Tuesday, October 20, 2015

Indexing and Analyzing Hypocoristics, Diminutives and Synonyms with ElasticSearch

Hypocoristics, Diminutives and Synonyms with ElasticSearch/Lucene

Intro


Indexing and searching through social data introduces two problems: colloquial to formal names matching; and synonym/shorthand to full word matching. Colloquial words may include a combination of hypocoristics, Diminutives and synonyms.  For example California, Golden State, Cali, CA etc.
We shall discuss the process of achieving a contextualized data ecosystem using elastic search. We will look at synonym matching as a better tool than fuzzy search for raw data contextualization.

Fuzzy search



Fuzzy search uses the string edit distance to determine the closeness between two strings. Ideally, this mechanism is best used for searching for possibly misspelled indexed words. For example, a search engine can suggest words as you type along. The suggestions are fuzzy matched against an indexing engine.


Synonym Matching



Synonym matching uses indexing to store all the possible matches between a word or phrase. This requires an analyst to understand their data well enough to construct mappings of synonyms. If dealing with state names, you can come up with abbreviations, popular names and formal name mapping such as CA, California, Golden State, Cali etc.
During the process of indexing your data (social media feeds etc) can reference the synonym mappings to add more context to your data. Thus contextualization.


Lucene synonym handling


See my blog ETL on the fly Hive Lucene integration.


Elasticsearch synonym handling



I created a state name index with their respective abbreviations and shortened postal names. The location is of the index is in elasticsearch:


I created a state name synonym map using this shell script :

states=`cat stateabbreviations.txt | sed  's/^.*$/"&"/g'  | tr '\n' ',' |sed 's/.\w*$//' | awk '{ print tolower($0) }'` 

 curl -XPUT "http://localhost:9200/statesabbreviation_index/" -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "statesabbr_synonym_filter": {
          "type": "synonym",
   "synonyms": [
           '"$states"' 
          ]
        }
      },
      "analyzer": {
        "statesabbr_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "statesabbr_synonym_filter"
          ]
        }
      }
    }
  }
}'


Note  that we are also using a lowercase filter and that all synonyms are converted to lowercase. This means each query to this index will be evaluated to lowercase because the index is in lowercase. The opposite can be achieved by changing the awk arguments and the filter arguments to uppercase.


Abbreviation file, stateabbreviation.txt, file is of the form:

Alabama,Ala.=>AL
Alaska=>AK
Arizona,Ariz.=>AZ
Arkansas,Ark.=>AR
California,Calif.=>CA
Colorado,Colo.=>CO
Connecticut,Conn.=>CT
Delaware,Del.=>DE
Florida,Fla.=>FL
Georgia,Ga.=>GA
Hawaii,Hawaii=>HI

Now you can make a GET query through your browser:

http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text=”CALIFORNIA is in United States of America and so is Maine”


Or query using curl in shell:

curl -XGET "http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text='CALIFORNIA%20IS%20in$20United%20States%20of%20America%20and%20so%20is%20Maine'"

where the text query is “CALIFORNIA is in United States of America and so is Maine” and the highlighted text is expected to return indexed synonyms.


Results:


{"tokens":[
{"token":"ca","start_offset":1,"end_offset":11,"type":"SYNONYM","position":1},
{"token":"is","start_offset":12,"end_offset":14,"type":"<ALPHANUM>","position":2},
{"token":"in","start_offset":15,"end_offset":17,"type":"<ALPHANUM>","position":3},
{"token":"usa","start_offset":18,"end_offset":42,"type":"SYNONYM","position":4},
{"token":"and","start_offset":43,"end_offset":46,"type":"<ALPHANUM>","position":5},
{"token":"so","start_offset":47,"end_offset":49,"type":"<ALPHANUM>","position":6},
{"token":"is","start_offset":50,"end_offset":52,"type":"<ALPHANUM>","position":7},
{"token":"me","start_offset":53,"end_offset":58,"type":"SYNONYM","position":8}]}

Names:


I also created a firstname hypocoristics diminutives index that can be queried the same as above where the map file is of the form:
Frederick=>Fred,Freddy,Rick,Fritz,Freddie


A browser query of the form:

http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'


Or query using curl in shell:

curl -XGET "http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'"


Resulted in:


{"tokens":[{"token":"fred","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddy","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"rick","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"fritz","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddie","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1}]}

Note that this firstname index queries for actual node to return other-names. In order to search by a diminutive such as freddie, a reverse index needs to exist. You can also create a reverse lookup index using a many to one map. i.e:

Fred,Freddy,Rick,Fritz,Freddie=>Frederick

However, there’s a caveat: some diminutives refer to more than one given name. For example, Frank could refer to Francesco, Francis or Franklin.  


Frances=>Fran,Franny,Fanny,Frankie
Francesca=>Fran,Franny,Fanny
Francesco=>Fran,Frank,Frankie
Francis=>Fran,Frank,Frankie
Franklin=>Frank


How can you exploit these synonym capabilities programmatically?

Unfortunately there are no stable php-elasticsearch package out there. However, if youre familiar with traversing json, then you should not have a problem making http(s) curl or javascript queries. Here is an example using curl php api:



<?
function restgetcall() {
 $url="http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'";
 $headers = array(
 'Accept: application/json',
 'Content-Type: application/json',
 );
 $data = json_encode( $vars );

  $handle = curl_init();
 curl_setopt($handle, CURLOPT_URL, $url);
 curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);
 curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);


  curl_setopt($handle, CURLOPT_HTTPGET, true);

  $response = curl_exec($handle);
 $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

 if ($response === false) {
    $info = curl_getinfo($handle);
    curl_close($handle);
    die('error occured during curl exec. Additional info: ' . var_export($info));
}
curl_close($handle);
$decoded = json_decode($response);
if (isset($decoded->response->status) && $decoded->response->status == 'ERROR') {
    die('error occured: ' . $decoded->response->errormessage);
}
echo 'response ok!';
print_r($decoded);
#var_export($decoded->response);

 }

 restgetcall();
?>

References:



Other topics:

Phonetic engines as an alternative to elasticsearch fuzzy matching.
Elasticsearch word stemming - algorithmic,dictionary,hunspell


ETL on the fly with BIG Data: A Hive Lucene Integration

Hive Lucene Integration.


Summary:

This topic looks at Hive and Lucene as two technology integrations that allow for a fast and cheap automated ETL process. For example, contextualizing data into standard format on the fly before matching against data in a hive repository. Non standard data include diminutives, hypocholistics and acronyms in people’s names and street addresses. 
Some form of ETL may be required when importing third party data into your data warehouse. And if you have multiple third party sources, some form of automated ETL becomes apparent. The transformation/translation phase can be streamlined using lucene technologies. Hive supports many UDFs including Java regular expressions. But to extend HQL capabilities further, a more involved UDF implementation may be necessary.
This topic will show a simple Hive UDF example for matching date strings with different formats using java regular expressions. The second example matches street name abbreviations by incorporating lucene solr synonyms to a UDF. 

You may also visit the project on github here

Compare dates example:


If you have two data sources containing date columns that you are interested in matching, you can implement a UDF that will anticipate different date formats and even use parts of the date (such as year) to perform a match.   
We implemented a class CompareYearInDateString that extends org.apache.hadoop.hive.ql.exec.UDF class which requires an evaluate() public method. The method allows you to pass more than one argument and return a user defined type. In our case a Boolean. If either string is empty it returns false. It extracts year string from both dateStr arguments by calling getYear() which first calls getDatePattern() method then calls getYearFromDate() using Pattern object and yearStr String as an argument. Method getDatePattern can be made more robust as you learn about potential Date formats such as international formats and or delimiters.


Compare Street names example:


We anticipated finding addresses with ‘st’ in place of ‘street’; ‘ave’ in place of ‘avenue’, etc. It would be a daunting task to incorporate all the different combinations of acronyms in a sql statement. Therefore it makes sense to create a UDF which can transform a non-standard street string into standard form. Lucene analyzers are best suited for detecting and replacing acronyms. 
There are two approaches to implementing analyzers for hive map reduce tasks or any other parallel process: restful index such as elastic search and solr; or an in memory analyzer. We felt that an restful index would burden the network with more traffic. Therefore we settled for standalone in memory Analyzer as a singleton per hadoop jvm. As it is possible to recycle a jvm container with new map reduce tasks, this approach is extendible as long as the synonym data remains moderately low.   
As in the previous example class CompareStreetNames extends UDF class and implements evaluate() method where the Texts are transformed using a custom AnalyzeUtil.analyzeString() method before being compared to each other. 
AnalyzeUtil analyzeString() method transforms a string by using a custom Analyzer, SynonymAnalyzer instance, which tokenizes a string reader into transformed term attribute tokens. We iterated over the tokens to create the transformed string.
SynonymAnalyzer is instantiated with each transformation request where a, overridden createComponents() method to instantiate a SynonymFilter with a SynonymMap object and a white space tokenizer. The tokenizer tells the filter how to break up a subject string while the SynonymMap instance contains key values to match tokenized synonyms against. 
Construction of the synonym map happens in the contructSynonymMap method which creates a singleton SynonymMapLoader which implements getSynonymMap() method. This singleton creates and keeps a copy of the SynonymMap in memory per jvm thereby removing IO constraints.
In getSynonymMap() method we start with parsing a file containing solr analyzer mappings and building a SynonymMap object using SolrSynonymParser class of lucene-analyzers-common api. Something to note, instantiating  SolrSynonymParser requires an analyzer which is not our custom analyzer (we used SimpleAnalyzer) and should be closed before exiting the method. The synonym map file in static variable LOCAL_SYNONYM_MAP is a jar resource containing data in solr synonym format which is a many-to-one mapping:
ARC => ARCADE
AVE => AVENUE
BLVD => BOULEVARD
STR , ST => STREET

An alternative to Solr synonym format is word net format which has a parser implemented in lucene. 


Creating the UDFs in hive:

This part is straight forward.
You can either create a fat jar containing your custom implementation and lucene classes. Or create a thin jar of your implementation and copy lucene jars into hive lib. See our maven pom.
We implemented a fat jar which was loaded into hdfs:

hadoop fs -put ~/ritho-hive-0.0.1-dist.jar /apps/gdt/

Then create the UDFs:


echo "CREATE FUNCTION CompareYearInDateString  AS 'com.ritho.hadoop.hive.udf.CompareYearInDateString' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive


echo "CREATE FUNCTION CompareStreetNames  AS 'com.ritho.hadoop.hive.udf.CompareStreetNames' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive

Testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '19221101’) ) x;"|hive

Another example testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '1922’) ) x;"|hive

Testing CompareStreetName class

echo "select * from (select CompareStreetNames('88 Mountain View Street STE 10', '88 Mtn View St Suite 10') ) x;"|hive

Conclusion:

These UDFs made our sql statements clean, easy to read. Most importantly was the ability to enhance Hive's functionality. It was also easy to implement the lucene analyzer. The overall result is an flexible architecture that DBAs and analysts can exploit.
Another example on how to exploit this architecture is is in person name internationalization or standardization (i.e., José to Joseph conversions or Jim to James mapping). Also, text similarity scoring in HQL can be adapted in the same manner.


References

https://hive.apache.org
http://lucene.apache.org
http://lucene.apache.org/solr/