Hypocoristics, Diminutives and Synonyms with ElasticSearch/Lucene
Intro
Indexing and searching through social data introduces two problems: colloquial to formal names matching; and synonym/shorthand to full word matching. Colloquial words may include a combination of hypocoristics, Diminutives and synonyms. For example California, Golden State, Cali, CA etc.
We shall discuss the process of achieving a contextualized data ecosystem using elastic search. We will look at synonym matching as a better tool than fuzzy search for raw data contextualization.
Fuzzy search
Fuzzy search uses the string edit distance to determine the closeness between two strings. Ideally, this mechanism is best used for searching for possibly misspelled indexed words. For example, a search engine can suggest words as you type along. The suggestions are fuzzy matched against an indexing engine.
Synonym Matching
Synonym matching uses indexing to store all the possible matches between a word or phrase. This requires an analyst to understand their data well enough to construct mappings of synonyms. If dealing with state names, you can come up with abbreviations, popular names and formal name mapping such as CA, California, Golden State, Cali etc.
During the process of indexing your data (social media feeds etc) can reference the synonym mappings to add more context to your data. Thus contextualization.
Lucene synonym handling
Elasticsearch synonym handling
I created a state name index with their respective abbreviations and shortened postal names. The location is of the index is in elasticsearch:
I created a state name synonym map using this shell script :
states=`cat stateabbreviations.txt | sed 's/^.*$/"&"/g' | tr '\n' ',' |sed 's/.\w*$//' | awk '{ print tolower($0) }'` curl -XPUT "http://localhost:9200/statesabbreviation_index/" -d ' { "settings": { "analysis": { "filter": { "statesabbr_synonym_filter": { "type": "synonym", "synonyms": [ '"$states"' ] } }, "analyzer": { "statesabbr_synonyms": { "tokenizer": "standard", "filter": [ "lowercase", "statesabbr_synonym_filter" ] } } } } }'
Note that we are also using a lowercase filter and that all synonyms are converted to lowercase. This means each query to this index will be evaluated to lowercase because the index is in lowercase. The opposite can be achieved by changing the awk arguments and the filter arguments to uppercase.
Abbreviation file, stateabbreviation.txt, file is of the form:
Alabama,Ala.=>AL Alaska=>AK Arizona,Ariz.=>AZ Arkansas,Ark.=>AR California,Calif.=>CA Colorado,Colo.=>CO Connecticut,Conn.=>CT Delaware,Del.=>DE Florida,Fla.=>FL Georgia,Ga.=>GA Hawaii,Hawaii=>HI
Now you can make a GET query through your browser:
http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text=”CALIFORNIA is in United States of America and so is Maine”
Or query using curl in shell:
curl -XGET "http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text='CALIFORNIA%20IS%20in$20United%20States%20of%20America%20and%20so%20is%20Maine'"
Results:
{"tokens":[ {"token":"ca","start_offset":1,"end_offset":11,"type":"SYNONYM","position":1}, {"token":"is","start_offset":12,"end_offset":14,"type":"<ALPHANUM>","position":2}, {"token":"in","start_offset":15,"end_offset":17,"type":"<ALPHANUM>","position":3}, {"token":"usa","start_offset":18,"end_offset":42,"type":"SYNONYM","position":4}, {"token":"and","start_offset":43,"end_offset":46,"type":"<ALPHANUM>","position":5}, {"token":"so","start_offset":47,"end_offset":49,"type":"<ALPHANUM>","position":6}, {"token":"is","start_offset":50,"end_offset":52,"type":"<ALPHANUM>","position":7}, {"token":"me","start_offset":53,"end_offset":58,"type":"SYNONYM","position":8}]}
Names:
I also created a firstname hypocoristics diminutives index that can be queried the same as above where the map file is of the form:
Frederick=>Fred,Freddy,Rick,Fritz,Freddie
A browser query of the form:
http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'
Or query using curl in shell:
curl -XGET "http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'"
Resulted in:
{"tokens":[{"token":"fred","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddy","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"rick","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"fritz","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddie","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1}]}
Note that this firstname index queries for actual node to return other-names. In order to search by a diminutive such as freddie, a reverse index needs to exist. You can also create a reverse lookup index using a many to one map. i.e:
Fred,Freddy,Rick,Fritz,Freddie=>Frederick
However, there’s a caveat: some diminutives refer to more than one given name. For example, Frank could refer to Francesco, Francis or Franklin.
Frances=>Fran,Franny,Fanny,Frankie
Francesca=>Fran,Franny,Fanny
Francesco=>Fran,Frank,Frankie
Francis=>Fran,Frank,Frankie
Franklin=>Frank
How can you exploit these synonym capabilities programmatically?
<? function restgetcall() { $url="http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'"; $headers = array( 'Accept: application/json', 'Content-Type: application/json', ); $data = json_encode( $vars ); $handle = curl_init(); curl_setopt($handle, CURLOPT_URL, $url); curl_setopt($handle, CURLOPT_HTTPHEADER, $headers); curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); curl_setopt($handle, CURLOPT_HTTPGET, true); $response = curl_exec($handle); $code = curl_getinfo($handle, CURLINFO_HTTP_CODE); if ($response === false) { $info = curl_getinfo($handle); curl_close($handle); die('error occured during curl exec. Additional info: ' . var_export($info)); } curl_close($handle); $decoded = json_decode($response); if (isset($decoded->response->status) && $decoded->response->status == 'ERROR') { die('error occured: ' . $decoded->response->errormessage); } echo 'response ok!'; print_r($decoded); #var_export($decoded->response); } restgetcall(); ?>
References:
Other topics:
Phonetic engines as an alternative to elasticsearch fuzzy matching.
Elasticsearch word stemming - algorithmic,dictionary,hunspell
Partial matching and postal codes - https://www.elastic.co/guide/en/elasticsearch/guide/master/_postcodes_and_structured_data.html