Tuesday, October 20, 2015

Indexing and Analyzing Hypocoristics, Diminutives and Synonyms with ElasticSearch

Hypocoristics, Diminutives and Synonyms with ElasticSearch/Lucene

Intro


Indexing and searching through social data introduces two problems: colloquial to formal names matching; and synonym/shorthand to full word matching. Colloquial words may include a combination of hypocoristics, Diminutives and synonyms.  For example California, Golden State, Cali, CA etc.
We shall discuss the process of achieving a contextualized data ecosystem using elastic search. We will look at synonym matching as a better tool than fuzzy search for raw data contextualization.

Fuzzy search



Fuzzy search uses the string edit distance to determine the closeness between two strings. Ideally, this mechanism is best used for searching for possibly misspelled indexed words. For example, a search engine can suggest words as you type along. The suggestions are fuzzy matched against an indexing engine.


Synonym Matching



Synonym matching uses indexing to store all the possible matches between a word or phrase. This requires an analyst to understand their data well enough to construct mappings of synonyms. If dealing with state names, you can come up with abbreviations, popular names and formal name mapping such as CA, California, Golden State, Cali etc.
During the process of indexing your data (social media feeds etc) can reference the synonym mappings to add more context to your data. Thus contextualization.


Lucene synonym handling


See my blog ETL on the fly Hive Lucene integration.


Elasticsearch synonym handling



I created a state name index with their respective abbreviations and shortened postal names. The location is of the index is in elasticsearch:


I created a state name synonym map using this shell script :

states=`cat stateabbreviations.txt | sed  's/^.*$/"&"/g'  | tr '\n' ',' |sed 's/.\w*$//' | awk '{ print tolower($0) }'` 

 curl -XPUT "http://localhost:9200/statesabbreviation_index/" -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "statesabbr_synonym_filter": {
          "type": "synonym",
   "synonyms": [
           '"$states"' 
          ]
        }
      },
      "analyzer": {
        "statesabbr_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "statesabbr_synonym_filter"
          ]
        }
      }
    }
  }
}'


Note  that we are also using a lowercase filter and that all synonyms are converted to lowercase. This means each query to this index will be evaluated to lowercase because the index is in lowercase. The opposite can be achieved by changing the awk arguments and the filter arguments to uppercase.


Abbreviation file, stateabbreviation.txt, file is of the form:

Alabama,Ala.=>AL
Alaska=>AK
Arizona,Ariz.=>AZ
Arkansas,Ark.=>AR
California,Calif.=>CA
Colorado,Colo.=>CO
Connecticut,Conn.=>CT
Delaware,Del.=>DE
Florida,Fla.=>FL
Georgia,Ga.=>GA
Hawaii,Hawaii=>HI

Now you can make a GET query through your browser:

http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text=”CALIFORNIA is in United States of America and so is Maine”


Or query using curl in shell:

curl -XGET "http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text='CALIFORNIA%20IS%20in$20United%20States%20of%20America%20and%20so%20is%20Maine'"

where the text query is “CALIFORNIA is in United States of America and so is Maine” and the highlighted text is expected to return indexed synonyms.


Results:


{"tokens":[
{"token":"ca","start_offset":1,"end_offset":11,"type":"SYNONYM","position":1},
{"token":"is","start_offset":12,"end_offset":14,"type":"<ALPHANUM>","position":2},
{"token":"in","start_offset":15,"end_offset":17,"type":"<ALPHANUM>","position":3},
{"token":"usa","start_offset":18,"end_offset":42,"type":"SYNONYM","position":4},
{"token":"and","start_offset":43,"end_offset":46,"type":"<ALPHANUM>","position":5},
{"token":"so","start_offset":47,"end_offset":49,"type":"<ALPHANUM>","position":6},
{"token":"is","start_offset":50,"end_offset":52,"type":"<ALPHANUM>","position":7},
{"token":"me","start_offset":53,"end_offset":58,"type":"SYNONYM","position":8}]}

Names:


I also created a firstname hypocoristics diminutives index that can be queried the same as above where the map file is of the form:
Frederick=>Fred,Freddy,Rick,Fritz,Freddie


A browser query of the form:

http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'


Or query using curl in shell:

curl -XGET "http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'"


Resulted in:


{"tokens":[{"token":"fred","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddy","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"rick","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"fritz","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddie","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1}]}

Note that this firstname index queries for actual node to return other-names. In order to search by a diminutive such as freddie, a reverse index needs to exist. You can also create a reverse lookup index using a many to one map. i.e:

Fred,Freddy,Rick,Fritz,Freddie=>Frederick

However, there’s a caveat: some diminutives refer to more than one given name. For example, Frank could refer to Francesco, Francis or Franklin.  


Frances=>Fran,Franny,Fanny,Frankie
Francesca=>Fran,Franny,Fanny
Francesco=>Fran,Frank,Frankie
Francis=>Fran,Frank,Frankie
Franklin=>Frank


How can you exploit these synonym capabilities programmatically?

Unfortunately there are no stable php-elasticsearch package out there. However, if youre familiar with traversing json, then you should not have a problem making http(s) curl or javascript queries. Here is an example using curl php api:



<?
function restgetcall() {
 $url="http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'";
 $headers = array(
 'Accept: application/json',
 'Content-Type: application/json',
 );
 $data = json_encode( $vars );

  $handle = curl_init();
 curl_setopt($handle, CURLOPT_URL, $url);
 curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);
 curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);


  curl_setopt($handle, CURLOPT_HTTPGET, true);

  $response = curl_exec($handle);
 $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

 if ($response === false) {
    $info = curl_getinfo($handle);
    curl_close($handle);
    die('error occured during curl exec. Additional info: ' . var_export($info));
}
curl_close($handle);
$decoded = json_decode($response);
if (isset($decoded->response->status) && $decoded->response->status == 'ERROR') {
    die('error occured: ' . $decoded->response->errormessage);
}
echo 'response ok!';
print_r($decoded);
#var_export($decoded->response);

 }

 restgetcall();
?>

References:



Other topics:

Phonetic engines as an alternative to elasticsearch fuzzy matching.
Elasticsearch word stemming - algorithmic,dictionary,hunspell


No comments:

Post a Comment