Flexing Data: ETL on the fly with BIG Data: A Hive Lucene Integration

Hive Lucene Integration.

Summary:

This topic looks at Hive and Lucene as two technology integrations that allow for a fast and cheap automated ETL process. For example, contextualizing data into standard format on the fly before matching against data in a hive repository. Non standard data include diminutives, hypocholistics and acronyms in people’s names and street addresses.

Some form of ETL may be required when importing third party data into your data warehouse. And if you have multiple third party sources, some form of automated ETL becomes apparent. The transformation/translation phase can be streamlined using lucene technologies. Hive supports many UDFs including Java regular expressions. But to extend HQL capabilities further, a more involved UDF implementation may be necessary.

This topic will show a simple Hive UDF example for matching date strings with different formats using java regular expressions. The second example matches street name abbreviations by incorporating lucene solr synonyms to a UDF.

You may also visit the project on github here

Compare dates example:

If you have two data sources containing date columns that you are interested in matching, you can implement a UDF that will anticipate different date formats and even use parts of the date (such as year) to perform a match.

We implemented a class CompareYearInDateString that extends org.apache.hadoop.hive.ql.exec.UDF class which requires an evaluate() public method. The method allows you to pass more than one argument and return a user defined type. In our case a Boolean. If either string is empty it returns false. It extracts year string from both dateStr arguments by calling getYear() which first calls getDatePattern() method then calls getYearFromDate() using Pattern object and yearStr String as an argument. Method getDatePattern can be made more robust as you learn about potential Date formats such as international formats and or delimiters.

Compare Street names example:

We anticipated finding addresses with ‘st’ in place of ‘street’; ‘ave’ in place of ‘avenue’, etc. It would be a daunting task to incorporate all the different combinations of acronyms in a sql statement. Therefore it makes sense to create a UDF which can transform a non-standard street string into standard form. Lucene analyzers are best suited for detecting and replacing acronyms.

There are two approaches to implementing analyzers for hive map reduce tasks or any other parallel process: restful index such as elastic search and solr; or an in memory analyzer. We felt that an restful index would burden the network with more traffic. Therefore we settled for standalone in memory Analyzer as a singleton per hadoop jvm. As it is possible to recycle a jvm container with new map reduce tasks, this approach is extendible as long as the synonym data remains moderately low.

As in the previous example class CompareStreetNames extends UDF class and implements evaluate() method where the Texts are transformed using a custom AnalyzeUtil.analyzeString() method before being compared to each other.

AnalyzeUtil analyzeString() method transforms a string by using a custom Analyzer, SynonymAnalyzer instance, which tokenizes a string reader into transformed term attribute tokens. We iterated over the tokens to create the transformed string.

SynonymAnalyzer is instantiated with each transformation request where a, overridden createComponents() method to instantiate a SynonymFilter with a SynonymMap object and a white space tokenizer. The tokenizer tells the filter how to break up a subject string while the SynonymMap instance contains key values to match tokenized synonyms against.

Construction of the synonym map happens in the contructSynonymMap method which creates a singleton SynonymMapLoader which implements getSynonymMap() method. This singleton creates and keeps a copy of the SynonymMap in memory per jvm thereby removing IO constraints.

In getSynonymMap() method we start with parsing a file containing solr analyzer mappings and building a SynonymMap object using SolrSynonymParser class of lucene-analyzers-common api. Something to note, instantiating SolrSynonymParser requires an analyzer which is not our custom analyzer (we used SimpleAnalyzer) and should be closed before exiting the method. The synonym map file in static variable LOCAL_SYNONYM_MAP is a jar resource containing data in solr synonym format which is a many-to-one mapping:

ARC => ARCADE

AVE => AVENUE

BLVD => BOULEVARD

STR , ST => STREET

An alternative to Solr synonym format is word net format which has a parser implemented in lucene.

Creating the UDFs in hive:

This part is straight forward.

You can either create a fat jar containing your custom implementation and lucene classes. Or create a thin jar of your implementation and copy lucene jars into hive lib. See our maven pom.

We implemented a fat jar which was loaded into hdfs:

hadoop fs -put ~/ritho-hive-0.0.1-dist.jar /apps/gdt/

Then create the UDFs:

echo "CREATE FUNCTION CompareYearInDateString  AS 'com.ritho.hadoop.hive.udf.CompareYearInDateString' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive

echo "CREATE FUNCTION CompareStreetNames  AS 'com.ritho.hadoop.hive.udf.CompareStreetNames' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive

Testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '19221101’) ) x;"|hive

Another example testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '1922’) ) x;"|hive

Testing CompareStreetName class

echo "select * from (select CompareStreetNames('88 Mountain View Street STE 10', '88 Mtn View St Suite 10') ) x;"|hive

Conclusion:

These UDFs made our sql statements clean, easy to read. Most importantly was the ability to enhance Hive's functionality. It was also easy to implement the lucene analyzer. The overall result is an flexible architecture that DBAs and analysts can exploit.

Another example on how to exploit this architecture is is in person name internationalization or standardization (i.e., José to Joseph conversions or Jim to James mapping). Also, text similarity scoring in HQL can be adapted in the same manner.

References

https://hive.apache.org

http://lucene.apache.org

http://lucene.apache.org/solr/

Flexing Data

Tuesday, October 20, 2015

ETL on the fly with BIG Data: A Hive Lucene Integration