Flexing Data: 2015

Tuesday, October 20, 2015

Indexing and Analyzing Hypocoristics, Diminutives and Synonyms with ElasticSearch

Hypocoristics, Diminutives and Synonyms with ElasticSearch/Lucene

Intro

Indexing and searching through social data introduces two problems: colloquial to formal names matching; and synonym/shorthand to full word matching. Colloquial words may include a combination of hypocoristics, Diminutives and synonyms. For example California, Golden State, Cali, CA etc.

We shall discuss the process of achieving a contextualized data ecosystem using elastic search. We will look at synonym matching as a better tool than fuzzy search for raw data contextualization.

Fuzzy search

Fuzzy search uses the string edit distance to determine the closeness between two strings. Ideally, this mechanism is best used for searching for possibly misspelled indexed words. For example, a search engine can suggest words as you type along. The suggestions are fuzzy matched against an indexing engine.

Synonym Matching

Synonym matching uses indexing to store all the possible matches between a word or phrase. This requires an analyst to understand their data well enough to construct mappings of synonyms. If dealing with state names, you can come up with abbreviations, popular names and formal name mapping such as CA, California, Golden State, Cali etc.

During the process of indexing your data (social media feeds etc) can reference the synonym mappings to add more context to your data. Thus contextualization.

Lucene synonym handling

See my blog ETL on the fly Hive Lucene integration.

Elasticsearch synonym handling

I created a state name index with their respective abbreviations and shortened postal names. The location is of the index is in elasticsearch:

http://localhost:9200/statesabbreviation_index

I created a state name synonym map using this shell script :

states=`cat stateabbreviations.txt | sed  's/^.*$/"&"/g'  | tr '\n' ',' |sed 's/.\w*$//' | awk '{ print tolower($0) }'` 

 curl -XPUT "http://localhost:9200/statesabbreviation_index/" -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "statesabbr_synonym_filter": {
          "type": "synonym",
   "synonyms": [
           '"$states"' 
          ]
        }
      },
      "analyzer": {
        "statesabbr_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "statesabbr_synonym_filter"
          ]
        }
      }
    }
  }
}'

Note that we are also using a lowercase filter and that all synonyms are converted to lowercase. This means each query to this index will be evaluated to lowercase because the index is in lowercase. The opposite can be achieved by changing the awk arguments and the filter arguments to uppercase.

Abbreviation file, stateabbreviation.txt, file is of the form:

Alabama,Ala.=>AL
Alaska=>AK
Arizona,Ariz.=>AZ
Arkansas,Ark.=>AR
California,Calif.=>CA
Colorado,Colo.=>CO
Connecticut,Conn.=>CT
Delaware,Del.=>DE
Florida,Fla.=>FL
Georgia,Ga.=>GA
Hawaii,Hawaii=>HI

Now you can make a GET query through your browser:

http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text=”CALIFORNIA is in United States of America and so is Maine”

Or query using curl in shell:

curl -XGET "http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text='CALIFORNIA%20IS%20in$20United%20States%20of%20America%20and%20so%20is%20Maine'"

where the text query is “CALIFORNIA is in United States of America and so is Maine” and the highlighted text is expected to return indexed synonyms.

Results:

{"tokens":[
{"token":"ca","start_offset":1,"end_offset":11,"type":"SYNONYM","position":1},
{"token":"is","start_offset":12,"end_offset":14,"type":"<ALPHANUM>","position":2},
{"token":"in","start_offset":15,"end_offset":17,"type":"<ALPHANUM>","position":3},
{"token":"usa","start_offset":18,"end_offset":42,"type":"SYNONYM","position":4},
{"token":"and","start_offset":43,"end_offset":46,"type":"<ALPHANUM>","position":5},
{"token":"so","start_offset":47,"end_offset":49,"type":"<ALPHANUM>","position":6},
{"token":"is","start_offset":50,"end_offset":52,"type":"<ALPHANUM>","position":7},
{"token":"me","start_offset":53,"end_offset":58,"type":"SYNONYM","position":8}]}

Names:

I also created a firstname hypocoristics diminutives index that can be queried the same as above where the map file is of the form:

Frederick=>Fred,Freddy,Rick,Fritz,Freddie

A browser query of the form:

http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'

Or query using curl in shell:

curl -XGET "http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'"

Resulted in:

{"tokens":[{"token":"fred","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddy","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"rick","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"fritz","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddie","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1}]}

Note that this firstname index queries for actual node to return other-names. In order to search by a diminutive such as freddie, a reverse index needs to exist. You can also create a reverse lookup index using a many to one map. i.e:

Fred,Freddy,Rick,Fritz,Freddie=>Frederick

However, there’s a caveat: some diminutives refer to more than one given name. For example, Frank could refer to Francesco, Francis or Franklin.

Frances=>Fran,Franny,Fanny,Frankie

Francesca=>Fran,Franny,Fanny

Francesco=>Fran,Frank,Frankie

Francis=>Fran,Frank,Frankie

Franklin=>Frank

How can you exploit these synonym capabilities programmatically?

Unfortunately there are no stable php-elasticsearch package out there. However, if youre familiar with traversing json, then you should not have a problem making http(s) curl or javascript queries. Here is an example using curl php api:

<?
function restgetcall() {
 $url="http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'";
 $headers = array(
 'Accept: application/json',
 'Content-Type: application/json',
 );
 $data = json_encode( $vars );

  $handle = curl_init();
 curl_setopt($handle, CURLOPT_URL, $url);
 curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);
 curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);


  curl_setopt($handle, CURLOPT_HTTPGET, true);

  $response = curl_exec($handle);
 $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

 if ($response === false) {
    $info = curl_getinfo($handle);
    curl_close($handle);
    die('error occured during curl exec. Additional info: ' . var_export($info));
}
curl_close($handle);
$decoded = json_decode($response);
if (isset($decoded->response->status) && $decoded->response->status == 'ERROR') {
    die('error occured: ' . $decoded->response->errormessage);
}
echo 'response ok!';
print_r($decoded);
#var_export($decoded->response);

 }

 restgetcall();
?>

References:

Lucene.Net – Custom Synonym Analyzer - CodeProject

Multiword Synonyms and Phrase Queries

http://en.wiktionary.org/wiki/Appendix:English_given_names

ETL on the fly with BIG Data: A Hive Lucene Integration

Hive Lucene Integration.

Summary:

This topic looks at Hive and Lucene as two technology integrations that allow for a fast and cheap automated ETL process. For example, contextualizing data into standard format on the fly before matching against data in a hive repository. Non standard data include diminutives, hypocholistics and acronyms in people’s names and street addresses.

Some form of ETL may be required when importing third party data into your data warehouse. And if you have multiple third party sources, some form of automated ETL becomes apparent. The transformation/translation phase can be streamlined using lucene technologies. Hive supports many UDFs including Java regular expressions. But to extend HQL capabilities further, a more involved UDF implementation may be necessary.

This topic will show a simple Hive UDF example for matching date strings with different formats using java regular expressions. The second example matches street name abbreviations by incorporating lucene solr synonyms to a UDF.

You may also visit the project on github here

Compare dates example:

If you have two data sources containing date columns that you are interested in matching, you can implement a UDF that will anticipate different date formats and even use parts of the date (such as year) to perform a match.

We implemented a class CompareYearInDateString that extends org.apache.hadoop.hive.ql.exec.UDF class which requires an evaluate() public method. The method allows you to pass more than one argument and return a user defined type. In our case a Boolean. If either string is empty it returns false. It extracts year string from both dateStr arguments by calling getYear() which first calls getDatePattern() method then calls getYearFromDate() using Pattern object and yearStr String as an argument. Method getDatePattern can be made more robust as you learn about potential Date formats such as international formats and or delimiters.

Compare Street names example:

We anticipated finding addresses with ‘st’ in place of ‘street’; ‘ave’ in place of ‘avenue’, etc. It would be a daunting task to incorporate all the different combinations of acronyms in a sql statement. Therefore it makes sense to create a UDF which can transform a non-standard street string into standard form. Lucene analyzers are best suited for detecting and replacing acronyms.

There are two approaches to implementing analyzers for hive map reduce tasks or any other parallel process: restful index such as elastic search and solr; or an in memory analyzer. We felt that an restful index would burden the network with more traffic. Therefore we settled for standalone in memory Analyzer as a singleton per hadoop jvm. As it is possible to recycle a jvm container with new map reduce tasks, this approach is extendible as long as the synonym data remains moderately low.

As in the previous example class CompareStreetNames extends UDF class and implements evaluate() method where the Texts are transformed using a custom AnalyzeUtil.analyzeString() method before being compared to each other.

AnalyzeUtil analyzeString() method transforms a string by using a custom Analyzer, SynonymAnalyzer instance, which tokenizes a string reader into transformed term attribute tokens. We iterated over the tokens to create the transformed string.

SynonymAnalyzer is instantiated with each transformation request where a, overridden createComponents() method to instantiate a SynonymFilter with a SynonymMap object and a white space tokenizer. The tokenizer tells the filter how to break up a subject string while the SynonymMap instance contains key values to match tokenized synonyms against.

Construction of the synonym map happens in the contructSynonymMap method which creates a singleton SynonymMapLoader which implements getSynonymMap() method. This singleton creates and keeps a copy of the SynonymMap in memory per jvm thereby removing IO constraints.

In getSynonymMap() method we start with parsing a file containing solr analyzer mappings and building a SynonymMap object using SolrSynonymParser class of lucene-analyzers-common api. Something to note, instantiating SolrSynonymParser requires an analyzer which is not our custom analyzer (we used SimpleAnalyzer) and should be closed before exiting the method. The synonym map file in static variable LOCAL_SYNONYM_MAP is a jar resource containing data in solr synonym format which is a many-to-one mapping:

ARC => ARCADE

AVE => AVENUE

BLVD => BOULEVARD

STR , ST => STREET

An alternative to Solr synonym format is word net format which has a parser implemented in lucene.

Creating the UDFs in hive:

This part is straight forward.

You can either create a fat jar containing your custom implementation and lucene classes. Or create a thin jar of your implementation and copy lucene jars into hive lib. See our maven pom.

We implemented a fat jar which was loaded into hdfs:

hadoop fs -put ~/ritho-hive-0.0.1-dist.jar /apps/gdt/

Then create the UDFs:

echo "CREATE FUNCTION CompareYearInDateString  AS 'com.ritho.hadoop.hive.udf.CompareYearInDateString' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive

echo "CREATE FUNCTION CompareStreetNames  AS 'com.ritho.hadoop.hive.udf.CompareStreetNames' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive

Testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '19221101’) ) x;"|hive

Another example testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '1922’) ) x;"|hive

Testing CompareStreetName class

echo "select * from (select CompareStreetNames('88 Mountain View Street STE 10', '88 Mtn View St Suite 10') ) x;"|hive

Conclusion:

These UDFs made our sql statements clean, easy to read. Most importantly was the ability to enhance Hive's functionality. It was also easy to implement the lucene analyzer. The overall result is an flexible architecture that DBAs and analysts can exploit.

Another example on how to exploit this architecture is is in person name internationalization or standardization (i.e., José to Joseph conversions or Jim to James mapping). Also, text similarity scoring in HQL can be adapted in the same manner.

References

https://hive.apache.org

http://lucene.apache.org

http://lucene.apache.org/solr/

Thursday, June 11, 2015

Setting up Ganglia on Ubuntu nodes

Setting up Ganglia on ubuntu Hadoop cluster:

Dependent on apache2 and php client and php module installation. Therefore install apache

sudo apt-get install apache2 php5 php5_cli libapache2-mod-php5

And visually verify that apache2 is running on http://localhost:80

Master ganglia- chose a low utilization , low consequence node i.e. slave6

sudo apt-get install ganglia-monitor rrdtool gmetad ganglia-webfrontend

get monitor , rrdtools and web ui.

Modify webcontext file and copy to apache

sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf

Modify /etc/ganglia/gmond.conf to your specs (we are using unicast configuration). I commented out multicast attributes and replaced default ips with my ganglia master ip:

globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 30
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a  tag.  If you do not specify a cluster tag, then all  will
 * NOT be wrapped inside of a  tag. */
cluster {
  name = "Hadoop Ganglia Monitor" 
  owner = "hduser"     
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = 10.77.201.104
  port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  port = 8649
}


/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}

Modify /etc/ganglia/gmetd.conf to state the data collecting node for ganglia

data_source "Hadoop Cluster" 10.77.201.104

Starting and stopping:

Restart gmetad

sudo service gmetad restart

Restart ganglia monitor in master node

sudo service ganglia-monitor restart

Ganglia Clusters

Install ganglia monitor on nodes

sudo apt-get install ganglia-monitor

Modify /etc/ganglia/gmond.conf to send data to receiver = datasource


/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a  tag.  If you do not specify a cluster tag, then all  will
 * NOT be wrapped inside of a  tag. */
cluster {
  name = "Hadoop Ganglia Monitor" 
  owner = "hduser"     
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = 10.77.201.104
  port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
}

Starting and stopping:

sudo service ganglia-monitor start

Hadoop ganglia

Issues:

tried accessing ganglia url

got on html:

There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused.

Problem is a file permissions issue.

Solution:


chown -R nobody:root /var/lib/ganglia/rrds

then restart daemons

MONITORING

send datagram

master node:

ps -ef | grep -v grep | grep gm

Results

ganglia 21116 1 0 12:58 ? 00:00:00 /usr/sbin/gmond --pid-file=/var/run/ganglia-monitor.pid

nobody 21127 1 0 12:58 ? 00:00:01 /usr/sbin/gmetad --pid-file=/var/run/gmetad.pid

network


sudo netstat -plane | egrep 'gmon|gme'

results:

tcp 0 0 0.0.0.0:8649 0.0.0.0:* LISTEN 999 87732663 21116/gmond

tcp 0 0 0.0.0.0:8651 0.0.0.0:* LISTEN 65534 87729072 21127/gmetad

tcp 0 0 0.0.0.0:8652 0.0.0.0:* LISTEN 65534 87729073 21127/gmetad

udp 0 0 0.0.0.0:8649 0.0.0.0:* 999 87732662 21116/gmond

udp 0 0 192.168.179.103:60243 192.168.179.103:8649 ESTABLISHED 999 87732666 21116/gmond

unix 3 [ ] STREAM CONNECTED 87785728 21116/gmond

unix 3 [ ] STREAM CONNECTED 83863311 21127/gmetad

Monday, June 8, 2015

Managing a maven artifact repository with Artifactory

Planning

The assumption is that you know how to use maven at an elementary level. You should have an idea of how to deploy web applications using tomcat or other JEE app server. Our examples use Subversion as the source code version management tool. You should have some familiarity with branching.

Our goals:

Reduce the disk space usage in our source code repository such as Subversion.

Set up a local repository to manage all third party libraries.

Standardize building, testing and releasing workproduct.

scripts to provide a stable methodology for building , testing and releasing.

A secure environment to execute 1) and 2).

Artifactory is a secure external repository management tool.
Maven has plugins for sourcecode version control ; third-party repositories and ftp plugins; build plugin including ANT plugin; JUnit plugins for testing; and release plugins.

Infrastructure management is still required to completely close all security holes:

A shared local repository (artifactory) should be located in an accessible server. It should be accessible to development; intergration; testing and production build environments. Different organizations may have higher security requirements than others, but generally local/intranet readonly accessibility is sufficient.
Work product should be separated out into separate pom files. See section III.

I. Setting up maven

Reference: http://www.theserverside.com/news/1364121/Setting-Up-a-Maven-Repository

http://maven.apache.org/guides/mini/guide-configuring-maven.html

II. Setting up Artifactory

1. Download Artifactory. http://sourceforge.net/projects/artifactory/files/artifactory/2.5.1.1/artifactory-2.5.1.1.zip/download

2. Install in JBOSS 6 didn’t work.

Installed in tomcat 5 using JDK6.

Set global variables for artifactory and jdk 6

$> cd <install dir>/apache-tomcat-5.5.28/bin/

$> export JAVA_HOME='/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home'

$> export JAVA_OPTS="$JAVA_OPTS -Dartifactory.home=/Users/kaniu_n/artifactory-2.5.1.1"

$> ./startup.sh

3. The maven repositories:

Artifactory manages multiple pre-configured remote maven repositories. It also has caching capability i.e. stores downloaded libraries locally temporarily to reduce IO. Artifactory administration allows for adding new or modifying existing repositories: local or remote.

Therefore, Artifactory becomes your library ‘repository’ instead of the many remote repositories.

4. pom.xml example:

<project xmlns="http://maven.apache.org/POM/4.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0

http://maven.apache.org/maven-v4_0_0.xsd">

<groupId>com.ritho.maven</groupId>

<artifactId>simple-parent</artifactId>

<name> Parent Project</name>

<url>http://maven.ritho.com</url>

<id>central</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</snapshots>

</repository>

<id>snapshots</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</releases>

</repository>

</repositories>

<id>central</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</snapshots>

</pluginRepository>

<id>snapshots</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</releases>

</pluginRepository>

</pluginRepositories>

Note that the urls point to the deployed Artifactory instance.

III. Build, test and Release Management using Maven.

Each project should have a pom. Related projects should have a parent pom file and a separate release pom file. In the following example, I have 4 eclipse projects:

-Exam/ has dependencies on DataObjects/

-ParentProject/ contains the main pom for all projects.

-Release/ is the end point for all packaged entities.

Here’s is an eclipse project layout:

This example will use implied classpaths for compiling the projects with dependencies. This is done by referencing the parent project which is ‘aware’ of all the relevant modules in the workspace. The goal is to build, test or package from the parent pom.xml file. <module> element allows mvn to decipher project dependencies and then create the compilation sequence of all artifacts. By always executing mvn process from the parent pom, you are ensuring that all changes in subsequent projects are reflected in the final artifact of the compilation sequence.

Caveates:

All projects will be compiled using the parent JDK. Therefore all projects have to be up to date with current JDK, i.e. some JDK5 warnings may not compile in JDK6.

Building does not auto-clean the build directories therefore do a ‘clean’ before rebuilding/testing/packaging i.e:

>mvn clean

>mvn clean package

Here are my examples:

1. Parent pom

<project xmlns="http://maven.apache.org/POM/4.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0

http://maven.apache.org/maven-v4_0_0.xsd">

<groupId>com.ritho.maven</groupId>

<artifactId>simple-parent</artifactId>

<name> Parent Project</name>

<url>http://maven.ritho.com</url>

<id>central</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</snapshots>

</repository>

<id>snapshots</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</releases>

</repository>

</repositories>

<id>central</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</snapshots>

</pluginRepository>

<id>snapshots</id>

<url>http://localhost:8080/artifactory/repo</url>

<enabled>false</enabled>

</releases>

</pluginRepository>

</pluginRepositories>

<module>../DataObjects</module>

</modules>

<build>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

</configuration>

</plugin>

</plugins>

</pluginManagement>

</build>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

</dependencies>

</project>

Some notes: This pom uses JDK5 but you may alter it for JDK6 as below:

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

</configuration>

</plugin>

</plugins>

</pluginManagement>

Make sure your maven is running on JDK6 or higher by setting JAVA_HOME; in MacOsX terminal:

$export JAVA_HOME='/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home'

$mvn -e package

2. DataObjects pom

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>com.dataobjects</groupId>

<artifactId>dataobjects</artifactId>

<groupId>com.ritho.maven</groupId>

<artifactId>simple-parent</artifactId>

<relativePath>../ParentProject</relativePath>

</parent>

<build>

<directory>${project.parent.relativePath}/maven/${project.artifactId}</directory>

</build>

</project>

3. Exam pom

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>com.ritho.maven</groupId>

<artifactId>simple-parent</artifactId>

<relativePath>../ParentProject</relativePath>

</parent>

<groupId>com.dataobjects</groupId>

<artifactId>dataobjects</artifactId>

<version>${project.version}</version>

</dependency>

</dependencies>

<build>

<directory>${project.parent.relativePath}/maven/${project.artifactId}</directory>

</build>

</project>

4. Distribution pom

This pom can be the parent pom with additional plugins enabled i.e for reporting, distribution management, remote connection.

i. Reporting allows you to autogenerate html that you may use to describe the distribution product of the pom file. Example:

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-surefire-report-plugin</artifactId>

<configuration> <outputDirectory>${project.build.directory}/${project.version}/surefire</outputDirectory>

</configuration>

</plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-site-plugin</artifactId>

<outputDirectory>${project.build.directory}/${project.version}/site</outputDirectory>

</configuration>

</plugin>

</plugins>

</reporting>

Note structure of the project below containing site and site.apt directories:

site.xml configures the template and its look and feel.

index.apt configures the content using Almost Plain Text notation.

site.xml

<?xml version="1.0" encoding="ISO-8859-1"?>

<project name="Maven" xmlns="http://maven.apache.org/DECORATION/1.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/DECORATION/1.0.0 http://maven.apache.org/xsd/decoration-1.0.0.xsd">

<name>Ritho HBase Utility for V-1.0.+</name>

<src>https://sites.google.com/site/rithotechcorp/_/rsrc/1332167517940/config/customLogo.gif?revision=3</src>

<href>http://maven.ritho.com/</href>

</bannerLeft>

<skin>

<groupId>org.apache.maven.skins</groupId>

<artifactId>maven-fluido-skin</artifactId>

</skin>

</fluidoSkin>

</custom>

<body>

<links>

</links>

</menu>

</body>

</project>

site.apt

<<Ritho HBase Utility:>>

Ritho HBase Utility library is a utility for interacting with HBase versions 1.0.+ .

<<Benefits:>>

This utility simplifies back-end calls to only a few lines of code reducing developer mistakes. Some level of connection management \

can be handled for you but the most significant advantage is simplified syntax.

<<Usage:>>

This testcase outlines the most common way to use this utility. Each testcase is self explanatory. You would need to set up \

configuration for each testcases:

----

package test

class test{

//some code ...

}

----

APT tool consumes this to produce decorated html encapsulating your text.

ii. distribution management and remote connection

This setting points to a remote/or local ftp site where you can submit your product. Here you can specify the site and repository locations.

<site>

<id>ritho-ftp-repository</id>

<url>ftp://ritho.com/downloads/site</url>

</site>

<id>ritho-ftp-repository</id>

<url>ftp://ritho.com/downloads/opensource</url>

</repository>

</distributionManagement>

Distribution management requires an extention. Here is an ftp wagon example:

<build>

<groupId>org.apache.maven.wagon</groupId>

<artifactId>wagon-ftp</artifactId>

</extension>

</extensions>

The assumption is that your remote ftp server requires authentication. Note that distribution management identifies the target remote repository (ritho-ftp-repository). settings.xml will contain a username and password for this repository. setting.xml resides in the local/execution environment at ~/m2/settings.xml

<settings>
...
<servers>
   <server>
     <id>ritho-ftp-repository</id>
     <username>username</username>
     <password>password</password>
   </server>
</servers>
...
</settings>

References:

http://maven.apache.org/guides/mini/guide-configuring-maven.html

http://maven.apache.org/guides/mini/guide-encryption.html

Tuesday, October 20, 2015

Indexing and Analyzing Hypocoristics, Diminutives and Synonyms with ElasticSearch

Hypocoristics, Diminutives and Synonyms with ElasticSearch/Lucene

Intro

Fuzzy search

Synonym Matching

Lucene synonym handling

Elasticsearch synonym handling

How can you exploit these synonym capabilities programmatically?

References:

Other topics:

ETL on the fly with BIG Data: A Hive Lucene Integration

Hive Lucene Integration.

Summary:

Compare dates example:

Compare Street names example:

Creating the UDFs in hive:

Conclusion:

References

Thursday, June 11, 2015

Setting up Ganglia on Ubuntu nodes

Monday, June 8, 2015

Managing a maven artifact repository with Artifactory

Planning

I. Setting up maven

II. Setting up Artifactory

III. Build, test and Release Management using Maven.