Tuesday, October 20, 2015

Indexing and Analyzing Hypocoristics, Diminutives and Synonyms with ElasticSearch

Hypocoristics, Diminutives and Synonyms with ElasticSearch/Lucene

Intro


Indexing and searching through social data introduces two problems: colloquial to formal names matching; and synonym/shorthand to full word matching. Colloquial words may include a combination of hypocoristics, Diminutives and synonyms.  For example California, Golden State, Cali, CA etc.
We shall discuss the process of achieving a contextualized data ecosystem using elastic search. We will look at synonym matching as a better tool than fuzzy search for raw data contextualization.

Fuzzy search



Fuzzy search uses the string edit distance to determine the closeness between two strings. Ideally, this mechanism is best used for searching for possibly misspelled indexed words. For example, a search engine can suggest words as you type along. The suggestions are fuzzy matched against an indexing engine.


Synonym Matching



Synonym matching uses indexing to store all the possible matches between a word or phrase. This requires an analyst to understand their data well enough to construct mappings of synonyms. If dealing with state names, you can come up with abbreviations, popular names and formal name mapping such as CA, California, Golden State, Cali etc.
During the process of indexing your data (social media feeds etc) can reference the synonym mappings to add more context to your data. Thus contextualization.


Lucene synonym handling


See my blog ETL on the fly Hive Lucene integration.


Elasticsearch synonym handling



I created a state name index with their respective abbreviations and shortened postal names. The location is of the index is in elasticsearch:


I created a state name synonym map using this shell script :

states=`cat stateabbreviations.txt | sed  's/^.*$/"&"/g'  | tr '\n' ',' |sed 's/.\w*$//' | awk '{ print tolower($0) }'` 

 curl -XPUT "http://localhost:9200/statesabbreviation_index/" -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "statesabbr_synonym_filter": {
          "type": "synonym",
   "synonyms": [
           '"$states"' 
          ]
        }
      },
      "analyzer": {
        "statesabbr_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "statesabbr_synonym_filter"
          ]
        }
      }
    }
  }
}'


Note  that we are also using a lowercase filter and that all synonyms are converted to lowercase. This means each query to this index will be evaluated to lowercase because the index is in lowercase. The opposite can be achieved by changing the awk arguments and the filter arguments to uppercase.


Abbreviation file, stateabbreviation.txt, file is of the form:

Alabama,Ala.=>AL
Alaska=>AK
Arizona,Ariz.=>AZ
Arkansas,Ark.=>AR
California,Calif.=>CA
Colorado,Colo.=>CO
Connecticut,Conn.=>CT
Delaware,Del.=>DE
Florida,Fla.=>FL
Georgia,Ga.=>GA
Hawaii,Hawaii=>HI

Now you can make a GET query through your browser:

http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text=”CALIFORNIA is in United States of America and so is Maine”


Or query using curl in shell:

curl -XGET "http://localhost:9200/statesabbreviation_index/_analyze?analyzer=statesabbr_synonyms&text='CALIFORNIA%20IS%20in$20United%20States%20of%20America%20and%20so%20is%20Maine'"

where the text query is “CALIFORNIA is in United States of America and so is Maine” and the highlighted text is expected to return indexed synonyms.


Results:


{"tokens":[
{"token":"ca","start_offset":1,"end_offset":11,"type":"SYNONYM","position":1},
{"token":"is","start_offset":12,"end_offset":14,"type":"<ALPHANUM>","position":2},
{"token":"in","start_offset":15,"end_offset":17,"type":"<ALPHANUM>","position":3},
{"token":"usa","start_offset":18,"end_offset":42,"type":"SYNONYM","position":4},
{"token":"and","start_offset":43,"end_offset":46,"type":"<ALPHANUM>","position":5},
{"token":"so","start_offset":47,"end_offset":49,"type":"<ALPHANUM>","position":6},
{"token":"is","start_offset":50,"end_offset":52,"type":"<ALPHANUM>","position":7},
{"token":"me","start_offset":53,"end_offset":58,"type":"SYNONYM","position":8}]}

Names:


I also created a firstname hypocoristics diminutives index that can be queried the same as above where the map file is of the form:
Frederick=>Fred,Freddy,Rick,Fritz,Freddie


A browser query of the form:

http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'


Or query using curl in shell:

curl -XGET "http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'"


Resulted in:


{"tokens":[{"token":"fred","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddy","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"rick","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"fritz","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1},{"token":"freddie","start_offset":1,"end_offset":10,"type":"SYNONYM","position":1}]}

Note that this firstname index queries for actual node to return other-names. In order to search by a diminutive such as freddie, a reverse index needs to exist. You can also create a reverse lookup index using a many to one map. i.e:

Fred,Freddy,Rick,Fritz,Freddie=>Frederick

However, there’s a caveat: some diminutives refer to more than one given name. For example, Frank could refer to Francesco, Francis or Franklin.  


Frances=>Fran,Franny,Fanny,Frankie
Francesca=>Fran,Franny,Fanny
Francesco=>Fran,Frank,Frankie
Francis=>Fran,Frank,Frankie
Franklin=>Frank


How can you exploit these synonym capabilities programmatically?

Unfortunately there are no stable php-elasticsearch package out there. However, if youre familiar with traversing json, then you should not have a problem making http(s) curl or javascript queries. Here is an example using curl php api:



<?
function restgetcall() {
 $url="http://localhost:9200/firstnames_index/_analyze?analyzer=firstnames_synonyms&text='Frederick'";
 $headers = array(
 'Accept: application/json',
 'Content-Type: application/json',
 );
 $data = json_encode( $vars );

  $handle = curl_init();
 curl_setopt($handle, CURLOPT_URL, $url);
 curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);
 curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);


  curl_setopt($handle, CURLOPT_HTTPGET, true);

  $response = curl_exec($handle);
 $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

 if ($response === false) {
    $info = curl_getinfo($handle);
    curl_close($handle);
    die('error occured during curl exec. Additional info: ' . var_export($info));
}
curl_close($handle);
$decoded = json_decode($response);
if (isset($decoded->response->status) && $decoded->response->status == 'ERROR') {
    die('error occured: ' . $decoded->response->errormessage);
}
echo 'response ok!';
print_r($decoded);
#var_export($decoded->response);

 }

 restgetcall();
?>

References:



Other topics:

Phonetic engines as an alternative to elasticsearch fuzzy matching.
Elasticsearch word stemming - algorithmic,dictionary,hunspell


ETL on the fly with BIG Data: A Hive Lucene Integration

Hive Lucene Integration.


Summary:

This topic looks at Hive and Lucene as two technology integrations that allow for a fast and cheap automated ETL process. For example, contextualizing data into standard format on the fly before matching against data in a hive repository. Non standard data include diminutives, hypocholistics and acronyms in people’s names and street addresses. 
Some form of ETL may be required when importing third party data into your data warehouse. And if you have multiple third party sources, some form of automated ETL becomes apparent. The transformation/translation phase can be streamlined using lucene technologies. Hive supports many UDFs including Java regular expressions. But to extend HQL capabilities further, a more involved UDF implementation may be necessary.
This topic will show a simple Hive UDF example for matching date strings with different formats using java regular expressions. The second example matches street name abbreviations by incorporating lucene solr synonyms to a UDF. 

You may also visit the project on github here

Compare dates example:


If you have two data sources containing date columns that you are interested in matching, you can implement a UDF that will anticipate different date formats and even use parts of the date (such as year) to perform a match.   
We implemented a class CompareYearInDateString that extends org.apache.hadoop.hive.ql.exec.UDF class which requires an evaluate() public method. The method allows you to pass more than one argument and return a user defined type. In our case a Boolean. If either string is empty it returns false. It extracts year string from both dateStr arguments by calling getYear() which first calls getDatePattern() method then calls getYearFromDate() using Pattern object and yearStr String as an argument. Method getDatePattern can be made more robust as you learn about potential Date formats such as international formats and or delimiters.


Compare Street names example:


We anticipated finding addresses with ‘st’ in place of ‘street’; ‘ave’ in place of ‘avenue’, etc. It would be a daunting task to incorporate all the different combinations of acronyms in a sql statement. Therefore it makes sense to create a UDF which can transform a non-standard street string into standard form. Lucene analyzers are best suited for detecting and replacing acronyms. 
There are two approaches to implementing analyzers for hive map reduce tasks or any other parallel process: restful index such as elastic search and solr; or an in memory analyzer. We felt that an restful index would burden the network with more traffic. Therefore we settled for standalone in memory Analyzer as a singleton per hadoop jvm. As it is possible to recycle a jvm container with new map reduce tasks, this approach is extendible as long as the synonym data remains moderately low.   
As in the previous example class CompareStreetNames extends UDF class and implements evaluate() method where the Texts are transformed using a custom AnalyzeUtil.analyzeString() method before being compared to each other. 
AnalyzeUtil analyzeString() method transforms a string by using a custom Analyzer, SynonymAnalyzer instance, which tokenizes a string reader into transformed term attribute tokens. We iterated over the tokens to create the transformed string.
SynonymAnalyzer is instantiated with each transformation request where a, overridden createComponents() method to instantiate a SynonymFilter with a SynonymMap object and a white space tokenizer. The tokenizer tells the filter how to break up a subject string while the SynonymMap instance contains key values to match tokenized synonyms against. 
Construction of the synonym map happens in the contructSynonymMap method which creates a singleton SynonymMapLoader which implements getSynonymMap() method. This singleton creates and keeps a copy of the SynonymMap in memory per jvm thereby removing IO constraints.
In getSynonymMap() method we start with parsing a file containing solr analyzer mappings and building a SynonymMap object using SolrSynonymParser class of lucene-analyzers-common api. Something to note, instantiating  SolrSynonymParser requires an analyzer which is not our custom analyzer (we used SimpleAnalyzer) and should be closed before exiting the method. The synonym map file in static variable LOCAL_SYNONYM_MAP is a jar resource containing data in solr synonym format which is a many-to-one mapping:
ARC => ARCADE
AVE => AVENUE
BLVD => BOULEVARD
STR , ST => STREET

An alternative to Solr synonym format is word net format which has a parser implemented in lucene. 


Creating the UDFs in hive:

This part is straight forward.
You can either create a fat jar containing your custom implementation and lucene classes. Or create a thin jar of your implementation and copy lucene jars into hive lib. See our maven pom.
We implemented a fat jar which was loaded into hdfs:

hadoop fs -put ~/ritho-hive-0.0.1-dist.jar /apps/gdt/

Then create the UDFs:


echo "CREATE FUNCTION CompareYearInDateString  AS 'com.ritho.hadoop.hive.udf.CompareYearInDateString' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive


echo "CREATE FUNCTION CompareStreetNames  AS 'com.ritho.hadoop.hive.udf.CompareStreetNames' USING JAR 'hdfs:///apps/gdt/ritho-hive-0.0.1-dist.jar' ;" | hive

Testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '19221101’) ) x;"|hive

Another example testing CompareYearInDateString class

echo "select * from (select CompareYearInDateString('1922/12/12', '1922’) ) x;"|hive

Testing CompareStreetName class

echo "select * from (select CompareStreetNames('88 Mountain View Street STE 10', '88 Mtn View St Suite 10') ) x;"|hive

Conclusion:

These UDFs made our sql statements clean, easy to read. Most importantly was the ability to enhance Hive's functionality. It was also easy to implement the lucene analyzer. The overall result is an flexible architecture that DBAs and analysts can exploit.
Another example on how to exploit this architecture is is in person name internationalization or standardization (i.e., José to Joseph conversions or Jim to James mapping). Also, text similarity scoring in HQL can be adapted in the same manner.


References

https://hive.apache.org
http://lucene.apache.org
http://lucene.apache.org/solr/


Thursday, June 11, 2015

Setting up Ganglia on Ubuntu nodes


Setting up Ganglia on ubuntu Hadoop cluster:

Dependent on apache2 and php client and php module installation. Therefore install apache
sudo apt-get install apache2 php5 php5_cli libapache2-mod-php5
And visually verify that apache2 is running on http://localhost:80

Master ganglia- chose a low utilization , low consequence node i.e. slave6

sudo apt-get install ganglia-monitor rrdtool gmetad ganglia-webfrontend

get monitor , rrdtools and web ui.

Modify webcontext file and copy to apache
sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf
Modify /etc/ganglia/gmond.conf to your specs (we are using unicast configuration). I commented out multicast attributes and replaced default ips with my ganglia master ip:

globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 30
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a  tag.  If you do not specify a cluster tag, then all  will
 * NOT be wrapped inside of a  tag. */
cluster {
  name = "Hadoop Ganglia Monitor" 
  owner = "hduser"     
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = 10.77.201.104
  port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  port = 8649
}


/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}
 

Modify /etc/ganglia/gmetd.conf to state the data collecting node for ganglia

data_source "Hadoop Cluster" 10.77.201.104


Starting and stopping:

Restart gmetad
sudo service gmetad restart


Restart ganglia monitor in master node

sudo service ganglia-monitor restart

Ganglia Clusters


Install ganglia monitor on nodes
sudo apt-get install ganglia-monitor

Modify /etc/ganglia/gmond.conf to send data to receiver = datasource

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a  tag.  If you do not specify a cluster tag, then all  will
 * NOT be wrapped inside of a  tag. */
cluster {
  name = "Hadoop Ganglia Monitor" 
  owner = "hduser"     
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = 10.77.201.104
  port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
}

Starting and stopping:
sudo service ganglia-monitor start
Hadoop ganglia


Issues:
tried accessing ganglia url
got on html:
There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused.

Problem is a file permissions issue.

Solution:

chown -R nobody:root /var/lib/ganglia/rrds
then restart daemons


MONITORING
send datagram

master node:
ps -ef | grep -v grep | grep gm
Results
ganglia  21116     1  0 12:58 ?        00:00:00 /usr/sbin/gmond --pid-file=/var/run/ganglia-monitor.pid
nobody   21127     1  0 12:58 ?        00:00:01 /usr/sbin/gmetad --pid-file=/var/run/gmetad.pid


network

sudo netstat -plane | egrep 'gmon|gme'
results:
tcp        0      0 0.0.0.0:8649            0.0.0.0:*               LISTEN      999        87732663    21116/gmond     
tcp        0      0 0.0.0.0:8651            0.0.0.0:*               LISTEN      65534      87729072    21127/gmetad    
tcp        0      0 0.0.0.0:8652            0.0.0.0:*               LISTEN      65534      87729073    21127/gmetad    
udp        0      0 0.0.0.0:8649            0.0.0.0:*                           999        87732662    21116/gmond     
udp        0      0 192.168.179.103:60243   192.168.179.103:8649    ESTABLISHED 999        87732666    21116/gmond     
unix  3      [ ]         STREAM     CONNECTED     87785728 21116/gmond         
unix  3      [ ]         STREAM     CONNECTED     83863311 21127/gmetad        

Monday, June 8, 2015

Managing a maven artifact repository with Artifactory

Planning

The assumption is that you know how to use maven at an elementary level. You should have an idea of how to deploy web applications using tomcat or other JEE app server. Our examples use Subversion as the source code version management tool. You should have some familiarity with branching.

Our goals:
  1. Reduce the disk space usage in our source code repository such as Subversion.
    1. Set up a local repository to manage all third party libraries.  
  2. Standardize building, testing and releasing workproduct.
    1. scripts to provide a stable methodology for building , testing and releasing.
  3. A secure environment to execute 1) and 2).
    1. Artifactory is a secure external repository management tool.
    2. Maven has plugins for sourcecode  version control ; third-party repositories and ftp plugins; build plugin including ANT plugin; JUnit plugins for testing; and release plugins.

Infrastructure management is still required to completely close all security holes:
  • A shared local repository (artifactory) should be located in an accessible server. It should be accessible to development; intergration; testing and production build environments. Different organizations may have higher security requirements than others, but generally local/intranet readonly accessibility is sufficient.  
  • Work product should be separated out into separate pom files. See section III.

I. Setting up maven



II. Setting up Artifactory



2. Install in JBOSS 6 didn’t work.
Installed in tomcat 5 using JDK6.
Set global variables for artifactory and jdk 6
     $> cd <install dir>/apache-tomcat-5.5.28/bin/
     $> export JAVA_HOME='/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home'
     $> export JAVA_OPTS="$JAVA_OPTS -Dartifactory.home=/Users/kaniu_n/artifactory-2.5.1.1"
     $> ./startup.sh

3. The maven repositories:

Artifactory manages multiple pre-configured remote maven repositories. It also has caching capability i.e. stores downloaded libraries locally temporarily to reduce IO. Artifactory administration allows for adding new or modifying existing repositories: local or remote.
Therefore, Artifactory becomes your library ‘repository’ instead of the many remote repositories.

4. pom.xml example:

<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                            http://maven.apache.org/maven-v4_0_0.xsd">
   <modelVersion>4.0.0</modelVersion>

   <groupId>com.ritho.maven</groupId>
   <artifactId>simple-parent</artifactId>
   <packaging>pom</packaging>
   <version>1.0</version>
   <name> Parent Project</name>

    <url>http://maven.ritho.com</url>

    <repositories>
        <repository>
            <id>central</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
        <repository>
            <id>snapshots</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <releases>
                <enabled>false</enabled>
            </releases>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <id>central</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
        <pluginRepository>
            <id>snapshots</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <releases>
                <enabled>false</enabled>
            </releases>
        </pluginRepository>
    </pluginRepositories>
        ..
        ..
        ..
 
   
Note that the urls point to the deployed Artifactory instance.

III. Build, test and Release Management using Maven.


Each project should have a pom. Related projects should have a parent pom file and a separate release pom file. In the following example, I have 4 eclipse projects:
    -Exam/ has dependencies on DataObjects/
    -ParentProject/ contains the main pom for all projects.
    -Release/ is the end point for all packaged entities.
Here’s is an eclipse project layout:


This example will use implied classpaths for compiling the projects with dependencies.  This is done by referencing the parent project which is ‘aware’ of all the relevant modules in the workspace. The goal is to build, test or package from the parent pom.xml file. <module> element allows mvn to decipher project dependencies and then create the compilation sequence of all artifacts. By always executing mvn process from the  parent pom, you are ensuring that all changes in subsequent projects are reflected in the final artifact of the compilation sequence.

Caveates:
All projects will be compiled using the parent JDK. Therefore all projects have to be up to date with current JDK, i.e. some JDK5 warnings may not compile in JDK6.
Building does not auto-clean the build directories therefore do a ‘clean’ before rebuilding/testing/packaging i.e:
>mvn clean
Or
>mvn clean package



Here are my examples:
1. Parent pom

<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                            http://maven.apache.org/maven-v4_0_0.xsd">
      <modelVersion>4.0.0</modelVersion>
       <groupId>com.ritho.maven</groupId>
       <artifactId>simple-parent</artifactId>
      <packaging>pom</packaging>
       <version>1.0</version>
      <name> Parent Project</name>
       <url>http://maven.ritho.com</url>
    <repositories>
        <repository>
            <id>central</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
        <repository>
            <id>snapshots</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <releases>
                <enabled>false</enabled>
            </releases>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <id>central</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
        <pluginRepository>
            <id>snapshots</id>
            <url>http://localhost:8080/artifactory/repo</url>
            <releases>
                <enabled>false</enabled>
            </releases>
        </pluginRepository>
    </pluginRepositories>
   <modules>
       <module>../Exam</module>
       <module>../DataObjects</module>
   </modules>
   <build>
       <pluginManagement>
           <plugins>
               <plugin>
                   <groupId>org.apache.maven.plugins</groupId>
                   <artifactId>maven-compiler-plugin</artifactId>
                   <configuration>
                       <source>1.5</source>
                       <target>1.5</target>
                   </configuration>
               </plugin>
           </plugins>
       </pluginManagement>
   </build>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.5</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>


Some notes: This pom uses JDK5 but you may alter it for JDK6 as below:
<pluginManagement>
           <plugins>
               <plugin>
                   <groupId>org.apache.maven.plugins</groupId>
                   <artifactId>maven-compiler-plugin</artifactId>
                   <configuration>
                       <source>1.6</source>
                       <target>1.6</target>
                   </configuration>
               </plugin>
           </plugins>
       </pluginManagement>

Make sure your maven is running on JDK6 or higher by setting JAVA_HOME; in MacOsX terminal:

$export JAVA_HOME='/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home'
$mvn -e package


2. DataObjects pom
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>com.dataobjects</groupId>
<artifactId>dataobjects</artifactId>
<parent>
<groupId>com.ritho.maven</groupId>
               <artifactId>simple-parent</artifactId>
               <version>1.0</version>
              <relativePath>../ParentProject</relativePath>
       </parent>
    <packaging>jar</packaging>
    <build>
       <sourceDirectory>src</sourceDirectory>
          <directory>${project.parent.relativePath}/maven/${project.artifactId}</directory>
     </build>
</project>



3. Exam pom
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com</groupId>
    <artifactId>exam</artifactId>
    <parent>
              <groupId>com.ritho.maven</groupId>
               <artifactId>simple-parent</artifactId>
               <version>1.0</version>
               <relativePath>../ParentProject</relativePath>
      </parent>
    <packaging>jar</packaging>
    <dependencies>
        <dependency>
            <groupId>com.dataobjects</groupId>
            <artifactId>dataobjects</artifactId>
            <version>${project.version}</version>
        </dependency>
       </dependencies>
    <build>
       <sourceDirectory>src</sourceDirectory>
       <testSourceDirectory>test</testSourceDirectory>
      <directory>${project.parent.relativePath}/maven/${project.artifactId}</directory>
     </build>
</project>



4. Distribution pom
This pom can be the parent pom with additional plugins enabled i.e for reporting, distribution management, remote connection.
i. Reporting allows you to autogenerate html that you may use to describe the distribution product of the pom file. Example:
<reporting>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-report-plugin</artifactId>
                <version>2.12</version>
                <configuration>    <outputDirectory>${project.build.directory}/${project.version}/surefire</outputDirectory>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-site-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                <outputDirectory>${project.build.directory}/${project.version}/site</outputDirectory>
                </configuration>
            </plugin>
        </plugins>
</reporting>

Note structure of the project below containing site and site.apt directories:
site.xml configures the template and its look and feel.
index.apt configures the content using Almost Plain Text notation.
site.xml
<?xml version="1.0" encoding="ISO-8859-1"?>

<project name="Maven" xmlns="http://maven.apache.org/DECORATION/1.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/DECORATION/1.0.0 http://maven.apache.org/xsd/decoration-1.0.0.xsd">  
   <bannerLeft>
       <name>Ritho HBase Utility for V-1.0.+</name>
       <src>https://sites.google.com/site/rithotechcorp/_/rsrc/1332167517940/config/customLogo.gif?revision=3</src>
       <href>http://maven.ritho.com/</href>
   </bannerLeft>
   <skin>
       <groupId>org.apache.maven.skins</groupId>
       <artifactId>maven-fluido-skin</artifactId>
       <version>1.2.1</version>
   </skin>
   <custom>
       <fluidoSkin>
           <sideBarEnabled>true</sideBarEnabled>
           <googlePlusOne />
       </fluidoSkin>
   </custom>
   <body>
       <links>
           <item name="Labs @ Ritho" href="http://labs.ritho.com/"/>
       </links>
       <menu name="Sources and Libraries">
          <item name="Download Directory" href="../../opensource/com/ritho/hbase/util//rithohbaseutility/" />
       <item name="Maven Descriptor" href="../../opensource/com/ritho/hbase/util/rithohbaseutility/maven-metadata.xml" />
       </menu>
       <menu ref="reports"/>
   </body>
</project>


site.apt
 <<Ritho HBase Utility:>>
 
 Ritho HBase Utility library is a utility for interacting with HBase versions 1.0.+ .

<<Benefits:>>
 This utility simplifies back-end calls to only a few lines of code reducing  developer mistakes. Some level of connection management \
 can be handled for you but the most significant advantage is simplified syntax.
 
<<Usage:>>
 This testcase outlines the most common way to use this utility. Each testcase is self explanatory. You would need to set up \
 configuration for each testcases:

----
package test
class test{
    //some code ...
}
----

APT tool consumes this to produce decorated html encapsulating your text.

ii. distribution management and remote connection
This setting points to a remote/or local ftp site where you can submit your product. Here you can specify the site and repository locations.

    <distributionManagement>
        <site>
            <id>ritho-ftp-repository</id>
            <url>ftp://ritho.com/downloads/site</url>
        </site>
        <repository>
            <id>ritho-ftp-repository</id>
            <url>ftp://ritho.com/downloads/opensource</url>
        </repository>
    </distributionManagement>

Distribution management requires an extention. Here is an ftp wagon example:
    <build>
        <extensions>
            <!-- Enabling the use of FTP -->
            <extension>
                <groupId>org.apache.maven.wagon</groupId>
                <artifactId>wagon-ftp</artifactId>
                <version>1.0-beta-6</version>
            </extension>
        </extensions>
        ..
        ..

The assumption is that your remote ftp server requires authentication. Note that distribution management identifies the target remote repository (ritho-ftp-repository). settings.xml will contain a username and password for this repository. setting.xml resides in the local/execution environment at ~/m2/settings.xml
<settings>
 ...
 <servers>
   <server>
     <id>ritho-ftp-repository</id>
     <username>username</username>
     <password>password</password>
   </server>
 </servers>
 ...
</settings>

References: