Tuesday, August 2, 2016

Notes on Nutch crawler with indexing


Understanding Nutch with HBase: the HBase schema.

http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1/conf/gora-hbase-mapping.xml


Data synchronization between rdbms and hive using sqoop

Hive can be a great backup environment for RDBMS data or simply as a data warehouse. Hive provides a great architecture for bulk OLAP data. Hive is also a great choice for data charting workspace where hadoop technologies can be employed to crunch data.
Because many organizations still use rdbms and sql technology in their data warehouse, it is easier to export data in hive to perform bulk processing. Sometimes data dumping and reimporting into hive is inefficient therefore a data synchronization strategy using jdbc technology is more logical. Sqoop is designed to replicate data between different databases by speaking the same 'jdbc language'.
Lets see how sqoop works between sql server and hive.



Thursday, July 14, 2016

Machine to machine conversation with Avro, Protobuf or Thrift


Introduction


API data exchange requires an established mode of transport such as a network protocol or data file system. The data being transferred should also fit an agreed upon format. SOAP solves that problem by establishing XML as the data format. Modern architectures have deviated from XML because of its payload overhead attributed to markups, and also having a dependency on developer expertise. Moreover, XML processing can be expensive and require more resources than are available on small devices.

A popular data protocol alternative is JSON which is lightweight, schema enforceable and simple to understand. In order for machines/APIs to 'talk to each' other JSON would need to be translated into the API language i.e. JAVA , Python etc. This requires some data extraction and translation on the fly. A serialization framework, SF, such as Avro, Thrift and protobuf solve the language bridge by making extraction and translation seamless by using pre generated client stubs. A translation layer based on a defined schema is created for both clients in their respective languages such that JAVA can talk to Python or other supported language. SF's use an interface description language IDL, to define the generated translation layer source code.

Here are some examples in JAVA:

Avro


It allows for native type definition, nesting, accumulations (array and maps) and complex types.  It uses json as the IDL thus familiar. Precompiled translation/serialization layer is not required but depending on business use case you may require them.  On the fly schema driven data translation makes it easier to work with a growing schema or adopt new schemas making development process decoupled from data definition. However, minimal development is required for all clients to implements avro apis.
Language support is generous as it supports Java, C++, C#, python, js among others.

Download avro tools jar from apache

Create schema. For example create file CustomMessage.avcs

{"namespace": "avro",
 "type": "record",
 "name": "CustomMessage",
 "fields": [
     {"name": "id", "type": "string"},
     {"name": "payload",  "type": "string"}
 ]

}

As mentioned before you have two options: to generate stub, or create a generic record parser.
Option 1 Generate stub:

java -jar ./avro-tools-1.7.7.jar compile schema  avro/CustomMessage.avsc .

Option 2 stubless avro parser: See my parser example on github.

Run tests using maven dependency:

<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro</artifactId>
  <version>1.7.7</version>

</dependency>


Protobuf


Is a google creation which is widely used in its ecosystem. It has limited language support but the documentation is very detailed. It requires code stubs which are precompiled making slightly  inflexible for the development process. The IDL is simple and easy to understand but its not json.

It requires system installation:

  1. download
  2. ./configure
  3. make
  4. make check
  5. sudo make install
  6. protoc --version


Create schema. For example, create file CustomMessage.proto :

package syntax2;
option java_outer_classname="CustomMessageProtobuf";
message CustomMessage {
    optional int32 age = 1;
    optional string name = 2;
}

Note the native type support and package (protocol3 which can also be replace with protocol2 ). See below. 'Optional' keyword is removed.

package syntax3;
option java_outer_classname="CustomMessageProtobuf";
message CustomMessage {
    int32 age = 1;
    string name = 2;
}

Generating protobuf java objects
protoc CustomMessage.proto --java_out ../

When testing, note the relationship between syntax and maven dependencies:

Maven dependencies for syntax2:
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.6.1</version>
<scope>compile</scope>

</dependency>

Maven dependencies for syntax3:
                 <dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.0.0-beta-2</version>
<scope>compile</scope>

</dependency>

Thrift


Thrift, like avro supports more languages than just java, C++ and python. It has great history with popular usage in big data technologies such as HBase's multi language support. Thrift suffers a back draw; poor documentation and support. It is also developer expertise driven.




Missing in Gora: Avro Phoenix translation

I was disappointed to see that while Gora supports Avro-SQL (for Mysql etc) bridging, it does not support Apache Phoenix SQL. Note that Gora supports HBase data which is logically different from Phoenix.
I found a way around this void. And that is to create a custom Avro-Phoenix translator.

The challenge is be aware of Avro or Phoenix data types and do a programmatic mapping onto a prepared statement. This can be extended to any jdbc driven connectivity:

My Avro Schema




Create phoenix table:
!sql CREATE TABLE TEST_TABLE (CREATED_TIME TIMESTAMP NOT NULL, ID VARCHAR(255) NOT NULL, MESSAGE VARCHAR(255) , INTERNAL_ID VARCHAR CONSTRAINT PK PRIMARY KEY ( CREATED_TIME, ID ) ) SALT_BUCKET=20;

Write to phoenix to test
!sql UPSERT INTO TEST_TABLE (CREATED_TIME, ID, MESSAGE) VALUES (now(),'id1', 'my message');

Write Avro to Phoenix translation


View the github project.

Read Phoenix Data to Avro translation ( Future work ).


Docs:

https://avro.apache.org/docs/1.7.7/spec.html#schema_record
https://avro.apache.org/docs/1.5.0/api/java/org/apache/avro/Schema.Type.html
https://phoenix.apache.org/language/datatypes.html

Friday, June 17, 2016

Configuring Thrift for HBase IO

Configuring php Thrift for HBase IO

There are few ways to read and write data to HBase in php. There is hard programmatic Socket connection IO; shell script system calls; HBase stargate restful calls; and Thrift connection calls.
Thrift php is a library with Socket connection calls to HBase. Works well with a php applications being migrated from mysql.

The setup

Assuming you already have php/php5 installed and running on ubuntu.

Install git
sudo apt-get install build-essential automake libtool flex bison libboost*

Get latest Thrift
git clone https://git-wip-us.apache.org/repos/asf/thrift.git

Make Thrift 
change dir into ~thrift/ and execute bootstrap
cd ~/thrift
./bootstrap.sh

Run environment configuration 
./configure
make - an install will be found in /usr/lib/php
make install
The final product is installed in /usr/lib/php

HBase IO

Setting up Thrift php modules
The basic requirements:
-set globals
-set required files
-set use HBase and Thrift classes

Import dependencies


Initialize classes

Write to Hbase Example

Read from Hbase Example



Monday, February 15, 2016

Scripting with hadoop technologies

Introduction

Hadoop commands described in the Apache tutorial are designed to work like shell commands. They are simple commands that can be scripted into complex commands. I will explore some commands to give a hint of how to achieve complex manipulations outside of the map reduce / YARN process. Note that these commands are not a replacement for map reduce/YARN processes but merely a compliment. 

Assumptions

You have a running hadoop environment. You are familiar with hdfs structure. Hadoop1 or hadoop2 works fine. A distributed hdfs system is not necessary to run these examples.


When working or analyzing data in large data sets hadoop becomes the preferred file system over all local/PC file systems because of the distributed nature of its resources. 
Hadoop commands work on files loaded into hdfs, therefore you will need to load files into your running hadoop file system. A simple command works fine: i.e

/usr/local/hadoop/bin/hadoop fs  -copyFromLocal input_file hdfs_file

CopyFromLocal is not flexible as it does not allow you to pipe  row manipulation commands; the imported file is treated as a singular object. As with any transfer mechanism, CopyFromLocal can be interrupted when the file is oversized and when the network is burdened causing unrecoverable failure. To reduce this risk work with small files. You can chunk a large file into smaller files in your linux environment using the split command:

gzip -cd /mnt/upload/SuperHugeCompressed.zip | split -b 10G --filter='gzip -cf > $FILE' - /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_ &

This command splits the large compressed file into 10GB size files (starting with SmallerSuperHugeCompressed_gz_split_aa ) and retains the gzip header in each file. Gzip header allows the OS to recognize the data type of each individual file as compressed data.

Then load the file into hdfs using hadoop copyFromLocal or put commands. 

CopyFromLocal will copy files that start with SmallerSuperHugeCompressed_gz_split_* and store them in hdfs /raw directory intact and compressed.

/usr/local/hadoop/bin/hadoop fs  -copyFromLocal /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_*  /raw/ 

Here is a hadoop put example where the input file is compressed and the hdfs file is decompressed:

gunzip -c  /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_aa | /usr/local/hadoop/bin/hadoop fs -put - /raw/SmallerSuperHugeDecompressed_gz_split_aa &

Hadoop put allows for row string manipulation using linux shell commands such as awk and sed. For example: load file into hdfs and modify each row - remove double quotes (") and replace with empty space:

gunzip -c  /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_aa  | sed -e 's/\”//g' | /usr/local/hadoop/bin/hadoop fs -put - /raw/SmallerSuperHugeDecompressed_gz_split_aa &

View contents of the first row  
You can cat a hdfs file and stop at row number 1:


hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml |  sed -n 1p 

Extract first row into file
You can pipe the above command into another hadoop put command in order to write the results back into hdfs:


hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml |  sed -n 1p | hadoop fs -put - /raw/first_row

Search for contents data into 'clean' file
You can search for content in a hdfs file using hadoop cat and grep. This allows for cat to scroll raw by row searching for the grep term and writing to stdout.


hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml |  grep configuration

It is important to note that hdfs data is immutable. To modify a row in hdfs will require streaming a file into another file and modify the stream during that process. 

Remove replication on file

/usr/local/hadoop/bin/hadoop fs -setrep -R 1 /raw/Acx_NonMatch_IB123838output2

Conclusions

Most hadoop commands are local commands that do not take advantage of the distributed nature of hadoop. However, some hadoop commands especially 'hdsfadmin' commands are map reduce/yarn applications. Scripting this distributed applications would require monitoring mechanisms in order to be incorporated in a