How do successful businesses collect, use and store data? How do data architects configure large data sets from vastly different datasources in anticipation of the ever changing business environment?
Tuesday, August 2, 2016
Notes on Nutch crawler with indexing
Understanding Nutch with HBase: the HBase schema.
http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1/conf/gora-hbase-mapping.xml
Data synchronization between rdbms and hive using sqoop
Hive can be a great backup environment for RDBMS data or simply as a data warehouse. Hive provides a great architecture for bulk OLAP data. Hive is also a great choice for data charting workspace where hadoop technologies can be employed to crunch data.
Because many organizations still use rdbms and sql technology in their data warehouse, it is easier to export data in hive to perform bulk processing. Sometimes data dumping and reimporting into hive is inefficient therefore a data synchronization strategy using jdbc technology is more logical. Sqoop is designed to replicate data between different databases by speaking the same 'jdbc language'.
Lets see how sqoop works between sql server and hive.
Because many organizations still use rdbms and sql technology in their data warehouse, it is easier to export data in hive to perform bulk processing. Sometimes data dumping and reimporting into hive is inefficient therefore a data synchronization strategy using jdbc technology is more logical. Sqoop is designed to replicate data between different databases by speaking the same 'jdbc language'.
Lets see how sqoop works between sql server and hive.
Thursday, July 14, 2016
Machine to machine conversation with Avro, Protobuf or Thrift
Introduction
API data exchange requires an established mode of transport such as a network protocol or data file system. The data being transferred should also fit an agreed upon format. SOAP solves that problem by establishing XML as the data format. Modern architectures have deviated from XML because of its payload overhead attributed to markups, and also having a dependency on developer expertise. Moreover, XML processing can be expensive and require more resources than are available on small devices.
A popular data protocol alternative is JSON which is lightweight, schema enforceable and simple to understand. In order for machines/APIs to 'talk to each' other JSON would need to be translated into the API language i.e. JAVA , Python etc. This requires some data extraction and translation on the fly. A serialization framework, SF, such as Avro, Thrift and protobuf solve the language bridge by making extraction and translation seamless by using pre generated client stubs. A translation layer based on a defined schema is created for both clients in their respective languages such that JAVA can talk to Python or other supported language. SF's use an interface description language IDL, to define the generated translation layer source code.
Here are some examples in JAVA:
Avro
It allows for native type definition, nesting, accumulations (array and maps) and complex types. It uses json as the IDL thus familiar. Precompiled translation/serialization layer is not required but depending on business use case you may require them. On the fly schema driven data translation makes it easier to work with a growing schema or adopt new schemas making development process decoupled from data definition. However, minimal development is required for all clients to implements avro apis.
Language support is generous as it supports Java, C++, C#, python, js among others.
Download avro tools jar from apache
Create schema. For example create file CustomMessage.avcs
{"namespace": "avro",
"type": "record",
"name": "CustomMessage",
"fields": [
{"name": "id", "type": "string"},
{"name": "payload", "type": "string"}
]
}
As mentioned before you have two options: to generate stub, or create a generic record parser.
Option 1 Generate stub:
java -jar ./avro-tools-1.7.7.jar compile schema avro/CustomMessage.avsc .
Option 2 stubless avro parser: See my parser example on github.
Run tests using maven dependency:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.7</version>
</dependency>
Protobuf
Is a google creation which is widely used in its ecosystem. It has limited language support but the documentation is very detailed. It requires code stubs which are precompiled making slightly inflexible for the development process. The IDL is simple and easy to understand but its not json.
It requires system installation:
- download
- ./configure
- make
- make check
- sudo make install
- protoc --version
Create schema. For example, create file CustomMessage.proto :
package syntax2;
option java_outer_classname="CustomMessageProtobuf";
message CustomMessage {
optional int32 age = 1;
optional string name = 2;
}
package syntax3;
option java_outer_classname="CustomMessageProtobuf";
message CustomMessage {
int32 age = 1;
string name = 2;
}
Generating protobuf java objects
protoc CustomMessage.proto --java_out ../
When testing, note the relationship between syntax and maven dependencies:
Maven dependencies for syntax2:
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.6.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.0.0-beta-2</version>
<scope>compile</scope>
</dependency>
Thrift
Thrift, like avro supports more languages than just java, C++ and python. It has great history with popular usage in big data technologies such as HBase's multi language support. Thrift suffers a back draw; poor documentation and support. It is also developer expertise driven.
Missing in Gora: Avro Phoenix translation
I was disappointed to see that while Gora supports Avro-SQL (for Mysql etc) bridging, it does not support Apache Phoenix SQL. Note that Gora supports HBase data which is logically different from Phoenix.
I found a way around this void. And that is to create a custom Avro-Phoenix translator.
The challenge is be aware of Avro or Phoenix data types and do a programmatic mapping onto a prepared statement. This can be extended to any jdbc driven connectivity:
My Avro Schema
Create phoenix table:
!sql CREATE TABLE TEST_TABLE (CREATED_TIME TIMESTAMP NOT NULL, ID VARCHAR(255) NOT NULL, MESSAGE VARCHAR(255) , INTERNAL_ID VARCHAR CONSTRAINT PK PRIMARY KEY ( CREATED_TIME, ID ) ) SALT_BUCKET=20;
Write to phoenix to test
Write Avro to Phoenix translation
View the github project.
Read Phoenix Data to Avro translation ( Future work ).
Docs:
https://avro.apache.org/docs/1.7.7/spec.html#schema_record
https://avro.apache.org/docs/1.5.0/api/java/org/apache/avro/Schema.Type.html
https://phoenix.apache.org/language/datatypes.html
I found a way around this void. And that is to create a custom Avro-Phoenix translator.
The challenge is be aware of Avro or Phoenix data types and do a programmatic mapping onto a prepared statement. This can be extended to any jdbc driven connectivity:
My Avro Schema
Create phoenix table:
!sql CREATE TABLE TEST_TABLE (CREATED_TIME TIMESTAMP NOT NULL, ID VARCHAR(255) NOT NULL, MESSAGE VARCHAR(255) , INTERNAL_ID VARCHAR CONSTRAINT PK PRIMARY KEY ( CREATED_TIME, ID ) ) SALT_BUCKET=20;
Write to phoenix to test
!sql UPSERT INTO TEST_TABLE (CREATED_TIME, ID, MESSAGE) VALUES (now(),'id1', 'my message');
View the github project.
Read Phoenix Data to Avro translation ( Future work ).
Docs:
https://avro.apache.org/docs/1.7.7/spec.html#schema_record
https://avro.apache.org/docs/1.5.0/api/java/org/apache/avro/Schema.Type.html
https://phoenix.apache.org/language/datatypes.html
Friday, June 17, 2016
Configuring Thrift for HBase IO
Configuring php Thrift for HBase IO
There are few ways to read and write data to HBase in php. There is hard programmatic Socket connection IO; shell script system calls; HBase stargate restful calls; and Thrift connection calls.
Thrift php is a library with Socket connection calls to HBase. Works well with a php applications being migrated from mysql.
The setup
Assuming you already have php/php5 installed and running on ubuntu.
Install git
sudo apt-get install build-essential automake libtool flex bison libboost*
Get latest Thrift
git clone https://git-wip-us.apache.org/repos/asf/thrift.git
Make Thrift
change dir into ~thrift/ and execute bootstrap
cd ~/thrift
./bootstrap.sh
Run environment configuration
./configure
make - an install will be found in /usr/lib/php
make install
The final product is installed in /usr/lib/php
HBase IO
Setting up Thrift php modules
The basic requirements:
The basic requirements:
-set globals
-set required files
-set use HBase and Thrift classes
Import dependencies
Import dependencies
Initialize classes
Write to Hbase Example
Read from Hbase Example
Complete example on github:
https://github.com/Kaniuritho/openprojects/blob/master/thrifthbase/thriftWriteTest.php
https://github.com/Kaniuritho/openprojects/blob/master/thrifthbase/thriftWriteTest.php
Monday, February 15, 2016
Scripting with hadoop technologies
Introduction
Hadoop commands described in the Apache tutorial are designed to work like shell commands. They are simple commands that can be scripted into complex commands. I will explore some commands to give a hint of how to achieve complex manipulations outside of the map reduce / YARN process. Note that these commands are not a replacement for map reduce/YARN processes but merely a compliment.
Assumptions
You have a running hadoop environment. You are familiar with hdfs structure. Hadoop1 or hadoop2 works fine. A distributed hdfs system is not necessary to run these examples.
When working or analyzing data in large data sets hadoop becomes the preferred file system over all local/PC file systems because of the distributed nature of its resources.
Hadoop commands work on files loaded into hdfs, therefore you will need to load files into your running hadoop file system. A simple command works fine: i.e
/usr/local/hadoop/bin/hadoop fs -copyFromLocal input_file hdfs_file
CopyFromLocal is not flexible as it does not allow you to pipe row manipulation commands; the imported file is treated as a singular object. As with any transfer mechanism, CopyFromLocal can be interrupted when the file is oversized and when the network is burdened causing unrecoverable failure. To reduce this risk work with small files. You can chunk a large file into smaller files in your linux environment using the split command:
gzip -cd /mnt/upload/SuperHugeCompressed.zip | split -b 10G --filter='gzip -cf > $FILE' - /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_ &
This command splits the large compressed file into 10GB size files (starting with SmallerSuperHugeCompressed_gz_split_aa ) and retains the gzip header in each file. Gzip header allows the OS to recognize the data type of each individual file as compressed data.
Then load the file into hdfs using hadoop copyFromLocal or put commands.
CopyFromLocal will copy files that start with SmallerSuperHugeCompressed_gz_split_* and store them in hdfs /raw directory intact and compressed.
/usr/local/hadoop/bin/hadoop fs -copyFromLocal input_file hdfs_file
CopyFromLocal is not flexible as it does not allow you to pipe row manipulation commands; the imported file is treated as a singular object. As with any transfer mechanism, CopyFromLocal can be interrupted when the file is oversized and when the network is burdened causing unrecoverable failure. To reduce this risk work with small files. You can chunk a large file into smaller files in your linux environment using the split command:
gzip -cd /mnt/upload/SuperHugeCompressed.zip | split -b 10G --filter='gzip -cf > $FILE' - /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_ &
This command splits the large compressed file into 10GB size files (starting with SmallerSuperHugeCompressed_gz_split_aa ) and retains the gzip header in each file. Gzip header allows the OS to recognize the data type of each individual file as compressed data.
Then load the file into hdfs using hadoop copyFromLocal or put commands.
CopyFromLocal will copy files that start with SmallerSuperHugeCompressed_gz_split_* and store them in hdfs /raw directory intact and compressed.
/usr/local/hadoop/bin/hadoop fs -copyFromLocal /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_* /raw/
Here is a hadoop put example where the input file is compressed and the hdfs file is decompressed:
gunzip -c /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_aa | /usr/local/hadoop/bin/hadoop fs -put - /raw/SmallerSuperHugeDecompressed_gz_split_aa &
Hadoop put allows for row string manipulation using linux shell commands such as awk and sed. For example: load file into hdfs and modify each row - remove double quotes (") and replace with empty space:
gunzip -c /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_aa | sed -e 's/\”//g' | /usr/local/hadoop/bin/hadoop fs -put - /raw/SmallerSuperHugeDecompressed_gz_split_aa &
View contents of the first row
You can cat a hdfs file and stop at row number 1:
hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml | sed -n 1p
Extract first row into file
You can pipe the above command into another hadoop put command in order to write the results back into hdfs:
hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml | sed -n 1p | hadoop fs -put - /raw/first_row
Search for contents data into 'clean' file
You can search for content in a hdfs file using hadoop cat and grep. This allows for cat to scroll raw by row searching for the grep term and writing to stdout.
You can search for content in a hdfs file using hadoop cat and grep. This allows for cat to scroll raw by row searching for the grep term and writing to stdout.
hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml | grep configuration
It is important to note that hdfs data is immutable. To modify a row in hdfs will require streaming a file into another file and modify the stream during that process.
Remove replication on file
Remove replication on file
/usr/local/hadoop/bin/hadoop fs -setrep -R 1 /raw/Acx_NonMatch_IB123838output2
Conclusions
Most hadoop commands are local commands that do not take advantage of the distributed nature of hadoop. However, some hadoop commands especially 'hdsfadmin' commands are map reduce/yarn applications. Scripting this distributed applications would require monitoring mechanisms in order to be incorporated in a
Subscribe to:
Posts (Atom)