Flexing Data

Tuesday, November 28, 2017

Visual approaches to gaining knowledge from big data using spark, ELK and directed graphs

Background

Traditional methods of information extraction require the quintessential Extraction, Transformation and Loading (ETL) process; followed by complex analytical queries; and finally, visualization.
In this article we look at a streamlined process to get us to the visual aids quicker if not instantaneously.

The problem

Let us suppose data has been dropped on your screen for you to gain business insight or solve a problem. Your first question may be; what is the source of the data and what does each data point represent? How granular is the data? What is the global relationship between each data point?

We propose three steps to answering these questions:
1. randomly sample your data and profile the attributes
2. extract information from the rest of your data based (1).
3. generate a wholistic knowledge of the information by drawing links between information extracted. Repeat (3) to satisfaction.

Consideration

In order to streamline this process let us consider an API integration exercise and some code. We want to integrate a Stream processing API; and indexing engine; a backend database; and the visual aids.

Spark for extraction and transformation

Profiling data using ELK

Spark for directed graph loading into JanusGraph db.

Gephi for analytics

Thursday, April 6, 2017

Attaining data flexibility while posting data to Elasticsearch

The problem:

Elasticsearch can prove unwieldy when the data being populated has a different schema. While it is possible to auto create indices in Elasticsearch by posting json data, you may run into validation exceptions when incoming data schema changes.

As an example post this message into an undefined index in order to autocreate an index:

curl -XPOST http://localhost/eventindex_uinak/event -d '{"timestamp":"2017-03-22T21:59:34 UTC","host":"uinak" ,"data":"a string"  }'

returns:

{"_index":"eventindex_uinak","_type":"event","_id":"AVr4DDXJ6XT8Rp6bGeEA","_version":1,"created":true}

Lets change the data schema a little by making data a json object:

curl -XPOST http://localhost/eventindex_uinak/event -d '{"timestamp":"2017-03-22T21:59:34 UTC","host":"uinak" ,"data": {"key":"my value"}  }'

returns:

{"error":"RemoteTransportException[[Super-Nova][inet[/x.x.x.x:y]][indices:data/write/index]]; nested: MapperParsingException[failed to parse [data]]; nested: ElasticsearchIllegalArgumentException[unknown property [key]]; ","status":400}

What happened here? Elasticsearch created an index with a schema of the first data payload. And when the next message is posted it fails validation due to the changed data type of "data" element.

Elastic search is describe as 'schema-free' but is it really? It depends at what level: on defining an index it is schema free, but on updating an index or adding adding data to a schema it is not.

Solution:

There are three possible solutions:
1. rename the conflicting type.
2. enforce data types using a schema beforehand i.e as part of a data pipeline.
3. this is not recommended but you may specify the conflicting attribute as a 'not analyzed' string in the model and then stringify all values; for example, use escape characters on json structure characters ( /{, /} etc )

Tuesday, August 2, 2016

Notes on Nutch crawler with indexing

Understanding Nutch with HBase: the HBase schema.

http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1/conf/gora-hbase-mapping.xml

Data synchronization between rdbms and hive using sqoop

Hive can be a great backup environment for RDBMS data or simply as a data warehouse. Hive provides a great architecture for bulk OLAP data. Hive is also a great choice for data charting workspace where hadoop technologies can be employed to crunch data.
Because many organizations still use rdbms and sql technology in their data warehouse, it is easier to export data in hive to perform bulk processing. Sometimes data dumping and reimporting into hive is inefficient therefore a data synchronization strategy using jdbc technology is more logical. Sqoop is designed to replicate data between different databases by speaking the same 'jdbc language'.
Lets see how sqoop works between sql server and hive.

Thursday, July 14, 2016

Machine to machine conversation with Avro, Protobuf or Thrift

Introduction

API data exchange requires an established mode of transport such as a network protocol or data file system. The data being transferred should also fit an agreed upon format. SOAP solves that problem by establishing XML as the data format. Modern architectures have deviated from XML because of its payload overhead attributed to markups, and also having a dependency on developer expertise. Moreover, XML processing can be expensive and require more resources than are available on small devices.

A popular data protocol alternative is JSON which is lightweight, schema enforceable and simple to understand. In order for machines/APIs to 'talk to each' other JSON would need to be translated into the API language i.e. JAVA , Python etc. This requires some data extraction and translation on the fly. A serialization framework, SF, such as Avro, Thrift and protobuf solve the language bridge by making extraction and translation seamless by using pre generated client stubs. A translation layer based on a defined schema is created for both clients in their respective languages such that JAVA can talk to Python or other supported language. SF's use an interface description language IDL, to define the generated translation layer source code.

Here are some examples in JAVA:

Avro

It allows for native type definition, nesting, accumulations (array and maps) and complex types. It uses json as the IDL thus familiar. Precompiled translation/serialization layer is not required but depending on business use case you may require them. On the fly schema driven data translation makes it easier to work with a growing schema or adopt new schemas making development process decoupled from data definition. However, minimal development is required for all clients to implements avro apis.
Language support is generous as it supports Java, C++, C#, python, js among others.

Download avro tools jar from apache

Create schema. For example create file CustomMessage.avcs

{"namespace": "avro",

"type": "record",

"name": "CustomMessage",

"fields": [

{"name": "id", "type": "string"},

{"name": "payload", "type": "string"}

]

}

As mentioned before you have two options: to generate stub, or create a generic record parser.
Option 1 Generate stub:

java -jar ./avro-tools-1.7.7.jar compile schema avro/CustomMessage.avsc .

Option 2 stubless avro parser: See my parser example on github.

Run tests using maven dependency:

<groupId>org.apache.avro</groupId>

</dependency>

Protobuf

Is a google creation which is widely used in its ecosystem. It has limited language support but the documentation is very detailed. It requires code stubs which are precompiled making slightly inflexible for the development process. The IDL is simple and easy to understand but its not json.

It requires system installation:

download

./configure

make

make check

sudo make install

protoc --version

Create schema. For example, create file CustomMessage.proto :

package syntax2;

option java_outer_classname="CustomMessageProtobuf";

message CustomMessage {

optional int32 age = 1;

optional string name = 2;

}

Note the native type support and package (protocol3 which can also be replace with protocol2 ). See below. 'Optional' keyword is removed.

package syntax3;

option java_outer_classname="CustomMessageProtobuf";

message CustomMessage {

int32 age = 1;

string name = 2;

}

Generating protobuf java objects
protoc CustomMessage.proto --java_out ../

When testing, note the relationship between syntax and maven dependencies:

Maven dependencies for syntax2:

<groupId>com.google.protobuf</groupId>

<artifactId>protobuf-java</artifactId>

<scope>compile</scope>

</dependency>

Maven dependencies for syntax3:

<groupId>com.google.protobuf</groupId>

<artifactId>protobuf-java</artifactId>

<scope>compile</scope>

</dependency>

Thrift

Thrift, like avro supports more languages than just java, C++ and python. It has great history with popular usage in big data technologies such as HBase's multi language support. Thrift suffers a back draw; poor documentation and support. It is also developer expertise driven.

Missing in Gora: Avro Phoenix translation

I was disappointed to see that while Gora supports Avro-SQL (for Mysql etc) bridging, it does not support Apache Phoenix SQL. Note that Gora supports HBase data which is logically different from Phoenix.
I found a way around this void. And that is to create a custom Avro-Phoenix translator.

The challenge is be aware of Avro or Phoenix data types and do a programmatic mapping onto a prepared statement. This can be extended to any jdbc driven connectivity:

My Avro Schema

Create phoenix table:
!sql CREATE TABLE TEST_TABLE (CREATED_TIME TIMESTAMP NOT NULL, ID VARCHAR(255) NOT NULL, MESSAGE VARCHAR(255) , INTERNAL_ID VARCHAR CONSTRAINT PK PRIMARY KEY ( CREATED_TIME, ID ) ) SALT_BUCKET=20;

Write to phoenix to test

!sql UPSERT INTO TEST_TABLE (CREATED_TIME, ID, MESSAGE) VALUES (now(),'id1', 'my message');

Write Avro to Phoenix translation

View the github project.

Read Phoenix Data to Avro translation ( Future work ).

Docs:

https://avro.apache.org/docs/1.7.7/spec.html#schema_record
https://avro.apache.org/docs/1.5.0/api/java/org/apache/avro/Schema.Type.html
https://phoenix.apache.org/language/datatypes.html

Friday, June 17, 2016

Configuring Thrift for HBase IO

Configuring php Thrift for HBase IO

There are few ways to read and write data to HBase in php. There is hard programmatic Socket connection IO; shell script system calls; HBase stargate restful calls; and Thrift connection calls.

Thrift php is a library with Socket connection calls to HBase. Works well with a php applications being migrated from mysql.

The setup

Assuming you already have php/php5 installed and running on ubuntu.

Install git

sudo apt-get install build-essential automake libtool flex bison libboost*

Get latest Thrift

git clone https://git-wip-us.apache.org/repos/asf/thrift.git

Make Thrift
change dir into ~thrift/ and execute bootstrap

cd ~/thrift

./bootstrap.sh

Run environment configuration

./configure

make - an install will be found in /usr/lib/php

make install

The final product is installed in /usr/lib/php

HBase IO

Setting up Thrift php modules
The basic requirements:

-set globals

-set required files

-set use HBase and Thrift classes

Import dependencies

Initialize classes

Write to Hbase Example

Read from Hbase Example

Complete example on github:

https://github.com/Kaniuritho/openprojects/blob/master/thrifthbase/thriftWriteTest.php