Tuesday, November 28, 2017

Visual approaches to gaining knowledge from big data using spark, ELK and directed graphs

Background


Traditional methods of information extraction require the quintessential Extraction, Transformation and Loading (ETL) process; followed by complex analytical queries; and finally, visualization.
In this article we look at a streamlined process to get us to the visual aids quicker if not instantaneously.


The problem


Let us suppose data has been dropped on your screen for you to gain business insight or solve a problem. Your first question may be; what is the source of the data and what does each data point represent? How granular is the data? What is the global relationship between each data point?

We propose three steps to answering these questions:
1. randomly sample your data and profile the attributes
2. extract information from the rest of your data based (1).
3. generate a wholistic knowledge of the information by drawing links between information extracted. Repeat (3) to satisfaction.


Consideration 


In order to streamline this process let us consider an API integration exercise and some code. We want to integrate a Stream processing API; and indexing engine; a backend database; and the visual aids.

Spark for extraction and transformation


Profiling data using ELK


Spark for directed graph loading into JanusGraph db.


Gephi for analytics




Thursday, April 6, 2017

Attaining data flexibility while posting data to Elasticsearch

The problem:


Elasticsearch can prove unwieldy when the data being populated has a different schema. While it is possible to auto create indices in Elasticsearch by posting json data, you may run into validation exceptions when incoming data schema changes.

As an example post this message into an undefined index in order to autocreate an index:

curl -XPOST http://localhost/eventindex_uinak/event -d '{"timestamp":"2017-03-22T21:59:34 UTC","host":"uinak" ,"data":"a string"  }'

returns:

{"_index":"eventindex_uinak","_type":"event","_id":"AVr4DDXJ6XT8Rp6bGeEA","_version":1,"created":true}

Lets change the data schema a little by making data a json object:

curl -XPOST http://localhost/eventindex_uinak/event -d '{"timestamp":"2017-03-22T21:59:34 UTC","host":"uinak" ,"data": {"key":"my value"}  }'

returns:

{"error":"RemoteTransportException[[Super-Nova][inet[/x.x.x.x:y]][indices:data/write/index]]; nested: MapperParsingException[failed to parse [data]]; nested: ElasticsearchIllegalArgumentException[unknown property [key]]; ","status":400}

What happened here? Elasticsearch created an index with a schema of the first data payload. And when the next message is posted it fails validation due to the changed data type of "data" element.
Elastic search is describe as 'schema-free' but is it really? It depends at what level: on defining an index it is schema free, but on updating an index or adding adding data to a schema it is not.

Solution:

There are three possible solutions:
1. rename the conflicting type.
2. enforce data types using a schema beforehand i.e as part of a data pipeline.
3. this is not recommended but you may specify the conflicting attribute as a 'not analyzed' string in the model and then stringify all values; for example, use escape characters on json structure characters ( /{, /} etc )