Tuesday, November 28, 2017

Visual approaches to gaining knowledge from big data using spark, ELK and directed graphs

Background


Traditional methods of information extraction require the quintessential Extraction, Transformation and Loading (ETL) process; followed by complex analytical queries; and finally, visualization.
In this article we look at a streamlined process to get us to the visual aids quicker if not instantaneously.


The problem


Let us suppose data has been dropped on your screen for you to gain business insight or solve a problem. Your first question may be; what is the source of the data and what does each data point represent? How granular is the data? What is the global relationship between each data point?

We propose three steps to answering these questions:
1. randomly sample your data and profile the attributes
2. extract information from the rest of your data based (1).
3. generate a wholistic knowledge of the information by drawing links between information extracted. Repeat (3) to satisfaction.


Consideration 


In order to streamline this process let us consider an API integration exercise and some code. We want to integrate a Stream processing API; and indexing engine; a backend database; and the visual aids.

Spark for extraction and transformation


Profiling data using ELK


Spark for directed graph loading into JanusGraph db.


Gephi for analytics