Tuesday, August 2, 2016

Notes on Nutch crawler with indexing


Understanding Nutch with HBase: the HBase schema.

http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1/conf/gora-hbase-mapping.xml


Data synchronization between rdbms and hive using sqoop

Hive can be a great backup environment for RDBMS data or simply as a data warehouse. Hive provides a great architecture for bulk OLAP data. Hive is also a great choice for data charting workspace where hadoop technologies can be employed to crunch data.
Because many organizations still use rdbms and sql technology in their data warehouse, it is easier to export data in hive to perform bulk processing. Sometimes data dumping and reimporting into hive is inefficient therefore a data synchronization strategy using jdbc technology is more logical. Sqoop is designed to replicate data between different databases by speaking the same 'jdbc language'.
Lets see how sqoop works between sql server and hive.