Introduction
Hadoop commands described in the Apache tutorial are designed to work like shell commands. They are simple commands that can be scripted into complex commands. I will explore some commands to give a hint of how to achieve complex manipulations outside of the map reduce / YARN process. Note that these commands are not a replacement for map reduce/YARN processes but merely a compliment.
Assumptions
You have a running hadoop environment. You are familiar with hdfs structure. Hadoop1 or hadoop2 works fine. A distributed hdfs system is not necessary to run these examples.
When working or analyzing data in large data sets hadoop becomes the preferred file system over all local/PC file systems because of the distributed nature of its resources.
Hadoop commands work on files loaded into hdfs, therefore you will need to load files into your running hadoop file system. A simple command works fine: i.e
/usr/local/hadoop/bin/hadoop fs -copyFromLocal input_file hdfs_file
CopyFromLocal is not flexible as it does not allow you to pipe row manipulation commands; the imported file is treated as a singular object. As with any transfer mechanism, CopyFromLocal can be interrupted when the file is oversized and when the network is burdened causing unrecoverable failure. To reduce this risk work with small files. You can chunk a large file into smaller files in your linux environment using the split command:
gzip -cd /mnt/upload/SuperHugeCompressed.zip | split -b 10G --filter='gzip -cf > $FILE' - /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_ &
This command splits the large compressed file into 10GB size files (starting with SmallerSuperHugeCompressed_gz_split_aa ) and retains the gzip header in each file. Gzip header allows the OS to recognize the data type of each individual file as compressed data.
Then load the file into hdfs using hadoop copyFromLocal or put commands.
CopyFromLocal will copy files that start with SmallerSuperHugeCompressed_gz_split_* and store them in hdfs /raw directory intact and compressed.
/usr/local/hadoop/bin/hadoop fs -copyFromLocal input_file hdfs_file
CopyFromLocal is not flexible as it does not allow you to pipe row manipulation commands; the imported file is treated as a singular object. As with any transfer mechanism, CopyFromLocal can be interrupted when the file is oversized and when the network is burdened causing unrecoverable failure. To reduce this risk work with small files. You can chunk a large file into smaller files in your linux environment using the split command:
gzip -cd /mnt/upload/SuperHugeCompressed.zip | split -b 10G --filter='gzip -cf > $FILE' - /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_ &
This command splits the large compressed file into 10GB size files (starting with SmallerSuperHugeCompressed_gz_split_aa ) and retains the gzip header in each file. Gzip header allows the OS to recognize the data type of each individual file as compressed data.
Then load the file into hdfs using hadoop copyFromLocal or put commands.
CopyFromLocal will copy files that start with SmallerSuperHugeCompressed_gz_split_* and store them in hdfs /raw directory intact and compressed.
/usr/local/hadoop/bin/hadoop fs -copyFromLocal /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_* /raw/
Here is a hadoop put example where the input file is compressed and the hdfs file is decompressed:
gunzip -c /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_aa | /usr/local/hadoop/bin/hadoop fs -put - /raw/SmallerSuperHugeDecompressed_gz_split_aa &
Hadoop put allows for row string manipulation using linux shell commands such as awk and sed. For example: load file into hdfs and modify each row - remove double quotes (") and replace with empty space:
gunzip -c /mnt/jobcache/dataextraction/SmallerSuperHugeCompressed_gz_split_aa | sed -e 's/\”//g' | /usr/local/hadoop/bin/hadoop fs -put - /raw/SmallerSuperHugeDecompressed_gz_split_aa &
View contents of the first row
You can cat a hdfs file and stop at row number 1:
hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml | sed -n 1p
Extract first row into file
You can pipe the above command into another hadoop put command in order to write the results back into hdfs:
hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml | sed -n 1p | hadoop fs -put - /raw/first_row
Search for contents data into 'clean' file
You can search for content in a hdfs file using hadoop cat and grep. This allows for cat to scroll raw by row searching for the grep term and writing to stdout.
You can search for content in a hdfs file using hadoop cat and grep. This allows for cat to scroll raw by row searching for the grep term and writing to stdout.
hadoop fs -cat /tmp/hadoop-yarn/staging/history/done/2015/03/14/000000/job_1426372650076_0011_conf.xml | grep configuration
It is important to note that hdfs data is immutable. To modify a row in hdfs will require streaming a file into another file and modify the stream during that process.
Remove replication on file
Remove replication on file
/usr/local/hadoop/bin/hadoop fs -setrep -R 1 /raw/Acx_NonMatch_IB123838output2
Conclusions
Most hadoop commands are local commands that do not take advantage of the distributed nature of hadoop. However, some hadoop commands especially 'hdsfadmin' commands are map reduce/yarn applications. Scripting this distributed applications would require monitoring mechanisms in order to be incorporated in a