Tuesday, November 11, 2014

Basic data preparation using scripting

Not too long ago I had an 800GB zipped file that I had to work with. My disk space was limited. Memory and processor resources were limited such that unzipping the file would take several hours and perhaps crash the computer in the process. All I wanted to do was know what type of data the file contains and how many rows. This was going to help me with resource planning for a data system.

My approach was straight forward. I started by splitting the large zip file into smaller yet compressed files.

gzip -c /mydir/verylargezip.zip | split -b 1G - /workdir/smallerzip_gz_split_

But there's one issue to watch out for: the resulting files share one compression start marker (in the first file) and one end marker in the last file. If you tried to use or decompress a middle file you may get a 'file type not recognized' exception. The following solution creates independent compressed files:

gzip -cd /mydir/verylargezip.zip | split -b 1G --filter='gzip -cf > $FILE' - /workdir/smallerzip_gz_split_ &

This creates 1G size files that are approximately 10X compressed. You can then read the first few rows to create a small delimited file to import into a spreadsheet such as google drive. You can create  a perl script to get rows:

The script (readrows.pl):
The command:

perl readrows.pl /workdir/smallerzip_gz_0_  /workdir/first100rows.csv 0 99

From here you also extract the column row (1st row):

perl readrow.pl /workdir/smallerzip_gz_0_  /workdir/datacolumns.csv 0 0

Suppose you would like to know how many rows are in your big file; you can count the rows in a 1G file

wc -l /workdir/smallerzip_gz_0_

and multiply by the number of 1G files created in the first step.

Suppose that your data was small enough to decompress onto disc yet too large to load into a spreadsheet; you can use scripts to manipulate and search through data. Consider creating one liner shell commands. They are powerful and easy to maintain.

Given a multi column delimited file (use this example: https://raw.githubusercontent.com/Azzurrio/moviestore/master/db/seeds_data/movies.csv), how can you search for all movies that cost 24.99?

grep '24.99' movies.csv

Which returns the all rows containing the search term:

The Godfather,1972,24.99,"The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.",tt0068646,"MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_SX214_AL_.jpg"Apocalypse Now,1979,24.99,"During the U.S.-Viet Nam War, Captain Willard is sent on a dangerous mission into Cambodia to assassinate a renegade colonel who has set himself up as a god among a local tribe.",tt0078788,"MV5BMTcyMzQ5NDM4OV5BMl5BanBnXkFtZTcwODUwNDg3OA@@._V1_SY317_CR12,0,214,317_AL_.jpg"

Suppose you only care about movie names, year released and price of a dvd: you can extract the columns you want i.e column 1 to column 3 using 'cut':

 grep '24.99' movies.csv | cut -d ',' -f 1,2,3

Or the more powerful output option, using awk:

grep '24.99' movies.csv | awk -v OFS="," -F"," '{print $1, $2, $3}'

Both produce:

The Godfather,1972,24.99
Apocalypse Now,1979,24.99

Awk allows you to decorate your output by defining a new delimiter ( tr ',' '|' does the same) and adding text. For example:

grep '24.99' movies.csv | awk -v OFS="|" -F"," '{print $1, "year="$2, "price="$3}'

produces a '|' delimited results decorated with data description:

The Godfather|year=1972|price=24.99
Apocalypse Now|year=1979|price=24.99

Lets get all the different unique prices from our list and sort them in ascending order:

cat movies.csv | cut -d ',' -f 3 | sort | uniq

to get (note string 'price' appears below because it is row 1 data):

13.99
14.99
15.99
17.99
19.99
20.99
22.99
24.99
26.99
price

To exclude row 1 data use "sed '1d'":
cat movies.csv | sed '1d' | cut -d ',' -f 3 | sort | uniq

Lets get and change all the movie titles to upper case and replace spaces with '_'

 cat movies.csv | sed '1d' | cut -d ',' -f 1 | tr [a-z] [A-Z] | sed 's/ /_/g'

produces:
THE_GODFATHERWILD_STRAWBERRIESTHE_SHAWSHANK_REDEMPTIONONE_FLEW_OVER_THE_CUCKOO'S_NESTTHE_DARK_KNIGHTIT'S_A_WONDERFUL_LIFEKILL_BILLSEVEN_SAMURAIAPOCALYPSE_NOWBLOOD_DIAMONDMODERN_TIMES"A_BEAUTIFUL_MIND"

Once again, this is only a demonstration of the power of simple scripting to handle moderately sized data. When we get to the BIG DATA size material, the one-liner shell commands can be incorporated with big data technology to produce amazing results. An example is hadoop shell commands which can be piped the same way.