My approach was straight forward. I started by splitting the large zip file into smaller yet compressed files.
gzip -c /mydir/verylargezip.zip | split -b 1G - /workdir/smallerzip_gz_split_
But there's one issue to watch out for: the resulting files share one compression start marker (in the first file) and one end marker in the last file. If you tried to use or decompress a middle file you may get a 'file type not recognized' exception. The following solution creates independent compressed files:
gzip -cd /mydir/verylargezip.zip | split -b 1G --filter='gzip -cf > $FILE' - /workdir/smallerzip_gz_split_ &
This creates 1G size files that are approximately 10X compressed. You can then read the first few rows to create a small delimited file to import into a spreadsheet such as google drive. You can create a perl script to get rows:
The script (readrows.pl):
The command:
perl readrows.pl /workdir/smallerzip_gz_0_ /workdir/first100rows.csv 0 99
From here you also extract the column row (1st row):
perl readrow.pl /workdir/smallerzip_gz_0_ /workdir/datacolumns.csv 0 0
Suppose you would like to know how many rows are in your big file; you can count the rows in a 1G file
wc -l /workdir/smallerzip_gz_0_
Given a multi column delimited file (use this example: https://raw.githubusercontent.com/Azzurrio/moviestore/master/db/seeds_data/movies.csv), how can you search for all movies that cost 24.99?
grep '24.99' movies.csv
Which returns the all rows containing the search term:
The Godfather,1972,24.99,"The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.",tt0068646,"MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_SX214_AL_.jpg"Apocalypse Now,1979,24.99,"During the U.S.-Viet Nam War, Captain Willard is sent on a dangerous mission into Cambodia to assassinate a renegade colonel who has set himself up as a god among a local tribe.",tt0078788,"MV5BMTcyMzQ5NDM4OV5BMl5BanBnXkFtZTcwODUwNDg3OA@@._V1_SY317_CR12,0,214,317_AL_.jpg"
Suppose you only care about movie names, year released and price of a dvd: you can extract the columns you want i.e column 1 to column 3 using 'cut':
grep '24.99' movies.csv | cut -d ',' -f 1,2,3
Or the more powerful output option, using awk:
grep '24.99' movies.csv | awk -v OFS="," -F"," '{print $1, $2, $3}'
Both produce:
The Godfather,1972,24.99
Apocalypse Now,1979,24.99
Awk allows you to decorate your output by defining a new delimiter ( tr ',' '|' does the same) and adding text. For example:
grep '24.99' movies.csv | awk -v OFS="|" -F"," '{print $1, "year="$2, "price="$3}'
produces a '|' delimited results decorated with data description:
The Godfather|year=1972|price=24.99
Apocalypse Now|year=1979|price=24.99
Lets get all the different unique prices from our list and sort them in ascending order:
cat movies.csv | cut -d ',' -f 3 | sort | uniq
to get (note string 'price' appears below because it is row 1 data):
13.99
14.99
15.99
17.99
19.99
20.99
22.99
24.99
26.99
price
To exclude row 1 data use "sed '1d'":
cat movies.csv | sed '1d' | cut -d ',' -f 3 | sort | uniq
Lets get and change all the movie titles to upper case and replace spaces with '_'
cat movies.csv | sed '1d' | cut -d ',' -f 1 | tr [a-z] [A-Z] | sed 's/ /_/g'
produces:
THE_GODFATHERWILD_STRAWBERRIESTHE_SHAWSHANK_REDEMPTIONONE_FLEW_OVER_THE_CUCKOO'S_NESTTHE_DARK_KNIGHTIT'S_A_WONDERFUL_LIFEKILL_BILLSEVEN_SAMURAIAPOCALYPSE_NOWBLOOD_DIAMONDMODERN_TIMES"A_BEAUTIFUL_MIND"
Once again, this is only a demonstration of the power of simple scripting to handle moderately sized data. When we get to the BIG DATA size material, the one-liner shell commands can be incorporated with big data technology to produce amazing results. An example is hadoop shell commands which can be piped the same way.