My approach was straight forward. I started by splitting the large zip file into smaller yet compressed files.
gzip -c /mydir/verylargezip.zip | split -b 1G - /workdir/smallerzip_gz_split_
But there's one issue to watch out for: the resulting files share one compression start marker (in the first file) and one end marker in the last file. If you tried to use or decompress a middle file you may get a 'file type not recognized' exception. The following solution creates independent compressed files:
gzip -cd /mydir/verylargezip.zip | split -b 1G --filter='gzip -cf > $FILE' - /workdir/smallerzip_gz_split_ &
This creates 1G size files that are approximately 10X compressed. You can then read the first few rows to create a small delimited file to import into a spreadsheet such as google drive. You can create a perl script to get rows:
The script (readrows.pl):
The command:
perl readrows.pl /workdir/smallerzip_gz_0_ /workdir/first100rows.csv 0 99
From here you also extract the column row (1st row):
perl readrow.pl /workdir/smallerzip_gz_0_ /workdir/datacolumns.csv 0 0
Suppose you would like to know how many rows are in your big file; you can count the rows in a 1G file
wc -l /workdir/smallerzip_gz_0_
Given a multi column delimited file (use this example: https://raw.githubusercontent.com/Azzurrio/moviestore/master/db/seeds_data/movies.csv), how can you search for all movies that cost 24.99?
grep '24.99' movies.csv
Which returns the all rows containing the search term:
The Godfather,1972,24.99,"The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.",tt0068646,"MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_SX214_AL_.jpg"Apocalypse Now,1979,24.99,"During the U.S.-Viet Nam War, Captain Willard is sent on a dangerous mission into Cambodia to assassinate a renegade colonel who has set himself up as a god among a local tribe.",tt0078788,"MV5BMTcyMzQ5NDM4OV5BMl5BanBnXkFtZTcwODUwNDg3OA@@._V1_SY317_CR12,0,214,317_AL_.jpg"
Suppose you only care about movie names, year released and price of a dvd: you can extract the columns you want i.e column 1 to column 3 using 'cut':
grep '24.99' movies.csv | cut -d ',' -f 1,2,3
Or the more powerful output option, using awk:
grep '24.99' movies.csv | awk -v OFS="," -F"," '{print $1, $2, $3}'
Both produce:
The Godfather,1972,24.99
Apocalypse Now,1979,24.99
Awk allows you to decorate your output by defining a new delimiter ( tr ',' '|' does the same) and adding text. For example:
grep '24.99' movies.csv | awk -v OFS="|" -F"," '{print $1, "year="$2, "price="$3}'
produces a '|' delimited results decorated with data description:
The Godfather|year=1972|price=24.99
Apocalypse Now|year=1979|price=24.99
Lets get all the different unique prices from our list and sort them in ascending order:
cat movies.csv | cut -d ',' -f 3 | sort | uniq
to get (note string 'price' appears below because it is row 1 data):
13.99
14.99
15.99
17.99
19.99
20.99
22.99
24.99
26.99
price
To exclude row 1 data use "sed '1d'":
cat movies.csv | sed '1d' | cut -d ',' -f 3 | sort | uniq
Lets get and change all the movie titles to upper case and replace spaces with '_'
cat movies.csv | sed '1d' | cut -d ',' -f 1 | tr [a-z] [A-Z] | sed 's/ /_/g'
produces:
THE_GODFATHERWILD_STRAWBERRIESTHE_SHAWSHANK_REDEMPTIONONE_FLEW_OVER_THE_CUCKOO'S_NESTTHE_DARK_KNIGHTIT'S_A_WONDERFUL_LIFEKILL_BILLSEVEN_SAMURAIAPOCALYPSE_NOWBLOOD_DIAMONDMODERN_TIMES"A_BEAUTIFUL_MIND"
Once again, this is only a demonstration of the power of simple scripting to handle moderately sized data. When we get to the BIG DATA size material, the one-liner shell commands can be incorporated with big data technology to produce amazing results. An example is hadoop shell commands which can be piped the same way.
Some other notes on shell scripting
ReplyDeleteI started of with a file like this
• Catherine - Cat, Cath, Cate (Kate), Cathy, Cassie, Katie
• Terence - Terry, Tel
S[edit]
• Teresa - Tracy/Tracey, Terry/Terrie
The desired output is of this format:
Teresa=>Tracy,Tracey,Terry,Terri
Theodore=>Ted,Teddy,Ed,Eddy,Ned,Neddy,The
Theresa=>Tracy,Tracey,Terry,Terri
Here’s my bash command:
grep • Untitled\ 4.txt | cut -d '•' -f2 | sed -e s/\ //g | sed -e s/\(/,/g | sed -e s/\)//g | sed -e s/\\//,/g | sed -e s/\-/=\>/g | awk '{print $1}' > firstnames_hypocoristics.txt
grep any row with ‘•’ which excludes ‘S[edit]’
cut -d '•' -f2 removes ‘•’ column
sed -e s/\ //g trims all spaces
sed -e s/\(/,/g | sed -e s/\)//g remove both ( and )
sed -e s/\\//,/g replaces / with
sed -e s/\-/=\>/g replaces - with =>
awk '{print $1}' removes the leading tab