File compression tools like gzip
and bzip2
can compress text files into a fraction of their size, often to as little as 20% of the original. Data files often come compressed to save storage space and network bandwidth. A typical workflow is to uncompress the file before analysis, but it can be more convenient to leave the file in its compressed form, especially if the uncompressed file would take up a significant amount of space. In this post I'll show how to work directly with compressed files in python.
Compression Ratios¶
Lets look at a small csv file containing data on National Parks (originally from wikipedia). The uncompressed file is 3.1 kb.
!head -n4 data/nationalparks.csv
!ls -lh data/nationalparks.csv
These commands zip up the files to see the difference in size. The -k
option keeps the original file for bzip2
, only recent versions of gzip
support this so the gzip
command is written a bit differently.
# Clean file before creating it again
!rm -f data/nationalparks.csv.bz2
%%bash
gzip < data/nationalparks.csv > data/nationalparks.csv.gz
bzip2 -k data/nationalparks.csv
ls -lh data/nationalparks*
In general bzip2
compresses slightly more than gzip
, but is significantly slower. For general use I find gzip
to be preferable. Now on to python! The function below prints the total area of all the National Parks using the uncompressed file.
import csv
def sum_area(f):
reader = csv.reader(f.readlines()[1:]) # exclude header line
total_area = sum([float(row[3]) for row in reader])
return total_area
def total_area_uncompressed(filename):
with open(filename) as f:
return sum_area(f)
total = total_area_uncompressed('data/nationalparks.csv')
print 'Total National Park Area = {:,} acres'.format(total)
To accomplish the same thing with compressed files, we can use the gzip
and bz2
libraries:
import bz2
import gzip
def total_area_gzip(filename):
with gzip.GzipFile(filename) as f:
return sum_area(f)
def total_area_bz2(filename):
with bz2.BZ2File(filename) as f:
return sum_area(f)
print 'Total National Park Area = {:,} acres'.format(
total_area_gzip('data/nationalparks.csv.gz')
)
print 'Total National Park Area = {:,} acres'.format(
total_area_bz2('data/nationalparks.csv.bz2')
)
Now our code operates on the compressed files directly. The downside is that it takes longer for the work on compressed files (for larger files this will be more significant). Keep in mind that the time to decompress the file must be taken at some point (either before or during analysis) so as long as you're not running the analysis many times the cost is minimal.
%%timeit
total_area_uncompressed('data/nationalparks.csv')
%%timeit
total_area_gzip('data/nationalparks.csv.gz')
%%timeit
total_area_bz2('data/nationalparks.csv.bz2')
We could also write a function that deals with the file appropriately based on its extension, saving us from having three separate functions.
def opener(filename):
if filename.endswith('.gz'):
return gzip.GzipFile(filename)
elif filename.endswith('.bz2'):
return bz2.BZ2File(filename)
else:
return open(filename)
for extension in ['', '.gz', '.bz2']:
filename = 'data/nationalparks.csv' + extension
print 'Reading {}'.format(filename)
with opener(filename) as f:
print 'Total National Park Area = {:,} acres'.format(sum_area(f))
Of course using library functions is preferred when possible. Happily pandas
supports reading compressed files with the compression=
parameter of read_csv()
.
import pandas as pd
npdf = pd.read_csv('data/nationalparks.csv.bz2', compression='bz2')
npdf.head()
print 'Total National Park Area = {:,} acres'.format(npdf['Area'].sum())
Similar Posts
- When joins go wrong, check data types, Score: 0.880
- Using sed to make specific text lowercase in place, Score: 0.836
- Analyzing large xml files in python, Score: 0.829
- Getting csv data from requests to a SQL backed Flask app, Score: 0.819
- Pandas date parsing performance, Score: 0.816
Comments