Dates and times provide an unlimited source of hassles for anyone working with them. In this post I'll discuss a potential performance pitfall I encountered parsing dates in pandas. Conclusion: Create DatetimeIndices by parsing data with to_datetime(my_dates, format='my_format')
.
There are comments.
Analyzing large xml files in python
To show some techniques for working with files that are too large to fit on memory, I'm writing this post on a 10 year old laptop with 512 Mb of RAM and a 1.2 GHz celeron processor. The data in question is an xml format dump of data from …
...read moreThere are comments.
Saving time and space by working with gzip and bzip2 compressed files in python
File compression tools like gzip
and bzip2
can compress text files into a fraction of their size, often to as little as 20% of the original. Data files often come compressed to save storage space and network bandwidth. A typical workflow is to uncompress the file before analysis, but it can be more convenient to leave the file in its compressed form, especially if the uncompressed file would take up a significant amount of space. In this post I'll show how to work directly with compressed files in python.
There are comments.
Annotating matplotlib plots
To extend on my post about plotting and reshaping data from the BART API, I worked a bit with the matplotlib annotion interface to add text and arrows to a plot. The meat of this post is in cell #4 below. Download notebook.
There are comments.
When joins go wrong, check data types
Writing and debugging joins can be especially difficult when dealing with data from text files. In some cases there is no resulting data, or (much harder to notice!) a few lines that should be included are dropped. Here I'll go into an example of a failed join in pandas, and …
...read moreThere are comments.
Installing python for data science
Installing all the python libraries required for data science can be a challenge, especially on windows machine. Unfortunately the same thing that makes the libraries fast also makes them difficult to distribute to different system types. Luckily there are a few free options for getting up and running painlessly. I …
...read moreThere are comments.
Label graph axes!
Download IPython notebook for this post.
It's easy enough to make a plot using matplotlib.
import matplotlib.pyplot as plt
import numpy as np
time_point_array = np.arange(0, 5, .1)
y_value_array = np.exp(time_point_array)
plt.plot(time_point_array, y_value_array)
This plot is not great data science. In fact it's poor data …
...read moreThere are comments.
Working with dates in pandas: a few examples
pandas has a lot of lifesaving features for dealing with dates. Here's an example timeseries data file, which happens contains a missing date:
date,temp
11-1-2014,56
11-2-2014,56
11-3-2014,59
11-5-2014,60
11-6-2014,55
Loading and plotting the data in pandas gives this result:
In [0]: %matplotlib inline
In …
There are comments.
Filter common words from documents
Often when working with text documents it is useful to filter out words that occur frequently in all documents (e.g. 'the', 'is', ...). These words, called stop words, don't give any special hint about the document's content. The nltk (Natural Language Toolkit) library for python includes a list of stop …
...read moreThere are comments.
Get a list of all English words in python
The nltk
library for python contains a lot of useful data in addition to it's functions. One convient data set is a list of all english words, accessible like so:
from nltk.corpus import words
word_list = words.words()
# prints 236736
print len(word_list)
You will probably first have to download …
...read moreThere are comments.
Page 1 / 1