Introduction¶
I recently starting collecting data from the BART API, specifically estimated time to departure for trains at the two stations I use most frequently. In this notebook I'll show how I parsed the data from a csv file, reshaped it to fit the questions at hand, and made a few plots. Download notebook ...read more
There are comments.
When joins go wrong, check data types
Writing and debugging joins can be especially difficult when dealing with data from text files. In some cases there is no resulting data, or (much harder to notice!) a few lines that should be included are dropped. Here I'll go into an example of a failed join in pandas, and …
...read moreThere are comments.
Installing python for data science
Installing all the python libraries required for data science can be a challenge, especially on windows machine. Unfortunately the same thing that makes the libraries fast also makes them difficult to distribute to different system types. Luckily there are a few free options for getting up and running painlessly. I …
...read moreThere are comments.
Using topic modeling to find related blog posts
Over the weekend I got curious about how different posts in this blog were similar to each other, and thought about putting links to similar posts at the end of each article. I used the gensim python library (topic modeling for humans) to find similar articles and I wrote a …
...read moreThere are comments.
Label graph axes!
Download IPython notebook for this post.
It's easy enough to make a plot using matplotlib.
import matplotlib.pyplot as plt
import numpy as np
time_point_array = np.arange(0, 5, .1)
y_value_array = np.exp(time_point_array)
plt.plot(time_point_array, y_value_array)
This plot is not great data science. In fact it's poor data …
...read moreThere are comments.
Working with dates in pandas: a few examples
pandas has a lot of lifesaving features for dealing with dates. Here's an example timeseries data file, which happens contains a missing date:
date,temp
11-1-2014,56
11-2-2014,56
11-3-2014,59
11-5-2014,60
11-6-2014,55
Loading and plotting the data in pandas gives this result:
In [0]: %matplotlib inline
In …
There are comments.
How to quickly test if an element belongs to a group
A common need in data science is to test if a some group of data contains a given value. One specific example would be to test if a word is a stop word.
The slow way
If the elements of the group exist in a list
named group
in python …
There are comments.
Filter common words from documents
Often when working with text documents it is useful to filter out words that occur frequently in all documents (e.g. 'the', 'is', ...). These words, called stop words, don't give any special hint about the document's content. The nltk (Natural Language Toolkit) library for python includes a list of stop …
...read moreThere are comments.
Get a list of all English words in python
The nltk
library for python contains a lot of useful data in addition to it's functions. One convient data set is a list of all english words, accessible like so:
from nltk.corpus import words
word_list = words.words()
# prints 236736
print len(word_list)
You will probably first have to download …
...read moreThere are comments.
« Page 2 / 2