I gave a lighting talk at the SF Python meeting tonight about the recommender system I wrote to generate the "Similar Posts" links on this site. The slides are up here.
...read moreThere are comments.
I gave a lighting talk at the SF Python meeting tonight about the recommender system I wrote to generate the "Similar Posts" links on this site. The slides are up here.
...read moreThere are comments.
In Part I of this post I made a topic model of the speech of Shakespeare characters from eight plays. Here in Part II I'll analyze the results of the model. Download notebook.
There are comments.
In this post I extract all the words spoken by each character in eight of Shakespeare's plays. Then I construct a topic model to see which characters are generally speaking about similar things. In Part II I look into the information revealed by the topic model. Download notebook.
There are comments.
To extend on my post about plotting and reshaping data from the BART API, I worked a bit with the matplotlib annotion interface to add text and arrows to a plot. The meat of this post is in cell #4 below. Download notebook.
There are comments.
I recently starting collecting data from the BART API, specifically estimated time to departure for trains at the two stations I use most frequently. In this notebook I'll show how I parsed the data from a csv file, reshaped it to fit the questions at hand, and made a few plots. Download notebook ...read more
There are comments.
Spark 1.2.0 was released yesterday (release notes). I'm curious to see how the new machine learning API's in spark.ml evolve.
...read moreThere are comments.
Writing and debugging joins can be especially difficult when dealing with data from text files. In some cases there is no resulting data, or (much harder to notice!) a few lines that should be included are dropped. Here I'll go into an example of a failed join in pandas, and …
...read moreThere are comments.
Installing all the python libraries required for data science can be a challenge, especially on windows machine. Unfortunately the same thing that makes the libraries fast also makes them difficult to distribute to different system types. Luckily there are a few free options for getting up and running painlessly. I …
...read moreThere are comments.
Over the weekend I got curious about how different posts in this blog were similar to each other, and thought about putting links to similar posts at the end of each article. I used the gensim python library (topic modeling for humans) to find similar articles and I wrote a …
...read moreThere are comments.
Scikit-learn has a nice flowchart of when to use different machine learning algorithms. View the whole chart here.
...read moreThere are comments.
Data science, I'm sorry to say, often involves cleaning up input data into a usable and uniform format. Command line tools like grep
, awk
and sed
provide an arcane power to manipulate text in files of arbitrary size. Mastering these tools can separate data science novices from data scientists with …
There are comments.
Download IPython notebook for this post.
It's easy enough to make a plot using matplotlib.
import matplotlib.pyplot as plt
import numpy as np
time_point_array = np.arange(0, 5, .1)
y_value_array = np.exp(time_point_array)
plt.plot(time_point_array, y_value_array)
This plot is not great data science. In fact it's poor data …
...read moreThere are comments.
pandas has a lot of lifesaving features for dealing with dates. Here's an example timeseries data file, which happens contains a missing date:
date,temp
11-1-2014,56
11-2-2014,56
11-3-2014,59
11-5-2014,60
11-6-2014,55
Loading and plotting the data in pandas gives this result:
In [0]: %matplotlib inline
In …
There are comments.
A common need in data science is to test if a some group of data contains a given value. One specific example would be to test if a word is a stop word.
If the elements of the group exist in a list
named group
in python …
There are comments.
Often when working with text documents it is useful to filter out words that occur frequently in all documents (e.g. 'the', 'is', ...). These words, called stop words, don't give any special hint about the document's content. The nltk (Natural Language Toolkit) library for python includes a list of stop …
...read moreThere are comments.
The nltk
library for python contains a lot of useful data in addition to it's functions. One convient data set is a list of all english words, accessible like so:
from nltk.corpus import words
word_list = words.words()
# prints 236736
print len(word_list)
You will probably first have to download …
...read moreThere are comments.
A friend of mine recently asked me to share some of my experiences in making the transition from a biophysics Ph.D. student to data scientist. I realized there are probably a lot of people interested in making a similar transition who could benefit from my experience.
A year …
...read moreThere are comments.
I put together a list of data science and python videos that I've found to be especaily useful and entertaining. The list is here, also linked from the nav bar up top.
...read moreThere are comments.
I put together a list of data science books I recommended, including a few for preparing for data science interviews. The list is availiable here.
...read moreThere are comments.
« Page 2 / 2