SF Python meetup talk

Published: Wed 11 February 2015
By Frank Cleary

In News.

tags: python data

I gave a lighting talk at the SF Python meeting tonight about the recommender system I wrote to generate the "Similar Posts" links on this site. The slides are up here.
...read more
There are comments.
Analysis of Shakespeare character speech topics

Published: Wed 31 December 2014
By Frank Cleary

In Tutorials.

tags: gensim data python matplotlib topic models

In Part I of this post I made a topic model of the speech of Shakespeare characters from eight plays. Here in Part II I'll analyze the results of the model. Download notebook.

...read more
There are comments.
Topic modeling of Shakespeare characters

Published: Tue 30 December 2014
By Frank Cleary

In Tutorials.

tags: gensim data python topic models

In this post I extract all the words spoken by each character in eight of Shakespeare's plays. Then I construct a topic model to see which characters are generally speaking about similar things. In Part II I look into the information revealed by the topic model. Download notebook.

...read more
There are comments.
Annotating matplotlib plots

Published: Mon 29 December 2014
By Frank Cleary

In Tips.

tags: pandas matplotlib data python

To extend on my post about plotting and reshaping data from the BART API, I worked a bit with the matplotlib annotion interface to add text and arrows to a plot. The meat of this post is in cell #4 below. Download notebook.

...read more
There are comments.
Cleaning, reshaping, and plotting BART time series data with pandas

Published: Sat 20 December 2014
By Frank Cleary

In Tutorials.

tags: pandas matplotlib data python

Introduction¶

I recently starting collecting data from the BART API, specifically estimated time to departure for trains at the two stations I use most frequently. In this notebook I'll show how I parsed the data from a csv file, reshaped it to fit the questions at hand, and made a few plots. Download notebook ...read more
There are comments.
Spark 1.2.0 released

Published: Fri 19 December 2014
By Frank Cleary

In News.

tags: spark

Spark 1.2.0 was released yesterday (release notes). I'm curious to see how the new machine learning API's in spark.ml evolve.
...read more
There are comments.
When joins go wrong, check data types

Published: Thu 27 November 2014
By Frank Cleary

In Tips.

tags: pandas python data cleaning data

Writing and debugging joins can be especially difficult when dealing with data from text files. In some cases there is no resulting data, or (much harder to notice!) a few lines that should be included are dropped. Here I'll go into an example of a failed join in pandas, and …
...read more
There are comments.
Installing python for data science

Published: Sun 23 November 2014
By Frank Cleary

In Tips.

tags: python data

Installing all the python libraries required for data science can be a challenge, especially on windows machine. Unfortunately the same thing that makes the libraries fast also makes them difficult to distribute to different system types. Luckily there are a few free options for getting up and running painlessly. I …
...read more
There are comments.
Using topic modeling to find related blog posts

Published: Thu 20 November 2014
By Frank Cleary

In Tutorials.

tags: data python gensim nltk pelican machine learning

Over the weekend I got curious about how different posts in this blog were similar to each other, and thought about putting links to similar posts at the end of each article. I used the gensim python library (topic modeling for humans) to find similar articles and I wrote a …
...read more
There are comments.
Scikit-learn machine learning algorithm flowchart

Published: Mon 17 November 2014
By Frank Cleary

In News.

tags: data scikit-learn machine learning

Scikit-learn has a nice flowchart of when to use different machine learning algorithms. View the whole chart here.

...read more
There are comments.
Using sed to make specific text lowercase in place

Published: Thu 13 November 2014
By Frank Cleary

In Tutorials.

tags: data command line sed cleaning data

Data science, I'm sorry to say, often involves cleaning up input data into a usable and uniform format. Command line tools like grep, awk and sed provide an arcane power to manipulate text in files of arbitrary size. Mastering these tools can separate data science novices from data scientists with …
...read more
There are comments.
Label graph axes!
Published: Sat 08 November 2014
By Frank Cleary

In Tips.

tags: data python matplotlib code

Download IPython notebook for this post.

It's easy enough to make a plot using matplotlib.
```
import matplotlib.pyplot as plt
import numpy as np

time_point_array = np.arange(0, 5, .1)
y_value_array = np.exp(time_point_array)

plt.plot(time_point_array, y_value_array)
```
This plot is not great data science. In fact it's poor data …
...read more
There are comments.
Working with dates in pandas: a few examples
Published: Fri 07 November 2014
By Frank Cleary

In Tips.

tags: python pandas data code

pandas has a lot of lifesaving features for dealing with dates. Here's an example timeseries data file, which happens contains a missing date:
```
date,temp
11-1-2014,56
11-2-2014,56
11-3-2014,59
11-5-2014,60
11-6-2014,55
```
Loading and plotting the data in pandas gives this result:
```
In [0]: %matplotlib inline

In …
```
...read more
There are comments.
How to quickly test if an element belongs to a group

Published: Wed 05 November 2014
By Frank Cleary

In Tutorials.

tags: code data python

A common need in data science is to test if a some group of data contains a given value. One specific example would be to test if a word is a stop word.

The slow way

If the elements of the group exist in a list named group in python …
...read more
There are comments.
Filter common words from documents

Published: Tue 04 November 2014
By Frank Cleary

In Tips.

tags: python nltk data cleaning data

Often when working with text documents it is useful to filter out words that occur frequently in all documents (e.g. 'the', 'is', ...). These words, called stop words, don't give any special hint about the document's content. The nltk (Natural Language Toolkit) library for python includes a list of stop …
...read more
There are comments.
Get a list of all English words in python
Published: Mon 03 November 2014
By Frank Cleary

In Tips.

tags: data python nltk

The nltk library for python contains a lot of useful data in addition to it's functions. One convient data set is a list of all english words, accessible like so:
```
from nltk.corpus import words
word_list = words.words()
# prints 236736
print len(word_list)
```
You will probably first have to download …
...read more
There are comments.
How to Transition from Ph.D. Student to Data Scientist

Published: Sat 01 November 2014
By Frank Cleary

In Tutorials.

tags: data career

A friend of mine recently asked me to share some of my experiences in making the transition from a biophysics Ph.D. student to data scientist. I realized there are probably a lot of people interested in making a similar transition who could benefit from my experience.

Goals

A year …
...read more
There are comments.
Recommended Videos

Published: Sat 01 November 2014
By Frank Cleary

In News.

tags: data code

I put together a list of data science and python videos that I've found to be especaily useful and entertaining. The list is here, also linked from the nav bar up top.
...read more
There are comments.
Recommended Books

Published: Sat 25 October 2014
By Frank Cleary

In News.

tags: data code

I put together a list of data science books I recommended, including a few for preparing for data science interviews. The list is availiable here.
...read more
There are comments.

« Page 2 / 2

Introduction¶

The slow way

Goals

Links

Social