The Timedelta
object is relatively new to pandas. Recently I worked with Timedeltas but found it wasn't obvious how to do what I wanted. Worse, some operations were seemingly obvious but could easily return the wrong answer (update: this issue was fixed in pandas version 0.17.0). Here I go through a few Timedelta
examples to provide a companion reference to the official documentation.
The Data¶
I got data on Old Faithful eruptions from here to serve as interesting Timedeltas (credit to Yellowstone National Park and Ralph Taylor). See the shell commands below for how I downloaded and cleaned the files (resulting file).
wget http://www.geyserstudy.org/geysers/OLDFAITHFUL/eruptions/Old%20Faithful%20eruptions%20for%20200{0..9}.TXT
wget http://www.geyserstudy.org/geysers/OLDFAITHFUL/eruptions/Old%20Faithful%20eruptions%20for%2020{10..11}.TXT
# remove file headers by finding lines that begin with two numbers and a slash
grep -h '^\d\{2\}/' Old\ Faithful\ eruptions\ for\ 20* > old_faithful_data.csv
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
The data has two columns: the date and time of the observed eruption and the elapsed time since the last eruption.
old_faithful_data = pd.read_csv('data/old_faithful_data.csv', header=None,
names=['date', 'elapsed_time'])
old_faithful_data.head(3)
Summary statistics¶
Here I calculate the time between eruptions from the observed time directly.
eruption_times = pd.to_datetime(old_faithful_data['date'].str.strip(),
format='%m/%d/%y %H:%M:%S')
eruption_deltas = eruption_times.diff()
eruption_deltas.describe()
Filtering Timedeltas¶
It looks like there might be some bad data: the longest record is over 79 days. For the rest of the analysis I'll filter out everything longer than 3 hours. Although some of these long events may be significant they clearly don't represent the usual behavior. It's always important to look at more than just the mean of a dataset like this since a few outliers can have a large effect on its value.
eruption_deltas = eruption_deltas[eruption_deltas < pd.Timedelta(hours=3)].dropna()
eruption_deltas.describe()
Histograms of Timedeltas¶
The summary table above is nice, but I'd like to visualize the data in a histogram. Unfortunately a simple eruption_deltas.hist()
call produces a TypeError
. Luckily there is a simple way to produce a histogram by converting the type of the series (what the documentation calls frequency conversion) and making a histogram of the result. Below I show two different conversion methods:
- Using
.astype()
and supplying a string argument of the form 'timedelta64[unit]' where unit can be 's' for seconds, 'h' for hours, 'D' for days, etc. Any decimal part of the result will be discarded (floor division). - Dividing by a
Timedelta
. AnyTimedelta
value will work, so if you want to find out how many times you could have listened to Piano Man just runmy_timedelta_series / pd.Timedelta(minutes=5, seconds=38)
.
print "Data as Timedelta:"
print eruption_deltas.head(3)
print ""
print "Data converted to the floor of the total hours (astype()):"
print (eruption_deltas.astype('timedelta64[h]')).head(3)
print ""
print "Data converted to total hours (/):"
print (eruption_deltas / pd.Timedelta(hours=1)).head(3)
print ""
print "Number of times you could listen to Piano Man between eruptions:"
print (eruption_deltas / pd.Timedelta(minutes=5, seconds=39)).head(3)
(eruption_deltas / pd.Timedelta(minutes=1)).hist(bins=xrange(50, 120, 1))
plt.xlabel('Time between eruptions (min)')
plt.ylabel('# of eruptions');
Beware of overflow! (update: fixed in 0.17.0)¶
The overflow issue discussed below was fixed in pandas version 0.17.0
Timedeltas are stored under the hood with nanosecond precision as 64-bit integers. It turns out it is relatively easy to run into overflow issues, possibly without noticing (see more discussion here).
For example, let's look at the average time between days for the first 2,015 years of the common era (ignoring corrections for leap years):
common_era = pd.Series([pd.Timedelta(days=1)] * 2015 * 365)
common_era.mean()
The average time between days comes out to about 3 hours and 6 minutes. My how time flies!
Here's what's happening... One day is 8.64E13 nanoseconds:
day_delta = pd.Timedelta(days=1)
print "{:,}".format(day_delta.value)
A signed 64-bit integer can hold only 106,751 of these day-long Timedeltas:
(2**63 - 1) / day_delta.value
Calculation of the mean works fine for 106,751 days, but overflow occurs if we add one more:
all_good_num_days = 106751
overflow_num_days = all_good_num_days + 1
all_good = pd.Series([pd.Timedelta(days=1)] * all_good_num_days)
print 'Mean time between the last {:,} days: {}'.format(all_good_num_days, all_good.mean())
overflow = pd.Series([pd.Timedelta(days=1)] * overflow_num_days)
print 'Mean time between the last {:,} days: {}'.format(overflow_num_days, overflow.mean())
This can be overcome by converting the data to a lower precision, performing the operation you need, then creating a Timedelta
from the result:
data_as_microseconds = overflow / pd.Timedelta(microseconds=1)
pd.Timedelta(data_as_microseconds.mean(), unit='us')
Similar Posts
- Polar plots and shaded errors in matplotlib, Score: 0.993
- Pandas date parsing performance, Score: 0.990
- Pandas 0.16.0 released, Score: 0.982
- Annotating matplotlib plots, Score: 0.978
- Cleaning, reshaping, and plotting BART time series data with pandas, Score: 0.978
Comments