Dates and times provide an unlimited source of hassles for anyone working with them. In this post I'll discuss a potential performance pitfall I encountered parsing dates in pandas. Conclusion: Create DatetimeIndices by parsing data with to_datetime(my_dates, format='my_format')
.
import numpy as np
import pandas as pd
import random
import time
from datetime import datetime
To get some data to test on, I create a list of dictionaries with strings representing a datetime and an observed value.
# Generate some fake timestamped data
num_rows = int(1e6)
np.random.seed(42)
data = np.random.randn(num_rows)
records = []
for i in xrange(num_rows):
timestamp = np.random.randint(0, int(time.time()))
date_string = datetime.fromtimestamp(timestamp).strftime('%m/%d/%y %H:%M')
records.append({'timestamp': date_string, 'value': data[i]})
print records[0]
Now I'll parse the random data into a DataFrame.
df = pd.DataFrame(records).set_index('timestamp')
df.head(3)
Performance¶
I want to create a datetime index so I can calculate the sum of each value by day. Here are two different methods with very different performance.
Method 1: Use the DatetimeIndex
constructor on the strings (very slow!):
%%timeit
df_dt_constructor = df.copy()
df_dt_constructor.index = pd.DatetimeIndex(df_dt_constructor.index)
Method 2: Parse the datetimes using to_datetime
and assign the index (much faster!):
%%timeit
df_parse_with_format = df.copy()
df_parse_with_format.index = pd.to_datetime(df_parse_with_format.index,
format='%m/%d/%y %H:%M')
By parsing the dates with to_datetime
the operation runs about 30x faster. If a format argument isn't supplied to_datetime
it is still faster than calling the DatetimeIndex
constructor directly, however only by about 2x. This can be beneficial since to_datetime
can handle data with inconsistent date formats.
Method 3: Parse without format (flexible but slow):
%%timeit
df_parse_no_format = df.copy()
df_parse_no_format.index = pd.to_datetime(df_parse_no_format.index)
Pandas makes it easy to aggregate the data into different time periods. This code computes the sum of the values for each day.
df_parse_first = df.copy()
df_parse_first.index = pd.to_datetime(df_parse_first.index,
format='%m/%d/%y %H:%M')
daily_sum = df_parse_first.resample('D', how='sum')
Although we collapsed many records into few we can still learn some things about the original data set. For example, we expect the variance of the daily sums to be the number of values in each bucket (because the underlying data was random normal). Using this the number of total observations can be estimated.
variance = daily_sum.var()[0]
num_days = len(daily_sum)
est_num_original_records = variance * num_days
error_pct = 100 * (est_num_original_records - num_rows) / num_rows
print 'Estimated there were {:,.2f} original records'.format(est_num_original_records)
print 'Error: {0:.2f}%'.format(error_pct)
Similar Posts
- Pandas Timedelta: histograms, unit conversion and overflow danger, Score: 0.990
- When joins go wrong, check data types, Score: 0.983
- Polar plots and shaded errors in matplotlib, Score: 0.982
- Pandas 0.16.0 released, Score: 0.979
- Analyzing 10 years of digital photography with python and pandas, Score: 0.963
Comments