We'd like to have a central place to collect knowledge and opinions about data structures for statistics, finance, econometrics, and other related application domains. Since a number of projects having to do with this problem domain have appeared in recent times, having constructive discussions about the various design choices made would be beneficial to all parties.
Packages of interest
Comparison of packages
package name |
pandas |
la |
web site |
||
license |
New BSD |
Simplified BSD |
programming languages |
Python, Cython |
Python, Cython |
required dependencies |
Python, Numpy, dateutil |
Python, Numpy |
optional dependencies |
scikits.statsmodels, nose, scipy |
h5py, Scipy, nose |
build dependencies |
C-compiler |
optional: C-compiler |
year started (open source) |
2008 (2009) |
2008 (2010) |
data objects |
1d: Series; 2d: DataFrame, DataMatrix; 3d: WidePanel, LongPanel |
nd: larry |
Number of dimensions supported |
1-3d |
nd > 0d |
mutability |
data mutable, expandable on 1 axis |
data and label mutable; axis insertable |
data container |
1d: ndarray; 2d: dict of Series or 2D ndarray (DataMatrix); 3d: ndarray |
ndarray |
data types |
homogenous, DataFrame/DataMatrix support heterogeneous |
homogenous: float, int, str, object |
index/label container |
ndarray (Index class) |
lists |
index/label types |
heterogenous, hashable |
heterogenous, hashable |
index/label constraints |
unique along any one axis, hashable |
unique along any one axis, hashable |
missing values |
NaN (float) or None (object) |
NaN (float), partial support: "" (str), None (object) |
binary operations on two data objects |
union of indices |
intersection of labels |
IO |
pickle, partial CSV |
HDF5, partial support for CSV |
pandas
pandas started in 2008 as a set of data structures for quantitative finance applications. It's original intent was to improve on ideas found in R's data.frame structure (hence the DataFrame class) and to enable Python to be a better tool for building robust systems. For example, the R data.frame does not support automatic data alignment and this is often desirable. Development on pandas proceeded with the following goals:
- Want to make it easy to interactively explore the data in memory, particularly in an IPython session. R for example is quite good at this as is MATLAB and other libraries / research environments.
- The user should not need to become a Python expert to become an expert in the use of pandas (in other words, where possible it tries not to make your life difficult)
- No need to support higher than 3-dimensional data since such data sets are seldom found in statistical / financial applications. For this reason I avoided a generic n-dimensional approach (like larry, for example)
- Performance is extremely important. This led to significant amounts of Cython code to optimize core operations (more on this below).
At a glance, the two workhorse data objects of pandas are the Series and the DataFrame:
from pandas import * s = Series([0, 1, 2], index=['a', 'b', 'c']) df = DataFrame({'col1' : s, 'col2' : s * 2}) >>> s a 0 b 1 c 2 >>> df col1 col2 a 0 0 b 1 2 c 2 4 >>> df['col3'] = 6 >>> df col1 col2 col3 a 0 0 6 b 1 2 6 c 2 4 6
Series behaves like both a dict and an ndarray. Binary operations (+, -, etc.) match on labels. For R users, DataFrame is like a data.frame + automatic data alignment.
larry
Here's larry in schematic form:
date1 date2 date3 'AAPL' 209.19 207.87 210.11 y = 'IBM' 129.03 130.39 130.55 'DELL' 14.82 15.11 14.94
The larry above is stored internally as a Numpy array and a list of lists:
y.label = [['AAPL', 'IBM', 'DELL'], [date1, date2, date3]] y.x = np.array([[209.19, 207.87, 210.11], [129.03, 130.39, 130.55], [ 14.82, 15.11, 14.94]])
A larry can have any number of dimensions except zero. Here, for example, is one way to create a one-dimensional larry:
>>> import la >>> y = la.larry([1, 2, 3])
In the statement above the list is converted to a Numpy array and the labels default to range(n), where *n* in this case is 3.
For the most part larry acts like a Numpy array. And, whenever you want, you have direct access to the Numpy array that holds your data. For example if you have a function, *myfunc*, that works on Numpy arrays and doesn't change the shape or ordering of the array, then you can use it on a larry, *y*, like this:
y.x = myfunc(y.x)
One of the design goals of the larry data object is to behave in a way that a Numpy array user would expect. Here's a brief comparison to Numpy arrays:
Numpy |
la |
arr = np.array([[1, 2], [3, 4]]) |
lar = larry([[1, 2], [3, 4]]) |
np.nansum(arr) |
lar.sum() |
arr.shape, arr.dtype, arr.ndim, arr.T |
lar.shape, lar.dtype, lar.ndim, lar.T |
arr.astype(float) |
lar.astype(float) |
arr1 + arr2 |
lar1 + lar2 |
arr[:,0] |
lar[:,0] |
nd --> (nd + 1) |
lar.insertaxis(axis, label) |
Design and performance issues for data structures
- For each item, we'd like to address the following points:
- What is a relevant example (with code snippets)?
- How are existing packages solving (or not solving) the problem?
- For performance-sensitive operations, how fast is each package (provide a code snippet)? We might also try to quantify the overhead using a particular data structure over vanilla NumPy areas (homogeneous or structured)
Creation of data objects
One of the most basic operations is the creation of a data object. Here are examples of how to create three different data objects: datarray.DataArray, pandas.DataMatrix, and la.larry.
First let's set up the data we will use to create the data objects:
>>> import numpy as np >>> N = 100 >>> arr = np.random.rand(N, N) >>> idx1 = map(str, range(N)) >>> idx2 = map(str, range(N))
Create a datarray.DataArray:
>>> import datarray >>> datarray.__version__ '0.0.4' >>> timeit datarray.DataArray(arr, [('row', idx1), ('col', idx2)]) 10000 loops, best of 3: 153 us per loop
Create a pandas.DataMatrix:
>>> import pandas >>> pandas.__version__ '0.2.1.dev' >>> timeit pandas.DataMatrix(arr, idx1, idx2) 10000 loops, best of 3: 50.1 us per loop
Create a la.larry:
>>> import la >>> la.__version__ '0.5.0dev' >>> timeit la.larry(arr, [idx1, idx2]) 100000 loops, best of 3: 14 us per loop
Most of the time used to create a larry is spent doing an integrity check of the labels. If you skip the integrity check, you get a 10x speed up:
>>> timeit la.larry(arr, [idx1, idx2], integrity=False) 1000000 loops, best of 3: 1.33 us per loop
Tabular data, mutability
Many statistical data sets arrive in tabular (2D) form-- typically a collection of vectors with associated labels. R provides the mutable data.frame class:
> df <- data.frame(a=rnorm(5), b=rnorm(5)) > df a b 1 -0.4454339 -0.87899052 2 0.1119211 0.08378914 3 0.5823613 1.51314787 4 -0.4235883 1.51682599 5 -0.4429238 -1.86833929 > df$a [1] -0.4454339 0.1119211 0.5823613 -0.4235883 -0.4429238 > df$c <- 1 > df a b c 1 -0.4454339 -0.87899052 1 2 0.1119211 0.08378914 1 3 0.5823613 1.51314787 1 4 -0.4235883 1.51682599 1 5 -0.4429238 -1.86833929 1
For users of other statistical environments, I'd be interested to hear about the tabular data structures and what they can do. Is such mutability useful? I've found it quite useful in cases like this. The R data.frame does not provide any sort of row-oriented data alignment (see below), but it's worth asking what are the features of such a 2D data structure that are the most useful.
The pandas code for the above is nearly identical, with pythonic notation of course:
In [22]: df = DataFrame({'a' : randn(5), 'b' : randn(5)}) In [23]: df Out[23]: a b 0 -1.894 -1.373 1 -0.8486 -0.5785 2 1.838 -3.761 3 -0.6144 -0.7812 4 0.8473 0.9544 In [24]: df['a'] Out[24]: 0 -1.89423416608 1 -0.848596359606 2 1.83799383312 3 -0.614350513146 4 0.84729188782 In [25]: df['c'] = 1 In [26]: df Out[26]: a b c 0 -1.894 -1.373 1 1 -0.8486 -0.5785 1 2 1.838 -3.761 1 3 -0.6144 -0.7812 1 4 0.8473 0.9544 1
In the context of misaligned data, I've seen a lot of pandas users use DataFrame purely as a container for aligning labeled data:
In [74]: df Out[74]: A B 2000-01-03 00:00:00 -0.2306 0.3456 2000-01-04 00:00:00 -0.882 -0.1626 2000-01-05 00:00:00 0.2474 -0.1887 2000-01-06 00:00:00 -0.2704 0.1398 2000-01-07 00:00:00 1.498 1.628 2000-01-10 00:00:00 -1.645 0.09256 2000-01-11 00:00:00 0.5666 -1.128 2000-01-12 00:00:00 -0.5082 -0.5381 2000-01-13 00:00:00 0.2205 -0.5728 2000-01-14 00:00:00 -0.4916 -0.3333 In [76]: s Out[76]: 2000-01-03 00:00:00 -0.115323455775 2000-01-04 00:00:00 -0.441017336819 2000-01-05 00:00:00 0.123677568341 2000-01-06 00:00:00 -0.135191054412 2000-01-07 00:00:00 0.749180511001 2000-01-10 00:00:00 -0.822250171143 2000-01-11 00:00:00 0.283300117597 2000-01-12 00:00:00 -0.254115521874 In [77]: df['C'] = s In [78]: df Out[78]: A B C 2000-01-03 00:00:00 -0.2306 0.3456 -0.1153 2000-01-04 00:00:00 -0.882 -0.1626 -0.441 2000-01-05 00:00:00 0.2474 -0.1887 0.1237 2000-01-06 00:00:00 -0.2704 0.1398 -0.1352 2000-01-07 00:00:00 1.498 1.628 0.7492 2000-01-10 00:00:00 -1.645 0.09256 -0.8223 2000-01-11 00:00:00 0.5666 -1.128 0.2833 2000-01-12 00:00:00 -0.5082 -0.5381 -0.2541 2000-01-13 00:00:00 0.2205 -0.5728 NaN 2000-01-14 00:00:00 -0.4916 -0.3333 NaN
To combine two larrys, use merge:
>> lar1 = larry(randn(5,2)) >> lar2 = larry(randn(2,1), [[0, 4], ['c']]) >> lar1.merge(lar2) label_0 0 1 2 3 4 label_1 0 1 c x array([[ 0.19271985, 0.38086157, 0.89416946], [ 0.17987281, -0.20143654, NaN], [ 0.56567726, 2.13787551, NaN], [-1.92165255, 0.22968882, NaN], [ 1.14440607, 0.07242789, -0.56150006]])
If the two larrys overlap, you can use the update option:
>> lar3 = larry(np.ones((2,1)), [[0, 4], [0]]) >> lar1.merge(lar3, update=True) label_0 0 1 2 3 4 label_1 0 1 x array([[ 1. , 0.38086157], [ 0.17987281, -0.20143654], [ 0.56567726, 2.13787551], [-1.92165255, 0.22968882], [ 1. , 0.07242789]])
Data alignment
- Union versus intersection of data labels
- Performance for differently-indexed data
Here is a pretty pathological but realistic example:
# Setup import numpy as np import pandas import la N = 10000 arr1 = np.random.randn(N) arr2 = np.random.randn(N) idx1 = range(N) idx2 = range(N)[::-1] # pandas series1 = pandas.Series(arr1, index=idx1) series2 = pandas.Series(arr2, index=idx2) # larry lar1 = la.larry(arr1, [idx1]) lar2 = la.larry(arr2, [idx2])
Here is with latest svn/bzr trunks on 5/30/10:
# aligned In [5]: timeit series1 + serie1 10000 loops, best of 3: 36.4 us per loop In [6]: timeit lar1 + lar1 10000 loops, best of 3: 72.5 us per loop # unaligned In [9]: timeit series1 + series2 100 loops, best of 3: 2.72 ms per loop In [8]: timeit lar1 + lar2 100 loops, best of 3: 4.97 ms per loop
A 2d example:
# Setup import numpy as np import pandas import la N = 1000 arr1 = np.random.randn(N, N) arr2 = np.random.randn(N, N) idx1 = range(N) idx2 = range(N)[::-1] # pandas dma1 = pandas.DataMatrix(arr1, idx1, idx1) dma2 = pandas.DataMatrix(arr2, idx2, idx2) # larry lar1 = la.larry(arr1, [idx1, idx1]) lar2 = la.larry(arr2, [idx2, idx2])
Here is with latest svn/bzr trunks on 5/30/10:
# aligned >> timeit dma1 + dma1 100 loops, best of 3: 16 ms per loop >> timeit lar1 + lar1 100 loops, best of 3: 15.9 ms per loop # unaligned >> timeit dma1 + dma2 10 loops, best of 3: 93.5 ms per loop >> timeit lar1 + lar2 10 loops, best of 3: 91.7 ms per loop
Reduction operations
Find the maximum value in a 2d data object of shape (1000, 1000):
import pandas import la N = 1000 arr = np.random.rand(N, N) dma = pandas.DataMatrix(arr, range(N), range(N)) lar = la.larry(arr) # max dma.max().max() 0.99999943789619117 lar.max() 0.99999943789619117 >> timeit dma.max().max() 10 loops, best of 3: 18 ms per loop >> timeit lar.max() 100 loops, best of 3: 14.1 ms per loop >> timeit np.nanmax(arr) 100 loops, best of 3: 14.1 ms per loop
IO
Users of data objects often need the ability to quickly archive their data. Here is the time taken to save and load two data objects of shape (N, N):
Could someone run this benchmark on a Linux machine with NumPy > 1.3? I (Wes) ran these on Windows and I don't trust the OS.
In [8]: roundtrip_archive(100) Numpy (npz) 0.0107 seconds larry (HDF5) 0.0064 seconds pandas (HDF5) 0.0050 seconds In [9]: roundtrip_archive(500) Numpy (npz) 0.1026 seconds larry (HDF5) 0.0908 seconds pandas (HDF5) 0.0836 seconds In [10]: roundtrip_archive(1000) Numpy (npz) 0.5945 seconds larry (HDF5) 0.3517 seconds pandas (HDF5) 0.2611 seconds In [11]: roundtrip_archive(2000, iterations=2) Numpy (npz) 4.6487 seconds larry (HDF5) 0.8843 seconds pandas (HDF5) 0.7653 seconds In [12]: roundtrip_archive(4000, iterations=1) Numpy (npz) 19.7412 seconds larry (HDF5) 2.9064 seconds pandas (HDF5) 3.3371 seconds
And here is the code used:
This needs pandas r180 and PyTables.
import time, os import numpy as np import la import pandas def timeit(f, iterations): start = time.clock() for i in xrange(iterations): f() return time.clock() - start def roundtrip_archive(N, iterations=10): # Create data arr = np.random.randn(N, N) lar = la.larry(arr) dma = pandas.DataMatrix(arr, range(N), range(N)) # filenames filename_numpy = 'c:/temp/numpy.npz' filename_larry = 'c:/temp/archive.hdf5' filename_pandas = 'c:/temp/pandas_tmp' # Delete old files try: os.unlink(filename_numpy) except: pass try: os.unlink(filename_larry) except: pass try: os.unlink(filename_pandas) except: pass # Time a round trip save and load numpy_f = lambda: numpy_roundtrip(filename_numpy, arr, arr) numpy_time = timeit(numpy_f, iterations) / iterations larry_f = lambda: larry_roundtrip(filename_larry, lar, lar) larry_time = timeit(larry_f, iterations) / iterations pandas_f = lambda: pandas_roundtrip(filename_pandas, dma, dma) pandas_time = timeit(pandas_f, iterations) / iterations print 'Numpy (npz) %7.4f seconds' % numpy_time print 'larry (HDF5) %7.4f seconds' % larry_time print 'pandas (HDF5) %7.4f seconds' % pandas_time def numpy_roundtrip(filename, arr1, arr2): np.savez(filename, arr1=arr1, arr2=arr2) npz = np.load(filename) arr1 = npz['arr1'] arr2 = npz['arr2'] def larry_roundtrip(filename, lar1, lar2): io = la.IO(filename) io['lar1'] = lar1 io['lar2'] = lar2 lar1 = io['lar1'] lar2 = io['lar2'] def pandas_roundtrip(filename, dma1, dma2): from pandas.io.pytables import HDFStore store = HDFStore(filename) store['dma1'] = dma1 store['dma2'] = dma2 dma1 = store['dma1'] dma2 = store['dma2'] def pandas_roundtrip_pickle(filename, dma1, dma2): dma1.save(filename) dma1 = pandas.DataMatrix.load(filename) dma2.save(filename) dma2 = pandas.DataMatrix.load(filename)
Sometimes you only want to load a small part of a large data object from your archive. larry supports loading pieces of a data object. Here are some examples:
io['lar2'][:,-1] io['lar2'][0,:]
Joining / merging data sets
In pandas, it's very similar to elsewhere
In [65]: df1 Out[65]: A B 2000-01-03 00:00:00 -0.1174 -0.941 2000-01-04 00:00:00 -0.6034 -0.008094 2000-01-05 00:00:00 -0.3816 -0.9338 2000-01-06 00:00:00 -0.3298 -0.9548 2000-01-07 00:00:00 0.9576 0.4652 2000-01-10 00:00:00 -0.7208 -1.131 2000-01-11 00:00:00 1.568 0.8498 2000-01-12 00:00:00 0.3717 -0.2323 2000-01-13 00:00:00 -1.428 -1.997 2000-01-14 00:00:00 -1.084 -0.271 In [67]: df2 Out[67]: C D 2000-01-03 00:00:00 0.2833 -0.1937 2000-01-05 00:00:00 1.868 1.207 2000-01-07 00:00:00 -0.8586 -0.7367 2000-01-11 00:00:00 2.121 0.9104 2000-01-13 00:00:00 0.7856 0.9063 In [68]: df1.join(df2) Out[68]: A B C D 2000-01-03 00:00:00 -0.1174 -0.941 0.2833 -0.1937 2000-01-04 00:00:00 -0.6034 -0.008094 NaN NaN 2000-01-05 00:00:00 -0.3816 -0.9338 1.868 1.207 2000-01-06 00:00:00 -0.3298 -0.9548 NaN NaN 2000-01-07 00:00:00 0.9576 0.4652 -0.8586 -0.7367 2000-01-10 00:00:00 -0.7208 -1.131 NaN NaN 2000-01-11 00:00:00 1.568 0.8498 2.121 0.9104 2000-01-12 00:00:00 0.3717 -0.2323 NaN NaN 2000-01-13 00:00:00 -1.428 -1.997 0.7856 0.9063 2000-01-14 00:00:00 -1.084 -0.271 NaN NaN In [70]: df1.join(df2, how='inner') Out[70]: A B C D 2000-01-03 00:00:00 -0.1174 -0.941 0.2833 -0.1937 2000-01-05 00:00:00 -0.3816 -0.9338 1.868 1.207 2000-01-07 00:00:00 0.9576 0.4652 -0.8586 -0.7367 2000-01-11 00:00:00 1.568 0.8498 2.121 0.9104 2000-01-13 00:00:00 -1.428 -1.997 0.7856 0.9063
The index (row labels) are the default key for joining, but a column can also be used for a similar SQL-like join:
In [73]: df2 Out[73]: C D key 2000-01-03 00:00:00 0.2833 -0.1937 0 2000-01-05 00:00:00 1.868 1.207 1 2000-01-07 00:00:00 -0.8586 -0.7367 0 2000-01-11 00:00:00 2.121 0.9104 1 2000-01-13 00:00:00 0.7856 0.9063 0 In [74]: df3 = DataFrame({'code' : {0 : 'foo', 1 : 'bar'}}) In [75]: df3 Out[75]: code 0 foo 1 bar In [76]: df2.join(df3, on='key') Out[76]: C D code key 2000-01-03 00:00:00 0.2833 -0.1937 foo 0 2000-01-05 00:00:00 1.868 1.207 bar 1 2000-01-07 00:00:00 -0.8586 -0.7367 foo 0 2000-01-11 00:00:00 2.121 0.9104 bar 1 2000-01-13 00:00:00 0.7856 0.9063 foo 0
Groupby-type operations
Time series / date functionality
Handling missing data
Statistics in computational biology and bioinformatics
I am writing here in response to Vincent Davis' message on the Biopython mailing list about feedback on statistical data structure design issues in biopython. While I have been a Biopython developer for many years, different users have different needs and perspectives, so what I write here is just my personal opinion rather than an official position of the Biopython consortium.
With respect to Python/NumPy/SciPy, computational biology and bioinformatics is different from other numerical sciences in three important ways:
- Bioinformatics is dominated by Perl, and therefore people new to bioinformatics tend to choose Perl as a default rather than Python. Many people do their statistical analysis by a combination of Perl and R. For us in Biopython this means that we should make using Python as straightforward as possible.
- People in bioinformatics come from a wide range of educational backgrounds. Some come from hard-core computational sciences such as physics, mathematics, or computer science, whereas others have a more traditional background in molecular biology and may not have a solid background in numerics. Some people starting in bioinformatics do not know what a compiler is (but otherwise they can be great scientists). Unfortunately, this means that tools available in SciPy are often not accessible simply because it is too difficult to install.
- Some numerical libraries may be more important in bioinformatics than in other fields and vice versa. For example, to be able to calculate p-values it is important to have a library of special functions, whereas linear algebra may be less important.
What I really miss in NumPy are special functions and some basic statistical functions such as kernel density estimators. I know that these are exist in SciPy, but because of the second point this won't do me much good. I hope that some day at least the special functions can be made available in NumPy instead of SciPy.
See below for a discussion of MaskedArrays (numpy.ma)
MaskedArrays commentary
Masked arrays are slower than ndarrays since they do more. Out of the packages listed above, scikits.timeseries is the only one that uses MaskedArrays heavily.
Here's some performance for a 1000x10 array (first number is the time ratio, since my CPU is undoubtedly faster or slower than yours when you run these benchmarks):
Operation | ndarray | MaskedArray | larry | pandas |
---|---|---|---|---|
arr * 2 | 1.0 (17 us) | 4.95 (84.2us) | 1.94 (32.9 us) | 1.85 (31.6us) |
arr * arr | 1.0 (15.9 us) | 6.42 (102us) | 1.88 (29.9 us) | 1.99 (31.7us) |
code for these data structures
arr = randn(1000, 10) marr = np.ma.masked_array(arr) larr = larry(arr) parr = DataMatrix(arr, range(1000), range(10))
The slowdown for using MaskedArrays does not seem acceptable for general purpose use-- however in applications where one needs to be able to distinguish between NaN (missing) and NaN (you took sqrt(-1)) then perhaps they should be used. But to pay 4-7x cost for every operation seems excessive.