Date post: | 20-Jun-2015 |
Category: |
Technology |
Upload: | wes-mckinney |
View: | 49,291 times |
Download: | 4 times |
Financial data analysis in Python with pandas
Wes McKinney@wesmckinn
10/17/2011
@wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
My background
3 years as a quant hacker at AQR, now consultant / entrepreneur
Math and statistics background with the zest of computer science
Active in scientific Python community
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
@wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
Bare essentials for financial research
Fast time series functionality
Easy data alignmentDate/time handlingMoving window statisticsResamping / frequency conversion
Fast data access (SQL databases, flat files, etc.)
Data visualization (plotting)
Statistical models
Linear regressionTime series models: ARMA, VAR, ...
@wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
Would be nice to have
Portfolio and risk analytics, backtesting
Easy enough to write yourself, though most people do a bad job of it
Portfolio optimization
Most financial firms use a 3rd party library anyway
Derivative pricing
Can use QuantLib in most languages
@wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
What are financial firms using?
HFT: a C++ and hardware arms race, a different topic
Research
Mainstream: R, MATLAB, Python, ...Econometrics: Stata, eViews, RATS, etc.Non-programmatic environments: ClariFI, Palantir, ...
Production
Popular: Java, C#, C++Less popular, but growing: PythonFringe: Functional languages (Ocaml, Haskell, F#)
@wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
What are financial firms using?
Many hybrid languages environments (e.g. Java/R, C++/R,C++/MATLAB, Python/C++)
Which is the main implementation language?If main language is Java/C++, result is lower productivity and highercost to prototyping new functionality
Trends
Banks and hedge funds are realizing that Java-based productionsystems can be replaced with 20% as much Python code (or less)MATLAB is being increasingly ditched in favor of Python. R andPython use for research generally growing
@wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
Python language
Simple, expressive syntax
Designed for readability, like “runnable pseudocode”
Easy-to-use, powerful built-in types and data structures:
Lists and tuples (fixed-size, immutable lists)Dicts (hash maps / associative arrays) and sets
Everything’s an object, including functions
“There should be one, and preferably only one way to do it”
“Batteries included”: great general purpose standard library
@wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
A simple example: quicksort
Pseudocode from Wikipedia:
function qsort(array)
if length(array) < 2
return array
var list less, greater
select and remove a pivot value pivot from array
for each x in array
if x < pivot then append x to less
else append x to greater
return concat(qsort(less), pivot, qsort(greater))
@wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
A simple example: quicksort
First try Python implementation:
def qsort(array):
if len(array) < 2:
return array
less , greater = [], []
pivot , rest = array [0], array [1:]
for x in rest:
if x < pivot:
less.append(x)
else:
greater.append(x)
return qsort(less) + [pivot] + qsort(greater)
@wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
A simple example: quicksort
Use list comprehensions:
def qsort(array):
if len(array) < 2:
return array
pivot , rest = array [0], array [1:]
less = [x for x in rest if x < pivot]
greater = [x for x in rest if x >= pivot]
return qsort(less) + [pivot] + qsort(greater)
@wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
A simple example: quicksort
Heck, fit it onto one line!
qs = lambda r: (r if len(r) < 2
else (qs([x for x in r[1:] if x < r[0]])
+ [r[0]]
+ qs([x for x in r[1:] if x >= r[0]])))
Though that’s starting to look like Lisp code...
@wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
A simple example: quicksort
A quicksort using NumPy arrays
def qsort(array):
if len(array) < 2:
return array
pivot , rest = array [0], array [1:]
less = rest[rest < pivot]
greater = rest[rest >= pivot]
return np.r_[qsort(less), [pivot], qsort(greater )]
Of course no need for this when you can just do:
sorted_array = np.sort(array)
@wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
Python: drunk with power
This comic has way too much airtime but:
@wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
Staples of Python for science: MINS
(M) matplotlib: plotting and data visualization
(I) IPython: rich interactive computing and development environment
(N) NumPy: multi-dimensional arrays, linear algebra, FFTs, randomnumber generation, etc.
(S) SciPy: optimization, probability distributions, signal processing,ODEs, sparse matrices, ...
@wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
Why did Python become popular in science?
NumPy traces its roots to 1995
Extremely easy to integrate C/C++/Fortran code
Access fast low level algorithms in a high level, interpreted language
The language itself
“It fits in your head”“It [Python] doesn’t get in my way” - Robert Kern
Python is good at all the things other scientific programminglanguages are not good at (e.g. networking, string processing, OOP)
Liberal BSD license: can use Python for commercial applications
@wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
Some exciting stuff in the last few years
Cython
“Augmented” Python language with type declarations, for generatingcompiled extensionsC-like speedups with Python-like development time
IPython: enhanced interactive Python interpreter
The best research and software development env for PythonAn integrated parallel / distributed computing backendGUI console with inline plotting and a rich HTML notebook (more onthis later)
PyCUDA / PyOpenCL: GPU computing in Python
Transformed Python overnight into one of the best languages for doingGPU computing
@wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
Where has Python historically been weak?
Rich data structures for data analysis and statistics
NumPy arrays, while powerful, feel distinctly “lower level” if you’reused to R’s data.frame
pandas has filled this gap over the last 2 years
Statistics libraries
Nowhere near the depth of R’s CRAN repositorystatsmodels provides tested implementations a lot of standardregression and time series modelsTurns out that most financial data analysis requires only fairlyelementary statistical models
@wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
pandas library
Began building at AQR in 2008, open-sourced late 2009
WhyR / MATLAB, while good for research / data analysis, are not suitableimplementation languages for large-scale production systems
(I personally don’t care for them for data analysis)
Existing data structures for time series in R / MATLAB were toolimited / not flexible enough my needs
Core idea: indexed data structures capable of storing heterogeneousdata
Etymology: panel data structures
@wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
pandas in a nutshell
A clean axis indexing design to support fast data alignment, lookups,hierarchical indexing, and more
High-performance data structures
Series/TimeSeries: 1D labeled vectorDataFrame: 2D spreadsheet-like structurePanel: 3D labeled array, collection of DataFrames
SQL-like functionality: GroupBy, joining/merging, etc.
Missing data handling
Time series functionality
@wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
pandas design philosophy
“Think outside the matrix”: stop thinking about shape and startthinking about indexes
Indexing and data alignment are essential
Fault-tolerance: save you from common blunders caused by codingerrors (specifically misaligned data)
Lift the best features of other data analysis environments (R,MATLAB, Stata, etc.) and make them better, faster
Performance and usability equally important
@wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
The pandas killer feature: indexing
Each axis has an index
Automatic alignment between differently-indexed objects: makes itnearly impossible to accidentally combine misaligned data
Hierarchical indexing provides an intuitive way of structuring andworking with higher-dimensional data
Natural way of expressing “group by” and join-type operations
Better integrated and more flexible indexing than anything availablein R or MATLAB
@wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
Tutorial time
To the IPython console!
@wesmckinn () Data analysis with pandas 10/17/2011 22 / 22