Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

PyData NYC 2013

Practical Medium Data Analytics with Python

PyData NYC 2013

Practical Medium Data Analytics with Python

10 Things I Hate About pandas

www.datapad.io

Wes McKinney

3

• Former quant and MIT math dude

• Creator of Pandas project for Python

• Author of Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

@wesmckinn

http://www.datapad.io


www.datapad.io

• > 20k copies since Oct 2012• Bringing many new people

to Python and data analysis with code

4



www.datapad.io

•http://datapad.io

•Founded in 2013, located in SF

• In private beta, join us!

•Hiring for engineering



http://datapad.io

http://datapad.io

Why hate on pandas?

www.datapad.io7



pandas rocks!

www.datapad.io

• Easy-to-use, fast in-memory data wrangling and analytics library

• Enabled loads of complex data work to be done by mere mortals in Python

• Might have kept R from taking over the world (hehe)

10

So, pandas



www.datapad.io11



www.datapad.io

•170 distinct contributors

•Over 5400 issues and pull requests on GitHub

•Upcoming 0.13 release

12

pandas, the project



www.datapad.io

•pandas’s broad applicability also a liability

•Only game in town for lot of things

•pandas being used in some unplanned ways

13

But.



www.datapad.io

• No more structured dtype drudgery!

• Easy IO!

• Data alignment!

• Hierarchical indexing!

• Time series analytics!

14

Some things to love



www.datapad.io

•Table reshaping

•Missing data handling

•pandas.merge, pandas.concat

•Expressive groupby machinery

15

More things to love



www.datapad.io

•General data wrangling

•ETL jobs

•Business analytics (incl. BI uses)

•Time series analysis, statistical modeling

16

Some pandas use cases



pandas does many things that are tedious, slow, or

difficult to do correctly without it

Unfortunately, pandas is not a database

www.datapad.io

•DataFrame’s internal structure intended to make row-oriented ops fast on numerical data

•Python objects can be used as data, indices (a feature, not a bug)

19

#1 Slightly too far from the metal



www.datapad.io

• Many analytics ops require a small portion of the data

• Many ways to “materialize” the full data set in memory by accident

• Axis indexes wouldn’t necessarily make sense on out of core data sets

20

#2 No support (yet) for memory maps



www.datapad.io

•N.B. HDF5/PyTables support is a partial solution

21

#2 No support (yet) for memory maps



www.datapad.io

•Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system

• Inadequacy of pandas/NumPy data type systems

22

#3 No tight database integration



www.datapad.io

• Jobs with heavy SQL-reading are slow and use tons of memory

•TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays

23

#3 No tight database integration



www.datapad.io

• Inconsistent representation of missing data

•No Boolean or Integer NA values

•NA needs to be a first class citizen in analytics operations

24

#4 Best-efforts NA representation



www.datapad.io

• Difficult to understand footprint of pandas object

• Ample data copying throughout library

• Would benefit from being able to compress data in-memory or shuttle data temporarily to disk

25

#5 RAM management



www.datapad.io

•Makes pandas not quite a fully-fledged R replacement

•GroupBy and Joins slower than they could be

26

#6 Weak support for categorical data



www.datapad.io

•Must write custom functions to pass to .apply(..)

•Easy to run up against DRY problems and general Python syntax limitations

27

#7 Complex GroupBy operations get messy



www.datapad.io

•DataFrame not intended as a database table

•Makes streaming data use a challenge

•B+ tree tables interesting?

28

#8 Appending data slow and tedious



www.datapad.io

•Currencies, units

•Time zones

•Geographic data

•Composite data types

29

#9 Limited type system, column metadata



www.datapad.io

•Filter

•Group

• Join

•Aggregate

•Limit/TopK

•Sorting

30

#10 No true query processing layer

WHERE, HAVINGGROUP BYJOINSUM, MEAN, ...LIMITORDER BY



www.datapad.io

•Hampered by use of Python data structures / GIL interactions

•Object internals not designed for concurrent use

31

#11 “Slow”: no multicore / distributed algos



Oh no what do we do

Stop believing in the “one tool to rule them all”

“Real Artists Ship”- Steve Jobs

www.datapad.io



www.datapad.io

• I am heavily biased by focus on business analytics/BI use cases

•Need production-ready software to ship in relatively short time frame

36

Focus on results



www.datapad.io

• In internal development at DataPad

•Code named “badger”

•pandas-ish syntax: designed for data processing and analytical queries

37

A new project



www.datapad.io

•Consistent data type system

•Compressed columnar binary storage

•High perf analytical query processor

•Data preparation/cleaning tools

38

Badger in a nutshell



www.datapad.io

•Time series analytics

• Immutable array data, little copying

•Analytics kernels: written C with no dependencies

•Caching of useful intermediates

39

Badger in a nutshell



www.datapad.io

•Data set: 2012 Election data (FEC)

•5.3 mm records 7 columns

•Tools

•pandas

•badger

•R: data.table

•SQL: PostgreSQL, SQLite

40

Some benchmarks



www.datapad.io

•Total contributions by candidate

41

Query 1

SELECT cand_nm, sum(contb_receipt_amt) AS totalFROM fecGROUP BY cand_nm



www.datapad.io

•Total contributions by candidate

42

Query 1

badger (in-‐memory) : 19ms (1x)badger (from-‐disk) : 131ms (6.9x)pandas (in-‐memory) : 273ms (14.3x)R data.table 1.8.10: 382ms (20x)PostgreSQL : 4.7s (247x)SQLite : 72s (3800x)



www.datapad.io

•Total contributions by candidate and state

43

Query 2

SELECT cand_nm, contbr_st, sum(contb_receipt_amt) AS totalFROM fecGROUP BY cand_nm, contbr_st



www.datapad.io44

Query 2

badger (in-‐memory) : 269ms (1x)badger (from-‐disk) : 391ms (1.5x)R data.table 1.8.10: 500ms (1.8x)pandas (in-‐memory) : 770ms (2.9x)PostgreSQL : 5.96s (23x)

•Total contributions by candidate and state



www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

45

Query 3

SELECT cand_nm, sum(contb_receipt_amt) as totalFROM fecWHERE contb_receipt_dt BETWEEN '2012-‐05-‐01' and '2012-‐11-‐05' AND contb_receipt_amt BETWEEN 0 and 2500GROUP BY cand_nm

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

46

Query 3

badger (in-‐memory) : 96ms (1x)badger (from-‐disk) : 275ms (2.9x)pandas (in-‐memory) : 946ms (9.8x)PostgreSQL : 6.2s (65x)



www.datapad.io

•Distributed in-memory analytics

•Multicore algorithms

•ETL job-building tools

•Open source in some form someday

•Looking for algorithms hackers to help

47

Badger, the future



www.datapad.io

Thank you!

48



Date post:	27-Jan-2015
Category:	Technology
Upload:	wesm
View:	114 times
Download:	1 times

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

Technology