Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | wesm |
View: | 114 times |
Download: | 1 times |
PyData NYC 2013
Practical Medium Data Analytics with Python
PyData NYC 2013
Practical Medium Data Analytics with Python
10 Things I Hate About pandas
www.datapad.io
Wes McKinney
3
• Former quant and MIT math dude
• Creator of Pandas project for Python
• Author of Python for Data Analysis — O’Reilly
• Founder and CEO of DataPad
@wesmckinn
www.datapad.io
• > 20k copies since Oct 2012• Bringing many new people
to Python and data analysis with code
4
www.datapad.io
•http://datapad.io
•Founded in 2013, located in SF
• In private beta, join us!
•Hiring for engineering
Why hate on pandas?
pandas rocks!
www.datapad.io
• Easy-to-use, fast in-memory data wrangling and analytics library
• Enabled loads of complex data work to be done by mere mortals in Python
• Might have kept R from taking over the world (hehe)
10
So, pandas
www.datapad.io
•170 distinct contributors
•Over 5400 issues and pull requests on GitHub
•Upcoming 0.13 release
12
pandas, the project
www.datapad.io
•pandas’s broad applicability also a liability
•Only game in town for lot of things
•pandas being used in some unplanned ways
13
But.
www.datapad.io
• No more structured dtype drudgery!
• Easy IO!
• Data alignment!
• Hierarchical indexing!
• Time series analytics!
14
Some things to love
www.datapad.io
•Table reshaping
•Missing data handling
•pandas.merge, pandas.concat
•Expressive groupby machinery
15
More things to love
www.datapad.io
•General data wrangling
•ETL jobs
•Business analytics (incl. BI uses)
•Time series analysis, statistical modeling
16
Some pandas use cases
pandas does many things that are tedious, slow, or
difficult to do correctly without it
Unfortunately, pandas is not a database
www.datapad.io
•DataFrame’s internal structure intended to make row-oriented ops fast on numerical data
•Python objects can be used as data, indices (a feature, not a bug)
19
#1 Slightly too far from the metal
www.datapad.io
• Many analytics ops require a small portion of the data
• Many ways to “materialize” the full data set in memory by accident
• Axis indexes wouldn’t necessarily make sense on out of core data sets
20
#2 No support (yet) for memory maps
www.datapad.io
•N.B. HDF5/PyTables support is a partial solution
21
#2 No support (yet) for memory maps
www.datapad.io
•Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system
• Inadequacy of pandas/NumPy data type systems
22
#3 No tight database integration
www.datapad.io
• Jobs with heavy SQL-reading are slow and use tons of memory
•TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays
23
#3 No tight database integration
www.datapad.io
• Inconsistent representation of missing data
•No Boolean or Integer NA values
•NA needs to be a first class citizen in analytics operations
24
#4 Best-efforts NA representation
www.datapad.io
• Difficult to understand footprint of pandas object
• Ample data copying throughout library
• Would benefit from being able to compress data in-memory or shuttle data temporarily to disk
25
#5 RAM management
www.datapad.io
•Makes pandas not quite a fully-fledged R replacement
•GroupBy and Joins slower than they could be
26
#6 Weak support for categorical data
www.datapad.io
•Must write custom functions to pass to .apply(..)
•Easy to run up against DRY problems and general Python syntax limitations
27
#7 Complex GroupBy operations get messy
www.datapad.io
•DataFrame not intended as a database table
•Makes streaming data use a challenge
•B+ tree tables interesting?
28
#8 Appending data slow and tedious
www.datapad.io
•Currencies, units
•Time zones
•Geographic data
•Composite data types
29
#9 Limited type system, column metadata
www.datapad.io
•Filter
•Group
• Join
•Aggregate
•Limit/TopK
•Sorting
30
#10 No true query processing layer
WHERE, HAVINGGROUP BYJOINSUM, MEAN, ...LIMITORDER BY
www.datapad.io
•Hampered by use of Python data structures / GIL interactions
•Object internals not designed for concurrent use
31
#11 “Slow”: no multicore / distributed algos
Oh no what do we do
Stop believing in the “one tool to rule them all”
“Real Artists Ship”- Steve Jobs
www.datapad.io
• I am heavily biased by focus on business analytics/BI use cases
•Need production-ready software to ship in relatively short time frame
36
Focus on results
www.datapad.io
• In internal development at DataPad
•Code named “badger”
•pandas-ish syntax: designed for data processing and analytical queries
37
A new project
www.datapad.io
•Consistent data type system
•Compressed columnar binary storage
•High perf analytical query processor
•Data preparation/cleaning tools
38
Badger in a nutshell
www.datapad.io
•Time series analytics
• Immutable array data, little copying
•Analytics kernels: written C with no dependencies
•Caching of useful intermediates
39
Badger in a nutshell
www.datapad.io
•Data set: 2012 Election data (FEC)
•5.3 mm records 7 columns
•Tools
•pandas
•badger
•R: data.table
•SQL: PostgreSQL, SQLite
40
Some benchmarks
www.datapad.io
•Total contributions by candidate
41
Query 1
SELECT cand_nm, sum(contb_receipt_amt) AS totalFROM fecGROUP BY cand_nm
www.datapad.io
•Total contributions by candidate
42
Query 1
badger (in-‐memory) : 19ms (1x)badger (from-‐disk) : 131ms (6.9x)pandas (in-‐memory) : 273ms (14.3x)R data.table 1.8.10: 382ms (20x)PostgreSQL : 4.7s (247x)SQLite : 72s (3800x)
www.datapad.io
•Total contributions by candidate and state
43
Query 2
SELECT cand_nm, contbr_st, sum(contb_receipt_amt) AS totalFROM fecGROUP BY cand_nm, contbr_st
www.datapad.io44
Query 2
badger (in-‐memory) : 269ms (1x)badger (from-‐disk) : 391ms (1.5x)R data.table 1.8.10: 500ms (1.8x)pandas (in-‐memory) : 770ms (2.9x)PostgreSQL : 5.96s (23x)
•Total contributions by candidate and state
www.datapad.io
•Total contributions by candidate and state with 2 filter predicates
45
Query 3
SELECT cand_nm, sum(contb_receipt_amt) as totalFROM fecWHERE contb_receipt_dt BETWEEN '2012-‐05-‐01' and '2012-‐11-‐05' AND contb_receipt_amt BETWEEN 0 and 2500GROUP BY cand_nm
www.datapad.io
•Total contributions by candidate and state with 2 filter predicates
46
Query 3
badger (in-‐memory) : 96ms (1x)badger (from-‐disk) : 275ms (2.9x)pandas (in-‐memory) : 946ms (9.8x)PostgreSQL : 6.2s (65x)
www.datapad.io
•Distributed in-memory analytics
•Multicore algorithms
•ETL job-building tools
•Open source in some form someday
•Looking for algorithms hackers to help
47
Badger, the future