Date post: | 25-Jun-2015 |
Category: |
Technology |
Upload: | datapad-inc |
View: | 2,592 times |
Download: | 1 times |
Strata Santa Clara 2014
The Last Mile: Challenges and opportunities
in data tools
www.datapad.io
Wes McKinney
�2
• Former quant @ AQR (a hedge fund)
• Creator of pandas
• Author of Python for Data Analysis — O’Reilly
• Founder and CEO of DataPad
@wesmckinn
www.datapad.io�4
•http://datapad.io
•New web-based visual analytics environment
• In private beta, join us!
•Hiring for engineering
www.datapad.io
•Business Analytics
•Statistics and ML
•ETL
•Data Visualization
•Workflows + Collaboration
�5
Some Problems
www.datapad.io�6
Data toolchains
Data Acquisition
ETL
SQL / Tidy Form
Code-based Env UI-based Env
Data Slinging / Management
Analysis
www.datapad.io�7
Data toolchains
Data Acquisition
ETL
HDFS
Code-based Env UI-based Env
Analytic DBMS
ETL
ETL?
Maybe
www.datapad.io
•Columnar / analytic databases
•SQL-on-Hadoop
•Spark / Spark ecosystem
•New life in visual ETL / data prep
•Better data manipulation libraries
�8
Some Trends
www.datapad.io
•R (+ data.table, dplyr)
•Python: pandas
•Data frames in Scala, F#, Julia, …
•Spark (Scala/Java)
�9
Crunching data with code
www.datapad.io
•Awkward / slow DB interactions
• In-process memory management
•Reuse of intermediate results
•Execution speed
•Evaluation semantics�10
Some Programmatic Tool Problems
www.datapad.io
•By Hadley Wickham and Romain Francois
•Uniform R API, SQL and in-memory backends
•Describe complex data manipulation using “chaining”
�11
dplyr (R library)
www.datapad.io�12
dplyr (R library)
final <- crime.by.state %.% filter(State=="New York", Year==2005) %.% arrange(desc(Count)) %.% select(Type.of.Crime, Count) %.% mutate(Proportion=Count/sum(Count)) %.% group_by(Type.of.Crime) %.% summarise(num.types = n(), counts = sum(Count))
www.datapad.io
•Broad set of primitive data ops
•Distributed in-memory model scales naturally, high performance
•Build complex computation graphs for analytics
•Applications: Shark, GraphX, …
�13
Apache Spark
www.datapad.io
•Broad traction
•Strong feature: time series analytics
•User-friendly API and community
•Being used in many unexpected ways
�14
pandas (Python library)
www.datapad.io
•A high performance in-memory analytics engine for DataPad
•Addresses many performance and memory management concerns in pandas
•May become an OSS project someday
�15
badger (DataPad internal)
www.datapad.io
•scikit-learn
•PMML
•Mahout
•Cloudera ML
�16
Standardized machine learning toolkits
www.datapad.io
•Cascading (+ Scalding, Cascalog)
•Apache Crunch
•Pig
�17
Enterprise data workflows
www.datapad.io
•Powering visual analytics tools on big data
•Compressed columnar storage
•MPP / in-memory execution model
�18
Analytic databases
www.datapad.io
•Visual Analytics/BI gone mainstream
•New Data Prep products
•Drag-and-drop predictive analytics
•Proliferation of vertical SaaS solutions
�19
Visual data tools
www.datapad.io
•Tend to be less flexible than code
•Multiple tools to get the job done
•Many still dependent on Excel
•Collaboration, versioning, provenance
�20
Visual tool challenges
www.datapad.io
•Discovery and reuse
•Cataloguing insights
•Analytics from ad-hoc to production
• Interesting projects: IPython Notebook, Shiny, Pivotal Chorus
�21
Collaboration tools