+ All Categories
Home > Technology > The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

Date post: 25-Jun-2015
Category:
Upload: datapad-inc
View: 2,592 times
Download: 1 times
Share this document with a friend
Popular Tags:
26
Strata Santa Clara 201 4 The Last Mile: Challenges and opportunities in data tools
Transcript
Page 1: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

Strata Santa Clara 2014

The Last Mile: Challenges and opportunities

in data tools

Page 2: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

Wes McKinney

�2

• Former quant @ AQR (a hedge fund)

• Creator of pandas

• Author of Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

@wesmckinn

Page 3: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io�3

Page 4: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io�4

•http://datapad.io

•New web-based visual analytics environment

• In private beta, join us!

•Hiring for engineering

Page 5: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Business Analytics

•Statistics and ML

•ETL

•Data Visualization

•Workflows + Collaboration

�5

Some Problems

Page 6: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io�6

Data toolchains

Data Acquisition

ETL

SQL / Tidy Form

Code-based Env UI-based Env

Data Slinging / Management

Analysis

Page 7: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io�7

Data toolchains

Data Acquisition

ETL

HDFS

Code-based Env UI-based Env

Analytic DBMS

ETL

ETL?

Maybe

Page 8: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Columnar / analytic databases

•SQL-on-Hadoop

•Spark / Spark ecosystem

•New life in visual ETL / data prep

•Better data manipulation libraries

�8

Some Trends

Page 9: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•R (+ data.table, dplyr)

•Python: pandas

•Data frames in Scala, F#, Julia, …

•Spark (Scala/Java)

�9

Crunching data with code

Page 10: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Awkward / slow DB interactions

• In-process memory management

•Reuse of intermediate results

•Execution speed

•Evaluation semantics�10

Some Programmatic Tool Problems

Page 11: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•By Hadley Wickham and Romain Francois

•Uniform R API, SQL and in-memory backends

•Describe complex data manipulation using “chaining”

�11

dplyr (R library)

Page 12: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io�12

dplyr (R library)

final <- crime.by.state %.% filter(State=="New York", Year==2005) %.% arrange(desc(Count)) %.% select(Type.of.Crime, Count) %.% mutate(Proportion=Count/sum(Count)) %.% group_by(Type.of.Crime) %.% summarise(num.types = n(), counts = sum(Count))

Page 13: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Broad set of primitive data ops

•Distributed in-memory model scales naturally, high performance

•Build complex computation graphs for analytics

•Applications: Shark, GraphX, …

�13

Apache Spark

Page 14: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Broad traction

•Strong feature: time series analytics

•User-friendly API and community

•Being used in many unexpected ways

�14

pandas (Python library)

Page 15: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•A high performance in-memory analytics engine for DataPad

•Addresses many performance and memory management concerns in pandas

•May become an OSS project someday

�15

badger (DataPad internal)

Page 16: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•scikit-learn

•PMML

•Mahout

•Cloudera ML

�16

Standardized machine learning toolkits

Page 17: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Cascading (+ Scalding, Cascalog)

•Apache Crunch

•Pig

�17

Enterprise data workflows

Page 18: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Powering visual analytics tools on big data

•Compressed columnar storage

•MPP / in-memory execution model

�18

Analytic databases

Page 19: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Visual Analytics/BI gone mainstream

•New Data Prep products

•Drag-and-drop predictive analytics

•Proliferation of vertical SaaS solutions

�19

Visual data tools

Page 20: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Tend to be less flexible than code

•Multiple tools to get the job done

•Many still dependent on Excel

•Collaboration, versioning, provenance

�20

Visual tool challenges

Page 21: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

•Discovery and reuse

•Cataloguing insights

•Analytics from ad-hoc to production

• Interesting projects: IPython Notebook, Shiny, Pivotal Chorus

�21

Collaboration tools

Page 22: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

Some ideas

�22

Page 23: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

Abstract away the execution model

(where possible)

�23

Page 24: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

More integrated environments

�24

Page 25: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

Enhance collaboration

�25

Page 26: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

www.datapad.io

Thank you!

���26


Recommended