Date post: | 12-Jul-2015 |
Category: |
Technology |
Upload: | big-data-spain |
View: | 211 times |
Download: | 1 times |
STATE OF PLAY
SEAN OWENDIRECTOR OF DATA SCIENCE CLOUDERA
State of PlayData Science on Hadoop in 2015
Sean Owen // Director, Data Science @ Cloudera
2
About …• Engineer • Data Science @ Cloudera• Oryx project founder• Committer, erstwhile VP Apache
Mahout• Apache Spark contributor /
personality• Co-author, Mahout in Action /
Advanced Analytics on Spark• [email protected] /
@sean_r_owen
3
Where Is My Magic Wand?
4
We Like Hadoop Because …
• (Was) Shiny New Toy• Be Like Yahoo, Google,
FB• Data as Strategy
• Free – Just Add Hardware• Open, Standard• Cost-Savings Projects
• Bigger and Faster is Better
• Fewer Hacks to Survive Scale
• Do The Previously Impossible
It’s Aspirational It Costs Less We Get MoreComputing
www.avalonconsulting.net/blog/485-thinking-beyond-shiny-and-new
www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/
5
Incremental Today vs. Revolutionary Tomorrow• We set up a prototype Hadoop
cluster as part of a big data POC• We cut our IT budget by 22% by
moving some operations to Hadoop• Our SQL queries are 3 times faster
and overnight reports finish in 39 minutes now
• We do the same things with data, but do them notably better.
• We want to become a real-time product business that reacts to new machine sensor data in seconds, not days
• We want to predict which merchants will take out a business loan this month
• We want a complete customer profile that “understands” what they want at any time
• We think there is a magic wand available?
6
Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!
7
Demystifying with Data Science• Machine Learning is not new• Big Machine Learning is qualitatively different
– More data beats algorithm improvement– Scale trumps noise and sample size effects– Can brute-force manual tasks
• Feature selection• Hyperparameter tuning
• Engineering “Big” is Difficult– Build new scalable data platforms– Re-engineering parallel algorithms
8
What is Data Science?What skill sets does it require?
What tools are commonly used?How do we architect data products?
How do we get started?
9
Three Camps
10
s3.a
maz
onaw
s.co
m/a
ws.
drew
conw
ay.c
om/v
iz/v
enn_
diag
ram
/dat
a_sc
ienc
e.ht
ml
11
Business
12
Business
13
Engineering vs. Statistics
Programming languagesSystems languagesLatency, throughput
Huge dataOnline problems
AutomatedDevelopers, Engineers
Statistical environments, BI toolsHigh-level languagesAccuracyMedium-sized dataOffline workAd-hocStatisticians, Analysts
vs.
14
Data Science + Hadoop
15
Engineering, Statistics & Hadoop: Before
Gap.
16
Engineering, Statistics & Hadoop: 2014
YAR
N R
M
17
Apache Spark: Something for Everyone• Now Apache TLP
– From UC Berkeley AMPLab– … inspired by MS DryadLINQ
• Scala-based– Expressive, efficient– JVM-based
• Scala-like abstractions– RDD: Resilient Distributed (immutable)
Dataset– Distributed works like local– Like Apache Crunch is Collection-like
• Read-Evaluate-Print-Loop– Interactive– No compile/deploy cycle needed
• Python API too• Natively Distributed• Hadoop-friendly
– Integrate with where data already is– ETL no longer separate
• Subprojects: MLlib and more
18
Statisticians: Shell, Concise Syntax
<row Id="4"...Tags="...c#...winforms..."/>
(4,"c#")(4,"winforms")...
(4,3104,1.0)(4,2148819,1.0)...
scala> val postIDTags = postsXML.flatMap { line =>val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r
val tagRegex = "<([^&]+)>".ridTagRegex.findFirstMatchIn(line) match {case None => None case Some(m) => {val postID = m.group(1).toIntval tagsString = m.group(2)val tags = tagRegex.findAllMatchIn(tagsString)
.map(_.group(1)).toListtags.map((postID,_))
}}
}
19
Engineers: Distributed, Manageable
20
2015 is Time to Operationalize
21
From Exploratory to Operational
Exploratory Analytics Operational Analytics
Explore DataPick Model
Build Model at Scale, Offline
Continuously Update Model
Score Model inReal-Time
22
Lambda λArchitecture noun. 1. Name of a design idea you’ve had before but didn’t realize was a thing that needed a name.
23
Lambda Architecture
λ:Streaming
• Lambda Architecture– Batch Layer: compute full answer offline,
in batch– Speed Layer: compute approximate
answer online, in near-real-time– Serving Layer: stitch speed/batch
answers together in real-time
• Great fit for big, real-time ML• Ecosystem has right components
now– Batch: Spark + MLlib– Speed: Spark Streaming– Serving: Tomcat / Jetty– Data Fabric: Kafka, HDFS
24
Oryx 2: Lambda for ML (alpha)
github.com/OryxProject/oryx
Thank [email protected]@sean_r_owen
17TH ~ 18th NOV 2014MADRID (SPAIN)