Date post: | 10-Dec-2014 |
Category: |
Technology |
Upload: | hortonworks |
View: | 519 times |
Download: | 1 times |
Copyright 2014 Zementis, Inc. All rights reserved. 2
!!!!!!Presented by Hortonworks & Zementis !September 10, 2014
Webinar will begin shortly… !Hadoop’s Advantages for Machine Learning and Predictive Analytics
Copyright 2014 Zementis, Inc. All rights reserved.
Moderator Presenters
!
!
4
Hadoop’s Advantages for Machine Learning and Predictive Analytics
Mark Rabkin Director Business Development
Zementis
Ofer Mendelevitch Director of Data Science
Hortonworks
Michael Zeller CEO
Zementis
Copyright 2014 Zementis, Inc. All rights reserved. 5
The Speakers
Michael Zeller CEO & Founder Zementis
Ofer Mendelevitch Director of Data Science Hortonworks
Ofer Mendelevitch is Director of data sciences at Hortonworks, where he is responsible for professional services involving data science with Hadoop, including use-cases like recommender systems, prediction, classification and search. Prior to joining Hortonworks, Ofer has held a number of positions from Entrepreneur in Residence at XSeed Capital, VP of Engineering at Nor1 and Director of engineering at Yahoo! where he led multiple engineering and data science teams.
Michael Zeller is the CEO and Co-Founder of Zementis. His vision is to help companies deepen and accelerate insights from big data through the power of predictive analytics. Michael also serves on the Board of Directors of Software San Diego and as Secretary/Treasurer on the Executive Committee of ACM SIGKDD, which is the premier international organization for data mining researchers and practitioners from academia, industry, and government.
Copyright 2014 Zementis, Inc. All rights reserved.
Products & Capabilities:• Vendor-neutral architecture for
- Data mining tools - Analytics and data warehouse
platforms • Supports PMML industry standard and
wide range of predictive modeling techniques
• Rapidly deploys and executes predictive models
• Accelerates business insight
Zementis provides software for operational deployment of predictive analytics
6
Hortonworks & Zementis
Our Commitment:
Hortonworks: We Do Hadoop. Our mission is to power your Modern Data Architecture by delivering Enterprise Apache HadoopReseller Partners:
• Open LeadershipDrive innovation in the open exclusively via the Apache community-driven open source process
• Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind
• Ecosystem EndorsementFocus on deep integration with existing data center technologies and skills
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A data architecture under pressure from new data
APPLICAT
IONS*
DATA
**SYSTEM*
REPOSITORIES*
SOURC
ES*
Exis4ng*Sources**(CRM,*ERP,*Clickstream,*Logs)*
RDBMS* EDW* MPP*
Business**Analy4cs*
Custom*Applica4ons*
Packaged*Applica4ons*
Source: IDC
2.8*ZB*in*2012*
85%*from*New*Data*Types*
15x*Machine*Data*by*2020*
40*ZB*by*2020*
OLTP,&ERP,&CRM&Systems&
Unstructured&documents,&emails&
Clickstream&
Server&logs&
Sen>ment,&Web&Data&
Sensor.&Machine&Data&
GeoEloca>on&
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop within an emerging Modern Data Architecture
OPERATIONS*TOOLS*
Provision, Manage & Monitor
DEV*&*DATA*TOOLS*
Build & Test
DATA
**SYSTEM*
REPOSITORIES*
SOURC
ES*
RDBMS* EDW* MPP*
OLTP,&ERP,&CRM&Systems&
Documents,&&Emails&
Web&Logs,&Click&Streams&
Social&Networks&
Machine&Generated&
Sensor&Data&
Geoloca>on&Data&
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
APPLICAT
IONS*
Business**Analy4cs*
Custom*Applica4ons*
Packaged*Applica4ons*
Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop unlocks a new approach: Iterative Analytics
Hadoop*Mul>ple&Query&Engines&Itera>ve&Process:&Explore,&Transform,&Analyze&
SQL*Single&Query&Engine&Repeatable&Linear&Process&
✚
Determine*list*of*ques4ons*
Design*solu4ons*
Collect*structured*data*
Ask*ques4ons*from*list*
Detect*addi4onal*ques4ons*
Batch* Interac4ve* Real\4me* Streaming*
Current Reality Apply schema on write
Dependent on IT
Augment w/ Hadoop
Apply schema on read
Support range of access patterns to data stored in HDFS: polymorphic access
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A (partial) map of machine learning “tasks”
Discovery
Clustering Detect natural groupings
Outlier detection Detect anomalies
Association rule mining Co-occurrence patterns
Prediction
Classification Predict a category
Regression Predict a value
Recommendation Predict a preference
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Typical iterative flow in machine learning modeling
Page 5
Visualize, Explore
Hypothesize; Model
Measure/Evaluate
Acquire Data
Clean Data
Deploy & Monitor
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why Apache Hadoop for Data Science?
• Hadoop’s schema-on-read reduces cycle time • Hadoop is ideal for pre-processing of raw data – Structured & unstructured
• Larger datasets enable better models
• Large-scale parallel scoring
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop’s schema-on-read accelerates innovation
I&need&new&data&
Finally,&we&start&
collec>ng&
Let&me&see…&is&it&any&good?&
Start 6 months 9 months
“Schema change” project
Let’s&just&put&it&in&a&folder&on&HDFS&
Let&me&see…&is&it&any&good?&
3 months
My&model&is&awesome!&
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop is ideal for large scale pre-processing
Join&
Normalize&
OCR&
Sample&
Aggregate&
Raw&Data&Feature&Matrix&
NLP&
Transform&
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop enables modeling with larger datasets Larger datasets ! better outcomes
Banko & Brill, 2001 • More examples • More features
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop enables large-scale parallelized “scoring”
Training set Learning
Model
Test set OutputScoring
PMMLNative
Embarrassingly Parallel Using Hadoop as grid compute infrastructure
Copyright 2014 Zementis, Inc. All rights reserved.
What is PMML?
• Mature standard developed by the DMG (Data Mining Group) to avoidproprietary issues and incompatibilities and to deploy models !
• XML-based language used to define statistical and data mining models and to share these between compliant applications !
• Supported by most leading data mining tools, commercial and open-source !
• Data handling and transformations (pre-and post-processing) are a core component of the PMML standard !
• Allows for the clear separation of tasks: Model development vs. model deployment !
• Eliminates the need for custom code and proprietary model deployment solutions
8
Predictive Model Markup Language (PMML) industry standard reduces the complexity of operationalizing models
Copyright 2014 Zementis, Inc. All rights reserved. Confidential 9
Predictive Analytics Workflow
PMML File!
Data Pre-Processing!
Predictive!Model!
Model !
Outputs!
Derived !Model Inputs!
Data Post-Processing!
Raw Inputs!
Prediction!
Input!Validation
Outliers, !Missing Values,!Invalid Values
Normalize,!Discretize, Bin,!
Map, etc.
Scaling,!Business Decisions,!
Thresholds, etc.
Model!Signature
Data and!operational types
PMML in action, covering a complete workflow from raw data input to decision output
Copyright 2014 Zementis, Inc. All rights reserved.
Big Data
Applications
Cloud
Log Files
RSS Feeds
Other Sources
Predictive Models
Machine Learning
Techniques
Data Mining Tools
Business Insights
Business Value
Decisions &
Actions
• Accelerated time-to-market
• More precise targeting
• Real-time responsiveness
• Enhanced operational agility
• Competitive advantage
• Higher revenue growth rates
• Greater profitability
• More relevant
• More accurate
• More comprehensive
• More nuanced
• Faster
• Lower risk
• Greater positive impact
Predictive Analytics
Path to Business Value
Databases
10
Predictive analytics helps organizations unlock the value of their big data …
Copyright 2014 Zementis, Inc. All rights reserved.
Predictive model deployment becomes a
rework cycle
• Extensive manual coding
• Cross-checking
• Fixing coding errors
• Delayed insight
• Less accurate decisions
• Missed opportunities
• Loss of value
Traditional Deployment Cycle
Develop Operationalize Utilize
Data Scientist IT Engineer
Business Decisions
Business Professional
11
… but model deployment challenges can often erode much of the value that predictive analytics can deliver
Copyright 2014 Zementis, Inc. All rights reserved.
Within 2 days *
~ 6 months
Without ZementisWith Zementis
Time-to-insight
Eco
nom
ic V
alue
Model Deployment Cycle Time
* And sometimes even within a few hours!
• Accelerated deployment timeline • Reduced model deployment cycle time • Reduced model deployment expense • Increased model throughput
• Enhanced accuracy • Minimal rework, if any
Rapid insight
Rapid time-to-value
from predictive analytics
=
Deployment with Zementis & PMML
12
Enter Zementis, whose solutions accelerate time-to-insight for predictive analytics
Copyright 2014 Zementis, Inc. All rights reserved. Confidential 13
Universal PMML Plug-in (UPPI)Model Deployment!
Integration/Execution
Decision Trees!
Neural Networks!
Support Vector Machines!
Linear and Logistic Regression!
Naive Bayes Classifiers!
General and Generalized Linear Models!
Cox Regression!
Rule Set Models!
Clustering!
Scorecards!
Association Rules!
Multiple Models (Segmentation, Chaining,
Composition and Ensemble, including
Random Forest Models)
Predictive Algorithms
Data Mining ToolsCommercial Vendors (e.g. IBM SPSS, SAS)!
Open Source Tools (R, KNIME, ...)
Zementis UPPI!for Hive/Hadoop
Simple Deployment & Execution! Upload PMML file(s) in Hive!
PMML turns into HiveQL functions!
Seamlessly score data on Hadoop
PMML
Copyright 2014 Zementis, Inc. All rights reserved. Confidential 14
Hive 0.13Now faster than ever, up to 100x performance improvements and more to come…
Copyright 2014 Zementis, Inc. All rights reserved. Confidential
Speeding Up Performance with Tez & ORC !
15
UPPI for Hive 0.13 Performance
Scaling by Hadoop Cluster Size
Tim
e
0
50
100
10 Nodes 20 Nodes
Tim
e
0
25
50
75
100 21%29%
Performance executing a complex PMML model as UDF (User-Defined Function) using Hive 0.13
29% performance improvement when executing the same model and data by enabling Tez & ORC
Tez & ORCTezHive!0.13
Copyright 2014 Zementis, Inc. All rights reserved. 16
DEMO!
1. PMML Sample Models > Hive UDFs 2. Run Customer Churn Example
Zementis!UPPI for Hive
Zementis Universal PMML Plug-in (UPPI) demo on Hortonworks Sandbox
Copyright 2014 Zementis, Inc. All rights reserved.
Broad Applicability
Fraud & Risk Scoring
• Financial institutions • Scoring bureaus • Fraud detection • Advanced decision
management
Marketing & Sales
• Up- /cross-sell and next-best-offer
• Marketing campaign optimization
• Real-time recommendations
• Rotating equipment • Energy • Biometrics • IP network security
Sensor & Device Data Processing
17
Hortonworks and Zementis products accelerate predictive model insights for multiple industries and business use cases
Copyright 2014 Zementis, Inc. All rights reserved. 18
Thank You!
Questions?