Date post: | 27-Aug-2014 |
Category: |
Software |
Upload: | hortonworks |
View: | 1,239 times |
Download: | 0 times |
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Spring 2014 Version 1.5
We do Hadoop.
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers…
Ofer Mendelevitch, Director of Data Science Hortonworks
Wayne Thompson, Chief Data Scientist SAS
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A data architecture under pressure from new data AP
PLICAT
IONS
DATA
SYSTEM
REPOSITORIES
SOURC
ES
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
Source: IDC
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sen>ment, Web Data
Sensor. Machine Data
Geo-‐loca>on
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop within an emerging Modern Data Architecture
OPERATIONS TOOLS
Provision, Manage & Monitor
DEV & DATA TOOLS
Build & Test
DATA
SYSTEM
REPOSITORIES
SOURC
ES
RDBMS EDW MPP
OLTP, ERP, CRM Systems
Documents, Emails
Web Logs, Click Streams
Social Networks
Machine Generated
Sensor Data
Geoloca>on Data
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
APPLICAT
IONS
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop unlocks a new approach: Iterative Analytics
Hadoop Mul>ple Query Engines Itera>ve Process: Explore, Transform, Analyze
SQL Single Query Engine Repeatable Linear Process
✚
Determine list of ques4ons
Design solu4ons
Collect structured data
Ask ques4ons from list
Detect addi4onal ques4ons
Batch Interac4ve Real-‐4me Streaming
Current Reality Apply schema on write
Dependent on IT
Augment w/ Hadoop
Apply schema on read
Support range of access patterns to data stored in HDFS: polymorphic access
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hadoop for Data Science
• Hadoop’s schema on read reduces cycle times • Hadoop is ideal for pre-processing of raw data • Improved models with larger datasets
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop’s schema-on-read accelerates innovation
I need new data
Finally, we start
collec>ng
Let me see… is it any good?
Start 6 months 9 months
“Schema change” project
Let’s just put it in a folder on HDFS
Let me see… is it any good?
3 months
My model is awesome!
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop ideal for large scale pre-processing
Join
Normalize
OCR
Sample
Aggregate
Raw Data Feature Matrix
NLP
Transform
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why big data science? Larger datasets à better outcomes
Banko & Brill, 2001 • More examples • More features
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A (partial) map of data science “tasks”
Discovery
Clustering Detect natural groupings
Outlier detection Detect anomalies
Association rule mining Co-occurrence patterns
Prediction
Classification Predict a category
Regression Predict a value
Recommendation Predict a preference
Big Data Science: High energy physics, Genomics, etc.
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Typical iterative flow in data science
Page 11
Visualize, Explore
Hypothesize; Model
Measure/Evaluate
Acquire Data
Clean Data
Deploy & Monitor
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SAS in-memory and Visual Statistics
HDP 2.1 Hortonworks Data Platform
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS YARN : Data Opera4ng System
DATA MANAGEMENT
SECURITY DATA ACCESS GOVERNANCE & INTEGRATION
Authen4ca4on Authoriza4on Accoun4ng
Data Protec4on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory Analy>cs, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Deployment Choice Linux Windows On-Premise Cloud
SAS® Visual Statistics
SAS® In-Memory Statistics for Hadoop
• Provide powerful advanced analytics integrated directly on HDP
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
BIG ANALYTICS+ HORTONWORKS DATA PLATFORM (HDP) = BIG OPPORTUNITIES
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
WHAT IS IT?
Provides a single interactive analytical platform on Hadoop to perform
• analytical data preparation • variable transformations • exploratory analysis • statistical modeling and machine learning • integrated modeling comparison and scoring
• Takes advantage of distributed in-memory computing optimized for analytical workloads
TEXT
PREPARE DATA EXPLO
RE
DATA
DEVELOP MODELS
SCO
RE
SAS® IN-MEMORY
ANALYTICS
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns Data Access
Data Management
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS® IN-MEMORY
ANALYTICS
INTEGRATED USER EXPERIENCE
Data Preparation Exploration/Visualization Modeling Deployment
DATA SCIENTIST /PROGRAMMER
SAS® Visual Statistics
SAS® In-Memory
Statistics for Hadoop
GUI GUI
STATISTICIAN
PROGRAMMING
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS IN-MEMORY STATISTICS FOR HADOOP Data Management • Aggregate • Compute • Update
• Append • Set • Schema
• DeleteRows • DropTables • PurgeTempTables
Data Exploration • Boxplot • Corr • Crosstab • Distinct • Fetch • Frequency • Histogram • KDE • MDSummary • Percentile • Summary • TopK
Descriptive Modeling • Association • Path Analysis • Clustering (k-means) • Clustering (DBSCAN)
Evaluation, Deployment • Assess Misclassification matrix Lift, ROC, Concordance • Score • Training / Validation
Data Management &
Exploration
Modeling
Model Evaluation & Deployment
ANALYTICAL LIFE CYCLE
Utilities • Where • GroupBy • TableInfo, ColumnInfo, ServerInfo • Partition, Balance • Store, Replay, Free • Table, Promote
Text Analytics • Parsing • SVD • Topic generation • Document projection
Recommendation Systems • Association • Clustering • kNN • SVD • Ensemble
Predictive Modeling • Decision Tree • Forecast • Gen Linear Model • Linear Regression • Logistic Regression • Random Forests
HDFS I/O • Sasiola • Sashdat • Anyfile Reader
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS ON HADOOP
Memory Hortonworks Data Platform
SAS® LASR™ Analytic Server
Head node
Data Nodes
Data
Data
Data
Data
Edge Node
SAS® Visual Analy>cs
SAS® Visual Sta>s>cs
SAS® In-‐Memory Sta>s>cs
SAS ® In-Memory Analytic Products
Web Clients
IN-MEMORY, CLIENT-SERVER, WEB-BASED
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS ON HADOOP
Memory Hortonworks Data Platform
SAS® LASR™ Analytic Server
Head node
Data Nodes
Data
Data
Data
Data
Edge Node
SAS® Visual Analy>cs
SAS® Visual Sta>s>cs
SAS® In-‐Memory Sta>s>cs
SAS ® In-Memory Analytic Products
Web Clients
IN-MEMORY, CLIENT-SERVER, WEB-BASED
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS ON HADOOP
Memory Hortonworks Data Platform
SAS® LASR™ Analytic Server
Head node
Data Nodes
Data
Data
Data
Data
Edge Node
SAS® Visual Analy>cs
SAS® Visual Sta>s>cs
SAS® In-‐Memory Sta>s>cs
result task
SAS ® In-Memory Analytic Products
Web Clients
IN-MEMORY, CLIENT-SERVER, WEB-BASED
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS ON HADOOP
broadcasts
SAS® LASR™ Analytic Server
Head node
Data Nodes
Data
Data
Data
Data
Edge Node result task
SAS ® In-Memory Analytic Products
SUMMARY STATISTICS
Web Clients
proc imstat; table dat1; summary X / mean; run; OUTPUT
Send request SampleMean(X) to LASR Waiting..
Receive 𝑿
A) Request 𝑺↓𝑿 =∑𝒊↑▒𝒙↓𝒊 from data nodes
C) Aggregate 𝑿 = ∑𝒋↑▒𝑺↓𝑿,𝒋 ⁄𝑵 D) Send 𝑿 back to Edge
B) Data node 𝒋 computes 𝑺↓𝑿,𝒋 =∑𝒊↑▒𝒙↓𝒊,𝒋 , 𝒋=𝟏,𝟐,𝟑,𝟒
Broadcast..
Memory
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS ON HADOOP
broadcasts
SAS® LASR™ Analytic Server
Head node
Data Nodes
Data
Data
Data
Data
Edge Node result task
SAS ® In-Memory Analytic Products
PRINCIPLES OF THE DESIGN
Web Clients
Thin Clients Multi-user Interactive Real-time Point-and-click or programing
Receive requests from a UI or SAS program.
• NO MAP REDUCE • One data copy • Concurrency • Temporary tables or
columns • MPP or SMP
Memory
Work on light computations (interactive trees)
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case #1: Recommendation systems
Why recommender systems? • 5 – 20% increase in sales • 60% use “recommendations” to
determine suitable product • In 2011 15% of customers
admitted to buying recommended products, 2013 nearly 30%
36 Million subscribers 60-70% view results from recommendation
Tens of Billions “Thumbs up” 60 Million active users 3.8 billion hours of music (last Qtr) 47% up-tic in active users 67% increase in music served
25% YOY Growth
Trip Advisor collaborates with EBAY, ORBITZ and others.
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pre-processing raw data for recommendation
• Inputs: • Explicit product ratings (when provided) • Implicit information: purchase transactions, page views, comments
5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5
Epic
X-‐Men
Hobb
it
Argo
Pirates
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum Comments
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Goal: predict a preference
Epic
X-‐Men
Hobb
it
Argo
Pirates
U101
U102
U103
U104
U105
…
Epic
X-‐Men
Hobb
it
Argo
Pirates
5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5
U101
U102
U103
U104
U105
…
5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
MACHINE LEARNING INTEGRATION
PREDICTIVE ANALYTICS &
MACHINE LEARNING RECOMMENDATION SYSTEM DEMO
SAS Visual Analytics
LOUNGE
PUB BEER
DRINK GAME
MUSIC Deployment
PINT
BAND
PLAY
GLASS
Relevant, Real-time,
Interactions
VODKA
PATIO KARAOKE
COCKTAIL
WINGS
DATA WRANGLING
Data Director*
Convert Json Files
Load LASR
Standardize
SAS In-Memory Statistics
Tony’s Bar
Trees Lounge
The Tropicana
Blue Parrot
Tony
Patty
George
Use
rs
Business
Beer & Wine
Chinese Food
Mexican Food
LIQUOR
ALCOHOL
BARTENDER
DRAFT
Topics
TAP
FUN
LIVE
SCENE POOL
Bus
ines
s
REVIEWS
* New SAS Product
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
PREDICTIVE ANALYTICS &
MACHINE LEARNING RECOMMENDATION SYSTEM DEMO
John Clark
Recommendation History
1. Oyster Bar 2. The Brick
3. Trees Lounge
4. Blue Parrot 5. Winchester Club
6. Starlight Lounge
7. Tony’s Bar 8. Lucy’s
9. The Tropicana
Rank
1
2
3
Recommendation
Review History
1. Oyster Bar 2. The Brick 3. Trees Lounge 4. Blue Parrot 5. Winchester Club 6. Starlight Lounge 7. Tony’s Bar 8. Lucy’s 9. The Tropicana
Rank 1,2, 3, …
Recommendation
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case # 2: Building a prediction model
Customer ID Age Gender Loyalty Card More features… Buys organic
11001 45 M Yes Yes 11002 43 M No Yes 11003 65 F Yes No … … … …
Unseen data
Model
Buys organic
Labeled Data
Customer ID Age Gender Home Owner
More features…
11004 33 M No … 11005 25 F No …
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo #2: Predicting who buys organic products?
• Dataset: grocery transaction and customer data
• Goals: • Understand customer propensity to buy organic products • Develop segments using an interactive decision • Develop stratified models to predict organic purchases
• Why is it useful? • Inventory strategy • Store layout planning • Provider management
Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .
SAS VISUAL STATISTICS 6.4 – ORGANICS PURCHASE DEMO PREDICTIVE
ANALYTICS & MACHINE LEARNING
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Wrap up: SAS and Hortonworks Data Platform
• Increase productivity for data scientists • Users can concurrently & interactively analyze traditional & new data sets in HDP to help
businesses quickly discover and capitalize on new business insights from their data
• Increase efficiency • Avoid unnecessary, multiple passes through the data • SAS in-memory infrastructure running on top of Hadoop eliminates costly data movement and
persists data in-memory for the entire analytics session
• Capture and analyze new data types • HDP + SAS enables data scientists to look at more of their enterprise data
• Leverage 100 percent open-source Apache Hadoop • SAS customers can now embrace Hadoop as a core platform in their data architecture
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How should you get started? Next steps…
• Get the Data
• Formulate a well defined business objective
• Data exploration: integrate and fuse heterogeneous data types
• Pre-process: generate features from raw data
• Manage the long-tail distribution and data imbalance
• Modeling: remember model building is cyclical
• Evaluate your results
• Work with IT to move analytics from research and into operations
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
More details..
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about SAS Software & Hortonworks http://hortonworks.com/partner/SAS/
Contact us: [email protected]