Date post: | 20-Jan-2015 |
Category: |
Technology |
Upload: | sunz-sas-users-of-new-zealand |
View: | 954 times |
Download: | 1 times |
1 © Copyright 2012 EMC Corporation. All rights reserved.
MAD SKILLS: NEW ANALYSIS PRACTICES FOR BIG DATA
2 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data Has Arrived
Source : 2011 IDC Digital Universe Study
GROW 44X IN THE NEXT 10
YEARS
THE DIGITAL UNIVERSE WILL
3 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data: Hype or Reality? • Do we have a Big Data problem in New Zealand?
• Do we have a Big Data problem in my organisation?
• Do I really need to care?
4 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data: Hype or Reality? • Do we have a Big Data problem in New Zealand? Maybe.
• Do we have a Big Data problem in my organisation? Maybe.
• Do I really need to care? ABSOLUTELY.
• Big Data Practices is about widely applicable
5 © Copyright 2012 EMC Corporation. All rights reserved.
Today
Shadow systems ‘Shallow’
Business Intelligence
Static schemas accrete over
time
Slow- moving models
Slow- moving
data
Departmental warehouses
Data Sources
Images
EDW
6 © Copyright 2012 EMC Corporation. All rights reserved.
A Common Analytics Environment
Data Warehouses, File Systems
SAS/ACCESS
7 © Copyright 2012 EMC Corporation. All rights reserved.
A Common Analytics Environment
Data Warehouse, File Systems
Cluster Computer
SAS/CONNECT SAS/ACCESS
8 © Copyright 2012 EMC Corporation. All rights reserved.
A Common Analytics Environment
Data Warehouse, File Systems
Cluster Computer
SAS/CONNECT SAS/ACCESS
Bag of Tricks - DATA Step Bravado - SQL Pushdown - SASFILE - SAS Views - Compression
9 © Copyright 2012 EMC Corporation. All rights reserved.
A Leaner Configuration
Parallel Database
SAS/ACCESS
Greenplum ... ...
... ...
10 © Copyright 2012 EMC Corporation. All rights reserved.
An Integrated Architecture
Parallel Database
SAS/ACCESS SAS/CONNECT
Greenplum ... ...
... ...
11 © Copyright 2012 EMC Corporation. All rights reserved.
SAS High Performance Appliance on Greenplum
Parallel Greenplum Database
SAS/ACCESS SAS/CONNECT
Greenplum ... ...
... ...
12 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Analytical Computation and data request sent to the worker nodes
13 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Data request sent to the database, data slice moved into memory
14 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Analytic Processing with internode communication
15 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Worker node results returned to the Master Node, finalize computation
16 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Root Node
Result returned to the client
17 © Copyright 2012 EMC Corporation. All rights reserved.
How do we sort out this mess?
Shadow systems ‘Shallow’
Business Intelligence
Static schemas accrete over
time
Slow- moving models
Slow- moving
data
Departmental warehouses
Data Sources
Images
18 © Copyright 2012 EMC Corporation. All rights reserved.
Keep the Enterprise Data Warehouse
Enterprise Data Warehouse
• Single Source of Truth • Heavy data governance and quality • Operational reporting • Financial consolidation
19 © Copyright 2012 EMC Corporation. All rights reserved.
Add an Analytics Data Cloud as a Complement
Commodity Hardware
Virtual Machines
Public Cloud
SAS/Greenplum/Hadoop
Enterprise Data Warehouse
• Single Source of Truth • Heavy data governance and quality • Operational reporting • Financial consolidation
Analytics Data Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple
marts and sandboxes • Ad hoc, business-led analytics solutions
20 © Copyright 2012 EMC Corporation. All rights reserved.
MAD Analytics
21 © Copyright 2012 EMC Corporation. All rights reserved.
Magnetic
ADC PLATFORM
Data
Analytics
Chorus
Agile
Dat
a Fa
st E
TL/E
LT
Simple linear models Trend analysis
Model selection
Model design
22 © Copyright 2012 EMC Corporation. All rights reserved.
Agile
analyze and model in the
cloud push results back into the
cloud
get data into the cloud
23 © Copyright 2012 EMC Corporation. All rights reserved.
Deep
Pas
t
Fut
ure
Facts Interpretation
What will happen?
How can we do better?
What happened where and
when?
How and why did it happen?
24 © Copyright 2012 EMC Corporation. All rights reserved.
Different Phases of Analytics PREDICTIVE MODEL
Linear Regression
Logistic Regression
Naïve Bayes Classifier
Decision Trees
Neural Networks
TRANSFORMATION Aggregation
Row Filtering
Deriving New Variables
Pivoting
Normalizing
DATA EXPLORATION Frequency
Histogram
Bar Chart
Box Plot Chart
Correlation Matrix Data Exploration
Data Prep Modeling Model Fit
DATA MINING
Association Rule
K-means Clustering
MODEL FIT STATISTICS Goodness of Fit
ROC
Significance statistics for all independent variables
SCORING
Linear Regression
Logistic Regression
Naïve Bayes Classifier
Decision Trees
Neural Networks
Scoring
25 © Copyright 2012 EMC Corporation. All rights reserved.
In-Database Machine Learning • Goal: Build models using all available data
• Principle: Avoid using samples if possible.
• Principle: Bring computation to data, not the other way round.
• In practice: Write machine learning algorithms in (parallelised) data languages like SQL, SAS, and MapReduce.
26 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – Online Learning • Process data one at a time using an incrementally maintained
model; adjust model every time we make a prediction error
• Examples: perceptron, online SVMs, Bayesian filters, etc.
• Such algorithms can be implemented using SAS DATA steps or SQL aggregate functions
27 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – Parallel Ensemble Learning • Break a (large) dataset into i.i.d subsets residing on each node,
learn a model on each subset in parallel, and then combine the models appropriately
• Examples: random forests, ensembles of SVMs, etc.
28 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – MapReduce • Repeatedly apply a Map function to transform (local) chunks of
data and then use a Reduce function to consolidate the transformed results
• Examples: parallel LDA, k-Means, Naive Bayes, etc.
29 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – Prediction Markets
30 © Copyright 2012 EMC Corporation. All rights reserved.
Japanese Telco: What People Are Talking About
!
31 © Copyright 2012 EMC Corporation. All rights reserved.
Traffic Network Modelling
32 © Copyright 2012 EMC Corporation. All rights reserved.
Massively Parallel Model Learning • Solving tens of thousands of statistical modelling problems, one for each
road in the city, in parallel: libname adc greenplm server=gplum db=traffic port=5432 user=keesiong …
proc sql; select origin, dest,
linregr(travel_time, array[peak_period(entry_time), …, origin_vol, dest_vol]) from adc.route_travel_info
group by origin,dest;
• A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) + 0.378 originVol(x) + 0.691 destVol(x)
33 © Copyright 2012 EMC Corporation. All rights reserved.
It’s MAD, but is it Mad?
Commodity Hardware
Virtual Machines
Public Cloud
SAS/Greenplum/Hadoop
Enterprise Data Warehouse
• Single Source of Truth • 1 Logical Model • Heavy data governance and quality • Operational reporting • Financial consolidation
Analytics Data Cloud • Source of all the raw data (often 10X size of
the EDW) • Self-service infrastructure to support multiple
marts and sandboxes • Rapid analytic iteration, and business led
solutions
34 © Copyright 2012 EMC Corporation. All rights reserved.
Public Cloud Computing Services
35 © Copyright 2012 EMC Corporation. All rights reserved.
Democratisation of Data
36 © Copyright 2012 EMC Corporation. All rights reserved.
Democratisation of Data
37 © Copyright 2012 EMC Corporation. All rights reserved.
Democratisation of Data
38 © Copyright 2012 EMC Corporation. All rights reserved.
Helping Organizations Evolve From This…
Line Of Business User
Business
I.T. Department
Database Administrator
Business Intelligence Analyst
39 © Copyright 2012 EMC Corporation. All rights reserved.
To This… Line Of Business User
Data Platform Administrator
Business Intelligence Analysts
Data Scientists
40 © Copyright 2012 EMC Corporation. All rights reserved.
You Should Take On Your Journey To Big Data Analytics
Top 3 Steps
41 © Copyright 2012 EMC Corporation. All rights reserved.
3.
Put all your data to work.
42 © Copyright 2012 EMC Corporation. All rights reserved.
2.
Have a data strategy.
Model less, iterate more.
43 © Copyright 2012 EMC Corporation. All rights reserved.
1.
First invest in people, then technology.
44 © Copyright 2012 EMC Corporation. All rights reserved.
“A journey of a thousand miles begins with a single step”
- Lao-tzu, Chinese philosopher (531 BC)
First Step: Walk towards the SAS-Greenplum Technical Session.