+ All Categories
Home > Technology > Mad skills new analysis practices for big data

Mad skills new analysis practices for big data

Date post: 20-Jan-2015
Category:
Upload: sunz-sas-users-of-new-zealand
View: 954 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
44
1 © Copyright 2012 EMC Corporation. All rights reserved. MAD SKILLS: NEW ANALYSIS PRACTICES FOR BIG DATA
Transcript
Page 1: Mad skills new analysis practices for big data

1 © Copyright 2012 EMC Corporation. All rights reserved.

MAD SKILLS: NEW ANALYSIS PRACTICES FOR BIG DATA

Page 2: Mad skills new analysis practices for big data

2 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data Has Arrived

Source : 2011 IDC Digital Universe Study

GROW 44X IN THE NEXT 10

YEARS

THE DIGITAL UNIVERSE WILL

Page 3: Mad skills new analysis practices for big data

3 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data: Hype or Reality? •  Do we have a Big Data problem in New Zealand?

•  Do we have a Big Data problem in my organisation?

•  Do I really need to care?

Page 4: Mad skills new analysis practices for big data

4 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data: Hype or Reality? •  Do we have a Big Data problem in New Zealand? Maybe.

•  Do we have a Big Data problem in my organisation? Maybe.

•  Do I really need to care? ABSOLUTELY.

•  Big Data Practices is about widely applicable

Page 5: Mad skills new analysis practices for big data

5 © Copyright 2012 EMC Corporation. All rights reserved.

Today

Shadow systems ‘Shallow’

Business Intelligence

Static schemas accrete over

time

Slow- moving models

Slow- moving

data

Departmental warehouses

Data Sources

Images

EDW

Page 6: Mad skills new analysis practices for big data

6 © Copyright 2012 EMC Corporation. All rights reserved.

A Common Analytics Environment

Data Warehouses, File Systems

SAS/ACCESS

Page 7: Mad skills new analysis practices for big data

7 © Copyright 2012 EMC Corporation. All rights reserved.

A Common Analytics Environment

Data Warehouse, File Systems

Cluster Computer

SAS/CONNECT SAS/ACCESS

Page 8: Mad skills new analysis practices for big data

8 © Copyright 2012 EMC Corporation. All rights reserved.

A Common Analytics Environment

Data Warehouse, File Systems

Cluster Computer

SAS/CONNECT SAS/ACCESS

Bag of Tricks - DATA Step Bravado - SQL Pushdown - SASFILE - SAS Views - Compression

Page 9: Mad skills new analysis practices for big data

9 © Copyright 2012 EMC Corporation. All rights reserved.

A Leaner Configuration

Parallel Database

SAS/ACCESS

Greenplum ... ...

... ...

Page 10: Mad skills new analysis practices for big data

10 © Copyright 2012 EMC Corporation. All rights reserved.

An Integrated Architecture

Parallel Database

SAS/ACCESS SAS/CONNECT

Greenplum ... ...

... ...

Page 11: Mad skills new analysis practices for big data

11 © Copyright 2012 EMC Corporation. All rights reserved.

SAS High Performance Appliance on Greenplum

Parallel Greenplum Database

SAS/ACCESS SAS/CONNECT

Greenplum ... ...

... ...

Page 12: Mad skills new analysis practices for big data

12 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Analytical Computation and data request sent to the worker nodes

Page 13: Mad skills new analysis practices for big data

13 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Data request sent to the database, data slice moved into memory

Page 14: Mad skills new analysis practices for big data

14 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Analytic Processing with internode communication

Page 15: Mad skills new analysis practices for big data

15 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Worker node results returned to the Master Node, finalize computation

Page 16: Mad skills new analysis practices for big data

16 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Root Node

Result returned to the client

Page 17: Mad skills new analysis practices for big data

17 © Copyright 2012 EMC Corporation. All rights reserved.

How do we sort out this mess?

Shadow systems ‘Shallow’

Business Intelligence

Static schemas accrete over

time

Slow- moving models

Slow- moving

data

Departmental warehouses

Data Sources

Images

Page 18: Mad skills new analysis practices for big data

18 © Copyright 2012 EMC Corporation. All rights reserved.

Keep the Enterprise Data Warehouse

Enterprise Data Warehouse

•  Single Source of Truth •  Heavy data governance and quality •  Operational reporting •  Financial consolidation

Page 19: Mad skills new analysis practices for big data

19 © Copyright 2012 EMC Corporation. All rights reserved.

Add an Analytics Data Cloud as a Complement

Commodity Hardware

Virtual Machines

Public Cloud

SAS/Greenplum/Hadoop

Enterprise Data Warehouse

•  Single Source of Truth •  Heavy data governance and quality •  Operational reporting •  Financial consolidation

Analytics Data Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple

marts and sandboxes • Ad hoc, business-led analytics solutions

Page 20: Mad skills new analysis practices for big data

20 © Copyright 2012 EMC Corporation. All rights reserved.

MAD Analytics

Page 21: Mad skills new analysis practices for big data

21 © Copyright 2012 EMC Corporation. All rights reserved.

Magnetic

ADC PLATFORM

Data

Analytics

Chorus

Agile

Dat

a Fa

st E

TL/E

LT

Simple linear models Trend analysis

Model selection

Model design

Page 22: Mad skills new analysis practices for big data

22 © Copyright 2012 EMC Corporation. All rights reserved.

Agile

analyze and model in the

cloud push results back into the

cloud

get data into the cloud

Page 23: Mad skills new analysis practices for big data

23 © Copyright 2012 EMC Corporation. All rights reserved.

Deep

Pas

t

Fut

ure

Facts Interpretation

What will happen?

How can we do better?

What happened where and

when?

How and why did it happen?

Page 24: Mad skills new analysis practices for big data

24 © Copyright 2012 EMC Corporation. All rights reserved.

Different Phases of Analytics PREDICTIVE MODEL

Linear Regression

Logistic Regression

Naïve Bayes Classifier

Decision Trees

Neural Networks

TRANSFORMATION Aggregation

Row Filtering

Deriving New Variables

Pivoting

Normalizing

DATA EXPLORATION Frequency

Histogram

Bar Chart

Box Plot Chart

Correlation Matrix Data Exploration

Data Prep Modeling Model Fit

DATA MINING

Association Rule

K-means Clustering

MODEL FIT STATISTICS Goodness of Fit

ROC

Significance statistics for all independent variables

SCORING

Linear Regression

Logistic Regression

Naïve Bayes Classifier

Decision Trees

Neural Networks

Scoring

Page 25: Mad skills new analysis practices for big data

25 © Copyright 2012 EMC Corporation. All rights reserved.

In-Database Machine Learning •  Goal: Build models using all available data

•  Principle: Avoid using samples if possible.

•  Principle: Bring computation to data, not the other way round.

•  In practice: Write machine learning algorithms in (parallelised) data languages like SQL, SAS, and MapReduce.

Page 26: Mad skills new analysis practices for big data

26 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – Online Learning •  Process data one at a time using an incrementally maintained

model; adjust model every time we make a prediction error

•  Examples: perceptron, online SVMs, Bayesian filters, etc.

•  Such algorithms can be implemented using SAS DATA steps or SQL aggregate functions

Page 27: Mad skills new analysis practices for big data

27 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – Parallel Ensemble Learning •  Break a (large) dataset into i.i.d subsets residing on each node,

learn a model on each subset in parallel, and then combine the models appropriately

•  Examples: random forests, ensembles of SVMs, etc.

Page 28: Mad skills new analysis practices for big data

28 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – MapReduce •  Repeatedly apply a Map function to transform (local) chunks of

data and then use a Reduce function to consolidate the transformed results

•  Examples: parallel LDA, k-Means, Naive Bayes, etc.

Page 29: Mad skills new analysis practices for big data

29 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – Prediction Markets

Page 30: Mad skills new analysis practices for big data

30 © Copyright 2012 EMC Corporation. All rights reserved.

Japanese Telco: What People Are Talking About

!

Page 31: Mad skills new analysis practices for big data

31 © Copyright 2012 EMC Corporation. All rights reserved.

Traffic Network Modelling

Page 32: Mad skills new analysis practices for big data

32 © Copyright 2012 EMC Corporation. All rights reserved.

Massively Parallel Model Learning •  Solving tens of thousands of statistical modelling problems, one for each

road in the city, in parallel: libname adc greenplm server=gplum db=traffic port=5432 user=keesiong …

proc sql; select origin, dest,

linregr(travel_time, array[peak_period(entry_time), …, origin_vol, dest_vol]) from adc.route_travel_info

group by origin,dest;

•  A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) + 0.378 originVol(x) + 0.691 destVol(x)

Page 33: Mad skills new analysis practices for big data

33 © Copyright 2012 EMC Corporation. All rights reserved.

It’s MAD, but is it Mad?

Commodity Hardware

Virtual Machines

Public Cloud

SAS/Greenplum/Hadoop

Enterprise Data Warehouse

•  Single Source of Truth •  1 Logical Model •  Heavy data governance and quality •  Operational reporting •  Financial consolidation

Analytics Data Cloud • Source of all the raw data (often 10X size of

the EDW) • Self-service infrastructure to support multiple

marts and sandboxes • Rapid analytic iteration, and business led

solutions

Page 34: Mad skills new analysis practices for big data

34 © Copyright 2012 EMC Corporation. All rights reserved.

Public Cloud Computing Services

Page 35: Mad skills new analysis practices for big data

35 © Copyright 2012 EMC Corporation. All rights reserved.

Democratisation of Data

Page 36: Mad skills new analysis practices for big data

36 © Copyright 2012 EMC Corporation. All rights reserved.

Democratisation of Data

Page 37: Mad skills new analysis practices for big data

37 © Copyright 2012 EMC Corporation. All rights reserved.

Democratisation of Data

Page 38: Mad skills new analysis practices for big data

38 © Copyright 2012 EMC Corporation. All rights reserved.

Helping Organizations Evolve From This…

Line Of Business User

Business

I.T. Department

Database Administrator

Business Intelligence Analyst

Page 39: Mad skills new analysis practices for big data

39 © Copyright 2012 EMC Corporation. All rights reserved.

To This… Line Of Business User

Data Platform Administrator

Business Intelligence Analysts

Data Scientists

Page 40: Mad skills new analysis practices for big data

40 © Copyright 2012 EMC Corporation. All rights reserved.

You Should Take On Your Journey To Big Data Analytics

Top 3 Steps

Page 41: Mad skills new analysis practices for big data

41 © Copyright 2012 EMC Corporation. All rights reserved.

3.

Put all your data to work.

Page 42: Mad skills new analysis practices for big data

42 © Copyright 2012 EMC Corporation. All rights reserved.

2.

Have a data strategy.

Model less, iterate more.

Page 43: Mad skills new analysis practices for big data

43 © Copyright 2012 EMC Corporation. All rights reserved.

1.

First invest in people, then technology.

Page 44: Mad skills new analysis practices for big data

44 © Copyright 2012 EMC Corporation. All rights reserved.

“A journey of a thousand miles begins with a single step”

- Lao-tzu, Chinese philosopher (531 BC)

First Step: Walk towards the SAS-Greenplum Technical Session.


Recommended