Mad skills new analysis practices for big data

Post on 20-Jan-2015

954 views 1 download

Tags:

description

 

transcript

1 © Copyright 2012 EMC Corporation. All rights reserved.

MAD SKILLS: NEW ANALYSIS PRACTICES FOR BIG DATA

2 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data Has Arrived

Source : 2011 IDC Digital Universe Study

GROW 44X IN THE NEXT 10

YEARS

THE DIGITAL UNIVERSE WILL

3 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data: Hype or Reality? •  Do we have a Big Data problem in New Zealand?

•  Do we have a Big Data problem in my organisation?

•  Do I really need to care?

4 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data: Hype or Reality? •  Do we have a Big Data problem in New Zealand? Maybe.

•  Do we have a Big Data problem in my organisation? Maybe.

•  Do I really need to care? ABSOLUTELY.

•  Big Data Practices is about widely applicable

5 © Copyright 2012 EMC Corporation. All rights reserved.

Today

Shadow systems ‘Shallow’

Business Intelligence

Static schemas accrete over

time

Slow- moving models

Slow- moving

data

Departmental warehouses

Data Sources

Images

EDW

6 © Copyright 2012 EMC Corporation. All rights reserved.

A Common Analytics Environment

Data Warehouses, File Systems

SAS/ACCESS

7 © Copyright 2012 EMC Corporation. All rights reserved.

A Common Analytics Environment

Data Warehouse, File Systems

Cluster Computer

SAS/CONNECT SAS/ACCESS

8 © Copyright 2012 EMC Corporation. All rights reserved.

A Common Analytics Environment

Data Warehouse, File Systems

Cluster Computer

SAS/CONNECT SAS/ACCESS

Bag of Tricks - DATA Step Bravado - SQL Pushdown - SASFILE - SAS Views - Compression

9 © Copyright 2012 EMC Corporation. All rights reserved.

A Leaner Configuration

Parallel Database

SAS/ACCESS

Greenplum ... ...

... ...

10 © Copyright 2012 EMC Corporation. All rights reserved.

An Integrated Architecture

Parallel Database

SAS/ACCESS SAS/CONNECT

Greenplum ... ...

... ...

11 © Copyright 2012 EMC Corporation. All rights reserved.

SAS High Performance Appliance on Greenplum

Parallel Greenplum Database

SAS/ACCESS SAS/CONNECT

Greenplum ... ...

... ...

12 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Analytical Computation and data request sent to the worker nodes

13 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Data request sent to the database, data slice moved into memory

14 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Analytic Processing with internode communication

15 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Master

Worker node results returned to the Master Node, finalize computation

16 © Copyright 2012 EMC Corporation. All rights reserved.

SAS-GP High-Performance Analytics

Worker Node 1

Worker Node 2

Worker Node N

Root Node

Result returned to the client

17 © Copyright 2012 EMC Corporation. All rights reserved.

How do we sort out this mess?

Shadow systems ‘Shallow’

Business Intelligence

Static schemas accrete over

time

Slow- moving models

Slow- moving

data

Departmental warehouses

Data Sources

Images

18 © Copyright 2012 EMC Corporation. All rights reserved.

Keep the Enterprise Data Warehouse

Enterprise Data Warehouse

•  Single Source of Truth •  Heavy data governance and quality •  Operational reporting •  Financial consolidation

19 © Copyright 2012 EMC Corporation. All rights reserved.

Add an Analytics Data Cloud as a Complement

Commodity Hardware

Virtual Machines

Public Cloud

SAS/Greenplum/Hadoop

Enterprise Data Warehouse

•  Single Source of Truth •  Heavy data governance and quality •  Operational reporting •  Financial consolidation

Analytics Data Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple

marts and sandboxes • Ad hoc, business-led analytics solutions

20 © Copyright 2012 EMC Corporation. All rights reserved.

MAD Analytics

21 © Copyright 2012 EMC Corporation. All rights reserved.

Magnetic

ADC PLATFORM

Data

Analytics

Chorus

Agile

Dat

a Fa

st E

TL/E

LT

Simple linear models Trend analysis

Model selection

Model design

22 © Copyright 2012 EMC Corporation. All rights reserved.

Agile

analyze and model in the

cloud push results back into the

cloud

get data into the cloud

23 © Copyright 2012 EMC Corporation. All rights reserved.

Deep

Pas

t

Fut

ure

Facts Interpretation

What will happen?

How can we do better?

What happened where and

when?

How and why did it happen?

24 © Copyright 2012 EMC Corporation. All rights reserved.

Different Phases of Analytics PREDICTIVE MODEL

Linear Regression

Logistic Regression

Naïve Bayes Classifier

Decision Trees

Neural Networks

TRANSFORMATION Aggregation

Row Filtering

Deriving New Variables

Pivoting

Normalizing

DATA EXPLORATION Frequency

Histogram

Bar Chart

Box Plot Chart

Correlation Matrix Data Exploration

Data Prep Modeling Model Fit

DATA MINING

Association Rule

K-means Clustering

MODEL FIT STATISTICS Goodness of Fit

ROC

Significance statistics for all independent variables

SCORING

Linear Regression

Logistic Regression

Naïve Bayes Classifier

Decision Trees

Neural Networks

Scoring

25 © Copyright 2012 EMC Corporation. All rights reserved.

In-Database Machine Learning •  Goal: Build models using all available data

•  Principle: Avoid using samples if possible.

•  Principle: Bring computation to data, not the other way round.

•  In practice: Write machine learning algorithms in (parallelised) data languages like SQL, SAS, and MapReduce.

26 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – Online Learning •  Process data one at a time using an incrementally maintained

model; adjust model every time we make a prediction error

•  Examples: perceptron, online SVMs, Bayesian filters, etc.

•  Such algorithms can be implemented using SAS DATA steps or SQL aggregate functions

27 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – Parallel Ensemble Learning •  Break a (large) dataset into i.i.d subsets residing on each node,

learn a model on each subset in parallel, and then combine the models appropriately

•  Examples: random forests, ensembles of SVMs, etc.

28 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – MapReduce •  Repeatedly apply a Map function to transform (local) chunks of

data and then use a Reduce function to consolidate the transformed results

•  Examples: parallel LDA, k-Means, Naive Bayes, etc.

29 © Copyright 2012 EMC Corporation. All rights reserved.

Design Pattern – Prediction Markets

30 © Copyright 2012 EMC Corporation. All rights reserved.

Japanese Telco: What People Are Talking About

!

31 © Copyright 2012 EMC Corporation. All rights reserved.

Traffic Network Modelling

32 © Copyright 2012 EMC Corporation. All rights reserved.

Massively Parallel Model Learning •  Solving tens of thousands of statistical modelling problems, one for each

road in the city, in parallel: libname adc greenplm server=gplum db=traffic port=5432 user=keesiong …

proc sql; select origin, dest,

linregr(travel_time, array[peak_period(entry_time), …, origin_vol, dest_vol]) from adc.route_travel_info

group by origin,dest;

•  A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) + 0.378 originVol(x) + 0.691 destVol(x)

33 © Copyright 2012 EMC Corporation. All rights reserved.

It’s MAD, but is it Mad?

Commodity Hardware

Virtual Machines

Public Cloud

SAS/Greenplum/Hadoop

Enterprise Data Warehouse

•  Single Source of Truth •  1 Logical Model •  Heavy data governance and quality •  Operational reporting •  Financial consolidation

Analytics Data Cloud • Source of all the raw data (often 10X size of

the EDW) • Self-service infrastructure to support multiple

marts and sandboxes • Rapid analytic iteration, and business led

solutions

34 © Copyright 2012 EMC Corporation. All rights reserved.

Public Cloud Computing Services

35 © Copyright 2012 EMC Corporation. All rights reserved.

Democratisation of Data

36 © Copyright 2012 EMC Corporation. All rights reserved.

Democratisation of Data

37 © Copyright 2012 EMC Corporation. All rights reserved.

Democratisation of Data

38 © Copyright 2012 EMC Corporation. All rights reserved.

Helping Organizations Evolve From This…

Line Of Business User

Business

I.T. Department

Database Administrator

Business Intelligence Analyst

39 © Copyright 2012 EMC Corporation. All rights reserved.

To This… Line Of Business User

Data Platform Administrator

Business Intelligence Analysts

Data Scientists

40 © Copyright 2012 EMC Corporation. All rights reserved.

You Should Take On Your Journey To Big Data Analytics

Top 3 Steps

41 © Copyright 2012 EMC Corporation. All rights reserved.

3.

Put all your data to work.

42 © Copyright 2012 EMC Corporation. All rights reserved.

2.

Have a data strategy.

Model less, iterate more.

43 © Copyright 2012 EMC Corporation. All rights reserved.

1.

First invest in people, then technology.

44 © Copyright 2012 EMC Corporation. All rights reserved.

“A journey of a thousand miles begins with a single step”

- Lao-tzu, Chinese philosopher (531 BC)

First Step: Walk towards the SAS-Greenplum Technical Session.