+ All Categories
Home > Engineering > MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data

Date post: 02-Jul-2015
Category:
Upload: long-pham
View: 56 times
Download: 2 times
Share this document with a friend
Description:
Another slides presenting the paper
44
MAD Skills: New Analysis Practices for Big Data Slides courtesy of original paper & Christan Grant’s slides Presented by Long Pham 11/11/2014
Transcript
Page 1: MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data

Slides courtesy of original paper & Christan Grant’s slides Presented by Long Pham

11/11/2014

Page 2: MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data

• Authors

• Jeff Cohen – Greenplum

• Brian Dolan – Fox Audience Network

• Mark Dunlap – Evergreen Technologies

• Joseph M Hellerstein – UC Berkeley

• Caleb Welton - Greenplum

• Presented at Very Large Database Conference 2009 in Lyon, France

2

Page 3: MAD Skills: New Analysis Practices for Big Data

What not to expect…

• Smart system supports novice users

• Diagrams + Pictures

• Quantitative experiments

• Formal proof

3

Page 4: MAD Skills: New Analysis Practices for Big Data

What to expect…• Smart system supports smart users

• Analysis

• Implementation examples

• Real application scenarios

• Reflection & Future discussion

4

Page 5: MAD Skills: New Analysis Practices for Big Data

• If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap.

• So what’s getting ubiquitous and cheap? Data.

• And what is complementary to data? Analysis.

• – Prof. Hal Varian, Chief Economist at Google

5

Page 6: MAD Skills: New Analysis Practices for Big Data

Traditionally, data analytics (OLAP) = well-structured data warehouse

• Single expensive center dedicated for analytics (which is separate from OLTP)

• Pre-materialization for pre-defined tasks

• Jealously guarded by engineers

• To ensure high quality integration

6

Page 7: MAD Skills: New Analysis Practices for Big Data

Things are changing towards decentralized analytics centers• Cheap storage

• World largest 10 years ago ~ $100 nowadays

• Massive data

• Even from a single source like clicks

• Popular analytics

• Proved to be profitable

7

Page 8: MAD Skills: New Analysis Practices for Big Data

New paradigm is needed: Magnetic Agile Deep

• Magnetic: attracts all data sources regardless quality

• vs. a single center

• Agile: : continuously adaptive structure

• vs. a rigid well-structured architecture

• Deep : supports sophisticated algorithms

• vs. limits within roll-up, drill-down, etc.

8

Magnetic

Agile

Deep

Page 9: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

9

Page 10: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

10

Page 11: MAD Skills: New Analysis Practices for Big Data

MAD is deeper than Data Cubes: inferential vs. descriptive • Descriptive Data Cubes:

• Roll-up, Drill-down, etc.

• To gain understanding

11

Deep

• Inferential MAD:

• Fit with models: e.g., Gaussian distribution

• Deeper understanding:

• Robust with outliers

• Robust with specific datasets

• Enable advanced tasks:

• Prediction

• Causality analysis

• Distributional comparison

Page 12: MAD Skills: New Analysis Practices for Big Data

MAD is closer to data than Stats software

• Stats software examples: Matlab, R, etc.

• Direct running in database vs. loading to software

• Distributed vs. in-memory data

12

AgileMagnetic

Page 13: MAD Skills: New Analysis Practices for Big Data

MAD is a more extensible eco-system than current MapReduce

• Current MapReduce: complicated algorithms are black-boxes

• MAD advocates a more extensible and modifiable eco-system

• Currently: SQL-based

• Possibly: MapReduce-based

13

DeepAgileMagnetic

Page 14: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

14

Page 15: MAD Skills: New Analysis Practices for Big Data

Fox Audience Network

• Served MySpace.com, IGN.com, Scout.com etc.

• About 150 Million users

• Ad network, bought by Rubicon in Nov 2010

15

Page 16: MAD Skills: New Analysis Practices for Big Data

It is big!• 42 Nodes (2 Masters, 40 Workers) Sun X4500

• Thumper

• 48 500GB drives

• 16GB RAM

• ~ 5TB Daily

• 1 Table 1.5 Trillion Rows

• Many different types of users workloads

• Dynamic query ecosystems

16

Magnetic

Page 17: MAD Skills: New Analysis Practices for Big Data

It has various query types• How many female WWF enthusiasts under the age

of 30 visited the Toyota community over the last four days and saw a medium rectangle?

• Ad hoc + Expensive -> fast?

• How are these people similar to those that visited Nissan?

• Open-ended, requires some statistics and the analyst to be in the loop.

17

Agile

Deep

Page 18: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

18

Page 19: MAD Skills: New Analysis Practices for Big Data

MAD Design Requirements• Loading shouldn’t be too long

• Integration and cleaning routines are unaffordable

• Analysts tolerates noise, in exchange for

• Being the first to analyze data

• Requiring sophisticated analysis

• Besides, it is advised to have a single data center

• Not decentralized physically but logically

• Thus:

• Data warehouse can be improved gradually

• Analysts must be armed with necessary tools

19

AgileDeep

Magnetic

Magnetic

Page 20: MAD Skills: New Analysis Practices for Big Data

MAD philosophy: 3 logical layers to allow analysts to touch data as soon as possible

• Reporting – Specialized static aggregates

• For novice analysts

• Specialized

• Tuned for performance

• Production – Aggregates used by most users

• For more advanced analysts

• Armed with common aggregations

• Stages – Raw tables and logs

• For engineers & some analysts

• Besides, Sandboxes – Play ground for analysts

20

Page 21: MAD Skills: New Analysis Practices for Big Data

Challenge: Provide powerful tools for analysts to agilely keep up with magnetic

data & go deep!

21

Page 22: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

22

Page 23: MAD Skills: New Analysis Practices for Big Data

MADlib = “RDBMS” + Stats + Math + Machine learning

• Build on RDBMS (PostgresQL)

• One single type: Scalar (value)

• Advance to more Math-friendly Stats-friendly types and corresponding operations:

Scalar: 0.1, 0.2…

Vector: [0.1, 0.2] [0.2, 0.4]

Matrix: [[0.1, 0.2], [0.2, 0.4]]

Function: probability density function f(.)

Functional: Mann-Whitney U test distribution f(.) and g(.);

• Enable Stats-friendly and ML-friendly operations:

Resampling

• Through User Defined functions (UDF)

23

Agile

Deep

Magnetic

Agile

Deep Agile

Page 24: MAD Skills: New Analysis Practices for Big Data

Scalar operations have been implemented in RDBMS

• SELECT 5*4;

• SELECT sqrt(64);

• SELECT cos(-3.14159 * sqrt(2) / 2 );

24

Page 25: MAD Skills: New Analysis Practices for Big Data

Vectors/Matrices can be considered as relation objects in Object-Relational Database• Matrix = (row_number integer, vector numeric[])

• Postgres has the extension!

• Summation, product, dot product are trivial

25

Page 26: MAD Skills: New Analysis Practices for Big Data

Matrices may have other data layouts to facilitate a particular operator like transpose

• If using previous representation

26

• If using sparse representation:

• (row number, column number, value)

• Trivial transpose

• Fast multiplication

Page 27: MAD Skills: New Analysis Practices for Big Data

Application Example: cosine similarity for fraud detection

• Scenario: Detect similar docs (measured by cosine similarity) promoted by different advertisers:

• They are usually fraudulent

• The advertisers usually use stolen credit cards

• Using matrix operators, the implementation is natural:

27

Not black box

Page 28: MAD Skills: New Analysis Practices for Big Data

Other examples in the paper• Ordinary Least Square

• Using pseudo inverse matrix routine

• Found in Math textbook

• Also applied for matrix division

• Conjugate Gradient

• iterative

• efficient?

• Support Vector Machine

28

Using the existing operators, analysts can

solve a number of complicated problems.

Before that, they used to load data into R, which is

slow

Page 29: MAD Skills: New Analysis Practices for Big Data

Function: UDF ~ trivial

• Correct me if I am wrong!

29

Page 30: MAD Skills: New Analysis Practices for Big Data

Functional example: Mann-Whitney U Test (MWU)

• Scenarios:

• Web companies compare user experiences from different versions of their website to find the best.

• Ad companies compare different ad campaigns and to find the one with the highest clicks-through rate

• for non-parametric = data set that does not fit in a well known distribution

• Calculation involves some counts

30

Page 31: MAD Skills: New Analysis Practices for Big Data

MWU implementation

31

• No blackbox • Direct computation

in the database • Easy-to-use

interface

Page 32: MAD Skills: New Analysis Practices for Big Data

Other example: Log-likelihood ratio

• Binomial distribution

• Multinomial distribution

• Questions/Comments?

32

Page 33: MAD Skills: New Analysis Practices for Big Data

Resampling Implementation

33

Create 10000 trials, each has size 3, as a view

Specify experiment (e.g., avg each subsample) by view

Run experiment by a single query

Page 34: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation (MAD RDBMS)

• Reflections & Directions

34

Page 35: MAD Skills: New Analysis Practices for Big Data

MAD RDBMS• Magnetic

• Get data painlessly

• Agility

• Efficient & Adaptive physical storage

• Deep

• Flexible programming eco-system

35

Page 36: MAD Skills: New Analysis Practices for Big Data

MAD Loading/Uploading• Scatter/Gather Streaming

• Share-nothing

• Coordination with external data:

• Data are queried while streaming

• Fast

• 4T/hour with minimum impact on current DB operations

• Greenplum has MapReduce support!

36

AgileMagnetic

Page 37: MAD Skills: New Analysis Practices for Big Data

MAD Storage• Tunable table types for different stages:

• external tables (e.g. files)

• heap tables (frequent updates)

• append-only tables (rare updates)

• column-stores flexibility

• Users can specify distribution policy

37

AgileMagnetic

Page 38: MAD Skills: New Analysis Practices for Big Data

MAD Partitioning• Partition by range of values or columns (list)

• i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage.

• Query optimizer knows the partitioning scheme

• Users can delay using partitions until partitioning is complete

38

AgileMagnetic

Page 39: MAD Skills: New Analysis Practices for Big Data

MAD Programming

• Flexible in coding: extensible library

• Flexible in programming metaphors: MapReduce vs. SQL

• Programmers must think out the code works w/o shared memory. (data-parallel)???

39

Deep

Page 40: MAD Skills: New Analysis Practices for Big Data

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

40

Page 41: MAD Skills: New Analysis Practices for Big Data

Directions

• Package management and reuse

• Co-optimizing storage and queries for linear algebra

• Automating physical design for iterative tasks

• Online query processing for MAD analytics

41

Page 42: MAD Skills: New Analysis Practices for Big Data

–Authors

“The question is not whether to get MAD, but how and when”

42

Page 43: MAD Skills: New Analysis Practices for Big Data

Questions

• Do Spark/Spark/BlinkDB provide better “how”?

• It is unclear how they handle parallel processing

• Is that implied when using SQL and share-nothing architecture?

43

Page 44: MAD Skills: New Analysis Practices for Big Data

Thank you!

44


Recommended