MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data

Slides courtesy of original paper & Christan Grant’s slides Presented by Long Pham

11/11/2014

MAD Skills: New Analysis Practices for Big Data

• Authors

• Jeff Cohen – Greenplum

• Brian Dolan – Fox Audience Network

• Mark Dunlap – Evergreen Technologies

• Joseph M Hellerstein – UC Berkeley

• Caleb Welton - Greenplum

• Presented at Very Large Database Conference 2009 in Lyon, France

2

What not to expect…

• Smart system supports novice users

• Diagrams + Pictures

• Quantitative experiments

• Formal proof

3

What to expect…• Smart system supports smart users

• Analysis

• Implementation examples

• Real application scenarios

• Reflection & Future discussion

4

• If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap.

• So what’s getting ubiquitous and cheap? Data.

• And what is complementary to data? Analysis.

• – Prof. Hal Varian, Chief Economist at Google

5

Traditionally, data analytics (OLAP) = well-structured data warehouse

• Single expensive center dedicated for analytics (which is separate from OLTP)

• Pre-materialization for pre-defined tasks

• Jealously guarded by engineers

• To ensure high quality integration

6

Things are changing towards decentralized analytics centers• Cheap storage

• World largest 10 years ago ~ $100 nowadays

• Massive data

• Even from a single source like clicks

• Popular analytics

• Proved to be profitable

7

New paradigm is needed: Magnetic Agile Deep

• Magnetic: attracts all data sources regardless quality

• vs. a single center

• Agile: : continuously adaptive structure

• vs. a rigid well-structured architecture

• Deep : supports sophisticated algorithms

• vs. limits within roll-up, drill-down, etc.

8

Magnetic

Agile

Deep

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

9



• MAD design


• Variable types




10

MAD is deeper than Data Cubes: inferential vs. descriptive • Descriptive Data Cubes:

• Roll-up, Drill-down, etc.

• To gain understanding

11

Deep

• Inferential MAD:

• Fit with models: e.g., Gaussian distribution

• Deeper understanding:

• Robust with outliers

• Robust with specific datasets

• Enable advanced tasks:

• Prediction

• Causality analysis

• Distributional comparison

MAD is closer to data than Stats software

• Stats software examples: Matlab, R, etc.

• Direct running in database vs. loading to software

• Distributed vs. in-memory data

12

AgileMagnetic

MAD is a more extensible eco-system than current MapReduce

• Current MapReduce: complicated algorithms are black-boxes

• MAD advocates a more extensible and modifiable eco-system

• Currently: SQL-based

• Possibly: MapReduce-based

13

DeepAgileMagnetic



• MAD design


• Variable types




14

Fox Audience Network

• Served MySpace.com, IGN.com, Scout.com etc.

• About 150 Million users

• Ad network, bought by Rubicon in Nov 2010

15

It is big!• 42 Nodes (2 Masters, 40 Workers) Sun X4500

• Thumper

• 48 500GB drives

• 16GB RAM

• ~ 5TB Daily

• 1 Table 1.5 Trillion Rows

• Many different types of users workloads

• Dynamic query ecosystems

16

Magnetic

It has various query types• How many female WWF enthusiasts under the age

of 30 visited the Toyota community over the last four days and saw a medium rectangle?

• Ad hoc + Expensive -> fast?

• How are these people similar to those that visited Nissan?

• Open-ended, requires some statistics and the analyst to be in the loop.

17

Agile

Deep



• MAD design


• Variable types




18

MAD Design Requirements• Loading shouldn’t be too long

• Integration and cleaning routines are unaffordable

• Analysts tolerates noise, in exchange for

• Being the first to analyze data

• Requiring sophisticated analysis

• Besides, it is advised to have a single data center

• Not decentralized physically but logically

• Thus:

• Data warehouse can be improved gradually

• Analysts must be armed with necessary tools

19

AgileDeep

Magnetic

Magnetic

MAD philosophy: 3 logical layers to allow analysts to touch data as soon as possible

• Reporting – Specialized static aggregates

• For novice analysts

• Specialized

• Tuned for performance

• Production – Aggregates used by most users

• For more advanced analysts

• Armed with common aggregations

• Stages – Raw tables and logs

• For engineers & some analysts

• Besides, Sandboxes – Play ground for analysts

20

Challenge: Provide powerful tools for analysts to agilely keep up with magnetic

data & go deep!

21



• MAD design


• Variable types




22

MADlib = “RDBMS” + Stats + Math + Machine learning

• Build on RDBMS (PostgresQL)

• One single type: Scalar (value)

• Advance to more Math-friendly Stats-friendly types and corresponding operations:

Scalar: 0.1, 0.2…

Vector: [0.1, 0.2] [0.2, 0.4]

Matrix: [[0.1, 0.2], [0.2, 0.4]]

Function: probability density function f(.)

Functional: Mann-Whitney U test distribution f(.) and g(.);

• Enable Stats-friendly and ML-friendly operations:

Resampling

• Through User Defined functions (UDF)

23

Agile

Deep

Magnetic

Agile

Deep Agile

Scalar operations have been implemented in RDBMS

• SELECT 5*4;

• SELECT sqrt(64);

• SELECT cos(-3.14159 * sqrt(2) / 2 );

24

Vectors/Matrices can be considered as relation objects in Object-Relational Database• Matrix = (row_number integer, vector numeric[])

• Postgres has the extension!

• Summation, product, dot product are trivial

25

Matrices may have other data layouts to facilitate a particular operator like transpose

• If using previous representation

26

• If using sparse representation:

• (row number, column number, value)

• Trivial transpose

• Fast multiplication

Application Example: cosine similarity for fraud detection

• Scenario: Detect similar docs (measured by cosine similarity) promoted by different advertisers:

• They are usually fraudulent

• The advertisers usually use stolen credit cards

• Using matrix operators, the implementation is natural:

27

Not black box

Other examples in the paper• Ordinary Least Square

• Using pseudo inverse matrix routine

• Found in Math textbook

• Also applied for matrix division

• Conjugate Gradient

• iterative

• efficient?

• Support Vector Machine

28

Using the existing operators, analysts can

solve a number of complicated problems.

Before that, they used to load data into R, which is

slow

Function: UDF ~ trivial

• Correct me if I am wrong!

29

Functional example: Mann-Whitney U Test (MWU)

• Scenarios:

• Web companies compare user experiences from different versions of their website to find the best.

• Ad companies compare different ad campaigns and to find the one with the highest clicks-through rate

• for non-parametric = data set that does not fit in a well known distribution

• Calculation involves some counts

30

MWU implementation

31

• No blackbox • Direct computation

in the database • Easy-to-use

interface

Other example: Log-likelihood ratio

• Binomial distribution

• Multinomial distribution

• Questions/Comments?

32

Resampling Implementation

33

Create 10000 trials, each has size 3, as a view

Specify experiment (e.g., avg each subsample) by view

Run experiment by a single query



• MAD design


• Variable types


• MAD system implementation (MAD RDBMS)


34

MAD RDBMS• Magnetic

• Get data painlessly

• Agility

• Efficient & Adaptive physical storage

• Deep

• Flexible programming eco-system

35

MAD Loading/Uploading• Scatter/Gather Streaming

• Share-nothing

• Coordination with external data:

• Data are queried while streaming

• Fast

• 4T/hour with minimum impact on current DB operations

• Greenplum has MapReduce support!

36

AgileMagnetic

MAD Storage• Tunable table types for different stages:

• external tables (e.g. files)

• heap tables (frequent updates)

• append-only tables (rare updates)

• column-stores flexibility

• Users can specify distribution policy

37

AgileMagnetic

MAD Partitioning• Partition by range of values or columns (list)

• i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage.

• Query optimizer knows the partitioning scheme

• Users can delay using partitions until partitioning is complete

38

AgileMagnetic

MAD Programming

• Flexible in coding: extensible library

• Flexible in programming metaphors: MapReduce vs. SQL

• Programmers must think out the code works w/o shared memory. (data-parallel)???

39

Deep



• MAD design


• Variable types




40

Directions

• Package management and reuse

• Co-optimizing storage and queries for linear algebra

• Automating physical design for iterative tasks

• Online query processing for MAD analytics

41

–Authors

“The question is not whether to get MAD, but how and when”

”

42

Questions

• Do Spark/Spark/BlinkDB provide better “how”?

• It is unclear how they handle parallel processing

• Is that implied when using SQL and share-nothing architecture?

43

Thank you!

44

Date post:	02-Jul-2015
Category:	Engineering
Upload:	long-pham
View:	56 times
Download:	2 times

MAD Skills: New Analysis Practices for Big Data

Engineering