+ All Categories
Home > Technology > In-Database Predictive Analytics

In-Database Predictive Analytics

Date post: 18-Jan-2015
Category:
Upload: john-de-goes
View: 2,201 times
Download: 2 times
Share this document with a friend
Description:
Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store. This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and survey the tools and technologies necessary to head down the path.
Popular Tags:
47
In-Database Predictive Analytics John A. De Goes @jdegoes, [email protected]
Transcript
Page 1: In-Database Predictive Analytics

In-DatabasePredictive Analytics

John A. De Goes@jdegoes, [email protected]

Page 2: In-Database Predictive Analytics

• Introduction

• Abusing SQL

• Painful by Design

• Database Extensions

• MADlib

• Other Approaches

• Summary

Agenda

Page 3: In-Database Predictive Analytics

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.

Page 4: In-Database Predictive Analytics

Traditional Predictive Analytics

Introduction

database

R

SAS

Page 5: In-Database Predictive Analytics

Data Bottleneck:Painful, Slow

Introduction

database

R

SAS

Page 6: In-Database Predictive Analytics

What’s the answer?

Introduction

Page 7: In-Database Predictive Analytics

“MapReduce”

Move the Code, not the Data!

AdvancedAnalytics

Introduction

Page 8: In-Database Predictive Analytics

Let’s Do K-Means in SQL!

Abusing SQL

Page 9: In-Database Predictive Analytics

General Approach in RDBMS

SQL

Feedback

DatabaseDriver

Abusing SQL

Page 10: In-Database Predictive Analytics

Our Initial Model

model

d k n iteration avg_q

number of dimensions

number of clusters

number of points

number of iterations

variance

Abusing SQL

Page 11: In-Database Predictive Analytics

Our Initial Data Set

Y

Y1 Y2 Y3 Y3

n rows

Abusing SQL

Page 12: In-Database Predictive Analytics

Projection & Numbering

Y

Y1 Y2 Y3 ...

YH

i Y1 ... Yd

INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;

1

2

3

4

...

...

n

1

2

3

4

...

...

n

Abusing SQL

Page 13: In-Database Predictive Analytics

Flattening

YH

i Y1 ... Yd

INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;

1

2

3

4

...

...

n

1

1

1

1

2

...

n

YV

i l val

1

2

...

d

1

...

d

n x d rows

1

1

...

1

2

...

n

Abusing SQL

Page 14: In-Database Predictive Analytics

Initializing k Cluster Centers

YH

i Y1 ... Yd

CH

j Y1 ... Yd

1

2

3

4

...

...

n

INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

1

2

3

4

...

...

k

Abusing SQL

Page 15: In-Database Predictive Analytics

CH

j Y1 ... Yd

1

2

3

4

...

...

k

Flattening

C

l j val

d x k rows

1

1

...

1

2

...

d

1

2

...

k

1

...

k

INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL

Page 16: In-Database Predictive Analytics

Computing Distances to Clusters

INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;

YD

i j dist

1

2

...

k

1

...

k

n x k rows

1

1

...

1

2

...

n

Abusing SQL

Page 17: In-Database Predictive Analytics

Computing Nearest Neighbors

INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;

nearest clusters

YNN

i j

n rows

1

2

3

4

5

...

n

Abusing SQL

Page 18: In-Database Predictive Analytics

Count Points Per Cluster

INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;

Abusing SQL

Page 19: In-Database Predictive Analytics

Compute New Centroids

INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL

Page 20: In-Database Predictive Analytics

Compute Variances

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Page 21: In-Database Predictive Analytics

Update Model

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Page 22: In-Database Predictive Analytics

Let’s not do that again!

Abusing SQL

Page 23: In-Database Predictive Analytics

Why are predictive analytics so hard to express in SQL?

Painful by Design

Page 24: In-Database Predictive Analytics

#1: No Arrays

Setsrows

Tuplescolumns

Arrays

Painful by Design

Page 25: In-Database Predictive Analytics

#2: Relational Algebra Sucks

Projection Selection Rename Natural Join

R S

Theta JoinSemijoin

R S R S

Antijoin

÷R S

Division

⟕R S

Left outer join

R S

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

Painful by Design

Iteration Recursion Multiple Dimensions

Page 26: In-Database Predictive Analytics

There’s GOT to be a better way!

Database Extensions

Page 27: In-Database Predictive Analytics

C Extension

Database Extensions

Page 28: In-Database Predictive Analytics

UDFUser-Defined Function

UDAUser-Defined Aggregate

Map Reducemap(a)

op2(a,b)init(a)

accum(a, b)merge(a, b)final(a)

Database Extensions

Page 29: In-Database Predictive Analytics

MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.

MADlib

Page 30: In-Database Predictive Analytics

Mac OS X

http://www.madlib.net/files/madlib-0.6-Darwin.dmg

Linux

http://www.madlib.net/files/madlib-0.6-Linux.rpm

1. Download the binaryMADlib

Page 31: In-Database Predictive Analytics

Mac OS X

Double-click on installer

Linux

yum install $MADLIB_PACKAGE --nogpgcheck

2. Start the InstallationMADlib

Page 32: In-Database Predictive Analytics

Greenplum

source /path/to/greenplum/greenplum_path.sh

PostgreSQL

Make sure psql is in PATH

3. Verify LocatabilityMADlib

Page 33: In-Database Predictive Analytics

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install

4. Register MADlibMADlib

Page 34: In-Database Predictive Analytics

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check

5. Test InstallationMADlib

Page 35: In-Database Predictive Analytics

SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);

Clustering in MADlibMADlib

Page 36: In-Database Predictive Analytics

Ahhhhhh......

MADlib

Page 37: In-Database Predictive Analytics

Our Way or the Highway

Composability

MADlib

Page 38: In-Database Predictive Analytics

RDBMS Isn’t the Only Game in Town!

Other Approaches

Page 39: In-Database Predictive Analytics

1. Embrace Coding

• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,

of course, MapReduce

• BDAS Ecosystem• Spark

Other Approaches

Page 40: In-Database Predictive Analytics

2. Reject RDBMS

• Datalog + variants• In theory, ideal for many kinds of predictive analytics

• Suffers from a lack of distributed, feature-complete implementations

Other Approaches

Page 41: In-Database Predictive Analytics

2. Reject RDBMS

• Rasdaman / RASQL• Arrays but not analytics

Community Editionshttp://www.rasdaman.org

Other Approaches

Page 42: In-Database Predictive Analytics

2. Reject RDBMS

• MonetDB / SciQL• Array extension of SQL

• Poor analytics

Community Editionshttp://www.monetdb.org

Other Approaches

Page 43: In-Database Predictive Analytics

2. Reject RDBMS

• SciDB / AFL (AQL)• Excellent analytics

• Limited composability

Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches

Page 44: In-Database Predictive Analytics

2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions

• Still immature

Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)

http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Other Approaches

Page 45: In-Database Predictive Analytics

Summary

• Increase performance, reduce friction by doing more inside the database

• Not a panacea• Hard to do in SQL

• Hard to do in C (but you may not have to: MADlib)

• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced analytics

Page 47: In-Database Predictive Analytics

References

• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)


Recommended