Top 3 Considerations for Machine Learning on Big Data

Post on 15-May-2015

570 views 2 download

Tags:

description

View the full recording of this deck here: http://info.datameer.com/Slideshare-Top-3-Things-to-Consider-for-Machine-Learning-on-Big-Data.html Machine learning is powerful but requires coding and access to all the relevant datasets to get full insights. With new Big Data analytic tools, business users can now use machine learning to gain a competitive edge. Based on best practices and customer experiences, join Datameer and Caserta Concepts as we discuss what to look for and what value organizations get out of Machine Learning on Big Data. This webinar will provide: *an overview of challenges and tools available today *use cases for machine learning on hadoop *capabilities to look for *comparison of available solutions

transcript

© 2013 Datameer, Inc. All rights reserved.

© 2013 Datameer, Inc. All rights reserved.

Top 3 Things to Consider with Machine Learning on Big Data

Karen HsuElliott Cordo

© 2013 Datameer, Inc. All rights reserved.

About our SpeakersKaren Hsu• Karen is Senior Director, Product Marketing at

Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles.

• Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market. 

• Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.  

© 2013 Datameer, Inc. All rights reserved.

About our SpeakersElliott Cordo• Elliott is a data warehouse and information

management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle.

• At Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing.

© 2013 Datameer, Inc. All rights reserved.

Drivers &Challenges Use Cases Key Criteria Best

Practices Next Steps

Drivers & Challenges

© 2013 Datameer, Inc. All rights reserved.

$0

$75

$150

$225

$300

12/31/0903/31/10

06/30/1009/30/10

12/31/1003/31/11

06/30/1109/30/11

12/31/1103/31/12

06/30/1209/30/12

12/31/1203/21/13

Amazon vs Barnes & Noble

$0

$75

$150

$225

$300

12/31/0903/31/10

06/30/1009/30/10

12/31/1003/31/11

06/30/1109/30/11

12/31/1103/31/12

06/30/1209/30/12

12/31/1203/21/13

NetFlix vs Blockbuster

Big Data Analytics Drives Results

Big Data Drives Results

© 2013 Datameer, Inc. All rights reserved.

• Hard to use• Requires PHD experts• Must write code• Expensive

• Fixed DW models• Must write code for

analytics• Very high IT labor

costs• Not agile

• Easy for small teams• Can’t manage large data

volume• Lack support of advanced

analytics

DataMining

TraditionalBI

Visualization

Alternatives Are Lacking

© 2013 Datameer, Inc. All rights reserved.

Job Title Bay Area New YorkIT Project Manager 140,000.00 $126,000.00System Administrator 117,000.00 $105,000.00Network Administrator 119,000.00 $107,000.00Database Administrator

125,000.00 $119,000.00IT Security Manager 116,000.00 $104,000.00Business Intelligence Analyst 137,000.00 $133,000.00

Data Scientist 138,000.00 $133,000.00Java Developer 136,000.00 $133,000.00QA Engineer 120,000.00 $114,000.00

1,148,000.00 $1,074,000.00

$1M+ in Salaries

$1M+ in CapitalSolution Cost / 100TB

Teradata EDW 1,650,000.00Oracle Exadata 1,400,000.00IBM Netezza 1,000,000.00

Costs of Building Can be $1M+

Use Cases

© 2013 Datameer, Inc. All rights reserved.

Use Case What is Revealed

Profiling and segmentation Customer, product, market characteristics and segments

Acquisition and retention

What leads a person to become a customer or stop being a customer

Product development and operations optimization

What led to product or network failure

Campaign management Patterns of successful campaigns

Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile

Use Cases

© 2013 Datameer, Inc. All rights reserved.

Industry Use Case

Financial Services• Show correlation between services purchased and

investments/trades made• Identify customer segments• Recommendations for research articles to drive trading

eCommerce• Show types of events person will like• Decision tree based on likelihood to click through• Recommendations for a large “cold start” population

Gaming• Clustering for user profiles• Correlation between attributes of a game and behavior• Churn analysis

Healthcare • Recommend tests or other offerings• Identify factors/trends that lead to disease

Customer Examples

Polling Question I

Key Criteria

© 2013 Datameer, Inc. All rights reserved.

Ease of Use Quality

Clustering

© 2013 Datameer, Inc. All rights reserved.

K-Means

1. Treats items as coordinates2. Places a number of random

“centroids” and assigns the nearest items

3. Moves the centroids around based on average location

4. Process repeats until the assignments stop changing

*Diagram from Collective Intelligence by Toby Segaran

• K-means is a popular and versatile general purpose clustering algorithm.

• Commonly used to group people and objects together to form segments

• Often leveraged to enhance recommendation and search systems

How it works

Clustering Overview

© 2013 Datameer, Inc. All rights reserved.

First, the set up...

And then run the results...

In Datameer, you select the columns... And get the results

And the quality of results increases with larger data sets…

Ease of Use

And write additional code to scale...

© 2013 Datameer, Inc. All rights reserved.

pca <- princomp(iris[1:4]);colors <- kmeans(iris[1:4], 3)$cluster;plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5);

First, you have to set up...

And then run the results...

In Datameer, you select the columns... And get the results

And then write more code to scale...

Ease of Use

© 2013 Datameer, Inc. All rights reserved.

Second, you need to create the cluster...

First, select the data...

And then see the results

In Datameer, you select the columns... And get the results

Ease of Use

© 2013 Datameer, Inc. All rights reserved.*Diagram from Collective Intelligence by Toby Segaran

User Location Company Favorite Algo

Elliott New Jersey Caserta K-Means

Karen California Datameer K-Means

User Location Company Favorite Algo1001 1 101 1001

1002 2 102 1001

1. First a dataset’s attirbutes must be converted to numeric representations

Ease of Use

In Datameer, you select the columns... And get the results

2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse 

3. Mahout is called, number of clusters, distance calculation is specifiedbin/mahout kmeans \ -i /user/kmeans/vectors \ -c /user/kmeans/input \ -o /user/kmeans/output \ -k 200 \ -dm CosineSimilarity \ -x 20\ -ow

4. The sparse vector output is then converted back to a delimted format,

5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

Column Dependencies

© 2013 Datameer, Inc. All rights reserved.

A Ba xb yb ya xc za y

Column Dependency ~

0.99

C Da xb xb ya zc ya y

Column Dependency ~

0.01

Value•See how data is related after joining multiple sets of data•See column dependencies on multiple types of data

Column Dependencies Overview

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

-3 -2 -1 0 1 2 3

-2-1

01

2ColumnDependency(A,B) = 0

Column A

Col

umn

B

-2 -1 0 1 2 3

-50

5

ColumnDependency(A,B) = 0.5

Column AC

olum

n B

-2 -1 0 1 2

-50

5

ColumnDependency(A,B) = 0.5

Column A

Col

umn

B-3 -2 -1 0 1 2 3

-6000

-4000

-2000

02000

4000

6000

ColumnDependency(A,B) = 1

Column A

Col

umn

B

ColumnDependency(A,B) = 0.5

Column A (NUMBER)

Col

umn

B (S

TRIN

G)

0 0.5 1 1.5 2 2.5 3

ab

c

ColumnDependency(A,B) = 1

Column A (NUMBER)

Col

umn

B (S

TRIN

G)

1 2 3 4 5 6 7 8 9 10 12 14

ab

cd

ef

gh

ij

klm

no

Decision Tree

© 2013 Datameer, Inc. All rights reserved.

Goal: Create a model that predicts the value of a target based on several inputs.

Decision Tree Overview

© 2013 Datameer, Inc. All rights reserved.

packages.install(rpart);library(rpart);treeInput <- read.csv("/PathToData/iris.csv");fit <- rpart(class ~ sepalLength+sepalWidth+petalLength+petalWidth, data=treeInput);par(mfrow=c(1,2), xpd=NA);plot(fit);text(fit, use.n=TRUE);

First, you need to code...

And then run the results...

And then write more code to scale...

In Datameer, you select the columns... And get the results

Ease of Use

© 2013 Datameer, Inc. All rights reserved.

Second, you configure the settings...

First, select the data...

And then see the results

In Datameer, you select the columns... And get the results

Ease of Use

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

Iris WineBreast  Cancer  

Wisconsin

R 92.66% 86.47% 92.86%

Weka 95.33% 89.33% 93.5%

Datameer 93.33% 91.18% 93.04%

Recommendations

© 2013 Datameer, Inc. All rights reserved.

Increased revenue

Your customers expect them

What makes a good recommendation?

Combination of algorithms and Hadoop make effective recommendations platform achievable

Recommendations Overview

© 2013 Datameer, Inc. All rights reserved.

# run factorization of ratings matrix$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \    --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2

# compute recommendations$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \    --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \    --numRecommendations 6 --maxRating 5 --numThreads 2

First, the set up...

And then run the results...

In Datameer, you select the columns... And get the results

1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515:5.0,508:5.0,496:5.0,483:5.0]3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.75,124:4.7,319:4.703,29:4.67,591:4.6]4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,483:5.0,475:5.0,471:5.0,876:5.0]5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0]6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5.0,527:5.0,526:5.0,521:5.0]

Ease of Use

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

Shawshank Godfather PulpFiction

FightClub

Dianna 4.76 4.98 1.95 2.44

Jon 1.99 2.51 2.87 4.83

Karen 3.28 4.72 1.89 2.95

Elliott 2.92 3.64 2.97 4.83

Same Results

Best Practices

© 2013 Datameer, Inc. All rights reserved.

Big Data Analytics Process

Integrate

Prepare andAnalyze

Visualize

DefineDeploy

AdHoc

Production

© 2013 Datameer, Inc. All rights reserved.

• Leverage Hierarchies

• If possible, use numbering schemes

• Scale the surrogate key of attributes

• Try different cluster sizes

• Avoid numeric similarities when building your data

Clustering

© 2013 Datameer, Inc. All rights reserved.

• Leverage a combination of algorithms

• Clustering is your friend!

• Treat cold start situations differently

• Think about ranking

• Don’t let recommendations go wild

Item-Based K-Means:Similar

Item Similarity

Best Recommendations

Recommendations

© 2013 Datameer, Inc. All rights reserved.

Process Best Practices

IterateMap Chain

Demonstration

Polling Question II

© 2013 Datameer, Inc. All rights reserved.

FunnelOptimization

BehavioralAnalytics

FraudPrevention

EDWOptimization

CustomerSegmentation

Increase Customer conversion by 3x

Increase Revenue by 2x

Identify $2B in potential fraud

98% OpEx savings$1M+ CapEx

savings

Lower Customer Acquisition Costs by

30%

Return on Investment

© 2013 Datameer, Inc. All rights reserved.

WorkshopContact•Elliott Cordo elliott@casertaconcepts.com

•Karen Hsu khsu@datameer.com

Call to Action