Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.


Top 3 Things to Consider with Machine Learning on Big Data

Karen HsuElliott Cordo


About our SpeakersKaren Hsu• Karen is Senior Director, Product Marketing at

Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles.

• Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market.

• Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.


About our SpeakersElliott Cordo• Elliott is a data warehouse and information

management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle.

• At Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing.


Drivers &Challenges Use Cases Key Criteria Best

Practices Next Steps

Drivers & Challenges


$0

$75

$150

$225

$300

12/31/0903/31/10

06/30/1009/30/10

12/31/1003/31/11

06/30/1109/30/11

12/31/1103/31/12

06/30/1209/30/12

12/31/1203/21/13

Amazon vs Barnes & Noble

$0

$75

$150

$225

$300

12/31/0903/31/10

06/30/1009/30/10

12/31/1003/31/11

06/30/1109/30/11

12/31/1103/31/12

06/30/1209/30/12

12/31/1203/21/13

NetFlix vs Blockbuster

Big Data Analytics Drives Results

Big Data Drives Results


• Hard to use• Requires PHD experts• Must write code• Expensive

• Fixed DW models• Must write code for

analytics• Very high IT labor

costs• Not agile

• Easy for small teams• Can’t manage large data

volume• Lack support of advanced

analytics

DataMining

TraditionalBI

Visualization

Alternatives Are Lacking


Job Title Bay Area New YorkIT Project Manager 140,000.00 $126,000.00System Administrator 117,000.00 $105,000.00Network Administrator 119,000.00 $107,000.00Database Administrator

125,000.00 $119,000.00IT Security Manager 116,000.00 $104,000.00Business Intelligence Analyst 137,000.00 $133,000.00

Data Scientist 138,000.00 $133,000.00Java Developer 136,000.00 $133,000.00QA Engineer 120,000.00 $114,000.00

1,148,000.00 $1,074,000.00

$1M+ in Salaries

$1M+ in CapitalSolution Cost / 100TB

Teradata EDW 1,650,000.00Oracle Exadata 1,400,000.00IBM Netezza 1,000,000.00

Costs of Building Can be $1M+

Use Cases


Use Case What is Revealed

Profiling and segmentation Customer, product, market characteristics and segments

Acquisition and retention

What leads a person to become a customer or stop being a customer

Product development and operations optimization

What led to product or network failure

Campaign management Patterns of successful campaigns

Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile

Use Cases


Industry Use Case

Financial Services• Show correlation between services purchased and

investments/trades made• Identify customer segments• Recommendations for research articles to drive trading

eCommerce• Show types of events person will like• Decision tree based on likelihood to click through• Recommendations for a large “cold start” population

Gaming• Clustering for user profiles• Correlation between attributes of a game and behavior• Churn analysis

Healthcare • Recommend tests or other offerings• Identify factors/trends that lead to disease

Customer Examples

Polling Question I

Key Criteria


Ease of Use Quality

Clustering


K-Means

1. Treats items as coordinates2. Places a number of random

“centroids” and assigns the nearest items

3. Moves the centroids around based on average location

4. Process repeats until the assignments stop changing

*Diagram from Collective Intelligence by Toby Segaran

• K-means is a popular and versatile general purpose clustering algorithm.

• Commonly used to group people and objects together to form segments

• Often leveraged to enhance recommendation and search systems

How it works

Clustering Overview


First, the set up...

And then run the results...

In Datameer, you select the columns... And get the results

And the quality of results increases with larger data sets…

Ease of Use

And write additional code to scale...


pca <- princomp(iris[1:4]);colors <- kmeans(iris[1:4], 3)$cluster;plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5);

First, you have to set up...



And then write more code to scale...

Ease of Use


Second, you need to create the cluster...

First, select the data...

And then see the results


Ease of Use

© 2013 Datameer, Inc. All rights reserved.*Diagram from Collective Intelligence by Toby Segaran

User Location Company Favorite Algo

Elliott New Jersey Caserta K-Means

Karen California Datameer K-Means

User Location Company Favorite Algo1001 1 101 1001

1002 2 102 1001

1. First a dataset’s attirbutes must be converted to numeric representations

Ease of Use


2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse

3. Mahout is called, number of clusters, distance calculation is specifiedbin/mahout kmeans \ -i /user/kmeans/vectors \ -c /user/kmeans/input \ -o /user/kmeans/output \ -k 200 \ -dm CosineSimilarity \ -x 20\ -ow

4. The sparse vector output is then converted back to a delimted format,

5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster


Quality Comparison

Column Dependencies


A Ba xb yb ya xc za y

Column Dependency ~

0.99

C Da xb xb ya zc ya y

Column Dependency ~

0.01

Value•See how data is related after joining multiple sets of data•See column dependencies on multiple types of data

Column Dependencies Overview


Quality Comparison

-3 -2 -1 0 1 2 3

-2-1

01

2ColumnDependency(A,B) = 0

Column A

Col

umn

B

-2 -1 0 1 2 3

-50

5

ColumnDependency(A,B) = 0.5

Column AC

olum

n B

-2 -1 0 1 2

-50

5


Column A

Col

umn

B-3 -2 -1 0 1 2 3

-6000

-4000

-2000

02000

4000

6000

ColumnDependency(A,B) = 1

Column A

Col

umn

B


Column A (NUMBER)

Col

umn

B (S

TRIN

G)

0 0.5 1 1.5 2 2.5 3

ab

c

ColumnDependency(A,B) = 1

Column A (NUMBER)

Col

umn

B (S

TRIN

G)

1 2 3 4 5 6 7 8 9 10 12 14

ab

cd

ef

gh

ij

klm

no

Decision Tree


Goal: Create a model that predicts the value of a target based on several inputs.

Decision Tree Overview


packages.install(rpart);library(rpart);treeInput <- read.csv("/PathToData/iris.csv");fit <- rpart(class ~ sepalLength+sepalWidth+petalLength+petalWidth, data=treeInput);par(mfrow=c(1,2), xpd=NA);plot(fit);text(fit, use.n=TRUE);

First, you need to code...


And then write more code to scale...


Ease of Use


Second, you configure the settings...

First, select the data...

And then see the results


Ease of Use


Quality Comparison

Iris WineBreast Cancer

Wisconsin

R 92.66% 86.47% 92.86%

Weka 95.33% 89.33% 93.5%

Datameer 93.33% 91.18% 93.04%

Recommendations


Increased revenue

Your customers expect them

What makes a good recommendation?

Combination of algorithms and Hadoop make effective recommendations platform achievable

Recommendations Overview


# run factorization of ratings matrix$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \ --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2

# compute recommendations$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \ --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \ --numRecommendations 6 --maxRating 5 --numThreads 2

First, the set up...



1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515:5.0,508:5.0,496:5.0,483:5.0]3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.75,124:4.7,319:4.703,29:4.67,591:4.6]4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,483:5.0,475:5.0,471:5.0,876:5.0]5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0]6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5.0,527:5.0,526:5.0,521:5.0]

Ease of Use


Quality Comparison

Shawshank Godfather PulpFiction

FightClub

Dianna 4.76 4.98 1.95 2.44

Jon 1.99 2.51 2.87 4.83

Karen 3.28 4.72 1.89 2.95

Elliott 2.92 3.64 2.97 4.83

Same Results

Best Practices


Big Data Analytics Process

Integrate

Prepare andAnalyze

Visualize

DefineDeploy

AdHoc

Production


• Leverage Hierarchies

• If possible, use numbering schemes

• Scale the surrogate key of attributes

• Try different cluster sizes

• Avoid numeric similarities when building your data

Clustering


• Leverage a combination of algorithms

• Clustering is your friend!

• Treat cold start situations differently

• Think about ranking

• Don’t let recommendations go wild

Item-Based K-Means:Similar

Item Similarity

Best Recommendations

Recommendations


Process Best Practices

IterateMap Chain

Demonstration

Polling Question II


FunnelOptimization

BehavioralAnalytics

FraudPrevention

EDWOptimization

CustomerSegmentation

Increase Customer conversion by 3x

Increase Revenue by 2x

Identify $2B in potential fraud

98% OpEx savings$1M+ CapEx

savings

Lower Customer Acquisition Costs by

30%

Return on Investment


WorkshopContact•Elliott Cordo [email protected]

•Karen Hsu [email protected]

Call to Action

Date post:	15-May-2015
Category:	Technology
Upload:	datameer
View:	570 times
Download:	2 times

Top 3 Considerations for Machine Learning on Big Data

Technology