Date post: | 15-May-2015 |
Category: |
Technology |
Upload: | datameer |
View: | 570 times |
Download: | 2 times |
© 2013 Datameer, Inc. All rights reserved.
© 2013 Datameer, Inc. All rights reserved.
Top 3 Things to Consider with Machine Learning on Big Data
Karen HsuElliott Cordo
© 2013 Datameer, Inc. All rights reserved.
About our SpeakersKaren Hsu• Karen is Senior Director, Product Marketing at
Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles.
• Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market.
• Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.
© 2013 Datameer, Inc. All rights reserved.
About our SpeakersElliott Cordo• Elliott is a data warehouse and information
management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle.
• At Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing.
© 2013 Datameer, Inc. All rights reserved.
Drivers &Challenges Use Cases Key Criteria Best
Practices Next Steps
Drivers & Challenges
© 2013 Datameer, Inc. All rights reserved.
$0
$75
$150
$225
$300
12/31/0903/31/10
06/30/1009/30/10
12/31/1003/31/11
06/30/1109/30/11
12/31/1103/31/12
06/30/1209/30/12
12/31/1203/21/13
Amazon vs Barnes & Noble
$0
$75
$150
$225
$300
12/31/0903/31/10
06/30/1009/30/10
12/31/1003/31/11
06/30/1109/30/11
12/31/1103/31/12
06/30/1209/30/12
12/31/1203/21/13
NetFlix vs Blockbuster
Big Data Analytics Drives Results
Big Data Drives Results
© 2013 Datameer, Inc. All rights reserved.
• Hard to use• Requires PHD experts• Must write code• Expensive
• Fixed DW models• Must write code for
analytics• Very high IT labor
costs• Not agile
• Easy for small teams• Can’t manage large data
volume• Lack support of advanced
analytics
DataMining
TraditionalBI
Visualization
Alternatives Are Lacking
© 2013 Datameer, Inc. All rights reserved.
Job Title Bay Area New YorkIT Project Manager 140,000.00 $126,000.00System Administrator 117,000.00 $105,000.00Network Administrator 119,000.00 $107,000.00Database Administrator
125,000.00 $119,000.00IT Security Manager 116,000.00 $104,000.00Business Intelligence Analyst 137,000.00 $133,000.00
Data Scientist 138,000.00 $133,000.00Java Developer 136,000.00 $133,000.00QA Engineer 120,000.00 $114,000.00
1,148,000.00 $1,074,000.00
$1M+ in Salaries
$1M+ in CapitalSolution Cost / 100TB
Teradata EDW 1,650,000.00Oracle Exadata 1,400,000.00IBM Netezza 1,000,000.00
Costs of Building Can be $1M+
Use Cases
© 2013 Datameer, Inc. All rights reserved.
Use Case What is Revealed
Profiling and segmentation Customer, product, market characteristics and segments
Acquisition and retention
What leads a person to become a customer or stop being a customer
Product development and operations optimization
What led to product or network failure
Campaign management Patterns of successful campaigns
Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile
Use Cases
© 2013 Datameer, Inc. All rights reserved.
Industry Use Case
Financial Services• Show correlation between services purchased and
investments/trades made• Identify customer segments• Recommendations for research articles to drive trading
eCommerce• Show types of events person will like• Decision tree based on likelihood to click through• Recommendations for a large “cold start” population
Gaming• Clustering for user profiles• Correlation between attributes of a game and behavior• Churn analysis
Healthcare • Recommend tests or other offerings• Identify factors/trends that lead to disease
Customer Examples
Polling Question I
Key Criteria
© 2013 Datameer, Inc. All rights reserved.
Ease of Use Quality
Clustering
© 2013 Datameer, Inc. All rights reserved.
K-Means
1. Treats items as coordinates2. Places a number of random
“centroids” and assigns the nearest items
3. Moves the centroids around based on average location
4. Process repeats until the assignments stop changing
*Diagram from Collective Intelligence by Toby Segaran
• K-means is a popular and versatile general purpose clustering algorithm.
• Commonly used to group people and objects together to form segments
• Often leveraged to enhance recommendation and search systems
How it works
Clustering Overview
© 2013 Datameer, Inc. All rights reserved.
First, the set up...
And then run the results...
In Datameer, you select the columns... And get the results
And the quality of results increases with larger data sets…
Ease of Use
And write additional code to scale...
© 2013 Datameer, Inc. All rights reserved.
pca <- princomp(iris[1:4]);colors <- kmeans(iris[1:4], 3)$cluster;plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5);
First, you have to set up...
And then run the results...
In Datameer, you select the columns... And get the results
And then write more code to scale...
Ease of Use
© 2013 Datameer, Inc. All rights reserved.
Second, you need to create the cluster...
First, select the data...
And then see the results
In Datameer, you select the columns... And get the results
Ease of Use
© 2013 Datameer, Inc. All rights reserved.*Diagram from Collective Intelligence by Toby Segaran
User Location Company Favorite Algo
Elliott New Jersey Caserta K-Means
Karen California Datameer K-Means
User Location Company Favorite Algo1001 1 101 1001
1002 2 102 1001
1. First a dataset’s attirbutes must be converted to numeric representations
Ease of Use
In Datameer, you select the columns... And get the results
2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse
3. Mahout is called, number of clusters, distance calculation is specifiedbin/mahout kmeans \ -i /user/kmeans/vectors \ -c /user/kmeans/input \ -o /user/kmeans/output \ -k 200 \ -dm CosineSimilarity \ -x 20\ -ow
4. The sparse vector output is then converted back to a delimted format,
5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
Column Dependencies
© 2013 Datameer, Inc. All rights reserved.
A Ba xb yb ya xc za y
Column Dependency ~
0.99
C Da xb xb ya zc ya y
Column Dependency ~
0.01
Value•See how data is related after joining multiple sets of data•See column dependencies on multiple types of data
Column Dependencies Overview
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
-3 -2 -1 0 1 2 3
-2-1
01
2ColumnDependency(A,B) = 0
Column A
Col
umn
B
-2 -1 0 1 2 3
-50
5
ColumnDependency(A,B) = 0.5
Column AC
olum
n B
-2 -1 0 1 2
-50
5
ColumnDependency(A,B) = 0.5
Column A
Col
umn
B-3 -2 -1 0 1 2 3
-6000
-4000
-2000
02000
4000
6000
ColumnDependency(A,B) = 1
Column A
Col
umn
B
ColumnDependency(A,B) = 0.5
Column A (NUMBER)
Col
umn
B (S
TRIN
G)
0 0.5 1 1.5 2 2.5 3
ab
c
ColumnDependency(A,B) = 1
Column A (NUMBER)
Col
umn
B (S
TRIN
G)
1 2 3 4 5 6 7 8 9 10 12 14
ab
cd
ef
gh
ij
klm
no
Decision Tree
© 2013 Datameer, Inc. All rights reserved.
Goal: Create a model that predicts the value of a target based on several inputs.
Decision Tree Overview
© 2013 Datameer, Inc. All rights reserved.
packages.install(rpart);library(rpart);treeInput <- read.csv("/PathToData/iris.csv");fit <- rpart(class ~ sepalLength+sepalWidth+petalLength+petalWidth, data=treeInput);par(mfrow=c(1,2), xpd=NA);plot(fit);text(fit, use.n=TRUE);
First, you need to code...
And then run the results...
And then write more code to scale...
In Datameer, you select the columns... And get the results
Ease of Use
© 2013 Datameer, Inc. All rights reserved.
Second, you configure the settings...
First, select the data...
And then see the results
In Datameer, you select the columns... And get the results
Ease of Use
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
Iris WineBreast Cancer
Wisconsin
R 92.66% 86.47% 92.86%
Weka 95.33% 89.33% 93.5%
Datameer 93.33% 91.18% 93.04%
Recommendations
© 2013 Datameer, Inc. All rights reserved.
Increased revenue
Your customers expect them
What makes a good recommendation?
Combination of algorithms and Hadoop make effective recommendations platform achievable
Recommendations Overview
© 2013 Datameer, Inc. All rights reserved.
# run factorization of ratings matrix$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \ --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2
# compute recommendations$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \ --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \ --numRecommendations 6 --maxRating 5 --numThreads 2
First, the set up...
And then run the results...
In Datameer, you select the columns... And get the results
1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515:5.0,508:5.0,496:5.0,483:5.0]3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.75,124:4.7,319:4.703,29:4.67,591:4.6]4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,483:5.0,475:5.0,471:5.0,876:5.0]5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0]6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5.0,527:5.0,526:5.0,521:5.0]
Ease of Use
© 2013 Datameer, Inc. All rights reserved.
Quality Comparison
Shawshank Godfather PulpFiction
FightClub
Dianna 4.76 4.98 1.95 2.44
Jon 1.99 2.51 2.87 4.83
Karen 3.28 4.72 1.89 2.95
Elliott 2.92 3.64 2.97 4.83
Same Results
Best Practices
© 2013 Datameer, Inc. All rights reserved.
Big Data Analytics Process
Integrate
Prepare andAnalyze
Visualize
DefineDeploy
AdHoc
Production
© 2013 Datameer, Inc. All rights reserved.
• Leverage Hierarchies
• If possible, use numbering schemes
• Scale the surrogate key of attributes
• Try different cluster sizes
• Avoid numeric similarities when building your data
Clustering
© 2013 Datameer, Inc. All rights reserved.
• Leverage a combination of algorithms
• Clustering is your friend!
• Treat cold start situations differently
• Think about ranking
• Don’t let recommendations go wild
Item-Based K-Means:Similar
Item Similarity
Best Recommendations
Recommendations
© 2013 Datameer, Inc. All rights reserved.
Process Best Practices
IterateMap Chain
Demonstration
Polling Question II
© 2013 Datameer, Inc. All rights reserved.
FunnelOptimization
BehavioralAnalytics
FraudPrevention
EDWOptimization
CustomerSegmentation
Increase Customer conversion by 3x
Increase Revenue by 2x
Identify $2B in potential fraud
98% OpEx savings$1M+ CapEx
savings
Lower Customer Acquisition Costs by
30%
Return on Investment
© 2013 Datameer, Inc. All rights reserved.
WorkshopContact•Elliott Cordo [email protected]
•Karen Hsu [email protected]
Call to Action