Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | makoto-yui |
View: | 5,069 times |
Download: | 0 times |
A Database-Hadoop Hybrid Approachto Scalable Machine Learning
Makoto YUI, Isao Kojima AIST, Japan
June 30, 2013IEEE BigData Congress 2013, Santa Clara
Outline
1. Motivation & Problem Description
2. Our Hybrid Approach to Scalable Machine Learning
Architecture
Our batch learning scheme on Hive
3. Experimental Evaluation
4. Conclusions and Future Directions
2
3
As we seen in the Keynote and the Panel discussion (2nd day) of this conference, data analytics and machine learning are obviously getting more attentions along with Big Data
Suppose then that you were a developer and your manager is willing to ..
ManagerDeveloper
What’s possible choices around there?
In-database Analytics MADlib (open-source project lead by Greenplum)
Bismarck (Project at wisc.edu, SIGMOD’12)
SAS In-database Analytics
Fuzzy Logix (Sybase)and more..
Machine Learning on Hadoop Apache Mahout
Vowpal Wabbit (open-source project at MS research)
In-house analytical toolse.g., Twitter, SIGMOD’12
4
Two popular schools of thought for performing large-scalemachine learning that does not fit in memory space
4 Issues needed to be considered
1. Scalability
5
Scalability would always be a problem when handing Big Data
4 Issues needed to be considered
1. Scalability
2. Data movement
6
Data movement is important because Moving data is a critical issue when the size of dataset shift from terabytes to petabytes and beyond
4 Issues needed to be considered
1. Scalability
2. Data movement
3. Transactions
7
Considering transactions is important for real-time/online predictions because most of transaction records, which are valuable for predictions, are stored in relational databases
4 Issues needed to be considered
1. Scalability
2. Data movement
3. Transactions
4. Latency and throughput
8
Latency and throughput are the key issues for achieving online prediction and/or real-time analytics
Which is better?
9
ScalabilityData movement
Transactions Latency Throughput
In-databaseanalytics
Machine learning on Hadoop
+ Fault Tolerance+ Straggler node handling + Scale-out
Which is better?
10
ScalabilityData movement
Transactions Latency Throughput
In-databaseanalytics
Machine learning on Hadoop
It depends on where the data initially stored and purposes of using the data
+ HDFS is useful for append-only and archiving purposes+ ETL Processing(feature engineering)
+ RDBMS is reliable as a transactional data store
Which is better?
11
ScalabilityData movement
Transactions Latency Throughput
In-databaseanalytics
Machine learning on Hadoop
+ Small fraction updates+ Index-lookup foronline prediction
Which is better?
12
ScalabilityData movement
Transactions
In-databaseanalytics
Machine learning on Hadoop
+ Incremental learning for each training instance
- High latency bottleneck in job submitting process
+ Batch processing
Latency Throughput
Idea behind DB-Hadoop Hybrid approach
13
scalabilityData movement
Transactions
Batch learningon Hadoop
Incremental learning and prediction in a relational database
Just an illustration, you knowNext, we will see what happens inside the box
Latency Throughput
Inside the box (an overview)
– How to combine them
14
Postgres
Hadoop cluster
node
node
node
・・・
OLTPtransactions
Training data
Prediction model
Incrementallearning
implemented as a database stored procedure
Trickle training data to Hadoop HDFS little by little and bringing back prediction models periodicity
Batch learning
15
Trickle updates
Source database
Trickle updates in the queue periodically
Hadoop clusterRelational Database
Stagingtable
Pull updates in the queue
Trainingdata sink
The Detailed Architecture― Data to Prediction Cycle
IncrementalLearner
16
Trickle updates
Source database
Hadoop clusterRelational Database
Stagingtable
Pull updates in the queue
Trainingdata sink
Prediction model
Batch learning processbuild a model
The Detailed Architecture― Data to Prediction Cycle
IncrementalLearner
up-to-datemodel Export a prediction
model
Trickle updates in the queue periodically
17
Trickle updates
Source database
Hadoop clusterRelational Database
Stagingtable
Pull updates in the queue
Trainingdata sink
Prediction model
Batch learning processbuild a model
The Detailed Architecture― Data to Prediction Cycle
IncrementalLearner
up-to-datemodel
Select the latest one
Insert a new one Export a predictionmodel
Trickle updates in the queue periodically
18
Trickle updates
Source database
Hadoop clusterRelational Database
Stagingtable
Pull updates in the queue
Trainingdata sink
Prediction model
Batch learning processbuild a model
The Detailed Architecture― Data to Prediction Cycle
IncrementalLearner
up-to-datemodel
Select the latest one
Insert a new one Export a predictionmodel
Transactionalupdates
Online prediction
User can control the flow considering requirements and performance Real-time prediction is possible using database triggers on the staging table
Trickle updates in the queue periodically
19
Trickle updates
Source database
Hadoop clusterRelational Database
Stagingtable
Pull updates in the queue
Trainingdata sink
Prediction model
Batch learning processbuild a model
The Detailed Architecture― Data to Prediction Cycle
IncrementalLearner
up-to-datemodel
Select the latest one
Insert a new one Export a predictionmodel
The workflow consists of continuous and independent processes
Trickle updates in the queue periodically
Existing Approach for Parallel Batch Learning― Machine Learning as User Defined Aggregates (UDAF)
20
train train
+1, <1,2>..+1, <1,7,9>
-1, <1,3, 9>..+1, <3,8>
merge
tuple<label, array<features >
array<weight>
array<sum of weight>, array<count>
Training table
Prediction model
UDAF
-1, <2,7, 9>..+1, <3,8>
final merge
merge
-1, <2,7, 9>..+1, <3,8>
train train
array<weight>
Bottleneck in the final merge
Scalability is limited by the maximum fan-out of the final merge
Scalar aggregates computing a large single result are not suitable for S/N settings
Parallel aggregation (as one in Google Dremel) is not supported in Hadoop/MapReduce
Aggregate tree (parallel aggregation)
to merge prediction models Problems Observed
Even though MPP databases and Hive parallelize user-defined aggregates, the above problems prevent using it
Purely Relational Approach for Parallel Learning
Implemented a trainer as a set-returning function, instead of UDAFThe purely relational way that scales on MPP and Hive/Hadoop
Shuffle by feature to Reducers
Run trainers independently on mappers and aggregate the results on reducersEmbarrassingly parallel as # of mappers and reducers is controllable
21
+1, <1,2>..+1, <1,7,9>
-1, <1,3, 9>..+1, <3,8>
train train
tuple<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation<feature, weights>
param-mix param-mix
Training table
Shuffle by feature
Our solution for parallel machine learning on Hadoop/Hive
SELECTfeature, -- reducers perform model averaging in parallelavg(weight) as weight
FROM (SELECT trainLogistic(features,label,..) as (feature,weight)FROM train
) t -- map-only taskGROUP BY feature; -- shuffled to reducers
Parameter MixingK. B. Hall et al. in Proc. NIPS workshop on Leaning on Cores, Clusters, and Clouds, 2010.
Key points
Experimental Evaluation1. Compared the performance of our batch learning scheme
to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit
2. Conducted a online prediction scenario to see the latency and throughput of our incremental learning scheme
Dataset KDD Cup 2012, Track 2 dataset, which is one of the one of the largest publically available datasets for machine learning, provided by a commercial search engine provider
Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop)each equipped with 8 processors and 24 GB memory
22
Given a prediction model is created with 80% of training data by batch learning, the rest of data (20%) is supplied for incremental learning
The task is predicting Click-Through-Rates of search engine ads The training data is about 235 million records in 23 GB
Performance Evaluation of Batch LearningOur batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively
23
AUC value (Green Bar) represents prediction accuracy
5x
7.65x
Throughput: 2.3 million tuples/sec on 32 nodesLatency: 96 sec for training 235 million records of 23 GB
CAUTION: you can find the detailed number and setting in our paper
Performance Analysis in the Evaluation
24
Source database
Hadoop clusterRelational Database
Stagingtable
Trainingdata sink
Prediction model
IncrementalLearner
up-to-datemodel
Low latency (5 sec) under moderate (70,000 tuples/sec)
updates
96 sec for training Excellent throughput (2.3 million
tuples/sec) on 32 nodes
5 s96 s
Performance Analysis in the Evaluation
25
Sqoop required 3 min 32 s to migrate a prediction model 80% model containing about 1.56 million records (323MB)
Model conversion to a dense format, which is suited for online
learning/prediction on Postgres, required 58 seconds
Source database
Hadoop clusterRelational Database
Stagingtable
Trainingdata sink
Prediction model
IncrementalLearner
up-to-datemodel
5 s96 s
212 s58 s
Non-trivial costs in model migration
Performance Analysis in the Evaluation
26
Source database
Hadoop clusterRelational Database
Stagingtable
Trainingdata sink
Prediction model
IncrementalLearner
up-to-datemodel
“Data migration time > Training time” justifies the rationale behind in-database analytics
The cost of moving data is critical for online prediction as well as in Big Data analysisModel migration costs could be amortized with our approach
Key observations
Conclusions
DB-Hadoop hybrid architecture for online prediction in which the prediction model needs to be updated in a low latency process
Design principal for achieving scalable machine learning on Hadoop/Hive
Excellent throughput and Scalability
Our Batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively
Acceptably Small Latency
Possibly less than 5 s, under moderate transactional updates
27
Going hybrid brings low latency to Big Data analytics
Directions for Future Work
28
Source database
Hadoop clusterRelational Database
Stagingtable
Trainingdata sink
Prediction model
IncrementalLearner
up-to-datemodel
Online testing
Integrating online testing schemes (e.g., Multi-armed Bandits and A/B testing)
to the prediction pipeline Develop a scheme to select the best
prediction model among past models for each user in each session
Online prediction
Backup slides
29
Directions for Future Work
30
Source database
Hadoop clusterRelational Database
Stagingtable
Trainingdata sink
Prediction model
IncrementalLearner
up-to-datemodel
Take a common setting for OLTP that database is partitioned across servers (a.k.a. Database Sharding) into consideration
Evaluation of Incremental LearningGiven a prediction model is created with 80% of training data by batch learning, the rest of data (20% ) is supplied for incremental learning
31
Built model with ..elapsed time
(in sec)Throughputtuples/sec AUC
Batch only (80%) 96.33 2067418.3 0.7177
+0.1% updates (80.1%) 4.99 36155.4 0.7197
+1% updates (81%) 25.96 69812.8 0.7242
+10% updates (90%) 256.03 71278.1 0.7291
+20% updates (100%) 499.61 72901.4 0.7349
Batch only (100%) 102.52 2298010.8 0.7356
Special ThanksFont
Lato by Łukasz Dziedzic
Symbols by the Noun Project
Data Analysis designed by Brennan Novak
Elephant designed by Ted Mitchner
Scale designed by Laurent Patain
Heavy Load designed by Olivier Guin
Receipt designed by Benjamin Orlovski
Gauge designed by Márcio Duarte
Stopwatch designed by Ilsur Aptukov
Box designed by Travis J. Lee
Sprint Cycle designed by Jeremy J Bristol
Dilbert characters by Scott Adams Inc.
10-12-10 and 7-29-12
32