+ All Categories
Home > Software > Hivemall talk@Hadoop summit 2014, San Jose

Hivemall talk@Hadoop summit 2014, San Jose

Date post: 16-Apr-2017
Category:
Upload: makoto-yui
View: 10,568 times
Download: 6 times
Share this document with a friend
43
National Institute of Advanced Industrial Science and Technology (AIST), Japan Makoto YUI [email protected], @myui Hivemall: Scalable Machine Learning Library for Apache Hive Hadoop Summit 2014, San Jose 1 / 43
Transcript
Page 1: Hivemall talk@Hadoop summit 2014, San Jose

National Institute of Advanced Industrial Science and Technology (AIST), Japan

Makoto YUI

[email protected], @myui

Hivemall: Scalable Machine Learning Library for Apache Hive

Hadoop Summit 2014, San Jose

1 / 43

Page 2: Hivemall talk@Hadoop summit 2014, San Jose

Plan of the talk

• What is Hivemall

• Why Hivemall

• What Hivemall can do

• How to use Hivemall

• How Hivemall works• How to deal with iterations w/ comparing to Spark

• Experimental Evaluation

• Conclusion

Hadoop Summit 2014, San Jose

2 / 43

Page 3: Hivemall talk@Hadoop summit 2014, San Jose

What is Hivemall

• A collection of machine learning algorithms implemented as Hive UDFs/UDTFs• Classification & Regression

• Recommendation

• k-Nearest Neighbor Search

.. and more

• An open source project on Github• Licensed under LGPL

• github.com/myui/hivemall (bit.ly/hivemall)

• 4 contributors

Hadoop Summit 2014, San Jose

3 / 43

Page 4: Hivemall talk@Hadoop summit 2014, San Jose

Reactions to the release

Hadoop Summit 2014, San Jose

4 / 43

Page 5: Hivemall talk@Hadoop summit 2014, San Jose

Reactions to the release

Hadoop Summit 2014, San Jose

5 / 43

Page 6: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Motivation – Why a new ML framework?

Mahout?

Vowpal Wabbit?(w/ Hadoop streaming)

Spark MLlib?

0xdata H2O? Cloudera Oryx?

Machine Learning frameworks out therethat run with Hadoop

Quick Poll: How many people in this room are using them?

6 / 43

Page 7: Hivemall talk@Hadoop summit 2014, San Jose

Framework User interface

Mahout Java API Programming

Spark MLlib/MLI Scala API programmingScala Shell (REPL)

H2O R programmingGUI

Cloudera Oryx Http REST API programming

Vowpal Wabbit(w/ Hadoop streaming)

C++ API programmingCommand Line

Hadoop Summit 2014, San Jose

Motivation – Why a new ML framework?

Existing distributed machine learning frameworksare NOT easy to use

7 / 43

Page 8: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Classification with Mahout

org/apache/mahout/classifier/sgd/TrainNewsGroups.java

Find the complete code at bit.ly/news20-mahout

8 / 43

Page 9: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why Hivemall

1. Ease of use• No programming

• Every machine learning step is done within HiveQL

• No compilation/packaging overhead• Easy for existing Hive users

• You can evaluate Hivemall within 5 minutes or so• Installation is just as follows

9 / 43

Page 10: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why Hivemall

2. Scalable to data• Scalable to # of training/testing instances• Scalable to # of features

• Built-in support for feature hashing

• Scalable to the size of prediction model• Suppose there are 200 labels * 100 million

features ⇒ Requires 150GB• Hivemall does not need a prediction model fit

in memory both in the training/prediction

• Feature engineering step is also scalable and parallelized using Hive

10 / 43

Page 11: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why Hivemall

3. Scalable to computing resources• Exploiting the benefits of Hadoop &

Hive• Provisioning the machine learning

service on Amazon Elastic MapReduce• Provides an EMR bootstrap for the

automated setup

Find an example on bit.ly/hivemall-emr

11 / 43

Page 12: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why Hivemall

4. Supports the state-of-the-art online learning algorithms (for classification)• Less configuration parameters

(no learning rate as one in SGD)• CW, AROW[1], and SCW[2] are not yet

supported in the other ML frameworks• Surprising fast convergence properties

(few iterations is enough)

1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 20092. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012

12 / 43

Page 13: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why Hivemall

AlgorithmsNews20.binary

Classification Accuracy

Perceptron 0.9460 Passive-Aggressive(a.k.a. Online-SVM)

0.9604

LibLinear 0.9636 LibSVM/TinySVM 0.9643 Confidence Weighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662

Better

4. Supports the state-of-the-art online learning algorithms (for classification)

CW-variants are very smart online ML algorithm

13 / 43

Page 14: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why CW variants are so good?

Suppose a binary classification setting to classify sentences positive or negative→ learn the weight for each word (each word is a feature)

I like this authorPositive

I like this author, but found this book dullNegative

Label Feature Vector

Naïve update will reduce both at same rateWlikeWdull

CW-variants adjust weights at different rates

14 / 43

Page 15: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why CW variants are so good?

weight

weight

Adjust a weight

Adjust a weight & confidence

0.6 0.80.6

0.80.6

At this confidence, the weight is 0.5

Confidence(covariance)

0.5

15 / 43

Page 16: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Why Hivemall

4. Supports the state-of-the-art online learning algorithms (for classification)• Fast convergence properties

• Perform small update where confidence is enough

• Perform large update where confidence is low (e.g., at the beginning)

• A few iterations are enough

16 / 43

Page 17: Hivemall talk@Hadoop summit 2014, San Jose

Plan of the talk

• What is Hivemall

• Why Hivemall

• What Hivemall can do

• How to use Hivemall

• How Hivemall works• How to deal with iterations w/ comparing to Spark

• Experimental Evaluation

• Conclusion

Hadoop Summit 2014, San Jose

17 / 43

Page 18: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

What Hivemall can do

• Classification (both one- and multi-class) Perceptron Passive Aggressive (PA) Confidence Weighted (CW) Adaptive Regularization of Weight Vectors (AROW) Soft Confidence Weighted (SCW)

• Regression Logistic Regression using Stochastic Gradient Descent (SGD) PA Regression AROW Regression

• k-Nearest Neighbor & Recommendation Minhash and b-Bit Minhash (LSH variant) Brute-force search using similarity measures (cosine similarity)

• Feature engineering Feature hashing Feature scaling (normalization, z-score)

18 / 43

Page 19: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Label

Feature Vector

Feature Vector

Label

Data preparation19 / 43

Page 20: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

How to use Hivemall - Data preparation

Define a Hive table for training/testing data

20 / 43

Page 21: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Label

Feature Vector

Feature Vector

Label

Feature Engineering

21 / 43

Page 22: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label})

as label,features

from e2006tfidf_train;

Applying a Min-Max Feature Normalization

How to use Hivemall - Feature Engineering

Transforming a label value to a value between 0.0 and 1.0

22 / 43

Page 23: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Label

Feature Vector

Feature Vector

Label

Training

23 / 43

Page 24: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall - Training

CREATE TABLE lr_model ASSELECT

feature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training by logistic regression

map-only task to learn a prediction model

Shuffle map-outputs to reduces by feature

Reducers perform model averaging in parallel

24 / 43

Page 25: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall - Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training of Confidence Weighted Classifier

Vote to use negative or positive weights for avg

+0.7, +0.3, +0.2, -0.1, +0.7

Training for the CW classifier

25 / 43

Page 26: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)

from news20mc_train_x3

) t group by label, feature;

Ensemble learning for stable prediction performance

Just stack prediction models by union all

26 / 43

Page 27: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Label

Feature Vector

Feature Vector

Label

Prediction

27 / 43

Page 28: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

How to use Hivemall - Prediction

CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

Prediction is done by LEFT OUTER JOINbetween test data and prediction model

No need to load the entire model into memory

28 / 43

Page 29: Hivemall talk@Hadoop summit 2014, San Jose

Plan of the talk

• What is Hivemall

• Why Hivemall

• What Hivemall can do

• How to use Hivemall

• How Hivemall works• How to deal with iterations w/ comparing to Spark

• Experimental Evaluation

• Conclusion

Hadoop Summit 2014, San Jose

29 / 43

Page 30: Hivemall talk@Hadoop summit 2014, San Jose

Implemented machine learning algorithms as User-Defined Table generating Functions (UDTFs)

Hadoop Summit 2014, San Jose

How Hivemall works in the training

+1, <1,2>..+1, <1,7,9>

-1, <1,3, 9>..+1, <3,8>

tuple<label, array<features>>

tuple<feature, weights>

Prediction model

UDTF

Relation<feature, weights>

param-mix param-mix

Training table

Shuffle by feature

train train

Friendly to the Hive relational query engine• Resulting prediction model is

a relation of feature and its weight

Embarrassingly parallel• # of mapper and reducers are

configurableBagging-like effect which helps

to reduce the variance of each classifier/partition

30 / 43

Page 31: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

train train

+1, <1,2>..+1, <1,7,9>

-1, <1,3, 9>..+1, <3,8>

merge

tuple<label, array<features >

array<weight>

array<sum of weight>, array<count>

Training table

Prediction model

-1, <2,7, 9>..+1, <3,8>

final merge

merge

-1, <2,7, 9>..+1, <3,8>

train train

array<weight>

Why not UDAF (as one in MADLib)

4 ops in parallel

2 ops in parallel

No parallelism

Machine learning as an aggregate function

Bottleneck in the final mergeThroughput limited by its fan out

Memory consumptiongrows

Parallelismdecreases

31 / 43

Page 32: Hivemall talk@Hadoop summit 2014, San Jose

How to deal with Iterations

Iterations are mandatory to get a good prediction model

• However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS

• Spark avoid it by in-memory computation

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

iter. 1 iter. 2

Input

32 / 43

Page 33: Hivemall talk@Hadoop summit 2014, San Jose

val data = spark.textFile(...).map(readPoint).cache()

for (i <- 1 to ITERATIONS) {val gradient = data.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

).reduce(_ + _)w -= gradient

}Repeated MapReduce steps

to do gradient descent

For each node, loads data in memory once

This is just a toy example! Why?

Training with Iterations in Spark

Logistic Regression example of Spark

Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required)

33 / 43

Page 34: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

What MLlib actually do?

Val data = ..

for (i <- 1 to numIterations) {val sampled = val gradient =

w -= gradient}

Mini-batch Gradient Descent with Sampling

Iterations are mandatory for convergence because each iteration uses only small fraction of data

GradientDescent.scalabit.ly/spark-gd

sample subset of data (partitioned RDD)

averaging the subgradients over the sampled data using Spark MapReduce

34 / 43

Page 35: Hivemall talk@Hadoop summit 2014, San Jose

How to deal with Iterations in Hivemall

Hivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT

* FROM (

SELECTamplify(${xtimes}, *) as (rowid, label, features)FROMtraining

) tCLUSTER BY RANDOM

35 / 43

Page 36: Hivemall talk@Hadoop summit 2014, San Jose

Map-only shuffling and amplifying

rand_amplify UDTF randomly shuffles the input rows for each Map task

CREATE VIEW training_x3asSELECT

rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features)

FROMtraining;

36 / 43

Page 37: Hivemall talk@Hadoop summit 2014, San Jose

Detailed plan w/ map-local shuffle

Shuffle (distributed by feature)R

edu

ce task

Merge

Aggregate

Reduce write

Map

task

Table scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Map

task

Table scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Red

uce task

Merge

Aggregate

Reduce write

Scanned entries are amplified and then shuffledNote this is pipeline op.

The Rand Amplifier operator is interleaved between the table scan and the training operator

37 / 43

Page 38: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

MethodELAPSED TIME (sec)

AUC

Plain 89.718 0.734805

amplifier+clustered by(a.k.a. global shuffle)

479.855 0.746214

rand_amplifier(a.k.a. map-local shuffle)

116.424 0.743392

Performance effects of amplifiers

For map-local shuffle, prediction accuracy got improved with an acceptable overhead

38 / 43

Page 39: Hivemall talk@Hadoop summit 2014, San Jose

Plan of the talk

• What is Hivemall

• Why Hivemall

• What Hivemall can do

• How to use Hivemall

• How Hivemall works• How to deal with iterations w/ comparing to Spark

• Experimental Evaluation

• Conclusion

Hadoop Summit 2014, San Jose

39 / 43

Page 40: Hivemall talk@Hadoop summit 2014, San Jose

Experimental Evaluation

Compared the performance of our batch learning scheme to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit

• Dataset KDD Cup 2012, Track 2 dataset, which is one of the largest publically available datasets for machine learning, provided by a commercial search engine provider• The training data is about 235 million records in 33 GB• # of feature dimensions is about 54 million

• TaskPredicting Click-Through-Rates of search engine ads

• Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop)each equipped with 8 processors and 24 GB memory

40

bit.ly/hivemall-kdd-dataset

40 / 43

Page 41: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

116.4

596.67

493.81

755.24

0

100

200

300

400

500

600

700

800

Hivemall VW1 VW32 Bismarck

0.64

0.66

0.68

0.7

0.72

0.74

0.76

Hivemall VW1 VW32 Bismarck

Throughput: 2.3 million tuples/sec on 32 nodesLatency: 96 sec for training 235 million records of 23 GB

Performance comparison

Prediction performance (AUC) is good

Elapsed time (sec) for trainingThe lower, the better

41 / 43

Page 42: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/small/training_libsvmfmt", multiclass = false)

val model = LogisticRegressionWithSGD.train(training, numIterations)..

How about Spark 1.0 MLlib

Works fine for small data (10k training examples in about 1.5 MB)

on 33 nodes with allocating 5 GB memory to each worker

LoC is small and easy to understand

However, Spark does not work for large dataset (235 million training example of 2^24 feature dimensions in about 33 GB)

Further investigation is required

42 / 43

Page 43: Hivemall talk@Hadoop summit 2014, San Jose

Hadoop Summit 2014, San Jose

Conclusion

Hivemall is an open source library that provides a collection of machine learning algorithms as Hive UDFs/UDTFs

Easy to use Scalable to computing resources

Runs on Amazon EMR Support state of the art classification algorithms Plan to support Shark/Spark SQL

Project Site:github.com/myui/hivemall or bit.ly/hivemall

Message of this talk: Please evaluate Hivemall by yourself. 5 minutes is enough for a quick start

Slide available onbit.ly/hivemall-slide

43 / 43


Recommended