Hivemall talk@Hadoop summit 2014, San Jose

transcript

National Institute of Advanced Industrial Science and Technology (AIST), Japan

Makoto YUI

m.yui@aist.go.jp, @myui

Hivemall: Scalable Machine Learning Library for Apache Hive

Hadoop Summit 2014, San Jose

1 / 43

Plan of the talk

• What is Hivemall

• Why Hivemall

• What Hivemall can do

• How to use Hivemall

• How Hivemall works• How to deal with iterations w/ comparing to Spark

• Experimental Evaluation

• Conclusion

2 / 43

What is Hivemall

• A collection of machine learning algorithms implemented as Hive UDFs/UDTFs• Classification & Regression

• Recommendation

• k-Nearest Neighbor Search

.. and more

• An open source project on Github• Licensed under LGPL

• github.com/myui/hivemall (bit.ly/hivemall)

• 4 contributors

3 / 43

Reactions to the release

4 / 43

Reactions to the release

5 / 43

Motivation – Why a new ML framework?

Mahout?

Vowpal Wabbit?(w/ Hadoop streaming)

Spark MLlib?

0xdata H2O? Cloudera Oryx?

Machine Learning frameworks out therethat run with Hadoop

Quick Poll: How many people in this room are using them?

6 / 43

Framework User interface

Mahout Java API Programming

Spark MLlib/MLI Scala API programmingScala Shell (REPL)

H2O R programmingGUI

Cloudera Oryx Http REST API programming

Vowpal Wabbit(w/ Hadoop streaming)

C++ API programmingCommand Line

Motivation – Why a new ML framework?

Existing distributed machine learning frameworksare NOT easy to use

7 / 43

Classification with Mahout

org/apache/mahout/classifier/sgd/TrainNewsGroups.java

Find the complete code at bit.ly/news20-mahout

8 / 43

Why Hivemall

1. Ease of use• No programming

• Every machine learning step is done within HiveQL

• No compilation/packaging overhead• Easy for existing Hive users

• You can evaluate Hivemall within 5 minutes or so• Installation is just as follows

9 / 43

Why Hivemall

2. Scalable to data• Scalable to # of training/testing instances• Scalable to # of features

• Built-in support for feature hashing

• Scalable to the size of prediction model• Suppose there are 200 labels * 100 million

features ⇒ Requires 150GB• Hivemall does not need a prediction model fit

in memory both in the training/prediction

• Feature engineering step is also scalable and parallelized using Hive

10 / 43

Why Hivemall

3. Scalable to computing resources• Exploiting the benefits of Hadoop &

Hive• Provisioning the machine learning

service on Amazon Elastic MapReduce• Provides an EMR bootstrap for the

automated setup

Find an example on bit.ly/hivemall-emr

11 / 43

Why Hivemall

4. Supports the state-of-the-art online learning algorithms (for classification)• Less configuration parameters

(no learning rate as one in SGD)• CW, AROW[1], and SCW[2] are not yet

supported in the other ML frameworks• Surprising fast convergence properties

(few iterations is enough)

1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 20092. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012

12 / 43

Why Hivemall

AlgorithmsNews20.binary

Classification Accuracy

Perceptron 0.9460 Passive-Aggressive(a.k.a. Online-SVM)

0.9604

LibLinear 0.9636 LibSVM/TinySVM 0.9643 Confidence Weighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662

Better

4. Supports the state-of-the-art online learning algorithms (for classification)

CW-variants are very smart online ML algorithm

13 / 43

Why CW variants are so good?

Suppose a binary classification setting to classify sentences positive or negative→ learn the weight for each word (each word is a feature)

I like this authorPositive

I like this author, but found this book dullNegative

Label Feature Vector

Naïve update will reduce both at same rateWlikeWdull

CW-variants adjust weights at different rates

14 / 43

Why CW variants are so good?

weight

Adjust a weight

Adjust a weight & confidence

0.6 0.80.6

0.80.6

At this confidence, the weight is 0.5

Confidence(covariance)

15 / 43

Why Hivemall

4. Supports the state-of-the-art online learning algorithms (for classification)• Fast convergence properties

• Perform small update where confidence is enough

• Perform large update where confidence is low (e.g., at the beginning)

• A few iterations are enough

16 / 43

Plan of the talk

• Why Hivemall

• Conclusion

17 / 43

What Hivemall can do

• Classification (both one- and multi-class) Perceptron Passive Aggressive (PA) Confidence Weighted (CW) Adaptive Regularization of Weight Vectors (AROW) Soft Confidence Weighted (SCW)

• Regression Logistic Regression using Stochastic Gradient Descent (SGD) PA Regression AROW Regression

• k-Nearest Neighbor & Recommendation Minhash and b-Bit Minhash (LSH variant) Brute-force search using similarity measures (cosine similarity)

• Feature engineering Feature hashing Feature scaling (normalization, z-score)

18 / 43

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Feature Vector

Data preparation19 / 43

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

How to use Hivemall - Data preparation

Define a Hive table for training/testing data

20 / 43

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Feature Vector

Feature Engineering

21 / 43

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label})

as label,features

from e2006tfidf_train;

Applying a Min-Max Feature Normalization

How to use Hivemall - Feature Engineering

Transforming a label value to a value between 0.0 and 1.0

22 / 43

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Feature Vector

Training

23 / 43

How to use Hivemall - Training

CREATE TABLE lr_model ASSELECT

feature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training by logistic regression

map-only task to learn a prediction model

Shuffle map-outputs to reduces by feature

Reducers perform model averaging in parallel

24 / 43

How to use Hivemall - Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training of Confidence Weighted Classifier

Vote to use negative or positive weights for avg

+0.7, +0.3, +0.2, -0.1, +0.7

Training for the CW classifier

25 / 43

create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)

union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)

) t group by label, feature;

Ensemble learning for stable prediction performance

Just stack prediction models by union all

26 / 43

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel

Feature Vector

Prediction

27 / 43

How to use Hivemall - Prediction

CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

Prediction is done by LEFT OUTER JOINbetween test data and prediction model

No need to load the entire model into memory

28 / 43

Plan of the talk

• Why Hivemall

• Conclusion

29 / 43

Implemented machine learning algorithms as User-Defined Table generating Functions (UDTFs)

How Hivemall works in the training

+1, <1,2>..+1, <1,7,9>

-1, <1,3, 9>..+1, <3,8>

tuple<label, array<features>>

tuple<feature, weights>

Prediction model

Relation<feature, weights>

param-mix param-mix

Training table

Shuffle by feature

train train

Friendly to the Hive relational query engine• Resulting prediction model is

a relation of feature and its weight

Embarrassingly parallel• # of mapper and reducers are

configurableBagging-like effect which helps

to reduce the variance of each classifier/partition

30 / 43

train train

+1, <1,2>..+1, <1,7,9>

-1, <1,3, 9>..+1, <3,8>

tuple<label, array<features >

array<weight>

array<sum of weight>, array<count>

Training table

Prediction model

-1, <2,7, 9>..+1, <3,8>

final merge

-1, <2,7, 9>..+1, <3,8>

train train

array<weight>

Why not UDAF (as one in MADLib)

4 ops in parallel

2 ops in parallel

No parallelism

Machine learning as an aggregate function

Bottleneck in the final mergeThroughput limited by its fan out

Memory consumptiongrows

Parallelismdecreases

31 / 43

How to deal with Iterations

Iterations are mandatory to get a good prediction model

• However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS

• Spark avoid it by in-memory computation

iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

iter. 1 iter. 2

32 / 43

val data = spark.textFile(...).map(readPoint).cache()

for (i <- 1 to ITERATIONS) {val gradient = data.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

).reduce(_ + _)w -= gradient

}Repeated MapReduce steps

to do gradient descent

For each node, loads data in memory once

This is just a toy example! Why?

Training with Iterations in Spark

Logistic Regression example of Spark

Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required)

33 / 43

What MLlib actually do?

Val data = ..

for (i <- 1 to numIterations) {val sampled = val gradient =

w -= gradient}

Mini-batch Gradient Descent with Sampling

Iterations are mandatory for convergence because each iteration uses only small fraction of data

GradientDescent.scalabit.ly/spark-gd

sample subset of data (partitioned RDD)

averaging the subgradients over the sampled data using Spark MapReduce

34 / 43

How to deal with Iterations in Hivemall

Hivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT

* FROM (

SELECTamplify(${xtimes}, *) as (rowid, label, features)FROMtraining

) tCLUSTER BY RANDOM

35 / 43

Map-only shuffling and amplifying

rand_amplify UDTF randomly shuffles the input rows for each Map task

CREATE VIEW training_x3asSELECT

rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features)

FROMtraining;

36 / 43

Detailed plan w/ map-local shuffle

Shuffle (distributed by feature)R

ce task

Aggregate

Reduce write

Table scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Table scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

uce task

Aggregate

Reduce write

Scanned entries are amplified and then shuffledNote this is pipeline op.

The Rand Amplifier operator is interleaved between the table scan and the training operator

37 / 43

MethodELAPSED TIME (sec)

Plain 89.718 0.734805

amplifier+clustered by(a.k.a. global shuffle)

479.855 0.746214

rand_amplifier(a.k.a. map-local shuffle)

116.424 0.743392

Performance effects of amplifiers

For map-local shuffle, prediction accuracy got improved with an acceptable overhead

38 / 43

Plan of the talk

• Why Hivemall

• Conclusion

39 / 43

Experimental Evaluation

Compared the performance of our batch learning scheme to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit

• Dataset KDD Cup 2012, Track 2 dataset, which is one of the largest publically available datasets for machine learning, provided by a commercial search engine provider• The training data is about 235 million records in 33 GB• # of feature dimensions is about 54 million

• TaskPredicting Click-Through-Rates of search engine ads

• Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop)each equipped with 8 processors and 24 GB memory

bit.ly/hivemall-kdd-dataset

40 / 43

596.67

493.81

755.24

Hivemall VW1 VW32 Bismarck

Throughput: 2.3 million tuples/sec on 32 nodesLatency: 96 sec for training 235 million records of 23 GB

Performance comparison

Prediction performance (AUC) is good

Elapsed time (sec) for trainingThe lower, the better

41 / 43

val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/small/training_libsvmfmt", multiclass = false)

val model = LogisticRegressionWithSGD.train(training, numIterations)..

How about Spark 1.0 MLlib

Works fine for small data (10k training examples in about 1.5 MB)

on 33 nodes with allocating 5 GB memory to each worker

LoC is small and easy to understand

However, Spark does not work for large dataset (235 million training example of 2^24 feature dimensions in about 33 GB)

Further investigation is required

42 / 43

Conclusion

Hivemall is an open source library that provides a collection of machine learning algorithms as Hive UDFs/UDTFs

Easy to use Scalable to computing resources

Runs on Amazon EMR Support state of the art classification algorithms Plan to support Shark/Spark SQL

Project Site:github.com/myui/hivemall or bit.ly/hivemall

Message of this talk: Please evaluate Hivemall by yourself. 5 minutes is enough for a quick start

Slide available onbit.ly/hivemall-slide

43 / 43

Hivemall talk@Hadoop summit 2014, San Jose

Software