+ All Categories
Home > Documents > Machine Learning with Apache Hama

Machine Learning with Apache Hama

Date post: 26-Jan-2016
Category:
Upload: berne
View: 29 times
Download: 1 times
Share this document with a friend
Description:
Machine Learning with Apache Hama. Tommaso Teofili tommaso [at] apache [dot] org. 1. About me. ASF member having fun with: Lucene / Solr Hama UIMA Stanbol … some others SW engineer @ Adobe R&D. 2. Agenda. Apache Hama and BSP Why machine learning on BSP Some examples Benchmarks. - PowerPoint PPT Presentation
34
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org
Transcript
Page 1: Machine Learning with Apache Hama

1

Machine Learning with Apache Hama

Tommaso Teofilitommaso [at] apache [dot] org

Page 2: Machine Learning with Apache Hama

2

About me ASF member having fun with: Lucene / Solr Hama UIMA Stanbol … some others

SW engineer @ Adobe R&D

Page 3: Machine Learning with Apache Hama

3

Agenda Apache Hama and BSP Why machine learning on BSP Some examples Benchmarks

Page 4: Machine Learning with Apache Hama

4

Apache Hama Bulk Synchronous Parallel computing

framework on top of HDFS for massive scientific computations

TLP since May 2012 0.6.0 release out soon Growing community

Page 5: Machine Learning with Apache Hama

5

BSP supersteps A BSP algorithm is composed by a sequence of “supersteps”

Page 6: Machine Learning with Apache Hama

6

BSP supersteps Each task

Superstep 1 Do some computation Communicate with other tasks Synchronize

Superstep 2 Do some computation Communicate with other tasks Synchronize

… … … Superstep N

Do some computation Communicate with other tasks Synchronize

Page 7: Machine Learning with Apache Hama

7

Why BSP Simple programming model Supersteps semantic is easy

Preserve data locality Improve performance

Well suited for iterative algorithms

Page 8: Machine Learning with Apache Hama

8

Apache Hama architecture BSP Program execution flow

Page 9: Machine Learning with Apache Hama

9

Apache Hama architecture

Page 10: Machine Learning with Apache Hama

10

Apache Hama Features

BSP API M/R like I/O API Graph API Job management / monitoring Checkpoint recovery Local & (Pseudo) Distributed run modes Pluggable message transfer architecture YARN supported Running in Apache Whirr

Page 11: Machine Learning with Apache Hama

11

Apache Hama BSP API public abstract class BSP<K1, V1, K2, V2,

M extends Writable> … K1, V1 are key, values for inputs K2, V2 are key, values for outputs M are they type of messages used for task

communication

Page 12: Machine Learning with Apache Hama

12

Apache Hama BSP API public void bsp(BSPPeer<K1, V1, K2, V2,

M> peer) throws ..

public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws ..

public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws ..

Page 13: Machine Learning with Apache Hama

13

Machine learning on BSP Lots (most?) of ML algorithms are

inherently iterative Hama ML module currently counts

Collaborative filtering Clustering Gradient descent

Page 14: Machine Learning with Apache Hama

14

Benchmarking architecture

HDFSHDFS

Solr

Lucene

DBMSHamaHama

MahoutMahout

NodeNodeNodeNode

NodeNodeNodeNode

Page 15: Machine Learning with Apache Hama

15

Collaborative filtering Given user preferences on movies We want to find users “near” to some

specific user So that that user can “follow” them And/or see what they like (which he/she could

like too)

Page 16: Machine Learning with Apache Hama

16

Collaborative filtering BSP Given a specific user Iteratively (for each task) Superstep 1*i

Read a new user preference row Find how near is that user from the current user

That is finding how near their preferences are Since they are given as vectors we may use vector

distance measures like Euclidean, cosine, etc. distance algorithms

Broadcast the measure output to other peers Superstep 2*i

Aggregate measure outputs Update most relevant users

Still to be committed (HAMA-612)

Page 17: Machine Learning with Apache Hama

17

Collaborative filtering BSP Given user ratings about movies "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8 "paula" -> 7, 3, 8, 2, 8.5, 0, 0 "jim” -> 4, 5, 0, 5, 8, 0, 1.5 "tom" -> 9, 4, 9, 1, 5, 0, 8 "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0

We ask for 2 nearest users to “paula” and we get “timothy” and “tom”

user recommendation We can extract highly rated movies

“timothy” and “tom” that “paula” didn’t see Item recommendation

Page 18: Machine Learning with Apache Hama

18

Benchmarks Fairly simple algorithm Highly iterative Comparing to Apache Mahout Behaves better than ALS-WR Behaves similarly to RecommenderJob and

ItemSimilarityJob

Page 19: Machine Learning with Apache Hama

19

K-Means clustering We have a bunch of data (e.g. documents) We want to group those docs in k

homogeneous clusters

Iteratively for each cluster Calculate new cluster center Add doc nearest to new center to the cluster

Page 20: Machine Learning with Apache Hama

20

K-Means clustering

Page 21: Machine Learning with Apache Hama

21

K-Means clustering BSP Iteratively Superstep 1*i Assignment phase Read vectors splits Sum up temporary centers with assigned vectors Broadcast sum and ingested vectors count

Superstep 2*i Update phase Calculate the total sum over all received

messages and average Replace old centers with new centers and check

for convergence

Page 22: Machine Learning with Apache Hama

22

Benchmarks One rack (16 nodes 256 cores) cluster 10G network On average faster than Mahout’s impl

Page 23: Machine Learning with Apache Hama

23

Gradient descent Optimization algorithm Find a (local) minimum of some function Used for

solving linear systems solving non linear systems in machine learning tasks

linear regression logistic regression neural networks backpropagation …

Page 24: Machine Learning with Apache Hama

24

Gradient descent Minimize a given (cost) function Give the function a starting point (set of parameters) Iteratively change parameters in order to minimize the

function Stop at the (local) minimum

There’s some math but intuitively: evaluate derivatives at a given point in order to choose

where to “go” next

Page 25: Machine Learning with Apache Hama

25

Gradient descent BSP Iteratively

Superstep 1*i each task calculates and broadcasts portions of the

cost function with the current parameters Superstep 2*i

aggregate and update cost function check the aggregated cost and iterations count

cost should always decrease Superstep 3*i

each task calculates and broadcasts portions of (partial) derivatives

Superstep 4*i aggregate and update parameters

Page 26: Machine Learning with Apache Hama

26

Gradient descent BSP Simplistic example

Linear regression Given real estate market dataset Estimate new houses prices given known

houses’ size, geographic region and prices Expected output: actual parameters for the

(linear) prediction function

Page 27: Machine Learning with Apache Hama

27

Gradient descent BSP Generate a different model for each region House item vectors

price -> size 150k -> 80

2 dimensional space ~1.3M vectors dataset

Page 28: Machine Learning with Apache Hama

28

Gradient descent BSP Dataset and model fit

Page 29: Machine Learning with Apache Hama

29

Gradient descent BSP Cost checking

Page 30: Machine Learning with Apache Hama

30

Gradient descent BSP Classification

Logistic regression with gradient descent Real estate market dataset We want to find which estate listings belong to agencies

To avoid buying from them

Same algorithm With different cost function and features

Existing items are tagged or not as “belonging to agency” Create vectors from items’ text Sample vector

1 -> 1 3 0 0 5 3 4 1

Page 31: Machine Learning with Apache Hama

31

Gradient descent BSP Classification

Page 32: Machine Learning with Apache Hama

32

Benchmarks Not directly comparable to Mahout’s

regression algorithms Both SGD and CGD are inherently better than

plain GD But Hama GD had on average same

performance of Mahout’s SGD / CGD Next step is implementing SGD / CGD on top of

Hama

Page 33: Machine Learning with Apache Hama

33

Wrap up Even if ML module is still “young” / work in progress and tools like Apache Mahout have better

“coverage”

Apache Hama can be particularly useful in certain “highly iterative” use cases

Interesting benchmarks

Page 34: Machine Learning with Apache Hama

34

Thanks!


Recommended