1
Machine Learning with Apache Hama
Tommaso Teofilitommaso [at] apache [dot] org
2
About me ASF member having fun with: Lucene / Solr Hama UIMA Stanbol … some others
SW engineer @ Adobe R&D
3
Agenda Apache Hama and BSP Why machine learning on BSP Some examples Benchmarks
4
Apache Hama Bulk Synchronous Parallel computing
framework on top of HDFS for massive scientific computations
TLP since May 2012 0.6.0 release out soon Growing community
5
BSP supersteps A BSP algorithm is composed by a sequence of “supersteps”
6
BSP supersteps Each task
Superstep 1 Do some computation Communicate with other tasks Synchronize
Superstep 2 Do some computation Communicate with other tasks Synchronize
… … … Superstep N
Do some computation Communicate with other tasks Synchronize
7
Why BSP Simple programming model Supersteps semantic is easy
Preserve data locality Improve performance
Well suited for iterative algorithms
8
Apache Hama architecture BSP Program execution flow
9
Apache Hama architecture
10
Apache Hama Features
BSP API M/R like I/O API Graph API Job management / monitoring Checkpoint recovery Local & (Pseudo) Distributed run modes Pluggable message transfer architecture YARN supported Running in Apache Whirr
11
Apache Hama BSP API public abstract class BSP<K1, V1, K2, V2,
M extends Writable> … K1, V1 are key, values for inputs K2, V2 are key, values for outputs M are they type of messages used for task
communication
12
Apache Hama BSP API public void bsp(BSPPeer<K1, V1, K2, V2,
M> peer) throws ..
public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws ..
public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws ..
13
Machine learning on BSP Lots (most?) of ML algorithms are
inherently iterative Hama ML module currently counts
Collaborative filtering Clustering Gradient descent
14
Benchmarking architecture
HDFSHDFS
Solr
Lucene
DBMSHamaHama
MahoutMahout
NodeNodeNodeNode
NodeNodeNodeNode
15
Collaborative filtering Given user preferences on movies We want to find users “near” to some
specific user So that that user can “follow” them And/or see what they like (which he/she could
like too)
16
Collaborative filtering BSP Given a specific user Iteratively (for each task) Superstep 1*i
Read a new user preference row Find how near is that user from the current user
That is finding how near their preferences are Since they are given as vectors we may use vector
distance measures like Euclidean, cosine, etc. distance algorithms
Broadcast the measure output to other peers Superstep 2*i
Aggregate measure outputs Update most relevant users
Still to be committed (HAMA-612)
17
Collaborative filtering BSP Given user ratings about movies "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8 "paula" -> 7, 3, 8, 2, 8.5, 0, 0 "jim” -> 4, 5, 0, 5, 8, 0, 1.5 "tom" -> 9, 4, 9, 1, 5, 0, 8 "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0
We ask for 2 nearest users to “paula” and we get “timothy” and “tom”
user recommendation We can extract highly rated movies
“timothy” and “tom” that “paula” didn’t see Item recommendation
18
Benchmarks Fairly simple algorithm Highly iterative Comparing to Apache Mahout Behaves better than ALS-WR Behaves similarly to RecommenderJob and
ItemSimilarityJob
19
K-Means clustering We have a bunch of data (e.g. documents) We want to group those docs in k
homogeneous clusters
Iteratively for each cluster Calculate new cluster center Add doc nearest to new center to the cluster
20
K-Means clustering
21
K-Means clustering BSP Iteratively Superstep 1*i Assignment phase Read vectors splits Sum up temporary centers with assigned vectors Broadcast sum and ingested vectors count
Superstep 2*i Update phase Calculate the total sum over all received
messages and average Replace old centers with new centers and check
for convergence
22
Benchmarks One rack (16 nodes 256 cores) cluster 10G network On average faster than Mahout’s impl
23
Gradient descent Optimization algorithm Find a (local) minimum of some function Used for
solving linear systems solving non linear systems in machine learning tasks
linear regression logistic regression neural networks backpropagation …
24
Gradient descent Minimize a given (cost) function Give the function a starting point (set of parameters) Iteratively change parameters in order to minimize the
function Stop at the (local) minimum
There’s some math but intuitively: evaluate derivatives at a given point in order to choose
where to “go” next
25
Gradient descent BSP Iteratively
Superstep 1*i each task calculates and broadcasts portions of the
cost function with the current parameters Superstep 2*i
aggregate and update cost function check the aggregated cost and iterations count
cost should always decrease Superstep 3*i
each task calculates and broadcasts portions of (partial) derivatives
Superstep 4*i aggregate and update parameters
26
Gradient descent BSP Simplistic example
Linear regression Given real estate market dataset Estimate new houses prices given known
houses’ size, geographic region and prices Expected output: actual parameters for the
(linear) prediction function
27
Gradient descent BSP Generate a different model for each region House item vectors
price -> size 150k -> 80
2 dimensional space ~1.3M vectors dataset
28
Gradient descent BSP Dataset and model fit
29
Gradient descent BSP Cost checking
30
Gradient descent BSP Classification
Logistic regression with gradient descent Real estate market dataset We want to find which estate listings belong to agencies
To avoid buying from them
Same algorithm With different cost function and features
Existing items are tagged or not as “belonging to agency” Create vectors from items’ text Sample vector
1 -> 1 3 0 0 5 3 4 1
31
Gradient descent BSP Classification
32
Benchmarks Not directly comparable to Mahout’s
regression algorithms Both SGD and CGD are inherently better than
plain GD But Hama GD had on average same
performance of Mahout’s SGD / CGD Next step is implementing SGD / CGD on top of
Hama
33
Wrap up Even if ML module is still “young” / work in progress and tools like Apache Mahout have better
“coverage”
Apache Hama can be particularly useful in certain “highly iterative” use cases
Interesting benchmarks
34
Thanks!