Boston hug-2012-07

Post on 10-May-2015

716 views 0 download

Tags:

description

Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms.

transcript

1©MapR Technologies - Confidential

Mahout, New and ImprovedNow with Super Fast Clustering

2©MapR Technologies - Confidential

Agenda

What happened in Mahout 0.7– less bloat– simpler structure– general cleanup

3©MapR Technologies - Confidential

To Cut Out Bloat

4©MapR Technologies - Confidential

5©MapR Technologies - Confidential

Bloat is Leaving in 0.7

Lots of abandoned code in Mahout– average code quality is poor– no users– no maintainers– why do we care?

Examples– old LDA– old Naïve Bayes– genetic algorithms

If you care, get on the mailing list

6©MapR Technologies - Confidential

Bloat is Leaving in 0.7

Lots of abandoned code in Mahout– average code quality is poor– no users– no maintainers– why do we care?

Examples– old LDA– old Naïve Bayes– genetic algorithms

If you care, get on the mailing list– oops, too late since 0.7 is already released

7©MapR Technologies - Confidential

Integration of Collections

8©MapR Technologies - Confidential

Nobody Cares about Collections

We need it, math is built on it

Pull it into math

Broke the build (battle of the code expanders)

Fixed now (thanks to Grant)

9©MapR Technologies - Confidential

Pig Vector

10©MapR Technologies - Confidential

What is it?

Supports access to Mahout functionality from Pig

So far -- text vectorization

And classification

And model saving

11©MapR Technologies - Confidential

What is it?

Supports Pig access to Mahout functions

So far text vectorization

And classification

And model saving

Kind of works (see pigML from twitter for better function)

12©MapR Technologies - Confidential

Compile and Install

Start by compiling and installing mahout in your local repository:cd ~/Apache

git clone https://github.com/apache/mahout.git

cd mahout

mvn install -DskipTests

Then do the same with pig-vectorcd ~/Apache

git clone git@github.com:tdunning/pig-vector.git

cd pig-vector

mvn package

13©MapR Technologies - Confidential

Tokenize and Vectorize Text

Tokenized is done using a text encoder– the dimension of the resulting vectors (typically 100,000-1,000,000– a description of the variables to be included in the encoding– the schema of the tuples that pig will pass together with their data types

Example:define EncodeVector

org.apache.mahout.pig.encoders.EncodeVector

('10','x+y+1', 'x:numeric, y:word, z:text');

You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier

14©MapR Technologies - Confidential

The Formula

Not normal arithmetic

Describes which variables to use, whether offset is included

Also describes which interactions to use

15©MapR Technologies - Confidential

The Formula

Not normal arithmetic

Describes which variables to use, whether offset is included

Also describes which interactions to use– but that doesn’t do anything yet!

16©MapR Technologies - Confidential

Load and Encode Data

Load the dataa = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')

as (x1:int, x2:int, x3:int);

And encode itb = foreach a generate 1 as key, EncodeVector(*) as v;

Note that the true meaning of * is very subtle Now store it

store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage (

'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter

-t org.apache.mahout.math.VectorWritable’);

17©MapR Technologies - Confidential

Train a Model

Pass previously encoded data to a sequential model trainerdefine train org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');

Note that the argument is a string with its own syntax

18©MapR Technologies - Confidential

Reservations and Qualms

Pig-vector isn’t done

And it is ugly

And it doesn’t quite work

And it is hard to build

But there seems to be promise

19©MapR Technologies - Confidential

Potential

Add Naïve Bayes Model?

Somehow simplify the syntax?

Try a recent version of elephant-bird?

Switch to pigML?

20©MapR Technologies - Confidential

Large-scale k-Means Clustering

21©MapR Technologies - Confidential

Goals

Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality– low average distance to nearest centroid on held-out data

Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes

22©MapR Technologies - Confidential

Non-goals

Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2

23©MapR Technologies - Confidential

Anti-goals

Multiple passes over original data Scale as O(k n)

24©MapR Technologies - Confidential

Why?

25©MapR Technologies - Confidential

K-nearest Neighbor withSuper Fast k-means

26©MapR Technologies - Confidential

What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you have few knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results, not just single nearest

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

27©MapR Technologies - Confidential

Modeling with k-nearest Neighbors

a

b c

28©MapR Technologies - Confidential

Subject to Some Limits

29©MapR Technologies - Confidential

Log Transform Improves Things

30©MapR Technologies - Confidential

Neighbors Depend on Good Presentation

31©MapR Technologies - Confidential

How We Did It

2 week hackathon with 6 developers from MapR customer Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup

32©MapR Technologies - Confidential

How We Did It

2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup– well, really only 100-1000x after basic hygiene

33©MapR Technologies - Confidential

What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Shared memory matrix– FileBasedMatrix uses mmap to share very large dense matrices

Searcher interface– Brute, ProjectionSearch, KmeansSearch, LshSearch

Super-fast clustering– Kmeans, StreamingKmeans

34©MapR Technologies - Confidential

Projection Search

java.lang.TreeSet!

35©MapR Technologies - Confidential

Projection Search

Projection onto a line provides a total order on data Nearby points stay nearby Some other points also wind up close

Search points just before or just after the query point

36©MapR Technologies - Confidential

How Many Projections?

37©MapR Technologies - Confidential

K-means Search

Simple Idea– pre-cluster the data– to find the nearest points, search the nearest clusters

Recursive application– to search a cluster, use a Searcher!

38©MapR Technologies - Confidential

39©MapR Technologies - Confidential

x

40©MapR Technologies - Confidential

41©MapR Technologies - Confidential

42©MapR Technologies - Confidential

x

43©MapR Technologies - Confidential

But This Requires k-means!

Need a new k-means algorithm to get speed– Hadoop is very slow at iterative map-reduce– Maybe Pregel clones like Giraph would be better– Or maybe not

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads on one node)– Very parallelizable

44©MapR Technologies - Confidential

Basic Method

Use a single pass of k-means with very many clusters– output is a bad-ish clustering but a good surrogate

Use weighted centroids from step 1 to do in-memory clustering– output is a good clustering with fewer clusters

45©MapR Technologies - Confidential

Algorithmic Details

Foreach data point xn

compute distance to nearest centroid, ∂sample u, if u > ∂/ß add to nearest centroidelse create new centroid

if number of centroids > k log nrecursively cluster centroidsset ß = 1.5 ß if number of centroids did not decrease

46©MapR Technologies - Confidential

How It Works

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

47©MapR Technologies - Confidential

Parallel Speedup?

48©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

49©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

50©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

Empirically, projection search beats 64 bit LSH by a bit– More optimization may change this story

51©MapR Technologies - Confidential

Moving to Ultra Mega Super Scale

Map-reduce implementation nearly trivial

Map: rough-cluster input data, output ß, weighted centroids

Reduce: – single reducer gets all centroids– if too many centroids, merge using recursive clustering– optionally do final clustering in-memory

Combiner possible, but not important

52©MapR Technologies - Confidential

Contact:– tdunning@maprtech.com– @ted_dunning

Slides and such:– http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr