Post on 10-May-2015
description
transcript
Practical Machine Learning
with Mahout
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
• Contact me attdunning@maprtech.comtdunning@apache.comted.dunning@gmail.com@ted_dunning
Agenda
• What works at scale• Recommendation• Unsupervised - Clustering
What Works at Scale
• Logging• Counting• Session grouping
What Works at Scale
• Logging• Counting• Session grouping
• Really. Don’t bet on anything much more complex than these
What Works at Scale
• Logging• Counting• Session grouping
• Really. Don’t bet on anything much more complex than these
• These are harder than they look
Recommendations
Recommendations
• Special case of reflected intelligence• Traditionally “people who bought x also
bought y”
• But soooo much more is possible
Examples
• Customers buying books (Linden et al)• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)• Internet radio listeners not skipping songs
(Musicmatch)• Internet video watchers watching >30 s
Dyadic Structure
• Functional– Interaction: actor -> item*
• Relational– Interaction Actors x Items⊆
• Matrix– Rows indexed by actor, columns by item– Value is count of interactions
• Predict missing observations
Recommendations Analysis
• R(x,y) = # people who bought x also bought y
select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id
) group by x, y
Recommendations Analysis
Fundamental Algorithmic Structure
• Cooccurrence
• Matrix approximation by factoring
• LLR
But Wait!
• Cooccurrence
• Cross occurrence
For example
• Users enter queries (A)– (actor = user, item=query)
• Users view videos (B)– (actor = user, item=video)
• A’A gives query recommendation– “did you mean to ask for”
• B’B gives video recommendation– “you might like these videos”
The punch-line
• B’A recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)
Real-life example
• Query: “Paco de Lucia”• Conventional meta-data search results:– “hombres del paco” times 400– not much else
• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff
Real-life example
Hypothetical Example
• Want a navigational ontology?• Just put labels on a web page with traffic– This gives A = users x label clicks
• Remember viewing history– This gives B = users x items
• Cross recommend– B’A = label to item mapping
• After several users click, results are whatever users think they should be
Super-fast k-means Clustering
RATIONALE
What is Quality?
• Robust clustering not a goal– we don’t care if the same clustering is replicated
• Generalization is critical• Agreement to “gold standard” is a non-issue
An Example
An Example
Diagonalized Cluster Proximity
Clusters as Distribution Surrogate
Clusters as Distribution Surrogate
THEORY
For Example
Grouping these two clusters
seriously hurts squared distance
ALGORITHMS
Typical k-means Failure
Selecting two seeds here cannot be
fixed with Lloyds
Result is that these two clusters get glued
together
Ball k-means
• Provably better for highly clusterable data• Tries to find initial centroids in each “core” of each real
clusters• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:
for each data point:assign point to nearest cluster
recompute centroids using only points much closer than closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops
exponentially with k• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops
exponentially with k• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time
• But for big data, k gets large
Surrogate Method
• Start with sloppy clustering into lots of clustersκ = k log n clusters
• Use this sketch as a weighted surrogate for the data
• Results are provably good for highly clusterable data
Algorithm Costs
• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point– fast, in-memory, high-quality clustering of κ weighted
centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids• Even the sloppy surrogate may suffice
Algorithm Costs
• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point– fast, in-memory, high-quality clustering of κ weighted
centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice
Algorithm Costs
• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal
Algorithm Costs
• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal
How It Works
• For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid
• If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold
IMPLEMENTATION
But Wait, …
• Finding nearest centroid is inner loop
• This could take O( d κ ) per point and κ can be big
• Happily, approximate nearest centroid works fine
Projection Search
total ordering!
LSH Bit-match Versus Cosine
RESULTS
Parallel Speedup?
✓
Quality
• Ball k-means implementation appears significantly better than simple k-means
• Streaming k-means + ball k-means appears to be about as good as ball k-means alone
• All evaluations on 20 newsgroups with held-out data
• Figure of merit is mean and median squared distance to nearest cluster
Contact Me!• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Get the code as part of Mahout trunk (or 0.8 very soon)
• Contact me at tdunning@maprtech.com or @ted_dunning
• Share news with @apachemahout