Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | dipu1025 |
View: | 111 times |
Download: | 4 times |
Sta$s$cal Compu$ng For Big Data Deepak Agarwal
LinkedIn Applied Relevance Science [email protected]
ENAR 2014, Bal$more, USA
Main Collaborators: several others at both Y! and LinkedIn
• I won’t be here without them, extremely lucky to work with such talented individuals
Bee-Chung Chen Liang Zhang Bo Long
Jonathan Traupman Paul Ogilvie
Structure of This Tutorial • Part I: Introduc$on to Map-‐Reduce and the Hadoop System – Overview of Distributed Compu$ng – Introduc$on to Map-‐Reduce – Some sta$s$cal computa$ons using Map-‐Reduce
• Bootstrap, Logis$c Regression • Part II: Recommender Systems for Web Applica$ons – Introduc$on – Content Recommenda$on – Online Adver$sing
Big Data becoming Ubiquitous
• Bioinforma$cs • Astronomy • Internet • Telecommunica$ons • Climatology • …
Big Data: Some size es$mates
• 1000 human genomes: > 100TB of data (1000 genomes project)
• Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated)
• Facebook: A billion monthly ac$ve users • LinkedIn: roughly > 280M members worldwide • Twiaer: > 500 million tweets a day • Over 6 billion mobile phones in the world genera$ng data everyday
Big Data: Paradigm shid • Classical Sta$s$cs
– Generalize using small data
• Paradigm Shid with Big Data – We now have an almost infinite supply of data – Easy Sta$s$cs ? Just appeal to asympto$c theory?
• So the issue is mostly computa$onal? – Not quite
• More data comes with more heterogeneity • Need to change our sta$s$cal thinking to adapt
– Classical sta$s$cs s$ll invaluable to think about big data analy$cs
Some Sta$s$cal Challenges
• Exploratory Analysis (EDA), Visualiza$on – Retrospec$ve (on Terabytes) – More Real Time (streaming computa$ons every few minutes/hours)
• Sta$s$cal Modeling – Scale (computa$onal challenge) – Curse of dimensionality
• Millions of predictors, heterogeneity – Temporal and Spa$al correla$ons
Sta$s$cal Challenges con$nued
• Experiments – To test new methods, test hypothesis from randomized experiments
– Adap$ve experiments
• Forecas$ng – Planning, adver$sing
• Many more I are not fully well versed in
Defining Big Data
• How to know you have the big data problem? – Is it only the number of terabytes ? – What about dimensionality, structured/unstructured, computa$ons required,…
• No clear defini$on, different point of views – When desired computa$on cannot be completed in the s$pulated $me with current best algorithm using cores available on a commodity PC
Distributed Compu$ng for Big Data
• Distributed compu$ng invaluable tool to scale computa$ons for big data
• Some distributed compu$ng models – Mul$-‐threading – Graphics Processing Units (GPU) – Message Passing Interface (MPI) – Map-‐Reduce
Evalua$ng a method for a problem • Scalability
– Process X GB in Y hours • Ease of use for a sta$s$cian • Reliability (fault tolerance)
– Especially in an industrial environment • Cost
– Hardware and cost of maintaining • Good for the computa$ons required?
– E.g., Itera$ve versus one pass • Resource sharing
Mul$threading
• Mul$ple threads take advantage of mul$ple CPUs
• Shared memory • Threads can execute independently and concurrently
• Can only handle Gigabytes of data • Reliable
Graphics Processing Units (GPU) • Number of cores:
– CPU: Order of 10 – GPU: smaller cores
• Order of 1000
• Can be >100x faster than CPU – Parallel computa$onally intensive tasks off-‐loaded to GPU
• Good for certain computa$onally-‐intensive tasks
• Can only handle Gigabytes of data
• Not trivial to use, requires good understanding of low-‐level architecture for efficient use – But things changing, it is gemng more user friendly
Message Passing Interface (MPI)
• Language independent communica$on protocol among processes (e.g. computers)
• Most suitable for master/slave model • Can handle Terabytes of data • Good for itera$ve processing • Fault tolerance is low
Map-‐Reduce (Dean & Ghemawat, 2004)
Mappers
Reducers
Data
Output
• Computa$on split to Map (scaaer) and Reduce (gather) stages
• Easy to Use: – User needs to implement two func$ons: Mapper and Reducer
• Easily handles Terabytes of data
• Very good fault tolerance (failed tasks automa$cally get restarted)
Comparison of Distributed Compu$ng Methods
Mul$threading GPU MPI Map-‐Reduce
Scalability (data size)
Gigabytes Gigabytes Terabytes Terabytes
Fault Tolerance High High Low High
Maintenance Cost Low Medium Medium Medium-‐High
Itera$ve Process Complexity
Cheap Cheap Cheap Usually expensive
Resource Sharing Hard Hard Easy Easy
Easy to Implement? Easy Needs understanding of low-‐level GPU architecture
Easy Easy
Example Problem
• Tabula$ng word counts in corpus of documents
• Similar to table func$on in R
Word Count Through Map-‐Reduce Hello World Bye World
Hello Hadoop Goodbye Hadoop
Mapper 1
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop,1>
<Hello, 1> <World, 1> <Bye, 1> <World,1>
Mapper 2
Reducer 1 Words from A-‐G
Reducer 2 Words from H-‐Z
<Bye, 1> <Goodbye, 1>
<Hello, 2> <World, 2> <Hadoop, 2>
Key Ideas about Map-‐Reduce Big Data
Par$$on 1 Par$$on 2 … Par$$on N
Mapper 1 Mapper 2 … Mapper N
<Key, Value> <Key, Value> <Key, Value> <Key, Value>
Reducer 1 Reducer 2 Reducer M …
Output 1 Output 1 Output 1 Output 1
Key Ideas about Map-‐Reduce • Data are split into par$$ons and stored in many different machines on disk (distributed storage)
• Mappers process data chunks independently and emit <Key, Value> pairs
• Data with the same key are sent to the same reducer. One reducer can receive mul$ple keys
• Every reducer sorts its data by key • For each key, the reducer processes the values corresponding to the key according to the customized reducer func$on and output
Compute Mean for Each Group ID Group No. Score
1 1 0.5
2 3 1.0
3 1 0.8
4 2 0.7
5 2 1.5
6 3 1.2
7 1 0.8
8 2 0.9
9 4 1.3
… … …
Key Ideas about Map-‐Reduce • Data are split into par$$ons and stored in many different machines on
disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs
– For each row: • Key = Group No. • Value = Score
• Data with the same key are sent to the same reducer. One reducer can receive mul$ple keys – E.g. 2 reducers – Reducer 1 receives data with key = 1, 2 – Reducer 2 receives data with key = 3, 4
• Every reducer sorts its data by key – E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>
• For each key, the reducer processes the values corresponding to the key according to the customized reducer func$on and output – E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
Key Ideas about Map-‐Reduce • Data are split into par$$ons and stored in many different machines on
disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs
– For each row: • Key = Group No. • Value = Score
• Data with the same key are sent to the same reducer. One reducer can receive mul$ple keys – E.g. 2 reducers – Reducer 1 receives data with key = 1, 2 – Reducer 2 receives data with key = 3, 4
• Every reducer sorts its data by key – E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>
• For each key, the reducer processes the values corresponding to the key according to the customized reducer func$on and output – E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
What you need to implement
Mapper: Input: Data for (row in Data) {
groupNo = row$groupNo; score = row$score; Output(c(groupNo, score));
}
Reducer: Input: Key (groupNo), List Value (a list of scores that belong to the Key) count = 0; sum = 0; for (v in Value) {
sum += v; count++;
} Output(c(Key, sum/count));
Pseudo Code (in R)
Exercise 1
• Problem: Average height per {Grade, Gender}? • What should be the mapper output key? • What should be the mapper output value? • What are the reducer input? • What are the reducer output? • Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
• Problem: Average height per Grade and Gender? • What should be the mapper output key?
– {Grade, Gender} • What should be the mapper output value?
– Height • What are the reducer input?
– Key: {Grade, Gender}, Value: List of Heights • What are the reducer output?
– {Grade, Gender, mean(Heights)}
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
Exercise 2
• Problem: Number of students per {Grade, Gender}? • What should be the mapper output key? • What should be the mapper output value? • What are the reducer input? • What are the reducer output? • Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
• Problem: Number of students per {Grade, Gender}? • What should be the mapper output key?
– {Grade, Gender} • What should be the mapper output value?
– 1 • What are the reducer input?
– Key: {Grade, Gender}, Value: List of 1’s • What are the reducer output?
– {Grade, Gender, sum(value list)} – OR: {Grade, Gender, length(value list)}
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
More on Map-‐Reduce • Depends on distributed file systems • Typically mappers are the data storage nodes • Map/Reduce tasks automa$cally get restarted when they fail (good fault tolerance)
• Map and Reduce I/O are all on disk – Data transmission from mappers to reducers are through disk copy
• Itera$ve process through Map-‐Reduce – Each itera$on becomes a map-‐reduce job – Can be expensive since map-‐reduce overhead is high
The Apache Hadoop System
• An open-‐source sodware for reliable, scalable, distributed compu$ng
• The most popular distributed compu$ng system in the world
• Key modules: – Hadoop Distributed File System (HDFS) – Hadoop YARN (job scheduling and cluster resource management)
– Hadoop MapReduce
Major Tools on Hadoop • Pig
– A high-‐level language for Map-‐Reduce computa$on • Hive
– A SQL-‐like query language for data querying via Map-‐Reduce • Hbase
– A distributed & scalable database on Hadoop – Allows random, real $me read/write access to big data – Voldemort is similar to Hbase
• Mahout – A scalable machine learning library
• …
Hadoop Installa$on
• Semng up Hadoop on your desktop/laptop: – hap://hadoop.apache.org/docs/stable/single_node_setup.html
• Semng up Hadoop on a cluster of machines – hap://hadoop.apache.org/docs/stable/cluster_setup.html
Hadoop Distributed File System (HDFS)
• Master/Slave architecture • NameNode: a single master node that controls which data block is stored where.
• DataNodes: slave nodes that store data and do R/W opera$ons
• Clients (Gateway): Allow users to login and interact with HDFS and submit Map-‐Reduce jobs
• Big data is split to equal-‐sized blocks, each block can be stored in different DataNodes
• Disk failure tolerance: data is replicated mul$ple $mes
Load the Data into Pig • A = LOAD ‘Sample-‐1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float); – The path of the data on HDFS ader LOAD
• USING PigStorage() means delimit the data by tab (can be omiaed)
• If data are delimited by other characters, e.g. space, use USING PigStorage(‘ ‘)
• Data schema defined ader AS • Variable types: int, long, float, double, chararray, …
Structure of This Tutorial
• Part I: Introduc$on to Map-‐Reduce and the Hadoop System – Overview of Distributed Compu$ng – Introduc$on to Map-‐Reduce – Introduc$on to the Hadoop System – Examples of Sta$s$cal Compu$ng for Big Data
• Bag of Liale Bootstraps • Large Scale Logis$c Regression
Bag of Liale Bootstraps
Kleiner et al. 2012
Bootstrap (Efron, 1979) • A re-‐sampling based method to obtain sta$s$cal distribu$on of sample es$mators
• Why are we interested ? – Re-‐sampling is embarrassingly parallelizable
• For example: – Standard devia$on of the mean of N samples (μ) – For i = 1 to r do
• Randomly sample with replacement N $mes from the original sample -‐> bootstrap data i
• Compute mean of the i-‐th bootstrap data -‐> μi
– Es$mate of Sd(μ) = Sd([μ1,…μr]) – r is usually a large number, e.g. 200
Bootstrap for Big Data
• Can have r nodes running in parallel, each sampling one bootstrap data
• However… – N can be very large – Data may not fit into memory – Collec$ng N samples with replacement on each node can be computa$onally expensive
M out of N Bootstrap (Bikel et al. 1997)
• Obtain SdM(μ) by sampling M samples with replacement for each bootstrap, where M<N
• Apply analy$cal correc$on to SdM(μ) to obtain Sd(μ) using prior knowledge of convergence rate of sample es$mates
• However… – Prior knowledge is required – Choice of M is cri$cal to performance – Finding op$mal value of M needs more computa$on
Bag of Liale Bootstraps (BLB) • Example: Standard devia$on of the mean • Generate S sampled data sets, each obtained by random
sampling without replacement a subset of size b (or par$$on the original data into S par$$ons, each with size b)
• For each data p = 1 to S do – For i = 1 to r do
• N samples with replacement on data of size b • Compute mean of the resampled data μpi
– Compute Sdp(μ) = Sd([μp1,…μpr]) • Es$mate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])
Bag of Liale Bootstraps (BLB) • Interest: ξ(θ), where θ is an es$mate obtained from size N data – ξ is some func$on of θ, such as standard devia$on, …
• Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or par$$on the original data into S par$$ons, each with size b)
• For each data p = 1 to S do – For i = 1 to r do
• Sample N samples with replacement on data of size b • Compute mean of the resampled data θpi
– Compute ξp(θ) = ξ([θp1,…θpr]) • Es$mate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
Bag of Liale Bootstraps (BLB) • Interest: ξ(θ), where θ is an es$mate obtained from size N data – ξ is some func$on of θ, such as standard devia$on, …
• Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or par$$on the original data into S par$$ons, each with size b)
• For each data p = 1 to S do – For i = 1 to r do
• Sample N samples with replacement on the data of size b • Compute mean of the resampled data θpi
– Compute ξp(θ) = ξ([θp1,…θpr]) • Es$mate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
Mapper Reducer
Gateway
Why is BLB Efficient
• Before: – N samples with replacement from size N data is expensive when N is large
• Now: – N samples with replacement from size b data – b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1))
– Equivalent to: A mul$nomial sampler with dim = b – Storage = O(b), Computa$onal complexity = O(b)
Simula$on Experiment
• 95% CI of Logis$c Regression Coefficients • N = 20000, 10 explanatory variables • Rela$ve Error = |Es$mated CI width – True CI width | / True CI width
• BLB-‐γ: BLB with b = Nγ • BOFN-‐γ: b out of N sampling with b = Nγ
• BOOT: Naïve bootstrap
Simula$on Experiment
Real Data
• 95% CI of Logis$c Regression Coefficients • N = 6M, 3000 explanatory variables • Data size = 150GB, r = 50, s = 5, γ = 0.7
Summary of BLB
• A new algorithm for bootstrapping on big data
• Advantages – Fast and efficient – Easy to parallelize – Easy to understand and implement – Friendly to Hadoop, makes it rou$ne to perform sta$s$cal calcula$ons on Big data
Large Scale Logis$c Regression
Logis$c Regression • Binary response: Y
• Covariates: X
• Yi ~ Bernoulli(pi)
• log(pi/(1-‐pi)) = XiTβ ; β ~ MVN(0 , 1/λ I )
• Widely used (research and applica$ons)
Large Scale Logis$c Regression • Binary response: Y
– E.g., Click / Non-‐click on an ad on a webpage • Covariates: X
– User covariates: • Age, gender, industry, educa$on, job, job $tle, …
– Item covariates: • Categories, keywords, topics, …
– Context covariates: • Time, page type, posi$on, …
– 2-‐way interac$on: • User covariates X item covariates • Context covariates X item covariates • …
Computa$onal Challenge
• Hundreds of millions/billions of observa$ons • Hundreds of thousands/millions of covariates • Fimng such a logis$c regression model on a single machine not feasible
• Model fimng itera$ve using methods like gradient descent, Newton’s method etc – Mul$ple passes over the data
Recap on Op$miza$on method
• Problem: Find x to min(F(x)) • Itera$on n: xn = xn-‐1 – bn-‐1 F’(xn-‐1) • bn-‐1 is the step size that can change every itera$on
• Iterate un$l convergence • Conjugate gradient, LBFGS, Newton trust region, … all of this kind
Itera$ve Process with Hadoop
Disk Mappers Disk Reducers
Disk Mappers Disk Reducers
Disk Mappers Disk Reducers
Limita$ons of Hadoop for fimng a big logis$c regression
• Itera$ve process is expensive and slow • Every itera$on = a Map-‐Reduce job • I/O of mapper and reducers are both through disk
• Plus: Wai$ng in queue $me • Q: Can we find a fimng method that scales with Hadoop ?
Large Scale Logis$c Regression • Naïve:
– Par$$on the data and run logis$c regression for each par$$on – Take the mean of the learned coefficients – Problem: Not guaranteed to converge to the model from single machine!
• Alterna$ng Direc$on Method of Mul$pliers (ADMM) – Boyd et al. 2011 – Set up constraints: each par$$on’s coefficient = global consensus
– Solve the op$miza$on problem using Lagrange Mul$pliers – Advantage: guaranteed to converge to a single machine logis$c regression on the en$re data with reasonable number of itera$ons
Large Scale Logis$c Regression via ADMM
BIG DATA
Par$$on 1 Par$$on 2 Par$$on 3 Par$$on K
Logis$c Regression
Logis$c Regression
Logis$c Regression
Logis$c Regression
Consensus Computa$on
Iteration 1
Large Scale Logis$c Regression via ADMM
BIG DATA
Par$$on 1 Par$$on 2 Par$$on 3 Par$$on K
Logis$c Regression
Consensus Computa$on
Logis$c Regression
Logis$c Regression
Logis$c Regression
Iteration 1
Large Scale Logis$c Regression via ADMM
BIG DATA
Par$$on 1 Par$$on 2 Par$$on 3 Par$$on K
Logis$c Regression
Logis$c Regression
Logis$c Regression
Logis$c Regression
Consensus Computa$on
Iteration 2
Details of ADMM
Dual Ascent Method
• Consider a convex op$miza$on problem
• Lagrangian for the problem: • Dual Ascent:
2Precursors
In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.
2.1 Dual Ascent
Consider the equality-constrained convex optimization problem
minimize f(x)subject to Ax = b,
(2.1)
with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT (Ax # b)
and the dual function is
g(y) = infx
L(x,y) = #f"(#AT y) # bT y,
where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual
7
2Precursors
In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.
2.1 Dual Ascent
Consider the equality-constrained convex optimization problem
minimize f(x)subject to Ax = b,
(2.1)
with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT (Ax # b)
and the dual function is
g(y) = infx
L(x,y) = #f"(#AT y) # bT y,
where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual
7
8 Precursors
problem is
maximize g(y),
with variable y ! Rm. Assuming that strong duality holds, the optimalvalues of the primal and dual problems are the same. We can recovera primal optimal point x! from a dual optimal point y! as
x! = argminx
L(x,y!),
provided there is only one minimizer of L(x,y!). (This is the caseif, e.g., f is strictly convex.) In the sequel, we will use the notationargminx F (x) to denote any minimizer of F , even when F does nothave a unique minimizer.
In the dual ascent method, we solve the dual problem using gradientascent. Assuming that g is di!erentiable, the gradient "g(y) can beevaluated as follows. We first find x+ = argminx L(x,y); then we have"g(y) = Ax+ # b, which is the residual for the equality constraint. Thedual ascent method consists of iterating the updates
xk+1 := argminx
L(x,yk) (2.2)
yk+1 := yk + !k(Axk+1 # b), (2.3)
where !k > 0 is a step size, and the superscript is the iteration counter.The first step (2.2) is an x-minimization step, and the second step (2.3)is a dual variable update. The dual variable y can be interpreted asa vector of prices, and the y-update is then called a price update orprice adjustment step. This algorithm is called dual ascent since, withappropriate choice of !k, the dual function increases in each step, i.e.,g(yk+1) > g(yk).
The dual ascent method can be used even in some cases when g isnot di!erentiable. In this case, the residual Axk+1 # b is not the gradi-ent of g, but the negative of a subgradient of #g. This case requires adi!erent choice of the !k than when g is di!erentiable, and convergenceis not monotone; it is often the case that g(yk+1) $> g(yk). In this case,the algorithm is usually called the dual subgradient method [152].
If !k is chosen appropriately and several other assumptions hold,then xk converges to an optimal point and yk converges to an optimal
Augmented Lagrangians • Bring robustness to the dual ascent method
• Yield convergence without assump$ons like strict convexity or finiteness of f
•
• The value of ρ influences the convergence rate
10 Precursors
collected (gathered) in order to compute the residual Axk+1 ! b. Oncethe (global) dual variable yk+1 is computed, it must be distributed(broadcast) to the processors that carry out the N individual xi mini-mization steps (2.4).
Dual decomposition is an old idea in optimization, and traces backat least to the early 1960s. Related ideas appear in well known workby Dantzig and Wolfe [44] and Benders [13] on large-scale linear pro-gramming, as well as in Dantzig’s seminal book [43]. The general ideaof dual decomposition appears to be originally due to Everett [69],and is explored in many early references [107, 84, 117, 14]. The useof nondi!erentiable optimization, such as the subgradient method, tosolve the dual problem is discussed by Shor [152]. Good references ondual methods and decomposition include the book by Bertsekas [16,chapter 6] and the survey by Nedic and Ozdaglar [131] on distributedoptimization, which discusses dual decomposition methods and con-sensus problems. A number of papers also discuss variants on standarddual decomposition, such as [129].
More generally, decentralized optimization has been an active topicof research since the 1980s. For instance, Tsitsiklis and his co-authorsworked on a number of decentralized detection and consensus problemsinvolving the minimization of a smooth function f known to multi-ple agents [160, 161, 17]. Some good reference books on parallel opti-mization include those by Bertsekas and Tsitsiklis [17] and Censor andZenios [31]. There has also been some recent work on problems whereeach agent has its own convex, potentially nondi!erentiable, objectivefunction [130]. See [54] for a recent discussion of distributed methodsfor graph-structured optimization problems.
2.3 Augmented Lagrangians and the Method of Multipliers
Augmented Lagrangian methods were developed in part to bringrobustness to the dual ascent method, and in particular, to yield con-vergence without assumptions like strict convexity or finiteness of f .The augmented Lagrangian for (2.1) is
L!(x,y) = f(x) + yT (Ax ! b) + (!/2)"Ax ! b"22, (2.6)2.3 Augmented Lagrangians and the Method of Multipliers 11
where ! > 0 is called the penalty parameter. (Note that L0 is thestandard Lagrangian for the problem.) The augmented Lagrangiancan be viewed as the (unaugmented) Lagrangian associated with theproblem
minimize f(x) + (!/2)!Ax " b!22
subject to Ax = b.
This problem is clearly equivalent to the original problem (2.1), sincefor any feasible x the term added to the objective is zero. The associateddual function is g!(y) = infx L!(x,y).
The benefit of including the penalty term is that g! can be shown tobe di!erentiable under rather mild conditions on the original problem.The gradient of the augmented dual function is found the same way aswith the ordinary Lagrangian, i.e., by minimizing over x, and then eval-uating the resulting equality constraint residual. Applying dual ascentto the modified problem yields the algorithm
xk+1 := argminx
L!(x,yk) (2.7)
yk+1 := yk + !(Axk+1 " b), (2.8)
which is known as the method of multipliers for solving (2.1). This isthe same as standard dual ascent, except that the x-minimization stepuses the augmented Lagrangian, and the penalty parameter ! is usedas the step size "k. The method of multipliers converges under far moregeneral conditions than dual ascent, including cases when f takes onthe value +# or is not strictly convex.
It is easy to motivate the choice of the particular step size ! inthe dual update (2.8). For simplicity, we assume here that f is di!er-entiable, though this is not required for the algorithm to work. Theoptimality conditions for (2.1) are primal and dual feasibility, i.e.,
Ax" " b = 0, $f(x") + AT y" = 0,
respectively. By definition, xk+1 minimizes L!(x,yk), so
0 = $xL!(xk+1,yk)
= $xf(xk+1) + AT!yk + !(Axk+1 " b)
"
= $xf(xk+1) + AT yk+1.
Alterna$ng Direc$on Method of Mul$pliers (ADMM)
• Problem: • Augmented Lagrangians • ADMM:
3Alternating Direction Method of Multipliers
3.1 Algorithm
ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form
minimize f(x) + g(z)subject to Ax + Bz = c
(3.1)
with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by
p! = inf{f(x) + g(z) | Ax + Bz = c}.
As in the method of multipliers, we form the augmented Lagrangian
L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.
13
3Alternating Direction Method of Multipliers
3.1 Algorithm
ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form
minimize f(x) + g(z)subject to Ax + Bz = c
(3.1)
with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by
p! = inf{f(x) + g(z) | Ax + Bz = c}.
As in the method of multipliers, we form the augmented Lagrangian
L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.
13
14 Alternating Direction Method of Multipliers
ADMM consists of the iterations
xk+1 := argminx
L!(x,zk,yk) (3.2)
zk+1 := argminz
L!(xk+1,z,yk) (3.3)
yk+1 := yk + !(Axk+1 + Bzk+1 ! c), (3.4)
where ! > 0. The algorithm is very similar to dual ascent and themethod of multipliers: it consists of an x-minimization step (3.2), az-minimization step (3.3), and a dual variable update (3.4). As in themethod of multipliers, the dual variable update uses a step size equalto the augmented Lagrangian parameter !.
The method of multipliers for (3.1) has the form
(xk+1,zk+1) := argminx,z
L!(x,z,yk)
yk+1 := yk + !(Axk+1 + Bzk+1 ! c).
Here the augmented Lagrangian is minimized jointly with respect tothe two primal variables. In ADMM, on the other hand, x and z areupdated in an alternating or sequential fashion, which accounts for theterm alternating direction. ADMM can be viewed as a version of themethod of multipliers where a single Gauss-Seidel pass [90, §10.1] overx and z is used instead of the usual joint minimization. Separating theminimization over x and z into two steps is precisely what allows fordecomposition when f or g are separable.
The algorithm state in ADMM consists of zk and yk. In other words,(zk+1,yk+1) is a function of (zk,yk). The variable xk is not part of thestate; it is an intermediate result computed from the previous state(zk!1,yk!1).
If we switch (re-label) x and z, f and g, and A and B in the prob-lem (3.1), we obtain a variation on ADMM with the order of the x-update step (3.2) and z-update step (3.3) reversed. The roles of x andz are almost symmetric, but not quite, since the dual update is doneafter the z-update but before the x-update.
Large Scale Logis$c Regression via ADMM
• Nota$on – (Xi , yi): data in the ith par$$on – βi: coefficient vector for par$$on i – β: Consensus coefficient vector – r(β): penalty component such as ||β||22
• Op$miza$on problem
Brief Article
The Author
July 7, 2013
min
NX
i=1
li(yi,XTi �i) + r(�)
subject to �i = �
1
ADMM updates
LOCAL REGRESSIONS Shrinkage towards current best global es$mate
UPDATED CONSENSUS
An example implementa$on
• ADMM for Logis$c regression model fimng with L2/L1 penalty
• Each itera$on of ADMM is a Map-‐Reduce job – Mapper: par$$on the data into K par$$ons – Reducer: For each par$$on, use liblinear/glmnet to fit a L1/L2 logis$c regression
– Gateway: consensus computa$on by results from all reducers, and sends back the consensus to each reducer node
KDD CUP 2010 Data
• Bridge to Algebra 2008-‐2009 data in haps://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp • Binary response, 20M covariates • Only keep covariates with >= 10 occurrences => 2.2M covariates
• Training data: 8,407,752 samples • Test data : 510,302 samples
Avg Training Loglikelihood vs Number of Itera$ons
Test AUC vs Number of Itera$ons
Beaer Convergence Can Be Achieved By
• Beaer Ini$aliza$on – Use results from Naïve method to ini$alize the parameters
• Adap$vely change step size (ρ) for each itera$on based on the convergence status of the consensus
Recommender Problems for Web Applications
Agenda • Topic of Interest
– Recommender problems for dynamic, time-sensitive applications
• Content Optimization, Online Advertising, Movie recommendation, shopping,…
• Introduction • Offline components
– Regression, Collaborative filtering (CF), … • Online components + initialization
– Time-series, online/incremental methods, explore/exploit (bandit)
• Evaluation methods + Multi-Objective • Challenges
Three components we will focus on • Defining the problem
– Formulate objectives whose optimization achieves some long-term goals for the recommender system
• E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ?
• Modeling (to estimate some critical inputs) – Predict rates of some positive user interaction(s) with items
based on data obtained from historical user-item interactions • E.g. Click rates, average time-spent on page, etc • Could be explicit feedback like ratings
• Experimentation – Create experiments to collect data proactively to improve
models, helps in converging to the best choice(s) cheaply and rapidly.
• Explore and Exploit (continuous experimentation) • DOE (testing hypotheses by avoiding bias inherent in data)
Modern Recommendation Systems
• Goal – Serve the right item to a user in a given context to
optimize long-term business objectives • A scientific discipline that involves
– Large scale Machine Learning & Statistics • Offline Models (capture global & stable characteristics) • Online Models (incorporates dynamic components) • Explore/Exploit (active and adaptive experimentation)
– Multi-Objective Optimization • Click-rates (CTR), Engagement, advertising revenue, diversity, etc
– Inferring user interest • Constructing User Profiles
– Natural Language Processing to understand content • Topics, “aboutness”, entities, follow-up of something, breaking news,…
Some examples from content optimization
• Simple version – I have a content module on my page, content inventory is
obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module
• More advanced – I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks?
• Highly advanced – There are multiple modules running on my webpage. How
do I perform a simultaneous optimization?
Recommend applications
Recommend search queries
Recommend news article
Recommend packages: Image Title, summary Links to other pages
Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages
Problems in this example • Optimize CTR on multiple modules
– Today Module, Trending Now, Personal Assistant, News
– Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations.
• For any single module – Optimize some combination of CTR, downstream
engagement, and perhaps advertising revenue.
Online Advertising
Adv
ertis
ers
Ad Network
Ads
Page
Recommend
Best ad(s)
User
Publisher
Response rates (click, conversion, ad-view) Bids
Auction
Click
conversion
Select argmax f(bid,response rates)
ML /Statistical model
Examples: Yahoo, Google, MSN, …
Ad exchanges (RightMedia, DoubleClick, …)
LinkedIn Today: Content Module
Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)
LinkedIn Ads: Match ads to users visi$ng LinkedIn
Right Media Ad Exchange: Unified Marketplace
Match ads to page views on publisher sites
Has ad impression to sell -‐-‐ AUCTIONS
Bids $0.50 Bids $0.75 via Network…
… which becomes $0.45 bid
Bids $0.65—WINS!
AdSense Ad.com
Bids $0.60
Recommender problems in general
USER
Item Inventory Ar$cles, web page,
ads, …
Use an automated algorithm to select item(s) to show
Get feedback (click, $me spent,..)
Refine the models
Repeat (large number of :mes) Op:mize metric(s) of interest (Total clicks, Total revenue,…)
Example applications Search: Web, Vertical Online Advertising Content …..
Context query, page, …
• Items: Articles, ads, modules, movies, users, updates, etc.
• Context: query keywords, pages, mobile, social media, etc.
• Metric to optimize (e.g., relevance score, CTR, revenue, engagement) – Currently, most applications are single-objective – Could be multi-objective optimization (maximize X subject to Y, Z,..)
• Properties of the item pool – Size (e.g., all web pages vs. 40 stories) – Quality of the pool (e.g., anything vs. editorially selected) – Lifetime (e.g., mostly old items vs. mostly new items)
Important Factors
Factors affecting Solution (continued)
• Properties of the context – Pull: Specified by explicit, user-driven query (e.g., keywords, a form) – Push: Specified by implicit context (e.g., a page, a user, a session)
• Most applications are somewhere on continuum of pull and push
• Properties of the feedback on the matches made
– Types and semantics of feedback (e.g., click, vote) – Latency (e.g., available in 5 minutes vs. 1 day) – Volume (e.g., 100K per day vs. 300M per day)
• Constraints specifying legitimate matches – e.g., business rules, diversity rules, editorial Voice – Multiple objectives
• Available Metadata (e.g., link graph, various user/item attributes)
Predicting User-Item Interactions (e.g. CTR)
• Myth: We have so much data on the web, if we can only process it the problem is solved – Number of things to learn increases with sample size
• Rate of increase is not slow – Dynamic nature of systems make things worse – We want to learn things quickly and react fast
• Data is sparse in web recommender problems – We lack enough data to learn all we want to learn and
as quickly as we would like to learn – Several Power laws interacting with each other
• E.g. User visits power law, items served power law – Bivariate Zipf: Owen & Dyer, 2011
Can Machine Learning help? • Fortunately, there are group behaviors that generalize to
individuals & they are relatively stable – E.g. Users in San Francisco tend to read more baseball news
• Key issue: Estimating such groups – Coarse group : more stable but does not generalize that well. – Granular group: less stable with few individuals – Getting a good grouping structure is to hit the “sweet spot”
• Another big advantage on the web – Intervene and run small experiments on a small population to
collect data that helps rapid convergence to the best choices(s) • We don’t need to learn all user-item interactions, only those that are good.
Predicting user-item interaction rates
Offline ( Captures stable characteris$cs
at coarse resolu$ons) (Logis$c, Boos$ng,….)
Feature construc$on Content: IR, clustering, taxonomy, en$ty,..
User profiles: clicks, views, social, community,..
Near Online (Finer resolu$on Correc$ons)
(item, user level) (Quick updates)
Explore/Exploit (Adap$ve sampling)
(helps rapid convergence to best choices)
Initialize
Post-click: An example in Content Optimization
Recommender EDITORIAL
content Clicks on FP links influence downstream supply distribution
AD SERVER DISPLAY ADVERTISING Revenue
Downstream engagement (Time spent)
Serving Content on Front Page: Click Shaping
• What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following
– Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)
High level picture
http request
Statistical Models updated in Batch mode: e.g. once every 30 mins
Server
Item Recommenda$on system: thousands of computa$ons in
sub-‐seconds
User Interacts e.g. click, does nothing
High level overview: Item Recommenda$on System
User Info
Item Index Id, meta-data
ML/ Statistical Models
Score Items P(Click), P(share), Semantic-relevance score,….
Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,….
User-item interaction Data: batch process
Updated in batch: Activity, profile
Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..
ML/Sta$s$cal models for scoring
Number of items Scored by ML
Traffic volume
1000 100 100k 1M 100M
Few hours
Few days
Several days
LinkedIn Today Yahoo! Front Page
Right Media Ad exchange LinkedIn Ads
Summary of deployments • Yahoo! Front page Today Module (2008-‐2011): 300% improvement
in click-‐through rates – Similar algorithms delivered via a self-‐serve pla�orm, adopted by
several Yahoo! Proper$es (2011): Significant improvement in engagement across Yahoo! Network
• Fully deployed on LinkedIn Today Module (2012): Significant improvement in click-‐through rates (numbers not revealed due to reasons of confiden$ality)
• Yahoo! RightMedia exchange (2012): Fully deployed algorithms to es$mate response rates (CTR, conversion rates). Significant improvement in revenue (numbers not revealed due to reasons of confiden$ality)
• LinkedIn self-‐serve ads (2012-‐2013):Fully deployed • LinkedIn News Feed (2013-‐2014): Fully deployed • Several others in progress….
Broad Themes • Curse of dimensionality
– Large number of observa$ons (rows), large number of poten$al features (columns)
– Use domain knowledge and machine learning to reduce the “effec$ve” dimension (constraints on parameters reduce degrees of freedom)
• I will give examples as we move along
• We oden assume our job is to analyze “Big Data” but we oden have control on what data to collect through clever experimenta$on – This can fundamentally change solu$ons
• Think of computa$on and models together for Big data • Op$miza$on: What we are trying to op$mize is oden complex,models to
work in harmony with op$miza$on – Pareto op$mality with compe$ng objec$ves
Sta$s$cal Problem • Rank items (from an admissible pool) for user visits in some
context to maximize a u$lity of interest • Examples of u$lity func$ons
– Click-‐rates (CTR) – Share-‐rates (CTR* [Share|Click] ) – Revenue per page-‐view = CTR*bid (more complex due to second price
auc$on)
• CTR is a fundamental measure that opens the door to a more principled approach to rank items
• Converge rapidly to maximum u$lity items – Sequen$al decision making process (explore/exploit)
item j from a set of candidates
User i with user features (e.g., industry, behavioral features, Demographic features,……)
(i, j) : response yij visits
Algorithm selects
(click or not)
Which item should we select? � The item with highest predicted CTR � An item for which we need data to predict its CTR
Exploit Explore
LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem
The Explore/Exploit Problem (to maximize CTR)
• Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items
• Easy!? Pick the items having the highest click-through rates (CTRs)
• But … – The system is highly dynamic:
• Items come and go with short lifetimes • CTR of each item may change over time
– How much traffic should be allocated to explore new items to achieve optimal performance ?
• Too little → Unreliable CTR estimates due to “starvation” • Too much → Little traffic to exploit the high CTR items
Y! front Page Applica$on
• Simplify: Maximize CTR on first slot (F1)
• Item Pool – Editorially selected for high quality and brand image – Few ar$cles in the pool but item pool dynamic
CTR Curves of Items on LinkedIn Today
CTR
Impact of repeat item views on a given user
• Same user is shown an item mul$ple $mes (despite not clicking)
Simple algorithm to es$mate most popular item with small but dynamic item pool
• Simple Explore/Exploit scheme – ε% explore: with a small probability (e.g. 5%), choose an item at random from the pool
– (100−ε)% exploit: with large probability (e.g. 95%), choose highest scoring CTR item
• Temporal Smoothing – Item CTRs change over $me, provide more weight to recent data in
es$ma$ng item CTRs • Kalman filter, moving average
• Discount item score with repeat views – CTR(item) for a given user drops with repeat views by some “discount”
factor (es$mated from data) • Segmented most popular
– Perform separate most-‐popular for each user segment
Time series Model: Kalman filter • Dynamic Gamma-‐Poisson: click-‐rate evolves over $me in a mul$plica$ve fashion
• Es$mated Click-‐rate distribu$on at $me t+1 – Prior mean:
– Prior variance:
High CTR items more adap$ve
More economical explora$on? Beaer bandit solu$ons
• Consider two armed problem
p2 (unknown payoff
probabilities)
The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the “mul$-‐armed bandit” problem, have been studied for a long $me.
Op$mal solu$on: Play the arm that has maximum poten:al of being good Op:mism in the face of uncertainty
p1 >
Item Recommenda$on: Bandits? • Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000
– Greedy: Show Item 2 to all; not a good idea – Item 1 CTR es$mate noisy; item could be poten$ally beaer
• Invest in Item 1 for beaer overall performance on average
– Exploit what is known to be good, explore what is poten$ally good CTR
Prob
abili
ty d
ensit
y Item 2
Item 1
Next few hours
Most Popular Recommendation
Personalized Recommendation
Offline Models
Collaborative filtering (cold-start problem)
Online Models
Time-series models Incremental CF, online regression
Intelligent Initialization
Prior estimation Prior estimation, dimension reduction
Explore/Exploit
Multi-armed bandits Bandits with covariates
Offline Components: Collaborative Filtering in Cold-start
Situations
Problem
Item j with
User i with user features xi (demographics, browse history, search history, …)
item features xj (keywords, content categories, ...)
(i, j) : response yij visits
Algorithm selects
(explicit rating, implicit click/no-click)
Predict the unobserved entries based on features and the observed entries
Model Choices • Feature-based (or content-based) approach
– Use features to predict response • (regression, Bayes Net, mixture models, …)
– Limitation: need predictive features • Bias often high, does not capture signals at granular levels
• Collaborative filtering (CF aka Memory based) – Make recommendation based on past user-item interaction
• User-user, item-item, matrix factorization, … • See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.
– Better performance for old users and old items – Does not naturally handle new users and new items (cold-
start)
Collaborative Filtering (Memory based methods)
User-User Similarity
Item-Item similarities, incorporating both
Estimating Similarities Pearson’s correlation Optimization based (Koren et al)
How to Deal with the Cold-Start Problem
• Heuris$c-‐based approaches – Linear combina$on of regression and CF models – Filterbot
• Add user features as psuedo users and do collabora$ve filtering -‐ Hybrid approaches
-‐ Use content based to fill up entries, then use CF
• Matrix Factoriza$on – Good performance on Ne�lix (Koren, 2009)
• Model-‐based approaches – Bilinear random-‐effects model (probabilis$c matrix factoriza$on)
• Good on Ne�lix data [Ruslan et al ICML, 2009] – Add feature-‐based regression to matrix factoriza$on
• (Agarwal and Chen, 2009) – Add topic discovery (from textual items) to matrix factoriza$on
• (Agarwal and Chen, 2009; Chun and Blei, 2011)
Per-item regression models • When tracking users by cookies, distribution of
visit patters could get extremely skewed – Majority of cookies have 1-2 visits
• Per item models (regression) based on user covariates attractive in such cases
Several per-item regressions: Multi-task learning
Low dimension (5-10),
B estimated retrospective data
• Agarwal,Chen and Elango, KDD, 2010
Affinity to old items
Per-user, per-item models via bilinear random-effects
model
Motivation • Data measuring k-way interactions pervasive
– Consider k = 2 for all our discussions • E.g. User-Movie, User-content, User-Publisher-Ads,….
– Power law on both user and item degrees
• Classical Techniques – Approximate matrix through a singular value
decomposition (SVD) • After adjusting for marginal effects (user pop, movie pop,..)
– Does not work • Matrix highly incomplete, severe over-fitting
– Key issue • Regularization of eigenvectors (factors) to avoid overfitting
Early work on complete matrices
• Tukey’s 1-df model (1956)
– Rank 1 approximation of small nearly complete matrix
• Criss-cross regression (Gabriel, 1978) • Incomplete matrices: Psychometrics (1-factor
model only; small data sets; 1960s) • Modern day recommender problems
– Highly incomplete, large, noisy.
Latent Factor Models
“newsy”
“sporty”
“newsy”
s
item
v
z
Affinity = u’v
Affinity = s’z
u sporty
Factorization – Brief Overview • Latent user factors:
(αi , ui=(ui1,…,uin))
• (Nn + Mm) parameters
• Key technical issue:
• Latent movie factors: (βj , vj=(v j1,….,v jn))
will overfit for moderate values of n,m
Regularization
Interaction
jijiij BvuyE ʹ′+++= βαµ)(
Latent Factor Models: Different Aspects
• Matrix Factorization – Factors in Euclidean space – Factors on the simplex
• Incorporating features and ratings simultaneously
• Online updates
Maximum Margin Matrix Factorization (MMMF)
• Complete matrix by minimizing loss (hinge,squared-error) on observed entries subject to constraints on trace norm – Srebro, Rennie, Jakkola (NIPS 2004)
– Convex, Semi-definite programming (expensive, not scalable)
• Fast MMMF (Rennie & Srebro, ICML, 2005) – Constrain the Frobenious norm of left and right
eigenvector matrices, not convex but becomes scalable.
• Other variation: Ensemble MMMF (DeCoste, ICML2005) – Ensembles of partially trained MMMF (some
improvements)
Matrix Factorization for Netflix prize data
• Minimize the objective function
• Simon Funk: Stochastic Gradient Descent
• Koren et al (KDD 2007): Alternate Least Squares – They move to SGD later in the competition
∑ ∑∑∈
++−obsij j
ji
ijTiij vuvur )()(
222 λ
ui vj
rij
au av
2σ
Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated-whom” Combining with Boltzmann Machines also improved performance
),(~),(~),(~ 2
IaMVNIaMVN
Nr
vj
ui
jTiij
0v0uvu σ
Probabilis$c Matrix Factoriza$on (Ruslan & Minh, 2008, NIPS)
Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008)
• Fully Bayesian treatment using an MCMC approach – Significant improvement
• Interpretation as a fully Bayesian hierarchical model shows why that is the case – Failing to incorporate uncertainty leads to bias in
estimates – Multi-modal posterior, MCMC helps in converging to a better one
r Var-comp: au
MCEM also more resistant to over-fitting
Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010)
• Specify rank probabilistically (automatic rank selection)
)/)1(,/(~)(~
),(~1
2
rrbraBetaBerz
vuzNy
k
kk
r
kjkikkij
−
∑=
π
π
σ
))1(/(Factors)#()))1(/(,1(~−+=
−+
rbaraErbaaBerzk
How to incorporate features: Deal with both warm start and cold-start
• Models to predict ratings for new pairs – Warm-start: (user, movie) present in the training data with large
sample size – Cold-start: At least one of (user, movie) new or has small sample
size • Rough definition, warm-start/cold-start is a continuum.
• Challenges – Highly incomplete (user, movie) matrix – Heavy tailed degree distributions for users/movies
• Large fraction of ratings from small fraction of users/movies
– Handling both warm-start and cold-start effectively in the presence of predictive features
Possible approaches • Large scale regression based on covariates
– Does not provide good estimates for heavy users/movies – Large number of predictors to estimate interactions
• Collaborative filtering – Neighborhood based – Factorization
• Good for warm-start; cold-start dealt with separately • Single model that handles cold-start and warm-start
– Heavy users/movies → User/movie specific model – Light users/movies → fallback on regression model – Smooth fallback mechanism for good performance
Add Feature-based Regression into
Matrix Factorization RLFM: Regression-based Latent
Factor Model
Regression-based Factorization Model (RLFM)
• Main idea: Flexible prior, predict factors through regressions
• Seamlessly handles cold-start and warm-start
• Modified state equation to incorporate covariates
RLFM: Model Rating: ),(~ 2σµijij Ny
)(~ ijij Bernoulliy µ)(~ ijijij NPoissony µ
Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts)
jtiji
tijij vubxt +++= βαµ )(
user i gives item j
Bias of user i: ),0(~ , 20 α
αα σεεα Nxg iiit
i +=
Popularity of item j: ),0(~ , 20 β
ββ σεεβ Nxd jjjt
j +=
Factors of user i: ),0(~ , 2INGxu uui
uiii σεε+=
Factors of item j: ),0(~ , 2INDxv vvi
viji σεε+=
Could use other classes of regression models
Graphical representation of the model
Advantages of RLFM • Better regularization of factors
– Covariates “shrink” towards a better centroid
• Cold-start: Fallback regression model (FeatureOnly)
RLFM: Illustration of Shrinkage
Plot the first factor value for each user (fitted using Yahoo! FP data)
Model fitting: EM for our class of models
The parameters for RLFM
• Latent parameters
• Hyper-parameters
}){},{},{},({ jiji vuβα=Δ
)IaAI,aAD, G, ,( vvuu ===Θ b
Computing the mode
Minimized
The EM algorithm
Computing the E-step
• Often hard to compute in closed form • Stochastic EM (Markov Chain EM; MCEM)
– Compute expectation by drawing samples from
– Effective for multi-modal posteriors but more expensive
• Iterated Conditional Modes algorithm (ICM) – Faster but biased hyper-parameter estimates
Monte Carlo E-step • Through a vanilla Gibbs sampler (conditionals closed form)
• Other conditionals also Gaussian and closed form • Conditionals of users (movies) sampled simultaneously • Small number of samples in early iterations, large numbers in
later iterations
M-step (Why MCEM is better than ICM)
• Update G, optimize
• Update Au=au I
Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well
Experiment 1: Better regularization
• MovieLens-100K, avg RMSE using pre-specified splits • ZeroMean, RLFM and FeatureOnly (no cold-start
issues) • Covariates:
– Users : age, gender, zipcode (1st digit only) – Movies: genres
Experiment 2: Better handling of Cold-start
• MovieLens-1M; EachMovie • Training-test split based on timestamp • Same covariates as in Experiment 1.
Experiment 4: Predicting click-rate on articles
• Goal: Predict click-rate on articles for a user on F1 position
• Article lifetimes short, dynamic updates important
• User covariates: – Age, Gender, Geo, Browse behavior
• Article covariates – Content Category, keywords
• 2M ratings, 30K users, 4.5 K articles
Results on Y! FP data
Some other related approaches • Stern, Herbrich and Graepel, WWW, 2009
– Similar to RLFM, different parametrization and expectation propagation used to fit the models
• Porteus, Asuncion and Welling, AAAI, 2011 – Non-parametric approach using a Dirichlet process
• Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 – Regression + random effects per user regularized
through a Graphical Lasso
Add Topic Discovery into Matrix Factorization
fLDA: Matrix Factorization through Latent Dirichlet Allocation
fLDA: Introduction • Model the rating yij that user i gives to item j as the user’s
affinity to the topics that the item has
– Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data
– fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation
∑+=k jkikij zsy ...User i ’s affinity to topic k
Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j
Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items
Φ11, …, Φ1W … Φk1, …, ΦkW … ΦK1, …, ΦKW
Topic 1
Topic k
Topic K
LDA Topic Modeling (1) • LDA is effective for unsupervised topic discovery [Blei’03]
– It models the generating process of a corpus of items (articles) – For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η) – For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)
– For each word, say the nth word, in item j, • Draw a topic zjn for that word from θj = [θj1, …, θjK] • Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn
Item j Topic distribution: [θj1, …, θjK]
Words: wj1, …, wjn, …
Per-word topic: zj1, …, zjn, …
Assume zjn = topic k
Observed
LDA Topic Modeling (2) • Model training:
– Estimate the prior parameters and the posterior topic×word distribution Φ based on a training corpus of items
– EM + Gibbs sampling is a popular method • Inference for new items
– Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase
• Supervised LDA [Blei’07] – Predict a target value for each item based on supervised LDA topics
∑= k jkkj zsy
Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j
Regression weight for topic k
∑+=k jkikij zsy ...vs.
One regression per user
Same set of topics across different regressions
fLDA: Model Rating: ),(~ 2σµijij Ny
)(~ ijij Bernoulliy µ)(~ ijijij NPoissony µ
Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts)
jkikkjitijij zsbxt ∑+++= βαµ )(
user i gives item j
Bias of user i: ),0(~ , 20 α
αα σεεα Nxg iiit
i +=
Popularity of item j: ),0(~ , 20 β
ββ σεεβ Nxd jjjt
j +=
Topic affinity of user i: ),0(~ , 2INHxs ssi
siii σεε+=
Pr(item j has topic k): ) itemin words#/()(1 jkzz jnnjk =∑=The LDA topic of the nth word in item j
Observed words: ),,(~ jnjn zLDAw ηλThe nth word in item j
Model Fitting • Given:
– Features X = {xi, xj, xij} – Observed ratings y = {yij} and words w = {wjn}
• Estimate: – Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]
• Regression weights and prior parameters – Latent factors: Δ = {αi, βj, si} and z = {zjn}
• User factors, item factors and per-word topic assignment
• Empirical Bayes approach: – Maximum likelihood estimate of the parameters:
– The posterior distribution of the factors:
∫ ΔΘΔ=Θ=ΘΘΘ
dzdzwywy ]|,,,Pr[maxarg]|,Pr[ maxargˆ
]ˆ,|,Pr[ ΘΔ yz
The EM Algorithm • Iterate through the E and M steps until convergence
– Let be the current estimate – E-step: Compute
• The expectation is not in closed form • We draw Gibbs samples and compute the Monte
Carlo mean
– M-step: Find
• It consists of solving a number of regression and optimization problems
)]|,,,Pr([log)( )ˆ,,|,( ΘΔ=ΘΘΔ
zwyEf nwyz
)(maxargˆ )1( Θ=ΘΘ
+ fn
)(ˆ nΘ
Supervised Topic Assignment
( ) ∏ =⋅++
+∝
=
¬¬
¬
ji jnijjnjkjn
k
jnkl
jn
kzyfZWZ
Z
kz
rated )|(
)Rest|Pr(
ληη
Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k
Probability of observing yij given the model
The topic of the nth word in item j
fLDA: Experimental Results (Movie) • Task: Predict the rating that a user would give a movie • Training/test split:
– Sort observations by time – First 75% → Training data – Last 25% → Test data
• Item warm-start scenario – Only 2% new items in test data
Model Test RMSE RLFM 0.9363 fLDA 0.9381
Factor-Only 0.9422 FilterBot 0.9517
unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906
Constant 1.1190
fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios
fLDA: Experimental Results (Yahoo! Buzz)
• Task: Predict whether a user would buzz-up an article • Severe item cold-start
– All items are new in test data
Data Statistics 1.2M observations
4K users 10K articles
fLDA significantly outperforms other
models
Experimental Results: Buzzing Topics
Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al
CIA interrogation
mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain
Swine flu
NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week
NFL games
court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow
Gay marriage
palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska
Sarah Palin
idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama
American idol
economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month
Recession
north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia
North Korea issues
3/4 topics are interpretable; 1/2 are similar to unsupervised topics
fLDA Summary • fLDA is a useful model for cold-start item recommendation • It also provides interpretable recommendations for users
– User’s preference to interpretable LDA topics
• Future directions: – Investigate Gibbs sampling chains and the convergence properties of
the EM algorithm – Apply fLDA to other multi-task prediction problems
• fLDA can be used as a tool to generate supervised features (topics) from text data
Summary • Regularizing factors through covariates effective • Regression based factor model that regularizes better
and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive
• Fitting method scalable; Gibbs sampling for users and
movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine
• Distributed computing on Hadoop: Multiple models and average across partitions (more later)
Online Components: Online Models, Intelligent
Ini$aliza$on, Explore / Exploit
Why Online Components? • Cold start
– New items or new users come to the system – How to obtain data for new items/users (explore/exploit) – Once data becomes available, how to quickly update the model
• Periodic rebuild (e.g., daily): Expensive • Continuous online update (e.g., every minute): Cheap
• Concept drift – Item popularity, user interest, mood, and user-to-item affinity may
change over time – How to track the most recent behavior
• Down-weight old data – How to model temporal patterns for better prediction
• … may not need to be online if the patterns are stationary
Big Picture Most Popular Recommendation
Personalized Recommendation
Offline Models
Collaborative filtering (cold-start problem)
Online Models Real systems are dynamic
Time-series models Incremental CF, online regression
Intelligent Initialization Do not start cold
Prior estimation Prior estimation, dimension reduction
Explore/Exploit Actively acquire data
Multi-armed bandits Bandits with covariates
Segmented Most Popular Recommenda$on
Extension:
Online Components for Most Popular Recommenda$on
Online models, intelligent ini$aliza$on & explore/exploit
Most popular recommendation: Outline
• Most popular recommendation (no personalization, all users see the same thing) – Time-series models (online models) – Prior estimation (initialization) – Multi-armed bandits (explore/exploit)
– Sometimes hard to beat!!
• Segmented most popular recommendation – Create user segments/clusters based on user
features – Do most popular recommendation for each segment
Most Popular Recommendation • Problem definition: Pick k items (articles) from a
pool of N to maximize the total number of clicks on the picked items
• Easy!? Pick the items having the highest click-through rates (CTRs)
• But … – The system is highly dynamic:
• Items come and go with short lifetimes • CTR of each item changes over time
– How much traffic should be allocated to explore new items to achieve optimal performance
• Too little → Unreliable CTR estimates • Too much → Little traffic to exploit the high CTR items
CTR Curves for Two Days on Yahoo! Front Page
Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short life$mes, (b) temporal effects, (c) oden breaking news stories
Each curve is the CTR of an item in the Today Module on www.yahoo.com over $me
For Simplicity, Assume … • Pick only one item for each user visit
– Multi-slot optimization later • No user segmentation, no personalization
(discussion later) • The pool of candidate items is predetermined
and is relatively small (≤ 1000) – E.g., selected by human editors or by a first-phase
filtering method – Ideally, there should be a feedback loop – Large item pool problem later
• Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)
Online Models • How to track the changing CTR of an item • Data: for each item, at time t, we observe
– Number of times the item nt was displayed (i.e., #views) – Number of clicks ct on the item
• Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1
• Potential solutions: – Observed CTR at t: ct / nt → highly unstable (nt is usually small)
– Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very slowly
– Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable • But, no estimation of Var[pt+1] (useful for explore/exploit)
Online Models: Dynamic Gamma-Poisson
• Model-based approach – (ct | nt, pt) ~ Poisson(nt pt) – pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η)
– Model parameters: • p1 ~ Gamma(mean=µ0, var=σ0
2) is the offline CTR estimate • η specifies how dynamic/smooth the CTR is over time
– Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)
• Solve this recursively (online update rule)
Show the item nt $mes Receive ct clicks pt = CTR at $me t
Nota$on:
p1 µ0, σ02
p2 …
n1 c1
n2 c2
η
Online Models: Derivation
size) sample (effective /Let
),(~),,...,,|(2
21111
ttt
ttttt varmeanGammancncp
σµγ
σµ
=
==−−
)(
),(~),,...,,|(
2|
2|
2|
21
|1
211111
ttttttt
ttt
ttttt varmeanGammancncp
σµησσ
µµ
σµ
++=
=
==
+
+
+++
tttttt
ttttttt
tttt
ttttttt
cn
varmeanGammancncp
||2|
||
|
2||11
/
/) (
size) sample (effective Let
),(~),,...,,|(
γµσ
γγµµ
γγ
σµ
=
+⋅=
+=
==
High CTR items more adap$ve
Es$mated CTR distribu$on at $me t
Es$mated CTR distribu$on at $me t+1
Tracking behavior of Gamma-Poisson model
• Low click rate articles – More temporal smoothing
Intelligent Initialization: Prior Estimation
• Prior CTR distribution: Gamma(mean=µ0, var=σ02)
– N historical items: • ni = #views of item i in its first time interval • ci = #clicks on item i in its first time interval
– Model • ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0
2) ⇒ ci ~ NegBinomial(µ0, σ0
2, ni) – Maximum likelihood estimate (MLE) of (µ0, σ0
2)
• Better prior: Cluster items and find MLE for each cluster – Agarwal & Chen, 2011 (SIGMOD)
∑ ⎟⎠⎞⎜
⎝⎛ +⎟
⎠⎞⎜
⎝⎛ +−⎟
⎠⎞⎜
⎝⎛ +Γ+⎟
⎠⎞⎜
⎝⎛Γ−
i iii nccNN 20
020
20
20
20
20
20
20
020
20
200
loglog loglog maxarg , σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σµ
Explore/Exploit: Problem Definition
$me
Item 1 Item 2 … Item K
x1% page views x2% page views … xK% page views
Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future
t –1 t –2 t
now clicks in the future
Modeling the Uncertainty, NOT just the Mean
Simplified semng: Two items
CTR
Prob
abili
ty d
ensit
y Item A
Item B
We know the CTR of Item A (say, shown 1 million $mes) We are uncertain about the CTR of Item B (only 100 $mes)
If we only make a single decision, give 100% page views to Item A
If we make mul$ple decisions in the future
explore Item B since its CTR can poten$ally be higher
∫ >⋅−=
qpdppfqp )()(Potential
CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)
Multi-Armed Bandits: Introduction (1)
Bandit “arms”
p1 p2 p3 (unknown payoff
probabilities)
“Pulling” arm i yields a reward:
reward = 1 with probability pi (success)
reward = 0 otherwise (failure)
For now, we are aaacking the problem of choosing best ar$cle/arm for all users
Multi-Armed Bandits: Introduction (2)
Bandit “arms”
p1 p2 p3 (unknown payoff
probabilities)
Goal: Pull arms sequen$ally to maximize the total reward
Bandit scheme/policy: Sequen$al algorithm to play arms (items)
Regret of a scheme = Expected loss rela$ve to the “oracle” op-mal scheme that always plays the best arm – “best” means highest success probability – But, the best arm is not known … unless you have an oracle – Regret is the price of explora$on – Low regret implies quick convergence to the best
Multi-Armed Bandits: Introduction (3)
• Bayesian approach – Seeks to find the Bayes optimal solution to a Markov
decision process (MDP) with assumptions about probability distributions
– Representative work: Gittins’ index, Whittle’s index – Very computationally intensive
• Minimax approach – Seeks to find a scheme that incurs bounded regret (with no
or mild assumptions about probability distributions) – Representative work: UCB by Lai, Auer – Usually, computationally easy – But, they tend to explore too much in practice (probably
because the bounds are based on worse-case analysis)
Skip details
Multi-Armed Bandits: Markov Decision Process (1)
• Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T
• State at time t: Θt = (θ1t, …, θKt) – θit = State of arm i at time t (that captures all we know about arm i at t)
• Reward function Ri(Θt, Θt+1) – Reward of pulling arm i that brings the state from Θt to Θt+1
• Transition probability Pr[Θt+1 | Θt, pulling arm i ] • Policy π: A function that maps a state to an arm (action)
– π(Θt) returns an arm (to pull) • Value of policy π starting from the current state Θ0 with horizon T
[ ]),(),(),( 1110)(0 0ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate reward Value of the remaining T-‐1 $me slots if we start from state Θ1
Multi-Armed Bandits: MDP (2)
• Optimal policy:
• Things to notice: – Value is defined recursively (actually T high-dim integrals) – Dynamic programming can be used to find the optimal policy – But, just evaluating the value of a fixed policy can be very expensive
• Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time
[ ]),(),(),( 1110)(0 0ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate reward Value of the remaining T-‐1 $me slots if we start from state Θ1
),( maxarg 0Θππ
TV
Multi-Armed Bandits: MDP (3) • Which arm should be pulled next?
– Not necessarily what looks best right now, since it might have had a few lucky successes
– Looks like it will be a function of successes and failures of all arms • Consider a slightly different problem setting
– Infinite time horizon, but – Future rewards are geometrically discounted
Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)
• Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently
Policy π(Θt) is a func$on of (θ1t, …, θKt)
Policy π(Θt) = argmaxi { g(θit) }
One K-‐dimensional problem
K one-‐dimensional problems S$ll computa$onally expensive!!
Gimns’ Index
Multi-Armed Bandits: MDP (4)
Bandit Policy
1. Compute the priority (Gittins’ index) of each arm based on its state
2. Pull arm with max priority, and observe reward
3. Update the state of the pulled arm
Priority 1
Priority 2
Priority 3
Multi-Armed Bandits: MDP (5) • Theorem [Gittins 1979]: The optimal policy decouples
and solves a bandit problem for each arm independently – Many proofs and different interpretations of Gittins’ index
exist • The index of an arm is the fixed charge per pull for a game with two options, whether
to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward
– Significantly reduces the dimension of the problem space – But, Gittins’ index g(θit) is still hard to compute
• For the Gamma-Poisson or Beta-Binomial models θit = (#successes, #pulls) for arm i up to time t
• g maps each possible (#successes, #pulls) pair to a number
– Approximate methods are used in practice – Lai et al. have derived these for exponential family
distributions
Multi-Armed Bandits: Minimax Approach (1)
• Compute the priority of each arm i in a way that the regret is bounded – Lowest regret in the worst case
• One common policy is UCB1 [Auer 2002] Number of successes of
arm i
Number of pulls of arm i
Total number of pulls of all arms
Observed success rate
Factor representing uncertainty
ii
ii n
nnc log2Priority ⋅+=
Multi-Armed Bandits: Minimax Approach (2)
• As total observations n becomes large: – Observed payoff tends asymptotically towards the
true payoff probability – The system never completely “converges” to one
best arm; only the rate of exploration tends to zero
Observed payoff
Factor representing uncertainty
ii
ii n
nnc log2Priority ⋅+=
Multi-Armed Bandits: Minimax Approach (3)
• Sub-optimal arms are pulled O(log n) times • Hence, UCB1 has O(log n) regret • This is the lowest possible regret (but the constants matter J) • E.g. Regret after n plays is bounded by
Observed payoff
Factor representing uncertainty
ii
ii n
nnc log2Priority ⋅+=
ibesti
K
jj
i ibesti
nµµ
π
µµ
−=Δ⎟⎟⎠
⎞⎜⎜⎝
⎛Δ⋅⎟⎟
⎠
⎞⎜⎜⎝
⎛++⎟
⎟⎠
⎞⎜⎜⎝
⎛
Δ ∑∑=<
where,3
1ln81
2
:
• Classical multi-armed bandits – A fixed set of arms with fixed rewards – Observe the reward before the next pull
• Bayesian approach (Markov decision process) – Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits
• Pull the arm currently having the highest index value – Whittle’s index [Whittle 1988]: Extension to a changing reward function – Computationally intensive
• Minimax approach (providing guaranteed regret bounds) – UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval
• Index of arm i = • Heuristics
– ε-Greedy: Random exploration using fraction ε of traffic – Softmax: Pick arm i with probability
– Posterior draw: Index = drawing from posterior CTR distribution of an arm
∑ j j
i
}/ˆexp{}/ˆexp{τµ
τµ
Classical Multi-Armed Bandits: Summary
ii item of CTR predicted ˆ =µ
iii nnnc log2 ⋅+
re temperatu=τ
Do Classical Bandits Apply to Web Recommenders?
Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short life$mes, (b) temporal effects, (c) oden breaking news stories
Each curve is the CTR of an item in the Today Module on www.yahoo.com over $me
Characteristics of Real Recommender Systems
• Dynamic set of items (arms) – Items come and go with short lifetimes (e.g., a day) – Asymptotically optimal policies may fail to achieve good performance
when item lifetimes are short • Non-stationary CTR
– CTR of an item can change dramatically over time • Different user populations at different times • Same user behaves differently at different times (e.g., morning, lunch
time, at work, in the evening, etc.) • Attention to breaking news stories decays over time
• Batch serving for scalability – Making a decision and updating the model for each user visit in real time
is expensive – Batch serving is more feasible: Create time slots (e.g., 5 min); for each
slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal et al., ICDM, 2009]
Explore/Exploit in Recommender Systems
$me
Item 1 Item 2 … Item K
x1% page views x2% page views … xK% page views
Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future
t –1 t –2 t
now clicks in the future
Let’s solve this from first principle
Bayesian Solution: Two Items, Two Time Slots (1)
• Two time slots: t = 0 and t = 1 – Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 – Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1
• To determine x, we need to estimate what would happen in the future
Question: What fraction x of N0 views to item P (1-x) to item Q
t=0 t=1
Now
time N0 views N1 views
End
Obtain c clicks ader serving x (not yet observed; random variable)
Assume we observe c; we can update p1
CTR
dens
ity Item Q
Item P
q1
p1(x,c) CTR
dens
ity Item Q
Item P
q0 p0
If x and c are given, op$mal solu$on: Give all views to Item P iff E[ p1 I x, c ] > q1
),(ˆ1 cxp
),(ˆ1 cxp
• Expected total number of clicks in the two time slots
}] ),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+
Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots
Solution: argmaxx Gain(x, q0, q1)
Bayesian Solution: Two Items, Two Time Slots (2)
}]0 ,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++=
E[#clicks] at t = 0 E[#clicks] at t = 1
Item P Item Q Show the item with higher E[CTR]: } ),,(ˆmax{ 11 qcxp
E[#clicks] if we always show
item Q
Gain(x, q0, q1) Gain of exploring the uncertain item P using x
• Approximate by the normal distribution – Reasonable approximation because of the central limit theorem
• Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0)
),(ˆ1 cxp
⎥⎥⎦
⎤
⎢⎢⎣
⎡−⎟
⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛ −Φ−+⎟⎟
⎠
⎞⎜⎜⎝
⎛ −⋅+−= )ˆ(
)(ˆ
1)(ˆ
)()ˆ(),,( 111
11
1
111100010 qp
xpq
xpqxNqpxNqqxGain
σσφσ
)1()()()],(ˆ[)( 2
0
01
21 baba
abxNba
xNcxpVarx+++++
==σ
)/()],(ˆ[ˆ 11 baacxpEp c +==
),(~ ofPrior 1 baBetap
Bayesian Solution: Two Items, Two Time Slots (3)
Bayesian Solution: Two Items, Two Time Slots (4)
• Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item?
Uncertainty: Low Uncertainty: High
Different curves are for different prior mean semngs
(Frac$on
of views to give to
the ite
m)
– Apply Whiale’s Lagrange relaxa$on (1988) to this problem semng Relax ∑i zi(c) = 1, for all c, to Ec [∑i zi(c)] = 1 Apply Lagrange mul$pliers (q0 and q1) to enforce the constraints
– We essen$ally reduce the K-‐item case to K independent two-‐item sub-‐problems (which we have solved)
Bayesian Solution: General Case (1) • From two items to K items
– Very difficult problem: ) )}],(ˆ{[maxˆ ( max 11000 iiiiiii cxpENpxN cx+∑
≥
)],(ˆ)([max 10 iiiii cxpzE ccz∑
≥
cc possible allfor ,1)( =∑ ii z
Note: c = [c1, …, cK] ci is a random variable represen$ng the # clicks on item i we may get
1=∑ ii x
) ),,(max ( min 101100, 10
qqxGainqNqN ii xqq i∑++
Bayesian Solution: General Case (2)
• From two intervals to multiple time slots – Approximate multiple time slots by two stages
• Non-stationary CTR – Use the Dynamic Gamma-Poisson model to
estimate the CTR distribution for each item
Simulation Experiment: Different Traffic Volume Simula$on with ground truth es$mated based on Yahoo! Front Page data Semng:16 live items per interval Scenarios: Web sites with different traffic volume (x-‐axis)
Simulation Experiment: Different Sizes of the Item Pool
Simula$on with ground truth es$mated based on Yahoo! Front Page data Semng: 1000 views per interval; average item life$me = 20 intervals Scenarios: Different sizes of the item pool (x-‐axis)
Characteristics of Different Explore/Exploit Schemes (1)
• Why the Bayesian solution has better performance • Characterize each scheme by three dimensions:
– Exploitation regret: The regret of a scheme when it is showing the item which it thinks is the best (may not actually be the best)
• 0 means the scheme always picks the actual best • It quantifies the scheme’s ability of finding good
items – Exploration regret: The regret of a scheme when it is exploring the items
which it feels uncertain about
• It quantifies the price of exploration (lower → better) – Fraction of exploitation (higher → better)
• Fraction of exploration = 1 – fraction of exploitation Exploita$on traffic Explora$on
traffic
All traffic to a web site
Characteristics of Different Explore/Exploit Schemes (2)
Exploita$on regret: Ability of finding good items (lower → beaer) Explora$on regret: Price of explora$on (lower → beaer) Frac$on of Exploita$on (higher → beaer)
Explora$on Regret Exploita$on frac$on
Exploita$o
n Re
gret
Exploita$o
n Re
gret
Good Good
Discussion: Large Content Pool • The Bayesian solution looks promising
– ~10% from true optimal for a content pool of 1000 live items
• 1000 views per interval; item lifetime ~20 intervals • Intelligent initialization (offline modeling)
– Use item features to reduce the prior variance of an item • E.g., Var[ item CTR | Sport ] < Var[ item CTR ]
– Require a CTR model that outputs both mean and variance
• Linear regression model • Segmented model: Estimate the CTR distribution of a random
article in an item category – Existing taxonomies, decision tree, LDA topics
• Feature-based explore/exploit – Estimate model parameters, instead of per-item CTR – More later
Discussion: Multiple Positions, Ranking
• Feature-based approach – reward(page) = model(φ(item 1 at position 1, … item k at position k)) – Apply feature-based explore/exploit
• Online optimization for ranked list – Ranked bandits [Radlinski et al., 2008]: Run an
independent bandit algorithm for each position – Dueling bandit [Yue & Joachims, 2009]: Actions are
pairwise comparisons • Online optimization of submodular functions
– ∀ S1, S2 and a, fa(S1 ⊕ S2) ≤ fa(S1) • where fa(S) = fa(S ⊕ 〈a〉) – fa(S)
– Streeter & Golovin (2008)
Discussion: Segmented Most Popular
• Partition users into segments, and then for each segment, provide most popular recommendation
• How to segment users – Hand-created segments: AgeGroup × Gender – Clustering or decision tree based on user features
• Users in the same cluster like similar items • Segments can be organized by taxonomies/hierarchies
– Better CTR models can be built by hierarchical smoothing • Shrink the CTR of a segment toward its parent • Introduce bias to reduce uncertainty/variance
– Bandits for taxonomies (Pandey et al., 2008) • First explore/exploit categories/segments • Then, switch to individual items
Most Popular Recommendation: Summary
• Online model: – Estimate the mean and variance of the CTR of each item over
time – Dynamic Gamma-Poisson model
• Intelligent initialization: – Estimate the prior mean and variance of the CTR of each item
cluster using historical data • Cluster items → Maximum likelihood estimates of the priors
• Explore/exploit: – Bayesian: Solve a Markov decision process problem
• Gittins’ index, Whittle’s index, approximations • Better performance, computation intensive • Thompson sampling: Sample from the posterior (simple)
– Minimax: Bound the regret • UCB1: Easy to compute • Explore more than necessary in practice
– ε-Greedy: Empirically competitive for tuned ε
Online Components for Personalized Recommenda$on
Online models, intelligent ini$aliza$on & explore/exploit
Intelligent Initialization for Linear Model (1)
• Linear/factorization model
– How to estimate the prior parameters µj and Σ • Important for cold start: Predictions are made using prior • Leverage available features
– How to learn the weights/factors quickly • High dimensional βj → slow convergence • Reduce the dimensionality
Subscript: user i, item j
),(~
) ,(~ 2
Σ
ʹ′
jj
jiij
N
uNy
µβ
σβ
ra$ng that user i gives item j
feature/factor vector of user i
factor vector of item j
Feature-based model initialization
• Dimensionality reduction for fast model convergence
),(~ Σjj AxNβ
FOBFM: Fast Online Bilinear Factor Model ),(~ ,~ Σʹ′ jjjiij Nuy µββPer-‐item
online model
),0(~
~
Σ
ʹ′+ʹ′
NvvuAxuy
j
jijiij
predicted by features
⇔
) ,0(~ 2IN
Bv
j
jj
θσθ
θ=
Subscript: user i item j Data: yij = ra$ng that user i gives item j ui = offline factor vector of user i xj = feature vector of item j
B is a n×k linear projec$on matrix (k << n) project: high dim(vj) → low dim(θj) low-‐rank approx of Var[βj]:
=
vj θj B
) ,(~ 2 BBAxN jj ʹ′θσβ
Offline training: Determine A, B, σθ2 through the EM algorithm (once per day or hour)
Feature-based model initialization
• Dimensionality reduction for fast model convergence
• Fast, parallel online learning
• Online selection of dimensionality (k = dim(θj)) – Maintain an ensemble of models, one for each candidate dimensionality
),(~ Σjj AxNβ
FOBFM: Fast Online Bilinear Factor Model ),(~ ,~ Σʹ′ jjjiij Nuy µββPer-‐item
online model
),0(~
~
Σ
ʹ′+ʹ′
NvvuAxuy
j
jijiij
predicted by features
⇔
) ,0(~ 2IN
Bv
j
jj
θσθ
θ= B is a n×k linear projec$on matrix (k << n) project: high dim(vj) → low dim(θj) low-‐rank approx of Var[βj]:
jijiij BuAxuy θ)(~ ʹ′+ʹ′
offset new feature vector (low dimensional)
, where θj is updated in an online manner
) ,(~ 2 BBAxN jj ʹ′θσβ
Subscript: user i item j Data: yij = ra$ng that user i gives item j ui = offline factor vector of user i xj = feature vector of item j
Experimental Results: My Yahoo! Dataset (1)
• My Yahoo! is a personalized news reading site – Users manually select news/RSS feeds
• ~12M “ratings” from ~3M users on ~13K articles – Click = positive – View without click = negative
Experimental Results: My Yahoo! Dataset (2)
Item-‐based data split: Every item is new in the test data – First 8K ar$cles are in the training data (offline training) – Remaining ar$cles are in the test data (online predic$on & learning)
Supervised dimensionality reduc$on (reduced rank regression) significantly outperforms other methods
Methods: No-‐init: Standard online
regression with ~1000 parameters for each item
Offline: Feature-‐based model without online update
PCR, PCR+: Two principal component methods to es$mate B
FOBFM: Our fast online method
Experimental Results: My Yahoo! Dataset (3)
• Small number of factors (low dimensionality) is better when the amount of data for online leaning is small
• Large number of factors is better when the data for learning becomes large • The online selection method usually selects the best dimensionality
# factors = Number of parameters per item updated online
Intelligent Initialization: Summary
• For online learning, whenever historical data is available, do not start cold
• For linear/factorization models – Use available features to setup the starting point – Reduce dimensionality to facilitate fast learning
• Next – Explore/exploit for personalization – Users are represented by covariates
• Features, factors, clusters, etc – Covariate bandits
Explore/Exploit for Personalized Recommendation
• One extreme problem formulation – One bandit problem per user with one arm per item – Bandit problems are correlated: “Similar” users like similar
items – Arms are correlated: “Similar” items have similar CTRs
• Model this correlation through covariates/features – Input: User feature/factor vector, item feature/factor vector – Output: Mean and variance of the CTR of this (user, item)
pair based on the data collected so far • Covariate bandits
– Also known as contextual bandits, bandits with side observations
– Provide a solution to • Large content pool (correlated arms) • Personalized recommendation (hint before pulling an arm)
Methods for Covariate Bandits • Priority-based methods
– Rank items according to the user-specific “score” of each item; then, update the model based on the user’s response
– UCB (upper confidence bound) • Score of an item = E[posterior CTR] + k StDev[posterior CTR]
– Posterior draw (Thompson sampling) • Score of an item = a number drawn from the posterior CTR distribution
– Softmax • Score of an item = a number drawn according to
• ε-Greedy – Allocate ε fraction of traffic for random exploration (ε may be adaptive) – Robust when the exploration pool is small
• Bayesian scheme – Close to optimal if can be solved efficiently
∑ j j
i
}/ˆexp{}/ˆexp{τµ
τµ
Covariate Bandits: Some References
• Just a small sample of papers – Hierarchical explore/exploit (Pandey et al., 2008)
• Explore/exploit categories/segments first; then, switch to individuals – Variants of ε-greedy
• Epoch-greedy (Langford & Zhang, 2007): ε is determined based on the generalization bound of the current model
• Banditron (Kakade et al., 2008): Linear model with binary response • Non-parametric bandit (Yang & Zhu, 2002): ε decreases over time;
example model: histogram, nearest neighbor – Variants of UCB methods
• Linearly parameterized bandits (Rusmevichientong et al., 2008): minimax, based on uncertainty ellipsoid
• LinUCB (Li et al., 2010): Gaussian linear regression model • Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009):
– Similar arms have similar rewards: | reward(i) – reward(j) | ≤ distance(i,j)
Online Components: Summary • Real systems are dynamic • Cold-start problem
– Incremental online update (online linear regression) – Intelligent initialization (use features to predict initial factor
values) – Explore/exploit (UCB, posterior draw, softmax, ε-greedy)
• Concept-drift problem – Tracking the most recent behavior (state-space models,
Kalman filter) – Modeling temporal patterns (tensor factorization, spline)
Evaluation Methods and Challenges
Evaluation Methods • Ideal method
– Experimental Design: Run side-by-side experiments on a small fraction of randomly selected traffic with new method (treatment) and status quo (control)
– Limitation • Often expensive and difficult to test large number of methods
• Problem: How do we evaluate methods offline on logged data? – Goal: To maximize clicks/revenue and not prediction
accuracy on the entire system. Cost of predictive inaccuracy for different instances vary.
• E.g. 100% error on a low CTR article may not matter much because it always co-occurs with a high CTR article that is predicted accurately
Usual Metrics • Predictive accuracy
– Root Mean Squared Error (RMSE) – Mean Absolute Error (MAE) – Area under the Curve, ROC
• Other rank based measures based on retrieval accuracy for top-k
– Recall in test data • What Fraction of items that user actually liked in the test data were
among the top-k recommended by the algorithm (fraction of hits, e.g. Karypsis, CIKM 2001)
• One flaw in several papers – Training and test split are not based on time.
• Information leakage • Even in Netflix, this is the case to some extent
– Time split per user, not per event. For instance, information may leak if models are based on user-user similarity.
Metrics continued.. • Recall per event based on Replay-Match
method – Fraction of clicked events where the top
recommended item matches the clicked one.
• This is good if logged data collected from a randomized serving scheme, with biased data this could be a problem – We will be inventing algorithms that provide
recommendations that are similar to the current one
• No reward for novel recommendations
Details on Replay-Match method (Li, Langford, et al)
• x: feature vector for a visit • r = [r1,r2,…,rK]: reward vector for the K items in inventory • h(x): recommendation algorithm to be evaluated • Goal: Estimate expected reward for h(x)
• s(x): recommendation scheme that generated logged-data • x1,..,xT: visits in the logged data • rti: reward for visit t, where i = s(xt)
Replay-Match continued • Estimator
• If importance weights and
– It can be shown estimator is unbiased
• E.g. if s(x) is random serving scheme, importance weights are uniform over the item set
• If s(x) is not random, importance weights have to be estimated through a model
Back to Multi-Objective Optimization
Recommender EDITORIAL
content Clicks on FP links influence downstream supply distribution
AD SERVER PREMIUM display (GUARANTEED) Spot Market (Cheaper)
Downstream engagement (Time spent)
Serving Content on Front Page: Click Shaping
• What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following
– Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)
Why call it Click Shaping? autos finance
health
hotjobs
movies
new.music
news
omgrealestate
rivals
shine
shopping
sports
tech
travel
tv
video
other
gmy.news
buzz
videogamesautos
finance
health
hotjobs
movies
new.music
news
omgrealestate
rivals
shine
shopping
sports
tech
travel
tv
video
other
videogames
buzz
gmy.news
-10.00%-8.00%-6.00%
-4.00%-2.00%0.00%2.00%4.00%
6.00%8.00%10.00%
autos
buzz
finance
gmy.news
health
hotjobs
movies
new.music
news omg
realestate
rivals
shine
shopping
sports
tech
travel tv
video
videogames
other
Supply distribution Changes
BEFORE AFTER
SHAPING can happen with respect to any downstream metrics (like engagement)
221
Multi-Objective Optimization
A1
A2
An
n articles K properties
news
finance
omg
… …
S1
S2
Sm
m user segments
…
CTR of user segment i on ar$cle j: pij Time dura$on of i on j: dij
11
Multi-Objective Program § Scalariza$on
Goal Programming
Simplex constraints on xiJ is always applied
Constraints are linear
Every 10 mins, solve x
Use this x as the serving scheme in the next 10 mins
Pareto-optimal solution (more in KDD 2011)
223
Summary • Modern recommendation systems on the web crucially depend on
extracting intelligence from massive amounts of data collected on a routine basis
• Lots of data and processing power not enough, the number of things we need to learn grows with data size
• Extracting grouping structures at coarser resolutions based on similarity (correlations) is important – ML has a big role to play here
• Continuous and adaptive experimentation in a judicious manner crucial to maximize performance – Again, ML has a big role to play
• Multi-objective optimization is often required, the objectives are application dependent. – ML has to work in close collaboration with
engineering, product & business execs
Challenges
Recall: Some examples • Simple version
– I have an important module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module
• More advanced – I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks?
• Highly advanced – There are multiple modules running on my website. How
do I take a holistic approach and perform a simultaneous optimization?
For the simple version • Multi-position optimization
– Explore/exploit, optimal subset selection
• Explore/Exploit strategies for large content pool and high dimensional problems – Some work on hierarchical bandits but more needs to be
done • Constructing user profiles from multiple sources with
less than full coverage – Couple of papers at KDD 2011
• Content understanding • Metrics to measure user engagement (other than
CTR)
Other problems • Whole page optimization
– Incorporating correlations
• Incentivizing User generated content
• Incorporating Social information for better recommendation (News Feed Recommendation)
• Multi-context Learning
Case Studies
Recommenda$ons and Adver$sing on LinkedIn HP
EXAMPLE: DISPLAY AD PLACEMENTS ON LINKEDIN
©2013 LinkedIn Corpora$on. All Rights Reserved.
Recommenda$ons and Adver$sing on LinkedIn HP
Click Cost =
Bid3 x CTR3/CTR2
Profile:
region = US, age = 20
Context = profile page, 300 x 250 ad slot
Ad request
Sorted by Bid * CTR
Response Predic$on Engine
Campaigns eligible for auc$on
Automa$c Format Selec$on
Filter Campaigns (Targe$ng criteria, Frequency Cap, Budget Pacing)
LinkedIn Advertising: Flow
Serving constraint < 100 millisec
CTR Predic$on Model for Ads • Feature vectors
– Member feature vector: xi (iden$ty, behavioral, network) – Campaign feature vector: cj (text, adv-‐id,…) – Context feature vector: zk (page type, device, …)
• Model:
CTR Predic$on Model for Ads • Feature vectors
– Member feature vector: xi – Campaign feature vector: cj – Context feature vector: zk
• Model:
Cold-start component
Warm-start per-campaign component
CTR Predic$on Model for Ads • Feature vectors
– Member feature vector: xi – Campaign feature vector: cj – Context feature vector: zk
• Model:
Cold-start component
Warm-start per-campaign component
Cold-‐start: Warm-‐start: Both can have L2 penal$es.
Model Fimng • Single machine (well understood)
– conjugate gradient – L-‐BFGS – Trusted region – …
• Model Training with Large scale data – Cold-‐start component Θw is more stable
• Weekly/bi-‐weekly training good enough • However: difficulty from need for large-‐scale logis$c regression
– Warm-‐start per-‐campaign model Θc is more dynamic • New items can get generated any $me • Big loss if opportuni$es missed • Need to update the warm-‐start component as frequently as possible
Model Fimng • Single machine (well understood)
– conjugate gradient – L-‐BFGS – Trusted region – …
• Model Training with Large scale data – Cold-‐start component Θw is more stable
• Weekly/bi-‐weekly training good enough • However: difficulty from need for large-‐scale logis$c regression
– Warm-‐start per-‐campaign model Θc is more dynamic • New items can get generated any $me • Big loss if opportuni$es missed • Need to update the warm-‐start component as frequently as possible
Large Scale Logistic Regression
Per-item logistic regression given Θc
Explore/Exploit with Logis$c Regression
239
+ +
+ +
+
+
+
_
_
_ _
_ _
_
_
_ _ _
_
_
COLD START
COLD + WARM START for an Ad-id
POSTERIOR of WARM-START COEFFICIENTS
E/E: Sample a line from the posterior (Thompson Sampling)
Models Considered
• CONTROL: per-‐campaign CTR coun$ng model
• COLD-‐ONLY: only cold-‐start component
• LASER: our model (cold-‐start + warm-‐start)
• LASER-‐EE: our model with Explore-‐Exploit using Thompson sampling
Metrics
• Model metrics (offline) – Test Log-‐likelihood – AUC/ROC – Observed/Expected ra$o
• Online metrics (Online A/B Test) – CTR – CPM (Revenue per impression) – Unique ads per user (diversity)
Observed / Expected Ra$o • Offline replay difficult with large items (randomiza$on costly) • Observed: #Clicks in the data , Expected: Sum of predicted CTR for
all impressions • Not a “standard” classifier metric, but useful for this applica$on • What we usually see: Observed / Expected < 1
– Quan$fies the “winner’s curse” aka selec$on bias in auc$ons • When choosing from among thousands of candidates, an item with mistakenly
over-‐es$mated CTR may end up winning the auc$on • Par$cularly helpful in spomng inefficiencies by segment
– E.g. by bid, number of impressions in training (warmness), geo, etc. – Allows us to see where the model might be giving too much weight to
the wrong campaigns • High correla$on between O/E ra$o and model performance online
Offline: ROC Curves
False Positive Rate
True
Pos
itive
Rat
e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
●●●●●●●●●
●●
●●
●
●
●
●
●
●
●
●
●
●●●●●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
CONTROL [ 0.672 ]COLD−ONLY [ 0.757 ]LASER [ 0.778 ]
Online A/B Test • Three models
– CONTROL (10%) – LASER (85%) – LASER-‐EE (5%)
• Segmented Analysis – 8 segments by campaign warmness
• Degree of warmness: the number of training samples available in the training data for the campaign
• Segment #1: Campaigns with almost no data in training • Segment #8: Campaigns that are served most heavily in the previous batches so that their CTR es$mate can be quite accurate
Daily CTR Lid Over Control Pe
rcen
tage
of C
TR L
ift
+%
+%
+%
+%
+%
Day
1
Day
2
Day
3
Day
4
Day
5
Day
6
Day
7
●
● ●
●
●
●
●●
LASERLASER−EE
Daily CPM Lid Over Control Pe
rcen
tage
of e
CPM
Lift
+%
+%
+%
+%
+%
+%D
ay 1
Day
2
Day
3
Day
4
Day
5
Day
6
Day
7
●
●
●
● ●
●
●
●
LASERLASER−EE
CPM Lid By Campaign Warmness Segments
Campaign Warmness Segment
Lift
Perc
enta
ge o
f CPM
−%
−%
−%
0%
+%
+%
1 2 3 4 5 6 7 8
LASERLASER−EE
O/E Ra$o By Campaign Warmness Segments
Campaign Warmness Segment
Obs
erve
d C
lick/
Expe
cted
Clic
ks
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
CONTROLLASERLASER−EE
Number of Campaigns Served Improvement from E/E
Insights • Overall performance:
– LASER and LASER-‐EE are both much beaer than control
– LASER and LASER-‐EE performance are very similar • Great news! We get explora$on without much addi$onal cost
• Explora$on has other benefits – LASER-‐EE serves significantly more campaigns than LASER
– Provides healthier market place, more ad diversity per user (beaer experience)
Solu$ons to Prac$cal Problems • Rapid model development cycle
– Quick reac$on to changes in data, product – Write once for training, tes$ng, inference
• Can adapt to changing data – Integrated Thompson sampling explore/exploit – Automa$c training – Mul$ple training frequencies for different parts of model
• Good tools yield good models – Reusable components for feature extrac$on and transforma$on – Very high-‐performance inference engine for deployment – Modelers can concentrate on building models, not re-‐wri$ng common func$ons or worrying about produc$on issues
Summary • Reducing dimension through logis$c regression coupled
with explore/exploit schemes like Thompson sampling effec$ve mechanism to solve response predic$on problems in adver$sing
• Par$$oning model components by cold-‐start (stable) and warm-‐start (non-‐sta$onary) with different training frequencies effec$ve mechanism to scale computa$ons
• ADMM with few modifica$ons effec$ve model training strategy for large data with high dimensionality
• Methods work well for LinkedIn adver$sing, significant improvements
©2013 LinkedIn Corpora$on. All Rights Reserved.
Theory vs. Prac$ce
Textbook • Data is sta$onary • Training data is clean • Training is hard, tes$ng and
inference are easy • Models don’t change • Complex algorithms work best
Reality • Features, items changing
constantly • Fraud, bugs,tracking delays,
online/offline inconsistencies, etc. • All aspects have challenges at
web scale • Never-‐ending processes of
improvement • Simple models with good features
and lots of data win
Current Work: Feed Recommenda$on Network Updates Job change Job anniversaries Connections Endorsements Upload Photos …………..
Content Articles by Influencer Shares by friends Content in Channels followed Content by companies followed Jobs recommendation ………… Sponsored Updates
Company updates Jobs
Tiered Approach to Ranking
§ A second pass ranker that blends disparate results returned by first-‐pass rankers
Jobs Ads Network Updates
Content ------------
BLENDER
Top k
Challenges • Personaliza$on
– Viewer-‐actor affinity by type (depends on strength of connec$ons in mul$ple contexts)
– Blending iden$ty and behavioral data • Frequency discoun$ng, freshness, diversifica$on • Mul$-‐objec$ves (Revenue, Engagement) • A/B tests with interference • Engagement metrics
– Func$on of various ac$ons that op$mize long-‐term engagement metric like return visits
• Summariza$on and adding new content types
Impression Discoun$ng • How does the response rate vary with past impressions for the same item?
Slide courtsey Pannaga Shivaswamy
Diversity • How does the response rate change when an actorId/objectType at a posi$on matches previous items?
Slide courtsey Pannaga Shivaswamy
Age of an item • How does the response rate change for different types with age?
Slide courtsey Pannaga Shivaswamy
Parallel Matrix Factoriza$on
Problem Setup • CTR predic$on for a user on an item
• Assump$ons: – There are sufficient data per item to es$mate per-‐item model – Serving bias and posi$onal bias are removed by randomly serving scheme
– Item populari$es are quite dynamic and have to be es$mated in real-‐$me fashion
• Examples: – Yahoo! Front page Today module – Linkedin Today module
Online Logis$c Regression (OLR) § User i with feature xi, article j § Binary response y (click/non-click) § §
§ Prior § Using Laplace approximation or variational Bayesian
methods to obtain posterior
§ New prior § Can approximate and as diagonal for high dim xi
User Features for OLR • Age, gender, industry, job posi$on for login users
• General behavior targe$ng (BT) features – Music? Finance? Poli$cs?
• User profiles from historical view/click behavior on previous items in the data, e.g. – Item-‐profile: use previously clicked item ids as the user profile – Category-‐profile: use item category affinity score as profile. The score can be simply user’s historical CTR on each category.
– Are there beaer ways to generate user profiles? – Yes! By matrix factoriza$on!
Generalized Matrix Factoriza$on (GMF) Framework
•
Global Features
Item effect
User factors
Item factors
User effect
Bell et al. (2007)
Regression Priors •
• g(·∙), h(·∙), G(·∙), H(·∙) can be any regression func$ons
• Agarwal and Chen (KDD 2009); Zhang et al. (RecSys 2011)
User covariates
Item covariates
Different Types of Prior Regression Models
• Zero prior mean – Bilinear random effects (BIRE)
• Linear regression – Simple regression (RLFM) – Lasso penalty (LASSO)
• Tree Models – Recursive par$$oning (RP) – Random forests (RF) – Gradient boos$ng machines (GB) – Bayesian addi$ve regression trees (BART)
• Monte Carlo EM (Booth and Hobert 1999) • Let • Let • E Step:
– Obtain N samples of condi$onal posterior
• M Step:
Model Fimng Using MCEM
Handling Binary Responses
• Gaussian responses: have closed form • Binary responses + Logis$c: no longer closed form
• Varia$onal approxima$on (VAR)
• Adap$ve rejec$on sampling (ARS)
Simula$on Study • 10 simulated data sets, 100K samples for both training and test
• 1000 users and 1000 items in training
• Extra 500 new users and 500 new items in test + old users/items
• For each user/item, 200 covariates, only 10 useful
• Construct non-‐linear regression model from 20 Gaussian func$ons for simula$ng α, β, u and v following Friedman (2001)
MovieLens 1M Data Set • 1M ra$ngs
• 6040 users
• 3706 movies
• Sort by $me, first 75% training, last 25% test
• A lot of new users in the test data set
• User features: Age, gender, occupa$on, zip code
• Item features: Movie genre
Performance Comparison
However…
• We are working with very large scale data sets!
• Parallel matrix factoriza$on methods using Map-‐Reduce has to be developed!
• Khanna et al. 2012 Technical report
• Monte Carlo EM (Booth and Hobert 1999) • Let • Let • E Step:
– Obtain N samples of condi$onal posterior
• M Step:
Model Fimng Using MCEM
Parallel Matrix Factoriza$on • Par$$on data into m par$$ons • For each par$$on run MCEM algorithm and get .
•
• Ensemble runs: for k = 1, … , n – Repar$$on data into m par$$ons with a new seed – Run E-‐step only job for each par$$on given
• Average over user/item factors for all par$$ons and k’s to obtain the final es$mate
One MapReduce job
Parallel Matrix Factoriza$on • Par$$on data into m par$$ons • For each par$$on run MCEM algorithm and get .
•
• Ensemble runs: for k = 1, … , n – Repar$$on data into m par$$ons with a new seed – Run E-‐step only job for each par$$on given
• Average over user/item factors for all par$$ons and k’s to obtain the final es$mate
Each ensemble run is a MapReduce
job
Key Points
• Par$$oning is tricky! – By events? By items? By users?
• Empirically, “divide and conquer” + average over to obtain work well!
• Ensemble runs: Ader obtained , we run n E-‐step-‐only jobs and take average, for each job using a different user-‐item mix.
Iden$fiability Issues
• Same log-‐likelihood can be achieved by – g ( ) = g ( ) + r, h ( ) = h ( ) – r
• Center α, β, u to zero-‐mean every E-‐step
– u = -‐u, v = -‐v • Constrain v to be posi$ve
– Switching u.1, v.1 with u.2, v.2 • ui ~ N(G(xi) , I), vj ~ N(H(xj), λI) • Constraint: Diagonal entries λ1 >= λ2 >= …
MovieLens 1M Data
• 75% training and 25% test split by $me • Imbalanced data
– User ra$ng = 1: Posi$ve – User ra$ng = 2, 3, 4, 5: Nega$ve – 5% posi$ve rate
• Balanced data – User ra$ng = 1, 2, 3: Posi$ve – User ra$ng = 4, 5: Nega$ve – 44% posi$ve rate
Matrix Factoriza$on For User Profile
• Offline user profile building period, obtain the user factor for user i
• Online modeling using OLR – If a user has a profile (warm-‐start), use as the user feature
– If not (cold-‐start), use as the user feature
Offline Evalua$on Metric Related to Clicks
• For model M and J live items (ar$cles) at any $me
• If M = random (constant) model E[S(M)] = #clicks
• Unbiased es$mate of expected total clicks (Langford et al. 2008)
Experiments on Big Data • Yahoo! Front Page Today Module data • Data for building user profile: 8M users with at least 10
clicks (heavy users) in June 2011, 1B events • Data for training and tes$ng OLR model: Random served
data with 2.4M clicks in July 2011 • Heavy users contributed around 30% of clicks • User feature for OLR:
– Intercept-‐only (MOST POPULAR) – 124 Behavior targe$ng features (BT-‐ONLY) – BT + top 1000 clicked ar$cle ids (ITEM-‐PROFILE) – BT + user profile with CTR on 43 binary content categories (CATEGORY-‐PROFILE)
– BT + profiles from matrix factoriza$on models
Click Lid Performance For Different User Profiles
Web Adver$sing
286
There are lots of ads on the web … 100s of billions of adver$sing dollars spent online per year (e-‐marketer)
Online adver$sing: 6000 d. Overview
Adv
ertis
ers
Ad Network
Ads
Content
Pick ads
User
Content Provider
Examples: Yahoo, Google, MSN, RightMedia, …
Web Adver$sing: Comes in different flavors
• Sponsored (“Paid” ) Search
– Small text links in response to query to a search engine
• Display Adver$sing – Graphical, banner, rich media; appears in several contexts like visi$ng
a webpage, checking e-‐mails, on a social network,….
– Goals of such adver$sing campaigns differ • Brand Awareness • Performance (users are targeted to take some ac$on, soon)
– More akin to direct marke$ng in offline world
Paid Search: Adver$se Text Links
Display Adver$sing: Examples
Display Adver$sing: Examples
LinkedIn company follow ad
Brand Ad on Facebook
Paid Search Ads versus Display Ads
Paid Search • Context (Query) important
• Small text links • Performance based
– Clicks, conversions
• Adver$sers can cherry-‐pick instances
Display • Reaching desired audience
• Graphical, banner, Rich media – Text, logos, videos,..
• Hybrid – Brand, performance
• Bulk buy by marketers – But things evolving
• Ad exchanges, Real-‐$me bidder (RTB)
Display Adver$sing Models
• Futures Market (Guaranteed Delivery) – Brand Awareness (e.g. Gilleae, Coke, McDonalds, GM,..)
• Spot Market (Non-‐guaranteed) – Marketers create targeted campaigns
• Ad-‐exchanges have made this process efficient – Connects buyers and sellers in a stock-‐market style market
• Several portals like LinkedIn and Facebook have self-‐serve systems to book such campaigns
Guaranteed Delivery (Futures Market)
• Revenue Model: Cost per ad impression(CPM) Ads are bought in bulk targeted to users based on
demographics and other behavioral features GM ads on LinkedIn shown to “males above 55” Mortgage ad shown to “everybody on Y! ”
Slots booked in advance and guaranteed – “e.g. 2M targeted ad impressions Jan next year” – Prices significantly higher than spot market
– Higher quality inventory delivered to maintain mark-‐up
Measuring effec$veness of brand adver$sing
§ "Half the money I spend on adver:sing is wasted; the trouble is, I don't know which half." -‐ John Wanamaker
• Typically – Number of visits and engagement on adver$ser website – Increase in number of searches for specific keywords – Increase in offline sales in the long-‐run
• How? – Randomized design (treatment = ad exposure, control = no exposure) – Sample surveys – Covariate shid (Propensity score matching)
• Several sta$s$cal challenges (experimental design, causal inference from observa$onal data, survey methodology)
Guaranteed delivery • Fundamental Problem: Guarantee impressions (with overlapping
inventory)
3
2 4
2 2
1
1
Young US
Female LI
Homepage
1. Predict Supply
2. Incorporate/Predict Demand
3. Find the optimal allocation
• subject to supply and demand constraints
si
dj xij
Example
3 2 4
2 2
1
1
Young US
Female LI Homepage
US & Y (2)
Supply Pools
Demand US, Y, nF Supply = 2 Price = 1
US, Y, F Supply = 3 Price = 5
Supply Pools
How should we distribute impressions from the supply pools to sa$sfy this demand?
Example (Cherry-‐picking) • Cherry-‐picking:
Fulfill demands at least cost
US & Y (2)
Supply Pools
Demand US, Y, nF Supply = 2 Price = 1
US, Y, F Supply = 3 Price = 5
How should we distribute impressions from the supply pools to sa$sfy this demand?
(2)
Example (Fairness) • Cherry-‐picking:
Fulfill demands at least cost
• Fairness: Equitable distribu$on of available supply pools
• Agarwal and Tomlin, INFORMS, 2010
• Ghosh et al, EC, 2011
US & Y (2)
Supply Pools
Demand US, Y, nF Supply = 2 Cost = 1
US, Y, F Supply = 3 Cost = 5
How should we distribute impressions from the supply pools to sa$sfy this demand?
(1)
(1)
The op$miza$on problem
• Maximize Value of remnant inventory (to be sold in spot market) – Subject to “fairness” constraints (to maintain high quality of inventory
in the guaranteed market) – Subject to supply and demand constraints
• Can be solved efficiently through a flow program
• Key sta$s$cal input: Supply forecasts
302
Various component of a Guaranteed Delivery system
Field Sales Team, sells Products (segments)
Pricing Engine
Admission Control
should the new contract request be
admiled? (solve VIA LP)
Supply forecasts
Demand forecasts & booked inventory
Adver$sers
Contracts signed, Nego$a$ons involved
OFFLINE COMPONENTS
ONLINE SERVING
On Line Ad Serving
Ads
Opportunity Near Real Time
Optimization
Stochastic Supply
Stochastic Demand
Contract Statistics Allocation
Plan (from LP)
High dimensional Forecas$ng • Supply forecasts important input required both at booking
$me (admission control) and serving $me • Problem: Given historical $me series data in a high
dimensional space (trillions of combina$ons), forecast number of visits for an arbitrary query for a future $me horizon – E.g.: Male visits from Hawaii on LinkedIn next year in January
• Challenging sta$s$cal problem – Curse of dimensionality & massive data – arbitrary query subset – latency constraints
• Forecas:ng High-‐dimensional data, Agarwal et al, SIGMOD, 2011
Other challenges • 3Ms: Mul$-‐response, Mul$-‐context modeling to op$mize Mul$ple
Objec$ves – Mul$-‐response: Clicks, share, comments, likes,.. (preliminary work at
CIKM 2012)
– Mul$-‐context: Mobile, Desktop, Email,..(preliminary work at SIGKDD 2011)
– Mul$-‐objec$ve: Tradeoff in engagement, revenue, viral ac$vi$es • Preliminary work at SIGIR 2012, SIGKDD 2011
• Scaling model computa$ons at run-‐$me to avoid latency issues – Predic$ve Indexing (preliminary work at WSDM 2012)
Bibliography Agarwal, D. and Chen, B. (2009). Regression-‐based latent factor models. In Proceedings of the 15th ACM SIGKDD interna$onal conference on Knowledge discovery and data mining, 19–28. ACM. Agarwal, D., Chen, B., and Elango, P. (2010). Fast online learning through offline ini$aliza$on for $me-‐sensi$ve recommenda$on. In Proceedings of the 16th ACM SIGKDD interna$onal conference on Knowledge discovery and data mining, 703–712. ACM. Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling rela$onships at mul$ple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD interna$onal conference on Knowledge discovery and data mining, 95–104. ACM. Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Sta$s$cal Society: Series B (Sta$s$cal Methodology), 61(1), 265-‐285. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed op$miza$on and sta$s$cal learning via the alterna$ng direc$on method of mul$pliers. Founda$ons and Trends® in Machine Learning, 3(1), 1-‐122. Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observa$ons: gains, losses, and remedies for losses (pp. 267-‐297). Springer New York. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communica$ons of the ACM, 51(1), 107-‐113.
Bibliography Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Sta$s$cs, 1-‐26. Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint arXiv:1206.6415. Khanna, R., Zhang, L., Agarwal, D. and Chen, B. (2012). Parallel Matrix Factoriza$on for Binary Response. In Arxiv.org. Zhang, L., Agarwal, D., and Chen, B. (2011). Generalizing matrix factoriza$on through flexible regression priors. In Proceedings of the fidh ACM conference on Recommender systems, 13–20. ACM.