ENAR short course

Sta$s$cal Compu$ng For Big Data Deepak Agarwal

LinkedIn Applied Relevance Science [email protected]

ENAR 2014, Bal$more, USA

Main Collaborators: several others at both Y! and LinkedIn

•  I won’t be here without them, extremely lucky to work with such talented individuals

Bee-Chung Chen Liang Zhang Bo Long

Jonathan Traupman Paul Ogilvie

Structure of This Tutorial •  Part I: Introduc$on to Map-‐Reduce and the Hadoop System – Overview of Distributed Compu$ng –  Introduc$on to Map-‐Reduce –  Some sta$s$cal computa$ons using Map-‐Reduce

•  Bootstrap, Logis$c Regression •  Part II: Recommender Systems for Web Applica$ons –  Introduc$on –  Content Recommenda$on – Online Adver$sing

Big Data becoming Ubiquitous

•  Bioinforma$cs •  Astronomy •  Internet •  Telecommunica$ons •  Climatology •  …

Big Data: Some size es$mates

•  1000 human genomes: > 100TB of data (1000 genomes project)

•  Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated)

•  Facebook: A billion monthly ac$ve users •  LinkedIn: roughly > 280M members worldwide •  Twiaer: > 500 million tweets a day •  Over 6 billion mobile phones in the world genera$ng data everyday

Big Data: Paradigm shid •  Classical Sta$s$cs

– Generalize using small data

•  Paradigm Shid with Big Data – We now have an almost infinite supply of data –  Easy Sta$s$cs ? Just appeal to asympto$c theory?

•  So the issue is mostly computa$onal? – Not quite

•  More data comes with more heterogeneity •  Need to change our sta$s$cal thinking to adapt

–  Classical sta$s$cs s$ll invaluable to think about big data analy$cs

Some Sta$s$cal Challenges

•  Exploratory Analysis (EDA), Visualiza$on – Retrospec$ve (on Terabytes) – More Real Time (streaming computa$ons every few minutes/hours)

•  Sta$s$cal Modeling – Scale (computa$onal challenge) – Curse of dimensionality

•  Millions of predictors, heterogeneity – Temporal and Spa$al correla$ons

Sta$s$cal Challenges con$nued

•  Experiments – To test new methods, test hypothesis from randomized experiments

– Adap$ve experiments

•  Forecas$ng – Planning, adver$sing

•  Many more I are not fully well versed in

Defining Big Data

•  How to know you have the big data problem? –  Is it only the number of terabytes ? – What about dimensionality, structured/unstructured, computa$ons required,…

•  No clear defini$on, different point of views – When desired computa$on cannot be completed in the s$pulated $me with current best algorithm using cores available on a commodity PC

Distributed Compu$ng for Big Data

•  Distributed compu$ng invaluable tool to scale computa$ons for big data

•  Some distributed compu$ng models – Mul$-‐threading – Graphics Processing Units (GPU) – Message Passing Interface (MPI) – Map-‐Reduce

Evalua$ng a method for a problem •  Scalability

–  Process X GB in Y hours •  Ease of use for a sta$s$cian •  Reliability (fault tolerance)

–  Especially in an industrial environment •  Cost

– Hardware and cost of maintaining •  Good for the computa$ons required?

–  E.g., Itera$ve versus one pass •  Resource sharing

Mul$threading

•  Mul$ple threads take advantage of mul$ple CPUs

•  Shared memory •  Threads can execute independently and concurrently

•  Can only handle Gigabytes of data •  Reliable

Graphics Processing Units (GPU) •  Number of cores:

–  CPU: Order of 10 –  GPU: smaller cores

•  Order of 1000

•  Can be >100x faster than CPU –  Parallel computa$onally intensive tasks off-‐loaded to GPU

•  Good for certain computa$onally-‐intensive tasks

•  Can only handle Gigabytes of data

•  Not trivial to use, requires good understanding of low-‐level architecture for efficient use –  But things changing, it is gemng more user friendly

Message Passing Interface (MPI)

•  Language independent communica$on protocol among processes (e.g. computers)

•  Most suitable for master/slave model •  Can handle Terabytes of data •  Good for itera$ve processing •  Fault tolerance is low

Map-‐Reduce (Dean & Ghemawat, 2004)

Mappers

Reducers

Data

Output

•  Computa$on split to Map (scaaer) and Reduce (gather) stages

•  Easy to Use: –  User needs to implement two func$ons: Mapper and Reducer

•  Easily handles Terabytes of data

•  Very good fault tolerance (failed tasks automa$cally get restarted)

Comparison of Distributed Compu$ng Methods

Mul$threading GPU MPI Map-‐Reduce

Scalability (data size)

Gigabytes Gigabytes Terabytes Terabytes

Fault Tolerance High High Low High

Maintenance Cost Low Medium Medium Medium-‐High

Itera$ve Process Complexity

Cheap Cheap Cheap Usually expensive

Resource Sharing Hard Hard Easy Easy

Easy to Implement? Easy Needs understanding of low-‐level GPU architecture

Easy Easy

Example Problem

•  Tabula$ng word counts in corpus of documents

•  Similar to table func$on in R

Word Count Through Map-‐Reduce Hello World Bye World

Hello Hadoop Goodbye Hadoop

Mapper 1

<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop,1>

<Hello, 1> <World, 1> <Bye, 1> <World,1>

Mapper 2

Reducer 1 Words from A-‐G

Reducer 2 Words from H-‐Z

<Bye, 1> <Goodbye, 1>

<Hello, 2> <World, 2> <Hadoop, 2>

Key Ideas about Map-‐Reduce Big Data

Par$$on 1 Par$$on 2 … Par$$on N

Mapper 1 Mapper 2 … Mapper N

<Key, Value> <Key, Value> <Key, Value> <Key, Value>

Reducer 1 Reducer 2 Reducer M …

Output 1 Output 1 Output 1 Output 1

Key Ideas about Map-‐Reduce •  Data are split into par$$ons and stored in many different machines on disk (distributed storage)

•  Mappers process data chunks independently and emit <Key, Value> pairs

•  Data with the same key are sent to the same reducer. One reducer can receive mul$ple keys

•  Every reducer sorts its data by key •  For each key, the reducer processes the values corresponding to the key according to the customized reducer func$on and output

Compute Mean for Each Group ID Group No. Score

1 1 0.5

2 3 1.0

3 1 0.8

4 2 0.7

5 2 1.5

6 3 1.2

7 1 0.8

8 2 0.9

9 4 1.3

… … …

Key Ideas about Map-‐Reduce •  Data are split into par$$ons and stored in many different machines on

disk (distributed storage) •  Mappers process data chunks independently and emit <Key, Value> pairs

–  For each row: •  Key = Group No. •  Value = Score

•  Data with the same key are sent to the same reducer. One reducer can receive mul$ple keys –  E.g. 2 reducers –  Reducer 1 receives data with key = 1, 2 –  Reducer 2 receives data with key = 3, 4

•  Every reducer sorts its data by key –  E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>

•  For each key, the reducer processes the values corresponding to the key according to the customized reducer func$on and output –  E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>

Key Ideas about Map-‐Reduce •  Data are split into par$$ons and stored in many different machines on

disk (distributed storage) •  Mappers process data chunks independently and emit <Key, Value> pairs

–  For each row: •  Key = Group No. •  Value = Score

•  Data with the same key are sent to the same reducer. One reducer can receive mul$ple keys –  E.g. 2 reducers –  Reducer 1 receives data with key = 1, 2 –  Reducer 2 receives data with key = 3, 4

•  Every reducer sorts its data by key –  E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>

•  For each key, the reducer processes the values corresponding to the key according to the customized reducer func$on and output –  E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>

What you need to implement

Mapper: Input: Data for (row in Data) {

groupNo = row$groupNo; score = row$score; Output(c(groupNo, score));

}

Reducer: Input: Key (groupNo), List Value (a list of scores that belong to the Key) count = 0; sum = 0; for (v in Value) {

sum += v; count++;

} Output(c(Key, sum/count));

Pseudo Code (in R)

Exercise 1

•  Problem: Average height per {Grade, Gender}? •  What should be the mapper output key? •  What should be the mapper output value? •  What are the reducer input? •  What are the reducer output? •  Write mapper and reducer for this?

Student ID Grade Gender Height (cm)

1 3 M 120

2 2 F 115

3 2 M 116

… … …

•  Problem: Average height per Grade and Gender? •  What should be the mapper output key?

–  {Grade, Gender} •  What should be the mapper output value?

– Height •  What are the reducer input?

–  Key: {Grade, Gender}, Value: List of Heights •  What are the reducer output?

–  {Grade, Gender, mean(Heights)}


1 3 M 120

2 2 F 115

3 2 M 116

… … …

Exercise 2

•  Problem: Number of students per {Grade, Gender}? •  What should be the mapper output key? •  What should be the mapper output value? •  What are the reducer input? •  What are the reducer output? •  Write mapper and reducer for this?


1 3 M 120

2 2 F 115

3 2 M 116

… … …

•  Problem: Number of students per {Grade, Gender}? •  What should be the mapper output key?

–  {Grade, Gender} •  What should be the mapper output value?

–  1 •  What are the reducer input?

–  Key: {Grade, Gender}, Value: List of 1’s •  What are the reducer output?

–  {Grade, Gender, sum(value list)} –  OR: {Grade, Gender, length(value list)}


1 3 M 120

2 2 F 115

3 2 M 116

… … …

More on Map-‐Reduce •  Depends on distributed file systems •  Typically mappers are the data storage nodes •  Map/Reduce tasks automa$cally get restarted when they fail (good fault tolerance)

•  Map and Reduce I/O are all on disk – Data transmission from mappers to reducers are through disk copy

•  Itera$ve process through Map-‐Reduce –  Each itera$on becomes a map-‐reduce job –  Can be expensive since map-‐reduce overhead is high

The Apache Hadoop System

•  An open-‐source sodware for reliable, scalable, distributed compu$ng

•  The most popular distributed compu$ng system in the world

•  Key modules: – Hadoop Distributed File System (HDFS) – Hadoop YARN (job scheduling and cluster resource management)

– Hadoop MapReduce

Major Tools on Hadoop •  Pig

–  A high-‐level language for Map-‐Reduce computa$on •  Hive

–  A SQL-‐like query language for data querying via Map-‐Reduce •  Hbase

–  A distributed & scalable database on Hadoop –  Allows random, real $me read/write access to big data –  Voldemort is similar to Hbase

•  Mahout –  A scalable machine learning library

•  …

Hadoop Installa$on

•  Semng up Hadoop on your desktop/laptop: – hap://hadoop.apache.org/docs/stable/single_node_setup.html

•  Semng up Hadoop on a cluster of machines – hap://hadoop.apache.org/docs/stable/cluster_setup.html

Hadoop Distributed File System (HDFS)

•  Master/Slave architecture •  NameNode: a single master node that controls which data block is stored where.

•  DataNodes: slave nodes that store data and do R/W opera$ons

•  Clients (Gateway): Allow users to login and interact with HDFS and submit Map-‐Reduce jobs

•  Big data is split to equal-‐sized blocks, each block can be stored in different DataNodes

•  Disk failure tolerance: data is replicated mul$ple $mes

Load the Data into Pig •  A = LOAD ‘Sample-‐1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float); –  The path of the data on HDFS ader LOAD

•  USING PigStorage() means delimit the data by tab (can be omiaed)

•  If data are delimited by other characters, e.g. space, use USING PigStorage(‘ ‘)

•  Data schema defined ader AS •  Variable types: int, long, float, double, chararray, …

Structure of This Tutorial

•  Part I: Introduc$on to Map-‐Reduce and the Hadoop System – Overview of Distributed Compu$ng –  Introduc$on to Map-‐Reduce –  Introduc$on to the Hadoop System –  Examples of Sta$s$cal Compu$ng for Big Data

•  Bag of Liale Bootstraps •  Large Scale Logis$c Regression

Bag of Liale Bootstraps

Kleiner et al. 2012

Bootstrap (Efron, 1979) •  A re-‐sampling based method to obtain sta$s$cal distribu$on of sample es$mators

•  Why are we interested ? –  Re-‐sampling is embarrassingly parallelizable

•  For example: –  Standard devia$on of the mean of N samples (μ) –  For i = 1 to r do

•  Randomly sample with replacement N $mes from the original sample -‐> bootstrap data i

•  Compute mean of the i-‐th bootstrap data -‐> μi

–  Es$mate of Sd(μ) = Sd([μ1,…μr]) –  r is usually a large number, e.g. 200

Bootstrap for Big Data

•  Can have r nodes running in parallel, each sampling one bootstrap data

•  However… – N can be very large – Data may not fit into memory – Collec$ng N samples with replacement on each node can be computa$onally expensive

M out of N Bootstrap (Bikel et al. 1997)

•  Obtain SdM(μ) by sampling M samples with replacement for each bootstrap, where M<N

•  Apply analy$cal correc$on to SdM(μ) to obtain Sd(μ) using prior knowledge of convergence rate of sample es$mates

•  However… –  Prior knowledge is required –  Choice of M is cri$cal to performance –  Finding op$mal value of M needs more computa$on

Bag of Liale Bootstraps (BLB) •  Example: Standard devia$on of the mean •  Generate S sampled data sets, each obtained by random

sampling without replacement a subset of size b (or par$$on the original data into S par$$ons, each with size b)

•  For each data p = 1 to S do –  For i = 1 to r do

•  N samples with replacement on data of size b •  Compute mean of the resampled data μpi

–  Compute Sdp(μ) = Sd([μp1,…μpr]) •  Es$mate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])

Bag of Liale Bootstraps (BLB) •  Interest: ξ(θ), where θ is an es$mate obtained from size N data –  ξ is some func$on of θ, such as standard devia$on, …

•  Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or par$$on the original data into S par$$ons, each with size b)


•  Sample N samples with replacement on data of size b •  Compute mean of the resampled data θpi

–  Compute ξp(θ) = ξ([θp1,…θpr]) •  Es$mate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

Bag of Liale Bootstraps (BLB) •  Interest: ξ(θ), where θ is an es$mate obtained from size N data –  ξ is some func$on of θ, such as standard devia$on, …

•  Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or par$$on the original data into S par$$ons, each with size b)


•  Sample N samples with replacement on the data of size b •  Compute mean of the resampled data θpi

–  Compute ξp(θ) = ξ([θp1,…θpr]) •  Es$mate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

Mapper Reducer

Gateway

Why is BLB Efficient

•  Before: – N samples with replacement from size N data is expensive when N is large

•  Now: – N samples with replacement from size b data – b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1))

– Equivalent to: A mul$nomial sampler with dim = b – Storage = O(b), Computa$onal complexity = O(b)

Simula$on Experiment

•  95% CI of Logis$c Regression Coefficients •  N = 20000, 10 explanatory variables •  Rela$ve Error = |Es$mated CI width – True CI width | / True CI width

•  BLB-‐γ: BLB with b = Nγ •  BOFN-‐γ: b out of N sampling with b = Nγ

•  BOOT: Naïve bootstrap

Simula$on Experiment

Real Data

•  95% CI of Logis$c Regression Coefficients •  N = 6M, 3000 explanatory variables •  Data size = 150GB, r = 50, s = 5, γ = 0.7

Summary of BLB

•  A new algorithm for bootstrapping on big data

•  Advantages – Fast and efficient – Easy to parallelize – Easy to understand and implement – Friendly to Hadoop, makes it rou$ne to perform sta$s$cal calcula$ons on Big data

Large Scale Logis$c Regression

Logis$c Regression •  Binary response: Y

•  Covariates: X

•  Yi ~ Bernoulli(pi)

•  log(pi/(1-‐pi)) = XiTβ ; β ~ MVN(0 , 1/λ I )

•  Widely used (research and applica$ons)

Large Scale Logis$c Regression •  Binary response: Y

–  E.g., Click / Non-‐click on an ad on a webpage •  Covariates: X

–  User covariates: •  Age, gender, industry, educa$on, job, job $tle, …

–  Item covariates: •  Categories, keywords, topics, …

–  Context covariates: •  Time, page type, posi$on, …

–  2-‐way interac$on: •  User covariates X item covariates •  Context covariates X item covariates •  …

Computa$onal Challenge

•  Hundreds of millions/billions of observa$ons •  Hundreds of thousands/millions of covariates •  Fimng such a logis$c regression model on a single machine not feasible

•  Model fimng itera$ve using methods like gradient descent, Newton’s method etc – Mul$ple passes over the data

Recap on Op$miza$on method

•  Problem: Find x to min(F(x)) •  Itera$on n: xn = xn-‐1 – bn-‐1 F’(xn-‐1) •  bn-‐1 is the step size that can change every itera$on

•  Iterate un$l convergence •  Conjugate gradient, LBFGS, Newton trust region, … all of this kind

Itera$ve Process with Hadoop

Disk Mappers Disk Reducers



Limita$ons of Hadoop for fimng a big logis$c regression

•  Itera$ve process is expensive and slow •  Every itera$on = a Map-‐Reduce job •  I/O of mapper and reducers are both through disk

•  Plus: Wai$ng in queue $me •  Q: Can we find a fimng method that scales with Hadoop ?

Large Scale Logis$c Regression •  Naïve:

–  Par$$on the data and run logis$c regression for each par$$on –  Take the mean of the learned coefficients –  Problem: Not guaranteed to converge to the model from single machine!

•  Alterna$ng Direc$on Method of Mul$pliers (ADMM) –  Boyd et al. 2011 –  Set up constraints: each par$$on’s coefficient = global consensus

–  Solve the op$miza$on problem using Lagrange Mul$pliers –  Advantage: guaranteed to converge to a single machine logis$c regression on the en$re data with reasonable number of itera$ons

Large Scale Logis$c Regression via ADMM

BIG DATA

Par$$on 1 Par$$on 2 Par$$on 3 Par$$on K

Logis$c Regression

Logis$c Regression

Logis$c Regression

Logis$c Regression

Consensus Computa$on

Iteration 1


BIG DATA


Logis$c Regression


Logis$c Regression

Logis$c Regression

Logis$c Regression

Iteration 1


BIG DATA


Logis$c Regression

Logis$c Regression

Logis$c Regression

Logis$c Regression


Iteration 2

Details of ADMM

Dual Ascent Method

•  Consider a convex op$miza$on problem

•  Lagrangian for the problem: •  Dual Ascent:

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

8 Precursors

problem is

maximize g(y),

with variable y ! Rm. Assuming that strong duality holds, the optimalvalues of the primal and dual problems are the same. We can recovera primal optimal point x! from a dual optimal point y! as

x! = argminx

L(x,y!),

provided there is only one minimizer of L(x,y!). (This is the caseif, e.g., f is strictly convex.) In the sequel, we will use the notationargminx F (x) to denote any minimizer of F , even when F does nothave a unique minimizer.

In the dual ascent method, we solve the dual problem using gradientascent. Assuming that g is di!erentiable, the gradient "g(y) can beevaluated as follows. We first find x+ = argminx L(x,y); then we have"g(y) = Ax+ # b, which is the residual for the equality constraint. Thedual ascent method consists of iterating the updates

xk+1 := argminx

L(x,yk) (2.2)

yk+1 := yk + !k(Axk+1 # b), (2.3)

where !k > 0 is a step size, and the superscript is the iteration counter.The first step (2.2) is an x-minimization step, and the second step (2.3)is a dual variable update. The dual variable y can be interpreted asa vector of prices, and the y-update is then called a price update orprice adjustment step. This algorithm is called dual ascent since, withappropriate choice of !k, the dual function increases in each step, i.e.,g(yk+1) > g(yk).

The dual ascent method can be used even in some cases when g isnot di!erentiable. In this case, the residual Axk+1 # b is not the gradi-ent of g, but the negative of a subgradient of #g. This case requires adi!erent choice of the !k than when g is di!erentiable, and convergenceis not monotone; it is often the case that g(yk+1) $> g(yk). In this case,the algorithm is usually called the dual subgradient method [152].

If !k is chosen appropriately and several other assumptions hold,then xk converges to an optimal point and yk converges to an optimal

Augmented Lagrangians •  Bring robustness to the dual ascent method

•  Yield convergence without assump$ons like strict convexity or finiteness of f

• 

•  The value of ρ influences the convergence rate

10 Precursors

collected (gathered) in order to compute the residual Axk+1 ! b. Oncethe (global) dual variable yk+1 is computed, it must be distributed(broadcast) to the processors that carry out the N individual xi mini-mization steps (2.4).

Dual decomposition is an old idea in optimization, and traces backat least to the early 1960s. Related ideas appear in well known workby Dantzig and Wolfe [44] and Benders [13] on large-scale linear pro-gramming, as well as in Dantzig’s seminal book [43]. The general ideaof dual decomposition appears to be originally due to Everett [69],and is explored in many early references [107, 84, 117, 14]. The useof nondi!erentiable optimization, such as the subgradient method, tosolve the dual problem is discussed by Shor [152]. Good references ondual methods and decomposition include the book by Bertsekas [16,chapter 6] and the survey by Nedic and Ozdaglar [131] on distributedoptimization, which discusses dual decomposition methods and con-sensus problems. A number of papers also discuss variants on standarddual decomposition, such as [129].

More generally, decentralized optimization has been an active topicof research since the 1980s. For instance, Tsitsiklis and his co-authorsworked on a number of decentralized detection and consensus problemsinvolving the minimization of a smooth function f known to multi-ple agents [160, 161, 17]. Some good reference books on parallel opti-mization include those by Bertsekas and Tsitsiklis [17] and Censor andZenios [31]. There has also been some recent work on problems whereeach agent has its own convex, potentially nondi!erentiable, objectivefunction [130]. See [54] for a recent discussion of distributed methodsfor graph-structured optimization problems.

2.3 Augmented Lagrangians and the Method of Multipliers

Augmented Lagrangian methods were developed in part to bringrobustness to the dual ascent method, and in particular, to yield con-vergence without assumptions like strict convexity or finiteness of f .The augmented Lagrangian for (2.1) is

L!(x,y) = f(x) + yT (Ax ! b) + (!/2)"Ax ! b"22, (2.6)2.3 Augmented Lagrangians and the Method of Multipliers 11

where ! > 0 is called the penalty parameter. (Note that L0 is thestandard Lagrangian for the problem.) The augmented Lagrangiancan be viewed as the (unaugmented) Lagrangian associated with theproblem

minimize f(x) + (!/2)!Ax " b!22

subject to Ax = b.

This problem is clearly equivalent to the original problem (2.1), sincefor any feasible x the term added to the objective is zero. The associateddual function is g!(y) = infx L!(x,y).

The benefit of including the penalty term is that g! can be shown tobe di!erentiable under rather mild conditions on the original problem.The gradient of the augmented dual function is found the same way aswith the ordinary Lagrangian, i.e., by minimizing over x, and then eval-uating the resulting equality constraint residual. Applying dual ascentto the modified problem yields the algorithm

xk+1 := argminx

L!(x,yk) (2.7)

yk+1 := yk + !(Axk+1 " b), (2.8)

which is known as the method of multipliers for solving (2.1). This isthe same as standard dual ascent, except that the x-minimization stepuses the augmented Lagrangian, and the penalty parameter ! is usedas the step size "k. The method of multipliers converges under far moregeneral conditions than dual ascent, including cases when f takes onthe value +# or is not strictly convex.

It is easy to motivate the choice of the particular step size ! inthe dual update (2.8). For simplicity, we assume here that f is di!er-entiable, though this is not required for the algorithm to work. Theoptimality conditions for (2.1) are primal and dual feasibility, i.e.,

Ax" " b = 0, $f(x") + AT y" = 0,

respectively. By definition, xk+1 minimizes L!(x,yk), so

0 = $xL!(xk+1,yk)

= $xf(xk+1) + AT!yk + !(Axk+1 " b)

"

= $xf(xk+1) + AT yk+1.

Alterna$ng Direc$on Method of Mul$pliers (ADMM)

•  Problem: •  Augmented Lagrangians •  ADMM:

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

14 Alternating Direction Method of Multipliers

ADMM consists of the iterations

xk+1 := argminx

L!(x,zk,yk) (3.2)

zk+1 := argminz

L!(xk+1,z,yk) (3.3)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c), (3.4)

where ! > 0. The algorithm is very similar to dual ascent and themethod of multipliers: it consists of an x-minimization step (3.2), az-minimization step (3.3), and a dual variable update (3.4). As in themethod of multipliers, the dual variable update uses a step size equalto the augmented Lagrangian parameter !.

The method of multipliers for (3.1) has the form

(xk+1,zk+1) := argminx,z

L!(x,z,yk)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c).

Here the augmented Lagrangian is minimized jointly with respect tothe two primal variables. In ADMM, on the other hand, x and z areupdated in an alternating or sequential fashion, which accounts for theterm alternating direction. ADMM can be viewed as a version of themethod of multipliers where a single Gauss-Seidel pass [90, §10.1] overx and z is used instead of the usual joint minimization. Separating theminimization over x and z into two steps is precisely what allows fordecomposition when f or g are separable.

The algorithm state in ADMM consists of zk and yk. In other words,(zk+1,yk+1) is a function of (zk,yk). The variable xk is not part of thestate; it is an intermediate result computed from the previous state(zk!1,yk!1).

If we switch (re-label) x and z, f and g, and A and B in the prob-lem (3.1), we obtain a variation on ADMM with the order of the x-update step (3.2) and z-update step (3.3) reversed. The roles of x andz are almost symmetric, but not quite, since the dual update is doneafter the z-update but before the x-update.


•  Nota$on –  (Xi , yi): data in the ith par$$on –  βi: coefficient vector for par$$on i –  β: Consensus coefficient vector –  r(β): penalty component such as ||β||22

•  Op$miza$on problem

Brief Article

The Author

July 7, 2013

min

NX

i=1

li(yi,XTi �i) + r(�)

subject to �i = �

1

ADMM updates

LOCAL REGRESSIONS Shrinkage towards current best global es$mate

UPDATED CONSENSUS

An example implementa$on

•  ADMM for Logis$c regression model fimng with L2/L1 penalty

•  Each itera$on of ADMM is a Map-‐Reduce job – Mapper: par$$on the data into K par$$ons –  Reducer: For each par$$on, use liblinear/glmnet to fit a L1/L2 logis$c regression

– Gateway: consensus computa$on by results from all reducers, and sends back the consensus to each reducer node

KDD CUP 2010 Data

•  Bridge to Algebra 2008-‐2009 data in haps://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp •  Binary response, 20M covariates •  Only keep covariates with >= 10 occurrences => 2.2M covariates

•  Training data: 8,407,752 samples •  Test data : 510,302 samples

Avg Training Loglikelihood vs Number of Itera$ons

Test AUC vs Number of Itera$ons

Beaer Convergence Can Be Achieved By

•  Beaer Ini$aliza$on – Use results from Naïve method to ini$alize the parameters

•  Adap$vely change step size (ρ) for each itera$on based on the convergence status of the consensus

Recommender Problems for Web Applications

Agenda •  Topic of Interest

– Recommender problems for dynamic, time-sensitive applications

•  Content Optimization, Online Advertising, Movie recommendation, shopping,…

•  Introduction •  Offline components

– Regression, Collaborative filtering (CF), … •  Online components + initialization

– Time-series, online/incremental methods, explore/exploit (bandit)

•  Evaluation methods + Multi-Objective •  Challenges

Three components we will focus on •  Defining the problem

–  Formulate objectives whose optimization achieves some long-term goals for the recommender system

•  E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ?

•  Modeling (to estimate some critical inputs) –  Predict rates of some positive user interaction(s) with items

based on data obtained from historical user-item interactions •  E.g. Click rates, average time-spent on page, etc •  Could be explicit feedback like ratings

•  Experimentation –  Create experiments to collect data proactively to improve

models, helps in converging to the best choice(s) cheaply and rapidly.

•  Explore and Exploit (continuous experimentation) •  DOE (testing hypotheses by avoiding bias inherent in data)

Modern Recommendation Systems

•  Goal –  Serve the right item to a user in a given context to

optimize long-term business objectives •  A scientific discipline that involves

–  Large scale Machine Learning & Statistics •  Offline Models (capture global & stable characteristics) •  Online Models (incorporates dynamic components) •  Explore/Exploit (active and adaptive experimentation)

–  Multi-Objective Optimization •  Click-rates (CTR), Engagement, advertising revenue, diversity, etc

–  Inferring user interest •  Constructing User Profiles

–  Natural Language Processing to understand content •  Topics, “aboutness”, entities, follow-up of something, breaking news,…

Some examples from content optimization

•  Simple version –  I have a content module on my page, content inventory is

obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module

•  More advanced –  I got X% lift in CTR. But I have additional information on

other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks?

•  Highly advanced –  There are multiple modules running on my webpage. How

do I perform a simultaneous optimization?

Recommend applications

Recommend search queries

Recommend news article

Recommend packages: Image Title, summary Links to other pages

Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages

Problems in this example •  Optimize CTR on multiple modules

– Today Module, Trending Now, Personal Assistant, News

– Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations.

•  For any single module – Optimize some combination of CTR, downstream

engagement, and perhaps advertising revenue.

Online Advertising

Adv

ertis

ers

Ad Network

Ads

Page

Recommend

Best ad(s)

User

Publisher

Response rates (click, conversion, ad-view) Bids

Auction

Click

conversion

Select argmax f(bid,response rates)

ML /Statistical model

Examples: Yahoo, Google, MSN, …

Ad exchanges (RightMedia, DoubleClick, …)

LinkedIn Today: Content Module

Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)

LinkedIn Ads: Match ads to users visi$ng LinkedIn

Right Media Ad Exchange: Unified Marketplace

Match ads to page views on publisher sites

Has ad impression to sell -‐-‐ AUCTIONS

Bids $0.50 Bids $0.75 via Network…

… which becomes $0.45 bid

Bids $0.65—WINS!

AdSense Ad.com

Bids $0.60

Recommender problems in general

USER

Item Inventory Ar$cles, web page,

ads, …

Use an automated algorithm to select item(s) to show

Get feedback (click, $me spent,..)

Refine the models

Repeat (large number of :mes) Op:mize metric(s) of interest (Total clicks, Total revenue,…)

Example applications Search: Web, Vertical Online Advertising Content …..

Context query, page, …

•  Items: Articles, ads, modules, movies, users, updates, etc.

•  Context: query keywords, pages, mobile, social media, etc.

•  Metric to optimize (e.g., relevance score, CTR, revenue, engagement) –  Currently, most applications are single-objective –  Could be multi-objective optimization (maximize X subject to Y, Z,..)

•  Properties of the item pool –  Size (e.g., all web pages vs. 40 stories) –  Quality of the pool (e.g., anything vs. editorially selected) –  Lifetime (e.g., mostly old items vs. mostly new items)

Important Factors

Factors affecting Solution (continued)

•  Properties of the context –  Pull: Specified by explicit, user-driven query (e.g., keywords, a form) –  Push: Specified by implicit context (e.g., a page, a user, a session)

•  Most applications are somewhere on continuum of pull and push

•  Properties of the feedback on the matches made

–  Types and semantics of feedback (e.g., click, vote) –  Latency (e.g., available in 5 minutes vs. 1 day) –  Volume (e.g., 100K per day vs. 300M per day)

•  Constraints specifying legitimate matches –  e.g., business rules, diversity rules, editorial Voice –  Multiple objectives

•  Available Metadata (e.g., link graph, various user/item attributes)

Predicting User-Item Interactions (e.g. CTR)

•  Myth: We have so much data on the web, if we can only process it the problem is solved –  Number of things to learn increases with sample size

•  Rate of increase is not slow –  Dynamic nature of systems make things worse –  We want to learn things quickly and react fast

•  Data is sparse in web recommender problems –  We lack enough data to learn all we want to learn and

as quickly as we would like to learn –  Several Power laws interacting with each other

•  E.g. User visits power law, items served power law –  Bivariate Zipf: Owen & Dyer, 2011

Can Machine Learning help? •  Fortunately, there are group behaviors that generalize to

individuals & they are relatively stable –  E.g. Users in San Francisco tend to read more baseball news

•  Key issue: Estimating such groups –  Coarse group : more stable but does not generalize that well. –  Granular group: less stable with few individuals –  Getting a good grouping structure is to hit the “sweet spot”

•  Another big advantage on the web –  Intervene and run small experiments on a small population to

collect data that helps rapid convergence to the best choices(s) •  We don’t need to learn all user-item interactions, only those that are good.

Predicting user-item interaction rates

Offline ( Captures stable characteris$cs

at coarse resolu$ons) (Logis$c, Boos$ng,….)

Feature construc$on Content: IR, clustering, taxonomy, en$ty,..

User profiles: clicks, views, social, community,..

Near Online (Finer resolu$on Correc$ons)

(item, user level) (Quick updates)

Explore/Exploit (Adap$ve sampling)

(helps rapid convergence to best choices)

Initialize

Post-click: An example in Content Optimization

Recommender EDITORIAL

content Clicks on FP links influence downstream supply distribution

AD SERVER DISPLAY ADVERTISING Revenue

Downstream engagement (Time spent)

Serving Content on Front Page: Click Shaping

•  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following

–  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10

•  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but

obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,

etc)

High level picture

http request

Statistical Models updated in Batch mode: e.g. once every 30 mins

Server

Item Recommenda$on system: thousands of computa$ons in

sub-‐seconds

User Interacts e.g. click, does nothing

High level overview: Item Recommenda$on System

User Info

Item Index Id, meta-data

ML/ Statistical Models

Score Items P(Click), P(share), Semantic-relevance score,….

Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,….

User-item interaction Data: batch process

Updated in batch: Activity, profile

Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..

ML/Sta$s$cal models for scoring

Number of items Scored by ML

Traffic volume

1000 100 100k 1M 100M

Few hours

Few days

Several days

LinkedIn Today Yahoo! Front Page

Right Media Ad exchange LinkedIn Ads

Summary of deployments •  Yahoo! Front page Today Module (2008-‐2011): 300% improvement

in click-‐through rates –  Similar algorithms delivered via a self-‐serve pla�orm, adopted by

several Yahoo! Proper$es (2011): Significant improvement in engagement across Yahoo! Network

•  Fully deployed on LinkedIn Today Module (2012): Significant improvement in click-‐through rates (numbers not revealed due to reasons of confiden$ality)

•  Yahoo! RightMedia exchange (2012): Fully deployed algorithms to es$mate response rates (CTR, conversion rates). Significant improvement in revenue (numbers not revealed due to reasons of confiden$ality)

•  LinkedIn self-‐serve ads (2012-‐2013):Fully deployed •  LinkedIn News Feed (2013-‐2014): Fully deployed •  Several others in progress….

Broad Themes •  Curse of dimensionality

–  Large number of observa$ons (rows), large number of poten$al features (columns)

–  Use domain knowledge and machine learning to reduce the “effec$ve” dimension (constraints on parameters reduce degrees of freedom)

•  I will give examples as we move along

•  We oden assume our job is to analyze “Big Data” but we oden have control on what data to collect through clever experimenta$on –  This can fundamentally change solu$ons

•  Think of computa$on and models together for Big data •  Op$miza$on: What we are trying to op$mize is oden complex,models to

work in harmony with op$miza$on –  Pareto op$mality with compe$ng objec$ves

Sta$s$cal Problem •  Rank items (from an admissible pool) for user visits in some

context to maximize a u$lity of interest •  Examples of u$lity func$ons

–  Click-‐rates (CTR) –  Share-‐rates (CTR* [Share|Click] ) –  Revenue per page-‐view = CTR*bid (more complex due to second price

auc$on)

•  CTR is a fundamental measure that opens the door to a more principled approach to rank items

•  Converge rapidly to maximum u$lity items –  Sequen$al decision making process (explore/exploit)

item j from a set of candidates

User i with user features (e.g., industry, behavioral features, Demographic features,……)

(i, j) : response yij visits

Algorithm selects

(click or not)

Which item should we select? � The item with highest predicted CTR � An item for which we need data to predict its CTR

Exploit Explore

LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem

The Explore/Exploit Problem (to maximize CTR)

•  Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items

•  Easy!? Pick the items having the highest click-through rates (CTRs)

•  But … –  The system is highly dynamic:

•  Items come and go with short lifetimes •  CTR of each item may change over time

–  How much traffic should be allocated to explore new items to achieve optimal performance ?

•  Too little → Unreliable CTR estimates due to “starvation” •  Too much → Little traffic to exploit the high CTR items

Y! front Page Applica$on

•  Simplify: Maximize CTR on first slot (F1)

•  Item Pool –  Editorially selected for high quality and brand image –  Few ar$cles in the pool but item pool dynamic

CTR Curves of Items on LinkedIn Today

CTR

Impact of repeat item views on a given user

•  Same user is shown an item mul$ple $mes (despite not clicking)

Simple algorithm to es$mate most popular item with small but dynamic item pool

•  Simple Explore/Exploit scheme –  ε% explore: with a small probability (e.g. 5%), choose an item at random from the pool

–  (100−ε)% exploit: with large probability (e.g. 95%), choose highest scoring CTR item

•  Temporal Smoothing –  Item CTRs change over $me, provide more weight to recent data in

es$ma$ng item CTRs •  Kalman filter, moving average

•  Discount item score with repeat views –  CTR(item) for a given user drops with repeat views by some “discount”

factor (es$mated from data) •  Segmented most popular

–  Perform separate most-‐popular for each user segment

Time series Model: Kalman filter •  Dynamic Gamma-‐Poisson: click-‐rate evolves over $me in a mul$plica$ve fashion

•  Es$mated Click-‐rate distribu$on at $me t+1 –  Prior mean:

–  Prior variance:

High CTR items more adap$ve

More economical explora$on? Beaer bandit solu$ons

•  Consider two armed problem

p2 (unknown payoff

probabilities)

The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the “mul$-‐armed bandit” problem, have been studied for a long $me.

Op$mal solu$on: Play the arm that has maximum poten:al of being good Op:mism in the face of uncertainty

p1 >

Item Recommenda$on: Bandits? •  Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000

– Greedy: Show Item 2 to all; not a good idea –  Item 1 CTR es$mate noisy; item could be poten$ally beaer

•  Invest in Item 1 for beaer overall performance on average

–  Exploit what is known to be good, explore what is poten$ally good CTR

Prob

abili

ty d

ensit

y Item 2

Item 1

Next few hours

Most Popular Recommendation

Personalized Recommendation

Offline Models

Collaborative filtering (cold-start problem)

Online Models

Time-series models Incremental CF, online regression

Intelligent Initialization

Prior estimation Prior estimation, dimension reduction

Explore/Exploit

Multi-armed bandits Bandits with covariates

Offline Components: Collaborative Filtering in Cold-start

Situations

Problem

Item j with

User i with user features xi (demographics, browse history, search history, …)

item features xj (keywords, content categories, ...)

(i, j) : response yij visits

Algorithm selects

(explicit rating, implicit click/no-click)

Predict the unobserved entries based on features and the observed entries

Model Choices •  Feature-based (or content-based) approach

–  Use features to predict response •  (regression, Bayes Net, mixture models, …)

–  Limitation: need predictive features •  Bias often high, does not capture signals at granular levels

•  Collaborative filtering (CF aka Memory based) –  Make recommendation based on past user-item interaction

•  User-user, item-item, matrix factorization, … •  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.

–  Better performance for old users and old items –  Does not naturally handle new users and new items (cold-

start)

Collaborative Filtering (Memory based methods)

User-User Similarity

Item-Item similarities, incorporating both

Estimating Similarities Pearson’s correlation Optimization based (Koren et al)

How to Deal with the Cold-Start Problem

•  Heuris$c-‐based approaches –  Linear combina$on of regression and CF models –  Filterbot

•  Add user features as psuedo users and do collabora$ve filtering -‐  Hybrid approaches

-‐  Use content based to fill up entries, then use CF

•  Matrix Factoriza$on –  Good performance on Ne�lix (Koren, 2009)

•  Model-‐based approaches –  Bilinear random-‐effects model (probabilis$c matrix factoriza$on)

•  Good on Ne�lix data [Ruslan et al ICML, 2009] –  Add feature-‐based regression to matrix factoriza$on

•  (Agarwal and Chen, 2009) –  Add topic discovery (from textual items) to matrix factoriza$on

•  (Agarwal and Chen, 2009; Chun and Blei, 2011)

Per-item regression models •  When tracking users by cookies, distribution of

visit patters could get extremely skewed – Majority of cookies have 1-2 visits

•  Per item models (regression) based on user covariates attractive in such cases

Several per-item regressions: Multi-task learning

Low dimension (5-10),

B estimated retrospective data

•  Agarwal,Chen and Elango, KDD, 2010

Affinity to old items

Per-user, per-item models via bilinear random-effects

model

Motivation •  Data measuring k-way interactions pervasive

–  Consider k = 2 for all our discussions •  E.g. User-Movie, User-content, User-Publisher-Ads,….

–  Power law on both user and item degrees

•  Classical Techniques –  Approximate matrix through a singular value

decomposition (SVD) •  After adjusting for marginal effects (user pop, movie pop,..)

–  Does not work •  Matrix highly incomplete, severe over-fitting

–  Key issue •  Regularization of eigenvectors (factors) to avoid overfitting

Early work on complete matrices

•  Tukey’s 1-df model (1956)

– Rank 1 approximation of small nearly complete matrix

•  Criss-cross regression (Gabriel, 1978) •  Incomplete matrices: Psychometrics (1-factor

model only; small data sets; 1960s) •  Modern day recommender problems

– Highly incomplete, large, noisy.

Latent Factor Models

“newsy”

“sporty”

“newsy”

s

item

v

z

Affinity = u’v

Affinity = s’z

u sporty

Factorization – Brief Overview •  Latent user factors:

(αi , ui=(ui1,…,uin))

•  (Nn + Mm) parameters

•  Key technical issue:

•  Latent movie factors: (βj , vj=(v j1,….,v jn))

will overfit for moderate values of n,m

Regularization

Interaction

jijiij BvuyE ʹ′+++= βαµ)(

Latent Factor Models: Different Aspects

•  Matrix Factorization – Factors in Euclidean space – Factors on the simplex

•  Incorporating features and ratings simultaneously

•  Online updates

Maximum Margin Matrix Factorization (MMMF)

•  Complete matrix by minimizing loss (hinge,squared-error) on observed entries subject to constraints on trace norm –  Srebro, Rennie, Jakkola (NIPS 2004)

–  Convex, Semi-definite programming (expensive, not scalable)

•  Fast MMMF (Rennie & Srebro, ICML, 2005) –  Constrain the Frobenious norm of left and right

eigenvector matrices, not convex but becomes scalable.

•  Other variation: Ensemble MMMF (DeCoste, ICML2005) –  Ensembles of partially trained MMMF (some

improvements)

Matrix Factorization for Netflix prize data

•  Minimize the objective function

•  Simon Funk: Stochastic Gradient Descent

•  Koren et al (KDD 2007): Alternate Least Squares – They move to SGD later in the competition

∑ ∑∑∈

++−obsij j

ji

ijTiij vuvur )()(

222 λ

ui vj

rij

au av

2σ

Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated-whom” Combining with Boltzmann Machines also improved performance

),(~),(~),(~ 2

IaMVNIaMVN

Nr

vj

ui

jTiij

0v0uvu σ

Probabilis$c Matrix Factoriza$on (Ruslan & Minh, 2008, NIPS)

Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008)

•  Fully Bayesian treatment using an MCMC approach –  Significant improvement

•  Interpretation as a fully Bayesian hierarchical model shows why that is the case –  Failing to incorporate uncertainty leads to bias in

estimates –  Multi-modal posterior, MCMC helps in converging to a better one

r Var-comp: au

MCEM also more resistant to over-fitting

Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010)

•  Specify rank probabilistically (automatic rank selection)

)/)1(,/(~)(~

),(~1

2

rrbraBetaBerz

vuzNy

k

kk

r

kjkikkij

−

∑=

π

π

σ

))1(/(Factors)#()))1(/(,1(~−+=

−+

rbaraErbaaBerzk

How to incorporate features: Deal with both warm start and cold-start

•  Models to predict ratings for new pairs –  Warm-start: (user, movie) present in the training data with large

sample size –  Cold-start: At least one of (user, movie) new or has small sample

size •  Rough definition, warm-start/cold-start is a continuum.

•  Challenges –  Highly incomplete (user, movie) matrix –  Heavy tailed degree distributions for users/movies

•  Large fraction of ratings from small fraction of users/movies

–  Handling both warm-start and cold-start effectively in the presence of predictive features

Possible approaches •  Large scale regression based on covariates

–  Does not provide good estimates for heavy users/movies –  Large number of predictors to estimate interactions

•  Collaborative filtering –  Neighborhood based –  Factorization

•  Good for warm-start; cold-start dealt with separately •  Single model that handles cold-start and warm-start

–  Heavy users/movies → User/movie specific model –  Light users/movies → fallback on regression model –  Smooth fallback mechanism for good performance

Add Feature-based Regression into

Matrix Factorization RLFM: Regression-based Latent

Factor Model

Regression-based Factorization Model (RLFM)

•  Main idea: Flexible prior, predict factors through regressions

•  Seamlessly handles cold-start and warm-start

•  Modified state equation to incorporate covariates

RLFM: Model Rating: ),(~ 2σµijij Ny

)(~ ijij Bernoulliy µ)(~ ijijij NPoissony µ

Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts)

jtiji

tijij vubxt +++= βαµ )(

user i gives item j

Bias of user i: ),0(~ , 20 α

αα σεεα Nxg iiit

i +=

Popularity of item j: ),0(~ , 20 β

ββ σεεβ Nxd jjjt

j +=

Factors of user i: ),0(~ , 2INGxu uui

uiii σεε+=

Factors of item j: ),0(~ , 2INDxv vvi

viji σεε+=

Could use other classes of regression models

Graphical representation of the model

Advantages of RLFM •  Better regularization of factors

–  Covariates “shrink” towards a better centroid

•  Cold-start: Fallback regression model (FeatureOnly)

RLFM: Illustration of Shrinkage

Plot the first factor value for each user (fitted using Yahoo! FP data)

Model fitting: EM for our class of models

The parameters for RLFM

•  Latent parameters

•  Hyper-parameters

}){},{},{},({ jiji vuβα=Δ

)IaAI,aAD, G, ,( vvuu ===Θ b

Computing the mode

Minimized

The EM algorithm

Computing the E-step

•  Often hard to compute in closed form •  Stochastic EM (Markov Chain EM; MCEM)

– Compute expectation by drawing samples from

–  Effective for multi-modal posteriors but more expensive

•  Iterated Conditional Modes algorithm (ICM) –  Faster but biased hyper-parameter estimates

Monte Carlo E-step •  Through a vanilla Gibbs sampler (conditionals closed form)

•  Other conditionals also Gaussian and closed form •  Conditionals of users (movies) sampled simultaneously •  Small number of samples in early iterations, large numbers in

later iterations

M-step (Why MCEM is better than ICM)

•  Update G, optimize

•  Update Au=au I

Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well

Experiment 1: Better regularization

•  MovieLens-100K, avg RMSE using pre-specified splits •  ZeroMean, RLFM and FeatureOnly (no cold-start

issues) •  Covariates:

–  Users : age, gender, zipcode (1st digit only) –  Movies: genres

Experiment 2: Better handling of Cold-start

•  MovieLens-1M; EachMovie •  Training-test split based on timestamp •  Same covariates as in Experiment 1.

Experiment 4: Predicting click-rate on articles

•  Goal: Predict click-rate on articles for a user on F1 position

•  Article lifetimes short, dynamic updates important

•  User covariates: –  Age, Gender, Geo, Browse behavior

•  Article covariates –  Content Category, keywords

•  2M ratings, 30K users, 4.5 K articles

Results on Y! FP data

Some other related approaches •  Stern, Herbrich and Graepel, WWW, 2009

–  Similar to RLFM, different parametrization and expectation propagation used to fit the models

•  Porteus, Asuncion and Welling, AAAI, 2011 –  Non-parametric approach using a Dirichlet process

•  Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 –  Regression + random effects per user regularized

through a Graphical Lasso

Add Topic Discovery into Matrix Factorization

fLDA: Matrix Factorization through Latent Dirichlet Allocation

fLDA: Introduction •  Model the rating yij that user i gives to item j as the user’s

affinity to the topics that the item has

–  Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data

–  fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation

∑+=k jkikij zsy ...User i ’s affinity to topic k

Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j

Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items

Φ11, …, Φ1W … Φk1, …, ΦkW … ΦK1, …, ΦKW

Topic 1

Topic k

Topic K

LDA Topic Modeling (1) •  LDA is effective for unsupervised topic discovery [Blei’03]

–  It models the generating process of a corpus of items (articles) –  For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η) –  For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)

–  For each word, say the nth word, in item j, •  Draw a topic zjn for that word from θj = [θj1, …, θjK] •  Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn

Item j Topic distribution: [θj1, …, θjK]

Words: wj1, …, wjn, …

Per-word topic: zj1, …, zjn, …

Assume zjn = topic k

Observed

LDA Topic Modeling (2) •  Model training:

–  Estimate the prior parameters and the posterior topic×word distribution Φ based on a training corpus of items

–  EM + Gibbs sampling is a popular method •  Inference for new items

–  Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase

•  Supervised LDA [Blei’07] –  Predict a target value for each item based on supervised LDA topics

∑= k jkkj zsy

Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j

Regression weight for topic k

∑+=k jkikij zsy ...vs.

One regression per user

Same set of topics across different regressions

fLDA: Model Rating: ),(~ 2σµijij Ny

)(~ ijij Bernoulliy µ)(~ ijijij NPoissony µ

Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts)

jkikkjitijij zsbxt ∑+++= βαµ )(

user i gives item j

Bias of user i: ),0(~ , 20 α

αα σεεα Nxg iiit

i +=

Popularity of item j: ),0(~ , 20 β

ββ σεεβ Nxd jjjt

j +=

Topic affinity of user i: ),0(~ , 2INHxs ssi

siii σεε+=

Pr(item j has topic k): ) itemin words#/()(1 jkzz jnnjk =∑=The LDA topic of the nth word in item j

Observed words: ),,(~ jnjn zLDAw ηλThe nth word in item j

Model Fitting •  Given:

–  Features X = {xi, xj, xij} –  Observed ratings y = {yij} and words w = {wjn}

•  Estimate: –  Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]

•  Regression weights and prior parameters –  Latent factors: Δ = {αi, βj, si} and z = {zjn}

•  User factors, item factors and per-word topic assignment

•  Empirical Bayes approach: –  Maximum likelihood estimate of the parameters:

–  The posterior distribution of the factors:

∫ ΔΘΔ=Θ=ΘΘΘ

dzdzwywy ]|,,,Pr[maxarg]|,Pr[ maxargˆ

]ˆ,|,Pr[ ΘΔ yz

The EM Algorithm •  Iterate through the E and M steps until convergence

– Let be the current estimate – E-step: Compute

•  The expectation is not in closed form •  We draw Gibbs samples and compute the Monte

Carlo mean

– M-step: Find

•  It consists of solving a number of regression and optimization problems

)]|,,,Pr([log)( )ˆ,,|,( ΘΔ=ΘΘΔ

zwyEf nwyz

)(maxargˆ )1( Θ=ΘΘ

+ fn

)(ˆ nΘ

Supervised Topic Assignment

( ) ∏ =⋅++

+∝

=

¬¬

¬

ji jnijjnjkjn

k

jnkl

jn

kzyfZWZ

Z

kz

rated )|(

)Rest|Pr(

ληη

Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k

Probability of observing yij given the model

The topic of the nth word in item j

fLDA: Experimental Results (Movie) •  Task: Predict the rating that a user would give a movie •  Training/test split:

–  Sort observations by time –  First 75% → Training data –  Last 25% → Test data

•  Item warm-start scenario –  Only 2% new items in test data

Model Test RMSE RLFM 0.9363 fLDA 0.9381

Factor-Only 0.9422 FilterBot 0.9517

unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906

Constant 1.1190

fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios

fLDA: Experimental Results (Yahoo! Buzz)

•  Task: Predict whether a user would buzz-up an article •  Severe item cold-start

–  All items are new in test data

Data Statistics 1.2M observations

4K users 10K articles

fLDA significantly outperforms other

models

Experimental Results: Buzzing Topics

Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al

CIA interrogation

mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain

Swine flu

NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week

NFL games

court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow

Gay marriage

palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska

Sarah Palin

idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama

American idol

economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month

Recession

north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia

North Korea issues

3/4 topics are interpretable; 1/2 are similar to unsupervised topics

fLDA Summary •  fLDA is a useful model for cold-start item recommendation •  It also provides interpretable recommendations for users

–  User’s preference to interpretable LDA topics

•  Future directions: –  Investigate Gibbs sampling chains and the convergence properties of

the EM algorithm –  Apply fLDA to other multi-task prediction problems

•  fLDA can be used as a tool to generate supervised features (topics) from text data

Summary •  Regularizing factors through covariates effective •  Regression based factor model that regularizes better

and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive

•  Fitting method scalable; Gibbs sampling for users and

movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine

•  Distributed computing on Hadoop: Multiple models and average across partitions (more later)

Online Components: Online Models, Intelligent

Ini$aliza$on, Explore / Exploit

Why Online Components? •  Cold start

–  New items or new users come to the system –  How to obtain data for new items/users (explore/exploit) –  Once data becomes available, how to quickly update the model

•  Periodic rebuild (e.g., daily): Expensive •  Continuous online update (e.g., every minute): Cheap

•  Concept drift –  Item popularity, user interest, mood, and user-to-item affinity may

change over time –  How to track the most recent behavior

•  Down-weight old data –  How to model temporal patterns for better prediction

•  … may not need to be online if the patterns are stationary

Big Picture Most Popular Recommendation

Personalized Recommendation

Offline Models

Collaborative filtering (cold-start problem)

Online Models Real systems are dynamic

Time-series models Incremental CF, online regression

Intelligent Initialization Do not start cold

Prior estimation Prior estimation, dimension reduction

Explore/Exploit Actively acquire data

Multi-armed bandits Bandits with covariates

Segmented Most Popular Recommenda$on

Extension:

Online Components for Most Popular Recommenda$on

Online models, intelligent ini$aliza$on & explore/exploit

Most popular recommendation: Outline

•  Most popular recommendation (no personalization, all users see the same thing) –  Time-series models (online models) –  Prior estimation (initialization) –  Multi-armed bandits (explore/exploit)

–  Sometimes hard to beat!!

•  Segmented most popular recommendation –  Create user segments/clusters based on user

features –  Do most popular recommendation for each segment

Most Popular Recommendation •  Problem definition: Pick k items (articles) from a

pool of N to maximize the total number of clicks on the picked items

•  Easy!? Pick the items having the highest click-through rates (CTRs)

•  But … –  The system is highly dynamic:

•  Items come and go with short lifetimes •  CTR of each item changes over time

–  How much traffic should be allocated to explore new items to achieve optimal performance

•  Too little → Unreliable CTR estimates •  Too much → Little traffic to exploit the high CTR items

CTR Curves for Two Days on Yahoo! Front Page

Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short life$mes, (b) temporal effects, (c) oden breaking news stories

Each curve is the CTR of an item in the Today Module on www.yahoo.com over $me

For Simplicity, Assume … •  Pick only one item for each user visit

– Multi-slot optimization later •  No user segmentation, no personalization

(discussion later) •  The pool of candidate items is predetermined

and is relatively small (≤ 1000) – E.g., selected by human editors or by a first-phase

filtering method –  Ideally, there should be a feedback loop –  Large item pool problem later

•  Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)

Online Models •  How to track the changing CTR of an item •  Data: for each item, at time t, we observe

–  Number of times the item nt was displayed (i.e., #views) –  Number of clicks ct on the item

•  Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1

•  Potential solutions: –  Observed CTR at t: ct / nt → highly unstable (nt is usually small)

–  Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very slowly

–  Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable •  But, no estimation of Var[pt+1] (useful for explore/exploit)

Online Models: Dynamic Gamma-Poisson

•  Model-based approach –  (ct | nt, pt) ~ Poisson(nt pt) –  pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η)

–  Model parameters: •  p1 ~ Gamma(mean=µ0, var=σ0

2) is the offline CTR estimate •  η specifies how dynamic/smooth the CTR is over time

–  Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)

•  Solve this recursively (online update rule)

Show the item nt $mes Receive ct clicks pt = CTR at $me t

Nota$on:

p1 µ0, σ02

p2 …

n1 c1

n2 c2

η

Online Models: Derivation

size) sample (effective /Let

),(~),,...,,|(2

21111

ttt

ttttt varmeanGammancncp

σµγ

σµ

=

==−−

)(

),(~),,...,,|(

2|

2|

2|

21

|1

211111

ttttttt

ttt

ttttt varmeanGammancncp

σµησσ

µµ

σµ

++=

=

==

+

+

+++

tttttt

ttttttt

tttt

ttttttt

cn

varmeanGammancncp

||2|

||

|

2||11

/

/) (

size) sample (effective Let

),(~),,...,,|(

γµσ

γγµµ

γγ

σµ

=

+⋅=

+=

==

High CTR items more adap$ve

Es$mated CTR distribu$on at $me t

Es$mated CTR distribu$on at $me t+1

Tracking behavior of Gamma-Poisson model

•  Low click rate articles – More temporal smoothing

Intelligent Initialization: Prior Estimation

•  Prior CTR distribution: Gamma(mean=µ0, var=σ02)

–  N historical items: •  ni = #views of item i in its first time interval •  ci = #clicks on item i in its first time interval

–  Model •  ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0

2) ⇒ ci ~ NegBinomial(µ0, σ0

2, ni) –  Maximum likelihood estimate (MLE) of (µ0, σ0

2)

•  Better prior: Cluster items and find MLE for each cluster –  Agarwal & Chen, 2011 (SIGMOD)

∑ ⎟⎠⎞⎜

⎝⎛ +⎟

⎠⎞⎜

⎝⎛ +−⎟

⎠⎞⎜

⎝⎛ +Γ+⎟

⎠⎞⎜

⎝⎛Γ−

i iii nccNN 20

020

20

20

20

20

20

20

020

20

200

loglog loglog maxarg , σ

µ

σ

µ

σ

µ

σ

µ

σ

µ

σ

µ

σµ

Explore/Exploit: Problem Definition

$me

Item 1 Item 2 … Item K

x1% page views x2% page views … xK% page views

Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future

t –1 t –2 t

now clicks in the future

Modeling the Uncertainty, NOT just the Mean

Simplified semng: Two items

CTR

Prob

abili

ty d

ensit

y Item A

Item B

We know the CTR of Item A (say, shown 1 million $mes) We are uncertain about the CTR of Item B (only 100 $mes)

If we only make a single decision, give 100% page views to Item A

If we make mul$ple decisions in the future

explore Item B since its CTR can poten$ally be higher

∫ >⋅−=

qpdppfqp )()(Potential

CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)

Multi-Armed Bandits: Introduction (1)

Bandit “arms”

p1 p2 p3 (unknown payoff

probabilities)

“Pulling” arm i yields a reward:

reward = 1 with probability pi (success)

reward = 0 otherwise (failure)

For now, we are aaacking the problem of choosing best ar$cle/arm for all users


Bandit “arms”

p1 p2 p3 (unknown payoff

probabilities)

Goal: Pull arms sequen$ally to maximize the total reward

Bandit scheme/policy: Sequen$al algorithm to play arms (items)

Regret of a scheme = Expected loss rela$ve to the “oracle” op-mal scheme that always plays the best arm –  “best” means highest success probability –  But, the best arm is not known … unless you have an oracle –  Regret is the price of explora$on –  Low regret implies quick convergence to the best


•  Bayesian approach –  Seeks to find the Bayes optimal solution to a Markov

decision process (MDP) with assumptions about probability distributions

–  Representative work: Gittins’ index, Whittle’s index –  Very computationally intensive

•  Minimax approach –  Seeks to find a scheme that incurs bounded regret (with no

or mild assumptions about probability distributions) –  Representative work: UCB by Lai, Auer –  Usually, computationally easy –  But, they tend to explore too much in practice (probably

because the bounds are based on worse-case analysis)

Skip details

Multi-Armed Bandits: Markov Decision Process (1)

•  Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T

•  State at time t: Θt = (θ1t, …, θKt) –  θit = State of arm i at time t (that captures all we know about arm i at t)

•  Reward function Ri(Θt, Θt+1) –  Reward of pulling arm i that brings the state from Θt to Θt+1

•  Transition probability Pr[Θt+1 | Θt, pulling arm i ] •  Policy π: A function that maps a state to an arm (action)

–  π(Θt) returns an arm (to pull) •  Value of policy π starting from the current state Θ0 with horizon T

[ ]),(),(),( 1110)(0 0ΘΘΘΘ Θ ππ π −+= TT VREV

[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr0

ΘΘΘΘΘΘΘ Θ dVR T ππ π

Immediate reward Value of the remaining T-‐1 $me slots if we start from state Θ1

Multi-Armed Bandits: MDP (2)

•  Optimal policy:

•  Things to notice: –  Value is defined recursively (actually T high-dim integrals) –  Dynamic programming can be used to find the optimal policy –  But, just evaluating the value of a fixed policy can be very expensive

•  Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time

[ ]),(),(),( 1110)(0 0ΘΘΘΘ Θ ππ π −+= TT VREV

[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr0

ΘΘΘΘΘΘΘ Θ dVR T ππ π

Immediate reward Value of the remaining T-‐1 $me slots if we start from state Θ1

),( maxarg 0Θππ

TV

Multi-Armed Bandits: MDP (3) •  Which arm should be pulled next?

–  Not necessarily what looks best right now, since it might have had a few lucky successes

–  Looks like it will be a function of successes and failures of all arms •  Consider a slightly different problem setting

–  Infinite time horizon, but –  Future rewards are geometrically discounted

Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)

•  Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently

Policy π(Θt) is a func$on of (θ1t, …, θKt)

Policy π(Θt) = argmaxi { g(θit) }

One K-‐dimensional problem

K one-‐dimensional problems S$ll computa$onally expensive!!

Gimns’ Index

Multi-Armed Bandits: MDP (4)

Bandit Policy

1.  Compute the priority (Gittins’ index) of each arm based on its state

2.  Pull arm with max priority, and observe reward

3.  Update the state of the pulled arm

Priority 1

Priority 2

Priority 3

Multi-Armed Bandits: MDP (5) •  Theorem [Gittins 1979]: The optimal policy decouples

and solves a bandit problem for each arm independently –  Many proofs and different interpretations of Gittins’ index

exist •  The index of an arm is the fixed charge per pull for a game with two options, whether

to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward

–  Significantly reduces the dimension of the problem space –  But, Gittins’ index g(θit) is still hard to compute

•  For the Gamma-Poisson or Beta-Binomial models θit = (#successes, #pulls) for arm i up to time t

•  g maps each possible (#successes, #pulls) pair to a number

–  Approximate methods are used in practice –  Lai et al. have derived these for exponential family

distributions

Multi-Armed Bandits: Minimax Approach (1)

•  Compute the priority of each arm i in a way that the regret is bounded –  Lowest regret in the worst case

•  One common policy is UCB1 [Auer 2002] Number of successes of

arm i

Number of pulls of arm i

Total number of pulls of all arms

Observed success rate

Factor representing uncertainty

ii

ii n

nnc log2Priority ⋅+=


•  As total observations n becomes large: – Observed payoff tends asymptotically towards the

true payoff probability – The system never completely “converges” to one

best arm; only the rate of exploration tends to zero

Observed payoff


ii

ii n



•  Sub-optimal arms are pulled O(log n) times •  Hence, UCB1 has O(log n) regret •  This is the lowest possible regret (but the constants matter J) •  E.g. Regret after n plays is bounded by

Observed payoff


ii

ii n


ibesti

K

jj

i ibesti

nµµ

π

µµ

−=Δ⎟⎟⎠

⎞⎜⎜⎝

⎛Δ⋅⎟⎟

⎠

⎞⎜⎜⎝

⎛++⎟

⎟⎠

⎞⎜⎜⎝

⎛

Δ ∑∑=<

where,3

1ln81

2

:

•  Classical multi-armed bandits –  A fixed set of arms with fixed rewards –  Observe the reward before the next pull

•  Bayesian approach (Markov decision process) –  Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits

•  Pull the arm currently having the highest index value –  Whittle’s index [Whittle 1988]: Extension to a changing reward function –  Computationally intensive

•  Minimax approach (providing guaranteed regret bounds) –  UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval

•  Index of arm i = •  Heuristics

–  ε-Greedy: Random exploration using fraction ε of traffic –  Softmax: Pick arm i with probability

–  Posterior draw: Index = drawing from posterior CTR distribution of an arm

∑ j j

i

}/êxp{}/êxp{τµ

τµ

Classical Multi-Armed Bandits: Summary

ii item of CTR predicted ˆ =µ

iii nnnc log2 ⋅+

re temperatu=τ

Do Classical Bandits Apply to Web Recommenders?

Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short life$mes, (b) temporal effects, (c) oden breaking news stories

Each curve is the CTR of an item in the Today Module on www.yahoo.com over $me

Characteristics of Real Recommender Systems

•  Dynamic set of items (arms) –  Items come and go with short lifetimes (e.g., a day) –  Asymptotically optimal policies may fail to achieve good performance

when item lifetimes are short •  Non-stationary CTR

–  CTR of an item can change dramatically over time •  Different user populations at different times •  Same user behaves differently at different times (e.g., morning, lunch

time, at work, in the evening, etc.) •  Attention to breaking news stories decays over time

•  Batch serving for scalability –  Making a decision and updating the model for each user visit in real time

is expensive –  Batch serving is more feasible: Create time slots (e.g., 5 min); for each

slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal et al., ICDM, 2009]

Explore/Exploit in Recommender Systems

$me

Item 1 Item 2 … Item K

x1% page views x2% page views … xK% page views

Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future

t –1 t –2 t

now clicks in the future

Let’s solve this from first principle

Bayesian Solution: Two Items, Two Time Slots (1)

•  Two time slots: t = 0 and t = 1 –  Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 –  Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1

•  To determine x, we need to estimate what would happen in the future

Question: What fraction x of N0 views to item P (1-x) to item Q

t=0 t=1

Now

time N0 views N1 views

End

Obtain c clicks ader serving x (not yet observed; random variable)

Assume we observe c; we can update p1

CTR

dens

ity Item Q

Item P

q1

p1(x,c) CTR

dens

ity Item Q

Item P

q0 p0

If x and c are given, op$mal solu$on: Give all views to Item P iff E[ p1 I x, c ] > q1

),(ˆ1 cxp

),(ˆ1 cxp

•  Expected total number of clicks in the two time slots

}] ),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+

Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots

Solution: argmaxx Gain(x, q0, q1)


}]0 ,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++=

E[#clicks] at t = 0 E[#clicks] at t = 1

Item P Item Q Show the item with higher E[CTR]: } ),,(ˆmax{ 11 qcxp

E[#clicks] if we always show

item Q

Gain(x, q0, q1) Gain of exploring the uncertain item P using x

•  Approximate by the normal distribution –  Reasonable approximation because of the central limit theorem

•  Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0)

),(ˆ1 cxp

⎥⎥⎦

⎤

⎢⎢⎣

⎡−⎟

⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛ −Φ−+⎟⎟

⎠

⎞⎜⎜⎝

⎛ −⋅+−= )ˆ(

)(ˆ

1)(ˆ

)()ˆ(),,( 111

11

1

111100010 qp

xpq

xpqxNqpxNqqxGain

σσφσ

)1()()()],(ˆ[)( 2

0

01

21 baba

abxNba

xNcxpVarx+++++

==σ

)/()],(ˆ[ˆ 11 baacxpEp c +==

),(~ ofPrior 1 baBetap



•  Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item?

Uncertainty: Low Uncertainty: High

Different curves are for different prior mean semngs

(Frac$on

of views to give to

the ite

m)

–  Apply Whiale’s Lagrange relaxa$on (1988) to this problem semng Relax ∑i zi(c) = 1, for all c, to Ec [∑i zi(c)] = 1 Apply Lagrange mul$pliers (q0 and q1) to enforce the constraints

–  We essen$ally reduce the K-‐item case to K independent two-‐item sub-‐problems (which we have solved)

Bayesian Solution: General Case (1) •  From two items to K items

–  Very difficult problem: ) )}],(ˆ{[maxˆ ( max 11000 iiiiiii cxpENpxN cx+∑

≥

)],(ˆ)([max 10 iiiii cxpzE ccz∑

≥

cc possible allfor ,1)( =∑ ii z

Note: c = [c1, …, cK] ci is a random variable represen$ng the # clicks on item i we may get

1=∑ ii x

) ),,(max ( min 101100, 10

qqxGainqNqN ii xqq i∑++

Bayesian Solution: General Case (2)

•  From two intervals to multiple time slots – Approximate multiple time slots by two stages

•  Non-stationary CTR – Use the Dynamic Gamma-Poisson model to

estimate the CTR distribution for each item

Simulation Experiment: Different Traffic Volume Simula$on with ground truth es$mated based on Yahoo! Front Page data Semng:16 live items per interval Scenarios: Web sites with different traffic volume (x-‐axis)

Simulation Experiment: Different Sizes of the Item Pool

Simula$on with ground truth es$mated based on Yahoo! Front Page data Semng: 1000 views per interval; average item life$me = 20 intervals Scenarios: Different sizes of the item pool (x-‐axis)

Characteristics of Different Explore/Exploit Schemes (1)

•  Why the Bayesian solution has better performance •  Characterize each scheme by three dimensions:

–  Exploitation regret: The regret of a scheme when it is showing the item which it thinks is the best (may not actually be the best)

•  0 means the scheme always picks the actual best •  It quantifies the scheme’s ability of finding good

items –  Exploration regret: The regret of a scheme when it is exploring the items

which it feels uncertain about

•  It quantifies the price of exploration (lower → better) –  Fraction of exploitation (higher → better)

•  Fraction of exploration = 1 – fraction of exploitation Exploita$on traffic Explora$on

traffic

All traffic to a web site

Characteristics of Different Explore/Exploit Schemes (2)

Exploita$on regret: Ability of finding good items (lower → beaer) Explora$on regret: Price of explora$on (lower → beaer) Frac$on of Exploita$on (higher → beaer)

Explora$on Regret Exploita$on frac$on

Exploita$o

n Re

gret

Exploita$o

n Re

gret

Good Good

Discussion: Large Content Pool •  The Bayesian solution looks promising

–  ~10% from true optimal for a content pool of 1000 live items

•  1000 views per interval; item lifetime ~20 intervals •  Intelligent initialization (offline modeling)

–  Use item features to reduce the prior variance of an item •  E.g., Var[ item CTR | Sport ] < Var[ item CTR ]

–  Require a CTR model that outputs both mean and variance

•  Linear regression model •  Segmented model: Estimate the CTR distribution of a random

article in an item category –  Existing taxonomies, decision tree, LDA topics

•  Feature-based explore/exploit –  Estimate model parameters, instead of per-item CTR –  More later

Discussion: Multiple Positions, Ranking

•  Feature-based approach –  reward(page) = model(φ(item 1 at position 1, … item k at position k)) –  Apply feature-based explore/exploit

•  Online optimization for ranked list –  Ranked bandits [Radlinski et al., 2008]: Run an

independent bandit algorithm for each position –  Dueling bandit [Yue & Joachims, 2009]: Actions are

pairwise comparisons •  Online optimization of submodular functions

–  ∀ S1, S2 and a, fa(S1 ⊕ S2) ≤ fa(S1) •  where fa(S) = fa(S ⊕ 〈a〉) – fa(S)

–  Streeter & Golovin (2008)

Discussion: Segmented Most Popular

•  Partition users into segments, and then for each segment, provide most popular recommendation

•  How to segment users –  Hand-created segments: AgeGroup × Gender –  Clustering or decision tree based on user features

•  Users in the same cluster like similar items •  Segments can be organized by taxonomies/hierarchies

–  Better CTR models can be built by hierarchical smoothing •  Shrink the CTR of a segment toward its parent •  Introduce bias to reduce uncertainty/variance

–  Bandits for taxonomies (Pandey et al., 2008) •  First explore/exploit categories/segments •  Then, switch to individual items

Most Popular Recommendation: Summary

•  Online model: –  Estimate the mean and variance of the CTR of each item over

time –  Dynamic Gamma-Poisson model

•  Intelligent initialization: –  Estimate the prior mean and variance of the CTR of each item

cluster using historical data •  Cluster items → Maximum likelihood estimates of the priors

•  Explore/exploit: –  Bayesian: Solve a Markov decision process problem

•  Gittins’ index, Whittle’s index, approximations •  Better performance, computation intensive •  Thompson sampling: Sample from the posterior (simple)

–  Minimax: Bound the regret •  UCB1: Easy to compute •  Explore more than necessary in practice

–  ε-Greedy: Empirically competitive for tuned ε

Online Components for Personalized Recommenda$on

Online models, intelligent ini$aliza$on & explore/exploit

Intelligent Initialization for Linear Model (1)

•  Linear/factorization model

–  How to estimate the prior parameters µj and Σ •  Important for cold start: Predictions are made using prior •  Leverage available features

–  How to learn the weights/factors quickly •  High dimensional βj → slow convergence •  Reduce the dimensionality

Subscript: user i, item j

),(~

) ,(~ 2

Σ

ʹ′

jj

jiij

N

uNy

µβ

σβ

ra$ng that user i gives item j

feature/factor vector of user i

factor vector of item j

Feature-based model initialization

•  Dimensionality reduction for fast model convergence

),(~ Σjj AxNβ

FOBFM: Fast Online Bilinear Factor Model ),(~ ,~ Σʹ′ jjjiij Nuy µββPer-‐item

online model

),0(~

~

Σ

ʹ′+ʹ′

NvvuAxuy

j

jijiij

predicted by features

⇔

) ,0(~ 2IN

Bv

j

jj

θσθ

θ=

Subscript: user i item j Data: yij = ra$ng that user i gives item j ui = offline factor vector of user i xj = feature vector of item j

B is a n×k linear projec$on matrix (k << n) project: high dim(vj) → low dim(θj) low-‐rank approx of Var[βj]:

=

vj θj B

) ,(~ 2 BBAxN jj ʹ′θσβ

Offline training: Determine A, B, σθ2 through the EM algorithm (once per day or hour)

Feature-based model initialization

•  Dimensionality reduction for fast model convergence

•  Fast, parallel online learning

•  Online selection of dimensionality (k = dim(θj)) –  Maintain an ensemble of models, one for each candidate dimensionality

),(~ Σjj AxNβ

FOBFM: Fast Online Bilinear Factor Model ),(~ ,~ Σʹ′ jjjiij Nuy µββPer-‐item

online model

),0(~

~

Σ

ʹ′+ʹ′

NvvuAxuy

j

jijiij

predicted by features

⇔

) ,0(~ 2IN

Bv

j

jj

θσθ

θ= B is a n×k linear projec$on matrix (k << n) project: high dim(vj) → low dim(θj) low-‐rank approx of Var[βj]:

jijiij BuAxuy θ)(~ ʹ′+ʹ′

offset new feature vector (low dimensional)

, where θj is updated in an online manner

) ,(~ 2 BBAxN jj ʹ′θσβ

Subscript: user i item j Data: yij = ra$ng that user i gives item j ui = offline factor vector of user i xj = feature vector of item j

Experimental Results: My Yahoo! Dataset (1)

•  My Yahoo! is a personalized news reading site – Users manually select news/RSS feeds

•  ~12M “ratings” from ~3M users on ~13K articles – Click = positive – View without click = negative


Item-‐based data split: Every item is new in the test data –  First 8K ar$cles are in the training data (offline training) –  Remaining ar$cles are in the test data (online predic$on & learning)

Supervised dimensionality reduc$on (reduced rank regression) significantly outperforms other methods

Methods: No-‐init: Standard online

regression with ~1000 parameters for each item

Offline: Feature-‐based model without online update

PCR, PCR+: Two principal component methods to es$mate B

FOBFM: Our fast online method


•  Small number of factors (low dimensionality) is better when the amount of data for online leaning is small

•  Large number of factors is better when the data for learning becomes large •  The online selection method usually selects the best dimensionality

# factors = Number of parameters per item updated online

Intelligent Initialization: Summary

•  For online learning, whenever historical data is available, do not start cold

•  For linear/factorization models – Use available features to setup the starting point – Reduce dimensionality to facilitate fast learning

•  Next – Explore/exploit for personalization – Users are represented by covariates

•  Features, factors, clusters, etc – Covariate bandits

Explore/Exploit for Personalized Recommendation

•  One extreme problem formulation –  One bandit problem per user with one arm per item –  Bandit problems are correlated: “Similar” users like similar

items –  Arms are correlated: “Similar” items have similar CTRs

•  Model this correlation through covariates/features –  Input: User feature/factor vector, item feature/factor vector –  Output: Mean and variance of the CTR of this (user, item)

pair based on the data collected so far •  Covariate bandits

–  Also known as contextual bandits, bandits with side observations

–  Provide a solution to •  Large content pool (correlated arms) •  Personalized recommendation (hint before pulling an arm)

Methods for Covariate Bandits •  Priority-based methods

–  Rank items according to the user-specific “score” of each item; then, update the model based on the user’s response

–  UCB (upper confidence bound) •  Score of an item = E[posterior CTR] + k StDev[posterior CTR]

–  Posterior draw (Thompson sampling) •  Score of an item = a number drawn from the posterior CTR distribution

–  Softmax •  Score of an item = a number drawn according to

•  ε-Greedy –  Allocate ε fraction of traffic for random exploration (ε may be adaptive) –  Robust when the exploration pool is small

•  Bayesian scheme –  Close to optimal if can be solved efficiently

∑ j j

i

}/êxp{}/êxp{τµ

τµ

Covariate Bandits: Some References

•  Just a small sample of papers –  Hierarchical explore/exploit (Pandey et al., 2008)

•  Explore/exploit categories/segments first; then, switch to individuals –  Variants of ε-greedy

•  Epoch-greedy (Langford & Zhang, 2007): ε is determined based on the generalization bound of the current model

•  Banditron (Kakade et al., 2008): Linear model with binary response •  Non-parametric bandit (Yang & Zhu, 2002): ε decreases over time;

example model: histogram, nearest neighbor –  Variants of UCB methods

•  Linearly parameterized bandits (Rusmevichientong et al., 2008): minimax, based on uncertainty ellipsoid

•  LinUCB (Li et al., 2010): Gaussian linear regression model •  Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009):

–  Similar arms have similar rewards: | reward(i) – reward(j) | ≤ distance(i,j)

Online Components: Summary •  Real systems are dynamic •  Cold-start problem

–  Incremental online update (online linear regression) –  Intelligent initialization (use features to predict initial factor

values) –  Explore/exploit (UCB, posterior draw, softmax, ε-greedy)

•  Concept-drift problem –  Tracking the most recent behavior (state-space models,

Kalman filter) –  Modeling temporal patterns (tensor factorization, spline)

Evaluation Methods and Challenges

Evaluation Methods •  Ideal method

–  Experimental Design: Run side-by-side experiments on a small fraction of randomly selected traffic with new method (treatment) and status quo (control)

–  Limitation •  Often expensive and difficult to test large number of methods

•  Problem: How do we evaluate methods offline on logged data? –  Goal: To maximize clicks/revenue and not prediction

accuracy on the entire system. Cost of predictive inaccuracy for different instances vary.

•  E.g. 100% error on a low CTR article may not matter much because it always co-occurs with a high CTR article that is predicted accurately

Usual Metrics •  Predictive accuracy

–  Root Mean Squared Error (RMSE) –  Mean Absolute Error (MAE) –  Area under the Curve, ROC

•  Other rank based measures based on retrieval accuracy for top-k

–  Recall in test data •  What Fraction of items that user actually liked in the test data were

among the top-k recommended by the algorithm (fraction of hits, e.g. Karypsis, CIKM 2001)

•  One flaw in several papers –  Training and test split are not based on time.

•  Information leakage •  Even in Netflix, this is the case to some extent

–  Time split per user, not per event. For instance, information may leak if models are based on user-user similarity.

Metrics continued.. •  Recall per event based on Replay-Match

method – Fraction of clicked events where the top

recommended item matches the clicked one.

•  This is good if logged data collected from a randomized serving scheme, with biased data this could be a problem – We will be inventing algorithms that provide

recommendations that are similar to the current one

•  No reward for novel recommendations

Details on Replay-Match method (Li, Langford, et al)

•  x: feature vector for a visit •  r = [r1,r2,…,rK]: reward vector for the K items in inventory •  h(x): recommendation algorithm to be evaluated •  Goal: Estimate expected reward for h(x)

•  s(x): recommendation scheme that generated logged-data •  x1,..,xT: visits in the logged data •  rti: reward for visit t, where i = s(xt)

Replay-Match continued •  Estimator

•  If importance weights and

–  It can be shown estimator is unbiased

•  E.g. if s(x) is random serving scheme, importance weights are uniform over the item set

•  If s(x) is not random, importance weights have to be estimated through a model

Back to Multi-Objective Optimization

Recommender EDITORIAL

content Clicks on FP links influence downstream supply distribution

AD SERVER PREMIUM display (GUARANTEED) Spot Market (Cheaper)

Downstream engagement (Time spent)

Serving Content on Front Page: Click Shaping

•  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following

–  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10

•  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but

obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,

etc)

Why call it Click Shaping? autos finance

health

hotjobs

movies

new.music

news

omgrealestate

rivals

shine

shopping

sports

tech

travel

tv

video

other

gmy.news

buzz

videogamesautos

finance

health

hotjobs

movies

new.music

news

omgrealestate

rivals

shine

shopping

sports

tech

travel

tv

video

other

videogames

buzz

gmy.news

-10.00%-8.00%-6.00%

-4.00%-2.00%0.00%2.00%4.00%

6.00%8.00%10.00%

autos

buzz

finance

gmy.news

health

hotjobs

movies

new.music

news omg

realestate

rivals

shine

shopping

sports

tech

travel tv

video

videogames

other

Supply distribution Changes

BEFORE AFTER

SHAPING can happen with respect to any downstream metrics (like engagement)

221

Multi-Objective Optimization

A1

A2

An

n articles K properties

news

finance

omg

… …

S1

S2

Sm

m user segments

…

CTR of user segment i on ar$cle j: pij Time dura$on of i on j: dij

11

Multi-Objective Program §  Scalariza$on

Goal Programming

Simplex constraints on xiJ is always applied

Constraints are linear

Every 10 mins, solve x

Use this x as the serving scheme in the next 10 mins

Pareto-optimal solution (more in KDD 2011)

223

Summary •  Modern recommendation systems on the web crucially depend on

extracting intelligence from massive amounts of data collected on a routine basis

•  Lots of data and processing power not enough, the number of things we need to learn grows with data size

•  Extracting grouping structures at coarser resolutions based on similarity (correlations) is important – ML has a big role to play here

•  Continuous and adaptive experimentation in a judicious manner crucial to maximize performance – Again, ML has a big role to play

•  Multi-objective optimization is often required, the objectives are application dependent. – ML has to work in close collaboration with

engineering, product & business execs

Challenges

Recall: Some examples •  Simple version

–  I have an important module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module

•  More advanced –  I got X% lift in CTR. But I have additional information on

other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks?

•  Highly advanced –  There are multiple modules running on my website. How

do I take a holistic approach and perform a simultaneous optimization?

For the simple version •  Multi-position optimization

–  Explore/exploit, optimal subset selection

•  Explore/Exploit strategies for large content pool and high dimensional problems –  Some work on hierarchical bandits but more needs to be

done •  Constructing user profiles from multiple sources with

less than full coverage –  Couple of papers at KDD 2011

•  Content understanding •  Metrics to measure user engagement (other than

CTR)

Other problems •  Whole page optimization

–  Incorporating correlations

•  Incentivizing User generated content

•  Incorporating Social information for better recommendation (News Feed Recommendation)

•  Multi-context Learning

Case Studies

Recommenda$ons and Adver$sing on LinkedIn HP

EXAMPLE: DISPLAY AD PLACEMENTS ON LINKEDIN

©2013 LinkedIn Corpora$on. All Rights Reserved.

Recommenda$ons and Adver$sing on LinkedIn HP

Click Cost =

Bid3 x CTR3/CTR2

Profile:

region = US, age = 20

Context = profile page, 300 x 250 ad slot

Ad request

Sorted by Bid * CTR

Response Predic$on Engine

Campaigns eligible for auc$on

Automa$c Format Selec$on

Filter Campaigns (Targe$ng criteria, Frequency Cap, Budget Pacing)

LinkedIn Advertising: Flow

Serving constraint < 100 millisec

CTR Predic$on Model for Ads •  Feature vectors

–  Member feature vector: xi (iden$ty, behavioral, network) –  Campaign feature vector: cj (text, adv-‐id,…) –  Context feature vector: zk (page type, device, …)

•  Model:


–  Member feature vector: xi –  Campaign feature vector: cj –  Context feature vector: zk

•  Model:

Cold-start component

Warm-start per-campaign component


–  Member feature vector: xi –  Campaign feature vector: cj –  Context feature vector: zk

•  Model:

Cold-start component

Warm-start per-campaign component

Cold-‐start: Warm-‐start: Both can have L2 penal$es.

Model Fimng •  Single machine (well understood)

–  conjugate gradient –  L-‐BFGS –  Trusted region –  …

•  Model Training with Large scale data –  Cold-‐start component Θw is more stable

•  Weekly/bi-‐weekly training good enough •  However: difficulty from need for large-‐scale logis$c regression

–  Warm-‐start per-‐campaign model Θc is more dynamic •  New items can get generated any $me •  Big loss if opportuni$es missed •  Need to update the warm-‐start component as frequently as possible

Model Fimng •  Single machine (well understood)

–  conjugate gradient –  L-‐BFGS –  Trusted region –  …

•  Model Training with Large scale data –  Cold-‐start component Θw is more stable

•  Weekly/bi-‐weekly training good enough •  However: difficulty from need for large-‐scale logis$c regression

–  Warm-‐start per-‐campaign model Θc is more dynamic •  New items can get generated any $me •  Big loss if opportuni$es missed •  Need to update the warm-‐start component as frequently as possible

Large Scale Logistic Regression

Per-item logistic regression given Θc

Explore/Exploit with Logis$c Regression

239

+ +

+ +

+

+

+

_

_

_ _

_ _

_

_

_ _ _

_

_

COLD START

COLD + WARM START for an Ad-id

POSTERIOR of WARM-START COEFFICIENTS

E/E: Sample a line from the posterior (Thompson Sampling)

Models Considered

•  CONTROL: per-‐campaign CTR coun$ng model

•  COLD-‐ONLY: only cold-‐start component

•  LASER: our model (cold-‐start + warm-‐start)

•  LASER-‐EE: our model with Explore-‐Exploit using Thompson sampling

Metrics

•  Model metrics (offline) – Test Log-‐likelihood – AUC/ROC – Observed/Expected ra$o

•  Online metrics (Online A/B Test) – CTR – CPM (Revenue per impression) – Unique ads per user (diversity)

Observed / Expected Ra$o •  Offline replay difficult with large items (randomiza$on costly) •  Observed: #Clicks in the data , Expected: Sum of predicted CTR for

all impressions •  Not a “standard” classifier metric, but useful for this applica$on •  What we usually see: Observed / Expected < 1

–  Quan$fies the “winner’s curse” aka selec$on bias in auc$ons •  When choosing from among thousands of candidates, an item with mistakenly

over-‐es$mated CTR may end up winning the auc$on •  Par$cularly helpful in spomng inefficiencies by segment

–  E.g. by bid, number of impressions in training (warmness), geo, etc. –  Allows us to see where the model might be giving too much weight to

the wrong campaigns •  High correla$on between O/E ra$o and model performance online

Offline: ROC Curves

False Positive Rate

True

Pos

itive

Rat

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

●●●●●●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●●●●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

CONTROL [ 0.672 ]COLD−ONLY [ 0.757 ]LASER [ 0.778 ]

Online A/B Test •  Three models

–  CONTROL (10%) –  LASER (85%) –  LASER-‐EE (5%)

•  Segmented Analysis –  8 segments by campaign warmness

•  Degree of warmness: the number of training samples available in the training data for the campaign

•  Segment #1: Campaigns with almost no data in training •  Segment #8: Campaigns that are served most heavily in the previous batches so that their CTR es$mate can be quite accurate

Daily CTR Lid Over Control Pe

rcen

tage

of C

TR L

ift

+%

+%

+%

+%

+%

Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

●

● ●

●

●

●

●●

LASERLASER−EE

Daily CPM Lid Over Control Pe

rcen

tage

of e

CPM

Lift

+%

+%

+%

+%

+%

+%D

ay 1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

●

●

●

● ●

●

●

●

LASERLASER−EE

CPM Lid By Campaign Warmness Segments

Campaign Warmness Segment

Lift

Perc

enta

ge o

f CPM

−%

−%

−%

0%

+%

+%

1 2 3 4 5 6 7 8

LASERLASER−EE

O/E Ra$o By Campaign Warmness Segments

Campaign Warmness Segment

Obs

erve

d C

lick/

Expe

cted

Clic

ks

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

CONTROLLASERLASER−EE

Number of Campaigns Served Improvement from E/E

Insights •  Overall performance:

–  LASER and LASER-‐EE are both much beaer than control

–  LASER and LASER-‐EE performance are very similar •  Great news! We get explora$on without much addi$onal cost

•  Explora$on has other benefits –  LASER-‐EE serves significantly more campaigns than LASER

–  Provides healthier market place, more ad diversity per user (beaer experience)

Solu$ons to Prac$cal Problems •  Rapid model development cycle

–  Quick reac$on to changes in data, product –  Write once for training, tes$ng, inference

•  Can adapt to changing data –  Integrated Thompson sampling explore/exploit –  Automa$c training –  Mul$ple training frequencies for different parts of model

•  Good tools yield good models –  Reusable components for feature extrac$on and transforma$on –  Very high-‐performance inference engine for deployment –  Modelers can concentrate on building models, not re-‐wri$ng common func$ons or worrying about produc$on issues

Summary •  Reducing dimension through logis$c regression coupled

with explore/exploit schemes like Thompson sampling effec$ve mechanism to solve response predic$on problems in adver$sing

•  Par$$oning model components by cold-‐start (stable) and warm-‐start (non-‐sta$onary) with different training frequencies effec$ve mechanism to scale computa$ons

•  ADMM with few modifica$ons effec$ve model training strategy for large data with high dimensionality

•  Methods work well for LinkedIn adver$sing, significant improvements

©2013 LinkedIn Corpora$on. All Rights Reserved.

Theory vs. Prac$ce

Textbook •  Data is sta$onary •  Training data is clean •  Training is hard, tes$ng and

inference are easy •  Models don’t change •  Complex algorithms work best

Reality •  Features, items changing

constantly •  Fraud, bugs,tracking delays,

online/offline inconsistencies, etc. •  All aspects have challenges at

web scale •  Never-‐ending processes of

improvement •  Simple models with good features

and lots of data win

Current Work: Feed Recommenda$on Network Updates Job change Job anniversaries Connections Endorsements Upload Photos …………..

Content Articles by Influencer Shares by friends Content in Channels followed Content by companies followed Jobs recommendation ………… Sponsored Updates

Company updates Jobs

Tiered Approach to Ranking

§  A second pass ranker that blends disparate results returned by first-‐pass rankers

Jobs Ads Network Updates

Content ------------

BLENDER

Top k

Challenges •  Personaliza$on

–  Viewer-‐actor affinity by type (depends on strength of connec$ons in mul$ple contexts)

–  Blending iden$ty and behavioral data •  Frequency discoun$ng, freshness, diversifica$on •  Mul$-‐objec$ves (Revenue, Engagement) •  A/B tests with interference •  Engagement metrics

–  Func$on of various ac$ons that op$mize long-‐term engagement metric like return visits

•  Summariza$on and adding new content types

Impression Discoun$ng •  How does the response rate vary with past impressions for the same item?

Slide courtsey Pannaga Shivaswamy

Diversity •  How does the response rate change when an actorId/objectType at a posi$on matches previous items?


Age of an item •  How does the response rate change for different types with age?


Parallel Matrix Factoriza$on

Problem Setup •  CTR predic$on for a user on an item

•  Assump$ons: –  There are sufficient data per item to es$mate per-‐item model –  Serving bias and posi$onal bias are removed by randomly serving scheme

–  Item populari$es are quite dynamic and have to be es$mated in real-‐$me fashion

•  Examples: –  Yahoo! Front page Today module –  Linkedin Today module

Online Logis$c Regression (OLR) §  User i with feature xi, article j §  Binary response y (click/non-click) §  § 

§  Prior §  Using Laplace approximation or variational Bayesian

methods to obtain posterior

§  New prior §  Can approximate and as diagonal for high dim xi

User Features for OLR •  Age, gender, industry, job posi$on for login users

•  General behavior targe$ng (BT) features –  Music? Finance? Poli$cs?

•  User profiles from historical view/click behavior on previous items in the data, e.g. –  Item-‐profile: use previously clicked item ids as the user profile –  Category-‐profile: use item category affinity score as profile. The score can be simply user’s historical CTR on each category.

–  Are there beaer ways to generate user profiles? –  Yes! By matrix factoriza$on!

Generalized Matrix Factoriza$on (GMF) Framework

• 

Global Features

Item effect

User factors

Item factors

User effect

Bell et al. (2007)

Regression Priors • 

•  g(·∙), h(·∙), G(·∙), H(·∙) can be any regression func$ons

•  Agarwal and Chen (KDD 2009); Zhang et al. (RecSys 2011)

User covariates

Item covariates

Different Types of Prior Regression Models

•  Zero prior mean –  Bilinear random effects (BIRE)

•  Linear regression –  Simple regression (RLFM) –  Lasso penalty (LASSO)

•  Tree Models –  Recursive par$$oning (RP) –  Random forests (RF) – Gradient boos$ng machines (GB) –  Bayesian addi$ve regression trees (BART)

•  Monte Carlo EM (Booth and Hobert 1999) •  Let •  Let •  E Step:

– Obtain N samples of condi$onal posterior

•  M Step:

Model Fimng Using MCEM

Handling Binary Responses

•  Gaussian responses: have closed form •  Binary responses + Logis$c: no longer closed form

•  Varia$onal approxima$on (VAR)

•  Adap$ve rejec$on sampling (ARS)

Simula$on Study •  10 simulated data sets, 100K samples for both training and test

•  1000 users and 1000 items in training

•  Extra 500 new users and 500 new items in test + old users/items

•  For each user/item, 200 covariates, only 10 useful

•  Construct non-‐linear regression model from 20 Gaussian func$ons for simula$ng α, β, u and v following Friedman (2001)

MovieLens 1M Data Set •  1M ra$ngs

•  6040 users

•  3706 movies

•  Sort by $me, first 75% training, last 25% test

•  A lot of new users in the test data set

•  User features: Age, gender, occupa$on, zip code

•  Item features: Movie genre

Performance Comparison

However…

•  We are working with very large scale data sets!

•  Parallel matrix factoriza$on methods using Map-‐Reduce has to be developed!

•  Khanna et al. 2012 Technical report

•  Monte Carlo EM (Booth and Hobert 1999) •  Let •  Let •  E Step:

– Obtain N samples of condi$onal posterior

•  M Step:

Model Fimng Using MCEM

Parallel Matrix Factoriza$on •  Par$$on data into m par$$ons •  For each par$$on run MCEM algorithm and get .

• 

•  Ensemble runs: for k = 1, … , n –  Repar$$on data into m par$$ons with a new seed –  Run E-‐step only job for each par$$on given

•  Average over user/item factors for all par$$ons and k’s to obtain the final es$mate

One MapReduce job

Parallel Matrix Factoriza$on •  Par$$on data into m par$$ons •  For each par$$on run MCEM algorithm and get .

• 

•  Ensemble runs: for k = 1, … , n –  Repar$$on data into m par$$ons with a new seed –  Run E-‐step only job for each par$$on given

•  Average over user/item factors for all par$$ons and k’s to obtain the final es$mate

Each ensemble run is a MapReduce

job

Key Points

•  Par$$oning is tricky! – By events? By items? By users?

•  Empirically, “divide and conquer” + average over to obtain work well!

•  Ensemble runs: Ader obtained , we run n E-‐step-‐only jobs and take average, for each job using a different user-‐item mix.

Iden$fiability Issues

•  Same log-‐likelihood can be achieved by – g ( ) = g ( ) + r, h ( ) = h ( ) – r

•  Center α, β, u to zero-‐mean every E-‐step

– u = -‐u, v = -‐v •  Constrain v to be posi$ve

–  Switching u.1, v.1 with u.2, v.2 •  ui ~ N(G(xi) , I), vj ~ N(H(xj), λI) •  Constraint: Diagonal entries λ1 >= λ2 >= …

MovieLens 1M Data

•  75% training and 25% test split by $me •  Imbalanced data

– User ra$ng = 1: Posi$ve – User ra$ng = 2, 3, 4, 5: Nega$ve – 5% posi$ve rate

•  Balanced data – User ra$ng = 1, 2, 3: Posi$ve – User ra$ng = 4, 5: Nega$ve – 44% posi$ve rate

Matrix Factoriza$on For User Profile

•  Offline user profile building period, obtain the user factor for user i

•  Online modeling using OLR –  If a user has a profile (warm-‐start), use as the user feature

–  If not (cold-‐start), use as the user feature

Offline Evalua$on Metric Related to Clicks

•  For model M and J live items (ar$cles) at any $me

•  If M = random (constant) model E[S(M)] = #clicks

•  Unbiased es$mate of expected total clicks (Langford et al. 2008)

Experiments on Big Data •  Yahoo! Front Page Today Module data •  Data for building user profile: 8M users with at least 10

clicks (heavy users) in June 2011, 1B events •  Data for training and tes$ng OLR model: Random served

data with 2.4M clicks in July 2011 •  Heavy users contributed around 30% of clicks •  User feature for OLR:

–  Intercept-‐only (MOST POPULAR) –  124 Behavior targe$ng features (BT-‐ONLY) –  BT + top 1000 clicked ar$cle ids (ITEM-‐PROFILE) –  BT + user profile with CTR on 43 binary content categories (CATEGORY-‐PROFILE)

–  BT + profiles from matrix factoriza$on models

Click Lid Performance For Different User Profiles

Web Adver$sing

286

There are lots of ads on the web … 100s of billions of adver$sing dollars spent online per year (e-‐marketer)

Online adver$sing: 6000 d. Overview

Adv

ertis

ers

Ad Network

Ads

Content

Pick ads

User

Content Provider

Examples: Yahoo, Google, MSN, RightMedia, …

Web Adver$sing: Comes in different flavors

•  Sponsored (“Paid” ) Search

–  Small text links in response to query to a search engine

•  Display Adver$sing –  Graphical, banner, rich media; appears in several contexts like visi$ng

a webpage, checking e-‐mails, on a social network,….

–  Goals of such adver$sing campaigns differ •  Brand Awareness •  Performance (users are targeted to take some ac$on, soon)

–  More akin to direct marke$ng in offline world

Paid Search: Adver$se Text Links

Display Adver$sing: Examples

Display Adver$sing: Examples

LinkedIn company follow ad

Brand Ad on Facebook

Paid Search Ads versus Display Ads

Paid Search •  Context (Query) important

•  Small text links •  Performance based

–  Clicks, conversions

•  Adver$sers can cherry-‐pick instances

Display •  Reaching desired audience

•  Graphical, banner, Rich media –  Text, logos, videos,..

•  Hybrid –  Brand, performance

•  Bulk buy by marketers –  But things evolving

•  Ad exchanges, Real-‐$me bidder (RTB)

Display Adver$sing Models

•  Futures Market (Guaranteed Delivery) –  Brand Awareness (e.g. Gilleae, Coke, McDonalds, GM,..)

•  Spot Market (Non-‐guaranteed) – Marketers create targeted campaigns

•  Ad-‐exchanges have made this process efficient –  Connects buyers and sellers in a stock-‐market style market

•  Several portals like LinkedIn and Facebook have self-‐serve systems to book such campaigns

Guaranteed Delivery (Futures Market)

•  Revenue Model: Cost per ad impression(CPM) Ads are bought in bulk targeted to users based on

demographics and other behavioral features GM ads on LinkedIn shown to “males above 55” Mortgage ad shown to “everybody on Y! ”

Slots booked in advance and guaranteed –  “e.g. 2M targeted ad impressions Jan next year” –  Prices significantly higher than spot market

–  Higher quality inventory delivered to maintain mark-‐up

Measuring effec$veness of brand adver$sing

§  "Half the money I spend on adver:sing is wasted; the trouble is, I don't know which half." -‐ John Wanamaker

•  Typically –  Number of visits and engagement on adver$ser website –  Increase in number of searches for specific keywords –  Increase in offline sales in the long-‐run

•  How? –  Randomized design (treatment = ad exposure, control = no exposure) –  Sample surveys –  Covariate shid (Propensity score matching)

•  Several sta$s$cal challenges (experimental design, causal inference from observa$onal data, survey methodology)

Guaranteed delivery •  Fundamental Problem: Guarantee impressions (with overlapping

inventory)

3

2 4

2 2

1

1

Young US

Female LI

Homepage

1.  Predict Supply

2.  Incorporate/Predict Demand

3.  Find the optimal allocation

•  subject to supply and demand constraints

si

dj xij

Example

3 2 4

2 2

1

1

Young US

Female LI Homepage

US & Y (2)

Supply Pools

Demand US, Y, nF Supply = 2 Price = 1

US, Y, F Supply = 3 Price = 5

Supply Pools

How should we distribute impressions from the supply pools to sa$sfy this demand?

Example (Cherry-‐picking) •  Cherry-‐picking:

Fulfill demands at least cost

US & Y (2)

Supply Pools

Demand US, Y, nF Supply = 2 Price = 1

US, Y, F Supply = 3 Price = 5


(2)

Example (Fairness) •  Cherry-‐picking:

Fulfill demands at least cost

•  Fairness: Equitable distribu$on of available supply pools

•  Agarwal and Tomlin, INFORMS, 2010

•  Ghosh et al, EC, 2011

US & Y (2)

Supply Pools

Demand US, Y, nF Supply = 2 Cost = 1

US, Y, F Supply = 3 Cost = 5


(1)

(1)

The op$miza$on problem

•  Maximize Value of remnant inventory (to be sold in spot market) –  Subject to “fairness” constraints (to maintain high quality of inventory

in the guaranteed market) –  Subject to supply and demand constraints

•  Can be solved efficiently through a flow program

•  Key sta$s$cal input: Supply forecasts

302

Various component of a Guaranteed Delivery system

Field Sales Team, sells Products (segments)

Pricing Engine

Admission Control

should the new contract request be

admiled? (solve VIA LP)

Supply forecasts

Demand forecasts & booked inventory

Adver$sers

Contracts signed, Nego$a$ons involved

OFFLINE COMPONENTS

ONLINE SERVING

On Line Ad Serving

Ads

Opportunity Near Real Time

Optimization

Stochastic Supply

Stochastic Demand

Contract Statistics Allocation

Plan (from LP)

High dimensional Forecas$ng •  Supply forecasts important input required both at booking

$me (admission control) and serving $me •  Problem: Given historical $me series data in a high

dimensional space (trillions of combina$ons), forecast number of visits for an arbitrary query for a future $me horizon –  E.g.: Male visits from Hawaii on LinkedIn next year in January

•  Challenging sta$s$cal problem –  Curse of dimensionality & massive data –  arbitrary query subset –  latency constraints

•  Forecas:ng High-‐dimensional data, Agarwal et al, SIGMOD, 2011

Other challenges •  3Ms: Mul$-‐response, Mul$-‐context modeling to op$mize Mul$ple

Objec$ves –  Mul$-‐response: Clicks, share, comments, likes,.. (preliminary work at

CIKM 2012)

–  Mul$-‐context: Mobile, Desktop, Email,..(preliminary work at SIGKDD 2011)

–  Mul$-‐objec$ve: Tradeoff in engagement, revenue, viral ac$vi$es •  Preliminary work at SIGIR 2012, SIGKDD 2011

•  Scaling model computa$ons at run-‐$me to avoid latency issues –  Predic$ve Indexing (preliminary work at WSDM 2012)

Bibliography Agarwal, D. and Chen, B. (2009). Regression-‐based latent factor models. In Proceedings of the 15th ACM SIGKDD interna$onal conference on Knowledge discovery and data mining, 19–28. ACM. Agarwal, D., Chen, B., and Elango, P. (2010). Fast online learning through offline ini$aliza$on for $me-‐sensi$ve recommenda$on. In Proceedings of the 16th ACM SIGKDD interna$onal conference on Knowledge discovery and data mining, 703–712. ACM. Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling rela$onships at mul$ple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD interna$onal conference on Knowledge discovery and data mining, 95–104. ACM. Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Sta$s$cal Society: Series B (Sta$s$cal Methodology), 61(1), 265-‐285. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed op$miza$on and sta$s$cal learning via the alterna$ng direc$on method of mul$pliers. Founda$ons and Trends® in Machine Learning, 3(1), 1-‐122. Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observa$ons: gains, losses, and remedies for losses (pp. 267-‐297). Springer New York. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communica$ons of the ACM, 51(1), 107-‐113.

Bibliography Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Sta$s$cs, 1-‐26. Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint arXiv:1206.6415. Khanna, R., Zhang, L., Agarwal, D. and Chen, B. (2012). Parallel Matrix Factoriza$on for Binary Response. In Arxiv.org. Zhang, L., Agarwal, D., and Chen, B. (2011). Generalizing matrix factoriza$on through flexible regression priors. In Proceedings of the fidh ACM conference on Recommender systems, 13–20. ACM.

Date post:	27-Jan-2015
Category:	Technology
Upload:	dipu1025
View:	111 times
Download:	4 times

ENAR short course

Technology