+ All Categories
Home > Technology > ENAR short course

ENAR short course

Date post: 27-Jan-2015
Category:
Upload: dipu1025
View: 111 times
Download: 4 times
Share this document with a friend
Description:
 
Popular Tags:
309
Sta$s$cal Compu$ng For Big Data Deepak Agarwal LinkedIn Applied Relevance Science [email protected] ENAR 2014, Bal$more, USA
Transcript
Page 1: ENAR short course

Sta$s$cal  Compu$ng    For  Big  Data  Deepak  Agarwal  

LinkedIn  Applied  Relevance  Science  [email protected]  

ENAR  2014,  Bal$more,  USA  

Page 2: ENAR short course

Main  Collaborators:  several  others  at  both  Y!  and  LinkedIn  

•  I  won’t  be  here  without  them,  extremely  lucky  to  work  with  such  talented  individuals  

Bee-Chung Chen Liang Zhang Bo Long

Jonathan Traupman Paul Ogilvie

Page 3: ENAR short course

Structure  of  This  Tutorial  •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the  Hadoop  System  – Overview  of  Distributed  Compu$ng  –  Introduc$on  to  Map-­‐Reduce  –  Some  sta$s$cal  computa$ons  using  Map-­‐Reduce  

•  Bootstrap,  Logis$c  Regression  •  Part  II:  Recommender  Systems  for  Web  Applica$ons  –  Introduc$on  –  Content  Recommenda$on  – Online  Adver$sing  

Page 4: ENAR short course

Big  Data  becoming  Ubiquitous  

•  Bioinforma$cs  •  Astronomy  •  Internet  •  Telecommunica$ons  •  Climatology  •  …    

Page 5: ENAR short course

Big  Data:  Some  size  es$mates  

•  1000  human  genomes:  >  100TB  of  data  (1000  genomes  project)  

•  Sloan  Digital  Sky  Survey:  200GB  data  per  night  (>140TB  aggregated)  

•  Facebook:  A  billion  monthly  ac$ve  users  •  LinkedIn:      roughly  >  280M  members  worldwide  •  Twiaer:  >  500  million  tweets  a  day  •  Over  6  billion  mobile  phones  in  the  world  genera$ng  data  everyday  

Page 6: ENAR short course

Big  Data:  Paradigm  shid  •  Classical  Sta$s$cs  

– Generalize  using  small  data    

•  Paradigm  Shid  with  Big  Data  – We  now  have  an  almost  infinite  supply  of  data  –  Easy  Sta$s$cs  ?  Just  appeal  to  asympto$c  theory?  

•  So  the  issue  is  mostly  computa$onal?  – Not  quite  

•  More  data  comes  with  more  heterogeneity  •  Need  to  change  our  sta$s$cal  thinking  to  adapt  

–  Classical  sta$s$cs  s$ll  invaluable  to  think  about  big  data  analy$cs    

Page 7: ENAR short course

Some  Sta$s$cal  Challenges  

•  Exploratory  Analysis  (EDA),  Visualiza$on  – Retrospec$ve  (on  Terabytes)  – More  Real  Time    (streaming  computa$ons  every  few  minutes/hours)  

•  Sta$s$cal  Modeling  – Scale  (computa$onal  challenge)  – Curse  of  dimensionality    

•  Millions  of  predictors,  heterogeneity  – Temporal  and  Spa$al  correla$ons  

Page 8: ENAR short course

Sta$s$cal  Challenges  con$nued  

•  Experiments  – To  test  new  methods,  test  hypothesis  from  randomized  experiments  

– Adap$ve  experiments    

•  Forecas$ng    – Planning,  adver$sing  

•  Many  more  I  are  not  fully  well  versed  in    

   

Page 9: ENAR short course

Defining  Big  Data    

•  How  to  know  you  have  the  big  data  problem?  –  Is  it  only  the  number  of  terabytes  ?  – What  about  dimensionality,  structured/unstructured,  computa$ons  required,…  

•  No  clear  defini$on,  different  point  of  views  – When  desired  computa$on  cannot  be  completed  in  the  s$pulated  $me  with  current  best  algorithm  using  cores  available  on  a  commodity  PC    

 

Page 10: ENAR short course

 Distributed  Compu$ng  for  Big  Data      

•  Distributed  compu$ng  invaluable  tool  to  scale  computa$ons  for  big  data  

•  Some  distributed  compu$ng  models  – Mul$-­‐threading  – Graphics  Processing  Units  (GPU)  – Message  Passing  Interface  (MPI)  – Map-­‐Reduce  

Page 11: ENAR short course

Evalua$ng  a  method  for  a  problem  •  Scalability  

–  Process  X  GB  in  Y  hours  •  Ease  of  use  for  a  sta$s$cian    •  Reliability  (fault  tolerance)  

–  Especially  in  an  industrial  environment  •  Cost  

– Hardware  and  cost  of  maintaining  •  Good  for  the  computa$ons  required?  

–  E.g.,  Itera$ve  versus  one  pass  •  Resource  sharing  

Page 12: ENAR short course

Mul$threading  

•  Mul$ple  threads  take  advantage  of  mul$ple  CPUs  

•  Shared  memory  •  Threads  can  execute  independently  and  concurrently  

•  Can  only  handle  Gigabytes  of  data  •  Reliable  

Page 13: ENAR short course

Graphics  Processing  Units  (GPU)  •  Number  of  cores:  

–  CPU:  Order  of  10  –  GPU:  smaller  cores  

•  Order  of  1000  

•  Can  be  >100x  faster  than  CPU  –  Parallel  computa$onally  intensive  tasks  off-­‐loaded  to  GPU  

•  Good  for  certain  computa$onally-­‐intensive  tasks  

•  Can  only  handle  Gigabytes  of  data  

•  Not  trivial  to  use,  requires  good  understanding  of  low-­‐level  architecture  for  efficient  use  –  But  things  changing,  it  is  gemng  more  user  friendly  

Page 14: ENAR short course

Message  Passing  Interface  (MPI)  

•  Language  independent  communica$on  protocol  among  processes  (e.g.  computers)  

•  Most  suitable  for  master/slave  model  •  Can  handle  Terabytes  of  data  •  Good  for  itera$ve  processing  •  Fault  tolerance  is  low  

Page 15: ENAR short course

Map-­‐Reduce  (Dean  &  Ghemawat,  2004)  

Mappers  

Reducers  

Data  

Output  

•  Computa$on  split  to  Map  (scaaer)  and  Reduce  (gather)  stages  

•  Easy  to  Use:    –  User  needs  to  implement  two  func$ons:  Mapper  and  Reducer  

•  Easily  handles  Terabytes  of  data  

•  Very  good  fault  tolerance  (failed  tasks  automa$cally  get  restarted)  

Page 16: ENAR short course

Comparison  of  Distributed  Compu$ng  Methods  

Mul$threading   GPU   MPI   Map-­‐Reduce  

Scalability  (data  size)  

Gigabytes   Gigabytes   Terabytes   Terabytes  

Fault  Tolerance   High   High   Low   High  

Maintenance  Cost   Low   Medium   Medium   Medium-­‐High  

Itera$ve  Process  Complexity  

Cheap   Cheap   Cheap   Usually  expensive  

Resource  Sharing   Hard   Hard   Easy   Easy  

Easy  to  Implement?   Easy   Needs  understanding  of  low-­‐level  GPU  architecture  

Easy   Easy  

Page 17: ENAR short course

Example  Problem  

•  Tabula$ng  word  counts  in  corpus  of  documents  

•  Similar  to  table  func$on  in  R  

Page 18: ENAR short course

Word  Count  Through  Map-­‐Reduce  Hello  World  Bye  World    

Hello  Hadoop  Goodbye  Hadoop  

Mapper  1  

<Hello,  1>  <Hadoop,  1>  <Goodbye,  1>  <Hadoop,1>  

<Hello,  1>  <World,  1>  <Bye,  1>  <World,1>  

Mapper  2  

Reducer  1  Words  from  A-­‐G  

Reducer  2  Words  from  H-­‐Z  

<Bye,  1>  <Goodbye,  1>  

<Hello,  2>  <World,  2>  <Hadoop,  2>  

Page 19: ENAR short course

Key  Ideas  about  Map-­‐Reduce  Big  Data  

Par$$on  1   Par$$on  2   …   Par$$on  N  

Mapper  1   Mapper  2   …   Mapper  N  

<Key,  Value>   <Key,  Value>  <Key,  Value>  <Key,  Value>  

Reducer  1   Reducer  2   Reducer  M  …  

Output  1   Output  1  Output  1  Output  1  

Page 20: ENAR short course

Key  Ideas  about  Map-­‐Reduce  •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on  disk  (distributed  storage)  

•  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs  

•  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can  receive  mul$ple  keys  

•  Every  reducer  sorts  its  data  by  key  •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key  according  to  the  customized  reducer  func$on  and  output  

Page 21: ENAR short course

Compute  Mean  for  Each  Group  ID   Group  No.   Score  

1   1   0.5  

2   3   1.0  

3   1   0.8  

4   2   0.7  

5   2   1.5  

6   3   1.2  

7   1   0.8  

8   2   0.9  

9   4   1.3  

…   …   …  

Page 22: ENAR short course

Key  Ideas  about  Map-­‐Reduce  •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on  

disk  (distributed  storage)  •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs  

–  For  each  row:  •  Key  =  Group  No.  •  Value  =  Score  

•  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can  receive  mul$ple  keys  –  E.g.  2  reducers  –  Reducer  1  receives  data  with  key  =  1,  2  –  Reducer  2  receives  data  with  key  =  3,  4  

•  Every  reducer  sorts  its  data  by  key  –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>  

•  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key  according  to  the  customized  reducer  func$on  and  output  –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>  

Page 23: ENAR short course

Key  Ideas  about  Map-­‐Reduce  •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on  

disk  (distributed  storage)  •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs  

–  For  each  row:  •  Key  =  Group  No.  •  Value  =  Score  

•  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can  receive  mul$ple  keys  –  E.g.  2  reducers  –  Reducer  1  receives  data  with  key  =  1,  2  –  Reducer  2  receives  data  with  key  =  3,  4  

•  Every  reducer  sorts  its  data  by  key  –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>  

•  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key  according  to  the  customized  reducer  func$on  and  output  –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>  

What  you  need  to  implement  

Page 24: ENAR short course

Mapper:  Input:  Data  for  (row  in  Data)  {  

 groupNo  =  row$groupNo;    score  =  row$score;    Output(c(groupNo,  score));  

}  

Reducer:  Input:  Key  (groupNo),  List  Value  (a  list  of  scores  that  belong  to  the  Key)  count  =  0;  sum  =  0;  for  (v  in  Value)  {  

 sum  +=  v;    count++;  

}  Output(c(Key,  sum/count));  

Pseudo  Code  (in  R)  

Page 25: ENAR short course

Exercise  1  

•  Problem:  Average  height  per  {Grade,  Gender}?  •  What  should  be  the  mapper  output  key?  •  What  should  be  the  mapper  output  value?  •  What  are  the  reducer  input?  •  What  are  the  reducer  output?  •  Write  mapper  and  reducer  for  this?  

Student  ID   Grade   Gender   Height  (cm)  

1   3   M   120  

2   2   F   115  

3   2   M   116  

…   …   …  

Page 26: ENAR short course

•  Problem:  Average  height  per  Grade  and  Gender?  •  What  should  be  the  mapper  output  key?  

–  {Grade,  Gender}  •  What  should  be  the  mapper  output  value?  

– Height  •  What  are  the  reducer  input?  

–  Key:  {Grade,  Gender},  Value:  List  of  Heights    •  What  are  the  reducer  output?  

–  {Grade,  Gender,  mean(Heights)}  

Student  ID   Grade   Gender   Height  (cm)  

1   3   M   120  

2   2   F   115  

3   2   M   116  

…   …   …  

Page 27: ENAR short course

Exercise  2  

•  Problem:  Number  of  students  per  {Grade,  Gender}?  •  What  should  be  the  mapper  output  key?  •  What  should  be  the  mapper  output  value?  •  What  are  the  reducer  input?  •  What  are  the  reducer  output?  •  Write  mapper  and  reducer  for  this?  

Student  ID   Grade   Gender   Height  (cm)  

1   3   M   120  

2   2   F   115  

3   2   M   116  

…   …   …  

Page 28: ENAR short course

•  Problem:  Number  of  students  per  {Grade,  Gender}?  •  What  should  be  the  mapper  output  key?  

–  {Grade,  Gender}  •  What  should  be  the  mapper  output  value?  

–  1  •  What  are  the  reducer  input?  

–  Key:  {Grade,  Gender},  Value:  List  of  1’s  •  What  are  the  reducer  output?  

–  {Grade,  Gender,  sum(value  list)}  –  OR:  {Grade,  Gender,  length(value  list)}  

Student  ID   Grade   Gender   Height  (cm)  

1   3   M   120  

2   2   F   115  

3   2   M   116  

…   …   …  

Page 29: ENAR short course

More  on  Map-­‐Reduce  •  Depends  on  distributed  file  systems  •  Typically  mappers  are  the  data  storage  nodes  •  Map/Reduce  tasks  automa$cally  get  restarted  when  they  fail  (good  fault  tolerance)  

•  Map  and  Reduce  I/O  are  all  on  disk  – Data  transmission  from  mappers  to  reducers  are  through  disk  copy  

•  Itera$ve  process  through  Map-­‐Reduce  –  Each  itera$on  becomes  a  map-­‐reduce  job  –  Can  be  expensive  since  map-­‐reduce  overhead  is  high  

Page 30: ENAR short course

The  Apache  Hadoop  System  

•  An  open-­‐source  sodware  for  reliable,  scalable,  distributed  compu$ng    

•  The  most  popular  distributed  compu$ng  system  in  the  world  

•  Key  modules:  – Hadoop  Distributed  File  System  (HDFS)  – Hadoop  YARN  (job  scheduling  and  cluster  resource  management)  

– Hadoop  MapReduce  

Page 31: ENAR short course

Major  Tools  on  Hadoop  •  Pig  

–  A  high-­‐level  language  for  Map-­‐Reduce  computa$on  •  Hive  

–  A  SQL-­‐like  query  language  for  data  querying  via  Map-­‐Reduce  •  Hbase  

–  A  distributed  &  scalable  database  on  Hadoop  –  Allows  random,  real  $me  read/write  access  to  big  data  –  Voldemort  is  similar  to  Hbase  

•  Mahout  –  A  scalable  machine  learning  library  

•  …  

Page 32: ENAR short course

Hadoop  Installa$on  

•  Semng  up  Hadoop  on  your  desktop/laptop:  – hap://hadoop.apache.org/docs/stable/single_node_setup.html  

•  Semng  up  Hadoop  on  a  cluster  of  machines  – hap://hadoop.apache.org/docs/stable/cluster_setup.html  

Page 33: ENAR short course

Hadoop  Distributed  File  System  (HDFS)  

•  Master/Slave  architecture  •  NameNode:  a  single  master  node  that  controls  which  data  block  is  stored  where.    

•  DataNodes:  slave  nodes  that  store  data  and  do  R/W  opera$ons  

•  Clients  (Gateway):  Allow  users  to  login  and  interact  with  HDFS  and  submit  Map-­‐Reduce  jobs  

•  Big  data  is  split  to  equal-­‐sized  blocks,  each  block  can  be  stored  in  different  DataNodes  

•  Disk  failure  tolerance:  data  is  replicated  mul$ple  $mes  

Page 34: ENAR short course

Load  the  Data  into  Pig  •  A  =  LOAD  ‘Sample-­‐1.dat'  USING  PigStorage()  AS  (ID  :  int,  groupNo:  int,  score:  float);    –  The  path  of  the  data  on  HDFS  ader  LOAD    

•  USING  PigStorage()  means  delimit  the  data  by  tab  (can  be  omiaed)  

•  If  data  are  delimited  by  other  characters,  e.g.  space,  use  USING  PigStorage(‘  ‘)    

•  Data  schema  defined  ader  AS    •  Variable  types:  int,  long,  float,  double,  chararray,  …  

Page 35: ENAR short course

Structure  of  This  Tutorial  

•  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the  Hadoop  System  – Overview  of  Distributed  Compu$ng  –  Introduc$on  to  Map-­‐Reduce  –  Introduc$on  to  the  Hadoop  System  –   Examples  of  Sta$s$cal  Compu$ng  for  Big  Data  

•  Bag  of  Liale  Bootstraps  •  Large  Scale  Logis$c  Regression  

Page 36: ENAR short course

Bag  of  Liale  Bootstraps  

Kleiner  et  al.  2012  

Page 37: ENAR short course

Bootstrap  (Efron,  1979)  •  A  re-­‐sampling  based  method  to  obtain  sta$s$cal  distribu$on  of  sample  es$mators  

•  Why  are  we  interested  ?  –  Re-­‐sampling  is  embarrassingly  parallelizable    

•  For  example:  –  Standard  devia$on  of  the  mean  of  N  samples  (μ)  –  For  i  =  1  to  r  do    

•  Randomly  sample  with  replacement  N  $mes  from  the  original  sample  -­‐>  bootstrap  data  i  

•  Compute  mean  of  the  i-­‐th  bootstrap  data  -­‐>  μi  

–  Es$mate  of  Sd(μ)  =  Sd([μ1,…μr])  –  r  is  usually  a  large  number,  e.g.  200  

Page 38: ENAR short course

Bootstrap  for  Big  Data  

•  Can  have  r  nodes  running  in  parallel,  each  sampling  one  bootstrap  data  

•  However…  – N  can  be  very  large  – Data  may  not  fit  into  memory  – Collec$ng  N  samples  with  replacement  on  each  node  can  be  computa$onally  expensive  

Page 39: ENAR short course

M  out  of  N  Bootstrap    (Bikel  et  al.  1997)  

•  Obtain  SdM(μ)  by  sampling  M  samples  with  replacement  for  each  bootstrap,  where  M<N  

•  Apply  analy$cal  correc$on  to  SdM(μ)  to  obtain  Sd(μ)  using  prior  knowledge  of  convergence  rate  of  sample  es$mates  

•  However…  –  Prior  knowledge  is  required  –  Choice  of  M  is  cri$cal  to  performance  –  Finding  op$mal  value  of  M  needs  more    computa$on  

Page 40: ENAR short course

Bag  of  Liale  Bootstraps  (BLB)  •  Example:  Standard  devia$on  of  the  mean    •  Generate  S  sampled  data  sets,  each  obtained  by  random  

sampling  without  replacement  a  subset  of  size  b  (or  par$$on  the  original  data  into  S  par$$ons,  each  with  size  b)  

•  For  each  data  p  =  1  to  S  do  –  For  i  =  1  to  r  do    

•  N  samples  with  replacement  on  data  of  size  b  •  Compute  mean  of  the  resampled  data  μpi  

–  Compute  Sdp(μ)  =  Sd([μp1,…μpr])  •  Es$mate  of  Sd(μ)  =  Avg([Sd1(μ),…,  SdS(μ)])  

Page 41: ENAR short course

Bag  of  Liale  Bootstraps  (BLB)  •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from  size  N  data  –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …  

•  Generate  S  sampled  data  sets,  each  obtained  from  random  sampling  without  replacement  a  subset  of  size  b  (or  par$$on  the  original  data  into  S  par$$ons,  each  with  size  b)  

•  For  each  data  p  =  1  to  S  do  –  For  i  =  1  to  r  do    

•  Sample  N  samples  with  replacement  on  data  of  size  b  •  Compute  mean  of  the  resampled  data  θpi  

–  Compute  ξp(θ)  =  ξ([θp1,…θpr])  •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  

Page 42: ENAR short course

Bag  of  Liale  Bootstraps  (BLB)  •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from  size  N  data  –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …  

•  Generate  S  sampled  data  sets,  each  obtained  from  random  sampling  without  replacement  a  subset  of  size  b  (or  par$$on  the  original  data  into  S  par$$ons,  each  with  size  b)  

•  For  each  data  p  =  1  to  S  do  –  For  i  =  1  to  r  do    

•  Sample  N  samples  with  replacement  on  the  data  of  size  b  •  Compute  mean  of  the  resampled  data  θpi  

–  Compute  ξp(θ)  =  ξ([θp1,…θpr])  •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  

Mapper  Reducer  

Gateway  

Page 43: ENAR short course

Why  is  BLB  Efficient  

•  Before:  – N  samples  with  replacement  from  size  N  data  is  expensive  when  N  is  large  

•  Now:  – N  samples  with  replacement  from  size  b  data  – b  can  be  several  magnitude  smaller  than  N  (e.g.  b  =  Nγ,  γ  in  [0.5,  1))  

– Equivalent  to:  A  mul$nomial  sampler  with  dim  =  b  – Storage  =  O(b),  Computa$onal  complexity  =  O(b)  

Page 44: ENAR short course

Simula$on  Experiment  

•  95%  CI  of  Logis$c  Regression  Coefficients  •  N  =  20000,  10  explanatory  variables  •  Rela$ve  Error  =  |Es$mated  CI  width  –  True  CI  width  |  /  True  CI  width  

•  BLB-­‐γ:  BLB  with  b  =  Nγ  •  BOFN-­‐γ:  b  out  of  N  sampling  with  b  =  Nγ  

•  BOOT:  Naïve  bootstrap  

Page 45: ENAR short course

Simula$on  Experiment  

Page 46: ENAR short course

Real  Data  

•  95%  CI  of  Logis$c  Regression  Coefficients  •  N  =  6M,  3000  explanatory  variables  •  Data  size  =  150GB,  r  =  50,  s  =  5,  γ  =  0.7  

Page 47: ENAR short course

Summary  of  BLB  

•  A  new  algorithm  for  bootstrapping  on  big  data  

•  Advantages  – Fast  and  efficient  – Easy  to  parallelize  – Easy  to  understand  and  implement  – Friendly  to  Hadoop,  makes  it  rou$ne  to  perform  sta$s$cal  calcula$ons  on  Big  data  

Page 48: ENAR short course

Large  Scale  Logis$c  Regression  

Page 49: ENAR short course

Logis$c  Regression  •  Binary  response:  Y  

•  Covariates:  X  

•  Yi  ~  Bernoulli(pi)  

•  log(pi/(1-­‐pi))  =  XiTβ  ;    β  ~  MVN(0  ,  1/λ  I  )  

•  Widely  used  (research  and  applica$ons)  

Page 50: ENAR short course

Large  Scale  Logis$c  Regression  •  Binary  response:  Y  

–  E.g.,  Click  /  Non-­‐click  on  an  ad  on  a  webpage  •  Covariates:  X  

–  User  covariates:    •  Age,  gender,  industry,  educa$on,  job,  job  $tle,  …  

–  Item  covariates:  •  Categories,  keywords,  topics,  …  

–  Context  covariates:  •  Time,  page  type,  posi$on,  …  

–  2-­‐way  interac$on:  •  User  covariates  X  item  covariates  •  Context  covariates  X  item  covariates  •  …  

Page 51: ENAR short course

Computa$onal  Challenge  

•  Hundreds  of  millions/billions  of  observa$ons    •  Hundreds  of  thousands/millions  of  covariates  •  Fimng  such  a  logis$c  regression  model  on  a  single  machine  not  feasible  

•  Model  fimng  itera$ve  using  methods  like  gradient  descent,  Newton’s  method  etc  – Mul$ple  passes  over  the  data  

Page 52: ENAR short course

Recap  on  Op$miza$on  method  

•  Problem:  Find  x  to  min(F(x))  •  Itera$on  n:  xn  =  xn-­‐1  –  bn-­‐1  F’(xn-­‐1)  •   bn-­‐1  is  the  step  size  that  can  change  every  itera$on  

•  Iterate  un$l  convergence    •  Conjugate  gradient,  LBFGS,  Newton  trust  region,  …  all  of  this  kind  

Page 53: ENAR short course

Itera$ve  Process  with  Hadoop    

Disk   Mappers   Disk   Reducers  

Disk  Mappers  Disk  Reducers  

Disk   Mappers   Disk   Reducers  

Page 54: ENAR short course

Limita$ons  of  Hadoop  for  fimng  a  big  logis$c  regression  

•  Itera$ve  process  is  expensive  and  slow  •  Every  itera$on  =  a  Map-­‐Reduce  job  •  I/O  of  mapper  and  reducers  are  both  through  disk  

•  Plus:  Wai$ng  in  queue  $me    •  Q:  Can  we  find  a  fimng  method  that  scales  with  Hadoop  ?  

Page 55: ENAR short course

Large  Scale  Logis$c  Regression  •  Naïve:    

–  Par$$on  the  data  and  run  logis$c  regression  for  each  par$$on  –  Take  the  mean  of  the  learned  coefficients  –  Problem:  Not  guaranteed  to  converge  to  the  model  from  single  machine!  

•  Alterna$ng  Direc$on  Method  of  Mul$pliers  (ADMM)  –  Boyd  et  al.  2011  –  Set  up  constraints:  each  par$$on’s  coefficient  =  global  consensus  

–  Solve  the  op$miza$on  problem  using  Lagrange  Mul$pliers  –  Advantage:  guaranteed  to  converge  to  a  single  machine  logis$c  regression  on  the  en$re  data  with  reasonable  number  of  itera$ons  

 

Page 56: ENAR short course

Large  Scale  Logis$c  Regression  via  ADMM  

BIG  DATA  

Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K  

Logis$c  Regression  

Logis$c  Regression  

Logis$c  Regression  

Logis$c  Regression  

Consensus  Computa$on  

Iteration 1

Page 57: ENAR short course

Large  Scale  Logis$c  Regression  via  ADMM  

BIG  DATA  

Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K  

Logis$c  Regression  

Consensus  Computa$on  

Logis$c  Regression  

Logis$c  Regression  

Logis$c  Regression  

Iteration 1

Page 58: ENAR short course

Large  Scale  Logis$c  Regression  via  ADMM  

BIG  DATA  

Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K  

Logis$c  Regression  

Logis$c  Regression  

Logis$c  Regression  

Logis$c  Regression  

Consensus  Computa$on  

Iteration 2

Page 59: ENAR short course

Details  of  ADMM  

Page 60: ENAR short course

Dual  Ascent  Method  

•  Consider  a  convex  op$miza$on  problem  

 •  Lagrangian  for  the  problem:    •  Dual  Ascent:  

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

8 Precursors

problem is

maximize g(y),

with variable y ! Rm. Assuming that strong duality holds, the optimalvalues of the primal and dual problems are the same. We can recovera primal optimal point x! from a dual optimal point y! as

x! = argminx

L(x,y!),

provided there is only one minimizer of L(x,y!). (This is the caseif, e.g., f is strictly convex.) In the sequel, we will use the notationargminx F (x) to denote any minimizer of F , even when F does nothave a unique minimizer.

In the dual ascent method, we solve the dual problem using gradientascent. Assuming that g is di!erentiable, the gradient "g(y) can beevaluated as follows. We first find x+ = argminx L(x,y); then we have"g(y) = Ax+ # b, which is the residual for the equality constraint. Thedual ascent method consists of iterating the updates

xk+1 := argminx

L(x,yk) (2.2)

yk+1 := yk + !k(Axk+1 # b), (2.3)

where !k > 0 is a step size, and the superscript is the iteration counter.The first step (2.2) is an x-minimization step, and the second step (2.3)is a dual variable update. The dual variable y can be interpreted asa vector of prices, and the y-update is then called a price update orprice adjustment step. This algorithm is called dual ascent since, withappropriate choice of !k, the dual function increases in each step, i.e.,g(yk+1) > g(yk).

The dual ascent method can be used even in some cases when g isnot di!erentiable. In this case, the residual Axk+1 # b is not the gradi-ent of g, but the negative of a subgradient of #g. This case requires adi!erent choice of the !k than when g is di!erentiable, and convergenceis not monotone; it is often the case that g(yk+1) $> g(yk). In this case,the algorithm is usually called the dual subgradient method [152].

If !k is chosen appropriately and several other assumptions hold,then xk converges to an optimal point and yk converges to an optimal

Page 61: ENAR short course

Augmented  Lagrangians  •  Bring  robustness  to  the  dual  ascent  method  

•  Yield  convergence  without  assump$ons  like  strict  convexity  or  finiteness  of  f  

•     

•  The  value  of  ρ  influences  the  convergence  rate  

10 Precursors

collected (gathered) in order to compute the residual Axk+1 ! b. Oncethe (global) dual variable yk+1 is computed, it must be distributed(broadcast) to the processors that carry out the N individual xi mini-mization steps (2.4).

Dual decomposition is an old idea in optimization, and traces backat least to the early 1960s. Related ideas appear in well known workby Dantzig and Wolfe [44] and Benders [13] on large-scale linear pro-gramming, as well as in Dantzig’s seminal book [43]. The general ideaof dual decomposition appears to be originally due to Everett [69],and is explored in many early references [107, 84, 117, 14]. The useof nondi!erentiable optimization, such as the subgradient method, tosolve the dual problem is discussed by Shor [152]. Good references ondual methods and decomposition include the book by Bertsekas [16,chapter 6] and the survey by Nedic and Ozdaglar [131] on distributedoptimization, which discusses dual decomposition methods and con-sensus problems. A number of papers also discuss variants on standarddual decomposition, such as [129].

More generally, decentralized optimization has been an active topicof research since the 1980s. For instance, Tsitsiklis and his co-authorsworked on a number of decentralized detection and consensus problemsinvolving the minimization of a smooth function f known to multi-ple agents [160, 161, 17]. Some good reference books on parallel opti-mization include those by Bertsekas and Tsitsiklis [17] and Censor andZenios [31]. There has also been some recent work on problems whereeach agent has its own convex, potentially nondi!erentiable, objectivefunction [130]. See [54] for a recent discussion of distributed methodsfor graph-structured optimization problems.

2.3 Augmented Lagrangians and the Method of Multipliers

Augmented Lagrangian methods were developed in part to bringrobustness to the dual ascent method, and in particular, to yield con-vergence without assumptions like strict convexity or finiteness of f .The augmented Lagrangian for (2.1) is

L!(x,y) = f(x) + yT (Ax ! b) + (!/2)"Ax ! b"22, (2.6)2.3 Augmented Lagrangians and the Method of Multipliers 11

where ! > 0 is called the penalty parameter. (Note that L0 is thestandard Lagrangian for the problem.) The augmented Lagrangiancan be viewed as the (unaugmented) Lagrangian associated with theproblem

minimize f(x) + (!/2)!Ax " b!22

subject to Ax = b.

This problem is clearly equivalent to the original problem (2.1), sincefor any feasible x the term added to the objective is zero. The associateddual function is g!(y) = infx L!(x,y).

The benefit of including the penalty term is that g! can be shown tobe di!erentiable under rather mild conditions on the original problem.The gradient of the augmented dual function is found the same way aswith the ordinary Lagrangian, i.e., by minimizing over x, and then eval-uating the resulting equality constraint residual. Applying dual ascentto the modified problem yields the algorithm

xk+1 := argminx

L!(x,yk) (2.7)

yk+1 := yk + !(Axk+1 " b), (2.8)

which is known as the method of multipliers for solving (2.1). This isthe same as standard dual ascent, except that the x-minimization stepuses the augmented Lagrangian, and the penalty parameter ! is usedas the step size "k. The method of multipliers converges under far moregeneral conditions than dual ascent, including cases when f takes onthe value +# or is not strictly convex.

It is easy to motivate the choice of the particular step size ! inthe dual update (2.8). For simplicity, we assume here that f is di!er-entiable, though this is not required for the algorithm to work. Theoptimality conditions for (2.1) are primal and dual feasibility, i.e.,

Ax" " b = 0, $f(x") + AT y" = 0,

respectively. By definition, xk+1 minimizes L!(x,yk), so

0 = $xL!(xk+1,yk)

= $xf(xk+1) + AT!yk + !(Axk+1 " b)

"

= $xf(xk+1) + AT yk+1.

Page 62: ENAR short course

Alterna$ng  Direc$on  Method  of  Mul$pliers  (ADMM)  

•  Problem:    •  Augmented  Lagrangians    •  ADMM:  

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

14 Alternating Direction Method of Multipliers

ADMM consists of the iterations

xk+1 := argminx

L!(x,zk,yk) (3.2)

zk+1 := argminz

L!(xk+1,z,yk) (3.3)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c), (3.4)

where ! > 0. The algorithm is very similar to dual ascent and themethod of multipliers: it consists of an x-minimization step (3.2), az-minimization step (3.3), and a dual variable update (3.4). As in themethod of multipliers, the dual variable update uses a step size equalto the augmented Lagrangian parameter !.

The method of multipliers for (3.1) has the form

(xk+1,zk+1) := argminx,z

L!(x,z,yk)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c).

Here the augmented Lagrangian is minimized jointly with respect tothe two primal variables. In ADMM, on the other hand, x and z areupdated in an alternating or sequential fashion, which accounts for theterm alternating direction. ADMM can be viewed as a version of themethod of multipliers where a single Gauss-Seidel pass [90, §10.1] overx and z is used instead of the usual joint minimization. Separating theminimization over x and z into two steps is precisely what allows fordecomposition when f or g are separable.

The algorithm state in ADMM consists of zk and yk. In other words,(zk+1,yk+1) is a function of (zk,yk). The variable xk is not part of thestate; it is an intermediate result computed from the previous state(zk!1,yk!1).

If we switch (re-label) x and z, f and g, and A and B in the prob-lem (3.1), we obtain a variation on ADMM with the order of the x-update step (3.2) and z-update step (3.3) reversed. The roles of x andz are almost symmetric, but not quite, since the dual update is doneafter the z-update but before the x-update.

Page 63: ENAR short course

Large  Scale  Logis$c  Regression  via  ADMM  

•  Nota$on    –  (Xi  ,  yi):  data  in  the  ith  par$$on  –  βi:  coefficient  vector  for  par$$on  i  –  β:  Consensus  coefficient  vector  –  r(β):  penalty  component  such  as  ||β||22    

•  Op$miza$on  problem  

Brief Article

The Author

July 7, 2013

min

NX

i=1

li(yi,XTi �i) + r(�)

subject to �i = �

1

Page 64: ENAR short course

ADMM  updates  

LOCAL  REGRESSIONS  Shrinkage  towards  current  best  global  es$mate  

UPDATED  CONSENSUS  

Page 65: ENAR short course

An  example  implementa$on  

•  ADMM  for  Logis$c  regression  model  fimng  with  L2/L1  penalty  

•  Each  itera$on  of  ADMM  is  a  Map-­‐Reduce  job  – Mapper:  par$$on  the  data  into  K  par$$ons  –  Reducer:  For  each  par$$on,  use  liblinear/glmnet  to  fit  a  L1/L2  logis$c  regression  

– Gateway:  consensus  computa$on  by  results  from  all  reducers,  and  sends  back  the  consensus  to  each  reducer  node  

Page 66: ENAR short course

KDD  CUP  2010  Data  

•  Bridge  to  Algebra  2008-­‐2009  data  in    haps://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp  •  Binary  response,  20M  covariates  •  Only  keep  covariates  with  >=  10  occurrences  =>  2.2M  covariates  

•  Training  data:  8,407,752  samples  •  Test  data  :  510,302  samples  

Page 67: ENAR short course

Avg  Training  Loglikelihood  vs  Number  of  Itera$ons  

Page 68: ENAR short course

Test  AUC  vs  Number  of  Itera$ons    

Page 69: ENAR short course

Beaer  Convergence  Can    Be  Achieved  By  

•  Beaer  Ini$aliza$on  – Use  results  from  Naïve  method  to  ini$alize  the  parameters  

•  Adap$vely  change  step  size  (ρ)  for  each  itera$on  based  on  the  convergence  status  of  the  consensus  

Page 70: ENAR short course

Recommender Problems for Web Applications

Page 71: ENAR short course

Agenda •  Topic of Interest

– Recommender problems for dynamic, time-sensitive applications

•  Content Optimization, Online Advertising, Movie recommendation, shopping,…

•  Introduction •  Offline components

– Regression, Collaborative filtering (CF), … •  Online components + initialization

– Time-series, online/incremental methods, explore/exploit (bandit)

•  Evaluation methods + Multi-Objective •  Challenges

Page 72: ENAR short course

Three components we will focus on •  Defining the problem

–  Formulate objectives whose optimization achieves some long-term goals for the recommender system

•  E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ?

•  Modeling (to estimate some critical inputs) –  Predict rates of some positive user interaction(s) with items

based on data obtained from historical user-item interactions •  E.g. Click rates, average time-spent on page, etc •  Could be explicit feedback like ratings

•  Experimentation –  Create experiments to collect data proactively to improve

models, helps in converging to the best choice(s) cheaply and rapidly.

•  Explore and Exploit (continuous experimentation) •  DOE (testing hypotheses by avoiding bias inherent in data)

Page 73: ENAR short course

Modern Recommendation Systems

•  Goal –  Serve the right item to a user in a given context to

optimize long-term business objectives •  A scientific discipline that involves

–  Large scale Machine Learning & Statistics •  Offline Models (capture global & stable characteristics) •  Online Models (incorporates dynamic components) •  Explore/Exploit (active and adaptive experimentation)

–  Multi-Objective Optimization •  Click-rates (CTR), Engagement, advertising revenue, diversity, etc

–  Inferring user interest •  Constructing User Profiles

–  Natural Language Processing to understand content •  Topics, “aboutness”, entities, follow-up of something, breaking news,…

Page 74: ENAR short course

Some examples from content optimization

•  Simple version –  I have a content module on my page, content inventory is

obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module

•  More advanced –  I got X% lift in CTR. But I have additional information on

other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks?

•  Highly advanced –  There are multiple modules running on my webpage. How

do I perform a simultaneous optimization?

Page 75: ENAR short course

Recommend applications

Recommend search queries

Recommend news article

Recommend packages: Image Title, summary Links to other pages

Pick  4  out  of  a  pool  of  K          K  =  20  ~  40          Dynamic    Routes  traffic  other  pages  

Page 76: ENAR short course

Problems in this example •  Optimize CTR on multiple modules

– Today Module, Trending Now, Personal Assistant, News

– Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations.

•  For any single module – Optimize some combination of CTR, downstream

engagement, and perhaps advertising revenue.

Page 77: ENAR short course

Online Advertising

Adv

ertis

ers

Ad Network

Ads

Page

Recommend

Best ad(s)

User

Publisher

Response rates (click, conversion, ad-view) Bids

Auction

Click

conversion

Select argmax f(bid,response rates)

ML /Statistical model

Examples: Yahoo, Google, MSN, …

Ad exchanges (RightMedia, DoubleClick, …)

Page 78: ENAR short course

LinkedIn  Today:  Content  Module  

Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)

Page 79: ENAR short course

LinkedIn  Ads:  Match  ads  to  users  visi$ng  LinkedIn  

Page 80: ENAR short course

Right  Media  Ad  Exchange:  Unified  Marketplace  

Match  ads  to  page  views  on  publisher  sites  

Has  ad  impression  to  sell  -­‐-­‐  AUCTIONS  

Bids  $0.50  Bids  $0.75  via  Network…  

…  which  becomes  $0.45  bid  

Bids  $0.65—WINS!  

AdSense  Ad.com  

Bids  $0.60  

Page 81: ENAR short course

Recommender problems in general

USER    

Item  Inventory  Ar$cles,  web  page,    

ads,  …  

Use  an  automated  algorithm    to  select  item(s)  to  show  

 Get  feedback  (click,  $me  spent,..)    

Refine  the  models    

Repeat  (large  number  of  :mes)  Op:mize  metric(s)  of  interest  (Total  clicks,  Total  revenue,…)  

Example applications Search: Web, Vertical Online Advertising Content …..

Context  query,  page,  …  

Page 82: ENAR short course

•  Items: Articles, ads, modules, movies, users, updates, etc.

•  Context: query keywords, pages, mobile, social media, etc.

•  Metric to optimize (e.g., relevance score, CTR, revenue, engagement) –  Currently, most applications are single-objective –  Could be multi-objective optimization (maximize X subject to Y, Z,..)

•  Properties of the item pool –  Size (e.g., all web pages vs. 40 stories) –  Quality of the pool (e.g., anything vs. editorially selected) –  Lifetime (e.g., mostly old items vs. mostly new items)

Important Factors

Page 83: ENAR short course

Factors affecting Solution (continued)

•  Properties of the context –  Pull: Specified by explicit, user-driven query (e.g., keywords, a form) –  Push: Specified by implicit context (e.g., a page, a user, a session)

•  Most applications are somewhere on continuum of pull and push

•  Properties of the feedback on the matches made

–  Types and semantics of feedback (e.g., click, vote) –  Latency (e.g., available in 5 minutes vs. 1 day) –  Volume (e.g., 100K per day vs. 300M per day)

•  Constraints specifying legitimate matches –  e.g., business rules, diversity rules, editorial Voice –  Multiple objectives

•  Available Metadata (e.g., link graph, various user/item attributes)

Page 84: ENAR short course

Predicting User-Item Interactions (e.g. CTR)

•  Myth: We have so much data on the web, if we can only process it the problem is solved –  Number of things to learn increases with sample size

•  Rate of increase is not slow –  Dynamic nature of systems make things worse –  We want to learn things quickly and react fast

•  Data is sparse in web recommender problems –  We lack enough data to learn all we want to learn and

as quickly as we would like to learn –  Several Power laws interacting with each other

•  E.g. User visits power law, items served power law –  Bivariate Zipf: Owen & Dyer, 2011

Page 85: ENAR short course

Can Machine Learning help? •  Fortunately, there are group behaviors that generalize to

individuals & they are relatively stable –  E.g. Users in San Francisco tend to read more baseball news

•  Key issue: Estimating such groups –  Coarse group : more stable but does not generalize that well. –  Granular group: less stable with few individuals –  Getting a good grouping structure is to hit the “sweet spot”

•  Another big advantage on the web –  Intervene and run small experiments on a small population to

collect data that helps rapid convergence to the best choices(s) •  We don’t need to learn all user-item interactions, only those that are good.

Page 86: ENAR short course

Predicting user-item interaction rates

Offline  (  Captures  stable  characteris$cs  

at  coarse  resolu$ons)  (Logis$c,  Boos$ng,….)  

 

Feature    construc$on  Content:  IR,  clustering,  taxonomy,  en$ty,..    

User  profiles:  clicks,  views,  social,  community,..  

Near  Online  (Finer  resolu$on  Correc$ons)  

(item,  user  level)  (Quick  updates)  

Explore/Exploit  (Adap$ve  sampling)  

(helps  rapid  convergence  to  best  choices)  

Initialize

Page 87: ENAR short course

Post-click: An example in Content Optimization

Recommender          EDITORIAL              

content Clicks on FP links influence downstream supply distribution

   AD  SERVER                DISPLAY          ADVERTISING  Revenue  

Downstream  engagement      (Time  spent)  

Page 88: ENAR short course

Serving Content on Front Page: Click Shaping

•  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following

–  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10

•  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but

obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,

etc)

Page 89: ENAR short course

High  level  picture  

http request

Statistical Models updated in Batch mode: e.g. once every 30 mins

Server  

Item  Recommenda$on  system:  thousands  of  computa$ons  in  

sub-­‐seconds  

User Interacts e.g. click, does nothing

Page 90: ENAR short course

High  level  overview:  Item  Recommenda$on  System  

User Info

Item Index Id, meta-data

ML/ Statistical Models

Score Items P(Click), P(share), Semantic-relevance score,….

Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,….

User-item interaction Data: batch process

Updated in batch: Activity, profile

Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..

Page 91: ENAR short course

ML/Sta$s$cal  models  for  scoring  

Number of items Scored by ML

Traffic volume

1000 100 100k 1M 100M

Few hours

Few days

Several days

LinkedIn Today Yahoo! Front Page

Right Media Ad exchange LinkedIn Ads

Page 92: ENAR short course

Summary  of  deployments      •  Yahoo!  Front  page  Today  Module  (2008-­‐2011):  300%  improvement  

in  click-­‐through  rates  –  Similar  algorithms  delivered  via  a  self-­‐serve  pla�orm,  adopted  by  

several  Yahoo!  Proper$es  (2011):  Significant  improvement  in  engagement  across  Yahoo!  Network  

•  Fully  deployed  on  LinkedIn  Today  Module  (2012):  Significant  improvement  in  click-­‐through  rates  (numbers  not  revealed  due  to  reasons  of  confiden$ality)  

•  Yahoo!  RightMedia  exchange  (2012):  Fully  deployed  algorithms  to  es$mate  response  rates  (CTR,  conversion  rates).  Significant  improvement  in  revenue  (numbers  not  revealed  due  to  reasons  of  confiden$ality)  

•  LinkedIn  self-­‐serve  ads  (2012-­‐2013):Fully  deployed  •  LinkedIn  News  Feed  (2013-­‐2014):  Fully  deployed  •  Several  others  in  progress….  

Page 93: ENAR short course

Broad  Themes  •  Curse  of  dimensionality  

–  Large  number  of  observa$ons  (rows),  large  number  of  poten$al  features  (columns)  

–  Use  domain  knowledge  and  machine  learning  to  reduce  the  “effec$ve”  dimension  (constraints  on  parameters  reduce  degrees  of  freedom)  

•  I  will  give  examples  as  we  move  along  

•   We  oden  assume  our  job  is  to  analyze  “Big  Data”  but  we  oden  have  control  on  what  data  to  collect  through  clever  experimenta$on  –  This  can  fundamentally  change  solu$ons  

•  Think  of  computa$on  and  models  together  for  Big  data  •  Op$miza$on:  What  we  are  trying  to  op$mize  is  oden  complex,models  to  

work  in  harmony  with  op$miza$on  –  Pareto  op$mality  with  compe$ng  objec$ves  

Page 94: ENAR short course

Sta$s$cal  Problem  •  Rank  items  (from  an  admissible  pool)  for  user  visits  in  some  

context  to  maximize  a  u$lity  of  interest  •  Examples  of  u$lity  func$ons  

–  Click-­‐rates  (CTR)  –  Share-­‐rates  (CTR*  [Share|Click]  )  –  Revenue  per  page-­‐view  =  CTR*bid  (more  complex  due  to  second  price  

auc$on)  

•  CTR  is  a  fundamental  measure  that  opens  the  door  to  a  more  principled  approach  to  rank  items  

•  Converge  rapidly  to  maximum  u$lity  items    –  Sequen$al  decision  making  process  (explore/exploit)  

 

Page 95: ENAR short course

item  j  from  a  set  of  candidates  

User  i    with  user  features  (e.g.,  industry,    behavioral  features,  Demographic  features,……)  

               (i,  j)  :  response  yij  visits  

Algorithm  selects  

(click  or  not)  

Which  item  should  we  select?    �  The  item  with  highest  predicted  CTR    �  An  item  for  which  we  need  data  to            predict  its  CTR  

Exploit  Explore  

LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem

Page 96: ENAR short course

The Explore/Exploit Problem (to maximize CTR)

•  Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items

•  Easy!? Pick the items having the highest click-through rates (CTRs)

•  But … –  The system is highly dynamic:

•  Items come and go with short lifetimes •  CTR of each item may change over time

–  How much traffic should be allocated to explore new items to achieve optimal performance ?

•  Too little → Unreliable CTR estimates due to “starvation” •  Too much → Little traffic to exploit the high CTR items

Page 97: ENAR short course

Y!  front  Page  Applica$on  

•  Simplify:  Maximize  CTR  on  first  slot  (F1)    

•  Item  Pool  –  Editorially  selected  for  high  quality  and  brand  image  –  Few  ar$cles  in  the  pool  but  item  pool  dynamic  

 

 

Page 98: ENAR short course

CTR Curves of Items on LinkedIn Today

CTR  

Page 99: ENAR short course

Impact  of  repeat  item  views  on  a  given  user  

•  Same  user  is  shown  an  item  mul$ple  $mes  (despite  not  clicking)  

Page 100: ENAR short course

Simple  algorithm  to  es$mate  most  popular  item  with  small  but  dynamic  item  pool  

•  Simple  Explore/Exploit  scheme –  ε%  explore:  with  a  small  probability  (e.g.  5%),  choose  an  item  at  random  from  the  pool  

–  (100−ε)%  exploit:  with  large  probability  (e.g.  95%),  choose  highest  scoring  CTR  item  

•  Temporal  Smoothing  –  Item  CTRs  change  over  $me,  provide  more  weight  to  recent  data  in  

es$ma$ng  item  CTRs  •  Kalman  filter,  moving  average  

•  Discount  item  score  with  repeat  views  –  CTR(item)  for  a  given  user  drops  with  repeat  views  by  some  “discount”  

factor  (es$mated  from  data)  •  Segmented  most  popular  

–  Perform  separate  most-­‐popular  for  each  user  segment    

Page 101: ENAR short course

Time  series  Model:  Kalman  filter  •  Dynamic  Gamma-­‐Poisson:  click-­‐rate  evolves  over  $me  in  a  mul$plica$ve  fashion  

•  Es$mated  Click-­‐rate  distribu$on  at  $me  t+1    –  Prior  mean:  

–  Prior  variance:      

High  CTR  items  more  adap$ve  

Page 102: ENAR short course

More  economical  explora$on?  Beaer  bandit  solu$ons  

•  Consider  two  armed  problem  

p2 (unknown payoff

probabilities)

The  gambler  has  1000  plays,  what  is  the  best  way  to  experiment  ?                                                (to  maximize  total  expected  reward)      This  is  called  the  “mul$-­‐armed  bandit”  problem,  have  been  studied  for  a  long  $me.  

 Op$mal  solu$on:  Play  the  arm  that  has  maximum  poten:al  of  being  good    Op:mism  in  the  face  of  uncertainty  

   

p1 >

Page 103: ENAR short course

Item  Recommenda$on:  Bandits?  •  Two  Items:  Item  1  CTR=  2/100  ;  Item  2  CTR=  250/10000  

– Greedy:  Show  Item  2  to  all;  not  a  good  idea  –  Item  1  CTR  es$mate  noisy;  item  could  be  poten$ally  beaer  

•  Invest  in  Item  1  for  beaer  overall  performance  on  average    

 

–  Exploit  what  is  known  to  be  good,  explore  what  is  poten$ally  good  CTR

Prob

abili

ty d

ensit

y Item 2

Item 1

Page 104: ENAR short course

Next few hours

Most Popular Recommendation

Personalized Recommendation

Offline Models

Collaborative filtering (cold-start problem)

Online Models

Time-series models Incremental CF, online regression

Intelligent Initialization

Prior estimation Prior estimation, dimension reduction

Explore/Exploit

Multi-armed bandits Bandits with covariates

Page 105: ENAR short course

Offline Components: Collaborative Filtering in Cold-start

Situations

Page 106: ENAR short course

Problem

Item j with

User i with user features xi (demographics, browse history, search history, …)

item features xj (keywords, content categories, ...)

(i, j) : response yij visits

Algorithm selects

(explicit rating, implicit click/no-click)

Predict the unobserved entries based on features and the observed entries

Page 107: ENAR short course

Model Choices •  Feature-based (or content-based) approach

–  Use features to predict response •  (regression, Bayes Net, mixture models, …)

–  Limitation: need predictive features •  Bias often high, does not capture signals at granular levels

•  Collaborative filtering (CF aka Memory based) –  Make recommendation based on past user-item interaction

•  User-user, item-item, matrix factorization, … •  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.

–  Better performance for old users and old items –  Does not naturally handle new users and new items (cold-

start)

Page 108: ENAR short course

Collaborative Filtering (Memory based methods)

User-User Similarity

Item-Item similarities, incorporating both

Estimating Similarities Pearson’s correlation Optimization based (Koren et al)

Page 109: ENAR short course

How to Deal with the Cold-Start Problem

•  Heuris$c-­‐based  approaches  –  Linear  combina$on  of  regression  and  CF  models    –  Filterbot  

•  Add  user  features  as  psuedo  users  and  do  collabora$ve  filtering  -­‐  Hybrid  approaches  

-­‐  Use  content  based  to  fill  up  entries,  then  use  CF  

•  Matrix  Factoriza$on  –  Good  performance  on  Ne�lix  (Koren,  2009)  

•  Model-­‐based  approaches    –  Bilinear  random-­‐effects  model  (probabilis$c  matrix  factoriza$on)  

•  Good  on  Ne�lix  data  [Ruslan  et  al  ICML,  2009]  –  Add  feature-­‐based  regression  to  matrix  factoriza$on    

•  (Agarwal  and  Chen,  2009)  –  Add  topic  discovery  (from  textual  items)  to  matrix  factoriza$on    

•  (Agarwal  and  Chen,  2009;  Chun  and  Blei,  2011)  

Page 110: ENAR short course

Per-item regression models •  When tracking users by cookies, distribution of

visit patters could get extremely skewed – Majority of cookies have 1-2 visits

•  Per item models (regression) based on user covariates attractive in such cases

Page 111: ENAR short course

Several per-item regressions: Multi-task learning

Low dimension (5-10),

B estimated retrospective data

•  Agarwal,Chen and Elango, KDD, 2010

Affinity to old items

Page 112: ENAR short course

Per-user, per-item models via bilinear random-effects

model

Page 113: ENAR short course

Motivation •  Data measuring k-way interactions pervasive

–  Consider k = 2 for all our discussions •  E.g. User-Movie, User-content, User-Publisher-Ads,….

–  Power law on both user and item degrees

•  Classical Techniques –  Approximate matrix through a singular value

decomposition (SVD) •  After adjusting for marginal effects (user pop, movie pop,..)

–  Does not work •  Matrix highly incomplete, severe over-fitting

–  Key issue •  Regularization of eigenvectors (factors) to avoid overfitting

Page 114: ENAR short course

Early work on complete matrices

•  Tukey’s 1-df model (1956)

– Rank 1 approximation of small nearly complete matrix

•  Criss-cross regression (Gabriel, 1978) •  Incomplete matrices: Psychometrics (1-factor

model only; small data sets; 1960s) •  Modern day recommender problems

– Highly incomplete, large, noisy.

Page 115: ENAR short course

Latent Factor Models

“newsy”

“sporty”

“newsy”

s

item

v

z

Affinity = u’v

Affinity = s’z

u sporty

Page 116: ENAR short course

Factorization – Brief Overview •  Latent user factors:

(αi , ui=(ui1,…,uin))

•  (Nn + Mm) parameters

•  Key technical issue:

•  Latent movie factors: (βj , vj=(v j1,….,v jn))

will overfit for moderate values of n,m

Regularization

Interaction

jijiij BvuyE ʹ′+++= βαµ)(

Page 117: ENAR short course

Latent Factor Models: Different Aspects

•  Matrix Factorization – Factors in Euclidean space – Factors on the simplex

•  Incorporating features and ratings simultaneously

•  Online updates

Page 118: ENAR short course

Maximum Margin Matrix Factorization (MMMF)

•  Complete matrix by minimizing loss (hinge,squared-error) on observed entries subject to constraints on trace norm –  Srebro, Rennie, Jakkola (NIPS 2004)

–  Convex, Semi-definite programming (expensive, not scalable)

•  Fast MMMF (Rennie & Srebro, ICML, 2005) –  Constrain the Frobenious norm of left and right

eigenvector matrices, not convex but becomes scalable.

•  Other variation: Ensemble MMMF (DeCoste, ICML2005) –  Ensembles of partially trained MMMF (some

improvements)

Page 119: ENAR short course

Matrix Factorization for Netflix prize data

•  Minimize the objective function

•  Simon Funk: Stochastic Gradient Descent

•  Koren et al (KDD 2007): Alternate Least Squares – They move to SGD later in the competition

∑ ∑∑∈

++−obsij j

ji

ijTiij vuvur )()(

222 λ

Page 120: ENAR short course

ui vj

rij

au av

Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated-whom” Combining with Boltzmann Machines also improved performance

),(~),(~),(~ 2

IaMVNIaMVN

Nr

vj

ui

jTiij

0v0uvu σ

Probabilis$c  Matrix  Factoriza$on  (Ruslan  &  Minh,  2008,  NIPS)  

Page 121: ENAR short course

Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008)

•  Fully Bayesian treatment using an MCMC approach –  Significant improvement

•  Interpretation as a fully Bayesian hierarchical model shows why that is the case –  Failing to incorporate uncertainty leads to bias in

estimates –  Multi-modal posterior, MCMC helps in converging to a better one

r Var-comp: au

MCEM also more resistant to over-fitting

Page 122: ENAR short course

Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010)

•  Specify rank probabilistically (automatic rank selection)

)/)1(,/(~)(~

),(~1

2

rrbraBetaBerz

vuzNy

k

kk

r

kjkikkij

∑=

π

π

σ

))1(/(Factors)#()))1(/(,1(~−+=

−+

rbaraErbaaBerzk

Page 123: ENAR short course

How to incorporate features: Deal with both warm start and cold-start

•  Models to predict ratings for new pairs –  Warm-start: (user, movie) present in the training data with large

sample size –  Cold-start: At least one of (user, movie) new or has small sample

size •  Rough definition, warm-start/cold-start is a continuum.

•  Challenges –  Highly incomplete (user, movie) matrix –  Heavy tailed degree distributions for users/movies

•  Large fraction of ratings from small fraction of users/movies

–  Handling both warm-start and cold-start effectively in the presence of predictive features

Page 124: ENAR short course

Possible approaches •  Large scale regression based on covariates

–  Does not provide good estimates for heavy users/movies –  Large number of predictors to estimate interactions

•  Collaborative filtering –  Neighborhood based –  Factorization

•  Good for warm-start; cold-start dealt with separately •  Single model that handles cold-start and warm-start

–  Heavy users/movies → User/movie specific model –  Light users/movies → fallback on regression model –  Smooth fallback mechanism for good performance

Page 125: ENAR short course

Add Feature-based Regression into

Matrix Factorization RLFM: Regression-based Latent

Factor Model

Page 126: ENAR short course

Regression-based Factorization Model (RLFM)

•  Main idea: Flexible prior, predict factors through regressions

•  Seamlessly handles cold-start and warm-start

•  Modified state equation to incorporate covariates

Page 127: ENAR short course

RLFM: Model Rating: ),(~ 2σµijij Ny

)(~ ijij Bernoulliy µ)(~ ijijij NPoissony µ

Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts)

jtiji

tijij vubxt +++= βαµ )(

user i gives item j

Bias of user i: ),0(~ , 20 α

αα σεεα Nxg iiit

i +=

Popularity of item j: ),0(~ , 20 β

ββ σεεβ Nxd jjjt

j +=

Factors of user i: ),0(~ , 2INGxu uui

uiii σεε+=

Factors of item j: ),0(~ , 2INDxv vvi

viji σεε+=

Could use other classes of regression models

Page 128: ENAR short course

Graphical representation of the model

Page 129: ENAR short course

Advantages of RLFM •  Better regularization of factors

–  Covariates “shrink” towards a better centroid

•  Cold-start: Fallback regression model (FeatureOnly)

Page 130: ENAR short course

RLFM: Illustration of Shrinkage

Plot the first factor value for each user (fitted using Yahoo! FP data)

Page 131: ENAR short course

Model fitting: EM for our class of models

Page 132: ENAR short course

The parameters for RLFM

•  Latent parameters

•  Hyper-parameters

}){},{},{},({ jiji vuβα=Δ

)IaAI,aAD, G, ,( vvuu ===Θ b

Page 133: ENAR short course

Computing the mode

Minimized

Page 134: ENAR short course

The EM algorithm

Page 135: ENAR short course

Computing the E-step

•  Often hard to compute in closed form •  Stochastic EM (Markov Chain EM; MCEM)

– Compute expectation by drawing samples from

–  Effective for multi-modal posteriors but more expensive

•  Iterated Conditional Modes algorithm (ICM) –  Faster but biased hyper-parameter estimates

Page 136: ENAR short course

Monte Carlo E-step •  Through a vanilla Gibbs sampler (conditionals closed form)

•  Other conditionals also Gaussian and closed form •  Conditionals of users (movies) sampled simultaneously •  Small number of samples in early iterations, large numbers in

later iterations

Page 137: ENAR short course

M-step (Why MCEM is better than ICM)

•  Update G, optimize

•  Update Au=au I

Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well

Page 138: ENAR short course

Experiment 1: Better regularization

•  MovieLens-100K, avg RMSE using pre-specified splits •  ZeroMean, RLFM and FeatureOnly (no cold-start

issues) •  Covariates:

–  Users : age, gender, zipcode (1st digit only) –  Movies: genres

Page 139: ENAR short course

Experiment 2: Better handling of Cold-start

•  MovieLens-1M; EachMovie •  Training-test split based on timestamp •  Same covariates as in Experiment 1.

Page 140: ENAR short course

Experiment 4: Predicting click-rate on articles

•  Goal: Predict click-rate on articles for a user on F1 position

•  Article lifetimes short, dynamic updates important

•  User covariates: –  Age, Gender, Geo, Browse behavior

•  Article covariates –  Content Category, keywords

•  2M ratings, 30K users, 4.5 K articles

Page 141: ENAR short course

Results on Y! FP data

Page 142: ENAR short course

Some other related approaches •  Stern, Herbrich and Graepel, WWW, 2009

–  Similar to RLFM, different parametrization and expectation propagation used to fit the models

•  Porteus, Asuncion and Welling, AAAI, 2011 –  Non-parametric approach using a Dirichlet process

•  Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 –  Regression + random effects per user regularized

through a Graphical Lasso

Page 143: ENAR short course

Add Topic Discovery into Matrix Factorization

fLDA: Matrix Factorization through Latent Dirichlet Allocation

Page 144: ENAR short course

fLDA: Introduction •  Model the rating yij that user i gives to item j as the user’s

affinity to the topics that the item has

–  Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data

–  fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation

∑+=k jkikij zsy ...User i ’s affinity to topic k

Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j

Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items

Page 145: ENAR short course

Φ11,  …,  Φ1W                …  Φk1,  …,  ΦkW                …  ΦK1,  …,  ΦKW  

Topic  1  

Topic  k  

Topic  K  

LDA Topic Modeling (1) •  LDA is effective for unsupervised topic discovery [Blei’03]

–  It models the generating process of a corpus of items (articles) –  For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η) –  For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)

–  For each word, say the nth word, in item j, •  Draw a topic zjn for that word from θj = [θj1, …, θjK] •  Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn

Item j Topic distribution: [θj1, …, θjK]

Words: wj1, …, wjn, …

Per-word topic: zj1, …, zjn, …

Assume zjn = topic k

Observed

Page 146: ENAR short course

LDA Topic Modeling (2) •  Model training:

–  Estimate the prior parameters and the posterior topic×word distribution Φ based on a training corpus of items

–  EM + Gibbs sampling is a popular method •  Inference for new items

–  Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase

•  Supervised LDA [Blei’07] –  Predict a target value for each item based on supervised LDA topics

∑= k jkkj zsy

Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j

Regression weight for topic k

∑+=k jkikij zsy ...vs.

One regression per user

Same set of topics across different regressions

Page 147: ENAR short course

fLDA: Model Rating: ),(~ 2σµijij Ny

)(~ ijij Bernoulliy µ)(~ ijijij NPoissony µ

Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts)

jkikkjitijij zsbxt ∑+++= βαµ )(

user i gives item j

Bias of user i: ),0(~ , 20 α

αα σεεα Nxg iiit

i +=

Popularity of item j: ),0(~ , 20 β

ββ σεεβ Nxd jjjt

j +=

Topic affinity of user i: ),0(~ , 2INHxs ssi

siii σεε+=

Pr(item j has topic k): ) itemin words#/()(1 jkzz jnnjk =∑=The LDA topic of the nth word in item j

Observed words: ),,(~ jnjn zLDAw ηλThe nth word in item j

Page 148: ENAR short course

Model Fitting •  Given:

–  Features X = {xi, xj, xij} –  Observed ratings y = {yij} and words w = {wjn}

•  Estimate: –  Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]

•  Regression weights and prior parameters –  Latent factors: Δ = {αi, βj, si} and z = {zjn}

•  User factors, item factors and per-word topic assignment

•  Empirical Bayes approach: –  Maximum likelihood estimate of the parameters:

–  The posterior distribution of the factors:

∫ ΔΘΔ=Θ=ΘΘΘ

dzdzwywy ]|,,,Pr[maxarg]|,Pr[ maxargˆ

]ˆ,|,Pr[ ΘΔ yz

Page 149: ENAR short course

The EM Algorithm •  Iterate through the E and M steps until convergence

– Let be the current estimate – E-step: Compute

•  The expectation is not in closed form •  We draw Gibbs samples and compute the Monte

Carlo mean

– M-step: Find

•  It consists of solving a number of regression and optimization problems

)]|,,,Pr([log)( )ˆ,,|,( ΘΔ=ΘΘΔ

zwyEf nwyz

)(maxargˆ )1( Θ=ΘΘ

+ fn

)(ˆ nΘ

Page 150: ENAR short course

Supervised Topic Assignment

( ) ∏ =⋅++

+∝

=

¬¬

¬

ji jnijjnjkjn

k

jnkl

jn

kzyfZWZ

Z

kz

rated )|(

)Rest|Pr(

ληη

Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k

Probability of observing yij given the model

The topic of the nth word in item j

Page 151: ENAR short course

fLDA: Experimental Results (Movie) •  Task: Predict the rating that a user would give a movie •  Training/test split:

–  Sort observations by time –  First 75% → Training data –  Last 25% → Test data

•  Item warm-start scenario –  Only 2% new items in test data

Model Test RMSE RLFM 0.9363 fLDA 0.9381

Factor-Only 0.9422 FilterBot 0.9517

unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906

Constant 1.1190

fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios

Page 152: ENAR short course

fLDA: Experimental Results (Yahoo! Buzz)

•  Task: Predict whether a user would buzz-up an article •  Severe item cold-start

–  All items are new in test data

Data Statistics 1.2M observations

4K users 10K articles

fLDA significantly outperforms other

models

Page 153: ENAR short course

Experimental Results: Buzzing Topics

Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al

CIA interrogation

mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain

Swine flu

NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week

NFL games

court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow

Gay marriage

palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska

Sarah Palin

idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama

American idol

economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month

Recession

north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia

North Korea issues

3/4 topics are interpretable; 1/2 are similar to unsupervised topics

Page 154: ENAR short course

fLDA Summary •  fLDA is a useful model for cold-start item recommendation •  It also provides interpretable recommendations for users

–  User’s preference to interpretable LDA topics

•  Future directions: –  Investigate Gibbs sampling chains and the convergence properties of

the EM algorithm –  Apply fLDA to other multi-task prediction problems

•  fLDA can be used as a tool to generate supervised features (topics) from text data

Page 155: ENAR short course

Summary •  Regularizing factors through covariates effective •  Regression based factor model that regularizes better

and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive

•  Fitting method scalable; Gibbs sampling for users and

movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine

•  Distributed computing on Hadoop: Multiple models and average across partitions (more later)

Page 156: ENAR short course

Online  Components:  Online  Models,  Intelligent  

Ini$aliza$on,  Explore  /  Exploit  

Page 157: ENAR short course

Why Online Components? •  Cold start

–  New items or new users come to the system –  How to obtain data for new items/users (explore/exploit) –  Once data becomes available, how to quickly update the model

•  Periodic rebuild (e.g., daily): Expensive •  Continuous online update (e.g., every minute): Cheap

•  Concept drift –  Item popularity, user interest, mood, and user-to-item affinity may

change over time –  How to track the most recent behavior

•  Down-weight old data –  How to model temporal patterns for better prediction

•  … may not need to be online if the patterns are stationary

Page 158: ENAR short course

Big Picture Most Popular Recommendation

Personalized Recommendation

Offline Models

Collaborative filtering (cold-start problem)

Online Models Real systems are dynamic

Time-series models Incremental CF, online regression

Intelligent Initialization Do not start cold

Prior estimation Prior estimation, dimension reduction

Explore/Exploit Actively acquire data

Multi-armed bandits Bandits with covariates

Segmented  Most    Popular  Recommenda$on  

Extension:  

Page 159: ENAR short course

Online  Components  for    Most  Popular  Recommenda$on  

Online  models,  intelligent  ini$aliza$on  &  explore/exploit  

Page 160: ENAR short course

Most popular recommendation: Outline

•  Most popular recommendation (no personalization, all users see the same thing) –  Time-series models (online models) –  Prior estimation (initialization) –  Multi-armed bandits (explore/exploit)

–  Sometimes hard to beat!!

•  Segmented most popular recommendation –  Create user segments/clusters based on user

features –  Do most popular recommendation for each segment

Page 161: ENAR short course

Most Popular Recommendation •  Problem definition: Pick k items (articles) from a

pool of N to maximize the total number of clicks on the picked items

•  Easy!? Pick the items having the highest click-through rates (CTRs)

•  But … –  The system is highly dynamic:

•  Items come and go with short lifetimes •  CTR of each item changes over time

–  How much traffic should be allocated to explore new items to achieve optimal performance

•  Too little → Unreliable CTR estimates •  Too much → Little traffic to exploit the high CTR items

Page 162: ENAR short course

CTR Curves for Two Days on Yahoo! Front Page

Traffic  obtained  from  a  controlled  randomized  experiment  (no  confounding)  Things  to  note:        (a)  Short  life$mes,  (b)  temporal  effects,  (c)  oden  breaking  news  stories  

Each  curve  is  the  CTR  of  an  item  in  the  Today  Module  on  www.yahoo.com  over  $me  

Page 163: ENAR short course

For Simplicity, Assume … •  Pick only one item for each user visit

– Multi-slot optimization later •  No user segmentation, no personalization

(discussion later) •  The pool of candidate items is predetermined

and is relatively small (≤ 1000) – E.g., selected by human editors or by a first-phase

filtering method –  Ideally, there should be a feedback loop –  Large item pool problem later

•  Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)

Page 164: ENAR short course

Online Models •  How to track the changing CTR of an item •  Data: for each item, at time t, we observe

–  Number of times the item nt was displayed (i.e., #views) –  Number of clicks ct on the item

•  Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1

•  Potential solutions: –  Observed CTR at t: ct / nt → highly unstable (nt is usually small)

–  Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very slowly

–  Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable •  But, no estimation of Var[pt+1] (useful for explore/exploit)

Page 165: ENAR short course

Online Models: Dynamic Gamma-Poisson

•  Model-based approach –  (ct | nt, pt) ~ Poisson(nt pt) –  pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η)

–  Model parameters: •  p1 ~ Gamma(mean=µ0, var=σ0

2) is the offline CTR estimate •  η specifies how dynamic/smooth the CTR is over time

–  Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)

•  Solve this recursively (online update rule)

   Show  the  item  nt  $mes      Receive  ct  clicks      pt  =  CTR  at  $me  t  

Nota$on:  

p1  µ0,  σ02  

p2   …

n1  c1  

n2  c2  

η  

Page 166: ENAR short course

Online Models: Derivation

size) sample (effective /Let

),(~),,...,,|(2

21111

ttt

ttttt varmeanGammancncp

σµγ

σµ

=

==−−

)(

),(~),,...,,|(

2|

2|

2|

21

|1

211111

ttttttt

ttt

ttttt varmeanGammancncp

σµησσ

µµ

σµ

++=

=

==

+

+

+++

tttttt

ttttttt

tttt

ttttttt

cn

varmeanGammancncp

||2|

||

|

2||11

/

/) (

size) sample (effective Let

),(~),,...,,|(

γµσ

γγµµ

γγ

σµ

=

+⋅=

+=

==

High  CTR  items  more  adap$ve  

Es$mated  CTR  distribu$on  at  $me  t  

Es$mated  CTR    distribu$on  at  $me  t+1  

Page 167: ENAR short course

Tracking behavior of Gamma-Poisson model

•  Low click rate articles – More temporal smoothing

Page 168: ENAR short course

Intelligent Initialization: Prior Estimation

•  Prior CTR distribution: Gamma(mean=µ0, var=σ02)

–  N historical items: •  ni = #views of item i in its first time interval •  ci = #clicks on item i in its first time interval

–  Model •  ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0

2) ⇒ ci ~ NegBinomial(µ0, σ0

2, ni) –  Maximum likelihood estimate (MLE) of (µ0, σ0

2)

•  Better prior: Cluster items and find MLE for each cluster –  Agarwal & Chen, 2011 (SIGMOD)

∑ ⎟⎠⎞⎜

⎝⎛ +⎟

⎠⎞⎜

⎝⎛ +−⎟

⎠⎞⎜

⎝⎛ +Γ+⎟

⎠⎞⎜

⎝⎛Γ−

i iii nccNN 20

020

20

20

20

20

20

20

020

20

200

loglog loglog maxarg , σ

µ

σ

µ

σ

µ

σ

µ

σ

µ

σ

µ

σµ

Page 169: ENAR short course

Explore/Exploit: Problem Definition

$me  

Item  1  Item  2  …  Item  K  

x1%  page  views  x2%  page  views  …  xK%  page  views  

Determine  (x1,  x2,  …,  xK)  based  on  clicks  and  views  observed  before  t  in  order  to  maximize  the  expected  total  number  of  clicks  in  the  future  

t  –1    t  –2     t  

now  clicks  in  the  future  

Page 170: ENAR short course

Modeling the Uncertainty, NOT just the Mean

Simplified  semng:  Two  items  

CTR

Prob

abili

ty d

ensit

y Item A

Item B

We  know  the  CTR  of  Item  A  (say,  shown  1  million  $mes)    We  are  uncertain  about  the  CTR  of  Item  B  (only  100  $mes)  

If  we  only  make  a  single  decision,  give  100%  page  views  to  Item  A  

 If  we  make  mul$ple  decisions  in  the  future  

explore  Item  B  since  its  CTR  can  poten$ally  be  higher  

∫ >⋅−=

qpdppfqp )()(Potential

CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)

Page 171: ENAR short course

Multi-Armed Bandits: Introduction (1)

Bandit “arms”

p1 p2 p3 (unknown payoff

probabilities)

“Pulling” arm i yields a reward:

reward = 1 with probability pi (success)

reward = 0 otherwise (failure)

For  now,  we  are  aaacking  the  problem  of  choosing  best  ar$cle/arm  for  all  users  

Page 172: ENAR short course

Multi-Armed Bandits: Introduction (2)

Bandit “arms”

p1 p2 p3 (unknown payoff

probabilities)

Goal:  Pull  arms  sequen$ally  to  maximize  the  total  reward  

Bandit  scheme/policy:  Sequen$al  algorithm  to  play  arms  (items)  

Regret  of  a  scheme  =  Expected  loss  rela$ve  to  the  “oracle”  op-mal  scheme  that  always  plays  the  best  arm  –  “best”  means  highest  success  probability  –  But,  the  best  arm  is  not  known  …  unless  you  have  an  oracle  –  Regret  is  the  price  of  explora$on  –  Low  regret  implies  quick  convergence  to  the  best  

Page 173: ENAR short course

Multi-Armed Bandits: Introduction (3)

•  Bayesian approach –  Seeks to find the Bayes optimal solution to a Markov

decision process (MDP) with assumptions about probability distributions

–  Representative work: Gittins’ index, Whittle’s index –  Very computationally intensive

•  Minimax approach –  Seeks to find a scheme that incurs bounded regret (with no

or mild assumptions about probability distributions) –  Representative work: UCB by Lai, Auer –  Usually, computationally easy –  But, they tend to explore too much in practice (probably

because the bounds are based on worse-case analysis)

Skip  details  

Page 174: ENAR short course

Multi-Armed Bandits: Markov Decision Process (1)

•  Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T

•  State at time t: Θt = (θ1t, …, θKt) –  θit = State of arm i at time t (that captures all we know about arm i at t)

•  Reward function Ri(Θt, Θt+1) –  Reward of pulling arm i that brings the state from Θt to Θt+1

•  Transition probability Pr[Θt+1 | Θt, pulling arm i ] •  Policy π: A function that maps a state to an arm (action)

–  π(Θt) returns an arm (to pull) •  Value of policy π starting from the current state Θ0 with horizon T

[ ]),(),(),( 1110)(0 0ΘΘΘΘ Θ ππ π −+= TT VREV

[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr0

ΘΘΘΘΘΘΘ Θ dVR T ππ π

Immediate  reward   Value  of  the  remaining  T-­‐1  $me  slots                                                  if  we  start  from  state  Θ1  

Page 175: ENAR short course

Multi-Armed Bandits: MDP (2)

•  Optimal policy:

•  Things to notice: –  Value is defined recursively (actually T high-dim integrals) –  Dynamic programming can be used to find the optimal policy –  But, just evaluating the value of a fixed policy can be very expensive

•  Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time

[ ]),(),(),( 1110)(0 0ΘΘΘΘ Θ ππ π −+= TT VREV

[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr0

ΘΘΘΘΘΘΘ Θ dVR T ππ π

Immediate  reward   Value  of  the  remaining  T-­‐1  $me  slots                                                  if  we  start  from  state  Θ1  

),( maxarg 0Θππ

TV

Page 176: ENAR short course

Multi-Armed Bandits: MDP (3) •  Which arm should be pulled next?

–  Not necessarily what looks best right now, since it might have had a few lucky successes

–  Looks like it will be a function of successes and failures of all arms •  Consider a slightly different problem setting

–  Infinite time horizon, but –  Future rewards are geometrically discounted

Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)

•  Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently

Policy  π(Θt)  is  a  func$on  of  (θ1t,  …,  θKt)  

Policy  π(Θt)  =  argmaxi  {  g(θit)  }  

One  K-­‐dimensional  problem  

K  one-­‐dimensional  problems  S$ll  computa$onally  expensive!!  

Gimns’  Index  

Page 177: ENAR short course

Multi-Armed Bandits: MDP (4)

Bandit Policy

1.  Compute the priority (Gittins’ index) of each arm based on its state

2.  Pull arm with max priority, and observe reward

3.  Update the state of the pulled arm

Priority  1  

Priority  2  

Priority  3  

Page 178: ENAR short course

Multi-Armed Bandits: MDP (5) •  Theorem [Gittins 1979]: The optimal policy decouples

and solves a bandit problem for each arm independently –  Many proofs and different interpretations of Gittins’ index

exist •  The index of an arm is the fixed charge per pull for a game with two options, whether

to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward

–  Significantly reduces the dimension of the problem space –  But, Gittins’ index g(θit) is still hard to compute

•  For the Gamma-Poisson or Beta-Binomial models θit = (#successes, #pulls) for arm i up to time t

•  g maps each possible (#successes, #pulls) pair to a number

–  Approximate methods are used in practice –  Lai et al. have derived these for exponential family

distributions

Page 179: ENAR short course

Multi-Armed Bandits: Minimax Approach (1)

•  Compute the priority of each arm i in a way that the regret is bounded –  Lowest regret in the worst case

•  One common policy is UCB1 [Auer 2002] Number of successes of

arm i

Number of pulls of arm i

Total number of pulls of all arms

Observed success rate

Factor representing uncertainty

ii

ii n

nnc log2Priority ⋅+=

Page 180: ENAR short course

Multi-Armed Bandits: Minimax Approach (2)

•  As total observations n becomes large: – Observed payoff tends asymptotically towards the

true payoff probability – The system never completely “converges” to one

best arm; only the rate of exploration tends to zero

Observed payoff

Factor representing uncertainty

ii

ii n

nnc log2Priority ⋅+=

Page 181: ENAR short course

Multi-Armed Bandits: Minimax Approach (3)

•  Sub-optimal arms are pulled O(log n) times •  Hence, UCB1 has O(log n) regret •  This is the lowest possible regret (but the constants matter J) •  E.g. Regret after n plays is bounded by

Observed payoff

Factor representing uncertainty

ii

ii n

nnc log2Priority ⋅+=

ibesti

K

jj

i ibesti

nµµ

π

µµ

−=Δ⎟⎟⎠

⎞⎜⎜⎝

⎛Δ⋅⎟⎟

⎞⎜⎜⎝

⎛++⎟

⎟⎠

⎞⎜⎜⎝

Δ ∑∑=<

where,3

1ln81

2

:

Page 182: ENAR short course

•  Classical multi-armed bandits –  A fixed set of arms with fixed rewards –  Observe the reward before the next pull

•  Bayesian approach (Markov decision process) –  Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits

•  Pull the arm currently having the highest index value –  Whittle’s index [Whittle 1988]: Extension to a changing reward function –  Computationally intensive

•  Minimax approach (providing guaranteed regret bounds) –  UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval

•  Index of arm i = •  Heuristics

–  ε-Greedy: Random exploration using fraction ε of traffic –  Softmax: Pick arm i with probability

–  Posterior draw: Index = drawing from posterior CTR distribution of an arm

∑ j j

i

}/ˆexp{}/ˆexp{τµ

τµ

Classical Multi-Armed Bandits: Summary

ii item of CTR predicted ˆ =µ

iii nnnc log2 ⋅+

re temperatu=τ

Page 183: ENAR short course

Do Classical Bandits Apply to Web Recommenders?

Traffic  obtained  from  a  controlled  randomized  experiment  (no  confounding)  Things  to  note:        (a)  Short  life$mes,  (b)  temporal  effects,  (c)  oden  breaking  news  stories  

Each  curve  is  the  CTR  of  an  item  in  the  Today  Module  on  www.yahoo.com  over  $me  

Page 184: ENAR short course

Characteristics of Real Recommender Systems

•  Dynamic set of items (arms) –  Items come and go with short lifetimes (e.g., a day) –  Asymptotically optimal policies may fail to achieve good performance

when item lifetimes are short •  Non-stationary CTR

–  CTR of an item can change dramatically over time •  Different user populations at different times •  Same user behaves differently at different times (e.g., morning, lunch

time, at work, in the evening, etc.) •  Attention to breaking news stories decays over time

•  Batch serving for scalability –  Making a decision and updating the model for each user visit in real time

is expensive –  Batch serving is more feasible: Create time slots (e.g., 5 min); for each

slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal  et  al.,  ICDM,  2009]  

Page 185: ENAR short course

Explore/Exploit in Recommender Systems

$me  

Item  1  Item  2  …  Item  K  

x1%  page  views  x2%  page  views  …  xK%  page  views  

Determine  (x1,  x2,  …,  xK)  based  on  clicks  and  views  observed  before  t  in  order  to  maximize  the  expected  total  number  of  clicks  in  the  future  

t  –1    t  –2     t  

now  clicks  in  the  future  

Let’s  solve  this  from  first  principle  

Page 186: ENAR short course

Bayesian Solution: Two Items, Two Time Slots (1)

•  Two time slots: t = 0 and t = 1 –  Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 –  Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1

•  To determine x, we need to estimate what would happen in the future

Question: What fraction x of N0 views to item P (1-x) to item Q

t=0 t=1

Now

time N0 views N1 views

End  

Obtain  c  clicks  ader  serving  x  (not  yet  observed;  random  variable)  

 Assume  we  observe  c;  we  can  update  p1  

CTR

dens

ity Item Q

Item P

q1  

p1(x,c) CTR

dens

ity Item Q

Item P

q0  p0

 If  x  and  c  are  given,  op$mal  solu$on:  Give  all  views  to  Item  P  iff    E[  p1  I  x,  c  ]  >  q1  

),(ˆ1 cxp

),(ˆ1 cxp

Page 187: ENAR short course

•  Expected total number of clicks in the two time slots

}] ),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+

Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots

Solution: argmaxx Gain(x, q0, q1)

Bayesian Solution: Two Items, Two Time Slots (2)

}]0 ,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++=

E[#clicks] at t = 0 E[#clicks] at t = 1

Item P Item Q Show  the  item  with  higher  E[CTR]:   } ),,(ˆmax{ 11 qcxp

E[#clicks] if we always show

item Q

Gain(x, q0, q1) Gain of exploring the uncertain item P using x

Page 188: ENAR short course

•  Approximate by the normal distribution –  Reasonable approximation because of the central limit theorem

•  Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0)

),(ˆ1 cxp

⎥⎥⎦

⎢⎢⎣

⎡−⎟

⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛ −Φ−+⎟⎟

⎞⎜⎜⎝

⎛ −⋅+−= )ˆ(

)(ˆ

1)(ˆ

)()ˆ(),,( 111

11

1

111100010 qp

xpq

xpqxNqpxNqqxGain

σσφσ

)1()()()],(ˆ[)( 2

0

01

21 baba

abxNba

xNcxpVarx+++++

==σ

)/()],(ˆ[ˆ 11 baacxpEp c +==

),(~ ofPrior 1 baBetap

Bayesian Solution: Two Items, Two Time Slots (3)

Page 189: ENAR short course

Bayesian Solution: Two Items, Two Time Slots (4)

•  Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item?

Uncertainty:  Low  Uncertainty:  High  

Different  curves  are  for  different  prior  mean  semngs  

(Frac$on

 of  views  to  give  to

 the  ite

m)  

Page 190: ENAR short course

–  Apply  Whiale’s  Lagrange  relaxa$on  (1988)  to  this  problem  semng  Relax  ∑i  zi(c)  =  1,  for  all  c,  to  Ec  [∑i  zi(c)]  =  1  Apply  Lagrange  mul$pliers  (q0  and  q1)  to  enforce  the  constraints  

–  We  essen$ally  reduce  the  K-­‐item  case  to  K  independent  two-­‐item  sub-­‐problems  (which  we  have  solved)  

Bayesian Solution: General Case (1) •  From two items to K items

–  Very difficult problem: ) )}],(ˆ{[maxˆ ( max 11000 iiiiiii cxpENpxN cx+∑

)],(ˆ)([max 10 iiiii cxpzE ccz∑

cc possible allfor ,1)( =∑ ii z

Note:  c  =  [c1,  …,  cK]      ci  is  a  random  variable  represen$ng      the  #  clicks  on  item  i  we  may  get  

1=∑ ii x

) ),,(max ( min 101100, 10

qqxGainqNqN ii xqq i∑++

Page 191: ENAR short course

Bayesian Solution: General Case (2)

•  From two intervals to multiple time slots – Approximate multiple time slots by two stages

•  Non-stationary CTR – Use the Dynamic Gamma-Poisson model to

estimate the CTR distribution for each item

Page 192: ENAR short course

Simulation Experiment: Different Traffic Volume  Simula$on  with  ground  truth  es$mated  based  on  Yahoo!  Front  Page  data    Semng:16  live  items  per  interval    Scenarios:  Web  sites  with  different  traffic  volume  (x-­‐axis)  

Page 193: ENAR short course

Simulation Experiment: Different Sizes of the Item Pool

 Simula$on  with  ground  truth  es$mated  based  on  Yahoo!  Front  Page  data    Semng:  1000  views  per  interval;  average  item  life$me  =  20  intervals    Scenarios:  Different  sizes  of  the  item  pool  (x-­‐axis)  

Page 194: ENAR short course

Characteristics of Different Explore/Exploit Schemes (1)

•  Why the Bayesian solution has better performance •  Characterize each scheme by three dimensions:

–  Exploitation regret: The regret of a scheme when it is showing the item which it thinks is the best (may not actually be the best)

•  0 means the scheme always picks the actual best •  It quantifies the scheme’s ability of finding good

items –  Exploration regret: The regret of a scheme when it is exploring the items

which it feels uncertain about

•  It quantifies the price of exploration (lower → better) –  Fraction of exploitation (higher → better)

•  Fraction of exploration = 1 – fraction of exploitation Exploita$on  traffic   Explora$on  

traffic  

All  traffic  to  a  web  site  

Page 195: ENAR short course

Characteristics of Different Explore/Exploit Schemes (2)

 Exploita$on  regret:  Ability  of  finding  good  items  (lower  →  beaer)    Explora$on  regret:  Price  of  explora$on  (lower  →  beaer)    Frac$on  of  Exploita$on  (higher  →  beaer)  

Explora$on  Regret   Exploita$on  frac$on  

Exploita$o

n  Re

gret  

Exploita$o

n  Re

gret  

Good   Good  

Page 196: ENAR short course

Discussion: Large Content Pool •  The Bayesian solution looks promising

–  ~10% from true optimal for a content pool of 1000 live items

•  1000 views per interval; item lifetime ~20 intervals •  Intelligent initialization (offline modeling)

–  Use item features to reduce the prior variance of an item •  E.g., Var[ item CTR | Sport ] < Var[ item CTR ]

–  Require a CTR model that outputs both mean and variance

•  Linear regression model •  Segmented model: Estimate the CTR distribution of a random

article in an item category –  Existing taxonomies, decision tree, LDA topics

•  Feature-based explore/exploit –  Estimate model parameters, instead of per-item CTR –  More later

Page 197: ENAR short course

Discussion: Multiple Positions, Ranking

•  Feature-based approach –  reward(page) = model(φ(item 1 at position 1, … item k at position k)) –  Apply feature-based explore/exploit

•  Online optimization for ranked list –  Ranked bandits [Radlinski et al., 2008]: Run an

independent bandit algorithm for each position –  Dueling bandit [Yue & Joachims, 2009]: Actions are

pairwise comparisons •  Online optimization of submodular functions

–  ∀ S1, S2 and a, fa(S1 ⊕ S2) ≤ fa(S1) •  where fa(S) = fa(S ⊕ 〈a〉) – fa(S)

–  Streeter & Golovin (2008)

Page 198: ENAR short course

Discussion: Segmented Most Popular

•  Partition users into segments, and then for each segment, provide most popular recommendation

•  How to segment users –  Hand-created segments: AgeGroup × Gender –  Clustering or decision tree based on user features

•  Users in the same cluster like similar items •  Segments can be organized by taxonomies/hierarchies

–  Better CTR models can be built by hierarchical smoothing •  Shrink the CTR of a segment toward its parent •  Introduce bias to reduce uncertainty/variance

–  Bandits for taxonomies (Pandey et al., 2008) •  First explore/exploit categories/segments •  Then, switch to individual items

Page 199: ENAR short course

Most Popular Recommendation: Summary

•  Online model: –  Estimate the mean and variance of the CTR of each item over

time –  Dynamic Gamma-Poisson model

•  Intelligent initialization: –  Estimate the prior mean and variance of the CTR of each item

cluster using historical data •  Cluster items → Maximum likelihood estimates of the priors

•  Explore/exploit: –  Bayesian: Solve a Markov decision process problem

•  Gittins’ index, Whittle’s index, approximations •  Better performance, computation intensive •  Thompson sampling: Sample from the posterior (simple)

–  Minimax: Bound the regret •  UCB1: Easy to compute •  Explore more than necessary in practice

–  ε-Greedy: Empirically competitive for tuned ε

Page 200: ENAR short course

Online  Components  for    Personalized  Recommenda$on  

Online  models,  intelligent  ini$aliza$on  &  explore/exploit  

 

Page 201: ENAR short course

Intelligent Initialization for Linear Model (1)

•  Linear/factorization model

–  How to estimate the prior parameters µj and Σ •  Important for cold start: Predictions are made using prior •  Leverage available features

–  How to learn the weights/factors quickly •  High dimensional βj → slow convergence •  Reduce the dimensionality

Subscript:      user  i,      item  j  

),(~

) ,(~ 2

Σ

ʹ′

jj

jiij

N

uNy

µβ

σβ

ra$ng  that  user  i  gives  item  j  

feature/factor  vector  of  user  i  

factor  vector  of  item  j  

Page 202: ENAR short course

Feature-based model initialization

•  Dimensionality reduction for fast model convergence

),(~ Σjj AxNβ

FOBFM: Fast Online Bilinear Factor Model ),(~ ,~ Σʹ′ jjjiij Nuy µββPer-­‐item    

online  model  

),0(~

~

Σ

ʹ′+ʹ′

NvvuAxuy

j

jijiij

predicted  by  features  

⇔  

) ,0(~ 2IN

Bv

j

jj

θσθ

θ=

Subscript:        user  i        item  j  Data:        yij  =  ra$ng  that                      user  i  gives  item    j        ui    =  offline  factor  vector                        of  user  i        xj    =  feature  vector                        of  item  j  

B  is  a  n×k  linear  projec$on  matrix  (k  <<  n)  project:  high  dim(vj)  →  low  dim(θj)  low-­‐rank  approx  of  Var[βj]:  

=  

vj   θj  B  

) ,(~ 2 BBAxN jj ʹ′θσβ

Offline  training:  Determine  A,  B,  σθ2  through  the  EM  algorithm  (once  per  day  or  hour)  

Page 203: ENAR short course

Feature-based model initialization

•  Dimensionality reduction for fast model convergence

•  Fast, parallel online learning

•  Online selection of dimensionality (k = dim(θj)) –  Maintain an ensemble of models, one for each candidate dimensionality

),(~ Σjj AxNβ

FOBFM: Fast Online Bilinear Factor Model ),(~ ,~ Σʹ′ jjjiij Nuy µββPer-­‐item    

online  model  

),0(~

~

Σ

ʹ′+ʹ′

NvvuAxuy

j

jijiij

predicted  by  features  

⇔  

) ,0(~ 2IN

Bv

j

jj

θσθ

θ= B  is  a  n×k  linear  projec$on  matrix  (k  <<  n)  project:  high  dim(vj)  →  low  dim(θj)  low-­‐rank  approx  of  Var[βj]:  

jijiij BuAxuy θ)(~ ʹ′+ʹ′

offset   new  feature  vector  (low  dimensional)  

,      where  θj  is  updated  in  an  online  manner  

) ,(~ 2 BBAxN jj ʹ′θσβ

Subscript:        user  i        item  j  Data:        yij  =  ra$ng  that                      user  i  gives  item    j        ui    =  offline  factor  vector                      of  user  i        xj    =  feature  vector                      of  item  j  

Page 204: ENAR short course

Experimental Results: My Yahoo! Dataset (1)

•  My Yahoo! is a personalized news reading site – Users manually select news/RSS feeds

•  ~12M “ratings” from ~3M users on ~13K articles – Click = positive – View without click = negative

Page 205: ENAR short course

Experimental Results: My Yahoo! Dataset (2)

Item-­‐based  data  split:  Every  item  is  new  in  the  test  data  –  First  8K  ar$cles  are  in  the  training  data  (offline  training)  –  Remaining  ar$cles  are  in  the  test  data  (online  predic$on  &  learning)  

Supervised  dimensionality  reduc$on  (reduced  rank  regression)  significantly  outperforms  other  methods  

Methods:  No-­‐init:  Standard  online  

regression  with  ~1000  parameters  for  each  item  

Offline:  Feature-­‐based  model  without  online  update  

PCR,  PCR+:  Two  principal  component  methods  to  es$mate  B  

FOBFM:  Our  fast  online  method  

Page 206: ENAR short course

Experimental Results: My Yahoo! Dataset (3)

•  Small number of factors (low dimensionality) is better when the amount of data for online leaning is small

•  Large number of factors is better when the data for learning becomes large •  The online selection method usually selects the best dimensionality

#  factors  =          Number  of        parameters  per        item  updated        online  

Page 207: ENAR short course

Intelligent Initialization: Summary

•  For online learning, whenever historical data is available, do not start cold

•  For linear/factorization models – Use available features to setup the starting point – Reduce dimensionality to facilitate fast learning

•  Next – Explore/exploit for personalization – Users are represented by covariates

•  Features, factors, clusters, etc – Covariate bandits

Page 208: ENAR short course

Explore/Exploit for Personalized Recommendation

•  One extreme problem formulation –  One bandit problem per user with one arm per item –  Bandit problems are correlated: “Similar” users like similar

items –  Arms are correlated: “Similar” items have similar CTRs

•  Model this correlation through covariates/features –  Input: User feature/factor vector, item feature/factor vector –  Output: Mean and variance of the CTR of this (user, item)

pair based on the data collected so far •  Covariate bandits

–  Also known as contextual bandits, bandits with side observations

–  Provide a solution to •  Large content pool (correlated arms) •  Personalized recommendation (hint before pulling an arm)

Page 209: ENAR short course

Methods for Covariate Bandits •  Priority-based methods

–  Rank items according to the user-specific “score” of each item; then, update the model based on the user’s response

–  UCB (upper confidence bound) •  Score of an item = E[posterior CTR] + k StDev[posterior CTR]

–  Posterior draw (Thompson sampling) •  Score of an item = a number drawn from the posterior CTR distribution

–  Softmax •  Score of an item = a number drawn according to

•  ε-Greedy –  Allocate ε fraction of traffic for random exploration (ε may be adaptive) –  Robust when the exploration pool is small

•  Bayesian scheme –  Close to optimal if can be solved efficiently

∑ j j

i

}/ˆexp{}/ˆexp{τµ

τµ

Page 210: ENAR short course

Covariate Bandits: Some References

•  Just a small sample of papers –  Hierarchical explore/exploit (Pandey et al., 2008)

•  Explore/exploit categories/segments first; then, switch to individuals –  Variants of ε-greedy

•  Epoch-greedy (Langford & Zhang, 2007): ε is determined based on the generalization bound of the current model

•  Banditron (Kakade et al., 2008): Linear model with binary response •  Non-parametric bandit (Yang & Zhu, 2002): ε decreases over time;

example model: histogram, nearest neighbor –  Variants of UCB methods

•  Linearly parameterized bandits (Rusmevichientong et al., 2008): minimax, based on uncertainty ellipsoid

•  LinUCB (Li et al., 2010): Gaussian linear regression model •  Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009):

–  Similar arms have similar rewards: | reward(i) – reward(j) | ≤ distance(i,j)

Page 211: ENAR short course

Online Components: Summary •  Real systems are dynamic •  Cold-start problem

–  Incremental online update (online linear regression) –  Intelligent initialization (use features to predict initial factor

values) –  Explore/exploit (UCB, posterior draw, softmax, ε-greedy)

•  Concept-drift problem –  Tracking the most recent behavior (state-space models,

Kalman filter) –  Modeling temporal patterns (tensor factorization, spline)

Page 212: ENAR short course

Evaluation Methods and Challenges

Page 213: ENAR short course

Evaluation Methods •  Ideal method

–  Experimental Design: Run side-by-side experiments on a small fraction of randomly selected traffic with new method (treatment) and status quo (control)

–  Limitation •  Often expensive and difficult to test large number of methods

•  Problem: How do we evaluate methods offline on logged data? –  Goal: To maximize clicks/revenue and not prediction

accuracy on the entire system. Cost of predictive inaccuracy for different instances vary.

•  E.g. 100% error on a low CTR article may not matter much because it always co-occurs with a high CTR article that is predicted accurately

Page 214: ENAR short course

Usual Metrics •  Predictive accuracy

–  Root Mean Squared Error (RMSE) –  Mean Absolute Error (MAE) –  Area under the Curve, ROC

•  Other rank based measures based on retrieval accuracy for top-k

–  Recall in test data •  What Fraction of items that user actually liked in the test data were

among the top-k recommended by the algorithm (fraction of hits, e.g. Karypsis, CIKM 2001)

•  One flaw in several papers –  Training and test split are not based on time.

•  Information leakage •  Even in Netflix, this is the case to some extent

–  Time split per user, not per event. For instance, information may leak if models are based on user-user similarity.

Page 215: ENAR short course

Metrics continued.. •  Recall per event based on Replay-Match

method – Fraction of clicked events where the top

recommended item matches the clicked one.

•  This is good if logged data collected from a randomized serving scheme, with biased data this could be a problem – We will be inventing algorithms that provide

recommendations that are similar to the current one

•  No reward for novel recommendations

Page 216: ENAR short course

Details on Replay-Match method (Li, Langford, et al)

•  x: feature vector for a visit •  r = [r1,r2,…,rK]: reward vector for the K items in inventory •  h(x): recommendation algorithm to be evaluated •  Goal: Estimate expected reward for h(x)

•  s(x): recommendation scheme that generated logged-data •  x1,..,xT: visits in the logged data •  rti: reward for visit t, where i = s(xt)

Page 217: ENAR short course

Replay-Match continued •  Estimator

•  If importance weights and

–  It can be shown estimator is unbiased

•  E.g. if s(x) is random serving scheme, importance weights are uniform over the item set

•  If s(x) is not random, importance weights have to be estimated through a model

Page 218: ENAR short course

Back to Multi-Objective Optimization

   Recommender        EDITORIAL              

content Clicks on FP links influence downstream supply distribution

AD  SERVER                PREMIUM  display                    (GUARANTEED)                Spot  Market  (Cheaper)  

 Downstream        engagement        (Time  spent)  

Page 219: ENAR short course

Serving Content on Front Page: Click Shaping

•  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following

–  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10

•  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but

obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,

etc)

Page 220: ENAR short course

Why call it Click Shaping? autos finance

health

hotjobs

movies

new.music

news

omgrealestate

rivals

shine

shopping

sports

tech

travel

tv

video

other

gmy.news

buzz

videogamesautos

finance

health

hotjobs

movies

new.music

news

omgrealestate

rivals

shine

shopping

sports

tech

travel

tv

video

other

videogames

buzz

gmy.news

-10.00%-8.00%-6.00%

-4.00%-2.00%0.00%2.00%4.00%

6.00%8.00%10.00%

autos

buzz

finance

gmy.news

health

hotjobs

movies

new.music

news omg

realestate

rivals

shine

shopping

sports

tech

travel tv

video

videogames

other

Supply distribution Changes

BEFORE AFTER

SHAPING can happen with respect to any downstream metrics (like engagement)

Page 221: ENAR short course

221

Multi-Objective Optimization

A1

A2

An

n articles K properties

news

finance

omg

… …

S1

S2

Sm

m user segments

 CTR  of  user  segment  i  on  ar$cle  j:  pij    Time  dura$on  of  i  on  j:  dij  

Page 222: ENAR short course

11

Multi-Objective Program §   Scalariza$on    

 Goal  Programming    

Simplex constraints on xiJ is always applied

Constraints are linear

Every 10 mins, solve x

Use this x as the serving scheme in the next 10 mins

Page 223: ENAR short course

Pareto-optimal solution (more in KDD 2011)

223

Page 224: ENAR short course

Summary •  Modern recommendation systems on the web crucially depend on

extracting intelligence from massive amounts of data collected on a routine basis

•  Lots of data and processing power not enough, the number of things we need to learn grows with data size

•  Extracting grouping structures at coarser resolutions based on similarity (correlations) is important – ML has a big role to play here

•  Continuous and adaptive experimentation in a judicious manner crucial to maximize performance – Again, ML has a big role to play

•  Multi-objective optimization is often required, the objectives are application dependent. – ML has to work in close collaboration with

engineering, product & business execs

Page 225: ENAR short course

Challenges

Page 226: ENAR short course

Recall: Some examples •  Simple version

–  I have an important module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module

•  More advanced –  I got X% lift in CTR. But I have additional information on

other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks?

•  Highly advanced –  There are multiple modules running on my website. How

do I take a holistic approach and perform a simultaneous optimization?

Page 227: ENAR short course

For the simple version •  Multi-position optimization

–  Explore/exploit, optimal subset selection

•  Explore/Exploit strategies for large content pool and high dimensional problems –  Some work on hierarchical bandits but more needs to be

done •  Constructing user profiles from multiple sources with

less than full coverage –  Couple of papers at KDD 2011

•  Content understanding •  Metrics to measure user engagement (other than

CTR)

Page 228: ENAR short course

Other problems •  Whole page optimization

–  Incorporating correlations

•  Incentivizing User generated content

•  Incorporating Social information for better recommendation (News Feed Recommendation)

•  Multi-context Learning

Page 229: ENAR short course

Case  Studies  

Page 230: ENAR short course

Recommenda$ons  and  Adver$sing  on  LinkedIn  HP  

Page 231: ENAR short course

EXAMPLE:  DISPLAY  AD  PLACEMENTS  ON  LINKEDIN    

©2013  LinkedIn  Corpora$on.  All  Rights  Reserved.  

Page 232: ENAR short course

Recommenda$ons  and  Adver$sing  on  LinkedIn  HP  

Page 233: ENAR short course

Click  Cost  =      

                     Bid3  x  CTR3/CTR2  

 Profile:  

region  =  US,  age  =  20    

Context  =  profile  page,  300  x  250  ad  slot  

 

Ad  request  

Sorted  by    Bid  *  CTR  

Response  Predic$on  Engine  

Campaigns  eligible  for  auc$on    

Automa$c  Format  Selec$on  

Filter  Campaigns  (Targe$ng  criteria,      Frequency  Cap,  Budget  Pacing)  

 

LinkedIn Advertising: Flow

Serving constraint < 100 millisec

Page 234: ENAR short course

CTR  Predic$on  Model  for  Ads  •  Feature  vectors  

–  Member  feature  vector:  xi  (iden$ty,  behavioral,  network)  –  Campaign  feature  vector:  cj  (text,  adv-­‐id,…)  –  Context  feature  vector:  zk  (page  type,  device,  …)  

•  Model:  

Page 235: ENAR short course

CTR  Predic$on  Model  for  Ads  •  Feature  vectors  

–  Member  feature  vector:  xi  –  Campaign  feature  vector:  cj  –  Context  feature  vector:  zk  

•  Model:  

Cold-start component

Warm-start per-campaign component

Page 236: ENAR short course

CTR  Predic$on  Model  for  Ads  •  Feature  vectors  

–  Member  feature  vector:  xi  –  Campaign  feature  vector:  cj  –  Context  feature  vector:  zk  

•  Model:  

Cold-start component

Warm-start per-campaign component

Cold-­‐start:        Warm-­‐start:      Both  can  have  L2  penal$es.    

Page 237: ENAR short course

Model  Fimng  •  Single  machine  (well  understood)  

–  conjugate  gradient  –  L-­‐BFGS  –  Trusted  region  –  …  

•  Model  Training  with  Large  scale  data  –  Cold-­‐start  component  Θw is  more  stable  

•  Weekly/bi-­‐weekly  training  good  enough  •  However:  difficulty  from  need  for  large-­‐scale  logis$c  regression  

–  Warm-­‐start  per-­‐campaign  model  Θc is  more  dynamic  •  New  items  can  get  generated  any  $me  •  Big  loss  if  opportuni$es  missed  •  Need  to  update  the  warm-­‐start  component  as  frequently  as  possible  

Page 238: ENAR short course

Model  Fimng  •  Single  machine  (well  understood)  

–  conjugate  gradient  –  L-­‐BFGS  –  Trusted  region  –  …  

•  Model  Training  with  Large  scale  data  –  Cold-­‐start  component  Θw is  more  stable  

•  Weekly/bi-­‐weekly  training  good  enough  •  However:  difficulty  from  need  for  large-­‐scale  logis$c  regression  

–  Warm-­‐start  per-­‐campaign  model  Θc is  more  dynamic  •  New  items  can  get  generated  any  $me  •  Big  loss  if  opportuni$es  missed  •  Need  to  update  the  warm-­‐start  component  as  frequently  as  possible  

Large Scale Logistic Regression

Per-item logistic regression given Θc

Page 239: ENAR short course

Explore/Exploit  with  Logis$c  Regression  

239  

+ +

+ +

+

+

+

_

_

_ _

_ _

_

_

_ _ _

_

_

COLD START

COLD + WARM START for an Ad-id

POSTERIOR of WARM-START COEFFICIENTS

E/E: Sample a line from the posterior (Thompson Sampling)

Page 240: ENAR short course

Models  Considered  

•  CONTROL:  per-­‐campaign  CTR  coun$ng  model  

•  COLD-­‐ONLY:  only  cold-­‐start  component  

•  LASER:  our  model  (cold-­‐start  +  warm-­‐start)  

•  LASER-­‐EE:  our  model  with  Explore-­‐Exploit  using  Thompson  sampling  

Page 241: ENAR short course

Metrics  

•  Model  metrics  (offline)  – Test  Log-­‐likelihood  – AUC/ROC  – Observed/Expected  ra$o  

•  Online  metrics  (Online  A/B  Test)  – CTR  – CPM  (Revenue  per  impression)  – Unique  ads  per  user  (diversity)  

Page 242: ENAR short course

Observed  /  Expected  Ra$o  •  Offline  replay  difficult  with  large  items  (randomiza$on  costly)  •  Observed:  #Clicks  in  the  data  ,  Expected:  Sum  of  predicted  CTR  for  

all  impressions  •  Not  a  “standard”  classifier  metric,  but  useful  for  this  applica$on  •  What  we  usually  see:  Observed  /  Expected  <  1  

–  Quan$fies  the  “winner’s  curse”  aka  selec$on  bias  in  auc$ons  •  When  choosing  from  among  thousands  of  candidates,  an  item  with  mistakenly  

over-­‐es$mated  CTR  may  end  up  winning  the  auc$on  •  Par$cularly  helpful  in  spomng  inefficiencies  by  segment  

–  E.g.  by  bid,  number  of  impressions  in  training  (warmness),  geo,  etc.  –  Allows  us  to  see  where  the  model  might  be  giving  too  much  weight  to  

the  wrong  campaigns  •  High  correla$on  between  O/E  ra$o  and  model  performance  online  

Page 243: ENAR short course

Offline:  ROC  Curves  

False Positive Rate

True

Pos

itive

Rat

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

●●●●●●●●●

●●

●●

●●●●●●●

●●

●●

CONTROL [ 0.672 ]COLD−ONLY [ 0.757 ]LASER [ 0.778 ]

Page 244: ENAR short course

Online  A/B  Test  •  Three  models  

–  CONTROL  (10%)  –  LASER  (85%)  –  LASER-­‐EE  (5%)  

•  Segmented  Analysis    –  8  segments  by  campaign  warmness  

•  Degree  of  warmness:  the  number  of  training  samples  available  in  the  training  data  for  the  campaign  

•  Segment  #1:  Campaigns  with  almost  no  data  in  training  •  Segment  #8:  Campaigns  that  are  served  most  heavily  in  the  previous  batches  so  that  their  CTR  es$mate  can  be  quite  accurate    

Page 245: ENAR short course

Daily  CTR  Lid  Over  Control  Pe

rcen

tage

of C

TR L

ift

+%

+%

+%

+%

+%

Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

● ●

●●

LASERLASER−EE

Page 246: ENAR short course

Daily  CPM  Lid  Over  Control  Pe

rcen

tage

of e

CPM

Lift

+%

+%

+%

+%

+%

+%D

ay 1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

● ●

LASERLASER−EE

Page 247: ENAR short course

CPM  Lid  By  Campaign    Warmness  Segments  

Campaign Warmness Segment

Lift

Perc

enta

ge o

f CPM

−%

−%

−%

0%

+%

+%

1 2 3 4 5 6 7 8

LASERLASER−EE

Page 248: ENAR short course

O/E  Ra$o  By  Campaign    Warmness  Segments  

Campaign Warmness Segment

Obs

erve

d C

lick/

Expe

cted

Clic

ks

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

CONTROLLASERLASER−EE

Page 249: ENAR short course

Number  of  Campaigns  Served  Improvement  from  E/E  

Page 250: ENAR short course

Insights  •  Overall  performance:  

–  LASER  and  LASER-­‐EE  are  both  much  beaer  than  control  

–  LASER  and  LASER-­‐EE  performance  are  very  similar  •  Great  news!  We  get  explora$on  without  much  addi$onal  cost  

•  Explora$on  has  other  benefits    –  LASER-­‐EE  serves  significantly  more  campaigns  than  LASER  

–  Provides  healthier  market  place,  more  ad  diversity  per  user  (beaer  experience)  

Page 251: ENAR short course

Solu$ons  to  Prac$cal  Problems  •  Rapid  model  development  cycle  

–  Quick  reac$on  to  changes  in  data,  product  –  Write  once  for  training,  tes$ng,  inference  

•  Can  adapt  to  changing  data  –  Integrated  Thompson  sampling  explore/exploit  –  Automa$c  training  –  Mul$ple  training  frequencies  for  different  parts  of  model  

•  Good  tools  yield  good  models  –  Reusable  components  for  feature  extrac$on  and  transforma$on  –  Very  high-­‐performance  inference  engine  for  deployment  –  Modelers  can  concentrate  on  building  models,  not  re-­‐wri$ng  common  func$ons  or  worrying  about  produc$on  issues  

Page 252: ENAR short course

Summary  •  Reducing  dimension  through  logis$c  regression  coupled  

with  explore/exploit  schemes  like  Thompson  sampling  effec$ve  mechanism  to  solve  response  predic$on  problems  in  adver$sing  

•  Par$$oning  model  components  by  cold-­‐start  (stable)  and  warm-­‐start  (non-­‐sta$onary)  with  different  training  frequencies  effec$ve  mechanism  to  scale  computa$ons  

•  ADMM  with  few  modifica$ons  effec$ve  model  training  strategy  for  large  data  with  high  dimensionality  

•  Methods  work  well  for  LinkedIn  adver$sing,  significant  improvements  

©2013  LinkedIn  Corpora$on.  All  Rights  Reserved.  

Page 253: ENAR short course

Theory  vs.  Prac$ce  

Textbook  •  Data  is  sta$onary  •  Training  data  is  clean  •  Training  is  hard,  tes$ng  and  

inference  are  easy  •  Models  don’t  change  •  Complex  algorithms  work  best  

Reality  •  Features,  items  changing  

constantly  •  Fraud,  bugs,tracking  delays,  

online/offline  inconsistencies,  etc.  •  All  aspects  have  challenges  at  

web  scale  •  Never-­‐ending  processes  of  

improvement  •  Simple  models  with  good  features  

and  lots  of  data  win  

Page 254: ENAR short course

Current  Work:  Feed  Recommenda$on  Network Updates Job change Job anniversaries Connections Endorsements Upload Photos …………..

Content Articles by Influencer Shares by friends Content in Channels followed Content by companies followed Jobs recommendation ………… Sponsored Updates

Company updates Jobs

Page 255: ENAR short course

Tiered  Approach  to  Ranking  

§  A  second  pass  ranker  that  blends  disparate  results  returned  by  first-­‐pass  rankers    

Jobs  Ads  Network  Updates  

Content   ------------

BLENDER  

Top k

Page 256: ENAR short course

Challenges  •  Personaliza$on  

–  Viewer-­‐actor  affinity  by  type  (depends  on  strength  of  connec$ons  in  mul$ple  contexts)  

–  Blending  iden$ty  and  behavioral  data  •  Frequency  discoun$ng,  freshness,  diversifica$on  •  Mul$-­‐objec$ves  (Revenue,  Engagement)  •  A/B  tests  with  interference    •  Engagement  metrics  

–  Func$on  of  various  ac$ons  that  op$mize  long-­‐term  engagement  metric  like  return  visits  

•  Summariza$on  and  adding  new  content  types  

Page 257: ENAR short course

Impression  Discoun$ng  •  How  does  the  response  rate  vary  with  past  impressions  for  the  same  item?  

Slide courtsey Pannaga Shivaswamy

Page 258: ENAR short course

Diversity  •  How  does  the  response  rate  change  when  an  actorId/objectType  at  a  posi$on  matches  previous  items?  

Slide courtsey Pannaga Shivaswamy

Page 259: ENAR short course

Age  of  an  item  •  How  does  the  response  rate  change  for  different  types  with  age?  

Slide courtsey Pannaga Shivaswamy

Page 260: ENAR short course

Parallel  Matrix  Factoriza$on  

Page 261: ENAR short course

Problem  Setup  •  CTR  predic$on  for  a  user  on  an  item  

•  Assump$ons:    –  There  are  sufficient  data  per  item  to  es$mate  per-­‐item  model  –  Serving  bias  and  posi$onal  bias  are  removed  by  randomly  serving  scheme  

–  Item  populari$es  are  quite  dynamic  and  have  to  be  es$mated  in  real-­‐$me  fashion  

•  Examples:  –  Yahoo!  Front  page  Today  module    –  Linkedin  Today  module    

Page 262: ENAR short course

Online  Logis$c  Regression  (OLR)  §  User i with feature xi, article j §  Binary response y (click/non-click) §  § 

§  Prior §  Using Laplace approximation or variational Bayesian

methods to obtain posterior

§  New prior §  Can approximate and as diagonal for high dim xi

Page 263: ENAR short course

User  Features  for  OLR  •  Age,  gender,  industry,  job  posi$on  for  login  users  

•  General  behavior  targe$ng  (BT)  features    –  Music?  Finance?  Poli$cs?  

•  User  profiles  from  historical  view/click  behavior  on  previous  items  in  the  data,  e.g.  –  Item-­‐profile:  use  previously  clicked  item  ids  as  the  user  profile  –  Category-­‐profile:  use  item  category  affinity  score  as  profile.  The  score  can  be  simply  user’s  historical  CTR  on  each  category.  

–  Are  there  beaer  ways  to  generate  user  profiles?  –  Yes!  By  matrix  factoriza$on!  

Page 264: ENAR short course

Generalized  Matrix  Factoriza$on  (GMF)  Framework  

•     

Global Features

Item effect

User factors

Item factors

User effect

Bell  et  al.  (2007)    

Page 265: ENAR short course

Regression  Priors  •     

•  g(·∙),  h(·∙),  G(·∙),  H(·∙)  can  be  any  regression  func$ons  

•  Agarwal  and  Chen  (KDD  2009);  Zhang  et  al.  (RecSys  2011)  

User covariates

Item covariates

Page 266: ENAR short course

Different  Types  of  Prior  Regression  Models  

•  Zero  prior  mean  –  Bilinear  random  effects  (BIRE)  

•  Linear  regression  –  Simple  regression  (RLFM)  –  Lasso  penalty  (LASSO)  

•  Tree  Models  –  Recursive  par$$oning  (RP)  –  Random  forests  (RF)  – Gradient  boos$ng  machines  (GB)  –  Bayesian  addi$ve  regression  trees  (BART)  

Page 267: ENAR short course

•  Monte  Carlo  EM  (Booth  and  Hobert  1999)  •  Let      •  Let    •  E  Step:  

– Obtain  N  samples  of  condi$onal  posterior  

•  M  Step:  

Model  Fimng  Using  MCEM  

Page 268: ENAR short course

Handling  Binary  Responses  

•  Gaussian  responses:                                                                                                                              have  closed  form    •  Binary  responses  +  Logis$c:  no  longer  closed  form  

•  Varia$onal  approxima$on  (VAR)  

•  Adap$ve  rejec$on  sampling  (ARS)  

Page 269: ENAR short course

Simula$on  Study  •  10  simulated  data  sets,  100K  samples  for  both  training  and  test  

•  1000  users  and  1000  items  in  training  

•  Extra  500  new  users  and  500  new  items  in  test  +  old  users/items  

•  For  each  user/item,  200  covariates,  only  10  useful  

•  Construct  non-­‐linear  regression  model  from  20  Gaussian  func$ons  for  simula$ng  α,  β,  u  and  v  following  Friedman  (2001)  

Page 270: ENAR short course
Page 271: ENAR short course

MovieLens  1M  Data  Set  •  1M  ra$ngs  

•  6040  users  

•  3706  movies  

•  Sort  by  $me,  first  75%  training,  last  25%  test  

•  A  lot  of  new  users  in  the  test  data  set  

•  User  features:  Age,  gender,  occupa$on,  zip  code  

•  Item  features:  Movie  genre  

Page 272: ENAR short course

Performance  Comparison  

Page 273: ENAR short course
Page 274: ENAR short course

However…  

•  We  are  working  with  very  large  scale  data  sets!  

•  Parallel  matrix  factoriza$on  methods  using  Map-­‐Reduce  has  to  be  developed!  

•  Khanna  et  al.  2012  Technical  report  

Page 275: ENAR short course

•  Monte  Carlo  EM  (Booth  and  Hobert  1999)  •  Let      •  Let    •  E  Step:  

– Obtain  N  samples  of  condi$onal  posterior  

•  M  Step:  

Model  Fimng  Using  MCEM  

Page 276: ENAR short course

Parallel  Matrix  Factoriza$on  •  Par$$on  data  into  m  par$$ons  •  For  each  par$$on                                            run  MCEM  algorithm  and  get            .  

•     

•  Ensemble  runs:  for  k  =  1,  …  ,  n  –  Repar$$on  data  into  m  par$$ons  with  a  new  seed  –  Run  E-­‐step  only  job  for  each  par$$on  given    

•  Average  over  user/item  factors  for  all  par$$ons  and  k’s  to  obtain  the  final  es$mate  

Page 277: ENAR short course

One  MapReduce    job  

Parallel  Matrix  Factoriza$on  •  Par$$on  data  into  m  par$$ons  •  For  each  par$$on                                            run  MCEM  algorithm  and  get            .  

•     

•  Ensemble  runs:  for  k  =  1,  …  ,  n  –  Repar$$on  data  into  m  par$$ons  with  a  new  seed  –  Run  E-­‐step  only  job  for  each  par$$on  given    

•  Average  over  user/item  factors  for  all  par$$ons  and  k’s  to  obtain  the  final  es$mate  

Each  ensemble    run  is  a  MapReduce    

job  

Page 278: ENAR short course

Key  Points  

•  Par$$oning  is  tricky!  – By  events?  By  items?  By  users?  

•  Empirically,  “divide  and  conquer”  +  average  over              to  obtain          work  well!  

•  Ensemble  runs:  Ader  obtained          ,  we  run  n  E-­‐step-­‐only  jobs  and  take  average,  for  each  job  using  a  different  user-­‐item  mix.    

Page 279: ENAR short course

Iden$fiability  Issues  

•  Same  log-­‐likelihood  can  be  achieved  by  – g  (  )  =  g  (  )  +  r,  h  (  )  =  h  (  )  –  r  

•  Center    α,  β,  u  to  zero-­‐mean  every  E-­‐step  

– u  =  -­‐u,  v  =  -­‐v  •  Constrain  v  to  be  posi$ve  

–   Switching  u.1,  v.1  with  u.2,  v.2  •  ui  ~  N(G(xi)  ,  I),  vj  ~  N(H(xj),  λI)  •  Constraint:  Diagonal  entries  λ1  >=  λ2  >=  …  

Page 280: ENAR short course

MovieLens  1M  Data  

•  75%  training  and  25%  test  split  by  $me  •  Imbalanced  data  

– User  ra$ng  =  1:  Posi$ve  – User  ra$ng  =  2,  3,  4,  5:  Nega$ve  – 5%  posi$ve  rate  

•  Balanced  data  – User  ra$ng  =  1,  2,  3:  Posi$ve  – User  ra$ng  =  4,  5:  Nega$ve  – 44%  posi$ve  rate  

Page 281: ENAR short course
Page 282: ENAR short course

Matrix  Factoriza$on  For  User  Profile  

•  Offline  user  profile  building  period,  obtain  the  user  factor            for  user  i  

•  Online  modeling  using  OLR  –  If  a  user  has  a  profile  (warm-­‐start),  use                as  the  user  feature  

–  If  not  (cold-­‐start),  use                          as  the  user  feature  

Page 283: ENAR short course

Offline  Evalua$on  Metric  Related  to  Clicks  

•  For  model  M  and  J  live  items  (ar$cles)  at  any  $me  

•  If  M  =  random  (constant)  model  E[S(M)]  =  #clicks  

•  Unbiased  es$mate  of  expected  total  clicks  (Langford  et  al.  2008)  

Page 284: ENAR short course

Experiments  on  Big  Data  •  Yahoo!  Front  Page  Today  Module  data  •  Data  for  building  user  profile:  8M  users  with  at  least  10  

clicks  (heavy  users)  in  June  2011,  1B  events  •  Data  for  training  and  tes$ng  OLR  model:  Random  served  

data  with  2.4M  clicks  in  July  2011  •  Heavy  users  contributed  around  30%  of  clicks  •  User  feature  for  OLR:  

–  Intercept-­‐only  (MOST  POPULAR)  –  124  Behavior  targe$ng  features  (BT-­‐ONLY)  –  BT  +  top  1000  clicked  ar$cle  ids  (ITEM-­‐PROFILE)  –  BT  +  user  profile  with  CTR  on  43  binary  content  categories  (CATEGORY-­‐PROFILE)  

–  BT  +  profiles  from  matrix  factoriza$on  models  

Page 285: ENAR short course

Click  Lid  Performance  For  Different  User  Profiles  

Page 286: ENAR short course

Web  Adver$sing  

286

There  are  lots  of  ads  on  the  web  …  100s  of  billions  of  adver$sing  dollars  spent  online  per  year  (e-­‐marketer)    

Page 287: ENAR short course

Online  adver$sing:  6000  d.  Overview  

Adv

ertis

ers

Ad  Network  

Ads  

Content  

Pick  ads  

User  

Content  Provider  

Examples:  Yahoo,  Google,  MSN,  RightMedia,  …  

Page 288: ENAR short course

Web  Adver$sing:  Comes  in  different  flavors  

 •  Sponsored  (“Paid”  )  Search  

–  Small  text  links  in  response  to  query  to  a  search  engine    

•  Display  Adver$sing    –  Graphical,  banner,  rich  media;  appears  in  several  contexts  like  visi$ng  

a  webpage,  checking  e-­‐mails,  on  a  social  network,….  

–  Goals  of  such  adver$sing  campaigns  differ  •  Brand  Awareness    •  Performance  (users  are  targeted  to  take  some  ac$on,  soon)  

–  More  akin  to  direct  marke$ng  in  offline  world  

Page 289: ENAR short course

Paid  Search:  Adver$se  Text  Links  

Page 290: ENAR short course

Display  Adver$sing:  Examples  

Page 291: ENAR short course

Display  Adver$sing:  Examples  

Page 292: ENAR short course

LinkedIn  company  follow  ad    

Page 293: ENAR short course

           Brand  Ad  on  Facebook  

Page 294: ENAR short course

Paid  Search  Ads  versus  Display  Ads  

Paid  Search  •  Context  (Query)  important    

•  Small  text  links    •  Performance  based  

–  Clicks,  conversions  

•  Adver$sers  can  cherry-­‐pick  instances  

Display  •  Reaching  desired  audience    

•  Graphical,  banner,  Rich  media  –  Text,  logos,  videos,..  

•  Hybrid  –  Brand,  performance  

•  Bulk  buy  by  marketers  –  But  things  evolving  

•  Ad  exchanges,  Real-­‐$me  bidder  (RTB)  

Page 295: ENAR short course

Display  Adver$sing  Models  

•  Futures  Market  (Guaranteed  Delivery)  –  Brand  Awareness  (e.g.  Gilleae,  Coke,  McDonalds,  GM,..)  

   

•  Spot  Market  (Non-­‐guaranteed)  – Marketers  create  targeted  campaigns    

•  Ad-­‐exchanges  have  made  this  process  efficient  –  Connects  buyers  and  sellers  in  a  stock-­‐market  style  market  

•  Several  portals  like  LinkedIn  and  Facebook  have  self-­‐serve  systems  to  book  such  campaigns  

 

Page 296: ENAR short course

Guaranteed  Delivery  (Futures  Market)  

•  Revenue  Model:  Cost  per  ad  impression(CPM)      Ads  are  bought  in  bulk  targeted  to  users  based  on  

demographics  and  other  behavioral  features            GM  ads  on  LinkedIn  shown  to  “males  above  55”            Mortgage  ad  shown  to  “everybody  on  Y!  ”  

 

 Slots  booked  in  advance  and  guaranteed    –  “e.g.  2M    targeted  ad  impressions  Jan  next  year”  –  Prices  significantly  higher  than  spot  market    

–  Higher  quality  inventory  delivered  to  maintain  mark-­‐up  

Page 297: ENAR short course

Measuring  effec$veness  of  brand  adver$sing  

§  "Half  the  money  I  spend  on  adver:sing  is  wasted;  the  trouble  is,  I  don't  know  which  half."  -­‐  John  Wanamaker    

•  Typically  –  Number  of  visits  and  engagement  on  adver$ser  website  –  Increase  in  number  of  searches  for  specific  keywords  –  Increase  in  offline  sales  in  the  long-­‐run  

•  How?  –  Randomized  design  (treatment  =  ad  exposure,  control  =  no  exposure)  –  Sample  surveys  –  Covariate  shid    (Propensity  score  matching)  

•  Several  sta$s$cal  challenges  (experimental  design,  causal  inference  from  observa$onal  data,  survey  methodology)  

Page 298: ENAR short course

Guaranteed  delivery  •  Fundamental  Problem:  Guarantee  impressions  (with  overlapping  

inventory)  

3

2 4

2 2

1

1

Young US

Female LI

Homepage

1.  Predict Supply

2.  Incorporate/Predict Demand

3.  Find the optimal allocation

•  subject to supply and demand constraints

si  

dj xij

Page 299: ENAR short course

Example  

3  2  4  

2   2  

1  

1  

Young   US  

Female  LI  Homepage  

US  &  Y  (2)  

Supply  Pools  

Demand  US,  Y,  nF  Supply  =  2  Price  =  1  

US,  Y,  F  Supply  =  3  Price  =  5  

Supply  Pools  

How  should  we  distribute  impressions  from  the  supply  pools  to  sa$sfy  this  demand?  

Page 300: ENAR short course

Example  (Cherry-­‐picking)  •  Cherry-­‐picking:    

Fulfill  demands  at  least  cost  

US  &  Y  (2)  

Supply  Pools  

Demand  US,  Y,  nF  Supply  =  2  Price  =  1  

US,  Y,  F  Supply  =  3  Price  =  5  

How  should  we  distribute  impressions  from  the  supply  pools  to  sa$sfy  this  demand?  

(2)  

Page 301: ENAR short course

Example  (Fairness)  •  Cherry-­‐picking:    

Fulfill  demands  at  least  cost  

•  Fairness:  Equitable  distribu$on  of  available  supply  pools  

•  Agarwal  and  Tomlin,  INFORMS,  2010  

•  Ghosh  et  al,  EC,  2011  

US  &  Y  (2)  

Supply  Pools  

Demand  US,  Y,  nF  Supply  =  2  Cost  =  1  

US,  Y,  F  Supply  =  3  Cost  =  5  

How  should  we  distribute  impressions  from  the  supply  pools  to  sa$sfy  this  demand?  

(1)  

(1)  

Page 302: ENAR short course

The  op$miza$on  problem    

•  Maximize  Value  of  remnant  inventory  (to  be  sold  in  spot  market)  –  Subject  to  “fairness”  constraints  (to  maintain  high  quality  of  inventory  

in  the  guaranteed  market)  –  Subject  to  supply  and  demand  constraints  

•  Can  be  solved  efficiently  through  a  flow  program  

•  Key  sta$s$cal  input:  Supply  forecasts  

302  

Page 303: ENAR short course

Various  component  of  a  Guaranteed  Delivery  system  

Page 304: ENAR short course

Field  Sales  Team,  sells  Products    (segments)  

 

Pricing  Engine  

Admission  Control  

should  the  new  contract  request  be  

admiled?  (solve  VIA  LP)  

Supply  forecasts  

Demand  forecasts  &  booked  inventory  

Adver$sers  

Contracts  signed,  Nego$a$ons  involved  

OFFLINE  COMPONENTS  

Page 305: ENAR short course

ONLINE  SERVING  

On Line Ad Serving

Ads

Opportunity Near Real Time

Optimization

Stochastic Supply

Stochastic Demand

Contract Statistics Allocation

Plan (from LP)

Page 306: ENAR short course

     High  dimensional  Forecas$ng  •  Supply  forecasts  important  input  required  both  at  booking  

$me  (admission  control)  and  serving  $me  •  Problem:  Given  historical  $me  series  data  in  a  high  

dimensional  space  (trillions  of  combina$ons),  forecast  number  of  visits  for  an  arbitrary  query  for  a  future  $me  horizon    –  E.g.:  Male  visits  from  Hawaii  on  LinkedIn  next  year  in  January  

•  Challenging  sta$s$cal  problem  –  Curse  of  dimensionality  &  massive  data  –  arbitrary  query  subset  –  latency  constraints    

•  Forecas:ng  High-­‐dimensional  data,  Agarwal  et  al,  SIGMOD,  2011  

Page 307: ENAR short course

Other  challenges  •  3Ms:  Mul$-­‐response,  Mul$-­‐context  modeling  to  op$mize  Mul$ple  

Objec$ves  –  Mul$-­‐response:  Clicks,  share,  comments,  likes,..  (preliminary  work  at  

CIKM  2012)  

–  Mul$-­‐context:  Mobile,  Desktop,  Email,..(preliminary  work  at  SIGKDD  2011)  

–  Mul$-­‐objec$ve:  Tradeoff  in  engagement,  revenue,  viral  ac$vi$es  •  Preliminary  work  at  SIGIR  2012,  SIGKDD  2011    

•  Scaling  model  computa$ons  at  run-­‐$me  to  avoid  latency  issues  –  Predic$ve  Indexing  (preliminary  work  at  WSDM  2012)  

Page 308: ENAR short course

Bibliography  Agarwal,  D.  and  Chen,  B.  (2009).  Regression-­‐based  latent  factor  models.  In  Proceedings  of  the  15th  ACM  SIGKDD  interna$onal  conference  on  Knowledge  discovery  and  data  mining,  19–28.  ACM.    Agarwal,  D.,  Chen,  B.,  and  Elango,  P.  (2010).  Fast  online  learning  through  offline  ini$aliza$on  for  $me-­‐sensi$ve  recommenda$on.  In  Proceedings  of  the  16th  ACM  SIGKDD  interna$onal  conference  on  Knowledge  discovery  and  data  mining,  703–712.  ACM.    Bell,  R.,  Koren,  Y.,  and  Volinsky,  C.  (2007).  Modeling  rela$onships  at  mul$ple  scales  to  improve  accuracy  of  large  recommender  systems.  In  Proceedings  of  the  13th  ACM  SIGKDD  interna$onal  conference  on  Knowledge  discovery  and  data  mining,  95–104.  ACM.    Booth,  J.  G.,  &  Hobert,  J.  P.  (1999).  Maximizing  generalized  linear  mixed  model  likelihoods  with  an  automated  Monte  Carlo  EM  algorithm.  Journal  of  the  Royal  Sta$s$cal  Society:  Series  B  (Sta$s$cal  Methodology),  61(1),  265-­‐285.  Boyd,  S.,  Parikh,  N.,  Chu,  E.,  Peleato,  B.,  &  Eckstein,  J.  (2011).  Distributed  op$miza$on  and  sta$s$cal  learning  via  the  alterna$ng  direc$on  method  of  mul$pliers.  Founda$ons  and  Trends®  in  Machine  Learning,  3(1),  1-­‐122.  Bickel,  P.  J.,  Götze,  F.,  &  van  Zwet,  W.  R.  (2012).  Resampling  fewer  than  n  observa$ons:  gains,  losses,  and  remedies  for  losses  (pp.  267-­‐297).  Springer  New  York.  Dean,  J.,  &  Ghemawat,  S.  (2008).  MapReduce:  simplified  data  processing  on  large  clusters.  Communica$ons  of  the  ACM,  51(1),  107-­‐113.  

Page 309: ENAR short course

Bibliography  Efron,  B.  (1979).  Bootstrap  methods:  another  look  at  the  jackknife.  The  annals  of  Sta$s$cs,  1-­‐26.  Kleiner,  A.,  Talwalkar,  A.,  Sarkar,  P.,  &  Jordan,  M.  (2012).  The  big  data  bootstrap.  arXiv  preprint  arXiv:1206.6415.  Khanna,  R.,  Zhang,  L.,  Agarwal,  D.  and  Chen,  B.  (2012).  Parallel  Matrix  Factoriza$on  for  Binary  Response.  In  Arxiv.org.  Zhang,  L.,  Agarwal,  D.,  and  Chen,  B.  (2011).  Generalizing  matrix  factoriza$on  through  flexible  regression  priors.  In  Proceedings  of  the  fidh  ACM  conference  on  Recommender  systems,  13–20.  ACM.      


Recommended