Microsoft Research Online Search and Advertising, Future and Present Chris Burges Microsoft Research...

Microsoft Research

Online Search and Advertising, Future and Present

Chris BurgesMicrosoft Research

Saturday, Dec 13, 2008

Text Mining, Search and Navigation 1

Microsoft Research

Contents

• Search and Advertising – some ideas• Where are we headed?• How to begin?

• Some new results on ranking: we can directly learn Information Retrieval measures

• Internet security and RSA: why worry?


Microsoft Research

~ Search and Advertising ~


Microsoft Research

Why Search Works…

• Traditional: print, TV, radio, billboards,…• Only very broadly targeted to

demographics (some exceptions)• Search is monetarily successful because

advertising is more precisely targeted• The Google model is giving and will

continue to give traditional channels a run for their money


Microsoft Research

Key Points

• The online experience will be more deeply engaging.


Microsoft Research

What’s wrong with what we do now?

• Nothing, but… ten blue links + ads, ten years from now?

• Ads are ‘tacked on’ to the user experience.• Paid Search / Contextual / Banner – all are

still largely impersonal.• But, Behavioral Targeting…


Microsoft Research

How might ads be targeted better?• I just bought a car – don’t show me more

ads for cars• I just bought a house – show me ads for

furniture• I like band X, but not Y• In general, build a model of what I’m in the

market for• Per-user pricing, availability• User-driven asks (show me all ads for Z)


Microsoft Research

User Models

• User models can be used to enrich the

online experience, not just advertising.- Automated teaching– Need a model of the user’s understanding.

• Find other users with similar interests• Tailor news presentation to user’s interests


Microsoft Research

Key Points• The online experience will be more deeply

engaging.• We will need rich state models of users: likes,

dislikes, ± interests, knowledge


Microsoft Research

What About Search?


Microsoft Research

Search: Somewhere in the Near Future


Query

Structured Data:

Distribution over Intents

Indexed Web Data

Structured Data: Diversity;

Popular Pages; Aid Transaction

Human Computer

Dialog

84% Info.12% Nav.4% Trans.

78% Comm.…

Display

Microsoft ResearchText Mining, Search and Navigation 12

Microsoft Research

How to get the information we need, to build good models for users?

Ask them!


Microsoft Research



dislikes, ± interests, knowledge, and more.• Natural Language Processing will be key.


Microsoft Research

Search Applications: And,Data Changes Everything


• Example: AskMSR (Brill, Dumais, Banko, ACL 2002)• Commonly used resources for QA:

• Part-of-speech tagger, parser, named-entity extractor, WordNet or other knowledge bases, passage or sentence retrieval, abduction, etc.

• AskMSR doesn’t use any of them • Instead, AskMSR focuses on data:

• There is a lot of data on the web – use it• Redundancy is a resource to be exploited

• Data-driven QA: simple techniques, lots of data



Data Changes Everything

Banko and Brill, Mitigating…, ICHLTR 2001


Data Changes Everything

Banko and Brill, Scaling…, 2001

Microsoft Research



dislikes, ± interests, knowledge, and more.• Natural Language Processing will be key.• “Search” can be the engine under the hood for

many different applications.• It’s better to use tons of data and simple

models, versus smaller datasets and complex models.


Microsoft Research



dislikes, ± interests, knowledge, and more.• Natural Language Processing will be key.• “Search” can be the engine under the hood for

many different applications.• It’s better to use tons of data and simple

models, versus smaller datasets and complex models.


Microsoft Research

How to proceed?

• Don’t know. But: Sam, a Search Chatbot.– Provide an engaging chat experience– Use Search to show images, urls, videos,…– Will build persistent user world models– Will have its own world model– Can show precisely targeted ads– Will leverage social networks


Microsoft Research

The Eliza Effect

• Eliza: J. Weizenbaum,1966 (!)• Demonstrated that extremely simple

techniques can result in compelling dialog (sometimes, for some users)

• Users tend to anthropomorphize computer behavior

• This is gives us an advantage


Microsoft Research

Our Prime Directive in Building Sam:


Do as little supervision as possible.

Microsoft Research

Let the Data do the Work• anarchism category: anarchism• anarchism category: political ideologies• anarchism category: political philosophies• anarchism category: social philosophy• autism category: autism• autism category: pervasive developmental disorders• autism category: childhood psychiatric disorders• autism category: communication disorders• autism category: neurological disorders• albedo category: electromagnetic radiation• albedo category: climatology• albedo category: climate forcing• albedo category: scattering, absorption and radiative transfer (optics)• albedo category: radiometry• abu dhabi category: abu dhabi• abu dhabi category: capitals in asia• abu dhabi category: cities in the united arab emirates• abu dhabi category: coastal cities• a category: latin letters• a category: vowel letters


Robert Rounthwaite, TMSN

Microsoft Research

Using Category Graphs to Drive Dialog

– User: I like ferrets.– Ferret: category: animals people keep as pets– Animals people keep as pets: rabbits– Sam: Do you like rabbits, too?


Use ODP and Wikipedia hierarchies to construct graph

Microsoft Research

Use Category Graphs to Build Models

• Attach vector to each node, sparsely:– [like/dislike; interested/not; knows about; …]

• Each component has confidence level• Leverage graph structure to explore


“World model” for both user and for Sam

Microsoft Research

Other Useful Sources of Data


qi+2

qi+1

qi+3

qi

uj

uj+1

uj+2

qj qj+1 qj+2

Q. Mei, D. Zhou, K. Church, Query Suggestion Using Hitting Time, CIKM 2008

Microsoft Research

Temporal Querying Behavior

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

-300 -200 -100 0 100 200 300

Lif

t in

Pro

babi

lity

Days Since Query for "interview"

resume moving

M. Richardson, ACM Transactions on the Web 2008

28Text Mining, Search and Navigation

Microsoft Research

We Are Not Alone

• http://quizbot.trueknowledge.com/• http://www.jabberwacky.com/


Jab: Do you like computers? Me: Yes, but only when they are turned on. Jab: I like ferraris, do you like them also? Me: Not really. Why do you like them? Jab: Yes. Me: Why do you like them? Jab: No, I like going to mars.

http://quizbot.trueknowledge.com/

http://www.jabberwacky.com/

Microsoft Research

One Possible Sentence Generator

• Inputs:– Sentiment– Distribution over topics under discussion– Features from recent sentences– Sentence or phrase database (with statistics)– Distributions over users likes / interests, etc.– Close or popular nodes where bot lacks

knowledge of user– Topic priors

• Output: ranked sentences


Microsoft Research

New Challenges for Machine Learning• How can we teach a chatbot to talk?

– “Good / bad response” buttons: reinforcement learning?

– ESP-like games for labeling for learning to rank sentences?

– Build natural sentences from phrases?• How can we learn effective user models?

– Combine from multiple users to form good priors– Use active learning during chat to reduce

uncertainty in the user’s model


Microsoft Research

Demo


Joint work with Scott Imig, Silviu Cucerzan

S. Cucerzan, Large Scale Named Entity Disambiguation based on Wikipedia data, Proc. 2007 Joint Conference on EMNLP and CNLL

Microsoft Research

~ Some New Results on Ranking ~


Microsoft Research

Empirical Optimality of -rank

Joint work with:– Pinar Donmez (CMU)– Krysta Svore (MSR)– Yisong Yue (Cornell)


Microsoft Research

Some IR Measures

• Precision: Recall: • Average Precision: Compute precision for each

positive, average over all positions • Mean Average Precision: Average AP over queries

• Mean Reciprocal Rank (TREC QA)

• Mean NDCG: , averaged over queries


Microsoft Research

IR Measures, cont.


These measures:

• Depend only on the labels and the sorted order of the documents

• Viewed as a function of the scores output by some model, are everywhere either flat or discontinuous

- SVM MAP: Yue et. al, SIGIR ’07- Tao Qin, Tie-Yan Liu, Hang Li, MSR Tech Report 164 (2008)

Microsoft Research

LambdaRank: Background


Microsoft Research

The RankNet Cost


Modeled posteriors:

Target posteriors:

Define

Cross entropy cost:

Model output probabilities using logistic:


o1-o2

C(o

1-o

2)

-5 -4 -3 -2 -1 0 1 2 3 4 50

1

2

3

4

5

6

P=0.0

P=0.5

P=1.0

Microsoft Research

RankNet Cost ~ Pairwise Cost


o1-o2

C(o

1-o

2)

-5 -4 -3 -2 -1 0 1 2 3 4 50

1

2

3

4

5

6


Pairwise Cost Revisited

Pairwise cost fine if no errors, but:

13 errors 11 errors

Microsoft ResearchText Mining, Search and Navigation

LambdaRankInstead of using a smooth approximation to the cost, andtaking derivatives, write down the derivatives directly.

Then use these derivatives to train a model usinggradient descent, as usual.

1s

2s

43

Microsoft Research

The Lambda Function


NDCG gain in swapping members of a pair of docs,multiplied by RankNet cost gradient as a smoother:

Let be the set of documents labeled higher (lower) than document :

Microsoft Research

Lambda Functions for MAP, MRR


Microsoft Research

Local Optimality

• Check the gradient vanishes at solution.• Get bound on probability that we’re not at a

local max, using one-sided Monte Carlo


P (We miss ascent direction despite k trials)

How large must k be for if ?

Answer:

Microsoft Research

Data Sets

• Artificial: 300 features, 50 urls/query, 10k/5k/10k train/valid/test split

• Web 1: 420 features, 26 urls/query, 10k/5k/10k split

• Web 2: 30k/5k/10k split


Microsoft Research

Which function to choose?


Lambda Gradient MAP ± SE MRR ± SE

RankNetWeightPairs 0.462 ± 0.0048 0.524 ± 0.0059

LocalGradient 0.435 ± 0.0048 0.515 ± 0.0060

LocalCost 0.427 ± 0.0049 0.512 ± 0.0059

SpringSmooth 0.424 ± 0.0048 0.498 ±0.0058

DiscreteGradient 0.401 ± 0.0049 0.471 ± 0.0059

• LocalGradient: finite element estimate of gradient, with margin• LocalCost: estimate local gradient using neighbors + weighted RankNet cost• SpringSmooth: smoother version of RankNetWeightPairs• DiscreteGradient: finite element estimate using optimal position


49

10K Web 30K Web Artificial

Microsoft Research

Sample Size Matters


10K Train 30K Train

Test Train Test Score Train Test Score

NDCG 0.416 NDCG 0.428

NDCG MAP 0.412 MAP 0.422

MRR 0.396 MRR 0.406

NDCG 0.442 NDCG 0.453

MAP MAP 0.439 MAP 0.456

MRR 0.429 MRR 0.449

NDCG 0.519 NDCG 0.532

MRR MAP 0.516 MAP 0.533

MRR 0.508 MRR 0.537

• Number of pairs drops by >2 for MRR and MAP• For MRR, number of samples drops much further

Microsoft Research

IR Measure Optimality - Conclusions

• Typically, IR practitioners would train models with small numbers of ‘smart’ features (~ BM25), and perform grid search

• However, adding many weak features improves performance

• We have shown that the LambdaRank gradients optimize three IR measures directly


Microsoft Research

~ RSA, Factoring, and Optimization ~


Microsoft Research

Factoring biprimes as optimization


• The security of internet commerce (SSL, RSA) rests on a mathematical conjecture, namely, that factoring biprimes is combinatorically hard.

• Conjectures aren’t necessarily true. If this conjecture is false, there is no simple backup plan.

• The current fastest known factoring method is the general number field sieve. It is exponentially slow:


Circumstantial Evidence That Factoring is Not NP-hard

• There are very few known problems that quantum computers could solve exponentially faster than classical computers.

• Factoring is one of them: (Shor, 94). The “discrete logarithm” is one more.

• Much work since then to find a quantum algorithm that solves an NP complete problem has failed. A quantum computer must use domain knowledge (S. Aaronson, 2008)

• Searching a list of solutions is not exponential: (Grover, 96)

Microsoft Research

Is This The Best We Can Do?

Even exponential complexity, but with better N dependence, would be interesting.


Microsoft Research

RSA ChallengeRSA Number Decimal digits Binary digits Cash prize offered Factored on Factored by

RSA-100 100 330 April 1991 Arjen K. Lenstra

RSA-110 110 364 April 1992 Arjen K. Lenstra and M.S. Manasse

RSA-120 120 397 June 1993 T. Denny et al.

RSA-129 129 426 $100 USD April 1994 Arjen K. Lenstra et al.

RSA-130 130 430 April 10, 1996 Arjen K. Lenstra et al.

RSA-140 140 463 February 2, 1999 Herman J. J. te Riele et al.

RSA-150[3] 150 496 April 16, 2004 Kazumaro Aoki et al.

RSA-155 155 512 August 22, 1999 Herman J. J. te Riele et al.

RSA-160 160 530 April 1, 2003 Jens Franke et al., University of Bonn

RSA-170 170 563 open

RSA-576 174 576 $10,000 USD December 3, 2003 Jens Franke et al., University of Bonn

RSA-180 180 596 open

RSA-190 190 629 open

RSA-640 193 640 $20,000 USD November 2, 2005 Jens Franke et al., University of Bonn

RSA-200 200 663 May 9, 2005 Jens Franke et al., University of Bonn

RSA-210 210 696 open

RSA-704 212 704 $30,000 USD open

RSA-768 232 768 $50,000 USD open

RSA-896 270 896 $75,000 USD open

RSA-1024 309 1024 $100,000 USD open

RSA-1536 463 1536 $150,000 USD open

RSA-2048 617 2048 $200,000 USD open

(Wikipedia)56Text Mining, Search and Navigation

http://en.wikipedia.org/wiki/RSA-100

http://en.wikipedia.org/wiki/1991

http://en.wikipedia.org/wiki/Arjen_K._Lenstra




http://en.wikipedia.org/w/index.php?title=M.S._Manasse&action=edit



http://en.wikipedia.org/w/index.php?title=T._Denny&action=edit


http://en.wikipedia.org/wiki/USD




http://en.wikipedia.org/wiki/April_10




http://en.wikipedia.org/wiki/February_2


http://en.wikipedia.org/wiki/Herman_J._J._te_Riele




http://en.wikipedia.org/w/index.php?title=Kazumaro_Aoki&action=edit


http://en.wikipedia.org/wiki/August_22


http://en.wikipedia.org/wiki/Herman_J._J._te_Riele




http://en.wikipedia.org/w/index.php?title=Jens_Franke&action=edit

http://en.wikipedia.org/wiki/University_of_Bonn




http://en.wikipedia.org/wiki/December_3








http://en.wikipedia.org/wiki/November_2





http://en.wikipedia.org/wiki/May_9

















Microsoft Research

Represent the Problem in Binary1 0 1 1 0 1 1 1 𝑥2 𝑥1 1 1 𝑦1 1 1 𝑥2 𝑥1 1 𝑦1 𝑥2𝑦1 𝑥1𝑦1 𝑦1 1 𝑥2 𝑥1 1

𝑥1 + 𝑦1 = 1 𝑥2 + 𝑥1𝑦1 + 1 = 0+ 2𝑧1 1+ 𝑥2𝑦1 + 𝑥1 + 𝑧1 = 1+ 2𝑧2 𝑦1 + 𝑥2 + 𝑧2 = 1+ 2𝑧3 1+ 𝑧3 = 0+ 2𝑧4 𝑧4 = 1 Text Mining, Search and Navigation 57

Microsoft Research

First Trick: LinearizationReplace 𝑥𝑖𝑦𝑗 by 𝜂𝑖𝑗 everywhere, and add constraints 𝑥𝑖 + 𝑦𝑗 ≥ 2𝜂𝑖𝑗 𝑥𝑖 + 𝑦𝑗 − 1 ≤ 𝜂𝑖𝑗 0 ≤ 𝜂𝑖𝑗 ≤ 1

Key trick: for ሼ𝑥,𝑦,𝜂ሽ∈{0,1} 𝑥+ 𝑦− 1 ≤ 𝜂: 𝑥𝑦= 1 →𝜂 = 1 𝑥+ 𝑦≥ 2𝜂: 𝑥𝑦= 0 →𝜂 = 0 𝑥+ 𝑦≥ 2𝜂: 𝜂 = 1 →𝑥𝑦= 1 𝑥+ 𝑦− 1 ≤ 𝜂: 𝜂 = 0 →𝑥𝑦= 0 so in {0,1}, 𝜂 = 𝑥𝑦.


Microsoft Research

Linearization, cont.

Theorem: Integer solutions of the {𝑥,𝑦,𝑧,𝜂} equations are in 1-1 correspondence with integer solutions of the {𝑥,𝑦,𝑧} equations. Given two corresponding solutions, the 𝑥,𝑦 and 𝑧 variables take the same values.

(Not immediately obvious: e.g. 𝑥1𝑦1 = 1, 𝑥2𝑦2 = 1 → 𝑥1𝑦2 = 1)


Microsoft Research

A Geometrical Problem


Microsoft Research

Second Trick: Quantization

Reduce feasible region via Linear Programming.

Maximize 𝑐.𝑥, 𝑥∈𝐹, with 𝑐= ሺ1,0ሻ: max𝑥 𝑐∙𝑥= 0.95 →𝑥1 = 0


Microsoft Research

Maximize 𝑐.𝑥, 𝑥∈𝐹, with 𝑐= ሺ1,1ሻ: 𝑥1𝑚 = 1.0, 𝑥2𝑚 = 0.9, 𝑐∙𝑥= 1.9

Impose 𝑐∙𝑥≤ 1:

ሼ𝑥: 𝑐∙𝑥≤ 1ሽ∩𝐹


Microsoft Research

Quantization Without LPs

𝑥1 + 𝑥2 ≤ 1 𝑥2 + 𝑥3 ≤ 1 𝑥3 + 𝑥1 ≤ 1

→ 𝑥1 + 𝑥2 + 𝑥3 ≤ 1.5 → 𝑥1 + 𝑥2 + 𝑥3 ≤ 1


Microsoft Research

More Simple Tricks

• Checking quantized versions of LP solutions is very fast (do the long division)

• Concentrate on the subspace.• Work with the smallest dimensional subspaces

that give new constraints.• Randomized algorithms?

5081 * 6007 = 30521567:1001111011001 * 1011101110111 = 1110100011011100011011111


Microsoft Research

The Geometric View

Distance from origin to simplex: 1/ξ𝑛 Volume of ‘corner’: 1 𝑛!Τ Longest span inside unit cube: ξ𝑛

Lemma: Denote the binary variables corresponding to vertex 𝑣 ∈𝒰 by 𝑏𝑖 ∈ሼ0,1ሽ, 𝑖 =1,…,𝑛. Then the (un-normalized) normal to the hyperplane defined by the regular simplex which intersects all vertices which differ from 𝑣 by one (in one coordinate) is 𝑛 = 𝕝− 2𝑣 (where 𝕝 is the vector of all ones), where the sign has been chosen such that 𝑛 at 𝑣 points into 𝒰. The equation of the corresponding hyperplane is 𝑥∙𝑛 = 1− 𝑝 where 𝑝 is the number of ones in 𝑣, and the corresponding constraint (delimiting the region lying inside 𝒰 but not including 𝑣) is 𝑥∙𝑛 ≥ 1− 𝑝.


Microsoft Research

Projections Lose Information


How fast can randomized projections in subspaces find the solution?

Microsoft Research

Conclusions

• Search (and advertising) are likely to become more ubiquitous and better targeted

• Ranking algorithms are a key tool, and we can directly optimize finicky IR measures

• RSA is probably safe as houses, but we should probe it


Microsoft Research

Backup Slides



A Simple Example

1 2 1 2, : 1, 0D D l l

Imagine some cost C:1 1 1 2 2

1

2 1 1 2 22

( , , , )

( , , , )

Cs l s l

s

Cs l s l

s

70


1 2

1 2

1 2

1 2

211 1 2 2 2 1 2

2 21 11 1 2 2 2 1 1 22 2

1 1 2 2

:

0 : 1

0 :

: 0

0 : ( , , , )

0 : ( , , , ) ( ) ( )

: ( , , , ) 0

x s s

x

x x

x

C

x C s l s l s s

x C s l s l s s s s

x C s l s l

Letting

Then a cost function exists:

…furthermore it’s convex

71


LambdaRank

• Choose the to model the desired cost. (Need not use pairs!)

• Very general. Handles multivariate, non-smooth costs.

• But, how to choose the ? • When will there exist a cost function C for your

choice of ?• When will that C be convex?

' s

' s

' s

72


Some Multilinear Algebra Basics

• An ‘n-form’ on a manifold M is a totally antisymmetric tensor that lives in the dual of the tangent space of M

• You can apply the differential operator d to an n-form to get an (n+1)-form

• A closed form f is one for which df=0• An exact form g is one for which g=dh, for some

form h• dd=0 (every exact form is closed)

73


Poincare’s Lemma

If is an open set that is star-shaped with respect to

the origin, then any closed form defined on is exact.

nS

S

R

Hence on such a set, a form is exact iff it is closed.

0.

, :

Define the 1-form

Then for some iff

Using classical notation: Jacobian symmetric!

i ii

ji

j i

dx

dC C d

i jx x

74


The Jacobian

• Square matrix, of side nDocs• Family of Jacobians, one for each label set• Symmetric cost function exists• Positive semidefinite cost function is

convex• (…like a kernel, but more general:

depends on all points!)

75


A Physical Analogy

• Think of ranked documents as point masses, as forces

• If , the forces are conservative – they derive from a potential

• E.g. choosing the to be linear in the scores is equivalent to a spring model

' sdC

' s

76


LambdaRank Speedup for RankNet

• Most neural net training is stochastic (update weights after every pattern)

• Here we can compute and increment the gradients for each document (mini batch)

• Batch them, apply fprop and backprop once per doc, per query; factorize the gradient.

77

Microsoft Research

Speedup Results


Date post:	27-Dec-2015
Category:	Documents
Upload:	vincent-gallagher
View:	214 times
Download:	0 times

Microsoft Research Online Search and Advertising, Future and Present Chris Burges Microsoft Research...

Documents