+ All Categories
Home > Documents > Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing...

Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing...

Date post: 29-Sep-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
33
Intern Presentation A/B Testing by Interleaving Sida Wang
Transcript
Page 1: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Intern Presentation

A/B Testing by Interleaving

Sida Wang

Page 2: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

My Project

• Evaluating search relevance by interleaving

results and collecting user data

– Interleaving Framework

• Generic, Extensible• Generic, Extensible

– Experiments to evaluate relevance by interleaving

• Based on the paper How Does Clickthrough Data

Reflect Retrieval Quality? by F. Radlinski et al

Page 3: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Evaluating Search Relevance

• Without Interleaving

- Full time human judges -> precision, recall, NDCG

- Compare Search

Page 4: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Compare Search

Result S1

Result S2

Result S3

Result G1

Result G2

Result G3

Page 5: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving
Page 6: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Issues

• Aas

• But do Microsoft people pick O14 Search or

Google Mini? Google Mini?

• Maybe people tend to pick the left?

• Alters the search experience

– Can never collect a lot of data using this method

Page 7: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

By Interleaving

Result A1 - Relevant

Result A2 - Relevant

Result A1 - Relevant

Result B1 - Useless

Result A2 - Relevant

Result B1 - Useless

Result B2 - UselessResult A2 - Relevant

Result A3 - Relevant

Result B2 - Useless

Result A3 - Relevant

Result B3 - Useless

Result B2 - Useless

Result B3 - Useless

Page 8: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

By Interleaving

Result A1

Result A2

Result A1

Result B1

Result A2Result B1

Result B1Result A2

Result A3Result B2

Result A3

Result B3

Result B1

Result B3

Page 9: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Considerations

• Minimize impact to UX

– So no demo, it looks exactly like normal search

• Minimize Bias

– Summary normalization– Summary normalization

– Interleaving algorithms

• Reliability / performance / and the usual

Page 10: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Experiments I did

• Automated random clicks

• Automated clicks according to relevance

judgments

• Clicks from real people• Clicks from real people

Page 11: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Random Clicks

0.4

0.5

0.6

% o

f V

ote

s R

ece

ive

d

Control Using Automated Random Clicks

0

0.1

0.2

0.3

0 500 1000 1500 2000 2500 3000 3500 4000 4500

% o

f V

ote

s R

ece

ive

d

Clicks

Betaa

MSW

Ties

Page 12: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

A Lot of Random Clicks

0.8

1

1.2

% o

f V

ote

s R

ece

ive

d

Control Using Automated Random Clicks

0

0.2

0.4

0.6

0 5000 10000 15000 20000 25000 30000

% o

f V

ote

s R

ece

ive

d

Clicks

ac11

a86f

ties

Page 13: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Experiments I did

• Automated random clicks

• Automated clicks according to relevance

judgments

• Clicks from real people• Clicks from real people

Page 14: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

O12 vs. O14

0.5

0.6

0.7

0.8

% o

f V

ote

s R

ece

ive

d

Automated Clicks Using Relevence Judgments

-0.1

0

0.1

0.2

0.3

0.4

0 1000 2000 3000 4000

% o

f V

ote

s R

ece

ive

d

Clicks

Acing05 Degraded

Acing05

Ties

Page 15: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Experiments I did

• Automated random clicks

• Automated clicks according to relevance

judgments

• Clicks from real people• Clicks from real people

Page 16: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

O12 vs. O14

0.8

1

1.2

% o

f V

ote

s R

ece

ive

d

O12 vs. O14 in BSG ALL

-0.2

0

0.2

0.4

0.6

0 10 20 30 40 50 60 70 80 90

% o

f V

ote

s R

ece

ive

d

Clicks

O12

O14

Tie

Page 17: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Method of Analysis (election)

• Vote by query, by user, by session etc.

• query = person, user = state

Page 18: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Summary of Results

Method of Voting O12 vs. O14

by queries (direct election): 12 vs. 24

by users (1 vote per state): 4 vs. 9

by sessions (~electoral votes): 5 vs. 11by sessions (~electoral votes): 5 vs. 11

• System does not seem to matter much, but

too little clicks (85) to draw significant

conclusion

Page 19: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

What Logically Follows

• Google Mini vs. O14 (after fixing Google Mini)

• FAST vs. O14 (after fixing RSS in fssearchoffice)

• I’d love to see the results

Page 20: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

What can interleaving do?

• Give relevance team more confidence

• Use interleaving for displaying results

• Use interleaving to automatically tune the

search engine

Am

bitio

n

search engine

Am

bitio

n

Page 21: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Add Confidence

• In addition to very traditional measures like

NDCG, Precision and Recall. It is nice to have

another independent metric.

• Automatic• Automatic

– Does not require human judgments

• Scalable

– Small impact to UX

Page 22: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

What can interleaving do?

• Give relevance team more confidence

• Use interleaving for displaying results

• Use interleaving to automatically tune the

search engine

Am

bitio

n

search engine

Am

bitio

n

Page 23: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Display

Page 24: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Display

Page 25: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

What can we do?

• Give relevance team more confidence

• Use interleave for displaying results

• Use interleaving to automatically tune the

search engine

Am

bitio

n

search engine

Am

bitio

n

Page 26: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Automatic Tuning

• Many relevance models, each is good for a particular type of corpora (specs, user data, academic articles, product catalog, websites)

• Use interleaving in 10% of searches

• Use user click data to:

– Automatically and dynamically decide on the best model, or tweak model parameters

Page 27: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Thank you!

• Dmitriy, Eugene, Puneet

• Jamie, Jessica, Ping, Victor, Relevance Team

• Russ, Jon• Russ, Jon

• Search Team

• Hope to see you again in the future!

Page 28: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Extra Slides

Page 29: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Automatic Tuning – Pair wise?

• Pair wise comparisons scales poorly

• But there seems to be “strong stochastic

transitivity”

– Given locations A, B ,C– Given locations A, B ,C

– If A > B > C then ΔAC > Max(ΔAB, ΔBC)

Page 30: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

How to Interleave

• Balanced

• Team Draft

Page 31: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Balanced Interleaving

Page 32: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

Team Draft

1st pick:

LeBron James

2nd pick:

Kobe Bryant

1st pick:

John Smith

2nd pick:

Kobe Bryant

3rd pick:

Tim Duncan

LeBron James

3rd pick:

Tim Duncan

Page 33: Intern Presentation shortsidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search relevance by interleaving

A/B Testing By Interleaving

Result A1 - Relevant

Result A2 - Relevant

Result A1 - Relevant

Result B1 - Useless

Result B2 - Useless

Result B1 - Useless

Result B2 - UselessResult A2 - Relevant

Result A3 - Relevant

Result A2 - Relevant

Result A3 - Relevant

Result B3 - Useless

Result B2 - Useless

Result B3 - Useless


Recommended