+ All Categories
Home > Documents > In Search of an Unbiased Web Ranking - Stanford...

In Search of an Unbiased Web Ranking - Stanford...

Date post: 09-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
76
UCLA Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo “John” Cho [email protected] UCLA Search Engines Considered Harmful Junghoo “John” Cho 1/45
Transcript
Page 1: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engines Considered HarmfulIn Search of an Unbiased Web Ranking

Junghoo “John” [email protected]

UCLA

Search Engines Considered Harmful Junghoo “John” Cho 1/45

Page 2: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

World-Wide Web

10 years ago With Web

Search Engines Considered Harmful Junghoo “John” Cho 2/45

Page 3: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Information Overload

Too much information, too much junk

Too little time

Search Engines Considered Harmful Junghoo “John” Cho 3/45

Page 4: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engines: The Savior

Search Engines Considered Harmful Junghoo “John” Cho 4/45

Page 5: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engine Success: Flip Side

“If you are not indexed by Google, you do notexist on the Web”

– News.com article, 10/23/2002

Only a few major players75% market share by Google alone

People “discover” pages through search enginesTop results: many usersBottom results: no new users

Big question: Are we biased by search engines?

Search Engines Considered Harmful Junghoo “John” Cho 5/45

Page 6: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engine Success: Flip Side

“If you are not indexed by Google, you do notexist on the Web”

– News.com article, 10/23/2002

Only a few major players75% market share by Google alone

People “discover” pages through search enginesTop results: many usersBottom results: no new users

Big question: Are we biased by search engines?

Search Engines Considered Harmful Junghoo “John” Cho 5/45

Page 7: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: “Secret Ranking Recipe”

Intuition: You are “important” if many otherpages link to you

High PageRank Low PageRank

Popular pages are returned at the topMore details later...

“Rich-get-richer” problem?

Search Engines Considered Harmful Junghoo “John” Cho 6/45

Page 8: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: “Secret Ranking Recipe”

Intuition: You are “important” if many otherpages link to you

High PageRank Low PageRank

Popular pages are returned at the topMore details later...

“Rich-get-richer” problem?

Search Engines Considered Harmful Junghoo “John” Cho 6/45

Page 9: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Outline

Web popularity-evolution experimentIs “rich-get-richer” happening?

Impact of search enginesHow much bias do search engines introduce?

New ranking metricCan we avoid search-engine bias?

Search Engines Considered Harmful Junghoo “John” Cho 7/45

Page 10: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Web Evolution Experiment

Collect Web history dataIs “rich-get-richer” happening?

From Oct. 2002 until Oct. 2003154 sites monitored

Top sites from each category of Open Directory

Pages downloaded every weekAll pages in each siteA total of average 4M pages every week (65GB)

Search Engines Considered Harmful Junghoo “John” Cho 8/45

Page 11: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

“Rich-Get-Richer” Problem

Construct weekly Web-link graphFrom the downloaded data

Partition pages into 10 groupsBased on initial link popularityTop 10% group, 10%-20% group, etc.

How many new links to each group after amonth?

Rich-get-richer → More new links to top groups

Search Engines Considered Harmful Junghoo “John” Cho 9/45

Page 12: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Result: Simple Link Count

20 40 60 80 100 Popularity

1 × 106

2 × 106

3 × 106

4 × 106

Increase in number of in−links

After 7 months70% of new links to top 20% pagesNo new links to bottom 60% pages

Search Engines Considered Harmful Junghoo “John” Cho 10/45

Page 13: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Result: PageRank

20 40 60 80 100Popularity

−0.0010

−0.0005

0.0005

0.0010

0.0015

0.0020

Increase in PageRank

After 7 monthsDecrease in PageRank for bottom 50% pagesDue to normalization of PageRank

Search Engines Considered Harmful Junghoo “John” Cho 11/45

Page 14: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Outline

Web popularity-evolution experiment“Rich-get-richer” is indeed happeningUnpopular pages get no attention

Impact of search enginesHow much bias do search engines introduce?

New ranking metricPage quality

Search Engines Considered Harmful Junghoo “John” Cho 12/45

Page 15: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Outline

Web popularity-evolution experiment“Rich-get-richer” is indeed happeningUnpopular pages get no attention

Impact of search enginesHow much bias do search engines introduce?

New ranking metricPage quality

Search Engines Considered Harmful Junghoo “John” Cho 12/45

Page 16: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engine Impact

How much bias do search engines introduce?

What we mean by bias?

What is the ideal ranking?How do search engines rank pages?

Search Engines Considered Harmful Junghoo “John” Cho 13/45

Page 17: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engine Impact

How much bias do search engines introduce?

What we mean by bias?

What is the ideal ranking?How do search engines rank pages?

Search Engines Considered Harmful Junghoo “John” Cho 13/45

Page 18: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search Engine Impact

How much bias do search engines introduce?

What we mean by bias?

What is the ideal ranking?How do search engines rank pages?

Search Engines Considered Harmful Junghoo “John” Cho 13/45

Page 19: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

What is the Ideal Ranking?

What do we mean by page quality?

Very subjective notion

Different quality judgment on the same page

Can there be an “objective” definition?

Search Engines Considered Harmful Junghoo “John” Cho 14/45

Page 20: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

What is the Ideal Ranking?

What do we mean by page quality?

Very subjective notion

Different quality judgment on the same page

Can there be an “objective” definition?

Search Engines Considered Harmful Junghoo “John” Cho 14/45

Page 21: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality Q(p)

Definition

The probability that an average Web user will like pagep enough to create a link to it if he looks at it

Idea: More people will like a higher quality pageDemocratic measure of quality

p1: 10,000 people, 8,000 liked it, Q(p1) = 0.8p2: 10,000 people, 2,000 liked it, Q(p2) = 0.2→ Q(p1) > Q(p2)

Search Engines Considered Harmful Junghoo “John” Cho 15/45

Page 22: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality Q(p) Cont.

In principle, we can measure Q(p) by

1. showing p to all Web users and2. counting how many people like it

When consensus is hard to reach, pick the onethat more people like

Search Engines Considered Harmful Junghoo “John” Cho 16/45

Page 23: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Intuition

A page is “important” if many pages link to it

Not every link is equal

A link from an “important” page matters more thanotherse.g. Link from Yahoo vs Link from a random homepage

Search Engines Considered Harmful Junghoo “John” Cho 17/45

Page 24: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Intuition

A page is “important” if many pages link to itNot every link is equal

A link from an “important” page matters more thanotherse.g. Link from Yahoo vs Link from a random homepage

Search Engines Considered Harmful Junghoo “John” Cho 17/45

Page 25: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Detail

PageRank of pi, PR(pi):

PR(pi) = [PR(p1)/c1 + · · ·+ PR(pm)/cm] †

p1, . . . , pm: pages with links to pi

cj: number of outgoing links from pj

Links from high PageRank pages have high “weights”

†“Damping factor” is ignored for simplicity

Search Engines Considered Harmful Junghoo “John” Cho 18/45

Page 26: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Random-Surfer Model

Random-Surfer Model

When users follow links randomly, PR(pi) is the prob-ability to reach pi

PR(p1): probability to be at p1

Q: Probability to go from p1 to pi?

A: PR(p1)/3

Q: Probability to be at pi, PR(pi)?

A: PR(p1)/3 + PR(p2) + PR(p3)/2

Search Engines Considered Harmful Junghoo “John” Cho 19/45

Page 27: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Random-Surfer Model

Random-Surfer Model

When users follow links randomly, PR(pi) is the prob-ability to reach pi

p1

pi

PR(p1): probability to be at p1

Q: Probability to go from p1 to pi?

A: PR(p1)/3

Q: Probability to be at pi, PR(pi)?

A: PR(p1)/3 + PR(p2) + PR(p3)/2

Search Engines Considered Harmful Junghoo “John” Cho 19/45

Page 28: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Random-Surfer Model

Random-Surfer Model

When users follow links randomly, PR(pi) is the prob-ability to reach pi

p1

pi

PR(p1): probability to be at p1

Q: Probability to go from p1 to pi?

A: PR(p1)/3

Q: Probability to be at pi, PR(pi)?

A: PR(p1)/3 + PR(p2) + PR(p3)/2

Search Engines Considered Harmful Junghoo “John” Cho 19/45

Page 29: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Random-Surfer Model

Random-Surfer Model

When users follow links randomly, PR(pi) is the prob-ability to reach pi

p1

pi

PR(p1): probability to be at p1

Q: Probability to go from p1 to pi?

A: PR(p1)/3

Q: Probability to be at pi, PR(pi)?

A: PR(p1)/3 + PR(p2) + PR(p3)/2

Search Engines Considered Harmful Junghoo “John” Cho 19/45

Page 30: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Random-Surfer Model

Random-Surfer Model

When users follow links randomly, PR(pi) is the prob-ability to reach pi

p1

pip2

p3

PR(p1): probability to be at p1

Q: Probability to go from p1 to pi?

A: PR(p1)/3

Q: Probability to be at pi, PR(pi)?

A: PR(p1)/3 + PR(p2) + PR(p3)/2

Search Engines Considered Harmful Junghoo “John” Cho 19/45

Page 31: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

PageRank: Random-Surfer Model

Random-Surfer Model

When users follow links randomly, PR(pi) is the prob-ability to reach pi

p1

pip2

p3

PR(p1): probability to be at p1

Q: Probability to go from p1 to pi?

A: PR(p1)/3

Q: Probability to be at pi, PR(pi)?

A: PR(p1)/3 + PR(p2) + PR(p3)/2

Search Engines Considered Harmful Junghoo “John” Cho 19/45

Page 32: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality vs PageRank

High PageRank→ The page is currently “popular”

PageRank ≈ Page quality if everyone is givenequal chance

Before Google, PageRank may have been fair

What about now?High PageRank → High Quality?Low PageRank → Low Quality?

PageRank is biased against new pages

How to measure the PageRank bias?

Search Engines Considered Harmful Junghoo “John” Cho 20/45

Page 33: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality vs PageRank

High PageRank→ The page is currently “popular”

PageRank ≈ Page quality if everyone is givenequal chance

Before Google, PageRank may have been fair

What about now?High PageRank → High Quality?Low PageRank → Low Quality?

PageRank is biased against new pagesHow to measure the PageRank bias?

Search Engines Considered Harmful Junghoo “John” Cho 20/45

Page 34: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Search-Engine Bias

Ideal experiment:Divide the world into two groups

The users who do not use search enginesThe users who use search engines very heavily

Compare popularity evolution

Problem: Difficult to conduct in practice

Search Engines Considered Harmful Junghoo “John” Cho 21/45

Page 35: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Search-Engine Bias

Ideal experiment:Divide the world into two groups

The users who do not use search enginesThe users who use search engines very heavily

Compare popularity evolution

Problem: Difficult to conduct in practice

Search Engines Considered Harmful Junghoo “John” Cho 21/45

Page 36: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Theoretical Web-User Models

Let us do theoretical experiments!Random-surfer model

Users follow links randomlyNever use search engines

Search-dominant modelUsers always start with a search engineOnly visit pages returned by the search engine

→ Compare popularity evolution

Search Engines Considered Harmful Junghoo “John” Cho 22/45

Page 37: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Basic Definitions for the Models

(Simple) Popularity P(p, t)Fraction of Web users that like p at time tE.g, 100,000 users, 10,000 like p, P(p, t) = 0.1

Visit Popularity V(p, t)Number of users that visit p in a unit time

Awareness A(p, t)Fraction of Web users who are aware of pE.g., 100,000 users, 30,000 aware of p, A(p, t) = 0.3

P(p, t) = Q(p) · A(p, t)

Search Engines Considered Harmful Junghoo “John” Cho 23/45

Page 38: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Basic Definitions for the Models

(Simple) Popularity P(p, t)Fraction of Web users that like p at time tE.g, 100,000 users, 10,000 like p, P(p, t) = 0.1

Visit Popularity V(p, t)Number of users that visit p in a unit time

Awareness A(p, t)Fraction of Web users who are aware of pE.g., 100,000 users, 30,000 aware of p, A(p, t) = 0.3

P(p, t) = Q(p) · A(p, t)

Search Engines Considered Harmful Junghoo “John” Cho 23/45

Page 39: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Random-Surfer Model

Popularity-Equivalence Hypothesis

V(p, t) = r · P(p, t) (or V(p, t) ∝ P(p, t))

PageRank is visit probability under random-surfer modelHigher popularity → More visitors

Random-Visit Hypothesis

A visit is done by any user with equal probability

Search Engines Considered Harmful Junghoo “John” Cho 24/45

Page 40: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Random-Surfer Model: Analysis

Current popularity P(p, t)

→ Number of visitors from V(p, t) = r · P(p, t)

→ Awareness increase ∆A(p, t)

→ Popularity increase ∆P(p, t)

→ New popularity P(p, t + 1)

Formal Analysis: Differential Equation

P(p, t) =[1− e−

rn

∫ t

0P(p,t)dt

]Q(p)

Search Engines Considered Harmful Junghoo “John” Cho 25/45

Page 41: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Random-Surfer Model: Analysis

Current popularity P(p, t)

→ Number of visitors from V(p, t) = r · P(p, t)

→ Awareness increase ∆A(p, t)

→ Popularity increase ∆P(p, t)

→ New popularity P(p, t + 1)

Formal Analysis: Differential Equation

P(p, t) =[1− e−

rn

∫ t

0P(p,t)dt

]Q(p)

Search Engines Considered Harmful Junghoo “John” Cho 25/45

Page 42: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Random-Surfer Model: Result

Theorem

The popularity of page p evolves over time throughthe following formula:

P(p, t) =Q(p)

1 + [ Q(p)P(p,0) − 1] e−[ r

nQ(p)]t

Q(p): quality of pP(p, 0): initial popularity of p at time zeron: total number of Web users.r: normalization constant in V(p, t) = r · P(p, t)

Search Engines Considered Harmful Junghoo “John” Cho 26/45

Page 43: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Random-Surfer Model: Popularity Graph

5 10 15 20 25 30 35

0.2

0.4

0.6

0.8

1.0

Infant Expansion Maturity

Popularity

Time

Q(p) = 1, P(p, 0) = 10−8,r

n= 1

Search Engines Considered Harmful Junghoo “John” Cho 27/45

Page 44: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Comparison with Google Evolution

Jan98 Jan99 Jan00 Jan01 Jan02 Jan03

0.05

0.10

0.15

0.20

0.25

Audience reach

Time

Data from Nielsen//NetRatings

Q(p) = 0.3, P(p, 0) = 5× 10−6,r

n= 8

Search Engines Considered Harmful Junghoo “John” Cho 28/45

Page 45: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model

V(p, t) ∼ P(p, t)?

For ith result, how many clicks?

For PageRank P(p, t), what ranking?

Empirical measurement by Lempel et al. and us

New Visit-Popularity Hypothesis

V(p, t) = r · P(p, t)94

Random-Visit Hypothesis

A visit is done by any user with equal probability

Search Engines Considered Harmful Junghoo “John” Cho 29/45

Page 46: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model

V(p, t) ∼ P(p, t)?

For ith result, how many clicks?

For PageRank P(p, t), what ranking?

Empirical measurement by Lempel et al. and us

New Visit-Popularity Hypothesis

V(p, t) = r · P(p, t)94

Random-Visit Hypothesis

A visit is done by any user with equal probability

Search Engines Considered Harmful Junghoo “John” Cho 29/45

Page 47: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model

V(p, t) ∼ P(p, t)?

For ith result, how many clicks?

For PageRank P(p, t), what ranking?

Empirical measurement by Lempel et al. and us

New Visit-Popularity Hypothesis

V(p, t) = r · P(p, t)94

Random-Visit Hypothesis

A visit is done by any user with equal probability

Search Engines Considered Harmful Junghoo “John” Cho 29/45

Page 48: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model

V(p, t) ∼ P(p, t)?

For ith result, how many clicks?

For PageRank P(p, t), what ranking?

Empirical measurement by Lempel et al. and us

New Visit-Popularity Hypothesis

V(p, t) = r · P(p, t)94

Random-Visit Hypothesis

A visit is done by any user with equal probability

Search Engines Considered Harmful Junghoo “John” Cho 29/45

Page 49: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model

V(p, t) ∼ P(p, t)?

For ith result, how many clicks?

For PageRank P(p, t), what ranking?

Empirical measurement by Lempel et al. and us

New Visit-Popularity Hypothesis

V(p, t) = r · P(p, t)94

Random-Visit Hypothesis

A visit is done by any user with equal probability

Search Engines Considered Harmful Junghoo “John” Cho 29/45

Page 50: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model: Result

1500 1550 1600 1650Time

0.2

0.4

0.6

0.8

1.0Popularity

∞∑i=1

[P(p, t)](i−94 ) − [P(p, 0)](i−

94 )(

i− 94

)Q(p)i

=r

nt (same parameters as before)

Search Engines Considered Harmful Junghoo “John” Cho 30/45

Page 51: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Comparison of Two Models

Time to final popularityRandom surfer: 25 time unitsSearch dominant: 1650 time units→ 66 times increases!

Expansion stageRandom surfer: 12 time unitsSearch dominant: non existent

5 10 15 20 25 30 35

0.2

0.4

0.6

0.8

1.0

Infant Expansion Maturity

Popularity

Time1500 1550 1600 1650

Time

0.2

0.4

0.6

0.8

1.0Popularity

Search Engines Considered Harmful Junghoo “John” Cho 31/45

Page 52: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Outline

Web popularity-evolution experimentIs “rich-get-richer” happening?

Impact of search enginesRandom-surfer modelSearch-dominant model

New ranking metricHow to measure page quality?

Search Engines Considered Harmful Junghoo “John” Cho 32/45

Page 53: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Basic Idea

Quality: probability of link creation by a newvisitor

Assuming the same number of visitorsQ(p) ∝ Number of new links

(or popularity increase)

Quality Estimator

Q(p) =

C ·

∆P(p)

/P(p) + P(p)

Search Engines Considered Harmful Junghoo “John” Cho 33/45

Page 54: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Basic Idea

Quality: probability of link creation by a newvisitor

Assuming the same number of visitorsQ(p) ∝ Number of new links

(or popularity increase)

Quality Estimator

Q(p) =

C ·

∆P(p)

/P(p) + P(p)

Search Engines Considered Harmful Junghoo “John” Cho 33/45

Page 55: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Basic Idea

Quality: probability of link creation by a newvisitor

Assuming the same number of visitorsQ(p) ∝ Number of new links

(or popularity increase)

Quality Estimator

Q(p) =

C ·

∆P(p)

/P(p) + P(p)

Search Engines Considered Harmful Junghoo “John” Cho 33/45

Page 56: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Problem 1

Different number of visitors to each pageMore visitors to more popular page

How to account for number of visitors?

Idea: PageRank = visit probability

Quality Estimator

Q(p) =

C ·

∆P(p)

/P(p) + P(p)

Search Engines Considered Harmful Junghoo “John” Cho 34/45

Page 57: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Problem 1

Different number of visitors to each pageMore visitors to more popular page

How to account for number of visitors?

Idea: PageRank = visit probability

Quality Estimator

Q(p) =

C ·

∆P(p)

/P(p) + P(p)

Search Engines Considered Harmful Junghoo “John” Cho 34/45

Page 58: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Problem 1

Different number of visitors to each pageMore visitors to more popular page

How to account for number of visitors?

Idea: PageRank = visit probability

Quality Estimator

Q(p) =

C ·

∆P(p)/P(p)

+ P(p)

Search Engines Considered Harmful Junghoo “John” Cho 34/45

Page 59: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Problem 2

No more new links to very popular pagesEveryone already knows them∆P(p)/P(p) ≈ 0 for well-known pages

How to account for well-known pages?

Idea: P(p) = Q(p) when everyone knows pUse P(p) to measure Q(p) for well-known pages

C: relative weight given to popularity increase

Quality Estimator

Q(p) =

C ·

∆P(p)/P(p)

+ P(p)

C: weight given to popularity increase

Search Engines Considered Harmful Junghoo “John” Cho 35/45

Page 60: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Problem 2

No more new links to very popular pagesEveryone already knows them∆P(p)/P(p) ≈ 0 for well-known pages

How to account for well-known pages?Idea: P(p) = Q(p) when everyone knows p

Use P(p) to measure Q(p) for well-known pages

C: relative weight given to popularity increase

Quality Estimator

Q(p) =

C ·

∆P(p)/P(p)

+ P(p)

C: weight given to popularity increase

Search Engines Considered Harmful Junghoo “John” Cho 35/45

Page 61: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Problem 2

No more new links to very popular pagesEveryone already knows them∆P(p)/P(p) ≈ 0 for well-known pages

How to account for well-known pages?Idea: P(p) = Q(p) when everyone knows p

Use P(p) to measure Q(p) for well-known pages

C: relative weight given to popularity increaseQuality Estimator

Q(p) = C ·∆P(p)/P(p) + P(p)

C: weight given to popularity increase

Search Engines Considered Harmful Junghoo “John” Cho 35/45

Page 62: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Measuring Quality: Theoretical Proof

Theorem

Under the random-surfer model, the quality of pagep, Q(p), always satisfies the following equation:

Q(p) =(n

r

) (dP(p, t)/dt

P(p, t)

)+ P(p, t)

Compare it with Q(p) = C · ∆P(p)

P(p)+ P(p)

Search Engines Considered Harmful Junghoo “John” Cho 36/45

Page 63: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Is Page Quality Effective?

How to measure its effectiveness?Implement it to a major search engine?Any other alternatives?

Idea: Pages eventually obtain deserved popularity(however long it may take...)

“Future” PageRank ≈ Q(p)

Search Engines Considered Harmful Junghoo “John” Cho 37/45

Page 64: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Is Page Quality Effective?

How to measure its effectiveness?Implement it to a major search engine?Any other alternatives?

Idea: Pages eventually obtain deserved popularity(however long it may take...)

“Future” PageRank ≈ Q(p)

Search Engines Considered Harmful Junghoo “John” Cho 37/45

Page 65: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality: Evaluation (1)

Q(p) as a predictor of future PageRankCompare the correlations of

“current” Q(p) with “future” PageRank“current” PageRank with “future” PageRank

→ Q(p) predicts “future” PageRank better?

Download the Web multiple times with longintervals

Search Engines Considered Harmful Junghoo “John” Cho 38/45

Page 66: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality: Evaluation (1)

Q(p) as a predictor of future PageRankCompare the correlations of

“current” Q(p) with “future” PageRank“current” PageRank with “future” PageRank

→ Q(p) predicts “future” PageRank better?

Download the Web multiple times with longintervals

t1

1 month

t3t2 t4

4 months

Search Engines Considered Harmful Junghoo “John” Cho 38/45

Page 67: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality: Evaluation (1)

Q(p) as a predictor of future PageRankCompare the correlations of

“current” Q(p) with “future” PageRank“current” PageRank with “future” PageRank

→ Q(p) predicts “future” PageRank better?

Download the Web multiple times with longintervals

t1

1 month

t3t2 t4

4 months

?PR(p, t3) PR(p, t4)

Search Engines Considered Harmful Junghoo “John” Cho 38/45

Page 68: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality: Evaluation (1)

Q(p) as a predictor of future PageRankCompare the correlations of

“current” Q(p) with “future” PageRank“current” PageRank with “future” PageRank

→ Q(p) predicts “future” PageRank better?

Download the Web multiple times with longintervals

t1

1 month

t3t2 t4

4 months

?Q(p, t3) PR(p, t4)

Search Engines Considered Harmful Junghoo “John” Cho 38/45

Page 69: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality: Evaluation (2)

Compare the average relative error

err(p) =

∣∣∣PR(p,t4)−Q(p,t3)

PR(p,t4)

∣∣∣∣∣∣PR(p,t4)−PR(p,t3)PR(p,t4)

∣∣∣

Result ∗

For Q(p, t3): average err = 0.32For PR(p, t3): average err = 0.78Q(p, t3) twice as accurate.

∗For the pages whose PageRank consistently increased/decreased fromt1 through t3.

Search Engines Considered Harmful Junghoo “John” Cho 39/45

Page 70: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Page Quality: Evaluation (2)

Compare the average relative error

err(p) =

∣∣∣PR(p,t4)−Q(p,t3)

PR(p,t4)

∣∣∣∣∣∣PR(p,t4)−PR(p,t3)PR(p,t4)

∣∣∣Result ∗

For Q(p, t3): average err = 0.32For PR(p, t3): average err = 0.78Q(p, t3) twice as accurate.

∗For the pages whose PageRank consistently increased/decreased fromt1 through t3.

Search Engines Considered Harmful Junghoo “John” Cho 39/45

Page 71: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Quality Evaluation: More Detail

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.

0.1

0.2

0.3

0.4

0.5

0.6

Q(p)PR(p)

Fraction of pages

err(p)

Search Engines Considered Harmful Junghoo “John” Cho 40/45

Page 72: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Summary

Web popularity-evolution experiment“Rich-get-richer” is indeed happening

Impact of search enginesRandom-surfer modelSearch-dominant model→ Search engines have worrisome impact

New ranking metricPage quality: Based on popularity evolutionIdentify high-quality pages early on

Search Engines Considered Harmful Junghoo “John” Cho 41/45

Page 73: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Thank You

For more details, see

A. Ntoulas, J. Cho and C. Olston.What’s New on the Web?In WWW Conference, 2004.

J. Cho and S. RoyImpact of Web Search Engines on Page PopularityIn WWW Conference, 2004.

J. Cho and R. Adams.Page Quality: In Search of an Unbiased Web RankingUCLA CS Department, Nov. 2003.

Any Questions?

Search Engines Considered Harmful Junghoo “John” Cho 42/45

Page 74: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Popularity Increase: Relative Link Count

20 40 60 80 100Popularity

0.2

0.4

0.6

0.8

1

Relative increase in number of in−inks

Search Engines Considered Harmful Junghoo “John” Cho 43/45

Page 75: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Popularity Increase: Relative PageRank

20 40 60 80 100Popularity

−0.015

−0.01

−0.005

0.005

Relative increase in PageRank

Search Engines Considered Harmful Junghoo “John” Cho 44/45

Page 76: In Search of an Unbiased Web Ranking - Stanford Universityi.stanford.edu/infoseminar/archive/WinterY2004/cho...UCLA Search Engine Success: Flip Side “If you are not indexed by Google,

UCLA

Search-Dominant Model: Result

1650.24336 1650.24337 1650.24338Time

0.2

0.4

0.6

0.8

1Popularity

Search Engines Considered Harmful Junghoo “John” Cho 45/45


Recommended