World Wide Web

1

World Wide Web

The largest and most widely known repository of hypertext Hypertext : text, hyperlinks

Comprises billions of documents, authored by millions of diverse people

Brief (non-technical) history

Early keyword-based engines Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997

Sponsored search ranking: Goto.com (morphed into Overture.com Yahoo!) Your search ranking depended on how much you

paid Auction for keywords: casino was expensive!

Brief (non-technical) history

1998+: Link-based ranking pioneered by Google Blew away all early engines Great user experience in search of a business

model Meanwhile Goto/Overture’s annual revenues were

nearing $1 billion Result: Google added paid-placement “ads”

to the side, independent of search results Yahoo followed suit

Algorithmic results.

Ads

Web search basics

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

User Needs

Need [Brod02, RL04] Informational – want to learn about something (~40% / 65%)

Navigational – want to go to that page (~25% / 15%)

Transactional – want to do something (web-mediated) (~35% / 20%) Access a service Downloads Shop

Gray areas Find a good hub Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Seattle weatherMars surface images

Canon S410

Car rental Brasil

How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

http://www.iprospect.com/

Users’ empirical evaluation of results Quality of pages varies widely

Relevance is not enough Other desirable qualities (non IR!!)

Content: Trustworthy, diverse, non-duplicated, well maintained

Web readability: display correctly & fast Precision vs. recall

On the web, recall seldom matters What matters

Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with

obscure queries Recall matters when the number of matches is very

small

Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided

Mitigate user errors (auto spell check, search assist,…)

Explicit: Search within results, more like this, refine ...

Spam

Search Engine Optimization

The trouble with sponsored search … It costs money. What’s the alternative? Search Engine Optimization:

“Tuning” your web page to rank highly in the algorithmic search results for select keywords

Alternative to paying for placement Thus, intrinsically a marketing function

Simplest forms

First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the

ones containing the most maui’s and resort’s SEOs responded with dense repetitions of chosen

terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the

background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers

Pure word density cannot

be trusted as an IR signal

Cloaking

Serve fake content to search engine spider

Is this a SearchEngine spider?

Y

N

SPAM

RealDocCloaking

More spam techniques

Doorway pages Pages optimized for a single keyword that re-

direct to the real target page Link spamming

Mutual admiration societies, hidden links, awards – more on these later

Domain flooding: numerous domains that point or re-direct to a target page

More on spam

Web search engines have policies on SEO practices they tolerate/block http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/

Adversarial IR: the unending (technical) battle between SEO’s and web search engines

Research http://airweb.cse.lehigh.edu/

http://help.yahoo.com/help/us/ysearch/index.html

http://www.google.com/intl/en/webmasters/

16

Crawling the Web

The Web document collection

Distributed content creation, linking, democratization of publishing

Content includes truth, lies, obsolete information, contradictions …

Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)…

Scale much larger than previous text collections … Content can be dynamically generated

The Web

18

Crawling basics

Web is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks.

Basic principle of crawlers Start from a given set of URLs Progressively fetch and scan them for new URLs

(outlinks), and then fetch these pages in turn, in an endless cycle

There is no guarantee that all accessible web pages will be located in this fashion

19

Engineering Large-Scale Crawlers(1) Performance is important! Main concerns

DNS caching, prefetching, and resolution Address resolution is a significant bottleneck

A crawler may generate dozens of mapping requests per second.

Many crawlers avoid fetching too many pages from one server, which might overload it; rather, they spread their access over many servers at a time. Lower the locality of access to the DNS cache

20

Engineering Large-Scale Crawlers(2) Eliminating already-visited URLs

Before adding a new URL to the work pool, we must check if it has already been fetched at least once

How to check quickly Hash function

Remember that the amount of storage that usually cannot fit in main memory Random access is expensive!

Luckily, there is some locality of access on URLs Relative URLs within sites Once the crawler starts exploring a site, URLs within the site are

frequently checked for a while However, a good hash function maps the domain strings

uniformly over the range. To achieve locality access, two-level hash function is used.

21

Engineering Large-Scale Crawlers(3) Spider traps

Commercial crawlers need to protect themselves from crashing on ill-formed HTML or misleading sites.

The best policy is to prepare regular statistics about the crawl If a site starts dominating the collection, it can be added

to the guard module.

22

Engineering Large-Scale Crawlers(4) Refreshing Crawled pages

Search engine’s index should be fresh! There is no general mechanism of update notifications. General idea

Depending on the bandwidth available, a round of crawling may run up to a few weeks.

Can we do better? Statistics Sort of score reflecting the probability that each page has

been modified A crawler is run at a smaller scale to monitor fast-changing

sites, especially related to current news, weather.

Hyperlink Analysis for the Web

Information Retrieval

• Input: Document collection• Goal: Retrieve documents or text with information

content that is relevant to user’s information need• Two aspects:

1. Processing the collection

2. Processing queries (searching)

Classic information retrieval

• Ranking is a function of query term frequency within the document (tf) and across all documents (idf)

• This works because of the following assumptions in classical IR:– Queries are long and well specified

“What is the impact of the Falklands war on Anglo-Argentinean relations”

– Documents (e.g., newspaper articles) are coherent, well authored, and are usually about one topic

– The vocabulary is small and relatively well understood

Web information retrieval

• None of these assumptions hold:– Queries are short: 2.35 terms in avg– Huge variety in documents: language, quality,

duplication– Huge vocabulary: 100s million of terms– Deliberate misinformation

• Ranking is a function of the query terms and of the hyperlink structure

Connectivity-based ranking

• Hyperlink analysis– Idea: Mine structure of the web graph– Each web page is a node– Each hyperlink is a directed edge

• Ranking Returned Documents– Query dependent raking – Query independent ranking

Query dependent ranking

• Assigns a score that measures the quality and relevance of a selected set of pages to a given user query.

• The basic idea is to build a query-specific graph, called a neighborhood graph, and perform hyperlink analysis on it.

Building a neighborhood graph

• A start set of documents matching the query is fetched from a search engine (typically 200-1000 nodes).

• The start set is augmented by its neighborhood, which is the set of documents that either hyperlinks to or is hyperlinked to by documents in the start set .(up to 5000 nodes)

• Each document in both the start set and the neighborhood is modeled by a node. There exists an edge from node A to node B if and only if document A hyperlinks to document B.– Hyperlinks between pages on the same Web host can be

omitted.

Query Results= Start Set Forward SetBack Set

Neighborhood graph

An edge for each hyperlink, but no edges within the same host

Result1

Result2

Resultn

f1f2

fs

...

b1b2

bm

… ...

• Subgraph associated to each query

Hyperlink-Induced Topic Search (HITS)• In response to a query, instead of an ordered

list of pages each meeting the query, find two sets of inter-related pages:– Hub pages are good lists of links on a subject.

• e.g., “Bob’s list of cancer-related links.”– Authority pages occur recurrently on good hubs for

the subject.• Best suited for “broad topic” queries rather

than for page-finding queries.• Gets at a broader slice of common opinion.

Hubs and Authorities

• Thus, a good hub page for a topic points to many authoritative pages for that topic.

• A good authority page for a topic is pointed to by many good hubs for that topic.

• Circular definition - will turn this into an iterative computation.

HITS [K’98]

• Goal: Given a query find:

– Good sources of content (authorities)

– Good sources of links (hubs)

• Authority comes from in-edges. Being a good hub comes from out-edges.

• Better authority comes from in-edges from good hubs. Being a better hub comes from out-edges to good authorities.

Intuition

q1

qk...

Aq2

r1

rk

r2...

H

p

Distilling hubs and authorities

• Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).

• Initialize: for all x, h(x)1; a(x) 1;• Iteratively update all h(x), a(x);• After iterations

– output pages with highest h() scores as top hubs

– highest a() scores as top authorities.

Key

Iterative update

• Repeat the following updates, for all x:

yx

yaxh

)()(

xy

yhxa

)()(

x

x

Scaling

• To prevent the h() and a() values from getting too big, can scale down after each iteration.

• Scaling factor doesn’t really matter:–we only care about the relative values

of the scores.

HITS details

How many iterations?

• Claim: relative values of scores will converge after a few iterations:– in fact, suitably scaled, h() and a() scores

settle into a steady state!• We only require the relative orders of the

h() and a() scores - not their absolute values.

• In practice, ~5 iterations get you close to stability.

Problems with the HITS algorithm(1)

• Only a relatively small part of the Web graph is considered, adding edges to a few nodes can change the resulting hubs and authority scores considerably.– It is relatively easy to manipulate these scores.

Problems with the HITS algorithm(2)

• We often find that the neighborhood graph contains documents not relevant to the query topic. If these nodes are well connected, the topic drift problem arises.– The most highly ranked authorities and hubs tend

not to be about the original topic. – For example, when running the algorithm on the

query “jaguar and car" the computation drifted to the general topic “car" and returned the home pages of different car manufacturers as top authorities, and lists of car manufacturers as the best hubs.

Query-independent ordering• First generation: using link counts as simple

measures of popularity.• Two basic suggestions:

– Undirected popularity:• Each page gets a score = the number of in-

links plus the number of out-links (3+2=5).– Directed popularity:

• Score of a page = number of its in-links (3).

Query processing• First retrieve all pages meeting the text query

(say venture capital).• Order these by their link popularity (either

variant on the previous page).

Spamming simple popularity• Exercise: How do you spam each of the

following heuristics so your page gets a high score?

Pagerank scoring• Imagine a browser doing a random walk on

web pages:– Start at a random page

– At each step, go out of the current page along one of the links on that page, equiprobably

• “In the steady state” each page has a long-term visit rate - use this as the page’s score.

1/31/31/3

Not quite enough• The web is full of dead-ends.

– Random walk can get stuck in dead-ends.– Makes no sense to talk about long-term visit rates.

??

Teleporting• At a dead end, jump to a random

web page.• At any non-dead end, with

probability 10%, jump to a random web page.–With remaining probability (90%), go

out on a random link.–10% - a parameter.

Result of teleporting

• Now cannot get stuck locally.• There is a long-term rate at

which any page is visited (not obvious, will show this).

• How do we compute this visit rate?

Markov chains• A Markov chain consists of n states, plus an

nn transition probability matrix P.• At each step, we are in exactly one of the

states.• For 1 i,j n, the matrix entry Pij tells us the

probability of j being the next state, given we are currently in state i.

i jPij

Pii>0is OK.

.11

ij

n

j

P

Markov chains

• Clearly, for all i,• Markov chains are abstractions of random

walks.• Exercise: represent the teleporting random

walk from 3 slides ago as a Markov chain, for this case:

Ergodic Markov chains• A Markov chain is ergodic if

– you have a path from any state to any other

– For any start state, after a finite transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero.

Ergodic Markov chains

• For any ergodic Markov chain, there is a unique long-term visit rate for each state.–Steady-state probability distribution.

• Over a long time-period, we visit each state in proportion to this rate.

• It doesn’t matter where we start.

Probability vectors• A probability (row) vector x = (x1, … xn)

tells us where the walk is at any point.• E.g., (000…1…000) means we’re in state i.

i n1

More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi.

.11

n

iix

Change in probability vector

• If the probability vector is x = (x1, … xn) at this step, what is it at the next step?

• Recall that row i of the transition prob. Matrix P tells us where we go next from state i.

• So from x, our next state is distributed as xP.

Steady state example

• The steady state looks like a vector of probabilities a = (a1, … an):– ai is the probability that we are in state i.

1 23/4

1/43/41/4

For this example, a1=1/4 and a2=3/4.

How do we compute this vector?• Let a = (a1, … an) denote the row vector of

steady-state probabilities.• If we our current position is described by a,

then the next step is distributed as aP.• But a is the steady state, so a=aP.• Solving this matrix equation gives us a.

One way of computing a• Recall, regardless of where we start, we

eventually reach the steady state a.• Start with any distribution (say x=(10…0)).• After one step, we’re at xP;• after two steps at xP2 , then xP3 and so on.• “Eventually” means for “large” k, xPk = a.• Algorithm: multiply x by increasing powers of

P until the product looks stable.

Google’s approach

• Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A)

Quality of a page is related to its in-degree

• Recursion: Quality of a page is related to– its in-degree, and to – the quality of pages linking to it

PageRank [BP ‘98]

Definition of PageRank

• Consider the following infinite random walk (surf):– Initially the surfer is at a random page– At each step, the surfer proceeds

• to a randomly chosen web page with probability d• to a randomly chosen successor of the current page with

probability 1-d

• The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.

PageRank (cont.)

By random walk theorem:• PageRank = stationary probability for this

Markov chain, i.e.

where n is the total number of nodes in the graph

Epq

qoutdegreeqPageRankdndpPageRank

),(

)(/)()1()(

PageRank (cont.)

P

A B

PageRank of P is

(1-d)* ( 1/4th the PageRank of A + 1/3rd the PageRank of B ) +d/n

dd

PageRank

• Used in Google’s ranking function

• Query-independent

• Summarizes the “web opinion” of the page

importance

We want top-ranking documents to be both relevant and authoritative

• Relevance is being modeled by cosine scores

• Authority is typically a query-independent property of a document– Assign to each document a query-independent

quality score in [0,1] to each document d• Denote this by g(d)

PageRank vs. HITS

• Computation: – Once for all documents and queries (offline)

• Query-independent – requires combination with query-dependent criteria

• Hard to spam

• Computation:– Requires computation for

each query

• Query-dependent

• Relatively easy to spam• Quality depends on

quality of start set

Reference• Monika Henzinger, “hyperlink analysis for the web”, IEEE internet

computing 2001.• J. Cho, H. García-Molina, and L. Page, “Efficient Crawling through URL

Ordering,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science, New York, 1998.

• S. Chakrabarti et al., “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science, New York, 1998.

• K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation in Hyperlinked Environments,” Proc. 21st Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 98), ACM Press, New York, 1998

• L. Page et al., “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford Digital Library Technologies, Working Paper 1999-0120, Stanford Univ., Palo Alto, Calif., 1998.

• I. Varlamis et al., “THESUS, a Closer View on Web Content Management Enhanced with Link Semantics”, IEEE Transactions on Knowledge and Data Engineering, vol. 16, No. 6, June 2004.

Date post:	22-Mar-2016
Category:	Documents
Upload:	ogden
View:	26 times
Download:	0 times