1
Announcements
• Research Paper due today
• Research Talks – Nov. 29 (Monday) Kayatana and Lance – Dec. 1 (Wednesday) Mark and Jeremy– Dec. 3 (Friday) Joe and Anton– Dec. 5 (Monday) Colin and Paul
2
Web Search
Lecture 23
3
Searching the Web
• Only search what is indexed – 1999, 800 million documents indexed by
Northern Light[7] • Largest Index - 16% of the indexable web
– 2004, 800 billion urls indexed by Google [1]
• Largest Index - ?% of indexable web
4
Visualizing the Web
• View the web as a directed graph of nodes and edges– set of abstract nodes (the pages)– joined by directional edges (the
hyperlinks)
• Structure provides significant insight about the content
5
Example Graph [6]
6
Citation Analysis[2] • Use structure to identify important, or
prominent, nodes• Garfield’s impact factor
– Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years
– More heavily cited journals have more overall impact on a field
• Consider it better to receive citations from an important journal
7
Influence Weights
• Pinski and Narin’s notion of influence weights– strength of the connection from one journal to
another • percentage of citations in the first journal that refer
to the second
– equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections)
• If a journal receives regular citations from other journals of large weight, it will acquire large weight
8
On the web
• Lot of dead-ends in the link structure– Prominent sites may have no links to
outside world– Use “smoothing” operation, giving all
pages a small, positive connection strength to every other page
• Compute equilibrium weights with respect to modified connection strengths
9
Different Model on the Web
• Prominent cites do not link to other prominent cites– Search engines won’t link to other search
engines because they are competitors– Want to keep users on its sites
• Large collection of pages link to many prominent sites in a focused manner– act as resource lists and guides to search
engines
10
Hubs and Authorities• Authorities – most prominent sources of
primary content for a topic• Hubs – high quality guides and resource
lists direct users to recommended authorities
• Each page is assigned a hub weight and an authority weight– authority weight - proportional to the sum of
the hub weights of pages that link to it– hub weight - proportional to the sum of the
authority weights of the pages that it links to
11
Simplified PageRank Algorithm[5]
• Formula used by Google to rank pages
– Let u be a web page– Fu is a set of pages u points to– Bu is the set of pages that point to u– Nu = |Fu|– c factor used for normalization
uBv vN
vRcuR
)()(
12
Simplified PageRank Calculation
where c = 1
13
PageRank Formula
• Account for sinks
• Complete Formula
– d is empirically set to about 0.15 to 0.2 by the system
uBv vN
vRdduR
)()1()(
14
Using Queries to find DocumentsVector Space Model – Content Relevance
Slide by Mark Levene [3]
15
Term Frequency (TF)• Count number of
occurrences of each term.• Bag of words approach• Ignore stopwords such as
is, a, of, the, …• Stemming - computer is
replaced by comput, as are its variants: computers, computing computation,computer and computed.
• Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag.
chess
computer
programming
chess
game
chess
gameis a
Slide by Mark Levene [3]
16
Inverse Document Frequency (IDF)
in
Nlog
• N is number of documents in the corpus.
• ni is number of docs in which word i appears.
• Log dampens the effect of IDF.• IDF is also number of bits to represent the term.
Slide by Mark Levene [3]
17
Ranking with TF-IDF
qijij
ijiji
wscore
IDFTFw
,
,,
• i – refers to document i
• j – refers to word (or term) j in doc i
• q – is the query which is a sequence of terms
• scorej - is the score for document j given q
• Rank results according to the scoring function.
Slide by Mark Levene [3]
18
Factor in Link Metrics
iijiji PRIDFTFw ,, • Multilply by PageRank of document (web
page).• We do not know exactly how Google factors in
the PR, it may be that log(PR) is used.
Slide by Mark Levene [3]
19
Rate of change on the Web [4]
• Search engines update their index periodically in order to keep up with evolving web– obsolete index leads to irrelevant or “broken”
search results– update both content and link structure
• Source of change– content of pages change– new pages are added
20
What’s new on the Web?
• New pages created rate of 8% a week[4]– New pages borrow significant amount
of content from old pages– After one year, 50% of the content on
the web is new
• Only 20% of pages available today accessible after one year
21
New Link Structure
• After a year, about 80% of links on the Web will be replaced with new ones
• 25% change per week– week-old rankings may not reflect the
current ranking of the pages very well
22
Change in old pages
• After one week– 30% of the changed pages –
difference > 5%
• After one year– less than 50% of changed pages –
difference > 5%
• Creation of new pages more significant source of change on the Web
23
Impact on Search Engines
• Need to continually update links – this data changes more rapidly then content– most links persist for less than 6 months
• Page removed and replaced by new ones at rapid rates– Sometimes better to used cached version of
page
• Pages that persist usually do not change very much– Past change does not predict future change
24
Citations[1] GOOGLE. Google. www.google.com
[2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), 1999.
[3] M. Levene. Lecture 4: Searching the Web. www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt
[4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, 2004.
[5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, 1998.
[6] I. Rogers. The Google PageRank Algorithm and How It Works. www.iprcom.com/papers/pagerank, April, 2002.
[7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.