+ All Categories
Home > Documents > Announcements

Announcements

Date post: 11-Jan-2016
Category:
Upload: morwen
View: 20 times
Download: 0 times
Share this document with a friend
Description:
Announcements. Research Paper due today Research Talks Nov. 29 (Monday) Kayatana and Lance Dec. 1 (Wednesday) Mark and Jeremy Dec. 3 (Friday) Joe and Anton Dec. 5 (Monday) Colin and Paul. Web Search. Lecture 23. Searching the Web. Only search what is indexed - PowerPoint PPT Presentation
24
1 Announcements Research Paper due today Research Talks Nov. 29 (Monday) Kayatana and Lance Dec. 1 (Wednesday) Mark and Jeremy Dec. 3 (Friday) Joe and Anton Dec. 5 (Monday) Colin and Paul
Transcript
Page 1: Announcements

1

Announcements

• Research Paper due today

• Research Talks – Nov. 29 (Monday) Kayatana and Lance – Dec. 1 (Wednesday) Mark and Jeremy– Dec. 3 (Friday) Joe and Anton– Dec. 5 (Monday) Colin and Paul

Page 2: Announcements

2

Web Search

Lecture 23

Page 3: Announcements

3

Searching the Web

• Only search what is indexed – 1999, 800 million documents indexed by

Northern Light[7] • Largest Index - 16% of the indexable web

– 2004, 800 billion urls indexed by Google [1]

• Largest Index - ?% of indexable web

Page 4: Announcements

4

Visualizing the Web

• View the web as a directed graph of nodes and edges– set of abstract nodes (the pages)– joined by directional edges (the

hyperlinks)

• Structure provides significant insight about the content

Page 5: Announcements

5

Example Graph [6]

Page 6: Announcements

6

Citation Analysis[2] • Use structure to identify important, or

prominent, nodes• Garfield’s impact factor

– Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years

– More heavily cited journals have more overall impact on a field

• Consider it better to receive citations from an important journal

Page 7: Announcements

7

Influence Weights

• Pinski and Narin’s notion of influence weights– strength of the connection from one journal to

another • percentage of citations in the first journal that refer

to the second

– equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections)

• If a journal receives regular citations from other journals of large weight, it will acquire large weight

Page 8: Announcements

8

On the web

• Lot of dead-ends in the link structure– Prominent sites may have no links to

outside world– Use “smoothing” operation, giving all

pages a small, positive connection strength to every other page

• Compute equilibrium weights with respect to modified connection strengths

Page 9: Announcements

9

Different Model on the Web

• Prominent cites do not link to other prominent cites– Search engines won’t link to other search

engines because they are competitors– Want to keep users on its sites

• Large collection of pages link to many prominent sites in a focused manner– act as resource lists and guides to search

engines

Page 10: Announcements

10

Hubs and Authorities• Authorities – most prominent sources of

primary content for a topic• Hubs – high quality guides and resource

lists direct users to recommended authorities

• Each page is assigned a hub weight and an authority weight– authority weight - proportional to the sum of

the hub weights of pages that link to it– hub weight - proportional to the sum of the

authority weights of the pages that it links to

Page 11: Announcements

11

Simplified PageRank Algorithm[5]

• Formula used by Google to rank pages

– Let u be a web page– Fu is a set of pages u points to– Bu is the set of pages that point to u– Nu = |Fu|– c factor used for normalization

uBv vN

vRcuR

)()(

Page 12: Announcements

12

Simplified PageRank Calculation

where c = 1

Page 13: Announcements

13

PageRank Formula

• Account for sinks

• Complete Formula

– d is empirically set to about 0.15 to 0.2 by the system

uBv vN

vRdduR

)()1()(

Page 14: Announcements

14

Using Queries to find DocumentsVector Space Model – Content Relevance

Slide by Mark Levene [3]

Page 15: Announcements

15

Term Frequency (TF)• Count number of

occurrences of each term.• Bag of words approach• Ignore stopwords such as

is, a, of, the, …• Stemming - computer is

replaced by comput, as are its variants: computers, computing computation,computer and computed.

• Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag.

chess

computer

programming

chess

game

chess

gameis a

Slide by Mark Levene [3]

Page 16: Announcements

16

Inverse Document Frequency (IDF)

in

Nlog

• N is number of documents in the corpus.

• ni is number of docs in which word i appears.

• Log dampens the effect of IDF.• IDF is also number of bits to represent the term.

Slide by Mark Levene [3]

Page 17: Announcements

17

Ranking with TF-IDF

qijij

ijiji

wscore

IDFTFw

,

,,

• i – refers to document i

• j – refers to word (or term) j in doc i

• q – is the query which is a sequence of terms

• scorej - is the score for document j given q

• Rank results according to the scoring function.

Slide by Mark Levene [3]

Page 18: Announcements

18

Factor in Link Metrics

iijiji PRIDFTFw ,, • Multilply by PageRank of document (web

page).• We do not know exactly how Google factors in

the PR, it may be that log(PR) is used.

Slide by Mark Levene [3]

Page 19: Announcements

19

Rate of change on the Web [4]

• Search engines update their index periodically in order to keep up with evolving web– obsolete index leads to irrelevant or “broken”

search results– update both content and link structure

• Source of change– content of pages change– new pages are added

Page 20: Announcements

20

What’s new on the Web?

• New pages created rate of 8% a week[4]– New pages borrow significant amount

of content from old pages– After one year, 50% of the content on

the web is new

• Only 20% of pages available today accessible after one year

Page 21: Announcements

21

New Link Structure

• After a year, about 80% of links on the Web will be replaced with new ones

• 25% change per week– week-old rankings may not reflect the

current ranking of the pages very well

Page 22: Announcements

22

Change in old pages

• After one week– 30% of the changed pages –

difference > 5%

• After one year– less than 50% of changed pages –

difference > 5%

• Creation of new pages more significant source of change on the Web

Page 23: Announcements

23

Impact on Search Engines

• Need to continually update links – this data changes more rapidly then content– most links persist for less than 6 months

• Page removed and replaced by new ones at rapid rates– Sometimes better to used cached version of

page

• Pages that persist usually do not change very much– Past change does not predict future change

Page 24: Announcements

24

Citations[1] GOOGLE. Google. www.google.com

[2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), 1999.

[3] M. Levene. Lecture 4: Searching the Web. www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt

[4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, 2004.

[5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, 1998.

[6] I. Rogers. The Google PageRank Algorithm and How It Works. www.iprcom.com/papers/pagerank, April, 2002.

[7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.


Recommended