+ All Categories
Home > Documents > Search Engines

Search Engines

Date post: 06-Jan-2016
Category:
Upload: keenan
View: 86 times
Download: 0 times
Share this document with a friend
Description:
Search Engines. CS 186 Guest Lecture Prof. Marti Hearst SIMS. Web Search Questions. How do search engines differ from DBMSs? What do people search for? How do search engines work? Interfaces Ranking Architecture. Web Search vs DBMS?. Web Search Imprecise Ranked results - PowerPoint PPT Presentation
Popular Tags:
73
Search Engines CS 186 Guest Lecture Prof. Marti Hearst SIMS
Transcript
Page 1: Search Engines

Search Engines

CS 186 Guest Lecture

Prof. Marti Hearst

SIMS

Page 2: Search Engines

Web Search Questions

• How do search engines differ from DBMSs?

• What do people search for?

• How do search engines work?• Interfaces

• Ranking

• Architecture

Page 3: Search Engines

Web Search vs DBMS?

Page 4: Search Engines

A Comparison

• Web Search• Imprecise

• Ranked results

• “Satisficing” results

• Unedited content

• Keyword queries

• Mainly Read-only

• Inverted index

• DBMS• Precise

• Usually unordered

• Complete results

• Controlled content

• SQL

• Reads and Writes

• B-trees

Page 5: Search Engines

What Do People Search for on the Web?

Page 6: Search Engines

What Do People Search for on the Web?• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%

Something is missing…Study by Spink et al., Oct 98 Survey on Excite, 13 questions

Data for 316 surveyswww.shef.ac.uk/~is/publications/infres/paper53.html

Page 7: Search Engines

• 4660 sex

• 3129 yahoo

• 2191 internal site admin check from kho

• 1520 chat

• 1498 porn

• 1315 horoscopes

• 1284 pokemon

• 1283 SiteScope test

• 1223 hotmail

• 1163 games

• 1151 mp3

• 1140 weather

• 1127 www.yahoo.com

• 1110 maps

• 1036 yahoo.com

• 983 ebay

• 980 recipes

50,000 queries from Excite, 1997 Most frequent terms:

What Do People Search for on the Web?

Page 8: Search Engines

Why do these differ?

• Self-reporting survey

• The nature of language• Only a few ways to say certain things

• Many different ways to express most concepts• UFO, Flying Saucer, Space Ship, Satellite

• How many ways are there to talk about history?

Page 9: Search Engines

Intranet Queries (Aug 2000)• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map

• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid

Page 10: Search Engines

Intranet Queries• Summary of sample data from 3 weeks of UCB queries

• 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)• 6.7% Schedule of classes or final exams (6222)• 5.4% Summer Session (5041)• 3.2% Extension (2932)• 3.1% Academic Calendar (2846)• 2.4% Directories (2202)• 1.7% Career Center (1588)• 1.7% Housing (1583)• 1.5% Map (1393)

• Average query length over last 4 months: 1.8 words• This suggests what is difficult to find from the home page

Page 11: Search Engines

Different kinds of users; different kinds of data

• Legal and news colleciton:• professional searchers • paying (by the query or by the minute)

• Online bibliographic catalogs (melvyl)• scholars searching scholarly literature

• Web• Every type of person with every type of goal• No “driving school” for searching

Page 12: Search Engines

Different kinds of information needs; different kinds of queries

• Example: Search on “Mazda”– What does this mean on the web?– What does this mean on a news collection?

• Example: “Mazda transmissions”

• Example: “Manufacture of Mazda transmissions in the post-cold war world”

Page 13: Search Engines

• Web queries are SHORT• ~2.4 words on average (Aug 2000)• Has increased, was 1.7 (~1997)

• User Expectations• Many say “the first item shown should be what I want to see”!• This works if the user has the most popular/common notion in mind

Web Queries

Page 14: Search Engines

Recent statistics from Inktomi, August 2000, for one client, one week

• Total # queries: 1315040 Number of repeated queries: 771085 Number of queries with repeated words: 12301 Average words/ query: 2.39 Query type: All words: 0.3036; Any words: 0.6886; Some words:0.0078 Boolean: 0.0015 (0.9777 AND / 0.0252 OR / 0.0054 NOT) Phrase searches: 0.198 URL searches: 0.066 URL searches w/http: 0.000 email searches: 0.001 Wildcards: 0.0011 (0.7042 '?'s ) fraction '?' at end of query: 0.6753 interrogatives when '?' at end: 0.8456

Page 15: Search Engines

How to Optimize for Short Queries? • Find good starting places

• User still has to search at the site itself

• Dialogues• Build upon a series of short queries

• Not well understood how to do this for the general case

• Question Answering• AskJeeves – hand edited

• Automated approaches are under development• Very simple

• Or domain-specific

Page 16: Search Engines

How to Find Good Starting Points?

• Manually compiled lists• Directories • e.g., Yahoo, Looksmart, Open directory

• Page “popularity”• Frequently visited pages (in general)• Frequently visited pages as a result of a query

• Link “co-citation”,• which sites are linked to by other sites?

• Number of pages in the site• Not currently used (as far as I know)

Page 17: Search Engines

Directories vs. Search EnginesAn IMPORTANT Distinction

• Directories• Hand-selected sites

• Search over the contents of the descriptions of the pages

• Organized in advance into categories

• Search Engines• All pages in all sites

• Search over the contents of the pages themselves

• Organized after the query by relevance rankings or other scores

Page 18: Search Engines

Link Analysis for Starting Points

• Assumptions:• If the pages pointing to this page are good, then

this is also a good page.

• The words on the links pointing to this page are useful indicators of what this page is about.

• References: Page et al. 98, Kleinberg 98

Page 19: Search Engines

Co-Citation Analysis• Has been around since the 50’s. (Small, Garfield, White &

McCain)

• Used to identify core sets of• authors, journals, articles for particular fields

• Not for general search

• Main Idea:• Find pairs of papers that cite third papers

• Look for commonalitieis• A nice demonstration by Eugene Garfield at:

– http://165.123.33.33/eugene_garfield/papers/mapsciworld.html

Page 20: Search Engines

Link Analysis for Starting Points

• Why does this work?• The official Toyota site will be linked to by lots of

other official (or high-quality) sites

• The best Toyota fan-club site probably also has many links pointing to it

• Less high-quality sites do not have as many high-quality sites linking to them

Page 21: Search Engines

Co-citation analysis (From Garfield 98)

Page 22: Search Engines

Link Analysis for Starting Points

• Does this really work?• Actually, there have been no rigorous evaluations

• Seems to work for the primary sites; not clear if it works for the relevant secondary sites

• One (small) study suggests that sites with many pages are often the same as those with good link co-citation scores. (Terveen & Hill, SIGIR 2000)

Page 23: Search Engines

What is Really Being Used?

• Todays search engines combine these methods in various ways• Integration of Directories

• Today most web search engines integrate categories into the results listings

• Lycos, MSN, Google

• Link analysis• Google uses it; others are using it or will soon• Words on the links seems to be especially useful

• Page popularity• Many use DirectHit’s popularity rankings

Page 24: Search Engines

Ranking Algorithms

Page 25: Search Engines

The problem of ranking

Cat cat catDog dog dogFish fish fish

Cat cat catCat cat catCat cat cat

OrangutangFish

Query: cat dog fish orangutang

Which is the best match?

Page 26: Search Engines

Assigning Weights to Terms

• Binary Weights• Raw term frequency• tf x idf

• Recall the Zipf distribution• Want to weight terms highly if they are

• frequent in relevant documents … BUT• infrequent in the collection as a whole

• Automatically derived thesaurus terms

Page 27: Search Engines

Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

Page 28: Search Engines

Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1

Page 29: Search Engines

Assigning Weights• Goal: give more weight to terms that are

• Common in THIS document

• Uncommon in the collection as a whole

• The tf x idf measure:• term frequency (tf)

• inverse document frequency (idf)

Page 30: Search Engines

Document Vectors

• Documents are represented as “bags of words”

• Represented as vectors when used computationally• A vector is like an array of floating point

• Each vector holds a place for every term in the collection

• Therefore, most vectors are sparse

Page 31: Search Engines

Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Page 32: Search Engines

Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Page 33: Search Engines

Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 34: Search Engines

Vector Space Model

• Documents are represented as vectors in term space

• Terms are usually stems• Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on length

and direction of their vector• A vector distance measure between the query and

documents is used to rank retrieved documents

Page 35: Search Engines

Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.

Page 36: Search Engines

tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 37: Search Engines

Computing Similarity Scores

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 38: Search Engines

The results of ranking

Cat cat catDog dog dogFish fish fish

Cat cat catCat cat catCat cat cat

OrangutangFish

Query: cat dog fish orangutang

What does vector space ranking do?

Page 39: Search Engines

High-Precision Ranking

Proximity search can help get high-precision results if > 1 term• Hearst ’96 paper:

• Combine Boolean and passage-level proximity

• Proves significant improvements when retrieving top 5, 10, 20, 30 documents

• Results reproduced by Mitra et al. 98

• Google uses something similar

Page 40: Search Engines

What is Really Being Used?

• Lots of variation here• Pretty messy in many cases• Details usually proprietary and fluctuating

• Combining subsets of:• Term frequencies• Term proximities• Term position (title, top of page, etc)• Term characteristics (boldface, capitalized, etc)• Link analysis information• Category information• Popularity information

Page 41: Search Engines

Web Spam

• Email Spam:• Undesired content

• Web Spam: • Content disguised as something it is not:

• Be retrieved more often than it otherwise would

• Be retrieved in contexts that it otherwise would not be retrieved in

Page 42: Search Engines

Web Spam• What are the types of Web spam?

• Add extra terms to get a higher ranking• Repeat “cars” thousands of times

• Add irrelevant terms to get more hits• Put a dictionary in the comments field• Put extra terms in the same color as the background of the web

page

• Add irrelevant terms to get different types of hits• Put “sex” in the title field in sites that are selling cars

• Add irrelevant links to boost your link analysis ranking

• There is a constant “arms race” between web search companies and spammers

Page 43: Search Engines

Inverted Index• This is the primary data structure for text indexes

• Main Idea:• Invert documents into a big index

• Basic steps:• Make a “dictionary” of all the tokens in the collection

• For each token, list all the docs it occurs in.

• Do a few things to reduce redundancy in the data structure

Page 44: Search Engines

Inverted indexes

• Permit fast search for individual terms

• For each term, you get a list consisting of:

• document ID

• frequency of term in doc (optional)

• position of term in doc (optional)

• These lists can be used to solve Boolean queries

• Also used for statistical ranking algorithms

Page 45: Search Engines

Inverted IndexesAn Inverted File is a vector file “inverted” so

that rows become columns and columns become rows

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 46: Search Engines

How Are Inverted Files Created• Documents are parsed to extract tokens.

These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 47: Search Engines

How Inverted Files are Created

After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 48: Search Engines

How InvertedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 49: Search Engines

How Inverted Files are CreatedDictionary PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 50: Search Engines

Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

• document ID • frequency of term in doc (optional) • position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Page 51: Search Engines

How Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Page 52: Search Engines

Web Search Architecture

Page 53: Search Engines

Web Search Architecture

• Preprocessing• Collection gathering phase

• Web crawling

• Collection indexing phase

• Online• Query servers

Page 54: Search Engines

An Example Search System:Cha-Cha

• A system for searching complex intranets

• Places retrieval results in context

• Important design goals:• Users at any level of computer expertise

• Browsers at any version level

• Computers of any speed

Page 55: Search Engines
Page 56: Search Engines
Page 57: Search Engines
Page 58: Search Engines
Page 59: Search Engines

How Cha-Cha Works

• Crawl the Intranet

• Compute the shortest hyperlink path from a certain root page to every web page

• Index and compute metadata for the pages

Page 60: Search Engines

Cha-Cha System Architecture

crawl theweb

store the

documents

Page 61: Search Engines

Cha-Cha System Architecture

crawl theweb

create a keyword

index

store the

documents

create files of

metadata

Cheshire II

Page 62: Search Engines

Creating a Keyword Index

• For each document• Tokenize the document

• Break it up into tokens: words, stems, punctuation

• There are many variations on this

• Record which tokens occurred in this document• Called an Inverted Index

• Dictionary: a record of all the tokens in the collection and their overall frequency

• Postings File: a list recording for each token, which document it occurs in and how often it occurs

Page 63: Search Engines

Responding to the User Query

• User searches on “pam samuelson”

• Search Engine looks up documents indexed with one or both terms in its inverted index

• Search Engine looks up titles and shortest paths in the metadata index

• User Interface combines the information and presents the results as HTML

Page 64: Search Engines

Standard Web Search Engine Architecturecrawl the

web

create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

userquery

Show results To user

DocIds

Page 65: Search Engines

Inverted Indexes for Web Search Engines

• Inverted indexes for word lists• Some systems partition the indexes across

different machines; each machine handles different parts of the data

• Other systems duplicate the data across many machines; queries are distributed among the machines

• Most do a combination of these

Page 66: Search Engines

From description of the FAST search engine, by Knut Risvik

http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.

Page 67: Search Engines

Cascading Allocation of CPUs

• A variation on this that produces a cost-savings:• Put high-quality/common pages on many

machines

• Put lower quality/less common pages on fewer machines

• Query goes to high quality machines first

• If no hits found there, go to other machines

Page 68: Search Engines

Web Crawlers

• How do the web search engines get all of the items they index?

• Main idea: • Start with known sites• Record information for these sites• Follow the links from each site• Record information found at new sites• Repeat

Page 69: Search Engines

Web Crawlers• How do the web search engines get all of the items they

index?• More precisely:

• Put a set of known sites on a queue• Repeat the following until the queue is empty:

• Take the first page off of the queue• If this page has not yet been processed:

– Record the information found on this page• Positions of words, links going out, etc

– Add each link on the current page to the queue– Record that this page has been processed

• In what order should the links be followed?

Page 70: Search Engines

Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees:

http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed

Page 71: Search Engines

Web Crawling Issues• “Keep-out” signs

• A file called norobots.txt tells the crawler which directories are off limits

• Freshness• Figure out which pages change often• Recrawl these often

• Duplicates, virtual hosts, etc• Convert page contents with a hash function• Compare new pages to the hash table

• Lots of problems• Server unavailable• Incorrect html• Missing links• Infinite loops

• Web crawling is difficult to do robustly!

Page 72: Search Engines

Commercial Issues

• General internet search is often commercially driven• Commercial sector sometimes hides things – harder to

track than research• On the other hand, most CTOs for search engine

companies used to be researchers, and so help us out• Commercial search engine information changes

monthly• Sometimes motivations are commercial rather than

technical

Page 73: Search Engines

For More Information

• IS213: Information Organization and Retrieval http://www.sims.berkeley.edu/courses/is202/f00/Assignments.html

• Modern Information Retrieval, Baeza-Yates and Ribeiro, Addison Wesley, 1999. http://www.sims.berkeley.edu/~hearst/irbook

• Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in the Proceedings of WWW7 / Computer Networks 30(1-7): 107-117, 1998. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

• Jurgen Koenemann and Nicholas J. Belkin, A Case for Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness, in the Proceedings of ACM/CHI, Vancouver, 1996.

• Marti Hearst, Improving Full-Text Precision on Short Queries using Simple Constraints, Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, April 1996. http://www.sims.berkeley.edu/~hearst/publications.shtml


Recommended