+ All Categories
Home > Documents > 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

Date post: 21-Dec-2015
Category:
View: 215 times
Download: 0 times
Share this document with a friend
31
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 Lecture 5: Search Engines
Transcript
Page 1: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Lecture 5:Search Engines

Page 2: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Outline

• Search engines: key tools for ecommerce– Buyers and sellers must find each other

• How do they work?• How much do they index?• How are hits ordered?• Can the order be changed?

Page 3: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engines

• Tools for finding information on the Web– Problem: “hidden” databases, e.g. New York Times

• Directory– A hand-constructed hierarchy of topics (e.g. Yahoo)

• Search engine– A machine-constructed index (usually by keyword)

• So many search engines, we now need search engines to find them. Searchenginecollosus.com

Page 4: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Indexing

• Arrangement of data (data structure) to permit fast searching

• Which list is easier to search?

sow fox pig eel yak hen ant cat dog hog

ant cat dog eel fox hen hog pig sow yak• Sorting helps. Why?

– Permits binary search. About log2n probes into list

• log2(1 billion) ~ 30

– Permits interpolation search. About log2(log2n) probes

• log2 log2(1 billion) ~ 5

Page 5: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Inverted Files

A file is a list of words by position

– First entry is the word in position 1 (first word)– Entry 4562 is the word in position 4562 (4562nd word)– Last entry is the last word

An inverted file is a list of positions by word!

POS1

10

20

30

36

FILE

a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)

INVERTED FILE

Page 6: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Inverted Files for Multiple Documents

107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3

jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39jezoar

34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132. . .

“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .

LEXICON

WORD INDEX

Page 7: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Architecture

• Spider– Crawls the web to find pages. Follows hyperlinks.

Never stops

• Indexer– Produces data structures for fast searching of all words in

the pages

• Retriever– Query interface– Database lookup to find hits

• 2 billion documents

• 4 TB RAM, many terabytes of disk

– Ranking

Page 8: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Crawlers (Spiders, Bots)

• Retrieve web pages for indexing by search engines

• Start with an initial page P0. Find URLs on P0 and add them to a

queue

• When done with P0, pass it to an indexing program, get a page

P1 from the queue and repeat

• Can be specialized (e.g. only look for email addresses)• Issues

– Which page to look at next? (Special subjects, recency)– Avoid overloading a site– How deep within a site to go (drill-down)?– How frequently to visit pages?

Page 9: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Query Specification

• Boolean– AND , OR, NOT, PHRASE “ ”, NEAR ~ – But keyword query is artificial

• Question-answering (simulated)– “Who offers a master’s degree in ecommerce?

• Date range• Relevance specification

– In Altavista, can specify terms by importance (separate from query specification)

• Content– multimedia, MP3, .PPT files

• Stemming: eat, eats, eaten, eating, eater, (ate!)

Page 10: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

“Advanced” Query Specification

• Multimedia, e.g. Google• Date range• Relevance specification

– In Altavista, can specify terms by importance (separate from query specification)

• Content– multimedia, MP3, .PPT files

• Stemming• Language• Search depth (from site’s front page)

Page 11: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Ranking (Scoring) Hits• Hits must be presented in some order• What order?

– Relevance, recency, popularity, reliability?• Some ranking methods

– Presence of keywords in title of document– Closeness of keywords to start of document– Frequency of keyword in document– Link popularity (how many pages point to this one)

• Can the user control? Can the page owner control?• Can you find out what order is used?• Spamdexing: influencing retrieval ranking by altering

a web page. (Puts “spam” in the index)

Page 12: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Google’s PageRank Algorithm

• Assumption: A link in page A to page B is a recommendation of page B by the author of A(we say B is successor of A)

The “quality” of a page is related to the number of links that point to it (its in-degree)

• Apply recursively: Quality of a page is related to– its in-degree, and to

– the quality of pages linking to it

PageRank Algorithm (Brinn & Page, 1998)SOURCE: GOOGLE

Page 13: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Definition of PageRank

• Consider the following infinite random walk (surfing):– Initially the surfer is at a random page

– At each step, the surfer proceeds

• to a randomly chosen web page with probability d

• to a randomly chosen successor of the current page with

probability 1-d

• The PageRank of a page p is the fraction of steps

the surfer spends at p as the number of steps

approaches infinity

SOURCE: GOOGLE

Page 14: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

PageRank Formula

where n is the total number of nodes in the graph

• Google uses d 0.85

• PageRank is a probability distribution over web pages

• The sum of all PageRanks of all Pages is 1

Epq

qoutdegreeqPageRankdn

dpPageRank

),(

)(/)()1()(

SOURCE: GOOGLE

Page 15: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

PageRank Example

P

A B

PageRank of P is

(1-d)[(PageRank of A)/4 + (PageRank of B)/3)] + d/n

dd

PAGERANK CALCULATORSOURCE: GOOGLE

Page 16: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Link Popularity

• How many pages link to this page?– on the whole Web– in our database?

• www.linkpopularity.com• Link popularity is used for ranking

– Many measures– Number of links in– Weighted number of links in (by weight of referring page)

Page 17: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Sizes (Sept. 2, 2003)

SOURCE: SEARCHENGINEWATCH.COM

ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma

SEARCHES/DAY(MILLIONS) 250 80 18

2900 per second!

BILLIONS OF PAGES

Page 18: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Usage

SHARE BY SEARCH SITE SHARE BY ENGINE

SOURCE: SEARCHENGINEWATCH.COM

Page 19: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engines Disjointness

SOURCE: SEARCHENGINESHOWDOWN

Four searches, 10 engines, total of 141 hits on March 6, 2002

Page 20: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine EKG

SOURCE: SEARCHENGINEWATCH.COM

Shows activity of the Lycos crawlerat one sample site, calafia.com, bynumber of pages visited during each crawl

Page 21: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine EKG Comparison

SO

UR

CE

: S

EA

RC

HE

NG

INE

WA

TC

H.C

OM

Page 22: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Differences

• Coverage (number of documents)• Spidering algorithms (visit SpiderCatcher)

– Frequency, depth of visits

• Inexing policies• Search interfaces• Ranking• One solution: use a metasearcher (search agent)

Page 23: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Metasearchers

• All the engines operate differently. Different– sizes– query languages– crawling algorithms– storage policies (stop words, punctuation, fonts)– freshness– ranking

• Submit the same query to many engines and collect the results

• Metacrawler

Page 24: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Clustering

• Viewing large numbers of unstructured hits is not useful

• Answer: cluster them• Vivisimo• Kartoo• iBoogie• SurfWax

Page 25: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Spying

• Peeking at queries as they are being submitted• AllTheWeb• Metaspy. Spies on Metacrawler• AskJeeves• Epicurious (recipes)• StockCharts.com• Yahoo buzz index• Kanoodle• IQSeek

Page 26: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Time Spent Per Visitor (minutes)by Search Engine, Jan. 2003

SOURCE: SEARCHENGINEWATCH.COM

AJ Ask JeevesAOL America OnlineAV AltavistaELNK EarthLinkGG GoogleISP InfoSpaceLS LookSmartLY LycosMSN MicrosoftNS NetscapeOVR OVERTUREYH Yahoo

Up 58% in ONE YEAR!

Page 27: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Audience Reachby Search Site, Jan, 2003

AJ Ask JeevesAOL America OnlineAV AltavistaELNK EarthLinkGG GoogleISP InfoSpaceLS LookSmartLY LycosMSN MicrosoftNS NetscapeOVR OVERTUREYH Yahoo

Audience Reach = % of active surfers visiting during month.

Totals exceed 100% because of overlap

SOURCE: SEARCHENGINEWATCH.COM

Page 28: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Robot Exclusion

• You may not want certain pages indexed but still viewable by browsers. Can’t protect directory.

• Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall

• They look for file robots.txt at highest directory level in domain. If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt

• A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX">

Page 29: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Robots Exclusion Protocol

• Format of robots.txt– Two fields. User-agent to specify a robot– Disallow to tell the agent what to ignore

• To exclude all robots from a server:User-agent: *Disallow: /

• To exclude one robot from two directories:User-agent: WebCrawler

Disallow: /news/Disallow: /tmp/

• View the robots.txt specification.

Page 30: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Key Takeaways

• Engines are a critical Web resource• Very sophisticated, high technology• They don’t cover the Web completely• Spamdexing is a problem• New paradigms needed as Web grows• What about images, music, video?

– www.corbis.com, Google images

Page 31: 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

QA&


Recommended