Information Retrieval
CSE 8337 (Part B)Spring 2009
Some Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/bookIntroduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze
http://informationretrieval.org
CSE 8337 Spring 2009 2
CSE 8337 Outline• Introduction• Simple Text Processing• Boolean Queries• Web Searching/Crawling• Indexes• Vector Space Model• Matching• Evaluation
CSE 8337 Spring 2009 3
Web Searching TOC Web Overview Searching Ranking Crawling
CSE 8337 Spring 2009 4
Web Overview Size
>11.5 billion pages (2005) Grows at more than 1 million pages a
day Google indexes over 3 billion
documents Diverse types of data http://
www.google.com/support/websearch/bin/topic.py?topic=8996
CSE 8337 Spring 2009 5
Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data
Profiles Registration information Cookies
CSE 8337 Spring 2009 6
Zipf’s Law Applied to Web Distribution of frequency of
occurrence of words in text. “Frequency of i-th most frequent
word is 1/i q times that of the most frequent word”
http://www.nslij-genetics.org/wli/zipf/
CSE 8337 Spring 2009 7
Heap’s Law Applied to Web Measures size of vocabulary in a
text of size n :O (n b)
b normally less than 1
CSE 8337 Spring 2009 8
Web search basics
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web spider
Indexer
Indexes
Search
User
CSE 8337 Spring 2009 9
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
CSE 8337 Spring 2009 10
Users’ empirical evaluation of results Quality of pages varies widely
Relevance is not enough Other desirable qualities (non IR!!)
Content: Trustworthy, diverse, non-duplicated, well maintained
Web readability: display correctly & fast No annoyances: pop-ups, etc
Precision vs. recall On the web, recall seldom matters
What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with
obscure queries Recall matters when the number of matches is very small
User perceptions may be unscientific, but are significant over a large aggregate
CSE 8337 Spring 2009 11
Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided
Mitigate user errors (auto spell check, search assist,…)
Explicit: Search within results, more like this, refine ...
Anticipative: related searches Deal with idiosyncrasies
Web specific vocabulary Impact on stemming, spell-check, etc
Web addresses typed in the search box …
CSE 8337 Spring 2009 12
Simplest forms First generation engines relied heavily on tf/idf
The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s
SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as
the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers
Pure word density cannot
be trusted as an IR signal
CSE 8337 Spring 2009 13
Term frequency tf The term frequency tft,d of term t in
document d is defined as the number of times that t occurs in d.
Raw term frequency is not what we want: A document with 10 occurrences of the
term is more relevant than a document with one occurrence of the term.
But not 10 times more relevant. Relevance does not increase
proportionally with term frequency.
CSE 8337 Spring 2009 14
Log-frequency weighting The log frequency weight of term t in d is
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum
over terms t in both q and d: score
The score is 0 if none of the query terms is present in the document.
otherwise 0,
0 tfif, tflog10 1 t,dt,d
t,dw
dqt dt ) tflog (1 ,
CSE 8337 Spring 2009 15
Document frequency Rare terms are more informative than
frequent terms Recall stop words
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
A document containing this term is very likely to be relevant to the query arachnocentric
→ We want a high weight for rare terms like arachnocentric.
CSE 8337 Spring 2009 16
Document frequency, continued
Consider a query term that is frequent in the collection (e.g., high, increase, line)
For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.
We will use document frequency (df) to capture this in the score.
df ( N) is the number of documents that contain the term
CSE 8337 Spring 2009 17
idf weight dft is the document frequency of t: the
number of documents that contain t df is a measure of the
informativeness of t We define the idf (inverse document
frequency) of t by
We use log N/dft instead of N/dft to “dampen” the effect of idf.
tt N/df log idf 10
Will turn out the base of the log is immaterial.
CSE 8337 Spring 2009 18
idf example, suppose N= 1 million
term dft idft
calpurnia 1 6animal 100 4sunday 1,000 3fly 10,000 2under 100,000 1the 1,000,00
00
There is one idf value for each term t in a collection.
CSE 8337 Spring 2009 19
Collection vs. Document frequency
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
Example:
Which word is a better search term (and should get a higher weight)?
Word Collection frequency
Document frequency
insurance 10440 3997
try 10422 8760
CSE 8337 Spring 2009 20
tf-idf weighting The tf-idf weight of a term is the product of
its tf weight and its idf weight.
Best known weighting scheme in information retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences
within a document Increases with the rarity of the term in the
collection
tdt Ndt
df/log)tflog1(w 10,,
CSE 8337 Spring 2009 21
Search engine optimization (Spam)
Motives Commercial, political, religious, lobbies Promotion funded by advertising budget
Operators Search Engine Optimizers for lobbies,
companies Web masters Hosting services
Forums E.g., Web master world (
www.webmasterworld.com) Search engine specific tricks Discussions about academic papers
CSE 8337 Spring 2009 22
Cloaking Serve fake content to search engine
spider DNS cloaking: Switch IP address.
Impersonate How do you identify a spider?
Is this a SearchEngine spider?
Y
N
SPAM
RealDoc
Cloaking
CSE 8337 Spring 2009 23
More spam techniques Doorway pages
Pages optimized for a single keyword that re-direct to the real target page
Link spamming Mutual admiration societies, hidden
links, awards – more on these later Domain flooding: numerous domains
that point or re-direct to a target page Robots
Fake query stream – rank checking programs
CSE 8337 Spring 2009 24
The war against spam Quality signals - Prefer
authoritative pages based on: Votes from authors
(linkage signals) Votes from users (usage
signals) Policing of URL
submissions Anti robot test
Limits on meta-keywords
Robust link analysis Ignore statistically
implausible linkage (or text)
Use link analysis to detect spammers (guilt by association)
Spam recognition by machine learning Training set based on
known spam Family friendly filters
Linguistic analysis, general classification techniques, etc.
For images: flesh tone detectors, source text analysis, etc.
Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern
detection
CSE 8337 Spring 2009 25
More on spam Web search engines have policies
on SEO practices they tolerate/block http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/
Adversarial IR: the unending (technical) battle between SEO’s and web search engines
Research http://airweb.cse.lehigh.edu
CSE 8337 Spring 2009 26
Ranking Order documents based on relevance to query
(similarity measure) Ranking has to be performed without
accessing the text, just the index About ranking algorithms, all information is
“top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries
CSE 8337 Spring 2009 27
Ranking Some of the new ranking algorithms also
use hyperlink information Important difference between the Web and
normal IR databases, the number of hyperlinks that point to a page provides a measure of its popularity and quality.
Links in common between pages often indicate a relationship between those pages.
CSE 8337 Spring 2009 28
Ranking Three examples of ranking techniques
based in link analysis: WebQuery HITS (Hub/Authority pages) PageRank
CSE 8337 Spring 2009 29
WebQuery WebQuery takes a set of Web pages (for
example, the answer to a query) and ranks them based on how connected each Web page is
http://www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html
CSE 8337 Spring 2009 30
HITS Kleinberg ranking scheme depends on the
query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to
them in S are called authorities Pages that have many outgoing links
are called hubs Better authority pages come from
incoming edges from good hubs and better hub pages come from outgoing edges to good authorities
CSE 8337 Spring 2009 31
Ranking
upSu
uApH )()(
pvSv
vHpA )()(
CSE 8337 Spring 2009 32
PageRank Used in Google PageRank simulates a user navigating
randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a
This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed
Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p1 to pn
CSE 8337 Spring 2009 33
PageRank (cont’d) PR(p) = c (PR(1)/N1 + … +
PR(n)/Nn) PR(i): PageRank for a page i which
points to target page p. Ni: number of links coming out of
page I
CSE 8337 Spring 2009 34
Conclusion Nowadays search engines use,
basically, Boolean or Vector models and their variations
Link Analysis Techniques seem to be the “next generation” of the search engines
Indexes: Compression and distributed architecture are keys
CSE 8337 Spring 2009 35
Crawlers Robot (spider) traverses the hypertext
sructure in the Web. Collect information from visited pages Used to construct indexes for search
engines Traditional Crawler – visits entire Web
(?) and replaces index Periodic Crawler – visits portions of
the Web and updates subset of index Incremental Crawler – selectively
searches the Web and incrementally modifies index
Focused Crawler – visits pages related to a particular subject
CSE 8337 Spring 2009 36
Crawling the Web The order in which the URLs are traversed is
important Using a breadth first policy, we first look at all
the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests
In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively
Good ordering schemes can make a difference if crawling better pages first (PageRank)
CSE 8337 Spring 2009 37
Crawling the Web Due to the fact that robots can
overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed
Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages
CSE 8337 Spring 2009 38
Focused Crawler Only visit links from a page if that page
is determined to be relevant. Components:
Classifier which assigns relevance score to each page based on crawl topic.
Distiller to identify hub pages. Crawler visits pages based on crawler and
distiller scores. Classifier also determines how useful
outgoing links are Hub Pages contain links to many
relevant pages. Must be visited even if not high relevance score.
CSE 8337 Spring 2009 39
Focused Crawler
CSE 8337 Spring 2009 40
Basic crawler operation Begin with known “seed”
pages Fetch and parse them
Extract URLs they point to Place the extracted URLs on a queue
Fetch each URL on the queue and repeat
CSE 8337 Spring 2009 41
Crawling picture
Web
URLs crawledand parsed
URLs frontier
Unseen Web
Seedpages
CSE 8337 Spring 2009 42
Simple picture – complications Web crawling isn’t feasible with one
machine All of the above steps distributed
Even non-malicious pages pose challenges Latency/bandwidth to remote servers
vary Webmasters’ stipulations
How “deep” should you crawl a site’s URL hierarchy?
Site mirrors and duplicate pages Malicious pages
Spam pages Spider traps
Politeness – don’t hit a server too often
CSE 8337 Spring 2009 43
What any crawler must do Be Polite: Respect implicit and
explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this
shortly) Be Robust: Be immune to spider
traps and other malicious behavior from web servers
CSE 8337 Spring 2009 44
What any crawler should do Be capable of distributed operation:
designed to run on multiple distributed machines
Be scalable: designed to increase the crawl rate by adding more machines
Performance/efficiency: permit full use of available processing and network resources
CSE 8337 Spring 2009 45
What any crawler should do Fetch pages of “higher quality”
first Continuous operation: Continue
fetching fresh copies of a previously fetched page
Extensible: Adapt to new data formats, protocols
CSE 8337 Spring 2009 46
Updated crawling picture
URLs crawledand parsed
Unseen Web
SeedPages
URL frontierCrawling thread
CSE 8337 Spring 2009 47
URL frontier Can include multiple pages from
the same host Must avoid trying to fetch them
all at the same time Must try to keep all crawling
threads busy
CSE 8337 Spring 2009 48
Explicit and implicit politeness Explicit politeness: specifications
from webmasters on what portions of site can be crawled robots.txt
Implicit politeness: even with no specification, avoid hitting any site too often
CSE 8337 Spring 2009 49
Robots.txt Protocol for giving spiders (“robots”)
limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html
Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt
This file specifies access restrictions
CSE 8337 Spring 2009 50
Robots.txt example No robot should visit any URL
starting with "/yoursite/temp/", except the robot called “searchengine":
User-agent: *Disallow: /yoursite/temp/
User-agent: searchengineDisallow:
CSE 8337 Spring 2009 51
Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL
Extract links from it to other docs (URLs)
Check if URL has content already seen If not, add to indexes
For each extracted URL Ensure it passes certain URL filter
tests Check if it is already in the frontier
(duplicate URL elimination)
E.g., only crawl .edu, obey robots.txt, etc.
Which one?
CSE 8337 Spring 2009 52
Basic crawl architecture
WWW
DNS
ParseContentseen?
DocFP’s
DupURLelim
URLset
URL Frontier
URLfilter
robotsfilters
Fetch
CSE 8337 Spring 2009 53
DNS (Domain Name Server) A lookup service on the internet
Given a URL, retrieve its IP address Service provided by a distributed set
of servers – thus, lookup latencies can be high (even seconds)
Common OS implementations of DNS lookup are blocking: only one outstanding request at a time
Solutions DNS caching Batch DNS resolver – collects requests
and sends them out together
CSE 8337 Spring 2009 54
Parsing: URL normalization When a fetched document is parsed, some
of the extracted links are relative URLs E.g., at
http://en.wikipedia.org/wiki/Main_Pagewe have a relative link to
/wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
During parsing, must normalize (expand) such relative URLs
CSE 8337 Spring 2009 55
Content seen? Duplication is widespread on
the web If the page just fetched is
already in the index, do not further process it
This is verified using document fingerprints or shingles
CSE 8337 Spring 2009 56
Filters and robots.txt Filters – regular expressions for
URL’s to be crawled/not Once a robots.txt file is fetched
from a site, need not fetch it repeatedly Doing so burns bandwidth, hits
web server Cache robots.txt files
CSE 8337 Spring 2009 57
Duplicate URL elimination For a non-continuous (one-shot)
crawl, test to see if an extracted+filtered URL has already been passed to the frontier
For a continuous crawl – see details of frontier implementation
CSE 8337 Spring 2009 58
Distributing the crawler Run multiple crawl threads, under
different processes – potentially at different nodes Geographically distributed nodes
Partition hosts being crawled into nodes Hash used for partition
How do these nodes communicate?
CSE 8337 Spring 2009 59
URL frontier: two main considerations
Politeness: do not hit a web server too frequently
Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose
content changes oftenThese goals may conflict each other.(E.g., simple priority queue fails – many
links out of a page go to its own site, creating a burst of accesses to that site.)
CSE 8337 Spring 2009 60
Politeness – challenges Even if we restrict only one
thread to fetch from a host, can hit it repeatedly
Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host
CSE 8337 Spring 2009 61
URL frontier: Mercator scheme
Prioritizer
Biased front queue selectorBack queue router
Back queue selector
K front queues
B back queuesSingle host on each
URLs
Crawl thread requesting URL
CSE 8337 Spring 2009 62
Mercator URL frontier URLs flow in from the top into the
frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO http://mercator.comm.nsdlib.org/
CSE 8337 Spring 2009 63
Front queues
Prioritizer
1 K
Biased front queue selectorBack queue router
CSE 8337 Spring 2009 64
Front queues Prioritizer assigns to URL an integer
priority between 1 and K Appends URL to corresponding queue
Heuristics for assigning priority Refresh rate sampled from previous
crawls Application-specific (e.g., “crawl news
sites more often”)
CSE 8337 Spring 2009 65
Biased front queue selector When a back queue requests a URL
(in a sequence to be described): picks a front queue from which to pull a URL
This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized
CSE 8337 Spring 2009 66
Back queuesBiased front queue selector
Back queue router
Back queue selector
1 B
CSE 8337 Spring 2009 67
Back queue invariants Each back queue is kept non-empty
while the crawl is in progress Each back queue only contains URLs
from a single host Maintain a table from hosts to back
queuesHost name Back queue… 3
1B
CSE 8337 Spring 2009 68
Back queue heap One entry for each back queue The entry is the earliest time te at
which the host corresponding to the back queue can be hit again
This earliest time is determined from Last access to that host Any time buffer heuristic we choose
CSE 8337 Spring 2009 69
Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding
back queue q (look up from table) Checks if queue q is now empty – if so,
pulls a URL v from front queues If there’s already a back queue for v’s
host, append v to q and pull another URL from front queues, repeat
Else add v to q When q is non-empty, create heap entry
for it
CSE 8337 Spring 2009 70
Number of back queues B Keep all threads busy while
respecting politeness Mercator recommendation: three
times as many back queues as crawler threads