Date post: | 18-Jan-2018 |
Category: |
Documents |
Upload: | britney-holt |
View: | 218 times |
Download: | 0 times |
IR Winter 2010
…14. Webometrics
The Bow-tie model…
Brief history of the Web
• FTP/Gopher• WWW (1989)• Archie (1990)• Mosaic (1993)• Webcrawler (1994)• Lycos (1994)• Yahoo! (1994)• Google (1998)
Size
• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]– 800 Million Web pages, 15 TB [Lawrence & Giles
1999]– 20 Billion Web pages indexed [now]
• Amount of data– roughly 200 TB [Lyman et al. 2003]
Zipfian properties
• In-degree• Out-degree• Visits to a page
Bow-tie model of the Web
SCC56 M
OUT44 M
IN44 M
Bröder & al. WWW 2000, Dill & al. VLDB 2001
DISC17 M
TEND44M
24% of pagesreachable froma given page
Measuring the size of the web
• Using extrapolation methods• Random queries and their coverage by
different search engines• Overlap between search engines• HTTP requests to random IP addresses
Bharat and Broder 1998
• Based on crawls of HotBot, Altavista, Excite, and InfoSeek
• 10,000 queries in mid and late 1997• Estimate is 200M pages• Only 1.4% are indexed by all of them
Example (from Bharat&Broder)
A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
What makes Web IR different?• Much bigger• No fixed document collection• Users• Non-human users• Varied user base• Miscellaneous user needs• Dynamic content• Evolving content• Spam• Infinite sized – size is whatever can be indexed!
IR Winter 2010
…15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …
Web crawling• The HTTP/HTML protocols• Following hyperlinks• Some problems:
– Link extraction– Link normalization– Robot exclusion– Loops– Spider traps– Server overload
Example• U-M’s root robots.txt file:• http://www.umich.edu/robots.txt
– User-agent: * – Disallow: /~websvcs/projects/ – Disallow: /%7Ewebsvcs/projects/ – Disallow: /~homepage/ – Disallow: /%7Ehomepage/ – Disallow: /~smartgl/ – Disallow: /%7Esmartgl/ – Disallow: /~gateway/ – Disallow: /%7Egateway/
Example crawler
• E.g., poacher– http://search.cpan.org/~neilb/Robot-0.011/
examples/poacher– Included in clairlib
&ParseCommandLine();&Initialise();$robot->run($siteRoot)
#=======================================================================# Initialise() - initialise global variables, contents, tables, etc# This function sets up various global variables such as the version number# for WebAssay, the program name identifier, usage statement, etc.#=======================================================================sub Initialise{ $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, 'EMAIL' => $EMAIL, 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error);}#=======================================================================# follow_url_test() - tell the robot module whether is should follow link#=======================================================================sub follow_url_test {}#=======================================================================# process_get_error() - hook function invoked whenever a GET fails#=======================================================================sub process_get_error {}#=======================================================================# process_contents() - process the contents of a URL we've retrieved#=======================================================================sub process_contents{ run_command($COMMAND, $filename) if defined $COMMAND;}
Focused crawling
• Topical locality– Pages that are linked are similar in content (and vice-
versa: Davison 00, Menczer 02, 04, Radev et al. 04)• The radius-1 hypothesis
– given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page)
• Focused crawling– Keeping a priority queue of the most relevant pages
Challenges in indexing the web
• Page importance varies a lot• Anchor text• User modeling• Detecting duplicates• Dealing with spam (content-based and
link-based)
Duplicate detection
• Shingles• TO BE OR• BE OR NOT• OR NOT TO• NOT TO BE• The use the Jaccard coefficient (size of
intersection/size of union) to determine similarity• Hashing• Shingling (separate lecture)
Document closures for Q&A
capital
P L P
Madridspain
spain
capital
Document closures for IR
Physics
P L P
PhysicsDepartment
University ofMichigan
Michigan
The link-content hypothesis
• Topical locality: page is similar () to the page that points to it ().• Davison (TF*IDF, 100K pages)
– 0.31 same domain– 0.23 linked pages– 0.19 sibling– 0.02 random
• Menczer (373K pages, non-linear least squares fit)
• Chakrabarti (focused crawling) - prob. of losing the topic
Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001
21)1()(
e 03.01=1.8, 2=0.6,