+ All Categories
Home > Documents > Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Date post: 04-Jan-2016
Category:
Upload: antonia-audrey-casey
View: 215 times
Download: 0 times
Share this document with a friend
26
Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive
Transcript
Page 1: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Internet Archive&

Web Datamining

Raymie Stata

UC Santa Cruz & Internet Archive

Page 2: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Agenda

• State of the Archive– Collections– Infrastructure (freecache)

• Internet Analytics– Information carnivores

Page 3: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Archive Overview

• Started in 1996

• Transitioned from “Archive of the Internet” to “Archive on the Internet”

• Transitioning to “Digital Library of the Future”

• Funding from private foundations, plus lots of volunteers

Page 4: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Digital Library of the Future

• Information is accessible to anyone from anywhere

• The best and broadest information is available

• We imagine a small network of very large, regional, “mega” digital libraries

Universal Access to Human Knowledge

Page 5: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Web collection

• Over 10B pages, 200TB, 50M sites– Broad crawls (20TB snapshot/2 months)– Narrow crawls (elections, 9/11)– “Heritage crawls”– Writing new crawler :-(

• Wayback machine– Success! 4M hits/day– Have search engine, but hidden!– Policy has been tested, remains same

Page 6: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Moving images

• 2500 Movies

• “Open source movies”– Upload your movie to the Archive– Build a movie at the Archive!

Page 7: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Texts

• Have > 20K books

• Actively involved in “1M Book” and ICDL

• Bookmobile– Protest of Eldred– Real interest turned out to be overseas

• India (30!), Egypt, Uganda

– Spun into separate non-profit

Page 8: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Audio - eTree

• Around 5,000 concerts from 250 bands– Growing 30 concerts, 1 band/day

• Largest consumer of bandwidth– Consistent 85Mbps (downloads)

• Same policy as Wayback– We respect requests

Page 9: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Infrastructure

• Infinite bandwidth and storage– Core competency of the Archive– Vision, not reality– But striving for it makes us better

• Recent challenges– Moving from 250TB to 1PB– Supporting eTree bandwidth

Page 10: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

The Petabyte challenge

• Finally having problems predicted– Power, cooling, disk failures dominating– Need larger staff, real software engineering

• BUT:– Took much longer than anticipated– Sticking to our philosophies

• Commodity hardware

• Widely used software + simple scripts

Page 11: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

The Petabyte architecture

• New datacenter– To solve our power and cooling problems

• Better “procurement” process• File-level mirroring

– Use basic FS, simple scripts– Preparing for geoplexing (vs. file-level RAID)

• Elimination of inter-crawl copies– This is currently our “backup”

Page 12: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

The (eTree) bandwidth challenge

• Can we do better than simply buying more bandwidth?

• Yes! Find other people willing to help

• Cooperative/open-source CDN

Page 13: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Freecache.org

• It shouldn’t cost you to give away content

• To distribute using freecache, simply:– Replace: href=http://X/Y– With: href=http://freecache.org/http://X/Y

• To be a distribution node, simply install a 1K perl-script on your Apache server

Page 14: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Freecache design

• Content routing done centrally

• Right now, routing is random– Working on “closeness-driven” routing

• LRU eviction policy

• Throttles “cheaters”

• Broken browsers have been a problem

Page 15: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

“Web scale” datamining

Apps &Apps &

Feature DatamartsFeature Datamarts

WarehouseWarehouse

Data collectionData collection

Use data• Wayback, Wayback search• Web characterization• Story lifecycle analyzer

Access subsets of data fast• Full-text index, shingleprints• Connectivity, Term vectors

Download web pages• Donations, crawling

Store and access pages• Page cache• Feature extractor

AccessAccess

Page 16: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Tools for Web mining

• Very similar to the Astronomy project– Need indexes, parallelism– Need to move computation to the data– Strategies to deal with different result-set sizes

• Current focus is on the “warehouse”

Page 17: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Web datamining usingWeb Carnivores

Page 18: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

The Carnivore Analogy [Etzioni96]

Web pages

Page 19: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

The Carnivore Analogy

Web pagesSearch engines

Page 20: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

The Carnivore Analogy

Web pagesSearch engines

Carnivore apps

Page 21: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Carnivores

• Search engines have what you want– Google has 3B pages: “It’s in there”

– No need to crawl anymore

• However, their general-purpose interface do not always yield good results for specific information needs

Page 22: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Googlisms: a fun carnivoreGooglism for: scott kirkpatrick

scott kirkpatrick is an associate for rossscott kirkpatrick is an awesome drummer with many fine credits to his namescott kirkpatrick is 17 but certified as an adultscott kirkpatrick is listed as one of the executors in the will of george hankins dated 1 october 1838 in jackson countyscott kirkpatrick is the new chairpersonscott kirkpatrick is joining the flett chiropractic clinic

Googlism for: john kubiatowicz

john kubiatowicz is a professor in computer science at uc berkeleyjohn kubiatowicz is currently an assistant professor at the university of california at berkeleyjohn kubiatowicz is designing ajohn kubiatowicz is working on oceanstorejohn kubiatowicz is a researcher at berkeley exploring the space of introspective computingjohn kubiatowicz is a doctoral candidate in the department of electrical engineering and computer science at mit

Page 23: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

A carnivore for genre search

• Genre classifies documents by its intent– Why was the document written

• Search engines search by topic, not genre

• Idea: build a carnivore for genre search

Page 24: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Genre search engine

Topic(from user)

Genre(static)

Term-vectorgeneration

Google

QueryGeneration Google Filter Results

Page 25: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

Making it work

• Query templates– Details of query matters

• PMI-IR for genre terms

• Discrimination as well as genre vector

Page 26: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive.

User study

• Genre: “Buying guides”– Education for product selection– Lots on the Web, but hard to find– (Agreement on what they are)

• Results– Topic by itself: 0% P@10 (ie, none in top 10)– Topic + “buying guide:” 33%– Our carnivore: 51%


Recommended