Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan...

Post on 15-Jan-2016

222 views 0 download

Tags:

transcript

Common Crawl:enabling machine-scale

analysis of web data

Lisa GreenKurt Bollacker

Jordan Mendelson

IIPC2014-05-19

Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg

Photo license: CC BY-SA http://commons.wikimedia.org/wiki/File:Img20050526_0007_at_tannheim_cumulus.jpg

Photo license: CC-BY-NC https://www.flickr.com/photos/malloreigh/5580160943

Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg

Enable machine scale access and analysis of web data for everyone

Web Data Commons:“Extracting Structured Data from the Common Crawl”

WikiEntities (Han Xiaogang) In What Context Is a Term Referenced?

WikiEntities Example: DiscographyWho are the most popular artists?

How Easily Can Google Analytics Track Our Browsing? (S. Merity, C.

Hornbaker)

Data Publica: Finding French Open Data

Commercial Applications:Improved Spell Checking

may be too domain specific

Photo license: CC-BY-NC-ND https://www.flickr.com/photos/blueforce4116/1398245798

Photo license: CC BY-SA http://commons.wikimedia.org/wiki/File:Img20050526_0007_at_tannheim_cumulus.jpg

Photo license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.

Image license: CC BY-SA https://www.flickr.com/photos/xdxd_vs_xdxd/6829447421

Photo license: CC-BY-SA https://www.flickr.com/photos/hackny/6202775045

Thank You

www.commoncrawl.org

lisa@commoncrawl.org

kurt@commoncrawl.org

jordan@commoncrawl.org