DBpedia ♥ Commons

Post on 22-Apr-2015

133 views 1 download

description

Extract semi-structure data from Wikimedia Commons to RDF using the DBpedia Extraction Framework

transcript

2nd DBpedia Meeting Leipzig 03.09.2014

DBpedia ♥ Commons

Gaurav Vaidya - Dimitris Kontokostas - Andrea Di Menna - Jim O'Regan

2nd DBpedia Meeting Leipzig 03.09.2014

~23M pages like this

2nd DBpedia Meeting Leipzig 03.09.2014

~23M pages like this

2nd DBpedia Meeting Leipzig 03.09.2014

A lot of pages like this

2nd DBpedia Meeting Leipzig 03.09.2014

Many pages like this

2nd DBpedia Meeting Leipzig 03.09.2014

Not very similar to pages like this

2nd DBpedia Meeting Leipzig 03.09.2014

DBpedia Extraction Framework

✔ “Wiki agnostic”

✔ Pluggableextractors

✔ Out of the box support for common metadata

✗ Tuned for extraction in the main namespace (not File:)

✗ Many other challenges left

2nd DBpedia Meeting Leipzig 03.09.2014

Challenges

✔ File metadata

✔ KML files

✔ Image Galleries

✔ Image Annotations

✔ Mappings Wiki

✔ Bootstrap community mappings✔ Template Statistics

✔ Licensing

✔ Technical details I'll not go into

2nd DBpedia Meeting Leipzig 03.09.2014

Out-of-the-box support

● Categories (skos)

● External links

● Geo-coordinates

● Raw infobox properties

● Labels

● PageIds / Revisions

● Links (internal / external)

● Mappings Wiki (with some tweaking / more on that later)

2nd DBpedia Meeting Leipzig 03.09.2014

File metadata

● New Extractor

● New file Class hierarchy

– dbo:File, dbo:Image, dbo:StillImage, dbo:MovingImage and dbo:Sound

Sample Output:

:Aeropetes.JPG a dbo:StillImage, dbo:Image, dbo:Document, dbo:File, Work; dcterms:type dbo:StillImage dbo:fileExtension "jpg" dcterms:format "image/jpeg" dbo:fileURL commons-path:Aeropetes.JPG ; foaf:depiction commons-path:Aeropetes.JPG ; dbo:thumbnail commons-path:Aeropetes.JPG?width=300 .

2nd DBpedia Meeting Leipzig 03.09.2014

Image Galleries

● Attach each galleryitem to the pageresource

:Colorado dbo:hasGalleryItem Colorado.JPG, Denver_Colorado_Art.jpg, ColoradoCenter1.jpg.

2nd DBpedia Meeting Leipzig 03.09.2014

Image Annotations

● AnnotationGadget

● Boxes withoptional description

2nd DBpedia Meeting Leipzig 03.09.2014

Image Annotations

● W3 Media Fragments recommendation

● Embed the box in the URI– ?width=15130&height=1886#xywh=pixel:10431,324,1670,1208> .

● Add descriptions in the new resource

2nd DBpedia Meeting Leipzig 03.09.2014

Mappings Wiki

2nd DBpedia Meeting Leipzig 03.09.2014

Template Statistics

2nd DBpedia Meeting Leipzig 03.09.2014

Licensing

● Identified & imported automatically ~360 licence templates

● Use the mappings wiki

● Needed some hacking to make it work

– e.g. {{Self|GFDL|cc-by-sa-3.0,2.5,2.0,1.0}}

:Acraea_circeis.JPG dbo:license <http://creativecommons.org/publicdomain/mark/1.0/>

:Antepipona_deflenda_-_2012-10-17.webm dbo:license <http://creativecommons.org/licenses/by-sa/3.0/ >

2nd DBpedia Meeting Leipzig 03.09.2014

KML Annotations attached to media

Attach raw KML data to resource with custom extractor

Sample Output::Yellowstone_1871b.jpg dbo:hasKMLData “”” ?xml version=1.0 encoding=UTF-8?><kml xmlns=http://earth.google.com/kml/2.2”><GroundOverlay><name>Yorktown, Indiana (1878)</name><description>An 1878 map of Yorktown in Tippecanoe County, Indiana. Source: Kingman Brothers&apos; Combination Atlas Map of Tippecanoe County, Indiana, 1878.</description> <color>99ffffff</color><Icon><href>BIG_LINK_HERE</href><viewBoundScale>0.75</viewBoundScale></Icon><LatLonBox><north>40.26126145890567</north><south>40.25777915632657</south><east>-86.77033439383223</east><west>-86.77398493316619</west><rotation>-1.123009884936565</rotation></LatLonBox></GroundOverlay></kml>“”"^^rdfs:XMLLiteral .

2nd DBpedia Meeting Leipzig 03.09.2014

Left TODOs

● Nested templates are commonly used and cannot be handled by the mappings wiki atm

– e.g. Media descriptions (although mapped) are missing{{Information |Description= {{en|Logo of the [[w:en:DBpedia|DBpedia project]]}} {{fr|Logo du projet [[w:fr:DBpedia|DBpedia]]}}

● Annotation descriptions need some tweaking

– Need to render wikitext● Put it under a SPARQL Endpoint

● Provide Linked Data

– http://commons.dbpedia.org

2nd DBpedia Meeting Leipzig 03.09.2014

Thank You!

Special thanks to:

● Alexandru Todor (importing the License templates)

● Google Summer of Code for sponsoring this project (Gaurav Vaidya)

Questions?

Dataset: http://nl.dbpedia.org/downloads/commonswiki Dataset samples: https://github.com/gaurav/commons-extraction