Introducing JSONpedia

Post on 15-Jan-2015

2,804 views 0 download

Tags:

description

Introduction to JSONpedia a JSON version of Wikipedia

transcript

JSONpediaFacilitating consumption of MediaWiki content.

WWW.SPAZIODATI.EU

Michele Mostarda <mostarda@spaziodati.eu>, TW: @micmosmercoledì 10 ottobre 12

What is JSONpedia?

mercoledì 10 ottobre 12

“JSONpedia is a library and a web service meant to read WikiText markup as JSON.”

mercoledì 10 ottobre 12

‣ Initially conceived as a tool to produce data to train Machine Learning models.

‣ The REST service,inspired by Sweeble Crystalball,produces JSON, HTML and (coming soon) RDF data.

‣ Written over a context-dependent event based parser to be more performant than an Regex based parser (like the wikiparser) or a DOM based parser (like Sweeble).

mercoledì 10 ottobre 12

Differences with Sweeble

mercoledì 10 ottobre 12

‣ Lightweight Event based parser.‣ More tolerant to frequent syntax errors

present within WikiText pages.‣ Serializes to JSON output which is easier

to consume!

mercoledì 10 ottobre 12

Differences with DBpedia

mercoledì 10 ottobre 12

‣ JSONpedia doesn't add any semantic to the extracted data.

‣ JSONpedia could integrate the current DBpedia regex-based parser.

‣ JSONpedia is a not competitor of DBpedia but rather a complement.

mercoledì 10 ottobre 12

JSONpedia Internals

mercoledì 10 ottobre 12

ArchitectureParser

Input WikiText

Structure

Validator

Extractor

Splitter

Linker

+

DBpedia API/Freebase

Output JSON

mercoledì 10 ottobre 12

WikiText Parser Events// Document bounding.void beginDocument(URL document);void endDocument();

// Error handling.void parseWarning(String msg, ParserLocation location);void parseError(Exception e, ParserLocation location);

// Tag handling.void beginTag(String node, Attribute[] attributes);void endTag(String node);void inlineTag(String node, Attribute[] attributes);void commentTag(String comment);

// Sectionsvoid section(String title, int level);

// Referencesvoid beginReference(String label);void endReference(String label);

// Linksvoid beginLink(String url);void endLink(String url);

// listsvoid beginList();void listItem();void endList();

// Templatesvoid beginTemplate(String name);void endTemplate(String name);

// Tablesvoid beginTable();void headCell(int row, int col);void bodyCell(int row, int col);void endTable();

// Generic parametervoid parameter(String param);// parameter / text valuevoid text(String content);

mercoledì 10 ottobre 12

WikiText Processors

‣ Structure‣ Extractors‣ Linkers‣ Splitters‣ Validator

Processors receive the stream of events generated by the parser and perform data construction and transformation.

mercoledì 10 ottobre 12

Structure

The Structure Processor receives a stream of WikiText parsing events and builds a 1-1JSON representation of the document DOM.

mercoledì 10 ottobre 12

Extractors

Extractors are specific Processors that collect a certain type of data from the event stream: for example the SectionsExtractor collects the list of all sections detected in the document stream.

mercoledì 10 ottobre 12

Linkers

A Linker is a Processor which links the current document entity to other informations acquired from external sources. An example of Linker is the FreebaseLinker which connects an entity to the same representation in Freebase if any.

mercoledì 10 ottobre 12

Splitters

A Splitter is a Processor able to cut sub trees of the JSON document built by the Structure processor. An example of Splitter is the TableSplitter which extract the JSON structures representing the tables declared in the document.

mercoledì 10 ottobre 12

Validator

A Validator is a Processor performing the check of data structures parsed from a document.

mercoledì 10 ottobre 12

Forthcoming Features

‣ JSONpedia DB (based on MongoDB + ElasticSearch) can be queried online. Also JSONpedia dumps will be available.

‣ Online data model Exporter Tool (CSV)‣ RDF output.

mercoledì 10 ottobre 12

Release

JSONpedia will be fully released OpenSource in by the end of the year.

mercoledì 10 ottobre 12

Live Demo

http://bit.ly/jsonpediaor

http://json.it.dbpedia.org/frontend/form.html

mercoledì 10 ottobre 12

Thanks!

Michele Mostarda <mostarda@spaziodati.eu>, TW: @micmos

WWW.SPAZIODATI.EU

mercoledì 10 ottobre 12