Lodie overview

BackgroundMethodology

EvaluationConclusionsReferences

LODIELinked Open Data Information Extraction

Fabio Ciravegna Anna Lisa Gentile Ziqi Zhang

OAK Group, Department of Computer ScienceThe University of Sheffield, UK

9th October 2012Semantic Web and Information Extraction

SWAIE @ EKAW 2012

Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 1 / 19



LODIE: Overview

Linked Open Data for Web-scale Information Extraction

Web-scale IEnumber of documents, domains, factsefficient and effective methods required

Linked Open Data to seed learning“[. . . ] a recommended best practice for exposing, sharing, andconnecting data [. . . ] using URIs and RDF”(linkeddata.org).a large KB of typed instances, relations, annotations (e.g., RDFa)

Adapting to specific user information needsusers define specific IE tasks by specifying the types of instancesand relations to be learnt




Challenges: user Information needs

SoA defines a generic IE task - KnowItAll, StatSnowball,PROSPERA, NELL, ExtremeExtraction [3, 4, 5, 11]extracts “people, oragnisation, location" etc and theirgeneric relations

RQ how to let users define Web-IE tasks tailored to theirown needs - “drugs that treat hayfever"




Challenges: training data

SoA requires certain amount of training/learning resources tobe manually specified

RQ how to automatically obtain these (and filter noise) fromthe LOD




Challenges: learning strategies

SoA Typically semi-supervised bootstrapping based learningfrom unstructured texts, prone to propagation of errors

RQ how to combine multi-strategy learning (e.g., from bothstructured and unstructured contents) to avoid driftingaway from the learning task




Challenges: publication of triples

SoA No integration with existing KB

RQ how to integrate with LOD




LODIE Architecture Overview

Figure: Architecture diagram.Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 7 / 19



User needs formalisation

Goal: Support users in formalising their information needs in amachine understandable format

Hypothesis

Users define information needs in terms of ontologiesUsers use different vocabularies in ontology creation

Methods

Baseline: manually identify relevant ontologies on the LOD anddefine a view on them using tools like neon-toolkit.orgOntology Design Pattern: reuse existing Content ODP buildingblock and apply re-engineering patterns to bridge the“vocabulary gap”




Learning Seed Identification and Filtering - I

Goal 1: Automatically identify training data in the forms of triples andannotations to seed learning

Hypothesis:

LOD can already contain answers to user needs in the forms oftriples and annotationsThe Web contains additional linguistic realisations of triples

Method

From LOD - SPQRL queries to fetch seed triples (andannotations) matching the user needsFrom the Web - Search for linguistic realisations of triples(identified above):

co-occurrence of related instances in textual contexts e.g.sentencesstructural elements e.g., tables




Learning Seed Identification and Filtering - II

Goal 2: Filter noisy training data and select the most informativeexamples for learning

Hypothesis:

Identified learning seeds can contain noise (causing“drifting-away”)... and can be redundant (causing unnecessary overheads)good learning examples are consistent w.r.t. the learning taskand diverse.

Method

Consistency measure - cluster seed instances of different classesN times (varying parameters), does i always appear in the clusterrepresenting the same class?Variability measure - cluster seed instances of the same class,how many clusters are generated and how dense are they




Multi-Strategy Learning

Goal: Learning from different types (e.g., structured, unstructured)with different strategies to improve both recall and precision

Hypothesis:

The same pieces of knowledge can be repeated in different forms,e.g., in tables v.s. sentences (re-enforcing evidenece, Precision)Some knowledge may be found only in one form or another(Recall)

Method: multi-strategy learning

Learning from structures such as tables and lists [10, 8]Inducing wrappers for regular pages [7]Lexical-syntactic pattern learning from free textsCombine outputs from different processes




Integration with the LOD

Goal: Assign unique identifier to the extracted knowledge

Hypothesis:

Knowledge that already exists in the LOD can be re-extracted andmust be integrated

Method:

simple, scalable disambiguation methods, e.g., by featureoverlapping [2] and string distance metrics [9]




User Feedback

Goal: Integrate user’s feedback on learning

Hypothesis:

Automatic extraction is imperfect and user’s feedback can helpimprove learning

Method:

Expose extracted knowledge via an interface and collect userfeedback - errors can be used as negative example for re-training




Evaluation

suitability to formalise the user needs

suitability of the approach to IE




Evaluation: suitability to formalise the user needs

Task described in natural language –> equivalent IE task

feasibility and efficiency (reasonable time, limited overhead)effectiveness

is result representative of the user needs?users judge for description of taskusers judge for resulting instances

is result suitable to seeding IE?usefulness of triples in terms to learning using the proposed qualitymeasures




Evaluation: suitability of the approach to IE

definition of new task: population of sections of the schema.orgontology

precisionpartial evaluation of recall

fraction of the available annotated instanceschecking recall with respect to already available annotatedinstances, not provided for training

comparative large scale IE evaluationsTAC Knowledge Base Population [6]TREC Entity Extraction task [1]




Impact

LODIE timeliness

LOD: first very large-scale information resource available for IE

covering for a growing number of domains

LODIE output

Web-scale IE task corpora, linked resources, etc.

developed code available as open source under the MIT licence

all the data generated will be made available using a licence suchas Open Data Commons (opendatacommons.org)




Krisztian Balog and Pavel Serdyukov.

Overview of the TREC 2010 Entity Track.In Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010). NIST, 2011.

S Banerjee and T Pedersen.

An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet.In CICLing ’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent TextProcessing, pages 136–145, London, UK, 2002. Springer-Verlag.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell.

Toward an Architecture for Never-Ending Language Learning.pages 1306–1313.

Oren Etzioni, Ana-maria Popescu, Daniel S Weld, Doug Downey, and Alexander Yates.

Web-Scale Information Extraction in KnowItAll ( Preliminary Results ).pages 100–110, 2004.

Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Gary Kratkiewicz, Nicolas Ward, and RalphWeischedel.Extreme Extraction – Machine Reading in a Week.(1):1437–1446, 2011.

Heng Ji and Ralph Grishman.

Knowledge Base Population : Successful Approaches and Challenges.

N Kushmerick.

Wrapper Induction for information Extraction.In IJCAI97, 1997.

Girija Limaye.




Annotating and Searching Web Tables Using Entities , Types and Relationships.

Vanessa Lopez, Miriam Fernández, Nico Stieler, Enrico Motta, Walton Hall, Milton Keynes Mkaa, and United Kingdom.

PowerAqua : supporting users in querying and exploring the Semantic Web content.

David Milne and Ian H Witten.Learning to Link with Wikipedia.pages 509–518, 2007.

Ndapandula Nakashole and Martin Theobald.

Scalable knowledge harvesting with high precision and high recall.on Web search and data mining, (1955):227–236, 2011.


Date post:	11-Nov-2014
Category:	Documents
Upload:	anna-lisa-gentile
View:	1,638 times
Download:	0 times

Lodie overview

Documents