Using the Web of Data for Information Extraction

Post on 10-May-2015

5,725 views 0 download

Tags:

description

Talk at Insiders Technologies , 21.01.2010. It's about publishing RDF data with D2R-server, link the data to get Linked Data, query the data with SPARQL via SQUIN and finally annotate text with this data by using RDFa in Epiphany.

transcript

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Using the Web of Data for

Information Extraction

sparqlrdf

rdfaD2R server

scoobie

epiphanysquin

Linked DataOBIE

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Are you still surfing ...

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010… or overloaded?

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

What are the cities of the universities in Rhineland Palatinate and what is the unemployment rate of these cities?

A simple question ...

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>PREFIX eurostat: <http://www4.wiwiss.fu-berlin.de/eurostat/resource/eurostat/>PREFIX dbpedia: <http://dbpedia.org/ontology/>PREFIX dbpedia_cat: <http://dbpedia.org/resource/Category>

SELECT ?dbpcity ?cityName ?ur WHERE {?uni skos:subject dbpedia_cat:Universities_and_colleges_in_Rhineland-Palatinate; dbpedia:city ?dbpcity .?dbpcity owl:sameAs ?statcity. ?statcity rdfs:label ?cityName ;

eurostat:unemployment_rate_total ?ur }

What are the cities of the universities in Rhineland Palatinate and what is the unemployment rate of these cities?

A simple question ...

http://www.w3.org/TR/rdf-sparql-query/

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010… and its answer.

dbpcity cityName ur

http://dbpedia.org/resource/Koblenz Koblenz 8.8http://dbpedia.org/resource/Trier Trier 7.3

Data Sources:

Query Engine: SQUIN - Query the Web of Linked Data http://squin.sourceforge.net/

http://wiki.dbpedia.orghttp://epp.eurostat.ec.europa.euhttp://www4.wiwiss.fu-berlin.de/eurostat/

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

So much data out there, too much?

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010What data do you have?

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Are you still surfing ...

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Agenda

In order to use Web of Data for information extraction, you have to understand its basics.● RDF on one slide● Publish data in RDF with D2R Server● Publish RDF as Linked Data● Query Linked Data with SPARQL and Squin● Use RDF for information extraction● Bring Linked Data to text via RDFa

11Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Wouldn't this be nice.

Data

12Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Wouldn't this be nice.

Data Text

Extraction Pipeline

ExtractionResults

enrich

User-defined Filter

13Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Wouldn't this be nice.

Data Text

Extraction Pipeline

ExtractionResults

enrich

User-defined Filter

annotate

annotatedtext

14Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Wouldn't this be nice.

Data Text

Extraction Pipeline

ExtractionResults

populate

enrich

User-defined Filter

annotate

annotatedtext

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF on one slide

* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf

Found at:

@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .

dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications//icdar/SchulzEGAAD09> .

<http://dblp.l3s.de/d2r/resource/publications/conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF on one slide

* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf

Found at:

@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .

dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

Vocabularies

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF on one slide

* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf

Found at:

@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .

dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

URLs / URIs

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF on one slide

* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf

Found at:

@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .

dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

Subjects

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF on one slide

* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf

Found at:

@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .

dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

Predicates

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF on one slide

* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf

Found at:

@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .

dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

Objects

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF data is graph data.

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Publishing relational data in RDF

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Publishing relational data in RDF

./generate-mapping-o mydatabase.n3 -b http://projects.dfki.uni-kl.de/mydatabase/jdbc:mysql://localhost:3306/mydatabase

./d2r-server -p 80 -b http://projects.dfki.uni-kl.de/mydatabase/mydatabase.n3

D2R Server - Publishing Relational Databases on the Semantic Web

http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/

Two small command line calls:

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Linked Data: Linking RDF data from different sources

Customer DB Employees DB

Project DB DBpedia

How to interlinkthese datasets?

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Linked Data: Linking RDF data from different sources

Linked Data Principles (TimBL, 2006)

1. Use URIs as names for things (e.g., http://dbpedia.org/resource/Berlin)

2. Use HTTP-URIs so that people can look up those names3. Provide useful information in RDF when someone looks up an URI4. Include links to other URIs to enable discovery of more information

Example:

<http://dbpedia.org/resource/Berlin> owl:sameAs opencyc:en/CityOfBerlinGermany ;

owl:sameAs opencyc:en/Berlin_StateGermany owl:sameAs <http://sws.geonames.org/2950159/> owl:sameAs <http://www4.wiwiss.fu-berlin.de/eurostat/resource/regions/Berlin> owl:sameAs freebase:http://dbpedia.org/resource/Berlin

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SPARQL: Querying RDF data

SPARQL - the RDF query language.

In contrast to SQL, it's data model is not set oriented but graph oriented.

Some Examples:

Resulting in tuples:SELECT ?interest ?friend WHERE {

   <http://www.w3.org/People/Berners­Lee/card#i> foaf:knows ?friend .   ?friend foaf:interest ?interest .  }

Resulting as graph :CONSTRUCT {?friend foaf:interest ?interest } WHERE {

   <http://www.w3.org/People/Berners­Lee/card#i> foaf:knows ?friend .   ?friend foaf:interest ?interest .  }

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SPARQL: Query Linked Data from different sources

Customer DB Employees DB

Project DB DBpedia

How to accessthese datasets with a single

SPARQL query?

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SPARQL: Query Linked Data from different sources

D2R Server D2R Server

D2R Server D2R Server

Customer DB Employees DB

Project DB DBpedia

Squin: Query the Web of Linked Data

http://squin.sourceforge.net/

Squin follows a Link Traversal approach over HTTP URIs.

Remember:

SELECT DISTINCT ?c ?cityName ?ur WHERE {?u skos:subjectdbpedia_cat:Universities_and_colleges_in_Rhineland-Palatinate; dbpedia:city ?c . ?c owl:sameAs [ rdfs:label ?cityName ; eurostat:unemployment_rate_total ?ur ]}

SQUINSQUIN

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Using RDF and Linked Data for Information Extraction

User Linked Data

Text Extraction Pipeline

Query

Result Graph

asks question

about

answersto

Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Using RDF and Linked Data for Information Extraction

What data do we have?

Classes Instances Datatype Properties Object Properties

<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>rdf:type foaf:Document ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .

foaf:Documentfoaf:Person

.../SchulzEGAAD09

.../Markus_Ebbeckedc:titlefoaf:namefoaf:firstNamefoaf:surName

dc:creatorfoaf:knows

Literals

„Markus“„Ebbecke“„Seizing the Treasure: Transferring Knowledge in Invoice Analysis“

Example RDF data

31Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SCOOBIEDomain Adaption

Vocabulary Data

Instance Data

Information Extraction (online)

Data Preprocessing & Learning (offline)

Structured Data

Text Corpus Data

Patterns andGazetteers

Data

32Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SCOOBIEEco System

TrainingCorpus

Patterns + Gazetteers

Text Corpus

Ontology

Instances

Domain Knowledge Models

Ses

sion

Dat

aT

asks

Index

Pre-process Train Extract

OIAP

I

I

Models

33Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SCOOBIEOBIE Pipeline

Normalization Text ExtractionLanguage Detection

Segmentation TokenizationSentence ExtractionPOS-Tagging

Symbolization Named Entity RecognitionStructured Entity RecognitionNoun Phrase ChunkingSymbol Recognition

Instantiation Instance RecognitionInstance DisambiguationChunk Classification

Contextualization Fact ExtractionFact Selection

Population Query Answering

34Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Used MachineLearning Models

Regex matching statistics (Structured Entity Recognition)

Supervised Learning

Unsupervised or Instance-based Learning

Gazetteer matching statistics (Named Entity Recognition)

CRF-based Noun Phrase Chunker

K-Nearest-Neighbor chunk classifier (Chunk Classification)Spreading Activation-based fact ranking (Fact Selection)

I

I

I

TF/IDF-based instance re-ranking (Instance Disambiguation)

Semi-Supervised Learning

35Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Used Machine Learning: Conditional Random Field

CRFs are sequence taggers:

Train it with: Bill CAPITALIZED nounslept LOWERCASE non-nounhere LOWERCASE non-noun

Test it with: He CAPITALIZEDvisited LOWERCASELondon CAPITALIZED

CRF results: nounnon-nounnon-noun

MALLET - MAchine Learning for LanguagE Toolkit

http://mallet.cs.umass.edu/

36Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

Bringing Linked Data to Text

Annotate plain text or HTML with RDF data.

I'm working at DFKI.

RDFa offers an HTML extension:

I'm working at<span about="dbpedia:DFKI" property="rdfs:label">DFKI</span>

Now lets generate RDFa automatically ...

37Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Do you remember?

Data Text

Extraction Pipeline

ExtractionResults

populate

enrich

User-defined Filter

annotate

annotatedtext

38Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF Epiphany

Epiphany takes the original webpage …

39Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF Epiphany

Epiphany takes the original webpage …and SCOOBIE initialized with an RDF data set …

40Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF Epiphany

Epiphany takes the original webpage …and SCOOBIE initialized with an RDF data set …It extracts RDF information from text and annotates it asRDFa…

41Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010RDF Epiphany

Epiphany takes the original webpage …and SCOOBIE initialized with an RDF Linked Data set …It extracts RDF information from text and annotates it asRDFa…clicking on RDFa annotationsopens further information fromthe Linked Data set

42Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010

SCOOBIE

RDF Epiphany

At a glance

● Epiphany is a free web service.

● Epiphany uses SCOOBIE.

● Epiphany can be initialized with any RDFLinked Data set.

● Epiphany generates an RDF document about a web page.

● Epiphany annotates RDF as RDFa in the web page.

http://projects.dfki.uni-kl.de/epiphany/

43Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Summary

Text

Extraction Pipeline

ExtractionResults

populate

enrich

User-defined Filter

annotate

annotatedtext

D2R Server

D2R Server

D2R Server

D2R Server

Customer DB Employees DB

Project DB DBpedia

SQUINSQUIN

44Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Outlook

Extraction Pipeline

ExtractionResults

populate

enrich

User-defined Filter

annotate

annotatedE-Mail

D2R Server

D2R Server

D2R Server

D2R Server

Customer DB Employees DB

Project DB DBpedia

SQUINSQUIN

E-Mail

45Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian

InsidersJanuary

2010Thank you!

sparqlrdf

rdfaD2R server

scoobie

epiphanysquin

Linked DataOBIE