http://webtlab.it.uc3m.es1
Semantic annotation of text: techniques and applications
Prof. Luis Sanchez-FernandezWeb Technologies LaboratoryUniversity Carlos III of Madrid
http://webtlab.it.uc3m.es
2
Semantic Web Techniques for semantic annotation
of text An approach to named entity
disambiguation using Wikipedia
Outline
http://webtlab.it.uc3m.es
Short history of the Web
1990: Creation of World Wide Web infraestructure at CERN by Tim Berners-Lee
HTTP, HTML, first Web client, first Web server 1993: Mosaic, first graphic Web client 1994: Netscape Navigator 1996: Commercial use of WWW is generalized 1999: Tim Berners-Lee proposes the Semantic
Web 2002: Weblogs and RSS Web 2.0 6th October 2009: at least 8 billion indexable Web
pages 23rd September 2010: at least 15 billion
indexable Web pages according to http://www.worldwidewebsize.com/
The problem of information overload
The great success of the web has lead to one of its current problems: information overload Difficult and time costly to find and
update relevant information for people and companies
Ex.: keep an updated state of the art Company employees can use up to 20%
of their working time searching in the Web (Outsell Inc, 2002)
http://webtlab.it.uc3m.es5
The goal of the Semantic Web is to automate web tasks by enriching the current Web content with formal representations that enable better cooperation between humans and computers
The Semantic Web proposal
http://webtlab.it.uc3m.es6
Semantic Web Stack
Máster interuniversitario en Ingeniería Telemática
7
RDF
“Resource Description Framework” (RDF) Goal of RDF (alternative views):
Language for resource description in the Web Language for formal representation of (parts of)
information available in a Web document (metadata) Formal => machine readable Vocabulary defined with ontologies
What is a resource? Web content: Web pages, images, e-mails, files, … Resources mentioned in Web content: Persons,
locations, organizations, …
Máster interuniversitario en Ingeniería Telemática
8
RDF basic principles
We want to represent a piece of information available in the Web describing a resource
Each metadata states a property that can be modelled as a (formal) statement, composed of: subject: resource being described predicate: property of the resource object: value of the property for the resource
being described “http://www.example.org has a creator whose
value is John Smith”
Máster interuniversitario en Ingeniería Telemática
9
RDF Model
An RDF model (set of RDF statements) can be represented by means of a graf
For each statement: subject is a node predicate is an arc object is a node
Subject and predicate are resources Object can be either a resource or a
literal
Máster interuniversitario en Ingeniería Telemática
10
Example
“http://www.example.org has a creator whose value is John Smith”.
Máster interuniversitario en Ingeniería Telemática
11
Textual notation (triples)
<http://www.example.org/index.html> <http://purl.org/dc/elements/1.1/creator> <http://www.example.org/staffid/85740> .
<http://www.example.org/index.html> <http://www.example.org/terms/creation-date>
"August 16, 1999" .<http://www.example.org/index.html>
<http://www.example.org/terms/language> "English“ .
Máster interuniversitario en Ingeniería Telemática
12
Ontologies: goal
An ontology is a formal, explicit specification of a shared conceptualization
An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as rules that should be fulfilled by such terms and relations
Máster interuniversitario en Ingeniería Telemática
13
RDF Schema
RDF vocabulary Properties definition and description of
properties Classes definition and description
Can be used to define simple ontologies
Máster interuniversitario en Ingeniería Telemática
14
Properties in RDF Schema
rdfs:subPropertyOf rdfs:range rdfs:domain rdfs:subClassOf
http://webtlab.it.uc3m.es15
Sample taxonomy
picture by Ian Ruotsala
http://webtlab.it.uc3m.es16
Ontology language More powerful than RDF-Schema Examples:
Existence/cardinality constraints all instances of person have a mother that is also a
person, or that persons have exactly 2 parents Transitive, inverse or symmetrical properties
isPartOf is a transitive property, hasPart is the inverse of isPartOf, touches is symmetrical
OWL
http://webtlab.it.uc3m.es17
Semantic Web and Technology Enhanced Learning
http://webtlab.it.uc3m.es18
Modelling (ontologies) learning processes learning content learning output (competences) learning agents (students, teachers)
Adding metadata (annotations) according to the models
Use the models and the metadata in tools to make decissions example: personalized, adaptive content and/or
problems
Typical applications
http://webtlab.it.uc3m.es19
Semantic annotation of text
Generalities
Goal: extract semantic annotations from free text
Natural language is complex and ambiguous
Language dependent Domain dependent applications
News Literature E-mail Transcriptions of spoken dialogues
Some useful results can be achieved nowadays
Taxonomy of semantic annotations Content based
annotations Document
categorization Named entities Ontology based
domain annotations Concepts and
instances identification
Relations extractionNamed Entity (Washington, location)
<rdf:Description rdf:about=‘WST'> <rdf:type rdf:resource=‘State'/></rdf:Description><rdf:Description rdf:about=‘WDC'> <rdf:type rdf:resource=‘City'/></rdf:Description>
isGovernor(GaryLocke,WST)
basic techniques (i)
Symbolic NLP Based on the use of lexicons and
grammar rules to process text Example: “Barack Obama Elected
President”Lexical Analysis
NP Barack
NP Obama
VBT Elect
VBT VBT + ‘ed’
NN President
Parsing
S NP NP* VBT NN
S
NP VBTNP NN
Semantic Analysis
S NP NP*(X) VBT(Elect) NN(Y)
hasFunction(X, Y)
hasFunction(BarackObama, President)
Basic techniques (ii)
Statistical NLP Based on counting: finding frequent patterns
that make likely the occurrence of certain text feature
Use of extensive corpora Example:
“Washington” when appearing in the same document with “Hollywood” is likely to represent (Denzel Washington, actor) while Washington” when appearing in the same document with “Obama” is likely to represent (Washington D.C., American capital)
We can count the frequency of different meanings of “Washington” when appearing in different contexts
http://webtlab.it.uc3m.es24
An approach to named entity disambiguation with
Wikipedia
http://webtlab.it.uc3m.es25
Instance: a particular person, location (GPE), organization, ...
Introduction
Entity: text + type
http://webtlab.it.uc3m.es26
Strategy I
http://webtlab.it.uc3m.es27
Approach Find entities in document For each entity, identify candidate
instances that are compatible with the entity name
Assign a ranking value to each candidate instance: 0 ≤ r ≤ 1
Greater ranking values indicate greater likelihood of occurrence
Strategy II
http://webtlab.it.uc3m.es28
Semantic coherence (in terms of ranking) “An instance would have a high ranking
value if the instances that typically co-occur with it also have high ranking values”
Strategy III
iCj
jjii IrIIIr )(),(Cooc)(
RAR
http://webtlab.it.uc3m.es29
We can add a vector E that accounts for other context information
Equation similar to Google PageRank
Strategy IV
ERAR
)1(
http://webtlab.it.uc3m.es30
Alternative instance names extracted by processing a Wikipedia dump Page titles, redirects, disambiguation pages,
anchors Indexed by Lucene
Candidate instances are obtained by querying Lucene
Candidate instances weighted by combining Lucene scores and PageRank values
Filtering limits the maximum number of candidates
Instance finder & filter
http://webtlab.it.uc3m.es31
Instance ranker
EkRAkRAkR ECCLL
E: candidate instance weights passed by the instance filterAC: based on instance co-occurrence in Wikipedia pagesAL: based on direct links
http://webtlab.it.uc3m.es32
Run I. finder I. ranker I. selector
αL αPR kL kC kE σL σH
WebTLab1
0.8 0.2 0.55 0.25 0.2 1.2 2.0
WebTLab2
0.8 0.2 0.55 0.25 0.2 1.05 1.5
WebTLab3
0.8 0.2 0.4 0.4 0.2 1.2 2.0
Results I
Run 2250 queries
1020 non-NIL
1230 NIL
WebTLab1 0.7698 0.6647 0.8569
WebTLab2 0.7636 0.6098 0.8911
WebTLab3 0.7596 0.6049 0.8878
EkRAkRAkR ECCLL
http://webtlab.it.uc3m.es33
Run I. finder I. ranker I. selector
αL αPR kL kC kE σL σH
WebTLab1
0.8 0.2 0.55 0.25 0.2 1.2 2.0
WebTLab2
0.8 0.2 0.55 0.25 0.2 1.05 1.5
WebTLab3
0.8 0.2 0.4 0.4 0.2 1.2 2.0
Results II
Run ORG GPE PER
WebTLab1 0.7613 0.6569 0.8908
WebTLab2 0.7707 0.6262 0.8935
WebTLab3 0.7680 0.6195 0.8908
EkRAkRAkR ECCLL
http://webtlab.it.uc3m.es34
Approach based on instance co-occurrence
Text from Wikipedia restricted to: titles, anchors
Results considered promising Should improve for GPE
Conclusions
http://webtlab.it.uc3m.es35
Thank You!
Questions?