Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | alfred-poole |
View: | 214 times |
Download: | 0 times |
Methods & Tools
Ontology Learning from TextOntology Learning from Text
18/5/2007
Pervasive Computing Research GroupCommunication Networks Laboratory
Department of Informatics and TelecommunicationsUniversity of Athens – Greece
Polyxeni Katsiouli
Definition of Ontology
‘A formal, explicit specification of a shared conceptualization’
must be machine
understandable
types of concepts and constraints must be clearly
defined
not private to some individual,but accepted by a group
an abstract model of some
phenomenon in the world formed
by identifying the relevant
concepts of that phenomenon
Main elements of an ontology
Hierarchy of concepts(is-a relations)
Object property(relation)
domain range
domain
xsd:stringxsd:string
range
datatype property(attribute)
hasTitle
wasWrittenBy
Definition of Ontology Learning
The application of a set of methods and techniques used for building an ontology from scratch
Uses distributed and heterogeneous knowledge and information sources
Allows a reduction in the time and effort needed in the ontology development process
Ontology Learning methods from…
Unstructured sources
• Involves NLP techniques, morphological and syntactic analysis, etc.
Semi-structured source
• elicit an ontology from sources that have some predefined
structure, such as XML Schema
Structured data
• Extracting concepts and relations from knowledge contained in structured data, such as databases
Ontology Learning ‘Layer Cake’
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Termsdisease, illness, hospital
{disease, illness}
Disease:=<I, E, L>
is_a (Doctor, Person)
cure (domain:Doctor, range:Disease)
x, y (sufferFrom(x, y) ill(x))
Part 1 Terms Extraction
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Termsdisease, illness, hospital
Terms
Linguistic realizations of domain-specific concepts
Are the basis of the ontology learning process
Term extraction implies:
• Linguistic processing part-of-speech tagging, morphological analysis, etc.
• Statistical processing compares the distribution of terms between corpora
Terms Extraction: Process
Run a Part-Of-Speech (POS) tagger over the domain corpus
Identify possible terms by constructing patterns, such as: Adj-Noun, Noun-noun, Adj-Noun-Noun,…
Ignore Names
Identify only the relevant to the text terms by applying statistical metrics
Linguistic Analysis: an exampleDiscourse Analysis
Dependency Structure (S)
Dependency Structure (Phrases)
Phrase Recognition
Morphological Analysis (stemming)
Part of Speech & Semantic Tagging
Tokenization (incl. Named-Entity Rec.)[table] [2005-06-01] [John Smith]
[[the] [large] [table] NP] [[in] [the] [corner] PP]
[table N:ARTIFACT] [table N:furniture]
[work~ing V]
[[the SPEC] [large MOD] [table HEAD] NP]
[[He SUBJ] [booked PRED] [[this] [table HEAD] NP:DOBJ]S]
[[He SUBJ] [booked PRED] [[this] [table HEAD]NP:DOBJ:X1]…]…
[[It SUBJ:X1] [was PRED] still available…]
Statistical Analysis
Statistical metrics used in terms extraction:
2 ( exp)
exp
obs Chi-square
Term weighting (TFIDF) ( ) log( )( )
Ntfidf w tf
df w
Mutual Information ( , )( , )
( ) ( )
P x ymi x y
P x P y
TFIDF
( ) ( ) log( )( )
Ntfidf w tf w
df w
tf(w) term frequency (number of words occurrences in a document)
df(w) document frequency (number of documents containing the word
N number of all documents
tfidf(w) relative importance of the word in the document
Most popular weighting schema
The word is more popular when it appears several times in a document The word is more important if it appears
in less documents
Part 2 Synonyms
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
{disease, illness}
Synonyms
Identification of terms that share semantics, i.e., potentially refer to the same concept
Methods for extracting synonyms
• Based on WordNet
• Latent Semantic Indexing (LSI)
WordNet A lexical database for the English language Nouns, verbs, adjectives & adverbs are grouped into sets
of synonyms (synsets) Synsets are interlinked by means of conceptual-semantic
and lexical relations
Adapting WordNet to specific domain
Partition the set of synonymy relations defined in WordNet in three classes:
• Relations irrelevant in the specific domain
• Relations that are relevant but incorrect in the specific domain
• Relations that are relevant and correct in the specific domain
Remove relations from the first two classes and include relations from the third class
Rank the rest sets according to their frequency in corpus
Latent Semantic Indexing (LSI)
LSI is a technique in NLP of analyzing relationships between a set of documents and the terms they contain
Uses a term-document matrix which describes the occurrences of terms in documents – Vector Space Model
Example: doc1 doc2
database X
computer X X
access X
Part 3 Concepts
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
Disease:=<I, E, L>
Concepts Intension, Extension, Lexicon
A term may be indicate a concept if we can define its:
Intension:
Extension:
Lexical realizations:
(in)formal definition of the set of objects that this concept
describes
a set of objects that the definition of this concept
describes
the term itself and its multilingual synonyms
Example: a disease is an impairment of health or a condition of abnormal functioning
Example: influenza, cancer, heart disease
Example: disease, illness, maladie
Part 4 Taxonomy Induction
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
is_a (Doctor, Person)
Concept Hierarchy Extraction
With the use of WordNet
Lexico-syntactic patterns
Machine Readable Dictionaries
Co-occurrence Analysis
Linguistic-approaches
Basic methods used for taxonomy extraction:
Taxonomy Extraction with WordNet
Given two terms t1 and t2, check if they stand in a
hypernym relation with regard to WordNet
Normalize the number of hypernym paths by dividing by the number of senses of t1
1 21 2
1
| ( ( ), ( )) |( , ) min( ,1)
| ( ) |
paths senses t senses tisa t t
senses t
path: a sequence of edges connecting the two synsets
Example: - 4 different hypernym paths between synsets ‘country’ and ‘region’ - ‘country’ has 5 senses
value of isa (country, region) = 0.8
Lexico-syntactic patterns - Hearst
Aim: the acquisition of hyponym lexical relations from text
Uses a set of predefined lexico-syntactic patterns which
• occur frequently and in many text genres
• indicate the relation of interest
• can be recognized with little or no pre-encoded knowledge
Principle idea: match these patterns in texts to retrieve is_a relations
Precision with respect to WordNet: 55,45%
Lexico-syntactic patterns - Hearst
NPo such as {NP1, NP2,…, (and | or)} NPn
‘Vehicles such as cars, trucks and bikes….’
such NP as {NP,} * { (or | and) } NP
‘Such fruits as oranges, nectarines or apples…’
NP {, NP} * { , } { or | and } other NP
‘Swimming, running, or/and other activities…’
vehicle
carbike
truckis-a
is-a is-a
fruit
applenectarine
orangeis-a
is-a is-a
is-a
activity
swimmingrunning
is-a
NP { , } including {NP, } * { or | and } NP
‘Injuries, including broken bones, wounds and bruises…’
NP { , } especially {NP, } * { or | and } NP
‘Publications, especially papers and books…’ publication
bookpaper
is-ais-a
Lexico-syntactic patterns - Hearst
injury
bruisewound
broken boneis-a
is-a is-a
Machine Readable Dictionaries
A method for extracting taxonomies which goes back
to the 80’s Main idea: exploit the regularity of dictionary entries to
find a suitable hypernym for the defined word
spring “the season between winter and summer and in which leaves and flowers appear”
Example:
is_a (spring, season)
MRDs: Exceptions
The hypernym can be preceded by an expression such as ‘a kind of’, ‘a sort
of’, or ‘a type of’ The problem is solved by keeping an exception list with words such as ‘kind’,
‘sort’, ‘type‘ and taking the head of the NP following the preposition ‘of’
The word can be defined in terms of a part-of or membership relation
republican : “a member of a political party advocating republicanism” Example:
is_a (republican, political party) part_of (republican, political party)
hornbeam: “a type of tree with a hard wood, sometimes used in hedges” Example:
is_a (hornbeam, tree)
Co-occurrence analysis
A certain term t1 is more special that a term t2, if
t2 also appears in all the documents in which t1
appears.
( , )( | )
( )
n x yP x y
n y
Term x subsumes term y iff P(x | y) 1, where
n(x,y) the number of documents in which x and y co-occurn(y) the number of documents that contain y
Document-based subsumption
Linguistic Approaches
Modifiers typically restrict or narrow down the meaning of the modified noun
is_a (international credit card, credit card)Example:
Part 5 Relations (non-taxonomic)
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
cure (domain:Doctor, range:Disease)
Extracting relations & attributes
Specific relations
• Part-of
• Qualia (Formal, Constitutive, Telic, Agentive)
General relations
• Exploiting linguistic structure
Attributes
Learning attributes: Introduction
Attributes relations with a datatype as range
Typically expressed in texts using preposition ofof, the verb havehave or genitivegenitive constructs, e.g. ‘the color of the car’, ‘the car’s color’, ‘every car has a color’
Values of attributes are expressed using copulacopula constructsconstructs, adjectivesadjectives or expressionsexpressions specific specific to the attribute in question, e.g.,
• ‘the car is red’ (copula + value)
• ‘the red car’ (adjective)
• ‘the baby weights 3 kgr’ (specific expressions)
Classification of attributes
To systematize the learning process attributes are classified according to their range
An approach to learning attributes
Tokenize & part-of-speech tag the corpus Apply the following patterns to extract adjective/noun pairs
(\w+{DET})? (\w+{NN}) + is{VBZ} \w + {JJ}
(\w+{DET})? \w + {JJ} (\w+{NN}) +
These pairs are weighted using conditional probability:
For each of the adjectives we look up the corresponding attributes in WordNet
f(n,a): joint frequency of adjective a and noun nf(n): the frequency of noun n
JJ: adjective DET: determinerNN: noun VBZ: verb, 3rd person singular present
“meronymy” / “part-of” relations
whole NN[-PL] ‘s POS part NN[-PL]
part NN[-PL] of PREP {the|a} DET mods [JJ|NN]* whole NN
Format type_of_word TAG type_of_word TAG…
NN = Noun NN-PL = Plural Noun
PREP = Preposition POS = Possessive
JJ = Adjective
e.g. …building’s basement…
e.g. …basement of a building… 55% accuracy55% accuracy
Given a “seed” word find parts of that word in a large corpus of text
Qualia structures
The meaning of a lexical element is described in terms of four roles:
Constitutive
Agentive
Formal
Telic
physical properties of a object (e.g., weight, material, parts)
typically a verb denoting an action which brings the object in existence
normally consists in typing information about the object (e.g., hypernym)
the purpose or function of an object either by a verb or by a nominal
Formal: artifact_tool
Constitutive: blade, handle,…
Telic: cut_act
Agentive: make_act
Example: Qualia structures for knife
Qualia Structures: Learning Approach
aim: to automatically learn qualia structures from the WWW
Based on the idea of matching certain lexico-syntactic patterns conveying a standard relation
Clues: search engine queries indicating the relation of interest
Calculate the weight of a candidate qualia element e for the term t using Jaccard coefficient:
Qualia Structures: Learning Process
Generate Clues
Download GoogleAbstracts
POS-tagging
Matching regularexpressions
Statistical Weighting
Word
Weighted QS
( )
( ) ( ) ( )
GoogleHits e t
GoogleHits e GoogleHits t GoogleHits e t
Qualia Structure: Patterns (1/2)
Formal RoleFormal Role
Telic RoleTelic Role
Qualia Structure: Patterns (2/2)
Constitutive RoleConstitutive Role
Relations by syntactic analysis
SubjToClass_PredToSlot_DObjToRange
Maps a subject to the domain, the predicate or verb to a slot or relation and the object to its range.
Example:
OntoLT
‘The player kicked the ball to the net’
relation: kick (domain: player, range: ball)
RelExt A tool for Relation Extraction
identifies relevant triples (pairs of concepts connected by a
relation) over concepts from an existing ontology
is based on the fact that verbs express a relation between two
classes that specify the domain and range
extracts relevant verbs & their grammatical arguments and
computes corresponding relations through a statistical &
linguistic processing
was developed in the context of SmartWeb project to provide
intelligent information services in the FIFA World Cup 2006
RelExt: Linguistic processing
Corpus
NER &Concept Tagging
Linguistic annotation
Annotatedcorpus
● Linguistic annotation
the SCHUG system was used
provides a multi-layer XML format for a given text
dependency structure, lemmatization, POS
● NER (Name Entity Recognition)
performed to map instances of football players to existing ontology classes
●Concept tagging
maps synonyms for given terms to the corresponding ontology concepts
RelExt: Statistical Processing
Relevance Measure• χ2 test used to compute relevance
ranking Coocurence measure Relation Extraction
RelevanceMeasure
FrequenciesIn BNC, NZZ
Relevance ScoresHeads, Preds
Cooccurence Scores
Heads <> Preds
Cooccurencemeasure
Part 6 Axioms & Rules
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
x, y (sufferFrom(x, y) ill(x)
DIRT Discovery of Inference Rules from Text
an unsupervised method for discovering inference rules from text, such as
X is author of Y X wrote Y,X caused Y Y is blamed on XX manufactures Y X’s Y factory
Is based on the assumption that:
Words that occurred in the same contexts tend to be similar
Distributional Hypothesis
DIRT: Distributional Hypothesis
Distributional Hypothesis is applied to dependency tress
If two paths tend to link the same sets of words, their meanings are hypothesized to be similar
DIRT: Dependency trees
The inference rules discovered by DIRT are between paths in dependency trees
Are generated by Minipar parser
Minipar represents its grammar as a network where nodes represent grammatical categories and links syntactic relationships A subset of the dependency relations in Minipar output
DIRT: Dependency trees“John found a solution to the problem”
pcomp
found
a
solution
to
problem
the
John
moddet
subj obj
det
Links represent dependency relationships
Direction: from the head to the modifier
Labels represent types of dependency relations
Each link between two words represents a direct semantic relationship
Path between “John” and “problem”
N:subj:V find V:obj:N solution N:to:N
meaning “X finds solution to Y”
DIRT: Paths in Dependency Trees
Connect the prepositional complement directly to the words modified by the preposition
transformation rule
Each link between two words represent a direct semantic relationship
A path represents indirect semantic relationships between two content words
Ontology Learning Tools
Text2OntoText2Onto• Open source (Java)
• http://ontoware.org/projects/text2onto
OntoLTOntoLT• Open source (Protégé plug-in, Java)
• http://olp.dfki.de/OntoLT/OntoLT.htm
OntoGenOntoGen• Open source (C++, .NET)
• http://www.textmining.net
Text2Onto: Main Features
Learn primitives independent of a specific KR language (Probabilistic Ontology Model, POM)
System calculates a confidence for each learned object for better user interaction
Updates the learned knowledge each time the corpus is changed and avoid processing it by scratch
Allows for easy • combination of algorithms,
• execution of algorithms,
• writing new algorithms
Text2Onto: Algorithms used
Concepts
• Statistical measures, e.g. TFIDF, C-value/NC-value,…
Subclass_of relations
• Exploits hypernym relations from WordNet
• Hearst patterns
Mereological relations (part-of) General relations: extracts the following syntactic frames:
• Transitive, e.g., love(subj, obj)
• Intransitive + PP-complement, e.g., walk(subj, pp(to))
• Transitive + PP-complement, e.g., hit(subj, obj, pp(with))
Instance-of Equivalence
Text2Onto: screenshot
OntoGen : Techniques used
Linear Dimensionality Reduction (a.k.a LSI)
• words related to the same topic co-occur together
more often than words related to different topics
• Result: clusters of words each describing one topic
K-means clustering algorithm
• Partitions the corpus into k clusters so that two
documents within the same cluster are more closely
related than two documents from different clusters
OntoGen: screenshot
Onto-LT
A Protégé plug-in with which classes and
relations can be extracted from a linguistic
annotated text collection
Provides mapping rules that allow for a mapping
between linguistic entities and class/slots
candidates in Protégé
Onto-LT: Mapping rules
HeadNounToClass_ModToSubClassHeadNounToClass_ModToSubClass
Maps a head-noun to a class and in combination with its modifier(s)to one or more sub-class(es)
Maps a linguistic subject to a class, its predicate to a corresponding
slot for this class and the direct object to the “range” of the slot
SubjToClass_PredToSlot_DObjToRangeSubjToClass_PredToSlot_DObjToRange
Onto-LT: System architecture
Onto-LT: screenshot
Conclusions
A detailed methodology that guides the ontology learning process does not exist
Only general guidelines are provided
No complete correspondence between the methods and the tools
Methods are based mainly on NLP techniques complemented with statistical measures
Tools give only support to perform some of the steps proposed in different approaches (except Text2Onto)
Some References…
Cimiano, P. Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, 2006
Hearst, M.A., Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of the 14th International Conference on Computational Linguistics, pp. 539-545, 1992
Gómez-Pérez, A., & Manzano-Macho, D., An overview of methods and tools for ontology learning from text, The
Knowledge Engineering Review, Vol. 19:3, 187-212, 2005. P. Cimiano, J. Wenderoth, Automatically Learning Qualia
Structures from the Web. In: Proceedings of the ACL Workshop on Deep Lexical Acquisition, pp. 28-37, 2005