Post on 20-Dec-2015
transcript
June 19-21, 2006 WMS'06, Chania, Crete 1
Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies
Euripides G.M. PetrakisGiannis VarelasAngelos HliaoutakisParaskevi Raftopoulou
June 19-21, 2006 WMS'06, Chania, Crete 2
Semantic Similarity Relates to computing the conceptual
similarity between terms which are not necessarily lexicacally similar “car”-“automobile”-“vehicle”, “drug”- “medicine”
Tool for making knowledge commonly understandable in applications such as IR, information communication in general
June 19-21, 2006 WMS'06, Chania, Crete 3
Methodology
Terms from different communicating sources are represented by ontologies
Map two terms to an ontology and compute their relationship in that ontology
Terms from different ontologies: Discover linguistic relationships or affinities between terms in different ontologies
June 19-21, 2006 WMS'06, Chania, Crete 4
Contributions
We investigate several Semantic Similarity Methods and we evaluate their performance http://www.intelligence.tuc.gr/similarity
We propose a novel semantic similarity measure for comparing concepts from different ontologies
June 19-21, 2006 WMS'06, Chania, Crete 5
Ontologies Tools of information representation on a
subject Hierarchical categorization of terms from
general to most specific terms object artifact construction stadium
Domain Ontologies representing knowledge of a domain e.g., MeSH medical ontology
General Ontologies representing common sense knowledge about the world e.g., WordNet
June 19-21, 2006 WMS'06, Chania, Crete 6
WordNet A vocabulary and a thesaurus offering a
hierarchical categorization of natural language terms More than 100,000 terms
Nouns, verbs, adjectives and adverbs are grouped into synonym sets (synsets)
Synsets represent terms or concepts with similar meaning stadium, bowl, arena, sports stadium – (a large
structure for open-air sports or entertainments)
June 19-21, 2006 WMS'06, Chania, Crete 7
WordNet Hierarchies The synsets are also organized into senses
Senses: Different meanings of the same term The synsets are related to other synsets
higher or lower in the hierarchy by different types of relationships e.g. Hyponym/Hypernym (Is-A relationships) Meronym/Holonym (Part-Of relationships)
Nine noun and several verb Is-A hierarchies
June 19-21, 2006 WMS'06, Chania, Crete 9
MeSH
MeSH: ontology for medical and biological terms by the N.L.M.
Organized in IS-A hierarchies More than 15 taxonomies, more than
22,000 terms No part-of relationships The terms are organized into synsets
called “entry terms’’
June 19-21, 2006 WMS'06, Chania, Crete 11
Semantic Similarity Methods Map terms to an ontology and compute
their relationship in that ontology Four main categories of methods:
Edge counting: path length between terms Information content: as a function of their
probability of occurrence in a corpus Feature based: similarity between their
properties (e.g., definitions) or based on their relationships to other similar terms
Hybrid: combine the above ideas
June 19-21, 2006 WMS'06, Chania, Crete 12
Example Edge counting
distance between “conveyance” and “ceramic” is 2
An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus
June 19-21, 2006 WMS'06, Chania, Crete 13
X-Similarity Relies on matching between synsets and
set description sets
A,B: synsets or term description sets
Do the same with all IS-A, Part-Of relationships and take their maximum
.0),( ,),(),,(max
;0),( ,1),(
baSifbaSbaS
baSifbaSim
synsetsnsdescriptioodsneighborho
synsets
BA
BASbaS nsdescriptio
),(
),(max),( baSbaS iodneighborho
June 19-21, 2006 WMS'06, Chania, Crete 14
WordNet term: “Hypothyroidism” MeSH term: “Hyperthyroidism”
<term> hypothyroidism <definition> An underactive thyroid gland; a glandular disorder Resulting from insufficient production of thyroid hormones. </definition> <synset> Hypothyroidism </synset> <hypernyms> glandular disease, disorder, condition, state </hypernyms> <hyponyms> myxedema, cretinism </hyponyms></term>
<term> hyperthyroidism <definition> Hypersecretion of Thyroid Hormones from Thyroid
Gland. Elevated levels of thyroid hormones increase Basal Metabolic Rate.
</definition> <synset> Hyperthyroidism </synset> <hypernyms> disease, thyroid, Endocrine System Diseases,
diseases </hypernyms> <hyponyms> thyrotoxicosis, thyrotoxicoses </hyponyms></term>
Example S(Hypothyroidism, Hyperthyroidism) = 0.387
June 19-21, 2006 WMS'06, Chania, Crete 15
Evaluation
The most popular methods are evaluated
All methods applied on a set of 38 term pairs
Their similarity values are correlated with scores obtained by humans
The higher the correlation of a method the better the method is
June 19-21, 2006 WMS'06, Chania, Crete 16
Evaluation on WordNetMethod Type Correlation
Rada 1989 Edge Counting 0.59
Wu 1994 Edge Counting 0.74
Li 2003 Edge Counting 0.82
Leackok 1998 Edge Counting 0.82
Richardson 1994 Edge Counting 0.63
Resnik 1999 Info. Content 0.79
Lin 1993 Info. Content 0.82
Lord 2003 Info. Content 0.79
Jiang 1998 Info. Content 0.83
Tversky 1977 Feature Based 0.73
X-Similarity Feature Based 0.74
Rodriguez 2003 Hybrid 0.71
June 19-21, 2006 WMS'06, Chania, Crete 17
Evaluation on MeSHMethod Type Correlation
Rada 1989 Edge Counting 0.50
Wu 1994 Edge Counting 0.67
Li 2003 Edge Counting 0.70
Leackok 1998 Edge Counting 0.74
Richardson 1994 Edge Counting 0.64
Resnik 1999 Info. Content 0.71
Lin 1993 Info. Content 0.72
Lord 2003 Info. Content 0.70
Jiang 1998 Info. Content 0.71
Tversky 1977 Feature Based 0.67
X-Similarity Feature Based 0.71
Rodriguez 2003 Hybrid 0.71
June 19-21, 2006 WMS'06, Chania, Crete 18
Cross Ontology Measures We used 40 MeSH terms pairs One of the terms is a also a WordNet term We measured correlation with scores
obtained by experts
Method Type Correlation
X-Similarity Feature-Based 0.70
Rodriguez Hybrid 0.55
June 19-21, 2006 WMS'06, Chania, Crete 19
Comments Edge counting/Info. Content methods work by
exploiting structure information Good methods take the position of the terms
into account Higher similarity for terms which are close
together but lower in the hierarchy e.g., [Li et.al. 2003]
X – Similarity performs at least as good as other Feature-Based methods
Outperforms other Cross-Ontology methods
June 19-21, 2006 WMS'06, Chania, Crete 20
Conclusions Semantic similarity methods approximated
the human notion of similarity reaching correlation up to 83%
Cross ontology similarity is a difficult problem that required further investigation
Work towards integrating Sem. Sim within IntelliSearch information Retrieval System for Web documents http://www.intelligence.tuc.gr/intellisearch
June 19-21, 2006 WMS'06, Chania, Crete 21
Try our system on the Web
http://www.intelligence.tuc.gr/similarity
Implementation: Giannis Varelas Spyros Argyropoulos