Ontology Alignment

transcript

Ontology AlignmentOntology Alignment

Problem StatementProblem Statement

Given N Ontologies (O1 ,…, On)◦In a Particular Domain ◦Different Level of Coverage

Goal◦Evaluate Commonality of Entities◦Rank Entities

Challenges & SolutionsChallenges & SolutionsOntology Alignments

◦Largest Common Subgraph (LCS)◦Vector Space Model (TF/ IDF)

Accuracy of Entities in Aligned Concepts◦Ranking Entities

LCS Algorithm for Multiple LCS Algorithm for Multiple OntologiesOntologies

Find the LCS for two

Ontologies

Align LCS with other

Ontologies

Largest Common Subgraph Largest Common Subgraph (LCS) Algorithm between two (LCS) Algorithm between two OntologiesOntologies

Data Structure for LCS Data Structure for LCS Algorithm Algorithm

Similarity Measure for Corresponding EntitiesNode Similarity + Structural Similarity

C1(C1,C’1, .95),(C1,C’6,.77),(C1,C’3,.71),(C1,C’4,.65),(C1,C’5,.54),(C1,C’2,.34)

C2(C2,C’3, .85),(C2,C’2,.67),(C2,C’1,.51),(C2,C’4,.45),(C2,C’5,.24),(C2,C’6,.14)

C3(C3,C’4, .90),(C3,C’1,.67),(C3,C’3,.51),(C3,C’2,.45),(C3,C’5,.34),(C3,C’6,.24)

C4(C4,C’2, .95),(C4,C’1,.65),(C4,C’3,.51),(C4,C’4,.45),(C4,C’5,.23),(C4,C’6,.14)

C5(C5,C’4, .80),(C5,C’1,.67),(C5,C’3,.65),(C5,C’2,.35),(C5,C’5,.34),(C5,C’6,.24)

C6(C6,C’1, .20),(C6,C’1,.15),(C6,C’3,.12),(C6,C’2,.12),(C6,C’5,.09),(C6,C’6,.08)

C7(C7,C’4, .31),(C7,C’1,.25),(C7,C’3,.23),(C7,C’2,.15),(C7,C’5,.14),(C7,C’6,.12)

Node Similarity: Instance-based Node Similarity: Instance-based Representing types using N-grams*Representing types using N-grams*

Node Similarity (Name-Match)◦Find Common N-gram (N = 2) for

corresponding columns

StrName FENAME Status

LOCUST-GROVE DR

LOCUST GROVE

LOUISE LN LOUISE BUILT

Street Laddress

Raddress

TRAIL RANGE DR

1600 1798

CR45/MANET CT

2500 2598

N-gram types from A.StrName = {LO, OC, CU,ST,…..}

N-gram types from B.Street = {TR, RA, R4, 5/,…..}

*Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani Thuraisingham & Shashi Shekhar, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.

Node Similarity: Instance-Node Similarity: Instance-basedbasedVisualizing Entropy and Conditional Visualizing Entropy and Conditional EntropyEntropy

H(C) = –Σpi log pi for all x є C1 U

H(C | T) = H (C,T) – H(C) for all x є C1 U C2 and t є T

Node Similarity: Faults of Node Similarity: Faults of this Methodthis Method• Semantically similar columns are not

guaranteed to have a high similarity score City Countr

Dallas USA

Houston USA

Kingston Jamaica

Halifax Canada

Mexico City

Mexico

ctyName country

Shanghai China

Beijing China

Tokyo Japan

New Delhi India

Kuala Lumpur

Malaysia

2-grams extracted from A: {Da, al, la, as, Ho, ou, us…}

A є O1 B є O2

2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}

: Column 1

: Column 2

Similarity = H(C|T) / H(C)

C1 є O1 C2 є O2

Step3Step3: Calculate Similarity

Step1Step1: Extract distinct keywords from compared columns

Step2Step2: Group distinct keywords together into semantic clusters

Keywords extracted from columns = {Johnson, Rd., School, 15th,…}

“Rd.”,”Dr.”,”St.”,”Pwy”,…“Johnson”,”School”,”Dr.”….

C1 U C2

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Road County

Custer Pwy Collin

15th St. Collin

Parker Rd. Collin

Node Similarity: Instance-Node Similarity: Instance-basedbasedK-medoid + NGD instance similarityK-medoid + NGD instance similarity

Node Similarity: Instance-Node Similarity: Instance-basedbased Problems with K-medoid + NGD*Problems with K-medoid + NGD*It is possible that two different geographic entities (ie: Dallas,

TX and Dallas County) in the same location will have a very low computed NGD value, and thus, be mistaken for being similar:

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Alma Dr. Richardson

Preston Rd. Addison

Dallas Pkwy Dallas

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

Campbell Rd. Denton

Harry Hines Blvd.

Dallas

*Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Semantic Schema Matching Without Shared Instances,” to appear in Third IEEE International Conference on Semantic Computing, Berkeley, CA, USA - September 14-16, 2009.

NodeNode Similarity: Instance-basedSimilarity: Instance-basedUsing geographic type information*Using geographic type information*We use a gazetteer to determine the geographic type of an instance:

O1 O2Geotypes

*Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Geographically-Typed Semantic Schema Matching,” to appear in ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2009), Seattle, Washington, USA, November 2009.

Node Similarity: Instance-basedNode Similarity: Instance-basedResults of Geographic Matching Over 2 Results of Geographic Matching Over 2 Separate Road Network Data SourcesSeparate Road Network Data Sources

Structural Similarity Structural Similarity

◦ Structural Similarity MeasurementI. Neighbor Similarity

C’4C’5

Structural Similarity Structural Similarity

Structural Similarity MeasurementI. Properties Similarity

C’4 C’

subClass

hasFlavor

hasColor

subClass

hasFlavorhasFlavo

hasFood

hasDrink

subclass

RTC1 = [3isA, 2subClass,1hasFlavor,1hasColor, 0 hasFood,1 hasTopping] RTC2 = [1isA, 1subClass,2hasFlavor,0hasColor,1hasFood]

hasTopping

SimilaritySimilarityResults of Pairwise Ontology Results of Pairwise Ontology Matching(I3CON Matching(I3CON Benchmark)Benchmark)

Matching using Name Similarity + RTS

Matching usingName Similarity + (RTS and Neighbor)

Ontology MatchingOntology MatchingVector Space Model (VSM)Vector Space Model (VSM)

Define the VSM for Each Entity• Collection of Words in label, edge types,

comment and neighbors.

C’4 C’

isAisA

subClass

hasFlavor

hasColor

subClass

hasFlavorhasFlavo

hasFood

hasDrink

subclass

VSM(C1)= [1C1,1C2,1C3,1C5,1C6,1isA, 2subClass,1hasFlavor]

VSM(C’1)= [1C’3, C’4,1C’5, 1isA, 2hasFlavor]

hasTopping

Ontology MatchingOntology MatchingVector Space Model (VSM)Vector Space Model (VSM)• Update VSM by Word Score Using TF/IDF

• Calculate Cosine Similarity for

corresponding entities

Cos(VSM(C1) , VSM(C2) )

Aligned ConceptsAligned Concepts• Aggregate different

ontologies• Example

Aligned ConceptsAligned Concepts• Statistical Model

Global Ontology

Entity1

O1: Person

Entity2

O1:hasFather

O1:hasMaleParent

O2:hasFather

Entity3 Entity4

O1: hasMon

O2: Person

O1:hasMother

O1:hasFemaleParent

O1:Harry

α1 α2 α3 α4

Β10=1

O2:hasMother

Aligned ConceptsAligned Concepts• Calculate the probabilities of

appearance of each entity in GO

• Use Maximum likelihood Estimation

• Calculate and

ReificationReification

Reification can be considered as a metadata about RDF/OWL statements.

Ontology Alignment approaches rely on probabilistic measures to find matches between concepts in different ontologies.

Reification data can be attached with the alignment information to show the 'match factor' between two concepts in OWL-2.

Advanced analytic algorithms can benefit from reification in establishing the relevance of search results.

OWL - 2OWL - 2 OWL – 2 is an extension to OWL. Some of

the new features in OWL 2 are as follows - Syntactic sugar (eg. Disjoint union of classes) Property chains Richer datatypes, data ranges Qualified cardinality restrictions new constructs that increase expressivity simple metamodeling capabilities extended annotation capabilities Following link lists all the new features in OWL

2http://www.w3.org/TR/2009/REC-owl2-new-features-20091027/

Ontology Extraction from Ontology Extraction from Text DocumentsText Documents

Problem StatementProblem StatementOur solution for ontology

construction of documents◦Use hierarchical clustering algorithm to

build a hierarchy for documents Hierarchical Agglomerative Clustering (HAC) Modified Self-Organizing Tree (MSOT) Hierarchical Growing Self-Organizing Tree

(HGSOT)

◦Assign concept for each node in the hierarchy Usage of the WordNet

Concept AssignmentConcept Assignment Concept Assignment to document

LVQ1: topic vector (t) is built by training with the training documents.

Clusters in LVQ are predefined. Each topic cluster is represented by a node in the output map, and the LVQ use pre-labeled data for training. Only the best match node’s vector (winning

vector) will be updated, rather than its neighbors. Vector updating rule will use following equations:

If data x and best match node c belong to the same class,

If data x and best match node c belong to the different class.

))((),()()()1( twxcittwtw iii

Concept AssignmentConcept Assignment◦ Concept sense disambiguation

One keyword associated with more than one concept in WordNet.

Keyword “gold” has 4 senses in WordNet and keyword “copper” has five senses in WordNet.

For disambiguation of concepts we apply the same technique (i.e., cosine similarity measure) used in topic tracking.

To construct a vector for each sense we will use a short description that appears in WordNet.

Concept AssignmentConcept AssignmentConcept assignment for leaf node

◦ If there are majority documents have the same concept we assign the concept to the leaf.

◦ If there is not majority we will choose a generic concept of all concept from WordNet to the leaf.

Concept assignment for non leaf node◦ If there are majority children have the same

concept we assign the concept to the internal node.

◦ If there is not majority we will choose a generic concept of all concept from WordNet to the internal node.