Ontology-based Clustering in a Peer Data Management System€¦ · Ontology-based Clustering in a...

OntologyOntology--based Clustering in a based Clustering in a Peer Data Management SystemPeer Data Management System

Ph.D. ThesisPh.D. Thesis

Carlos Eduardo Santos Pires ([email protected])Carlos Eduardo Santos Pires ([email protected])

Advisor: Ana Carolina Salgado ([email protected])Advisor: Ana Carolina Salgado ([email protected])

April 27, 2009 April 27, 2009 -- Recife, PE, BrazilRecife, PE, Brazil

��

2

Introduction

P5

P1

P3

P2

P4

Peer

Exported schema

Data Source

Local mapping

Schema mapping

Peer Data Management Systems (PDMS)Peer Data Management Systems (PDMS)

��

3

Problem

• Arbitrary approach for connecting peers is inefficient

• Peers sharing– Different content (exported schemas)

• Neighbors in the overlay network

– Similar content• Positioned far from each other or even

isolated in the overlay network

��

4

Motivation

• Semantic Communities– Put together peers with common interests

about a specific topic– Formulated queries are transmitted among the

peers of the community– Peers are organized according to a P2P

network topology– Exported schemas represented by ontologies– Creation and maintenance is a challenging

issue

��

5

Goal

• Main Contribution– A process for clustering peers into the semantic

communities that compose a PMDS • To achieve this objective, we propose…

– Ontology-based PDMS architecture– Ontology matching process

• Global Similarity measure– Automatic process for summarizing ontologies– Peer Clustering Process

��

6

Background

• Ontology Matching and Merging

OntologyMatching A’A

oi

oj

Parameters (P)

Resources (R)

OntologyMerging ok

��

7

Background

• Clustering– Automatic process of partitioning a finite set of

objects in a set of meaningful clusters– Exclusive and unsupervised classification

• Clustering Issues (Peer Clustering)– Object set availability– Sensitivity to input order

• Cluster Validity– External and Internal

��

8

SPEED: SSemantic PEEPEEr DData Management System

DHT Network

SP

DP

Semantic Peer

Data Peer

Integration Peer

Semantic Community

Semantic Cluster

IP

SP1

SP2

SP3

SPi

IPi2

DPi21 DPi22

DPi2n

IPi1

DPi11 DPi12DPijm

IPij

DPij1DPij2

DPijk

Unstructured Super-Peer

Network

��

9

Ontologies in SPEED

SPi

IPij

DPij1 DPij2 DPijk

Semantic Peer

Integration Peer

Data Peer

Cluster

CommunityCommunity Ontology

Cluster Ontology

Local Ontology

SummarizedCluster Ontology

Local Ontology

��

10

Other Definitions

• Requesting Peer– Peer wishing to join the system– Connected as a data peer or integration peer

• Semantic Neighbor (Cluster)– Belong to the same community– Share semantically similar content

• Semantic Neighborhood– Set of semantic neighbors of a cluster

��

11

Data Peer

��

12

Integration Peer

��

13

Semantic Peer

��

14

Architectural Considerations

• Why a DHT network?– Efficient searches and sensibility to changes in

the structure– Semantic Peers

• High reliability, network bandwidth, and availability

• Why a peer takes part in only one cluster?– Avoid duplication of query results

��

15

Architectural Considerations

• Why a super-peer network?– Provides an environment that is better suited to

the establishment of schema mappings– Facilitate query routing– Avoid multiple successive reformulations– Exploit the physical heterogeneity of peers

• Why a semantic index?– Avoid starting the search for a semantically

similar cluster in an ad-hoc manner

��

16

SPEED vs. Related PDMS

OntologyMatching

Semantic indexand flooding

PredefinedEmptyMixedSPEED

OntologyMatching

FloodingNon existingNot emptyUnstructuredHelios

Distancebetweenconcepts

CentralizedAccess PointStructure (APS);SCI

Non existingEmptyUnstructuredSunrise

OntologyMatching

Flooding;short and longdistance links

PredefinedNot emptyUnstructuredOntSum

SemanticSimilarity

NeighborhoodSearch

DomainsNetworkPopulation

NetworkTopology

PDMS

��

17

Matching Process

Linguistic-StructuralMatching

(any matcher)ALS

SemanticRules

Application

SimilarityCombination

1

3

Semantic Matching

2

ASE

Weights

CorrespondenceRanking

4Correspondence

Selection Aij

Ontology Oi

Ontology Oj

Domain Ontology

5ACO

1:n or n:m

1:n or n:m

1:n or n:m 1:1

Phase 1 Phase 2

Weights

��

18

Example (Semiport and UnivBench)

��

19

Global Similarity Measure

66.0|7||6|

)8.08.00.08.03.00.10.1()8.08.03.00.10.1(),( =

++++++++++++=ji OOAverageWeighted

��

20

Implementation Issues

�

�

�

��

21

Experiments

Recall

0%10%20%30%40%50%60%70%80%90%

100%

COMA++ H-Match Falcon-AO

Linguistic + Structural Linguistic + Structural + Semantic

Precision

0%

20%

40%

60%

80%

100%

COMA++ H-Match Falcon-AO

Linguistic + Structural Linguistic + Structural + Semantic

||||

),(A

ARRAP

∩=||

||),(

RAR

RAR∩=

��

22

OWLSum: an Ontology Summarization Process

• Main use in Peer Clustering– Resume cluster ontologies (semantic index)

• A summary does not represent a cluster ontology in its entirety– Improve ontology matching

OS = Subontology(O)Cluster Ontology

��

23

Relevance Measures

• Centrality: relationships (number and type) of a concept with other concepts in an ontology O

• Frequency: occurrences of a concept in local ontologies O1,…,On that compose O

1|C|max

wnmax

wnnr

)(ccentrality ud

udud

s

ss

n−

��

��

� ×+××=

|,...,O|O)|dences(c|correspon

)cfrequency(n1

nn =

��

24

Summarization Process

��

25

Example

…

0.07720.077NodePair

0.07720.077RoutingComputer

0.11520.115SecurityEquipment

0.11520.115Equipment

0.19220.192Software

0.19220.192Computer

0.19220.192SwitchEquipment

0.19220.192NetworkNode

0.19220.192Cable

0.23120.231ServerSoftware

RelevanceFrequencyCentralityConcept

Group1Group2

Group1 � NodePair � Group2Recall = 100% Precision = 86%F-measure = 92.5% Size = 7Relevance Average = 0.181

Group1 � Equipment � Group2Recall = 100% Precision = 86%F-measure = 92.5% Size = 7Relevance Average = 0.187

Ontology Summary

��

26

Experiments

50%50%User agreement vs. OntoSum

75%75%User agreement vs. OWLSum

75%75%Expert 3 vs. OWLSum



8-Concept4-Conceptconference.owl

��

27

Peer Clustering in SPEED

��

28

Search for a Semantic Community

��

29

Clustering Algorithm

• Inspired in the Leader algorithm [Hartigan, 1975]

• Main steps– Step 1. Search for Initial Cluster in

Semantic Index– Step 2. Search for Most Similar Cluster– Step 3. Connection of a Requesting Peer

��

30

Step 1. Search for Initial Cluster in Semantic Index

• Requesting Peer RPn sends its local ontology LOn to semantic peer

• Search in the Semantic Index– SemMatch (LOn,OSij)– For each index entry a global similarity

measure is produced– Initial cluster � highest global measure

��

31

Step 2. Search for Most Similar Cluster

IPi1

IPi2

IPi3

IPi4

RPn

UnstructuredSuper-Peer

Network

IPi5

LOLOnn

IPi1: 0.2

LOLOnn

IPi2: 0.6 LOLOnn

LOLOnn

IPi3: 0.7IPi4: 0.4

Initial Cluster

ConnectTTL = 1

Initial Cluster

Direct Neighbors

ConnectTTL = 2

Initial Cluster

Direct Neighbors

Indirect Neighbors

ConnectTTL = 3

��

32

3) Connection of a Requesting Peer

• Case 1: MAX(global measure) ≥≥≥≥ cluster threshold– RPn joins chosen cluster (Data Peer)– Merge (CLOij,LOn)– Semantic index is updated

• Case 2: otherwise– RPn creates new cluster (Integration Peer)– CLOij = LOn

– A new entry is added to the semantic index– Semantic Neighborhood: Neighbor Threshold

Neighbor ThresholdNeighbor Threshold

0.3 0.8

IPi4

0.4

IPi2 IPi3

0.7

IPi1

0.5 0.60.2 0.90.1 1.00.0

Cluster ThresholdCluster Threshold

��

33

Maintenance Considerations

• Evolution of cluster ontologies– Peer connection and disconnection

• Peer Disconnection– Removal of elements and semantic mappings

• Update of Cluster Neighborhood• Recalculation of Global Similarity Measure

– Similarity is calculated when a requesting peer joins a cluster

��

34

Implementation Issues

• SPEED Simulator– Implementation: Java– Integrated with SemMatch and OWLSum– Ontology Library (Education)

• Ontology Merging– Implementation: Java, OWL API– String-match

��

35

SPEED Simulator

LO02-Education.owl LO45-Education.owl LO41-Education.owl LO40-Education.owl LO01-Education.owl LO03-Education.owl LO36-Education.owl LO05-Education.owl LO20-Education.owl LO15-Education.owl LO06-Education.owl LO27-Education.owl LO26-Education.owl ...

Input File

Tue Mar 24 18:18:45 GMT-03:00 2009 RP45 is now connecting... RP45 is now a Integration Peer with out semantic neighbors Semantic Index: <<Cluster: 45>> Exhibition(1) Event(1) Conference(1) Workshop(1) Network: Domain: education (represented by SP: 100) Cluster45(RP45) … Network: Domain: education (represented by SP: 100) Cluster45(RP45, RP13, RP36, RP29, RP42) Cluster08(RP08, RP20, RP02, RP05, RP06, RP27, RP26, RP16, RP30) Cluster44(RP44, RP38, RP39, RP41, RP22, RP33) Cluster37(RP37, RP32, RP19, RP40) Cluster15(RP15, RP11, RP31, RP21, RP07, RP17, RP18, RP03) Cluster24(RP24, RP14, RP34, RP43) Cluster28(RP28, RP01, RP23, RP35, RP12, RP04, RP09, RP25, RP10) Total number of messages: 561 #matchings between OS and LO: 251 #matchings between CLOs: 42 #matchings between CLO and LO: 42 Simulation time: 1161 seconds External indices: RandIndex=0.942 JaccardCoefficiet=0.646 FMIndex=0.785 Hubbert=0.752

Log File

��

36

Experiments

• #Requesting Peers: 45• Search Strategy

– allClusters vs. limitedClusters

• allClusters– Semantic index is discarded

• limitedClusters– Multiple executions– Different orders of requesting peers

��

37

-1,0

-0,8

-0,6

-0,4

-0,2

0,0

0,2

0,4

0,6

0,8

1,0

Cluster Threshold

Glo

bal S

ilhou

ette

Val

ues

Global Silhouette 0,479 0,505 0,336 0,307 0,304

0,25 0,35 0,45 0,55 0,65

Experiments: limitedClusters

0,00,10,20,30,40,50,60,70,80,91,0

Cluster Threshold

Inde

x In

terv

al

Rand Index 0,928 0,942 0,935 0,928 0,901

Jaccard Coefficient 0,629 0,649 0,530 0,454 0,246

Fowlkes-Mallows Index 0,785 0,788 0,713 0,664 0,486

Hubert's Statistic 0,748 0,755 0,682 0,636 0,458

0,25 0,35 0,45 0,55 0,65

External Indices

Internal Indices

“Golden Standard”Hierarchical clustering Algorithm

Similarity between peers of the same clusterDissimilarity between peers of the distinct cluster

��

38

limitedClusters vs. allClusters

0

50

100

150

200

250

300

#Exe

cutio

ns o

f Sem

Mat

ch

allClusters 122 212 0

limitedClusters 91 155 271

SemMatch (CLO,CLO) SemMatch (CLO,LO) SemMatch (OS,LO)

LO = Local Ontology CLO = Cluster Ontology OS = Ontology Summary

#Executions of SemMatch

0,00,10,20,30,40,50,60,70,80,91,0

Inde

x In

terv

al

allClusters 0,970 0,675 0,794 0,778

limitedClusters 0,942 0,649 0,788 0,755

Rand Index Jaccard Coefficient

Fowlkes-Mallows Index

Hubert's Statistic

Search StrategyExternal IndicesCluster Threshold = 0.35

External IndicesCluster Threshold = 0.35

��

39

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of hops

Rea

ched

Rel

evan

t Pee

rs (%

)

ReachedRelevantPeers 16,4% 51,7% 94,8% 100,0%

1 2 3 4

Experiments

0

500

1000

1500

2000

2500

3000

#Mes

sage

s

allClusters 2509

limitedClusters 1962

Transmitted Messages

Query Routing

External IndiceslimitedClusters vs. allClustersCluster Threshold = 0.35

limitedClustersCluster Threshold = 0.35

��

40

Conclusions

• Incremental process to cluster semantically similar peers in a PDMS

• Peers are organized according to a mixed P2P topology

• Ontologies are used to represent exported schemas

• Peer clustering is assisted by– Ontology matching– Ontology summarization

��

41

Contributions

• Ontology-based PDMS Architecture• Ontology Matching Process

– Determine most similar cluster– Determine cluster neighborhood– Search in the semantic index

• Ontology Summarization Process– Cluster Ontologies

• Incremental Peer Clustering Process

��

42

Future Work

• SemMatch– Consider properties of the concepts

• OWLSum– Apply transitivity rules in order to eliminate non-relevant

concepts• Peer Clustering

– Improve semantic index, e.g. organization and search– Consider peer disconnection

• Load Balancing– Merging and split of clusters

OntologyOntology--based Clustering in a based Clustering in a Peer Data Management SystemPeer Data Management System

Ph.D. ThesisPh.D. Thesis

Carlos Eduardo Santos Pires ([email protected])Carlos Eduardo Santos Pires ([email protected])

Advisor: Ana Carolina Salgado ([email protected])Advisor: Ana Carolina Salgado ([email protected])

April 27, 2009 April 27, 2009 -- Recife, PE, BrazilRecife, PE, Brazil

Date post:	25-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Ontology-based Clustering in a Peer Data Management System€¦ · Ontology-based Clustering in a...

Documents