OntologyOntology--based Clustering in a based Clustering in a Peer Data Management SystemPeer Data Management System
Ph.D. ThesisPh.D. Thesis
Carlos Eduardo Santos Pires ([email protected])Carlos Eduardo Santos Pires ([email protected])
Advisor: Ana Carolina Salgado ([email protected])Advisor: Ana Carolina Salgado ([email protected])
April 27, 2009 April 27, 2009 -- Recife, PE, BrazilRecife, PE, Brazil
����������� ���������������������������������������
2
Introduction
P5
P1
P3
P2
P4
Peer
Exported schema
Data Source
Local mapping
Schema mapping
Peer Data Management Systems (PDMS)Peer Data Management Systems (PDMS)
����������� ���������������������������������������
3
Problem
• Arbitrary approach for connecting peers is inefficient
• Peers sharing– Different content (exported schemas)
• Neighbors in the overlay network
– Similar content• Positioned far from each other or even
isolated in the overlay network
����������� ���������������������������������������
4
Motivation
• Semantic Communities– Put together peers with common interests
about a specific topic– Formulated queries are transmitted among the
peers of the community– Peers are organized according to a P2P
network topology– Exported schemas represented by ontologies– Creation and maintenance is a challenging
issue
����������� ���������������������������������������
5
Goal
• Main Contribution– A process for clustering peers into the semantic
communities that compose a PMDS • To achieve this objective, we propose…
– Ontology-based PDMS architecture– Ontology matching process
• Global Similarity measure– Automatic process for summarizing ontologies– Peer Clustering Process
����������� ���������������������������������������
6
Background
• Ontology Matching and Merging
OntologyMatching A’A
oi
oj
Parameters (P)
Resources (R)
OntologyMerging ok
����������� ���������������������������������������
7
Background
• Clustering– Automatic process of partitioning a finite set of
objects in a set of meaningful clusters– Exclusive and unsupervised classification
• Clustering Issues (Peer Clustering)– Object set availability– Sensitivity to input order
• Cluster Validity– External and Internal
����������� ���������������������������������������
8
SPEED: SSemantic PEEPEEr DData Management System
DHT Network
SP
DP
Semantic Peer
Data Peer
Integration Peer
Semantic Community
Semantic Cluster
IP
SP1
SP2
SP3
SPi
IPi2
DPi21 DPi22
DPi2n
IPi1
DPi11 DPi12DPijm
IPij
DPij1DPij2
DPijk
Unstructured Super-Peer
Network
����������� ���������������������������������������
9
Ontologies in SPEED
SPi
IPij
DPij1 DPij2 DPijk
Semantic Peer
Integration Peer
Data Peer
Cluster
CommunityCommunity Ontology
Cluster Ontology
Local Ontology
SummarizedCluster Ontology
Local Ontology
����������� ���������������������������������������
10
Other Definitions
• Requesting Peer– Peer wishing to join the system– Connected as a data peer or integration peer
• Semantic Neighbor (Cluster)– Belong to the same community– Share semantically similar content
• Semantic Neighborhood– Set of semantic neighbors of a cluster
����������� ���������������������������������������
11
Data Peer
����������� ���������������������������������������
12
Integration Peer
����������� ���������������������������������������
13
Semantic Peer
����������� ���������������������������������������
14
Architectural Considerations
• Why a DHT network?– Efficient searches and sensibility to changes in
the structure– Semantic Peers
• High reliability, network bandwidth, and availability
• Why a peer takes part in only one cluster?– Avoid duplication of query results
����������� ���������������������������������������
15
Architectural Considerations
• Why a super-peer network?– Provides an environment that is better suited to
the establishment of schema mappings– Facilitate query routing– Avoid multiple successive reformulations– Exploit the physical heterogeneity of peers
• Why a semantic index?– Avoid starting the search for a semantically
similar cluster in an ad-hoc manner
����������� ���������������������������������������
16
SPEED vs. Related PDMS
OntologyMatching
Semantic indexand flooding
PredefinedEmptyMixedSPEED
OntologyMatching
FloodingNon existingNot emptyUnstructuredHelios
Distancebetweenconcepts
CentralizedAccess PointStructure (APS);SCI
Non existingEmptyUnstructuredSunrise
OntologyMatching
Flooding;short and longdistance links
PredefinedNot emptyUnstructuredOntSum
SemanticSimilarity
NeighborhoodSearch
DomainsNetworkPopulation
NetworkTopology
PDMS
����������� ���������������������������������������
17
Matching Process
Linguistic-StructuralMatching
(any matcher)ALS
SemanticRules
Application
SimilarityCombination
1
3
Semantic Matching
2
ASE
Weights
CorrespondenceRanking
4Correspondence
Selection Aij
Ontology Oi
Ontology Oj
Domain Ontology
5ACO
1:n or n:m
1:n or n:m
1:n or n:m 1:1
Phase 1 Phase 2
Weights
����������� ���������������������������������������
18
Example (Semiport and UnivBench)
����������� ���������������������������������������
19
Global Similarity Measure
66.0|7||6|
)8.08.00.08.03.00.10.1()8.08.03.00.10.1(),( =
++++++++++++=ji OOAverageWeighted
����������� ���������������������������������������
20
Implementation Issues
�
�
�
����������� ���������������������������������������
21
Experiments
Recall
0%10%20%30%40%50%60%70%80%90%
100%
COMA++ H-Match Falcon-AO
Linguistic + Structural Linguistic + Structural + Semantic
Precision
0%
20%
40%
60%
80%
100%
COMA++ H-Match Falcon-AO
Linguistic + Structural Linguistic + Structural + Semantic
||||
),(A
ARRAP
∩=||
||),(
RAR
RAR∩=
����������� ���������������������������������������
22
OWLSum: an Ontology Summarization Process
• Main use in Peer Clustering– Resume cluster ontologies (semantic index)
• A summary does not represent a cluster ontology in its entirety– Improve ontology matching
OS = Subontology(O)Cluster Ontology
����������� ���������������������������������������
23
Relevance Measures
• Centrality: relationships (number and type) of a concept with other concepts in an ontology O
• Frequency: occurrences of a concept in local ontologies O1,…,On that compose O
1|C|max
wnmax
wnnr
)(ccentrality ud
udud
s
ss
n−
��
���
� ×+××=
|,...,O|O)|dences(c|correspon
)cfrequency(n1
nn =
����������� ���������������������������������������
24
Summarization Process
����������� ���������������������������������������
25
Example
…
0.07720.077NodePair
0.07720.077RoutingComputer
0.11520.115SecurityEquipment
0.11520.115Equipment
0.19220.192Software
0.19220.192Computer
0.19220.192SwitchEquipment
0.19220.192NetworkNode
0.19220.192Cable
0.23120.231ServerSoftware
RelevanceFrequencyCentralityConcept
Group1Group2
Group1 � NodePair � Group2Recall = 100% Precision = 86%F-measure = 92.5% Size = 7Relevance Average = 0.181
Group1 � Equipment � Group2Recall = 100% Precision = 86%F-measure = 92.5% Size = 7Relevance Average = 0.187
Ontology Summary
����������� ���������������������������������������
26
Experiments
50%50%User agreement vs. OntoSum
75%75%User agreement vs. OWLSum
75%75%Expert 3 vs. OWLSum
75%50%Expert 2 vs. OWLSum
88%75%Expert 1 vs. OWLSum
8-Concept4-Conceptconference.owl
����������� ���������������������������������������
27
Peer Clustering in SPEED
����������� ���������������������������������������
28
Search for a Semantic Community
����������� ���������������������������������������
29
Clustering Algorithm
• Inspired in the Leader algorithm [Hartigan, 1975]
• Main steps– Step 1. Search for Initial Cluster in
Semantic Index– Step 2. Search for Most Similar Cluster– Step 3. Connection of a Requesting Peer
����������� ���������������������������������������
30
Step 1. Search for Initial Cluster in Semantic Index
• Requesting Peer RPn sends its local ontology LOn to semantic peer
• Search in the Semantic Index– SemMatch (LOn,OSij)– For each index entry a global similarity
measure is produced– Initial cluster � highest global measure
����������� ���������������������������������������
31
Step 2. Search for Most Similar Cluster
IPi1
IPi2
IPi3
IPi4
RPn
UnstructuredSuper-Peer
Network
IPi5
LOLOnn
IPi1: 0.2
LOLOnn
IPi2: 0.6 LOLOnn
LOLOnn
IPi3: 0.7IPi4: 0.4
Initial Cluster
ConnectTTL = 1
Initial Cluster
Direct Neighbors
ConnectTTL = 2
Initial Cluster
Direct Neighbors
Indirect Neighbors
ConnectTTL = 3
����������� ���������������������������������������
32
3) Connection of a Requesting Peer
• Case 1: MAX(global measure) ≥≥≥≥ cluster threshold– RPn joins chosen cluster (Data Peer)– Merge (CLOij,LOn)– Semantic index is updated
• Case 2: otherwise– RPn creates new cluster (Integration Peer)– CLOij = LOn
– A new entry is added to the semantic index– Semantic Neighborhood: Neighbor Threshold
Neighbor ThresholdNeighbor Threshold
0.3 0.8
IPi4
0.4
IPi2 IPi3
0.7
IPi1
0.5 0.60.2 0.90.1 1.00.0
Cluster ThresholdCluster Threshold
����������� ���������������������������������������
33
Maintenance Considerations
• Evolution of cluster ontologies– Peer connection and disconnection
• Peer Disconnection– Removal of elements and semantic mappings
• Update of Cluster Neighborhood• Recalculation of Global Similarity Measure
– Similarity is calculated when a requesting peer joins a cluster
����������� ���������������������������������������
34
Implementation Issues
• SPEED Simulator– Implementation: Java– Integrated with SemMatch and OWLSum– Ontology Library (Education)
• Ontology Merging– Implementation: Java, OWL API– String-match
����������� ���������������������������������������
35
SPEED Simulator
LO02-Education.owl LO45-Education.owl LO41-Education.owl LO40-Education.owl LO01-Education.owl LO03-Education.owl LO36-Education.owl LO05-Education.owl LO20-Education.owl LO15-Education.owl LO06-Education.owl LO27-Education.owl LO26-Education.owl ...
Input File
Tue Mar 24 18:18:45 GMT-03:00 2009 RP45 is now connecting... RP45 is now a Integration Peer with out semantic neighbors Semantic Index: <<Cluster: 45>> Exhibition(1) Event(1) Conference(1) Workshop(1) Network: Domain: education (represented by SP: 100) Cluster45(RP45) … Network: Domain: education (represented by SP: 100) Cluster45(RP45, RP13, RP36, RP29, RP42) Cluster08(RP08, RP20, RP02, RP05, RP06, RP27, RP26, RP16, RP30) Cluster44(RP44, RP38, RP39, RP41, RP22, RP33) Cluster37(RP37, RP32, RP19, RP40) Cluster15(RP15, RP11, RP31, RP21, RP07, RP17, RP18, RP03) Cluster24(RP24, RP14, RP34, RP43) Cluster28(RP28, RP01, RP23, RP35, RP12, RP04, RP09, RP25, RP10) Total number of messages: 561 #matchings between OS and LO: 251 #matchings between CLOs: 42 #matchings between CLO and LO: 42 Simulation time: 1161 seconds External indices: RandIndex=0.942 JaccardCoefficiet=0.646 FMIndex=0.785 Hubbert=0.752
Log File
����������� ���������������������������������������
36
Experiments
• #Requesting Peers: 45• Search Strategy
– allClusters vs. limitedClusters
• allClusters– Semantic index is discarded
• limitedClusters– Multiple executions– Different orders of requesting peers
����������� ���������������������������������������
37
-1,0
-0,8
-0,6
-0,4
-0,2
0,0
0,2
0,4
0,6
0,8
1,0
Cluster Threshold
Glo
bal S
ilhou
ette
Val
ues
Global Silhouette 0,479 0,505 0,336 0,307 0,304
0,25 0,35 0,45 0,55 0,65
Experiments: limitedClusters
0,00,10,20,30,40,50,60,70,80,91,0
Cluster Threshold
Inde
x In
terv
al
Rand Index 0,928 0,942 0,935 0,928 0,901
Jaccard Coefficient 0,629 0,649 0,530 0,454 0,246
Fowlkes-Mallows Index 0,785 0,788 0,713 0,664 0,486
Hubert's Statistic 0,748 0,755 0,682 0,636 0,458
0,25 0,35 0,45 0,55 0,65
External Indices
Internal Indices
“Golden Standard”Hierarchical clustering Algorithm
Similarity between peers of the same clusterDissimilarity between peers of the distinct cluster
����������� ���������������������������������������
38
limitedClusters vs. allClusters
0
50
100
150
200
250
300
#Exe
cutio
ns o
f Sem
Mat
ch
allClusters 122 212 0
limitedClusters 91 155 271
SemMatch (CLO,CLO) SemMatch (CLO,LO) SemMatch (OS,LO)
LO = Local Ontology CLO = Cluster Ontology OS = Ontology Summary
#Executions of SemMatch
0,00,10,20,30,40,50,60,70,80,91,0
Inde
x In
terv
al
allClusters 0,970 0,675 0,794 0,778
limitedClusters 0,942 0,649 0,788 0,755
Rand Index Jaccard Coefficient
Fowlkes-Mallows Index
Hubert's Statistic
Search StrategyExternal IndicesCluster Threshold = 0.35
External IndicesCluster Threshold = 0.35
����������� ���������������������������������������
39
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Number of hops
Rea
ched
Rel
evan
t Pee
rs (%
)
ReachedRelevantPeers 16,4% 51,7% 94,8% 100,0%
1 2 3 4
Experiments
0
500
1000
1500
2000
2500
3000
#Mes
sage
s
allClusters 2509
limitedClusters 1962
Transmitted Messages
Query Routing
External IndiceslimitedClusters vs. allClustersCluster Threshold = 0.35
limitedClustersCluster Threshold = 0.35
����������� ���������������������������������������
40
Conclusions
• Incremental process to cluster semantically similar peers in a PDMS
• Peers are organized according to a mixed P2P topology
• Ontologies are used to represent exported schemas
• Peer clustering is assisted by– Ontology matching– Ontology summarization
����������� ���������������������������������������
41
Contributions
• Ontology-based PDMS Architecture• Ontology Matching Process
– Determine most similar cluster– Determine cluster neighborhood– Search in the semantic index
• Ontology Summarization Process– Cluster Ontologies
• Incremental Peer Clustering Process
����������� ���������������������������������������
42
Future Work
• SemMatch– Consider properties of the concepts
• OWLSum– Apply transitivity rules in order to eliminate non-relevant
concepts• Peer Clustering
– Improve semantic index, e.g. organization and search– Consider peer disconnection
• Load Balancing– Merging and split of clusters
OntologyOntology--based Clustering in a based Clustering in a Peer Data Management SystemPeer Data Management System
Ph.D. ThesisPh.D. Thesis
Carlos Eduardo Santos Pires ([email protected])Carlos Eduardo Santos Pires ([email protected])
Advisor: Ana Carolina Salgado ([email protected])Advisor: Ana Carolina Salgado ([email protected])
April 27, 2009 April 27, 2009 -- Recife, PE, BrazilRecife, PE, Brazil