Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | johnathan-gray |
View: | 232 times |
Download: | 3 times |
Text Based Similarity Metrics and Delta for Semantic Web Graphs
Krishnamurthy Koduvayur ViswanathanMonday, June 28, 2010
1
Contributions
• Define text-based similarity metrics that characterize the relationship between semantic web graphs
• Evaluate the similarity metrics for three specific cases of similarity that we defined
• Generate a delta between pairs of SW graphs that may be two versions of the same graph
• Prototyped the techniques in a new system called Similis
2Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Motivation: Near Duplicate Detection for the SW?
3Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Goals
• Explore the different ways in which two SW graphs may be similar to each other
• In particular, evaluate the specific use case of versioning relations between SW graphs
• Additionally, develop techniques to generate a delta between versions
4Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Comparison with near duplicate text document detection
• In a text document:– Order of the content is important– The meaning of the text is not a part of the problem, just
the textual encoding of the meaning
• For a SWD, the order is not deterministic i.e. equivalent SWDs may have different statement orderings
• Non-deterministic blank node identifiers
5Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Semantic Web Document (SWD)
• RDF representation of a Semantic Web Graph– Document based serialization of a SW graph on
the web (ontology or data-file)– Document based serialization of the result of a
SPARQL query on a triple-store– Document based serialization of structured
metadata extracted from an HTML page using RDFa
6Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Semantic Web Graph Similarity
• The archive or the Swoogle search engine (Ding et al. 2004) shows several examples of how ontologies and RDF documents evolve over time
• Kinds of similarity between two SW graphs:– Same classes and properties used. Differ only in literal
content– Different only in base-URIs of entities used– Different versions of the same semantic web graph
7Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Similarity in Classes and Properties• Two semantic web graphs that differ only in the
literal content
8Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Different in Literal Content<http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-
ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#fullName> “Eric Miller” .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected]“ .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Dr” . <http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-
ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#fullName> “John Doe” .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected] “ .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Mr” .
9Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Different only in base-URI
10Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Different only in base-URI<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#locatedIn> ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue> <http://www.w3.org/2001/sw/WebOnt/guide-src/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#hasSugar> .
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#locatedIn>._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue>
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#hasSugar> .
11Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Versioning Relationship
• Two semantic web documents have a versioning relationship, if they are variants of the same semantic web graph.
• Variants are created due to the dynamic nature of the web, i.e. content keeps getting modified– Minor changes: spelling corrections, punctuations etc– Major changes: Affect the semantic content
12Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Problem Definition
• Problem 1: Given a collection of semantic web graphs in the form of RDF documents, characterize the similarity between pairs into one or more of the three cases:– Same classes and properties used, but differ only in the
literal content– Differ only in the base-URI used– Are different versions of the same graph i.e. have a
versioning relationship
13Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Problem Definition
• Problem 2: Generate a delta between pairs that have been identified as having a versioning relationship between them.
14Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
ApproachInput: Corpus of SWDs
Convert to n-triples format
Convert to canonical form
Generate Reduced Forms
Compute Text-Based Similarity Metrics
Characterize similarity between pairs
Identify versions
Generate delta between versions
Build feature-vectors for each pair
15Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Convert to n-triples
16Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Convert to Canonical Form• Comparison methods may be affected by blank node
identifiers and statement ordering
• Canonicalization assigns consistent IDs to blank nodes and orders the statements lexicographically.
• Transforms two semantically equivalent graphs into the same canonical representation
17
Based on: Carroll, J. J. 2003. Signing RDF graphs. In In 2nd ISWC, volume 2870 of LNCS, 5–15. Springer.
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Convert to Canonical Form
<person:John> <a:livesIn> _:x ._:x <a:IsPartOf> ”USA” .<person:John> <a:likes> ”cheese” ._:x <a:hasCapital> :y .
“~” <a:hasCapital> “~” . # _:x _:y“~” <a:IsPartOf> ”USA” . # _:x<person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> “~” . #_:x
Old Blank Node Identifier
New Blank Node Identifier
_:y _:g1
_:x _:g2
_:g2 <a:hasCapital> _:g1 . _:g2 <a:IsPartOf> ”USA” . <person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> _:g2 .
BNode Table
18Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Limitation of the Algorithm: Non-Distinctive Triples
• The algorithm can only deal with graphs that do not have non-distinctive triples
• Non Distinctive Triples: The triples in the graph that cannot be uniquely identified when all the blank nodes are treated as equal
19Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Graphs with Non-Distinctive Triples
• For a group of n non-distinct triples, there are n! ways of renaming the blank nodes
• For graphs with non-distinctive triples, a single unique canonical form does not exist
• To compare two graphs, compare each of the possible canonical forms for both graphs
• Number of comparisons: O(m!n!)• Similis throws an exception when it finds a graph
with multiple forms
20Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Graphs with Non-Distinctive Triples• Only a small percentage of SW graphs (13%) did not
have a unique canonical form (1200 randomly collected SW documents)
21Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Generating Reduced Forms• The canonical form of each SW graph is broken down
into a number of reduced forms• These reduced forms are used to characterize the
relationship between pairs of SW graphs• The following is the anatomy of a triple:
Entity URI <http://www.w3.org/2001/sw/guide-src/wine#hasSugar>
Base URI <http://www.w3.org/2001/sw/guide-src/wine>
Local Name <hasSugar>
22Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Only-Literals Reduced Form• Contains only the literals from the original n-triples
file.• Lets us compare only the textual content within a
graph, separated from the rest of the graph
23Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
No-Literals Reduced Form• All the literals from the canonical form are replaced
by an empty string• Lets us compare only the classes and properties
used, regardless of literal content
24Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Local-Name Reduced Form• The base-URI of every node in the canonical form is
replaced by an empty string• Lets us compare only the local names of the classes
and properties used
25Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Local-Name-No-Literal Reduced Form• All the literals, and the base-URI of every node is
replaced by an empty string• Lets us compare the non-literal content of two SW
graphs
26Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Similarity/Distance Metrics Used
• Cosine Similarity between SWD vectors• Jaccard and Containment Metrics• Hamming Distance between Simhash fingerprints
27Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Computation of Pairwise Metrics
• Compute cosine similarity between the canonical, and local forms of each pair in the collection– If cosine similarity < 0.7, remove pair from further
consideration– Else, compute all other metrics for all the forms (5 forms *
3 metrics = 15 specific metrics)
• Total of 17 metrics computed
28Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Cosine Similarity Between Term Vectors
• Each SWD containing terms Tj = {t1, t2…tn} is treated as a vector Vj = (γ1t1,γ2t2,… γntn) where each γi is the weight associated with term ti
• Non-blank, non-literal nods are used as features, and Term Frequency (TF) is used as weight
• Two vectors for each SWD: one uses full entity URIs as features, other uses local-name of terms
• Indicates similarity in classes and properties
29Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
SW Document Vectors
Term Freq
<http://purl.org/dc/elements/1.1/title> 2
<http://purl.org/dc/elements/1.1/creator> 1
<http://purl.org/dc/elements/1.1/contributor> 1
<http://put-off.org> 1
30
Term Freq
<title> 2
<creator> 2
<contributor> 1
<put-off.org> 1
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Jaccard and Containment
• Computed for all forms (five) for a candidate pair of SW graphs (5 * 2 = 10 metrics)
• Construct sets of character 4-grams for each document
• 4-grams are computed by running a four character-wide window over the text representation
31Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Hamming Distance between Simhash Fingerprints
• Simhash fingerprints of similar documents differ in a small number of bit positions
• Tokenize documents into character 3-grams• Compute simhash fingerprint for each document in
pair (we implemented 128 bit fingerprints)• Find Hamming Distance between the fingerprints• Computed for all forms (five) for a candidate pair of
SW graphs (5 * 1 = 5 metrics)
32Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Classification
33
Naïve Bayes Classifier:
Similarity in classes and properties
Similarity metrics
computed for each
candidate pair
Naïve Bayes/SVM classifier:
Difference only in Base-URI
SVM Classifier: Versioning
Relationship
Feature Vector
FV
Feature Vector
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Example feature vector used for determining versioning relationship
Computing Delta Between Two Versions
34
Version1
Except Version2
Subtractive Delta
Version2Except Version1
Additive Delta
Version1
Version2
Delta
SVM Classifier: Versioning
Relationship
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Raw Delta• Statement-by-statement comparison between
canonical forms of the two SWDs• Only local names of entities are compared
35Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta After Deductive Closure
36
SWGv1
SWGv2
Compute deductive closure
Compute deductive closure
Canonicalize
Canonicalize
Generate Raw Delta
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta After Deductive Closure• If O is a set of propositions, p ԑ O and p q╞ , then q ԑ
O
37Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta at Concept Level
38Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta at Concept Level
• Works only for ontologies• Groups of class/property definitions are serialized
into individual graphs• Corresponding graphs in the two versions are
compared to each other
39Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Concept Level Delta: example
40Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Detecting Class Renaming
41
Sauterne
Sauterne
Sauterne
Sauterne
Sauterne
Sauterne
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Detecting Class Renaming
Input: Local names of entites in both diffs
Generate 3-gram sets for each entity
Compute 3-gram overlap between sets in additive and subtractive deltas
If overlap > 0.7, add (oldname, newname) to candidate set
Replace oldname in subtractive delta by newname
Check for presence of all modified statements in additive delta
42Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Detecting Class Renaming
43Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Data-set: Using Swoogle’s SW Wayback machine
• Swoogle caches multiple snapshots for each indexed semantic web document
• Labeling for versions: We extract such snapshots from Swoogle’s cache and label these pairs as versions
44Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Literal Content
• Features used for classification:– LocalNameCosineSim– CosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralSimhash
• Training set from Swoogle archive included 806 positive pairs, and 806 negative pairs
45Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Literal Content
• Results of 10-fold stratified cross validation using a Naïve Bayes classifier:
46Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Literal Content
• Results of using a SVM with all of the features, instead of manually selecting features:
• Attribute relevance ranking:
47Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Base-URI
• Features for classification:– CosineSim– LocalNameCosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralContainment– OnlyLiteralContainment– OnlyLiteralJaccard
• Training set contained 100 positive examples, and 100 negative examples
48Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Base-URI
• 10-fold cross validation using Naïve Bayes:
• 10-fold cross validation (SVM linear-kernel)
49Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs with a Versioning Relationship
• 124 training instances from Swoogle data-set
• Filtered highly dynamic pairs from consideration
50Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs with a Versioning Relationship
• Test dataset: 160 instances (50% +ve 50% -ve)• Classification results using SVM (linear kernel)
51Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Correctness of Delta Computation
• For any two versions of a SW graph, it holds that Δx(K → K’)K ≡ K’
• We check this condition programmatically for each delta generated
52Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Conclusion
• Define text-based similarity metrics that characterize the relationship between semantic web graphs
• Evaluate the similarity metrics for three specific cases of similarity that we defined
• Generate deltas between pairs of SW graphs that may be two versions of the same graph
• Prototyped the techniques in a new system called Similis
53Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Future Directions
• Scalability• Content of Delta Generated• Standard Ontologies to:– Describe delta– Describe the relationship between a pair of SW
graphs
• Detecting direction of change between two versions
54Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion