Post on 20-Mar-2017
transcript
DB
Gro
up @
U
NIM
O
Context Semantic Analysis: a knowledge-based technique for
computing inter-document similarity
Fabio Benedetti, Domenico Beneventano, Sonia Bergamaschi
Department of Engineering “Enzo Ferrari”University of Modena & Reggio Emilia
The 9th International Conference on Similarity Search and Applications (SISAP 2016)
DB
Gro
up @
U
NIM
O
2CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 2
Outline
• State of the art
• Our proposal - CSA
• Performance Evaluation
• Scalability Evaluation
• Conclusion and Future works
DB
Gro
up @
U
NIM
O
3CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 3
Inter-document similarity
Techniques of inter-document similarity are designed to compute the similarity between two documents contained in a the corpus C
These similarity measures con be applied:
• In Information Retrieval system (for ranking documents)• To identify topics in a corpus• To cluster documents according to their content
DB
Gro
up @
U
NIM
O
4CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 4
State of the Art
Content Based techniques
Only the textual information contained in the corpus is used for computing the inter-document similarity.• Vector Space model [1]• LSA [2]
Knowledge enriched techniques
Additional resources (knowledge bases, encyclopedic resources) are used for improving the estimation of the similarity
DB
Gro
up @
U
NIM
O
5CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 5
Example
The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square
The bands headed by Mick Jagger with the leader of The Who played in London last week
DB
Gro
up @
U
NIM
O
6CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 6
Example
The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square
The bands headed by Mick Jagger with the leader of The Who played in London last week
DB
Gro
up @
U
NIM
O
7CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 7
Example
The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square
The bands headed by Mick Jagger with the leader of The Who played in London last week
Classical techniques of similarity based on text do not detect weak relations between concepts
They can be found in knowledge bases
DB
Gro
up @
U
NIM
O
8CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 8
Knowledge enriched techniques
Ad-hoc techniques use Wikipedia as source of knowledge
• ESA [3]• WikiWalk [4]• SSA [5]
Only one technique uses a generic knowledge base:
• GED [6]
DB
Gro
up @
U
NIM
O
9CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 9
Knowledge enriched techniques
Ad-hoc techniques use Wikipedia as source of knowledge
• ESA [3]• WikiWalk [4]• SSA [5]
Only one technique uses a generic knowledge base:
• GED [6]
The Semantic Web provides standards for describing knowledge bases:• RDF• OWL
Linked Open Data
DB
Gro
up @
U
NIM
O
10CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 10
Linked Open Data
[Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260]
DB
Gro
up @
U
NIM
O
11CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 11
Our Proposal: CSAContext Semantic Analysis (CSA) is a novel technique forestimating inter-document similarity that leverages the information contained in a generic knowledge base.
It can be used with any generic RDF knowledge bases.
CSA aims to be scalable• Each document can be processed independently• The knowledge extracted from the KB can be
embedded in the document’s metadata as a vector
DB
Gro
up @
U
NIM
O
12CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 12
Knowledge Base
A generic KB is a graph composed by a set of triples.
A triple is composed by <subject, predicate, object>
dbr:The_Rolling_Stones dbo:genre dbr:Rock_music
dbo:Band dbo:MusicGenre
dbr:The_Yes-Men
dbo:City
dbr:Detroit dbr:Punk_Rockrdf:type
rdf:typedbo:genre
rdf:type
rdf:type
dbo:genre
DB
Gro
up @
U
NIM
O
13CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 13
CSA steps
Given a corpus C of documents and an RDF knowledge graph KB, CSA is composed of three steps:
• Contextual Graph Extraction: a Contextual Graph CG(d) containing the contextual information of a document d is extracted from the KB.
• Semantic Context Vectors Generation: the Semantic Context Vector SCV(d) representing the context of the document d is generated analyzing its CG(d).
• Context Similarity Evaluation: the Context Similarity is evaluated by comparing the context vectors of two documents.
DB
Gro
up @
U
NIM
O
16CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 16
Contextual Graph ExtractionFor each document d we extract from KB the contextual graph CG(d)
1. Starting Entities Identification: the entities of KB which are explicitly mentioned in the document d are identified (SE(d) )
2. Contextual Graph Construction: it is defined as the subgraph of KB composed by all the triples that connect with a path of length l, at least 2 starting entities in SE(d)
3. Contextual Graph weighting: to weight the edges of CG(d) according their importance within the graph
DB
Gro
up @
U
NIM
O
17CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 17
Contextual Graph weightingWe used different strategies for weighting the edges of a CG
For each generic edge:
• : the weight of each edge is set to 1
• [6]
Where = Information content of the property
• [6]
si pi oi
DB
Gro
up @
U
NIM
O
18CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 18
Total Correlation - Example
si pi oi
SCi OCi
rdf:type rdf:type
Classes
triple
• is the probability that a subject of a triple belongs to the class
• is the probability that an object of a triple belongs to the class
• is the probability that a property of a triple would be
• is the probability that the subject and the object of a triple belong to and respectively, and they are connected by a property
DB
Gro
up @
U
NIM
O
19CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 19
Total Correlation - Example
T1 : <dbr:The_Yes-Men, dbo:genre, dbr:Detroit>
T2: <dbr:The_Rolling_Stones, dbo:genre, dbr:Rock_music>
Total correlation(t2) > Total correlation (t1)
dbr:The_Rolling_Stones dbo:genre dbr:Rock_music
dbo:Band dbo:MusicGenre
dbr:The_Yes-Men
dbo:City
dbr:Detroitrdf:type
rdf:type
dbo:genrerdf:type
rdf:type
DB
Gro
up @
U
NIM
O
20CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 20
Semantic Context VectorsA Semantic Context Vector of the document di, SCV (di) synthetizes the information contained in CG(di)
Where , and contains the set of entities belonging to KB
is a weighting function that defines the importance of the entity in the contextual graph of the document
DB
Gro
up @
U
NIM
O
21CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 21
Weighting functionsIt has to spread weight to entities of a contextual graph according to the importance of these entities within the graph
• r : Weighted PageRank [7]• pr: Weighted Personalized Page Rank [8]
– It biases the probability of the random surfer to be teleported to a particular group of nodes
– Personalization:
We used different configuration of dumping factor @ • The probability, at any iteration, that the random surfer will not be
teleported is defined by the dumping factor
DB
Gro
up @
U
NIM
O
22CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 22
Context Similarity Evaluation
Now we obtained a Semantic Context Vector for each document in the Corpus
We use the cosine similarity for estimating the Context Similarity
We also computed linear combination of CSA with text similarity measures
sim = α ∗ simCSA + (1 − α) ∗ simtext
α is the weight parameter used for combining the two measures
DB
Gro
up @
U
NIM
O
23CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 23
Evaluation – Knowledge Bases
• Dbpedia, a knowledge base automatically generated from Wikipedia– 4.58 million entities with 583 million statements
• Wikidata, a collaborative knowledge base– 16,411,514 entities and 80,007,001 statements
(49,821,734 between entities)
DB
Gro
up @
U
NIM
O
24CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 24
Evaluation – LP50 datasetI used the LP50 dataset:• it contains 50 documents, selected from the Australian
Broadcasting Corporations news mail service, evaluated by 83 University of Adelaide students (29 males and 54 females)
• The benchmark compare the similarity results by using the Pearson Product moment correlation
Benchmarks:• Jaccard similarity on vectors composed by the stating
entities• Cosine similarity by using bag of words
DB
Gro
up @
U
NIM
O
25CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 25
Results– LP50 dataset
DB
Gro
up @
U
NIM
O
26CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 26
Results– LP50 dataset (2)
TODOOO
DB
Gro
up @
U
NIM
O
27CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 27
Comparison with other systems – LP50 dataset
DB
Gro
up @
U
NIM
O
28CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 28
Evaluation – Reuters 21578 datasetReuters is a collection of 1504 manually classified documents, which is commonly used for evaluating hierarchical clustering techniques
• To build the clusters hierarchy we used a hierarchical clustering algorithm, based on a similarity measure and group-average-link.
• Similarity measures used as benchmark:– GED-based similarity– LSA– Jaccard on starting entities
DB
Gro
up @
U
NIM
O
29CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 29
Results – Reuters 21578 dataset• Performance is measured in terms of goodness of
fit with existing categories by using F measure• We measured the average of the execution time
obtained running 5 time the clustering algorithm.
DB
Gro
up @
U
NIM
O
30CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 30
Conclusions & Future worksConclusions• CSA is consistent respect to human judges
and it outperforms standard similarity methods• We have shown that CSA is a general technique
that can be used with different RDF KBs• We demonstrated its scalability with a large
corpus of documents
Future Works• To test CSA in a specific domain with domain KBs• To test CSA in an Information Retrieval system
DB
Gro
up @
U
NIM
O
31CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 31
References
• [1] P. D. Turney, P. Pantel, et al. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1):141–188, 2010.
• [2] S. T. Dumais. Latent semantic analysis. Annual review of information science and technology, 38(1):188–230, 2004.
• [3] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606–1611, 2007.
• [4] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606–1611, 2007.
• [5] S. Hassan and R. Mihalcea. Semantic relatedness using salient semantic analysis. In AAAI, 2011.
• [6] M. Schuhmacher and S. P. Ponzetto. Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 543–552. ACM, 2014.
• [7] W. Xing and A. Ghorbani. Weighted pagerank algorithm. In Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on, pages 305–314. IEEE, 2004.
• [8] T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web, pages 517–526. ACM, 2002.
DB
Gro
up @
U
NIM
O
32Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio EmiliaCSA: a knowledge-based technique for computing inter-document similaritySISAP 2016
Thanks for your attention!