Context Semantic Analysis: a knowledge-based technique for computing inter-document similarity

transcript

Context Semantic Analysis: a knowledge-based technique for

computing inter-document similarity

Fabio Benedetti, Domenico Beneventano, Sonia Bergamaschi

Department of Engineering “Enzo Ferrari”University of Modena & Reggio Emilia

The 9th International Conference on Similarity Search and Applications (SISAP 2016)

2CSA: a knowledge-based technique for computing inter-document similaritySISAP 2016

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 2

Outline

• State of the art

• Our proposal - CSA

• Performance Evaluation

• Scalability Evaluation

• Conclusion and Future works

Inter-document similarity

Techniques of inter-document similarity are designed to compute the similarity between two documents contained in a the corpus C

These similarity measures con be applied:

• In Information Retrieval system (for ranking documents)• To identify topics in a corpus• To cluster documents according to their content

State of the Art

Content Based techniques

Only the textual information contained in the corpus is used for computing the inter-document similarity.• Vector Space model [1]• LSA [2]

Knowledge enriched techniques

Additional resources (knowledge bases, encyclopedic resources) are used for improving the estimation of the similarity

Example

The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square

The bands headed by Mick Jagger with the leader of The Who played in London last week

Example

Classical techniques of similarity based on text do not detect weak relations between concepts

They can be found in knowledge bases

Ad-hoc techniques use Wikipedia as source of knowledge

• ESA [3]• WikiWalk [4]• SSA [5]

Only one technique uses a generic knowledge base:

• GED [6]

Ad-hoc techniques use Wikipedia as source of knowledge

• ESA [3]• WikiWalk [4]• SSA [5]

Only one technique uses a generic knowledge base:

• GED [6]

The Semantic Web provides standards for describing knowledge bases:• RDF• OWL

Linked Open Data

[Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260]

Our Proposal: CSAContext Semantic Analysis (CSA) is a novel technique forestimating inter-document similarity that leverages the information contained in a generic knowledge base.

It can be used with any generic RDF knowledge bases.

CSA aims to be scalable• Each document can be processed independently• The knowledge extracted from the KB can be

embedded in the document’s metadata as a vector

Knowledge Base

A generic KB is a graph composed by a set of triples.

A triple is composed by <subject, predicate, object>

dbr:The_Rolling_Stones dbo:genre dbr:Rock_music

dbo:Band dbo:MusicGenre

dbr:The_Yes-Men

dbo:City

dbr:Detroit dbr:Punk_Rockrdf:type

rdf:typedbo:genre

rdf:type

dbo:genre

CSA steps

Given a corpus C of documents and an RDF knowledge graph KB, CSA is composed of three steps:

• Contextual Graph Extraction: a Contextual Graph CG(d) containing the contextual information of a document d is extracted from the KB.

• Semantic Context Vectors Generation: the Semantic Context Vector SCV(d) representing the context of the document d is generated analyzing its CG(d).

• Context Similarity Evaluation: the Context Similarity is evaluated by comparing the context vectors of two documents.

Contextual Graph ExtractionFor each document d we extract from KB the contextual graph CG(d)

1. Starting Entities Identification: the entities of KB which are explicitly mentioned in the document d are identified (SE(d) )

2. Contextual Graph Construction: it is defined as the subgraph of KB composed by all the triples that connect with a path of length l, at least 2 starting entities in SE(d)

3. Contextual Graph weighting: to weight the edges of CG(d) according their importance within the graph

Contextual Graph weightingWe used different strategies for weighting the edges of a CG

For each generic edge:

• : the weight of each edge is set to 1

• [6]

Where = Information content of the property

• [6]

si pi oi

Total Correlation - Example

si pi oi

SCi OCi

rdf:type rdf:type

Classes

triple

• is the probability that a subject of a triple belongs to the class

• is the probability that an object of a triple belongs to the class

• is the probability that a property of a triple would be

• is the probability that the subject and the object of a triple belong to and respectively, and they are connected by a property

Total Correlation - Example

T1 : <dbr:The_Yes-Men, dbo:genre, dbr:Detroit>

T2: <dbr:The_Rolling_Stones, dbo:genre, dbr:Rock_music>

Total correlation(t2) > Total correlation (t1)

dbr:The_Rolling_Stones dbo:genre dbr:Rock_music

dbo:Band dbo:MusicGenre

dbr:The_Yes-Men

dbo:City

dbr:Detroitrdf:type

rdf:type

dbo:genrerdf:type

rdf:type

Semantic Context VectorsA Semantic Context Vector of the document di, SCV (di) synthetizes the information contained in CG(di)

Where , and contains the set of entities belonging to KB

is a weighting function that defines the importance of the entity in the contextual graph of the document

Weighting functionsIt has to spread weight to entities of a contextual graph according to the importance of these entities within the graph

• r : Weighted PageRank [7]• pr: Weighted Personalized Page Rank [8]

– It biases the probability of the random surfer to be teleported to a particular group of nodes

– Personalization:

We used different configuration of dumping factor @ • The probability, at any iteration, that the random surfer will not be

teleported is defined by the dumping factor

Context Similarity Evaluation

Now we obtained a Semantic Context Vector for each document in the Corpus

We use the cosine similarity for estimating the Context Similarity

We also computed linear combination of CSA with text similarity measures

sim = α ∗ simCSA + (1 − α) ∗ simtext

α is the weight parameter used for combining the two measures

Evaluation – Knowledge Bases

• Dbpedia, a knowledge base automatically generated from Wikipedia– 4.58 million entities with 583 million statements

• Wikidata, a collaborative knowledge base– 16,411,514 entities and 80,007,001 statements

(49,821,734 between entities)

Evaluation – LP50 datasetI used the LP50 dataset:• it contains 50 documents, selected from the Australian

Broadcasting Corporations news mail service, evaluated by 83 University of Adelaide students (29 males and 54 females)

• The benchmark compare the similarity results by using the Pearson Product moment correlation

Benchmarks:• Jaccard similarity on vectors composed by the stating

entities• Cosine similarity by using bag of words

Results– LP50 dataset

Results– LP50 dataset (2)

TODOOO

Comparison with other systems – LP50 dataset

Evaluation – Reuters 21578 datasetReuters is a collection of 1504 manually classified documents, which is commonly used for evaluating hierarchical clustering techniques

• To build the clusters hierarchy we used a hierarchical clustering algorithm, based on a similarity measure and group-average-link.

• Similarity measures used as benchmark:– GED-based similarity– LSA– Jaccard on starting entities

Results – Reuters 21578 dataset• Performance is measured in terms of goodness of

fit with existing categories by using F measure• We measured the average of the execution time

obtained running 5 time the clustering algorithm.

Conclusions & Future worksConclusions• CSA is consistent respect to human judges

and it outperforms standard similarity methods• We have shown that CSA is a general technique

that can be used with different RDF KBs• We demonstrated its scalability with a large

corpus of documents

Future Works• To test CSA in a specific domain with domain KBs• To test CSA in an Information Retrieval system

References

• [1] P. D. Turney, P. Pantel, et al. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1):141–188, 2010.

• [2] S. T. Dumais. Latent semantic analysis. Annual review of information science and technology, 38(1):188–230, 2004.

• [3] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606–1611, 2007.

• [4] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606–1611, 2007.

• [5] S. Hassan and R. Mihalcea. Semantic relatedness using salient semantic analysis. In AAAI, 2011.

• [6] M. Schuhmacher and S. P. Ponzetto. Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 543–552. ACM, 2014.

• [7] W. Xing and A. Ghorbani. Weighted pagerank algorithm. In Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on, pages 305–314. IEEE, 2004.

• [8] T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web, pages 517–526. ACM, 2002.

32Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio EmiliaCSA: a knowledge-based technique for computing inter-document similaritySISAP 2016

Thanks for your attention!

Context Semantic Analysis: a knowledge-based technique for computing inter-document similarity

Data & Analytics