Download - Text Similarities - PG Pushpin

PUSHPINTEXT SIMILARITIES

Junaid Surve

6644418

2

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

3





4

INTRODUCTION WWW – a huge tangled web of information.

Issues faced – duplications, plagiarism, copyright violation etc.

Aim : To detect and report duplicates

Method : Compare and output the level of similarity which is “TEXT SIMILARITY”.

5

Text Similarity has 2 aspects : Content Similarity : Words are compared.

e.g. “I have a car” and “I have a vehicle” are 75% similar.

Expression Similarity : Meaning of the information is considered.e.g. “I have a car” and “I have a vehicle” can be considered 100% similar.

Scope – Content Similarity

6

2 step process:

STEP 1 : Data Retrieval“The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1]

STEP II : Similarity MeasurementsTo correlate the words or terms of two or more documents or web pages.

7





8

DATA RETRIEVAL Translation of literature to mathematics.

A variety of such concrete techniques exist – TF/IDF Document-Term Matrix VSM LSA

The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used.

9

TF/IDF Term Frequency / Inverse Document

Frequency

Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum.

Two linear, independent aspects: Term Frequency - frequency of occurrence of a

term in a given document. Inverse Document Frequency - measure of the

general importance of the term.

10

TF IDF Example [7] Three Documents –

D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”

Two steps Calculate the Term Frequency Calculate the Inverse Document Frequency

11

TF IDF Example

Terms D1 D2 D3 dfi D/dfi IDF=log(D/dfi)

a 1 1 1 3 3/3 = 1 0

arrived 1 1 2 3/2 = 1.5 0.1761

damaged 1 1 3/1 = 3 0.4771

delivery 1 1 3/1 = 3 0.4771

fire 1 1 3/1 = 3 0.4771

gold 1 1 2 3/2 = 1.5 0.1761

in 1 1 1 3 3/3 = 1 0

of 1 1 1 3 3/3 = 1 0

silver 2 1 3/1 = 3 0.4771

shipment 1 1 2 3/2 = 1.5 0.1761

truck 1 1 2 3/2 = 1.5 0.1761

12

Document-Term Matrix “A Document-Term Matrix is a mathematical

matrix that describes the frequency of terms that occur in a collection of documents.” [2]

Rows – DocumentsColumns – Terms

Only depicts which document contains which term and the number of occurrences of that term in the document.

13

Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases”

I like databases hate

D1 1 1 1 0

D2 1 0 1 2

14

VSM “Vector Space Model (VSM) is an algebraic

model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3]

Each document and query is represented as a vector: document : dj = (w1,j , w2,j , .... , wn,j) query : q = (w1,q , w2,q , .... , wn,q)

Terms can be individual words, keywords, or phrases, based on the type of application.

15

VSM Example [7]

Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”

Query – Gold Silver Truck

16

VSM Example continued... Calculating TF-IDFTerms Q D1 D2 D3 IDFi QxIDF

i

D1xIDFi

D2xIDFi

D3xIDFi

a 1 1 1 0

arrived 1 1 0.1761

0.1761 0.1761

damaged

1 0.4771

0.4771

delivery

1 0.4771

0.4771

fire 1 0.4771

0.4771

gold 1 1 1 0.1761

0.1761 0.1761 0.1761

in 1 1 1 0

of 1 1 1 0

silver 1 2 0.4771

0.4771 0.9542

shipment

1 1 0.1761

0.1761 0.1761

truck 1 1 1 0.1761

0.1761 0.1761 0.1761

17

LSA “Latent Semantic Analysis (LSA) is a theory and

method for extracting and representing the meaning of words and passages of words.” [4]

Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier.

2 step process: Construction of Document-Term Matrix Singular Value Decomposition

18

LSA Example



19

LSA Example contd...

STEP 1 : Constructing the Term-Document Matrix & Query Matrix

20


STEP 2: Evaluating Singular Vector Decomposition

21


STEP 3 : Reducing Dimensionality w.r.t k

22

Similar SVD evaluation and reduction is done for the query vector Q.

At the end we have: Reduced SVD Matrix V (for the documents) Reduced SVD Matrix Q (for the query)

V = Q =

This further can be supplied to similarity measurement technique.

23





24

SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology.

Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages.

Two major techniques in focus here: Cosine Similarity SOC-PMI

25

COSINE SIMILARITY Evaluate similarity between 2 vectors by

measuring cosine of the angle between them.

Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction.

In our scope : similarity will range between 0 and 1, since term weights are always positive.i.e. The angle between two considered vectors will never exceed 90

26

COSINE Example [7] Example continued from VSM.



We have calculated weights using TF-IDF scheme.

Next Step – Calculate Cosine Similarity: CosineΘDi = (Q . Di ) / (|Q| x |Di|) i.e. First calculate Dot product: Q . Di

Then calculate scalar product: |Q| x |Di|

27

COSINE Example continued... Dot Products: Q.Di = ∑i wQ,j wi,j

Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620

Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i

w2i,j)

|Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896

Cosine Similarity: CosineΘD1 = 0.0801 CosineΘD2 = 0.8246 CosineΘD3 = 0.3271

28

SOC-PMI “Second-Order Co-occurence Pointwise Mutual

Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5]

A lot of mathematics involved to generate the formula.

This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1.

29

SOC-PMI with an example Complicated method with a lot of mathematical

formulae.

Example [6] : W1 = car W2 = automobile

m = 70, n = 43

Assumptions: ϒ = 3, ∂ = 0.7 window of 11 words

β1 = β2 = 24.88CORPUS

30

SOC-PMI example contd...

Types & Frequencies Bigram frequencies and the set X and the set Y of words with

their PMI values

31

SOC-PMI example contd...

32





33

APPLICATIONS Plagiarism Detection

Term Similarity play an important in the field of Plagiarism Detection.

Copyright ViolationCopies of restricted Software/Data can be detected

using Text Similarities. Recommender Services

34

PROTOTYPE AIM : Finding the degree of Similarity between

files.

2 steps Data Retrival

TF-IDF Similarity Measurement

Cosine Pearson Correlation Distribution Matrix Co-occurence

35

Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme

SequenceFilesFromDirectory Converts files into sequence files. < Text, Text >

DocumentProcessor Converts the sequence file into <Text, StringTuple>

DictionaryVectorizer Creates TF Vectors <Text, VectorWritable> Creates dfcount < IntWritable, LongWritable> Creates wordcount <Text, LongWritable>

TFIDFConverter Creates TF-IDF vectors <Text, VectorWritable>

36

Prototype – Similarity Measurement Intermediate steps

Convert the TF-IDF into a Matrix <IntWritable, VectorWritable>

Similarity Measurement Distribution Multiplication

Matrix * Matrix´ Cosine, Pearson Correlation and Co-occuerrence

RowSimilarityJob (Similarity Classname) SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_COOCCURRENCE

37

Prototype – Similarity Measurment Cosine

Pearson Correlation

Distribution Matrix

Co-occurence

38





39

SUMMARY What is Text Similarity. Scope - Content Similarity Steps involved in the process:

Data Retrieval TF/IDF Document-Term Matrix VSM LSA


Applications & Prototype

40

41

References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia

(2012), http://en.wikipedia.org/wiki/Information_retrieval[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia

(2011), http://en.wikipedia.org/wiki/Document-term_matrix[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia

(2011), http://en.wikipedia.org/wiki/Vector_space_model[4] Wikipedia: Latent semantic indexing - Wikipedia, the free

encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing

[5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co-occurrence_pointwise_mutual_information

[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.

[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html

http://en.wikipedia.org/wiki/Information_retrieval

http://en.wikipedia.org/wiki/Document-term_matrix

http://en.wikipedia.org/wiki/Vector_space_model

http://en.wikipedia.org/wiki/Vector_space_model

http://en.wikipedia.org/wiki/Latent_semantic_indexing

http://en.wikipedia.org/wiki/Second-order_co-occurrence_pointwise_mutual_information

http://en.wikipedia.org/wiki/Second-order_co-occurrence_pointwise_mutual_information

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html