PUSHPINTEXT SIMILARITIES
Junaid Surve
6644418
2
AGENDA Introduction Data Retrieval
TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype Summary
3
AGENDA Introduction Data Retrieval
TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype Summary
4
INTRODUCTION WWW – a huge tangled web of information.
Issues faced – duplications, plagiarism, copyright violation etc.
Aim : To detect and report duplicates
Method : Compare and output the level of similarity which is “TEXT SIMILARITY”.
5
Text Similarity has 2 aspects : Content Similarity : Words are compared.
e.g. “I have a car” and “I have a vehicle” are 75% similar.
Expression Similarity : Meaning of the information is considered.e.g. “I have a car” and “I have a vehicle” can be considered 100% similar.
Scope – Content Similarity
6
2 step process:
STEP 1 : Data Retrieval“The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1]
STEP II : Similarity MeasurementsTo correlate the words or terms of two or more documents or web pages.
7
AGENDA Introduction Data Retrieval
TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype Summary
8
DATA RETRIEVAL Translation of literature to mathematics.
A variety of such concrete techniques exist – TF/IDF Document-Term Matrix VSM LSA
The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used.
9
TF/IDF Term Frequency / Inverse Document
Frequency
Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum.
Two linear, independent aspects: Term Frequency - frequency of occurrence of a
term in a given document. Inverse Document Frequency - measure of the
general importance of the term.
10
TF IDF Example [7] Three Documents –
D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
Two steps Calculate the Term Frequency Calculate the Inverse Document Frequency
11
TF IDF Example
Terms D1 D2 D3 dfi D/dfi IDF=log(D/dfi)
a 1 1 1 3 3/3 = 1 0
arrived 1 1 2 3/2 = 1.5 0.1761
damaged 1 1 3/1 = 3 0.4771
delivery 1 1 3/1 = 3 0.4771
fire 1 1 3/1 = 3 0.4771
gold 1 1 2 3/2 = 1.5 0.1761
in 1 1 1 3 3/3 = 1 0
of 1 1 1 3 3/3 = 1 0
silver 2 1 3/1 = 3 0.4771
shipment 1 1 2 3/2 = 1.5 0.1761
truck 1 1 2 3/2 = 1.5 0.1761
12
Document-Term Matrix “A Document-Term Matrix is a mathematical
matrix that describes the frequency of terms that occur in a collection of documents.” [2]
Rows – DocumentsColumns – Terms
Only depicts which document contains which term and the number of occurrences of that term in the document.
13
Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases”
I like databases hate
D1 1 1 1 0
D2 1 0 1 2
14
VSM “Vector Space Model (VSM) is an algebraic
model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3]
Each document and query is represented as a vector: document : dj = (w1,j , w2,j , .... , wn,j) query : q = (w1,q , w2,q , .... , wn,q)
Terms can be individual words, keywords, or phrases, based on the type of application.
15
VSM Example [7]
Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
Query – Gold Silver Truck
16
VSM Example continued... Calculating TF-IDFTerms Q D1 D2 D3 IDFi QxIDF
i
D1xIDFi
D2xIDFi
D3xIDFi
a 1 1 1 0
arrived 1 1 0.1761
0.1761 0.1761
damaged
1 0.4771
0.4771
delivery
1 0.4771
0.4771
fire 1 0.4771
0.4771
gold 1 1 1 0.1761
0.1761 0.1761 0.1761
in 1 1 1 0
of 1 1 1 0
silver 1 2 0.4771
0.4771 0.9542
shipment
1 1 0.1761
0.1761 0.1761
truck 1 1 1 0.1761
0.1761 0.1761 0.1761
17
LSA “Latent Semantic Analysis (LSA) is a theory and
method for extracting and representing the meaning of words and passages of words.” [4]
Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier.
2 step process: Construction of Document-Term Matrix Singular Value Decomposition
18
LSA Example
Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
Query – Gold Silver Truck
19
LSA Example contd...
STEP 1 : Constructing the Term-Document Matrix & Query Matrix
20
LSA Example contd...
STEP 2: Evaluating Singular Vector Decomposition
21
LSA Example contd...
STEP 3 : Reducing Dimensionality w.r.t k
22
Similar SVD evaluation and reduction is done for the query vector Q.
At the end we have: Reduced SVD Matrix V (for the documents) Reduced SVD Matrix Q (for the query)
V = Q =
This further can be supplied to similarity measurement technique.
23
AGENDA Introduction Data Retrieval
TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype Summary
24
SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology.
Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages.
Two major techniques in focus here: Cosine Similarity SOC-PMI
25
COSINE SIMILARITY Evaluate similarity between 2 vectors by
measuring cosine of the angle between them.
Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction.
In our scope : similarity will range between 0 and 1, since term weights are always positive.i.e. The angle between two considered vectors will never exceed 90
26
COSINE Example [7] Example continued from VSM.
Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
Query – Gold Silver Truck
We have calculated weights using TF-IDF scheme.
Next Step – Calculate Cosine Similarity: CosineΘDi = (Q . Di ) / (|Q| x |Di|) i.e. First calculate Dot product: Q . Di
Then calculate scalar product: |Q| x |Di|
27
COSINE Example continued... Dot Products: Q.Di = ∑i wQ,j wi,j
Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620
Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i
w2i,j)
|Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896
Cosine Similarity: CosineΘD1 = 0.0801 CosineΘD2 = 0.8246 CosineΘD3 = 0.3271
28
SOC-PMI “Second-Order Co-occurence Pointwise Mutual
Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5]
A lot of mathematics involved to generate the formula.
This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1.
29
SOC-PMI with an example Complicated method with a lot of mathematical
formulae.
Example [6] : W1 = car W2 = automobile
m = 70, n = 43
Assumptions: ϒ = 3, ∂ = 0.7 window of 11 words
β1 = β2 = 24.88CORPUS
30
SOC-PMI example contd...
Types & Frequencies Bigram frequencies and the set X and the set Y of words with
their PMI values
31
SOC-PMI example contd...
32
AGENDA Introduction Data Retrieval
TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype Summary
33
APPLICATIONS Plagiarism Detection
Term Similarity play an important in the field of Plagiarism Detection.
Copyright ViolationCopies of restricted Software/Data can be detected
using Text Similarities. Recommender Services
34
PROTOTYPE AIM : Finding the degree of Similarity between
files.
2 steps Data Retrival
TF-IDF Similarity Measurement
Cosine Pearson Correlation Distribution Matrix Co-occurence
35
Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme
SequenceFilesFromDirectory Converts files into sequence files. < Text, Text >
DocumentProcessor Converts the sequence file into <Text, StringTuple>
DictionaryVectorizer Creates TF Vectors <Text, VectorWritable> Creates dfcount < IntWritable, LongWritable> Creates wordcount <Text, LongWritable>
TFIDFConverter Creates TF-IDF vectors <Text, VectorWritable>
36
Prototype – Similarity Measurement Intermediate steps
Convert the TF-IDF into a Matrix <IntWritable, VectorWritable>
Similarity Measurement Distribution Multiplication
Matrix * Matrix´ Cosine, Pearson Correlation and Co-occuerrence
RowSimilarityJob (Similarity Classname) SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_COOCCURRENCE
37
Prototype – Similarity Measurment Cosine
Pearson Correlation
Distribution Matrix
Co-occurence
38
AGENDA Introduction Data Retrieval
TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype Summary
39
SUMMARY What is Text Similarity. Scope - Content Similarity Steps involved in the process:
Data Retrieval TF/IDF Document-Term Matrix VSM LSA
Similarity Measurements Cosine Similarity SOC-PMI
Applications & Prototype
40
41
References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia
(2012), http://en.wikipedia.org/wiki/Information_retrieval[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia
(2011), http://en.wikipedia.org/wiki/Document-term_matrix[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia
(2011), http://en.wikipedia.org/wiki/Vector_space_model[4] Wikipedia: Latent semantic indexing - Wikipedia, the free
encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing
[5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co-occurrence_pointwise_mutual_information
[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.
[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html