NDD Project presentation

1

Weighted Simhash: A Random Projection Approach for detecting near duplicate

documents in large collection

Md Mishfaq AhmedGraduate StudentDepartment of CS

University of Memphis

2

Introduction• Near duplicate documents (NDD): identical

in terms of core content but differ in small portions of the document– Harder to be detected than exact duplicates– Exact duplicates:• Standard methods exists

– Near duplicates:• Several approaches exists but no widely accepted

method to identify

3

Near Duplicate: main sources

• News articles • Web documents (web pages) differing only in

advertisements and/or timestamps– As many as 40% of the pages on the web are

duplicates of other pages

4

Near Duplicate: main sources

• NDD techniques useful in sequences that are not documents (such as DNA sequences)

• Replication for Reliability– In file systems: main content of an important

document is replicated and stored at different places

5

Earlier Approaches for NDD• A naive solution : Compare a document with

all documents in the collection, word by word– prohibitively expensive on large datasets

• Another approach: Convert documents into canonical forms, until they are exact duplicates

• More viable approach: Approximation and probabilistic methods– Trade off

• precision and recall ↔ manageable speed

6

Earlier Approaches for NDD

7

Shingling based methods

• A document d = a sequence of tokens• Encode d as a set of unique k-grams– k gram = k contiguous sequence

of tokens• Measure overlap or similarity

between two k-grams • Sum of overlaps or similarity across

the entire set gives the similarity between two docs

9

Projection based methods: SimHash

Example:

– d1: word1+word2+word3 – d2: word1+word4

10

SimHash: Example

• Document d1: w1+w2+w3

11

SimHash: Example

– Document d2: word1+word4

12

Projection based methods: Probabilistic Simhash

• Key observations:– Projection is already probabilistic– Bits in a fingerprint are mutually independent– Intermediate values are ignored while generating

fingerprints• Useful to give insight into the volatility of a bit

13


• Key observations:

For another document d, which is not a near duplicate of d1: fingerprint of d is most likely to be different than that of d1 at bit position with intermediate value closest to zero

14


• Implementation– An unique data structure per document to rank

bits or set of bits according to volatility• Stores bit positions

– When comparing two fingerprints• Compare bits with higher volatility first• Ensures quicker identification of nonduplicates• Reduce number of bit comparisons for nonduplicates

15


• Drawback:– Overhead of extra data structure per document

apart from the fingerprint

16

Our Approach: Weighted SimHash

• Main Idea:– Terms with higher inverse document frequency

(IDF): better in finding NDD• Consider two documents D1, D2 and two terms: t1:

high IDF , t2: low IDF -– Case I: both D1 and D2 has t1– Case II: both D1 and D2 has t2– Case III: None of them have t1– Case IV: None of them have t2

• D1,D2 more likely to be NDD in Case I then in Case II• D1,D2 more likely to be NDD in Case IV then in Case III

18

Weighted SimHash: Key Steps

19

Weighted SimHash

• Generation of fingerprint:– Terms with higher IDFs contribute more in forming the

sum leading to more significant bits (towards the left end of fingerprint)

– Terms with lower IDFs contribute more in forming the sum leading to less significant bits (towards the right end of fingerprint)

– Leads to increased chance of mismatches in leading bits for non duplicates.

– How to achieve this?• Multiplication factor

20

Weighted SimHash

• Multiplication factor (MF) for term t , mft = f(IDFt , bp)

Series10

0.5

1

1.5

2

2.5MF for high IDF Term MF for mid IDF Term MF for low IDF Term

Mul

tiplic

ation

fact

or

Bit positionMSB LSB

23

Weighted SimHash

• Example(generation of fingerprint): – Document D2: word1 + word4

24

Weighted SimHash

• Example(generation of fingerprint): – Document D2: word1 + word4

25

Weighted SimHash

• Finding near duplicate– Compare fingerprints of the query document and

each document from collection• Start scan from most significant bit (or left most bit)• Count number of mismatch• If number of mismatch gets past k (allowed hamming

distance threshold): No near duplicate. stop scan. Go to next document• If number of mismatch within allowed hamming

distance threshold after scanning entire fingerprint: near duplicate found

26

Experiment

• Reuters data set: almost 10k documents– 10 documents randomly selected– Each of 10 documents has been very slightly

modified at most two words change per doc to produce 20 documents per selection

– 200 documents which we consider as near duplicates for respective selection

– 10 docs then used as source query

27

Experiment: Procedure

28

Results

0 1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

Recall(SimHash)Recall(WSH)

K value (hamming distance threshod)

Reca

ll Pe

rcan

tage

Figure : Comparison of percentage recall for all the 20 query documents for SimHash and Weighted SimHash methods with k (hamming distance value) shown in the X axis.

29

Results

0 2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80

90

100

Precision(SimHash)Precision(WSH)

K value (hamming distance threshold))

Prec

ision

Per

cant

age

Figure : Precision comparison between random projection and weighted simhash. X axis shows different values of k. The figure shows there is no real difference between two methods in terms of precision.

30

Results

Figure: Average execution time per query for each of the methods. For cosine similarity the threshold is 0.95.

Cosine Similarity (0.95)

SimHash WSH with MF(0.1,1.0)

WSH with MF(0.3,1.0)






4656

38203560 3490 3587 3664 3680

3440 3560

Comparison of Average Execution timeAverage Execution time(milliseconds)

31

Limitations of WSH

• Dependence on IDF– Web search: IDF unknown– Heuristics can be used: • IDF from first 1000 documents

32

Limitations of WSH

• Difficulty Setting the lower and upper bound on multiplication factor– May vary from collection to

collection

33

Conclusion

• Batch processing of Document collection:– Runtime: WSH better than SimHash– Precision and Recall: WSH and SimHash are

comparable

35

Conclusion

• Further work on SimHash:– How much the fingerprint can be allowed to be

altered?

36

Thank you

Date post:	30-Jun-2015
Category:	Education
Upload:	ahmedmishfaq
View:	116 times
Download:	0 times

NDD Project presentation

Education