+ All Categories
Home > Education > NDD Project presentation

NDD Project presentation

Date post: 30-Jun-2015
Category:
Upload: ahmedmishfaq
View: 116 times
Download: 0 times
Share this document with a friend
Description:
Near duplicate detection method based on random projection. This presentation gives an overview of existing category of NDD methods and introduces WSH (Weighted SimHash). It also presents some result comparing original Simhash with WSH and cosine similarity based method
31
Weighted Simhash: A Random Projection Approach for detecting near duplicate documents in large collection Md Mishfaq Ahmed Graduate Student Department of CS University of Memphis 1
Transcript
Page 1: NDD Project presentation

1

Weighted Simhash: A Random Projection Approach for detecting near duplicate

documents in large collection

Md Mishfaq AhmedGraduate StudentDepartment of CS

University of Memphis

Page 2: NDD Project presentation

2

Introduction• Near duplicate documents (NDD): identical

in terms of core content but differ in small portions of the document– Harder to be detected than exact duplicates– Exact duplicates:• Standard methods exists

– Near duplicates:• Several approaches exists but no widely accepted

method to identify

Page 3: NDD Project presentation

3

Near Duplicate: main sources

• News articles • Web documents (web pages) differing only in

advertisements and/or timestamps– As many as 40% of the pages on the web are

duplicates of other pages

Page 4: NDD Project presentation

4

Near Duplicate: main sources

• NDD techniques useful in sequences that are not documents (such as DNA sequences)

• Replication for Reliability– In file systems: main content of an important

document is replicated and stored at different places

Page 5: NDD Project presentation

5

Earlier Approaches for NDD• A naive solution : Compare a document with

all documents in the collection, word by word– prohibitively expensive on large datasets

• Another approach: Convert documents into canonical forms, until they are exact duplicates

• More viable approach: Approximation and probabilistic methods– Trade off

• precision and recall ↔ manageable speed

Page 6: NDD Project presentation

6

Earlier Approaches for NDD

Page 7: NDD Project presentation

7

Shingling based methods

• A document d = a sequence of tokens• Encode d as a set of unique k-grams– k gram = k contiguous sequence

of tokens• Measure overlap or similarity

between two k-grams • Sum of overlaps or similarity across

the entire set gives the similarity between two docs

Page 8: NDD Project presentation

9

Projection based methods: SimHash

Example:

– d1: word1+word2+word3 – d2: word1+word4

Page 9: NDD Project presentation

10

SimHash: Example

• Document d1: w1+w2+w3

Page 10: NDD Project presentation

11

SimHash: Example

– Document d2: word1+word4

Page 11: NDD Project presentation

12

Projection based methods: Probabilistic Simhash

• Key observations:– Projection is already probabilistic– Bits in a fingerprint are mutually independent– Intermediate values are ignored while generating

fingerprints• Useful to give insight into the volatility of a bit

Page 12: NDD Project presentation

13

Projection based methods: Probabilistic Simhash

• Key observations:

For another document d, which is not a near duplicate of d1: fingerprint of d is most likely to be different than that of d1 at bit position with intermediate value closest to zero

Page 13: NDD Project presentation

14

Projection based methods: Probabilistic Simhash

• Implementation– An unique data structure per document to rank

bits or set of bits according to volatility• Stores bit positions

– When comparing two fingerprints• Compare bits with higher volatility first• Ensures quicker identification of nonduplicates• Reduce number of bit comparisons for nonduplicates

Page 14: NDD Project presentation

15

Projection based methods: Probabilistic Simhash

• Drawback:– Overhead of extra data structure per document

apart from the fingerprint

Page 15: NDD Project presentation

16

Our Approach: Weighted SimHash

• Main Idea:– Terms with higher inverse document frequency

(IDF): better in finding NDD• Consider two documents D1, D2 and two terms: t1:

high IDF , t2: low IDF -– Case I: both D1 and D2 has t1– Case II: both D1 and D2 has t2– Case III: None of them have t1– Case IV: None of them have t2

• D1,D2 more likely to be NDD in Case I then in Case II• D1,D2 more likely to be NDD in Case IV then in Case III

Page 16: NDD Project presentation

18

Weighted SimHash: Key Steps

Page 17: NDD Project presentation

19

Weighted SimHash

• Generation of fingerprint:– Terms with higher IDFs contribute more in forming the

sum leading to more significant bits (towards the left end of fingerprint)

– Terms with lower IDFs contribute more in forming the sum leading to less significant bits (towards the right end of fingerprint)

– Leads to increased chance of mismatches in leading bits for non duplicates.

– How to achieve this?• Multiplication factor

Page 18: NDD Project presentation

20

Weighted SimHash

• Multiplication factor (MF) for term t , mft = f(IDFt , bp)

Series10

0.5

1

1.5

2

2.5MF for high IDF Term MF for mid IDF Term MF for low IDF Term

Mul

tiplic

ation

fact

or

Bit positionMSB LSB

Page 19: NDD Project presentation

23

Weighted SimHash

• Example(generation of fingerprint): – Document D2: word1 + word4

Page 20: NDD Project presentation

24

Weighted SimHash

• Example(generation of fingerprint): – Document D2: word1 + word4

Page 21: NDD Project presentation

25

Weighted SimHash

• Finding near duplicate– Compare fingerprints of the query document and

each document from collection• Start scan from most significant bit (or left most bit)• Count number of mismatch• If number of mismatch gets past k (allowed hamming

distance threshold): No near duplicate. stop scan. Go to next document• If number of mismatch within allowed hamming

distance threshold after scanning entire fingerprint: near duplicate found

Page 22: NDD Project presentation

26

Experiment

• Reuters data set: almost 10k documents– 10 documents randomly selected– Each of 10 documents has been very slightly

modified at most two words change per doc to produce 20 documents per selection

– 200 documents which we consider as near duplicates for respective selection

– 10 docs then used as source query

Page 23: NDD Project presentation

27

Experiment: Procedure

Page 24: NDD Project presentation

28

Results

0 1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

Recall(SimHash)Recall(WSH)

K value (hamming distance threshod)

Reca

ll Pe

rcan

tage

Figure : Comparison of percentage recall for all the 20 query documents for SimHash and Weighted SimHash methods with k (hamming distance value) shown in the X axis.

Page 25: NDD Project presentation

29

Results

0 2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80

90

100

Precision(SimHash)Precision(WSH)

K value (hamming distance threshold))

Prec

ision

Per

cant

age

Figure : Precision comparison between random projection and weighted simhash. X axis shows different values of k. The figure shows there is no real difference between two methods in terms of precision.

Page 26: NDD Project presentation

30

Results

Figure: Average execution time per query for each of the methods. For cosine similarity the threshold is 0.95.

Cosine Similarity (0.95)

SimHash WSH with MF(0.1,1.0)

WSH with MF(0.3,1.0)

WSH with MF(0.5,1.0)

WSH with MF(0.5,1.5)

WSH with MF(1.0,1.3)

WSH with MF(1.0,1.6)

WSH with MF(1.0,2.0)

4656

38203560 3490 3587 3664 3680

3440 3560

Comparison of Average Execution timeAverage Execution time(milliseconds)

Page 27: NDD Project presentation

31

Limitations of WSH

• Dependence on IDF– Web search: IDF unknown– Heuristics can be used: • IDF from first 1000 documents

Page 28: NDD Project presentation

32

Limitations of WSH

• Difficulty Setting the lower and upper bound on multiplication factor– May vary from collection to

collection

Page 29: NDD Project presentation

33

Conclusion

• Batch processing of Document collection:– Runtime: WSH better than SimHash– Precision and Recall: WSH and SimHash are

comparable

Page 30: NDD Project presentation

35

Conclusion

• Further work on SimHash:– How much the fingerprint can be allowed to be

altered?

Page 31: NDD Project presentation

36

Thank you


Recommended