Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan
Language Technologies Institute School of Computer Science Carnegie
Mellon University {huiyang, callan}@cs.cmu.edu
Slide 3
Presentation Outline Introduction Problem Definition System
Architecture Feature-based Document Retrieval Similarity-based
Clustering Evaluation and Experimental Results Related Work
Conclusion and Demo
Slide 4
Introduction - I U.S. regulatory agencies are required to
solicit, consider, and respond to public comments before issuing
the final regulations. Some popular regulations attract hundreds of
thousands of comments from the general public. In late 1990s USDAs
national organic standard, manually sort over 250,000 public
comments. In 2004 the EPAs proposed Mercury rule
(USEPA-OAR-2002-0056) attracted over 530,000 email messages. Very
labor-intensive
Slide 5
Introduction - II Things Become Worse Now Many Online form
letters available Written by special interest groups Modifying an
electronic form letter is extremely easy Special Interest Groups
Build electronic advocacy groups when there is a disconnect between
broad public opinion and legislative action. Provide information
and tools to help each individual have the greatest possible impact
once a group is assembled. Moveon.org,
http://www.moveon.orghttp://www.moveon.org GetActive,
http://www.getactive.orghttp://www.getactive.org
Slide 6
Introduction - III Public comments will be near-duplicates if
created from the same form letter. Near-Duplicates increase the
likelihood of overlooking substantive information that an
individual adds to a form letter. Goal: Recognizing the
Near-duplicates and organize them Finding the added information by
an individual Finding the Unique comments Our research focused on
recognizing and organizing near-duplicates by text mining and
clustering, as well as handling large amount of data
Slide 7
Problem Definition - I What is a near-duplicate? Pugh declared
that two documents are considered near duplicates if they have more
than r features in common. Conrad, et al., stated that two
documents are near duplicates if they share >80% terminology
defined by human experts Our definition based on the ways to create
near-duplicates
Slide 8
Problem Definition - II Sources of Public Comments ffrom
scratch (unique comments) bbased on a form letter (exact- or near
duplicates) A Category-based Definition BBlock edit KKey block
MMinor change MMinor change + block edit EExact
Slide 9
Block Edit
Slide 10
Key Block
Slide 11
Minor Change
Slide 12
Minor Change + Block Edit
Slide 13
Block Reordering
Slide 14
Exact
Slide 15
Presentation Outline Introduction Problem Definition System
Architecture Feature-based Document Retrieval Similarity-based
Clustering Evaluation and Experimental Results Related Work
Conclusion and Demo
Slide 16
System Architecture
Slide 17
Feature-based Document Retrieval - I To get duplicate candidate
set for each seed Avoid working on the entire dataset Steps: Each
seed document is broken into chunks Select the most informative
words from each chunk a text span around term t* which is the term
with minimal document frequency in the chunk
Slide 18
Feature-based Document Retrieval - II Metadata extraction by
Information Extraction email senders, receivers, signatures, docket
IDs, delivered dates, email relayers
Slide 19
Feature-based Document Retrieval - III Query Formulation #AND (
docketoar.20020056 router.moveon #OR(standards proposed by will
harm thousands unborn children for coal plants should other cleaner
alternative by 90 by with national standards available pollution
control) )
Slide 20
Slide 21
Presentation Outline Introduction Problem Definition System
Architecture Feature-based Document Retrieval Similarity-based
Clustering Evaluation and Experimental Results Related Work
Conclusion and Demo
Slide 22
Similarity-based Clustering - I Document dissimlilarity based
on Kullback-Leibler (KL) divergence KL-divergence, a distributional
similarity measure, is one way to measure the similarity of one
document (one unigram distribution) given another. Clustering
Algorithm Soft, Non-Hierarchical Clustering (Partition) Single Pass
Clustering with carefully selected seed documents each time Close
to K-Means No need to define K before hand
Slide 23
Dup Candidates Similarity-based Clustering - II Dup Set 1 Dup
Set 2
Slide 24
Adaptive Thresholding Cut-off threshold for different clusters
should be different Documents in a cluster are sorted by their
document-centroid similarity scores. Sample sorted scores at 10
document intervals If there is greater than 5% of the initial
cut-off threshold within an interval, a new cut-off threshold is
set at the beginning of the interval.
Slide 25
Does Feature-based Document Retrieval Helps? It works fairly
efficient for large clusters cuts the number of documents needed to
be clustered from the size of entire dataset to a reasonable
number. 536,975 ->10,995 documents Bad for small clusters
(especially for those only containing a single unique document)
Disable feature-based retrieval after most of the big clusters have
been found. assume that most of the remaining unclustered documents
are unique. Only similarity-based clustering is used on them
Slide 26
Presentation Outline Introduction Problem Definition System
Architecture Feature-based Document Retrieval Similarity-based
Clustering Evaluation and Experimental Results Related Work
Conclusion and Demo
Slide 27
Evaluation Methodology - I Difficult to evaluate clustering
performance lack of man power to produce ground truth for large
dataset Two subsets of 1,000 email messages each were selected
randomly from the Mercury dataset. Assessors: two graduate research
assistants Manually organized the documents into clusters of
documents that they felt were near-duplicates Manually went through
one of the experimental clustering results pair by pair ( compare
document- centroid pair)
Slide 28
Evaluation Methodology - II Class j vs. Cluster i F-measure pij
= nij/ ni, rij = nij/ nj F =, Fj = maxi {Fij} Purity =, i =
maxj{pij}, Pairwise-measure Folkes and Mallows index Kappa = p(A) =
(a+d)/m, p(E) =
Slide 29
Experimental Results - I
Slide 30
Experimental Results - II
Slide 31
Conclusion Large Volume Working Set Duplicate Definition and
Automatic evaluation Feature-based Duplicate Candidate Retrieval
Similarity-based Clustering Improved Efficiency
Slide 32
Related Work - I Duplicate detection in other domains:
databases [Bilenko and Mooney 2003] to find records referring to
the same entity but possibly in different representations
electronic publishing [Brin et al. 1995] to detect plagiarism or to
identify different versions of the same document. web search
[Chowdhury et al. 2002] [Pugh 2004] more efficient web-crawling
effective search results ranking easy web documents archiving
Slide 33
Related Work - II Fingerprinting a compact description of a
document, and then do pair-wise comparison of document fingerprints
Shingling [Broder et al.] represents a document as a series of
simple numeric encodings for an n-term window retain every mth
shingle to produce a document sketch super shingles Selective
fingerprinting [Heintze] selected a subset of the substrings to
generate fingerprints Statistical approach [Chowdhury et al.] n
high idf terms Improved accuracy over shingling Efficient:
one-fifth of the time over shingling Fingerprints Reliability in
dynamic environment [Conrad et al.] Consider time factor on the
Web
Slide 34
References M. Bilenko and R. Mooney. Adaptive duplicate
detection using learnable string similarity measures. In
Proceedings of the 9th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C.,
August 2003. S. Brin, J. Davis, and H. Garcia-Molina. Copy
detection mechanisms for digital documents. In Proceedings of the
Special Interest Group on Management of Data (SIGMOD 1995), pages
398409. ACM Press, May 1995. A. Z. Broder, S. C. Glassman, M. S.
Manasse, and G. Zweig. Syntactic clustering of the web. In
Proceedings of WWW6 97, pages 391404. Elsevier Science, April 1997.
J. Callan, E-Rulemaking testbed.
http://hartford.lti.cs.cmu.edu/eRulemaking/Data/.
2004http://hartford.lti.cs.cmu.edu/eRulemaking/Data/ A. Chowdhury.
O. Frieder, D. Grossman, and M. McCabe. Collection statistics for
fast Duplicate document detection. In ACM Transactions on
Information Systems (TOIS), Volume 20, Issue 2, 2002. J. Cohen. A
coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37-46, 1960. J. Conrad, X. S. Guo,
and C. P. Schriber. Online duplicate document detection: Signature
reliability in a dynamic retrieval environment. In Proceedings of
CIKM03, pages 443452. ACM Press, Nov. 2003. N. Heintze. Scalable
document fingerprinting. In Proceedings of the Second USENIX
electronic Commerce Workshop, pages 191200, Nov. 1996. W. Pugh. US
Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf.
2003US Patent
6,658,423ttp://www.cs.umd.edu/~pugh/google/Duplicates.pdf