Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | joseph-garrison |
View: | 18 times |
Download: | 0 times |
of 34
Near-Duplicate Detection for eRulemakingHui Yang, Jamie CallanLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University{huiyang, callan}@cs.cmu.edu
Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo
Introduction - IU.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations. Some popular regulations attract hundreds of thousands of comments from the general public. In late 1990s USDAs national organic standard, manually sort over 250,000 public comments. In 2004 the EPAs proposed Mercury rule (USEPA-OAR-2002-0056) attracted over 530,000 email messages.Very labor-intensive
Introduction - IIThings Become Worse Now Many Online form letters available Written by special interest groupsModifying an electronic form letter is extremely easySpecial Interest GroupsBuild electronic advocacy groups when there is a disconnect between broad public opinion and legislative action.Provide information and tools to help each individual have the greatest possible impact once a group is assembled. Moveon.org, http://www.moveon.orgGetActive, http://www.getactive.org
Introduction - IIIPublic comments will be near-duplicates if created from the same form letter.Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter. Goal: Recognizing the Near-duplicates and organize themFinding the added information by an individualFinding the Unique commentsOur research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data
Problem Definition - IWhat is a near-duplicate?Pugh declared that two documents are considered near duplicates if they have more than r features in common. Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human expertsOur definition based on the ways to create near-duplicates
Problem Definition - IISources of Public Commentsfrom scratch (unique comments)based on a form letter (exact- or near duplicates)A Category-based DefinitionBlock editKey blockMinor changeMinor change + block editExact
Block Edit
Dear Environmental Protection Agency,
The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.
Sincerely,
Mary10 Kennaday RoadMendham, NJ 07945
________This is an important constituent communication initiated through MoveOn.org.
Dear Environmental Protection Agency,
I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.
The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.
Sincerely,
JohnPO Box 91067Atqasuk, AK 99791
________This is an important constituent communication initiated through MoveOn.org.
Key Block
Dear Environmental Protection Agency,
The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.
Sincerely,
Mary10 Kennaday RoadMendham, NJ 07945
________This is an important constituent communication initiated through MoveOn.org.
Dear Environmental Protection Agency,
American citizens need to stand up for their rights. Which means the freedom to pursue life, liberty, health, and happiness. Everybody has the right to wake up each morning and breath the freshest air that this green earth can provide us, not what some government organization says that we need to put up with because they want their standards so lax. This is a democracy, by the people, for the people, not what Bush decides because it suits his mood. It concerns me to see so many people that I care about every day trying so hard to live with the mental and neurological problems that they have acquired, or were born with due to mercury poisoning.
The EPA should require power plants to cut mercury pollution by 90% by 008.These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.
Sincerely, Rose!
The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.
Minor Change
July 31, 2004
Dear Administrator Leavitt,
I am writing to urge you to take prompt action to clean up mercury and other toxic air pollution from power plants. EPA's current proposals permit far ore mercury pollution than what the Clean Air Act allows, while at the same time fail to address over sixty other hazardous air pollutants like dioxin.
Sincerely,
Ken12168 Duke Dr NEBlaine, MN 55434-3111
March 31, 2004
Dear Administrator Michael Leavitt,
I am writing to urge you to take prompt action to clean up mercury and other toxic air pollution from the power plants. EPAs current proposals permit far more mercury pollution than what the Clean Air Act allows, while at the same time fail to address over sixty other hazardous air pollutants like dioxin.
Sincerely,
Sandra H
,USA
Minor Change + Block Edit
As someone who cares about protecting the health of children and ourenvironment, I am deeply concerned about the mercury contamination of our lakes and streams. Mercury descends from polluted air into water and then works its way up the food chain. It is especially dangerous to people and wildlife that consume large amounts of fish.I urge you to reconsider your agencys approach and require power plants to reduce their emissions of mercury to the greatest extent possible. his is what the federal law requires, and also what the people and wildlife of this country deserve.
Thank you for your consideration,
jUDY Berk232 Beech HIll rd.Northport, ME 04849
As someone who cares about protecting our wildlife and wild places, I am deeply upset about the mercury contamination of our lakes and streams. Mercury descends from polluted air into water and then works its way up the food chain. It is especially dangerous to people and wildlife that consume large amounts of fish.
Specifically, I am concerned that EPA is proposing a cap-and-trade system to manage mercury emissions. Under such a system, not all plants would have to reduce their harmful missions of mercury and some could even increase! This approach =s unacceptable for dealing with such a toxic pollutant which is precisely why the Clean Air Act does not allow it. I urge you to reconsider your approach and require power plants to reduce heir emissions of mercury to the greatest extent possible. This is what the federal law requires, and also what the people and wildlife of this country deserve.
Thank you for your consideration.
Sincerely,
jamie duval22 manville hill rdcumberland, Rhode Island 02864
Block Reordering
Dear Environmental Protection Agency,
The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.
I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.
Sincerely,
JohnPO Box 91067Atqasuk, AK 99791
Dear Environmental Protection Agency,
I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.
The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.
Sincerely,
JohnPO Box 91067Atqasuk, AK 99791
Exact
The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.
The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.
Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo
System Architecture
Feature-based Document Retrieval - ITo get duplicate candidate set for each seedAvoid working on the entire datasetSteps: Each seed document is broken into chunks Select the most informative words from each chunka text span around term t* which is the term with minimal document frequency in the chunk
Feature-based Document Retrieval - IIMetadata extraction by Information Extractionemail senders, receivers, signatures, docket IDs, delivered dates,email relayers
Feature-based Document Retrieval - IIIQuery Formulation
#AND ( docketoar.20020056 router.moveon #OR(standards proposed by will harm thousands unborn children for coal plants should other cleaner alternative by 90 by with national standards available pollution control) )
Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo
Similarity-based Clustering - IDocument dissimlilarity based on Kullback-Leibler (KL) divergenceKL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.
Clustering AlgorithmSoft, Non-Hierarchical Clustering (Partition)Single Pass Clustering with carefully selected seed documents each timeClose to K-MeansNo need to define K before hand
Similarity-based Clustering - IIDup CandidatesDup Set 1Dup Set 2
Adaptive ThresholdingCut-off threshold for different clusters should be differentDocuments in a cluster are sorted by their document-centroid similarity scores. Sample sorted scores at 10 document intervalsIf there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.
Does Feature-based Document Retrieval Helps?It works fairly efficient for large clusters cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. 536,975 ->10,995 documentsBad for small clusters (especially for those only containing a single unique document)Disable feature-based retrieval after most of the big clusters have been found.assume that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them
Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo
Evaluation Methodology - IDifficult to evaluate clustering performance lack of man power to produce ground truth for large datasetTwo subsets of 1,000 email messages each were selected randomly from the Mercury dataset. Assessors: two graduate research assistantsManually organized the documents into clusters of documents that they felt were near-duplicatesManually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)
Evaluation Methodology - IIClass j vs. Cluster iF-measurepij = nij/ ni , rij = nij/ nj F = , Fj = maxi {Fij}
Purity = , i = maxj{pij},
Pairwise-measureFolkes and Mallows index Kappa = p(A) = (a+d)/m , p(E) =
Experimental Results - I
(BC = .74)
AB
1
0.14
0.24
0.08
Baseline2 router
0.28
0.31
3
Metadata-based (router+filesizerange+docket)
0.68
0.34
0.42
0.42
0.56
5
0.70
0.40
0.68
Similarity-based clustering
0.72
0.56
7
0.74
0.55
0.71
Experimental Results - II
Chart2
10
11
11
11
21
21
21
21
21
21
21
22
22
22
22
22
22
22
33
33
33
33
43
44
54
55
55
55
65
66
66
66
76
76
76
76
77
77
77
77
77
77
77
77
77
77
77
77
77
77
77
77
77
77
78
78
78
79
79
710
710
811
812
812
812
812
813
813
913
913
1014
1014
1014
1115
1215
1216
1216
1216
1216
1317
1317
1317
1318
1418
1419
1419
1419
1520
1520
1520
1520
1520
1521
1621
1621
1622
1622
1622
1622
1723
1723
1723
1723
1824
1824
1824
1824
1824
1824
1925
1925
1925
2026
2026
2026
2026
2026
2027
2027
2027
2127
2127
2128
2128
2128
2228
2228
2228
2328
2328
2329
2329
2429
2529
2530
2530
2531
2531
2532
2632
2632
2733
2733
2734
2734
2734
2735
2835
2835
2836
2836
2836
2936
2937
disable document retrieval for small clusters
enable document retrieval for small clusters
number of clusters
time(s)
Effects of disable document retrieval for small clusters
med
maxmedminduration(s)new clusternumclusterspullkeep
100005-07-2004-1034341701305
111105-10-2004-4956792496113
221102-16-2004-0095343349110
221006-22-2004-520077410292
231102-20-2004-134023550146
331006-12-2004-513426623831
431005-08-2004-10871275242
441103-20-2004-175227849619
441005-14-2004-50100792421
541002-29-2004-1533581026911
551104-23-2004-400538114999
662104-26-2004-413950123975
662003-13-2004-16377913185
662003-04-2004-15674214724
762003-23-2004-231149156511
772104-23-2004-405656164362
772005-25-2004-50538217127
772004-17-2004-3766861881
873003-23-2004-250642195274
883103-23-2004-187947208244
883003-17-2004-02330521221
883003-18-2004-173068222551
893103-23-2004-322133235431
994003-23-2004-294855246521
9104103-23-2004-054436258241
10115105-07-2004-104528268511
10125103-23-2004-243812276511
11125003-23-2004-314528284441
12125003-23-2004-278681295461
12136103-23-2004-056321306511
12146103-23-2004-042780318251
13146003-23-2004-243382324451
14156103-23-2004-278037338251
14156007-11-2004-536083342682
14156004-10-2004-3717523534223
15166103-22-2004-1763883631
15167003-23-2004-058660374491
16167003-23-2004-310907384241
16167006-18-2004-51833739101
17167006-26-2004-529546401787
17177103-29-2004-354292411185
17177006-17-2004-51774242109
18177003-16-2004-1711244321
18177002-11-2004-118701442271
18177003-23-2004-306565454391
19177004-29-2004-4257784611
19177006-14-2004-51401647141
19187103-23-2004-335273483911
19187006-14-2004-5145094921
19197104-21-2004-379784504961
19197004-01-2004-3619165111
20197002-27-2004-15206152124
20197003-30-2004-3565695321
20197003-23-2004-335305541151
21208103-23-2004-031697558241
21208003-23-2004-200499566793
21218103-23-2004-037385578223
22229103-23-2004-275010588221
22239103-23-2004-294087598221
232310003-23-2004-274428605553
232410103-23-2004-301926616781
242411003-23-2004-305721626521
252512103-23-2004-321036636521
252512003-23-2004-033552644601
252512003-23-2004-260795654431
262612103-23-2004-050671665261
262713103-23-2004-210128676781
272713003-23-2004-037996686551
272713003-23-2004-274690694411
282813103-23-2004-244841704311
282814003-23-2004-221874714361
282814003-23-2004-023621725921
292914103-23-2004-052150735451
292915003-23-2004-272340748241
303015103-23-2004-250568758241
303116103-23-2004-045736768231
313116003-23-2004-211832774591
313216103-23-2004-262974784851
313216003-23-2004-180376794581
323317103-23-2004-304517805461
333317005-07-2004-107485814611
333317003-23-2004-262411825421
343418103-23-2004-269881835501
343418003-23-2004-300077844601
353519103-23-2004-241889856431
353519003-23-2004-295156864341
363519003-23-2004-210177876511
363620103-23-2004-050230886511
373620003-23-2004-240092895471
373720103-29-2004-074879905481
383720003-23-2004-053645914551
383820103-23-2004-204719926511
393821003-23-2004-069103934341
393921103-23-2004-330106945541
403921003-23-2004-309476954511
414022103-23-2004-194032966511
414022003-23-2004-277633974361
424022003-23-2004-210055984281
424122103-23-2004-068553998231
434223103-23-2004-0613841008231
434223003-23-2004-3047421014501
444223003-23-2004-2787291023691
444323103-23-2004-2555311038241
454424105-07-2004-1052111048221
454424003-23-2004-3121221055331
464524103-23-2004-2238951064121
474524003-23-2004-2003731079251
474624103-19-2004-174337108271
484624003-23-2004-177676109261
484625003-23-2004-3141361104711
494725103-26-2004-3518931114801
504725003-23-2004-1884911128231
504826103-23-2004-3339701134361
514926103-23-2004-2219211149251
515026103-23-2004-2713781159191
515026003-23-2004-0633521168231
525126103-23-2004-2337731174461
525127003-23-2004-2199881186991
525227103-12-2004-1599231191011
535227002-23-2004-1348701202101
535227003-23-2004-1810051214431
545227003-23-2004-2345541224341
545328103-23-2004-3193491238241
555428103-23-2004-3063891248231
555428003-15-2004-17009912531
555428003-10-2004-159055126961
565428003-18-2004-173164127131
565528103-23-2004-2234761286341
565528002-03-2004-110960129302
575628103-23-2004-3145561308231
575629003-23-2004-2712851315691
585729103-23-2004-0633741324432
595729003-23-2004-2766281338431
595729003-10-2004-15905713411
595830103-23-2004-2131441354363
605830003-23-2004-0533321365293
615931103-23-2004-0313811378263
616031103-23-2004-2536971386781
616032003-23-2004-0267641398221
626132103-23-2004-2589221408221
626132003-23-2004-2448351414441
636233103-23-2004-3290851426521
646333103-23-2004-0423851438351
646434103-23-2004-0509241448231
656534103-23-2004-3230751458231
656534003-23-2004-2133571468251
666635103-23-2004-1856111478231
666735103-23-2004-1786641489431
666735003-23-2004-2017891494731
666836103-23-2004-3140051504361
676936103-23-2004-0568141518261
686936003-23-2004-0634581528241
687036103-23-2004-2205831534351
697137103-23-2004-2086721548231
med
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
med
min
max
number of clusters so far
time(s)
Time vs. Number of Clusters
min
max
med
min
number of clusters
time(s)
Time vs. Number of Clusters
max
seed counttime so far(s)duration(s)cluster size# redistributed docsnew clusternumclusterspullkeep
2650043643605-07-2004-1034341701305
2650005-10-2004-4956792114113
2651102-16-2004-0095343113110
2651006-22-2004-52007749492
2651002-20-2004-134023526646
2651006-12-2004-51342663131
2651005-08-2004-10871274242
2651003-20-2004-17522781919
2651005-14-2004-50100792121
2651002-29-2004-153358101111
2651004-23-2004-40053811179
2651004-26-2004-41395012145
2652103-13-2004-1637791355
2652003-04-2004-1567421444
2652003-23-2004-231149154341
2652004-23-2004-405656161672
2652005-25-2004-5053821777
2652004-17-2004-3766861881
2652003-23-2004-250642195264
2653103-23-2004-187947204404
2653003-17-2004-0233052111
2653003-18-2004-1730682251
2653003-23-2004-322133235461
2653003-23-2004-294855244361
2654103-23-2004-054436258231
2654005-07-2004-104528264361
2655103-23-2004-243812274361
2655003-23-2004-314528284221
2655003-23-2004-278681294311
2655003-23-2004-056321304361
2656103-23-2004-042780314221
2656003-23-2004-243382324221
2656003-23-2004-278037334261
2656007-11-2004-5360833422
2656004-10-2004-371752355223
2656003-22-2004-1763883611
2656003-23-2004-058660374351
2657103-23-2004-310907384351
2657006-18-2004-51833739101
2657006-26-2004-52954640937
2657003-29-2004-3542924155
2657006-17-2004-51774242109
2657003-16-2004-1711244321
2657002-11-2004-11870144461
2657003-23-2004-306565454351
2657004-29-2004-4257784611
2657006-14-2004-5140164711
2657003-23-2004-3352734811
2657006-14-2004-5145094921
2657004-21-2004-3797845011
2657004-01-2004-3619165111
2657002-27-2004-1520615211
2657003-31-2004-3603835344
2657003-30-2004-3565695421
2657003-23-2004-3353055511
2658103-23-2004-031697564401
2658003-23-2004-200499575473
2658003-23-2004-037385585243
2659103-23-2004-275010595241
2659003-23-2004-294087605241
26510103-23-2004-274428615463
26510003-23-2004-301926625461
26511103-23-2004-305721638251
26512103-23-2004-321036648251
26512003-23-2004-033552654411
26512003-23-2004-260795664361
26512003-23-2004-050671674361
26513103-23-2004-210128684361
26513003-23-2004-037996694361
26513003-23-2004-274690705481
26513003-23-2004-244841714191
26514103-23-2004-221874725461
26514003-23-2004-023621734361
26514003-23-2004-052150744221
26515103-23-2004-272340754361
26515003-23-2004-250568768231
26516103-23-2004-045736774361
26516003-23-2004-211832784361
26516003-23-2004-262974794361
26516003-23-2004-180376804361
26517103-23-2004-304517814311
26517005-07-2004-107485824361
26517003-23-2004-262411835421
26518103-23-2004-269881848231
26518003-23-2004-300077854361
26519103-23-2004-241889865421
26519003-23-2004-295156874341
26519003-23-2004-210177884361
26520103-23-2004-050230894361
26520003-23-2004-240092904311
26520003-29-2004-074879911201
26520003-23-2004-053645924221
26520003-23-2004-204719934221
26521103-23-2004-069103944341
26521003-23-2004-330106955421
26521003-23-2004-309476965531
26522103-23-2004-194032974361
26522003-23-2004-277633984351
26522003-23-2004-210055994361
26522003-23-2004-0685531004221
26523103-23-2004-0613841014361
26523003-23-2004-3047421024241
26523003-23-2004-2787291034251
26523003-23-2004-2555311044361
26524105-07-2004-1052111054361
26524003-23-2004-3121221064221
26524003-23-2004-2238951074221
26524003-23-2004-2003731084221
26524003-19-2004-174337109391
26524003-23-2004-17767611071
26525103-23-2004-3141361115091
26525003-26-2004-3518931121671
26525003-23-2004-1884911134221
26526103-23-2004-3339701144221
26526003-23-2004-2219211154221
26526003-23-2004-2713781164351
26526003-23-2004-0633521174071
26526003-23-2004-2337731184081
26527103-23-2004-2199881194071
26527003-12-2004-15992312021
26527002-23-2004-1348701211671
26527003-23-2004-1810051224221
26527003-23-2004-2345541234221
26528103-23-2004-3193491244231
26528003-23-2004-3063891254221
26528003-15-2004-17009912641
26528003-10-2004-15905512711
26528003-18-2004-17316412881
26528003-23-2004-2234761294081
26528002-03-2004-11096013022
26528003-23-2004-3145561314361
26529103-23-2004-2712851325721
26529003-23-2004-0633741334242
26529003-23-2004-2766281344081
26529003-10-2004-15905713511
26530103-23-2004-2131441364363
26530003-23-2004-0533321375243
26531103-23-2004-0313811388263
26531003-23-2004-2536971395461
26532103-23-2004-0267641405241
26532003-23-2004-2589221415241
26532003-23-2004-2448351425461
26533103-23-2004-3290851434361
26533003-23-2004-0423851448271
26534103-23-2004-0509241454361
26534003-23-2004-3230751464361
26534003-23-2004-2133571474381
26535103-23-2004-1856111486341
26535003-23-2004-1786641495421
26535003-23-2004-2017891505241
26536103-23-2004-3140051514341
26536003-23-2004-0568141525431
26536003-23-2004-0634581534221
26536003-23-2004-2205831544231
26537003-23-2004-2086721556341
max
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
time so far(s)
seed counttime so far(s)duration(s)cluster size# redistributed docsnew clusternumclusterspullkeep
2651152752705-07-2004-1034341701305
2651005-10-2004-4956792263113
2652102-16-2004-0095343389110
2652006-22-2004-520077411792
2652002-20-2004-134023524146
2653106-12-2004-513426659331
2654105-08-2004-108712718942
2654003-20-2004-17522782419
2654005-14-2004-501007922721
2655102-29-2004-1533581033411
2655004-23-2004-400538114209
2656104-26-2004-413950122915
2656003-13-2004-16377913695
2656003-04-2004-156742141934
2657103-23-2004-231149155321
2657004-23-2004-405656166312
2657005-25-2004-5053821787
2657004-17-2004-3766861881
2658103-23-2004-250642195304
2658003-23-2004-187947205454
2658003-17-2004-02330521471
2658003-18-2004-1730682251
2658003-23-2004-322133235271
2659103-23-2004-294855245291
2659003-23-2004-054436256831
26510105-07-2004-104528265371
26510003-23-2004-243812275271
26511103-23-2004-314528286821
26512103-23-2004-278681296911
26512003-23-2004-056321305271
26512003-23-2004-042780316211
26513103-23-2004-243382325491
26514103-23-2004-278037337701
26514007-11-2004-53608334352
26514004-10-2004-3717523546523
26515103-22-2004-176388366621
26515003-23-2004-058660375251
26516103-23-2004-310907385241
26516006-18-2004-51833739491
26517106-26-2004-529546405647
26517003-29-2004-354292415925
26517006-17-2004-51774242139
26518103-16-2004-171124431061
26518002-11-2004-118701441161
26518003-23-2004-306565456981
26519104-29-2004-42577846361
26519006-14-2004-51401647361
26519003-23-2004-335273481071
26519006-14-2004-51450949411
26519004-21-2004-37978450281
26519004-01-2004-3619165111
26520102-27-2004-152061523214
26520003-30-2004-3565695321
26520003-23-2004-335305544011
26521103-23-2004-031697554401
26521003-23-2004-200499565283
26521003-23-2004-037385575243
26522103-23-2004-275010586691
26522003-23-2004-294087595251
26523103-23-2004-274428605293
26523003-23-2004-301926615271
26524103-23-2004-305721628461
26525103-23-2004-321036638231
26525003-23-2004-033552644421
26525003-23-2004-260795655271
26526103-23-2004-050671665281
26526003-23-2004-210128676181
26527103-23-2004-037996686851
26527003-23-2004-274690696161
26528103-23-2004-244841705251
26528003-23-2004-221874715311
26528003-23-2004-023621725281
26529103-23-2004-052150736391
26529003-23-2004-272340745941
26530103-23-2004-250568756811
26530003-23-2004-045736765241
26531103-23-2004-211832776391
26531003-23-2004-262974784411
26531003-23-2004-180376794401
26532103-23-2004-304517807091
26533105-07-2004-107485816811
26533003-23-2004-262411824801
26534103-23-2004-269881836891
26534003-23-2004-300077846811
26535103-23-2004-241889855741
26535003-23-2004-295156864691
26536103-23-2004-210177876321
26536003-23-2004-050230885271
26537103-23-2004-240092896811
26537003-29-2004-074879902851
26538103-23-2004-053645916821
26538003-23-2004-204719926291
26539103-23-2004-069103934431
26539003-23-2004-330106946821
26540103-23-2004-309476956881
26541103-23-2004-194032966711
26541003-23-2004-277633976811
26542103-23-2004-210055987021
26542003-23-2004-068553995251
26543103-23-2004-0613841005241
26543003-23-2004-3047421016361
26544103-23-2004-2787291025491
26544003-23-2004-2555311036881
26545105-07-2004-1052111045221
26545003-23-2004-3121221056401
26546103-23-2004-2238951069251
26547103-23-2004-2003731075391
26547003-19-2004-1743371085931
26548103-23-2004-1776761092991
26548003-23-2004-3141361105951
26549103-26-2004-3518931118851
26550103-23-2004-1884911125271
26550003-23-2004-3339701135451
26551103-23-2004-2219211146331
26551003-23-2004-2713781154391
26551003-23-2004-0633521165191
26552103-23-2004-2337731175841
26552003-23-2004-2199881187021
26552003-12-2004-15992311931
26553102-23-2004-1348701205281
26553003-23-2004-1810051214621
26554103-23-2004-2345541227341
26554003-23-2004-3193491234611
26555103-23-2004-3063891246251
26555003-15-2004-170099125111
26555003-10-2004-1590551263751
26556003-18-2004-17316412791
26556003-23-2004-2234761284431
26556002-03-2004-1109601292792
26557103-23-2004-3145561305251
26557003-23-2004-2712851317121
26558103-23-2004-0633741326822
26559103-23-2004-2766281336851
26559003-10-2004-1590571341241
26559003-23-2004-2131441355313
26560103-23-2004-0533321365273
26561103-23-2004-0313811378253
26561003-23-2004-2536971385271
26561003-23-2004-0267641395261
26562103-23-2004-2589221405241
26562003-23-2004-2448351416441
26563103-23-2004-3290851427031
26564103-23-2004-0423851436851
26564003-23-2004-0509241445241
26565103-23-2004-3230751457751
26565003-23-2004-2133571465361
26566103-23-2004-1856111476291
26566003-23-2004-1786641484431
26566003-23-2004-2017891495251
26566003-23-2004-3140051504431
26567103-23-2004-0568141516941
26568103-23-2004-0634581526131
26568003-23-2004-2205831537481
26569103-23-2004-2086721547591
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
time so far(s)
ConclusionLarge Volume Working SetDuplicate Definition and Automatic evaluationFeature-based Duplicate Candidate RetrievalSimilarity-based ClusteringImproved Efficiency
Related Work - IDuplicate detection in other domains:databases [Bilenko and Mooney 2003]to find records referring to the same entity but possibly in different representationselectronic publishing [Brin et al. 1995] to detect plagiarism or to identify different versions of the same document. web search [Chowdhury et al. 2002] [Pugh 2004]more efficient web-crawlingeffective search results rankingeasy web documents archiving
Related Work - IIFingerprintinga compact description of a document, and then do pair-wise comparison of document fingerprintsShingling [Broder et al.]represents a document as a series of simple numeric encodings for an n-term windowretain every mth shingle to produce a document sketchsuper shinglesSelective fingerprinting [Heintze]selected a subset of the substrings to generate fingerprintsStatistical approach [Chowdhury et al.]n high idf terms Improved accuracy over shinglingEfficient: one-fifth of the time over shinglingFingerprints Reliability in dynamic environment [Conrad et al.]Consider time factor on the Web
ReferencesM. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398409. ACM Press, May 1995. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 97, pages 391404. Elsevier Science, April 1997.J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM03, pages 443452. ACM Press, Nov. 2003.N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191200, Nov. 1996.W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003
Demohttp://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/
Modifying an electronic form letter is extremely easy. Moreover, it can better represent their opinions or stylistic preferences
When there is a disconnect between broad public opinion and legislative action, MoveOn builds electronic advocacy groups.Examples of such issues are campaign finance, environmental and energy issues, media consolidation.. Once a group is assembled, MoveOn provides information and tools to help each individual have the greatest possible impact.
Many of the form letter-based comments are near-duplicates that are similar, but not exactly identical, to a (usually unknown) initial form letter, and to one another.
Recognizing, organizing, and considering near-duplicates are significant challenges for the agencies because near-duplicates increase the likelihood of overlooking substantive information (information that by law the agency must consider) that an individual adds to a form letter.However, it is unsupportable to have fully human assessments on huge document collections such as public comments in eRulemaking. Moreover, this definition assumes that the original document can be identified easily.
For example, excerpt means one document takes first section from another article and elaboration means one document adds one or more paragraphs to another article.
However, in domains such as eRulemaking it may not be possible to tell by just looking at the documents which document is the source and which is the modified copy. In order to have a fully automatic assessment of duplicate and near-duplicate detection without human experts involved, we need to have a quantitative definition of near-duplicates. That is to say, our definition is a guideline for machines. In the eRulemaking domain, there are two broad categories of documents: Comments that are written more-or-less from scratch (unique), and comments that are based on a form letter (exact-duplicates and near-duplicates). The near-duplicate category can be further divided into subcategories (Table 1). Our subcategories are similar in spirit to those of Conrad, et al., but reflect typical patterns appearing in public comment datasets Our duplicate and near-duplicate detection system includes three modules: Text preprocessing, feature-based document retrieval and similarity-based document clustering (Figure 1).The text preprocessing module consists of information extraction (IE), binning by length, and seed document selection. The IE module identifies the email sender, the receiver, address, salutation and signature lines; docket IDs mentioned in the text; delivery dates and the email-relaying organizations identified in email headers. These features are used later in duplicate detection. Documents are then binned based on their content lengths (i.e., after removal of headers, address, salutation, and signature lines are removed). Binning insures that exact-duplicates are grouped together. Moreover, even before duplicate clustering starts, the system has a rough guess about what are the large duplicate sets. These sets are handled first, to improve efficiency. Finally, to identify the dominating document within a bin, the document that repeats most (i.e., has the most exact duplicates) is selected to be the initial seed in any bin whose size is greater than 100.In the feature-based document retrieval module, the system begins near-duplicate detection by going through the size-based bins in a round robin manner. On each pass a document is selected from a bin, and a feature-based query is generated based on fingerprint and metadata. The system then performs probabilistic document retrieval, using a Boolean query, to get candidate duplicates for the document. The similarity-based clustering module measures the similarity of each document in the candidate set to the current seed; similar documents are grouped together. The next section describes the algorithms for feature-based document retrieval and similarity-based clustering. during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (mass mailers) or are commercial advertising (spam), duplicate documents are likely to share the same features as the seed document.
One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them. during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (mass mailers) or are commercial advertising (spam), duplicate documents are likely to share the same features as the seed document.Note that this requires an extra step to first store the similarity scores above a relatively loose initial cut-off threshold
It works fairly efficient for large clusters since it usually cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them FIn the context of clustering, the quality of a clustering with respect to the gold standard is assessed in terms of precision (p) and recall (r). p and r compare each cluster i with each class j in the ground truth:pij = nij/ ni (5) rij = nij/ nj (6)where nij is the number of documents from class j in cluster i, ni is the number of documents in cluster i and nj is the number of documents in class j.The F-measure widely used in IR community tries to capture how well the groups of the investigated clustering at the best match the groups of the ground truth. The F-measure for a cluster i and class j is: Fij = 2rij / (p ij +r ij)(7)The F-measure of the whole clustering set is: F = (8)where Fj = maxi {Fij} , n is the total number of document.PurityIf the purpose of clustering is to classify clusters of documents rather than single documents, purity of each cluster tries to capture how well the groups of the first on average matches those of the ground truth. The weighted average purity over all clusters is a measure of quality of the whole clustering. It is defined as: = (9)where i = maxj{pij}, n is the total number of document and pij is precision of cluster i with reference to class j.Pair-wise MeasureOne may also define measures based on the distribution of pairs of texts. Pair measures reflect degrees of overlap and are thus not dependent on the direction of the comparison. Folkes and Mallows index [9] is one of such pair measures that have been used for cluster validation(10) where a is the number of pairs in the same group in the ground truth and in the clustering (agreement), b is the number of pairs in the same group in the ground truth but in different in the clustering (false negative), c is the number of pairs in different groups in the ground truth but in the same in the clustering (false positive), d is the number of pairs in different groups in the ground truth and in the clustering (agreement).KappaFinally, we suggest using the Cohens kappa statistics [7] to assess agreement between clustering results relative to two ground truth given by different human assessors. We consider the investigated clustering results as another assessor. The kappa coefficient is used to compare inter-assessor concordances and is defined is: = (11)where p(A) is the observed agreement between the two assessments. It can be calculated as (a+d)/m by using a, b, c and d given in Equation 10 and m=a+b+c+d. p(E) is the agreement expected by chance. It can be calculated as .Our idea is that if the concordance of the clustering results given by our system and one of the assessors are as good as or even better than that of two human assessors, our algorithm could be considered as effective. In computational linguistics > 0.67 has often been required to draw any conclusions of agreement at all [5]. However, there has been a number of objections to this standard and less strict evaluations consider 0.20 < < 0.40 indicating fair agreement and 0.40 < < 0.60 indicating moderate agreement.Early research on duplicate detection was done mostly in the areas of databases [1][16] and electronic publishing [2][11] . In database research, the goal of duplicate detection is to find records referring to the same entity but possibly in different representations. In electronic publishing, duplicate detection is used to detect plagiarism or to identify different versions of the same document. Most recently, duplicate detection is extensively studied for web search tasks, for example, to give more efficient web-crawling, effective search results ranking, and easy web documents archiving.A widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to do pair-wise comparison of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by counting the number of common sequences in the complete fingerprints. Different algorithms are characterized, and their computational costs are determined, by the hash function and how substrings are selected.
The dominant approach in duplicate detection is shingling, which was proposed by Broder, et al. [1]. It represents a document as a series of simple numeric encodings for an n-term window. Filtering techniques were used to retain every mth shingle to produce a document sketch. The authors also presented a super-shingle technique that creates meta-sketches to further reduce the computational complexity. Pairs of documents that have a high shingle match coefficient are asserted to be near-duplicates.
Heintze studied selective document fingerprinting that scales to large environments and that can identify similarities between documents which have as little as 5% or less in common [10].To reduce the size of the fingerprints, he selected a subset of the substrings to generate them. He demonstrated that selective fingerprints normally perform within a factor of two of complete fingerprints.
Pugh, from Google, claimed that duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints matches [13].
Recent duplicate detection research in the Web environment has focused on issues of computational efficiency and detection effectiveness while relying on collection statistics to consistently recognize document replicas in full-text collections [6][14]. Since a terms inverse document frequency (idf) is a measurement of a words rareness across a given collection, its value measures how well a term discriminates one document from others. Chowdhury, et al., selected terms best characterizing a document, i.e., n high idf words. Moreover, a term is not considered unless it appears across the collections at least five times to filter out possible misspelling tokens and typos. They claimed that in addition to improving accuracy over shingling, it executes in one-fifth of the time. They showed that a statistically-based approach (i) is more efficient than a fingerprint-based approach, (ii) scales in the number of documents and (iii) works well for documents of diverse sizes.