Near-Duplicate Detection for eRulemaking

Near-Duplicate Detection for eRulemakingHui Yang, Jamie CallanLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University{huiyang, callan}@cs.cmu.edu

Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo

Introduction - IU.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations. Some popular regulations attract hundreds of thousands of comments from the general public. In late 1990s USDAs national organic standard, manually sort over 250,000 public comments. In 2004 the EPAs proposed Mercury rule (USEPA-OAR-2002-0056) attracted over 530,000 email messages.Very labor-intensive

Introduction - IIThings Become Worse Now Many Online form letters available Written by special interest groupsModifying an electronic form letter is extremely easySpecial Interest GroupsBuild electronic advocacy groups when there is a disconnect between broad public opinion and legislative action.Provide information and tools to help each individual have the greatest possible impact once a group is assembled. Moveon.org, http://www.moveon.orgGetActive, http://www.getactive.org

Introduction - IIIPublic comments will be near-duplicates if created from the same form letter.Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter. Goal: Recognizing the Near-duplicates and organize themFinding the added information by an individualFinding the Unique commentsOur research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

Problem Definition - IWhat is a near-duplicate?Pugh declared that two documents are considered near duplicates if they have more than r features in common. Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human expertsOur definition based on the ways to create near-duplicates

Problem Definition - IISources of Public Commentsfrom scratch (unique comments)based on a form letter (exact- or near duplicates)A Category-based DefinitionBlock editKey blockMinor changeMinor change + block editExact

Block Edit

Dear Environmental Protection Agency,

The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.

Sincerely,

Mary10 Kennaday RoadMendham, NJ 07945

________This is an important constituent communication initiated through MoveOn.org.


I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.

The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.

Sincerely,

JohnPO Box 91067Atqasuk, AK 99791


Key Block


The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.

Sincerely,

Mary10 Kennaday RoadMendham, NJ 07945



American citizens need to stand up for their rights. Which means the freedom to pursue life, liberty, health, and happiness. Everybody has the right to wake up each morning and breath the freshest air that this green earth can provide us, not what some government organization says that we need to put up with because they want their standards so lax. This is a democracy, by the people, for the people, not what Bush decides because it suits his mood. It concerns me to see so many people that I care about every day trying so hard to live with the mental and neurological problems that they have acquired, or were born with due to mercury poisoning.

The EPA should require power plants to cut mercury pollution by 90% by 008.These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.

Sincerely, Rose!

The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.

Minor Change

July 31, 2004

Dear Administrator Leavitt,

I am writing to urge you to take prompt action to clean up mercury and other toxic air pollution from power plants. EPA's current proposals permit far ore mercury pollution than what the Clean Air Act allows, while at the same time fail to address over sixty other hazardous air pollutants like dioxin.

Sincerely,

Ken12168 Duke Dr NEBlaine, MN 55434-3111

March 31, 2004

Dear Administrator Michael Leavitt,

I am writing to urge you to take prompt action to clean up mercury and other toxic air pollution from the power plants. EPAs current proposals permit far more mercury pollution than what the Clean Air Act allows, while at the same time fail to address over sixty other hazardous air pollutants like dioxin.

Sincerely,

Sandra H

,USA

Minor Change + Block Edit

As someone who cares about protecting the health of children and ourenvironment, I am deeply concerned about the mercury contamination of our lakes and streams. Mercury descends from polluted air into water and then works its way up the food chain. It is especially dangerous to people and wildlife that consume large amounts of fish.I urge you to reconsider your agencys approach and require power plants to reduce their emissions of mercury to the greatest extent possible. his is what the federal law requires, and also what the people and wildlife of this country deserve.

Thank you for your consideration,

jUDY Berk232 Beech HIll rd.Northport, ME 04849

As someone who cares about protecting our wildlife and wild places, I am deeply upset about the mercury contamination of our lakes and streams. Mercury descends from polluted air into water and then works its way up the food chain. It is especially dangerous to people and wildlife that consume large amounts of fish.

Specifically, I am concerned that EPA is proposing a cap-and-trade system to manage mercury emissions. Under such a system, not all plants would have to reduce their harmful missions of mercury and some could even increase! This approach =s unacceptable for dealing with such a toxic pollutant which is precisely why the Clean Air Act does not allow it. I urge you to reconsider your approach and require power plants to reduce heir emissions of mercury to the greatest extent possible. This is what the federal law requires, and also what the people and wildlife of this country deserve.

Thank you for your consideration.

Sincerely,

jamie duval22 manville hill rdcumberland, Rhode Island 02864

Block Reordering




Sincerely,





Sincerely,


Exact



System Architecture

Feature-based Document Retrieval - ITo get duplicate candidate set for each seedAvoid working on the entire datasetSteps: Each seed document is broken into chunks Select the most informative words from each chunka text span around term t* which is the term with minimal document frequency in the chunk

Feature-based Document Retrieval - IIMetadata extraction by Information Extractionemail senders, receivers, signatures, docket IDs, delivered dates,email relayers

Feature-based Document Retrieval - IIIQuery Formulation

#AND ( docketoar.20020056 router.moveon #OR(standards proposed by will harm thousands unborn children for coal plants should other cleaner alternative by 90 by with national standards available pollution control) )

Similarity-based Clustering - IDocument dissimlilarity based on Kullback-Leibler (KL) divergenceKL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.

Clustering AlgorithmSoft, Non-Hierarchical Clustering (Partition)Single Pass Clustering with carefully selected seed documents each timeClose to K-MeansNo need to define K before hand

Similarity-based Clustering - IIDup CandidatesDup Set 1Dup Set 2

Adaptive ThresholdingCut-off threshold for different clusters should be differentDocuments in a cluster are sorted by their document-centroid similarity scores. Sample sorted scores at 10 document intervalsIf there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

Does Feature-based Document Retrieval Helps?It works fairly efficient for large clusters cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. 536,975 ->10,995 documentsBad for small clusters (especially for those only containing a single unique document)Disable feature-based retrieval after most of the big clusters have been found.assume that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them

Evaluation Methodology - IDifficult to evaluate clustering performance lack of man power to produce ground truth for large datasetTwo subsets of 1,000 email messages each were selected randomly from the Mercury dataset. Assessors: two graduate research assistantsManually organized the documents into clusters of documents that they felt were near-duplicatesManually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)

Evaluation Methodology - IIClass j vs. Cluster iF-measurepij = nij/ ni , rij = nij/ nj F = , Fj = maxi {Fij}

Purity = , i = maxj{pij},

Pairwise-measureFolkes and Mallows index Kappa = p(A) = (a+d)/m , p(E) =

Experimental Results - I

(BC = .74)

AB

1

0.14

0.24

0.08

Baseline2 router

0.28

0.31

3

Metadata-based (router+filesizerange+docket)

0.68

0.34

0.42

0.42

0.56

5

0.70

0.40

0.68

Similarity-based clustering

0.72

0.56

7

0.74

0.55

0.71

Experimental Results - II

Chart2

10

11

11

11

21

21

21

21

21

21

21

22

22

22

22

22

22

22

33

33

33

33

43

44

54

55

55

55

65

66

66

66

76

76

76

76

77

77

77

77

77

77

77

77

77

77

77

77

77

77

77

77

77

77

78

78

78

79

79

710

710

811

812

812

812

812

813

813

913

913

1014

1014

1014

1115

1215

1216

1216

1216

1216

1317

1317

1317

1318

1418

1419

1419

1419

1520

1520

1520

1520

1520

1521

1621

1621

1622

1622

1622

1622

1723

1723

1723

1723

1824

1824

1824

1824

1824

1824

1925

1925

1925

2026

2026

2026

2026

2026

2027

2027

2027

2127

2127

2128

2128

2128

2228

2228

2228

2328

2328

2329

2329

2429

2529

2530

2530

2531

2531

2532

2632

2632

2733

2733

2734

2734

2734

2735

2835

2835

2836

2836

2836

2936

2937

disable document retrieval for small clusters

enable document retrieval for small clusters

number of clusters

time(s)

Effects of disable document retrieval for small clusters

med

maxmedminduration(s)new clusternumclusterspullkeep

100005-07-2004-1034341701305

111105-10-2004-4956792496113

221102-16-2004-0095343349110

221006-22-2004-520077410292

231102-20-2004-134023550146

331006-12-2004-513426623831

431005-08-2004-10871275242

441103-20-2004-175227849619

441005-14-2004-50100792421

541002-29-2004-1533581026911

551104-23-2004-400538114999

662104-26-2004-413950123975

662003-13-2004-16377913185

662003-04-2004-15674214724

762003-23-2004-231149156511

772104-23-2004-405656164362

772005-25-2004-50538217127

772004-17-2004-3766861881

873003-23-2004-250642195274

883103-23-2004-187947208244

883003-17-2004-02330521221

883003-18-2004-173068222551

893103-23-2004-322133235431

994003-23-2004-294855246521

9104103-23-2004-054436258241

10115105-07-2004-104528268511

10125103-23-2004-243812276511

11125003-23-2004-314528284441

12125003-23-2004-278681295461

12136103-23-2004-056321306511

12146103-23-2004-042780318251

13146003-23-2004-243382324451

14156103-23-2004-278037338251

14156007-11-2004-536083342682

14156004-10-2004-3717523534223

15166103-22-2004-1763883631

15167003-23-2004-058660374491

16167003-23-2004-310907384241

16167006-18-2004-51833739101

17167006-26-2004-529546401787

17177103-29-2004-354292411185

17177006-17-2004-51774242109

18177003-16-2004-1711244321

18177002-11-2004-118701442271

18177003-23-2004-306565454391

19177004-29-2004-4257784611

19177006-14-2004-51401647141

19187103-23-2004-335273483911

19187006-14-2004-5145094921

19197104-21-2004-379784504961

19197004-01-2004-3619165111

20197002-27-2004-15206152124

20197003-30-2004-3565695321

20197003-23-2004-335305541151

21208103-23-2004-031697558241

21208003-23-2004-200499566793

21218103-23-2004-037385578223

22229103-23-2004-275010588221

22239103-23-2004-294087598221

232310003-23-2004-274428605553

232410103-23-2004-301926616781

242411003-23-2004-305721626521

252512103-23-2004-321036636521

252512003-23-2004-033552644601

252512003-23-2004-260795654431

262612103-23-2004-050671665261

262713103-23-2004-210128676781

272713003-23-2004-037996686551

272713003-23-2004-274690694411

282813103-23-2004-244841704311

282814003-23-2004-221874714361

282814003-23-2004-023621725921

292914103-23-2004-052150735451

292915003-23-2004-272340748241

303015103-23-2004-250568758241

303116103-23-2004-045736768231

313116003-23-2004-211832774591

313216103-23-2004-262974784851

313216003-23-2004-180376794581

323317103-23-2004-304517805461

333317005-07-2004-107485814611

333317003-23-2004-262411825421

343418103-23-2004-269881835501

343418003-23-2004-300077844601

353519103-23-2004-241889856431

353519003-23-2004-295156864341

363519003-23-2004-210177876511

363620103-23-2004-050230886511

373620003-23-2004-240092895471

373720103-29-2004-074879905481

383720003-23-2004-053645914551

383820103-23-2004-204719926511

393821003-23-2004-069103934341

393921103-23-2004-330106945541

403921003-23-2004-309476954511

414022103-23-2004-194032966511

414022003-23-2004-277633974361

424022003-23-2004-210055984281

424122103-23-2004-068553998231

434223103-23-2004-0613841008231

434223003-23-2004-3047421014501

444223003-23-2004-2787291023691

444323103-23-2004-2555311038241

454424105-07-2004-1052111048221

454424003-23-2004-3121221055331

464524103-23-2004-2238951064121

474524003-23-2004-2003731079251

474624103-19-2004-174337108271

484624003-23-2004-177676109261

484625003-23-2004-3141361104711

494725103-26-2004-3518931114801

504725003-23-2004-1884911128231

504826103-23-2004-3339701134361

514926103-23-2004-2219211149251

515026103-23-2004-2713781159191

515026003-23-2004-0633521168231

525126103-23-2004-2337731174461

525127003-23-2004-2199881186991

525227103-12-2004-1599231191011

535227002-23-2004-1348701202101

535227003-23-2004-1810051214431

545227003-23-2004-2345541224341

545328103-23-2004-3193491238241

555428103-23-2004-3063891248231

555428003-15-2004-17009912531

555428003-10-2004-159055126961

565428003-18-2004-173164127131

565528103-23-2004-2234761286341

565528002-03-2004-110960129302

575628103-23-2004-3145561308231

575629003-23-2004-2712851315691

585729103-23-2004-0633741324432

595729003-23-2004-2766281338431

595729003-10-2004-15905713411

595830103-23-2004-2131441354363

605830003-23-2004-0533321365293

615931103-23-2004-0313811378263

616031103-23-2004-2536971386781

616032003-23-2004-0267641398221

626132103-23-2004-2589221408221

626132003-23-2004-2448351414441

636233103-23-2004-3290851426521

646333103-23-2004-0423851438351

646434103-23-2004-0509241448231

656534103-23-2004-3230751458231

656534003-23-2004-2133571468251

666635103-23-2004-1856111478231

666735103-23-2004-1786641489431

666735003-23-2004-2017891494731

666836103-23-2004-3140051504361

676936103-23-2004-0568141518261

686936003-23-2004-0634581528241

687036103-23-2004-2205831534351

697137103-23-2004-2086721548231

med

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

med

min

max

number of clusters so far

time(s)

Time vs. Number of Clusters

min

max

med

min

number of clusters

time(s)

Time vs. Number of Clusters

max

seed counttime so far(s)duration(s)cluster size# redistributed docsnew clusternumclusterspullkeep

2650043643605-07-2004-1034341701305

2650005-10-2004-4956792114113

2651102-16-2004-0095343113110

2651006-22-2004-52007749492

2651002-20-2004-134023526646

2651006-12-2004-51342663131

2651005-08-2004-10871274242

2651003-20-2004-17522781919

2651005-14-2004-50100792121

2651002-29-2004-153358101111

2651004-23-2004-40053811179

2651004-26-2004-41395012145

2652103-13-2004-1637791355

2652003-04-2004-1567421444

2652003-23-2004-231149154341

2652004-23-2004-405656161672

2652005-25-2004-5053821777

2652004-17-2004-3766861881

2652003-23-2004-250642195264

2653103-23-2004-187947204404

2653003-17-2004-0233052111

2653003-18-2004-1730682251

2653003-23-2004-322133235461

2653003-23-2004-294855244361

2654103-23-2004-054436258231

2654005-07-2004-104528264361

2655103-23-2004-243812274361

2655003-23-2004-314528284221

2655003-23-2004-278681294311

2655003-23-2004-056321304361

2656103-23-2004-042780314221

2656003-23-2004-243382324221

2656003-23-2004-278037334261

2656007-11-2004-5360833422

2656004-10-2004-371752355223

2656003-22-2004-1763883611

2656003-23-2004-058660374351

2657103-23-2004-310907384351

2657006-18-2004-51833739101

2657006-26-2004-52954640937

2657003-29-2004-3542924155

2657006-17-2004-51774242109

2657003-16-2004-1711244321

2657002-11-2004-11870144461

2657003-23-2004-306565454351

2657004-29-2004-4257784611

2657006-14-2004-5140164711

2657003-23-2004-3352734811

2657006-14-2004-5145094921

2657004-21-2004-3797845011

2657004-01-2004-3619165111

2657002-27-2004-1520615211

2657003-31-2004-3603835344

2657003-30-2004-3565695421

2657003-23-2004-3353055511

2658103-23-2004-031697564401

2658003-23-2004-200499575473

2658003-23-2004-037385585243

2659103-23-2004-275010595241

2659003-23-2004-294087605241

26510103-23-2004-274428615463

26510003-23-2004-301926625461

26511103-23-2004-305721638251

26512103-23-2004-321036648251

26512003-23-2004-033552654411

26512003-23-2004-260795664361

26512003-23-2004-050671674361

26513103-23-2004-210128684361

26513003-23-2004-037996694361

26513003-23-2004-274690705481

26513003-23-2004-244841714191

26514103-23-2004-221874725461

26514003-23-2004-023621734361

26514003-23-2004-052150744221

26515103-23-2004-272340754361

26515003-23-2004-250568768231

26516103-23-2004-045736774361

26516003-23-2004-211832784361

26516003-23-2004-262974794361

26516003-23-2004-180376804361

26517103-23-2004-304517814311

26517005-07-2004-107485824361

26517003-23-2004-262411835421

26518103-23-2004-269881848231

26518003-23-2004-300077854361

26519103-23-2004-241889865421

26519003-23-2004-295156874341

26519003-23-2004-210177884361

26520103-23-2004-050230894361

26520003-23-2004-240092904311

26520003-29-2004-074879911201

26520003-23-2004-053645924221

26520003-23-2004-204719934221

26521103-23-2004-069103944341

26521003-23-2004-330106955421

26521003-23-2004-309476965531

26522103-23-2004-194032974361

26522003-23-2004-277633984351

26522003-23-2004-210055994361

26522003-23-2004-0685531004221

26523103-23-2004-0613841014361

26523003-23-2004-3047421024241

26523003-23-2004-2787291034251

26523003-23-2004-2555311044361

26524105-07-2004-1052111054361

26524003-23-2004-3121221064221

26524003-23-2004-2238951074221

26524003-23-2004-2003731084221

26524003-19-2004-174337109391

26524003-23-2004-17767611071

26525103-23-2004-3141361115091

26525003-26-2004-3518931121671

26525003-23-2004-1884911134221

26526103-23-2004-3339701144221

26526003-23-2004-2219211154221

26526003-23-2004-2713781164351

26526003-23-2004-0633521174071

26526003-23-2004-2337731184081

26527103-23-2004-2199881194071

26527003-12-2004-15992312021

26527002-23-2004-1348701211671

26527003-23-2004-1810051224221

26527003-23-2004-2345541234221

26528103-23-2004-3193491244231

26528003-23-2004-3063891254221

26528003-15-2004-17009912641

26528003-10-2004-15905512711

26528003-18-2004-17316412881

26528003-23-2004-2234761294081

26528002-03-2004-11096013022

26528003-23-2004-3145561314361

26529103-23-2004-2712851325721

26529003-23-2004-0633741334242

26529003-23-2004-2766281344081

26529003-10-2004-15905713511

26530103-23-2004-2131441364363

26530003-23-2004-0533321375243

26531103-23-2004-0313811388263

26531003-23-2004-2536971395461

26532103-23-2004-0267641405241

26532003-23-2004-2589221415241

26532003-23-2004-2448351425461

26533103-23-2004-3290851434361

26533003-23-2004-0423851448271

26534103-23-2004-0509241454361

26534003-23-2004-3230751464361

26534003-23-2004-2133571474381

26535103-23-2004-1856111486341

26535003-23-2004-1786641495421

26535003-23-2004-2017891505241

26536103-23-2004-3140051514341

26536003-23-2004-0568141525431

26536003-23-2004-0634581534221

26536003-23-2004-2205831544231

26537003-23-2004-2086721556341

max

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

time so far(s)

seed counttime so far(s)duration(s)cluster size# redistributed docsnew clusternumclusterspullkeep

2651152752705-07-2004-1034341701305

2651005-10-2004-4956792263113

2652102-16-2004-0095343389110

2652006-22-2004-520077411792

2652002-20-2004-134023524146

2653106-12-2004-513426659331

2654105-08-2004-108712718942

2654003-20-2004-17522782419

2654005-14-2004-501007922721

2655102-29-2004-1533581033411

2655004-23-2004-400538114209

2656104-26-2004-413950122915

2656003-13-2004-16377913695

2656003-04-2004-156742141934

2657103-23-2004-231149155321

2657004-23-2004-405656166312

2657005-25-2004-5053821787

2657004-17-2004-3766861881

2658103-23-2004-250642195304

2658003-23-2004-187947205454

2658003-17-2004-02330521471

2658003-18-2004-1730682251

2658003-23-2004-322133235271

2659103-23-2004-294855245291

2659003-23-2004-054436256831

26510105-07-2004-104528265371

26510003-23-2004-243812275271

26511103-23-2004-314528286821

26512103-23-2004-278681296911

26512003-23-2004-056321305271

26512003-23-2004-042780316211

26513103-23-2004-243382325491

26514103-23-2004-278037337701

26514007-11-2004-53608334352

26514004-10-2004-3717523546523

26515103-22-2004-176388366621

26515003-23-2004-058660375251

26516103-23-2004-310907385241

26516006-18-2004-51833739491

26517106-26-2004-529546405647

26517003-29-2004-354292415925

26517006-17-2004-51774242139

26518103-16-2004-171124431061

26518002-11-2004-118701441161

26518003-23-2004-306565456981

26519104-29-2004-42577846361

26519006-14-2004-51401647361

26519003-23-2004-335273481071

26519006-14-2004-51450949411

26519004-21-2004-37978450281

26519004-01-2004-3619165111

26520102-27-2004-152061523214

26520003-30-2004-3565695321

26520003-23-2004-335305544011

26521103-23-2004-031697554401

26521003-23-2004-200499565283

26521003-23-2004-037385575243

26522103-23-2004-275010586691

26522003-23-2004-294087595251

26523103-23-2004-274428605293

26523003-23-2004-301926615271

26524103-23-2004-305721628461

26525103-23-2004-321036638231

26525003-23-2004-033552644421

26525003-23-2004-260795655271

26526103-23-2004-050671665281

26526003-23-2004-210128676181

26527103-23-2004-037996686851

26527003-23-2004-274690696161

26528103-23-2004-244841705251

26528003-23-2004-221874715311

26528003-23-2004-023621725281

26529103-23-2004-052150736391

26529003-23-2004-272340745941

26530103-23-2004-250568756811

26530003-23-2004-045736765241

26531103-23-2004-211832776391

26531003-23-2004-262974784411

26531003-23-2004-180376794401

26532103-23-2004-304517807091

26533105-07-2004-107485816811

26533003-23-2004-262411824801

26534103-23-2004-269881836891

26534003-23-2004-300077846811

26535103-23-2004-241889855741

26535003-23-2004-295156864691

26536103-23-2004-210177876321

26536003-23-2004-050230885271

26537103-23-2004-240092896811

26537003-29-2004-074879902851

26538103-23-2004-053645916821

26538003-23-2004-204719926291

26539103-23-2004-069103934431

26539003-23-2004-330106946821

26540103-23-2004-309476956881

26541103-23-2004-194032966711

26541003-23-2004-277633976811

26542103-23-2004-210055987021

26542003-23-2004-068553995251

26543103-23-2004-0613841005241

26543003-23-2004-3047421016361

26544103-23-2004-2787291025491

26544003-23-2004-2555311036881

26545105-07-2004-1052111045221

26545003-23-2004-3121221056401

26546103-23-2004-2238951069251

26547103-23-2004-2003731075391

26547003-19-2004-1743371085931

26548103-23-2004-1776761092991

26548003-23-2004-3141361105951

26549103-26-2004-3518931118851

26550103-23-2004-1884911125271

26550003-23-2004-3339701135451

26551103-23-2004-2219211146331

26551003-23-2004-2713781154391

26551003-23-2004-0633521165191

26552103-23-2004-2337731175841

26552003-23-2004-2199881187021

26552003-12-2004-15992311931

26553102-23-2004-1348701205281

26553003-23-2004-1810051214621

26554103-23-2004-2345541227341

26554003-23-2004-3193491234611

26555103-23-2004-3063891246251

26555003-15-2004-170099125111

26555003-10-2004-1590551263751

26556003-18-2004-17316412791

26556003-23-2004-2234761284431

26556002-03-2004-1109601292792

26557103-23-2004-3145561305251

26557003-23-2004-2712851317121

26558103-23-2004-0633741326822

26559103-23-2004-2766281336851

26559003-10-2004-1590571341241

26559003-23-2004-2131441355313

26560103-23-2004-0533321365273

26561103-23-2004-0313811378253

26561003-23-2004-2536971385271

26561003-23-2004-0267641395261

26562103-23-2004-2589221405241

26562003-23-2004-2448351416441

26563103-23-2004-3290851427031

26564103-23-2004-0423851436851

26564003-23-2004-0509241445241

26565103-23-2004-3230751457751

26565003-23-2004-2133571465361

26566103-23-2004-1856111476291

26566003-23-2004-1786641484431

26566003-23-2004-2017891495251

26566003-23-2004-3140051504431

26567103-23-2004-0568141516941

26568103-23-2004-0634581526131

26568003-23-2004-2205831537481

26569103-23-2004-2086721547591

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

time so far(s)

ConclusionLarge Volume Working SetDuplicate Definition and Automatic evaluationFeature-based Duplicate Candidate RetrievalSimilarity-based ClusteringImproved Efficiency

Related Work - IDuplicate detection in other domains:databases [Bilenko and Mooney 2003]to find records referring to the same entity but possibly in different representationselectronic publishing [Brin et al. 1995] to detect plagiarism or to identify different versions of the same document. web search [Chowdhury et al. 2002] [Pugh 2004]more efficient web-crawlingeffective search results rankingeasy web documents archiving

Related Work - IIFingerprintinga compact description of a document, and then do pair-wise comparison of document fingerprintsShingling [Broder et al.]represents a document as a series of simple numeric encodings for an n-term windowretain every mth shingle to produce a document sketchsuper shinglesSelective fingerprinting [Heintze]selected a subset of the substrings to generate fingerprintsStatistical approach [Chowdhury et al.]n high idf terms Improved accuracy over shinglingEfficient: one-fifth of the time over shinglingFingerprints Reliability in dynamic environment [Conrad et al.]Consider time factor on the Web

ReferencesM. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398409. ACM Press, May 1995. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 97, pages 391404. Elsevier Science, April 1997.J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM03, pages 443452. ACM Press, Nov. 2003.N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191200, Nov. 1996.W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003

Demohttp://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/

Modifying an electronic form letter is extremely easy. Moreover, it can better represent their opinions or stylistic preferences

When there is a disconnect between broad public opinion and legislative action, MoveOn builds electronic advocacy groups.Examples of such issues are campaign finance, environmental and energy issues, media consolidation.. Once a group is assembled, MoveOn provides information and tools to help each individual have the greatest possible impact.

Many of the form letter-based comments are near-duplicates that are similar, but not exactly identical, to a (usually unknown) initial form letter, and to one another.

Recognizing, organizing, and considering near-duplicates are significant challenges for the agencies because near-duplicates increase the likelihood of overlooking substantive information (information that by law the agency must consider) that an individual adds to a form letter.However, it is unsupportable to have fully human assessments on huge document collections such as public comments in eRulemaking. Moreover, this definition assumes that the original document can be identified easily.

For example, excerpt means one document takes first section from another article and elaboration means one document adds one or more paragraphs to another article.

However, in domains such as eRulemaking it may not be possible to tell by just looking at the documents which document is the source and which is the modified copy. In order to have a fully automatic assessment of duplicate and near-duplicate detection without human experts involved, we need to have a quantitative definition of near-duplicates. That is to say, our definition is a guideline for machines. In the eRulemaking domain, there are two broad categories of documents: Comments that are written more-or-less from scratch (unique), and comments that are based on a form letter (exact-duplicates and near-duplicates). The near-duplicate category can be further divided into subcategories (Table 1). Our subcategories are similar in spirit to those of Conrad, et al., but reflect typical patterns appearing in public comment datasets Our duplicate and near-duplicate detection system includes three modules: Text preprocessing, feature-based document retrieval and similarity-based document clustering (Figure 1).The text preprocessing module consists of information extraction (IE), binning by length, and seed document selection. The IE module identifies the email sender, the receiver, address, salutation and signature lines; docket IDs mentioned in the text; delivery dates and the email-relaying organizations identified in email headers. These features are used later in duplicate detection. Documents are then binned based on their content lengths (i.e., after removal of headers, address, salutation, and signature lines are removed). Binning insures that exact-duplicates are grouped together. Moreover, even before duplicate clustering starts, the system has a rough guess about what are the large duplicate sets. These sets are handled first, to improve efficiency. Finally, to identify the dominating document within a bin, the document that repeats most (i.e., has the most exact duplicates) is selected to be the initial seed in any bin whose size is greater than 100.In the feature-based document retrieval module, the system begins near-duplicate detection by going through the size-based bins in a round robin manner. On each pass a document is selected from a bin, and a feature-based query is generated based on fingerprint and metadata. The system then performs probabilistic document retrieval, using a Boolean query, to get candidate duplicates for the document. The similarity-based clustering module measures the similarity of each document in the candidate set to the current seed; similar documents are grouped together. The next section describes the algorithms for feature-based document retrieval and similarity-based clustering. during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (mass mailers) or are commercial advertising (spam), duplicate documents are likely to share the same features as the seed document.

One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them. during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (mass mailers) or are commercial advertising (spam), duplicate documents are likely to share the same features as the seed document.Note that this requires an extra step to first store the similarity scores above a relatively loose initial cut-off threshold

It works fairly efficient for large clusters since it usually cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them FIn the context of clustering, the quality of a clustering with respect to the gold standard is assessed in terms of precision (p) and recall (r). p and r compare each cluster i with each class j in the ground truth:pij = nij/ ni (5) rij = nij/ nj (6)where nij is the number of documents from class j in cluster i, ni is the number of documents in cluster i and nj is the number of documents in class j.The F-measure widely used in IR community tries to capture how well the groups of the investigated clustering at the best match the groups of the ground truth. The F-measure for a cluster i and class j is: Fij = 2rij / (p ij +r ij)(7)The F-measure of the whole clustering set is: F = (8)where Fj = maxi {Fij} , n is the total number of document.PurityIf the purpose of clustering is to classify clusters of documents rather than single documents, purity of each cluster tries to capture how well the groups of the first on average matches those of the ground truth. The weighted average purity over all clusters is a measure of quality of the whole clustering. It is defined as: = (9)where i = maxj{pij}, n is the total number of document and pij is precision of cluster i with reference to class j.Pair-wise MeasureOne may also define measures based on the distribution of pairs of texts. Pair measures reflect degrees of overlap and are thus not dependent on the direction of the comparison. Folkes and Mallows index [9] is one of such pair measures that have been used for cluster validation(10) where a is the number of pairs in the same group in the ground truth and in the clustering (agreement), b is the number of pairs in the same group in the ground truth but in different in the clustering (false negative), c is the number of pairs in different groups in the ground truth but in the same in the clustering (false positive), d is the number of pairs in different groups in the ground truth and in the clustering (agreement).KappaFinally, we suggest using the Cohens kappa statistics [7] to assess agreement between clustering results relative to two ground truth given by different human assessors. We consider the investigated clustering results as another assessor. The kappa coefficient is used to compare inter-assessor concordances and is defined is: = (11)where p(A) is the observed agreement between the two assessments. It can be calculated as (a+d)/m by using a, b, c and d given in Equation 10 and m=a+b+c+d. p(E) is the agreement expected by chance. It can be calculated as .Our idea is that if the concordance of the clustering results given by our system and one of the assessors are as good as or even better than that of two human assessors, our algorithm could be considered as effective. In computational linguistics > 0.67 has often been required to draw any conclusions of agreement at all [5]. However, there has been a number of objections to this standard and less strict evaluations consider 0.20 < < 0.40 indicating fair agreement and 0.40 < < 0.60 indicating moderate agreement.Early research on duplicate detection was done mostly in the areas of databases [1][16] and electronic publishing [2][11] . In database research, the goal of duplicate detection is to find records referring to the same entity but possibly in different representations. In electronic publishing, duplicate detection is used to detect plagiarism or to identify different versions of the same document. Most recently, duplicate detection is extensively studied for web search tasks, for example, to give more efficient web-crawling, effective search results ranking, and easy web documents archiving.A widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to do pair-wise comparison of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by counting the number of common sequences in the complete fingerprints. Different algorithms are characterized, and their computational costs are determined, by the hash function and how substrings are selected.

The dominant approach in duplicate detection is shingling, which was proposed by Broder, et al. [1]. It represents a document as a series of simple numeric encodings for an n-term window. Filtering techniques were used to retain every mth shingle to produce a document sketch. The authors also presented a super-shingle technique that creates meta-sketches to further reduce the computational complexity. Pairs of documents that have a high shingle match coefficient are asserted to be near-duplicates.

Heintze studied selective document fingerprinting that scales to large environments and that can identify similarities between documents which have as little as 5% or less in common [10].To reduce the size of the fingerprints, he selected a subset of the substrings to generate them. He demonstrated that selective fingerprints normally perform within a factor of two of complete fingerprints.

Pugh, from Google, claimed that duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints matches [13].

Recent duplicate detection research in the Web environment has focused on issues of computational efficiency and detection effectiveness while relying on collection statistics to consistently recognize document replicas in full-text collections [6][14]. Since a terms inverse document frequency (idf) is a measurement of a words rareness across a given collection, its value measures how well a term discriminates one document from others. Chowdhury, et al., selected terms best characterizing a document, i.e., n high idf words. Moreover, a term is not considered unless it appears across the collections at least five times to filter out possible misspelling tokens and typos. They claimed that in addition to improving accuracy over shingling, it executes in one-fifth of the time. They showed that a statistically-based approach (i) is more efficient than a fingerprint-based approach, (ii) scales in the number of documents and (iii) works well for documents of diverse sizes.

Date post:	31-Dec-2015
Category:	Documents
Upload:	joseph-garrison
View:	18 times
Download:	0 times