Efficient blocking method for a large scale citation matching

transcript

Efficient blocking method fora large scale citation matching

Mateusz Fedoryszak & Łukasz Bolikowski{matfed,bolo}@icm.edu.pl

Interdisciplinary Centre for Mathematical andComputational Modelling

University of Warsaw

Citation matching

• Note: it's an instance of data linkage problem

References[1] I. Newton, Philosophiae naturalis...[2] N. Copernicus, De revolutionibus...

ID Title Author

Copernicus14 De revolutionibus...

ΕὐκλείδηςΣτοιχεῖα11

Why important?

• Clickable interfaces• Bibliometrics

(think: H-index)• Further analysis

(e.g. similarities)

Why difficult?

• Citation extraction errors (in both digital-born and retro-born docs)

• Countless citation styles used inconsistently

• Typos and other human errors

The Problem

References

ID Title Author

Naïve approach

For 1.3M documents and 12M citations it's 15.6 × 1012 comparisons

References

ID Title Author

Select the best candidates

• I'll present a method of candidate selection and how to implement it using Apache Hadoop

References

ID Title Author

Blocking

References

ID Title Author

Fingerprints

References

ID Title AuthorAAAABBBB CCCC

AAAA FFFF

Workflow

document IDhashcitation IDhash

citation document

document IDhash

citation ID

document IDhash

document ID

citation IDhash

citation ID document ID

Workflow with tuning

• Before:• Compute bucket sizes• Reject too big ones• Use DistributedCache

disseminate

• After:• For each citation

choose only the most popular candidates

document IDhashcitation IDhash

citation document

document IDhash

citation ID

document IDhash

document ID

citation IDhash

citation ID document ID

Hash functions

Normalisation• Lowercase• Remove

• diacritics• punctuation marks

• Filter out tokens shorter than 3 characters (except numbers)

Normalisation

Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.

pawlak zdzislaw 1982 rough sets internat comput inform sci 11 5 341 356

Examples

{ author: "Zdzisław Pawlak", year: "1982", title: "Rough sets", journal: "International Journal of Computer & Information Sciences", volume: "11", issue: "5", pages: "341–356"}

Baseline

pawlakzdzislaw

1982rough

...internat

zdzislawpawlak1982

rough...

internationaljournal

Bigrams

• For document we use only authors and title fields

pawlak zdzislawzdzislaw 1982

1982 roughrough sets

zdzislaw pawlakrough sets

name-year• For citation:

• name: any of first 4 distinct text tokens• year: any number between 1900 and 2050

pawlak#1982zdzislaw#1982

rough#1982sets#1982

zdzislaw#1982pawlak#1982

+approximate variant zdzislaw#1981pawlak#1981

zdzislaw#1983pawlak#1983

name-year-pages• For citation:

• pages: any sorted pair of numbers, not year

pawlak#1982#5#11pawlak#1982#5#341

pawlak#1982#...pawlak#1982#341#356

zdzislaw#...zdzislaw#1982#341#356

rough#...sets#...

zdzislaw#1982#341#356pawlak#1982#341#356

+approximate & optimistic variant

Intermezzo: citation parsing

Pawlak , Zdzisław ( 1982 ) .

author other author other year other other

name-year-numn

• n = 1..3• For citation:

• numn: any sorted tuple of numbers, not year

pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#pawlak#1982#11#341#356

zdzislaw#...rough#...sets#...

+approximate variant

pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#

pawlak#1982#11#341#356zdzislaw#...

Evaluation

Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>

Test dataset

2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global cancer statistics. CA Cancer J Clin 61: 69–90

Test dataset

• Based on Open Access Subset of PMC• Only citations preserving original formatting• Only citations with PMID assigned• 528k documents• 3.6M citation out of which 321k resolvable

Metrics

• Recall — the percentage of true citation → document links that are maintained by the heuristic

• Precision — the percentage of citation → document links returned by algorithm that are correct

• Intermediate data — total number of hashes and pairs generated (before selecting the most popular ones)

• Candidate pairs — number of pairs returned by heuristic for further assessment

• F-measure not included intentionally

Limits

• Candidate documents per citation• 30• no limit

• Bucket size• 10• 100• 1000• 10000• no limit

Recallhash precision recall intermediate data to assess

bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459 baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212 name-year (strict) 0.1% 90.2% 322,015,088 290,940,929 baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843 name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677

Precision

hash precision recall intermediate data to assess

name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734

name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128

name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182

name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208

name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314

name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074

bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228

Recall/intermediate datahash precision recall intermediate data to assess

name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403

name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734

name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129

bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883

bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997

baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677

name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933

baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560

name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181

baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777

Recall vs. intermediate data

Recall/to assesshash precision recall intermediate data to assess name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995 name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042 name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261

name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284

name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181

bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980

name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129

name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337

baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560

bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883

Recall vs. to assess

Combination

Lost citationsHash Lost fraction

name-year (approx., 1000, 30) 12.4%name-year-num2 (approx., 1000, 30) 12.3%name-year (strict, 1000, 30) 9.8%name-year-pages (approx., pessimistic, 1000, 30) 9.0%baseline (10000, 10) 6.7%name-year-num (approx., 1000, 30) 6.0%name-year (strict) 5.8%name-year-num2 (strict., 1000, 30) 5.6%name-year (approx.) 5.1%name-year-num (strict., 1000, 30) 4.4%name-year-num3 (approx., 1000, 30) 4.2%baseline (1000, 30) 3.7%

ResultsHash sequence Recall Intermediate data To assess

bigrams (10000, 30) 98.17% 285,908,900 79,329,459

name-year-pages (strict, optimistic)name-year (strict, 1000, 30)name-year (strict, 10000, 30)bigrams (10000, 30)

87.64% 187,394,452 41,152,278

name-year-pages (strict, optimistic)name-year-pages (strict, pessimistic)bigrams (100, 30)bigrams (10000, 30)

96.86% 333,701,109 29,818,635

name-year-pages (strict, optimistic)bigrams (100, 30)bigrams (10000, 30)

97.76% 202,590,413 30,582,488

name-year-pages (strict, optimistic)name-year-num3 (strict)bigrams (10, 10)bigrams (100, 30)bigrams (10000, 30)

97.73% 398,895,930 25,123,164

Future work

• Other combinations• After fine-grained assessment• Various hash functions at the same time

• Further efficiency tuning• Limit number of generated hashes

CoAnSys Project

• An open source framework for mining very large collections of scientific publications

• Contains implementation of the presented workflow

• http://coansys.ceon.pl/

Thank you! Questions?

Mateusz Fedoryszakmatfed@icm.edu.pl

http://coansys.ceon.pl/http://adalab.icm.edu.pl/