+ All Categories
Home > Education > Scaling up the Extraction of Canonical Citations in Classics

Scaling up the Extraction of Canonical Citations in Classics

Date post: 16-Mar-2018
Category:
Upload: matteo-romanello
View: 670 times
Download: 6 times
Share this document with a friend
35
Scaling up the Extraction of Canonical Citations in Classics Matteo Romanello (DAI / KCL) @mr56k Prologue Approach Evaluation Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scaling up the Extraction of Canonical Citations in Classics Matteo Romanello (DAI / KCL) @mr56k Humanitiés Numériques et Antiquité – 4 Sept. 2015
Transcript
Page 1: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Scaling up the Extraction of CanonicalCitations in Classics

Matteo Romanello (DAI / KCL) @mr56k

Humanitiés Numériques et Antiquité – 4 Sept. 2015

Page 2: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Prologue

Page 3: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Centrality of References

Referring as scholarly primitive (Unsworth):

Referring, Discovering, Annotating, Comparing, Sampling,Illustrating, Representing

References in Classics:canonical texts, fragmentary texts, inscriptions, papyri,manuscripts, coins, etc.

Ubiquity of References

journal articles, reviews, monographs, indexes, commentaries

Page 4: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Classical Commentaries

Page 5: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Enhanced Reading (1)

Line-by-line Bibliographical Database of Wolfram von Eschenbach’sParzival, http://wolfram.lexcoll.com/txts/index.htm

Page 6: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Enhanced Reading (2)

http://labs.jstor.org/shakespeare/

Page 7: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Enhanced Reading (3)

Segetes, http://segetes.io/aeneid

Page 8: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Enhanced Reading (4)

Hellespont Projecthttp://gapvis.hellespont.dainst.org/#book/1/read/113/

Page 9: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Trends in Text Reception

Neville Morley, Number Crunching,http://thesphinxblog.com/2015/06/25/number-crunching/

Page 10: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Citation Networks

with applications to:

1 search2 document clustering3 formal network analysis

Page 11: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Approach

Page 12: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Rationale

beyond string-based searchreferences not quotationsscalable approach:

language independentapplicable to large amounts of documentseasily adaptable to different materials and ways ofreferencing

Examples:

In Statius’ « Achilleid » (2, 96-102) Achilles describes […]e.g. Vergil, Aen. 12, 101-109 ; Lucan 1, 204-212 ; Statius,Th. 12, 736-740 […]

Page 13: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Named Entity Recognition

Computer Science > Information Extraction > NER

Question Answering System

Q: where did Aaron Swartz die?A: New York

Two days after the prosecution rejected a counter-offer bySwartz, he was found dead in his Brooklyn, New Yorkapartment, where he had hanged himself.

3-step process:

1 Named Entity Recognition and Classification2 Relation Extraction3 Named Entity Disambiguation

Page 14: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Citation Extraction: Step 1 (NER)

Named Entities (= citation components):

AAUTHOR = ancient authorAWORK = ancient workREFAUWORK = concise reference to author, work or both(“Pliny, nat.”, “Thuc.”)REFSCOPE = indication of the cited passage (“11, 4, 11”)

Page 15: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Citation Extraction: Step 2 (Relation Detection)

reference as relation vs. reference as monolithic entitybinary scope relation between two entities (arguments)

arg1: aauthor | awork | refauworkarg2: refscope

examples:Ammianus (15, 8, 7)Trabajos 159–173”Pliny, nat. 11, 4, 11

Page 16: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Citation Extraction: Step 3 (Disambiguation)

assign each author/work/canonical reference a unique IDIDs are CTS URNs

Page 17: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Canonical Text Services (CTS) Unique ResourceNames (URNs)

A machine-readable syntax for canonical references [refs]

Plinyurn:cts:latinLit:phi0978

Pliny’s NHurn:cts:latinLit:phi0978.phi001

Pliny, Nat. 11,4,11urn:cts:latinLit:phi0978.phi001:11.4.11

Page 18: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Extraction Pipeline

Page 19: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Evaluation

Page 20: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

L’Année philologique (APh)

http://www.annee-philologique.com/

Page 21: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

APh Example

APh 75-06697 => S. Braund & G. Gilbert. 2004. “An ABC of epic ira:anger, beasts, and cannibalism” Yale Classical Studies 32:250-285

In Statius ’ « Achilleid » (2, 96-102) Achilles describes his dietof wild animals in infancy, which rendered him fearless and mayindicate another aspect of his character - a tendency towardaggression and anger.

The portrayal of angry warriors in Roman epic is effected forthe most part not by direct descriptions but indirectly, bysimiles of wild beasts (e.g. Vergil, Aen. 12, 101-109;Lucan 1, 204-212; Statius, Th. 12, 736-740; Silius 5, 306-315).

These similes may be compared to two passages fromStatius (Th. 1, 395-433 and 8, 383-394) that portray the onsetof anger in direct narrative. Analysis of these passagesdemonstrates that the concept of « ira » in epic takes its moralaspect from the context.

Page 22: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Data

APhanalytical reviews (en, de, fr, es, it)80 volumes (1924-)autom. processed vol. 75 (2004)

6,694 abstracts (total = 6,946; errors = 252)350k tokens3k citations

man. corrected ~8 % of vol. 75366 abstracts26k tokens380 citations

Page 23: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Precision, Recall and F1 Score

https://commons.wikimedia.org/wiki/File:Precisionrecall.svg

By Walber (Own work) [CC BY-SA 4.0]

Page 24: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Evaluation Summary

Task Precision Recall F1 Score

NER 79.24% 69.62% 73.88%RelEx 93.33% 91.87% 92.60%NED 61.04% 90.94% 73.05%

methods:NER: machine learning-basedRelEx: rule-basedNED: rule-based + knowledge base

manually corrected ~8 % of vol. 75366 abstracts26k tokens380 citations

Page 25: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NER: Method

1 Linguistic Features (PoS Tag, neighbouring words)2 Word-level Features:

punctuationfinal_dot, quotation_mark, has_hyphen, bracket

casemixed_caps, all_caps, init_caps, all_lower

numberroman, year, range, mixed_alphanum

patterns“Avien.” –> “Aaaaa-” (expanded)“Avien.” –> “Aa-” (compressed)

3 Semantic Features (matches against dictionaries)

Page 26: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NER: Evaluation

Task: extraction of entities aauthor, awork, refauwork,refscope

Algorithm Precision Recall F1 Score

CRF 79.24% 69.62% 73.88%

MaxEnt 75.29% 66.75% 70.43%SVM 74.44% 70.21% 71.93%

Aauthor : P = 91.15%, R = 39.67%, F1 = 54.53%

Page 27: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

RelEx: Evaluation

rule-based method

Precision Recall F1 Score

93.33% 91.87% 92.60%

Missed scope relations:

du [REFSCOPE chant 4] de l’ [AWORK « Énéide » ]Le [REFSCOPE livre 13 ] de la [AWORK « Chronique » ]les [REFSCOPE v. 9–12 ] des [AWORK « Acharniens » ]

Page 28: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NED: Method

Thuc. I 89, 1s.

1 match reference against knowledge baseexact and approximate string matchingapproximate string matching:

edit_distance("Virgilio","Virgil") = 3

Thuc. → urn:cts:greekLit:tlg0003.tlg001

2 normalise the reference scopee.g. 1.89.1–1.89.2

1.89.1–21, 89, 1–2I 89, 1s.

3 assign unique IDurn:cts:greekLit:tlg0003.tlg001:1.89.1–1.89.2

Page 29: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NED: Knowledge Base

underlying model: HuCit, CIDOC-CRM & FRBRoousages:

extract abbreviationsresolve implicit refs:, e.g. “Herod. 4, 5-7”validate citations, e.g. “Thuc. 1.100.9.4”

Page 30: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NED: Evaluation

Matching Type Precision Recall F1 Score

Exact 58.33% 62.88% 60.52%Approximate (n=4) 61.04% 90.94% 73.05%Approximate (n=7) 58.94% 94.76% 72.67%

Page 31: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NED: Error Types

1 abbreviation is highly ambiguous (without context)

But Horace undermines the suggestion that his ownpoetry will forever represent the Augustan Age.Carm. 4, 15 in fact […]

2 ambiguous author mention

Esame dell’ esegesi papiracea ad Aristofane :permanenza del lavoro degli eruditi alessandrini […]

Page 32: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

NED: Error Types (contd.)

3 implied context (title of reviewed publ.)

Dans son chap. 5 sur le squelette et la respiration,Lactance utilise des sources disparates et arrive auxlimites de son savoir médical.

4 ambiguously expressed reference

Analysis of the pederastic poems in the Theocriteancorpus (12 ; 23 ; 29 ; 30) reveals that Theocritusreflects on mutuality in a relationship […]

Page 33: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Outlook

Page 34: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Future Plans

1 improve overall accuracytest other methods for each processing stepmore training data (and more specialised)expand knowledge base

2 make software available to othersstreamline installationimprove documentationoffer as web serviceoffer as part of a research infrastructure

3 apply on a larger scaleimprove performances (optimisation)use of high performance and parallel computing

Page 35: Scaling up the Extraction of Canonical Citations in Classics

Scaling up theExtraction of

CanonicalCitations in

Classics

MatteoRomanello

(DAI / KCL)@mr56k

Prologue

Approach

Evaluation

Outlook

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

Thank you for your attention!Links

[email protected]://github.com/mromanello/CRefExhttps://github.com/mromanello/APh_Corpus


Recommended