Date post: | 16-Mar-2018 |
Category: |
Education |
Upload: | matteo-romanello |
View: | 670 times |
Download: | 6 times |
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Scaling up the Extraction of CanonicalCitations in Classics
Matteo Romanello (DAI / KCL) @mr56k
Humanitiés Numériques et Antiquité – 4 Sept. 2015
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Prologue
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Centrality of References
Referring as scholarly primitive (Unsworth):
Referring, Discovering, Annotating, Comparing, Sampling,Illustrating, Representing
References in Classics:canonical texts, fragmentary texts, inscriptions, papyri,manuscripts, coins, etc.
Ubiquity of References
journal articles, reviews, monographs, indexes, commentaries
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Classical Commentaries
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Enhanced Reading (1)
Line-by-line Bibliographical Database of Wolfram von Eschenbach’sParzival, http://wolfram.lexcoll.com/txts/index.htm
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Enhanced Reading (2)
http://labs.jstor.org/shakespeare/
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Enhanced Reading (3)
Segetes, http://segetes.io/aeneid
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Enhanced Reading (4)
Hellespont Projecthttp://gapvis.hellespont.dainst.org/#book/1/read/113/
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Trends in Text Reception
Neville Morley, Number Crunching,http://thesphinxblog.com/2015/06/25/number-crunching/
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Citation Networks
with applications to:
1 search2 document clustering3 formal network analysis
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Approach
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Rationale
beyond string-based searchreferences not quotationsscalable approach:
language independentapplicable to large amounts of documentseasily adaptable to different materials and ways ofreferencing
Examples:
In Statius’ « Achilleid » (2, 96-102) Achilles describes […]e.g. Vergil, Aen. 12, 101-109 ; Lucan 1, 204-212 ; Statius,Th. 12, 736-740 […]
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Named Entity Recognition
Computer Science > Information Extraction > NER
Question Answering System
Q: where did Aaron Swartz die?A: New York
Two days after the prosecution rejected a counter-offer bySwartz, he was found dead in his Brooklyn, New Yorkapartment, where he had hanged himself.
3-step process:
1 Named Entity Recognition and Classification2 Relation Extraction3 Named Entity Disambiguation
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Citation Extraction: Step 1 (NER)
Named Entities (= citation components):
AAUTHOR = ancient authorAWORK = ancient workREFAUWORK = concise reference to author, work or both(“Pliny, nat.”, “Thuc.”)REFSCOPE = indication of the cited passage (“11, 4, 11”)
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Citation Extraction: Step 2 (Relation Detection)
reference as relation vs. reference as monolithic entitybinary scope relation between two entities (arguments)
arg1: aauthor | awork | refauworkarg2: refscope
examples:Ammianus (15, 8, 7)Trabajos 159–173”Pliny, nat. 11, 4, 11
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Citation Extraction: Step 3 (Disambiguation)
assign each author/work/canonical reference a unique IDIDs are CTS URNs
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Canonical Text Services (CTS) Unique ResourceNames (URNs)
A machine-readable syntax for canonical references [refs]
Plinyurn:cts:latinLit:phi0978
Pliny’s NHurn:cts:latinLit:phi0978.phi001
Pliny, Nat. 11,4,11urn:cts:latinLit:phi0978.phi001:11.4.11
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
The Extraction Pipeline
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Evaluation
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
L’Année philologique (APh)
http://www.annee-philologique.com/
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
APh Example
APh 75-06697 => S. Braund & G. Gilbert. 2004. “An ABC of epic ira:anger, beasts, and cannibalism” Yale Classical Studies 32:250-285
In Statius ’ « Achilleid » (2, 96-102) Achilles describes his dietof wild animals in infancy, which rendered him fearless and mayindicate another aspect of his character - a tendency towardaggression and anger.
The portrayal of angry warriors in Roman epic is effected forthe most part not by direct descriptions but indirectly, bysimiles of wild beasts (e.g. Vergil, Aen. 12, 101-109;Lucan 1, 204-212; Statius, Th. 12, 736-740; Silius 5, 306-315).
These similes may be compared to two passages fromStatius (Th. 1, 395-433 and 8, 383-394) that portray the onsetof anger in direct narrative. Analysis of these passagesdemonstrates that the concept of « ira » in epic takes its moralaspect from the context.
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
The Data
APhanalytical reviews (en, de, fr, es, it)80 volumes (1924-)autom. processed vol. 75 (2004)
6,694 abstracts (total = 6,946; errors = 252)350k tokens3k citations
man. corrected ~8 % of vol. 75366 abstracts26k tokens380 citations
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Precision, Recall and F1 Score
https://commons.wikimedia.org/wiki/File:Precisionrecall.svg
By Walber (Own work) [CC BY-SA 4.0]
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Evaluation Summary
Task Precision Recall F1 Score
NER 79.24% 69.62% 73.88%RelEx 93.33% 91.87% 92.60%NED 61.04% 90.94% 73.05%
methods:NER: machine learning-basedRelEx: rule-basedNED: rule-based + knowledge base
manually corrected ~8 % of vol. 75366 abstracts26k tokens380 citations
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NER: Method
1 Linguistic Features (PoS Tag, neighbouring words)2 Word-level Features:
punctuationfinal_dot, quotation_mark, has_hyphen, bracket
casemixed_caps, all_caps, init_caps, all_lower
numberroman, year, range, mixed_alphanum
patterns“Avien.” –> “Aaaaa-” (expanded)“Avien.” –> “Aa-” (compressed)
3 Semantic Features (matches against dictionaries)
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NER: Evaluation
Task: extraction of entities aauthor, awork, refauwork,refscope
Algorithm Precision Recall F1 Score
CRF 79.24% 69.62% 73.88%
MaxEnt 75.29% 66.75% 70.43%SVM 74.44% 70.21% 71.93%
Aauthor : P = 91.15%, R = 39.67%, F1 = 54.53%
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
RelEx: Evaluation
rule-based method
Precision Recall F1 Score
93.33% 91.87% 92.60%
Missed scope relations:
du [REFSCOPE chant 4] de l’ [AWORK « Énéide » ]Le [REFSCOPE livre 13 ] de la [AWORK « Chronique » ]les [REFSCOPE v. 9–12 ] des [AWORK « Acharniens » ]
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NED: Method
Thuc. I 89, 1s.
1 match reference against knowledge baseexact and approximate string matchingapproximate string matching:
edit_distance("Virgilio","Virgil") = 3
Thuc. → urn:cts:greekLit:tlg0003.tlg001
2 normalise the reference scopee.g. 1.89.1–1.89.2
1.89.1–21, 89, 1–2I 89, 1s.
3 assign unique IDurn:cts:greekLit:tlg0003.tlg001:1.89.1–1.89.2
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NED: Knowledge Base
underlying model: HuCit, CIDOC-CRM & FRBRoousages:
extract abbreviationsresolve implicit refs:, e.g. “Herod. 4, 5-7”validate citations, e.g. “Thuc. 1.100.9.4”
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NED: Evaluation
Matching Type Precision Recall F1 Score
Exact 58.33% 62.88% 60.52%Approximate (n=4) 61.04% 90.94% 73.05%Approximate (n=7) 58.94% 94.76% 72.67%
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NED: Error Types
1 abbreviation is highly ambiguous (without context)
But Horace undermines the suggestion that his ownpoetry will forever represent the Augustan Age.Carm. 4, 15 in fact […]
2 ambiguous author mention
Esame dell’ esegesi papiracea ad Aristofane :permanenza del lavoro degli eruditi alessandrini […]
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
NED: Error Types (contd.)
3 implied context (title of reviewed publ.)
Dans son chap. 5 sur le squelette et la respiration,Lactance utilise des sources disparates et arrive auxlimites de son savoir médical.
4 ambiguously expressed reference
Analysis of the pederastic poems in the Theocriteancorpus (12 ; 23 ; 29 ; 30) reveals that Theocritusreflects on mutuality in a relationship […]
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Outlook
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Future Plans
1 improve overall accuracytest other methods for each processing stepmore training data (and more specialised)expand knowledge base
2 make software available to othersstreamline installationimprove documentationoffer as web serviceoffer as part of a research infrastructure
3 apply on a larger scaleimprove performances (optimisation)use of high performance and parallel computing
Scaling up theExtraction of
CanonicalCitations in
Classics
MatteoRomanello
(DAI / KCL)@mr56k
Prologue
Approach
Evaluation
Outlook
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Thank you for your attention!Links
[email protected]://github.com/mromanello/CRefExhttps://github.com/mromanello/APh_Corpus