Date post: | 29-Mar-2015 |
Category: |
Documents |
Upload: | summer-newham |
View: | 214 times |
Download: | 2 times |
1
Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About
Genealogical Records: A Genealogical Records: A PrototypePrototype
By
Charla J. Woodbury
A thesis submitted to the faculty ofBrigham Young University
in partial fulfillment of the requirements for the degree of
Master of Science
22
Digital Images – Human Digital Images – Human IndexIndex
• Large number of competing family history websites•Digital images•Human indexes
• Researchers hunting through records and indexes to put families together
33
ProblemProblem
Large amounts of primary genealogical Large amounts of primary genealogical datadata
Big projects to index and extract recordsBig projects to index and extract records
Two independent indexers and Two independent indexers and adjudicationadjudication
Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families
44
Automated Extraction Automated Extraction SolutionSolution
Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data
Add rules and logic thatAdd rules and logic that Label family roles - husband, daughter, Label family roles - husband, daughter,
etc.etc. Link family relationshipsLink family relationships
HUSBAND – WIFEHUSBAND – WIFE PARENT – CHILDPARENT – CHILD
5
OutlineOutline
1.1. Data PreparationData Preparation
2.2. Ontology Extraction System Ontology Extraction System (OntoES)(OntoES)
3.3. OWL File and SWRL RulesOWL File and SWRL Rules
4.4. SPARQL QueriesSPARQL Queries
5.5. Experimental ResultsExperimental Results
6.6. ConclusionsConclusions
5
66
1. Data Preparation1. Data Preparation
Collect machine-readable records from Collect machine-readable records from three different countriesthree different countries
Format in HTML format for extractionFormat in HTML format for extraction
Prepare lexicons for names, places, etc.Prepare lexicons for names, places, etc.
77
New England Vital Records New England Vital Records – Beverly, Massachusetts – Beverly, Massachusetts
1668-18491668-1849
88
Danish Parish – Maglebye, Praesto
1646-1813
99
English Parish – South English Parish – South Petherton, Somersetshire Petherton, Somersetshire
1574-19011574-1901
1010
same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts
SOUTH PETHERTON SOUTH PETHERTON MARRIAGES (from genuki)MARRIAGES (from genuki)
11
Transcript of the original record
1576/1577 eodem die Nicholaus Patch Christinam Denman 26 Jan 1605 Richard Patch et Joanna Lavor 1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis
cominguntur 1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur 1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio
cominguntur 1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch 1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio
cominguntur 1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio
cominguntur 1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio
cominguntur 1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio
cominguntur
1212
2. Ontology Extraction 2. Ontology Extraction SystemSystem
OntoESOntoES: automatically interpret and : automatically interpret and correctly label genealogical data correctly label genealogical data usingusing
Data framesData framesRegular expressions Regular expressions LexiconsLexiconsDate conversion methodsDate conversion methods
1313
Marriage OntologyMarriage Ontology
1414
Data Frame EditorData Frame Editor
15
Regular expressions
MARDATE Value expression
EXAMPLE type 25-Sep 1613(0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\
d\d)
Keyword expression (\b(md\.?|marry|marriage|married|maried|
wed|wedding)\b)
1616
SampleSample MONTH MONTH LEXICONLEXICON
1Ober1Ober 7ber7ber 8ber8ber 9ber9ber aprapr aprilapril aprilisaprilis augaug augustaugust augustiaugusti augustusaugustus avravr avrilavril avrilisavrilis decdec decemberdecember
decembrdecembr decembredecembre decembridecembri febfeb febrfebr februarifebruari februaryfebruary janjan januarijjanuarij januaryjanuary juljul julijuli juliusjulius julyjuly junjun junejune
1717
Object LevelObject Level
1818
CANONICALIZATION CANONICALIZATION METHODSMETHODS
inside the ontologyinside the ontology Regularize date (Julian format: Regularize date (Julian format:
YYYYdddYYYYddd))
1620 2-May 1620 2-May →→ 16200931620093
Display stored Julian format as DD MMM YYYY
1620093 →→ 2 MAY 1620
1919
Feast DatesFeast DatesDates expressed as a holy dayDates expressed as a holy day Fixed DatesFixed Dates
Christmas 1720 Christmas 1720 →→ 25 DEC 172025 DEC 1720
Moveable Dates around Easter Moveable Dates around Easter (36 possible Easter dates with leap year (36 possible Easter dates with leap year
variation)variation)
1723 Dnica Septuagesima1723 Dnica Septuagesima →→ 24 JAN 24 JAN 17231723
Same day as previous entrySame day as previous entry
2020
Run Ontology Run Ontology
InputInput Ontology Ontology (Created with OntoES)(Created with OntoES) HTML dataHTML data (Hypertext Markup Language) (Hypertext Markup Language)
OutputOutput RDF databaseRDF database (Resource Description (Resource Description
Format)Format) OWL fileOWL file (Ontology Web Language) (Ontology Web Language)
2121
Ontology WorkbenchOntology Workbench
2222
Extracted MarriagesExtracted MarriagesBetDate
MarDate NameM NameF NameU
same day 1576
Nicholas PatchChristian
Denman
26 JAN 1605
Richard Patch Joan Lavor
26 SEP 1613
John ElliottJoan
Woodbery
7 AUG 1615
Thomas Prime Maria Parry
29 JAN 1616
William WoodberyElizabeth
Patch
2 MAY 1620
William HillerdFortu:
Patch
17 SEP 1622
Nicholas PatchElizabeth
Owlsey
22 JAN 1627
Richard Patch Mary White
16 JAN 1630
Andrew Elliott Joan Patch
12 FEB 1639
Andrew Elliott Joan Pitts
23
3. OWL File and SWRL 3. OWL File and SWRL RulesRules OWL HEADER
<owl:Class rdf:ID="MarriageRecord"/> <owl:Class rdf:ID="Person"/> <owl:Class rdf:ID="NameU"/> <owl:DatatypeProperty rdf:ID="NameUValue"> <rdfs:domain rdf:resource="#NameU"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty>
PERSON - NAMEU <owl:ObjectProperty rdf:ID="Person-NameU"> <rdfs:domain rdf:resource="#Person"/> <rdfs:range rdf:resource="#NameU"/> <owl:inverseOf> <owl:ObjectProperty rdf:ID="NameU-Person"/> </owl:inverseOf> </owl:ObjectProperty>
24
Sample RDF Triples
Person_10 | sameAs | Person_10Person_10 | type | ThingPerson_10 | type | PersonNameU_0 | NameUValue | “Christian
Denman”NameU_0 | sameAs | NameU_0NameU_0 | type | ThingNameU_0 |type | NameUNameM_4 | NameMValue | “Nicholas Patch”NameM_4 | sameAs | NameM_4NameM_4 | type | ThingNameM_4 |type | NameM
2525
SWRL RulesSWRL Rules Define OWL ClassDefine OWL Class
Example – HusbandExample – Husband <owl:Class rdf:ID="Husband"/><owl:Class rdf:ID="Husband"/>
Define RuleDefine Rule Example – Person with male name is a Example – Person with male name is a
HusbandHusband Person-NameM(?x,?y) -> Husband(?x)Person-NameM(?x,?y) -> Husband(?x)
?x
?y
26
Marriage – Person_10 to Person_4 Person_10
<Person rdf:ID="Person_10"> <Person-NameU rdf:resource="#NameU_0" /> </Person>
MarriageRecord_7 <MarriageRecord rdf:ID="MarriageRecord_7"> <MarriageRecord-Person rdf:resource="#Person_4" /> <MarriageRecord-Person
rdf:resource="#Person_10" /> </rdf:MarriageRecord>
NameM_4 <NameM rdf:ID="NameM_4"> <NameMValue> Nicholas Patch</NameMValue> </NameM>
Person_4 <Person rdf:ID="Person_4"> <Person-NameM rdf:resource="#NameM_4" /> </Person>
27
Rule HEAD in OWL file
<swrl:Imp rdf:ID="Def-Husband"> <swrl:head rdf:parseType=“Collection”>
<swrl:ClassAtom> <swrl:argument1 rdf:resource="#x"/> <swrl:classPredicate
rdf:resource="#Husband"/> </swrl:ClassAtom></swrl:head>
28
Rule BODY in OWL file
<swrl:body><swrl:IndividualPropertyAtom>
<swrl:propertyPredicate rdf:resource="#Person-NameM"/>
<swrl:argument1 rdf:resource="#x"/> <swrl:argument2 rdf:resource="#y"/> </swrl:IndividualPropertyAtom>
</swrl:body>
</swrl:Imp>
2929
Related RulesRelated Rules NameF is populated then value in NameU is NameF is populated then value in NameU is
HusbandHusband
Person-NameF(?w,?v) Person-NameF(?w,?v) MarriageRecord-Person(?z,? MarriageRecord-Person(?z,?w) w)
MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?x) Person-NameU(?x,?y) Person-NameU(?x,?y)
-> Husband(?x)-> Husband(?x)
?x
?z
?w?v
?y
3030
HusbandOf RuleHusbandOf Rule
Husband(?x) Husband(?x) Wife(?y) Wife(?y) MarriageRecord- MarriageRecord-Person(?z,?x)Person(?z,?x)
MarriageRecord-Person(?z,?y)MarriageRecord-Person(?z,?y)
-> HusbandOf(?x,?y)-> HusbandOf(?x,?y)
31
Auxiliary Name RulesAuxiliary Name Rules
31
NameM(?x) -> Name(?x)NameM(?x) -> Name(?x)
NameF(?x) -> Name(?x)NameF(?x) -> Name(?x)
NameU(?x) -> Name(?x)NameU(?x) -> Name(?x)
NameMValue(?x) -> NameValue(?x)NameMValue(?x) -> NameValue(?x)
NameFValue(?x) -> NameValue(?x)NameFValue(?x) -> NameValue(?x)
NameUValue(?x) -> NameValue(?x)NameUValue(?x) -> NameValue(?x)
Person-NameM(?x,?y) -> Person-Name(?x,?y)Person-NameM(?x,?y) -> Person-Name(?x,?y)
Person-NameF(?x,?y) -> Person-Name(?x,?y)Person-NameF(?x,?y) -> Person-Name(?x,?y)
Person-NameU(?x,?y) -> Person-Name(?x,?y)Person-NameU(?x,?y) -> Person-Name(?x,?y)
3232
4.4. SPARQL Query SPARQL Query Who is Who is Husband ofHusband of Christian Christian
Denman?Denman?
PREFIX : PREFIX : http://www.deg.byu.edu/ontology/Marriage#
SELECT ?HusbandSELECT ?HusbandWHEREWHERE{{ ?X :NameValue "Christian Denman" .?X :NameValue "Christian Denman" . ?Y :Person-Name ?X .?Y :Person-Name ?X . ?W :HusbandOf ?Y .?W :HusbandOf ?Y . ?W :Person-Name ?V .?W :Person-Name ?V . ?V :NameValue ?Husband?V :NameValue ?Husband}}
3333
Query ResultsQuery ResultsHusbandHusband
==============================================================================
""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string
3434
Query ResultsQuery ResultsHusbandHusband
==============================================================================
""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string
South Petherton Marriagessame day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts
“Nicholas Patch” because: NameValue(“Nicholas Patch”) and Name-NameValue(n1, “Nicholas Patch”) and Name(n1) is NameM(n1) and Person-NameM(p1, n1) NameValue(“Christian Denman”) and Name-NameValue(n2, “Christian Denman”) and Name(n2) is NameU(n2) and Person-NameU(p2, n2)Husband(p1) because: Person-NameM(p1, n1)Wife(p2) because: Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1) and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1)HusbandOf(p1, p2) because: Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1) and MarriageRecord-Person(r1, p1)
3535
5. Experimental Results5. Experimental Results
Extraction ResultsExtraction Results
American Extraction ProblemAmerican Extraction Problem
Rule ResultsRule Results
3636
Extraction ResultsExtraction ResultsMARRIAGES
ENTITIES
RECALL %ERRORS
PRECISION
English 188 594 58899.0
%8 98.7%
American
608 1824 163089.4
%34 98.0%
Danish 171 543 53899.1
%10 98.2%
BIRTHS
English 3153 9489 939499.0
%61 99.4%
American
675 2055 180988.0
%33 98.2%
Danish 677 2061 204299.1
%15 99.3%
DEATHS
English 3458 8675 858999.0
%83 99.0%
American
510 1305 114888.0
%28 97.6%
Danish 833 2113 209399.1
%19 99.1%
3737
American DifficultyAmerican Difficulty
BIRTHBIRTHWOODBURY, Charles Henry [Charles WOODBURY, Charles Henry [Charles
William, P. R. 4.], s. Henry [housewright. William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, dup.] and Henrietta (Galloup), Dec. 4, 1845.1845.
Extra information inside brackets & Extra information inside brackets & parenthesesparentheses Charles WilliamCharles William – twin of Charles Henry – twin of Charles Henry Henry [housewrightHenry [housewright – identified as NAME – identified as NAME Henrietta (Galloup)Henrietta (Galloup) –identified as NAME –identified as NAME
3838
Rules ResultsRules Results
100% Precision and Recall100% Precision and Recall(Once rules are well-defined, the results are (Once rules are well-defined, the results are
perfect.)perfect.)
Database SizeDatabase Size(The RDF database is much larger when rule (The RDF database is much larger when rule
triples are added.)triples are added.) NEW PROPERTIES – husband, wife, parent, NEW PROPERTIES – husband, wife, parent,
childchild NEW LINKSNEW LINKS
39
Size Impact of Adding Rules
MARRIAGE 21 rules EVENT 30 rules
Triples OWL(# lines)
OWL File (kilobytes)
Triples OWL(# lines)
OWL File (kilobytes)
OWL File
814 498 14 2232 1405 15
W/Rules
1009 785 31 2983 1873 75
Difference
195 287 17 751 468 60
Increase
23.96% 57.63% 121.43% 33.65% 33.31% 400.00%
4040
6. Conclusions6. Conclusions
Speed up data indexingSpeed up data indexing Make production of a full index easierMake production of a full index easier Ground the index in original documentsGround the index in original documents Provide for inferred factsProvide for inferred facts Simplify as well as augment record Simplify as well as augment record
searchsearch Help link records and form family Help link records and form family
groups and ancestral linesgroups and ancestral lines