+ All Categories
Home > Documents > Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J....

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J....

Date post: 29-Mar-2015
Category:
Upload: summer-newham
View: 214 times
Download: 2 times
Share this document with a friend
Popular Tags:
40
1 Automatic Extraction Automatic Extraction From and Reasoning About From and Reasoning About Genealogical Records: A Genealogical Records: A Prototype Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science
Transcript
Page 1: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Genealogical Records: A Genealogical Records: A PrototypePrototype

By

Charla J. Woodbury

A thesis submitted to the faculty ofBrigham Young University

in partial fulfillment of the requirements for the degree of

Master of Science

Page 2: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

22

Digital Images – Human Digital Images – Human IndexIndex

• Large number of competing family history websites•Digital images•Human indexes

• Researchers hunting through records and indexes to put families together

Page 3: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

33

ProblemProblem

Large amounts of primary genealogical Large amounts of primary genealogical datadata

Big projects to index and extract recordsBig projects to index and extract records

Two independent indexers and Two independent indexers and adjudicationadjudication

Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families

Page 4: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

44

Automated Extraction Automated Extraction SolutionSolution

Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data

Add rules and logic thatAdd rules and logic that Label family roles - husband, daughter, Label family roles - husband, daughter,

etc.etc. Link family relationshipsLink family relationships

HUSBAND – WIFEHUSBAND – WIFE PARENT – CHILDPARENT – CHILD

Page 5: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

5

OutlineOutline

1.1. Data PreparationData Preparation

2.2. Ontology Extraction System Ontology Extraction System (OntoES)(OntoES)

3.3. OWL File and SWRL RulesOWL File and SWRL Rules

4.4. SPARQL QueriesSPARQL Queries

5.5. Experimental ResultsExperimental Results

6.6. ConclusionsConclusions

5

Page 6: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

66

1. Data Preparation1. Data Preparation

Collect machine-readable records from Collect machine-readable records from three different countriesthree different countries

Format in HTML format for extractionFormat in HTML format for extraction

Prepare lexicons for names, places, etc.Prepare lexicons for names, places, etc.

Page 7: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

77

New England Vital Records New England Vital Records – Beverly, Massachusetts – Beverly, Massachusetts

1668-18491668-1849

Page 8: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

88

Danish Parish – Maglebye, Praesto

1646-1813

Page 9: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

99

English Parish – South English Parish – South Petherton, Somersetshire Petherton, Somersetshire

1574-19011574-1901

Page 10: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1010

same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts

SOUTH PETHERTON SOUTH PETHERTON MARRIAGES (from genuki)MARRIAGES (from genuki)

Page 11: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

11

Transcript of the original record

1576/1577 eodem die Nicholaus Patch Christinam Denman 26 Jan 1605 Richard Patch et Joanna Lavor 1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis

cominguntur 1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur 1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio

cominguntur 1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch 1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio

cominguntur 1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio

cominguntur 1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio

cominguntur 1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio

cominguntur

Page 12: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1212

2. Ontology Extraction 2. Ontology Extraction SystemSystem

OntoESOntoES: automatically interpret and : automatically interpret and correctly label genealogical data correctly label genealogical data usingusing

Data framesData framesRegular expressions Regular expressions LexiconsLexiconsDate conversion methodsDate conversion methods

Page 13: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1313

Marriage OntologyMarriage Ontology

Page 14: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1414

Data Frame EditorData Frame Editor

Page 15: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

15

Regular expressions

MARDATE Value expression

EXAMPLE type 25-Sep 1613(0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\

d\d)

Keyword expression (\b(md\.?|marry|marriage|married|maried|

wed|wedding)\b)

Page 16: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1616

SampleSample MONTH MONTH LEXICONLEXICON

1Ober1Ober 7ber7ber 8ber8ber 9ber9ber aprapr aprilapril aprilisaprilis augaug augustaugust augustiaugusti augustusaugustus avravr avrilavril avrilisavrilis decdec decemberdecember

decembrdecembr decembredecembre decembridecembri febfeb febrfebr februarifebruari februaryfebruary janjan januarijjanuarij januaryjanuary juljul julijuli juliusjulius julyjuly junjun junejune

Page 17: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1717

Object LevelObject Level

Page 18: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1818

CANONICALIZATION CANONICALIZATION METHODSMETHODS

inside the ontologyinside the ontology Regularize date (Julian format: Regularize date (Julian format:

YYYYdddYYYYddd))

1620 2-May 1620 2-May →→ 16200931620093

Display stored Julian format as DD MMM YYYY

1620093 →→ 2 MAY 1620

Page 19: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

1919

Feast DatesFeast DatesDates expressed as a holy dayDates expressed as a holy day Fixed DatesFixed Dates

Christmas 1720 Christmas 1720 →→ 25 DEC 172025 DEC 1720

Moveable Dates around Easter Moveable Dates around Easter (36 possible Easter dates with leap year (36 possible Easter dates with leap year

variation)variation)

1723 Dnica Septuagesima1723 Dnica Septuagesima →→ 24 JAN 24 JAN 17231723

Same day as previous entrySame day as previous entry

Page 20: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

2020

Run Ontology Run Ontology

InputInput Ontology Ontology (Created with OntoES)(Created with OntoES) HTML dataHTML data (Hypertext Markup Language) (Hypertext Markup Language)

OutputOutput RDF databaseRDF database (Resource Description (Resource Description

Format)Format) OWL fileOWL file (Ontology Web Language) (Ontology Web Language)

Page 21: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

2121

Ontology WorkbenchOntology Workbench

Page 22: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

2222

Extracted MarriagesExtracted MarriagesBetDate

MarDate NameM NameF NameU

same day 1576

Nicholas PatchChristian

Denman

26 JAN 1605

Richard Patch Joan Lavor

26 SEP 1613

John ElliottJoan

Woodbery

7 AUG 1615

Thomas Prime Maria Parry

29 JAN 1616

William WoodberyElizabeth

Patch

2 MAY 1620

William HillerdFortu:

Patch

17 SEP 1622

Nicholas PatchElizabeth

Owlsey

22 JAN 1627

Richard Patch Mary White

16 JAN 1630

Andrew Elliott Joan Patch

12 FEB 1639

Andrew Elliott Joan Pitts

Page 23: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

23

3. OWL File and SWRL 3. OWL File and SWRL RulesRules OWL HEADER

<owl:Class rdf:ID="MarriageRecord"/> <owl:Class rdf:ID="Person"/> <owl:Class rdf:ID="NameU"/> <owl:DatatypeProperty rdf:ID="NameUValue"> <rdfs:domain rdf:resource="#NameU"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty>

PERSON - NAMEU <owl:ObjectProperty rdf:ID="Person-NameU"> <rdfs:domain rdf:resource="#Person"/> <rdfs:range rdf:resource="#NameU"/> <owl:inverseOf> <owl:ObjectProperty rdf:ID="NameU-Person"/> </owl:inverseOf> </owl:ObjectProperty>

Page 24: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

24

Sample RDF Triples

Person_10 | sameAs | Person_10Person_10 | type | ThingPerson_10 | type | PersonNameU_0 | NameUValue | “Christian

Denman”NameU_0 | sameAs | NameU_0NameU_0 | type | ThingNameU_0 |type | NameUNameM_4 | NameMValue | “Nicholas Patch”NameM_4 | sameAs | NameM_4NameM_4 | type | ThingNameM_4 |type | NameM

Page 25: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

2525

SWRL RulesSWRL Rules Define OWL ClassDefine OWL Class

Example – HusbandExample – Husband <owl:Class rdf:ID="Husband"/><owl:Class rdf:ID="Husband"/>

Define RuleDefine Rule Example – Person with male name is a Example – Person with male name is a

HusbandHusband Person-NameM(?x,?y) -> Husband(?x)Person-NameM(?x,?y) -> Husband(?x)

?x

?y

Page 26: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

26

Marriage – Person_10 to Person_4 Person_10

<Person rdf:ID="Person_10"> <Person-NameU rdf:resource="#NameU_0" /> </Person>

MarriageRecord_7 <MarriageRecord rdf:ID="MarriageRecord_7"> <MarriageRecord-Person rdf:resource="#Person_4" /> <MarriageRecord-Person

rdf:resource="#Person_10" /> </rdf:MarriageRecord>

NameM_4 <NameM rdf:ID="NameM_4"> <NameMValue> Nicholas Patch</NameMValue> </NameM>

Person_4 <Person rdf:ID="Person_4"> <Person-NameM rdf:resource="#NameM_4" /> </Person>

Page 27: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

27

Rule HEAD in OWL file

<swrl:Imp rdf:ID="Def-Husband"> <swrl:head rdf:parseType=“Collection”>

<swrl:ClassAtom> <swrl:argument1 rdf:resource="#x"/> <swrl:classPredicate

rdf:resource="#Husband"/> </swrl:ClassAtom></swrl:head>

Page 28: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

28

Rule BODY in OWL file

<swrl:body><swrl:IndividualPropertyAtom>

<swrl:propertyPredicate rdf:resource="#Person-NameM"/>

<swrl:argument1 rdf:resource="#x"/> <swrl:argument2 rdf:resource="#y"/> </swrl:IndividualPropertyAtom>

</swrl:body>

</swrl:Imp>

Page 29: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

2929

Related RulesRelated Rules NameF is populated then value in NameU is NameF is populated then value in NameU is

HusbandHusband

Person-NameF(?w,?v) Person-NameF(?w,?v) MarriageRecord-Person(?z,? MarriageRecord-Person(?z,?w) w)

MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?x) Person-NameU(?x,?y) Person-NameU(?x,?y)

-> Husband(?x)-> Husband(?x)

?x

?z

?w?v

?y

Page 30: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3030

HusbandOf RuleHusbandOf Rule

Husband(?x) Husband(?x) Wife(?y) Wife(?y) MarriageRecord- MarriageRecord-Person(?z,?x)Person(?z,?x)

MarriageRecord-Person(?z,?y)MarriageRecord-Person(?z,?y)

-> HusbandOf(?x,?y)-> HusbandOf(?x,?y)

Page 31: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

31

Auxiliary Name RulesAuxiliary Name Rules

31

NameM(?x) -> Name(?x)NameM(?x) -> Name(?x)

NameF(?x) -> Name(?x)NameF(?x) -> Name(?x)

NameU(?x) -> Name(?x)NameU(?x) -> Name(?x)

NameMValue(?x) -> NameValue(?x)NameMValue(?x) -> NameValue(?x)

NameFValue(?x) -> NameValue(?x)NameFValue(?x) -> NameValue(?x)

NameUValue(?x) -> NameValue(?x)NameUValue(?x) -> NameValue(?x)

Person-NameM(?x,?y) -> Person-Name(?x,?y)Person-NameM(?x,?y) -> Person-Name(?x,?y)

Person-NameF(?x,?y) -> Person-Name(?x,?y)Person-NameF(?x,?y) -> Person-Name(?x,?y)

Person-NameU(?x,?y) -> Person-Name(?x,?y)Person-NameU(?x,?y) -> Person-Name(?x,?y)

Page 32: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3232

4.4. SPARQL Query SPARQL Query Who is Who is Husband ofHusband of Christian Christian

Denman?Denman?

PREFIX : PREFIX : http://www.deg.byu.edu/ontology/Marriage#

SELECT ?HusbandSELECT ?HusbandWHEREWHERE{{ ?X :NameValue "Christian Denman" .?X :NameValue "Christian Denman" . ?Y :Person-Name ?X .?Y :Person-Name ?X . ?W :HusbandOf ?Y .?W :HusbandOf ?Y . ?W :Person-Name ?V .?W :Person-Name ?V . ?V :NameValue ?Husband?V :NameValue ?Husband}}

Page 33: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3333

Query ResultsQuery ResultsHusbandHusband

==============================================================================

""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string

Page 34: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3434

Query ResultsQuery ResultsHusbandHusband

==============================================================================

""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string

South Petherton Marriagessame day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts

“Nicholas Patch” because: NameValue(“Nicholas Patch”) and Name-NameValue(n1, “Nicholas Patch”) and Name(n1) is NameM(n1) and Person-NameM(p1, n1) NameValue(“Christian Denman”) and Name-NameValue(n2, “Christian Denman”) and Name(n2) is NameU(n2) and Person-NameU(p2, n2)Husband(p1) because: Person-NameM(p1, n1)Wife(p2) because: Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1) and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1)HusbandOf(p1, p2) because: Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1) and MarriageRecord-Person(r1, p1)

Page 35: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3535

5. Experimental Results5. Experimental Results

Extraction ResultsExtraction Results

American Extraction ProblemAmerican Extraction Problem

Rule ResultsRule Results

Page 36: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3636

Extraction ResultsExtraction ResultsMARRIAGES

ENTITIES

RECALL %ERRORS

PRECISION

English 188 594 58899.0

%8 98.7%

American

608 1824 163089.4

%34 98.0%

Danish 171 543 53899.1

%10 98.2%

BIRTHS

English 3153 9489 939499.0

%61 99.4%

American

675 2055 180988.0

%33 98.2%

Danish 677 2061 204299.1

%15 99.3%

  DEATHS

English 3458 8675 858999.0

%83 99.0%

American

510 1305 114888.0

%28 97.6%

Danish 833 2113 209399.1

%19 99.1%

Page 37: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3737

American DifficultyAmerican Difficulty

BIRTHBIRTHWOODBURY, Charles Henry [Charles WOODBURY, Charles Henry [Charles

William, P. R. 4.], s. Henry [housewright. William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, dup.] and Henrietta (Galloup), Dec. 4, 1845.1845.

Extra information inside brackets & Extra information inside brackets & parenthesesparentheses Charles WilliamCharles William – twin of Charles Henry – twin of Charles Henry Henry [housewrightHenry [housewright – identified as NAME – identified as NAME Henrietta (Galloup)Henrietta (Galloup) –identified as NAME –identified as NAME

Page 38: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

3838

Rules ResultsRules Results

100% Precision and Recall100% Precision and Recall(Once rules are well-defined, the results are (Once rules are well-defined, the results are

perfect.)perfect.)

Database SizeDatabase Size(The RDF database is much larger when rule (The RDF database is much larger when rule

triples are added.)triples are added.) NEW PROPERTIES – husband, wife, parent, NEW PROPERTIES – husband, wife, parent,

childchild NEW LINKSNEW LINKS

Page 39: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

39

Size Impact of Adding Rules

MARRIAGE 21 rules EVENT 30 rules

Triples OWL(# lines)

OWL File (kilobytes)

Triples OWL(# lines)

OWL File (kilobytes)

OWL File

814 498 14 2232 1405 15

W/Rules

1009 785 31 2983 1873 75

Difference

195 287 17 751 468 60

Increase

23.96% 57.63% 121.43% 33.65% 33.31% 400.00%

Page 40: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

4040

6. Conclusions6. Conclusions

Speed up data indexingSpeed up data indexing Make production of a full index easierMake production of a full index easier Ground the index in original documentsGround the index in original documents Provide for inferred factsProvide for inferred facts Simplify as well as augment record Simplify as well as augment record

searchsearch Help link records and form family Help link records and form family

groups and ancestral linesgroups and ancestral lines


Recommended