+ All Categories
Home > Documents > Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables

Date post: 08-Jan-2016
Category:
Upload: aaralyn
View: 38 times
Download: 3 times
Share this document with a friend
Description:
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables. A Thesis Submitted to the Faculty of Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information Acquiring microfilm is expensive and time consuming. - PowerPoint PPT Presentation
Popular Tags:
55
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Kenneth Martin Tubbs Jr. Jr. A Thesis Submitted to the Faculty of Brigham Young University
Transcript
Page 1: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Submitted to the Faculty ofBrigham Young University

Page 2: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

MotivationMotivation

• Millions of people want genealogical Millions of people want genealogical informationinformation

• Acquiring microfilm is expensive and Acquiring microfilm is expensive and time consumingtime consuming

Page 3: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Extraction ProblemExtraction Problem

• Searching microfilm by hand is slow, Searching microfilm by hand is slow, error prone, and tediouserror prone, and tedious

• Extraction by hand requires enormous Extraction by hand requires enormous amounts of time and manpoweramounts of time and manpower

Page 4: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

DifficultiesDifficulties

• Tables Tables have different layouts and styles have different layouts and styles

• Tables contain different recordsTables contain different records

• Tables do not use a uniform schemaTables do not use a uniform schema

• Tables lack information and are ambiguousTables lack information and are ambiguous

Page 5: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Related WorkRelated Work

• Current work exploits the geometric Current work exploits the geometric properties of tablesproperties of tables

• Regular expressions, grammars, Regular expressions, grammars, probabilistic models, and templatesprobabilistic models, and templates

• They ignore the ontological constraints of They ignore the ontological constraints of this informationthis information

Page 6: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

ContributionsContributions

• Exploit both ontological and geometric Exploit both ontological and geometric constraintsconstraints

• Identify complex recordsIdentify complex records

• Work with tables with hand-written Work with tables with hand-written valuesvalues

Page 7: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Generate ConfidencesGenerate

Confidences

EnforceConstraints

EnforceConstraints

VerifyResultsVerifyResults

Page 8: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Training SetTraining Set

• 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls• Used to:Used to:

– Identify relationships between table cells Identify relationships between table cells

– Create genealogical ontologyCreate genealogical ontology

– Define features to extractDefine features to extract

– Generate rules (constraints)Generate rules (constraints)

Page 9: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Input: Microfilm TableInput: Microfilm Table

Page 10: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Input: Microfilm TableInput: Microfilm Table

Page 11: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Input: Microfilm TableInput: Microfilm Table

Input FeaturesInput Features

1.1. Coordinates of each cell.Coordinates of each cell.

2.2. Printed text for label cells.Printed text for label cells.

3.3. Whether or not each value Whether or not each value cell is empty.cell is empty.

Page 12: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Input: Microfilm TableInput: Microfilm Table

<<index index sourcesource="="0444770/0444770_2.gif0444770/0444770_2.gif"" ontologyontology="="ontology.xmlontology.xml">">  

<<cellcell rectrect="="7,131,62,2617,131,62,261"" printed_textprinted_text="="Dwelling-houses number in the order Dwelling-houses number in the order of visitation.of visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="61,132,118,26061,132,118,260"" printed_textprinted_text="="Families number in order of Families number in order of visitation.visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="119,132,436,261119,132,436,261"" printed_textprinted_text="="The Name of every Person whose The Name of every Person whose usual place of abode on the first day of June, 1840, was in this usual place of abode on the first day of June, 1840, was in this family.family."" emptyempty="="00" />" />      

<<cellcell rectrect="="62,260,120,29562,260,120,295"" printed_textprinted_text="="22"" emptyempty="="00" />" />   

<<cellcell rectrect="="118,260,436,298118,260,436,298"" printed_textprinted_text="="33"" emptyempty="="00" />" />   

<<cellcell rectrect="="7,458,62,4977,458,62,497"" printed_textprinted_text=""="" emptyempty="="11" />" />

. . .. . .

Page 13: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Genealogical OntologyGenealogical Ontology

Page 14: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Genealogical OntologyGenealogical Ontology

Page 15: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Genealogical OntologyGenealogical Ontology <<OntologyOntology>>

<<ObjectSetObjectSet id id="="00"" name name="="PersonPerson"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="11"" name name="="FamilyFamily"" syn syn="="familiesfamilies"" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="22"" name name="="EventEvent"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="33"" name name="="AgeAge"" syn syn="="age birthdayage birthday"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="44"" name name="="RelationshipRelationship"" syn syn="="relationship relationrelationship relation"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="55"" name name="="Full NameFull Name"" syn syn="="full name whom whofull name whom who"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="66"" name name="="First NameFirst Name"" syn syn="="first given christianfirst given christian"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="77"" name name="="Middle Name(s)Middle Name(s)"" syn syn="="middle initialmiddle initial"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="88"" name name="="Last NameLast Name"" syn syn="="last surnamelast surname"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="99"" name name="="Title(s)Title(s)"" syn syn="="titletitle"" lex lex="="11"/>"/>

. . .. . .

Page 16: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Generate ConfidencesGenerate Confidences

• Confidence of relationships Confidence of relationships between pairs of cellsbetween pairs of cells

• Generate confidence values Generate confidence values between 0 and 1between 0 and 1

Generate Confidences

Generate Confidences

Page 17: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

RelationshipsRelationshipsGenerate Confidences

Generate Confidences

• A label cell describes a value cell A label cell describes a value cell

• Value cells in same row or columnValue cells in same row or column

• Label cells form a multi-level label Label cells form a multi-level label

• A label cell maps to an object setA label cell maps to an object set

• Identify factoringIdentify factoring

Page 18: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Label Cell and Value CellLabel Cell and Value Cell

A continuous path between a label A continuous path between a label cell and a value cellcell and a value cell

Generate Confidences

Generate Confidences

Label Label

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Page 19: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Label Cell and Value CellLabel Cell and Value Cell

Preferences for label – value Preferences for label – value orientationsorientations

Generate Confidences

Generate Confidences

Label Orientation Confidence

Above 1

Left .75

Right .5

Below .25

Label

Page 20: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Label Cell and Value CellLabel Cell and Value Cell

Compare the height or width of each Compare the height or width of each label cell with each value celllabel cell with each value cell

Generate Confidences

Generate Confidences

LabelLabelOROR

1100Not SimilarNot Similar SimilarSimilar

Page 21: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Value Cell and Value CellValue Cell and Value Cell(Same Row)(Same Row)

A continuous, A continuous, horizontalhorizontal path exists path exists between a pair of value cellsbetween a pair of value cells

Generate Confidences

Generate Confidences

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Page 22: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Value Cell and Value Cell Value Cell and Value Cell (Same Column)(Same Column)

A continuous, A continuous, verticalvertical path exists path exists between a label cell and a value cellbetween a label cell and a value cell

Generate Confidences

Generate Confidences

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Page 23: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Value Cell and Value CellValue Cell and Value Cell(Geometrically Similar )(Geometrically Similar )

Compare height and widthCompare height and width

Generate Confidences

Generate Confidences

1100Not SimilarNot Similar SimilarSimilar

Page 24: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Multi-level LabelsMulti-level Labels

• Distance between the midpoints Distance between the midpoints

• A line through the midpointsA line through the midpoints

• Share a common borderShare a common border

Generate Confidences

Generate Confidences

Page 25: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Match Label Cells to Object SetsMatch Label Cells to Object Sets

• Match synonyms of object sets to Match synonyms of object sets to words in a labelwords in a label– Location of matched wordsLocation of matched words– Order that object sets match wordsOrder that object sets match words

Generate Confidences

Generate Confidences

Full NameFull Name

LocationLocation

DayDay

FamilyFamily

Object SetsObject Sets

Page 26: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Enforce ConstraintsEnforce Constraints• A set of rules describe geometric and ontological constraints.A set of rules describe geometric and ontological constraints.

• For example:For example:– Value cells of the same type have the same dimensionsValue cells of the same type have the same dimensions– A family can’t have 100 membersA family can’t have 100 members

• The algorithm iterates over the rulesThe algorithm iterates over the rules

Generate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

Page 27: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

1. Similar Value Cells1. Similar Value CellsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

Page 28: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

1. Similar Value Cells1. Similar Value CellsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

LowerLowerConfidenceConfidence

Page 29: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

1. Similar Value Cells1. Similar Value CellsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

Page 30: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

2. Combine Aggregations2. Combine AggregationsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

Page 31: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

3. Multi-level Labels3. Multi-level LabelsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

Page 32: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

4. Factoring4. Factoring

• Observed cardinality:Observed cardinality:

– microfilm tablemicrofilm table

• Expected cardinality:Expected cardinality:

– genealogy ontologygenealogy ontology

Generate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints Check Cardinality ConstraintsCheck Cardinality Constraints

Page 33: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Observed CardinalityObserved CardinalityGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints [First Name] per [Family] = [First Name] per [Family] = 4545 / / 99 = = 4.674.67

. . .. . .

Page 34: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Expected CardinalityExpected Cardinality

[First Name] per [Family] = 4.8 * 1 * 1 = [First Name] per [Family] = 4.8 * 1 * 1 = 4.84.8

Generate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

Page 35: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

5. Ontological Similarity5. Ontological SimilarityGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints Increase Confidence of Label Increase Confidence of Label

to Object Set Mappingsto Object Set Mappings

Page 36: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

6. Same Microfilm Roll6. Same Microfilm RollGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

• Microfilm from the same roll have Microfilm from the same roll have the same structure and relationships the same structure and relationships

• Generate the confidence values for Generate the confidence values for multiple tables from the same roll multiple tables from the same roll

• Take the average of the respective Take the average of the respective confidence values confidence values

Page 37: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Verify ResultsVerify ResultsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

VerifyResults

VerifyResults

Page 38: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

DatabaseDatabase

Full NameFull Name …

Generate Confidences

Generate Confidences

ApplyRules

ApplyRules

VerifyResults

VerifyResults

• Create SQL Insert statements to Create SQL Insert statements to store value cell coordinatesstore value cell coordinates

INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,114,521,172335,114,521,172')') INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,173,521,231335,173,521,231')')…

Page 39: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Generate ConfidencesGenerate

Confidences

EnforceConstraints

EnforceConstraints

VerifyResultsVerifyResults

Page 40: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Training Set ResultsTraining Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 100%100% 100%100%

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 100%100% 100%100%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

74.45%74.45% 100%100% 84.65%84.65%

FactoringFactoring 100%100% 100%100% 100%100%

SQL FieldsSQL Fields 99.42%99.42% 100%100% 99.71%99.71%

Page 41: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Ambiguous FactoringAmbiguous Factoring

Page 42: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

ExperimentsExperiments

• 75 Tables from 15 different 75 Tables from 15 different microfilm rollsmicrofilm rolls

• Precision, recall, and accuracyPrecision, recall, and accuracy– Populated SQL fieldsPopulated SQL fields– Each relationshipEach relationship

Page 43: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Test Set ResultsTest Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 98.12 %98.12 % 98.12 %98.12 %

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 99.67%99.67% 99.82%99.82%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

84.98%84.98% 92.76%92.76% 88.1888.18%%

FactoringFactoring 100%100% 93.40%93.40% 93.47%93.47%

SQL FieldsSQL Fields 93.20%93.20% 92.41%92.41% 92.15%92.15%

Page 44: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

3 Success Examples3 Success Examples

1.1. Specialized RecordSpecialized Record

2.2. Ontology ConstraintsOntology Constraints

3.3. FactoringFactoring

Page 45: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

1. Specialized Records1. Specialized Records

Page 46: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

1. Specialized Records1. Specialized Records

INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456 ,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1)INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2)INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3)INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1)INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1)INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483')INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483')INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')

Page 47: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

2. Ontology Constraints2. Ontology Constraints

Page 48: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

2. Ontology Constraints2. Ontology Constraints

INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1)INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372')INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371')

INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')

Page 49: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

3. Factoring3. Factoring

Page 50: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

3 Types of Errors3 Types of Errors

1.1. Ambiguous FactoringAmbiguous Factoring

2.2. Long Label NamesLong Label Names

3.3. Ambiguous ColumnsAmbiguous Columns

Page 51: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

2. Long Label Names2. Long Label Names

Page 52: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

3. Ambiguous Columns3. Ambiguous Columns

Page 53: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

ArtifactsArtifacts

• Tool in the Java programming language Tool in the Java programming language

• http://www.rdhd.byu.edu/

• Executable Jar FileExecutable Jar File

• Source CodeSource Code

• Input FilesInput Files

• DocumentationDocumentation

Page 54: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Future WorkFuture Work

• Advanced natural language Advanced natural language processingprocessing

• Hand-written valuesHand-written values

• Machine learningMachine learning

Page 55: Recognizing Records from the  Extracted Cells of Genealogical Microfilm Tables

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Presented to theDepartment of Computer Science

Brigham Young University


Recommended