Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
Recognizing Recordsfrom the Extracted Cells
of Microfilm Tables
Kenneth M. TubbsDavid W. Embley
Brigham Young University
Supported by NSFSupported by NSF
MotivationMotivation
MotivationMotivation
• Millions want microfilm informationMillions want microfilm information– 1880 census on-line, end of October1880 census on-line, end of October– 3 million hits per hour on familysearch.org3 million hits per hour on familysearch.org
• Acquiring information from microfilmAcquiring information from microfilm– Expensive and time consumingExpensive and time consuming– 2.5 million rolls, 20,000 extractors, 100 hours per 2.5 million rolls, 20,000 extractors, 100 hours per
year: requires 104 yearsyear: requires 104 years• Finding a way to automate: big win!Finding a way to automate: big win!
DifficultiesDifficulties
• DDifferent layouts and styles ifferent layouts and styles
• Different types of dataDifferent types of data
• Sometimes ambiguousSometimes ambiguous
• Type-written labels (OCR)Type-written labels (OCR)
• Hand-written data (?)Hand-written data (?)
Objective: Identify RecordsObjective: Identify Records
• Ontological as well as geometric constraintsOntological as well as geometric constraints• Layout of handwritten valuesLayout of handwritten values• Layout of empty cellsLayout of empty cells
Given a zoned image of a microfilm table, exploit:Given a zoned image of a microfilm table, exploit:
Output field coordinates (labeled with respect to Output field coordinates (labeled with respect to the ontology) and organized into recordsthe ontology) and organized into records
AlgorithmAlgorithm
SQL Insert Statements
SQL Insert Statements
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
Generate ConfidenceGenerate
Confidence
EnforceConstraints
EnforceConstraints
VerifyResultsVerifyResults
““Training” SetTraining” Set
• 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls• Used to:Used to:
– Identify relationships between table cells Identify relationships between table cells
– Create genealogical ontologyCreate genealogical ontology
– Define features to extractDefine features to extract
– Generate rules (constraints)Generate rules (constraints)
Input: Microfilm TableInput: Microfilm Table
Input: Microfilm TableInput: Microfilm Table
Input: Microfilm TableInput: Microfilm Table
Input FeaturesInput Features
1.1. Coordinates of each cellCoordinates of each cell
2.2. Printed text for label cellsPrinted text for label cells
3.3. Cell empty or notCell empty or not
Input: Microfilm TableInput: Microfilm Table
<<index index sourcesource="="0444770/0444770_2.gif0444770/0444770_2.gif"" ontologyontology="="ontology.xmlontology.xml">">
<<cellcell rectrect="="7,131,62,2617,131,62,261"" printed_textprinted_text="="Dwelling-houses number in the order Dwelling-houses number in the order of visitation.of visitation."" emptyempty="="00" />" />
<<cellcell rectrect="="61,132,118,26061,132,118,260"" printed_textprinted_text="="Families number in order of Families number in order of visitation.visitation."" emptyempty="="00" />" />
<<cellcell rectrect="="119,132,436,261119,132,436,261"" printed_textprinted_text="="The Name of every Person whose The Name of every Person whose usual place of abode on the first day of June, 1840, was in this usual place of abode on the first day of June, 1840, was in this family.family."" emptyempty="="00" />" />
<<cellcell rectrect="="62,260,120,29562,260,120,295"" printed_textprinted_text="="22"" emptyempty="="00" />" />
<<cellcell rectrect="="118,260,436,298118,260,436,298"" printed_textprinted_text="="33"" emptyempty="="00" />" />
<<cellcell rectrect="="7,458,62,4977,458,62,497"" printed_textprinted_text=""="" emptyempty="="11" />" />
. . .. . .
Genealogical OntologyGenealogical Ontology
Genealogical OntologyGenealogical Ontology
Genealogical OntologyGenealogical Ontology <<OntologyOntology>>
<<ObjectSetObjectSet id id="="00"" name name="="PersonPerson"" syn syn=""="" lex lex="="00"/>"/>
<<ObjectSetObjectSet id id="="11"" name name="="FamilyFamily"" syn syn="="familiesfamilies"" lex lex="="00"/>"/>
<<ObjectSetObjectSet id id="="22"" name name="="EventEvent"" syn syn=""="" lex lex="="00"/>"/>
<<ObjectSetObjectSet id id="="33"" name name="="AgeAge"" syn syn="="age birthdayage birthday"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="44"" name name="="RelationshipRelationship"" syn syn="="relationship relationrelationship relation"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="55"" name name="="Full NameFull Name"" syn syn="="full name whom whofull name whom who"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="66"" name name="="First NameFirst Name"" syn syn="="first given christianfirst given christian"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="77"" name name="="Middle Name(s)Middle Name(s)"" syn syn="="middle initialmiddle initial"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="88"" name name="="Last NameLast Name"" syn syn="="last surnamelast surname"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="99"" name name="="Title(s)Title(s)"" syn syn="="titletitle"" lex lex="="11"/>"/>
. . .. . .
Generate Confidence Generate Confidence MatricesMatrices
• Relationships between pairs of cellsRelationships between pairs of cells
• Confidence values between 0 and 1Confidence values between 0 and 1
Generate Confidence
Generate Confidence
RelationshipsRelationshipsGenerate Confidence
Generate Confidence
• Label cell describes value cellsLabel cell describes value cells
• Value cells in same row or columnValue cells in same row or column
• Label cells form a multi-level label Label cells form a multi-level label
• Label cells correspond to object setsLabel cells correspond to object sets
• Value factoring and nested valuesValue factoring and nested values
Label Cell and Value CellLabel Cell and Value Cell
A continuous path between a label A continuous path between a label cell and a value cellcell and a value cell
Generate Confidence
Generate Confidence
Label Label
Confidence =Confidence =
1 If a path exists1 If a path exists
0 If no path exists0 If no path exists
Label Cell and Value CellLabel Cell and Value Cell
Preferences for label – value Preferences for label – value orientationsorientations
Generate Confidence
Generate Confidence
Label Orientation Confidence
Above 1
Left .75
Right .5
Below .25
Label
Label Cell and Value CellLabel Cell and Value Cell
Compare the height or width of each Compare the height or width of each label cell with each value celllabel cell with each value cell
Generate Confidence
Generate Confidence
LabelLabelOROR
1100Not SimilarNot Similar SimilarSimilar
Value Cell and Value CellValue Cell and Value Cell(Same Row)(Same Row)
A continuous, A continuous, horizontalhorizontal path exists path exists between a pair of value cellsbetween a pair of value cells
Generate Confidence
Generate Confidence
Confidence =Confidence =
1 If a path exists1 If a path exists
0 If no path exists0 If no path exists
Value Cell and Value Cell Value Cell and Value Cell (Same Column)(Same Column)
A continuous, A continuous, verticalvertical path exists path exists between a label cell and a value cellbetween a label cell and a value cell
Generate Confidence
Generate Confidence
Confidence =Confidence =
1 If a path exists1 If a path exists
0 If no path exists0 If no path exists
Value Cell and Value CellValue Cell and Value Cell(Geometrically Similar )(Geometrically Similar )
Compare height and widthCompare height and width
Generate Confidence
Generate Confidence
1100Not SimilarNot Similar SimilarSimilar
Multi-level LabelsMulti-level Labels
• Distance between the midpoints Distance between the midpoints
• A line through the midpointsA line through the midpoints
• Share a common borderShare a common border
Generate Confidence
Generate Confidence
Match Label Cells to Object SetsMatch Label Cells to Object Sets
• Location of matched wordsLocation of matched words
• Order of matched wordsOrder of matched words
Generate Confidence
Generate Confidence
Full NameFull Name
LocationLocation
DayDay
FamilyFamily
Object SetsObject Sets
Enforce ConstraintsEnforce Constraints
• Rules for geometric and ontological constraintsRules for geometric and ontological constraints
• Examples:Examples:– Same-type value cells have the same dimensions.Same-type value cells have the same dimensions.
– A family can’t have 100 members.A family can’t have 100 members.
• Iterate over the rules, seeking convergenceIterate over the rules, seeking convergence
Generate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Similar Value CellsSimilar Value CellsGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Similar Value CellsSimilar Value CellsGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
LowerLowerConfidenceConfidence
Similar Value CellsSimilar Value CellsGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Combine AggregationsCombine AggregationsGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Multi-level LabelsMulti-level LabelsGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
FactoringFactoring
• Observed cardinality in microfilm tableObserved cardinality in microfilm table
• Expected cardinality in genealogy ontologyExpected cardinality in genealogy ontology
Generate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Check Cardinality ConstraintsCheck Cardinality Constraints
Observed CardinalityObserved CardinalityGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints [First Name] per [Family] = [First Name] per [Family] = 4545 / / 99 = = 4.674.67
. . .. . .
Expected CardinalityExpected Cardinality
[First Name] per [Family] = 4.8 * 1 * 1 = [First Name] per [Family] = 4.8 * 1 * 1 = 4.84.8
Generate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Ontological SimilarityOntological SimilarityGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints Increase Confidence of Label Increase Confidence of Label
to Object Set Mappingsto Object Set Mappings
Same Microfilm RollSame Microfilm RollGenerate
Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
Average Confidence Values Across TablesAverage Confidence Values Across Tables
Verify ResultsVerify ResultsGenerate Confidence
Generate Confidence
EnforceConstraints
EnforceConstraints
VerifyResults
VerifyResults
DatabaseDatabase
Full NameFull Name …
Generate Confidence
Generate Confidence
ApplyRules
ApplyRules
VerifyResults
VerifyResults
…
INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES
('('335,114,521,172335,114,521,172')') INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES
('('335,173,521,231335,173,521,231')') …
SQL Statements Insert Value Cell CoordinatesSQL Statements Insert Value Cell Coordinates
““Training” Set ResultsTraining” Set Results
RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy
Label Cell Describes Label Cell Describes
Value CellValue Cell100%100% 100%100% 100%100%
Value Cells in Same Value Cells in Same Row or ColumnRow or Column
100%100% 100%100% 100%100%
Multilevel LabelsMultilevel Labels 100%100% 100%100% 100%100%
Label Cells – Object Label Cells – Object Set MatchesSet Matches
100%100% 100%100% 100%100%
FactoringFactoring 74.45%74.45% 100%100% 84.65%84.65%
SQL FieldsSQL Fields 99.42%99.42% 100%100% 99.71%99.71%
Ambiguous FactoringAmbiguous Factoring
ExperimentsExperiments
• 75 tables from 15 different microfilm rolls75 tables from 15 different microfilm rolls
• Precision, recall, and accuracyPrecision, recall, and accuracy– Populated SQL fieldsPopulated SQL fields– Each relationshipEach relationship
Test Set ResultsTest Set Results
RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy
Label Cell Describes Label Cell Describes
Value CellValue Cell100%100% 98.12 %98.12 % 98.12 %98.12 %
Value Cells in Same Value Cells in Same Row or ColumnRow or Column
100%100% 100%100% 100%100%
Multilevel LabelsMultilevel Labels 100%100% 99.67%99.67% 99.82%99.82%
Label Cells – Object Label Cells – Object Set MatchesSet Matches
84.98%84.98% 92.76%92.76% 88.1888.18%%
FactoringFactoring 100%100% 93.40%93.40% 93.47%93.47%
SQL FieldsSQL Fields 93.20%93.20% 92.41%92.41% 92.15%92.15%
Factoring over Several Factoring over Several Tables Improved ResultsTables Improved Results
Some Long Label NamesSome Long Label NamesCaused ConfusionCaused Confusion
State here the particular ReligionState here the particular Religionor Religious Denomination,or Religious Denomination,
to which each persons belongs.to which each persons belongs.[Members of Protestant Denomina-[Members of Protestant Denomina-tions are requested not to describetions are requested not to describe
themselves by the vague termthemselves by the vague term‘‘Protestant,’ but to enter theProtestant,’ but to enter the
name of the Particular Church,name of the Particular Church,Denomination, or Body, to whichDenomination, or Body, to whichthey belong.] they belong.]
Ambiguous ColumnsAmbiguous ColumnsCaused ConfusionCaused Confusion
Full NameFull Name
Conclusions
• Identified records in microfilm tables– Geometric and ontological properties– Evidence matrices & corroboration rules
• Accuracy: ~92%
http://www.rdhd.byu.eduhttp://www.fht.byu.edu