+ All Categories
Home > Documents > Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems...

Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems...

Date post: 23-Dec-2015
Category:
Upload: laurel-chandler
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
16
Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch
Transcript
Page 1: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

Enabling Efficient ChineseJiapu Information Extraction

Stephen W. LiddleBYU Information Systems Department

Derek Dobson, David W. Embley, Chuck LiuFamilySearch

Page 2: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

2

Chinese Jiapu: Clan Family Record

• 12,871,979 Jiapu images• ~ half billion fact assertions

Page 3: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

3

COMETClick-Only (or at least Mostly) Extraction Tool

Page 4: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

4

COMET with Jiapu

Page 5: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

5

COMET with Jiapu

Page 6: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

6

COMET with Jiapu

Page 7: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

7

Structured Data Stored for Query and Search

Page 8: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

8

Requirements for COMET to work well with Jiapu

• Good alignment

• Good OCR

Page 9: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

9

Requirements for COMET to work well with Jiapu

• Good alignment

• Good OCR

OK

not OK

Page 10: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

10

Requirements for COMET to work well with Jiapu

• Good alignment

• Good OCR

four characters as one

missing OCR

bad OCR

many incorrect characters

??

Page 11: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

11

OCR Resolution• Currently insufficient:– Acrobat Professional– Tesseract– Abbyy

• Is there a better OCR engine for Chinese?• Can we do better with what we have?– Image enhancement– Resize glyphs– Use an ensemble of OCR engines– Train Tesseract for Jiapu peculiarities

Page 12: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

12

AlignmentResolution

• Currently– Open source PDFBox: needs work for Acrobat Pro– Tesseract & Abbyy interact incorrectly with PDFBox

• Engineering solution (but lots of work)

Page 13: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

13

Automating Jiapu Data ExtractionFuture Work

Page 14: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

14

Verify & CorrectFuture Work

Page 15: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

15

Conclusions• “Failed” experiment ?• A way out– Alignment engineering– OCR: find or do R&D

• Potential– Click-Only (Mostly) Extraction – a win– Semi-automatic extraction – a big win

Page 16: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch.

16

Conclusions• “Failed” experiment ?• A way out– Alignment engineering– OCR: find or do R&D

• Potential– Click-Only (Mostly) Extraction – a win– Semi-automatic extraction – a big win


Recommended