Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 1 times |
A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System
Alan WessmanBrigham Young UniversityMS Thesis Defense
Based in part on research funded by the National Science Foundation.
2
Presentation Overview
Background of legacy Ontos Assumptions, challenges, concerns Framework as solution Explain framework Explain reference implementation Evaluation of system Future work and conclusion
3
Data Extraction Goals of data extraction
Find relevant data in unstructured or semi-structured documents
Map extracted data to a formal structure Approaches
Wrappers (ROADRUNNER, TSIMMIS) NLP and machine learning (RAPIER, WHISK) Ontologies (Ontos)
4
Ontos
Developed by Data Extraction Group (DEG) at BYU
Based on OSM ontologies and data frames Focuses on multiple-record extraction Good precision/recall Resilient to document changes
5
How Ontos Works
6
Ontos Assumptions
OSML ontologies Single- or multiple-record text documents Each document/record relevant to domain Heuristics produce accurate mappings Output to relational database
7
Some Current Challenges
Challenge Example
New/evolving ontology features Enhanced data frames
Variety of documents PDF, plaintext, XML
Content filtering Extract from certain HTML attributes (ALT, SRC, HREF)
Locating values On-the-fly lexicon
Optimizing mappings Better heuristics; HMM-based mapping
8
Architectural Concerns
Variety of technologies Different OSM representations Highly coupled code Difficult to install elsewhere Difficult to upgrade or extend
9
Thesis Statement
A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research.
We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.
10
Frameworks Abstract architecture Decouple independent
functions Define interfaces Use abstract classes,
interfaces, declarative configuration files
Allow quick adjustment of system settings without re-coding
Make a system customizable
Image from http://www.mcoe.org
11
Creating an Extraction Framework
Analyze systems Generalize
functionality Define interfaces Create supporting
code Document framework
DataExtractionEngine
public void doExtraction()
ExtractionPlan
DocumentRetriever
DocumentStructureRecognizer
DocumentStructureParser
ContentFilter
ValueRecognizer
ValueMapper
OntologyWriter
Dynamicallyloaded
components
Config parameters
execute()
ExtractionAlgorithm
uses
12
Managing the Process
DataExtractionEngine Main class Initialize, perform extraction, finalize
ExtractionPlan Defines order of steps in the extraction process Can be imperative, declarative, or dynamic (like
SQL execution plan)
13
Handling Documents DocumentRetriever
Responsible for locating relevant documents
Search engine, local filesystem, CMS
DocumentStructureRecognizer Decides which
DocumentStructureParser to use
DocumentStructureParser Breaks document into
individual records or sub-documents
Record separator, table analyzer
ContentFilter Normalizes document text Strips out unwanted markup,
stopwords, etc.
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
14
Extracting Values ValueRecognizer
Uses matching rules defined in ontology Produces set of candidate matches (like data
record table) ValueMapper
Accepts or rejects candidate matches Assigns accepted matches to elements of the
ontology (e.g., object sets) OntologyWriter
Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)
15
Implementing the Framework
Applicati onOntol ogyDocument
Retriever
Object sets, relationship sets,and constraints
Value and keyword matching rules
SourceDescriptor
Document
StructureRecognizer
StructureParser
ContentFilter
Document
DocumentDocumentDocument
ValueRecognizer
ValueMapper
Candidate matches
Extracted objects and relationships
Ontol ogyWriter
StructureOutput
DataOutput
URI LocalDocumentRetriever
DOMDocument
(no DocumentStructureRecognizer)
FanoutRecordSeparator
TextDocument
HTMLFilter
DataFrameMatcher
OSMX ontology
HeuristicBasedMapper
ObjectRelationshipWriter
(no structural output) HTML representation
16
OSMX Legacy Ontos: OSML OntologyEditor:
OSM.dtd New standard is OSMX
XML Schema (better constraints; validation)
JAXB generates corresponding Java classes
Common language for DEG tools
Allows data to be stored inline with model
17
Managing the Process OntosEngine
Main class for Ontos system
Takes parameters from command line or configuration file
OntosExtractionPlan Sequentially
retrieves, parses, filters, and extracts from individual documents
Imperative (hard-coded) algorithm
Applicati onOntol ogyDocument
Retriever
Object sets, relationship sets,and constraints
Value and keyword matching rules
SourceDescriptor
Document
StructureRecognizer
StructureParser
ContentFilter
Document
DocumentDocumentDocument
ValueRecognizer
ValueMapper
Candidate matches
Extracted objects and relationships
Ontol ogyWriter
StructureOutput
DataOutput
URI LocalDocumentRetriever
DOMDocument
(no DocumentStructureRecognizer)
FanoutRecordSeparator
TextDocument
HTMLFilter
DataFrameMatcher
OSMX ontology
HeuristicBasedMapper
ObjectRelationshipWriter
(no structural output) HTML representation
18
Handling Documents
LocalDocumentRetriever Retrieves documents from local filesystem Filename filter excludes irrelevant files
FanoutRecordSeparator Implements DocumentStructureParser Locates record boundaries and creates sub-
documents HTMLFilter
Removes all HTML markup from documents
19
Recognizing Values: DataFrameMatcher
Uses data frame enhancements: Keyword affinity (left and right) Require context for left, right, or both Value phrase-specific keywords Link matches back to specific patterns
Other improvements: Consistent regular expression handling Unlimited recursive macro definition
20
Mapping Values: HeuristicBasedMapper
New algorithm Fully recursive wrt ontology structure ContextualHeuristic generates objects Connection-based heuristics (singleton, nested-
group, etc.) generate relationships See paper for additional details
21
Output
Human-readable HTML format Easier to count correct, partial, incorrect
mappingsDeceasedPerson osmx3113
•has DeceasedName Sandoval, Ernesto J.•has DeathDate October 7, 2004•has BirthDate November 9, 1923•has Age 63•DeceasedPerson has Relationship to RelativeName
•RelativeName Agullar Sandoval•Relationship daughter
•DeceasedPerson has Relationship to RelativeName •RelativeName Lalo Sandoval•Relationship brother
22
Using the Framework and Reference Implementation
Adding new features Create new implementation classes Extend (subclass) existing implementations
Switching feature set Change class name in config file Override class on command line
23
Evaluating the Framework
Age FuneralDate Viewing Relationship/
RelativeName
Recall Precision Recall Precision Recall Precision Recall Precision
New Ontos
60% 50% 68% 76% 80% 63% 74% 43%
Legacy Ontos
57% 38% 63% 75% 93% 18% 73% 41%
Four of eighteen object sets shown above.
Data from Salt Lake Tribune and Arizona Daily Star
Input:
Obituaries ontology
25 obituaries from two newspapers
24
Statistics about the System
Files Lines of code*
Framework 38 2868
OntologyEditor 141 22,249
OSMX (XML Schema) 1 1918
OSMX (Java)** 60 6912
Ontos 29 6295
* Includes comments and whitespace.
** JAXB-generated classes add 197 files and 62,888 lines of code.
25
Future Work Algorithm improvements
On-the-fly lexicons Machine learning techniques Confidence values Canonicalization Expected participation cardinality Negative-indicator keywords
Integration Online search engines Semantic Web annotator and query engine Web interface to extraction engine
26
Contributions
Design and construction of a data-extraction framework
Reference implementation Ontos upgrade Pattern for future use of framework
OSMX Standardized storage format http://www.deg.byu.edu/xml/osmx.xsd
27
Contributions
Uniform codebase and language OntologyEditor migration
New graphics classes Extended data frame support
Modular heuristic-based mapper Concept of extraction plans Flexible research platform
28
Conclusion
Framework gives us the flexibility we need for further data-extraction research
Framework is capable of supporting Ontos functionality
OSMX and reference implementation provide solid base for future research applications