Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | horace-york |
View: | 218 times |
Download: | 0 times |
Ontology based Information Extraction
Jin MaoPostdoc, School of Information, University of Arizona
Oct. 9th, 2015
Outline
Information Extraction
The process of obtaining pertinent information (facts) from documents. Examples: The forest area in India extended to about 75 million
hectares, which in terms of geographical area is approximately 22 percent of the total land.
What’s the relationship between forest area and geographical area?
Ontology Based Information Extraction (OBIE)
Ontology Based Information Extraction(Wimalasuriya and Dou, 2010)
Ontology-driven Information Extraction(Yildiz and Miksch, 2007) The same as Ontology Based Information Extraction Whether the ontology part is within the system (Yildiz and
Miksch, 2007)
TerminologyTerminology
Ontology Based Information Extraction (OBIE)
Process unstructured or semi-structured natural language text
Present the output using ontologies Ontology as input(Li and Bontcheva, 2007), released
Use an IE process guided by an ontology no new IE method an existing one is oriented to identify the components of an
ontology (classes, properties and instances) Extractors belong to an ontology? linguistic rules
Key CharacteristicsKey Characteristics
Ontology Based Information Extraction (OBIE)
An ontology helps to clarify a domain’s semantics. E.g., concepts and their relationships
To alleviate a wide variety of natural language ambiguities
WhyWhy
Ontology Based Information Extraction (OBIE)
Business Intelligence (BI) in e-business
Social Media—twitter
Metadata Generation for digital resources.
……
ApplicationsApplications
Common Architectures
Information Extraction: Identify instances from the ontology in the text. Classes, Instances, Mentions, Properties, Property Values Free texts in natural language.
Example 1: Classical fried egg Mycoplasma-type colonies were not observed on 1% agar medium.
Example 2: The cells are not motile, are not lysed in 1% SDS (wt/vol), and stain Gram positively.
Major ChallengesMajor Challenges
Common Architectures
Ontology Enhancement / UpdatingUpgrade the ontology with new instances to cover the knowledge
better in a domainNot in the common architecture.
Major ChallengesMajor Challenges
Common Architectures
General ArchitectureGeneral Architecture
Common Architectures
First StepFirst Step
Define the semantic elements to be extracted An example (Muller et al., 2004) Concept (C): named entities about every parts of human body
such as heart,lung, kidney… Name of Disease (N): words or phrases of disease names. Description (D): any words or phrases that describe Concepts.
“Description”refers to any kind of words or phrases that relates semantically to Concepts.
Pair of Concept and Description (P): all possible combinations of Concepts and Descriptions. Combinations contain full meaning of relationships between C and D.
Information Extraction Methods
Using regular expressions/patterns (watched|seen) <NP> Part-of-Speech Tag
Implemented using finite-state transducers which consist of a series of finite-state automata Automatically generate regular rules: “[Ii]nteract(s|ed|
ing)?”“interact,” “interacts,” “interacted,” “interacting,” ”Interact,” “Interacts,” “Interacted,” and “Interacting.”
Simple, surprisingly good results
Linguistic rulesLinguistic rules
Information Extraction Methods
automatically mine extraction rules from text
A dictionary inductive learning algorithm(Vargas-Vera et al., 2001)
Finding the longest common subsequence problem (Romano et
al., 2006)
Relational Learning(Califf and Mooney, 1999), a bottom-up
learning
Linguistic rulesLinguistic rules
Information Extraction Methods
To recognize individual words or phrases
widely used in the named-entity recognition
E.g., to recognize states of the US or countries of the world
Conditions:
Specify exactly what is being identified by the gazetteer.
Specify where the information for the gazetteer lists was obtained
from.
Gazetteer ListsGazetteer Lists
Information Extraction Methods
Linguistic features such as POS tags, capitalization
information and individual
Part of IE as classification problems:
whether a word token is the start/end of an entity (Li et al., 2004)
identify different components of an ontology such as instances (Li
and Bontcheva, 2007) and property values (Wu and Weld, 2007)
Classification TechniquesClassification Techniques
Information Extraction Methods
A semantically annotated parse tree for the text as a part
of the IE process
Linguistic extraction rules with partial parse trees
(Todirascu et al., 2002).
Syntax/Shallow NLPSyntax/Shallow NLP
Ontology Construction
to consider the ontology as an input to the system
to construct an ontology as a part of the OBIE process
Ontology Enhancement
update the ontology by adding new classes and properties through the IE process. NOT instances and their property values Such systems include the implementations by Maedche et al.
(2003) and Dung and Kameyama (2007). Fuzzy Relationship Rule: Define rules according to the
relationships among semantic elements.
o Generate a suggestion list for the domain experts to extract real semantic elements.
Performance Evaluation
Measure the accuracy of identifying instances and property values.
Most IE systems face a trade-off between improving precision and recall.
when β2<1, p should be more important
Performance Evaluation
Evaluation in different scales (Maynard et al., 2004) each answer is categorized as correct or incorrect, however,
different degrees of correctness should be allowed. Learning Accuracy (LA) : This measures the closeness of the
assigned class label to the correct class label based on the hierarchy of the ontology (Cimiano et al., 2005).
Multi-dimensional evaluation beyond Precision and Recall
Performance Evaluation
Cost-based metrics(Maynard et al., 2004)
cost would typically be associated with a miss and a false alarm
(spurious answer)
augmented precision (AP)
augmented recall (AR)
Potentials
Automatically processing the information contained in
natural language text
Creating semantic contents for the Semantic Web
automatic metadata generation
semantic annotation
Improving the quality of ontologies
ACKNOWLEDGEMENT
Most of the materials are adapted from:
Wimalasuriya, D. C., & Dou, D. (2010). Ontology-based information extraction: An introduction
and a survey of current approaches. Journal of Information Science.
Other References (part):•Muhammad, A., & Dey, L. (2005). Biological Ontology enhancement with Fuzzy Relation: A Text Mining Framework. In International Conference on Web Intelligence WI (Vol. 5). •R. Romano, L. Rokach and O. Maimon, Automatic discovery of regular expression patterns representing negated findings in medical narrative reports. In: Proceedings of the 6th International Workshop on Next Generation Information Technologies and Systems (Springer, Berlin, 2006).•Muller, H. M., Kenny, E. E., & Sternberg, P. W. (2004). Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol, 2(11), e309.•Dung, T. Q., & Kameyama, W. (2007). Ontology-based information extraction and information retrieval in health care domain. In Data Warehousing and Knowledge Discovery (pp. 323-333). Springer Berlin Heidelberg.
Thank you!