Ontology based Information Extraction Jin Mao Postdoc, School of Information, University of Arizona...

Ontology based Information Extraction

Jin MaoPostdoc, School of Information, University of Arizona

Oct. 9th, 2015

Outline

Information Extraction

The process of obtaining pertinent information (facts) from documents. Examples: The forest area in India extended to about 75 million

hectares, which in terms of geographical area is approximately 22 percent of the total land.

What’s the relationship between forest area and geographical area?

Ontology Based Information Extraction (OBIE)

Ontology Based Information Extraction(Wimalasuriya and Dou, 2010)

Ontology-driven Information Extraction(Yildiz and Miksch, 2007) The same as Ontology Based Information Extraction Whether the ontology part is within the system (Yildiz and

Miksch, 2007)

TerminologyTerminology


Process unstructured or semi-structured natural language text

Present the output using ontologies Ontology as input(Li and Bontcheva, 2007), released

Use an IE process guided by an ontology no new IE method an existing one is oriented to identify the components of an

ontology (classes, properties and instances) Extractors belong to an ontology? linguistic rules

Key CharacteristicsKey Characteristics


An ontology helps to clarify a domain’s semantics. E.g., concepts and their relationships

To alleviate a wide variety of natural language ambiguities

WhyWhy


Business Intelligence (BI) in e-business

Social Media—twitter

Metadata Generation for digital resources.

……

ApplicationsApplications

Common Architectures

Information Extraction: Identify instances from the ontology in the text. Classes, Instances, Mentions, Properties, Property Values Free texts in natural language.

Example 1: Classical fried egg Mycoplasma-type colonies were not observed on 1% agar medium.

Example 2: The cells are not motile, are not lysed in 1% SDS (wt/vol), and stain Gram positively.

Major ChallengesMajor Challenges


Ontology Enhancement / UpdatingUpgrade the ontology with new instances to cover the knowledge

better in a domainNot in the common architecture.

Major ChallengesMajor Challenges


General ArchitectureGeneral Architecture


First StepFirst Step

Define the semantic elements to be extracted An example (Muller et al., 2004) Concept (C): named entities about every parts of human body

such as heart,lung, kidney… Name of Disease (N): words or phrases of disease names. Description (D): any words or phrases that describe Concepts.

“Description”refers to any kind of words or phrases that relates semantically to Concepts.

Pair of Concept and Description (P): all possible combinations of Concepts and Descriptions. Combinations contain full meaning of relationships between C and D.

Information Extraction Methods

Using regular expressions/patterns (watched|seen) <NP> Part-of-Speech Tag

Implemented using finite-state transducers which consist of a series of finite-state automata Automatically generate regular rules: “[Ii]nteract(s|ed|

ing)?”“interact,” “interacts,” “interacted,” “interacting,” ”Interact,” “Interacts,” “Interacted,” and “Interacting.”

Simple, surprisingly good results

Linguistic rulesLinguistic rules


automatically mine extraction rules from text

A dictionary inductive learning algorithm(Vargas-Vera et al., 2001)

Finding the longest common subsequence problem (Romano et

al., 2006)

Relational Learning(Califf and Mooney, 1999), a bottom-up

learning

Linguistic rulesLinguistic rules


To recognize individual words or phrases

widely used in the named-entity recognition

E.g., to recognize states of the US or countries of the world

Conditions:

Specify exactly what is being identified by the gazetteer.

Specify where the information for the gazetteer lists was obtained

from.

Gazetteer ListsGazetteer Lists


Linguistic features such as POS tags, capitalization

information and individual

Part of IE as classification problems:

whether a word token is the start/end of an entity (Li et al., 2004)

identify different components of an ontology such as instances (Li

and Bontcheva, 2007) and property values (Wu and Weld, 2007)

Classification TechniquesClassification Techniques


A semantically annotated parse tree for the text as a part

of the IE process

Linguistic extraction rules with partial parse trees

(Todirascu et al., 2002).

Syntax/Shallow NLPSyntax/Shallow NLP

Ontology Construction

to consider the ontology as an input to the system

to construct an ontology as a part of the OBIE process

Ontology Enhancement

update the ontology by adding new classes and properties through the IE process. NOT instances and their property values Such systems include the implementations by Maedche et al.

(2003) and Dung and Kameyama (2007). Fuzzy Relationship Rule: Define rules according to the

relationships among semantic elements.

o Generate a suggestion list for the domain experts to extract real semantic elements.

Performance Evaluation

Measure the accuracy of identifying instances and property values.

Most IE systems face a trade-off between improving precision and recall.

when β2<1, p should be more important


Evaluation in different scales (Maynard et al., 2004) each answer is categorized as correct or incorrect, however,

different degrees of correctness should be allowed. Learning Accuracy (LA) : This measures the closeness of the

assigned class label to the correct class label based on the hierarchy of the ontology (Cimiano et al., 2005).

Multi-dimensional evaluation beyond Precision and Recall


Cost-based metrics(Maynard et al., 2004)

cost would typically be associated with a miss and a false alarm

(spurious answer)

augmented precision (AP)

augmented recall (AR)

Potentials

Automatically processing the information contained in

natural language text

Creating semantic contents for the Semantic Web

automatic metadata generation

semantic annotation

Improving the quality of ontologies

ACKNOWLEDGEMENT

Most of the materials are adapted from:

Wimalasuriya, D. C., & Dou, D. (2010). Ontology-based information extraction: An introduction

and a survey of current approaches. Journal of Information Science.

Other References (part):•Muhammad, A., & Dey, L. (2005). Biological Ontology enhancement with Fuzzy Relation: A Text Mining Framework. In International Conference on Web Intelligence WI (Vol. 5). •R. Romano, L. Rokach and O. Maimon, Automatic discovery of regular expression patterns representing negated findings in medical narrative reports. In: Proceedings of the 6th International Workshop on Next Generation Information Technologies and Systems (Springer, Berlin, 2006).•Muller, H. M., Kenny, E. E., & Sternberg, P. W. (2004). Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol, 2(11), e309.•Dung, T. Q., & Kameyama, W. (2007). Ontology-based information extraction and information retrieval in health care domain. In Data Warehousing and Knowledge Discovery (pp. 323-333). Springer Berlin Heidelberg.

Thank you!

Date post:	17-Jan-2016
Category:	Documents
Upload:	horace-york
View:	218 times
Download:	0 times

Ontology based Information Extraction Jin Mao Postdoc, School of Information, University of Arizona...

Documents