Post on 19-Jan-2016
transcript
LREC - 2010
Authors
Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu
Presented by
Chris Irwin Davis
Semi-Automatic Domain Ontology Creation from Text Resources
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 2
Jaguar Overview
• Jaguar: Builds Ontologies and Knowledge-Bases from the concepts and relationships between those concepts found in text.
• Constituents of a knowledge base
– Concepts/Vocabulary (“weapon”, “WMD”, “launcher”)
– Relations (“anthrax” ISA “biological weapon”, “anthrax” CAU “death”)
• 26 different semantic relation types extracted
– Organization of Relations• Hierarchical• Contextual
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 3
Types of Knowledge
• Universal (or ontological)
– Represented in Hierarchies
– Simple binary relations between concepts
– “Chemical weapons such as nerve gas, …”
• Contextual
– Represented in individual (semantic) contexts
– Groups of relations centered on a common concept
– “The forces launched a full-scale attack on Monday”
chemical weapon
nerve gas
launch
AGT
THM
TMP
forces
full-scale attack
monday
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 4
KB Constituents
Concept Set
C3C5
C6C4
Knowledge Base
C2
C1
ContextualKnowledge
C21 C2
2C23
C24
R1
R2
R3
C33 C3
6R4
Hierarchy
C7
R5 C37
C4
C3C16
C13
C14
C11
anthrax
biological weapon
assassinate
AGT
THM
TMP
rebel
political leader
may 21
isa
isa
isa
isa
pw
pw p
w
Ontology
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 5
Jaguar Overview
Documents
Seeds
Ontology(structured knowledge)
Functionality1. Produce ontologies
2. Link concepts & relations to text
3. Visualize ontology
4. Edit ontology
5. Enhance an existing ontology
6. Merge two ontologies into a consistent ontology
7. Ontological search of documents (search documents using ontology)
Jaguar
Ontology + pointers to text
Knowledge Base (ontology + contextual knowledge + pointers to text)
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 6
Knowledge Bases
• Ontology/KB creation overview
– Knowledge Extraction from Text
• Pattern recognition; Semantic Parsing
– Knowledge Representation and Storage
• Contextual vs. Universal
• XML; Relational Database
– Knowledge Base Maintenance
• Conflict Resolution; Ontology Merging
• User Interaction; Ontology Modification
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 7
Jaguar – Process & Modules
Jaguar
Text Processin
g
Classification
Hierarchy Creation
Knowledge Base
Maintenance
Seeds (keywords-list or Ontology)
Ontology + pointers to text
Knowledge Base (ontology + contextual knowledge + pointers
to text)
Chopshop: Tokenization
Post: Part-of-speech Tagging
Rose: Named Entity Recognition
Relu: Syntactic Parsing
Talbot: Word Sense Disambiguation
Polaris: Semantic Parsing
PreProcessor: Text-Extraction from HTML. MS Word & PDF Docs
Documents
ConceptTagger: Concept/Temporal Tagging
Text ProcessingInput: Documents, Seeds• Extract “concepts” of interest• Extract binary relations (universal)• Use Semantic Parser to obtain contextual
knowledge
Output: Concepts, Contexts, Binary Relations
“The rebels had access to chemical weapons, such as nerve gas and other poisonous gases.”
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 8
Domain Ontology Creation
• Polaris: Extract semantic relations in text– Pattern matching and machine learning – Syntactic parse tree broken down into a number of
syntactic patterns – Syntactic patterns include verbs and their arguments,
complex nominals, adjective phrases, adjective clauses, and others.
– There are six primary pattern types discovered within noun phrases:
• N-N and Adj-N (which comprise compound nominals)• ’s and of (Genitive patterns)• Adjective Phrases• Adjective Clauses
• first five further subdivided into nominalized and non-nominalized (giving a total of 11 patterns discovered within compound nominals)
– There are also five verb argument level patterns being discovered:
• NP verb• verb NP• verb PP• verb ADVP• verb S
Jaguar
Text Processin
g
Classification
Hierarchy Creation
Knowledge Base
Maintenance
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 9
Domain Ontology Creation
Input: Concepts, Binary Relations• Classify each concept against every other using
defined procedures, obtaining set of ISA relations• Add all ISA and other binary relations to the
hierarchy using conflict resolution
Output: Hierarchy of relations
“Scud missile” ISA “missile”
“Squadron” PW “Platoon”
“weapons inspection team” ISA “inspection team”
Jaguar/KAT
Text Processing
Classification
Hierarchy Creation
Knowledge Base
Maintenance
Classification/Hierarchy Creation
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 10
Domain Ontology Creation
• Classification Procedures:
– Procedure 1: Classify a concept of the form [word, head] with respect to concept [head]
– Procedure 2: Classify a concept [word1, head1] with respect to another concept [word2, head2]
– Procedure 3: To classify a concept [word1, word2, head]
– Procedure 4: Classify a concept [word1, head] with respect to a concept hierarchy under [head]
Jaguar/KAT
Text Processing
Classification
Hierarchy Creation
Knowledge Base
Maintenance
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 11
Domain Ontology Creation
• Knowledge Base Merging• Visualization• Knowledge Base Editing
– User Interaction– Modifications
Jaguar/KATText
Processing
Classification
Hierarchy Creation
Knowledge Base
Maintenance
Knowledge Base Maintenance
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 12
Domain Ontology/KB Creation - Example
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 13
Domain Ontology/KB Creation - Example
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 14
Conflict Resolution Algorithm
• Approach Used: Prevention
– Start from an empty hierarchy and an input relation set
– Add a relation from the input set to the hierarchy, if:
• It does not form a cycle
• It is not redundant (does not duplicate a path)
– After the addition of any relation, algorithms (jump link removal) are run to
ensure that all jump links are removed
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 15
Knowledge Base Merging
• Current Approach
– Label the bigger ontology L1, and the other L2
– Merge concepts (from those in L2 into those of L1)
– Copy all contexts (from L2 to L1)
– Add all relations (from the hierarchy of L2 to the hierarchy of L1) using the conflict
resolution algorithm
– Additionally, classify all concepts in L1’s hierarchy against concepts in L2’s
hierarchy (form relation set R)
– Add relations from R into L1’s hierarchy (conflict resolution)
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 16
Merging Hierarchies
stock_market
exchange
work_place
money_market
market
industry
stock_exchange
money_market
capital market
financial marketL1 L2
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 17
Merging Hierarchies
stock_market, stock_exchange
exchange
work_place
money_market
market
industry
“stock_market” ISA “capital market”
capital market
“capital market” ISA “financial market”
financial market
“money_market” ISA “financial market”“financial market” ISA “market”“capital market” ISA “market”
L1
Simulating Classification
stock_market
“stock_market” SYN “stock_exchange”
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 18
Semantic Relation Evaluation
• Training corpus:– noun phrase patterns: Wall Street Journal (TreeBank 2), L.A.
Times (TREC 9), and XWN 2.0– verb argument patterns: FrameNet
• Three evaluation corpora to benchmark the Polaris semantic relations:– TreeBank: we manually annotated 500 random sentences from
the Penn Treebank 3 corpus with 5879 semantic relations.– GlassBox Human: 51 random sentences from the NIMD corpus
was manually POS-tagged, syntactically parsed and semantically annotated with 706 semantic relations.
– GlassBox Machine: the same 51 sentences used in GlassBox Human evaluation corpus was POS-tagged, syntactically parsed by our NLP tools and then manually annotated with 741 semantic relations.
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 19
Semantic Relation Evaluation
• For Treebank evaluation corpus:– Polaris discovered 5245 relations
• 2212 exact matches to the human annotations• 630 partial matches
– partial matches mean that while the relation type was correct and the argument bracketing at least overlapped, there were some extra or missing tokens in the generated arguments
– partial matches are scored using precision, recall, and f-measure on the overlapping tokens
• For the GlassBox Human evaluation corpus:– Polaris discovered 449 relations
• 311 were perfect matches to the human annotations• 56 were partial matches
• For the GlassBox Machine evaluation corpus:– Polaris discovered 464 relations
• 249 were perfect matches to the human annotations• 71 were partial matches
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 20
Semantic Relation Evaluation
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 21
Domain Ontology Library Creation
• We use Jaguar to create an ontology library for the 33 topics defined in NIPF and 10 topics from the financial domain– NIPF is the Director of National Intelligence’s (DNI’s) guidance to
the Intelligence Community on the national intelligence priorities approved by the President of the United States of America
– For each topic, we collected 500 documents from the web and manually verified their relevance to the corresponding topic.
– For each topic, Jaguar is provided with an initial seed set containing on average 47 concepts of interest
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 22
Domain Ontology/KB Evaluation
• We evaluated the quality of 8 Jaguar ontologies by comparing them against manual gold annotations
• Our evaluations are focused on the – Lexical Level– Vocabulary, or Data Layer Level – Other Semantic Relations Level
• Viewing an ontology as a set of semantic relations between two concepts, the human annotators:– Labeled an entry correct if the concepts and the semantic
relation are correctly detected by the system else marked the entry as Incorrect
– Labeled a correct entry as irrelevant if any of the concepts or the semantic relation are irrelevant to the domain
– From the sentences added new entries if the concepts and the semantic relation were omitted by Jaguar
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 23
NIPF Ontology/KB Evaluation - Metrics
Nj(.) gives the counts from Jaguar’s output
Ng(.) correspond to counts in the user annotations
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 24
Domain Ontology/KB Evaluation - Results
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 25
Domain Ontology/KB Evaluation - Results
LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 26
Conclusions
• We presented a generalized and improved procedure to automatically extract deep semantic information from text resources
• A methodology to rapidly create semantically-rich domain ontologies while keeping the manual intervention to a minimum
• We defined evaluation metrics to assess the quality of the ontologies and presented evaluation results for a subset of the intelligence and financial ontology libraries, semi-automatically created using freely-available textual resources from the Web
• The results show that a decent amount of knowledge can be accurately extracted while keeping the manual intervention in the process to a minimum.