Post on 04-Jan-2016
description
transcript
On the Need to Bootstrap Ontology Learning with
Extraction Grammar Learning
Kassel, 22 July 2005
Georgios PaliourasSoftware & Knowledge Engineering Lab
Inst. of Informatics & TelecommunicationsNCSR “Demokritos”
http://www.iit.demokritos.gr/~paliourg
Kassel, 22/07/2005 ICCS’05 2
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 3
MotivationMotivation
• Practical information extraction requires a conceptual description of the domain, e.g. an ontology, and a grammar.
• Manual creation and maintenance of these resources is expensive.
• Machine learning has been used to:– Learn ontologies based on extracted instances.– Learn extraction grammars, given the conceptual
model.
• Study how the two processes are interacting and the possibility of combining them.
Kassel, 22/07/2005 ICCS’05 4
Information extractionInformation extraction
• Common approach: shallow parsing with regular grammars.
• Limited use of deep analysis to improve extraction accuracy (HPSGs, concept graphs).
• Linking of extraction patterns to ontologies (e.g. information extraction ontologies).
• Initial attempts to combine syntax and semantics (Systemic Functional Grammars).
• Learning simple extraction patterns (regular expressions, HMMs, tree-grammars, etc.)
Kassel, 22/07/2005 ICCS’05 5
Ontology learningOntology learning
• Deductive approach to ontology modification: driven by linguistic rules.
• Inductive identification of new concepts/terms.• Clustering, based on lexico-syntactic analysis of
the text (subcat frames).• Formal Concept Analysis for term clustering
and concept identification.• Clustering and merging of conceptual graphs
(conceptual graph theory).• Deductive learning of extraction grammars in
parallel with the identification of concepts.
Kassel, 22/07/2005 ICCS’05 6
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 7
SKEL - visionSKEL - vision
Research objective:innovative knowledge technologies for reducing the information overload on the Web
Areas of research activity:– Information gathering (retrieval, crawling, spidering)– Information filtering (text and multimedia
classification)– Information extraction (named entity recognition and
classification, role identification, wrappers, grammar and lexicon learning)
– Personalization (user stereotypes and communities)– Ontology learning and population
Kassel, 22/07/2005 ICCS’05 8
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 9
CROSSMARC ObjectivesCROSSMARC Objectives
• crawl the Web for interesting Web pages,• extract information from pages of different sites
without a standardized format (structured, semi-structured, free text),
• process Web pages written in several languages,
• be customized semi-automatically to new domains and languages,
• deliver integrated information according to personalized profiles.
Develop technology for Information Integration that can:
Kassel, 22/07/2005 ICCS’05 10
CROSSMARC ArchitectureCROSSMARC Architecture
Ontology
Kassel, 22/07/2005 ICCS’05 11
CROSSMARC OntologyCROSSMARC Ontology
…<description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> …
<node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym></node>
Lexicon
Ontology
<node idref="OA-d0e7">
<synonym>Όνομα Επεξεργαστή</synonym>
</node>
Greek Lexicon
Kassel, 22/07/2005 ICCS’05 12
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 13
Meta-learning for Web IEMeta-learning for Web IE
Motivation:• There are many different learning
methods, producing different types of extraction grammar.
• In CROSSMARC we had four different approaches with significant difference in the extracted information.
Proposed approach:• Use meta-learning to combine the
strengths of individual learning methods.
Kassel, 22/07/2005 ICCS’05 14
D \ DjDj
Meta-learning for Web IEMeta-learning for Web IE
Base-level dataset D
L1…LN
MDj
Meta-level dataset MD
C1(j)…CN(j)
CM
New vector x
C1...CN
Meta-levelvector
Class value y(x)
L1…LN
LM
Stacked generalization
Kassel, 22/07/2005 ICCS’05 15
Meta-learning for Web IEMeta-learning for Web IE
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…
Information Extraction is not naturally a classification task
In IE we deal with text documents, paired with templates
Template T
t(s,e) s, e Field f
Transport ZX 47, 49 Model
15” 56, 58 screenSize
TFT 59, 60 screenType
Intel <b> Pentium III 63, 67 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
Each template is filled with instances <t(s,e), f>
Kassel, 22/07/2005 ICCS’05 16
Meta-learning for Web IEMeta-learning for Web IE
T1 filled by the IE system E1
t(s, e) s, e f
Transport ZX 47, 49 model
15” 56, 58 screenSize
TFT 59, 60 screenType
Intel <b> Pentium III 63, 67 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
1 GB 81, 83 ram
T2 filled by the IE system E2
t(s, e) s, e f
Transport ZX 47, 49 manuf
TFT 59, 60 screenType
Intel <b> Pentium 63, 66 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
1 GB 81, 83 HDcapacity
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…
Combining Information Extraction systems
Kassel, 22/07/2005 ICCS’05 17
Meta-learning for Web IEMeta-learning for Web IE
Stacked template (ST)
s, e t(s, e) Field by E1 Field by E2 Correct field
47, 49 Transport ZX model manuf model
56, 58 15” screenSize - screenSize
59, 60 TFT screenType screenType screenType
63, 66 Intel<b>Pentium - procName -
63, 67 Intel<b>Pentium III procName - procName
67, 69 600 MHz procSpeed procSpeed procSpeed
76, 78 256 MB ram ram ram
81, 83 1 GB ram HDcapacity -
Creating a stacked template
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…
Kassel, 22/07/2005 ICCS’05 18
D \ Dj
Meta-learning for Web IEMeta-learning for Web IE
Training in the new stacking framework
Dj
L1…LNE1(j)…EN(j)
CM
ST1 ST2 …
L1…LN E1…EN
LMMDj
D = set of documents, paired with hand-filled templates
MD = set of meta-level feature vectors
Kassel, 22/07/2005 ICCS’05 19
Meta-learning for Web IEMeta-learning for Web IE
Stacking at run-time
New document d
E1
E2
EN
…
T1
T2
TN
Stacked template CM
TFinal
template
<t(s,e), f>
Kassel, 22/07/2005 ICCS’05 20
Experimental resultsExperimental results
Domain Best base Stacking
Courses 65.73 71.93
Projects 61.64 70.66
Laptops 63.81 71.55
Jobs 83.22 85.94
Seminars 86.23 90.03
F1-scores (combined recall and precision) on four benchmark domains and one of the CROSSMARC domains.
Kassel, 22/07/2005 ICCS’05 21
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 22
Learning CFGsLearning CFGs
Motivation:• Wanting to provide more complex extraction
patterns for less structured text.• Wanting to learn more compact and human-
comprehensible grammars.• Wanting to be able to process large corpora
containing only positive examples.Proposed approach:• Efficient learning of context free grammars from
positive examples, guided by Minimum Description Length.
Kassel, 22/07/2005 ICCS’05 23
Learning CFGsLearning CFGs
• Infers context-free grammars.• Learns from positive examples only.• Overgenarisation controlled through a heuristic,
based on MDL.• Two basic/three auxiliary learning operators.• Two search strategies:
– Beam search.– Genetic search.
Introducing eg-GRIDS
Kassel, 22/07/2005 ICCS’05 24
Learning CFGsLearning CFGs
Minimum Description Length (MDL)Minimum Description Length (MDL)
Model Length (ML) Model Length (ML) == GDLGDL ++ DDLDDL
Bits required to encode the grammar G.
Grammar Description Length (GDL)Grammar Description Length (GDL)
Bits required to encode all training examples, as encoded by the grammar G.
Derivations Description Length (DDL)Derivations Description Length (DDL)
Overly Specific Overly Specific GrammarGrammar
Overly Specific Overly Specific GrammarGrammar
Overly General Overly General GrammarGrammar
Overly General Overly General GrammarGrammar
DDLDDL
HypothesesHypothesesHypothesesHypotheses
GDLGDL
Kassel, 22/07/2005 ICCS’05 25
Learning CFGsLearning CFGs
eg-GRIDS Architectureeg-GRIDS Architecture
Operator Operator ModeMode
Beam of Beam of GrammarsGrammarsBeam of Beam of
GrammarsGrammars
MergeMerge NTNT OperatorOperator
CreateCreate NTNT OperatorOperator
Lea
rnin
g O
per
ator
s
Create Create Optional NTOptional NT
DetectDetect CenterCenter EmbeddingEmbedding
YES
NO
Evo
luti
onar
y A
lgor
ith
m
MutationMutation
Search Organisation Selection
BodyBody SubstitutionSubstitution
Training Training ExamplesExamplesTraining Training
ExamplesExamples
Overly Specific Overly Specific GrammarGrammar
Overly Specific Overly Specific GrammarGrammar
Final Final GrammarGrammar
Final Final GrammarGrammar
Any Inferred Grammar better
than those in beam?
Kassel, 22/07/2005 ICCS’05 26
Experimental resultsExperimental results
• The Dyck language with k=1: S → S S | ( S ) | є
Errors of:• Omission: failures to parse sentences
generated from the “correct” grammar (longer test sentences than in the training set).– Overly specific grammar.
• Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar.– Overly general grammar.
Kassel, 22/07/2005 ICCS’05 27
Probability of parsing a valid sentence (1-errors of omission)Probability of parsing a valid sentence (1-errors of omission)
Experimental resultsExperimental results
Kassel, 22/07/2005 ICCS’05 28
Probability of generating a valid sentence (1-errors of commission)Probability of generating a valid sentence (1-errors of commission)
Experimental resultsExperimental results
Kassel, 22/07/2005 ICCS’05 29
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 30
Ontology EnrichmentOntology Enrichment
• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.
e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.
– New surface appearance of an instance.
e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’
• We concentrate on instances.
• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain.
Kassel, 22/07/2005 ICCS’05 31
Ontology EnrichmentOntology Enrichment
Multi-Lingual Domain Ontology
Additional annotations
Validation
Ontology Enrichment / Population
Domain Expert
Annotating Corpus Using Domain Ontology
Information extraction
machine learning
Corpus
Kassel, 22/07/2005 ICCS’05 32
Finding synonymsFinding synonyms
• The number of instances for validation increases with the size of the corpus and the ontology.
• There is a need for supporting the enrichment of the ‘synonymy’ relationship.
• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).
• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’
Orthographical : ‘Intel p3’ - ‘intell p3’
Lexicographical : ‘Hewlett Packard’ - ‘HP’
Combination : ‘Intell Pentium 3’ - ‘P III’
Kassel, 22/07/2005 ICCS’05 33
COCLUCOCLU
• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.
• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.
• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).
Kassel, 22/07/2005 ICCS’05 34
Experimental resultsExperimental results
Initial 2nd iter.
15/58 48/58
28/58 56/58
40/58 57/58
Discovering lexical synonyms:
Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group.
50
60
70
80
90
100
0 20 40 60 80
Instances removed (%)
Ac
cu
rac
y (
%)
Discovering new instances:
Hide part of the known instances.
Evolve ontology and grammars to recover them.
Kassel, 22/07/2005 ICCS’05 35
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– BOEMIE: Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 36
BOEMIE - motivationBOEMIE - motivation• Multimedia content grows with increasing rates in public
and proprietary webs.
• Hard to provide semantic indexing of multimedia content.
• Significant advances in automatic extraction of low-level features from visual content.
• Little progress in the identification of high-level semantic features
• Little progress in the effective combination of semantic features from different modalities.
• Great effort in producing ontologies for semantic webs.
• Hard to build and maintain domain-specific multimedia ontologies.
Kassel, 22/07/2005 ICCS’05 37
BOEMIE- approachBOEMIE- approach
EVOLVEDONTOLOGY
INITIALONTOLOGY
POPULATION & ENRICHMENT COORDINATION
INTERMEDIATEONTOLOGY
ONTOLOGY EVOLUTION TOOLKIT
LEARNING TOLS
REASONING ENGINE
MATCHING TOOLS
ONTOLOGY MANAGEMENT TOOL
ONTOLOGY EVOLUTION
SEMANTICS EXTRACTION
RESULTS
OTHERONTOLOGIES
SEMANTICS EXTRACTION
MULTIMEDIA CONTENT
SEMANTICS EXTRACTION TOOLKIT
TEXT EXTRACTION TOOLS
AUDIO EXTRACTION TOOLS
INFORMATION FUSION TOOLS
VISUAL EXTRACTION TOOLS
FROM VISUAL CONTENT
FROM NON-VISUAL CONTENT
FROM FUSED CONTENT
Content Collection (crawlers, spiders, etc.)
Kassel, 22/07/2005 ICCS’05 38
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 39
KR issuesKR issues
• Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE?
• Is that better than having separate representations for different tasks?
• Do we need an intermediate formalism (e.g. grammar + CG + ontology)?
• Do we need to represent uncertainty (e.g. using probabilistic graphical models)?
Kassel, 22/07/2005 ICCS’05 40
ML issuesML issues
• What types and which aspects of grammars and conceptual structures can we learn?
• What training data do we need? Can we reduce the manual annotation effort?
• What background knowledge do we need and what is the role of deduction?
• What is the role of multi-strategy learning, especially if complex representations are used?
Kassel, 22/07/2005 ICCS’05 41
Content-type issuesContent-type issues
• What is the role of semantically annotated content in learning, e.g. as training data?
• What is the role of hypertext as a graph?
• Can we extract information from multimedia content?
• How can ontologies and learning help improve extraction from multimedia?
Kassel, 22/07/2005 ICCS’05 42
SKEL IntroductionSKEL Introduction
• This is research of many current and past members of SKEL.
• CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway).
Acknowledgements