+ All Categories
Home > Documents > On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning

Date post: 04-Jan-2016
Category:
Upload: cai
View: 52 times
Download: 0 times
Share this document with a friend
Description:
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning. Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/~paliourg. Kassel, 22 July 2005. Outline. - PowerPoint PPT Presentation
42
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/~paliourg
Transcript
Page 1: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

On the Need to Bootstrap Ontology Learning with

Extraction Grammar Learning

Kassel, 22 July 2005

Georgios PaliourasSoftware & Knowledge Engineering Lab

Inst. of Informatics & TelecommunicationsNCSR “Demokritos”

http://www.iit.demokritos.gr/~paliourg

Page 2: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 2

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 3: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 3

MotivationMotivation

• Practical information extraction requires a conceptual description of the domain, e.g. an ontology, and a grammar.

• Manual creation and maintenance of these resources is expensive.

• Machine learning has been used to:– Learn ontologies based on extracted instances.– Learn extraction grammars, given the conceptual

model.

• Study how the two processes are interacting and the possibility of combining them.

Page 4: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 4

Information extractionInformation extraction

• Common approach: shallow parsing with regular grammars.

• Limited use of deep analysis to improve extraction accuracy (HPSGs, concept graphs).

• Linking of extraction patterns to ontologies (e.g. information extraction ontologies).

• Initial attempts to combine syntax and semantics (Systemic Functional Grammars).

• Learning simple extraction patterns (regular expressions, HMMs, tree-grammars, etc.)

Page 5: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 5

Ontology learningOntology learning

• Deductive approach to ontology modification: driven by linguistic rules.

• Inductive identification of new concepts/terms.• Clustering, based on lexico-syntactic analysis of

the text (subcat frames).• Formal Concept Analysis for term clustering

and concept identification.• Clustering and merging of conceptual graphs

(conceptual graph theory).• Deductive learning of extraction grammars in

parallel with the identification of concepts.

Page 6: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 6

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 7: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 7

SKEL - visionSKEL - vision

Research objective:innovative knowledge technologies for reducing the information overload on the Web

Areas of research activity:– Information gathering (retrieval, crawling, spidering)– Information filtering (text and multimedia

classification)– Information extraction (named entity recognition and

classification, role identification, wrappers, grammar and lexicon learning)

– Personalization (user stereotypes and communities)– Ontology learning and population

Page 8: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 8

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 9: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 9

CROSSMARC ObjectivesCROSSMARC Objectives

• crawl the Web for interesting Web pages,• extract information from pages of different sites

without a standardized format (structured, semi-structured, free text),

• process Web pages written in several languages,

• be customized semi-automatically to new domains and languages,

• deliver integrated information according to personalized profiles.

Develop technology for Information Integration that can:

Page 10: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 10

CROSSMARC ArchitectureCROSSMARC Architecture

Ontology

Page 11: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 11

CROSSMARC OntologyCROSSMARC Ontology

…<description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> …

<node idref="OV-d0e1041">  <synonym>Intel Pentium III</synonym>   <synonym>Pentium III</synonym>   <synonym>P3</synonym>   <synonym>PIII</synonym></node>

Lexicon

Ontology

<node idref="OA-d0e7">

  <synonym>Όνομα Επεξεργαστή</synonym>

</node>

Greek Lexicon

Page 12: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 12

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 13: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 13

Meta-learning for Web IEMeta-learning for Web IE

Motivation:• There are many different learning

methods, producing different types of extraction grammar.

• In CROSSMARC we had four different approaches with significant difference in the extracted information.

Proposed approach:• Use meta-learning to combine the

strengths of individual learning methods.

Page 14: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 14

D \ DjDj

Meta-learning for Web IEMeta-learning for Web IE

Base-level dataset D

L1…LN

MDj

Meta-level dataset MD

C1(j)…CN(j)

CM

New vector x

C1...CN

Meta-levelvector

Class value y(x)

L1…LN

LM

Stacked generalization

Page 15: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 15

Meta-learning for Web IEMeta-learning for Web IE

…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…

Information Extraction is not naturally a classification task

In IE we deal with text documents, paired with templates

Template T

t(s,e) s, e Field f

Transport ZX 47, 49 Model

15” 56, 58 screenSize

TFT 59, 60 screenType

Intel <b> Pentium III 63, 67 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

Each template is filled with instances <t(s,e), f>

Page 16: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 16

Meta-learning for Web IEMeta-learning for Web IE

T1 filled by the IE system E1

t(s, e) s, e f

Transport ZX 47, 49 model

15” 56, 58 screenSize

TFT 59, 60 screenType

Intel <b> Pentium III 63, 67 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

1 GB 81, 83 ram

T2 filled by the IE system E2

t(s, e) s, e f

Transport ZX 47, 49 manuf

TFT 59, 60 screenType

Intel <b> Pentium 63, 66 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

1 GB 81, 83 HDcapacity

…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…

Combining Information Extraction systems

Page 17: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 17

Meta-learning for Web IEMeta-learning for Web IE

Stacked template (ST)

s, e t(s, e) Field by E1 Field by E2 Correct field

47, 49 Transport ZX model manuf model

56, 58 15” screenSize - screenSize

59, 60 TFT screenType screenType screenType

63, 66 Intel<b>Pentium - procName -

63, 67 Intel<b>Pentium III procName - procName

67, 69 600 MHz procSpeed procSpeed procSpeed

76, 78 256 MB ram ram ram

81, 83 1 GB ram HDcapacity -

Creating a stacked template

…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…

Page 18: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 18

D \ Dj

Meta-learning for Web IEMeta-learning for Web IE

Training in the new stacking framework

Dj

L1…LNE1(j)…EN(j)

CM

ST1 ST2 …

L1…LN E1…EN

LMMDj

D = set of documents, paired with hand-filled templates

MD = set of meta-level feature vectors

Page 19: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 19

Meta-learning for Web IEMeta-learning for Web IE

Stacking at run-time

New document d

E1

E2

EN

T1

T2

TN

Stacked template CM

TFinal

template

<t(s,e), f>

Page 20: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 20

Experimental resultsExperimental results

Domain Best base Stacking

Courses 65.73 71.93

Projects 61.64 70.66

Laptops 63.81 71.55

Jobs 83.22 85.94

Seminars 86.23 90.03

F1-scores (combined recall and precision) on four benchmark domains and one of the CROSSMARC domains.

Page 21: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 21

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 22: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 22

Learning CFGsLearning CFGs

Motivation:• Wanting to provide more complex extraction

patterns for less structured text.• Wanting to learn more compact and human-

comprehensible grammars.• Wanting to be able to process large corpora

containing only positive examples.Proposed approach:• Efficient learning of context free grammars from

positive examples, guided by Minimum Description Length.

Page 23: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 23

Learning CFGsLearning CFGs

• Infers context-free grammars.• Learns from positive examples only.• Overgenarisation controlled through a heuristic,

based on MDL.• Two basic/three auxiliary learning operators.• Two search strategies:

– Beam search.– Genetic search.

Introducing eg-GRIDS

Page 24: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 24

Learning CFGsLearning CFGs

Minimum Description Length (MDL)Minimum Description Length (MDL)

Model Length (ML) Model Length (ML) == GDLGDL ++ DDLDDL

Bits required to encode the grammar G.

Grammar Description Length (GDL)Grammar Description Length (GDL)

Bits required to encode all training examples, as encoded by the grammar G.

Derivations Description Length (DDL)Derivations Description Length (DDL)

Overly Specific Overly Specific GrammarGrammar

Overly Specific Overly Specific GrammarGrammar

Overly General Overly General GrammarGrammar

Overly General Overly General GrammarGrammar

DDLDDL

HypothesesHypothesesHypothesesHypotheses

GDLGDL

Page 25: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 25

Learning CFGsLearning CFGs

eg-GRIDS Architectureeg-GRIDS Architecture

Operator Operator ModeMode

Beam of Beam of GrammarsGrammarsBeam of Beam of

GrammarsGrammars

MergeMerge NTNT OperatorOperator

CreateCreate NTNT OperatorOperator

Lea

rnin

g O

per

ator

s

Create Create Optional NTOptional NT

DetectDetect CenterCenter EmbeddingEmbedding

YES

NO

Evo

luti

onar

y A

lgor

ith

m

MutationMutation

Search Organisation Selection

BodyBody SubstitutionSubstitution

Training Training ExamplesExamplesTraining Training

ExamplesExamples

Overly Specific Overly Specific GrammarGrammar

Overly Specific Overly Specific GrammarGrammar

Final Final GrammarGrammar

Final Final GrammarGrammar

Any Inferred Grammar better

than those in beam?

Page 26: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 26

Experimental resultsExperimental results

• The Dyck language with k=1: S → S S | ( S ) | є

Errors of:• Omission: failures to parse sentences

generated from the “correct” grammar (longer test sentences than in the training set).– Overly specific grammar.

• Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar.– Overly general grammar.

Page 27: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 27

Probability of parsing a valid sentence (1-errors of omission)Probability of parsing a valid sentence (1-errors of omission)

Experimental resultsExperimental results

Page 28: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 28

Probability of generating a valid sentence (1-errors of commission)Probability of generating a valid sentence (1-errors of commission)

Experimental resultsExperimental results

Page 29: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 29

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 30: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 30

Ontology EnrichmentOntology Enrichment

• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.

e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.

– New surface appearance of an instance.

e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’

• We concentrate on instances.

• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain.

Page 31: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 31

Ontology EnrichmentOntology Enrichment

Multi-Lingual Domain Ontology

Additional annotations

Validation

Ontology Enrichment / Population

Domain Expert

Annotating Corpus Using Domain Ontology

Information extraction

machine learning

Corpus

Page 32: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 32

Finding synonymsFinding synonyms

• The number of instances for validation increases with the size of the corpus and the ontology.

• There is a need for supporting the enrichment of the ‘synonymy’ relationship.

• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).

• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’

Orthographical : ‘Intel p3’ - ‘intell p3’

Lexicographical : ‘Hewlett Packard’ - ‘HP’

Combination : ‘Intell Pentium 3’ - ‘P III’

Page 33: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 33

COCLUCOCLU

• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.

• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.

• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Page 34: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 34

Experimental resultsExperimental results

Initial 2nd iter.

15/58 48/58

28/58 56/58

40/58 57/58

Discovering lexical synonyms:

Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group.

50

60

70

80

90

100

0 20 40 60 80

Instances removed (%)

Ac

cu

rac

y (

%)

Discovering new instances:

Hide part of the known instances.

Evolve ontology and grammars to recover them.

Page 35: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 35

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– BOEMIE: Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 36: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 36

BOEMIE - motivationBOEMIE - motivation• Multimedia content grows with increasing rates in public

and proprietary webs.

• Hard to provide semantic indexing of multimedia content.

• Significant advances in automatic extraction of low-level features from visual content.

• Little progress in the identification of high-level semantic features

• Little progress in the effective combination of semantic features from different modalities.

• Great effort in producing ontologies for semantic webs.

• Hard to build and maintain domain-specific multimedia ontologies.

Page 37: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 37

BOEMIE- approachBOEMIE- approach

EVOLVEDONTOLOGY

INITIALONTOLOGY

POPULATION & ENRICHMENT COORDINATION

INTERMEDIATEONTOLOGY

ONTOLOGY EVOLUTION TOOLKIT

LEARNING TOLS

REASONING ENGINE

MATCHING TOOLS

ONTOLOGY MANAGEMENT TOOL

ONTOLOGY EVOLUTION

SEMANTICS EXTRACTION

RESULTS

OTHERONTOLOGIES

SEMANTICS EXTRACTION

MULTIMEDIA CONTENT

SEMANTICS EXTRACTION TOOLKIT

TEXT EXTRACTION TOOLS

AUDIO EXTRACTION TOOLS

INFORMATION FUSION TOOLS

VISUAL EXTRACTION TOOLS

FROM VISUAL CONTENT

FROM NON-VISUAL CONTENT

FROM FUSED CONTENT

Content Collection (crawlers, spiders, etc.)

Page 38: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 38

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Page 39: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 39

KR issuesKR issues

• Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE?

• Is that better than having separate representations for different tasks?

• Do we need an intermediate formalism (e.g. grammar + CG + ontology)?

• Do we need to represent uncertainty (e.g. using probabilistic graphical models)?

Page 40: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 40

ML issuesML issues

• What types and which aspects of grammars and conceptual structures can we learn?

• What training data do we need? Can we reduce the manual annotation effort?

• What background knowledge do we need and what is the role of deduction?

• What is the role of multi-strategy learning, especially if complex representations are used?

Page 41: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 41

Content-type issuesContent-type issues

• What is the role of semantically annotated content in learning, e.g. as training data?

• What is the role of hypertext as a graph?

• Can we extract information from multimedia content?

• How can ontologies and learning help improve extraction from multimedia?

Page 42: On the Need to Bootstrap  Ontology Learning with  Extraction Grammar Learning

Kassel, 22/07/2005 ICCS’05 42

SKEL IntroductionSKEL Introduction

• This is research of many current and past members of SKEL.

• CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway).

Acknowledgements


Recommended