On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning

transcript

On the Need to Bootstrap Ontology Learning with

Extraction Grammar Learning

Kassel, 22 July 2005

Georgios PaliourasSoftware & Knowledge Engineering Lab

Inst. of Informatics & TelecommunicationsNCSR “Demokritos”

http://www.iit.demokritos.gr/~paliourg

Kassel, 22/07/2005 ICCS’05 2

OutlineOutline• Motivation and state of the art

• SKEL research

– Vision

– Information integration in CROSSMARC.

– Meta-learning for information extraction.

– Context-free grammar learning.

– Ontology enrichment.

– Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Kassel, 22/07/2005 ICCS’05 3

MotivationMotivation

• Practical information extraction requires a conceptual description of the domain, e.g. an ontology, and a grammar.

• Manual creation and maintenance of these resources is expensive.

• Machine learning has been used to:– Learn ontologies based on extracted instances.– Learn extraction grammars, given the conceptual

model.

• Study how the two processes are interacting and the possibility of combining them.

Kassel, 22/07/2005 ICCS’05 4

Information extractionInformation extraction

• Common approach: shallow parsing with regular grammars.

• Limited use of deep analysis to improve extraction accuracy (HPSGs, concept graphs).

• Linking of extraction patterns to ontologies (e.g. information extraction ontologies).

• Initial attempts to combine syntax and semantics (Systemic Functional Grammars).

• Learning simple extraction patterns (regular expressions, HMMs, tree-grammars, etc.)

Kassel, 22/07/2005 ICCS’05 5

Ontology learningOntology learning

• Deductive approach to ontology modification: driven by linguistic rules.

• Inductive identification of new concepts/terms.• Clustering, based on lexico-syntactic analysis of

the text (subcat frames).• Formal Concept Analysis for term clustering

and concept identification.• Clustering and merging of conceptual graphs

(conceptual graph theory).• Deductive learning of extraction grammars in

parallel with the identification of concepts.

Kassel, 22/07/2005 ICCS’05 6

• SKEL research

– Vision

• Open issues

Kassel, 22/07/2005 ICCS’05 7

SKEL - visionSKEL - vision

Research objective:innovative knowledge technologies for reducing the information overload on the Web

Areas of research activity:– Information gathering (retrieval, crawling, spidering)– Information filtering (text and multimedia

classification)– Information extraction (named entity recognition and

classification, role identification, wrappers, grammar and lexicon learning)

– Personalization (user stereotypes and communities)– Ontology learning and population

Kassel, 22/07/2005 ICCS’05 8

• SKEL research

– Vision

• Open issues

Kassel, 22/07/2005 ICCS’05 9

CROSSMARC ObjectivesCROSSMARC Objectives

• crawl the Web for interesting Web pages,• extract information from pages of different sites

without a standardized format (structured, semi-structured, free text),

• process Web pages written in several languages,

• be customized semi-automatically to new domains and languages,

• deliver integrated information according to personalized profiles.

Develop technology for Information Integration that can:

Kassel, 22/07/2005 ICCS’05 10

CROSSMARC ArchitectureCROSSMARC Architecture

Ontology

Kassel, 22/07/2005 ICCS’05 11

CROSSMARC OntologyCROSSMARC Ontology

…<description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> …

<node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym></node>

Lexicon

Ontology

<synonym>Όνομα Επεξεργαστή</synonym>

</node>

Greek Lexicon

Kassel, 22/07/2005 ICCS’05 12

• SKEL research

– Vision

• Open issues

Kassel, 22/07/2005 ICCS’05 13

Meta-learning for Web IEMeta-learning for Web IE

Motivation:• There are many different learning

methods, producing different types of extraction grammar.

• In CROSSMARC we had four different approaches with significant difference in the extracted information.

Proposed approach:• Use meta-learning to combine the

strengths of individual learning methods.

Kassel, 22/07/2005 ICCS’05 14

D \ DjDj

Base-level dataset D

L1…LN

Meta-level dataset MD

C1(j)…CN(j)

New vector x

C1...CN

Meta-levelvector

Class value y(x)

L1…LN

Stacked generalization

Kassel, 22/07/2005 ICCS’05 15

…TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB…

Information Extraction is not naturally a classification task

In IE we deal with text documents, paired with templates

Template T

t(s,e) s, e Field f

Transport ZX 47, 49 Model

15” 56, 58 screenSize

TFT 59, 60 screenType

Intel Pentium III 63, 67 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

Each template is filled with instances <t(s,e), f>

Kassel, 22/07/2005 ICCS’05 16

T1 filled by the IE system E1

t(s, e) s, e f

Transport ZX 47, 49 model

15” 56, 58 screenSize

Intel Pentium III 63, 67 procName

256 MB 76, 78 ram

1 GB 81, 83 ram

T2 filled by the IE system E2

t(s, e) s, e f

Transport ZX 47, 49 manuf

Intel Pentium 63, 66 procName

256 MB 76, 78 ram

1 GB 81, 83 HDcapacity

Combining Information Extraction systems

Kassel, 22/07/2005 ICCS’05 17

Stacked template (ST)

s, e t(s, e) Field by E1 Field by E2 Correct field

47, 49 Transport ZX model manuf model

56, 58 15” screenSize - screenSize

59, 60 TFT screenType screenType screenType

63, 66 IntelPentium - procName -

63, 67 IntelPentium III procName - procName

67, 69 600 MHz procSpeed procSpeed procSpeed

76, 78 256 MB ram ram ram

81, 83 1 GB ram HDcapacity -

Creating a stacked template

Kassel, 22/07/2005 ICCS’05 18

D \ Dj

Training in the new stacking framework

L1…LNE1(j)…EN(j)

ST1 ST2 …

L1…LN E1…EN

D = set of documents, paired with hand-filled templates

MD = set of meta-level feature vectors

Kassel, 22/07/2005 ICCS’05 19

Stacking at run-time

New document d

Stacked template CM

TFinal

template

<t(s,e), f>

Kassel, 22/07/2005 ICCS’05 20

Experimental resultsExperimental results

Domain Best base Stacking

Courses 65.73 71.93

Projects 61.64 70.66

Laptops 63.81 71.55

Jobs 83.22 85.94

Seminars 86.23 90.03

F1-scores (combined recall and precision) on four benchmark domains and one of the CROSSMARC domains.

Kassel, 22/07/2005 ICCS’05 21

• SKEL research

– Vision

• Open issues

Kassel, 22/07/2005 ICCS’05 22

Learning CFGsLearning CFGs

Motivation:• Wanting to provide more complex extraction

patterns for less structured text.• Wanting to learn more compact and human-

comprehensible grammars.• Wanting to be able to process large corpora

containing only positive examples.Proposed approach:• Efficient learning of context free grammars from

positive examples, guided by Minimum Description Length.

Kassel, 22/07/2005 ICCS’05 23

• Infers context-free grammars.• Learns from positive examples only.• Overgenarisation controlled through a heuristic,

based on MDL.• Two basic/three auxiliary learning operators.• Two search strategies:

– Beam search.– Genetic search.

Introducing eg-GRIDS

Kassel, 22/07/2005 ICCS’05 24

Minimum Description Length (MDL)Minimum Description Length (MDL)

Model Length (ML) Model Length (ML) == GDLGDL ++ DDLDDL

Bits required to encode the grammar G.

Grammar Description Length (GDL)Grammar Description Length (GDL)

Bits required to encode all training examples, as encoded by the grammar G.

Derivations Description Length (DDL)Derivations Description Length (DDL)

Overly Specific Overly Specific GrammarGrammar

Overly General Overly General GrammarGrammar

DDLDDL

HypothesesHypothesesHypothesesHypotheses

GDLGDL

Kassel, 22/07/2005 ICCS’05 25

eg-GRIDS Architectureeg-GRIDS Architecture

Operator Operator ModeMode

Beam of Beam of GrammarsGrammarsBeam of Beam of

GrammarsGrammars

MergeMerge NTNT OperatorOperator

CreateCreate NTNT OperatorOperator

Create Create Optional NTOptional NT

DetectDetect CenterCenter EmbeddingEmbedding

MutationMutation

Search Organisation Selection

BodyBody SubstitutionSubstitution

Training Training ExamplesExamplesTraining Training

ExamplesExamples

Overly Specific Overly Specific GrammarGrammar

Final Final GrammarGrammar

Any Inferred Grammar better

than those in beam?

Kassel, 22/07/2005 ICCS’05 26

• The Dyck language with k=1: S → S S | ( S ) | є

Errors of:• Omission: failures to parse sentences

generated from the “correct” grammar (longer test sentences than in the training set).– Overly specific grammar.

• Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar.– Overly general grammar.

Kassel, 22/07/2005 ICCS’05 27

Probability of parsing a valid sentence (1-errors of omission)Probability of parsing a valid sentence (1-errors of omission)

Kassel, 22/07/2005 ICCS’05 28

Probability of generating a valid sentence (1-errors of commission)Probability of generating a valid sentence (1-errors of commission)

Kassel, 22/07/2005 ICCS’05 29

• SKEL research

– Vision

• Open issues

Kassel, 22/07/2005 ICCS’05 30

Ontology EnrichmentOntology Enrichment

• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.

e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.

– New surface appearance of an instance.

e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’

• We concentrate on instances.

• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain.

Kassel, 22/07/2005 ICCS’05 31

Ontology EnrichmentOntology Enrichment

Multi-Lingual Domain Ontology

Additional annotations

Validation

Ontology Enrichment / Population

Domain Expert

Annotating Corpus Using Domain Ontology

Information extraction

machine learning

Corpus

Kassel, 22/07/2005 ICCS’05 32

Finding synonymsFinding synonyms

• The number of instances for validation increases with the size of the corpus and the ontology.

• There is a need for supporting the enrichment of the ‘synonymy’ relationship.

• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).

• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’

Orthographical : ‘Intel p3’ - ‘intell p3’

Lexicographical : ‘Hewlett Packard’ - ‘HP’

Combination : ‘Intell Pentium 3’ - ‘P III’

Kassel, 22/07/2005 ICCS’05 33

COCLUCOCLU

• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.

• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.

• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Kassel, 22/07/2005 ICCS’05 34

Initial 2nd iter.

15/58 48/58

28/58 56/58

40/58 57/58

Discovering lexical synonyms:

Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group.

0 20 40 60 80

Instances removed (%)

Discovering new instances:

Hide part of the known instances.

Evolve ontology and grammars to recover them.

Kassel, 22/07/2005 ICCS’05 35

• SKEL research

– Vision

– BOEMIE: Bootstrapping ontology evolution with multimedia information extraction.

• Open issues

Kassel, 22/07/2005 ICCS’05 36

BOEMIE - motivationBOEMIE - motivation• Multimedia content grows with increasing rates in public

and proprietary webs.

• Hard to provide semantic indexing of multimedia content.

• Significant advances in automatic extraction of low-level features from visual content.

• Little progress in the identification of high-level semantic features

• Little progress in the effective combination of semantic features from different modalities.

• Great effort in producing ontologies for semantic webs.

• Hard to build and maintain domain-specific multimedia ontologies.

Kassel, 22/07/2005 ICCS’05 37

BOEMIE- approachBOEMIE- approach

EVOLVEDONTOLOGY

INITIALONTOLOGY

POPULATION & ENRICHMENT COORDINATION

INTERMEDIATEONTOLOGY

ONTOLOGY EVOLUTION TOOLKIT

LEARNING TOLS

REASONING ENGINE

MATCHING TOOLS

ONTOLOGY MANAGEMENT TOOL

ONTOLOGY EVOLUTION

SEMANTICS EXTRACTION

RESULTS

OTHERONTOLOGIES

SEMANTICS EXTRACTION

MULTIMEDIA CONTENT

SEMANTICS EXTRACTION TOOLKIT

TEXT EXTRACTION TOOLS

AUDIO EXTRACTION TOOLS

INFORMATION FUSION TOOLS

VISUAL EXTRACTION TOOLS

FROM VISUAL CONTENT

FROM NON-VISUAL CONTENT

FROM FUSED CONTENT

Content Collection (crawlers, spiders, etc.)

Kassel, 22/07/2005 ICCS’05 38

• SKEL research

– Vision

• Open issues

Kassel, 22/07/2005 ICCS’05 39

KR issuesKR issues

• Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE?

• Is that better than having separate representations for different tasks?

• Do we need an intermediate formalism (e.g. grammar + CG + ontology)?

• Do we need to represent uncertainty (e.g. using probabilistic graphical models)?

Kassel, 22/07/2005 ICCS’05 40

ML issuesML issues

• What types and which aspects of grammars and conceptual structures can we learn?

• What training data do we need? Can we reduce the manual annotation effort?

• What background knowledge do we need and what is the role of deduction?

• What is the role of multi-strategy learning, especially if complex representations are used?

Kassel, 22/07/2005 ICCS’05 41

Content-type issuesContent-type issues

• What is the role of semantically annotated content in learning, e.g. as training data?

• What is the role of hypertext as a graph?

• Can we extract information from multimedia content?

• How can ontologies and learning help improve extraction from multimedia?

Kassel, 22/07/2005 ICCS’05 42

SKEL IntroductionSKEL Introduction

• This is research of many current and past members of SKEL.

• CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway).

Acknowledgements

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning

Documents