+ All Categories
Home > Documents > Lexical knowledge schemes for modeling words and expressions in communication Computational...

Lexical knowledge schemes for modeling words and expressions in communication Computational...

Date post: 21-Dec-2015
Category:
View: 233 times
Download: 1 times
Share this document with a friend
41
Lexical knowledge schemes for modeling words and expressions in communication Computational Lexicology & Terminology Lab Wauter Bosma Isa Maks Roxane Segers Hennie van der Vliet Piek Vossen LCC-meeting, October, 9 th , 2008, VU University Amsterdam
Transcript

Lexical knowledge schemes for modeling words and

expressions in communication

Computational Lexicology & Terminology LabWauter BosmaIsa MaksRoxane SegersHennie van der VlietPiek Vossen

LCC-meeting, October, 9th, 2008, VU University Amsterdam

LCC meeting, October 9th, 2008, VU University Amsterdam

Overview

• Genre as a knowledge scheme

• What do we do at CLTL?

• How does it relate to genre?

• Projects at CLTL

• Discussion

LCC meeting, October 9th, 2008, VU University Amsterdam

A view on genre

• Genre is an abstract knowledge scheme that natural language speakers can apply to effectively structure communication. – How and where is such a scheme stored?– How is this knowledge activated and applied

in a communicative setting? – How can we benefit from these insights in

computerized information and communication systems?

LCC meeting, October 9th, 2008, VU University Amsterdam

Social behaviour

Communication

targets

strategy

form

medium

language

lexicon

grammar entities

relations

Participants

Intentions

Text: structure & content

genre

Attitudes

objects

relations

World Knowledge

Ontology

LCC meeting, October 9th, 2008, VU University Amsterdam

Focus of Computational Lexicology and Terminology Lab (CLTL)

• Lexicon = model of abstract knowledge to efficiently process and produce natural language in communicative settings

• Symbolic & abstract representation of forms related to concepts:– forms are variants that can refer to more-or-less the same

semantic content:• shootV – shootingN – agressionN- fightN - conflictN – warN – WOIIName

• payV – exchangeV - buyV – sellV – merchandiseN - tradeN - businessN

• Also encode pragmatic aspects of use– Sentiment, subjectivity & attitude– Perspective– Domain restrictions

LCC meeting, October 9th, 2008, VU University Amsterdam

Focus of Computational Lexicology and Terminology Lab (CLTL)

• Broad notion of knowledge:• words & expressions (what is a word, what is a concept?)• phrases, sentences and text (incorporating grammar)• genres

• Abstract symbolic representations related to statistical expectation patterns

• Tagged corpus represents an 'experience' of language use:– "X drinks beer", "Y drinks wine", "Z drinks milk"

• Lexicon is the highest abstraction of these experiences that gives the most effective prediction of how words and expressions behave:

– "XYZ drink beverages"• Corpus-based lexicon or corpus data represented as a

lexicon

LCC meeting, October 9th, 2008, VU University Amsterdam

Focus of Computational Lexicology and Terminology Lab (CLTL)

• Validation of models and databases with lexical knowledge:– Can we define types of structures (lexical and compositional

expressions) that correctly predict their behavior in language use? -> pluriform-object-count-noun (police), object-count-noun (police officer), group-object-count-noun (eikenbos (oak forest)), mass-object-uncount-noun (bos (forest))

– Can we build a comprehensive database using these types?• Use the database in corpus research and analysis:

– import corpus data into the lexical database– apply the database to textual corpora in computer applications:

• Automatic tagging of corpora with features• Automatically mine textual data using the lexicon as a background

knowledge resource, e.g. to find facts of causal relations for environmental phenomena

LCC meeting, October 9th, 2008, VU University Amsterdam

Text corpus with empirical data-linear text-every word occurrence is unique-domain and genre specific

Term database:-generic list of terms-derived from text corpus-patterns and features that are dominant in domain and genre

Lexical database-generic list of words and terms-abstracts from various text corpora-differentiation for different domains and genres-most generic representation -in a language community

Ontology-concepts instead of words-identity criteria-language neutral-domain and perspective neutral-no genre dependency-logically valid-for inferencing

Derive

Map

ValidateIntegrate

LCC meeting, October 9th, 2008, VU University Amsterdam

Projects at CLTC

• Cornetto (Stevin project: STE05039)• Kyoto (FP7 ICT Work Programme 2007 under

Challenge 4 - Digital libraries and Content, project ICT-211423)

• Camera projects:– From sentiments and opinions in text to positions of

political parties– The semantics of history

• A term bank for the Belastingdienst (Steunpunt Terminologie)

• DutchSemCor (NWO investeringssubsidie)

LCC meeting, October 9th, 2008, VU University Amsterdam

Cornetto

• COmbinatorial Relational NEtwork voor Taal TOepassingen

• Goal: to develop a lexical semantic database for Dutch:– 90K Entries: generic and central part of the

language– Rich horizontal and vertical semantic relations– Combinatoric information – Ontological information

LCC meeting, October 9th, 2008, VU University Amsterdam

Lexical Unit & Synsets

• Lexical Unit = form-meaning relation, such that:– form = abstract representation of certain realizations;– part-of-speech is the same;– meaning is the same, where meaning is defined by a

reference to a unique Synset;

• Synset = Set of synonyms (LUs) that refer to the same entities in most contexts.– Defined by lexical semantic relations;– Defined by reference to ontology Terms or logical

expressions involving Terms from the ontology;

LCC meeting, October 9th, 2008, VU University Amsterdam

Data Organization

Internal relations

PrincetonWordnet

WordnetDomains

SpanishWordnet

CzechWordnet

GermanWordnet

FrenchWordnet

KoreanWordnet Arabic

Wordnet

SUMOMILO

Collection of Terms and Axioms

Correspond to word-meaning pairform

morphology

syntax

semantics

pragmatics

usage examples

Lexical Unit

Model meaning relations

Synset

Synonyms

LCC meeting, October 9th, 2008, VU University Amsterdam

Data overview

  ALL NOUNS VERBS ADJ. ADV. Other

Synsets 70,434 52,888 9,053 7,703 220 570

Lexical Units 118,466 85,278 17,363 15,731 73 21

Lemmas (form+pos) 91,991 70,556 9,055 12,307 73 n.a.

Synonyms in synsets 102,572 74,893 14,091 12,899 84 605

CID records 103,668 75,812 14,093 13,089 484 190

Synonym per synset 1.46 1.42 1.56 1.67 0.38 1.06

Senses per lemma 1.29 1.21 1.92 1.28 1.00 n.a.

LCC meeting, October 9th, 2008, VU University Amsterdam

band#2 (tire)band#1(band)

cassettebandje(audio cassette)

ring (ring)

voorwerp (object)

band#5 (bond)

verhouding(relation)

relatie (reltion)

toestand (state)

fietsband(bike tire)

buitenband(outer tire)

binnenband(inner tire)

autoband(car tire)

zwemband(tire for swimming)jazzband

(jazz band)popgroep(pop group)

muziekgezelschap(music group)

gezelschap(group of people)

groep(groep)

muzikant(musician)

muziek (music)

artiest (artist)

bloedband(blood bond)

familieband(family bond)

moederband(mother bond)

band#3/geluidsband(audio tape)

geluidsdrager(audio carrier)

informatiedrager(data carrier) schrijven

(write)lezen (read)

middel (device)

musiceren(to make music)

Combinatorics

de band starten(to start a tape)

op de band opnemen(to record on a tape)

de band afspelen(to play from a tape)

Combinatorics

een goede/sterke band(a good strong bond)

de banden verbreken(to break all bonds)

een band hebben met iemand(to have a bond with s.o.)

Combinatorics

in een band spelen(to play in a band)

een band oprichten(to start a band)

de band speelt(the band plays)

Combinatorics

de band oppompen(to pump air in a tire)

een band plakken(to fix a whole in a tire)

een lekke band(flat tire)

de band springt(the tire explodes)

LCC meeting, October 9th, 2008, VU University Amsterdam

Integrating the ontology: Sumo terms and axioms

Lexicon versus Ontology

Abstract Physical

H20 CO2

Element

Ontology

Process

PossessionTransaction

Organism

Dog

PoodleDog{buy}

{sell}LABELS for ROLES:{watchdog}EN, {waakhond}NL, {banken}JP((instance x Canine)(role x GuardingProcess))

NAMES for TYPES:{poodle}EN{poedel}NL{pudoru}JP((instance x Poodle)

subjobj

receivergiver

goods

subjobj

LABELS for ROLES:{bluswater}{theewater}{koffiewater}

ind obj

ind obj

LCC meeting, October 9th, 2008, VU University Amsterdam

Kyoto

• Yielding Ontologies for Transition-Based Organization• Funded:

– 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics

• Goal: – Platform for knowledge sharing across languages and cultures– Enables knowledge transition and information search across different

target groups, transgressing linguistic, cultural and geographic boundaries.

– Open text mining and deep semantic search– Wiki environment that allows people in the field to maintain their

knowledge and agree on meaning without knowledge engineering skills• URL: http://www.kyoto-project.eu/• Duration: March 2008 – March 2011• Effort: 364 person months of work

LCC meeting, October 9th, 2008, VU University Amsterdam

KYOTO (ICT-211423) Overview • Languages:

– English, Dutch, Italian, Spanish, Basque, Chinese, Japanese • Domain:

– Environmental domain, BUT usable in any domain • Global:

– Both European and non-European languages• Available:

– Free: as open source system and data (GPL)• Future perspective:

– Content standardization that supports world wide communication– Global Wordnet Grid

LCC meeting, October 9th, 2008, VU University Amsterdam

Images

Index

Docs

URLs

Experts

Search

Dialogue

CO2 emission

water pollution

Capture

CitizensGovernorsCompanies

Domain

DomainWikyoto

Wordnets

Abstract PhysicalTop

Middlewater CO2

Substance

Universal Ontology

Process

Environmental organizations

Environmental organizations

Global Wordnet Grid

Kybots

FactMining

Tybots

ConceptMining Sudden increase

of CO2 emissionsin 2008 in Europe

LCC meeting, October 9th, 2008, VU University Amsterdam

User perspective

• Ecosystem services– nature as a resource: food, transport,

recreation, medicine, material– nature for waste absorption– economic dependency– state of nature– footprint– poverty

LCC meeting, October 9th, 2008, VU University Amsterdam

qualifies

qualifies

Lexicon versus Ontology

Abstract Physical

H20 CO2

Element

Ontology

Process

PossessionTransaction

Organism

Ecosystem services-Nature as a resource-Nature for waste absorption-State of nature-Threats to nature

branding rural products

sustainable products

green roof

alien invasive species

species migration

ecosystem-based drinking water production

Artifacts

green house gas

Spider

LCC meeting, October 9th, 2008, VU University Amsterdam

System components

• Wikyoto = wiki environment for a social group:– to model the terms and concepts of a domain and agree on their

meaning, within group, across languages and cultures– to define the types of knowledge and facts of interest

• Tybots = Term extraction robots, extract term data from text corpus

• Kybots = Knowledge yielding robots, extract facts from a text corpus

• Linguistic processors:– tokenizers, segmentizers, taggers, grammars – named entity recognition– word sense disambiguation– generate a layered text annotation in Kyoto Annotation Format

(KAF)

LCC meeting, October 9th, 2008, VU University Amsterdam

Capture ServerCapture Server

Document BaseLinear KAF

Document BaseLinear KAF

Tybot server(Term Extraction)

Tybot server(Term Extraction)

Extracted TermsGeneric K-TMF

Extracted TermsGeneric K-TMF

Term Editor(Wikyoto)

Term Editor(Wikyoto)

Domain OntologyOWL_DL

Domain OntologyOWL_DL

Domain WordnetK-LMF

Domain WordnetK-LMF

Kybot Server(Fact Extraction)

Kybot Server(Fact Extraction)

SemanticAnnotationSemantic

Annotation

Document BaseLinear Generic KAF

Document BaseLinear Generic KAF

Document BaseLinear KAF

Document BaseLinear KAF

Kybot EditorKybot Editor

KybotProfilesKybot

ProfilesConcept User

Fact User

LCC meeting, October 9th, 2008, VU University Amsterdam

SourceDocuments

LinguisticProcessors

[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP

Morpho-syntactic analysis

English Wordnet

emission:2gas:1

area:1

greenhouse gas:1

rural area:1

geographical area:1

region:3

location:3 substance:1

emission:3

farmland:2

naturalprocess:1

in

of

Term hierarchy

emission gas

greenhouse gas

area

agricultural area

TYBOT ConceptMiners

Abstract Physical

H20 CO2

Substance

CO2Emission

WaterPollution

Ontology

Process

Chemical Reaction

GlobalWarming

GreenhouseGas

Ontologize

Axiomatize

(instance s1 Substance) (instance e1 Warming) (katalyist s1 e1)

Synthesize

CO2

CO2

Conceptual modeling

LCC meeting, October 9th, 2008, VU University Amsterdam

Fact mining by Kybots

SourceDocuments

LinguisticProcessors

[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP

Morpho-syntactic analysis

Abstract Physical

H2O CO2

Substance

CO2 emission

water pollution

Ontology Wordnets &Linguistic Expressions

Generic

Process

Chemical Reaction

Logical Expressions

Domain

[[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3

Fact analysisPatient

Patient

LCC meeting, October 9th, 2008, VU University Amsterdam

Do populations always consist of marine species?

A.....

decline...

population.....Z

pdf

Are terrestrial species never

marine species?

Simplified Term Fragment

population

marinespecies

terrestrialspecies

Simplified Ontology Fragment

?Population

Group

KyotoServer

WIKIPEDIA

Hidden

Shown

.... populations declined

.....terrestrial andmarine species..

in forests.....declined

Do populations consist of

marine species?

InterviewAre terrestrial

species a type of

populations?

Interview

.... populations such as

terrestrial and marine species .....

Smart Kytext

KAF DE-TNTybots

DE-WN

G-WN

DE-KON

G-KON SUMOFactAF

KAF

Kybots

DOLCE

GEOplugin plugin

Facts in RDF Wordnets in LMF Ontologies in OWL

FRAMENET

emission:2gas:1

greenhouse gas:1

substance:1

emission:3

natural process:1

C02

Lexical database: wordnet

Abstract Physical

H20 CO2

Substance

CO2Emission

Process

ChemicalReaction

GlobalWarming

GreenhouseGas

Ontology

Maximalabstraction&

integrity

Languageneutralintegrity

gasgreen house gas -> gas-increase(AG)-in 2003 (TIME)CO2 -> green house gas-emission (PA)-in European countries (LO)

Term database

Generictext based

Sudden increase of green house gases in 2003........ C02 emission

in European countries....Green house gases such as C02, ....

Text corpus

Lineartext

ConceptMining

by Tybots

Synthesize Text miningby Kybots

Ontologize

Axiomatize

(instance s1 Substance) (instance e1 Warming) (katalyist s1 e1)

LCC meeting, October 9th, 2008, VU University Amsterdam

From sentiments and opinions in text to positions of political parties

• Most language use does not express facts but personal opinions and positions with respect to facts or issues, often disguised for some communicative or manipulative goal.

• CAMERA project involving 2 AIOs from FdL and 1 AIO from Political Sciences

• Combines contemporary theories and methods in linguistics and political science to develop an automated research tool for rich text-mining:– Complexity of language use, the linguistic modeling of

subjectivity and the representation of this knowledge in a lexicon. – Complex dimensionality of competition between political parties.

• Mining tool for language-meaning research can be applied to enhance the Kieskompas (Electoral Compass).

Corpus Linguistics

Political Text Corpus

QuantitativeText Analyis

ConcordanceSearch

Lexical Analysis Lexical database

AutomatedTagging &Analysis

ManualCoding

Political Analysis

SearchQuantitative

Data Analysis

Morpho-syntacticParsers

Modeling

Political Database

ManualCoding & Tagging

Linguistic rules

Interpretation rules

Co-occurrenceLexical acquisition

Derivation

aio-1

aio-2

aio-3

system integrator-4

Omstreden democratie:-Jan Kleinnijenhuis-Wouter van Atteveldt

LCC meeting, October 9th, 2008, VU University Amsterdam

AIO-1: Lexical model and acquisition for sentiment and opinion analysis

in Dutch text

• Words & expressions in political text

• Model sentiment, subjectivity, lexical framing and attitudinal implications

• Build a lexicon encoding these layers

• Validate the lexicon in the mining application applied to the text corpus

LCC meeting, October 9th, 2008, VU University Amsterdam

Levels of subjectivity

• sentiment orientation, e.g. – small (neutral), splendid (positive), dull (negative)– funeral (negative), birthday party (positive), meeting

(neutral)• explicit attitudinal and deontic implications

– hate, love, favour, desire, want– impossible, possible, can, cannot– demand, beg, hope, wish

• implicit attitudinal and deontic implications– neutral: describe, cite, quote– subjective: tell my story, shout, cry out, suggest

LCC meeting, October 9th, 2008, VU University Amsterdam

Some concepts of sayingThe reporter expresses attitude towards the subject (is not aware)

nazeggen:1, herhalen:4, echoën:2meesmuilen:1herkauwen:2toesnauwen:1, aanblaffen:2, sissen:2, toebijten:1, toeblaffen:1 toesmijten:2,toevoegen:4uitputten:3verzuchten:1 pretenderen:1, beweren:1

Subject of speech act has attitude towards (is aware):afzeggen:1, cancellen:1ontkennen:1, miskennen:1, ontveinzen:1toewensen:1, wensen:2verbieden:1aanzetten:12, beklemtonen:2, hameren:2, tamboereren:2 onderstrepen:2, onderlijnen:1, accentueren:1toezeggen:1, beloven:1uitlaten:5, beoordelen:1distantiëren:1erkennen:2, toegeven:1 opmerken:2, aantekenen:4

LCC meeting, October 9th, 2008, VU University Amsterdam

Synsets or lexical units

• {brilliant:3, glorious:4, magnificent:1, splendid:2}

• {bus:4, jalopy:1, heap:3}– has_hyperonym: {car:1, auto:1, automobile:1,

machine:4, motorcar:1}

• {fiets:1, brik:7, kar:3, karretje:2, rijwiel:1, velo:1}

LCC meeting, October 9th, 2008, VU University Amsterdam

The semantics of history

• Camera project involving 1 AIO from FdL and 1 AIO from FEW (Exact Science)

• Goal: an ontology and lexicon for a historical multimedia archive of the Rijksmuseum.

• Applied to an innovative information system for accessing the historical archive.

LCC meeting, October 9th, 2008, VU University Amsterdam

The semantics of history = semantics of change

• Represent different realities:– related through causal changes over time – representing different views or perspectives

on the same reality, e.g. form a different historical angle or from different geographical or social parties.

• Changes are typed as events

LCC meeting, October 9th, 2008, VU University Amsterdam

Events as key notions• Historical events:

– events considered from a distance in time and abstraction of detail.– referenced by names (WOII, de Val van Srebrenica), nouns (war) or

nominalizations (the violation of human rights)• News events:

– Reports on (the same) reality but more in the active verbal form: US soldiers shoot Iraqi citizens.

– Close to the actual event– lacking a historical abstraction and filtering.

• Both news and historic imply subjectivity and perspective on these events but probably make different selections and use different genres to convey this information.

• News becomes history over time, and we therefore expect a smooth transition in the use of language to refer to the same events, adding more and more historical perspective.

LCC meeting, October 9th, 2008, VU University Amsterdam

“Val van Srebrenica” in Wikipedia

• Headings:– 1992 ethnic cleansing campaign– The conflict in eastern Bosnia– Struggle for Srebrenica

• Text:– A fierce struggle for territorial control then ensued among the three

major groups in Bosnia: Bosniak (commonly known as 'Bosnian Muslims'), Serb and Croat. In the eastern part of Bosnia, close to Serbia, conflict was particularly fierce between Serbs and Bosniaks

– Serb military and paramilitary forces from the area and neighboring parts of eastern Bosnia and Serbia gained control of Srebrenica for several weeks in early 1992, killing and expelling Bosniak civilians. In May 1992, Bosnian government forces under the leadership of Naser Orić recaptured the town

– thus proceeded with the ethnic cleansing of Bosniaks from Bosniak ethnic territories in Eastern Bosnia and Central Podrinje

LCC meeting, October 9th, 2008, VU University Amsterdam

Letter from the Dutch minister of defense

• De afgelopen zes maanden werd de uitvoering van deze taken aanzienlijk bemoeilijkt door de Bosnisch-Servische weigering de enclave voldoende te laten bevoorraden. Door een gebrek aan brandstof moesten patrouilles te voet worden uitgevoerd. Ook blokkeerden de Bosnische Serviers sinds mei jl. de rotatie van het personeel van Dutchbat, waardoor de bezetting werd teruggebracht van 630 naar 430 blauwhelmen. De vijandelijkheden namen geleidelijk toe, waardoor op 3 juni jl. een observatiepost in het zuidoostelijke deel van de enclave moest worden opgegeven

• Historical terms: blokkade, val, opgave, overgave

SemiStructured

Data

FreeText

Data modelStructured

Terms&

Relations

EventOnt.

HistoricOnt.

Ontology

Lexicon

DataConversion

TermExtraction

Alignment

Ontolization

SmartIndexing

Objects

Smart Retrieval

Lexicalization

Lexical mapping

Validation

Events Locations People

conflictstruggleethnic cleansing….killingexpellinggain control

LCC meeting, October 9th, 2008, VU University Amsterdam

AIO at FdL

• Lexical framing of events in news reporting and historical descriptions.

• Use historical thesaurus to group all the words and expressions in a lexicon relative to the same events

• Differentiate implications of the lexical variation: packaging of events

• Classification of news

Thank you for your attention


Recommended