23 January 2003 APAN-Fukuoka Language and Tools for Lexical Resource Management Asanee Kawtrakul (1)...

Post on 26-Dec-2015

215 views 2 download

Tags:

transcript

23 January 2003 APAN-Fukuoka

Language and Tools for Lexical Resource Management

Asanee Kawtrakul (1)Aree Thunkijjanukij (2)

Preeda Lertpongwipusana(1)Poonna Yospanya(1)

(1)Department of Computer Engineering, Faculty of Engineering, (2) Thai National AGRIS center

Kasetsart University

Acknowledgement

• JIRCUS: Japan International Research Center for Agricultural Sciences

• Organizing committee

• Kasetsart University

Outline

• Background & Motivation

• Problems in Lexical Resource Preparation

• Requirements for Lexical Resource Management

• Proposed Language and tools

• Conclusion and Next steps

Background and Motivation

• Thailand is the agricultural basis country– having a rich knowledge and data in agricultural field,

• A great quantity of agricultural information was scattered in unstructured and unrelated text – Skimming/Digesting and integrating becomes

essential

• Knowledge is around the world– Knowledge Discovery without language barrier is also

needed

The Basic Idea behind..

GatheringModule AgriculturalDocument

collection

Indexingand Clustering

Module

Internet

SummarizationModule

TranslationModule Data Cube

GraphicalUser Interface

Textual Data as a Input

Let us focus on Canada’s agricultural products. In 1998, there were 1,216 registered commercial egg producers in Canada. Ontario produced 39.8% of all eggs in Canada, Quebec was second with 16.6%. The western provinces have a combined egg production of 35.6% and the eastern provinces have a combined production of 8.0%.

With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb

Summarization and Translation as a Result

CategoryCategory ExporterExporter YearYear MonthMonth PricePrice UnitUnitPaddy Thailand 2002 January 300 Dollars/Ton

Paddy Thailand 2002 February 285 Dollars/Ton

ประเภประเภทท

ผู้�ส่�งผู้�ส่�งออกออก

ป�ป� เดื�อนเดื�อน ราคาราคา หน�วยหน�วย

ข้�าวเปลื�อก

ประเทศไทย

2545 มกราคม 14,340 บาทต่�อเกว�ยน

ข้�าวเปลื�อก

ประเทศไทย

2545 ก�มภาพั�นธ์�

13,625 บาทต่�อเกว�ยน

The Development of Agricultural System for Knowledge Acquisition and Dissemination

• 5 years Project (2001-2005)

• The Collaborative work between:– Thai National AGRIS center:

• Providing Bilingual Thesaurus (AGROVOC)

– Department of Computer Engineering• Developing NLP techniques for Searching, Summarizing and

Translation including tools for lexical resource management

• Funded by Kasetsart University Research and Development Institution

Acquisition System

Rules Thesaurus Lexicon

Linguist/Domain ExpertVery Large Corpus

DocumentIndexing & Clustering

Linguistic Knowledge Base

Intelligent Search Engine•With Translation

•With Summarization

Document Warehouse

Gathering Module

Internet/Intranet

Thai Agricultural Thesaurus

• Total number of English vocabulary is 27,531 terms

• Translate in to Thai only 10,280 terms (except scientific names)

• Scientific name were not be translated– ex. Oryza (genus) sativa (specy) of rice or

family

Problem in hand-coded Thesaurus

• Scalability

• Reliability and Coherence

• Rigidity

• Cost

Foods

Bakery Product

Deistic Foods

Frozen Foods

Fermented Foods

Processed Products

Canned Products

Dried Products

Frozen Products

Fermented Products

Alcoholic Beverage

milk

Fermented Foods

Fermented Fish

Fermented Fish

Fermented Fish

Foods

Fermented Foods

Processed Products

Local Product

Products

Fermented Fish

Commercial Vegetables: The September index, at 107, was

up 1.9 percent from last month but 3.6 percent below Septe

mber 1998. Priceincreases for lettuce, tomatoes, broccoli, and celery more than offset pricedecreases for onions, carrots, and cucumbers

Commercial Vegetable

tomatoesBroccoli Carrots

Cucumbers

tomatoes

VEGETTABLESBROCCOLI

type=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

CHERRY TOMATOEStype=fruit vegetable

LYCOPERSICON ESCULENTUM

type=taxonomic

BT

NT

RT

SOLANACEAE

CAPSICUMNICOTIANA

BT

NT

Expert DomainExpert Domain

color=red

color=red

tomato

tomatoes

Keyword AssignedKeyword Assigned

Commercial Vegetable

broccoli

carrot

tomato

User CategoryUser Category

Other Major Problems (1)

• Accessing to textual information– Language variation:

• Many ways to express the same idea

Ex: thinning flower uses deblossoming

thinning branch uses pruning

– how the computer can know that words a person uses are related to words found in stored text?Ex: user: thinning branch

computer: pruning

Requirement (1)

• Accessing to textual information

–Need intelligent browsing from related concept to related concept,

rather than from occurrence of stemmed character strings

Other Major Problems (2)

• Transforming from unstructured to structured information

Requirement (2)• Need Application-based Frame about product

price– Knowledge representation in table form– Consisting of attributes and their values

CategoryCategory Paddy

ExporterExporter Thailand

PricePrice 300

UnitUnit Dollars/Ton

Attributes

Values

Problems in Translation: Pragmatic and Semantic

• The September All Farm Products Index was 97 percent of its 1990-92 base, down1.0 percent from the August index and 2.0 percent below th

e September 1998 Index

Using Ontology0.97*

averagePrice of year from1990-

1992

SeptemberOf year ??

AugustYear1997

Down 0.02*price(September 1998)

“Year 1990-1992” meaning

Product Year

A 1990 1991 1992

B - - -

C - - -

D - - -

Requirement (3)

• Lexicon should having the semantic constraints between lexical entities,

restriction on usage categories

Summary of Problems related to lexicon

• In terms of coverage– Extensional coverage, i.e., number of entries– Intensional coverage, i.e., the number of information fields

• In terms of semantic domain covered by the application– Meaning Interpretation with respect to objects, subject

matter, topics of discourse, and pragmatic interpretation

• The user category with reference to the intended system users– Commercial product vs Plant products vs Family

products

One Solution

• Encoding world knowledge in the structures attached to each

lexical item which needs both language and tools

The Design of Lexicon: Requirement Specification

• Macrostructure: Lexicon structure in terms of relations between lexical entries – i.e. Hierarchical taxonomies which are characteristic

of thesauri of semantically related word family

• Microstructure: types of information for each entry– Pronunciation or phonemic transcription– Syntactic properties– Meaning– Pragmatics of their use in real context and language

Microstructure (cont’)

• Lexical entity could contain slots/scripts for each specific domain and need intelligent

Analyzer and understanding language

– Supplies information extraction– Supplies the missing value

Lexical Resource Management Language

• which is able to:

– Handle heterogeneity of linguistic knowledge structures.

– Handle exceptions and inconsistencies of natural languages.

– Provide an intuitive means to store and manipulate both linguistic and

world knowledge.

Language Features

• The language is designed in a way that will enable:– Supports for heterogeneous structures.– Sufficient provisions to handle exceptions and

inconsistencies of natural languages (this is achieved through the +/- operators).

– Deduction of knowledge from rules.– Detection and prevention of potential integrity

violations.

Language and Tools Specification requirement

• Flexibility – almost any structures can be defined in this model.

• Extensibility – extending a structure is simple.

• Maturability – structure reformation and deformation are supported.

• Integrity – meta-relations help prevent malformed or ill-semantic data entries.

• Dealing with inconsistencies is feasible.

Some Syntactic Elements

• Knowledge manipulations are achieved through these primitives:– def is used to define structures not already

existing.– redef changes aspects of existing structures.– undef removes specified structures from the

knowledge base.– ret is used to retrieve structures from the

knowledge base.

Examples

• Hierarchies: tree structures representing generalization semantics, or classes, of atoms.

thing

animate inanimate

animalhuman

A semantic tree represented by a hierarchy structure

Usage Examples

• Defining a hierarchy– def thing(animate(human+animal)+inanimate).

• Adding the ‘plant’ and ‘vehicle’ concepts– def animate(plant+vehicle).

• Reparenting the ‘vehicle’ concept– redef animate(vehicle) inanimate(vehicle).

• Removing the ‘human’ concept– undef human. (provided that there is only a single

instance of ‘human’)

Usage Examples (2)

• Defining case frames for verbs– First, we need to define meta-relations for

words belonging to the sub-hierarchy ‘verb’.– def meta case(verb, sub:thing).– def meta case(verb, sub:thing, obj:thing).– Then, we define case frames for several verbs.– def case(eat, sub:human+animal, obj:food).– def case(fly, sub:bird-penguin). (here, we

emphasize the use of +/- operators)

Hierarchy & Set

c1

w1

w7

c2w2

p1

w6

f1f2 f3

f4

c3

w4

w5

w3

Defining a Hierarchy

c1

w1

w7

c2w2

p1

w6

w4

w5

w3

def c1(“w1”(“w3”)+c2(“w4”)+“w2”).

def “w5”+“w6” under “w4”.

def “p1”(“w7”) under “w2”.

Manipulating the Hierarchy

c1

w1

w7

c2w2

p1

w6

w4

w5

w3

redef “w4” under “w2”.

undef “w1”.

Defining a Set

f1f2 f3

f4

c3

def c3{[f1]+[f2]+[f3]}.

def [f4] in c3.

Defining a Relation

c2

w6

f1f2 f3

f4

c3

w4

w5

def meta r1(c2, c3). Template defined.

r1’

def r1(“w4”, [f1]). Relation defined.

r1

c2

w1

def r1(“w1”, [f3]). Constraint violated.Definition not allowed.

inherited

Synset & Surrogates

• A synset is an unnamed set identified by its unique ID.

• Members of a synset are considered synonymous with different degrees of

synonymity.• Distance graph is automatically constructed

within a synset with surrogates being representatives of synset members.

• Entities with identical features are attached to the same surrogates.

Synset & Surrogates

s1

s4

s2

s3

s5

w2

w1

p2

p3

w3

w4

w6

p1

f2

f4

f4

f3

f3f4

f1

f1f1

f3

f2

f1

f4

synset#1

surrogate network internally constructed

Synset & Multilingual Lexicon

• Synset members are not confined within language scope, that is, entities from different

language may belong to the same synset.• Distance matrix are computed from number of

different features over each pair of surrogates. • Traversing from a word to nearest-distant words

is handled by the system. We can determine words with potentially nearest semantics here.

Expected Result

Keyword GeneratedKeyword Generated

Keyword GeneratedKeyword Generated

“Fruit vegetable”,red

tomatoes

VEGETTABLESBT

Expert DomainExpert DomainKeyword GeneratedKeyword Generated

“Fruit vegetable”,red

tomatoes

VEGETTABLESBT

Keyword GeneratedKeyword Generated

“Fruit vegetable”,red

BROCCOLItype=leaf vegetablecolor=green

Expert DomainExpert Domain

tomatoes

VEGETTABLESBT

Expert DomainExpert DomainKeyword GeneratedKeyword Generated

“Fruit vegetable”,redSweet pepper

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

tomatoes

VEGETTABLESBT

Expert DomainExpert DomainKeyword GeneratedKeyword Generated

“Fruit vegetable”,redSweet pepperTomatoes

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

tomatoes

VEGETTABLESBT

Expert DomainExpert Domain

CHERRY TOMATOEStype=fruit vegetable

NT

color=red

Keyword GeneratedKeyword Generated

“Fruit vegetable”,redSweet pepperTomatoesCherry Tomatoes

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

tomatoes

VEGETTABLESBT

Expert DomainExpert Domain

CHERRY TOMATOEStype=fruit vegetable

NT

color=red

Keyword GeneratedKeyword Generated

“Fruit vegetable”,redSweet pepperTomatoesCherry Tomatoes

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

RTLYCOPERSICON ESCULENTUM

type=taxonomicSOLANACEAE

CAPSICUMNICOTIANA

BT

NTcolor=red

Keyword GeneratedKeyword Generated

Keyword GeneratedKeyword Generated

“Plant in same family”

tomatoes

VEGETTABLESBT

Expert DomainExpert Domain

CHERRY TOMATOEStype=fruit vegetable

NT

color=red

Keyword GeneratedKeyword Generated

“Plant in same family”Capsicum

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

RTLYCOPERSICON ESCULENTUM

type=taxonomicSOLANACEAE

CAPSICUM

BT

NTcolor=red

tomatoes

VEGETTABLESBT

Expert DomainExpert Domain

CHERRY TOMATOEStype=fruit vegetable

NT

color=red

Keyword GeneratedKeyword Generated

“Plant in same family”CapsicumNicotiana

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

RTLYCOPERSICON ESCULENTUM

type=taxonomicSOLANACEAE

CAPSICUMNICOTIANA

BT

NTcolor=red

tomatoes

VEGETTABLESBT

Expert DomainExpert Domain

CHERRY TOMATOEStype=fruit vegetable

NT

color=red

Keyword GeneratedKeyword Generated

“Plant in same family”CapsicumNicotiana

BROCCOLItype=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

RTLYCOPERSICON ESCULENTUM

type=taxonomicSOLANACEAE

CAPSICUMNICOTIANA

BT

NTcolor=red

tomatoes

VEGETTABLESBROCCOLI

type=leaf vegetablecolor=green

SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow

TOMATOEStype=fruit vegetablecolor=red, yellow

CHERRY TOMATOEStype=fruit vegetable

LYCOPERSICON ESCULENTUM

type=taxonomic

BT

NT

RT

SOLANACEAE

CAPSICUMNICOTIANA

BT

NT

Expert DomainExpert Domain

color=red

color=red

tomato

tomatoes

Keyword AssignedKeyword Assigned

Commercial Vegetable

broccoli

carrot

tomato

User CategoryUser Category

Keyword GeneratedKeyword Generated

tomatoTomatoTomatoesCherry Tomatoes

Conclusion and Next steps

• This is a preliminary introduction of the language, with a few of its many possibilities.

• Structures not mentioned in details here have not yet been firmly specified. These

structures are rules, maps, and contexts, which are incorporated to extend the

potentials in handling deductions, multilingual operations, domain-dependent retrievals, etc.

Next Steps

• Revise the Idea• Continue the Implementation

– Aligner Tool– GUI tools for Thesaurus maintenance

• Short - term solutions to language variability problems by exploiting available knowledge sources with available

techniques• Long-range approach need high quality language understanding , i.e., Automatic thesaurus construction

– System of Agricultural Information Summarization and Translation

Thank you