+ All Categories
Home > Documents > Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2,...

Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2,...

Date post: 20-Jan-2016
Category:
Upload: elmer-collins
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
21
Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2 , Pavel Smrz 1 1 Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno, Czech Republic 2 Saint-Petersburg State University Universitetskaya 11, Saint-Petersburg, Russia {anna , smrz}@ fi . muni . cz
Transcript
Page 1: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Word Association Thesaurus as a Resource for Extending

Semantic Networks

Anna Sinopalnikova1, 2, Pavel Smrz1

1Faculty of Informatics, Masaryk UniversityBotanicka 68a, 602 00 Brno, Czech Republic

2Saint-Petersburg State UniversityUniversitetskaya 11, Saint-Petersburg, Russia

{anna, smrz}@fi.muni.cz

Page 2: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Overview Motivation Word Association and other notions of

psycholinguistics WAT vs. Corpus Semantic Information from WAT

core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information

Page 3: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Types of Semantic Resources used in NLP

Corpora Dictionaries, thesauri,

ontologies, taxonomies

1. These are primary resources, presenting (more or less) ‘raw’ data on the language in use.

2. Information is given implicitly.

3. Need special extraction procedures and tools.

1. These are ‘derived’ resources, presenting explications of some internal knowledge. They are based on primary resources + researcher’s intuition.

2. Information is given explicitly.

Page 4: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Motivation There is still a need for empirical basis

of semantic network construction. Semantic Web initiatives. WAT are available for many languages.

Nobody knows what are they good for and how to use them.

Page 5: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Word Association and other notions of psycholinguistics

Word Association Word Association Test Word Association Norms Word Association Thesaurus

Page 6: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Example

Needle stimulates:-> thread: 41, pin: 13, sharp: 6, sew: 5, cotton: 2, dressmaker: 1, fix: 1, prick: 1, sewing: 1, sow: 1, spring; 1, stitch: 1, etc.

Page 7: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WATs explored RAT - Russian WAT by Karaulov et al (1994-1998):

8000 stimuli - 23000 words covered – 1000 subjects, EAT - Edinburgh WAT by Kiss et al (1972): 8400

stimuli – 54000 words covered - 1000 subjects, Czech WAN (Novak et al, 1996): 150 stimuli - 4000

words covered – 250 subjects.

Experience gained in projects: RussNet (a wordnet-like database for Russian linking lexical

semantics with derivational morphology Czech part of the BalkaNet project (multilingual wordnet-like

network for 5 Balkan languages and Czech).

Page 8: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WAT vs. Corpus

History: Church & Hanks, 1990; Wettler & Rapp, 1993; Willners, 2001

Bokrjonok 3.0. - balanced corpus for Russian (16 mln words), BNC - British National Corpus (112 mln), CNC - Czech National Corpus (160 mln) and its unbalanced version

(630 mln words)

Research procedure:5000 pairs e.g. cheese – mouse, dark - alley have been extracted from

each WAN in random order, and then searched in the corpora. The window span was fixed to -10; +10 words.

Corpus WAN

Page 9: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WAN vs. Corpus: Russian

Quantitative analysis: (Sinopalnikova, 2004) - 64% word associations do not occur in the corpus,- 49% while excluding unique associations (that with absolute frequency = 1)

Qualitative analysis:- high ratio of syntagmatic associations to be absent,- for verbs this number was up to 84%.

Page 10: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WAN vs. Corpus: Russian (2)Relation % Relation % Relation % Relation

PARADI GMATI C: 21,4 SYNTAGMATI C: 48,7 DOMAI N 13,4 OTHER

antonymy 1,5 Adj+N 7,9

cause 1,6 N+Adj 4,8

co-hyponymy 4,9 V+Adv 9,1

has_ subevent 0,8 V+N (agent) 3,5

hyponymy 2,5 V+N (instrument) 1,4

is_ subevent 2,9 V+N (location) 1,5

meronymy 0,5 V+N (object) 8,3

synonymy 2,9 V+N (patient) 9,8

xpos_ near_ synonymy 3,6 V+V 1,1

others 0,2 others 1,3

%

16,5

Page 11: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WAN vs. Corpus: English

Quantitative analysis:- 31% word associations do not occur in the BNC

Qualitative analysis:PARADIGMATIC 57,1

SYNTAGMATIC 8,4DOMAIN 21,7OTHER 12,8

Page 12: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WAN vs. Corpus: English (2) acquiring synonymy and hyponymy

e.g. sex – fornicate (archaic or humorous), ire (poetic) – anger, cowardly – yellow (slang)

acquiring information about low frequent wordse.g. perambulate (NBNC = 3), fornicate (NBNC = 6)

cf. EAT: perambulate - walk: 30, pram: 17, baby: 9, push: 8, about: 1, dawdle: 1,move: 1, promenade: 1, slowly: 1, stroll:1, through:1, wander:1, etc.

acquiring domain relations; absent portion of them was surprisingly large for such corpus as BNC e.g. ink-pot – pen: 24, non-violence – peace 29, offside – soccer 2

Page 13: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

WAN vs. Corpus: Czech

Quantitative analysis: - 514 associations missing (10,28%)

Qualitative analysis:- proportion of the syntagmatic and paradigmatic ones among them was similar to that for English

Page 14: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Extracting semantic information from WAT

Associations:by form – 10% (e.g. know – no, yellow - mellow)by meaning – 90% (e.g. needle – sew, yellow -

sun) core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information

Page 15: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Core conceptsIn WAT there could be observed words that have an above-

average number of direct links to other words. Russian человек, мир, дом, жизнь, есть, думать, жить, идти, большой, хорошо, плохо, нет (не), новый, дерево etc. (295 words with more then 100 relations); English man, sex, no (not), love, house; work, eat, think, go, live; good, old, small etc. (586 words with more then 100 relations); Czech člověk, dům, strom; jíst, jít, myslet; moc, starý, velký, bílý, hezký etc.

These words determine the fundamental concepts of a particular language system, and thus should be incorporated into ontology as its core components (e.g., SUMO upper concepts or EWN Base Concepts.

Page 16: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Semantic primitives WAT could also provide a list of basic concepts

associated with each separate word. Thus revealing semantics of a word (situation) as a

list of semantic constituents - separate pieces of information.

Abstract words (verbs, adjectives or nouns denoting complex situation or emotional states) are difficult to decompose by means of logic and intuition.

E.g. Depression could be reduced to its constituents sad 7, low 5, black 4, manic 4, sadness 3, bored 3, misery 2, tiredness 2, despair 1, gloom 1, grey 1, hopelessness 1, monotony 1, sick 1, mood 1, nerves 1, etc., its probable causes: rain 3, guilt 1, pain 1, unemployment 1, its probable effects: suicide 1, its antipodes elation 3, fun 1, happiness 1 etc.

Page 17: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Syntagmatic and paradigmatic relations

“Linguistic substitutes for reality” WA reflect the order of events in reality, the way objects are

organized in the space, and the way human beings experience them.

Associations by contiguity e.g. cry – baby may be treated as a manifestation of syntagmatic relation between verb and its subject, while take – hand as a ROLE_INSTRUMENT relation.

Generalization! e.g. drink – water, beer, milk, ale, Coca-cola, coffee, juice, etc. found in WAT should be generalized as drink ROLE_OBJECT beverage relation and in such a form incorporated in the semantic network

Page 18: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Syntagmatic and paradigmatic relations (2)

The law of contiguity could not explain all associations.

Law of similarity, e.g. inanimate – dead: 39 (SYNONYMY), seek – find: 56 (CAUSE relation), buy – sell: 56 (CONVERSIVE relation).

One of the main benefits of WAT : paradigmatic relations are given explicitly as opposed to other sources of empirical data (e.g. text corpora).

Page 19: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Domain information WAT explicitly present the way common words are grouped

together according to the fragments of reality they describe. E.g., hospital –> nurse, doctor, pain, ill, injury, load… Types of domain relations:

name of domain (situation) – domain member e.g. hospital – nurse:8, finance – money: 61, football – player:4; marriage – husband 2;

participant – participant e.g. pepper – salt: 58, tamer – lion: 69, needle – thread: 41 mouse – cat: 22;

participant – circumstance e.g. umbrella – rain: 58; actor – stage:23; participant – pointer to its function/role in the situation e.g. larder –

food: 58, envelope – letter: 60, actor – play: 15 etc. To differentiate types of domain relations within semantic

network, vs. to include them as uniform IS_ASSOCIATED_TO relation?

Page 20: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

ConclusionsAdvantages of using WAT in constructing semantic

network: Simplicity of data acquisition. Broad variety of semantic information to acquire. Empirical nature of data extracted (as opposed to

theoretical one, cf. conventional ontologies, taxonomies or classification schemes, that supposes the researcher’s introspection and intuition to be involved, and hence, leads to over- and under-estimation of the phenomena under consideration).

Probabilistic nature of data presented (data reflects the relative rather then absolute relevance of semantic relations in each particular case).

Page 21: Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2, Pavel Smrz 1 1 Faculty of Informatics, Masaryk University.

Thank you...


Recommended