16 Distributionelle Semantik -...

transcript

Distributionelle Semantik

Vorlesung “Computerlinguistische Techniken”

Alexander Koller

12. Januar 2016

Welt- und Wortwissen

• Semantische Inferenzen brauchen formalisiertes Wissen über die Welt und über Wortbedeutungen.

Which genetically caused connective tissue disorder has severe symptoms and complications regarding the aorta and skeletal features, and, very characteristically, ophthalmologic subluxation?

Marfan's is created by a defect of the gene that determines the structure of Fibrillin-11. One of the symptoms is displacement of one or both of the eyes' lenses. The most serious complications affect the cardiovascular system, especially heart valves and the aorta.

Der Wissens-Bottleneck

• Bedeutung von formalisiertem Wissen für CL-Anwendungen seit Jahrzehnten akzeptiert. ‣ z.B. Bar-Hillel 1960: Übersetzung von “the box is in the pen”?

• Breite Formalisierung impraktikabel. ‣ immerhin z.B. Cyc: mehrere Millionen Fakten

‣ Weltwissen sehr umfangreich

‣ Prädikatenlogik geeigneter Formalismus?

• Aktuelle Perspektive: lexikalisches Wissen von Hand oder automatisch formalisieren.

Query Expansion

hiernach gesucht

das gefunden

Lexikalische Semantik

He's not pining! He's passed on! This parrot is no more! He has ceased to be! He's expired and gone to meet his maker! He's a stiff ! Bereft of life, he rests in peace! His metabolic processes are now history! He's off the twig! He's kicked the bucket, he's shuffled off his mortal coil, run down the curtain and joined the bleedin' choir invisible!! THIS IS AN EX-PARROT!!

Relationen zwischen Bedeutung von Wörtern: z.B. Synonymie

Semantische Relationen

• Lexikalische Semantik beschreibt mögliche semantische Relationen zwischen Wörtern: ‣ Synonymie: Wörter bedeuten das gleiche.

Apfelsine/Orange; Bildschirm/Monitor; etc.

‣ Hyponymie: Ein Wort ist Oberbegriff des anderen. Auto/Fahrzeug; Blume/Pflanze; etc.

‣ Antonymie: Wörter beschreiben das Gegenteil. gewinnen/verlieren; heiß/kalt; etc.

WordNetentity

physical object

artifact

structure

building complex

plant#1,works,

industrial plant

living thing

organism

plant#2,flora,

plant life

= Hyponymiegleicher Knoten = Synonymiehttp://wordnet.princeton.edu/

Lexikalische Ambiguitäten

• Polysemie: Wort hat zwei verschiedene Bedeutungen, die miteinander verwandt sind. ‣ Schule #1: Institution, in der Schüler lernen

‣ Schule #2: Gebäude, in dem Schule #1 arbeitet

• Homonymie: Wort hat zwei verschiedene Bedeutungen, die nicht verwandt sind. ‣ Bank #1: Geldinstitut

‣ Bank #2: Sitzgelegenheit

Word sense disambiguation

• Word sense disambiguation ist das Problem, jedes Wort-Token mit seinem Wortsinn zu taggen.

• Accuracy von WSD hängt vom Bedeutungs-Inventar ab. Stand der Kunst: 90% auf grobkörnigen Senses.

• Typische Techniken machen überwachtes Training auf kleineren Datenmengen und erweitern Modell mit unüberwachten Methoden.

Problem

• Handgeschriebene Thesauri sind viel zu klein. ‣ Englisches Wordnet: 117.000 Synsets

‣ GermaNet: 85.000 Synsets

• Anzahl von englischen Wörtern im englischen Google-n-Gramm-Korpus > 1 Million.

• Damit lösen wir das Query-Expansion-Problem nicht.

• Semantische Relationen automatisch lernen?

Experiment 1

(nach Folien von Katrin Erk)

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Experiment 1

spiegel.de zu PRISM

Experiment 1

Doc 2: Raumschiff Macht Imperator Todesstern Vater

spiegel.de zu PRISM

Experiment 1

Wikipedia zu “Star Wars”spiegel.de zu PRISM

Experiment 1

Doc 3: kontext-freie Algorithmus dynamische Tabelle Chomsky-Normalform

Wikipedia zu “Star Wars”spiegel.de zu PRISM

Experiment 1

Wikipedia zu “Star Wars”

Wikipedia zum CKY-Parser

spiegel.de zu PRISM

Experiment 1

Doc 4: Erntebemühungen Anbaufläche Sie Gurken Pflänzchen Zentimeter

spiegel.de zu PRISM

Experiment 1

Doc 4: Erntebemühungen Anbaufläche Sie Gurken Pflänzchen Zentimeter

www.gartenbau.org

spiegel.de zu PRISM

Experiment 2

(Stefan Evert, Tutorial bei NAACL 2010)

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:‣ He handed her a glass of bardiwac.

Experiment 2

‣ Nigel staggered to his feet, face flushed from too much bardiwac.

Experiment 2

‣ Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine.

Experiment 2

‣ The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.

Experiment 2

‣ The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.

→ Bardiwac ist ein Rotwein.

Distributionelle Semantik

• Ansatz, um semantische Ähnlichkeit von Wörtern aus unannotierten Daten zu lernen. ‣ Ähnlichkeit als Approximation von Synonymie

‣ Lexikon kann automatisch beliebig groß werden

• Bedeutung eines Worts ≈ Verteilung der anderen Wörter, die mit ihm zusammen auftreten.

• Grundidee aus den 1950ern (Harris 1951):“You shall know a word by the company it keeps.” (Zitat ist von Firth)

Kookkurrenz

• Was bedeutet es, dass zwei Wörtern “zusammen auftreten”?

• Einfachster Ansatz: Zähle im Korpus ab, wie oft Wort w1 in k-Wort-Fenster um Wort w2 auftritt.

see who can grow the biggest flower. Can we buy some fibre, pleaseAbu Dhabi grow like a hot-house flower, but decided themselves to follow the

as a physical level. The Bach Flower Remedies are prepared from non-poisonous wilda seed from which a strong tree will grow. This is the finest

(k = 6, British National Corpus)

Kookkurrenz

factory

grow 15 147 330 517 106 3garden 5 200 198 316 118 17worker 279 0 5 84 18 0production 102 6 9 130 28 0wild 3 216 35 96 30 0

Figure 108.4: Some co-occurrence vectors from the British National Corpus.

factory

flower

Figure 108.5: Graphical illustration of co-occurrence vectors.

through counts of context words occurring in the neighborhood of targetword instances. Take, as in the WSD example above, the n (e.g., 2000)most frequent content words in a corpus as the set of relevant context words;then count, for each word w, how often each of these context words occurredin a context window of n before or after each occurrence of w. Fig. 108.4shows the co-occurrence counts for a number of target words (columns),and a selection of context words (rows) obtained from a 10% portion of theBritish National Corpus (Clear 1993).

The resulting frequency pattern encodes information about the meaningof w. According to the Distributional Hypothesis, we can model the semanticsimilarity between two words by computing the similarity between their co-occurrences with the context words. In the example of Fig. 108.4, the targetflower co-occurs frequently with the context words grow and garden, andinfrequently with production and worker. The target word tree has a similardistribution, but the target factory shows the opposite co-occurrence patternwith these four context words. This is evidence that trees and flowers aremore similar to each other than to factories.

Technically, we represent each word w as a vector in a high-dimensional

Kookkurrenz-Matrix für BNC, aus Koller & Pinkal 12

see who can grow the biggest flower. Can we buy some fibre, pleaseAbu Dhabi grow like a hot-house flower, but decided themselves to follow the

as a physical level. The Bach Flower Remedies are prepared from non-poisonous wilda seed from which a strong tree will grow. This is the finest

Vektorraum-Modell

factory

flower

factory

flower

Vektoren inhochdimensionalem

Vektorraum

1 Dimension pro Kontextwort (hier: 6 Dimensionen)

Bild vereinfacht zu 2 Dimensionen,ist nur schematisch.

Ähnlichkeit

• Aus Vektorraum-Modell kann man jetzt Ähnlichkeit zwischen Wörtern ableiten.

• 1. Versuch:ähnlich = euklidische Distanz ist klein

factory

flower

dist(~v, ~w) =

vuutnX

(vi � wi)2

nicht besonderssinnvoll

Kosinus-Ähnlichkeit

• 2. Versuch: ähnlich = Winkel ist klein. ‣ ignoriert Länge von Vektoren = absolute Worthäufigkeiten

(das ist gut)

‣ Kontextwörter kommen proportional ähnlich oft vor

• Leicht zu berechnen ist Kosinus des Winkels: ‣ cos = 1 heißt Winkel = 0°, d.h. sehr ähnlich

‣ cos = 0 heißt Winkel = 90°, d.h. sehr unähnlich

factory

flower

cos(~v, ~w) =

Pni=1 vi · wipPn

i=1 v2i ·

pPni=1 w

cos(tree, flower) = 0.75, i.e. 40° cos(tree, factory) = 0.05, i.e. 85°

Was haben wir erreicht?

• Maß für semantische Ähnlichkeit ‣ Kookkurrenzmatrix für alle Wortpaare aus unannotertem

Text berechnen.

‣ Auf dieser Grundlage Ähnlichkeitsmaß, z.B. Kosinus.

‣ Für beliebig große Textmengen leicht zu berechnen.

• Mögliche Erweiterungen: ‣ Komplexere Features und Feature-Gewichte

‣ Dimensionsreduktion

‣ Kompositionalität

Uninformative Dimensionen

• Nicht alle Kontextwörter gleich informativ. ‣ Kookkurrenz mit “grow” vs. mit “the”

• Einfachster Ansatz: Bestimmte häufige Wörter von Hand angeben und bei der Berechnung von Ähnlichkeit ignorieren. ‣ Solche Wörter heißen im Information Retrieval

“Stop-Wörter”.

• Allgemeiner: Gewichtung von Dimensionen automatisch lernen.

Komplexere Features

• Kookkurrenz von Wörtern überschätzt “gemeinsames Auftreten”.

• Lösungsansatz: Komplexere Features, die syntaktische Relationen zwischen Wörtern mit erfassen (Lin 98). ‣ zähle nicht mehr: tritt “flower” in Fenster von

7 Wörtern um “Abu Dhabi” auf?

‣ sondern: tritt “flower” als Subjekt von“grow” auf?

the Qataris had watched Abu Dhabi grow like a hot-house flower, but decided

Introduction The distributional hypothesis

Geometric interpretation

I row vector xdogdescribes usage ofword dog in thecorpus

I can be seen ascoordinates of pointin n-dimensionalEuclidean space

I illustrated for twodimensions:get and use

Ixdog = (115, 10) ●

0 20 40 60 80 100 120

Two dimensions of English V−Obj DSM

getuse

catdog

(get, obj)(u

Ergebnis

Introduction The distributional hypothesis

Semantic distances

I main result of distributionalanalysis are “semantic”distances between words

I typical applicationsI nearest neighboursI clustering of related wordsI construct semantic map

nt pig

rtle car

Word space clustering of concrete nouns (V−Obj from BNC)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●

●●

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

−0.4

−0.2

Semantic map (V−Obj from BNC)

birdgroundAnimalfruitTreegreentoolvehicle

chicken

eagle duck

swanowl

penguinpeacock

elephantcow

lionpig

turtle

cherry

banana

pearpineapple

mushroom

lettuce

potatoonion

bottle

pencil

scissors

kettle

screwdriver

hammer

chisel

telephoneboat carship

rocket

motorcycle

helicopter

(Evert, NAACL Tutorial 2010)

Ergebnisse

(Ergebnisse von Lin 98, aus J&M)

hope (N): optimism 0.141, chance 0.137, expectation 0.136, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108

hope(V): would like 0.158, wish 0.140, plan 0.139, say 0.137, believe 0.135, think 0.133, agree 0.130, wonder 0.130, try 0.127, decide 0.125

brief (N): legal brief 0.139, affidavit 0.103, filing 0.098, petition 0.086, document 0.083, argument 0.083, letter 0.079, rebuttal 0.078, memo 0.077, article 0.076

brief (A): lengthy 0.256, hour-long 0.191, short 0.173, extended 0.163, frequent 0.162, recent 0.158, short-lived 0.155, prolonged 0.149, week-long 0.149, occasional 0.146

Probleme

• Ähnlichkeit = Synonymie? ‣ Synonyme sind distributionell sehr ähnlich.

‣ Aber auch Antonyme und (in geringerem Maß)Hyponyme distributionell sehr ähnlich.

• Distributionelle Ähnlichkeit ist nicht referentielle Ähnlichkeit. Erkennung von Antonymen notorisch schweres Problem.

brief (A): lengthy 0.256, hour-long 0.191, short 0.173, extended 0.163, frequent 0.162, recent 0.158, short-lived 0.155, prolonged 0.149, week-long 0.149, occasional 0.146

Kompositionelle distrib Semantik

• Aktueller Trend: kompositionelle Berechnung von größeren Phrasen aus distributionellen Repräsentationen von Wörtern.

• Z.B. Mitchell & Lapata 08: berechne Kookk-Vektor für Phrase durch Addition der Wortvektoren.

• Erscheint linguistisch zweifelhaft, korreliert aber mit menschlichen Bewertungen von Ähnlichkeit.

Kompositionelle distrib Semantik

• Baroni & Zamparelli (2010): “Nouns are vectors, adjectives are matrices” (= Funktionen). ‣ lernt Matrizen für Adjektive, so dass A* N den Kookk-

Vektor von “A N” approximiert (für alle N)

• Cf. Anwendung von Adjektiven auf Nomen in Montague-Grammatik.

related to the definition of the adjective (mental ac-tivity, historical event, green colour, quick and littlecost for easy N), and so on.

American N black N easy NAm. representative black face easy startAm. territory black hand quickAm. source black (n) little costgreen N historical N mental Ngreen (n) historical mental activityred road hist. event mental experiencegreen colour hist. content mental energynecessary N nice N young Nnecessary nice youthfulnecessary degree good bit young doctorsufficient nice break young staff

Table 1: Nearest 3 neighbors of centroids of ANs thatshare the same adjective.

How about the neighbors of specific ANs? Ta-ble 2 reports the nearest 3 neighbors of 9 randomlyselected ANs involving different adjectives (we in-spected a larger random set, coming to similar con-clusions to the ones emerging from this table).

bad electronic historicalluck communication mapbad elec. storage topographicalbad weekend elec. transmission atlasgood spirit purpose hist. materialimportant route nice girl little warimportant transport good girl great warimportant road big girl major warmajor road guy small warred cover special collection young husbandblack cover general collection small sonhardback small collection small daughterred label archives mistress

Table 2: Nearest 3 neighbors of specific ANs.

The nearest neighbors of the corpus-based ANvectors in Table 2 make in general intuitive sense.Importantly, the neighbors pick up the compositemeaning rather than that of the adjective or nounalone. For example, cover is an ambiguous word,but the hardback neighbor relates to its “front of abook” meaning that is the most natural one in com-bination with red. Similarly, it makes more sensethat a young husband (rather than an old one) wouldhave small sons and daughters (not to mention the

mistress!).We realize that the evidence presented here is

of a very preliminary and intuitive nature. Indeed,we will argue in the next section that there arecases in which the corpus-derived AN vector mightnot be a good approximation to our semantic in-tuitions about the AN, and a model-composed ANvector is a better semantic surrogate. One of themost important avenues for further work will be tocome to a better characterization of the behaviour ofcorpus-observed ANs, where they work and wherethe don’t. Still, the neighbors of average and AN-specific vectors of Tables 1 and 2 suggest that, forthe bulk of ANs, such corpus-based co-occurrencevectors are semantically reasonable.

6 Study 2: Predicting AN vectors

Having tentatively established that the sort of vec-tors we can harvest for ANs by directly collectingtheir corpus co-occurrences are reasonable represen-tations of their composite meaning, we move on tothe core question of whether it is possible to recon-struct the vector for an unobserved AN from infor-mation about its components. We use nearness tothe corpus-observed vectors of held-out ANs as avery direct way to evaluate the quality of model-generated ANs, since we just saw that the observedANs look reasonable (but see the caveats at the endof this section). We leave it to further work to as-sess the quality of the generated ANs in an appliedsetting, for example adapting Mitchell and Lapata’sparaphrasing task to ANs. Since the observed vec-tors look like plausible representations of compos-ite meaning, we expect that the closer the model-generated vectors are to the observed ones, the betterthey should also perform in any task that requires ac-cess to the composite meaning, and thus that the re-sults of the current evaluation should correlate withapplied performance.

More in detail, we evaluate here the compositionmethods (and the adjective and noun baselines) bycomputing, for each of them, the cosine of the testset AN vectors they generate (the “predicted” ANs)with the 41K vectors representing our extended vo-cabulary in semantic space, and looking at the posi-tion of the corresponding observed ANs (that werenot used for training, in the supervised approaches)

Zusammenfassung

• “Knowledge bottleneck” ist ein sehr ernstes Problem in der semantischen Verarbeitung.

• Wichtiges Thema in aktueller Forschung: distributionelle Methoden für semantische Ähnlichkeit von Wörtern.

• Aktueller Trend: Kombination mit kompositionellen Methoden.

16 Distributionelle Semantik -...

Documents