+ All Categories
Home > Documents > Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Date post: 29-Mar-2015
Category:
Upload: easton-wynder
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
56
Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1
Transcript
Page 1: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Machine Learning: Basic Introduction

Jan OdijkJanuary 2011

LOT Winter School 2011

1

Page 2: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Overview

• Introduction• Rule-based Approaches• Machine Learning Approaches

– Statistical Approach– Memory Based Learning

• Methodology• Evaluation • Machine Learning & CLARIN

2

Page 3: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Introduction

• As a scientific discipline– Studies algorithms that allow computers to

evolve behaviors based on empirical data

• Learning: empirical data are used to improve performance on some tasks

• Core concept: Generalize from observed data

3

Page 4: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Introduction

• Plural Formation– Observed: list of (singular form, plural form) – Generalize: predict plural form given a singular

form for new words (not in observed list)

• PoS tagging– Observed: text corpus with PoS-tag annotations– Generalize: predict Pos-Tag of each token from

a new text corpus4

Page 5: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Introduction

• Supervised Learning– Map input into desired output, e.g. classes– Requires a training set

• Unsupervised Learning– Model a set of inputs (e.g. into clusters)– No training set required

5

Page 6: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Introduction

• Many approaches– Decision Tree Learning– Artificial Neural Networks– Genetic programming– Support Vector Machines– Statistical Approaches– Memory Based Learning

6

Page 7: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Introduction

• Focus here– Supervised learning– Statistical Approaches– Memory-based learning

7

Page 8: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Rule-Based Approaches

• Rule based systems for language– Lexicon

• Lists all idiosyncratic properties of lexical items– Unpredictable properties e.g man is a noun

– Exceptions to rules, e.g. past tense(go) = went

• Hand-crafted

• In a fully formalized manner

8

Page 9: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Rule-Based Approaches

• Rule based systems for language (cont.)– Rules

• Specifies regular properties of language– E.g. direct object directly follows verb (in English)

• Hand-crafted

• In a fully formalized manner

9

Page 10: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Rule-Based Approaches

• Problems for rule based systems– Lexicon

• Very difficult to specify and create

• Always incomplete

• Existing dictionaries – Were developed for use by humans

– Do not specify enough properties

– Do not specify the properties in a formalized manner

10

Page 11: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Rule-Based Approaches

• Problems for rule based systems (cont.)– Rules

• Extremely difficult to describe a language (or even a significant subset of language) by rules

• Rule systems become very large and difficult to maintain

• (No robustness (‘fail softly’) for unexpected input)

11

Page 12: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Machine Learning

• Machine Learning– A machine learns

• Lexicon

• Regularities of language

– From a large corpus of observed data

12

Page 13: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Statistical approach• Goal: get output O given some input I

– Given a word in English, get its translation in Spanish

– Given acoustic signal with speech, get the written transcription of the spoken word

– Given preceding tags and following ambitag, get tag of the current word

• Work with probabilities P(O|I)13

Page 14: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• P(A) probability of A

• A an event (usually modeled by a set)

• Event space=all possible event elements: Ω

• 0 ≤ P(A) ≤ 1

• For finite event space, and a uniform distribution: P(A) = |A| / |Ω|

14

Page 15: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Simple Example

• A fair coin is tossed 3 times– What is the probability of (exactly) two heads?

• 2 possibilities for each toss: Heads or Tails

• Solution:– Ω = HHH, HHT, HTH, HTT, THH, THT, TTH, TTT

– A = HHT, HTH, THH

– P(A) = |A| / |Ω| = 3/815

Page 16: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Conditional Probability

• P(A|B)– Probability of event A given that event B has

occurred

• P(A|B) = P (A ∩ B) / P(B) (for P(B)>0)

A A∩B B

16

Page 17: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• A fair coin is tossed 3 times– What is the probability of (exactly) two heads

(A) if the first toss has occurred and is H (B)?– Ω = HHH, HHT, HTH, HTT, THH, THT, TTH, TTT

– A = HHT, HTH, THH

– B = HHH,HHT,HTH,HTT

– A ∩ B = HHT, HTH

– P(A|B)=P(A∩B) / P(B) = 2/8 / 4/8 = 2 / 4 = ½

17

Page 18: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Given – P(A|B)=P(A∩B) / P(B) (multiply by P(B))

– P(A∩B) = P(A|B) P(B)

– P(B∩A) = P(B|A) P(A)

– P(A∩B) = P(B∩A) – P(A∩B) = P(B|A) P(A)

• Bayes Theorem:– P(A|B) = P(A∩B)/P(B) = P(B|A)P(A) / P(B)

18

Page 19: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Bayes Theorem Check– Ω = HHH, HHT, HTH, HTT, THH, THT, TTH, TTT

– A = HHT, HTH, THH

– B = HHH,HHT,HTH,HTT

– A ∩ B = HHT, HTH

– P(B|A) = P(B∩A) / P(A) = 2/8 / 3/8 = 2/3

– P(A|B) = P(B|A)P(A) / P(B) = (2/3 * 3/8) / (4/8) = 2 * 6/24= 1/2

19

Page 20: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Statistical approach– Using Bayesian inference (noisy channel

model)

• get P(O|I) for all possible O, given I

• take that O given input I for which P(O|I) is highest: Ô

• Ô = argmaxO P(O|I)

20

Page 21: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Statistical approach

• How to obtain P(O|I)?

• Bayes Theorem

• P(O|I) =

21

Page 22: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• Did we gain anything?• Yes!

– P(O) and P(I|O) often easier to estimate than P(O|I)

– P(I) can be ignored: it is independent of O.– (though we have no probabilities anymore)

• In particular:

• argmaxO P(O|I) = argmaxO P(I|O) * P(O)22

Page 23: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• P(O) (also called the Prior probability)– Used for the language model in MT and ASR– cannot be computed: must be estimated– P(w) estimated using the relative frequency of w

in a (representative) corpus• count how often w occurs in the corpus

• Divide by total number of word tokens in corpus

• = relative frequency ; set this as P(w)

– (ignoring smoothing)23

Page 24: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• P(I|O) (also called the likelihood)– Cannot easily be computed

– But estimated on the basis of a corpus

– Speech recognition: • Transcribed speech corpus Acoustic Model

– Machine translation• Aligned parallel corpus Translation Model

24

Page 25: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

• How to deal with sentences instead of words?

• Sentence = w1..wn

– P(S) = P(w1)*..*P(wn)?

– NO: This misses the connections between the words

– P(S) = (chain rule)• P(w1)P(w2|w1)P(w3|w1w2)..P(wn|w1..wn-1)

25

Page 26: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Statistical Approach

– N-grams needed (not really feasible)– Probabilities of n-grams are estimated by the

relative frequency of n-grams in a corpus• Frequencies get too low for n-grams n>=3 to be

useful

– In practice: use bigrams, trigrams (4-grams)– E.g. Bigram model:

• P(S) = P(w1w2)* P(w2w3)..* P(wn-1wn)

26

Page 27: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Memory Based Learning

• Classification

• Determine input features

• Determine output classes

• Store observed examples

• Use similarity metrics to classify unseen cases

27

Page 28: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Memory Based Learning

• Example: PP-attachment

• Given a input sequence V ..N.. PP– PP attaches to V?, or– PP attaches to N?

• Examples– John ate crisps with Mary

– John ate pizza with fresh anchovies

– John had pizza with his best friends

28

Page 29: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Memory Based Learning

• Input features (feature vector): – Verb– Head noun of complement NP– Preposition– Head noun of complement NP in PP

• Output classes (indicated by class labels)– Verb (i.e. attaches to the verb)– Noun (i.e. attaches to the noun)

29

Page 30: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Memory Based Learning

• Training Corpus:

Id Verb Noun1 Prep Noun2 Class

1 ate crisps with Mary Verb

2 ate pizza with anchovies

Noun

3 had pizza with friends Verb

4 has pizza with John Verb

5 …

30

Page 31: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Memory Based Learning

• MBL: Store training corpus (feature vectors + associated class in memory)

• for new cases– Stored in memory?

• Yes: assign associated class

• No: use similarity metrics

31

Page 32: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similarity Metrics

• (actually : distance metrics)

• Input: eats pizza with Liam

• Compare input feature vector X with each vector Y in memory: Δ(X,Y)

• Comparing vectors: sum the differences for the n individual features Δ(X,Y) = Σn

i=1 δ(xi,yi)

32

Page 33: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similarity Metrics

• δ(f1,f2) =

– (f1,f2 numeric):

– (f1-f2)/(max-min)• 12 – 2 = 10 in a range of 0 .. 100 10/100=0.1• 12 - 2 = 10 in a range of 0 .. 20 10/20 = 0.5

– (f1,f2 not numeric):

– 0 if f1= f2 no difference distance = 0

– 1 if f1≠ f2 difference distance = 133

Page 34: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similarity Metrics

Id Verb Noun1 Prep Noun2 Class Δ(X,Y)

New(X) eats pizza with Liam ??

Mem 1 ate:1 crisps:1 with:0 Mary:1 Verb 3

Mem 2 ate:1 Pizza:0 with:0 anchovies:1 Noun 2

Mem 3 had:1 Pizza:0 with:0 Friends:1 Verb 2

Mem 4 has:1 Pizza:0 with:0 John:1 Verb 2

Mem 5 …

34

Page 35: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similarity Metrics

• Look at the “k nearest neighbours” (k-NN)– (k = 1): look at the ‘nearest’ set of vectors

• The set of feature vectors with ids 2,3,4 has the smallest distance (viz. 2)

• Take the most frequent class occurring in this set: Verb

• Assign this as class to the new example

35

Page 36: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similarity Metrics

• with Δ(X,Y) = Σni=1 δ(xi,yi)

– every feature is ‘equally important;– Perhaps some features are more ‘important’

• Adaptation:– Δ(X,Y) = Σn

i=1 wi * δ(xi,yi)

– Where wi is the weight of feature i

36

Page 37: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similarity Metrics

• How to obtain the weight of a feature? – Can be based on knowledge– Can be computed from the training corpus– In various ways:

• Information Gain

• Gain Ratio

• χ2

37

Page 38: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Methodology

• Split corpus into– Training corpus– Test Corpus

• Essential to keep test corpus separate• (Ideally) Keep Test Corpus unseen• Sometimes

– Development set– To do tests while developing

38

Page 39: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Methodology

• Split– Training 50%– Test 50%

• Pro– Large test set

• Con– Small training set

39

Page 40: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Methodology

• Split– Training 90%– Test 10%

• Pro– Large training set

• Con– Small test set

40

Page 41: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Methodology

• 10-fold cross-validation– Split corpus in 10 equal subsets– Train on 9; Test on 1 (in all 10 combinations)

• Pro: – Large training sets– Still independent test sets

• Con : training set still not maximal

• requires a lot of computation41

Page 42: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Methodology

• Leave One Out– Use all examples in training set except 1– Test on 1 example (in all combinations)

• Pro: – Maximal training sets– Still independent test sets

• Con : requires a lot of computation

42

Page 43: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Evaluation

True class

Positive (P)

Negative (N)

Predictedclass

CorrectTrue Positive (TP)

False Positive (FP)

IncorrectFalse negative (FN)

True Negative (TN)

43

Page 44: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Evaluation

• TP= examples that have class C and are predicted to have class C

• FP = examples that have class ~C but are predicted to have class C

• FN= examples that have class C but are predicted to have class ~C

• TN= examples that have class ~C and are predicted to have class ~C

44

Page 45: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Evaluation

• Precision = TP / (TP+FP)

• Recall = True Positive Rate = TP / P

• False Positive Rate = FP / N

• F-Score = (2*Prec*Rec) / (Prec+Rec)

• Accuracy = (TP+TN)/(TP+TN+FP+FN)

45

Page 46: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Morphology for Dutch– Segmentation into stems and affixes

• Abnormaliteiten -> abnormaal + iteit + en

– Map to morphological features (eg inflectional)• liepen-> lopen + past plural

• Instance for each character

• Features: Focus char; 5 preceding and 5 following letters + class

46

Page 47: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Morphology for Dutch Results Prec Rec F-Score – Full: 81.1 80.7 80.9– Typed Seg: 90.3 89.9 90.1– Untyped Seg: 90.4 90.0 90.2– Seg=correctly segmented– Typed= assigned correct type– Full = typed segm + correct spelling changes

47

Page 48: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Part-of-Speech Tagging– Assignment of tags to words in context– [word] -> [(word, tag)]– [book that flight] ->– [(book, verb) (that,Det) (flight, noun)]– Book in isolation is ambiguous between noun

and verb: marked by an ambitag: noun/verb

48

Page 49: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Part-of-Speech Tagging Features– Context:

• preceding tag + following ambitag

– Word: • Actual word form for 1000 most frequent words

• some features of the word– ambitag of the word

– +/-capitalized

– +/-with digits

– +/-hyphen49

Page 50: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Part-of-Speech Tagging Results– WSJ: 96.4% accuracy– LOB Corpus: 97.0% accuracy

50

Page 51: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Phrase Chunking– Marking of major phrase boundaries– The man gave the boy the money ->

– [NP the man] gave [NP the boy] [NP the money]

– Usually encoded with tags per word:– I-X = inside X; O=outside; B-X=beginning of new X

– theI-NP manI-NP gaveO theI-NP boyI-NP theB-NP moneyI-NP

51

Page 52: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Phrase Chunking Features– Word form– PoS-tags of

• 2 preceding words

• The focus word

• 1 word to the right

52

Page 53: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Phrase Chunking Results

Prec Rec F-score

• NP 92.5 92.2 92.3

• VP 91.9 91.7 91.8

• ADJP 68.4 65.0 66.7

• ADVP 78.0 77.9 77.9

• PP 91.9 92.2 92.053

Page 54: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Example Applications

• Coreference Marking– COREA project– Demo

– Een 21-jarige dronkenlap3 besloot maandagnacht zijn5005=3 roes uit te slapen op de snelweg A19 bij Naarden . De politie12=9 trof de man14=5005 slapend aan achter het stuur van zijn5017=14 auto18, terwijl de motor nog draaide

54

Page 55: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Machine Learning & CLARIN

• Web services in work flow systems are created for several MBL-based tools– Orthographic Normalization

– Morphological analysis

– Lemmatization

– Pos-Tagging

– Chunking

– Coreference assignment

– Semantic annotation (semantic roles, locative and temporal adverbs)

55

Page 56: Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Machine learning & CLARIN

• Web services in work flow systems are created for statistically based tools such– Speech recognition

– Audio mining

– All based on SPRAAK

– Tomorrow more on this!

56


Recommended