Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd...

transcript

Rutgers CS440, Fall 2003

Introduction to Statistical Learning

Reading: Ch. 20, Sec. 1-4, AIMA 2nd Ed.

Learning under uncertainty

• How to learn probabilistic models such as Bayesian networks, Markov models, HMMs, …?

• Examples:– Class confusion example: how did we come up with the CPTs?

– Earthquake-burglary network structure?

– How do we learn HMMs for speech recognition?

– Kalman model (e.g., mass, friction) parameters?

– User models encoded as Bayesian networks for HCI?

Hypotheses and Bayesian theory

• Problem: – Two kinds of candy, lemon and chocolate– Packed in five types of unmarked bags:

(100% C, 0% L) 10% of time, (75% C, 25% L) 10% time, (50% C, 50% L) 40% of time, (25% C, 75% L) 20% of time, (0% C, 100%L) 10% of time

– Task: open a bag, ( unwrap candy, observe it, …), then predict what the next one will be

• Formulation:

– H (Hypothesis): h1 (100,0) or h2 (75,25) or h3 (50,50) or h4 (25,75) or h5(0,100)

– di (Data):i-th open candy: L (lemon) or C (chocolate)

– Goal:• predict di+1 after seeing D = { d0, d1, …, di }• P( di+1 | D )

Bayesian learning

• Bayesian solution

Estimate probabilities of hypothesis (candy bag types), then predict data (candy type)

P(hi | D ) ~ P( D | hi ) P(hi)

P(di | D ) = hi P( di | hi ) P( hi | D )

P( D | hi ) = P( d0 | hi ) x … x P(di | hi )

Hypothesis posterior Data likelihood Hypothesis prior

Prediction

I.I.D. (independently, identically distributed) data points

Example

• P( hi ) = ?

– P( hi ) = (0.1, 0.2, 0.4, 0.2, 0.1)

• P( di | hi ) = ?

– P( chocolate | h1 ) = 1, P( lemon | h3 ) = 0.5

• P( C, C, C, C, C | h4 ) = ?

– P( C, C, C, C, C | h4 ) = 0.255

• P( h5 | C, C, C, C, C ) = ?

– P( h5 | C, C, C, C, C ) ~ P(C, C, C, C, C | h5 ) P( h5 ) = 05 0.1 = 0

0*0.6244 + 0.25*0.2963 + 0.50*0.0780 + 0.75*0.0012 + 1*0 = 0.1140

• P( chocolate | C, C, C, C, … ) = ?– P (chocolate | C, C, C, C, … ) -> 1

Bayesian prediction properties

• True hypothesis eventually dominates

• Bayesian prediction is optimal (minimizes prediction error)

• Comes at a price: usually many hypotheses, intractable summation

Approximations to Bayesian prediction

• MAP – Maximum a posteriori

P( d | D ) = P( d | hMAP ), hMAP = arg maxhi P( hi | D )

(easier to compute)

• Role of prior, P(hi): penalizes complex hypotheses

• ML – Maximum likelihood

P( d | D ) = P( d | hML ), hML = arg maxhi P( D | hi )

Learning from complete data

• Learn parameters of Bayesian models from data – e.g., learn probabilities of C & L for a bag of candy whose proportions of

C&L are unknown by observing opened candy from that bag

• Candy problem parameters:

u – probability of C in bag uu – probability of bag u

Candy Parameter

L 1- u

Bag Parameter

ML Learning from complete data

• ML approach: select model parameters to maximize likelihood of seen data

1. Need to assume distribution model that determines how the samples (of candy) are distributed in a bag

2. Select parameters of the model that maximize the likelihood of the seen data

hdPhDPhDL

hdPhDP

)1()|(

)|(log)|(log)|(

)|()|(

likelihood

model: binomial

log-likelihood

Maximum likelihood learning (binomial distribution)

• How to find a solution to the above problem?

)1log()(log)(maxarg

)1(logmaxarg

)|(logmaxarg)|(maxarg

hdPhDLh

Maximum likelihood learning (cont’d)

• Take the first derivative of (log) likelihood and set it to zero

CdLdCd

1* )()()(

• Counting!

………

Total:

d=Ld=CSample

Naïve Bayes model

• One set of causes, multiple independent sources of evidence

E1 E2 EN

• Example: C {spam, not spam}, Ei { token i present, token i absent }

)()|(),,...,,(1

21 CPCEPCEEEPN

• Limiting assumption, often works well in practice

… …

Inference & Decision in NB model

• Inference

)(log)|(log~),...,,|(log

)()|(~),...,,|(

CPCEPEEECP

Evidence scoreHypothesis (class) score Prior score

• Decision

)|(log

0),...,,|_(log),...,,|(log

),...,,|_(),...,,|(

SPAMNOTCEP

SPAMCEP

EEESPAMNOTCPEEESPAMCP

Log odds ratio

Learning in NB models

• Example:

Given a set of K email messages, each with tokens D={ dj = (e1j,…,eNj) }, eij {0,1}, and labels C={cj} (SPAM or NOT_SPAM), find the best set of CPTs P(E i|C) and P(C)

Assume: P(Ei|C=c) is binomial with parameter i,c, P(C) is binomial with parameter c

)1()|(

• ML learning: maximize likelihood of K messages, each one in one of the two classes

jjijij

,,...,)1(log)1(log)(logmax

2N+1 parameters

Label of message j

Token i in message j present/absent

jijKci ce

Learning of Bayesian network parameters

• Naïve Bayes learning can be extended to BNs! How?• Model each CPT as binomial/multinomial distribution. Maximize

likelihood of data given BN.

earthquake burglary

callnewscast

eEbBaAP

)1(),|(

Sample E B A N C

1 1 0 0 1 0

2 1 0 1 1 0

3 1 1 0 0 1

4 0 1 0 1 1

BN Learning (cont’d)

• Issues:

1. Priors on parameters. What if ? Should we trust it?

Maybe always add some small pseudo-count ?

2. How do we learn a BN graph (structure)?

Test all possible structures, then pick the one with the highest data likelihood?

3. What if we do not observe some nodes (evidence not on all nodes)?

0),,1(1

iiii eEbBA

0,),,1(1

iiii eEbBA

Learning from incomplete data

• Example: – In the alarm network, we received data where we only know Newscast, Call,

Earthquake, Burglary, but have no idea what Alarm state is.– In SPAM model, we do not know if a message is spam or not (missing label).

Sample E B A N C1 1 0 N/A 1 02 1 0 N/A 1 03 1 1 N/A 0 14 0 1 N/A 1 1

• Solution?We can still try to find network parameters that maximize likelihood of incomplete data.

)|,(logmaxarg

)|(logmaxarg)|(maxarg

Hidden variable

Completing the data

• Maximizing incomplete data likelihood is tricky.• If we could, somehow, complete the data we would know how

to select model parameters that maximize the completed data.• How do we complete the missing data?

1. Randomly complete?

2. Estimate missing data from evidence, P( h | Evidence ).

11…104

10…113

01…012

01P( a=0 | E=1,B=0,N=1,C=0 )011.0

CNABESample

01P( a=1 | E=1,B=0,N=1,C=0 )011.1

EM Algorithm

• With completed data, Dc, maximize completed (log)likelihood by weighting contribution from each sample with P(h|d)

i hiicc dhPhdPDL

* ),|()|,(logmaxarg)|(maxarg

i aiiiiiiii

iiiiiiiii

nNcCeEbBaAPeEbBaA

nNcCeEbBAPeEbBA

),,,|(),,(

),,,|1(),,1(

• E(xpectation) M(aximization) algorithm:

1. Pick initial parameter estimates 0.

2. error = Inf;

3. While (error > max error)

1. E-step: Complete data, Dc, based on k-1.

2. M-step: Compute new parameters k that maximize completed data likelihood.

3. error = L( D | k ) - L( D | k-1 )

EM Example

• Candy problem, but now we do not know which bag the candy came from (bag label missing).

• E-step:

• M-step

0),,|()(1

1),,|()(

),,|(1

),,|(log)1log()(log)(

duPLdduPCdL

duPLdCdL

),,|()(

)|(),|(~),,|( 1111 ku

kui uPudPduP

ku duP

111,* ),,|(

Prior probability of bag uCandy (C) probability in bag u

EM Learning of HMM parameters

• HMM needs EM for parameter learning (unless we know exactly the hidden states at every time instance)– Need to learn transition and emission parameters.

• E.g.: – Learning of HMMs for speech modeling.

1. Assume a general (word/language) model.

2. E-step: Recognize (your own) speech using this model (Viterbi decoding).

3. M-step: Tweak parameters to recognize your speech a bit better (ML parameter fitting).

4. Go to 2.

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd...

Documents