Post on 22-Dec-2015
transcript
Rutgers CS440, Fall 2003
Introduction to Statistical Learning
Reading: Ch. 20, Sec. 1-4, AIMA 2nd Ed.
Rutgers CS440, Fall 2003
Learning under uncertainty
• How to learn probabilistic models such as Bayesian networks, Markov models, HMMs, …?
• Examples:– Class confusion example: how did we come up with the CPTs?
– Earthquake-burglary network structure?
– How do we learn HMMs for speech recognition?
– Kalman model (e.g., mass, friction) parameters?
– User models encoded as Bayesian networks for HCI?
Rutgers CS440, Fall 2003
Hypotheses and Bayesian theory
• Problem: – Two kinds of candy, lemon and chocolate– Packed in five types of unmarked bags:
(100% C, 0% L) 10% of time, (75% C, 25% L) 10% time, (50% C, 50% L) 40% of time, (25% C, 75% L) 20% of time, (0% C, 100%L) 10% of time
– Task: open a bag, ( unwrap candy, observe it, …), then predict what the next one will be
• Formulation:
– H (Hypothesis): h1 (100,0) or h2 (75,25) or h3 (50,50) or h4 (25,75) or h5(0,100)
– di (Data):i-th open candy: L (lemon) or C (chocolate)
– Goal:• predict di+1 after seeing D = { d0, d1, …, di }• P( di+1 | D )
Rutgers CS440, Fall 2003
Bayesian learning
• Bayesian solution
Estimate probabilities of hypothesis (candy bag types), then predict data (candy type)
P(hi | D ) ~ P( D | hi ) P(hi)
P(di | D ) = hi P( di | hi ) P( hi | D )
P( D | hi ) = P( d0 | hi ) x … x P(di | hi )
Hypothesis posterior Data likelihood Hypothesis prior
Prediction
I.I.D. (independently, identically distributed) data points
Rutgers CS440, Fall 2003
Example
• P( hi ) = ?
– P( hi ) = (0.1, 0.2, 0.4, 0.2, 0.1)
• P( di | hi ) = ?
– P( chocolate | h1 ) = 1, P( lemon | h3 ) = 0.5
• P( C, C, C, C, C | h4 ) = ?
– P( C, C, C, C, C | h4 ) = 0.255
• P( h5 | C, C, C, C, C ) = ?
– P( h5 | C, C, C, C, C ) ~ P(C, C, C, C, C | h5 ) P( h5 ) = 05 0.1 = 0
• P( lemon | C, C, C, C, C ) = ?– P( d | h1) P(h1 | C, C, C, C, C ) + … + P( d | h5 ) P( h5 | C,C,C,C,C) =
0*0.6244 + 0.25*0.2963 + 0.50*0.0780 + 0.75*0.0012 + 1*0 = 0.1140
• P( chocolate | C, C, C, C, … ) = ?– P (chocolate | C, C, C, C, … ) -> 1
Rutgers CS440, Fall 2003
Bayesian prediction properties
• True hypothesis eventually dominates
• Bayesian prediction is optimal (minimizes prediction error)
• Comes at a price: usually many hypotheses, intractable summation
Rutgers CS440, Fall 2003
Approximations to Bayesian prediction
• MAP – Maximum a posteriori
P( d | D ) = P( d | hMAP ), hMAP = arg maxhi P( hi | D )
(easier to compute)
• Role of prior, P(hi): penalizes complex hypotheses
• ML – Maximum likelihood
P( d | D ) = P( d | hML ), hML = arg maxhi P( D | hi )
Rutgers CS440, Fall 2003
Learning from complete data
• Learn parameters of Bayesian models from data – e.g., learn probabilities of C & L for a bag of candy whose proportions of
C&L are unknown by observing opened candy from that bag
• Candy problem parameters:
u – probability of C in bag uu – probability of bag u
Candy Parameter
C u
L 1- u
Bag Parameter
1 1
2 2
3 3
4 4
5 5
Bag u
Rutgers CS440, Fall 2003
ML Learning from complete data
• ML approach: select model parameters to maximize likelihood of seen data
1. Need to assume distribution model that determines how the samples (of candy) are distributed in a bag
2. Select parameters of the model that maximize the likelihood of the seen data
uu
Ldu
Cduu
N
iuiuu
N
iuiu
hCdP
hdP
hdPhDPhDL
hdPhDP
)|(
)1()|(
)|(log)|(log)|(
)|()|(
1
1
likelihood
model: binomial
log-likelihood
Rutgers CS440, Fall 2003
Maximum likelihood learning (binomial distribution)
• How to find a solution to the above problem?
)1log()(log)(maxarg
)1(logmaxarg
)|(logmaxarg)|(maxarg
1
1
*
1
*
LdCd
hdPhDLh
i
N
ii
h
LdCdN
ih
N
iui
hu
hu
u
ii
u
uu
Rutgers CS440, Fall 2003
Maximum likelihood learning (cont’d)
• Take the first derivative of (log) likelihood and set it to zero
0)(1
1)(
10
1
1)(
1)(
1*
1*
1
*
N
ii
N
ii
N
iii
LdCdL
LdCdL
N
iiNN
ii
N
ii
N
ii
CdLdCd
Cd
1
1
11
1* )()()(
)(
• Counting!
01N
………
Total:
102
011
d=Ld=CSample
N
ii Cd
1
)(
N
ii Ld
1
)(
Rutgers CS440, Fall 2003
Naïve Bayes model
• One set of causes, multiple independent sources of evidence
C
E1 E2 EN
• Example: C {spam, not spam}, Ei { token i present, token i absent }
Ei
)()|(),,...,,(1
21 CPCEPCEEEPN
iiN
• Limiting assumption, often works well in practice
… …
Rutgers CS440, Fall 2003
Inference & Decision in NB model
• Inference
)(log)|(log~),...,,|(log
)()|(~),...,,|(
121
121
CPCEPEEECP
CPCEPEEECP
N
iiN
N
iiN
Evidence scoreHypothesis (class) score Prior score
• Decision
0)_|(
)|(log
0),...,,|_(log),...,,|(log
),...,,|_(),...,,|(
?
1
?
2121
21
?
21
N
i i
i
NN
NN
SPAMNOTCEP
SPAMCEP
EEESPAMNOTCPEEESPAMCP
EEESPAMNOTCPEEESPAMCP
Log odds ratio
Rutgers CS440, Fall 2003
Learning in NB models
• Example:
Given a set of K email messages, each with tokens D={ dj = (e1j,…,eNj) }, eij {0,1}, and labels C={cj} (SPAM or NOT_SPAM), find the best set of CPTs P(E i|C) and P(C)
Assume: P(Ei|C=c) is binomial with parameter i,c, P(C) is binomial with parameter c
cC
cC
eci
ecii
cCP
cCeEP
1
1,,
)1()(
)1()|(
• ML learning: maximize likelihood of K messages, each one in one of the two classes
K
j
N
i
cC
cC
eci
eci
jjijij
CcNc
P1 1
11,,
,,...,)1(log)1(log)(logmax
,,1
CD,
2N+1 parameters
Label of message j
Token i in message j present/absent
K
jjKC
K
jijKci ce
1
1
1
1,
Rutgers CS440, Fall 2003
Learning of Bayesian network parameters
• Naïve Bayes learning can be extended to BNs! How?• Model each CPT as binomial/multinomial distribution. Maximize
likelihood of data given BN.
earthquake burglary
alarm
callnewscast
abe
abe
bB
bB
eE
eE
eEbBaAP
bBP
eEP
1
1
1
)1(),|(
)1()(
)1()(
Sample E B A N C
1 1 0 0 1 0
2 1 0 1 1 0
3 1 1 0 0 1
4 0 1 0 1 1
N
iii
N
iiii
be
eEbB
eEbBA
1
1
),(
),,1(
Rutgers CS440, Fall 2003
BN Learning (cont’d)
• Issues:
1. Priors on parameters. What if ? Should we trust it?
Maybe always add some small pseudo-count ?
2. How do we learn a BN graph (structure)?
Test all possible structures, then pick the one with the highest data likelihood?
3. What if we do not observe some nodes (evidence not on all nodes)?
0),,1(1
N
iiii eEbBA
0,),,1(1
bebe
N
iiii eEbBA
Rutgers CS440, Fall 2003
Learning from incomplete data
• Example: – In the alarm network, we received data where we only know Newscast, Call,
Earthquake, Burglary, but have no idea what Alarm state is.– In SPAM model, we do not know if a message is spam or not (missing label).
Sample E B A N C1 1 0 N/A 1 02 1 0 N/A 1 03 1 1 N/A 0 14 0 1 N/A 1 1
• Solution?We can still try to find network parameters that maximize likelihood of incomplete data.
N
i hi
N
ii
h
hdP
dPDLu
1
1
*
)|,(logmaxarg
)|(logmaxarg)|(maxarg
Hidden variable
Rutgers CS440, Fall 2003
Completing the data
• Maximizing incomplete data likelihood is tricky.• If we could, somehow, complete the data we would know how
to select model parameters that maximize the completed data.• How do we complete the missing data?
1. Randomly complete?
2. Estimate missing data from evidence, P( h | Evidence ).
11…104
10…113
01…012
01P( a=0 | E=1,B=0,N=1,C=0 )011.0
CNABESample
01P( a=1 | E=1,B=0,N=1,C=0 )011.1
Rutgers CS440, Fall 2003
EM Algorithm
• With completed data, Dc, maximize completed (log)likelihood by weighting contribution from each sample with P(h|d)
N
i hiicc dhPhdPDL
1
* ),|()|,(logmaxarg)|(maxarg
N
i aiiiiiiii
N
iiiiiiiii
be
nNcCeEbBaAPeEbBaA
nNcCeEbBAPeEbBA
1
1
0
1
),,,|(),,(
),,,|1(),,1(
• E(xpectation) M(aximization) algorithm:
1. Pick initial parameter estimates 0.
2. error = Inf;
3. While (error > max error)
1. E-step: Complete data, Dc, based on k-1.
2. M-step: Compute new parameters k that maximize completed data likelihood.
3. error = L( D | k ) - L( D | k-1 )
Rutgers CS440, Fall 2003
EM Example
• Candy problem, but now we do not know which bag the candy came from (bag label missing).
• E-step:
• M-step
0),,|()(1
1),,|()(
10
),,|(1
1)(
1)(
),,|(log)1log()(log)(
1
11*
1
11*
*
1
11
1
115
1
N
i
ku
kuii
N
i
ku
kuiik
u
c
N
i
ku
kuik
uik
uik
u
c
N
i
ku
kui
u
ku
kui
kuic
duPLdduPCdL
duPLdCdL
duPLdCdL
ku
N
i
ku
kui
N
i
ku
kuii
ku
duP
duPCd
1
11
1
11
,*
),,|(
),,|()(
)|(),|(~),,|( 1111 ku
kui
ku
kui uPudPduP
N
i
ku
kuiN
ku duP
1
111,* ),,|(
Prior probability of bag uCandy (C) probability in bag u
Rutgers CS440, Fall 2003
EM Learning of HMM parameters
• HMM needs EM for parameter learning (unless we know exactly the hidden states at every time instance)– Need to learn transition and emission parameters.
• E.g.: – Learning of HMMs for speech modeling.
1. Assume a general (word/language) model.
2. E-step: Recognize (your own) speech using this model (Viterbi decoding).
3. M-step: Tweak parameters to recognize your speech a bit better (ML parameter fitting).
4. Go to 2.