©Jan-19 Christopher W. Clifton 120
Materials adapted from Profs. Jennifer Neville and Dan Goldwasser
CS37300:
Data Mining & Machine Learning
Naïve Bayes
Prof. Chris Clifton
13 February 2020
Classification as probability estimation
Example model: Naive Bayes classifiers
• Instead of learning a function f that assigns labels,
Learn a conditional probability distribution over the output
of function f
• P( f(x) | x ) = P( f(x) = y | x1,x2,…, xp )
• Can use probabilities for other tasks
– Classification
– Ranking
©Jan-19 Christopher W. Clifton 220
Denominator: normalizing factor
to make probabilities sum to 1
(can be computed from numerators)
Bayes
rule
Knowledge representation and model space:
Bayes rule for probabilistic classifier
Bayes rule for probabilistic classifier
• P(y) - the prior probability of a label yReflects background knowledge; before data is observed. If no
information - uniform distribution.
• P(x) - The probability that this sample of the Data is observed. (No knowledge of the label)
• P(x|y): The probability of observing the sample x, given that the label y is the target (Likelihood)–
• P(y|x): The posterior probability of v. The probability that v is the target, given that D has been observed.
©Jan-19 Christopher W. Clifton 320
Bayes rule for probabilistic classifier
Check your intuition:
P(y|x) increases with P(y) and with P(x|y)
P(y|x) decreases with P(x)
Bayes rule for probabilistic classifier
• The learner considers a set of candidate labels, and attempts to find the most
probable one y∈Y, given the observed data.
• Such maximally probable assignment is called maximum a posteriori assignment
(MAP); Bayes theorem is used to compute it:
yMAP = argmaxy ∈ Y P(y|x) = argmaxy ∈ Y P(x|y) P(y)/P(x)
= argmaxy ∈ Y P(x|y) P(y)
Since P(x) is the same for all y∈ Y
©Jan-19 Christopher W. Clifton 420
Bayes rule for probabilistic classifier
• How can we compute P(v |D)?
– Basic idea: represent instance as a set of features xi
)x,...,x,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj yMAP = argmaxy ∈ Y P(y|x) = argmaxy ∈ Y P(y|x1 , x2 , ..., xn)
yMAP = argmaxy ∈ Y P(x1 , x2 , ..., xn|y) P(y)/ P(x1 , x2 , ..., xn) =
= argmaxy ∈ Y P(x1 , x2 , ..., xn|y) P(y)
Bayes rule for probabilistic classifier
• Given training data we can estimate the two terms
• Estimating P(y) is easy. For each value v count how many times it appears in the
training data.
• However, it is not feasible to estimate P(x1,…, xn | y )
– In this case we have to estimate, for each target value, the probability of each
instance (most of which will not occur)
• In order to use a Bayesian classifiers in practice, we need to make assumptions that
will allow us to estimate these quantities.
yMAP = argmaxy ∈ Y P(x1 , x2 , ..., xn|y) P(y)
Question: Assume binary xi’s. How many parameters does the model require?
©Jan-19 Christopher W. Clifton 520
NB: Independence Assumptions
Conditional Independence:
Assume feature probabilities are independent given the label
P(xi|yj) = P(xi|xi-1; yj)
Naive
assumption
Bayes
rule
Question: How many
parameters do we need to
estimate now?
Is assuming independence a problem?
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
Y=XOR(X1,X2)
©Jan-19 Christopher W. Clifton 620
• Let’s consider the spam classification problem
• Is NB an appropriate model to use?
– Does the conditional independence assumption hold for this
problem?
• However NB is frequently (and successfully) used for
spam detection!
• Why does it succeed?
NBC learning
NBC parameters = CPDs+prior
CPDs: P(A|BC)
P(I|BC)
P(S|BC)
P(CR|BC)
Prior:P(BC)
©Jan-19 Christopher W. Clifton 720
Score Function:
Likelihood
• Let
• Assume the data D are independently sampled from the same distribution:
• The likelihood function represents the probability of the data as a function of the model parameters:
If instances are independent,
likelihood is product of probs
Likelihood (cont’)
• Likelihood is not a probability distribution– Gives relative probability of data given a parameter
– Numerical value of L is not relevant, only the ratio of two scores is relevant, e.g.,:
• Likelihood function: allows us to determine unknown parameters based on known outcomes
• Probability distribution: allows us to predict unknown outcomes based on known parameters
©Jan-19 Christopher W. Clifton 820
NBCs: Likelihood
• NBC likelihood uses the NBC probabilities for each data
instance (i.e., probability of the class given the attributes)
General likelihood
Bayes rule
Naive assumption
Search:Maximum likelihood estimation
• Most widely used method of parameter estimation
• “Learn” the best parameters by finding the values of Θ
that maximizes likelihood:
• Often easier to work with loglikelihood:
©Jan-19 Christopher W. Clifton 920
Likelihood surface
L
If the likelihood
surface is
convex we can
often determine
the parameters
that maximize
the function
analytically
MLE for multinomials
• Let X ∈ {1, ..., k} be a discrete random variable with k values, where P(X=j)=θj
• Then P(X) is a multinomial distribution:
where I(X=j) is an indicator function
• The likelihood for a data set D=[x1, ..., xN] is:
• The maximum likelihood estimates for each parameter are:
(using Lagrange multipliers)
In this case,
MLE can be
determined
analytically
by counting
©Jan-19 Christopher W. Clifton 1020
Learning CPDs from examples
0132No
171310Yes
HighMediumLow
Y
X1
P[ X1 = Low | Y = Yes] =
P[ Y = No] =
NBC learning
• Estimate prior P(BC) and conditional
probability distributions P(A | BC), P(I | BC),
P(S | BC), P(CR | BC) independently with
maximum likelihood estimation
P(A | BC) P(I | BC)
P(S | BC) P(CR | BC)
P(BC)
©Jan-19 Christopher W. Clifton 1120
NBC prediction
• What is the probability that a new person
will buy a computer?
31..40 high no excellent ?
P(A | BC) P(I | BC)
P(S | BC) P(CR | BC)
P(BC)
Zero counts are a problem
• If an attribute value does not occur in training example, we assign zero probability to that value
• How does that affect the conditional probability P[ f(x) | x ] ?
• It equals 0!!!
• Why is this a problem?
• Adjust for zero counts by “smoothing” probability estimates
©Jan-19 Christopher W. Clifton 1220
Adds uniform prior
Smoothing: Laplace correction
0132No
171310Yes
HighMediumLow
Y
X1
P[ X1 = High | Y = No] =
Laplace correction
Numerator: add 1
Denominator: add k, where k=number of
possible values of X
Naive Bayes classifier
• Simplifying (naive) assumption: attributes are conditionally independent given the class
• Strengths:– Easy to implement
– Often performs well even when assumption is violated
– Can be learned incrementally
• Weaknesses:– Class conditional assumption produces skewed probability
estimates
– Dependencies among variables cannot be modeled
©Jan-19 Christopher W. Clifton 1320
NBC learning
• Model space– Parametric model with specific form
(i.e., based on Bayes rule and assumption of conditional independence),
– Models vary based on parameter estimates in CPDs
• Search algorithm– MLE optimization of parameters
(convex optimization results in exact solution)
• Scoring function– Likelihood of data given NBC model form
Compare NBC to DT to KNN
• Hypothesis space
– What type of functions are used? Which one is more expressive?
• Scoring function
– How is each model scored?
• Search
– What search procedure is used?
– Are we guaranteed to find the optimal model?
– What is the complexity of the search procedure?
©Jan-19 Christopher W. Clifton 1420
Numerical Stability
• Recall: NB classifier:
– Multiplying probabilities can get us into problems!
– Imagine computing the probability of 2000 independent coin flips
– Most programming environments: (.5)2000=0
Numerical Stability
• Our problem: Underflow Prevention
• Recall: log(xy) = log(x) + log(y)
• better to sum logs of probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log probability
score is still the most probable.
positionsi
jijCc
NB cxPcPc )|(log)(logargmaxj
©Jan-19 Christopher W. Clifton 1520
Naïve Bayes: Two classes
• Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier
• In the case of two classes, v∈{0,1} we predict that v=1iff:
10)v|P(x0)P(v
1)v|P(x1)P(v
n
1i jij
n
1i jij
i jijVvNB )v|P(x)P(vargmax v
j
Naïve Bayes: Two classes
• Notice that the naïve Bayes method gives a method for
predicting rather than an explicit classifier.
• In the case of two classes, v∈ {0,1} we predict that v=1 iff:
10)v|P(x0)P(v
1)v|P(x1)P(v
n
1i jij
n
1i jij
Denote: pi = P(x i = 1| v = 1), qi = P(x i = 1| v = 0)
P(v j = 1)· pi
xi (1- pi )1-xi
i=1
n
Õ
P(v j = 0)· qi
xi (1- qi )1-xi
i=1
n
Õ> 1
i jijVvNB )v|P(x)P(vargmax v
j
©Jan-19 Christopher W. Clifton 1620
Naïve Bayes: Two classes
In the case of two classes, v∈ {0,1} we predict that v=1 iff:
0)xq-1
qlog
p-1
p(log
q-1
p-1log
0)P(v
1)P(vlog
:iff 1vpredict we logarithm; Take
i
i
i
ii
i
ii
i
j
j
w i logpi
1 - pi
logqi
1- qi
logpi
qi
1 - qi
1 - pi
if pi qi then wi 0 and the feature is irrelevant
•We get that naive Bayes is a linear separator with :
1
)q-1
q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1i
x
i
i
ij
n
1i
x
i
i
ij
n
1i
x-1
i
x
ij
n
1i
x-1
i
x
ij
i
i
ii
ii