+ All Categories
Home > Documents > Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature...

Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature...

Date post: 05-Jan-2016
Category:
Upload: janis-lawson
View: 213 times
Download: 1 times
Share this document with a friend
48
Probabilistic Models Probabilistic Models Produce Produce probabilistic probabilistic outcomes outcomes Given feature vector Given feature vector F = (f F = (f 1 = v = v 1 ^ … ^ f ^ … ^ f n = v = v n ) ) Output probability Output probability P(category = positive | F) P(category = positive | F) “given © Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010 CS 760 – Machine Learning (UW- CS 760 – Machine Learning (UW- Madison) Madison)
Transcript
Page 1: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Probabilistic ModelsProbabilistic Models

• Produce Produce probabilisticprobabilistic outcomes outcomes• Given feature vectorGiven feature vector

F = (fF = (f11= v= v11 ^ … ^ f ^ … ^ fnn = v = vnn))

• Output probabilityOutput probabilityP(category = positive | F)P(category = positive | F)

“given”

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 2: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Brute-Force EstimationBrute-Force Estimation

• Observe output many times for each Observe output many times for each possible setting of features in Fpossible setting of features in F

• Estimate Estimate P(Category = cP(Category = ci i | F = {v| F = {v11, …, , …,

vvnn})})

by a big by a big lookup tablelookup table – ie, just – ie, just countcount

• Problems with this approach?Problems with this approach?• Too many possible settings for FToo many possible settings for F• Training set will only be a sampleTraining set will only be a sample

(new settings for F will appear in (new settings for F will appear in testtest set) set)© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 3: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Bayesian NetworksBayesian Networks

• Bayesian networks (BNs) Bayesian networks (BNs) compactlycompactly represent the represent the full joint probability full joint probability distributiondistribution

• We will cover Bayes net learning in We will cover Bayes net learning in more detail in the next lecture, but we more detail in the next lecture, but we really begin today with simplest Bayes really begin today with simplest Bayes netnet

• Take CS 731, CS 776, and/or CS 838 Take CS 731, CS 776, and/or CS 838 (Zhu) for more on statistical ML (also (Zhu) for more on statistical ML (also consider classes in Stats Dept)consider classes in Stats Dept)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 4: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Bayes’ RuleBayes’ Rule

• Bayes’ Rule (in the context of Bayes’ Rule (in the context of supervised ML)supervised ML)

P(category | features) =P(category | features) =

P(features | category) * P(category)P(features | category) * P(category)P(features)P(features)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 5: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Derivation of Bayes’ Derivation of Bayes’ RuleRule• DefinitionsDefinitions

P(A^B) P(A^B) P(A|B)*P(B) P(A|B)*P(B)P(A^B) P(A^B) P(B|A)*P(A) P(B|A)*P(A)

• So So (assuming P(B) > 0)(assuming P(B) > 0)

P(A|B)*P(B) = P(B|A)*P(A)P(A|B)*P(B) = P(B|A)*P(A)P(A|B) = P(A|B) = P(B|A)*P(A)P(B|A)*P(A)

P(B)P(B)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 6: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Conditional Conditional ProbabilitiesProbabilities

• Note the differenceNote the difference• P(A|B) is smallP(A|B) is small• P(B|A) is largeP(B|A) is large

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 7: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

MarginalizingMarginalizing

• Assume Assume random variablerandom variable AA has has possible values possible values aa11, …, a, …, ann

• We can “marginalize out” (“sum out”) We can “marginalize out” (“sum out”) A A

P(P(BB ) = ) = P( P(BB | | AA = a = aii )) ** P(P(AA = a = aii ))

n

i = 1

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 8: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Bayesian LearningBayesian Learning

• Use training data to estimate Use training data to estimate (for Naïve Bayes)(for Naïve Bayes)• P(fP(fii = v = vjj | category = POS) for each i,j | category = POS) for each i,j• P(fP(fii = v = vjj | category = NEG) for each i,j | category = NEG) for each i,j• P(category = POS)P(category = POS)• P(category = NEG) P(category = NEG)

• Apply Bayes’ rule to findApply Bayes’ rule to find• P(category = POS | test example features)P(category = POS | test example features)• P(category = NEG | test example features)P(category = NEG | test example features)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 9: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Independence of Independence of EventsEvents

• If If AA and and BB are independent, then are independent, then P(A|B) = P(A)P(A|B) = P(A)

P(B|A) = P(B)P(B|A) = P(B)

• And thereforeAnd thereforeP(A ^ B) = P(A) P(B)P(A ^ B) = P(A) P(B)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 10: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Conditional Conditional IndependenceIndependence

• If If AA and and BB are conditionally are conditionally independent given C, thenindependent given C, then

P(A|B,C) = P(A|C)P(A|B,C) = P(A|C)

P(B|A,C) = P(B|C)P(B|A,C) = P(B|C)

• And thereforeAnd thereforeP(A ^ B|C) = P(A|C) P(B|C)P(A ^ B|C) = P(A|C) P(B|C)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 11: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Learning in Learning in Probabilistic Setting Probabilistic Setting (Simple Example)(Simple Example)• Suppose we have a thumbtack (with a Suppose we have a thumbtack (with a

round flat head and sharp point) that round flat head and sharp point) that when flipped can land either with the when flipped can land either with the point up (point up (tailstails) or with the point touching ) or with the point touching the ground (the ground (headsheads).).

• Suppose we flip the thumbtack 100 Suppose we flip the thumbtack 100 times, and 70 times it lands on times, and 70 times it lands on headsheads. . Then we estimate that the probability of Then we estimate that the probability of headsheads the next time is 0.7. This is the the next time is 0.7. This is the maximum likelihood estimatemaximum likelihood estimate..

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 12: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

The General Maximum The General Maximum Likelihood SettingLikelihood Setting• We had a binomial distribution We had a binomial distribution b(n,p)b(n,p) for for

n=100n=100, and we wanted a good guess at , and we wanted a good guess at pp..• We chose the We chose the pp that would maximize the that would maximize the

probability of our observation of probability of our observation of 70 heads70 heads..• In general we have a parameterized In general we have a parameterized

distribution and want to estimate a distribution and want to estimate a (several) parameter(s): choose the value (several) parameter(s): choose the value that maximizes the probability of the data. that maximizes the probability of the data.

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 13: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Frequentist-Bayesian Frequentist-Bayesian DebateDebate• The preceding seems backwards: we The preceding seems backwards: we

want to maximize the probability of want to maximize the probability of pp, , not necessarily of the data (we already not necessarily of the data (we already have it).have it).

• A Frequentist will say this is the best we A Frequentist will say this is the best we can do: we can’t talk about can do: we can’t talk about probabilityprobability of of pp; it is fixed (though unknown).; it is fixed (though unknown).

• A Bayesian says the probability of A Bayesian says the probability of pp is is the degree of belief we assign to it ...the degree of belief we assign to it ...

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 14: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Fortunately the Two Fortunately the Two Agree (Almost)Agree (Almost)• It turns out that for Bayesians, if our It turns out that for Bayesians, if our

prior belief is that all values of prior belief is that all values of pp are are equally likely, then after observing the equally likely, then after observing the data we’ll assign the highest data we’ll assign the highest probability to the maximum likelihood probability to the maximum likelihood estimate for estimate for pp..

• But what if our prior belief is different? But what if our prior belief is different? How do we merge the prior belief with How do we merge the prior belief with the data to get the best new belief?the data to get the best new belief?

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 15: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Encode Prior Beliefs as a Encode Prior Beliefs as a Beta DistributionBeta Distribution

a,b

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 16: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Any intuition for this?Any intuition for this?• For any positive integer For any positive integer yy, , ((yy) = () = (y-1y-1)!.)!.• Suppose we use this, and we also replaceSuppose we use this, and we also replace

• x x withwith p p• a a withwith x x• a+b a+b withwith n n

• Then we get:Then we get:

• The beta(a,b) is just the binomial(n,p) The beta(a,b) is just the binomial(n,p) where n=a+p, and p becomes the variable. where n=a+p, and p becomes the variable. With change of variable, we need a different With change of variable, we need a different normalizing constant so the sum (integral) normalizing constant so the sum (integral) is 1. Hence (n+1)! replaces n!.is 1. Hence (n+1)! replaces n!.

xnx ppxnx

n

)1()!(!

)!1(

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 17: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 18: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Incorporating a PriorIncorporating a Prior

• We assume a We assume a betabeta distribution as our distribution as our prior distribution over the parameter prior distribution over the parameter pp..

• Nice properties: unimodal, we can choose Nice properties: unimodal, we can choose the mode to reflect the most probable the mode to reflect the most probable value, we can choose the variance to value, we can choose the variance to reflect our confidence in this value.reflect our confidence in this value.

• Best property: a Best property: a betabeta distribution is distribution is parameterized by two positive numbers: parameterized by two positive numbers: aa,,bb

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 19: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Beta Distribution Beta Distribution (Continued)(Continued)• (Continued)… and (Continued)… and bb. Higher values of . Higher values of

aa relative to relative to bb cause the mode of the cause the mode of the distribution to be more to the left, and distribution to be more to the left, and higher values of both higher values of both aa and and bb cause cause the distribution to be more peaked the distribution to be more peaked (lower variance). We might for (lower variance). We might for example take example take aa to be the number of to be the number of headsheads, and , and bb to be the number of to be the number of tailstails. . At any time, the mode of At any time, the mode of

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 20: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Beta Distribution Beta Distribution (Continued)(Continued)• (Continued)… the beta distribution (the (Continued)… the beta distribution (the

expectation for expectation for pp) is ) is a/(a+b)a/(a+b), and as we , and as we get more data, the distribution becomes get more data, the distribution becomes more peaked reflecting higher more peaked reflecting higher confidence in our expectation. So we confidence in our expectation. So we can specify our prior belief for can specify our prior belief for pp by by choosing initial values for choosing initial values for aa and and bb such such that that a/(a+b)=pa/(a+b)=p, , andand we can specify we can specify confidence in this belief with highconfidence in this belief with high

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 21: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Beta Distribution Beta Distribution (Continued)(Continued)• (Continued)… initial values for (Continued)… initial values for aa and and bb. .

Updating our prior belief based on Updating our prior belief based on data to obtain a posterior belief simply data to obtain a posterior belief simply requires incrementing requires incrementing aa for every for every headsheads outcome and incrementing outcome and incrementing bb for for every tails outcome.every tails outcome.

• So after So after hh heads out of heads out of nn flips, our flips, our posterior distribution says posterior distribution says P(P(headsheads)=()=(a+ha+h)/()/(a+b+na+b+n).).

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 22: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Dirichlet DistributionsDirichlet Distributions

• What if our variable is not Boolean but can What if our variable is not Boolean but can take on more values? (Let’s still assume take on more values? (Let’s still assume our variables are discrete.)our variables are discrete.)

• Dirichlet distributions are an extension of Dirichlet distributions are an extension of beta distributions for the multi-valued case beta distributions for the multi-valued case (corresponding to the extension from (corresponding to the extension from binomial to multinomial distributions).binomial to multinomial distributions).

• A Dirichlet distribution over a variable with A Dirichlet distribution over a variable with nn values has values has nn parameters rather than 2. parameters rather than 2.

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 23: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Back to Frequentist-Back to Frequentist-Bayes DebateBayes Debate• Recall that under the frequentist view Recall that under the frequentist view

we estimate each parameter we estimate each parameter pp by by taking the ML estimate (taking the ML estimate (maximum maximum likelihood estimatelikelihood estimate: the value for : the value for pp that that maximizes the probability of the data).maximizes the probability of the data).

• Under the Bayesian view, we now have Under the Bayesian view, we now have a prior distribution over values of a prior distribution over values of pp. If . If this prior is a beta, or more generally a this prior is a beta, or more generally a DirichletDirichlet

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 24: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Frequentist-Bayes Frequentist-Bayes Debate (Continued)Debate (Continued)• (Continued)… then we can update it to a (Continued)… then we can update it to a

posterior distribution quite easily using posterior distribution quite easily using the data as illustrated in the thumbtack the data as illustrated in the thumbtack example. The result yields a new value example. The result yields a new value for the parameter for the parameter pp we wish to estimate we wish to estimate (e.g., probability of (e.g., probability of headsheads) called the ) called the MAP (MAP (maximum a posteriorimaximum a posteriori) estimate.) estimate.

• If our prior distribution was uniform over If our prior distribution was uniform over values for values for pp, then ML and MAP agree., then ML and MAP agree.

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 25: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Some Predictive Some Predictive ModelsModels• Maximum Likelihood Maximum Likelihood : model that makes : model that makes

data most probabledata most probable• MAP MAP : most probably hypothesis given : most probably hypothesis given

datadata• Bayes optimal Bayes optimal : prediction with highest : prediction with highest

expected value (sum probabilities of all expected value (sum probabilities of all models making this prediction)models making this prediction)

• Selective model averaging Selective model averaging : (Bayes : (Bayes optimal using just a subset of “best” optimal using just a subset of “best” models)models)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 26: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes: Naïve Bayes: MotivationMotivation• Let’s do maximum likelihood Let’s do maximum likelihood

predictionprediction

• We will see many draws of XWe will see many draws of X11,…,X,…,Xnn and the response (class) Yand the response (class) Y

• We want ML estimate of P(Y| XWe want ML estimate of P(Y| X11,…,X,…,Xnn))

• What difficulty arises?What difficulty arises?• Exponentially many settings for XExponentially many settings for X11,…,X,…,Xnn

• Next case probably has not been seenNext case probably has not been seen© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 27: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

On Approach: Assume On Approach: Assume Conditional Conditional IndependenceIndependence• By Bayes Rule (with normalization):By Bayes Rule (with normalization):

• P(Y| XP(Y| X11,…,X,…,Xnn) = ) = ααP(XP(X11,…,X,…,Xnn|Y)P(Y)|Y)P(Y)• Normalization: compute above for each Normalization: compute above for each

value of Y, then normalize so sum to 1value of Y, then normalize so sum to 1• Recall Conditional independence:Recall Conditional independence:

• P(XP(X11,…,X,…,Xnn|Y) = P(X|Y) = P(X11|Y)…P(X|Y)…P(Xnn|Y) |Y)

• P(Y| XP(Y| X11,…,X,…,Xnn)= )= ααP(XP(X11|Y)…P(X|Y)…P(Xnn|Y)P(Y)|Y)P(Y)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 28: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Is Simple Bayes Naïve?Is Simple Bayes Naïve?

Surprisingly, the Surprisingly, the assumption of conditional assumption of conditional independenceindependence, although most likely violated, is , although most likely violated, is not too harmful!not too harmful!

• Naïve Bayes works quite well Naïve Bayes works quite well • Very successful in text categorization (“bag-o- words” rep)Very successful in text categorization (“bag-o- words” rep)• Used in printer diagnosis in Win 95, Office Assistant,Used in printer diagnosis in Win 95, Office Assistant,

spam filtering, etcspam filtering, etc

• Recent resurgence of research activity in Naïve Recent resurgence of research activity in Naïve BayesBayes• Many “dead” ML algo’s resuscitated by availability Many “dead” ML algo’s resuscitated by availability

of large datasets (KISS Principle)of large datasets (KISS Principle)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 29: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve BayesNaïve Bayes

• Makes a “naïve” assumption, namelyMakes a “naïve” assumption, namelyconditionalconditional independence of features independence of features

P(A^B | category) = P(A^B | category) = P(A | category) * P(B | category)P(A | category) * P(B | category)

• Hence avoid estimating probability Hence avoid estimating probability of compound features (A^B)of compound features (A^B)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 30: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes in Naïve Bayes in PracticePractice

• Empirically, estimates Empirically, estimates relativerelative probabilities more reliably than probabilities more reliably than absolute ones:absolute ones:

P(POS | features) P(POS | features) P(features | POS) * P(POS)P(features | POS) * P(POS)

==

P(NEG | features)P(NEG | features) P(features| NEG) * P(NEG)P(features| NEG) * P(NEG)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 31: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes Technical Naïve Bayes Technical Detail:Detail:UnderflowUnderflow• If we have, say, 100 features, we are If we have, say, 100 features, we are

multiplying 100 numbers in [0,1].multiplying 100 numbers in [0,1].• If many probabilities are small, we could If many probabilities are small, we could

“underflow” the min. positive “underflow” the min. positive float/double in our computer.float/double in our computer.

Trick:Trick:

- Sum log’s of prob’sSum log’s of prob’s- Subtract logs since log = logP(+) – logP(-)Subtract logs since log = logP(+) – logP(-)

log( ' )'

prob sprob s e

P(+)P(-)

Product

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 32: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Log OddsLog Odds

P(f1 | pos ) * … * P(fn | pos ) * P(pos)P(f1 | pos ) * … * P(fn | pos ) * P(pos)

P(f1 | neg ) * … * P(fn | neg ) * P(neg)P(f1 | neg ) * … * P(fn | neg ) * P(neg)

log(Odds) = ∑ log{ P(flog(Odds) = ∑ log{ P(fii | pos) / P(f | pos) / P(fii | | neg) }neg) }

+ log( P(pos) / P(neg) )+ log( P(pos) / P(neg) )

Odds =

Notice if a feature value is more likely in a pos, the log is pos and if more likely in neg, the log is

neg (0 if tie)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 33: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes ExampleNaïve Bayes Example

ColorColor ShapeShape SizeSize CategorCategoryy

redred •• bigbig ++

blueblue smallsmall ++

redred smallsmall ++

redred bigbig blueblue •• smallsmall redred smallsmall ??

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 34: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes ExampleNaïve Bayes Example

• For the new example (red, For the new example (red, , small), small)

P(+|F’s) P(+|F’s) P(red|+)*P( P(red|+)*P(|+)*P(small||+)*P(small|+)*P(+)+)*P(+)

==P(P( |F’s) |F’s) P(red| P(red|)*P()*P(| | )*P(small| )*P(small| )*P()*P())

= = 2/3 * 1/3 * 2/3 * 3/52/3 * 1/3 * 2/3 * 3/5 = 1.77 = 1.77 1/2 * 1/2 * 1/2 * 2/51/2 * 1/2 * 1/2 * 2/5

• So most likely a POS exampleSo most likely a POS example© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 35: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes Special Naïve Bayes Special Case of Bayes Net Case of Bayes Net LearningLearning• Naïve Bayes has fixed structure Naïve Bayes has fixed structure

reflecting conditional independencereflecting conditional independence• Learn parameters and then label test Learn parameters and then label test

case by inference… same for Naïve case by inference… same for Naïve Bayes as for Bayes net learningBayes as for Bayes net learning

• Methods for dealing with zeroes, Methods for dealing with zeroes, incorporating prior knowledge, incorporating prior knowledge, smoothing all same as BN learningsmoothing all same as BN learning

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 36: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Dealing with Zeroes Dealing with Zeroes (and Small Samples)(and Small Samples)

• If we never see something (eg, in If we never see something (eg, in the train set), should we assume the train set), should we assume its probability is zero?its probability is zero?

• If we only see 3 pos ex’s and 2 If we only see 3 pos ex’s and 2 are red, do we really think are red, do we really think

P(red|pos) = 2/3 P(red|pos) = 2/3 ??

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 37: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

MM-estimates -estimates (Eq 6.22 in Mitchell; Eq 7 in draft chapter)(Eq 6.22 in Mitchell; Eq 7 in draft chapter)

• Imagine we had Imagine we had m m hypothetical pos ex’shypothetical pos ex’s

• Assume Assume pp is prob these examples are is prob these examples are redred

• Improved estimate:Improved estimate:P(red | pos) = 2 + p * mP(red | pos) = 2 + p * m

3 + m3 + m

(In general, (In general, redred is some feature value and is some feature value and 2 and 32 and 3 are actual counts in the training set)are actual counts in the training set)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 38: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

MM-Estimates More -Estimates More GenerallyGenerally

P(fi=vi) =

# times fi = vi +

Equivalent sample size

# train ex’s +

Equivalent sample size

initial guess for P(fi = vi)

x

Estimate based on data

Estimate based on prior knowledge (“priors”)

m

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 39: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Mitchell’s Eq 6.22Mitchell’s Eq 6.22

Prob =

nc + m p

n + m# of fi = vi

examples# of actual examples

Equivalent sample size used in guess

Prior guess

Prob (color=red) =8 + 100 x 0.5

10 + 100=

58

110= 0.53

Example: Of 10 examples, 8 have color = red

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 40: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Putting Things Putting Things TogetherTogether• mp and mn are our parameters a and b of mp and mn are our parameters a and b of

the beta distribution (the numbers of the beta distribution (the numbers of hypothetical positive and negative hypothetical positive and negative examples, respectively)examples, respectively)

• If start with beta(a,b): after seeing actual If start with beta(a,b): after seeing actual positive count a’ and actual negative positive count a’ and actual negative count b’ in our data, posterior is count b’ in our data, posterior is beta(a+a’,b+b’)beta(a+a’,b+b’)

• This distribution will be more peaked, and This distribution will be more peaked, and shifted according to a’/b’shifted according to a’/b’

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 41: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Laplace SmoothingLaplace Smoothing

• Special case of Special case of mm estimates estimates• Let Let m = #colors, p = 1/mm = #colors, p = 1/m• Ie, assume one hypothetical pos ex of each Ie, assume one hypothetical pos ex of each

colorcolor

• Implementation trickImplementation trick• Start all counters at 1 instead of 0Start all counters at 1 instead of 0

• Eg, initializeEg, initialize count(pos, feature(i), value(i, j)) = 1 count(pos, feature(i), value(i, j)) = 1• count(pos, color, red), count(pos, color, red),

count(neg, color, red),count(neg, color, red),count(pos, color, blue), count(pos, color, blue), ……

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 42: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Methods for Dealing Methods for Dealing with Real-Valued with Real-Valued FeaturesFeatures• Discretize such features into intervalsDiscretize such features into intervals

• Compute mean and standard deviation Compute mean and standard deviation for each feature, conditioned on (ie, for each feature, conditioned on (ie, given) the output categorygiven) the output category

• Model as a mixture of GaussiansModel as a mixture of Gaussians• Produces multi-modal distributions rather Produces multi-modal distributions rather

than uni-modal (eg, single Gaussian)than uni-modal (eg, single Gaussian)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 43: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Discretizing Numeric Discretizing Numeric Features Features - Some Possibilities - Some Possibilities • UniformlyUniformly divide [min, max] divide [min, max]• Use Use information theoryinformation theory (Fayyed & Iran, (Fayyed & Iran,

‘92)‘92)• Put Put same numbersame number of examples in each of examples in each

binbin

- makes sense since P(X|+) and P(X|-)- makes sense since P(X|+) and P(X|-)

will likely differ in each binwill likely differ in each bin P(X)

+- - --

- --

+ +++++ +

++ +++

- ----

X

--

----

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 44: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Using a Gaussian Using a Gaussian Feature-Value Model Feature-Value Model on a Test Exon a Test Ex• Assume we estimated on train setAssume we estimated on train set

• Mean = 5 and stdDev = 3 for pos ex’sMean = 5 and stdDev = 3 for pos ex’s• Mean = 4 and stdDev = 2 for neg ex’sMean = 4 and stdDev = 2 for neg ex’s

• Assume test ex’s value for this feature Assume test ex’s value for this feature is is 33

• What do we use in Naïve Bayes calc?What do we use in Naïve Bayes calc?• Prob(f=3 | pos ) ≈ exp{ –(3 – 5)Prob(f=3 | pos ) ≈ exp{ –(3 – 5)22 / 3 / 32 2 }}• Prob(f=3 | neg ) ≈ exp{ –(3 – 4)Prob(f=3 | neg ) ≈ exp{ –(3 – 4)22 / 2 / 22 2 }}

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 45: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Naïve Bayes as a Bayesian Naïve Bayes as a Bayesian Network (one type of Network (one type of Graphical Graphical ModelModel ))

Node Node ii stores P(F stores P(Fii | POS) and P(F | POS) and P(Fi i | NEG)| NEG)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 46: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Using Generative Using Generative ModelsModels

• Make a model that generates Make a model that generates positivespositives

• Make a model that generates Make a model that generates negativesnegatives

• Classify a test example based on Classify a test example based on which is more likely to generate itwhich is more likely to generate it• The Naïve Bayes ratio does thisThe Naïve Bayes ratio does this

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 47: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Discriminant ModelsDiscriminant Models

• Directly classify instances into categoriesDirectly classify instances into categories• Captures Captures differencesdifferences between categories between categories• May not describe all featuresMay not describe all features• Examples: decision trees and neural netsExamples: decision trees and neural nets• Eg, what differentiates birds and mammals? Eg, what differentiates birds and mammals?

Breast feeding (?) Has feathers?Breast feeding (?) Has feathers?

• Typically more efficient and simplerTypically more efficient and simpler

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 48: Probabilistic Models Produce probabilistic outcomesProduce probabilistic outcomes Given feature vectorGiven feature vector F = (f 1 = v 1 ^ … ^ f n = v.

Final Comments on Final Comments on Naïve BayesNaïve Bayes• Effective algorithm on real-world tasksEffective algorithm on real-world tasks

• Basis of best spam filters, so maybe most Basis of best spam filters, so maybe most used ML algo in practice?used ML algo in practice?

• Fast, simpleFast, simple

• Gives confidence in its class predictions Gives confidence in its class predictions (ie, (ie, probabilitiesprobabilities))

• Makes simplifying assumptionsMakes simplifying assumptions(extensions exist — coming next…)(extensions exist — coming next…)

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)


Recommended