Post on 18-Dec-2015
transcript
Bayesian Learning, Regression-based learning
Overview Bayesian Learning
Full
MAP learning
Maximum Likelihood Learning
Learning Bayesian Networks (Fully observable)
Regression and Logistic Regression
Full Bayesian LearningIn Decision Trees (and in all non-Bayesian learning methods) the idea is always to find the best model that explains some observationsIn contrast, full Bayesian learning sees learning as Bayesian updating of a probability distribution over the hypothesis space, given data
H is the hypothesis variablePossible hypotheses (values of H) h1…, hn P(H) = prior probability distribution over hypotesis spacejth observation dj gives the outcome of random variable Dj
training data d= d1,..,dk
Given the data so far, each hypothesis hi has a posterior probability:
P(hi |d) = αP(d| hi) P(hi) (Bayes theorem)where P(d| hi) is called the likelihood of the data under each hypothesisPredictions over a new entity X are a weighted average over the prediction of each hypothesis:P(X|d) = = ∑i P(X, hi |d) = ∑i P(X| hi,d) P(hi |d) = ∑i P(X| hi) P(hi |d) ~ ∑i P(X| hi) P(d| hi) P(hi) The weights are given by the data likelihood and prior of each hNo need to pick one best-guess hypothesis!
The data does not add anything to a prediction given an hp
Full Bayesian Learning
Suppose we have 5 types of candy bags• 10% are 100% cherry candies (h100 , P(h100 )= 0.1)• 20% are 75% cherry + 25% lime candies (h75 , P(h75 )= 0.2)• 40% are 50% cherry + 50% lime candies (h50 , P(h50 )= 0.4)• 20% are 25% cherry + 75% lime candies (h25 , P(h25 )= 0.2)• 10% are 100% lime candies (h0 ,P(h100 )= 0.1)
• The we observe candies drawn from some bag
Let’s call θ the parameter that defines the fraction of cherry candy in a bag, and hθ the corresponding hypothesis
Which of the five kinds of bag has generated my 10 observations? P(h θ |d).
What flavour will the next candy be? Prediction P(X|d)
Example
Example If we re-wrap each candy and return it to the bag, our 10
observations are independent and identically distributed, i.i.d, so
• P(d| hθ) = ∏j P(dj| hθ) for j=1,..,10
For a given hθ , the value of P(dj| hθ) is
• P(dj = cherry| hθ) = θ; P(dj = lime|hθ) = (1-θ)
And given N observations, of which c are cherry and l = N-c lime
)1()1(|(11
c
j
c
jθ )hP d
• Binomial distribution: probability of # of successes in a sequence of N independent trials with binary outcome, each of which yields success with probability θ.
For instance, after observing 3 lime candies in a row:
• P([lime, lime, lime| h 50) = 0.53 because the probability of seeing lime for each observation is 0.5 under this hypotheses
Initially, the hp with higher priors dominate (h50 with prior = 0.4)
As data comes in, the true hypothesis (h0 ) starts dominating, as the probability of seeing this data given the other hypotheses gets increasingly smaller
• After seeing three lime candies in a row, the probability that the bag is the all-lime one starts taking off
P(h100|d)P(h75|d)P(h50|d)P(h25|d)P(h0|d)
P(hi |d) = αP(d| hi) P(hi)
Posterior Probability of H
Prediction Probability
The probability that the next candy is lime increases with the probability that the bag is an all-lime one
∑i P(next candy is lime| hi) P(hi |d)
Overview Full Bayesian Learning
MAP learning
Maximun Likelihood Learning
Learning Bayesian Networks
• Fully observable
• With hidden (unobservable) variables
MAP approximation Full Bayesian learning seems like a very safe bet, but
unfortunately it does not work well in practice
• Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)
Very common approximation: Maximum a posterior (MAP) learning:
• Instead of doing prediction by considering all possible hypotheses , as in
P(X|d) = ∑i P(X| hi) P(hi |d)
• Make predictions based on hMAP that maximises P(hi |d)
I.e., maximize P(d| hi) P(hi)
P(X|d)~ P(X| hMAP )
MAP approximation Map is a good approximation when P(X |d) ≈ P(X| hMAP)
• In our example, hMAP is the all-lime bag after only 3 candies, predicting that the next candy will be lime with p =1
• the bayesian learner gave a prediction of 0.8, safer after seeing only 3 candies
P(h100|d)P(h75|d)P(h50|d)P(h25|d)P(h0|d)
Bias As more data arrive, MAP and Bayesian prediction become
closer, as MAP’s competing hypotheses become less likely
Often easier to find MAP (optimization problem) than deal with a large summation problem
P(H) plays an important role in both MAP and Full Bayesian Learning
• Defines the learning bias, i.e. which hypotheses are favoured
Used to define a tradeoff between model complexity and its ability to fit the data
• More complex models can explain the data better => higher P(d| hi) danger of overfitting
• But simpler model have higher probability
• I.e. common learning bias is to penalize complexity
Overview Full Bayesian Learning
MAP learning
Maximun Likelihood Learning
Learning Bayesian Networks
• Fully observable
• With hidden (unobservable) variables
Maximum Likelihood (ML)Learning Further simplification over full Bayesian and MAP learning
• Assume uniform priors over the space of hypotheses
• MAP learning (maximize P(d| hi) P(hi)) reduces to maximize P(d| hi)
When is ML appropriate?
Maximum Likelihood (ML) Learning Further simplification over Full Bayesian and MAP learning
• Assume uniform prior over the space of hypotheses
• MAP learning (maximize P(d| hi) P(hi)) reduces to maximize P(d| hi)
When is ML appropriate?
• Used in statistics as the standard (non-bayesian) statistical learning method by those who distrust subjective nature of hypotheses priors
• When the competing hypotheses are indeed equally likely (e.g. have same complexity)
• With very large datasets, for which P(d| hi) tends to overcome the influence of P(hi))
Overview Bayesian Learning
Full
MAP learning
Maximum Likelihood Learning
Learning Bayesian Networks (Fully observable)
Learning BNets: Complete Data We will start by applying ML to the simplest type of BNets
learning:
• known structure
• Data containing observations for all variables
All variables are observable, no missing data
The only thing that we need to learn are the network’s parameters
ML learning: example Back to the candy example:
• New candy manufacturer that does not provide data on the probability of different types of bags, i.e. fraction θ of cherry candy
• Any θ is possible: continuum of hypotheses hθ
• Reasonable to assume that all θ are equally likely (we have no evidence of the contrary): uniform distribution P(hθ)
• θ is a parameter for this simple family of models, that we need to learn
Simple network to represent this problem
• Flavor represents the event of drawing a cherry vs. lime candy from the bag
• P(F=cherry), or P(cherry) for brevity, is equivalent to the fraction θ of cherry candies in the bag
We want to infer θ by unwrapping N candies from the bag
Unwrap N candies, c cherries and l = N-c lime (and return each candy in the bag after observing flavor)
As we saw earlier, this is described by a binomial distribution
• P(d| h θ) = ∏j P(dj| h θ) = θ c (1- θ) l
With ML we want to find θ that maximizes this expression, or equivalently its log likelihood (L)
• L(P(d| h θ)
= log (∏j P(dj| h θ))
= log (θ c (1- θ) l )
= clogθ + l log(1- θ)
ML learning: example (cont’d)
To maximise, we differentiate L(P(d| h θ) with respect to θ and set the result to 0
ML learning: example (cont’d)
N
c
1
c
)) -log(1 log( c
Doing the math gives
01
cNc
the proportion of cherries in the bag is equal to the proportion (frequency) of in cherries in the data
General ML procedure
Express the likelihood of the data as a function of the parameters to be learned
Take the derivative of the log likelihood with respect of each parameter
Find the parameter value that makes the derivative equal to 0
The last step can be computationally very expensive in real-world learning tasks
There is one more example for you to look at in the next few slides
Another example The manufacturer choses the color of the wrapper
probabilistically for each candy based on flavor, following an unknown distribution
• If the flavour is cherry, it chooses a red wrapper with probability θ1
• If the flavour is lime, it chooses a red wrapper with probability θ2
The Bayesian network for this problem includes 3 parameters to be learned
• θ θ 1 θ 2
Another example The manufacturer choses the color of the wrapper
probabilistically for each candy based on flavor, following an unknown distribution
• If the flavour is cherry, it chooses a red wrapper with probability θ1
• If the flavour is lime, it chooses a red wrapper with probability θ2
The Bayesian network for this problem includes 3 parameters to be learned
• θ θ 1 θ 2
Another example (cont’d) P( W=green, F = cherry| hθθ1θ2
) = (*)
= P( W=green|F = cherry, hθθ1θ2) P( F = cherry| hθθ1θ2
)
= θ (1-θ 1)
We unwrap N candies
• c are cherry and l are lime
• rc cherry with red wrapper, gc cherry with green wrapper
• r l lime with red wrapper, g l lime with green wrapper
• every trial is a combination of wrapper and candy flavor similar to event (*) above, so
P(d| hθθ1θ2)
= ∏j P(dj| hθθ1θ2)
= θc (1-θ) l (θ 1) rc (1-θ 1) g
c (θ 2) r
l (1-θ 2) g l
Another example (cont’d) I want to maximize the log of this expression
• clogθ + l log(1- θ) + rc log θ 1 + gc log(1- θ 1) + rl log θ 2 + g l log(1- θ 2)
Take derivative with respect of each of θ, θ 1 ,θ 2
• The terms not containing the derivation variable disappear
Frequencies again!
ML parameter learning in Bayesian nets
With complete data and ML approach:
• Parameters learning decomposes into a separate learning problem for each parameter (CPT), because of the log likelihood step
• Each parameter is given by the frequency of the desired child value in the presence of the relevant parents values
P(Y=yi|X=xj) =
Frequencies are used to learn the relevant probabilities (e.g. transition and observation models), in HMM and more complex bayesian networks
See C&G island exercise and papers in presentation list
ML is the theoretical justification of this approach
Very Popular Application
Naïve Bayes models: very simple Bayesian networks for classification
• Class variable (to be predicted) is the root node
• Attribute variables Xi (observations) are the leaves
Naïve because it assumes that the attributes are conditionally independent of each other given the class
Deterministic prediction can be obtained by picking the most likely class
Scales up really well: with n boolean attributes we just need…….
C
X1
Xi
X2
i
nn
nn C)(x(C)
),..,x,x(x
),..,x,xx(C),..,x,x(C|x |
,
21
2121 PP
P
PP
Very Popular Application
Naïve Bayes models: very simple Bayesian networks for classification
• Class variable (to be predicted) is the root node
• Attribute variables Xi (observations) are the leaves
Naïve because it assumes that the attributes are conditionally independent of each other given the class
Deterministic prediction can be obtained by picking the most likely class
Scales up really well: with n boolean attributes we just need 2n+1 parameters
C
X1
Xi
X2
i
nn
nn C)(x(C)
),..,x,x(x
),..,x,xx(C),..,x,x(C|x |
,
21
2121 PP
P
PP
Example
Naïve Classifier for the newsgroup reading example
Example
Naïve Classifier for the newsgroup reading example
Problem with ML parameter learning
With small datasets, some of the frequencies may be 0 just because we have not observed the relevant data
Generates very strong incorrect predictions:
• Simple common fix: initialize the count of every relevant event to 1 before counting the observations
• For more sophisticated strategies see textbook (p. 296)
Probability from Experts
As we mentioned in previous lectures, an alternative to learning probabilities from data is to get them from experts
Problems
• Experts may be reluctant to commit to specific probabilities that cannot be refined
• How to represent the confidence in a given estimate
• Getting the experts and their time in the first place
One promising approach is to leverage both sources when they are available
• Get initial estimates from experts
• Refine them with data
Combining Experts and Data Get the expert to express her belief on event A as the pair
<n,m>
i.e. how many observations of A they have seen (or expect to see) in m trials
Combine the pair with actual data
• If A is observed, increment both n and m
• If ⌐A is observed, increment m alone
The absolute values in the pair can be used to express the expert’s level of confidence in her estimate
• Small values (e.g., <2,3>) represent low confidence
• The larger the values, the higher the confidence
WHY?
Combining Experts and Data Get the expert to express her belief on event A as the pair
<n,m>
i.e. how many observations of A they have seen (or expect to see) in m trials
Combine the pair with actual data
• If A is observed, increment both n and m
• If ⌐A is observed, increment m alone
The absolute values in the pair can be used to express the expert’s level of confidence in her estimate
• Small values (e.g., <2,3>) represent low confidence, as they are quickly dominated by data
• The larger the values, the higher the confidence as it takes more and more data to dominate the initial estimate (e.g. <2000, 3000>)
Overview Bayesian Learning
Full
MAP learning
Maximum Likelihood Learning
Learning Bayesian Networks (Fully observable)
Regression and Logistic Regression
40
One more set of techniques fpr supervised learning
• Naïve Bayes and Decision Trees (in their basic incarnation) provide classification in terms of discrete labels y.
y = f(attr1, attr2, …. attrn)
• We will briefly see two techniques that go beyond • Regression: y is continuous • Logistic Regression: y is continuous but reduced to a
binary classification
Linear Regression
41
hw(x) = w1x + w0
problem of fitting a linear function to a set of training examples: input/output pairs with numeric values
Regression
Linear regression: problem of fitting a linear function hw(x) to a set of training examples (xj, yj): input/output pairs with numeric values
• y = m x + b
• hw(x) = w1 x + w0
Find best values for parameters that
• “maximize goodness of fit” or “minimize loss”
Then most probable values of parameters found by minimizing squared-error loss:
Loss(hw ) = Σj (yj – hw(xj))242
Regression: Minimizing Loss
Choose weights to minimizesum of squared errors
43
Regression: Minimizing Loss
y = w1 x + w0
44
Loss(hw ) = Σj (yj – hw(xj))2
0
)))((()(
2
j
jwj
j
j
w
w
xhy
w
hLoss
Algebra givesan exact solution tothe minimizationproblem
Multivariate Regression Suppose we have features X1,...,Xn. A linear function of these features is a
function of the form
fw(X1,...,Xn) = w0+w1 × X1 + ...+ wn × Xn ,
Given a set E of examples, where each example e E∈ has
• values xi for feature Xi
• observed value oe.
The predicted value is thus
hwe=w0+w1× x1 + ...+ wn× xn =∑i=0
n wi× xi ,
where x0 is defined to be 1.
The sum-of-squares error on examples E for target Y is
ErrorE(w) = ∑e E∈ (oe. -hwe)2
• In this linear case, as for the univariate version, the weights that minimize the error can be computed analytically, by equating the derivatives in each w i to 0
Squashed linear functions for classification
We can use a linear function for discrete classification Let’s consider binary classification task , where the domain
of the target variable is {0,1}. For classification, use a squashed linear function of the form
fw(X1,...,Xn) = f( w0+w1 × X1 + ...+ wn × Xn) ,
where f is an activation function, from real numbers into [0,1].
Activation Functions Commonly used ones:
• f (w0 + w1*X1+ w2*X2+….+ wk*Xk)
Examples:
ineinf
1
1)(
f(in) = 1 when in > t = 0 otherwise
Step Function A step function represents a linear classifier with hard
threshold• implements a linear decision boundary (aka linear separator) that
separates the classes involved in the classification
Decision boundary
A step function was the basis for the perceptron, one of the first methods
supervised classification and the basis for Neural Networks
Can’t use derivative to compute weights that minimize loss because step
function is not derivable
Expressiveness of Step Function
Step function can only represent linearly separable functions .
Linearlyseparable
Linearlyseparable
= FALSE
= TRUE
Linearlyseparable
Non-Linearlyseparable
Majority (I1,I2,I3)
Logistic Regression Classification that uses a sigmoid function as activation function
Learning with logistic regression: still want to minimize the sum of
squared errors over training set of examples E
ErrE(w) = ∑e E∈ (oe- hwe)2
= ∑e E∈ (oe - f(∑i wi × xi))2.
Because f(x) is a non-linear function, hard to find solution analitically:• Iterative computation of weights using Gradient Descent
Gradient Descent Search
Alter each weight in an amount proportional to the slope in that direction
Repeat until termination condition is met (error is “small enough”)
j
e
w
wErr
2)(
))(
(2
j
ejj w
wErrww
Look at the partial derivative of the error term on each example e with
respect to each weight wj ,
is a constant called the learning rate that determines how fast the
algorithm moves toward the minimum of the error function
Gradient Descent Search
each set of weights defines a point on the error surface
Search toward the minimum of the surface that describes the sum-of-squares
errors as a function of all the weights in the logistic regression
Given a point on the surface, look at the slope of the surface along the axis formed by each weight
partial derivative of the surface Err with respect to each weight wj
Weight Update for Logistic Regression
j
kkjje
j
wee
j
e
w
XwXwXwfo
w
ho
w
wErr2
0022 ))....(()()(
The predicted value for example e is
The actual observed value is oe
)(')(2)....(')(2 0 infwErrXXwXwXwfhoX ejkkjjoweej
Partial derivative with respect to weight wj
)....( 00 kkjjwe XwXwXwfh
Weight Update for Logistic Regression
From this result and
)(')( infwErrxww ejjj
xexf
1
1)( The sigmoid activation function is used
because its derivative is easy to compute analytically
chain rulexfxfxf -= ))(1)(()('
each weight wj in gradient descent for Logistic Regression
is updated as follows
))(
(2
jjj w
wErrww
Expressiveness of Logistic Regression Sigmoid activation function has similar limitations in
expressiveness as step function
• Output of a two-input logistic regression
• Adjusting weights changes the orientation, location and steepness of cliff, but still can’t learn functions like XOR
Learning Goals for Bayesian learning and regression-based learning
For each of Bayesian learning, MAP and ML learning
Define how it works
apply it to compute P(h|data) and P(X|data)
Explain how parameters can be learned from data in fully observable Bayesian networks.
Compute the parameters for a specific network
Explain how parameters can be learned in partially observable Bayesian networks (see slides from next class)
.
56