Machine LearningLecture 01-1: Basics of Probability Theory
Nevin L. [email protected]
Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology
Nevin L. Zhang (HKUST) Machine Learning 1 / 52
Basic Concepts in Probability Theory
Outline
1 Basic Concepts in Probability Theory
2 Interpretation of Probability
3 Univariate Probability Distributions
4 Multivariate ProbabilityBayes’ Theorem
5 Parameter Estimation
Nevin L. Zhang (HKUST) Machine Learning 2 / 52
Basic Concepts in Probability Theory
Random Experiments
Probability associated with a random experiment — a process withuncertain outcomes
Often kept implicit
In machine learning, we often assume that data are generated by ahypothetical process (or a model), and task is to determine the structureand parameters of the model from data.
Nevin L. Zhang (HKUST) Machine Learning 3 / 52
Basic Concepts in Probability Theory
Sample Space
Sample space (aka population) Ω: Set of possible outcomes and arandom experiment.
Example: Rolling two dices.
Elements in a sample space are outcomes.
Nevin L. Zhang (HKUST) Machine Learning 4 / 52
Basic Concepts in Probability Theory
Events
Event: A subset of the sample space.
Example: The two results add to 4.
Nevin L. Zhang (HKUST) Machine Learning 5 / 52
Basic Concepts in Probability Theory
Probability Weight Function
A probability weight P(ω) is assigned to each outcome.
In Machine Learning, we often need to determine the probability weights,or related parameters, from data. This task is called parameter learning.
Nevin L. Zhang (HKUST) Machine Learning 6 / 52
Basic Concepts in Probability Theory
Probability measure
Probability P(E ) of an event E : P(E ) =∑
ω∈E P(ω)
A probability measure is a mapping from the set of events to [0, 1]
P : 2Ω → [0, 1]
that satisfies Kolmogorov’s axioms:
1 P(Ω) = 1.2 P(A) ≥ 0 ∀A ⊆ Ω3 Additivity: P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.
In a more advanced treatment of Probability Theory, we would start withthe concept of probability measure, instead of probability weights.
Nevin L. Zhang (HKUST) Machine Learning 7 / 52
Basic Concepts in Probability Theory
Random Variables
A random variable is a function over the sample space.
Example: X = sum of the two results. X ((2, 5)) = 7;X ((3, 1)) = 4)
Why is it random? The experiment.
Domain of a random variable: Set of all its possible values.
ΩX = 2, 3, . . . , 12
Nevin L. Zhang (HKUST) Machine Learning 8 / 52
Basic Concepts in Probability Theory
Random Variables and Event
A random variable X taking a specific value x is an event:
ΩX=x = ω ∈ Ω|X (ω) = x
ΩX=4 = (1, 3), (2, 2, )(3, 1).
Nevin L. Zhang (HKUST) Machine Learning 9 / 52
Basic Concepts in Probability Theory
Probability Mass Function (Distribution)
Probability mass function P(X ): ΩX → [0, 1]
P(X = x) = P(ΩX=x)
P(X = 4) = P((1, 3), (2, 2, )(3, 1)) = 336 .
If X is continuous, we have a density function p(X ).
Nevin L. Zhang (HKUST) Machine Learning 10 / 52
Interpretation of Probability
Outline
1 Basic Concepts in Probability Theory
2 Interpretation of Probability
3 Univariate Probability Distributions
4 Multivariate ProbabilityBayes’ Theorem
5 Parameter Estimation
Nevin L. Zhang (HKUST) Machine Learning 11 / 52
Interpretation of Probability
Frequentist interpretation
Probabilities are long term relative frequencies.
Example:
X is result of coin tossing. ΩX = H,TP(X=H) = 1/2 means that
the relative frequency of getting heads will almost surely approach 1/2as the number of tosses goes to infinite.
Justified by the Law of Large Numbers:
Xi : result of the i-th tossing; 1 – H, 0 — TLaw of Large Numbers:
limn→∞
∑ni=1 Xi
n=
1
2with probability 1
The frequentist interpretation is meaningful only when experimentcan be repeated under the same condition.
Nevin L. Zhang (HKUST) Machine Learning 12 / 52
Interpretation of Probability
Bayesian interpretation
Probabilities are logically consistent degrees of beliefs.
Applicable when experiment not repeatable.
Depends on a person’s state of knowledge.
Example: “probability that Suez canal is longer than the Panamacanal”.
Doesn’t make sense under frequentist interpretation.Subjectivist: degree of belief based on state of knowledge
Primary school student: 0.5Me: 0.8Geographer: 1 or 0
Arguments such as Dutch book are used to explain why one’sprobability beliefs must satisfy Kolmogorov’s axioms.
Nevin L. Zhang (HKUST) Machine Learning 13 / 52
Interpretation of Probability
Interpretations of Probability
Now both interpretations are accepted. In practice, subjective beliefsand statistical data complement each other.
We rely on subjective beliefs (prior probabilities) when data arescarce.As more and more data become available, we rely less and less onsubjective beliefs.Often, we also use prior probabilities to impose some bias on the kindof results we want from a machine learning algorithm.
The subjectivist interpretation makes concepts such as conditionalindependence easy to understand.
Nevin L. Zhang (HKUST) Machine Learning 14 / 52
Univariate Probability Distributions
Outline
1 Basic Concepts in Probability Theory
2 Interpretation of Probability
3 Univariate Probability Distributions
4 Multivariate ProbabilityBayes’ Theorem
5 Parameter Estimation
Nevin L. Zhang (HKUST) Machine Learning 15 / 52
Univariate Probability Distributions
Binomial and Bernoulli Distributions
Suppose we toss a coin n times. At each time, the probability ofgetting a head is θ.
Let X be the number of heads. Then X follows the binomialdistribution, written as X ∼ Bin(n, θ):
Bin(X = k|n, θ) =
(nk
)θk(1− θ)n−k if 0 ≤ k ≤ n
0 if k < 0 or k > n
If n = 1, then X follows the Bernoulli distribution, written asX ∼ Ber(θ)
Ber(X = x |θ) =
θ if x = 11− θ if x = 0
Nevin L. Zhang (HKUST) Machine Learning 16 / 52
Univariate Probability Distributions
Multinomial Distribution
Suppose we toss a K -sided die n times. At each time, the probabilityof getting result j is θj . Let θ = (θ1, . . . , θK )>.
Let x = (x1, ..., xK ) be a random vector, where xj is the number oftimes side j of the die occurs. Then x follows the multinomialdistribution, written as x ∼ Multi(n,θ)
Multi(x|n,θ) =
(n
x1, . . . , xK
) K∏j=1
θxjk ,
where
(n
x1, . . . , xK
)=
n!
x1! . . . xK !is the multinomial coefficient
Nevin L. Zhang (HKUST) Machine Learning 17 / 52
Univariate Probability Distributions
Categorical Distribution
In the previous slide, if n = 1, x = (x1, ..., xK ) has one componentbeing 1 and the others are 0. In other words, it is a one-hot vector.
In this case, x follows the categorical distribution, written asx ∼ Cat(θ)
Cat(x|θ) =K∏j=1
θ1(xj=1)j ,
where 1(xj = 1) is the indicator function, whose value is 1 whenxj = 1 and 0 otherwise.
Nevin L. Zhang (HKUST) Machine Learning 18 / 52
Univariate Probability Distributions
Gaussian (Normal) Distribution
The most widely used distribution in statistics and machine learning isthe Gaussian or normal distribution.
Its probability density is given by
N (x |µ, σ2) =1√
2πσ2exp
[−(x − µ)2
2σ2
]Here µ = E [X ] is the mean (and mode), and σ2 = var [X ] is thevariance
Nevin L. Zhang (HKUST) Machine Learning 19 / 52
Multivariate Probability
Outline
1 Basic Concepts in Probability Theory
2 Interpretation of Probability
3 Univariate Probability Distributions
4 Multivariate ProbabilityBayes’ Theorem
5 Parameter Estimation
Nevin L. Zhang (HKUST) Machine Learning 20 / 52
Multivariate Probability
Joint probability mass function
Probability mass function of a random variable X :
P(X ) : ΩX → [0, 1]
P(X = x) = P(ΩX=x).
Suppose there are n random variables X1, X2, . . . , Xn.A joint probability mass function, P(X1,X2, . . . ,Xn), over thoserandom variables is:
a function defined on the Cartesian product of their state spaces:n∏
i=1
ΩXi → [0, 1]
P(X1 = x1,X2 = x2, . . . ,Xn = xn) = P(ΩX1=x1 ∩ΩX2=x2 ∩ . . .∩ΩXn=xn).
Nevin L. Zhang (HKUST) Machine Learning 21 / 52
Multivariate Probability
Joint probability mass function
Example:
Population: Apartments in Hong Kong rental market.Random variables: (of a random selected apartment)
Monthly Rent: low (≤ 1k), medium ((1k, 2k]), upper medium((2k,4k]), high (≥4k),Type: public, private, others
Joint probability distribution P(Rent,Type):
public private otherslow .17 .01 .02
medium .44 .03 .01upper medium .09 .07 .01
high 0 0.14 0.1
Nevin L. Zhang (HKUST) Machine Learning 22 / 52
Multivariate Probability
Multivariate Gaussian Distributions
For continuous variables, the most commonly used joint distribution isthe multivariate Gaussian distribution: N (µ,Σ)
N (x|µ,Σ) =1√
(2π)D |Σ|exp
[−(x− µ)>Σ−1(x− µ)
2
]
D: dimensionality.x: vector of D random variables, representing dataµ: vector of meansΣ: covariance matrix. |Σ| denotes the determinant of Σ.
Nevin L. Zhang (HKUST) Machine Learning 23 / 52
Multivariate Probability
Multivariate Gaussian Distributions
A 2-D Gaussian distribution.
µ: center of contours
Σ: orientation and size of contours
Nevin L. Zhang (HKUST) Machine Learning 24 / 52
Multivariate Probability
Marginal probability
What is the probability of a randomly selected apartment being apublic one? (Law of total probability)
P(Type=pulic) = P(Type=public, Rent=low)+P(Type=public,Rent=medium)+ P(Type=public, Rent=upper medium)+P(Type=public, Rent=high) = .7
P(Type=private) = P(Type=private, Rent=low)+ P(Type=private,Rent=medium)+ P(Type=private, Rent=upper medium)+P(Type=private, Rent=high)= .25
public private others P(Rent)low .17 .01 .02 .2
medium .44 .03 .01 .48upper medium .09 .07 .01 .17
high 0 0.14 0.1 .15P(Type) .7 .25 .05
Called marginal probability because written on the margins.
Nevin L. Zhang (HKUST) Machine Learning 25 / 52
Multivariate Probability
Conditional probability
For events A and B:
P(A|B) =P(A,B)
P(B)(=
P(A ∩ B)
P(B))
Meaning:P(A): My probability on A (without any knowledge about B)P(A|B): My probability on event A assuming that I know event B istrue.
What is the probability of a randomly selected private apartmenthaving “low” rent?
P(Rent=low|Type=private)
=P(Rent=Low, Type=private)
P(Type=private)= .01/.25=.04
In contrast:
P(Rent=low) = 0.2.
Nevin L. Zhang (HKUST) Machine Learning 26 / 52
Multivariate Probability
Marginal independence
Two random variables X and Y are marginally independent, writtenX ⊥ Y , if
for any state x of X and any state y of Y ,
P(X=x |Y=y) = P(X=x), whenever P(Y = y) 6= 0.
Meaning: Learning the value of Y does not give me any informationabout X and vice versa.Y contains no information about X and viceversa.
Equivalent definition:
P(X=x ,Y=y) = P(X=x)P(Y=y)
Shorthand for the equations:
P(X |Y ) = P(X ),P(X ,Y ) = P(X )P(Y ).
Nevin L. Zhang (HKUST) Machine Learning 27 / 52
Multivariate Probability
Marginal independence
Examples:
X :result of tossing a fair coin for the first time,Y : result of second tossing of the same coin.X : result of US election, Y : your grades in this course.
Counter example:X – oral presentation grade , Y – project reportgrade.
Nevin L. Zhang (HKUST) Machine Learning 28 / 52
Multivariate Probability
Conditional independence
Two random variables X and Y are conditionally independent given athird variable Z ,written X ⊥ Y |Z , if
P(X=x |Y=y ,Z=z) = P(X=x |Z=z) whenever P(Y=y ,Z=z) 6= 0
Meaning:
If I know the state of Z already, then learning the state of Y does notgive me additional information about X .Y might contain some information about X .However all the information about X contained in Y are also containedin Z .
Shorthand for the equation:
P(X |Y ,Z ) = P(X |Z )
Equivalent definition:
P(X ,Y |Z ) = P(X |Z )P(Y |Z )
Nevin L. Zhang (HKUST) Machine Learning 29 / 52
Multivariate Probability
Example of Conditional Independence
There is a bag of 100 coins. 10 coins were made by a malfunctioningmachine and are biased toward head. Tossing such a coin results inhead 80% of the time. The other coins are fair.
Randomly draw a coin from the bag and toss it a few time.
Xi : result of the i-th tossing, Y : whether the coin is produced by themalfunctioning machine.
The Xi ’s are not marginally independent of each other:
If I get 9 heads in first 10 tosses, then the coin is probably a biasedcoin. Hence the next tossing will be more likely to result in a head thana tail.Learning the value of Xi gives me some information about whether thecoin is biased, which in term gives me some information about Xj .
Nevin L. Zhang (HKUST) Machine Learning 30 / 52
Multivariate Probability
Example of Conditional Independence
However, they are conditionally independent given Y :
If the coin is not biased, the probability of getting a head in one toss is1/2 regardless of the results of other tosses.If the coin is biased, the probability of getting a head in one toss is80% regardless of the results of other tosses.If I already knows whether the coin is biased or not, learning the valueof Xi does not give me additional information about Xj .
Here is how the variables are related pictorially. We will return to thispicture later.
Nevin L. Zhang (HKUST) Machine Learning 31 / 52
Multivariate Probability Bayes’ Theorem
Prior, posterior, and likelihood
Three important concepts in Bayesian inference.
With respect to a piece of evidence: E
Prior probability P(H): belief about a hypothesis before observingevidence.
Example: Suppose 10% of people suffer from Hepatitis B. A doctor’sprior probability about a new patient suffering from Hepatitis B is 0.1.
Posterior probability P(H|E ):belief about a hypothesis afterobtaining the evidence.
If the doctor finds that the eyes of the patient are yellow, his beliefabout patient suffering from Hepatitis B would be > 0.1.
Nevin L. Zhang (HKUST) Machine Learning 33 / 52
Multivariate Probability Bayes’ Theorem
Prior, posterior, and likelihood
Suppose a patient is observed to have yellow eyes (E ).
Consider two possible explanations:
1 The patient has Hepatitis B (H1),2 The patient does not have Hepatitis B (H2)
Obviously, H1 is a better explanation because P(E |H1) > P(E |H2). To stateit another way, we say that H1 is more likely than H2 given E .
In general, the likelihood of a hypothesis H given evidence E is a measureof how well H explains E . Mathematically, it is
L(H|E ) = P(E |H)
In Machine Learning, we often talk about the likelihood of a model M givendata D. It is a measure of how well the model M explains the data D.Mathematically, it is
L(M|D) = P(D|M)
Nevin L. Zhang (HKUST) Machine Learning 34 / 52
Multivariate Probability Bayes’ Theorem
Bayes’ Theorem/Bayes Rule
Bayes’ Theorem: relates prior probability, likelihood, and posteriorprobability:
P(H|E ) =P(H)P(E |H)
P(E )∝ P(H)L(H|E )
where P(E ) is normalization constant to ensure∑
h∈ΩHP(H = h|E ) = 1.
That is: posterior ∝ prior× likelihood
Nevin L. Zhang (HKUST) Machine Learning 35 / 52
Parameter Estimation
Outline
1 Basic Concepts in Probability Theory
2 Interpretation of Probability
3 Univariate Probability Distributions
4 Multivariate ProbabilityBayes’ Theorem
5 Parameter Estimation
Nevin L. Zhang (HKUST) Machine Learning 36 / 52
Parameter Estimation
A Simple Problem
Let X be the result of tossing a thumbtack and ΩX = H,T.Data instances:D1 = H, D2 = T , D3 = H, . . . , Dm = H
Data set: D = D1,D2,D3, . . . ,DmTask: To estimate parameter θ = P(X=H).
Nevin L. Zhang (HKUST) Machine Learning 37 / 52
Parameter Estimation
Likelihood
Data: D = H,T ,H,T ,T ,H,TAs possible values of θ, which of the following is the most likely?Why?
θ = 0θ = 0.01θ = 0.5
θ = 0 contradicts data because P(D|θ = 0) = 0.It cannot explain thedata at all.
θ = 0.01 almost contradicts with the data. It does not explain thedata well.However, it is more consistent with the data than θ = 0 becauseP(D|θ = 0.01) > P(D|θ = 0).
So θ = 0.5 is more consistent with the data than θ = 0.01 becauseP(D|θ = 0.5) > P(D|θ = 0.01)It explains the data the best, and is hence the most likely.
Nevin L. Zhang (HKUST) Machine Learning 38 / 52
Parameter Estimation
Maximum Likelihood Estimation
In general, the larger P(D|θ) is, the more likely the value θ is.Likelihood of parameter θ given data set:
L(θ|D) = P(D|θ)
The maximum likelihood estimation (MLE) θ∗ is
L(θ∗|D) = arg maxθ
L(θ|D).
MLE best explains data or best fits data.
Nevin L. Zhang (HKUST) Machine Learning 39 / 52
Parameter Estimation
i.i.d and Likelihood
Assume the data instances D1, . . . , Dm are independent given θ:
P(D1, . . . ,Dm|θ) =m∏i=1
P(Di |θ)
Assume the data instances are identically distributed:
P(Di = H) = θ,P(Di = T ) = 1−θ for all i
(Note: i.i.d means independent and identically distributed)
Then
L(θ|D) = P(D|θ) = P(D1, . . . ,Dm|θ)
=m∏i=1
P(Di |θ) = θmh(1− θ)mt (1)
where mh is the number of heads and mt is the number of tail.Binomial likelihood.
Nevin L. Zhang (HKUST) Machine Learning 40 / 52
Parameter Estimation
Example of Likelihood Function
Example: D = D1 = H,D2T ,D3 = H,D4 = H,D5 = T
L(θ|D) = P(D|θ)
= P(D1 = H|θ)P(D2 = T |θ)P(D3 = H|θ)P(D4 = H|θ)P(D5 = T |θ)
= θ(1− θ)θθ(1− θ)
= θ3(1− θ)2.
Nevin L. Zhang (HKUST) Machine Learning 41 / 52
Parameter Estimation
Sufficient Statistic
A sufficient statistic is a function s(D) of data that summarizingthe relevant information for computing the likelihood. That is
s(D) = s(D′)⇒ L(θ|D) = L(θ|D′)
Sufficient statistics tell us all there is to know about data.
Since L(θ|D) = θmh(1− θ)mt ,the pair (mh,mt) is a sufficient statistic.
Nevin L. Zhang (HKUST) Machine Learning 42 / 52
Parameter Estimation
Loglikelihood
Loglikelihood:
l(θ|D) = logL(θ|D) = logθmh(1− θ)mt = mhlogθ + mt log(1− θ)
Maximizing likelihood is the same as maximizing loglikelihood. Thelatter is easier.
Taking the derivative of dl(θ|D)dθ and setting it to zero, we get
θ∗ =mh
mh + mt=
mh
m
MLE is intuitive.
It also has nice properties:
E.g. Consistence: θ∗ approaches the true value of θ with probability 1as m goes to infinity.
Nevin L. Zhang (HKUST) Machine Learning 43 / 52
Parameter Estimation
Drawback of MLE
Thumbtack tossing:
(mh,mt) = (3, 7). MLE: θ = 0.3.Reasonable. Data suggest that the thumbtack is biased toward tail.
Coin tossing:Case 1: (mh,mt) = (3, 7). MLE: θ = 0.3.
Not reasonable.Our experience (prior) suggests strongly that coins are fair, henceθ=1/2.The size of the data set is too small to convince us this particular coinis biased.The fact that we get (3, 7) instead of (5, 5) is probably due torandomness.
Case 2: (mh,mt) = (30, 000, 70, 000). MLE: θ = 0.3.Reasonable.Data suggest that the coin is after all biased, overshadowing our prior.
MLE does not differentiate between those two instances. It doe nottake prior information into account.
Nevin L. Zhang (HKUST) Machine Learning 44 / 52
Parameter Estimation
Two Views on Parameter Estimation
MLE:
Assumes that θ is unknown but fixed parameter.
Estimates it using θ∗, the value that maximizes the likelihood function
Makes prediction based on the estimation: P(Dm+1 = H|D) = θ∗
Bayesian Estimation:
Treats θ as a random variable.
Assumes a prior probability of θ: p(θ)
Uses data to get posterior probability of θ: p(θ|D)
Nevin L. Zhang (HKUST) Machine Learning 45 / 52
Parameter Estimation
Two Views on Parameter Estimation
Bayesian Estimation:
Predicting Dm+1
P(Dm+1 = H|D) =
∫P(Dm+1 = H, θ|D)dθ
=
∫P(Dm+1 = H|θ,D)p(θ|D)dθ
=
∫P(Dm+1 = H|θ)p(θ|D)dθ
=
∫θp(θ|D)dθ.
Full Bayesian: Take expectation over θ.
Bayesian MAP:
P(Dm+1 = H|D) = θ∗ = arg max p(θ|D)
Nevin L. Zhang (HKUST) Machine Learning 46 / 52
Parameter Estimation
Calculating Bayesian Estimation
Posterior distribution:
p(θ|D) ∝ p(θ)L(θ|D)
= θmh(1− θ)mtp(θ)
where the equation follows from (1)
To facilitate analysis, assume prior has Beta distribution B(αh, αt)
p(θ) ∝ θαh−1(1− θ)αt−1
Then
p(θ|D) ∝ θmh+αh−1(1− θ)mt+αt−1 (2)
Nevin L. Zhang (HKUST) Machine Learning 47 / 52
Parameter Estimation
Beta Distribution
The normalization constant forthe Beta distribution B(αh, αt)
Γ(αt + αh)
Γ(αt)Γ(αh)
where Γ(.) is the Gammafunction. For any integer α,Γ(α) = (α− 1)!. It is alsodefined for non-integers.
Density function of prior Betadistribution B(αh, αt),
p(θ) =Γ(αt + αh)
Γ(αt)Γ(αh)θαh−1(1− θ)αt−1 (3)
The hyperparameters αh andαt can be thought of as”imaginary” counts from ourprior experiences.
Their sum α = αh+αt is calledequivalent sample size.
The larger the equivalentsample size, the more confidentwe are in our prior.
Nevin L. Zhang (HKUST) Machine Learning 48 / 52
Parameter Estimation
Conjugate Families
Binomial Likelihood: θmh(1− θ)mt
Beta Prior: θαh−1(1− θ)αt−1
Beta Posterior: θmh+αh−1(1− θ)mt+αt−1.
Beta distributions are hence called a conjugate family for Binomiallikelihood.
Conjugate families allow closed-form for posterior distribution ofparameters and closed-form solution for prediction.
Nevin L. Zhang (HKUST) Machine Learning 49 / 52
Parameter Estimation
Calculating Prediction
We have
P(Dm+1 = H|D) =
∫θp(θ|D)dθ
= c
∫θθmh+αh−1(1− θ)mt+αt−1dθ
=mh + αh
m + α
where c is the normalization constant, m=mh+mt , α = αh+αt .
Consequently,
P(Dm+1 = T |D) =mt + αt
m + α
After taking data D into consideration, now our updated belief onX=T is mt+αt
m+α .
Nevin L. Zhang (HKUST) Machine Learning 50 / 52
Parameter Estimation
MLE and Bayesian estimation
As m goes to infinity, P(Dm+1 = H|D) approaches the MLEmh
mh+mt,which approaches the true value of θ with probability 1.
Coin tossing example revisited:
Suppose αh = αt = 100. Equivalent sample size: 200In case 1,
P(Dm+1 = H|D) =3 + 100
10 + 100 + 100≈ 0.5
Our prior prevails.In case 2,
P(Dm+1 = H|D) =30, 000 + 100
100, 0000 + 100 + 100≈ 0.3
Data prevail.
Nevin L. Zhang (HKUST) Machine Learning 51 / 52
Parameter Estimation
MLE vs Bayesian Estimation
Much of Machine Learning is about parameter estimation.
In all case, both MLE and Bayesian estimations can used, althoughthe latter is harder mathematically.
In this course, we will focus on MLE.
Nevin L. Zhang (HKUST) Machine Learning 52 / 52