Post on 04-Jul-2018

Probability Primer

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 790-784,∼seungjin

Why Probability?

Machine learning deals with uncertain quantities or stochasticquantities.

Sources of uncertainty:

I Inherent stochasticity in the system being modeled

I Incomplete observability

I Incomplete modeling

[Source: Chapter 3 in Deep Learning book by Goodfellow-Bengio-Couville, 2016]

Probabilistic Models in Machine Learning

I A probabilistic model is a joint distribution,

p(x, z),

over observed variables x and hidden variables z.

I Inference about unknowns is carried out by calculating theposterior distribution over hidden variables:

p(z|x) =p(x, z)


I The evidence p(x) is not tractable in most of models of interest, weresort to approximate inference. (is NOT covered in this class, willbe handled in CSED 515)

Sets, Fields, Events

I A set is a collection of objects. The objects are called elements ofthe set.

I A sample space Ω is the set of all outcomes of an experiment.

I A subset of Ω is called event. A collection of subsets of Ω are calledevents.

I Consider a universal set Ω and a collection of subsets of Ω. LetE ,F , . . . denote subsets in this collection. This collection of subsetsof Ω forms a field M if

1. ∅ ∈ M, Ω ∈M.2. If E ∈M and F ∈M, then E ∪ F ∈M. and E ∩ F ∈M.3. If E ∈M, then E c ∈M.

I A σ-field F is a field that is closed under any countable set ofunions, intersections, and combinations.

Probability Measure

Given a sample space Ω, a function P defined on the subsets of Ω is aprobability measure if the following four axioms are satisfied:

1. P(A) ≥ 0 for any event A ∈ F .

2. P(∅) = 0.

3. P(Ω) = 1.

4. P(∪∞i=1Ai ) =∑∞

i=1 P(Ai ) if A1,A2, . . . are events that are mutuallyexclusive or pairwise disjoint.

The probability measure P : F 7→ [0, 1] is a function on F that assigns toan event A ∈ F a number in [0,1], such that above axioms are satisfied.

Probability Space

DefinitionA probability space is a triplet (Ω,F ,P) where Ω is a set, F is aσ-algebra, and P is a probability measure on (Ω,F).

A probability space (Ω,F ,P) is a mathematical model of a random experiment,

an experiment whose exact outcome cannot be told in advance. The set Ω

stands for the collection of all possible outcomes of the experiment. A subset F

is said to occur if the outcome of the experiment happens to belong to F .

Given our capabilities to measure, detect, and discern, and given the nature of

answers we seek, only certain subsets F are distinguished enough to be of

concern whether they occur. The σ-algebra F is the collection of all such

subsets whose occurrence are noteworthy and decidable; the elements of F are

called events. From this point of view, the conditions for F to be a σ-algebra

are logical consequences of the interpretation of the term ’event’. Finally, for

each event F , the chances that F occurs is modeled to be the number P(F ),

called the probability that F occurs.

Consider an experiment of tossing a coin twice.

I The sample space is Ω = HH,HT ,TH,TT.I An event is a subset of Ω. For instance, an event involving ”at least

one head” is HH,HT ,TH. Or an event involving ”no more thanone head” is HT ,TH,TT.

I A trivial σ-algebra is ∅,Ω or another example of σ-algebra is 2Ω.

In practice, rather than specifying a particular σ-algebra from scratch, there isusually a class of events of interest, C, which we want to be included in theσ-algebra. Hence, we wish to ”complete” C by adding events to it so that weget a σ-algebra. For example, consider 2-coin toss example again. We find thesmallest σ-algebra containing (HH), (HT ), (TH), (TT ); we call this theσ-algebra ”generated” by the fundamental events (HH), (HT ), (TH), (TT ).Formally, let C be a collection of subsets of Ω. The minimal σ-field generatedby C, denoted σ(C), satisfies: (i) C ⊂ σ(C); (ii) if B is any other σ-algebracontaining C, then σ(C) ⊂ B.

Joint, Marginal, Conditional, Total Probabilities,Independence

I Joint probability: P(A,B)

I Marginal probability: P(A) =∑

B P(A,B)

I Conditional probability: P(A|B) = P(A,B)P(B)

I Total probability: P(B) =∑n

i=1 P(B|Ai )P(Ai ) where ∪ni=1Ai = Ω(A′i s are mutually exclusive)

I Independence: P(A,B) = P(A)P(B)

P(X, Y) P(Y)



[Figure source: Murphy’s]

9 / 21

Bayes Theorem

Theorem (Bayes’ theorem)Let Ai , i = 1, . . . , n be a set of disjoint and exhaustive events. Then∪ni=1Ai = Ω, Ai ∩ Aj = ∅, i 6= j . For any event B with P(B) > 0 andP(Ai ) 6= 0 ∀i ,

P(Aj |B) =P(B|Aj)P(Aj)


P(B|Aj)P(Aj)∑ni=1 P(B|Ai )P(Ai )


Random Variables

A random variable X (ω) (for a shorthand notation, X ) is a measurablefunction from a set of possible outcomes Ω to a measurable space E(usually E = R):

X : Ω→ E .

The probability that X takes value in a measurable set S ⊂ E is

P(X ∈ S) = P(ω ∈ Ω|X (ω) ∈ S)

Definition (Measure-theoretic definition)Let (Ω,F ,P) be a probability space and (E , E) a measurable space.Then an (E , E)-valued random variable is a measurable functionX : Ω→ E , which means that, for every subset B ∈ E , its preimage

X−1(B) = ω : X (ω) ∈ B ∈ F .

This definition enables us to measure any subset B ∈ E in the targetspace by looking at its preimage, which by assumption is measurable.

[Source: Wikipedia]

Probability Distribution and Density

I Discrete random variablesI Probability mass function (PMF)

p(x) = P(X = x).

I Continuous random varialesI Cumulative distribution function (CDF)

F (x) = P(X ≤ x).

I Probability density function (PDF)

p(x) =dF (x)


Gaussian Distribution

Univariate (x ∈ R)

p(x) = N (x |µ, σ2) =1√


− 1

2σ2(x − µ)2


(a) PDF (b) CDF

[Figure source: Wikipedia]

Multivariate (x ∈ RD)

p(x) = N (x|µ,Σ) =1

(2π)D2 |Σ|




2(x− µ)>Σ−1(x− µ)


[Figure source: Wikipedia]

Correlation coefficient

ρ =E[XY ]− E[X ]E[Y ]


Covariance matrix

Σ =

[σ2x ρσxσyρσxσy σ2



[E[X 2]− E2[X ] E[XY ]− E[X ]E[Y ]E[XY ]− E[X ]E[Y ] E[Y 2]− E2[Y ]


Note that the CDF F is a monotonically increasing function, so it has aninverse. Denote by F−1 the inverse of CDF F .

The α quantile of F is F−1(α) which is the value of xα such thatF (X ≤ xα) = α.

Φ−1(α/2) 0 Φ−1(1−α/2)

α/2 α/2

[Figure source: Murphy’s]

Product and Sum Rules

I Product rule

p(x , y) = p(x |y)p(y).

I Sum rule

p(x) =∑y

p(x , y)


p(x |y)p(y).

Mean, Variance, Moments

I Mean (ensemble average, statistical average, expected value)

µx = E [x ] =∑x∈Sx

x p(x), (discrete)

µx = E [x ] =

∫p(x) x dx . (continuous).

I Variance

σ2x = E

[(x − µx)2

]= E

[x2]− µ2

x .

I Moment

mk = E[xk].

Random Vector and Covariance

A random vector x ∈ Rn is a collection of n random variables, xini=1.The probability density function of the random vector x is defined by thejoint density function, i.e.,

p(x) = p(x1, . . . , xn)

A mean vector and a covariance matrix are defined by

µ = E [x]



x p(x),

Σ = E[(x− µ) (x− µ)>



(x− µ) (x− µ)> p(x).

Bernoulli & Categorical DistributionsI Bernoulli distribution is the distribution for a single binary random

variable x ∈ 0, 1 parameterized by µ = P(x = 1).

Bern(x |µ) = µx(1− µ)1−x .

I Categorical distribution is a discrete probability distribution thatdescribes the possible results of a random variable that can take onone of K possible elementary events, x ∈ 1, 2, . . . ,K.

p(x) =K∏i=1

P(x = i)I[x=i ].

Or using the 1− of − K encoded random vectors x = [x1, . . . , xK ]>

of dimension K ,

p(x) =K∏i=1

P(x = i)xi .

