Post on 26-Jun-2020
transcript
Review of Probability Axioms and Laws
Reference:1. D. P. Bertsekas, J. N. Tsitsiklis, “Introduction to Probability,” Athena Scientific, 2008.
Berlin ChenDepartment of Computer Science & Information Engineering
National Taiwan Normal University
What is “Probability” ?
• Probability was developed to describe phenomena that cannot be predicted with certainty– Frequency of occurrences– Subjective beliefs
• Everyone accepts that the probability (of a certain thing to happen) is a number between 0 and 1 (?)
• Measures deduced from probability axioms and theories (laws/rules) can help us deal with and quantify “information”
Berlin Chen 2
Sets (1/2)
• A set is a collection of objects which are the elements of the set– If is an element of set , denoted by– Otherwise denoted by
• A set that has no elements is called empty set is denoted by Ø
• Set specification– Countably finite:– Countably infinite:– With a certain property:
Berlin Chen 3
6,5,4,3,2,1 ,...4,4,2,2,0
integer is 2kk 10 xx Pxx satisfies
such that
x S SxSx
Sets (2/2)
• If every element of a set is also an element of a set , then is a subset of– Denoted by or
• If and , then the two sets are equal– Denoted by
• The universal set, denoted by , which contains all objects of interest in a particular context– After specifying the context in terms of universal set , we only
consider sets that are subsets of
Berlin Chen 4
ST S T
TS ST
TS ST TS
S
Set Operations (1/3)
• Complement– The complement of a set with respect to the universe , is
the set , namely, the set of all elements that do not belong to , denoted by
– The complement of the universe Ø
• Union– The union of two sets and is the set of all elements that
belong to or , denoted by
• Intersection– The intersection of two sets and is the set of all
elements that belong to both and , denoted by
Berlin Chen 5
S Sxx
S cSc
STS
T TS
TSS T TS
TxSxxTS or
TxSxxTS and
Set Operations (2/3)
• The union or the intersection of several (or even infinite many) sets
• Disjoint– Two sets are disjoint if their intersection is empty (e.g., =
Ø)
• Partition– A collection of sets is said to be a partition of a set if the sets
in the collection are disjoint and their union is Berlin Chen 6
nSxxSSS nn
n somefor 211
nSxxSSS nn
n allfor 211
TS
SS
Set Operations (3/3)
• Visualization of set operations with Venn diagrams
Berlin Chen 7
The Algebra of Sets
• The following equations are the elementary consequences of the set definitions and operations
• De Morgan’s law
Berlin Chen 8
n
cn
c
nn SS
n
cn
c
nn SS
,
,
,
,
S
SS
USTSUTS
STTS
cc
.
,
SS
SS
USTSUTS
UTSUTS
c
Ø
commutative
distributive
associative
distributive
Probabilistic Models (1/2)
• A probabilistic model is a mathematical description of an uncertainty situation– It has to be in accordance with a fundamental framework to be
discussed shortly
• Elements of a probabilistic model– The sample space
• The set of all possible outcomes of an experiment– The probability law
• Assign to a set of possible outcomes (also called an event) a nonnegative number (called the probability of ) that encodes our knowledge or belief about the collective “likelihood” of the elements of
Berlin Chen 9
AA AP
A
Probability Axioms
Berlin Chen 10
Probabilistic Models (2/2)
• The main ingredients of a probabilistic model
Berlin Chen 11
Sample Spaces and Events
• Each probabilistic model involves an underlying process, called the experiment– That produces exactly one out of several possible outcomes– The set of all possible outcomes is called the sample space of
the experiment, denoted by– A subset of the sample space (a collection of possible outcomes)
is called an event
• Examples of the experiment– A single toss of a coin (finite outcomes)– Three tosses of two dice (finite outcomes)– An infinite sequences of tosses of a coin (infinite outcomes)– Throwing a dart on a square (infinite outcomes), etc.
Berlin Chen 12
Sample Spaces and Events (2/2)
• Properties of the sample space– Elements of the sample space must be mutually exclusive– The sample space must be collectively exhaustive– The sample space should be at the “right” granularity (avoiding
irrelevant details)
Berlin Chen 13
Probability Laws
• Discrete Probability Law– If the sample space consists of a finite number of possible
outcomes, then the probability law is specified by the probabilities of the events that consist of a single element. In particular, the probability of any event is the sum of the probabilities of its elements:
• Discrete Uniform Probability Law– If the sample space consists of possible outcomes which are
equally likely (i.e., all single-element events have the same probability), then the probability of any event is given by
Berlin Chen 14
nsss ,,, 21
n
nn
sssssssss
PPPPPPP
21
2121
,,,
n
A
n
AA ofelement ofnumber P
Continuous Models
• Probabilistic models with continuous sample spaces– It is inappropriate to assign probability to each single-element
event (?)– Instead, it makes sense to assign probability to any interval (one-
dimensional) or area (two-dimensional) of the sample space
• Example: Wheel of Fortune
Berlin Chen 15
a
b
cd
?333.0?33.0
?3.0
PPP ? bxaxP
Properties of Probability Laws
• Probability laws have a number of properties, which can be deduced from the axioms. Some of them are summarized below
Berlin Chen 16
Conditional Probability (1/2)• Conditional probability provides us with a way to reason
about the outcome of an experiment, based on partial information– Suppose that the outcome is within some given event , we
wish to quantify the likelihood that the outcome also belongs some other given event
– Using a new probability law, we have the conditional probability of given , denoted by , which is defined as:
• If has zero probability, is undefined• We can think of as out of the total probability of the
elements of , the fraction that is assigned to possible outcomes that also belong to
Berlin Chen 17
B
A
BA BAP
BBABA
PPP
BP BAP BAP
BA
A B
Conditional Probability (2/2)
• When all outcomes of the experiment are equally likely, the conditional probability also can be defined as
• Some examples having to do with conditional probability1. In an experiment involving two successive rolls of a die, you are told
that the sum of the two rolls is 9. How likely is it that the first roll was a 6?
2. In a word guessing game, the first letter of the word is a “t”. What is the likelihood that the second letter is an “h”?
3. How likely is it that a person has a disease given that a medical test was negative?
4. A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?
Berlin Chen 18
BBABA
of elements ofnumber of elements ofnumber
P
Conditional Probabilities Satisfy the Three Axioms
• Nonnegative:
• Normalization:
• Additivity:If and are two disjoint events
Berlin Chen 19
0BAP
1
BB
BBB
PP
PPP
1A 2A
distributive
disjoint sets1A 2A
B
BABAB
BABAB
BABAB
BAABAA
21
21
21
2121
PPP
PPP
PP
PP
Multiplication (Chain) Rule
• Assuming that all of the conditioning events have positive probability, we have
– The above formula can be verified by writing
– For the case of just two events, the multiplication rule is simply the definition of conditional probability
Berlin Chen 20
112131211 ni in
ni i AAAAAAAAA PPPPP
1
1
1
21
321
1
2111
n
i i
ni in
i i AA
AAAAA
AAAAA
PP
PP
PPPP
12121 AAAAA PPP
Total Probability Theorem
• Let be disjoint events that form a partition of the sample space and assume that , for all . Then, for any event , we have
– Note that each possible outcome of the experiment (sample space) is included in one and only one of the events
Berlin Chen 21
nAA ,,1
nAA ,,1
iB
nn
n
ABAABABABABPPPP
PPP
11
1
0iAP
Bayes’ Rule
• Let be disjoint events that form a partition of the sample space, and assume that , for all . Then, for any event such that we have
Berlin Chen 22
nAAA ,,, 21
0iAP iB 0BP
nn
ii
nk kk
ii
ii
ii
ABAABAABA
ABAABA
BABA
BBA
BA
PPPPPP
PPPP
PPP
PP
P
11
1
Multiplication rule
Total probability theorem
Independence (1/2)
• Recall that conditional probability captures the partial information that event provides about event
• A special case arises when the occurrence of provides no such information and does not alter the probability that has occurred
– is independent of ( also is independent of )
Berlin Chen 23
BAPB A
A
B
BABA
ABBABA
PPP
PP
PP
BA
ABA PP
B A
Independence (2/2)
• and are independent => and are disjoint (?)– No ! Why ?
• and are disjoint then• However, if and
• Two disjoint events and with and are never independent
Berlin Chen 24
A B A B
AB BABA PPP
0BAPA B 0AP 0BP
A B 0AP 0BP
Conditional Independence (1/2)
• Given an event , the events and are called conditionally independent if
– We also know that
– If , we have an alternative way to express conditional independence
Berlin Chen 25
A BC
CBCACBA PPP
C
CBACBCC
CBACBA
PPPP
PPP
multiplication rule
0CBP
CACBA PP
1
2
3
Conditional Independence (2/2)
• Notice that independence of two events and with respect to the unconditionally probability law does not imply conditional independence, and vice versa
• If and are independent, the same holds for (i) and (ii) and(iii) and
Berlin Chen 26
A B
CBCACBABABA PPPPPP
A
BA cBcAcA cB
B
Independence of a Collection of Events
• We say that the events are independent if
• For example, the independence of three events amounts to satisfying the four conditions
Berlin Chen 27
nAAA ,,, 21
nSAASi
iSii ,1,2, of subset every for ,
PP
321 ,, AAA
321321
3232
3131
2121
AAAAAAAAAAAAAAAAAA
PPPPPPPPPPPPP
2n-n-1
Random Variables
• Given an experiment and the corresponding set of possible outcomes (the sample space), a random variable associates a particular number with each outcome– This number is referred to as the (numerical) value of the
random variable– We can say a random variable is a real-valued function of the
experimental outcome
Berlin Chen 28
xwX :
w
Random Variables: Example
• An experiment consists of two rolls of a 4-sided die, and the random variable is the maximum of the two rolls– If the outcome of the experiment is (4, 2), the value of this
random variable is 4– If the outcome of the experiment is (3, 3), the value of this
random variable is 3
– Can be one-to-one or many-to-one mappingBerlin Chen 29
Discrete/Continuous Random Variables
• A random variable is called discrete if its range (the set of values that it can take) is finite or at most countably infinite
• A random variable is called continuous (not discrete) if its range (the set of values that it can take) is uncountably infinite– E.g., the experiment of choosing a point from the interval
[−1, 1]• A random variable that associates the numerical value
to the outcome is not discrete
Berlin Chen 30
,2 ,1:infinitecountably ,4 ,3 ,2 ,1:finite
a
a2a
Concepts Related to Discrete Random Variables
• For a probabilistic model of an experiment– A discrete random variable is a real-valued function of the
outcome of the experiment that can take a finite or countably infinite number of values
– A (discrete) random variable has an associated probability mass function (PMF), which gives the probability of each numerical value that the random variable can take
– A function of a random variable defines another random variable, whose PMF can be obtained from the PMF of the original random variable
Berlin Chen 31
Probability Mass Function
• A (discrete) random variable is characterized through the probabilities of the values that it can take, which is captured by the probability mass function (PMF) of , denoted
– The sum of probabilities of all outcomes that give rise to a value of equal to
– Upper case characters (e.g., ) denote random variables, while lower case ones (e.g., ) denote the numerical values of a random variable
• The summation of the outputs of the PMF function of a random variable over all it possible numerical values is equal to one
Berlin Chen 32
X
X xpX or xXxpxXxp XX PP
X x
Xx
1x
X xp sxX ' are disjoint and form a partition of the sample space
Calculation of the PMF
• For each possible value of a random variable :1. Collect all the possible outcomes that give rise to the event 2. Add their probabilities to obtain
• An example: the PMF of the random variable = maximum roll in two independent rolls of a fair 4-sided die
Berlin Chen 33
Xx xX
xpX
xpX X
Expectation
• The expected value (also called the expectation or the mean) of a random variable , with PMF , is defined by
– Can be interpreted as the center of gravity of the PMF(Or a weighted average, in proportion to probabilities, of the possible values of )
• The expectation is well-defined if
– That is, converges to a finite value
Berlin Chen 34
X Xp
x
X xxpXE
x
X xpx
x
X xxp
X
xX
xX
xpxc
xpcx 0
Expectations for Functions of Random Variables
• Let be a random variable with PMF , and let be a function of . Then, the expected value of the random variable is given by
• To verify the above rule – Let , and therefore
Berlin Chen 35P b bilit B li Ch 35
X Xp
Xg
x
X xpxgXgE
XgX
XgY
yxgxXY xpyp
xX
y yxgxX
yxgxX
y
Yy
xpxg
xpxgxpy
ypyYXg EE
?
1y 2y 3y 4y
1x 2x 3x 4x 5x 6x 7x
Variance
• The variance of a random variable is the expected value of a random variable
– The variance is always nonnegative (why?)– The variance provides a measure of dispersion of around its
mean – The standard derivation is another measure of dispersion, which
is defined as (a square root of variance)
• Easier to interpret, because it has the same units as
Berlin Chen 36
X 2XX E
xX xpXx
XXX2
2
var
E
EE
X
XX var
X
Properties of Mean and Variance
• Let be a random variable and let
where and are given scalars
Then,
• If is a linear function of , then
Berlin Chen 37
a
baXY b
a linear function of
X
X
XaY
bXaY
varvar 2
EE
X Xg XgXg EE How to verify it ?
Joint PMF of Random Variables
• Let and be random variables associated with the same experiment (also the same sample space and probability laws), the joint PMF of and is defined by
• if event is the set of all pairs that have a certain property, then the probability of can be calculated by
– Namely, can be specified in terms of and
Berlin Chen 38
X Y
X Y
yYxXyYxXyxp YX ,,, PP
A
yx ,A
Ayx
YX yxpAYX,
, ,,P
X Y
A
Marginal PMFs of Random Variables
• The PMFs of random variables and can be calculated from their joint PMF
– and are often referred to as the marginal PMFs
– The above two equations can be verified by
Berlin Chen 39
X Y
x
YXYyYXX yxpypyxpxp , ,, ,,
xp X ypY
yYX
y
X
yxp
yYxX
xXxp
,
,
,
P
P
Conditioning
• Recall that conditional probability provides us with a way to reason about the outcome of an experiment, based on partial information
• In the same spirit, we can define conditional PMFs, given the occurrence of a certain event or given the value of another random variable
Berlin Chen 40
• The conditional PMF of a random variable , conditioned on a particular event with , is defined by (where and are associated with the same experiment)
• Normalization Property– Note that the events are disjoint for different
values of , their union is
Berlin Chen 41
Conditioning a Random Variable on an Event (1/2)
XA 0AP
x
AxXA PP
AxX PX A
A
AxXAxXxP AX PPP
1
AA
A
AxX
AAxXxP x
xxAX P
PP
P
PP
X A
Total probability theorem
Conditioning a Random Variable on an Event (2/2)
• A graphical illustration
Berlin Chen 42
xP AX is obtained by adding the probabilities of the outcomesthat give rise to and belong to the conditioning eventxX A
Conditioning a Random Variable on Another (1/2)
• Let and be two random variables associated with the same experiment. The conditional PMF of given is defined as
• Normalization Property
• The conditional PMF is often convenient for the calculation of the joint PMF
Berlin Chen 43
X YX
Y
ypyxp
yYyYxXyYxXyxp
Y
YX
YX
,
,
,
P
PP
YXp
1x
YX yxp
yY valuesomeon fixed is
multiplication (chain) rule
) ( ,, xypxpyxpypyxp XYXYXYYX
Conditioning a Random Variable on Another (2/2)
• The conditional PMF can also be used to calculate the marginal PMFs
• Visualization of the conditional PMF
Berlin Chen 44
y
YXYy
YXX yxpypyxpxp ,,
YXp
xYX
YX
Y
YXYX
yxpyxpypyxp
yxp
,,
,
,
,
,
Independence of a Random Variable from an Event
• A random variable is independent of an event if
– Require two events and be independent for all
• If a random variable is independent of an eventand
Berlin Chen 45
X A
xAxXAxX allfor , and PPP
X A
xxp
xXA
AxXA
AxXxp
X
AX
allfor ,
and
PP
PPP
P 0AP
xX A x
Independence of Random Variables (1/2)
• Two random variables and are independent if
• If a random variable is independent of an random variable
Berlin Chen 46
X Y
all 0 with allfor , xypyxpyxp YXYX
yxyYxXyYxX
yxypxpyxp YXYX
, allfor ,,or
, allfor ,, ,
PPP
XY
xypyxpypypxp
ypyxp
yxp
X
Y
YX
Y
YXYX
all and 0 with allfor ,
, ,
Independence of Random Variables (2/2)
• Random variables and are said to be conditionally independent, given a positive probability event , if
– Or equivalently,
• Note here that, as in the case of events, conditional independence may not imply unconditional independence and vice versa
Berlin Chen 47
X YA
yxypxpyxp AYAXAYX , allfor ,, ,
xypyxpyxp AYAXAYX all and 0 with allfor , ,
Entropy (1/2)
• Three interpretations for quantity of information1. The amount of uncertainty before seeing an event2. The amount of surprise when seeing an event3. The amount of information after seeing an event
• The definition of information:
– the probability of an event
• Entropy: the average amount of information
– Have maximum value when the probability(mass) function is a uniform distribution
Berlin Chen 48
ii
i xPxP
xI 22 log1log)(
ixP ix
ix
iXiX xPxPxPEXIEXHi
22 loglog)()(
,...,...,, where 21 ixxxX
00log0 2 define
Entropy (2/2)
• For Boolean classification (0 or 1)
• Entropy can be expressed as the minimum number of bits of information needed to encode the classification of an arbitrary number of examples– If c classes are generated, the maximum of entropy can be
Berlin Chen 49
222121 loglog)( ppppXEntropy
cXEntropy 2log)(
0,11 ,
12
1
xppxp
xPX