www.ATIcourses.com
Boost Your Skills with On-Site Courses Tailored to Your Needs The Applied Technology Institute specializes in training programs for technical professionals. Our courses keep you current in the state-of-the-art technology that is essential to keep your company on the cutting edge in today’s highly competitive marketplace. Since 1984, ATI has earned the trust of training departments nationwide, and has presented on-site training at the major Navy, Air Force and NASA centers, and for a large number of contractors. Our training increases effectiveness and productivity. Learn from the proven best. For a Free On-Site Quote Visit Us At: http://www.ATIcourses.com/free_onsite_quote.asp For Our Current Public Course Schedule Go To: http://www.ATIcourses.com/schedule.htm
14
14
Fundamental Probability Concepts
• Probabilistic Interpretation of Random Experiments (P)– Outcomes: sample space– Events: collection of outcomes (set theoretic) – Probability Measure: assign number “probability” P ε [0,1] to event
• Dfn#1-Sample Space (S): Fine-grained enumeration (atomic - parameters)– List all possible outcomes of a random experiment– ME - Mutually exclusive - Disjoint “atomic”– CE - Collectively exhaustive - Covers all outcomes
• Dfn#2- Event Space (E): Coarse-grained enumeration (re-group into sets)– ME & CE List of Events
Events: A,B,C ME but not CEAtomic Outcomes(Disjoint by dfn)A
BC
S (all outcomes)
D
Events: A,B,C ,D both ME & CE
INDEX
Discrete parameters uniquely define the coordinates of the Sample Space (S) and the collection of all parameter coordinate values defines all the atomic outcomes. As such atomic outcomes are mutually exclusive (ME) and collectively exhaustive (CE) and constitute a fundamental representation of the Sample Space S. By taking ranges of the parameters such as A, B, C, and D, one can define a more useful Event Space which should consist of ME and CE events which cover all outcomes in S without overlap as shown in the figure.
16
16
Fair Dice Event Space Representations
• Coordinate Representation:– Pair 6-sided dice– S={(d1,d2): d1,d2 = 1,2,…,6}– 36 Outcomes Ordered pairs
• Matrix Representation:– Cartesian Product: – {d1} x {d2} = d1 d2
T
• Tree Representation:
• Polynomial Generator for Sum
6
5
4
3
2
1 [ ]654321
)6,6()5,6()4,6()3,6()2,6()1,6()6,5()5,5()4,5()3,5()2,5()1,5()6,4()5,4()4,4()3,4()2,4()1,4()6,3()5,3()4,3()3,3()2,3()1,3()6,2()5,2()4,2()3,2()2,2()1,2()6,1()5,1()4,1()3,1()2,1()1,1(
=
d1
d2
1 2 3 4 5 6
6
5
4
3
2
1
12
3
45
6
Start
d1
d2(1,1)(1,2)(1,3)(1,4)(1,5)(1,6)
(6,1)(6,2)(6,3)(6,4)(6,5)(6,6)
36 Outcomes Ordered Pairs
1 2 3 4 5 6 2 2 3 4 5 6 7
8 9 10 11 12
( ) 1 2 3 4 5 65 4 3 2 1
x x x x x x x x x x x xx x x x x
+ + + + + = + + + + +
+ + + + +Exponents represent 6-sided die face numbers
2 Dice
Exponents represent pair sums Coefficients represent #ways
It is helpful to have simple visual representations of Sample and Event SpacesFor a pair of 6-sided dice, coordinate, matrix, and tree representations are all useful representations. Also the polynomial generator for the sum of a pair of 6-sided dice immediately gives probabilities for each sum. Squaring the polynomial (x1+x2+x3+x4+x5 +x6)2 yields a generator polynomial whose exponents represent all possible sums for a pair of 6-sided dice S={2,3,4,5,6,7,8,9,10,11,12}and whose coefficients C={1,2,3,4,5,6,5,4,3,2,1} represent the number of ways each sum can occur. Dividing by the coefficients C by the total #outcomes 62 = 36 yields the probability “distribution” for the pair of dice.Venn diagrams for two or three events are useful; for example, the coordinate representation in the top figure can be used to visualize the following eventsA: {d1 = 3 and d2 = arbitrary, B= {d1 + d2 = 7}, and C= {d1 = d2}
Once we display these two events on the coordinate diagram their intersection properties are obvious, viz., both A & B and A & C intersect, albeit at different points, while B & C do not intersect (no point corresponding to sum=7 and equal dice values). More than three intersecting sets, become problematic for Venn diagrams as the advantage of visualization is muddled somewhat by the increasing number of overlapping regions in theses cases (see next two slides).
A: d1=3, d2 =arb.
B: d1+d2 =7
C: d1=d2
17
17
Venn Diagram for 4 Sets
A B
CD
AB
BC
ABD
ACD
ABC
BCD
ABCDAD
CD
ACBD
4C0 = (4C1 4-Singles) – (4C2 6-Pairs) + (4C3 4-Triples ) - ( 4C4 1-Quadruple)
As we go to Venn diagrams with more than 3 sets the labeling of regions becomes a practical limitation to their use. In this case of 4 sets A,B,C, D, the labeling is still pretty straightforward and usable. The 4 singles A,B,C,D are labeled in an obvious manner at the edge of each circle. The 6 pairs AB,AC,AD,BC,BD,CD are labeled at the intersection of two circles. The 4 triples ABC, ABD, BCD, ACD are labeled within “curved triangular areas” corresponding to the intersections of three circles. The 1 quadruple ABCD is labeled within the unique “curved quadrilateral area” corresponding to the intersection of all four circles.
20
20
E1={(d1,d2): d1+d2 ¥ 10}P(E1)=6/36=1/6
E2={(d1,d2): d1+ d2 = 7}P(E2)=6/36=1/6
E1
E2
Trivial Computation of Probabilities of Eventssum = d1 + d2
2 3 4
5 6
7 8
9 1
0 11 1
2
00 1
1
0.5
0.5
6
5
4
3
2
1
1 2 3 4 5 6d1
d2
s2
s1
Ex#1 Pair of DiceS={(d1,d2): d1,d2 = 1,2,…,6}
Ex#2 Two Spins on Calibrated WheelS={(s1,s2): s1,s2 ε [0,1]}
E1={(s1,s2): s1+s2 ¥ 1.5}--> P(E1) = ----- =.52/2=1/8
E2={(s1,s2): s2 § .25} --> P(E2)=1(.25)/1=.25
E3={(s1,s2): s1= .85; s2= .35}--> P(E3)=0/1=0
1E1
E2
E3
For equally likely atomic events the probability of any outcome Event is easily computed as the (#atomic outcomes in Event)/(total # outcomes). For a pair of dice, the total # of outcomes is 6*6=36 and hence simple counting of the # points in E /36 yields P(E), etc.Two spins on a calibrated wheel [0, 1) can be represented by the unit square in the (s1 , s2)-plane and an analogous calculation can be performed to obtain the probability for the event E by dividing the area covered by the event by the area of the event space (“1”): P(E)= area(E)/ 1.
24
24
DeMorgans’ Formulas - Finite Unions and Intersections
i) Compl(Union) = Intersec(Compls):
ii) Compl(Intersec) = Union(Compls):
Useful Forms:i’) Union expressedas an Intersection
ii’) Intersection expressed as a Union
( ) ccccc BABABA )()( =∪=∪
ompl)Intersec(Cn)Compl(Unio
)( ccc BABA =∪
l)Union(Comprsec)Compl(Inte
)( ccc BAAB ∪=
( ) ( )ccccc BAABAB ∪==)(
Intersect grey areas
Yields one grey area with A and B excluded
Taking its complement yields white area, i.e.,
cc BA
cB
cA
cBA )( ∪
BA ∪
B
A
cc BA &
cc BA
cn
cccn EEEEEE ∩∩∩=∪∪∪ 2121 )(
cn
cccn EEEEEE ∪∪∪=∩∩∩ 2121 )(
Visualization
INDEX
( )c c cA B
A B∪
DeMorgan’s Laws for the complement of finite unions and intersections states that i) The complement of unions equals the intersections of the complements, andii) The complement of intersections equals the union of complements The alternate forms obtained by taking the complements of the original equations are often more useful
because they give a direct decomposition of the union and the intersection of two or more setsi’) The union equals the complement of the (intersection of complements)ii’) The intersection equals the complement of the (union of complements)
A graphical construction of A U B = (Ac Bc)c is also shown in the figure..Ac and Bc are the two shaded areas in the middle planes which exclude A and B respectively (white) ovalsIntersecting these two shaded areas and taking the complement leaves the white oval areas which is A U B
27
27
Set Algebra Summary Graphic
A B“A-B”
AB
A U B
“B-A”
Union
Intersection
Differences
c
c
A B A A BB B A
∪ = ∪
= ∪
BxAxiffABxABBABA
∈∈∈=⋅=∩
&
BxAxiffBAxABBABA cc
∉∈−∈=∩≡−
and
Union
Intersection
Difference
( )c c cA B A B∪ =
( )cc cAB A B= ∪
DeMorgans ( )c c cA B A B∪ =
complement of (At least one) = (not any)means
Bc A Ac B
This summary graphic illustrates the set algebra for two sets A , B and their union intersection and difference. DeMorgans Law can be interpreted as saying “the complement of (“at least one”) is “not any”Associativity and commutivity of the two operations allows extension to more than two sets.
28
28
Basic Counting PrinciplesPrinciple #0: Take Case n=3-4; generalize to n
∏=
=⋅⋅=m
kkm nnnnn
121
Principle #1: Product Rule for Sub-experiments:
Generate “tree” of outcomes
)!(!)(kn
nnP kkn
−==
Principle #2: Perm n distinguish-obj take k
Principle #3:Perm n-obj take n with r -groups of indistinguishable objects
groupsrnnn
n
r
−⋅⋅⋅
=
!!!
!Sequences
hableDistinguis #
21
162 65,536=2 2 2 2 ... 2
16- Bins
Binary Digits
26 26 26 10 10 10
6- Bins3 326 10= ⋅Licenses
Repetitions Allowed
k=n 11 Travel 5 Cooking 4 Garden
11! 5! 4!3! Permute Groups
Arrange All Books
No Repetitions
k<n 11 Travel Books in 5 bins 11| 10| 9 |8 |7
nkknk
nkn
Ckn ≤
−=
=
)!(!!
Principle #4: Combination of n-objects take k
“TOOL” 12!1!1!2
!4=
⋅⋅Arrange Letters
600,12!1!2!3!4
!10=
⋅⋅⋅{4”r”, 3”s”, 2”o”, 1 “t”}
Committee of 3 {2M, 1F} from {6M, 3F} 453
!256
13
26 =⋅
⋅=⋅ CC
Committee of 4 from 22 people
224
22! 22! 7315(22 4)!4! 18!4!
C = = =−
Order not important!
1
2
13
Start
Num Suit
52
HDSCHDSC
HDSC
“Fill k-bins”
= Principle #3 with {taken , not taken} not counted INDEX
#ways: 13 4 52* =
Binomial Expansion: (a+b)3 (a+b)n
Outcomes must be distinguished by labels. They are characterized by either i) distinct orderings or ii) distinct groupings. A grouping consists of objects with distinct labels; changing order within a group is not a new group, but is a new permutation. The four basic counting principles for groups of distinguishable objects are summarized and examples of each are displayed in the table.Principle#0: This is practical advice to solve a problem with n= 2,3,4 objects first and then generalize the “solution pattern” to general n. Principle#1: This product rule is best understood in terms of the multiplicative nature of outcomes as we “branch out” on a tree. For a a single draw from a deck of cards there are 13 “number” branches and, in turn, each of these has 4 “suit” branches yielding 13*4 =52 distinguishable cards or outcomes.Principle#2: Permutation (ordering) of n objects take k at a time is best understood by setting up “k-containers” putting one of “n” in the first, one of “n-1” , ... and finally one of “n-k+1” in the kth container. The total #ways is obtained by the product rule as n*(n-1)*...*(n-k+1) = n!/(n-k)!Principle#3: Permutation of all ”n” objects consisting of “r “ groups of indistinguishable objects {3 t , 4 s 5 u}. If all objects were distinguishable then the result would be n! permutations; however permutations within the r groups does not create new outcomes and therefore we divide by factorials of the numbers in each group to obtain n!/(n1! n2! ... nr!)Principle#4: Combination of n objects take k is related to Principles#2, #3. There are n! permutations; ignoring permutations within r= 2 groups {“taken” , “not taken”} yields n!/(n! (n-k)!)
41
41
Counting with ReplacementSelect “B” from Alphabet and Replace Always have 26 letters to choose from
Refills Drop DownA A A A
B B B B
Y Y Y Y
Z Z ZZ
...
Permutation of “n” obj with replacement taken “k” at a time
n n n n n … nBin# 1 2 3 … k
Combination of “n” obj with replacement taken “k” at a time Note: “k” can be larger than “n”
1 1 11
n kk
n k n kC
k n+ − + − + −
= = = −
Example: From 2 objects {A, B} choose 3 with replacement (Only Way!)After each draw of an A or B “drop down a replacement” add 1 after each draw except last (effective # objects) = 2 +(3-1)=4
A B A / B A / B
4!1!3
!43
43
1323
2 ====/ −+ CCC{AAA},{BBB} {ABB},{AAB}
4 Outcomes
23 =8 distinct orderings
4 distinct groupings
{AAA}
SA
B
A
B
AB
BA
B
ABA
BA
{AAB}{ABA}{ABB}{BAA}
{BBA}{BBB}
{BAB}
3 “A”2 “A”& 1”B”2 “A”& 1”B”
2 “A”& 1”B”2 “B”& 1”A”
2 “B”& 1”A”2 “B”& 1”A”
3 “B”
n=2 , k=3
# replaceable objects
(# drws)n
kP = kn=
nkC/ =
(draw k)
effective # objects n + (k-1)
INDEX
Counting permutations and combinations with replacement is analogous to a candy machine purchase in which a new object drops down to replace the one that has been drawn, thus giving the same number of choices in each draw. Permutation of n obj taken k at a time with replacement: Each of the k draws has the same number of outcomes n because of replacement, the result is n*n*n... *n = nk and is written nPk with an “over-slash” on the permutation symbol. The case n=2, k=3 of 3 draws with 2 replaceable objects {A,B} shows the slash-2P3 =23 = 8 permutations that result. Combination of n obj taken k at a time with replacement: For n=2, k=3, 2 take 3 does not make any sense. However, with replacement, it does since each draw except the last drops down an identical item and hence the number of items to choose from becomes n +(k-1) and slash-nCk = n+(k-1)Ck. The tree verifies this formula and explicitly shows that there are 4 distinct groupings {3A, 3B, 2A1B, 1A2B} exactly the number of combinations with replacement given by the general formula slash-2C3 = 2+(3-1)C3 = 4C3 =4
47
47
II) Fundamentals of Probability
1. Axioms 2. Formulations: Classical, Frequentist, Bayesian, Ad Hoc3. Adding Probabilities: Inclusion / Exclusion, CE & ME4. Application of Venn Diagrams & Trees5. Conditional Probability & Bayes’ “Inverse Probability”6. Independent versus Disjoint Events7. System Reliability Analysis
As a theory, Probability is based on a small set of axioms which set forth fundamental properties of construction.In practice, probability may be formulated theoretically, experimentally, or subjectively, but must always obey the basic Axioms. Evaluating probabilities for events, is naturally developed in terms of their unions and intersections using Venn Diagrams, Trees and Inclusion/Exclusion techniques.Conditional probabilities, their inverses (Bayes’ theorem), and the dependence between two or more events flow naturally from the basic axioms of probability.System reliability analysis utilizes all these fundamental concepts
54
54
Inclusion / Exclusion Ideas
ME Events A,B - Disjoint AB= φ No intersections ”Add Prob”
A B P(A∩B) = P(A) + P(B)
∫
No intersections
“Recast” as Disjoint Union “CE & ME”Intersect: “CE, not ME”Not Disjoint AB∫φ
P(A∩B) ∫ P(A) + P(B) Intersection “AB” Counted Twice!! P(A∩B) = P(A) + P(B-A) = P(A) + P(BAc)
A BAB
Subtract “P(AB)” from sum; count only once( )c cB B S B A A BA BA= ⋅ = ⋅ ∪ = ∪
Generalization by Induction:
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ( ))P A B C P A D P A P D P AD P A P B C P A B C∪ ∪ = ∪ = + − = + ∪ − ⋅ ∪
)()()()( ABPBPAPBAP −+=∪
)}()()({)}()()({)( ABACPACPABPBCPCPBPAP −+−−++=
CBDlet ∪=
INDEX
)()()()()()()()( ABCPBCPACPABPCPBPAPCBAP +−−−++=∪∪
add singles add triplessubtract pairs
A B-A
A BAB
BAC
( ) ( ) ( )cP BA P B P AB= −
Inclusion / Exclusion
It is important to realize that although probabilities are simply numbers that add, the probability of the union of two events P(A U B) is not equal to the sum of individual probabilities for the two events P(A) + P(B). This is because points in this overlap region AB are counted twice; to correct for this one needs to subtract out “once” the double counted points in the overlap yielding P(A U B) = P(A) + P(B)-P(AB). Only in the case of non-intersection AB = φ does the simple sum of probabilities hold.The generalization for a union of three or more sets alternates inclusion and exclusion; for A,B,C the probability P(AUBUC) adds the singles, subtracts the doubles and adds the triple as shown.
68
68
Venn Diagram Application: Inclusion/Exclusion
T (36) S (28)
B (18)CLUB
TB (12)
TS (22)
SB (9)
TSB (4)
Method 2: Disjoint Union - Graphical
( ) ( ) ( ) ( )36 6 1 43
c c c
c c c
T S B T ST BT SP T S B P T P ST P BT S
N N N N
∪ ∪ = ∪ ∪
∪ ∪ = + +
= + + =
CLUB
TB (12)
TS (22)
SB (9)
TSB (4)
6
1
118
8 5
T (36) STc (6)
BTcSc (1)
Club: 36 T , 28 S, 18 B
36 28 18P(T) ; P(S) ; P(B) ; .etcN N N
= = =
Let N= Total # members (unknown)
Given following information find how many club members play at least one sport T or S or B
Write Probabilities as
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )36 28 18 22 12 9 4
43
P T S B P T P S P B P TS P TB P BS P TBS
N N N N N N N
N
= + + − − − +
= + + − − − +
=
∪ ∪
Method 1: Subs into Formula for Union
Thus 43 of “N” Club Members play at least one sport. (N is irrelevant)
INDEX
This example illustrates the ease by which a Venn diagram can display the probabilities associated with the various intersections of 3 sets T, S, and B.The number of elements in each of the 7 distinct regions is easily read off the figure; they are required to establish the total number in their union T U S U B via the inclusion/exclusion formula.Another method of finding P(T U S U B ) is to decompose the union T U S U B into a union of disjoint sets T* U S* U B* for which the probability is additive, i.e., P(T* U S* U B* ) = P(T*) + P(U*) + P(B*).
69
69
Matching Problem – 1“N” men throw hats onto floor; Each man in turn randomly draws a hat
Sum Joint Probabilities over all “n-tuples” 1 2
All n-tuples Eq. Likely
( )! ! ( )! 1( )! !( )! ! !ni i i
n tuples
N N n N N nP E E En N n N n N n−
− −= ⋅ = = −
∑
i1 |i2 | i3 | … | in in+1 | in+2 | in+3 | … | iN
n “Ei s” choose own hats (N-n) Does not Matter (Matched or Not Matched )
nN
!)!(
permsTotal# perms#
NnN −
==)(21 niii EEEP
1 | 2 | 3 | … | k | k+1 | … | N Hats
Men
INDEX
1 1 1 1 1 1( 1)1 2 2! 3! 4! 5! !(0 ) 1 ( ) N
N NNP matches P E E E e−− + − + + − →∞− = − = →∪ ∪ ∪
1 1 2 1 2 3
1 1 11 2 3 2! 3! 3
1(0 ) 1 ( ) 1 ( ) ( ) ( ) 1 {1 }i i i i i i
tuples pairs triplesP Matches P E E E P E P E E P E E E +
−
− = − = − − + = − − =
∑ ∑ ∑∪ ∪
1 1 1 1( 1)2! 3! 4! ( )!
!(k matches)N k
N kkP
−− + + + − − =1 1!N k e−
→∞→ ⋅ Poisson with success rate λ=1/N & “time intvl” t = N samples; a=λ *t =(1/N)*N =1
b) k- Matches
1 2(0 ) 1 ( )NP matches P E E E− = − ∪ ∪ ∪Let Event Ei = ith man chooses his own hat ; compute:a) No Matches - Find Probability None draw own hat.
Probability that M1 & M2 &...&Mndraw own hats
irrespective of what other men draw
Total # of“n-tuple”selections from N
Here is an example that requires the inclusion/exclusion expansion for a large number of intersecting sets. Since it becomes increasingly difficult to use Venn diagrams for a large number of intersecting sets, we must use the set theoretic expansion to compute the probability. We shall spend some time on this problem as it is very rich in probability concepts.The problem statement is simple enough: “N men throw their hats onto the floor; each man in turn randomly draws a hat. “a) What is the probability that no man draws his own hat?b) What is the probability of exactly k-matches? Key ideas: define Event Ei = ith man selects his own hat
then take union of N sets E1 U E2 U ... U EN and P(no-matches)=1- P(E1 U E2 U ... U EN)
The expansion of the P(E1 U E2 U ... U EN) involves addition and subtraction of P(singles), P(pairs), P(triples), etc. ( The events Ei are CE but not ME so you cannot simply sum up the P(Ei ) for k singles to obtain an answer to part b)) .This slide shows a key part of the proof which establishes the very simple result that the sum over singles, P(singles) = 1/(1!); sum over pairs is P(pairs)= 1/(2!) ; sum over triples is P(triples)=1/(3!); sum over 4-tuples, P(4-tuples) = 1/(4!); ... sum over N-tuples, P(N-tuple) = 1/(N!). Limit as N large approaches a Poisson Distribution with success rate for each draw λ=1/N and data length t =N i.e., parameter a =λ t =1
75
75
Man Hat Problem n =3 Tree/Table Counting
M#1 M#2 M#3 #Matches1 2 3 31 3 2 12 1 3 12 3 1 03 1 2 03 2 1 1
Br#1
Br#2
Br#3
From Table:
Prob[0-matches]=2/6
Prob[1-matches]=3/6
Prob[2-matches]=0/6=0
Prob[3-matches]=1/6 Prob[Trpls] = P[E1E2E3]=(1/3)(1/2)=1/6
Prob[Sgls]=P[E1]=P[E2]=P[E3]=1/3
Prob[Dbls] = P[E1E2]=(1/3)(1/2)=1/6
From Tree:
=1-{Sum[Sngls]-Sum[Dbls]+Sum[Trpls]}
=1-{3(1/3) -3(1/6)+1(1/6)}=2/6
Prob[0-matches]=1-Pr[E1 U E2 U E3]
M.E. OutcomesTree#1
31
2
3
Start1/3
2
31/2
1
3
1/2
1
2
3
2
1
2
1
1/3
1/3
1/2
1/2
1/2
1/2
1
1
1
1
1
1
M#1 Drw#1
M#3 Drw#3
M#2 Drw#2
E1
E3
E3
E2
E2 }{ 321 EEE
}{ 321ccEEE
}{ 321 EEE cc
}{ 321ccc EEE
}{ 321ccc EEE
}{ 321cc EEE
Match Outcomes
triple
No-match
single
No-match
single
single1/3 2/6 2/6P(Ei) =
Alternate Trees Yield: P[E1E3]= P[E2E3]=1/6
Connection: Matches & Events
E1C
E1C
This slide shows the complete the tree and associated table for the Man - Hat problem in which n=3 men throw their hats in the center of a room and then randomly select a hat. The drawing order is fixed as Man#1, Man#2, Man #3, and the 1st column of nodes labeled as circled 1, 2, 3 shows the event E1 in which the Man#1draws his own hat, and the complementary event E1
c i.e., Man#1 does not draw his own hat . The 2nd column of nodes corresponds to the remaining two hats in each branch shows the event E2 in which the Man#2 draws his own hat; note that E2 has two contributions of 1/6 summing to 1/3. Similarly, the 3rd draw results in the event E3 in two positions shown again summing to 1/3. The tree yields ME & CE outcomes expressed as composite states such as {E1E2E3}, {E1E2
cE3c, etc., or
equivalently in terms of the number of matches in the next column. The nodal sequence in the tree can be translated into the table on the right which is analogous to the table we used on the previous slide. The number of matches can be counted directly from the table as shown. The lower half of the slide compares the “ # of matches” events with the “compound events” formed from the “Ei”s{ no-matches, singles, pairs, and triples }. The connection between these two types of events is based on the common event “no-matches,” i.e., the inclusion/exclusion expansion of the expression [1-P(E1U E2U E3) ] in terms of singles doubles and triples yields P(0-matches).
82
82
Conditional Probability - Definition & Properties
=
==
Bover BA fraction
)()()|(
BPBAPBAP
=
==
Aover BA fraction
)()()|(
APBAPABP
Given A
A BBA
P(BA)
B
A
BA
BA
• In terms of atomic events si we can formally write
• Definition of Conditional Probability
=≡
32
)ˆ()ˆ()ˆ|(
SPSAPSAP
( )( )S
SSP
SsP
SP
SsP
SPSAPSAP As
iiAs ii
ˆin pts#A & ˆin pts#
)ˆ(
)ˆ(
)ˆ(
)ˆ(
)ˆ()ˆ()ˆ|( ==
∪==
∑∈∈
• Note in case it reduces to P(A) as it must SS =ˆ
iAssA
i∈∪=
•Asymmetry of Conditional Probability
Given B
Not Symmetrical!
INDEX
The formal definition of conditional probability follows directly from the renormalization concept discussed on the previous slide. It is simply the joint probability defined on the intersection of the set A and S-cap, P(AS-cap) divided by the normalizing probability P(S-cap). It can also be written explicitly in terms of a sum over atomic events given in the second equation.Conditional probability is not symmetric because the joint probability on the intersection of A and B isdivided by probability of the conditioning set which is P(A) in one case and P(B) in the other. This is also easily visualized using Venn diagrams where the “shape division” are obviously different in the two cases.
83
83
Examples - Coin Flips, 3-Sided Dice
Example#1: Three Coin FlipsGiven the first flip is H, Find Prob #H > #T
43
84
83
)ˆ()()()()|( ==
++=>
SPHTHPHHTPHHHPHnnP TH
SH
T
H
T
H
T
T
H
T
H
T
H
T
H
SS
{HHH}
{HHT}
{HTH}
{HTT}
{THH}
{TTH}{TTT}
{THT}81)(;
81)(;
81)(;
84)ˆ( ==== HTHPHHTPHHHPSP
TH nn >
Example#2: 4-Sided DiceGiven the first “die” d1= 4”Find Prob of Event A: “d2= 4”P(d2=4| d1= 4)=?
161)4,4(;
164)4()ˆ( 1 ==== PdPSP
41
164
161
)ˆ()4,4()4|4( 12 =====
SPPddP
d1
d2
4
3
2
1
1 2 3 4
S
1234
S
d1 d2
(4,1)(4,2)(4,3)(4,4)
S
S
A
Reduced Sample space
#H > #T
Flip#1
Flip#2
Flip#3
Here are two examples illustrating conditional probability.The first involves a series of three coin flips and a tree shows all possible outcomes for the original space S. The reduced set of outcomes conditions on the statement “ 1st draw is a head (red circle)” and S-cap only takes the upper branch of the tree and leads to a reduced set of outcomes. The conditional probability is computed either by considering outcomes in this conditioning space S-cap or by computing the probability for S (the whole tree) and then renormalizing by the probability for S-cap ( upper branch).The second example involves the throw of a pair 4-sided dice and asks for the probability that d2 =4 given
that d1=4, P(d2 =4 | d1 =4 ). The answer is obtained directly from the definition of conditional probability and is illustrated using a tree and a coordinate representation of the dice sample space with a Venn diagram overlay for the event (d1, d2) = (4,4) (green) and the subspace S-cap {d1=4} (red rectangle).
85
85
Probability of Winning in the “Game of Craps”Subsequent Throws - dice sum=(d1+d2)
“Point” - “Win” (W)7 “Lose” (L)Other (O) “Throw Again”
First Throw - dice sum=(d1+d2)2, 3, 12 - “Lose” (L)7, 11 - “Win” (W)Other (O) - first time defines your “Point” = “5” say
Rules for the “Game of Craps”
5/3656, 8
6/3667
4/3645, 9
3/3634, 10
2/3623, 11
1/3612, 12
#Prob#WaysS=d1+d2
Points
52
36261
1364
3626
364
3626
364
3626
364
364)5|(
32
=
−=+
+
+
+=WP
4929.)6()6|()5()5|()4()4|(2362
366
)Point()Point|()11()7()(
36/511/536/45/236/33/1
Points
=
++++=
++= ∑
PWPPWPPWP
PWPPPWP
Start
Thr#1
Thr#2 Thr#3 Thr#42
3
4
5
6
7
8
9
10
11
12
L
L
L
W
W
5
7
O
L
W
5
7
O
L
WPoint
5
7
O
L
W
364
366
3626
364
366
3626
364
3626 36
6
INDEX
Here we compute the probability of winning the game of craps previously described by the rules for the 1st
and subsequent throws given in the box and illustrated by the tree. Since there are 36 equally likely outcomes the #ways for the two dice summing to either 2 or 12 is obviously 1/36, for 3 or 11 it is 2/36, and the remaining sums of two dice can be read directly off the sum axis coordinate representation and are displayed in the table on the right.We have labeled the partial tree “given the point 5” by their conditional probabilities derived from the table. The probability for the three outcomes W(“5”), L (“7”), “Other (not “5 or 7”) can be read off the table as P(5)= 4/36, P(7)=6/36, P(Other)= 1-(4+6)/36 =26/36. Note that these are actually conditional probabilities; but since the throws are independent the conditionals are the same as the a prioris as taken from the table. The P(W|5) is obtained by summing all paths that lead to a win on this “infinite tree”. Thus the 2nd throw yields W with probability 4/36 and the 3rd throw yields W with probability P(5|Other)P(5)=(26/36)(4/36), and the 4th throw yields W with probability P(5|Other,Other)P(5)=(26/36)2 (4/36), ... leading to an infinite geometric series which sums to (4/36)*1/(1-26/36)=2/5.The total probability of winning is the sum of winning on the 1st throw (“7” or “11”) plus winning on the subsequent throws for each possible “point.” The infinite sum for the other points is obtained in a similar manner to that for “5” and (taking points by pairs in the table leads to the factor of two) the final result is shown to be .4929, i.e., a 49.3% chance of winning!
88
88
Visualization of Joint, Conditional, & Total ProbabilityBinary Comm Signal - 2 Levels {0,1} Binary Decision - {R0, R1}={(“0” rcvd , “1” rcvd} Joint Probability
(Symmetric)
“0” sent & R0 (“0” rcvd )
P(0,R0) = P(R0,0)R0 (“0” rcvd ) &
“0” sent
Conditional Probability (Non-Symmetric)
“0” sent givenR0 (“0” rcvd )
P(0|R0) ∫ P(R0|0)R0 (“0” rcvd ) given “0” sent
)1,()0,()0,(
)()0,()|0(
)1,()0,()0,(
)0()0,()0|(
00
0
0
00
00
000
RPRPRP
RPRPRP
RPRPRP
PRPRP
+=≡
+=≡Conditional Probability
Requires Total Probability P(0), P(R0), etc.
sent
rcvd
Joint
0 1
R1
0R1
1R0
0R0
R0
1R1
ovly
)1,()0,()( 000 RPRPRP +=),0(),0()0( 10 RPRPP +=Total Probability P(0) sum up joint on R0,R1
Total Probability P(R0) sum across joint on 0,1
Re-normalize Joint Probability
x = 0,1
y =R0 ,R1
R0 ,R1
x = 0 ,1
INDEX
Another way to visualize the communication channel is in terms of an overlay of a Signal Plane divided (equally) into “0”s and “1”s and a Detection Plane which characterizes how the “0”s and “1”s are detected and is structured as shown so that when we overlay the two planes we obtain an Outcome Plane with four distinct regions whose areas represent probabilities of the four product (joint) states { 0R0, 0R1, 1R0, 1R1} (similar to the tree outputs).In this representation the total probability of a “0” P(0) can be thought of as decomposed into two parts summed vertically over the “0”-half of the bottom plane shown by the break arrow P(0) = P(0,R0) + P(0,R1) [Note: summing on the “1”-half of the bottom plane yields P(1) = P(1,R0) + P(1,R1).]Similarly the total probability P(R0) can be thought of as decomposed into two parts summed horizontally over the “R0”-portion of the bottom plane shown by the break arrow P(R0) = P(R0,0) + P(R0,1); similarly we have P(R1) = P(R1,0) + P(R1,1).The Total Probability of a given state is obtained by performing such sums over all joint states.
96
96
Log-Odds Ratio - Add & Subtract Measurement InformationRevisit Binary Comm Channel 95.)0|( 0 =RP
05.)0|( 1 =RP 10.)1|( 0 =RP90.)1|( 1 =RP P(0)=.5
P(1)=.5
Relation between L1 and P(1|R1)
⇒−
=⇒
−
≡)|1(1
)|1()|1(1
)|1(ln1
1
1
11
1
RPRPe
RPRPL L
=∆
=
)0|()1|(ln;
)0()1(ln
1
11 RP
RPLPPL Rold
1
1
1)|1( 1 L
L
eeRP+
=
Meas#1: R1
Additive Meas Updates for L
8903.208903.2
05.9.ln
05.5.ln
1
+==
=∆
=
=
new
R
old
L
L
L
10
)0|()1|(ln
)0()1(ln
)1|(1)1|(ln
)1(1)1(ln
)|1(1)|1(ln
1
1
1
1
1
11
LL
RPRP
PP
RPRP
PP
RPRPL
∆≡≡
+
=
−
+
−
=
−
≡
1Roldnew LLL ∆+=
947.1
)|1( 8903.2
8903.2
1 =+
=e
eRP
Meas#2: R0
63901.)25129.2(8903.2
25129.295.10.ln
)0|()1|(ln
0
00
0
=−+=
∆+=−=
=
=∆
Roldnew
R
LLL
RPRPL
Alternate Meas#2: R1
1
1
1
1
( |1) .90ln ln( | 0) .05
2.8903
2.8903 2.8903 5.7806
R
new old R
P RLP R
L L L
∆ = =
= += + ∆
= + =
Updates
655.1
)|1( 63901.
63901.
01 =+
=e
eRRP 997.1
)|1( 7806.5
7806.5
01 =+
=e
eRRP
E = “1”Ec = “0”
Note:
INDEX
Revisiting the binary communication channel we now compute updates using the log odds ratio which are additive updates. The update equation simply starts from the initial log odds ratio which is Lold=ln[P(1)/P(1c)] =ln(.5/.5)=0 for the communication channel. There are two measurement types R1 and R0 and each adds an increment ∆L determined by its measurement statistics, viz., R1: ∆LR1 =ln[(P(R1|1)/P(R1|1c)]=ln(.90/.05) = +2.8903 (positive “confirming”) R0: ∆LR0 = ln[(P(R0|1)/P(R0|1c)]=ln(.10/.95)= -2.25129. (negative “refuting”)
The table illustrates how easy it is to accumulate the results of two measurements R1 followed by R0 by just adding the two ∆Ls to obtain Lnew= 0+2.8903-2.25129=.63901, or alternately R1 followed by R1 to obtain Lnew=0+2.8903+2.8903=5.7806. These log odds ratios are converted to actual probabilities by computing P= eLnew / (1+ eLnew ) yielding .655 and .997 for the above two cases.If we want to find the number of R1 measurements needed to give .99999 probability of “1” we need only convert .99999 to an L =ln[(.99999)/(1-.99999)] =11.51 and divide the result by 2.8903 to find 3.98 so that 4 R1 measurements are sufficient.
122
122
Discrete Random Variables (RV) –Key Concepts
• Discrete RVs: A series of measurements of random events• Characteristics: “Moments:” Mean and Std Deviation• Prob Mass Fcn: (PMF), Joint, Marginal, Conditional PMFs• Cumulative Distr Fcn: (CDF) i) Btwn 0 and 1, ii) Non-decreasing• Independence of two RVs• Transformations - Derived RVs• Expected Values (for given PMF)• Relationships Btwn two RVs: Correlations• Common PMFs Table• Applications of Common PMFs• Sums & Convolution: Polynomial Multiplication• Generating Function: Concept & Examples
INDEX
This slide gives a glossary of some of the key concepts involving random variables (RVs) which we shall discuss in detail in this section. Physical phenomena are always subject to some random components so that RVs must appear in any realistic model and hence their statistical properties provide a framework for analysis of multiple experiments using the same model. These concepts provide the rich environment that allows analysis of complex random systems with several RVs by defining the distributions associated with their sums and transformations of these distributions inherent in the mathematical equations that are used to model the system. At any instant, a RV takes on a single random value and represents one sample from the underlying RV distribution defined by its probability mass function (PMF). Often we need to know the probability for some range of values of a RV and this is found by summing the individual probability values of the PMF; thus a cumulative distribution function (CDF) is defined to handle such sums. The CDF formally characterizes the discrete RV in terms of a quasi-continuous function that ranges between [0,1] and which has a unique inverse. Distributions can also be characterized by single numbers rather than PMFs or CDFs and this leads to concepts of mean values, standard deviations, correlations between pairs of RVs and expected values. There are a number of fundamental PMFs used to describe physical phenomena and these common PMFs will be compared and illustrated through examples. Finally, the relationship between the sum of two RVs and the concept of convolution and the generating function for RVs will be discussed.
125
125
Transformation of Sample Space: Sum & Difference - 4-Sided Dice Fair 4-sided dice thrown twice: RVs:
Uniform PMF pD1D2 (d1,d2) = 1/16
d1
d2
4
3
2
1
1 2 3 4
Sum= “S” & Absolute Difference “D”
Find New PMF pDS(d,s) = ?
S=d1+d2D=|d2-d1|
D/S=3/5
3/5
3/5
2/6 1/7 0/8
2/4 1/5 0/6 1/7
1/3 0/4 1/5 2/6
0/2 1/3 2/4
Labels: D/S=3/5
INDEX
1
3
4
2
0 s0 1 2 3 4 5 6 7 8
d
Collapse on d-axis
Collapse on s-axis
Collapse on d-axis pD(1)
pD(3)
pS(6)
1/16
2/16
2/16
2/16
2/16
2/16
1/161/161/16
2/16
“missing”points
pD1D2(d1,d2)
d1
d2
43
21
1 2 3 4
1/16
s
),( dspSDd
1/16 2/162/16
2/16
2/16
2/16
1/16
1/16
1/162/16
1
3
4
2
0
0 1 2 3 4 5 6 7 8
2/16
1/16
2/16
2/16
2/16
2/16
2/16
1/16
1/16
1/16
Rotated to D, S Coordinates
d1
d2
43
21
1 2 3 4
D
D/S=3/5
3/53/5
2/6
1/7
0/8
2/4
1/5
0/61/7
1/3
0/4
1/5
2/6
0/2
1/3
2/4
S
Absolute Difference Doubles Values above S-Axis
Fold over S-Axis
In the game with 4-sided dice, we are interested in the distribution of the sum random variable S = D1 + D2 , pS(s) and not the joint distribution pD1,D2(d1d2). This slide and several to follow illustrate the procedure for obtaining the desired “marginal” (or collapsed ) distribution pS(s). In the process, we shall develop the relationship between distributions under transformation of coordinates, and define conditional, and marginal, distributions involving a pair of RVs {D1,D2}. We start with the 2- and 3-dimensional dice representations of equally likely outcomes of 1/16 as shown on the left. Recall that the points (d1, d2) for dice outcomes may alternately be expressed by points (s,d) their sum and difference coordinates, where s = d1+ d2 and d = d2 - d1 . These coordinate axes are shown in the top left figure where the sum and difference each take on 7 values: s={2,3,4,5,6,7,8} and d={-3,-2,-1,0,1,2,3}We consider a slightly different transformation s = d1+ d2 and |d| = |d2 - d1| and now the absolute difference |d| takes on only 4 values {0,1,2,3}; this has the effect of doubling the probability values of {1,2,3} by folding over the negative difference values onto and doubling them. If we label each point in this figure by the “|d |/ s” values we see for example that the points (d1d2) =(1,4) and (d1d2) =(4,1) at opposite corners of the grid are both now labeled with |d| / s = 3 / 5 . Labeling all points in this manner and rotating the figure clockwise 90o so D is up and S is to the right (central figure) we have found the new joint distribution pSD(s,|d|) as illustrated in the two right figures where points are now labeled by (s,|d|) values. Note that the new distribution has doubled the positive d values to 2/16 each and that certain coordinate points (s,|d|)=(3,0) are not occupied (green). The marginal distribution pS(s) defined as the sum of the joint distribution pSD(s,|d|) over all |d| values and is easily picked off the upper right figure by collapsing values down along the s-axis. Similarly, the distribution pD(|d|) defined as the sum of the joint distribution pSD(s,|d|) over all s-values. The table shows the results.
137
137
Common PMFs and Properties -1VarianceMean
Negative BinomialX=x Trialsr- Successes
GeometricX=x Trials1- SuccessHow many trials “x”for “1” succ
Independent Bernoulli Trials
Binomialn - TrialsX=x Succ.How many succ “x” in “n” trials ?
“Atomic” RV
Bernoulli 1-Trial X=x succ.“0” or “1”successes
PMFRV Name
==−=
=)failure(01)success(1
)(XqpXp
xpX
pppXE
=⋅+−⋅= 1)1(0][
∑=
⋅=1,0
)(][x
X xpxXE 22 ][][)var( XEXEX −=
pqppppX
pppXE
=−=−=
=⋅+−⋅=
)1()var(
1)1(0][
2
222
0 1x
nx
qpxn
xp xnxX
,1,0
)(
=
= −
0 1 2 3 4
1/16
3/16
4/16
2/16
5/16
6/16
0 x
)(xpX
np
qpxn
xXE xnn
x
x
=
= −
=∑
0][ npqX =)var(
=
=−
otherwise)(0,2,1
)(1 xpq
xpx
X
pqp
qdqdp
qdqdppqxXE
x
x
x
x
1)1(1
1
][
2
11
1
=−+
=
−
=
=⋅= ∑∑∞
=
∞
=
−
2)var(pqX =
)(xpX
x0 1 2 3 4 5 ...
1/2
1/16
3/16
4/16
2/16
5/16
6/16
7/16
0
∞++=
⋅
−−
=
−−
−−
),2(),1(,
11
)(
next trialon succ.
trials)1(in succ.)1(
1
rrrx
pqprx
xp
xr
rxrX
pr
qprx
xXErx
rxr
=
−−
⋅= ∑∞
=
−
11
][2var( ) qX r
p= ⋅
As p decr. Expected num. trials “x” for r-succ must incr.
Geom RV = Neg Binomfor r=1 succ.
One Sequence
Many Sequences
As p decr. Expected num. trials “x” for 1-succ must incr.
INDEX
This table and one to follow compare some common probability distributions and explore their fundamental properties and how they relate to one another. A brief description is given under the “RV Name” column followed by the PMF formula and figure in col#2; formulas for the mean and variance are shown in the last two columns. The Bernoulli RV X answers the question “what is the result of a single Bernoulli trial?” It takes on only two values, namely “1”=Success with probability p and “0”=Fail with probability q=1-p. The Binomial RV “X” answers the question “how many successes X in n Bernoulli trials?” It takes on values corresponding to the number of successes “X” in “n” independent Bernoulli trials; the sum RV X=X1+ X2+ ...+Xn of n Bernoulli RVs has nCx tree paths for X=x successes yielding a pmf nCx px qn-x as shown.The Geometric RV X answers the question “how many Bernoulli trials X for 1 success?” It takes on values from 1 to infinity and is the sum of n-1 failed Bernoulli trials followed by one successful trial; the sum RV X=X1+ X2+ ...+Xn of n Bernoulli RVs has only one tree path with X= x trials yielding 1-success and so has a pmf qx-1 p1 as shown.The Negative Binomial RV X answers the question “how many Bernoulli trials X for r- successes?” It takes on values from r to infinity and is the sum of n Geometric random variables; the sum RV X=G1+ G2+ ...+Gr of “r” Geometric RVs with probability pr-1 qx-r p1 and has x-1Cr-1 tree paths for X=x-1 trials yielding (r-1)-successes followed by one final success and so has a pmf x-1Cr-1 pr-1 qx-r p1 with x = r, r+1, ... inf, as shown
138
138
Bernoulli/Binomial Tree Structures
Independent Bernoulli Trials
Binomial2 - TrialsX=x Succ.How many succ “x” in “2” trials ?
“Atomic” RV
Bernoulli 1-Trial X=x succ.“0” or “1”successes
PMFRV Name
==−=
=)failure(01)success(1
)(XqpXp
xpX
0 1x
22( )
0,1,2
x xXp x p q
xx
− =
=
START
F
S
q
p
x0
1
q
p
Prob
{FF}
{FS}
{SF}
{SS}
START
F
S
F
S
S
F
q
p
q
pq
p
x0
1
1
2
q2
qp
pq
p2
Prob
0 1 2
1/2
x
)( xp X
1/4
(q+p)
(q+p)2
(q+p)2 = q2 + 2pq + p2
= 2C0 p0 q2 + 2C1 p1 q1 + 2C2 p2 q0
2C2
2C0
2C1
INDEX
The RVs of the last slide are grouped in pairs {Bernoulli,Binomial} and {Geometric, Negative Binomial} for a reason. The sum of many independent Bernoulli trials generates a Binomial distribution and similarly the a sum of many independent Geometric trials generates the Negative Binomial distribution. This slide and the next give a graphical construction of these trees for these two groups of paired distributions by repeatedly applying the basic tree structure of the underlying Bernoulli or Geometric tree structure as appropriate. In the first panel we show the PMF properties for Bernoulli on the left and on the right we display Bernoulli tree structure where the upper branch q=Pr{Fail] goes to the state X= 0 and the lower branch p = Pr[Success] goes to the state X= 1. In the second panel we show the PMF properties for a simple n=2 trial Binomial. The corresponding tree structure for this Binomial is obtained by appending a second Bernoulli tree to each output node of the first trial, thus yielding the 4 output states {{FF}, {FS}, {SF}, {SS}}. We see that there is 2C0 tree paths leading to {FF} p0q2 , 2C1 tree paths leading to{FS} p1q1 , and 2C2 tree paths leading to {SS} p2q0 , which is precisely as expected from the Binomial PMF for n=2. This can be continued for n=3, 4, ... by repeatedly appending a Bernoulli tree to each new node. Further we see that this structure for n=2 is represented algebraically by (q+p)2 inasmuch as the direct expansion gives 1=q2 + 2q1p1 +p2 ; expanding an expression corresponding to n Bernoulli trials (q+p)n obviously yields the appropriate Binomial expansion for general exponent n.Thus the Binomial is represented by the repetitive tree structure or by the repeated multiplication of the algebraic structure 1=(q+p) by itself n-times to obtain 1n=(q+p)n .
139
139
Geometric/NegBinomial Tree Structures
Negative BinomialX=x Trials2- Successes
GeometricX=x Trials1- SuccessHow many trials “x” for “1” succ
PMFRV Name
=
=−
otherwise)(0,2,1
)(1 xpq
xpx
X
)(xpX
x0 1 2 3 4 5 ...
1/2
1/16
3/16
4/16
2/16
5/16
6/16
7/16
0
2 1 2
succ. on next trial(2 1)succ. in ( 1) trials
1( )
2 1
2,3, 4,
xX
x
xp x p q p
x
− −
− −
− = ⋅ −
= ∞
One Infinite Sequence
Many Infinite Sequences
[(1-q)-1 p]
[(1-q)-1 p ]2
START
F
S
F
S
q
p
q
p
Sp
START
F
S
F
S
q
p
q
p
Sp
S
q
p
F
F
S
q
pS
S
q
p
F
F
S
q
pS
S
q
p
F
F
S
q
pS
)( xp X
x0 1 2 3 4 5 ...
1/4
1/8
1/16
0
3/16
p2 (1-q)-2 = p {1+(-2)1-3(-q)1 +[(-2)(-3)/2] 1-4(-q)2 +[(-2)(-3)(-4)/(2)(3)] 1-5(-q)3 + ...} p
={ 1C1 p + 2C1 pq1 + 3C1 p1 q2 + 4C1 p1 q3 + ...} pINDEX
This slide first gives a graphical construction of a Geometric tree from an infinite number of Bernoulli trials and then shows how the Negative Binomial tree is the result of appending a Geometric tree to itself in a manner similar to that of the last slide. In the first panel we repeat the PMF properties for Geometric RV. On the right side of this panel we display Geometric tree structure whose branches end in a single success. This tree has a Bernoulli trial appended to each failure node and is constructed from an infinite number of Bernoulli trials. The 1st Bernoulli trial yields X=1 with p=Pr[Success] and this ends the lower branch; its upper branch yields X=0 with q=Pr{Fail]; this failure node spawns a 2nd
Bernoulli trial which again leads to X=1 or X=0; this process continues indefinitely. It accurately describes the probabilities for a single success in 1, 2, 3,... inf number of trials and is algebraically represented by the expression 1=[(1-q)-1 p] which expands to [1 + q1 + q2 + q3 +....]*p corresponding to exactly 0, 1, 2, 3,... “failures before a single success”In the second panel we show the PMF properties for an r=2 Negative Binomial; on the right we display the Negative Binomial tree structure obtained by applying the basic Geometric tree to each node (infinite number) corresponding to a 1st success. This leads to a doubly infinite tree structure for the r=2 Negative Binomial which gives the number of trials X =x required for r=2 successes. We can verify the first few terms in the Negative binomial expansion given under PMF in the lower panel using the tree. This process may be extended to r=3, 4, ... successes by repeatedly applying the Geometric tree to each success node. For n=2, direct expansion of the algebraic identity 12=[(1-q)-1 p]2 yields { 1C1 p + 2C1 pq1
+ 3C1 p1 q2 + 4C1 p1 q3 + ...}p in agreement with the n=2 Negative Binomial terms in the table. In an analogous fashion expansion of 1r=[(1-q)-1 p]r yields results for the r-success Negative Binomial. Note that the “Negative” modifier to Binomial is a natural designation in view of the (1-q)-1 term in the algebraic structure.
140
140
==−=
=)failure(01)success(1
)(XqpXp
xpX
npqK
npKE
K
K
=σ=
=µ=2)var(
)(
∑=
=n
iiXK
1
knkK qp
kn
kp −
=)(
Bernoulli Single RV , Two Outcomes
successesktrialsn #,# ==
successestrials #1,0,#1 ==
0)var(;)( 2 =σ==µ= XX XpXE
Bernoulli Process 1 Bernoulli trial for
Event E1
∑==
n
i iXK1
K=# Succfor n- trials
Bernoulli, Geometric, Binomial & Negative Binomial PMFs• Bernoulli RV as Probability “Indicator” for Outcomes of a Series of
Experiments representing a two different Event types, namely,E1: “Success in 1 trial” X = Bernoulli RV E2: “ N1 is #Trials for 1stsuccess“ N1 = Geometric RV
( )11
r
ir iN N
=∑=
Nr =# Trials for r-Succ.
Geometric Process n1 Bernoulli trials for
Event E21
( )1
r
r
r n rrN r
np n p q
r−−
= −
1
21 2
1[ ] [ ]
var( ) var( )
r
r
r N
r N
E N rE N rp
qN r N rp
µ
σ
= = =
= = =
#trials for r successesrn =
( )11
r
ir iN N
=∑=
Binomial
Neg. Binomial
Sum n Indep. Bernoulli RVs “X”
Sum r Indep. Geometric RVs ”N1”
b(k;n,p)
bn(nr;r,p)
1
1
111( ) n
Np n p q −=
( )Xp x p=
The Bernoulli RV “X” is the basic building block for other RVs ( “atomic” RV ) and has a PMF distribution with only two outcomes X=1 with probability p and X=0 with probability q=1-p . We have seen that n such Bernoulli variables when added yield a Binomial PMF {b(x;n,p), x=0,1,2,...,n} which gives the “#successes “x” for “n” trials. We have also seen that this Binomial PMF can be understood by repeatedly appending the Bernoulli tree graph to each of its nodes (repeated independent trials) thereby constructing a tree with 2n outcomes corresponding to the n Bernoulli trials, each with two possible outcomes.Alternately, the Geometric PMF can be constructed by repeatedly appending a Bernoulli tree graph, but this time only to the failure node, an infinite number of times, thereby constructing a tree with an infinite number of outcomes all of which correspond to “x-1” failures and exactly 1 success for x=1,2, ...., inf. Just as the Bernoulli tree graph is a building block for the Binomial tree graph, the infinite Geometric PMF tree graph is a building block for the Negative Binomial. The Negative Binomial tree graph for r=2 successes is constructed by appending a Geometric tree graph to itself, but this time only to the success nodes, resulting in a doubly infinite tree graph corresponding to exactly “x-1” failures and exactly 2 successes for x= 2,3 ...., inf. Repeating this process r-times yields the r-fold infinite tree graph corresponding to exactly “x-1” failures and exactly r successes for x= r,r+1, ...., inf. The mathematical transformations relating Bernoulli, Binomial,Geometric and Negative Binomial are shown in this slide.
141
141
Common PMFs and Properties-2VarianceMean
Zeta(Zipf)n - TrialsX=x Succ.
PoissonTrialsX=x Succ
Hyper-geometricX=x -succN= fixed popm= taggedn=test samplw/o rplcemt
PMFRV Name∑=
⋅=1,0
)(][x
X xpxXE 22 ][][)var( XEXEX −=
var( )X a=[ ]E X a=
Limit of Binomial
0
lim( ) (aver. arrival rate)*timenp
a n p tλ→∞→
= ⋅ = ⋅ =
( )/ !0,1, 2,( )
0
x
aX
a xxp x eOtherwise
==
∞
( )
11 1
( )1 1
1 " " 1,2, ; 1( ; ) ( ) ( )0
1
Riemann Zeta Fcn (s)
X
Css sx xx x
sx term x sp x s s sOtherwise
C ζ
ζζ ζ
ζ
−∞ ∞ = =
− = = >=
= ⇒ = =
∑ ∑
PMF Derives from Binomial Identity
INDEX
11( )
1
( 1)11( ) ( )1
1
[ ; ] xs sxx
ss ssxx
E X s ζ
ζζ ζ
∞
⋅
=
∞−
− =
= ⋅
= ⋅ =
∑
∑ ( )
2 211( )
1
( 2) ( 1) 2( ) ( )
( ; ) [ ; ]xs sxx
s ss s
Var X s E X sζ
ζ ζζ ζ
∞
⋅
=
− −
= ⋅ −
= −
∑
( 1)( )[ ; 3.5] 1.191ssE X s ζ
ζ−= = =
( )(1.5) (2.5) 2(3.5) (3.5)( ; 3.5)
.856
Var X s ζ ζζ ζ= = −
=
n m N≤ ≤
( )0 1 1 0
N m N m m N m m N m m N m m N mn n n n x n x n
+ − − − − − = = + + + + + − −
min max
(n-x) from x from "(N-m) unmarked""m-marked"
( ) ;
0 ;[1, ] ; [1, ] ; ( ) min( , )
X
m N mx n xp x x x x
Nn
Otherwisem N n N N m n x m n
=
− − = ≤ ≤
∈ ∈ − − ≤ ≤
[ ]
where / is the"initial" probability of drawing a marked item
mE X n n pN
p m N
= ⋅ = ⋅
=
( ) ( )var( )( 1)N n m N mX nN N N
− −= ⋅ ⋅ ⋅
−
( )var( )( 1)N nX n p qN
−= ⋅ ⋅ ⋅
−
This second part of the Common PMFs table shows the Hyper-geometric, Poisson and Riemman Zeta (or Zipf ) PMFsThe Hyper-geometric RV “X” answers the question “how many successes (defectives) X are obtained with n test samples (trials without replacement) from a production run (sample space) that contains m defective and N-m working items?” X takes on values corresponding to the number of successes (defectives) “X” in “n” dependent Bernoulli trials; the distribution is best understood in terms of the Binomial identity NCn = mC0
N-m Cn + ...+ mCxN-m Cn-x +... + mCm
N-m Cn-m which when divided by NCnyields the distribution mCx
N-m Cn-x where X takes on values x=[xmin, xmax] where xmin=N-n-m and xmax= min(n,m).as allowed by the combinations w/o replacementThe Poisson RV “X” answers the question “how many successes X in n Bernoulli trials with n very large?” We shall discuss this in more detail in the second part of the course where we pair it with a continuous distribution. For now it is sufficient to know that it represents a limiting behavior of the Binomial PMF in the limit that n-> inf and its terms represent single terms in the expansion of ea where a =λ∗ t is called the Poisson parameter, where λ is a “rate” and t is a time interval for the data run. The PMF is therefore the ratio of the single term in the expansion to ea over ea which is pX(x)={ ax/ x!} / ea for x=0,1,2,3,... The Poisson RV has many applications in physics and engineering.The Riemman Zeta RV “X” has applications to Language processing and prime number theory and its properties are given in the table. Note that the exponent must satisfy α >0 in order to avoid the harmonic series which will does not converge and therefore cannot satisfy the sum to unity condition on the PMF.
3
2/24/2012 3
Chapter 5 – Continuous RVs
Probability Density Function (PDF)
Prob at a point = 0
∫∫ ==∈b
aXX dxxfdxxfx )()(]Pr[
E
E
}:{ bxax ≤≤=E :Event
Except for δ-fcn at a point0)(]0.2Pr[0.2
0.2
=== ∫=x
X dxxfx
Mixed Continuous & Discrete Outcomes – Dirac δ-fcn
0
0
0
0 0
( ) ( )( )
( ) ( )
X
xb
a x
f x x xb a
x x dx x x dxε
ε
βαδ
αδ αδ α+
−
= − +−
− = − =∫ ∫
( )b aβ−
)( 0xx −αδ)(xf X
a b xx0
uniform
∑=
−=n
kkkX xxxf
0
)()( δα
)()()( k
b
akk xgxxxg∫ =−= δα
)(xg)(xf X
xxk
( )k kx xα δ −
xnx0 x1
Sampled Continuous Fcn g(x)
a b
)(xf X
x][Pr bxa ≤≤
In Discrete Probability a RV is characterized by its probability mass function (PMF) pX(x) which specifies the amount of probability associated with each point in the discrete sample space. Continuous probability generalizes this concept to a probability density function (PDF) fX(x) defined over a continuous sample space. Just as the sum of pX(x) over the whole sample space must be unity, the integral of fX(x) over the whole sample space must also be unity. An event E is defined by a sum or integral over a portion of the sample space as shown by the shaded area in the upper figure between x=a and x=b.The middle panel gives an example of a mixed distribution containing continuous uniform distribution β/(b-a) and a Dirac δ-function at the point x0 α∗ δ(x-x0) corresponding to a discrete contribution at that point. The uniform distribution is shown as a continuous horizontal line at “height” y = β between a and b and the Dirac δ-function is shown with an arrow corresponding to a probability mass “α” accumulated at a single point x=x0.. The integral over the continuous part gives (b-a)* β/(b-a) = β and the integral of the Dirac δ-function α∗ δ(x-x0) over any interval containing x0 yields α. Thus, in order for this expression to be a valid probability density function, we require the sum of the two contributions be unity: α+ β =1 .
Consider the continuous curve fX(x) = g(x) in the bottom panel and take the sum of products αk*δ(x-xk). Is this a valid discrete “PMF”? In order for this to be so the sum of the contributions αk must be unity. Does it represent a digital sampling of g(x)? No, in order to actually write down an appropriate “sampled” version of g(x), we need to develop a “sampling” transformation Yk=Yk(X) for k=0,1,2,...,n so as to transform the original continuous fX(x) to a discrete fY(yk) (See slide#26 )
7
2/24/2012 7
Cumulative Distribution Function (CDF)
'
( ) Pr[ ] ( ') 'x
X Xx
F x X x f x dx=−∞
= ≤ = ∫
dxxfxFdxxFxdF
xfxF
aFbFbx
abaFbF
FF
XXXX
XXdxd
XX
XX
XX
)()()()( or,
)()( :PDFDensity
)()(]Pr[a :tionInterpreta Prob
if;)()( :decr.-Non Monotone
1)(;0)( :ValuesBdy
=−+=
=
−=≤≤
≥≥
=+∞=−∞
Probability Density PDF integrates to yield CDF
0 1/2 3/21
1
0 1/2 3/21
1
1/2
0
0 x
fX(x)
FX(x)
CDF
x
0 1/2 3/21
1
0 1/2 3/21
1
1/2
0
0 x
fX(x)
FX(x)
CDF
x
1/2¼ δ(x-1)
¼
The cumulative distribution function (CDF) for a continuous probability density function fX(x) is defined in a manner similar to that for discrete distributions pX(x) except that the cumulative sum over a discrete set is replaced by an integral over all X less than or equal to a value x. This integral yields a function of “x” FX(x) = Pr[X<=x] which has the following important properties(i)FX(x) always starts at 0 and ends at 1 (ii)FX(x) is continuous, (iii)FX(x) is non-decreasing, (iv)FX(x) is invertible; i.e., FX
-1 (x) exists, and(v)The density fX(x)=d/dx{FX(x)} (since exact differential d FX(x) = FX(x+dx) - FX(x) = fX(x)dx )It is important to note all five properties of FX(x) as they have important consequences.The figure shows the relationship between the density fX(x) and the cumulative distribution FX(x) for two cases (i) two regions of constant density (two “boxes”) and (ii) one region of constant density plus a delta function (one “box” and an arrow “spike”) . In case (i) FX(x) ramps from a value of 0 to ½ in the region [0, ½ ] from the 1st constant density box, then remains constant at ½ over the region [ ½ , 1] and finally ramps from ½ to 1 from the 2nd constant density box. Note that the slopes of the two ramps are both “1” in this case and that the total area under the density curves 1* [1/2-0] + 1* [3/2-1] = 1. In case (ii) FX(x) ramps from a value of 0 to ½ in the region [0, 1] by virtue of the constant “½” density box, then jumps by “1/4” because of the delta function, and finally continues its ramp from the value ¾ to 1. Note that this is simply the superposition of a constant density of “ ½“ plus a delta function ¼∗ δ(x-1), and again the total area under the density curves ½ * [3/2-0] + ¼ = 1
14
2/24/2012 14
Transformations of Continuous RVs
• Transformation of Densities PDFs in 1 dimension• Transformation of Joint Densities PDFs in 2 or more dimensions• Two Methods:
1) CDF Method:Step#1) First find CDF FX(x) by integrating fX(x)
Step#2) Invert y=g(x) transformation
& use it to write in terms of the known FX(x)
(Note y= g(x) may not be “one-to-one” “multiplicity”)
Step#3) Differentiate wrt y:
2) Jacobian Method: Transform PDF fY(y) using derivatives Express everything in terms of variable y
∫=
−∞=
==yy
yYYY dyyf
dydyF
dydyf
'
'
')'()()(
)(1 ygxg(x)y −=⇒=
)(;)()( xgydxxfdyyf XY ==))(('))((
)()(
1
1
ygxgygxf
dxdyxfyf
X
XY
−
−
==
=
=
( ) Pr[ ]YF y Y y= ≤
Note absolute value
It is very important to understand how probability densities change under a transformation of coordinates y=g(x). We have seen several examples of such coordinate transformations for discrete variables, namely, (i) Dice: Transform from individual dice coordinates (d1, d2) to the sum and difference coordinates (s, d) corresponding to a 90 degree rotation of coordinates, and(ii) Dice: Transform from individual dice coordinates (d1, d2) to the minimum and maximum coordinates (z, w) corresponding to corner shaped surfaces of constant minimum or maximum values.There are two methods for transforming the densities of RVs, namely (i) the CDF-method and (ii) the Jacobian Method. While they are both quite useful for 1-dimensional PDFs fX(x), the Jacobian method is best for transforming joint RVs .The CDF method involves three distinct steps as indicated on the slide, namely (i) compute CDF FX(x), (ii) Relate FY(y) = Pr[Y<=y] to FX(x) and then invert the transformation x = g-1(y) and substitute to find FY(y) with a redefined y domain, and (iii) differentiate wrt “y” to obtain the transformed probability density for the RV Y: fY(y). Note that if the function is multi-valued and therefore not invertible, it must be broken up into intervals for which it is invertible and appropriate “fold-over” multiplicities must be accounted for.The Jacobian Method uses derivatives of the transformation to transfer densities from the original set of RVs to the new one; the Jacobian accounts for linear, areal, and volume changes between the coordinates. In one dimension the Jacobian is simply a derivative and is obtained by transferring the probability in the interval x to x+dx: fX(x)dx to the probability in the interval y to y+dy: fY(y)dy Equating the two expressions yields fY(y) =fX(x) / |dy/dx| = fX(g-1(y) ) / |dy/dx|. Note that the absolute value is necessary since fY(y) must always be greater than or equal to zero.
15
2/24/2012 15
Transformation of Continuous RV - CDF Method
'
'
1/ 200 900 1100( )
0
0 900( ) Pr[ ] ( ') ' ( 900) / 200 900 1100
1 1100
R
r r
R Rr
rf r
Otherwise
rF r R r f r dr r r
r
=
=−∞
≤ ≤=
<
= ≤ = = − ≤ ≤ >
∫x900 1100
0
1PDF = fX(x)
CDF= FX(x)
1/200
Resistance X = R
Conductance Y = 1/R
>
≤≤
<
==
90010
9001
11001
2001
110010
)()( 2
y
yy
y
yFdydyf YY
Method#1
1/9001/1100
6050
4050
PDF = fY(y)
y0
1
CDF= FY(y)
( ) Pr[ ] Pr[ 1/ ] 1 Pr[ 1/ ]1 0 1 1/ 900
1( 900)1 (1/ ) 1 900 1/ 1100
2001 1 0 1/ 1100
Y
R
F y Y y R y R yy
yF y y
y
= ≤ = ≥ = − ≤
− = < −= − = − ≤ ≤
− = >
Step#1 Compute FX(x)
Step#2 Transform to FY(y)
Step#3 Differentiate FY(y)
The Resistance X=R of a circuit has a uniform probability density function fR(r)=1/200 between 900 and 1100 ohms as shown in the top panel; the corresponding CDF FR(r) is the ramp function starting at “0”for R<=900 and reaching “1” at R=1100 and beyond as shown. The detailed analytic function is given in the slide and represents the result of Step#1 in the CDF-Method.The problem is to find the PDF for the conductance Y=1/X = 1/R. We first down the definition for FY(y) for a given value Y=y and then re-express it as a function of R =1/YFY(y) =Pr[Y<=y] = Pr[R>=(1/y)] = 1-Pr[R<=(1/y)]
= 1 – FR(1/y ) This last expression is now evaluated in the lower panel of the slide by substituting r=1/y into the expression for FR(1/y ) of the upper panel. Note the resulting expression has been written down by direct substitution and the intervals have been left in terms of 1/y. (This constitutes step#2 of the method). Finally, differentiating FY(y) wrt “y” we find (step#3) the desired PDF fY(y); we have also “flipped” the “1/y” interval specifications and reordered the resulting “y” intervals in the customary increasing order.As seen in this example, the CDF method requires careful attention to the definition of the FY(y) defined in terms of cumulative probability of the variable Y. Since Y=1/R, this leads to FY(y) = 1-FR(1/y ) and a reverse ordering of the inequalities for the intervals.
16
2/24/2012 16
Transformation of Continuous RV - Derivative (Jacobian) Method
)( Find )()(0
1100900200/1)(
yfdrrfdyyfOtherwise
rrf
YRY
R
⇒= ≤≤
=
22
)200/1(|/1|
)()(
|/|)()()(
yrrfyf
drdyrf
dydrrfyf
RY
RRY
=−
=
==
9001
11001for
2001)( 2 ≤≤= y
yyfY
Method#2
900
6050
4050
1100
1yR
=
x R=
1( ) 2200Yf y
y=
1( )200Xf x =
dyslopedx
=
hyperbola: 1xy =dydx
Note: fY(y) is large for small slope & vice versa. Same Differential Area (Probability) is mapped via hyperpolato yield the tall high and short fat strip areas shown for fY(y)
The Jacobian Method is much more straight forward and moreover has a very intuitive visualization in the 3-dimensional plot shown on this slide. The uniform probability density function fR(r)=1/200 between 900 and 1100 ohms is written explicitly in the first boxed equation. The Jacobian method just takes the constant fR(r) = 1/200 and divides it by the magnitude of the derivative |dy/dr|=|-1/r2| = y2 to yield directly fY(y)=1/(200y2) for y ε [1/1100, 1/900].The 3-dimensional plot shows exactly what is going on: i) The original uniform distribution fX(x)=1/200 displayed as a vertical rectangle in the x-z plane ii) Sample strips at either end with width “dx” have the same small probability dP= fX(x)dx as shown At R=900, the density fX(x) is divided by the large slope |dy/dx| yielding a smaller magnitude for fY(y) as illustrated, but this is compensated by a proportionately larger “dy”and thus transfers the same small probability dP= fY(y)dy. iii) Conversely, the strip at R=1100 is divided by a small slope |dy/dx| and yields a larger magnitude for fY(y), which is compensated by a proportionately smaller “dy” again transferring the same dP. iv) The end point values of the transformed density fY(y) are illustrated in the figure. The strip width “dx”cuts the x-y transformation curve at two red points which have a “dy” width that is small at x =1100 and large at x = 900 as determined by the slope of the curve. The shape in between these end points is a result of the smoothly varying slope of the transformation hyperbola shown in the x-y plane. Thus the slope of the transformation curve (hyperbola xy=constant in this case) in the x-y plane determines how each “dx” strip of the uniform distribution fX(x)=1/200 in the x-z plane transfers to the new density fY(y) shown in the z-y plane. This 3-dimensional representation de-mystifies the nature of the transformation of probability densities and makes it quite natural and intuitive for 1-dimensional density functions. It is easily extended to two-dimensional joint distributions.
18
2/24/2012 18
over"-fold"factorty multiplici
:Rule General
=α
⋅=dy/dx
(x)fα(y)f XY
Transformation of Continuous RV – Example 3 “Multiplicity Factor”
+∞<<=
==
=
+∞<<∞−π
=
−
−
−
yforeπy
y
eπ
dy/dx(x)f(y)f
XY
xexf
y
y
XY
x
X
021
221
22
for PDF Find21)(
:PDFGaussian
2
2
2
2
2
Not a 1-1 mapping
density is doubled
),0(),( ∞→∞−∞
Double Density Pts
Fold-over
x
y
x
yDouble Density Pts
2
212
x
eπ
−
212
y
ye
π
−
2y x=
Two Equal Contributions from –x & +x
The transformation of a Gaussian PDF under the transformation Y=X2 is easily computed using the Jacobian method provided one incorporates a multiplicity factor α as shown in the boxed density equation . The multiplicity factor arises because there are two contributions to the same y-value one from –x and the other from +x as illustrated in the upper figure; thus folding the parabola across the x=0 symmetry line yields twice the density on positive x and this corresponds to a multiplicity factor α=2 in the boxed density transformation equation. The 3d plot shows the original Gaussian density function (grey) in the x-z plane, the transformation y=x2
in the x-y plane, and the resulting distribution shown as a dashed curve in the y-z plane. The two thin vertical slices at –x and +x are mapped to the same y-value and hence doubles the density contribution to fY(y) as shown.
26
2/24/2012 26
Analog to Digital (A/D) Converter - Series of Step Functions
A/D converterMapping Fcn ( ) 1 ; 1Y g X k k x k= = + < ≤ +
<≥
==−
000
)(xxae
xfPDFax
XX
11
1
0( )
0 0
( 1) 0; 1, 2,...
0 0
kkax axkx k
k X x kx k
ak a
ae dx e xf x dx
x
e e xk
x
α− −
= −= −
= −
−
= − ≥= =
< − ≥
= =<
∫∫
1( ) ( 1) ( )ak a
Yk
f y e e y kδ∞
−
=
= − ⋅ −∑
a) Exponential b) Gaussian b) Uniform 1 0 10
( )0 otherwiseX X
xPDF f x
≤ ≤= =
2
2
/2
1
/2
1 ( ) ( 1)2
1( ) ; ( , )2
kx
kx kx k
x
x
e dx k k
k e dx k
α ϕ ϕπ
ϕπ
−
= −
=−
=−∞
= = − −
≡ ∈ −∞ ∞
∫
∫1
1 ( 1)10 10
1 ; 1, 2, ,1010
k
kx k
k kdx
k
α= −
− −= =
= =
∫
L
∑ −⋅=k
kkY yyyf )()( δα10
1
1( ) ( )10Y
k
f y y kδ=
= −∑
Continuous Representation of Discrete “sampled” Distributions
∑ −⋅=k
kkY )yδ(yα(y)fMapped Density
∞<<∞−π
== −
x
exfPDF xXX
2/2
21)(
Y (OUT)
1 2
-2 -1
3
-3 12
-2-1
0 X(IN)
3
0 10 20
0.10.0950.050
fY(y)
y
( )k y kα δ −fY(y)
1/10
1 1050y
fY(y)
y
k αk1 0.0952 0.0863 0.078
11 0.035
(0.1) 0.1 (0.1)( 1) .105k ke e e− −− = ⋅
In discussing the half-wave rectifier on the last slide we found that the effect of a “zero” slope transformation function was to pile up all the probability in the x-interval into a single δ-function at the constant y=“0” value associated with that part of the transformation. Here we extend that concept to a “sample & hold” type mapping function typical of an Analog to Digital (A/D) converter. The specific mapping function y=g(x) = k+1 for k < x ≤ k+1 is illustrated in the grey box as a series of horizontal steps over the entire range of x [-3, 3]; the y-values for these steps range from y=-2 to y=+3. Each horizontal (zero-slope) line accumulates the integral of fX(x) from x=k to k+1 onto its associated y-value shown as a red circle with the point of a δ-function arrow pointing up out of the page and having an amplitude given by the integral for that interval denoted by the symbol αk. The table shows several examples of a digitally sampled representation for a) Exponential, b) Gaussian, and c) Uniform distributions in the three columns. The rows of the table give the specific continuous densities for each, the computations for the amplitudes of the discrete digital samples αk, the resulting sum of δ-functions, and finally a plot showing arrows of different lengths to represent the δ-functions of the sampled distributions.
48
2/24/2012 48
Order Statistics - General Case n Random VariablesGeneral Case n Variables: X1,, X2 , ... ,, Xn RVs
Reorder {X1,, X2 , ... ,, Xn } as follows:
Y1,= smallest {X1,, X2 , ... ,, Xn }
Y2= next smallest {X1,, X2 , ... ,, Xn }
Yn= largest {X1,, X2 , ... ,, Xn }
Y1< Y2 < Yj <… < Yn
Same PDF in variable “y” fX(y)
Find PDF for the jth “order statistic”
Pr[ ] ( ) ; 1, 2, ,jj Yy Y y dy f y dy j n≤ ≤ + = = L
Assume RVs are Indep and Identically Distributed (IID){X1,, X2 , ... ,, Xn } fX(x)
fX(y)dy
Y1 |Y2 |… |Yj-1 Yj+1 | Yj+2 | … | YN
[FX(y)]j-1 [1 - FX(y)]n-j
y y+dy
(j-1) RVs
Each IID: P[Yj ≤ y]= FX(y)
(n-j) RVs
P[Yj > y]= 1 - FX(y)
jth “order
statistic”Yj= jth smallest {X1,, X2 , ... ,, Xn }
Case n=3 {Min, Mdl, Max};Y2 = “Mdl“statistic.Y2 could be any one of {X1,, X2 , X3 }
( ) ( )1( ) ( ) 1 ( )j n jX X XF y f y dy F y− −= ⋅ ⋅ −
Diff’l Prob. “one sequence”jth order statistic
j=2: [Y1 |Y2 | Y3 ]3! 6
1! 1! 1!=
j=3: [Y1 Y2 | Y3 | φ] 3! 32! 1! 0!
=
j=1: [φ| Y1 |Y2 Y3 ]3! 3
0! 1! 2!=
[X2,| X1 | X3], [X3,| X1 | X2] [X1,| X2 | X3], [X3,| X2 | X1] [X1,| X3 | X2], [X2,| X3 | X1]
[φ| X1 | X2 X3] [φ| X2 | X1 X3] [φ| X3 | X1 X2]
There are 3! = 6 orderings; however, we partition into 3 groups and permutations within a group is irrelevant;
[ X2 X3 |X1 |φ] [ X1 X3 |X2 |φ] [ X1 X2 |X3 |φ]
Min
Mdl
Max
all Yk <y all Yk > y
( )Xf x( )XF y 1 ( )XF y−
fX(y)dy
1 2 1 2 1 2( ) ( ) ( ) ( )nX X X n X X X nf x x x f x f x f x= ⋅ ⋅ ⋅L L L
Order Statistics for the general case of n IID Random Variables is detailed on this slide. The n IID RVs {X1, X2,..., Xn} are re-ordered from the smallest Y1 to the largest Yn and the jth Y in the sequence Yj is called the “jth order statistic”. Again we fix a value Y=y and consider the continuous range of re-ordered Y-values illustrated in the figure: the small interval from y to y+dy contains the differential probability for the jth order statistic Yj given by fX(y)dy; all Y-values less than this belong to the Y1 through Yj-1 and those greater belong to Yj+1 through Yn as shown in the inset figure. Now for each of the Ys on the left we have the probability Pr[Y1 ≤ y] = FX(y), Pr[Y2 ≤ y] = FX(y), ... Pr[Yj-1 ≤ y] = FX(y), and because they are IID the total probability of those on the left is Pr[Yleft ≤ y] = [FX(y) ]j-1; similarly on the right we find Pr[Yright ≤ y] = [1-FX(y) ]n-j. So for the reordered Ys the differential probability is just the product of these three terms multiplied by a multiplicity factor α, viz.,
dP = Pr[y≤ Yj ≤ y+dy]= f Yj (y) dy = α [FX(y) ]j-1 fX(y) [1-FX(y) ]n-j dyThe multiplicity factor α results from the number of re-orderings of {X1, X2,..., Xn} for each order statistic Yj ; arguments for n=3 and n=4 are illustrated on this slide and the next. These arguments look (in turn) at each order statistic min, middle(s), and max and compute in each case the number of distinct arrangements of {X1, X2,..., Xn} that yield the three groups relative to the “separation point” Y=y and arrive at multinomial forms dependent upon the orderings for each statistic. The specific multiplicity factors for the cases for n=3,4 are easily found to be
α = 3C (j-1),1,(3-j) = 3! / [(j-1)! 1! (3-j)!] ; α = 4C (j-1),1,(4-j) = 4! / [(j-1)! 1! (4-j)!] and the final results for the PDF of the jth order statistic f Yj (y) in these cases are
fYj (yj) = 3C (j-1),1,(3-j) [FX(yj) ]j-1 fX(yj) [1-FX(yj) ]3-j for j=1,2,3 (n=3)fYj (yj) = 4C (j-1),1,(4-j) [FX(yj) ]j-1 fX(yj) [1-FX(yj) ]4-j for j=1,2,3,4 (n=4)
61
2/24/2012 61
Random Processes – Introduction - Lec#4• Time Series Data = Physical Measurements in time• Random Process = Sequence of random variable realizations
– Geiger Counter Sequence of “detections” - Poisson Process– Communication Binary Bit Stream - Bernoulli Process “ 01001…”– E&M Propagation Phase (I-Q components) - Gaussian Process
• Arrival Event: Success =“arrival” (of an event in time)• Interarrival Times for Random Processes
– Not only interested in how many successes K (“ arrivals”) there are– But also interested in “specific time of arrivals,” e.g., TK = time of kth arrival– DSP Chip Interrupts:
• Time between interrupts • used for data processing
– Waiting on Telephone:• “you are 10th customer in line and …• your wait will be approximately “7 minutes”
RandomProcess
Number ofArrivals
InterarrivalTimes
GeigerCounter Poisson Exponential
Binary BitStream Bernoulli Geometric
Observations of physical processes produce measurements over time which almost always have components described by a random process. Some examples are Geiger counter detections (Poisson Process), Binary bit streams (Bernoulli Process) and Electromagnetic wave I, Q Phase components (Gaussian Process). Because, these processes take place over time, the notion of a “success” is translated to an “arrival” at a specific time. Moreover, we are not only interested in how many successes K there are, but also their specific arrival times, i.e., we would like to know the time of the kth arrival Tk. This has application to many physical processes such as the timing of DSP chip interrupts relative to their “clock cycles” and the queuing of customers in a telephone answering system. In both cases you want to make sure the system can handle the “load” in an appropriate manner; for the DSP chip you need to minimize the number of times you are near the leading or trailing “edge” of the timing pulse in order to avoid errors, while for the telephone answering service, the 10th customer, would like to know how long he must wait in the queuebefore being served.
78
2/24/2012 78
Multi-User Digital Communication “CDMA” Arrival Slots• Two signals s1 , s2 ;Decode s1 or s2 in given time slot• a priori Prob: P[s1]=3/4 ; P[s2]=1/4 • Decoding Statistics:decoded “1” : P[1|s1]=2/3 ; P[1|s2]=2/3 not decoded “0” : P[0|s1]=1/3 ; P[0|s2]=1/3
161
21
21)4(]4Pr[)(]Pr[
114
111
1 11=
===⇒===
−−
Nk
N pNpqkpkN
No memory of failures in slots 3 & 4
161
21
21)4(]4Pr[
133
1 1=
==== pqpN N
163
213
1214
)4(]4Pr[11
)(]Pr[4
2422 2
=
=
−−
===⇒
−−
=== −− qppNqprn
npnN Nrnr
Nr r
41
21)2(]2Pr[
22
2 2=
==== ppN N
No memory - slots 6 to 10
41
)4/1(1)16/3(
)2(1)4(
]2Pr[]2,4Pr[]2|4Pr[
2
2
2
2222 =
−=
−=
>>=
=>=N
N
pp
NNNNN
1 2 3 4 5 6 7 8 9 10
3 “1”s1 2 3 4
No Memory
1 2 3 4
1 2“Renewal”
a priori decode
P|s1,1]= P[1|s1] P|s1] =(2/3)(3/4) =1/2
P|s2,1]= P[1|s2] P|s2] =(2/3)(1/4) =1/6
P|s1,0]=1/4
P|s2,0]=1/12
Time Slot #4
P[s1]
P[s2]
P[1|s1]
P[1|s2]
P[0|s1]
P[0|s2]
2/3
3/4
1/4
1/3
2/3
1/3
S1
S2
1( )
1r
r n rN
np n p q
r−−
= −
1) Pr[ 1st decode in 4th slot]
2) Pr[ 4th decode in 10th slot | 3 decodesin 1st 6 time slots ]
3) Pr[ 2nd decode in 4th slot]
4) Pr[ 2nd decode in 4th slot | no decodes in 1st 2 time slots]
{ “means” N2>2 }
Nr time slots (“trials”) r-Decodes of s1 p1=q1=1/2
“1” “1” “1”
s1 Decoded “success”p1=1/2
s1 Not Decoded “failure”q1=1/2
“0”“0”
This example illustrates renewal properties and time slot arrivals of the Geometric and Negative Binomial RV distributions. In a multiuser environment the digital signals from multiple transmitters can occupy the same signal processing time slot so long as they can be distinguished by their modulation characteristics. Code Division Multiple Access (CDMA) uses a pseudorandom code that is unique to each user to “decode” the proper signal source. Consider two signals s1 and s2 being processed in the same time slot with a priori “system usage” given by P[s1] = ¾ and P[s2] = ¼ ; further let “1” denote successful and “0” denote unsuccessful decodes respectively. Given that each signal has the same 2/3 probability of a successful decode P[1|s1] = P[1|s2] = 2/3, we can use the tree to find the single trial probability of success for decoding each signal. For signal s1 we see that the end state {s1, 1} represents a successful decode and has p1=1/2 ; all other states {s1, 0}, {s2 1}, {s2, 0} represent failure to decode signal s1 with probability q1 = 1/4+1/6 + 1/12 = 1/2. Similarly for signal s2 we see that the end state {s2, 1} represents a successful decode of s2 and has p2 =1/6 ; all other states {s2, 0}, {s1 1}, {s1, 0} represent failure to decode signal s2 with probability q2 = 1/12+1/2 + 1/4 = 10/12 =5/6. We consider successive decodes of s1 as independent trials with probability of success p1=1/2 . Thus, the probability of having r- successful decodings of s1 in Nr signal processing slots “trials” is given by the Negative Binomial PMF
pNr(n) = n-1Cr-1p1rq1
n-r with nr = r, r+1, r+2, .... with p1=q1=1/21) Pr of 1st decode (r=1) in 4th slot (N1 =4) is pN1(4) = 4-1C1-1p1
1q14-1 = 1(1/2)4 = 1/16
2) Pr of 4th decode (r=4) in 10th slot (N4 =10) given 3 previous decodes in 1st 6 slots is found by “restarting the process with slots #7 , 8, 9, 10 so we need only one decode (r =1) in 4 slots, i.e., N1 =4, which is identical to part 1) and yields Pr[N4 = 10 | N3=6] = pN1(4) = 4-1C1-1p1
1q14-1 = 1(1/2)4 = 1/16
3) Pr of 2nd decode (r=2) in 4th slot (N2 =4) is pN2(4) = 4-1C2-1p12q1
4-2 = 3(1/2)4 = 3/164) Pr of 2nd decode (r=2) in 4th slot given 1st two slots were not decoded is found by “restarting the process with slots #3,4 “so we need r=2 in the two remaining slots N2 =2 which means two successes in two trials, so we have pN2(2) = 2-1C2-1p1
2q12-2 = 1(1/2)2= 1/4
97
2/24/2012 97
Binary Communication with NoiseGaussian under Linear
Transformation: Y=eX+f {2
2 2 2: ( , ) : ( , )Y Y
X X X XY eX fX N Y N e f eµ σ
µ σ µ σ= +
≡ ≡
→ +14243
d1 = detect “1”
d0 = detect “0”
Threshold Detector
Binary Generator Modulator Channel
“1”
“0”
+a-a
Y1 = a + X
Y0 = - a + X
Noise X: N(0,1)
Threshold Detector
Y > c detect “+ a” or “1”
Y ≤ c detect “- a” or “0”
Prob of an Error for Detection a “1”
P(Er “1” ) = P(Y ≤ c | +a) P(+a) + P(Y > c | -a) P(-a)
Type I Error “Missed Detection”Does not Exceed Threshold
But Belongs to “+a” Distrib.
Type II Error “False Positive”Exceeds Threshold
But Belongs to “-a” Distrib.
y
Threshold y = cdetect “0”
detect “1”
fY|A (y|-a) fY|A (y|+a)
- a +a0Type II
“False Positive”Type I
“Missed Detection”
Y1 = N(a,1)
Y0 = N(-a,1)
Consider the Binary communication channel depicted in the upper sketch: A binary sequence of “1”s and “0”s is generated and then amplitude modulated by a positive amplitude +a for “1” and –a for “0” as illustrated by the “square wave pulse train” at the modulator. Zero mean unit variance Gaussian noise N(0,1) is added by the “channel” and the (signal + noise) outputs are two distinct Gaussian RVs : Y1= a +X ~ N(+a, 1) and Y0=–a +X ~ N(-a,1) about two different means as shown in the probability density plot. This output is presented to a Threshold detector which attempts to detect the original sequence of “1”s and “0”s by setting a threshold Y =c (vertical dashed line) and assigning a “1” to Y-values to the right and “0” to for Y-values to the left of the threhold. Considering the detection of “1” we see that two types of error can occur as follows:Type I Missed Detection: P(Y≤c | +a) The larger hatched area on the left with Y<c which belongs to the N(+a,1) curve but is rejected because it does not exceed the threshold “c”Type II False Positive: P(Y>c | -a) The smaller hatched area on the right with Y>c which belongs to the “0” N(-a,1) curve but is falsely detected as “1” because it exceeds the threshold “c”The total probability for an error in detecting a “1” is the sum of each conditional multiplied by its a priori as shown in the bottom equation. The total probability for an error in detecting a “0” is written down in an analogous fashion as a sum of conditionals multiplied by their a priori s (not shown) .
101
2/24/2012 101
Common PDFs - “Continuous” and PropertiesVarianceMean
Exponential
Uniform
PDFRV Name
≤≤
−=Otherwise
bxaabxf X
0
1)(
2a b+
∫∞
−∞=
⋅x
X dxxfx )( 22 ][][)var( XEXEX −=
( )12
2ab −
0( )
0 0
t
Te t
f tt
λλ − ≥=
<
λr
a bx
)(xf X
2
1λ
Generating Fcn
0>λs
λλ −
( )
sb sae es b a
−−
Gamma
r-Erlangr = integer
1( ) 0( ) ( 1)!
0 0r
t r
T
e t tf t r
t
λλ λ− −≥= −
<
0>λ
Normal
r
sλ
λ −
λ1
2λr
Rayleigh
r =1
r =2r =3
Exponential
λ=
1][ 1TE
λ=
2][ 2TE
λ=
3][ 3TE
)(tfrT
t
∞<<∞−σ⋅π
= σµ−
−
x
exfx
X2
2
2)(
21)(
),( 2σµN
2 2
2 2( )a x
Xf x a xe−
=0 ; 0x a> >
21 πa
222
aπ−
Gaussian Rayleigh
x
( )Tf t
t
2( )2ss
eσµ + µ 2σ
( ) [ ]Xss E eϕ =
Arrival Rate
λλλ111
3 ][ ++=TE
For r=3: three “exponential waits”
“exponential wait”
1max
rt λ−=
Peaks at
Peaks at x=0 Peaks at
x=1/a
0
2
2( / )
2
( / )2
1
1
s asa
s a
e
erf
π−
+ ⋅
⋅ +
This table compares some common continuous probability distributions and explores their fundamental properties and how they relate to one another. A brief description is given under the “RV Name” column followed by the PMF formula and figure in col#2, the generating function in col#3, and formulas for the mean and variance in the last two columns. The Uniform Distribution has a constant magnitude 1/(b-a) over the interval [a,b]; the mean is at the center of the distribution (a+b)/2 and the variance is (b-a)2/12 . The Exponential Distribution decays exponentially with time from an initial probability density λ at t=0. The mean time for an arrival is E[T] = 1/ λ which equals the e-folding time of the exponential. Its variance is 1/ λ2 . This cumulative exponential distribution is the probability that the first arrival T1occurs outside a fixed time interval [0,t]; it equals the probability that the discrete number of Poisson arrivals K(t)=0 occurs within the interval [0,t] , that is, Pr(T1>t)= Pr(K(t)=0). The r-Erlang / Gamma Distributions for r>1, all rise from zero to reach a maximum at (r-1)/ λ and then decay almost exponentially ~tr-1e-λt to zero. The maximum occurs after a wait of one exponential meanwait time 1/ λ for r=1, two 1/ λ waits for r=2, and r 1/ λ waits for any r. The variance is r times that of the exponential variance 1/ λ2 . The cumulative r-Erlang distribution is the probability that the rth arrival time Tr occurs outside a fixed time interval [0,t] ; this equals the probability that the discrete number of Poisson arrivals K(t) ≤ (r-1) i.e., Pr(T1>t)= Pr(K(t) ≤ (r-1)). The Gamma density is a generalization of the rth Erlang density obtained by replacing (r-1)! with Γ(r) making it valid for non-integer values of r. The Gaussian (Normal) Distribution is the most universal distribution in the sense of that the Central limit theorem requires that sums of many IID RVs approach the Gaussian distribution.The Rayleigh Distribution results from the product of two independent Gaussians when expressed in polar coordinates and integrated over the angular coordinate. The probability density is zero at x=0 and peaks at r=1/a½ before it drops towards zero with a “Gaussian-like” shape for x>0. It is compared with the Gaussian which is symmetric for about x=0.
109
2/24/2012 109
Consequences of Central Limit Theorem
Sum of n Uniform Variates Xi
Generate uniform Sequence of N=1000 points { Xi }
Central Limit Thm: =>Generates a Gaussian as n=2,4,8,12, … large
-.5 -.4 -.3 -.2 -.1 .5.4.3.2.10
pX(x)
x
1/11
Discrete Uniform PMF
1( ) ( ) ; .5, ,0, ,.511X i ip x x x xδ= − = − L L
1
; 2, 4,8,12n
n ii
Z X n=
= =∑
.2 | .5 | -.1 | .3 | -.2 | -.1 | -.1 | .4 | -.3 | .1 | -.5 | -.1
n = 2
n = 4
n = 8
n = 12
{Xi }
.7 .2 -.3 .5 -.2 -.6
L -.1 | .4 | -.3 | .1 | -.5 | -.1-.1 | .4 | -.3 | .1 | -.5 | -.1
.9 .2 -.8
1.1
.2
.05
1.0
1.0 2.0- 1.0-2.0 0
( ) ( )n nZ Zf z p z≈
z
111( )Xp x =
2n =
4n =
12n =
Plot Frequency of Occurrence ( )nZf z
Note: Curves give “shape” of freq of occur. for discrete points spaced 0.1 apart
The Discrete Uniform PMF with values at 11 discrete points ranging from x ={-.5, -.4, -.3, -.2,-.1, 0, .1, .2,.3,.4.,.5} can be expressed as a sum of 11 δ-functions with magnitude 1/11 at each of these points as shown in the figure. This can also be thought of as the result of a “sample and hold” transform (see Slide#26) of a Continuous Uniform PMF fY(y) = 1/11 ranging along the y-axis from y=-.6 to y=+.5 ; for example, the term 1/11*δ(x-(-.5)) is the δ-function located at x= -.5 generated by integrating the continuous PMF from y= -.6 to y=-.5 which gives an accumulated probability of ”.1/(.5 –(-.6)) =1/11 at the correct x-location.Suppose that a sequence of 1000 numbers from the discrete set {-.5, -.4, -.3, -.2,-.1, 0, .1, .2,.3,.4.,.5} are randomly generated on a computer to create the data run notionally illustrated in the 2nd panel . Now we can create sum variables Zn consisting of the sum of n =2 or n= 4 or n= 8, or n=12 of these samples from the discrete uniform PMF. According to the CLT, as we increase “n”, the resulting frequency distribution of the sum variables “Zn“s should approach a Gaussian. The notional illustration shows what we should expect. The dashed rectangle shows the bounds of the original uniform discrete PMF and the other curves show the march towards a Gaussian. Note that unlike a Gaussian all these distribution are zero outside a finite interval determined by the number of variables that are summed. The triangle shape is the sum of two RVs and obviously the min and max are [-1, 1] for Z2 ; the Z12 RV on the other hand, covers the range from [-6, 6]; the range increases as we sum more variables, but only as n-> ∞ does the sum variable fully capture the small Gaussian “tails” for large |x| as required by the CLT.This result can also be thought of in terms of an n-fold convolution of the IID RVs Xk k=1,2,...,n which also spreads out with each new convolution in the sequence. The next slide shows the results of a MatLabsimulation of this CLT approach to a Gaussian and a plot of the results confirming the notional sketch shown on this slide. (The MatLab script is given on the notes page of the next slide.)
121
2/24/2012 121
Examples Using Markov & Chebyshev Bounds
rrXP X
1][ ≤µ≥
[ ] 2
1r
rXP XX ≤σ≥µ−
Prob “value” of RV X exceeds “r” times its
mean is 1/r
Prob “deviation” of RV X exceeds “r” times its
std dev r σX is 1/r2
Markov
Chebyshev
Kindergarten Class mean height = 42” Find bound on Prob of a student being taller than 63”
%7.665.1/1]425.1Pr[5.1634242 =≤⋅≥=⇒=⋅=µ XrrX
Note that for r =1 the bound is “1” or 100%; Thus useful bounds require r >1
Xµ Xµ2 Xµ30Xµ5.1
Ross Ex. 7-2a) Factory production
a) Given mean =50, find bound on Prob production exceeds 75, i.e., Prob[X>75]
b) Given also variance = 25 , find bound on Prob production between 40 and 60
ccXEcXP Xµ
=≤≥][][
or
or
[ ] 2
2
kkXP X
Xσ
≤≥µ−
667.7550][]75[ ==≤≥
cXEXP
[ ] 75.25.110501 =−≥≥−− XP
Note a lower bound: at least 75%
Note a upper bound: at most 66.7%
[ ] 25.10251050 2 =≤≥−XP
Markov
Markov
Chebyshev
⇒
Examples:
Here are two examples of the application of the Markov and Chebyshev Bounds. The two forms for each are stated on the LHS of the slide for reference purposes. The decision to use one or the other of these bounds depends upon what type of information we have about the distribution. Thus if the RV X takes on only positive values and we only know its mean, µX , then we must use the Markov bound. On the other hand, if the RV X takes on both positive and negative values and we know the mean, µX , and variance, σX
2, then we must use the Chebyshev bound. If in the latter case the RV X takes on only positive values, then we could use either Chebyshev or Markov bounds, but we would choose Chebyshev over Markov because it uses more of the information and hence will always be a tighter upper bound. Neither of these bounds is very tight because the information about the distribution is very limited; knowing the actual distribution itself always yields the best bounds. 1) The mean height in a Kindergarten Class is µX = 42” and we are asked “what is the probability of a student being taller than 63?” Short of knowing the actual distribution, the best we can do is use the Markov inequality to find an upper bound Pr[X>63] < 42/63=.67 or 67%. This is also easily computed if we realize that the tail is the region beyond 63”= 1.5(42”) so r=1.5 and the answer is 1/1.5 =2/3=.67 .2) The factory production has a mean output µX = 50 units and we are asked (a) “what is the probability of a 75 unit output?” This again involves a positive quantity X the number of units and we choose the Markov bound for 1.5(50) = 75 units so again r=1.5 and the resulting probability is 67% . (b) If we are also given the variance of the production σX
2 = 25 the additional information allows us to use the Chebyshev bound to find the probability in the tails on either side of the mean of 50. Thus, if we find the probability in the 2-sigma tails (r=2) to left of 50-10 and to the right of 50+10 as Pr[Tails] ≤ 1/22
= 25%. Hence the production within the bounds [40,60] is the complementary probabilityPr[40 ≤ X ≤ 60] =1-Pr[Tails] ≥ 1-.25 = .75 or at least 75%
132
2/24/2012 132
Transformation of Variables & General Bivariate Normal Distribution
Linear Xform to Y
[ ]TXXK E X X I= ⋅ =
CovarianceMean
Computation KYY
==
2
1
bb
bmYY AX b= +
X a bivariate normal (indep comp) N(0,1)
[ ] {TT
I
TTTTT
bAX
TYYYY AAAXXEAAXXAEAXAXEbYbYEmYmYEK ====
−−=−−=
=+=43421 ][])([])([))(())((
Determinant KYY ( )2detdetdetdet AAAK TYY =⋅=
( )1 1 1( ) ( )21
2
1 2
1 2
( )( )det
TA y b A y b
XY y y
J YYx x
ef xf yK
π
− −− − ⋅ −
= =
( ) ( ) ( )1
111
−
−−−
=
=
YY
TT
K
AAAA
[ ]
YY
myKmy
Y K
eyf
yYYT
y
det21
)(
)()(21 1 −−− −
π=General Bivariate
Normal Distribution
New Prob Density
TYY AAK =
[ ] 0Xm E X= =r
=
1001
XXK
=
00
Xm
(No Longer Independent Components or zero means & unit variances)
{0
[ ] [ ] [ ]Ym E Y E A X b A E X b b=
= = ⋅ + = ⋅ + =r
A is Jacobian: det det{ } det( ) detiij YY
j
yy A J A Kxx
∂ = ⇒ = = ∂
Computation mY
det det YYA K⇒ =
We introduced the Bivariate Gaussian distribution for the case of two independent N(0,1) Gaussians (with the same variance =1) and arrived at a zero mean vector mX and a diagonal covariance matrix KXX=diag(1,1) corresponding to a pair of uncorrelated Gaussian RVs and displayed in the first line of the table. The second line of the table shows the results of making a linear transformation of variables Y=AX+b from the X1 X2 coordinates to the new Y1 Y2 coordinates; note that the vector b =[b1,b2]T
represents the displaced origin of the Y1 Y2 coordinates relative to X =[0,0]T. We see that the new mean vector is no longer zero but rather mY = b and the new covariance KYY =AAT no longer has unit variances along the diagonal, but, in general, now has non-zero off-diagonal elements as well. The fact that this linear transformation yields non-zero off-diagonal elements in the covariance matrix means that the new RVs Y1 Y2 are no longer uncorrelated. The computations supporting these table entries are straightforward. The new mean is obtained by taking the expectation E[Y]= E[AX+ b] and using the fact that the original mean E[X] is zero to give mY = E[Y]= b . Substituting this value b for mY in the covariance expression KYY = E[(Y-b)(Y-b)T] yields KYY = E[(AX)(AX)T] = A E[XXT] AT =A AT since E[XXT] =KXX = I (i.e., the identity matrix diag(1,1)).In order to find the new Bivariate density fY1,Y2(y1,y2) we need to divide fX1,X2(x1,x2) by the Jacobiandeterminant J(Y,X) and replace X by A-1(Y-b). This Jacobian is found by differentiating the transformation Y=AX+b to find J=det[∂Y / ∂X ] = det(A) ; note that this is easily verified by writing out the two equations explicitly and differentiating y1 and y2 with respect to x1 and x2 to obtain the partials ∂yi / ∂xj = aij and then taking the determinant to find the Jacobian. Taking the det(KYY) =det(AAT) and using the fact that the determinant det(A) = det(AT), we find that detA = det (KYY)½ . Finally substituting this and X = A-1(Y-b) yields the general Bivariate Normal Distribution fY(y) given in the grey boxed equation at the bottom of the slide. Be careful to note that the inverse KYY
-1 occurs in the exponential quadratic form and that the matrix KYY occurs in the denominator det (KYY)½ ; also observe the “shorthand” vector notation for the bivariate density fY(y) in place of the more explicit fY1,Y2(y1,y2).
135
2/24/2012 135
Bivariate Gaussian Distribution & Level Surfaces
σσρσσρσσ
= 2221
212
1K
( ) ( ) 01det 222
21 ≥ρ−σσ=K
y2
Ellipse Areas collapse to a line
y2
y1- 45o
0ρ <
y1
+ 45o0ρ >
Ellipses
y2
y1- 45o
1ρ = −
y2
y1
+ 45o
1ρ = +
Degenerate Ellipses
1
1 2
12
1 21( , )
2 det
TYYy K y
Y YYY
f y y eKπ
−−=
1 2 1 2( , )Y Yf y y
Gaussian Probability
Surface
2d Ellipses y1
y2
1 1ρ− < < +
0ρ =
1ρ = ±
1 2 1 21 2 1 2( , ) ( ) ( )Y Y Y Yf y y f y f y= ⋅
Ellipse in y1 – y2 space; y1 & y2 are dependent
Diagonal Terms only; Either Ellipse or CirclePrincipal Axes along y1 & y2
1 2 1 21 2 1 2( , ) ( ) ( )Y Y Y Yf y y f y f y≠ ⋅
Degenerate Case: Ellipse st. line: Along one of the “Principal Axes”; y2 = ±ρ · y1 = ± y1y1 & y2 are “extremely dependent” correlated or anti-correlated
NO Correlation
0ρ =Positive Correlation
0ρ > Negative Correlation
0ρ <
independent
0=ρy2
y1
1 2σ σ>
Ellipse Along Principal Axes
Circle0=ρ
y2
y1
1 2σ σ=arbitrary
orientation
The bivariate density fY(y) = fY1,Y2(y1,y2) is completely determined by its mean vector mY and its covariance matrix KYY as given by the equations on the upper right. Consider the the Bivariate Gaussian density which is plotted as a 2d surface relative to its mean vector components mY1 and mY2 taken as the origin. The level surfaces represented by cuts parallel to the y1-y2 plane are the ellipses given by the quadratic form equation of the last slide. The structure of these ellipses are shown in the tableau consisting of 3 columns for positive, negative, and zero correlation coefficient ρ and by 2 rows corresponding to general (top row) and degenerate cases.The general cases in the top row have unequal sigmas σ1> σ2 and as we go across the row we have an ellipse with positive correlation (ρ > 0), one with negative correlation (ρ < 0) and an ellipse along its principal axes with no correlation (ρ =0). The (red) arrows show the directions of the principal axes of the ellipse in each case; the zero correlation case on the extreme right has the principal axes coinciding with y1 and y2 , while the negative correlation case has its principal axes rotated at -45o to the y1-axis and the positive correlation case has its principal axes rotated at +45o to the y1-axis. The bottom row illustrates the two degenerate cases ρ =+1 and ρ =-1 in which the ellipse “collapses’ to a straight line corresponding to complete correlation or anti-correlation (opposite variations of Y1 and Y2) respectively, and the degenerate uncorrelated case ρ =0 in which the principal axis ellipse above it degenerates into a circle because the two sigmas are equal (σ1=σ2 ).
138
2/24/2012 138
Ellipses of Concentration1D Gaussian Distribution described by two scalars: mean µX & Var(X) intuitive Normalized & Centered RV Standardized Distribution Tabulation of CDF
xXµ
fX(x)
Xσ XσGaussian
Probability Surface
2d Ellipsesx1
x2
“Level Curves”
Xσµ−
= XXY
dtey tx
t
2/2
21)( −
−∞=∫ π
=Φ
y0
fY(y)
y
Standardized DensityProb Density
xKx
XXX
XXT
eK
xxf1
21
21 det21),(
−−
π=
“Level curves” of Zero Mean 2D Gaussian Surface with Covariance KXX
Tabulate Area
( ) .21
1 22
2221
2
21
21
2211
constcxxxxxKxXXXX
XXT ==
σ+
σσρ
−σρ−
=−
Vector mX and KXX are not very intuitive!
2D Gaussian Distributions described by vector & Matrix: mean vector mX & Covariance KXX
The 1-dimensional Gaussian distribution is completely described by two scalars the mean µX and the variance σX
2. The tabulation of a single integral for the cumulative distribution function FY(y) shown in the left box is sufficient to characterize all Gaussians X: N(µX , σX
2 ) if we first transform to a standardized Gaussian RV Y via Y = X- µX) / σX. The Gaussian integral representing the probability distribution for the standardized Pr[Y≤y] = FY(y) is used so often it is denoted as the “Normal Integral”Φ(x). We would like to extend this concept of a single tabulated integral to describe all 2-dimensional Gaussian distributions; however, as we have seen, the Bivariate Gaussian distribution requires more than just the means and variances of two Gaussians as we must also characterize their “co-variation” by specifying their correlation coefficient ρ. Thus we must specify the two elements of the mean vector µX and all three elements of the (symmetric) covariance matrix KXX in order to completely characterize a Bivariate Gaussian fX1X2(x1,x2) given in the right box of the slide. We have seen that the level “surfaces” (actually curves) of the Gaussian PDF are ellipses centered about the mean vector coordinates µX1 and µX2 and described by quadratic form xTK-1
XX x in the exponent of the PDF. The explicit equation for the level curves with zero mean is obtained by setting this term equal to an arbitrary positive constant c2 as given by the equation in the slide. These ellipses are called ellipses of concentration because the area contained within them measures the concentration of probability for the specific “cut through” the PDF surface. In the next few slides we will show how this leads to a single tabulated function for the Bivariate Gaussian that is analogous to Φ(x) for the Normal Distribution.
141
2/24/2012 141
Gaussian & Bivariate (2d) Gaussian Distributions Compared
21 2 /2Prob( ) ( ) 1T cxx Cx K x c F c e α− −< = = − =
)1ln(2 α−−=c
Probability for x to be within an ellipse “scaled by c”:
1−xxKNote: Inverse Covariance
determines Ellipse
Scale Factor c in terms of % concentration:
xXµ
fX(x)
Xσ XσProb Density
Equivalent 1d sigma table
3.4199.73-σ
2.4895.42-σ
1.5268.31-σ
cα (%)1d sigma
Equivalent 1d sigma table
3.4199.73-σ
2.4895.42-σ
1.5268.31-σ
cα (%)1d sigma
Equivalent 1d sigma table
3.4199.73-σ
2.4895.42-σ
1.5268.31-σ
cα (%)1d sigma
68.3%
1-σ ≈ c=1.52
2d Ellipse
x1
x2
“slice”
68.3%
α = 68.3% Prob region
On the last slide we found that the 2d probabilities are described in terms ellipses of concentration specified by the axis scale parameter c which is related to the percentage of events contained within the ellipse by the expression shown in the slide. This CDF is in fact a Rayleigh distribution with “radial distance r” replaced by the ellipse scale parameter “c”. Setting this probability within the ellipse (parameterized by the value “c”) equal to α allows us to solve for the value of c in the boxed equation. Using this equation, we compute the table which displays the values of the ellipse scaling parameter “c” corresponding to the standard values of 1-σ (68.3%) , 2-σ (95.4%), and 3-σ (99.7%) associated with a 1-dimensional Gaussian distribution.These ellipses are used to specify equivalent “standard deviations” for the Bivariate Gaussian and extending this tabulation for all probabilities allows us to define a standard Bivariate Normal function Ψ(c) similar to the Φ(x) for the Normal Gaussian. The two figures illustrate this equivalence by showing the c=1.52 cut through the Bivariate Gaussian surface yielding an equivalent “1-σ”ellipse containing α = 68.3% of the probability and then notionally comparing the ellipse with the “1-σ” area under the standard Gaussian curve.
151
2/24/2012 151
Closure Under Bayesian Updates - Summary
Conditional Mean represents an “estimate of X given meas.Y” with Var(X|Y) obtained from Bayes’ Updated Gaussian
2| 1)|(;]|[ ρ−=ρ=≡µ YXVaryYXEYX
Start with General Gaussian Vector with non-zero mean &Variance
σσρσσρσσ
=
µµ
=µ 2
2
;YYX
YXXXY
Y
X K
Conditional Mean and Variance Represents the Bayes’ Update Equation 2
|22
|
1;)1()|(
)(]|[
ρ−σ=σρ−σ=
µ−σσ
ρ+µ=≡µ
XYXX
YY
XXYX
YXVar
yYXE
Note 2: Y is irrelevant for ρ=0 X & Y indep => Conditionals do not depend upon value of y: µX|Y = µX & σXY
2 =Var(X|Y) = σX2
Started with a pair of N(0,1) RVs X & Y with correlation ρSummary:
Generalize:
=
YX
Xr
Note 1 “Gaussian Arena” we do not need to work with distributions directly since both 1) Linear Xfms & 2)Bayes’ Update Equation yield Gaussian Vector Results (surrogates for the joint and conditional distributions respectively)
2 /2
( )2
y
Yef y
π
−
=
=
YX
Xr
=
≡µ
00
YX
EXr
r
ρ
ρ=
11
XYK
2( )
21 2(1 )| 22 (1 )
( | )
x y
X Yf x y e
ρ
ρ
π ρ
− −−
−=3) Bayes’ Update fX|Y(x|y) is Gaussian
2) Marginal fY(y) is found to be N(0,1):
1) The joint distribution is a correlated Gaussian in X and Y
4) Pick off “conditional” mean & variance from fX|Y(x|y)
2 22
21 2(1 )22 1
( , )
x xy y
XYf x y e
ρ
ρ
π ρ
− + −−
−=
2( ,1 )N yρ ρ−
Closure Under Bayesian Updates started with a pair of correlated N(0,1) Gaussian RVs with correlation coefficient ρ. and resulted in a Gaussian conditional distribution fX|Y(x|y) with conditional mean is µX|Y= E[X|Y] = ρy and conditional variance is Var(X|Y) = σX|Y
2 = 1-ρ2.If instead, we start with a pair of correlated Gaussian RVs having different means and variances given by the mean vector µX and covariance matrix KXY shown in the middle panel of the slide yields the general result for a Gaussian with a
conditional mean E[X|Y] = µX|Y = µX + ρσX(y- µY)/σY , and conditional variance Var(X|Y) = σX|Y
2
given in the boxed equation. The lower panel interprets these results in terms of a two dimensional “Gaussian Arena” in which the input and output are related by the underlying joint Gaussian distribution which remains Gaussian for all possible linear coordinate transformations and even maintains its Gaussian character when one of the variables is conditioned on the other. Thus the Gaussian vector remains Gaussian under both linear transformations and Bayes’ updates. Also note that if the correlation is zero (ρ =0) then the input and output variables are independent as is evident in the boxed equations which reduce to statements that the conditional mean is equal to the mean µX|Y = µX and the conditional variance is equal to the variance σX|Y
2 = σX2 .
We note in passing that because the quadratic form in the joint Gaussian is symmetric in the X and Y variables, we could just as well have computed the output Y conditioned on the input X to find analogous results with X Y corresponding to the forward Bayesian relation. A visual interpretation of this result will be given in the next slide and further insight into the role of the communication channel and its inverse will be given in the slides after that.
152
2/24/2012 152
Visualization of Conditional Mean
| ( )XX Y X Y
Y
yσµ µ ρ µσ
= + −y
x
1ρ = +Degenerate Ellipse
y = y0Distribution is a Single Unique point with zero
variance!YX
XYEσ⋅σ
=ρ][
|
|
|
If 0 Indep. (Y is irrelevant)
If 1 ( ) direct correlation
If 1 ( ) inverse correlation
X
X
X Y X
X Y X YY
X Y X YY
y
y
σσ
σσ
ρ µ µ
ρ µ µ µ
ρ µ µ µ
= =
= + = + −
= − = − −
Special Cases:
given a priori2;X Xµ σ
Bayesian Update Conditions X on Y
yields a posteriori2 2 2
| |( ) ; (1 )XX Y X Y X Y X
Y
yρ σµ µ µ σ ρ σσ
= + − = −
General Case:
( , )X Yµ µ0|X Y yµ =
Choose arb. y0 ; it is tangent to an ellipse whose max is ymax= y0 = +c
( )2 2 2 22 1x x y y cρ ρ− + = − ⋅% % % % ;X Y
X Y
x yx yµ µσ σ− −
= =% %
Recall Covariance Ellipse Construction Extremum
x’
x
y
y’ y = y0 “slice”
( , )X Yµ µ“origin at”
0 0 0( )x x y yρ≡ = ⇒% % % % 0 0X Y
X Y
x yµ µρσ σ− −
= ⋅
0
00 |
YX X X Y y
Y
yx µµ ρ σ µσ =
−= + ⋅ ⋅ =
found the corresponding x- value to be
x0 = mean “conditioned on the y0-slice”
00 |X Y yx µ ==
Distribution is Gaussian with conditional mean µX|Yconditional variance σX|Y
2
x|X Yµ
fX|Y(x)
|X Yσ |X Yσ“y0-slice”
The results for the conditional mean and variance can be understood graphically as follows. Starting with the Bivariate Gaussian Density we draw the elliptical contours corresponding to the horizontal cuts through the density surface centered at the mean coordinates µX and µY indicated by the black dot at the center. If we choose a fixed value of y=y0 the line parallel to the x-axis is tangent to one of the ellipses and hence y0 represents the maximum y-value for that ellipse as shown by the red dot. This line also results from a vertical plane y=y0 cutting through the distribution and the Gaussian cut through the distribution is shown above the contours. The x-coordinate corresponding to this maximum is obtained by dropping a perpendicular onto the x-axis at a value x0 = µX|Y=y0 as shown in the figure. Recalling the calculation used for the covariance ellipse construction, the x0-value corresponding to this maximum at y=y0 is given in standardized coordinates x0=ρy0 which is converted to the coordinates of the figure by letting x0 -> (x0-µX)/σX and y0 -> (y0-µY)/σY to yield (x0 –µX)/ σX = ρ (y0-µY)/σY or x0 = µX +ρ σX (y0-µY)/σY which is exactly the statement that x0is the conditional mean µX|Y=y0 .The three special cases ρ=0,+1,-1 shown in the bottom panel are: (i) ρ=0 no correlation corresponds a coordinate system along the principal axis of the ellipse for which a constant y=y0 cut will always yield a conditional mean µX|Y=y0 = µX
(ii) ρ=+1 complete positive correlation corresponds the case where the ellipse collapses to a straight line; the conditional distribution is a single point with zero variance on the line with slope (σY/σX) as shown and yields a conditional mean µX|Y=y0 = µX +σX (y0-µY)/σY
(iii) ρ=-1 complete negative correlation corresponds the case where the ellipse collapses to a straight line; the conditional distribution is a single point with zero variance on the line with slope (-σY/σX) (not shown) and yields a conditional mean µX|Y=y0 = µX -σX (y0-µY)/σY
155
2/24/2012 155
Rationale for “Inverse Channel” & Generating Correlated RVs
Generate X: N(0,1) correlated to Y with coeff. ρGiven Y: N(0,1) RV
Inverse Channel Method: X=ρY+V
ρY=N(0,1)
V=N(0,1-ρ2 ) noise
inputX=N(0,1)
output
(i) Generate samples of RV “Y” using standard method (e.g., sum 12 uniform Variates on [-0.5, 0.5]) to yield N(0,1).
(ii) Generate zero mean Gaussian noise “V” with variance 1- ρ2 to yield N(0, 1- ρ2 ).
(iii) Multiply each RV sample “Y” by desired correlation coefficient ρ
(iv) Add noise sample “V” to obtain output “X”which is N(0,1) and has the desired correlation coefficient correl(X,Y)= ρ
ρ = 0: No correlation between X & Y.
0.Y + N(0,1-02 ) = N(0,1 ) X
X is simply the uncorrel noise sample N(0,1).
ρ = ±1: Full correlation/anti-correlation (Degenerate Ellipse or St.Line)
±1 . Y + N(0,1-(±1 )2 ) = ±Y X
X is simply ±Y – value
-1 < ρ < 1: General correlation
ρ . Y + N(0,1- ρ 2 ) X
X results from multiplying Y by the correlation ρ and adding noise with variance (1- ρ 2 )
(i) If Noise is not added: X=ρY: Var(X) =Var(ρY) =ρ2 Var(Y)= ρ2 ≠ 1
(ii) If uncorrel noise is added X=ρY+”V” with appropriate Var(V)= (1- ρ2 ) to cancel correlcontrib. to Var(X) then
Var(X) = Var(ρY+V) = ρ2 Var(Y) + Var(V)+2Cov(Y,V)
= ρ2 . 1 + (1- ρ2 ) + 0 = 1
Rationale: “X=ρY+V”
Special Cases: “X=ρY+V” ; -1 ≤ ρ ≤ +1
The last couple slides considered the inverse channel and its relation to a Bayesian update which starts with an a priori value of the mean µX and variance σX
2 and then updates their values as a result of an actual “measurement Y”. The conditional mean and variance formulas that we found comported with both the Bayesian Update equation for conditional probability densities and also to those obtained by constructing an inverse channel which creates an input X from an output Y. In this slide and the next we consider this important “coincidence” in some detail. The box on the left uses the inverse channel model as a computer program flow diagram to actually generate a RV X~N(0,1) from a linear combination of Y ~N(0,1) and noise V~N(0,1-ρ2) . Note that the input and output RVs are both N(0,1) Gaussians with unit variance yet the noise must have a variance that is less than unity for this to work. The rationale is simple enough, for consider what might be your first impulse to generate a pair of correlated RVs by setting Y = ρ X (upper right box); taking the expectations E[Y] and E[Y2] we find µY= ρ µX = ρ *0 = 0 and σY
2 = ρ2 σX2 = ρ2 ≠ 1 which this does not agree with the assumption that both X
and Y are N(0,1). Agreement is possible only if we add zero-mean noise with variance (1-ρ2) because when added to ρ2 it yields the desired unit variance for the RV Y.The special cases of no correlation (ρ = 0 ) and full positive and negative correlation (ρ = ±1 ) are explicitly shown to be in agreement this model. For no correlation the model gives X as just N(0,1) random noise which is takes on values completely independent of the y–values. On the other hand for full positive (or negative) correlation the model gives X as N(0,1) which takes on values that are exactly the same as those for Y (or –Y). In the general case -1 < ρ <+1 the model gives X as N(0,1) RV which tracks Y more closely for correlations near +1 and tracks the noise more closely for correlations nearer to zero thus giving the expected intermediate behavior.
157
2/24/2012 157
Multilinear Gaussian Distribution
n-dimensional Gaussian Vector X= [ X1, X2,... Xn]T
)()(21
2/
1
det)2(1)( XXX
TX xKx
XXnX e
Kxf µ−µ−−
−
π=
( ) ( )( )[ ] ncrXXEKcr XcXrrcXX L,2,1,; =µ−µ−=
nnnnn
rc
n
n
n
KKKKK
KKKKKKKKKKKK
L
MMMM
L
L
L
321
3333231
2232221
1131211
Matrix components
Moment Generating Fcn Tn
ttKttXX tttteeEt
TXXX
TT
],,[;][)( 2121
Lr
r ===φµ+⋅
Still Gaussian After Linear Transformation:
(See Next Slide =>)
( ) ( )( )( )( )[ ] ( )[ ] [ ] [ ] T
XXT
K
TXX
TTXX
TXX
TYYYY
XXY
XY
AAKA)µX)(µX(EAA)µX)(µXA(E)µXA()µXA(EµYµYEK
)µXA(bµAbXAµY
bµA]bXE[A]YE[µ
XX
=−−=−−=−−=−−=
−=+−+=−
+=+==
=444 3444 21rrrrrrrrrrrrrrrr
rrrrrrrr
rrrrrr
TXXYY AAKK =
)()(21
2/
1
det)2(1)( YYY
TY yKy
YYnY e
Kyf µ−µ−−
−
π=
bA XY +µ=µbAXY +=
Details
1st and 2nd Moment Vector µX & Covariance KXX Uniquely Defines Multivariate Gaussian
Gaussian
The extension to Multilinear Gaussian distributions or Vectors is straight forward; taking the product of “n” independent N(µX, σX
2) Gaussians symbolized by the vector X=[X1,X2,...Xn]T yields an n-dimensional Gaussian characterized by an n-dimensional mean vector µX and n x n covariance matrix KXX whose diagonals equal the variances of the individual RVs and whose off diagonal elements are all zero. Even if we start with independent RVs, a linear transformation of the form Y= AX + b produces correlations and the off-diagonal terms of the new covariance matrix are no longer zero. The transformation leaves the Gaussian structure the same, but the mean and covariance become µY = AµX + b and KYY = AKXX AT respectively.The Gaussian always has the form fX(x)=(2π)- n/2 (detKXX) )-1/2 exp(- ½ q) with the scalar quadratic q = [x-µX]T KXX
-1[x-µX]. The row-column components of the covariance matrix are determined by theexpected values of the “row-col” pair products of centered deviations.The moment generating function generalizes to φX(t) = E[exp(X tT )] = exp( ½ tT KXX t +µX
T t) with t= [t1,t2, ...,tn]T.Note that we have reverted to the old notation in which the components of the Gaussian vectors are labeled by indexed quantities Xi and the new components under a coordinate transformation are Yi. This is temporary, however, because we shall want to consider communication channels with a number of inputs and a number of outputs and partition the n-dimensional Gaussian vector into these two distinct type of components in order to define the conditional distribution as µX|Y in a useful manner.
159
2/24/2012 159
Partitioned Multivariate Gaussian & Xfm to Block Diagonal Partition: [X(1) | X(2) ]T {Comm Channel with multiple inputs: “X”= X(1) & outputs “Y”= X(2) }
=
+
n
k
k
x
x
x
xx
x
x
M
L
M
K
1
2
1
)2(
)1(
µ
µ
µ
µµ
=
µ
µ
+
n
k
k
M
L
M
K
1
2
1
)2(
)1(
+
++++
+
+
nnknnkn
nkkkkkk
knkkkkk
nkk
KKKK
KKKKKKKK
KKKK
,1,1
,1,1,11,1
1,1,
11,1111
LL
MMMM
LL
LK
MMMM
LL
k x k k x (n-k)
(n-k) x (n-k)(n-k) x k
=
)2)(2()1)(2(
)2)(1()1)(1(
KKKK
Perform Linear Xfmin “partitioned form”
=
=
−−
−
−−−
−
)(),(
)(,
)(),(),(
)(,,
00 knkkn
knkk
knknkkn
knkkk
IBI
IBI
A
Find “B”matrix so that new KYY is block diagonal
02221 =+ TBKK
02212 =+ BKK
2 x 2 Partitioned Matrix2 x 1 Partitioned Vectors
where,
11 12 11 12
21 22 21 22
11 21 12 22
21 22
11 21 12 22
12 22
21 22 22
00 0 0
0
Tk k k kT
XX Tn k n k n k n k
kT
n k
T T
T
I B I B I B IK K K KAK A
I I I B IK K K K
IK BK K BKB IK K
K BK K BKK B BK B
K K B K
− − − −
−
= ⋅ ⋅ = ⋅ ⋅
+ +
= ⋅
+ + + += +
(1) 11 12 (1)
(2) 21 22 (2)
y A A x
y A A x
=
M
K L M L K
M
Now drop parentheses notation for partitioned components !!
(1)
(2)
Consider a multi-dimensional communication channel partitioned into two sets as follows:“X”: k-inputs X(1) = [X1, X2, ..., Xk]T and “Y”: (n-k)-outputs X(2) = [Xk+1, Xk+2, ..., Xn]T . The mean vector and covariance matrix are also partitioned in the same manner to yield 2 x 1 partitioned vector X(I) and 2 x 2 partitioned covariance matrix K(I)(J). Note that the partition dimensions of K(I)(J) are specifically as follows: Row#1 [K11 : K12] = [ k x k : (n-k) x k ] Row#2 [K21 : K22] = [ k x (n-k) : (n-k) x (n-k)] . Now lets perform a linear transformation to a new coordinate system according to the equation Y=AX+bwhere it is now understood that the Y(I) and X(I) and b(I) are all partitioned in the same manner as 2 x 1 column vectors and the matrix A(I)(J) is partitioned into a 2x2 matrix which corresponds to the partitioning of the original covariance martix K(I)(J) as shown in detail on the slide. The transformed covariance matrix KYY is defined by the following product of n x n matrices A KXX AT ; in partitioned form we instead have a product of three 2 x 2 matrices. The sub-matrices in the partition of A(I)(J) are chosen as follows: A(1)(J) =[ Ik, k : Bk , (n-k)] and A(2)(J) =[ 0n-k, k : I(n-k), (n-k)] (labeled by their dimensions). The problem is to find the 2x2 matrix B such that the new covariance matrix KYY is block diagonal; taking the product of the three partitioned matrices A KXX AT results in two a 2x2 matrix shown at the bottom of the slide. Forcing the two “off-diagonal” partitions (circled) to be zero yields two conditions on the matrix B and its transpose BT as follows:
(1) K21 + K22 BT =0 ; (2) K12 +BK22 =0Note that the partitioned components are of the original matrix KXX so for example K21 is the 2,1 partition component or (KXX)21 . On the next slide we formally solve for B and B and write down the explicit form of the block diagonal matrix KYY with just 2 components, namely, (KYY )11 and (KYY )22 . This will allow us to factor the multivariate Gaussian and prove a very elegant generalization of Bayes’Update for the conditional mean and conditional covariance known as the Gauss-Markov Theorem.
163
2/24/2012 163
Gauss-Markov TheoremUpdating Gaussian Vectors under Bayes’Rule
Given X and Y are jointly Gaussian Random input and output vectors with dim k and n-k respectively
Gauss-Markov Theorem states that the conditional PDF of ”X given Y” is also Gaussian with conditional mean & covariance given by
{
{ {
{ {
≡
−×−×−
−××
×)()()(
)(
knknYY
kknYX
knkXY
kkXX
nn KK
KK
K{
≡
−× )(
)(
1 kn
k
n YX
Xr
{
µµ
≡µ−× )(
)(
1 knY
kX
n
r
{ { { 434213211)()()(
1
)(11
| )(×−−×−
−
−×××
µ−+µ=µkn
Y
knkn
YYknk
XYk
X
k
YX yKK{ { {
kknYX
knkn
YYknk
XYkk
YX KKK×−−×−
−
−××
−=)()()(
1
)(XX| KK
321
Note: Although Covariance K is symmetric, the blocks themselves are not , i.e., { {
kknYX
knkXY KK
×−−×
≠)()(
{ {kkn
YX
kkn
T
knkXY KK
×−
×−
−×
=
)(
)(
)(43421
Symmetry of K requires the following relationship for the off diagonal blocks
Combine to form n-dim vector with partitioned mean and covariance as follows :
The result of the last section for the n-dimensional Multivariate Gaussian are now cast in a form more suitable for a communication channel. We introduce the new notation in which the 1st partition of the Gaussian Vector X consists of the k inputs Xk = [X1, ...,Xk]T and the 2nd partition consists of n-k outputs Yk = [Yk ...Yk]T . The mean vector µX and covariance matrix KXX are partitioned in a natural manner as shown on the slide. In this notation, the Gauss-Markov Theorem states that the conditional PDF of “vector X given vector Y”is also a Gaussian with conditional mean and covariance given by the two boxed equations. This is identical to the results of previous slide, however in a new notation. Note that a possible source of confusion is to equate the partitions Xk and Yk (whose dimensions k +(n-k) add to “partition” n) with the transformation of coordinates Y=AX used to transform between to n-dimensional coordinate systems from X to the canonical coordinates Y. Also note that even though the full nxn covariance matrix is symmetric Kr c = Kc r with respect to its indices (i.e., K = KT), this is no longer true for the partitioned components K(R)(C) ≠ K(C)(R) as evidenced by the fact that KXY ≠ KYX as they usually do not even have the same dimensions. The symmetry of the full matrix requires blocks with transposed partition indices be transposes of one another, i.e., KXY
T = KYX which is possible now because these two matrices now have the same dimensions.The Gauss Markov Theorem is the basis for using the conditional mean estimator µX|Y to update the a priori mean value µX = E[X] of a k-dimensional state vector X by using an (n-k) dimensional measurement vector Y. The state and measurement vectors must be part of the same multivariate Gaussian distribution or equivalently the must be components of a partitioned Gaussian vector whose means, variances, and correlations are given by the partitioned n-dimensional mean vector and covariance matrix shown at the top of the slide. They indeed form a Gaussian “Arena”.
164
2/24/2012 164
Gauss-Markov Estimator
Gaussian Means & Variances Add
1) i.e., e is uncorrelated with theˆ &e X e Y⊥ ⊥0][&0]ˆ[ == eYEXeE
2) Estimator and RV X have same correlation with measurements Y
X
XYYX KK =ˆ
3) Distributions for and satisfy “Pythagorean Right Triangle Relationship”as showne
eXX += ˆ),0(),0(
),(),(ˆ
1
1
PNKKKKNe
QNKKKNX
P
YXYYXYXX
X
Q
YXYYXYX
=−=
µ=µ=
≡
−
≡
−
444 3444 21
4434421
),0(),(),( PNQNKN XXXX +µ=µ
Gauss-Markov Estimator
Error
Random Variable
X
Error e and Conditional Mean Estimator satisfy the following:X
: ( , )X XXN KµX
: (0, )N Peˆ : ( , )XN QµX
Following remarkable properties can be shown for these RVs
1ˆ [ ( )]X XY YY Ye X X X K K Yµ µ−= − = − + −
Note: The “Estimator” and the “Error” depend upon the specific values of X=“x” and Y=“y”and hence generate samples of two new random variables . & whose statistics can be inferred from those of X and Y.
X e
New RVs:1
| ( )X Y X XY YY YX K K Yµ µ µ−→ = + −) Estimator RV
Error RV
Xestimator and the data Y“orthogonal”
2(0,1) (0, ) (0,1 )N N Nρ ρ= + −Y=ρ X+VRecall for Scalar X & Y:
The conditional mean is evaluated for a specific “realization” of a Gaussian RV X=“x” and Y=“y” and hence looking at many realizations allows us to consider the conditional mean µX|Y as a random variable itself. Thus we replace the specific realizations µX|Y and “y” in the update equation by RVs denoted respectively as X-hat and Y as shown in the first equation. Now the difference between the true state Xand the conditional mean estimate of that state X-hat is a RV that represents the Estimation Error e =X-(X-hat) as shown in the second equation. These two equation can be shown to have the following remarkable properties : 1) the error is uncorrelated with either the estimator X-hat or the data Y, 2) the X-hat estimator and the true state X correlate with the measurements in the same way, and 3) the distributions for the RVs X-hat and e satisfy a “Pythagorean Right Triangle Relationship between their Gaussian designations. Looking at the figure the true state X ~ N(µX , KXX) on the hypotenuse, the estimator X-hat ~ N(µX , Q) where Q= KXYKYY
-1KYX in the plane, and the error e ~ N(0 , P) where P= KXX - KXY KYY-1KYX
perpendicular to the plane. The vector relation is X = X-hat + e which forms the right triangle and the means and variances add so that
µX =µX +0 and KXX = P + Q = (KXX - KXY KYY-1KYX )+(KXYKYY
-1KYX). For the normal distributions this may be written in the suggestive form
N(µX , KXX) = N(µX , Q) + N(0, P) . Also recall this relationship showed up for the scalar case of a single input X and single output Y in the form Y=ρX+V (where V = e (noise) and solving for the error e =Y-ρX)
N(0,1) = N(0,ρ) + N(0,1-ρ2)