Foundations of Probabilities
and Information Theory
for Machine Learning
0.
Random Variables
Some proofs
1.
E[X + Y ] = E[X] + E[Y ]where X and Y are random variables of the same type (i.e. either discrete or cont.)
The discrete case:
E[X + Y ] =∑
ω∈Ω
(X(ω) + Y (ω)) · P (ω)
=∑
ω
X(ω) · P (ω) +∑
ω
Y (ω) · P (ω) = E[X ] + E[Y ]
The continuous case:
E[X + Y ] =
∫
x
∫
y
(x+ y)pXY (x, y)dydx
=
∫
x
∫
y
xpXY (x, y)dydx+
∫
x
∫
y
ypXY (x, y)dydx
=
∫
x
x
∫
y
pXY (x, y)dydx+
∫
y
y
∫
x
pXY (x, y)dxdy
=
∫
x
xpX(x)dx +
∫
y
ypY (y)dy = E[X ] + E[Y ]
2.
X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],X and Y being random variables of the same type (i.e. either discrete or continuous)
The discrete case:
E[XY ] =∑
x∈V al(X)
∑
y∈V al(Y )
xyP (X = x, Y = y) =∑
x∈V al(X)
∑
y∈V al(Y )
xyP (X = x) · P (Y = y)
=∑
x∈V al(X)
xP (X = x)∑
y∈V al(Y )
yP (Y = y)
=∑
x∈V al(X)
xP (X = x)E[Y ] = E[X ] · E[Y ]
The continuous case:
E[XY ] =
∫
x
∫
y
xy p(X = x, Y = y)dydx =
∫
x
∫
y
xy p(X = x) · p(Y = y)dydx
=
∫
x
x p(X = x)
(∫
y
y p(Y = y)dy
)
dx =
∫
x
x p(X = x)E[Y ]dx
= E[Y ] ·∫
x
x p(X = x)dx = E[X ] ·E[Y ]
3.
Binomial distribution: b(r;n, p)def.= Cr
n pr(1− p)n−r
Significance: b(r;n, p) is the probability of drawing r heads in n independentflips of a coin having the head probability p.
b(r;n, p) indeed represents a probability distribution:
• b(r;n, p) = Crn p
r(1− p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ 0, 1, . . . , n,•
∑n
r=0 b(r;n, p) = 1:
(1− p)n + C1np(1− p)n−1 + · · ·+ Cn−1
n pn−1(1− p) + pn = [p + (1− p)]n = 1
4.
Binomial distribution: calculating the mean
E[b(r;n, p)]def.=
n∑
r=0
r · b(r;n, p) =
= 1 · C1np(1− p)n−1 + 2 · C2
np2(1− p)n−2 + · · ·+ (n− 1) · Cn−1
n pn−1(1− p) + n · pn= p
[C1
n(1− p)n−1 + 2 · C2np(1− p)n−2 + · · ·+ (n− 1) · Cn−1
n pn−2(1− p) + n · pn−1]
= np[(1− p)n−1 + C1
n−1p(1− p)n−2 + · · ·+ Cn−2n−1p
n−2(1− p) + Cn−1n−1p
n−1]
(1)
= np[p + (1− p)]n−1 = np
For the (1) equality we used the following property:
k Ckn = k
n!
k! (n− k)! =n!
(k − 1)! (n− k)! =n (n− 1)!
(k − 1)! (n− 1− (k − 1))!
= n Ck−1n−1, ∀k = 1, . . . , n.
5.
Binomial distribution: calculating the variancefollowing www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites
“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,
Oxford Science Publications, 1986
We will make use of the formula Var[X ] = E[X2]−E2[X ].By denoting q = 1− p, it follows:
E[b2(r;n, p)]def.=
n∑
r=0
r2Crnp
rqn−r =
n∑
r=0
r2n(n− 1) . . . (n− r + 1)
r!prqn−r
=
n∑
r=1
rn(n− 1) . . . (n− r + 1)
(r − 1)!prqn−r =
n∑
r=1
rnCr−1n−1 p
rqn−r
= npn∑
r=1
r Cr−1n−1 p
r−1q(n−1)−(r−1)
6.
Binomial distribution: calculating the variance (cont’d)
By denoting j = r − 1 and m = n− 1, we’ll get:
E[b2(r;n, p)] = npm∑
j=0
(j + 1)Cjm pjqm−j
= np
m∑
j=0
j Cjm pjqm−j +
m∑
j=0
Cjm pjqm−j
= np
m∑
j=0
jm · . . . · (m− j + 1)
j!pjqm−j + (p+ q
︸ ︷︷ ︸1
)m
= np
m∑
j=1
mCj−1m−1 p
jqm−j + 1
= np
mp
m∑
j=1
Cj−1m−1 p
j−1q(m−1)−(j−1) + 1
= np[(n− 1)p(p+ q︸ ︷︷ ︸
1
)m−1 + 1] = np[(n− 1)p+ 1] = n2p2 − np2 + np
Finally,
Var[X ] = E[b2(r;n, p)] − (E[b(r;n, p)])2 = n2p2 − np2 + np− n2p2 = np(1− p)
7.
Binomial distribution: calculating the variance
Another solution
• se demonstreaza relativ usor ca orice variabila aleatoare urmanddistributia binomiala b(r;n, p) poate fi vazuta ca o suma de n vari-abile independente care urmeaza distributia Bernoulli de parametrup;a
• stim (sau, se poate dovedi imediat) ca varianta distributiei Bernoullide parametru p este p(1− p);
• tinand cont de proprietatea de liniaritate a variantelor — Var[X1 +X2 . . . + Xn] = Var[X1] + Var[X2] . . . + Var[Xn], daca X1, X2, . . . , Xn suntvariabile independente —, rezulta ca Var[X ] = np(1− p).
aVezi www.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citeaza de asemenea ca sursa “Proba-bility: An Introduction” de Geoffrey Grimmett si Dominic Welsh, Oxford Science Publications, 1986.
8.
The Gaussian distribution: p(X = x) = 1√2πσe− (x− µ)2
2σ2
Calculating the mean: E[Nµ,σ(x)]def.=∫∞−∞ xp(x)dx = 1√
2πσ
∫∞−∞ x · e
−(x− µ)2
2σ2 dx
Using the variable transformation v =x− µσ
will imply x = σv + µ and dx = σdv, so:
E[X ] =1√2πσ
∫ ∞
−∞(σv + µ)e
−v2
2 (σdv) =σ√2πσ
σ
∫ ∞
−∞ve
−v2
2 dv + µ
∫ ∞
−∞e−v2
2 dv
=1√2π
−σ
∫ ∞
−∞(−v)e
−v2
2 dv + µ
∫ ∞
−∞e−v2
2 dv
=
1√2π
−σ e−v2
2
∣∣∣∣∣∣∣
∞
−∞︸ ︷︷ ︸
=0
+µ
∫ ∞
−∞e−v2
2 dv
=µ√2π
∫ ∞
−∞e−v2
2 dv (see the next slide for the computation on this last integral)
=µ√2π
√2π = µ
9.
The Gaussian distribution: calculating the mean (Cont’d)
∫
∞
v=−∞
e−
v2
2 dv
2
=
∫
∞
x=−∞
e−
x2
2 dx
·
∫
∞
y=−∞
e−
y2
2 dy
=
∫
∞
x=−∞
∫
∞
y=−∞
e−
x2 + y2
2 dydx
=
∫∫
R2
e−
x2 + y2
2 dydx
By switching from x, y to polar coordinates r, θ (see the Note below), it follows:
∫
∞
v=−∞
e−
v2
2 dv
2
=
∫
∞
r=0
∫
2π
θ=0
e−
r2
2 (rdrdθ) =
∫
∞
r=0
re−
r2
2
(∫
2π
θ=0
dθ
)
dr =
∫
∞
r=0
re−
r2
2 θ|2π0 dr
= 2π
∫
∞
r=0
re−
r2
2 dr = 2π (−e−
r2
2 )
∣
∣
∣
∣
∣
∣
∣
∞
0
= 2π(0− (−1)) = 2π ⇒∫
∞
v=−∞
e−
v2
2 dv =√2π.
Note: x = r cos θ and y = r sin θ, with r ≥ 0 and θ ∈ [0, 2π). Therefore, x2 + y2 = r2, and theJacobian matrix is
∂(x, y)
∂(r, θ)=
∣
∣
∣
∣
∣
∣
∣
∣
∂x
∂r
∂x
∂θ
∂y
∂r
∂y
∂θ
∣
∣
∣
∣
∣
∣
∣
∣
=
∣
∣
∣
∣
∣
cos θ −r sin θ
sin θ r cos θ
∣
∣
∣
∣
∣
= r cos2 θ + r sin2 θ = r ≥ 0. So, dxdy = rdrdθ.
10.
Bivariate Gaussian
11.
The Gaussian distribution: calculating the variance
We will make use of the formula Var[X ] = E[X2]− E2[X ].
E[X2] =
∫ ∞
−∞x2p(x)dx =
1√2πσ
∫ ∞
−∞x2 · e
−(x− µ)2
2σ2 dx
Again, using the transformation v =x− µσ
will imply x = σv+µ and dx = σdv. Therefore,
E[X2] =1√2πσ
∫ ∞
−∞(σv + µ)2 e
−v2
2 (σdv)
=σ√2πσ
∫ ∞
−∞(σ2v2 + 2σµv + µ2) e
−v2
2 dv
=1√2π
σ2
∫ ∞
−∞v2 e
−v2
2 dv + 2σµ
∫ ∞
−∞v e
−v2
2 dv + µ2
∫ ∞
−∞e−v2
2 dv
Note that we have already computed∫∞−∞ ve
−v2
2 dv = 0 and∫∞−∞ e
−v2
2 dv =√2π.
12.
The Gaussian distribution: calculating the variance (Cont’d)
Therefore, we only need to compute
∫ ∞
−∞v2e
−v2
2 dv =
∫ ∞
−∞(−v)
−ve
−v2
2
dv =
∫ ∞
−∞(−v)
e
−v2
2
′
dv
= (−v) e−v2
2
∣∣∣∣∣∣∣
∞
−∞
−∫ ∞
−∞(−1)e
−v2
2 dv = 0 +
∫ ∞
−∞e−v2
2 dv =√2π.
Here above we used the fact that
limv→∞
v e−v2
2 = limv→∞
v
e
v2
2
l’Hopital= lim
v→∞1
v e
v2
2
= 0 = limv→−∞
v e−v2
2
So, E[X2] = 1√2π
(σ2√2π + 2σµ · 0 + µ2
√2π)= σ2 + µ2.
And, finally, Var[X ] = E[X2]− (E[X ])2 = (σ2 + µ2)− µ2 = σ2.
13.
Vectors of random variables.
A property:
The covariance matrix Σ corresponding to such a vector issymmetric and positive semi-definite
Chuong Do, Stanford University, 2008
[adapted by Liviu Ciortuz]
14.
Fie variabilele aleatoare X1, . . . , Xn, cu Xi : Ω → R pentrui = 1, . . . , n. Matricea de covarianta a vectorului de variabilealeatoare X = (X1, . . . , Xn) este o matrice patratica de dimensiune
n×n, ale carei elemente se definesc astfel: [Cov(X)]ijdef.= Cov(Xi, Xj),
pentru orice i, j ∈ 1, . . . , n.
Aratati ca Σnot.= Cov(X) este matrice simetrica si pozitiv semi-
definita, cea de-a doua proprietate ınsemnand ca pentru oricevector z ∈ Rn are loc inegalitatea z⊤Σz ≥ 0. (Vectorii z ∈ Rn suntconsiderati vectori-coloana, iar simbolul ⊤ reprezinta operatia detranspunere de matrice.)
15.
Cov(X)i,jdef.= Cov(Xi, Xj), for all i, j ∈ 1, . . . , n, and
Cov(Xi, Xj)def.= E[(Xi − E[Xi])(Xj − E[Xj ])] = E[(Xj − E[Xj ])(Xi − E[Xi])] = Cov(Xj , Xi),
therefore Cov(X) is a symmetric matrix.
We will show that zTΣz ≥ 0 for any z ∈ Rn (seen as a column-vector):
zTΣz =
n∑
i=1
zi(
n∑
j=1
Σijzj) =
n∑
i=1
n∑
j=1
(ziΣijzj) =
n∑
i=1
n∑
j=1
(ziCov[Xi, Xj] zj)
=
n∑
i=1
n∑
j=1
(zi E[(Xi − E[Xi])(Xj − E[Xj])] zj) = E
n∑
i=1
n∑
j=1
zi (Xi − E[Xi])(Xj − E[Xj ]) zj
= E
(n∑
i=1
zi (Xi − E[Xi])
)
n∑
j=1
(Xj − E[Xj ]) zj
= E
(n∑
i=1
(Xi − E[Xi]) zi
)
n∑
j=1
(Xj − E[Xj ]) zj
= E[((X − E[X ])T · z)2] ≥ 0
16.
Multi-variate Gaussian distributions:
A property:
When the covariance matrix of a multi-variate (d-dimensional) Gaussian
distribution is diagonal, then the p.d.f. (probability density function) of
the respective multi-variate Gaussian is equal to the product of d
independent uni-variate Gaussian densities.
Chuong Do, Stanford University, 2008
[adapted by Liviu Ciortuz]
17.
Let’s consider X = [X1 . . . Xd]T , µ ∈ Rd and Σ ∈ Sd+, where Sd+ is the set of symmetric
positive definite matrices (which implies |Σ| 6= 0 and (x − µ)TΣ−1(x − µ) > 0, therefore− 1
2 (x− µ)TΣ−1(x− µ) < 0, for any x ∈ Sd, x 6= µ).
The probability density function of a multi-variate Gaussian distribution of parametersµ and Σ is:
p(x;µ,Σ) =1
(2π)d/2|Σ|1/2 exp
(
−1
2(x − µ)TΣ−1(x − µ)
)
,
Notation: X ∼ N (µ,Σ).Show that when the covariance matrice Σ is diagonal, then the p.d.f. (probability den-sity function) of the respective multi-variate Gaussian is equal to the product of dindependent uni-variate Gaussian densities.
We will make the proof for d = 2(generalization to d > 2 will be easy):
x =
[x1x2
]
µ =
[µ1
µ2
]
Σ =
[σ21 0
0 σ22
]
Note: It is easy to show that if Σ ∈ Sd+ is diagonal, the
elements on the principal diagonal Σ are indeed strictly
positive. (It is enough to consider z = (1, 0) and respec-
tively z = (0, 1) in formula for pozitive-definiteness of Σ.)
This is why we wrote these elements of σ as σ21 and σ2
2 .
18.
19.
p(x;µ,Σ) =1
2π
∣∣∣∣
σ21 0
0 σ22
∣∣∣∣
12
exp
(
−1
2
[x1 − µ1
x2 − µ2
]T [σ21 0
0 σ22
]−1 [x1 − µ1
x2 − µ2
])
=1
2π σ1σ2exp
−1
2
[x1 − µ1
x2 − µ2
]T
1
σ21
0
01
σ22
[x1 − µ1
x2 − µ2
]
=1
2π σ1σ2exp
−1
2
[x1 − µ1
x2 − µ2
]T
1
σ21
(x1 − µ1)
1
σ22
(x2 − µ2)
=1
2π σ1σ2exp
(
− 1
2σ21
(x1 − µ1)2 − 1
2σ22
(x2 − µ2)2
)
= p(x1;µ1, σ21) p(x2;µ2, σ
22).
20.
Bi-variate Gaussian distributions. A property:
The conditional distributions X1|X2 and X2|X1 are alsoGaussians.
The calculation of their parameters
Duda, Hart and Stork, Pattern Classification, 2001,Appendix A.5.2
[adapted by Liviu Ciortuz]
21.
Fie X o variabila aleatoare care urmeaza o distributie gaussiana bi-variatade parametri µ (vectorul de medii) si Σ (matricea de covarianta). Asadar,µ = (µ1, µ2) ∈ R2, iar Σ ∈M2×2(R).
Prin definitie, Σ = Cov(X,X), unde Xnot.= (X1, X2), asadar Σij = Cov(Xi, Xj)
pentru i, j ∈ 1, 2. De asemenea, Cov(Xi, Xi) = Var[Xi]not.= σ2
i ≥ 0 pentru
i ∈ 1, 2, ın vreme ce pentru i 6= j avem Cov(Xi, Xj) = Cov(Xj , Xi)not.= σij.
In sfarsit, daca introducem ,,coeficientul de corelare“ ρdef.=
σ12σ1σ2
, rezulta ca
putem scrie astfel matricea de covarianta:
Σ =
[σ21 ρσ1σ2
ρσ1σ2 σ22
]
. (2)
22.
Demonstrati ca ipoteza X ∼ N (µ,Σ), im-plica faptul ca distributia conditionalaX2|X1 este de tip gaussian, si anume
X2|X1 = x1 ∼ N (µ2|1, σ22|1),
cu µ2|1 = µ2+ρσ2σ1
(x1−µ1) si σ22|1 = σ2
2(1−ρ2).
Observatie: Pentru X1|X2, rezultatuleste similar: X1|X2 = x2 ∼ N (µ1|2, σ
21|2), cu
µ1|2 = µ1 + ρσ1σ2
(x2 − µ2) si σ21|2 = σ2
1(1− ρ2).
Source:
Pattern Classification, Appendix A.5.2,
Duda, Hart and Stork, 2001
23.
Answer
pX2|X1(x2|x1) def.
=pX1,X2
(x1, x2)
pX1(x1)
, (3)
where
pX1,X2(x1, x2) =
1
(√2π)2
√
|Σ|exp
(
−1
2(x− µ)⊤Σ−1(x− µ)
)
si
pX1(x1) =
1√2πσ1
exp
(
− 1
2σ21
(x1 − µ1)2
)
. (4)
From (2) it follows that |Σ| = σ21σ
22(1 − ρ2). In order that
√
|Σ| and Σ−1 be defined, it
follows that ρ ∈ (−1, 1). Moreover, since σ1, σ2 > 0, we will have√
|Σ| = σ1σ2√
1− ρ2.
Σ−1 =1
σ21σ
22(1− ρ2)
Σ∗ =1
σ21σ
22(1 − ρ2)
[σ22 −ρσ1σ2
−ρσ1σ2 σ21
]
=1
(1 − ρ2)
1
σ21
− ρ
σ1σ2
− ρ
σ1σ2
1
σ22
24.
So,
pX1,X2(x1, x2) =
=1
2πσ1σ2√
1− ρ2exp
− 1
2(1− ρ)2 (x1 − µ1, x2 − µ2)
1
σ21
− ρ
σ1σ2
− ρ
σ1σ2
1
σ22
(x1 − µ1
x2 − µ2
)
=1
2πσ1σ2√
1− ρ2·
exp
(
− 1
2(1− ρ2)
[(x1 − µ1
σ1
)2
− 2ρ
(x1 − µ1
σ1
)(x2 − µ2
σ2
)
+
(x2 − µ2
σ2
)2])
(5)
25.
By substitution (4) and (5) in the definition (3), we will get:
p(x2|x1) =pX1,X2
(x1, x2)
pX1(x1)
=1
2πσ1σ2√
1− ρ2· exp
(
− 1
2(1− ρ2)
[(x1 − µ1
σ1
)2
− 2ρ
(x1 − µ1
σ1
)(x2 − µ2
σ2
)
+
(x2 − µ2
σ2
)2])
·√2πσ1 exp
(
1
2
(x1 − µ1
σ1
)2)
=1√
2πσ2√
1− ρ2exp
[
− 1
2(1− ρ2)
(x2 − µ2
σ2− ρx1 − µ1
σ1
)2]
=1√
2πσ2√
1− ρ2exp
−1
2
x2 − [µ2 + ρσ2σ1
(x1 − µ1)]
σ2√
1− ρ2
2
Therefore,
X2|X1 = x1 ∼ N (µ2|1, σ22|1) with µ2|1
not.= µ2 + ρ
σ2σ1
(x1 − µ1) and σ22|1
not.= σ2
2(1− ρ2).
26.
Using the Central Limit Theorem (the i.i.d. version)
to compute the real error of a classifier
CMU, 2008 fall, Eric Xing, HW3, pr. 3.3
27.
Chris recently adopts a new (binary) classifier to filter emailspams. He wants to quantitively evaluate how good the classi-fier is.
He has a small dataset of 100 emails on hand which, you canassume, are randomly drawn from all emails.
He tests the classifier on the 100 emails and gets 83 classifiedcorrectly, so the error rate on the small dataset is 17%.
However, the number on 100 samples could be either higher orlower than the real error rate just by chance.
With a confidence level of 95%, what is likely to be the range ofthe real error rate? Please write down all important steps.
(Hint: You need some approximation in this problem.)
28.
Notations:
Let Xi, i = 1, . . . , n = 100 be defined as:Xi = 1 if the email i was incorrectly classified, and 0 otherwise;
E[Xi]not.= µ
not.= ereal ; Var(Xi)
not.= σ2
esamplenot.=
X1 + . . .+Xn
n= 0.17
Zn =X1 + . . .+Xn − nµ√
n σ(the standardized form of X1 + . . .+Xn)
Key insight:
Calculating the real error of the classifier (more exactly, a symmetric interval
around the real error pnot.= µ) with a “confidence” of 95% amounts to finding a > 0
sunch that P (|Zn| ≤ a) ≥ 0.95.
29.
Calculus:
| Zn |≤ a ⇔∣∣∣∣
X1 + . . .+Xn − nµ√n σ
∣∣∣∣≤ a ⇔
∣∣∣∣
X1 + . . .+Xn − nµnσ
∣∣∣∣≤ a√
n
⇔∣∣∣∣
X1 + . . .+Xn − nµn
∣∣∣∣≤ aσ√
n⇔∣∣∣∣
X1 + . . .+Xn
n− µ
∣∣∣∣≤ aσ√
n
⇔ |esample − ereal| ≤aσ√n⇔ |ereal − esample| ≤
aσ√n
⇔ − aσ√n≤ ereal − esample ≤
aσ√n
⇔ esample −aσ√n≤ ereal ≤ esample +
aσ√n
⇔ ereal ∈[
esample −aσ√n, esample +
aσ√n
]
30.
Important facts:
The Central Limit Theorem: Zn → N (0; 1)
Therefore, P (|Zn| ≤ a) ≈ P (|X | ≤ a) = Φ(a)− Φ(−a), where X ∼ N (0; 1)
and Φ is the cumulative function distribution of N (0; 1).
Calculus:
Φ(−a) + Φ(a) = 1⇒ P (| Zn |≤ a) = Φ(a)− Φ(−a) = 2Φ(a)− 1
P (| Zn |≤ a)=0.95⇔2Φ(a)− 1=0.95⇔Φ(a)=0.975⇔ a ∼= 1.97 (see Φ table)
σ2 not.= Varreal = ereal(1− ereal) because Xi are Bernoulli variables.
Futhermore, we can approximate ereal with esemple, because
E[esample] = ereal and Varsample =1
nVarreal → 0 for n→ +∞,
cf. CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1.ab.
Finally:
⇒ aσ√n≈ 1.97 ·
√
0.17(1− 0.17)√100
∼= 0.07
|ereal − esample| ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07
⇔ ereal ∈ [0.10, 0.24]
31.
Exemplifying
a mixture of categorical distributions;
how to compute its expectation and variance
CMU, 2010 fall, Aarti Singh, HW1, pr. 2.2.1-2
32.
Suppose that I have two six-sided dice, one is fair and the other one isloaded – having:
P (x) =
1
2x = 6
1
10x ∈ 1, 2, 3, 4, 5
I will toss a coin to decide which die to roll. If the coin flip is heads I willroll the fair die, otherwise the loaded one. The probability that the coinflip is heads is p ∈ (0, 1).
a. What is the expectation of the die roll (in terms of p).
b. What is the variation of the die roll (in terms of p).
33.
Solution:
a.
E[X ] =6∑
i=1
i · [P (i|fair) · p+ P (i|loaded) · (1− p)]
=
[6∑
i=1
i · P (i|fair)]
p +
[6∑
i=1
i · P (i|loaded)]
(1− p)
=7
2p+
9
2(1− p) = 9
2− p
34.
b. Recall that we may write Var (X) = E[X2]− (E[X ])2, therefore:
E[X2] =
6∑
i=1
i2 · [P (i|fair) · p+ P (i|loaded) · (1− p)]
=
[6∑
i=1
i2 · P (i|fair)]
p+
[6∑
i=1
i2 · P (i|loaded)]
(1− p)
=91
6p+ (
36
2+
55
10)(1− p)
=47
2− 25
3p
Combining this with the result of the previous question yields:
Var (X) = E[X2]− (E[X ])2 =141
6− 50
6p− (
9
2− p)2
=141
6− 50
6p− (
81
4− 9p+ p2)
= (141
6− 81
4)− (
50
6− 9)p− p2
=13
4+
2
3p− p2
35.
Elements of Information Theory:Some examples and then some useful proofs
36.
Computing entropies and specific conditional entropies
for discrete random variables
CMU, 2012 spring, R. Rosenfeld, HW2, pr. 2
37.
On the roll of two six-sided fair dice,
a. Calculate the distribution of the sum (S) of the total.
b. The amount of information (or surprise) when seeing the outcome x
for a random variable X is defined as log21
P (X = x)= − log2 P (X = x). How
surprised are you (in bits) to observe S = 2, S = 11, S = 5, S = 7?
c. Calculate the entropy of S [as the expected value of the random variable− log2 P (X = x)].
d. Let’s say you throw the die one by one, and the first die shows 4. Whatis the entropy of S after this observation? Was any information gained / lostin the process? If so, calculate how much information (in bits) was lost orgained.
38.
a.
S 2 3 4 5 6 7 8 9 10 11 12P (S) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
b.
Information(S = 2) = − log2(1/36) = log2 36 = 2 log2 6 = 2(1 + log2 3)
= 5.169925001 bits
Information(S = 11) = − log2 2/36 = log2 18 = 1 + 2 log2 3 = 4.169925001 bits
Information(S = 5) = − log2 4/36 = log2 9 = 2 log2 3 = 3.169925001 bits
Information(S = 7) = − log2 6/36 = log2 6 = 1 + log2 3 = 2.584962501 bits
39.
c.
H(S) = −n∑
i=1
pi log pi
= −(
2 · 136
log1
36+ 2 · 2
36log
2
36+ 2 · 3
36log
3
36+ 2 · 4
36log
4
36+
2 · 536
log5
36+
6
36log
6
36
)
=1
36(2 log2 36 + 4 log2 18 + 6 log2 12 + 8 log2 9 + 10 log2
36
5+ 6 log2 6)
=1
36(2 log2 6
2 + 4 log2 6 · 3 + 6 log2 6 · 2 + 8 log2 32 + 10 log2
62
5+ 6 log2 6)
=1
36(40 log2 6 + 20 log2 3 + 6− 10 log2 5)
=1
36(60 log2 3 + 46− 10 log2 5) = 3.274401919 bits.
40.
d.
S 2 3 4 5 6 7 8 9 10 11 12P (S|...) 0 0 0 1/6 1/6 1/6 1/6 1/6 1/6 0 0
H(S|First-die-shows-4) = −6 · 16log2
1
6= log2 6 = 2.58 bits,
IG(S;First-die-shows-4) = H(S)−H(S|First-die-shows-4) = 3.27− 2.58 = 0.69 bits.
41.
Computing entropies and mean conditional entropies
for discrete random variables
CMU, 2012 spring, R. Rosenfeld, HW2, pr. 3
42.
A doctor needs to diagnose a person having cold (C). The primary factorhe considers in his diagnosis is the outside temperature (T ). The randomvariable C takes two values, yes / no, and the random variable T takes 3values, sunny, rainy, snowy. The joint distribution of the two variables isgiven in following table.
T = sunny T = rainy T = snowy
C = no 0.30 0.20 0.10C = yes 0.05 0.15 0.20
a. Calculate the marginal probabilities P (C), P (T ).
Hint: Use P (X = x) =∑
Y P (X = x;Y = y). For example,
P (C = no) = P (C = no, T = sunny) + P (C = no, T = rainy) + P (C = no, T = snowy).
b. Calculate the entropies H(C), H(T ).
c. Calculate the mean conditional entropies H(C|T ), H(T |C).
43.
a. PC = (0.6, 0.4) si PT = (0.35, 0.35, 0.30).
b.
H(C) = 0.6 log5
3+ 0.4 log
5
2= log 5− 0.6 log 3− 0.4 = 0.971 bits
H(T ) = 2 · 0.35 log 20
7+ 0.3 log
10
3= 0.7(2 + log 5− log 7) + 0.3(1 + log 5− log 3)
= 1.7 + log 5− 0.7 log 7− 0.3 log 3 = 1.581 bits.
44.
c.
H(C|T ) def.=
∑
t∈Val (T )
P (T = t) ·H(C|T = t)
= P (T = sunny) ·H(C|T = sunny) + P (T = rainy) ·H(C|T = rainy) +
P (T = snowy) ·H(C|T = snowy)
= 0.35 ·H(
0.30
0.30 + 0.05,
0.05
0.30 + 0.05
)
+ 0.35 ·H(
0.20
0.20 + 0.15,
0.15
0.20 + 0.15
)
+
0.30 ·H(
0.10
0.10 + 0.20,
0.20
0.20 + 0.10
)
=7
20·H(6
7,1
7
)
+7
20·H(4
7,3
7
)
+3
10·H(1
3,2
3
)
=7
20·(6
7log
7
6+
1
7log 7
)
+7
20·(4
7log
7
4+
3
7log
7
3
)
+3
10·(1
3log 3 +
2
3log
3
2
)
=7
20·(
log 7− 6
7− 6
7log 3
)
+7
20·(
log 7− 8
7− 3
7log 3
)
+3
10·(
log 3− 2
3
)
=7
10log 7−
(3
10+
4
10+
2
10
)
−(
6
20+
3
20− 3
10
)
· log 3 =7
10log 7− 3
20log 3− 9
10= 0.82715 bits.
45.
H(T |C) def.=
∑
c∈Val (C)
P (C = c) ·H(T |C = c)
= P (C = no) ·H(T |C = no) + P (C = yes) ·H(T |C = yes)
= 0.60 ·H(
0.30
0.30 + 0.20 + 0.10,
0.20
0.30 + 0.20 + 0.10,
0.10
0.30 + 0.20 + 0.10
)
+
0.40 ·H(
0.05
0.05 + 0.15 + 0.20,
0.15
0.05 + 0.15 + 0.20,
0.20
0.05 + 0.15 + 0.20
)
=3
5·H(1
2,1
3,1
6
)
+2
5·H(1
8,3
8,1
2
)
=3
5
(1
2+
1
3log 3 +
1
6(1 + log 3)
)
+2
5
(1
8· 3 + 3
8(3− log 3) +
1
2
)
=3
5
(2
3+
1
2log 3
)
+2
5
(
2− 3
8log 3
)
=6
5+
3
20log 3 = 1.43774 bits.
46.
Computing the entropy of theexponential distribution
CMU, 2011 spring, R. Rosenfeld,
HW2, pr. 2.c
xxxxxx Exponential p.d.f.
0 1 2 3 4 5
0.0
0.5
1.0
1.5
xp(
x)
0 1 2 3 4 5
0.0
0.5
1.0
1.5
θ = 0.5θ = 1θ = 1.5
47.
Pentru o distributie de probabilitate continua P , entropia se defineste ast-fel:
H(P ) =
∫ +∞
−∞
P (x) log21
P (x)dx
Calculati entropia distributiei continue exponentiale de parametru λ > 0.Definitia acestei distributii este urmatoarea:
P (x) =
λe−λx, daca x ≥ 0;0, daca x < 0.
Indicatie: Daca P (x) = 0, veti presupune ca −P (x) log2 P (x) = 0.
48.
Answer
H(P ) =
∫ 0
−∞P (x) log2
1
P (x)dx+
∫ ∞
0
P (x) log21
P (x)dx
def. P=
∫ 0
−∞0 log2 0 dx
︸ ︷︷ ︸0
+
∫ ∞
0
λe−λx log21
λe−λxdx =
∫ ∞
0
λe−λx log21
λe−λxdx
⇒ H(P ) =1
ln 2
∫ ∞
0
λe−λx ln1
λe−λxdx =
1
ln 2
∫ ∞
0
λe−λx
(
ln1
λ+ ln
1
e−λx
)
dx
=1
ln 2
∫ ∞
0
λe−λx(− lnλ+ ln eλx
)dx
=1
ln 2
∫ ∞
0
λe−λx (− lnλ+ λx) dx
=1
ln 2
∫ ∞
0
λe−λx(− lnλ) dx+1
ln 2
∫ ∞
0
λe−λxλxdx
=− lnλ
ln 2
∫ ∞
0
λe−λx dx+λ
ln 2
∫ ∞
0
λe−λxx dx
=lnλ
ln 2
∫ ∞
0
(e−λx
)′dx− λ
ln 2
∫ ∞
0
(e−λx
)′x dx
49.
Prima integrala se rezolva foarte usor:
∫ ∞
0
(e−λx
)′dx = e−λx
∣∣∞0
= e−∞ − e0 = 0− 1 = −1
Pentru a rezolva cea de-a doua integrala se poate folosi formula de integrare prin parti:
∫ ∞
0
(e−λx
)′x dx = e−λxx
∣∣∞0−∫ ∞
0
e−λxx′ dx = e−λxx∣∣∞0−∫ ∞
0
e−λx dx
Integrala definita e−λxx∣∣∞0
nu se poate calcula direct (din cauza conflictului 0 · ∞ carese produce atunci cand lui x i se atribuie valoarea-limita ∞), ci se calculeaza folosindregula lui l’Hopital:
limx→∞
xe−λx = limx→∞
x
eλx= lim
x→∞x′
(eλx)′ = lim
x→∞1
λeλx=
1
λlimx→∞
e−λx = e−∞ = 0,
decie−λxx
∣∣∞0
= 0− 0 = 0.
50.
Integrala∫∞0 e−λx dx se calculeaza usor:
∫ ∞
0
e−λx dx = − 1
λ
∫ ∞
0
(e−λx
)′dx = − 1
λe−λx
∣∣∞0
= − 1
λ(0− 1) =
1
λ
Prin urmare,∫ ∞
0
(e−λx
)′x dx = 0− 1
λ= − 1
λ,
ceea ce conduce la rezultatul final:
H(P ) =ln λ
ln 2(−1)− λ
ln 2
(
− 1
λ
)
= − lnλ
ln 2+
1
ln 2=
1− lnλ
ln 2.
51.
Derivation of entropy definition,
starting from a set of desirable properties
CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2
52.
Remark: The definition we gave for entropy−∑ni=1 pi log pi is not very intuitive.
Theorem:
If ψn(p1, . . . , pn) satisfies the following axioms
A0. [LC:] ψn(p1, . . . , pn) ≥ 0 for any n ∈ N∗ and p1, . . . , pn, since we view ψn is a
measure of disorder ; also, ψ1(1) = 0 because in this case there is no disorder;
A1. ψn should be continuous in pi and symmetric in its arguments;
A2. if pi = 1/n then ψn should be a monotonically increasing function of n;(If all events are equally likely, then having more events means beingmore uncertain.)
A3. if a choice among N events is broken down into successive choices, thenthe entropy should be the weighted sum of the entropy at each stage;
then ψn(p1, . . . , pn) = −K∑
i pi log pi where K is a positive constant.
(As we’ll see, K depends however on ψs
(
1
s, . . . ,
1
s
)
for a certain s ∈ N∗).
Remark: We will prove the theorem firstly for uniform distributions (pi = 1/n) and secondly
for the case pi ∈ Q (only!).
53.
Example for the axiom A3:
Encoding 1
(a,b,c)
1/2 1/3 1/6
b ca
Encoding 2
(a,b,c)
a
1/2 1/2
(b,c)
b c
2/3 1/3
H
(
1
2,1
3,1
6
)
=1
2log 2 +
1
3log 3 +
1
6log 6 =
(
1
2+
1
6
)
log 2 +
(
1
3+
1
6
)
log 3 =2
3+
1
2log 3
H
(
1
2,1
2
)
+1
2H
(
2
3,1
3
)
= 1 +1
2
(
2
3log
3
2+
1
3log 3
)
= 1 +1
2
(
log 3− 2
3
)
=2
3+
1
2log 3
The next 3 slides:
Case 1: pi = 1/n for i = 1, . . . , n; proof steps:
54.
a. A(n)not.= ψ(1/n, 1/n, . . . , 1/n) implies
A(sm) = mA(s) for any s,m ∈ N∗. (1)
b. If s,m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N⋆ such that sm ≤ tn ≤ sm+1, then
| mn −log tlog s |≤ 1
n . (2)
c. For sm ≤ tn ≤ sm+1 as above, due to A2 it follows (immediately)
ψsm
(1
sm, . . . ,
1
sm
)
≤ ψtn
(1
tn, . . . ,
1
tn
)
≤ ψsm+1
(1
sm+1, . . . ,
1
sm+1
)
i.e. A(sm) ≤ A(tn) ≤ A(sm+1)c. Show that
| mn −A(t)A(s) |≤ 1
n for s 6= 1. (3)
d. Combining (2) + (3) immediately gives
| A(t)A(s) −log tlog s |≤ 2
n pentru s 6= 1 (4)
d. Show that this inequation implies
A(t) = K log t with K > 0 (due to A2). (5)
55.
Proofa.
. . . . . .. . .
. . . . . .. . .
. . .
1/s 1/s 1/s
1/s 1/s 1/s
nivel 2
nivel 1
1/s 1/s 1/s
nivel m
de s ori
. . .de s ori
de s ori
ms
1m
s
1
de s orim
. . .
ms
1
Applying the axion A3 on the right encoding from above gives:
A(sm) = A(s) + s · 1sA(s) + s2 · 1
s2A(s) + . . .+ sm−1 · 1
sm−1A(s)
= A(s) +A(s) +A(s) + . . .+A(s)︸ ︷︷ ︸
m times
= mA(s)
56.
Proof (cont’d)
b.
sm ≤ tn ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m+ 1) log s⇒m
n≤ log t
log s≤ m
n+
1
n⇒ 0 ≤ log t
log s− m
n≤ 1
n⇒∣∣∣∣
log t
log s− m
n
∣∣∣∣≤ 1
n
c.
A(sm) ≤ A(tn) ≤ A(sm+1)(1)⇒ mA(s) ≤ n A(t) ≤ (m+ 1) A(s)
s 6=1⇒m
n≤ A(t)
A(s)≤ m
n+
1
n⇒ 0 ≤ A(t)
A(s)− m
n≤ 1
n⇒∣∣∣∣
A(t)
A(s)− m
n
∣∣∣∣≤ 1
n
d. Consider again sm ≤ tn ≤ sm+1 with s, t fixed. If m→∞ then n→∞ and
from
∣∣∣∣
A(t)
A(s)− log t
log s
∣∣∣∣≤ 2
nit follows that
∣∣∣∣
A(t)
A(s)− log t
log s
∣∣∣∣→ 0.
Therefore
∣∣∣∣
A(t)
A(s)− log t
log s
∣∣∣∣= 0 and so
A(t)
A(s)=
log t
log s.
Finally, A(t) =A(s)
log slog t = K log t, where K =
A(s)
log s> 0 (if s 6= 1).
57.
Case 2: pi ∈ Q for i = 1, . . . , n
Let’s consider a set of N ≥ 2equiprobable random events, and P =(S1, S2, . . . , Sk) a partition of this set.Let’s denote pi =| Si | /N .
A “natural” two-step ecoding (asshown in the nearby figure) leads toA(N) = ψk(p1, . . . , pk) +
∑
i piA(| Si |),based on the axiom A3.
Finally, using the result A(t) = K log t,gives:
K logN = ψk(p1, . . . , pk) +K∑
i pi log | Si |
S i
1/|S |i1/|S |i 1/|S |i
|S |/N2 |S |/Ni
S 1 S 2 S k
|S |/Nk|S |/N1
level 1
level 2
. . .. . .
. . .
. . . . . . . . .
⇒ ψk(p1, . . . , pk) = K[ logN −∑
i
pi log | Si | ]
= K[ logN∑
i
pi −∑
i
pi log | Si | ] = −K∑
i
pi log| Si |N
= −K∑
i
pi log pi
58.
Entropie, entropie corelata,
entropie conditionala, castig de informatie:
definitii si proprietati imediate
CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2
59.
Definitii
• Entropia variabilei X:
H(X)def.= −∑i P (X = xi) logP (X = xi)
not.= EX [− logP (X)].
• Entropia conditionala specifica a variabilei Y ın raport cu valoarea xk a variabilei X:
H(Y | X = xk)def.= −∑j P (Y = yj | X = xk) logP (Y = yj | X = xk)
not.= EY |X=xk
[− logP (Y | X = xk)].
• Entropia conditionala medie a variabilei Y ın raport cu variabila X:
H(Y | X)def.=∑
k P (X = xk)H(Y | X = xk)not.= EX [H(Y | X)].
• Entropia corelata a variabilelor X si Y :
H(X,Y )def.= −∑i
∑
j P (X = xi, Y = yj) logP (X = xi, Y = yj)not.= EX,Y [− logP (X,Y )].
• Informatia mutuala a variabilelor X si Y , numita de asemenea castigul de informatieal variabilei X ın raport cu variabila Y (sau invers):
MI(X,Y )not.= IG(X,Y )
def.= H(X)−H(X | Y ) = H(Y )−H(Y | X)
(Observatie: ultima egalitate de mai sus are loc datorita rezultatului de la punctulc de mai jos.)
60.
a.H(X) ≥ 0.
H(X) = −∑
i
P (X = xi) logP (X = xi) =∑
i
P (X = xi)︸ ︷︷ ︸
≥0
log1
P (X = xi)︸ ︷︷ ︸
≥0
≥ 0
Mai mult, H(X) = 0 daca si numai daca variabila X este constanta:
,,⇒“ Presupunem ca H(X) = 0, adica∑
i P (X = xi) log1
P (X = xi)= 0. Datorita faptului
ca fiecare termen din aceasta suma este mai mare sau egal cu 0, rezulta ca H(X) = 0
doar daca pentru ∀i, P (X = xi) = 0 sau log1
P (X = xi)= 0, adica daca pentru ∀i,
P (X = xi) = 0 sau P (X = xi) = 1. Cum ınsa∑
i P (X = xi) = 1 rezulta ca exista osingura valoare x1 pentru X astfel ıncat P (X = x1) = 1, iar P (X = x) = 0 pentruorice x 6= x1. Altfel spus, variabila aleatoare discreta X este constanta.
,,⇐“ Presupunem ca variabila X este constanta, ceea ce ınseamna ca X ia o singuravaloare x1, cu probabilitatea P (X = x1) = 1. Prin urmare, H(X) = −1 · log 1 = 0.
61.
b.
H(Y | X) = −∑
i
∑
j
P (X = xi, Y = yj) logP (Y = yj | X = xi)
H(Y | X) =∑
i
P (X = xi)H(Y | X = xi)
=∑
i
P (X = xi)
[
−∑
j
P (Y = yj | X = xi) logP (Y = yj | X = xi)
]
= −∑
i
∑
j
P (X = xi)P (Y = yj | X = xi)︸ ︷︷ ︸
=P (X=xi,Y=yj)
logP (Y = yj | X = xi)
= −∑
i
∑
j
P (X = xi, Y = yj) logP (Y = yj | X = xi)
62.
c.H(X, Y ) = H(X) +H(Y | X) = H(Y ) +H(X | Y )
H(X, Y ) = −∑
i
∑
j
p(xi, yj) log p(xi, yj)
= −∑
i
∑
j
p(xi) · p(yj | xi) log[p(xi) · p(yj | xi)]
= −∑
i
∑
j
p(xi) · p(yj | xi)[log p(xi) + log p(yj | xi)]
= −∑
i
∑
j
p(xi) · p(yj | xi) log p(xi)−∑
i
∑
j
p(xi) · p(yj | xi) log p(yj | xi)
= −∑
i
p(xi) log p(xi) ·∑
j
p(yj | xi)︸ ︷︷ ︸
=1
−∑
i
p(xi)∑
j
p(yj | xi) log p(yj | xi)
= H(X) +∑
i
p(xi)H(Y | X = xi) = H(X) +H(Y | X)
63.
Mai general (regula de ınlantuire):
H(X1, . . . , Xn) = H(X1) +H(X2 | X1) + . . .+H(Xn | X1, . . . , Xn−1)
H(X1, . . . , Xn) =E
[
log1
p(x1, . . . , xn)
]
=− Ep(x1,...,xn)[log p(x1, . . . , xn)]
=− Ep(x1,...,xn)[log p(x1) + log p(x2 | x1) + . . .+ log p(xn | x1, . . . , xn−1)]
=− Ep(x1)[log p(x1)]−Ep(x1,x2)[log p(x2 | x1)]− . . .− Ep(x1,...,xn)[log p(xn | x1, . . . , xn−1)]
=H(X1) +H(X2 | X1) + . . .+H(Xn | X1, . . . , Xn−1)
64.
An upper bound for the entropy of a discrete distribution
CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 1.1
65.
Fie X o variabila aleatoare discreta care ia n valori si urmeaza distributiaprobabilista P . Conform definitiei, entropia lui X este
H(X) = −n∑
i=1
P (X = xi) log2 P (X = xi).
Aratati ca H(X) ≤ log2 n.
Sugestie: Puteti folosi inegalitatea lnx ≤ x − 1 care are loc pentru oricex > 0.
66.
Answer
H(X) =1
ln 2
(
−n∑
i=1
P (X = xi) lnP (X = xi)
)
Asadar,
H(X) ≤ log2 n ⇔ 1
ln 2
(
−n∑
i=1
P (X = xi) lnP (X = xi)
)
≤ log2 n
⇔ −n∑
i=1
P (xi) lnP (xi) ≤ lnn
⇔n∑
i=1
P (xi) ln1
P (xi)−(
n∑
i=1
P (xi)
)
︸ ︷︷ ︸
1
lnn ≤ 0
⇔n∑
i=1
P (xi) ln1
P (xi)−
n∑
i=1
P (xi) lnn ≤ 0
⇔n∑
i=1
P (xi)
(
ln1
P (xi)− lnn
)
≤ 0
⇔n∑
i=1
P (xi) ln1
nP (xi)≤ 0
67.
Aplicand inegalitatea lnx ≤ x− 1 pentru x =1
nP (xi), vom avea:
n∑
i=1
P (xi) ln1
nP (xi)≤
n∑
i=1
P (xi)
(1
nP (xi)− 1
)
=
n∑
i=1
1
n−
n∑
i=1
P (xi)
︸ ︷︷ ︸1
= 1− 1 = 0
Observatie: Aceasta margine superioara chiar este ,,atinsa“. De exemplu, ın cazul ıncare o variabila aleatoare discreta X avand n valori urmeaza distributia uniforma, sepoate verifica imediat ca H(X) = log2 n.
68.
Relative entropy a.k.a. the Kulback-Leibler divergence,
and the [relationship to] information gain;
some basic properties
CMU, 2007 fall, C. Guestrin, HW1, pr. 1.2[adapted by Liviu Ciortuz]
69.
The relative entropy — also known as the Kullback-Leibler (KL) diver-gence — from a distribution p to a distribution q is defined as
KL(p||q) def.= −
∑
x∈X
p(x) logq(x)
p(x)
From an information theory perspective, the KL-divergence specifies thenumber of additional bits required on average to transmit values of X ifthe values are distributed with respect to p but we encode them assumingthe distribution q.
70.
Notes
1. KL is not a distance measure, since it is not symmetric (i.e., in generalKL(p||q) 6= KL(q||p)).
Another measure, which is defined as JSD(p||q) = 1
2(KL(p||q)+KL(q||p)), and
is called the Jensen-Shannon divergence is symmetric.
2. The quantity
d(X, Y )def.= H(X, Y )− IG (X, Y ) = H(X) +H(Y )− 2IG (X, Y )
= H(X | Y ) +H(Y | X)
known as variation of information, is a distance metric, i.e., it is non-negative, symmetric, implies indiscernability, and satisfies the triangle in-equality.
71.
a. Show that KL(p||q) ≥ 0, and KL(p||q) = 0 iff p(x) = q(x) for all x.(More generally, the smaller the KL-divergence, the more similar the two distributions.)
Indicatie:
Pentru a demonstra punctul acesta puteti folosi inegalitatea lui Jensen:
Daca ϕ : R → R este o functie convexa, atunci pentru orice t ∈ [0, 1] si orice x1, x2 ∈ R
urmeaza ϕ(tx1 + (1− t)x2) ≤ tϕ(x1) + (1− t)ϕ(x2).Daca ϕ este functie strict convexa, atunci egalitatea are loc doar daca x1 = x2.
Mai general, pentru orice ai ≥ 0, i = 1, . . . , n cu∑
i ai 6= 0 si orice xi ∈ R, i = 1, . . . , n, avem
ϕ
(∑
i aixi∑
j aj
)
≤∑
i aiϕ(xi)∑
j aj.
Daca ϕ este strict convexa, atunci egalitatea are loc doar daca x1 = . . . = xn.
Evident, rezultate similare pot fi formulate si pentru functii concave.
72.
Answer
Vom dovedi inegalitatea KL(p||q) ≥ 0 folosind inegalitatea lui Jensen, ın expresia careia
vom ınlocui ϕ cu functia convexa − log2, pe ai cu p(xi) si pe xi cuq(xi)
p(xi).
(Pentru convenienta, ın cele ce urmeaza vor renunta la indicele variabilei x.)Vom avea:
KL (p || q) def.= −
∑
x
p(x) logq(x)
p(x)
Jensen
≥ − log(∑
x
p(x)q(x)
p(x)) = − log(
∑
x
q(x)
︸ ︷︷ ︸1
) = − log 1 = 0
Asadar, KL (p || q) ≥ 0, oricare ar fi distributiile (discrete) p si q.
73.
Vom demonstra acum ca KL(p||q) = 0⇔ p = q.
⇐Egalitatea p(x) = q(x) implica
q(x)
p(x)= 1, deci log
q(x)
p(x)= 0 pentru orice x, de unde
rezulta imediat KL(p||q) = 0.
⇒Stim ca ın inegalitatea lui Jensen are loc egalitatea doar ın cazul ın care xi = xjpentru orice i si j.
In cazul de fata, aceasta conditie se traduce prin faptul ca raportulq(x)
p(x)este acelasi
pentru orice valoare a lui x.
Tinand cont ca∑
x p(x) = 1 si∑
x p(x)q(x)
p(x)=∑
x q(x) = 1, rezulta caq(x)
p(x)= 1 sau,
altfel spus, p(x) = q(x) pentru orice x, ceea ce ınseamna ca distributiile p si q suntidentice.
74.
b. We can define the information gain as the KL-divergence from the observed jointdistribution of X and Y to the product of their observed marginals:
IG(X,Y )def.= KL(pX,Y || (pX pY )) = −
∑
x
∑
y
pX,Y (x, y) log
(pX(x)p(y)
pY (x, y)
)
not.= −
∑
x
∑
y
p(x, y) log
(p(x)p(y)
p(x, y)
)
Prove that this definition of information gain is equivalent to the one given in problemCMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2. That is, show that IG(X,Y ) =H [X ]−H [X |Y ] = H [Y ]−H [Y |X ], starting from the definition in terms of KL-divergence.
Remark:
It follows that
IG(X,Y ) =∑
y
p(y)∑
x
p(x | y) log p(x | y)p(x)
=∑
y
p(y)KL(pX|Y || pX)
= EY [KL(pX|Y || pX)]
75.
Answer
By making use of the multiplication rule, namely p(x, y) = p(x | y)p(y), we will have:
KL(pXY || (pX pY ))
def. KL= −
∑
x
∑
y
p(x, y) log
(p(x)p(y)
p(x, y)
)
= −∑
x
∑
y
p(x, y) log
(p(x)p(y)
p(x | y)p(y)
)
= −∑
x
∑
y
p(x, y)[log p(x)− log p(x | y)]
= −∑
x
∑
y
p(x, y) log p(x)−(
−∑
x
∑
y
p(x, y) log p(x | y))
= −∑
x
log p(x)∑
y
p(x, y)
︸ ︷︷ ︸
=p(x)
−H [X | Y ]
= H [X ]−H [X | Y ] = IG(X,Y )
76.
c.A direct consequence of parts a. and b. is that IG(X, Y ) ≥ 0 (and thereforeH(X) ≥ H(X|Y ) and H(Y ) ≥ H(Y |X)) for any discrete random variables Xand Y .
Prove that IG(X, Y ) = 0 iff X and Y are independent.
Answer:
This is also an immediate consequence of parts a. and b. already proven:
IG (X, Y ) = 0(b)⇔ KL (pXY ||pX pY ) = 0
(a)⇔ X and Y are independent.
77.
Remark
Putem demonstra inegalitatea IG(X,Y ) ≥ 0 si ın maniera directa, folosind rezultatul dela punctul b. si aplicand inegalitatea lui Jensen ın forma generalizata, cu urmatoarele,,amendamente“:
− ın locul unui singur indice, se vor considera doi indici (asadar ın loc de ai si xi vomavea aij si respectiv xij);
− vom lua ϕ = − log2 iar aij ← p(xi, yj) si xij ←p(xi)p(xj)
p(xi, xj);
− ın fine, vom tine cont ca∑
i
∑
j p(xi, yj) = 1.
Prin urmare,
IG(X,Y ) =∑
i
∑
j
p(xi, yj) logp(xi, yj)
p(xi) · p(yj)=∑
i
∑
j
p(xi, yj)
[
− logp(xi) · p(yj)p(xi, yj)
]
≥ − log
∑
i
∑
j
p(xi, yj)p(xi) · p(yj)p(xi, yj)
= − log(∑
i
∑
j
p(xi) · p(yj))
= − log(∑
i
p(xi)
︸ ︷︷ ︸1
·∑
j
p(yj)
︸ ︷︷ ︸1
) = − log 1 = 0
In concluzie, IG(X,Y ) ≥ 0.
78.
Remark (cont’d)
Daca X si Y sunt variabilele independente,atunci p(xi, yj) = p(xi)p(yj) pentru orice i si j.
In consecinta, toti logaritmii din partea dreapta a primei egalitati din calculul de maisus sunt 0 si rezulta IG(X,Y ) = 0.
Invers, presupunand ca IG(X,Y ) = 0, vom tine cont de faptul ca putem exprima castigulde informatie cu ajutorul divergentei KL si vom aplica un rationament similar cu celde la punctul a.
Rezulta cap(xi)p(yj)
p(xi, yj)= 1 si deci p(xi)p(yj) = p(xi, yj) pentru orice i si j.
Aceasta echivaleaza cu a spune ca variabilele X si Y sunt independente.
79.
Proving [in a direct manner] thatthe Information Gain is always positive or 0
(an indirect proof was made at CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2)
Liviu Ciortuz, 2017
80.
Definitia castigului de informatie (sau: a informatiei mutuale) al unei vari-abile aleatoare X ın raport cu o alta variabila aleatoare Y este
IG (X,Y ) = H(X)−H(X | Y ) = H(Y )−H(Y | X).
La CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a demonstrat — pentru cazul ın care X
si Y sunt discrete — ca IG(X,Y ) = KL(PX,Y ||PX PY ), unde KL desemneaza entropia relativa (sau:
divergenta Kullback-Leibler), PX si PY sunt distributiile variabilelor X si, respectiv, Y , iar PX,Y este
distributia corelata a acestor variabile. Tot la CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a
aratat ca divergenta KL este ıntotdeauna ne-negativa. In consecinta, IG(X,Y ) ≥ 0 pentru orice X
si Y .
La acest exercitiu va cerem sa demonstrati inegalitatea IG(X,Y ) ≥ 0 ınmaniera directa, plecand de la prima definitie data mai sus, fara a [mai] apelala divergenta Kullback-Leibler.
81.
Sugestie: Puteti folosi urmatoarea forma a inegalitatii lui Jensen:
n∑
i=1
ai log xi ≤ log
(n∑
i=1
aixi
)
unde baza logaritmului se considera supraunitara, ai ≥ 0 pentru i = 1, . . . , n si∑n
i=1 ai = 1.
Observatie: Avantajul la aceasta problema, comparativ cu CMU, 2007 fall, Carlos Guestrin,
HW1, pr. 1.2.a, este ca aici se lucreaza cu o singura distributie (p), nu cu doua distributii (p si q).
Totusi, demonstratia de aici va fi mai laborioasa.
Answer (in Romanian)
Presupunem ca valorile variabilei X sunt x1, x2, . . . , xn, iar valorile variabilei Ysunt y1, y2, . . . , ym. Avem:
IG(X,Y )def.= H(X)−H(X |Y )
def.=
n∑
i=1
−P (xi) log2 P (xi)−m∑
j=1
P (yj)
n∑
i=1
(−P (xi|yj) log2 P (xi|yj))
82.
−IG(X,Y ) =
n∑
i=1
P (xi) log2 P (xi)−m∑
j=1
P (yj)
n∑
i=1
P (xi|yj) log2 P (xi|yj)
def.=
prob. marg.
n∑
i=1
m∑
j=1
P (xi, yj)
log2 P (xi)−m∑
j=1
P (yj)
n∑
i=1
P (xi|yj) log2 P (xi|yj)
distrib.·,+=
n∑
i=1
m∑
j=1
P (xi, yj) log2 P (xi)−m∑
j=1
n∑
i=1
P (yj)P (xi|yj) log2 P (xi|yj)
def.=
prob. cond.
n∑
i=1
m∑
j=1
P (xi, yj) log2 P (xi)−m∑
j=1
n∑
i=1
P (xi, yj) log2 P (xi|yj)
distrib.·,+=
n∑
i=1
m∑
j=1
P (xi, yj)(log2 P (xi)− log2 P (xi|yj))
prop.=log.
n∑
i=1
m∑
j=1
P (xi, yj) log2P (xi)
P (xi|yj)
reg. de=
multipl.
n∑
i=1
m∑
j=1
P (xi|yj)P (yj) log2P (xi)
P (xi|yj)
distrib.·,+=
m∑
j=1
P (yj)
n∑
i=1
P (xi|yj)︸ ︷︷ ︸
ai
log2P (xi)
P (xi|yj)
83.
Intrucat pe de o parte P (xi|yj) ≥ 0 si pe de alta parte∑n
i=1 P (xi|yj) = 1 pentrufiecare valoare yj a lui Y ın parte, putem aplica inegalitatea lui Jensen pentrucea de-a doua suma din ultima expresie de mai sus — mai exact, pentrufiecare valoare a indicelui j ın parte — si obtinem:
−IG(X,Y ) ≤m∑
j=1
P (yj) log2
(n∑
i=1
P (xi|yj)P (xi)
P (xi|yj)
)
=m∑
j=1
P (yj) log2
(n∑
i=1
P (xi)
)
︸ ︷︷ ︸1
= 0
Prin urmare, IG(X,Y ) ≥ 0.
84.