+ All Categories
Home > Documents > Probability mass function, cumulative distribution … › hkasahara › Econ325 ›...

Probability mass function, cumulative distribution … › hkasahara › Econ325 ›...

Date post: 06-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
23
Econ 325 Notes on Mathematical Expectation, Variance, and Covariance 1 By Hiro Kasahara Probability mass function, cumulative distribution function, and mathematical expectation Example 1 Toss 2 fair coins. Denoting the heads by “H” and the tails by “T”, the sample space as S = {TT,TH,HT,HH }, where P (TT )= P (TH )= P (HT )= P (HH )=1/4. Now, define a random variable X by the number of heads when we toss 2 coins. Then, the event X =0 happens if and only if TT happens. Also, the event X =1 happens if either TH or HT happens. Finally, the event X =2 happens if HH happens. Now, we may define a random variable X as a mapping from the sample space S = {TT,TH,HT,HH } to the space of possible values of X defined by {0, 1, 2} as: X (s)= 0 if s = {TT } 1 if s = {TH } or {HT } 2 if s = {HH } Accordingly, we may assign a probability to each possible value of X in {0, 1, 2} as P (X = 0) = P ({TT })=1/4, P (X = 1) = P ({TH,HT })=1/2 and P (X = 2) = P ({HH })=1/4. Definition 1 (Random variable) Consider a random experiment of which sample space is given by S . A random variable X is a function that assigns one value X (s)= x to each element of s S : X : S the space of X We denote the space of X by S X , which is the set of all possible values of X , i.e., S X = {x : X (s)= x, s S } . When a random variable X takes a finite number of values or a countable number of values, then the random variable X is called as a discrete random variable. Suppose that X can take n possible values given by {x 1 ,x 2 , ..., x n }. Then, we may define the space of possible values of X by S X = {x 1 ,x 2 , ..., x n }. A mapping from S X to the real-value between 0 and 1 that represents the value of P (X = x) defines the probability mass function. Definition 2 (Probability mass function) The probability mass function (pmf) of a discrete random variable X , denoted by f (x), is a function that satisfies the following properties: 1. 0 f X (x) 1 for any x S X . 2. xS X f X (x)=1. 3. P (X A)= xA f X (x) for any A S X . Taking A = {x}, the third property implies that f X (x)= P (X = x). The second property must hold because xS X P (X = x) = 1. The cumulative probabilities defined by P (X x 0 ), i.e., the probability that X takes less than or equal to some value x 0 is often of interest. 1 c Hiroyuki Kasahara. Not to be copied, used, revised, or distributed without explicit permission of copyright owner. 1
Transcript
Page 1: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Econ 325

Notes on Mathematical Expectation, Variance, and Covariance1

By Hiro Kasahara

Probability mass function, cumulative distribution function, andmathematical expectation

Example 1 Toss 2 fair coins. Denoting the heads by “H” and the tails by “T”, the sample spaceas S = {TT, TH,HT,HH}, where P (TT ) = P (TH) = P (HT ) = P (HH) = 1/4. Now, define arandom variable X by the number of heads when we toss 2 coins. Then, the event X = 0 happens ifand only if TT happens. Also, the event X = 1 happens if either TH or HT happens. Finally, theevent X = 2 happens if HH happens. Now, we may define a random variable X as a mapping fromthe sample space S = {TT, TH,HT,HH} to the space of possible values of X defined by {0, 1, 2}as:

X(s) =

0 if s = {TT}1 if s = {TH} or {HT}2 if s = {HH}

Accordingly, we may assign a probability to each possible value of X in {0, 1, 2} as P (X = 0) =P ({TT}) = 1/4, P (X = 1) = P ({TH,HT}) = 1/2 and P (X = 2) = P ({HH}) = 1/4.

Definition 1 (Random variable) Consider a random experiment of which sample space is givenby S. A random variable X is a function that assigns one value X(s) = x to each element of s ∈ S:

X : S → the space of X

We denote the space of X by SX , which is the set of all possible values of X, i.e., SX = {x : X(s) =x, s ∈ S} .

When a random variable X takes a finite number of values or a countable number of values,then the random variable X is called as a discrete random variable. Suppose that X can take npossible values given by {x1, x2, ..., xn}. Then, we may define the space of possible values of X bySX = {x1, x2, ..., xn}. A mapping from SX to the real-value between 0 and 1 that represents thevalue of P (X = x) defines the probability mass function.

Definition 2 (Probability mass function) The probability mass function (pmf) of a discreterandom variable X, denoted by f(x), is a function that satisfies the following properties:

1. 0 ≤ fX(x) ≤ 1 for any x ∈ SX .

2.∑

x∈SX fX(x) = 1.

3. P (X ∈ A) =∑

x∈A fX(x) for any A ⊂ SX .

Taking A = {x}, the third property implies that fX(x) = P (X = x). The second property musthold because

∑x∈SX P (X = x) = 1.

The cumulative probabilities defined by P (X ≤ x0), i.e., the probability that X takes less thanor equal to some value x0 is often of interest.

1 c©Hiroyuki Kasahara. Not to be copied, used, revised, or distributed without explicit permission of copyrightowner.

1

Page 2: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Definition 3 (Cumulative distribution function) The cumulative distribution function (cdf)is a function defined by

FX(x0) = P (X ≤ x0) =∑x≤x0

fX(x).

We may formally define the mathematical expectation of X as the weighted average of valuesX can take where the weights are given by the values of probability mass function.

Definition 4 If f(x) is the pmf of a random variable X, then the mathematical expectation, orthe expected value, of X is defined by

E[X] =∑x∈SX

xfX(x).

We often denote the expected value of X using the Greek letter µ (“mu”).

Definition 5 If f(x) is the pmf of a random variable X, then the variance σ2 and the standarddeviation σ (“sigma”) of X are defined by

σ2 =∑x∈SX

(x− µ)2f(x) and σ =

√∑x∈SX

(x− µ)2f(x),

respectively, where µ is the expected value of X, i.e., µ = E[X].

The following properties of expectation and variance are often used in different context.

Proposition 1 For any constant a and b and any random variable X,

E[a+ bX] = a+ bE[X].

Proposition 2 For any constant a and b and any random variable X,

V ar[a+ bX] = b2V ar[X].

The proofs of these two propositions are given below.

Example 2 Define a random variable X by the total number of heads when you toss 2 coins. Thesample space is S = {0, 1, 2}. If both are heads, then X = 2, which happens with 1/4 probability;if one of the coins is tail and the other is head, then X = 1 which happens with 1/2 probability; ifboth are tail, then X = 0. Therefore, the probability mass function is given by

fX(x) =

1/4 if x = 01/2 if x = 11/4 if x = 2.

The cumulative distribution function is given by

FX(x) =

1/4 if x = 03/4 if x = 11 if x = 2.

Taking A = {0, 1} in the third property of the probability mass function, the probability of X ≤ 1can be computed as Pr(X ≤ 1) =

∑x∈{0,1} fX(x) = fX(0)+fX(1) = 1/4+1/2 = 3/4. The expected

value of X is given by

E[X] = 0× (1/4) + 1× (1/2) + 2× (1/4) = 1.

The variance and the standard deviation of X are given by

σ2 = (0− 1)2 × (1/4) + (1− 1)× (1/2) + (2− 1)2 × (1/4) = 1/2 and σ =√σ2 = 1/

√2.

2

Page 3: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Example 3 Roll a die twice. Let X be the number of times 4 comes up. X takes three possiblevalues 0, 1, or 2. X = 0 when both dies take the value in {1, 2, 3, 5, 6} so that P (X = 0) =(5/6)× (5/6) = 25/36. X = 1 either when the the first die takes one of the values in {1, 2, 3, 5, 6}and the second die takes the value of 4 or when the first die takes the value of 4 and the second dietakes one of the values in {1, 2, 3, 5, 6} so that P (X = 1) = (5/6)× (1/6) + (1/6)× (5/6) = 10/36.Finally, X = 2 when both dies take the value of 4 so that P (X = 2) = (1/6)× (1/6) = 1/36. Notethat P (X = 0) + P (X = 1) + P (X = 2) = 1. Therefore, the probability mass function and thecumulative distribution function of X is given by

fX(x) =

25/36 if x = 010/36 if x = 11/36 if x = 2.

The cumulative distribution function is given by

FX(x) =

25/36 if x = 035/36 if x = 1

1 if x = 2.

The expected value of X is given by

E(X) =∑

x=0,1,2

xf(x) = 0× (25/36) + 1× (10/36) + 2× (1/36) = 1/3.

Example 4 Toss a coin 3 times. Let X be the number of heads. There are 8 possible outcomes:{TTT, TTH, THT, THH,HTT,HTH,HHT,HHH}, where H indicates “Head” and T indicates“Tail.” X takes four possible values 0, 1, 2, and 3 with probabilities P (X = 0) = 1/8, P (X = 1) =3/8, P (X = 2) = 3/8, and P (X = 3) = 1/8. Therefore,

fX(x) =

1/8 if x = 03/8 if x = 13/8 if x = 21/8 if x = 3.

and

FX(x) =

1/8 if x = 01/2 if x = 17/8 if x = 21 if x = 3

The expected value of X is

E(X) =∑

x=0,1,2,3

xf(x) = 0×(1/8)+1×(3/8)+2×(3/8)+3×(1/8) = (0+3+6+3)/8 = 12/8 = 3/2.

Example 5 (Bernoulli distribution) Define a random variable X which takes a value of 1 withprobability p and 0 with probability 1− p. Its probability mass function is given by

fX(x) =

{1− p if x = 0p if x = 1.

This random variable is called Bernoulli random variable. The expected value and the varianceof X is computed as E[X] =

∑x=0,1 xfX(x) = 0× (1− p) + 1× p = p and V ar[X] =

∑x=0,1(x−

E[X])2fX(x) = (0− p)2 × (1− p) + (1− p)2 × p = p(1− p).

3

Page 4: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Example 6 (Binomial distribution) Given n independent Bernoulli random variables, Xi withP (Xi = 1) = p and P (Xi = 0) = 1 − p for i = 1, 2, ..., n, we may define a random variable by thesum of n independent Bernoulli random variable as Y =

∑ni=1Xi. This random variable Y has a

binomial distribution of which probability mass function is given by

f(y) =n!

y!(n− y)!py(1− p)n−y,

where y! = y · (y − 1) · ... · 2 · 1. The expected value of Y is computed as E[Y ] = E[∑n

i=1Xi] =∑ni=1E[Xi] = (p + p + .... + p) = np because E[Xi] = p. The variance of Y is computed as

V ar[Y ] = V ar[∑n

i=1Xi] =∑n

i=1 V ar(Xi) + 2∑n−1

i=1

∑nj=i+1Cov(Xi, Xj) =

∑ni=1 V ar(Xi) + 0 =

(p(1− p) + p(1− p) + ....+ p(1− p)) = np(1− p), where the second to the last equality follows fromthe independence of Xi and Xj when i 6= j.

Example 7 (The average of n independent Bernoulli random variables) Consider a ran-dom variable defined by the average of n independent Bernoulli random variables, i.e., X =1n

∑ni=1Xi, where P (Xi = 1) = p and P (Xi = 0) = 1 − p for i = 1, 2, ..., n. Then, X is re-

lated to Y =∑n

i=1Xi as X = Yn . Using this relationship, we may compute the expected value and

the variance of X from those of Y as E[X] = E[Y ]/n = p and V ar[X] = V ar[Y ]/n2 = p(1−p)n ,

where we use the property of variance such that V ar(bY ) = b2V ar(Y ) with b = 1/n in this case.

Poisson distribution

The Poisson probability distribution is often used to describe the probability of a number of eventsoccurring in a fixed interval of time or space. To be concrete, consider a random variable X definedby the number of customers to arrive at a checkout aisle in a local grocery store from 2pm to 3pm.Assume that one hour interval between 2pm and 3pm is divided into a very large number of “veryshort” subintervals with equal length h. Suppose that the following three conditions are satisfied:

1. The number of occurrences in subintervals are independent.

2. The probability of exactly one occurrence in a subinterval of length h is approximately λh.

3. The probability of two or more occurrences approaches zero as the length h approaches zero.

For example, you can think of h = 1 second in the above example. Then, one hour interval can bedivided into 60 second × 60 minutes = 3600 time intervals of length h = 1 second = 1/3600 hour.If the average number of customer arriving at a checkout aisle within 1 hour is λ = 10, then theprobability that one customer arrives at a checkout within 1 second is λ×h = 10×(1/3600) = 1/360,where the probability of two or more customers checking out within 1 second is almost zero.

When these three conditions are satisfied, the probability distribution of a random variable Xdefined by the number of customers arriving at a checkout aisle from 2pm to 3pm is given by aPoisson distribution.

The probability mass function of a Poisson distribution (i.e., the probability that X = x) isgiven by

f(x) =λxe−λ

x!,

where λ represents the average number of occurrences (in this case, the number of customers)within one unit of time (in this case, 1 hour).

The mean and the variance for the Poisson distribution is given by

E[X] = λ and V ar(X) = λ.

4

Page 5: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

We may also obtain a Poisson distribution as a limit of the sum of Bernoulli trials. To see this,divide one unit of time into a very large number n of short intervals with length h; in the aboveexample, one unit of time is one hour where n = 3600 and h = 1/3600. Then, a random variabledefined by the occurrence or nonoccurence in each subinterval h = 1/n (e.g., whether an event ofcustomer checkout occurs or not within each subinterval of length h = 1/3600) follows a Bernoullidistribution with probability

p = λh = λ(1/n). (1)

Because the total number of occurrences is the sum of n independent Bernoulli random variables,we may approximate a Poisson distribution by the Binomial probability

Pr(X = x) ≈ n!

x!(n− x)!px(1− p)n−x =

n!

x!(n− x)!

n

)x(1− λ

n

)n−x.

We may show that this Binomial probability mass function approaches to the Poisson probabilitymass function as n increases to infinity, i.e.,2

limn→∞

n!

x!(n− x)!

n

)x(1− λ

n

)n−x=λxe−λ

x!. (2)

This also implies that, when n is large, the Binomial probability can be well-approximated bythe Poisson distribution. In fact, because p = λ/n in (1) implies λ = np, substituting λ = np intothe right hand side of (2) gives

Pr(X = x) =n!

x!(n− x)!px(1− p)n−x ≈ (np)xe−np

x!

for a Binomial random variable X when n is large.

Bivariate distribution

Let X and Y be two random variables defined on a discrete space. To be concrete, X can take npossible values given by {x1, x2, ..., xn} and Y can take m possible values given by {y1, y2, ..., ym}.Then, we may define the space of (X,Y ) by S = {(xi, yj) for i = 1, ..., n and j = 1, ...,m}, whereeach possible pair of values represents basic outcome. A mapping from the space of (X,Y ), S,to the real-value between 0 and 1 that represents the value of P (X = x, Y = y) defines the jointprobability mass function.

Definition 6 (Joint probability mass function) The joint probability mass function (pmf) ofdiscrete random variables X and Y , denoted by f(x, y), is a function that satisfies the followingproperties:

1. 0 ≤ f(x, y) ≤ 1 for any (x, y) ∈ S.

2This can be shown as follows. Write

n!

x!(n− x)!

n

)x(1− λ

n

)n−x=

n!

(n− x)!nxλx

x!

(1− λ

n

)n(1− λ

n

)−x.

Then, the result follows from

limn→∞

n!

(n− x)!nx = 1, limn→∞

(1− λ

n

)n= e−λ, and lim

n→∞

(1− λ

n

)−x= 1.

5

Page 6: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

2.∑

(x,y)∈S f(x, y) = 1.

3. P ((X,Y ) ∈ A) =∑

(x,y)∈A f(x, y) for any A ⊂ S.

Taking A = {(x, y)}, the third property implies that f(x, y) = P (X = x, Y = y).From the joint probability mass function of X and Y , f(x, y), we may derive the marginal

probability mass function of X or Y .

Definition 7 (The marginal probability mass function of X or Y ) Let X and Y be two ran-dom variables with the joint pmf f(x, y). Then, the marginal probability mass function of X isdefined by

fX(x) =∑y

f(x, y) = P (X = x),

where the summation is taken over all possible values of y. Similarly, the marginal probability massfunction of Y is defined by

fY (y) =∑x

f(x, y) = P (Y = y).

Given the marginal distribution of X and Y , we may compute the expected value of X and Y asEX [X] =

∑x xfX(x) and EY [Y ] =

∑y yfY (y). The We are often interested in linear relationship

between X and Y . The covariance and the correlation coefficient of X and Y are important conceptto describe the linear relationship between X and Y .

Definition 8 (Covariance and Correlation Coefficient) The covariance of X and Y is de-fined as

Cov(X,Y ) =∑x,y

(x− EX [X])(y − EY [Y ])f(x, y).

The correlation coefficient of X and Y is defined as

ρ = Corr(X,Y ) =Cov(X,Y )√V ar(X)V ar(Y )

=Cov(X,Y )

σXσY,

where σX and σY are standard deviations of X and Y , respectively.

We denote the correlation coefficient by ρ (“rho”). The correlation coefficient takes a value between-1 and 1:3

−1 ≤ ρ ≤ 1.

The sign of correlation coefficient indicates whether X and Y has positive or negative linear re-lationship while the value of correlation coefficient represents the strength of linear relationshipbetween X and Y , where we have ρ = 1 when all pairs of values (X,Y ) are on a straight line withpositive slope. When ρ = 0 or, equivalently, Cov(X,Y ) = 0, X and Y do not have any linearrelationship.

Definition 9 Random variables X and Y are said to be uncorrelated if and only if

Cov(X,Y ) = 0.

3We may prove this by applying a version of the Cauchy Schwarz inequality: |Cov(X,Y )|2 ≤ V ar(X)V ar(Y ).Taking a square-root of this Cauchy Schwarz inequality and divide both sides by

√V ar(X)V ar(Y ), we have

|Cov(X,Y )|√V ar(X)V ar(Y )

≤ 1.

6

Page 7: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

It is important to emphasize that the notion of uncorrelatedness is about a lack of linear relation-ship. Even when X and Y are uncorrelated, X and Y could have a relationship in higher ordermoments. For example, even when Cov(X,Y ) = E[(X − E[X])(Y − E[Y ])] = 0, we could haveE[(X − E(X))2(Y − E[Y ])2] 6= 0.

The stochastic independence of two random variables X and Y are defined by the property thatP (X = x, Y = y) = P (X = x)P (Y = y).

Definition 10 (Stochastic independence) The discrete random variables X and Y are inde-pendent if and only if

P (X = x, Y = y) = P (X = x)(Y = y) for every (x, y) ∈ S.

Equivalently,f(x, y) = fX(x)fY (y).

Stochastic independence implies uncorrelatedness but the converse is not true.Consider two events A = {X = x} and B = {Y = y}. The conditional probability of A given

B is defined as P (A|B) = P (A∩B)P (B) . In terms of probability mass functions, we have P (A ∩ B) =

P (X = x, Y = y) = f(x, y) and P (B) = fY (y) so that we may define the conditional probability

mass function of X given Y = y by f(x|y)(= P (A|B)) = f(x,y)fY (y) .

Definition 11 (Conditional probability) The conditional probability mass function of X givenY = y is defined by

fX|Y (x|y) =f(x, y)

fY (y)

provided that fY (y) > 0.

The conditional probability mass function of X given Y = y cannot be defined when fY (y) = 0.Similarly, we may define the conditional probability mass function of Y given X = x by fY |X(y|x) =f(x,y)fX(x) whenever fX(x) > 0.

Definition 12 (Conditional mean and variance) The conditional mean of X given Y = y isdefined as

EX|Y [X|Y = y] =∑x

xfX|Y (x|y).

The conditional variance of X given Y = y is defined as

V arX|Y [X|Y = y] =∑x

(x− EX|Y [X|Y = y])2fX|Y (x|y).

The conditional mean of X given Y can be written as EY [Y |X] without specifying a value ofX. If we view Y as a random variable, then, EY |X [Y |X] is a random variable because the valueof EY [X|Y ] depends on a realization of X. Viewing EY [Y |X] as a function of X, EY [Y |X] is alsocalled conditional expectation function.

If we take the expectation of a random variable EY |X [X|Y ] over all possible values of X, wehave the expected value of Y as the following proposition states.

Proposition 3 (Law of Iterated Expectation) Let X and Y are two random variables. Then,

EX [EY |X [Y |X]] = EY [Y ].

7

Page 8: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

If the conditional expectation of Y given X does not depend on X, EY |X [Y |X] is not a randomvariable anymore and is identically equal to the expected value of Y .

Definition 13 A random variable Y is said to be mean independent of random variable X ifand only if

EY |X [Y |X] = EY [Y ] for all x such that fX(x) 6= 0.

When X and Y are stochastically independent, then fY |X(y|x) := f(x,y)fX(x) = fX(x)fY (y)

fX(x) = fY (x),and therefore the conditional probability mass function of Y given X = x is the same as the(unconditional) probability mass function of Y . In this case, EY |X [Y |X = x] =

∑y yfY |X(y, x) =∑

y yfY (y) = EY [Y ] for all values of x. Therefore, The stochastic independence implies the meanindependence. Now, is it true that, if the conditional mean of Y given X = x is the same as theunconditional mean of Y for any x ∈ SX , then X and Y are always stochastically independent?The answer is no. For example, the conditional variance of Y given X can depend on X (i.e.,V arY |X(Y |X) 6= V arY (Y )) when EY |X [Y |X] = EY [Y ].

The stochastic independence defined by the property that f(X,Y ) = fX(X)fY (Y ) is thestrongest form of independence. The stochastic independence implies both uncorrelatedness andthe mean independence but neither uncorrelatedness nor the mean independence implies stochasticindependence. How about the relationship between uncorrelatedness and mean-independence? Itcan be shown that mean-independence implies uncorrelatedness but the converse is not true. Insum,

Stochastic Independence(f(X,Y ) = fX(X)fY (Y ))

⇒ Mean independence(EY |X [Y |X] = EY [Y ]

EX|Y [X|Y ] = EX [X]

) ⇒ Uncorrelatedness(Cov(X,Y ) = 0)

Example 8 In the household survey, we asked if the household head has completed 4 year universitydegree or not and asked their annual incomes. Define the random variables X and Y as follows:X = 1 if the household head has completed 4 year university; X = 0 otherwise. Y is the annualincomes. For simplicity, assume that Y takes the three values: 30, 60, and 100 in thousand $. Thesample space is S = {(x, y) : x = 0, 1 and y = 30, 60, 100}. The joint probability of X and Y isgiven in Table 1. The joint probability mass function, f(x, y), is given by

f(x, y) =

0.24 if (x, y) = (0, 30)0.12 if (x, y) = (0, 60)0.04 if (x, y) = (0, 100)0.12 if (x, y) = (1, 30)0.36 if (x, y) = (1, 60)0.12 if (x, y) = (1, 100).

The marginal probability mass function of X is computed as fX(x) =∑

y=30,60,100 f(x, y) so thatfX(0) = f(0, 30) + f(0, 60) + f(0, 100) = 0.24 + 0.12 + 0.04 = 0.4 and fX(1) = 0.12 + 0.36 + 0.12 =0.6. Take x = 0 and y = 30. Because f(0, 30) = 0.24 and fX(0)fY (30) = 0.4 × 0.36 = 0.144,f(0, 30) 6= fX(0)fY (30). Therefore, X and Y are not stochastically independent in this example.4

4Here, recall that X and Y are stochastically independent if and only if f(x, y) = fX(x)fY (y) for every pair ofvalues (x, y). Therefore, we can prove that X and Y are not stochastically independent by showing that f(x, y) =fX(x)fY (y) does not hold for at least one pair of values (x, y) (in this case, (x, y) = (0, 30)). On the other hand, toprove that X and Y are stochastically independent, we need to show that f(x, y) = fX(x)fY (y) holds for for everypossible pair of values (x, y).

8

Page 9: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

The conditional probability mass function of Y given X = 1 is

fY |X(y|1) =

0.12/0.6 if y=300.36/0.6 if y=600.12/0.6 if y=100.

The conditional mean of Y given X = 1 is computed as EY |X [Y |X = 1] =∑

y yfY |X(y|1) =30× (0.12/0.6) + 60× (0.36/0.6) + 100× (0.12/0.6) = 62. On the other hand, the conditional meanof Y given X = 0 is EY |X [Y |X = 0] = 30× (0.24/0.4) + 60× (0.12/0.4) + 100× (0.04/0.4) = 46.

We may also compute the conditional probability mass function X given Y = 30 as

fX|Y (x|30) =

{0.24/0.36 if x=00.12/0.36 if x=1.

and the conditional mean of X given Y = 30 is computed as EX|Y [X|Y = 30] =∑

x=0,1 xfX|Y (x|30) =0 × (0.24/0.36) + 1 × (0.12/0.36) = 1/3. We may compute the conditional variance of X givenY = 30 as V arX|Y [X|Y = 30] =

∑x=0.1(x−EX|Y [X|Y = 30])2fX|Y (x|30) = (0−1/3)2(2/3) + (1−

1/3)2(1/3) = 2/9.The conditional expectation of Y given X = 0 or 1 is computed as:

EY [Y |X = 0] =3∑j=1

yjP (Y = yj |X = 0) = 30× 0.6 + 60× 0.3 + 100× 0.1 = 46,

EY [Y |X = 1] =

3∑j=1

yjP (Y = yj |X = 1) = 30× 0.2 + 60× 0.6 + 100× 0.2 = 62.

Therefore, viewing EY [Y |X] as a function of X, the conditional expectation function EY [Y |X] isgiven by

EY [Y |X] =

{46 if X=062 if X=1.

Viewing EY [Y |X] as a random variable of which realized value depends on the realized value ofX, define a new random variable Z as Z := EY [Y |X]. Then, the probability mass function ofZ = EY [Y |X] is given by

fZ(z) =

0.4 if z=46,0.6 if z=62,0 otherwise.

Note that the probability mass function takes the same values as that of X. The expected value ofZ = EY [Y |X] is computed as EZ [Z] = EX [EY [Y |X]] =

∑x=0,1EY [Y |X = x] = 0.4×46+0.6×62 =

55.6. We may verify that the law of iterated expectation holds. In fact, the expected value of Yis computed as EY [Y ] =

∑y=30,60,120 yfY (y) = 300.36 + 600.48 + 1000.16 = 55.6. Therefore,

EY [Y ] = EX [EY [Y |X]] holds. This is an example of the Law of Iterated Expectation.

Example 9 Suppose that X and Y are stochastically independent. In this case, is it true E[X|Y =y] = E[X], i.e., the conditional mean of X given Y = y is the same as the unconditional meanof X? To answer this question, recall that stochastic independence is defined by the property thatf(x, y) = fX(x)fY (y). Then, the conditional probability mass function of X given Y = y becomes:

fX|Y (x|y) =f(x, y)

fY (y)=fX(x)fY (y)

fY (y)= fX(x),

where the second equality uses the definition of stochastic dependence, i.e., f(x, y) = fX(x)fY (y).Therefore, EX|Y [X|Y = y] =

∑x xfX|Y (x|y) =

∑x xf(x) = EX [X].

9

Page 10: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Table 1: Joint Distribution of University Degree and Annual Incomes (in thousand $)

Y = 30 Y = 60 Y = 100 Marginal Prob. of X

X = 0 0.24 0.12 0.04 0.40

X = 1 0.12 0.36 0.12 0.60

Marginal Prob. of Y 0.36 0.48 0.16 1.00

We are often interested in a random variable defined by a linear combination of two randomvariables. The following formula gives how to compute the variance of a linear combination of tworandom variables X and Y from the variance and covariance of X and Y .

Proposition 4 For any constant a and b and any two random variables X and Y ,

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X,Y ).

The proof of this proposition is given in below. The following example shows how we may applythis formula in practice.

Example 10 In the lecture, we consider the investment returns of having a portfolio of W =aX+bY , where X is the share price for bond and Y is the share price for aggressive investment fund.The joint distribution of (∆X,∆Y ) in this example is given in Table 2. There are three possiblepair of values for (∆X,∆Y ), where (X,Y ) = (7,−20) represents “Recession,” (X,Y ) = (4, 6)represents “Stable Economy,” and (X,Y ) = (2, 35).

Now, consider the portfolio of having 40 percent of bond X and 60 percent of aggressive invest-ment fund Y so that the return from this portfolio is given by ∆W = 0.4∆X + 0.6∆Y by settinga = 0.4 and b = 0.6. What is the expected value and the standard deviation of ∆W?

We may compute the expected value as E[∆W ] = 0.4E[∆X] + 0.6E[∆Y ], where E[∆X] =0.3 × 2 + 0.5 × 4 + 0.2 × 7 = 4 and E[∆Y ] = 0.2 × −20 + 0.5 × 6 + 0.3 × 35 = 9.5. Therefore,E[∆W ] = 0.4× 4 + 0.6× 9.5 = 7.3.

To compute the variance of ∆W , we use the formula V ar(a∆X + b∆Y ) = a2V ar(∆X) +b2V ar(∆Y )+2abCov(∆X,∆Y ), where a = 0.4, b = 0.6, V ar(∆X) =

∑x=2,4,7(x−E[X])2fX(x) =

(2 − 4)2 × 0.3 + (4 − 4)2 × 0.5 + (7 − 4)2 × 0.2 = 3, V ar(∆Y ) =∑

y=−20,6,35(y − E[Y ])2fY (y) =

(−20− 9.5)2 × 0.2 + (6− 9.5)2 × 0.5 + (35− 9.5)2 × 0.3 = 375.25,

Cov(∆X,∆Y ) =∑

(x,y)=(2,35),(4,6),(7,−20)

(x− E[X])(y − E[Y ])f(x, y)

= (2− 4)(35− 9.5)× 0.3 + (4− 4)(6− 9.5)× 0.5 + (7− 4)(−20− 9.5)× 0.2 = −33.

Therefore,

V ar(0.4∆X + 0.6∆Y ) = (0.4)2 × 3 + (0.6)2 × 375.25 + 2× 0.4× 0.6× (−33) = 119.73

and the standard deviation of ∆W is given by√

119.73 ≈ 10.91.

Properties of Mathematical Expectation

Let X be a random variable and suppose that the mathematical expectation of X, E(X), exists.

1. If a is a constant, thenE(a) = a.

10

Page 11: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Table 2: Joint Distribution of (∆X,∆Y )

∆Y= -20 ∆Y = 6 ∆Y = 35 Marginal Prob. of ∆X

∆X = 2 0 0 0.3 0.3

∆X = 4 0 0.5 0 0.5

∆X = 7 0.2 0 0 0.2

Marginal Prob. of ∆Y 0.2 0.5 0.3 1.00

2. If b is a constant, thenE(bX) = bE(X).

3. If a and b are constants, then

E(a+ bX) = a+ bE(X). (3)

4. Let g(x) be a function of x. In general,

E(g(X)) 6= g(E(X))

unless g(x) is a linear function of the form g(x) = a+ bx.

Proof: Let X be a discrete random variable, where possible values for X is {x1, . . . , xn} withprobability mass function of X given by

pXi = P (X = xi) , i = 1, . . . n.

For the proof of 1, we have

E(a) =n∑i=1

apXi

= (apX1 + apX2 + ...+ apXn )

= a× (pX1 + pX2 + ...+ pXn )

= a

n∑i=1

pXi

= a

where the last equality holds because∑n

i=1 pXi = 1.

For the proof of 2, we have

E(bX) =

n∑i=1

bxipXi

= (bx1pX1 + bx2p

X2 + ....+ bxnp

Xn )

= b× (x1pX1 + x2p

X2 + ....+ xnp

Xn )

= bn∑i=1

xipXi

= bE(X).

11

Page 12: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

For the proof of 3, we have

E(a+ bX) =n∑i=1

(a+ bxi)pXi

= (a+ bx1)pX1 + (a+ bx2)p

X2 + ....+ (a+ bxn)pXn

= (apX1 + apX2 + ...+ apXn ) + (bx1pX1 + bx2p

X2 + ....+ bxnp

Xn )

= a× (pX1 + pX2 + ...+ pXn ) + b× (x1pX1 + x2p

X2 + ....+ xnp

Xn )

= an∑i=1

pXi + bn∑i=1

xipXi

= a+ bE(X).

For 4,

E(g(X)) =n∑i=1

g(xi)pXi (4)

g(E(X)) = g

(n∑i=1

xipXi

). (5)

When g(x) = a + bx, we have E(g(X)) = E(a + bX) = a + bE(X) = g(E(X)) from 3. However,this is not necessarily the case for nonlinear functions g(x). For example, for g(x) = x2, we haveE(g(X)) = E(X2) and g(E(X)) = {E(X)}2. As we show below, Var(X) = E(X2)− {E(X)}2 > 0in general (unless X is constant). Therefore, E(X2) 6= {E(X)}2 in general.

Variance and Covariance

Let X and Y be two discrete random variables. The set of possible values for X is {x1, . . . , xn};and the set of possible values for Y is {y1, . . . , ym}. The joint probability mass function is given by

pX,Yij = P (X = xi, Y = yj) , i = 1, . . . n; j = 1, . . . ,m.

The marginal probability function of X is

pXi = P (X = xi) =m∑j=1

pX,Yij , i = 1, . . . n,

and the marginal probability function of Y is

pYj = P (Y = yj) =

n∑i=1

pX,Yij , j = 1, . . .m.

In Example 4, we have n = 2 and m = 3, where x1 = 0, x2 = 1, y1 = 30, y2 = 60, and y3 = 100.The joint distribution of X and Y in Example 4 is given in Table 1.

What would be the values for pX,Yij , pXi , and pYj for i = 1, 2 and j = 1, 2, 3 in Table 1? Answer:Compare Table 1 with Table 3.

It is also intuitive to understand why pXi = P (X = xi) =∑m

j=1 pX,Yij holds for this example.

What is the marginal probability of having 4 year university degree in Example 4? This is thesum of the probabilities of P (X = 1, Y = 30), P (X = 1, Y = 60), and P (X = 1, Y = 100) so thatP (X = 1, Y = 30)+P (X = 1, Y = 60)+P (X = 1, Y = 100) = P (X = 1) or 0.12+0.36+0.12 = 0.60in Table 1.

12

Page 13: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Table 3: Joint Distribution of X and Y

Y = y1 Y = y2 Y = y3 Marginal Dist. of X

X = x1 pX,Y11 pX,Y12 pX,Y13 pX1X = x2 pX,Y21 pX,Y22 pX,Y23 pX2Marginal Dist. of Y pY1 pY2 pY3 1.00

1.E[X + Y ] = E[X] + E[Y ]. (6)

Proof:

E(X + Y ) =n∑i=1

m∑j=1

(xi + yj)pX,Yij

=n∑i=1

m∑j=1

(xipX,Yij + yjp

X,Yij )

=n∑i=1

m∑j=1

xipX,Yij +

n∑i=1

m∑j=1

yjpX,Yij (7)

=

n∑i=1

xi ·

m∑j=1

pX,Yij

+

m∑j=1

yj ·

(n∑i=1

pX,Yij

)(8)

because we can take xi out of∑m

j=1 because xi does not depend on j’s

=

n∑i=1

xi · pXi +

m∑j=1

yj · pYj

because pXi =∑m

j=1 pX,Yij and pYj =

∑ni=1 p

X,Yij

= E(X) + E(Y )

Equation (7): To understand∑n

i=1

∑mj=1(xip

X,Yij +yjp

X,Yij ) =

∑ni=1

∑mj=1 xip

X,Yij +

∑ni=1

∑mj=1 yjp

X,Yij ,

consider the case of n = m = 2. Then,

2∑i=1

2∑j=1

(xipX,Yij + yjp

X,Yij )

= (x1pX,Y11 + y1p

X,Y11 ) + (x1p

X,Y12 + y2p

X,Y12 ) + (x2p

X,Y21 + y1p

X,Y21 ) + (x2p

X,Y22 + y2p

X,Y22 )

= (x1pX,Y11 + x1p

X,Y12 + x2p

X,Y21 + x2p

X,Y22 ) + (y1p

X,Y11 + y2p

X,Y12 + y1p

X,Y21 + y2p

X,Y22 )

=2∑i=1

2∑j=1

xipX,Yij +

2∑i=1

2∑j=1

yjpX,Yij .

Equation (8): To understand∑n

i=1

∑mj=1 xip

X,Yij =

∑ni=1 xi · (

∑mj=1 p

X,Yij ), consider the case

13

Page 14: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

of n = m = 2. Then,

2∑i=1

2∑j=1

xipX,Yij = x1p

X,Y11 + x1p

X,Y12 + x2p

X,Y21 + x2p

X,Y22

= x1(pX,Y11 + pX,Y12 ) + x2(p

X,Y21 + pX,Y22 )

=2∑i=1

xi(pX,Yi1 + pX,Yi2 )

=2∑i=1

xi(2∑j=1

pX,Yij ).

Similarly, we may show that∑2

i=1

∑2j=1 yjp

X,Yij =

∑2j=1 yj · (

∑2i=1 p

X,Yij ).

2. If c is a constant, then Cov (X, c) = 0.

Proof: According to the definition of covariance,

Cov(X, c) = E[(X − E(X))(c− E(c))].

Since the expectation of a constant is itself, i.e., E(c) = c,

Cov(X, c) = E[(X − E(X))(c− c)]= E[(X − E(X)) · 0]

= E[0]

=n∑i=1

0× pXi

=n∑i=1

0

= 0 + 0 + ...+ 0

= 0

3. Cov (X,X) = V ar (X) .

Proof: According to the definition of covariance, we can expand Cov(X,X) as follows:

Cov(X,X) = E[(X − E(X))(X − E(X))]

=

n∑i=1

[xi − E(X)][xi − E(X)] · P (X = xi), where E(X) =

n∑i=1

xipXi

=

n∑i=1

[xi − E(X)][xi − E(X)] · pXi

=

n∑i=1

[xi − E(X)]2 · pXi

= E[(X − E(X))2] (by def. of the expected value)

= V ar(X).

14

Page 15: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

4. Cov (X,Y ) = Cov (Y,X) .

Proof: According to the definition of covariance, we can expand Cov(X,Y ) as follows:

Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))]

=

n∑i=1

m∑j=1

[xi − E(X)][yj − E(Y )] · pX,Yij , where E(X) =

n∑i=1

xipXi and E(Y ) =

m∑j=1

yjpYj

=

m∑j=1

n∑i=1

[yj − E(Y )][xi − E(X)] · pX,Yij

= E[(Y − E(Y ))(X − E(X))] (by def. of the expected value)

= Cov(Y,X). (by def. of the covariance)

5. Cov (X,Y ) = E[XY ]− E[X]E[Y ] and V ar(X) = E[X2]− {E(X)}2.

Proof: According to the definition of covariance, we can expand Cov(X,Y ) as follows:

Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))]

= E[XY − E(Y )X − E(X)Y + E(X)E(Y )]

=n∑i=1

m∑j=1

[xiyj − E(Y )xi − E(X)yj + E(X)E(Y )] · pX,Yij

=n∑i=1

m∑j=1

xiyjpX,Yij − E(Y )

n∑i=1

m∑j=1

xipX,Yij − E(X)

n∑i=1

m∑j=1

yjpX,Yij + E(X)E(Y )

n∑i=1

m∑j=1

pX,Yij︸ ︷︷ ︸=1

= E[XY ]− E(Y )

n∑i=1

xi

m∑j=1

pX,Yij︸ ︷︷ ︸=pXi

−E(X)

m∑j=1

yj

n∑i=1

pX,Yij︸ ︷︷ ︸=pYj

+E(X)E(Y )

= E[XY ]− E(Y )E(X)− E(X)E(Y ) + E(X)E(Y )

= E[XY ]− E(X)E(Y ).

Note that we may also prove that V ar(X) = E[X2] − {E[X]}2 by letting Y = X inthe above proof. We may also prove that, for any function g1(x) and g2(x), we haveV ar(g1(X)+g2(X)) = V ar(g1(X))+V ar(g2(X))+2Cov(g1(X), g2(X)) (try to prove yourselfas an exercise!).

6. Cov (a1 + b1X, a2 + b2Y ) = b1b2Cov (X,Y ) , where a1, a2, b1, and b2 are some constants.

Proof: Using E(a1 + b1X) = a1 + b1E(X) and E(a2 + b2Y ) = a2 + b2E(Y ), we can expand

15

Page 16: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Cov (a1 + b1X, a2 + b2Y ) as follows:

Cov(a1 + b1X, a2 + b2Y ) = E[(a1 + b1X − E(a1 + b1X))(a2 + b2Y − E(a2 + b2Y ))]

= E[(a1 + b1X − (a1 + b1E(X)))(a2 + b2Y − (a2 + b2E(Y ))]

= E[(a1 − a1 + b1X − b1E(X))(a2 − a2 + b2Y − b2E(Y )]

= E[(b1X − b1E(X))(b2Y − b2E(Y )]

= E[b1(X − E(X)) · b2(Y − E(Y ))]

= E[b1b2(X − E(X))(Y − E(Y ))]

=n∑i=1

m∑j=1

b1b2(xi − E(X))(yj − E(Y )) · pX,Yij

= b1b2

n∑i=1

m∑j=1

[xi − E(X)][yj − E(Y )] · pX,Yij (by using (3))

= b1b2Cov(X,Y ).

7. If X and Y are independent, then Cov (X,Y ) = 0.

Proof: If X and Y are independent, by definition of stochastic independence, P (X = xi, Y =yj) = P (X = xi)P (Y = yj) = pXi p

Yj for any i = 1, ..., n and j = 1, ...,m. Then, we may

16

Page 17: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

expand Cov (X,Y ) as follows.

Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))]

=

n∑i=1

m∑j=1

[xi − E(X)][yj − E(Y )] · P (X = xi, Y = yj)

=n∑i=1

m∑j=1

[xi − E(X)][yj − E(Y )]pXi pYj

because X and Y are independent

=n∑i=1

m∑j=1

{[xi − E(X)]pXi }{[yj − E(Y )]pYj }

=n∑i=1

[xi − E(X)]pXi

m∑j=1

[yj − E(Y )]pYj

(9)

because we can move [xi − E(X)]pXi outside of∑m

j=1

because [xi − E(X)]pXi does not depend on the index j’s

=

m∑j=1

[yj − E(Y )]pYj

{

n∑i=1

[xi − E(X)]pXi

}(10)

because we can move{∑m

j=1[yj − E(Y )]pYj

}outside of

∑ni=1

because{∑m

j=1[yj − E(Y )]pYj

}does not depend on the index i’s

=

{n∑i=1

xipXi −

n∑i=1

E(X)pXi

m∑j=1

yjpYj −

m∑j=1

E(Y )pYj

=

{E(X)−

n∑i=1

E(X)pXi

E(Y )−m∑j=1

E(Y )pYj

by definition of E(X) and E(Y )

=

{E(X)− E(X)

n∑i=1

pXi

E(Y )− E(Y )m∑j=1

pYj

because we can move E(X) and E(Y ) outside of

∑ni=1 and

∑mj=1, respectively

= {E(X)− E(X) · 1} · {E(Y )− E(Y ) · 1}= 0 · 0 = 0.

Equation (10): This is similar to equation (8). Please consider the case of n = m = 2 andconvince yourself that (10) holds.

8. V ar (X + Y ) = V ar (X) + V ar (Y ) + 2Cov (X,Y ).

Proof: By the definition of variance,

V ar(X + Y ) = E[(X + Y − E(X + Y ))2].

17

Page 18: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Then,

V ar(X + Y ) = E[(X + Y − E(X + Y ))2]

= E[((X − E(X)) + (Y − E(Y )))2]

= E[(X − E(X))2 + (Y − E(Y ))2 + 2(X − E(X))(Y − E(Y ))]

because for any a and b, (a+ b)2 = a2 + b2 + 2ab

= E[(X − E(X))2] + E[(Y − E(Y ))2] + 2E[(X − E(X))(Y − E(Y ))] (by using (6))

= V ar(X) + V ar(Y ) + 2Cov(X,Y )

by definition of variance and covariance

9. V ar (X − Y ) = V ar (X) + V ar (Y )− 2Cov (X,Y ).

Proof: The proof of V ar (X − Y ) = V ar (X)+V ar (Y )−2Cov (X,Y ) is similar to the proofof V ar (X + Y ) = V ar (X) + V ar (Y ) + 2Cov (X,Y ). First, we may show that E(X − Y ) =E(X)− E(Y ). Then,

V ar(X − Y ) = E[(X − Y − E(X − Y ))2]

= E[((X − E(X))− (Y − E(Y )))2]

= E[(X − E(X))2 + (Y − E(Y ))2 − 2(X − E(X))(Y − E(Y ))]

= E[(X − E(X))2] + E[(Y − E(Y ))2]− 2E[(X − E(X))(Y − E(Y ))] (by using (6))

= V ar(X) + V ar(Y )− 2Cov(X,Y )

10. Define W = (X−E(X))/√V ar(X) and Z = (Y −E(Y ))/

√V ar(Y ). Show that Cov(W,Z) =

Corr(X,Z).

Proof: Expanding Cov(W,Z), we have

Cov(W,Z) = E[(W − E(W ))(Z − E(Z))]

= E[WZ] (because E[W ] = E[Z] = 0)

= E

{X − E(X)√V ar(X)

· Y − E(Y )√V ar(Y )

}by definition of W and Z

= E

{1√

V ar(X)· 1√

V ar(Y )· [X − E(X)]E[Y − E(Y )]

}=

1√V ar(X)

· 1√V ar(Y )

· E {[X − E(X)]E[Y − E(Y )]} (by using (3))

because both 1√V ar(X)

and 1√V ar(Y )

are constant

=E {[X − E(X)]E[Y − E(Y )]}√

V ar(X)√V ar(Y )

=Cov(X,Y )√

V ar(X)√V ar(Y )

(by definition of covariance)

= Corr(X,Y ) (by definition of correlation coefficient)

18

Page 19: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

11. Show that Corr(X,Y ) = −1 or 1 if Y = a+ bX.

12. Let b be a constant. Show that E[(X − b)2] = E(X2)− 2bE(X) + b2. What is the value of bthat gives the minimum value of E[(X − b)2]?

13. Let a and b be two constants. What is the value of a and b that gives the minimum value ofE[(Y − a− bX)2]?

14. Let {xi : i = 1, . . . , n} and {yi : i = 1, . . . , n} be two sequences. Define the averages

x =1

n

n∑i=1

xi,

y =1

n

n∑i=1

yi.

(a)∑n

i=1 (xi − x) = 0.

Proof:n∑i=1

(xi − x) =

n∑i=1

xi −n∑i=1

x

=

n∑i=1

xi − nx

because∑n

i=1 x = x+ x+ ...+ x = nx

= n

∑ni=1 xin

− nx

because∑n

i=1 xi = nn

∑ni=1 xi = n

∑ni=1 xin

= nx− nx

because x =∑ni=1 xin

= 0.

(b)∑n

i=1 (xi − x)2 =∑n

i=1 xi (xi − x).

19

Page 20: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Proof: We use the result of 2.(a) above.

n∑i=1

(xi − x)2 =n∑i=1

(xi − x) (xi − x)

=n∑i=1

xi (xi − x)−n∑i=1

x (xi − x)

=n∑i=1

xi (xi − x)− xn∑i=1

(xi − x)

because x is constant and does not depend on i’s =n∑i=1

xi (xi − x)− x · 0

because∑n

i=1 (xi − x) = 0. as shown above

=n∑i=1

xi (xi − x) .

(c)∑n

i=1 (xi − x) (yi − y) =∑n

i=1 yi (xi − x) =∑n

i=1 xi (yi − y).

Proof: The proof is similar to the proof of 2.(b) above.

n∑i=1

(xi − x) (yi − y) =n∑i=1

(xi − x) yi −n∑i=1

(xi − x) y

=n∑i=1

(xi − x) yi − yn∑i=1

(xi − x)

=n∑i=1

(xi − x) yi − y · 0

=n∑i=1

yi (xi − x) .

Also,n∑i=1

(xi − x) (yi − y) =n∑i=1

xi (yi − y)−n∑i=1

x (yi − y)

=n∑i=1

xi (yi − y)− xn∑i=1

(yi − y)

=n∑i=1

xi (yi − y)− x · 0

=

n∑i=1

xi (yi − y) .

Conditional Mean and Conditional Variance

Let X and Y be two discrete random variables. The set of possible values for X is {x1, . . . , xn};and the set of possible values for Y is {y1, . . . , ym}. We may define the conditional probability

20

Page 21: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

Table 1: Joint Distribution and Conditional Distribution of University Degree and Annual Incomes(in thousand $)

Y = 30 Y = 60 Y = 100 Marginal Dist. of X

X = 0 0.24 0.12 0.04 0.40

X = 1 0.12 0.36 0.12 0.60

Marginal Dist. of Y 0.36 0.48 0.16 1.00

Table 2: Conditional Distribution of Annual Incomes given University Degree

Y = 30 Y = 60 Y = 100 Marginal Dist. of X

P (Y |X = 0) 0.24/0.40=0.6 0.12/0.40=0.3 0.04/0.40=0.1 0.40

P (Y |X = 1) 0.12/0.60=0.2 0.36/0.60=0.6 0.12/0.60=0.2 0.60

Marginal Dist. of Y 0.36 0.48 0.16 1.00

function of Y given X as

pY |Xij = P (Y = yj |X = xi) =

P (X = xi, Y = yj)

P (X = xi)=pX,Yij

pXi,

where pX,Yij = P (X = xi, Y = yj) and pXi = P (X = xi).The conditional mean of Y given X = xi is given by

EY [Y |X = xi] =m∑j=1

yjP (Y = yj |X = xi) =m∑j=1

yjpY |Xij ,

where the symbol EY indicates that the expectation is taken treating Y as a random variable. Theconditional variance of Y given X = xi is given by

V ar(Y |X = xi) = E[(Y − E[Y |X = xi])2|X = xi] =

m∑j=1

(yj − E[Y |X = xi])2pY |Xij .

Consider Example 4 to understand these notations. Table 1 is replicated above. The conditionaldistribution of Y given X = 0 or X = 1 in Example 4 as in Table 2.

Conditional expectation of Y given X = 0 or 1 is computed as:

EY [Y |X = 0] =3∑j=1

yjP (Y = yj |X = 0) = 30× 0.6 + 60× 0.3 + 100× 0.1 = 46,

EY [Y |X = 1] =

3∑j=1

yjP (Y = yj |X = 1) = 30× 0.2 + 60× 0.6 + 100× 0.2 = 62.

The conditional mean of Y given X can be written as EY [Y |X] without specifying a value ofX. If we view X as a random variable, then, EY [Y |X] is a random variable because the value ofEY [Y |X] depends on a realization of X. The following shows that the unconditional mean of Y isequal to the expected value of EY [Y |X] where the expectation is taken with respect to X.

21

Page 22: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

1. Law of Iterated Expectation: EY [Y ] = EX [EY [Y |X]].

Proof: Because EY [Y |X = xi] =∑m

j=1 yjpY |Xij , we have

EX [EY [Y |X]] =n∑i=1

EY [Y |X = xi]pXi

=n∑i=1

(m∑j=1

yjpY |Xij )pXi

=

n∑i=1

m∑j=1

yjpX,Yij

pXipXi

=

n∑i=1

m∑j=1

yjpX,Yij

=m∑j=1

yj

n∑i=1

pX,Yij

=m∑j=1

yjpYj = EY [Y ].

We can confirm EY [Y ] = EX [EY [Y |X]] in Example 4 as follows. The mean of Y is

EY [Y ] =

m∑j=1

yjpYj = 30× 0.36 + 60× 0.48 + 100× 0.16 = 55.6.

Now, we can compute EX [EY [Y |X]] as the weighted average of EY [Y |X = 0] and EY [Y |X =1]:

EX [EY [Y |X]] = EY [Y |X = 0]×P (X = 0)+EY [Y |X = 1]×P (X = 1) = 46×0.4+62×0.6 = 55.6.

Therefore, EY [Y ] = EX [EY [Y |X]] in Example 4. The above proof shows that this is alwaysthe case.

2. Var(Y |X) = E[Y 2|X]− {E[Y |X]}2.

Proof:V ar(Y |X) = EY [(Y − EY [Y |X])2|X]

= EY [Y 2 − 2Y EY [Y |X] + (EY [Y |X])2|X]

= EY [Y 2|X]− 2EY [Y |X]E[Y |X] + (EY [Y |X])2

= EY [Y 2|X]− (EY [Y |X])2.

3. For constant a and b, Var(a+ bY |X) = b2Var(Y |X).

Proof:V ar(a+ bY |X) = EY [(a+ bY − {a+ bEY [Y |X]})2|X]

= EY [(b2(Y − EY [Y |X])2|X]

= b2EY [(Y − EY [Y |X])2|X]

= b2V ar(Y |X).

22

Page 23: Probability mass function, cumulative distribution … › hkasahara › Econ325 › notes_exp.pdfprobability that one customer arrives at a checkout within 1 second is h= 10 (1=3600)

4. For any function g1(x) and g2(x), we have E[g1(X) + g2(X)Y |X] = g1(X) + g2(X)E[Y |X].

5. E[X(Y − E[Y |X])] = 0.

6. If Y is mean independent of X, i.e., EY [Y |X] = EY [Y ], then Cov(Y,X) = 0.

23


Recommended