Notes on Set Theory and Probability Theory

Notes on Set Theory and Probability Theory

Michelle Alexopoulos

August 2003

0.1. Set Theory and Probability Theory

Before we talk about probability, it is useful to review some basic definitions and

theorem from set theory.

Definition 1. A set if a collection of objects.

Definition 2. If every element in a set A is also a member of set B then A is a

subset of B, i.e., A ⊂ B.

Definition 3. Two sets, A and B, are equal, denoted A=B, if and only if all

element in A belongs to the set B and every element in B belongs to set A, i.e.,

A ⊆ B and A ⊇ B.

Definition 4. B is a proper subset of A if B is a subset of a A, but B does not

equal A.

Definition 5. The empty set, or null set, is a set which contains no elements,

and is denoted by the symbol ∅.

Definition 6. Suppose that A ⊂ S. The complement of set A, denoted as A or

Ac, is the set containing all elements in S that are not in A. i.e.,

Ac = γ : γ ∈ S and γ /∈ A

2

Definition 7. The union of sets A and B, denoted A ∪ B, is the set containing

all elements in either A or B or both. i.e.,

A ∪B = γ : γ ∈ A or γ ∈ B.

Definition 8. The intersection of sets A and B, denoted A ∩ B, is the set con-

taining all elements in both A and B. i.e.,

A ∩B = γ : γ ∈ A and γ ∈ B.

Definition 9. Two sets, A and B, are called disjoint of mutually exclusive if they

contain no common element. i.e., if A ∩B = ∅.

Definition 10. The set of all possible outcomes of a random experiment is called

the sample space (or universal set) and is denoted by U .

Some Theorems involving sets:

Theorem 11. A ∪B = B ∪A (commutative law for unions)

Theorem 12. (A ∪B) ∪ C = A ∪ (B ∪ C) (associative law for unions)

Theorem 13. A ∩B = B ∩A (commutative law for intersections)

Theorem 14. (A ∩B) ∩ C = A ∩ (B ∩ C) (associative law for intersections)

3

Theorem 15. A ∩ (B ∪ C) = (A ∩B) ∪ (A ∩ C) (First distributive law)

Theorem 16. A ∪ (B ∩ C) = (A ∪B) ∩ (A ∪ C) (Second distributive law)

Theorem 17. If A ⊂ B then A0 ⊃ B0

Theorem 18. A ∪∅ = A and A ∩∅ = ∅

Theorem 19. (A ∪B)0 = A0 ∩B0 (De Morgan’s first law)

Theorem 20. (A ∩B)0 = A0 ∪B0 (De Morgan’s second law)

0.2. Probability Space

Basic Probability theory is defined using a triple (S,F , P ) where S is the sample

space, F is the collection of events, and P is a function that maps F into the

interval [0,1]. P is the probability measure and intuitively F is the set of all

events that can be verified to have occurred or not occurred I will discuss these

objects in more detail below. However, for the most part, I will follow the notation

used in standard textbooks like Greene’s Econometric Analysis.

Definition 21. Sample space S: a set of elements of interest.

4

In elementary probability theory, we usually associate these elements with

outcomes of an experiment. For example, consider a simple experiment of a single

toss of a coin. The sample space of this experiment is

S = H,T,

where H stands for “head” and T for “tail”. For an experiment of three tosses of

a coin, then the sample space is

S = HHH,HHT,HTH,HTT, THH,THT, TTH, TTT.

More generally, the sample space of an experiment of n tosses of a coin is

S = ω : ω = (a1, ..., an), ai = H or T.

An alternate example is one where an individual rolls a die once. The sample

space of this experiment is:

S = 1, 2, 3, 4, 5, 6

5

and if we rolled the die n times, the sample space of this experiment would be:

S = ω : ω = (a1, ..., an), ai = 1, 2, 3, 4, 5 or 6.

In modern probability theory, the sample space can be fairly general and ab-

stract. For example, it can be the collection of all real numbers, R, or the collection

all n-dimensional vectors, Rn, or any subset of these collections.

The Axioms of Probability: Suppose we have a sample space S. If S is dis-

crete, then all subsets correspond to events, but if S is continuous, only measurable

subsets correspond to events.

To each event A in the class of events, we associate a real number, P(A), i.e.,

P is a real-valued function defined on C. Then P is the probability functions and

P(A) is the probability of the event A, if the following axioms are satisfied:

Axiom 1: For every event A, P (A) ≥ 0.

Axiom 2: For the sure or certain event S, P (S) = 1.

Axiom 3: For any number of mutually exclusive events A1, A2, ... , then P (A1∪

A2 ∪ ...) = P (A1) + P (A2) + ...

Some Theorems on Probability:

6

Theorem 22. If A1 ⊆ A2, then P (A1) ≤ P (A2) and P (A2−A1) = P (A2)−P (A1)

Theorem 23. For every event A, 0 ≤ P (A) ≤ 1.

Theorem 24. P (∅) = 0 (i.e., the impossible event has probability 0)

Theorem 25. If A0 is the complement of A, then P (A0) = 1− P (A)

Theorem 26. If A = A1 ∪ A2 ∪ ... ∪ An,where A1, ..., An are mutually exclusive

events then P (A) =Pn

i=1 P (Ai).

Theorem 27. If A and B are any two events, then P (A ∪B) = P (A) + P (B)−

P (A ∩B).

Conditional Probability: Let A and B be two events such that P (A) > 0.

Let P (B|A) denote the probability of B given that A has occurred. Since A has

already occurred, it becomes the new sample space. From this, we are led to the

definition of P (B|A):

P (B|A) = P (A ∩B)P (A)

or P (A ∩B) = P (A)P (B|A)

Definition 28. If P(B|A)=P(B), then A and B are independent events.

This is equivalent to P(A ∩B) = P (A)P (B).

7

Theorem 29. Bayes Rule: Suppose that A1, A2, ..., An are mutually exclusive

events whose union is the sample space, S. Then if A is any event:

P (AK |A) = P (Ak)P (A|Ak)Pni=1 P (Ai)P (A|Ai)

Definition 30. Two events A and B are said to be independent if and only if

P (A ∩B) = P (A)P (B)

Counting: Suppose that we are given n distinct objects

Definition 31. nPr is the number of permutations of n objects taken r at a time:

nPr =n!

(n− r)!

Definition 32. nCr is the combinations of n objects taken r at a time:

nCr =

n

r

=n!

r!(n− r)!

nPr is used when order matters. For example if we want to find out how many

different permutations consisting of three letters each can be formed from the 4

8

letters A,B, C, and D, the answer is given by

4P3 =4!

(4− 3)! = 4 ∗ 3 ∗ 2 ∗ 1 = 24

In this case order matters, i.e., ABC is a different permutation then ACB, or

BAC, or CAB, etc.

If we only want to know how many ways three letters can be chosen from the

set of letters A,B, C and D, then the answer is given by

4C3 =4!

3!(4− 3)! = 4

In this case order does not matter so we can only have 4 possibilities. i.e., (1)

A,B, and C, (2) A,C, and D, (3) A,B, and D, and (4) B,C, and D.

Other Commonly used definitions If you are reading more advanced books

on probability theory, you will often find terms like sigma-field and sigma-algebra

and find reference to measurable functions and measure spaces. I will now briefly

turn to these

Partition and Information: A partition of a set S is a finite collection A =

9

A1, A2, ..., AN of disjoint subsets of S whose union is S.

Examples: S = −0.05,−0.01, 0, 0.01, 0.05, and

A0 = S,

A1 = −0.05,−0.01, 0, 0.01, 0.05,

A2 = −0.05, −0.01, 0, 0.01, 0.05,

Ai, i = 1, 2, 3 can each be thought of as representing the information an agent

may have. Suppose that the numbers in S represent all possible returns in the

stock market in a day. Then, after the return is realized, an agent with information

partition A0 has effectively no information about the return, while an agent with

information A1 can tell whether the return is positive, zero, or negative, and an

agent with information A2 knows exactly what the return is. So, these three

partitions represent progressively more information.

Given a partitionA = A1, A2, ..., AN, an agent may assign different probabil-

ities to each of the event in the partition, P1, ..., PN . Based on these probabilities,

the agent should also be able to decide the probabilities of events such as A1∪A2,

or Ac1.

P (A1 ∪A2) = P (A1) + P (A2), P (Ac1) = 1− P (A1)

10

This motivates the following definition of measurable sets, which can thought of

as all events that can be assigned a probability.

Above, I mentioned that F includes all outcomes on the sample space that

can be verified to have occurred or not occurred. Basically this means that if

set A is an event, then its complement, A0 (i.e., not A) must also be an event.

Furthermore, if A and B are events then we need to be able to determine: (a)

if both A and B happened and (b) if either A or B (or both) occurred. Thus,

A ∩ B and A ∪ B are also events. An algebra (of field) is a collection of subsets

that is closed under complementation, intersection and union. For our purposes,

we will also require F to be closed under countable unions/intersections, and we

will refer toe F as a σ−algebra (σ−field). This next definition states these ideas

more formally.

Definition 33. A σ-field F is a collection of subsets of a sample space S with

the following properties:

(i) The empty set φ ∈ F .

(ii) If A ∈ F , then the complement Ac ∈ F .

(iii) If Ai ∈ F , i = 1, 2, ..., then their union ∪Ai ∈ F .

11

Note that if Ai ∈ F for i = 1, 2, then, Aci ∈ F , which imply that Ac1 ∪Ac2 ∈ F .

Thus, (Ac1 ∪Ac2)c ∈ F . However,

(Ac1 ∪Ac2)c ≡ ω ∈ S : ω /∈ Ac1, and ω /∈ Ac2

= ω ∈ S : ω ∈ A1 and ω ∈ A2

≡ A1 ∩A2.

So, A1 ∩A2 ∈ F .

Definition 34. A pair (S, F) is called a measurable space, and any subset in F

is called a measurable set or event.

Examples:

(i) F =φ, S

(ii) F =φ, A,Ac, S = φ, −0.05,−0.01, 0, 0.01, 0.05, S

Definition 35. σ(C): smallest σ-field that contains the collection of subsets, C.

Examples:

(i) σ(A) = φ, A,Ac, S.

(ii) C = A1, A2, σ(C) = φ, A1, A2, Ac1, Ac2, A1 ∪ A2, Ac1 ∪ Ac2, A1 ∩ A2, Ac1 ∩

A2, A1 ∩Ac2, A1 ∪Ac2, Ac1 ∪A2

12

(iii) B, the σ-field generated by all the open intervals in R. We call all the

subsets in B Borel sets.

Let A and A0 be two partitions of S, we say that information represented by

A is finer than that represented by A0 if σ(A0) ⊂ σ(A).

Definition 36. A measure is a set function v defined on F such that :

(i) 0 ≤ v(A) ≤ ∞ for any A ∈ F .

(ii) v(φ) = 0.

(iii) If Ai ∈ F , i = 1, 2, ..., and Ai ∩Aj = φ for any i 6= j, then

v (∪∞i=1Ai) =∞Xi=1

v(Ai).

Examples:

Counting measure S = a1, a2, a3, ..., F contains all the subsets of

S.

v(A) = number of elements in subset A

Lebesgue measure S = R, F = B, and

v((a, b)) = b− a.

13

Proposition 37. For a measure space (S,F , v), we have

(i) If A ⊂ B, then v(A) ≤ v(B)

(ii) For any sequence A1, A2,...,

v(∪∞i=1Ai) ≤∞Xi=1

v(Ai)

(iii) If A1 ⊂ A2 ⊂ A3 ⊂ ..., (or A1 ⊃ A2 ⊃ A3 ⊃ ...), then,

v( limn−→∞

An) = v(∪∞i=1Ai) = limn−→∞

v(An)

(or

v( limn−→∞

An) = v(∩∞i=1Ai) = limn−→∞

v(An)

if v(A1) <∞).

Proof: (i) Let C = B ∩Ac, then, C ∈ F and

v(B) = v(A ∪ C) = v(A) + v(C) ≥ v(A)

because v(C) ≥ 0.

(ii) Let C1 = A1, C2 = A2 ∩ Cc1, C3 = A3 ∩ Cc2 ∩ Cc1, ..., then, Ci, i = 1, 2, ... is

14

a sequence of disjoint sets such that ∪∞i=1Ai = ∪∞i=1Ci and, by (i), v(Ci) ≤ v(Ai).

Thus, we have

v(∪∞i=1Ai) = v(∪∞i=1Ci) =∞Xi=1

v(Ci) ≤∞Xi=1

v(Ai).

(iii) If An is an increasing sequence, let A0 = φ, and Dn = An − An−1 ≡

An ∩ Acn−1 for n ≥ 1. Then, Dn, n = 1, 2, ... is a sequence of disjoint sets such

that ∪∞n=1An = ∪∞n=1Dn. By the definition of measure, we have

v(∪∞n=1An) = v(∪∞n=1Dn) =∞Xn=1

v(Dn)

= limn−→∞

nXi=1

v(Di)

= limn−→∞

"nXi=1

(v(Ai)− v(Ai−1))#

= limn−→∞

v(An).

Now, if An is a decreasing sequence such that v(A1) < ∞, then, we have

Bn = A1 − An is an increasing sequence. From what we just proved, then, we

have

v(∪∞n=1Bn) = limn−→∞

v(Bn) = v(A1)− limn−→∞

v(An).

15

However, ∩∞n=1An = A1 − (∪∞n=1Bn), so,

v(∩∞n=1An) = v(A1)− v(∪∞n=1Bn) = limn−→∞

v(An).

Q.E.D.

If v(S) = 1, then v is called a probability measure.

Proposition 38. Let (S,F) be a measurable space.

• (i) If f and g are measurable, then so are fg and af + bg, where a and b are

two real numbers; also, f/g is measurable provided g(ω) 6= 0 for any ω ∈ S.

(ii) If f1, f2, ... are measurable, then so are supn fn, infn fn. Furthermore, if

limn−→∞ fn exists, then it is also measurable.

(iii) Suppose that f is a measurable function on (S,F) and g a measurable

function on (R,B), then, the composite function g f defined by g f(ω) =

g(f(ω)) is also a measurable function.

(iv) If f is a continuous function on (R,B), then f is measurable.

Proposition 39. Let f and g be measurable functions on a measure space (S,F , v).

• (i) R (af + bg)dv = a R fdv + b R gdv.16

(ii) If f = g a.e., then,Rfdv =

Rgdv.

(iii) If f ≤ g a.e., then, R fdv ≤ R gdv.(iv) If f ≥ 0 a.e. and R fdv = 0, then, f = 0 a.e.(v) If f ≥ 0 a.e. and R fdv = 1, then, the set function

P (B) =

ZB

fdv

is a probability measure on (S,F). The function f is called the probability

density function (p.d.f.) of P with respect to measure v.

(vi) If fn −→ f a.e., |fn| ≤ g, andRgdv <∞, then,

limn−→∞

Zfndv =

Zfdv.

(vii) If |∂f(ω, θ)/∂θ| ≤ g(ω) a.e., and R gdv <∞, then,d

dθ

·Zf(ω, θ)dv

¸=

Z∂f(ω, θ)

∂θdv.

17

0.3. Random Variables and Probability Distribution

Definition 40. Consider a random experiment with sample space S. A random

variable X(ξ) is a single valued real function that assigns a real number to each

sample point ξ of S. Often we use a single letter X for this function in place of

X(ξ).

Probability Distribution:

Definition 41. A listing of the values x taken by a random variable X and their

associated probabilities is a probability distribution, f(x).

Definition 42. The distribution function [or cumulative distribution function

(c.d.f)] of X is the function defined by :

FX(x) ≡ P (X ≤ x), −∞ < x <∞.

Properties of FX(x) :

1. 0 ≤ FX(x) ≤ 1

2. FX(x1) ≤ FX(x2) if x1 < x2 (i.e., non-decreasing)

3. limx→∞FX(x) = FX(∞) = 1

18

4. limx→−∞FX(x) = FX(−∞) = 0

5. limx→a+FX(x) = FX(a+) = FX(a), a

+ = lim0<ε→0 a + ε (i.e., F is right

continuous)

**Note that a distribution may not be left continuous.

Definition 43. Let X be a random variable with cdf. FX(x). X is a discrete

random variable only if its range contains a finite or countably infinite number of

points. Alternatively, if FX(x) changes values only in jumps (at most a countable

number of them) and is constant between jumps, then X is called a discrete random

variable.

Definition 44. Suppose that jumps in FX(x) of a discrete random variable X

occur at the points x1, x2, ... where the sequence may be either finite or countably

infinite, and we assume xi < xj if i < j, then:

FX(xi)− FX(xi−1) = P (X ≤ xi)− P (X ≤ xi−1) = P (X = xi)

Let pX(x) = P (X = x). The function pX(x) is called a probability mass

function (pmf) of the discrete random variable X.

19

Properties of pX(x) :

1. 0 ≤ pX(xk) ≤ 1 for k = 1, 2, ...

2. pX(x) = 0 if x 6= xk for k = 1, 2, ...

3.P

k pX(xk) = 1

The probability distribution for a discrete random variable is f(x) = pX(x) =

P (X = x) and the c.d.f. FX(x) of a discrete random variable X can be obtained

by:

FX(x) = P (X ≤ x) =Xxk≤x

pX(xk)

Definition 45. Let X be a random variable with cdf. FX(x). X is a continuous

random variable only if its range contains an interval (either finite or infinite)

of real numbers. Alternatively, if FX(x) is continuous and also has a derivative

dFX(x)/dx which exists everywhere except at possibly a finite number of points

and is piecewise continuous, then X is called a continuous random variable.

For the case of a continuous random variable, the probability associated with

any particular point is zero, (i.e., P(X=x)=0). However, we can assign a positive

probability to intervals in the range of x.

20

Definition 46. Let f(x) = dFX(x)dx

. The function f(x) is called the probability

density function (pdf) of the continuous random variable X.

Properties of f(x) :

1. f(x) ≥ 0

2.R∞−∞ f(x) = 1

3. f(x) is piecewise continuous

4. P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a ≤ X ≤ b) =R baf(x)dx

The cumulative distribution function for the continuous random variable X is

FX(x) = P (X ≤ x) =Z x

−∞f(t)dt

Furthermore, from the definition of the cdf we know that

P (a < x ≤ b) = FX(b)− FX(a)

**Note that many books write FX(x) as F (x).

21

0.4. Expectation of a Random Variable

Definition 47. Mean of a Random Variables: The mean, or expect value, of a

random variable is:

E[x] =

P

x xf(x) if x is discreteR∞−∞ xf(x)dx if x is continuous

The mean is normally denoted by µ.

Proposition 48. Let g(x) be a function of x. The function that gives the ex-

pected value of g(x) is denoted:

E[g(X)] =

P

x g(x)f(x) if X is discreteR∞−∞ g(x)f(x)dx if X is continuous

.

Definition 49. Variance of a Random Variables: The variance of a random vari-

able is:

V ar[x] = E[(x− µ)2] =

P

x(x− µ)2f(x) if x is discreteR∞−∞(x− µ)2f(x)dx if x is continuous

where µ = E(x).

22

The variance is usually denoted by σ2.

The variance is conveniently computed according to the following equation

V ar(x) = σ2 = E(x2)− µ2

Properties of expectations and variances: Let a and b be constants, and

let X and Y be random variables.

1. E(a) = a and V ar(a) = 0

2. E(aX)=aE(X)

3. E(X+Y)=E(X)+E(Y)

4. E(XY)=E(X)E(Y) if X and Y are independent random variables.

5. Var(aX)=c2Var(X)

6. If X and Y are independent random variables,

V ar(X + Y ) = V ar(X) + V ar(Y )

V ar(X − Y ) = V ar(X) + V ar(Y )

23

The Normal distribution: In econometrics you will often use the Normal

distribution. The general form of a normal distribution with mean µ and variance

σ2 is

f(x|µ,σ2) = 1√σ22π

exp

(−12

"(x− µ)2

σ2

#)

We usually denote the fact that x has a normal distribution by writing x ∼

N [µ,σ2] which reads x is normally distributed with mean µ and standard deviation

σ.

Properties of a normal distribution:

1. If x ∼ N [µ,σ2], then a+ bx ∼ N [a+ bµ, b2σ2] where a and b are constants.

If a = −µσand b = 1

σthen letting z = a+bx we find that z ∼ N [0, 1]. N [0, 1]

is called the standard normal distribution and has the density function

φ(z) =1√2πexp

½−z

2

2

¾

The notation φ(z) is often used to denote the standard normal distribution,

and Φ(z) is often used for its cdf.

2. If z ∼ N [0, 1] then z2 ∼ χ2[1] where χ2[1] is the chi-squared distribution

with one degree of freedom.

24

3. If z1, z2, ..., zn are independent random variables and zi ∼ N [0, 1] for all i,

thennXi=1

z2i ∼ χ2[n]

You will also find that the t-distribution will converge in the limit to the

normal distribution.

0.5. Distribution and Expectation of Random Vectors

Above we discussed the case where we have one random variable (i.e., the univari-

ate case). However, in many instances we will want to consider the case where

we have multiple random variables. The good news is that the concepts de-

scribed above can be fairly easily extended to the case of n random variables (the

multivariate case)

Definition 50. Given an experiment, the n-tuple of random variables (X1, X2, ...,Xn)

is referred to as an n-dimensional random vector (or n-variate random variable),

if each Xi associates a real number with every sample point ξ in S.

25

Let X denote a random vector in Rn. The vector X = (X1,X2, ..., Xn) takes

on the values in Rn according to the following joint probability distribution (cdf):

F (x) = P (X ≤ x) where this equality is given by

FX1X2...Xn(x1, x2, ..., xn) = P (X1 ≤ x1,X2 ≤ x2, ..., Xn ≤ xn)

For this case we have that FX1X2...Xn(∞,∞, ...,∞) = 1.

The marginal joint cdfs are gotten from this one by setting the appropriate

Xis to ∞. For example, the bivariate distribution for x1and x2 is given by

FX1X2(x1, x2) = FX1X2...Xn(x1, x2,∞,∞, ...,∞)

For the discrete n-variate random variable, the joint pmf is defined by:

pX1X2...Xn(x1, x2, ..., xn) = P (X1 = x1,X2 = x2, ..., Xn = xn)

Properties of pX1X2...Xn(x1, x2, ..., xn) :

1. 0 ≤ pX1X2...Xn(x1, x2, ..., xn) ≤ 1

2.Pn

i=1

PxipX1X2...Xn(x1, x2, ..., xn) = 1

26

3. The marginal pmf of one random variable (or set of random variables) is

found by summing pX1X2...Xn(x1, x2, ..., xn) over the ranges of the other vari-

ables xis. , e.g.,

pX1X2...Xn−k(x1, x2, ..., xn−k) =Xxn−k+1

Xxn−k+2

...Xxn

pX1X2...Xn(x1, x2, ..., xn)

4. Conditional pmfs are then defined in a straight forward manor. For example:

pXn|X1,X2...Xn−1(xn|x1, x2, ..., xn−1) =pX1X2...Xn(x1, x2, ..., xn)

pX1X2...Xn−1(x1, x2, ..., xn−1)

If we are dealing with a continuous n-variate random variable, then if FX has

a pdf f , that is,

FX(x1, ..., xn) =

Z x1

−∞...

Z xn

−∞f(z1, ..., zn)dz1...dzn

for some function f , then, we can generally find the joint pdf for a continuous

n-variate random variable by:

fX1X2...Xn(x1, ..., xn) =∂nFX1X2...Xn(x1, ..., xn)

∂x1∂x2...∂xn

27

If we know the joint distribution function of a random vectorX, then we also know

the joint distribution of any subvector of X. For example, the joint distribution

of X(k) = (X1, ...,Xk)0, k < n, is

FX(k)(x1, ..., xk) =

Z x1

−∞...

Z xk

−∞

·Z ∞

−∞...

Z ∞

−∞f(z1, ..., zn)dzk+1...dzn

¸dz1...dzk.

Properties of fX1X2...Xn(x1, x2, ..., xn) :

1. fX1X2...Xn(x1, x2, ..., xn) ≥ 0

2.R∞−∞ ...

R∞−∞ fX1X2...Xn(x1, x2, ..., xn)dx1...dxn = 1

3. The marginal pdf of one random variable (or set of random variables) is

found by integrating fX1X2...Xn(x1, x2, ..., xn) over the ranges of the other

variables xis. , e.g.,

fX1X2...Xn−k(x1, x2, ..., xn−k) =Z ∞

−∞

Z ∞

−∞...

Z ∞

−∞fX1X2...Xn(x1, x2, ..., xn)dxn−k+1dxx−k+2...dxn

4. Conditional pdfs are then defined easily computed. For example:

fXn|X1,X2...Xn−1(xn|x1, x2, ..., xn−1) =fX1X2...Xn(x1, x2, ..., xn)

fX1X2...Xn−1(x1, x2, ..., xn−1)

28

Proposition 51. If X and Y ’s joint distribution function has a p.d.f. f(x, y),

then, the conditional distribution function of X given Y has a p.d.f. f(x|y) given

by the following:

f(x|y) = f(x, y)

fY (y)

where fY (y) ≡Rf(x, y)dx is the marginal distribution function of the random

variable Y .

0.5.1. Expectations:

The definition of an expectation is also easily generalizable to the case were we

have multiple random variable:

E[g(X)] =

R∞−∞ ...

R∞−∞ g(x1, ..., xn)f(x1, ..., xn)dx1...dxn if the variables are continuousP...Pg(x1, ..., xn)f(x1, ..., xn) if the variables are discrete

Let the EX(g(X)), and EX,Y (g(X)) denote the expectation of the function G(X)

with respect to the marginal and the joint distributions respectively. It is easy to

show that

EX(g(X)) = EX,Y (g(X))

29

sinceEX,Y (g(X)) =P

i,j g(xi)f(xi, yi) =P

i g(xi)hP

j f(xi, yi)i=P

i g(xi)f(xi) =

EX(g(X)).

Theorem 52. If X and Y are independent EX,Y (g(X)h(y)) = EX(g(x))EY (h(Y ))

Definition 53. Cov(X,Y)=E[(X-E(X))(Y-E(Y))]

Definition 54. The conditional expectation of Y given X=x is defined as:

E(Y |X = x) =

Pi yif(yi|x) in the discrete caseRyf(y|x) in the continuous case

Definition 55. The conditional expectation of g(X,Y) given X=x is:

E(g(X,Y )|X = x) =

Pi g(x, yi)f(yi|X = x) in the discrete caseRg(x, y)f(y|X = x) in the continuous case

Theorem 56. Law of iterated expectations. E[y]=Ex[E[y|x]] where the notation

Ex[·] indicates the expectation over the values of x.

30

Properties of Conditional Expectation: The conditional expectation can be

defined given any σ−field A such that A ⊂ F or any random variable Y . The

following proposition gives some of the properties of the conditional expectation

using the formal language.

Proposition 57. Let X and Y be two integrable random variables on a proba-

bility space (Ω,F , P ) and F1 is a σ−field such that F1 ⊂ F .

• (i) If X = c a.s. for some real number c, then, E[X|F1] = c a.s.

(ii) If X ≤ Y a.s., then, E[X|F1] ≤ E[Y |F1] a.s.

(iii) If a and b are real numbers, then, E[aX+bY |F1] = aE[X|F1]+bE[Y |F1]

a.s.

(iv) E[E[X|F1]] = E[X].

(v) If F0 ⊂ F1, then, E[E[X|F1]|F0] = E[X|F0] = E[E[X|F0]|F1] a.s.

(vi) If σ(Y ) ⊂ F1 and E[|XY |] <∞, then E[XY |F1] = Y E[X|F1] a.s.

(vii) If E[|g(X,Y )|] <∞, then E[g(X,Y )|Y = y] = E[g(X, y)|Y = y] a.s.

Variance-Covariance Matrix of A Random Vector:

31

The expectation of a random vector is defined as a vector which consists of

the expected value of each individual random variables.

E[X] = (E[X1], ..., E[Xn])0.

The variance-covariance matrix of a random vector X is defined as

V ar(X) =E[(X−E[X])(X− E[X])0].

Here, the expectation is taken element by element.

Proposition 58. Let X be a random vector.

• (i) For any vector c, E[c0X] = c0E[X], and V ar(c0X) = c0V ar(X)c.

(ii) The variance-covariance matrix of X is semi positive definite.

Proof: (i) For any vector c, we have

E[c0X] ≡ E[c1X1 + ...+ cnXn]

= c1E[X1] + ...+ cnE[Xn]

≡ c0E[X].

32

V ar(c0X) ≡ E[(c0X−E[c0X])(c0X− E[c0X])0]

= E[c0(X−E[X])(c0(X−E[X]))0]

= E[c0(X−E[X])(X− E[X])0c]

= c0E[(X−E[X])(X− E[X])0]c

≡ c0V ar(X)c.

(ii) For any vector c,

(c0X−E[c0X])(c0X−E[c0X])0 = (c0X−E[c0X])2 ≥ 0.

So,

c0V ar(X)c = E[(c0X−E[c0X])(c0X− E[c0X])0] ≥ 0.

Q.E.D.

Transformation of random variables: Let X be a continuous random vari-

able with pdf fX(x). If the transformation y=g(x) is one-to-one and has the in-

verse transformation x = g−1(y) = h(y), then the pdf of Y is given by fY (y) =

fX(x)¯dxdy

¯= fX [h(y)]

¯dh(y)dy

¯.

Let Z=g(X,Y) and W=h(X,Y) where X and Y are random variables and

33

fX,Y (x,y,) is the joint pdf of X and Y. If the transformation z=g(x,y) and w=h(w,y)

is one to one and has the inverse transformation x=q(z,w) and y=r(z,w) then the

joint pdf for Z and W is given by:

fZ,W (z, w) = fX,Y (x, y) |J(x, y)|−1 where x = q(z, w) and y = r(z, w) and

J(x, y) =

¯¯ ∂g

∂x∂g∂y

∂h∂x

∂h∂y

¯¯ =

¯¯ ∂z

∂x∂z∂y

∂w∂x

∂w∂y

¯¯

which is the jacobian of the transformation z=g(x,y) and w=h(x,y). If we then

define

J(z, w) =

¯¯ ∂q

∂z∂q∂w

∂r∂z

∂r∂w

¯¯ =

¯¯ ∂x

∂z∂x∂w

∂y∂z

∂y∂w

¯¯

then¯J(z, w)

¯= |J(x, y)|−1 and fZ,W (z, w) = fX,Y [q(z, w), r(z, w)]

¯J(z, w)

¯.

The multivariate normal distribution Let x be a set of random variables,

x = (x1, ...xn), with the mean vector µ and the covariance matrix Σ. The general

form of the joint density for the multivariate normal is

f(x)− (2π)−n/2 |Σ|−1/2 e(−1/2)(x−µ)0Σ−1(x−µ)

34

Properties of the multivariate normal Let x1 be any subset of the variables

including a single variable, and let x2 be the remaining variables. Partition µ and

Σ likewise so

µ =

µ1µ2

and Σ =

Σ11 Σ12

Σ21 Σ22

1. If [x1, x2] have a joint multivariate normal distribution, then the marginal

distributions are

x1 ˜ N(µ1,Σ11) and

x2 ˜ N(µ2,Σ22)

2. If [x1, x2] have a joint multivariate normal distribution, then the conditional

distribution of x1 given x2 is also normal:

x1|x2 ˜ N(µ1.2,Σ11.2) where

µ1.2 = µ1 + Σ12Σ−122 (x2 − µ2)

Σ11.2 = Σ11 − Σ12Σ−122 Σ21

35

0.6. Markov Process

Stochastic Process:

A stochastic process is a sequence of random variables Zt, t = 0, 1, ... on

a fixed probability space (S,F , P ) such that Zt is measurable with respect to a

σ-field Ft for all t = 0, 1, ... and that the sequence of the σ-fields, Ft, t = 0, 1, ...

is a filtration, i.e., Ft ⊂ Ft+1 for any t = 0, 1, ....

Note that for any sequence of random variables, we can always set Ft =

σ(Z0, ..., Zt). Then clearly Ft, t = 0, 1, ... is a filtration and Zt is measurable

with respect to Ft. Most of the time this is the natural filtration that we work

with for stochastic processes. However, there maybe other filtrations such that

Zt, t = 0, 1, ... is also a stochastic process. Which filtration to use depends on

the problem we want to study.

Sample Path:

For any given ω ∈ S, we call the sequence of real numbers Zt(ω), t = 0, 1, ...

a sample path of the stochastic process.

Markov Process:

A stochastic process is called a Markov process if for any A ∈ F , t ≥ 1, and

36

1 ≤ n ≤ t, we have

P (A|σ(Zt, Zt−1, ..., Zt−n)) = P (A|Zt).

Another way to put this is that a random process X(t), t∈ T is a markov

process if

P (X(tn+1) ≤ xn+1|X(t1) = x1,X(t2) = x2, ...,X(tn) = xn) = P (X(tn+1) ≤ xn+1|X(tn) = xn)

whenever t1 < t2 < ... < tn < tn+1.

This type of process has a memoryless property since the future state of the

process depends only on the present state and not on the past history.

A discrete-state markov process is called a Markov chain.

Acknowledgements: The notes for this section have been based on materials

provided by Prof. Xiaodong Zhu, Prof. Angelo Melino, Prof. Bruce Hansen, and

on chapters in Econometric Analysis byW. Greene, and many of the definitions are

taken from Probability, random variables and random processes by Hsu. Please

do not circulate these notes without permission of the author.

37

Date post:	05-Oct-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Notes on Set Theory and Probability Theory

Documents