Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference

PHIL 6334 - Probability/Statistics Lecture Notes 1:Introduction to Probability and Statistical Inference

Aris Spanos [Spring 2014]

1 Introduction

1.1 Important distinctions

1. The first distinction that needs to be kept in mind when dealing with probability

and inference is that between:(i) the real world of actual data,

(ii) Plato’s world of mathematics

Conflating the two can easily lead to serious category mistakes.

2. The second distinction pertains to two different types of information in empir-

ical modeling:

[i] statistical information reflected in chance regularity patterns,

[ii] substantive subject matter information

This distinction is important because the primary objective of empirical modeling is

‘to learn from data’ about observable stochastic phenomena of interest using statis-

tical models. A statistical model aims to account for all the systematic statistical

information in the data and functions as a mediator between the data and the phe-

nomenon of interest. Substantive subject matter information sharpens and enhances

this learning from data when it does not contradict the statistical information. The

ultimate aim is to blend the two sources of information after validating that informa-

tion vis-a-vis the data.

3. The third distinction within probability theory pertains to the basic concepts

defining the mathematical set up:

(a) (F P()): events and probabilities, e.g. P(|) ∈F ∈F (b) {(; ) ∈Θ ∈R} : random variables and probabilities

In statistical modeling the appropriate set up to be used is (b), because real-world

data are usually expressed in terms of numbers. However, people who have been

trained to think in terms of (a) often conflate events with random variables and

statistical hypotheses, leading to serious muddiness in discussions of inference.

Example 1: "what is the probability that Joan has the disease given that she tested

positive?" is an event, not a legitimate hypothesis for frequentist testing.

Example 2: be suspicious of notation like (|) or (|) where and

denote evidence and a hypothesis; is never an event in frequentist inference; see

Spanos, A. (2010), “Is Frequentist Testing Vulnerable to the Base-Rate Fallacy?”

Philosophy of Science, 77: 565-583.

1

1.2 Dual role of probability theory

Probability theory is part of mathematics that provides the foundation and the over-

arching framework for:

(1) Statistical modeling: the mathematical concepts needed to model (de-

scribe) chance regularity patterns in the form of a statistical model M(x) : a set

of probabilistic assumptions specifying a generating mechanism that could have given

rise to the data x0:= (1 2 ), and

(2) Statistical Inference: the framework for model-based (M(x)) statistical

inference - estimation (point and interval), testing, prediction and policy analysis.

Probability theory provides ways to ‘calibrate’ the reliability and optimality of infer-

ence procedures using error probabilities.

2 Phenomena amenable to statistical modeling

Stochastic: observable phenomena which exhibit chance (non-deterministic) reg-

ularities are amenable to statistical modeling. Such phenomena vary from simple

games of chance (tossing coins, casting dice, playing the roulette, etc.), to highly

complicated experiments in physics and chemistry, as well as observable phenomena

in economics, astronomy, geology, the biological and social sciences.

• Example 1. Tossing a coin and noting the outcome: Heads (H) or Tails (T).• Example 2. Sampling with replacement from an urn which contains red (R)

and black (B) balls.

• Example 3. Observing the gender (B or G) of newborns during a certain

period (a month) in NY city.

• Example 4. Casting two dice and adding the dots of the two sides facing up.• Example 5. Tossing a coin twice and noting the outcome.• Example 6. Tossing a coin until the first "Heads" occurs.• Example 7. Counting the number of emergency calls to a regional hospitalduring a certain period (a week).

––—

• Example 8. Sampling without replacement from an urn which contains red

(R) and black (B) balls.

• Example 9. Observing the daily changes of the Dow Jones (D-J) index duringa certain period.

• Example 10. Observing the annual average global surface air temperature1856-2011.

2

2.1 Chance regularity vs. deterministic regularity

Deterministic regularity can be fully described using mathematical functions

among variables, e.g. = cos(), = 1 2

5 0 04 5 04 0 03 5 03 0 02 5 02 0 01 5 01 0 05 01

1 .0

0 .5

0 .0

- 0 .5

- 1 .0

In d e x

y

T i m e S e r i e s P l o t o f y

Fig. 1: t-plot of = cos()

Stochastic regularity: data x0:=(1 2 ) (table 1) shown in fig.2.

In d e x

x

1 0 09 08 07 06 05 04 03 02 01 01

1 2

1 0

8

6

4

2

P lo t o f x

Fig. 2: t-plot of data that exhibit chance regularities

Table 1 - Observed data on –––

3 10 11 5 6 7 10 8 5 11 2 9 9 6 8 4 7 6 5 12

7 8 5 4 6 11 7 10 5 8 7 5 9 8 10 2 7 3 8 10

11 8 9 5 7 3 4 9 10 4 7 4 6 9 7 6 12 8 11 9

10 3 6 9 7 5 8 6 2 9 6 4 7 8 10 5 8 7 9 6

5 7 7 6 12 9 10 4 8 6 5 4 7 8 6 7 11 7 8 3

Just glancing through the numbers one cannot easily detect chance regularity pat-

terns, hence the various graphical techniques designed to bring out such patterns.

Chance regularities: [D] Distribution, [M] Dependence, [H] Heterogeneity

3

[D] Distribution chance regularities

Thought experiment 1. Think of the observations as little squares with equal

area and rotate the figure 2 clockwise by 90◦ and let the squares representing theobservations fall vertically creating a pile on the -axis.

[1] Distribution: when is large enough the data form a perceptible shape

(histogram).

Fig. 3: A typical realization of a NIID process

Fig. 4: Assessing the Distribution assumption

4

Let us return to the data in table 1.

In d e x

x

1 0 09 08 07 06 05 04 03 02 01 01

1 2

1 0

8

6

4

2

P lo t o f x


Applying thought experiment 1 to the data depicted in figure 2 yields the histogram

given below.

x

Freq

uenc

y

1 2108642

1 8

1 6

1 4

1 2

1 0

8

6

4

2

0

H is to g r a m o f x

Fig. 5: The histogram of the data in fig. 2

As argued below, the above data x0:= (1 2 ) could be viewed as a typical

realization of an Independent and Identically Distributed (IID) process {1 2 }generically denoted by { ∈N = (1 2 )} whose distribution appears to besymmetrically triangular. How do we assess IID using graphical techniques?

Assessing [M] Dependence chance regularities

Thought experiment 2. Hide the observations following a certain value of

the index, say = 40, and try to guess the next outcome. Repeat this along the

observation index axis and if it turns out that it is impossible to use the previous

observations to guess the value of the next observation, excluding the extreme cases 2

and 12, then the chance regularity pattern we call independence is present. Intuitively,

we can summarize this notion:

5

[2] Independence: in any sequence of trials the outcome of any one trial does

not influence and is not influenced by that of any other.

Fig. 6: A typical realization of a positively dependent process

Fig. 7: A typical realization of a negatively dependent process

1 0 09 08 07 06 05 04 03 02 01 01

1 .0

0 .8

0 .6

0 .4

0 .2

0 .0

I n d e x

x

T y p ic a l re a l iz a t io n o f a B e rn o u l l i I I D p ro c e s s

6

1 0 09 08 07 06 05 04 03 02 01 01

1 .0

0 .8

0 .6

0 .4

0 .2

0 .0

I n d e x

z

T y p ic a l re a l iz a t io n o f a d e p e n d e n t B e rn o u ll p ro c e s s

What can one say about dependence for the data in table 1 (fig. 2)? They exhibit

no dependence patterns.

Assessing [H] Heterogeneity chance regularities

Thought experiment 3. Take a wide frame (to cover the spread of the fluctua-

tions in a t-plot such as figure 2) that is also long enough (roughly less than half the

length of the horizontal axis) and let it slide from left to right along the horizontal

axis looking at the picture inside the frame as it slides along. In the case where the

picture does not change significantly, the data exhibit homogeneity.

Another way to view this pattern is in terms of the arithmetic average and the

variation around this average of the numbers as we move from left to right. It appears

as though this sequential average and its variation are relatively constant around 7.

The variation around this constant average value appears to be within constant bands.

This chance regularity can be intuitively summarized by the following notion:

[3] Homogeneity: the probabilities associated with the various outcomes remain

identical for all trials.

Fig. 8: A typical realization of a NIID process

7

Fig. 9: A typical realization of a mean-heterogeneous process

What can one say about heterogeneity for the data in table 1 (fig. 2)? The data

exhibit no heterogeneity patterns.

In d e x

x

1 0 09 08 07 06 05 04 03 02 01 01

1 2

1 0

8

6

4

2

P lo t o f x


The data in fig. 2 should be contrasted with the data in figures 10-11.

2 2 7 72 0 2 41 7 7 11 5 1 81 2 6 51 0 1 27 5 95 0 62 5 31

1 2 0 0 0

1 0 0 0 0

8 0 0 0

6 0 0 0

4 0 0 0

2 0 0 0

t im e

Dow

Jon

es In

dex

t - P l o t o f D o w J o n e s I n d e x

Fig. 10: t-plot of the Dow Jones Index

8

Y e a r

y

2 0 0 01 9 7 61 9 5 21 9 2 81 9 0 41 8 8 01 8 5 6

0 .5 0

0 .2 5

0 .0 0

- 0 .2 5

- 0 .5 0

T e m p e r a tu r e d a ta

Fig. 11: Surface air temperature, annual series

The data sets depicted in fig. 10-11 exhibit departures from the IID assumptions;

they indicate both temporal dependence and heterogeneity.

2.2 Empirical regularities → probabilitiesThe question that naturally arises is whether the available substantive information

pertaining to the mechanism that gave rise to the data in fig. 2 would affect the

choice of a statistical model. Common sense suggests that it should, but it is not

clear what the extent of that influence should be. Let us discuss that in stages.

2.2.1 The Data Generating Mechanism (DGM)

The data in table 1 (fig. 2) were generated by a sequence of =100 trials of casting

two dice and adding the dots of the two sides facing up. This game of chance was very

popular in medieval times with soldiers waiting for weeks on end to assail European

cities they had under siege. After thousands of trials these illiterate soldiers learned

empirically that the number 7 occurs more often than any other number and that 6

occurs less often than 7 but more often than 5 etc.

Historically, the formalization process from chance regularity patterns to

probabilities has taken several centuries to come together.

[a] All possible distinct outcomes are known a priori. The first crucial feature of

the DGM is its stochastic nature: at each trial (the casting of two dice) the outcome

(the sum of the dots of the sides) cannot be predicted with any certainty. The only

thing one can say for sure is that the result of each trial will be one of the events

(numbers): {2 3 4 5 6 7 8 9 10 11 12}

9

[b] in any particular trial the outcome is not known a priori, but there exists

a perceptible regularity of occurrence associated with these outcomes.The second key feature of the DGM is that, under certain conditions, these events

do not occur equally often in this game of chance. It was obvious to the medieval

soldiers that certain events like 7 occur more often than others like 12.

What are these conditions? The third feature of the generating mechanism is:

[c] it can be repeated under (roughly) identical conditions.

These conditions are of paramount importance in modeling stochastic phenomena

because they pertain to the validity of the premises of inference. In this case they

pertain to the physical symmetry of the two dice and the homogeneity of the replica-

tion process. In the actual experiment the dice were cast in the same wooden box to

secure a certain form of nearly identical circumstances for each trial.

The question that naturally arose at the time was whether one could explain

(account for) the differences in the empirical relative frequency of occurrence of these

results shown by the histogram (fig. 5) using some kind of mathematical argument.

If one could figure out the odds for different events, one could make money! The

first systematic account of the underlying mathematical reasoning behind fig. 5 was

given byGirolamo Cardano, a notorious gambler, during the 16th century in Italy.

2.2.2 The probabilities behind the empirical frequencies

Cardano was able to get to the probabilities behind the observed relative frequencies

in four steps.

Step 1. He reasoned that since each die has 6 faces (1 2 6) casting two

dice will give rise to 36 possible (elementary) outcomes associated with the different

pairings of these numbers (1 2) (table 2), where denotes the number of dots of

the −th die, =1 2Table 2 - Casting two dice:

set of elementary outcomes

1 2 3 4 5 6

1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

Step 2. He reasoned that if the casting is repeated under similar conditions and

the two dice are symmetric and homogeneous, then all elementary outcomes (1 2)

in table 2 are equally likely to occur, and thus one can attach a probability 136.

Step 3. He reasoned that the events of interest {2 3 12} can be seen as arisingfrom adding 1+2 for each pair in table 2. He collected the different ways the events

10

1 + 2 can arise from elementary outcomes and formed table 3.

(4,3)

(3,3) (3,4) (4,4)

(2,3) (2,4) (5,2) (5,3) (4,5)

(2,2) (3,2) (4,2) (2,5) (3,5) (4,5) (5,5)

(2,1) (3,1) (4,1) (5,1) (6,1) (6,2) (6,3) (6,4) (6,5)

(1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,6) (3,6) (4,6) (5,6) (6,6)

2 3 4 5 6 7 8 9 10 11 12

Table 3: Distribution of events (Plato’s world)

Step 4. He realized that if each of the elementary outcomes in table 2 has

probability of occurrence 136 then one can evaluate the probabilities associated with

events {2 3 12} by simply adding these probabilities.For instance, we know that the number 2 can arise as the sum of a single combi-

nation of faces: {1 1} - each die comes up 1 hence (1+2=2)= ({1 1})= 136

the same applies to the number 12: (1+2=12)= ({6 6})= 136 On the other

hand the number 3 can arise as the sum of two sets of faces: {(1 2) (2 1)} hence (1+2=3)= ({(1 2) (2 1)})= (1 2) + (2 1)= 2

36 The same calculation applies

to the number 11: (1+2=11)= ({(6 5) (5 6)})= 236 Similarly, the calculation of

probability for number 7 is: (1+2=7)= ({(1 6) (6 1) (2 5) (5 2) (3 4) (4 3)})=

(1 6)+ (6 1)+ (2 5)+ (5 2)+ (3 4)+ (4 3)= 636

Continuing this line of reasoning Cardano was able to construct the probability

distribution in table 4 that relates each event of interest in {2 3 12} with itsprobability of occurrence.

Events 2 3 4 5 6 7 8 9 10 11 12

Probabilities 136

236

336

436

536

636

536

436

336

236

136

Table 4 - Probability distribution for the sum of two dice

2.3 Probabilities → empirical regularities

The probability distribution in table 4 represents a probabilistic concept (Plato’s

world) formulated by mathematicians in order to capture the distribution chance

regularity exhibited by the data in fig. 2 and summarized by the histogram in fig. 5.

A direct comparison between figures 5 and 12 confirms the soldiers’ intuition in the

sense that the empirical frequencies in fig. 5 are close to the theoretical probabilities

shown in table 4 (fig. 12). Moreover, if we were to repeat the experiment, not

100 but 1000 times, these frequencies would have been even closer to the theoretical

probabilities. In this sense we can think of the histogram in fig. 5 as an empirical

instantiation of the probability distribution in table 4 (fig. 12).

11

x

Freq

uenc

y

12108642

18

16

14

12

10

8

6

4

2

0

His togram of x

Fig. 5: The histogram (real world)

¥¥ ¥ ¥

¥ ¥ ¥ ¥ ¥¥ ¥ ¥ ¥ ¥ ¥ ¥

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥2 3 4 5 6 7 8 9 10 11 12

Fig. 12: Probability distribution (Plato’s)

The probability distribution in table 4 represents an oversimplified simple sta-

tistical model where one assumed certain conditions of symmetry, homogeneity and

sameness to evaluate the relevant probabilities without the presence of any unknown

parameters. In practice one needs to assess some of these preconditions before impos-

ing them and model validation is the first step before the statistical model is used for

inference purposes. An important component of model validation is formally test the

relationship between fig. 5 and table 4 to ensure that, indeed, the former could have

been generated by a statistical model with the latter as the assumed distribution.

The step from the observed regularities to their formalization (mathematization)

was prompted by the distributional regularity pattern of the data in fig. 2 as exem-

plified in fig. 5 in the form of a histogram. The formalization itself was initially very

slow, taking centuries to materialize, beginning with figuring out simple combinator-

ial (combinations and permutations) arguments. We can capture the essence of this

early formalization by taking the dice casting example one step further.

2.4 Same experiment, different questions of interest

When playing the casting of two dice, the medieval soldiers used to gamble on whether

the outcome is an odd or an even number (the Greeks introduced these concepts at

around 300 B.C.). That is, the soldiers would bet on one of two events:

= {3 5 7 9 11} = {2 4 6 8 10 12}

12

At first sight it looks as though betting on will be a sure winner because there

are more even than odd numbers. The medieval soldiers, however, knew by empirical

observation that this was not true, it was a fair bet! We can confirm their intuition

using the distribution in table 4:

() = (3) + (5) + (7) + (9) + (11) = 236+ 436+ 636+ 436=12

() = (2)+ (4)+ (6)+ (8)+ (10)+ (12)= 136+ 336+ 536+ 536+ 336+ 136=12

This gives rise to the distribution in table 5.

Events = {3 5 7 9 11} = {2 4 6 8 10 12}

Probabilities 12

12

Table 5- The probability distribution of odd (A) and even (B)

This is an important aspect of modeling because it determines the particular

statistical model that can be used to answer the questions of interest.

3 Stochastic phenomenon → a probability model

Example 1. Tossing a coin and noting the outcome: Heads (H) or Tails (T).

Example 2. Sampling with replacement from an urn which contains red (R) and

black (B) balls.

Example 3. Observing the gender (B or G) of newborns during a certain period

(a month) in NY city.

Example 4. Casting two dice and adding the dots of the two sides facing up.

Example 5. Tossing a coin twice and noting the outcome.

Example 6. Tossing a coin until the first "Heads" occurs.

Example 7. Counting the number of emergency calls to a regional hospital during

a certain period (a week).

Examples 1-7 constitute the simplest form of a stochastic phenomenon, which can

be both experimental (1-2, 4-6) and observational (3,7), described as

a Random Experiment (E):[a] All possible distinct outcomes are known a priori,

[b] in any particular trial the outcome is not known a priori, but there exists

a perceptible regularity of occurrence associated with these outcomes,

[c] it can be repeated under (largely) identical conditions.

What renders them simple is condition [c] which give rise to what are known as

random samples.

Examples 8-10 (section 2) constitute more complicated stochastic phenomena,

both experimental (8) and observational (9-10), whose outcomes cannot be realisti-

cally viewed as random (IID) samples.

13

3.1 Mathematical framing I: Events and Probabilities

[a] Formalizing "all possible distinct outcomes"

This is achieved by defining the set of all possible outcomes.

Examples:

1 = {} 2 = {} 3 = {} 4 = {2 3 12}5 = {() ( ) () ( )} 6 = { }7 = {0 1 2 } 8 = {} 9 = R:=(−∞∞) 10 = R+:=(0∞)[b] Formalizing the "perceptible regularity"

The formalization begins with the crucial distinction:

outcome: ∈ vs. event: ⊂

That is, an outcome is an element of but an event is a subset of i.e. an event is

any combination of outcomes in

Step 1: We define the set F of events of interest and related events associated

with the experiment comprising a set of subsets of .

Examples:

F1:={∅ } where denotes the sure event, ∅ the impossible event,F2:={∅ } F3:={∅ } F4:={∅ } ={3 5 7 9 11}={2 4 6 8 10 12} assuming one is interested in an odd or an even number.F5:={∅ } ={() ( )} ={( ) ()}F8:={∅ }.Mathematically F is a field : a set of subsets of which is closed under the set

theoretic operations of union (∪), intersection (∩), and complementation (−).Examples:

In the case ofF4:={∅ } ∪=∈F4 ∩=∅∈F4 =∈F4 =∈F4Step 2: We assign probabilities to all the elements of F using:

P(): F → [0 1] a set function which satisfies the axioms:

[A1] P() = 1,[A2] P() ≥ 0, for any event ∈F ,[A3] For and in F such that ∩=∅, P( ∪)=P()+P()

The probability space (F P()) provides an idealized description of the sto-chastic mechanism that gives rise to the events of interest and related events F , withP() assigning probabilities to events in F Intuitively, an event is something thatmight or might not occur in a specific trial, and thus the set theoretic operations ∪,∩, − give rise to related events.Example 1. In the case of Example 1 [Tossing a coin and noting the outcome],

the probabilities assigned are:

P()=1 P(∅)=0 P()= P( )=1− 0 ≤ ≤ 1 (1)

14

3.2 Mathematical framing II: Random Variables and Probabilities

The above mathematical formulation based on (F P()) is highly abstract and notparticularly convenient for modeling purposes because data often come in the form

of numbers!

For modeling purposes we need recast the above mathematical formalization into

a formulation that relates the events of interest to real numbers. The key to such a

recasting is provided by the notion of a random variable.

A random variable is a real-valued function from the set of outcomes to the

real line:(): → R

that upholds the event structure of interest in F . That is, each event defined by{ : ()=} belongs to F for all ∈RExample 4. In light of the fact that F4:={∅ } ={3 5 7 9 11}={2 4 6 8 10 12} the real-valued function:

()=1 ()=0

defines a random variable with respect to F4 because the events generated by

are always elements of F4 since: {: ()=1}=∈F4 {: ()=0}=∈F4 {:()=}=∅∈F4 for all 6= 0 or 6= 1Example 5. In light of the fact that 5={() ( ) () ( )} and

F :={∅ } ={() ( )} ={( ) ()}, the real-valued function :

()= ( )=0 ( )=1 ()=2

is not a random variable with respect to F (above) because we can show that some

events generated by () do not belong to F5 e.g. { : ()=0}={() ( )}∈F5However, this does not preclude the possibility of defining a different field relative to

which constitutes a random variable. How? First, generate the primary events of

interest associated with :

{: =0}={() ( )}:= {: =1}={( )}= {: =2}=()}:=

and, second, use these primary events (), to define the related events specifying

the field generated by :

( )={∅ ∪ ∪ ∪ }

Verify that ( ) is closed under the set theoretic operations of union (∪), intersection(∩), and complementation (−)!In the where () is a random variable relative to F , all events {: ()=} for

all ∈R belongs to F , thus one can attach probabilities via:

P({ : =})=() for all ∈R

15

where now we traded the set function P(): F →[0 1] with the density function ():

R→[01] which is a point to point numerical function.In Example 4 above, the density function is the Bernoulli:

(; )=

½1− for =0

for =1 or (; )=(1− )1− =0 1

3.2.1 The notion of a Probability model

Using the concept of a random variable one transforms the original formalization

(F P()) into something more data friendly and richer in mathematical structure:

(F P())()

=⇒ {(;θ) θ∈Θ ∈R}

where (;θ) ∈R denotes a density function and θ its unknown parameters.

This transformed probability set up defines what is known as the probability model

Φ= {(;θ) θ∈Θ ∈R} which is one of two components defining a statisticalmodel, the other is known as the sample model.

Particular examples of probability models.

Normal: Φ=n(;θ)= 1

√2exp

n− (−)2

22

o θ:=( 2)∈R×R+ ∈R

oBeta: Φ=

n(;θ)=

−1(1−)−1[]

θ:=( ) 0 0 0 ≤ ≤ 1o

Bernoulli: Φ= {(; ) = (1− )1− 0 1 = 0 1}Poisson: Φ=

n(; ) = −

! 0 = 0 1 2 3

oExample 1. Let us return to Example 4, where the random variable()=1 ()=0

then the relevant probability is the Bernoulli as given above. That is, we can use this

to model the experiment of casting two dice and observing whether the sum of the

dots is even or odd, without having to assume that we have perfectly symmetric and

homogeneous dice because we can leave as an unknown.

4 Useful probability concepts

4.1 Discrete vs. continuous random variables

The above definition of a random variables is confined to functions that take only

discrete values.

(a) For a countable the function () : → R is a random variable (r.v.)

relative to a particular event space of interest F if:{ : () = }∈F for all ∈R.

In words, if all the events { : = } for any real number belong to F then ()

is a r.v. relative to this F Intuitively, is a numbering mapping that preserves the

event structure of interest F

16

(b) For an uncountable the function() : → R is a random variable relativeto a particular event space of interest F if:

{ : () ≤ }∈F for all ∈R.That is, in this case the events of interest are defined by { : ≤ } In this casethe connection between P() and () is more complicated because it goes through

the cumulative distribution function ():

P({ : () ≤ })= ()= R −∞ () for all ∈R.Note that under certain regularity conditions: ()=

()

4.2 Density function

Properties of a density function

Continuous r.v. Discrete r.v.

f1 () ≥ 0 for all ∈R () ≥ 0 for all ∈R

f2R∞−∞ ()=1

P∈R

()=1

f3 ()− ()=R () ()−()=

P=

()

However, there is one crucial difference that these properties do not bring out:

discrete: () : R → [0 1]

continuous: () : R → [0∞)

That is, in the discrete r.v. case one can think of a density function as assigning

probabilities to all events of the form { = } ∈R but in the continuous r.v.

case it does not because () can be bigger than one! Hence, in the continuous case

we can think of the assigning of probabilities over a small interval ( ≤ ≤ + ):

P( ≤ ≤ + ) ' () for some small 0,

where → 0 is the infinitesimal interval we use in calculus.

Example. Consider the function: ()=2 0 1 This function satisfies

f1 if 0 but to satisfy f2 we need to ensure thatR 10()=1 For that to be the

case we solveR 102=1 for Since

R 102=1

3→ =3 i.e. ()=32 0 1

4.3 Moments of distributions

For modeling and inference purposes we need to relate the unknown parameter(s) θ

to certain features of the assumed density function (; ) ∈R and the best way

to do that is to relate to the moments of (; ) because it makes modeling and

inference easier to handle.

17

(a)Rawmoments: ()=P

∈R (; ) (discrete), or()=R(; )

(continuous), =1 2

Mean. The most widely used raw moment is the mean:

()=P

∈R (; ) ()=R∈R (; )

Example 1. In the case of the simple Bernoulli distribution:

(; )

0 1−

1

()=P

∈R (; )=(0)(1−) + (1) =

Example 2. Consider the sum of two dice =(1+2) distribution:

2 3 4 5 6 7 8 9 10 11 12

() 136

236

336

436

536

636

536

436

336

236

136

Table 4 - Probability distribution for the sum of two dice

() = 2( 136) + 3( 2

36) + 4( 3

36) + 5( 4

36) + 6( 5

36) + 7( 6

36)+

+8( 536) + 9( 4

36) + 10( 3

36) + 11( 2

36) + 12( 1

36) = 70

That is, =7 is not just the number with the highest probability, it is also the mean

of the distribution.

Example 3. For the density function ()=32 0 1 :

()=R 10(32) = 75

(b) Central moments: ([ −()])=P

∈R[−()](; ),

or ([−()])= R∈R [−()](; ), =2

(i) Variance. The most widely used central moment is the variance:

()=([ −()]2)=P

∈R [−()]2(; )

Example 1. In the case of the simple Bernoulli distribution:

()=P

∈R (−)2(; )=(1−)2 + (0−)2(1−)=(1−)Example 2. In the case of the sum of two dice =(1+2) distribution:

() =(2-7)2( 136)+(3-7)

2( 236)+(4-7)

2( 336)+(5-7)

2( 436)+(6-7)

2( 536)+(7-7)

2( 636)+

+(8-7)2( 536)+(9-7)

2( 436)+(10-7)

2( 336)+(11-7)

2( 236)+(12-7)

2( 136)=5.833

Example 3. For the density function ()=32 01 :

()=R 10[−()]2()= R 1

0(−75)2(32)=0375

18

(ii) Standard deviation:p () This measure can be used to render a

random variable unit-less in the sense that the ratio √ ()

does not depend on

the units is measured!

Warning: Statistics textbooks often flout the distinction between Plato’s worldand the real world by using the same term to refer to two very different concepts, as

shown below.

Plato’s world Real world

mean: =P

∈R (; ) =1

P=1

variance: 2=P

∈R (− )2(; ) 2= 1

P=1

(−)2

5 Important concepts

Deterministic vs. stochastic (chance) regularities

Chance regularities: Distributional, Dependence and Heterogeneity

Random Experiment (E)Set of all possible outcomes ()

Set of events of interest and related events (F)Probability set function (P)

Random variables (())

Field generated by a random variable

Density function (())

Probability model ({(;θ) θ∈Θ ∈R})Mean (()), variance ( ())

19

Date post:	29-Nov-2014
Category:	Education
Upload:	jemille6
View:	4,726 times
Download:	0 times

Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference

Education