Copyright (c) Bani K. Mallick1 STAT 651 Lecture #15.

Post on 21-Dec-2015

214 views 0 download

Tags:

transcript

Copyright (c) Bani K. Mallick 1

STAT 651

Lecture #15

Copyright (c) Bani K. Mallick 2

Topics in Lecture #15 Some basic probability

The binomial distribution

Inference about a single population proportions

Copyright (c) Bani K. Mallick 3

Book Sections Covered in Lecture #15

Chapters 4.7-4.8

Chapter 10.2

Copyright (c) Bani K. Mallick 4

Lecture 14 Review: Nonparametric Methods

Replace each observation by its rank in the pooled data

Do the usual ANOVA F-test

Kruskal-Wallis

Copyright (c) Bani K. Mallick 5

Lecture 14 Review: Nonparametric Methods

Once you have decided that the populations are different in their means, there is no version of a LSD

You simply have to do each comparison in turn

This is a bit of a pain in SPSS, because you physically must do each 2-population comparison, defining the groups as you go

Copyright (c) Bani K. Mallick 6

Categorical Data

Not all experiments are based on numerical outcomes

We will deal with categorical outcomes, i.e., outcomes that for each individual is a category

The simplest categorical variable is binary:

Success or failure

Male of female

Copyright (c) Bani K. Mallick 7

Categorical Data

For example, consider flipping a fair coin, and let

X = 0 means “tails”

X = 1 means “heads”

Copyright (c) Bani K. Mallick 8

Categorical Data

The fraction of the population who are “successes” will be denoted by the Greek symbol

Note that because it is a Greek symbol, it represents something to do with a population

For coin flipping, if you flipped all the fair coins in the world (the population), the fraction of the times they turn up heads equals

Copyright (c) Bani K. Mallick 9

Categorical Data

The fraction of the population who are “successes” will be denoted by the Greek symbol

The fraction of the sample of size n who are “successes” is going to be denoted by

We want to relate to

Let X = number of successes in the sample. The fraction = (# successes)/n = X / n

Copyright (c) Bani K. Mallick 10

Categorical Data

Suppose you flip a coin 10 times, and get 6 heads.

The proportion of heads = 0.60

The percentage of heads = 60%

Copyright (c) Bani K. Mallick 11

Categorical Data

The number of success X in n experiments each with probability of success is called a binomial random variable

There is a formula for this:

Pr(X = k) =

0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.

k n kn!Pr( k/ n) (1 )ˆ

k! (n-k)!

Copyright (c) Bani K. Mallick 12

Categorical Data

0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.

The idea is to relate the sample fraction to the population fraction using this formula

Key Point: if we knew , then we could entirely characterize the fraction of experiments that have k successes

k n kn!Pr(X k) Pr( k/ n) (1 )ˆ

k! (n-k)!

Copyright (c) Bani K. Mallick 13

Categorical Data

The probability that the coin lands on heads will be denoted by the Greek symbol

Suppose you flip a coin 2 times, and count the number of heads.

So here, X = number of heads that arise when you flip a coin 2 times

X takes on the values 0, 1 and 2

takes on the values 0/2, ½, 2/2

Copyright (c) Bani K. Mallick 14

Categorical Data: What the binomial formula does

The experiment results in 4 equally likely outcomes: each occurs ¼ of the time

Tails on toss #1

Heads on toss #1

Tails of toss #2

¼ ¼

Heads on Toss #2

¼ ¼

Copyright (c) Bani K. Mallick 15

Categorical Data

Heads = “success”:

Tails on toss #1

Heads on toss #1

Tails on toss #2

¼ ¼

Heads on Toss #2

¼ ¼

Pr(X 0) Pr( 0/ 2) 1/ 4ˆ Pr(X 1) Pr( 1/ 2) 1/ 2ˆ

Pr(X 2) Pr( 2/ 2) 1/ 4ˆ The binomial formula can be used to give these results without thinking

Copyright (c) Bani K. Mallick 16

Categorical Data

0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.

n=2, k=1, k! = 1, n! = 2, (n-k)! = 1

The binomial formula gives the answer ½, which we know to be correct

k n kn!Pr(X k) Pr( k/ n) (1 )ˆ

k! (n-k)!

k n k.5, and(1 ) .5

Copyright (c) Bani K. Mallick 17

Categorical Data

Roll a fair dice

1 2 3 4 5 6

First Dice

Every combination is equally likely, so what are the probabilities?

Copyright (c) Bani K. Mallick 18

Categorical Data

Roll a fair dice

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

First Dice

Every combination is equally likely, so what are the probabilities?

Copyright (c) Bani K. Mallick 19

Categorical Data

Roll a fair dice

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

First Dice

Every combination is equally likely, so what are the probabilities?

What is the chance of rolling a 1 or a 2?

Copyright (c) Bani K. Mallick 20

Categorical Data

Roll a fair dice

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

First Dice

Every combination is equally likely, so what are the probabilities?

What is the chance of rolling a 1 or 2? 2/6 = 1/3

Copyright (c) Bani K. Mallick 21

Categorical Data

Now roll two fair dice

1 2 3 4 5 6

1

2

3

4

5

6

Second Dice

First Dice

Every combination is equally likely, so what are the probabilities?

Copyright (c) Bani K. Mallick 22

Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Every combination is equally likely, so what are the probabilities?

Copyright (c) Bani K. Mallick 23

Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Define a success as rolling a 1 or a 2. What is the chance of two successes?

Copyright (c) Bani K. Mallick 24

Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Define a success as rolling a 1 or a 2. What is the chance of two successes? 4/36 = 1/9

Copyright (c) Bani K. Mallick 25

Categorical Data

Roll two fair dice

1 2 3 4 5 6

1 1/36 1/36 1/36 1/36 1/36 1/36

2 1/36 1/36 1/36 1/36 1/36 1/36

3 1/36 1/36 1/36 1/36 1/36 1/36

4 1/36 1/36 1/36 1/36 1/36 1/36

5 1/36 1/36 1/36 1/36 1/36 1/36

6 1/36 1/36 1/36 1/36 1/36 1/36

Second Dice

First Dice

Define a success as rolling a 1 or a 2. What is the chance of two failures? 16/36 = 4/9

Copyright (c) Bani K. Mallick 26

Categorical Data

So, a success occurs when you roll a 1 or a 2

Pr(success on a single die) = 2/6 = 1/3 =

Pr(2 successes) = 1/3 x 1/3 = 1/9

Use the binomial formula: pr(X=k) when k=2

k!=2, n!=2, (n-k)!=1,

k n k1/ 9,and(1 ) 1

k n kn!Pr(X k) Pr( k/ n) (1 ) 1/ 9ˆ

k! (n-k)!

Copyright (c) Bani K. Mallick 27

Categorical Data

In other words, the binomial formula works in these simple cases, where we can draw nice tables

Now think of rolling 4 dice, and ask the chance the 3 of the 4 times you get a 1 or a 2

Too big a table: need a formula

Copyright (c) Bani K. Mallick 28

Categorical Data

Does it matter what you call as “success” and hat you call a “failure”?

No, as long as you keep track

For example, in a class experiment many years ago, men were asked whether they preferred to wear boxers or briefs

This is binary, because there are only 2 outcomes

“success” = ?????

Copyright (c) Bani K. Mallick 29

Categorical Data

Binary experiments have sampling variability, just like sample means, etc.

Experiment: “success” = being under 5’10” in height

First 6 men with SSN < 5

First 6 men with SSN > 5

Note how the number of “successes” was not the same! (I might have to do this a few times)

Copyright (c) Bani K. Mallick 30

Categorical Data

The sample fraction is a random variable

This means that if I do the experiment over and over, I will get different values.

These different values have a standard deviation.

Copyright (c) Bani K. Mallick 31

Categorical Data

The sample fraction has a standard error

Its standard error is

Note how if you have a bigger sample, the standard error decreases

The standard error is biggest when = 0.50.

ˆ

(1 )n

Copyright (c) Bani K. Mallick 32

Categorical Data

The sample fraction has a standard error

Its standard error is

The estimated standard error based on the sample is

ˆ

(1 )n

ˆ

(1 )ˆ ˆˆ

n

Copyright (c) Bani K. Mallick 33

Categorical Data

It is possible to make confidence intervals for the population fraction if the number of successes > 5, and the number of failures > 5

If this is not satisfied, consult a statistician

Under these conditions, the Central Limit Theorem says that the sample fraction is approximately normally distributed (in repeated experiments)

Copyright (c) Bani K. Mallick 34

Categorical Data

(1100% CI for the population fraction

is by looking up 1 in Table 1

/ 2 ˆzˆ ˆ

ˆ

(1 )ˆ ˆˆ

n

/ 2z

Copyright (c) Bani K. Mallick 35

Categorical Data

Often, you will only know the sample proportion/percentage and the sample size

Computing the confidence interval for the population proportion: two ways By hand

By SPSS (this is a pain if you do not have the data entered already)

Because you may need to do this by hand, I will make you do this.

Copyright (c) Bani K. Mallick 36

Categorical Data

(1100% CI for the population fraction

95% CI, = 1.96

n = 25, = 0.30

/ 2 ˆzˆ ˆ

ˆ

(1 ) .3(1 .3)ˆ ˆ 0.09165ˆn 25

/ 2z

/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ

Copyright (c) Bani K. Mallick 37

Categorical Data

(1100% CI for the population fraction

Interpretation?

/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ

0.30 0.18 [0.12,0.48]

Copyright (c) Bani K. Mallick 38

Categorical Data

(1100% CI for the population fraction

Interpretation? The proportion of successes in the population is from 0.12 to 0.48 (12% to 48%) with 95% confidence

/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ

0.30 0.18 [0.12,0.48]

Copyright (c) Bani K. Mallick 39

Categorical Data

You can use SPSS as long as the number of successes and the number of failures both exceed 5

To get the confidence intervals, you first have to define a numeric version of your variable that classifies whether an observation is a success or failure.

You then compute the 1-sample confidence interval from “descriptives” “Explore”: Demo

Copyright (c) Bani K. Mallick 40

Categorical Data

If you set up your data in SPSS, the “mean” will be the proportion/fraction/percentage of 1’s

Data = 0 1 1 1 0 0 0 1 0 0

n = 10

Mean = 4/10 = .40

= .40

Copyright (c) Bani K. Mallick 41

Boxers versus briefs for males

Case Processing Summary

188 100.0% 0 .0% 188 100.0%Boxers or BriefsPerference

N Percent N Percent N Percent

Valid Missing Total

Cases

In this output, boxers = 1 and briefs = 0

Copyright (c) Bani K. Mallick 42

Boxers versus briefs for males: what % prefer boxers? In the

sample, 46.81%. In the population???

Descriptives

.4681 3.649E-02

.3961

.5401

.4645

.0000

.250

.5003

.00

1.00

1.00

1.0000

.129 .177

-2.005 .353

MeanLower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Boxers or BriefsPerference

Statistic Std. Error

In this output, boxers = 1 and briefs = 0. The proportionof 1’s is the mean

Copyright (c) Bani K. Mallick 43

Boxers versus briefs for males: what % prefer boxers? Between

39.61% and 54.01%

Descriptives

.4681 3.649E-02.3961

.5401

.4645

.0000

.250.5003

.00

1.001.00

1.0000.129 .177

-2.005 .353

Mean

Lower BoundUpper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

MedianVariance

Std. DeviationMinimum

MaximumRange

Interquartile Range

SkewnessKurtosis

GenderMaleNumeric Boxers: 0

= Briefs, 1 = Boxers

Statistic Std. Error

Copyright (c) Bani K. Mallick 44

Boxers versus briefs

In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs.

Between 39.61% and 54.01% men prefer boxers to briefs (95% CI)

Is there enough evidence to conclude that men generally prefer briefs?

Copyright (c) Bani K. Mallick 45

Boxers versus briefs

In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs.

Between 39.61% and 54.01% men prefer boxers to briefs (95% CI)

Is there enough evidence to conclude that men generally prefer briefs?

No: since 50% is in the CI! This means that it is possible (95%CI) that 50% prefer boxers, 50% prefer briefs, = 0.50.

Copyright (c) Bani K. Mallick 46

Sample Size Calculations

The standard error of the sample fraction is

If you want an (1100% CI interval to be

you should set

ˆ

(1 )n

E

/ 2

(1 )E z

n

Copyright (c) Bani K. Mallick 47

Sample Size Calculations

This means that

/ 2

(1 )E z

n

2/ 2 2

(1 )n z

E

Copyright (c) Bani K. Mallick 48

Sample Size Calculations

The small problem is that you do not know . You have two choices: Make a guess for

Set = 0.50 and calculate (most conservative, since it results in largest sample size)

Most polling operations make the latter choice, since it is most conservative

2/ 2 2

(1 )n z

E

Copyright (c) Bani K. Mallick 49

Sample Size Calculations: Examples

Set E = 0.04, 95% CI, you guess that = 0.30

You have no good guess:

2/ 2 2

(1 )n z

E

22

.3(1 .3)n 1.96 504

.04

22

.5(1 .5)n 1.96 601

.04