Sampling and Sample Size Part 2 Cally Ardington. Lecture Outline Standard deviation and standard...

Sampling and Sample Size Part 2Cally Ardington

Lecture Outline Standard deviation and standard error

•Detecting impact Background

Hypothesis testing Power

The ingredients of power

We implement the Balsakhi Program

Case 2: Remedial Education in IndiaEvaluating the Balsakhi Program

Incorporating random assignment into the program

Case 2: Remedial Education in IndiaEvaluating the Balsakhi Program

Incorporating random assignment into the program

Post-test: control & treatment

Yes No

Don’t know

33%33%33%

Is this impact statistically significant?

A. YesB. NoC. Don’t know

01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991000

20

40

60

80

100

120

140

160

control treatment

control μ treatment μ

test scores





• The Law of Large of Numbers and Central Limit Theorem allow us to do hypothesis testing to determine whether our findings are statistically significant

Hypothesis Testing

• In criminal law, most institutions follow the rule: “innocent until proven

guilty”

• The presumption is that the accused is innocent and the burden is on

the prosecutor to show guilt

• The jury or judge starts with the “null hypothesis” that the accused person is

innocent

• The prosecutor has a hypothesis that the accused person is guilty

Hypothesis Testing

8

• In program evaluation, instead of “presumption of innocence,” the

rule is: “presumption of insignificance”

• The “Null hypothesis” (H0) is that there was no (zero) impact of the

program

• The burden of proof is on the evaluator to show a significant

difference

Hypothesis Testing

• If it is very unlikely (less than a 5% probability) that the difference is

solely due to chance:

• We “reject our null hypothesis”

• We may now say:

• “our program has a statistically significant impact”

Hypothesis Testing: Conclusions

Type I and II errorsYOU CONCLUDE

Effective No Effect

THE TRUT

H

Effective Type II Error

No Effect

Type I Error

What is the significance level?

• Type I error: rejecting the null hypothesis even though it is true (false positive)

• Significance level: The probability that we will reject the null hypothesis even though it is true

Theoretical Sampling Distribution

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

H0

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

95% Confidence Interval

H0

1.96 SD1.96 SD

Impose Significance Level of 5%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

H0

1.96 SD

H0H0





• Type II Error: Failing to reject the null hypothesis

(concluding there is no difference), when indeed the

null hypothesis is false.

• Power: If there is a measureable effect of our

intervention (the null hypothesis is false), the

probability that we will detect an effect (reject the null

hypothesis)

• Power = 1- Probability of Type II Error

What is Power?

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

H0

Hβ

Anything between lines cannot be distinguished from 0

Impose significance level of 5%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

Hβ

H0

Shaded area shows % of time we would find Hβ true if it was-4 -3 -2 -1 0 1 2 3 4 5 6

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

Can we distinguish Hβ from H0 ?

Hβ

H0


Effective No Effect

THE TRUT

H


No Effect

Type I Error


Effective No Effect

THE TRUT

H


No Effect

Type I Error

(probability =

significance level)


Effective No Effect

THE TRUT

H

Effective (probability =

power)

Type II Error

No Effect

Type I Error

Before the experiment

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

• Assume two effects: no effect and treatment effect β

H0Hβ

Impose significance level of 5%

Anything between lines cannot be distinguished from 0

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

HβH0

Can we distinguish Hβ from H0 ?

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

Shaded area shows % of time we would find Hβ true if it was

HβH0

What influences power?

• What are the factors that change the proportion of the research hypothesis that is shaded—i.e. the proportion that falls to the right (or left) of the null hypothesis curve?

• Understanding this helps us design more powerful experiments





• Effect Size

• Sample Size

• Variance

• Proportion of sample in Treatment vs. Control

• Clustering

Power: Main Ingredients

• Effect Size

• Sample Size

• Variance


• Clustering


Effect Size: 1*SD• Hypothesized effect size determines distance between means

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

1 Standard Deviation

HβH0

Effect Size = 1*SD

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

H0Hβ

Power: 26%If the true impact was 1*SD…

The Null Hypothesis would be rejected only 26% of the time

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

HβH0

Effect Size: 3*SD

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

Bigger hypothesized effect size distributions farther apart

3*SD

Effect size 3*SD: Power= 91%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

Bigger Effect size means more power

H0

Hβ

25% 25%25%25%

What effect size should you use when designing your experiment?

A. Smallest effect size that is still cost effective

B. Largest effect size you expect your program to produce

C. BothD. Neither

• What is the smallest effect that should justify the program to be adopted:

• Cost of this program vs the benefits it brings• Cost of this program vs the alternative use of the money

• If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero

• In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero

Picking an effect size

Effect size and take-up

• Let’s say we believe the impact on our participants is “3”• What happens if take up is 1/3?• Let’s show this graphically

Effect Size: 3*SD

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

Let’s say we believe the impact on our participants is “3”

3*SD

Take up is 33%. Effect size is 1/3rd

• Hypothesized effect size determines distance between means

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

1 Standard Deviation

HβH0

Back to: Power = 26%

Take-up is reflected in the effect size

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

powerHβH0

• Effect Size

• Sample Size

• Variance


• Imperfect compliance

• Clustering


20% 20%20%20%20%

By increasing sample size you increase…

A. AccuracyB. PrecisionC. BothD. NeitherE. Don’t know

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

Power: 91%

Power: Effect size = 1 SD, Sample size = 4

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

Power: 64%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

Power: Effect size = 1 SD, Sample size = 9

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

Power: 91%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

Sample Size

• Effect Size

• Sample Size

• Variance


• Imperfect compliance

• Clustering


• How large an effect you can detect with a given sample depends on how variable the outcomes is.

• Example: If all children have very similar learning level without a program, a very small impact will be easy to detect

• We can try to “absorb” variance:

• Using a baseline

• Controlling for other variables

• In practice, controlling for other variables (besides the baseline

outcome) buys you very little

Variance

Variance

Low Standard Deviation

0

5

10

15

20

25

valu

e

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

Number

Fre

qu

ency

mean 50

mean 60

Less Precision

Medium Standard Deviation

0

1

2

3

4

5

6

7

8

9

valu

e

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

Number

Fre

qu

ency

mean 50

mean 60

Even less precise

High Standard Deviation

0

1

2

3

4

5

6

7

8

value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89

Number

Fre

qu

en

cy

mean 50

mean 60

• Effect Size

• Sample Size

• Variance


• Clustering


Sample split: 50% C, 50% T

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

H0 Hβ

Equal split gives distributions that are the same “fatness”

Power: 91%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

If it’s not 50-50 split?

• What happens to the relative fatness if the split is not 50-50.• Say 25-75?

Sample split: 25% C, 75% T

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

significance

H0 Hβ

Uneven distributions, not efficient, i.e. less power

Power: 83%

-4 -3 -2 -1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

control

treatment

power

• Effect Size

• Sample Size

• Variance


• Clustering


Clustered design: definition

• In sampling:• When clusters of individuals (e.g. schools, communities, etc) are randomly

selected from the population, before selecting individuals for observation

• In randomized evaluation:• When clusters of individuals are randomly assigned to different treatment

groups

Reason for adopting cluster randomization• Need to minimize or remove contamination

• Example: In the deworming program, schools was chosen as the unit because worms are contagious

• Basic feasibility considerations• Example: The PROGRESA program would not have been politically feasible if

some families were introduced and not others.

• Only natural choice• Example: Any education intervention that affect an entire classroom (e.g.

flipcharts, teacher training).

Clustered design: intuition

• You want to know how close the upcoming national elections will be

• Method 1: Randomly select 50 people from entire Indian population

• Method 2: Randomly select 5 families, and ask ten members of each family their opinion

Low intra-cluster correlation (ICC) aka ρ (rho)

HIGH intra-cluster correlation (ρ)

High Low

No effect on rh

o

Don’t know

25% 25%25%25%

All uneducated people live in one village. People with only primary education live in another. College grads live in a third, etc. ICC (ρ) on education will be..

A. HighB. LowC. No effect on rhoD. Don’t know

Clustered Design: Intuition

• The outcomes within a family are likely correlated. Similarly with children within a school, families within a village etc.

• Each additional individual does not bring entirely new information

• At the limit, imagine all outcomes within a cluster are exactly the same: effective sample size is number of clusters, not number of individuals

• Precision will depend on the number of clusters, sample size within clusters and the within cluster correlation

67

Include m

ore cl

usters

in the s.

..

Include m

ore peo

ple in cl

usters

Both

Don’t know

25% 25%25%25%

If ICC (ρ) is high, what is a more efficient way of increasing power?

A. Include more clusters in the sample

B. Include more people in clusters

C. BothD. Don’t know

• The Standardized effect size is the effect size divided by the standard deviation of the outcome

• δ = effect size/Standard deviation

Standardized Effect Sizes

Standardized Effect SizesAn effect size of…

Is considered… …and it means that… Required N under 50% treatment

0.2 Modest The average member of the treatment group had a better outcome than the 58th percentile of the control group

786

0.5 Large The average member of the treatment group had a better outcome than the 69th percentile of the control group

126

0.8 VERY Large The average member of the treatment group had a better outcome than the 79th percentile of the control group

50

Conclusion

• Even with a perfectly valid experiment, the ability to make inference depends on the SIZE OF THE SAMPLE.

• In designing an evaluation, you need to balance tradeoffs to ensure that your sample is large enough, given

• Desired power and significance levels• Anticipated effect size• The amount of “noise” (underlying variance in outcome variable)• Treatment-Control size ration (feasibility and cost)• Take up of treatment• Clustering

The Important Stuff How confident are we of our results ?

We have a sample, not the population. The Central Limit Theorem and The Law of Large Numbers tell us important things

about the sampling distribution that allow for HYPOTHESIS TESTING. Hypothesis testing enables us to establish whether our results are statistically significant. There are two kind of errors we can make in hypothesis testing

> Type 1: The intervention is not effective and we find it to be effective. We FIX this at 5%.

> Type 2 : The intervention is effective and we find it to be no impact. The smaller the probability of this occurring the higher our power. Power can be increased by five things

>Sample size > The size of the effect > The proportion of your sample in the control group and the proportion in the

sample of your treatment group > The variance> Clustering

Date post:	24-Dec-2015
Category:	Documents
Upload:	tabitha-jones
View:	215 times
Download:	1 times

Sampling and Sample Size Part 2 Cally Ardington. Lecture Outline Standard deviation and standard...

Documents