Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | tabitha-jones |
View: | 215 times |
Download: | 1 times |
Sampling and Sample Size Part 2Cally Ardington
Lecture Outline Standard deviation and standard error
•Detecting impact Background
Hypothesis testing Power
The ingredients of power
We implement the Balsakhi Program
Case 2: Remedial Education in IndiaEvaluating the Balsakhi Program
Incorporating random assignment into the program
Case 2: Remedial Education in IndiaEvaluating the Balsakhi Program
Incorporating random assignment into the program
Post-test: control & treatment
Yes No
Don’t know
33%33%33%
Is this impact statistically significant?
A. YesB. NoC. Don’t know
01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991000
20
40
60
80
100
120
140
160
control treatment
control μ treatment μ
test scores
Lecture Outline Standard deviation and standard error
•Detecting impact Background
Hypothesis testing Power
The ingredients of power
• The Law of Large of Numbers and Central Limit Theorem allow us to do hypothesis testing to determine whether our findings are statistically significant
Hypothesis Testing
• In criminal law, most institutions follow the rule: “innocent until proven
guilty”
• The presumption is that the accused is innocent and the burden is on
the prosecutor to show guilt
• The jury or judge starts with the “null hypothesis” that the accused person is
innocent
• The prosecutor has a hypothesis that the accused person is guilty
Hypothesis Testing
8
• In program evaluation, instead of “presumption of innocence,” the
rule is: “presumption of insignificance”
• The “Null hypothesis” (H0) is that there was no (zero) impact of the
program
• The burden of proof is on the evaluator to show a significant
difference
Hypothesis Testing
• If it is very unlikely (less than a 5% probability) that the difference is
solely due to chance:
• We “reject our null hypothesis”
• We may now say:
• “our program has a statistically significant impact”
Hypothesis Testing: Conclusions
Type I and II errorsYOU CONCLUDE
Effective No Effect
THE TRUT
H
Effective Type II Error
No Effect
Type I Error
What is the significance level?
• Type I error: rejecting the null hypothesis even though it is true (false positive)
• Significance level: The probability that we will reject the null hypothesis even though it is true
Theoretical Sampling Distribution
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
H0
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
95% Confidence Interval
H0
1.96 SD1.96 SD
Impose Significance Level of 5%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
H0
1.96 SD
H0H0
Lecture Outline Standard deviation and standard error
•Detecting impact Background
Hypothesis testing Power
The ingredients of power
• Type II Error: Failing to reject the null hypothesis
(concluding there is no difference), when indeed the
null hypothesis is false.
• Power: If there is a measureable effect of our
intervention (the null hypothesis is false), the
probability that we will detect an effect (reject the null
hypothesis)
• Power = 1- Probability of Type II Error
What is Power?
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
H0
Hβ
Anything between lines cannot be distinguished from 0
Impose significance level of 5%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
Hβ
H0
Shaded area shows % of time we would find Hβ true if it was-4 -3 -2 -1 0 1 2 3 4 5 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
Can we distinguish Hβ from H0 ?
Hβ
H0
Type I and II errorsYOU CONCLUDE
Effective No Effect
THE TRUT
H
Effective Type II Error
No Effect
Type I Error
Type I and II errorsYOU CONCLUDE
Effective No Effect
THE TRUT
H
Effective Type II Error
No Effect
Type I Error
(probability =
significance level)
Type I and II errorsYOU CONCLUDE
Effective No Effect
THE TRUT
H
Effective (probability =
power)
Type II Error
No Effect
Type I Error
Before the experiment
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
• Assume two effects: no effect and treatment effect β
H0Hβ
Impose significance level of 5%
Anything between lines cannot be distinguished from 0
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
HβH0
Can we distinguish Hβ from H0 ?
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
Shaded area shows % of time we would find Hβ true if it was
HβH0
What influences power?
• What are the factors that change the proportion of the research hypothesis that is shaded—i.e. the proportion that falls to the right (or left) of the null hypothesis curve?
• Understanding this helps us design more powerful experiments
Lecture Outline Standard deviation and standard error
•Detecting impact Background
Hypothesis testing Power
The ingredients of power
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Power: Main Ingredients
Effect Size: 1*SD• Hypothesized effect size determines distance between means
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
1 Standard Deviation
HβH0
Effect Size = 1*SD
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
H0Hβ
Power: 26%If the true impact was 1*SD…
The Null Hypothesis would be rejected only 26% of the time
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
HβH0
Effect Size: 3*SD
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
Bigger hypothesized effect size distributions farther apart
3*SD
Effect size 3*SD: Power= 91%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
Bigger Effect size means more power
H0
Hβ
25% 25%25%25%
What effect size should you use when designing your experiment?
A. Smallest effect size that is still cost effective
B. Largest effect size you expect your program to produce
C. BothD. Neither
• What is the smallest effect that should justify the program to be adopted:
• Cost of this program vs the benefits it brings• Cost of this program vs the alternative use of the money
• If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero
• In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero
Picking an effect size
Effect size and take-up
• Let’s say we believe the impact on our participants is “3”• What happens if take up is 1/3?• Let’s show this graphically
Effect Size: 3*SD
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
Let’s say we believe the impact on our participants is “3”
3*SD
Take up is 33%. Effect size is 1/3rd
• Hypothesized effect size determines distance between means
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
1 Standard Deviation
HβH0
Back to: Power = 26%
Take-up is reflected in the effect size
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
powerHβH0
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Imperfect compliance
• Clustering
Power: Main Ingredients
20% 20%20%20%20%
By increasing sample size you increase…
A. AccuracyB. PrecisionC. BothD. NeitherE. Don’t know
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
Power: 91%
Power: Effect size = 1 SD, Sample size = 4
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
Power: 64%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
Power: Effect size = 1 SD, Sample size = 9
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
Power: 91%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
Sample Size
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Imperfect compliance
• Clustering
Power: Main Ingredients
• How large an effect you can detect with a given sample depends on how variable the outcomes is.
• Example: If all children have very similar learning level without a program, a very small impact will be easy to detect
• We can try to “absorb” variance:
• Using a baseline
• Controlling for other variables
• In practice, controlling for other variables (besides the baseline
outcome) buys you very little
Variance
Variance
Low Standard Deviation
0
5
10
15
20
25
valu
e
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
Number
Fre
qu
ency
mean 50
mean 60
Less Precision
Medium Standard Deviation
0
1
2
3
4
5
6
7
8
9
valu
e
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
Number
Fre
qu
ency
mean 50
mean 60
Even less precise
High Standard Deviation
0
1
2
3
4
5
6
7
8
value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89
Number
Fre
qu
en
cy
mean 50
mean 60
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Power: Main Ingredients
Sample split: 50% C, 50% T
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
H0 Hβ
Equal split gives distributions that are the same “fatness”
Power: 91%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
If it’s not 50-50 split?
• What happens to the relative fatness if the split is not 50-50.• Say 25-75?
Sample split: 25% C, 75% T
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
significance
H0 Hβ
Uneven distributions, not efficient, i.e. less power
Power: 83%
-4 -3 -2 -1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
control
treatment
power
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Power: Main Ingredients
Clustered design: definition
• In sampling:• When clusters of individuals (e.g. schools, communities, etc) are randomly
selected from the population, before selecting individuals for observation
• In randomized evaluation:• When clusters of individuals are randomly assigned to different treatment
groups
Reason for adopting cluster randomization• Need to minimize or remove contamination
• Example: In the deworming program, schools was chosen as the unit because worms are contagious
• Basic feasibility considerations• Example: The PROGRESA program would not have been politically feasible if
some families were introduced and not others.
• Only natural choice• Example: Any education intervention that affect an entire classroom (e.g.
flipcharts, teacher training).
Clustered design: intuition
• You want to know how close the upcoming national elections will be
• Method 1: Randomly select 50 people from entire Indian population
• Method 2: Randomly select 5 families, and ask ten members of each family their opinion
Low intra-cluster correlation (ICC) aka ρ (rho)
HIGH intra-cluster correlation (ρ)
High Low
No effect on rh
o
Don’t know
25% 25%25%25%
All uneducated people live in one village. People with only primary education live in another. College grads live in a third, etc. ICC (ρ) on education will be..
A. HighB. LowC. No effect on rhoD. Don’t know
Clustered Design: Intuition
• The outcomes within a family are likely correlated. Similarly with children within a school, families within a village etc.
• Each additional individual does not bring entirely new information
• At the limit, imagine all outcomes within a cluster are exactly the same: effective sample size is number of clusters, not number of individuals
• Precision will depend on the number of clusters, sample size within clusters and the within cluster correlation
67
Include m
ore cl
usters
in the s.
..
Include m
ore peo
ple in cl
usters
Both
Don’t know
25% 25%25%25%
If ICC (ρ) is high, what is a more efficient way of increasing power?
A. Include more clusters in the sample
B. Include more people in clusters
C. BothD. Don’t know
• The Standardized effect size is the effect size divided by the standard deviation of the outcome
• δ = effect size/Standard deviation
Standardized Effect Sizes
Standardized Effect SizesAn effect size of…
Is considered… …and it means that… Required N under 50% treatment
0.2 Modest The average member of the treatment group had a better outcome than the 58th percentile of the control group
786
0.5 Large The average member of the treatment group had a better outcome than the 69th percentile of the control group
126
0.8 VERY Large The average member of the treatment group had a better outcome than the 79th percentile of the control group
50
Conclusion
• Even with a perfectly valid experiment, the ability to make inference depends on the SIZE OF THE SAMPLE.
• In designing an evaluation, you need to balance tradeoffs to ensure that your sample is large enough, given
• Desired power and significance levels• Anticipated effect size• The amount of “noise” (underlying variance in outcome variable)• Treatment-Control size ration (feasibility and cost)• Take up of treatment• Clustering
The Important Stuff How confident are we of our results ?
We have a sample, not the population. The Central Limit Theorem and The Law of Large Numbers tell us important things
about the sampling distribution that allow for HYPOTHESIS TESTING. Hypothesis testing enables us to establish whether our results are statistically significant. There are two kind of errors we can make in hypothesis testing
> Type 1: The intervention is not effective and we find it to be effective. We FIX this at 5%.
> Type 2 : The intervention is effective and we find it to be no impact. The smaller the probability of this occurring the higher our power. Power can be increased by five things
>Sample size > The size of the effect > The proportion of your sample in the control group and the proportion in the
sample of your treatment group > The variance> Clustering