Not completely in FPP but good stuff anyway
Inference when considering two populations
Inference for the difference of two parametersOften we are interested in comparing the
population average or the population proportion/percentage for two groups
We can do these types of comparisons using CI’s and hypothesis tests
General ideas and equations don’t changeCI: estimate ± multiplier*SETest statistics: (observed– expected)/SE
Inference for P1 – P2
Lets just jump right into an example
CI for P1 – P2 Estimate ± multiplier*SE
Multiplier comes from the z-table
Everything else we know about confidence intervals is the sameInterpretationWhat does 95% confidence mean
€
ˆ p 1 − ˆ p 2 ± multiplierˆ p 1(1− ˆ p 1)
n1
+ˆ p 2(1− ˆ p 2)
n2
Inference for difference of two population means μ1 – μ2
Two possibilities in collecting data on two variables here
Design 1: Units are matched in pairsUse “matched pairs inference”
Design 2: units not matched in pairsUse “two sample inferences”
Typical study designsMatched pairs
A) two treatments given to each unitB) units paired before treatments are assigned,
then treatments are assigned randomly within pairs
Two samplesA) some units assigned to get only treatment a,
and other units assigned to get only treatment b. Assignment is completely at random
B) Units in two different groups compared on some survey variable
Matched pairs vs two samplesData collected in two independent samples:
No matching, so creating values of some “difference” is meaningless
A “matched pairs” analysis is mathematically wrong and gives incorrect CI’s and p-values
Data collected in matched pairs:Matching, when effective, reduces the SE.A two sample analysis artificially inflates the SE,
resulting in excessively wide CI’s and unreliable p-values
An example towards the end of these slides will demonstrate this
Inference in μ1 – μ2: matched pairsGeneral idea with matched pairs design is to
compute the difference for pair of observations and treat the differences as the single variable
Measure y1 and y2 on each unit. Then for each unit computed = y1 – y2
Then find a confidence interval for the differencedifference estimate ± multiplier*SEaverage of differences ± t-table value * SD of
differences/√n
Inference in μ1 – μ2: matched pairs Do people perform better on tests when smelling
flowers versus smelling nothing?Hirsch and Johnston (1996) asked 21 subjects to
work a maze while wearing a mask. The mask was either unscented or carried a floral scent. Each subject worked both mazes. The order of the mask was randomized to ensure fair comparison to the two treatments. The response is the difference in completion times for the unscented and scented masks.
Example: Person 1 completed the maze in 30.60 seconds while wearing the unscented mask, and in 37.97 seconds while wearing the scented mask.
So, this person’s data value is –7.37 (30.60 – 37.97).
JMP output for odor exampleThe differences
appear to follow the normal curve. There are no outliers
Sample average difference is 0.96, suggesting people do better with scented mask
.01
.05
.10
.25
.50
.75
.90
.95
.99
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
-30 -20 -10 0 10 20 30
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
0.9566667
12.547882
2.7381723
6.6683939
-4.755061
21
Moments
Hypothesized Value
Actual Estimate
df
Std Dev
0
0.95667
20
12.5479
Test Statistic
Prob > |t|
Prob > t
Prob < t
0.3494
0.7305
0.3652
0.6348
t Test
Test Mean=value
Difference
Distributions
Conclusions from odors exampleThe 95% CI ranges from -4.76 to 6.67, which
is too wide a range to determine whether floral odors help or hurt performance for these mazes. In other words, the data suggest that any effect of scented masks is small enough that we cannot estimate it with reasonable accuracy using these 21 subjects. We should collect more data to estimate the effect of the odor more precisely.
We also note that this study was very specific. The results may not be easily generalized to other populations, other tests, or other treatments.
Inference in μ1 – μ2: two samplesPygmalion study
Researchers gave IQ test to elementary school kids.They randomly picked six kids and told teachers the
test predicts these kids have high potential for accelerated growth.
They randomly picked different six kids and told teachers the test predicts these kids have no potential for growth.
At end of school year, they gave IQ test again to all students.
They recorded the change in IQ scores of each student.
Let’s see what they found…
EDA for pygmalion studyIt looks like being
labeled “accelerated” leads to larger improvements than being labeled “no growth”
Let’s make a 99% CI to confirm this
Impr
ovem
ent
0
5
10
15
20
accelerated none
Growth Group
Sample means and SD’s Level Number Mean SD
SE accelerated 6 15.17 4.708 1.92 none 6 6.17 3.656 1.49
Sample difference is 9.00. The SE of this difference:
43.2
2222
21
21
21
SESEn
SD
n
SDSE
Pygmalion confidence interval99% CI for difference in mean scores (accel –
none):
Estimate ± mulitplier*SEEstimate is mean1 – mean2Multiplier comes from the t-table (we will talk
about df in a sec.)SE of difference from the previous slide€
€
x 1 − x 1 ± multiplier * (SE of difference) =
15.17 − 6.17 ± 3.17* 2.43 ≈
(1.30, 16.70)
Conclusions from the pygmalion studyThe 99% CI ranges from 1.30 to 16.70,
which is always positive. The data provide evidence that students labeled “accelerated” have higher avg. improvements in IQ than students labeled “no growth.” We are 99% confidence the difference in averages is between 1.3 and 16.7 IQ points.
Degrees of FreedomUse the Welch-Satterhwaite degrees of freedom
formula
This is typically what a computer will give you
Conservative approach use the smaller of n1-1 and n2-1
2
22
21
21
1
2
2
22
1
21
11
11
ns
nns
n
ns
ns
df
Impr
ovem
ent
0
5
10
15
20
accelerated none
Growth Group
Excluded Rows 6
accelerated
none
Level
6
6
Number
15.1667
6.1667
Mean
4.70815
3.65605
Std Dev
1.9221
1.4926
Std Err Mean
10.226
2.330
Lower 95%
20.108
10.003
Upper 95%
Means and Std Deviations
accelerated-none
Assuming unequal variances
Difference
Std Err Dif
Upper CL Dif
Lower CL Dif
Confidence
9.0000
2.4336
14.4677
3.5323
0.95
t Ratio
DF
Prob > |t|
Prob > t
Prob < t
3.698283
9.422114
0.0046
0.0023
0.9977 -10 -5 0 5 10
t Test
Oneway Analysis of Improvement By Growth Group
Hypothesis tests for difference of two parametersThe main ideas of hypothesis tests remain
the same
1) specify hypothesis2) compute test statistic (observed –
expected)/SE3) calculate p-value4) make conclusions
Hypothesis test for p1 – p2 Herson (1971) examined whether men or
women are more likely to suffer from nightmares. He asked a random sample of 160 men and 192 women whether they experienced nightmares “often” (at least once a month) or “seldom” less than once a month
In the sample 55 men (34.4%) and 60 women(31.3%) said they suffered nightmares often. Is this 3.1% difference sufficient evidence of a sex-related difference in nightmare suffering?
Hypothesis test for p1 – p2 Step 1: Claim is mean and women
suffer at different ratiesHo: p1 = p2 vs Ha p1 ≠ p2 the same as
Ho: p1 – p2 = 0 vs Ha p1 – p2 ≠ 0
Step2: Compute test statistic
f
ff
m
m
f
nn
z m
)p̂1(p̂)p̂1(p̂
0)p̂p̂(
m
Hypothesis test for p1 – p2 Step 2 continued
Notice that the test statistic is simply the # of SE’s the sample difference in proportions is from 0 (the hypothesized difference).
Step3: Compute the p-valueSince we are dealing with a two-sided
alternative we want the area under the normal curve to left of -0.62 and to the right 0f 0.62
P-value ≈ 0.55
€
z =(ˆ p m − ˆ p
f) − 0
ˆ p m(1− ˆ p
m)
nm
+ˆ p
f(1− ˆ p
f)
nf
=(.344 − .313) − 0
.344(1− .344)
160+
.313(1− .313)
160
= 0.62
Hypothesis test for p1 – p2 Step 4 make a conclusionThis is a large p-value. We do not reject the null hypothesis. The data do no provide sufficient evidence to concluce that the proportion of men that have nightmares is different from that of women.
As a reminder how do we interpret the value 0.55Assuming the null hypothesis is true (i.e. men and women are equally likely to have nightmares), there is a 55% chance of getting a sample difference of 3.1% or more (in either direction)
Inference in μ1 – μ2: matched pairsDo people perform better on tests when smelling
flowers versus smelling nothing?
Hirsch and Johnston (1996) asked 21 subjects to work a maze while wearing a mask. The mask was either unscented or carried a floral scent. Each subject worked both mazes. The order of the mask was randomized to ensure fair comparison to the two treatments. The response is the difference in completion times for the unscented and scented masks.
Example: Person 1 completed the maze in 30.60 seconds while wearing the unscented mask, and in 37.97 seconds while wearing the scented mask.
So, this person’s data value is –7.37 (30.60 – 37.97).
JMP output for odor exampleThe differences
appear to follow the normal curve. There are no outliers.
The sample average difference is 0.96, suggesting people do better with the scented mask
.01
.05
.10
.25
.50
.75
.90
.95
.99
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
-30 -20 -10 0 10 20 30
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
0.9566667
12.547882
2.7381723
6.6683939
-4.755061
21
Moments
Hypothesized Value
Actual Estimate
df
Std Dev
0
0.95667
20
12.5479
Test Statistic
Prob > |t|
Prob > t
Prob < t
0.3494
0.7305
0.3652
0.6348
t Test
Test Mean=value
Difference
Distributions
Hypothesis test for μ1 – μ2: matched pairs Claim: smelling flowers helps you
complete maze fasterHo: μf = μh vs. Ha:μf < μh
Ho: μf - μh = 0 vs. Ha:μf - μh < 0Ho: D = 0 vs. Ha: D < 0Test statistic
€
t =d − 0
SEof d's
=−0.95667
2.738= −0.349
Conclusions about odorUsing a t-distribution with 20 (21 – 1)
degrees of freedom, the p-value is Pr(T<-0.349) = 0.3652.
Assuming there is no difference in average scores when wearing either mask, there is a 36.52% chance of getting a sample mean difference of .957 seconds favoring the scented mask. This is a non-trivial chance. Therefore, we do not have enough evidence to conclude that wearing a scented mask improves performance on the maze.
Inference in μ1 – μ2: Two independent samplesPygmalion study revisited (starts on slide 14)Step1 Ho: μa = μn vs. Ha:μa > μnStep2
Step3 find the p-value. We use the t-table with how many degrees of freedom? Use 10 as in the CIp-value between smaller than 0.01
We will reject Ho.Strong evidence in data to conclude that those
labeled “accelerated” have larger IQ scores than those being labeled “no growth”
€
t =obs − exp
SE=
(y a − y n ) − 0
SEa2 + SEn
2=
(15.17 − 6.17) − 0
1.922 +1.492= 3.70
Impr
ovem
ent
0
5
10
15
20
accelerated none
Growth Group
Excluded Rows 6
accelerated
none
Level
6
6
Number
15.1667
6.1667
Mean
4.70815
3.65605
Std Dev
1.9221
1.4926
Std Err Mean
10.226
2.330
Lower 95%
20.108
10.003
Upper 95%
Means and Std Deviations
accelerated-none
Assuming unequal variances
Difference
Std Err Dif
Upper CL Dif
Lower CL Dif
Confidence
9.0000
2.4336
14.4677
3.5323
0.95
t Ratio
DF
Prob > |t|
Prob > t
Prob < t
3.698283
9.422114
0.0046
0.0023
0.9977 -10 -5 0 5 10
t Test
Oneway Analysis of Improvement By Growth Group
Matched pairs analysisMPG for 10 cars collected after similar drives
using each of two different types of gas additivMatched pairs analysis
Variable N mean SD SE diff(a – b) 10 -0.82 0.61 0.1995% CI for mean difference (-1.256, -0.384)P-value = 0.002
Two sample analysisVariable N Mean SD SEMpg a 10 20.6 14.1 4.4Mpg b 10 21.4 14.2 4.495% CI for difference (-14.14, 12.50)P-value = 0.898
Conclusions from previous exampleRight analysis (matched pairs) has narrow CI and
tiny p-value. We are able to see that additive b yields more miles per gallon.
Wrong analysis (two independent samples has a very wide CI and a large p-value. Using this analysis we’d incorrectly conclude additives a and b are equally effective
Here’s whyVariation in mpg across cars is much higher than
variation in mpg within cars. By matching we eliminate this across-car variation. The two-independent samples analysis ignores elimination of across-car variance
Moral of the story: Use anlaysis that corresponds to how data are collected
Matched pairs cont.Why not always use matched pairs?
Matching increases the possibility of imbalance in background variables. Matching on irrelevant variables can make inferences less precise because of imbalance in causally-relevant background variables.
Guidance for using matched pairs?Match on variables that have substantial
effect on response. This can make inferences more precise.
Determining a sample sizeWe will use a method that is
sometimes called the “margin of error method”
Suppose we want a 95% CI for the percentage of people who show symptoms of clinical despression
Further more we want the CI to be fairly precise: we want a margin of error of 1%
Therefore we want
€
0.01 =1.96 p(1− p) /n
Determining sample sizeUsing we can
solve for n
Now you just plug in your best guess for P and you have the sample size required for a 1% margin of error
Ex: say that P=0.3
If this sample size is too expensive either decrease level of confidence or desired maring of error
€
0.01=1.96 p(1− p) /n
€
n = (1.962)p(1− p) /(0.01)2
€
n = (1.962)0.3(1− 0.3) /(0.01)2 = 8100
Determine sample size for differences in % and averageSame logic appliesWrite down the expression for SEDecide on a margin of error Solve for sample size
Guess P1 and P2 for differences in two percentages and SD1 and SD2 for differences in means
Set n1 = n2 (same sample size for each group)
Determining sample sizeThe same ideas apply with you desire a CI
of a meanSuppose that we want to estimate the
average weight of men in the U.S.Further suppose that we want a margin of
error to be 8 pounds
We need to guess at the SD for weight. Let’s guess it to be around 20 punds
Then solving for n we get
Round up and take a sample of 25
€
8 =1.96 × SD / n
€
n =1.962(202) /(82) = 24.01
Determining sample sizes for differences in % and avg.Same logic applies:
Write down expression of SEDecide on a desired margin of errorSolve for sample size
Guess p1 and p2 for difference in two percentages.
Guess SD1 and SD2 for differences in two means.
Set n1 = n2 . Sample size in each group is n1