Class 23: Thursday, Dec. 2nd Today: One-way analysis of variance, multiple comparisons. Next week:...

Class 23: Thursday, Dec. 2nd

• Today: One-way analysis of variance, multiple comparisons.

• Next week: Two-way analysis of variance.• I will e-mail the final homework, Homework 9, to you this

weekend. • All of the final project ideas look good. I have e-mailed

some of you my comments already and will e-mail the rest of you my comments by tomorrow.

• Schedule:– Thurs., Dec. 9th – Final class– Mon., Dec. 13th (5 pm) – Preliminary results from final project

due– Tues., Dec. 14th (5 pm) – Homework 9 due– Tues., Dec. 21st (Noon) – Final project due.

Individual vs. Familywise Error Rate

• When several tests are considered simultaneously, they constitute a family of tests.

• Individual Type I error rate: Probability for a single test that the null hypothesis will be rejected assuming that the null hypothesis is true.

• Familywise Type I error rate: Probability for a family of test that at least one null hypothesis will be rejected assuming that all of the null hypotheses are true.

• When we consider a family of tests, we want to make the familywise error rate small, say 0.05, to protect against falsely rejecting a null hypothesis.

Why Control the Familywise error rate:

• Five children in a particular school got leukemia last year? Is that a coincidence or does the clustering of cases suggest the presence of an environmental toxin that caused the disease?

• Individual Type I error rate: Calculate the probability that five children at this particular school would all get leukemia this particular year. If this is small, say smaller than 0.05, become alarmed.

• Familywise Type I error rate: Calculate the probabilty that five children in any school would develop the same severe disease in the same year. If this is small, say smaller than 0.05, become alarmed.

• If we control the individual type I error rate, then we will locate many disease “clusters” that are not caused by an environmental toxin but are just coincidences.

Bonferroni Method• General method for doing multiple comparisons for any

family of k tests. • Denote familywise type I error rate we want by p*, say

p*=0.05. • Compute p-values for each individual test -- • Reject null hypothesis for ith test if• Guarantees that familywise type I error rate is at most p*. • Why Bonferroni works: If we do k tests and all null

hypotheses are true , then using Bonferroni with p*=0.05, we have probability 0.05/k to make a Type I error for each test and expect to make k*(0.05/k)=0.05 errors in total.

kpp ,...,1

k

ppi

*

Multiplicity• A news report says, “A 15 year study of more than 45,000

Swedish solidiers revealed that heavy users of marijuana were six times more likely than nonusers to develop schizophrenia.”

• Were the investigators only looking for difference in schizophrenia among heavy/non-heavy users of marijuana?

• Key question: What is their family of tests? If they were actually looking for a difference among 100 outcomes (e.g., blood pressure, lung cancer), Bonferroni should be used to control the familywise Type I error rate, i.e., only consider a difference significant if p-value is less than .05/100=.0005.

• The best way to deal with the multiple comparisons problem is to design a study to search specifically for a pattern that was suggested by an exploratory data analysis. Then there is only one comparison.

Bonferroni method on Milgram’s data

• If we want to test whether each of the four groups has a mean different from the mean of all four groups, we have four tests. Bonferroni method: Check whether p-value of each test is <0.05/4=0.0125.

• There is strong evidence that the remote group has a mean higher than the mean of the four groups and the touch-proximity group has a mean lower than the mean of the four groups.

Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 338.25 9.067431 37.30 <.0001 Condition[Proximity] -26.25 15.70525 -1.67 0.0966 Condition[Remote] 66.75 15.70525 4.25 <.0001 Condition[Touch-Proximity] -70.125 15.70525 -4.47 <.0001 Condition[Voice-Feedback] 29.625 15.70525 1.89 0.0611

Multiple Comparison Simulation• In multiplecomp.JMP, 50 groups are compared

with sample sizes of ten for each group.• The observations for each group are simulated

from a standard normal distribution. Thus, in fact,

• Bonferroni approach to deciding which groups have means different than average: Reject null hypothesis that a group’s mean is the average mean of all groups only if the p-value for the t-test is .05/50=.001.

05021

Multiple Comparison Simulation

Iteration 1 2 3 4 5

# of Groups with p-value < 0.05

# of Groups with p-value < .0025

Pairwise Comparisons

• We are interested not just in what groups have means that are different than the average mean, but in pairwise comparisons between the groups.

• For a pairwise comparison between group i and group j, we want to test the null hypothesis that group i and group j have the same means versus the alternative that group i and group j have different means, i.e., vs.

Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 338.25 9.067431 37.30 <.0001 Condition[Proximity] -26.25 15.70525 -1.67 0.0966 Condition[Remote] 66.75 15.70525 4.25 <.0001 Condition[Touch-Proximity] -70.125 15.70525 -4.47 <.0001 Condition[Voice-Feedback] 29.625 15.70525 1.89 0.0611

jiH :0 jiaH :

Pairwise Comparisons Cont.

• For Milgram’s obedience data, there are six pairwise comparisons:

(1) Proximity vs. Remote; (2) Proximity vs. Touch-Proximity; (3) Proximity vs. Voice-Feedback; (4) Remote vs. Touch-Proximity; (5) Remote vs. Voice-Feedback; (6) Touch-Proximity vs. Voice-Feedback

• Multiple comparisons situation with a family of six tests. We want to control the familywise error rate at .05 rather than the individual type I error rate.

• Could use Bonferroni to do this but there is a method called Tukey’s HSD (stands for “Honest Significant Differences”) that is specially designed to control the familywise type I error rate for pairwise comparisons in ANOVA.

LSMeans Differences Tukey HSD Alpha= 0.050 Q= 2.59695LSMean[i] By LSMean[j] Mean[i]-Mean[j] Std Err Dif Lower CL Dif Upper CL Dif

Proximity Remote Touch-Proximity Voice-Feedback

Proximity 0 0 0 0

-93 25.6466

-159.6 -26.397

43.875 25.6466 -22.728 110.478

-55.875 25.6466 -122.48 10.7278

Remote 93 25.6466 26.3972 159.603

0 0 0 0

136.875 25.6466 70.2722 203.478

37.125 25.6466 -29.478 103.728

Touch-Proximity -43.875 25.6466 -110.48 22.7278

-136.88 25.6466 -203.48 -70.272

0 0 0 0

-99.75 25.6466 -166.35 -33.147

Voice-Feedback 55.875 25.6466 -10.728 122.478

-37.125 25.6466 -103.73 29.4778

99.75 25.6466 33.1472 166.353

0 0 0 0

Level Least Sq Mean Remote A 405.00000 Voice-Feedback A B 367.87500 Proximity B C 312.00000 Touch-Proximity C 268.12500 Levels not connected by same letter are significantly different

Comparisons between groups that are in red are groups for which the null hypothesis that the group means are the same is rejected using the Tukey HSD procedure, which controls the familywise Type I error rate at 0.05. A confidence interval for the difference in group means that adjusts for multiple comparisons is shown in the third and fourthlines.

More on Tukey’s HSD • Using Tukey’s HSD, the pairs for which there is strong evidence of a

difference in means adjusting for multiple comparisons are remote is higher than proximity, remote is higher than touch proximity and voice feedback is higher than touch proximity.

• For confidence intervals for differences in the means of each pair of groups, if we use the usual confidence intervals, there is a good chance that at least one of the intervals will not contain the true difference in means between the groups.

• When making a family of confidence intervals, we want confidence intervals that have a 95% chance of all intervals in the family containing their true values. The confidence intervals produced by the Tukey HSD procedure have this property.

• 95% confidence interval for difference in mean of remote group vs. mean of proximity group using Tukey’s HSD: (26.40, 159.60).

• 95% confidence interval for difference in mean of remote group vs. mean of proximity group assuming that this is the only confidence interval being formed (family of one confidence interval): (42.34, 143.66). Tukey’s HSD confidence interval is wider because in order for a family of CIs to each contain their true value when multiple CIs are formed, each CI must be wider than it would be if only one CI was being formed.

Tukey HSD in JMP

• Use Analyze, Fit Model to do the analysis of variance by making the X variable the categorical variable denoting the group.

• After Fit Model, click red triangle next to group variable (Condition in the Milgram study) and click LS Means Differences Tukey HSD. Clicking LS Means Differences Student’s t gives CIs that do not adjust for multiple comparisons.

Assumptions in one-way ANOVA

• Assumptions needed for validity of one-way analysis of variance p-values and CIs:– Linearity: automatically satisfied.– Constant variance: Spread within each group

is the same.– Normality: Distribution within each group is

normally distributed.– Independence: Sample consists of

independent observations.

Rule of thumb for checking constant variance

• Constant variance: Look at standard deviation of different groups by using Fit Y by X and clicking Means and Std Dev.

• Check whether (highest group standard deviation/lowest group standard deviation)^2 is greater than 3. If greater than 3, then constant variance is not reasonable and transformation should be considered.. If less than 3, then constant variance is reasonable.

• (Highest group standard deviation/lowest group standard deviation)^2 =(131.874/63.640)^2=4.29. Thus, constant variance is not reasonable for Milgram’s data.

Means and Std Deviations Level Number Mean Std Dev Std Err Mean Proximity 40 312.000 129.979 20.552 Remote 40 405.000 63.640 10.062 Touch-Proximity 40 268.125 131.874 20.851 Voice-Feedback 40 367.875 119.518 18.897

Transformations to correct for nonconstant variance

• If standard deviation is highest for high groups with high means, try transforming Y to log Y or . If standard deviation is highest for groups with low means, try transforming Y to Y2.

• SD is particularly low for group with highest mean. Try transforming to Y2. To make the transformation, right click in new column, click New Column and then right click again in the created column and click Formula and enter the appropriate formula for the transformation.

Y

Means and Std Deviations Level Number Mean Std Dev Std Err Mean Proximity 40 312.000 129.979 20.552 Remote 40 405.000 63.640 10.062 Touch-Proximity 40 268.125 131.874 20.851 Voice-Feedback 40 367.875 119.518 18.897

Transformation of Milgram’s data to Squared Voltage Level

• Check of constant variance for transformed data: (Highest group standard deviation/lowest group standard deviation)^2 = 2.67. Constant variance assumption is reasonable for voltage squared.

• Analysis of variance tests are approximately valid for voltage squared data; reanalyzed data using voltage squared.

Means and Std Deviations Level Number Mean Std Dev Std Err Mean Proximity 40 113816 78920.2 12478 Remote 40 167974 48541.4 7675 Touch-Proximity 40 88847 79291.3 12537 Voice-Feedback 40 149259 74053.6 11709

Analysis using Voltage SquaredResponse Voltage Squared Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Condition 3 3 1.50737e11 9.8735 <.0001

LSMeans Differences Tukey HSD Alpha= 0.050 Q= 2.59695LSMean[i] By LSMean[j] Mean[i]-Mean[j] Std Err Dif Lower CL Dif Upper CL Dif

Proximity Remote Touch-Proximity Voice-Feedback

Proximity 0 0 0 0

-5.42e4 15951.4 -9.56e4 -1.27e4

24969.4 15951.4 -1.65e4 66394.3

-3.54e4 15951.4 -7.69e4 5981.81

Remote 54157.5 15951.4 12732.6 95582.4

0 0 0 0

79126.9 15951.4 37701.9 120552

18714.4 15951.4 -2.27e4 60139.3

Touch-Proximity -2.5e+4 15951.4 -6.64e4 16455.6

-7.91e4 15951.4 -120552 -3.77e4

0 0 0 0

-6.04e4 15951.4 -101837 -1.9e+4

Voice-Feedback 35443.1 15951.4 -5981.8 76868.1

-1.87e4 15951.4 -6.01e4 22710.6

60412.5 15951.4 18987.6 101837

0 0 0 0

Strong evidence that the group mean voltage squared levels are not all the same.

Strong evidence that remote has higher mean voltage squared level than proximityand touch-proximity and that voice-feedback has higher mean voltage squared level than touch-proximity, taking into account the multiple comparisons.

Rule of Thumb for Checking Normality in ANOVA

• The normality assumption for ANOVA is that the distribution in each group is normal. Can be checked by looking at the boxplot, histogram and normal quantile plot for each group.

• If there are more than 30 observations in each group, then the normality assumption is not important; ANOVA p-values and CIs will still be approximately valid even for nonnormal data if there are more than 30 observations in each group.

• If there are less than 30 observations per group, then we can check normality by clicking Analyze, Distribution and then putting the Y variable in the Y, Columns box and the categorical variable denoting the group in the By box. We can then create normal quantile plots for each group and check that for each group, the points in the normal quantile plot are in the confidence bands. If there is nonnormality, we can try to use a transformation such as log Y and see if the transformed data is approximately normally distributed in each group.

One way Analysis of Variance: Steps in Analysis

1. Check assumptions (constant variance, normality, independence). If constant variance is violated, try transformations.

2. Use the effect test (commonly called the F-test) to test whether all group means are the same.

3. If it is found that at least two group means differ from the effect test, use Tukey’s HSD procedure to investigate which groups are different, taking into account the fact multiple comparisons are being done.

Example: Discrimination against the Handicapped

• Study of how physical handicaps affect people’s perception of employment qualifications.

• Researchers prepared five videotaped job interviews, using same two male actors for each. Tapes differed only in that applicant appeared with a different handicap in each– (i) wheelchair; (ii) on crutches; (iii) hearing impaired; (iv) one leg amputated; (v) no handicap.

• Each tape shown to 14 students from U.S. university. Students rate qualifications of candidate on 0 to 10 point scale based on tape.

• Questions of interest: Do subjects systematically evaluate qualifications differently according to candidate’s handicap? If so, which handicaps produce different evaluations?

Checking AssumptionsOneway Analysis of SCORE By HANDICAP

SC

OR

E

1

2

3

4

5

6

7

8

9

AMPUTEE CRUTCHES HEARING NONE WHEELCHAIR

HANDICAP

Means and Std Deviations Level Number Mean Std Dev Std Err Mean Lower 95% Upper 95% AMPUTEE 14 4.42857 1.58572 0.42380 3.5130 5.3441 CRUTCHES 14 5.92143 1.48178 0.39602 5.0659 6.7770 HEARING 14 4.05000 1.53259 0.40960 3.1651 4.9349 NONE 14 4.90000 1.79358 0.47935 3.8644 5.9356 WHEELCHAIR 14 5.34286 1.74828 0.46725 4.3334 6.3523

Constant variance is reasonable – (Largest standard deviation/smallest standard deviation)^2=(1.79/1.48)^2=1.46. There are less than 30 observations per groupso we need to check normality but a check of the normal quantile plot for each group indicates that normality is OK.

Do all videotapes have the same mean?

•

Response SCORE Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F HANDICAP 4 4 30.521429 2.8616 0.0301 Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 4.9285714 0.195173 25.25 <.0001 HANDICAP[AMPUTEE] -0.5 0.390347 -1.28 0.2048 HANDICAP[CRUTCHES] 0.9928572 0.390347 2.54 0.0134 HANDICAP[HEARING] -0.878571 0.390347 -2.25 0.0278 HANDICAP[NONE] -0.028571 0.390347 -0.07 0.9419 HANDICAP[WHEELCHAIR] 0.4142857 0.390347 1.06 0.2925

Test of H_0: Mean of all five videotapes is the same vs. H_A: At least two of thevideotapes have different means has p-value 0.0301. Evidence that there is somedifference in the means of the videotapes.

How do the videotapes compare?LSMeans Differences Tukey HSD Alpha= 0.050 Q= 2.80582LSMean[i] By LSMean[j] Mean[i]-Mean[j] Std Err Dif Lower CL Dif Upper CL Dif

AMPUTEE CRUTCHES HEARING NONE WHEELCHAIR

AMPUTEE 0 0 0 0

-1.4929 0.61719 -3.2246 0.23888

0.37857 0.61719 -1.3532 2.1103

-0.4714 0.61719 -2.2032 1.2603

-0.9143 0.61719

-2.646 0.81745

CRUTCHES 1.49286 0.61719 -0.2389 3.22459

0 0 0 0

1.87143 0.61719 0.1397

3.60316

1.02143 0.61719 -0.7103 2.75316

0.57857 0.61719 -1.1532 2.3103

HEARING -0.3786 0.61719 -2.1103 1.35316

-1.8714 0.61719 -3.6032 -0.1397

0 0 0 0

-0.85 0.61719 -2.5817 0.88173

-1.2929 0.61719 -3.0246 0.43888

NONE 0.47143 0.61719 -1.2603 2.20316

-1.0214 0.61719 -2.7532 0.7103

0.85 0.61719 -0.8817 2.58173

0 0 0 0

-0.4429 0.61719 -2.1746 1.28888

WHEELCHAIR 0.91429 0.61719 -0.8174 2.64602

-0.5786 0.61719 -2.3103 1.15316

1.29286 0.61719 -0.4389 3.02459

0.44286 0.61719 -1.2889 2.17459

0 0 0 0

The only conclusion we can make about how the videotapes compare, taking account of the fact that we are making multiple comparisons, is that Crutches has a higher mean than Hearing.

Date post:	20-Jan-2016
Category:	Documents
View:	217 times
Download:	0 times

Class 23: Thursday, Dec. 2nd Today: One-way analysis of variance, multiple comparisons. Next week:...

Documents