+ All Categories
Home > Documents > stat301 independent two samples - Dept. of Statistics ... · p Either one simple random sample of...

stat301 independent two samples - Dept. of Statistics ... · p Either one simple random sample of...

Date post: 19-Jul-2018
Category:
Upload: vonhu
View: 213 times
Download: 0 times
Share this document with a friend
43
Objectives Inference for comparing means of two populations where the samples are independent p Two-sample t significance test (we give three examples) p Two-sample t confidence interval p http://onlinestatbook.com/2/tests_of_means/difference_means.ht ml p Further reading: IPS: Chapter 7.2, OS3: Chapter 5.3 (page 230- 238)
Transcript

ObjectivesInference for comparing means of two populations where the

samples are independent

p Two-sample t significance test (we give three examples)

p Two-sample t confidence interval

p http://onlinestatbook.com/2/tests_of_means/difference_means.html

p Further reading: IPS: Chapter 7.2, OS3: Chapter 5.3 (page 230-238)

Topics:Independenttwosamplet-testp Be able to construct the appropriate hypothesis for comparing two

populations based on what researchers want to “prove”.p When given a data sets and the story behind it, be able to identify

when to use an independent two sample t-test. p Understand the statcrunch output for an independent two sample t-

test and confidence interval. p Be able to construct confidence intervals and hypothesis tests based on a

part of the output. p Understand the standard error for the independent two samples t-test

and confidence interval.p Understand what combination of the samples sizes yields the smallest

standard error and why. p Be able to check the validity (accuracy) of the p-values and

confidence intervals.

Standarderrorsp We have learnt that standard errors are crucial in constructing both

confidence intervals and also statistical testing. Do not get mixed up between standard error of an estimator and standard deviation of the observations.

p The amount of variation in the sample (the average (squared) distance between each estimate and the population mean) is measured by its standard error, which is

q You can imagine that the unknown population mean should be in some proximity to the known sample mean. This “proximity” is measured by the standard error.

q The sample mean tends to get closer in proximity/precision to the population mean as you increase the sample size.

q As we continue with the course, the standard errors will become more complex, but the underlying ideas are the same.

s.e. =

spn=

(amount of variation in the sample)

(square root of the sample size)

Comparisonseverywhere!p You will see comparisons being made all over the place. Just look at

some of the products you have at home:p Dentex floss sticks are clinically proven to remove more plaque than

regular floss.p What does this mean – how on earth do they prove this?p This is an example of where they prove the results statistically.p It is done via clinical trials, by collecting data:

p Aim: to see if it is possible to prove that on average the amount of plaque removed using floss sticks is more than the average amount of plaque removed using regular floss.

q We state the hypothesis as H0: μFP-μF ≤0 against HA: μFP – μF >0. where μFP = mean floss removed using Floss stick and μF = mean plaque removed using regular loss.

Designingtheflossstudyp There are two ways the data could have been collected:

p Either one simple random sample of individuals is taken. Each individual (on separate days) is asked to use a floss stick and regular floss and the amount of plaque removed for each treatment is measured.This is an example of a matched pair study where the same individual is used in both treatments. In this case a matched paired-test is done –which we covered in the previous lectures.

q The advantage of this design is that it avoids confounding because the same individual is used for both experiments.

q The disadvantage is that it takes time and effort because we need to do it over several days.

q Alternatively, a group of volunteers are randomly split into two groups. Some are asked to use floss and others are asked to use floss sticks. The individuals in both groups are completely independent of each other and there isn’t any matching. q The advantage it that it is quick to do this experiment. q Disadvantage: larger standard errors.

Independentsamplesinferencep The purpose of most studies is to compare the effects of different

treatments or conditions. p Using matching to design an experiment is very useful way to make

comparisons between populations since it tends to reduce confounding factors (such as the ability of a person to floss). p If we have reason to believe that there is matching between subjects,

then we should use a matched paired t-test. p However, in many situations it is impossible to have any matching

between the samples.p If we want to see whether a drug works, we need to compare a SRS

(simple random sample) of patients treated with the drug with a SRS of patients gives the placebo.

p A simple random group of individuals are chosen and the amount of plaque removed using both regular floss and floss picks is measured.Specifically, 50 individuals were given floss and another 50 were given floss sticks. The average amount of plaque removed in the groups using floss sticks is 3.16mg whereas the average amount of plaque removed in the groups using regular floss is 2.99mg.

Example 1: Floss

The data is on the left. Note that each row does not correspond to the same individual. These are two different individuals with no pairing/matching. They are two independent samples. It is hard to understand these numbers, so the data is summarized in the table below.

Example 1: Floss

q Does the difference of 3.16 vs 2.99 automatically prove that floss sticks are better?

q No it doesn’t. The products makers cannot use this as proof. They need to show that given the data it is implausible that over the entire population there is no difference between the floss and floss sticks (they need to show that the null is not plausible).

q Technically, this means calculating the chance of observing a difference of 3.16mg – 2.99mg by fluke.

q This is the p-value. q If the p-value is large (over a pre-determined significance level, say 5%),

then we cannot reject the null. This means that there isn’t any evidence in the collected data that that floss picks are better than regular floss.

q If the p-value is small (below a pre-determined significance level), then the data collected suggests that floss stick function better than floss.

Fortunately, computer software does the calculation. p However, the usual assumptions of normality of the sample means still

apply. So these still have to be checked:

Theindependentsamplet-testinStatcrunch

We test the hypothesis H0: μFP-μF ≤0 against HA: μFP – μF >0. Remember this needs to match what is given in the box.

Dotplot with mean marker, gives the dot plot on the next page which allows a visual comparison.

AnalysisoftheflossdatasetThe green lines in the dot plot correspond to the sample means calculated from the data. Just looking at two data sets we observe there is a large overlap in numbers. This makes it visually hard to discriminate between the two data sets and determine the p-value.

We see that the sample mean for FlossSticks is greater than the sample mean of regular floss. We test the hypothesis H0: μFP-μF ≤0 against HA: μFP – μF >0.

The output tells us that under normality of the sample means, there is a 3.15% chance of observing a 0.16 difference or greater if the null were true.

p The p-value of 3.15% (which is less than 5%) tells us that there is some evidence in the claim that floss sticks remove more plaque than floss.

p The number of degrees of freedom is 94.5! A very strange number. Why it is so, is beyond this class. We simply treat it like any other DF.

p To find out on average how much more plaque floss sticks remove compared with regular floss we calculate the 95% confidence interval. This can be done using the output and the t-calculator in Statcrunch.

[0.16� 1.98⇥ 0.085, 0.16 + 1.98⇥ 0.085]

= [�0.0088, 0.329]

Using Statcrunch we can use obtain the critical value corresponding to 2.5.We just need to place 94.5 as the DF.

1.98 gives us the critical value to use in construction of the 95% confidence interval for the mean difference, which is given below.

p Thus with 95% confidence the mean difference difference lies somewhere between -0.0088 and 0.329. Note that zero lies in this interval, this is because we can reject the null using a one-sided test, but not a 2 sided one.

q Alternatively, we can use Statcrunch to give the confidence interval:

q Observe that these numbers are identical to the calculation we made.

q Using statistical software avoids cumbersome calculations!

p Since Dentex found a statistically significant difference (at the 5% significance level), they can make the claim that Floss sticks remove more plaque than regular Floss.

p Checking the normality assumption of the sample means:

Sample size is relatively large and the distribution of the two sample means (look at the green plots) look quite normal. Therefore we can say the 3.15% p-value (base on normality of the sample means) is close to the truth p-values.

QuestionTime

p In the situation discussed above the samples are completely independent of each other – there isn’t any matching. In this situation we need to use an independent two sample t-test.

p In general, subjects are often observed separately under the different conditions, resulting in samples that are independent. That is, the subjects of each sample are obtained and observed independently from the subjects of the other samples.

p As in the matched pairs design, subjects should be sampled using a simple random sample from the population of interest.

p By the end of the class you should be able to identify which test to apply give the situation.p You should look to see if there is any matching in the data, if there is

matching never do an independent sample t-test (this will give the wrong standard errors and can lead to unreliable results).

q If the samples appear to completely independent of each other use an independent sample t-test.

Summary

Example2:Heights

p Consider the following problem that we already know the answer to:p In general, do males students tend to be taller than female students? p In terms of a hypothesis test we to see if there is evidence to support:

H0 : μM - μF ≤ 0 against HA : μM - μF > 0.q A matched design is possible, by random sampling male and female

student siblings. Such data is not readily available. In addition it can induce a bias, since we exclude the subpopulation of people with same-sex or no siblings.

q Instead, a random sample of students was drawn and an independent sample t-test is done.

q Statcrunch instructions: Stat -> T-stat –> Two Sample -> With data. Then place the relevant columns in each box. You have the option of doing a test (one or two sided) or constructing a confidence interval.

In this sample there were 27 males and 37 females, there is clearly no matching. The difference in sample means is 0.45.

Visually, there seems to be a large difference between the data sets. Meaning that it is unlikely they share the same mean (small p-value).

We see that the p-value is less than 0.01%. We do the test at the 5% level, since 0.01%<5%, there is strong evidence to suggest that males are on average taller than females.

Observe where the t-value in the Statcrunch output is calculated by simply taking the relative distance between the sample mean differences and the null (which is zero) divided by the standard error

t-value =0.466� 0

0.064= 7.27

We can use the same output to construct a 99% confidence interval for the mean difference. The only difference is that the degrees of freedom is unusual – it is 48.29%. However, we do exactly the same as before, we either look-up t-tables (provided by me in the exam paper) or use Statcrunch distribution calculator.

With 99% confidence we believe the mean difference in height lies between 0.29 to 0.64 feet.

[0.466± 2.68⇥ 0.064] = [0.29, 0.64]

Example3:Dietsp We want to know whether there is a difference between two different

diets. 20 randomly sampled people are randomly placed into two groups of 10. The first group goes on Diet I and the second group on Diet 2. The weight loss for each group (after dieting for one month) is given below

p

p

We need to use an independent two sample t-procedure (no matching between individuals). As we have no reason to believe one diet is better than another, our hypothesis of interest is:H0 : μ1 - μ2 = 0 against HA : μ1 - μ2 ≠ 0

The sample means are different, but there is a large overlap in the dots.

The 95% confidence interval is [-2.23,0.598]. This tells us with 95% confidence the mean difference between the diets is somewhere in this interval. As this contains the mean difference of 0, we cannot reject the null (for the two sided test; refer to Chapter 7, page 85). The p-value is H0 : μ1 - μ2 = 0 against HA : μ1 - μ2 ≠ 0 is greater than 5% for the two sided test.

To calculate the precise p-value we use the t-transform

Using Statcrunch, the area to the LEFT of -1.22, this is 12%. Thus the p-value for the two-sided test is 24%.From the data there is no evidence to suggest there is any difference between the means of the diets.

t-value =�0.82� 0

0.6692= �1.22

Example4:Doescalciuminteractwithironabsorption?p It is believed that too much calcium in a diet can reduce the absorption

of iron. To test this, 20 randomly sampled people were put into two groups of 10. One group was given a calcium high diet and the iron absorption recorded. The other group was given a calcium low diet and iron recorded. The differences from their previous level is given below (this is why you see some negative numbers).

p The data and summary statistics is given below:

We observe that for this group there those in a calcium low group absorb more iron, is this statistically significant?

p The hypothesis of interest is H0 : μCH - μCL ≥ 0 against HA : μCH - μCL < 0.

The hypothesis given in the output above is opposite of what we want to test.

However, from the output we immediately see that the p-value for H0 : μCH - μCL ≥ 0 against HA : μCH - μCL < 0 is the area to the LEFT of -3.19 which is 1-0.9974 = 0.26%.

As this p-value is less than 5% there is evidence to reject the null and conclude that high calcium decreases iron absorption (compared with low calcium).

The 95% confidence interval for the mean difference is [�1.991± 0.623⇥ 2.1]

Example5:Calftreatmentsp Comparing the weights of calves and different treatments

Treatment A vs B Is there is evidence in the data to suggest there is a difference between treatments A and B. This means we are testing H0: μA – μB =0 against HA: μA – μB ≠0.

Eyeballing the numbers, we see it is hard to visually discriminateBetween the two data sets.

Example5:Calftreatmentsp The p-value for H0: μA – μB =0 against HA: μA – μB ≠0 is 93%.p This tells us that obtaining a differences seen in the two groups when

there is no difference in the treatments (in terms of weight) is highly likely. Thus there is no evidence to reject the null

Note: To analyze the calf data in Statcrunch you need to split each group into their own columns. To do this go to Data -> Arrange -> Split -> Select Column data you want to analyze (for example Wt 8) and Select the group you want (for example TRT)

TreatmentAvs Dp From the summary statistics, the difference between treatment A

and B appears quite large (7.7), can this difference be explained by random chance? We test the hypothesis H0: μA – μD =0 against HA: μA – μD ≠0.

There is a 7.7 point difference in the treatments but a large overlap in the data sets (both have a large standard deviation).

p The mean difference may be -7.7 but the p-value is 34%, this tells us there is over a 1/3 chance of observing a difference of 7.7 in the sample means when there is in fact no difference in the treatments. This is quite large – over the 5% significant level, so there is no evidence to reject the null

p We now construct a 95% confidence interval. To do this we use statcrunch to find the critical value of a t-distribution with 19.09 df

The 95% confidence interval for the difference in mean weights for the treatments in [-7.7 – 2.09×8.09,-7.7+2.09×8.09] =[-24,9.2]. This is an interval where we believe the mean difference should lie – and explains why we were not able to reject the null, despite 7.7 being subjectively large.

Theideabehindthetestp So far we have just described the output. Here we state the theory. p We illustrate idea using the female and male height examplep The difference in sample means will vary from sample to

sample. p If the sample size is large enough will have a normal

distribution (this is the central limit theorem).p The normal distribution will be centered about the true mean μM - μF

(population male mean minus population female mean) and but it will have a complicated standard error:

X̄M � X̄F

q Where σM = standard deviation of heights and σF = standard deviation of female heights.

r�2M

27+

�2F

34

X̄M � X̄F

p Just like in the one-sample case, in order to do the test we simply take the z-transform under the null that the mean male and female height is the same (μM - μF = 0).

z =5.91� 5.46q

�2M27 +

�2F34

q At this point we encounter a problem. We do not know the population standard deviations σM and σF .

q But we see from the summary statistics that we do have estimates for them. Thus we can replace the true population standard deviations by its estimates. And obtain the transformation:

t =5.91� 5.46q0.272

27 + 0.212

34

Thedistributionofthisratio?p Having exchanged the unknown true standard deviations with their

estimators (calculated from the data) it seems reasonable to suppose that extra variability has been added to this ratio and we need to correct for it by changing from a normal distribution to another distribution. Previously for the one sample case, the new distribution which took into account of this variability was the t-distribution.

p In the two sample case, the ratio t =

X̄M � X̄Fqs2M27 +

s2F34

This ratio has approximately a t-distribution with a very strange number of degrees of freedom….

This is why using software is important, you don’twant to calculate this stuff!

df=

s12

n1+s22

n2

!

"#

$

%&

2

1n1−1

s12

n1

!

"#

$

%&

2

+1n2−1

s22

n2

!

"#

$

%&

2

p We are testing H0 : μM - μF = 0 against HA : μM - μF > 0 and have the t-transform

Which we know has 48.045 degree of freedom. Now going to Statcrunch -> Stat -> Calculators -> T we have

The area to the right of 7.11 for a t-distribution with 48.045 degrees of freedom is tiny. So at both the 5% and 1% significance level we would reject the null.This means there is plenty of evidence to reject the null and conclude the mean height of males is greater than females.

Remember If the sample sizes are both over 15, and the data not too skewed,

using the t-distribution reasonable.

5.91� 5.46q0.272

27 + 0.212

34

= 7.27

Remember: Significance means the evidence of the data is sufficient to

reject the null hypothesis (at our stated level α). Only data, and the

statistics we calculate from the data, can be statistically “significant”.

We can say that the sample means are “significantly different” or that the

observed effect is “significant”. But the conclusion about the population

means is simply “they are different.”

The observed effect of 0.46 between male and female height is significant

so we conclude that the true effect μM---μF is greater than zero.

Having made this conclusion, or even if we have not, we can always

estimate the difference using the confidence interval [0.33,0.58].

SummaryofAnalysis:Significanteffect

Standarderrorsp In the one-sample case the standard error is

s(standard deviation of population)pn(sample size)

=

rs2

n

q In the independent two-sample case the standard error iss

s21(variance of population one)

n(sample size)

+

s22(variance of population two)

m(sample size)

These two different standard errors are for different situations but the ideas are the same. Remember, that a smaller standard error leads to more reliable estimators. Therefore if we are designing the experiment to decrease the sample size we observe that:

q For the one-sample case, we can decrease the standard error by increasing the sample size (it is usually impossible to decrease the standard deviation)

q For the two-sample case, we can decrease the standard error by increasing the size of both samples (again it is usually impossible to decrease the standard deviation of the populations).

Choosingthesamplesizep We now consider how to distribute the sample sizes in the case that

the standard deviations for both samples are about the same. In this case the standard error is:

rs2

n+

s2

m= s

r1

n+

1

m

q Remember the standard deviation is fixed we cannot change this value.q Suppose that we only have enough funds to include 200 subjects in our

experiment, how to distribute them amongst the two groups:q It makes no sense to have on subject in group 1 and 199 in group 2. For

example, if we are comparing male and female heights, this would be using one male height to estimate the mean height of males and 199 females heights to estimate the mean height of females. Clearly this is wrong, and we can understand why from the standard which is

s

r1

1+

1

199= 1.002s

q On the other hand if we distributed them evenly, 100 and 100, the standard error is a lot smaller

s

r1

100+

1

100= 0.141s

Whichtypeoftest?Onesample,pairedsamplesortwoindependentsamples?p Comparing vitamin content of bread

immediately after baking vs. 3 days later (the same loaves are used on day one and 3 days later).

p Comparing vitamin content of bread immediately after baking vs. 3 days later (tests made on independent loaves).

p Average fuel efficiency for 2005 vehicles is 21 miles per gallon. Is average fuel efficiency higher in the new generation “green vehicles”?

p Is blood pressure altered by use of an oral contraceptive? Comparing a group of women not using an oral contraceptive with a group taking it.

p Review insurance records for dollar amount paid after fire damage in houses equipped with a fire extinguisher vs. houses without one. Was there a difference in the average dollar amount paid?

Cautionsaboutthetwosamplet-testorinterval

p Using the correct standard error and degrees of freedom is critical.

p As in the one sample t-test, the method assumes simple random samples.

p Likewise, it also assumes the populations have normal distributions.

p Skewness and outliers can make the methods inaccurate (that is, having confidence/significance level other that what they are supposed to have).

p The larger the sample sizes, the less this is a problem.

p It also is less of a problem if the populations have similar skewness and the two samples are close to the same size.

p “Significant effect” merely means we have sufficient evidence to say the two true means are different. It does not explain why they are different or how meaningful/important the difference is.

p A confidence interval is needed to determine how big the effect is.

Summary:Distributionoftwosamplemeansp In order to do statistical inference, we must know a few things about the

sampling distribution of our statistic.p The sampling distribution of has standard deviation

(Mathematically, the variance of the difference is the sum of

the variances of the two sample means.)

p This is estimated by the standard error

If the sample sizes are both over 15, and the data not too skewed, using the

t-distribution reasonable.

p Then the two-sample t statistic is

p This statistic has an approximate t-distribution on which we will base our

inferences. But the degrees of freedom is complicated …

2 21 2

1 2.n n

s s+

2 21 2

1 2

1 2 1 2( ) ( ) .s sn n

x xt - - µ -µ=

+

1 2x x-

2 21 2

1 2.s s

n nSE = +

Recall that we have two independent samples and we use the difference between the sample averages ( ) to estimate (μ1 − μ2).

This estimate has standard error

p The margin of error for a confidence interval of μ1 − μ2 is

p We find t* is found using the computer. The confidence interval is then computed as

The interpretation of “confidence” is the same as before: it is the proportion of possible samples for which the method leads to a true statement about the parameters.

2 2* *1 2

1 2

s sm t t SEn n

= ´ + = ´

Two-samplet confidenceinterval

2 21 2

1 2.s sSE

n n= +

1 2x x-

1 2( ) .x x m- ±

The null hypothesis is that both population means μ1 and μ2 are equal, thus their difference is equal to zero.

H0: μ1 = μ2 Û H0: μ1 − μ2 = 0 .

Either a one-sided or a two-sided alternative hypothesis can be tested.

Using the value (μ1 − μ2) = 0 given in H0, the test statistic becomes

To find the P-value, we look up the appropriate probability of the t-distribution using the df given by Statcrunch or me.

Two-samplet significancetest

2 21 2

1 2

1 2( ) 0 .s sn n

x xt - -=

+

Statisticsinthemediap Look at this article and the data they describe:

p http://www.economist.com/news/science-and-technology/21676754-curious-result-hints-possibility-dementia-caused-fungal

p What is the data that Dr. Carrasco has? p If we did a independent sample t-test to see whether those with Alzeheimer’s had more fungal

cells than those who did not Alzheimer’s what would be the p-value (give a rough estimate)?

AccompanyingproblemsassociatedwiththisChapter

p Quiz 14p Homework 7 (Questions 5,6 and 7)


Recommended