The Analysis of Variance. One-Way ANOVA We use ANOVA when we want to look at statistical...

Post on 17-Jan-2016

224 views 0 download

transcript

The Analysis of Variance

One-Way ANOVA

We use ANOVA when we want to look at statistical relationships (difference in means for example) between more than 2 populations or samples

ANOVA is a natural extension of ideas used in 2-pop t-tests and other methods we have explored

Trouble on the School Board!

Despite the school board’s best efforts – sensitive test score data for a large urban school district was leaked to the press!

The issue is a long standing argument that children in the inner city do not receive the same quality of education as do children in the suburban parts of the city. This could be very embarrassing for both the board and the mayor!

Here’s the data

A school board official states: “ The data is roughly normally distributed and is what you would expect for a random sample of 90 students – 30 from each

of the East, Central and West districts”

NOT SO FAST!Take a closer

look at the data – check for “structure”

Our investigative reporter took Stats 300 in college! Here is what she did:

Sort the data into East, Central and West “bins”

The box plot suggests a cover-up!

Digging further… the full set becomes

Further tests…Thanks StatsMan!

Summary of the 3 data sets:

Is there a statistical hypothesis lurking about?

The Hypotheses

Let 1, 2, and 3 be the mean scores for the three populations:

Pop1 = East Pop 2 = Central Pop 3 = West

Ho: 1= 2= 3

Ha: ?

The null hypothesis is pretty straight forward

Why is this a problem?

Could we do this with paired t-tests?

YES!

What does this imply?

We have good evidence to reject the null hypothesis – the central district scores are statistically lower than the other two districts.

Could we just use paired t-tests?

If we had 12 school districts that we were testing in the same way as the previous case – how would the analysis change? How many pairs How many false positives would we get

at a 95% Confidence level?

Why we can’t use multiple pairs of t-tests or why we should consider the entire set:

1. As the number of pairs increases the chance of a false positives or erroneous conclusion on the null hypothesis increases

2. By pooling all of information (not just pairs) we get a much more precise value for the standard deviation in the population

3. By treating all of the data we can, potentially detect interesting correlations between subgroups – this could easily be overlooked in we approached the data in a pair-wise fashion.

• Decreases the chance of false positives

• Pooling gives more precision in statisitcs

• Detect interesting correlations

Setting up for ANOVA

You guessed it – yet more terminology!

In 12.1 and 12.2 we will introduce: A method to get an estimate for the

standard deviation for the entire population (Pooled Estimator)

A new spin on degrees of freedom (df) A new test for significance – the F-test

Pooled Estimator for

This is a generalization of the method we used in paired t-tests:

2 2 22 1 1 2 2

1 2

( 1) ( 1) ( 1)

( 1) ( 1) ( 1)I I

p

I

n s n s n ss

n n n

This expression begins to measure the total variation in a population. Each si

2 term measures variation within a given sample. “I” represents the total number of independent SRS’s

Sigma Rule…

If the largest standard deviation in a set of I SRS’s is less than twice as large as the smallest then we can approximate the standard deviation by using the pooled estimator.

Example: What is the pooled estimate for sigma for the 3 school districts?

I = 3 (East, Central, West are SRS’s) n1=n2=n3=30

2 2 22 (30 1)35.04 (30 1)33.56 (30 1)26.13

(30 1) (30 1) (30 1)ps

2 1012.28 31.8p ps s

Part II – Developing the F-Test

Conceptual Model A collection of

SRS’s drawn from a larger population illustrate two different kinds of variation:

Internal variation around a sample mean within a given SRS

Variation of the SRS means with the overall population mean

ANOVA compares the two kinds of variability

The null hypothesis often is equivalent to saying that the populations overlap (have the same mean for example)

Another way of saying this is that the SRS’s share the “grand mean” of the entire population

This could happen if the individual SRS’s have large variation internally but not externally

We need a way to quantify this

Ways of quantifying variation

The F-Value

We can compare variation between samples with the variation within samples by calculating the Mean Square of the error in both cases.

This is expressed as:

We will get to F-distributions in a few moments

( )

( )

MS betweenF

MS within

Mean Square Error – MSE(within)

This is what the pooled estimator determines:

This means that our school board data has an internal MSE of (31.8)2

2 ( )ps MSE within

Mean Square Error – MSE(between)

To determine this we need the “grand mean” for all of the data:

Mean Square Error – MSE(between)

Define as:2 2( ) ( )

( )( ) 1i i grand i i grandn x n x

MS betweendf between I

A new application of the idea of degrees of freedom

Example – school board data:

2 2 230(649 611) 30(548 611) 30(635 611)( )

3 189835

MS between

We can now determine the “F-Value” for this data:

8983588.8

31.8b

w

MSF

MS

I Don’t Get It!

Confused? We are almost there. We now know how to

quantify the variation within SRS’s (MSw) and the variation between the means of the SRS’s (MSb)

The “F-ratio” can be compared against tables just like we did for z-tests and t-tests

How to Use an “F-ratio”

You need to know some important numbers:

The number of SRS’s (I) from this we form the degrees of freedom for the MSb term: dfb = I-1

The total number of data points ( the pooled data) = N, dfw=N-1

The F-ratio tests the null hypothesis (ie – that the means are equal)

If Ho is true the F ≈ 1

b

w

MSF

MS

denominator

numerator

Testing the School Board’s Claim

The school board’s claim was that there was no difference between the three district’s mean test scores.

Since there were 90 students (n=90) and 3 groups (I=3) we should use the F(I-1,N-1) = F(2,89) distribution

So … use Table E and F(2,89) = 88.8. Since this is not listed we need to approximate. You should be able to determine the probability of the null hypothesis between an upper and lower p-value.

With an F-ratio as big as 88.8 you really don’t normally need to look it up – you know Ho is false!

Use Minitab or EXCEL

Life is short! ANOVA is a complex (number intensive) process. Let’s look at two approaches: Minitab

Next lecture …

We will spend next lecture working through several examples of ANOVA

When doing this keep in mind what it is that you are calculating

Don’t get overwhelmed by the detail!