ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA...

transcript

Agenda for Week 7, Hour 1

ANOVA and R-squared revisited.

Multiple regression and r-squared.

Week 7, Hour 2

Multiple regression: co-linearity, perturbations,

correlation matrix

Stat 302 Notes. Week 7, Hour 1, Page 1 / 28

Consider this made-up dataset on silicon wafers, wafers.csv. It’s based on a very common type of quality control analysis in manufacturing.

A factory manager is interested in reducing the number of bad wafers the factory produces in a batch.

She sets the factory to make 6 batches of wafers each at 3 levels of cooking temperature and 3 levels of spin speed. There are 54 batches of wafers in total. The response variable is number of bad wafers (in a batch of 1000).

Here are select rows from the dataset.

'cooktemp' is the cooking temperature in Celcius

'spinrpm' is the spin rate while cooling, in RPM

'bad' is the number of bad wafers in the batch

Note that even though we can describe temperature and speed as continuous variables, we are treating them as categories here.

Essentially we are calling them ‘low’, ‘medium’, and ‘high’ settings.

wafers$spinrpm = as.factor(wafers$spinrpm)

wafers$cooktemp = as.factor(wafers$cooktemp)

Here is the one-way ANOVA of 'bad' using cooking temperature as an explanatory variable.mod = lm(bad ~ cooktemp, data=wafers)

anova(mod)

p value is small, so we have strong evidence that cooking temperature matters.

Without the p-value, we could compare the obtained F to a critical value for F.

(Recall: F test is one-tailed, we only care about larger variances)

A hypothesis test tells us that there some of variance in bad wafer count is explained by cooktemp.

It doesn't tell us how much of the variance is explained.

For that we need the Sum of Squares total,

which is SSgroup + SSresid = 727 + 2934 = 3661

Proportion of variance explained, or R-squared

= SSgroup / SStotal

= 727 / 3661

= 0.1986, or 19.86% of variation explained.

We can also get this information from the summary of the lm() object that we used to get the ANOVA in the first place.

There's no such thing as a correlation in an ANOVA, but the sometimes the ANOVA is referred to as having an R-squared because of this variance explained connection.

A two-armed bird needs a two-way ANOVA

Here is the two-way ANOVA using both cooking temperature and spin rate to explain 'bad'.

mod = lm(bad ~ spinrpm + cooktemp, data=wafers)

anova(mod)

First, do we have evidence that the number of bad wafers change by temperature?

What about by spin speed?

Yes to both.

The p-values associated with each factor is small.

So both factors are explaining a significant proportion of the variance. But how much?

We need the sum of squares total. This is 3661 , the total of the sum of squares from all sources: temperature, spin speed, and residuals.

SStotal = SSspin + SStemp + SSresid

= 1840 + 727 + 1094

= 3661 (The same as in the one-way ANOVA)

Of this total, spin speed explains

SSspin / SStotal = 1840 / 3661 = 50.26%

of the variation, and temperature explains

SStemp / SStotal = 727 / 3661 = 19.86% of it.

Both grouping variables together explain

2567 / 3661 = 70.12% of the variation

We can confirm this by looking at the linear model summary.

The multiple r-squared should match variance explained by the model. (i.e. everything but the residuals)

Are they evolving, or are we regressing?

Recall that in assignment 1, we looked at some national hockey league data. We made a model of wins as a function of goals against.

This is a simple regression model. The regression equation is

Wins = 78.83 – 0.163*GA +error

…means that a team with 0 goals against it is expected to win 78.83 of their 82 games, and that every goal against the team costs it 0.168 wins.

In this model, goals against explained 42.21% of the variation in the number of wins.

We can expand this from a simple regression into a multiple regression model by incorporating a second explanatory variable, Goals For (GF)

The regression equation is

Wins = 37.95 – 0.163*GA + 0.177*GF +error

The regression equation is

Wins = 37.95 – 0.163*GA + 0.177*GF +error

…meaning that a team with both 0 goals against and 0 goals for will win 37.95 games (a bit fewer than half).

Every goal against will reduce this win count by 0.163 (holding “goals for” constant)

Every goal for will increase the win count by 0.177

(holding “goals against” constant)

When doing a multiple regression, the slope coefficient associated with each variable is implicitly while “holding other variables constant”.

That means we take each slope effect separately, even if they often appear together.

Example:

If the team makes a chance (e.g. a trade or a coaching change) such that will score 5 more goals in a season, but also allow 3 more goals, then: Stat 302 Notes. Week 7, Hour 1, Page 21 / 28

Adding 5 'goals for' and 3 'goals against'.

The effect of the additional goals against is to earn

0.163 * 3 = 0.489 fewer wins per season.

The effect of the increase in goals for is to earn

0.177 * 5 = 0.885 more wins per season.

The total effect is the sum of each separate effect, so with the change, we expect an increase of

(-0.489) + 0.885 = 0.396 wins this season.

In this multiple regression model, goals for and against together explain 82.93% of the variation in wins.

In this model, it’s not surprising that including both 'goals for' and 'goals against' is better than including only one.

However, when including additional explanatory variables, r-squared always increases until it is 100%.

Even if the new variables are completely random noise, the r-squared will increase by a little bit.

We use the ‘multiple r-squared’ in the model summary because it’s easy to interpret, but the adjusted r-squared is also useful, because it’s always a little less than the multiple r-squared to account for the amount that r-squared would increase from random noise.

Question: Doesn’t goals for and goals against determine wins entirely? If you score more goals than your opponent, you win. End of story. Right?

Answer: For a single game, that’s true. But we don’t have data of this resolution.

We have the total goals for and against for the entire season, but not for individual games.

When we aggregate data (e.g. add together the goalsfrom different games in a season), we lose some information.

Winning a game by 1 goal, or winning it by 50 goals both count as a single goal.

That’s where the remaining 17% unexplained variance is: in the differences between individual games.

Question: Could there be such a terrible team that a model will predict to have less than 0 wins?

Answer: Yes. However, such a team would be an extreme outlier in the data.

We shouldn’t extrapolate and apply the model to cases far outside the data we have observed.

Break time

ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA...

Documents