+ All Categories
Home > Documents > Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf ·...

Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf ·...

Date post: 18-Aug-2018
Category:
Upload: hatuong
View: 221 times
Download: 0 times
Share this document with a friend
39
Residuals Outliers and influential points. Correlation vs. causation
Transcript
Page 1: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Residuals

Outliers and influential points.

Correlation vs. causation

Page 2: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

When we use correlation, we make certain assumptions

about the data:

1. A straight-line relationship.

2. Interval data

3. Random sampling

4. Normal distributed characteristics (approximate is OK)

Today we’re going to look at ways these assumptions can

be violated.

Page 3: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

First, a tool for finding problems in correlation: Residuals.

Page 4: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

One way to show a correlation is to fit a line through the

middle of the data. (Line of best fit)

If the line is definitely upwards and keeps close to the data,

you have a correlation.

Page 5: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Since a line won’t perfect describe the relationship between

two variables, especially when randomness is involved,

there’s some error left over.

These leftovers are called residuals. (as in “left behind”)

Page 6: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Looking at a graph of the residuals can magnify patterns

that were not immediately obvious in the data before.

In this case, the points dip below the line and then come

back above it.

Page 7: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

If the relationship between two variables really is linear,

then any other patterns should be random noise.

That means if we see any obvious pattern in the residuals,

including this one, a correlation coefficient isn’t going to tell

you the whole story.

Page 8: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Sometimes people try to correlate interval data to

something ordinal or nominal. This is dumb.

These are residuals from trying to correlate a yes/no

response to something interval.

Page 9: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Using ordinal data will leave huge jumps from one level to

the next. Nominal data simply won’t find on a scatterplot.

Both cases violate of the assumption of interval data.

Page 10: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Sometimes the pattern isn’t a trend in the center of the

data, it can also be a trend in the spread of the data.

If the variation in y changes as x changes, the relationship

between x and y is called heteroscedastic

Page 11: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Hetero means “different” and Scedastic means “scattered”.

Heteroscedastic means there is a different amount of

scatter at different data points. If you encounter it, it could

lower your correlation so it’s worth mentioning. (Look for

fan shapes)

If the variation in y is the same everywhere, we call that

Homoscedastic, meaning “same-scatter”

Page 12: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

If the Toronto Stock Exchange Index were correlated to

something linearly, the residual graph would resemble this.

As the index numbers get higher, they tend to jump up and

down more. Going from 10,000 to 10,100 is no big deal.

Going from 500 to 600 is a big deal.

Page 13: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Residuals should look like this: a horizontal band of noise.

There should be no obvious trends or patterns.

The occasional point can be outside the data without issue.

Page 14: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

But how far out is too far?

What happens when you get there?

Page 15: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Outliers are a violation of the assumption of normality, and

correlation can be sensitive to outliers.

A value that is far from the rest of the data can throw off

the correlation, or create a false correlation.

Page 16: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Example: In the 1960’s, a survey was done to get various

facts about TV stars.

Intelligence Quotient (IQ) was found to be positively

correlated with shoe size. (r = 0.503, n = 30)

(This story is true, the exact data has been made up)

Page 17: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Could this be a fluke of the data? Did they falsely find a

correlation? (r=.503, n=30)

Page 18: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

t-score = 3.08

t* = 2.048 for df=28, .05 significance (2 tailed)

t* = 2.763 for df=28, .01 sig

So p < 0.01. (By computer: p=.0046)

That means it’s possible, but highly unlikely we’ll see a

correlation of this strength in uncorrelated data by chance

alone.

Page 19: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Standard practice is to visualize the data when possible.

There’s no obvious trend, except...

Page 20: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

What is that?

There’s one person with very high IQ and very large shoes.

In other words, an outlier.

Page 21: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

It`s Bozo the clown.

He had huge clown shoes, and he was a verified genius.

We can’t assume normality with that Bozo in the way.

Page 22: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

So what now?

We could remove Bozo from the dataset, but if we remove

data points we don’t like, we could come to almost any

“conclusion” we wanted.

That’s why we have assumption of random selection

If Bozo can’t be in the sample, then his chance of being

selected is zero (no longer equal chance in population).

Page 23: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

We can remove him and keep randomization, but it implies

that Bozo was not in the population of interest. (Equal

chance of selection among non-clowns?)

Most respondents wear shoes that fit their feet. Bozo wore

absurdly large shoes, much larger than his feet, for

entertainment.

So dismissing the Bozo data as an outlier is reasonable, his

shoes are fundamentally different.

Page 24: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Let’s try the analysis again without including Bozo’s data.

r = -.006, n=29

Page 25: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Is there still a significant correlation?

...not even close.

t* = 1.314 at .20 significance,

so p-value > .20 (actually p-value = 0.9975)

Page 26: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

So removing Bozo the Clown from the dataset completely

changed our results.

Bozo wasn’t just an outlier, he was an influential outlier

because he alone influenced the results noticeably.

Not every outlier is influential, and

Not every influential point is an outlier.

Page 27: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Outliers are points that don’t fit the pattern. Correlation

assumes a linear pattern.

r = .032, p-value = .866

An outlier is anything outside the linear trend.

Page 28: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

For a point to be influential, it just has to change the linear

trend. If it’s far enough for the mean in the x-direction, it

doesn’t have to be far from the trend to change the results.

Changing this one point from IQ 100 to IQ 110

Changes the correlation from .016 to .155.

Page 29: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

More formally, an outlier is anything with a large residual.

Since normality, and hence symmetry is assumed, the 3

standard deviation rule applies.

Anything with a residual of 3 standard deviations above or

below zero is considered an outlier.

Page 30: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

If residuals show heteroscedasticity, outliers are more

likely to show up, and in greater numbers.

Page 31: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Look at your data closely. Get right in its face.

Page 32: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

You can use statistics and graphs to intimately know a

dataset, but numbers and pictures aren’t a substitute for

reasoning.

Just because two things happen together (or one after

another) doesn’t mean that one of them causes the other.

A correlation between two things doesn’t imply causation.

Page 33: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Consider this crime and sales data of a large city over five

years (one point = one month)

Homicide rates are strongly positively correlated with ice

cream sales. (r = .652, n=60)

Page 34: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Jumping from correlation to causation, we find that

availability of ice cream is driving people to kill each other.

But correlation works both ways. Ice cream sales are

correlated with homicide rates.

That also must mean that nothing builds an appetite for

cold, cold ice cream like...

cold, cold murder.

Page 35: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Causation works in one direction. Correlation works both

ways. That alone should be enough not to make that leap.

Often there’s a common explanation to increases in both

variables. In this case it’s heat. Both increase in summer.

Page 36: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Simple right? Then how do mistakes like this get made?

Study: Mercury can cause NY loon population drop. (source: Wall Street Journal June 28, 2:21pm)

“ A 10-year study of Adirondack loons shows mercury contamination can lead to

population declines because birds with elevated mercury levels produce fewer chicks

than those with low levels”

Page 37: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

“But how can we ever tell causation with statistics?”

Short answer: You can’t

Good answer: You can’t with statistics alone, because

dealing with numbers after the fact is observational.

But you can use it in combination with other fields

(Experimental Design) to manipulate variables.

Indoor greenhouses can manipulate soil type, moisture, and

light directly, but plants still have randomness.

Page 38: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Better answer: (for interest only)

Google books for a preview, or look up the term

“Contrapositive”

Page 39: Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf · Outliers are a violation of the assumption of normality, and correlation can be

Next time, we expand to multiple correlations and partial

correlations. We may finish chapter 10 early.

ASSIGNMENTS DUE AT 4:30PM!!!!


Recommended