Copyright © 2010 Pearson Education, Inc. Slide 7 - 1 Lauren is enrolled in a very large college...

Copyright © 2010 Pearson Education, Inc. Slide 7 - 1

Lauren is enrolled in a very large college calculus class. On the first exam, the class mean was a 75 and the standard deviation was 10. On the second exam, the class mean was 70 and the standard deviation was 15. Lauren scored 85 on both exams. Assuming the scores on each exam were approximately normally distributed, on which exam did Lauren score better relative to the rest of the class?

a. She scored much better on the first exam.b. She scored much better on the second exam.c. She scored about equally well on both exams.d. It is impossible to tell because the class size is not given.e. It is impossible to tell because the correlation between the

two sets of exam scores is not given.


Lauren is enrolled in a very large college calculus class. On the first exam, the class mean was a 75 and the standard deviation was 10. On the second exam, the class mean was 70 and the standard deviation was 15. Lauren scored 85 on both exams. Assuming the scores on each exam were approximately normally distributed, on which exam did Lauren score better relative to the rest of the class?

a. She scored much better on the first exam.b. She scored much better on the second exam.c. She scored about equally well on both exams.d. It is impossible to tell because the class size is not given.e. It is impossible to tell because the correlation between the

two sets of exam scores is not given.

Copyright © 2010 Pearson Education, Inc.

Chapter 3Scatterplots, Correlation and Least Squares Regression


Into Vocab

Response Variable Measures an outcome of a study

Explanatory Variable Attempts to explain the observed outcome

Independent Variable X (explanatory)

Dependent Variable Y (Response)

Slide 7 - 4


Looking at Scatterplots

Scatterplots may be the most common and most effective display for data. In a scatterplot, you can see patterns, trends,

relationships, and even the occasional extraordinary value sitting apart from the others.

Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables.

Always plot the explanatory variable on the horizontal axis


Looking at Scatterplots (cont.)

When looking at scatterplots, we will look for direction, form, strength, and unusual features.

Direction: A pattern that runs from the upper left to the

lower right is said to have a negative direction. A trend running the other way has a positive

direction.



The figure shows a negative direction between the year since 1970 and the prediction errors made by NOAA.

As the years have passed, the predictions have improved (errors have decreased).

Can the NOAA predict where a hurricane will go?



As the central pressure increases, the maximum wind speed decreases.

Can call this a linear relationship



Form: If there is a

straight line (linear) relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form.



Form: If the relationship isn’t straight, but curves

gently, while still increasing or decreasing steadily,

we can often find ways to make it more nearly straight.



Form: If the relationship curves sharply,

the methods cannot really help us.



Strength: At one extreme, the points appear to follow a

single stream

(whether straight, curved, or bending all over the place).



Strength: At the other extreme, the points appear as a

vague cloud with no discernable trend or pattern:

Note: we will quantify the amount of scatter soon.



Unusual features: Look for the unexpected. Often the most interesting thing to see in a

scatterplot is the thing you never thought to look for.

One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot.

Clusters or subgroups should also raise questions.


Roles for Variables

It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis.

This determination is made based on the roles played by the variables.

When the roles are clear, the explanatory or predictor variable (independent) goes on the x-axis, and the response variable (variable of interest) (dependent) goes on the y-axis.


Roles for Variables (cont.)

The roles that we choose for variables are more about how we think about them rather than about the variables themselves.

Just placing a variable on the x-axis doesn’t necessarily mean that it explains or predicts anything. And the variable on the y-axis may not respond to it in any way.


Examples: Determine which variable goes on the x-axis and y-axis:

a. Do baseball teams that score more runs sell more tickets to their games?

b. Do older houses sell for less than new ones of comparable size and quality?

c. Do students who score higher on their SAT tests have higher grade point averages in college?

d. Can we estimate a person’s percent body fat more simply by just measuring waist or wrist sizes?


TI-Tips Scatterplots First let me show you how to name a list … Enter the years 1990 to 2000 as 0,1,2, …, 10 in L1 On the next column enter the following values: 6546, 6996, 6996, 7350,

7500, 7978, 8377, 8710, 9110, 9411, 9800 STATPLOT, choose the scatterplot icon Identify which Xlist and Ylist (L1 and L2). Choose a symbol for displaying the points. ZoomStat (If you ever get a ERR:DIM MISMATCH means you don’t have

the same number of x’s as y’s. or you may have another STATPLOT on.) TRACE will show you the value of each point.


Data collected from students in Statistics classes included their heights (in inches) and weights (in pounds):

Here we see a positive association and a fairly straight form, although there seems to be a high outlier.

Correlation


How strong is the association between weight and height of Statistics students?

If we had to put a number on the strength, we would not want it to depend on the units we used.

A scatterplot of heights (in centimeters) and weights (in kilograms) doesn’t change the shape of the pattern:

Correlation (cont.)


Correlation (cont.)

Since the units don’t matter, why not remove them altogether?

We could standardize both variables and write the coordinates of a point as (zx, zy).

Here is a scatterplot of the standardized weights and heights:


Correlation (cont.)

Note that the underlying linear pattern seems steeper in the standardized plot than in the original scatterplot.

That’s because we made the scales of the axes the same.

Equal scaling gives a neutral way of drawing the scatterplot and a fairer impression of the strength of the association.


Correlation (cont.)

Some points (those in green) strengthen the impression of a positive association between height and weight.

Other points (those in red) tend to weaken the positive association.

Points with z-scores of zero (those in blue) don’t vote either way.


Correlation (cont.)

The correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables.

r zxzy

n 1


Correlation (cont.)

For the students’ heights and weights, the correlation is 0.644.

What does this mean in terms of strength? We’ll address this shortly.


Correlation Conditions

Correlation measures the strength of the linear association between two quantitative variables.

Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition


Correlation Conditions (cont.)

Quantitative Variables Condition: Correlation applies only to quantitative

variables. Don’t apply correlation to categorical data

masquerading as quantitative. Check that you know the variables’ units and

what they measure.



Straight Enough Condition: You can calculate a correlation coefficient for

any pair of variables. But correlation measures the strength only of

the linear association, and will be misleading if the relationship is not linear.



Outlier Condition: Outliers can distort the correlation dramatically. An outlier can make an otherwise small

correlation look big or hide a large correlation. It can even give an otherwise positive

association a negative correlation coefficient (and vice versa).

When you see an outlier, it’s often a good idea to report the correlations with and without the point.


TI Tips – Finding Correlations

You must first tell your calculator to find correlations. Do this once and it should be done until you change your batteries: 2nd CATALOG. Scroll down until you find DiagnosticOn. Hit Enter. It should say DONE.

Check the conditions first by looking at a scatterplot. Does the association look linear, are there outliers?

STAT CALC, select 8:LinReg(a+bx), ENTER Add the names of your x and y lists (2nd STAT) separate

them by a comma. Then hit enter. You will see lots of numbers. For now we will use only r.

What does it mean? Let’s find out!


Correlation Properties

The sign of a correlation coefficient gives the direction of the association.

Correlation is always between –1 and +1. Correlation can be exactly equal to –1 or +1,

but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line.

A correlation near zero corresponds to a weak linear association.



Correlation Properties (cont.)

Correlation treats x and y symmetrically: The correlation of x with y is the same as the

correlation of y with x. Correlation has no units. Correlation is not affected by changes in the

center or scale of either variable. Correlation depends only on the z-scores, and

they are unaffected by changes in center or scale.


Correlation Properties (cont.)

Correlation measures the strength of the linear association between the two variables. Variables can have a strong association but

still have a small correlation if the association isn’t linear.

Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.


Correlation ≠ Causation

Whenever we have a strong correlation, it is tempting to explain it by imagining that the predictor variable has caused the response to help.

Scatterplots and correlation coefficients never prove causation.

A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable.


Correlation Tables

It is common in some fields to compute the correlations between each pair of variables in a collection of variables and arrange these correlations in a table.


Straightening Scatterplots

Straight line relationships are the ones that we can measure with correlation.

When a scatterplot shows a bent form that consistently increases or decreases, we can often straighten the form of the plot by re-expressing one or both variables.


What Can Go Wrong?

Don’t say “correlation” when you mean “association.” More often than not, people say correlation

when they mean association. The word “correlation” should be reserved for

measuring the strength and direction of the linear relationship between two quantitative variables.


What Can Go Wrong?

Don’t correlate categorical variables. Be sure to check the Quantitative Variables

Condition. Don’t confuse “correlation” with “causation.”

Scatterplots and correlations never demonstrate causation.

These statistical tools can only demonstrate an association between variables.


What Can Go Wrong? (cont.)

Be sure the association is linear. There may be a strong association between

two variables that have a nonlinear association.



Don’t assume the relationship is linear just because the correlation coefficient is high.

Here the correlation is 0.979, but the relationship is actually bent.



Beware of outliers.

Even a single outlier

can dominate the correlation value.

Make sure to check the Outlier Condition.


3.3 Least Squares Regression

Slide 7 - 43


Fat Versus Protein: An Example

The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu with a correlation of .83:


The Linear Model

The linear model (line of best fit, least squares line, regression line) is just an equation of a straight line through the data to show us how the values are associated.

Using this line we will be able to predict values. Predicted values are denoted as: (also called y-hat).

The hat tells you they are predicted values. The difference between the observed-value and the

predicted-value is called the residual.

residual = observed – predicted = y – y(hat)


A negative residual means the predicted value’s too big (an overestimate).

A positive residual means the predicted value’s too small (an underestimate).

In the figure, the estimated fat of the BK Broiler chicken sandwich is 36 g, while the true value of fat is 25 g, the residual=?


“Best Fit” Means Least Squares

Some residuals are positive, others are negative, and, on average, they cancel each other out. To calculate how well the line fits the data we square the residuals (to eliminate the negatives) then find the sum of the squares.

The smaller the sum, the better the fit. That is why another name is least squares line.


If the variables are standardized (zscores or standard deviations):

The equation of the line of best fit is:

Correlation (also called r) is the same for x and y because it is standardized. Therefore:

ˆ *

ˆ *

y xz correlation z

y correlation x

ˆ *

ˆ *

x yz correlation z

x correlation y


Example: A scatterplot of house prices vs. house size for houses shows a relationship that is straight, with only moderate scatter and no outliers. The correlation between house price and house size is 0.77.

a. You go to an open house and find the house is 1 standard deviation above the mean in size. What would you guess about its price?

b. You read an add for a house priced 2 standard deviations below the mean. What would you guess about it’s size?

c. A friend tells you about a house whose size in square meters (he’s European) is 1.5 standard deviations above the mean. What would you guess about its size in square feet?


Sometimes we are given the regression line in REAL UNITS!!!

The regression line for the Burger King data fits the data well: The equation is

Example: What is predicted fat content for a BK Broiler chicken sandwich (with 30 g of protein)?


To find the regression line (in real units):

You may be given the standard deviations, correlation and means.

OR …You may be given raw data.


First make sure a regression is appropriate: Since regression and correlation are closely

related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition (look at scatterplot) Outlier Condition (look at scatterplot)


To create the Regression Line in Real Units given the standard deviations, correlation (r), and means:

You know the equation of a line.

In Statistics we use a slightly different notation:

We write b1 and b0 for the slope and intercept of

the line. (slope is always in units of y per unit of x)

y mx b

y b0 b1x

1y

x

sb r

s

0 1b y b x


To find a regression line (linear model) with raw data:

Use your calculator! First, be sure to check:

Quantitative Linear (scatterplot) No outliers (scatterplot).

If it is not quantitative, not linear or it has outliers, you will NOT be able to model the data with a linear model.


TI Tips: Equation of the Regression Line

STAT, CALC, Choose LinReg(a + bx) (Not the first one … the second one … scroll down!)

Specify that x and y are YR and TUIT (we put these in our calculator before.)


TI Tips: Equation of the Regression Line Graphed on the Scatterplot

STAT, CALC, Choose LinReg(a + bx) (Not the first one … the second one … scroll down!)

Specify that x and y are YR and TUIT (we put these in our calculator before.)

We want the screen to say: LinReg(a+bx) YR, TUIT, Y1 (this will send the equation to Y1 and then we will see it on our graph)

To add Y1 to the end: VARS, Y-VARS, 1:Function and choose Y1

ENTER See the equation. It has also been placed in Y1. Hit

GRAPH.


Example: Using the relationship between house price (in thousands of dollars) and house size (in thousands of square feet) the regression model is: a. What is the slope and what does it mean?

b. What are the units of the slope?

c. Your house is 2000 square feet bigger than your neighbor’s house. How much more do you expect it to be worth?

d. Is the y-intercept of -3.117 meaningful, explain?

ˆ 3.117 94.454*P Size


Example: The linear model relating hurricanes’ wind speeds to their central pressures was:

Predicted MaxWindSpeed = 955.27-(.897)CentralPressure

Hurricane Katrina had a central pressure measured at 920 millibars. What does our regression model predict for her maximum wind speed? How good is that prediction, given that Katrina’s actual wind speed was measured at 110 knots?


More about Residuals

A scatterplot of all the residuals the graph should be completely random! It should show no bends and should have no outliers.


Draw examples of a residual graph that is not random.


TI Tips – Residual Plots

You look at the scatterplot to make sure it is linear. Sometimes it is hard to tell. After you do a regression do a residual plot. If the residual plot is completely random, you know your scatterplot was linear.

The calculator automatically stores the residuals in a list named RESID after you run a regression. To look at them … STAT EDIT cursor over to RESID.

To create the residual plot … STAT PLOT, Plot2, Xlist:YR and Ylist: RESID

Y= may still have your regression line in it. You can either turn it off or remove it.

ZoomStat Do you see a curve?


Example: Our linear model for homes uses the model: predicted price = -3.117 + (94.454)(size)

a. Would you prefer to find a home with a negative or a positive residual? Explain.

b. You plan to look for a home of about 3000 square feet. How much should you expect to have to pay?

c. You find a nice home that size selling for $300,000. What’s the residual?


The Residual Standard Deviation

The standard deviation of the residuals, se, measures how much the points spread around the regression line.

Se = “Errors in predictions based on this model have a standard deviation of s (standard deviation in y units).”

We estimate the SD of the residuals using:

se e2

n 2


R2—The Variation Accounted For (cont.)

All regression analyses include this statistic, although by tradition, it is written R2 (pronounced “R-squared”). An R2 of 0 means that none of the variance in the data is in the model; all of it is still in the residuals.

When interpreting a regression model you need to Tell what R2 means. “The % of variability in y that is explained by x is” R2

R2 is always between 0% and 100%. What makes a “good” R2 value depends on the kind of data you are analyzing and on what you want to do with it.

Always report slope and intercept for a regression and R2 so that readers can judge for themselves how successful the regression is at fitting the data.


Assumptions and Conditions

Quantitative Variables Condition: Regression can only be done on two

quantitative variables (and not two categorical variables).

Straight Enough Condition: The linear model assumes that the relationship

between the variables is linear. (check by scatterplot)


Assumptions and Conditions (cont.)

If the scatterplot is not straight enough, stop here. You can only use a linear model on two

variables that are related linearly! Some nonlinear relationships can be saved by re-

expressing the data to make the scatterplot more linear.



It’s a good idea to check linearity again after computing the regression when we can examine the residuals.

Does the Plot Thicken? Condition: Residual plots should be scattered. Don’t confuse this with Normal Probability Plots

from unit one (to see if it is a normal curve) should be a straight line.



Outlier Condition: Watch out for outliers. Outlying points can dramatically change a

regression model. Outliers can even change the sign of the slope,

misleading us about the underlying relationship between the variables.


What Can Go Wrong?

Don’t fit a straight line to a nonlinear relationship. Beware extraordinary points (y-values that stand

off from the linear pattern or extreme x-values). Don’t extrapolate beyond the data—the linear

model may no longer hold outside of the range of the data.

Don’t infer that x causes y just because there is a good linear model for their relationship—association is not causation.

Don’t choose a model based on R2 alone.


A few IMPORTANT things to remember:

“The percentage of variability in y that is explained by x is: r2” (an example of this will be homework problem #7)

Correlation = r = +/- squareroot of r2 (you need to decide if it is + or – for a positive or negative correlation)

residual = observed – predicted = y – y(hat) R2 tells you how well the actual data fits the model (1 is

perfect, zero is no correlation) 1 – r2 is the fraction of the original variance left in the

residuals Be careful not to use a regression to extrapolate (predict

values beyond the scope/time frame of the model)

Date post:	31-Dec-2015
Category:	Documents
Upload:	nelson-lynch
View:	216 times
Download:	0 times

Copyright © 2010 Pearson Education, Inc. Slide 7 - 1 Lauren is enrolled in a very large college...

Documents