Chapter 3: Examining Relationshipsdavidsongstat.weebly.com/uploads/2/2/1/1/2211249/chapter...Chapter...

Chapter 3: Examining Relationships

Introduction: Most statistical studies involve more than one variable. In Chapters 1 and 2 we used the boxplot /stem-leaf plot/ histogram to analyze one-variable distributions . In this chapter we will observe studies with two or more variables.

Exploration: SAT Activity - Verbal vs. Math Scores Consider the following: Is there an obvious pattern? Can you describe the pattern? Can you see any obvious association between SAT Math

and SAT Verbal?

I. Variables –

A. Response – measures an outcome of a study

B. Explanatory – attempts to explain the observed outcome

Example 3.1 Effect of Alcohol on Body Temperature

Alcohol has many effects on the body. One effect is a drop in body

temperature. To study this effect, researchers give several different amounts of

alcohol to mice, then measure the change in each mouse’s body temperature in

the 15 minutes after taking the alcohol.

Response Variable –

Explanatory Variable –

Note: When attempting to predict future outcomes, we are interested in

the response variable.

• In Example 3.1 alcohol actually causes a change in body

temperature. Sometimes there is no cause-and-effect

relationship between variables. As long as scores are closely

related, we can use one to predict the other.

• Correlation and Causation

Exercise 3.1 Explanatory and Response Variables

In each of the following situations, is it more reasonable to simply explore the

relationship between the two variables or to view one of the variables as an

explanatory variable and the other as a response variable? In the latter case,

which is the explanatory variable and which is the response variable?

(a) The amount of time spent studying for a stat test and the grade on the test

(b) The weight and height of a person

(c) The amount of yearly rainfall and the yield of a crop

(d) A student’s grades in statistics and French

(e) The occupational class of a father and of a son

• a) Time spent studying is explanatory; the grade is


• (b) Explore the relationship; there is no reason to

view one or the other as explanatory.

• (c) Rainfall is explanatory; crop yield is the response

variable.

• (d) Explore the relationship.

• (e) The father’s class is explanatory; the son’s class is


Exercise 3.2 Quantitative and Categorical Variables

How well does a child’s height at age 6 predict height at

age 16? To find out, measure the heights of a large

group of children at age 6, wait until they reach age 16,

then measure their heights again. What are the

explanatory and response variables here? Are these

variables categorical or quantitative?

Height at age six is explanatory, and height at age 16 is the

response variable. Both are quantitative.

3.1 Scatterplots

Scatterplots show the relationship between two quantitative variables taken on the same

individual observation.

Exercise 3.7 Are Jet Skis Dangerous

An article in the August 1997 issue of the Journal of the American Medical Association

reported on a survey that tracked emergency room visits at randomly selected hospitals

nationwide. Here are data on the number of jet skis in use, the number of accidents, and

the number of fatalities for the years 1987 – 1996:

Year Number in use Accidents Fatalities

1987 92,756 376 5

1988 126,881 650 20

1989 178,510 844 20

1990 241,376 1,162 28

1991 305,915 1,513 26

1992 372,283 1,650 34

1993 454,545 2,236 35

1994 600,000 3,002 56

1995 760,000 4,028 68

1996 900,000 4,010 55

Interpreting Scatterplots

Look for an overall pattern and for striking deviation from that

pattern

In particular, the direction, form and strength of the relationship

between the two variables.

a) direction – negative, positive

b) form – clustering of data, linear, quadratic, etc.

c) strength – how closely the points follow a clear form

Positive correlation (direction) – when above average values of one

variable tend to accompany above average values of the other variable.

Also, the same for below average.

Negative correlation (direction) – when above average values of

one variable tend to accompany below average values of the other

variable.

Note: The more students who took the test, the lower the state average- -this is the “open enrollment” concept.

No correlation

Linear Relationship – when points lie in a straight line pattern.

Outliers – in this case, deviations from the overall scatterplot patterns.

III. Drawing Scatterplots –

Step 1. Scale the horizontal and vertical axes. The intervals MUST be

the same.

Step 2. Label both axes.

Step 3. Adopt a scale that uses the whole grid.

Step 4. Add a categorical variable by using a different plotting symbol

or color.

A scatterplot with categorical data included.

III. Example 3.4 pg 127: Heating Degree-Days

end Sec. 3.1

Sec. 3.2 Correlation ( r ) - is a measure of the direction and strength

of the linear relationship between two

quantitative variables.

Sec. 3.2 Correlation ( r )

I. Characteristics of “r”-

A. Indicates the direction of a linear relationship by its sign

B. Always satisfies 1 1r

C. Ignores the distinction between explanatory and response

variables

D. Requires that both variables be quantitative

E. Is not resistant*

* herein, correlation joins the mean and standard deviation

II. Graphical Interpretations of “r”

II. Least-Squares Regression-

Summary:

The method of least-squares is a standard approach to the approximate

solution of over determined systems (sets of equations in which there are

more equations than unknowns). "Least-squares" means that the overall

solution minimizes the sum of the squares of the errors made in solving

every single equation.

The most important application is in data fitting. The best fit in the least-

squares sense minimizes the sum of residuals, a residual being the

difference between an observed value and the fitted value provided by a

model.

a regression line is a straight line that describes how a response

variable y changes as an explanatory variable x changes. A

regression line is often used to predict the value of y for a given

value of x. Regression, unlike correlation, requires that we have an

explanatory variable and a response variable.

Simply put,

III. Recall: Correlation (r) makes no use of the distinction between explanatory

and response variables. That is, it is not necessary that we define one of the

quantitative variables as a response variable and the second as the explanatory

variable. It doesn’t matter—for the calculation of rho--which one

is the y (D.V.) and which one is x (I.V.).

In least-squares regression we are required to designate

a response variable (y) and an explanatory variable (x).

ˆ( )y a bx

IV. Regression line – a line that describes how a response variable (y)

changes as an explanatory varible (x) changes.

A. Regression lines are often used to predict the value of “y” for a

given value of “x”.

ˆ

,

the slope

the intercept

y = the predicted rather than an actual "y" for any "x"

y

x

y a bx

where

sb r

s

a y bx

equation of the regression line

“y” is the observed value, 𝑦 is the predicted value of “y”.

B. Least-squares regression line of “y” on “x” is the line that makes

the sum of the squares of the vertical distances (or, “y’s”) of the

data points from the line as small as possible. This minimizes the

error in predicting 𝑦 .

𝑦 = 𝑎 + 𝑏𝑥

C. Prediction errors are errors in “y” which is the vertical direction in the

scatterplot and are calculated as

error = observed value - predicted value

D. To determine the equation of a least-squares line, (1) solve for the intercept

“a” and the slope “b” —thus we have two unknowns. However, it can be

shown that every least-squares regression line passes through the point,

𝑋 , 𝑌 ). Next, (2) the slope of the least-squares line is equal to the product

of the correlation (rho) and the quotient of the s.d.’s

y

x

sb r

s

E. Constructing the least-squares equation-

Given some explanatory and response variable with the following stat

summary - 17.222, 161.111, 19.696, 33.479, 0.997x yx y s s r

Even though we don’t know the actual data we can till construct the

equation and use it to make predictions. The slope and intercept can be

calculated as

33.479 0.997 1.695

19.696

161.111 1.695 17.222 131.920

y

x

sb r

s

a y bx

So that the least-squares line has equation 131.920 1.695y x

2IV. The Role of r in Regression Analysis

We know that the correlation r is the slope of the least-squares regression

line when we measure both x and y in standardized units. The square of

the correlation

2 r is the fraction of the variance of one variable that is

explained by least-squares regression on the other variable.

When you report a regression, give r-square as a measure of how

successful the regression was in explaining the response.

While correlation coefficients are normally reported as r = (a value

between -1 and +1), squaring them makes then easier to understand. The

square of the coefficient (or r-square) is equal to the percent of the

variation in one variable that is related to the variation in the other. After

squaring r, ignore the decimal point. An r of .5 means 25% of the

variation is related (.5 squared =.25). An r value of .7 means 49% of

the variance is related (.7 squared = .49).

V. Summary of Least-Squares Regression-

A. The distinction between explanatory and response variables is essential

in regression.

B. There is a close relationship between correlation and the slope of the

least-squares line. The slope is

This means that along the regression line, a change of one s.d. in x corresponds to

a change of r standard deviations in y.

C. The LSRL always passes through the point on the graph of y

against x.

D. The correlation describes the strength of the straight-line relationship. In

the regression setting, this description takes a specific form: the square of

the correlation is the fraction of the variation in the values of y that is

explained by the least-squares regression of y on x.

y

x

sb r

s

, x y

VI. Residuals - a residual is the difference between an observed value of the

response variable and the value predicted by the regression line. That is,

residual = observed y – predicted y

ˆ y y

A. A residual plot is a scatterplot of the regression residuals against the explanatory variable. These plots assist us in assessing the fit of a regression line.

B. Residual example:

A study was conducted to test whether the age at which a child begins

to talk can be reliably used to predict later scores on a test (taken much

later in years) of mental ability—in this instance, the “Gesell Adaptive

Score” test. The data appear on in the next slide.

Scatterplot of Gesell Adaptive Scores

vs the age at first word for 21 kids.

The line is the LSRL for predicting

Gesell score from age at first word.

The residual plot for the regression

shows how far the data fall from the

regression line. Child 19 is an outlier

and child 18 is an “influential”

observation that does not have a large

residual.

The plot shows a (-) relationship that is

moderately linear, with r = -.640

Uniform scatter indicates that the

regression line fits the data well.

The curved pattern means that a

straight line in an inappropriate

model.

The response variable y has more

spread for larger values of the

explanatory variables x, so the

predication will be less accurate

when x is larger.

C. The residuals from the least-squares line have the special property:

the mean of the least-squares residuals is always zero.

VI. Influential Observations – an observation is influential for a statistical

calculation if removing it would markedly change the result of the calculation.

A. Outlier- is an observation that lies outside the overall pattern of other

observations.

B. Points that are outliers in the x direction of a scatterplot are often

influential for the LSRL but need not have large residuals.

end Sec. 3.3 and

Chpt. 3

Date post:	22-Mar-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Chapter 3: Examining Relationshipsdavidsongstat.weebly.com/uploads/2/2/1/1/2211249/chapter...Chapter...

Documents