Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | alexia-morton |
View: | 218 times |
Download: | 2 times |
BPS - 3rd Ed. Chapter 4 1
Chapter 4
Scatterplots and Correlation
BPS - 3rd Ed. Chapter 4 2
Variable (X) and Variable (Y)
Prior chapters one variable at a time This chapter relationship between two
variables One variable is an “outcome”: response
variable (Y) The other variable is a “predictor”:
explanatory variable (X) Are X and Y related? X Y?
BPS - 3rd Ed. Chapter 4 3
Question
A study investigates whether the there is a relationship between gross domestic product and life expectancy:
Which is the explanatory variable (X)?
Which is the response variable (Y)?
All other variables that may influence life expectancy are “lurking” and may confound the relation between X and Y. Are there lurking variables in this analysis?
BPS - 3rd Ed. Chapter 4 4
This chapter considers the case in which both X and Y are quantitative variables
Bivariate data points (xi, yi) are plotted on graph paper to form a scatterplot
Scatterplot
BPS - 3rd Ed. Chapter 4 5
X = percent of students taking SAT
Y = mean SAT verbal score
What is the relationship between X and Y?
Example of a scatterplot
BPS - 3rd Ed. Chapter 4 6
Interpreting scatterplots Form
Can data be described by straight line? [Linearity] Direction
Does the line slope upward or downward Positive association = above-average values of Y
accompany above-average values of X (and vice versa) Negative association = above-average values of Y
accompany below-average values of X (and vice versa)
StrengthDo data point adhere to imaginary line?
BPS - 3rd Ed. Chapter 4 7
Form [discuss]
0
10
20
30
40
50
60
$0 $10 $20 $30 $40 $50 $60 $70
Income
Hea
th S
tatu
s M
easu
re
0
10
20
30
40
50
60
70
0 20 40 60 80 100
Age
Hea
th S
tatu
s M
easu
re0
2
4
6
8
10
12
14
16
18
0 20 40 60 80 100
Age
Ed
uca
tion
Lev
el
30
35
40
45
50
55
60
65
0 20 40 60 80
Physical Health Score
Men
tal H
ealt
h S
core
BPS - 3rd Ed. Chapter 4 8
Strength and direction
Direction: positive, negative or flat
Strength: How closely does a non-horizontal straight line fit the points of a scatterplot?Close fitting strong
Loose fitting weak
BPS - 3rd Ed. Chapter 4 9
Strength cannot be reliably judged visually
These two scatterplots are of the same data (they have the exact same correlation)
The second scatter plot looks like a stronger correlation, but this is an artifact of the axis scaling
BPS - 3rd Ed. Chapter 4 10
Correlation coefficient (r)
Let r denote the correlation coefficient r is always between -1 and +1, inclusive Sign of r denotes direction of association Special values for r :
r = +1 all points on upward sloping line r = -1 all points on downward sloping line r = 0 no line or horizontal line The closer r is to +1 or –1, the better the fit of
points to the line
BPS - 3rd Ed. Chapter 4 11
Examples of Correlations Husband’s versus Wife’s ages
r = .94 Husband’s versus Wife’s heights
r = .36 Professional Golfer’s Putting Success:
Distance of putt in feet versus percent success
r = -.94
BPS - 3rd Ed. Chapter 4 12
Correlation Coefficient r Data on variables X and Y for n
individuals:x1, x2, … , xn and y1, y2, … , yn
Each variable has a mean and std dev:2) ch. (see and )) yx
s ,y (s ,x (
n
1i y
i
x
i
s
yy
s
xx
1-n
1r
BPS - 3rd Ed. Chapter 4 13
Correlation coefficient r
y
iY
x
iX
s
yyz
s
xxz
The formula for r can be understood by converting data points to standardized scores:
n
1i1-n
1r YX zz where
BPS - 3rd Ed. Chapter 4 14
Illustrative example (gdp_life.sav)
Per Capita Gross Domestic Productand Average Life Expectancy for
Countries in Western Europe
Does GDP predict life expectancy?
BPS - 3rd Ed. Chapter 4 15
Illustrative example (gdp_life.sav)
Country Per Capita GDP (X) Life Expectancy (Y)
Austria 21.4 77.48
Belgium 23.2 77.53
Finland 20.0 77.32
France 22.7 78.63
Germany 20.8 77.17
Ireland 18.6 76.39
Italy 21.5 78.51
Netherlands 22.0 78.15
Switzerland 23.8 78.99
United Kingdom 21.2 77.37
BPS - 3rd Ed. Chapter 4 16
Illustrative example (gdp_life.sav) Scatterplot
GDP
24232221201918
LIF
E_
EX
P
79.5
79.0
78.5
78.0
77.5
77.0
76.5
76.0
BPS - 3rd Ed. Chapter 4 17
Illustrative example (gdp_life.sav)x y
21.4 77.48 -0.078 -0.345 0.027
23.2 77.53 1.097 -0.282 -0.309
20.0 77.32 -0.992 -0.546 0.542
22.7 78.63 0.770 1.102 0.849
20.8 77.17 -0.470 -0.735 0.345
18.6 76.39 -1.906 -1.716 3.271
21.5 78.51 -0.013 0.951 -0.012
22.0 78.15 0.313 0.498 0.156
23.8 78.99 1.489 1.555 2.315
21.2 77.37 -0.209 -0.483 0.101
= 21.52 = 77.754sum = 7.285
sx =1.532 sy =0.795
yi /syy xi /sxx
x y
y
i
x
i
s
y-y
s
x-x
BPS - 3rd Ed. Chapter 4 18
Illustrative example (gdp_life.sav)
0.809
(7.285)110
1
n
1i y
i
x
i
s
yy
s
xx
1-n
1r
BPS - 3rd Ed. Chapter 4 19
Interpretation of r
Direction of association: positive or negative
Strength of association: the closer |r| is to 1, the stronger the correlation. Here are guidelines:
0.0 |r| < 0.3 weak correlation
0.3 |r| < 0.7 moderate correlation
0.7 |r| < 1.0 strong correlation
|r| = 1.0 perfect correlation
BPS - 3rd Ed. Chapter 4 20
Interpretation of r
For GDP / life expectancy example, r = 0.809. This indicates a strong positive correlation
GDP
24232221201918
LIF
E_
EX
P
79.5
79.0
78.5
78.0
77.5
77.0
76.5
76.0
BPS - 3rd Ed. Chapter 4 21
Problems with Correlations
Not all relations are linear Outliers can have large influence on r Lurking variables confound relations
BPS - 3rd Ed. Chapter 4 22
Not all Relationships are Linear Miles per Gallon versus Speed
r 0 (flat line) But there is a non-
linear relationy = - 0.013x + 26.9
r = - 0.06
0
5
10
15
20
25
30
35
0 50 100
speed
mil
es p
er
gall
on
BPS - 3rd Ed. Chapter 4 23
Not all Relationships are Linear Miles per Gallon versus Speed
0
5
10
15
20
25
30
35
0 50 100
speed
mil
es p
er g
allo
n Curved relationship.
r was misleading.
BPS - 3rd Ed. Chapter 4 24
Outliers and Correlation
The outlier in the above graph decreases r
If we remove the outlier strong relation
BPS - 3rd Ed. Chapter 4 25
Exercise 4.15: Calories and sodium content of hot dogs
(a) What are the lowest and highest calorie counts? …lowest and highest sodium levels?
(b) Positive or negative association?
(c) Any outliers? If we ignore outlier,is relation still linear? Does the correlation become stronger?
BPS - 3rd Ed. Chapter 4 26
Exercise 4.13: IQ and school grades
(a) Positive or negative association?
(b) Is form linear? Does it appear strong?
(c) What is the IQ and GPA for the outlier on the bottom there?