Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 230 times |
Download: | 4 times |
Chapter 4 1
Chapter 4
Describing Relationships: Scatterplots and Correlation
Objectives (BPS chapter 4)
Relationships: Scatterplots and correlation
Explanatory and response variables
Displaying relationships: scatterplots
Interpreting scatterplots
Adding categorical variables to scatterplots
Measuring linear association (correlation)
Facts about correlation
Chapter 4 2
Chapter 4 3
ScatterplotA scatterplot is a graph in which paired (x, y) data (usually collected on the same individuals) are plotted with one variable represented on a horizontal (x -) axis and the other variable represented on a vertical (y-) axis. Each individual pair (x, y) is plotted as a single point.
Example:
Student Number of Beers
Blood Alcohol Level
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Here we have two quantitative variables
for each of 16 students.
1. How many beers they drank,
and
2. Their blood alcohol level (BAC)
We are interested in the relationship
between the two variables: How is one
affected by changes in the other one?
Student Beers BAC
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
ScatterplotsIn a scatterplot one axis is used to represent each of the variables,
and the data are plotted as points on the graph.
Explanatory (independent) variable: number of beers
Response
(dependent)
variable:
blood alcohol
contentx
y
Explanatory and response variables
A response variable measures or records an outcome of a study. An
explanatory variable explains changes in the response variable.
Typically, the explanatory or independent variable is plotted on the x
axis and the response or dependent variable is plotted on the y axis.
Some plots don’t have clear explanatory and response variables.
Do calories explain
sodium amounts?
Does percent return on Treasury bills
explain percent return on common stocks?
Chapter 4 8
Examining a ScatterplotYou can describe the overall pattern of a scatterplot by the
Form – linear or non-linear ( quadratic, exponential, no correlation etc.)
Direction – negative, positive.
Strength – strong, very strong, moderately strong, weak etc.
Look for outliers and how they affect the correlation.
Chapter 4 9
Scatterplot
x 1 2 3 4 5
y -4 -2 1 0 2
x
2 4
–2
– 4
y
2
6
Example: Draw a scatter plot for the data below. What is the nature of the relationship between X and Y.
Strong, positive and linear.
Chapter 4 10
Examining a Scatterplot
Two variables are positively correlated when high values of the variables tend to occur together and low values of the variables tend to occur together. The scatterplot slopes upwards from left to right.
Two variables are negatively correlated when high values of one of the variables tend to occur with low values of the other and vice versa. The scatterplot slopes downwards from left to right.
Chapter 4 11
Types of Correlation
x
y
Negative Linear Correlation
x
y
No Correlation
x
y
Positive Linear Correlation
x
y
Non-linear Correlation
As x increases, y tends to decrease.
As x increases, y tends to increase.
Chapter 13 12
Examples of Relationships
0
10
20
30
40
50
60
$0 $10 $20 $30 $40 $50 $60 $70
Income
Hea
lth
Sta
tus
Mea
sure
0
10
20
30
40
50
60
70
0 20 40 60 80 100
Age
Hea
lth
Stat
us M
easu
re0
2
4
6
8
10
12
14
16
18
0 20 40 60 80 100
Age
Ed
uca
tion
Lev
el
30
35
40
45
50
55
60
65
0 20 40 60 80
Physical Health Score
Men
tal H
ealt
h S
core
Caution: Relationships require that both variables be quantitative (thus the order of the data points is
defined entirely by their value).
Correspondingly, relationships between categorical data are meaningless.
Example: Beetles trapped on boards of different colors
What association? What relationship?
Blue White Green Yellow Board color
Blue Green White Yellow Board color
Describe one category at a time.
?
Chapter 4 14
Thought Question 1What type of association would the following pairs of variables have – positive, negative, or none?
1. Temperature during the summer and electricity bills
2. Temperature during the winter and heating costs3. Number of years of education and height (Elementary School)
4. Frequency of brushing and number of cavities
5. Number of churches and number of bars in cities
6. Height of husband and height of wife
Chapter 4 15
Thought Question 2
Consider the two scatterplots below. How does the outlier impact the correlation for each plot?
– does the outlier increase the correlation, decrease the correlation, or have no impact?
Strength of the associationThe strength of the relationship between the two variables can be seen
by how much variation, or scatter, there is around the main form.
With a strong relationship, you can get a pretty good estimate
of y if you know x.
With a weak relationship, for any x you might get a wide
range of y values.
How to scale a scatterplot
Using an inappropriate scale for a scatterplot can give an incorrect impression.
Both variables should be given a similar amount of space:
• Plot roughly square• Points should occupy all the plot space (no blank space)
Same data in all four plots
Adding categorical variables to scatterplots
Often, things are not simple and one-dimensional. We need to group
the data into categories to reveal trends.
What may look like a positive
linear relationship is in fact a
series of negative linear
associations.
Plotting different habitats in
different colors allowed us to
make that important distinction.
Comparison of men’s and
women’s racing records over
time.
Each group shows a very
strong negative linear
relationship that would not be
apparent without the gender
categorization.
Relationship between lean body mass
and metabolic rate in men and women.
While both men and women follow the
same positive linear trend, women show
a stronger association. As a group, males
typically have larger values for both
variables.
Chapter 4 20
Measuring Strength & Directionof a Linear Relationship
How closely does a non-horizontal straight line fit the points of a scatterplot?
The correlation coefficient (often referred to as just correlation): r
– measure of the strength of the relationship: the stronger the relationship, the larger the magnitude of r.
– measure of the direction of the relationship: positive r indicates a positive relationship, negative r indicates a negative relationship.
Chapter 4 21
Correlation Coefficient
Greek Capital Letter Sigma – denotes summation or addition.
1
1
1
1
x y
x y
x x y yrn s s
x x y yn s s
Example: Find the correlation between X and Y
Chapter 4 22
x 1 2 3 4 5
y -4 -2 1 0 2
x y
1 -2 -4 -3.4 6.8
2 -1 -2 -1.4 1.4
3 0 1 1.6 0
4 1 0 0.6 0.6
5 2 2 2.6 5.2
3, 0.6x y
1.58, 2.41x ys s
140.9192
4 1.58 2.41r
x x y y x x y y
Chapter 4 23
Correlation Coefficient
The range of the correlation coefficient is -1 to 1.
-1 0 1
If r = -1 there is a perfect negative
correlation
If r = 1 there is a perfect positive
correlation
If r is close to 0 there is no linear
correlation
Chapter 4 24
Linear Correlation
Strong negative correlation
Weak positive correlation
Strong positive correlation
Non-linear Correlation
x
y
x
y
x
y
x
y
r = 0.91 r = 0.88
r = 0.42 r = 0.07
Try
Chapter 4 25
Correlation Coefficient
special values for r : a perfect positive linear relationship would have r = +1 a perfect negative linear relationship would have r = -1 if there is no linear relationship, or if the scatterplot
points are best fit by a horizontal line, then r = 0 Note: r must be between -1 and +1, inclusive
r > 0: as one variable changes, the other variable tends to change in the same direction
r < 0: as one variable changes, the other variable tends to change in the opposite direction
Chapter 4 26
Correlation Coefficient Because r uses the z-scores for the observations, it does not change
when we change the units of measurements of x , y or both.
Correlation ignores the distinction between explanatory and response variables.
r measures the strength of only linear association between variables.
A large value of r does not necessarily mean that there is a strong linear relationship between the variables – the relationship might not be linear; always look at the scatterplot.
When r is close to 0, it does not mean that there is no relationship between the variables, it means there is no linear relationship.
Outliers can inflate or deflate correlations. Try
Chapter 4 27
Not all Relationships are LinearMiles per Gallon versus Speed
Curved relationship(r is misleading)
Speed chosen for each subject varies from 20 mph to 60 mph
MPG varies from trial to trial, even at the same speed
Statistical relationship
0
5
10
15
20
25
30
35
0 50 100
speed
mil
es p
er g
allo
n
r=-0.06
Chapter 4 28
Common Errors Involving Correlation
1. Causation: It is wrong to conclude that correlation implies causality.
2. Averages: Averages suppress individual variation and may inflate the correlation coefficient.
3. Linearity: There may be some relationship between x and y even when there is no linear correlation.
Chapter 4 29
ExampleA survey of the world’s nations in 2004 shows a strongpositive correlation between percentage of countriesusing cell phones and life expectancy in years at birth.
a) Does this mean that cell phones are good for your health?
No. It simply means that in countries where cell phone use is high, the life expectancy tends to be high as well.
b) What might explain the strong correlation?The economy could be a lurking variable. Richer countries generally have more cell phone use and better health care.
Chapter 4 30
ExampleThe correlation between Age and Income as measured on 100
people is r = 0.75. Explain whether or not each of these
conclusions is justified.
a) When Age increases, Income increases as well.
b) The form of the relationship between Age and Income is linear.
c) There are no outliers in the scatterplot of Income vs. Age.
d) Whether we measure Age in years or months, the correlation will still be 0.75.
Chapter 4 31
ExampleExplain the mistakes in the statements below:
a) “My correlation of -0.772 between GDP and Infant Mortality Rate shows that there is almost no association between GDP and Infant Mortality Rate”.
b) “There was a correlation of 0.44 between GDP and Continent”
c) “There was a very strong correlation of 1.22 between Life Expectancy and GDP”.
Chapter 4 32
Key Concepts Strength of Linear Relationship
Direction of Linear Relationship
Correlation Coefficient
Common Problems with Correlations
r can only be calculated for quantitative data.