Objectives
2.1 Scatterplots
§ Scatterplots
§ Explanatory and response variables
§ Interpreting scatterplots
§ Outliers
Adapted from authors’ slides © 2012 W.H. Freeman and Company
Relationships p A very important aspect of statistics is the study of relationships
between two variables. p We have already partly studied this problem when we were doing
two-sample procedures p Relationship between location and level of student debt p Relationship between gender and height
p Also we have looked at relationships between categorical variables. p Binge drinking and gender.
p In this section we start to `quantify’ and model these relationships. p There are situations when the relationship is so clear we do not
need any form of statistical analysis: p For example, suppose we want to buy a latte at a coffee shop. The
barista explains that the latte comes in three sizes, small, medium and large, the prices are $3.50, $4.00 and $4.50 respectively.
Clearly in this example, knowing the price tells you exactly the price of the coffee. However, in many situations the relationship is not so clear cut. This is where statistical tools become useful.
Relationship of two numerical variables Most statistical studies involve more than one variable and the primary questions are about their relationships. Questions one can ask: q Which variable(s) are explanatory and which are responses?
q Do we want to know how one variable affects the value of another? q Or do we simply want to measure their association?
q How is the relationship best described? q Is the association positive or negative? q How can we predict one variable from the value of the other(s)? q Can a straight line be used effectively or is the relationship more
complex? q How well (close) do the data fit the relationship we describe?
q How strong (or weak) is the relationship? q Is the relationship “significant”? (Can we reject H0: no association?) q How do the data deviate from the overall pattern?
Student Beers BAC
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Looking at relationships: Scatterplots In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph.
We look for an overall pattern and for deviations from the pattern.
Example: Relationships in weight gain p A study was done to investigate why some people do not gain weight
even when they overeat. One theory is that these people tend to do `non-exercise’ activity (such as fidgeting and twitching) which prevents their weight gain.
p To investigate this issue researchers overfed 16 healthy volunteers for a period of 8 weeks. Before the study they measured the average amount of NEA (non-exercise activity) each volunteer did per day (measure in calories). Then during the study they also measured the amount of NEA that each volunteer did. The difference in the NEA (before and after the study) and the weight gain is given on my website.
Scatterplot NEA against weight gain
From the plot it is clear that the people with larger increases in non-exercise activity gained the least weight. How to quantify the strength of this relationship?
Positive association: High values of the response variable tend to occur together with high values of the explanatory variable.
Negative association: High values of the response variable tend to occur together with low values of the explanatory variable.
Flat (no) association: The values of the response variable are similarly distributed for all values of the other variable. There is no information about the response variable that can be predicted from the explanatory variable.
Complex association: For some values of the explanatory variable the variables appear to be positively associated, but for other values of that variable they appear to be negatively associated (curvature).
Or information other than the general (average) level of the response variable can be predicted from the explanatory variable.
Positive or Negative?
Form and direction of an association
Straight Line Relationship
Curved Relationship
No Relationship
Negative Positive
Neither Positive
Example: Negative association for weights
From the plot it is clear that the people with larger increases in non-exercise activity gained the least weight. This means the association is negative.
Example: Positive association for temp and CO2
This is a scatter plot of average global yearly temperatures against the yearly man-made CO2 emissions. There are 150 points each corresponding from one year from 1855-2005. We can see a clear positive association. Large CO2 values tend to correspond to larger temperatures.
Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.
This is a very strong positive relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. Y varies very little for a given X.
This is a weak positive relationship. For a particular median household income (X), you cannot predict the state per capita income (Y) very well. Y varies widely for a given X.
Issues: How to scale a scatterplot
Using an inappropriate scale for a scatterplot will give an incorrect impression and interpretation of the data. Both variables should be given a similar amount of space: • The plot is roughly square. • Space cannot be reduced without removing some points.
Same data in all four plots. There is a negative relationship between swim time and pulse rate.
Issues: Outliers An outlier is a data point that is exceptionally unusual or unexpected.
They fall outside of the overall pattern of the relationship.
This point is not in line with the others. It is an outlier of the relationship.
This point is unusual in its values but it is not an outlier of the relationship.
Review: Interpreting scatterplots q After plotting two variables on a scatterplot, we describe the
relationship by examining the direction, form, and strength of the association. We look for an overall pattern …
q Direction: positive, negative, no direction.
q Form: straight line, curved, clusters, no pattern.
q Strength: how closely the points fit the “form”.
q … and for deviations from that pattern. q Do the points fit more closely for one part of the form than it does for
another?
q Are there outliers?
q Would it be appropriate to extrapolate the relationship we see?
Objectives
2.2 Correlation
§ The correlation coefficient r
§ Properties of the correlation coefficient
Adapted from authors’ slides © 2012 W.H. Freeman and Company
Measuring the strength of a linear relationship p We recall that in the previous section:
p The midterms grades appeared to be positively associated but the strength of the association is weak.
In particular the association between midterm 1 and the other midterms seemed very weak. The association between midterm 2 and 3 appeared to be stronger. p Whereas the weight and NEA appeared to have a negative association
that was strong. p How to quantify and compare these associations?
p How to compare the associations between the midterms? p The linear association between two numerical variables can be
measured using the notion of correlation. p The correlation coefficient is a number which lies between -1 and 1.
p 1 = complete positive association (no spread) p -1 = complete negative association (no spread) p 0 = no linear association – but there could be other types of nonlinear associations.
Measuring relationship: correlation q It is calculated using the standardized values (z-scores) of both the
x and y variables.
q r is positive if the relationship is positive and negative if the relationship is negative.
q r is always between −1 and 1. The closer it is to −1 or 1, the stronger the relationship. But close to 0 does not necessarily mean no relationship.
q r has no units of measurement and does not depend on the units for x and y.
q It does not matter whether you plot x against y or y against x, the correlation coefficient will be the same.
€
r =1
n −1xi − x
sx
⎛
⎝ ⎜
⎞
⎠ ⎟
i=1
n
∑ yi − y sy
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
z-score for x z-score for y
Weight gain and NEA
The correlation for the weight gain example is -0.778. It is negative because large NEA corresponds to smaller weight gain and it is close to -1, because there is not much spread about the line.
Yearly temperature and man-‐made CO2
This is a scatter plot of average global yearly temperatures against the yearly man-made CO2 emissions. There are 150 points each corresponding from one year from 1855-2005. The correlation between temperature and CO2 is 0.799. The correlation is positive because large amounts of CO2 emissions tend to correspond to large temperatures. The correlation is relatively close to one, since there is some spread about the line, but not a huge amount.
Time to swim: Pulse rate:
Correlation: This indicates a moderately strong negative relationship.
The correlation coefKicient r
0.75r = −
35; 0.70xx s= =
140; 9.5yy s= =
"Time to Swim" is the explanatory variable here, and belongs on the x axis. However, the value of r is the same regardless of how we label or plot the variables.
The value of r would be the same if, for example, “Time to Swim” was measured in seconds and “Pulse Rate” was measured in beats per hour.
r ranges from −1 to +1
The correlation coefficient r quantifies the strength and direction of a linear relationship between two quantitative variables.
Strength: how closely the points follow a straight line.
Direction: is positive when individuals with higher X values tend to have higher values of Y, and is negative when individuals with higher X values tend to have lower values of Y.
Automobiles in Albuquerque were randomly selected (at a shopping center) in 1974 and given an emissions test. Total hydrocarbon emissions level and model year were observed.
Direction? Form? Strength?
Negative Straight Line? Weak r = −.483
Pollutants were observed over a 28 day period. The carbon pollutants and the ozone level are to be related.
Direction? Form? Strength?
Positive Straight Line Moderate r = .687
The efficiency of an industrial biofilter is tested at different temperature levels.
Direction? Form? Strength?
Positive Straight Line Moderate to Strong r = .891
The nickel-to-iron ratio was measured in oat plants and the plant age (in days after emergence) was also recorded.
Direction? Form? Strength?
Complex (positive until 50 days, then negative) Curved Strong (if curve is taken into account) r = .479
The correlation measures the degree to which the points fit a straight line, not a curve.
What’s wrong with the statement? p In my genetics class there is a perfect correlation (correlation
coefficient = 1) between midterm 2 and midterm 3, both midterms were out of 15 so if a student scored 12 in midterm 2 then he scored 12 in midterm 3 too. p A perfect (or high) correlation does not mean that the numbers for both
variables are the same. For example in midterm 2 the students could have scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph).
p There is a high correlation between the age of American workers and their occupation. p Occupation is a categorical variable (Teacher, Lorry driver, Miner etc). So it
is impossible to define a correlation between age and occupation. The article probably means a strong association between age (where age was grouped – eg 20-29, 30-39,..) and occupation, they do this by comparing conditional probabilities (see previous lectures). But the word correlation makes no sense, how can higher age correspond to a higher occupation!
p We found a correlation of 1.19 between students ratings of faculty teaching and ratings made by other faculty. p Correlation can only lie between -1 and 1!