+ All Categories
Home > Documents > Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there...

Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there...

Date post: 24-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
26
Objectives 2.1 Scatterplots Scatterplots Explanatory and response variables Interpreting scatterplots Outliers Adapted from authors’ slides © 2012 W.H. Freeman and Company
Transcript
Page 1: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Objectives  

2.1 Scatterplots

§  Scatterplots

§  Explanatory and response variables

§  Interpreting scatterplots

§  Outliers

Adapted  from  authors’  slides  ©  2012  W.H.  Freeman  and  Company  

Page 2: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Relationships  p  A very important aspect of statistics is the study of relationships

between two variables. p  We have already partly studied this problem when we were doing

two-sample procedures p  Relationship between location and level of student debt p  Relationship between gender and height

p  Also we have looked at relationships between categorical variables. p  Binge drinking and gender.

p  In this section we start to `quantify’ and model these relationships. p  There are situations when the relationship is so clear we do not

need any form of statistical analysis: p  For example, suppose we want to buy a latte at a coffee shop. The

barista explains that the latte comes in three sizes, small, medium and large, the prices are $3.50, $4.00 and $4.50 respectively.

Clearly in this example, knowing the price tells you exactly the price of the coffee. However, in many situations the relationship is not so clear cut. This is where statistical tools become useful.

Page 3: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Relationship  of  two  numerical  variables  Most statistical studies involve more than one variable and the primary questions are about their relationships. Questions one can ask: q  Which variable(s) are explanatory and which are responses?

q  Do we want to know how one variable affects the value of another? q  Or do we simply want to measure their association?

q  How is the relationship best described? q  Is the association positive or negative? q  How can we predict one variable from the value of the other(s)? q  Can a straight line be used effectively or is the relationship more

complex? q  How well (close) do the data fit the relationship we describe?

q  How strong (or weak) is the relationship? q  Is the relationship “significant”? (Can we reject H0: no association?) q  How do the data deviate from the overall pattern?

Page 4: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Student Beers BAC

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

Looking  at  relationships:  Scatterplots  In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph.

We look for an overall pattern and for deviations from the pattern.

Page 5: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Example:  Relationships  in  weight  gain  p  A study was done to investigate why some people do not gain weight

even when they overeat. One theory is that these people tend to do `non-exercise’ activity (such as fidgeting and twitching) which prevents their weight gain.

p  To investigate this issue researchers overfed 16 healthy volunteers for a period of 8 weeks. Before the study they measured the average amount of NEA (non-exercise activity) each volunteer did per day (measure in calories). Then during the study they also measured the amount of NEA that each volunteer did. The difference in the NEA (before and after the study) and the weight gain is given on my website.

Page 6: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Scatterplot  NEA  against  weight  gain  

From the plot it is clear that the people with larger increases in non-exercise activity gained the least weight. How to quantify the strength of this relationship?

Page 7: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Positive association: High values of the response variable tend to occur together with high values of the explanatory variable.

Negative association: High values of the response variable tend to occur together with low values of the explanatory variable.

Flat (no) association: The values of the response variable are similarly distributed for all values of the other variable. There is no information about the response variable that can be predicted from the explanatory variable.

Complex association: For some values of the explanatory variable the variables appear to be positively associated, but for other values of that variable they appear to be negatively associated (curvature).

Or information other than the general (average) level of the response variable can be predicted from the explanatory variable.

Positive  or  Negative?  

Page 8: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Form  and  direction  of  an  association  

Straight Line Relationship

Curved Relationship

No Relationship

Negative Positive

Neither Positive

Page 9: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Example:  Negative  association  for  weights  

From the plot it is clear that the people with larger increases in non-exercise activity gained the least weight. This means the association is negative.

Page 10: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Example:  Positive  association  for  temp  and  CO2    

This is a scatter plot of average global yearly temperatures against the yearly man-made CO2 emissions. There are 150 points each corresponding from one year from 1855-2005. We can see a clear positive association. Large CO2 values tend to correspond to larger temperatures.

Page 11: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Strength  of  the  association  The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.

This is a very strong positive relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. Y varies very little for a given X.

This is a weak positive relationship. For a particular median household income (X), you cannot predict the state per capita income (Y) very well. Y varies widely for a given X.

Page 12: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Issues:  How  to  scale  a  scatterplot  

Using an inappropriate scale for a scatterplot will give an incorrect impression and interpretation of the data. Both variables should be given a similar amount of space: •  The plot is roughly square. •  Space cannot be reduced without removing some points.

Same data in all four plots. There is a negative relationship between swim time and pulse rate.

Page 13: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Issues:  Outliers  An outlier is a data point that is exceptionally unusual or unexpected.

They fall outside of the overall pattern of the relationship.

This point is not in line with the others. It is an outlier of the relationship.

This point is unusual in its values but it is not an outlier of the relationship.

Page 14: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Review:  Interpreting  scatterplots  q  After plotting two variables on a scatterplot, we describe the

relationship by examining the direction, form, and strength of the association. We look for an overall pattern …

q  Direction: positive, negative, no direction.

q  Form: straight line, curved, clusters, no pattern.

q  Strength: how closely the points fit the “form”.

q  … and for deviations from that pattern. q  Do the points fit more closely for one part of the form than it does for

another?

q  Are there outliers?

q  Would it be appropriate to extrapolate the relationship we see?

Page 15: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Objectives  

2.2 Correlation

§  The correlation coefficient r

§  Properties of the correlation coefficient

Adapted  from  authors’  slides  ©  2012  W.H.  Freeman  and  Company  

Page 16: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Measuring  the  strength  of  a  linear  relationship  p  We recall that in the previous section:

p  The midterms grades appeared to be positively associated but the strength of the association is weak.

In particular the association between midterm 1 and the other midterms seemed very weak. The association between midterm 2 and 3 appeared to be stronger. p  Whereas the weight and NEA appeared to have a negative association

that was strong. p  How to quantify and compare these associations?

p  How to compare the associations between the midterms? p  The linear association between two numerical variables can be

measured using the notion of correlation. p  The correlation coefficient is a number which lies between -1 and 1.

p  1 = complete positive association (no spread) p  -1 = complete negative association (no spread) p  0 = no linear association – but there could be other types of nonlinear associations.

Page 17: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Measuring  relationship:  correlation  q  It is calculated using the standardized values (z-scores) of both the

x and y variables.

q  r is positive if the relationship is positive and negative if the relationship is negative.

q  r is always between −1 and 1. The closer it is to −1 or 1, the stronger the relationship. But close to 0 does not necessarily mean no relationship.

q  r has no units of measurement and does not depend on the units for x and y.

q  It does not matter whether you plot x against y or y against x, the correlation coefficient will be the same.

r =1

n −1xi − x

sx

⎝ ⎜

⎠ ⎟

i=1

n

∑ yi − y sy

⎝ ⎜ ⎜

⎠ ⎟ ⎟

z-score for x z-score for y

Page 18: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Weight  gain  and  NEA  

The correlation for the weight gain example is -0.778. It is negative because large NEA corresponds to smaller weight gain and it is close to -1, because there is not much spread about the line.

Page 19: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Yearly  temperature  and  man-­‐made  CO2    

This is a scatter plot of average global yearly temperatures against the yearly man-made CO2 emissions. There are 150 points each corresponding from one year from 1855-2005. The correlation between temperature and CO2 is 0.799. The correlation is positive because large amounts of CO2 emissions tend to correspond to large temperatures. The correlation is relatively close to one, since there is some spread about the line, but not a huge amount.

Page 20: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Time to swim: Pulse rate:

Correlation: This indicates a moderately strong negative relationship.

The  correlation  coefKicient  r  

0.75r = −

35; 0.70xx s= =

140; 9.5yy s= =

"Time to Swim" is the explanatory variable here, and belongs on the x axis. However, the value of r is the same regardless of how we label or plot the variables.

The value of r would be the same if, for example, “Time to Swim” was measured in seconds and “Pulse Rate” was measured in beats per hour.

Page 21: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

r  ranges  from    −1  to  +1  

The correlation coefficient r quantifies the strength and direction of a linear relationship between two quantitative variables.

Strength: how closely the points follow a straight line.

Direction: is positive when individuals with higher X values tend to have higher values of Y, and is negative when individuals with higher X values tend to have lower values of Y.

Page 22: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Automobiles in Albuquerque were randomly selected (at a shopping center) in 1974 and given an emissions test. Total hydrocarbon emissions level and model year were observed.

Direction?    Form?    Strength?  

Negative Straight Line? Weak r = −.483

Page 23: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

Pollutants were observed over a 28 day period. The carbon pollutants and the ozone level are to be related.

Direction?    Form?    Strength?  

Positive Straight Line Moderate r = .687

Page 24: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

The efficiency of an industrial biofilter is tested at different temperature levels.

Direction?    Form?    Strength?  

Positive Straight Line Moderate to Strong r = .891

Page 25: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

The nickel-to-iron ratio was measured in oat plants and the plant age (in days after emergence) was also recorded.

Direction?    Form?    Strength?  

Complex (positive until 50 days, then negative) Curved Strong (if curve is taken into account) r = .479

The correlation measures the degree to which the points fit a straight line, not a curve.

Page 26: Objectives*suhasini/teaching301/stat301_correlation.pdf · scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). " There

What’s  wrong  with  the  statement?  p  In my genetics class there is a perfect correlation (correlation

coefficient = 1) between midterm 2 and midterm 3, both midterms were out of 15 so if a student scored 12 in midterm 2 then he scored 12 in midterm 3 too. p  A perfect (or high) correlation does not mean that the numbers for both

variables are the same. For example in midterm 2 the students could have scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph).

p  There is a high correlation between the age of American workers and their occupation. p  Occupation is a categorical variable (Teacher, Lorry driver, Miner etc). So it

is impossible to define a correlation between age and occupation. The article probably means a strong association between age (where age was grouped – eg 20-29, 30-39,..) and occupation, they do this by comparing conditional probabilities (see previous lectures). But the word correlation makes no sense, how can higher age correspond to a higher occupation!

p  We found a correlation of 1.19 between students ratings of faculty teaching and ratings made by other faculty. p  Correlation can only lie between -1 and 1!


Recommended