Correlation&
RegressionAssociation & Prediction
Measuring association
Editorial and letter to the editor, Indianapolis Star re CDC data
Differing opinions regarding degree of association
How to quantify the association between two variables• ie Smoking deaths & tax• ie Smoking percent & tax• ie Smoking percent & smoking death
Lot’s of Anecdotal & Clinical Relationships
Breast feeding & IQ
Smoking & Criminal Behavior
Abortion & Crime
Is there a relationship?
Student SAT-V GPAJohn 333 1.0Janet 756 3.8Thomas 444 1.9Scotty 629 3.2Diana 501 2.3Hilary 245 0.4
Plot out the data The Scattergram
SAT_V
800700600500400300200
GP
A
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Plot out the data The Scattergram
SAT_V
800700600500400300200
GP
A
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
John
Janet(756,3.8)
Plot out the data The Scattergram
SAT_V
800700600500400300200
GP
A
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Each point represents a pair of scores from a single subject (case)
The Scattergram
SAT_V (Mean = 484.67)
800700600500400300200
GP
A (
Me
an
= 2
.1)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Add 2 more students
Student SAT-V GPAJohn 333 1.0Janet 756 3.8Thomas 444 1.9Scotty 629 3.2Diana 501 2.3Hilary 245 0.4Joe 630 0.9Patricia 404 3.1
The Scattergram
SAT_V (Mean = 492.75)
800700600500400300200
GP
A (
Me
an
= 2
.08
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Quantifying Relationships
Pearson: developed the technique Pearson r
•Pearson correlation coefficient•Pearson product-moment correlation
coefficient•r
Correlation
Co rrelation: how score on one variable is related to score on another variable
More specifically• How relative performance on one variable
is related to relative performance on another variable• ie How each score relates to its’ mean and
variability
Quantify relationship to the mean: Deviation Score
X = independent variable Y = dependent variable X - X (score on one variable related
to its mean; deviation score of X; x)
Y - Y (score on another variable related to its mean; deviation score of Y; y)
Calculation of r : deviation score method
( (Xi - X) (Yi -Y) )
[(Xi - X)2 * (Yi - Y)2]r =
Calculation of r : deviation score method
( Xi - X)
Deviation score of XxNote: will be + or - for each case
Calculation of r : deviation score method
( Yi - Y)
Deviation score of YyNote: will be + or - for each case
Calculation of r : deviation score method
(Xi - X) ( Yi - Y)
Product of paired deviation scoresProduct of x and yxyNote: product will be + or - for each case
Calculation of r : deviation score method
[(Xi - X) ( Yi - Y)]
Sum of product of paired deviation scoresSum of xyCovarianceNote: will be + or - depending on ALL of the individual cases!!!!
Calculation of r : deviation score method
( (Xi - X) (Yi -Y) )
(Xi - X)2 * (Yi - Y)2r =
Calculate r : T1&T2, T1&T3, T1&T4
Test 1 Test 2 Test 3 Test 4
Mike 11 11 5 9
Sue 9 9 7 5
Jan 7 7 9 11
Bob 5 5 11 7
r by deviation score method
Name T1(X) T2(Y) x y x^2 y^2 xy
Mike 11 11 3 3 9 9 9
Sue 9 9 1 1 1 1 1
Jan 7 7 -1 -1 1 1 1
Bob 5 5 -3 -3 9 9 9
X=8 Y=8 20 20 20
00.120
20
2020
2022
yx
xyr
r T1&T2 = 1.00Perfect Positive Relationshipsee scattergram next slide
Test 1(X)
Test 2(Y)
Test 3(Y)
Test 4(Y)
Mike 11 11 5 9
Sue 9 9 7 5
Jan 7 7 9 11
Bob 5 5 11 7
Graphical presentation of the data: perfect +
relationship
T1
121110987654
T2
12
11
10
9
8
7
6
5
4
Test 1(X)
Test 2(Y)
Test 3(Y)
Test 4(Y)
Mike 11 11 5 9
Sue 9 9 7 5
Jan 7 7 9 11
Bob 5 5 11 7
•T1 & T2 = 1.00
•perfect positive
•T1 & T3 = -1.00
•perfect negative
•T1& T4 = 0.00
•no relationship
Possible values of r
Range from -1.00 to +1.00 any value in between
• closer the value to -1.00, stronger the - relationship between the two variables
• closer the value to +1.00, stronger the + relationship between the two variables
Guess the correlation game
Possible values of r
Range from -1.00 to +1.00 any value in between
• closer the value to -1.00, stronger the - relationship between the two variables
• closer the value to +1.00, stronger the + relationship between the two variables
Just what does r value of +0.25 mean?
Factors limiting a PMCC
1. Homogenous group• subjects very similar on the variables
2. Unreliable measurement instrument/technique • measurements bounce all over the place)
3. Nonlinear relationship • Pearson's r is based on linear relationships
4. Ceiling or Floor with measurement • lots of scores clumped at the top or bottom...therefore no spread which
creates a problem similar to the homogeneous group [skewed data set(s)]
Assumptions of the PMCC
1. Measures are approximately normally distributed• Check with frequency distribution
2. The variance of the two measures is similar (homoscedasticity)
• check with scatterplot
3. The relationship is linear• check with scatterplot
4. The sample represents the population
5. Variables measured on a interval or ratio scale
NotCausation
Only Association
Correlations and causality
Correlations only describe the relationship, they do not prove cause and effect
Correlation is a necessary, but not a sufficient condition for determining causality
There are Three Requirements to Infer a Causal Relationship…
Correlations and causality
A statistically significant relationship between the variables
The causal variable occurred prior to the other variable
There are no other factors that could account for the cause Correlation studies do not meet the last
requirement and may not meet the second requirement
Correlations and causality
If there is a relationship between A and B it could be because A ->B A<-B A<-C->B
Smoking & LBP
Smoking LowBackPain
r = 0.45
Smoking & LBP
Smoking LowBackPain
r = 0.45
?LowBackPain
Smoking
Smoking & LBP
Smoking LowBackPain
r = 0.45
Lifestyle factors( ie strength)
?
Interpreting r
r is not a proportion.• r = 0.25 does not mean one quarter
similarity between the variables• r = 0.50 does not mean one half
similarity between the variables r describes the co-variability of the
variables
Coefficient of Determination
r2 : simply square the r value What percentage of the variance in
each variable is explained by knowledge of the variance of the other variable• what percentage of the variance
within Y is predicted by the variance within X?
Coefficient of Determination
(Shared Variation) Correlation Coefficient Squared Percentage of the variability among scores on
one variable that can be attributed to differences in the scores on the other variable
The coefficient of determination is useful because it gives the proportion of the variance of one variable that is predictable from the other variable
Notes about r2
Coefficient of determination explains shared variance• therefore 1-r2 is unexplained
r = 0.70 gives about 50% explained variance (why???)
always calculate r2 to evaluate extent of the correlation
Use of Correlation
Reliability of a test/measure • relate test-retest scores• relate tester1 to tester2
Validity of a test• HR and fitness (aerobic capacity)
Relate multiple dependent variables (do all measure the same construct?)
Cautions concerning r
Appropriate only for linear relationships (use Anxiety&Performance.sav)
Sensitive to range of talent• smaller range, lower r
Sensitive to sampling variation• smaller samples, more unstable
r calculated is not population r
Anxiety & Skill Performance
Meyer et al, 2002MSSE, 34:7, 1065-1070
Adachi et al, 2002. Mechanoreceptors in the ACL contribute to the joint position sense. Acta Orthop Scand, 73:2:330-334.
Click here for a web site to reviewcorrelation concepts
introduced inthis lecture