Post on 05-Jan-2016
description
transcript
Cal State Northridge427AinsworthCorrelation and Regression
Chapter 9 Correlation
Major Points - CorrelationQuestions answered by correlationScatterplotsAn exampleThe correlation coefficientOther kinds of correlations Factors affecting correlationsTesting for significance
The QuestionAre two variables related?Does one increase as the other increases?e. g. skills and incomeDoes one decrease as the other increases?e. g. health problems and nutritionHow can we get a numerical measure of the degree of relationship?
ScatterplotsAKA scatter diagram or scattergram.Graphically depicts the relationship between two variables in two dimensional space.
Direct Relationship
Chart1
12
15
10
14
13
6
18
15
15
10
12
12
13
9
16
15
13
11
7
14
11
6
8
13
12
12
13
10
13
12
Average Hours of Video Games Per Week
Average Number of Alcoholic Drinks Per Week
Scatterplot:Video Games and Alcohol Consumption
correlations
1367
486
1264
1159
760
876
1061
495
780
377
1568
1475
591
1439
790
1066
1071
086
1579
1647
1749
1479
560
1248
1266
1365
1155
1455
477
1544
-0.6415181796
1912
1815
1310
1514
1213
96
1818
1915
2015
1310
1612
2012
2213
129
2316
1515
1313
1611
127
1914
1611
76
38
1713
1112
1712
2013
1010
1813
1512
0.7382131408
correlations
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Average Hours of Video Games Per Week
Average Number of Alcoholic Drinks Per Week
Correlation Between Video Games and Alcohol Consumption
Inverse Relationship
Chart2
67
86
64
59
60
76
61
95
80
77
68
75
91
39
90
66
71
86
79
47
49
79
60
48
66
65
55
55
77
44
Average Hours of Video Games Per Week
Exam Score
Scatterplot: Video Games and Test Score
correlations
1367
486
1264
1159
760
876
1061
495
780
377
1568
1475
591
1439
790
1066
1071
086
1579
1647
1749
1479
560
1248
1266
1365
1155
1455
477
1544
-0.6415181796
1912
1815
1310
1514
1213
96
1818
1915
2015
1310
1612
2012
2213
129
2316
1515
1313
1611
127
1914
1611
76
38
1713
1112
1712
2013
1010
1813
1512
0.7382131408
correlations
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Average Hours of Video Games Per Week
Exam Score
Correlation Between Video Games and Test Score
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Average Hours of Video Games Per Week
Average Number of Alcoholic Drinks Per Week
Correlation Between Video Games and Alcohol Consumption
An ExampleDoes smoking cigarettes increase systolic blood pressure?Plotting number of cigarettes smoked per day against systolic blood pressureFairly moderate relationshipRelationship is positive
Trend?
Smoking and BPNote relationship is moderate, but real.Why do we care about relationship?What would conclude if there were no relationship?What if the relationship were near perfect?What if the relationship were negative?
Heart Disease and CigarettesData on heart disease and cigarette smoking in 21 developed countries (Landwehr and Watkins, 1987) Data have been rounded for computational convenience.The results were not affected.
The DataSurprisingly, the U.S. is the first country on the list--the country with the highest consumption and highest mortality.
Sheet1
CountryCigarettesCHD(X - Xbar)(Y - Ybar)(X - Xbar)(Y - Ybar)
111265.0511.4857.97
29213.056.4819.76
39243.059.4828.91
49213.056.4819.76
58192.054.489.18
68132.05-1.52-3.12
78192.054.489.18
86110.05-3.52-0.18
96230.058.480.42
10515-0.950.48-0.46
11513-0.95-1.521.44
1254-0.95-10.529.99
13518-0.953.48-3.31
14512-0.95-2.522.39
1553-0.95-11.5210.94
16411-1.95-3.526.86
17415-1.950.48-0.94
1846-1.95-8.5216.61
19313-2.95-1.524.48
2034-2.95-10.5231.03
21314-2.95-0.521.53
Mean5.9514.52
SD2.336.69
Sum222.44
Sheet2
Sheet3
Scatterplot of Heart DiseaseCHD Mortality goes on ordinate (Y axis)Why?Cigarette consumption on abscissa (X axis)Why?What does each dot represent?Best fitting line included for clarity
{X = 6, Y = 11}
Chapter 9 Correlation
What Does the Scatterplot Show?As smoking increases, so does coronary heart disease mortality.Relationship looks strongNot all data points on line.This gives us residuals or errors of predictionTo be discussed later
CorrelationCo-relationThe relationship between two variablesMeasured with a correlation coefficientMost popularly seen correlation coefficient: Pearson Product-Moment Correlation
Types of CorrelationPositive correlationHigh values of X tend to be associated with high values of Y.As X increases, Y increasesNegative correlationHigh values of X tend to be associated with low values of Y.As X increases, Y decreasesNo correlationNo consistent tendency for values on Y to increase or decrease as X increases
Correlation CoefficientA measure of degree of relationship.Between 1 and -1Sign refers to direction.Based on covarianceMeasure of degree to which large scores on X go with large scores on Y, and small scores on X go with small scores on YThink of it as variance, but with 2 variables instead of 1 (What does that mean??)
*
CovarianceRemember that variance is:
The formula for co-variance is:
How this works, and why?When would covXY be large and positive? Large and negative?
Chapter 9 Correlation
Example
Country
X (Cig.)
Y (CHD)
*
1
11
26
5.05
11.48
57.97
2
9
21
3.05
6.48
19.76
3
9
24
3.05
9.48
28.91
4
9
21
3.05
6.48
19.76
5
8
19
2.05
4.48
9.18
6
8
13
2.05
-1.52
-3.12
7
8
19
2.05
4.48
9.18
8
6
11
0.05
-3.52
-0.18
9
6
23
0.05
8.48
0.42
10
5
15
-0.95
0.48
-0.46
11
5
13
-0.95
-1.52
1.44
12
5
4
-0.95
-10.52
9.99
13
5
18
-0.95
3.48
-3.31
14
5
12
-0.95
-2.52
2.39
15
5
3
-0.95
-11.52
10.94
16
4
11
-1.95
-3.52
6.86
17
4
15
-1.95
0.48
-0.94
18
4
6
-1.95
-8.52
16.61
19
3
13
-2.95
-1.52
4.48
20
3
4
-2.95
-10.52
31.03
21
3
14
-2.95
-0.52
1.53
Mean
5.95
14.52
SD
2.33
6.69
Sum
222.44
_1222636204.unknown
_1222636234.unknown
_1222636241.unknown
_1222636165.unknown
Example*What the heck is a covariance? I thought we were talking about correlation?
Correlation CoefficientPearsons Product Moment CorrelationSymbolized by rCovariance (product of the 2 SDs)
Correlation is a standardized covariance
Calculation for ExampleCovXY = 11.12sX = 2.33sY = 6.69
ExampleCorrelation = .713Sign is positiveWhy?If sign were negativeWhat would it mean?Would not alter the degree of relationship.
Other calculations*Z-score method
Computational (Raw Score) Method
Other Kinds of CorrelationSpearman Rank-Order Correlation Coefficient (rsp)used with 2 ranked/ordinal variablesuses the same Pearson formula*
Chapter 9 Correlation
Sheet1
AttractivenessSymmetry
32
46
11
23
54
65
0.77
Sheet2
Sheet3
Other Kinds of CorrelationPoint biserial correlation coefficient (rpb)used with one continuous scale and one nominal or ordinal or dichotomous scale.uses the same Pearson formula*
Chapter 9 Correlation
Sheet1
AttractivenessDate?
30
40
11
21
51
60
-0.49
Sheet2
Sheet3
Other Kinds of CorrelationPhi coefficient ()used with two dichotomous scales.uses the same Pearson formula*
Chapter 9 Correlation
Sheet1
AttractivenessDate?
00
10
11
11
00
11
0.71
Sheet2
Sheet3
Factors Affecting rRange restrictionsLooking at only a small portion of the total scatter plot (looking at a smaller portion of the scores variability) decreases r.Reducing variability reduces rNonlinearityThe Pearson r (and its relatives) measure the degree of linear relationship between two variablesIf a strong non-linear relationship exists, r will provide a low, or at least inaccurate measure of the true relationship.
Factors Affecting rHeterogeneous subsamplesEveryday examples (e.g. height and weight using both men and women)OutliersOverestimate CorrelationUnderestimate Correlation
Countries With Low Consumptions
Truncation*
Non-linearity*
Heterogenous samples*
Outliers*
Testing Correlations*So you have a correlation. Now what?In terms of magnitude, how big is big?Small correlations in large samples are big.Large correlations in small samples arent always big.Depends upon the magnitude of the correlation coefficientANDThe size of your sample.
Testing rPopulation parameter = Null hypothesis H0: = 0Test of linear independenceWhat would a true null mean here?What would a false null mean here?Alternative hypothesis (H1) 0Two-tailed
Tables of SignificanceWe can convert r to t and test for significance:
Where DF = N-2
Tables of SignificanceIn our example r was .71N-2 = 21 2 = 19
T-crit (19) = 2.09Since 6.90 is larger than 2.09 reject r = 0.
Computer PrintoutPrintout gives test of significance.
Regression
Chapter 9 Correlation
What is regression?*How do we predict one variable from another?How does one variable change as the other changes?Influence
Linear Regression*A technique we use to predict the most likely score on one variable from those on another variableUses the nature of the relationship (i.e. correlation) between two variables to enhance your prediction
Linear Regression: Parts*Y - the variables you are predictingi.e. dependent variableX - the variables you are using to predicti.e. independent variable - your predictions (also known as Y)
Why Do We Care?*We may want to make a prediction.More likely, we want to understand the relationship.How fast does CHD mortality rise with a one unit increase in smoking?Note: we speak about predicting, but often dont actually predict.
An Example*Cigarettes and CHD Mortality againData repeated on next slideWe want to predict level of CHD mortality in a country averaging 10 cigarettes per day.
The Data*Based on the data we have what would we predict the rate of CHD be in a country that smoked 10 cigarettes on average?First, we need to establish a prediction of CHD from smoking
Sheet1
CountryCigarettesCHD(X - Xbar)(Y - Ybar)(X - Xbar)(Y - Ybar)
111265.0511.4857.97
29213.056.4819.76
39243.059.4828.91
49213.056.4819.76
58192.054.489.18
68132.05-1.52-3.12
78192.054.489.18
86110.05-3.52-0.18
96230.058.480.42
10515-0.950.48-0.46
11513-0.95-1.521.44
1254-0.95-10.529.99
13518-0.953.48-3.31
14512-0.95-2.522.39
1553-0.95-11.5210.94
16411-1.95-3.526.86
17415-1.950.48-0.94
1846-1.95-8.5216.61
19313-2.95-1.524.48
2034-2.95-10.5231.03
21314-2.95-0.521.53
Mean5.9514.52
SD2.336.69
Sum222.44
Sheet2
Sheet3
*For a country that smokes 6 C/A/DWe predict a CHD rate of about 14Regression Line
Chapter 9 Correlation
Regression Line*Formula
= the predicted value of Y (e.g. CHD mortality) X = the predictor variable (e.g. average cig./adult/country)
Regression Coefficients*Coefficients are a and bb = slope Change in predicted Y for one unit change in Xa = intercept value of when X = 0
Calculation*Slope
Intercept
For Our Data*CovXY = 11.12s2X = 2.332 = 5.447b = 11.12/5.447 = 2.042a = 14.524 - 2.042*5.952 = 2.32See SPSS printout on next slideAnswers are not exact due to rounding error and desire to match SPSS.
SPSS Printout*
Note:*The values we obtained are shown on printout.The intercept is the value in the B column labeled constant The slope is the value in the B column labeled by name of predictor variable.
Making a Prediction*Second, once we know the relationship we can predict
We predict 22.77 people/10,000 in a country with an average of 10 C/A/D will die of CHD
Accuracy of PredictionFinnish smokers smoke 6 C/A/DWe predict:
They actually have 23 deaths/10,000Our error (residual) = 23 - 14.619 = 8.38a large error*
Chapter 9 Correlation
*Cigarette Consumption per Adult per Day12108642CHD Mortality per 10,0003020100ResidualPrediction
Chapter 9 Correlation
Residuals*When we predict for a given X, we will sometimes be in error. Y for any X is a an error of estimateAlso known as: a residualWe want to (Y- ) as small as possible.BUT, there are infinitely many lines that can do this.Just draw ANY line that goes through the mean of the X and Y values.Minimize Errors of Estimate How?
Minimizing Residuals*Again, the problem lies with this definition of the mean:
So, how do we get rid of the 0s?Square them.
Regression Line: A Mathematical DefinitionThe regression line is the line which when drawn through your data set produces the smallest value of:
Called the Sum of Squared Residual or SSresidualRegression line is also called a least squares line.*
Chapter 9 Correlation
Summarizing Errors of Prediction*Residual varianceThe variability of predicted values
Standard Error of Estimate*Standard error of estimateThe standard deviation of predicted values
A common measure of the accuracy of our predictionsWe want it to be as small as possible.
Example*
Chapter 9 Correlation
Sheet1
CountryX (Cig.)Y (CHD)Y'(Y - Y')(Y' - Ybar)(Y - Ybar)
1112624.8291.1711.371106.193131.699
292120.7450.2550.06538.70141.939
392420.7453.25510.59538.70189.795
492120.7450.2550.06538.70141.939
581918.7030.2970.08817.46420.035
681318.703-5.70332.52417.4642.323
781918.7030.2970.08817.46420.035
861114.619-3.61913.0970.00912.419
962314.6198.38170.2410.00971.843
1051512.5772.4235.8713.7910.227
1151312.5770.4230.1793.7912.323
125412.577-8.57773.5653.791110.755
1351812.5775.42329.4093.79112.083
1451212.577-0.5770.3333.7916.371
155312.577-9.57791.7193.791132.803
1641110.5350.4650.21615.91212.419
1741510.5354.46519.93615.9120.227
184610.535-4.53520.56615.91272.659
193138.4934.50720.31336.3732.323
20348.493-4.49320.18736.373110.755
213148.4935.50730.32736.3730.275
Mean5.95214.524
SD2.3346.690
Sum0.04440.757454.307895.247895.06
Y' = (2.04*X) + 2.37
Sheet2
Sheet3
Regression and Z Scores*When your data are standardized (linearly transformed to z-scores), the slope of the regression line is called DO NOT confuse this with the associated with type II errors. Theyre different.When we have one predictor, r = Zy = Zx, since A now equals 0
Sums of square deviationsTotal
Regression
Residual we already covered
SStotal = SSregression + SSresidualPartitioning Variability*
Partitioning Variability*Degrees of freedomTotaldftotal = N - 1Regressiondfregression = number of predictorsResidualdfresidual = dftotal dfregressiondftotal = dfregression + dfresidual
Partitioning Variability*Variance (or Mean Square)Total Variances2total = SStotal/ dftotalRegression Variances2regression = SSregression/ dfregressionResidual Variances2residual = SSresidual/ dfresidual
Example*
Sheet1
CountryX (Cig.)Y (CHD)Y'(Y - Y')(Y' - Ybar)(Y - Ybar)
1112624.8291.1711.371106.193131.699
292120.7450.2550.06538.70141.939
392420.7453.25510.59538.70189.795
492120.7450.2550.06538.70141.939
581918.7030.2970.08817.46420.035
681318.703-5.70332.52417.4642.323
781918.7030.2970.08817.46420.035
861114.619-3.61913.0970.00912.419
962314.6198.38170.2410.00971.843
1051512.5772.4235.8713.7910.227
1151312.5770.4230.1793.7912.323
125412.577-8.57773.5653.791110.755
1351812.5775.42329.4093.79112.083
1451212.577-0.5770.3333.7916.371
155312.577-9.57791.7193.791132.803
1641110.5350.4650.21615.91212.419
1741510.5354.46519.93615.9120.227
184610.535-4.53520.56615.91272.659
193138.4934.50720.31336.3732.323
20348.493-4.49320.18736.373110.755
213148.4935.50730.32736.3730.275
Mean5.95214.524
SD2.3346.690
Sum0.04440.757454.307895.247895.06
Y' = (2.04*X) + 2.37
Sheet2
Sheet3
Example*
Coefficient of Determination*It is a measure of the percent of predictable variability
The percentage of the total variability in Y explained by X
r = .713r 2 = .7132 =.508
or
Approximately 50% in variability of incidence of CHD mortality is associated with variability in smoking.r 2 for our example*
Coefficient of Alienation*It is defined as 1 - r 2 or
Example1 - .508 = .492
r2, SS and sY-Y*r2 * SStotal = SSregression(1 - r2) * SStotal = SSresidualWe can also use r2 to calculate the standard error of estimate as:
Testing Overall Model*We can test for the overall prediction of the model by forming the ratio:
If the calculated F value is larger than a tabled value (F-Table) we have a significant prediction
Testing Overall Model*Example
F-Table F critical is found using 2 things dfregression (numerator) and dfresidual.(demoninator)F-Table our Fcrit (1,19) = 4.3819.594 > 4.38, significant overallShould all sound familiar
SPSS output*
Testing Slope and Intercept*The regression coefficients can be tested for significanceEach coefficient divided by its standard error equals a t value that can also be looked up in a t-table Each coefficient is tested against 0
Testing the Slope*With only 1 predictor, the standard error for the slope is:
For our Example:
Testing Slope and Intercept*These are given in computer printout as a t test.
Testing*The t values in the second from right column are tests on slope and intercept.The associated p values are next to them.The slope is significantly different from zero, but not the intercept.Why do we care?
Testing*What does it mean if slope is not significant?How does that relate to test on r?What if the intercept is not significant?Does significant slope mean we predict quite well?
*********Landwehr, J.M. & Watkins, A.E. (1987) Exploring Data: Teachers Edition. Palo Alto, CA: Dale Seymour Publications.
*****************With updates on slides 23, 25, 27, 29, 35, 36*Landwehr, J.M. & Watkins, A.E. (1987) Exploring Data: Teachers Edition. Palo Alto, CA: Dale Seymour Publications.
**