Chapter 5 : Correlation and Regression
1. Correlation
2. Partial Correlation
3. Linear Regression
4. Residual Plots
5. Quadratic Regression
6. Transformations
This material covers sections 5ABCDEFHIJK. Omit 3N,
5GL
1
Correlation
• The CORRelation procedure gives sample correlation
coefficients between pairs of variables.
• These include the
1. Pearson product r
2. Spearman rank ρ
3. Kendall τ
2
Correlation Properties:
1. Correlation coefficients lie in [−1,1].
2. Pearson’s r measures the linear relation between two
continuous variables.
3. Positive values indicate that one variable increases lin-
early with the other variable.
4. Negative values indicate that the variables are inversely
related.
5. A correlation of 0 means that the variables are not lin-
early related.
6. Spearman’s rank and Kendall’s tau apply to ordinal
data.
3
Correlation Syntax:
PROC CORR DATA=MYDATA;
Gives the Pearson r between all pairs of numeric variables
in MYDATA.
4
Simple Example:
Compute the correlation between the variables X and Y ,
given in the following table:
X Y
22 12
23 11
26 15
24 14
31 20
27 18
25 16
5
Simple Example: Cont’d
DATA CORR_EG;
INPUT X Y;
DATALINES;
22 12
23 11
26 15
24 14
31 20
27 18
25 16
;
PROC CORR DATA=CORR_EG;
PROC PLOT; /* Not necessary, but recommended
as an aid in interpreting the correlation. */
PLOT Y*X;
6
Default Output:
1. Simple statistics for each numeric variable in MYDATA.
2. Correlations between all pairs of variables in MYDATA.
3. P-values for testing significance.
7
Significance Tests:
1. Null Hypotheses: true population correlations are really
0.
2. Test statistic: the sample correlation, r
3. PROB = p-value, the probability of observing a sample
correlation at least as large as r, under the null hypoth-
esis.
4. A small p-value indicates that the population correlation
is really nonzero.
8
Assumptions:
• the variables are normally distributed, and
• the observations are independent of each other.
If data are not normal, use nonparametric tests based on
Spearman’s ρ or Kendall’s τ .
Computing Spearman’s ρ:
PROC CORR DATA=MYDATA SPEARMAN;
9
Alternative Interpretation of Pearson’s r:
r2 = the proportion of variance in one of the variables
that can be explained by variation in the other variable.
1− r2 = the proportion of variance left unexplained.
e.g.
• Heights are measured for 20 father and (adult) son
pairs.
• The correlation is estimated to be r = .6.
• r2 = .36 so 36% of the variation in the height of sons
is attributable to variation in the height of the father.
• 64% of the variance in sons’ heights is left unexplained.
10
Correlation Cont’d
• Causality: It must be emphasized that a linear rela-
tionship between variables does not imply a cause-and-
effect relationship between the variables.
• Correlation Matrices: A matrix of correlations between
all pairs of numeric variables in the SAS data set can
be computed, using the VAR statement.
• Example: Information on waste output in a region of
California is stored in the file waste.dat. The data set
contains 40 observations on 10 variables:
11
Correlation Cont’d
• ZONE - the area in which the data was collected.
• WASTE - the amount of waste output in the area.
• Predictor variables: each gives the percentage of the
zone devoted to
– IND - industry.
– MET - fabricated metals.
– WHS - wholesale and trucking.
– RET - retail trade.
– RES - restaurants and hotels.
– FIN - finance and insurance.
– MSC - miscellaneous activities.
– HOM - residential dwellings.
Find the pairwise correlation matrix for the variables IND,
MET, WHS.
12
Correlation matrix
DATA WASTE;
INFILE ’waste.dat’;
INPUT ZONE WASTE IND MET WHS RET RES
FIN MSC HOM;
PROC CORR DATA=WASTE NOSIMPLE NOPROB;
/* NOSIMPLE suppresses the printing
of summary statistics */
/* NOPROB suppresses the significance
tests */
VAR IND MET WHS;
RUN; QUIT;
The output window then contains:Pearson Correlation Coefficients
IND MET WHSIND 1.0 .39315 .41971
MET .39315 1.0 .88869WHS .41971 .88869 1.0
13
Correlation: WITH
Using the WITH statement:
PROC CORR DATA=MYDATA;
VAR X Y;
WITH Z1 Z2 Z3;
computes correlations between the following pairs of vari-
ables:
• X, Z1
• X, Z2
• X, Z3
• Y, Z1
• Y, Z2
• Y, Z3
14
WITH example
DATA OZONE;
INFILE ’ozone.dat’;
OPTIONS PAGESIZE = 40;
/* Daily Zonal means: OZONE, in units DOBSON;
SOURCE: NASA*/
/* 100 DOBSON UNITS = 1MM THICKNESS (IF OZONE
LAYER WERE BROUGHT TO EARTH’S SURFACE) */
/* Each observation contains ozone thickness
measurements averaged over 288 longitudes at
latitudes separated by 5 degrees;
e.g. M875 = average ozone thickness at latitude 87.5 */
/* SH = average over the southern hemisphere;
NH = average over the northern hemisphere */
/* 0 = MISSING VALUE */
15
WITH example Cont’d
INPUT YRFRAC M875 M825 M775 M725 M675 M625 M575 M525 M475 M425
M375 M325 M275 M225 M175 M125 M75 M25 P25 P75 P125 P175
P225 P275 P325 P375 P425 P475 P525 P575 P625 P675 P725
P775 P825 P875 SH NH;
/* IT IS VERY IMPORTANT TO SET
THE MISSING VALUES TO .;
OTHERWISE, THE 0’s WILL BE ENTERED
INTO THE CORRELATION COMPUTATION
AND GIVE MISLEADING RESULTS.*/
IF M875 = 0 THEN M875 = .;
IF M825 = 0 THEN M825 = .;
IF M775 = 0 THEN M775 = .;
IF M725 = 0 THEN M725 = .;
IF M675 = 0 THEN M675 = .;
IF M625 = 0 THEN M625 = .;
IF M575 = 0 THEN M575 = .;
IF M525 = 0 THEN M525 = .;
16
WITH example Cont’d
IF M475 = 0 THEN M475 = .;
IF M425 = 0 THEN M425 = .;
IF M375 = 0 THEN M375 = .;
IF M325 = 0 THEN M325 = .;
IF M275 = 0 THEN M275 = .;
IF M225 = 0 THEN M225 = .;
IF M175 = 0 THEN M175 = .;
IF M125 = 0 THEN M125 = .;
IF M75 = 0 THEN M75 = .;
IF M25 = 0 THEN M25 = .;
IF P875 = 0 THEN P875 = .;
IF P825 = 0 THEN P825 = .;
IF P775 = 0 THEN P775 = .;
IF P725 = 0 THEN P725 = .;
IF P675 = 0 THEN P675 = .;
IF P625 = 0 THEN P625 = .;
IF P575 = 0 THEN P575 = .;
17
WITH example Cont’d
IF P525 = 0 THEN P525 = .;
IF P475 = 0 THEN P475 = .;
IF P425 = 0 THEN P425 = .;
IF P375 = 0 THEN P375 = .;
IF P325 = 0 THEN P325 = .;
IF P275 = 0 THEN P275 = .;
IF P225 = 0 THEN P225 = .;
IF P175 = 0 THEN P175 = .;
IF P125 = 0 THEN P125 = .;
IF P75 = 0 THEN P75 = .; IF P25 = 0 THEN P25 = .;
SASDATE = FLOOR(365.25*(YRFRAC-1960));
MONTH = MONTH(SASDATE);
SEASONAL = ABS(7-MONTH);
PROC CORR NOSIMPLE NOPROB;
VAR M875 M475 M75; WITH P875 P475 P75;
/* This program gives correlations between various southern and
northern ozone layer averages */ RUN; QUIT;
18
Partial Correlations:
Often, two variables appear to be highly correlated but are
both highly correlated with another variable. Computation
of partial correlations between variables removes the effects
of other variables.
Syntax:
PROC CORR DATA=MYDATA;
VAR X;
WITH Y;
PARTIAL V;
This computes partial correlations between X and Y, after
eliminating effects due to correlation with V.
19
Partial Correlations: Example
Ozone measurements are correlated with temperature. The
resulting seasonal effect can be removed as follows:
PROC CORR DATA = OZONE NOSIMPLE NOPROB;
VAR M875 M475 M75;
WITH P475 P75;
PARTIAL SEASONAL;
We see that the magnitude of the correlations has been
reduced after taking into account the seasonal effects.
20
Exercise:
Refer to the Winnipeg Climate data set.
1. For each month, compute correlations between maxi-
mum temperature and each of minimum temperature,
minimum and maximum pressure, and minimum and
maximum wind speed.
2. Compute the partial correlations, taking into account
minimum temperature.
21
Simple Linear Regression
• The equation for a straight line relating the (nonran-
dom) variables y and x is
y = α+ βx
β is the slope.
α is the intercept.
The dependent variable y is the response variable.
The independent variable x is the predictor or explana-
tory variable.
22
‘Simulation’ Example:
Suppose the intercept is 2.0 and the slope is -1.5. Compute
y values corresponding to x = 1,2,4,5,8,10,13, and obtain
a scatterplot of the paired observations.
DATA NONRAN;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
Y = ALPHA + BETA*X;
23
‘Simulation’ Example: Cont’d
DATALINES;
1
2
4
5
8
10
13
;
PROC PLOT;
PLOT Y*X;
RUN;
QUIT;
If BETA is positive, say BETA = 2.5, then the graph increases.
24
Simple linear model Adding Noise:
• Even if variables are related to each other by a straight
line, experimental observations usually contain some
kind of (unobservable) noise or random errors which
cause small distortions.
• The simplest way of modelling this kind of noise is to
assume it is normally distributed and to add it y: i.e.
Y = y(x) + ε
or
Y = α+ βx+ ε
• For each observation, ε is a normal random variable
having mean 0 and variance σ2.
25
Simple linear model: Cont’d
• If there are n observations, there must be n random
errors. We assume that they are independent of each
other.
• This model is called the simple linear model.
• Note that since E[ε] = 0, we have
E[Y ] = α+ βx
i.e. the mean of the Y variable is a linear function of x.
26
Simulating the simple linear model:
Add noise to the above data. Assume σ = 0.2.
DATA RANDOM;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
SIGMA=0.2;
EPSILON = SIGMA*RANNOR(0);
Y = ALPHA + BETA*X + EPSILON;
DATALINES;
1
2
4
5
27
Simulating the simple linear model:
8
10
13
;
PROC PLOT;
PLOT Y*X;
RUN;
QUIT;
Repeating this simulation experiment, using σ = 1.0, σ = 1.5
and σ = 2.0, we see that as σ increases, the less ‘linear’ the
graph appears.
28
Estimation:
Unlike in our simulation study, α and β are unknown, and
must be estimated. Least-squares estimators are com-
puted by the REGression procedure or PROC REG.
Syntax:
PROC REG DATA = MYDATA;
MODEL Y = X;
29
Estimation: Cont’d
Estimating the parameters for the simulated data.
DATA RANDOM;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
SIGMA=0.2;
EPSILON = SIGMA*RANNOR(0);
Y = ALPHA + BETA*X + EPSILON;
DATALINES;
1
2
4
5
30
Estimation: Cont’d
8
10
13
;
PROC REG;
MODEL Y = X;
RUN; QUIT;
Note that the intercept estimate is near 2.0 and the slope
estimate is near -1.5.
31
Example:
10 fields were planted in wheat and i kg/acre of nitrate
were applied to the ith field, for i = 1,2, . . . ,10. We want
to model the relationship between mean wheat yield Y and
amount of nitrate X. That is, we
Y = α+ βX + ε
where ε is the unobserved error random variable.
32
Example: Cont’d
To estimate α and β we use
DATA WHTYIELD;
INPUT NITRATE YIELD;
DATALINES;
1 15
2 13
3 16
4 12
5 14
6 18
7 17
8 19
9 16
10 20
;
33
Example: Cont’d
PROC PLOT;
PLOT YIELD*NITRATE;
/* One should first plot \verb+YIELD+
against \verb+NITRATE+ to see whether
a linear model is appropriate. */
PROC REG DATA=WHTYIELD;
MODEL YIELD = NITRATE;
RUN;
34
Output
The output window lists
• an ANOVA table
• estimates of some statistics
• a table of parameter estimates and their standard er-
rors.
The estimates of α and β are
• a = 12.7 (standard error = 1.31)
• b = 0.606 (standard error = .212)
and the fitted model is
y = 12.7 + .606x.
The estimated standard deviation of the errors ε is given
by ROOT-MSE and is 1.93.
35
Tests
We can also test whether the slope of the regression line
is 0:
Ho : β = 0
versus
H1 : β 6= 0.
Test statistic:
T = β/s(β) = 2.86.
p-value: 0.0212
Conclusion: Reject the null hypothesis at α = .05. The
true slope is not 0.
36
Tests Cont’d
The intercept α can also be tested. This time, we have a
p-value of .0001 which is very strong evidence against the
hypothesis that α = 0.
These tests are based on the assumption that the obser-
vations are independent (i.e. the errors ε are independent
of each other). They are correct if the errors are normally
distributed, and approximately correct, if the true error dis-
tribution has a finite variance.
37
The ANOVA table:
• This table demonstrates how the variance in the Y or
response variable decomposes into variance explained
by the regression with X (SSModel) plus variance left un-
explained (SSError).
• SSModel + SSError = C Total
• MSModel = SSModel/DFModel (DFModel = number of pa-
rameters estimated - 1)
38
The ANOVA table: Cont’d
• MSError = SSError/DFError (DFError = n - DFModel)
Note that MSE is the square of the ROOT-MSE
• The test statistic F=MSModel/MSError is the square of the
T statistic used to test whether the slope is 0.
• If F is large, we have reason to reject the null hypothesis
in favor of the alternative that β 6= 0. Note that the p-
value for this test is identical with the earlier p-value.
39
Other Statistics
• R-square = the coefficient of determination R2 = .505.
It is the square of the correlation between YIELD and
NITRATE.
• Dep Mean = average of the dependent variable values =
16.0.
40
Predicted Values:
• The predicted values can be calculated from the fitted
regression equation for the given values of the explana-
tory variable X.
• For the wheat yield example, we found that
y = 12.7 + .606x
Therefore, if x = 3, we predict y = 12.7+ .606(3) = 14.5.
This is the predicted value corresponding to x = 3.
41
Plotting Predicted Values:
SAS can plot the predicted values versus the explanatory
variable:
PROC REG DATA = WHTYIELD;
MODEL YIELD = NITRATE;
PLOT PREDICTED. *NITRATE;
RUN;
42
Plotting Predicted Values: Overlay
It is possible to overlay this plot on the plot of the original
data:
PROC REG DATA = WHTYIELD;
MODEL YIELD = NITRATE;
PLOT PREDICTED. *NITRATE=’P’
YIELD*NITRATE=’*’ / OVERLAY;
RUN;
/* PLOT X*Y=’!’ causes the plotting symbol
to be ’!’ */
43
Residual Plots:
• Residuals are the differences between the response val-
ues and predicted values:
y − y = y − a− bx
• They are ’estimates’ of the errors ε.
ε = y − α− βx
• Examine plots of the residuals:
1. look for outliers - indications that the linear model
may not be adequate or that the error distribution is
not close enough to normal. Tests are not trustwor-
thy in this case.
44
Residual Plots: Cont’d
2 Patterns can
• indicate the need to transform the data or to add a
quadratic term to the linear model
• indicate that the error variance is not constant. If
not, weighted least-squares should be used.
To get a feel for what the residual plots should look like if
the linear model is appropriate, use simulation:
45
Residual Plots: Cont’d
DATA RANDOM;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
SIGMA=0.2;
EPSILON = SIGMA*RANNOR(0);
Y = ALPHA + BETA*X + EPSILON;
DATALINES;
1
2
4
5
7
8
46
Residual Plots: Cont’d
9
10
12
13
;
PROC REG;
MODEL Y = X;
PLOT RESIDUAL. * X;
RUN; QUIT;
47
A quadratic example:
DATA RANDOM;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
SIGMA=0.2;
EPSILON = SIGMA*RANNOR(0);
Y = ALPHA + BETA*(X - 5)**2 + EPSILON;
DATALINES;
1
2
4
5
7
8
48
A quadratic example:
9
10
12
13
;
PROC REG;
MODEL Y = X;
PLOT RESIDUAL. * X;
RUN; QUIT;
49
Outlier example:
DATA RANDOM;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
SIGMA=0.2; SIGMA2 = 1.8;
U = UNIFORM(0);
IF U < .8 THEN EPSILON = SIGMA*RANNOR(0);
ELSE EPSILON = SIGMA2*RANNOR(0);
Y = ALPHA + BETA*X + EPSILON;
DATALINES;
1
2
4
5
7
50
Outlier example: Cont’d
8
9
10
12
13
;
PROC REG; MODEL Y = X;
PLOT RESIDUAL. * X;
RUN; QUIT;
51
An increasing variance example
DATA RANDOM;
INPUT X;
ALPHA = 2.0;
BETA = -1.5;
SIGMA=0.2*SQRT(X);
/* SIGMA increases with the square
root of X */
EPSILON = SIGMA*RANNOR(0);
Y = ALPHA + BETA*X + EPSILON;
DATALINES;
1
2
4
5
7
8
52
An increasing variance example Cont’d
9
10
12
13
;
PROC REG;
MODEL Y = X;
PLOT RESIDUAL. * X;
RUN; QUIT;
Exercise: Examine a plot of the residuals for the wheat
yield data.
53
Adding a Quadratic Term:
To fit the quadratic model
Y = α+ βX + β2X2
use the line
X2 = X**2;
in the DATA step, and use the following MODEL state-
ment in PROC REG:
MODEL Y = X X2;
54
Example:
DATA WHTYIELD;
INPUT NITRATE YIELD;
NITRATE2 = NITRATE**2;
DATALINES;
1 15
2 13
3 16
4 12
5 14
6 18
7 17
8 19
9 16
10 20
;
55
Example : Cont’d
PROC REG DATA=WHTYIELD;
MODEL YIELD = NITRATE NITRATE2;
RUN;
QUIT;
The output includes a test of the hypothesis that β2 = 0.
56
Transformations:
Sometimes an appropriate transformation (like a log or a
square root) is sufficient to linearize a relationship between
two variables. Sometimes, such a transformation can cor-
rect for a variance that is not constant. (N.B. If the re-
sponse variable is a count, it is almost always the case that
a nonconstant variance can be corrected for by taking a
square root of the response variable.)
57
Summary:
PROC REG;
MODEL Y = X; /* FITS LINEAR MODEL RELATING
RESPONSE Y TO
EXPLANATORY VARIABLE X */
PLOT PREDICTED. * X;/* PLOTS PREDICTED VALUES */
PLOT RESIDUAL.*X = ’Y’; /* PLOTS RESIDUALS
WITH PLOTTING SYMBOL ’Y’ */
58