RACE 615 Introduction to Medical Statistics
Correlation & simple linear regression analysis
Assist.Prof.Dr.Sasivimol Rattanasiri
Doctor of Philosophy Program in Clinical Epidemiology and Master of Science Program in Medical Epidemiology Section for Clinical Epidemiology & Biostatistics
Faculty of Medicine Ramathibodi Hospital Mahidol University www.ceb-rama.org
Academic year 2015 semester I
CONTENTS
1. CORRELATION ................................................................................................................ 3
1.1 Pearson’ correlation coefficient ..................................................................................... 3
1.2 Spearman’s rank correlation coefficient ....................................................................... 7
2. SIMPLE LINEAR REGRESSION ................................................................................. 10
2.1 The linear regression model ........................................................................................ 11
2.2 Determining the best fit for the simple linear regression line ..................................... 12
2.3 Evaluating the regression model ................................................................................. 14
2.4 Using the regression model ......................................................................................... 24
2.5 Dummy tables ............................................................................................................. 30
Assignment IV……………………………………………..……………………………..31
OBJECTIVES
This module will help you to be able to:
• Estimate Pearson correlation coefficient & interpret results
• Fit a regression model, estimate correlation coefficients, make statistical
inferences, and interpretations
• Check model’s assumptions
REFERENCES
1. Pagano M. and Gauvreau K. Principle of Biostatistics. California: Duxbury Press
1993; 379 - 424.
2. Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Appllied regression analysis
and other multivariable methods. Washington: Duxbury Press 1998; 39 - 212.
3. Neter J, Wasserman W, and Kutner HM. Applied statistical models, thirds
edition. Boston: IRWIN 1990; 21 - 433.
4. Altman GD. Practical statistics for medical research. London: Chapman &
Hall 1991; 277-361.
READING SECTION
Read chapter 2, 3, & 4 in Neter J et al
ASSIGNMENT IV (20%)
P. 31 (Due on: October 1, 2015)
2 | P a g e
1. CORRELATION
The measures of the strength of association between two categorical variables were
described in the preceding module. This module begins with investigation of the
association between two continuous variables. For example, we wanted to determine the
association between age and percentage of body fat, serum cholesterol level and blood
pressure, age and calcium score, age and bone mineral density. The statistical method
which is used to measure an association between two continuous variables is known as
correlation. The degree of the linear association is measured by the correlation coefficient.
1.1 Pearson’ correlation coefficient
If X and Y are random continuous variables, xµ and yµ are means of the variables X and Y,
and xσ and yσ are standard deviations of the variables X and Y, so the population
correlation coefficient ( ρ ) is defined as:
yx
yx YXσσ
µµρ
))(( −−= . (1)
The sample Pearson’s correlation coefficient (r), is an un-biased estimator of ρ and is
defined as:
∑ ∑
∑
= =
=
−−
−−=
n
i
n
iii
n
i ii
yyxx
yyxxr
1 1
22
1
)()(
))((. (2)
where ix is the ith observation in the variable X,
iy is the ith observation in the variable Y,
x is the sample mean of the variable X,
y is the sample mean of the variable Y.
3 | P a g e
Interpretation of the sample Pearson’s correlation coefficient
• The correlation coefficient lies between -1 to 1.
• If the correlation coefficient is near +1 or -1, it indicates that there is strong linear
association between two continuous variables.
• If the correlation coefficient is near 0, it indicates that there is no linear association
between two continuous variables. However, a nonlinear relationship may exist.
• If the correlation coefficient is greater than 0, the correlation between two
continuous variables is positive. It means that if as a value of one variable increases,
a value of the other one variable tends to increase, whereas as a value of one
variable decreases, a value of the other one variable tends to decrease.
• If the correlation coefficient is less than 0, the correlation between two continuous
variables is negative. It means that if as a value of one variable increases, a value of
the other one variable tends to decrease, whereas as a value of one variable
decreases, the values of the other one variable tends to increase.
The hypothesis test for the correlation coefficient may be performed under the null
hypothesis that there is no association between two continuous variables. This test has the t
distribution with n-2 degrees of freedom. It is defined as:
212
rnrt−−
= (3)
where r is the sample correlation coefficient,
n is the number of subjects.
Suggestion: A first step that is useful in assessing the association between two continuous
variables is to create a two-way scatter plot of the data. This plot suggests the pattern of the
relationship between two variables. Therefore, we can determine a statistical analysis to
measure the degree of a relationship between two continuous variables. An example of this
plot is presented in the Example 1-1.
Example 1-1 Researchers wanted to assess the strength of association between body mass
index (BMI) and systolic blood pressure (SBP) in 32 subjects. The scatter plot of the
relationship between BMI and SBP is shown in Figure 1-1. We can see that the points
themselves varied widely, but the overall pattern suggested that SBP tends to become
higher as BMI increases. This impression suggests that there is a strong linear relationship
4 | P a g e
between the BMI and SBP and the correlation between these two variables is positive. The
Pearson’s correlation coefficient is 0.74, confirming the visual impression.
To test the hypothesis for correlation coefficient, use the following procedure:
1. Generate the null and alternative hypotheses
- The null hypothesis is that there is no linear association between BMI and SBP.
- The alternative hypothesis is that there is linear association between BMI and SBP.
0:0 =ρH
0: ≠ρAH
2. Compute the Pearson’s correlation coefficient (r)
a) The mean BMI is
∑=
=32
1321
iixx
= 3.44
b) The mean SBP is
∑=
=32
1321
iiyy
= 144.53
c) In addition
∑ ∑= =
−−=−−n
i iiiii yxyyxx
1
32
1)144.53)(3.44())((
= 164.62,
∑ ∑= =
−=−n
i iii xxx
1
32
1
23.44)()( 2
= 7.66,
∑ ∑= =
−=−n
i iii yyy
1
32
1
2)144.53()( 2
= 6425.97
d) The correlation coefficient is
5.97)(7.66)(642
164.62=r
= 0.74
5 | P a g e
3. Compute the statistical test
2(0.74)-12-320.74=t
= 6.03
4. Compute the associated p value by the STATA program
. disp tprob(30,6.03)
Then we get the associated p value < 0.001.
5. Draw a conclusion
The p value for this example is less than 0.001 which is less than the level of significance.
As a result, we reject the null hypothesis and conclude that there is linear association
between BMI and SBP. The correlation between these two variables is 0.74, and the
relation is clearly positive, that is the SBP tends to be higher for higher BMI.
To do the correlation analysis using the STATA program, we would type:
. pwcorr sbp bmi,sig obs | sbp bmi -------------+------------------ sbp | 1.0000 | | 32 | bmi | 0.7420 1.0000 | 0.0000 | 32 32 |
Figure 1-1 Scatter plot between BMI and SBP
120
140
160
180
Sys
tolic
blo
od p
ress
ure
2.5 3 3.5 4 4.5Body mass index
6 | P a g e
Limitations of the sample Pearson’s correlation coefficient
• The correlation coefficient is only correct if both of the continuous variables are
normally distributed. If the variables are not normally distributed, then the
interpretation may not be correct.
• If either or both of the continuous variables do not have a normal distribution, a
nonparametric correlation coefficient, Spearman’s rank correlation coefficient, is
required.
• The correlation coefficient is the measure of the strength of the linear relationship
between two continuous variables. If they have a nonlinear relationship, then the
correlation coefficient is not a valid measure of this association.
• It must be kept in mind that a high correlation between two continuous variables
does not in itself imply a cause-and-effect relationship.
1.2 Spearman’s rank correlation coefficient
The concept of ranks is possible for assessing the relationship between two continuous
variables when either or both of the continuous variables do not have a normal distribution.
One approach is to rank the two sets of data of both variables separately and calculate a
coefficient of rank correlation. This approach is known as Spearman’s rank correlation
coefficient.
The Spearman’s rank correlation coefficient is denoted by sr . It is simply Pearson’s
correlation coefficient calculated for the ranked values of two continuous variables. It is
defined as:
∑ ∑
∑
= =
=
−−
−−=
n
i
n
irrirri
n
i rrirris
yyxx
yyxxr
1 1
22
1
)()(
))(( (4)
where rix and riy are the ranks associated with the ith subject rather than the actual
observations. An interpretation of the Spearman’s rank correlation coefficient is exactly the
same as the Pearson’s correlation coefficient. The hypothesis test for the correlation
coefficient is performed by using the same procedure that we used for the Pearson’s
correlation. It can be defined as:
212
s
ss r
nrt−−
= (5)
7 | P a g e
Advantages of the Spearman’s rank correlation coefficient
• The Spearman’s rank correlation coefficient is much less sensitive to outlying
values than the Pearson’s correlation coefficient.
• The Spearman’s rank correlation coefficient can be used when one or both of the
variables are ordinal.
• The Spearman’s rank correlation coefficient may be used when assessing general
association. It has the advantage of not specifically assessing linear association.
Example 1-2 Researchers would like to assess the strength of association between age and
amount of calcium intake in 80 adults. The scatter plot of the relationship between age and
calcium intake is shown in Figure 1-2. It is clearly seen that there is considerable scatter
with no obvious underlying relationship between age and calcium intake. This impression
suggests that there is a nonlinear relationship between the age and calcium intake.
Additionally, the calcium intake data do not have a normal distribution. Therefore, the
Spearman’s rank correlation coefficient is a more appropriate measure of the association
for this example.
Figure 1-2 Scatter plot between age and calcium intake
The spearman’s correlation coefficient for this example is 0.17 and the correlation is
positive. These results suggest a weak relationship between age and calcium intake. To test
the hypothesis for correlation coefficient, use the following procedure:
200
400
600
800
1000
1200
Cal
cium
inta
ke (m
g/da
y)
60 65 70 75 80 85Age (years)
8 | P a g e
1. Generate the null and alternative hypotheses
- The null hypothesis is that there is no association between age and calcium intake.
- The alternative hypothesis is that there is linear association between age and calcium
intake.
0:0 =ρH
0: ≠ρAH
2. Compute the Spearman’s rank correlation coefficient and test statistic by using the
STATA program. The output is presented as: . spearman ca_intake age Number of obs = 80 Spearman's rho = 0.1655 Test of Ho: ca_intake and age are independent Prob > |t| = 0.1423
3. Draw a conclusion
The p value for this example is 0.14 which is greater than the level of significance. As a
result, we fail to reject the null hypothesis and conclude that there is no association
between age and calcium intake.
9 | P a g e
2. SIMPLE LINEAR REGRESSION Other questions may arise when we want to explore the relationship between two
continuous variables. In particular, when researchers would like to predict or estimate the
value of one variable corresponding to a given value of another variable. For example,
rather than investigation the relationship between SBP and BMI as Example 1-1,
researchers may be interested in predicting the change in SBP that corresponds to a given
change in BMI. Clearly, correlation analysis does not carry out this purpose; it just
indicates the strength of the association as a single number. The appropriate statistical
method which is used to predict one continuous variable from others’ variables is known as
regression analysis.
Variables involved in regression analysis are not treated as symmetrically as correlation
analysis. The regression treats these variables as dependent (response or outcome) and
independent (explanatory or predictor) variables. The dependent variable is the one being
explained. The independent variable is the one used to explain the variation in the
dependent variable. For the question in Example 1-1, if researchers want to predict SBP
from BMI, SBP is treated as the response variable and BMI is treated as the explanatory
variable.
If only one explanatory variable is used to explain the variation in a response variable, it is
called simple regression analysis. If two or more explanatory variables are used to explain
the variation in a response variable, this is called multiple regression analysis.
Linear regression will be applied when each value of response variable is independent from
another. In case different observations are measured on the same subject at different times,
special methods can be used to find the best-fitting model, which will be discussed later in
longitudinal data analysis.
There are many choices of the functional form of the regression relation between response
and explanatory variables such as linear, quadratic, or parabola. However, the regression
function is not known in advance and must be decided upon once the data has been
collected and analyzed. The linear and quadratic regression functions are often used as
satisfactory first approximations for regression functions of unknown nature.
10 | P a g e
In this module, we consider a basic regression model, where there is only one explanatory
variable and the regression function is linear. The multiple regression analysis is explained
later in Module V.
2.1 The linear regression model
The linear regression model is used to predict the change in response variable that
corresponds to a given change in the explanatory variable. This model gives a straight-line
relationship between these two variables. The simple linear regression model for
population is defined as:
XY βα += , (6)
where y is referred to as the response variable and x is referred to as the explanatory
variable; α is the y-intercept of the line and β is its slope, and they are called population
regression coefficients. The y-intercept is defined as the mean value of the response y when
x is equal to zero. In most medical applications, the y-intercept has no practical meaning
because the independent variable cannot be anywhere near zero, for example, blood
pressure, weight, or height. The slope is interpreted as the change in y that corresponds to a
one unit change in x.
Although the relationship between two variables is described by a perfect straight line, the
relationship between individual values of two variables is not. Thus, an error term (ε ),
which represents the variation of the response variable with a given explanatory variable, is
introduced into the model. The full linear regression model can take the following form:
εβα ++= XY . (7)
Therefore, the sample linear regression model is defined as:
iii ebxay ++=ˆ (8)
where iy and ix are the observed values of response and explanatory variables,
iy is the estimated or predicted value of iy for a particular value of ix ,
a and b are the un-biased estimators of population regression coefficients,
ie is the random error which is the difference between the observed and predicted
values of iy . The technical term for this distance is called a residual.
11 | P a g e
2.2 Determining the best fit for the simple linear regression line
The next question is how to fit a regression model. One mathematical technique for
determining the best fitting straight line to a set of data is known as the method of least
squares. The concept of this method is to produce the line that minimizes the distance
between the observed data and the fitted line.
If iy is the observed value for a particular value of ix , and iy is the predicted value of iy
which is the value given by the regression line, then the distance between the observed and
predicted values, which is called the random error or residual, can be defined as:
iii yye∧
−= . (9)
If the sum of the observed values in the sample is the same as the sum of predicted values
from the regression model, the sum of these errors is equal to zero. This is an impossible
circumstance. Hence, to find the line that best fits to the set of data, we cannot minimize
the sum of errors. Instead, we minimize the sum of the squares of the errors, which can be
defined as:
∑∑=
∧
=
−=n
iii
n
ii yye
1
2
1
2 )( . (10)
This equation is also called the error sum of squares, or the residual sum of squares. Thus,
the regression coefficients a and b in the regression model are calculated by,
iii ebxay ++=ˆ ,
which give the minimum residual sum of squares which can be defined as:
∑∑=
∧
=
−=n
iii
n
ii yye
1
2
1
2 )(
∑=
−−=n
iii bxay
1
2)( ,
therefore,
∑
∑
=
=
−
−−= n
ii
n
iii
xx
yyxxb
1
2
1
)(
))(( , (11)
and
xbya −= . (12)
12 | P a g e
The regression coefficients for the prediction of SBP from BMI in Example 1-1 can be
calculated as:
∑ ∑= =
−−=−−n
i iiiii yxyyxx
1
32
1)144.53)(3.44())((
= 164.62
∑ ∑= =
−=−n
i iii xxx
1
32
1
2)3.44()( 2
= 7.66
21.497.66
164.62==b
3.4421.49-144.53 ×=a
=70.58
To fit the regression model by using STATA, we would type “regress y x“, then it will
carry out least square regression for us. In this example, we want to regress SBP on BMI,
so we just type
regress sbp quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 36.75 Model | 3537.94585 1 3537.94585 Prob > F = 0.0000 Residual | 2888.0229 30 96.2674299 R-squared = 0.5506 -------------+------------------------------ Adj R-squared = 0.5356 Total | 6425.96875 31 207.289315 Root MSE = 9.8116 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 21.49167 3.545147 6.06 0.000 14.25151 28.73182 _cons | 70.57641 12.32187 5.73 0.000 45.4118 95.74102 ------------------------------------------------------------------------------
The output from the STATA indicated that the estimate of the population intercept (_cons)
is equal to 70.58 and the estimate of the population slope (quet) is equal to 21.49.
Therefore, the linear regression model for the prediction of SBP from BMI in Example 1-1
is:
The y-intercept of the regression line is 70.58. This is the predicted value of SBP that
corresponds to a BMI of 0. In this example, a BMI of 0 has no real meaning. The slope of
the line is 21.49, implying that for each one-unit increase in BMI, an SBP increases by
21.49 mm Hg on average.
i21.49x70.58yi +=
13 | P a g e
2.3 Evaluating the regression model
After the least squares regression line is constructed, it must be evaluated to determine
whether it adequately describes the relationship between the response and explanatory
variables, and whether it can be used effectively for prediction purpose. These can be done
by consideration of the confidence interval of the regression coefficients, test of hypotheses
about the regression coefficients, and estimation of the coefficient of determination.
2.3.1 Estimate confidence interval for regression coefficients
The uncertainty of the regression coefficients is considered by the confidence intervals of
the population regression coefficients.
The t distribution is used to make confidence intervals for the regression coefficients. The
standard deviation of the residual can be defined as:
2
)ˆ(1
2
| −
−=∑=
n
yys
n
iii
xy , (13)
the standard error of the intercept can be defined as:
∑=
∧
−+=
n
ii
xy
xx
xn
sase
1
2
2
|
)(
1)( , (14)
and the standard error of the slope can be defined as:
∑=
∧
−=
n
ii
xy
xx
sbse
1
2
|
)()( . (15)
Therefore, the confidence interval for the population intercept is given as:
))(( 2/1 aseta∧
− ×− α to ))(( 2/1 aseta∧
− ×+ α , (16)
and the confidence interval for the population slope is given as:
))(( 2/1 bsetb∧
− ×− α to ))(( 2/1 bsetb∧
− ×+ α , (17)
14 | P a g e
where 2/1 α−t is the appropriate value from the t distribution with n-2 degrees of freedom
associated with a confidence of α)%100(1− .
For the regression model of SBP on BMI, the 95% CIs for the regression coefficients can
be estimated as follows:
9.8130
2888.02| ==xys ,
this estimate is used to compute,
12.327.66
3.443219.81)(
2=+=
∧
ase
and,
3.557.66
9.81)( ==∧
bse .
For a 95% confidence level with 30 degrees of freedom, the value of 2.042t α/21 =− . Recall
that a and b are 70.58 and 21.49, respectively. Therefore, 95% CI for α is
70.58-(2.042x12.32) to 70.58+(2.042x12.32), that is from 45.41 to 95.74,
and 95% CI for β is
21.49-(2.042x3.55) to 21.49+(2.042x3.55), that is from 14.25 to 28.73.
Conclusion:
The 95% CI of the slope β lies between 14.25 and 28.73, which does not include zero. This
can be interpreted that we are 95% confident that the true population slope β lies between
14.25 and 28.73.
2.3.2 Tests of hypotheses about regression coefficients
This section is concerned with tests of hypotheses about the population regression
coefficients. The slope is usually the more important coefficient in the regression model
than the intercept. It quantifies the average change in y that corresponds to each unit
change in x. We can test the null hypothesis that the population slope is equal to zero. If the
population slope is zero, it means that there is no linear relationship between explanatory
and response variables. The test of hypothesis can be done either by using analysis of
variance and the F statistic or by using the t statistic.
15 | P a g e
1) Hypothesis testing with F statistic
The hypothesis testing with F statistic is based upon the ANOVA table. The principle of
this test is to partition the total variation into two components; explained variation and
unexplained variation.
The total variation of the observed values of y is measured by the total sum of squares
(SST). It is the sum of the squares of the differences between each observation and the
overall mean. This is calculated as:
∑=
−=n
ii yySST
1
2)( (18)
The unexplained variation of the observed values of y is measured by the error sum of
squares (SSE). It is the sum of the squares of the differences between each observation and
its predicted values. This is calculated as:
∑=
−=n
iii yySSE
1
2)ˆ( (19)
The explained variation of the observed values of y is measured by the sum of squares due
to linear regression (SSR). It is the sum of the squares of the differences between each
predicted value and the overall mean. This is calculated as:
∑=
−=n
ii yySSR
1
2)ˆ( (20)
Thus, SSR is the portion of SST that is explained by the use of the regression model, and
SSE is the portion of SST that is not explained by the use of the regression model.
The value of the F test can be defined as:
)SSE/(nSSR/
MSEMSRF
21−
== (21)
16 | P a g e
The general form of ANOVA table for simple linear regression is presented in Table 2-1.
Table 2-1 ANOVA table for simple linear regression
Source of variation
Degrees of freedom
Sums of squares
Mean squares F
Linear regression 1 SSR MSR MSR/MSE Residual n-2 SSE MSE Total n-1 SST MST
2) Hypothesis testing with t statistic
The statistical test for testing the hypothesis 0:0 =βH by using the t statistic can be
defined as:
)(
0
bse
bt ∧
−= (22)
Under the null hypothesis, this ratio has a t distribution with n-2 degrees of freedom.
For the regression model of SBP on BMI, the hypothesis testing of the slope of the
regression line can be performed as follows:
1. Generate null and alternative hypotheses
- The null hypothesis is that the slope is equal to zero
- The alternative hypothesis is that the slope is not equal to zero
0:0 =βH
0: ≠βAH
2. Compute the statistical test
6.063.55
0-21.49==t
3. Compute the associated p value by the STATA program
. disp tprob(30,6.06)
Then we get the associated p value < 0.001.
4. Draw a conclusion
For the hypothesis: 0=BMIH β:0 , the calculated statistic is t = 6.062 with degrees of
freedom of 32-2=30. The associated p-value is <.001. We therefore reject the null
17 | P a g e
hypothesis that the population slope is equal to zero and conclude that there is a statistically
significant straight-line association between SBP and BMI. That is for each one unit
increase in BMI, SBP increases about 22 mmHg.
If we are interested in testing the null hypothesis that the intercept is equal to a specified
value 0α , we use calculations that are analogous to those for the slope. The statistical test
can be defined as:
)(0
ase
at ∧
−=
α (23)
However, the intercept is not usually of interest. There is of very little practical value in
making inference about the intercept.
2.3.3 The coefficient of determination
The coefficient of determination is denoted by R2. It is the square of the Pearson
correlation coefficient r; consequently, r2 = R2. The R2 represents the proportion of the total
variation of the observed values of y that is explained by the linear regression model. In
the other words, the R2 represents how well the independent variable explains the
dependent variable in the regression model. Therefore, the R2 can be calculated as below:
SSTSSRR =2 (24)
The STATA output of simple linear regression: regress sbp quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 36.75 Model | 3537.94585 1 3537.94585 Prob > F = 0.0000 Residual | 2888.0229 30 96.2674299 R-squared = 0.5506 -------------+------------------------------ Adj R-squared = 0.5356 Total | 6425.96875 31 207.289315 Root MSE = 9.8116 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 21.49167 3.545147 6.06 0.000 14.25151 28.73182 _cons | 70.57641 12.32187 5.73 0.000 45.4118 95.74102 ------------------------------------------------------------------------------
From the STATA output of simple linear regression analysis, the R2 is 0.5506. This
implies that 55.06% of the variation among the observed values of the SBP is explained by
its linear relationship with BMI. The remaining 44.94% (100-55.06%) of variation is
18 | P a g e
unexplained or in the other words, is explained by the other variables which are not
considered here.
2.3.4 Assumptions underlying simple linear regression
It is important to consider assumptions that underlie the simple linear regression model
before using a regression analysis. The assumptions of the simple linear regression model
are described below:
1) Normality
The values of the response variable Y have a normal distribution for each value of the
explanatory variable X. However, if this assumption holds then the residuals should also
have a normal distribution. If this assumption is broken, it will result in an invalid model.
This assumption needs to be checked after fitting the simple linear regression model.
The assumption of normality can be assessed formally by a normal plot of the residuals.
These residuals are assumed to be independent normal random variables with mean 0 and
constant variance ( 2σ ). Assumption violation will result in an invalid model. To make sure
that the model is appropriate, this needs to be checked whether their distributions are normal.
Checking this assumption can be performed as follows:
1. Estimate the residuals
After fitting regression model by regress command, the estimation of residuals can be done
by predict command with option resid as:
predict res, resid
The STATA stores the results of predicting residuals in the variable “res” as follows: list sbp bmi res in 1/5 +-------------------------+ | sbp bmi res | |-------------------------| 1. | 152 4.116 -7.036115 | 2. | 164 4.01 7.242001 | 3. | 135 2.876 2.613559 | 4. | 130 3.1 -7.200574 | 5. | 137 3.296 -4.412943 | +-------------------------+
2. Describe residuals distribution
sum res, detail Residuals ------------------------------------------------------------- Percentiles Smallest 1% -19.23081 -19.23081 5% -18.44582 -18.44582 10% -10.51667 -11.5153 Obs 32
19 | P a g e
25% -7.160689 -10.51667 Sum of Wgt. 32 50% -1.603864 Mean 4.47e-08 Largest Std. Dev. 9.652048 75% 8.117427 11.79569 90% 11.79569 12.1004 Variance 93.16203 95% 12.59216 12.59216 Skewness .1254792 99% 22.53132 22.53132 Kurtosis 2.598663
The output from the STATA program indicates that the mean of 0 is slightly greater than
the median which is -1.60. There is a little skewness of 0.13.
3. Construct a normal probability plot
Construct a normal probability plot of residuals by using pnorm command as:
pnorm res
Figure 2-2 Standardized normal probability plot of residuals
Figure2-2 shows that the residuals have little departure from the diagonal line and suggests
that they are approximately a normal distribution.
4. Hypothesis testing about normality
The assumption of normality can be confirmed by a statistics test known as “Shapiro-Wilk
test”. We can do this by typing:
swilk res Shapiro-Wilk W test for normal data Variable | Obs W V z Pr > z ---------+------------------------------------------------- res | 32 0.97006 0.999 -0.003 0.50106
0.00
0.25
0.50
0.75
1.00
Nor
mal
F[(r
es-m
)/s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
20 | P a g e
The test performs with the null hypothesis that the residuals are normally distributed.
The Shapiro-Wilk statistics is equal 0.97006 and its related p-value is 0.50106. We
therefore fail to reject the null hypothesis and conclude that the residuals are normally
distributed.
2) Homoskedasticity
The variability of the response variable Y, which is assessed by the variance, is the same
for each value of the explanatory variable X. This assumption indicates that the spread of
points around the regression line is similar for all x values. Considering in terms of the
residuals, the variance of residuals should be the same for each value of the explanatory
variable X. The assumption of homoskedasticity can be assessed after fitting the regression
model by a plot of the residuals against the value of x or the predicted value of y.
The assumption of homoskedasticity can be assessed by plotting the residuals against the
independent variable x or predicted values of the dependent variable y. If the residuals lie
parallel to the horizontal line, the variance is constant over the values of x. On the other
hand, if the residuals increase or decrease as the value of x or the predicted value of y
become larger, then the variance of residuals ( 2/ xyσ ) is not constant across the values of x or
the predicted values of y. Violation of this assumption is called heteroskedasticity. We can
examine this assumption as follows: 1. Estimate the predicted values of y
After fitting a regression model by regress command, the estimation of the predicted values
of y can be done by predict command with option xb as:
predict yhat, xb
The STATA stores the results of the predicted values of y in the variable “yhat” as follows: list sbp bmi yhat in 1/5 +------------------------+ | sbp bmi yhat | |------------------------| 1. | 152 4.116 159.0361 | 2. | 164 4.01 156.758 | 3. | 135 2.876 132.3864 | 4. | 130 3.1 137.2006 | 5. | 137 3.296 141.4129 | +------------------------+
2. Plot the residuals against the predicted values of y
To plot the residuals against the predicted values of y can be done by using rvfplot
command as: 21 | P a g e
rvfplot, yline(0)
Figure 2-3 Residuals from the regression model, plotted against SBP
Figure 2-3 shows that the residuals are randomly scattered below and above the horizontal
line and do not show any distinct trend. That is the assumption of straight-line relationship
between SBP and BMI appears reasonable, which corresponds with that shown in Figure 2-
1. The residuals do not appear to increase or decrease as the predicted values increase.
This indicates that the assumption of homoskedasticity is valid.
3. Hypothesis testing about homoskedasticity
In addition, constant variance can be checked by using Breusch-Pagan / Cook-Weisberg
test as below.
estat hettest res
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance
Variables: res chi2(1) = 0.25 Prob > chi2 = 0.6157
The test performs with the null hypothesis that the variance is constant across the predicted
values of y. The Cook-Weisberg test is equal 0.25 and its related p-value is 0.6157. We
therefore fail to reject the null hypothesis and conclude that the variance is constant across
the predicted values of y.
-20
-10
010
20R
esid
uals
120 130 140 150 160 170Fitted values
22 | P a g e
3) Outliers
The outliers are extreme values and can result in an invalid model. This can be checked by
a plot of standardized residuals against values of the independent variable X. The
standardized residual for each observation can be defined as:
Se
z ii =
where, S refers to the standard deviation of residuals which can be defined as:
∑=−
=n
iie
nS
1
2
21
If the residuals lie a long distance from the rest of the observations, usually four or more
standard deviations from zero, they are called outliers.
If the outliers are present, data should be checked to make sure that there is no error during
data entry, or no error due to measurement. If the error came from measurement, that
observation must be omitted because the outliers can result in an invalid model. Steps for
checking the outliers are presented as follows:
1. Estimate the standardized residaul
After fitting regression model by regress command, the estimation of the standardized
residual can be done by predict command with option rstandard as:
predict rstd, rstandard
The STATA stores the results of the standardized residual in the variable “rstd” as follows:
. list person sbp bmi rstd in 1/5 +----------------------------------+ | person sbp bmi rstd | |----------------------------------| 1. | 11 166 3.877 1.269367 | 2. | 6 129 2.79 -.1640324 | 3. | 9 144 2.368 2.538403 | 4. | 7 162 3.668 1.308478 | 5. | 5 146 2.979 1.197833 | +----------------------------------+
2. Create the standardized residual plot
After estimating the standardized residual, the next step is to plot standardized residuals
versus the values of the independent variable by using STATA command as follows:
twoway (scatter rstd quet, sort yline(0))
23 | P a g e
Figure 2-5 Outliers: Standardized residuals versus BMI
This figure suggests that there is only one highest residual, but this does not exceed 4
standard deviations. Therefore, the outlier should not matter in this case.
2.4 Using the regression model
There are two ways in which the model can be used. First, it can be used to predict the
mean value of y for a given value of x. Second, it can be used to predict an individual y for
a new member of a population. However, the predicted mean and individual values of y
are numerically equivalent for any particular value of x, but the confidence interval of the
predicted individual values of y are wider than the confidence interval of the predicted
mean values of y. The details of computations are presented as follows:
2.4.1 The predicted mean values
The predicted mean values of y can be defined as:
bxay +=ˆ . (25)
For our example on the SBP and Quetelet index in Example 1-1, the linear regression
model is:
xy 21.4970.58ˆ += .
Using this linear regression model, we can find the predicted mean value of y for any
specific value of x. For example, we want to estimate the mean SBP for all subjects with
Quetelet index of 3.8 (x=3.8), the predicted mean value of SBP for this Quetelet index is
152.243.821.4970.58ˆ =×+=y mm Hg.
-2-1
01
23
Sta
ndar
dize
d re
sidu
als
2.5 3 3.5 4 4.5Body mass index
24 | P a g e
This value of y can also be interpreted as a point estimator of the mean value of y for
Quetelet index=3.8. Thus, we can state that, on average, the SBP is equal to 152.24 mmHg
for all subjects whose Quetelet index=3.8. An example of the prediction of the change of
SBP for any specific value of Quetelet index by using the regression line is presented in
Figure 2-6.
Figure 2-6 Scatter plot and regression line of relationship between BMI and SBP
To construct a confidence interval for predicted mean values, we must know the predicted
mean, the standard error of the predicted mean, and the distribution of the predicted mean.
The t distribution is also used to make confidence intervals for the predicted mean values.
The standard error of iy for a given value of x, say 0x , is estimated by:
∑=
∧
−
−+= n
ii
xyi
xx
xxn
syse
1
2
20
|
)(
)(1)ˆ( (26)
Therefore, the confidence interval is given by:
))ˆ((ˆ 2/1 ii ysety∧
− ×− α to ))ˆ((ˆ 2/1 ii ysety∧
− ×+ α , (27)
where 2/1 α−t is the appropriate value from the t distribution with n-2 degrees of freedom
associated with a confidence of )%α−100(1 .
120
140
160
180
Sys
tolic
blo
od p
ress
ure
(mm
Hg)
2.5 3 3.5 4 4.5Quetelet index
Regression line y
=70.58+21.49(x
25 | P a g e
Referring to Example 1-1 on SBP and Quetelet index, the standard error of predicted mean
is thus:
2.157.66
3.44)-(3.803219.81)ˆ(
2=+=
∧
iyse ,
For a 95% confidence level with 30 degrees of freedom, the value of 2.042t α/21 =− .
Therefore, 95% CI for the predicted mean value of SBP for all subjects with Quetelet
index=3.8 is
152.24-(2.042x2.15) to 152.24+(2.042x2.15), that is from 147.85 to 156.64.
Thus, with 95% confidence, we can state that the mean SBP for all subjects with Quetelet
index of 3.8 is between 147.85 and 156.64 mmHg. The procedures for construction of the
95% confidence intervals for predicted mean values using STATA are described as
follows:
1. Estimate the predicted values of y
After fitting regression model by regress command, the estimation of the predicted values
of y can be done by predict command with option xb as:
predict yhat, xb
STATA calculates the predicted mean values of y and stores the results in the variable
“yhat” as follows:
list sbp quet yhat in 1/5 +------------------------+ | sbp quet yhat | |------------------------| 1. | 152 4.116 159.0361 | 2. | 164 4.01 156.758 | 3. | 135 2.876 132.3864 | 4. | 130 3.1 137.2006 | 5. | 137 3.296 141.4129 | +------------------------+
2. Estimate the standard errors of predicted mean values
The estimation of the standard errors of the predicted values of y can be done by predict
command with option stdp as:
predict se_m, stdp
The STATA calculates the standard errors of the predicted mean value of SBP for any
given value of Quetelet index and stores the results in variable “se_m” as follows:
list sbp quet yhat se_m in 1/5
26 | P a g e
+-----------------------------------+ | sbp quet yhat se_m | |-----------------------------------| 1. | 152 4.116 159.0361 2.955181 | 2. | 164 4.01 156.758 2.660088 | 3. | 135 2.876 132.3864 2.649855 | 4. | 130 3.1 137.2006 2.114377 | 5. | 137 3.296 141.4129 1.809128 | +-----------------------------------+
3. Estimate the value of the t distribution
To get the value of the t distribution, we use the STATA function “invt p df”. For this
example, we need to know the value of the t-distribution with df = 32-2=30, so we would
type:
disp invttail(30,0.025)
2.0422725
The value of t distribution for this example is 2.042.
4. Construct the 95% CIs for predicted mean values
To construct the 95% CIs for predicted mean values of SBP, we would type:
gen lower_m=yhat-(2.042*se_m) gen upper_m=yhat+(2.042*se_m)
The lower and upper values of the 95% CI for the predicted mean values of SBP are
stored in the variables “lower_m” and “upper_m”, respectively.
list sbp quet yhat se_m lower_m upper_m in 1/5 +---------------------------------------------------------+ | sbp quet yhat se_m lower_m upper_m | |---------------------------------------------------------| 1. | 152 4.116 159.0361 2.955181 153.0016 165.0706 | 2. | 164 4.01 156.758 2.660088 151.3261 162.1899 | 3. | 135 2.876 132.3864 2.649855 126.9754 137.7975 | 4. | 130 3.1 137.2006 2.114377 132.883 141.5181 | 5. | 137 3.296 141.4129 1.809128 137.7187 145.1072 | +---------------------------------------------------------+
2.4.2 The predicted individual values
Sometimes, instead of predicting the mean value of y for a given value of x, we would
prefer to predict an individual y for a new member of population. The individual outcome
of y is denoted by y~ . The predicted individual value of y is identical to the predicted mean
value of y; in particular,
bxay +=~ (28)
However, the standard error of y~ , is not the same. We incorporate an extra source of
variability, which is the dispersion of the y values about their mean, in the expression of the 27 | P a g e
standard error of iy~ . This term is not included in the expression of the standard error of iy
. Therefore, the confidence interval of the predicted individual value is much wider than
the confidence interval of the predicted mean value. The standard error iy~ for a given value
of x, say 0x , is estimated by
∑=
∧
−
−++= n
ii
xyi
xx
xxn
syse
1
2
20
|
)(
)(11)~( , (29)
and the confidence interval is given by
))~((~2/1 ii ysety
∧
− ×− α to ))~((~2/1 ii ysety
∧
− ×+ α , (30)
where 2/1 α−t is the appropriate value from the t distribution with n-2 degrees of freedom
associated with a confidence interval of )%α−100(1 .
Return once again to the SBP and Quetelet index data, when Quetelet index is 3.8 (x=3.8),
so the standard error of the predicted individual value is thus:
10.047.66
3.44)-(3.8032119.81)~(
2=++=
∧
iyse ,
For a 95% confidence level with 30 degrees of freedom, the value of 2.042=− 2/1 αt .
Therefore, 95% CI for the predicted individual value of SBP for a Quetelet index=3.8 is
152.24-(2.042x10.04) to 152.24+(2.042x10.04), that is from 131.73 to 172.76,
which is considerably wider than the 95% confidence interval for the predicted mean.
The procedures for construction of the 95% confidence intervals for predicted mean values
using STATA are described as follows:
1. Estimate the predicted values of y
After fitting regression model by regress command, the estimation of the predicted values
of y can be done by predict command with option xb as:
predict yhat, xb
STATA calculates the predicted mean values of y and stores the results in the variable
“yhat” as follows:
list sbp quet yhat in 1/5
28 | P a g e
+------------------------+ | sbp quet yhat | |------------------------| 1. | 152 4.116 159.0361 | 2. | 164 4.01 156.758 | 3. | 135 2.876 132.3864 | 4. | 130 3.1 137.2006 | 5. | 137 3.296 141.4129 | +------------------------+
2. Estimate the standard errors of predicted individual values
The estimation of the standard error of the predicted values of y can be done by predict
command with option stdf as:
predict se_i, stdf
TATA calculates the standard error of the predicted individual value of SBP for any given
value of Quetelet index and stores the results in variable “se_i” as follows:
list sbp quet yhat se_i in 1/5 +-----------------------------------+ | sbp quet yhat se_i | |-----------------------------------| 1. | 152 4.116 159.0361 10.24698 | 2. | 164 4.01 156.758 10.1658 | 3. | 135 2.876 132.3864 10.16313 | 4. | 130 3.1 137.2006 10.03683 | 5. | 137 3.296 141.4129 9.976993 | +-----------------------------------+
3. Estimate the value of the t distribution
To get the value of the t distribution, we use the STATA function “invt p df”. For this
example, we need to know the value of t-distribution with df = 32-2=30, so we would type:
invt .05 30 t = 2.0422729
The value of t distribution for this example is 2.042.
4. Construct the 95% CIs for predicted individual values
To construct the 95% CIs for predicted individual values of SBP, we would type:
gen lower_i=yhat-(2.042*se_i) gen upper_i=yhat+(2.042*se_i)
The lower and upper values of the 95% CI for the predicted individual values of SBP are
stored in the variables “lower_m” and “upper_m”, respectively.
29 | P a g e
list sbp quet yhat se_i lower_i upper_i in 1/5 +---------------------------------------------------------+ | sbp quet yhat se_i lower_i upper_i | |---------------------------------------------------------| 1. | 152 4.116 159.0361 10.24698 138.1118 179.9604 | 2. | 164 4.01 156.758 10.1658 135.9994 177.5166 | 3. | 135 2.876 132.3864 10.16313 111.6333 153.1396 | 4. | 130 3.1 137.2006 10.03683 116.7054 157.6958 | 5. | 137 3.296 141.4129 9.976993 121.0399 161.786 | +---------------------------------------------------------+
2.5 Dummy tables
Before analyzing the data, you have to plan how to present the results by using dummy
tables. A dummy table is a table which will help you visualize your output appearance
after analysis. This is the way of planning out your work. The dummy table for
determination of association between patient’s characteristics and changes of systolic blood
pressure is presented in Table 2-2.
Table 2-2 Association between patient’s characteristics and changes of systolic blood
pressure
Characteristics Coefficient 95% CI P value
Age
≥ 60
45-59
< 45 0*
BMI
*Coefficient is equal to zero for the reference group.
30 | P a g e
Assignment IV Correlation & simple linear regression (20%) Due date: Oct 1, 2015 From the randomized controlled trial of calcium supplements, researchers wanted to predict
the change of total left femur from age. The total left femur and age were stored under the
variables ‘total’ and ‘age’, respectively.
The data are given in the data set cross-sectional_BMD_&_risk_factor.dta
a. Construct a two-way scatter plot of ‘total left femur’ versus ‘age’.
b. Does the graph suggest anything about the relationship between these variables?
c. Using the ‘total left femur’ as the response variables and ‘age’ as the explanatory
variable, construct the least square regression line. Interpret the slope of this
equation.
d. At the significance level of .05, test the null hypothesis that the slope = 0.
e. What is the estimated mean of total left femur for the population of subjects whose
age is 65 years?
f. For the subjects whose age is 65 years, construct a 95% CI of the true mean
predicted GFR.
g. For the subjects whose age is 65 years, construct a 95% CI of the true individual
predicted GFR.
h. Does the regression model seem to fit well with the observed data? Comment on the
coefficient of determination and check assumptions.
31 | P a g e