RACE 615: Introduction to Medical Statistics...RACE 615 Introduction to Medical Statistics...

RACE 615 Introduction to Medical Statistics

Correlation & simple linear regression analysis

Assist.Prof.Dr.Sasivimol Rattanasiri

Doctor of Philosophy Program in Clinical Epidemiology and Master of Science Program in Medical Epidemiology Section for Clinical Epidemiology & Biostatistics

Faculty of Medicine Ramathibodi Hospital Mahidol University www.ceb-rama.org

Academic year 2015 semester I

CONTENTS

1. CORRELATION ................................................................................................................ 3

1.1 Pearson’ correlation coefficient ..................................................................................... 3

1.2 Spearman’s rank correlation coefficient ....................................................................... 7

2. SIMPLE LINEAR REGRESSION ................................................................................. 10

2.1 The linear regression model ........................................................................................ 11

2.2 Determining the best fit for the simple linear regression line ..................................... 12

2.3 Evaluating the regression model ................................................................................. 14

2.4 Using the regression model ......................................................................................... 24

2.5 Dummy tables ............................................................................................................. 30

Assignment IV……………………………………………..……………………………..31

OBJECTIVES

This module will help you to be able to:

• Estimate Pearson correlation coefficient & interpret results

• Fit a regression model, estimate correlation coefficients, make statistical

inferences, and interpretations

• Check model’s assumptions

REFERENCES

1. Pagano M. and Gauvreau K. Principle of Biostatistics. California: Duxbury Press

1993; 379 - 424.

2. Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Appllied regression analysis

and other multivariable methods. Washington: Duxbury Press 1998; 39 - 212.

3. Neter J, Wasserman W, and Kutner HM. Applied statistical models, thirds

edition. Boston: IRWIN 1990; 21 - 433.

4. Altman GD. Practical statistics for medical research. London: Chapman &

Hall 1991; 277-361.

READING SECTION

Read chapter 2, 3, & 4 in Neter J et al

ASSIGNMENT IV (20%)

P. 31 (Due on: October 1, 2015)

2 | P a g e

1. CORRELATION

The measures of the strength of association between two categorical variables were

described in the preceding module. This module begins with investigation of the

association between two continuous variables. For example, we wanted to determine the

association between age and percentage of body fat, serum cholesterol level and blood

pressure, age and calcium score, age and bone mineral density. The statistical method

which is used to measure an association between two continuous variables is known as

correlation. The degree of the linear association is measured by the correlation coefficient.

1.1 Pearson’ correlation coefficient

If X and Y are random continuous variables, xµ and yµ are means of the variables X and Y,

and xσ and yσ are standard deviations of the variables X and Y, so the population

correlation coefficient ( ρ ) is defined as:

yx

yx YXσσ

µµρ

))(( −−= . (1)

The sample Pearson’s correlation coefficient (r), is an un-biased estimator of ρ and is

defined as:

∑ ∑

∑

= =

=

−−

−−=

n

i

n

iii

n

i ii

yyxx

yyxxr

1 1

22

1

)()(

))((. (2)

where ix is the ith observation in the variable X,

iy is the ith observation in the variable Y,

x is the sample mean of the variable X,

y is the sample mean of the variable Y.

3 | P a g e

Interpretation of the sample Pearson’s correlation coefficient

• The correlation coefficient lies between -1 to 1.

• If the correlation coefficient is near +1 or -1, it indicates that there is strong linear

association between two continuous variables.

• If the correlation coefficient is near 0, it indicates that there is no linear association

between two continuous variables. However, a nonlinear relationship may exist.

• If the correlation coefficient is greater than 0, the correlation between two

continuous variables is positive. It means that if as a value of one variable increases,

a value of the other one variable tends to increase, whereas as a value of one

variable decreases, a value of the other one variable tends to decrease.

• If the correlation coefficient is less than 0, the correlation between two continuous

variables is negative. It means that if as a value of one variable increases, a value of

the other one variable tends to decrease, whereas as a value of one variable

decreases, the values of the other one variable tends to increase.

The hypothesis test for the correlation coefficient may be performed under the null

hypothesis that there is no association between two continuous variables. This test has the t

distribution with n-2 degrees of freedom. It is defined as:

212

rnrt−−

= (3)

where r is the sample correlation coefficient,

n is the number of subjects.

Suggestion: A first step that is useful in assessing the association between two continuous

variables is to create a two-way scatter plot of the data. This plot suggests the pattern of the

relationship between two variables. Therefore, we can determine a statistical analysis to

measure the degree of a relationship between two continuous variables. An example of this

plot is presented in the Example 1-1.

Example 1-1 Researchers wanted to assess the strength of association between body mass

index (BMI) and systolic blood pressure (SBP) in 32 subjects. The scatter plot of the

relationship between BMI and SBP is shown in Figure 1-1. We can see that the points

themselves varied widely, but the overall pattern suggested that SBP tends to become

higher as BMI increases. This impression suggests that there is a strong linear relationship

4 | P a g e

between the BMI and SBP and the correlation between these two variables is positive. The

Pearson’s correlation coefficient is 0.74, confirming the visual impression.

To test the hypothesis for correlation coefficient, use the following procedure:

1. Generate the null and alternative hypotheses

- The null hypothesis is that there is no linear association between BMI and SBP.

- The alternative hypothesis is that there is linear association between BMI and SBP.

0:0 =ρH

0: ≠ρAH

2. Compute the Pearson’s correlation coefficient (r)

a) The mean BMI is

∑=

=32

1321

iixx

= 3.44

b) The mean SBP is

∑=

=32

1321

iiyy

= 144.53

c) In addition

∑ ∑= =

−−=−−n

i iiiii yxyyxx

1

32

1)144.53)(3.44())((

= 164.62,

∑ ∑= =

−=−n

i iii xxx

1

32

1

23.44)()( 2

= 7.66,

∑ ∑= =

−=−n

i iii yyy

1

32

1

2)144.53()( 2

= 6425.97

d) The correlation coefficient is

5.97)(7.66)(642

164.62=r

= 0.74

5 | P a g e

3. Compute the statistical test

2(0.74)-12-320.74=t

= 6.03

4. Compute the associated p value by the STATA program

. disp tprob(30,6.03)

Then we get the associated p value < 0.001.

5. Draw a conclusion

The p value for this example is less than 0.001 which is less than the level of significance.

As a result, we reject the null hypothesis and conclude that there is linear association

between BMI and SBP. The correlation between these two variables is 0.74, and the

relation is clearly positive, that is the SBP tends to be higher for higher BMI.

To do the correlation analysis using the STATA program, we would type:

. pwcorr sbp bmi,sig obs | sbp bmi -------------+------------------ sbp | 1.0000 | | 32 | bmi | 0.7420 1.0000 | 0.0000 | 32 32 |

Figure 1-1 Scatter plot between BMI and SBP

120

140

160

180

Sys

tolic

blo

od p

ress

ure

2.5 3 3.5 4 4.5Body mass index

6 | P a g e

Limitations of the sample Pearson’s correlation coefficient

• The correlation coefficient is only correct if both of the continuous variables are

normally distributed. If the variables are not normally distributed, then the

interpretation may not be correct.

• If either or both of the continuous variables do not have a normal distribution, a

nonparametric correlation coefficient, Spearman’s rank correlation coefficient, is

required.

• The correlation coefficient is the measure of the strength of the linear relationship

between two continuous variables. If they have a nonlinear relationship, then the

correlation coefficient is not a valid measure of this association.

• It must be kept in mind that a high correlation between two continuous variables

does not in itself imply a cause-and-effect relationship.

1.2 Spearman’s rank correlation coefficient

The concept of ranks is possible for assessing the relationship between two continuous

variables when either or both of the continuous variables do not have a normal distribution.

One approach is to rank the two sets of data of both variables separately and calculate a

coefficient of rank correlation. This approach is known as Spearman’s rank correlation

coefficient.

The Spearman’s rank correlation coefficient is denoted by sr . It is simply Pearson’s

correlation coefficient calculated for the ranked values of two continuous variables. It is

defined as:

∑ ∑

∑

= =

=

−−

−−=

n

i

n

irrirri

n

i rrirris

yyxx

yyxxr

1 1

22

1

)()(

))(( (4)

where rix and riy are the ranks associated with the ith subject rather than the actual

observations. An interpretation of the Spearman’s rank correlation coefficient is exactly the

same as the Pearson’s correlation coefficient. The hypothesis test for the correlation

coefficient is performed by using the same procedure that we used for the Pearson’s

correlation. It can be defined as:

212

s

ss r

nrt−−

= (5)

7 | P a g e

Advantages of the Spearman’s rank correlation coefficient

• The Spearman’s rank correlation coefficient is much less sensitive to outlying

values than the Pearson’s correlation coefficient.

• The Spearman’s rank correlation coefficient can be used when one or both of the

variables are ordinal.

• The Spearman’s rank correlation coefficient may be used when assessing general

association. It has the advantage of not specifically assessing linear association.

Example 1-2 Researchers would like to assess the strength of association between age and

amount of calcium intake in 80 adults. The scatter plot of the relationship between age and

calcium intake is shown in Figure 1-2. It is clearly seen that there is considerable scatter

with no obvious underlying relationship between age and calcium intake. This impression

suggests that there is a nonlinear relationship between the age and calcium intake.

Additionally, the calcium intake data do not have a normal distribution. Therefore, the

Spearman’s rank correlation coefficient is a more appropriate measure of the association

for this example.

Figure 1-2 Scatter plot between age and calcium intake

The spearman’s correlation coefficient for this example is 0.17 and the correlation is

positive. These results suggest a weak relationship between age and calcium intake. To test

the hypothesis for correlation coefficient, use the following procedure:

200

400

600

800

1000

1200

Cal

cium

inta

ke (m

g/da

y)

60 65 70 75 80 85Age (years)

8 | P a g e

1. Generate the null and alternative hypotheses

- The null hypothesis is that there is no association between age and calcium intake.

- The alternative hypothesis is that there is linear association between age and calcium

intake.

0:0 =ρH

0: ≠ρAH

2. Compute the Spearman’s rank correlation coefficient and test statistic by using the

STATA program. The output is presented as: . spearman ca_intake age Number of obs = 80 Spearman's rho = 0.1655 Test of Ho: ca_intake and age are independent Prob > |t| = 0.1423


The p value for this example is 0.14 which is greater than the level of significance. As a

result, we fail to reject the null hypothesis and conclude that there is no association

between age and calcium intake.

9 | P a g e

2. SIMPLE LINEAR REGRESSION Other questions may arise when we want to explore the relationship between two

continuous variables. In particular, when researchers would like to predict or estimate the

value of one variable corresponding to a given value of another variable. For example,

rather than investigation the relationship between SBP and BMI as Example 1-1,

researchers may be interested in predicting the change in SBP that corresponds to a given

change in BMI. Clearly, correlation analysis does not carry out this purpose; it just

indicates the strength of the association as a single number. The appropriate statistical

method which is used to predict one continuous variable from others’ variables is known as

regression analysis.

Variables involved in regression analysis are not treated as symmetrically as correlation

analysis. The regression treats these variables as dependent (response or outcome) and

independent (explanatory or predictor) variables. The dependent variable is the one being

explained. The independent variable is the one used to explain the variation in the

dependent variable. For the question in Example 1-1, if researchers want to predict SBP

from BMI, SBP is treated as the response variable and BMI is treated as the explanatory

variable.

If only one explanatory variable is used to explain the variation in a response variable, it is

called simple regression analysis. If two or more explanatory variables are used to explain

the variation in a response variable, this is called multiple regression analysis.

Linear regression will be applied when each value of response variable is independent from

another. In case different observations are measured on the same subject at different times,

special methods can be used to find the best-fitting model, which will be discussed later in

longitudinal data analysis.

There are many choices of the functional form of the regression relation between response

and explanatory variables such as linear, quadratic, or parabola. However, the regression

function is not known in advance and must be decided upon once the data has been

collected and analyzed. The linear and quadratic regression functions are often used as

satisfactory first approximations for regression functions of unknown nature.

10 | P a g e

In this module, we consider a basic regression model, where there is only one explanatory

variable and the regression function is linear. The multiple regression analysis is explained

later in Module V.

2.1 The linear regression model

The linear regression model is used to predict the change in response variable that

corresponds to a given change in the explanatory variable. This model gives a straight-line

relationship between these two variables. The simple linear regression model for

population is defined as:

XY βα += , (6)

where y is referred to as the response variable and x is referred to as the explanatory

variable; α is the y-intercept of the line and β is its slope, and they are called population

regression coefficients. The y-intercept is defined as the mean value of the response y when

x is equal to zero. In most medical applications, the y-intercept has no practical meaning

because the independent variable cannot be anywhere near zero, for example, blood

pressure, weight, or height. The slope is interpreted as the change in y that corresponds to a

one unit change in x.

Although the relationship between two variables is described by a perfect straight line, the

relationship between individual values of two variables is not. Thus, an error term (ε ),

which represents the variation of the response variable with a given explanatory variable, is

introduced into the model. The full linear regression model can take the following form:

εβα ++= XY . (7)

Therefore, the sample linear regression model is defined as:

iii ebxay ++=ˆ (8)

where iy and ix are the observed values of response and explanatory variables,

iy is the estimated or predicted value of iy for a particular value of ix ,

a and b are the un-biased estimators of population regression coefficients,

ie is the random error which is the difference between the observed and predicted

values of iy . The technical term for this distance is called a residual.

11 | P a g e

2.2 Determining the best fit for the simple linear regression line

The next question is how to fit a regression model. One mathematical technique for

determining the best fitting straight line to a set of data is known as the method of least

squares. The concept of this method is to produce the line that minimizes the distance

between the observed data and the fitted line.

If iy is the observed value for a particular value of ix , and iy is the predicted value of iy

which is the value given by the regression line, then the distance between the observed and

predicted values, which is called the random error or residual, can be defined as:

iii yye∧

−= . (9)

If the sum of the observed values in the sample is the same as the sum of predicted values

from the regression model, the sum of these errors is equal to zero. This is an impossible

circumstance. Hence, to find the line that best fits to the set of data, we cannot minimize

the sum of errors. Instead, we minimize the sum of the squares of the errors, which can be

defined as:

∑∑=

∧

=

−=n

iii

n

ii yye

1

2

1

2 )( . (10)

This equation is also called the error sum of squares, or the residual sum of squares. Thus,

the regression coefficients a and b in the regression model are calculated by,

iii ebxay ++=ˆ ,

which give the minimum residual sum of squares which can be defined as:

∑∑=

∧

=

−=n

iii

n

ii yye

1

2

1

2 )(

∑=

−−=n

iii bxay

1

2)( ,

therefore,

∑

∑

=

=

−

−−= n

ii

n

iii

xx

yyxxb

1

2

1

)(

))(( , (11)

and

xbya −= . (12)

12 | P a g e

The regression coefficients for the prediction of SBP from BMI in Example 1-1 can be

calculated as:

∑ ∑= =

−−=−−n

i iiiii yxyyxx

1

32

1)144.53)(3.44())((

= 164.62

∑ ∑= =

−=−n

i iii xxx

1

32

1

2)3.44()( 2

= 7.66

21.497.66

164.62==b

3.4421.49-144.53 ×=a

=70.58

To fit the regression model by using STATA, we would type “regress y x“, then it will

carry out least square regression for us. In this example, we want to regress SBP on BMI,

so we just type

regress sbp quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 36.75 Model | 3537.94585 1 3537.94585 Prob > F = 0.0000 Residual | 2888.0229 30 96.2674299 R-squared = 0.5506 -------------+------------------------------ Adj R-squared = 0.5356 Total | 6425.96875 31 207.289315 Root MSE = 9.8116 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 21.49167 3.545147 6.06 0.000 14.25151 28.73182 _cons | 70.57641 12.32187 5.73 0.000 45.4118 95.74102 ------------------------------------------------------------------------------

The output from the STATA indicated that the estimate of the population intercept (_cons)

is equal to 70.58 and the estimate of the population slope (quet) is equal to 21.49.

Therefore, the linear regression model for the prediction of SBP from BMI in Example 1-1

is:

The y-intercept of the regression line is 70.58. This is the predicted value of SBP that

corresponds to a BMI of 0. In this example, a BMI of 0 has no real meaning. The slope of

the line is 21.49, implying that for each one-unit increase in BMI, an SBP increases by

21.49 mm Hg on average.

i21.49x70.58yi +=

13 | P a g e

2.3 Evaluating the regression model

After the least squares regression line is constructed, it must be evaluated to determine

whether it adequately describes the relationship between the response and explanatory

variables, and whether it can be used effectively for prediction purpose. These can be done

by consideration of the confidence interval of the regression coefficients, test of hypotheses

about the regression coefficients, and estimation of the coefficient of determination.

2.3.1 Estimate confidence interval for regression coefficients

The uncertainty of the regression coefficients is considered by the confidence intervals of

the population regression coefficients.

The t distribution is used to make confidence intervals for the regression coefficients. The

standard deviation of the residual can be defined as:

2

)ˆ(1

2

| −

−=∑=

n

yys

n

iii

xy , (13)

the standard error of the intercept can be defined as:

∑=

∧

−+=

n

ii

xy

xx

xn

sase

1

2

2

|

)(

1)( , (14)

and the standard error of the slope can be defined as:

∑=

∧

−=

n

ii

xy

xx

sbse

1

2

|

)()( . (15)

Therefore, the confidence interval for the population intercept is given as:

))(( 2/1 aseta∧

− ×− α to ))(( 2/1 aseta∧

− ×+ α , (16)

and the confidence interval for the population slope is given as:

))(( 2/1 bsetb∧

− ×− α to ))(( 2/1 bsetb∧

− ×+ α , (17)

14 | P a g e

where 2/1 α−t is the appropriate value from the t distribution with n-2 degrees of freedom

associated with a confidence of α)%100(1− .

For the regression model of SBP on BMI, the 95% CIs for the regression coefficients can

be estimated as follows:

9.8130

2888.02| ==xys ,

this estimate is used to compute,

12.327.66

3.443219.81)(

2=+=

∧

ase

and,

3.557.66

9.81)( ==∧

bse .

For a 95% confidence level with 30 degrees of freedom, the value of 2.042t α/21 =− . Recall

that a and b are 70.58 and 21.49, respectively. Therefore, 95% CI for α is

70.58-(2.042x12.32) to 70.58+(2.042x12.32), that is from 45.41 to 95.74,

and 95% CI for β is

21.49-(2.042x3.55) to 21.49+(2.042x3.55), that is from 14.25 to 28.73.

Conclusion:

The 95% CI of the slope β lies between 14.25 and 28.73, which does not include zero. This

can be interpreted that we are 95% confident that the true population slope β lies between

14.25 and 28.73.

2.3.2 Tests of hypotheses about regression coefficients

This section is concerned with tests of hypotheses about the population regression

coefficients. The slope is usually the more important coefficient in the regression model

than the intercept. It quantifies the average change in y that corresponds to each unit

change in x. We can test the null hypothesis that the population slope is equal to zero. If the

population slope is zero, it means that there is no linear relationship between explanatory

and response variables. The test of hypothesis can be done either by using analysis of

variance and the F statistic or by using the t statistic.

15 | P a g e

1) Hypothesis testing with F statistic

The hypothesis testing with F statistic is based upon the ANOVA table. The principle of

this test is to partition the total variation into two components; explained variation and

unexplained variation.

The total variation of the observed values of y is measured by the total sum of squares

(SST). It is the sum of the squares of the differences between each observation and the

overall mean. This is calculated as:

∑=

−=n

ii yySST

1

2)( (18)

The unexplained variation of the observed values of y is measured by the error sum of

squares (SSE). It is the sum of the squares of the differences between each observation and

its predicted values. This is calculated as:

∑=

−=n

iii yySSE

1

2)ˆ( (19)

The explained variation of the observed values of y is measured by the sum of squares due

to linear regression (SSR). It is the sum of the squares of the differences between each

predicted value and the overall mean. This is calculated as:

∑=

−=n

ii yySSR

1

2)ˆ( (20)

Thus, SSR is the portion of SST that is explained by the use of the regression model, and

SSE is the portion of SST that is not explained by the use of the regression model.

The value of the F test can be defined as:

)SSE/(nSSR/

MSEMSRF

21−

== (21)

16 | P a g e

The general form of ANOVA table for simple linear regression is presented in Table 2-1.

Table 2-1 ANOVA table for simple linear regression

Source of variation

Degrees of freedom

Sums of squares

Mean squares F

Linear regression 1 SSR MSR MSR/MSE Residual n-2 SSE MSE Total n-1 SST MST

2) Hypothesis testing with t statistic

The statistical test for testing the hypothesis 0:0 =βH by using the t statistic can be

defined as:

)(

0

bse

bt ∧

−= (22)

Under the null hypothesis, this ratio has a t distribution with n-2 degrees of freedom.

For the regression model of SBP on BMI, the hypothesis testing of the slope of the

regression line can be performed as follows:

1. Generate null and alternative hypotheses

- The null hypothesis is that the slope is equal to zero

- The alternative hypothesis is that the slope is not equal to zero

0:0 =βH

0: ≠βAH

2. Compute the statistical test

6.063.55

0-21.49==t

3. Compute the associated p value by the STATA program

. disp tprob(30,6.06)

Then we get the associated p value < 0.001.


For the hypothesis: 0=BMIH β:0 , the calculated statistic is t = 6.062 with degrees of

freedom of 32-2=30. The associated p-value is <.001. We therefore reject the null

17 | P a g e

hypothesis that the population slope is equal to zero and conclude that there is a statistically

significant straight-line association between SBP and BMI. That is for each one unit

increase in BMI, SBP increases about 22 mmHg.

If we are interested in testing the null hypothesis that the intercept is equal to a specified

value 0α , we use calculations that are analogous to those for the slope. The statistical test

can be defined as:

)(0

ase

at ∧

−=

α (23)

However, the intercept is not usually of interest. There is of very little practical value in

making inference about the intercept.

2.3.3 The coefficient of determination

The coefficient of determination is denoted by R2. It is the square of the Pearson

correlation coefficient r; consequently, r2 = R2. The R2 represents the proportion of the total

variation of the observed values of y that is explained by the linear regression model. In

the other words, the R2 represents how well the independent variable explains the

dependent variable in the regression model. Therefore, the R2 can be calculated as below:

SSTSSRR =2 (24)

The STATA output of simple linear regression: regress sbp quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 36.75 Model | 3537.94585 1 3537.94585 Prob > F = 0.0000 Residual | 2888.0229 30 96.2674299 R-squared = 0.5506 -------------+------------------------------ Adj R-squared = 0.5356 Total | 6425.96875 31 207.289315 Root MSE = 9.8116 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 21.49167 3.545147 6.06 0.000 14.25151 28.73182 _cons | 70.57641 12.32187 5.73 0.000 45.4118 95.74102 ------------------------------------------------------------------------------

From the STATA output of simple linear regression analysis, the R2 is 0.5506. This

implies that 55.06% of the variation among the observed values of the SBP is explained by

its linear relationship with BMI. The remaining 44.94% (100-55.06%) of variation is

18 | P a g e

unexplained or in the other words, is explained by the other variables which are not

considered here.

2.3.4 Assumptions underlying simple linear regression

It is important to consider assumptions that underlie the simple linear regression model

before using a regression analysis. The assumptions of the simple linear regression model

are described below:

1) Normality

The values of the response variable Y have a normal distribution for each value of the

explanatory variable X. However, if this assumption holds then the residuals should also

have a normal distribution. If this assumption is broken, it will result in an invalid model.

This assumption needs to be checked after fitting the simple linear regression model.

The assumption of normality can be assessed formally by a normal plot of the residuals.

These residuals are assumed to be independent normal random variables with mean 0 and

constant variance ( 2σ ). Assumption violation will result in an invalid model. To make sure

that the model is appropriate, this needs to be checked whether their distributions are normal.

Checking this assumption can be performed as follows:

1. Estimate the residuals

After fitting regression model by regress command, the estimation of residuals can be done

by predict command with option resid as:

predict res, resid

The STATA stores the results of predicting residuals in the variable “res” as follows: list sbp bmi res in 1/5 +-------------------------+ | sbp bmi res | |-------------------------| 1. | 152 4.116 -7.036115 | 2. | 164 4.01 7.242001 | 3. | 135 2.876 2.613559 | 4. | 130 3.1 -7.200574 | 5. | 137 3.296 -4.412943 | +-------------------------+

2. Describe residuals distribution

sum res, detail Residuals ------------------------------------------------------------- Percentiles Smallest 1% -19.23081 -19.23081 5% -18.44582 -18.44582 10% -10.51667 -11.5153 Obs 32

19 | P a g e

25% -7.160689 -10.51667 Sum of Wgt. 32 50% -1.603864 Mean 4.47e-08 Largest Std. Dev. 9.652048 75% 8.117427 11.79569 90% 11.79569 12.1004 Variance 93.16203 95% 12.59216 12.59216 Skewness .1254792 99% 22.53132 22.53132 Kurtosis 2.598663

The output from the STATA program indicates that the mean of 0 is slightly greater than

the median which is -1.60. There is a little skewness of 0.13.

3. Construct a normal probability plot

Construct a normal probability plot of residuals by using pnorm command as:

pnorm res

Figure 2-2 Standardized normal probability plot of residuals

Figure2-2 shows that the residuals have little departure from the diagonal line and suggests

that they are approximately a normal distribution.

4. Hypothesis testing about normality

The assumption of normality can be confirmed by a statistics test known as “Shapiro-Wilk

test”. We can do this by typing:

swilk res Shapiro-Wilk W test for normal data Variable | Obs W V z Pr > z ---------+------------------------------------------------- res | 32 0.97006 0.999 -0.003 0.50106

0.00

0.25

0.50

0.75

1.00

Nor

mal

F[(r

es-m

)/s]

0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

20 | P a g e

The test performs with the null hypothesis that the residuals are normally distributed.

The Shapiro-Wilk statistics is equal 0.97006 and its related p-value is 0.50106. We

therefore fail to reject the null hypothesis and conclude that the residuals are normally

distributed.

2) Homoskedasticity

The variability of the response variable Y, which is assessed by the variance, is the same

for each value of the explanatory variable X. This assumption indicates that the spread of

points around the regression line is similar for all x values. Considering in terms of the

residuals, the variance of residuals should be the same for each value of the explanatory

variable X. The assumption of homoskedasticity can be assessed after fitting the regression

model by a plot of the residuals against the value of x or the predicted value of y.

The assumption of homoskedasticity can be assessed by plotting the residuals against the

independent variable x or predicted values of the dependent variable y. If the residuals lie

parallel to the horizontal line, the variance is constant over the values of x. On the other

hand, if the residuals increase or decrease as the value of x or the predicted value of y

become larger, then the variance of residuals ( 2/ xyσ ) is not constant across the values of x or

the predicted values of y. Violation of this assumption is called heteroskedasticity. We can

examine this assumption as follows: 1. Estimate the predicted values of y

After fitting a regression model by regress command, the estimation of the predicted values

of y can be done by predict command with option xb as:

predict yhat, xb

The STATA stores the results of the predicted values of y in the variable “yhat” as follows: list sbp bmi yhat in 1/5 +------------------------+ | sbp bmi yhat | |------------------------| 1. | 152 4.116 159.0361 | 2. | 164 4.01 156.758 | 3. | 135 2.876 132.3864 | 4. | 130 3.1 137.2006 | 5. | 137 3.296 141.4129 | +------------------------+

2. Plot the residuals against the predicted values of y

To plot the residuals against the predicted values of y can be done by using rvfplot

command as: 21 | P a g e

rvfplot, yline(0)

Figure 2-3 Residuals from the regression model, plotted against SBP

Figure 2-3 shows that the residuals are randomly scattered below and above the horizontal

line and do not show any distinct trend. That is the assumption of straight-line relationship

between SBP and BMI appears reasonable, which corresponds with that shown in Figure 2-

1. The residuals do not appear to increase or decrease as the predicted values increase.

This indicates that the assumption of homoskedasticity is valid.

3. Hypothesis testing about homoskedasticity

In addition, constant variance can be checked by using Breusch-Pagan / Cook-Weisberg

test as below.

estat hettest res

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance

Variables: res chi2(1) = 0.25 Prob > chi2 = 0.6157

The test performs with the null hypothesis that the variance is constant across the predicted

values of y. The Cook-Weisberg test is equal 0.25 and its related p-value is 0.6157. We

therefore fail to reject the null hypothesis and conclude that the variance is constant across

the predicted values of y.

-20

-10

010

20R

esid

uals

120 130 140 150 160 170Fitted values

22 | P a g e

3) Outliers

The outliers are extreme values and can result in an invalid model. This can be checked by

a plot of standardized residuals against values of the independent variable X. The

standardized residual for each observation can be defined as:

Se

z ii =

where, S refers to the standard deviation of residuals which can be defined as:

∑=−

=n

iie

nS

1

2

21

If the residuals lie a long distance from the rest of the observations, usually four or more

standard deviations from zero, they are called outliers.

If the outliers are present, data should be checked to make sure that there is no error during

data entry, or no error due to measurement. If the error came from measurement, that

observation must be omitted because the outliers can result in an invalid model. Steps for

checking the outliers are presented as follows:

1. Estimate the standardized residaul

After fitting regression model by regress command, the estimation of the standardized

residual can be done by predict command with option rstandard as:

predict rstd, rstandard

The STATA stores the results of the standardized residual in the variable “rstd” as follows:

. list person sbp bmi rstd in 1/5 +----------------------------------+ | person sbp bmi rstd | |----------------------------------| 1. | 11 166 3.877 1.269367 | 2. | 6 129 2.79 -.1640324 | 3. | 9 144 2.368 2.538403 | 4. | 7 162 3.668 1.308478 | 5. | 5 146 2.979 1.197833 | +----------------------------------+

2. Create the standardized residual plot

After estimating the standardized residual, the next step is to plot standardized residuals

versus the values of the independent variable by using STATA command as follows:

twoway (scatter rstd quet, sort yline(0))

23 | P a g e

Figure 2-5 Outliers: Standardized residuals versus BMI

This figure suggests that there is only one highest residual, but this does not exceed 4

standard deviations. Therefore, the outlier should not matter in this case.

2.4 Using the regression model

There are two ways in which the model can be used. First, it can be used to predict the

mean value of y for a given value of x. Second, it can be used to predict an individual y for

a new member of a population. However, the predicted mean and individual values of y

are numerically equivalent for any particular value of x, but the confidence interval of the

predicted individual values of y are wider than the confidence interval of the predicted

mean values of y. The details of computations are presented as follows:

2.4.1 The predicted mean values

The predicted mean values of y can be defined as:

bxay +=ˆ . (25)

For our example on the SBP and Quetelet index in Example 1-1, the linear regression

model is:

xy 21.4970.58ˆ += .

Using this linear regression model, we can find the predicted mean value of y for any

specific value of x. For example, we want to estimate the mean SBP for all subjects with

Quetelet index of 3.8 (x=3.8), the predicted mean value of SBP for this Quetelet index is

152.243.821.4970.58ˆ =×+=y mm Hg.

-2-1

01

23

Sta

ndar

dize

d re

sidu

als

2.5 3 3.5 4 4.5Body mass index

24 | P a g e

This value of y can also be interpreted as a point estimator of the mean value of y for

Quetelet index=3.8. Thus, we can state that, on average, the SBP is equal to 152.24 mmHg

for all subjects whose Quetelet index=3.8. An example of the prediction of the change of

SBP for any specific value of Quetelet index by using the regression line is presented in

Figure 2-6.

Figure 2-6 Scatter plot and regression line of relationship between BMI and SBP

To construct a confidence interval for predicted mean values, we must know the predicted

mean, the standard error of the predicted mean, and the distribution of the predicted mean.

The t distribution is also used to make confidence intervals for the predicted mean values.

The standard error of iy for a given value of x, say 0x , is estimated by:

∑=

∧

−

−+= n

ii

xyi

xx

xxn

syse

1

2

20

|

)(

)(1)ˆ( (26)

Therefore, the confidence interval is given by:

))ˆ((ˆ 2/1 ii ysety∧

− ×− α to ))ˆ((ˆ 2/1 ii ysety∧

− ×+ α , (27)


associated with a confidence of )%α−100(1 .

120

140

160

180

Sys

tolic

blo

od p

ress

ure

(mm

Hg)

2.5 3 3.5 4 4.5Quetelet index

Regression line y

=70.58+21.49(x

25 | P a g e

Referring to Example 1-1 on SBP and Quetelet index, the standard error of predicted mean

is thus:

2.157.66

3.44)-(3.803219.81)ˆ(

2=+=

∧

iyse ,

For a 95% confidence level with 30 degrees of freedom, the value of 2.042t α/21 =− .

Therefore, 95% CI for the predicted mean value of SBP for all subjects with Quetelet

index=3.8 is

152.24-(2.042x2.15) to 152.24+(2.042x2.15), that is from 147.85 to 156.64.

Thus, with 95% confidence, we can state that the mean SBP for all subjects with Quetelet

index of 3.8 is between 147.85 and 156.64 mmHg. The procedures for construction of the

95% confidence intervals for predicted mean values using STATA are described as

follows:

1. Estimate the predicted values of y

After fitting regression model by regress command, the estimation of the predicted values


predict yhat, xb

STATA calculates the predicted mean values of y and stores the results in the variable

“yhat” as follows:

list sbp quet yhat in 1/5 +------------------------+ | sbp quet yhat | |------------------------| 1. | 152 4.116 159.0361 | 2. | 164 4.01 156.758 | 3. | 135 2.876 132.3864 | 4. | 130 3.1 137.2006 | 5. | 137 3.296 141.4129 | +------------------------+

2. Estimate the standard errors of predicted mean values

The estimation of the standard errors of the predicted values of y can be done by predict

command with option stdp as:

predict se_m, stdp

The STATA calculates the standard errors of the predicted mean value of SBP for any

given value of Quetelet index and stores the results in variable “se_m” as follows:

list sbp quet yhat se_m in 1/5

26 | P a g e

+-----------------------------------+ | sbp quet yhat se_m | |-----------------------------------| 1. | 152 4.116 159.0361 2.955181 | 2. | 164 4.01 156.758 2.660088 | 3. | 135 2.876 132.3864 2.649855 | 4. | 130 3.1 137.2006 2.114377 | 5. | 137 3.296 141.4129 1.809128 | +-----------------------------------+

3. Estimate the value of the t distribution

To get the value of the t distribution, we use the STATA function “invt p df”. For this

example, we need to know the value of the t-distribution with df = 32-2=30, so we would

type:

disp invttail(30,0.025)

2.0422725

The value of t distribution for this example is 2.042.

4. Construct the 95% CIs for predicted mean values

To construct the 95% CIs for predicted mean values of SBP, we would type:

gen lower_m=yhat-(2.042*se_m) gen upper_m=yhat+(2.042*se_m)

The lower and upper values of the 95% CI for the predicted mean values of SBP are

stored in the variables “lower_m” and “upper_m”, respectively.

list sbp quet yhat se_m lower_m upper_m in 1/5 +---------------------------------------------------------+ | sbp quet yhat se_m lower_m upper_m | |---------------------------------------------------------| 1. | 152 4.116 159.0361 2.955181 153.0016 165.0706 | 2. | 164 4.01 156.758 2.660088 151.3261 162.1899 | 3. | 135 2.876 132.3864 2.649855 126.9754 137.7975 | 4. | 130 3.1 137.2006 2.114377 132.883 141.5181 | 5. | 137 3.296 141.4129 1.809128 137.7187 145.1072 | +---------------------------------------------------------+

2.4.2 The predicted individual values

Sometimes, instead of predicting the mean value of y for a given value of x, we would

prefer to predict an individual y for a new member of population. The individual outcome

of y is denoted by y~ . The predicted individual value of y is identical to the predicted mean

value of y; in particular,

bxay +=~ (28)

However, the standard error of y~ , is not the same. We incorporate an extra source of

variability, which is the dispersion of the y values about their mean, in the expression of the 27 | P a g e

standard error of iy~ . This term is not included in the expression of the standard error of iy

. Therefore, the confidence interval of the predicted individual value is much wider than

the confidence interval of the predicted mean value. The standard error iy~ for a given value

of x, say 0x , is estimated by

∑=

∧

−

−++= n

ii

xyi

xx

xxn

syse

1

2

20

|

)(

)(11)~( , (29)

and the confidence interval is given by

))~((~2/1 ii ysety

∧

− ×− α to ))~((~2/1 ii ysety

∧

− ×+ α , (30)


associated with a confidence interval of )%α−100(1 .

Return once again to the SBP and Quetelet index data, when Quetelet index is 3.8 (x=3.8),

so the standard error of the predicted individual value is thus:

10.047.66

3.44)-(3.8032119.81)~(

2=++=

∧

iyse ,

For a 95% confidence level with 30 degrees of freedom, the value of 2.042=− 2/1 αt .

Therefore, 95% CI for the predicted individual value of SBP for a Quetelet index=3.8 is

152.24-(2.042x10.04) to 152.24+(2.042x10.04), that is from 131.73 to 172.76,

which is considerably wider than the 95% confidence interval for the predicted mean.

The procedures for construction of the 95% confidence intervals for predicted mean values

using STATA are described as follows:

1. Estimate the predicted values of y

After fitting regression model by regress command, the estimation of the predicted values


predict yhat, xb

STATA calculates the predicted mean values of y and stores the results in the variable

“yhat” as follows:

list sbp quet yhat in 1/5

28 | P a g e

+------------------------+ | sbp quet yhat | |------------------------| 1. | 152 4.116 159.0361 | 2. | 164 4.01 156.758 | 3. | 135 2.876 132.3864 | 4. | 130 3.1 137.2006 | 5. | 137 3.296 141.4129 | +------------------------+

2. Estimate the standard errors of predicted individual values

The estimation of the standard error of the predicted values of y can be done by predict

command with option stdf as:

predict se_i, stdf

TATA calculates the standard error of the predicted individual value of SBP for any given

value of Quetelet index and stores the results in variable “se_i” as follows:

list sbp quet yhat se_i in 1/5 +-----------------------------------+ | sbp quet yhat se_i | |-----------------------------------| 1. | 152 4.116 159.0361 10.24698 | 2. | 164 4.01 156.758 10.1658 | 3. | 135 2.876 132.3864 10.16313 | 4. | 130 3.1 137.2006 10.03683 | 5. | 137 3.296 141.4129 9.976993 | +-----------------------------------+

3. Estimate the value of the t distribution

To get the value of the t distribution, we use the STATA function “invt p df”. For this

example, we need to know the value of t-distribution with df = 32-2=30, so we would type:

invt .05 30 t = 2.0422729

The value of t distribution for this example is 2.042.

4. Construct the 95% CIs for predicted individual values

To construct the 95% CIs for predicted individual values of SBP, we would type:

gen lower_i=yhat-(2.042*se_i) gen upper_i=yhat+(2.042*se_i)

The lower and upper values of the 95% CI for the predicted individual values of SBP are

stored in the variables “lower_m” and “upper_m”, respectively.

29 | P a g e

list sbp quet yhat se_i lower_i upper_i in 1/5 +---------------------------------------------------------+ | sbp quet yhat se_i lower_i upper_i | |---------------------------------------------------------| 1. | 152 4.116 159.0361 10.24698 138.1118 179.9604 | 2. | 164 4.01 156.758 10.1658 135.9994 177.5166 | 3. | 135 2.876 132.3864 10.16313 111.6333 153.1396 | 4. | 130 3.1 137.2006 10.03683 116.7054 157.6958 | 5. | 137 3.296 141.4129 9.976993 121.0399 161.786 | +---------------------------------------------------------+

2.5 Dummy tables

Before analyzing the data, you have to plan how to present the results by using dummy

tables. A dummy table is a table which will help you visualize your output appearance

after analysis. This is the way of planning out your work. The dummy table for

determination of association between patient’s characteristics and changes of systolic blood

pressure is presented in Table 2-2.

Table 2-2 Association between patient’s characteristics and changes of systolic blood

pressure

Characteristics Coefficient 95% CI P value

Age

≥ 60

45-59

< 45 0*

BMI

*Coefficient is equal to zero for the reference group.

30 | P a g e

Assignment IV Correlation & simple linear regression (20%) Due date: Oct 1, 2015 From the randomized controlled trial of calcium supplements, researchers wanted to predict

the change of total left femur from age. The total left femur and age were stored under the

variables ‘total’ and ‘age’, respectively.

The data are given in the data set cross-sectional_BMD_&_risk_factor.dta

a. Construct a two-way scatter plot of ‘total left femur’ versus ‘age’.

b. Does the graph suggest anything about the relationship between these variables?

c. Using the ‘total left femur’ as the response variables and ‘age’ as the explanatory

variable, construct the least square regression line. Interpret the slope of this

equation.

d. At the significance level of .05, test the null hypothesis that the slope = 0.

e. What is the estimated mean of total left femur for the population of subjects whose

age is 65 years?

f. For the subjects whose age is 65 years, construct a 95% CI of the true mean

predicted GFR.

g. For the subjects whose age is 65 years, construct a 95% CI of the true individual

predicted GFR.

h. Does the regression model seem to fit well with the observed data? Comment on the

coefficient of determination and check assumptions.

31 | P a g e

Date post:	08-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

RACE 615: Introduction to Medical Statistics...RACE 615 Introduction to Medical Statistics...

Documents