Date post: | 23-Dec-2015 |
Category: |
Documents |
Upload: | trevor-mcdonald |
View: | 220 times |
Download: | 2 times |
Linear regressionLinear regression
Brian Healy, PhDBrian Healy, PhD
BIO203BIO203
Previous classesPrevious classes
Hypothesis testingHypothesis testing– ParametricParametric– NonparametricNonparametric
CorrelationCorrelation
What are we doing today?What are we doing today?
Linear regressionLinear regression– Continuous outcome with continuous, Continuous outcome with continuous,
dichotomous or categorical predictordichotomous or categorical predictor– Equation: Equation:
Interpretation of coefficientsInterpretation of coefficients Connection between regression andConnection between regression and
– correlationcorrelation– t-testt-test– ANOVAANOVA
xxXYE 10)|(
Big pictureBig picture Linear regression is the most commonly used Linear regression is the most commonly used
statistical technique. It allows the comparison statistical technique. It allows the comparison of dichotomous, categorical and continuous of dichotomous, categorical and continuous predictors with a continuous outcome.predictors with a continuous outcome.
Extensions of linear regression allow Extensions of linear regression allow – Dichotomous outcomes- logistic regressionDichotomous outcomes- logistic regression– Survival analysis- Cox proportional hazards Survival analysis- Cox proportional hazards
regression regression – Repeated measures Repeated measures
Amazingly, many of the analyses we have Amazingly, many of the analyses we have learned can be completed using linear learned can be completed using linear regressionregression
ExampleExample Yesterday, Yesterday,
we we investigateinvestigated the d the associatioassociation between n between age and age and BPF using BPF using a a correlation correlation coefficientcoefficient
Can we fit Can we fit a line to a line to this data?this data?
.75
.8.8
5.9
.95
BP
F
20 30 40 50 60Age
Quick math reviewQuick math review As you remember from As you remember from
high school math, the high school math, the basic equation of a line basic equation of a line is given by is given by y=mx+b y=mx+b where where mm is the slope is the slope and and bb is the y-intercept is the y-intercept
One definition of One definition of mm is is that for every one unit that for every one unit increase in x, there is increase in x, there is an m unit increase in yan m unit increase in y
One definition of One definition of bb is is the value of y when x the value of y when x is equal to zerois equal to zero
Line
y = 1.5x + 4
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12
PicturePicture
Look at the data in Look at the data in this picturethis picture
Does there seem to Does there seem to be a correlation be a correlation (linear relationship) (linear relationship) in the data?in the data?
Is the data Is the data perfectly linear?perfectly linear?
Could we fit a line Could we fit a line to this data?to this data?
0
5
10
15
20
25
0 2 4 6 8 10 12
What is linear regression?What is linear regression?
Linear regression tries Linear regression tries to find the best line to find the best line (curve) to fit the data (curve) to fit the data
The method of finding The method of finding the best line (curve) is the best line (curve) is least squares, which least squares, which minimizes the sum of minimizes the sum of the distance from the the distance from the line for each of pointsline for each of points
y = 1.5x + 4
0
5
10
15
20
25
0 2 4 6 8 10 12
How do we find the best How do we find the best line?line?
Let’s look at three Let’s look at three candidate linescandidate lines
Which do you think Which do you think is the best?is the best?
What is a way to What is a way to determine the best determine the best line to use?line to use?
ResidualsResiduals The actual observations, The actual observations,
yyii, may be slightly off the , may be slightly off the population line because population line because of variability in the of variability in the population. The equation population. The equation is is yyi i = = + + xxii + + ii, , where where ii is the deviation from is the deviation from the population line (See the population line (See picture). picture).
This is called the This is called the residualresidual
This is the distance from the line for patient 1,
Least squaresLeast squares
The method employed to find the best line The method employed to find the best line is called least squares. This method finds is called least squares. This method finds the values of the values of that minimize the squared that minimize the squared vertical distance from the line to each of vertical distance from the line to each of the point. This is the same as minimizing the point. This is the same as minimizing the sum of the the sum of the ii
22
n
ii
n
ii xy
1
2110
1
2
Estimates of regression Estimates of regression coefficientscoefficients
Once we have solved the least squares Once we have solved the least squares equation, we obtain estimates for the equation, we obtain estimates for the ’s, ’s, which we refer to as which we refer to as
The final least squares equation is where The final least squares equation is where yhat is the mean value of y for a value of yhat is the mean value of y for a value of xx11
10ˆ,ˆ
110ˆˆˆ xy
xy
xx
yyxx
n
ii
n
iii
10
1
2
11
ˆˆ
ˆ
Assumptions of linear Assumptions of linear regressionregression
LinearityLinearity– Linear relationship between outcome and predictorsLinear relationship between outcome and predictors– E(Y|X=x)=E(Y|X=x)=++xx1 1 + + 22xx22
22 is still a linear regression is still a linear regression equation because each of the equation because each of the ’s is to the first ’s is to the first powerpower
Normality of the residualsNormality of the residuals– The residuals, The residuals, ii, are normally distributed, N(0, , are normally distributed, N(0,
Homoscedasticity of the residualsHomoscedasticity of the residuals– The residuals, The residuals, ii, have the same variance, have the same variance
IndependenceIndependence– All of the data points are independentAll of the data points are independent– Correlated data points can be taken into account Correlated data points can be taken into account
using multivariate and longitudinal data methodsusing multivariate and longitudinal data methods
Linearity assumptionLinearity assumption
One of the assumptions of linear One of the assumptions of linear regression is that the relationship between regression is that the relationship between the predictors and the outcomes is linearthe predictors and the outcomes is linear
We call this the population regression line We call this the population regression line E(Y | X=x) = E(Y | X=x) = y|xy|x = = + + 11xx
This equation says that the mean of y This equation says that the mean of y given a specific value of x is defined by given a specific value of x is defined by the the coefficients coefficients
The coefficients act exactly like the slope The coefficients act exactly like the slope and y-intercept from the simple equation and y-intercept from the simple equation of a line from beforeof a line from before
Normality and Normality and homoscedasticity assumptionhomoscedasticity assumption
Two other Two other assumptions of assumptions of linear regression are linear regression are related to the related to the ii’s’s– Normality- the Normality- the
distribution of the distribution of the residuals are normal. residuals are normal.
– Homoscedasticity- Homoscedasticity- the variance of y the variance of y given x is the same given x is the same for all values of xfor all values of x
Distribution of y-values at each value of x is normal with the same variance
ExampleExample
Here is a Here is a regression regression equation for the equation for the comparison of age comparison of age and BPFand BPF
iii ageBPF
ageageBPFE
10
10)|(
.75
.8.8
5.9
.95
BP
F
20 30 40 50 60Age
ResultsResults
The estimated The estimated regression regression equationequation
ageFBP *0029.0957.0ˆ .7
5.8
.85
.9.9
5
20 30 40 50 60Age
BPF predval
_cons .957443 .035037 27.33 0.000 .885553 1.029333 age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total .066750142 28 .002383934 Root MSE = .04061 Adj R-squared = 0.3083 Residual .044524108 27 .001649041 R-squared = 0.3330 Model .022226034 1 .022226034 Prob > F = 0.0010 F( 1, 27) = 13.48 Source SS df MS Number of obs = 29
. regress bpf age
Estimated intercept
Estimated slope
The final regression equation is The final regression equation is
The coefficients mean The coefficients mean – the estimate of the mean BPF for a patient with the estimate of the mean BPF for a patient with
an age of 0 is 0.957 (an age of 0 is 0.957 (00hat)hat)
– an increase of one year in age leads to an an increase of one year in age leads to an estimated decrease of 0.0029 in mean BPF estimated decrease of 0.0029 in mean BPF ((11hat)hat)
ageFBP *0029.0957.0ˆ
Interpretation of regression Interpretation of regression coefficientscoefficients
Unanswered questionsUnanswered questions
Is the estimate of Is the estimate of 11 ( (11hat) hat) significantly different than zero? In significantly different than zero? In other words, is there a significant other words, is there a significant relationship between the predictor relationship between the predictor and the outcome?and the outcome?
Have the assumptions of regression Have the assumptions of regression been met?been met?
Estimate of variance for Estimate of variance for hathat’s’s
In order to determine if there is a significant In order to determine if there is a significant association, we need an estimate of the association, we need an estimate of the variance of variance of 00hat and hat and 11hathat
ssy|xy|x is the residual variance in y after is the residual variance in y after accounting for x (standard deviation from accounting for x (standard deviation from regression, root mean square error)regression, root mean square error)
n
ii
xy
xx
ses
1
2
|1̂ˆ
n
ii
xy
xx
x
nses
1
2
2
|0
1ˆˆ
Test statisticTest statistic
For both regression coefficients, we use a For both regression coefficients, we use a t-statistic to test any specific hypothesist-statistic to test any specific hypothesis– Each has n-2 degrees of freedom (This is the Each has n-2 degrees of freedom (This is the
sample size-number of parameters estimated)sample size-number of parameters estimated)
What is the usual null hypothesis for What is the usual null hypothesis for 11??
111
ˆˆ
ˆ
est
0
00
ˆˆ
ˆ
est
Hypothesis testHypothesis test
1)1) HH00: : 11=0=0
2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor
3)3) Linear regressionLinear regression
4)4) Test statistic: t=-3.67 (27 dof)Test statistic: t=-3.67 (27 dof)
5)5) p-value=0.0011p-value=0.0011
6)6) Since the p-value is less than 0.05, we Since the p-value is less than 0.05, we reject the null hypothesisreject the null hypothesis
7)7) We conclude that there is a significant We conclude that there is a significant association between age and BPFassociation between age and BPF
_cons .957443 .035037 27.33 0.000 .885553 1.029333 age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total .066750142 28 .002383934 Root MSE = .04061 Adj R-squared = 0.3083 Residual .044524108 27 .001649041 R-squared = 0.3330 Model .022226034 1 .022226034 Prob > F = 0.0010 F( 1, 27) = 13.48 Source SS df MS Number of obs = 29
. regress bpf age
Estimated intercept
Estimated slope p-value for slope
Comparison to correlationComparison to correlation
In this example, we found a relationship In this example, we found a relationship between the age and BPF. We also between the age and BPF. We also investigated this relationship using investigated this relationship using correlationcorrelation
We get the same p-value!!We get the same p-value!! Our conclusion is exactly the same!!Our conclusion is exactly the same!! There are other relationships we will see There are other relationships we will see
laterlaterMethodMethod p-valuep-value
CorrelationCorrelation 0.00100.0010
Linear Linear regressionregression
0.00100.0010
Confidence interval for Confidence interval for 11
As we have done previously, we can As we have done previously, we can construct a confidence interval for the construct a confidence interval for the regression coefficientsregression coefficients
Since we are using a t-distribution, we do Since we are using a t-distribution, we do not automatically use 1.96. Rather we use not automatically use 1.96. Rather we use the cut-off from the t-distributionthe cut-off from the t-distribution
Interpretation of confidence interval is same Interpretation of confidence interval is same as we have seen previouslyas we have seen previously
1,2/11,2/1ˆˆ*ˆ,ˆˆ*ˆ estest dofdof
InterceptIntercept STATA also provides a test statistic STATA also provides a test statistic
and p-value for the estimate of the and p-value for the estimate of the interceptintercept
This is for HThis is for Hoo: : 0 0 = 0, which is often not = 0, which is often not a hypothesis of interest because this a hypothesis of interest because this corresponds to testing whether the corresponds to testing whether the BPF is equal to zero at age of 0BPF is equal to zero at age of 0
Since BPF can’t be 0 at age 0, this test Since BPF can’t be 0 at age 0, this test is not really of interestis not really of interest
We can center covariates to make this We can center covariates to make this test importanttest important
PredictionPrediction
PredictionPrediction
Beyond determining if there is a Beyond determining if there is a significant association, linear regression significant association, linear regression can also be used to make predictionscan also be used to make predictions
Using the regression equation, we can Using the regression equation, we can predict the BPF for patients with specific predict the BPF for patients with specific age valuesage values– Ex. A patient with age=40Ex. A patient with age=40
The expected BPF for a patient of age 40 The expected BPF for a patient of age 40 based on our experiment is 0.841based on our experiment is 0.841
841.040*0029.0957.0ˆ FBP
ExtrapolationExtrapolation
Can we predict the BPF for a patient Can we predict the BPF for a patient with age 80? What assumption would with age 80? What assumption would we be making?we be making?
.75
.8.8
5.9
.95
20 30 40 50 60Age
BPF predval
Confidence interval for Confidence interval for predictionprediction
We can place a confidence interval around We can place a confidence interval around our predicted mean valueour predicted mean value
This corresponds to the plausible values for This corresponds to the plausible values for the mean BPF at a specific agethe mean BPF at a specific age
To calculate a confidence interval for the To calculate a confidence interval for the predicted mean value, we need an predicted mean value, we need an estimate of variability in the predicted estimate of variability in the predicted meanmean
n
ii
xy
xx
xx
nsyes
1
2
2
|
1ˆˆ
Confidence intervalConfidence interval
Note that the standard error equation has Note that the standard error equation has a different magnitude based on the x a different magnitude based on the x value. In particular, the magnitude is the value. In particular, the magnitude is the least when x=the mean of xleast when x=the mean of x
Since the test statistic is based on the t-Since the test statistic is based on the t-distribution, our confidence interval isdistribution, our confidence interval is
This confidence interval is rarely used for This confidence interval is rarely used for hypothesis testing because hypothesis testing because
yestyyesty dfdf ˆˆ*ˆ,ˆˆ*ˆ ,2/,2/
.75
.8.8
5.9
.95
20 30 40 50 60Age
Prediction intervalPrediction interval A confidence interval for a mean provides A confidence interval for a mean provides
information regarding the accuracy of a information regarding the accuracy of a estimated mean value for a sample sizeestimated mean value for a sample size
Often, we are interested in how accurate Often, we are interested in how accurate our prediction would be for a single our prediction would be for a single observation, not the mean of a group of observation, not the mean of a group of observations. This is called a prediction observations. This is called a prediction intervalinterval
What would you estimate as the value for What would you estimate as the value for a single new observation?a single new observation?
Do you think a prediction interval is Do you think a prediction interval is narrower or wider?narrower or wider?
Prediction intervalPrediction interval
Confidence interval always tighter than Confidence interval always tighter than prediction intervalsprediction intervals
The variability in the prediction of a single The variability in the prediction of a single observation contains two types of observation contains two types of variabilityvariability– Variability of the estimate of the mean Variability of the estimate of the mean
(confidence interval)(confidence interval)– Variability around the estimate of the mean Variability around the estimate of the mean
(residual variability)(residual variability) 22| ˆˆ~ yessyse xy
yestyyesty dfdf~ˆ*~,~ˆ*~
,2/,2/
.7.8
.91
20 30 40 50 60Age
ConclusionsConclusions
Prediction interval is always wider Prediction interval is always wider than confidence intervalthan confidence interval– Common to find significant differences Common to find significant differences
between groups but not be able to between groups but not be able to predict very accuratelypredict very accurately
– To predict accurately for a single To predict accurately for a single patient, we need limited overlap of the patient, we need limited overlap of the distribution. The benefit of an increased distribution. The benefit of an increased sample size decreasing the standard sample size decreasing the standard error does not helperror does not help
Model checkingModel checking
How good is our model?How good is our model?
Although we have found a relationship Although we have found a relationship between age and BPF, linear regression between age and BPF, linear regression also allows us to assess how well our also allows us to assess how well our model fits the datamodel fits the data
RR22=coefficient of =coefficient of determination=proportion of variance in determination=proportion of variance in the outcome explained by the modelthe outcome explained by the model– When we have only one predictor, it is the When we have only one predictor, it is the
proportion of the variance in y explained by xproportion of the variance in y explained by x2
2|
22
y
xyy
s
ssR
RR22
What if all of the variability in y was explained What if all of the variability in y was explained by x?by x?– What would RWhat would R2 2 equal?equal?– What does this tell you about the correlation What does this tell you about the correlation
between x and y?between x and y?– What if the correlation between x and y is negative?What if the correlation between x and y is negative?
What if none of the variability in y is explained What if none of the variability in y is explained by x?by x?– What would RWhat would R22 equal? equal?– What is the correlation between x and y in this What is the correlation between x and y in this
case?case?
r vs. Rr vs. R22
RR22=(Pearson’s correlation coefficient)=(Pearson’s correlation coefficient)22=r=r22
Since r is between -1 and 1, RSince r is between -1 and 1, R22 is always is always less than rless than r
– r=0.1, Rr=0.1, R22=0.01=0.01– r=0.5, Rr=0.5, R22=0.25=0.25
MethodMethod EstimateEstimate
rr -0.577-0.577
RR22 0.3330.333
Evaluation of modelEvaluation of model
Linear regression required several Linear regression required several assumptionsassumptions– LinearityLinearity– HomoscedasticityHomoscedasticity– NormalityNormality– Independence-usually from study designIndependence-usually from study design
We must determine if the model We must determine if the model assumptions were reasonable or a assumptions were reasonable or a different model may have been neededdifferent model may have been needed
Statistical research has investigated Statistical research has investigated relaxing each of these assumptionsrelaxing each of these assumptions
Scatter plotScatter plot
A good first step in any regression is A good first step in any regression is to look at the x vs. y scatter plot. to look at the x vs. y scatter plot. This allows us to seeThis allows us to see– Are there any outliers?Are there any outliers?– Is the relationship between x and y Is the relationship between x and y
approximately linear?approximately linear?– Is the variance in the data Is the variance in the data
approximately constant for all values of approximately constant for all values of x?x?
Tests for the assumptionsTests for the assumptions
There are several different ways to test There are several different ways to test the assumptions of linear regression. the assumptions of linear regression. – GraphicalGraphical– StatisticalStatistical
Many of the tests use the residuals, which Many of the tests use the residuals, which are the distances from the fitted line and are the distances from the fitted line and the outcomesthe outcomes
iiiii xyyy 10ˆˆˆˆ
Residual plotResidual plot-.
1-.
050
.05
.1R
esid
ual
s
.8 .85 .9Fitted values
If the assumptions of linear regression are met, we will observe a random scatter of points
Investigating linearityInvestigating linearity
Scatter plot of Scatter plot of predictor vs predictor vs outcomeoutcome
What do you notice What do you notice here?here?
One way to handle One way to handle this is to transform this is to transform the predictor to the predictor to include a quadratic include a quadratic or other termor other term
Non-linear relationship
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10 12
AgingAging
Research has Research has shown that the shown that the decrease in decrease in BPF in normal BPF in normal people is people is pretty slow up pretty slow up until age 65 until age 65 and then there and then there is a more is a more steep dropsteep drop
.65
.7.7
5.8
.85
BP
F
40 50 60 70 80Age
Fitted lineFitted line
.65
.7.7
5.8
.85
40 50 60 70 80Age
Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends
What if we fit a line for this?What if we fit a line for this?
Residual plot shows a non-random Residual plot shows a non-random scatter because the relationship is scatter because the relationship is not really linearnot really linear-.
050
.05
Res
idu
als
.72 .74 .76 .78 .8 .82Fitted values
What can we do?What can we do?
If the relationship between x and y is If the relationship between x and y is not linear, we can try a not linear, we can try a transformation of the valuestransformation of the values
Possible transformationsPossible transformations– Add a quadratic termAdd a quadratic term– Fit a spline. This is when there is a slope Fit a spline. This is when there is a slope
for a certain part of the curve and a for a certain part of the curve and a different slope for the rest of the curvedifferent slope for the rest of the curve
Adding a quadratic termAdding a quadratic term.6
5.7
.75
.8.8
5
40 50 60 70 80Age
Residual plotResidual plot
-.05
0.0
5R
esid
ual
s
.7 .72 .74 .76 .78 .8Fitted values
Checking linearityChecking linearity
Plot of residuals vs. the predictor is also Plot of residuals vs. the predictor is also used to detect departures from linearityused to detect departures from linearity
These plots allow you to investigate These plots allow you to investigate each predictor separately so becomes each predictor separately so becomes important in multiple regressionimportant in multiple regression
If linearity holds, we anticipate a If linearity holds, we anticipate a random scatter of the residuals on both random scatter of the residuals on both types of residual plottypes of residual plot
HomoscedasticityHomoscedasticity
The second assumption The second assumption is equal variance across is equal variance across the values of the the values of the predictor predictor
The top plot shows the The top plot shows the assumption is met, assumption is met, while the bottom plot while the bottom plot shows that there is a shows that there is a greater amount of greater amount of variance for larger variance for larger fitted valuesfitted values
ExampleExample0
100
000
200
000
300
000
Exp
ress
ion
leve
l
1 2 3 4 5 6Lipid number
ExampleExample In this example, In this example,
we can fit a we can fit a linear linear regression regression model model assuming that assuming that there is a linear there is a linear increase in increase in expression with expression with lipid number, lipid number, but here is the but here is the residuals plot residuals plot from this from this analysisanalysis
What is wrong?What is wrong?
-100
000
01
0000
02
0000
0R
esid
ual
s
-50000 0 50000 100000Fitted values
Transform the y-valueTransform the y-value
Clearly, the Clearly, the residuals residuals showed that showed that we did not we did not have equal have equal variancevariance
What if we What if we log-transform log-transform our y-value?our y-value?
68
10
12
14
log
Exp
ress
ion
leve
l
1 2 3 4 5 6Lipid number
New regression equationNew regression equation
By transforming the outcome variable we By transforming the outcome variable we have changed our regression equation:have changed our regression equation:– Original: ExpressionOriginal: Expressionii = =00+ + 11*lipid*lipidii++ii
– New: ln(ExpressionNew: ln(Expressionii) =) =00+ + 11*lipid*lipidii++ii
What is the interpretation of What is the interpretation of 11 from the new from the new regression model?regression model?– For every one unit increase in lipid number, there For every one unit increase in lipid number, there
is a is a 11 unit increase in the ln(Expression) on unit increase in the ln(Expression) on averageaverage
– The interpretation has changed due to the The interpretation has changed due to the transformationtransformation
Residual plotResidual plot
On the log-On the log-scale, the scale, the assumption assumption of equal of equal variance variance appears appears much more much more reasonablereasonable
-3-2
-10
12
Res
idu
als
7 8 9 10 11Fitted values
Checking homoscedasticity Checking homoscedasticity
If we do not appear to have equal If we do not appear to have equal variance, a transformation of the variance, a transformation of the outcome variable can be usedoutcome variable can be used– Most common are log-transformation or Most common are log-transformation or
square root transformationsquare root transformation Other approaches involving weighted Other approaches involving weighted
least squares can also be used if a least squares can also be used if a transformation does not worktransformation does not work
NormalityNormality
Regression Regression requires that requires that the residuals the residuals are normally are normally distributeddistributed
To test if the To test if the residuals are residuals are normal:normal:– Histogram of Histogram of
residualsresiduals– Normal Normal
probability plotprobability plot
02
46
8D
ensi
ty
-.1 -.05 0 .05resid
Several statistical tests for normality of residuals are also available
What if normality does not What if normality does not hold?hold?
Transformations of the outcome can Transformations of the outcome can often helpoften help
Changing to another type of Changing to another type of regression that does not require regression that does not require normality of the residualsnormality of the residuals– Logistic regressionLogistic regression– Poisson regressionPoisson regression
OutliersOutliers
Investigating the residuals also provides Investigating the residuals also provides information regarding outliersinformation regarding outliers
If a value is extreme in the vertical If a value is extreme in the vertical direction, the residual will be extreme as direction, the residual will be extreme as wellwell– You will see this in labYou will see this in lab
If a value is extreme in the horizontal If a value is extreme in the horizontal direction, this value can have too much direction, this value can have too much importance (leverage)importance (leverage)– This is beyond the scope of this classThis is beyond the scope of this class
ExampleExample
Another measure of disease burden Another measure of disease burden in MS is the T2 lesion volume in the in MS is the T2 lesion volume in the brainbrain– Over the course of the disease patients Over the course of the disease patients
accumulate brain lesions that they do accumulate brain lesions that they do not recover fromnot recover from
This is a measure of the disease This is a measure of the disease burden in the brainburden in the brain
Is the significant linear relationship Is the significant linear relationship between T2 lesion volume and age?between T2 lesion volume and age?
01
02
03
0L
esio
n v
olu
me
20 30 40 50 60Age
Linear modelLinear model
Our initial linear model:Our initial linear model:– LVLVii = =00++11*age*ageii + +ii
– What is the interpretation of What is the interpretation of 11??– What is the interpretation of What is the interpretation of 00??
Using STATA, we get the following Using STATA, we get the following regression equation:regression equation:
– Is there a significant relationship between age Is there a significant relationship between age and lesion volume?and lesion volume?
ii ageVL *062.070.3ˆ
Hypothesis testHypothesis test
1)1) HH00: : 11=0=0
2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor
3)3) Linear regressionLinear regression
4)4) Test statistic: t=0.99 (102 dof)Test statistic: t=0.99 (102 dof)
5)5) p-value=0.32p-value=0.32
6)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail to reject the null hypothesisto reject the null hypothesis
7)7) We conclude that there is no significant We conclude that there is no significant association between age and lesion volumeassociation between age and lesion volume
_cons 3.699857 2.742369 1.35 0.180 -1.739618 9.139333 age .0623605 .0628706 0.99 0.324 -.0623429 .187064 lv_entry Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 3474.0327 103 33.7284729 Root MSE = 5.8081 Adj R-squared = -0.0002 Residual 3440.84404 102 33.7337651 R-squared = 0.0096 Model 33.1886601 1 33.1886601 Prob > F = 0.3236 F( 1, 102) = 0.98 Source SS df MS Number of obs = 104
. regress lv_entry age
Estimated coefficient
p-value
-10
01
02
03
0R
esid
ual
s
5 5.5 6 6.5 7 7.5Fitted values
Linear modelLinear model
Our initial linear model:Our initial linear model:– ln(LVln(LVii) =) =00++11*age*ageii + +ii
– What is the interpretation of What is the interpretation of 11??
– What is the interpretation of What is the interpretation of 00??
Using STATA, we get the following Using STATA, we get the following regression equation:regression equation:
– Is there a significant relationship between age Is there a significant relationship between age and lesion volume?and lesion volume?
ii ageLV *0034.036.1)(n̂l
Hypothesis testHypothesis test
1)1) HH00: : 11=0=0
2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor
3)3) Linear regressionLinear regression
4)4) Test statistic: t=0.38 (102 dof)Test statistic: t=0.38 (102 dof)
5)5) p-value=0.71p-value=0.71
6)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail to reject the null hypothesisto reject the null hypothesis
7)7) We conclude that there is no significant We conclude that there is no significant association between age and lesion volumeassociation between age and lesion volume
_cons 1.355875 .3952489 3.43 0.001 .5719006 2.139849 age .0034291 .0090613 0.38 0.706 -.014544 .0214022 lnlv Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 71.5754302 103 .69490709 Root MSE = .8371 Adj R-squared = -0.0084 Residual 71.4750773 102 .700736052 R-squared = 0.0014 Model .100352931 1 .100352931 Prob > F = 0.7059 F( 1, 102) = 0.14 Source SS df MS Number of obs = 104
. regress lnlv age
Estimated coefficient
p-value
-2-1
01
2R
esid
ual
s
1.4 1.45 1.5 1.55Fitted values
Histograms of residualsHistograms of residuals
Untransformed values
Transformed values
Conclusions for model Conclusions for model checkingchecking
Checking model assumptions for Checking model assumptions for linear regression is needed to ensure linear regression is needed to ensure inferences are correctinferences are correct– If you have the wrong model, your If you have the wrong model, your
inference will be wrong as wellinference will be wrong as well Majority of model checking based on Majority of model checking based on
the residualsthe residuals If model fit is bad, should use a If model fit is bad, should use a
different modeldifferent model
Dichotomous predictorsDichotomous predictors
Linear regression with Linear regression with dichotomous predictordichotomous predictor
Linear regression can also be used for Linear regression can also be used for dichotomous predictors, like sexdichotomous predictors, like sex
To do this, we use an indicator variable, To do this, we use an indicator variable, which equals 1 for male and 0 for female. which equals 1 for male and 0 for female. The resulting regression equation for BPF The resulting regression equation for BPF isis
iii sexBPF
sexsexBPFE
10
10)|(
GraphGraph
.75
.8.8
5.9
.95
BP
F
0 .2 .4 .6 .8 1Sex
The regression equation can be rewritten asThe regression equation can be rewritten as
The meaning of the coefficients in this case The meaning of the coefficients in this case are are – 0 0 is the mean BPF when sex=0, in the female is the mean BPF when sex=0, in the female
groupgroup
– is the mean BPF when sex=1, in the is the mean BPF when sex=1, in the male groupmale group
What is the interpretation of What is the interpretation of 11??
– For a one-unit increase in sex, there is a For a one-unit increase in sex, there is a 11 increase in mean of the BPFincrease in mean of the BPF
– The difference in mean BPF between the males The difference in mean BPF between the males and femalesand females
imale
ifemale
BPF
BPF
10
0
Interpretation of resultsInterpretation of results
The final regression equation is The final regression equation is
The meaning of the coefficients in this case The meaning of the coefficients in this case are are – 0.823 is the estimate of the mean BPF in the 0.823 is the estimate of the mean BPF in the
female groupfemale group– 0.037 is the estimate of the mean increase in 0.037 is the estimate of the mean increase in
BPF between the males and femalesBPF between the males and females– What is the estimated mean BPF in the males?What is the estimated mean BPF in the males?
How could we test if the difference between How could we test if the difference between the groups is statistically significant?the groups is statistically significant?
sexFBP *037.0823.0ˆ
Hypothesis testHypothesis test
1)1) HH00: There is no difference based on gender : There is no difference based on gender ((11=0)=0)
2)2) Continuous outcome, dichotomous Continuous outcome, dichotomous predictorpredictor
3)3) Linear regressionLinear regression4)4) Test statistic: t=1.82 (27 dof)Test statistic: t=1.82 (27 dof)5)5) p-value=0.079p-value=0.0796)6) Since the p-value is more than 0.05, we Since the p-value is more than 0.05, we
fail to reject the null hypothesisfail to reject the null hypothesis7)7) We conclude that there is no significant We conclude that there is no significant
difference in the mean BPF in males difference in the mean BPF in males compared to femalescompared to females
_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865 sex .0371364 .0203586 1.82 0.079 -.004636 .0789087 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total .066750142 28 .002383934 Root MSE = .04691 Adj R-squared = 0.0767 Residual .059426595 27 .002200985 R-squared = 0.1097 Model .007323547 1 .007323547 Prob > F = 0.0792 F( 1, 27) = 3.33 Source SS df MS Number of obs = 29
. regress bpf sex
Estimated difference between groups
p-value for difference
-.1
-.05
0.0
5.1
Res
idu
als
.82 .83 .84 .85 .86Fitted values
T-testT-test
As hopefully you remember, you As hopefully you remember, you could have tested this same null could have tested this same null hypothesis using a two sample t-testhypothesis using a two sample t-test
Linear regression makes an equal Linear regression makes an equal variance assumption, so let’s use the variance assumption, so let’s use the same assumption for our t-testsame assumption for our t-test
Hypothesis testHypothesis test
1)1) HH00: There is no difference based on gender : There is no difference based on gender 2)2) Continuous outcome, dichotomous predictorContinuous outcome, dichotomous predictor3)3) t-testt-test4)4) Test statistic: t=-1.82 (27 dof)Test statistic: t=-1.82 (27 dof)5)5) p-value=0.079p-value=0.0796)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail
to reject the null hypothesisto reject the null hypothesis7)7) We conclude that there is no significant We conclude that there is no significant
difference in the mean BPF in males difference in the mean BPF in males compared to femalescompared to females
Pr(T < t) = 0.0396 Pr(|T| > |t|) = 0.0792 Pr(T > t) = 0.9604 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Ho: diff = 0 degrees of freedom = 27 diff = mean(0) - mean(1) t = -1.8241 diff -.0371364 .0203586 -.0789087 .004636 combined 29 .8318276 .0090667 .0488255 .8132553 .8503998 1 7 .86 .0196457 .0519775 .8119288 .9080712 0 22 .8228636 .0096717 .0453645 .8027502 .8429771 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances
. ttest bpf, by(sex)
Amazing!!!Amazing!!!
We get the same result using both We get the same result using both approaches!!approaches!!
Linear regression has the advantages Linear regression has the advantages of: of: – Allowing multiple predictors (tomorrow)Allowing multiple predictors (tomorrow)– Accommodating continuous predictors Accommodating continuous predictors
(relationship to correlation)(relationship to correlation)– Accommodating categorical predictors Accommodating categorical predictors
(tomorrow)(tomorrow) Very flexible approachVery flexible approach
ConclusionConclusion
Indicator variables can be used to Indicator variables can be used to represent dichotomous variables in a represent dichotomous variables in a regression equationregression equation
Interpretation of the coefficient for Interpretation of the coefficient for an indicator variable is the same as an indicator variable is the same as for a continuous variablefor a continuous variable– Provides a group comparisonProvides a group comparison
Tomorrow we will see how to use Tomorrow we will see how to use regression to match ANOVA resultsregression to match ANOVA results