7/25/2019 6- Single Variable Regression (Part II)
1/15
Single Variable Regression (Part II)
7. Residual Plots
After the curve is fit, it is important to examine if the fitted curve is reasonable. This isdone using residuals. The residual for a point is the difference between the observed value
and the predicted value, i.e., the residual from fitting a straight line is found as:
There are several standard residual plots:
plot of residuals vs predicted
plot of residuals vs X;
plot of residuals vs time ordering.
In all cases, the residual plots should show random scatter around ero with no obvious
pattern. !on"t plot residual vs # $ this will lead to odd loo%ing plots which are an artifact of
the plot and don"t mean an&thing.
8. Probability Plots
The probabilit& plot is a graphical techni'ue for assessing whether or not a data set
follows a given distribution such as the normal distribution. The data are plotted against atheoretical normal distribution in such a wa& that the points should form approximatel& a
straight line. !epartures from this straight line indicate departures from the specified
distribution.
Page 9 of 15
7/25/2019 6- Single Variable Regression (Part II)
2/15
The points on this plot form a nearl& linear pattern, which indicates that the normal
distribution is a good model for this data set.
The normal probabilit& plot is formed b&: (ertical axis: )rdered response values
*oriontal axis: +ormal order statistic medians
The observations are plotted as a function of the corresponding normal order statistic
'uantiles. In addition, a straight line can be fit to the points and added as a reference line.
The further the points var& from this line, the greater the indication of departures from
normalit&. The correlation coefficient of the points on the normal probabilit& plot can be
compared to a table of critical valuesto provide a formal test of the h&pothesis that the
data come from a normal distribution.
n -.- -.-/ -.-
4 .01/ .0234 .030
5 .1-33 .00-4 .035-
10 .1342 .10- .00-4
15 .1/-6 .1303 .1-
20 .16-- .1/-3 .151-
25 .1665 .1/05 .14-0
30 .12-2 .1631 .141-
40 .1262 .12/ .1/12
50 .10-2 .1264 .1664
0 .103/ .1211 .12-
75 .106/ .103/ .12/2
The normal probabilit& plot is used to answer the following 'uestions.
. Are the data 7meaning the residuals8 normall& distributed95. hat is the nature of the departure from normalit& 7data s%ewed, shorter than
expected tails, longer than expected tails89
Page 10 of 15
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htm7/25/2019 6- Single Variable Regression (Part II)
3/15
!y"i#al$or%al Probability Plot& $or%ally 'istributed 'ata
+ormal;robabilit&
;lot
The following normal probabilit& plot is from a heat flowmeterdata.
7/25/2019 6- Single Variable Regression (Part II)
4/15
+ormal
;robabilit&
;lot for !ata
with =hort
Tails
The following is a normal probabilit& plot for /-- random
numbers generated from a Tu%e&$>ambdadistribution with
the parameter e'ual to ..
7/25/2019 6- Single Variable Regression (Part II)
5/15
!y"i#al $or%al Probability Plot& 'ata ae +ong !ails
+ormal
;robabilit&
;lot for !ata
with >ong Tails
The following is a normal probabilit& plot of /-- numbers
generated from adouble exponentialdistribution. The double
exponential distribution is s&mmetric, but relative to the
normal it declines rapidl& and has longer tails.
7/25/2019 6- Single Variable Regression (Part II)
6/15
the non$linearit& of the normal probabilit& plot can show up in
two wa&s. ?irst, the middle of the data ma& show an =$li%e
pattern. This is common for both short and long tails. In this
particular case, the = pattern in the middle is fairl& mild.
=econd, the first few and the last few points show mar%ed
departure from the reference fitted line. In the plot above,this is most noticeable for the first few data points. In
comparing this plot to the short$tail examplein the previous
section, the important difference is the direction of the
departure from the fitted line for the first few and the last
few points. ?or long tails, the first few points show increasing
departure from the fitted line belowthe line and last few
points show increasing departure from the fitted line above
the line. ?or short tails, this pattern is reversed.
In this case we can reasonabl& conclude that the normaldistribution can be improved upon as a model for these data.
?or probabilit& plots that indicate long$tailed distributions,
the next step might be to generate a Tu%e& >ambda
7/25/2019 6- Single Variable Regression (Part II)
7/15
data.!iscussion This 'uadratic pattern in the normal probabilit& plot is the
signature of a significantl& right$s%ewed data set. =imilarl&,
if all the points on the normal probabilit& plot fell above the
reference lineconnecting the first andlast points, that would
be the signature pattern for a significantl& left$s%ewed dataset.
In this case we can 'uite reasonabl& conclude that we need to
model these data with a right s%ewed distribution such as the
eibullor lognormal.
. /a%"le ield and ertilier
e wish to investigate the relationship between &ield 7>iters8 and fertilier 7%gha8 for
tomato plants. An experiment was conducted in the =chwar household on summer on
plots of land where the amount of fertilier was varied and the &ield measured at the end
of the season.
The amount of fertilier applied to each plot was chosen between / and 0 %gha. hile
the levels were not s&stematicall& chosen 7e.g. the& were not evenl& spaced between the
highest and lowest values8, the& represent commonl& used amounts based on a preliminar&
surve& of producers.
Interest also lies in predicting the &ield when 6 %gha are assigned. The level of fertilier
were randoml& assigned to each plot. At the end of the experiment, the &ields were
measured and the following
data were obtained.
Page 15 of 15
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm7/25/2019 6- Single Variable Regression (Part II)
8/15
In this stud&, it is 'uite clear that the fertilier is the predictor 7X8 variable, while the
response variable 7#8 is the &ield.
The population consists of all possible field plots with all possible tomato plants of this
t&pe grown under all possible fertilier levels between about / and 0 %gha.
If all of the population could be measured 7which it can"t8 &ou could find a relationship
between the &ield and the amount of fertilier applied. This relationship would have the
form:
where Boand Brepresent the true population intercept and slope respectivel&. The term C
represents random variation that is alwa&s present, i.e. even if the same plot was grown
twice in a row with the same amount of fertilier, the &ield would not be identical 7wh&98.
The population parameters to be estimated are Bo$ the true average &ield when theamount of fertilier is -, and B, the true average change in &ield per unit change in the
amount of fertilier. These are ta%en over all plants in all possible field plots of this t&pe.
The values of Boand Bare impossible to obtain as the entire population could never be
measured.
P+6! nalysis
*ere is the data entered into a D#>)T data sheet. +ote the scale of both variables
7continuous8. The ordering of the rows is +)T important; however, it is often easier to
find individual data points if the data is sorted b& the X value and the rows for future
predictions are placed at the end of the dataset.
Page 1" of 15
7/25/2019 6- Single Variable Regression (Part II)
9/15
Ese the =tatistics$F @egression Anal&sis$F =imple @egressionplatform to start the
anal&sis. =pecif& the # and X variable as needed.
Then clic% )D. A ne- s"read s*eetwill be created that #ontains t*e regression results.
Page 1# of 15
7/25/2019 6- Single Variable Regression (Part II)
10/15
At this stage, it would be also useful to draw a scatter plot of the data 7refer to previous
D#>)T tutorials8
The relationship loo% approximatel& linear; there don"t appear to be an& outlier orinfluential points; the scatter appears to be roughl& e'ual across the entire regression
line. @esidual plots will be used later to chec% these assumptions in more detail.
The 8itmenu item allows &ou to fit the least$s'uares line. The actual fitted line is drawn
on the scatter plot, and the straight line e'uation coefficients, 7here called A for the
intercept and A5 for the slope8 of the fitted line are printed below the fit spread sheet.
Page 1$ of 15
7/25/2019 6- Single Variable Regression (Part II)
11/15
The estimated regression line is:
In terms of estimates, b-5.0/6 is the
estimated intercept, and b.- is the
estimated slope.
The estimated slope is the estimated
change in &ield when the amount of
fertilier is increased b& unit. In this
case, the &ield is expected to increase
7wh&98 b& .-32 > when the fertilieramount is increased b& %gha. +)TG
that the slope is the . In this particular case
the intercept has a meaningful interpretation, but I"d be worried about extrapolating
outside the range of the observed X values.
)nce again, these are the results from a single experiment. If another experiment was
repeated, &ou would obtain different estimates 7b-and bwould change8. The sampling
distribution over all possible experiments would describe the variation in b-and bover all
possible experiments. The standard deviation of b-and bover all possible experiments is
again referred to as the standard error of b-and b.
The formulae for the standard errors of b-and bare mess&, and hopeless to compute b&
hand. And ust li%e inference for a mean or a proportion, we can obtain estimates of the
standard error from D#>)T 7from the regression results sheet created in page 0 8.
Page 19 of 15
7/25/2019 6- Single Variable Regression (Part II)
12/15
The estimated standard error for b 7the estimated slope8 is -.35 >%g. This is an
estimate of the standard deviation of bover all possible experiments. +ormall&, theintercept is of limited interest, but a standard error can also be found for it as shown in
the above table.
Esing exactl& the same logic as when we found a confidence interval for the population
mean, a confidence interval for the population slope 7B8 is found 7approximatel&8 as bJ
57estimated se8 In the above example, an approximate confidence interval for Bis found
as
.- J 5 K 7-.358 .- J .564 7-.032 to .36/8 >%g
of fertilier applied. An LexactM confidence interval can be computed b& D#>)T as shown
above. The LexactM confidence interval is based on the t$distribution and is slightl& wider
than our approximate confidence interval because the total sample sie 7 pairs of points8
is rather small.
e interpret this interval as Nbeing 1/O confident that the true increase in &ield when the
amount of fertilier is increased b& one unit is somewhere between 7.032 to .36/8 >%g."
Page 20 of 15
7/25/2019 6- Single Variable Regression (Part II)
13/15
Pe sure to carefull& distinguish between B and b. +ote that the confidence interval is
computed using b, but is a confidence interval for B$ the population parameter that is
un%nown .
In linear regression problems, one h&pothesis of interest is if the true slope is ero. This
would correspond to no linear relationship between the response and predictor variable
7wh&98 In man& cases, a confidence interval tells the entire stor&.
D#>)T produces a test of the h&pothesis that each of the parameters 7the slope and the
intercept in the population8 is ero. The output is reproduced again below:
The test of h&pothesis about the intercept is not of interest 7wh&98.
>et
B be the true 7un%nown8 slope.
b be the estimated slope. In this case b .-4.
The h&pothesis testing proceeds as follows. Again note that we are interested in the
population parameters and not the sample statistics:
. =pecif& the null and alternate h&pothesis:
+otice that the null h&pothesis is in terms of the population
parameter B. This is a two$sided test as we are interested in detecting differences from
ero in either direction.
5. ?ind the test statistic and the p$value. The test statistic is computed as:
In other words, the estimate is over 0 standard errors awa& from the h&pothesied valueQ
This will be compared to a t$distribution with nR5 1 degrees of freedom. The p$value is
found to ver& small 7less than -.---8.
Page 21 of 15
7/25/2019 6- Single Variable Regression (Part II)
14/15
3.
7/25/2019 6- Single Variable Regression (Part II)
15/15
=econd the experimenter ma& be interested in predicting the average of A>> ?ETE@G
responses at a particular X. This would correspond to the average &ield for all future plots
when 6 %gha of fertilier is added.
The prediction interval for an individual response is sometimes called a confidence intervalfor an individual response but this is an unfortunate 7and incorrect8 use of the term
confidence interval. =trictl& spea%ing confidence intervals are computed for fixed
un%nown parameter values; predication intervals are
computed for future random variables.
7To be continued8.
Page 23 of 15