Date post: | 11-Jun-2018 |
Category: |
Documents |
Upload: | duongthien |
View: | 232 times |
Download: | 0 times |
1
Chapter 11 Exercises
1. A geyser is a hot spring that periodically erupts, throwing water into the air.Geysers are extremely rare. There are only about 50 known geyserfields around the world. One of the largest fields is the Valley ofGeysers in Kamchatka, Russia. Every year, a lot of tourists visitthe valley.
Every time a geyser erupts, the eruption may last for minutes.For tourists and scientists, it is of interest to determine the waitingtime to an eruption. The following table gives the data of duration(X) and waiting time (Y ), both in minutes, from a random sampleof 25 eruptions:
Duration (X) Waiting time (Y )2.8 1035.3 444.8 633.2 654.6 663.6 615.5 584.9 742.0 934.2 743.5 663.0 793.1 712.9 742.4 965.0 633.7 623.7 623.1 734.0 602.1 685.2 602.9 822.4 794.2 56
Some relevant summaries are:
X = 3.684, Y = 70.08,n∑
i=1
X2i = 365.51,
n∑i=1
Y 2i = 127022,
n∑i=1
XiYi = 6224.2.
2
(a) Draw a scatter plot of the data. Does it justify a linear regression analysis? Identifyany outliers and if so, remove the outliers and redraw the figure.
(b) Fit a simple linear regression model using duration (X) as the predictor and waitingtime (Y ) as the outcome variable, i.e.,
Y = a+ bX
and superimpose the regression line on the plot you created for (a).
(c) Determine whether the model in (b) can be generalised? That is, consider thehypotheses:
H0 : b = 0 vs. H1 : b 6= 0
Test the hypotheses using both a critical value of 1.96 (i.e., assume n is large) andusing the appropriate critical value based on the following “t-table”:
t-tabledf 22 23 24 25 26 120 >120
critical value 2.074 2.069 2.064 2.060 2.056 1.98 1.96
(d) Find the correlation between X and Y and hence determine the goodness-of-fit ofyour model. Comment on your results.
(e) Estimate the waiting time to the next eruption if the duration of the last eruptionwas 3.5 minutes, using the model you have created.
2. The waters off the coast of the Kamchatka peninsula is hometo six species of Pacific salmon. Salmons are an important partof the local diet and a major source of income to the community.
The salmons breed in local streams and rivers andtheir young swim out to the ocean. Those youngsalmons that survive to reach spawning age wouldreturn to spawn in the same location where they were
born. This cycle repeats for every generation of salmons. The total number of maturesalmons returning in a given year is called a run. The portion of a salmon run thatsurvives to reach the spawning grounds is often called the escapement. The relationshipbetween the size of runs and escapements is crucial to ecologists and policy makers.
An ecologist collected the following data on the sizes (in 1000) of the escapement andrun at a particular location between years 2000 and 2007:
3
Year Run (Y ) Escapement (X)2000 4627 8212001 2224 6522002 4693 9322003 1287 2372004 4011 7492005 4280 10062006 1779 2182007 4011 457
(a) Draw a scatter plot of the data. Does it justify a linear regression analysis? Identifyany outliers and if so, remove the outliers and redraw the figure.
(b) Fit a simple linear regression model using escapement (X) as the predictor and run(Y ) as the outcome variable, i.e.,
Y = a+ bX
and superimpose the regression line on the plot you created in (a).
(c) Determine whether the model in (b) can be generalised? That is, consider thehypotheses:
H0 : b = 0 vs. H1 : b 6= 0
Test the hypotheses using both a critical value of 1.96 (i.e., assume n is large) andusing the appropriate critical value based on the following “t-table”:
t-tabledf 5 6 7 8 9 10 11
critical value 2.571 2.447 2.365 2.306 2.262 2.228 2.201
(d) What is the meaning of b in this context? Find a 95% confidence interval for b anduse one or two sentences to summarize the results.
(e) Find the correlation between X and Y and hence determine the goodness-of-fit ofthe model. Comment on your results.
(f) Estimate the size of the run if the escapement is 630,000 fish, using the model youhave created.
4
3. Other than threats from overfishing, the salmon population is underthreat from another source. Geologists recently discovered a huge undersea oilfield in the region. The oil field alone has been estimated to holda reserve of 3.7 billion barrels of oil. To value the price tag of theoil field, the government needs data from previous transactionsof similar oil fields. The following table records the value inbillion $ (Y ) and reserve size in billion barrels (X) from 11 recenttransactions:
Transaction Size of field (X) Price (Y )1 2.1 18.92 3 14.53 2.7 23.64 2.7 27.25 5.4 36.36 3 19.87 3.9 36.48 4.8 1209 3.3 35.710 4.4 4811 4.6 29.612 3.2 24.9
(a) A regression model of using X as the predictor and Y as the outcome variable isto be determined using the method of least squares, describe the method in one ortwo sentences.
(b) Plot the data and identify any outliers and if so, remove the outliers and redraw thefigure. Fit a simple linear regression model using X as the predictor and Y as theoutcome variable, i.e.,
Y = a+ bX
and superimpose the regression line on the plot you created. Comment on whethera linear regression is suitable.
(c) Determine whether the model in (b) can be generalised? That is, consider thehypotheses:
H0 : b = 0 vs. H1 : b 6= 0
Test the hypotheses using both a critical value of 1.96 (i.e., assume n is large) andusing the appropriate critical value based on the following “t-table”:
5
t-tabledf 6 7 8 9 10 11 12 13
critical value 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160
(d) What is the meaning of b in this context? Find a 95% confidence interval for b anduse one or two sentences to summarize the results.
(e) Find the correlation between X and Y and hence determine the goodness-of-fit ofthe model. Comment on your results.
(f) Can you use the regression model to estimate the values of the Kapchatka oil field?If you answered ”yes”, write down your point estimate. If you answered ”no”,explain why not.
4. To study the impact of these explorations on the local ecology, a group of biologistshas collected some data to determine the connection between toxic waste exposure andpre-spawn mortality in the salmon population. The data record an index of exposure andpre-spawn mortality rate (mortality per 100,000) at nine different locations:
Location Exposure (X) Mortality (Y )1 2.49 147.12 2.57 130.13 3.41 129.94 1.25 113.55 1.62 137.56 3.83 162.37 11.64 207.58 6.41 177.99 8.34 210.3
Some summary statistics are given:
n∑i=1
Xi = 41.56,n∑
i=1
Yi = 1416.1,n∑
i=1
X2i = 289.42,
n∑i=1
Y 2i = 232499,
n∑i=1
XiYi = 7439.37
(a) Fit a simple linear regression model using X as the predictor and Y as the outcomevariable, i.e.,
Y = a+ bX.
(b) Is there a linear relationship between mortality and exposure at the 5% significance
level? It is known that√∑n
i=1(Yi − Yi)2/(n− 2) = 14. You may find the attached
“t-table” useful.
6
t-tabledf 6 7 8 9 10 11 12 13
critical value 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160
(c) The following figure gives a scattered plot of the data. Identify the lines on thefigure and comment on whether a linear regression is suitable. By placing marks onthe figure, estimate the average mortality for locations with exposure level of 5 andvalidate your estimate using the regression model obtained in (a).
●
● ●
●
●
●
●
●
●
2 4 6 8 10 12
100
150
200
250
Exposure
Mor
talit
y
(d) The biologists are particularly concerned about a location that has an exposurelevel of 7 and mortality rate of 250. Using the figure in (d), determine whetherthere is statistical evidence that the location has elevated mortality (Elevated vs.Not elevated) compared to other areas with the same exposure. State your reason.
5. Geothermal energy has provided theregion with an important source of income.A company is exploring the possibility ofidentifying undiscovered geothermal resources.Using data from n = 25 known geothermalresources, the company wants to study therelationship between the size of geothermalactivity and a number of possible factors.The data consist of the following variables:Y = Activity (geothermal activity measuredin percentage from the average) and four factors: X1 (Recent magmatic activity), X2
(Recency of an earthquake in weeks), X3 (Heat Flow) and X4 (Distance from a fault linein km). Summary statistics are given in the following table.
7
Variable∑n
i=1 xi∑n
i=1 x2i
∑ni=1 xiyi
∑ni=1 yi
∑ni=1 y
2i
Y Activity 82 3846
Factors X1 8173 2751705 27274X2 592 14638 1514X3 1224 69794 1928X4 63.1 308.2 -221
The next table gives the simple linear regression results between Y and Xj, j = 1, ..., 4,
where columns a and b give the estimates of a and b in a linear regression Y = a+bXj +e.
The numbers in brackets are SD(a) and SD(b). The test statistic refers to the test H0 :b = 0 vs. H1 : b 6= 0.
Variable a b Test StatisticX1 – A – – B – – C –
(14.46) (0.044)
X2 19.88 -0.70 – D –(11.69) (0.48)
X3 -31.45 0.58 – E –(13.11) (0.23)
X4 10.53 -2.87 – F –(2.91) (0.83)
(a) Find the values A and B in the table.
(b) Whether a variable can be used for predicting Y can be evaluated using a “teststatistic” of the relevant regression coefficient. Find the test statistics in the tablefor all the variables. For each statistic, mark with an asterisk next to it to indicatesignificance at the 5% level, and write a sentence about each variable’s relationshipto the outcome.
(c) The following figure shows the scattered plot of the data, superimposed with 95%confidence and prediction intervals, based on the linear regression between X4 andY . The dotted grey line is drawn for ease of referencing.
8
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●
●
0 2 4 6 8 10
−30
−10
1030
X4
Act
ivity
The company is interested in a site that is 2 km from a fault line. Using the figure,estimate the geothermal activity for that location and compare your estimate to thevalue obtained using information from the table.
(d) Can the company be 95% confident that at that location, the geothermal activitywould not be more than 15% below average?
(e) A second site, at 1 km from a fault line is also being considered, the company wantsto choose between the two sites, the one with a lower margin of error in its estimateof geothermal activity. Which site should the company choose and why?
9
ANSWERS
(1a) Looking at the scatter plot, there is no clear evidence of departure from a linear
●
●
●● ●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
2.0 3.0 4.0 5.0
5070
90
Duration
Wai
ting
time
relationship between X and Y , therefore, we can attempt to fit a linear regression line tothe data.
(b)
n = 25, X = 3.684, Y = 70.08,n∑
i=1
X2i = 365.51,
n∑i=1
Y 2i = 127022,
n∑i=1
XiYi = 6224.2
b =
∑ni=1XiYi − nXY∑n
i=1
∑X2
i − n(X)2=
6224.2− 25× 3.684× 70.08
365.51− 25× 3.6842= −8.78,
a = Y − bX = 70.08− (−8.78× 3.684) = 102.43.
Therefore Y = 102.43 − 8.78X. Since b < 0, the model suggests an inverse relationshipbetween duration and waiting time.
●
●
●● ●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
2.0 3.0 4.0 5.0
5070
90
Duration
Wai
ting
time
(c) The test statistic is
Z∗ =b
σ/√∑n
i=1X2i − n(X)2
, where σ =
√∑ni=1(Yi − Yi)2n− 2
.
10
To find Z∗, we need to determine σ. We need:
Observation X Y Y = a+ bX Y − Y1 2.8 103 77.8419 25.15812 5.3 44 55.8907 -11.89073 4.8 63 60.281 2.7194 3.2 65 74.3298 -9.32985 4.6 66 62.0371 3.96296 3.6 61 70.8176 -9.81767 5.5 58 54.1346 3.86548 4.9 74 59.4029 14.59719 2 93 84.8663 8.133710 4.2 74 65.5493 8.450711 3.5 66 71.6956 -5.695612 3 79 76.0858 2.914213 3.1 71 75.2078 -4.207814 2.9 74 76.9639 -2.963915 2.4 96 81.3541 14.645916 5 63 58.5249 4.475117 3.7 62 69.9395 -7.939518 3.7 62 69.9395 -7.939519 3.1 73 75.2078 -2.207820 4 60 67.3054 -7.305421 2.1 68 83.9883 -15.988322 5.2 60 56.7688 3.231223 2.9 82 76.9639 5.036124 2.4 79 81.3541 -2.354125 4.2 56 65.5493 -9.5493
The last column gives σ =√
1n
∑ni=1(Yi − Yi)2 ≈ 9.826 and
Z∗ =−8.789.826√
365.51−25×3.6842= −4.57
Therefore, using a 5% significance test, if we used 1.96 as the critical value, then clearly,|Z∗| > 1.96 and H0 : b = 0 can be rejected.
If we use a critical based on df , then since n = 25 and df = n − 2 = 23, the criticalvalue is 2.069, from the “t-table”. Since |Z∗| > 2.069, therefore, we arrive at the sameconclusion.
(d) R2 can be calculated by first finding r and then squaring
r =
∑ni=1XiYi − nXY√∑n
i=1X2i − nX2
√∑ni=1 Y
2i − nY 2
= −0.69.
11
Therefore,
R2 = r2 = −0.692 = 0.0.476.
Based on R2, which is a moderate value, the regression model fit seems adequate.Furthermore, the percent variation explained is
0.476× 100% = 47.6%,
so 47.6% of the differences in waiting time can be explained by duration, which relates tothe volume of spring water ejected.
(e) For X = 3.5,Y = 102.43− 8.78× 3.5 = 71.7,
therefore, the predicted waiting time is 71.7 minutes.
(2a) Looking at the scatter plot, there is no clear evidence of departure from a linear
●
●
●
●
●
●
●
●
200 400 600 800 1000
1500
3000
4500
Escapement
Run
relationship between X and Y , therefore, we can attempt to fit a linear regression line tothe data.
(b)
n = 8, X = 634, Y = 3364,n∑
i=1
X2i = 3853348,
n∑i=1
Y 2i = 103695406,
n∑i=1
XiYi = 19458478
b =
∑ni=1XiYi − nXY∑n
i=1
∑X2
i − n(X)2=
19458478− 8× 634× 3364
3853348− 8× 6342= 3.757676,
a = Y − bX = 3364− 3.757676 = 981.633.
Therefore Y = 981.6 + 3.758X.
12
●
●
●
●
●
●
●
●
200 400 600 800 1000
1500
3000
4500
Escapement
Run
(c) The test statistic is
Z∗ =b
σ/√∑n
i=1X2i − n(X)2
, where σ =
√∑ni=1(Yi − Yi)2n− 2
.
To find Z∗, we need to determine σ. We need:
Observation X Y Y = a+ bX Y − Y1 821 4627 4066.6854 560.31462 652 2224 3431.6382 -1207.63823 932 4693 4483.7875 209.21254 237 1287 1872.2026 -585.20265 749 4011 3796.1327 214.86736 1006 4280 4761.8555 -481.85557 218 1779 1800.8068 -21.80688 457 4011 2698.8913 1312.1087
The last column gives σ =√
1n−2
∑ni=1(Yi − Yi)2 ≈ 832.6 and
Z∗ =3.758832.6√
3853348−8×6342= 3.6
Therefore, using a 5% significance test, if we used 1.96 as the critical value, then clearly,|Z∗| > 1.96 and H0 : b = 0 can be rejected.
If we use a critical based on df , then since n = 8 and df = n − 2 = 6, the criticalvalue is 2.447, from the “t-table”. Since |Z∗| > 2.447, therefore, we arrive at the sameconclusion.
(d) The coefficient b can be interpreted as the number (ratio) of offspring that survivedfrom all causes (sometimes called recruits) for each spawning adult.
13
A 95% confidence interval for b is
b± 1.96SD(b) = b± 1.96σ√∑n
i=1X2i − n(X)2
= 3.758± 1.96832.6√
3853348− 8× 6342
= 3.758± 1.96× 1.043 = (1.71, 5.80)
Therefore, the ecologist can be 95% certain that the ratio of recruits to spawners isbetween 1.71 to about 5.8.
Alternatively, we could use the small n formula, so that 1.96 can be replaced by acorresponding value from a “t-table”. To do this, we find df = n− 2 = 8− 2, which givesa value of 2.447. The confidence interval becomes:
3.758± 2.447× 1.043 = (1.21, 6.31)
In this case, the interval estimate gives a wider range, which means it is a moreconservative estimate. The small n formula always produces a more conservative estimatethan its large sample (using 1.96) counterpart.
(e) R2 can be calculated by first finding r and then squaring
r =
∑ni=1XiYi − nXY√∑n
i=1X2i − nX2
√∑ni=1 Y
2i − nY 2
= 0.827.
Therefore,
R2 = r2 = 0.8272 = 0.684.
Based on R2, which is a high value, the regression model fit is good. Furthermore, thepercent variation explained is
0.684× 100% = 68.4%,
so 68.4% of the differences in run size can be explained by the escapement size.
(f) For X = 630,Y = 981.6 + 3.758× 630 = 3349.14,
therefore, the predicted run size is 3349.14× 1000.
(3a) Least squares method finds the best fitting line to a set of data by minimizing thesum of the squared deviation between the regression line and the observed values of theoutcome (dependent) variable.
(b) Looking at the scatter plot, there seems to be an outlier (X = 4.8, Y = 120). Removingthat gives a picture suggesting a linear relationship between X and Y , therefore, we canattempt to fit a linear regression line to the data.
14
●●
●●
●
●
●●
●
●●
●
0 1 2 3 4 5 6
020
4060
8010
0
Reserve size
Pric
e
●
●
●
●
●
●
●●
●
●
●
0 1 2 3 4 5 60
1020
3040
50
Outlier removed
Reserve size
Pric
e
n = 11, X = 3.4818, Y = 28.6273,n∑
i=1
X2i = 143.01,
n∑i=1
Y 2i = 9973.61,
n∑i=1
XiYi = 1162.58
b =
∑ni=1XiYi − nXY∑n
i=1
∑X2
i − n(X)2=
1162.58− 11× 3.4818× 28.6273
143.01− 11× 3.48182= 6.851,
a = Y − bX = 28.6273− 6.851.4818 = 4.773.
Therefore Y = 4.773 + 6.851X.
●
●
●●
●
●
●●
●
●
●
0 1 2 3 4 5 6
010
2030
4050
Reserve size
Pric
e
(c) The test statistic is
Z∗ =b
σ/√∑n
i=1X2i − n(X)2
, where σ =
√∑ni=1(Yi − Yi)2n− 2
.
To find Z∗, we need to determine σ. We need:
15
Observation X Y Y = a+ bX Y − Y1 2.1 18.9 19.1605 -0.26052 3 14.5 25.3264 -10.82643 2.7 23.6 23.2711 0.32894 2.7 27.2 23.2711 3.92895 5.4 36.3 41.7687 -5.46876 3 19.8 25.3264 -5.52647 3.9 36.4 31.4922 4.90788 3.3 35.7 27.3816 8.31849 4.4 48 34.9177 13.082310 4.6 29.6 36.2879 -6.687911 3.2 24.9 26.6965 -1.7965
The last column gives σ =√
1n−2
∑ni=1(Yi − Yi)2 = 7.496 and
Z∗ =6.8517.496√
143.01−11×3.48182= 2.84
Therefore, using a 5% significance test, if we used 1.96 as the critical value, then clearly,|Z∗| > 1.96 and H0 : b = 0 can be rejected.
If we use a critical based on df , then since n = 11 and df = n − 2 = 9, the criticalvalue is 2.262, from the “t-table”. Since |Z∗| > 2.262, therefore, we arrive at the sameconclusion.
(d) The coefficient b can be interpreted as the price per barrel of reserve oil.
A 95% confidence interval for b is
b± 1.96SD(b) = b± 1.96σ√∑n
i=1X2i − n(X)2
= 6.851± 1.967.496√
143.01− 11× 3.48182
= 6.851± 1.96× 2.412 = (2.12, 11.58)
Therefore, the government can be 95% certain that the price per barrel is between 2.5 toabout 11 dollars.
Alternatively, we could use the small n formula, so that 1.96 can be replaced by acorresponding value from a “t-table”. To do this, we find df = n− 2 = 11− 2 = 9, whichgives a value of 2.262. The confidence interval becomes:
6.851± 2.262× 2.412 = (1.39, 12.03)
16
(e) R2 can be calculated by first finding r and then squaring
r =
∑ni=1XiYi − nXY√∑n
i=1X2i − nX2
√∑ni=1 Y
2i − nY 2
= 0.6875.
Therefore,
R2 = r2 = 0.68752 = 0.4726.
Based on R2, which is a moderate value, the regression model fit seems adequate.Furthermore, the percent variation explained is
0.4726× 100% = 47.26%,
so 47.26% of the differences in transaction price can be explained by the reserve size.
(f) For X = 3.7,Y = 4.773 + 6.851× 3.7 = 30.12,
therefore, the predicted price is 30.12 billion.
(4a)
n = 9, X = 41.56/9, Y = 1416.1/9,n∑
i=1
X2i = 289.42,
n∑i=1
Y 2i = 232499,
n∑i=1
XiYi = 7439.37
b =
∑ni=1XiYi − nXY∑n
i=1
∑X2
i − n(X)2=
7439.37− 9× 41.56/9× 1416.1/9
289.42− 9× (41.56/9)2= 9.23176,
a = Y − bX = 1416.1/9− 9.21377× 41.56/9 = 114.715.
Therefore Y = 114.71 + 9.231X.
(b) The test statistic is
Z∗ =b
σ/√∑n
i=1X2i − n(X)2
, where σ =
√∑ni=1(Yi − Yi)2n− 2
.
To find Z∗, we need to determine σ =√
1n−2
∑ni=1(Yi − Yi)2 = 14 and
Z∗ =9.231
14√289.42−9×(41.56/9)2
= 6.51
Using the attached table, we choose a critical based on df , then since n = 9 and df =n− 2 = 7, the critical value is 2.365, from the “t-table”. Since |Z∗| = 6.51 which is muchbigger than 2.365, we conclude that there is a significant relationship between exposureand pre-spawn mortality.
(c) The gray line is the regression model. Between the sets of coloured curves, the red
17
dashed pair are the prediction bands and the blue solid ones are the confidence bandsbecause we know that the confidence bands are always narrower than the prediction bands.
The scattered plot shows no evidence of any violations of the linear regression model.
By using the figure, we can estimate the mortality at exposure = 5 by finding 5 onthe horizontal axis and drawing a vertical line until it intersects the regression line, wecan extend from the point of intersection to the vertical axis to obtain an estimate of themortality, which is approximately 160. These lines are shown as green dotted lines on theattached figure. We can validate this estimate by comparing the answer to that obtainedby the regression model, which is
Y = 114.71 + 9.231X ⇒ Y = 114.71 + 9.231(5) = 160.87
●
● ●
●
●
●
●
●
●
2 4 6 8 10 12
100
150
200
250
Exposure
Mor
talit
y
PredictionConfidence
(d) Using the figure, the site can be compared to other areas with the same level ofexposure (X = 7) by using the prediction bands (red dashed curves). We mark theprediction bands at X = 7, shown in green boxes, and those points represent the endsof our prediction intervals of pre-spawn mortality. Since 95% confidence level is usedthroughout this course, the ends of this interval the upper and lower prediction limits ofthe pre-spawn mortality with exposure at X = 7. Since the upper limit is much lowerthan 250, the biologists can conclude that, with 95% confidence, the site in question haselevated pre-spawn mortality.
(5a)
n = 25, X =8173
25, Y =
82
25,
n∑i=1
X2i = 2751705,
n∑i=1
Y 2i = 3846,
n∑i=1
XiYi = 27274
b =
∑ni=1XiYi − nXY∑n
i=1
∑X2
i − n(X)2=
27274− 25× 8173/25× 82/25
2751705− 25× (8173/25)2= 0.005847,
a = Y − bX = 82/25− 0.005847× 8173/25 = 1.3685.
18
●
● ●
●
●
●
●
●
●
2 4 6 8 10 12
100
150
200
250
Exposure
Mor
talit
y
PredictionConfidence
Therefore A = 1.369, B = 0.00585 and
Y = 1.369 + 0.00585X.
(b) The test statistic is simply b/SD(b) so for example
C =0.00585
0.044= 0.13.
Statistical significance can be determined by comparing the test statistic to a critical valuefrom the t-table. For n = 25, df = n − 2 = 23, hence the critical value is 2.069. SinceC = 0.13 < 2.069, C is not significant. Hence there is not sufficient evidence to suggestthere is association between recent magmatic activity and geothermal activity.
The remaining test statistics can also be obtained and there is also not sufficientevidence to support an association between time of earthquake activity and geothermalactivity. However, there is evidence to support that higher heat flow and a shorter distanceto a fault line are both associated with higher geothermal activity.
Variable a b Test StatisticX1 1.369 0.00585 0.13
(14.46) (0.044)
X2 19.88 -0.70 -1.45(11.69) (0.48)
X3 -31.45 0.58 2.52∗
(13.11) (0.23)
X4 10.53 -2.87 -3.45∗
(2.91) (0.83)
19
(c) By using the figure, we can estimate the activity at the location that is 2 km from thea fault line by finding 2 on the horizontal axis and drawing a vertical line until it intersectsthe regression line, we can extend from the point of intersection to the vertical axis toobtain an estimate of the activity, which is approximately 5. These lines are shown asgreen dotted lines on the attached figure. We can compare this estimate to that obtainedby the regression model, which is
Y = 10.53− 2.87X ⇒ Y = 10.53− 2.87(2) = 4.79
so the two estimates are similar.
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●
●
0 2 4 6 8 10
−30
−10
1030
X4
Act
ivity
(d) Using the figure, the geothermal activity at the location can be evaluated by usingthe 95% prediction bands (red dashed curves). We mark the prediction bands at X = 2,shown in green boxes, and those points represent the ends of our prediction intervals ofgeothermal activity. Since the lower limit is below 15, our prediction does not rule outgeothermal activity more than 15 percent below average.
(e) The margin of error for X = x, at 95% confidence level has the following form:
1.96σ
√1 +
1
n+
(x− X)2∑ni=1(Xi − X)2
.
Hence, all things equal, the further x is from X, the bigger the margin of error. In thisproblem X = 63.1
25≈ 2.52. The first site (at X = 2) is closer to X than the second site,
and hence, the estimate at site 1 has a smaller margin of error.
20
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●
●
0 2 4 6 8 10
−30
−10
1030
X4
Act
ivity