+ All Categories
Home > Documents > Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random...

Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random...

Date post: 11-Jun-2018
Category:
Upload: duongthien
View: 232 times
Download: 0 times
Share this document with a friend
20
1 Chapter 11 Exercises 1. A geyser is a hot spring that periodically erupts, throwing water into the air. Geysers are extremely rare. There are only about 50 known geyser fields around the world. One of the largest fields is the Valley of Geysers in Kamchatka, Russia. Every year, a lot of tourists visit the valley. Every time a geyser erupts, the eruption may last for minutes. For tourists and scientists, it is of interest to determine the waiting time to an eruption. The following table gives the data of duration (X ) and waiting time (Y ), both in minutes, from a random sample of 25 eruptions: Duration (X ) Waiting time (Y ) 2.8 103 5.3 44 4.8 63 3.2 65 4.6 66 3.6 61 5.5 58 4.9 74 2.0 93 4.2 74 3.5 66 3.0 79 3.1 71 2.9 74 2.4 96 5.0 63 3.7 62 3.7 62 3.1 73 4.0 60 2.1 68 5.2 60 2.9 82 2.4 79 4.2 56 Some relevant summaries are: ¯ X =3.684, ¯ Y = 70.08, n X i=1 X 2 i = 365.51, n X i=1 Y 2 i = 127022, n X i=1 X i Y i = 6224.2.
Transcript
Page 1: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

1

Chapter 11 Exercises

1. A geyser is a hot spring that periodically erupts, throwing water into the air.Geysers are extremely rare. There are only about 50 known geyserfields around the world. One of the largest fields is the Valley ofGeysers in Kamchatka, Russia. Every year, a lot of tourists visitthe valley.

Every time a geyser erupts, the eruption may last for minutes.For tourists and scientists, it is of interest to determine the waitingtime to an eruption. The following table gives the data of duration(X) and waiting time (Y ), both in minutes, from a random sampleof 25 eruptions:

Duration (X) Waiting time (Y )2.8 1035.3 444.8 633.2 654.6 663.6 615.5 584.9 742.0 934.2 743.5 663.0 793.1 712.9 742.4 965.0 633.7 623.7 623.1 734.0 602.1 685.2 602.9 822.4 794.2 56

Some relevant summaries are:

X = 3.684, Y = 70.08,n∑

i=1

X2i = 365.51,

n∑i=1

Y 2i = 127022,

n∑i=1

XiYi = 6224.2.

Page 2: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

2

(a) Draw a scatter plot of the data. Does it justify a linear regression analysis? Identifyany outliers and if so, remove the outliers and redraw the figure.

(b) Fit a simple linear regression model using duration (X) as the predictor and waitingtime (Y ) as the outcome variable, i.e.,

Y = a+ bX

and superimpose the regression line on the plot you created for (a).

(c) Determine whether the model in (b) can be generalised? That is, consider thehypotheses:

H0 : b = 0 vs. H1 : b 6= 0

Test the hypotheses using both a critical value of 1.96 (i.e., assume n is large) andusing the appropriate critical value based on the following “t-table”:

t-tabledf 22 23 24 25 26 120 >120

critical value 2.074 2.069 2.064 2.060 2.056 1.98 1.96

(d) Find the correlation between X and Y and hence determine the goodness-of-fit ofyour model. Comment on your results.

(e) Estimate the waiting time to the next eruption if the duration of the last eruptionwas 3.5 minutes, using the model you have created.

2. The waters off the coast of the Kamchatka peninsula is hometo six species of Pacific salmon. Salmons are an important partof the local diet and a major source of income to the community.

The salmons breed in local streams and rivers andtheir young swim out to the ocean. Those youngsalmons that survive to reach spawning age wouldreturn to spawn in the same location where they were

born. This cycle repeats for every generation of salmons. The total number of maturesalmons returning in a given year is called a run. The portion of a salmon run thatsurvives to reach the spawning grounds is often called the escapement. The relationshipbetween the size of runs and escapements is crucial to ecologists and policy makers.

An ecologist collected the following data on the sizes (in 1000) of the escapement andrun at a particular location between years 2000 and 2007:

Page 3: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

3

Year Run (Y ) Escapement (X)2000 4627 8212001 2224 6522002 4693 9322003 1287 2372004 4011 7492005 4280 10062006 1779 2182007 4011 457

(a) Draw a scatter plot of the data. Does it justify a linear regression analysis? Identifyany outliers and if so, remove the outliers and redraw the figure.

(b) Fit a simple linear regression model using escapement (X) as the predictor and run(Y ) as the outcome variable, i.e.,

Y = a+ bX

and superimpose the regression line on the plot you created in (a).

(c) Determine whether the model in (b) can be generalised? That is, consider thehypotheses:

H0 : b = 0 vs. H1 : b 6= 0

Test the hypotheses using both a critical value of 1.96 (i.e., assume n is large) andusing the appropriate critical value based on the following “t-table”:

t-tabledf 5 6 7 8 9 10 11

critical value 2.571 2.447 2.365 2.306 2.262 2.228 2.201

(d) What is the meaning of b in this context? Find a 95% confidence interval for b anduse one or two sentences to summarize the results.

(e) Find the correlation between X and Y and hence determine the goodness-of-fit ofthe model. Comment on your results.

(f) Estimate the size of the run if the escapement is 630,000 fish, using the model youhave created.

Page 4: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

4

3. Other than threats from overfishing, the salmon population is underthreat from another source. Geologists recently discovered a huge undersea oilfield in the region. The oil field alone has been estimated to holda reserve of 3.7 billion barrels of oil. To value the price tag of theoil field, the government needs data from previous transactionsof similar oil fields. The following table records the value inbillion $ (Y ) and reserve size in billion barrels (X) from 11 recenttransactions:

Transaction Size of field (X) Price (Y )1 2.1 18.92 3 14.53 2.7 23.64 2.7 27.25 5.4 36.36 3 19.87 3.9 36.48 4.8 1209 3.3 35.710 4.4 4811 4.6 29.612 3.2 24.9

(a) A regression model of using X as the predictor and Y as the outcome variable isto be determined using the method of least squares, describe the method in one ortwo sentences.

(b) Plot the data and identify any outliers and if so, remove the outliers and redraw thefigure. Fit a simple linear regression model using X as the predictor and Y as theoutcome variable, i.e.,

Y = a+ bX

and superimpose the regression line on the plot you created. Comment on whethera linear regression is suitable.

(c) Determine whether the model in (b) can be generalised? That is, consider thehypotheses:

H0 : b = 0 vs. H1 : b 6= 0

Test the hypotheses using both a critical value of 1.96 (i.e., assume n is large) andusing the appropriate critical value based on the following “t-table”:

Page 5: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

5

t-tabledf 6 7 8 9 10 11 12 13

critical value 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160

(d) What is the meaning of b in this context? Find a 95% confidence interval for b anduse one or two sentences to summarize the results.

(e) Find the correlation between X and Y and hence determine the goodness-of-fit ofthe model. Comment on your results.

(f) Can you use the regression model to estimate the values of the Kapchatka oil field?If you answered ”yes”, write down your point estimate. If you answered ”no”,explain why not.

4. To study the impact of these explorations on the local ecology, a group of biologistshas collected some data to determine the connection between toxic waste exposure andpre-spawn mortality in the salmon population. The data record an index of exposure andpre-spawn mortality rate (mortality per 100,000) at nine different locations:

Location Exposure (X) Mortality (Y )1 2.49 147.12 2.57 130.13 3.41 129.94 1.25 113.55 1.62 137.56 3.83 162.37 11.64 207.58 6.41 177.99 8.34 210.3

Some summary statistics are given:

n∑i=1

Xi = 41.56,n∑

i=1

Yi = 1416.1,n∑

i=1

X2i = 289.42,

n∑i=1

Y 2i = 232499,

n∑i=1

XiYi = 7439.37

(a) Fit a simple linear regression model using X as the predictor and Y as the outcomevariable, i.e.,

Y = a+ bX.

(b) Is there a linear relationship between mortality and exposure at the 5% significance

level? It is known that√∑n

i=1(Yi − Yi)2/(n− 2) = 14. You may find the attached

“t-table” useful.

Page 6: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

6

t-tabledf 6 7 8 9 10 11 12 13

critical value 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160

(c) The following figure gives a scattered plot of the data. Identify the lines on thefigure and comment on whether a linear regression is suitable. By placing marks onthe figure, estimate the average mortality for locations with exposure level of 5 andvalidate your estimate using the regression model obtained in (a).

● ●

2 4 6 8 10 12

100

150

200

250

Exposure

Mor

talit

y

(d) The biologists are particularly concerned about a location that has an exposurelevel of 7 and mortality rate of 250. Using the figure in (d), determine whetherthere is statistical evidence that the location has elevated mortality (Elevated vs.Not elevated) compared to other areas with the same exposure. State your reason.

5. Geothermal energy has provided theregion with an important source of income.A company is exploring the possibility ofidentifying undiscovered geothermal resources.Using data from n = 25 known geothermalresources, the company wants to study therelationship between the size of geothermalactivity and a number of possible factors.The data consist of the following variables:Y = Activity (geothermal activity measuredin percentage from the average) and four factors: X1 (Recent magmatic activity), X2

(Recency of an earthquake in weeks), X3 (Heat Flow) and X4 (Distance from a fault linein km). Summary statistics are given in the following table.

Page 7: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

7

Variable∑n

i=1 xi∑n

i=1 x2i

∑ni=1 xiyi

∑ni=1 yi

∑ni=1 y

2i

Y Activity 82 3846

Factors X1 8173 2751705 27274X2 592 14638 1514X3 1224 69794 1928X4 63.1 308.2 -221

The next table gives the simple linear regression results between Y and Xj, j = 1, ..., 4,

where columns a and b give the estimates of a and b in a linear regression Y = a+bXj +e.

The numbers in brackets are SD(a) and SD(b). The test statistic refers to the test H0 :b = 0 vs. H1 : b 6= 0.

Variable a b Test StatisticX1 – A – – B – – C –

(14.46) (0.044)

X2 19.88 -0.70 – D –(11.69) (0.48)

X3 -31.45 0.58 – E –(13.11) (0.23)

X4 10.53 -2.87 – F –(2.91) (0.83)

(a) Find the values A and B in the table.

(b) Whether a variable can be used for predicting Y can be evaluated using a “teststatistic” of the relevant regression coefficient. Find the test statistics in the tablefor all the variables. For each statistic, mark with an asterisk next to it to indicatesignificance at the 5% level, and write a sentence about each variable’s relationshipto the outcome.

(c) The following figure shows the scattered plot of the data, superimposed with 95%confidence and prediction intervals, based on the linear regression between X4 andY . The dotted grey line is drawn for ease of referencing.

Page 8: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

8

●●

●●●

●●

●●

●●●

●●

0 2 4 6 8 10

−30

−10

1030

X4

Act

ivity

The company is interested in a site that is 2 km from a fault line. Using the figure,estimate the geothermal activity for that location and compare your estimate to thevalue obtained using information from the table.

(d) Can the company be 95% confident that at that location, the geothermal activitywould not be more than 15% below average?

(e) A second site, at 1 km from a fault line is also being considered, the company wantsto choose between the two sites, the one with a lower margin of error in its estimateof geothermal activity. Which site should the company choose and why?

Page 9: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

9

ANSWERS

(1a) Looking at the scatter plot, there is no clear evidence of departure from a linear

●● ●

●●

●●

●●●

●●

2.0 3.0 4.0 5.0

5070

90

Duration

Wai

ting

time

relationship between X and Y , therefore, we can attempt to fit a linear regression line tothe data.

(b)

n = 25, X = 3.684, Y = 70.08,n∑

i=1

X2i = 365.51,

n∑i=1

Y 2i = 127022,

n∑i=1

XiYi = 6224.2

b =

∑ni=1XiYi − nXY∑n

i=1

∑X2

i − n(X)2=

6224.2− 25× 3.684× 70.08

365.51− 25× 3.6842= −8.78,

a = Y − bX = 70.08− (−8.78× 3.684) = 102.43.

Therefore Y = 102.43 − 8.78X. Since b < 0, the model suggests an inverse relationshipbetween duration and waiting time.

●● ●

●●

●●

●●●

●●

2.0 3.0 4.0 5.0

5070

90

Duration

Wai

ting

time

(c) The test statistic is

Z∗ =b

σ/√∑n

i=1X2i − n(X)2

, where σ =

√∑ni=1(Yi − Yi)2n− 2

.

Page 10: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

10

To find Z∗, we need to determine σ. We need:

Observation X Y Y = a+ bX Y − Y1 2.8 103 77.8419 25.15812 5.3 44 55.8907 -11.89073 4.8 63 60.281 2.7194 3.2 65 74.3298 -9.32985 4.6 66 62.0371 3.96296 3.6 61 70.8176 -9.81767 5.5 58 54.1346 3.86548 4.9 74 59.4029 14.59719 2 93 84.8663 8.133710 4.2 74 65.5493 8.450711 3.5 66 71.6956 -5.695612 3 79 76.0858 2.914213 3.1 71 75.2078 -4.207814 2.9 74 76.9639 -2.963915 2.4 96 81.3541 14.645916 5 63 58.5249 4.475117 3.7 62 69.9395 -7.939518 3.7 62 69.9395 -7.939519 3.1 73 75.2078 -2.207820 4 60 67.3054 -7.305421 2.1 68 83.9883 -15.988322 5.2 60 56.7688 3.231223 2.9 82 76.9639 5.036124 2.4 79 81.3541 -2.354125 4.2 56 65.5493 -9.5493

The last column gives σ =√

1n

∑ni=1(Yi − Yi)2 ≈ 9.826 and

Z∗ =−8.789.826√

365.51−25×3.6842= −4.57

Therefore, using a 5% significance test, if we used 1.96 as the critical value, then clearly,|Z∗| > 1.96 and H0 : b = 0 can be rejected.

If we use a critical based on df , then since n = 25 and df = n − 2 = 23, the criticalvalue is 2.069, from the “t-table”. Since |Z∗| > 2.069, therefore, we arrive at the sameconclusion.

(d) R2 can be calculated by first finding r and then squaring

r =

∑ni=1XiYi − nXY√∑n

i=1X2i − nX2

√∑ni=1 Y

2i − nY 2

= −0.69.

Page 11: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

11

Therefore,

R2 = r2 = −0.692 = 0.0.476.

Based on R2, which is a moderate value, the regression model fit seems adequate.Furthermore, the percent variation explained is

0.476× 100% = 47.6%,

so 47.6% of the differences in waiting time can be explained by duration, which relates tothe volume of spring water ejected.

(e) For X = 3.5,Y = 102.43− 8.78× 3.5 = 71.7,

therefore, the predicted waiting time is 71.7 minutes.

(2a) Looking at the scatter plot, there is no clear evidence of departure from a linear

200 400 600 800 1000

1500

3000

4500

Escapement

Run

relationship between X and Y , therefore, we can attempt to fit a linear regression line tothe data.

(b)

n = 8, X = 634, Y = 3364,n∑

i=1

X2i = 3853348,

n∑i=1

Y 2i = 103695406,

n∑i=1

XiYi = 19458478

b =

∑ni=1XiYi − nXY∑n

i=1

∑X2

i − n(X)2=

19458478− 8× 634× 3364

3853348− 8× 6342= 3.757676,

a = Y − bX = 3364− 3.757676 = 981.633.

Therefore Y = 981.6 + 3.758X.

Page 12: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

12

200 400 600 800 1000

1500

3000

4500

Escapement

Run

(c) The test statistic is

Z∗ =b

σ/√∑n

i=1X2i − n(X)2

, where σ =

√∑ni=1(Yi − Yi)2n− 2

.

To find Z∗, we need to determine σ. We need:

Observation X Y Y = a+ bX Y − Y1 821 4627 4066.6854 560.31462 652 2224 3431.6382 -1207.63823 932 4693 4483.7875 209.21254 237 1287 1872.2026 -585.20265 749 4011 3796.1327 214.86736 1006 4280 4761.8555 -481.85557 218 1779 1800.8068 -21.80688 457 4011 2698.8913 1312.1087

The last column gives σ =√

1n−2

∑ni=1(Yi − Yi)2 ≈ 832.6 and

Z∗ =3.758832.6√

3853348−8×6342= 3.6

Therefore, using a 5% significance test, if we used 1.96 as the critical value, then clearly,|Z∗| > 1.96 and H0 : b = 0 can be rejected.

If we use a critical based on df , then since n = 8 and df = n − 2 = 6, the criticalvalue is 2.447, from the “t-table”. Since |Z∗| > 2.447, therefore, we arrive at the sameconclusion.

(d) The coefficient b can be interpreted as the number (ratio) of offspring that survivedfrom all causes (sometimes called recruits) for each spawning adult.

Page 13: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

13

A 95% confidence interval for b is

b± 1.96SD(b) = b± 1.96σ√∑n

i=1X2i − n(X)2

= 3.758± 1.96832.6√

3853348− 8× 6342

= 3.758± 1.96× 1.043 = (1.71, 5.80)

Therefore, the ecologist can be 95% certain that the ratio of recruits to spawners isbetween 1.71 to about 5.8.

Alternatively, we could use the small n formula, so that 1.96 can be replaced by acorresponding value from a “t-table”. To do this, we find df = n− 2 = 8− 2, which givesa value of 2.447. The confidence interval becomes:

3.758± 2.447× 1.043 = (1.21, 6.31)

In this case, the interval estimate gives a wider range, which means it is a moreconservative estimate. The small n formula always produces a more conservative estimatethan its large sample (using 1.96) counterpart.

(e) R2 can be calculated by first finding r and then squaring

r =

∑ni=1XiYi − nXY√∑n

i=1X2i − nX2

√∑ni=1 Y

2i − nY 2

= 0.827.

Therefore,

R2 = r2 = 0.8272 = 0.684.

Based on R2, which is a high value, the regression model fit is good. Furthermore, thepercent variation explained is

0.684× 100% = 68.4%,

so 68.4% of the differences in run size can be explained by the escapement size.

(f) For X = 630,Y = 981.6 + 3.758× 630 = 3349.14,

therefore, the predicted run size is 3349.14× 1000.

(3a) Least squares method finds the best fitting line to a set of data by minimizing thesum of the squared deviation between the regression line and the observed values of theoutcome (dependent) variable.

(b) Looking at the scatter plot, there seems to be an outlier (X = 4.8, Y = 120). Removingthat gives a picture suggesting a linear relationship between X and Y , therefore, we canattempt to fit a linear regression line to the data.

Page 14: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

14

●●

●●

●●

●●

0 1 2 3 4 5 6

020

4060

8010

0

Reserve size

Pric

e

●●

0 1 2 3 4 5 60

1020

3040

50

Outlier removed

Reserve size

Pric

e

n = 11, X = 3.4818, Y = 28.6273,n∑

i=1

X2i = 143.01,

n∑i=1

Y 2i = 9973.61,

n∑i=1

XiYi = 1162.58

b =

∑ni=1XiYi − nXY∑n

i=1

∑X2

i − n(X)2=

1162.58− 11× 3.4818× 28.6273

143.01− 11× 3.48182= 6.851,

a = Y − bX = 28.6273− 6.851.4818 = 4.773.

Therefore Y = 4.773 + 6.851X.

●●

●●

0 1 2 3 4 5 6

010

2030

4050

Reserve size

Pric

e

(c) The test statistic is

Z∗ =b

σ/√∑n

i=1X2i − n(X)2

, where σ =

√∑ni=1(Yi − Yi)2n− 2

.

To find Z∗, we need to determine σ. We need:

Page 15: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

15

Observation X Y Y = a+ bX Y − Y1 2.1 18.9 19.1605 -0.26052 3 14.5 25.3264 -10.82643 2.7 23.6 23.2711 0.32894 2.7 27.2 23.2711 3.92895 5.4 36.3 41.7687 -5.46876 3 19.8 25.3264 -5.52647 3.9 36.4 31.4922 4.90788 3.3 35.7 27.3816 8.31849 4.4 48 34.9177 13.082310 4.6 29.6 36.2879 -6.687911 3.2 24.9 26.6965 -1.7965

The last column gives σ =√

1n−2

∑ni=1(Yi − Yi)2 = 7.496 and

Z∗ =6.8517.496√

143.01−11×3.48182= 2.84

Therefore, using a 5% significance test, if we used 1.96 as the critical value, then clearly,|Z∗| > 1.96 and H0 : b = 0 can be rejected.

If we use a critical based on df , then since n = 11 and df = n − 2 = 9, the criticalvalue is 2.262, from the “t-table”. Since |Z∗| > 2.262, therefore, we arrive at the sameconclusion.

(d) The coefficient b can be interpreted as the price per barrel of reserve oil.

A 95% confidence interval for b is

b± 1.96SD(b) = b± 1.96σ√∑n

i=1X2i − n(X)2

= 6.851± 1.967.496√

143.01− 11× 3.48182

= 6.851± 1.96× 2.412 = (2.12, 11.58)

Therefore, the government can be 95% certain that the price per barrel is between 2.5 toabout 11 dollars.

Alternatively, we could use the small n formula, so that 1.96 can be replaced by acorresponding value from a “t-table”. To do this, we find df = n− 2 = 11− 2 = 9, whichgives a value of 2.262. The confidence interval becomes:

6.851± 2.262× 2.412 = (1.39, 12.03)

Page 16: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

16

(e) R2 can be calculated by first finding r and then squaring

r =

∑ni=1XiYi − nXY√∑n

i=1X2i − nX2

√∑ni=1 Y

2i − nY 2

= 0.6875.

Therefore,

R2 = r2 = 0.68752 = 0.4726.

Based on R2, which is a moderate value, the regression model fit seems adequate.Furthermore, the percent variation explained is

0.4726× 100% = 47.26%,

so 47.26% of the differences in transaction price can be explained by the reserve size.

(f) For X = 3.7,Y = 4.773 + 6.851× 3.7 = 30.12,

therefore, the predicted price is 30.12 billion.

(4a)

n = 9, X = 41.56/9, Y = 1416.1/9,n∑

i=1

X2i = 289.42,

n∑i=1

Y 2i = 232499,

n∑i=1

XiYi = 7439.37

b =

∑ni=1XiYi − nXY∑n

i=1

∑X2

i − n(X)2=

7439.37− 9× 41.56/9× 1416.1/9

289.42− 9× (41.56/9)2= 9.23176,

a = Y − bX = 1416.1/9− 9.21377× 41.56/9 = 114.715.

Therefore Y = 114.71 + 9.231X.

(b) The test statistic is

Z∗ =b

σ/√∑n

i=1X2i − n(X)2

, where σ =

√∑ni=1(Yi − Yi)2n− 2

.

To find Z∗, we need to determine σ =√

1n−2

∑ni=1(Yi − Yi)2 = 14 and

Z∗ =9.231

14√289.42−9×(41.56/9)2

= 6.51

Using the attached table, we choose a critical based on df , then since n = 9 and df =n− 2 = 7, the critical value is 2.365, from the “t-table”. Since |Z∗| = 6.51 which is muchbigger than 2.365, we conclude that there is a significant relationship between exposureand pre-spawn mortality.

(c) The gray line is the regression model. Between the sets of coloured curves, the red

Page 17: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

17

dashed pair are the prediction bands and the blue solid ones are the confidence bandsbecause we know that the confidence bands are always narrower than the prediction bands.

The scattered plot shows no evidence of any violations of the linear regression model.

By using the figure, we can estimate the mortality at exposure = 5 by finding 5 onthe horizontal axis and drawing a vertical line until it intersects the regression line, wecan extend from the point of intersection to the vertical axis to obtain an estimate of themortality, which is approximately 160. These lines are shown as green dotted lines on theattached figure. We can validate this estimate by comparing the answer to that obtainedby the regression model, which is

Y = 114.71 + 9.231X ⇒ Y = 114.71 + 9.231(5) = 160.87

● ●

2 4 6 8 10 12

100

150

200

250

Exposure

Mor

talit

y

PredictionConfidence

(d) Using the figure, the site can be compared to other areas with the same level ofexposure (X = 7) by using the prediction bands (red dashed curves). We mark theprediction bands at X = 7, shown in green boxes, and those points represent the endsof our prediction intervals of pre-spawn mortality. Since 95% confidence level is usedthroughout this course, the ends of this interval the upper and lower prediction limits ofthe pre-spawn mortality with exposure at X = 7. Since the upper limit is much lowerthan 250, the biologists can conclude that, with 95% confidence, the site in question haselevated pre-spawn mortality.

(5a)

n = 25, X =8173

25, Y =

82

25,

n∑i=1

X2i = 2751705,

n∑i=1

Y 2i = 3846,

n∑i=1

XiYi = 27274

b =

∑ni=1XiYi − nXY∑n

i=1

∑X2

i − n(X)2=

27274− 25× 8173/25× 82/25

2751705− 25× (8173/25)2= 0.005847,

a = Y − bX = 82/25− 0.005847× 8173/25 = 1.3685.

Page 18: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

18

● ●

2 4 6 8 10 12

100

150

200

250

Exposure

Mor

talit

y

PredictionConfidence

Therefore A = 1.369, B = 0.00585 and

Y = 1.369 + 0.00585X.

(b) The test statistic is simply b/SD(b) so for example

C =0.00585

0.044= 0.13.

Statistical significance can be determined by comparing the test statistic to a critical valuefrom the t-table. For n = 25, df = n − 2 = 23, hence the critical value is 2.069. SinceC = 0.13 < 2.069, C is not significant. Hence there is not sufficient evidence to suggestthere is association between recent magmatic activity and geothermal activity.

The remaining test statistics can also be obtained and there is also not sufficientevidence to support an association between time of earthquake activity and geothermalactivity. However, there is evidence to support that higher heat flow and a shorter distanceto a fault line are both associated with higher geothermal activity.

Variable a b Test StatisticX1 1.369 0.00585 0.13

(14.46) (0.044)

X2 19.88 -0.70 -1.45(11.69) (0.48)

X3 -31.45 0.58 2.52∗

(13.11) (0.23)

X4 10.53 -2.87 -3.45∗

(2.91) (0.83)

Page 19: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

19

(c) By using the figure, we can estimate the activity at the location that is 2 km from thea fault line by finding 2 on the horizontal axis and drawing a vertical line until it intersectsthe regression line, we can extend from the point of intersection to the vertical axis toobtain an estimate of the activity, which is approximately 5. These lines are shown asgreen dotted lines on the attached figure. We can compare this estimate to that obtainedby the regression model, which is

Y = 10.53− 2.87X ⇒ Y = 10.53− 2.87(2) = 4.79

so the two estimates are similar.

●●

●●●

●●

●●

●●●

●●

0 2 4 6 8 10

−30

−10

1030

X4

Act

ivity

(d) Using the figure, the geothermal activity at the location can be evaluated by usingthe 95% prediction bands (red dashed curves). We mark the prediction bands at X = 2,shown in green boxes, and those points represent the ends of our prediction intervals ofgeothermal activity. Since the lower limit is below 15, our prediction does not rule outgeothermal activity more than 15 percent below average.

(e) The margin of error for X = x, at 95% confidence level has the following form:

1.96σ

√1 +

1

n+

(x− X)2∑ni=1(Xi − X)2

.

Hence, all things equal, the further x is from X, the bigger the margin of error. In thisproblem X = 63.1

25≈ 2.52. The first site (at X = 2) is closer to X than the second site,

and hence, the estimate at site 1 has a smaller margin of error.

Page 20: Chapter 11 Exercises - mysmu.edu · Chapter 11 Exercises 1. ... both in minutes, from a random sample of 25 eruptions: Duration (X) Waiting time (Y) 2.8 103 5.3 44 4.8 63 3.2 65 4.6

20

●●

●●●

●●

●●

●●●

●●

0 2 4 6 8 10

−30

−10

1030

X4

Act

ivity


Recommended