+ All Categories
Home > Documents > MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists...

MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists...

Date post: 22-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
25
MATH1725 Introduction to Statistics: Exercises Exercises I – start after lecture 2, complete by lecture 5 – TUTOR to mark Q1. The following data, collected one early October, give the age since registration of one hundred cars in the University of Leeds Orange Zone Car Park. Age (years) 0 0–1 1–2 2–3 3–4 4–5 5–6 6–7 7–8 8–9 9–10 10–11 11–12 Frequency 4 15 16 5 12 12 7 10 6 4 4 3 2 Determine the mean age, median age, standard deviation and semi-interquartile range for this sample of one hundred cars. A second sample of fifty cars parked in the then BBC Leeds staff car park are shown below. Age (years) 0 0–1 1–2 2–3 3–4 4–5 5–6 6–7 7–8 8–9 9–10 10–11 11–12 Frequency 3 6 3 7 5 6 8 3 3 2 2 2 0 For this second data set the summary statistics are, sample mean = 4.25 years, sample standard deviation = 2.91 years, sample median = 4.17 years, semi-interquartile range = 1.93 years. Do the ages of cars in the University of Leeds car park look significantly different from those in the flesh-pots of BBC Leeds? (Don’t do a test of hypothesis.) Hint: Cars aged “0” were newly registered. Those aged “0–1” were registered in the previous twelve months and can be assumed to have class mid-point 0.5 years. Q2. Mercer and Hall reported the yields of grain from 500 small plots. The data was recorded to the nearest tenth of a pound and grouped into class intervals of width 0.4 lbs. as follows, Yield Frequency Yield Frequency Yield Frequency Yield Frequency 2.8–3.1 19 3.6–3.9 141 4.4–4.7 94 5.2–5.5 4 3.2–3.5 67 4.0–4.3 157 4.8–5.1 18 Determine the mean and standard deviation of these values. Q3. Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and M.G.Lockley, Cambridge University Press, 1989) is one of many to study the interesting problems posed by such tracks. A key question is whether dinosaurs in any region travelled along certain preferred routes. Scientists want to know about the average direction travelled by the dinosaurs and the spread of these values. Five such dinosaur tracks make angles of 36 , 72 , 44 , 88 , and 23 .What is the average direction the dinosaurs were travelling in? 0 o θ θ= Direction of travel Unit Vector cos( ) sin( ) θ θ Hint: You cannot just average the angles. For example, averaging 359 and 1 gives 180 , not the sensible 0 ! You must invent a measure of location for these directional values. To do this imagine each angle as represented by a unit vector in the appropriate direction. How might you find the average of these unit vectors? 1
Transcript
Page 1: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

MATH1725 Introduction to Statistics: Exercises

Exercises I – start after lecture 2, complete by lecture 5 – TUTOR to mark

Q1. The following data, collected one early October, give the age since registration of onehundred cars in the University of Leeds Orange Zone Car Park.

Age (years) 0 0–1 1–2 2–3 3–4 4–5 5–6 6–7 7–8 8–9 9–10 10–11 11–12

Frequency 4 15 16 5 12 12 7 10 6 4 4 3 2

Determine the mean age, median age, standard deviation and semi-interquartile range for thissample of one hundred cars.

A second sample of fifty cars parked in the then BBC Leeds staff car park are shown below.

Age (years) 0 0–1 1–2 2–3 3–4 4–5 5–6 6–7 7–8 8–9 9–10 10–11 11–12

Frequency 3 6 3 7 5 6 8 3 3 2 2 2 0

For this second data set the summary statistics are, sample mean = 4.25 years, sample standarddeviation = 2.91 years, sample median = 4.17 years, semi-interquartile range = 1.93 years.

Do the ages of cars in the University of Leeds car park look significantly different from thosein the flesh-pots of BBC Leeds? (Don’t do a test of hypothesis.)Hint: Cars aged “0” were newly registered. Those aged “0–1” were registered in the previoustwelve months and can be assumed to have class mid-point 0.5 years.

Q2. Mercer and Hall reported the yields of grain from 500 small plots. The data was recordedto the nearest tenth of a pound and grouped into class intervals of width 0.4 lbs. as follows,

Yield Frequency Yield Frequency Yield Frequency Yield Frequency

2.8–3.1 19 3.6–3.9 141 4.4–4.7 94 5.2–5.5 43.2–3.5 67 4.0–4.3 157 4.8–5.1 18

Determine the mean and standard deviation of these values.

Q3. Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editorsD.D.Gillette and M.G.Lockley, Cambridge University Press, 1989) is one of many to study theinteresting problems posed by such tracks. A key question is whether dinosaurs in any regiontravelled along certain preferred routes. Scientists want to know about the average directiontravelled by the dinosaurs and the spread of these values.

Five such dinosaur tracks make angles of 36◦, 72◦, 44◦, 88◦, and 23◦.What is the averagedirection the dinosaurs were travelling in?

0oθ

θ=

Direction of travel

UnitVector

cos( )

sin( )θ

θ

Hint: You cannot just average the angles. For example, averaging 359◦ and 1◦ gives 180◦, not thesensible 0◦! You must invent a measure of location for these directional values. To do this imagineeach angle as represented by a unit vector in the appropriate direction. How might you find theaverage of these unit vectors?

1

Page 2: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Exercises II – start after lecture 6, complete by lecture 9 – student to mark

Q1. Ten tins of a brand of tomato juice were found to have vitamin C concentrations (in mgper 100 gms.)

16, 22, 21, 20, 23, 21, 19, 15, 13, 22.

(a) Find a point estimate for µ, the mean average vitamin C concentration for this brand.(b) Assuming that vitamin C concentration is normally distributed with known variance σ2 = 16,construct a 95% confidence interval for µ.(c) Is a sample of size n = 10 large enough to ensure that 95% of the time µ will lie in the intervalx ± 1.8? If not, what should n be?(d) Test whether µ = 18 against the alternative that µ > 18. This can be interpreted as testingwhether the mean is no more than 18 mg per 100 gms.

Q2. According to Donnelly’s Estate Agents on Woodhouse Lane, the average weekly rent payableby students in private accommodation is £55. A random sample of ten such weekly rents obtainedfrom Donnelly’s were

£50, £55, £52.50, £52.50, £62, £55, £50, £57.50, £50, £54.50

(a) Find the mean x and variance s2 of this sample.(b) Assuming that rents are normally distributed with both mean µ and variance σ2 unknown,construct a 95% confidence interval for µ.(c) Does your answer support Donnelly’s assertion that the mean is £55?

Q3. In the following cases write down the null and alternative hypotheses, an appropriate teststatistic and its distribution under the null hypothesis, the critical region and the decision rule.(a) A random sample of five years chosen from the last century is used to test the claim at the 1%significance level that average annual rainfall in Leeds is 29′′. The standard deviation of annualrainfall is σ = 2.2′′.(b) A random sample of eight students is used to test the claim at the 5% significance level that,on average, students spend at least 18 hours studying, in addition to time spent in lectures andtutorials.(c) A random sample of 66 middle school children within the Leeds Education Authority areaprovides information to test the claim at the 1% significance level that a school lunch costs onaverage no more than £2.Hint: Recall the critical region represents values of the test statistic which lead to rejection of thenull hypothesis. Also, compare parts (b) and (c) with Q1(d).

Q4. In a medical study of patients given the drug sulphasalazine and a placebo, sixteen patientswere paired up. One of each pair received a drug and the other the placebo. The response scorefor each patient was found.

Pair number 1 2 3 4 5 6 7 8Drug response 0.16 0.97 1.57 0.55 0.62 1.12 0.68 1.69Placebo response 0.11 0.13 0.77 1.19 0.46 0.41 0.40 1.28

By considering the differences between the responses test whether the two treatments are signifi-cantly different at the 5% level.

Q5. Consider a 95% confidence interval for the mean µ with known variance σ2.(a) “Increasing the sample size n makes a confidence interval wider”. True or false? Why?(b) “Greater variability in data makes a confidence interval wider”. True or false? Why?

2

Page 3: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Exercises III – start after lecture 10, complete by lecture 13 – TUTOR to mark

Q1. The data below give the obesity and blood pressure of seven Mexican-American adultfemales (aged 35–60) in a small Californian town.

Obesity (actual weight/ideal weight) 1.50 1.59 1.43 1.63 2.39 1.50 0.95Systolic blood pressure (mmHg) 140 150 130 132 150 112 138

Is the correlation coefficient, to two decimal places, between these two quantities: (i) 0.27,(ii) 0.32, (iii) 0.37, (iv) 0.42, (v) 0.47, (vi) 0.52? Show your calculations.

Q2. The table below gives the ages of husband and wife for marriages in England and Wales in1986, with frequencies given in hundreds.

Age of husband (years)15–24 25–34 35–44 45–54 55–64 65–84 Total

15–24 1094 682 53 6 1 183625–34 150 682 201 31 4 1 1069

Age of 35–44 9 75 160 80 18 2 344wife 45–54 6 28 62 37 8 141

(years) 55–64 1 8 27 19 5565–84 1 6 28 35

Totals 1253 1445 443 188 93 58 3480

(a) By using the mid-points 20, 30, 40, 50, 60, 75 for the age intervals, calculate the correlationcoefficient between age of husband and age of wife at marriage. Is your answer, to two decimalplaces, (i) 0.79, (ii) 0.80, (iii) 0.81, (iv) 0.82, (v) 0.83, (vi) 0.84? Show your calculations.(b) Comment upon your result and the data in general.Hint: Don’t forget that the 65–84 class has mid-point 75!

Q3. The weight x and fuel consumption y for seven cars are given below.

Weight (lbs) 3400 3800 4100 2200 2600 2900 2000Fuel consumption (gallons/100 miles) 5.5 5.9 6.5 3.3 3.6 4.6 3.0

(a) Obtain the regression line for fuel consumption given weight.(b) Calculate the residual sum of squares about the fitted line.

Q4. Consider the least squares regression line of y given x based on n pairs of data (xk, yk)(k = 1, 2, . . . , n) where the sample variances of xk and of yk satisfy s2

X > 0 and s2Y > 0. The fitted

Y -value Yk at x = xk satisfiesYk = y + β(xk − x).

The deviation of the observed value yk from the fitted value Yk is thus (yk − Yk).By recalling from your notes what β equals, show that

n∑

k=1

(yk − Yk)2 =

n∑

k=1

(yk − y)2 − β2n∑

k=1

(xk − x)2,

and hence shown∑

k=1

(yk − Yk)2 = (1 − r2

XY )n∑

k=1

(yk − y)2 (⋆)

where rXY is the sample correlation coefficient between the xk and the yk.

Q5. Use the result (⋆) in question 4 to deduce that −1 ≤ rXY ≤ +1.

3

Page 4: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Exercises IV: Start after lecture 13, complete by lecture 17 – TUTOR to mark

Q1. A regression line passing through the origin, y = βx, might serve as a model for the roaddistance y between two points whose straight-line distance apart is x. For n pairs of observedvalues (xk, yk) (k = 1, 2, . . . , n) the least squares estimate for β is found by minimising

S =

n∑

k=1

(yk − βxk)2.

Show that the least squares estimate for β is β =n∑

k=1

xkyk

/ n∑

k=1

x2k.

Q2. The data below give the straight-line distance x and road distance y between ten pairs ofpoints around Sheffield.

x (miles) 9.5 9.8 5.0 19.0 23.0 14.6 15.2 8.3 11.4 21.6y (miles) 10.7 11.7 6.5 25.6 29.4 16.3 17.2 9.5 18.4 28.8

(a) Draw a scatter diagram to show it is reasonable to fit a straight line through the origin tothese data.(b) Calculate the regression line of road distance for given straight-line distance of the form y = βx.Is the slope coefficient: (i) 0.784, (ii) 1.329, (iii) 1.276, (iv) 1.130, (v) 1.267, (vi) 1.239?(c) Use your line to estimate the road distance between two points which are a straight-line distanceapart of 10.0 miles.

Q3. The weight x and fuel consumption y for seven cars are given below.

Weight (lbs) 3400 3800 4100 2200 2600 2900 2000Fuel consumption (gallons/100 miles) 5.5 5.9 6.5 3.3 3.6 4.6 3.0

The least squares regression line for fuel consumption given weight was found in Exercises III,question 3, to be y = −0.500+0.001709x. Test whether the slope 0.001709 is significantly differentfrom zero. What do you conclude from your test?Hint: Read lecture 10.

Q4. Three fair coins are tossed and the number X of heads is recorded. If x heads occur, thenthe first x coins are turned over and the number Y of heads now showing is recorded.(a) Obtain the joint probability function p(x, y).(b) Obtain the marginal probability functions pX(x) and pY (y).(c) Obtain the mean and variance of X and Y .(d) Calculate the correlation between X and Y . Is your answer: (i) 0.500, (ii) 0.658, (iii) 0.000,(iv) 0.750, (v) 0.866, (vi) 0.911?Hint: Outcome (H,T,H) gives X = 2. Turning over the first two coins gives (T,H,H) so Y = 2.

Q5. A regular six-sided die is thrown and its score X is observed. If X = x, then x unbiasedcoins are tossed and the number of heads Y is observed.(a) By considering the distribution of Y when X = x, explain why the conditional probabilitiesfor Y given X = x satisfy pr {Y = y|X = x} =

(xy

)

( 1

2)x for y = 0, 1, . . . , x and x = 1, 2, 3, 4, 5, 6.

(b) Obtain the joint probability function p(x, y) of X and Y .(c) Obtain the marginal probability function pY (y) of Y .(d) Obtain the conditional probability distribution of X given that Y = 3. (Write your probabil-ities as multiples of 1/16.)Hints: For part (b), use the result p(x, y) = pr {Y = y|X = x}×pX(x). Deduce a similar expressionfor part (d) to give pr {X = x|Y = 3} for x = 1, 2, 3, . . . , 6.

4

Page 5: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Exercises V: start after lecture 16, complete by lecture 21 – TUTOR to mark

Q1. The random variables X1 and X2 have means µ1 and µ2 respectively, with common varianceσ2 and correlation coefficient ρ. (a) What is the mean and variance of Y = X1−3X2?(b) If Z = gX1 + X2 where g is a known constant, what is the value of cov(Y,Z)?(c) What value of the constant g makes Z uncorrelated with Y ?(d) If ρ = 0.2, then g = 7. In this case evaluate corr(X1, Y ) and corr(X1, Z).Hints: Recall lecture 14. In part (a) recall the results for E[a1X1 + a2X2] and Var[a1X1 + a2X2].In part (b) recall the result for cov(a1X1 +a2X2, b1X1 + b2X2). In part (c), corr(Y,Z) = 0. Whatthen does cov(Y,Z) equal?

Q2. Four components of types A, B, C, and D are fitted together lengthwise to make an electricaljunction. The mean length and standard deviation of length in stable production of each type ofcomponent are given below.

Component type A B C DMean (cms.) 5.7 10.8 10.8 6.3Standard deviation (cms.) 0.0056 0.0180 0.0180 0.0092

Components of types B and C have to be matched; as a result of this there is a correlation of0.75 between the lengths of components of these types used in the same junction. Assuming allother correlations between lengths of components of different types are zero, and there is no spacebetween successive components, calculate the mean length of a junction and its standard deviation.

Q3. A piston of circular cross-section has to fit into a similarly shaped cylinder. The distributionof diameters of pistons and cylinders are known to be normal with parameters as given below.

Piston diameters Mean = 10.42 cms. Standard deviation = 0.03 cms.Cylinder diameters Mean = 10.43 cms. Standard deviation = 0.04 cms.

A piston and cylinder are selected at random for assembly. Is the probability that the piston willnot fit into the cylinder: (i) 0.023, (ii) 0.421, (iii) 0.443, (iv) 0.579, (v) 0.557, (vi) 0.977?Show your calculations.Hint: Mind the gap.

Q4. In an experiment, sixteen rats were divided at random into two groups of eight rats. Ingroup A the rats were reared in an atmosphere which contained a known percentage of dust. Ingroup B the rats were reared in a dust free atmosphere. After three months the rats were killedand the lung weights (in gms.) were found.

Group A: 5.79 5.57 6.52 4.78 5.91 7.02 6.06 6.38Group B: 4.20 4.06 5.81 3.63 2.80 5.10 3.64 4.53

From earlier studies it is known the two groups have variances σ2A = 0.50 and σ2

B = 0.80 respec-tively. Test at the 1% level whether there is a significant difference in the mean lung weight forthe two groups.Hint: Recall lecture 15.

Q5. Random samples of 14 and 11 chicks respectively were given, from birth, a protein supple-ment, either oil meal or meat meal based. The weights (in gms.) of the chicks when six weeks oldare given below.

Oil meal: 240 230 270 320 250 330 270 200 330 250 180 160 190 250Meat meal: 150 260 380 260 240 320 210 340 300 310 260

Assuming that the variances of the weights for the two diets are equal, test at the 5% levelwhether the mean weight of 6-week old chicks on the two diet supplements is the same.

5

Page 6: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Exercises VI: start after lecture 20, not to be handed in

Q1. In a random sample of 1000 people, 500 indicate that they support political party A. Obtaina 95% confidence interval for the proportion of the population who support party A.

Q2. The data below from two independent surveys of American homes gives the number ofhouseholds in which at least one adult was at home during the working day (between 9 a.m. and4 p.m. on a weekday).

Number of households Number of houses with at least oneYear of survey in survey adult at home during working day

1971 500 2961976 1000 463

Test whether the proportion of houses with at least one adult at home during the working day hasremained constant over the five year period.

Q3. The following data gives the frequency distribution of the size of casual groups of people ona Spring afternoon in Portland, Oregon.

Size of group 1 2 3 4 5 6Frequency 1486 694 195 37 10 1

A suggested model for the probability pr of a group of size r is

pr =µre−µ

r!(1 − e−µ)

for r = 1, 2, 3, . . ., where µ is estimated from the data to be 0.8925. Does this model give a goodfit to these data?

Q4. Macdonell (1901) collected the following data about the Gloucester smallpox epidemic of1895-96.

Recoveries Deaths Totals

Vaccinated 1091 120 1211Unvaccinated 454 314 768

Totals 1545 434 1979

Are the two factors vaccinated/unvaccinated and recovery/death independent?

6

Page 7: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Solutions to exercises

Solutions I

Q1.

The calculations for sample mean and standard deviation can be put into a table.

Age of car (years) Class mark xi Frequency fi fixi fix2i

0 0.0 4 0.0 0.000– 1 0.5 15 7.5 3.751– 2 1.5 16 24.0 36.002– 3 2.5 5 12.5 31.253– 4 3.5 12 42.0 147.004– 5 4.5 12 54.0 243.005– 6 5.5 7 38.5 211.756– 7 6.5 10 65.0 422.507– 8 7.5 6 45.0 337.508– 9 8.5 4 34.0 289.009–10 9.5 4 38.0 361.00

10–11 10.5 3 31.5 330.7511–12 11.5 2 23.0 264.50

Totals n = 100 415.0 2678.00

Sample mean x =1

n

k∑

i=1

fixi =415

100= 4.15 ≈ 4.2 years.

s2 =1

n − 1

{

k∑

i=1

fix2i − nx2

}

=1

99

{

2678.00 − 100(4.15)2}

= 9.654040 ≈ 9.65.

Sample standard deviation s =√

s2 =√

9.654040 = 3.107095 = 3.11 years.

The sample mean is 4.15 years. Since the data is given as whole years it does not make sensegiving the mean to two decimal places, equivalent to an accuracy of ±0.005 years or ±1.8 days!One decimal place is more than ample.To find the median and quartiles we need to determine the cumulative frequencies.

Cumulative frequencyClass limits Class frequency Class upper boundary at upper boundary

0 4 0.0 40– 1 15 1.0 191– 2 16 2.0 352– 3 5 3.0 403– 4 12 4.0 524– 5 12 5.0 645– 6 7 6.0 716– 7 10 7.0 817– 8 6 8.0 878– 9 4 9.0 919–10 4 10.0 95

10–11 3 11.0 9811–12 2 12.0 100

7

Page 8: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Median class is “3–4” class. The median is the value M having cumulative frequency 50 and isfound using interpolation.

M = 3.0 +(50 − 40)

(52 − 40)× (4.0 − 3.0) = 3.0 +

10

12× (4.0 − 3.0) = 3.0 + 0.83 ≈ 3.8 years.

Lower quartile lies in “1–2” class. By interpolation,

Q1 = 1.0 +6

16× (2.0 − 1.0) = 1.0 + 0.375 = 1.375 ≈ 1.38 years.

Upper quartile lies in “6–7” class. By interpolation,

Q3 = 6.0 +4

10× (7.0 − 6.0) = 6.40 years.

Semi-interquartile range = 12(Q3 − Q1) = 1

2(6.400 − 1.375) = 2.51 years.

The summary statistics for the two data sets can now be compared.

University Car Park BBC Leeds Car Park

Sample mean 4.2 4.3Sample median 3.8 4.2

Sample standard deviation 3.11 2.91Lower quartile 1.38 2.07Upper quartile 6.40 5.94Semi-interquartile range 2.51 1.93

The cars in the two car parks have similar mean ages. The variability of ages is very slightlygreater amongst the cars in the University car park than in the other car park. On the whole thereis not much difference between the two car parks.The calculations for the BBC North data are given below.

Age of car (years) Class mark xi Frequency fi fixi fix2i

0 0.0 3 0.0 0.000– 1 0.5 6 3.0 1.501– 2 1.5 3 4.5 6.752– 3 2.5 7 17.5 43.753– 4 3.5 5 17.5 61.254– 5 4.5 6 27.0 121.505– 6 5.5 8 44.0 242.006– 7 6.5 3 19.5 126.757– 8 7.5 3 22.5 168.758– 9 8.5 2 17.0 144.509–10 9.5 2 19.0 180.50

10–11 10.5 2 21.0 220.50

Totals n = 50 212.5 1317.75

Sample mean x =1

n

k∑

i=1

fixi =212.5

50= 4.25 years.

s2 =1

n − 1

{

k∑

i=1

fix2i − nx2

}

=1

49

{

1317.75 − 50(4.25)2}

= 8.461735 ≈ 8.46.

8

Page 9: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Sample standard deviation s =√

s2 =√

8.461735 = 2.90891 = 2.91 years.

Q2.

The class boundaries are (2.75,3.15), (3.15,3.55), (3.55,3.95), and so on, with class mid-points 2.95,3.35, 3.75, and so on. A suitable coding to use is z = (x − m)/c = (x − 4.15)/0.4.

Class Class mid-point x Frequency f z = (x − 4.15)/0.4 fz fz2

2.8–3.1 2.95 19 −3 −57 1713.2–3.5 3.35 67 −2 −134 2683.6–3.9 3.75 141 −1 −141 1414.0–4.3 4.15 157 0 0 04.4–4.7 4.55 94 1 94 944.8–5.1 4.95 18 2 36 725.2–5.5 5.35 4 3 12 36

Totals n = 500 −190 782

Sample mean of coded values is

z =1

n

k∑

i=1

fizi =−190

500= −0.38.

Sample variance of coded values is

s2z =

1

n − 1

{

k∑

i=1

fiz2i − nz2

}

=1

499

{

782 − 500(−0.38)2}

=709.8

499= 1.42244.

For the original units of measurement we have,

Sample mean x = m + cz = 4.15 + 0.4 × (−0.38) = 3.998 ≈ 4.0 lbs.

Sample variance s2x = c2s2

z = (0.4)2 × 1.42244 = 0.227591 ≈ 0.228lbs2.

Sample standard deviation sx =√

s2x =

√0.227591 = 0.477065 ≈ 0.48lbs.

Q3.Each track direction, such as 36◦, can be regarded as a unit vector with the appropriate angle.In polar co-ordinates the direction 36◦ represents the vector (r, θ) = (1, 36◦). In Cartesian co-ordinates this unit vector is given by (x, y) = (cos 36◦, sin 36◦) = (0.8090, 0.5878).

0oθ

θ=

Direction of travel

UnitVector

cos( )

sin( )θ

θ

For each of the five directions θi we can calculate the corresponding x and y increments, xi = cos(θi)and yi = sin(θi).

9

Page 10: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Angle θi cos(θi) sin(θi)

36 0.8090 0.587872 0.3090 0.951144 0.7193 0.694788 0.0348 0.999423 0.9205 0.3907

Totals 2.7926 3.6237

The sum of the x-increments of the five vectors is 2.7926 and the sum of the y-increments is 3.6237.The average x-increment is 2.7926

5 = 0.55852, and the average y-increment is 3.62375 = 0.72474. We

can imagine the “average vector” as given by (x, y) = (0.55852, 0.72474). In polar co-ordinatesthis is the vector (R,Θ) where x2 + y2 = R2 and tan(Θ) = y/x.For these data,

tan(Θ) =y

x=

0.72474

0.55852= 1.2976 so Θ = 52.38◦ ≈ 52◦.

The “mean angle”, the measure of location we were looking for, is Θ = 52◦.

The signs of x and y can be used to check which quadrant Θ lies in. For example, if y = −0.72474and x = −0.55852 then tan(Θ) = 1.2976 but clearly Θ = 232◦.

Notice that if all the five track directions had been the same we would have had R = 1 so (1−R) = 0.If the five directions had been spread throughout 360◦, say 0◦, 72◦, 144◦, 216◦, 288◦, then the“average” vector will have length R = 0 so (1 − R) = 1.Clearly, for angles closely clustered together the quantity (1−R) is small, whereas for angles widelyscattered (1 − R) is large. Not only have we found a measure of location Θ, but we also have ameasure of concentration or spread given by (1 − R).For these data,

R2 = x2 + y2 = 0.558522 + 0.724742 = 0.83719

so R = 0.915 and (1 − R) = (1 − 0.915) = 0.081. This is small suggesting that the five dinosaurdirections are closely clustered together. Generally they were going in the same direction, whereverthat was!

10

Page 11: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Solutions II

Q1.

(a) Sample mean x = 192/10 = 19.2 mg per 100 gms is a suitable point estimate for µ.(b) Let X denote the vitamin C concentration of a tin with X ∼ N(µ, σ2 = 16). The 95%confidence interval for µ is

x ± 1.96σ√n

= 19.2 ± 1.964√10

= 19.2 ± 2.48 = (16.72, 21.68)

(c) For n = 10 above, we had x ± 2.48. To reduce ±2.48 to ±1.8 need to increase n so

1.96σ√n

= 1.8 ⇒ n =

(

1.96σ

1.8

)2

=

(

1.96 × 4

1.8

)2

= 18.97.

We should use n = 19 to ensure the criterion is satisfied.(d) Since the variance is known, the test statistic is

Z =X − µ

σ/√

n=

X − 18

4/√

10

where Z ∼ N(0, 1) if H0 is true.For α = 0.05 with a one-sided test, zα = 1.645. Critical region is Z > 1.645.Decision rule: reject H0 if observed z > 1.645.Here z = (19.2 − 18)/1.265 = 0.949 so accept H0.

Q2.

(a)

x =1

n

n∑

i=1

xi =539

10= £53.90 per week.

s2 =1

n − 1

{

n∑

i=1

x2i − nx2

}

=1

9(29183 − 29052.1) =

130.9

9= 14.54.

(b) Estimate the population variance σ2 by s2. Then T =X − µ

s/√

n∼ tn−1.

The 95% confidence interval for µ is

x ± t9(2.5%)s√10

= 53.9 ± 2.262 × 3.813√10

= 53.90 ± 2.73 = (51.17, 56.63).

(c) Since £55 lies inside the confidence interval, there is support for Donnelly’s assertion.

Q3.

(a) Let X denote annual rainfall in inches in Leeds.Assume years are independent and X ∼ N(µ, σ2/n). Here σ2 = 2.22 = 4.84, n = 5, so X ∼N(µ, 0.968).Test H0 : µ = 29 vs. H1 : µ 6= 29. Test statistic is

Z =X − µ

σ/√

n=

X − 29

0.984

where Z ∼ N(0, 1) if H0 is true.For α = 0.01 with a two-sided test, zα/2 = 2.576. Critical region is Z < −2.576 and Z > 2.576.Decision rule: reject H0 if observed z < −2.576 or observed z > 2.576, i.e., if |z| > 2.576.

11

Page 12: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

(b) Let X denote number of study hours.Assume the students are independent and X ∼ N(µ, σ2/n) with n = 8. Here σ2 is unknown soestimated using sample variance s2.Test H0 : µ = 18 vs. H1 : µ < 18. Test statistic is

T =X − µ

s/√

n=

X − 18

s/√

8

where T ∼ t7 if H0 is true.For α = 0.05 with a one-sided test, t7(5%) = 1.895. Critical region is T < −1.895. Decision rule:reject H0 if observed t < −1.895.

(c) Let X denote cost of a school lunch.Assume lunches are priced independently and X ∼ N(µ, σ2/n) with n = 66. Here σ2 is unknownso estimated using sample variance s2.Test H0 : µ = 2 vs. H1 : µ > 2. Test statistic is

T =X − µ

s/√

n=

X − 2

s/√

66

where T ∼ t65 if H0 is true. Since the degrees of freedom are large, notice that t65 ≈ N(0, 1).For α = 0.01 with a one-sided test, t65(1%) ≈ z0.01 = 2.326. Critical region is T > 2.326.Decision rule: reject H0 if observed t > 2.326.

Q4.

This is a “matched-pair” type problem.

Pair number 1 2 3 4 5 6 7 8

Drug response x1i 0.16 0.97 1.57 0.55 0.62 1.12 0.68 1.69Placebo response x2i 0.11 0.13 0.77 1.19 0.46 0.41 0.40 1.28Difference di = x1i − x2i 0.05 0.84 0.80 −0.64 0.16 0.71 0.28 0.41

Model differences di as a random sample from a N(µd, σ2d) distribution.

n = 8,

8∑

k=1

dk = 2.61,

8∑

k=1

d2k = 2.5339, d = 0.326, s2

d = 0.2403.

If µd = 0 then the drug and placebo give the same mean response score.Testing H0 : µd = 0 vs. H1 : µd 6= 0 at 5% level:

t =

d√

s2d/n

=

0.326√

0.2403/8

= 1.882. t7(2.5%) = 2.365.

Hence accept null hypothesis at 5% level. No significant difference in mean yield between the twotreatments at 5% level.Since t7(5%) = 1.895 also just accept H0 at 10% level.

Q5.

95% confidence interval is of form x ± 1.96(σ/√

n) so width is 3.92(σ/√

n).(a) False. Increasing n reduces width of interval.(b) True. Increasing σ increases width of interval.

12

Page 13: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Solutions III

Q1.

Answer (iii). Correlation = 0.37.Let X denote obesity and Y blood pressure.

n∑

k=1

xk = 1.50 + 1.59 + · · · + 0.95 = 10.99 ⇒ x = 10.99/7 = 1.57.

n∑

k=1

yk = 140 + 150 + · · · + 138 = 952 ⇒ y = 952/7 = 136.

n∑

k=1

x2k = 1.502 + 1.592 + · · · + 0.952 = 18.3445.

n∑

k=1

y2k = 1402 + 1502 + · · · + 1382 = 130512.

n∑

k=1

xkyk = 1.50 × 140 + 1.59 × 150 + · · · + 0.95 × 138 = 1507.16.

sXY =1

n − 1

(

n∑

k=1

xkyk − nxy

)

=1

6(1507.16 − 1494.64) = 2.08666.

s2X =

1

n − 1

(

n∑

k=1

x2k − nx2

)

=1

6(18.3445 − 17.2543) = 0.1817 ⇒ sX = 0.4263.

s2Y =

1

n − 1

(

n∑

k=1

y2k − ny2

)

=1

6(130512 − 129472) = 173.3333 ⇒ sY = 13.1656.

Correlation rXY =sXY

sXsY=

2.0866

0.4263 × 13.1656= 0.372 ≈ 0.37.

This is not a very strong correlation.In evaluating s2

X , s2Y , and sXY don’t round x and y too soon. Thus,

s2X = 1

6(18.3445 − 7 × 1.572) = 0.1817 while s2X = 1

6(18.3445 − 7 × 1.62) = 0.0708. Better to use

s2X =

(

n∑

k=1

x2k −

(∑

xk)2

n

)

/

(n − 1),

and similarly for s2Y and sXY .

Q2.

(a) Answer (i). Correlation = 0.79.Let X denote age of husband and Y age of wife. Recall that frequencies are in hundreds!

i

fi�xi = 125300 · 20 + 144500 · 30 + · · · + 9300 · 60 + 5800 · 75 = 10546000.

j

f�jyj = 183600 · 20 + 106900 · 30 + · · · + 3500 · 75 = 9552500.

x = 10546000/348000 = 30.3046 years, y = 9552500/348000 = 27.4497 years.

13

Page 14: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Notice if we forget frequencies are in hundreds, it would not matter for the sample means.∑

i

fi�x2i = 125300 · 202 + 144500 · 302 + · · · + 5800 · 752 = 364155000.

j

f�jy

2j = 183600 · 202 + 106900 · 302 + · · · + 3500 · 752 = 299427500.

i

j

fijxiyj = 109400 · 20 · 20 + 68200 · 30 · 20 + · · · + 2800 · 75 · 75 = 321810000.

sXY =1

n − 1

i

j

fijxiyj − nxy

=1

347999(321810000 − 289484670) = 92.8891.

s2X =

1

n − 1

(

i

fi�x2i − nx2

)

=1

347999(364155000 − 2319592290) = 128.0541 ⇒ sX = 11.3161.

s2Y =

1

n − 1

j

f�jy

2j − ny2

=1

347999(299427500 − 2262213380) = 106.9374 ⇒ sY = 10.3411.

Correlation rXY =sXY

sXsY=

92.8891

11.3161 × 10.3411= 0.7938 ≈ 0.79.

Notice that n − 1 = 347999 here, not 3479, though it does not make a large difference to youranswers for sX , sY , and sXY .

(b) Quite a large positive correlation.Data suggest that age of men at marriage is on average slightly greater than the age of women.Men prefer to marry women younger than themselves.Or do women prefer to marry men slightly older than themselves?

Q3.

(a)

Weight (x) Fuel consumption(y) x2 xy y2

3400 5.5 11560000 18700 30.253800 5.9 14440000 22420 34.814100 6.5 16810000 26650 42.252200 3.3 4840000 7260 10.892600 3.6 6760000 9360 12.962900 4.6 8410000 13340 21.162000 3.0 4000000 6000 9.00

Totals 21000 32.4 66820000 103730 161.32

x =1

n

n∑

k=1

xk =21000

7= 3000 lbs. y =

1

n

n∑

k=1

yk =32.4

7= 4.629 gallons per 100 miles.

s2X =

1

n − 1

(

n∑

k=1

x2k − nx2

)

=1

6

{

66820000 − 7(3000)2}

= 636666.7.

s2Y =

1

n − 1

(

n∑

k=1

y2k − ny2

)

=1

6

{

161.32 − 7(4.629)2}

= 1.89238.

sXY =1

n − 1

(

n∑

k=1

xkyk − nxy

)

=1

6{103730 − 7(3000)(4.629)} = 1088.33.

14

Page 15: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Regression line for Y given X = x is

y = y + (x − x)sXY

s2X

= 4.629 + (x − 3000)0.001709 ⇒ y = −0.500 + 0.001709x.

(b) Residual sum of squares is

(n − 1)

(

s2Y − s2

XY

s2X

)

= 6

(

1.89238 − 1088.332

636666.7

)

= 0.1917.

Why measure fuel consumption in gallons per 100 miles?Plotting y against x with y measured in units of gallons per 100 miles gives an approximate straightline relationship between y and x. Plotting y against x with y measured in units of miles per gallongives a curvilinear relationship between y and x, something like y ∝ 1/x.

- * - *6.0+ * -

- * 30+ *gallons - miles - *per 100 - per -miles - * gallon-

4.0+ - *- * 20+- * - *- * - * *

+---------+---------+--Weight +---------+---------+--Weight2000 3000 4000 lbs. 2000 3000 4000 lbs.

> weight=c(3400,3800,4100,2200,2600,2900,2000) # input weights> fuel=c(5.5,5.9,6.5,3.3,3.6,4.6,3.0) # input fuel consumptions> summary(lm(fuel~weight))

Residuals:1 2 3 4 5 6 7

0.187659 -0.096111 -0.008938 0.038968 -0.344802 0.142371 0.080853

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.4997008 0.3095645 -1.614 0.167weight 0.0017094 0.0001002 17.061 1.27e-05 ***

Residual standard error: 0.1958 on 5 degrees of freedomMultiple R-Squared: 0.9831, Adjusted R-squared: 0.9797F-statistic: 291.1 on 1 and 5 DF, p-value: 1.266e-05

> anova(lm(fuel~weight))

Analysis of Variance TableResponse: fuel

Df Sum Sq Mean Sq F value Pr(>F)weight 1 11.1625 11.1625 291.08 1.266e-05 ***Residuals 5 0.1917 0.0383

The value 0.1917 above is the residual (or error) sum of squares as given by R.

15

Page 16: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Q4.

At x = xk the observed value is yk and fitted value is Yk = y + (xk − x)β. Thus1

n∑

k=1

(yk − Yk)2 =

n∑

k=1

{yk − y − β(xk − x)}2 on substituting for Yk,

=

n∑

k=1

{(yk − y)2 − 2β(xk − x)(yk − y) + β2(xk − x)2}

=

n∑

k=1

(yk − y)2 − 2β

n∑

k=1

(xk − x)(yk − y) + β2n∑

k=1

(xk − x)2.

As β =sXY

s2X

, then

n∑

k=1

(xk − x)(yk − y) = (n − 1)sXY = β(n − 1)s2X = β

n∑

k=1

(xk − x)2. Thus

n∑

k=1

(yk − Yk)2 =

n∑

k=1

(yk − y)2 − 2β2n∑

k=1

(xk − x)2 + β2n∑

k=1

(xk − x)2

=n∑

k=1

(yk − y)2 − β2n∑

k=1

(xk − x)2 as required.

n∑

k=1

(yk − Yk)2 =

n∑

k=1

(yk − y)2 − β2n∑

k=1

(xk − x)2

=

n∑

k=1

(yk − y)2 −(

sXY

s2X

)2

(n − 1)s2X as β =

sXY

s2X

,

=

n∑

k=1

(yk − y)2 − (n − 1)

(

s2XY

s2X

)

=

n∑

k=1

(yk − y)2 − (n − 1)s2Y

(

s2XY

s2Xs2

Y

)

=

n∑

k=1

(yk − y)2 − r2XY

n∑

k=1

(yk − y)2 as rXY =sXY

sXsY,

= (1 − r2XY )

n∑

k=1

(yk − y)2 as required. (⋆)

Q5.

Since the sum of squared terms in (⋆) above can never be negative it follows that

n∑

k=1

(yk − Yk)2 ≥ 0 and

n∑

k=1

(yk − y)2 > 0.

Hence (1 − r2XY ) ≥ 0, giving r2

XY ≤ 1 so that −1 ≤ rXY ≤ 1 as required.

1Notice that the proof breaks down if all the xk values are the same. In this case sXY = 0 and s2

X = 0 so β and

rXY are not defined. Similarly, if all the yk are the same, then sXY = 0 and s2

Y = 0 so β = 0 and rXY is not defined.

16

Page 17: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Solutions IV

Q1.

Require line y = βx. Estimate slope β by minimizing sum of squared deviations

S =

n∑

k=1

(yk − βxk)2.

dS

dβ= −2

n∑

k=1

xk(yk −βxk) sodS

dβ= 0 ⇒

n∑

k=1

xkyk − βn∑

k=1

x2k = 0 so β =

n∑

k=1

xkyk

/

n∑

k=1

x2k.

Q2.

(a)

2520151050

30

20

10

0

Straight-line distance (miles)

Roa

d di

stan

ce (

mile

s)

A straight line through the origin seems to give a reasonable model for these data. Slope of line isapproximately 30/25 = 1.2 and provides a rough check on answer to (b) below.

(b) Answer (iii). Slope = 1.276.

xk 9.5 9.8 5.0 19.0 23.0 14.6 15.2 8.3 11.4 21.6yk 10.7 11.7 6.5 25.6 29.4 16.3 17.2 9.5 18.4 28.8x2

k 90.25 96.04 25.00 361.00 529.00 213.16 231.04 68.89 129.96 466.56xkyk 101.65 114.66 32.50 486.40 676.20 237.98 261.44 78.85 209.76 622.08

n∑

k=1

xkyk = 2821.52 and

n∑

k=1

x2k = 2210.90 ⇒ β =

xkyk∑

x2k

=2821.52

2210.90= 1.27619 ≈ 1.276.

(c) If x = 10 miles, predict y = 10β = 10 × 1.276 = 12.76 ≈ 12.8 miles.

Q3.

n = 7, s2X = 636666.7, (n−1)S2

X = 3820000, sXY = 1088.33, s2Y = 1.892381, β = 0.001709.

Residual sumof squares (SS)

= (n−1)

(

s2Y − s2

XY

s2X

)

= 0.191746, σ2 =Residual SS

n − 2=

0.191746

5= 0.038349,

To test whether the slope β equals zero, use t =β

Var[β], where t ∼ t5 if β = 0. Here

Var[β] ≈√

σ2

(n − 1)s2X

=

0.0383

3820000= 0.000100195 ⇒ t =

0.001709

0.000100195= 17.1.

Since t5(0.5%) = 4.032, reject the hypothesis that β = 0 at the 1% level.

17

Page 18: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

> weight=c(3400,3800,4100,2200,2600,2900,2000) # input weights> fuel=c(5.5,5.9,6.5,3.3,3.6,4.6,3.0) # input fuel consumptions> summary(lm(fuel~weight))

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.4997008 0.3095645 -1.614 0.167weight 0.0017094 0.0001002 17.061 1.27e-05 ***

Residual standard error: 0.1958 on 5 degrees of freedomMultiple R-Squared: 0.9831, Adjusted R-squared: 0.9797F-statistic: 291.1 on 1 and 5 DF, p-value: 1.266e-05

> anova(lm(fuel~weight))

Analysis of Variance TableResponse: fuel

Df Sum Sq Mean Sq F value Pr(>F)weight 1 11.1625 11.1625 291.08 1.266e-05 ***Residuals 5 0.1917 0.0383

Above shows β = 0.0017094 with estimated standard error 0.0001002 and t-value to test whetherβ = 0 is t = 17.061. The residual sum of squares equals 0.1917, so σ2 = 0.0383 and σ = 0.1958.

Q4.

List possible outcomes of this experiment in a table.

Original outcomes HHH HHT HTH HTT THH THT TTH TTTNumber X of heads showing 3 2 2 1 2 1 1 0Result of turning over first x coins TTT TTT THH TTT HTH HHT HTH TTTNumber Y of coins now showing 0 0 2 0 2 2 2 0

(a) Eight equally likely outcomes in total, so each has probability 1/8. Joint probability functionp(x, y) found by adding up the probabilities for the different outcomes, so, for example,

p(3, 0) ≡ pr {X = 3 and Y = 0} = 1/8, p(2, 2) ≡ pr {X = 2 and Y = 2} = 2/8 = 1/4.

Display joint probability function p(x, y) in a table.

Y Marginal probability0 1 2 3 of X, pX(x)

0 1/8 0 0 0 1/8X 1 1/8 0 1/4 0 3/8

2 1/8 0 1/4 0 3/83 1/8 0 0 0 1/8

Marginal probs. for Y , pY (y) 1/2 0 1/2 0 Total = 1

(b) Marginal probabilities of X and Y found by taking row and column sums respectively ofp(x, y).

i.e., pX(x) =∑

y

p(x, y) and pY (y) =∑

x

p(x, y).

Probabilities pX(x) could also be obtained by noting that X ∼ Bin(n = 3, π = 12 ).

X 0 1 2 3 Y 0 1 2 3pX(x) 1/8 3/8 3/8 1/8 pY (y) 1/2 0 1/2 0

18

Page 19: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

(c)

E[X] =∑

x

x pr {X = x} = 0 × (1/8) + 1 × (3/8) + 2 × (3/8) + 3 × (1/8) = 1.5.

Var[X] = E[X2] − {E[X]}2 =∑

x

x2pr {X = x} − {E[X]}2 = 3 − 1.52 = 0.75.

Mean and variance of X could be obtained by noting that X ∼ Bin(n = 3, π = 12).

E[Y ] =∑

y

y pr {Y = y} = 0 · 12 + 2 · 1

2 = 1.

Var[Y ] = E[Y 2] − {E[Y ]}2 =∑

y

y2pr {Y = y} − {E[Y ]}2 = 02 · 12 + 22 · 1

2 − 12 = 1.

(d) Answer (iii). Correlation = 0.000.

cov(X,Y ) =∑

x

y

xy p(x, y)−E[X]E[Y ] = 1.5−1.5 = 0 ⇒ corr(X,Y ) =cov(X,Y )

Var[X]Var[Y ]= 0.

X and Y are uncorrelated but they are not independent (since, for example, p(0, 2) 6= pX(0)pY (2)).

Q5.

Let X denote score of regular die. Thus pr {X = x} = 1/6 for x = 1, 2, . . . , 6.(a) If X = x, then toss x unbiased coins and observe Y heads. Thus, conditional on X = x,Y ∼ Bin(x, 1

2), so that,

pr {Y = y|X = x} =

(

x

y

)

(12)y(1

2 )x−y =

(

x

y

)

(12 )x, for y = 0, 1, . . . , x, x = 1, 2, 3, 4, 5, 6.

(b) Joint probabilities p(x, y) are given by,

pr {X = x, Y = y} = pr {X = x} pr {Y = y|X = x} =1

6

(

x

y

)

(12 )x, (y = 0, 1, . . . , x; x = 1, 2, . . . , 6).

Display joint probabilities in a table.

Y Marginal probability0 1 2 3 4 5 6 for X

1 1/12 1/12 0 0 0 0 0 1/62 1/24 2/24 1/24 0 0 0 0 1/6

X 3 1/48 3/48 3/48 1/48 0 0 0 1/64 1/96 4/96 6/96 4/96 1/96 0 0 1/65 1/192 5/192 10/192 10/192 5/192 1/192 0 1/66 1/384 6/384 15/384 20/384 15/384 6/384 1/384 1/6

Marginal 63/384 120/384 99/384 64/384 29/384 8/384 1/384 Total = 1prob. for Y

(c) Marginal probabilities for Y are found by taking column sums above.

Y 0 1 2 3 4 5 6pY (y) 63/384 120/384 99/384 64/384 29/384 8/384 1/384

(d) Obtain conditional probabilities for X given Y = 3 using the definition,

pr {X = x|Y = 3} =pr {X = x, Y = 3}

pr {Y = 3} =pr {X = x, Y = 3}

(64/384).

x 1 2 3 4 5 6pr {X = x|Y = 3} 0 0 2/16 4/16 5/16 5/16

19

Page 20: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Solutions V

Q1.

RecallE[a1X1 + a2X2] = a1E[X1] + a2E[X2],

Var[a1X1 + a2X2] = a21Var[X1] + 2a1a2cov(X1,X2) + a2

2Var[X2].

Also Var[X1] = σ2, Var[X2] = σ2, cov(X1,X2) = ρStdev[X1]Stdev[X2] = ρσ2.

(a) For Y = X1 − 3X2, put a1 = 1 and a2 = −3.

E[X1 − 3X2] = E[X1] − 3E[X2] = µ1 − 3µ2.

Var[X1 − 3X2] = Var[X1] − 6cov(X1,X2) + 9Var[X2] = σ2 − 6ρσ2 + 9σ2 = (10 − 6ρ)σ2.

(b)

cov(Y,Z) = cov(X1 − 3X2, gX1 + X2) = gVar[X1] + (1 − 3g)cov(X1,X2) − 3Var[X2]

= gσ2 + (1 − 3g)ρσ2 − 3σ2 = {g + (1 − 3g)ρ − 3}σ2.

(c) corr(Y,Z) = 0 only if cov(Y,Z) = 0. This occurs if g + (1 − 3g)ρ − 3 = 0 so that

g =3 − ρ

1 − 3ρ.

(d) Case g = 7, ρ = 0.2.

cov(X1, Y ) = cov(X1, X1 − 3X2) = Var[X1] − 3cov(X1,X2) = (1 − 3ρ)σ2 = 0.4σ2.

Also Var[Y ] = (10 − 6ρ)σ2 = 8.8σ2. Hence

corr(X1, Y ) =cov(X1, Y )

Var[X1]Var[Y ]=

0.4σ2

√σ2 × 8.8σ2

=0.4√8.8

= 0.135.

Similarly

cov(X1, Z) = cov(X1, 7X1 + X2) = 7Var[X1] + cov(X1,X2) = (7 + ρ)σ2 = 7.2σ2,

Var[Z] = Var[7X1 + X2] = 49Var[X1] + 14cov(X1,X2) + Var[X2] = (50 + 14ρ)σ2 = 52.8σ2.

Hence

corr(X1, Z) =cov(X1, Z)

Var[X1]Var[Z]=

7.2σ2

√σ2 × 52.8σ2

=7.2√52.8

= 0.991.

Notice that even though corr(X1, Y ) 6= 0 and corr(X1, Z) 6= 0 we can have corr(Y,Z) = 0.

Q2.Let component lengths be LA, LB, LC , LD. Total junction length is L = LA + LB + LC + LD.

E[L] = E[LA+LB +LC +LD] = E[LA]+E[LB]+E[LC ]+E[LD] = 5.7+10.8+10.8+6.3 = 33.6 cms.

Var[L] = Var[LA + LB + LC + LD] = Var[LA] + Var[LB ] + Var[LC ] + Var[LD] + 2cov(LB , LC)

= (0.0056)2 + (0.0180)2 + (0.0180)2 + (0.0092)2 + 2(0.75)(0.0180)2 = 12.5 × 10−4,

where cov(LB , LC) = corr(LB , LC)×Stdev(LB)×Stdev(LC) = 0.75×0.0180×0.0180 = 0.000243.Hence Stdev[L] =

√12.5 × 10−4 = 0.035 cms.

20

Page 21: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Q3.

Answer (ii). Probability = 0.421.Piston diameter P ∼ N(mean = 10.42, variance = 0.0009).Cylinder diameter C ∼ N(mean = 10.43, variance = 0.0016).The gap between cylinder and piston is 1

2(C−P ). Piston does not fit into the cylinder if C−P < 0.

E[C − P ] = 10.43 − 10.42 = 0.01.

Since cylinder and piston diameter can be regarded as independent, having been chosen at random,

Var[C − P ] = Var[C] + Var[P ] = 0.0009 + 0.0016 = 0.0025.

Hence C − P ∼ N(mean = 0.01, variance = 0.0025). Required probability is,

pr {C − P < 0} = pr

{

Z <0 − 0.01√

0.0025

}

= Φ(−0.2) = 1 − Φ(0.2) = 0.4207 ≈ 0.421,

where Z ∼ N(0, 1) and Φ(z) is the standard normal distribution function.

Q4.

Assume data from two independent normal distributions, variances known.Group A: xA = 6.00 gms., σ2

A = 0.50, nA = 8.Group B: xB = 4.22 gms., σ2

B = 0.80, nB = 8.To test H0 : µA = µB vs. H1 : µA 6= µB at 1% level, reject H0 if

xA − xA√

σ2A

nA+

σ2B

nB

≥ z0.005.

Here∣

xA − xA√

σ2A

nA+

σ2B

nB

=

6.00 − 4.22√

0.50

8+

0.80

8

=

1.78√0.1625

= 4.42. z0.005 = 2.576.

Hence reject H0 at 1% level. The mean lung weights for the two groups are significantly different.Whether the difference is due to accumulated dust particles or a physiological change in the lungsize is another question.

Worried about animal experimentation? To make a start on examining any benefits and ethics ofanimal research you could refer to the following:Editorial. Ban chimp testing. Scientific American, October 2011, p.6.Goldberg,A.M. and Hartung,T. Protecting more than animals. Scientific American, January 2006.Goldberg,A.M. and Frazier,J.M. Alternatives to animals in toxicity testing, Scientific American,August 1989, p.16-22.Rowan,A.N. The benefits and ethics of animal research, Scientific American, February 1997, p.63.Barnard,N.D. and Kaufman,S.R. Animal research is wasteful and misleading, Scientific American,February 1997, p.64-66.Botting,J.H. and Morrison,A.R. Animal research is vital to medicine, Scientific American, Febru-ary 1997, p.67-69.

Q5.

Data from two independent normal distributions with unknown variances.

21

Page 22: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Oil meal data: x1 = 247.9 gms., s21 = 2925.8, n1 = 14.

Meat meal data: x2 = 275.5 gms., s22 = 4087.3, n2 = 11.

You need more than one decimal place accuracy in x1 and x2 to get s21 and s2

2 correct!Assume σ2

1 = σ22 = σ2 (unknown). Estimate σ2 using

s2 =(n1 − 1)s2

1 + (n2 − 1)s22

n1 + n2 − 2=

13s21 + 10s2

2

23= 3430.8.

To test H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 at 5% level, reject H0 if∣

x1 − x2√

s2

(

1

n1+

1

n2

)

≥ t23(2.5%).

Here∣

x1 − x2√

s2

(

1

n1+

1

n2

)

=

247.9 − 275.5√

3430.8

(

1

14+

1

11

)

= 1.17

and t23(2.5%) = 2.069 so accept H0 at 5% level.The diet supplements do not have a significantly different effect at 5% level.

What about a confidence interval for µ1 − µ2?Recall that

t =(x1 − x2) − (µ1 − µ2)√

s2

(

1

n1+

1

n2

)

∼ t23

so that

Pr

−t23(2.5%) <(x1 − x2) − (µ1 − µ2)√

s2

(

1

n1+

1

n2

)

≤ t23(2.5%)

= 0.95.

Re-arranging this gives a 95% confidence interval for µ1 − µ2 of the form

(

x1 − x2 − t23(2.5%)

s2

(

1

n1+

1

n2

)

< µ1 − µ2 ≤ x1 − x2 + t23(2.5%)

s2

(

1

n1+

1

n2

)

)

.

Since

x1 − x2 = 247.9 − 275.5 = −27.6,

s2

(

1

n1+

1

n2

)

= 23.60, t23(2.5%) = 2.069

the 95% confidence interval for µ1 − µ2 is

(−27.6 − 2.069 × 23.60, − 27.6 + 2.069 × 23.60) = (−76.4, 21.2).

The confidence interval includes zero (where µ1 − µ2 = 0 so µ1 = µ2) so the diet supplements donot have a significantly different effect at 5% level.

22

Page 23: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Solutions VI

Q1.

Let π denote the population proportion supporting party A.Approximate 95% confidence interval for π is

p ± 1.96

p(1 − p)

n.

Here x = 500, n = 1000, so p = xn = 0.5 and 1.96

p(1−p)n = 1.96 × 0.0158 = 0.03099.

95% confidence interval for π is 0.5 ± 0.03099 = (0.469, 0.531).

Notice that even with 1000 people asked the 95% confidence interval has width 0.062.

Q2.

This is a two independent sample problem.

Number Number of houses Proportion of houses PopulationYear in survey with adult at home with adult at home proportion

1971 n1 = 500 x1 = 296 p1 = 0.592 π1

1976 n2 = 1000 x2 = 463 p1 = 0.463 π2

5% level test of H0 : π1 = π2(= π) vs. H1 : π1 6= π2 rejects H0 if

|p1 − p2|√

π(1 − π)(

1n1

+ 1n2

)

≥ 1.96.

Estimate π by

π =x1 + x2

n1 + n2=

759

1500= 0.506.

Then|p1 − p2|

π(1 − π)(

1n1

+ 1n2

)

=|0.592 − 0.463|

0.24996(

1500 + 1

1000

)

=0.129

0.02738= 4.71.

Hence, reject H0 at 5% level.Also reject H0 at 1% level on comparison with critical value 2.57).

There is significant evidence of a shift in the proportion of homes with at least one adult at homeduring the working day.

Q3.

Total number of groups is 1486 + 694 + 195 + 37 + 10 + 11 = 2423.With µ estimated by 0.8925, fitted model is

pr = 0.69390.8925r

r!.

Expected frequency for group of size r is thus

2423pr = 1681.20.8925r

r!.

Size of group 1 2 3 4 5 ≥ 6

Observed frequency 1486 694 195 37 10 1Expected frequency 1500.5 669.6 199.2 44.4 7.9 1.3

23

Page 24: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Combine groups “5” and “≥ 6” to make the expected frequency 9.3 and the corresponding observedfrequency 11. There are now 5 groups.Number of degrees of freedom = 5 groups – 1 constraint – 1 estimated parameter µ = 3.

χ2obs =

(1486 − 1500.5)2

1500.5+

(694 − 669.6)2

669.6+ · · · + (37 − 44.4)2

44.4+

(11 − 9.3)2

9.3= 2.66.

Test H0 : probability model is a good fit vs. H1 : probability model is not a good fit.5% level test is reject H0 if χ2

obs ≥ χ23(5%).

From tables, χ23(5%) = 7.815, so do not reject H0 at 5% level.

Also do not reject H0 at 10% level since χ23(10%) = 6.251. The given model provides a good fit to

the observed data.

The probability model

pr =µre−µ

r!(1 − e−µ)

for r = 1, 2, 3, . . ., is known as the truncated Poisson distribution. Can you think why?

How did we obtain the estimate µ = 0.8925?If X denotes the size of the casual groups, having the truncated Poisson distribution above, then itcan be shown that

E[X] =µ

1 − e−µ.

The sample mean for these data is x = 36632423 = 1.51176. Thus we might expect that

1.51176 ≈ µ

1 − e−µ.

The value of µ can be found using an iterative method, so that on the nth iteration,

µn ≈ 1.51176(1 − e−µn−1).

For an initial estimate of µ, notice that 2p2/p1 = µ so that µ0 ≈ 2( 6942423 )/(1486

2423 ) = 0.9341.A better estimate is µ1 = 1.51176(1 − e−0.9341) = 0.9177.Re-iterating gives µ2 = 1.51176(1 − e−0.9177) = 0.9079.After a number of steps this procedure converges to give µ = 0.8925. A much quicker procedurewould use the Newton-Raphson method. Have you met this?

Q4.

Let πA = proportion vaccinated, πB = proportion recovering, πAB = proportion both vaccinatedand recovering.We want to test whether the two factors A and B are independent, H0 : πAB = πAπB.Estimate πA by

πA =Number vaccinated

Number in study=

1211

1979= 0.612.

Estimate πB by

πB =Number recovering

Number in study=

1545

1979= 0.781.

Under H0, expected frequency in cell (A,B) is 1979πAπB = 1211×15451979 = 945.9, and similarly for

the other cells. Table of observed (expected) frequencies is:

Recoveries Deaths Totals

Vaccinated 1091 (945.9) 120 (265.2) 1211Unvaccinated 454 (599.7) 314 (168.2) 768

24

Page 25: MATH1725 Introduction to Statistics: Exercisessta6ajb/math1725/1725ex.pdf · Paleoichnologists study dinosaur tracks. The book Dinosaur Tracks and Traces (editors D.D.Gillette and

Number of degrees of freedom = 4 − 1 − 2 = 1 so use a continuity correction.Could get degrees of freedom using formula (2 − 1) × (2 − 1) for this 2 × 2 table.

χ2obs =

(|1091 − 945.9| − 12)2

945.9+ · · · + (|314 − 168.2| − 1

2)2

168.2= 261.7.

5% level test is reject H0 if χ2obs ≥ χ2

1(5%). From tables, χ21(5%) = 3.841, so reject H0 at 5% level.

Also reject H0 at 1% level since χ21(1%) = 6.635.

Very strong evidence that the two factors are not independent. Is this what you expected?

25


Recommended