STATISTICAL METHODS FOR THE SOCIAL SCIENCES · 2017-05-08 · STATISTICAL METHODS FOR THE SOCIAL...

Full file at https://fratstock.eu

INSTRUCTOR’S MANUAL

to accompany

STATISTICAL METHODS FOR THE SOCIAL SCIENCES

Fourth Edition

Alan Agresti and Barbara Finlay

published by Pearson Education

Manual prepared by:

Jackie Miller 404 Cockins Hall

Department of Statistics

The Ohio State University

Columbus, OH 43210

Instructors: Please notify Alan Agresti of any errors in this manual or the text so they can be

corrected for future printings. Please send e-mail to [email protected].


Table of Contents

1. Introduction 1

2. Sampling and Measurement 2

3. Descriptive Statistics 5

4. Probability Distributions 20

5. Statistical Inference: Estimation 29

6. Statistical Inference: Significance Tests 39

7. Comparison of Two Groups 50

8. Analyzing Association Between Categorical Variables 64

9. Linear Regression and Correlation 71

10. Introduction to Multivariate Relationships 90

11. Multiple Regression and Correlation 95

12. Comparing Groups: Analysis of Variance (ANOVA) Methods 109

13. Combining Regression and ANOVA: Quantitative and Categorical Predictors 116

14. Model Building with Multiple Regression 123

15. Logistic Regression: Modeling Categorical Responses 139

16. An Introduction to Advanced Methodology 145


1

Chapter 1

1.1. (a) An individual Prius (automobile). (b) All Prius automobiles used in the EPA tests. (c) All

Prius automobiles that are or may be manufactured.

1.2. (a) All 7 million voters. (b) A statistic is the 56.5% who voted for Schwarzenegger from the

exit poll sample of size 2705; a parameter is the 55.9% who actually voted for Schwarzenegger.

1.3. (a) All students at the University of Wisconsin. (b) A statistic, since it’s calculated only for

the 100 sampled students.

1.4. A statistic, since it is based on the approximately 1200 Floridians in the sample.

1.5. (a) All adult Americans. (b) Proportion of all adult Americans who would answer definitely

or probably true. (c) The sample proportion 0.523 estimates the population proportion. (d) No, it

is a prediction of the population value but will not equal it exactly, because the sample is only a

very small subset of the population.

1.6. (a) The most common response was 2 hours per day. (b) This is a descriptive statistic because

it describes the results of a sample.

1.7. (a) A total of 85.7% said ―yes, definitely‖ or ―yes, probably.‖ (b) In 1998, a total of 85.8%

said ―yes, definitely‖ or ―yes, probably.‖ (c) A total of 74.4% said ―yes, definitely‖ or ―yes,

probably.‖ The percentages of yes responses were higher for HEAVEN than for HELL.

1.8. (a) Statistics, since they’re based on a sample of 60,000 households, rather than all

households. (b) Inferential, predicting for a population using sample information.

1.9. (a)

1.10.

Race Age Sentence Felony? Prior

Arrests

Prior

Convictions

white 19 2 no 2 1

black 23 1 no 0 0

white 38 10 yes 8 3

Hispanic 20 2 no 1 1

white 41 5 yes 5 4

1.14. (a) A statistic is a numerical summary of the sample data, while a parameter is a numerical

summary of the population. For example, consider an exit poll of voters on election day. The

proportion voting for a particular candidate is a statistic. Once all of the votes have been counted,

the proportion of voters who voted for that candidate would be known (and is the parameter). (b)

Description deals with describing the available data (sample or population), whereas inference

deals with making predictions about a population using information in the sample. For example,


2

consider a sample of voters on election day. One could use descriptive statistics to describe the

voters in terms of gender, race, party, etc., and inferential statistics to predict the winner of the

election.

1.15. If you have a census, you do not need to use the information from a sample to describe the

population since you have information from the population as a whole.

1.16. (a) The descriptive part of this example is that the average age in the sample is 24.1 years.

(b) The inferential part of this example is that the sociologist estimates the average age of brides

at marriage for the population to between 23.5 and 24.7 years. (b) The population of interest is

women in New England in the early eighteenth century.

1.17. (a) A statistic is the 45% of the sample of subjects interviewed in the UK who said yes. (b)

A parameter is the true percent of the 48 million adults in the UK who would say yes. (c) A

descriptive analysis is that the percentage of yes responses in the survey varied from 10% (in

Bulgaria) to 60% in Luxembourg). (d) An inferential analysis is that the percentage of adults in

the UK who would say yes falls between 41% and 49%.

Chapter 2

2.1. (a) Discrete variables take a finite set of values (or possible all nonnegative integers), and we

can enumerate them all. Continuous variables take an infinite continuum of values. (b)

Categorical variables have a scale that is a set of categories; for quantitative variables, the

measurement scale has numerical values that represent different magnitudes of the variable. (c)

Nominal variables have a scale of unordered categories, whereas ordinal variables have a scale of

ordered categories. The distinctions among types of variables are important in determining the

appropriate descriptive and inferential procedures for a statistical analysis.

2.2. (a) Quantitative (b) Categorical (c) Categorical (d) Quantitative (e) Categorical (f)

Quantitative (g) Categorical (h) Quantitative (i) Categorical

2.3. (a) Nominal (b) Nominal (c) Interval (d) Nominal (e) Nominal (f) Ordinal (g) Interval (h)

Ordinal (i) Nominal (j) Interval (k) Nominal

2.4. (a) Nominal (b) Nominal (c) Ordinal (d) Interval (e) Interval (f) Interval (g) Ordinal (h)

Interval (i) Nominal (j) Interval

2.5. (a) Interval (b) Ordinal (c) Nominal

2.6. (a) State of residence. (b) Number of siblings. (c) Social class (high, medium, low). (d)

Student status (full time, part time). (e) Number of cars owned. (f) Time (in minutes) needed to

complete an exam. (g) Number of siblings.


3

2.7. (a) Ordinal, since there is a sense of order to the categories. (b) Discrete. (c) These values are

statistics since them come from a sample.

2.8. Ordinal.

2.9. (b), (c), (d)

2.10. (a), (c), (e), (f)

2.11. Students numbered 10, 22, 24.

2.12. Number names 00001 to 52000. First five that are selected are 15011, 46573, 48360, 39975,

06907.

2.13. Observational study (b) Experiment (c) Observational study (d) Experiment

2.14. (a) Experimental study, since the researchers are assigning subjects to treatments. (b) An

observational study could look those who grew up in nonsmoking or smoking environments and

examine incidence of lung cancer.

2.15. (a) Sample-to-sample variability causes the results to vary. (b) The sampling error for the

Gallup poll is –2.4% for Gore, 0.1% for Bush, and 1.3% for Nader.

2.16. (a) This is a volunteer sample because viewers chose whether to call in. (b) Randomly

sample the population.

2.17. The first question is confusing in its wording. The second question has clearer wording.

2.18. (a) Skip number is k = 52,000/5 = 10,400. Randomly select one of the first 10,400 names

and then skip 10,400 names to get each of the next names. For example, if the first name picked is

01536, the other four names are 01536 + 10400 = 11936, 11936 + 10400 = 22336, 22336 +

10400 = 32736, 32736 + 10400 = 43136. (b) We could treat the pages as clusters. We would

select a random sample of pages, and then sample every name on the pages selected. Its

advantage is that it is much easier to select the sample than it is with random sampling. A

disadvantage is as follows: Suppose there are 100 ―Martinez‖ listings in the directory, all falling

on the same page. Then with cluster sampling, either all or none of the Martinez families would

end up in the sample. If they are all sampled, certain traits which they might have in common

(perhaps, e.g., religious affiliation) might be over-represented in the sample.

2.19. Draw a systematic sample form the student directory, using skip number k = 5000/100 = 50.

2.20. (a) This is not a simple random sample since the sample with necessarily have 40 women

and 40 men. A simple random sample may or may not have exactly 40 men and 40 women. (b)

This is stratified random sampling. You ensure that neither men nor women are over-sampled.


4

2.21. (a) The clusters. (b) The subjects within every stratum. (c) The main difference is that a

stratified random sample uses every stratum, and we want to compare the strata. By contrast, we

have a sample of clusters, and not all clusters are represented—the goal is not to compare the

clusters but to use them to obtain a sample.

2.22. (a) Categorical are GE, VE, AB, PI, PA, RE, LD, AA; quantitative are AG, HI, CO, DH,

DR, NE, TV, SP, AH. (b) Nominal are GE, VE, AB, PA, LD, AA; ordinal are PI and RE; interval

are AG, HI, CO, DH, DR, NE, TV, SP, AH.

2.24. (a) Draw a systematic sample from the student directory, using skip number k = N/100,

where N = number of students on the campus. (b) High school GPA on a 4-point scale, treated as

quantitative, interval, continuous; math and verbal SAT on a 200 to 800 scale, treated as

quantitative, interval, continuous; whether work to support study (yes, no), treated as categorical,

nominal, discrete; time spent studying in average day, on scale (none, less than 2 hours, 2-4

hours, more than 4 hours), treated as quantitative, ordinal, discrete.

2.25. This is nonprobability sampling; certain segments may be over- or under-represented,

depending on where the interviewer stands, time of day, etc. Quota sampling fails to incorporate

randomization into the selection method.

2.26. Responses can be highly dependent on nonsampling errors such as question wording.

2.27. (a) This is a volunteer sample, so results are unreliable; e.g., there is no way of judging how

close 93% is to the actual population who believe that benefits should be reduced. (b) This is a

volunteer sample; perhaps an organization opposing gun control laws has encouraged members to

send letters, resulting in a distorted picture for the congresswoman. The results are completely

unreliable as a guide to views of the overall population. She should take a probability sample of

her constituents to get a less biased reaction to the issue. (c) The physical science majors who

take the course might tend to be different from the entire population of physical science majors

(perhaps more liberal minded on sexual attitudes, for example). Thus, it would be better to take

random samples of students of the two majors from the population of all social science majors

and all physical science majors at the college. (d) There would probably be a tendency for

students within a given class to be more similar than students in the school as a whole. For

example, if the chosen first period class consists of college-bound seniors, the members of the

class will probably tend to be less opposed to the test than would be a class of lower achievement

students planning to terminate their studies with high school. The design could be improved by

taking a simple random sample of students, or a larger random sample of classes with a random

sample of students then being selected from each of those classes (a two-stage random sample).

2.28. A systematic sample with a skip number of 7 (or a multiple of 7) would be problematic

since the sampled editions would all be from the same day of the week (e.g., Friday). The day of

the week may be related to the percentage of newspaper space devoted to news about

entertainment.

2.29. Because of skipping names, two subjects listed next to each other on the list cannot both be

in the sample, so not all samples are equally likely.


5

2.30. If we do not take a disproportional stratified random sample, we might not have enough

Native Americans in our sample to compare their views to those of other Americans.

2.31. If a subject is in one of the clusters that is not chosen, then this subject can never be in the

sample. Not all samples are equally likely.

2.33. The nursing homes can be regarded as clusters. A systematic random sample is taken of the

clusters, and then a simple random sample is taken of residents from within the selected clusters.

2.34. (b)

2.35. (c)

2.36. (c)

2.37. (a)

2.38. False. This is a convenience sample.

2.39. False. This is a voluntary response sample.

2.40. An annual income of $40,000 is twice the annual income of $20,000. However, 70 degrees

Fahrenheit is not twice as hot as 35 degrees Fahrenheit. (Note that income has a meaningful zero

and temperature does not.) IQ is not a ratio-scale variable.

Chapter 3

3.1. (a)

Place of Birth Relative Frequency

Europe 13.7%

Asia 25.4%

Caribbean 9.6%

Central America 37.6%

South America 6.1%

Other 7.6%


6

(b)

Place of Birth

S AmericaOtherCaribbeanEuropeAsiaC America

Perc

en

t

40.00

30.00

20.00

10.00

0.00

(c) ―Place of birth‖ is categorical. (d) The mode is Central America.

3.2. (a)

Religion Relative Frequency

Christianity 41.2%

Islam 25.5%

Hinduism 17.6%

Confucianism 7.8%

Buddhism 7.8%


7

(b)

Religion

ConfucianismBuddhismHinduismIslamChristianity

Rela

tiv

e F

req

uen

cy

50.00

40.00

30.00

20.00

10.00

0.00

(c) The mode of these five religions is Christianity. Christianity is also the mode of all religions.

3.3. (a) There are 33 students. The minimum score is 65, and the maximum score is 98.

(b)

Midterm_Score

10090807060

Fre

qu

en

cy

12.5

10.0

7.5

5.0

2.5

0.0

Histogram

Mean =82.88Std. Dev. =8.947

N =33


8

3.4. (a)

Number Persons Relative Frequency

1 27.1%

2 33.3%

3 16.0%

4 13.8%

5 or more 9.8%

(b)

Number of Persons

5 or more4321

Rela

tiv

e F

req

uen

cy

40.00

30.00

20.00

10.00

0.00

(c) The median household size is 2 persons, and the mode is also 2 persons.

3.5. (a)

Frequency Percent Cumulative

Percent

Valid 1 3 6.0 6.0

2 8 16.0 22.0

3 9 18.0 40.0

4 2 4.0 44.0

5 8 16.0 60.0

6 9 18.0 78.0

7 5 10.0 88.0

8 2 4.0 92.0

9 2 4.0 96.0

10 1 2.0 98.0

13 1 2.0 100.0

Total 50 100.0


9

(b)

MU_noDC

12.5107.552.50

Fre

qu

en

cy

10

8

6

4

2

0

Histogram

Mean =4.8Std. Dev. =2.571

N =50

The distribution appears to be bimodal and skewed to the right.

(c) Stem Leaves

1 000

2 00000000

3 000000000

4 00

5 00000000

6 000000000

7 00000

8 00

9 00

10 0

11

12

13 0

The stem-and-leaf plot shows the same bimodality and right skew that the histogram does.

3.6. (a) GDP is rounded to the nearest thousand Stem (10 thousands) Leaves (thousands

2 023

2 58899

3 00011122233

3 8

4

4

5

5

6

6

7 0


10

(b)

RoundGDP

70.0060.0050.0040.0030.0020.00

Fre

qu

en

cy

12.5

10.0

7.5

5.0

2.5

0.0

Histogram

Mean =32.00Std. Dev. =9.615

N =23

(c) The outlier in each plot is Luxembourg.

3.7. (a) The mean is (26 + 17 + 236 + 2 + 6)/5 = 287/5 = 57.4 abortions per 1000 women 15 to 41

years of age. (b) The median is 17 abortions per 1000 women 15 to 41 years of age. The mean

and median are so different because California is an extreme outlier in this small data set.

3.8. (a) The mean is (0.3 + 1.8 + 2.3 + 1.2 + 1.4 + 0.7 + 9.9 + 20.1)/8 = 37.7/8 = 4.7 metric tons

per person. The median is 1.6 metric tons per person. (b) The United States appears to be an

outlier, since it is far greater than any other data value. (Without the United States, the mean is

2.5 and the median is 1.4.)

3.9. (a) The response ―not far enough‖ is the mode. (b) We cannot compute and mean or median

with these data since they are categorical.

3.10. (a) Stem Leaves

0 4679

1 133

2 0

3 9

4 4

(b) The mean is 16.6 days, and the median is 12 days.

(c) Leaves

25 years ago Stem Leaves

0 4679

875 1 133

440 2 0

21 3 9

0 4 4

5 5

For the data from 25 years ago, the mean was 27.6 days, and the median was 24 days. The mean

has decreased by 11 days, and the median has decreased by 12 days since 25 years ago. (d) Of the


11

11 observations, the median is 13 days. We cannot calculate the mean, but substituting 40 for the

censored observation gives a mean of 18.7 days.

3.11. (a)

TV Hours Frequency Relative Frequency

0 79 4.0

1 422 21.2

2 577 29.0

3 337 17.0

4 226 11.4

5 136 6.8

6 99 5.0

7 23 1.2

8 34 1.7

9 4 0.2

10 23 1.2

12 14 0.7

13 1 0.1

14 7 0.4

15 2 0.1

18 2 0.1

24 1 0.1

Total 1987 100.0

(b) The distribution is unimodal and right skewed. (c) The median is the 994th data value, which

is 2. (d) The mean is larger than 2 because the data is skew right by a few high values.

3.12. Central

America Stem

Western

Europe

8540 4

85210 5 488

82 6 003678

7 1268

8 567

9 0

Female economic activity seems greater, on average, in Western Europe than in Central America.

Most of the values in Western Europe exceed the highest value in Central America. There appear

to be more women in the labor force (per 100 men) in Western Europe than in Central America.

3.13. Since the mean is much greater than the median, the distribution of 2000 household income

in Canada is most likely skewed to the right.

3.14. (a) The median is ―2 or 3 times a month.‖ The mode is ―not at all.‖ The data are centered

around the respondents having sex about 2 or 3 months in the past 12 months. The most frequent

answer to the question is ―not at all.‖ (b) The sample mean is 4.1, which means that, on average,

the respondents had sex about 4 times a month in the past 12 months.

3.15. (a) The mode is ―every day.‖ The median is ―a few times a week.‖ (b) The mean is 3.7

times per week, which is lower than the 4.4 times a week in 1994.


12

3.16. (a) For each gender, the distribution of earnings is skewed to the right, since each mean is

greater than its respective median. (b) The overall mean income is ($39,890×73.8 +

$56,724×83.4)/(73.8 + 83.4) = $7674663.6/157.2 = $48,821.

3.17. (a) The response variable is median family income, and the explanatory variable is race. (b)

We cannot find the median income for the combined groups since we do not know how many

families are in each group. (c) We would need to know how many families were in each group.

3.18. (a) The distribution is skewed to the right. (b) The Empirical Rule only applies to bell-

shaped distributions, so it does not apply here. (c) The median is 0. If the 500 observations were

to shift from 0 to 6, the median would remain zero, since half of the data values fall below 0 and

half fall above 0. This illustrates the resistance of the median to skewness and extreme values.

3.19. (a) Median: $10.13; mean: $10.18; range: $0.46; standard deviation: $0.22. (b) Median:

$10.01; mean: $9.17; range: $5.31; standard deviation: $2.26. The median is resistant to outliers,

but the mean, range, and standard deviation are highly impacted by outliers.

3.20. (a) Mean: 30; standard deviation: 9.0. (b) Minimum: 13; lower quartile: 25.5; median: 31;

upper quartile: 36; maximum: 42.

3.21. The mean is 28.7, and the standard deviation is 12.5. The 2006 HDI ratings for the top 10

nations vary greatly.

3.22. (a) The life expectancies in Africa vary more than the life expectancies in Western Europe,

because the life expectancies for the African countries are more spread out than those for the

Western European countries. (b) The standard deviation is 1.1 for the Western European nations

and 7.1 for the African nations.

3.23. (a) (i) $40,000 to $60,000; (ii) $30,000 to $70,000; (iii) $20,000 to $80,000. (b) A salary of

$100,000 would be unusual because it is 5 standard deviations above the mean.

3.24. (a) Approximately 68% of the values are contained in the interval 32 to 38 days;

approximately 95% of the values are contained in the interval 29 to 41 days; all or nearly all of

the values are contained in the interval 26 to 44 days. (b) (i) The mean would decrease if the

observation for the U.S. was included. (ii) The standard deviation would increase if the

observation for the U.S. was included. (c) The U.S. observation is 5.3 standard deviations below

the mean.

3.25. (a) 88.8% of the observations fall within one standard deviation of the mean. (b) The

Empirical Rule is not appropriate for this variable, since the data are highly skewed to the right.

3.26. 10 is realistic; –20 is impossible since the standard deviation cannot be negative; 0 implies

that every student scored 76 on the exam, which is highly improbable; 50 is too large (it is half of

the possible range of scores).

3.27. (a) The most realistic value is 0.4, because the range is 5 times the length of this value. (b)

The value of –10.0 is impossible since the standard deviation cannot be negative.

3.28. (d)


13

3.29. (a) Since the range is 43.5 standard deviations above the mean, the distribution is most

likely skewed to the right. (b) The distribution probably has outliers (take the maximum usage,

for example).

3.30. The distribution is most likely skewed to the right since the minimum water consumption (0

thousands of gallons) is less than one standard deviation below the mean.

3.31. (a) The range is $28,700, which is the difference between the mean salary for secondary

school teachers in Illinois (highest mean) and in South Dakota (lowest mean). (b) The

interquartile range is $9600 and represents the spread of the mean salaries for the middle 50% of

the states.

3.32. (a)

40000

50000

60000

Sala

ry

(b) The box plot suggests that the data are skewed to the right. (c) 7000 is the most plausible

standard deviation, since the range of the data is about 4 standard deviations. The values 100 and

1000 are too small for the spread that we see, and 25,000 is just slightly over the value for the

range.

3.33. The mean, standard deviation, maximum, and range all decrease, because the observation

for D.C. was a high outlier. Note that these statistics are not resistant to outliers. On the other

hand, the median, Q3, Q1, the interquartile range, and the mode remain the same, as these are all

resistant to outliers. The minimum remains the same since D.C. was a high outlier and not a low

outlier.

3.34. (a) The Empirical Rule does not apply to this distribution because the standard deviation is

much larger than the mean, suggesting a right-skewed distribution. (b) The five-number summary

confirms that the distribution is skewed to the right, since the distance between Q3 and the

medians is larger than the distance between the median and Q1 and the maximum is so large. (c)

IQR = Q3 – Q1 = 1105 – 256 = 849. Low outliers would be observations less than Q1 – 1.5(IQR)

= 256 – 1.5(849) = –1017.5. There are no values that are low outliers. High outliers would be

observations greater than Q3 + 1.5(IQR) = 1105 + 1.5(849) = 2378.5. At least the maximum is a

high outlier.


14

3.35. (a) The sketch should show a right-skewed distribution. (b) The sketch should show a right-

skewed distribution. (c) The sketch should show a left-skewed distribution. (d) The sketch should

show a right-skewed distribution. (e) The sketch should show a left-skewed distribution.

3.36. (a) Skewed to the left (b) Bell shaped (c) Skewed to the right (d) Skewed to the left (e)

Skewed to the left (f) Skewed to the right (g) U shaped

3.38. A box plot using only the five-number summary follows:

4.0

6.0

8.0

10.0

12.0

EU

un

em

plo

ym

en

t

This box plot shows us that the maximum is an outlier. Since the mean and the median are the

same, the distribution may be slightly more symmetric than the box plot implies. It is important to

note that only the five-number summary was used to produce this box plot.

3.39. (a) Minimum = 0, Q1 = 20, median = 30, Q3 = 50, maximum = 14. (b) Same as part (a). (c)

The observations with values 12 and 14 are outliers. (d) The standard deviation is 3. The value

0.3 is too small for the distribution, the value 13 is almost equal to the range, and the value 23

exceeds the range.

3.40. Side-by-side box plots are not very informative as shown below:


15

The infant mortality rates in Africa are much higher than the infant mortality rates in Western

Europe. In addition, since Q1, the median, and Q3 are all equal for Western Europe, we do not

actually see a ―box‖ in the box plot. Infant mortality rates in Africa are skewed to the right, while

infant mortality rates in Western Europe are symmetric.

3.41. (a)

10.0

15.0

20.0

25.0

No

he

alt

h i

nsu

ran

ce

(b) The distribution appears to be skewed to the right.

3.42. (a) The range is 92.3 – 78.3 = 14. The interquartile range is 88.8 – 83.6 = 5.2. (b) Low

outliers would be observations less than Q1 – 1.5(IQR) = 83.6 – 1.5(5.2) = 75.8. There are no

values that are low outliers. High outliers would be observations greater than Q3 + 1.5(IQR) =

88.8 + 1.5(5.2) = 96.6. There are no values that are high outliers.

3.43. (a) Minimum = 1, Q1 = 3, median = 5, Q3 = 6, maximum = 13. (b)

2.5

5.0

7.5

10.0

12.5

MU

_n

oD

C

Louisiana appears to be a mild outlier. (c) Minimum = 1, Q1 = 3, median = 5, Q3 = 6, maximum

= 44.


16

0

10

20

30

40M

U

Louisiana is still a mild outlier, and D.C. is an extreme outlier. Only the maximum changes in the

five-number summary when the observation for D.C. is added to the data set.

3.44. The IQR is most likely 350. The value of 1500 is the range. The value of –10 is impossible

for an IQR, and the 0 would only occur if all data values were the same. Given the minimum,

median, and maximum, the value 10 is too small to be the IQR.

3.45. (a) Luxembourg’s observation is 3.9 standard deviations above the mean. (b) Sweden’s

observation is 0.8 standard deviations below the mean. (c) (i) Canada’s observation is 2.5

standard deviations above the mean of the EU countries. (ii) The U.S. observation is 3.6 standard

deviations above the mean of the EU countries (but not as high as Luxembourg).

3.46. (a) Italy’s observation is 0.4 standard deviations below the mean. (b) The U.S. observation

is 3.4 standard deviations above the mean of the EU countries. (c) We expect almost all of the

values in a bell shaped distribution to be within 3 standard deviations of the mean. Thus, the U.S.

observation would be considered a high outlier.

3.47. (a) Response variable: opinion about national health insurance (favor, oppose); explanatory

variable: political party (Democrat, Republican). (b) The data could be summarized in a

contingency table with political party as the rows and opinion about national health insurance as

the columns.

3.48. (a) Response variable: happiness; explanatory variable: religious attendance. (b) For those

who attend religious services nearly every week or more, 44.5% reported being very happy. For

those who attend religious services never or less than once a year, 23.2% reported being very

happy. (c) There appears to be an association between happiness and religious attendance since

the percentages that reported being very happy differed greatly by attendance at religious

services.

3.49. (a) United States: predicted fertility = 3.2 – 0.04(50) = 1.2; Yemen: predicted fertility = 3.2

– 0.04(0) = 3.2. (b) The negative value implies that the fertility rate decreases as Internet use

increases.


17

3.50. (a) Points in a scatterplot for these data should have a negative association and be fairly

tightly clustered in a linear pattern. (b) Contraceptive use is more strongly associated with fertility

than is Internet use because –0.89 is a stronger linear association than is –0.55.

3.51. (a) Based on the plot (see next page), the correlation should be positive, since higher values

of GDP tend to go with higher values of CO2 (and vice versa). (b) Luxembourg has a GDP of

69,961 and CO2 of 22.0, both of which are extreme values.

GDP

70,00060,00050,00040,00030,00020,00010,000

CO

2

25.0

20.0

15.0

10.0

5.0

3.52. The number of physicians is more strongly correlated with carbon dioxide emissions than is

female economic activity, since the absolute value of the correlation for number of physicians and

CO2 emissions is closer to 1 than is the correlation for female economic activity and CO2

emissions.

3.53. (a) y is a sample statistic (sample mean) used to estimate the population mean µ. (b) s is a

sample statistic (sample standard deviation) used to estimate the population standard deviation .

3.54. (a) The mean is 1232.2 miles, with standard deviation 1681.7 miles. The median is 640

miles. The minimum distance from home is 0 miles, while the maximum distance is 8000 miles.

The histogram shows that the distribution of the distance from home is skewed to the right.


18

DHome

80006000400020000

Fre

qu

en

cy

40

30

20

10

0

Histogram

Mean =1232.2Std. Dev. =1681.748

N =60

(b) The mean is 7.3 hours, with standard deviation 6.7 hours. The median is 6 hours. The

minimum from home is 0 hours, while the maximum is 37 hours. The histogram shows that the

distribution of the hours of watching television is skewed to the right.

TVhours

40.030.020.010.00.0

Fre

qu

en

cy

20

15

10

5

0

Histogram

Mean =7.27Std. Dev. =6.717

N =60

3.56. Report should include graphical displays and summary statistics. The summary statistics

are: mean = 2.7, standard deviation = 2.1, minimum = 0.1, Q1 = 1.5, median = 2.1, Q3 = 3.6,

maximum = 9.4. The U.S. is an outlier with 9.4 gun deaths per 100,000 people.

3.57. Report should state that the explanatory variable is the percentage with income below the

poverty level and the response variable is the violent crime rate. The correlation coefficient is

0.496. D.C. appears to be an outlier.

Date post:	25-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

STATISTICAL METHODS FOR THE SOCIAL SCIENCES · 2017-05-08 · STATISTICAL METHODS FOR THE SOCIAL...

Documents