+ All Categories
Home > Documents > Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots,...

Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots,...

Date post: 20-Oct-2019
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
22
Scatterplots, Association, and Correlation 7 Chapter 114 R egularly, since 1937, the Gallup Poll has asked likely U.S. voters whether they would vote for a qualified woman for president if their preferred po- litical party nominated one. Are people more likely to say yes to this ques- tion now than they were 70 years ago? If so, has the increase been consistent, or do you think there might have been periods when the “yes’s” didn’t increase at all, or even decreased? Here’s a plot of the percentage saying they would vote for a woman, plotted against the year in which the survey took place: WHO U.S. voters WHAT Percentage saying they would vote for a woman for president WHEN 1937–1999 WHERE United States HOW Gallup Poll 45 60 75 90 45 60 75 90 % Responding Yes Year (since 1900) Clearly, attitudes have changed. The plot shows fairly steady growth since 1937, reaching a level of 90% of voters saying “yes” by the year 1999. We can also see that there was a period of no growth in the 1960s and early 1970s. This timeplot is an example of a more general kind of display called a scatter- plot. Scatterplots may be the most common and most effective display for data. By just looking at them, you can see patterns, trends, relationships, and even the occasional extraordinary value sitting apart from the others. As the great philo- sopher Yogi Berra 1 once said, “You can observe a lot by watching.” 2 Scatterplots 1 Hall of Fame catcher and manager of the New York Mets. 2 But then he also said “I really didn’t say everything I said.” So we can’t really be sure. A scatterplot of percentage saying they would vote for a woman plotted against the year of the survey. Has the increase in willingness to vote for a woman been con- stant over the entire time period? What fea- tures in the trend do you see? Figure 7.1
Transcript
Page 1: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Scatterplots,Association, andCorrelation

7Chapter

114

Regularly, since 1937, the Gallup Poll has asked likely U.S. voters whetherthey would vote for a qualified woman for president if their preferred po-litical party nominated one. Are people more likely to say yes to this ques-

tion now than they were 70 years ago? If so, has the increase been consistent, ordo you think there might have been periods when the “yes’s” didn’t increase atall, or even decreased? Here’s a plot of the percentage saying they would vote fora woman, plotted against the year in which the survey took place:

WHO U.S. voters

WHAT Percentage saying theywould vote for a womanfor president

WHEN 1937–1999

WHERE United States

HOW Gallup Poll

45

60

75

90

45 60 75 90

% R

espo

ndin

g Ye

s

Year (since 1900)

Clearly, attitudes have changed. The plot shows fairly steady growth since1937, reaching a level of 90% of voters saying “yes” by the year 1999. We can alsosee that there was a period of no growth in the 1960s and early 1970s.

This timeplot is an example of a more general kind of display called a scatter-plot. Scatterplots may be the most common and most effective display for data.By just looking at them, you can see patterns, trends, relationships, and even theoccasional extraordinary value sitting apart from the others. As the great philo-sopher Yogi Berra1 once said, “You can observe a lot by watching.”2 Scatterplots

1Hall of Fame catcher and manager of the New York Mets.2But then he also said “I really didn’t say everything I said.” So we can’t really be sure.

A scatterplot of percentage saying theywould vote for a woman plotted againstthe year of the survey. Has the increase inwillingness to vote for a woman been con-stant over the entire time period? What fea-tures in the trend do you see? Figure 7.1

3339 Deveaux ch07_113-135 3/25/03 3:16 PM Page 114

Page 2: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Everyone looks at scatterplots. But, if asked, most people would find it hard tosay what to look for in a scatterplot. What do you see? Try to describe the scatter-plot congestion cost against freeway speed.

Probably, you would say that the direction of the relationship is important. Asthe peak freeway speed goes up, the cost of congestion goes down. A pattern that

runs from the upper left to the lower right is said to have a negative direc-

tion. A trend running the other way has a positive direction.The second thing to look for in a scatterplot is its form. If there is a straight line

relationship, it will appear as a cloud or swarm of points stretched out in a gener-ally consistent, straight form. For example, the scatterplot of traffic congestion hassuch an underlying linear form with some points that stray away from it.

are the best way to start observing the relationship between two quantitativevariables.

Relationships between variables are often at the heart of what we’d like to learnfrom data:

• Are grades actually higher now than they used to be?• Do people tend to reach puberty at a younger age than in previous genera-

tions?• Does applying strong magnets to parts of the body relieve pain? If so, are

stronger magnets more effective?• Do students learn better with the use of computer technology?

Questions such as these relate two quantitative variables and ask whetherthere is an association between them. Scatterplots are the ideal way to picturesuch associations.

Looking at Scatterplots

The Texas Transportation Institute studies the mobility provided by the nation’stransportation system. It issues an annual report on traffic congestion and itscosts. Here’s a scatterplot of the annual cost per person of traffic delays (in dol-lars) in 70 cities in the United States against the peak period freeway speed(mph).

Chapter 7 Scatterplots, Association, and Correlation 115

WHO 70 U.S. cities

WHAT Cost per person oftraffic delays and peakperiod freeway speed

UNITS $ per person per yearand miles per hour

WHEN 2000

WHY Annual report from theTexas TransportationInstitute

Look for: Direction

0

150

300

450

600

44 48 52 56 60Peak Period Freeway Speed (mph)

Cos

t per

Per

son

($ p

er p

erso

n pe

r yea

r)

Cost per person ($ per year) of traffic delays vs. peakperiod freeway speed (mph) for 70 U.S. cities.

Figure 7.2

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 115

Page 3: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

116 Part II Exploring Relationships Between Variables

Descartes was a philosopher, fa-mous for his statement cogito ergosum: I think, therefore I am.

3The axes are also called the “ordinate” and the “abscissa”—often by people who want to impress you.We can’t remember which is which, so we won’t expect you to remember it either. In Statistics (and inall statistics computer programs), the axes are always called “y” and “x.”

Look for: Form (especiallystraightness)

Look for: Scatter

If the relationship isn’t straight, but curves gently, while still increasing or de-

creasing steadily, we can often find ways to make it more nearly

straight. But if it curves sharply—up and then down, for example, ,—there is much less we can say about it with the methods of this book.

The third thing to look for in a scatterplot is how much scatter it has. At one ex-

treme, do the points appear to follow a single stream (whether straight,curved, or bending all over the place)? Or, at the other extreme, does the swarm ofpoints seem to form a vague cloud through which we can barely discern any

trend or pattern? We are especially interested in the scatterwhen the underlying form is straight. We’ll develop tools for quantifying theamount of scatter soon, but for now we just want to be aware of whether thereseems to be a large or small amount of scatter. The traffic congestion plot showslittle scatter, so we conclude that there’s a strong relationship between cost andspeed.

Finally, you should always look for the unexpected. Often the most interestingthing to see in a scatterplot is the thing you never thought to look for. One exam-ple of such a surprise is an outlier standing away from the overall pattern of thescatterplot. Such a point is almost always interesting and always deserves specialattention. Clusters or subgroups that stand away from the rest of the plot or showa trend in a different direction than the rest of the plot should raise questionsabout why they are different. They may be a clue that you should split the datainto subgroups rather than looking at it all together.

Scatterplot Details

Scatterplots were among the first modern mathematical displays. The idea of us-ing two axes at right angles to define a field on which to display values can betraced back to René Descartes (1596–1650), and the playing field he defined in thisway is formally called a Cartesian plane in his honor.

The two axes Descartes specified characterize the scatterplot. The axis that runsup and down is, by convention, called the y-axis, and the one that runs from sideto side is called the x-axis. You can count on these names. If someone refers to they-axis, you may be sure they mean the vertical, up-and-down axis, and similarlywith the x-axis.3

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 116

Page 4: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 117

NOTATION ALERT:So x and y are reserved letters as well,but not just for labeling the axes of ascatterplot. In Statistics, the assignmentof variables to the x- and y-axes (andchoice of notation for them in formulas)often conveys information about theirroles as predictor or response.

y

y

x x

(x, y)

To make a scatterplot of two quantitative variables, assign one to the y-axis andthe other to the x-axis. Be sure to label the axes clearly and indicate the scale of theaxes with numbers. Scatterplots display quantitative variables. Each variable hasunits, which should appear with the display to define what it’s showing.

Each point is placed on a scatterplot at a position that corresponds to values onthese two variables. Its horizontal location is specified by its value on the x-axisvariable and its vertical location is specified by its value on the y-axis variable. To-gether, these are known as coordinates and written (x, y).

Scatterplots made by computer programs (such as the two we’ve seen in thischapter) often do not—and usually should not—show the origin: the point at x 5 0, y 5 0 where the axes meet. If both variables have values that are near to oron both sides of zero, then the origin will be part of the display. If the values arefar from zero, though, there’s no reason to include the origin. In fact, it’s far bet-ter to focus on the part of the Cartesian plane that contains the data. (For exam-ple, we’re not interested in the likely cost of delays if the freeways had a peak pe-riod speed of 0 mph.) Often, programs indicate this choice by drawing the axesso that they don’t quite meet.

Roles for Variables

Which variable should go on the x-axis and which on the y-axis? What we want toknow about the relationship can tell us how to make the plot. We often have ques-tions such as:

• Are people who smoke heavily more likely to get lung cancer?• Is birth order an important factor in predicting future income?• Can we estimate a person’s % body fat more simply by just measuring girth or

wrist size?

In each of these examples, the two variables play different roles. One plays therole of the explanatory or predictor variable, while the other takes on the role ofthe response variable. When the roles are clear, we place the explanatory variableon the x-axis and the response variable on the y-axis. When you make a scatter-plot, you can assume that those who view it will think this way, so take care inchoosing which variables to assign to which axes.

The roles that we choose for variables are more about how we think aboutthem than about the variables themselves. Just placing a variable on the x-axisdoesn’t necessarily mean that it explains or predicts anything. And the variable onthe y-axis may not respond to it in any way. We plotted cost per person againstpeak freeway speed, thinking that the slower you go, the more it costs in delays.But maybe spending $500 per person in freeway improvement would increasespeed. If we were examining that option, we might choose to plot cost per personas the explanatory variable and speed as the response.

Older textbooks, and disciplines other than statistics sometimes refer to the x- and y-variables as the independent and dependent variables, respectively. Theidea was that the y-variable depended on the x-variable and the x-variableacted independently to make y respond. But these names conflict with otheruses of the same terms in Statistics. We’ll use the terms “explanatory” and“response” when we’re thinking about a relationship in those terms, but we’lloften just say x-variable and y-variable.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 117

Page 5: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Correlation

Data collected from students in Statistics classes included their height (ininches) and weight (in pounds). It’s no great surprise to discover that thereis a positive association between the two. As you might suspect, taller stu-dents tend to weigh more. (If we had reversed the roles and chosen heightas the explanatory variable, we might say that heavier students tend to betaller.) And the form of the scatterplot is fairly linear as well, although thereseems to be a high outlier, as the plot to the left shows.

There is clearly a positive association, but how strong is it? If you had toput a number (say, between 0 and 1) on the strength, what would it be?Clearly, whatever measure we use shouldn’t depend on the units of thevariables. After all, if we had measured heights and weights in differentunits, it wouldn’t change the direction, form, or scatter, so it shouldn’tchange the strength.

Since the units don’t matter, why not just remove them altogether? If westandardize both variables, we’ll turn the coordinates of each point into apair of z-scores. Now the center of the scatterplot is at the origin and theaxes are in standard deviation units. (Look at Figure 7.5.)

Is this the only difference between these plots? Well, no. The under-lying linear pattern seems steeper in the standardized plot. That’s becausewe made the scales of the axes the same. Now the length of one standarddeviation is the same vertically and horizontally. When we worked in theoriginal units, we were free to make the plot as tall and thin

118 Part II Exploring Relationships Between Variables

WHO Students

WHAT Height (inches), weight(pounds)

WHERE Ithaca, NY

WHY Data for class

HOW Survey

120

160

200

240

280

64 68 72 76

Wei

ght (

lb)

Height (in.)

Weight vs. height of Statistics students.Figure 7.3

We could have measured the weightin stones—a stone is a measure inthe now outdated UK system of mea-sures equal to 14 pounds.And we could measure your height inhands—hands are still commonlyused to measure the heights ofhorses. A hand is 4 inches.But no matter what units we use tomeasure the two variables, the cor-relation stays the same.

120

160

200

240

280

64 68 72 76

Wei

ght (

lb)

Height (in.)

4When we’re free to choose how to draw a scatterplot, what often looks best is to make therange of the x-axis slightly larger than the range of the y-axis. This is an aesthetic choice and isprobably related to the Golden Ratio of the Greeks.

40

60

80

100

120

160 170 180 190

Wei

ght (

kg)

Height (cm)

Plotting weight vs. height in different unitsdoesn’t change the shape of the pattern.

Figure 7.4

or as squat and wide

120

160

200

240

280

64 68 72 76

Wei

ght (

lb)

Height (in.)

as we wanted to. Equal scaling gives a neutral way of drawing the scatter-plot and a fairer impression of the strength of the association.4

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 118

Page 6: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Now we are plotting standardized values, so we should label them asz-scores. Since we have z-scores for each variable, we can distinguish themby calling them zx and zy. We can write the coordinates of a point as (zx, zy).

Which points in the scatterplot of the z-scores give the impression of apositive association? The points in the upper right and lower left coloredgreen strengthen the impression of a pattern from lower left to upper right.For points in these quadrants, zx and zy have the same sign. If we multipliedthem together, every point would have a positive product. Points far fromthe origin (which make the association look more positive) have a biggerproduct.

The red points in the upper left and lower right quadrants tend to weakenthe positive association. For these points, zx and zy have opposite signs. Sothe product zx zy for these points is always negative. Now points far from theorigin (which make the association look more negative) have an even morenegative product.

Points with z-scores of zero on either variable don’t vote either way, and zx zy 5 0. We’ve colored them blue.

We can turn these products into a measure of the strength of the associa-tion. We just add up all the zx zy for every point in the scatterplot:

zxzy

This summarizes the direction and strength of the association for all thepoints. If most of the points are in the green quadrants, the sum will tend tobe positive. If most are in the red quadrants, it will tend to be negative.

a

Chapter 7 Scatterplots, Association, and Correlation 119

2

4

5

1 2

1

3

z Heights

z Weights–1

–1

A scatterplot of standardized heights andweights with points colored by how they af-fect the association: green for positive, red fornegative, and blue for neutral. Figure 7.5

Finding the correlation coefficient by handTo find the correlation coefficient by hand, start with the summary statistics for both variables:

, , sx, and sy. Then find the deviations as we did for the standard deviation, but now inboth x and y: (x 2 ) and (y 2 ). For each data pair, multiply these deviations together:(x 2 )(y 2 ). Add the products up for all data pairs. Finally, divide the sum by the productof (n 2 1) 3 sx 3 sy to get the correlation coefficient.

Here we go:Suppose the data pairs are:

Then 5 14, 5 7, sx 5 6.20, and sy 5 3.39

Deviations Deviations

in x in y Product

6 2 14 5 �8 5 2 7 5 �2 �8 3 �2 5 16

10 2 14 5 �4 3 2 7 5 �4 16

14 2 14 5 0 7 2 7 5 0 0

19 2 14 5 5 8 2 7 5 1 5

21 2 14 5 7 12 2 7 5 5 35

Add the products up: 16 1 16 1 0 1 5 1 35 5 72Finally, we divide by (n 2 1) 3 sx 3 sy 5 (5 2 1) 3 6.20 3 3.39 5 84.07The ratio is the correlation coefficient:r 5 72/84.07 5 0.856.

yx

yxyx

yx

x 6 10 14 19 21

y 5 3 7 8 12

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 119

Page 7: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

But the size of this sum gets bigger the more data we have. To adjust for this wedivide the sum by n 2 1.5 The ratio is the famous correlation coefficient:

.

For the students’ heights and weights, the correlation comes out to 0.644. Thereare a number of alternative formulas for the correlation coefficient, using x and yin their original units. You may find them written elsewhere.6 They can be moreconvenient when you want to compute correlation by hand. But the form givenhere is best for understanding what it means.

Correlation Conditions

Correlation measures the strength of the linear association between two quantita-tive variables. Before you use correlation, you must check several conditions:

• Quantitative variables condition. Correlation applies only to quantitative vari-ables. Don’t apply correlation to categorical data masquerading as quantitative.Check that you know the variables’ units and what they measure.

• Straight enough condition. Sure, you can calculate a correlation coefficient forany pair of variables. But correlation measures the strength only of the linear as-sociation, and will be misleading if the relationship is not linear.

• Outlier condition. Outliers can distort the correlation dramatically. An outliercan make an otherwise small correlation look big or hide a large correlation. Itcan even give an otherwise positive association a negative correlation coeffi-cient (and vice versa). When you see an outlier, it’s often a good idea to reportthe correlations with and without the point.

Each of these conditions is easy to check with a scatterplot. Many correlationsare reported without supporting data or plots. You should still think about theconditions. And you should be cautious in interpreting (or accepting others’ inter-pretations of) the correlation when you can’t check the conditions for yourself.

r 5gzxzy

n 2 1

120 Part II Exploring Relationships Between Variables

The variables are systolic and diastolic bloodpressure (SBP and DBP) recorded (in millime-ters of mercury, or mmHg) for each of 1406participants in a famous health study inFramingham, MA.

Looking at Association

When your blood pressure is measured, it is reported as two values, systolic blood pressure anddiastolic blood pressure. How are these variables related to each other? Do they tend to both behigh or low? Let’s examine their relationship with a scatterplot.

Var iab les Identify two quantitative variableswhose relationship we wish to examine. Re-port the W’s and be sure both variables arerecorded for the same individuals.

NOTATION ALERT:The letter r is always used for correlation,so you can’t use it for anything else inStatistics. Whenever you see an “r,” it’ssafe to assume it’s a correlation.

5Yes, the same n 2 1 we saw for the standard deviation. And we offer the same promise to explain itlater.

6Like here, for example: .r 5a sx 2 x dsy 2 y d2a sx 2 x d2 a sy 2 y d2

5a sx 2 x dsy 2 y d

sn 2 1d2s2x s2

y

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 120

Page 8: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

✓ Quantitative variables condition: Both SBPand DBP are quantitative and measured inmmHg.

✓ Straight enough condition: The scatterplotlooks straight.

✓ Outlier condition: There are a few stragglingpoints, but none far enough from the body ofthe data to be called outliers.

We have two quantitative variables that satisfyour conditions, so a correlation would be a suit-able measure of association.

The correlation coefficient is 0.792.

The scatterplot shows a positive direction, withhigher SBP going with higher DBP. The plot isgenerally straight with a moderate amount ofscatter. The correlation of 0.792 indicates astrong linear association. A few cases standout with unusually high SBP compared withtheir DBP. It seems far less common for theDBP to be high by itself.

✔ Reality Check

Plan Check the conditions.

Make the scatterplot. Use a computer pro-gram or graphing calculator if you can.

Looks like a strong positive linear association.We shouldn’t be surprised if the correlationcoefficient is positive and fairly large.

Mechanics We usually calculate correlationswith technology. Here we have 1406 cases, sowe’d never try it by hand.

In terpre ta t ion Describe the direction, form,and scatter you see in the plot, along with anyunusual points or features.

Correlation Properties

Here’s a useful list of facts about the correlation coefficient:

• The sign of a correlation coefficient gives the direction of the association.• Correlation is always between 21 and 11. Correlation can be exactly equal to

21.0 or 11.0, but these values are unusual in real data because they mean thatall the data points fall exactly on a single straight line.A correlation near zero corresponds to a weak linear association.

• Correlation treats x and y symmetrically. The correlation of x with y is thesame as the correlation of y with x.

• Correlation has no units. This fact can be especially appropriate when thedata’s units are somewhat vague to begin with (IQ score, personality index,socialization, and so on). Correlation is sometimes given as a percentage, but

Chapter 7 Scatterplots, Association, and Correlation 121

300

250

150

200

100

10050 75 125 150Diastolic BP

Syst

olic

BP

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 121

Page 9: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

122 Part II Exploring Relationships Between Variables

we discourage that because it suggests a percentage of something—and correla-tion, lacking units, has no “something” of which to be a percentage.

• Correlation is not affected by changes in the center or scale of either variable.Changing the units or baseline of either variable has no effect on the correla-tion coefficient. Correlation depends only on the z-scores, and they are unaf-fected by changes in center or scale.

• Correlation measures the strength of the linear association between the twovariables. Variables can have a strong association but still have a small correla-tion if the association isn’t linear.

• Correlation is sensitive to outliers. A single outlying value can make a smallcorrelation large or make a large one small.

Correlation Tables

It is common in some fields to compute the correlations between each pair ofvariables in a collection of variables and arrange these correlations in a table.The rows and columns of the table name the variables, and the cells hold thecorrelations.

Correlation tables are compact, and give a lot of summary information at aglance. They can be an efficient way to start to look at a large data set, but a dan-gerous one. By presenting all of these correlations without any checks for linearityand outliers, the correlation table risks showing truly small correlations that havebeen inflated by outliers, truly large correlations that are hidden by outliers, andcorrelations of any size that may be meaningless because the underlying form isnot linear.

7A table of scatterplots arranged just like a correlation table is sometimes called a scatterplot matrix,sometimes abbreviated to SPLOM. You might see these terms in a statistics package.

Market CashAssets Sales Value Profits Flow Employees

Assets 1.000Sales 0.746 1.000Market Value 0.682 0.879 1.000Profits 0.602 0.814 0.968 1.000Cash Flow 0.641 0.855 0.970 0.989 1.000Employees 0.594 0.924 0.818 0.762 0.787 1.000

A correlation table of data reportedby Forbes magazine for large com-panies. From this table, can you besure that the variables are linearlyassociated and free from outliers?

Table 7.1

The diagonal cells of a correlation table always show correlations of exactly 1(can you see why?). Correlation tables are commonly offered by statistics pack-ages on computers. These same packages often offer simple ways to make all thescatterplots you need to look at.7

*Straightening Scatterplots

Straight line relationships are the ones that we can measure with correlation.When a scatterplot shows a bent form that consistently increases or decreases, wecan often straighten the form of the plot by re-expressing one or both variables.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 122

Page 10: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Some camera lenses have an adjustable aperture, the hole that lets the lightthrough. The size of the aperture is expressed in a mysterious number called thef/stop. Each increase of one f/stop number corresponds to a halving of the lightthat is allowed to come through. The f/stops of one 35-mm camera are

f/stop: 2.8 4 5.6 8 11 16 22 32

When you increase the f/stop one notch, you cut down the light, so you have toincrease the time the shutter is open. We could experiment to find the best shutterspeed for each f/stop value. A table of recommended shutter speeds and f/stopsfor a camera lists the relationship like this:

f/stop: 2.8 4 5.6 8 11 16 22 32Shutter speed: 1/1000 1/500 1/250 1/125 1/60 1/30 1/15 1/8

The correlation of these f/stops and shutter speeds is 0.979. That sounds prettyhigh. You might assume that there must be a strong linear relationship. But whenwe check the scatterplot (we always check the scatterplot) it shows that somethingis not quite right:

Chapter 7 Scatterplots, Association, and Correlation 123

7.5

15.0

22.5

30.0

0.025 0.050 0.075 0.100

f/sto

ps

Shutter Speed (sec)

A scatterplot of f/stop vs. shutter speedshows a bent relationship. Figure 7.6

200

400

1000

800

600

1200

(f/st

op)2

0.03 0.06 0.09 0.12 0.15

Shutter Speed (sec)

0

Re-expressing f/stop speed by squaringstraightens the plot. Figure 7.7

We can see that the f/stop is not linearly related to the shutter speed. Can wefind a transformation of f/stop that straightens out the line? What if we look atthe square of the f/stop against the shutter speed?

3339 Deveaux ch07_113-135 3/26/03 8:15 AM Page 123

Page 11: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

124 Part II Exploring Relationships Between Variables

The second plot looks much more nearly straight. In fact, the correlation is now0.998, but the increase in correlation is not important. (The original value of 0.979should please almost anyone who sought a large correlation.) What is importantis that the form of the plot is now straight, so the correlation is now an appropriatemeasure of association.8

We can often find transformations that straighten out lines. Here, we found thesquare. Chapter 10 discusses simple ways to find a good re-expression.

How often have you heard the word “correlation”? Chances are pretty good thatwhen you’ve heard the term, it’s been misused. When people want to sound sci-entific, they often say “correlation” when talking about the relationship betweentwo variables. It’s one of the most widely misused Statistics terms, and given howoften statistics are misused, that’s saying a lot. One of the problems is that manypeople use the specific term correlation when they really mean the more generalterm association. Association is a deliberately vague term describing the relation-ship between two variables.

Don’t fall into the trap of misusing the term correlation yourself. And watch outfor the other common problems discussed in the following paragraphs.

Check the Conditions● Don’t correlate categorical variables. People who misuse “correlation” to

mean “association” often fail to notice whether the variables they discuss arequantitative.

● Be sure the association is linear. A student project evaluating the quality ofbrownies baked at different temperatures reports a correlation of 20.05 be-tween judges’ scores and baking temperature. That seems to say there is norelationship until we look at the scatterplot:

Did you know that there’s astrong correlation betweenplaying an instrument anddrinking coffee? No? Onereason might be that thestatement doesn’t makesense. Correlation is validonly for quantitativevariables.

What Can Go Wrong?

8 Sometimes we can do a “reality check” on our choice of re-expression. In this case, a bit of researchreveals that f/stops are related to the diameter of the open shutter. Since the amount of light that entersis determined by the area of the open shutter, which is related to the diameter by squaring, the squarere-expression seems reasonable. Not all re-expressions have such nice explanations, but it’s a goodidea to think about them.

0

2

4

6

8

10

150 300 450 600Baking temperature (°F)

Scor

e

The relationship between brownietaste score and baking tempera-ture is strong, but not at all lin-ear. Figure 7.8

There is a strong association, but the relationship is not linear.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 124

Page 12: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 125

● Beware of outliers. You can’t interpret a correlation coefficient safely withouta background check for outliers. Here’s a silly example:

The relationship between IQ and shoe size among comedians shows a surpris-ingly strong positive correlation of 0.50. To check assumptions, we look at thescatterplot.

100

125

150

175

7.5 22.5

IQ

Shoe Size

55000

60000

65000

70000

75000

150 200 250

Hum

an P

opul

atio

n

# of Storks

A scatterplot of IQ scores vs. shoe size. From this“study,” what is the relationship between the two?The correlation is 0.50. Who does that point in the up-per right-hand corner belong to? Figure 7.9

A scatterplot of the number of storks in Old-enburg, Germany, plotted against the popu-lation of the town for 7 years in the 1930s.The association is clear. How about the cau-sation? (Ornithologishe Monatsberichte, 44,no. 2) Figure 7.10

The outlier is Bozo the Clown, known for his large shoes, and widely acknowl-edged to be a comic “genius.” Without Bozo the correlation is near zero.

Even a single outlier can dominate the correlation value.

Don’t Confuse Correlation with CausationOnce we have a strong correlation, it’s tempting to try to explain it by imaginingthat the predictor variable has caused the response to change. Humans are likethat; we tend to see causes and effects in everything.

Sometimes we can play with this tendency. Here’s a scatterplot that shows thehuman population (y) of Oldenburg, Germany, in the beginning of the 1930s plot-ted against the number of storks nesting in the town (x).

Anyone who has seen the beginning of the movie Dumbo remembers Mrs.Jumbo anxiously awaiting the arrival of the stork to bring her new baby. Eventhough you know it’s silly, you can’t help but think for a minute that this plotshows that storks are the culprits. The two variables are obviously related to eachother (the correlation is 0.97!), but that doesn’t prove that storks bring babies.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 125

Page 13: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

It turns out that storks nest on house chimneys. More people means morehouses, more nesting sites, and so more storks. The causation is actually in the op-posite direction, but you can’t tell from the scatterplot or correlation. You need ad-ditional information—not just the data—to determine the real mechanism.

Does cancer cause smoking?Even if the correlation of two variables is due to a causal relationship, the correlation it-self cannot tell us what causes what.

Sir Ronald Aylmer Fisher (1890–1962) was one of the greatest statisticians of the20th century. Fisher testified in court (paid by the tobacco companies) that a causal re-lationship might underlie the correlation of smoking and cancer:

“Is it possible, then, that lung cancer . . . is one of the causes of smoking cigarettes? Idon’t think it can be excluded . . . the pre-cancerous condition is one involving a certainamount of slight chronic inflammation . . .

A slight cause of irritation . . . is commonly accompanied by pulling out a cigarette, andgetting a little compensation for life’s minor ills in that way. And . . . is not unlikely to beassociated with smoking more frequently.”

Ironically, the proof that smoking indeed is the cause of many cancers came from ex-periments conducted following the principles of experiment design and analysis thatFisher himself developed—and that we’ll see in Chapter 13.

126 Part II Exploring Relationships Between Variables

Scatterplots and Correlation on the ComputerStatistics packages generally make it easy to look at a scatterplot to check whether the corre-lation is appropriate. Some packages make this easier than others.

Many packages allow you to modify or enhance a scatterplot, altering the axis labels, theaxis numbering, the plot symbols, or the colors used. Some options, such as color and sym-bol choice, can be used to display additional information on the scatterplot.

Scatterplots and correlation coefficients never prove causation. This is, for exam-ple, one part of the story of why it took so long for the U.S. Surgeon General to getwarning labels on cigarettes. Although there was plenty of evidence that in-creased smoking was associated with increased levels of lung cancer, it took yearsto provide evidence that smoking actually causes lung cancer. (The tobacco com-panies used this to great advantage.)

Watch Out for Lurking VariablesA scatterplot of the damage (in dollars) caused to a house by fire would show astrong correlation with the number of firefighters at the scene. Surely the damagedoesn’t cause firefighters. And firefighters do seem to cause damage, sprayingwater all around and chopping holes. Does that mean we shouldn’t call the firedepartment? Of course not. There is an underlying variable that leads to bothdamage and firefighters—the size of the blaze.

A hidden variable that stands behind a relationship and determines it by simul-taneously affecting both variables is called a lurking variable. You can often debunkclaims made about data by finding the lurking variable behind the scenes.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 126

Page 14: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 127

To make a scatterplot of two variables, select one variable as Yand the other as X and choose Scatterplot from the Plotmenu. Then find the correlation by choosing Correlation fromthe scatterplot’s HyperView menu.

Alternatively, select the two variables and choose PearsonProduct-Moment from the Correlations submenu of the Calcmenu.

CommentsWe prefer that you look at the scatterplot first and then findthe correlation. But if you’ve found the correlation first, click onthe correlation value to drop down a menu that offers to makethe scatterplot.

DATA DESK

To make a Scatterplot with the Excel Chart Wizard:• Click on the Chart Wizard Button in the menu bar. Excel

opens the Chart Wizard’s Chart Type Dialog window.• Make sure the Standard Types tab is selected, and select

XY (Scatter) from the choices offered.• Specify the scatterplot without lines from the choices of-

fered in the Chart sub-type selections. The Next button takesyou to the Chart Source Data dialog.

• If it is not already frontmost, click on the Data Range tab, andenter the data range in the space provided.

• By convention, we always represent variables in columns.The Chart Wizard refers to variables as Series. Be sure theColumn option is selected.

• Excel places the leftmost column of those you select on thex-axis of the scatterplot. If the column you wish to see on thex-axis is not the leftmost column in your spreadsheet, click onthe Series tab and edit the specification of the individual axisseries.

• Click the Next button. The Chart Options dialog appears.• Select the Titles tab. Here you specify the title of the chart

and names of the variables displayed on each axis.• Type the chart title in the Chart title: edit box.• Type the x-axis variable name in the Value (X) Axis: edit box.

Note that you must name the columns correctly here. Naminganother variable will not alter the plot, only mislabel it.

• Type the y-axis variable name in the Value (Y) Axis: edit box.• Click the Next button to open the chart location dialog.• Select the As new sheet: option button.• Click the Finish button.

Often, the resulting scatterplot will not be useful. By default, Ex-cel includes the origin in the plot even when the data are farfrom zero.You can adjust the axis scales.

To change the scale of a plot axis in Excel:• Double-click on the axis. The Format Axis Dialog appears.• If the scale tab is not the frontmost, select it.• Enter new minimum or new maximum values in the spaces

provided. You can drag the dialog box over the scatterplot asa straightedge to help you read the maximum and minimumvalues on the axes.

• Click the OK button to view the rescaled scatterplot.• Follow the same steps for the x-axis scale.

Compute a correlation in Excel with the CORREL function fromthe drop-down menu of functions. If CORREL is not on themenu, choose More Functions and find it among the statisticalfunctions in the browser.

In the dialog that pops up, enter the range of cells holding oneof the variables in the space provided.

Enter the range of cells for the other variable in the space pro-vided.

EXCEL

To make a scatterplot and compute correlation, choose Fit Yby X from the Analyze menu.

In the Fit Y by X dialog, drag the Y variable into the “Y, Re-sponse” box, and drag the X variable into the “X, Factor” box.Click the OK button.

Once JMP has made the scatterplot, click on the red trianglenext to the plot title to reveal a menu of options. Select DensityEllipse and select .95. JMP draws an ellipse around the dataand reveals the Correlation tab. Click the blue triangle next toCorrelation to reveal a table containing the correlation coeffi-cient.

JMP

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 127

Page 15: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

128 Part II Exploring Relationships Between Variables

Connections

Scatterplots are the basic tool for examining the relationship between two quantitative variables. Westart with a picture when we want to understand the distribution of a single variable, and wealways make a scatterplot to begin to understand the relationship between two quantitativevariables.

We used z-scores as a way to measure the statistical distance of data values from their means.Now we’ve seen the z-scores of x and y working together to build the correlation coefficient. Corre-lation is a summary statistic like the mean and standard deviation—only it summarizes the strengthof a linear relationship. And we interpret it as we did z-scores, using the standard deviations as ourrulers in both x and y.

To make a scatterplot, choose Interactive from the Graphsmenu. From the Interactive Graphs submenu, choose Scatter-plot. In the Create Scatterplot dialog, drag variable namesfrom the source list into the targets on the right. Each target de-scribes a specific part or aspect of the plot. For example, drag avariable name to the y-axis target to specify the variable to dis-play on the y-axis.

Similarly, the x-axis target gets the name of the variable to dis-play on the x-axis.

To compute a correlation coefficient, choose Correlate fromthe Analyze menu. From the Correlate submenu, choose Bi-variate. In the Bivariate Correlations dialog, use the arrow but-ton to move variables between the source and target lists.

Make sure the Pearson option is selected in the CorrelationCoefficients field.

SPSS

To create a scatterplot, set up the STAT PLOT by choosing thescatterplot icon (the first option). Specify the lists where thedata are stored as Xlist and Ylist. Set the graphing WINDOW tothe appropriate scale and GRAPH (or take the easy way outand just ZoomStat!).

To find the correlation, go to STAT CALC menu and select 8:LinReg(a1bx). Then specify the lists where the data arestored. The final command you will enter should look like Lin-Reg(a1bx) L1, L2.

CommentsNotice that if you TRACE the scatterplot the calculator willtell you the X and Y value at each point.

If the calculator does not tell you the correlation after you entera LinReg command, try this. Hit 2nd CATALOG. You now see alist of everything the calculator knows how to do. Scroll downuntil you find DiagnosticOn. Hit ENTER twice. (It should sayDone.) Now and forevermore (or until you change batteries)you can find a correlation using your calculator.

TI-83

To make a scatterplot, choose Plot from the Graph menu. Inthe Plot dialog, click on the Y cell for Graph 1 in the Graph vari-ables box to specify the y-variable. Then assign the y-variablefrom the Variable list box. Click on the X cell for the graph andassign the x-variable from the Variable list box. Click the OKbutton to view the scatterplot.

To compute a correlation coefficient, choose Basic Statisticsfrom the Stat menu.

From the Basic Statistics submenu, choose Correlation. Clickon the Variables box and assign variables from the VariableList box. Click the OK button to compute the correlation table.

MINITAB

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 128

Page 16: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 129

Skills When you complete this lesson you should:

• Recognize when interest in the pattern of possible relationship between two quantita-tive variables suggests making a scatterplot.

• Know how to identify the roles of the variables and to place the response variable onthe y-axis and the explanatory variable on the x-axis.

• Know the conditions for correlation and how to check them.

• Know that correlations are between 21 and 11, and that each extreme indicates a per-fect linear association.

• Understand how the magnitude of the correlation reflects the strength of a linear asso-ciation as viewed in a scatterplot.

• Know that the correlation has no units.

• Know that the correlation coefficient is not changed by changing the center or scale ofeither variable.

• Understand that causation cannot be demonstrated by a scatterplot or correlation.

• Know how to make a scatterplot by hand (for a small set of data) or with technology.

• Know how to compute the correlation of two variables.

• Know how to read a correlation table produced by a statistics program.

• Be able to describe the direction, form, and scatter of a scatterplot.

• Be prepared to identify and describe points that deviate from the overall pattern.

• Be able to use correlation as part of the description of a scatterplot.

• Be alert to misinterpretations of correlation.

• Understand that finding a correlation between two variables does not indicate a causalrelationship between them. Beware the dangers of suggesting causal relationshipswhen describing correlations.

Key ConceptsScatterplots A scatterplot shows the relationship between two quantitative variables measured on the

same cases.

Looking at scatterplots • direction• form• scatter

explanatory-variable, In a scatterplot, you must choose a role for each variable. Assign to the y-axis the variable that response variable you hope to predict or explain. Assign to the x-axis the variable that accounts for, explains,

x-variable, y-variable predicts, or is otherwise responsible for the y-variable.

Correlation Correlation is a numerical measure of the direction and strength of a linear association.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 129

Page 17: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

130 Part II Exploring Relationships Between Variables

1. Association. Suppose you were to collect data for eachpair of variables. You want to make a scatterplot. Whichvariable would you use as the explanatory variable, andwhich as the response variable? Why? What would youexpect to see in the scatterplot? Discuss the likely direc-tion, form, and scatter.a) Apples: weight in grams, weight in ouncesb) Apples: circumference (inches), weight (ounces)c) College freshmen: shoe size, grade point averaged) Gasoline: number of miles you drove since filling up,

gallons remaining in your tank

2. Association. Suppose you were to collect data for eachpair of variables. You want to make a scatterplot. Whichvariable would you use as the explanatory variable, andwhich as the response variable? Why? What would youexpect to see in the scatterplot? Discuss the likely direc-tion, form, and scatter.a) T- shirts at a store: price each, number soldb) Skin diving: depth, water pressurec) Skin diving: depth, visibilityd) All elementary school students: weight, score on a

reading test

3. Association. Suppose you were to collect data for eachpair of variables. You want to make a scatterplot. Whichvariable would you use as the explanatory variable, andwhich as the response variable? Why? What would youexpect to see in the scatterplot? Discuss the likely direc-tion, form, and scatter.a) When climbing a mountain: altitude, temperatureb) For each week: Ice cream cone sales, air conditioner

salesc) People: age, grip strengthd) Drivers: blood alcohol level, reaction time

4. Association. Suppose you were to collect data for eachpair of variables. You want to make a scatterplot. Whichvariable would you use as the explanatory variable, andwhich as the response variable? Why? What would youexpect to see in the scatterplot? Discuss the likely direc-tion, form, and scatter.a) Long-distance calls: time (minutes), costb) Lightning strikes: Distance from lightning, time

delay of the thunderc) A streetlight: its apparent brightness, your distance

from itd) Cars: weight of car, age of owner

5. Scatterplots. Which of the scatterplots below showa) little or no association?b) a negative association?c) a linear association?

d) a moderately strong association?e) a very strong association?

Exercises

(1) (2)

(3) (4)

6. Scatterplots. Which of the scatterplots below showa) little or no association?b) a negative association?c) a linear association?d) a moderately strong association?e) a very strong association?

(1) (2)

(3) (4)

7. Performance IQ scores vs. brain size. A study exam-ined brain size (measured as pixels counted in a digi-tized magnetic resonance image [MRI] of a cross-sectionof the brain) and IQ (4 Performance scales of theWeschler IQ test) for college students. The scatterplotshows the Performance IQ scores vs. the brain size.

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 130

Page 18: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 131

Comment on the association between brain size and IQas seen in this scatterplot.

80

100

120

140

825 975

Perfo

rman

ce IQ

Brain Size (pixels)

8. Kentucky Derby. The fastest horse in Kentucky Derbyhistory was Secretariat in 1973. The scatterplot showsspeed (in miles per hour) of the winning horses eachyear. What do you see? In most sporting events, perfor-mances have improved and continue to improve, sosurely we anticipate a positive direction. But what of theform? Has the performance increased at the same ratethroughout the last 125 years?

32

34

36

1890 1920 1950 1980

Spee

d (M

PH)

Year

9. Firing pottery. A ceramics factory can fire eight largebatches of pottery a day. Sometimes in the process a fewof the pieces break. In order to understand the problembetter, the factory records the number of broken pieces ineach batch for 3 days and then creates the scatterplotshown.a) Make a histogram showing the distribution of the

number of broken pieces in the 24 batches of potteryexamined.

b) Describe the distribution as shown in the histogram.What feature of the problem is more apparent in thehistogram than in the scatterplot?

c) What aspect of the company’s problem is more ap-parent in the scatterplot?

# of

Bro

ken

Piec

es

6

5

4

3

2

1

01 2 3 4 5 6 7 8

Batch Number

a) Make a histogram of the daily sales since the shophas been in business.

b) State one fact that is obvious from the scatterplot, butnot from the histogram.

c) State one fact that is obvious from the histogram, butnot from the scatterplot.

11. Matching. Here are several scatterplots. The calculatedcorrelations are 20.923, 20.487, 0.006, and 0.777. Whichis which?

(a) (b)

(c) (d)

10. Coffee sales. Owners of a new coffee shop tracked salesfor the first 20 days, and displayed the data in a scatter-plot (by day):

Sales (in $100)

5

4

3

2

1

4 8 12 16

Day

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 131

Page 19: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

132 Part II Exploring Relationships Between Variables

12. Matching. Here are several scatterplots. The calculatedcorrelations are 20.977, 20.021, 0.736, and 0.951. Whichis which?

a) Find the correlation for these data.b) Suppose we were to record time at the table in hours

rather than in minutes. How would the correlationchange? Why?

c) Write a sentence or two explaining what this correla-tion means for these data. Remember to write aboutfood consumption in toddlers rather than about cor-relation coefficients.

d) One analyst concluded, “It is clear from this correla-tion that toddlers who spend more time at the tableeat less. Evidently something about being at the tablecauses them to lose their appetites.” Explain why thisexplanation is not an appropriate conclusion fromwhat we know about the data.

14. Vehicle weights. The Minnesota Department of Trans-portation hoped that they could measure the weights ofbig trucks without actually stopping the vehicles byusing a newly developed “weigh-in-motion” scale. Tosee if the new device was accurate, they conducted a cal-ibration test. They weighed several trucks when stopped(static weight), assuming that this weight was correct.They they weighed them again while the trucks weremoving to see how well the new scale could estimate theactual weight. Their data are given in the table.

WEIGHT OF A TRUCK (THOUSANDS OF POUNDS)

Weight-in-Motion Static Weight

26 27.929.9 29.139.5 3825.1 2731.6 30.336.2 34.525.1 27.831 29.635.6 33.140.2 35.5

a) Make a scatterplot for these data.b) Describe the direction, form and scatter of the plot.c) Write a few sentences telling what the plot says about

the data. (Note: The sentences should be aboutweighing trucks, not about scatterplots.)

d) Find the correlation.e) If the trucks were weighed in kilograms, how would

this change the correlation? (1 kilogram 5 2.2pounds)

f) Do any points deviate from the overall pattern? Whatdoes the plot say about a possible recalibration of theweigh-in-motion scale?

(a) (b)

(c) (d)

13. Lunchtime. Does how long children remain at the lunchtable help predict how much they eat? The table givesdata on 20 toddlers observed over several months at anursery school. “Time” is the average number of min-utes a child spent at the table when lunch was served.“Calories” is the average number of calories the childconsumed during lunch, calculated from careful obser-vation of what the child ate each day.

Calories Time

472 21.4498 30.8465 37.7456 33.5423 32.8437 39.5508 22.8431 34.1479 33.9454 43.8450 42.4410 43.1504 29.2437 31.3489 28.6436 32.9480 30.6439 35.1444 33.0408 43.7

425

450

475

500

25 30 35 40

Cal

orie

s

Time (min)

T

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 132

Page 20: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 133

22500

30000

37500

360 440

Atte

ndan

ce

Runs

a) Does the scatterplot indicate that it’s appropriate tocalculate a correlation? Explain.

b) Describe the association between attendance andruns scored.

c) Does this prove that the owners are right, that morefans will come to games if the teams score moreruns?

17. Burgers. Fast food is often considered unhealthy be-cause much fast food is high in both fat and sodium. Butare the two related? Here are the fat and sodium con-tents of several brands of burgers. Create a scatterplotand find the correlation between fat content and sodiumcontent. Write a description of the association.

Fat (g) 19 31 34 35 39 39 43Sodium (mg) 920 1500 1310 860 1180 940 1260

18. Burgers. In the previous exercise you examined associa-tion between the amounts of fat and sodium in fast foodhamburgers. What about fat and calories? Here are datafor the same burgers.

Fat (g) 19 31 34 35 39 39 43Calories 410 580 590 570 640 680 660

a) Create a scatterplot.b) Find the correlation.c) Describe the association.

19. Attendance. American League baseball games areplayed under the designated hitter rule, meaning thatweak-hitting pitchers do not come to bat. Baseball own-ers believe that the designated hitter rule means moreruns scored, which in turn means higher attendance. Isthere evidence that more fans attend games if the teamsscore more runs? Data collected midway through the2001 season indicate a correlation of 0.74 between runsscored and the number of people at the game.

15. Fuel economy. Here are advertised horsepower ratingsand expected gas mileage for several 2001 vehicles.

Audi A4 170 hp 22 mpgBuick LeSabre 205 20Chevy Blazer 190 15Chevy Prizm 125 31Ford Excursion 310 10GMC Yukon 285 13Honda Civic 127 29Hyundai Elantra 140 25Lexus 300 215 21Lincoln LS 210 23Mazda MPV 170 18Olds Alero 140 23Toyota Camry 194 21VW Beetle 115 29

a) Make a scatterplot for these data.b) Describe the direction, form, and settler of the plot.c) Find the correlation between horsepower and miles

per gallon.d) Write a few sentences telling what the plot says about

fuel economy.

16. Drug abuse. A survey was conducted in the UnitedStates and 10 countries of Western Europe to determinethe percentage of teenagers who had used marijuanaand other drugs. The results are summarized in thetable.

Percent Who Have Used

Country Marijuana Other Drugs

Czech Rep. 22 4Denmark 17 3England 40 21Finland 5 1Ireland 37 16Italy 19 8No. Ireland 23 14Norway 6 3Portugal 7 3Scotland 53 31USA 34 24

a) Create a scatterplot.b) What is the correlation between the percent of teens

who have used marijuana and the percent who haveused other drugs?

c) Write a brief description of the association.d) Do these results confirm that marijuana is a “gate-

way drug,” that is, that marijuana use leads to theuse of other drugs? Explain.

T

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 133

Page 21: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

134 Part II Exploring Relationships Between Variables

20. Second inning. Perhaps fans are just more interested inteams that win? Are the teams that win necessarily thosethat score the most runs?

c) “There was a very strong correlation of 1.22 betweenlife expectancy and GDP.”

d) “The correlation between literacy rate and GDP was0.83. This shows that countries wanting to increasetheir standard of living should invest heavily in edu-cation.”

24. Sample survey. A polling organization is checking itsdatabase to see if the two data sources they used sam-pled the same zip codes. The variable datasource 5 1 ifthe data source is MetroMedia, 2 if the data source isDataQwest, and 3 if it’s RollingPoll. The organizationfinds that the correlation between five digit zip code anddatasource is 2.0229. It concludes that the correlation islow enough to state that there is no dependency betweenzip code and source of data. Comment.

25. Baldness and heart disease. Medical researchers fol-lowed 1435 middle-aged men for a period of 5 years,measuring the amount of baldness present (none 5 1,little 5 2, some 5 3, much 5 4, extreme 5 5) and pres-ence of heart disease (No 5 0, Yes 5 1). They found acorrelation of 0.089 between the two variables. Commenton their conclusion that this shows that baldness is not apossible cause of heart disease.

26. Oil production. The following table shows the oil pro-duction of the United States from 1949 to 2000 (in mil-lions of barrels per year).a) Find the correlation between year and production.b) A reporter concludes that a low correlation between

year and production shows that oil production hasremained steady over the 50-year period. Do youagree with this interpretation? Explain.

Year Oil Year Oil

1949 1,841,940 1967 3,215,7421950 1,973,574 1968 3,329,0421951 2,247,711 1969 3,371,7511952 2,289,836 1970 3,517,4501953 2,357,082 1971 3,453,9141954 2,314,988 1972 3,455,3681955 2,484,428 1973 3,360,9031956 2,617,283 1974 3,202,5851957 2,616,901 1975 3,056,7791958 2,448,987 1976 2,976,1801959 2,574,590 1977 3,009,2651960 2,574,933 1978 3,178,2161961 2,621,758 1979 3,121,3101962 2,676,189 1980 3,146,3651963 2,752,723 1981 3,128,6241964 2,786,822 1982 3,156,7151965 2,848,514 1983 3,170,9991966 3,027,763 1984 3,249,696

(continued)

22500

30000

37500

30 40 50 60

Atte

ndan

ce

Wins

CorrelationWins Runs Attend

Wins 1.000Runs 0.680 1.000Attend 0.557 0.740 1.000

a) Do winning teams generally enjoy greater atten-dance at their home games? Describe the association.

b) Is attendance more strongly associated with winningor scoring runs? Explain.

c) How strongly is scoring more runs associated withwinning more games?

21. Politics. A candidate for office claims that “there is acorrelation between television watching and crime.”Criticize this statement in statistical terms.

22. Association. A researcher investigating the associationbetween two variables collected some data and wassurprised when he calculated the correlation. He hadexpected to find a fairly strong association, yet the corre-lation was near 0. Discouraged, he didn’t bother makinga scatterplot. Explain to him how the scatterplot couldstill reveal the strong association he anticipated.

23. Correlation errors. Your economics professor assignsyour class to investigate factors associated with the grossdomestic product (GDP) of nations. Each student exam-ines a different factor (such as life expectancy, literacyrate, etc.) for a few countries and reports to the class.Apparently some of your classmates do not understandStatistics very well because you know several of theirconclusions are incorrect. Explain the mistakes in theirstatements below.a) “My correlation of 20.772 shows that there is almost

no association between GDP and infant mortalityrate.”

b) “There was a correlation of 0.44 between GDP andcontinent.”

T

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 134

Page 22: Scatterplots, Association, and Correlations3.amazonaws.com/zanran_storage/ · Scatterplots, Association, and Correlation Chapter 7 114 R egularly, since 1937, the Gallup Poll has

Chapter 7 Scatterplots, Association, and Correlation 135

a) Make a scatterplot and describe the association. (Re-member: direction, form, scatter!)

b) Why would you not want to talk about the correla-tion between planet position and distance from thesun?

c) Make a scatterplot showing the logarithm of distancevs. position. What is better about this scatterplot?

*28. Internet journals. The rapid growth of Internet pub-lishing is seen in a number of electronic academic jour-nals made available during the last decade.

Year Number of Journals

1991 271992 361993 451994 1811995 3061996 10931997 2459

a) Make a scatterplot and describe the trend.b) Re-express the data in order to make the association

more nearly linear.

Year Oil Year Oil

1985 3,274,553 1993 2,499,0331986 3,168,252 1994 2,431,4761987 3,047,378 1995 2,394,2681988 2,979,123 1996 2,366,0171989 2,778,773 1997 2,354,8311990 2,684,687 1998 2,281,9191991 2,707,039 1999 2,146,7321992 2,624,632 2000 2,135,062

*27. Planets. Is there any pattern to the locations of theplanets in our solar system? The table shows the aver-age distance of each of the nine planets from the sun.

Position Distance from SunPlanet Number (million miles)

Mercury 1 36Venus 2 67Earth 3 93Mars 4 142Jupiter 5 484Saturn 6 887Uranus 7 1784Neptune 8 2796Pluto 9 3666

3339 Deveaux ch07_113-135 3/25/03 3:17 PM Page 135


Recommended