+ All Categories
Home > Documents > PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering...

PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering...

Date post: 09-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
40
Part A 1 PART A INTRODUCTION TO STATISTICS: AN OVERVIEW
Transcript
Page 1: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Part A 1

PART A

INTRODUCTION TO STATISTICS:

AN OVERVIEW

Page 2: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

2 Chapter 1 — Introduction to Statistical Thinking

Chapter 1

Introduction to Statistical Thinking

Table of Contents

Section 1.1 — What is Statistics? Introduction to the Uses of Statistics ............................................................................................... 4 Introduction to the Nature of Statistics ............................................................................................ 4

Section 1.2 — A Student Survey

A Survey ........................................................................................................................................ 5 Talking About the Data................................................................................................................... 5 Some Questions About the Survey .................................................................................................. 5

Section 1.3 — Units and Variables

Units and Variables ........................................................................................................................ 8 Verbal Templates for Naming Variables ......................................................................................... 9 Yes-No Variables ........................................................................................................................... 10 Missing, Ambiguous, and Incorrect Values ..................................................................................... 10 Possible Values of a Variable and Types of Variables ..................................................................... 11

Section 1.4 — Categorical Variables

Categorical Variables ..................................................................................................................... 11 Creating Categories ........................................................................................................................ 11 Dichotomous Categorical Variables ................................................................................................ 12 Ordered vs. Unordered Categorical Variables ................................................................................. 12

Section 1.5 — Numerical Variables

Continuous vs. Discrete Numerical Variables ................................................................................. 13 Almost Continuous Numerical Variables ........................................................................................ 15 Grouped Numerical Variables......................................................................................................... 15 Intrinsic Type of a Variable ............................................................................................................ 15

Section 1.6 — Summarizing Categorical Variables

The Frequency of a Characteristic ................................................................................................... 16 Talking about Percentages .............................................................................................................. 16 Comparing Subgroups With Respect to a Characteristic ................................................................. 17 Association Between Two Categorical Variables ............................................................................ 18

Section 1.7 — Summarizing Numerical Variables

The “Average”................................................................................................................................ 21 The Mean and the Median .............................................................................................................. 21 Talking About Means and Medians ................................................................................................ 22 Association Between One Categorical and One Numerical Variable ............................................... 22

Section 1.8 — Populations, Samples, Parameters, and Statistics

Populations and Samples ................................................................................................................ 25 Describing a Population Clearly ..................................................................................................... 26 Parameters and Statistics ................................................................................................................ 27 Frequencies and Percentages .......................................................................................................... 28 Means and Medians ........................................................................................................................ 30 Estimating An Unknown Parameter with a Known Statistic ............................................................ 31 Talking About Parameters and Statistics ......................................................................................... 32

Page 3: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Chapter 1 — Table of Contents 3

Section 1.9 — Well-Defined Variables

Well-Defined Variables .................................................................................................................. 33 When the Unit Itself is a Group ...................................................................................................... 34 Describing a Group Using a Count ................................................................................................. 34 Describing a Group Using a Percentage .......................................................................................... 35 Describing a Group Using an Average ............................................................................................ 35

Section 1.10 — Template for the Basic Concepts ............................................................... 37 Chapter 1 Appendix

Glossary ......................................................................................................................................... 39 List of Symbols .............................................................................................................................. 40 List of Formulas ............................................................................................................................. 40

Page 4: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

4 Chapter 1 — Introduction to Statistical Thinking

Section 1.1 — What is Statistics? Introduction to the Uses of Statistics

Statistics is about information—expressed in numbers, symbols, words, and pictures (graphs, for example) —

that describes things we want to know about our world. The term “statistics” comes from the Latin word for “state.” It originally referred to the study of political facts

and figures; literally, “state-istics.” The statistics about the American people obtained in the U.S. census conducted every ten years illustrate this meaning of the word.

Statisticians today, however, deal with a wide-ranging variety of issues. The following are some examples:

Sports: How great is the “home-field advantage” in baseball? Health and medicine: Will a low-cholesterol diet reduce your chances of heart disease? Politics and public policy: Was the 1970 national military draft lottery conducted fairly? Education: Is a pretest an effective predictor of performance in a course? Business and industry: Does an employer practice racial or sexual discrimination in hiring or salary policies? Entertainment: How many television viewers watch the Super Bowl? Science and technology: What are the chances of a major earthquake in the San Francisco Bay Area in the next 30

years? Introduction to the Nature of Statistics

Not only is there a wide variety in the uses of statistics; the nature of the subject itself can be described in a variety of ways. Statisticians can infer valuable conclusions by studying many pieces of data, when any one data value by itself would yield little information.

The health history of one individual is not very useful. From the health history of thousands of smokers and non-smokers, we can estimate the health risks of smoking. Statisticians explore the world using numbers.

By counting the number of people who developed skin cancer in different parts of the country, statisticians found that areas which naturally receive more ultraviolet radiation from the sun (because of their geography or climate) had higher skin cancer rates. Partly based on this evidence, scientists concluded that as the hole in the protective ozone layer around the earth grew, the resulting increase in ultraviolet radiation would mean that more people will get skin cancer. Statisticians investigate patterns, associations, and relationships, as well as their causes.

Is there an association between the number of hours of sleep a student gets and his or her grade point average?

Is there a relationship between cigarette smoking and lung cancer? Does cigarette smoking cause lung cancer? Statisticians practice the art and science of making decisions in the presence of uncertainty.

At election time, the Gallup poll takes samples of about a thousand people to try to predict the outcome.

Page 5: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.2 — A Student Survey 5

Section 1.2 — A Student Survey A Survey

A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is one piece of information—one “piece of the puzzle.” [Note: “Data” is plural; “datum” is singular.] On the following two pages, you will find a survey given to the 50 students in an introductory statistics course at the start of the Fall, 1997 semester, together with their responses. We will begin our statistical exploration by describing this survey and its results. Talking About the Data

Putting the information into words is a skill crucial to your ability to understand data. This is done below for student #1. Each phrase in italics describes the information requested; in each case, the student's response is underlined.

Responses of Student Number 1

(1) (2) (3) (4a) (4b) (4c) (5) (6) (7) (8) (9) (10) No. Gende

r Race Height

in inches Age

Age

group 21 or older?

Hand Expected difficulty

GPA Hours of sleep

Math feelings

Major

1 F B/AA 65 18 16-19 no R 4 2.50 7.0 C SS (1) The gender of student #1 is female. (2) Her race/ethnicity is Black/African-American. (3) 65 is her height in inches to the nearest inch. (4a) Her age in years at her last birthday is 18. (4b) She is in the 16-19 age group. (4c) She is not 21 or older. (5) Her writing hand is her right hand.

(6) The expected difficulty of the course for her is 4 (between “moderate” and “hard”).

(7) Her college GPA to the nearest hundredth is 2.50. (8) The number of hours of sleep she gets each night

is 7.0. (9) Her feelings about mathematics are C (“neutral”). (10) Her major is social science.

Some Questions About the Survey

The responses to the survey provide us with information about the students in the class. Using that information,

we can answer questions that concern a single question in the survey, such as the following:

• Are there more men than women in the class? [Question 1: gender] • What percentage of the class have a GPA of 3.00 or higher? [Question 7: GPA] • How old is the oldest student in the class? [Question 4a: age] However, we can also ask questions that concern two or more survey questions, such as: • Do older students tend to have higher GPAs than younger students? [Questions 4a, 4b, or 4c and Question 7:

age vs. GPA] • Do men tend to have more positive feelings about mathematics than women? [Question 1 and Question 9:

gender vs. feelings about math] • Do students majoring in the social sciences or arts and humanities tend to expect the course to be more difficult

than students in the other majors? [Question 10 and Question 6: major vs. expected difficulty] Issues regarding two or more survey questions require more involved statistical methods than those which

concern only a single survey question. But they also tend to be more interesting and intriguing than issues dealing with just one question because they ask us to compare different groups—older vs. younger students; males vs. females; students in one group of majors vs. students in another group of majors—in order to see whether they differ, and if so, how they differ.

Page 6: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

6 Chapter 1 — Introduction to Statistical Thinking

Survey of Statistics Students

1. Gender: Female Male 2. Race/Ethnicity:

Asian White/Caucasian

Black/African-American Other:

Hispanic/Latino/Chicano Please identify: ____________________

3. Height to the nearest inch: _______

4a. Age in years at your last birthday: _______

4b. Age group (mark the appropriate box): 16-19 20-24 25-29 30-39 40-69

4c. Are you 21 years old or older? Yes No

5. Which hand do you write with? Left Right

6. How easy or hard do you expect this course to be?

1 2 3 4 5 Easy Moderate Hard

7. What is your college GPA (grade point average) to the nearest hundredth? ________

8. On average, to the nearest half hour, how many hours of sleep do you get per night? _______

9. Your feelings about mathematics:

A B C D E awful/bored/scared neutral excited/happy

10. Check the box showing your major (from the following list):

AH BF CESM HE SS Other: _______

AH (Arts and Humanities): art, art history, communication, dance, English, foreign language, journalism, liberal studies, library studies, linguistics, literature, music, speech

BF (Business and Finance): accounting, business, business administration, economics, marketing

CESM (Computers, Engineering, Science, and Mathematics): aeronautics, biology, chemistry, computer science, engineering, geology, mathematics, physics

HE (Health and the Environment): ecology, environmental science, forestry, geography, natural resources, nursing, nutrition, pharmacy, physical therapy, pre-med, public health, veterinary medicine

SS (Social Science): Black studies, counseling, developmental studies, education, history, law, political science, psychology, public policy, religion, social welfare, sociology, women’s studies

Page 7: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.2 — A Student Survey 7

The raw data: The responses of the 50 students

No. Gender

Race/ Ethnicity

Height in

inches

Age (a)

Age (b)

Age (c)

Hand Expected difficulty

GPA Hours of sleep

Math feelings

Major

1 F B/AA 65 18 16-19 no R 4 2.50 7.0 C SS 2 F Filipino 65 19 16-19 no R 4 2.89 6.5 C SS 3 F A 56 19 16-19 no R 5 2.00 8.5 A AH 4 F A 60 19 16-19 no R 3 3.00 7.0 C HE 5 F A 63 20 20-24 no R 3 2.87 6 - 7 C HE 6 F A 63 20 20-24 no R 4 3.80 7.0 A AH 7 F H 65 20 20-24 no R 3 3.00 8.0 B AH 8 F H 60 20 20-24 no R 4 3.00 5.0 B SS 9 F H 61 20 20-24 no L 4 3.57 5.0 D AH

10 F A 64 21 20-24 yes R 4 3.02 6.0 B SS 11 F B/AA 54 21 20-24 yes R 5 3.60 7.0 B AH 12 F H 64 22 20-24 yes R 5 2.70 7.5 C SS 13 F A 64 22 20-24 yes R 4 low 7.0 A BF 14 F B/AA 66 22 20-24 yes R 5 2.30 8.0 B AH 15 F W 65 22 20-24 yes R 4 3.00 8.5 B SS 16 F W 69 23 20-24 yes R betw. 4 & 5 3.90 7.5 D HE 17 F A 64 23 20-24 yes R 5 3.25 5.5 B AH/SS 18 F B/AA 61.5 23 20-24 yes R 4 3.05 5 or 6 B AH 19 F B/AA 66 23 20-24 yes R 4 3.00 7.5 C SS 20 F H 64 24 20-24 yes R 4 3.69 8.0 C AH 21 F B/AA 64 24 20-24 yes R 3 3.5 – 4 6 - 8 C HE 22 F H 67 27 25-29 yes R 4 2.10 7.0 C SS 23 F A 63 27 25-29 yes R 2 3.85 7.5 E BF 24 F A 66 27 25-29 yes R 3 3.05 5.0 C SS 25 F H 68.5 28 25-29 yes R 4 2.50 7.5 C SS 26 F B/AA 64 33 30-39 yes R 4 3.37 10.0 B BF 27 F B/AA 64.5 36 30-39 yes L 3 3.00 8.0 A BF 28 F B/AA 57 38 30-39 yes L 3 3.25 9.0 B SS 29 F B/AA 65 38 30-39 yes R 5 2.50 6.0 B BF 30 F H 62 40 40-69 yes R 5 3.25 5.0 A SS 31 F B/AA 72 42 40-69 yes R 3 3.40 5.5 D BF 32 F B/AA 61 52 40-69 yes R 4 2.97 7 – 8 C AH 33 M A 62 18 16-19 no R 3 2.00 8.0 C AH 34 M W 69 19 16-19 no R 2 3.33 8.0 D HE 35 M Maltease 68 19 16-19 no R 4 2.80 7.0 E fire science 36 M A 67 19 16-19 no R 3 3.00 8.5 E CESM 37 M B/AA 74 20 20-24 no L 5 3.40 7.0 B SS 38 M H 71 20 20-24 no R 3 3.86 7.0 D SS 39 M B/AA 68 20 20-24 no L 3 2.83 5.0 C SS 40 M W 69 20 20-24 no R 5 3.60 8.0 C undecided 41 M H 64 20 20-24 no R 3 2.60 7.0 D SS 42 M A 67 21 20-24 yes R 3 3.20 6.0 D BF 43 M A 65 21 20-24 yes R 3 3+ blank E electronics 44 M W 74 22 20-24 yes R 4 2.60 8.0 A SS 45 M Mixed 69 25 25-29 yes R 5 2.40 8.0 B SS 46 M B/AA 67 27 25-29 yes R 5 3.53 4.0 C SS 47 M African 72 31 30-39 yes R 3 3.00 5.0 C CESM 48 M B/AA 71 31 30-39 yes R 4 3.00 7.0 B BF 49 M W 74 33 30-39 no L 5 2.83 6.5 D HE 50 M W 70 41 40-69 yes R 5 3.20 7.0 A SS

Page 8: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

8 Chapter 1 — Introduction to Statistical Thinking

Section 1.3 — Units and Variables Units and Variables

Every field has its fundamental entities: arithmetic has numbers and their properties; language has words and their meanings; chemistry has atoms and their bonds. The field of statistics has its own fundamental entities: units and variables.

Definition 1: A unit (or case) is an individual person or thing about whom (or about which) information is desired. When the unit is a person, he or she is sometimes referred to as a subject.

This picture represents one unit or subject. In our survey, a unit

(subject) is a student. The survey was returned by 50 subjects; we have information about 50 units.

A unit or subject: one

individual person or thing

3.18

A unit or subject and his/her value of the variable GPA

Definition 2: A variable is general information about units, expressed in words as a noun or a noun phrase. Definition 3: The value of a variable is the specific information about one particular unit. Examples: • Variable: gender Value: male • Variable: number of siblings Value: 2 • Variable: city of residence Value: Oakland • Variable: whether or not registered to vote Value: no • Variable: annual income in dollars Value: 43,575

Listed below are the variables in the survey and the values of the variables for student #1. Notice that the

variables were in italics and the values were underlined when we looked at her responses on page 5.

The variables in the survey and their values for student #1

Variable Value (1) gender female (2) race/ethnicity Black/African-American (3) height in inches (to the nearest inch) 65 (4a) age at last birthday 18 (4b) age group 16-19 (4c) whether or not the student is 21 or older no (5) writing hand right (6) expected difficulty of the course 4 (7) college GPA (to the nearest hundredth) 2.50 (8) number of hours of sleep per night (to the

nearest half hour) 7.0

(9) feelings about mathematics C (10) major social science

A question asked of a unit somehow includes the variable. The answer somehow includes the value of the

variable. For example:

Page 9: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.3 — Units and Variables 9

Question: How old are you? (Variable: age in years) Answer: I'm 23 years old. (Value of the variable: 23) Question: Are you employed? (Variable: whether or not employed) Answer: Yes. (Value of the variable: yes) Question: Whom do you work for? (Variable: employer) Answer: I work for Safeway. (Value of the variable: Safeway) A unit does not have to be a person. For example: Unit: a car Question 1: How many cylinders does the car have? Variable 1: Number of cylinders Question 2: Is it an American made car? Variable 2: Whether or not it is American made Unit: a house Question 1: What was the house’s appraised value in December 2008? Variable 1: Appraised value in December 2008 Question 2: How many bedrooms does the house have? Variable 2: Number of bedrooms

Verbal Templates for Naming Variables To express a variable in words, use a noun (for example, “gender”) or a noun phrase (for example, “city of

residence”). As a general rule, you can tell if you have correctly stated a variable in words by seeing if your proposed name for the variable can complete a sentence that has one of the following forms (or a form similar to one of these):

“We want to know the unit’s ____________________” or “We want to know the _________________ of/by the unit” or “We want to know the _________________ that the unit has.” These are some examples of variables stated correctly: • “We want to know the student’s age in years.” The variable is: age in years. • “We want to know the student’s feelings about mathematics.” The variable is: feelings about mathematics . • We want to know how many hours of television the person watched in the past week. This can be rephrased

using the above pattern: “We want to know the number of hours of television watched in the past week by the person.” The variable is:

the number of hours of television watched in the past week • We want to know the number of times the person has visited a doctor in the past year. This can be rephrased

using the above pattern: “We want to know the number of doctor visits in the past year that the person had.” The variable is: number of

doctor visits in the past year. These are examples of some common errors: • We want to know the person’s city of residence.

Page 10: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

10 Chapter 1 — Introduction to Statistical Thinking

Proposed variable: What city do you live in? --> This is incorrect: A variable can not be a question. Proposed variable: The city in which the person lives --> This is better, but still not the best: This phrase won’t fit properly in the above templates. It is not

correct to say: “We want to know the person’s city in which the person lives.” Proposed variable: City of residence. --> This is correct. It is correct to say: “We want to know the person’s city of residence.”

Yes-No Variables If a variable deals with a “yes” or “no” situation, a different verbal template is used. The phrase “whether or

not” is very handy for stating the variable, and the verbal template to assist you has the following form: “We want to know whether or not the unit .” Some examples: We want to know whether or not the student is 21 or older. The variable is: whether or not the student is 21 or

older. We want to know whether or not the car has front and side air bags. The variable is: whether or not the car

has front and side air bags. We want to know whether or not the house has more than two bathrooms. The variable is: whether or not the

house has more than two bathrooms.

Missing, Ambiguous, and Incorrect Values Definition 4: A missing value is a value of a variable which is not given for a particular unit.

In the survey responses on page 7 in Section 1.2, for example, student #43 did not give his number of hours of sleep. When data is analyzed, the fact that the value is missing must be taken into account.

Some data values are problematic even though they are not missing. Student #43 gave his GPA as “3+.” This

is ambiguous because if we want to use this number in our computations, do we consider it to be 3.1, or 3.5, or what? Is his GPA between 3 and 3.5? We don’t know.

Three students (#18, 25, and 27) did not give their heights to the nearest inch; the values they gave are 61.5,

68.5, and 64.5. If we want to include these numbers in the computations involving the heights of the other students (all of which were given to the nearest inch), should we leave them as given or should we round them off as the other students did? If we round them, should we round all of them up to the next highest inch (62, 69, 65), or maybe round some up and some down by rounding instead to the nearest even number of inches (62, 68, 64). The decision we make might affect conclusions we ultimately reach. For example, if we always round up, it might tend to make the class, as a group, seem to be taller than it actually is.

Student #17’s major is “AH/SS” (i.e., “Arts and Humanities/Social Science”). Consider the question: “How

many students in the class are social science majors?” Do we count student #17? Should student #17 count as one-half of a social science major? It is unclear how to handle this case.

Student #43 gave his major as “electronics.” Should we classify this student into the category “CESM”

(Computers, Engineering, Science, and Mathematics) or in the “Other” category? How about student #35, who listed his major as “fire science?” Should we count him in the category HE (Health and the Environment), which

Page 11: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.3 — Units and Variables 11

includes such majors as environmental science and forestry? Or is he an “Other”? We will need to resolve these questions before we can deal with the data, and different people might plausibly make different decisions.

Finally, student #49 gave his age as 33 and his age group as 30-39, but replied “No” to the question “Are you 21

years old or older?” Presumably, this last answer is incorrect; it is probably appropriate for us to change the “No” to a “Yes.”

Possible Values of a Variable and Types of Variables Definition 5: The possible values of a variable are all the values the variable could have. Definition 6: The type of variable is a description of the possible values of a variable.

For some variables, the possible values are numbers and for others, they are descriptions of categories. For

each of the 12 variables in the survey in Section 1.2 (page 6), the following table gives the possible values of the variable and the type of variable. Sections 1.4 and 1.5 will explain the types in detail.

Variable Possible Values of the Variable Type of Variable (1) gender female, male dichotomous categorical (2) race/ethnicity Asian, Black, Hispanic, white, other unordered categorical (3) height in inches (to the nearest inch) 55, 56, 57, 58, 59, …, 76, 77, 78 discrete numerical (4a) age at last birthday 16, 17, 18, 19, …, 55, 56, 57, … discrete numerical (4b) age group 16-19, 20-24, 25-29, 30-39, 40-69 grouped numerical (4c) whether or not the student is 21 or older yes, no dichotomous categorical (5) writing hand left, right dichotomous categorical (6) expected difficulty of the course 1, 2, 3, 4, 5 probably ordered categorical,

but perhaps discrete numerical

(7) college GPA (to the nearest hundredth) 2.00, 2.01, …, 3.99, 4.00 almost continuous numerical (8) number of hours of sleep per night (to

the nearest half hour) 4, 4.5, 5, …, 9, 9.5, 10 discrete numerical

(9) feelings about mathematics A, B, C, D, E ordered categorical (10) major AH, BF, CESM, HE, SS, other unordered categorical

Section 1.4 — Categorical Variables

Categorical Variables

In our survey on pages 6 and 7, the response to the variable race/ethnicity places a student in one of five groups or categories. Therefore race/ethnicity is a categorical variable. Definition 7: A categorical variable is one in which each possible value is a group or category.

The categorical variables in our survey are gender, race/ethnicity ,

whether or not the student is 21 or older, writing hand, expected difficulty of the course, feelings about mathematics, and major.

An illustration of a categorical variable. Each

“bag” represents one category

Creating Categories

Sometimes we have control over the creation of the categories and sometimes we don’t. How the categories are

created can affect how we understand and analyze the data.

Page 12: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

12 Chapter 1 — Introduction to Statistical Thinking

In the case of the variable gender, it is commonly agreed that there are only two possible values: male and

female.* By contrast, the five categories for the variable major (Arts and Humanities, Business and Finance, Health and the Environment, and the two others) were created by the author because they seemed convenient to him, but it is easy to see that someone else might have created different categories, and even more or fewer categories.

We don’t have much choice over the categories for the variable writing hand; it’s either “left” or “right”

(although there the rare exceptions for whom it is “either” or “none”). But the variable feelings about mathematics could have had only three categories instead of five as possible values: “positive,” “negative,” and “neutral.” Or, it could have had more than the five categories given in the survey, say a range of 10 possible choices from “extremely positive” to “extremely negative.”

Should the variable expected difficulty of the course be considered to be categorical, despite the fact that its

categories are the numbers 1, 2, 3, 4, and 5? These numbers simply name the categories, just as the categories A, B, C, D and E are the values for feelings about mathematics. Is it appropriate to perform arithmetic on these numbers? For example, student #16 checked a box between 4 and 5 for expected difficulty; should we call this reply 4.5? If we can perform the operations of arithmetic on the values—for example taking their average—then perhaps they should not be considered to be merely categories, but true numbers instead. You will learn about numerical variables—variables whose values are numbers—in Section 1.5. But for now keep in mind the following caution: A variable whose values are numbers can sometimes considered to be a categorical variable.

Dichotomous Categorical Variables

If we ignore the exceptions, then the variable gender and the

variable writing hand has only two possible values each. Definition 8: A dichotomous (or binary) variable is a variable

with two possible values. [”Dichotomous” comes from Greek words meaning “cut in two”; “binary” derives from the Latin for “two.”]

An illustration of a dichotomous variable. There

are exactly two categories. Yes-no variables (typically phrased using “whether or not”) are common examples of dichotomous variables.

Consider the question: “Are you currently taking classes at a four-year college?” The yes-no variable is whether or not the person is currently taking classes at a four-year college.

The variable (4c) in the survey, whether or not the student is 21 or older, is a dichotomous yes-no variable

created from the variable age which itself is not dichotomous. Dichotomous variables can easily be created from variables that are not dichotomous by using the “whether or not” phraseology. The variables whether or not the student has positive feelings about mathematics and whether or not the student is a social science major are two such examples. Ordered vs. Unordered Categorical Variables

The values of the survey variable feelings about mathematics can be arranged in order from A (very negative) to E (very positive). It makes sense to ask whether one student’s feelings about mathematics are more positive than another’s. Therefore this categorical variable is said to be ordered.

* However, this simple dichotomy ignores people who, either through genetics or choice, do not fit conveniently into either category. Some people say that gender can’t be represented by just two categories.

Page 13: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.4 — Categorical Variables 13

Definition 9: An ordered categorical variable is one in which the categories can be arranged in a natural order from low to high.

An illustration of an ordered categorical variable. The categories can be arranged

in order from low to high.

Definition 10: An unordered categorical variable is one whose values can not be arranged from low to high.

The values of the variables major and ethnicity can not be arranged in order. It makes no sense, for example, to ask whether a student majoring in social science has a “higher” major than a student majoring in health and the environment.

If the values 1, 2, 3, 4, and 5 of the variable expected difficulty of the course are considered to be categories,

then this variable is certainly an ordered categorical variable; the categories are ordered from one extreme to the other.

Section 1.5 — Numerical Variables

Continuous vs. Discrete Numerical Variables

The values of variables such as number of hours of sleep, height in inches, college GPA, and are numbers which result from counting, measuring, or computing. Such variables are called numerical variables. Definition 11: A numerical variable is one whose values are numbers on a number line on which arithmetical calculations can be performed.

In the case of a numerical variable, it makes sense to carry out certain arithmetic computations using the data. For example, average heights or average GPAs can be computed by adding and dividing: If the GPAs of two students are 3.1 and 3.3, then their average GPA is 3.2.

Furthermore, for a numerical variable, pairs of values with the same difference are equally spaced. Consider the

variable height to the nearest inch. Two people 61 and 63 inches tall have the same height difference as two people who are 72 and 74 inches tall; also, someone who is 65 inches tall is exactly halfway between two people who are 64 and 66 inches tall.

Notice that for the categorical variable expected difficulty, whose values are 1, 2, 3, 4 and 5, it is not clear that

the remarks above apply. Is category 2 the average of categories 1 and 3? Is category 4 exactly halfway between categories 3 and 5? Do categories 1 and 2 differ by the same amount that categories 2 and 3 differ? If so, then perhaps expected difficulty could be considered to be numerical. If not, then it should be treated as an ordered categorical variable.

The values of a numerical variable can sometimes be negative. Consider the numerical variable change in

weight. For an individual whose weight dropped by three pounds, the value of the variable change in weight is –3. The values of a numerical variable typically have a natural multiplicative relationship, but this isn’t true for

every numerical variable. The issue is this: Does multiplying a value by 2 or 3, for example, double or triple what we are measuring? “Yes” when a police officer measures a driver’s blood alcohol: .08 is twice as high as .04. Also, “yes” when we count a motorist’s speeding tickets: 6 tickets is three times as many as 2 tickets. But “no” when we look at the day’s temperature: it’s not appropriate to say that 70 degrees is twice as hot as 35 degrees or that 90 degrees is three times as hot as 30 degrees. Temperatures can be compared by subtracting (for example, we can say that 60 degrees is 20 degrees hotter than 40 degrees) but not by dividing (we can’t say that 60 degrees is one-and-a-half times as hot as 40 degrees).

Page 14: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

14 Chapter 1 — Introduction to Statistical Thinking

Two basic types of numerical variables will interest us: continuous and discrete.

Definition 12: A continuous numerical variable is a numerical variable whose possible values include every decimal number on the number line within some range of values.

Definition 13: A discrete numerical variable is a numerical variable whose possible values are separated on the number line by intervals containing no possible values; that is, the possible values come in distinct steps.

To understand the difference between a continuous and a discrete variable, think of a speedometer on a car.

The indicator needle on some older cars can point (at least theoretically) to any point on a scale (from 0 to 120 miles per hour, for example) and therefore displays speed in miles per hour as a continuous variable. Any decimal number in that range, for example 57.384, could be the speed displayed—and therefore the value of the variable—at some moment.

On some newer cars, however, there is a digital display giving speed in miles per hour only as a whole number:

50 or 51 or 52, etc. It displays speed as a discrete variable; 50.3 is not a possible value of this variable, nor is any other decimal number between 50 and 51. Similarly, on an “old-fashioned” watch with hands, time is a continuous variable: the watch (if read precisely enough) could show a time of 3 minutes and 51.389 seconds after 4 o’clock, for example. A digital watch displays time discretely to the nearest minute or second. For a watch displaying time to the nearest second, 4:03:51 and 4:03:52 are possible values, but not any value in between.

4550

55

MPH

12

6

39

CONTINUOUS DISPLAYS

51MPH

8:19:53

DISCRETE DISPLAYS On a number line, a continuous variable can take on any value in some interval, but the possible values of a

discrete variable are separated by spaces. A continuous variable moves smoothly over its possible values but a discrete variable jumps from value to value.

Entire interval of values (highlighted): possible values of a continuous variable

Separated values (highlighted): possible values of a

discrete variable

On recommendation forms, teachers

are sometimes asked to assess a student in regard to preparedness for college by placing an “x” at the appropriate place on a scale like the one shown. The variable preparedness for college, when measured in this fashion, is continuous.

completely unprepared

outstanding preparation

0 10x

A CONTINUOUS SCALE. The “x” could conceivably be

placed, for example, at the number 3.928615.

By contrast, consider a scale such as the following in which the teacher is asked to rank the student’s preparedness by circling a whole number from 0 to 10: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10. In this case, the variable preparedness for college is discrete.

Similarly, the variable number of hours of sleep per night to the nearest half hour can only have the values

suggested by the following diagram (5, 5.5, 6, etc.) so it is discrete.

Page 15: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.5 — Numerical Variables 15

Almost Continuous Numerical Variables

In practice, there can be limits to the accuracy of a measurement or a computation. Height might be rounded to the nearest tenth of an inch, GPA to the nearest hundredth of a grade point, and age to the day. Variables measured or computed in that fashion are not perfectly continuous because there would be no permitted values of height between 67.2 and 67.3 inches, of GPA between 3.76 and 3.77, or of age between 23 years, 7 months, 4 days and 23 years, 7 months, 5 days.

Technically speaking, such rounded variables are discrete. However, the distance between successive values is

so small (0.1 inch, 0.01 grade point, or 1 day) that it is reasonable to call these variables almost continuous.

Definition 14: A numerical variable is almost continuous if the difference between successive values is negligible.

Possible values of an almost continuous variable (highlighted)

The variable annual income to the nearest dollar is almost continuous because the $1 difference between

successive values—for example, $45,723 and $45,724—is negligible. By contrast, the variable hourly wage to the nearest dollar is not almost continuous, because the $1 difference between a $5 and a $6 wage, for example, is definitely not negligible.

Grouped Numerical Variables Definition 15: A grouped numerical variable is a numerical variable each of whose values has been grouped into a range of numbers.

One of our survey’s three age variables (see survey question 4b on page 6) places the student in one of the following age groups: 16-19, 20-24, 25-29, 30-39, or 40-69. Each group represents a range of ages; this variable is a grouped numerical variable.

Notice that such a variable is closely related to an ordered categorical variable, because we could take the

viewpoint that the student is simply placed in one of five ordered age categories. However, the fact that each “category” is a range of numbers on a number line makes this situation different. You will learn in Chapter 5 (Univariate Numerical Data) how to work with the special properties of such variables. Intrinsic Type of a Variable

Definition 16: The intrinsic type of a variable is the type the variable would be if its values were measured or computed as precisely as possible without any rounding or grouping.

When the values of a variable are rounded or grouped, they become less precise, shifting the variable along a

scale of precision from continuous numerical, to almost continuous numerical, to discrete numerical, to categorical, and finally to dichotomous. The following examples illustrate the progression.

The variable gender has only two possible values (ignoring any exceptions), so gender is intrinsically

dichotomous.

Page 16: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

16 Chapter 1 — Introduction to Statistical Thinking

Let the unit be a car, and consider the variable country of manufacture. When the possible values are grouped into the categories “U.S.” or “foreign,” country of manufacture becomes dichotomous. However, the values can be made more specific: U.S., Japan, Germany, Sweden, etc., and when they are, the variable is unordered categorical. Thus, country of manufacture is intrinsically unordered categorical because that is the most precise way to measure it.

A student’s test grade on a 100-point test can be recorded as “pass” or “fail.” When it is recorded in that way,

test grade is dichotomous. However, the same variable test grade can be measured more precisely as “A, B, C, D, or F,” in which case the variable is ordered categorical. If the values are given with even more detail as “90-100, 80-89, 70-79, 60-69, 0-59,” then the variable test grade is grouped numerical. Finally, recording the variable test grade as precisely as possible, we would use the actual scores: “100, 99, 98, 97, …. 2, 1, 0.” In this case, test grade is then discrete numerical. Assuming test grade can not be measured any more precisely than this, it is intrinsically discrete numerical. If half-points are given, then the test grade could be any of these values: “100, 99.5, 99, 98.5, ..., 1.5, 1, 0.5, 0,” and test grade would then be intrinsically almost continuous.

Annual income can be classified simply as “above poverty level” or “at or below poverty level,” in which case it

is dichotomous. Measured with increasing degrees of precision, annual income can be ordered categorical (with values “low,” “middle,” and “high”), grouped numerical (values: $0-$9,999, $10,000-$19,999, $20,000-$29,999, etc.), discrete (rounded values: $20,000, $21,000, $22,000, etc.). Measured as precisely as possible (with values in one-cent increments: $20,000.00, $20,000.01, $20,000.02, etc.), annual income is almost continuous. Thus, the variable annual income is intrinsically almost continuous.

Human body temperature in degrees Fahrenheit can be classified as dichotomous (“normal,” “not normal”), or

more precisely as ordered categorical (“low,” “normal,” “high”), discrete numerical (values: 98.5, 98.6, 98.7, etc.), or almost continuous (values: 98.60, 98.61, 98.62, 98.63, etc.). However, human body temperature is intrinsically continuous because temperature theoretically can take on any value within a range of values on a number line.

Section 1.6 — Summarizing Categorical Variables

The Frequency of a Characteristic A categorical variable places each unit into a category. Valuable information can be obtained by counting the

number of units in each category. It is convenient to think of the categories as characteristics of the units. For example, in our student survey, some of the characteristics of interest are: “being female,” “writing with the left hand,” and “expecting the course to be hard.” Definition 17: The frequency of a characteristic is the number of units that have the characteristic.

The following are examples of statements about frequencies in our student survey (see page 7):

• The number of students in the class who are female is 32. • The number of students in the class who write with their left hand is 5. • The number of students in the class who expect the course to be hard (response “5”) is 13.

Notice that the words in boldface describe the group; the words in italics give the characteristic.

Talking about Percentages

Percentages are one of the most important ways of describing data and will be used often. Consider the following example:

Page 17: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.6 — Summarizing Categorical Variables 17

There are 50 students in the class; 32 of them are female.

Because 32/50 = 64%, we can make the following percentage statements. In each case, the verbal “percent

pattern” is given.

Statement 64% of the students in the class are female.

Percent Pattern

# 1 Statement 64% is the percentage of the students in the class who are female.

Percent Pattern

# 2 Notice that in the statements above:

(1) What follows the word “of” is a description of the group (in boldface). The size of this group is the

denominator.

(2) Following the word “who” or “which” in the second statement is the characteristic (in italics). The number of individuals having the characteristic (in other words, the frequency of the characteristic) is the numerator.

Thus in the above example:

Comparing Subgroups With Respect to a Characteristic

In addition to specifying characteristics, categories play another role: they split the group into subgroups. The

variable gender, for example, divides the class into two subgroups: “males” (a subgroup of 18 students) and “females” (a subgroup of 32 students). The variable major creates several subgroups: those majoring in Arts and Humanities (AH), those majoring in Business and Finance (BF), those majoring in Health and the Environment (HE), etc.

Computing percentages of the entire group can be interesting (for example, we notice that 64% of the students

in the class are female but only 36% of them are male). However, it can be even more interesting to compare two or more subgroups with respect to the same characteristic.

Example 6.1 Comparing the subgroups males vs. females with respect to the characteristic “has positive feelings

about mathematics” Consider the two subgroups “male students” and “female students” (formed from the variable gender) and the

characteristic “has positive feelings about mathematics (responses ‘D’ or ‘E’)” formed from the variable feelings about mathematics. How do males and females compare with regard to this characteristic? The answer is contained in two percentage statements.

First we must determine the frequency of this characteristic within each subgroup. Using the raw data on page

7, we can count that 4 females and 8 males gave the response “D or E: excited/happy”). Thus:

Page 18: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

18 Chapter 1 — Introduction to Statistical Thinking

12.5% (4/32) of the females in the class have positive feelings about mathematics. By contrast, 44.4% (8/18) of the males in the class have positive feelings about mathematics. Both of these statements conform to Percent Pattern #1 on page 17, except that they now apply to a subgroup

rather than to the entire group:

Percent Pattern

# 1 Notice that the two percentages differ: 12.5% is much less than 44.4%. We conclude that the males and females

answering the survey differ with regard to having positive feelings about mathematics. A greater percentage of males than females have positive feelings about mathematics.

This conclusion, of course, only applies to this one group of 50 students. We are not in a position to make any

generalization about college students in general, or even about any statistics students other than those in the survey.

Example 6.2 Comparing the subgroups of those 21 or older vs. those under 21 with respect to the characteristic “has a GPA of 3.00 or higher”

The variable whether or not the student is 21 or older splits the class into two subgroups: those who are 21 or

older (there are 32 students in this subgroup) and those who are under 21 (the remaining 18 students). The variable GPA, which is an almost continuous variable, can be made dichotomous by considering the

characteristic “has a GPA of 3.00 or higher.” Of the 32 students who are 21 or older, 22 have a GPA of 3.00 or higher. Of those 18 students who are under

21, 10 have a GPA of 3.00 or higher. We thus get these percentage statements: 68.8% (22/32) of the students 21 or older have a GPA of 3.00 or higher. By contrast, 55.6% (10/18) of the students under 21 have a GPA of 3.00 or higher. Notice once again that the two percentages differ: 68.8% is not the same as 55.6%. The students 21 or older

differ from the students under 21 with regard to having a GPA of 3.00 or higher. A greater percentage of the older students than the younger ones have that high a GPA. Association Between Two Categorical Variables

In each of the examples above, we examined the relationship between two categorical variables, one of which (gender in Example 6.1; age* in Example 6.2) was used to create subgroups, and the other of which (feelings about mathematics in Example 6.1; GPA** in Example 6.2) was used to create a characteristic. The observations we made concern what is called the association between the two variables.

* To be technically correct, we should say that the variable actually was whether or not the student was 21 or older. ** Recall that although GPA is a numerical variable, we considered it as the dichotomous categorical variable: whether or not the student has a GPA of 3.00 or higher.

Page 19: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.6 — Summarizing Categorical Variables 19

Definition 18: There is an association between two categorical variables (or we could say: “two categorical variables are associated”) if the subgroups (formed from the first variable) differ in the percentage which have the characteristic (derived from the second variable). If the percentages are the same, there is no association between the variables (or we could say: “the variables are not associated”).

A brief way to say this is: “Variables are associated if subgroups differ with respect to a characteristic.” Every question about association of categorical variables can also be stated in terms of subgroups differing.

Thus, in Example 6.1 (page 17) the question:

“Are gender and feelings about mathematics associated?”

is the same question as:

“Do male students and female students differ with respect to feelings about mathematics?” In the first question—about association—the two variables are stated (“gender” and “feelings about

mathematics”). By contrast, the second question—about whether things differ—states subgroups from the first variable (“male students” and “female students”) but does not state the first variable (“gender”); it also states the second variable (“feelings about mathematics”).

Similarly, the question in Example 6.2 (page 18): “Is there an association between age and GPA?”

is the same question as:

“Do older and younger students differ with respect to GPA?” Thus, we can say that the variables gender and feelings about mathematics are associated because the

percentages in Example 6.1—12.5% and 44.4%—are different (in fact, very different), which means that the subgroups “male students” and “female students” who took the survey differ a lot with respect to feelings about mathematics.

The variables in Example 6.2, age and GPA, are also associated because the percentages 68.8% and 55.6%

differ, although the difference is not nearly as great as in the prior example. That is, students 21 or older differ from those under 21 in this regard (having GPA of 3.00 or higher), but only somewhat.

When the variables are associated—in other words, when the subgroups differ—we can speak of the direction

and the strength of the association. Definition 19: The direction of the association between two categorical variables is the information on which subgroup has more or less of the characteristic.

Definition 20: The strength of the association between two categorical variables is the information on the size of the difference between the subgroups. Words like “weak,” “moderate,” “strong,” and “very strong” are typically used to describe the strength of the association.

In Example 6.1, to describe the direction of the association, we would say:

“A higher percentage of males than females in the class have positive feelings about mathematics.” As for the strength of the association, we would say:

“The association is very strong, because the percentages 12.5% and 44.4% differ greatly.” As for Example 6.2, the direction and strength of the association can be described in this way:

Page 20: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

20 Chapter 1 — Introduction to Statistical Thinking

The percentage of those 21 and older who have a GPA of 3.00 or higher is greater than the percentage of those under 21 who have that high a GPA.

However, the association is weak; the percentages (68.8% and 55.6%) do not differ by very much. Stating the strength of the association requires a judgment call. Association varies from none (identical

percentages), to weak, moderate, strong, very strong, and finally perfect association—the extreme case in which, for example, all men have positive feelings and all women negative feelings. Knowledgeable people can reasonably disagree in their assessments of the strength of an association. What is “moderate” to one person might be “fairly strong” to another. However, it would be very unusual if an association which looks “very strong” to one observer looks “very weak” to another.

This is a guide to understanding the concept of association between categorical variables:

TO ANSWER THIS QUESTION: ANSWER THIS QUESTION:

Are the categorical variables associated? Do the subgroups (formed from one variable) differ with respect to the percentage having a characteristic (formed from the other variable)?

What is the direction of the association? For which subgroup is the percentage greater?

What is the strength of the association? How big is the difference between the percentages?

Example 6.3 The association between age and GPA We have already considered this association in the discussion of Example 6.2, but let’s take another look. We

focused on the characteristic “has a GPA of 3.00 or higher,” and we found that the older students (21 or older) did somewhat better than the younger students.

However, why draw the line at 3.00? After all, GPA is an almost continuous numerical variable, and we can

divide the data anywhere we choose. What if we look instead at the characteristic “has a GPA of 3.50 or higher”? Of the 18 younger students, 4 have a GPA of 3.50 or higher; the percentage is 4/18 = 22.2%. Of the 32 older

students, one of them gave his GPA as “3+”. Because we don’t know if this is above or below 3.50, we have no choice but to ignore this student for now. Of the remaining 31 older students, 6 have GPAs of 3.50 or above; the percentage is 6/31 = 19.4%.

To summarize: 22.2% of the students under 21 have a GPA of 3.50 or above. 19.4% of the students aged 21 or older have a GPA of 3.50 or above. Conclusion: Are the variables associated? (in other words, Do younger and older students differ with respect to GPA?) Yes;

the variables age and GPA are associated because the percentages for the two subgroups (younger students; older students) differ.

What is the direction of the association? A greater percentage of the younger students than the older students

have a high GPA (3.50 or above). What is the strength of the association? The association is very weak; the percentages are almost equal.

Page 21: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.6 — Summarizing Categorical Variables 21

Were you surprised? The direction of the association in this example is exactly the opposite of what it was when we considered this situation in Example 6.2 (page 18) and at the top of page 20! Now, the younger students are doing better. Previously we had decided that the older students were doing better!

This type of anomaly -- an anomaly is something that is abnormal or peculiar or that deviates from what is

expected -- can potentially occur whenever we are free to “draw the line” to turn a numerical variable into a dichotomous one, and you should be aware of this. Given a simplified, dichotomous description of a variable (such as GPA) which is not intrinsically dichotomous, you should wonder how things might look if the dividing line were drawn somewhere else: it is possible that the story would change.

Because GPA is in fact intrinsically almost continuous numerical, perhaps it is a good idea not to make it

dichotomous at all (by forming characteristics like “3.00 or higher” or “3.50 or higher”), and instead to leave it in its numerical form, working directly with the numbers. The next section deals with this issue: how to handle numerical variables.

Section 1.7 — Summarizing Numerical Variables

The “Average”

You have seen that categorical data is often summarized using percentages. By contrast, numerical data is often summarized with an “average.” We commonly encounter such statements as the following: Statement (1): Bay Area drivers get an average of 1.3 parking tickets per year. Statement (2): The average new home in the Bay Area has a purchase price of $350,000.

In Statement (1), the term “average” refers to a concept call the mean, but “average” in Statement (2) could very well refer to a different concept: the median. The Mean and the Median

The mean is what most people think of as the average. Definition 21: The mean of a collection of data values equals the sum of the data values divided by the number of data values. Example 7.1 Test grades

If your test grades are 80, 93, 76, 88, and 91, then your mean test grade is 85.6 because

80 + 93 + 76 + 88 + 915

= 4285

= 85.6

The word “average” can also refer to the median of the numerical data values.

Definition 22: The median of a collection of numerical data values is the middle number when the data values are written in order from smallest to largest. If there are two middle numbers (which occurs when there is an even number of data values), then the median is the value which is exactly halfway between the two middle numbers. To find this halfway value, add the two middle numbers and divide by 2.

Example 7.2 Test grades

If your test grades are 80, 93, 76, 88, and 91, then your median test grade is 88, because 88 is in the middle of the list of ordered scores.

Page 22: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

22 Chapter 1 — Introduction to Statistical Thinking

If your test grades are 80, 93, 76, 88, 91, and 72, then your median test grade is 84, because 84 is halfway between the two middle values (80 and 88) in the list of ordered scores.

In Chapter 5 (Univariate Numerical Data), you will learn the advantages and disadvantages of using each of

these—the mean and the median—to give an average. For now, it is enough for you to know how to compute them and that the word “average” is ambiguous because it could refer to either one. Talking About Means and Medians

The two statements at the beginning of this section can be rewritten using this useful verbal pattern:

The

________________

Insert the word “MEAN” or “MEDIAN”

__________

State the

VARIABLE

of by for

(etc.)

______________

State the GROUP

is

__________

State the VALUE

Assume that Statement (1) at the beginning of this section (“Bay Area drivers get an average of 1.3 parking

tickets per year”) refers to the mean. Also, assume that Statement (2) “The average new home in the Bay Area has a purchase price of $350,000”) refers to the median. (You will learn in Section 5.4, page 166, why the “average” in Statement (2) probably is not a mean. But for now, do you see why the “average” in Statement (1) can not possibly be a median?)

When rewritten using the above pattern, the two statements become:

Statement (1): The mean number of parking tickets received each year by Bay Area drivers is 1.3. MEAN VARIABLE GROUP VALUE

Statement (2):

The median purchase price of new homes in the Bay Area is $350,000. MEDIAN VARIABLE GROUP VALUE

Notice that these two sentences are constructed according to the pattern above. Following the word “mean” or

“median” in each sentence is the variable (“number of parking tickets received each year” or “purchase price”). Then comes a description of the units in the group (“Bay Area drivers” or “new homes in the Bay Area”). Each sentence ends with the value (“1.3” or “$350,000”). Association Between One Categorical and One Numerical Variable

We now consider the question of whether or not there is association between a categorical and a numerical

variable. You will see that the concept is quite similar to the concept of association between two categorical variables discussed in Section 1.6 (page 18), and that the question here, as there, can be restated the same way: Do the subgroups differ with respect to a characteristic? Example 7.3 The association between gender and height

Among the 50 students who answered our student survey, is there an association between gender and height?

Page 23: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.7 — Summarizing Numerical Variables 23

This question concerns two variables: gender (categorical) and height (numerical). As you learned in Section 1.6 (page 17), the variable gender divides the group of students into two subgroups: males and females. The above question about association can thus be restated like this:

Among the 50 students who answered the survey, do the males and females differ with respect to height? The survey responses give the heights of the 32 females and 18 males making up the two subgroups. One way

to compare the two subgroups is to compute the mean height of the males and the mean height of the females.

The mean height of the females is:

The mean height of the males is:

Notice that the mean height of the males is greater (in fact, much greater) than the mean height of the females.

Using the following definition, we conclude that there is an association between gender and height—in other words, gender and height are associated.

Definition 23: There is an association between a categorical variable and a numerical variable (or we could say: “a categorical variable and a numerical variable are associated”) if the subgroups (formed from the categorical variable) differ in regard to some numerical measure (based on the numerical variable). If the measures are the same, then the subgroups do not differ, and there is no association between the variables (or we could say: “the variables are not associated”).

In brief: “Variables are associated if the subgroups differ.” Note that in this example, the measure being used

to compare the subgroups is “mean height.” As in Section 1.6 (page 19), we can also speak of the direction and strength of the association.

Definition 24: To describe the direction of the association between a categorical and a numerical variable, state for which subgroup the measure being used is greater or less.

Definition 25: To describe the strength of the association between a categorical and a numerical variable, describe the size of the difference between the measures in the two subgroups. Words like “weak,” “moderate,” “strong,” and “very strong” are typically used.

In this example, the direction of the association is that the males, on average, are taller than the females. The

strength of the association can be described as strong; the difference in mean heights (68.9 – 63.5 = 5.4 inches) is quite large.

Note that in saying that males tend to be taller than the females, we are not saying that every one of the men is

taller than every one of the women. (In fact, the tallest woman is 72 inches; the shortest man is 62 inches.) What we are saying is that, when we look at the two groups (actually, subgroups) in their entirety, the collection of males is taller than the collection of females.

Example 7.4 The association between gender and height (see Example 7.3, page 22)

Another way to examine this association is to use the median, not the mean, and to compare the median heights

of the males and females in the class. From the ordered lists of the heights, we find that the median height of the 32 women is 64 inches:

54 56 57 60 60 61 61 62 62 63 63 63 64 64 64 64 64 64 64 64 65 65 65 65 65 66 66 66 67 68 69 72

and the median height of the 18 men is 69 inches: 62 64 65 67 67 67 68 68 69 69 69 70 71 71 72 74 74 74

(Notice that the two middle numbers in each subgroup are underlined and in bold type)

Page 24: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

24 Chapter 1 — Introduction to Statistical Thinking

Using the median, our conclusions are the same as when the mean was used: (1) Is there an association between gender and height for the students in the class? Do males and females differ

with respect to height? Yes, there is an association between gender and height for the students in the class; males and females differ with respect to height.

(2) What is the direction of the association? The males in the class tend to be taller than the females. (3) What is the strength of the association? The association between gender and height is quite strong.

Example 7.5 The association between age and GPA (revisited) Recall the discussion in Examples 6.2 and 6.3 (pages 18 and 20) of the association between age and GPA. We

considered both variables to be dichotomous, and compared the older and younger students with regard to the characteristics: “has a GPA of 3.00 or higher” and “has a GPA of 3.50 or higher.”

What if we leave GPA as a number and do not make it dichotomous, as we did earlier? We will then have a

categorical variable (age) and a numerical variable (GPA), and we can use the methods of this section. The question is: Is there an association between age and GPA? This can be rephrased in the usual way: “Do

the subgroups differ?”—this is, do the older and younger students differ with respect to GPA? We need a numerical measure of GPA. Let’s first use the mean, and then repeat the process using the median. There are 18 younger students (under 21) whose GPAs range from a low of 2.00 to a high of 3.86. Their mean

GPA is 3.00. There are 32 older students whose GPAs range from 2.10 to 3.90. Unfortunately, three of their GPAs were

given as: “3+” “3.5 – 4,” and “low,” which are values that we are unable to use because we want to perform arithmetic on the values to compute a mean. Looking only at the remaining 29 students, we can compute that their mean GPA is 3.05.

(Note: If we make the judgment call that “3+” is probably about 3.25, that “3.5 – 4” is probably about 3.75, and

that “low” is probably about 2.5, and if we include these three estimated values, then the mean GPA becomes 3.06—not a big change.)

To summarize, the mean GPA of the younger students is 3.00; the mean GPA of the older students is 3.05. Our

conclusions: (1) Is there an association between age and GPA for the students in the class? Do older and younger students

differ with respect to GPA? Yes, there is an association between age and GPA for the students in the class; the GPAs of the older and younger students differ.

(2) What is the direction of the association? The older students tend to have slightly higher GPAs than the

younger students. (3) What is the strength of the association? The association between age and GPA is extremely weak; the

difference between older and younger students is very slight. In fact, it is so slight as to be almost negligible. What if we use the median instead of the mean as our measure of GPA? Arranging the GPAs of the 18 younger

students in order—the two middle values are in boldface—we find that their median GPA is 3.00:

2.00 2.00 2.50 2.60 2.80 2.83 2.87 2.89 3.00 3.00 3.00 3.00 3.33 3.40 3.57 3.60 3.80 3.86

Page 25: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.7 — Summarizing Numerical Variables 25

As for the 32 older students, notice that two of the three problematic values (“3.5 – 4” and “low”) can be included this time (assuming “low” is under 3), because all we need to do is put the values in order and find the middle value. (The third problematic value, 3+, is still too vague to include.) The median GPA of the remaining subgroup of 31 older students is 3.02.

Our conclusions: (1) Is there an association between age and GPA for the students in the class? Do older and younger students

differ with respect to GPA? Yes, there is an association between age and GPA for the students in the class; the older and younger students differ.

(2) What is the direction of the association? The older students tend to have GPAs which are higher (but just

barely) than those of the younger students. (3) What is the strength of the association? The association between age and GPA is so weak as to be

essentially negligible; there is basically no difference at all between older and younger students in regard to their GPAs.

In this section and the previous one, we have examined the association between age (“under 21” vs. “21 or

older”) and GPA in several ways. In Example 6.2 (page 18), we found that a greater percentage of older students than younger ones have a GPA of 3.00 or higher. In Example 6.3 (page 20), we saw that a slightly greater percentage of younger students have a GPA of 3.50 or higher. Finally, in this example, we found that the means and medians of the two subgroups (older and younger students) just barely differ, with the older students higher by a scant margin.

Notice that finally, by using all of this information, we have now a much fuller picture of the relationship

between age and GPA for this group of students. There is still even more information we could obtain on this relationship, however. We have used the techniques of the past two sections which have introduced you first to association between two categorical variables, and then to association between one categorical and one numerical variable. (You will return to these concepts in greater detail in Chapter 4 [Categorical Data] and Chapter 5 [Univariate Numerical Data]). A third important concept, that of association between two numerical variables (also commonly called “correlation”) will be covered in Chapter 6 (Bivariate Numerical Data). There you will see how to study the relationship between age and GPA when both variables are left in their original numerical form.

Section 1.8 — Populations, Samples, Parameters, and Statistics Populations and Samples

We are often interested in a group—usually large—called a population and are trying to learn about it by gathering data on some of it—called a sample—which is often just a small part of the population.

Definition 26: A population is the entire group of units about which information is desired. Definition 27: The population size is the number of units in the population; it is denoted by the symbol capital N.

The population

NThe

population size

Page 26: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

26 Chapter 1 — Introduction to Statistical Thinking

The sample

nThe

sample size

Definition 28: Sample. A sample is the collection of units within the population from which information is actually obtained. Definition 29: Sample size. The sample size is the number of units in the sample; it is denoted by the symbol small n.

HELPFUL ADVICE

It is a good idea to use the word “all” when you describe a population to emphasize that the population is the entire group of units.

Example 8.1 Smoking habits

In order to learn about the smoking habits of all adult Americans 18 and over, a polling organization selects 1560 adults and interviews them about their smoking habits. Unit: a person Population: all adult Americans 18 or older Population size: N is unknown and large (perhaps N is about 200,000,000) Sample: the people selected and interviewed Sample size: n = 1560 Example 8.2 Predictions from a pretest

We are interested in seeing how well the pretest can predict the performance of students taking the introductory statistics class at Laney College, so we gather data on the students in the current semester's class of 45 students; specifically, we record each student’s pretest score and final grade. Unit: a person Population: all students (past, present, and future) taking the introductory statistics class at Laney College Population size: N is unknown Sample: the students in this term's class Sample size: n = 45

CAUTION ! — A COMMON ERROR

When asked to “state the population” or to “state the sample,” some students make the mistake of giving the population size (the value of the number N) or the sample size (the value of the number n). Be sure you understand that stating the population or the sample requires a description in words, not numbers, of the units which make up the group.

Describing a Population Clearly

In Chapters 2, 3, 11, 12 and 13, you will learn how a sample can be wisely chosen and what can be learned about a population based on information in such a sample. Our concern now is simply to describe the population clearly and precisely. The main guidelines to follow are these:

Page 27: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.8 — Populations, Samples, Parameters, and Statistics 27

GUIDELINES FOR DESCRIBING A POPULATION (1) A description of a population should allow a reader to decide whether any given unit is or is

not a member of the population. (2) Avoid obvious ambiguities, but remember that total clarity is difficult to achieve.

A perfect description of a population is sometimes difficult to create, so we must often settle for a description

which is reasonably specific but not perfect. The following examples illustrate the difference between clear and unclear descriptions of a population. Population (unclear): all adult Bay Area residents

This is unclear because the term “adult” is unclear. Let’s assume that we have in mind those who are age 18 or older. Then we would write: Population (clearer, but still ambiguous): all Bay Area residents who are 18 or older

But what makes a person a “Bay Area resident?” Is someone who was temporarily transferred to San Francisco from New York to work on a six-month project a resident of the Bay Area? A further definition of “resident” is needed.

And what is the “Bay Area?” Does it include Santa Rosa and Fairfield, which are outlying cities? Where do

you draw the line?

Or, consider this example: Population (unclear): all college students

But what is a “college student?” That term needs to be defined precisely. Suppose we say: Population (clearer): all students who are taking at least 50% of a full course load at a 2-year or 4-year college in

the United States

But what is a “full course load?” And what is a “college?” Is a business college a college? Is a barber college a college?

The following astonishing description of a particular population—”employed persons”—is taken from a United

States Department of Labor publication about unemployment rates. Notice that even though incredible detail is used to define this population, you can probably still find ambiguity even in this definition.

Employed persons are those who did any work at all for pay or profit in the survey reference week (the week including the 12th of the month) or worked 15 hours or more without pay in a family business or farm, plus those not working who have a job from which they were temporarily absent, whether or not paid, for such reasons as labor-management dispute, illness, or vacation.

Parameters and Statistics

In Sections 1.6 and 1.7, you learned about numbers which describe groups: (1) frequencies and percents for categorical data (2) means and medians for numerical data.

Page 28: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

28 Chapter 1 — Introduction to Statistical Thinking

In this section, you have met two important groups: populations and samples. We now put those two ideas together:

Definition 30: A parameter is a number which describes a population. Definition 31: A statistic is a number which describes a sample.

MEMORY AID

Population Parameter

Sample Statistic

Notice that the word “statistics” has two different meanings:

Meaning #1: The name of the subject.

Example: “The field of statistics is remarkably useful.” Meaning #2 The plural of “statistic.”

Example: “Data was gathered on a sample of college students, and the value of the statistic ‘mean age’—one of several statistics computed to describe the students—was a surprise to the researchers.”

Frequencies, percentages, means, and medians are among the most common and useful parameters and statistics. They are denoted by particular symbols which are shown in this table and explained below:

Population

parameter Sample statistic

For categorical Frequency F f variables Percentage p

For numerical Mean variables Median MDx mdx

Helpful hint: Memorize these eight symbols and their meanings (which are given below) right away! They are

used throughout the rest of the book, and the sooner you have memorized them, the easier it will be for you to understand the material that follows. To help you remember the symbols in this table, keep in mind that the following contrasts are sometimes used to distinguish parameters from statistics:

Population parameter vs. Sample statistic

Capital letters (Examples: F, MDx)

vs. Small letters (Examples: f, mdx)

Plain (Example: p)

vs. Adorned with a “hat” or “bar” (Examples: , )

Greek letter (Example: µ)

vs. English letter (Example: )

Frequencies and Percentages

The following four parameters and statistics are relevant when there is a categorical variable.

Definition 32: The population frequency is the number of units in a population which have a characteristic. It is denoted by the symbol F. Definition 33: The sample frequency is the number of units in a sample which have a characteristic. It is denoted by the symbol f.

Page 29: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.8 — Populations, Samples, Parameters, and Statistics 29

Definition 34: The population percentage (or population proportion) is the percentage of a population which has a characteristic; it is denoted by the symbol p. The formula for p is

In words,

Definition 35: The sample percentage (or sample proportion) is the percentage of a sample which has a characteristic. It is denoted by the symbol (pronounced “p-hat”). The formula for is

In words,

Example 8.3 Smoking habits

To learn about the smoking habits of adult Americans 18 or over, a polling organization selects 1560 adults, interviews them about their smoking habits, and finds that 230 of them currently smoke at least one pack of cigarettes per day. Suppose that there are in fact a total of 180,000,000 adult Americans 18 or over, and that 29,500,000 of them currently smoke at least one pack a day. Population: all adult Americans 18 or older (Recall that it is a good idea to use the word “all” when describing a population.) Parameters (numbers which describe the population):

in words — the number of American adults 18 or older who are currently smoking at least one pack of cigarettes per day

symbol — F value — 29,500,000 in words — the percentage of American adults 18 or older who are currently smoking at least one pack of

cigarettes per day symbol — p value — 16.4% (29,500,000/180,000,000)

Sample: the people selected and interviewed Statistics:

in words — the number of adults sampled who are currently smoking at least one pack of cigarettes per day symbol — f value — 230 in words — the percentage of adults sampled who are currently smoking at least one pack of cigarettes per day symbol — value — 14.7% (230/1560)

Page 30: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

30 Chapter 1 — Introduction to Statistical Thinking

Means and Medians

The following four parameters and statistics are relevant in the case of a numerical variable. Definition 36: The population mean is the mean of a numerical variable in a population. It is denoted by the symbol (which is read “mu sub x;” is the Greek letter mu and is pronounced “mew” rhyming with “few,” not “moo” rhyming with “too”). The formula for is

Definition 37: The sample mean is the mean of a numerical variable in a sample. It is denoted by the symbol (pronounced “x-bar”). The formula for is

Definition 38: The population median is the median of a numerical variable in a population. It is denoted by the symbol MDx (which is read “M–D–sub–x”). Definition 39: The sample median is the median of a numerical variable in a sample. It is denoted by the symbol mdx (which is also read “m–d–sub–x”). Example 8.4 Pretest scores and final grades

We will learn about the pretest scores of the students taking the introductory statistics class at Laney College this term, and we take the point of view that they constitute a sample of all Laney statistics students—past, present, and future. We gather data on the students in this term's class of 45 students. Their mean pretest score is 6.8 (out of 10). Their median final grade is 82.3 (out of 100). Population: all students (past, present, and future) taking the introductory statistics course at Laney College Parameters:

in words - the mean pretest score of all students taking the introductory statistics course at Laney College symbol - value - unknown

in words - the median final grade of all students taking the introductory statistics course at Laney College symbol - MDx value - unknown

Sample: the students in this term's class Statistics:

in words - the mean pretest score of the students in this term's class symbol - value - 6.8 in words - the median final grade of the students in this term's class symbol - mdx value - 82.3

Page 31: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.8 — Populations, Samples, Parameters, and Statistics 31

Estimating An Unknown Parameter with a Known Statistic Typically, the value of a population parameter is unknown but the value of the corresponding sample statistic is

known. This is because we usually do not have access to all the information about the entire population, but we do have all the data from the sample. One of the tasks of a statistician is to do his or her job so well—that is, to select the sample using valid and appropriate statistical methodology—that the following “Key Fact” will be true about the three statistic/parameter pairs and p, and , and mdx and MDx:

KEY FACT: “PROBABLY CLOSE”

If the sample is sufficiently large and is properly chosen from the population, then:

The known value of the sample statistic

is probably close to the unknown value of the population parameter.

In other words, the goal is to select a sample from a population in such a way that:

• is probably close to p • is probably close to • mdx is probably close to MDx In Chapter 2 (Introduction to Probability and Sampling) and Chapter 3 (Introduction to Surveys, Confidence

Intervals and Hypothesis Testing), you will learn some methods for properly selecting the sample from the entire population.

The words “probably” and “close” will be made precise in Section 3.2 (Confidence Intervals). Turning the word “probably” into something precise and measurable (like “a 95% chance”) will require knowing something about the mathematical theory of probability. Being specific about how close is “close” (such as, “within 3%”) will require knowing something about the mathematical theory of sampling distributions. But for now, you can use these ideas in a loose intuitive sense. Example 8.3 (continued from page 29) Smoking habits

If the sample was chosen properly from the population, then we can say that:

The known percentage of the adults in the sample who are currently smoking at least one pack of cigarettes per day, 14.7%,

is probably close to

the unknown percentage p of all American adults 18 or older who are currently smoking at least one pack of cigarettes per day.

Example 8.4 (continued from page 30) Pretest scores and final grades

We would like to be able to say that: The known mean pretest score of the students in this term's class, 6.8,

is probably close to

the unknown mean pretest score of all students (past, present, and future) taking the introductory statistics course at Laney College.

The known median final grade mdx of the students in this term's class, 82.3,

is probably close to

the unknown median final grade MDx of all students (past, present, and future) taking the introductory statistics course at Laney College.

Page 32: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

32 Chapter 1 — Introduction to Statistical Thinking

However, as you will learn in Section 3.1 (pages 89-90), there are good reasons to believe that the goal, a

properly-chosen sample, was not achieved in Example 8.4 (page 30) and that we can not validly make these statements in this case. Talking About Parameters and Statistics

Each verbal statement of a parameter or a statistic in Examples 8.3 and 8.4 (pages 29 and 30) contains within it a verbal description of some of the other main concepts introduced in this chapter. The following tables display these terms, which can be used to construct the proper sentences. When you express parameters or statistics in words, you should be sure to use these verbal patterns so that you state them correctly.

Parameters and Statistics Associated with Categorical Variables

Notice that for categorical variables, the statement of the parameter or statistic in words contains a “keyword”

(either the word “number” or the word “percentage”), a statement of the population or sample, and a statement of the characteristic. These patterns are shown here:

FREQUENCY (from Example 8.3, page 31)

Parameter or statistic

Symbol

Keyword

Population or sample

Characteristic

Parameter F = the number of American adults 18 or older

who currently smoke at least one pack a day

Statistic f = the number of adults in the sample who currently smoke at least one pack a day

PERCENTAGE (from Example 8.3, page 31)

Parameter or statistic

Symbol

Keyword

Population or sample

Characteristic

Parameter p = the percentage of American adults 18 or older

who currently smoke at least one pack a day

Statistic = the percentage of adults in the sample

who currently smoke at least one pack a day

Page 33: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.8 — Populations, Samples, Parameters, and Statistics 33

Parameters and Statistics Associated with Numerical Variables

Notice that for numerical variables, the statement of the parameter or statistic in words contains a “keyword” (either the word “mean” or the word “median”), a statement of the variable, and a statement of the population or sample. These patterns are shown here:

MEAN (from Example 8.4, page 31)

Parameter or

statistic

Symbol

Keyword

Variable

Population or sample Parameter = the mean pretest

score of all the past, present, and future statistics

students taking the introductory course at Laney

Statistic = the mean pretest score

of the students in this term's class

MEDIAN (from Example 8.4, page 31)

Parameter or

statistic

Symbol

Keyword

Variable

Population or sample Parameter MDx = the median final grade of all the past, present, and future statistics

students taking the introductory course at Laney

Statistic mdx = the median final grade of the students in this term's class

Section 1.9 — Well-Defined Variables Well-Defined Variables

For each unit, a variable must have a single value. Definition 40: A well-defined variable is a variable for which there is exactly 1 possible value of the variable for each unit. In brief: “1 unit plus 1 variable yields 1 value”:

1 unit + 1 variable → 1 value

People sometimes propose variables which are not well-defined.

Example 9.1 Unit: an elementary school child Unacceptable “variable”: parent's age

This “variable” is not well-defined because the parent is not specified; is it the mother or the father? The variable mother’s age is more precise, but still has drawbacks. What if the mother is deceased? Do we want the mother's age at death, what her age would be today, or simply the reply “deceased”? If the child lives with a step-mother, who is the “mother?” Is a mother who turns 38 tomorrow “37” or “38”?

A more precise statement of the variable is the following:

Page 34: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

34 Chapter 1 — Introduction to Statistical Thinking

“Age in years at the last birthday of the living female adult who has legal custody of the child, if there is such a person and her age is known, ‘unknown’ if her age is unknown, and ‘none’ if there is no such woman or if she is deceased.”

Unfortunately, this sounds like legal jargon—or perhaps like instructions on an Internal Revenue Service tax form—and a careful reader could find objections even to this definition.

Keep in mind the following when you define variables.

GUIDELINES FOR DEFINING A VARIABLE (1) A description of a variable should yield a single value for each unit in the population. (2) Avoid obvious ambiguities, but remember that total clarity is difficult to achieve. (3) If the variable is a measurement, state the unit of measurement (for example, inches, pounds,

years, etc.)

To illustrate point (3), if the variable is age and the value of the variable is 15, is this 15 years, 15 months, or 15

weeks? A better statement of the variable is age in years. If the variable is weight and the value of the variable is 95, is this 95 pounds or 95 kilograms? A better statement of the variable is weight in pounds. When the Unit Itself is a Group

If we are studying the classes in an elementary school, the households in a community, the colleges in a state, or the cities in a country, the unit can be thought of as a group:

• the unit “class” is a group of children • the unit “household” is a group of people • the unit “college” is a group of students • the unit “city” is a group of residents

Common variables used to provide information about a unit which is a group include counts, percentages, and

“averages” (that is, means and medians). Describing a Group Using a Count

When some of the members of the unit which is a group have a particular characteristic, the variable of interest might be the number of individuals in that unit having that characteristic. This yields variables such as those in the following examples. Example 9.2 Unit: a household Variable: the number of people in the household who are under 18

Data gathered on the households in a city might look like this:

Household (the unit)

Number of people under 18 in the household (the variable)

Household #1 Household #2 Household #2 etc.

4 0 1 etc.

Page 35: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.9 — Well-Defined Variables 35

Example 9.3 Unit: a college Variable: the number of students in the college who are receiving some form of financial assistance

Data gathered on the colleges in a state might look like this:

College (the unit)

Number of students in the college receiving financial aid (the variable)

Laney College City College of S.F. College of Marin etc.

5,317 16,428 1,590 etc.

Describing a Group Using a Percentage

Because units which are groups vary in size, it is often better to consider not the number but rather the percentage of the unit which have that characteristic. After all, in Example 9.3, above, the number of students in a college who are receiving financial aid can be expected to be greater in a larger college. Forming the percentage will take this into account. This yields such variables as the following.

Example 9.4 Unit: a class in an elementary school Variable: the percentage of students in the class who are receiving remedial help in reading

Data gathered on the classes in a the school might look like this:

Class (the unit)

Percentage of students in the class who are receiving remedial help in reading (the variable)

First grade class #1 First grade class #2 Second grade class #1 etc.

10.3% 5.8% 20.4% etc.

Example 9.5 Unit: a city Variable: the percentage of adults in the city who smoke cigarettes

Data gathered on the cities in a state might look like this:

City (the unit)

Percentage of adults in the city who smoke cigarettes (the variable)

Oakland Fresno Sacramento etc.

24.3% 29.4% 26.8% etc.

Describing a Group Using an Average

If a numerical value is defined for every member of the unit that is a group, then the mean or median is sometimes a useful description of the unit.

Page 36: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

36 Chapter 1 — Introduction to Statistical Thinking

Example 9.6 Unit: a college class Variable: the mean GPA of the students in the class

Data gathered on the classes in a college might look like this:

Class (the unit)

Mean GPA of the students in the class (the variable)

Statistics (9 - 10 am) Statistics (10 - 11 am) English 1A (12 - 1 pm) etc.

3.25 3.14 2.87 etc.

Example 9.7 Unit: a county Variable: the median income of all adults living in the county

Data gathered on the counties in a state might look like this:

County (the unit)

Median income of the adult residents (the variable)

Alameda Contra Costa Marin etc.

$20,470 $23,565 $28,940 etc.

Page 37: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Section 1.10 — Template for the Basic Concepts 37

Section 1.10 — Template for the Basic Concepts

The following template provides an outline and a review covering many of the basic concepts introduced in this chapter. When you fill in the blanks for a specific instance of data-gathering, you will be identifying the key elements of the statistical context. See page 38 for two completed examples using this template.

Template for the Basic Concepts

The unit is ___________________________________________________________________________. AN INDIVIDUAL ABOUT WHOM (OR ABOUT WHICH) INFORMATION IS DESIRED The population is ____________________________________________________________________. THE ENTIRE GROUP OF UNITS ABOUT WHICH INFORMATION IS DESIRED The population size N is _______________________________________________. THE NUMBER OF UNITS IN THE POPULATION The sample is ______________________________________________________________________. THE GROUP OF UNITS FROM WHICH INFORMATION IS OBTAINED The sample size n is __________________________________________. THE NUMBER OF UNITS IN THE SAMPLE The variable is __________________________________________________________. GENERAL INFORMATION ABOUT ALL UNITS The possible values of the variables are _____________________________________________. ALL THE VALUES THE VARIABLE COULD HAVE The type of variable is __________________________________________________________. DESCRIPTION OF THE POSSIBLE VALUES OF THE VARIABLE The parameter: in symbols: ______ in words: ________________________________________ numerical value, if known (if not, state “unknown”): _______ The statistic: in symbols: ______ in words: ________________________________________ numerical value, if known (if not, state “unknown”): _______ Key Fact: If the sample is chosen using valid and appropriate statistical methodology, then: ____________________________________________________________ THE KNOWN VALUE ____ OF THE SAMPLE STATISTIC ____ IS PROBABLY CLOSE TO

____ , THE UNKNOWN VALUE OF THE POPULATION PARAMETER

Page 38: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

38 Chapter 1 — Introduction to Statistical Thinking

Example 10.1 High school security A researcher wishes to learn how Bay Area high school students feel about security in their high schools. They are

asked if they think security is: (a) too strict; (b) about right; (c) not strict enough. The researcher employs a respected company knowledgeable about proper statistical methodology to select a sample of 2500 high school students in the Bay Area. Of the students sampled, 800 responded: “(c) not strict enough.” Based on this information, the following is how to complete the information in the template. The unit is: a high school student (or: a Bay Area high school student; or: a student; or: a person). The population is: all Bay Area high school students. The population size N is: unknown. The sample is: the students selected. The sample size n is: 2500. The variable is: opinion on high school security. The possible values of the variables are: “too strict,” “about right,” “not strict enough.” The type of variable is: ordered categorical. The parameter:

in symbols: p in words: the percentage of all Bay Area high school students who believe security is not strict enough numerical value: unknown

The statistic: in symbols: in words: the percentage of the students in the sample who think security is not strict enough numerical value: 32% (800/2500)

Key Fact: Since the sample was chosen using appropriate statistical methodology, the percentage of the students sampled who think security is not strict enough, 32%, is probably close to p, the percentage of all Bay Area high school students who think security is not strict enough.

Example 10.2 Lowering cholesterol levels One part of a large medical study looks at the decrease in the cholesterol level of 20 patients with heart disease

when they are put on a special 4-week program of diet and exercise. The mean decrease in the cholesterol level of these 20 patients is 12.6. Ultimately, the researchers hope to learn what the decrease in the cholesterol level would be for all American adults who might participate in such a diet and exercise program. The unit is: a person (or: an adult). The population is: all American adults. The population size N is: unknown and large. The sample is: the 20 heart patients studied. The sample size n is: 20. The variable is: decrease in cholesterol level. The possible values of the variables are: a range of positive and negative values on the number line The type of variable is: continuous numerical (or almost continuous numerical) The parameter:

in symbols: in words: the mean decrease in cholesterol level for all American adults participating in this diet and exercise

program numerical value: unknown

The statistic: in symbols: in words: the mean decrease in cholesterol level for the 20 heart patients in the study numerical value: 12.6

Key Fact: The sample is quite small. Even more importantly, it was not chosen properly from among all American adults.

Therefore, we can not say in this case that the mean decrease in cholesterol level (12.6) for the heart patients in the study participating in this diet and exercise program is probably close to , the mean decrease in cholesterol level for all American adults who would participate in this diet and exercise program.

Page 39: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

Index 39

Glossary [page number is in brackets]

Almost continuous numerical variable [15] — a numerical variable for which the difference between successive

values is negligible Association between two categorical variables [19] — the relationship between the subgroups formed from one of

the variables and their percentage breakdowns based on the other variable Association between a categorical variable and a numerical variable [23] — the relationship between the

subgroups formed from one of the variables and the mean, median, or other summary values of the numerical variable

Binary variable [12] — a dichotomous variable Case [8] — a unit Categorical variable [11] — a variable in which each possible value is a class or group Continuous numerical variable [14] — a numerical variable whose values can be any number in some interval on

the number line Data [5] — pieces of information (plural) Datum [5] — one piece of information (singular) Dichotomous variable [12] — a variable with two possible values. Direction of the association between two variables [19] — specification of which subgroup based on one variable

has more or less of what the other variable measures Discrete numerical variable [14] — a numerical variable whose values are separated on the number line by jumps

containing no possible values Frequency [16] — the number of units in a group or subgroup having a characteristic. Grouped numerical variable [15] — a numerical variable each of whose values has been grouped into a range of

numbers Intrinsic type of a variable [15] — the type the variable would be if its values were measured or computed as

precisely as possible without rounding or grouping Mean (of a numerical variable) [21] — the sum of the data values divided by the number of data values Median (of a numerical variable) [21] — the data value in the middle of the list (or the mean of the two middle

data values) when the values are listed in order from smallest to largest Missing value [10] — a value of a variable which is not given for a particular unit Numerical variable [13] — a variable whose values are numbers on a number line on which arithmetic can be

performed Ordered categorical variable [12] — a categorical variable in which the categories can be arranged in a natural

order from lowest to highest Parameter [28] — a number which describes a population Population [25] — the entire group of units about which information is desired Population frequency [28] — the number of units in a population that have a particular characteristic Population mean [30] — the mean of a numerical variable in a population Population median [30] — the median of a numerical variable in a population Population percentage (or proportion) [29] — the percentage of a population that has a particular characteristic Population size [25] — the number of units in the population Possible values of a variable [11] — all the values the variable could have Sample [26] — the group of units about which information is obtained Sample frequency [28] — the number of units in a sample that have a particular characteristic Sample mean [30] — the mean of a numerical variable in a sample Sample median [30] — the median of a numerical variable in a sample Sample percentage (or proportion) [29] — the percentage of a sample that has a particular characteristic Sample size [26] — the number of units in the sample Statistic [28] — a number that describes a sample Statistics [4, 28] — (1) the subject of this book; (2) the plural of “statistic” Strength of the association between two variables [19] — a measure of how much the subgroups based on one

variable differ with respect to the other variable Subgroup [17] — a part of a group Subject [8] — a unit that is a person

Page 40: PART A INTRODUCTION TO STATISTICS: AN OVERVIEW · A statistical investigation requires gathering data—the information the statistician uses to assemble a picture. Each datum is

40 Index

Type of variable [11] — a description of the kinds of values of the variable Unit [8] — an individual person or thing about whom or which information is desired Unordered categorical variable [12] — a categorical variable in which the categories can not be arranged from

lowest to highest Value of a variable [8] — the specific information about one particular unit Variable [8] — a general description of information about a unit, expressed in words as a noun or a noun phrase Well-defined variable [33] — a variable for which there is exactly 1 possible value of the variable for each unit Yes-no variable [10] — a dichotomous variable whose values are “yes” and “no”

List of Symbols

N — population size [25] n — sample size [26] F — population frequency [28] f — sample frequency [28] p — population percentage [29]

— sample percentage [29] — population mean [30]

— sample mean [30] MDx — population median [30] mdx — sample median [30]

List of Formulas

[29]

[29]

[29]

[29]

[30]

[30]


Recommended