Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Numerically Summarizing Data 3.

Copyright © 2013, 2010 and 2007 Pearson Education, Inc.

Chapter

Numerically

Summarizing

Data

3


Section

Measures of Central Tendency

3.1

3-4

The arithmetic mean of a variable is computed by adding all the values in the data set and then dividing by the number of observations: “N” or “n”.

3-5

The population mean, μ (mew), is computed using all the individuals in a population. The population mean is a parameter.

The sample mean, (x-bar), is computed using sample data. It is a statistic.

x

3-6

If x1, x2, …, xN are the N observations of a variable from a population,

then the population mean, µ, is:

x1 x2 L xN

N

xiN

3-7

If x1, x2, …, xn are the n observations of a variable from a sample,

then the sample mean, , is

x

x1 x2 L xn

n

xin

x

3-8

EXAMPLE Computing a Population Mean

and a Sample Mean

The following data represent the travel times (min) to work for all seven employees of a company.

23, 36, 23, 18, 5, 26, 43

(a) Compute the population mean of this data.

(b)Then, take a simple random sample of n = 3 employees. Compute the sample mean.

(c) Then, take a second simple random sample of n = 3 employees. Again compute the sample mean.

3-9

EXAMPLE Computing a Population Mean

1 2 7...

7

ix

Nx x x

23 36 23 18 5 26 43

7

174

7

24.9 minutes

(a)

3-10

EXAMPLE Computing a Sample Mean

(b) Obtain a simple random sample of size n = 3 from the population of seven employees. Use this simple random sample to determine a sample mean. Find a second simple random sample and determine that sample mean.

1 2 3 4 5 6 7

23, 36, 23, 18, 5, 26, 43

5 36 26

322.3

x

36 23 26

328.3

x

Hint: Use Calc RandInt

Recall, pop mean = 24.9 min

3-11

The median of a variable is the value that lies in the middle of the data when arranged (sorted) in

ascending order (Calc:Stat:SortA)

It is the value for which there are an equal number of actual data pieces above and below it.

The median may/may not be an actual piece of data.

.

3-12

Finding the Median of a Data Set

1Enter the data into the Calc as a List (Stat)

2 Sort the data in ascending order. Assign each data piece a Rank starting at Min.

3Determine the number of observations, n.

4Determine the observation in the middle of the data set: Rank: (n+1)/2

3-13

Steps in Finding the Median of a Data Set

•If n is odd, then the median is the actual data value in the middle of the data set.

•If n is even, then the median is the mean of the two middle observations.

3-14

EXAMPLE Computing a Median of a Data Set

The following data represent the travel times (in minutes) to work for all seven employees of a start-up web development company.

23, 36, 23, 18, 5, 26, 43

Determine the median of this data.

Step 1: Sort (A): 5, 18, 23, 23, 26, 36, 43

1 7 14

2 2

n Step 2: So, the med is Rank = 4

5, 18, 23, 23, 26, 36, 43

3-15


Suppose the start-up company hires a new employee. The travel time of the new employee is 70 minutes. Determine the median of the “new” data set.

23, 36, 23, 18, 5, 26, 43, 70

1 8 14.5

2 2

n

5, 18, 23, 23, 26, 36, 43, 70

24.5M

med is Rank 4 ½ piece of data

3-16


The following data represent the travel times (in minutes) to work for all seven employees of a start-up company.

23, 36, 23, 18, 5, 26, 43

Suppose a new employee is hired who has a 130 min commute.

How does this impact the value of the mean and median?

Mean before new hire: 24.9 minutes

Mean after new hire: 38 minutes

Median before new hire: 23 minutes Median after new hire: 24.5 minutes

3-17

A numerical summary of data (mean, median, etc) is said to be resistant if “extreme” data values (very large or very small) do not affect its value significantly.

3-18

3-19

EXAMPLE Describing the Shape of the Distribution

The following data represent the asking price ($) of homes for sale in Lincoln, NE.

Source: http://www.homeseekers.com

79,995 128,950 149,900 189,900

99,899 130,950 151,350 203,950

105,200 131,800 154,900 217,500

111,000 132,300 159,900 260,000

120,000 134,950 163,300 284,900

121,700 135,500 165,000 299,900

125,950 138,500 174,850 309,900

126,900 147,500 180,000 349,900

3-20

The mean asking price is $168,320 and the median asking price is $148,700.

Therefore, we would conjecture (estimate) that the distribution is

skewed right.

3-21

350000300000250000200000150000100000

12

10

8

6

4

2

0

Asking Price

Frequency

Asking Price of Homes in Lincoln, NE

3-22

The mode of a variable is the most frequently occurring observation of the variable

(if there is one.)

If no data piece occurs more than once, we say the data have no mode.

A set of data can have no mode, one mode, or more than one mode.

3-23

EXAMPLE Finding the Mode of a Data Set

The data on the next slide represent the Vice Presidents of the United States

and their state of birth.

Find the mode, if there is one.

3-24

Joe Biden

Pennsylvania

3-25

The mode is New York.

3-26

Tally data to determine most frequent observation


Section

Measures of Dispersion

(Variation)

3.2

3-29

The range, R, of a variable is the difference between the largest data value and the smallest data

value.

Range = R = max - min value

3-30

EXAMPLE Finding the Range of a Set of Data

The following data represent the travel times (min) to work for all seven employees of a start-

up company:

23, 36, 23, 18, 5, 26, 43

Find the range of the data.

Range = (max – min)

43 – 5 = 38 minutes

3-31

The population standard deviation of a variable is the square root of the mean of the squared

deviations from the population mean.

The population standard deviation is symbolically represented by “σ” (lowercase Greek letter sigma).

3-32

Computing Population Standard Deviation

The following data represent the travel times (min) to work for all seven employees of a start-up

company.

23, 36, 23, 18, 5, 26, 43

Compute the population standard deviation of this data.

Hint: First, put the data into a TI-84 List, then find the mean = 24.85714 min

3-33

xi μ xi – μ (xi – μ)2

23 24.85714 -1.85714 3.44898

36 24.85714 11.14286 124.1633

23 24.85714 -1.85714 3.44898

18 24.85714 -6.85714 47.02041

5 24.85714 -19.8571 394.3061

26 24.85714 1.142857 1.306122

43 24.85714 18.14286 329.1633

902.8571 2

ix

xi 2N

902.8571

711.36 minutes

3-34

The sample standard deviation, “s” of a variable uses the same computation except we divide by (n – 1) instead of N.

“n” is the sample size and “N” is the population size of the data set.

3-35

EXAMPLE Computing a Sample Standard Deviation

Here are the results of a random sample of three times taken from the travel times (min) to work for all seven employees of a start-up company:

5, 26, 36

Find the sample standard deviation “s”.

3-36

s xi x 2n 1

500.66667

215.82 minutes

Recall, the Pop std dev (using all 7 times) was 11.36 min.

If you have final exam grades for two If you have final exam grades for two different math classes (A & B), different math classes (A & B),

and if the data is more dispersed and if the data is more dispersed (larger Range) for one class (A) than (larger Range) for one class (A) than

the other, the other, then the standard deviation of that then the standard deviation of that

class (A) will be larger than the class (A) will be larger than the other.other.

3-38

The variance of a variable is the square of the standard deviation.

The math symbol for the population variance is σ2 ,

and for the sample variance is s2

3-39

EXAMPLE Computing a Population Variance

The following data represent the travel times (min) to work for all seven employees of a start-up web development company.

23, 36, 23, 18, 5, 26, 43

Compute the population and sample variance of this data.

3-40

EXAMPLE Computing a Population Variance

Recall from earlier that the pop standard deviation was σ = 11.36 min, so the pop variance is

σ2 = 129.05

From before, the sample std dev was s = 15.82 min, so the sample variance is s2 = 250.27

TI-84TI-84

The calculator will give The calculator will give twotwo std devs:std devs:

“ “σσ ” for a Pop ” for a Pop

and “s” for a sample.and “s” for a sample.

It will It will notnot give give anyany variance variance or range statistics.or range statistics.

3-42

The Empirical Rule68 – 95 - 99.7

If a distribution is roughly bell shaped, then:

1. Approx 68% of the data will lie within 1 standard deviation of the mean.

2. Approx 95% of the data will lie within 2 standard deviations of the mean.

3. Approx 99.7% of the data will lie between 3 std dev’s of the mean.

3-43

3-44

EXAMPLE Using the Empirical Rule

The following data represent the blood HDL cholesterol levels of all 54 female patients of Dr. Dracula.

41 48 43 38 35 37 44 44 4462 75 77 58 82 39 85 55 5467 69 69 70 65 72 74 74 7460 60 60 61 62 63 64 64 6454 54 55 56 56 56 57 58 5945 47 47 48 48 50 52 52 53

3-45

Using a TI-84 we find: 7.11 and 4.57

3-46

22.3 34.0 45.7 57.4 69.1 80.8 92.5

Actually, 45 of the 54, or 83.3% of his patients have HDL between 34.0 and 69.1.

According to the Empirical Rule, 99.7% of the patients will have HDL within 3 standard deviations of the mean.

13.5% + 34% + 34% = 81.5% of all patients will have HDL between 34.0 and 69.1 according to the Empirical Rule.

3-47

Chebychev’s Theorem

For any type distribution (Normal or not),

at least

1 1

k2

100%

of the data lie within “k”

std devs of the mean, (k >1).

3-48

EXAMPLE Using Chebychev’s Theorem

Using the data from the HDL blood example, use Chebychev’s Theorem to determine the percentage of

patients that have HDL levels within 3 std dev (SD) of the mean.

at least

(b) the actual percentage of his patients that had HDL between 34 and 80.8 (within 3 SD of mean).

52/54 ≈ 0.96 ≈ 96%

1 1

32

100% 88.9%


Section

Measures of Central Tendency /Dispersion from Grouped Data

3.3

3-50

We have discussed computing statistics from raw data, but often the only available data have already been summarized into frequency distributions called “grouped data”.

We cannot find exact values of the mean/std dev without raw data, but we can approximate these measures using the following techniques….

3-51

1. Approximate the Mean of a Variable from a Frequency Distribution

where xi is the midpoint or value of the ith classfi is the frequency of the ith classn is the number of classes

x xi fifi

x1 f1 x2 f2 ... xn fn

f1 f2 ... fn

3-52

Hours 0 1-5 6-10 11-15 16-20 21-25 26-30 31-35

Freq 0 130 250 230 180 100 60 50

The National Survey of Student Engagement is a 2007 survey asking freshman college students how much time

they spend preparing for class each week.

Approximate the mean number of hours spent preparing for class each week.

EXAMPLE Approximating the Mean from a Frequency Distribution

3-53

Time (hr) Frequency MP = xi xi fi

0 0 0 0

1 - 5 130 3 390

6 - 10 250 8 2000

11 - 15 230 13 2990

16 - 20 180 18 3240

21 - 25 100 23 2300

26 – 30 60 28 1680

31 – 35 50 33 1650

x

if

i 14,250 f

i 1000

x xi fifi

14,250

100014.25

3-54

2. The weighted mean, of a variable is found by multiplying each value of the variable

by its corresponding weight, adding these products, and dividing by the sum of the weights.

xw wi xiwi

w1x1 w2 x2 ... wn xn

w1 w2 ... wn

where w is the weight of the ith observationxi is the value of the ith observation

xw

3-55

EXAMPLE Computed a Weighted Mean

Kayla goes to the Nut store and creates her own snack mix. She combines 1 pound of raisins, 2 pounds of chocolate-covered peanuts, and 1.5 pounds of cashews. The raisins cost $1.25 per pound, the chocolate covered peanuts cost $3.25 per pound, and the cashews cost $5.40 per pound.

What is the mean cost per pound of this mix?

xw 1($1.25) 2($3.25)1.5($5.40)

1 2 1.5

$15.85

4.5$3.52

3-56

Approximate the Standard Deviation of a Variable from a Frequency Distribution

xi 2

fifi

SampleStandard Deviation

PopulationStandard Deviation

where xi is the midpoint of the ith classfi is the frequency of the ith class

s xi x 2

fifi 1


Section

Measures of Position and Outliers

3.4

3-58

A z-score describes how many std dev’s a data piece is from the mean

(above or below). There is both a population z-score and

a sample z-score:

z x

Sample z-scorePopulation z-score

The z-score has mean of 0 and standard deviation of 1.

z x x

s

3-59

EXAMPLE Using Z-Scores

The mean height of adult males is 69.1 inches with a standard deviation of 2.8 inches. The mean height

of adult females is 63.7 inches with a standard deviation of 2.7 inches.

Who is relatively taller: a man whose height is 83 inches, or a woman whose height is 76 inches?

(In other words, which one is further from the mean of their gender?)

3-60

The man’s height is 4.96 standard deviations above the male mean. The woman’s height is 4.56 standard deviations above the female mean.

The man is relatively taller because his height is further above the mean height of men than hers is above that of women.

zkg 83 69.1

2.84.96

zcp 76 63.7

2.74.56

3-61

The kth percentile, denoted, Pk , of a set of data is a value such that “k” percent of the observations are less than or equal to the value.

3-62

EXAMPLE Interpret a Percentile

The Graduate Record Examination (GRE) is an exam required for admission to many U.S. graduate schools.

The University of Pittsburgh Graduate School of Public Health requires a GRE score no less than the 70th

percentile for admission into their Human Genetics Master of Science program.

Interpret this P70 admissions requirement.

3-63

EXAMPLE Interpret a Percentile

The 70th percentile is the score such that 70% of the individuals who took the exam scored worse, and 30% of the individuals scored the same or better.

In order to be admitted to this program, an applicant must score higher than 70% of the people who take the GRE.

Or, the individual’s score must be in the top 30%.

3-64

EXAMPLE Percentile

The following are scores on a Stats exam:

42,50,59,62,68,73,76,81,86,90,94,100

What is the percentile value of the 81 score?

7/12 = 0.58

… and 94 is the score

58P

83P

3-65

“Quartiles” divide data into fourths, or four equal parts.

• The 1st quartile, Q1, divides the bottom 25% the data from the top 75%. The 1st quartile is equivalent to the 25th percentile.

• divides the bottom 50% of the data from the top 50%. It is equivalent to the 50th percentile, which is equivalent to the median.

• divides the bottom 75% of the data from the top 25%. It is equivalent to the 75th percentile.

2Q

3Q

3-66

Finding Quartiles

Step 1 Arrange the data in ascending order.

Step 2 Determine the median, M, or second quartile, Q2 .

Step 3 Divide the data set into halves: the observations below (to the left of) M and the observations above M. The first quartile, Q1 , is the median of the bottom half, and the third quartile, Q3 , is the median of the top half.

3-67

A group of Brigham Young University students collected data on the speed of vehicles traveling

through a construction zone on a state highway, where the posted speed was 25 mph.

The recorded speed of 14 randomly selected vehicles is given below:

20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40

Find and interpret the quartiles for speed in the construction zone.

EXAMPLE Finding and Interpreting Quartiles

3-68

EXAMPLE Finding and Interpreting Quartiles

n = 14 observations Median = = 32.5 mph

The median of the bottom half of the data is Q1

20, 24, 27, 28, 29, 30, 32

The median of these seven observations is 28. Therefore, Q1 = 28.

The median of the top half of the data is the third quartile, Q3 and Q3 = 38.

2Q

3-69

Interpretation:

• 25% of the speeds are less than or equal to 28 mph and 75% of the speeds are greater than 28 mph.

• 50% of the speeds are less than or equal to the median, 32.5 mph, and 50% of the speeds are greater.

• 75% of the speeds are less than or equal to 38 mph, and 25% of the speeds are greater.

3-70

The Interquartile range, IQR, is the range of the middle 50% of the data

observations.

The IQR is the difference between the third and first quartiles: IQR = Q3 – Q1

In the vehicle speed problem, the IQR = 38-28 = 10 mph, and 50% of

the observed speeds lie between 28 and 38 mph.

3-71

EXAMPLE Determining and Interpreting the Interquartile Range

Determine and interpret the interquartile range of the speed data.

Q1 = 28 Q3 = 38

The range of the middle 50% of car speeds traveling through the construction zone is

10 miles per hour.

IQR Q3 Q

1

38 28

10

3-72

Suppose a 15th car travels through the construction zone at 100 mph. How does this value impact the mean, median, standard deviation, and interquartile range?

Without 15th car With 15th car

Mean 32.1 mph 36.7 mph

Median 32.5 mph 33 mph

Standard deviation 6.2 mph 18.5 mph

IQR 10 mph 11 mph

3-73

Checking for “Outliers” by Using Quartiles

1. Compute the interquartile range.

2. Determine the fences. Fences serve as cutoff points for determining outliers.

Lower Fence = Q1 – 1.5(IQR): 28-15 = 13 mph

Upper Fence = Q3 + 1.5(IQR): 38+15 = 53 mph

3. Any data value outside the fence is called an “outlier” (asterisked) and does not qualify as a min/max value.


Section

The 5-Number Summary and

Boxplots

3.5

3-75

The Five-number summary of a data set consists of the Min data value, Q1, the Median, Q3, and the Max data value as

follows:

3-76

EXAMPLE Obtaining the Five-Number Summary

Every six months, the US Federal Reserve Board conducts a survey of credit card bank plans in the

U.S.

The following data are the interest rates charged by 10 randomly selected banks who issue credit

cards for the July 2005 survey.

Determine the Five-number summary of the data.

3-77


Institution Rate

Pulaski Bank and Trust Company 6.5%

Rainier Pacific Savings Bank 12.0%

Wells Fargo Bank NA 14.4%

Firstbank of Colorado 14.4%

Lafayette Ambassador Bank 14.3%

Infibank 13.0%

United Bank, Inc. 13.3%

First National Bank of The Mid-Cities 13.9%

Bank of Louisiana 9.9%

Bar Harbor Bank and Trust Company 14.5%Source: http://www.federalreserve.gov/pubs/SHOP/survey.htm

3-78


Enter the % data into a TI-84 List:

12.0, 13.3, 13.9, 14.3,14.4, 14.5, 9.9, 6.5, 13.0,14.4

1-Var Stats will give the 5-Number Summary as follows:

Five-number Summary:

6.5% 12.0% 13.6% 14.4% 14.5%

3-79

The interquartile range (IQR) is 14.4% - 12% = 2.4%

The lower and upper fences are:

Lower Fence = Q1 – 1.5(IQR) = 12 – 1.5(2.4) = 8.4%

Upper Fence = Q3 + 1.5(IQR) = 14.4 + 1.5(2.4) = 18.0%

5-N: 6.5% 12.0% 13.6% 14.4% 14.5%

[ ]*

3-80

The bank credit card rate boxplot indicates that the distribution is skewed left.

Use a boxplot and quartiles to describe the shape of a distribution.

ENDEND

CHAP 3CHAP 3

Summarizing Numerical Summarizing Numerical DataData

Date post:	12-Jan-2016
Category:	Documents
Upload:	deirdre-shields
View:	215 times
Download:	0 times

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Numerically Summarizing Data 3.

Documents