Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | deirdre-shields |
View: | 215 times |
Download: | 0 times |
Copyright © 2013, 2010 and 2007 Pearson Education, Inc.
Chapter
Numerically
Summarizing
Data
3
Copyright © 2013, 2010 and 2007 Pearson Education, Inc.
Section
Measures of Central Tendency
3.1
3-4
The arithmetic mean of a variable is computed by adding all the values in the data set and then dividing by the number of observations: “N” or “n”.
3-5
The population mean, μ (mew), is computed using all the individuals in a population. The population mean is a parameter.
The sample mean, (x-bar), is computed using sample data. It is a statistic.
x
3-6
If x1, x2, …, xN are the N observations of a variable from a population,
then the population mean, µ, is:
x1 x2 L xN
N
xiN
3-7
If x1, x2, …, xn are the n observations of a variable from a sample,
then the sample mean, , is
x
x1 x2 L xn
n
xin
x
3-8
EXAMPLE Computing a Population Mean
and a Sample Mean
The following data represent the travel times (min) to work for all seven employees of a company.
23, 36, 23, 18, 5, 26, 43
(a) Compute the population mean of this data.
(b)Then, take a simple random sample of n = 3 employees. Compute the sample mean.
(c) Then, take a second simple random sample of n = 3 employees. Again compute the sample mean.
3-9
EXAMPLE Computing a Population Mean
1 2 7...
7
ix
Nx x x
23 36 23 18 5 26 43
7
174
7
24.9 minutes
(a)
3-10
EXAMPLE Computing a Sample Mean
(b) Obtain a simple random sample of size n = 3 from the population of seven employees. Use this simple random sample to determine a sample mean. Find a second simple random sample and determine that sample mean.
1 2 3 4 5 6 7
23, 36, 23, 18, 5, 26, 43
5 36 26
322.3
x
36 23 26
328.3
x
Hint: Use Calc RandInt
Recall, pop mean = 24.9 min
3-11
The median of a variable is the value that lies in the middle of the data when arranged (sorted) in
ascending order (Calc:Stat:SortA)
It is the value for which there are an equal number of actual data pieces above and below it.
The median may/may not be an actual piece of data.
.
3-12
Finding the Median of a Data Set
1Enter the data into the Calc as a List (Stat)
2 Sort the data in ascending order. Assign each data piece a Rank starting at Min.
3Determine the number of observations, n.
4Determine the observation in the middle of the data set: Rank: (n+1)/2
3-13
Steps in Finding the Median of a Data Set
•If n is odd, then the median is the actual data value in the middle of the data set.
•If n is even, then the median is the mean of the two middle observations.
3-14
EXAMPLE Computing a Median of a Data Set
The following data represent the travel times (in minutes) to work for all seven employees of a start-up web development company.
23, 36, 23, 18, 5, 26, 43
Determine the median of this data.
Step 1: Sort (A): 5, 18, 23, 23, 26, 36, 43
1 7 14
2 2
n Step 2: So, the med is Rank = 4
5, 18, 23, 23, 26, 36, 43
3-15
EXAMPLE Computing a Median of a Data Set
Suppose the start-up company hires a new employee. The travel time of the new employee is 70 minutes. Determine the median of the “new” data set.
23, 36, 23, 18, 5, 26, 43, 70
1 8 14.5
2 2
n
5, 18, 23, 23, 26, 36, 43, 70
24.5M
med is Rank 4 ½ piece of data
3-16
EXAMPLE Computing a Median of a Data Set
The following data represent the travel times (in minutes) to work for all seven employees of a start-up company.
23, 36, 23, 18, 5, 26, 43
Suppose a new employee is hired who has a 130 min commute.
How does this impact the value of the mean and median?
Mean before new hire: 24.9 minutes
Mean after new hire: 38 minutes
Median before new hire: 23 minutes Median after new hire: 24.5 minutes
3-17
A numerical summary of data (mean, median, etc) is said to be resistant if “extreme” data values (very large or very small) do not affect its value significantly.
3-18
3-19
EXAMPLE Describing the Shape of the Distribution
The following data represent the asking price ($) of homes for sale in Lincoln, NE.
Source: http://www.homeseekers.com
79,995 128,950 149,900 189,900
99,899 130,950 151,350 203,950
105,200 131,800 154,900 217,500
111,000 132,300 159,900 260,000
120,000 134,950 163,300 284,900
121,700 135,500 165,000 299,900
125,950 138,500 174,850 309,900
126,900 147,500 180,000 349,900
3-20
The mean asking price is $168,320 and the median asking price is $148,700.
Therefore, we would conjecture (estimate) that the distribution is
skewed right.
3-21
350000300000250000200000150000100000
12
10
8
6
4
2
0
Asking Price
Frequency
Asking Price of Homes in Lincoln, NE
3-22
The mode of a variable is the most frequently occurring observation of the variable
(if there is one.)
If no data piece occurs more than once, we say the data have no mode.
A set of data can have no mode, one mode, or more than one mode.
3-23
EXAMPLE Finding the Mode of a Data Set
The data on the next slide represent the Vice Presidents of the United States
and their state of birth.
Find the mode, if there is one.
3-24
Joe Biden
Pennsylvania
3-25
The mode is New York.
3-26
Tally data to determine most frequent observation
Copyright © 2013, 2010 and 2007 Pearson Education, Inc.
Section
Measures of Dispersion
(Variation)
3.2
3-29
The range, R, of a variable is the difference between the largest data value and the smallest data
value.
Range = R = max - min value
3-30
EXAMPLE Finding the Range of a Set of Data
The following data represent the travel times (min) to work for all seven employees of a start-
up company:
23, 36, 23, 18, 5, 26, 43
Find the range of the data.
Range = (max – min)
43 – 5 = 38 minutes
3-31
The population standard deviation of a variable is the square root of the mean of the squared
deviations from the population mean.
The population standard deviation is symbolically represented by “σ” (lowercase Greek letter sigma).
3-32
Computing Population Standard Deviation
The following data represent the travel times (min) to work for all seven employees of a start-up
company.
23, 36, 23, 18, 5, 26, 43
Compute the population standard deviation of this data.
Hint: First, put the data into a TI-84 List, then find the mean = 24.85714 min
3-33
xi μ xi – μ (xi – μ)2
23 24.85714 -1.85714 3.44898
36 24.85714 11.14286 124.1633
23 24.85714 -1.85714 3.44898
18 24.85714 -6.85714 47.02041
5 24.85714 -19.8571 394.3061
26 24.85714 1.142857 1.306122
43 24.85714 18.14286 329.1633
902.8571 2
ix
xi 2N
902.8571
711.36 minutes
3-34
The sample standard deviation, “s” of a variable uses the same computation except we divide by (n – 1) instead of N.
“n” is the sample size and “N” is the population size of the data set.
3-35
EXAMPLE Computing a Sample Standard Deviation
Here are the results of a random sample of three times taken from the travel times (min) to work for all seven employees of a start-up company:
5, 26, 36
Find the sample standard deviation “s”.
3-36
s xi x 2n 1
500.66667
215.82 minutes
Recall, the Pop std dev (using all 7 times) was 11.36 min.
If you have final exam grades for two If you have final exam grades for two different math classes (A & B), different math classes (A & B),
and if the data is more dispersed and if the data is more dispersed (larger Range) for one class (A) than (larger Range) for one class (A) than
the other, the other, then the standard deviation of that then the standard deviation of that
class (A) will be larger than the class (A) will be larger than the other.other.
3-38
The variance of a variable is the square of the standard deviation.
The math symbol for the population variance is σ2 ,
and for the sample variance is s2
3-39
EXAMPLE Computing a Population Variance
The following data represent the travel times (min) to work for all seven employees of a start-up web development company.
23, 36, 23, 18, 5, 26, 43
Compute the population and sample variance of this data.
3-40
EXAMPLE Computing a Population Variance
Recall from earlier that the pop standard deviation was σ = 11.36 min, so the pop variance is
σ2 = 129.05
From before, the sample std dev was s = 15.82 min, so the sample variance is s2 = 250.27
TI-84TI-84
The calculator will give The calculator will give twotwo std devs:std devs:
“ “σσ ” for a Pop ” for a Pop
and “s” for a sample.and “s” for a sample.
It will It will notnot give give anyany variance variance or range statistics.or range statistics.
3-42
The Empirical Rule68 – 95 - 99.7
If a distribution is roughly bell shaped, then:
1. Approx 68% of the data will lie within 1 standard deviation of the mean.
2. Approx 95% of the data will lie within 2 standard deviations of the mean.
3. Approx 99.7% of the data will lie between 3 std dev’s of the mean.
3-43
3-44
EXAMPLE Using the Empirical Rule
The following data represent the blood HDL cholesterol levels of all 54 female patients of Dr. Dracula.
41 48 43 38 35 37 44 44 4462 75 77 58 82 39 85 55 5467 69 69 70 65 72 74 74 7460 60 60 61 62 63 64 64 6454 54 55 56 56 56 57 58 5945 47 47 48 48 50 52 52 53
3-45
Using a TI-84 we find: 7.11 and 4.57
3-46
22.3 34.0 45.7 57.4 69.1 80.8 92.5
Actually, 45 of the 54, or 83.3% of his patients have HDL between 34.0 and 69.1.
According to the Empirical Rule, 99.7% of the patients will have HDL within 3 standard deviations of the mean.
13.5% + 34% + 34% = 81.5% of all patients will have HDL between 34.0 and 69.1 according to the Empirical Rule.
3-47
Chebychev’s Theorem
For any type distribution (Normal or not),
at least
1 1
k2
100%
of the data lie within “k”
std devs of the mean, (k >1).
3-48
EXAMPLE Using Chebychev’s Theorem
Using the data from the HDL blood example, use Chebychev’s Theorem to determine the percentage of
patients that have HDL levels within 3 std dev (SD) of the mean.
at least
(b) the actual percentage of his patients that had HDL between 34 and 80.8 (within 3 SD of mean).
52/54 ≈ 0.96 ≈ 96%
1 1
32
100% 88.9%
Copyright © 2013, 2010 and 2007 Pearson Education, Inc.
Section
Measures of Central Tendency /Dispersion from Grouped Data
3.3
3-50
We have discussed computing statistics from raw data, but often the only available data have already been summarized into frequency distributions called “grouped data”.
We cannot find exact values of the mean/std dev without raw data, but we can approximate these measures using the following techniques….
3-51
1. Approximate the Mean of a Variable from a Frequency Distribution
where xi is the midpoint or value of the ith classfi is the frequency of the ith classn is the number of classes
x xi fifi
x1 f1 x2 f2 ... xn fn
f1 f2 ... fn
3-52
Hours 0 1-5 6-10 11-15 16-20 21-25 26-30 31-35
Freq 0 130 250 230 180 100 60 50
The National Survey of Student Engagement is a 2007 survey asking freshman college students how much time
they spend preparing for class each week.
Approximate the mean number of hours spent preparing for class each week.
EXAMPLE Approximating the Mean from a Frequency Distribution
3-53
Time (hr) Frequency MP = xi xi fi
0 0 0 0
1 - 5 130 3 390
6 - 10 250 8 2000
11 - 15 230 13 2990
16 - 20 180 18 3240
21 - 25 100 23 2300
26 – 30 60 28 1680
31 – 35 50 33 1650
x
if
i 14,250 f
i 1000
x xi fifi
14,250
100014.25
3-54
2. The weighted mean, of a variable is found by multiplying each value of the variable
by its corresponding weight, adding these products, and dividing by the sum of the weights.
xw wi xiwi
w1x1 w2 x2 ... wn xn
w1 w2 ... wn
where w is the weight of the ith observationxi is the value of the ith observation
xw
3-55
EXAMPLE Computed a Weighted Mean
Kayla goes to the Nut store and creates her own snack mix. She combines 1 pound of raisins, 2 pounds of chocolate-covered peanuts, and 1.5 pounds of cashews. The raisins cost $1.25 per pound, the chocolate covered peanuts cost $3.25 per pound, and the cashews cost $5.40 per pound.
What is the mean cost per pound of this mix?
xw 1($1.25) 2($3.25)1.5($5.40)
1 2 1.5
$15.85
4.5$3.52
3-56
Approximate the Standard Deviation of a Variable from a Frequency Distribution
xi 2
fifi
SampleStandard Deviation
PopulationStandard Deviation
where xi is the midpoint of the ith classfi is the frequency of the ith class
s xi x 2
fifi 1
Copyright © 2013, 2010 and 2007 Pearson Education, Inc.
Section
Measures of Position and Outliers
3.4
3-58
A z-score describes how many std dev’s a data piece is from the mean
(above or below). There is both a population z-score and
a sample z-score:
z x
Sample z-scorePopulation z-score
The z-score has mean of 0 and standard deviation of 1.
z x x
s
3-59
EXAMPLE Using Z-Scores
The mean height of adult males is 69.1 inches with a standard deviation of 2.8 inches. The mean height
of adult females is 63.7 inches with a standard deviation of 2.7 inches.
Who is relatively taller: a man whose height is 83 inches, or a woman whose height is 76 inches?
(In other words, which one is further from the mean of their gender?)
3-60
The man’s height is 4.96 standard deviations above the male mean. The woman’s height is 4.56 standard deviations above the female mean.
The man is relatively taller because his height is further above the mean height of men than hers is above that of women.
zkg 83 69.1
2.84.96
zcp 76 63.7
2.74.56
3-61
The kth percentile, denoted, Pk , of a set of data is a value such that “k” percent of the observations are less than or equal to the value.
3-62
EXAMPLE Interpret a Percentile
The Graduate Record Examination (GRE) is an exam required for admission to many U.S. graduate schools.
The University of Pittsburgh Graduate School of Public Health requires a GRE score no less than the 70th
percentile for admission into their Human Genetics Master of Science program.
Interpret this P70 admissions requirement.
3-63
EXAMPLE Interpret a Percentile
The 70th percentile is the score such that 70% of the individuals who took the exam scored worse, and 30% of the individuals scored the same or better.
In order to be admitted to this program, an applicant must score higher than 70% of the people who take the GRE.
Or, the individual’s score must be in the top 30%.
3-64
EXAMPLE Percentile
The following are scores on a Stats exam:
42,50,59,62,68,73,76,81,86,90,94,100
What is the percentile value of the 81 score?
7/12 = 0.58
… and 94 is the score
58P
83P
3-65
“Quartiles” divide data into fourths, or four equal parts.
• The 1st quartile, Q1, divides the bottom 25% the data from the top 75%. The 1st quartile is equivalent to the 25th percentile.
• divides the bottom 50% of the data from the top 50%. It is equivalent to the 50th percentile, which is equivalent to the median.
• divides the bottom 75% of the data from the top 25%. It is equivalent to the 75th percentile.
2Q
3Q
3-66
Finding Quartiles
Step 1 Arrange the data in ascending order.
Step 2 Determine the median, M, or second quartile, Q2 .
Step 3 Divide the data set into halves: the observations below (to the left of) M and the observations above M. The first quartile, Q1 , is the median of the bottom half, and the third quartile, Q3 , is the median of the top half.
3-67
A group of Brigham Young University students collected data on the speed of vehicles traveling
through a construction zone on a state highway, where the posted speed was 25 mph.
The recorded speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
Find and interpret the quartiles for speed in the construction zone.
EXAMPLE Finding and Interpreting Quartiles
3-68
EXAMPLE Finding and Interpreting Quartiles
n = 14 observations Median = = 32.5 mph
The median of the bottom half of the data is Q1
20, 24, 27, 28, 29, 30, 32
The median of these seven observations is 28. Therefore, Q1 = 28.
The median of the top half of the data is the third quartile, Q3 and Q3 = 38.
2Q
3-69
Interpretation:
• 25% of the speeds are less than or equal to 28 mph and 75% of the speeds are greater than 28 mph.
• 50% of the speeds are less than or equal to the median, 32.5 mph, and 50% of the speeds are greater.
• 75% of the speeds are less than or equal to 38 mph, and 25% of the speeds are greater.
3-70
The Interquartile range, IQR, is the range of the middle 50% of the data
observations.
The IQR is the difference between the third and first quartiles: IQR = Q3 – Q1
In the vehicle speed problem, the IQR = 38-28 = 10 mph, and 50% of
the observed speeds lie between 28 and 38 mph.
3-71
EXAMPLE Determining and Interpreting the Interquartile Range
Determine and interpret the interquartile range of the speed data.
Q1 = 28 Q3 = 38
The range of the middle 50% of car speeds traveling through the construction zone is
10 miles per hour.
IQR Q3 Q
1
38 28
10
3-72
Suppose a 15th car travels through the construction zone at 100 mph. How does this value impact the mean, median, standard deviation, and interquartile range?
Without 15th car With 15th car
Mean 32.1 mph 36.7 mph
Median 32.5 mph 33 mph
Standard deviation 6.2 mph 18.5 mph
IQR 10 mph 11 mph
3-73
Checking for “Outliers” by Using Quartiles
1. Compute the interquartile range.
2. Determine the fences. Fences serve as cutoff points for determining outliers.
Lower Fence = Q1 – 1.5(IQR): 28-15 = 13 mph
Upper Fence = Q3 + 1.5(IQR): 38+15 = 53 mph
3. Any data value outside the fence is called an “outlier” (asterisked) and does not qualify as a min/max value.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc.
Section
The 5-Number Summary and
Boxplots
3.5
3-75
The Five-number summary of a data set consists of the Min data value, Q1, the Median, Q3, and the Max data value as
follows:
3-76
EXAMPLE Obtaining the Five-Number Summary
Every six months, the US Federal Reserve Board conducts a survey of credit card bank plans in the
U.S.
The following data are the interest rates charged by 10 randomly selected banks who issue credit
cards for the July 2005 survey.
Determine the Five-number summary of the data.
3-77
EXAMPLE Obtaining the Five-Number Summary
Institution Rate
Pulaski Bank and Trust Company 6.5%
Rainier Pacific Savings Bank 12.0%
Wells Fargo Bank NA 14.4%
Firstbank of Colorado 14.4%
Lafayette Ambassador Bank 14.3%
Infibank 13.0%
United Bank, Inc. 13.3%
First National Bank of The Mid-Cities 13.9%
Bank of Louisiana 9.9%
Bar Harbor Bank and Trust Company 14.5%Source: http://www.federalreserve.gov/pubs/SHOP/survey.htm
3-78
EXAMPLE Obtaining the Five-Number Summary
Enter the % data into a TI-84 List:
12.0, 13.3, 13.9, 14.3,14.4, 14.5, 9.9, 6.5, 13.0,14.4
1-Var Stats will give the 5-Number Summary as follows:
Five-number Summary:
6.5% 12.0% 13.6% 14.4% 14.5%
3-79
The interquartile range (IQR) is 14.4% - 12% = 2.4%
The lower and upper fences are:
Lower Fence = Q1 – 1.5(IQR) = 12 – 1.5(2.4) = 8.4%
Upper Fence = Q3 + 1.5(IQR) = 14.4 + 1.5(2.4) = 18.0%
5-N: 6.5% 12.0% 13.6% 14.4% 14.5%
[ ]*
3-80
The bank credit card rate boxplot indicates that the distribution is skewed left.
Use a boxplot and quartiles to describe the shape of a distribution.
ENDEND
CHAP 3CHAP 3
Summarizing Numerical Summarizing Numerical DataData