Distance Learning Today is Friday, March 27th
AP Review Unit: Displaying & Describing Data Volume 1 Note Spiral pgs. 39-71
College Board Video Links:
• Link to All College Board AP Review Videos
Note Videos: Describing & Displaying Videos are not Ready I will notify you via Google Classroom as soon as they are complete.
Materials:
• Displaying & Describing Data Vocabulary Card
• AP Review Notes: Displaying & Describing Data
Assignments:
• Unit 1 Progress Checks: MCQ Part A & MCQ Part B Due by 11:00pm Wednesday, April 1st
• Unit 1 Progress Check: FRQ Due by 11:00pm Wednesday, April 1st
• Unit 2 Progress Check: MCQ Part A Due by 11:00pm Thursday, April 2nd
The above assignments can be found on the college board website.
Highly Suggested (If we were in class, we would do all of these)
• 2005 Question #1
• 2005 Question #2
• 2005B Question #1 & 2
• 2004B Question # 5
• 2014 Question #4
• 2010 Question #6
Link to FRQ Questions & Solutions
Displaying & Describing Data-Vocabulary Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data
Displaying & Describing Data Created by: Loren L. Spencer Q2.1
Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data.
Describing Data-C.U.S.S. Center Mean-(𝝁, 𝒙 )-the sum of all the data divided by
the number of data-the expected value.
Significantly impacted by outliers.
Median-the middle most data value in a line of
data arranged from least to greatest. 50% of the
data are greater and 50% of the data are smaller.
Not significantly impacted by outliers.
Note: Place data values in STAT EDIT
Press STAT CALC 1-Var STATS
• When a constant is added to each data value,
the center increases by that constant. • When each data value is multiplied by a
constant, the center is increased by the
multiple of that constant.
Unusual Features Gaps-Places in a distribution with no data values.
Clusters-groups of data points-give the value of
the centers of the groups and the ranges.
Outliers-data points that are significantly larger
or smaller than the remaining data values.
1.5 x IQR or more than 3 standard deviations
from the mean.
Shape Unimodal-one mound-the most common occurring
data value-highest probability of occurring.
Bimodal-two mounds or high points-the two most
common-the two values most likely to occur.
Multimodal-more than two mounds
Uniform-all values are equally likely to occur. The
bars of a histogram are of equal height.
Mound Shaped- The mean equals the median
and mode. Use this description for bell curves,
t-distributions and normal curves.
Skewed Right-the mean is to the right/greater
than the median. The tail is to the right.
All Chi-squared distributions are skewed right
Skewed Left-the mean is to the left/smaller
than the median. The tail is to the left.
Symmetric-The mean equals the median.
The mode may different-bimodal or inverse bell.
Describing Data-C.U.S.S.-cont’d. Spread • When a constant is added to each data value,
the spread does not change. • When each data value, is multiplied by a
constant, the measure of spread is increased
by that multiple of that constant. Exception:
variance is increased by the constant’s square.
Range-the largest data value minus the smallest.
Significantly impacted by outliers.
Interquartile Range:Q3 – Q1. The data value at
the 1st quartile subtracted from the data value at
the 3rd quartile.
The range of the Middle 50% of the data
Not significantly impacted by outliers.
Standard Deviation-(𝝈, s)-The square root of
the average of the squared data differences from
the mean. The square root of the variance.
Z-scores are a measure of standard deviations
from the mean and are related to percentile ranks.
Variance- The standard deviation squared. The
average of the squared differences from the mean.
Univariate Data Displays • Label the graph’s axes
• Title the graph
Categorical Displays-Displays that have one
axis which is comprised of a list of qualitative
values rather than quantitative values and measure
frequency or relative frequency (percent).
Note: There is no description of spread, shape, or
center for categorical data. The Mode measures
frequency & is a valid measure for categorical data.
DotPlots: Measure the frequency of categorical
variables and provide the exact counts of the data.
Barcharts/Bargraphs-Provide frequencies or
relative frequencies as (percents) and represent
the categorical data counts as areas.
Note: Bars should not touch because the data
categories are not sequential/numerical in nature.
PieCharts/Graphs- Represent categorical data
as areas & relative frequencies as (percents) .
Displaying & Describing Data-Vocabulary Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data
Displaying & Describing Data Created by: Loren L. Spencer Q2.1
Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data.
Quantitative Displays-Displays in which one
axis is a listing of sequential quantitative values.
• Label the graph’s axes
• Title the graph
Histogram: a frequency distribution whose
class/bar widths have a height that is proportional
to the frequency of the values in that class.
• Useful for large data sets
• Provides shape and an idea of spread
• Individual data values are usually lost
• The area of a bar is proportional
• Vertical axis can be either frequencies or
relative frequencies/percents.
Note: Bars should touch because the data is
sequential/numerical in nature.
Stemplot- a graphical displaying that separates
the ones digits from the remaining digits.
• Provides shape
• Maintains all data values so exact summary
statistics can be calculated
• Unwieldy for large data sets.
Note: must provide a legend
Cumulative Relative Frequency Plot (ogive)- Plots cumulative frequencies for data from left to
right such that the largest data point furthest to
the right will be at 100%.
• Skewed right data will increase rapidly at first
then more slowly later. Begins convex
• Skewed left data will increase slowly at first
and more rapidly later. Begins concave
Box Plot (box and whisker plot)-a visual
representation of the five number summary.
The smallest value, the 1st quartile value, the
median, the 3rd quartile value and the largest value.
Note: Each section contains 25% of the data
Remember: The mean is not shown & the variance
cannot be calculated. The range can be computed.
Calculator:
• Enter Data into STAT Edit
• Press 2nd STAT PLOT
• Select the Box plot with outliers & Press
graph Press Zoom 9
Do Not Forget to label your axis
Comparing Distributions • Be Specific
• Compare and contrast the centers
• Compare and contrast the spread
• Compare and contrast the shape
Bivariate Data Scatterplot: Both axes represent variables either
the response variable (output) or the explanatory
variable (input).
Note: Neither axis represents a frequency
Empirical Rule • Applies to symmetric mound/bell shaped
curves.
• 68% of the data is within ± 1 standard
deviations of the mean
• 95% of the data is within ± 2 standard
deviations of the mean
• 99.7% of the data is within ± 3 standard
deviations of the mean
Quick Hints: Percentile ranks are the areas under the
distributions and should be read left to right.
Z-scores-measure the distance in number of
standard deviations a value is from the mean. The
related area under the curve tells us the chance
that an event will happen.
Remember: we are always calculating the likelihood
that an event occurs is either greater than or less
than a specific value. In essence we are calculating
the chance of a range of values.
We do not calculate the probability of a specific
data value when working with continuous data.
Standard Normal-the distribution which is
created After a z-score has been computed.
𝝁 = 0 and 𝝈 = 1
Before the z-score is computed the mean is the
mean of the population and the standard deviation
is the standard deviation of the population.
Displaying & Describing Data Note Spiral pgs 39-71
I. Types of Variables
A. Categorical vs. Quantitative
1. Categorical variables represent types of data which may be divided into groups or
categories.
Examples of categorical variables are race, sex, age group, color, and educational level.
With Categorical Data we keep track of the counts. Because Categorical data is not
numerical, we do not calculate the mean or median or standard deviation. We also do not talk
about shape or unusual features. We do talk about the Mode, that is to say which category
occurs the most.
• Example: The following table shows data for the 8 longest roller coasters in the world as of 2015.
Length feet Type Speed mph Height feet Drop Feet Continent
8,133 Steel 95 318 310 Asia
7,442 Steel 50 107 102 Europe
7,359 Wood 65 110 135 North America
6,709 Steel 81 259 230 Asia
6,602 Steel 95 325 320 North America
6,595 Steel 93 310 300 North America
6,562 Steel 149 171 168 Asia
6,442 Wood 67 163 154 North America
Which of the following variables is categorical?
A. Length
B. Type
C. Speed
D. Height
E. Drop
2. Quantitative variables express a certain quantity, amount or range. Usually, there are
measurement units associated with the data, such as meters, hours, pounds…
Examples: Height, weight, length and time, the number of eggs, how many of something
• Example: Data were collected on 100 United States coins minted in 2018. Which of the following
represents a quantitative variable for the data collected?
A. The type of metal used in the coin
B. The value of the coin
C. The color of the coin
D. The person depicted on the face of the coin
E. The location where the coin was minted
B. Discrete vs. Continuous
1. Discrete data is Numerical Data that can be counted.
Example: There are 13 Girls and 15 Guys; 18 people voted Yes, the number of likes, the
number of eggs, the number of days etc.
2. Continuous data is Numerical Data that can take on any real number on the number line.
Example: Weight, height, time, temperature and distance
Discrete vs. Continuous-Continued
• Example: The following list shows the number of video games sold at a game store each day for one
week.
15, 43, 50, 39, 22, 16, 20
Which of the following is the best classification of the data in the list?
A. Categorical and continuous
B. Quantitative and continuous
C. Categorical and discrete
D. Quantitative and discrete
E. Neither categorical nor quantitative, and neither discrete nor continuous
C. Bivariate Data: bi-variate data is two variable data. (Typically the variables are related or we
believe them to be so we say that the variables are paired together.)
A. Categorical Bivariate Data: Compares the counts of 2 categorical variables.
Example: number of speeding tickets vs Gender or Mechanical Delays vs. Weather Delays
We would normally display the counts of this data in a 2-way table
B. Quantitative Bivariate Data: Compares 2 quantitative variables.
Example: strength & age; electricity generation & hours of sunlight; blood pressure & weight.
All of these have numerical measurements that are more than just counts.
We would normally display Bivariate data in a scatterplot.
II. Categorical Displays Displays that have one axis which is comprised of a list of qualitative values
rather than quantitative values and measure frequency or relative frequency (percent).
Note: There is no description of spread, shape, or center for categorical data. The Mode measures
frequency & is a valid measure for categorical data.
A. Tables: A table lists the number of times an outcome occurs. • Example: The following table summarizes the number of pies sold at a booth one day at a local farmers
market. Type of Pie Frequency
Apple 18
Blueberry 14
Cherry 16
Key Lime 12
Peach 12
Pumpkin 18
Which of the following statements is supported by the table?
A. More Cherry Pies were sold than any other type of pie
B. Twice as many apple pies as key lime pies were sold
C. More than half the pies sold were apple
D. Fewer than 50 pies were sold at the booth that day
E. The combined percentage of key lime pies and pumpkin pies sold was less than 50%
B. Barcharts/Bargraphs-Provide frequencies or relative frequencies as (percents) and
represent the categorical data counts as areas.
Note: Bars should not touch because the data categories are not sequential/numerical in nature.
• Example: In a certain school district, students from grade 6 through grade 12 can participate in a
school-sponsored community service activity. The following bar chart shows the relative frequencies of
students from each grade who participate in the community service activity.
Which of the following statements is supported by the bar chart?
A. The greatest number of participating students was in grade 9.
B. The number of participating students in grade 6 was equal to the number of participating
students in grade 7.
C. The relative frequency of all participating students in grades 6 and 7 combined was 0.60.
D. Grade 12 had the least relative frequency of participating students.
E. Grade 11 had the greatest relative frequency of participating students.
• Example: The following bar chart shows the relative frequency of days of rain for 30 days in four
regions of a certain state.
Which of the following statements is not supported by the bar chart?
A. Region D had the greatest percentage of days of rain.
B. Region B had the least percentage of days of rain.
C. Region A had more than 15 days of rain.
D. Region C had more than 25 days of rain.
E. Region D had less than 23 days of rain.
C. Segmented bar chart: A stacked bar chart in which each column is divided into segments which
are proportional in size to that segment’s representation within the population. Please be aware
that these are relative frequencies (proportions) and do not provide insight as to the number/size
of the sample.
D. PieCharts/Graphs- Represent categorical data as areas & relative frequencies as (percents).
E. Cumulative Frequency Graphs: Provides a running total of frequencies.
• Example: 2006 Form B Question 1
A large regional real estate company keeps records of home sales for each of its sales agents. Each
month, the company publishes the sales volume for each agent. Monthly sales volume is defined as
the total sales price of all homes sold by the agent during a month. The figure below displays the
cumulative relative frequency plot of the most recent monthly sales volume (in hundreds of
thousands of dollars) for these agents.
(a) In the context of this question, explain what information is conveyed by the circled point.
(b) What proportion of sales agents achieved monthly sales volume between $700,000 and
$800,000?
(c) For values between 10 and 11 on the horizontal axis, the cumulative relative frequency plot is
flat. In the context of this question, explain what this means.
(d) A bonus is to be given to 20 percent of the sales agents. Those who achieved the highest
monthly sales volume during the preceding month will receive a bonus. What is the minimum
monthly sales volume an agent must have achieved to qualify for the bonus?
F. 2-way Tables: Display the counts of 2 categorical variables. (this is bivariate data) In the problem
below the sample size is 200 and the row totals and column totals represent the marginal
frequencies. The numbers in the table that are not totals are the joint frequencies or intersections
and represent the number of times that two separate variables yielded a particular outcome (this
and that) both occurred. The key work for a joint frequency is AND.
G. Conditional Probability- With conditional probability, we know something about our sample. As a
consequence, we are no longer looking at the entire sample and we are only looking at a portion or
subset of the sample (we are looking at a particular category as a opposed to the whole). Joint Frequency
Marginal Frequency=
"and" "both"
Given=
intersection
Given
• Example: 2003 Form B Problem 2: A simple random sample of adults living in a suburb of a large city
was selected. The age and annual income of each adult in the sample were recorded. The resulting data
are summarized in the table below.
Annual Income
Age Category $25,000-
$35,000
$35,001-$50,000 Over $50,000 Total
21-30 8 15 27 50
31-45 22 32 35 89
45-60 12 14 27 53
Over 60 5 3 7 15
Total 47 64 96 207
(a) What is the probability that a person chosen at random from those in this sample will be in the
31-45 paid category?
(b) What is the probability that a person chosen at random from those in this sample whose incomes are
over $50,000 will be in the 31-45 age category? Show your work.
(c) Based on your answers to parts (a) and (b), is annual income independent of age category for those in
this sample? Explain.
H. Independence : Two events are considered to be independent, if the probability of one does not
impact the probability of the other. If categories are independent then they are proportional. In
addition, if the variables are independent, then the marginal probability of one times the marginal
probability of the second will equal their joint probability. Remember the multiplication rule:
If independent P(A∩B) = P(A) × P(B) or P(A and B) = the probability of A times B
1. Tables- If the variables are independent then the outcomes of the variables will be
proportional.
• Example:
Find the missing table value that results in perfect independence.
133 95
105
• Example:
Job No Job Total
Juniors 13 5 18
Seniors 13 26 39
Total 26 31 57
A survey of 57 students was conducted to determine whether or not they held jobs outside of school. The
two-way table above shows the numbers of students by employment status (job, no job) and class (juniors,
seniors). Which of the follow best describes the relationship between employment status and class?
(A) There appears to be no association, since the same number of juniors and seniors have jobs.
(B) There appears to be no association, since close to half of the students have jobs.
(C) There appears to be an association, since there are more seniors than juniors in the survey.
(D) There appears to be an association, since the proportion of juniors having jobs is much larger than
the proportion of seniors having jobs.
• Example:2010 Form B Number 5
An advertising agency in a large city is conducting a survey of adults to investigate whether there is an
association between highest level of education achievement and primary source for news. The company
takes a random sample of 2,500 adults in the city. The results are shown in the table below.
a) If an adult is to be selected at random from this sample, what is the probability that the selected adult
is a college graduate or obtains news primarily from the internet?
b) If an adult who is a college graduate is to be selected at random from this sample, what is the
probability that the selected adult obtains news primarily from the internet?
c) When selecting an adult at random from the sample of 2,500 adults, are the events “is a college
graduate” and “obtains news primarily from the internet” independent? Justify your answer.
d) The company wants to conduct a statistical test to investigate whether there is an association between
educational achievement and primary source for news for adults in the city. What is the name of the
statistical test that should be used? What are the appropriate degrees of freedom for this test?
2. Independence of Segment Bar Charts-If the variables are independent then the outcomes will
be proportional and as such the corresponding segments will be equal size.
• Scenario: 2011 Problem 2 The table below shows the political party registration by gender of all 500
registered voters in Franklin Township.
PARTY REGISTRATION – FRANKLIN TOWNSHIP
Party W Party X Party Y Total
Female 60 120 120 300
Male 28 124 48 200
Total 88 244 168 500
(a) Given that a randomly selected registered voter is a male, what is the probability that he is
registered for Party Y?
(b) Among the registered voters of Franklin Township, are the events “is a male” and “is registered for
Party Y” independent? Justify your answer based on Probabilities calculated from the table above.
(c) In Lawrence Township, the proportions of all registered voters for Parties W, X, and Y are the same
as for Franklin Township, and party registration is independent of gender. Complete the graph below
to show the distributions of party registration by gender in Lawrence Township
𝑷𝒂𝒓𝒕𝒚 𝒘 =𝟖𝟖
𝟓𝟎𝟎= .176
𝑷𝒂𝒓𝒕𝒚 𝒙 =𝟐𝟒𝟒
𝟓𝟎𝟎= .488
𝑷𝒂𝒓𝒕𝒚 𝒚 =𝟏𝟔𝟖
𝟓𝟎𝟎= .336
3. Simpson’s Paradox: when the results from combined grouping appears to contradict the results
from the individual groupings. Simpson’s Paradox arises when two or more sub-groups are
combined to form a single group and there exists significant differences in the sizes of the sub-
groups and the proportions in each group differ.
• Example:
Administrators at a state university computed the mean GPA (grade point average) for juniors and
seniors majoring in either physics or chemistry. The results are displayed in the table below. When
juniors and seniors are grouped together, could physics majors have a higher mean GPA than chemistry
majors?
Physics Chemistry
Juniors 2.8 3.0
Seniors 3.2 3.6
Overall ? ?
(A) No. The physics majors’ mean GPA for juniors and seniors must be 3.0, while the chemistry majors’
mean GPA for juniors and seniors must be 3.3.
(B) No. There is not enough information to determine the mean GPA for each major, but it must be
higher for chemistry majors than for physics majors.
(C) Yes. It could happen. Whether it does happen depends on the number of juniors and seniors in each
major.
(D) Yes. It could happen. Whether it does happen depends on the variability of the GPAs within each of
the four groups of students.
(E) Yes. It could happen. Whether it does happen depends on the shapes of the distributions of the
GPAs for each of the four groups of students.
III. Quantitative Variable Displays:
A. Stem and Leaf Plots: This type of data display maintains all of the original data values and it
provides an idea of center shape, spread and any unusual features in a data set.
• Example: 2010 Form B Problem 1 (part b.)
twenty concentrations of aldrin for River X are given below.
Construct a stemplot that displays the concentrations of aldrin for River X
B. Histograms: In histograms, the class/bar widths have a height that is proportional to the
frequency of the values in that class. Histograms are useful for large data sets. And they
provide an idea of center, shape and spread and show unusual features of the data sets.
However, individual data values are not included in histogram. Please note that depending on the
width of the bars, histograms will look different for the same data.
3.4 4.0 5.6 3.7 8.0 5.5 5.3 4.2 4.3 7.3
8.6 5.1 8.7 4.6 7.5 5.3 8.2 4.7 4.8 4.6
C. Box Plots: Boxplots provide a visual representation of the 5 number summary. Each section of
a boxplot contains 25% of the data. Exact shape of a distribution as well as, the mean and
variance cannot be determined from a boxplot display. Box plots do show outliers.
The five number summary consists of the minimum value, the 1st quartile, the median, the
3rd quartile and the maximum value. Because the minimum and maximum are given the range
can be calculated. The IQR can be calculated by using the formula: IQR = Q3 – Q1
• Example: 2000 Problem 3 Five hundred randomly selected middle-aged men and five hundred randomly
selected young adult men were rated on a scale from 1 to 10 on their physical flexibility, with 10 being
the most flexible. Their ratings appear in the frequency table below. For example, 17 middle-aged men
had a flexibility rating of 1.
(a) Display these data graphically so that the flexibility of middle-aged men and young adult men can be
easily compared.
(b) Based on an examination of your graphical display, write a few sentences comparing the flexibility of
middle-aged men with the flexibility of young adult men.
The median flexibility scores differ by 1 with young men’s scores being higher at 7 while the middle age
men is at 6 which suggests that young men have more flexibility than middle aged men.
The distribution of middle age men is reasonably symmetric while that of young men is skewed to the
left. The upper 50% of the young men have flexibility ratings higher than 75% of the middle-aged men.
Middle aged men and young men flexibility scores both have a range of 10 and fall within the
same values. Though the values ar
gher the interquartile range for both
Physical Flexibility
Rating
Frequency of Middle-
Aged Men
Frequency of Young
Adult Men
1 17 4
2 31 17
3 49 29
4 71 39
5 70 54
6 87 69
7 78 83
8 54 93
9 34 73
10 9 39
D. Scatter Plots: The above displays are univariate displays, that is to say they keep track of the
counts of a single variable. Scatterplots are bi-variate (measure two variables, X & Y and
represent them as a single point on a graph). Scatterplots are used to show the paired
relationship between two variables (X & Y). Scatterplots allow us to determine if a
correlation between two variables exists.
• Example: 1998 Question 2 A plot of the number of defective items produced during 20 consecutive
days at a factory is shown below.
(a) Draw a histogram that shows the frequencies of the number of defective items.
(b) Give one fact that is obvious from the histogram but is not obvious from the scatterplot.
(c) Give one fact that is obvious from the scatterplot but is not obvious from the histogram.
IV. Describing Data
A. Describing Categorical Displays-We talk about the number of counts and we can talk about the
mode as in which category had the most observations. On axis are the categories and the other
axis is the number of counts. Because the data is categorical in nature, and one axis does not
follow a number line, order does not matter.
(Example: Which comes first basketball, soccer, swimming, football, golf, baseball, volleyball,
cheer etc? Maybe its alpha, or least to greatest or greatest to least. The order depends on who
is putting together the display.)
When it comes to Categorical Displays, because order does not matter, we do not C.U.S.S.
B. Describing Quantitative Displays- For Quantitative Data Displays, both axes are number lines.
As a consequence, order does matter and we must C.U.S.S.
1. Calculating Summary Statistics:
• Go to STAT Edit on the calculator
• Input the values into L1
• Input the frequencies into L2
• Press STAT CALC
• Select 1-Var STATS & Press Enter
• Make certain your List is L1
• Make Certain the FreqList is L2
• Example: A television game show has three payoffs with the following probabilities:
Payoff ($) 0 500 5,000
Probability .7 .25 .05
What are the mean and standard deviation of the payoff variable?
(A) 𝜇 = 375, 𝜎 = 361
(B) 𝜇 = 375, 𝜎 = 1,083
(C) 𝜇 = 1,833, 𝜎 = 1,816
(D) 𝜇 = 1,833, 𝜎 = 2,248
(E) None of the above gives a set of correct answers.
• Example: The number of hybrid cars a dealer sells weekly has the following probability distribution:
Number of hybrids 0 1 2 3 4 5
Probability .32 .28 .15 .11 .08 .06
The dealer purchases the cars for $21,000 and sells them for $24,500. What is the expected
weekly profit from selling hybrid cars?
(A) $2,380
(B) $3,500
(C) $5,355
(D) $8,109
(E) $37,485
2. Center-The measure of the middle
a) Mean: The average of all of the values (the mean is impacted by outliers and non-
symmetric distributions). The mean is the balance point on a scale or see-saw.
• Example: Suppose the starting salary of a graduating class are as follows:
Number of Students Starting Salary ($)
10 15,000
17 20,000
25 25,000
38 30,000
27 35,000
21 40,000
12 45,000
b) Median: the middle most data value in a line of data arranged from least to greatest.
50% of the data are greater and 50% of the data are smaller. (The median is not
impacted by outliers and is useful for non-symmetric distributions)
• Example: The following list shows the selling prices of 8 houses in a certain town.
House Price House Price
A $302,100 E $275,000
B $275,800 F $295,000
C $305,400 G $281.000
D $250,600 H $284,700
What is the median selling price of the houses in the list?
A $263,200
B. $283,300
C. $284,700
D. $288,450
E. $290,600
3. Unusual Features
i. Gaps: Places in a distribution with no data values (give the location of the gaps)
Just because there is a large gap doesn’t mean there is an outlier
ii. Clusters: Groups of data points-Provide the value of the center and the range of
all clusters/groups.
iii. Outliers: Data points that are significantly larger or smaller than the other data
values.
a) Standard Deviation Method: A any data value that is more than 2 or 3
standard deviations away from the mean.
What is the mean starting salary?
(A) $ 30,000
(B) $30,533
(C) $32,500
(D) $32,533
(E) $35,000
• Example: A statistician at a metal manufacturing plant is sampling the thickness of metal plates. If
an outlier occurs within a particular sample, the statistician must check the configuration of the
machine. The distribution of metal thickness has mean 23.5 millimeters (mm) and standard deviation
1.4 mm. Based on the two-standard deviations rule for outliers, of the following, which is the
greatest thickness that would require the statistician to check the configuration of the machine?
A. 19.3
B. 20.6
C. 22.1
D. 23.5
E. 24.9
b) 5 Number Summary Method:
A data value X < Q1-1.5×(IQR)
A data value X > Q3-1.5×(IQR)
• Example:
A random sample of golf scores gives the following summary statistics: n = 20, �̅� = 84.5 Sx= 11.5,
minX = 68, Q1 = 78, Med = 86, Q3 = 91 maxX =112. What can be said about the number of outliers?
(A) 0
(B) 1
(C) 2
(D) At least 1
(E) At least 2
• Example: The following boxplot summarizes the heights of a sample of 100 trees growing on a tree
farm.
Emily claims that a tree height of 43 inches is an outlier for the distribution. Based on
the 1.5×IQR rule for outliers, is there evidence to support the claim?
A. Yes, because (max−Q3) is greater than (Q1−min).
B. Yes, because 43 is greater than (Q3+IQR).
C. Yes, because 43 is greater than (Q1−1.5×IQR).
D. No, because 43 is not greater than (Q3+1.5×IQR).
E. No, because 43 is greater than (Q1−1.5×IQR).
• Example: 2001 Question 1: The summary statistics for the number of inches of rainfall in Los Angeles
for 117 years, beginning in 1877, are shown below:
MIN MAX Q1 Q3
4.850 38.180 9.680 19.250
(a) Describe a procedure that uses these summary statistics to determine whether there are
outliers.
(b) Are there outliers in these data? _______
Justify your answer based on the procedure that you described in part (a).
(c) The news media reported that in a particular year, there were only 10 inches of rainfall. Use the
information provided to comment on this reported statement.
N MEAN MEDIAN TRMEAN STDEV SE.MEAN
117 14.941 13.070 14.416 6.747 0.624
4. Spread
i. Range: The range is a singular value and is never negative and is calculated by the
formula: Range = Max – Min.. Because the range uses the maximum and minimum
values, it is greatly impacted by outliers.
ii. Interquartile Range: IQR gives an idea as to how spread out the data is by
focusing on the middle 50% of the data and is computed by subtracting the value
of the 1st quartile from that of the 3rd quartile. We express it in this manner
IQR= Q3 –Q1. Because the interquartile range focuses on the middle 50% of the
data it is not impacted by outliers. IQR is very useful for skewed distributions.
iii. Variance: The Variance finds the differences of each value from the mean and
then squares them and sums them. The sum of those squared differences is then
divided by the sample size. Because variance is based on using the mean, the
variance is greatly impacted by outliers and non-symmetric distributions. The
formula is as follows:
Population Variance 𝝈𝟐 = Σ(𝒙𝒊−𝝁)
𝟐
𝒏 where 𝝁 is the true population mean
Sample Variance 𝒔𝟐 = Σ(𝒙𝒊−𝒙 )
𝟐
𝒏−𝟏 where 𝒙 is the mean of the sample
iv. Standard Deviation: The standard deviation is just the square root of the
variance. Because standard deviation is based on using the mean, the standard
deviation is greatly impacted by outliers and non-symmetric distributions. The
formula is as follows:
Population Variance 𝝈 = √Σ(𝒙𝒊−𝝁)
𝟐
𝒏 where 𝝁 is the true population mean
Sample Variance 𝒔 = √Σ (𝒙𝒊−𝒙 )𝟐
𝒏−𝟏 . where 𝒙 is the mean of the sample
• Example: Of the following dotplots, which represents the set of data that has the greatest
standard deviation?
A.
B.
C.
D.
E.
v. Z-scores: A Z-score is a ratio that provides a measure as to how far a value is
from the mean and takes into account both the center and the dispersion of the
data. Z-scores act as a ruler and can be used to compare different shaped
distributions the basic z score formula is 𝒛 = 𝒙−𝝁
𝝈 where z is the number of
standard deviations a value lies from the mean
• Example:
Gina's doctor told her that the standardized score (z-score) for her systolic blood pressure, as
compared to the blood pressure of other women her age, is 1.50. Which of the following is the best
interpretation of this standardized score?
(A) Gina's systolic blood pressure is 150.
(B) Gina's systolic blood pressure is 1.50 standard deviations above the average systolic blood pressure of
women her age.
(C) Gina's systolic blood pressure is 1.50 above the average systolic blood pressure of women her age.
(D) Gina's systolic blood pressure is 1.50 times the average systolic blood pressure of women her age.
(E) Only 1.5% of women Gina's age have a higher systolic blood pressure than she does.
• Example: The weights of a population of adult male gray whales are approximately normally distributed
with a mean weight of 18,000 kilograms and a standard deviation of 4,000 kilograms. The weights of a
population of adult male humpback whales are approximately normally distributed with a mean weight of
30,000 kilograms and a standard deviation of 6,000 kilograms. A certain adult male gray whale weighs
24,000 kilograms. This whale would have the same (z-score) as an adult male humpback whale with what
weight?
Example:
Some descriptive statistics for a set of test scores are shown above. For this test, a certain student has a
standardized score of z = -1.2. What score did this student receive on the test?
A. 266.28
B. 779.42
C. 1008.02
D. 1083.38
E. 1311.98
• Example: 2011 Question 1: A professional sports team evaluates potential players for a certain
position based on two main characteristics, speed and strength.
(a) Speed is measured by the time required to run a distance of 40 yards, with smaller times indicating
more desirable (faster) speeds. From previous speed data for all players in this position, the times
to run 40 yards have a mean of 4.60 seconds and a standard deviation of 0.15 seconds, with a
minimum time of 4.40 seconds, as shown in the table below.
Mean Standard deviation Minimum
Time to run 40 yards 4.60 seconds 0.15 seconds 4.40 seconds
Based on the relationship between the mean, standard deviation, and minimum time, is it reasonable to
believe that the distribution of 40-yard running times is approximately? Explain
(b) Strength is measured by the amount of weight lifted, with more weight indicating more desirable
(greater) strength. From previous strength data for all players in this position, the amount of weight
lifted has a mean of 310 pounds and a standard deviation of 25 pounds, as shown in the table below.
Mean Standard deviation
Amount of weight lifted 310 pounds 25 pounds
Calculate and interpret the z-score for a player in this position who can lift a weight of 370 pounds.
(c) The characteristics of speed and strength are considered to be of equal importance to the team in
selecting a player for the position. Based on the information about the means and standard
deviations of the speed and strength data for all players and the measurements listed in the table
below for Players A and B, which player should the team select if the team can only select one of the
two players? Justify your answer.
Player A Player B
Time to run 40 yards 4.42 seconds 4.57 seconds
Amount of weight lifted 370 pounds 375 pounds
5. Shape
i. Symmetric: The vertical line can divide the data into two matching mirror images.
If the data is symmetric the mean and median will be equal.
Examples:
ii. Mound Shape: The graph is symmetric with most of the data in the center of the
graph. The data density diminishes as you move towards each tail. The mean,
median and mode are all equal. The normal distribution and the t-distributions
are mound shaped
Example:
iii. Uniform: A graph that is approximately the same height.
Example:
iv. Bi-modal The distribution has two distinct peaks (modes). The peaks may not be
equal in height.
Examples:
v. Skewed Left: The tail is to the left and the mean is less than the median. The
mean is to the left of the median. (the distribution is being pulled to the left)
vi. Skewed Right: The tail is to the right and the mean is greater than the median.
• Example: Consider the shape of the graph of individual incomes in the United States.
What can be said about the ratio 𝑴𝒆𝒂𝒏 𝑰𝒏𝒅𝒊𝒗𝒊𝒅𝒖𝒂𝒍 𝑰𝒏𝒄𝒐𝒎𝒆
𝑴𝒆𝒅𝒊𝒂𝒏 𝑰𝒏𝒅𝒊𝒗𝒊𝒅𝒖𝒂𝒍 𝑰𝒏𝒄𝒐𝒎𝒆?
(A) Approximately zero
(B) Less than one, but definitely above zero
(C) Approximately one
(D) Greater than one
(E) Cannot be answered without knowing the standard deviation
• Example: A B
5 1 12 7
7 3 0 13 2
8 7 5 2 14 1 5
8 5 3 1 15 3 6 9
7 7 4 16 0 2 3 4 6 9
6 2 17 2 5 6 6 8
Given the back-to-back stemplot above, which of the following is true?
(A) The Empirical Rule applies to both sets A and B.
(B) The median of each is approximately (120 +170)/2/
(C) In one set the mean and median should be about the same, while in the other the mean appears
to be less than the median.
(D) The ranges of the two sets are equal.
(E) The variances of the two sets are approximately the same
• Example: For a sample of 42 rabbits, the mean weight is 5 pounds and the standard deviation of weights
is 3 pounds. Which of the following is most likely true about the weights for the rabbits in this sample?
A. The distribution of weights is approximately normal because the sample size is 42, and therefore
the central limit theorem applies.
B. The distribution of weights is approximately normal because the standard deviation is less than
the mean.
C. The distribution of weights is skewed to the right because the least possible weight is within 2
standard deviations of the mean.
D. The distribution of weights is skewed to the left because the least possible weight is within 2
standard deviations of the mean.
E. The distribution of weights has a median that is greater than the mean.
1 Transforming Data
A. Measures of Center:
1. Effects of adding or subtracting a constant: Measures of center (mean & median) will
shift by the amount added or subtracted from each data point
2. Effects of Multiplying each data value by a Constant: Measures of center (mean &
median) will shift by the amount of the multiple.
B. Measures of spread (range, IQR, standard deviation, and variance) will not shift by the amount
added or subtracted from each data point
1. Effects of adding or subtracting a constant to each data value: Measures of spread
(Range, IQR, Variance, and Standard Deviation) will not shift when a constant is added or
subtracted from each data point.
2. Effects of multiplying each data value by a constant:
a) Range, IQR and Standard deviation will increase by the value of the value of the
constant being multiplied.
b) Variance increases by the value of the Constant squared. This is because variance itself
is the square of the standard deviation.
C. Z-scores Will not shift regardless whether you add or multiply each data value by a
constant.
• Example:
A B
5 4 10
8 6 1 11 4 5
7 5 12 7 6 8
7 13 5 7
9 7 7 3 14 7
8 4 0 15 3 7 7 9
5 16 0 4 8
17 5
Given this back-to-back stemplot, which of the following is incorrect
(A) The distributions have the same mean.
(B) The distributions have the same range.
(C) The distributions have the same interquartile range.
(D) The distributions have the same standard deviation
(E) The distributions have the same variance.
• Example: Consider the following back-to-back stemplot:
0 3 4 8
1 0 1 2 5 6
8 4 3 2 2 9
6 5 2 1 0 3 2 5 5 7
9 2 4
7 5 5 2 5 6
6 1 4 5 8
6 7 0 9
8 5 4 1 8
9 0 9
Which of the following is a correct statement?
(A) The distributions have the same mean.
(B) The distributions have the same median.
(C) The interquartile range of the distribution to the left is 20 greater than the interquartile range of
the distribution to the right.
(D) The distributions have the same variance.
(E) None of the above is correct.
• Example: Suppose the average score on a national test is 500 with a standard deviation of 100. If
each score is increased by 25% what are the new mean and standard deviation?
(A) 500, 100
(B) 525, 100
(C) 625, 100
(D) 625, 105
(E) 625, 125
• Example: Suppose the average scone a national test is 500 with a standard deviation of 100. If
each score is increased by 25 points what are the new mean and standard deviation?
(A) 500, 100
(B) 500, 125
(C) 525, 100
(D) 525, 105
(E) 525, 125
• Example: The number of hybrid cars a dealer sells weekly has the following probability distribution:
Number of hybrids 0 1 2 3 4 5
Probability .07 .28 .23 .18 .13 .11
The dealer purchases the cars for $22,000 and sells them for $24,500. There is a fixed cost of 500
dollars for the showroom. The profit Y, in dollars, for a particular week can be described by
y = 2500x – 500. What is the standard deviation of Y?
(A) $3,145
(B) $3,645
(C) $4,970.
(D) $5,375
(E) $5,875
A. Combining Random Variables.
A. Means can be added or subtracted directly. Note: it is okay to have a negative value when you
subtract. Note: the variables do not have to be independent to combine means
B. Variances: Variances can be added directly. They cannot be subtracted. Note: the variables do
have to be independent to combined variances.
C. Standard Deviations: Standard deviations must be converted to variances by squaring and then
they may be summed. We do not subtract them. Note: the variables do have to be independent
to combined standard deviations
• Example:
Suppose the average height of policemen is 71 inches with a standard deviation of 4 inches, while the
average for policewomen is 66 inches with a standard deviation of 3 inches. If a committee looks at all
ways of pairing up one male with one female, officer, what will be the mean and standard deviation for
the difference in heights for the set of possible partners?
(A) Mean of 5 inches with a standard deviation of 1 inches.
(B) Mean of 5 inches with a standard deviation of 3.5 inches.
(C) Mean of 5 inches with a standard deviation of 5 inches
(D) Mean of 68.5 inches with a standard deviation of 1 inch.
(E) Mean of 68.5 inches with a standard deviation of 3.5 inches.
• Example:A company that makes fleece clothing uses fleece produced from two farms, Northern Farm and
Western Farm. Let the random variable X represent the weight of fleece produced by a sheep from
Northern Farm. The distribution of X has mean 14.1 pounds and standard deviation 1.3 pounds. Let the
random variable Y represent the weight of fleece produced by a sheep from Western Farm. The distribution
of Y has mean 6.7 pounds and standard deviation 0.5 pound. Assume X and Y are independent. Let W equal
the total weight of fleece from 10 randomly selected sheep from Northern Farm and 15 randomly selected
sheep from Western Farm. Calculate the mean standard deviation of W ?
• Example: 2013 Question 3 Each full carton of Grade A eggs consists of 1 randomly selected empty
cardboard container and 12 randomly selected eggs. The weights of such full cartons are approximately
normally distributed with a mean of 840 grams and a standard deviation of 7.9 grams.
(a) What is the probability that a randomly selected full carton of Grade A eggs will weigh more than
850 grams?
(b) The weights of the empty cardboard containers have a mean of 20 grams and a standard deviation
of 1.7 grams. It is reasonable to assume independence between the weights of the empty cardboard
containers and the weights of the eggs. It is also reasonable to assume independence among the
weights of the 12 eggs that are randomly selected for a full carton.
Let the random variable X be the weight of a single randomly selected Grade A egg.
i) What is the mean of X ?
ii) What is the standard deviation of X ?