Distance Learning - Mr. Spencer's AP Statistics 2019-2020€¦ · Displaying & Describing Data...

Distance Learning Today is Friday, March 27th

AP Review Unit: Displaying & Describing Data Volume 1 Note Spiral pgs. 39-71

College Board Video Links:

• Link to All College Board AP Review Videos

Note Videos: Describing & Displaying Videos are not Ready I will notify you via Google Classroom as soon as they are complete.

Materials:

• Displaying & Describing Data Vocabulary Card

• AP Review Notes: Displaying & Describing Data

Assignments:

• Unit 1 Progress Checks: MCQ Part A & MCQ Part B Due by 11:00pm Wednesday, April 1st

• Unit 1 Progress Check: FRQ Due by 11:00pm Wednesday, April 1st

• Unit 2 Progress Check: MCQ Part A Due by 11:00pm Thursday, April 2nd

The above assignments can be found on the college board website.

Highly Suggested (If we were in class, we would do all of these)

• 2005 Question #1


• 2005B Question #1 & 2

• 2004B Question # 5



Link to FRQ Questions & Solutions

https://www.youtube.com/user/advancedplacement

https://apcentral.collegeboard.org/courses/ap-statistics/exam/past-exam-questions?course=ap-statistics

Displaying & Describing Data-Vocabulary Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data

Displaying & Describing Data Created by: Loren L. Spencer Q2.1

Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data.

Describing Data-C.U.S.S. Center Mean-(𝝁, 𝒙 )-the sum of all the data divided by

the number of data-the expected value.

Significantly impacted by outliers.

Median-the middle most data value in a line of

data arranged from least to greatest. 50% of the

data are greater and 50% of the data are smaller.

Not significantly impacted by outliers.

Note: Place data values in STAT EDIT

Press STAT CALC 1-Var STATS

• When a constant is added to each data value,

the center increases by that constant. • When each data value is multiplied by a

constant, the center is increased by the

multiple of that constant.

Unusual Features Gaps-Places in a distribution with no data values.

Clusters-groups of data points-give the value of

the centers of the groups and the ranges.

Outliers-data points that are significantly larger

or smaller than the remaining data values.

1.5 x IQR or more than 3 standard deviations

from the mean.

Shape Unimodal-one mound-the most common occurring

data value-highest probability of occurring.

Bimodal-two mounds or high points-the two most

common-the two values most likely to occur.

Multimodal-more than two mounds

Uniform-all values are equally likely to occur. The

bars of a histogram are of equal height.

Mound Shaped- The mean equals the median

and mode. Use this description for bell curves,

t-distributions and normal curves.

Skewed Right-the mean is to the right/greater

than the median. The tail is to the right.

All Chi-squared distributions are skewed right

Skewed Left-the mean is to the left/smaller

than the median. The tail is to the left.

Symmetric-The mean equals the median.

The mode may different-bimodal or inverse bell.

Describing Data-C.U.S.S.-cont’d. Spread • When a constant is added to each data value,

the spread does not change. • When each data value, is multiplied by a

constant, the measure of spread is increased

by that multiple of that constant. Exception:

variance is increased by the constant’s square.

Range-the largest data value minus the smallest.

Significantly impacted by outliers.

Interquartile Range:Q3 – Q1. The data value at

the 1st quartile subtracted from the data value at

the 3rd quartile.

The range of the Middle 50% of the data

Not significantly impacted by outliers.

Standard Deviation-(𝝈, s)-The square root of

the average of the squared data differences from

the mean. The square root of the variance.

Z-scores are a measure of standard deviations

from the mean and are related to percentile ranks.

Variance- The standard deviation squared. The

average of the squared differences from the mean.

Univariate Data Displays • Label the graph’s axes

• Title the graph

Categorical Displays-Displays that have one

axis which is comprised of a list of qualitative

values rather than quantitative values and measure

frequency or relative frequency (percent).

Note: There is no description of spread, shape, or

center for categorical data. The Mode measures

frequency & is a valid measure for categorical data.

DotPlots: Measure the frequency of categorical

variables and provide the exact counts of the data.

Barcharts/Bargraphs-Provide frequencies or

relative frequencies as (percents) and represent

the categorical data counts as areas.

Note: Bars should not touch because the data

categories are not sequential/numerical in nature.

PieCharts/Graphs- Represent categorical data

as areas & relative frequencies as (percents) .

Displaying & Describing Data-Vocabulary Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data

Displaying & Describing Data Created by: Loren L. Spencer Q2.1

Statistics: The study of how to collect, organize, analyze, interpret and draw conclusions from data.

Quantitative Displays-Displays in which one

axis is a listing of sequential quantitative values.

• Label the graph’s axes

• Title the graph

Histogram: a frequency distribution whose

class/bar widths have a height that is proportional

to the frequency of the values in that class.

• Useful for large data sets

• Provides shape and an idea of spread

• Individual data values are usually lost

• The area of a bar is proportional

• Vertical axis can be either frequencies or

relative frequencies/percents.

Note: Bars should touch because the data is

sequential/numerical in nature.

Stemplot- a graphical displaying that separates

the ones digits from the remaining digits.

• Provides shape

• Maintains all data values so exact summary

statistics can be calculated

• Unwieldy for large data sets.

Note: must provide a legend

Cumulative Relative Frequency Plot (ogive)- Plots cumulative frequencies for data from left to

right such that the largest data point furthest to

the right will be at 100%.

• Skewed right data will increase rapidly at first

then more slowly later. Begins convex

• Skewed left data will increase slowly at first

and more rapidly later. Begins concave

Box Plot (box and whisker plot)-a visual

representation of the five number summary.

The smallest value, the 1st quartile value, the

median, the 3rd quartile value and the largest value.

Note: Each section contains 25% of the data

Remember: The mean is not shown & the variance

cannot be calculated. The range can be computed.

Calculator:

• Enter Data into STAT Edit

• Press 2nd STAT PLOT

• Select the Box plot with outliers & Press

graph Press Zoom 9

Do Not Forget to label your axis

Comparing Distributions • Be Specific

• Compare and contrast the centers

• Compare and contrast the spread

• Compare and contrast the shape

Bivariate Data Scatterplot: Both axes represent variables either

the response variable (output) or the explanatory

variable (input).

Note: Neither axis represents a frequency

Empirical Rule • Applies to symmetric mound/bell shaped

curves.

• 68% of the data is within ± 1 standard

deviations of the mean

• 95% of the data is within ± 2 standard


• 99.7% of the data is within ± 3 standard


Quick Hints: Percentile ranks are the areas under the

distributions and should be read left to right.

Z-scores-measure the distance in number of

standard deviations a value is from the mean. The

related area under the curve tells us the chance

that an event will happen.

Remember: we are always calculating the likelihood

that an event occurs is either greater than or less

than a specific value. In essence we are calculating

the chance of a range of values.

We do not calculate the probability of a specific

data value when working with continuous data.

Standard Normal-the distribution which is

created After a z-score has been computed.

𝝁 = 0 and 𝝈 = 1

Before the z-score is computed the mean is the

mean of the population and the standard deviation

is the standard deviation of the population.

Displaying & Describing Data Note Spiral pgs 39-71

I. Types of Variables

A. Categorical vs. Quantitative

1. Categorical variables represent types of data which may be divided into groups or

categories.

Examples of categorical variables are race, sex, age group, color, and educational level.

With Categorical Data we keep track of the counts. Because Categorical data is not

numerical, we do not calculate the mean or median or standard deviation. We also do not talk

about shape or unusual features. We do talk about the Mode, that is to say which category

occurs the most.

• Example: The following table shows data for the 8 longest roller coasters in the world as of 2015.

Length feet Type Speed mph Height feet Drop Feet Continent

8,133 Steel 95 318 310 Asia

7,442 Steel 50 107 102 Europe

7,359 Wood 65 110 135 North America

6,709 Steel 81 259 230 Asia

6,602 Steel 95 325 320 North America

6,595 Steel 93 310 300 North America

6,562 Steel 149 171 168 Asia

6,442 Wood 67 163 154 North America

Which of the following variables is categorical?

A. Length

B. Type

C. Speed

D. Height

E. Drop

2. Quantitative variables express a certain quantity, amount or range. Usually, there are

measurement units associated with the data, such as meters, hours, pounds…

Examples: Height, weight, length and time, the number of eggs, how many of something

• Example: Data were collected on 100 United States coins minted in 2018. Which of the following

represents a quantitative variable for the data collected?

A. The type of metal used in the coin

B. The value of the coin

C. The color of the coin

D. The person depicted on the face of the coin

E. The location where the coin was minted

B. Discrete vs. Continuous

1. Discrete data is Numerical Data that can be counted.

Example: There are 13 Girls and 15 Guys; 18 people voted Yes, the number of likes, the

number of eggs, the number of days etc.

2. Continuous data is Numerical Data that can take on any real number on the number line.

Example: Weight, height, time, temperature and distance

Discrete vs. Continuous-Continued

• Example: The following list shows the number of video games sold at a game store each day for one

week.

15, 43, 50, 39, 22, 16, 20

Which of the following is the best classification of the data in the list?

A. Categorical and continuous

B. Quantitative and continuous

C. Categorical and discrete

D. Quantitative and discrete

E. Neither categorical nor quantitative, and neither discrete nor continuous

C. Bivariate Data: bi-variate data is two variable data. (Typically the variables are related or we

believe them to be so we say that the variables are paired together.)

A. Categorical Bivariate Data: Compares the counts of 2 categorical variables.

Example: number of speeding tickets vs Gender or Mechanical Delays vs. Weather Delays

We would normally display the counts of this data in a 2-way table

B. Quantitative Bivariate Data: Compares 2 quantitative variables.

Example: strength & age; electricity generation & hours of sunlight; blood pressure & weight.

All of these have numerical measurements that are more than just counts.

We would normally display Bivariate data in a scatterplot.

II. Categorical Displays Displays that have one axis which is comprised of a list of qualitative values

rather than quantitative values and measure frequency or relative frequency (percent).

Note: There is no description of spread, shape, or center for categorical data. The Mode measures

frequency & is a valid measure for categorical data.

A. Tables: A table lists the number of times an outcome occurs. • Example: The following table summarizes the number of pies sold at a booth one day at a local farmers

market. Type of Pie Frequency

Apple 18

Blueberry 14

Cherry 16

Key Lime 12

Peach 12

Pumpkin 18

Which of the following statements is supported by the table?

A. More Cherry Pies were sold than any other type of pie

B. Twice as many apple pies as key lime pies were sold

C. More than half the pies sold were apple

D. Fewer than 50 pies were sold at the booth that day

E. The combined percentage of key lime pies and pumpkin pies sold was less than 50%

B. Barcharts/Bargraphs-Provide frequencies or relative frequencies as (percents) and

represent the categorical data counts as areas.

Note: Bars should not touch because the data categories are not sequential/numerical in nature.

• Example: In a certain school district, students from grade 6 through grade 12 can participate in a

school-sponsored community service activity. The following bar chart shows the relative frequencies of

students from each grade who participate in the community service activity.

Which of the following statements is supported by the bar chart?

A. The greatest number of participating students was in grade 9.

B. The number of participating students in grade 6 was equal to the number of participating

students in grade 7.

C. The relative frequency of all participating students in grades 6 and 7 combined was 0.60.

D. Grade 12 had the least relative frequency of participating students.

E. Grade 11 had the greatest relative frequency of participating students.

• Example: The following bar chart shows the relative frequency of days of rain for 30 days in four

regions of a certain state.

Which of the following statements is not supported by the bar chart?

A. Region D had the greatest percentage of days of rain.

B. Region B had the least percentage of days of rain.

C. Region A had more than 15 days of rain.

D. Region C had more than 25 days of rain.

E. Region D had less than 23 days of rain.

C. Segmented bar chart: A stacked bar chart in which each column is divided into segments which

are proportional in size to that segment’s representation within the population. Please be aware

that these are relative frequencies (proportions) and do not provide insight as to the number/size

of the sample.

D. PieCharts/Graphs- Represent categorical data as areas & relative frequencies as (percents).

E. Cumulative Frequency Graphs: Provides a running total of frequencies.

• Example: 2006 Form B Question 1

A large regional real estate company keeps records of home sales for each of its sales agents. Each

month, the company publishes the sales volume for each agent. Monthly sales volume is defined as

the total sales price of all homes sold by the agent during a month. The figure below displays the

cumulative relative frequency plot of the most recent monthly sales volume (in hundreds of

thousands of dollars) for these agents.

(a) In the context of this question, explain what information is conveyed by the circled point.

(b) What proportion of sales agents achieved monthly sales volume between $700,000 and

$800,000?

(c) For values between 10 and 11 on the horizontal axis, the cumulative relative frequency plot is

flat. In the context of this question, explain what this means.

(d) A bonus is to be given to 20 percent of the sales agents. Those who achieved the highest

monthly sales volume during the preceding month will receive a bonus. What is the minimum

monthly sales volume an agent must have achieved to qualify for the bonus?

F. 2-way Tables: Display the counts of 2 categorical variables. (this is bivariate data) In the problem

below the sample size is 200 and the row totals and column totals represent the marginal

frequencies. The numbers in the table that are not totals are the joint frequencies or intersections

and represent the number of times that two separate variables yielded a particular outcome (this

and that) both occurred. The key work for a joint frequency is AND.

G. Conditional Probability- With conditional probability, we know something about our sample. As a

consequence, we are no longer looking at the entire sample and we are only looking at a portion or

subset of the sample (we are looking at a particular category as a opposed to the whole). Joint Frequency

Marginal Frequency=

"and" "both"

Given=

intersection

Given

• Example: 2003 Form B Problem 2: A simple random sample of adults living in a suburb of a large city

was selected. The age and annual income of each adult in the sample were recorded. The resulting data

are summarized in the table below.

Annual Income

Age Category $25,000-

$35,000

$35,001-$50,000 Over $50,000 Total

21-30 8 15 27 50

31-45 22 32 35 89

45-60 12 14 27 53

Over 60 5 3 7 15

Total 47 64 96 207

(a) What is the probability that a person chosen at random from those in this sample will be in the

31-45 paid category?

(b) What is the probability that a person chosen at random from those in this sample whose incomes are

over $50,000 will be in the 31-45 age category? Show your work.

(c) Based on your answers to parts (a) and (b), is annual income independent of age category for those in

this sample? Explain.

H. Independence : Two events are considered to be independent, if the probability of one does not

impact the probability of the other. If categories are independent then they are proportional. In

addition, if the variables are independent, then the marginal probability of one times the marginal

probability of the second will equal their joint probability. Remember the multiplication rule:

If independent P(A∩B) = P(A) × P(B) or P(A and B) = the probability of A times B

1. Tables- If the variables are independent then the outcomes of the variables will be

proportional.

• Example:

Find the missing table value that results in perfect independence.

133 95

105

• Example:

Job No Job Total

Juniors 13 5 18

Seniors 13 26 39

Total 26 31 57

A survey of 57 students was conducted to determine whether or not they held jobs outside of school. The

two-way table above shows the numbers of students by employment status (job, no job) and class (juniors,

seniors). Which of the follow best describes the relationship between employment status and class?

(A) There appears to be no association, since the same number of juniors and seniors have jobs.

(B) There appears to be no association, since close to half of the students have jobs.

(C) There appears to be an association, since there are more seniors than juniors in the survey.

(D) There appears to be an association, since the proportion of juniors having jobs is much larger than

the proportion of seniors having jobs.

• Example:2010 Form B Number 5

An advertising agency in a large city is conducting a survey of adults to investigate whether there is an

association between highest level of education achievement and primary source for news. The company

takes a random sample of 2,500 adults in the city. The results are shown in the table below.

a) If an adult is to be selected at random from this sample, what is the probability that the selected adult

is a college graduate or obtains news primarily from the internet?

b) If an adult who is a college graduate is to be selected at random from this sample, what is the

probability that the selected adult obtains news primarily from the internet?

c) When selecting an adult at random from the sample of 2,500 adults, are the events “is a college

graduate” and “obtains news primarily from the internet” independent? Justify your answer.

d) The company wants to conduct a statistical test to investigate whether there is an association between

educational achievement and primary source for news for adults in the city. What is the name of the

statistical test that should be used? What are the appropriate degrees of freedom for this test?

2. Independence of Segment Bar Charts-If the variables are independent then the outcomes will

be proportional and as such the corresponding segments will be equal size.

• Scenario: 2011 Problem 2 The table below shows the political party registration by gender of all 500

registered voters in Franklin Township.

PARTY REGISTRATION – FRANKLIN TOWNSHIP

Party W Party X Party Y Total

Female 60 120 120 300

Male 28 124 48 200

Total 88 244 168 500

(a) Given that a randomly selected registered voter is a male, what is the probability that he is

registered for Party Y?

(b) Among the registered voters of Franklin Township, are the events “is a male” and “is registered for

Party Y” independent? Justify your answer based on Probabilities calculated from the table above.

(c) In Lawrence Township, the proportions of all registered voters for Parties W, X, and Y are the same

as for Franklin Township, and party registration is independent of gender. Complete the graph below

to show the distributions of party registration by gender in Lawrence Township

𝑷𝒂𝒓𝒕𝒚 𝒘 =𝟖𝟖

𝟓𝟎𝟎= .176

𝑷𝒂𝒓𝒕𝒚 𝒙 =𝟐𝟒𝟒

𝟓𝟎𝟎= .488

𝑷𝒂𝒓𝒕𝒚 𝒚 =𝟏𝟔𝟖

𝟓𝟎𝟎= .336

3. Simpson’s Paradox: when the results from combined grouping appears to contradict the results

from the individual groupings. Simpson’s Paradox arises when two or more sub-groups are

combined to form a single group and there exists significant differences in the sizes of the sub-

groups and the proportions in each group differ.

• Example:

Administrators at a state university computed the mean GPA (grade point average) for juniors and

seniors majoring in either physics or chemistry. The results are displayed in the table below. When

juniors and seniors are grouped together, could physics majors have a higher mean GPA than chemistry

majors?

Physics Chemistry

Juniors 2.8 3.0

Seniors 3.2 3.6

Overall ? ?

(A) No. The physics majors’ mean GPA for juniors and seniors must be 3.0, while the chemistry majors’

mean GPA for juniors and seniors must be 3.3.

(B) No. There is not enough information to determine the mean GPA for each major, but it must be

higher for chemistry majors than for physics majors.

(C) Yes. It could happen. Whether it does happen depends on the number of juniors and seniors in each

major.

(D) Yes. It could happen. Whether it does happen depends on the variability of the GPAs within each of

the four groups of students.

(E) Yes. It could happen. Whether it does happen depends on the shapes of the distributions of the

GPAs for each of the four groups of students.

III. Quantitative Variable Displays:

A. Stem and Leaf Plots: This type of data display maintains all of the original data values and it

provides an idea of center shape, spread and any unusual features in a data set.

• Example: 2010 Form B Problem 1 (part b.)

twenty concentrations of aldrin for River X are given below.

Construct a stemplot that displays the concentrations of aldrin for River X

B. Histograms: In histograms, the class/bar widths have a height that is proportional to the

frequency of the values in that class. Histograms are useful for large data sets. And they

provide an idea of center, shape and spread and show unusual features of the data sets.

However, individual data values are not included in histogram. Please note that depending on the

width of the bars, histograms will look different for the same data.

3.4 4.0 5.6 3.7 8.0 5.5 5.3 4.2 4.3 7.3

8.6 5.1 8.7 4.6 7.5 5.3 8.2 4.7 4.8 4.6

C. Box Plots: Boxplots provide a visual representation of the 5 number summary. Each section of

a boxplot contains 25% of the data. Exact shape of a distribution as well as, the mean and

variance cannot be determined from a boxplot display. Box plots do show outliers.

The five number summary consists of the minimum value, the 1st quartile, the median, the

3rd quartile and the maximum value. Because the minimum and maximum are given the range

can be calculated. The IQR can be calculated by using the formula: IQR = Q3 – Q1

• Example: 2000 Problem 3 Five hundred randomly selected middle-aged men and five hundred randomly

selected young adult men were rated on a scale from 1 to 10 on their physical flexibility, with 10 being

the most flexible. Their ratings appear in the frequency table below. For example, 17 middle-aged men

had a flexibility rating of 1.

(a) Display these data graphically so that the flexibility of middle-aged men and young adult men can be

easily compared.

(b) Based on an examination of your graphical display, write a few sentences comparing the flexibility of

middle-aged men with the flexibility of young adult men.

The median flexibility scores differ by 1 with young men’s scores being higher at 7 while the middle age

men is at 6 which suggests that young men have more flexibility than middle aged men.

The distribution of middle age men is reasonably symmetric while that of young men is skewed to the

left. The upper 50% of the young men have flexibility ratings higher than 75% of the middle-aged men.

Middle aged men and young men flexibility scores both have a range of 10 and fall within the

same values. Though the values ar

gher the interquartile range for both

Physical Flexibility

Rating

Frequency of Middle-

Aged Men

Frequency of Young

Adult Men

1 17 4

2 31 17

3 49 29

4 71 39

5 70 54

6 87 69

7 78 83

8 54 93

9 34 73

10 9 39

D. Scatter Plots: The above displays are univariate displays, that is to say they keep track of the

counts of a single variable. Scatterplots are bi-variate (measure two variables, X & Y and

represent them as a single point on a graph). Scatterplots are used to show the paired

relationship between two variables (X & Y). Scatterplots allow us to determine if a

correlation between two variables exists.

• Example: 1998 Question 2 A plot of the number of defective items produced during 20 consecutive

days at a factory is shown below.

(a) Draw a histogram that shows the frequencies of the number of defective items.

(b) Give one fact that is obvious from the histogram but is not obvious from the scatterplot.

(c) Give one fact that is obvious from the scatterplot but is not obvious from the histogram.

IV. Describing Data

A. Describing Categorical Displays-We talk about the number of counts and we can talk about the

mode as in which category had the most observations. On axis are the categories and the other

axis is the number of counts. Because the data is categorical in nature, and one axis does not

follow a number line, order does not matter.

(Example: Which comes first basketball, soccer, swimming, football, golf, baseball, volleyball,

cheer etc? Maybe its alpha, or least to greatest or greatest to least. The order depends on who

is putting together the display.)

When it comes to Categorical Displays, because order does not matter, we do not C.U.S.S.

B. Describing Quantitative Displays- For Quantitative Data Displays, both axes are number lines.

As a consequence, order does matter and we must C.U.S.S.

1. Calculating Summary Statistics:

• Go to STAT Edit on the calculator

• Input the values into L1

• Input the frequencies into L2

• Press STAT CALC

• Select 1-Var STATS & Press Enter

• Make certain your List is L1

• Make Certain the FreqList is L2

• Example: A television game show has three payoffs with the following probabilities:

Payoff ($) 0 500 5,000

Probability .7 .25 .05

What are the mean and standard deviation of the payoff variable?

(A) 𝜇 = 375, 𝜎 = 361

(B) 𝜇 = 375, 𝜎 = 1,083

(C) 𝜇 = 1,833, 𝜎 = 1,816

(D) 𝜇 = 1,833, 𝜎 = 2,248

(E) None of the above gives a set of correct answers.

• Example: The number of hybrid cars a dealer sells weekly has the following probability distribution:

Number of hybrids 0 1 2 3 4 5

Probability .32 .28 .15 .11 .08 .06

The dealer purchases the cars for $21,000 and sells them for $24,500. What is the expected

weekly profit from selling hybrid cars?

(A) $2,380

(B) $3,500

(C) $5,355

(D) $8,109

(E) $37,485

2. Center-The measure of the middle

a) Mean: The average of all of the values (the mean is impacted by outliers and non-

symmetric distributions). The mean is the balance point on a scale or see-saw.

• Example: Suppose the starting salary of a graduating class are as follows:

Number of Students Starting Salary ($)

10 15,000

17 20,000

25 25,000

38 30,000

27 35,000

21 40,000

12 45,000

b) Median: the middle most data value in a line of data arranged from least to greatest.

50% of the data are greater and 50% of the data are smaller. (The median is not

impacted by outliers and is useful for non-symmetric distributions)

• Example: The following list shows the selling prices of 8 houses in a certain town.

House Price House Price

A $302,100 E $275,000

B $275,800 F $295,000

C $305,400 G $281.000

D $250,600 H $284,700

What is the median selling price of the houses in the list?

A $263,200

B. $283,300

C. $284,700

D. $288,450

E. $290,600

3. Unusual Features

i. Gaps: Places in a distribution with no data values (give the location of the gaps)

Just because there is a large gap doesn’t mean there is an outlier

ii. Clusters: Groups of data points-Provide the value of the center and the range of

all clusters/groups.

iii. Outliers: Data points that are significantly larger or smaller than the other data

values.

a) Standard Deviation Method: A any data value that is more than 2 or 3

standard deviations away from the mean.

What is the mean starting salary?

(A) $ 30,000

(B) $30,533

(C) $32,500

(D) $32,533

(E) $35,000

• Example: A statistician at a metal manufacturing plant is sampling the thickness of metal plates. If

an outlier occurs within a particular sample, the statistician must check the configuration of the

machine. The distribution of metal thickness has mean 23.5 millimeters (mm) and standard deviation

1.4 mm. Based on the two-standard deviations rule for outliers, of the following, which is the

greatest thickness that would require the statistician to check the configuration of the machine?

A. 19.3

B. 20.6

C. 22.1

D. 23.5

E. 24.9

b) 5 Number Summary Method:

A data value X < Q1-1.5×(IQR)

A data value X > Q3-1.5×(IQR)

• Example:

A random sample of golf scores gives the following summary statistics: n = 20, �̅� = 84.5 Sx= 11.5,

minX = 68, Q1 = 78, Med = 86, Q3 = 91 maxX =112. What can be said about the number of outliers?

(A) 0

(B) 1

(C) 2

(D) At least 1

(E) At least 2

• Example: The following boxplot summarizes the heights of a sample of 100 trees growing on a tree

farm.

Emily claims that a tree height of 43 inches is an outlier for the distribution. Based on

the 1.5×IQR rule for outliers, is there evidence to support the claim?

A. Yes, because (max−Q3) is greater than (Q1−min).

B. Yes, because 43 is greater than (Q3+IQR).

C. Yes, because 43 is greater than (Q1−1.5×IQR).

D. No, because 43 is not greater than (Q3+1.5×IQR).

E. No, because 43 is greater than (Q1−1.5×IQR).

• Example: 2001 Question 1: The summary statistics for the number of inches of rainfall in Los Angeles

for 117 years, beginning in 1877, are shown below:

MIN MAX Q1 Q3

4.850 38.180 9.680 19.250

(a) Describe a procedure that uses these summary statistics to determine whether there are

outliers.

(b) Are there outliers in these data? _______

Justify your answer based on the procedure that you described in part (a).

(c) The news media reported that in a particular year, there were only 10 inches of rainfall. Use the

information provided to comment on this reported statement.

N MEAN MEDIAN TRMEAN STDEV SE.MEAN

117 14.941 13.070 14.416 6.747 0.624

4. Spread

i. Range: The range is a singular value and is never negative and is calculated by the

formula: Range = Max – Min.. Because the range uses the maximum and minimum

values, it is greatly impacted by outliers.

ii. Interquartile Range: IQR gives an idea as to how spread out the data is by

focusing on the middle 50% of the data and is computed by subtracting the value

of the 1st quartile from that of the 3rd quartile. We express it in this manner

IQR= Q3 –Q1. Because the interquartile range focuses on the middle 50% of the

data it is not impacted by outliers. IQR is very useful for skewed distributions.

iii. Variance: The Variance finds the differences of each value from the mean and

then squares them and sums them. The sum of those squared differences is then

divided by the sample size. Because variance is based on using the mean, the

variance is greatly impacted by outliers and non-symmetric distributions. The

formula is as follows:

Population Variance 𝝈𝟐 = Σ(𝒙𝒊−𝝁)

𝟐

𝒏 where 𝝁 is the true population mean

Sample Variance 𝒔𝟐 = Σ(𝒙𝒊−𝒙 )

𝟐

𝒏−𝟏 where 𝒙 is the mean of the sample

iv. Standard Deviation: The standard deviation is just the square root of the

variance. Because standard deviation is based on using the mean, the standard

deviation is greatly impacted by outliers and non-symmetric distributions. The

formula is as follows:

Population Variance 𝝈 = √Σ(𝒙𝒊−𝝁)

𝟐

𝒏 where 𝝁 is the true population mean

Sample Variance 𝒔 = √Σ (𝒙𝒊−𝒙 )𝟐

𝒏−𝟏 . where 𝒙 is the mean of the sample

• Example: Of the following dotplots, which represents the set of data that has the greatest

standard deviation?

A.

B.

C.

D.

E.

v. Z-scores: A Z-score is a ratio that provides a measure as to how far a value is

from the mean and takes into account both the center and the dispersion of the

data. Z-scores act as a ruler and can be used to compare different shaped

distributions the basic z score formula is 𝒛 = 𝒙−𝝁

𝝈 where z is the number of

standard deviations a value lies from the mean

• Example:

Gina's doctor told her that the standardized score (z-score) for her systolic blood pressure, as

compared to the blood pressure of other women her age, is 1.50. Which of the following is the best

interpretation of this standardized score?

(A) Gina's systolic blood pressure is 150.

(B) Gina's systolic blood pressure is 1.50 standard deviations above the average systolic blood pressure of

women her age.

(C) Gina's systolic blood pressure is 1.50 above the average systolic blood pressure of women her age.

(D) Gina's systolic blood pressure is 1.50 times the average systolic blood pressure of women her age.

(E) Only 1.5% of women Gina's age have a higher systolic blood pressure than she does.

• Example: The weights of a population of adult male gray whales are approximately normally distributed

with a mean weight of 18,000 kilograms and a standard deviation of 4,000 kilograms. The weights of a

population of adult male humpback whales are approximately normally distributed with a mean weight of

30,000 kilograms and a standard deviation of 6,000 kilograms. A certain adult male gray whale weighs

24,000 kilograms. This whale would have the same (z-score) as an adult male humpback whale with what

weight?

Example:

Some descriptive statistics for a set of test scores are shown above. For this test, a certain student has a

standardized score of z = -1.2. What score did this student receive on the test?

A. 266.28

B. 779.42

C. 1008.02

D. 1083.38

E. 1311.98

• Example: 2011 Question 1: A professional sports team evaluates potential players for a certain

position based on two main characteristics, speed and strength.

(a) Speed is measured by the time required to run a distance of 40 yards, with smaller times indicating

more desirable (faster) speeds. From previous speed data for all players in this position, the times

to run 40 yards have a mean of 4.60 seconds and a standard deviation of 0.15 seconds, with a

minimum time of 4.40 seconds, as shown in the table below.

Mean Standard deviation Minimum

Time to run 40 yards 4.60 seconds 0.15 seconds 4.40 seconds

Based on the relationship between the mean, standard deviation, and minimum time, is it reasonable to

believe that the distribution of 40-yard running times is approximately? Explain

(b) Strength is measured by the amount of weight lifted, with more weight indicating more desirable

(greater) strength. From previous strength data for all players in this position, the amount of weight

lifted has a mean of 310 pounds and a standard deviation of 25 pounds, as shown in the table below.

Mean Standard deviation

Amount of weight lifted 310 pounds 25 pounds

Calculate and interpret the z-score for a player in this position who can lift a weight of 370 pounds.

(c) The characteristics of speed and strength are considered to be of equal importance to the team in

selecting a player for the position. Based on the information about the means and standard

deviations of the speed and strength data for all players and the measurements listed in the table

below for Players A and B, which player should the team select if the team can only select one of the

two players? Justify your answer.

Player A Player B

Time to run 40 yards 4.42 seconds 4.57 seconds

Amount of weight lifted 370 pounds 375 pounds

5. Shape

i. Symmetric: The vertical line can divide the data into two matching mirror images.

If the data is symmetric the mean and median will be equal.

Examples:

ii. Mound Shape: The graph is symmetric with most of the data in the center of the

graph. The data density diminishes as you move towards each tail. The mean,

median and mode are all equal. The normal distribution and the t-distributions

are mound shaped

Example:

iii. Uniform: A graph that is approximately the same height.

Example:

iv. Bi-modal The distribution has two distinct peaks (modes). The peaks may not be

equal in height.

Examples:

v. Skewed Left: The tail is to the left and the mean is less than the median. The

mean is to the left of the median. (the distribution is being pulled to the left)

vi. Skewed Right: The tail is to the right and the mean is greater than the median.

• Example: Consider the shape of the graph of individual incomes in the United States.

What can be said about the ratio 𝑴𝒆𝒂𝒏 𝑰𝒏𝒅𝒊𝒗𝒊𝒅𝒖𝒂𝒍 𝑰𝒏𝒄𝒐𝒎𝒆

𝑴𝒆𝒅𝒊𝒂𝒏 𝑰𝒏𝒅𝒊𝒗𝒊𝒅𝒖𝒂𝒍 𝑰𝒏𝒄𝒐𝒎𝒆?

(A) Approximately zero

(B) Less than one, but definitely above zero

(C) Approximately one

(D) Greater than one

(E) Cannot be answered without knowing the standard deviation

• Example: A B

5 1 12 7

7 3 0 13 2

8 7 5 2 14 1 5

8 5 3 1 15 3 6 9

7 7 4 16 0 2 3 4 6 9

6 2 17 2 5 6 6 8

Given the back-to-back stemplot above, which of the following is true?

(A) The Empirical Rule applies to both sets A and B.

(B) The median of each is approximately (120 +170)/2/

(C) In one set the mean and median should be about the same, while in the other the mean appears

to be less than the median.

(D) The ranges of the two sets are equal.

(E) The variances of the two sets are approximately the same

• Example: For a sample of 42 rabbits, the mean weight is 5 pounds and the standard deviation of weights

is 3 pounds. Which of the following is most likely true about the weights for the rabbits in this sample?

A. The distribution of weights is approximately normal because the sample size is 42, and therefore

the central limit theorem applies.

B. The distribution of weights is approximately normal because the standard deviation is less than

the mean.

C. The distribution of weights is skewed to the right because the least possible weight is within 2

standard deviations of the mean.

D. The distribution of weights is skewed to the left because the least possible weight is within 2

standard deviations of the mean.

E. The distribution of weights has a median that is greater than the mean.

1 Transforming Data

A. Measures of Center:

1. Effects of adding or subtracting a constant: Measures of center (mean & median) will

shift by the amount added or subtracted from each data point

2. Effects of Multiplying each data value by a Constant: Measures of center (mean &

median) will shift by the amount of the multiple.

B. Measures of spread (range, IQR, standard deviation, and variance) will not shift by the amount

added or subtracted from each data point

1. Effects of adding or subtracting a constant to each data value: Measures of spread

(Range, IQR, Variance, and Standard Deviation) will not shift when a constant is added or

subtracted from each data point.

2. Effects of multiplying each data value by a constant:

a) Range, IQR and Standard deviation will increase by the value of the value of the

constant being multiplied.

b) Variance increases by the value of the Constant squared. This is because variance itself

is the square of the standard deviation.

C. Z-scores Will not shift regardless whether you add or multiply each data value by a

constant.

• Example:

A B

5 4 10

8 6 1 11 4 5

7 5 12 7 6 8

7 13 5 7

9 7 7 3 14 7

8 4 0 15 3 7 7 9

5 16 0 4 8

17 5

Given this back-to-back stemplot, which of the following is incorrect

(A) The distributions have the same mean.

(B) The distributions have the same range.

(C) The distributions have the same interquartile range.

(D) The distributions have the same standard deviation

(E) The distributions have the same variance.

• Example: Consider the following back-to-back stemplot:

0 3 4 8

1 0 1 2 5 6

8 4 3 2 2 9

6 5 2 1 0 3 2 5 5 7

9 2 4

7 5 5 2 5 6

6 1 4 5 8

6 7 0 9

8 5 4 1 8

9 0 9

Which of the following is a correct statement?

(A) The distributions have the same mean.

(B) The distributions have the same median.

(C) The interquartile range of the distribution to the left is 20 greater than the interquartile range of

the distribution to the right.

(D) The distributions have the same variance.

(E) None of the above is correct.

• Example: Suppose the average score on a national test is 500 with a standard deviation of 100. If

each score is increased by 25% what are the new mean and standard deviation?

(A) 500, 100

(B) 525, 100

(C) 625, 100

(D) 625, 105

(E) 625, 125

• Example: Suppose the average scone a national test is 500 with a standard deviation of 100. If

each score is increased by 25 points what are the new mean and standard deviation?

(A) 500, 100

(B) 500, 125

(C) 525, 100

(D) 525, 105

(E) 525, 125

• Example: The number of hybrid cars a dealer sells weekly has the following probability distribution:

Number of hybrids 0 1 2 3 4 5

Probability .07 .28 .23 .18 .13 .11

The dealer purchases the cars for $22,000 and sells them for $24,500. There is a fixed cost of 500

dollars for the showroom. The profit Y, in dollars, for a particular week can be described by

y = 2500x – 500. What is the standard deviation of Y?

(A) $3,145

(B) $3,645

(C) $4,970.

(D) $5,375

(E) $5,875

A. Combining Random Variables.

A. Means can be added or subtracted directly. Note: it is okay to have a negative value when you

subtract. Note: the variables do not have to be independent to combine means

B. Variances: Variances can be added directly. They cannot be subtracted. Note: the variables do

have to be independent to combined variances.

C. Standard Deviations: Standard deviations must be converted to variances by squaring and then

they may be summed. We do not subtract them. Note: the variables do have to be independent

to combined standard deviations

• Example:

Suppose the average height of policemen is 71 inches with a standard deviation of 4 inches, while the

average for policewomen is 66 inches with a standard deviation of 3 inches. If a committee looks at all

ways of pairing up one male with one female, officer, what will be the mean and standard deviation for

the difference in heights for the set of possible partners?

(A) Mean of 5 inches with a standard deviation of 1 inches.

(B) Mean of 5 inches with a standard deviation of 3.5 inches.

(C) Mean of 5 inches with a standard deviation of 5 inches

(D) Mean of 68.5 inches with a standard deviation of 1 inch.

(E) Mean of 68.5 inches with a standard deviation of 3.5 inches.

• Example:A company that makes fleece clothing uses fleece produced from two farms, Northern Farm and

Western Farm. Let the random variable X represent the weight of fleece produced by a sheep from

Northern Farm. The distribution of X has mean 14.1 pounds and standard deviation 1.3 pounds. Let the

random variable Y represent the weight of fleece produced by a sheep from Western Farm. The distribution

of Y has mean 6.7 pounds and standard deviation 0.5 pound. Assume X and Y are independent. Let W equal

the total weight of fleece from 10 randomly selected sheep from Northern Farm and 15 randomly selected

sheep from Western Farm. Calculate the mean standard deviation of W ?

• Example: 2013 Question 3 Each full carton of Grade A eggs consists of 1 randomly selected empty

cardboard container and 12 randomly selected eggs. The weights of such full cartons are approximately

normally distributed with a mean of 840 grams and a standard deviation of 7.9 grams.

(a) What is the probability that a randomly selected full carton of Grade A eggs will weigh more than

850 grams?

(b) The weights of the empty cardboard containers have a mean of 20 grams and a standard deviation

of 1.7 grams. It is reasonable to assume independence between the weights of the empty cardboard

containers and the weights of the eggs. It is also reasonable to assume independence among the

weights of the 12 eggs that are randomly selected for a full carton.

Let the random variable X be the weight of a single randomly selected Grade A egg.

i) What is the mean of X ?

ii) What is the standard deviation of X ?

Date post:	02-Oct-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Distance Learning - Mr. Spencer's AP Statistics 2019-2020€¦ · Displaying & Describing Data...

Documents