Unit 6: Describing Data€¦ · FOA/Algebra 1 Unit 6: Describing Data Notes 1 Unit 6: Describing...

FOA/Algebra 1 Unit 6: Describing Data Notes

1

Unit 6: Describing Data

Calculating Measures of Central Tendency & Spread Measures of Central Tendency: Mean, Median, Mode

Measures of Spread: Range, Interquartile Range(IQR), Mean Average Deviation (MAD)

Types of Data

There are several different classifications of data: univariate versus bivariate, categorical versus quantitative.

Univariate – Involves a single variable

Bivariate – Involves two variables

Categorical – Places an individual into one of several groups or categories (gender, hair color, eye color, etc)

Quantitative – Numerical values (test scores, age, grade point average, etc)

Measures of Central Tendency

Measures of Central Tendency are used to generalize data sets and identify common values.

Mean

Definition: Average of a numerical data set, denoted as x

Calculation: Add up all the data values and divide by the number of data values

Useful When: - Data values do not vary greatly

- No outliers

- Distribution is symmetric

Example: Find the mean of the following numbers.

a. 76 77 79 80 82 88 90 92 95 b. 15, 10, 12, 18, 10, 22

Mode Definition: Value that occurs most frequently. There can be no, one, or several modes

Calculation: Find the numbers that are repeated

o NO MODE (No numbers repeat)

Say “no mode”

o ONE MODE (One number repeats)

State the number that repeats

o MORE THAN ONE MODE (Several numbers repeat the same amount of

times)

State the numbers that repeat.

Useful When: - Data set contains categorical data

Example: Find the mode of the following numbers.

a. 76 77 79 80 82 88 90 92 95 b. 15, 10, 12, 18, 10, 22


2

Median

Definition: The middle number when the values are written in numerical order

Calculation: Rewrite your data values in numerical order to find the middle number.

o If your data set is ODD, then the median will be the number that falls

directly in the middle.

o If your data set is EVEN, then the median is the average of the two

middle numbers.

Useful When: - Distribution is skewed

- Data values contain an outlier

First and

Third

Quartiles

Definition: Quartiles are values that divide a list of numbers into quarters

First (Q1) Quartile: Median of the lower half of a data set

o Calculation: Find the middle number of the values to the left of the median

Third (Q3) Quartile: Median of the upper half of a data set

o Calculation: Find the middle number of the values to the right of the median

Example: Find the median, lower and upper quartiles of the following numbers.

a. 76 77 79 80 82 88 90 92 95 b. 15, 10, 12, 18, 10, 22

Outliers Data value that is much greater than or much less than the rest of the data in a data set

If an outlier is present, you would use the median to describe the data, NOT the mean!

Example: Identify any outliers in the data set. Then determine if the median or mean best represents the data sets.

a. 15, 10, 12, 18, 10, 22 b. 128, 152, 170, 41, 161

Measures of Spread

Measures of Spread describe the “diversity” of the values in a data set. Measures of spread are used to help

explain whether data values are very similar or very different.

Range

Definition: Difference between the greatest and least values in the set

Calculation: Subtract the smallest data value from the biggest data value

Range = Biggest # - Smallest #

Example: Find the range of the following numbers.

a. 76 77 79 80 82 88 90 92 95 b. 15, 10, 12, 18, 10, 22


3

Interquartile

Range (IQR)

Definition: The difference between the third and first quartiles (Q3 – Q1). It finds the distance

between two data values that represent the middle 50% of the data.

Calculation: Subtract the first quartile value from the third quartile value (Q3 – Q1).

Example: Find the interquartile range of the following numbers. (Use Q1 and Q3 from the previous page)

a. 76 77 79 80 82 88 90 92 95 b. 15, 10, 12, 18, 10, 22

Mean

Absolute

Deviation

Definition: Average absolute value of the difference between each data point and the

mean. It essentially takes the average distance of the data points from the mean.

The greater the mean absolute deviation, the more the data is spread out.

The formula for mean absolute deviation is:

Calculation: - Find the mean of the set of numbers

- Subtract each number in the set by the mean and take the absolute value

of each new number (new number will be positive)

- Find the sum of the new numbers and divide by the number of data values

Example: Find the MAD of the following numbers.

a. 15, 10, 12, 18, 10, 22

X1 = data value

x = mean

= sum

N = number of data values


4

Homework: Putting Measures of Center and Spread Together

Use the data set below to answer the following questions:

5, 2, 9, 10, 3, 7, 2, 18, 12, 15, 1, 6, 9, 5, 2, 7

1.) Find the mean. 2.) Find the median(Q2). 3.) Find the mode.

4.) Find the range. 5.) Find Q1. 6.) Find Q3.

7.) Find the IQR. 8.) Find the MAD.


5

Dot Plots & Histograms

A dot plot is a data representation that uses a number line and x’s, dots, or other symbols to show frequency. The

number of times a value is repeated corresponds to the number of dots above that value. A dot plot also shows the

size of the data set. Dot plots are also called line plots. An example of a dot below is below:

Advantages of Dot Plots: Simple to make

Shows each individual data point

Disadvantages of Dot Plots: Can be time consuming with lots of data points

Have to count to get exact total

Fractions are hard to display

Types of Dot Plot Distributions

TYPE DESCRIPTION PICTURE

SYMMETRIC

When graphed, a vertical line drawn at the

center will form mirror images.

This shape is referred to as the bell shaped

curve or normal curve

Mean is approximately equal to the median

SKEWED LEFT

(NEGATIVE

SKEW)

Fewer data points are found to the left of the

graph (towards the smaller data values). The

“tail” of the graph is to the left.

Typically, the mean is less than or to the left of

the median.

SKEWED RIGHT

(POSITIVE

SKEW)

Fewer data points are found to the right of

the graph (towards the bigger data values).

The “tail” of the graph is to the right.

Typically the mean is greater than or to the

right of the median

UNIFORM

The data is spread equally (or very close to

equally) across the range.

Uniform distributions are a type of symmetric

distributions.

Minutes 0 1 2 3 4 5 6 7 8 9 10 11 12

People 6 2 3 5 2 5 0 0 2 3 7 4 1


6

Practice 1: Identify the type of distribution of the following dot plots.

a. b.

Practice 2: Find the following values:

Describe the following:

Mean: Mean: Mean:

Median: Median: Median:

Mode: Mode: Mode:

Range: Range: Range:

Distribution: Distribution: Distribution:

Practice 3: The following dot plot represents gold medals won at the Special Olympics:

a. How many participants are represented in the dot plot?

b. How many participants won 10 or more medals?

c. How many participants won less than 4 medals?

d. Describe the data distribution and interpret its meaning in terms of this problem situation.


7

Histograms

A histogram is a bar graph used to display the frequency of data divided into equal intervals, called bins. The bars

must be of equal width and should touch, but not overlap. The height of each bar gives the frequency of the data.

An example of a histogram is below:

How many students read 4-7 books?

How many more students read 4-7 books than 12-15 books?

Advantages of Histograms: Good for determining the shape of data

Convenient for representing large quantities of data

Disadvantages of Histograms: Cannot read exact values because data is grouped into categories

More difficult to compare two data sets because measures of center and

spread cannot be determined


SYMMETRIC

When graphed, a vertical line drawn at the center will

form mirror images.

This shape is referred to as the bell shaped curve or

normal curve

The median will be in or close to the center of the

number line.

SKEWED LEFT

(NEGATIVE

SKEW)

Fewer data points are found to the left of the graph

(towards the smaller data values). The “tail” of the

graph is to the left.

The median will be shifted right and the “tail” on the

left. Typically, the mean is less than or to the left of the

median.

SKEWED RIGHT

(POSITIVE

SKEW)

Fewer data points are found to the right of the graph

(towards the bigger data values). The “tail” of the

graph is to the right.

The median will be shifted left and the “tail’ on the

right. Typically the mean is greater than or to the right

of the median

UNIFORM

The data is spread equally (or very close to equally)

across the range.


distributions.

The median will be in or close to the center of the

number line.


8

Practice 1: Describe the distribution of each histogram and if the mean is less, greater, or equal to the median. Then

describe which would be a better measure of center; the median or mean.

a. b.

Practice 2: Use the histogram to answer the following questions about how long it takes students to get ready.

a. How many students answered the question?

b. How many students take less than 40 minutes to get ready?

c. Based on the info given, could you redraw the current histogram with

intervals half their current size? Why or why not?

Practice 3: Analyze the given histogram which displays the ACT composite score of several randomly chosen

students.

a. Describe the distribution and explain what it means in terms of the

problem situation.

b. How many students had an ACT score of at least 20?

c. How many students had an ACT score less than 30?

d. How many students had an ACT score of exactly 25?


9

Box Plots

A box plot (also called box and whisker plot) is used to show how data values are distributed. They are created

using five important numbers that show the minimum, maximum, median, lower quartile, and upper quartile.

In a box plot, a rectangle is drawn starting at the first quartile and ending at the third quartile. The rectangle shows

the middle 50% of the data set. The median is represented by a line. Whiskers are drawn from the rectangle to the

minimum and maximum data values. An example of a box plot is below:

Types of Box Plot Distributions


SYMMETRIC

When graphed, a vertical line drawn at the center will

form mirror images.

This shape is referred to as the bell shaped curve or

normal curve

The median and mean will be approximately equal.

SKEWED LEFT

(NEGATIVE

SKEW)

Fewer data points are found to the left of the graph.

The “tail” of the graph is to the left.

The interquartile range will be shifted to the right of the

number line (inside IQR) and the mean less than the

median.

SKEWED RIGHT

(POSITIVE

SKEW)

Fewer data points are found to the right of the graph.

The “tail” of the graph is to the right.

The interquartile range will be shifted to the left of the

number line and the mean greater than the median.

UNIFORM

The data is spread equally (or very close to equally)

across the range.


distributions.

The median and mean will be approximately equal.

Outliers: A data value that lies on the outside of all the other data values. It is denoted by an asterisk (*) or dot.


10

Identifying Distributions

Identify the type of distribution of the following box plots.

a.

b.

c.

Calculating the Parts of a Box Plot

Before you can even create a box plot, you have to know how to calculate the “five number summary”, which

consists of the minimum, maximum, median, lower quartile, and upper quartile.

Using the following data set, find the five number summary:

{15, 10, 12, 18, 10, 22, 11, 17, 13}

Minimum: Smallest number of the data set _________

Maximum: Largest number of the data set _________

Median: Middle number of the data set _________

Lower Quartile: Median of the lower half of the data set (Q1 or First Quartile) _________

Upper Quartile: Median of the upper half of the data set (Q3 or Third Quartile) _________


11

Interpreting Box Plots

Practice with Box Plots

Example 1: Analyze the box plot below about the cost, in dollars, of 12 CD’s. Answer the questions.

A. Which cost is the upper quartile? B. What is the range?

C. What is the median? D. Which cost represents the 100th percentile?

E. How many CD’s cost between $14.50 F. How many CD’s cost less than $14.50?

and $26.00?

List the data values that fall below 25%:

List the data values that fall above 75%:

List the data values that fall above 50%:

Calculate the IQR:


12

Example 2: Analyze the box plot below and answer the following questions:

A. What is the height range of the middle B. How many of the surveyed adults

50 percent of the surveyed adults? are between 72 and 79 inches?

C. What percent of the surveyed adults D. What is the height of the tallest

are 72 inches or shorter? adult surveyed?

E. About 10 people have a height below what F. About 20 people have a height

amount? above amount?

G. How many of the surveyed adults are H. Describe the distribution. Is the median

at least 58 inches tall? or mean best describe the data?


13

Comparing Measures of Center and Spread

Comparing Measures of Center and Spread

Center Spread

Mean Data is Symmetric

No Outliers More Spread

Data values are spread

out

Greater MAD

Median

Skewed Data

Outliers

(Skewed left – mean < median)

(Skewed right – mean > median)

Less Spread

Data values are close

together

Smaller MAD

Example 1: Which data set will have the greater mean absolute deviation? Why?

Example 2: The following data represents test scores from Unit 11 test.

Unit 11 Test Scores: 81, 41, 89, 92, 80, 86, 77, 66, 84, 92, 97, 88, 77, 38

a. Compare the mean and median.

b. What type of distribution does the data create? What does this mean?

c. Are there any outliers?

d. What measure of center best describes the grades and why?


14

Example 3: The histograms below show the scores of Mrs. Smith’s first and second block class at Red Rock High

School.

1. How many students are in her 1st and 2nd block class?

2. How many students failed the test in each class?

3. Which measure of center best describes the data and why?

4. Which class seemed to do better overall?


15

Example 4: Each girl in Mrs. Washington’s class and Mrs. Wheaton’s class measured their own height. The heights

were plotted on the dot plots below. Use the dot plots to compare the heights of the girls in the two classes.

Mrs. Washington Mrs. Wheaton

a.

Describe the distribution for each class. c. What is the mean and median for each class?

b. Which teacher’s girls appear to be taller and why? d. How tall are the majority of the girls in each class?

Example 5: The following box plots show the average monthly high temperatures for Milwaukee and Honolulu. Use

the box plots to answer the following questions.

Honolulu Milkwaukee

A. What was the median temperature for both cities? B. What was the range for both cities?

C. Which city has more spread in its data and why?

D. Interpret what the 1st and 3rd quartiles mean for both cities.


16

Frequency Tables

A relative frequency is the frequency that an event occurs divided by the total number of events.

Example: If your team has won 9 games from a total of 12 games played: The frequency of winning is 9 The relative frequency of winning is 9/12 = 75%.

A two way table is a useful way to organize data that can be categorized by two variables (bi-variate). The

following table shows the results of a poll of randomly selected high school students and their preference for either

math or English. Joint frequencies are the number of times a response was given for a certain characteristic.

Marginal frequencies is the total number of times a response is given for a certain characteristic. Marginal

frequencies are found in the margins of the table.

9th Grade 10th

Grade

11th

Grade

12th

Grade Total

Math 10 12 11 8

English 12 11 8 8

Total

1. How many students are in 11th grade?

2. How many students are in 9th grade and prefer math?

3. How many students prefer English and are in 12th grade?

4. How many students are there total?

Example 1: Fill in the missing values into the table below and then answer the following questions:

9th Grader’s School Transportation Survey

a. How many students are there total?

b. How many 9th boys walk to school?

c. How many 9th girls ride their bike to school?

d. How many males took the survey?


17

Joint and Marginal Relative Frequencies

The joint relative frequencies are the values in each category divided by the total number of values and

written as percents (or decimals).

The marginal relative frequencies are found by adding the joint relative frequencies in each row and

column (totals) and are written as percents (or decimals). Marginal frequencies are written in the MARGINS.

The marginal frequency totals in each row and column should always total 1 or 100%.

Calculate the joint and marginal relative frequencies for the table:

9th Grade 10th Grade 11th Grade 12th Grade Total

Math

English

Total

a. What percent of students are 10th graders & like English?

b. What percent of students like Math and are 12th graders?

c. What percent of students like Math? d. What percent of those surveys were seniors?

Practice with Joint and Marginal Relative Frequencies

Example 3: One hundred people who frequently get migraine headaches were chosen to participate in a study of

new anti-headache medicine. Some of the participants were given the medicine; others were not. After one

week, the participants were asked if they got a headache during the week. The two way frequency table

summarizes the results. Fill in the missing value and then create a joint and marginal relative frequency table.

Took Medicine Did NOT Take

Medicine TOTAL

Headache 12 27

No Headache 48 25

TOTAL 40

Joint and Marginal Relative Frequencies


Medicine TOTAL

Headache

No Headache

TOTAL


18

Conditional Frequencies

A conditional frequency is restricted to a particular group (or subgroup).

Conditional frequencies are typically identified by the words “given that” or “if” or “what percent of (insert

condition)”. They do NOT come from the total data, but from a row or column total.

To calculate a conditional frequency, divide the joint relative frequency by the marginal relative frequency

(does not matter if they are the frequencies or percents/decimals). Conditional frequencies are used to find

conditional probabilities.


Medicine TOTAL

Headache 12 15 27

No Headache 48 25 73

TOTAL 60 40 100

1. What is the probability that a participant did not get a headache if they took the medicine?

2. What is the probability that a participant took medicine given they did not have a headache?

3. What is the probability that a participant took medicine given they did have a headache?

4. Calculate the joint and marginal frequencies from the table above.


Medicine TOTAL

Headache

No Headache

TOTAL

5. What is the probability that a participant who did not get a headache took the medicine?

6. What is the probability that a participant took medicine given they did not have a headache?

7. What is the probability that a participant took medicine given they did have a headache?

8. What do you notice about the answers from problems 1 – 3 and problems 5 – 7?


19

Example 5: Students were surveyed about whether or not they have a pet and if they are allergic or not to animals.

The results are below:

a. What percent of those surveyed who are allergic to animals have a pet?

b. What percent of those surveyed who are not allergic to animals have a pet?

c. What percent of those who have a pet are allergic to animals?

d. What percent of those who have a pet are not allergic to animals?

Example 6: The following contains the scores of the latest math project. Use the table to answer the following

questions:

a. What percentage of males earned a score of an “A”?

b. What percentage of those who earned an “A” were male?

c. What percentages of females earned a score of a “B”?

d. What percentage of those who earned an “F” were female?


20

Scatterplots

A scatterplot is a graph of data pairs (x, y).

Scatterplots are typically used to describe relationships, called correlations, between two variables (bi-

variate).

The correlation coefficient describes how well a line fits the data. A trend line can be drawn to help

determine correlation.

Correlation Coefficients

0.70 to 1.00 Strong Positive 0.70 to 1.00 Strong Negative

0.30 to 0.69 Moderate Positive 0.30 to 0.69 Moderate Negative

0.00 to 0.29 None to Weak Positive 0.00 to 0.29 None to Weak Negative

Example: Determine if the following graphs have positive, negative, or no correlations. Then tell if the correlation

coefficient is strong, moderate, or weak positive or negative.

a. b. c. d. e.

Positive Correlation

As x values increase,

y values increase

Correlation Coefficient is

close to 1

Positive Slope

Negative Correlation

As x values increase,

y values decrease


close to -1

Negative Slope

No Correlation

No relationship between

x and y


close to 0

No line


21

Example: Describe the scatterplot that best describes the scenario below and explain why:

The relationship between the number of days since a sunflower seed was planted and the height of the plant.

Example: Describe the correlation you would expect to see between each pair of data sets. Explain your choice:

a. The number of hours you work vs the amount of money in your bank account:

b. The number of hours workers receive safety training vs the number of accidents on the job:

c. The number of students at Hillgrove vs the number of dogs in Atlanta:

d. The number of heaters sold versus the months in order from April to September:

e. The number of rice dishes eaten vs the number of cars on I-75 throughout the day:

f. The number of calories burned/lost vs the amount of hours you worked out:


22

Correlation vs Causation

Correlation: implies a mutual relationship between two or more things. It is very IMPORTANT to understand that just

because two variables are strongly correlated does NOT imply a cause and effect relationship. A strong

relationship between two variables could be a coincidence or caused by additional factors. Typically, correlations

use the words noticed and showed.

Correlations only show relationships…they cannot be used to make conclusions!!

Causation: implies a relationship in which one action or event is the direct consequence of another (cause and

effect).

Correlation Causation

Smoking is correlated with alcoholism (but it

doesn’t cause it).

The more ice cream consumed on a beach,

the increased number of people who go in

the water (eating ice cream doesn’t cause

you to go in the water more).

The more you smoke, the chances of

developing lung cancer increase. (Does

smoking cause lung cancer?)

The less calories you eat, the more weight you

lose (Does eating less cause you to lose

weight?)

Example: Determine if the following relationships show a correlation or causation:

A. A recent study showed that college students were more likely to vote than their peers who were not in

school.

B. Dr. Shaw noticed that there was more trash in the hallways after 2nd period than 1st period.

C. You hit your little sister and she cries.

D. The number of miles driven and the amount of gas used on your trip to Disneyworld.

E. The age of a child and his/her shoe size.

F. The amount of cars a salesman sells and the amount of commission he makes during the month of July.


23

Steps for Calculating the Correlation Coefficient & Creating a Model

1. Once your data is entered into a list, Press [STAT] [CALC] and choose your regression.

4: LinReg – Linear Regression y = mx + b (a = m)

5: QuadReg – Quadratic Regression y = ax2 + bx + c

0: ExpReg – Exponential Regression y = abx

2. If you want your graphing calculator to automatically input the equation into y = , do the following:

On the 2nd screen, hit ENTER until STORE REGEQ is highlighted. Hit Vars Y-VARS 1: Function 1: Y1

3. Hit ENTER until CALCULATE is highlighted. You should see your variables (a, b, and possibly c) unless with r2 and r.

R: correlation coefficient – this tells you how much correlation exists between your data

R2: this tells you how well the equation fits your data. The closer to 1, the better the fit.

Practice Predicting with Scatter Plots

1. What can be concluded from the scatterplot below?

2. The scatterplot shows the number of fat (grams) in a restaurant sandwich and the number of calories. a. How many grams of fat would you predict to be in a sandwich that contains 650 calories? b. How many calories would you predict to be in a sandwich with 20 grams of fat?

A. The older a person gets, the more television they watch. B. As a person gets older, their taste in television changes. C. The older a person gets, the less television they watch. D. There is no relationship between age and television watching.


24

3. Make a scatterplot for each data set. Then find the correlation coefficient using your calculator. a. b. 4. Match the graph with its correlation coefficient.

Linear Regression Yesterday, we drew trend lines to help us see if a scatter plot had any types of correlation. A trend line is a line that

closely models the data. A line of best fit is the line that comes closest to all of the points in the data set. The line of

best fit provides the predicted values for a set of data.

If a line is a good line of best fit, it will have data points above and below the line.

Choices A. r = 0.45

B. r = 0.94

C. r = 0.07

D. r = -0.39

E. r = -0.89


25

Example: Draw a line of best fit for each graph:

Example: The table shows test averages of eight students. The equation that best models the data is

y = 0.77x + 18.12 and the correlation coefficient is 0.87. Discuss correlation and causation for the data set.

Example: Eight adults were surveyed about their education and earnings. The table shows the survey results. The

equation that models the data Is y = 0.59x + 30.28 and the correlation coefficient is 0.86. Discuss correlation and

causation for the data set.

Calculating a Line of Best Fit

Scenario 1: A weather team records the weather each hour after sunrise one morning in May. The hours after

sunrise and the temperature in degrees Fahrenheit are in the table below. Create a graph to represent the data

and calculate a linear equation to represent the table.


26

a. Interpret what the slope of each equation means in terms of the problem context.

b. Interpret what the y-intercept of each equation means in terms of the problem context.

Calculate by Hand Step 1: Pick two points and calculate the slope (must go

through trend line:

Step 2: Estimate/determine the y-intercept:

Step 3: Enter into y = mx + b

Calculate using Regression Step 1: Enter data into a list (Stat Edit)

Step 2: Calculate a regression (Stat Calc 4: Lin Reg)

a:

b:

r:

3. Enter into y = mx + b

Date post:	01-Aug-2020
Category:	Documents
Upload:	others
View:	118 times
Download:	11 times

Unit 6: Describing Data€¦ · FOA/Algebra 1 Unit 6: Describing Data Notes 1 Unit 6: Describing...

Documents