Statistics 12 · Statistics 12 Chapter 1 –Exploring Data Dr. John Lo Royal Canadian College...

Post on 21-Sep-2020

0 views 0 download

transcript

Statistics 12Chapter 1 – Exploring Data

Dr. John LoRoyal Canadian College

2020-2021

0. Real-life Case Study

RCC @ 2020 CHAPTER 1 - EXPLORING DATA 2

› Question: Do pets or friends help reduce stress?

CHAPTER 1 - EXPLORING DATA 3RCC @ 2020

1. Introduction: Making sense of data

RCC @ 2020 CHAPTER 1 - EXPLORING DATA 4

› Statistics: The science of data

› In order to understand what data tell us, we need to perform data analysis.

› Data analysis: The process of organizing, displaying, summarizing, and asking questions about data.

CHAPTER 1 - EXPLORING DATA 5RCC @ 2020

› Example: Imagine that we are developing a student database for RCC.

› Questions to ask:

1. What are the individuals?

› The students of RCC

2. What are the variables?

› For example, gender, age, grade level, address, phone numbers.

CHAPTER 1 - EXPLORING DATA 6RCC @ 2020

› Depending on the nature of variables, we can group them into two major types:

› Examples:

• Categorical – gender, race, occupation

• Quantitative – grade point average, age

CHAPTER 1 - EXPLORING DATA 7RCC @ 2020

› Example: Is there anything suspicious?

CHAPTER 1 - EXPLORING DATA 8RCC @ 2020

› Another key description that we should know regarding data is distribution.

› By definition:

› Can you describe the distribution of data in the previous example?

CHAPTER 1 - EXPLORING DATA 9RCC @ 2020

› Practice: The following table includes data for 10 people chosen at random from more than 1 million people in households.

CHAPTER 1 - EXPLORING DATA 10RCC @ 2020

2. Analyzing categorical data

RCC @ 2020 CHAPTER 1 - EXPLORING DATA 11

A. Distribution of a single categorical variable

› Recall that categorical variables place individuals into one of several groups or categories.

➢Note that the values of a categorical variables are labelsfor the different categories.

➢Note also that the distribution of a categorical variable lists the count or percent of individuals who fall into each category.

CHAPTER 1 - EXPLORING DATA 12RCC @ 2020

› Example: A survey of radio audience rating of US radio stations

CHAPTER 1 - EXPLORING DATA 13

variable

values

counts percentsRCC @ 2020

› Frequency tables are not always easy to read and analyze.

› To facilitate the analysis, one may prefer to show the distribution by displaying with:

CHAPTER 1 - EXPLORING DATA 14

A pie chart A bar graph

RCC @ 2020

i. Bar graphs

› A bar graph displays the distribution of a categorical variable, showing the counts for each category for easy comparison.

CHAPTER 1 - EXPLORING DATA 15RCC @ 2020

› Two ways of showing bar charts: horizontal vs. vertical

CHAPTER 1 - EXPLORING DATA 16RCC @ 2020

› Question: Should a bar chart be horizontal or vertical?

› Answer: Depends on whether nominal or ordinal variables are considered.

CHAPTER 1 - EXPLORING DATA 17RCC @ 2020

› Bar charts may be deceiving if not well designed.

CHAPTER 1 - EXPLORING DATA 18

• Example: How many people were in each class on the Titanic?

RCC @ 2020

› The best data displays observe a fundamental principle of graphing data called the area principle.

› This principle states that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.

› Because of this, violations of the area principle are a common way to lie (either intentionally or not) with statistics.

CHAPTER 1 - EXPLORING DATA 19RCC @ 2020

› Question: What’s wrong with this bar chart?

CHAPTER 1 - EXPLORING DATA 20RCC @ 2020

› Question: 500 random customers who bought the new iMac computer were asked if their previous computer had been another Mac or a Windows computer. The results are found in this table:

› Why is the pictograph misleading?

CHAPTER 1 - EXPLORING DATA 21RCC @ 2020

› Two possible bar graphs of the data are shown below. Which one could be considered deceptive? Why?

CHAPTER 1 - EXPLORING DATA 22RCC @ 2020

ii. Pie charts

› A pie chart displays all the cases as a circle whose slices have areas proportional to each category’s fraction of the whole.

› Pie charts give a quick impression of the distribution, and are particularly good for seeing relative frequencies of ½, ¼ or 1/8.

CHAPTER 1 - EXPLORING DATA 23RCC @ 2020

› There are many varieties of pie charts:

CHAPTER 1 - EXPLORING DATA 24RCC @ 2020

› Note that pie charts may be attractive, but it can be hard to see patterns in them.

› Can you tell the differences in distributions depicted by these three pie charts?

CHAPTER 1 - EXPLORING DATA 25RCC @ 2020

› The bar charts of the same values look like:

› Bar charts are almost always better than pie charts for comparing the relative frequencies of categories.

CHAPTER 1 - EXPLORING DATA 26RCC @ 2020

B. Distribution with more than one categorical variable

› We have learnt the analysis of a distribution with a single categorical variable using frequency tables, bar charts and pie charts.

› However, if a distribution contains two categorical variables, for example, what will we do?

› We can make use of a two-way table!

CHAPTER 1 - EXPLORING DATA 27RCC @ 2020

› Example: A survey of 4826 randomly selected young adults (aged 19 to 25) asked, “What do you think the chances are you will have much more than a middle-class income at age 30?”. The table below shows the responses.

CHAPTER 1 - EXPLORING DATA 28RCC @ 2020

› To analyze data presented in a two-way table, we look at the marginal distributions, which are defined as follows:

› In typical two-way tables, these distributions are listed on the right and bottom margins.

CHAPTER 1 - EXPLORING DATA 29RCC @ 2020

› The marginal distributions in the two-way table shown previously can be examined by converting the data into percent:

CHAPTER 1 - EXPLORING DATA 30RCC @ 2020

› Each marginal distribution from a two-way table is a distribution from a single categorical variable. But it does not tell anything about the relationship between two variables.

› The opinions of women and men alone can be analyzed individually by studying the data only at the “Female” or “Male” column, respectively.

› The resulting distributions are called conditional distributions:

CHAPTER 1 - EXPLORING DATA 31RCC @ 2020

› The conditional distributions of opinions among women and men:

CHAPTER 1 - EXPLORING DATA 32RCC @ 2020

› Conditional distributions can be displayed in form of segmented bar graph or side-by-side bar graph:

CHAPTER 1 - EXPLORING DATA 33RCC @ 2020

› Both graphs provide evidence of an association between gender and opinion about future wealth in this sample of young adults.

› The concept of association is important in statistics, but we could not over-emphasize it!

CHAPTER 1 - EXPLORING DATA 34RCC @ 2020

3. Displaying quantitative data with graphs

RCC @ 2020 CHAPTER 1 - EXPLORING DATA 35

› Bar graphs or pie charts are good at showing the distributions of categorical variables. However, they can’t be used for quantitative variables.

› The following types of graphs are commonly used to display the distributions of quantitative variables:

a) Dotplots

b) Stem-and-leaf plots

c) Histograms

d) Density plots

CHAPTER 1 - EXPLORING DATA 36RCC @ 2020

› Always remember that a graph is to help us understand the data. Therefore, to interpret graphs of quantitative data, the SOCS strategy described below is followed:

CHAPTER 1 - EXPLORING DATA 37RCC @ 2020

A. Dotplots

› Each data value is shown as a dot above its location on a number line.

CHAPTER 1 - EXPLORING DATA 38

Dotplot of the Kentucky Derby race winning times

RCC @ 2020

› Example: The following table displays the US Environmental Protection Agency (EPA) estimates of highway gas mileage in miles per gallon (mpg) for a sample of 24 model year 2012 midsize cars.

CHAPTER 1 - EXPLORING DATA 39RCC @ 2020

› A dotplot of the data is shown below:

› Can you describe the shape, center, and spread of the data? Is there any outliers?

CHAPTER 1 - EXPLORING DATA 40RCC @ 2020

› The shape of data distribution can be described using the following terms:

CHAPTER 1 - EXPLORING DATA 41RCC @ 2020

CHAPTER 1 - EXPLORING DATA 42

Symmetric• Data clustered at the center

Skewed-to-the-left (or left-skewed)• Data clustered on the right

Skewed-to-the-right (or right-skewed)• Data clustered on the left

RCC @ 2020

› Besides skewness, the shape of a distribution can be described in terms of modes (or local maxima):

CHAPTER 1 - EXPLORING DATA 43RCC @ 2020

› Example: Compare the distributions of household size for UK and South Africa.

CHAPTER 1 - EXPLORING DATA 44RCC @ 2020

B. Stem-and-leaf plots

› Also known as stemplots, they give a quick picture of the shape of a distribution while including the actual numerical values in the graph.

› Example:

CHAPTER 1 - EXPLORING DATA 45RCC @ 2020

› A stem-and-leaf plot can be made following the procedures below:

1. Separate each observation into a stem (i.e., all but the final digit) and a leaf (i.e., the last digit).

2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right. Don’t skip any stem.

3. Write each leaf in the row to the right of its stem. Arrange the leaves in ascending order.

4. Provide a key that explains the meaning of the stems and leaves.

CHAPTER 1 - EXPLORING DATA 46RCC @ 2020

› Example: 20 female students from a school were randomly chosen and asked how many pairs of shoes they have. Here are the data:

› Present the results in a stem-and-leaf plot.

CHAPTER 1 - EXPLORING DATA 47RCC @ 2020

› Problem: 20 male students from the same school were also selected randomly for the survey. The data are shown below. Construct the corresponding stemplot.

CHAPTER 1 - EXPLORING DATA 48RCC @ 2020

› Sometimes, for clarity purpose, we construct the stem-and-leaf plots with split stems.

CHAPTER 1 - EXPLORING DATA 49

Cluster of data

Large gap between 22 and 35

RCC @ 2020

› The data both males and females can be compared using the back-to-back stem-and-leaf plot with common stems.

CHAPTER 1 - EXPLORING DATA 50RCC @ 2020

› Some tips to consider when making a stemplots:

1. Stemplots do not work well for large data sets, where each stem must hold a large number of leaves.

2. There is no magic number of stems to use, but five is a good minimum.

3. If you split stems, be sure that each stem is assigned an equal number of possible leaf digits (two stems, each with five possible leaves; or five stems, each with two possible leaves).

4. Round the data so that the final digit after rounding is suitable as a leaf. Do this when the data have too many digits.

CHAPTER 1 - EXPLORING DATA 51RCC @ 2020

› Practice: Here are the numbers of points scored by teams in the California Division I-AAA high school basketball playoffs in a single day’s games:

Construct a stemplot for the data.

CHAPTER 1 - EXPLORING DATA 52RCC @ 2020

C. Histograms

› Quantitative variables often take many values. A graph of the distribution is clearer if nearby values are grouped together. The resulting graph is called a histogram.

› The followings are the steps in making a histogram:

1. Divide the data into classes of equal width.

2. Find the count (frequency) or percent (relative frequency) of individuals in each class.

3. Label and scale the axes and draw the histogram.

CHAPTER 1 - EXPLORING DATA 53RCC @ 2020

› Example: The following table presents the data of the percent of home state’s residents who were born outside the US for all 50 states.

CHAPTER 1 - EXPLORING DATA 54RCC @ 2020

› The range is 1.2 to 27.2. Hence

› The frequency and relative frequency tables:

CHAPTER 1 - EXPLORING DATA 55RCC @ 2020

› Using these data, the histograms can be made:

CHAPTER 1 - EXPLORING DATA 56RCC @ 2020

› To see more subtle details of the distribution, more classes (i.e., smaller width) can be used:

› What is the difference between these histograms and the previous ones?

CHAPTER 1 - EXPLORING DATA 57RCC @ 2020

› Be careful when making histograms:

i. Don’t confuse histograms with bar graphs.

CHAPTER 1 - EXPLORING DATA 58RCC @ 2020

ii. Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations.

› Why is the first one misleading?CHAPTER 1 - EXPLORING DATA 59RCC @ 2020

iii. Just because a graph looks nice doesn’t make it a meaningful display of data.

› Which one is a better display?

CHAPTER 1 - EXPLORING DATA 60RCC @ 2020

D. Density plots

› The size of the bins in a histogram can influence its looksand the interpretation of the distribution.

› Density plots smooth the bins in a histogram to reduce the effect of the choice of the size of bins.

› Example: Ages of those aboard the Titanic.

CHAPTER 1 - EXPLORING DATA 61RCC @ 2020

4. Describing quantitative data with numbers

RCC @ 2020 CHAPTER 1 - EXPLORING DATA 62

› Consider the following survey which is about the travel times in minutes for 15 randomly chosen workers in North Carolina:

› How do we describe the distribution?

CHAPTER 1 - EXPLORING DATA 63RCC @ 2020

› The stemplot of these data:

› Conclusions:

▪ Unimodal and right-skewed

▪ Center around 20

▪ A possible outlier at 60

▪Wide spread (from 5 to 60)CHAPTER 1 - EXPLORING DATA 64

(Where is the exact center?)

(Is it really an outlier?)

(Is the spread reasonable?)RCC @ 2020

A. Measuring center

i. Mean

› The arithmetic average that measures the center of data

CHAPTER 1 - EXPLORING DATA 65RCC @ 2020

› Example: Here is the stemplot of the travel times for work for the sample of 15 North Carolinians.

a) Find the mean travel time for all 15 workers.

b) Calculate the mean again excluding the person who reported a 60-minute travel time to work. What do you notice?

CHAPTER 1 - EXPLORING DATA 66RCC @ 2020

ii. Median

› Another common measure of center is the median which describes the midpoint of a distribution.

CHAPTER 1 - EXPLORING DATA 67RCC @ 2020

› Example: Here are the travel times in minutes of 20 randomly chosen New York workers:

a) Make a stemplot of the data. Be sure to include the key.

b) Find the median. Show your work.

CHAPTER 1 - EXPLORING DATA 68RCC @ 2020

› What is the difference between mean and median if they both measure the center of data?

CHAPTER 1 - EXPLORING DATA 69

Quantity Characteristics

Mean Arithmetic average of a set of data. It gives the “average” value of a variable.

Median Midpoint of a set of data. It gives the “typical” value of a variable.

RCC @ 2020

› Comparing the mean and median of a distribution:

› Note that in a skewed distribution, the mean is usually farther out in the long tail than is the median. Therefore, the latter one is reported often for strongly skewed distributions such as incomes.

CHAPTER 1 - EXPLORING DATA 70RCC @ 2020

B. Measuring spread

i. Interquartile range (IQR)

› A measure of center alone can be misleading

CHAPTER 1 - EXPLORING DATA 71RCC @ 2020

› A useful numerical description of a distribution requires both a measure of center and a measure of spread.

› How can we measure spread?

CHAPTER 1 - EXPLORING DATA 72RCC @ 2020

› Example: Calculate the quartiles for 15 workers’ travel times sampled in North Carolina.

› Arrange the times in increasing order:

CHAPTER 1 - EXPLORING DATA 73RCC @ 2020

› Example: Find and interpret IQR for the data on travel times to work for 20 randomly selected New Yorkers.

› Rewrite the list of values in increasing order:

CHAPTER 1 - EXPLORING DATA 74RCC @ 2020

› The first quartile is:

› The third quartile is:

› Therefore the interquartile range is:

› Interpretation: The range of the middle half of travel times for New Yorkers in the sample is 27.5 min.

CHAPTER 1 - EXPLORING DATA 75

𝑄1 =15 + 15

2= 15

𝑄3 =40 + 45

2= 42.5

𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 42.5 − 15 = 27.5

RCC @ 2020

› A useful application of IQR is to identify possible outliersin a data set.

› By definition:

› In the previous example, IQR is 27.5. Hence, data are flagged outliers if they do not falling within:

CHAPTER 1 - EXPLORING DATA 76

15 − 1.5 × 27.5 = −26.25

42.5 + 1.5 × 27.5 = 83.75

RCC @ 2020

› The center and spread of a data set can be described well using a five-number summary:

› Note that approximately 25% of data fall between each pair of successive numbers in the summary.

CHAPTER 1 - EXPLORING DATA 77RCC @ 2020

› The five-number summary is usually represented by the boxplot (or box-and-whisker plot) which is constructed based on the following steps:

CHAPTER 1 - EXPLORING DATA 78RCC @ 2020

› Example: The followings are the data on the number of home rums that Barry Bonds hit in each of his 21 complete seasons before his retirement in 2007.

› Make a boxplot for these data.

CHAPTER 1 - EXPLORING DATA 79RCC @ 2020

› Based on the data, we have

› The IQR is 45 − 25.5 = 19.5.

› The 1.5×IQR rule gives the range −3.75 and 74.25. Hence, there is no outliers in the data set.

› The resulting boxplot is thus:

CHAPTER 1 - EXPLORING DATA 80RCC @ 2020

› Technically, what we are doing when making a boxplot is as shown:

CHAPTER 1 - EXPLORING DATA 81RCC @ 2020

ii. Standard deviation and variance

› Another way of measuring the spread of data is to use standard deviation and its close relative, variance.

CHAPTER 1 - EXPLORING DATA 82RCC @ 2020

› The value of standard deviation can be determined by the following procedures:

CHAPTER 1 - EXPLORING DATA 83RCC @ 2020

› Example: Consider the following data on the number of pets owned by a group of 9 children. Determine the variance and standard deviation.

› We can construct the dotplot for the data set:

CHAPTER 1 - EXPLORING DATA 84RCC @ 2020

› The deviation of each observation from the mean and the corresponding squared value are listed below:

CHAPTER 1 - EXPLORING DATA 85RCC @ 2020

› Based on this table, we can compute the variance

› The standard deviation is thus given by

› It means that the number of pets typically varies from the average (i.e., 5 pets) by about 2.55 pets.

CHAPTER 1 - EXPLORING DATA 86

𝑠𝑥2 =

16 + 4 + 1 + 1 + 1 + 0 + 4 + 9 + 16

9 − 1= 6.5

𝑠𝑥 = 6.5 = 2.55

RCC @ 2020

› There are some properties of standard deviations that are worth noting:

1. Standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center.

2. Standard deviation is always greater than or equal to 0. It gets larger when observations become more spread out about the mean.

3. Standard deviation has the same units of measurement as the original observations.

4. Standard deviation is not resistant. A few outliers can make it very large. It is even more sensitive than the mean to a few extreme observations.

CHAPTER 1 - EXPLORING DATA 87RCC @ 2020

› There are two choices that can be used to describe center and spread: median/IQR and mean/standard deviation. Which one should we choose?

CHAPTER 1 - EXPLORING DATA 88

Choice Reasons

Median and IQR They are resistant to extreme values. Therefore, they are suitable for skewed distributions and the distributions with strong outliers.

Mean and Standard Deviation

They are more sensitive to the presence of outliers and strong skewness. Therefore, they are suitable for symmetric distributions

RCC @ 2020

› A rule of thumb that explains how standard deviation measures the variation in a data set or the spread in a relative frequency distribution is called the empirical rule or 68-95-99.7 rule.

CHAPTER 1 - EXPLORING DATA 89

68% of the measurements lie within one standard deviationRCC @ 2020

› Practice: The following is the result from a survey regarding the texting habits of males and females. A random sample of students were asked to record the number of text messages sent and received over a two-day period.

› What conclusion can you draw? Give appropriate evidence to support your answer.

CHAPTER 1 - EXPLORING DATA 90RCC @ 2020