1 Numerically Summarizing Data Learning Objectives 1. Understand the difference between a parameter...

1

Numerically Summarizing Data

Learning Objectives1. Understand the difference between a parameter and a statistic2. Describe and compute measures of central tendency3. Describe and compute measures of dispersion4. Compute measures of location5. Learn to read box plots and check for outliers

2

A parameter is a descriptive measure of a population.

In most real world cases, the population parameter is not known. For example, the average gas price in the whole nation.

A statistic is a descriptive measure of a sample. We use statistic to estimate the corresponding parameter. For example, Average gas price of the nation is not known. However, we can take a random sample of 100 stations and compute the sample average gas price, then use the sample average to estimate the unknown population average.

Measures of Central Tendency

(Mean, Median and Mode)

3

The population mean, is computed using all the individuals in a population, the total # of all individuals is N.

The population mean is a parameter.

The sample mean, is computed using sample data.

The sample mean is a statistic that is an unbiased estimator of the population mean.

NOTE: In real world applications, population mean is usually not known, and is estimated by using sample mean x

4

The median of a variable is the value that lies in the middle of the data when arranged in ascending order. That is, half the data is below the median and half the data is above the median. We use m to represent the median.

Median

5

Steps in Computing the Median of a Data Set

1. Arrange the data in ascending order.2. Determine the number of observations (n).3. Determine the observation in the middle of the

data set. The position is (n+1)/2(a) If (n+1)/2 is an integer, locate the data value at the

(n+1)/2 position. This is the median (NOTE: for this situation, # of data values, n is an odd number.)

(b) If (n+1)/2 is NOT an integer, the median is the average of the two data values on either side of the observations that lies in the (n+1)/2 position. [ NOTE: for this situation, n is even].

6

EXAMPLE Computing the Median of Data

Find the mean and median of the following pulse rates from a sample of 8 individuals {NOTE: n = 8 in this case}

80, 76, 65, 68, 72, 73, 65, 80

Arrange them in ascending order: 65, 65, 68, 72, 73, 76, 80, 80

Find the position: (n+1)/2 = (8+1)/2=4.5

Position is not an integer: Median = (72+73)/2 = 72.5

Adding one additional pulse rate of 100, now find the median of the data {NOTE n = 9 in this case}:

80, 76, 65, 68, 72, 73, 65, 80,100

Ascending order: 65,65,68,72,73,76,80,80,100

Position: (9+1)/2 = 5: Median is 73 (on the 5th position)

7

The mode of a variable is the most frequent observation of the variable that occurs in the data set.

If there are two values that occur with the most frequency, we say the data has is bimodal.

Exercise: Find the mode of the following pulse rate data

80, 76, 65, 68, 72, 73, 65, 80,100, 80, 74, 65, 66, 70, 74, 65, 80,98

Modes are: 65 and 80

8

Comparing Mean and Median:How does the extreme observation affect the mean and median?

[similar exam questions]Example:The following is the quiz scores of 10 students in class A:

5,5,5,5,5,7,7,7,7,7Find mean = ______, find median: ________

The following is the quiz score of 10 students in class B:

5,5,5,5,5,7,7,7,7,30Find mean = ________, find median =_________

Fact:The mean is sensitive to extreme data values.Median is robust to extreme data values.

9

How does the unusual cases affect the average, median and the shape of the histogram?

Compare Histograms with/without the ‘outlier’ case, 5000 miles

Miles

Frequency

450037503000225015007500

90

80

70

60

50

40

30

20

10

0100000000000000000

4

56

87

Histogram of Miles With the case of 5000 miles

Distance from Home

Frequency

6005004003002001000

50

40

30

20

10

01010

2011

12

42

52

21

14

Histogram of Miles Without the case of 5000 miles

Descriptive Statistics: Miles for 148 cases (with the case of 5000 miles)

Variable N Mean SE Mean TrMean StDev Min Q1 Median Q3 MaxMiles 148 151.5 33.6 111.8 409.4 1.0 75 120 150 5000

Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles)

Variable N Mean SE Mean TrMean StDev Min Q1 Median Q3 MaximumMiles 147 ______ 6.71 111.02 81.37 1.0 75 ______ 150 600

Shape is _________Shape is _________

10

When data sets have unusually large or small values relative to the entire set of data or when the distribution of the data is skewed, the median is the preferred measure of central tendency over the arithmetic mean because it is more representative of the typical observation.

Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles)

Variable N Mean Min Median MaximumMiles 147 118.52 1 120 600

NOTE: Median remains unchange. Why? Since it only uses the middle one (or two data points) to find median. But, it uses everyone data to find average. So, a very large unusual data will make average larger. But, not median.

11

Comparison of Mean, Median, and Mode for different shapes of distributions

[Similar exam question]

Right-SkewedRight-SkewedLeft-SkewedLeft-Skewed SymmetricSymmetric

MeanMean = = MedianMedian

== ModeModeMeanMean

MedianMedian

ModeMode ModeMode

MedianMedian

MeanMean

Mean<Medain Mean~Median Mean>Median

ExerciseNOTE: In real world applications, distribution of a sample data can never be perfectly symmetric. The shape can only be approximately symmetric.

IF MEAN IS CLOSE TO MEDIAN (NOT NECESSARY EXACTLY MEAN = MEDIAN), WE WOULD SAY THE DISTRIBUTION IS APPROXIMATELY SYMMETRIC.

Exercise: A sample of 50 gas prices are recorded and summarized. The average price is $3.15, median price is $3.13. Is the shape of the price distribution more likely to be skew-to-left, approximately symmetric, skewed-to-right?

ANS:

13

Measures of dispersion measure the degree that the data values spread. The larger the data values spread, the larger the variation of the data values.

Example: Scores of 5 students in class A: 60,60,70,80,80

Scores of 5 students in class B: 40,60,70,80,100

Scores of 5 students in class C: 70,70,70,70,70

Q: Scores in Class ____ have largest variation.

Scores in Class _____ has zero variation.

Measures of DispersionFour different measures of dispersion: Range, Variance, Standard Deviation,

Interquartile Range (IQR)

14

Visualizing Variability using Histogram

A B

C Which one shows the largest variation:

Which one shows the smallest variation:

15

• The sample variance is :

• The sample standard deviation is: s = s2

How to measure the variation?

• Range = R = Largest Data Value – Smallest Data Value

The population variance is symbolically represented by lower case Greek sigma squared.

The population standard deviation is:

NOTE: the divider: (n-1) is called the Degrees of Freedom.

16

NOTE: As mentioned before, for real world problems, population mean, population variance and population standard deviation are NOT KNOWN.

Similar to Sample Mean, sample variance and sample standard deviation are obtained from sample data. They are used to estimate the unknown population variance and population standard deviation. This is the major part of the inferential statistics, which will be dealt with later.

In this Chapter, we are learning how to compute and interpret these sample descriptive summaries to understand the sample data.

17

Notation:

s 2: sample variance : sample standard deviation

NOTE: If the original measurement unit is (ft), the variance s2 has measurement unit (ft)2, since

If x has unit (ft), then, (x- )2 has the unit (ft)(ft) , which is (ft)2

The measurement unit of s2 is (ft)2 . The measurement unit of s is (ft).

2: population variance. : population standard deviation 2

2ss

x

Some important Tips

NOTE: Sample statistics: such as sample mean , sample median, s, s2 will be different for different samples.

Population parameters: such as population mean, , population variance, 2, population s.d., are fixed constant for a given population. They do not change for different samples.

19

ExerciseComparing Variation: Quiz Scores of 40 students

[similar exam questions]

0 5 10

Class A

20 20

0 4 5 6 10

Class B

20 20

0 5 10

Class C

1 3 2 4 9 10 5 3 1 2

Variation:

Which one has smallest s.d.?

Which has largest s.d.?

Answer

Class B has smallest standard deviation

Class A has largest standard deviation

21

Points to remember about variance and standard deviation and the relationship with histogram:

- The value of s and s2 is always greater than or equal to zero.

- The larger the value of s 2 or s, the greater the variability of the data set.

- If s 2 or s is equal to zero, all measurements must have the same.

- The standard deviation s is computed in order to have a measure of variability having the same unit as the observations.

- The larger the s.d., the more spread the data, the flatter the histogram.

- The smaller the s.d., the more clustered the data around the mean, the taller the peak of the histogram.

22

Exercise (Similar Exam questions)1. The gas price is a concern for people. A random sample of 40 stations gives the following

data summary:Sample mean = $2.15 Median = $2.12 S = $.15Q: Is the distribution of the gas prices more likely to be (a) Symmetric (b) skewed-to-right (c) Skewed-to-leftAnd WHY?

2. The following two data are prices of milk from 6 stores, one was from January, and one year after.

Store: A B C D E FPrice in January 2004 1.85 1.95 1.85 2.00 1.78 1.97Price in January 2005 2.05 2.15 2.05 2.20 1.98 2.17

True or False for each of the following statements:(a) The average price remains the same between two years.(b) The price range remains the same between two years.(c ) The median remains the same between two years.(d) The standard deviation (s) remains the same between two years.

23

Descriptive Statistics: distance

Variable N Mean Median TrMean StDev SE Mean

distance 56 142.0 140.0 128.3 112.2 15.0

Variable Minimum Maximum Q1 Q3

distance 5.0 800.0 92.5 160.0

x mMean after excluding the lowest 5% and the highest 5% of the data. Called: Trimmed Mean

s , the sample standard deviation.

s2 = (112.2)2

Smallest Largest

25% of the distances are lower than Q1, the first Quartile, or 25th Percentile

75% of the distances are lower than Q3, the third Quartile, or 75th Percentile

Descriptive Summary for the 56 distances

If we add the max, 6000 to the data, so that we 57 cases, what is the effect of 6000 to the

following summary statistics:Increase? Decease? The same?

(a) the average distance:(b) the median distance: (c) the standard deviation: (d) the range:

Answer

Adding 6000 miles to the data, then,

•Average distance is increased.

•Median distance for this example is the same. (in general, will be almost the same)

•Standard deviation is increased.

•Range is increased.

26

Empirical Rule and ApplicationsWhat is the meaning of variation and how is it

used in solving real world problems?

Approximately

68% of the data is between ± 1 s



of the mean

For Symmetric mound-shaped data (Bell-shaped )

27

The important Application of Empirical rule is: It is applied to identify rare (unusual, extreme )observations.If an observation falls outside two s.d. range, it only has 5% of chance to occur. Therefore, it is considered to be a rare (or unusual) case.

2.5%2.5%

34% 34%

13.5%13.5%

NOTE: If you add the % on each side of the center line , it adds to 50%. A mounded-shape distribution is symmetric about the mean.

28

Applying Empirical Rule to identify Rare EventsA simple and powerful tool for identifying outliers, extremes, or unusual, or rare events. We will use this rule very often through out the entire semester.

Consider the 2010 ACT test, the average was 21 and a standard deviation was 4. The distribution of the ACT scores is mounded-shaped.

Q1: A student received a score of 25. Is this an unusually high score?

Q2: If CMU will admit students with a minimum ACT to be one standard deviations below the mean, what is the minimum ACT for CMU admission?

Q3: A student received an ACT of 30. Is this an unusually high score?

(Similar questions in the test)

ANSWER:

Q1: 25 = 21+4 (that is one s.d. above the mean. It is inside two s.d. from the mean. So, it is NOT unusually high score.

Q2: The score at one s.d. below the mean = 21 – 4 = 17.

Q3: the score 30 > 21 + 2(4) = 29. 30 is outside the two s.d. from mean. There is only 2.5% of scores higher than 29. Hence, 30 is an unusually high score.

29

Exercise: Estimating average, standard deviation and applying Empirical Rule when distribution is

mounded-shaped

We collect a sample of 40 weekly spending from 40 students. Suppose the spending has a mounded-shape distribution. We only know the min = $20 and max = $80. As you see the weekly spending varies. There is a variation among spending. (a) Give a good estimate of the average spending and standard deviation of the weekly spending based on the 40 students data.(b) Approximately how many % of students would spend $35 or more per week:

ANS: Since the distribution is mounded-shaped, we can use (20+80) / 2 = $50 to estimate the average spending.Since this is a sample, so, we use s = range/4 to estimate the s.d., which would be (80-20)/4 = $15.0.

ANS: We can then use this estimated average spending and s to answer question (b):$35 is about one s.d. below the mean. Hence, the % of spending $35 or more = 34% + 50% = 84%. Approximately 84% of individuals spend $35 or more per week.

30

Five Number Summary; Box plots

The Five-Number Summary

MINIMUM Q1 Median Q3 MAXIMUM

IQR (Inter-quartile Range) = Q3 – Q1

31

Steps for Drawing a Box plot

Step 1: Determine the lower and upper fence:

Lower fence = Q1 – 1.5(IQR)

Upper fence = Q3 + 1.5(IQR)

Step 2: Draw vertical lines at Q1, M and Q3. Enclose these vertical lines in a box.

Step 3: Label the lower and upper fences.

Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence.Step 5: Any data value less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk (*).

32

EXAMPLE Drawing a Boxplot

Min Q1 M Q3 Max IQR

28 38 48 56 73 Q3-Q1 =56-38=18

Draw a boxplot for the serum HDL.

Compute the lower and upper fence and draw a boxplot.

HDL7060504030

Boxplot of HDL

Median Mean

Q1 Q3

33

1. If the median is near the center of the box and each of the horizontal lines are approximately equal length, then the distribution is roughly symmetric.

2. If the median is left of the center of the box and/or the right line is substantially longer than the left line, the distribution is right skewed.

3. If the median is right of the center of the box and/or the left line is substantially longer than the right line, the distribution is left skewed

Relationship between Distribution Shape and Boxplot (Similar questions in the test)

34

Symmetric

35

Skewed Right

36

Skewed Left

37

Distance data – 100 distance data

Miles

Frequency

10008006004002000

50

40

30

20

10

0

Histogram of Miles

Miles9008007006005004003002001000

Boxplot of Miles

Miles

Frequency

8006004002000

35

30

25

20

15

10

5

0

8006004002000

female male

Histogram of Miles

Panel variable: Gender

Miles8006004002000

8006004002000

female male

Boxplot of Miles

Panel variable: Gender

38

EXAMPLE Comparing Two Data Sets Using Boxplots

The following boxplots represent the birth rate for women 15 - 44 years of age in 1990 and 1997 for each state.

What conclusion can you make?

Date post:	17-Dec-2015
Category:	Documents
Upload:	jonah-reynold-howard
View:	219 times
Download:	1 times

1 Numerically Summarizing Data Learning Objectives 1. Understand the difference between a parameter...

Documents