+ All Categories
Home > Documents > Week21 Describing distributions with numbers A large number or numerical methods are available for...

Week21 Describing distributions with numbers A large number or numerical methods are available for...

Date post: 02-Jan-2016
Category:
Upload: antonia-dean
View: 220 times
Download: 3 times
Share this document with a friend
Popular Tags:
40
week2 1 Describing distributions with numbers •A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central tendency of the set of observations – it is the tendency of the data to cluster, or center, about certain numerical values. The variability of the set of observation – it is the spread of the data.
Transcript
Page 1: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 1

Describing distributions with numbers

• A large number or numerical methods are available for

describing quantitative data sets. Most of these methods

measure one of two data characteristics:

The central tendency of the set of observations – it is the

tendency of the data to cluster, or center, about certain

numerical values.

The variability of the set of observation – it is the spread

of the data.

Page 2: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 2

Measuring Center

• The Mode is the observation that occurs most frequently.

• The mode for categorical variable will be the label of the category with the highest number of counts.

• Measuring center

Two common measures of center are the mean and the median. These two measures behave differently. The mean is the “average value” and the median is the “middle

value”.

Page 3: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 3

Measuring center: the median • The median M is the midpoint of the distribution, the number

such that half the observations are smaller then it and the other half are larger.

• To find the median of a distribution:1. Arrange the observations in order of size, from smallest to

largest.

2. If the number of observations n is odd, the median is the center observation in the ordered list.

3. If the number of observations n is even, the median is the average of the two center observations in the ordered list.

Page 4: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 4

Example

The annual salaries (in thousands of $) of a random

sample of five employees of a company are:

40, 30, 25, 200, 28

Arranging the values in increasing order:

25 28 30 40 200

median = 30

Excluding 200 median = (28+30)/2=29.

Page 5: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 5

• MINITAB commands Stat > Basic Statistics > Display Descriptive Statistics

• MINITAB output for the data in the example above is given bellow:

Variable N Median

salary 5 30.0

Page 6: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 6

Measuring center: mean

• To find the mean of a set of observations, add their values and divide by the number of observations. If the n observations are x1,x2,…xn, their mean is given by

• Example

Find the mean of the following observations: 4, 5, 9, 3, 5.

Solution:

n

x

n

xxxxmean in

21

x

Page 7: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 7

Example• The annual salaries (in thousands of $) of a random sample of

five employees of a company are: 40, 30, 25, 200, 28.

If we exclude 200 as an outlier,

• Mean is sensitive to the influence of a few extreme observations. Because the mean cannot resist the influence of extreme values, we say that it is NOT a resistant measure of center.

Page 8: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 8

Example - Calculation for grouped data

Determine the mean of the data represented by the following frequency table.

Solution:

Class Interval Frequency

10-20 2

20-30 4

30-40 7

40-50 6

50-60 1

Page 9: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 9

Mean versus median • The median and mean are the most common measures of the

center of a distribution.

• If the distribution is exactly symmetric, the mean and median are exactly the same.

• Median is less influenced by extreme values.

• If the distribution is skewed to the right, then mode < median < mean

• If the distribution is skewed to the left, then mean < median < mode.

Page 10: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 10

Trimmed mean • Trimmed mean is a measure of the center that is more resistant

than the mean but uses more of the available information than the median.

• To compute the 10% trimmed mean, discard the highest 10% and the lowest 10% of the observations and compute the mean of the remaining 80%. Similarly, we can compute 5%, 20% etc. trimmed mean.

• Trimming eliminates the effect of a small number of outliers.

Page 11: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 11

Example

• Compute the 10% trimmed mean of the data given below. 20 40 22 22 21 21 20 10 20 20 20 13 18 50 20 18 15 8 22 25

• Solution: - Arrange the values in increasing order: 8 10 13 15 18 18 20 20 20 20 20 20 21 21 22 22 22 25 40 50

- The are 20 observations and 10% of 20 = 2. Hence, discard the first 2 and the last 2 observations in the ordered data and compute the mean of the remaining 16 values. Variable N Mean C2 16 19.812

Exercise 1.77 on page 63 in IPS.

Page 12: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 12

Questions1. You are asked to recommend a measure of center to characterize the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4. What is your recommendation and why?2. The mean is ____ sensitive to extreme values than the

median. (a) more (b) less (c) equally (d) can’t say without data3. Changing the value of a single score in a data set will

necessarily cause the mean to change. (T/F)4. Changing the value of a single score in a data set will

necessarily cause the median to change. (T/F)

Page 13: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 13

Percentiles

• The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread.

• We can describe the spread or variability of a distribution by giving several percentiles.

• The pth percentile of a distribution is the value such that p percent of the observations are smaller or equal to it.

• The median is the 50th percentile.

• If a data set contains n observations, then the pth percentile is the value in the ordered data set.( 1)

100thpn

Page 14: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 14

Example

• Find the 20th percentile of the data represented by the following stem-and-leaf plot.

Stem-and-leaf of Rural N = 29 Leaf Unit = 1.0 N* = 7 1 2 1 5 3 3589 (12) 4 122333456788 12 5 112467 6 6 7 5 7 04 3 8 48 1 9 1 10 8

Page 15: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 15

Solution

Page 16: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 16

Quartiles

• The 25th percentile is called the first quartile (Q1).

• The first quartile (Q1) is the median of the observations whose position in the ordered list is to the left of the location of the overall median.

• The 75th percentile is called the third quartile (Q3).

• The third quartile (Q3) is the median of the observations whose position in the ordered list is to the right of the location of the overall median.

NOTE: The median is the second quartile Q2 .

Page 17: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 17

Example

The highway mileages of 20 cars, arranged in increasing order are:

13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28

29 32.

The median is …

The first quartile Q1 is …

The third quartile Q3 is…

• Exercise: Find

(a) the 10th percentile.

(b) the 90th percentile of the above data set.

Page 18: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 18

Measuring Spread

• The range (max-min) is a measure of spread but it is very sensitive to the influence of extreme values.

• The distance between the first and third quartiles is called the Interquartile range (IQR) i.e. IQR =Q3 – Q1 .

• The IQR is another measure of spread that is less sensitive to the influence of extreme values.

Page 19: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 19

The five-number summary • The five-number summary of a set of observations consists of

the smallest observation, the first quartile, the median, the third quartile and the largest observation.

• These five numbers give a reasonably complete description of both the center and the spread of the distribution.

• MINITAB commands: Stat > Basic Statistics > Display Descriptive Statistics

Page 20: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 20

Example • The highway mileages of 20 cars, arranged in increasing order are: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32. Give the five number summary.

• Answer From example 1.14 p45 we have, min = 13, first quartile = 18, median = 23, third quartile = 27 , max. = 32.

The MINITAB output using the above commands is as follows:

Variable N Minimum Q1 Median Q3 Maximummileage 20 13.00 17.50 23.00 27.50 32.00

Page 21: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 21

Box-plot

• A box-plot is a graph of the five-number summary.

• Example:

Make a box-plot for the data in the above example.

• MINITAB commands: Graph > Boxplot

Mile

ages

30

25

20

15

Boxplot of Mileages

Page 22: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 22

Exercise • The dot-plot for a set of 20 observations is given below:

• Draw a box-plot for the data (use the same scale).

12011010090

Score

Dotplot for Score

Page 23: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 23

ExerciseThe stemplot for a set of 50 observations is given below:

Draw a box-plot for the data. Compute the 40% trimmed mean.

Stem-and-leaf of Fees N = 50Leaf Unit = 1.0 2 0 89 7 1 00234 15 1 55558899 (28) 2 0000000000000111112222222223 7 2 59 5 3 0 4 3 5 3 4 00 1 4 1 5 0

Page 24: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 24

Exercise

• The box-plot, histogram and stem-and-leaf plot for a data set are given below. Describe the distribution.

Stem-and-leaf of C2 N = 50 Leaf Unit = 1.0 (29) 0 00011111111122222222233444444 21 0 55555666788 10 1 0234 6 1 66 4 2 1 3 2 88 1 3 1 3 8

403020100

C2

4036322824201612840

20

10

0

C2

Fre

qu

en

cy

Page 25: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 25

Exercise • Consider the following Minitab generated box-plots of coagulation times in

seconds for samples of blood drawn from animals receiving three different diets denoted 1, 2, and 3 :

• State whether the following statements are true or false a) The animal that had the longest coagulation time was given diet 3. b) The greatest variability occurs with diet 2. c) Diet 1 shows evidence of right (positive) skewness but diet 2 shows evidence of left (negative) skewness. d) Approximately 25% of animals on diet 2 had coagulation times less then 63. e) The smallest upper (third) quartile is for diet 3. f) We can see that the mean for diet 1 is less than 62 seconds.

321

70

65

60

Dietco

agtim

es

Page 26: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 26

Measuring spread: Standard deviation • The variance (s2) of a set of n observations is

• The standard deviation (s) is the square root of the variance (s2). i.e.

• It can be shown that,

This formula is usually quicker.

nxxx ,...,, 21

2 2 2 2( ) ( ) ( ) ( )2 1 21 1

x x x x x x x xn isn n

2 2 2 2( ) ( ) ( ) ( )1 21 1

x x x x x x x xn isn n

2 2

1x nxis

n

Page 27: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 27

• The deviations display the spread of the values xi about

their mean. Some of these deviations will be positive and some

negative because the observations fall on each side of the

mean.

• The sum of the deviations of the observations from their mean

will always be zero.

• Squaring the deviations makes them all positive, so that

observations far from the mean in either direction have large

positive squared deviations.

• The variance is the average of the squared deviations.

• The variance, s2, and the standard deviation, s, will be large if

the observations are widely spread about their mean, and small

if the observations are all close to the mean.

xxi

Page 28: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 28

Example• Find the standard deviation of the following data set: 4, 8, 2, 9, 7.

• Solution: n=5 ,

• Using the second formula we have

and so

Page 29: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 29

• MINITAB commands Stat > Basic Statistics > Display Descriptive Statistics

• MINITAB output for the above data is given below:

Variable N StDev

C1 5 2.92

• Exercise: Find the standard deviation of the following data set: 5, 8, 7, 9, 7, 11.

Page 30: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 30

Example -Calculation from grouped data

• Determine the standard deviation of the data represented by the

following frequency distribution.

Class Interval Frequency

10-20 2

20-30 4

30-40 7

40-50 6

50-60 1

Page 31: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 31

Solution

Page 32: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 32

Properties of standard deviation (s)

• s measures the spread about the mean and should be used only when the mean is chosen as the measure of center.

• s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0.

• s, like the mean , is not resistant to extreme values. A few outliers can make s very large.

Page 33: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 33

Ballpark approximation for s

• The ballpark approximation for the standard deviation s is the Range/4 (divide by 3 if there are less then 10 observations, divide by 5 if there are more then 100 observations).

• For the data set 4, 8, 2, 9, 7, range = 9 – 2 = 7 and so

.7 2.333

s

Page 34: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 34

The empirical (68-95-99.7) rule

• With a bell shaped distribution,

about 68% of the data fall within a distance of 1 standard deviation from the mean. 95% fall within 2 standard deviations of the mean. 99.7% fall within 3 standard deviations of the mean.

• What if the distribution is not bell-shaped? There is another rule, named Chebyshev's Rule, that tells us

that there must be at least 75% of the data within 2 standard

deviations of the mean, regardless of the shape, and at least

89% within 3 standard deviations.

Page 35: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 35

Outliers

An outlier is an observation that is usually large or small relative to the other values in a data set. Outliers are typically attributable to one of the following causes:

1. The observation is observed, recorded, or entered incorrectly.

2. The observation comes from a different population.

3. The observation is correct but represents a rare event.

Page 36: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 36

The 1.5×IQR Criterion for outliers

• Call an observation a suspected outlier if it falls more than 1.5×IQR above the 3rd quartile or below the 1st quartile.

• Example

Consider the data given in example 1.13 on page 43 in IPS (mileage data with an extra observation of 66).

Variable N Mean Min Q1 Median Q3 Max Mileages 21 24.67 13 18 23 28 66

The IQR = 28-18 = 10 and the largest observation, 66, falls

more than 1.5×IQR above Q3 and therefore is an outlier.

Page 37: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 37

Choosing a summary

• The five-number summary is usually better than the mean and the standard deviation for describing skewed distributions or distributions with strong outliers.

• Use mean and standard deviation for reasonably symmetric distributions that are free of outliers.

Page 38: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 38

Questions

1. How do the mean, median, and mode compare, usually, when a distribution is positively skewed? negatively skewed? Draw a picture and try to estimate the locations of these measures.

2. Which type of display is the most useful type for clear direct comparisons of the key characteristics of several data sets (e.g. blood cholesterol changes for several different treatments) ?

3. In a frequency table of 300 scores, the mean is reported as 80 and the median as 65. One would expect this distribution to be

a. positively skewed. b. negatively skewed. c. symmetrical d. rectangular.

Page 39: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 39

4. Find the median of the following frequency distribution.

5. On sta220 term test, John scored at the 78th percentile, and Jack scored at the 63rd. State whether the following statements are true of false a. John is 15 times better than jack. b. John scored 15 more points than Jack. c. 15% of those taking the test got scores ranging between John's and Jack's scores. d. 62 students scored less than John.

Score Frequency

1 2

2 9

3 5

4 3

5 1

Page 40: Week21 Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods.

week2 40

6. Estimate the mean and standard deviation of the distribution represented by the following histogram.

1 3 5 7 9 11 13 15

0

5

10

Rate

Fre

quen

cy


Recommended