Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | jonah-reynold-howard |
View: | 219 times |
Download: | 1 times |
1
Numerically Summarizing Data
Learning Objectives1. Understand the difference between a parameter and a statistic2. Describe and compute measures of central tendency3. Describe and compute measures of dispersion4. Compute measures of location5. Learn to read box plots and check for outliers
2
A parameter is a descriptive measure of a population.
In most real world cases, the population parameter is not known. For example, the average gas price in the whole nation.
A statistic is a descriptive measure of a sample. We use statistic to estimate the corresponding parameter. For example, Average gas price of the nation is not known. However, we can take a random sample of 100 stations and compute the sample average gas price, then use the sample average to estimate the unknown population average.
Measures of Central Tendency
(Mean, Median and Mode)
3
The population mean, is computed using all the individuals in a population, the total # of all individuals is N.
The population mean is a parameter.
The sample mean, is computed using sample data.
The sample mean is a statistic that is an unbiased estimator of the population mean.
NOTE: In real world applications, population mean is usually not known, and is estimated by using sample mean x
4
The median of a variable is the value that lies in the middle of the data when arranged in ascending order. That is, half the data is below the median and half the data is above the median. We use m to represent the median.
Median
5
Steps in Computing the Median of a Data Set
1. Arrange the data in ascending order.2. Determine the number of observations (n).3. Determine the observation in the middle of the
data set. The position is (n+1)/2(a) If (n+1)/2 is an integer, locate the data value at the
(n+1)/2 position. This is the median (NOTE: for this situation, # of data values, n is an odd number.)
(b) If (n+1)/2 is NOT an integer, the median is the average of the two data values on either side of the observations that lies in the (n+1)/2 position. [ NOTE: for this situation, n is even].
6
EXAMPLE Computing the Median of Data
Find the mean and median of the following pulse rates from a sample of 8 individuals {NOTE: n = 8 in this case}
80, 76, 65, 68, 72, 73, 65, 80
Arrange them in ascending order: 65, 65, 68, 72, 73, 76, 80, 80
Find the position: (n+1)/2 = (8+1)/2=4.5
Position is not an integer: Median = (72+73)/2 = 72.5
Adding one additional pulse rate of 100, now find the median of the data {NOTE n = 9 in this case}:
80, 76, 65, 68, 72, 73, 65, 80,100
Ascending order: 65,65,68,72,73,76,80,80,100
Position: (9+1)/2 = 5: Median is 73 (on the 5th position)
7
The mode of a variable is the most frequent observation of the variable that occurs in the data set.
If there are two values that occur with the most frequency, we say the data has is bimodal.
Exercise: Find the mode of the following pulse rate data
80, 76, 65, 68, 72, 73, 65, 80,100, 80, 74, 65, 66, 70, 74, 65, 80,98
Modes are: 65 and 80
8
Comparing Mean and Median:How does the extreme observation affect the mean and median?
[similar exam questions]Example:The following is the quiz scores of 10 students in class A:
5,5,5,5,5,7,7,7,7,7Find mean = ______, find median: ________
The following is the quiz score of 10 students in class B:
5,5,5,5,5,7,7,7,7,30Find mean = ________, find median =_________
Fact:The mean is sensitive to extreme data values.Median is robust to extreme data values.
9
How does the unusual cases affect the average, median and the shape of the histogram?
Compare Histograms with/without the ‘outlier’ case, 5000 miles
Miles
Frequency
450037503000225015007500
90
80
70
60
50
40
30
20
10
0100000000000000000
4
56
87
Histogram of Miles With the case of 5000 miles
Distance from Home
Frequency
6005004003002001000
50
40
30
20
10
01010
2011
12
42
52
21
14
Histogram of Miles Without the case of 5000 miles
Descriptive Statistics: Miles for 148 cases (with the case of 5000 miles)
Variable N Mean SE Mean TrMean StDev Min Q1 Median Q3 MaxMiles 148 151.5 33.6 111.8 409.4 1.0 75 120 150 5000
Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles)
Variable N Mean SE Mean TrMean StDev Min Q1 Median Q3 MaximumMiles 147 ______ 6.71 111.02 81.37 1.0 75 ______ 150 600
Shape is _________Shape is _________
10
When data sets have unusually large or small values relative to the entire set of data or when the distribution of the data is skewed, the median is the preferred measure of central tendency over the arithmetic mean because it is more representative of the typical observation.
Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles)
Variable N Mean Min Median MaximumMiles 147 118.52 1 120 600
NOTE: Median remains unchange. Why? Since it only uses the middle one (or two data points) to find median. But, it uses everyone data to find average. So, a very large unusual data will make average larger. But, not median.
11
Comparison of Mean, Median, and Mode for different shapes of distributions
[Similar exam question]
Right-SkewedRight-SkewedLeft-SkewedLeft-Skewed SymmetricSymmetric
MeanMean = = MedianMedian
== ModeModeMeanMean
MedianMedian
ModeMode ModeMode
MedianMedian
MeanMean
Mean<Medain Mean~Median Mean>Median
ExerciseNOTE: In real world applications, distribution of a sample data can never be perfectly symmetric. The shape can only be approximately symmetric.
IF MEAN IS CLOSE TO MEDIAN (NOT NECESSARY EXACTLY MEAN = MEDIAN), WE WOULD SAY THE DISTRIBUTION IS APPROXIMATELY SYMMETRIC.
Exercise: A sample of 50 gas prices are recorded and summarized. The average price is $3.15, median price is $3.13. Is the shape of the price distribution more likely to be skew-to-left, approximately symmetric, skewed-to-right?
ANS:
13
Measures of dispersion measure the degree that the data values spread. The larger the data values spread, the larger the variation of the data values.
Example: Scores of 5 students in class A: 60,60,70,80,80
Scores of 5 students in class B: 40,60,70,80,100
Scores of 5 students in class C: 70,70,70,70,70
Q: Scores in Class ____ have largest variation.
Scores in Class _____ has zero variation.
Measures of DispersionFour different measures of dispersion: Range, Variance, Standard Deviation,
Interquartile Range (IQR)
14
Visualizing Variability using Histogram
A B
C Which one shows the largest variation:
Which one shows the smallest variation:
15
• The sample variance is :
• The sample standard deviation is: s = s2
How to measure the variation?
• Range = R = Largest Data Value – Smallest Data Value
The population variance is symbolically represented by lower case Greek sigma squared.
The population standard deviation is:
NOTE: the divider: (n-1) is called the Degrees of Freedom.
16
NOTE: As mentioned before, for real world problems, population mean, population variance and population standard deviation are NOT KNOWN.
Similar to Sample Mean, sample variance and sample standard deviation are obtained from sample data. They are used to estimate the unknown population variance and population standard deviation. This is the major part of the inferential statistics, which will be dealt with later.
In this Chapter, we are learning how to compute and interpret these sample descriptive summaries to understand the sample data.
17
Notation:
s 2: sample variance : sample standard deviation
NOTE: If the original measurement unit is (ft), the variance s2 has measurement unit (ft)2, since
If x has unit (ft), then, (x- )2 has the unit (ft)(ft) , which is (ft)2
The measurement unit of s2 is (ft)2 . The measurement unit of s is (ft).
2: population variance. : population standard deviation 2
2ss
x
Some important Tips
NOTE: Sample statistics: such as sample mean , sample median, s, s2 will be different for different samples.
Population parameters: such as population mean, , population variance, 2, population s.d., are fixed constant for a given population. They do not change for different samples.
19
ExerciseComparing Variation: Quiz Scores of 40 students
[similar exam questions]
0 5 10
Class A
20 20
0 4 5 6 10
Class B
20 20
0 5 10
Class C
1 3 2 4 9 10 5 3 1 2
Variation:
Which one has smallest s.d.?
Which has largest s.d.?
Answer
Class B has smallest standard deviation
Class A has largest standard deviation
21
Points to remember about variance and standard deviation and the relationship with histogram:
- The value of s and s2 is always greater than or equal to zero.
- The larger the value of s 2 or s, the greater the variability of the data set.
- If s 2 or s is equal to zero, all measurements must have the same.
- The standard deviation s is computed in order to have a measure of variability having the same unit as the observations.
- The larger the s.d., the more spread the data, the flatter the histogram.
- The smaller the s.d., the more clustered the data around the mean, the taller the peak of the histogram.
22
Exercise (Similar Exam questions)1. The gas price is a concern for people. A random sample of 40 stations gives the following
data summary:Sample mean = $2.15 Median = $2.12 S = $.15Q: Is the distribution of the gas prices more likely to be (a) Symmetric (b) skewed-to-right (c) Skewed-to-leftAnd WHY?
2. The following two data are prices of milk from 6 stores, one was from January, and one year after.
Store: A B C D E FPrice in January 2004 1.85 1.95 1.85 2.00 1.78 1.97Price in January 2005 2.05 2.15 2.05 2.20 1.98 2.17
True or False for each of the following statements:(a) The average price remains the same between two years.(b) The price range remains the same between two years.(c ) The median remains the same between two years.(d) The standard deviation (s) remains the same between two years.
23
Descriptive Statistics: distance
Variable N Mean Median TrMean StDev SE Mean
distance 56 142.0 140.0 128.3 112.2 15.0
Variable Minimum Maximum Q1 Q3
distance 5.0 800.0 92.5 160.0
x mMean after excluding the lowest 5% and the highest 5% of the data. Called: Trimmed Mean
s , the sample standard deviation.
s2 = (112.2)2
Smallest Largest
25% of the distances are lower than Q1, the first Quartile, or 25th Percentile
75% of the distances are lower than Q3, the third Quartile, or 75th Percentile
Descriptive Summary for the 56 distances
If we add the max, 6000 to the data, so that we 57 cases, what is the effect of 6000 to the
following summary statistics:Increase? Decease? The same?
(a) the average distance:(b) the median distance: (c) the standard deviation: (d) the range:
Answer
Adding 6000 miles to the data, then,
•Average distance is increased.
•Median distance for this example is the same. (in general, will be almost the same)
•Standard deviation is increased.
•Range is increased.
26
Empirical Rule and ApplicationsWhat is the meaning of variation and how is it
used in solving real world problems?
Approximately
68% of the data is between ± 1 s
95% of the data is between ± 2 s
100% of the data is between ± 3 s
of the mean
For Symmetric mound-shaped data (Bell-shaped )
27
The important Application of Empirical rule is: It is applied to identify rare (unusual, extreme )observations.If an observation falls outside two s.d. range, it only has 5% of chance to occur. Therefore, it is considered to be a rare (or unusual) case.
2.5%2.5%
34% 34%
13.5%13.5%
NOTE: If you add the % on each side of the center line , it adds to 50%. A mounded-shape distribution is symmetric about the mean.
28
Applying Empirical Rule to identify Rare EventsA simple and powerful tool for identifying outliers, extremes, or unusual, or rare events. We will use this rule very often through out the entire semester.
Consider the 2010 ACT test, the average was 21 and a standard deviation was 4. The distribution of the ACT scores is mounded-shaped.
Q1: A student received a score of 25. Is this an unusually high score?
Q2: If CMU will admit students with a minimum ACT to be one standard deviations below the mean, what is the minimum ACT for CMU admission?
Q3: A student received an ACT of 30. Is this an unusually high score?
(Similar questions in the test)
ANSWER:
Q1: 25 = 21+4 (that is one s.d. above the mean. It is inside two s.d. from the mean. So, it is NOT unusually high score.
Q2: The score at one s.d. below the mean = 21 – 4 = 17.
Q3: the score 30 > 21 + 2(4) = 29. 30 is outside the two s.d. from mean. There is only 2.5% of scores higher than 29. Hence, 30 is an unusually high score.
29
Exercise: Estimating average, standard deviation and applying Empirical Rule when distribution is
mounded-shaped
We collect a sample of 40 weekly spending from 40 students. Suppose the spending has a mounded-shape distribution. We only know the min = $20 and max = $80. As you see the weekly spending varies. There is a variation among spending. (a) Give a good estimate of the average spending and standard deviation of the weekly spending based on the 40 students data.(b) Approximately how many % of students would spend $35 or more per week:
ANS: Since the distribution is mounded-shaped, we can use (20+80) / 2 = $50 to estimate the average spending.Since this is a sample, so, we use s = range/4 to estimate the s.d., which would be (80-20)/4 = $15.0.
ANS: We can then use this estimated average spending and s to answer question (b):$35 is about one s.d. below the mean. Hence, the % of spending $35 or more = 34% + 50% = 84%. Approximately 84% of individuals spend $35 or more per week.
30
Five Number Summary; Box plots
The Five-Number Summary
MINIMUM Q1 Median Q3 MAXIMUM
IQR (Inter-quartile Range) = Q3 – Q1
31
Steps for Drawing a Box plot
Step 1: Determine the lower and upper fence:
Lower fence = Q1 – 1.5(IQR)
Upper fence = Q3 + 1.5(IQR)
Step 2: Draw vertical lines at Q1, M and Q3. Enclose these vertical lines in a box.
Step 3: Label the lower and upper fences.
Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence.Step 5: Any data value less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk (*).
32
EXAMPLE Drawing a Boxplot
Min Q1 M Q3 Max IQR
28 38 48 56 73 Q3-Q1 =56-38=18
Draw a boxplot for the serum HDL.
Compute the lower and upper fence and draw a boxplot.
HDL7060504030
Boxplot of HDL
Median Mean
Q1 Q3
33
1. If the median is near the center of the box and each of the horizontal lines are approximately equal length, then the distribution is roughly symmetric.
2. If the median is left of the center of the box and/or the right line is substantially longer than the left line, the distribution is right skewed.
3. If the median is right of the center of the box and/or the left line is substantially longer than the right line, the distribution is left skewed
Relationship between Distribution Shape and Boxplot (Similar questions in the test)
34
Symmetric
35
Skewed Right
36
Skewed Left
37
Distance data – 100 distance data
Miles
Frequency
10008006004002000
50
40
30
20
10
0
Histogram of Miles
Miles9008007006005004003002001000
Boxplot of Miles
Miles
Frequency
8006004002000
35
30
25
20
15
10
5
0
8006004002000
female male
Histogram of Miles
Panel variable: Gender
Miles8006004002000
8006004002000
female male
Boxplot of Miles
Panel variable: Gender
38
EXAMPLE Comparing Two Data Sets Using Boxplots
The following boxplots represent the birth rate for women 15 - 44 years of age in 1990 and 1997 for each state.
What conclusion can you make?