Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 228 times |
Download: | 3 times |
Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution,
Relative Location and Outliers
Measures of Association between Two Variables
Weighted Mean and Grouped Data
Shape of A Distribution Depends On the Relative Location, and Outliers
Shape of a Distribution
z-Scores (Standardized Values)
Chebyshev’s Theorem Empirical Rule
Detecting Outliers
Distribution Shape: Skewness
An important measure of the shape of a distribution is called skewness.
The formula for computing the skewness of a data set is somewhat complex.
3
3)(
x
XES
-
Skeweness (S)
Is a measure of the asymmetry of a probability distribution
S=0: Symmetrical S>0: the distribution is right (positively)
skewed S<0: the distribution is left (negatively)
skewed
3
3)(
x
XES
-
Distribution Shape: Skewness
Distribution Shape: Skewness
Symmetric (not skewed)• Skewness is zero.• Mean and median are equal.
Rela
tive F
req
uen
cyR
ela
tive F
req
uen
cy
.05.05
.10.10
.15.15
.20.20
.25.25
.30.30
.35.35
00
Skewness = 0
Distribution Shape: Skewness Moderately Skewed Left
Skewness is negative. Mean will usually be less than the median.
Rela
tive F
req
uen
cyR
ela
tive F
req
uen
cy
.05.05
.10.10
.15.15
.20.20
.25.25
.30.30
.35.35
00
Skewness = - .31
Distribution Shape: Skewness
Highly Skewed Right• Skewness is positive (often above 1.0).• Mean will usually be more than the median.
Rela
tive F
req
uen
cyR
ela
tive F
req
uen
cy
.05.05
.10.10
.15.15
.20.20
.25.25
.30.30
.35.35
00
Skewness = 1.25
Is a measure of the relative location of the observation in a data set.
z-Score (Standardized Value)
zx xsii
Z-Score denotes the number of standard units (deviations) a given data value xi is located from the mean.
As a result, z-score is also called a standardized value.
A data value less than the sample mean will always have a z-score less than zero.
A data value greater than the sample mean will always have a z-score greater than zero.
A data value equal to the sample mean will always have a z-score of zero.
z-Score (Standardized Value)
zx xsii
Chebyshev’s Theorem
A theorem that shows the position of a certain proportion of observation in any data set relative to the mean of the data when the values are standardized.
A theorem that shows the position of a certain proportion of observation in any data set relative to the mean of the data when the values are standardized.
At least of the data values must be
within of the mean.
At least of the data values must be
within of the mean.
75%75%
z = 2 standard deviations z = 2 standard deviations
At least of the data values must be
within of the mean.
At least of the data values must be
within of the mean.
89%89%
z = 3 standard deviations z = 3 standard deviations
At least of the data values must be
within of the mean.
At least of the data values must be
within of the mean.
94%94%
z = 4 standard deviations z = 4 standard deviations
The theorem states that, for any data set---
For a data with a bell-shaped distribution:
of the values of a normal random variable are within of its mean.
of the values of a normal random variable are within of its mean.68.26%68.26%
+/- 1 standard deviation+/- 1 standard deviation
of the values of a normal random variable are within of its mean.
of the values of a normal random variable are within of its mean.95.44%95.44%
+/- 2 standard deviations+/- 2 standard deviations
of the values of a normal random variable are within of its mean.
of the values of a normal random variable are within of its mean.99.72%99.72%
+/- 3 standard deviations+/- 3 standard deviations
Empirical Rule
Empirical Rule
xm – 3s m – 1s
m – 2sm + 1s
m + 2sm + 3sm
68.26%
95.44%99.72%
Z-Scores allow us to Detect Outliers
An outlier is an unusually small or unusually large value in a data set.
A data value with a z-score less than -3 or greater than +3 might be considered an outlier.
It might be the result of:• an incorrect recording• an incorrectly included value in the data set• a correctly recorded data value but an unusual
occurrence
Exploratory Data Analysis
Five-Number Summary and Box Plot
Five-Number Summary
1 Smallest Value
First Quartile
Median
Third Quartile
Largest Value
2
3
4
5
375375
400400
425425
450450
475475
500500
525525
550550
575575
600600
625625
A box is drawn with its ends located at the first and third quartiles.
Box Plot
A vertical line is drawn in the box at the location of the median (second quartile).
Q1 = 445 Q3 = 525Q2 = 475
Measures of Association between Two Variables
CovarianceCorrelation Coefficient
Covariance
Covariance
Covariance between two random variables ( X and Y) is computed as follows:
Covariance between two random variables ( X and Y) is computed as follows:
forsamples
forpopulations
sx x y ynxy
i i
( )( )
1
xyi x i yx y
N
( )( )
Covariance
Positive values indicate a positive relationship. Positive values indicate a positive relationship.
Negative values indicate a negative relationship. Negative values indicate a negative relationship.
The covariance is a measure of the direction of movement and linear association between two variables. The covariance is a measure of the direction of movement and linear association between two variables.
Correlation Coefficient
Correlation Coefficient
However, it doesn’t indicate the causation. That is, just because two variables are highly correlated, it does not mean that one variable is the cause of the other.
However, it doesn’t indicate the causation. That is, just because two variables are highly correlated, it does not mean that one variable is the cause of the other.
Correlation is a measure of the degree of linear association between two variables. Correlation is a measure of the degree of linear association between two variables.
The correlation coefficient is computed as follows:
The correlation coefficient is computed as follows:
forsamples
forpopulations
rs
s sxyxy
x y
xyxy
x y
Correlation Coefficient
Correlation Coefficient
Values near +1 indicate a strong positive linear relationship. Values near +1 indicate a strong positive linear relationship.
Values near -1 indicate a strong negative linear relationship. Values near -1 indicate a strong negative linear relationship.
The coefficient can take on values between -1 and +1.
The correlation coefficient is computed as follows:
The correlation coefficient is computed as follows:
forsamples
rs
s sxyxy
x y
Correlation Coefficient
Weighted Mean Mean, Variance, and Standard Deviation of
Grouped Data
You are taking five courses. The following table depicts the credit hours associated with each course and your grades. Compute your GPA for the semester?
Courses Credit Hours GradeCalculus 4 BPsychology 3 AMarketing 3 CEconomics 3 DStat 2 A
Weighted Mean
i i
i
wxx
w
where: xi = value of observation i
wi = weight for observation i
You are taking five courses. The following table depicts the credit hours associated with each course and your grades. Compute your GPA for the semester?
Courses Credit Hours (Wi)
Grade WiX G
Calculus 4 B(3) 12Psychology 3 A(4) 12Marketing 3 C(2) 6Economics 3 D(1) 3Stat 2 A(4) 8
13 41
Weighted Mean
i i
i
wxx
w
where: xi = value of observation i
wi = weight for observation i
Weighted Mean
When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean.
When data values vary in importance, the analyst must choose the weight that best reflects the importance of each value.
Working with Grouped Data
425 430 430 435 435 435 435 435 440 440440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615
Given below is a sample of monthly rents for 70 efficiency apartments.
Grouped Data
Rent ($) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6
Computing the mean and variance of a grouped data
To compute the weighted mean from a grouped data we treat the midpoint of each class as though it were the mean of all items in the class.
We compute a weighted mean of the data using class midpoints and class frequencies as weights.
Similarly, in computing the variance and standard deviation, the class frequencies are used as weights.
Mean for Grouped Datai if M
xn
N
Mf ii
where: fi = frequency of class i Mi = midpoint of class i
Sample Data
Population Data
Given below is a sample of monthly rents for 70 efficiency apartments as grouped data--- in the form of a frequency distribution.
Rent ($) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6
Sample Mean for Grouped Data
Sample Mean for Grouped Data
This approximationdiffers by $2.41 fromthe actual samplemean of $490.80.
34,525 493.21
70x
Rent ($) f i
420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6
Total 70
M i
429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5
f iM i
3436.07641.55634.03916.03566.52118.01099.02278.01179.03657.034525.0
Variance for Grouped Data
sf M xn
i i22
1
( )
22
f M
Ni i( )
For sample data
For population data
Rent ($) f i
420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6
Total 70
M i
429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5
Sample Variance for Grouped Data
M i - x
-63.7-43.7-23.7-3.716.336.356.376.396.3116.3
f i(M i - x )2
32471.7132479.596745.97110.11
1857.555267.866337.13
23280.6618543.5381140.18
208234.29
(M i - x )2
4058.961910.56562.1613.76
265.361316.963168.565820.169271.76
13523.36
34,525 493.21
70x
3,017.89 54.94s
s2 = 208,234.29/(70 – 1) = 3,017.89
This approximation differs by only $.20 from the actual standard deviation of $54.74.
Sample Variance for Grouped Data
Sample Variance
Sample Standard Deviation