Post on 04-Jun-2018
transcript
8/14/2019 Rsch STATISTICS.docx
1/17
Measures of Variability
Author(s)David M. Lane
Prerequisites
Percentiles,Distributions,Measures of Central Tendency
Learning Objectives
1.Determine the relative variability of two distributions2.Compute the range3.Compute the inter-quartile range4.Compute the variance in the population5.Estimate the variance from a sample6.Compute the standard deviation from the variance
WHAT IS VARIABILITY?
Variability refers to how "spread out" a group of scores is.
To see what we mean by spread out, consider graphs in
Figure 1. These graphs represent the scores on two quizzes.
The mean score for each quiz is 7.0. Despite the equality of
means, you can see that the distributions are quite
different. Specifically, the scores on Quiz 1 are more
densely packed and those on Quiz 2 are more spread out.
The differences among students were much greater on Quiz
2 than on Quiz 1.
Quiz 1
Quiz 2
http://onlinestatbook.com/2/introduction/percentiles.htmlhttp://onlinestatbook.com/2/introduction/percentiles.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/introduction/percentiles.html8/14/2019 Rsch STATISTICS.docx
2/17
Figure 1. Bar charts of two quizzes.
The terms variability, spread, and dispersion are synonyms,
and refer to how spread out a distribution is. Just as in the
section on central tendency where we discussed measures
of the center of a distribution of scores, in this chapter we
will discuss measures of the variability of a distribution.
There are four frequently used measures of variability: the
range, interquartile range, variance, and standard
deviation. In the next few paragraphs, we will look at each
of these four measures of variability in more detail.
RANGEThe range is the simplest measure of variability tocalculate, and one you have probably encountered many
times in your life. The range is simply the highest score
minus the lowest score. Lets take a few examples. What isthe range of the following group of numbers: 10, 2, 5, 6, 7,3, 4? Well, the highest number is 10, and the lowestnumber is 2, so 10 - 2 = 8. The range is 8. Lets takeanother example. Heres a dataset with 10 numbers: 99,45, 23, 67, 45, 91, 82, 78, 62, 51. What is the range? Thehighest number is 99 and the lowest number is 23, so 99 -23 equals 76; the range is 76. Now consider the twoquizzes shown in Figure 1. On Quiz 1, the lowest score is 5and the highest score is 9. Therefore, the range is 4. The
range on Quiz 2 was larger: the lowest score was 4 and thehighest score was 10. Therefore the range is 6.
INTERQUARTILE RANGE
Theinterquartile range(IQR) is the range of the middl
50% of the scores in a distribution. It is computed as
follows:
IQR = 75th percentile - 25th percentile
For Quiz 1, the 75th percentile is 8 and the 25th perce
is 6. The interquartile range is therefore 2. For Quiz 2,
which has greater spread, the 75th percentile is 9, the
percentile is 5, and the interquartile range is 4. Recall
in the discussion ofbox plots, the 75th percentile was
the upper hinge and the 25th percentile was called the
lower hinge. Using this terminology, the interquartile r
is referred to as the H-spread.
A related measure of variability is called thesemi-
interquartile range. The semi-interquartile range is def
simply as the interquartile range divided by 2. If a
distribution is symmetric, the median plus or minus the
semi-interquartile range contains half the scores in the
distribution.
VARIANCE
Variability can also be defined in terms of how close th
scores in the distribution are to the middle of the
distribution. Using the mean as the measure of the mid
of the distribution, the variance is defined as the averasquared difference of the scores from the mean. The d
from Quiz 1 are shown in Table 1. The mean score is 7
Therefore, the column "Deviation from Mean" contains
http://glossary%28%27interquartile_range%27%29/http://glossary%28%27interquartile_range%27%29/http://glossary%28%27interquartile_range%27%29/http://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://glossary%28%27interquartile_range%27%29/8/14/2019 Rsch STATISTICS.docx
3/17
score minus 7. The column "Squared Deviation" is simply
the previous column squared.Table 1. Calculation of Variance for Quiz 1 scores.
Scores Deviation from Mean Squared Deviation
9 2 4
9 2 4
9 2 4
8 1 1
8 1 1
8 1 1
8 1 1
7 0 0
7 0 0
7 0 0
7 0 0
7 0 0
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
5 -2 4
5 -2 4
Means
7 0 1.5
One thing that is important to notice is that the mean
deviation from the mean is 0. This will always be the ca
The mean of the squared deviations is 1.5. Therefore,
variance is 1.5. Analogous calculations with Quiz 2 sho
that its variance is 6.7. The formula for the variance is
where 2is the variance, is the mean, and N is the
number of numbers. For Quiz 1, = 7 and N = 20.
If the variance in a sample is used to estimate the
variance in a population, then the previous formula
underestimates the variance and the following formula
should be used:
where s2is the estimate of the variance and M is the
sample mean. Note that M is the mean of a sample tak
from a population with a mean of . Since, in practice,
8/14/2019 Rsch STATISTICS.docx
4/17
variance is usually computed in a sample, this formula is
most often used. The simulation "estimating variance"
illustrates the bias in the formula with N in the
denominator.
Let's take a concrete example. Assume the scores 1, 2,
4, and 5 were sampled from a larger population. To
estimate the variance in the population you would computes2as follows:
M = (1 + 2 + 4 + 5)/4 = 12/4 = 3.
s2= [(1-3)2+ (2-3)2+ (4-3)2+ (5-3)2]/(4-1)
= (4 + 1 + 1 + 4)/3 = 10/3 = 3.333
There are alternate formulas that can be easier to use if you
are doing your calculations with a hand calculator. You
should note that these formulas are subject to rounding
error if your values are very large and/or you have an
extremely large number of observations.
and
For this example,
STANDARD DEVIATION
Thestandard deviationis simply the square root of the
variance. This makes the standard deviations of the tw
quiz distributions 1.225 and 2.588. The standard devia
is an especially useful measure of variability when the
distribution is normal or approximately normal (see Ch
on Normal Distributions) because the proportion of the
distribution within a given number of standard deviatiofrom the mean can be calculated. For example, 68% o
distribution is within one standard deviation of the mea
and approximately 95% of the distribution is within tw
standard deviations of the mean. Therefore, if you had
normal distribution with a mean of 50 and a standard
deviation of 10, then 68% of the distribution would be
between 50 - 10 = 40 and 50 +10 =60. Similarly, abou
95% of the distribution would be between 50 - 2 x 10
and 50 + 2 x 10 = 70. The symbol for the populationstandard deviation is ; the symbol for an estimate
computed in a sample is s. Figure 2 shows two normal
distributions. The red distribution has a mean of 40 an
standard deviation of 5; the blue distribution has a me
http://onlinestatbook.com/2/summarizing_distributions/variance_est.htmlhttp://glossary%28%27standard_deviation%27%29/http://glossary%28%27standard_deviation%27%29/http://glossary%28%27standard_deviation%27%29/http://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://glossary%28%27standard_deviation%27%29/http://onlinestatbook.com/2/summarizing_distributions/variance_est.html8/14/2019 Rsch STATISTICS.docx
5/17
60 and a standard deviation of 10. For the red distribution,
68% of the distribution is between 35 and 45; for the blue
distribution, 68% is between 50 and 70.
Figure 2. Normal distributions with standard deviations
of5and 10.
Levels of Measurement
There are four levels of measurement. They are:
Nominal Ordinal Interval, and Ratio
Associated with each level are acceptable statistical methods. The nominal
level is simplest, while Ratio measures are the most sophisticated.
Nominal Level (Grouping)
Nominal level data is generally preferred to as the "lowest" level of
measure. Data is limited to groups and categories. No numerical data is
ever provided.
Example:
Male Female
Catholic Protestant Jewish Muslim Other
Ordinal Level (Grouping and Ranking)
Ordinal level data can be grouped and and ranked. With Ordinal datacan say that a measure is higher or lower than another measure. But,
may not say how much higher or lower.
Example:
Preferred Flavors of Ice Cream
1. Vanilla
2. Chocolate
3. Strawberry
4. Cherry
Interval Level (Grouping, ranking, and includes the exact distan
between measures)
Interval level data can be grouped, ranked, and include the exact dista
between measures. Note: Interval measures never contain a zero ( 0 )
starting point.
Example:
Sam is 2" taller than Bill an
taller than Steve.
8/14/2019 Rsch STATISTICS.docx
6/17
Bill is 2" shorter than Sam and 1" taller than Steve.
Steve is 3" shorter than Sam and 1" shorter than Bill.
Note: We do not know how tall Sam, Bill or Steve are, we only know
exactly the difference in their heights when compared to one another.
-
Ratio Level (Grouping, ranking, exact distance between measurement,
and contains an absolute "0")
Ratio level data are said to be at the highest level and can be grouped,ranked, and the exact distance between measures determined. Also, Ratio
level measures contain an absolute "0". By having an absolute "0" in yourmeasurement "scale", you are able to describe data in terms of ratios.
Example:
You could say that Jack, who weighs 200 lbs., is twice as heavy as Marywho weighs 100 lbs. (twice as heavy is a ratio statement).
It should be noted that with social science data, there are rarely any outside
standard scales, such as a yardstick to measure height. Therefore, socialresearch rarely generates data that goes beyond Interval level measures.
Sample Size
The question ""how large should my sample be?" is a common one. And,
one with no simple answer. While there are a number of elegant approaches
to answer this question, for our purposes, several "rules of thumb" will
serve us better.
Rule of Thumb #1
Use sample groups larger than 30 for interval level measures.
Rule of Thumb #2
If the total population that you are examining is less than 30. Use all
them.
Rule of Thumb #3
You should have a sample size of 30 for every relationship you meas
30+ people ----------------> compared against "X" OK
15 Women ----------------> compared against "X" Not OK
15 Men -------------------> compared against "X" Not OK
30+ Women --------------> compared against "X" OK30+ Men -----------------> compared against "X" OK
Rule of Thumb #4
Consult this table
Selecting Samples
To select a random sample, use a table of Random Numbers or use a
computerized random number generator.
Note: For a more detailed discussion of sample size, see pages 385-3your textbook.
Measures of Central Tendency
http://www.internetraining.com/Statkit/SampleTable.htmhttp://www.internetraining.com/Statkit/SampleTable.htmhttp://www.internetraining.com/Statkit/SampleTable.htm8/14/2019 Rsch STATISTICS.docx
7/17
-If you want to find a Yak-Yak bird, the first question you
might ask yourself is, "Where do most Yak-Yak birds
live?" In other words, "where would I have to go to have
the greatest chance of finding a Yak-Yak bird?" Measures
of Central Tendency tell you where most of whatever youare measuring can be found.
-
Mean = Average
All scores are added up and divided by the number of scores.
Median = Middle score
Count the total number of scores. The one in the middle is the median.
Note: If there are an even number of scores, select the middle two and
average them. This will give you the median.
Mode = Most common score
The mode is the score that occurs most often.
Range
Range is the difference between the highest and lowest scores. You sonly use the range to describe interval or ratio level data. To calculate
range, subtract the lowest score from the highest score
Example:
Note: In some statistics books, they will define range as the High Scominus the Low Score, Plus one (1). This is an inclusive measure of ra
rather than a measure of the difference between two scores. For exam
the inclusive range for data ranging from 6 to10 would be 5.
8/14/2019 Rsch STATISTICS.docx
8/17
For our purposes, we will define the range as the difference betweenthe highest and lowest scores.
Skewness
When the mean, median and mode are equal, you will have a normal or bell
shaped distribution of scores.
Example:
Scores: 7, 8, 9, 9, 10, 10, 10, 11, 11, 12, 13
Mean: 10Median: 10
Mode: 10
Range: 6
We should note at this point that a normal distribution (Bell Curve) iimportant concept for statisticians because it gives them a "theoretica
standard" by which to compare data that may not form a perfect bell
If you have data where the mean, median and mode are quite differenscores are said to be skewed.
Example:
Scores: 7, 8, 9, 10, 11, 11, 12, 12, 12, 13, 13
Mean: 10.7
Media: 11Mode: 12
Range: 6
Scores that are "bunched" at the right or high end of the scale are said
have a negative skew.
8/14/2019 Rsch STATISTICS.docx
9/17
8/14/2019 Rsch STATISTICS.docx
10/17
On the other hand, if my standard deviation is small, it indicates that myscores are close to the Mean, and therefore, the Mean is a good indicator of
the "average" score.
To calculate the standard deviation of a set of scores, use the following
formula.
The following should help. Let's say the data below represents the test
scores for 10 trainees. (The top score anyone could make was 50).
Scores (n) = 10Mean = 400/10 = 40
Median = 40.5Mode = 41
Range = 16
Let me try it on the Internet
So what does the standard deviation tell you? It tells you that most tr
made 40, give or take 5 points. And, that in this case, you can say wit
confidence that the Mean is a very good indicator of the "average" sc
http://www.physics.csbsju.edu/stats/descriptive2.htmlhttp://www.physics.csbsju.edu/stats/descriptive2.htmlhttp://www.physics.csbsju.edu/stats/descriptive2.html8/14/2019 Rsch STATISTICS.docx
11/17
By establishing the standard deviation for a set of scores, you will be able
to describe accurately how various scores differs from one another and
from the Mean (average).
Note: In a Normal Distribution:
approximately 68% of all scores will fall within one (+/-) 1 standarddeviation from the Mean;
approximately 95% of all scores will fall within one (+/-) 2 standarddeviation from the Mean;
approximately 99% of all scores will fall within one (+/-) 3 standarddeviation from the Mean.
When we know the standard deviation for a set of scores, it is possibcompare our data at a glance with a Normal Distribution to determine
degree of dispersion.
In the above example, we can see that all of our scores fell within two
standard deviations of the Mean. Which again reaffirms that our Mea
very good indicator of the "average" score.
A Quick and Dirty Way for Determining the Standard Deviation
Set of Scores
If your Mean, Median and Mode are very close to one another, you c
calculated the Standard Deviation by:
1. Determine the Range
2. Divide the Range by four
3. The resulting number will approximately equal the true standard
deviation
Example:
8/14/2019 Rsch STATISTICS.docx
12/17
Scores: 7, 8, 9, 10, 10, 10, 11, 11, 12, 13Mean: 10
Media: 10
Mode: 10
Range: 6
True Standard Deviation: 1.67
Q&D Standard Deviation: 6/4 = 1.5
CAUTION: Only use this technique IF your Mean, Media and Mode
are very similar.
Measures of Association
Pearson's rInterval Level Measure
There will be many times when you will want to know if two variables are"related," and to what extent. For example, you may want to find out therelationship between Yearly Income and Education. To do this, you would
randomly select 10 individuals . To help you "sort out" your data, you
would construct the following table.
With this data, you can now use the Pearson's r or product moment
correlation coefficient formula.
8/14/2019 Rsch STATISTICS.docx
13/17
Let me try it on the Internet
Using the following table, you can see that there is a very high correlation
between Income and Years of Education.
Measures of Association
Spearman's Rank Order
Ordinal Level Measure
Once in a while, it is useful to compare rankings between two people
groups. Fox example, let's say a group of employees and a group of
managers want to find out if there is a difference in the workplace vaheld by each group. In this example, each group ranks 10 workplace
http://faculty.vassar.edu/lowry/corr_stats.htmlhttp://faculty.vassar.edu/lowry/corr_stats.htmlhttp://faculty.vassar.edu/lowry/corr_stats.html8/14/2019 Rsch STATISTICS.docx
14/17
In order to calculate a Spearman Rank Order, you must first construct the
following table.
Let me try it on the Internet
Again, using the following table, you can see that this time there is ncorrelation between the two rankings.
http://faculty.vassar.edu/lowry/corr_rank.htmlhttp://faculty.vassar.edu/lowry/corr_rank.htmlhttp://faculty.vassar.edu/lowry/corr_rank.html8/14/2019 Rsch STATISTICS.docx
15/17
Chi-Square
Test of Significance
Nominal Level Measure
Let's say you wanted to know if there was any significant difference
between the production rates for departments which had trained supervisors
versus those departments whose supervisors were not trained.
Step 1
Lets begin by collecting some data.
9 departments produced above standard 11 departments produced at standard 10 departments produced below standard 17 departments are supervised by trained supervisors 13 departments are supervised by untrained supervisors
Step 2
From this data, you can construct the following cross-tabs table.
Note:We have included the totals for trows and columns in our table
Step 3
We must now construct a table which describes what would have hap
if training did not impact production. (We call this table an Expectan
Table).
A = 9 departments with above standard productionB = 11 departments with standard production
8/14/2019 Rsch STATISTICS.docx
16/17
C = 10 departments with below standard productionM = Number of departments with trained supervisors (17)
N = Number of departments with untrained supervisors (13)
T = Total number of supervisors
Step 4
We are now ready to calculate the Chi-square using the following formula.
Where O = the actual production records for each cell in the table
Where E = the expected production record for each cell in the table
Let me try it on the Internet v.1 - Chi Square
Let me try it on the Internet v.2 - Contingency Table
Step 5
A Chi-square by itself is not of much value. You have to use a Chi-sq
table that you can find in most statistics books. However, before you use the table, you must first determine the degree of freedom of your
To do this, blot out one row and one column. The remaining number
cells will be your degrees of freedom.
In our example table, we can see that we have 2 cells that are not fille
which means we have 2 degrees of freedom (df)
Now, using a Chi-square table, find where the 8.416 falls on the line
numbers for df 2. (Online Chi Square Table)
http://graphpad.com/quickcalcs/chisquared1.cfmhttp://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.htmlhttp://graphpad.com/quickcalcs/chisquared1.cfm8/14/2019 Rsch STATISTICS.docx
17/17
What this tells you is that with 98%+* confidence (probability) that
supervisory training is responsible for increase production.
Note: You can use Chi-square with cross-tab tables that have up to 30
degrees of freedom. (The maximum for most Chi-square tables).
__________________
* .02 = 2%
100% - 2% = 98%