Rsch STATISTICS.docx

transcript

8/14/2019 Rsch STATISTICS.docx

1/17

Measures of Variability

Author(s)David M. Lane

Prerequisites

Percentiles,Distributions,Measures of Central Tendency

Learning Objectives

1.Determine the relative variability of two distributions2.Compute the range3.Compute the inter-quartile range4.Compute the variance in the population5.Estimate the variance from a sample6.Compute the standard deviation from the variance

WHAT IS VARIABILITY?

Variability refers to how "spread out" a group of scores is.

To see what we mean by spread out, consider graphs in

Figure 1. These graphs represent the scores on two quizzes.

The mean score for each quiz is 7.0. Despite the equality of

means, you can see that the distributions are quite

different. Specifically, the scores on Quiz 1 are more

densely packed and those on Quiz 2 are more spread out.

The differences among students were much greater on Quiz

2 than on Quiz 1.

Quiz 1

Quiz 2
http://onlinestatbook.com/2/introduction/percentiles.htmlhttp://onlinestatbook.com/2/introduction/percentiles.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/introduction/percentiles.html


2/17

Figure 1. Bar charts of two quizzes.

The terms variability, spread, and dispersion are synonyms,

and refer to how spread out a distribution is. Just as in the

section on central tendency where we discussed measures

of the center of a distribution of scores, in this chapter we

will discuss measures of the variability of a distribution.

There are four frequently used measures of variability: the

range, interquartile range, variance, and standard

deviation. In the next few paragraphs, we will look at each

of these four measures of variability in more detail.

RANGEThe range is the simplest measure of variability tocalculate, and one you have probably encountered many

times in your life. The range is simply the highest score

minus the lowest score. Lets take a few examples. What isthe range of the following group of numbers: 10, 2, 5, 6, 7,3, 4? Well, the highest number is 10, and the lowestnumber is 2, so 10 - 2 = 8. The range is 8. Lets takeanother example. Heres a dataset with 10 numbers: 99,45, 23, 67, 45, 91, 82, 78, 62, 51. What is the range? Thehighest number is 99 and the lowest number is 23, so 99 -23 equals 76; the range is 76. Now consider the twoquizzes shown in Figure 1. On Quiz 1, the lowest score is 5and the highest score is 9. Therefore, the range is 4. The

range on Quiz 2 was larger: the lowest score was 4 and thehighest score was 10. Therefore the range is 6.

INTERQUARTILE RANGE

Theinterquartile range(IQR) is the range of the middl

50% of the scores in a distribution. It is computed as

follows:

IQR = 75th percentile - 25th percentile

For Quiz 1, the 75th percentile is 8 and the 25th perce

is 6. The interquartile range is therefore 2. For Quiz 2,

which has greater spread, the 75th percentile is 9, the

percentile is 5, and the interquartile range is 4. Recall

in the discussion ofbox plots, the 75th percentile was

the upper hinge and the 25th percentile was called the

lower hinge. Using this terminology, the interquartile r

is referred to as the H-spread.

A related measure of variability is called thesemi-

interquartile range. The semi-interquartile range is def

simply as the interquartile range divided by 2. If a

distribution is symmetric, the median plus or minus the

semi-interquartile range contains half the scores in the

distribution.

VARIANCE

Variability can also be defined in terms of how close th

scores in the distribution are to the middle of the

distribution. Using the mean as the measure of the mid

of the distribution, the variance is defined as the averasquared difference of the scores from the mean. The d

from Quiz 1 are shown in Table 1. The mean score is 7

Therefore, the column "Deviation from Mean" contains
http://glossary%28%27interquartile_range%27%29/http://glossary%28%27interquartile_range%27%29/http://glossary%28%27interquartile_range%27%29/http://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://glossary%28%27interquartile_range%27%29/


3/17

score minus 7. The column "Squared Deviation" is simply

the previous column squared.Table 1. Calculation of Variance for Quiz 1 scores.

Scores Deviation from Mean Squared Deviation

9 2 4

9 2 4

9 2 4

8 1 1

8 1 1

8 1 1

8 1 1

7 0 0

7 0 0

7 0 0

7 0 0

7 0 0

6 -1 1

6 -1 1

6 -1 1

6 -1 1

6 -1 1

6 -1 1

5 -2 4

5 -2 4

Means

7 0 1.5

One thing that is important to notice is that the mean

deviation from the mean is 0. This will always be the ca

The mean of the squared deviations is 1.5. Therefore,

variance is 1.5. Analogous calculations with Quiz 2 sho

that its variance is 6.7. The formula for the variance is

where 2is the variance, is the mean, and N is the

number of numbers. For Quiz 1, = 7 and N = 20.

If the variance in a sample is used to estimate the

variance in a population, then the previous formula

underestimates the variance and the following formula

should be used:

where s2is the estimate of the variance and M is the

sample mean. Note that M is the mean of a sample tak

from a population with a mean of . Since, in practice,


4/17

variance is usually computed in a sample, this formula is

most often used. The simulation "estimating variance"

illustrates the bias in the formula with N in the

denominator.

Let's take a concrete example. Assume the scores 1, 2,

4, and 5 were sampled from a larger population. To

estimate the variance in the population you would computes2as follows:

M = (1 + 2 + 4 + 5)/4 = 12/4 = 3.

s2= [(1-3)2+ (2-3)2+ (4-3)2+ (5-3)2]/(4-1)

= (4 + 1 + 1 + 4)/3 = 10/3 = 3.333

There are alternate formulas that can be easier to use if you

are doing your calculations with a hand calculator. You

should note that these formulas are subject to rounding

error if your values are very large and/or you have an

extremely large number of observations.

and

For this example,

STANDARD DEVIATION

Thestandard deviationis simply the square root of the

variance. This makes the standard deviations of the tw

quiz distributions 1.225 and 2.588. The standard devia

is an especially useful measure of variability when the

distribution is normal or approximately normal (see Ch

on Normal Distributions) because the proportion of the

distribution within a given number of standard deviatiofrom the mean can be calculated. For example, 68% o

distribution is within one standard deviation of the mea

and approximately 95% of the distribution is within tw

standard deviations of the mean. Therefore, if you had

normal distribution with a mean of 50 and a standard

deviation of 10, then 68% of the distribution would be

between 50 - 10 = 40 and 50 +10 =60. Similarly, abou

95% of the distribution would be between 50 - 2 x 10

and 50 + 2 x 10 = 70. The symbol for the populationstandard deviation is ; the symbol for an estimate

computed in a sample is s. Figure 2 shows two normal

distributions. The red distribution has a mean of 40 an

standard deviation of 5; the blue distribution has a me
http://onlinestatbook.com/2/summarizing_distributions/variance_est.htmlhttp://glossary%28%27standard_deviation%27%29/http://glossary%28%27standard_deviation%27%29/http://glossary%28%27standard_deviation%27%29/http://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://glossary%28%27standard_deviation%27%29/http://onlinestatbook.com/2/summarizing_distributions/variance_est.html


5/17

60 and a standard deviation of 10. For the red distribution,

68% of the distribution is between 35 and 45; for the blue

distribution, 68% is between 50 and 70.

Figure 2. Normal distributions with standard deviations

of5and 10.

Levels of Measurement

There are four levels of measurement. They are:

Nominal Ordinal Interval, and Ratio

Associated with each level are acceptable statistical methods. The nominal

level is simplest, while Ratio measures are the most sophisticated.

Nominal Level (Grouping)

Nominal level data is generally preferred to as the "lowest" level of

measure. Data is limited to groups and categories. No numerical data is

ever provided.

Example:

Male Female

Catholic Protestant Jewish Muslim Other

Ordinal Level (Grouping and Ranking)

Ordinal level data can be grouped and and ranked. With Ordinal datacan say that a measure is higher or lower than another measure. But,

may not say how much higher or lower.

Example:

Preferred Flavors of Ice Cream

1. Vanilla

2. Chocolate

3. Strawberry

4. Cherry

Interval Level (Grouping, ranking, and includes the exact distan

between measures)

Interval level data can be grouped, ranked, and include the exact dista

between measures. Note: Interval measures never contain a zero ( 0 )

starting point.

Example:

Sam is 2" taller than Bill an

taller than Steve.


6/17

Bill is 2" shorter than Sam and 1" taller than Steve.

Steve is 3" shorter than Sam and 1" shorter than Bill.

Note: We do not know how tall Sam, Bill or Steve are, we only know

exactly the difference in their heights when compared to one another.

-

Ratio Level (Grouping, ranking, exact distance between measurement,

and contains an absolute "0")

Ratio level data are said to be at the highest level and can be grouped,ranked, and the exact distance between measures determined. Also, Ratio

level measures contain an absolute "0". By having an absolute "0" in yourmeasurement "scale", you are able to describe data in terms of ratios.

Example:

You could say that Jack, who weighs 200 lbs., is twice as heavy as Marywho weighs 100 lbs. (twice as heavy is a ratio statement).

It should be noted that with social science data, there are rarely any outside

standard scales, such as a yardstick to measure height. Therefore, socialresearch rarely generates data that goes beyond Interval level measures.

Sample Size

The question ""how large should my sample be?" is a common one. And,

one with no simple answer. While there are a number of elegant approaches

to answer this question, for our purposes, several "rules of thumb" will

serve us better.

Rule of Thumb #1

Use sample groups larger than 30 for interval level measures.

Rule of Thumb #2

If the total population that you are examining is less than 30. Use all

them.

Rule of Thumb #3

You should have a sample size of 30 for every relationship you meas

30+ people ----------------> compared against "X" OK

15 Women ----------------> compared against "X" Not OK

15 Men -------------------> compared against "X" Not OK

30+ Women --------------> compared against "X" OK30+ Men -----------------> compared against "X" OK

Rule of Thumb #4

Consult this table

Selecting Samples

To select a random sample, use a table of Random Numbers or use a

computerized random number generator.

Note: For a more detailed discussion of sample size, see pages 385-3your textbook.

Measures of Central Tendency
http://www.internetraining.com/Statkit/SampleTable.htmhttp://www.internetraining.com/Statkit/SampleTable.htmhttp://www.internetraining.com/Statkit/SampleTable.htm


7/17

-If you want to find a Yak-Yak bird, the first question you

might ask yourself is, "Where do most Yak-Yak birds

live?" In other words, "where would I have to go to have

the greatest chance of finding a Yak-Yak bird?" Measures

of Central Tendency tell you where most of whatever youare measuring can be found.

-

Mean = Average

All scores are added up and divided by the number of scores.

Median = Middle score

Count the total number of scores. The one in the middle is the median.

Note: If there are an even number of scores, select the middle two and

average them. This will give you the median.

Mode = Most common score

The mode is the score that occurs most often.

Range

Range is the difference between the highest and lowest scores. You sonly use the range to describe interval or ratio level data. To calculate

range, subtract the lowest score from the highest score

Example:

Note: In some statistics books, they will define range as the High Scominus the Low Score, Plus one (1). This is an inclusive measure of ra

rather than a measure of the difference between two scores. For exam

the inclusive range for data ranging from 6 to10 would be 5.


8/17

For our purposes, we will define the range as the difference betweenthe highest and lowest scores.

Skewness

When the mean, median and mode are equal, you will have a normal or bell

shaped distribution of scores.

Example:

Scores: 7, 8, 9, 9, 10, 10, 10, 11, 11, 12, 13

Mean: 10Median: 10

Mode: 10

Range: 6

We should note at this point that a normal distribution (Bell Curve) iimportant concept for statisticians because it gives them a "theoretica

standard" by which to compare data that may not form a perfect bell

If you have data where the mean, median and mode are quite differenscores are said to be skewed.

Example:

Scores: 7, 8, 9, 10, 11, 11, 12, 12, 12, 13, 13

Mean: 10.7

Media: 11Mode: 12

Range: 6

Scores that are "bunched" at the right or high end of the scale are said

have a negative skew.


9/17


10/17

On the other hand, if my standard deviation is small, it indicates that myscores are close to the Mean, and therefore, the Mean is a good indicator of

the "average" score.

To calculate the standard deviation of a set of scores, use the following

formula.

The following should help. Let's say the data below represents the test

scores for 10 trainees. (The top score anyone could make was 50).

Scores (n) = 10Mean = 400/10 = 40

Median = 40.5Mode = 41

Range = 16

Let me try it on the Internet

So what does the standard deviation tell you? It tells you that most tr

made 40, give or take 5 points. And, that in this case, you can say wit

confidence that the Mean is a very good indicator of the "average" sc
http://www.physics.csbsju.edu/stats/descriptive2.htmlhttp://www.physics.csbsju.edu/stats/descriptive2.htmlhttp://www.physics.csbsju.edu/stats/descriptive2.html


11/17

By establishing the standard deviation for a set of scores, you will be able

to describe accurately how various scores differs from one another and

from the Mean (average).

Note: In a Normal Distribution:

approximately 68% of all scores will fall within one (+/-) 1 standarddeviation from the Mean;

approximately 95% of all scores will fall within one (+/-) 2 standarddeviation from the Mean;

approximately 99% of all scores will fall within one (+/-) 3 standarddeviation from the Mean.

When we know the standard deviation for a set of scores, it is possibcompare our data at a glance with a Normal Distribution to determine

degree of dispersion.

In the above example, we can see that all of our scores fell within two

standard deviations of the Mean. Which again reaffirms that our Mea

very good indicator of the "average" score.

A Quick and Dirty Way for Determining the Standard Deviation

Set of Scores

If your Mean, Median and Mode are very close to one another, you c

calculated the Standard Deviation by:

1. Determine the Range

2. Divide the Range by four

3. The resulting number will approximately equal the true standard

deviation

Example:


12/17

Scores: 7, 8, 9, 10, 10, 10, 11, 11, 12, 13Mean: 10

Media: 10

Mode: 10

Range: 6

True Standard Deviation: 1.67

Q&D Standard Deviation: 6/4 = 1.5

CAUTION: Only use this technique IF your Mean, Media and Mode

are very similar.

Measures of Association

Pearson's rInterval Level Measure

There will be many times when you will want to know if two variables are"related," and to what extent. For example, you may want to find out therelationship between Yearly Income and Education. To do this, you would

randomly select 10 individuals . To help you "sort out" your data, you

would construct the following table.

With this data, you can now use the Pearson's r or product moment

correlation coefficient formula.


13/17


Using the following table, you can see that there is a very high correlation

between Income and Years of Education.

Measures of Association

Spearman's Rank Order

Ordinal Level Measure

Once in a while, it is useful to compare rankings between two people

groups. Fox example, let's say a group of employees and a group of

managers want to find out if there is a difference in the workplace vaheld by each group. In this example, each group ranks 10 workplace
http://faculty.vassar.edu/lowry/corr_stats.htmlhttp://faculty.vassar.edu/lowry/corr_stats.htmlhttp://faculty.vassar.edu/lowry/corr_stats.html


14/17

In order to calculate a Spearman Rank Order, you must first construct the

following table.


Again, using the following table, you can see that this time there is ncorrelation between the two rankings.
http://faculty.vassar.edu/lowry/corr_rank.htmlhttp://faculty.vassar.edu/lowry/corr_rank.htmlhttp://faculty.vassar.edu/lowry/corr_rank.html


15/17

Chi-Square

Test of Significance

Nominal Level Measure

Let's say you wanted to know if there was any significant difference

between the production rates for departments which had trained supervisors

versus those departments whose supervisors were not trained.

Step 1

Lets begin by collecting some data.

9 departments produced above standard 11 departments produced at standard 10 departments produced below standard 17 departments are supervised by trained supervisors 13 departments are supervised by untrained supervisors

Step 2

From this data, you can construct the following cross-tabs table.

Note:We have included the totals for trows and columns in our table

Step 3

We must now construct a table which describes what would have hap

if training did not impact production. (We call this table an Expectan

Table).

A = 9 departments with above standard productionB = 11 departments with standard production


16/17

C = 10 departments with below standard productionM = Number of departments with trained supervisors (17)

N = Number of departments with untrained supervisors (13)

T = Total number of supervisors

Step 4

We are now ready to calculate the Chi-square using the following formula.

Where O = the actual production records for each cell in the table

Where E = the expected production record for each cell in the table

Let me try it on the Internet v.1 - Chi Square

Let me try it on the Internet v.2 - Contingency Table

Step 5

A Chi-square by itself is not of much value. You have to use a Chi-sq

table that you can find in most statistics books. However, before you use the table, you must first determine the degree of freedom of your

To do this, blot out one row and one column. The remaining number

cells will be your degrees of freedom.

In our example table, we can see that we have 2 cells that are not fille

which means we have 2 degrees of freedom (df)

Now, using a Chi-square table, find where the 8.416 falls on the line

numbers for df 2. (Online Chi Square Table)
http://graphpad.com/quickcalcs/chisquared1.cfmhttp://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.htmlhttp://graphpad.com/quickcalcs/chisquared1.cfm


17/17

What this tells you is that with 98%+* confidence (probability) that

supervisory training is responsible for increase production.

Note: You can use Chi-square with cross-tab tables that have up to 30

degrees of freedom. (The maximum for most Chi-square tables).

__________________

* .02 = 2%

100% - 2% = 98%

Rsch STATISTICS.docx

Documents