+ All Categories
Home > Documents > The Mean, Median, And Mode

The Mean, Median, And Mode

Date post: 13-Apr-2018
Category:
Upload: janet-tal-udan
View: 237 times
Download: 0 times
Share this document with a friend

of 25

Transcript
  • 7/26/2019 The Mean, Median, And Mode

    1/25

    Lesson 1

    Measures of Central Tendency:The Mean, Median, and Mode

    One of the most basic purposes of statistics is simply to enable us to make sense of largenumbers. For example, if you want to know how the students in your school are doing in the

    statewide achievement test, and somebody gives you a list of all 600 of their scores, thatsuseless. This everyday problem is even more obvious and staggering when youre dealing, letssay, with the population data for the nation.

    Weve got to be able toconsolidate and synthesize large numbers to reveal their collectivecharacteristics and interrelationships, and transform them from an incomprehensible mass to aset of useful and enlightening indicators.

    The Mean

    One of the most useful and widely used techniques for doing thisone which you alreadyknowis the average, or, as it is know in statistics, the mean. And you know how to calculate

    the mean: you simply add up a set of scores and divide by the number of scores. Thus we haveour first and perhaps the most basic statistical formula:

    Where:

    (sometimes call the X-bar) is the symbol for the mean.

    (the Greek letter s igma) is the symbol for summation.

    X is the symbol for the scores.

    N is the symbol for the number of scores.

    So this formula simply says you get the mean by summing up all the scores and dividing the totalby the number of scoresthe old average, which in this case were all familiar with, so its a goodplace to begin.

    This is pretty simple when you have only a few numbers. For example, if you have just 6 numbers(3, 9, 10, 8, 6, and 5), you insert them into the formula for the mean, and do the math:

    But we usually have many more numbers to deal with, so lets do a couple examples where thenumbers are larger, and show how the calculations should be done. In our first example, weregoing to compute the mean salary of 36 people. Column A of Table 1 show the salaries (rangingfrom $20K to $70K), and column B shows how many people earned each of the salaries.

    Table 1

    Example 1 of Method for Computing the Mean

    A B C

    Salary (X) Frequency (f) fX

    $20k 1 20

    $25K 2 50

    $30K 3 90

    $35K 4 140

  • 7/26/2019 The Mean, Median, And Mode

    2/25

    $40K 5 200

    $45K 6 270

    $50K 5 250

    $55K 4 220

    $60K 3 180

    $65K 2 130

    $70K

    1

    70

    Sum 36 1,620

    To get the for our formula, we multiply the number of people in each salary category by thesalary for that category (e.g., 1 x 20, 2 x 25, etc.), and then total those numbers (the ones incolumn C). Thus we have:

    And this is how the distribution of these salaries looks:

    Figure 1Distribution of Example 1 Salaries

    The scores in this distribution are said to be normally distributed, i.e., clustered around a centralvalue, with decreasing numbers of cases as you move to the extreme ends of the range. Thus theterm normal curve.

    So, computing the mean is pretty simple. Piece of cake, right? Not so fast.

    In our second example, lets look what happens if we change just six peoples salary in Table 1.Lets suppose that the three people who made $60K actually made $220K, and that the two whomade $65K made $205K, and the one person who made $70K made $210K. The revised salarytable is the same except for these changes.

    Table 2Example 2 of Method for Computing the Mean

    A B C

    Salary (X) Frequency (f) fX

    $20k 1 20

    $25K 2 50

    $30K 3 90

    $35K 4 140

    $40K 5 200

  • 7/26/2019 The Mean, Median, And Mode

    3/25

  • 7/26/2019 The Mean, Median, And Mode

    4/25

    The median is the point in the distribution above which and below which 50% of the scores lie. Inother words, if we list the scores in order from highest to lowest (or lowest to highest) and findthe middle-most score, thats the median.

    For example, suppose we have the following scores: 2, 12, 4, 11, 3, 7, 10, 5, 9, 6. The next step isto array them in order from lowest to highest.

    2345679101112

    Since we have 10 scores, and 50% of 10 is 5, we want the point above which and below which

    there are five scores. Careful. If you count up from the bottom, you might think the median is 6.But thats not right because there are 4 scores below 6 and 5 above it. So how do we deal withthat problem? We deal with it by understanding that in statistics, a measurement or a score isregarded not as a point but as an interval ranging from half a unit below to half a unit above thevalue. So in this case, the actual midpoint or median of this distributionthe point above whichand below which 50% of the scores lieis 6.5

    As we saw with the mean, when we have only a few numbers, its pretty simple. But how do wefind the median when we have larger numbers and more than one person with the same score?Its not difficult. Lets use the salary data in Table 1.

    Table 3

    Example 1 of Method for Computing the Median

    Salary Range Frequency

    $20K $19.5K-20.5K 1

    $25K $24.5K-25.5K 2

    $30K $29.5K-30.5K 3

    $35K $34.5K-35.5K 4

    $40K $39.5K-40.5K 5

    $45K $44.5K-45.5K 6

    $50K $49.5K-50.5K 5

    $55K

    $54.5K-55.5K

    4

    $60K $59.5K-60.5K 3

    $65K $64.5K-65.5K 2

    $70K $69.5K-70.5K 1

    Sum 36

    The salaries are already in order from lowest to highest, so the next step in finding the median isto determine how many individuals (ratings, scores, or whatever) we have. Those are shown inthe frequency column, and the total is 36. So our N = 36, and we want to find the salary pointabove which and below which 50%, or 18, of the individuals fall. If we count up from the bottom

    through the $40K level, we have 15, and we need three more. But if we include the $45K level (inwhich there are 6), we have 21, three more than we need. Thus, we need 3, or 50%, of the 6 casesin the $45K category. We add this value (.5) to the lower limit of the interval in which we know themedian lies ($44.5K-$45.5K), and this gives us value of $45K.

    In this case, the mean and the median are the sameas they always are in normal distributions.So in situations like this, the mean is the preferred measure.

    But things arent always so neat and tidy. Lets now compute the median for the salary data inTable 2, which we know (from Figure 2) are not normally distributed.

  • 7/26/2019 The Mean, Median, And Mode

    5/25

  • 7/26/2019 The Mean, Median, And Mode

    6/25

  • 7/26/2019 The Mean, Median, And Mode

    7/25

  • 7/26/2019 The Mean, Median, And Mode

    8/25

  • 7/26/2019 The Mean, Median, And Mode

    9/25

  • 7/26/2019 The Mean, Median, And Mode

    10/25

    deviation on either side of the mean. If we go out to two standard deviations on either side of themean, we will include 95.44% of the scores; and if we go out three standard deviations, that willencompass 98.74% of the scores; and so on.

    Another way to think about this is to realize that in this distribution, if you have a score thatswithin one standard deviation of the mean, i.e., between 80 and 120, thats pretty average twothirds of the people are concentrated in that range. But if you have a score thats two or threestandard deviations away from the mean, that is clearly a deviant score, i.e., very high or verylow. Only a small percent of the cases lie that far out from the mean.

    This is valuable to understand in its own right, and will become useful when we take updetermining the significance of difference between meanswhich were going to do next inLesson 3.

    Figure 6Normal Curve Showing the Percent of Cases Lying Within 1, 2, and 3 Standard Deviations Fromthe Mean

    Testing the Difference Between Means: The t-Test

    This is one of the most important parts of this course in basic statistics. Here were going to learnabout testing the significance of difference between means. What does that mean?

    Suppose youre the superintendent, and one of your principals bursts into your officeenthusiastically and says, "I know youll be happy to learn that after our big effort this year inreading, my third graders improved from 187 to 195 on the state reading test!"

    You immediately ask her, "Is the 8-point difference between those means statisticallysignificant?" When her eyes glaze over and she says, "Huh?" you smile, forebearingly, (becauseyouve taken this course in basic statistics, and she hasnt), and you patiently explain to her thatsimply because there is a numerical difference between last years and this years meanscoresdoesnt mean that there is real difference. It could be due to chance variation in the scores.

    So how do we know when the difference between two means is probably a real difference, notone due to chance? We have to say "probably" because nothing in statistics is absolutely certain

    (as is the case with most things in life). But there are statistical tests which can tell us how likelya difference between two means is due to chance.

    One of the most widely used statistical methods for testing the difference between means, andthe one were going to get you up-to-speed on, is called the t-test.

    Lets go back to the salary data we worked with in Table 1 of Lesson 1, but now lets compare themean salary of that group with another group, and ask whether the mean salaries of the twogroups are significantly different.

  • 7/26/2019 The Mean, Median, And Mode

    11/25

    First, lets look at the formula for the t-test, and determine what we need to make thecomputation:

    Where:

    is the mean for Group 1.

    is the mean for Group 2.

    is the number of people in Group 1.

    is the number of people in Group 2.

    is the variance for Group 1.

    is the variance for Group 2.

    The only thing in this formula youre not familiar with is the symbol s2, which stands for the variance. The variance is the same as the standard deviation

    without the square root, i.e., its nothing more than the sum of the deviations of all the scores from the mean divided by n -1.

    The formula above is for testing the significance of difference between two independent samples, i.e., groups of different people. If we wanted to test the

    difference between, say, the pre-test and post-test means of the same group of people, we would use a different formula for dependent samples. That

    formula is:

    Where:

    is the sum of all the individuals pre-post score differences.

    is the sum of all the individuals pre-post score differences squared.

    is the number of paired observations.

    But for now, well test the significance of difference between the mean salary of two different groups. You can try the one for dependent samples on your

    own. (I knew youd welcome that opportunity.)

    Tables 8 and 9 provide the numbers we need to compute the t-test for the difference in mean salaries of the two groups.

    Table 8

    Salaries and t-Test Calculation Data for Group 1

    A B C D E

    Salary (X) Frequency (f) X - Mean (d) fd fd2

    20 1 25 25 625

    25 2 20 40 800

    30 3 15 45 675

    35 4 10 40 400

    40 5 5 25 125

  • 7/26/2019 The Mean, Median, And Mode

    12/25

  • 7/26/2019 The Mean, Median, And Mode

    13/25

  • 7/26/2019 The Mean, Median, And Mode

    14/25

  • 7/26/2019 The Mean, Median, And Mode

    15/25

    ________________________________________________________________________

    In order to use this table, we enter it with our t value (.222) and something called "degrees offreedom." The degrees of freedom is simply n11+n21 or, in our case, 70. Note that there are twocolumns of t values, one labeled .05, and the other labeled .01. If we go down to the degrees offreedom nearest to ours, which would be 70, we find that both the .05 and the .01 t values aresubstantially larger than our .222. So we didnt achieve a large enough t value to reject the nullhypothesis, i.e., to be able to conclude that the difference wasnt due to chance.

    Why do we have two columns, one labeled .05 and the other .01? Because those are the twolevels of significance commonly used in statistical analysis. The t values in the .05 column arelikely to occur by chance 5 percent of the time, whereas the t values in the .01 column are likelyto occur by chance only 1 percent of the time.

    Type I and Type II Errors

    The choice of what significance level to use (.05, .01, or lower or higher) is the difficult choice thatyou as the researcher must make. If you decide to accept the .05 level of confidence, whichrequires a smaller t value, you can more easily reject the null hypothesis and declare that there is

    a statistically significant difference between the means than if you select the .01 level, but youwill be wrong 5 percent of the time. This is the Type I error.

    On the other hand, if you select the .01 value, you will be wrong only 1 percent of the time. Butsince the .01 value requires a larger t value, you will less often be able to reject the nullhypothesis and say that there is a statistically significant difference between the means when infact that is the case. This is the Type II error. It is crucially important to an understanding of evenbasic statistics that we have a clear understanding of these two errors. If you spend a little timewith Table 10, it will help you achieve this understanding.

    Table 11.Accepting and Rejecting Null Hypotheses and the Making of Type I and Type II Errors*

    Decision

    Accept The NullHypothesis

    Reject The NullHypothesis

    A The null hypothesis isreally true, i.e., there isnot a real differencebetween the means ofthe two groups.

    1

    You accepted the null

    hypothesis when it istrue, i.e., youconcluded that thereis not a real differencebetween the means ofthe two groups which,in fact, is the case.That was a gooddecision.

    2

    You rejected the null

    hypothesis when it is true,i.e., you concluded thatthere is a real differencebetween the means of thetwo groups when, in fact,there is not a difference.That was a bad decision.

    You made the Type I error.

    B

    The null hypothesis isreally false, i.e., there isa real differencebetween the means ofthe two groups.

    3

    You accepted the nullhypothesis when it isfalse, i.e., youconcluded that thereis not a real differencebetween the means ofthe two groups whenin fact there is a realdifference. That was a

    4

    You rejected the nullhypothesis when it is true,i.e., you concluded thatthere is a real differencebetween the means of thetwo groups which, in fact,is the case . That was agood decision.

  • 7/26/2019 The Mean, Median, And Mode

    16/25

  • 7/26/2019 The Mean, Median, And Mode

    17/25

  • 7/26/2019 The Mean, Median, And Mode

    18/25

  • 7/26/2019 The Mean, Median, And Mode

    19/25

  • 7/26/2019 The Mean, Median, And Mode

    20/25

    Total ( )

    115 112 13225 12544 12880

    110 108 12100 11664 11880

    4381 4227 664583 616805 639987

    So, we plug the numbers from this table into the formula, and do the math:

    or

    or

    or

    In this case, the correlation between reading and math scores is remarkably high (because Iconcocted the numbers so it would turn out that way). With real scores, it would be high, but notthat high. If you glance over the numbers in Table 12, even before weve computed the correlationyou can easily see (in this small sample of 30) that high scores in reading tend to go with highscores in math, low reading scores tend to go with low math scores, and so on. But, of course,you wouldnt be able to seethat pattern if you had a sample of 500.

    Positive and Negative Correlations

    I pointed out above that a correlation can vary from +1.00 to1.00. The correlation we just

    computed is a positive correlation. That is, high reading scores go with high math scores, lowwith low, and so on. However, we could have a negative correlation. This is not something bad; itsimply denotes an association in which high scores on one variable go with low scores on theother. For example, if we were computing a correlation between, say, amount of time studentswatch television and their achievement score, we would find a negative correlation: high TVwatching is associated with lower achievement scores, and vice versa. Such a correlation mightbe something like.71.

    Determining Statistical Significance

    OK, so we have a correlation coefficient. What precisely does it mean, and how do we interpretit? Its not a percent, as many people mistakenly think.

    First, we can determine its statistical significance in the same way we did with the t test. We canlook it up in a table in the appendices of any statistical text. In the case of our .98 correlationbetween reading and math scores, if we look that up in the table for correlations, we find that thevalue needed to reject the null hypothesis at the .01 level of confidence (and declare that thecorrelation is statistically significant, or unlikely due to chance) for our sample of 30 is .45 (in thiscase using the one-tailed test because the samples are dependent).

  • 7/26/2019 The Mean, Median, And Mode

    21/25

  • 7/26/2019 The Mean, Median, And Mode

    22/25

  • 7/26/2019 The Mean, Median, And Mode

    23/25

  • 7/26/2019 The Mean, Median, And Mode

    24/25

  • 7/26/2019 The Mean, Median, And Mode

    25/25


Recommended