IE241 ver15

IE241 Introduction to Mathematical Statistics

Topic Slide

Probability ..3 a priori ..4 set theory ..10 axiomatic definition .14 marginal probability . 17 conditional probability .19 independent events 20 Bayes formula .28Discrete sample spaces .33 permutations .34 combinations 35Statistical distributions 37 random variable ...38 binomial distribution .42Moments .47 moment generating function .50Other discrete distributions 59 Poisson 59 Hypergeometric 62 Negative binomial 66Continuous distributions .69 Normal .70 Normal approximation to binomial ..79 Uniform (rectangular) ..84 Gamma 85 Beta 86 Log normal .87Cumulative distributions..89 Normal cdf..90 Binomial cdf .94Empirical distributions 99 Random sampling 99

Topic Slide

Estimate of mean .112 Estimate of variance .113 degrees of freedom ..116 KAIST sample ..119Percentiles and quartiles122Sampling distributions ..124 of the mean ....126 Central Limit Theorem127Confidence intervals .130 for the mean 130 Students t 137 for the variance ..143 Chi-square distribution .143Coefficient of variation .146Properties of estimators149 unbiased150 consistent..152 minimum variance unbiased ..152 maximum likelihood154Statistical Process Control160Linear functions of random variables 173Multivariate distributions ..180 bivariate normal ..180 correlation coefficient 185 covariance 186Regression functions .201 method of least squares202 multiple regression .209General multivariate normal214Multinomial218Marginal distributions.231Conditional distributions239

Statistics is the discipline that permits you to make decisions in the face of uncertainty. Probability, a division of mathematics, is the theory of uncertainty. Statistics is based on probability theory, but is not strictly a division of mathematics. However, in order to understand statistical theory and procedures, you must have an understanding of the basics of probability.

Probability arose in the 17th century because of games of chance. Its definition at the time was an a priori one:

If there are n mutually exclusive, equally likely outcomes and if nA of these outcomes have attribute A, then the probability ofA is nA/n.

This definition of probability seems reasonable for certain situations. For example, if one wants the probability of a diamond in a selection from a card deck, then A = , nA = 13, n = 52 and the probability of a diamond = 13/52 =1/4. As another example, consider the probability of an even number on one roll of a die. In this case, A = even number on roll, n = 6, nA = 3, and the probability of an even number = 3/6 = 1/2. As a third example, you are interested in the probability of J on one draw from a card deck. Then A = J, n = 52, and nA = 1, so the probability of J = 1/52.

The conditions of equally likely and mutually exclusive are critical to this a priori approach. For example, suppose you want the probability of the event A, where A is either a king or a spade drawn at random from a new deck. Now when you figure the number of ways you can achieve the event A, you count 13 spades and 4 kings, which seems to give nA = 17, for a probability of 17/52.

But one of the kings is a spade, so kings and spades are not mutually exclusive. This means that you are double counting. The correct answer is nA = 16, for a probability of 16/52.

As another example, suppose the event A is 2 heads in two tosses of a fair coin. Now the outcomes are 2H, 2T, or 1 of each. This would seem to give a probability of 1/3. But the last outcome really has twice the probability of each of the others because the right way to list the outcomes is: HH, TT, HT, TH. Now we see that 1 head and 1 tail can occur in either of two ways and the correct probability of 2H is 1/4.

But there are some problems with the a priori approach.

Suppose you want the probability that a positive integer drawn at random is even. You might assume that it would be 1/2, but since there are infinitely many integers and they need not be ordered in any given way, there is no way to prove that the probability of an even integer = 1/2.

The integers can even be ordered so that the ratio of evens to odds oscillates and never approaches any definite value as n increases.

Besides the difficulty of an infinite number of possible outcomes, there is also another problem with the a priori definition. Suppose the outcomes are not equally likely.

As an example, suppose that a coin is biased in favor of heads. Now it is clearly not correct to say that the probability of a head = the probability of a tail = 1/2 in a given toss of a coin.

Because of these difficulties, another definition of probability arose which is based on set theory. Imagine a conceptual experiment that can be repeated under similar conditions. Each outcome of the experiment is called a sample point s. The totality of all sample points resulting from this experiment is called a sample space S. An example is two tosses of a coin. In this case, there are four sample points in S: (H,H), (H,T), (T,H), (T,T).

Some definitions

If s is an element of S, then sS.

Two sets are equal if every element of one is also an element of the other.

If every element of S1 is an element of S, but not conversely, then S1 is a subset of S, denoted S1S.

The universal set is S where all other sets are subsets of S.

More definitions

The complement of a set A with respect to the sample space S is the set of points in S but not in A. It is usually denoted .

If a set contains no sample points, it is called the null set, .

If S1 and S2 are two sets S, then all sample points in S1 or S2 or both are called the union of S1 and S2 which is denoted S1 S2.

More definitions

If S1 and S2 are two sets S, then the event consisting of points in both S1 and S2 is called the intersection of S1 and S2 which is denoted S1 S2. S is called a continuous sample space if S contains a continuum of points.

S is called a discrete sample space if S contains a discrete number of points or a countable infinity of points which can be put in one-to-one correspondence with the positive integers.

Now we can proceed with the axiomatic definition of probability. Let S be a sample space where A is an event in S. Then P is a probability function on S if the following three axioms are satisfied:

Axiom 1. P(A) is a real nonnegative number for every event A in S.

Axiom 2. P(S) = 1.

Axiom 3. If S1, S2, Sn is a sequence of mutually exclusive events in S, that is, if Si Sj= for all i,j where ij, then P(S1S2Sn) = P(S1)+P(S2)++P(Sn)

Some theorems that follow from this definition

If A is an event in S, then the probability that A does not happen = 1- P(A).

If A is an event in S, then 0 P(A) 1.

P() = 0.

If A and B are any two events in S, then P(AB) = P(A)+ P(B) P(A B) where A B represents the joint occurrence of both A and B. P(A B) is also called P(A,B).

As an illustration of this last theorem -- in S, there are many points, but the event A and the event B are overlapping. If we didnt subtract the P(AB) portion, we would be counting it twice for P(AUB).

AB

Marginal probability is the term used when one or more criteria of classification is ignored.

Lets say we have a sample of 60 people who are either male or female and also who are either rich, middle-class, or poor.

In this case, we have the cross-tabulation of gender and financial status shown in the table below.

The marginal probability of male is 34/60 and the marginal probability of middle-class is 48/60.

More theorems

If A and B are two events in S such that P(B)>0, the conditional probability of A given that B has happened is P(A| B) = P(A B) / P(B).

Then it follows that the joint probability P(A B) = P(A| B) P(B).

More theorems

If A and B are two events in S, A and B are independent of one another if any of the following is satisfied: P(A| B)= P(A) P(B| A)= P(B) P(A B) = P(A) P(B)

P(A B) is the probability that either the event A or the event B happens. When we talk about either/or situations, we always are adding probabilities. P(A B) = P(A) + P(B) P(A,B)

P(A B) or P(A,B) is the probability that both the event A and the event B happen. When we talk about both/and situations, we are always multiplying probabilities. P(A B) = P(A) P(B) if A and B are independent and P(A B) = P(A|B) P(B) if A and B are not independent.

As an example of conditional probability, consider an urn with 6 red balls and 4 black balls. If two balls are drawn without replacement, what is the probability that the second ball is red if we know that the first was red? Let B be the event that the first ball is red and A be the event the second ball is red. P(A B) is the probability that both balls are red. There are 10C2 = 45 ways of drawing two balls and 6C2 = 15 ways of getting two red balls.

So P(A B) = 15 / 45 = 1/3. P(B), the probability that the first ball is red is 6/10 = 3/5. Therefore, P(A| B) = 1/3 = 5/9. 3/5

This probability could be computed from the sample space directly because once the first red ball has been drawn, there remain only 5 red balls and 4 black balls. So the probability of drawing red the second time is 5/9. The idea of conditional probability is to reduce the total sample space to that portion of the sample space in which the given event has happened. All possible probabilities computed in this reduced sample space must sum to 1. So the probability of drawing black the second time = 4/9.

Another example involves a test for detecting cancer which has been developed and is being tested in a large hospital. It was found that 98% of cancer patients reacted positively to the test, while only 4% of non-cancer patients reacted positively.

If 3% of the patients in the hospital have cancer, what is the probability that a patient selected at random from the hospital who reacts positively will have cancer?

Given: P(reaction | cancer) = .98 P(reaction | no cancer) = .04 P(cancer) = .03 P(no cancer) = .97

Needed:

P(r & c ) = P(r|c) P(c) = (.98)(.03) = .0294P(r & nc) = P(r|nc) P(nc) = (.04)(.97) = .0388P(r) = P(r & c)+ P(r & nc) = .0294 + .0388 = .0682

Now we have the information we need to solve the problem.

Conditional probability led to the development of Bayes formula, which is used to determine the likelihood of a hypothesis, given an outcome.

This formula gives the likelihood of Hi given the data D you actually got versus the total likelihood of every hypothesis given the data you got. So Bayes strategy is a likelihood ratio test.

Bayes formula is one way of dealing with questions like the last one. If we find a reaction, what is the probability that it was caused by cancer?

Now lets cast Bayes formula in the context of our cancer situation, where there are two possible hypotheses that might cause the reaction, cancer and other.

which confirms what we got with the classic conditional probability approach.

Consider another simple example where there are two identical boxes. Box 1 contains 2 red balls and box 2 contains 1 red ball and 1 white ball. Now a box is selected by chance and 1 ball is drawn from it, which turns out to be red. What is the probability that Box 1 was the one that was selected?

Using conditional probability, we would find

and get the numerator by

P(Box1,R) = P(Box1)P(R|Box1) = ( )(1) = 1/2 Then we get the denominator by P(R) =P(Box1,R) + P(Box2,R) = + = 3/4

Putting these in the formula,

If we use the sample space method, we have four equally likely outcomes: B1R1 B1R2 B2R B2W The condition R restricts the sample space to the first three of these, each with probability 1/3. Then P(Box1|R) = 2/3

Now lets try it with Bayes formula. There are only two hypotheses here, so H1= Box1 and H2 = Box2. The data, of course, = R. So we can find

And we can find

So we can see that the odds of the data favoring Box1 to Box2 are 2:1.

Discrete sample spaces with a finite number of pointsLet s1, s2, s3, sn be n sample points in S which are equally likely. Then P(s1) = P(s2) = P(s3) P(sn) = 1/n. If nA of these sample points are in the event A, then P(A) = nA /n, which is the same as the a priori definition. Clearly this definition satisfies the axiomatic conditions because the sample points are mutually exclusive and equally likely.

Now we need to know how many arrangements of a set of objects there are. Take as an example the number of arrangements of the three letters a, b, c.

In this case, the answer is easy: abc, acb, bac, bca, cab, cba. But if the number of arrangements were much larger, it would be nice to have a formula that figures out how many for us. This formula is the number of arrangements or permutations of N things = N!.

Now we can find the number of permutations of N things if we take only x of them at a time. This formula is NPx = N! / (N-x)!

Next we want to know how many combinations of a set of N objects there are if we take x of them at a time. This is different from the number of permutations because we dont care about the ordering of the objects, so abc and cab count as one combination though they represent two permutations. The formula for the number of combinations of N things taking x at a time is

How many pairs of cards can be drawn from a deck, where we dont care about the order in which they are drawn? The solution is

ways that two cards can be drawn. Now suppose we want to know the probability that both cards will be spades. Since there are 13 spades in the deck and we are drawing 2 cards, the number of ways that 2 spades can be drawn from the 13 available is So the probability that two spades will be drawn is 78 /1326.

Statistical Distributions Now we begin the study of statistical distributions. If there is a distribution, then something must be being distributed. This something is a random variable. You are familiar with variables in functions like a linear form: y = a x + b. In this case, a and b are constants for any given linear function and x and y are variables. In the equation for the circumference of a circle, we have C = d where C and d are variables and is a constant.

A random variable is different from a mathematical variable because it has a probability function associated with it.

More precisely, a random variable is a real-valued function defined on a probability space, where the function transforms points of S into values on the real axis.

For example, the number of heads in two tosses of a fair coin can be transformed as:Now X(s) is real-valued and can be used in a distribution function.

Because a probability is associated with each element in S, this probability is now associated with each corresponding value of the random variable. There are two kinds of random variables: discrete and continuous.

A random variable is discrete if it assumes only a finite (or denumerable) number of values. A random variable is continuous if it assumes a continuum of values.

We begin with discrete random variables. Consider a random experiment where four fair coins are tossed and the number of heads is recorded. In this case, the random variable X takes on the five values: 0, 1, 2, 3, 4. The probability associated with each value of the random variable X is called its probability function p(X) or probability mass function, because the probability is massed at each of a discrete number of points.

One of the most frequently used discrete distributions in applications of statistics is the binomial. The binomial distribution is used for n repeated trials of a given experiment, such as tossing a coin. In this case, the random variable X has the probability function: P(x) = nCx pxqn-x where p+q =1 x =0,1,2,3,,n

In one toss of a coin, this reduces to pxq0 and is called the point binomial or Bernoulli distribution. p = the probability that an event will occur and, of course, q = the probability that it will not occur. p and n are called parameters of this family of distributions. Each time either p or n changes, we have a new member of the binomial family of distributions, just as each time a or b changed in the linear function we had a new member of the family of linear functions.

The binomial distribution for 10 tosses of a fair coin is shown below. The actual values are shown in the accompanying table. Note the symmetry of the distribution. This always happens when p = .5.

Chart4

0.0009765625

0.009765625

0.0439453125

0.1171875

0.205078125

0.24609375

0.205078125

0.1171875

0.0439453125

0.009765625

0.0009765625

Number of heads

P(x)

Binomial distribution for 10 tosses of a fair coin

Sheet1

010.000976562510.0009765625000.0009765625

10.50.001953125100.00976562510.009765625

20.250.00390625450.043945312520.0439453125

30.1250.00781251200.117187530.1171875

40.06250.0156252100.20507812540.205078125

50.031250.031252520.2460937550.24609375

60.0156250.06252100.20507812560.205078125

70.00781250.1251200.117187570.1171875

80.003906250.25450.043945312580.0439453125

90.0019531250.5100.00976562590.009765625

100.0009765625110.0009765625100.0009765625

010.000005904910.0000059049

10.70.000019683100.000137781

20.490.00006561450.0014467005

30.3430.00021871200.009001692

40.24010.0007292100.036756909

50.168070.002432520.1029193452

60.1176490.00812100.200120949

70.08235430.0271200.266827932

80.057648010.09450.2334744405

90.0403536070.3100.121060821

100.0282475249110.0282475249

00.0000059049

10.000137781

20.0014467005

30.009001692

40.036756909

50.1029193452

60.200120949

70.266827932

80.2334744405

90.121060821

100.0282475249

Sheet1

Number of heads

P(x)


Sheet2

Number of heads

P(x)

Binomial distribution for 10 tosses of a coin with p = .7

Sheet3

The probability of 5 heads is highest so 5 is called the mode of x. The mode of any distribution is its most frequently occurring value. The mode is a measure of central tendency.

5 is also the mean of X, which in general for the binomial = np. The mean of any distribution is the most important measure of central tendency. It is the measure of location on the x-axis.

Every distribution has a set of moments. Moments for theoretical distributions are expected values of powers of the random variable. The rth moment is E(X-)r where E is the expectation operator and is an origin.

The expected value of a random variable is defined as E(X) where is Greek because it is the theoretical mean or average of the random variable.

is the first moment about 0.

The second moment is about itself E(X- )2 and is called the variance 2 of the random variable.

The third moment E(X- )3 is also about and is a measure of skewness or non-symmetry of the distribution.

The mean of the distribution is a measure of its location on the x axis. The mean is the only point such that the sum of the deviations from it = 0. The mean is the most important measure of centrality of the distribution.

The variance is a measure of the spread of the distribution or the extent of its variability. The mean and variance are the two most important moments.

Every distribution has a moment generating function (mgf), which for a discrete distribution is

The way this works is

Assume that p(x) is a function such that the series above converges. Then

In this expression, the coefficient of k/k! is the kth moment about the origin. To evaluate a particular moment, it may be convenient to compute the proper derivative of Mx() at = 0, since repeated differentiation of this moment generating function will show that

From the mgf, we can find the first moment around =0, which is the mean. The mean of the binomial = np.

We can also find the second moment around = , the variance. The variance of the binomial = npq.

The mgf enables us to find all the moments of a distribution.

Now suppose in our binomial we change p to .7. Then a different binomial distribution function results, as shown in the next graph and the table of data accompanying it. This makes sense because with a probability of .7 that you will get heads, you should see more heads.

Chart3

0.0000059049

0.000137781

0.0014467005

0.009001692

0.036756909

0.1029193452

0.200120949

0.266827932

0.2334744405

0.121060821

0.0282475249

Number of heads

P(x)


Sheet1

010.000976562510.0009765625000.0009765625

10.50.001953125100.00976562510.009765625

20.250.00390625450.043945312520.0439453125

30.1250.00781251200.117187530.1171875

40.06250.0156252100.20507812540.205078125

50.031250.031252520.2460937550.24609375

60.0156250.06252100.20507812560.205078125

70.00781250.1251200.117187570.1171875

80.003906250.25450.043945312580.0439453125

90.0019531250.5100.00976562590.009765625

100.0009765625110.0009765625100.0009765625

010.000005904910.0000059049

10.70.000019683100.000137781

20.490.00006561450.0014467005

30.3430.00021871200.009001692

40.24010.0007292100.036756909

50.168070.002432520.1029193452

60.1176490.00812100.200120949

70.08235430.0271200.266827932

80.057648010.09450.2334744405

90.0403536070.3100.121060821

100.0282475249110.0282475249

00.0000059049

10.000137781

20.0014467005

30.009001692

40.036756909

50.1029193452

60.200120949

70.266827932

80.2334744405

90.121060821

100.0282475249

Sheet1

Number of heads

P(x)


Sheet2

Number of heads

P(x)


Sheet3

This distribution is called a skewed distribution because it is not symmetric.

Skewing can be in either the positive or the negative direction. The skew is named by the direction of the long tail of the distribution. The measure of skew is the third moment around = .

So the binomial with p = .7 is negatively skewed.

The mean of this binomial = np = 10(.7) = 7. So you will expect more heads when the probability of heads is greater than that of tails.

The variance of this binomial is npq =10(.7)(.3) = 2.1.

Another discrete distribution that comes in handy when p is very small is the Poisson distribution. Its distribution function is

where >0

In the Poisson distribution, the parameter is , which is the mean value of x in this distribution.

The Poisson distribution is an approximation to the binomial distribution when np is large relative to p and n is large relative to np. Because it does not involve n, it is particularly useful when n is unknown.

As an example of the Poisson, assume that a volume V of some fluid contains a large number n of some very small organisms. These organisms have no social instincts and therefore are just as likely to appear in one part of the liquid as in any other part.

Now take a drop D of the liquid to examine under a microscope. Then the probability that any one of the organisms appears in D is D/V.

The probability that x of them are in D is

The Poisson is an approximation to this expression, which is simply a binomial in which p = D/V is very small. The above binomial can be transformed to the Poisson:

where Dd = and n/V = d.

Another discrete distribution is the hypergeometric distribution, which is used when there is no replacement after each experiment. Because there is no replacement, the value of p changes from one trial to the next. In the binomial, p is always constant from trial to trial.

Suppose that 20 applicants appear for a job interview and only 5 will be selected. The value of p for the first selection is 1/20.

After the first applicant is selected, p changes from 1/20 to 1/19 because the one selected is not thrown back in to be selected again.

For the 5th selection, p has moved to 1/16, which is quite different from the original 1/20.

Now if there had been 1000 applicants and only 2 were going to be selected, p would change from 1/1000 to 1/999, which is not enough of a change to be important.

So the binomial could be used here with little error arising from the assumptions that the trials are independent and p is constant.

The hypergeometric distribution is

Another discrete distribution is the negative binomial. The negative binomial distribution is used for the question On which trial(s) will the first (and later) success(es) come?

Let p be the probability of success and let p(X) be the probability that exactly x+r trials will be needed to produce r successes.

The negative binomial is: p(x) = pr ( x+r-1Cr-1 ) qx

where x = 0,1,2, and p +q =1 Notice that this turns the binomial on its head because instead of the number of successes in n trials, it gives the number of trials to r successes. This is why it is called the negative binomial.

The binomial is the most important of the discrete distributions in applications, but you should have a passing familiarity with the others.

Now we move on to distributions of continuous random variables.

Because a continuous random variable has a nondenumerable number of values, its probability function is a density function. A probability density function is abbreviated pdf.

There is a logical problem associated with assigning probabilities to the infinity of points on the x-axis and still having the density sum to 1. So what we do is deal with intervals instead of with points. Hence P(x=a) = 0 for any number a.

By far, the most important distribution in statistics is the normal or Gaussian distribution. Its formula is

The normal distribution is characterized by only two parameters, its mean and its standard deviation .

The mgf for a continuous distribution is

This mgf is of the same form as that for discrete distributions shown earlier, and it generates moments in the same manner.

A normal distribution with = 1.5 and = .9 is shown next.

Chart1

0.1125172466

0.1340898019

0.1578769908

0.1836489157

0.2110592229

0.2396440986

0.2688286681

0.2979414039

0.3262365208

0.3529236253

0.3772031616

0.3983055469

0.4155313764

0.4282897967

0.4361321191

0.4387780086

0.4361321191

0.4282897967

0.4155313764

0.3983055469

0.3772031616

0.3529236253

0.3262365208

0.2979414039

0.2688286681

0.2396440986

0.2110592229

0.1836489157

0.1578769908

0.1340898019

0.1125172466

random variable

Normal distribution (1.5, 0.9)

Sheet1

00.11251724661.50.049493965900.0494939659

0.10.13408980190.90921211310.06180520120.10.0618052012

0.20.15787699080.07638586930.20.0763858693

0.30.18364891570.09344688740.30.0934468874

0.40.21105922290.11317039670.40.1131703967

0.50.23964409860.13569800880.50.1356980088

0.60.26882866810.16111931470.60.1611193147

0.70.29794140390.18946143140.70.1894614314

0.80.32623652080.22068039190.80.2206803919

0.90.35292362530.25465514590.90.2546551459

10.37720316160.291184831510.2911848315

1.10.39830554690.32998979491.10.3299897949

1.20.41553137640.37071660111.20.3707166011

1.30.42828979670.41294699761.30.4129469976

1.40.43613211910.45621050221.40.4562105022

1.50.43877800860.51.50.5

1.60.43613211910.54378949781.60.5437894978

1.70.42828979670.58705300241.70.5870530024

1.80.41553137640.62928339891.80.6292833989

1.90.39830554690.67001020511.90.6700102051

20.37720316160.708815168520.7088151685

2.10.35292362530.74534485412.10.7453448541

2.20.32623652080.77931960812.20.7793196081

2.30.29794140390.81053856862.30.8105385686

2.40.26882866810.83888068532.40.8388806853

2.50.23964409860.86430199122.50.8643019912

2.60.21105922290.88682960332.60.8868296033

2.70.18364891570.90655311262.70.9065531126

2.80.15787699080.92361413072.80.9236141307

2.90.13408980190.93819479882.90.9381947988

30.11251724660.950506034130.9505060341

Sheet1

random variable

Cumulative normal (1.5, .9)

Sheet2

random variable


Sheet3

This is the familiar bell curve. If the standard deviation were smaller, the curve would be tighter. And if were larger, the curve would be flatter and more spread out.

Any normal distribution may be transformed into the standard normal distribution with = 0 and = 1. The transformation is

z = (x-) / In this case, z is called the standard normal variate or random variable.

If we use the transformed variable z, the normal density becomes

The area under the curve for any normal distribution from to +1 = .34 and the area from to -1 = .34. So from -1 to +1 is 68% of the area, which means that the values of the random variable X falling between those two limits use up .68 of the total probability.

The area from to +1.96 = .475 and because the normal curve is symmetric, it is the same from to -1.96. So from -1.96 to +1.96 = 95% of the area under the curve, and the values of the random variable in that range use up .95 of the total probability.

.34.34.135.135

Chart2

0.0044318484

0.0059525324

0.0079154516

0.0104209348

0.0135829692

0.0175283005

0.0223945303

0.0283270377

0.0354745928

0.043983596

0.0539909665

0.0656158148

0.0789501583

0.0940490774

0.1109208347

0.1295175957

0.1497274656

0.171368592

0.194186055

0.217852177

0.2419707245

0.2660852499

0.2896915528

0.3122539334

0.3332246029

0.3520653268

0.37

0.38

0.39

0.4

0.4

0.3969525475

0.391042694

0.3813878155

0.3682701403

0.3520653268

0.3332246029

0.3122539334

0.2896915528

0.2660852499

0.2419707245

0.217852177

0.194186055

0.171368592

0.1497274656

0.1295175957

0.1109208347

0.0940490774

0.0789501583

0.0656158148

0.0539909665

0.043983596

0.0354745928

0.0283270377

0.0223945303

0.0175283005

0.0135829692

0.0104209348

0.0079154516

0.0059525324

0.0044318484

random variable

Standard normal distribution

Sheet1

00.11251724661.50.049493965900.0494939659-33.7-3.00.00

0.10.13408980190.90921211310.06180520120.10.0618052012-2.753.6-2.90.01

0.20.15787699080.07638586930.20.0763858693-2.53.5-2.80.01

0.30.18364891570.09344688740.30.0934468874-2.253.4-2.70.01

0.40.21105922290.11317039670.40.1131703967-23.3-2.60.01

0.50.23964409860.13569800880.50.1356980088-1.753.2-2.50.02

0.60.26882866810.16111931470.60.1611193147-1.53.1-2.40.02

0.70.29794140390.18946143140.70.1894614314-1.253.0-2.30.03

0.80.32623652080.22068039190.80.2206803919-12.9-2.20.04

0.90.35292362530.25465514590.90.2546551459-0.752.8-2.10.04

10.37720316160.291184831510.2911848315-0.52.7-2.00.05

1.10.39830554690.32998979491.10.3299897949-0.252.6-1.90.07

1.20.41553137640.37071660111.20.370716601102.5-1.80.08

1.30.42828979670.41294699761.30.41294699760.252.4-1.70.09

1.40.43613211910.45621050221.40.45621050220.52.3-1.60.11

1.50.43877800860.51.50.50.752.2-1.50.13

1.60.43613211910.54378949781.60.543789497812.1-1.40.15

1.70.42828979670.58705300241.70.58705300241.252.0-1.30.17

1.80.41553137640.62928339891.80.62928339891.51.9-1.20.19

1.90.39830554690.67001020511.90.67001020511.751.8-1.10.22

20.37720316160.708815168520.708815168521.7-1.00.24

2.10.35292362530.74534485412.10.74534485412.251.6-0.90.27

2.20.32623652080.77931960812.20.77931960812.51.5-0.80.29

2.30.29794140390.81053856862.30.81053856862.751.4-0.70.31

2.40.26882866810.83888068532.40.838880685331.3-0.60.33

2.50.23964409860.86430199122.50.86430199121.2-0.50.35

2.60.21105922290.88682960332.60.88682960331.1-0.40.37

2.70.18364891570.90655311262.70.90655311261.0-0.30.38

2.80.15787699080.92361413072.80.92361413070.9-0.20.39

2.90.13408980190.93819479882.90.93819479880.8-0.10.40

30.11251724660.950506034130.95050603410.70.00.40

0.60.10.40

00.50.20.39

0.40.30.38

0.30.40.37

0.20.50.35

0.10.60.33

0.00.70.31

0.10.80.29

0.20.90.27

0.31.00.24

0.41.10.22

0.51.20.19

0.61.30.17

0.71.40.15

0.81.50.13

0.91.60.11

1.01.70.09

1.11.80.08

1.21.90.07

1.32.00.05

1.42.10.04

1.52.20.04

1.62.30.03

1.72.40.02

1.82.50.02

1.92.60.01

2.02.70.01

2.12.80.01

2.22.90.01

2.33.00.00

2.4

2.5

2.6

2.7

2.8

2.9

3.0

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

Sheet1

random variable


Sheet2

random variable


Sheet3

The normal distribution is very important for statisticians because it is a mathematically manageable distribution with wide ranging applicability, but it is also important on its own merits.

For one thing, many populations in various scientific or natural fields have a normal distribution to a good degree of approximation. To make inferences about these populations, it is necessary to know the distributions for various functions of the sample observations.

The normal distribution may be used as an approximation to the binomial for large n.

Theorem:

If X represents the number of successes in n independent trials of an event for which p is the probability of success on a single trial, then the variable (X-np)/npq has a distribution that approaches the normal distribution with mean = 0 and variance = 1 as n becomes increasingly large.

Corrollary:

The proportion of successes X/n will be approximately normally distributed with mean p and standard deviation pq/n if n is sufficiently large.

Consider the following illustration of the normal approximation to the binomial.

In Mendelian genetics, certain crosses of peas should give yellow and green peas in a ratio of 3:1. In an experiment that produced 224 peas, 176 turned out to be yellow and only 48 were green.

The 224 peas may be considered 224 trials of a binomial experiment where the probability of a yellow pea = . Given this, the average number of yellow peas should be 224(3/4) =168 and =224(3/4)(1/4) = 6.5.

Is the theory wrong? Or is the finding of 176 yellow peas just normal variation?

To save the laborious computation required by the binomial, we can use the normal approximation to get a region around the mean of 168 which encompasses 95% of the values that would be found in the normal distribution.

Since the 176 yellow peas found in this experiment is within this interval, there is no reason to reject Mendelian inheritance.

The normal distribution will be re-visited later, but for now well move on to some other interesting continuous distributions.

The first of these is the uniform or rectangular distribution.

f(x) = 1/(-) X = 0 elsewhere

This is an important distribution for selecting random samples and computers use it for this purpose.

Another important continuous distribution is the gamma distribution, which is used for the length of time it takes to do something or for the time between events.

The gamma is a two-parameter family of distributions, with and as the parameters. Given > 0 and > -1, the gamma density is:

Another important continuous distribution is the beta distribution, which is used to model proportions, such as the proportion of lead in paint or the proportion of time that the FAX machine is under repair. This is a two-parameter family of distributions with parameters and , which both must be greater than -1. The beta density is:

The log normal distribution is another interesting continuous distribution. Let x be a random variable. If loge(x) is normally distributed, then x has a log normal distribution. The log normal has two parameters, and , both of which are greater than 0. For x > 0,

As with the discrete distributions, most of the continuous distributions are of passing interest. Only the normal distribution at this point is critically important. You will come back to it again and again in statistical study.

Now one kind of distribution we havent covered so far is the cumulative distribution. Whereas the distribution of the random variable is denoted p(x) if it is discrete and f(x) if it is continuous, the cumulative distribution is denoted P(x) and F(x) for discrete and continuous distributions, respectively. The cumulative distribution or cdf is the probability that X Xc and thus it is the area under the p(x) or f(x) function up to and including the point Xc.

The most interesting cumulative distribution function or cdf is the normal one, often called the normal ogive.

Chart2

0.0494939659

0.0618052012

0.0763858693

0.0934468874

0.1131703967

0.1356980088

0.1611193147

0.1894614314

0.2206803919

0.2546551459

0.2911848315

0.3299897949

0.3707166011

0.4129469976

0.4562105022

0.5

0.5437894978

0.5870530024

0.6292833989

0.6700102051

0.7088151685

0.7453448541

0.7793196081

0.8105385686

0.8388806853

0.8643019912

0.8868296033

0.9065531126

0.9236141307

0.9381947988

0.9505060341

random variable


Sheet1

00.11251724661.50.049493965900.0494939659

0.10.13408980190.90921211310.06180520120.10.0618052012

0.20.15787699080.07638586930.20.0763858693

0.30.18364891570.09344688740.30.0934468874

0.40.21105922290.11317039670.40.1131703967

0.50.23964409860.13569800880.50.1356980088

0.60.26882866810.16111931470.60.1611193147

0.70.29794140390.18946143140.70.1894614314

0.80.32623652080.22068039190.80.2206803919

0.90.35292362530.25465514590.90.2546551459

10.37720316160.291184831510.2911848315

1.10.39830554690.32998979491.10.3299897949

1.20.41553137640.37071660111.20.3707166011

1.30.42828979670.41294699761.30.4129469976

1.40.43613211910.45621050221.40.4562105022

1.50.43877800860.51.50.5

1.60.43613211910.54378949781.60.5437894978

1.70.42828979670.58705300241.70.5870530024

1.80.41553137640.62928339891.80.6292833989

1.90.39830554690.67001020511.90.6700102051

20.37720316160.708815168520.7088151685

2.10.35292362530.74534485412.10.7453448541

2.20.32623652080.77931960812.20.7793196081

2.30.29794140390.81053856862.30.8105385686

2.40.26882866810.83888068532.40.8388806853

2.50.23964409860.86430199122.50.8643019912

2.60.21105922290.88682960332.60.8868296033

2.70.18364891570.90655311262.70.9065531126

2.80.15787699080.92361413072.80.9236141307

2.90.13408980190.93819479882.90.9381947988

30.11251724660.950506034130.9505060341

Sheet1

random variable


Sheet2

random variable


Sheet3

The points in a continuous cdf like the normal F(x) are obtained by integrating over the f(x) to the point in question.

The cdf can be used to find the probability that a random variable X is some value of interest because the cdf gives probabilities directly.

In the normal distribution shown earlier with = 1.5 and =0.9, the probability that X 2 is given by the cdf as .71. Also the probability that 1 x 2 is given by F(2) F(1) = .71 - .29 = .42.

Now you know from this normal cdf that the probability that X 2 is .71.

Suppose you want the probability that X 2. Well if P(X 2) = .71, then P(X 2) = 1-.71= .29.

Note that you are ignoring the fact that P(X = 2) is included is included in the cdf probability because P(X = 2) = 0 in a continuous pdf.

For the binomial distribution, a point on the cumulative distribution function P(x) is obtained by summing probabilities of the p(x) up to the point in question. Then P(xi)= p(x xi). In general,

Chart1

0.0009765625

0.0107421875

0.0546875

0.171875

0.376953125

0.623046875

0.828125

0.9453125

0.9892578125

0.9990234375

1

Number of heads

probability X < Xc

Binomial CDF with p =.5 and n=10

Sheet1

010.000976562510.0009765625000.00097656250.000976562500.0009765625

10.50.001953125100.00976562510.0097656250.010742187510.0107421875

20.250.00390625450.043945312520.04394531250.054687520.0546875

30.1250.00781251200.117187530.11718750.17187530.171875

40.06250.0156252100.20507812540.2050781250.37695312540.376953125

50.031250.031252520.2460937550.246093750.62304687550.623046875

60.0156250.06252100.20507812560.2050781250.82812560.828125

70.00781250.1251200.117187570.11718750.945312570.9453125

80.003906250.25450.043945312580.04394531250.989257812580.9892578125

90.0019531250.5100.00976562590.0097656250.999023437590.9990234375

100.0009765625110.0009765625100.00097656251101

1650.0195480474

1660.0241553003

1680.0344990107

1700.0450715292

1720.0538640983

1730.056948985

1750.0595436239

1770.056948985

1780.0538640983

1800.0450715292

1820.0344990107

1840.0241553003

1850.0195480474

6.6833125519

175

010.000005904910.0000059049

10.70.000019683100.000137781

20.490.00006561450.0014467005

30.3430.00021871200.009001692

40.24010.0007292100.036756909

50.168070.002432520.1029193452

60.1176490.00812100.200120949

70.08235430.0271200.266827932

80.057648010.09450.2334744405

90.0403536070.3100.121060821

100.0282475249110.0282475249

00.0000059049

10.000137781

20.0014467005

30.009001692

40.036756909

50.1029193452

60.200120949

70.266827932

80.2334744405

90.121060821

100.0282475249

Sheet1

Number of heads

P(x)


Sheet2

Number of heads

P(x)


Sheet3

Number of heads

probability X < Xc

Binomial CDF with p =.5 and n=10

Height in cm

Distribution of male height

From this cdf, we can see that the probability that the number of heads will be 2 = .05.

And the probability that the number of heads will be 6 = .82.

But the probability that the number of heads will be between two numbers is tricky here because the cdf includes the probability of x, not just the values < x. So if you want the probability that 2 x 6, you need to use P(6)- P(1) because if you subtracted P(2) from P(6), you would exclude the value 2 heads.

So P(2 x 6) = P(6) P(1) = .82 -.01 = .81.

So if you are given a point on the binomial cdf, say, (4, .38), then the probability that X 4 = .38.

But suppose you want the probability that X > 4. Then 1- P(X 4) = 1-.38 = .62 is the answer.

But if you want the probability that X 4, you cant get it from the information given because P(X = 4) is included in the binomial cdf.

Now we have covered the major distributions of interest. But so far, we have been dealing with theoretical distributions, where the unknown parameters are given in Greek.

Since we dont know the parameters, we have to estimate them. This means we have to develop empirical distributions and estimate the parameters.

To think about empirical distributions, we must first consider the topic of sampling.

We need a sample to develop the empirical distribution, but the sample must be selected randomly. Only random samples are valid for statistical use. If any other sample is used, say, because it is conveniently available, the information gained from it is useless except to describe the sample itself.

Now how can you tell if a sample is random? Can you tell by looking at the data you got from your sample?

Does a random sample have to be representative of the group from which it was obtained?

The answer to these questions is a resounding NO.

Now lets develop what a random sample really is.

First, there is a population with a variable of interest. The population is all elements of concern, for example, all males from age 18 to age 30 in Korea. Maybe the variable of interest is height.

The population is always very large and often infinite. Otherwise, we would just measure the entire population on the variable of interest and not bother with sampling.

Since we can never measure every element (person, object, manufactured part, etc.) in the population, we draw a sample of these elements to measure some variable of interest. This variable is the random variable.

The sample may be taken from some portion of the population, and not from the entire population. The portion of the population from which the sample is drawn is called the sampling frame.

Maybe the sample was taken from males between 18 and 30 in Seoul, not in all of Korea. Then although Korea is the population of interest, Seoul is the sampling frame. Any conclusions reached from the Seoul sample apply only to the set of 18 to 30 year-old males in Seoul, not in all of Korea.

To show how far astray you can go when you dont pay attention to the sampling frame, consider the US presidential election of 1948.

Harry Truman was running against Tom Dewey. All the polling agencies were sure Dewey would win and the morning paper after the election carried the headline DEWEY WINS There is a famous picture of the victorious Truman holding up the morning paper for all to see.

How did the pollsters go so wrong? It was in their sampling frame. It turns out that they had used the phone directories all over the US to select their sample. But the phone directories all over the US do not contain all the US voters. At that time, many people didnt have phones and many others were unlisted. This is a glaring and very famous example of just how wrong you can be when you dont follow the sampling rules.

Now assuming youve got the right sampling frame, the next requirement is a random sample. The sample must be taken randomly for any conclusions to be valid. All conclusions apply only to the sampling frame, not to the entire population.

A random sample is one in which each and every element in the sampling frame has an equal chance of being selected for the sample.

This means that you can get some random samples that are quite unrepresentative of the sampling frame. But the larger the random sample is, the more representative it tends to be.

Suppose you want to estimate the height of males in Chicago between the ages of 18 and 30.

If you were looking for a random sample of size 12 in order to estimate the height, you might end up with the Chicago Bulls basketball team. This sample of 12 is just as likely as any other sample of 12 particular males. But it certainly isnt representative of the height of Chicago young males.

But you must take a random sample to have any justification for your conclusions.

Now the ONLY way you can know that a sample is random is if it was selected by a legitimate random sampling procedure.

Today, most random selections are done by computer. But there are other methods, such as drawing names out of a container if the container was appropriately shaken.

The lottery in the US is done by putting a set of numbered balls in a machine. The machine stirs them up and selects 5 numbered balls, one at a time. These numbers are the lottery winners.

Anyone who bought a lottery ticket which has the same 5 numbers as were drawn will win the lottery.

Because this equipment was designed as lottery equipment, it is fair to say that the sample of 5 balls drawn is a random sample.

Formally, in statistics, a random sample is thought of as n independent and identically distributed (iid) random variables, that is, x1, x2, x3, xn.

In this case, xi is the random variable from which the ith value in the sample was obtained.

When we want to speak of a random sample, we say: Let {xi} be a set of n iid random variables.

Once you get the random sample, you can get the distribution of the variable of interest for the sample.

Then you can use the empirical sample distribution to estimate the parameters in the sampling frame, but not in the entire population. Most of what we estimate are the two most important moments, and 2.

Since we dont know the theoretical mean and variance 2, we can estimate them from our sample.

The mean estimate is

where n is the sample size.

The estimate of the second moment, the variance, is

Although the variance is a measure of the spread or variability of the distribution around the mean, usually we take the square root of the variance, the standard deviation, to get the measure in the same scale as the mean. The standard deviation is also a measure of variability.

Now two questions arise. First, if we are going to take the square root anyway, why do we bother to square the estimate in the first place?

The answer is simple if you look at the formula carefully.

Clearly, if you didnt square the deviations in the numerator, they would always sum to 0, because the mean is the value such that the deviations around it always sum to 0.

Now for the second question. Why is it that when we estimate the mean, we divide by n, but when we estimate the variance, we divide by n -1?

The answer is in the concept of degrees of freedom.

When we estimate the mean, each value of x is free to be whatever it is. Thus, there are no constraints on any value of X so there are n degrees of freedom because there are n observations in the sample.

But when we estimate the variance, we use the mean estimate in the formula. Once we know the mean, which we must to compute the variance, we lose one degree of freedom.

Suppose we have 5 observations and their mean = 6. If the values 4, 5, 6, 7 are 4 of these 5 observations, the 5th observation is not free to be anything but 8.

So when we use the estimated mean in a formula we always lose a degree of freedom.

In the formula for the variance, only n -1 of the (Xi )2 points is free to vary. The nth one is not free to vary. Thats why we divide by n 1.

One last point

The sample mean and the sample variance for normal distributions are independent of one another.

Now lets take a random sample of size 18 of the height of Korean male students at KAIST. Lets say the height measurements are:

165,166,168,168,172,172,172,175,175,175, 175,178,178,178,182,182,184,185, all in cm.

Now the mean of these is 175 cm. The standard deviation is 6 cm. And the distribution is symmetric, as shown next.

Chart2

1

1

2

3

4

3

2

1

1

Height (cm)

Height of sample of 18 KAIST male students

Sheet1

00.11251724661.50.049493965900.04949396593.7-3.00.00

0.10.13408980190.90921211310.06180520120.10.06180520123.6-2.90.01

0.20.15787699080.07638586930.20.07638586933.5-2.80.01

0.30.18364891570.09344688740.30.09344688743.4-2.70.01

0.40.21105922290.11317039670.40.11317039673.3-2.60.01

0.50.23964409860.13569800880.50.13569800883.2-2.50.02

0.60.26882866810.16111931470.60.16111931473.1-2.40.02

0.70.29794140390.18946143140.70.18946143143.0-2.30.03

0.80.32623652080.22068039190.80.22068039192.9-2.20.04

0.90.35292362530.25465514590.90.25465514592.8-2.10.04

10.37720316160.291184831510.29118483152.7-2.00.05

1.10.39830554690.32998979491.10.32998979492.6-1.90.07

1.20.41553137640.37071660111.20.37071660112.5-1.80.08

1.30.42828979670.41294699761.30.41294699762.4-1.70.09

1.40.43613211910.45621050221.40.45621050222.3-1.60.11

1.50.43877800860.51.50.52.2-1.50.13

1.60.43613211910.54378949781.60.54378949782.1-1.40.15

1.70.42828979670.58705300241.70.58705300242.0-1.30.17

1.80.41553137640.62928339891.80.62928339891.9-1.20.19

1.90.39830554690.67001020511.90.67001020511.8-1.10.22

20.37720316160.708815168520.70881516851.7-1.00.24

2.10.35292362530.74534485412.10.74534485411.6-0.90.27

2.20.32623652080.77931960812.20.77931960811.5-0.80.29

2.30.29794140390.81053856862.30.81053856861.4-0.70.31

2.40.26882866810.83888068532.40.83888068531.3-0.60.33

2.50.23964409860.86430199122.50.86430199121.2-0.50.35

2.60.21105922290.88682960332.60.88682960331.1-0.40.37

2.70.18364891570.90655311262.70.90655311261.0-0.30.38

2.80.15787699080.92361413072.80.92361413070.9-0.20.39

2.90.13408980190.93819479882.90.93819479880.8-0.10.40

30.11251724660.950506034130.95050603410.70.00.40

0.60.10.40

00.50.20.39

0.40.30.38

0.30.40.37

0.20.50.35

0.10.60.33

0.00.70.31

0.10.80.29

0.20.90.27

0.31.00.24

0.41.10.22

0.51.20.19

0.61.30.17

0.71.40.15

0.81.50.13

0.91.60.11

1.01.70.09

1.11.80.08

1.21.90.07

1.32.00.05

1.42.10.04

1.52.20.04

1.62.30.03

1.72.40.02

1.82.50.02

1.92.60.01

2.02.70.01

2.12.80.01

2.22.90.01

2.33.00.00

2.4

2.5

2.6

16512.7

16612.8

16822.9

17233.0

17543.1

17833.2

18223.3

18413.4

18513.5

3.6

3.7

3.8

3.9

Sheet1

random variable


Sheet2

0.0044318484

0.0059525324

0.0079154516

0.0104209348

0.0135829692

0.0175283005

0.0223945303

0.0283270377

0.0354745928

0.043983596

0.0539909665

0.0656158148

0.0789501583

0.0940490774

0.1109208347

0.1295175957

0.1497274656

0.171368592

0.194186055

0.217852177

0.2419707245

0.2660852499

0.2896915528

0.3122539334

0.3332246029

0.3520653268

0.37

0.38

0.39

0.4

0.4

0.3969525475

0.391042694

0.3813878155

0.3682701403

0.3520653268

0.3332246029

0.3122539334

0.2896915528

0.2660852499

0.2419707245

0.217852177

0.194186055

0.171368592

0.1497274656

0.1295175957

0.1109208347

0.0940490774

0.0789501583

0.0656158148

0.0539909665

0.043983596

0.0354745928

0.0283270377

0.0223945303

0.0175283005

0.0135829692

0.0104209348

0.0079154516

0.0059525324

0.0044318484

random variable


Sheet3

Height (cm)

Height of sample of 18 KAIST male students

The distribution would be much closer to normal if the sample were larger, but with 18 observations, it still is symmetric.

The median of the distribution is 175, the same as the mean. The median is a measure of central tendency such that half of the observations fall below and half above.

The mode of this distribution is also 175.

For normal distributions, the mean, median, and mode are all equal. In fact for all unimodal symmetric distributions, the mean, median, and mode are all equal.

The mth percentile is the point below which is m% of the observations. The 10th percentile is the point below which are 10% of the observations. The 60th percentile is the point below which are 60% of the observations.

The 1st quartile is the point below which are 25% of the observations. The 3rd quartile is the point below which are 75% of the observations.

The median is the 50th percentile and the 2nd quartile.

This is our first empirical distribution. We know its mean, its standard deviation, and its general shape. The estimates of the mean and standard deviation are called statistics and are shown in roman type.

Now assume that the sample that we used was indeed a random sample of male students at KAIST. Now we can ask how good is our estimate of the true mean of all KAIST male students.

In order to answer this question, assume that you did this study -- selecting 18 male students at KAIST and measuring their height -- infinitely often. After each study, you record the sample mean and variance.

Now you have infinitely many sample means from samples of n = 18, and they must have a distribution, with a mean and variance. Note that now we are getting the distribution of a statistic, not a fundamental measurement.

Distributions of statistics are called sampling distributions.

So far, we have had theoretical population distributions of the random variable X and empirical sample distributions of the random variable X.

Now we move into sampling distributions, where the random variable is not X but a function of X called a statistic.

The first sampling distribution we will consider is that of the sample mean so we can see how good our estimate of the population mean is.

Because we dont really do the experiment infinitely often, we just imagine that it is possible to do so, we need to know the distribution of the sample mean.

This is where an amazing theorem comes to our rescue the Central Limit Theorem.

Let be the mean and s2 the variance of a random sample of size n from f(x). Now define

Then y is distributed normally with mean = 0 and variance =1 as n increases without bound.

Note that y here is just the standardized version of the statistic .

This theorem holds for means of samples of any size n where f(x) is normal.

But the really amazing thing is that it also holds for means of any distributional form of f(x) for large n. Of course, the more the distribution differs from normality, the larger n must be.

Now were back to our original question: How good is our sample estimate of the mean of the population?

We know that is distributed normally with mean thanks to the CLT. The standard deviation of is

The standard deviation of is often called the standard error because is an estimate of and any variation of around is error of estimate. By contrast, the standard deviation of X is just the natural variation of X and is not error.

So now we can define a confidence interval for our estimate of the mean.

where z is the standard normal deviate which leaves .5 in each tail of the normal distribution. If z = 1.96, then the confidence interval will contain the parameter 95% of the time. Hence, this is called a 95% confidence interval and its two end points are called confidence limits.

If is small, the interval will be very tight, so the estimate is a precise one. On the other hand, if is large, the interval will be wide, so the estimate is not so precise.

Now it is important to get the interpretation of a confidence interval clear. It does NOT mean that the population mean has a 95% probability of falling within the interval.

That would be tantamount to saying that is a random variable that has a probability function associated with it.

But is a parameter, not a random variable, so its value is fixed. It is unknown but fixed.

So the proper interpretation for a 95% confidence interval is this. Imagine that you have taken zillions (zillions means infinitely often) of random samples of n =18 KAIST male students and obtained the mean and standard deviation of their height for each sample.

Now imagine that you can form the 95% confidence interval for each sample estimate as we have done above. Then 95% of these zillions of confidence intervals will contain the parameter .

It may seem counter-intuitive to say that we have 95% confidence that our 95% confidence interval contains , but that there is not 95% probability that falls in the interval.

But if you understand the proper interpretation, you can see the difference. The idea is that 95% of the intervals formed in this way will capture . This is why they are called confidence intervals, not probability intervals.

Now we can also form 99% confidence intervals simply by changing the 1.96 in the formula to 2.58. Of course, this will widen the interval, but you will have greater confidence.

90% confidence intervals can be formed by using 1.65 in the formula. This will narrow the interval, but you will have less confidence.

But when we try to find a confidence interval, we run into a problem. How can we find the confidence interval when we dont know the parameter ?

Of course, we could substitute the estimate s for , but then our confidence statement would be inexact, and especially so for small samples. The way out was shown by W.S. Gossett, who wrote under the pseudonym Student. His classic paper introducing the t distribution has made him the founder of the modern theory of exact statistical inference.

Students t is

t involves only one parameter and has the t distribution with n -1 degrees of freedom, which involves no unknown parameters.

The t distribution is

where k is the only parameter and k = the number of degrees of freedom.

Students t distribution is symmetric like the normal but with higher and longer tails for small k. The t distribution approaches the normal as k , as can be seen in the t table on the following page.

Now we can solve the problem of computing confidence intervals for the mean. This formula is correct only if s is computed with n -1 in the denominator.

t is tabled so that its extreme points (to get 95%, 99% confidence intervals, etc.) are given by t.975 and t.995, respectively. There is also a tdist function in Excel which gives the tail probability for any value.

In our sample of 18 KAIST males, the estimated mean =175 cm and the estimated standard deviation = 6 cm. So our 95% confidence interval is 175 2.110 (6 / ) or (172 178) where 2.110 is the tabled value of t.975 for 17 df. This interval isnt very tight but then we had only 18 observations.

Technically, we always have to use the t distribution for confidence intervals for the mean, even for large samples, because the value is always unknown.

But it turns out that when the sample size is over 30, the t distribution and the normal distribution give the same values within at least two decimal points, that is, z.975 t.975 because the t distribution approaches the normal distribution as df .

What about the distribution of s2 the estimate of 2?

The statistic s2 has a chi-square distribution with n-1 df. Chi-square is a new distribution for us, but it is the distribution of the quantity

or if we convert to a standard normal deviate, where

then

has a chi-square distribution with n df. So the sample variance has a chi-square distribution.

What about a confidence interval for s2? In our KAIST sample, n = 18, s = 6, and s2 = 36. The formula for the confidence interval is

This is a 95% confidence interval for 2 and it is very wide because we had only 18 observations. The two 2 values are those for .975 and .025 with n-1 =17 df. Confidence intervals for variances are rarely of interest.

Much more common is the problem of comparing two variances where the two random variables are of different orders of magnitude.

For example, which is more variable, the weight of elephants or the weight of mice?

Now we know that elephants have a very large mean weight and mice have a very small mean weight. But is their variability around their mean very different?

The only way we can answer this is to take their variability relative to their average weight. To do so, we use the standard deviation as the measure of variability.

The quantity is a measure of relative variability called the coefficient of variation.

Now if you had a random sample of elephant weights and a random sample of mouse weights, you could compare the coefficient of variation of elephant weight with the coefficient of variation of mouse weight and answer the question.

What are the properties of an estimator that make it good?

1. Unbiased 2. Consistent 3. Best unbiased

Lets look at each of these in turn.

1. An unbiased estimator is one where E( ) =

The sample mean is an unbiased estimator of because

and since E(X) and there are n E(X) in this sum, we have

Is s2 an unbiased estimator of 2?

2. A consistent estimator is one for which the estimator gets closer and closer to the parameter value as n increases without limit.

3. A best unbiased estimator, also called a minimum variance unbiased estimator, is one which is first of all unbiased and has the minimum variance among all unbiased estimators.

How can we get estimates of parameters?

One way is the method of moments, which comes from the moment generating function.

Another very important way is the method of maximum likelihood.

A maximum likelihood estimator (MLE) of the parameter in the density function f(X; ) is an estimator that maximizes the likelihood function L(x1, x2, ,xn; ), where the xi are the n sample values of X and is the parameter to be estimated.

If the {xi} are treated as fixed, the likelihood function becomes a function of only .

In the discrete case, the likelihood function is

L({xi}; ) = p(x1;)p(x2;)p(xn;) where p(x;) is the frequency function for a sample of n observations and the parameter .

L({xi}; ) gives the probability of obtaining the particular sample values that were obtained with the parameter . The value of which maximizes this likelihood function is called the maximum likelihood estimate (MLE) of .

In the continuous case, the likelihood function

L({xi}; ) = f(x1;)f(x2;)f(xn;)

gives the probability density at the sample point (x1, x2, , xn) where the sample space is thought of as being n -dimensional.

Again, the value of which maximizes this likelihood function is called the maximum likelihood estimate (MLE) of .

Lets look at an example of maximum likelihood estimation. Consider the density function:

f(x;) = e-x

where is a parameter that depends on the experimental conditions.

The likelihood function is:

Differentiating this with respect to , we get

and setting this equal to 0, either = 0 or the expression in parentheses = 0. Since the density doesnt exist when = 0, the only nontrivial solution for this equation is

Assume that we have 5 experimental observations for this density: x1=.9, x2=1.7, x3=.4, x4=.3, x5=2.4

Then from the previous result,

So .88 is the MLE for .

Lets look at an application of mean and standard deviation estimates in manufacturing.

The approach is called Statistical Process Control (SPC) and it was developed in the 1920s by Walter Shewhart.

It became very popular after another statistician, W. Edwards Deming, showed the Japanese how to use it after WWII. Now it is used everywhere in the developed world.

At that time (1950s), everything that came to America from Japan was cheap, but junk.

It sold for a while because it was so cheap, but eventually, people caught on that it was just junk so they stopped buying. It was at this point that Deming went to Japan.

The general practice in manufacturing during Shewharts time was to run an assembly line all day. Then at the end of the day, an inspector inspected all the parts produced by the process that day.

If a part was good, it was passed on to the next step. If it was bad, it was either discarded or reworked, at significant cost to the business. Sometimes inspection did not occur until the product was finished. Then if a product did not meet specifications, the entire product was discarded. Imagine the cost in this case.

The idea of SPC is to get rid of all that waste in materials and manpower by eliminating bad parts as soon as the process starts to produce them.

The problem was to find the point where the parts started going bad, so you could stop the process and fix the problem. Shewhart was the one who solved this problem by SPC.

The idea is to examine periodically a few (usually 3 or 5) parts produced by an assembly line and determine if the process is still running properly.

In any process, there is variation. If the process is very good, the variation is small. The natural variation of the process is called system variation or common variation.

After some preliminary running of the process to determine its location and variation, a chart is made with control limits on it.

Chart3

505254

505254

505254

505254

505254

505254

Time of day

Mean

Mean Chart

Sheet1

2525254250252254

4535254450452454

6515254650652654

8535254850852854

10525254105010521054

12585254125012521254

Sheet1

Time of day

Mean

Mean Chart

Sheet2

Time of day

Mean

Mean Chart

Sheet3

The control chart was developed for a process that would select 5 parts every two hours.

The green line is the expected mean line that was found in preliminary work.

The upper red line is called the upper control limit (UCL) and the lower red line is called the lower control limit (LCL). The control limits reflect the system variation around the overall mean line. They are usually 95% confidence limits.

Then the process is run.

Chart4

52505254

53505254

51505254

53505254

52505254

58505254

Time of day

Mean

Mean Chart

Sheet1

2525254250252254

4535254450452454

6515254650652654

8535254850852854

10525254105010521054

12585254125012521254

Sheet1

Time of day

Mean

Mean Chart

Sheet2

Time of day

Mean

Mean Chart

Sheet3

Each point on the SPC chart is the mean of the measurement X on 5 parts.

As you can see, the points are staying within the control limits (red) and generally staying slightly above or below the overall mean line (green) from 2 pm to 10 pm.

At midnight, the point jumps out of control to a value of 58. Variation like this is called special cause variation.

This alerts the operator to a problem with his process. His job is to stop the process and find and fix the problem.

Now he knows the problem happened between 10pm and 12 midnight because everything was OK at 10 pm. So he holds back the parts produced between 10 and 12 for inspection to make sure no bad part goes to the next step.

Once he fixes the problem, the process starts up again and the chart continues.

Now SPC has cut all the losses that would have occurred between midnight and 8 am, when the parts go to the next step.

There is also a range chart or a standard deviation chart to accompany the mean chart, but that is another story.

When Deming told the story of his experiences in Japan, he said,

I told them that they could go from being the junk manufacturer of the world to producing the best quality products in the world in five years if they used the SPC system. But I made a mistake. They did it in two years.

This is an example of how useful statistics can be in a manufacturing setting.

Actually there are a number of variations of control charts, and an entire field of technology has developed surrounding this idea.

Lets look at linear functions of random variables.

We know that E(X) = . But suppose we are interested in a function of X, like, say, aX, where a is a constant. Now what is E(aX)?

Because E is a linear operator, E(aX) = aE(X) = a.

This means that when we estimate the mean of aX, we get .

How about the E(X + Y - Z) where X, Y, Z are all random variables?

Again, because E is a linear operator, E(X + Y - Z) = E(X) + E(Y) - E(Z)

So we can estimate the mean of the sum or difference of random variables by the sum or difference of their means.

What about the variance of functions of random variables?

For aX, how is the variance affected? Lets go to the definition of variance.

Then

So if we want to estimate the variance of aX, we can simply multiply the estimated variance of X by a2 to get a2s2 .

Now what about the variance of X + Y or of X - Y, where X and Y are independent?

The variance of the sum or difference of independent random variables is the sum of the separate variances.

In general, the variance of X+Y where X and Y are random variables, whether independent or not, is

If X and Y are independent, the covariance term sX,Y drops out.

Now what about the variance of the sum or difference of two independent means? The variances of and are

So the estimated variance of the difference between the means of two independent random variables is

The square root of this is the standard deviation or standard error of the difference between two independent means.

So far we have been talking about distributions of a single random variable. But we now turn to distributions of multiple random variables, which may or may not be related to one another.

Lets begin with the bivariate case. Now we have two random variables, X and Y, which have a joint normal density.

For 1-dimensional random variables, the distribution can be drawn on a piece of paper, where the x-axis is the variate and the y-axis is the ordinate of the distribution.

For two random variables, one variate X is on the x-axis, the other variate Y is on the y-axis, and the ordinate is the third dimension.

So now we imagine a bell sitting on a table. One edge of the table is the x-axis and the other edge is the y-axis.

The distribution is the bell itself, which represents the ordinates for a set of (x,y) points on the table.

This density is shown below, where the only new parameter is .

If both X and Y are in standard normal form, their bivariate density simplifies as

What is ?

is measure of the relationship between the two random variables X and Y. It is called the correlation coefficient, where

-1 +1 When = 0, there is no relationship between X and Y and thus

f(x,y) =f(x) f(y)

is defined through the covariance of X and Y. The covariance is a measure of how the two variables X and Y vary together. It is defined as

Cov(X,Y) x,y E[(x-x)(y-y)] and is estimated by

The correlation coefficient is estimated by r

and is thus a standardized version of the covariance.

The correlation is a measure of the linear relationship of two variables. There is no cause-effect implication. The two variables simply vary together.

Consider the following example of the scores of 30 students on a language test X and a science test Y.

Chart1

37

37

34

34

33

40

39

37

36

35

30

34

30

37

40

42

40

36

31

31

36

29

29

40

42

29

40

31

38

32

language score

science score

Scattergram of language and science scores

Sheet1

3437

3737

3634

3234

3233

3640

3539

3437

2936

3535

2830

3034

3230

4137

3840

3642

3740

3336

3231

3331

3936

3329

3029

3340

4342

3129

3840

3431

3638

3432

Sheet1

language score

science score


Sheet2

Sheet3

As the scattergram shows, there is a tendency for the language and science scores to vary together. The degree of linear relationship is not perfect and r = .66 for this situation.

Note that the relationship is a linear one and the best fitting line can be drawn through the points. If the relationship had been perfect, r = 1 and all of the points would fall on the line.

If the relationship had been negative, then the line would have a negative slope and r would be negative.

In general, r = 0 if the points show no linear relationship at all. If the relationship is perfect, then r = 1 or -1, depending on whether the best-fitting line through the points would have a positive or negative slope.

For weak relationships, r is usually in the .3 to .4 range. For moderate relationships, r is usually in the .5 to .7 range. And for strong relationships, r is usually about .8 to .95.

Of course, if the direction of the relationship were negative, each r above would be negative.

As another example, consider the following data on the heights and weights of 12 college students.

Are these two variables correlated? Lets first look at the scattergram.

Chart2

124

184

161

164

140

154

210

164

126

172

133

150

Height

Weight

Relationship of height and weight

Sheet1

63124

72184

70161

68164

66140

69154

74210

70164

63126

72172

65133

71150

Sheet1

Height

Weight

Relationship of height and weight

Sheet2

Sheet3

It certainly does appear that height and weight are correlated. In fact, the correlation coefficient r = .91.

But what if you found out that four of the points were for college women and the other eight for college men. Now what would you conclude? Well, lets look at the scattergrams for men and women separately.

Chart6

164

154

161

164

150

184

172

210

Height

Weight

Relationship of height and weight for college men

Sheet1

6816463124

6915463126

7016165133

7016466140

71150

72184

72172

74210

Sheet1

Height

Weight


Sheet2

Height

Weight

Relationship of height and weight for college women

Sheet3

Chart7

124

126

133

140

Height

Weight


Sheet1

6816463124

6915463126

7016165133

7016466140

71150

72184

72172

74210

Sheet1

Height

Weight


Sheet2

Height

Weight


Sheet3

Now it doesnt seem that height and weight are only moderately correlated. The important thing to note here is that degree of correlation can be strongly enhanced by including extreme values.

In this case, the women were extremely low both in height and weight, compared to the men.

Chart2

1

1.5

2

2.5

3

4

4.5

4.8

5

5.1

5.2

5.2

5.1

5

4.8

4.5

4

3

2.5

2

1.5

1

Sheet1

3437

3737

3634

3234

3233

3640

3539

3437

2936

3535

2830

3034

3230

4137

3840

3642

3740

3336

3231

3331

3936

3329

3029

3340

4342

3129

3840

3431

3638

3432

11

11.5

12

12.5

13

1.24

1.44.5

1.64.8

25

2.55.1

35.2

3.55.2

45.1

4.25

4.44.8

4.64.5

4.84

53

52.5

52

51.5

51

Sheet1

language score

science score


Sheet2

Sheet3

In the preceding scattergram, the relationship is just about perfect, but r = 0 because there is no linear relationship.

There are ways to deal with measuring the strength of nonlinear relationships, but we will not deal with them here.

The correlation coefficient is used to describe the linear relationship between two random variables.

It is possible to use the relationship between two variables, an independent one X which is not a random variable and the dependent one Y, which is a random variable. In such a case, we would be interested in predicting Y from X.

In order to predict, we must have the best-fit line. So how do we get the best-fit line to a set of data? What makes a line the best-fit line?

The answer is in the method of least squares. The line of least-squares best fit is the line for which

is minimized. Note that yi is the actual point and yi is the point on the line of best fit.

The least squares line of best fit is

Then the intercept a is given by

and the slope b is given by

A study of the effect of water irrigation on hay yield produced the 7 observations which are shown in the following table:

From the formulas for a and b, the best-fit line is shown as well. The best-fit line is called the regression line. The least-squares line of best fit is

y = .10 x + 4.0 This is the line that minimizes the sum of squared errors.

Chart3

5.275.2

5.685.8

6.256.4

7.217

8.027.6

8.718.2

8.428.8

water

Yield

Best fitting line for water-yield data

Sheet1

3437

3737

3634

3234

3233

3640

3539

3437

2936

3535

2830

3034

3230

4137

3840

3642

3740

3336

3231

3331

3936

3329

3029

3340

4342

3129

3840

3431

3638

3432

11

11.5

12

12.5

13

1.24

1.44.5

1.64.8

25

2.55.1

35.2

3.55.2

45.1

4.25

4.44.8

4.64.5

4.84

53

52.5

52

51.5

51

125.27

185.68

246.25

307.21

368.02

428.71

488.42

125.2

185.8

246.4

307

367.6

428.2

488.8

Sheet1

language score

science score


Sheet2

Sheet3

water

Yield

Best fitting line for water-yield data

What this means is that if we take the vertical distance of

Date post:	01-Oct-2015
Category:	Documents
Upload:	apoorva-bhatt
View:	224 times
Download:	2 times

IE241 ver15

Documents