njc sampling lecture notes

H2 Sampling National Junior College Mathematics Department 2011

2011 SH2 H2 Mathematics 1

National Junior College 2011 H2 Mathematics (Senior High 2) Statistics 4 - Sampling (Lecture Notes)

_______________________________________________________________

Objectives At the end of this chapter, students should be able to

understand the concepts of population, sample and random sample; describe how to gather a sample using any of the four sampling methods: random

sampling, stratified sampling, systematic sampling and quota sampling; state the advantages and disadvantages of each of the four sampling methods; explain in simple terms why a given sampling method may be unsatisfactory, and

suggest possible improvements in the context of the question; understand the use of and calculate the unbiased estimates of the population mean

and variance, including cases where the data is given in summarised form x and 2

x ;

understand that the sample mean X is a random variable with E X and

2

Var Xn

;

use the fact that X has a normal distribution if X has a normal distribution;

apply the sampling distribution 2

~ N ,Xn

to solve statistical problems in real-

world situations;

Use the Central Limit Theorem to treat X as having a normal distribution when X is not normally distributed and the sample size is sufficiently large.

§1 Introduction When we need to gather information about a population, very rarely in practice can we afford the luxury of examining the complete population (e.g. all Asians, European women who weigh more than 50kg, etc.). The two obvious reasons are that the cost is too high and the population is dynamic in that the individuals making up the population may change over time. Almost invariably, sample observations are used to make inferences about the population probability distribution. This is known as statistical inference. Sampling theory is a study of this relationship between a population and the samples drawn from that population. It provides a systematic way of gauging the extent of reliability or accuracy of this inference.

A population is the whole set of items that people are interested in. A sample is a representative part of a population.


Examples 1. In doing project work, a group of SH1 students want to find out the mean number of

hours Junior College (JC) students spent on watching TV daily. The population is all the JC students in Singapore. A sample can be taken by selecting 20 students from each JC.

2. Suppose that in Neotopia, 60% of the population are males and 40% are females. An

investigation is made to find out the number of vegetarians. A sample of 50 people consisting of 30 men and 20 women is taken.

However for the statistical analysis to be valid, the sample must be carefully gathered (e.g. it must not be biased in any way, and the larger the sample, the better the representation, etc). Hence this chapter shall focus on discussing several sampling methods. §2 Sampling Methods Sampling methods are classified as either probability or nonprobability. In probability sampling, each member of the population has a known non-zero probability of being selected. Probability sampling methods include random sampling, systematic sampling, and stratified sampling. In nonprobability sampling, members are selected from the population in some non-random manner. These include convenience sampling, judgment sampling, quota sampling, and snowball sampling. The advantage of probability sampling is that sampling error can be calculated. Sampling error is the degree to which a sample might differ from the population. When inferring to the population, results are reported plus or minus the sampling error. In nonprobability sampling, the degree to which the sample differs from the population remains unknown. In this chapter, we will cover four different types of sampling, namely, random sampling, systematic sampling, stratified sampling and quota sampling. 2.1 Sampling Frame This is the list of all members within a population that can be sampled, and may include individuals, households, institutions, etc. A sampling frame should be as accurate and up-to-date as possible. Ideally the sampling frame should list the entire target population. 2.2 Random Sampling A random sample is one in which each member of the population has an equal chance of being selected. Two methods which can be carried out for random sampling are:

drawing lots, for instance, to select a sample size n from a population of N people, we may write the names of all the people on slips of papers and put the slips in a container. We then draw the slips from the container. This method would be relatively difficult if the population is large;




random number sampling. (Random numbers can be generated by computer programmes like Excel and calculators or obtained through random number tables.)

To carry out random sampling,

the sampling frame must be complete and updated. population size should be manageable.

Advantage of Random Sampling

The analysis of data is relatively easy. The data collected is generally free from bias.

Disadvantages of Random Sampling

It is difficult or impossible to identify every member of the population especially if the population is large.

We may not be able to get access to some members who have been chosen for the sample.

Strictly speaking, the method described above is known as simple random sampling, as systematic and stratified sampling methods are also random sampling methods which yield random samples. However, a special result of simple random sampling is that each sample of the same size has an equal probability of being selected. 2.3 Systematic Sampling (Simple) random sampling from a very large population is very cumbersome. An alternative procedure is to use systematic sampling. Systematic sampling yields a random sample whereby a starting point is randomly chosen and then the items are picked at regular intervals from the population. Suppose we can line up all members of a population and that we can move down the line one member at a time. To draw a sample, we simply select every kth member (eg. every 5th member) of the population. Usually the first member selected is chosen randomly from the

first k members1, and if the sample is of size n, then N

kn

, where N is the size of the

population. Example 2.1 A free gift is to be given to each of the 8 households from an estate of 120 households. The 120 households are listed in a certain order and every 15th household is chosen since

12015

8k . If the 7th household is randomly chosen from the first 15 households as the

starting point, then household numbers 7, 22, 37, 52, 67, 82, 97 and 112 will get a free gift.

1 The first member can actually be randomly selected from any point in the population. However, we will need to cycle back to the start of the population list upon reaching the end of the list if there are insufficient samples. For instance in Example 2.1, we could have randomly selected the 37th household as the first household, then the household numbers 37, 52, 67, 82, 97, 112, 7 and 22 will get a free gift.


Note: This is no longer simple random sampling, because some combinations of 8 households have a larger selection probability than others – for instance, {15, 30, 45, … ,

120} has a 1

15 chance of being selected, while {1, 2, 3, ..., 8} can never be selected under this

method. Example 2.2 A survey is to be conducted on 250 ticket holders to a carnival out of 5300 ticket holders. To draw a sample of 250 ticket holders from a population of 5300 ticket holders, we may take

530021.2 21

250k . We can then select every 21st ticket holder commencing with a

randomly selected first ticket holder (say the 8th ticket holder). When using systematic sampling, the researcher must ensure that the chosen sampling interval does not hide a pattern. Any pattern might result in a bias sample (See Disadvantages of Systematic Sampling). To carry out systematic sampling,

the starting point must be randomly selected. use systematic sampling only when the population is likely to be homogeneous. the sampling frame must be complete and updated.

Advantages of Systematic Sampling

The sample is more evenly spread out over the population. It is easier to conduct than other types of sampling.

Disadvantages of Systematic Sampling

There may be bias caused by the effect of periodicity of the population ie. there may be a periodic or cyclic pattern within the frame itself. Consider the following:

o Suppose the use of a particular MRT station in a period of time is to be investigated. If the seventh day is selected to do the testing, then the same day of the week would be tested on each occasion. If that day happened to be Sunday, this may result in a very inaccurate picture of the use of the station.

It is not always possible to arrange the members of the population, especially if the sample size is large.




Example 2.3 The Head of Mathematics Department of Holistics Junior College decides to take a survey of opinions of 700 graduating students regarding the quality of the teaching of Mathematics. (i) What is the sampling frame in the context of the question? (ii) Describe clearly how a systematic sample of size 140 can be obtained. Solution (i) It is the list of all the 700 graduating students. (ii) Arrange the list of all 700 students in some order (can be by surname, class or any other

reasonable category).

Calculate the interval to take samples from to obtain the 140 samples, ie. 700

5140

.

From the first group of 5 students, select the first student randomly. Then select every 5th student thereafter.

2.4 Stratified Sampling The population is divided into non-overlapping representative groups or strata according to one or more criteria. Items are selected randomly from each stratum, with the sample size being proportional to the relative size of the stratum. Examples of strata are age groups, genders and income groups. Again, we can see that stratified sampling yields a random sample. For instance, in 2001, the Infocomm Development Authority of Singapore (IDA) conducted a national survey of 3000 Singaporeans aged 10 years or older to determine the broadband market size in Singapore while gaining an understanding of the demographic and usage profiles of these users. In order to obtain a random sample of 3000 Singaporeans of the correct racial proportion, IDA tried to model the sample size as accurately as the actual racial proportion of Singapore.

Groups Chinese Malays Indian/Others

Sample Size 78% 3000 2340 13% 3000 390 9% 3000 270 Source: Survey on Broadband Usage in Singapore in 2001 Summary Report, Infocomm Development Authority of Singapore (IDA)

To carry out stratified sampling, strata must not overlap (no member of the population can belong to two strata at the

same time). strata must be exhaustive (every member in the population belongs to one and only

one stratum). each of the strata can be treated separately, and so the sampling is convenient and

more accurate. the sampling frame must be complete and updated.


Advantages of Stratified Sampling The results of each stratum can be analysed separately, so it is convenient and usually

gives more accurate estimates. It ensures better coverage of the population than simple random sampling as it

represents the population proportionately. Disadvantages of Stratified Sampling

It is more difficult to conduct as compared with simple random sampling. It may be difficult to identify appropriate strata. It is time consuming (difficult to organise).

Example 2.4 An airline wishes to assess its in-flight service for a specific flight and employs a marketing research company to administer a survey. The seats on this flight are divided into classes as follows:

Class Type First Class Business Class Economy Class Total No. of Seats 20 80 300 400

(i) An employee of the company proposes using stratified sampling method to select 80

seats and ask the passengers occupying these seats to complete a questionnaire. Describe how this sample can be obtained. Suggest a practical difficulty which may be encountered in carrying out this proposal.

(ii) Another employee suggests using simple random sampling method to choose a sample

of 80 passengers from the list of passengers who have checked in. Explain why it may not be appropriate to use this sampling method.

Solution (i) Based on the breakdown of the class types, the company can randomly select 4, 16 and

60 seats from the first, business and economy class respectively for the survey.

Some passengers have booked a flight ticket but did not turn up or changed flight so some of the seats in the sample may not have a passenger. OR The flight is not fully booked so the chosen seat could be empty. OR The passenger may ignore the questionnaire.

(ii) It is not appropriate to use simple random sampling as passengers from different classes

experience different in-flight service and hence may have very different opinions on the service. OR The number of passengers in the first class is very small, so the passengers from the first class may not be chosen at all using the random sampling method.



2.5 Quota Sampling Quota sampling is commonly used in market research and opinion polls. It works in the same way as stratified sampling where the population is divided into mutually exclusive subgroups, but the sample is non-random. For example, a surveyor is asked to interview 20 people in the market concerning the use of a certain brand of detergent. The interviewer is given a list of required criteria concerning the sample of 20 people. Based on this requirement, he selects his sample (in a non-random manner). Specific Subgroups required:

Sex Age Social Class Male Female 20 – 29 6 High 7

30 – 39 8 Middle 10 40 – 49 4 Low 3 8 12

> 50 2 Total : 20 Total : 20 Total : 20

Based on this quota, a possible selection is as follows:

Social Class 20-29 30-39 40-49 > 50 Total High 2 male 2 male 2 female 1 female 7

Middle 2 male 6 female 2 female -- 10 Low 2 male -- -- 1 female 3 Total 6 8 4 2 20

The interviewer is given a free choice in picking the people to fill the quotas. To carry out quota sampling,

no sampling frame required. information gathered from this type of sample must be treated with caution.

Advantages of Quota Sampling

The cost is low. It is faster to gather the information. Does not require a sampling frame.

Disadvantages of Quota Sampling

It is not a good representation of the population as a whole as compared with other types of sampling.

The sample is non-random, thus making it impossible to assess the sampling error. This method is biased as the interviewer may choose those who are easier and more

willing to be interviewed.



Example 2.5 (i) Give a real-life example of a situation in which quota sampling could be used. Explain

why quota sampling would be appropriate in this situation, and describe briefly any disadvantage that quota sampling has.

(ii) Explain briefly whether it would be possible to use stratified sampling in the situation

you have described in part (i). Solution (i) Example: Interview shoppers in a supermarket to find out their liking for a special

product. Quota sampling is appropriate as the shoppers can be easily divided into different age groups, e.g. below 20, 20–35, 36−50, 51−65 and above 65.

Disadvantage: The sample could be biased since the interviewer may tend to select shoppers who are more approachable and this results in a non-random sample.

(ii) For stratified sampling, a sampling frame must be available. In the above situation, it is

not easy to obtain the list of all shoppers. Hence stratified sampling is not possible. Note: (i) Your answer should include the following (with reference to the situation):

What is the population described in the context? (Shoppers at a supermarket) What is/are the issue(s) stated? (To find out the liking for a special product) What are the sub-populations to be identified? (The different age groups i.e. the

strata) How do you discuss the disadvantage relevant to the context? (Biased, non-

random) (ii) Possible or not depends on whether the situation in (i) has a sampling frame. Your

answer should clearly imply consideration of whether the list (sampling frame) exists or is accessible.




§3 Parameters and Statistics Suppose that a population has an unknown parameter (i.e. mean, variance, proportion, etc), then an estimate of this unknown parameter can be made from the information supplied by a random sample(s) taken from the population. The information from which the parameter is inferred from sample is called a statistic. For instance, the mean of a sample gives us an estimate of the mean of a population. Hence, sample mean is a statistic and population mean is a parameter.

For example, someone wants to find out the mean number of hours a JC student in Singapore spent on watching TV. The mean here is a population parameter, and it is often denoted by μ. In a survey of 20 JC students, the mean number of hours from this sample is a statistic. It can be used to infer the population mean. The sample mean is often denoted by X . Formally, let X be the number of hours a JC student in Singapore spend on watching TV. Then E( )X . Surveying one student is equivalent to making one observation of X, which we can denote the outcome by 1X . Hence, a survey of 20 students is equivalent to making 20 observations of

X, i.e. 1 2, , 3 2,…, 0X X X X and the sample mean is therefore 1 2 3 2...

200X X X X

X

.

Note that X is a random variable, since 1 2 3 20, , , …,X X X X are all random variables. The

experiment is the survey of 20 students on the number of hours spent on watching TV. The outcome is the mean value of the number of hours for the sample. This outcome will be different, with every sampling of 20 students. §4 Estimators In this section, we will discuss point estimation.

For example, X , the sample mean, is an estimator of population mean .

For example, a single value of x of the statistic, X , the sample mean, is a point estimate of population mean .

A statistic is a numeric quantity of some characteristic in the sample.

Any statistic, T, derived from a random sample and used to estimate an unknown population parameter , is known as an estimator.

A point estimate of a population parameter is a single value t of a statistic T that is used to estimate .



Suppose from a particular sample, we find that the sample mean is 3. Then this sample mean is a point estimate of the population mean. Point estimates are represented by lower case letters, i.e. 3x . 4.1 Unbiased Estimators Consider a population with an unknown parameter .

We shall see that X is an unbiased estimator of µ. ** See Appendix A1 for definition of a “best estimator”. 4.1.1 Unbiased Estimator of Population Mean Let 1 2 3, , , ..., nX X X X be a random sample from a population with UNKNOWN mean µ. Then we have

Proof: 1 2 3 ... 1E E E EnX X X X

X nn n

X X .

4.1.2 Unbiased Estimator of Population Variance Let 1 2 3, , , ..., nX X X X be a random sample of size n from a population with UNKNOWN

variance 2. By definition, sample variance is 22

1

1 n

x ii

X Xn

.

i.e. 2E s 2 , where

** See Appendix A2 for proof.

If T is an estimator (from a sample) for the parameter , then T is said to be an unbiased estimator if and only if TE .

X is an unbiased estimator of where 1 2 3 ... nX X X XX

n

.

2 2

1 x

ns

n

is an unbiased estimator of population variance 2

.

2 2

2

2 1

1

2

1

1

( )1

(in MF15)1

1( ) (in MF15)

1

x

n

ni

i

n

i

i

i

i

ns

n

XX

n n

X Xn


Example 4.1 Changi Airport handles thousands of pieces of luggage per day. A random sample of 20 pieces of luggage is taken, and the weights (in kg) of the pieces are as follow:

38 64 50 32 44 25 49 57 46 58 40 47 36 48 52 44 68 26 38 76

Obtain unbiased estimates for the population mean and variance. Solution Let X denote the weight of a randomly chosen piece of luggage. From the sample of 20 pieces of luggage, we have

n = 20, 938x , 2 47324x .

Unbiased estimate for the population mean, 1

46.9kg20

x x .

Sample variance, 22 2 21 1( ) 47324 46.9 166.59

20x ix x

n .

Unbiased estimate for the population variance, 2 220 20166.59 175 (3 sig. fig.)

19 19xs .

OR We can use the graphic calculator to help us.

Action Keystroke Screen Shot Enter data in List 1 using STAT > Edit

Press STAT > CALC Select 1-Var Stats

Press L1. Press Enter.

This shows you the unbiased estimates for the population

mean x and variance . 2sTherefore, 2 213.2422768 175.36xs ( x denotes the sample standard deviation.)



OR s2 can also be calculated directly without finding 2

x first.

2

212 2

1

1 1 93847324 175 (3 sig. fig.)

1 20 1 20

n

ini

ii

x

s xn n

Example 4.2 A random sample of 110 tables is examined, and the mass x (kg) are given by

and . Calculate the unbiased estimates of µ, the population mean, and 2, the

population variance.

1759x 2 198020x

Solution Given that n = 110, , we get 21759, 198020x x

175915.99

110x and

2 22 2 ( )1 1 (1759)

198020 1558.641 109 110

xs x

n n

.

Example 4.3 (Example 4.2 re-visited) A random sample of 110 tables is examined, and the mass x (kg) are given by

and . Calculate the unbiased estimates of µ, the

population mean, and 2, the population variance.

15 1759x 215 198020x

Solution: Let , we have and 15y x 1759y 2 198020y .

Also, 1759

15 15 15 31.0110

y x x y

The variance of x will be the same as the variance of y.

2 22 2 ( )1 1 (1759)

198020 1558.641 109 110

ys y

n n

Question: Why do you think the values of s2 in Examples 4.2 and 4.3 are the same (ie. Why

is the variance of x the same as the variance of y?) Answer: Translating all data points by 15 units does not affect the spread of the data.

Hence, the variance of x will be the same as the variance of y.




§5 Sample Mean, X Although we can obtain the unbiased estimates of a parameter, we also see that for every sample that we draw from the population, the estimates will most likely be different. Then how do we know which one is a better indication of the population parameter? In order to resolve this, we need to know the distribution of the statistic which provides information about the parameter. More specifically in our case, we shall determine the distribution of the sample mean. ** See Appendix A3 for a detailed example of what it means by sampling distribution of the sample mean. 5.1 Mean and Variance of Sample Mean Let 1 2, ,...., nX X X be n independent observations of a random variable X, taken from any infinite population (or finite population if sampling is with replacement) with mean µ and variance 2. Notice that each of 1 2, ,...., nX X X will have the same distribution as X. Then,

1 2 ,...., n,X X X is known as a random sample of size n taken from the population.

The sample mean is 1 2

1.... nX X X X

n .

Then, E X and 2

Var Xn

.

Proof:

1 2 3 ... 1E E E EnX X X X

X nn n

X X

1 2 1 22

2

2

1 1Var( ) Var ... Var( ) Var( ) ... Var( )

1Var( )

n nX X X X X X Xn n

n Xn n

5.2 Distribution of the Sample Mean (A) From a Normal Population with Known Population Variance, 2 Let 1 2, ,...., nX X X be a random sample of size n taken from a population with Normal

distribution having mean µ and variance 2, where 1 2, ,...., nX X X are independent

observations of X. Then 1 2 .... n

1X X X

n X follows a Normal distribution where

XE and 2

Var Xn

, i.e.

2

2~ , ~ N ,X N Xn



Note: The distribution of X is exactly normal when X is known to be normal and this is regardless of the size of n.

Example 5.1 A machine produces cylindrical rods whose diameters are normally distributed with mean 1 cm and standard deviation 0.01 cm. (i) If 4 of these rods are selected at random, calculate the probability that the mean of their

diameters will exceed 0.99 cm. (ii) Find the least sample size, n, required to ensure that there is a probability of more than

0.95 that the sample mean differs from the population mean diameter by less than 0.001 cm.

Solution

Let X be the diameter of a cylindrical rod. Then 2~ N 1,0.01X .

(i) The mean of the diameter of the 4 selected rods, 20.01

~ N 1,4

X

.

Using GC, P( 0.99) 0.977X .

(ii) Now 20.01

~ N 1,Xn

and since we need P( 0.001 0.001) 0.95X ,

0.001 0.001P 0.95 ;

0.01 0.01

P 010 10

P 0.97510

1.9610

n n

n nZ

nZ

n

where 0.01

.95

384.16

XZ Z

n

n

Therefore, the minimum sample size required is 385.

In the above example, (ii) shows us that we can achieve the level of “reliability” that we want by varying the size of the sample. We also like to point out that where there are unknown population mean and/or variance to be computed, a good approach to get around this problem is to standardise the normal random variable.



(B) From Any Distribution with Known Population Variance, 2 Central Limit Theorem (CLT) Let 1 2, ,...., nX X X be a random sample of size n taken from a population with any non-normal

distribution having mean µ and variance 2, where 1 2, ,...., nX X X are independent observations of X. Then when n is large (n > 30), by Central Limit Theorem,

(i) the distribution of the sample mean 1 2

1...... n X X X X

n is approximately

normal and so

(ii) the distribution of the sum 1 2 .... nX X X is also approximately normal and so

1. The Central Limit Theorem applies to any non-normal distribution, discrete or continuous, as long as n is sufficiently large.

2. Sometimes a smaller sample size (eg. 25) may be considered large. It is important to note that the approximation gets better as n gets larger.

3. In theory, continuity correction (c.c.) is needed if X is a discrete random variable. But due to the negligible difference in the value of probability obtained with and without

considering the c.c. 1

2n , it is not necessary to apply c.c. even if X is discrete.

Example 5.2 A random sample of 40 observations is taken from a binomial distribution, . Find the probability that the sample mean exceeds 5.5.

~ B(10,0.6)X

Solution

~ B(10,0.6)X

2

E( ) 10 0.6 6

Var( ) 10 0.6 0.4 2.4

X

X

Since is large, then by Central Limit Theorem, 40n 2.4

~ N 6,40

X

approximately.

Therefore, P 5.5 0.979X .

Here, we can see that X follows a binomial distribution.

However, by CLT, 1 2 4

400X X X

follows a normal distribution approximately.

2

~ N ,Xn

approximately.

21 2 .... ~ N ,nX X X n n approximately.


Example 5.3 An electrical circuit contains components, one of which is faulty and tests are conducted to locate the fault. The random variable X denotes the number of tests required to locate the fault, and it is given that E and ( ) 11X Var( ) 2X . The cost C (in suitable units) of locating a fault depends in part on the number of tests required and is given by the formula

. Find the expectation and variance of C. Given that 100 circuits of the above type are to be tested, find the probability that the total cost will exceed 2750 units.

XC 25

Solution

E( ) E(5 2 ) 5 2E( ) 27;

Var( ) Var(5 2 ) 4Var( ) 8

C X X

C X X

Since is large, then by Central Limit Theorem, 100n we have 1 100total cost, ... ~ N 27 100,8 100 approximatelyT C C .

Therefore, (3 sig. fig). P 2750 0.0385T Remark In this example, although the population means and population variances of both X and C are known, their distributions are unknown. Important Note: In the above sections (A) and (B), notice that we have considered problems where both the means and variances are known values. However, it must be noted that in most practical situations, both the mean and variance of the distribution are unknown. The next section (C) shall discuss the situation when the population variance is unknown.




(C) From a Population with Unknown Population Variance, 2 Let 1 2, ,...., nX X X be a random sample of size n taken from a population with any distribution

having mean µ and variance 2, where 1 2, ,...., nX X X are independent observations of X.

Then we can have 2

~ N ,Xn

either due to X being normally distributed or by CLT if X

is not normally distributed but n is large. Upon standardising, we get ~ N 0,1X

Zn

exactly if X is normally distributed or approximately if CLT is used. If 2 is unknown, we need to find a replacement for it. The obvious choice will be to use its unbiased estimator, s2.

Recall that: 2

2 2 2 ( )1

1 1x

xns x

n n n

Hence, we can use this point estimate in place of the population variance. However, as the unbiased estimator is also a statistic and follows a certain distribution, this introduces error into the distribution of Z.

Thus, as opposed to ~ N 0,1X

Zn

, we have instead ~X

T ts n

, where 1n .

t denotes the t-distribution with ν degrees of freedom.

Another result tells us that when n is large (> 30). Hence when n is large, N 0,1t

~ N 0,1X

s n

approximately, i.e. 2

~ N ,s

Xn

approximately.

For now, we shall only consider the case where n is large. The situation when n is small shall be dealt with in greater details in the next topic.



Example 5.4 A random sample of 110 tables is examined, and the mass x (kg) are given by

and . Calculate the unbiased estimates of µ, the population mean, and 2, the

population variance. Find the value of d given that there is a probability of 0.95 that the sample mean differs from the population mean by less than d kg.

1759x 2 198020x

Solution

From Example 4.2, 1759

15.99110

x and 2

2 2 ( )11558.64

1

xs x

n n

.

Since X is of an unknown distribution and n = 100 is large, by CLT, 2

~ N ,Xn

approximately. Since is unknown, we use s in place of and we have 2

~ N ,s

Xn

approximately.

Given that P 0X d .95 , we have

0.975

P 0.95110 110

P 0.95, (Since is large, ~ N 0,1 approx.)110 110

1.96110

7.38

X d

s s

d XZ n

s s

dz

s

d



Example 5.5 In a survey of 50 student leaders, the number of hours each of them spent on CCA per week, x, was obtained. The summarised data are as follows:

( 5) 3x 2 and . 2( 5) 400x (i) Find the unbiased estimates of the population mean and population variance of the

number of hours each student leader spent on CCA per week. (ii) Based on the above sample, estimate the probability that the average number of hours

each student leader spent on CCA per week exceeds 6 hours. Solution

(i) Let y = x 5. Then and . 32y 2 400y

1 320.64

50y y

n .

2 22 21 1 1 1400 32

1 49

379.52 94887.7453. (5 sig. fig.)

49 1225

ys y yn n

50

Let x and 2xs be the unbiased estimates of the required mean and variance respectively.

5 0.64 5 5.64x y .

2 2 94887.75

1225x ys s . (3 sig. fig.)

(ii) Since n = 50 is large, by Central Limit Theorem, 2

~ N ,50

X

approximately.

Since μ and σ2 are unknown, we estimate them using x and s2 respectively. Hence, we get the following estimates:

Mean of X = 5.64 and Variance of X = 7.7453

0.1549150

.

Hence, P( X > 6) = 0.180 (3 sig. fig.)



SUMMARY OF DISTRIBUTION OF X We consider 2 main cases: population variance known and unknown.

X has mean and known variance 2

2~ N , nX exactly

regardless of n

2~ N , nX approximately

by CLT for large n

X has mean and unknown variance 2

2~ N , nX regardless of n 2

~ N , nX by CLT for large n

1~ n

Xt

s n

n small, then no change

1~ nts n

X

X follows a Normal Distribution Distribution of X is unknown

X follows a Normal Distribution Distribution of X is unknown

Variance unknown, so replace with s2, hence causes change in distribution of X

n large, 1

N 0,1n

t

~ N 0,1X

s n



Appendix A1 Best Estimator

Example A1.1

If is a random sample taken from a population with mean 321 ,, XXX and variance ,

determine which of the following estimators for

2

are unbiased, and the best estimator.

3321

1

XXXT

;

3

32 3212

XXXT

;

3

2 213

XXT

Solution

1 2 31

1E( ) E [3E( )]

3 3

X X XT X

1 2 32

2 3 1E( ) E [6E( )] 2

3 3

X X XT X

1 23

2 1E( ) E [3E( )]

3 3

X XT X

Hence, T1 and T3 are unbiased estimators of ..

21 2 31

21 23

1 1Var( ) Var [3Var( )]

3 9 3

2 1 5Var( ) Var [Var( ) 4Var( )]

3 9 9

X X XT X

X XT X

X

Hence, T1 is the best estimator.

An estimator is said to be the best estimator if it is unbiased and has the smallest variance.



A2 Unbiased Estimator of Population Variance

Proof The sample variance is given by

22

1

2 2

1

2 2

1 1 1

2 2

1 1

22

1

1

12

1 1 12

1 1 12 1

1

n

x ii

n

ii

n n n

ii i i

n n

ii i

n

i

i

i

i

i

X Xn

X XX Xn

X XX Xn n n

X X X Xn n n

X Xn

1

n

i

Then,

22 2

1

22

1

22 2

22 21

2 2 2 2 2

2 2 2 2 2 2

1

1E E

1E E

1E ... E

1E ... E E

1 1 1E E

1 1

n

xi

n

i

n

n

i

i

X Xn

X Xn

X X Xn

X X Xn

n X Xn n n

n

n n

2

We can see from the workings above that the sample variance 2

x is not an unbiased

estimator of the population variance 2 since 2 2xE .

But we observe that 2 2 2 2 21E E E

1 1x x

n n n

n n n2

x .

In other words, a multiple of the sample variance, denoted by 2

12x

ns

n

is an unbiased

estimator of the population variance, 2 .

2 2

1 x

ns

n

is an unbiased estimator of population variance 2

.


A3 Sampling Distribution of the Sample Mean A distribution of sample mean X is a “distribution of a statistic (in this case a sample mean) over repeated sampling from a specified population” based on all possible random samples of size n, from a population. It can inform us of the degree of sample-to-sample variability we should expect due to chance. Example A3.1

Suppose we have a population consisting of four numbers: 2 3 4 5

Here, the population mean 2 3 4 5

3.55

.

Let’s take all possible samples of size n = 2 from this population (repetitions allowed) where

for each sample, X1 is the first observation and X2 is the second observation and 1 2

2X XX .

Observation Observation

Sample No. X1 X2

X Sample No.

X1 X2 X

1 2 2 1 2x 9 4 2 9 3x

2 2 3 2 2.5x 10 4 3 10 3.5x

3 2 4 3 3x 11 4 4 11 4x

4 2 5 4 3.5x 12 4 5 12 4.5x

5 3 2 5 2.5x 13 5 2 13 3.5x

6 3 3 6 3x 14 5 3 14 4x

7 3 4 7 3.5x 15 5 4 15 4.5x

8 3 5 8 4x 16 5 5 16 5x

1 2 3 16, , , ,x x x x are different possible observations of X .

By studying how the value of 1 2 3 16, , , ,x x x x changes, we are studying the distribution of the

sample mean X . Observe that

x is rarely exactly . However most are close to (or cluster around) . Extreme values of x are uncommon. You can determine the exact probability of obtaining a particular X .

For instance, 3P 3

16X .

The sampling distribution, X , is as follows:

x 2 2.5 3 3.5 4 4.5 5

P X x 116 2

16 316 4

16 316 2

16 116

Also, 3 31 2 4 2 116 16 16 16 16 16 16E 2 2.5 3 3.5 4 4.5 5 3.5X .



If possible samples of size n = 3 are taken from the population, the observations will be as follows:

Observation ObservationSample No

X1 X2 X3 X

Sample NoX1 X2 X3

X

1 2 2 2 2 33 4 2 2 2.667

2 2 2 3 2.333 34 4 2 3 3 3 2 2 4 2.667 35 4 2 4 3.333 4 2 2 5 3 36 4 2 5 3.667 5 2 3 2 2.333 37 4 3 2 3 6 2 3 3 2.667 38 4 3 3 3.333 7 2 3 4 3 39 4 3 4 3.667 8 2 3 5 3.333 40 4 3 5 4 9 2 4 2 2.667 41 4 4 2 3.333

10 2 4 3 3 42 4 4 3 3.667 11 2 4 4 3.333 43 4 4 4 4 12 2 4 5 3.667 44 4 4 5 4.333 13 2 5 2 3 45 4 5 2 3.667 14 2 5 3 3.333 46 4 5 3 4 15 2 5 4 3.667 47 4 5 4 4.333 16 2 5 5 4 48 4 5 5 4.667 17 3 2 2 2.333 49 5 2 2 3 18 3 2 3 2.667 50 5 2 3 3.333 19 3 2 4 3 51 5 2 4 3.667 20 3 2 5 3.333 52 5 2 5 4 21 3 3 2 2.667 53 5 3 2 3.333 22 3 3 3 3 54 5 3 3 3.667 23 3 3 4 3.333 55 5 3 4 4 24 3 3 5 3.667 56 5 3 5 4.333 25 3 4 2 3 57 5 4 2 3.667 26 3 4 3 3.333 58 5 4 3 4 27 3 4 4 3.667 59 5 4 4 4.333 28 3 4 5 4 60 5 4 5 4.667 29 3 5 2 3.333 61 5 5 2 4 30 3 5 3 3.667 62 5 5 3 4.333 31 3 5 4 4 63 5 5 4 4.667 32 3 5 5 4.333 64 5 5 5 5

Observe that

No x is exactly . However most are close to (or cluster around) .

In this case, 10 5P 3

64 32X . Can you write-down the sampling distribution of

X ?


Date post:	16-Apr-2015
Category:	Documents
Upload:	bhimabi
View:	130 times
Download:	8 times

njc sampling lecture notes

Documents