Fundamentals of Statistical Inference

7/27/2019 Fundamentals of Statistical Inference

1/101

Fundamentals of Statistical

Inference

compiled by

Srilakshminarayana,G. M.Sc, Ph.D

Shri Dharmasthala Manjunatheswara Institute

for Management Development

#1 Chamundi Hill Road, Siddhartha Nagar, Mysore-570011

(Private Circulation Only-September 2012)


2/101

Table of Contents

Table of Contents i

Important note about the material 1

1 Estimation 21.1 Importance of estimation in management . . . . . . . . . . . . . . . . 21.2 Key terms in estimation . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Determination of sample size . . . . . . . . . . . . . . . . . . . . . . . 91.4 Point estimator for population mean . . . . . . . . . . . . . . . . . . 9

1.4.1 Steps in obtaining an estimate of population mean . . . . . . 11

1.5 Point estimator for population variance . . . . . . . . . . . . . . . . . 111.5.1 Steps in calculating an estimate of population variance . . . . 11

1.6 Role of sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.7 Sampling distribution of a Statistic . . . . . . . . . . . . . . . . . . . 121.8 Sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.9 Point estimator for a population proportion . . . . . . . . . . . . . . 141.10 Finding the best estimator . . . . . . . . . . . . . . . . . . . . . . . . 141.11 Drawback of point estimate . . . . . . . . . . . . . . . . . . . . . . . 151.12 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.13 Probability of the true population parameter falling within the interval

estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.14 Interval estimates and confidence intervals . . . . . . . . . . . . . . . 181.15 Relationship between confidence level and

confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.16 Using sampling and confidence interval

estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.17 Interval estimation of population mean

( known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

i


3/101

1.18 Using the Z statistic for estimating population mean . . . . . . . . . 20

1.19 Using finite correction factor for the finite population . . . . . . . . . 251.20 Interval estimation for difference of two means . . . . . . . . . . . . . 271.21 Confidence interval estimation of the population mean ( unknown) . 281.22 Checking the assumptions . . . . . . . . . . . . . . . . . . . . . . . . 291.23 Concept of degrees of freedom . . . . . . . . . . . . . . . . . . . . . . 301.24 Confidence interval estimation for population proportion . . . . . . . 321.25 Estimation of the sample size . . . . . . . . . . . . . . . . . . . . . . 331.26 Sample size for estimating population mean . . . . . . . . . . . . . . 351.27 Sample size for estimation population proportion . . . . . . . . . . . 401.28 Sample size for an interval estimate of a population proportion . . . . 411.29 Further discussion of sample size determination for a proportion . . . 42

2 Testing of Hypothesis-Fundamentals 442.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2 Formats of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 472.3 The rationale for hypothesis testing . . . . . . . . . . . . . . . . . . 482.4 Steps in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 492.5 One tail and two tail tests . . . . . . . . . . . . . . . . . . . . . . . . 55

2.5.1 One tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 562.5.2 Two tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.6 Critical region and non-critical region . . . . . . . . . . . . . . . . . . 56

2.7 Errors in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 572.8 Test for single mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.8.1 Z-test for single mean- known case . . . . . . . . . . . . . . . 592.8.2 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 602.8.3 t-test for single mean- unknown case . . . . . . . . . . . . . . 612.8.4 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 62

2.9 Test for single proportion . . . . . . . . . . . . . . . . . . . . . . . . . 632.9.1 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 64

2.10 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 65

3 Testing of hypothesis-Two sample problem 673.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.3 Test for difference of means: Z-test . . . . . . . . . . . . . . . . . . . 68

3.3.1 Testing Using Excel: 21 = 22 =

2 (known) . . . . . . . . . . 693.3.2 Testing Using Excel: Unequal Variances (Known) . . . . . . . 70

3.4 Test for difference of means:t-test . . . . . . . . . . . . . . . . . . . . 713.4.1 Testing Using Excel: 21 =

22 =

2 (Unknown) . . . . . . . . . 72

ii


4/101

3.4.2 Testing Using Excel: Unequal Variances (Unknown) . . . . . . 73

3.5 Test for difference of two proportions . . . . . . . . . . . . . . . . . . 743.5.1 Testing Using Excel: Test for Difference of Proportions . . . . 753.6 Test for dependent samples . . . . . . . . . . . . . . . . . . . . . . . . 76

3.6.1 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 773.7 Test for difference of variances-F Test . . . . . . . . . . . . . . . . . . 783.8 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 78

4 Chi-Square tests 804.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.1.1 Chi-square test for significance of a population variance . . . . 814.1.2 Chi-square test for goodness of fit . . . . . . . . . . . . . . . . 81

4.1.3 Chi-square test for independence of attributes . . . . . . . . . 824.2 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 82

5 Analysis of Variance (ANOVA) 845.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.2 One way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.2 Steps for computing the F test value for ANOVA . . . . . . . 86

5.3 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 885.3.1 Assumptions for the Two-Way ANOVA . . . . . . . . . . . . . 90

5.4 The Scheffe Test and the Tukey Test . . . . . . . . . . . . . . . . . . 915.4.1 Scheffe Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5 Tukey Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Correlation and Regression 936.1 Testing significance of Correlation = 0 . . . . . . . . . . . . . . . . 936.2 Testing significance of Correlation = 0 . . . . . . . . . . . . . . . . 946.3 Testing significance of correlation 1 = 2 . . . . . . . . . . . . . . . . 956.4 Testing significance of regression model . . . . . . . . . . . . . . . . . 96

References 97

iii


5/101

Important note about the material

This material is for internal circulation only and not a substitute for a text book.

It contains only fundamental steps to be followed when inferential tools are used to

analyze the data. It is restricted to the need of present batch and do not contain

entire information about the topic. The complete information can be found in the

text book prescribed and in other references.

1


6/101

Chapter 1

Estimation

1.1 Importance of estimation in management

Everyone makes estimates. When you are ready to cross a street, you estimate the

speed of any car that is approaching, the distance between you and that car, and

your own speed. Having made these quick estimates, you decide whether to wait,

walk, or run. All managers must make quick estimates too. The outcome of these

estimates can affect their organizations as seriously as the outcome of your decision as

to whether to cross the street. University department heads make estimates of next

sessions enrollment in Statistics. Credit managers estimate whether a purchaser will

eventually pay his bills. Prospective home buyers make estimates concerning the

behavior of interest rates in the mortgage market. All these people make estimates

without worry about whether they are scientific but with hope that he estimates

bear a reasonable resemblance to the outcome. Managers use estimates because in all

but the most trivial decisions, they must make rational decisions without complete

information and with a great deal of uncertainty about what the future will bring.

How do managers use sample statistics to estimate population parameters? The

department head attempts to estimate enrollments next fall from current enrollments

2


7/101

Estimation 3

in the same courses. The credit manager attempts to estimate the creditworthiness of

prospective customers from a sample of their past payment habits. The home buyer

attempts to estimate the future course of interest rates by observing the current

behavior of those rates. In each case, somebody is trying to infer something about a

population from information taken from a sample. This chapter introduces methods

that enable us to estimate with reasonable accuracy the population proportion (the

proportion of the population that possesses a given characteristic) and population

mean. To calculate exact proportion or the exact mean would be an impossible goal.

Even so, we will be able to make an estimate, make a statement about the error that

will probably accompany this estimate, an implement some controls to avoid as much

of the error as possible. As decision makers, we will be forced at times to rely on

blind hunches. Yet in other situations, in which information is available and we apply

statistical concepts, we can do better than that.

Let us start with a small discussion on why a management student should study

statistical methods to estimate the unknown quantities. Estimations are made at

low level, middle level, and high level management. At any level, we understand

the present carefully. Then look into the past. See what has happened in the past.

List all possible options we have gathered from the past. Choose the best from the

available options. The option that best suits the present is considered as a solution.

For example, a manager of an production unit wishes to estimate the items to be

produced for the current year and depending on his estimate he wishes to place an

order for the raw materials. He takes his records and looks into the items produced,

raw materials used to produce these items. Finally he uses his experience and takes

a decision on the items to be produced for the current year and depending on the

estimate he prepares an order for the raw materials. But what is the guarantee that

the value he estimated is free of error? How can he justify that the actual requirement

is close to the value he estimated? There is a chance that the value he estimated using


8/101

Estimation 4

his experience may be an over estimate or under estimate. How can he convince his

boss that the value he chose will yield the organization better profits? If everything

goes fine, then no one will blame him. It would have been better if life is free of

uncertainty. But it is not so. The manager should take care of the uncertainty

associated with the estimate obtained using his experience. At this stage one can

argue saying that he is experienced and he can specify a range instead of a single

value. The statement could be May be this time the requirement lie in between

10000 and 15000. Even now there is an amount of uncertainty associated with

this. Because the words May be indicate that there is an amount of uncertainty

associated. One can continue the argument. Finally what is that we want. We want

a statement that the requirement for the current year may lie between 10000 and

15000 and the chance that it lie outside this range is 0.05. How could we get this

chance of 0.05? It is due to the systematic procedures available in statistics. Why he

needs it because the manager has to report to his boss saying that the requirement

for the current year lies between limits A and B. The boss is much concerned about

satisfying the needs of the customers and if anything goes wrong it is he who will be

targeted first. To avoid this the manager can use the statistical techniques available

and provide the range along with a chance. This is again done taking the past data

into consideration. Systematic construction needs understanding the past carefully

and choosing an appropriate tool for the given situation. One has to choose the

technique based on the study variable under consideration. This is because the tools

used for a quantitative variable cannot be used for a qualitative variable without

proper adjustment.

The example discussed above is from a production unit. Similarly, let us consider

marketing. The sales executive has to report to his boss the number of packets of

oil he will sell this month. What will he do when his boss asks him about this? He

immediately says that he will sell 150 packets of oil this month. How did he say this?


9/101

Estimation 5

He didnt use statistics to do this. He just used his experience to do this. There is

no point in taking an excel sheet and use a statistical procedure to give this number.

He used his common sense, past experience and market conditions. He is sure that

the market need atleast 150 packets this month and he already sold 145 packets last

month. He also has complete knowledge about his competitors sales in the market.

Taking into consideration all these factors he could easily estimate the current months

sales. Let us consider another example. Suppose that this time the sales executive

has been promoted as sales manager. Now he has to estimate the sales of the entire

region. Now the problem is he is the manager but not a sales executive. He has to

take the data from the sales executives of the entire region and then he has to estimate

the sales for the current year. Depending on this estimate he has to build a strategy

to increase the sales. In the previous case he can get on because of his experience and

common sense. But now he is a manager and he cant take any risk. Now also he can

use his experience. But this time only to develop a proper strategy. He should take

help of statistical methods in order to give a proper estimate and to construct a better

strategy. What will he do? He will take the data from the sales executives and takes

the average of all the values and adjusts that according to the market conditions and

finally gives an estimate. Is it a good estimate? what adjustment he should make to

the average and give an estimate which convinces his boss? The answer is very simple.

According to the statistical theory, the sample average best estimates the population

average. Here the population average is the sales for the current year and the sample

average is the value he calculated after obtaining the data from his executives. What

about the adjustment? The adjustment is to construct an interval associated with a

probability value, take a value within the interval and consider it as an estimate.

Let us consider the case in Human resource management (HRM). Suppose that

the HR manager wishes to know about the performance of new appraisal system


10/101

Estimation 6

developed to appraise the employees. Since the organization has thousands of em-

ployees it is apparent that she cant take the opinion of all the employees. She has

to take a sample of employees and consider their opinion. Here the estimate is the

proportion of employees who are against the system, which is an estimator of the

population proportion. The variable under consideration is a qualitative variable and

the appropriate estimator is the sample proportion.

Management is a discipline which uses statistical tools to support the decisions

relating to various business situations. Most of the times the decision maker will be

left with some amount of data relating to the given situation, on which he is supposed

to take a decision. It is always desirable to use the data obtained and take appropriate

decision. One important aspect in decision making is estimation. This is a part of

statistical inference. Estimation is a systematic way of understanding the behavior of

unknown population characteristics based on a sample. These characteristics include

all the descriptive statistics related to a properly defined population. But most of the

time we are interested in population mean and variance. These are the characteristics

which play an important role in making decisions. It is very important to note that

mean should always be followed by the variance. Mean measures the central tendency

and variance measures the dispersion. To estimate these characteristics, we use the

sample data gathered from the defined population. The sample is selected as the

true representative of the population selected for the study. Note that care has to be

taken while selecting the sample. Coming back to the estimation, we use the sample

characteristics to estimate the population characteristics. Sample mean and variance

are used to estimate the population mean and variance. Two types of estimation

has been studied formally by the researchers. They are point estimates and interval

estimates. A point estimate is the value of the statistic for a given sample. We

use sample statistics as estimators to estimate the population parameters. These

estimators are functions of the sample i.e., they produce different values for different


11/101

Estimation 7

samples. Each value is considered as the estimate of the parameter. Point estimates

obtained for different samples put together constitute sampling distribution of the

statistic. Usual understanding in estimation is that for sufficiently large samples

these sample means, when plotted, produce a normal curve. This basic assumption

is very important to construct an interval estimate. Another important aspect in

point estimation is the associated sampling error of the statistic. When we obtain

the point estimate from a sample, it is equivalently important to obtain the sampling

error or standard deviation of the statistic. This sampling error gives the amount of

fluctuation that can be allowed below and above the estimate.

The purpose of any random sample is to estimate population properties of a popula-

tion from the data observed in the sample. The mathematical procedures appropriate

for performing this estimation depend on which properties are of interest and which

type of random sampling scheme is used. Note that the sampling scheme has to be

selected appropriately for a given situation. The decision maker has to take care of

the assumptions made at the time of selecting the sampling scheme. This is very im-

portant because the assumptions of the mathematical model that will be used in the

later stages should coincide with the assumptions made at the time of selecting the

sample. If this is not taken care, then the results obtained may not be reliable. Along

with this, another aspect that play an important role is sampling error. Sampling

error is the inevitable result of basing an inference os a random sample rather than

on the entire population.

1.2 Key terms in estimation1. Population: Group of objects or individuals that posses the assumed charac-

teristics under study. This group can be finite or infinite.

2. Sample: Group of objects or individuals that posses the same characteristics

as that of population, taken for enumeration and further analysis. This group

is considered as the true representative of the entire population under study.


12/101

Estimation 8

3. Parameter: Unknown characteristics of the population under study such as

population mean, median, mode, standard deviation etc.

4. Statistic: Characteristics of the sample such as sample mean, median, mode,

standard deviation etc.

5. Estimator: Any statistic, which is a function of sample values, used to estimate

a population parameter.

6. Estimate: An estimate is a specific value of the estimator for a given sample.

7. Point estimate: A point estimate is a numerical value, a best guess of a

population parameter, based on the data in a sample.

8. Estimation error: The estimation error is the difference between the point

estimate and the true value of the population parameter being estimated.

9. Interval estimate: An interval estimate is an interval around the point esti-

mate, calculated from the sample data, where we strongly believe the true value

of the population parameter lie.

10. Unbiased estimate: An unbiased estimate is a point estimate such that themean of its sampling distribution is equal to the true value of the population

parameter being estimated.

11. Efficiency: Another desirable property of a good estimator is that it be ef-

ficient. Efficiency refers to the size of the standard error of the statistic. If

we compare two statistics from a sample of the sample size and try to decide

which one is the more efficient estimator, we would pick the statistic that has

the smaller standard error, or standard deviation of the sampling distribution.

12. Sufficiency: An estimator is sufficient if it makes so much use of the infor-

mation in the sample that no other estimator could extract from the sample

information about the population parameter being estimated.

13. Consistency: A point estimator is said to be consistent if its value tends to

become closer to the population parameter as the sample size increases.


13/101

Estimation 9

1.3 Determination of sample size

There are several ways to estimate an unknown characteristic of the population. In

this compiled work we only discuss parametric estimation.Interested can look into

standard book for other methods like non-parametric estimation, robust estimation

etc.In parametric estimation we mainly talk about the population characteristics like

mean, variance/standard deviation and proportion. We first discuss determination

of sample size in detail and then proceed to estimation procedures.At an intermedi-

ate stage, i.e. after collecting the sample from the population under study, we look

forward to understand the behavior of the population through the estimated char-

acteristics from the sample. Hence one has to note at this point that the sample

taken play an important role in studying the population.Now the question is what

should be the sample size. This is an interesting question, which do not have a ready

made answer. It is an important step before the survey. Note that sampling error,

decreases with increase in the sample size. So we use the fact the smaller the variance,

the larger the sample size needed to achieve a degree of accuracy.

Determining the best sample size is not just a statistical decision. Statisticians

can tell you how the standard error behaves as you increase or decrease the sample

size, and the market researchers can tell you what the cost of taking more or larger

samples will be. But its the decision maker who must use your judgement to combine

these two inputs to make a sound managerial decision.

1.4 Point estimator for population mean

Definition 1. Point estimator:A sample statistic that is calculated using sample data to estimate most likely value of

the corresponding unknown population parameter is termed as point estimator, and the

numerical value of the estimator is termed as point estimate. A point estimate consists

of a single sample statistic that is used to estimate the true value of a population

parameter.


14/101

Estimation 10

For example, the sample mean X is a point estimate of the population mean

and the sample variance S2 is a point estimate of the population variance 2. On

many occasions estimating the population mean is useful in business research. For

example,1. The manager of human resources in a company might want to estimate the

average number of days of work an employee misses per year because of illness.

If the firm has thousands of employees, direct calculation of a population mean

such as this may be practically impossible. Instead, a random sample of em-

ployees can be taken, and the sample mean number of sick days can be used toestimate the population mean.

2. Suppose that another company developed a new process for prolonging the

shelf life of a loaf of bread. The company wants to be able to date each loaf for

freshness, but company officials do not know exactly how long the bread will

stay fresh. By taking a random sample and determining the sample mean shelf

life, they can estimate the average shelf life for population of bread.

3. As the cellular telephone industry matures, a cellular telephone company isrethinking its pricing structure. Users appear to be spending more time on

the phone and are shopping around for the best deals. To do better planning,

the cellular company wants to ascertain the average number of minutes of time

used per month by each of its residential users but does not have the resources

available to examine all monthly bills and extract the information. The company

decides to take a random sample of customer bills and estimate the population

mean from sample data. A researcher for the company takes a random sampleof 85 bills for a recent month and from these bills computes a sample mean

of 510 min. This sample mean, which is a statistic, is used to estimate the

population mean, which is a parameter. If the company uses the sample mean

of 510 min as an estimate for the population mean, then the sample mean is

used as a point estimate.


15/101

Estimation 11

4. A tire manufacturer developed a new tire designed to provide an increase in

mileage over the firms current line of tires. To estimate the mean number of

miles provided by the new tires, the manufacturer selected a sample of 120 new

tires and observed a sample mean of 36,500 miles.

In all the above examples, note the statistic (sample mean) is a function of the

sample drawn from the population under study and the numerical value assumed by

this statistic is an estimate of the population mean. (Observe the difference between

an estimator and an estimate).

1.4.1 Steps in obtaining an estimate of population mean

1. Draw a sample from the population under study.

2. Find the total of all the observations in the sample.

3. Divide the total with the number of observations.

4. The resultant value is the sample mean, which is taken as the estimate of the

population mean.

1.5 Point estimator for population varianceThe estimation of population variance is an important step in analyzing the sample

drawn from the population under study. We use sample variance to estimate the

population variance. But sample variance is not an unbiased estimator of population

variance. So we modify the formula used to calculate the sample variance. The

formula to calculate the sample variance is given by

2

= s2

=

1

n

ni=1

(Xi X)2

In order get an unbiased estimator, ne has to change1

nto

1

n 1. The resultant iscalled Means square error, which gives an unbiased estimator of population variance.

1.5.1 Steps in calculating an estimate of population variance

1. Calculate the mean of the sample drawn.


16/101

Estimation 12

2. Compute the deviation of all the observations from the mean.

3. Square the deviations and obtain the total.

4. Divide the total obtained in step 3 with n 1.

Note 1. Note that, the above formulae to calculate mean and variance are used when

individual observations are taken. If one is using a frequency distribution then they

have to include the frequencies to calculate mean and variance.

1.6 Role of sampling

In order to understand the population characteristics like mean, variance etc. it is

very important to draw a sample which is a true representative of the population.

Proper sampling design has to be adopted before drawing a sample. Sampling frame

should be framed and then should be checked with population. Care should be taken

to decrease the non-response rate. It should be noted that a random sample will

better estimate the population parameters than a non-random sample. In order to

get better estimate, it is also important to ensure that the sample is free of any sort of

bias. The questionnaire framed to collect the responses should be tested using a pilot

survey before the actual survey. One has to note that pilot survey has to be framed

in such a way that it resembles the actual survey and should give better insights

about the resources needed to conduct the actual survey. An interesting point is

that samples with smaller samples, which is a true representative, gives satisfactory

results than a sample with larger sample size, which is not a true representative

of the population. Another interesting aspect in sampling is the belief that larger

populations need larger samples is not always a valid statement. Depending on the

situation, objectives, a sample should be taken.

1.7 Sampling distribution of a Statistic

Sampling distribution is the underlying probability distribution of the statistic used

for the study. This is constructed by taking several samples from the population.


17/101

Estimation 13

For example, a sampling distribution of sample mean is constructed by taking as

many samples as possible from the population and by calculating sample mean for

all the samples. The set of all values constitute a sampling distribution of sample

mean. Theoretically, it has been shown that the sampling distribution of mean is

either normal (central limit theorem-finite known variance-larger sample size) or a

t-distribution (when assumption of normality is satisfied-small sample sizes). When

the assumption of normality is not satisfied, then the sampling distribution of sample

mean can be approximated to normal law using central limit theorem for sufficiently

large sample sizes. The sampling distribution of sample variance or mean square error

is chi-square distribution (it is discussed in detail in chapter 4).

1.8 Sampling error

After drawing a sample, it is important to study the sampling error. For this, the

decision maker has to find the standard error of the estimator used to estimate the

population parameter. The question is what is the relation between the standard error

and sampling error? Note that, the sample is drawn to understand the behaviour of

the population characteristics (like mean, median etc.) and are studied using their

estimators from a sample. Obviously if the sampling error is more, then it will be

reflected in the standard error of the estimator. Also note that, reciprocal of standard

error gives the precision of the estimator. This is because it is expected that the

absolute difference between the true population characteristic and sample estimator

is less than , where depends on standard error. Refer to determination of sample

size section to understand this better.

Sampling variation is the price we pay for working with a sample rather than the

population.


18/101

Estimation 14

1.9 Point estimator for a population proportion

When the underlying variable is a qualitative variable, one is interested in studying

the proportion of individuals who satisfy a particular attribute. For example, the

sales manager may be interested in studying the proportion of individuals who give

more importance to quality than cost. Here, he may confine to the customers who

are regular in purchasing from his store. For this properly defined population, the

parameter is proportion (denoted by P) and the sample statistic (denoted by p of

P) is the unbiased estimator. To calculate the sample proportion, one has to define

the random variable under study properly. Then, identify the individuals who satisfy

the attribute (denoted by X) and take the ratio of X and n, the sample size, to

get the estimate. Note that the sampling distribution on sample proportion can be

approximated to normal distribution. But the exact probability distribution used

to model the number of individuals who fall under a particular category is binomial

distribution.

1.10 Finding the best estimator

A given sample statistic is not always the best estimator of its analogous population

parameter. Consider a symmetrically distributed population in which the values of

the median and the mean coincide. In this instance, the sample mean would be

an unbiased estimator of population median. Also, the sample mean would be a

consistent estimator of the population median because, as the sample size increases,

the value of the sample mean would tend to come very close to the population median.

And the sample mean would be a more efficient estimator of the population medianthan the sample median itself because in large samples, the sample mean has s smaller

standard error than the sample median. At the same time, the sample median in a

symmetrically distributed population would be an unbiased and consistent estimator

of the population mean but not the most efficient estimator because in large samples,

its standard error is larger than that of the sample mean.


19/101

Estimation 15

1.11 Drawback of point estimate

The drawback of a point estimate is that no information is available regarding its re-

liability, i.e., how close it is to its true population parameter. In fact, the probability

that a single sample statistic actually equals the population parameter is extremely

small. For this reason, point estimates are rarely used alone to estimate population

parameters. It is better to offer a range of values within which the population pa-

rameters are expected to fall so that reliability (probability) of the estimate can be

measured. This is the purpose of the interval estimation.

1.12 Interval estimation

In most of the cases, a point estimate does not provide information about how close

is the estimate to the population parameter unless accompanied by a statement of

possible sampling error involved based on the sampling distribution of the statistic.

It is therefore important to know the precision of an estimate before depending on

it to make a decision. Thus, decision-makers prefers to use an interval estimate (i.e.

the range of values defined around a sample statistic) that is likely to contain the

population parameter value.

Interval estimation is a rule for calculating two numerical values, say and that

create an interval that contains the population parameter of interest. This interval is

therefore commonly referred to as a confidence coefficient and denoted by . However,

it is also important to state how confident one should be that the interval estimate

contains the parameter value. Hence an interval estimate of a population parameter

is a confidence interval with a statement of confidence (probability) that the inter-

val contains the parameter value. In other words, confidence interval estimation is

an interval of values computed from sample data that is likely to contain the true

population parameter value.

Suppose the marketing research director needs an estimate of the average life in


20/101

Estimation 16

months of car batteries his company manufactures. We select a random sample of

200 batteries, record the car owners names and addresses as listed in store records,

and interview these owners about the battery life they have experienced. Our sample

of 200 users has a mean battery life of 36 months. If we use the point estimate of the

sample mean as the best estimator of the population mean , we would report that

the mean life of the companys batteries is 36 months. But director also asks for a

statement about the uncertainty that will be likely to accompany this estimate, that

is, a statement about the range within which the unknown population mean is likely

to lie. To provide such a statement, we need to find the standard error of the mean.

The general form of an interval estimate is as follows:

Point estimate Margin of errorThe purpose of an interval estimate is to provide information about how close the

point estimate is to the value of the population parameter. The general form of an

interval estimate of a population mean is

X

Margin of error

The general form of an interval estimate of a population proportion is

P Margin of errorThe sampling distribution of X and P play key roles in company these interval esti-

mates.

1.13 Probability of the true population parameter

falling within the interval estimateTo begin to solve this problem, we should review the relevant concepts that we worked

with the normal probability distribution and learned that specific portions of the area

under the normal curve are located between plus and minus any given number of

standard deviations from the mean. Fortunately, we can apply these properties to

the standard error of the mean and make the statement about range of values used to


21/101

Estimation 17

make an interval estimate. Note that if we select and plot a large number of sample

means from a population, the distribution of these means will approximate normal

curve. Furthermore, the mean of the sample means will be the same as the population

mean. Our sample size of 200 (in battery example) is large enough that we can apply

the central limit theorem. To measure the spread, or dispersion, in our distribution

of sample means, we can use the following formula and calculate the standard error

of the mean:

Standard error of the mean for an infinite population

X =

n Standard deviation of the population

Suppose we have already estimated the standard deviation of the population of the

batteries and reported that it is 10 months. Using this standard deviation, we can

calculate the standard error of the mean:

X =

n=

10200

= 0.707 month

We could now report to the director that our estimate of the life of the companys

batteries is 36 months, and the standard error that accompanies this estimate is

0.707. In other words, the actual mean life for all the batteries may lie somewhere

in the interval estimate of 35.293 to 36.707 months. This is helpful but insufficient

information for the director. Next we need to calculate the chance that the actual life

will lie in this interval or in other intervals of different widths that we might choose,

2(2 0.707),3(3 0.707), and so on.The probability is 0.955 that the mean of a sample size of 200 will be within

2 standard errors of the population mean. Stated differently 95.5 percent of allthe sample means are within 2 standard errors from , and hence is within 2standard errors of 95.5 percent of all the sample means. Theoretically, if we select

1,000 samples at random from a given population and then construct an interval of


22/101

Estimation 18

2 standard errors around the mean of each of these samples, about 955 of theseintervals will include the population mean. Similarly, the probability is 0.683 that

the mean of the sample will be within 1 standard error of the population mean,and so forth. This theoretical concept is basic to study interval construction and

statistical inference. Applying this to the battery example, we can now report to the

director that our best estimate of the life of the companys batteries is 36 months, and

we are 68.3 percent confident that the life lies in the interval from 35.293 to 36.707

months ( 36 1x). Similarly, we are 95.5 percent confident that the life falls withinthe interval of 34.586 to 37.414 months (362x ), and we are 99.7 percent confidentthat battery life falls within the interval of 33.879 to 38.121 months (36 3x ).

1.14 Interval estimates and confidence intervals

In using interval estimates, we are not confined to 1, 2, and 3 standard errors. Forexample, 1.64 standard error includes about 90 percent of the area under the curve;it includes 0.4495 of the area on either side of the mean in a normal distribution.

Similarly 2.58 standard errors include 99 percent of the area, or 49.51 percent oneach side of the mean.

In statistics, the probability that we associated with an interval estimate is called

the confidence level. This probability indicates how confident we are that the interval

estimate will include the population parameter. A higher probability means more

confidence. In estimation, the most commonly used confidence levels are 90 percent,

95 percent, and 99 percent, but we are free to apply any confidence level.

The confidence interval is the range of the estimate we are making. If we report

that we are 90 percent confident that the mean of the population of incomes of people

in a certain community will lie between 8000and24000, then the range 800024000is our confidence interval. Often, however, we will express the confidence interval in


23/101

Estimation 19

standard errors rather than in numerical values. Thus, we will often express confi-

dence intervals like this: X 1.64X , where

X+ 1.64X = Upper limit of the confidence interval

X+ 1.64X = Lower limit of the confidence interval

Thus, confident limits are the upper and lower limits of the confidence interval. In

this case, X+ 1.64X is called the upper confidence limit (UCL) and X 1.64X isthe lower confidence limit (LCL).

1.15 Relationship between confidence level and

confidence interval

You may think that we should use a high confidence level, such as 99 %, in all

estimation problems. After all, a high confidence level seems to signify a high degree

of accuracy in the estimate. In practice, however, high confidence levels will produce

large confidence intervals and such large intervals are not precise; they give very fuzzy

estimates.

1.16 Using sampling and confidence interval

estimation

We described that samples being drawn repeatedly from a given population in order

to estimate a population parameter. We also mentioned selecting a large number of

sample means from a population. In practice, however, it is often difficult or expensive

to take more than one sample from a population. Based on just one sample, we

estimate the population parameter. We must be careful, then, about interpreting the

results of such a process.


24/101

Estimation 20

Suppose we calculate from one sample in our battery example the following con-

fidence interval and confidence level: We are 95 percent confident that the mean

battery life of the population lies within 30 and 42 months. This statement does not

mean that the chance is 0.95 that the mean life of all our batteries falls within the

interval established from this one sample. Instead, it means that if we select many

random samples of the same size and calculate a confidence interval for each of these

samples, then in about 95 percent of these cases, the population mean will lie within

that interval.

1.17 Interval estimation of population mean

( known)

In order to develop an interval estimate of a population mean, either the population

standard deviation or the sample standard deviation must be used to compute the

margin of error. Although rarely known exactly, historical data or other information

available in some applications permit us to obtain a good estimate of the population

standard deviation prior to sampling. In such cases, population standard deviation

can, for all practical purposes, be considered known. We refer to such cases as the

known case.

1.18 Using the Z statistic for estimating popula-

tion meanNote that a complete census is neither a feasible, nor a practical option. In order to

draw an inference about the population, a researcher has to take a sample and has

to apply statistical techniques to estimate population parameter on the basis of the

sample statistics. For example, a researcher can use two methods to find out the rate


25/101

Estimation 21

of absenteeism in a manufacturing company with 500,000 employees. The first method

is to go in for a census and calculate the rate of absenteeism based on information from

all the 500,000 employees. This would be extremely difficult in terms of execution

and would be time-consuming and costly. Instead of this, a researcher can take a

sample of any size (keeping in mind the definition of small-and large-sized samples)

and can make an estimate based on the information obtained from the sample. The

possibility of committing non-sampling errors will also be minimized if this method

is used. We need to develop a statistical tool that provides a good estimate of the

population parameter on the basis of the sample statistic. The Z statistic can be

used for estimating the population parameter on the basis of the sample statistic.

According to the central limit theorem, the sample means for a sufficiently large

samples (n 30 ), are approximately normally distributed, regardless of the shapeof the population distribution. For a normally distributed population, sample means

are normally distributed for any size of the sample.

Suppose the population mean is unknown and the true population standard

deviation is known. Then for a large sample size (n 30 ), the sample mean X isthe best point estimator for the population mean . Since the sampling distribution

is approximately normal, it can be used to compute confidence interval of population

mean as follows:

X Z2

nor

X Z2

n X+ Z

2

n

,

where Z2

is the Z-value representing an area 2

in the right tail of the standard

normal probability distribution, and (1 ) is the level of confidence.Alternative approach:

A (1) 100% large sample confidence interval for a population mean can also be


26/101

Estimation 22

found by using the statistic

Z = X n

which has a standard normal distribution (i.e, Z N(0, 1)). This formula can berearranged algebraically for population mean

= X Z n

Sample mean can be greater than or less than the population mean; hence, the

formula takes the following form is the area under the normal curve which is outside

the confidence interval and is located in the tails of the normal curve. Confidence

interval is the range within which we can say with some confidence that the population

mean is located. We can say with some confidence, however, we are not absolutely

sure that the population mean is within the confidence interval. In order to be 100%

sure that the population mean is within the confidence interval, the confidence level

should be 100%, that is, indefinitely wide, which would be meaningless. We use

the concept of probability in order to define some certainty. We can assign some

probability that the population mean is located within the confidence interval.

If Z2

is the Z-value with an area

2in the right tail of normal curve, then we can

write

P

Z

2 0.It is the logical opposite of the null hypothesis. In other words, when null

hypothesis is found to be true, the alternative hypothesis must be false or when

null hypothesis is found to be false, the alternative hypothesis must be true. The

alternative hypothesis represents the conclusion reached by rejecting the null

hypothesis if there is sufficient evidence from the sample information to decide

that the null hypothesis is unlikely to be true. Hypothesis-testing methodology

is designed so that the rejection of the null hypothesis is based on evidence from

the sample that the alternative hypothesis is far more likely to be true. However,

failure to reject the null hypothesis is not proof that it is true. One can never

prove that the null hypothesis is correct because the decision is based only on the

sample information, not on the entire population. Therefore, if you fail to reject

the null hypothesis, you can only conclude that there is insufficient evidence to

warrant its rejection. A summary of the null and alternative hypothesis is

presented below:

The Null and alternative hypothesis:

(a) The null hypothesis H0 represents the status quo or the current belief in

a situation.

(b) The alternative hypothesis H1 is the opposite of the null hypothesis and

represents a research claim or specific inference you would like to prove.

(c) If you reject the null hypothesis, you have statistical proof that the alter-

native hypothesis is correct.

(d) If you do not reject the null hypothesis, then you have failed to prove the

alternative hypothesis. The failure to prove the alternative, however, does

not mean that you have proven null hypothesis.


56/101

Testing of Hypothesis 52

(e) The null hypothesis H0 always refers to a specified value of the population

parameter (such as ), not a sample statistic (such as X).

(f) The statement of the null hypothesis always contains an equal sign re-

garding the specified value of the population parameter (e.g. H0 : =

368 grams).

(g) The statement of the alternative hypothesis never contains an equal sign

regarding the specified value of the population parameter (e.g. H1 : =368 grams).

Each of the following statements is an example of a null hypothesis and alter-

native hypothesis:

H0 : = 0 H1 : = 0

H0 : 0 H1 : > 0

H0 : 0 H1 : < 0

(I) Directional hypothesis(a) H0: There is no difference between the average pulse rates of men and

women.

H1 : Men have lower average pulse rates than women do.

(b) H0 : There is no relationship between exercise intensity and the re-

sulting aerobic benefit.

H1 : Increasing exercise intensity increases the resulting aerobic bene-

fit.(c) H0 : The defendant is innocent.

H1 : The defendant is guilty.

(II) Non-directional hypothesis

(a) H0 : Men and women have same verbal abilities.

H1 : Men and women have different verbal abilities.


57/101


(b) H0: The average monthly salary for management graduates with a

4-year experience.

H1 : The average monthly salary is not Rs.75, 000.

(c) H0 : Older workers are more loyal to a company.

H1 : Older workers may not be loyal to a company.

3. Determine the appropriate statistical test:

After setting the hypothesis, the researcher has to decide on an appropriate sta-

tistical test that will be used for statistical analysis. The tests of significance or

test statistic are classified into two categories: parametric and non-parametric

tests. Parametric tests are more powerful because their data are derived from

interval and ratio measurements. Nonparametric tests are used to test hypothe-

ses with nominal and ordinal data. Parametric techniques are the tests of choice

provided certain assumptions are met. Assumptions for parametric tests are as

follows:

i. The selection of any element (or member) from the population should notaffect the chance for any other to be included in the sample to be drawn

from the population.

ii. The samples should be drawn from normally distributed population.

iii. Populations under study should have equal variances.

Non-parametric tests have few assumptions and do not specify normally dis-

tributed populations or homogeneity of variance.

Selection of a test:

For choosing a particular test of significance following three factors are consid-

ered:

a. Whether the test involves one sample, two samples or k samples?

b. Whether samples used are independent or related?

c. Is the measurement scale nominal, ordinal, interval, or ratio?


58/101


Further, it is also important to know: (i) sample size, (ii) The number of sam-

ples, and their size, (iii) Whether data have been weighted. Such questions help

in selecting an appropriate test statistic. One sample tests are used for single

sample and to test the hypothesis that it comes from a specified population.

The following questions need to be answered before using one sample tests

a. Is there a difference between observed frequencies and the expected fre-

quencies based on a statistical theory?

b. Is there difference between observed and expected proportions?

c. Is it reasonable to conclude that a sample is drawn from a population with

some specific distribution (normal, Poisson, and so on).

d. Is there significant difference between some measures of central tendency

and its population parameter?

The value of test statistic is calculated from the distribution of sample statistic

by using the following formula

Test Statistic =Value of sample statistic Value of hypothesized population parameter

standardized error of the sample statistic

The choice of a probability distribution of a sample statistic is guided by the

sample size n and the value of population standard deviation n as shown below

Sample size Population standard deviation

. . . . . . Known Unknown

n > 30 Normal distribution Normal distribution

n 30 Normal distribution t-distribution

4. Level of significance: This is admissible level of error at which we test the null

hypothesis. The level of significance generally denoted by is the probability,


59/101


which is attached to a null hypothesis, which may be rejected even when it is

true. The level of significance is also known as the size of the rejection region

or the size of the critical region. It is very important to note that the level

of significance must be determined before drawing the samples, so that the

obtained result is free from the choice bias of a decision marker. The levels of

significance which are generally applied by researchers are 0.01, 0.05, 0.10. It

is specified in terms of the probability of null hypothesis H0 being wrong. In

other words, the level of significance defines the likelihood of rejecting a null

hypothesis when it is true, i.e. it is the risk a decision maker takes of rejecting

the null hypothesis when it is really true. The guide provided by the statistical

theory is that this probability must be small.

5. Test statistic: This is constructed using the statistic used to estimate the popu-

lation parameter on which the hypothesis is being tested. The value of the teste

statistic decided whether to reject the hypothesis or not reject the hypothesis.

6. Critical value: After constructing the test statistic, we need to obtain the critical

value. This critical value divides the entire region into critical and non-critical

region.

7. Conclusion:At this stage, the calculated value of the test statistic is compared

with the critical value and concluded accordingly. In recent times, p-value

approach is prominent and these two methods will be discusses in detail in the

next section.

8. Power of the test: This decides the strength of the test in correctly rejecting thenull hypothesis. Its calculation will be discussed for each test separately using

an example.

2.5 One tail and two tail tests

The form of the alternative hypothesis can be either one-tailed or two-tailed, depend-

ing on what the analyst is trying to prove.


60/101


2.5.1 One tailed test

One tailed tests are further classified as right tailed and left tailed tests. Alternative

hypothesis decides whether a test is right tailed or a left tailed. If the alternative

hypothesis is of type > then, the test is classified as right tailed test and if the

alternative hypothesis is of type < then, the test is classified as left tailed test. Note

that the = sign should be always in null hypothesis (let us accept this). This is

because, the test statistic is calculated under the assumption that the null hypothesis

is true.

2.5.2 Two tailed testWhen the alternative hypothesis is of type = then the test is classified as two tailedtest.

2.6 Critical region and non-critical region

The sampling distribution of the test statistic is divided into two regions,a region of

rejection (sometimes called the critical region) and a region of non-rejection. If the

test statistic falls into the region of non-rejection,you do not reject the null hypothesis.

If the test statistic falls into the rejection region,you reject the null hypothesis.

The region of rejection consists of the values of the test statistic that are unlikely

to occur if the null hypothesis is true.These values are much more likely to occur

if the null hypothesis is false.Therefore,if a value of the test statistic falls into this

rejection region, you reject the null hypothesis because that value is unlikely if the

null hypothesis is true. To make a decision concerning the null hypothesis, you first

determine the critical value of the test statistic. The critical value divides the non-rejection region from the rejection region. Determining the critical value depends on

the size of the rejection region.The size of the rejection region is directly related to

the risks involved in using only sample evidence to make decisions about a population

parameter.


61/101


2.7 Errors in hypothesis testing

A Type I error occurs if you reject the null hypothesis, H0, when it is true and should

not be rejected. A Type I error is afalse alarm. The probability of a Type I error

occurring is .

A Type II error occurs if you do not reject the null hypothesis, H0, when it is

false and should be rejected. A Type II error represents a missed opportunity to take

some corrective action. The probability of a Type II error occurring is .

Whenever we reject a null hypothesis, there is a chance that we have made a

mistake i.e., that we have rejected a true statement. Rejecting a true null hypothesis

is referred to as a Type I error, and our probability of making such an error is

represented by the Greek letter alpha (). This probability, which is referred to as

the significance level of the test, is of primary concern in hypothesis testing.

On the other hand, we can also make the mistake of failing to reject a false null

hypothesis this is a Type II error. Our probability of making it is represented by the

Greek letter beta (). Naturally, if we either fail to reject a true null hypothesis or

reject a false null hypothesis, we have acted correctly. The probability of rejecting

afalse null hypothesis is called the power of the test. The four possibilities are shown

in Table.

Actual Situation

Statistical decision H0 true H0 false

Do not reject H0 Correct decision, Confidence= (1 ) Type-II error, P(Type IIerror) =

Reject H0 Type-I error, P(Type Ierror) = Correct decision, Power= (1 )

In hypothesis testing, there is a necessary trade-off between Type I and Type II

errors: For a given sample size, reducing the probability of a Type I error increases the


62/101


probability of a Type II error, and vice versa. The only sure way to avoid accepting

false claims is to never accept any claims. Likewise, the only sure way to avoid

rejecting true claims is to never reject any claims. Of course, each of these extreme

approaches is impractical, and we must usually compromise by accepting a reasonable

risk of committing either type of error.

Complements of Type-I and Type-II Errors

The confidence coefficient 1, is the probability that you will not reject the nullhypothesis, when it is true and should not be rejected.

The power of a statistical test, 1 , is the probability that you will reject thenull hypothesis when it is false and should be rejected.


63/101


2.8 Test for single mean

In this section, we discuss two tests that are most common in testing a hypothesis

built on population mean . The first one is the Z test and second the t test. Wediscuss these two tests in detail using appropriate examples. The selection of the test

depends on the sample size of the study or on the value of the standard deviation

(known or unknown case).

Assumptions

1. The variable under study is ratio or interval.2. The population follows normal distribution.

3. Population variance 2: known (Z-test), Unknown (t-test).

4. Responses are independent within the samples.

2.8.1 Z-test for single mean- known case

The procedure to use a Z-test is as follows:

1. Null hypothesis: H0 : = (,)0.2. Alternative hypothesis: H0 : = ()0.3. Level of significance: = 0.05(0.01, 0.02, 0.10).

4. Test Statistic: Under H0,

Two tailed test:

Z =|X 0|

n

N(0, 1)

One tailed test:

Z =X 0

n

N(0, 1)

5. Comparison and Conclusion.


64/101


2.8.2 Testing Using Excel

A1 Null Hypothesis = 0

A2 Level of Significance () 0.05

A3 Population Standard Deviation

A4 Sample Size n

A5 Sample Mean X

A6 Intermediate Calculations

A7 Standard Error of the Mean

n= A3/sqrt(A4)

A8 Z test Statistic Z =X 0

n

N(0, 1) = (A5 A1)/A7

A9 Two Tailed Test Alternative Hypothesis H1: = 0

A10 Lower Critical Value =NORM.S.INV(A2/2)

A11 Upper Critical Value =NORM.S.INV(1-A2/2)

A12 p-Value 2* (1-NORM.S.DIST(ABS(A8), TRUE))

A13 Left Tailed Test Alternative Hypothesis H1: < 0

A14 Lower Critical Value =NORM.S.INV(A2)

A15 p-Value NORM.S.DIST(A8, TRUE)

A16 Right Tailed Test Alternative Hypothesis H1: > 0

A17 Upper Critical Value =NORM.S.INV(1-A2)

A18 p-Value 1-(NORM.S.DIST(A8, TRUE))

A19 Conclusion

A20 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)


65/101


2.8.3 t-test for single mean- unknown case

The procedure to use a Z-test is as follows:

1. Null hypothesis: H0 : = (,)0.2. Alternative hypothesis: H1 : = ()0.

3. Level of significance: = 0.05(0.01, 0.02, 0.10).


Two tailed test:

t =|X 0|

Sn

tn1 d.f.

One tailed test:

t =X 0

Sn

tn1 d.f.

where

S =ni=1(Xi X)

2

n 15. Comparison and Conclusion.


66/101



A1 Null Hypothesis = 0


A3 Sample Standard Deviation = S

A4 Sample Size n

A5 Degrees of Freedom (d.f.) n 1

A6 Sample Mean X


A8 Standard Error of the MeanS

n= A3/sqrt(A4)

A9 t test Statistic t =X 0

Sn

t(n1) d.f. = (A6 A1)/A7

A10 Two Tailed Test Alternative Hypothesis H1:

= 0

A11 Lower Critical Value =T.INV(A2/2, A5)

A12 Upper Critical Value =T.INV(1-A2/2, A5)

A13 p-Value 2* (1-T.DIST(ABS(A9), A5, TRUE))


A15 Lower Critical Value =T.INV(A2, A5)

A16 p-Value T.DIST(A9, A5, TRUE)


A18 Upper Critical Value =T.INV(1-A2, A5)

A19 p-Value 1-(T.DIST(A9, A5, TRUE))

A20 Conclusion



67/101


2.9 Test for single proportion

In this section, we discuss the procedure used to test the significance of single pro-

portion.

Assumptions

1. The population follows normal distribution.

2. The condition np 5, n(1 p) 5 is satisfied. This condition is necessary toapproximate the sampling distribution of the statistic to normal law.

Steps in using the test

1. Null hypothesis: H0 : P = (,)P0.2. Alternative hypothesis: H1 : P = ()P0.3. Level of significance: = 0.05(0.01, 0.02, 0.10).


Two tailed test:Z =

|P P0|

P0(1P0)n

N(0, 1)

One tailed test:

Z =P P0|P0(1P0)

n

N(0, 1)



68/101



A1 Null Hypothesis P = P0


A3 Number of items of Interest X

A4 Sample Size n


A6 Sample Proportion Xn

= A3/A4

A7 Standard Error

P0 (1 P0)

n= sqrt((A1*(1-A1))/A4)

A8 Z test Statistic Z =P P0

P0 (1 P0)n

N(0, 1) = (A6 A1)/A7

A9 Two Tailed Test Alternative Hypothesis H1: = 0










A19 Conclusion



69/101


2.10 Comparison and conclusion

Two tailed test:

1. Critical value approach:

Find the Table value at chosen level of significance . Compare this value with

the calculated value.

(a) Ifcal tab then, do not reject the null hypothesis.

(b) Ifcal > tab then, reject the null hypothesis.

2. p-value approach:

Compute the p-value at chosen level of significance.

(a) Ifp then, do not reject the null hypothesis.

(b) Ifp > then, reject the null hypothesis.

One tailed test:

1. Right tailed test:

(a) Critical value approach:

Find the Table value at chosen level of significance . Compare this value

with the calculated value.

i. Ifcal tab then, do not reject the null hypothesis.

ii. Ifcal > tab then, reject the null hypothesis.

(b) p-value approach:


i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.


70/101


2. Left tailed test




i. Ifcal > tab then, do not reject the null hypothesis.

ii. Ifcal tab then, reject the null hypothesis.

(b) p-value approach: Compute the p-value at chosen level of significance.



71/101

Chapter 3

Testing of hypothesis-Two sample

problem

3.1 Introduction

In this chapter, we discuss the testing procedures used to test the significant difference

between parameters belonging to two independent populations. We note that, there

are several cases which will be discussed in the following sections.

3.2 Assumptions

In this section, we give some important points regarding the testing procedures used

in two sample problem.

1. The variable under study is ratio or interval.

2. The population follows normal distribution.

3. Population variances are equal i.e. 21 = 22.

4. Samples are independent.

5. Responses are independent within the samples.

67


72/101

Testing of hypothesis-Two sample problem 68

3.3 Test for difference of means: Z-test

1. Null hypothesis: H0 : 1 = (,)2.

2. Alternative hypothesis: H1 : 1 = (>,


73/101



2 (known)

A1 Null Hypothesis 1 = 2


A3 Sample Mean1 X1

A4 Sample Mean2 X2

A5 Sample Size1 n1

A6 Sample Size2 n2

A7 Population Standard deviation


A9 S.E. of difference of Means

1

n1+

1

n2

= A7/sqrt((1/A5)+(1/A6))

A10 Z test Statistic Z =X1 X2

1

n1 +1

n2 N(0, 1) = (A3 A4)/A9

A11 Two Tailed Test Alternative Hypothesis H1: 1 = 2




A16 Left Tailed Test Alternative Hypothesis H1:1 < 2



A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2



A22 Conclusion Reject or Do not reject H0


74/101


3.3.2 Testing Using Excel: Unequal Variances (Known)



A3 Sample Mean1 X1

A4 Sample Mean2 X2

A5 Sample Size1 n1

A6 Sample Size2 n2

A7 Population Standard deviation1 1

A8 Population Standard deviation2 2


A10 Standard Error of the Mean 21n1

+22n2

= sqrt((A72/A5)+(A82/A6))

A11 Z test Statistic Z = X1 X221n1

+22n2

N(0, 1) = (A3 A4)/A10













75/101


3.4 Test for difference of means:t-test

1. Null hypothesis: H0 : 1 = (,)2.

2. Alternative hypothesis: H0 : 1 = ()2.



(a) When the assumption of equality of variances is satisfied (21 = 22 =

2)

and 2 is unknown.

Two tailed test:

t =|X1 X2|

1n1

+ 1n2

tn1+n22 d.f.

One tailed test:

t =X1 X2

1n1

+ 1n2

tn1+n22 d.f.

(b) When the assumption of equality of variances is not satisfied (21 =

22).

Two tailed test:

t =|X1 X2|

S21

n1+

S22

n2

tn1+n22 d.f.

One tailed test:

t =X1 X2

S21

n1+

S22

n2

tn1+n22 d.f.

where

S21 =

ni=1(Xi X1)2

n1 1 and S22 =

ni=1(Yi X2)2

n2 15. Comparison and Conclusion.


76/101



2 (Unknown)



A3 Sample Mean1 X1

A4 Sample Mean2 X2

A5 Sample Size1 n1

A6 Sample Size2 n2

A7 Sample Standard deviation1 S1


A9 Pooled Estimate S = sqrt ((A5 A72 + A6 A82)/(A5 + A6 2))


A11 S.E. of difference of Means S

1n1

+ 1n2

= A9/sqrt((1/A5)+(1/A6)

A12 t test Statistic t =X1 X2

S

1

n1+

1

n2

tn1+n22 = (A3-A4)/A11


A14 Lower Critical Value =T.INV(A2/2, (A5+A6-2))

A15 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))

A16 p-Value 2* (1-T.DIST(ABS(A12), (A5+A6-2), TRUE))


A18 Lower Critical Value =T.INV(A2, (A5+A6-2))

A19 p-Value T.DIST(A12, (A5+A6-2), TRUE)


A21 Upper Critical Value =T.INV(1-A2, (A5+A6-2))

A22 p-Value 1-(T.DIST(A12, (A5+A6-2), TRUE))



77/101


3.4.2 Testing Using Excel: Unequal Variances (Unknown)



A3 Sample Mean1 X1

A4 Sample Mean2 X2

A5 Sample Size1 n1

A6 Sample Size2 n2




A10 S.E. of difference of Means

S21n1

+S22n2

= sqrt((A72/A5)+(A82/A6))

A11 t test Statistic t =X1 X2

S21n1

+S22n2

tn1+n22 = (A3-A4)/A10


A13 Lower Critical Value =T.INV(A2/2, (A5+A6-2))

A14 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))

A15 p-Value 2* (1-T.DIST(ABS(A11), (A5+A6-2), TRUE))


A17 Lower Critical Value =T.INV(A2, (A5+A6-2))

A18 p-Value T.DIST(A11, (A5+A6-2), TRUE)


A20 Upper Critical Value =T.INV(1-A2, (A5+A6-2))

A21 p-Value 1-(T.DIST(A11, (A5+A6-2), TRUE))



78/101


3.5 Test for difference of two proportions

1. Null hypothesis: H0 : P1 = (,)P2.

2. Alternative hypothesis: H0 : P1 = ()P2.



Two tailed test:

Z = |P1

P2

|P1(1P1)n1

+ P2(1P2)n2

N(0, 1)One tailed test:

Z =P1 P2

P1(1P1)n1

+ P2(1P2)n2

N(0, 1)

Test statistic using a pooled estimate:

Z =|P1 P2|

P(1 P) 1n1 + 1n2

N(0, 1)

One tailed test:

Z =P1 P2

P(1 P)

1n1

+ 1n2

N(0, 1)where,

P =n1P1 + n2P1

n1 + n2



79/101


3.5.1 Testing Using Excel: Test for Difference of Proportions

A1 Null Hypothesis P1 = P2


A3 Sample Size1 n1

A4 Sample Size2 n2

A5 Number of items of interest X1

A6 Number of items of interest X1

A7 Sample Proportion1 P1 = A5/A3

A8 Sample Proportion2 P2 = A6/A4

A9 Pooled Estimate P = ((A3 A7 + A4 A8)/(A3 + A4))


A11 S.E. sqrt(A9 (1-A9)(((1/A5)+(1/A6)))

A12 Z test Statistic Z=(A7-A8)/A11

A13 Two Tailed Test Alternative Hypothesis H1: P1 = P2




A17 Left Tailed Test Alternative Hypothesis H1:P1 < P2



A20 Right Tailed Test Alternative Hypothesis H1: P1 > P2





80/101


3.6 Test for dependent samples

In previous section, we have discussed the testing procedure used to test the hypoth-

esis constructed on difference between two population means, when the samples are

independent. In this section, we discuss a testing procedure, when the samples are

dependent. This is the case where the responses are taken from the same set of in-

dividuals before an experiment and after the experiment. This is also used when the

samples are matched samples.

Suppose that, in a marketing research, the researcher wants to know the opinionof the customers on his companys product. He selects a sample of customers, say of

size n, from a population and collects the opinion from these n customers and then he

introduces the same product with some additions to it. He requests the customers to

use the product and collects the response from them after a month. Here the variable

measured is the weight of the customers. In this case the researcher is interested to

test the hypothesis Is there any significant difference between the average weight of

the customers before and after the additions to the product?

1. Null hypothesis: H0 : 1 = (,)2 or D = 1 2 = (,)0.2. Alternative hypothesis: H1 : 1 = (>, ,


81/101



A1 Null Hypothesis D : 1 2 = 0


A3 Sample Mean d

A4 Sample Size (before) n1

A5 Sample Size (after) n2

A6 Sample Standard deviation d


A8 S.E.d

n= A6/sqrt(A5)

A9 t test Statistic t =ddn

tn1 d.f. = (A3)/A8

A10 Two Tailed Test Alternative Hypothesis H1: D : 1

2

= 0




A14 Left Tailed Test Alternative Hypothesis H1:D : 1 2 < 0



A17 Right Tailed Test Alternative Hypothesis H1: D : 1 2 > 0


A19 p-Value 1-(NORM.S.DIST(A9), TRUE))



82/101


3.7 Test for difference of variances-F Test

This is an important test which is used to test the hypothesis H0 : 21 = 22.

1. Null hypothesis: H0 : 21 = (, 22 .

2. Alternative hypothesis: H1 : 21 = (>, tab then, reject the null hypothesis.

2. p-value approach:


(a) Ifp then, do not reject the null hypothesis.

(b) Ifp > then, reject the null hypothesis.

One tailed test:

1. Right tailed test:





83/101


i. Ifcal tab then, do not reject the null hypothesis.ii. Ifcal > tab then, reject the null hypothesis.

(b) p-value approach:



2. Left tailed test




i. Ifcal > tab then, do not reject the null hypothesis.

ii. Ifcal tab then, reject the null hypothesis.

(b) p-value approach: Compute the p-value at chosen level of significance.



84/101

Chapter 4

Chi-Square tests

4.1 IntroductionThe statistical-inference techniques presented so far have dealt exclusively with hy-

pothesis tests and confidence intervals for population parameters, such as population

means and population proportions. In this chapter, we consider three widely used in-

ferential procedures that are not concerned with population parameters. These three

procedures are often called chi-square procedures because they rely on a distribution

called the chi-square distribution.

The distribution is also important in discrete hedging of options in finance, as

well as option pricing. This distribution is used to construct the confidence interval

for population variance 2. Also note that this distribution is derived from normal

distribution. Square of a standard normal variate gives a chi-square random variable

with 1 degrees of freedom. Similarly if we square n standard normal random variables

and add them, we get a chi-square distribution with n degrees of freedom.

The tests discussed in this chapter have wide applicability

Date post:	02-Apr-2018
Category:	Documents
Upload:	alok-shenoy
View:	254 times
Download:	1 times

Fundamentals of Statistical Inference

Documents