Date post: | 02-Apr-2018 |
Category: |
Documents |
Upload: | alok-shenoy |
View: | 254 times |
Download: | 1 times |
of 101
7/27/2019 Fundamentals of Statistical Inference
1/101
Fundamentals of Statistical
Inference
compiled by
Srilakshminarayana,G. M.Sc, Ph.D
Shri Dharmasthala Manjunatheswara Institute
for Management Development
#1 Chamundi Hill Road, Siddhartha Nagar, Mysore-570011
(Private Circulation Only-September 2012)
7/27/2019 Fundamentals of Statistical Inference
2/101
Table of Contents
Table of Contents i
Important note about the material 1
1 Estimation 21.1 Importance of estimation in management . . . . . . . . . . . . . . . . 21.2 Key terms in estimation . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Determination of sample size . . . . . . . . . . . . . . . . . . . . . . . 91.4 Point estimator for population mean . . . . . . . . . . . . . . . . . . 9
1.4.1 Steps in obtaining an estimate of population mean . . . . . . 11
1.5 Point estimator for population variance . . . . . . . . . . . . . . . . . 111.5.1 Steps in calculating an estimate of population variance . . . . 11
1.6 Role of sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.7 Sampling distribution of a Statistic . . . . . . . . . . . . . . . . . . . 121.8 Sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.9 Point estimator for a population proportion . . . . . . . . . . . . . . 141.10 Finding the best estimator . . . . . . . . . . . . . . . . . . . . . . . . 141.11 Drawback of point estimate . . . . . . . . . . . . . . . . . . . . . . . 151.12 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.13 Probability of the true population parameter falling within the interval
estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.14 Interval estimates and confidence intervals . . . . . . . . . . . . . . . 181.15 Relationship between confidence level and
confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.16 Using sampling and confidence interval
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.17 Interval estimation of population mean
( known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
i
7/27/2019 Fundamentals of Statistical Inference
3/101
1.18 Using the Z statistic for estimating population mean . . . . . . . . . 20
1.19 Using finite correction factor for the finite population . . . . . . . . . 251.20 Interval estimation for difference of two means . . . . . . . . . . . . . 271.21 Confidence interval estimation of the population mean ( unknown) . 281.22 Checking the assumptions . . . . . . . . . . . . . . . . . . . . . . . . 291.23 Concept of degrees of freedom . . . . . . . . . . . . . . . . . . . . . . 301.24 Confidence interval estimation for population proportion . . . . . . . 321.25 Estimation of the sample size . . . . . . . . . . . . . . . . . . . . . . 331.26 Sample size for estimating population mean . . . . . . . . . . . . . . 351.27 Sample size for estimation population proportion . . . . . . . . . . . 401.28 Sample size for an interval estimate of a population proportion . . . . 411.29 Further discussion of sample size determination for a proportion . . . 42
2 Testing of Hypothesis-Fundamentals 442.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2 Formats of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 472.3 The rationale for hypothesis testing . . . . . . . . . . . . . . . . . . 482.4 Steps in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 492.5 One tail and two tail tests . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.1 One tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 562.5.2 Two tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Critical region and non-critical region . . . . . . . . . . . . . . . . . . 56
2.7 Errors in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 572.8 Test for single mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.8.1 Z-test for single mean- known case . . . . . . . . . . . . . . . 592.8.2 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 602.8.3 t-test for single mean- unknown case . . . . . . . . . . . . . . 612.8.4 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9 Test for single proportion . . . . . . . . . . . . . . . . . . . . . . . . . 632.9.1 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 65
3 Testing of hypothesis-Two sample problem 673.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.3 Test for difference of means: Z-test . . . . . . . . . . . . . . . . . . . 68
3.3.1 Testing Using Excel: 21 = 22 =
2 (known) . . . . . . . . . . 693.3.2 Testing Using Excel: Unequal Variances (Known) . . . . . . . 70
3.4 Test for difference of means:t-test . . . . . . . . . . . . . . . . . . . . 713.4.1 Testing Using Excel: 21 =
22 =
2 (Unknown) . . . . . . . . . 72
ii
7/27/2019 Fundamentals of Statistical Inference
4/101
3.4.2 Testing Using Excel: Unequal Variances (Unknown) . . . . . . 73
3.5 Test for difference of two proportions . . . . . . . . . . . . . . . . . . 743.5.1 Testing Using Excel: Test for Difference of Proportions . . . . 753.6 Test for dependent samples . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6.1 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 773.7 Test for difference of variances-F Test . . . . . . . . . . . . . . . . . . 783.8 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 78
4 Chi-Square tests 804.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.1 Chi-square test for significance of a population variance . . . . 814.1.2 Chi-square test for goodness of fit . . . . . . . . . . . . . . . . 81
4.1.3 Chi-square test for independence of attributes . . . . . . . . . 824.2 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 82
5 Analysis of Variance (ANOVA) 845.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.2 One way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.2 Steps for computing the F test value for ANOVA . . . . . . . 86
5.3 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 885.3.1 Assumptions for the Two-Way ANOVA . . . . . . . . . . . . . 90
5.4 The Scheffe Test and the Tukey Test . . . . . . . . . . . . . . . . . . 915.4.1 Scheffe Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5 Tukey Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Correlation and Regression 936.1 Testing significance of Correlation = 0 . . . . . . . . . . . . . . . . 936.2 Testing significance of Correlation = 0 . . . . . . . . . . . . . . . . 946.3 Testing significance of correlation 1 = 2 . . . . . . . . . . . . . . . . 956.4 Testing significance of regression model . . . . . . . . . . . . . . . . . 96
References 97
iii
7/27/2019 Fundamentals of Statistical Inference
5/101
Important note about the material
This material is for internal circulation only and not a substitute for a text book.
It contains only fundamental steps to be followed when inferential tools are used to
analyze the data. It is restricted to the need of present batch and do not contain
entire information about the topic. The complete information can be found in the
text book prescribed and in other references.
1
7/27/2019 Fundamentals of Statistical Inference
6/101
Chapter 1
Estimation
1.1 Importance of estimation in management
Everyone makes estimates. When you are ready to cross a street, you estimate the
speed of any car that is approaching, the distance between you and that car, and
your own speed. Having made these quick estimates, you decide whether to wait,
walk, or run. All managers must make quick estimates too. The outcome of these
estimates can affect their organizations as seriously as the outcome of your decision as
to whether to cross the street. University department heads make estimates of next
sessions enrollment in Statistics. Credit managers estimate whether a purchaser will
eventually pay his bills. Prospective home buyers make estimates concerning the
behavior of interest rates in the mortgage market. All these people make estimates
without worry about whether they are scientific but with hope that he estimates
bear a reasonable resemblance to the outcome. Managers use estimates because in all
but the most trivial decisions, they must make rational decisions without complete
information and with a great deal of uncertainty about what the future will bring.
How do managers use sample statistics to estimate population parameters? The
department head attempts to estimate enrollments next fall from current enrollments
2
7/27/2019 Fundamentals of Statistical Inference
7/101
Estimation 3
in the same courses. The credit manager attempts to estimate the creditworthiness of
prospective customers from a sample of their past payment habits. The home buyer
attempts to estimate the future course of interest rates by observing the current
behavior of those rates. In each case, somebody is trying to infer something about a
population from information taken from a sample. This chapter introduces methods
that enable us to estimate with reasonable accuracy the population proportion (the
proportion of the population that possesses a given characteristic) and population
mean. To calculate exact proportion or the exact mean would be an impossible goal.
Even so, we will be able to make an estimate, make a statement about the error that
will probably accompany this estimate, an implement some controls to avoid as much
of the error as possible. As decision makers, we will be forced at times to rely on
blind hunches. Yet in other situations, in which information is available and we apply
statistical concepts, we can do better than that.
Let us start with a small discussion on why a management student should study
statistical methods to estimate the unknown quantities. Estimations are made at
low level, middle level, and high level management. At any level, we understand
the present carefully. Then look into the past. See what has happened in the past.
List all possible options we have gathered from the past. Choose the best from the
available options. The option that best suits the present is considered as a solution.
For example, a manager of an production unit wishes to estimate the items to be
produced for the current year and depending on his estimate he wishes to place an
order for the raw materials. He takes his records and looks into the items produced,
raw materials used to produce these items. Finally he uses his experience and takes
a decision on the items to be produced for the current year and depending on the
estimate he prepares an order for the raw materials. But what is the guarantee that
the value he estimated is free of error? How can he justify that the actual requirement
is close to the value he estimated? There is a chance that the value he estimated using
7/27/2019 Fundamentals of Statistical Inference
8/101
Estimation 4
his experience may be an over estimate or under estimate. How can he convince his
boss that the value he chose will yield the organization better profits? If everything
goes fine, then no one will blame him. It would have been better if life is free of
uncertainty. But it is not so. The manager should take care of the uncertainty
associated with the estimate obtained using his experience. At this stage one can
argue saying that he is experienced and he can specify a range instead of a single
value. The statement could be May be this time the requirement lie in between
10000 and 15000. Even now there is an amount of uncertainty associated with
this. Because the words May be indicate that there is an amount of uncertainty
associated. One can continue the argument. Finally what is that we want. We want
a statement that the requirement for the current year may lie between 10000 and
15000 and the chance that it lie outside this range is 0.05. How could we get this
chance of 0.05? It is due to the systematic procedures available in statistics. Why he
needs it because the manager has to report to his boss saying that the requirement
for the current year lies between limits A and B. The boss is much concerned about
satisfying the needs of the customers and if anything goes wrong it is he who will be
targeted first. To avoid this the manager can use the statistical techniques available
and provide the range along with a chance. This is again done taking the past data
into consideration. Systematic construction needs understanding the past carefully
and choosing an appropriate tool for the given situation. One has to choose the
technique based on the study variable under consideration. This is because the tools
used for a quantitative variable cannot be used for a qualitative variable without
proper adjustment.
The example discussed above is from a production unit. Similarly, let us consider
marketing. The sales executive has to report to his boss the number of packets of
oil he will sell this month. What will he do when his boss asks him about this? He
immediately says that he will sell 150 packets of oil this month. How did he say this?
7/27/2019 Fundamentals of Statistical Inference
9/101
Estimation 5
He didnt use statistics to do this. He just used his experience to do this. There is
no point in taking an excel sheet and use a statistical procedure to give this number.
He used his common sense, past experience and market conditions. He is sure that
the market need atleast 150 packets this month and he already sold 145 packets last
month. He also has complete knowledge about his competitors sales in the market.
Taking into consideration all these factors he could easily estimate the current months
sales. Let us consider another example. Suppose that this time the sales executive
has been promoted as sales manager. Now he has to estimate the sales of the entire
region. Now the problem is he is the manager but not a sales executive. He has to
take the data from the sales executives of the entire region and then he has to estimate
the sales for the current year. Depending on this estimate he has to build a strategy
to increase the sales. In the previous case he can get on because of his experience and
common sense. But now he is a manager and he cant take any risk. Now also he can
use his experience. But this time only to develop a proper strategy. He should take
help of statistical methods in order to give a proper estimate and to construct a better
strategy. What will he do? He will take the data from the sales executives and takes
the average of all the values and adjusts that according to the market conditions and
finally gives an estimate. Is it a good estimate? what adjustment he should make to
the average and give an estimate which convinces his boss? The answer is very simple.
According to the statistical theory, the sample average best estimates the population
average. Here the population average is the sales for the current year and the sample
average is the value he calculated after obtaining the data from his executives. What
about the adjustment? The adjustment is to construct an interval associated with a
probability value, take a value within the interval and consider it as an estimate.
Let us consider the case in Human resource management (HRM). Suppose that
the HR manager wishes to know about the performance of new appraisal system
7/27/2019 Fundamentals of Statistical Inference
10/101
Estimation 6
developed to appraise the employees. Since the organization has thousands of em-
ployees it is apparent that she cant take the opinion of all the employees. She has
to take a sample of employees and consider their opinion. Here the estimate is the
proportion of employees who are against the system, which is an estimator of the
population proportion. The variable under consideration is a qualitative variable and
the appropriate estimator is the sample proportion.
Management is a discipline which uses statistical tools to support the decisions
relating to various business situations. Most of the times the decision maker will be
left with some amount of data relating to the given situation, on which he is supposed
to take a decision. It is always desirable to use the data obtained and take appropriate
decision. One important aspect in decision making is estimation. This is a part of
statistical inference. Estimation is a systematic way of understanding the behavior of
unknown population characteristics based on a sample. These characteristics include
all the descriptive statistics related to a properly defined population. But most of the
time we are interested in population mean and variance. These are the characteristics
which play an important role in making decisions. It is very important to note that
mean should always be followed by the variance. Mean measures the central tendency
and variance measures the dispersion. To estimate these characteristics, we use the
sample data gathered from the defined population. The sample is selected as the
true representative of the population selected for the study. Note that care has to be
taken while selecting the sample. Coming back to the estimation, we use the sample
characteristics to estimate the population characteristics. Sample mean and variance
are used to estimate the population mean and variance. Two types of estimation
has been studied formally by the researchers. They are point estimates and interval
estimates. A point estimate is the value of the statistic for a given sample. We
use sample statistics as estimators to estimate the population parameters. These
estimators are functions of the sample i.e., they produce different values for different
7/27/2019 Fundamentals of Statistical Inference
11/101
Estimation 7
samples. Each value is considered as the estimate of the parameter. Point estimates
obtained for different samples put together constitute sampling distribution of the
statistic. Usual understanding in estimation is that for sufficiently large samples
these sample means, when plotted, produce a normal curve. This basic assumption
is very important to construct an interval estimate. Another important aspect in
point estimation is the associated sampling error of the statistic. When we obtain
the point estimate from a sample, it is equivalently important to obtain the sampling
error or standard deviation of the statistic. This sampling error gives the amount of
fluctuation that can be allowed below and above the estimate.
The purpose of any random sample is to estimate population properties of a popula-
tion from the data observed in the sample. The mathematical procedures appropriate
for performing this estimation depend on which properties are of interest and which
type of random sampling scheme is used. Note that the sampling scheme has to be
selected appropriately for a given situation. The decision maker has to take care of
the assumptions made at the time of selecting the sampling scheme. This is very im-
portant because the assumptions of the mathematical model that will be used in the
later stages should coincide with the assumptions made at the time of selecting the
sample. If this is not taken care, then the results obtained may not be reliable. Along
with this, another aspect that play an important role is sampling error. Sampling
error is the inevitable result of basing an inference os a random sample rather than
on the entire population.
1.2 Key terms in estimation1. Population: Group of objects or individuals that posses the assumed charac-
teristics under study. This group can be finite or infinite.
2. Sample: Group of objects or individuals that posses the same characteristics
as that of population, taken for enumeration and further analysis. This group
is considered as the true representative of the entire population under study.
7/27/2019 Fundamentals of Statistical Inference
12/101
Estimation 8
3. Parameter: Unknown characteristics of the population under study such as
population mean, median, mode, standard deviation etc.
4. Statistic: Characteristics of the sample such as sample mean, median, mode,
standard deviation etc.
5. Estimator: Any statistic, which is a function of sample values, used to estimate
a population parameter.
6. Estimate: An estimate is a specific value of the estimator for a given sample.
7. Point estimate: A point estimate is a numerical value, a best guess of a
population parameter, based on the data in a sample.
8. Estimation error: The estimation error is the difference between the point
estimate and the true value of the population parameter being estimated.
9. Interval estimate: An interval estimate is an interval around the point esti-
mate, calculated from the sample data, where we strongly believe the true value
of the population parameter lie.
10. Unbiased estimate: An unbiased estimate is a point estimate such that themean of its sampling distribution is equal to the true value of the population
parameter being estimated.
11. Efficiency: Another desirable property of a good estimator is that it be ef-
ficient. Efficiency refers to the size of the standard error of the statistic. If
we compare two statistics from a sample of the sample size and try to decide
which one is the more efficient estimator, we would pick the statistic that has
the smaller standard error, or standard deviation of the sampling distribution.
12. Sufficiency: An estimator is sufficient if it makes so much use of the infor-
mation in the sample that no other estimator could extract from the sample
information about the population parameter being estimated.
13. Consistency: A point estimator is said to be consistent if its value tends to
become closer to the population parameter as the sample size increases.
7/27/2019 Fundamentals of Statistical Inference
13/101
Estimation 9
1.3 Determination of sample size
There are several ways to estimate an unknown characteristic of the population. In
this compiled work we only discuss parametric estimation.Interested can look into
standard book for other methods like non-parametric estimation, robust estimation
etc.In parametric estimation we mainly talk about the population characteristics like
mean, variance/standard deviation and proportion. We first discuss determination
of sample size in detail and then proceed to estimation procedures.At an intermedi-
ate stage, i.e. after collecting the sample from the population under study, we look
forward to understand the behavior of the population through the estimated char-
acteristics from the sample. Hence one has to note at this point that the sample
taken play an important role in studying the population.Now the question is what
should be the sample size. This is an interesting question, which do not have a ready
made answer. It is an important step before the survey. Note that sampling error,
decreases with increase in the sample size. So we use the fact the smaller the variance,
the larger the sample size needed to achieve a degree of accuracy.
Determining the best sample size is not just a statistical decision. Statisticians
can tell you how the standard error behaves as you increase or decrease the sample
size, and the market researchers can tell you what the cost of taking more or larger
samples will be. But its the decision maker who must use your judgement to combine
these two inputs to make a sound managerial decision.
1.4 Point estimator for population mean
Definition 1. Point estimator:A sample statistic that is calculated using sample data to estimate most likely value of
the corresponding unknown population parameter is termed as point estimator, and the
numerical value of the estimator is termed as point estimate. A point estimate consists
of a single sample statistic that is used to estimate the true value of a population
parameter.
7/27/2019 Fundamentals of Statistical Inference
14/101
Estimation 10
For example, the sample mean X is a point estimate of the population mean
and the sample variance S2 is a point estimate of the population variance 2. On
many occasions estimating the population mean is useful in business research. For
example,1. The manager of human resources in a company might want to estimate the
average number of days of work an employee misses per year because of illness.
If the firm has thousands of employees, direct calculation of a population mean
such as this may be practically impossible. Instead, a random sample of em-
ployees can be taken, and the sample mean number of sick days can be used toestimate the population mean.
2. Suppose that another company developed a new process for prolonging the
shelf life of a loaf of bread. The company wants to be able to date each loaf for
freshness, but company officials do not know exactly how long the bread will
stay fresh. By taking a random sample and determining the sample mean shelf
life, they can estimate the average shelf life for population of bread.
3. As the cellular telephone industry matures, a cellular telephone company isrethinking its pricing structure. Users appear to be spending more time on
the phone and are shopping around for the best deals. To do better planning,
the cellular company wants to ascertain the average number of minutes of time
used per month by each of its residential users but does not have the resources
available to examine all monthly bills and extract the information. The company
decides to take a random sample of customer bills and estimate the population
mean from sample data. A researcher for the company takes a random sampleof 85 bills for a recent month and from these bills computes a sample mean
of 510 min. This sample mean, which is a statistic, is used to estimate the
population mean, which is a parameter. If the company uses the sample mean
of 510 min as an estimate for the population mean, then the sample mean is
used as a point estimate.
7/27/2019 Fundamentals of Statistical Inference
15/101
Estimation 11
4. A tire manufacturer developed a new tire designed to provide an increase in
mileage over the firms current line of tires. To estimate the mean number of
miles provided by the new tires, the manufacturer selected a sample of 120 new
tires and observed a sample mean of 36,500 miles.
In all the above examples, note the statistic (sample mean) is a function of the
sample drawn from the population under study and the numerical value assumed by
this statistic is an estimate of the population mean. (Observe the difference between
an estimator and an estimate).
1.4.1 Steps in obtaining an estimate of population mean
1. Draw a sample from the population under study.
2. Find the total of all the observations in the sample.
3. Divide the total with the number of observations.
4. The resultant value is the sample mean, which is taken as the estimate of the
population mean.
1.5 Point estimator for population varianceThe estimation of population variance is an important step in analyzing the sample
drawn from the population under study. We use sample variance to estimate the
population variance. But sample variance is not an unbiased estimator of population
variance. So we modify the formula used to calculate the sample variance. The
formula to calculate the sample variance is given by
2
= s2
=
1
n
ni=1
(Xi X)2
In order get an unbiased estimator, ne has to change1
nto
1
n 1. The resultant iscalled Means square error, which gives an unbiased estimator of population variance.
1.5.1 Steps in calculating an estimate of population variance
1. Calculate the mean of the sample drawn.
7/27/2019 Fundamentals of Statistical Inference
16/101
Estimation 12
2. Compute the deviation of all the observations from the mean.
3. Square the deviations and obtain the total.
4. Divide the total obtained in step 3 with n 1.
Note 1. Note that, the above formulae to calculate mean and variance are used when
individual observations are taken. If one is using a frequency distribution then they
have to include the frequencies to calculate mean and variance.
1.6 Role of sampling
In order to understand the population characteristics like mean, variance etc. it is
very important to draw a sample which is a true representative of the population.
Proper sampling design has to be adopted before drawing a sample. Sampling frame
should be framed and then should be checked with population. Care should be taken
to decrease the non-response rate. It should be noted that a random sample will
better estimate the population parameters than a non-random sample. In order to
get better estimate, it is also important to ensure that the sample is free of any sort of
bias. The questionnaire framed to collect the responses should be tested using a pilot
survey before the actual survey. One has to note that pilot survey has to be framed
in such a way that it resembles the actual survey and should give better insights
about the resources needed to conduct the actual survey. An interesting point is
that samples with smaller samples, which is a true representative, gives satisfactory
results than a sample with larger sample size, which is not a true representative
of the population. Another interesting aspect in sampling is the belief that larger
populations need larger samples is not always a valid statement. Depending on the
situation, objectives, a sample should be taken.
1.7 Sampling distribution of a Statistic
Sampling distribution is the underlying probability distribution of the statistic used
for the study. This is constructed by taking several samples from the population.
7/27/2019 Fundamentals of Statistical Inference
17/101
Estimation 13
For example, a sampling distribution of sample mean is constructed by taking as
many samples as possible from the population and by calculating sample mean for
all the samples. The set of all values constitute a sampling distribution of sample
mean. Theoretically, it has been shown that the sampling distribution of mean is
either normal (central limit theorem-finite known variance-larger sample size) or a
t-distribution (when assumption of normality is satisfied-small sample sizes). When
the assumption of normality is not satisfied, then the sampling distribution of sample
mean can be approximated to normal law using central limit theorem for sufficiently
large sample sizes. The sampling distribution of sample variance or mean square error
is chi-square distribution (it is discussed in detail in chapter 4).
1.8 Sampling error
After drawing a sample, it is important to study the sampling error. For this, the
decision maker has to find the standard error of the estimator used to estimate the
population parameter. The question is what is the relation between the standard error
and sampling error? Note that, the sample is drawn to understand the behaviour of
the population characteristics (like mean, median etc.) and are studied using their
estimators from a sample. Obviously if the sampling error is more, then it will be
reflected in the standard error of the estimator. Also note that, reciprocal of standard
error gives the precision of the estimator. This is because it is expected that the
absolute difference between the true population characteristic and sample estimator
is less than , where depends on standard error. Refer to determination of sample
size section to understand this better.
Sampling variation is the price we pay for working with a sample rather than the
population.
7/27/2019 Fundamentals of Statistical Inference
18/101
Estimation 14
1.9 Point estimator for a population proportion
When the underlying variable is a qualitative variable, one is interested in studying
the proportion of individuals who satisfy a particular attribute. For example, the
sales manager may be interested in studying the proportion of individuals who give
more importance to quality than cost. Here, he may confine to the customers who
are regular in purchasing from his store. For this properly defined population, the
parameter is proportion (denoted by P) and the sample statistic (denoted by p of
P) is the unbiased estimator. To calculate the sample proportion, one has to define
the random variable under study properly. Then, identify the individuals who satisfy
the attribute (denoted by X) and take the ratio of X and n, the sample size, to
get the estimate. Note that the sampling distribution on sample proportion can be
approximated to normal distribution. But the exact probability distribution used
to model the number of individuals who fall under a particular category is binomial
distribution.
1.10 Finding the best estimator
A given sample statistic is not always the best estimator of its analogous population
parameter. Consider a symmetrically distributed population in which the values of
the median and the mean coincide. In this instance, the sample mean would be
an unbiased estimator of population median. Also, the sample mean would be a
consistent estimator of the population median because, as the sample size increases,
the value of the sample mean would tend to come very close to the population median.
And the sample mean would be a more efficient estimator of the population medianthan the sample median itself because in large samples, the sample mean has s smaller
standard error than the sample median. At the same time, the sample median in a
symmetrically distributed population would be an unbiased and consistent estimator
of the population mean but not the most efficient estimator because in large samples,
its standard error is larger than that of the sample mean.
7/27/2019 Fundamentals of Statistical Inference
19/101
Estimation 15
1.11 Drawback of point estimate
The drawback of a point estimate is that no information is available regarding its re-
liability, i.e., how close it is to its true population parameter. In fact, the probability
that a single sample statistic actually equals the population parameter is extremely
small. For this reason, point estimates are rarely used alone to estimate population
parameters. It is better to offer a range of values within which the population pa-
rameters are expected to fall so that reliability (probability) of the estimate can be
measured. This is the purpose of the interval estimation.
1.12 Interval estimation
In most of the cases, a point estimate does not provide information about how close
is the estimate to the population parameter unless accompanied by a statement of
possible sampling error involved based on the sampling distribution of the statistic.
It is therefore important to know the precision of an estimate before depending on
it to make a decision. Thus, decision-makers prefers to use an interval estimate (i.e.
the range of values defined around a sample statistic) that is likely to contain the
population parameter value.
Interval estimation is a rule for calculating two numerical values, say and that
create an interval that contains the population parameter of interest. This interval is
therefore commonly referred to as a confidence coefficient and denoted by . However,
it is also important to state how confident one should be that the interval estimate
contains the parameter value. Hence an interval estimate of a population parameter
is a confidence interval with a statement of confidence (probability) that the inter-
val contains the parameter value. In other words, confidence interval estimation is
an interval of values computed from sample data that is likely to contain the true
population parameter value.
Suppose the marketing research director needs an estimate of the average life in
7/27/2019 Fundamentals of Statistical Inference
20/101
Estimation 16
months of car batteries his company manufactures. We select a random sample of
200 batteries, record the car owners names and addresses as listed in store records,
and interview these owners about the battery life they have experienced. Our sample
of 200 users has a mean battery life of 36 months. If we use the point estimate of the
sample mean as the best estimator of the population mean , we would report that
the mean life of the companys batteries is 36 months. But director also asks for a
statement about the uncertainty that will be likely to accompany this estimate, that
is, a statement about the range within which the unknown population mean is likely
to lie. To provide such a statement, we need to find the standard error of the mean.
The general form of an interval estimate is as follows:
Point estimate Margin of errorThe purpose of an interval estimate is to provide information about how close the
point estimate is to the value of the population parameter. The general form of an
interval estimate of a population mean is
X
Margin of error
The general form of an interval estimate of a population proportion is
P Margin of errorThe sampling distribution of X and P play key roles in company these interval esti-
mates.
1.13 Probability of the true population parameter
falling within the interval estimateTo begin to solve this problem, we should review the relevant concepts that we worked
with the normal probability distribution and learned that specific portions of the area
under the normal curve are located between plus and minus any given number of
standard deviations from the mean. Fortunately, we can apply these properties to
the standard error of the mean and make the statement about range of values used to
7/27/2019 Fundamentals of Statistical Inference
21/101
Estimation 17
make an interval estimate. Note that if we select and plot a large number of sample
means from a population, the distribution of these means will approximate normal
curve. Furthermore, the mean of the sample means will be the same as the population
mean. Our sample size of 200 (in battery example) is large enough that we can apply
the central limit theorem. To measure the spread, or dispersion, in our distribution
of sample means, we can use the following formula and calculate the standard error
of the mean:
Standard error of the mean for an infinite population
X =
n Standard deviation of the population
Suppose we have already estimated the standard deviation of the population of the
batteries and reported that it is 10 months. Using this standard deviation, we can
calculate the standard error of the mean:
X =
n=
10200
= 0.707 month
We could now report to the director that our estimate of the life of the companys
batteries is 36 months, and the standard error that accompanies this estimate is
0.707. In other words, the actual mean life for all the batteries may lie somewhere
in the interval estimate of 35.293 to 36.707 months. This is helpful but insufficient
information for the director. Next we need to calculate the chance that the actual life
will lie in this interval or in other intervals of different widths that we might choose,
2(2 0.707),3(3 0.707), and so on.The probability is 0.955 that the mean of a sample size of 200 will be within
2 standard errors of the population mean. Stated differently 95.5 percent of allthe sample means are within 2 standard errors from , and hence is within 2standard errors of 95.5 percent of all the sample means. Theoretically, if we select
1,000 samples at random from a given population and then construct an interval of
7/27/2019 Fundamentals of Statistical Inference
22/101
Estimation 18
2 standard errors around the mean of each of these samples, about 955 of theseintervals will include the population mean. Similarly, the probability is 0.683 that
the mean of the sample will be within 1 standard error of the population mean,and so forth. This theoretical concept is basic to study interval construction and
statistical inference. Applying this to the battery example, we can now report to the
director that our best estimate of the life of the companys batteries is 36 months, and
we are 68.3 percent confident that the life lies in the interval from 35.293 to 36.707
months ( 36 1x). Similarly, we are 95.5 percent confident that the life falls withinthe interval of 34.586 to 37.414 months (362x ), and we are 99.7 percent confidentthat battery life falls within the interval of 33.879 to 38.121 months (36 3x ).
1.14 Interval estimates and confidence intervals
In using interval estimates, we are not confined to 1, 2, and 3 standard errors. Forexample, 1.64 standard error includes about 90 percent of the area under the curve;it includes 0.4495 of the area on either side of the mean in a normal distribution.
Similarly 2.58 standard errors include 99 percent of the area, or 49.51 percent oneach side of the mean.
In statistics, the probability that we associated with an interval estimate is called
the confidence level. This probability indicates how confident we are that the interval
estimate will include the population parameter. A higher probability means more
confidence. In estimation, the most commonly used confidence levels are 90 percent,
95 percent, and 99 percent, but we are free to apply any confidence level.
The confidence interval is the range of the estimate we are making. If we report
that we are 90 percent confident that the mean of the population of incomes of people
in a certain community will lie between 8000and24000, then the range 800024000is our confidence interval. Often, however, we will express the confidence interval in
7/27/2019 Fundamentals of Statistical Inference
23/101
Estimation 19
standard errors rather than in numerical values. Thus, we will often express confi-
dence intervals like this: X 1.64X , where
X+ 1.64X = Upper limit of the confidence interval
X+ 1.64X = Lower limit of the confidence interval
Thus, confident limits are the upper and lower limits of the confidence interval. In
this case, X+ 1.64X is called the upper confidence limit (UCL) and X 1.64X isthe lower confidence limit (LCL).
1.15 Relationship between confidence level and
confidence interval
You may think that we should use a high confidence level, such as 99 %, in all
estimation problems. After all, a high confidence level seems to signify a high degree
of accuracy in the estimate. In practice, however, high confidence levels will produce
large confidence intervals and such large intervals are not precise; they give very fuzzy
estimates.
1.16 Using sampling and confidence interval
estimation
We described that samples being drawn repeatedly from a given population in order
to estimate a population parameter. We also mentioned selecting a large number of
sample means from a population. In practice, however, it is often difficult or expensive
to take more than one sample from a population. Based on just one sample, we
estimate the population parameter. We must be careful, then, about interpreting the
results of such a process.
7/27/2019 Fundamentals of Statistical Inference
24/101
Estimation 20
Suppose we calculate from one sample in our battery example the following con-
fidence interval and confidence level: We are 95 percent confident that the mean
battery life of the population lies within 30 and 42 months. This statement does not
mean that the chance is 0.95 that the mean life of all our batteries falls within the
interval established from this one sample. Instead, it means that if we select many
random samples of the same size and calculate a confidence interval for each of these
samples, then in about 95 percent of these cases, the population mean will lie within
that interval.
1.17 Interval estimation of population mean
( known)
In order to develop an interval estimate of a population mean, either the population
standard deviation or the sample standard deviation must be used to compute the
margin of error. Although rarely known exactly, historical data or other information
available in some applications permit us to obtain a good estimate of the population
standard deviation prior to sampling. In such cases, population standard deviation
can, for all practical purposes, be considered known. We refer to such cases as the
known case.
1.18 Using the Z statistic for estimating popula-
tion meanNote that a complete census is neither a feasible, nor a practical option. In order to
draw an inference about the population, a researcher has to take a sample and has
to apply statistical techniques to estimate population parameter on the basis of the
sample statistics. For example, a researcher can use two methods to find out the rate
7/27/2019 Fundamentals of Statistical Inference
25/101
Estimation 21
of absenteeism in a manufacturing company with 500,000 employees. The first method
is to go in for a census and calculate the rate of absenteeism based on information from
all the 500,000 employees. This would be extremely difficult in terms of execution
and would be time-consuming and costly. Instead of this, a researcher can take a
sample of any size (keeping in mind the definition of small-and large-sized samples)
and can make an estimate based on the information obtained from the sample. The
possibility of committing non-sampling errors will also be minimized if this method
is used. We need to develop a statistical tool that provides a good estimate of the
population parameter on the basis of the sample statistic. The Z statistic can be
used for estimating the population parameter on the basis of the sample statistic.
According to the central limit theorem, the sample means for a sufficiently large
samples (n 30 ), are approximately normally distributed, regardless of the shapeof the population distribution. For a normally distributed population, sample means
are normally distributed for any size of the sample.
Suppose the population mean is unknown and the true population standard
deviation is known. Then for a large sample size (n 30 ), the sample mean X isthe best point estimator for the population mean . Since the sampling distribution
is approximately normal, it can be used to compute confidence interval of population
mean as follows:
X Z2
nor
X Z2
n X+ Z
2
n
,
where Z2
is the Z-value representing an area 2
in the right tail of the standard
normal probability distribution, and (1 ) is the level of confidence.Alternative approach:
A (1) 100% large sample confidence interval for a population mean can also be
7/27/2019 Fundamentals of Statistical Inference
26/101
Estimation 22
found by using the statistic
Z = X n
which has a standard normal distribution (i.e, Z N(0, 1)). This formula can berearranged algebraically for population mean
= X Z n
Sample mean can be greater than or less than the population mean; hence, the
formula takes the following form is the area under the normal curve which is outside
the confidence interval and is located in the tails of the normal curve. Confidence
interval is the range within which we can say with some confidence that the population
mean is located. We can say with some confidence, however, we are not absolutely
sure that the population mean is within the confidence interval. In order to be 100%
sure that the population mean is within the confidence interval, the confidence level
should be 100%, that is, indefinitely wide, which would be meaningless. We use
the concept of probability in order to define some certainty. We can assign some
probability that the population mean is located within the confidence interval.
If Z2
is the Z-value with an area
2in the right tail of normal curve, then we can
write
P
Z
2 0.It is the logical opposite of the null hypothesis. In other words, when null
hypothesis is found to be true, the alternative hypothesis must be false or when
null hypothesis is found to be false, the alternative hypothesis must be true. The
alternative hypothesis represents the conclusion reached by rejecting the null
hypothesis if there is sufficient evidence from the sample information to decide
that the null hypothesis is unlikely to be true. Hypothesis-testing methodology
is designed so that the rejection of the null hypothesis is based on evidence from
the sample that the alternative hypothesis is far more likely to be true. However,
failure to reject the null hypothesis is not proof that it is true. One can never
prove that the null hypothesis is correct because the decision is based only on the
sample information, not on the entire population. Therefore, if you fail to reject
the null hypothesis, you can only conclude that there is insufficient evidence to
warrant its rejection. A summary of the null and alternative hypothesis is
presented below:
The Null and alternative hypothesis:
(a) The null hypothesis H0 represents the status quo or the current belief in
a situation.
(b) The alternative hypothesis H1 is the opposite of the null hypothesis and
represents a research claim or specific inference you would like to prove.
(c) If you reject the null hypothesis, you have statistical proof that the alter-
native hypothesis is correct.
(d) If you do not reject the null hypothesis, then you have failed to prove the
alternative hypothesis. The failure to prove the alternative, however, does
not mean that you have proven null hypothesis.
7/27/2019 Fundamentals of Statistical Inference
56/101
Testing of Hypothesis 52
(e) The null hypothesis H0 always refers to a specified value of the population
parameter (such as ), not a sample statistic (such as X).
(f) The statement of the null hypothesis always contains an equal sign re-
garding the specified value of the population parameter (e.g. H0 : =
368 grams).
(g) The statement of the alternative hypothesis never contains an equal sign
regarding the specified value of the population parameter (e.g. H1 : =368 grams).
Each of the following statements is an example of a null hypothesis and alter-
native hypothesis:
H0 : = 0 H1 : = 0
H0 : 0 H1 : > 0
H0 : 0 H1 : < 0
(I) Directional hypothesis(a) H0: There is no difference between the average pulse rates of men and
women.
H1 : Men have lower average pulse rates than women do.
(b) H0 : There is no relationship between exercise intensity and the re-
sulting aerobic benefit.
H1 : Increasing exercise intensity increases the resulting aerobic bene-
fit.(c) H0 : The defendant is innocent.
H1 : The defendant is guilty.
(II) Non-directional hypothesis
(a) H0 : Men and women have same verbal abilities.
H1 : Men and women have different verbal abilities.
7/27/2019 Fundamentals of Statistical Inference
57/101
Testing of Hypothesis 53
(b) H0: The average monthly salary for management graduates with a
4-year experience.
H1 : The average monthly salary is not Rs.75, 000.
(c) H0 : Older workers are more loyal to a company.
H1 : Older workers may not be loyal to a company.
3. Determine the appropriate statistical test:
After setting the hypothesis, the researcher has to decide on an appropriate sta-
tistical test that will be used for statistical analysis. The tests of significance or
test statistic are classified into two categories: parametric and non-parametric
tests. Parametric tests are more powerful because their data are derived from
interval and ratio measurements. Nonparametric tests are used to test hypothe-
ses with nominal and ordinal data. Parametric techniques are the tests of choice
provided certain assumptions are met. Assumptions for parametric tests are as
follows:
i. The selection of any element (or member) from the population should notaffect the chance for any other to be included in the sample to be drawn
from the population.
ii. The samples should be drawn from normally distributed population.
iii. Populations under study should have equal variances.
Non-parametric tests have few assumptions and do not specify normally dis-
tributed populations or homogeneity of variance.
Selection of a test:
For choosing a particular test of significance following three factors are consid-
ered:
a. Whether the test involves one sample, two samples or k samples?
b. Whether samples used are independent or related?
c. Is the measurement scale nominal, ordinal, interval, or ratio?
7/27/2019 Fundamentals of Statistical Inference
58/101
Testing of Hypothesis 54
Further, it is also important to know: (i) sample size, (ii) The number of sam-
ples, and their size, (iii) Whether data have been weighted. Such questions help
in selecting an appropriate test statistic. One sample tests are used for single
sample and to test the hypothesis that it comes from a specified population.
The following questions need to be answered before using one sample tests
a. Is there a difference between observed frequencies and the expected fre-
quencies based on a statistical theory?
b. Is there difference between observed and expected proportions?
c. Is it reasonable to conclude that a sample is drawn from a population with
some specific distribution (normal, Poisson, and so on).
d. Is there significant difference between some measures of central tendency
and its population parameter?
The value of test statistic is calculated from the distribution of sample statistic
by using the following formula
Test Statistic =Value of sample statistic Value of hypothesized population parameter
standardized error of the sample statistic
The choice of a probability distribution of a sample statistic is guided by the
sample size n and the value of population standard deviation n as shown below
Sample size Population standard deviation
. . . . . . Known Unknown
n > 30 Normal distribution Normal distribution
n 30 Normal distribution t-distribution
4. Level of significance: This is admissible level of error at which we test the null
hypothesis. The level of significance generally denoted by is the probability,
7/27/2019 Fundamentals of Statistical Inference
59/101
Testing of Hypothesis 55
which is attached to a null hypothesis, which may be rejected even when it is
true. The level of significance is also known as the size of the rejection region
or the size of the critical region. It is very important to note that the level
of significance must be determined before drawing the samples, so that the
obtained result is free from the choice bias of a decision marker. The levels of
significance which are generally applied by researchers are 0.01, 0.05, 0.10. It
is specified in terms of the probability of null hypothesis H0 being wrong. In
other words, the level of significance defines the likelihood of rejecting a null
hypothesis when it is true, i.e. it is the risk a decision maker takes of rejecting
the null hypothesis when it is really true. The guide provided by the statistical
theory is that this probability must be small.
5. Test statistic: This is constructed using the statistic used to estimate the popu-
lation parameter on which the hypothesis is being tested. The value of the teste
statistic decided whether to reject the hypothesis or not reject the hypothesis.
6. Critical value: After constructing the test statistic, we need to obtain the critical
value. This critical value divides the entire region into critical and non-critical
region.
7. Conclusion:At this stage, the calculated value of the test statistic is compared
with the critical value and concluded accordingly. In recent times, p-value
approach is prominent and these two methods will be discusses in detail in the
next section.
8. Power of the test: This decides the strength of the test in correctly rejecting thenull hypothesis. Its calculation will be discussed for each test separately using
an example.
2.5 One tail and two tail tests
The form of the alternative hypothesis can be either one-tailed or two-tailed, depend-
ing on what the analyst is trying to prove.
7/27/2019 Fundamentals of Statistical Inference
60/101
Testing of Hypothesis 56
2.5.1 One tailed test
One tailed tests are further classified as right tailed and left tailed tests. Alternative
hypothesis decides whether a test is right tailed or a left tailed. If the alternative
hypothesis is of type > then, the test is classified as right tailed test and if the
alternative hypothesis is of type < then, the test is classified as left tailed test. Note
that the = sign should be always in null hypothesis (let us accept this). This is
because, the test statistic is calculated under the assumption that the null hypothesis
is true.
2.5.2 Two tailed testWhen the alternative hypothesis is of type = then the test is classified as two tailedtest.
2.6 Critical region and non-critical region
The sampling distribution of the test statistic is divided into two regions,a region of
rejection (sometimes called the critical region) and a region of non-rejection. If the
test statistic falls into the region of non-rejection,you do not reject the null hypothesis.
If the test statistic falls into the rejection region,you reject the null hypothesis.
The region of rejection consists of the values of the test statistic that are unlikely
to occur if the null hypothesis is true.These values are much more likely to occur
if the null hypothesis is false.Therefore,if a value of the test statistic falls into this
rejection region, you reject the null hypothesis because that value is unlikely if the
null hypothesis is true. To make a decision concerning the null hypothesis, you first
determine the critical value of the test statistic. The critical value divides the non-rejection region from the rejection region. Determining the critical value depends on
the size of the rejection region.The size of the rejection region is directly related to
the risks involved in using only sample evidence to make decisions about a population
parameter.
7/27/2019 Fundamentals of Statistical Inference
61/101
Testing of Hypothesis 57
2.7 Errors in hypothesis testing
A Type I error occurs if you reject the null hypothesis, H0, when it is true and should
not be rejected. A Type I error is afalse alarm. The probability of a Type I error
occurring is .
A Type II error occurs if you do not reject the null hypothesis, H0, when it is
false and should be rejected. A Type II error represents a missed opportunity to take
some corrective action. The probability of a Type II error occurring is .
Whenever we reject a null hypothesis, there is a chance that we have made a
mistake i.e., that we have rejected a true statement. Rejecting a true null hypothesis
is referred to as a Type I error, and our probability of making such an error is
represented by the Greek letter alpha (). This probability, which is referred to as
the significance level of the test, is of primary concern in hypothesis testing.
On the other hand, we can also make the mistake of failing to reject a false null
hypothesis this is a Type II error. Our probability of making it is represented by the
Greek letter beta (). Naturally, if we either fail to reject a true null hypothesis or
reject a false null hypothesis, we have acted correctly. The probability of rejecting
afalse null hypothesis is called the power of the test. The four possibilities are shown
in Table.
Actual Situation
Statistical decision H0 true H0 false
Do not reject H0 Correct decision, Confidence= (1 ) Type-II error, P(Type IIerror) =
Reject H0 Type-I error, P(Type Ierror) = Correct decision, Power= (1 )
In hypothesis testing, there is a necessary trade-off between Type I and Type II
errors: For a given sample size, reducing the probability of a Type I error increases the
7/27/2019 Fundamentals of Statistical Inference
62/101
Testing of Hypothesis 58
probability of a Type II error, and vice versa. The only sure way to avoid accepting
false claims is to never accept any claims. Likewise, the only sure way to avoid
rejecting true claims is to never reject any claims. Of course, each of these extreme
approaches is impractical, and we must usually compromise by accepting a reasonable
risk of committing either type of error.
Complements of Type-I and Type-II Errors
The confidence coefficient 1, is the probability that you will not reject the nullhypothesis, when it is true and should not be rejected.
The power of a statistical test, 1 , is the probability that you will reject thenull hypothesis when it is false and should be rejected.
7/27/2019 Fundamentals of Statistical Inference
63/101
Testing of Hypothesis 59
2.8 Test for single mean
In this section, we discuss two tests that are most common in testing a hypothesis
built on population mean . The first one is the Z test and second the t test. Wediscuss these two tests in detail using appropriate examples. The selection of the test
depends on the sample size of the study or on the value of the standard deviation
(known or unknown case).
Assumptions
1. The variable under study is ratio or interval.2. The population follows normal distribution.
3. Population variance 2: known (Z-test), Unknown (t-test).
4. Responses are independent within the samples.
2.8.1 Z-test for single mean- known case
The procedure to use a Z-test is as follows:
1. Null hypothesis: H0 : = (,)0.2. Alternative hypothesis: H0 : = ()0.3. Level of significance: = 0.05(0.01, 0.02, 0.10).
4. Test Statistic: Under H0,
Two tailed test:
Z =|X 0|
n
N(0, 1)
One tailed test:
Z =X 0
n
N(0, 1)
5. Comparison and Conclusion.
7/27/2019 Fundamentals of Statistical Inference
64/101
Testing of Hypothesis 60
2.8.2 Testing Using Excel
A1 Null Hypothesis = 0
A2 Level of Significance () 0.05
A3 Population Standard Deviation
A4 Sample Size n
A5 Sample Mean X
A6 Intermediate Calculations
A7 Standard Error of the Mean
n= A3/sqrt(A4)
A8 Z test Statistic Z =X 0
n
N(0, 1) = (A5 A1)/A7
A9 Two Tailed Test Alternative Hypothesis H1: = 0
A10 Lower Critical Value =NORM.S.INV(A2/2)
A11 Upper Critical Value =NORM.S.INV(1-A2/2)
A12 p-Value 2* (1-NORM.S.DIST(ABS(A8), TRUE))
A13 Left Tailed Test Alternative Hypothesis H1: < 0
A14 Lower Critical Value =NORM.S.INV(A2)
A15 p-Value NORM.S.DIST(A8, TRUE)
A16 Right Tailed Test Alternative Hypothesis H1: > 0
A17 Upper Critical Value =NORM.S.INV(1-A2)
A18 p-Value 1-(NORM.S.DIST(A8, TRUE))
A19 Conclusion
A20 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)
7/27/2019 Fundamentals of Statistical Inference
65/101
Testing of Hypothesis 61
2.8.3 t-test for single mean- unknown case
The procedure to use a Z-test is as follows:
1. Null hypothesis: H0 : = (,)0.2. Alternative hypothesis: H1 : = ()0.
3. Level of significance: = 0.05(0.01, 0.02, 0.10).
4. Test Statistic: Under H0,
Two tailed test:
t =|X 0|
Sn
tn1 d.f.
One tailed test:
t =X 0
Sn
tn1 d.f.
where
S =ni=1(Xi X)
2
n 15. Comparison and Conclusion.
7/27/2019 Fundamentals of Statistical Inference
66/101
Testing of Hypothesis 62
2.8.4 Testing Using Excel
A1 Null Hypothesis = 0
A2 Level of Significance () 0.05
A3 Sample Standard Deviation = S
A4 Sample Size n
A5 Degrees of Freedom (d.f.) n 1
A6 Sample Mean X
A7 Intermediate Calculations
A8 Standard Error of the MeanS
n= A3/sqrt(A4)
A9 t test Statistic t =X 0
Sn
t(n1) d.f. = (A6 A1)/A7
A10 Two Tailed Test Alternative Hypothesis H1:
= 0
A11 Lower Critical Value =T.INV(A2/2, A5)
A12 Upper Critical Value =T.INV(1-A2/2, A5)
A13 p-Value 2* (1-T.DIST(ABS(A9), A5, TRUE))
A14 Left Tailed Test Alternative Hypothesis H1: < 0
A15 Lower Critical Value =T.INV(A2, A5)
A16 p-Value T.DIST(A9, A5, TRUE)
A17 Right Tailed Test Alternative Hypothesis H1: > 0
A18 Upper Critical Value =T.INV(1-A2, A5)
A19 p-Value 1-(T.DIST(A9, A5, TRUE))
A20 Conclusion
A21 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)
7/27/2019 Fundamentals of Statistical Inference
67/101
Testing of Hypothesis 63
2.9 Test for single proportion
In this section, we discuss the procedure used to test the significance of single pro-
portion.
Assumptions
1. The population follows normal distribution.
2. The condition np 5, n(1 p) 5 is satisfied. This condition is necessary toapproximate the sampling distribution of the statistic to normal law.
Steps in using the test
1. Null hypothesis: H0 : P = (,)P0.2. Alternative hypothesis: H1 : P = ()P0.3. Level of significance: = 0.05(0.01, 0.02, 0.10).
4. Test Statistic: Under H0,
Two tailed test:Z =
|P P0|
P0(1P0)n
N(0, 1)
One tailed test:
Z =P P0|P0(1P0)
n
N(0, 1)
5. Comparison and Conclusion.
7/27/2019 Fundamentals of Statistical Inference
68/101
Testing of Hypothesis 64
2.9.1 Testing Using Excel
A1 Null Hypothesis P = P0
A2 Level of Significance () 0.05
A3 Number of items of Interest X
A4 Sample Size n
A5 Intermediate Calculations
A6 Sample Proportion Xn
= A3/A4
A7 Standard Error
P0 (1 P0)
n= sqrt((A1*(1-A1))/A4)
A8 Z test Statistic Z =P P0
P0 (1 P0)n
N(0, 1) = (A6 A1)/A7
A9 Two Tailed Test Alternative Hypothesis H1: = 0
A10 Lower Critical Value =NORM.S.INV(A2/2)
A11 Upper Critical Value =NORM.S.INV(1-A2/2)
A12 p-Value 2* (1-NORM.S.DIST(ABS(A8), TRUE))
A13 Left Tailed Test Alternative Hypothesis H1: < 0
A14 Lower Critical Value =NORM.S.INV(A2)
A15 p-Value NORM.S.DIST(A8, TRUE)
A16 Right Tailed Test Alternative Hypothesis H1: > 0
A17 Upper Critical Value =NORM.S.INV(1-A2)
A18 p-Value 1-(NORM.S.DIST(A8, TRUE))
A19 Conclusion
A20 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)
7/27/2019 Fundamentals of Statistical Inference
69/101
Testing of Hypothesis 65
2.10 Comparison and conclusion
Two tailed test:
1. Critical value approach:
Find the Table value at chosen level of significance . Compare this value with
the calculated value.
(a) Ifcal tab then, do not reject the null hypothesis.
(b) Ifcal > tab then, reject the null hypothesis.
2. p-value approach:
Compute the p-value at chosen level of significance.
(a) Ifp then, do not reject the null hypothesis.
(b) Ifp > then, reject the null hypothesis.
One tailed test:
1. Right tailed test:
(a) Critical value approach:
Find the Table value at chosen level of significance . Compare this value
with the calculated value.
i. Ifcal tab then, do not reject the null hypothesis.
ii. Ifcal > tab then, reject the null hypothesis.
(b) p-value approach:
Compute the p-value at chosen level of significance.
i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.
7/27/2019 Fundamentals of Statistical Inference
70/101
Testing of Hypothesis 66
2. Left tailed test
(a) Critical value approach:
Find the Table value at chosen level of significance . Compare this value
with the calculated value.
i. Ifcal > tab then, do not reject the null hypothesis.
ii. Ifcal tab then, reject the null hypothesis.
(b) p-value approach: Compute the p-value at chosen level of significance.
i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.
7/27/2019 Fundamentals of Statistical Inference
71/101
Chapter 3
Testing of hypothesis-Two sample
problem
3.1 Introduction
In this chapter, we discuss the testing procedures used to test the significant difference
between parameters belonging to two independent populations. We note that, there
are several cases which will be discussed in the following sections.
3.2 Assumptions
In this section, we give some important points regarding the testing procedures used
in two sample problem.
1. The variable under study is ratio or interval.
2. The population follows normal distribution.
3. Population variances are equal i.e. 21 = 22.
4. Samples are independent.
5. Responses are independent within the samples.
67
7/27/2019 Fundamentals of Statistical Inference
72/101
Testing of hypothesis-Two sample problem 68
3.3 Test for difference of means: Z-test
1. Null hypothesis: H0 : 1 = (,)2.
2. Alternative hypothesis: H1 : 1 = (>,
7/27/2019 Fundamentals of Statistical Inference
73/101
Testing of hypothesis-Two sample problem 69
3.3.1 Testing Using Excel: 21 = 22 =
2 (known)
A1 Null Hypothesis 1 = 2
A2 Level of Significance () 0.05
A3 Sample Mean1 X1
A4 Sample Mean2 X2
A5 Sample Size1 n1
A6 Sample Size2 n2
A7 Population Standard deviation
A8 Intermediate Calculations
A9 S.E. of difference of Means
1
n1+
1
n2
= A7/sqrt((1/A5)+(1/A6))
A10 Z test Statistic Z =X1 X2
1
n1 +1
n2 N(0, 1) = (A3 A4)/A9
A11 Two Tailed Test Alternative Hypothesis H1: 1 = 2
A12 Lower Critical Value =NORM.S.INV(A2/2)
A13 Upper Critical Value =NORM.S.INV(1-A2/2)
A15 p-Value 2* (1-NORM.S.DIST(ABS(A10), TRUE))
A16 Left Tailed Test Alternative Hypothesis H1:1 < 2
A17 Lower Critical Value =NORM.S.INV(A2)
A18 p-Value NORM.S.DIST(A10, TRUE)
A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2
A20 Upper Critical Value =NORM.S.INV(1-A2)
A21 p-Value 1-(NORM.S.DIST(A10, TRUE))
A22 Conclusion Reject or Do not reject H0
7/27/2019 Fundamentals of Statistical Inference
74/101
Testing of hypothesis-Two sample problem 70
3.3.2 Testing Using Excel: Unequal Variances (Known)
A1 Null Hypothesis 1 = 2
A2 Level of Significance () 0.05
A3 Sample Mean1 X1
A4 Sample Mean2 X2
A5 Sample Size1 n1
A6 Sample Size2 n2
A7 Population Standard deviation1 1
A8 Population Standard deviation2 2
A9 Intermediate Calculations
A10 Standard Error of the Mean 21n1
+22n2
= sqrt((A72/A5)+(A82/A6))
A11 Z test Statistic Z = X1 X221n1
+22n2
N(0, 1) = (A3 A4)/A10
A12 Two Tailed Test Alternative Hypothesis H1: 1 = 2
A13 Lower Critical Value =NORM.S.INV(A2/2)
A14 Upper Critical Value =NORM.S.INV(1-A2/2)
A15 p-Value 2* (1-NORM.S.DIST(ABS(A11), TRUE))
A16 Left Tailed Test Alternative Hypothesis H1:1 < 2
A17 Lower Critical Value =NORM.S.INV(A2)
A18 p-Value NORM.S.DIST(A11, TRUE)
A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2
A20 Upper Critical Value =NORM.S.INV(1-A2)
A21 p-Value 1-(NORM.S.DIST(A11, TRUE))
A22 Conclusion Reject or Do not reject H0
7/27/2019 Fundamentals of Statistical Inference
75/101
Testing of hypothesis-Two sample problem 71
3.4 Test for difference of means:t-test
1. Null hypothesis: H0 : 1 = (,)2.
2. Alternative hypothesis: H0 : 1 = ()2.
3. Level of significance: = 0.05(0.01, 0.02, 0.10).
4. Test Statistic: Under H0,
(a) When the assumption of equality of variances is satisfied (21 = 22 =
2)
and 2 is unknown.
Two tailed test:
t =|X1 X2|
1n1
+ 1n2
tn1+n22 d.f.
One tailed test:
t =X1 X2
1n1
+ 1n2
tn1+n22 d.f.
(b) When the assumption of equality of variances is not satisfied (21 =
22).
Two tailed test:
t =|X1 X2|
S21
n1+
S22
n2
tn1+n22 d.f.
One tailed test:
t =X1 X2
S21
n1+
S22
n2
tn1+n22 d.f.
where
S21 =
ni=1(Xi X1)2
n1 1 and S22 =
ni=1(Yi X2)2
n2 15. Comparison and Conclusion.
7/27/2019 Fundamentals of Statistical Inference
76/101
Testing of hypothesis-Two sample problem 72
3.4.1 Testing Using Excel: 21 = 22 =
2 (Unknown)
A1 Null Hypothesis 1 = 2
A2 Level of Significance () 0.05
A3 Sample Mean1 X1
A4 Sample Mean2 X2
A5 Sample Size1 n1
A6 Sample Size2 n2
A7 Sample Standard deviation1 S1
A8 Sample Standard deviation2 S2
A9 Pooled Estimate S = sqrt ((A5 A72 + A6 A82)/(A5 + A6 2))
A10 Intermediate Calculations
A11 S.E. of difference of Means S
1n1
+ 1n2
= A9/sqrt((1/A5)+(1/A6)
A12 t test Statistic t =X1 X2
S
1
n1+
1
n2
tn1+n22 = (A3-A4)/A11
A13 Two Tailed Test Alternative Hypothesis H1: 1 = 2
A14 Lower Critical Value =T.INV(A2/2, (A5+A6-2))
A15 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))
A16 p-Value 2* (1-T.DIST(ABS(A12), (A5+A6-2), TRUE))
A17 Left Tailed Test Alternative Hypothesis H1:1 < 2
A18 Lower Critical Value =T.INV(A2, (A5+A6-2))
A19 p-Value T.DIST(A12, (A5+A6-2), TRUE)
A20 Right Tailed Test Alternative Hypothesis H1: 1 > 2
A21 Upper Critical Value =T.INV(1-A2, (A5+A6-2))
A22 p-Value 1-(T.DIST(A12, (A5+A6-2), TRUE))
A23 Conclusion Reject or Do not reject H0
7/27/2019 Fundamentals of Statistical Inference
77/101
Testing of hypothesis-Two sample problem 73
3.4.2 Testing Using Excel: Unequal Variances (Unknown)
A1 Null Hypothesis 1 = 2
A2 Level of Significance () 0.05
A3 Sample Mean1 X1
A4 Sample Mean2 X2
A5 Sample Size1 n1
A6 Sample Size2 n2
A7 Sample Standard deviation1 S1
A8 Sample Standard deviation2 S2
A9 Intermediate Calculations
A10 S.E. of difference of Means
S21n1
+S22n2
= sqrt((A72/A5)+(A82/A6))
A11 t test Statistic t =X1 X2
S21n1
+S22n2
tn1+n22 = (A3-A4)/A10
A12 Two Tailed Test Alternative Hypothesis H1: 1 = 2
A13 Lower Critical Value =T.INV(A2/2, (A5+A6-2))
A14 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))
A15 p-Value 2* (1-T.DIST(ABS(A11), (A5+A6-2), TRUE))
A16 Left Tailed Test Alternative Hypothesis H1:1 < 2
A17 Lower Critical Value =T.INV(A2, (A5+A6-2))
A18 p-Value T.DIST(A11, (A5+A6-2), TRUE)
A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2
A20 Upper Critical Value =T.INV(1-A2, (A5+A6-2))
A21 p-Value 1-(T.DIST(A11, (A5+A6-2), TRUE))
A22 Conclusion Reject or Do not reject H0
7/27/2019 Fundamentals of Statistical Inference
78/101
Testing of hypothesis-Two sample problem 74
3.5 Test for difference of two proportions
1. Null hypothesis: H0 : P1 = (,)P2.
2. Alternative hypothesis: H0 : P1 = ()P2.
3. Level of significance: = 0.05(0.01, 0.02, 0.10).
4. Test Statistic: Under H0,
Two tailed test:
Z = |P1
P2
|P1(1P1)n1
+ P2(1P2)n2
N(0, 1)One tailed test:
Z =P1 P2
P1(1P1)n1
+ P2(1P2)n2
N(0, 1)
Test statistic using a pooled estimate:
Z =|P1 P2|
P(1 P) 1n1 + 1n2
N(0, 1)
One tailed test:
Z =P1 P2
P(1 P)
1n1
+ 1n2
N(0, 1)where,
P =n1P1 + n2P1
n1 + n2
5. Comparison and Conclusion.
7/27/2019 Fundamentals of Statistical Inference
79/101
Testing of hypothesis-Two sample problem 75
3.5.1 Testing Using Excel: Test for Difference of Proportions
A1 Null Hypothesis P1 = P2
A2 Level of Significance () 0.05
A3 Sample Size1 n1
A4 Sample Size2 n2
A5 Number of items of interest X1
A6 Number of items of interest X1
A7 Sample Proportion1 P1 = A5/A3
A8 Sample Proportion2 P2 = A6/A4
A9 Pooled Estimate P = ((A3 A7 + A4 A8)/(A3 + A4))
A10 Intermediate Calculations
A11 S.E. sqrt(A9 (1-A9)(((1/A5)+(1/A6)))
A12 Z test Statistic Z=(A7-A8)/A11
A13 Two Tailed Test Alternative Hypothesis H1: P1 = P2
A14 Lower Critical Value =NORM.S.INV(A2/2)
A15 Upper Critical Value =NORM.S.INV(1-A2/2)
A16 p-Value 2* (1-NORM.S.DIST(ABS(A12), TRUE))
A17 Left Tailed Test Alternative Hypothesis H1:P1 < P2
A18 Lower Critical Value =NORM.S.INV(A2)
A19 p-Value NORM.S.DIST(A12, TRUE)
A20 Right Tailed Test Alternative Hypothesis H1: P1 > P2
A21 Upper Critical Value =NORM.S.INV(1-A2)
A22 p-Value 1-(NORM.S.DIST(A12, TRUE))
A23 Conclusion Reject or Do not reject H0
7/27/2019 Fundamentals of Statistical Inference
80/101
Testing of hypothesis-Two sample problem 76
3.6 Test for dependent samples
In previous section, we have discussed the testing procedure used to test the hypoth-
esis constructed on difference between two population means, when the samples are
independent. In this section, we discuss a testing procedure, when the samples are
dependent. This is the case where the responses are taken from the same set of in-
dividuals before an experiment and after the experiment. This is also used when the
samples are matched samples.
Suppose that, in a marketing research, the researcher wants to know the opinionof the customers on his companys product. He selects a sample of customers, say of
size n, from a population and collects the opinion from these n customers and then he
introduces the same product with some additions to it. He requests the customers to
use the product and collects the response from them after a month. Here the variable
measured is the weight of the customers. In this case the researcher is interested to
test the hypothesis Is there any significant difference between the average weight of
the customers before and after the additions to the product?
1. Null hypothesis: H0 : 1 = (,)2 or D = 1 2 = (,)0.2. Alternative hypothesis: H1 : 1 = (>, ,
7/27/2019 Fundamentals of Statistical Inference
81/101
Testing of hypothesis-Two sample problem 77
3.6.1 Testing Using Excel
A1 Null Hypothesis D : 1 2 = 0
A2 Level of Significance () 0.05
A3 Sample Mean d
A4 Sample Size (before) n1
A5 Sample Size (after) n2
A6 Sample Standard deviation d
A7 Intermediate Calculations
A8 S.E.d
n= A6/sqrt(A5)
A9 t test Statistic t =ddn
tn1 d.f. = (A3)/A8
A10 Two Tailed Test Alternative Hypothesis H1: D : 1
2
= 0
A11 Lower Critical Value =NORM.S.INV(A2/2)
A12 Upper Critical Value =NORM.S.INV(1-A2/2)
A13 p-Value 2* (1-NORM.S.DIST(ABS(A9), TRUE))
A14 Left Tailed Test Alternative Hypothesis H1:D : 1 2 < 0
A15 Lower Critical Value =NORM.S.INV(A2)
A16 p-Value NORM.S.DIST(A9, TRUE)
A17 Right Tailed Test Alternative Hypothesis H1: D : 1 2 > 0
A18 Upper Critical Value =NORM.S.INV(1-A2)
A19 p-Value 1-(NORM.S.DIST(A9), TRUE))
A23 Conclusion Reject or Do not reject H0
7/27/2019 Fundamentals of Statistical Inference
82/101
Testing of hypothesis-Two sample problem 78
3.7 Test for difference of variances-F Test
This is an important test which is used to test the hypothesis H0 : 21 = 22.
1. Null hypothesis: H0 : 21 = (, 22 .
2. Alternative hypothesis: H1 : 21 = (>, tab then, reject the null hypothesis.
2. p-value approach:
Compute the p-value at chosen level of significance.
(a) Ifp then, do not reject the null hypothesis.
(b) Ifp > then, reject the null hypothesis.
One tailed test:
1. Right tailed test:
(a) Critical value approach:
Find the Table value at chosen level of significance . Compare this value
with the calculated value.
7/27/2019 Fundamentals of Statistical Inference
83/101
Testing of hypothesis-Two sample problem 79
i. Ifcal tab then, do not reject the null hypothesis.ii. Ifcal > tab then, reject the null hypothesis.
(b) p-value approach:
Compute the p-value at chosen level of significance.
i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.
2. Left tailed test
(a) Critical value approach:
Find the Table value at chosen level of significance . Compare this value
with the calculated value.
i. Ifcal > tab then, do not reject the null hypothesis.
ii. Ifcal tab then, reject the null hypothesis.
(b) p-value approach: Compute the p-value at chosen level of significance.
i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.
7/27/2019 Fundamentals of Statistical Inference
84/101
Chapter 4
Chi-Square tests
4.1 IntroductionThe statistical-inference techniques presented so far have dealt exclusively with hy-
pothesis tests and confidence intervals for population parameters, such as population
means and population proportions. In this chapter, we consider three widely used in-
ferential procedures that are not concerned with population parameters. These three
procedures are often called chi-square procedures because they rely on a distribution
called the chi-square distribution.
The distribution is also important in discrete hedging of options in finance, as
well as option pricing. This distribution is used to construct the confidence interval
for population variance 2. Also note that this distribution is derived from normal
distribution. Square of a standard normal variate gives a chi-square random variable
with 1 degrees of freedom. Similarly if we square n standard normal random variables
and add them, we get a chi-square distribution with n degrees of freedom.
The tests discussed in this chapter have wide applicability