+ All Categories
Home > Documents > Mathematics and Statistics (Unit IV & v)

Mathematics and Statistics (Unit IV & v)

Date post: 16-Nov-2014
Category:
Upload: denish-gandhi
View: 801 times
Download: 1 times
Share this document with a friend
75
BHARTHIDASAN UNIVERSITY UNIT-IV THEORY OF SAMPLING AND TESTING OF HYPOTHESIS 4.0 OBJECTIVES 4.1 NEED FOR SAMPLING 4.2 ELEMENTS OF SAMPLING PLAN 4.3 TYPES OF SAMPLING 4.3.1 Random or Probability Sampling Simple Random Sampling Stratified Random Sampling Systematic Random Sampling Cluster Sampling 4.3.2 Non-Random or Non-Probability Sampling Convenience Sampling Judgmental Sampling Quota Sampling 4.4 SAMPLING AND NON-SAMPLING ERRORS 4.4.1 Reasons for sampling errors 4.42 Reasons for non-sampling errors 4.5 TESTING OF HYPOTHESIS 4.5.1 Sampling Distribution 4.5.2 Standard Error 4.5.3 Null & Alternative Hypothesis 4.5.4 Errors in testing of hypothesis 4.5.5 Critical Region 4.5.6 Two tailed and One tailed test 4.5.7 Large and Small sample test 4.6 PROCEDURE FOR TESTING OF HYPOTHESIS 4.7 TESTS OF SIGNIFICANCE 4.7.1 Test for single mean 4.7.2 Test for difference of two means 4.7.3 Test for two standard deviations 4.7.4 Test for Single Proportion 4.7.5 Test for difference of two proportions 4.8 Analysis of Variance 4.8.1 Assumptions MATHEMATICS AND STATISTICS Page 1
Transcript
Page 1: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

UNIT-IV

THEORY OF SAMPLING AND TESTING OF HYPOTHESIS

4.0 OBJECTIVES 4.1 NEED FOR SAMPLING 4.2 ELEMENTS OF SAMPLING PLAN 4.3 TYPES OF SAMPLING

4.3.1 Random or Probability Sampling Simple Random Sampling Stratified Random Sampling Systematic Random Sampling Cluster Sampling

4.3.2 Non-Random or Non-Probability Sampling Convenience Sampling Judgmental Sampling Quota Sampling

4.4 SAMPLING AND NON-SAMPLING ERRORS 4.4.1 Reasons for sampling errors 4.42 Reasons for non-sampling errors

4.5 TESTING OF HYPOTHESIS 4.5.1 Sampling Distribution 4.5.2 Standard Error 4.5.3 Null & Alternative Hypothesis 4.5.4 Errors in testing of hypothesis 4.5.5 Critical Region 4.5.6 Two tailed and One tailed test 4.5.7 Large and Small sample test

4.6 PROCEDURE FOR TESTING OF HYPOTHESIS 4.7 TESTS OF SIGNIFICANCE

4.7.1 Test for single mean 4.7.2 Test for difference of two means 4.7.3 Test for two standard deviations 4.7.4 Test for Single Proportion 4.7.5 Test for difference of two proportions

4.8 Analysis of Variance4.8.1 Assumptions4.8.2 One way ANOVA4.8.3 Applications

MATHEMATICS AND STATISTICS Page 1

Page 2: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

4.0 Objectives

Sampling is being used in our everxyday life without knowing about it. For examples, a cook tests a small quantity of rice to see whether it has been well cooked and a grain merchant does not examine each grain of what he intends to purchase, but inspects only a small quantity of grains. Most of our decisions are based on the examination of a few items only.

In a statistical investigation, the interest usually lies in the assessment of general magnitude and the study of variation with respect to one or more characteristics relating to individuals belonging to a group. This group of individuals or units under study is called population or universe. Thus in statistics, population is an aggregate of objects or units under study. The population may be finite or infinite.

Sampling and Sample

Sampling is a method of selecting units for analysis such as households, consumers, companies etc. from the respective population under statistical investigation. The theory of sampling is based on the principle of statistical regularity. According to this principle, a moderately large number of items chosen at random from a large group are almost sure on an average to possess the characteristics of the larger group. A smallest non- divisible part of the population is called a unit. A unit should be well defined and should not be ambiguous. For example, if we define unit as a household then it should be defined that a person should not belong to two households nor should it leave out persons belonging to the population.

A finite subset of a population is called a sample and the number of units in a sample is called its sample size. By analyzing the data collected from the sample one can draw inference about the population under study.

Parameter and Statistic

The statistical constants of a population like mean (m), variance (s2), and proportion (P) are termed as parameters. Statistical measures like mean (x), variance (s2), proportion (p) computed from the sampled observations are known as statistics. Sampling is employed to throw light on the population parameter. A statistic is an estimate based on sample data to draw inference about the population parameter.

MATHEMATICS AND STATISTICS Page 2

Page 3: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

4.1 NEED FOR SAMPLING

Suppose that the raw materials department in a company receives items in lots and issues them to the production department as and when required. Before accepting these items, the inspection department inspects or tests them to make sure that they meet the required specifications. Thus

(i) it could inspect all items in the lot or

(ii) it could take a sample and inspect the sample for defectives Statistics for Managers and then estimate the total number of defectives for the population as a whole.

The first approach is called complete enumeration (census). It has two major disadvantages namely, the time consumed and the cost involved in it.

The second approach that uses sampling has two major advantages.

(i) It is significantly less expensive.

(ii) It takes least possible time with best possible results.

There are situations that involve destruction procedure where sampling is the only answer. A well-designed statistical sampling methodology would give accurate results and at the same time will result in cost reduction and least time. Thus sampling is the best available tool to decision makers.

4.2 ELEMENTS OF SAMPLING PLAN

The main steps involved in the planning and execution of sample survey are:

I) Objectives The first task is to lay down in concrete terms the basic objectives of the survey. Failure to define the objective(s) will clearly undermine the purpose of carrying out the survey itself. For example, in a nationalized bank wants to study savings bank account holders perception of the service quality rendered over a period of one year, the objective of the sampling is, here, to analyze the perception of the account holders in the bank.

ii) Population to be covered Based on the objectives of the survey, the population should be well defined. The characteristics concerning the population under study should also be clearly defined. For example, to analyze the perception of the savings

MATHEMATICS AND STATISTICS Page 3

Page 4: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

bank account holders about the service rendered by the bank, all the account holders in the bank constitute the population to be investigated.

iii) Sampling frame In order to cover the population decided upon, there should be some list, map or other acceptable material (called the frame) which serves as a guide to the population to be covered. The list or map must be examined to be sure that it is reasonably free from defects. The sampling frame will help us in the selection of sample. All the account numbers of the savings bank account holders in the bank are the sampling frame in the analysis of perception of the customers regarding the service rendered by the bank.

iv) Sampling Unit For the purpose of sample selection, the population should be capable of being divided up into sampling units. The division of the population into sampling units should be unambiguous. Every element of the population should belong to just one sampling unit. Each account holder of the savings bank account in the bank, form a unit of the sample as all the savings bank account holders in the bank constitute the population.

v) Sample Selection The size of the sample and the manner of selecting the sample should be defined based on the objectives of the statistical investigation. The estimation of population parameter along with their margin of uncertainty are some of the important aspccts to be followed in sample selection.

vi) Collection of data The method of collecting the information has to be decided, keeping in view the costs involved and the accuracy aimed at. Physical observation, intcrvewing respondents and collecting data through mail are some ofthe methods that can be followed in collection of data.

vii) Analysis of data The collected data should be properly classified and subjected to an appropriate analysis. The conclusions are drawn based on the results of the analysis.

4.3 Types of Sampling

The technique of selecting a sample from a population usually depends on the nature of the data and the type of enquiry. The procedure of sampling may be broadly classified under the following heads:

1) Probability sampling or random sampling and

2) Non-probability sampling or non-random sampling.

MATHEMATICS AND STATISTICS Page 4

Page 5: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

4.3.1 Probability sampling

Statistics for Managers Probability sampling is a method of sampling that ensures that every unit in the population has a known non-zero chance of being included in the sample.

The different methods of random sampling are:

(a) Simple Random Sampling (SRS)

Simple random sampling is the foundation of probability sampling. It is a special case of probability sampling in which every unit in the population has an equal chance of being included in a sample. Simple random sampling also makes the selection of every possible combination of the desired number of units equally likely. Sampling may be done with or without replacement. It may be noted that when the sampling is with replacement, the units drawn are replaced before the next selection is made. The population size remains constant when the sampling is with replacement. If one wants to select n units from a population of size N without replacement, then every possible selection of n units must have the same probability. Thus there are NCn possible ways to pick up n units from the population of size N.

MATHEMATICS AND STATISTICS Page 5

Typeof Sampling

Probability sampling or Random Sampling

Simple Random Sampling

Stratified Random Sampling

Systematic Sampling

Cluster Sampling

Non Probability sampling or Non - Random Sampling

Convenience Sampling

Judgement Sampling

Quota Sampling

Page 6: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Simple random sampling guarantees that a sample of n units has the same probability 1

NC nof

being selected.

Example

A bank wants to study the Savings Bank account holders perception of the service quality rendered over a period of one year. The bank has to prepare a complete list of savings bank account holders, called as sampling frame, say 500. Now the process involves selecting a sample of5O out of 500 and interviewing them. This could be achieved in many ways. Two common ways are:

(1) Lottery method: Select 50 slips from a box containing well shuffled 500 slips of account numbers without replacement. This method can be applied when the population is small enough to handle.

(2) Random numbers method: When the population size is very large, the most practical and inexpensive method of selecting a simple random sample is by using the random number tables.

(b) Stratified Random Sampling

Stratified sampling is a two-step process in which the population is partitioned into sub-populations, or strata. The strata should be mutually exclusive and collectively exhaustive in that every population element should be assigned to one and only one stratum and no population elements should be omitted. Next, elements are selected from each stratum by a random procedure, usually SRS. Technically, only SRS should be employed in selecting the elements from each stratum. In practice, sometimes systematic sampling and other probability sampling procedures are employed. Stratified sampling differs from quota sampling in that the sample elements are selected probabilistically rather than based on convenience or judgment. A major objective of stratified sampling is to increase precision without increasing cost.

The variables used to partition the population into strata are referred to as Theory of Sampling and stratification variables. The criteria for the selection of these variables consist of Testing of Hypothesis homogeneity, heterogeneity, relatedness, and cost. The elements within a stratum should be as homogeneous as possible, but the elements in different strata should be as heterogeneous as possible. The stratification variables should also be closely related to the characteristic of interest. The more closely these criteria are met, the greater the effectiveness in controlling extraneous sampling variation. Finally, the variables should decrease the cost of the stratification process by being easy to measure and apply.

(c) Systematic Random Sampling

MATHEMATICS AND STATISTICS Page 6

Page 7: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

In systematic random sampling, the sample is chosen by selecting a random starting point and then picking every ith element in succession from the sampling frame. The sampling interval, i, is determined by dividing the population size N by the sample size n and rounding to the nearest integer. For example, there are 100,000 elements in the population and a sample of 1,000 is desired. In this case, the sampling interval. i, is 100. A random number between I and 100 is selected. If, for example, this number is 23, the sample consists of elements 23, 123,223,323,423,523, and so on.

Systematic sampling is similar to SRS in that each population element has a known and equal probability of selection. However, it is different from SRS in that only the permissible sample size n that can be drawn has a known and equal probability of selection. The measuring sample of size n has a zero probability of being selected.

For systematic sampling, we assume that the population elements are ordered in some respect. In some cases, the ordering is unrelated to the characteristic of interest. Systematic sampling is a convenient way of selecting a sample. It requires less time and cost when compared to simple random sampling.

(d) Cluster Random Sampling

In cluster sampling, the target population is first divided into mutually exclusive and collectively exhaustive subpopulation, or clusters. Then a random sample of clusters selected, based on a probability sampling technique such as SRS. For each selected cluster, either all the elements are included in the sample or a sample of elements is drawn probabilistically. If all the elements in each selected cluster are included in the sample the procedure is called one-stage cluster sampling. If a sample of elements is drawn probabilistically from each selected cluster, the procedure is two-stage cluster sampling. Furthermore, a cluster sample can have multiple (more than two) stages, as in multistage cluster sampling.

The distinction between cluster sampling and stratified sampling is that in cluster sampling, only sample of subpopulations (clusters) is chosen, whereas in stratified sampling, all the subpopulations(strata) are selected for further sampling.

4.3.2 Non-Probability Sampling

The fundamental difference between probability sampling and non-probability sampling is that in non-probability sampling procedure, the selection of the sample units does not ensure a known chance to the units being selected. In other words the units are selected without using the principle of probability. Even though the non-probability sampling has advantages such as reduced cost, speed and convenience in implementation, it lacks accuracy in view of the selection bias. Non-probability sampling is suitable for pilot studies and exploratory research.

MATHEMATICS AND STATISTICS Page 7

Page 8: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

The methods of non-random sampling are:

(a) Convenience sampling:

Convenience sampling attempts to obtain a sample of convenient elements. The selection of sampling units is left primarily to the interviewer. Often, respondents are selected because they happen to be in the right place at the right time. Examples of convenience sampling include

(1) Use of students, church groups, and members of social organizations,

(2) mall-intercept interviews without qualifying the respondents,

(3) Department stores using charge account lists.

Convenience sampling is the least expensive and least time consuming of all sampling techniques. The sampling units are accessible, easy to measure, and cooperative.

(b) Judgmental sampling:

Judgmental sampling is a form of convenience sampling in which the population elements are selected based on the judgment of the researcher. The researcher, exercising judgment or expertise, chooses the elements to be included in the sample. Because he or she believes that they are representative of the population of interest or are otherwise appropriate. Common examples of judgmental sampling include

(1) Test markets selected to determine the potential of a new product,

(2) Purchase engineers selected in industrial marketing research because they are considered to be representative of the company,

(3) Expert witnesses used in court.

(c) Quota sampling:

This is a restricted type of judgment sampling. This consists in specifying quotas of the samples to be drawn from different groups and then drawing the required samples from these groups by judgmental sampling. Quota sampling is widely used in opinion and market research surveys.

MATHEMATICS AND STATISTICS Page 8

Page 9: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

4.4 SAMPLING AND NON SAMPLING ERRORS:

A sample is a part of the whole population. A sample drawn from the population depends on chance and as such all the characteristic of the population may not be present in the sample drawn from the same population. Any statistical measure say, mean of the sample, may not be equal to the corresponding statistical measure of the population from which the sample has been drawn. Thus there can be discrepancies in the statistical measure of population. i.e.. parameters and (he statistical measures of sample drawn from the same population. i.e., statistic. These discrepancies are known as Errors in sampling. Errors in sampling are of two types

(i) Sampling Errors

(ii,) Non-sampling Errors

4.4.1 Sampling Errors

Sampling Errors is inherent in the method of sampling. Sampling depends on chance and due to the existence of chance in sampling, the sampling errors occur. Errors in sampling arise primarily due to the following reasons:

1. Faulty selection of the sample. This may be due to selection of defective sampling techniques which may introduce the element of bias, e.g., purposive or judgmental sampling, in which investigator deliberately selects a non-representative sample.

2. Substitution. Sometimes an investigator while collecting the information from a particular sampling unit, included in the random selection substitutes a convenience member of the population and this may lead to some bias as the characteristic possessed by the substituted unit may be different from those possessed by the original unit included in sampling.

3. Faulty demarcation of sampling units

4. Variability of the population. Sampling error may also depend o the variability or heterogeneity of the population from which the samples are to be drawn.

4.4.2 Non-Sampling Errors

MATHEMATICS AND STATISTICS Page 9

Page 10: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Non-sampling errors or Bias automatically creep in due to human factors which always vary from one investigator to another. Bias may arise in the following different ways.

(i) Due to negligence and carelessness on the part of the investigator

(ii) Due to faulty planning of sampling

(iii) Due to the faulty selection of sample units

(iv) Due to incomplete investigation and sample survey

(v) Due to framing of a wrong questionnaire

(vi) Due to negligence and non-response on the part of the respondents

(vii) Due to substitutes of selected unit by another

(viii) Due to error in compilation

(ix) Due to applying wrong statistical measure.

4.5 TESTING OF HYPOTHESIS

The testing of hypothesis is a procedure that helps us to ascertain the likelihood of hypothesized population parameter being correct by making use of the sample statistic. In testing of hypothesis a statistic is computed from a sample drawn from the parent population and on the basis of this statistic, it is observed whether the sample so drawn has come from the population with certain specified characteristic.

4.5.1 Sampling Distribution

Consider all possible samples of size ‘n’ which can be drawn from a given population. For each sample we can compute a statistic such as mean, standard deviation, etc. which will vary from sample to sample. The aggregate of various values of the statistic under consideration may be grouped into a frequency distribution. This distribution is known as sampling distribution of the statistic. Thus the probability distribution of all the possible values that a sample statistic can take is called the sampling distribution of the statistic.

Sampling mean and sample proportion based on random sample are example of sample statistic

MATHEMATICS AND STATISTICS Page 10

Page 11: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Sampling distribution of the Mean from normal population

If x1, x2, x3, ……….. xn are n independent random samples drawn from a normal population with mean m and standard deviation s, then the sampling distribution of x (the

sample mean) follows a normal distribution with mean m and standard deviation σ

√n .

It may be noted that

(i) the sample mean x = ∑i=1

n

xi

n

= x1+x2+……+xn

n

Thus x is a random variable and will be different every time when a new sample of n observations are taken.

(ii) x is an unbiased estimator of the population mean m. i.e. E ( x ) = μ, denoted by μx

= μ.

(iii) The standard deviation of the sample mean x is given by σ x = σ

√n .

Sampling distribution of proportions

Suppose that a population is infinite and that the probability of occurrence of an even, say success is P. Let Q=1-P denotes the probability of failure.

Consider all possible samples of size n drawn from this population. For each sample, determine the proportion p of successes. Applying central limit theorem, if the sample of size n is large, the distribution of the sample proportion p follows a normal distribution with mean mp = P and

S.D σp=√ PQn

.

4.5.2 Standard Error

The standard deviation of the sampling distribution of a statistic is called the standard error of the statistic. The standard deviation of the distribution of the sample mean is called the standard error of the mean. Likewise, the standard deviation of the distribution of the sample proportion is called the standard error of the proportion.

The standard error is popularly known as sampling error. Sampling error throws light on the precision and accuracy of the estimate. The standard error is inversely proportional to the sample size i.e. the larger the sample size the smaller the standard error.

MATHEMATICS AND STATISTICS Page 11

Page 12: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

The standard error measures the dispersion of all possible values of the statistic in repeated samples of a fixed size from a given population. It is used to set up confidence limits for population parameters in tests of significance. Thus the standard errors of sample mean x and sample proportion p are used to find confidence limits for the population mean m and the population proportion P respectively.

Statistic Standard Error RemarkSample mean x

σ

√nPopulation size is infinite or sample with replacement

σ√n √ N−n

N−1

Population size N finite or sample without replacement

Sample Proportion (p) √ PQn

Population size is infinite or sample with replacement

√ PQn

√ N−nN−1

Population size N finite or sample without replacement

4.5.3 Null and Alternative Hypothesis

Null Hypothesis: The statistical hypothesis that is set up for testing a hypothesis is known as Null Hypothesis. The null hypothesis is set up in testing a statistical hypothesis only to decide whether to accept or reject the null hypothesis. It asserts that there is no difference between the sample statistic and population parameter and whatever difference is there, is attributable to sampling errors. Null Hypothesis usually denoted by H0.

Alternative Hypothesis: The negation of Null Hypothesis is called the Alternative hypothesis. In other words, any hypothesis which is not a null hypothesis is called alternative hypothesis. It is always denoted by H1 or Ha. It is set in such a way that rejection of null hypothesis implies the acceptance of alternative hypothesis.

4.5.4 Error in testing of hypothesis

For testing the hypothesis, we take a sample from the population, an on the basis of the sample result obtained, we decide whether to accept or reject the hypothesis. Here, two type of errors are possible. A null hypothesis could be rejected when it is true. This is called Type I error and the probability of committing type I error is denoted by α. Alternatively, an error could result by accepting a null hypothesis when it is false, this is known as Type II error and the probability of committing type II error is denoted by β.

This is illustrated in the following table:

Statistical Decision of the Test

MATHEMATICS AND STATISTICS Page 12

Page 13: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

True Situation H0 is True H0 is FalseH0 is True Correct Decision Type I ErrorH0 is False Type II Error Correct Decision

4.5.5. Critical Region

A region in the sample space which amounts to rejection of null hypothesis (H0) is called the critical region. After formulating the null and alternative hypothesis about a population parameter, we take a sample from the population and calculate the value of the relevant statistic, and compare it with the hypothesized population parameter. After doing this, we have to decide the criteria for accepting or rejecting the null hypothesis. These criteria are given as a range of values in the form of an interval, say (a,b), so that if the statistic value falls outside the range, were reject the null hypothesis. If the statistic value falls within the interval (a,b) then we accept H0. This criterion has to be decided on the basis of the level of significance. At 5% level of significance means that 5% of the statistical value arrived at from the samples will fall outside this range (a,b)and 95% of the values will be within the range (a,b). Thus the level of significance is the probability of Type I error. The levels of significance usually employed in testing of hypothesis are 5% and 1%. A high significance level chose for testing a hypothesis would imply that higher is the probability of rejecting a null hypothesis if it is true.

Table of critical value Zα of Z.

Level of SignificanceCritical Value (Zα) 1% 5% 10%Two tailed test | Zα|=2.58 | Zα|=1.96 | Zα|=1.645One tailed test | Zα|=2.33 | Zα|=1.645 | Zα|=1.28

4.5.6 Two tailed and one tailed test:

The probability curve of the sampling distribution of the test statistic is a normal curve. In any test, the critical region is represented by a portion of the area under this normal curve. This curve has two sides (or ends) known as two tails. The rejection region may be represented by a portion of area on each of the two sides or by only on the side of the normal curve and correspondingly the test is known as two tailed test (or two sided test) or one tailed test (or one sided test).

When the test of hypothesis is made on the basis of rejection region represented by both sides of the standard normal curve, it is called a two tailed test or two sided test.

MATHEMATICS AND STATISTICS Page 13

Page 14: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

When the test of hypothesis is made on the basis of rejection region represented by any of the sides of the standard normal curve, it is called a one tailed test or one sided test.

4.5.7 Large and small sample test

The test of significance is (a) Test of significance for large sample and (b) Test of significance for small samples. For larger sample size (.30), all the distributions like Binomial, Poisson etc., are approximated by normal distribution. Thus normal probability curve can be used for testing of hypothesis.

4.6 PROCEDURE FOR TESTING OF HYPOTHESIS:

Steps for testing hypothesis is given below ( for both large sample and small sample tests)

Step 1: Null hypothesis: Set up null hypothesis H0

Step 2: Alternative Hypothesis: Set up alternative Hypothesis H1, which is complementary to H0

which will indicate whether one tailed (right or left tailed) or two tailed test is to be applied

Step 3: Level of significance: Choose an appropriate level of significance (a).

Step 4: Test Statistic (or test of criterion):

Calculate the value of the test statistic, Z = t−E (t)S .E .(t )

Under the null hypothesis, where‘t’ is the sample statistic

Step 5: Critical Value: Find the critical value Za of Z at the level of significance, from the table “areas under the normal curve Za – values” in case of large samples, or areas under t-table, F-table, Chi-square table” in case of small samples.

Step 6: Inference: We compare the computed value of Z (in absolute value) with the significant value (critical value) Zα/2 (or Za). If |Z|>Za, we reject the null hypothesis H0 at a % level of significance and if |Z|<Za, we accept H0 at a % level of significance.

4.7. LARGE SAMPLE TESTS:

4.7.1 Test for single mean:

Step 1: Setting up of a Null hypothesis. There is no significance difference between the sample and the population mean or the sample has been drawn from the parent population. H0: x = μ

MATHEMATICS AND STATISTICS Page 14

Page 15: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Step 2: Setting up of an Alternative hypothesis: There is a significance difference between the sample mean and the population mean. H1: x ≠ μ

Step 3: Fixing of level of significance: a (normally it is 5%)

Step 4: Computation of test Statistic:

Zcal= x❑−μ❑

σ√n

Step 5: Critical Value: Find the critical value Za at a % level of significance, from the table” areas under the normal curves Za – values.

Step 6: Interference: If the modulus of the calculated value Zcal≤ Zα , obtained in step 5, we accept the null hypothesis at a % level of significance. Otherwise we reject the null hypothesis at a % level of significance.

Now, we discuss the above with an example.

Example 4.1 A Sample of 400 male students of a college is found to have a mean height of 171.38cm. Can it be regarded as a sample from a large population with mean height 171.17cm and standard deviation 4.40cm?

Solution:

Given n = 400 (Large Sample)

μ = 171.17cm; x = 171.38 cm; σ = 3.30 cm

Null Hypothesis (H0): Sample mean has been drawn from a large population with mean height of 171.17 cm. i.e., H0: μ = 171.17 cm

Alternative Hypothesis (H1): Sample mean has not been drawn from a large population with mean 171.17cm i.e., H1: μ≠171.17cm.

Level of significance (α): 5%

Test Statistic:

Zcal= x❑−μ❑

σ√n

MATHEMATICS AND STATISTICS Page 15

Page 16: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Zcal= 171.38❑−171.17❑

4.40√400

Zcal= 0.210.22 = 0.9546

Critical value: At 5% level, Z0.05 = 1.96

Interference: Since the calculated value of Z is less than the critical value of Z at 5% level, hence we accept the null hypothesis and conclude that, the sample mean has been drawn from a large population with mean height of 171.17cm.

Example 4.2 The mean lifetime of 100 fluorescent light bulbs produced by a company is computed to be 1570 hours with standard deviation of 120 hours. If m is the mean lifetime of all the bulbs produced by the company, test the hypothesis μ = 1600 hours against the alternative hypothesis m ≠ 1600 hours using a 5% level of significance.

Solution:

We are given

X = 1570 hrs. μ = 1600 hrs, σ = s =120 hrs, n = 100

Null Hypothesis (H0): m=1600. i.e., There is no significant difference between the sample mean and population mean.

Alternative Hypothesis (H1): M1 1600 (tow tailed Alternative) There is a significant difference between the sample mean and population mean.

Level of Significance (a): 5%

Test Statistic:

Zcal= x❑−μ❑

σ√n

Zcal= 1570−1600

120

√100 = -2.5

MATHEMATICS AND STATISTICS Page 16

Page 17: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

|Zcal| = 2.5

Critical value: At 5% level,

Interference: Since the calculated value is greater than the critical value of Z at 5% level, hence we reject the null hypothesis and conclude that , there is a significant difference between the sample mean and population mean.

Self – Assessment Question

1. A random sample of 900 members has a mean 3.4 cm and S.D 2.61 cm. Is the sample from a large population of mean 3.25 cm and S.D 2.61 cm?

n=900, x = 3.4, μ = 3.25, σ = 2.61, |Zcal| = 1.724

[Hint: H0 is accepted at 5% level]

2. A random sample os size 400 drawn and the sample mean was found to be 99. Test whether the sample could have come from a normal population with mean 100 and standard deviation 8 at 5% level.n=400, x = 99, μ = 100, σ = 8, |Zcal| = 2.5[Hint: H0 is rejected at 5% level]

4.7.2 Test for difference of mean:

Working Rule:

Step 1: Setting up of a Null Hypothesis: The two samples have been drawn from different from different populations having the same means and equal standard deviationH0 : μ1 = μ2

Step 2: Setting up of an Alternative Hypothesis. The two samples have been not drawn from differne tfrom different populations.H0 : μ1 ≠ μ2 (Two tailed test), or H1 : μ1 < μ2 (One tailed test), or

H1 : μ1 > μ2(One tailed test).

Step 3: Fixing the level of Significance: α (normally it is 5%)

MATHEMATICS AND STATISTICS Page 17

Page 18: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Step 4: Computation of test Statistic:

Zcal = x1−x2

√ σ12

n1

+σ2

2

n2

; if the population s.d’s are known

Zcal = x1−x2

√ s12

n1

+s2

2

n2

; if the population s.d’s are not known.

Step 5: Critical Value: Find the critical value Za at α% level of significance, from the table areas under the normal curve Za -values

Step 6: Inference: If the modulus of the calculated value, obtained in step 5, we accept the null hypothesis at α% level of significance, Otherwise we reject the null hypothesis at a % level of significance.

Example 4.3: A college conducts both day and evening classes intended to be identical. A sample of 100 day students yields examination results as under x1= 72 and σ1 = 14.8. A sample of 200 evening students yields examination result under x2= 73.9 and σ2=17.9. Are the two mean statistically equal at 5% level?

Solution:

We are given

n=100, x1= 72 and σ1 = 14.8, n=200, x2= 73.9 and σ2=17.

Null Hypothesis (H0): H0:μ1 = μ2. .e, the two means are statistically equal.

Alternative Hypothesis (H1): μ1 ≠μ2 (Two tailed test) i.e., the two means are not statistically equal.

Level of Significance (α): 5%

Test Statistics

MATHEMATICS AND STATISTICS Page 18

Page 19: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Zcal = x1−x2

√ σ12

n1

+σ2

2

n2

= 72.4−73.9

√(14.8)❑2

100+(17.9)❑

2

200 =

−1.5

√3.7925 = −1.5

1.947 = -0.77

|zcal| = 0.77

Critical value: At 5% level, Z0.05 = 1.96

Inference: Since the calculated value of Zcal is less than the critical value of Z at 5% level, hence we accept the null hypothesis and conclude that, the two means are statistically equal.

Example 4.4 A random sample of 1000 workers from South India shows that their mean wages are Rs 47 per week with a standard deviation of Rs. 28. A random sample of 1500 workers from North India gives a mean age of Rs. 49 per week with a standard deviation of Rs. 40. Is there any significant difference between their mean levels of wages?

Solution

We are given, n1 = 1000, x1 = 47 and s1 = 28, n2=1500, x2 = 73.9 and S2=17.9

Null Hypothesis (H0): H0:μ1=μ2 i.e., there is no significant difference between their mean level of wages

Alternative Hypothesis (H1): H1: μ1≠μ2 (Two tailed test) i.e., there is a significant difference between their mean level of wages.

Level of significance (α): 5%

Test Statistics

Zcal = x1−x2

√ s12

n1

+s2

2

n2

= 47−49

√(28)❑2

1000+(40)❑

2

1500 =

−2

√1.8507 = −21.3604 = -1.47

|Zcal|= 1.47

Critical value: At 5% level

Inference: Since the calculated value of Zcal is less than the critical value of Z at 5% level, hence we accept the null hypothesis and conclude that the two means are statistically equal.

Self-Assessment Question

MATHEMATICS AND STATISTICS Page 19

Page 20: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

1. In a survey if buying habits, 400 women shoppers are chosen at random in super market ‘A’ located in certain section of the city. Their average weekly food expenditure is Rs. 250 with a standard deviation of Rs. 40. For 400 women shoppers chosen at random in super market ‘B’ in another section of the city, the average weekly food expenditure is Rs. 220 with standard deviation of Rs. 55. Test at 5% level of significance whether the average weekly food expenditure of the two populations of shoppers are equal.[Hint: n1: 400, x1 =250, s1=40, n2=400, x2 =220 and s2=55, |Zcal|=8.82]

2. Random samples drawn from two countries gave the following data relating to the heights of adult males:

3.

Country A Country BMean height in inches 67.42 67.25Standard deviation 2.58 2.50Number of sample 1000 1200

Is the difference between the means significant?

[Hint: n1: 1000, x1 =67.42, s1=2.58, n2=1200, x2 =67.25 and s2=2.50, |Zcal|=1.56]

4.7.3 Test for single proportion

Working Rule:

Step 1: Setting up of Null Hypothesis. The sample has been drawn from a population with proportion P, i.e., P=P0.

Step 2: Setting up of Alternative Hypothesis. The sample has not been drawn from a population with proportion P, i.e, H1:P≠ P0.

Step 3: Fixing of level of significance: α (normally it is 5%)

Step 4: Computation of test statistic:

Zcal= p−P

√ PQn

Step 5: Critical Value: Find the critical value Za at α% level of significance, from the table “areas under the normal curve Za – values.”

MATHEMATICS AND STATISTICS Page 20

Page 21: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Step 6: Inference: If the modulus of the calculated value |Zcal|≤ Zα, obtained in step 5, we accept the null hypothesis at α% level of significance. Otherwise we reject the null hypothesis at α% level of significance.

Now, we discuss the above with an example.

Example 4.5 In a sample of 1000 people in Karnataka 540 are rice eater and the rest are wheat eaters. Can we assume that both rice and wheat eaters are equally popular in this state at 1% level of significance?

Solution:

Given, n=1000; p=540

1000= 0.54; P=).5; Q=1-P=1-0.5=0.5

Null Hypothesis (H0): The sample has been drawn from a population with proportion P, i.e., H0: P=0.5

Alternative Hypothesis (H1): The sample has not been drawn from a population with proportion P, i.e., H1: P≠0.5.

Level of significance (α): 1%

Test Statistics:

Zcal = p−P

√ PQn

= 0.54−0.5

√ (0.5 )(0.5)1000

= 0.040.0158 = 2.53

|Zcal|= 2.53

Critical Value: At 1% level, Z0.01=2.58

Inference: Since the calculated value of |Zcal|is less than the critical value of Z at 1% level, hence we accept the null hypothesis and conclude that, the sample has been drawn from a population with proportion P, i.e., H0: P=0.5.

Self – assessment Question

1. In a random sample of 400 persons from a large population, 120 are female can it said that males and females are in the ration 5:3 in the population. Use 5% level of significance.

[Hint: n= 400, p=120400

= 0.3; P = 38

=0.375; Q= 1-P=1-0.375=0.625=2.58]

MATHEMATICS AND STATISTICS Page 21

Page 22: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

|Zcal|=3.125

4.7.4 Test for two proportions:

Working Rule:

Step 1: Setting up of a Null Hypothesis. The two samples have been drawn from same population, i.e., H0: P1=P2.

Step 2: Setting up of an Alternative hypothesis. The two samples have not been drawn from same population, i.e., H1:P1≠P2

Step 3: Fixing of level of significance: α (normally it is 5%)

Step 4: Computation of test statistics:

Zcal= p1−p2

√PQ ( 1n1

+ 1n2

) ; Where P=n1 p1+n2 p2

n1+n2 and Q=1-P

Step 5:Critical value: Find the critical value Za at a % level of significance, from the table “areas under the normal curve Z a– values.

Example 4.6 In a sample of 600 men from a certain city, 450 are found to be smokers. In a sample of 900 from another city 450 are found to be smokers. Do the data indicate that the two cities are significantly different with respect to prevalence of smoking habits among men?

Solution:

Given n1=600; n2=900; p1=450600

= 0.75; p2 = 450900

= 0.5;

P= n1 p1+n2 p2

n1+n2 =

600 (0.75 )+900(0.5)600+900

= 0.6

Q= 1-P = 1-0.6 = 0.4

Null Hypothesis (H0): The two samples have been drawn from same population, i.e., H0: P1 = P2

Alternative Hypothesis (H1): The two samples have not been drawn from the same population, i.e., H1: P1 ≠ P2.

Level of significance (α): 5%

MATHEMATICS AND STATISTICS Page 22

Page 23: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Test Statistic:

Zcal = p1−p2

√PQ ( 1n1

+ 1n2

) = 0.75−0.5

√( 0.6 )(0.4)( 1600

+ 1900

) = 0.25

0.0258 = 9.7

|Zcal|= 9.7

Critical Value: At 5% level, Z0.05=1.96

Inference: Since the calculated value of |Zcal|is greater than the critical value of Z at 5% level, hence we reject the null hypothesis and conclude that,the two samples have not been drawn from the same population.

Example 4.7 A machine puts out 16 imperfect articles in a sample of 500. After the machine is overhauled, it puts out 3 imperfect articles in a batch of 100. Has the machine improved?

Solution:

Given n1=500; n2=100; p1=16

500 = 0.032; p2 =

3100

= 0.03;

P= n1 p1+n2 p2

n1+n2 =

500 (0.032 )+100(0.03)500+100

= 0.0316

Q= 1-P = 1-0.0316 = 0.968

Null Hypothesis(H0): P1 = P2.

Alternative Hypothesis (H1): P1 > P2 (one tailed test)

Level of significance (α): 5%

Test Statistic:

Zcal = p1−p2

√PQ ( 1n1

+ 1n2

) = 0.032−0.03

√( 0.0316 )(0.968)( 1500

+ 1100

) = 0.002

0.0105 = 0.19

|Zcal|= 0.19

Critical Value: At 5% level, Z0.05 = 1.645

Inference: Since the calculated value of is less than the critical value of Z at 5% level hence we accept the null hypothesis and conclude that, there is no improvement after overhauling.

MATHEMATICS AND STATISTICS Page 23

Page 24: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Self Assessment Question

1. In a random samples of 600 and 1000 men from two cities 400 and 600 men are found to be literate. Do the data indicate at 5% level of significance that the population are significantly different in the percentage of literacy?

[Hint: n1=600, n2=1000, p1=400600

= 0.67, p2=600

1000=0.6; P=0.625;Q=0.375,]

|Zcal|=2.67 2. Before an increase in excise duty on tea 400 people out of a sample of 500 persons were

found to be tea drinkers. After an increase in the duty, 400 persons were known to the tea drinkers in sample of 600 people. Do you think that there has been a significant decrease in the consumption of tea after the increase in the excise duty?

3. [Hint: n1=500, n2=600, p1=400500

= 0.80, p2=400600

=0.67; P=0.73;Q=0.27,]

4. |Zcal|=4.81; H0: P1 = P2; H1 : P1 < P2 (one tailed test).

4.8 ANALYSIS OF VARIANCE:

In many statistical studies a variable of interest, called the response variable (or dependent variable), is identified. Then the data are collected that tell us about how one or more factors (or independent variables) influence the variable of interest. If we cannot control the factor(s) being studied, we say that the data obtained are observational. For example, suppose that in order to study how the size of a home relates to the sale price of the home, a real estate agent randomly selects 50 recently sold homes and records the square footages and sale prices of these homes. Because the real estate agent cannot control the sizes of the randomly selected homes, we say that data are observational.

If we can control the factors being studied, we say that the data are experimental. Furthermore, in this case the values, or levels, of the factor (or combination of factors) are called treatments. The purpose of most experiment is to compare and estimate the effects of the different treatments on the response variable. For example, suppose that an oil company wishes to study how three different gasoline types (A, B and C) affects the mileage obtained by popular midsized automobile model. Here the response variable is gasoline mileage and the company will study a single factor-gasoline type. Since the oil company can control which gasoline type is used in the midsized automobile; the data that the oil company will collect are experimental. Furthermore, the treatments – the levels of the factor gasoline type – are gasoline type A, B and C.

MATHEMATICS AND STATISTICS Page 24

Page 25: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

In order to collect data in an experiment, the different treatments are assigned to objects (people, cars, animals or the like) that are called experimental units. For example in gasoline mileage situation, gasoline types A, B and C will be compared by conducting mileage test using a midsized automobile. The automobiles used in the test are the experimental units.

Definition:

According to R.A. Fisher, Analysis of Variance (ANOVA) is the “Separation of Variance ascribable to one group of causes from the variance ascribable to other group”. By this technique te toal variation in the sample data is expressed as the sum of its nonnegative components where each of these components is a measure of the variation due to some specific independent source or factor or cause

4.8.1 Assumptions:

For the validity of the F-test in ANOVA the following assumptions are made(i) The observations are independent.(ii) Parent population from which the observations are taken is normal and (iii) Various treatment and environmental effects are additive in nature.

4.8.2 One Way Classification

Let us suppose that N observations, i=1, 2,…………….k; j=1,2…….) of a random variable X are grouped on some basis, into k classes of sizes n1, n2, ……nk respectively

(N=∑i=1

k

ni) as exhibited below:

Mean TotalX11 x12 . . . . . . x1n1 x1 T1

X21 x22 . . . . . . x2n2 x2 T2

Xi1 xi2 . . . . . . xini x i Ti

Xk1 xk2 . . . . . . xknk xk Tk

Grand TotalG

The total variation in the observation xij can be split into the following two components:

(i) The variation between he classes or the variation due to different bases of classification, commonly known as treatments.

MATHEMATICS AND STATISTICS Page 25

Page 26: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

(ii) The variation within the classes i.e., the inherent variation of the random variable within the observations of a class.

The first type of variation is due to assignable causes which can be detected and controlled by human endeavor and the second type of variation due to chance causes which are beyond the control of human hand.

In Particular, let us consider the effect of k diffent rations on the yield in milk of N cows (of the same breed and stock) divided into k classes of sizes n1, n2, ……..nk.

Respectively (N=∑i=1

k

ni) Hence the sources of variation are

(i) Effect of rations(ii) Error due to chance causes produced by numerous causes that they are not

detected and identified.

Test Procedure:

The steps involved in carrying out the analysis are:

1) Null Hypothesis (H0): The first step is to set up of a null hypothesis H0: μ1 = μ2 =………= μk

2) Alternative Hypothesis (H1): all μ1 ’s are not equal (i=1,2,……k)3) Level of significance: Let α 0.054) Test statistic:

Various sums of squares are obtained as follows:

a) Find the sum of values of all the (N) items of the given data. Let this grand total represented by ‘G’.

Then correction Factor (C.F)=G2

Nb) Find the sum of squares of all the individual items (xij) and then the Total sum of

squares (TSS) = ∑∑x ij2-C.F.

c) Find the sum of squares of all the class totals (or each treatment total) Ti (i=1,2,…….k) and then the sum of squares between the classes or between the treatments

(SST) is SST = ∑i=1

k T i2

n j

- C.F. where ni (i=1,2,…..k) is the number of observations in the

ith class or number of observations received by ith treatment.

MATHEMATICS AND STATISTICS Page 26

Page 27: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

d) Find the sum of squares within the class or sum of squares due to error (SSE) by subtraction. SSE = TSS-SST

5) Degrees of freedom (d.f): The degrees of freedom for total sum of squares freedom for SSE is (N-k)

6) Mean sum of squares: The mean sum of squres for treatments is SSTk−1

and mean sum of

squares for erro is SSEN−k

7) ANOVA Table: The above sum of squres together with their respective degrees of freedom and mean sum of squres will be summarized in the following table.

ANOVA Table for one-way classification

Sources of Variation d.f. S.S. M.S.S F-ratioBetween Treatments k-1 SST SST

k−1 =MST

MSTMSE

=

F1

Error N-k SSE SSEN−k

=

MSETotal N-1

Calculation of variance ration

Variance ratio of F is the ratio between greater variance and smaller variance, thus

= F1 = Variationbetweenthe treatmentsVariationwithin the treatments

= MSTMSE

If variance within the treatment is more than the variance between the treatments, then numerator and denominator should be interchanged and degrees of freedom adjusted accordingly.

8) Critical Value of F or Table value of F:The critical value of F or table value of F is obtained from F table for (k-1, N-k) d.f. at 5% level of significance.

9) Inference:If calculated F value is less than table value of F, we may accept our null hypothesis H0

and say that there is no significant difference between treatments. If Calculated F value

MATHEMATICS AND STATISTICS Page 27

Page 28: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

is greater than table value of F, we reject our H0 and say that the difference between treatments is significant.

Example 4.7

The following table gives the yields on 15 sample plots under three varieties of seed

A: 20 21 23 16 20B: 18 20 17 15 25C: 25 28 22 28 32

Prepare an analysis of variance table

Solution:

Null Hypothesis (H0): μ1 = μ2 = μ3 (i.e., various varities of seeds are homogeneous)

Alternative hypothesis (H1): μ1 ≠ μ2 ≠ μ3 (i.e., various varities of seeds are not homogeneous)

Level of Significance(α):0.05

Test Statistic:

Variety Total SquaresA 20 21 23 16 20 100 10000B 18 20 17 15 25 95 9025C 25 28 22 28 32 135 18225Grand Total 330

Squares:

Variety TotalA 400 441 529 256 400 2026B 324 400 289 225 625 1863C 625 784 484 784 1024 3701Total 7590

Correction Factor (C.F.) = G2

N =

3302

15 = 7260

MATHEMATICS AND STATISTICS Page 28

Page 29: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Total sum of Squares (TSS) is TSS = ∑∑x ij2- C.F = 7590-7260 = 330

Sum of squares between the classes or between the treatments (SST) is

SST = ∑i=1

k T i2

ni

- C.F.

SST = ( 1002

5+ 952

5+ 1352

5¿−7260

= 7450-7260 = 190

Sum of squares due to error (SSE) = TSS –SST = 330-190 = 140

ANOVA Table for one-way classification

Source of Variation d.f. S.S M.S.S F=ratioBetween treatments 3-1=2 190 190

2= 95

9511.667

= 8.142

Error 14-2=12 140 14012

= 11.667

Total 15-1=14

Table Value: Table value of F for 92,12) d.f., at 5% level of significance is 3.89 (From F-table)

Inference: Since calculated F is greater than table value of F, we may reject our H 0 and say that various varieties of seeds are not homogeneous.

Self – Assessment Question

1. Three processes A, B and C is tested to see whether their outputs are equivalent.The following observation of output are madeA 10 12 13 11 10 14 15 13B 9 11 10 12 13C 11 10 15 14 12 13

Carry out the analysis of variance and state your conclusion

Hint:

Sources of Variation

d.f. S.S. M.S.S F-ratio

MATHEMATICS AND STATISTICS Page 29

Page 30: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Between treatments

3-1=2 7 7/2 =3.5 3.5/3.19 = 1.097

Error 18-2=16 51 51/16=3.19Total 19-1=18

Table value of F for (2,16) d.f. at 5% level of significance is 3.63

Chapter Summary

This chapter has explained different types of sampling. First we discussed the need an elements of sampling plan. We continued by discussing sampling and non-sampling errors and Testing of hypothesis. We saw that both large sample and small sample test are inferences can be made. We learned that procedure for testing of hypothesis for testing single mean, difference of two means, singly proportion and difference of two proportions under large sample tests. To conclude this chapter, we explained how to test homogeneity using one way ANOVA.

MATHEMATICS AND STATISTICS Page 30

Page 31: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

UNIT – V

CORRELATION AND REGRESSION ANALYSIS

5.0 OBJECTIVES5.1 MEANING OF CORRELATION5.2 TYPES OF CORRELATION

5.2.1 Positive and Negative Correlation5.2.2 Linear and Non-linear Correlation

5.3 MEASUREMENT TECHNIQUES OF CORRELATION COEFFICEINT5.3.1 Scatter Diagram5.3.2 Karl Pearson’s Coefficient of Correlation5.3.3 Spearman’s Rank Correlation Ranks are given directly

Non -repeated ranksRepeated ranks

5.4 PROPERTIES OF CORRELATION COEFFICEINT5.5 MEANING OF REGRESSION5.6 TYPES OF REGRESSION LINES

5.6.1 Regression lines of X on Y5.6.2 Regression line of Y on X

5.7 CONSTUCTION OF REGRESSION EQUATIONS5.8 PROPERTIES OF REGRESSION COEFFICENTS5.9 DIFFERENCES BETWEEN CORRELATION AND REGRESSION5.10 APPLICATIONS OF REGRESSION ANALYSIS

MATHEMATICS AND STATISTICS Page 31

Page 32: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

INTRODUCTION

In this unit you will be able to learn the concept of correlation and regression. Also from this unit you will be able to learn the various methods of obtaining the correlation coefficients, rank correlation coefficient, regression equations etc. This unit explains the differences between the correlation and regression. It is easy to understand the techniques to be discussed in this unit by making use of calculation. Try out the example problems with the calculator.

5.1 MEANING OF CORRELATION

We are familiar that, the change in one factor, say, the amount of rainfall affects the change in the other factors, say, yield of rice. This means that there exists some kind of relationship between the two factors. Thus correlation is relationship between two factors.

In simple words, correlation means “the degree of relationship between two or more factors”. An example of the relationship that exist between the price and demand.

5.2 TYPES OF CORRELATION

There are different types of correlation. They can be classified into the following categories.

a) Positive and Negative degree correlationb) Linear and Non-linear correlation

First we will discuss positive and Negative degree correlation

5.2.1 Positive and Negative degree correlation

If the changes in the factors are in the same direction then the correlation is said to be “Positive degree correlation”. Relationship between the amount of rainfall and yield of rice is an example of positive degree correlation. If the rainfall level increases then the yield of rice also increases and vice-versa.

Now, we will discuss the linear and Non-linear correlation.

5.2.2 Linear and Non-linear Correlation

If the changes in the factors are in the constant ratio then the correlation is said to be “Linear correlation”.

MATHEMATICS AND STATISTICS Page 32

Page 33: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

For example

Amount of rainfall (in mm)

40 60 80 100 120

Yield of rice 100 150 200 250 300

From the above example, it can be observed that amount of rainfall increases with 20 mm at each level and yield of rice increases with 50 tonnes at each level.

If the changes in the factors are not in the constant ratio then the correlation is said to be “Non-linear correlation”.

For example

Factor 1 40 60 80 100 120Factor 2 100 150 200 250 300

From the above example, it can be observed that, the changes at various levels are different

Self Assessment Question

State the different types of correlation with example in the space given below. Limit your answer in about 80 words.

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

MATHEMATICS AND STATISTICS Page 33

Page 34: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

5.3.1 Scatter Diagram

If the values of variables or factors, say X and Y is plotted in the XY – plane, the diagram of the data obtained is called as scatter diagram. The greater the scatter of the plotted points on the diagram, the lesser is the relationship between the two variables or factors

1. If all the points lie on a straight line falling from the lower left- hand corner to the upper right-hand corner, the correlation is said to be perfective positive(Fig 5.1) i.e. the correlation coefficient r = +1

Figure 5.1 r= +1

2. If all the points lie on a straight line falling from the upper left-hand corner to the upper right-hand corner, the correlation is said to be perfectively negative (i.e. the correlation coefficient r = -1) Fig 5.2.

Figure 5.2 r = -1

Figure If all the points lie on a straight line fall in a narrow band and they show a rising tendency from the lower left-hand corner to the upper right-hand corner, there would be high degree of positive correlation. Fig 5.3

5.3 r = 1

MATHEMATICS AND STATISTICS Page 34

Y

X

X = Y

X = -Y

X

Y

Y

X

Page 35: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

If all the points lie on a straight line fall in a narrow band and they show a declining tendency from the upper left hand corner to the lower right-hand corner, there would be high degree of negative correlation. Fig 5.4.

Fig 5.4 r = 1

If a all the points lie on a straight line fall in a widely band and they show a rising tendency from ∑the lower left-hand corner to the upper right-hand corner, there would be low degree of positive correlation. Fig 5.5.

Fig 5.5 r > 0

If all the points lie on a straight line fall in a widely band and they show a declining tendency from the upper left hand corner to the lower right hand corner, there would be low degree of negative correlation. Fig 5.6

Fig 5.6 r < 1

If the plotted points lie on a straight line parallel to x-axis or in haphazard manner it shows the absence of correlation between two factors. Fig 5.7.

Fig 5.7 r = 0

MATHEMATICS AND STATISTICS Page 35

Y

X

Y

X

Page 36: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

5.3.2 KARL PEARSON’S COEFFICEINT OF CORRELATION

As a measure of degree of linear relationship between two variables, Karl Pearson developed a formula called correlation coefficient. The correlation coefficient between two variable usually denoted by rxy, is a measure of relationship between them is defined as,

rxy= Cov(x , y)

σ x σ y

= ∑ (X−X ) (Y−Y )

√∑ (X−X )2√∑ (Y−Y )2

= ∑ xy

√∑ x2 √∑ y2

Where x = X−X ; y = Y−Y

Working Procedure

Step 1: Denote one series by X and other by Y

Step 2: Calculate X and Y of the X and Y series respectively, using the formula,

X= ∑ X

n ; and Y = ∑ Y

n

Step 3: Take the deviations of the observations in X-series and from Xand write it under the column headed by x = −X . Take the deviation of the observations in Y series from Y and write it under the column y = Y−Y .

Step 4: Multiply the respective deviations and write it under the column headed by xy.

Step 5: Square the deviations obtained in step 4 for X and Y series and write it under the column headed by x2 and y2.

Step 6: Apply the following formula to calculate the correlation coefficient (r).

rxy =∑ xy

√∑ x2 √∑ y2

MATHEMATICS AND STATISTICS Page 36

Page 37: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Example 5.1 Find the coefficient of correlation between height of brothers and sisters from the following data

Height of Brothers (in cm) 65 66 67 68 69 70 71Height of Sisters (in cm) 67 68 66 69 72 72 69

Solution: Let the heights of Brothers be denoted by X and that of Sisters by Y. Let us prepare the following table

X x = X−X Y y = X = ∑Xn

=

4767

= 68

Y−Y

xy X2 Y2

65 -3 67 -2 6 9 466 -2 68 -1 2 4 167 -1 66 -3 3 1 968 0 69 0 0 0 069 1 72 +3 +3 1 970 2 72 +3 +6 4 971 3 69 0 0 9 0476 - 483 - 20 28 32

From the above table.

N=7; ∑X = 476; ∑Y = 483; ∑xy= 20; ∑x2 = 28; ∑y2 = 32

X = ∑Xn

= 476

7 = 68

Y = ∑Yn

= 483

7 = 69

Karl Pearson’s Coefficient of Correlation is now calculated as follows:

r = ∑ xy

√∑ x2 √∑ y2

= 20

√28√32

MATHEMATICS AND STATISTICS Page 37

Page 38: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

= 20

(5.2915 ) ¿¿ = 20

29.9335¿

¿

r = 0.06681

Self – Assessment Question

Calculate the correlation coefficient between the height of sister and height of the brothers from the given data:

Height of Sisters (in cm) 64 65 66

67 68 69 70

Height of Brothers (in cm) 66 67 65

68 70 68 72

[Hint: X = 67, Y = 68, ∑X2=28, ∑Y2=34, ∑xy = 25, r= 0.81]

Short Cut Method

The above direct method for calculating ‘r’ is not convenient when (i) the terms of the Series X and Y are larger and the calculation of X and Y become difficult (or) (ii) the mean of X or Y are not integers. In these cases we apply the following formula of assumed mean

rxy = n∑dxdy−(∑dx)(∑dy)

√n∑dx2−(∑dx)2 √n∑dy2−(∑dy )2

where,

dx = X-A, A is the assumed mean of X – series

dy = Y-B, B is the assumed mean of Y – series

n is number of observation of X and Y

Working Procedure

Step 1: Denote one series by X and the other by Y.

Step 2: Take any term ‘A’ as assumed mean of X series and ‘B’ as assumed mean of Y series (preferably the middle one).

MATHEMATICS AND STATISTICS Page 38

Page 39: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Step3: Take the deviations of the observations in X – series from A and writ it under the column headed by dx = X-A. Take the deviations of the observations in Y series from B and write it under the column headed by dy= Y-B.

Step 4: Multiply the respective deviations and write it under the column headed by dx dy.

Step 5: Square the deviations obtained in step 4 fro X and Y series and write it under the column headed by dx2 and dy2.

Step 6: Apply the following formula to calculate the correlation coefficient (r).

rxy = n∑dxdy−(∑dx)(∑dy)

√n∑dx2−(∑dx)2 √n∑dy2−(∑dy )2

Example 5.2: Calculate the coefficient of correlation for the following pairs of values of X and Y.

X 17 19 21 26 20 28 26 29Y 23 27 25 26 27 25 30 33

Solution:

Let the assumed means for X and Y be 23 and 27 respectively, so that dx = X-23, dy = Y-27,

We have the following table

X Y dx = X-23 dy = Y-27 dxdy dx2 dy

2

17 23 -6 -4 24 36 1619 27 -4 0 0 16 021 25 -2 -2 4 4 426 26 3 -1 -3 9 120 27 -3 0 0 9 028 25 5 -2 -10 25 426 30 3 3 9 9 929 33 6 6 36 36 36186 216 2 0 60 144 70

Note that, here X =∑Xn

= 186

8 = 23.25, which is not an integer, we use short-cut method,

Here n=8, ∑dx = 2, ∑dy = 0, ∑dxdy=60, ∑dx2 = 144, ∑dy2 = 70,

MATHEMATICS AND STATISTICS Page 39

Page 40: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

rxy = n∑dxdy−(∑dx)(∑dy)

√n∑dx2−(∑dx)2 √n∑dy2−(∑dy )2

rxy = 8 (60 )−(2 )(0)

√8(144)❑−(2)2√8(70)❑−(0)2

rxy = 480

√1148 √560

rxy = 480

(33.8821 )(23.6643)

rxy = 480

801.7962

rxy = 0.5987

Self – assessment Question

Compute the coefficient of correlation for the following data

X 10 25 13 25 22 11 12 25 21 20Y 12 22 16 15 18 18 17 23 24 17

[Ans: rxy= 0.53]

5.3.3 : Spearman’s Rank Correlation Coefficient

The coefficient of rank correlation is based on the various values of the varieties and is denoted by

ρ = 1- 6∑D2

n3−n❑

where, D – is the difference of corresponding ranks and n – is the number or pairs of observations.

TYPE I: RANKS ARE GIVEN DIRECTLY

Working Procedure

Step 1: Denote rank of X series by R1 and rank of Y series by R2.

Step 2: Calculate the difference or R1 and R2 and write it under the column headed by D

MATHEMATICS AND STATISTICS Page 40

Page 41: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Step 3: Square the difference D and write it under the column headed by D2.

Step 4: Apply the formula:

ρ = 1- 6∑D2

n3−n❑

This method is described with following example

Example 5.3: Two judges in a beauty contest rank the 12 entries as follows.

Judge X 1 2 3 4 5 6 7 8 9 10 11 12Judge Y 12 9 6 10 3 5 4 7 8 2 11 1

Calculate the rank correlation coefficient between the two judges X and Y.

Judge X (R1) Judge Y (R2) D=R1-R2 D2

1 12 -11 1212 9 -7 493 6 -3 94 10 -6 365 3 2 46 5 1 17 4 3 98 7 1 19 8 1 110 2 8 6411 11 0 012 1 11 121

Total 416Here n = 12; ∑D2=416

Now,

ρ = 1- 6∑D2

n3−n❑

ρ = 1- 6(416)❑

123−12❑

ρ = 1-2496

1728−12

MATHEMATICS AND STATISTICS Page 41

Page 42: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

ρ = 1-24961716

= 1- 1.4545

ρ = -0.4545

Example 5.4: Ten competitors in a beauty contest were ranked by three judges in the following order:

Judge 1 1 6 5 10 3 2 4 9 7 8Judge 2 3 5 8 4 7 10 2 1 6 9Judge 3 6 4 9 8 1 2 3 10 5 7

Use the rank correlation coefficient to determine which pair of judges has the nearest approach to common taste in beauty.

Solution

Let R1, R2, R3 respectively be the ranks given by first, second and third judge.

Let ρij be the rank correlation coefficient between the ranks given by ith and jth judges, i=1,2,3; j=1,2,3.

Let Dij =Ri – Rj, be the difference of ranks of an individual give by ith and Jth Judge.

Judge 1R1

Judge 2R2

Judge 3R3

D12=R1-R2 D122 D23=R2-R3 D23

2 D13=R1-R3 D132

1 3 6 -2 4 -3 9 -5 256 5 4 1 1 1 1 2 45 8 9 -3 9 -1 1 4 1610 4 8 6 36 -4 16 2 43 7 1 -4 16 6 36 2 42 10 2 -8 64 8 64 0 04 2 3 2 4 -1 1 1 19 1 10 8 64 -9 81 -1 17 6 5 -1 1 1 1 2 48 9 7 -1 1 2 4 1 1

Total 200 214 60

Here n = 10; ∑D122=200, ∑D23

2=214; ∑D132=60

MATHEMATICS AND STATISTICS Page 42

Page 43: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

First and Second Judges

ρ12 = 1 - 6∑D122

n3−n❑ = 1- 6 (200)

103−10❑ = 1- 1200990 = 1- 1.2121 = 0.2121

Second and Third Judges

ρ23 = 1 - 6∑D23

2

n3−n❑ = 1- 6 (214)

103−10❑ = 1- 1284990 = 1- 1.2969 = 0.2969

First and Third Judges

ρ13 = 1 - 6∑D132

n3−n❑ = 1- 6 (60)

103−10❑ = 1- 360990 = 1- 0.3636 = 0.6364

Since ρ13 is maximum, thus the pair of the first and third judges has the nearest approach to common taste in beauty.

Self –assessment Question

1. Two judges in a musical contest rank the 10 entries as follows:

Judge X 3 5 8 4 7 10 2 1 6 9Judge Y 6 4 9 8 1 2 3 10 5 7

[Hint: n = 10; ∑D2=149; ρ = 0.8495]

2. Ten Competitors in a beauty contest were ranked by three judges in the following order

1st Judge 1 5 4 8 9 6 10 7 3 22nd Judge 4 8 7 6 5 9 10 3 2 13rd Judge 6 7 8 1 5 10 9 2 3 4

Use spearman’s coefficient of rank correlation to determine which pair of judges has the nearest approach to common taste in beauty:

[Hint: n = 10; ∑D122=74, ∑D23

2=44, ∑D132=156, ρ12=05515, ρ23=0.7333; ρ13=0.0545]

TYPE II: RANKS ARE NOT GIVEN – NON – REPEATED RANKS

In this case we are given only the data. We assign the ranks to both the series of X and Y by giving the ranks in ascending order for both series (or descending order).

Working Rule

MATHEMATICS AND STATISTICS Page 43

Page 44: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Step 1: Assign ranks to each items of both series in ascending or descending order.

Step 2: Calculate the difference of ranks and write it under the column headed by D.

Step 3: Square the difference D and write it under the heading D2.

Step 4: Apply the formula,

ρ = 1 - 6∑D❑2

n3−n❑

This method is explained with the help of the following example.

Example 5.5

For the following data calculate the coefficient of rank correlation.

Series X 80 91 99 71 61 81 70 59Series Y 123 135 154 110 105 134 121 106

Solution:

Series X Series Y Rank X(R1)

Rank Y(R2)

D D2

80 123 5 5 0 091 135 7 7 0 099 154 8 8 0 071 110 4 3 1 161 105 2 1 1 181 134 6 6 0 070 121 3 4 -1 159 106 1 2 -1 1

Total 4

Here, n = 8; ∑D2 = 4

Now,

ρ = 1 - 6∑D❑2

n3−n❑ = 1 – 6 (4)

83−8❑ = 1 - 24

504 = 1-0.0476 = 0.09524

Self – assessment Question

Calculate the rank correlation coefficient for the following data of two series

MATHEMATICS AND STATISTICS Page 44

Page 45: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Series X 92 89 87 86 83 77 71 63 53 50Series Y 86 83 91 77 68 85 52 82 37 57

[Hint: n = 10; ∑D2=44; ρ=0.733]

TYPE III: RANKS ARE NOT GIVEN – REPEATED RANKS

If two or more individuals are placed together in any classification with respect to an attribute, there are more than one item with the same rank in either or both the series, then the problem is solved by assigning average rank to each of their individuals who are put in a tie.

For example, suppose an item is repeated at rank 5, (i.e., the 5 th and 6th item are having same values), then the common rank assigned to 5the and 6th is (5+6)/2=5.5. The next rank assigned thrice, then the common rank assigned to the value is sum of the ranks by divided by 3. In order to find the rank correlation coefficient the adjustment factor is added to the formula, which is given by

Adjustment Factor (A.F) = 1

12 (m3-m)

Where ‘m’ is the number of times an item is repeated. This Adjustments Factor is to be added for each repeated value in both the series.

The modified formula for the rank correlation coefficient is given by,

ρ = 1 – 6¿¿¿

This method is explained with the following example,

Example 5.6 From the following data related to the series X and Y, calculate the coefficient of rank correlation.

Series X 48 33 40 9 16 16 65 24 16 57Series Y 13 13 24 6 15 4 20 9 6 19

Solution

Series X Series Y Rank X (R1) Rank Y (R2) D=R1-R2 D2

48 13 8 5.5 2.5 6.2533 13 6 5.5 0.5 0.2540 24 7 10 -3 9.009 6 1 2.5 -1.5 2.25

MATHEMATICS AND STATISTICS Page 45

Page 46: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

16 15 3 7 -4 16.0016 4 3 1 2 4.0065 20 10 9 1 1.0024 9 5 4 1 1.0016 6 3 2.5 0.5 0.2557 19 9 8 1 1.00

Total 41.00

Here n = 10, ∑D2=41

[Remark: In the X series, we see that the value 16 is repeated thrice, the common rank is given to the X value is 3, which is the average of 2.3 and 4. i.e., (2+3+4)/3=3]

Now, Adjustement Factor

For X series, AF1= 1

12 (33-3) = 2

For Y series, ,

AF2= 1

12 (23-2) = 0.5

AF3= 1

12 (23-2) = 0.5

The coefficient of rank correlation is,

ρ = 1 – 6¿¿¿

= 1 – 6[41+2+0.5+0.5]

103−10❑

= 1 – 6 [ 44 ]990

= 1 – 264990

= 1- 0.2667

ρ = 0.7333

MATHEMATICS AND STATISTICS Page 46

Page 47: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Self – assessment Question

Obtain the rank correlation coefficient for the following data

Series X 68 64 75 50 64 80 75 40 55 64Series Y 62 58 68 45 81 60 68 48 50 70

[Hint: n=10; ∑D2=72; ρ=0.545]

5.4 properties of Correlation Coefficient

The value of ‘r’ does not depend on which of the two variables under study is labeled X and which is labeled Y.

The correlation coefficient lies between -1 and +1 i.e., -1≤r≤+1 The correlation coefficient is independent of change of origin and scale. r = +1, if all (Xi, Yj) pairs lie on a straight line with positive slope and

r= -1, if all (Xi, Yj) pairs lie on a straight line with negative slope.

5.5 REGRESSION ANALYSIS

Managers often make decisions by studying the relationship between variables and process improvements can often be made by understanding how changes in one or more variables affect the process output. Regression Analysis is a statistical technique in which we observe data to relate a variable of interest, which is called the dependent (or response) variable, to one or more independent (or predicator) variable. The objective is to build a regression model, or prediction equation, that can be used to describe , predict and control the dependent variable on the basis of the independent variables. For example, a company might wish to improve its marketing process. After collecting data concerning the demand for a product, the product’s price, and the advertising expenditures made to promote the product, the company might use regression analysis to develop an equation to predict demand on the basis of price and advertising expenditure. Predictions of demand for various price-advertising expenditure combinations can then be used to evaluate potential changes in the company’s marketing strategies.

In the words of M.M. Blair,

Regression analysis is a “mathematical measure of average relationship between two or more variables in terms of the original unit of the data”.

MATHEMATICS AND STATISTICS Page 47

Page 48: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

5.5. Types of Regression Lines

A line of regression is the line, which gives the best estimate of one variable X, for any given value of the other variable. We have two types of regression lines, namely,

o Regression line of X on Yo Regression line of Y on X.

First we will give the regression line of X on Y.

It is the line, which gives the best estimate for the values of X for a specified value of Y.

It is given by

X - X = bxy (Y - Y )

Where bxy is the regression coefficient of X on Y, which can be calculated using any of th formula under the natures of the data

bxy = ∑xy

∑ y2

where, x = X - X and y= Y - Y

or

bxy = n∑d xd y−¿¿¿

where, dx = X – A, dy= Y – B; and A, B are assumed mean

or

bxy = rσ x

σ y

where ‘r’ is the correlation coefficient, σx an d σy are the standard deviations for X and Y series.

Now we give the regression line of Y on X.

It is the line, which gives the best estimate for the value of Y for a specified value of X.

MATHEMATICS AND STATISTICS Page 48

Page 49: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

It is given by

Y - Y = byx (X - X )

Where byx is the regression line of Y on X, which can be calculated using any one of the following formula depending upon the nature of data.

byx = ∑xy

∑ x2 ; where x = X - X and y = Y - Y

or

bxy = n∑d xd y−¿¿¿ where, dx = X- A; dy=Y-B and A, B are assumed mean.

or

byx = rσ y

σ x where ‘r’ is the correlation coefficient σx, σy are the standard deviations of X and Y

series.

5.7 CONSTRUCTION OF REGRESSION EQUATION

Example 5.7 The height of a sample of 10 fathers and their eldest sons are given below 9to the nearest cm).

Height of Father (X) 170 167 162 163 167 166 169 171 166 169Height of Son (Y) 166 167 164 166 166 164 168 170 163 166

(i) Obtain the two regression equations(ii) Estimate the likely height of Father when the height of Son is 190 cm.(iii) Estimate the likely height of Son when the height of Father is 160 cm.

Solution

Height of Father(X)

Height of Son(Y)

X = X -X y = Y - Y Xy x2 Y2

170 166 3 0 0 9 0167 167 0 1 0 0 1162 164 -5 -2 10 25 4163 166 -4 0 0 16 0167 166 0 0 0 0 0

MATHEMATICS AND STATISTICS Page 49

Page 50: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

166 164 -1 -2 2 1 4169 168 2 2 4 4 4171 170 4 4 16 16 16166 163 -1 -3 3 1 9169 166 2 0 0 4 01670 1660 0 0 35 76 38

Here, n=10, ∑X=1670, ∑Y=1660

X = ∑Xn

= 1670

10 = 167

Y = ∑Yn

= 1660

10 = 166

From the table, ∑xy=35, ∑x2=76, ∑y2= 38

bxy = ∑xy

∑ y2 = 3538

= 0.9211

byx = ∑xy

∑ x2 = 3576

= 0.4605

(i) Regression line of X on YX - X = bxy (Y - Y )X-167 = 0.9211(Y-166)X-167 = 0.9211 Y – 152.9028X = 0.9211Y- 152.9026 + 167X = 0.9211Y+14.934

Regression line of Y on X

Y - Y = byx (X - X )

Y – 166 = 0.4605 (X -167)

Y – 166 = 0.4605 X – 76.9035

Y = 0.4605 X – 76.9035 + 166

Y = 0.4605X + 89.0965

ii) Given, Height of Son (Y)= 190 cm.

MATHEMATICS AND STATISTICS Page 50

Page 51: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

To estimate the height of Father (X) we use X on Y equation

X = 0.9211Y + 14.0934

X = 0.9211(190)+14.0934

X = 175.009+14.0934

X=189.1024cm

iii) Given, Height of Father (X) = 160cm.

To estimate the height of son, we use Y on X equation.

Y = 0.4605X+89.0965

Y= 0.4605(160)+89.0965

Y= 73.68+89.0965

Y=162.78cm

Self – Assessment Question

From the following data, obtain the two regression equations.

Sales 91 97 108 121 67 124 51 73 111 57Purchase 71 75 69 97 70 91 99 61 80 47

Also estimate the sales when the purchase is 90.

[Hint: n=10; X =90, Y = 76, ∑xy=3900, ∑x2=6360, ∑y2= 2388

bxy = 1.36; byx = 0.6132, Line of X on Y : X = 1.36 Y -5.2;

Line of Y on X:Y=0.6132X+14.812;

Estimated sales, when the purchase is 90=117.2

Example 5.8

Find the two lines of regression from te following data

Price at Mumbai (in Rs.) 36 42 55 61 76 26Price at Chennai (in Rs.) 15 36 24 26 15 14

Also estimate the likely price at Mumbai when the price at Chennai is Rs 60/-

MATHEMATICS AND STATISTICS Page 51

Page 52: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Solution

Price atMumbai(X)

Price atChennai(Y)

dx=X-A(A=55)

dy=Y-B(B=26)

dxdy dx2 dy2

36 15 -19 -11 209 361 12142 36 -13 10 -130 169 10055 24 0 2 0 0 461 26 6 0 0 36 076 15 21 -9 -189 441 8126 14 -29 -12 348 841 144296 130 -34 -44 238 1848 450

Here, n=10, ∑x =296, ∑Y=130, ∑dx=34; ∑dy= -44; ∑dxdy= 238, ∑dx2=1848; ∑dy

2=450

X = ∑Xn =

2966 = 49.33

Y = ∑Yn =

1306 = 21.67

bxy = n∑d xd y−¿¿¿ = 6 (238 )−(−34 )(−44)

6 ( 450 )−(−44 )2 = 1428−14962700−1936 =

−68764 = -0.089

bxy = n∑d xd y−¿¿¿ = 6 (238 )−(−34 )(−44)

6 (1848 )−(−34)2 = −68

9888−1150 = −688732 = 0.0078

Regression Line of X on Y

X - X = bxy (Y - Y )X - 49.33 = -0.089(Y-21.67)X - 49.33= -0.089Y + 1.9286X = -0.089Y + 1.9286 + 49.33X = -0.089Y + 51.2586

Regression line of Y on X

Y - Y = byx (X - X )Y -21.67 = -0.0078 (X-49.33)Y-21.67 = -0.0078X+0.3848Y= -0.0078X+0.3848+21.67

MATHEMATICS AND STATISTICS Page 52

Page 53: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

Y=-0.0078X+22.0548

To find the estimate likely price at Mumbai, we use the line X on Y

X = -0.089Y + 51.2586

X = -0.089(60)+51.2586 = 45.92

Hence the price at Mumbai is Rs 45.92.

Self – assessment Question

Age of Husband 23 22 28 26 35 20 22 40 20 18Age of Wife 18 15 20 17 22 14 16 21 15 14

Hence estimate the age of husband when the age of wife is 19.

[Hint: n=10; X = 25.6, Y = 17.2; bxy =2.23; byx=0.385

X=2.23Y-12.76; Y=0.385X+7.3Y; Age of Husband(X) = 29.61]

Example 5.9 Find out the likely production corresponding to a rainfall of 40 cm from the following data

Rainfall (in cm) Output (in quintals)Average 30 50Standard Deviation 5 10

Coefficient of correlation, r=0.8

Let X and Y denotes the rainfall and output respectively

Given: X= 30, Y = 50, σx = 10, σy=10, r=0.8

Regression line of Y on X

Y - Y = byx (X - X )

byx= r¿)=0.8(105

) = 1.6

Y-50 = 1.6(X-30)Y-50 = 1.6X-48Y=1.6X-48+50Y=1.6X+2

MATHEMATICS AND STATISTICS Page 53

Page 54: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

When rainfall X = 40 cm

Y=1.6(40)+2

Y=66Quintals

Self – Assessment Question

Estimate the most likely yield of paddy when the annual rainfall is 22cm other factors being assumes to remain same.

Yield per hectare (in kg) Annual Rainfall (in cm)Mean 973.5 18.3Standard Deviation 38.4 2.0

Coefficient of Correlation = 0.58

[Hint: Regression line of Y and X, Y=11.136X+769.71; For X=22; Yield (Y)= 1014.7 kg]

5.8 Properties of regression coefficients

1. There two regression lines, namely, X on Y and Y on X and they always intersect at the mean (X ,Y )

2. If one regression coefficient is greater than unity, then the other one has to be less than unity.

3. Geometric mean between the regression coefficients is correlation coefficient

(i.e., r = ±√bxy byx )

4. Although regression equations are usually different, they become identical if r= +1.

5. If r=0 then the regression lines are perpendicular to each other.

5.9. Difference between correlation and regression

Correlation Regression1. IT is the degree of relationship

between two or more variable or factos

1. It is the average relationship between two or more variables or factors

1. It is symmetric in x and y, i.e., rxy = ryx 2. The regression coefficients are not symmetric

2. The correlation coefficient does not reflect upon the nature of variable (independent or dependent variable)

3. Regression coefficients reflects on the nature of variable i.e., which is dependent and which is independent.

MATHEMATICS AND STATISTICS Page 54

Page 55: Mathematics and Statistics (Unit IV & v)

BHARTHIDASAN UNIVERSITY

3. It does not imply cause and effect relationship; between the variable under study.

4. It indicates the cause and effect relationship between the variable. The variable corresponding to cause is taken as independent variable, whereas corresponding to effect is taken as dependent variable.

4. It is a relative measure and is independent of the units of measurement.

5. Regression coefficients are absolute measure of finding out the relationship between two or more variables.

5. It indicates the degree of associations. 6. It is used to forecast the nature of dependent variable when the value of independent variable is known.

5.10 Application of Regression

The causes and effect relations are indicated from the study of regression analysis. It establishes the rate of change in one variable in terms of the changes in another

variable It is useful in economic analysis as regression equation can determine an increase in the

cost of living index for a particular increase in general price level. It helps in prediction and thus it can estimate the values of unknown quantities It helps in determining the coefficient of correlation. It enables us to study the nature of relationship between the variables. It can be useful to all natural, social and physical sciences, where the data are in

functional relationship.

Chapter Summary

This chapter has discussed simple correlation coefficient, correlation coefficient and simple linear regression analysis, which relates a dependent variable to a single independent variable. We began by considering the simple linear regression model, which employs two parameters; the slope and y intercept. We next discussed how to compute the least square point estimates of the parameters and how to construct the regression equations by using various methods. We learned that the difference between correlation and regression and applications of regression analysis.

MATHEMATICS AND STATISTICS Page 55


Recommended