+ All Categories
Home > Documents > Regression

Regression

Date post: 22-Oct-2014
Category:
Upload: ioana-ion
View: 74 times
Download: 2 times
Share this document with a friend
Popular Tags:
48
7.1 Introduction Quit often, there are occasions in business when changes in one or many variable appear to be related in certain way to movements in one or several other variables. For example, a sales manager may observe that sales value changed when there has been a change in advertising expenditure, or the logistic manager may notice that as cars and trucks are more used and the number of clients increased then the maintenance expenses becomes larger. This chapter considers the problems of analysing the relationships between variables. Different types of scatter diagrams are depicted. Straight line equations are described and the method of calculating the least squares regression line is described. The uses and method of calculating the coefficient of determination and coefficient of correlation are described and the development of confidence limits for the regression line is explained in detail. The chapter concludes with an explanation of the Rank Correlation coefficients as non parametric measures of the statistical associations REGRESSION AND CORRELATION
Transcript
Page 1: Regression

7.1 Introduction Quit often, there are occasions in business when changes in one or many variable appear to be related in certain way to movements in one or several other variables. For example, a sales manager may observe that sales value changed when there has been a change in advertising expenditure, or the logistic manager may notice that as cars and trucks are more used and the number of clients increased then the maintenance expenses becomes larger.

This chapter considers the problems of analysing the relationships between variables. Different types of scatter diagrams are depicted. Straight line equations are described and the method of calculating the least squares regression line is described. The uses and method of calculating the coefficient of determination and coefficient of correlation are described and the development of confidence limits for the regression line is explained in detail. The chapter concludes with an explanation of the Rank Correlation coefficients as non parametric measures of the statistical associations

REGRESSION AND CORRELATION

Page 2: Regression

Statistics for Business Administration Certain questions may occur for the manager or analyst, as the followings: 1. Are the changes of the variables in the same or in opposite directions? 2. Could changes in one variable be influencing or be influenced by

movements in the other variable? 3. This is an important relationship or could apparently related movements

come about purely by chance? 4. Could movements in two variables be related, not directly, but through

movements in a third variable? 5. What is the importance of this knowledge for the business decision

system?

In many occasions the manager or analyst is interested in predicting the value of one variable related to other variables which were considered to influence it. For example, the quality control manager may want to know what might be the effect on the number of failures if the amount of expenditure on inspection were increased. The Marketing Manager may wish to predict market share if advertising costs were cut by 20%. Suppose that a manager has sensed that two variables are behaving in some way ‘related’, how will the manager proceed to investigate the relation? A possible methodology might be as follows: a) Observe and note what is happening in a systematic way. b) Draw a scatter diagram of data that is being observed. c) Measure statistically the intensity of the relation, its significance and

describe the relation. d) Use the result to improve your decisions In the managerial process it is necessary to dispose of statistical information as variate and complex as possible, that will be known and used to measure the relations of independence or dependence between the variables. The relationship between the statistical variables or between indicators can be observed in all economic activity: production, between the production indicators and these of efficiency and productivity, between resources and the results of their using, between the obtained results and the investment plan.

Page 3: Regression

Regression and Correlation Bivariate data implies two distinct categories of variables: independent and dependent variables. The independent variable is that variable occurring randomly or chosen freely and it is usually denoted by x. The dependent variable occurs as a result of the variation of the independent variable and it is usually denoted by y. 7.2 Categories of Relations between Variables The relations that can be found between x and y variables, modelled as y = f (x) + ε, allows characterizing the direction of change, the intensity of change and the shape of the relation. The relations are classified as follows: a. according to the way of change we can have :

- direct relations, also called positive relations, meaning that a change in independent variable will induce a change of the dependent variable in the same direction: if x is increasing then y will also increase and if x is decreasing then y will decrease

- opposite relations, also called negative relations, meaning that a change in the independent variable will induce a change of the dependent variable in opposite direction: if x is increasing then y will decrease and if x is decreasing then y will increase

b. according to the intensity of the relation we can have:

- high intensity, strong, or tight relations, expressed by high correlation level between the variables

- medium intensity relations - low intensity causal relation

c. according to the shape of the relations we can observe: - linear relations - non linear relations, as exponential growth, logarithmic decrease, etc.

d. according to the randomness involved, we will have deterministic and probabilistic models The deterministic model is allowing us to determine the value of a dependent variable from the values of the

Page 4: Regression

Statistics for Business Administration

independent variables. Such models represent relationships in the natural sciences. Example of deterministic model:

E = mc2 ,

where: E – energy m – mass c – speed of light

For practical models we have to represent the randomness that is part of a real – life process. Such models are called probabilistic models. For a probabilistic model we add a random term (also called the error variables). The random term accounts for all the variables, measurable and immeasurable, that are not part of the model. In the case of the probabilistic first-order model:

Y = β0 + β1 X+ ε, (7.1.) where:

Y – dependent or explained variable; X – independent or explanatory variable; ε – random variable; β0 , β1 – parameters.

Example 1: For the following situation in Table 7-1 we are asked to characterize the relation between expenditure on inspection and defective parts delivered to the customer for a company with ten operating plants of similar size producing small components:

Costs and defective products recording Table 7-1

Observation number

Control costs per batch Defective parts per batch of units

1 2 3 4 5 6 7 8 9

10

25 30 15 75 40 65 45 24 35 70

50 35 60 15 46 20 28 45 42 22

Page 5: Regression

Regression and Correlation We can deduce that there is likely to be an opposite relationship between the control cost and the number of defectives parts delivered to the customer; the higher the cost, the fewer defective units are delivered. Based on this assumption – which is a form of hypothesis – the data can be graphed using the scatter diagram. The scatter diagram is graphical form of data displaying constructed as follows: - the horizontal or x axis is used for the independent variable variants or

classes in this case, expenditure. - the y or vertical axis is used for the dependent variable variants or

classes, in this case, defective parts delivered. This type of diagram is known as a scatter diagram.

Figure 7.1 shows a clear drift downwards in defectives delivered as costs per batch increases. The scatter diagram shows: - an opposite or negative relation due to the negative slope - a linear relation due to the linear shape of the scatter diagram points

(the linear equation can correctly model the relation due to the fact that the points are close to the line)

- a medium intensity of the relation due to the fact that the points are not extremely gathered

0 10 20 30 40 50 60 70

0 10 20 30 40 50 60 70 80 90 100 110

Defective parts per delivered batch

Cost per controlled batch Figure 7.1 Scatter diagram based on the data in Example 1.

Page 6: Regression

Statistics for Business Administration - an inelastic relation due to the fat that the slope is almost 45o Sometimes other possibilities exist ranging from a perfect negative or perfect positive relationship to no discernible relationship. A perfect relationship is one where a single straight line can be drawn through all the point, for example 2.1 and 2.2 in Figure 7.2.

Figure 7.2 Perfect positive relationship

Figure 7.3 Perfect negative relationship

7.3 Simple Regression Model Regression is a statistical method providing a mathematical description of the statistical relations between variables. The purpose of the regression techniques is:

Page 7: Regression

Regression and Correlation - to describe in mathematical terms a statistical relation, - to estimate the value of the dependent variable given the value of the

independent variable and - to compare statistical relations between variables for two companies or

for two countries, regions, companies Regression is concerned with obtaining a mathematical function describing the statistical relation between variables. If the relation is between one dependent and one independent variable then we are in the case of the simple regression; if the statistical relation is between one dependent and two or more independent variables then we are in the case of the multiple regression. This section deals only with the simple regression techniques. According to the mathematical function modelling the relation between the variables we can identify linear and non-linear equations 7.3.1 Simple Linear Regression Model Simple linear regression model:

Y = β0 + β1 X+ ε (7.2) The main attributes of the linear regression, modelling the relation between two variables using the first degree equation, are: a. Useful means of forecasting when the data has a generally linear

relationship. Over operational ranges linearity (or near linearity) is often assumed for such items as costs, contributions and sales.

b. A measure of the accuracy of fit (R 2 , the ratio of correlation or r, the

coefficient of correlation) can be easy calculated for any linear regression line.

c. To have confidence in the regression relationship calculated it is

preferable to have a large number of observations. d. With further analysis confidence limits can be calculated for forecasts

produced by the regression formula.

Page 8: Regression

Statistics for Business Administration e. Any form of extrapolation, including that based on regression analysis,

must be done with great precaution taking into account also other forecasting techniques. Once outside the observed values relationships and conditions may change drastically.

f. Regression is not an adaptive forecasting system, i.e. it is not suitable for

incorporation in, say a stock control system where the requirements would be for a forecasting system automatically producing forecasts which adapt to current market conditions.

g. In many circumstances it is not sufficiently accurate to assume that y

depends only on one independent variable as discussed above in simple linear regression. Frequently, a particular value depends on two or more factors in which case multiple regression analysis is employed.

For example, an analysis of a firm might produce the following multiple regression equation: Overheads (EUROS) = 10800 + 6.9x + 7.2y + 3.7z, where, x : labour hours worked y: machine hours z: production volume (tonnage) 7.3.2 Least Squares Method For defining the relationship between Y and X we need to know the values of the coefficients of the linear model β0 and β1 (the population parameters). We have to estimate the parameters by using a sample of observations of size n. Estimated linear regression is:

yi = b0 + b1xi, i = 1,…,n (7.3) Usually the estimators for the parameters of the regression line are obtained by using the least squares method.

Page 9: Regression

Regression and Correlation To find the line of best fit mathematically it is necessary to calculate a line that minimizes the total of the squared deviations of the actual observations from the calculated line. This is known as the method of least squares or the least squares method of linear regression.

⎪⎪

⎪⎪

=−−−=∂∂

=−−−=∂∂

→−−=

=

=

=

n

iiii

n

iii

i

n

ii

xbbyxbs

xbbybs

xbbys

110

1

110

0

210

1

0)(2

0)(2

min)(

so,

⎪⎪

⎪⎪

=+

=+

∑∑∑

∑∑

===

==n

iii

n

ii

n

ii

n

ii

n

ii

yxxbxb

yxbnb

11

21

10

1110

(7.4)

By solving the system of equations we obtain the values for b0 and b1 and we calculate the value of the regression equation for each value of the x variable. These values of the regression equations are also called the theoretical values of the y variable depending on x and the operation to replace the real terms with the values of the regression equation (theoretical values) is called adjustment computation. The parameter b0 represents the fixed element and b1 is the slope of the line i.e. the change in the mean value of y per unit change in x. The two parameters b0 and b1 have a mean character and they have to be representative for the biggest part of the values which helped to their calculation. The b0 parameter called intercept has a mean character to the extent that its value shows at what level would reach the value of the y characteristic if all the factors were exercised a constant action over its formation. In this case, the individual values of the resultative variable would be equal between them, so equal to their mean. The b1 parameter, called regression coefficient, expresses geometrically the slope of the straight line. The regression coefficient measures the average variation of the y variable when the x variable increases by one unit. More, the regression coefficient shows the direction in which it is realized the relation: • Thus, if b1 >0, positive relationship.

Page 10: Regression

Statistics for Business Administration • When b1<0, negative relationship.

• When b1=0, the two variables are unrelated and 0byx = , so the mean value of the regression equation equals the mean value of the dependent variable )( yx yy = .

The use of these equations will be demonstrated using the Example 1 data contained in Table 1. The equations become:

10 b0 + 424 b1 = 363 424 b0 + 21.926 b1 = 12.815

Solving gives b0= 63.97 and b1 = -0.65 to 2 decimal places. Therefore, the regression line for Example 1 is:

y = 63.97 – 0.65 x Note: the Normal equations automatically produce sign (+ or -) for the regression coefficient b1; in this case, minus. The calculated values can be used to draw the mathematically correct line of best fit on a graph. This is usually done by plotting based on three values of x: the lowest, highest and mean. Based on Example 1 the three values of x are: 15, 42.4 and 75. Each of these values is substituted into the calculated regression line and the result values plotted on the graph. Note: The values of b0 and b1 have been calculated in the example above by substituting in the Normal Equations. An alternative is to transpose the Normal Equations so as to be able to find b0 and b1 directly. The formulae are as follows:

b0 = n

y∑ - xbyn

xb1

1 −=∑ (7.5)

b1 = ∑∑

2xnxyn

- 2)x(yx

∑∑ ∑ (7.6)

Page 11: Regression

Regression and Correlation It is often more convenient to use this alternative form especially when using a calculator. Values for b0 and b1 are re-calculated using the transposed formulae and the Table 1 data.

b1 = ( )2424916.2110

363424815.1210−×

×−× = -0.652467 = -0.65

b0 = 10363 - 0.652467 ×

10424 = 63.97

Figure 7.4 Calculated lines of best fit. For any set of bivariate data a least squares regression line always passes through the mean point ( yx, ) of the data. 7.3.3 Using the Results of the Simple Regression Analysis When the values have been calculated for b0 and b1, predictions or forecasts can be made for values of x that have not yet occurred. The predictions can be read from the graph on which the line of best fit has been plotted, Figure 7.4, or the values inserted into the straight-line formula. Reverting to Example 1 it will be recalled that the manager wished to know the likely number of defects if 50 parts per 1000 was spent on inspection.

0 10 20 30 40 50 60 70

0 10 20 30 40 50 60 70 80 90 100 110 Inspection costs per batch

Defective parts per delivered

batch

Page 12: Regression

Statistics for Business Administration From Figure 7.4 it will be seen that the number of defects would be 31 per 1000. The formula can also be used, thus: y = 63.97 – 0.65x, so when x is 50:

y = 63.97 – 0.65 (50) = 31.47

Thus the manager would conclude that, on average, 31.47 defects per 1000 would be found if 50 parts per 1000 was spent on inspection. Predictions should be given only if the result has an economic meaning. If the x value can be used to make a prediction according to the regression line this does not necessarily mean that we have obtained a practical forecasted value. The predicted value is just a single point that needs to be qualified by the use of the confidence classes.

7.3.4 Quality of the Regression Line. Regression Line Standard Error The regression line accuracy can be measured with the standard error of the regression. Also this measure is used to estimate the regression parameters b0 and b1, to construct their confidence class. The inference concerning these estimates can be made using the significance test t and using the confidence class construction. In both cases we need to calculate the standard error. It will be denoted by Se:

2

2

−−= ∑ ∑ ∑

nyxbyay

S iiiie (7.7)

The above formula provides an estimate of the standard error due to the fact that it is using the regression line values b0 and b1 which are themselves estimates. This is why it is also called residual standard deviation. For the example concerning the defective parts we have computed the standard error as follows:

76.5210

12815)65.0(36397,63123,15=

−⋅−−⋅−

=eS defective parts

This value is used to set the confidence classes limits for an individual value prediction or for the whole regression line.

Page 13: Regression

Regression and Correlation The line of best fit y = b0 + b1x is an average line which passes through x and y and any estimate based must be a mean value of a point estimate. The confidence limits for the whole of the regression line are calculated by using a quantity known as the standard error of the average forecast that is given by:

S ef = S e( )

( )∑ ∑−

−+

nx

x

xxn1

22

2

(7.8)

7.3.5 Constructing the Confidence Interval The actual confidence interval is constructed in exactly the same way as that for a mean or for a proportion. In this case since the number of observations is 10, then the t distribution is used with 10-2 = 8 degrees of freedom. The interval is calculated by estimating the fitted value of y for each value of x in the original data using the equation y =b0 + b1x. The interval then takes the form:

y± S ef × t (7.9) Given that (based on Example 1):

b0 = 63.97 b1 = -0.65 S e = 5.76

S ef = S e( )

( )∑ ∑−

−+

nx

x

xxn1

22

2

= 5.76 ×Value from Table 2 above and

t = 2.306 for 8 degrees of freedom and a 95% confidence interval. The confidence interval can be now calculated as follows: When x = 15, y = 63.97 – 0.65 (15) = 54.2, the limits round these estimates are: 54.2 ± 7.15. This gives an upper limit of 61.35 and a lower limit of 47.05 when x = 24, y = 48.37 ± 5.72 giving a 54.09 upper limit and a 42.65 lower limit.

(7.9)

Page 14: Regression

Statistics for Business Administration When making an individual value prediction for y due to technical reasons it is necessary to amend the previous formula of the standard error, obtaining the standard error of the individual forecast:

∑ ∑−

−++=

nx

x

xxn

SSi

i

ieef 2

2

2

)()(11 (7.10)

Using the data in our example and x value for instance 45 we are obtaining

04.6

10424926,21

)4.4245(101176.5 2

2

=−

−++=efS .

When x = 45 and y = 34.72, the individual confidence interval is: 34.72 ± 2.306 6.04, with the lower limit of 20.79 and the upper limit of 48.65, limits which are different from the previous limits computed. This is because when an individual prediction of y is made the confidence intervals are much wider. 7.3.6 Standard Errors for the Parameters b0 and b1 If b0 and b1 are computed from sample data they can be considered as estimates, statistics of the population intercept denoted by and the population coefficient of correlation denoted β1 in the case of repeated sampling. The mean value of b0 values coming from repeated sampling is expressed as β0, the population intercept and the mean value of b1 values is expressed as β1, the population slope. The standards deviations are:

( )22

2

∑ ∑∑−

⋅=ii

iea

xxn

xSS , (7.11)

where: Se= standard error of regression.

The confidence class for β0 and β1 are obtained as follows: - for the intercept: b0 ± t x Sb

Page 15: Regression

Regression and Correlation - for the slope: b1 ± t x Sb where:

( )∑ ∑−

=

nx

x

SS

ii

eb 2

2

is the value of the statistics t corresponds to n-2 degree of freedoms at the chosen probability, showing the confidence level. In addition we construct a significance test for and β: - For the intercept

H 0 : β0 = b0 chosen value H 1 : β0 ≠ b0 chosen value

The test statistics is the t test:

aSb

t 00 β−= (7.12)

- For the slope: H 0 : β1 = 0 H 1 : β1 ≠ 0

The test statistics is the t test:

bSbt 11 β−

= (7.13)

= 092.0

065.0 − = 7.07

Since 7.07 > 2.306, H 0 can be rejected. On the basis of this evidence the regression equation y = 63.97 – 0.65 x can be used as a basis of prediction for Example 1. 7. 4 Non-linear Regression Models There are many occasions when the relationship between variables cannot be adequately described by linear functions, whether they use a single independent variable or several. In such circumstances some form of non-

Page 16: Regression

Statistics for Business Administration linear or curvy-linear model is likely to be more suitable and the following paragraphs describe some commonly encountered non-linear models. The exponential function The exponential function takes the form:

y = ab x where y is the dependent variable a and b are constants and x denotes the independent variable Linear form of the exponential function The exponential function can be reduced to linear form by taking the logarithm of the function thus:

log y = log a + x log b or

log y = A + Bx where, A = log a and B = log b The similarity of this expression and the linear regression line previously discussed will be apparent. An interesting feature of the log form of the exponential function is that it is equivalent to fitting a straight line to a graph drawn on semi-logarithmic scale graph paper (i.e. a logarithmic scale on the vertical axis and an ordinary arithmetic scale on the horizontal axis). Logarithmic functions An alternative non-linear function is known as a logarithmic function which has the form of: y = ax b , where y denotes variable to be predicted, a and b are constants and x denotes the time periods. As with the exponential function, this function can be expressed in a linear form using logarithms thus

log y = log a + b log x In this function y is said to be a logarithmic function of x. This function is equivalent to fitting a straight line to a graph drawn on log-log paper (i.e. both horizontal and vertical scales being logarithmic).

Page 17: Regression

Regression and Correlation The hyperbolic curve This is another type of non-linear curve and takes the form

xbay +=

The values of a and b are calculated by reference to amended formulas:

∑ ∑

∑ ∑

⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛

= 22

x1

x1n

yx1y

x1n

b

nx1b

ny

a∑∑ ⎟

⎠⎞

⎜⎝⎛

−=

Data have been kept for 10 orders showing the variation in unit cost against order volume for 10 clients, as follows in Table 7-2:

Relation between order size and unit costs

Table 7-2

Client number Order volume x

Unit cost y

1 2 3 4 5 6 7 8 9

10

10 11 12 13 14 15 17 18 19 20

150 127 123 117 110 107 104 101 97 95

These data have been graphed, Figure 7.5, and the graph suggests that the hyperbolic curve might be appropriate for predicting the unit cost of an order of 22 units. What is the predicted cost?

Page 18: Regression

Statistics for Business Administration

Figure 7.5 Unit cost and order size relation

Solution: The calculations for the least squares line of best fit are shown in Table 7-3.

Table 7-3

Observation x1

y 2

x1⎟⎠⎞

⎜⎝⎛

yx1⎟⎠⎞

⎜⎝⎛

1 2 3 4 5 6 7 8 9 10

0.100 0.090 0.083 0.077 0.071 0.067 0.059 0.056 0.053 0.050

150 127 123 117 110 107 104 101 97 95

0.0100 0.0083 0.0069 0.0059 0.0051 0.0044 0.0035 0.0031 0.0027 0.0025

15.000 11.545 10.250 9.000 7.857 7.133 6.118 5.611 5.105 4.750

Total

0.706 1,131 0.0524

( )2706.00524.0101131706.0369.8210b

−××−×

=

b = 985.92

10706.092.985

10131,1a ×−=

a = 43.49

80 90

100 110 120 130 140 150

8 10 12 14 16 18 20 22 24 Order volume

Unit cost

Page 19: Regression

Regression and Correlation Thus the hyperbolic function is:

x92.98549.43y +=

The calculated least squares line can now be fitted on the graph using the calculated values according to the hyperbolic function, in Table 7-4.

Table 7-4 X

xba +

Value of y

10 11 12 13 14 15 17 18 19 20

43.49 + 985.92 + 10 43.49 + 985.92 + 11 43.49 + 985.92 + 12 43.49 + 985.92 + 13 43.49 + 985.92 + 14 43.49 + 985.92 + 15 43.49 + 985.92 + 17 43.49 + 985.92 + 18 43.49 + 985.92 + 19 43.49 + 985.92 + 20

142.08 133.12 125.65 119.33 113.91 109.22 101.49 98.26 95.38 92.79

These values are plotted on Figure 7.6.

Figure 7.6 Scatter diagram with fitted hyperbolic curve

80 90

100 110 120 130 140 150

8 10 12 14 16 18 20 22 24 Order volume

Unit cost

Page 20: Regression

Statistics for Business Administration The same information can be reproduced in a linear form where the x axis is

defined as x1 . The question posed in the problem; what is the unit cost for

an order size of 22 units can be answered from one of the graphs or by direct calculation.

On the assumption that the known relationship between x and y continues beyond the observed range then the unit cost for an order size of 22 is:

2292.98549.43y += = 88.30

Learning curves Forecasting is concerned with what we anticipate will happen in the future. Unthinking extrapolation of past conditions is unlikely to produce good forecasts. If we are aware of an expected change in conditions in the future this must be taken into account when preparing the finalised forecast. A particular example of this relates to what are known as learning curves that are a practical application of a non-linear function. The learning curve depicts the way people learn by doing a task and are therefore able to complete the task more quickly the next time they attempt it. Learning is rapid in the early stages and the rate gradually declines until a sufficient number of units or tasks have been completed, when the time taken will become constant. The main practical application is concerned with direct labour times and costs. Cost predictions especially those relating to direct labour costs should allow for the effects of the learning process. During the early stages of producing a new part or carrying out a new process, experience and skill is gained, productivity increases and there is a reduction of time taken per unit. Studies have shown that there is a tendency for the time per unit to reduce at some constant rate as production mounts. For example, an 80% learning curve means that as cumulative production quantities double the average time per unit falls by 20%.

Page 21: Regression

Regression and Correlation This is shown in Table 7-5:

Illustration of an 80% Learning Curve Table 7-5

Cumulative number of

clients

Cumulative time taken

(min.s)

Average time per client

20 40 80 160

400 640

1,024 1,638.4

20 16 (20 %80× ) 12.8 ( )%80%8020 ×× 10.24

( %80%80%8020 ××× ) The learning curve is a non-linear function with the general form:

y = ab x where, y: average labour hours for a client a: number of labour hours for the first client x: cumulative number of clients b: the learning coefficient The learning coefficient is calculated as follows:

( )2log

edecreaseoportionatPr1logb −=

thus for a 20% decrease (i.e. an 80% learning curve)

( )30103.090309.1

2log2.01logb =

−= = -0.322

Note: It will be remembered from mathematics that the log of 0.8 is conventionally written as 1.90309 but is actually -1 + 0.903309 i.e. -0.09691 which, divided by 0.303103, gives -0.322. Having established the values for the function it can be used to find the expected labour time per unit. For example, with an 80% learning curve and a time of 10 minutes for the first client, what is the expected time per client when cumulative number of clients is 20 clients? Using the function we obtain:

y = ax b = 10 * 20 322.0− = 3.812 mins

Page 22: Regression

Statistics for Business Administration Note: Whilst it is clear that learning does take place and that average times are likely to reduce, in practice it is highly unlikely that there will be a regular consistent rate of decrease as exemplified above. According, any cost predictions based on conventional learning curves should be cautiously used when forecasting as any other kind of forecasting method. Linear transformation of learning curve An alternative method of calculating the learning curve coefficient uses the linear transformation formed by taking the logarithm of the function thus:

log y = log (ax b ) log y = log a + b log x

This will be recognised as a transformation to the general linear form

y = a + bx If X stands for log x and Y stands for log y then the standard formulae for a and b become

b = ∑ ∑∑ ∑ ∑

−22 )X(XnYXXYn

log a = nY∑ - b

nX∑

The above formulae are illustrated using the data in the previous paragraph thus from Table 7 – 5 into Table 7-6.

Table 7-6 Cumulative

Number of clients Cumulative

time Average time

per client x y

20 400 20 40 640 16 80 1,024 12.8 160 1,638.4 10.24

The logarithms of the cumulative number of clients, x, and the average serving time, y, are used to find the values for the formulae above and are shown in Table 7-7.

Page 23: Regression

Regression and Correlation

Table 7-7 X

i.e. log x Y

log y X 2

(log x) 2

XY log x. log y

1.30103 1.60206 1.90309 2.20412

1.30103 1.20412 1.10721 1.010303

1.69268 2.56659 3.62175 4.85815

1.69268 1.92907 2.10712

2.22682

∑X = 7.01030 ∑Y = 4.62266 ∑ 2X = 12.73917 ∑XY = 7.95569

These values are inserted into the formula

B = 201030.773917.12462266.401030.74

−××× = -0.3223

This will be seen to be the same value as calculated above. The learning curve thus has the form:

y = ax 3223.0− For completeness the value of a is calculated. This represents the number of labour hours for the first client. This was not one of the observed values, which started at 20 clients, so the value represents the theoretical time for the first client; given the relationships found for the observed range of 20 to 160 clients. Using the formula given above

log a = 4

62266.4 - (-0.3223) 4

01030.7

log a = 1.725052, and finding the antilog gives a = 52.546. The full learning curve formula is thus:

y = 52.546x 03223−

The value of 52.546 hours for the first unit can be proved by inserting one of the observed values, say 20 units, and checking that the calculated time agrees with the observed time of 20 minutes. To find the value of y = 52.46 × 20 3223.0− we can compute:

Number log 20 1.30103

1.30103x0.3223 0.41932 0-0.41932 1.58068 (this represents 20-03223)

52.546 1.72052+1.30120 the anti-log f which is almost

Page 24: Regression

Statistics for Business Administration 7.5 Multiple Regression Models The section shows the development of a multiple regression model and how the closeness of fit is measured by the coefficient of multiple determinations. Various non-linear models such as the exponential, logarithmic and hyperbolic functions are explained and exemplified and the chapter concludes with an analysis of learning curves.

There will be occasions when the simple model, y = b0 + b 1 x, will not be considered satisfactory. This means that the simple linear model will not be a good enough predictor. In such circumstances there are two possible courses of action: a. To investigate the possibility that movements in y, the dependent

variable, depends on several independent variables and not just one as in the basic model. For example, changes in demand for a product may depend on:

the price of a product the price of substitutes the level of incomes consumer tastes and so on

If linearity can be assumed then a linear multiple regression models can be used. These models are dealt with in the first part of the chapter. b. Alternatively a non-linear model may be considered more appropriate

and several of the more important non-linear functions are dealt with later in the chapter.

A model which incorporates several independent variables is known as a multiple regression model. Because of the lengthy nature of the calculations it would be unlikely that a detailed question on multiple regression would appear in the examinations for which this manual is intended. Familiarity with the processes involved and the structure of the model is, however, necessary. The development of this model is shown below.

Page 25: Regression

Regression and Correlation The basic two variable model (one dependent and one independent variable) is:

y = b0 + b 1 x which can be solved using the Normal equations thus:

∑ y = b0 n + b 1 ∑ x

∑ xy = b0 ∑ x + b 1 ∑ 2x

From this can be developed models with more than 2 variables and this is illustrated below using a 3 variable model (one dependent and two independent variables; y, x1 and x 2 ).

y = b0 + b 1 x 1 + 22 xb (7.14) In the case of mass phenomena the resulting variable is considered a function with many variables: ( )nxxxfy ,,, 21 K= + e where the variables

nxxx ,,, 21 K are the factorial variable, which determine in a certain measure the variation of the resulting variable (y).

If the relation between every factor and the resulting variable is linear, than the estimation equation will be:

( ) nnn xbxbbxxxY +++= KK 11021 ,,, + e (7.15)

where:

b0: represents the parameter that expressed the unregistered factors considered as having constant action that is all the other factors except for those considered factorial variables

nbb ,,1 K : coefficients of regression that shows the measure with which it is modified the resulting variable if the factorial variable is modified on average with a unit.

nxxx ,,, 21 K : independent variables included in the relation of interdependence.

Page 26: Regression

Statistics for Business Administration The determination of the parameters is made by the application of the method of the least squares, conditioning it that the sum of the errors of the empiric terms from the line of regression, square rose, to be minimum:

( )∑ =− min2,...,2,1 xnxxYy

In order to find the value of these parameters is necessary to be established the system of the Normal equations:

( )[ ] min0 211 =+++−∑ nn xbxbby K

At the end of solving the system we have the parameters of estimation of the regression function. As in the case of multiple correlation in order to measure the degree of intensity of the correlation, we are using the ratio of correlation. The multiple linear model can be solved by the Normal equations for a three variable model, as follows:

∑ y = b0n + b 1 ∑ ∑+ 221 xbx

∑ ∑ ∑ ∑++= 2122

1111 xxbxbxyx (7.16)

∑ ∑ ∑ ∑++= 22221122 xbxxbxayx

The line of best fit gives way to a plane of best fit. The parameter b 1 is the slope of the plane along the x 1 axis, b 2 is the slope along the x 2 axis, and the plane cuts the y axis at ‘a’.

The aim of adding to the simple two variable model is to improve the fit of the data.

The above models are illustrated by the following examples.

Page 27: Regression

Regression and Correlation Example of multiple regression The X consultancy company is investigating the relationship between performance in Statistics Methods and hours studied per week and the general level of intelligence of candidates. The company has data on ten students as follows:

Student Hours I.Q. Examination level (%) 1 6 100 45 2 6 117 55 3 12 119 80 4 14 95 73 5 11 110 71 6 9 99 56 7 19 98 95 8 16 101 86 9 3 100 34 10 9 115 66

It is required: to calculate the simple separate regressions, the multiple regression and the coefficients of determination.

Solution Part A – Calculation of separate regressions

Table 7-8 y y 2 x 1 2

1x x 2 x 22 x 1 y x 2 y x 1 x 2

1 2 3 4 5 6 7 8 9 10

56 45 80 73 71 55 95 86 34 66

3,136 2,025 6,400 5,329 5,041 3,025 9,025 7,396 1,156 4,356

9 6

12 14 11 6

19 16 3 9

81 36 144 196 121 36 361 256

9 81

99 100 119 95 110 117 98 101 100 115

9,801 10,000 14,161 9,025 12,100 13,689 9,604 10,201 10,000 13,225

504 270 960

1,022 781 330

1,805 1,376 102 594

5,544 4,500 9,520 6,935 7,810 6,435 9,310 8,686 3,400 7,590

891 600

1,428 1,330 1,210 702

1,862 1,616 300

1,035 661 46,899 105 1,321 1,054 111,806 7,744 69,730 10,974

For Regression y on x1 (Exam. Scores: hours studied)

The parameters are:

( )∑ ∑∑ ∑ ∑

−= 2

121

11x

xxn

yxyxnb

1 = 2105321,1110

661105744,710−×

×−× ; 1xb = 3.68;

nxbx

nY

a 11x1

∑∑ −= = 10

10567734.3110661 ×

− ; =1xa 27.59

Page 28: Regression

Statistics for Business Administration The regression equation for the relationship of hours studied and examination result is:

1xxx xbay111

+= = 27.59 +3.68

The co-efficient of correlation for this relationship is:

( ) ( )∑ ∑∑ ∑∑ ∑ ∑

−×−

−=

2221

21

11x

yynxxn

yxyxnr

1

Note: This formula is a direct equivalent of that given previously but is easier to work with since all except ( )∑ ∑−

22 yyn is already known.

( )∑ ∑−22 yyn = 10×46,889-661 2 = 468,890 – 436,921 = 31,969

969,31185,2

035,8r1x

×= = 0.9613

21xr = 0.9243 i.e. coefficient of determination for y: x1 .

In a similar manner the regression y on x 2 (exam. scores: IQ scores) is calculated resulting in:

2xxx xbay222

+= = 57.16 + 0.085x 2

001608.0r 2x1=

Page 29: Regression

Regression and Correlation

y = 3,69x + 27,59

0102030405060708090

100

0 2 4 6 8 10 12 14 16 18 20

Hours studied per week

Examination score

Figure 7-6 Scatter diagram of examination scores and hours studied (y: x 1 ).

y = 0,085x + 57,16

0102030405060708090

100

90 100 110 120

IQ Score

Examination score

Figure 7-7 Scatter diagram of examination scores I.Q. scores (y: x 2 ).

Solution: Part B – The multiple regression (y : x 1 and x 2 ) The multiple regression calculations are carried out using the three variable

Page 30: Regression

Statistics for Business Administration Normal Equations from Para 3 and the results in Table 1 above, thus:

661 = 10a + 1,05b1 + 1,054b 2 7,744 = 105a + 1,321b1 + 10,974b 2 69,730 = 1,054a + 10,974b1 + 111,806b 2

Using standard simultaneous equation procedures results in the following values for the coefficients in the equation:

2211 xbxbay ++=

21 x6.0x93.306.38y ++−=

This result could be used to predict the examination score for a candidate, given the number of hours worked and IQ. For example, what is the expected score of a candidate who has worked for 13 hours per week and who has an IQ of 102?

1026.01393.306.38y ×+×+−= = 74.23% expected examination score Solution: Part C – Coefficient of multiple determination, R 2 Using the computational formula given and the values calculated above, R 2 can be calculated thus:

R 2 = ( ) ( ) ( )

10661889,46

10661730,696.0744,793.366166.38

2

2

−×+×+×− = 0.9995

The various coefficients of determination can now be summarised and interpreted

9243.0r 2x1=

0016.0r 2x2=

R 2 = 0.9995

2x1

r - This indicates that about 92% of the variation in examination scores is caused by variation in hours of study, which is obviously a major influence.

Page 31: Regression

Regression and Correlation

2x 2

r - This indicates that only 0.16% of any variation in examination score is caused by variation in IQ score which is a very small influence indeed.

R 2 - This shows the combined effect of two independent variables and indicates that 99.95% of the movement in examination score is brought about by movements in hours studied and IQ score. This, however, assumes that it is a reasonable hypothesis that examination results are influenced by the intelligence of candidates and how hard they work!

7.6 Correlation between Variables The degree of correlation between two variables can be measured by using the following indicators: a) Covariance represents an absolute measure of the relation intensity and

it is computed as the arithmetic mean of the product: ( )( )yyxx ii −− . It can be also computed as:

( )( )n

yyxxyx ii −−= ∑),cov( (7.17)

If the results tend to zero then there is no relation between the variables. If the result is positive, than we have a positive correlation and if the result is negative, we have a negative correlation. The covariance maximum value equals the multiplication between the standard deviations of the variables in the case of the perfect correlation. b) Coefficient of Correlation, denoted by r, used only for the linear relations This provides a measure of the strength of association between two variables; r can range from -1, i.e. perfect negative correlation to +1 i.e. perfect positive correlation.

The formula for the coefficient of correlation is:

( ) ( )( )yx

ii

yx nyyxxyxr

σσσσ ⋅⋅

−−=

⋅= ∑,cov (7.18)

Page 32: Regression

Statistics for Business Administration c) Ratio of determination denoted by R 2 - expresses how much of total

variation of Y variable it is explained by the independent variable. d) The Rank Correlation Coefficients. This provides a measure of the

association between two sets of ranked or ordered data. Whichever type of coefficient is being used it follows that a coefficient of zero or near zero generally indicates no correlation. 7.6.1 Parametric Measures of Simple Correlation. Coefficient and Ratio

of Correlation Coefficient of correlation This coefficient represents a measure computed differently for different way of data presentations as related pairs or classified pair of figures: a simple bivariate numerical data which were not grouped b bivariate numerical data grouped by classes or variants with common

frequencies for x and y variation c bivariate numerical data grouped by variants or classes into a cross table This coefficient gives an indication of the strength of the linear relationship between two variables. a. In the case of simple bivariate numerical data which are not grouped

and are presented as related pair of figures: X values X1 ………… Xi ………. Xn Σ Xi Y values Y1 ………… Yi ………… Yn Σ Yi

The general formula is

( ) ( )( )yx

ii

yx nyyxxyxr

σσσσ ⋅⋅

−−=

⋅= ∑,cov (7.19)

Page 33: Regression

Regression and Correlation There are several possible formulae but a practical one is the “reduced computation formula (7.20):

( ) ( )∑ ∑∑ ∑∑ ∑ ∑

−×−

−=

2222 yynxxn

yxxynr (7.20)

This formula is used to find r from the data in Example 1 from Table 7-9.

Table 7-9 X Y X 2 Y 2 XY 15 24 25 30 35 40 45 65 70 75

60 45 50 35 42 46 28 20 22 15

225 576 625 900 1225 1600 2025 4225 4900 5625

3600 2025 2500 1225 1764 2116 784 400 484 225

900 1080 1250 1050 1470 1840 1260 1300 1540 1125

424 363 21.926 15.123 12.815

∑X ∑Y ∑ 2X ∑ 2Y ∑X Y

Using the formula above:

( ) ( ) ( )

93.0461.19484.39

762.25769.131230.151776.17926.219

912.153150.128

363123.1510424926.2110

363424815.121022

−=−

=

=−−

−=

−−

−=

x

xxxx

xxr

Thus the correlation coefficient is -0.93 which indicates a strong negative linear association between expenditure on inspection and defective parts delivered. It will be seen that the formula automatically produces the correct sign for the coefficient.

Page 34: Regression

Statistics for Business Administration

b. In the case of bivariate numerical data grouped by classes or variants with common frequencies for x and y variation

In this case the input data are arranged as follows: X values X1 ………… Xi ………. Xn Σ Xi Y values Y1 ………… Yi ………… Yn Σ Yi Frequencies fi f1……………. fi…………… fn Σ fi For the above table the practical formula for the coefficient of correlation is:

( ) ( )∑ ∑∑∑ ∑∑∑ ∑ ∑∑

−⋅×−⋅

−⋅⋅=

2222iiiiiiii

iiiiiiii

fyfyffxfxf

fyfxfyxfr

In the case bivariate numerical data grouped into a cross table as follows in Table 7-10:

Table 7-10 Variation class

middles or variants of the

dependent variable(xi)

Variation class middles or variants of the dependent

variable (yj) Total (fi)

Y1 ……. Yj ……… Ym X1 f11 f1i f1m f1. . . .

.

.

.

. . .

. . .

X2 f21 …. f2i …. f2m fi. . . .

Xn fn1 fnj fnm fn. Total (f.j) f.1 f.j f.m ∑∑∑ == ijji fff ..

For a cross table the practical formula of r is:

( ) ( )∑ ∑∑∑∑ ∑∑∑∑ ∑ ∑∑∑

−⋅×−⋅

−⋅⋅=

2222jjjjijiiiiij

jjiiijjiij

fyfyffxfxf

fyfxfyxfr (7.21)

No matter the data presentation and classification the coefficient of correlation is interpreted compared to zero and its limits, -1 and +1.

Page 35: Regression

Regression and Correlation Interpretation of the value of r Cautiousness is needed in the interpretation of the coefficient of correlation, r. A high value (above +0.9 or -0.9) only shows a strong association between the two variables and does not show a causal relationship. It is possible to find two variables which produce a high calculated r value yet which have no causal relationship. This is known as spurious or nonsense correlation. An example might be the wheat harvest in America and the number of deaths by drowning in Britain. There might be a high apparent correlation between these two variables but there clearly is no causal relationship. The coefficient of correlation can take values between –1 and +1 as follows: • 0=r : no relationship, independent variables • ( )2.0,0∈r : there is a low intensity relation between the variables • ( )5.0,2.0∈r : there is a week correlation, case needing a significance test

to be applied as for instance the Student test • ( )75.0,5.0∈r : medium intensity relation • ( )95.0,75.0∈r : tight, high intensity relation • ( )00.1,95.0∈r : we have an extremely strong relationship between the

variables, almost a deterministic relation (functional relation) If we are comparing r with zero than: • r > 0: shows a positive relationship and should correspond to a positive

slope of the regression line • r < 0, shows annegative relationship and should correspond to a negative

slope A low correlation coefficient, somewhere near zero, does not always mean that there is no relationship between the variables. All it says is that there is no linear relationship between the variables- there may be a strong relationship but of a non-linear one.

Page 36: Regression

Statistics for Business Administration A further problem in interpretation arises from the fact that the coefficient of correlation measures the relationship between a single independent variable and dependent variable, whereas a particular variable may be dependent on several independent variables in which case multiple correlation should have been calculated rather than the simple two-variable coefficient. The significance of r Frequently the set of X and Y observations is based upon a sample. Had a different sample between drawn then the value of r would be different, although the degree of correlation in the reference population would remain the same. In the same way that the knowledge of x s enables an estimate to be made of the population mean then the knowledge of r enables the analyst to make an estimate of ρ , the population coefficient of correlation. Generally in examination questions the sample size is limited to some figure that can be dealt with in the time allowed. It is questionable whether the sample size given in examinations gives enough data for a credible judgment to be formed about a possible relationship between the X and Y values or is it just that the particular samples gives this impression? Conversely, if r is low does it really imply a lack of a relationship? There may indeed be a close relationship but the data has not revealed it. Further, the relationship may exist, but it may not to be linear or it may not be direct. It is possible to test whether the value of r is sufficiently different from zero for the analyst to decide whether the X and Y values are correlated. The test may be stated the null hypothesis and its alternative:

H 0 : ρ = 0 H 1 : ρ ≠ 0

It is a t test for which the test statistic is given by:

t = 2nr1

r2

−×−

ρ− (7.22)

Page 37: Regression

Regression and Correlation Using the values from example 1, i.e. r= -0.93 and n=10 we obtain:

t = 21093.01

093.02

−×−

−− = 2.53 × 2.83 = 7.16

Ratio of correlation The ratio of correlation can be used to characterise any category of relation, linear or not linear relationship. It can be also used to measure the intensity of the relation no matter how many independent variables we take into account. The ratio of correlation shows only the intensity of the relation and it does not show the direction.

It is computed with the formula:

( )( )∑

∑−

−−= 2

2

1yy

YyR

i

xi , (7.23)

where:

yi: array of dependent data

Yx: array of adjusted values, calculated according to the regression function

y : the arithmetic mean of the dependent values

The ratio of correlation is interpreted similarly with the coefficient of correlation and it can take values between 0 and +1. 7.6.2 Parametric Measures of Multiple Correlation In the case of multiple correlation the closeness of fit is measured by the coefficient of multiple determination, coefficient R 2 for which the general formula and the useful computational formula are given below:

R 2 = iationTotal

iationExplainedvar

var = ( )

( )∑∑

−2

2

YY

YYestimate (7.24)

Page 38: Regression

Statistics for Business Administration Where Y estimate now equals the estimate of Y for each value of x1 and x 2 .

R 2 =

( )

( )∑ ∑

∑ ∑ ∑ ∑

−++

ny

y

ny

yxbyxbya2

2

2

2211 (7.25)

It is not necessarily the case that the value of the coefficient of determination will improve with the addition of extra variables. The ratio of multiple correlation is calculated as in the case of the simple correlation, depending on the specific weight of the dispersion produced by

registered factors: (21

2

, xxyσ ) over the total dispersion of the resulting

variable ( 2yσ ). If we are using the relation between the three dispersions: ( 2yσ = 2yσ /x1,x2,…,xn + 2yσ /r), the ratio of correlation is computed after the formula:

( )( )∑

∑−

−−= 2

221

2,1

,...,,1

yy

xxYxyR

i

ni

xxy (7.26)

The ratio of multiple correlations can take values between 0 and +1. This ratio has the highest value by rapport to the simple correlation indicators, because it reunites the influence of each factor and of the interaction between them. So, the more there are considered many factors, the higher is the ratio’s value. Theoretically, it can be admitted that under the conditions in which the factors could be expressed numerically, than the ratio of multiple correlation should be 1, showing the functional dependence between all its determinative factors and its level (of the resulting variable). Therefore, the equation of regression will be equal to the empiric value of the factorial variable calculated by the size of all the determinative factors, and the free term would be 0:

( ) nnn xbxbxbaxxxY ++++= KK 221121 ,,, . (7.27)

Page 39: Regression

Regression and Correlation But, actually there can’t be identified all the influence factors and some of them can’t be quantified. From this reason, the value of the multiple regression line will have errors more or less close to the real values of the series’ terms, because of the influence of those unregistered factors included in the value of the free term a0. In the case of linear relation verified with every of the considered factors, the ratio of multiple correlation transforms into a coefficient of a multiple correlation. The coefficient of multiple correlations equals the ratio of multiple correlations. In the case of multiple correlations, the ratio of linear correlation synthesizes all the simple linear relations. If the factors are independent between them, than the ratio of multiple determinations equals the sum of the ratios of simple determination. For instance, for two factors:

212,1222

xy

xy

xxy

RRR += (7.28)

If the relation is linear, than R is substituted by r:

212,1222

xy

xy

xxy rrR += . (7.29)

From this, the ratio of correlation is:

212,1222

xy

xy

xxy

rrR += . (7.30)

Usually, among the socio-economic phenomena the factors of influence are independent between them and therefore it appears the necessity of considering the reciprocal influence of the factors. If the factors are interdependent, 02,1 ≠xxr .

Page 40: Regression

Statistics for Business Administration This inter-influence has to be eliminated because it can be found in the value of multiple correlation coefficients. The ratio of multiple linear correlations is calculated using the coefficient of simple correlation:

2

21

21

21

2

2

2

1

2,1 1

2

xx

xx

xy

xy

xy

xy

xxy r

rrrrrR

−+= . (7.31)

7.6.3 Nonparametric Measures of Correlation Sometimes in practice we cannot use for the interpretation of the relation any of the known functions, because we do not have enough elements to identify the rule of distribution of the errors for the used series. In this case there are used nonparametric methods like the coefficient of association proposed by Yule, the coefficients of ranks correlation proposed by Kendall and Spearman. These coefficients have the advantage they can be used in the case of a skewed distribution or a small number of units. This thing can be possible due to the fact in this type of situations the terms’ distribution is made in connection to the rank of each independent variable. Yule coefficient of association This coefficient is used when the statistical units can be separated into two groups according to the x and y variation or they have the form of the binary variables:

Table 7-11

X groups Y groups or variants Total or variants Y1 Y2 X1 A B A+B X2 C D C+D Total A+C B+D A+B+C+D

Page 41: Regression

Regression and Correlation In order to express the intensity of the relation we are using the formula:

KYule = CBDACBDA⋅+⋅⋅−⋅ , (7.32) with the same interpretation as for the

coefficient of correlation, taking values between -1 and +1. Ranks coefficients This nonparametric method also has the advantage to include in the analysis the rapport of dependence between phenomena and qualitative variable that cannot be expressed numerically, but can be classified after a certain rank. Therefore, the data are arranged after the variation of the independent variable and each variant is replaced with its number of order called rank. The ranks can be distributed either increasingly, when the best value of the indicator is the one with the minimum value, or decreasingly, when the maximum value has the rank one. From the point of view of the value of coefficients of correlation, the sense of distributing the ranks does not have a great importance, if we maintain the same direction for all the variables. The direction is important only if the analysis of correlation is combined with the establishment of a hierarchic typology. Starting from the hypothesis between the two series of ranks there is concordance. When it exists a relation between the two variables of the same unit, there has to correspond the same number of units with a higher or smaller rank than them. The most frequent calculation formulas of the coefficient of correlation of the ranks are those of Spearman and Kendall. The rank coefficient proposed by Spearman:

nn

dr

is

−−= ∑

3

261 , (7.33)

where: di: the rank difference between correlated variables and n: the number of correlated units.

Page 42: Regression

Statistics for Business Administration The coefficient of correlation of the ranks proposed by Kendall has the formula:

)1(2−

=nn

SrK , (7.34)

where: S = P + Q: the score of the two different positions of the ranks of the correlated variables. P: the number of superior ranks that succeed the rank of the effect variable for which it is made the calculation. Q: the number of inferior ranks, of the effect variable, that succeed the same rank. Always P has a positive value, Q is negative, and so S can be positive or negative. So, the coefficients of correlation of the ranks can take values between –1 and +1. Their interpolation is made as the parametrical correlation. The advantages and the facility of calculation make these coefficients very applicable for studying the relation between specific phenomena including qualitative variables measured on the ordinal scale. Due to the fact it easier to be calculated, the most frequent used is Spearman’s coefficient. It is deduced from the coefficient of simple linear correlation where the mean and the dispersion are based on the properties of the asymmetric progression. It has been concluded that Kendall’s coefficient is smaller than Spearman’s. Tied rankings. Adjusted rankings A slight adjustment to the formula is necessary if in a research recording the students marks, some students obtained the same marks in a test and thus are given the same ranking. The adjustment is:

12tt 3 − (7.35)

where t is the number of tied rankings.

Page 43: Regression

Regression and Correlation The adjusted formula for the Spearman coefficient is:

R = 1- ( )1

126

2

32

⎟⎟⎠

⎞⎜⎜⎝

⎛ −+∑

nn

ttd (7.35’)

For example assume that students E and F achieved equal marks in QT and were given joint third place. The revised data are given by Table 7-12:

Table 7-12

( ) =−

⎟⎟⎠

⎞⎜⎜⎝

⎛ −+

−=∑

112

61 2

32

nn

ttdR ( )188

1222

21256

1 2

3

⎟⎟⎠

⎞⎜⎜⎝

⎛ −+

−= = + 0.69

As will be seen, the Spearman value has moved also from +0.74 to 0.69. 7.7 Exercises

Multiple choice exercises with answers 1. Which of the following techniques is used to predict the value of one

variable on the basis of other variables? a. Correlation analysis b. Coefficient of correlation c. Covariance d. Regression analysis ANSWER: d

Student Q.T Ranking

M.A. Ranking d d 2

A 2 3 -1 1 B 7 6 +1 1 C 6 4 +2 4 D 1 2 -1 2 E 3

21

5 -1

21

241

F 3

21

1 +2

21

641

G 5 8 -3 9 H 8 7 +1 1

Page 44: Regression

Statistics for Business Administration

2. If cov (X,Y) = 1260, 16002 =xs and ,12252 =ys then the coefficient of determination is:

a. 0.7875 b. 1.0286 c. 0.8100 d. 0.7656 ANSWER: c 3. The coefficient of determination 2R measures the amount of: a. variation in y that is explained by variation in x b. variation in x that is explained by variation in y c. variation in y that is unexplained by variation in x d. variation in x that is unexplained by variation in y ANSWER: a 4. In the simple linear regression model, the y-intercept represents the: a. change in y per unit change in x b. change in x per unit change in y c. value of y when x = 0 d. value of x when y = 0 ANSWER: c Multiple choice exercises without answers 5. In a regression problem, if the coefficient of determination is 0.95, this

means that: a. 95% of the y values are positive b. 95% of the variation in y can be explained by the variation in x c. 95% of the x values are equal d. 95% of the variation in x can be explained by the variation in y 6. In a regression problem, if all the values of the independent variable are

equal, then the coefficient of determination must be: a. 1 b. .5 c. 0 d. –1

Page 45: Regression

Regression and Correlation 7. The following sum of squares are produced: ∑ =− 200)( 2yyi , ∑ =− 50)ˆ( 2

ii yy , ∑ =− 150)ˆ( 2yyi The proportion of the variation in y that is explained by the variation in x is: a. 25% b. 75% c. 33% d. 50% Open ended exercises with answers 8. Consider the following data values of variables x and y.

x 2 4 6 8 10 13 y 7 11 17 21 27 36

a. Determine the least squares regression line. b. Find the predicted value of y for x = 9. c. What does the value of the slope of the regression line tell you? d. Calculate the coefficient of determination, and describe what this statistic

tells you about the relationship between the two variables. e. Calculate the Pearson coefficient of correlation. What sign does it have?

Why? f. What does the coefficient of correlation calculated in part (e) tell you

about the direction and strength of the relationship between the two variables?

ANSWERS: a. =y .934 + 2.637x b. 24.667 c. If x increases by one unit, y on average will increase by 2.637. d. =2R .995. This means that 99.5% of the variation in the dependent

variable y is explained by the variation in the independent variable x. e. r = .9975. It is positive since the slope of the regression line is positive. f. There is a very strong (almost perfect) positive linear relationship

between the two variables.

Page 46: Regression

Statistics for Business Administration 9. A professor of economics wants to study the relationship between

income (y in $1000s) and education (x in years). A random sample eight individuals is taken and the results are shown below.

Education 16 11 15 8 12 10 13 14 Income 58 40 55 35 43 41 52 49

a. Draw a scatter diagram of the data to determine whether a linear model appears to be appropriate.

b. Determine the least squares regression line. c. Interpret the value of the slope of the regression line. d. Determine the standard error of estimate and describe what this statistic

tells you about the regression line.

a. =y 10.6165 + 2.9098x b. For each additional year of education, the income on average increases

by $2,909.80.

c. =εs 2.436; the model’s fit is good. 10. A scatter diagram includes the following data points:

x 3 2 5 4 5 y 8 6 12 10 14

Two regression models are proposed:

Model 1: =y 1.2 + 2.5x Model 2: =y 5.5 + 4.0x

Using the least squares method, which of these regression models provide the better fit to the data? Why? ANSWERS:

a. It appears that a linear model is appropriate.

Scatter Diagram

010203040506070

0 2 4 6 8 10 12 14 16 18

Years of Education

Inco

me

Page 47: Regression

Regression and Correlation Standard error = 4.95 and 593.25 for models 1 and 2, respectively. Therefore, model 1 is better than model 2. 11. Consider the following data values of variables x and y:

x 2 4 6 8 10 13 y 7 11 17 21 27 36

a. Determine the least squares regression line. b. Find the predicted value of y for x = 9. c. What does the value of the slope of the regression line tell you? d. Calculate the coefficient of determination, and describe what this statistic

tells you about the relationship between the two variables. e. Calculate the Pearson coefficient of correlation. What sign does it have?

Why? f. What does the coefficient of correlation calculated in part (e) tell you

about the direction and strength of the relationship between the two variables?

ANSWERS: a. =y 0.934 + 2.637x b. 24.667 c. If x increases by one unit, y on average will increase by 2.637. d. =2R 0.995. This means that 99.5% of the variation in the dependent

variable y is explained by the variation in the independent variable x. e. r = 0.9975. It is positive since the slope of the regression line is positive. f. There is a very strong (almost perfect) positive linear relationship

between the two variables. Open ended exercises without answers 12. Refer to Exercise 10.

a. Determine the coefficient of determination and discuss what its value tells you about the two variables.

b. Calculate the Pearson correlation coefficient. What sign does it have? Why?

Page 48: Regression

Statistics for Business Administration c. Conduct a test of the population coefficient of correlation to determine at

the 5% significance level whether a linear relationship exists between years of education and income.

13. In a simple linear regression problem, the following statistics are

calculated from a sample of 10 observations. ∑ −− ))(( yyxx = 2250, xs = 10, ∑ x = 50, ∑ y = 75

Compute the regression equation.

14. Given the least squares regression line y = -2.48 + 1.63x, and a coefficient of determination of 0.81, compute and interpret the coefficient of correlation

15. Refer to Exercise 10.

a. Use the regression equation to determine the predicted values of y. b. Use the predicted and actual values of y to calculate the residuals. c. Plot the residuals against the predicted values of y. Does the variance

appear to be constant. d. identify possible outliers. 16. For a company we know the information regarding the turnover and the profit evolution:

Year 1991 1992 1993 1994 1995 1996 Turnover mobile relative change (%)

+ 3 % + 4 % + 2 % + 4 % - 2 % + 8 %

Profit chain base absolute change (mill. m.u.)

6 4 - 1 9 2 9

The turnover in 1990 was 80 m.u. and the average rate of profit per year was + 6.4 %. Considering the profit evolution is influenced by the turnover evolution you are asked to: a reconstruct and graph the historical evolutions b. forecast the turnover evolution for the next year, choosing the most

appropriate method between the simple and the analytic methods

c. forecast the profit in 1997 taking into account its dependency upon the turnover (according to the regression line).

d. Measure the intensity of the relation using parametrical and nonparametric measures.


Recommended