2. The Linear Regression Model
Joshua Sherman Applied Econometrics
040693 University of Vienna
Regression model
We use regression models to answer the following types of questions:
If one variable changes in a certain way (PUNISHMENT), by how much will another variable (CRIME) change?
Given the value of one variable, can we predict the corresponding value of another?
Simple linear regression model
We begin by discussing the simple linear regression model, in which there is only one explanatory variable on the right hand side of the regression equation:
= 0 + 1
The unknown parameters 0 and 1 are the intercept and slope of the regression function. We refer to them as population parameters.
Suppose Y is sales of umbrellas and X is the amount of rainfall in centimeters in a given city. Then the slope coefficient 1 represents the change in the average number of umbrella sales given a 1 cm change in rainfall. The intercept coefficient represents the average number of umbrella sales in a city with no rainfall.
Simple linear regression model
The number of umbrella sales for all cities in which annual rainfall is a given amount (for example, 10 cm) will be scattered around the mean.
A probability density function (pdf) will depict how these values are scattered around the mean.
The mean is just one descriptor of a distribution. Another important descriptor is the variance.
The variance is defined as the average of the squared differences between the
values of a distribution and the mean. It is essentially a measure of the extent to which values of a distribution are spread out. Mathematically:
= ( )2
Probability density function for Y
Y
X=10 X=25
= 0 + 1
This regression function shows the average number of umbrellas sold at different levels of rainfall, in centimeters
The conditional variance of Y is = 2 for all values of X
Probability density function for Y
On the previous slide, the constant variance assumption implies that at each level of rainfall X we are equally uncertain about how far values of Y will be from their average value, = 0 + 1. Data satisfying this condition are considered to be homoskedastic. If this assumption is not satisfied, the data are considered to be heteroskedastic.
The error term
An observation on Y can be decomposed into two parts:
Systematic component:
= 0 + 1
Random component:
= = 0 1
Rearranging we obtain the simple linear regression model:
= 0 + 1 +
The error term
Why do we introduce an error term?
Unavailability of data Randomness in human behavior Net influence of a large number of small and independent causes
As e is the random component, we know that = 0 because:
= 0 1 = 0 We also know that the variances of Y and e are identical and equal
to 2 because they only differ by a constant. Thus the pdfs for Y and e are identical in all respects except for their location.
The error term initial assumptions
Several assumptions are required in order to run the simple linear regression model. Thus far we have assumed that: 1. = 0 + 1 + 2. = 0, which is equivalent to stating that = 0 + 1.
3. = 2 = (homoskedasticity)
Later we will see why these assumptions (and others) are important for our purposes
The population vs. the sample
In practice, the econometrician will possess a sample of Y values corresponding to some fixed X values rather than data from the entire population of values. Therefore the econometrician will never truly know the values of 0 and 1.
However, we may estimate these parameters. We will denote these estimators as 0 and 1.
Ordinary least squares
So how shall we find 0 and 1? We need a method or rule for how to estimate the population parameters using sample data.
The most widely used rule is the method of least squares, or ordinary least squares (OLS). According to this principle, a line is fitted to the data that renders the sum of the squares of the vertical distances from each data point to the line as small as possible.
Ordinary least squares
Therefore the fitted line may be written as:
= 0 + 1
The vertical distances from the fitted line to each point are the least squares residuals, . They are given by:
= = 0 1
Ordinary least squares
Mathematically, we want to find 0 and 1 such that the sum of the squared vertical distances from the data points to the line is minimized:
min 2
= ( )2 = (0 1)
2
If you do not recall how to find the solution for 0 and 1 using partial derivatives, the steps may be found in the course text.
The least squares estimators
Upon solving this minimization problem we find that:
1 = ( )( )
2
0 = 1
where =
and =
are the sample means
of the observations on Y and X.
OLS and the true parameter values
So how are the OLS estimators 0 and 1 related to 0 and 1?
If assumptions 1 and 2 from earlier hold, then 0 = 0 and 1 = 1 (proof provided in the text).
That is, if we were able to take repeated samples, the expected value of the estimators 0 and 1 would equal the true parameter values 0 and 1
When the expected value of any estimator of a parameter equals
the true parameter value, then that estimator is unbiased
Later we will explore how violation of certain assumptions will cause estimators to be biased
OLS and the true parameter values
So the idea behind OLS is that if we are dealing with an instance in which certain assumptions hold, the expected value of the estimators 0 and 1 will equal the true parameter values 0 and 1.
Coefficient of determination
We are interested in a measure that will indicate how good of a fit our sample regression line is to the data. Let us define = , the deviation of a variable from its mean. Using sample data we note that = . Then:
= +
In other words, the amount by which the data deviate from the mean can be broken into an explained portion ( ) and an unexplained portion, .
Coefficient of determination
Using = + , we may square both sides and divide by N to obtain:
2
= ( )2
+ 2
We may then define the coefficient of determination 2 as the ratio of explained variation to total variation:
2 =
( )2
2
= 1 2
2
Coefficient of determination
Therefore we have:
( )2: Total sum of squares (TSS). A measure of total variation in Y about the mean.
( )2: Explained sum of squares (ESS). The part of
total variation in Y about the mean that is explained by the sample regression.
2: Residual sum of squares (RSS). The part of total
variation in Y about the mean that is not explained by the sample regression.
Coefficient of determination
It can also be shown that:
2 =
( )2
2
=( )
2
2 2 =
( )2
( 2 2)( 2 2)
Its limits are 0 2 1. If = for each i, then 2 = 1
How would the regression line appear graphically if 2=0? What is the intuition?
Coefficient of determination
One should remain level-headed upon finding the 2:
It would not be surprising to find an 2 near 1 when working with particular types of time series data that trend smoothly over time
It would not be surprising to find a relatively low 2 when working with microeconomic data
involving consumer behavior. Variations in individual behavior may be difficult to fully explain.
There are several other measures that are important indicators of how to evaluate
a model:
Signs and magnitudes of the estimates Precision of the estimates The models predictive value
What makes a good estimator?
Unbiasedness
Earlier we stated that an estimator is unbiased if its mean is equal to the true value of the parameter being estimated
Efficiency
The smaller the variance, the better the chance that the estimate is close to the actual value of , which is unknown
What makes a good estimator?
Restricting an estimator to be a linear function of the observations on the dependent variable makes our choice of which unbiased estimator has smallest variance manageable.
An estimator that is linear, unbiased, and that has minimum variance among all linear unbiased estimators is called the best linear unbiased estimator (BLUE).
Assumptions when running OLS
We require several assumptions in order for the OLS estimators to be BLUE: = 0 + 1 +
= 0. It is important that the factors not explicitly included
in the model, and therefore incorporated into , do not systematically affect the average value of Y. That is, the positive values cancel out the negative values so that their average effect on Y is zero.
= 2 = . This is the assumption of homoskedasticity. Otherwise, our estimators will not have minimum variance.
Assumptions when running OLS
, = , = 0. That is, the covariance between any pair of random errors is zero. Otherwise, our estimators will not have minimum variance.
The variable X is not random and must take at least two different values. Without this condition, we cannot run OLS. Quite simply, if there is no variation in the X variable, then we will not be able to explain variation in the Y variable.
The values of are normally distributed about their mean and therefore Y is normally distributed (this is necessary for hypothesis testing, which we will discuss in a later lecture):
~ 0, 2
Variance
While the econometrician can never be certain that the estimates obtained are equal or close to the true parameters of the model (as the true parameters are unknowable), finding a coefficient with relatively small variance will certainly give him or her more confidence that the estimate is good
That is, given two different distributions of 1 with the same mean, we prefer the distribution with smaller variance
Variance size will be shown to be crucial when testing hypotheses
Variance
Given our previous definition of variance, if our assumptions (1-5) hold it can be shown that the variances of 0and 1 are:
0 = 2 2
2
1 =2
2
How does the extent to which is spread out relate to variance?
Variance
In addition, we may be interested in the variance of the random error term
The variance of the random error is:
= 2 =
2 = 2
Of course, the random errors are unobservable. So how shall we proceed?
Variance
Recall that:
= = 0 1
We may therefore replace with :
2 = 2
However, we must modify this formula slightly based on the number of
regression parameters (K) (what is the intuition?). When dealing with only 0 and 1, K=2. Therefore the formula that we use to ensure an unbiased estimator is:
2 = 2
Variance
Now that we have found 2, an unbiased estimator of 2, we may write:
0 = 2 2
2
1 = 2
2
The square roots of the estimated variances are the
standard errors of 0 and 1
Covariance
Earlier in the lecture we defined
= ( )( )
= () ()
By extension we may define the covariance between two random variables X and Y as:
, = = () ()
=
Positive covariance: When X is above (below) its mean, Y is likely to be above (below) its mean, and vice versa.
Negative covariance: When X is above (below) its mean, Y is likely to be below (above) its mean, and vice versa
Coefficient of correlation
However, interpreting is difficult because may arbitrarily increase or decrease depending on units of measurement. We may therefore scale the covariance by the standard deviations of the variables and define the coefficient of correlation as:
=
() ()=
Its limits are 1 1, where = 1 indicates a
perfect linear relationship between X and Y.
Covariance
Covariance between 0 and 1 is also a measure of the association between the two variables:
0, 1 = 0 (0) 1 (1)
It can then be shown that:
0, 1 = 2
2
Now that we have explored the theoretical background required to appreciate OLS, lets start working with an actual data set