PY1PR2 lecture 3: Simple regression Dr David Field.

PY1PR2 lecture 3: Simple regression

Dr David Field

Summary

• Andy Field covers simple regression in the first half of chapter 7

• Regression is a technique you can use when you want to predict the value of a variable from the value of another variable

• Relationship between correlation and regression• Finding the “line of best fit”• Interpreting regression results and using them to

make predictions• Assessing how good the fit is

Regression vs. Correlation

• The three highlighted scatter plots all show examples of perfect positive correlation

• But there is an obvious difference between them– the slope

• Simple regression assesses the slope of the relationship between two variables, while correlation assesses how strong / tight the relationship is

How regression measures the slope

• Make the assumption that the slope or relationship can be described by a straight line– if the scatter plot of two variables shows a curved

relationship you can’t use regression

• Find the “best fitting” line, and then the gradient of that line is the measure of the slope

• There are an infinite number of possible straight lines you could draw on the scatter plot of two variables– regression includes a method of finding the best one

Example data

• Today’s lecture, and the regression workshop, will make use of some real data to illustrate how simple regression is used to explore the relationship between variables

• The sample is a sample of rich, developed nations• The variables are

– average annual income per head– income inequality– % of population suffering from any form of mental illness

during the past year (WHO world mental health survey)– life expectancy (workshop)

• Source: Wilkinson & Pickett (2009) The Spirit Level

Variation between the per capita incomes of rich countries

10000

15000

20000

25000

30000

35000

40000

US

A

No

rwa

y

Sw

itzerla

nd

De

nm

ark

Irela

nd

Au

stria

Ca

na

da

Be

lgiu

m

Jap

an

Ne

the

rlan

ds

Au

stralia

UK

Ge

rma

ny

Fra

nce

Fin

lan

d

Sw

ed

en

Italy

Sin

ga

po

re

Sp

ain

Ne

w Z

ea

lan

d

Gre

ece

Israe

l

Po

rtug

al

$ p

er y

ear

Income Gap – the 20:20 ratio

• Average annual income is much larger in some developed countries than others

• Countries also differ in terms of how wide the spread around the average income– In some countries there is a great deal of variation around the

mean (large SD)– In other countries there is a small SD

• Economists think about this in terms of inequality– How rich is rich?– Subjectively, in some countries, rich means double the average

income– In other countries rich means 4 times the averages income

• Quantifying inequality as an index– The top 20% of earners in a country are defined as “rich”– The bottom 20% are defined as “poor”– The mean income of the top 20% is then divided by the mean

income of the bottom 20% producing an “inequality ratio”

Income and inequality as predictor variables

• Wilson & Pickett (2009) analysed the relationship between income, income inequality, and the prevalence of a range of health and social problems in different countries– e.g. homicide rate, imprisonment rate, teenage birth

rate, social mobility

• To illustrate linear regression, we will focus on the relationship between income, income inequality, and a psychological variable:– % of population suffering from any form of mental

illness during the past year

– The scatter plot indicates that mental health problems are more prevalent in more unequal societies

– To quantify the relationship with regression, the first step is to find the “line of best fit”

– The line of best fit can be estimated by eye

– This line is obviously a poor summary of the trend on the graph

– The slope of the line is much too steep

– This line looks like it captures the relationship fairly well

– The solid red line captures the relationship fairly well

– But possibly the green dotted line does a better job

– Regression uses a mathematical technique to decide which of all the possible lines is the best fitting line

The method of least squares

The method of least squares can be used to decide which of these two lines provides a better model of the data

•Calculate the difference between each data point and the line, in terms of the predicted variable (mental health problems)

•Positive numbers mean the model has overestimated, negative numbers mean it has underestimated

•To measure the fit, square the difference scores and sum them (why square the difference scores?)

•Next, calculate the sum of squared differences from the line on the right

•Whichever line has a smaller total squared difference score is a better model of the data

•There are always an infinite number of lines to compare, so regression uses a mathematical technique to find the one that minimizes the squared differences

Describing the line of best fit

• Once you have obtained the line of best fit it can be drawn on a scatter plot, and it can be described by two numbers– the first number, called the intercept or b0, is the value

of the predicted variable (e.g. mental health) where predictor has a value of 0, and the line crosses the vertical axes of the graph

– the second number, called b1, is the gradient or slope of the line, and tells you what happens to the y axis value of the best fit line when you increase the value on the x axis by 1 unit

• Together, they are known as the regression coefficients

0

2

4

6

8

10

12

14

16

Predictor

Ou

tco

me

•The value of the intercept, b0, for both lines is 8

•The solid red line has a positive value of b1 (gradient or slope)

•For the dashed line, b1 is a negative number, indicating that as the

predictor increases the predicted value decreases

0 10

0

2

4

6

8

10

12

14

16

Predictor

Ou

tco

me

•These 3 lines all have the same positive slope value, b1

•But each has a different intercept, b0

The line of best fit as a “model” of the data

• A statistical model is a way of describing the most important aspects of a set of data that is simpler than the data itself– a straight line is simpler than the scatter plot it

summarizes

• Like all statistical models, the line of best fit can be used to predict values of the outcome for a specific value of the predictor– outcome = b0 + (b1 * specific value of predictor)

– see later for examples

b0 = 6.6

b1 = 3.7• Best fit line for

prediction of mental health by inequality– The b0 and b1 have

the same units as the predicted variable, % in this case

– b1 is 3.7% per unit of the predictor variable

– In other words, moving from an inequality ratio of 3 to 4 increases the rate of mental illness by 3.7%

Using the model to predict values

• Data for the % of population suffering from any form of mental illness during the past year was not available for Greece– but we do know that Greece has an income inequality

ration of 6.19, which is 3.28 units higher than the most equal country, Japan

– We can use the formula for straight lines, combined with the values of b0 and b1 to predict the level of mental illness in Greece

– b0 + (b1 * 3.28)

– 6.6 + (3.28 * 3.7) = 18.74%

• You can also check the predicted value for Greece graphically– draw a line up from the

x axis to meet the regression line at the point corresponding to the predictor value for Greece

– draw a line across to the y axis and read off the predicted value

Making predictions for extreme values of the predictor

• Imagine we obtained a measurement of income inequality two new countries– a capitalist country with no welfare state and zero

taxation, inequality ratio 18 (much higher than any country in our sample)

– a communist country, inequality ratio 1.5 (les than half the most equal country in our sample, Japan)

• We could use the equation for straight lines to predict levels of mental illness for both countries– Doing this is referred to as “extrapolation”– Can you think of any problems with doing this, or

reasons for caution?

How well does the model fit?• Regression is guaranteed to find the best fitting

straight line, but it might still be a poor fit if the two variables are only weakly related

1) There are two ways of assessing the model1) R2, the proportion of the total variance in the predicted

variable that the best fitting line accounts for2) a null hypothesis test:

• what is the probability of obtaining the b1 value in the sample if the true value of b1 in the population is zero?

• a b1 of 0 means that as the value of the predictor increases the value of the predicted stays the same

• (as inequality increases mental illness stays the same)– Let’s compare the fit of two models that predict the

proportion of the population with mental health problems

Predicting mental health from income

b1 = 0.65

b0 = -1.7 (zero income = less than zero mental health problems?!?!)

Assessing the two models of mental health: R2

• To assess the line of best fit, it is compared to an even simpler model of the predicted variable– If I had mental health data for the 11 countries on the

scatter plot, but no data about income or inequality, and I was asked to predict the level of mental health problems in Greece, my best guess would be the mean of the mental health data

– The simplest possible model of the predicted variable is its mean

– Calculate the sum of the squared deviations from the mean (total sum of squares)

– Calculate the sum of the squared deviations from the line of best fit (residual sum of squares)

– If the line is a good model, the residual sum should be much smaller than the total sum

The line of best fit versus the mean

mean of Y

Can you find any countries where the mean is a better model of mental health problems than the line of best fit for inequality?

The line of best fit versus the mean

Can you find any countries where the mean is a better model of mental health problems than the line of best fit for income?

For how many countries is the model obviously better than the mean?

Calculating R2

• R2 is a descriptive statistic that describes how much better the model is at explaining variation in the predicted variable than using the mean as a model

• It is expressed as a proportion of the total variation (variance) in the predicted variable– therefore it has a maximum of 1 and a minimum of 0

R2total sum of squares – residual sum of squares

total sum of squares

Note: the square root of R2 is the Pearson correlation of the two variables

Which variable is a better predictor?

R2 = 0.16 R2 = 0.55

correlation = 0.39 correlation = 0.74

Is b1 significantly different from zero?

• If the true value of b1 in the population is zero, what is the probability that a random sample of this size would have a value of b1 as big or bigger than the observed value?

• The p value is provided by a t test• t = b1 / standard error of b1• standard error of b1 is based on the SD of the residual

(deviation) scores– if the data points are close to the line of best fit SE will be small, if

far away, SE will be large

• otherwise identical to the t test for the difference between two sample means

• if p < 0.05 we support the hypothesis that the predictor variable is useful in estimating the value of the outcome

Are the predictors statistically significant?

t(9)1.29, p = 0.23, NS t(9)3.3, p = 0.009

The simple linear regression model

• As we saw earlier, simple regression is simply a model of the data as a straight line: – Outcome = b0 + (b1 * specific value of predictor)

• But that equation is not quite complete, because the regression model needs to reflect the fact that the data points rarely all lie exactly on the line

• Therefore, you will usually see the regression equation written as some symbolic variant of:– Outcome = b0 + (b1 * specific value of predictor) +

“residual error”

• residual error refers to the deviation score from the model (the vertical lines drawn on the scatter plots earlier)

Things to bear in mind

• Often, you can’t be sure that the predicted variable is being caused by the predictor– it might be the other way round (and you can swap the predictor

and predicted around and run the analysis again)– in some cases, e.g. inequality and mental health, it does not make

sense to run the regression the other way around• Nobody would claim that mental health problems cause

inequality• but you might make an argument that a 3rd variable causes the

variation in both observed variables

• The predicted variable should be a continuous variable measured on an interval or ratio scale

• If a scatter plot suggests a non-linear relationship, you can’t use simple regression

If you’d like to evaluate the effects of inequality and other variables yourself more data is available in the book or on the website

e.g. more unequal US states have higher levels of health and social problems

Statistical terms for revision

• model• method of least squares• residual• regression coefficients

– intercept, b0

– slope or gradient, b1

• best fitting line• extrapolation• R2

Date post:	24-Dec-2015
Category:	Documents
Upload:	ashlee-blake
View:	216 times
Download:	1 times

PY1PR2 lecture 3: Simple regression Dr David Field.

Documents