+ All Categories
Home > Documents > Statistical Analysis SC504/HS927 Spring Term 2008

Statistical Analysis SC504/HS927 Spring Term 2008

Date post: 31-Dec-2015
Category:
Upload: hoyt-mccray
View: 19 times
Download: 0 times
Share this document with a friend
Description:
Statistical Analysis SC504/HS927 Spring Term 2008. Introduction to Logistic Regression Dr. Daniel Nehring. Outline. Preliminaries: The SPSS syntax Linear regression and logistic regression OLS with a binary dependent variable Principles of logistic regression - PowerPoint PPT Presentation
35
1 Statistical Analysis SC504/HS927 Spring Term 2008 Introduction to Logistic Regression Dr. Daniel Nehring
Transcript
Page 1: Statistical Analysis SC504/HS927 Spring Term 2008

1

Statistical AnalysisSC504/HS927Spring Term 2008

Introduction to Logistic Regression

Dr. Daniel Nehring

Page 2: Statistical Analysis SC504/HS927 Spring Term 2008

2

Outline

Preliminaries: The SPSS syntax Linear regression and logistic regression OLS with a binary dependent variable Principles of logistic regression Interpreting logistic regression coefficients Advanced principles of logistic regression (for self-study)

Source:http://privatewww.essex.ac.uk/~dfnehr

Page 3: Statistical Analysis SC504/HS927 Spring Term 2008

3

PRELIMINARIES

Page 4: Statistical Analysis SC504/HS927 Spring Term 2008

4

The SPSS syntax

Simple programming language allowing access to all SPSS operations

Access to operations not covered in the main interface

Accessible through syntax windows Accessible through ‘Paste’ buttons in every

window of the main interface Documentation available in ‘Help’ menu

Page 5: Statistical Analysis SC504/HS927 Spring Term 2008

5

Using SPSS syntax files

Saved in a separate file format through the syntax window

Run commands by highlighting them and pressing the arrow button.

Comments can be entered into the syntax. Copy-paste operations allow easy learning of the

syntax. The syntax is preferable at all times to the main

interface to keep a log of work and identify and correct mistakes.

Page 6: Statistical Analysis SC504/HS927 Spring Term 2008

6

PART I

Page 7: Statistical Analysis SC504/HS927 Spring Term 2008

7

Relation between 2 continuous variables

Regression coefficient 1 Measures association between y and x Amount by which y changes on average when x changes by

one unit Least squares method

Simple linear regression

y

x

xβαy 11Slope

Page 8: Statistical Analysis SC504/HS927 Spring Term 2008

8

Multiple linear regression

Relation between a continuous variable and a set of i continuous variables

Partial regression coefficients i

Amount by which y changes on average when xi changes by one unit and all the other xis remain constant

Measures association between xi and y adjusted for all other xi

xβ ... xβ xβαy ii2211

Page 9: Statistical Analysis SC504/HS927 Spring Term 2008

9

Multiple linear regression

Predicted Predictor variables

Response variable Explanatory variables

Dependent Independent variables

xβ ... xβ xβα y ii2211

Page 10: Statistical Analysis SC504/HS927 Spring Term 2008

10

OLS with a binary dependent variable

Binary variables can take only 2 possible values: yes/no (e.g. educated to degree level, smoker/non-smoker) success/failure (e.g. of a medical treatment)

Coded 1 or 0 (by convention 1=yes/ success) Using OLS for a binary dependent variable predicted

values can be interpreted as probabilities; expected to lie between 0 and 1

But nothing to constrain the regression model to predict values between 0 and 1; less than 0 & greater than 1 are possible and have no logical interpretation

Approaches which ensure that predicted values lie between 0 & 1 are required such as logistic regression

Page 11: Statistical Analysis SC504/HS927 Spring Term 2008

Fitting equation to the data

Linear regression: Least squares Logistic regression: Maximum likelihood Likelihood function

Estimates parameters with property that likelihood (probability) of observed data is higher than for any other values

Practically easier to work with log-likelihood

n

iiiii xyxylL

1

)(1ln)1()(ln)(ln)(

Page 12: Statistical Analysis SC504/HS927 Spring Term 2008

12

Maximum Likelihood Estimation (MLE) OLS cannot be used for logistic regression since

the relationship between the dependent and independent variable is non-linear

MLE is used instead to estimate coefficients on independent variables (parameters)

Of all possible values of these parameters, MLE chooses those under which the model would have been most likely to generate the observed sample

Page 13: Statistical Analysis SC504/HS927 Spring Term 2008

13

Logistic regression Models relationship between set of

variables xidichotomous (yes/no)categorical (social class, ... )continuous (age, ...)

and

dichotomous (binary) variable Y

Page 14: Statistical Analysis SC504/HS927 Spring Term 2008

14

PART II

Page 15: Statistical Analysis SC504/HS927 Spring Term 2008

15

Logistic regression (1) ‘Logistic regression’ or ‘logit’ p is the probability of an event occurring 1-p is the probability of the event not occurring p can take any value from 0 to 1 the odds of the event occurring =

the dependent variable in a logistic regression is the natural log of the odds:

pp1

pp

1ln

Page 16: Statistical Analysis SC504/HS927 Spring Term 2008

16

Logistic regression (2)

ln (.) can take any value, p will always range from 0 to 1

the equation to be estimated is:

kk xbxbxbap

p

.. . .

1ln 2211

Page 17: Statistical Analysis SC504/HS927 Spring Term 2008

Logistic regression (3)

logit of P(y|x)

{Logistic transformation

Page 18: Statistical Analysis SC504/HS927 Spring Term 2008

18

Predicting p let

then to predict p for individual i,

kk xbxbxbaz .. . . 2211

i

i

i

k ikii

z

z

i

z

xbxbxba

i

i

ee

p

e

ep

p

1

1 .. ..

2211

Page 19: Statistical Analysis SC504/HS927 Spring Term 2008

Logistic function (1)

0.0

0.2

0.4

0.6

0.8

1.0

Probability of event y

x

Page 20: Statistical Analysis SC504/HS927 Spring Term 2008

20

PART III

Page 21: Statistical Analysis SC504/HS927 Spring Term 2008

21

Interpreting logistic regression coefficients

intercept is value of ‘log of the odds’ when all independent variables are zero

each slope coefficient is the change in log odds from a 1-unit increase in the independent variable, controlling for the effects of other variables

two problems: log odds not easy to interpret change in log odds from 1-unit increase in one independent depends

on values of other independent variables

but the exponent of b (eb) is not dependent on values of other independent variables and is the odds ratio

Page 22: Statistical Analysis SC504/HS927 Spring Term 2008

22

Odds ratio

odds ratio for coefficient on a dummy variable, e.g. female=1 for women, 0 for men

odds ratio = ratio of the odds of event occurring for women to the odds of its occurring for men

odds for women are eb times odds for men

Page 23: Statistical Analysis SC504/HS927 Spring Term 2008

23

General rules for interpreting logistic regression coefficients

if b1 > 0, X1 increases p

if b1 < 0, X1 decreases p

if odds ratio >1, X1 increases p

if odds ratio < 1, X1 decreases p

if CI for b1 includes 0, X1 does not have a statistically significant effect on p

if CI for odds ratio includes 1, X1 does not have a statistically significant effect on p

Page 24: Statistical Analysis SC504/HS927 Spring Term 2008

24

An example: modelling the relationship between disability, age and income in the 65+ population dependent variable = presence of disability

(1=yes,0=no) independent variables:

X1 age in years (in excess of 65 i.e. 650, 70 5)

X2 whether has low income (in lowest 3rd of the income distribution)

data: Health Survey for England, 2000

Page 25: Statistical Analysis SC504/HS927 Spring Term 2008

25

Example: logistic regression estimate for probability of being disabled, people aged 65+ Coeff

(b)

Odds

ratio

p-

value

95% CI on

coeff

95% CI on

odds ratio

constant -0.912 0.000 -0.696 -1.129

age 0.078 1.081 0.000 0.060 0.095 1.062 1.100

has low income

-0.270 0.764 0.003 -0.024 -0.515 0.597 0.976

source: estimated from the Health Survey for England, 2000

Page 26: Statistical Analysis SC504/HS927 Spring Term 2008

26

PART IV

Page 27: Statistical Analysis SC504/HS927 Spring Term 2008

27

Odds, log odds, odds ratios and probabilities

pp

1 odds

pp

1ln odds log

kk xbxbxba .. . . 2211

kbe k variableratio, odds

)...(

)...(

2211

2211

1 probabilty

kk

kk

xbxbxba

xbxbxba

ee

Page 28: Statistical Analysis SC504/HS927 Spring Term 2008

28

Odds, odds ratios and probability of disability among non low income people aged 65+

2.82

0.74

1.000

7.029

0.40

0.29

1.081

0

1

2

3

4

5

6

7

8

65 70 75 80 85 90

age

prob

abili

ties

/odd

s/od

ds r

atio

s

odds

probability

odds ratio compared with age 65+

Page 29: Statistical Analysis SC504/HS927 Spring Term 2008

29

Odds, odd ratios and probabilities pj = 0.2 i.e. a 20% probability oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25 pk = 0.4 oddsk = 0.4/0.6 = 0.67 relative probability/risk pj/pk = 0.2/0.4 = 0.5 odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37 odds ratio is not equal to relative probability/risk except approximately if pj and pk are small………

Page 30: Statistical Analysis SC504/HS927 Spring Term 2008

30

Points to note from logit example.xls if you see an odds ratio of e.g. 1.5 for a

dummy variable indicating female, beware of saying ‘women have a probability 50% higher than men’. Only if both p’s are small can you say this.

better to calculate probabilities for example cases and compare these

Page 31: Statistical Analysis SC504/HS927 Spring Term 2008

31

let

then to predict p for individual i,

Predicting p kk xbxbxbaz .. . . 2211

i

i

i

k ikii

z

z

i

z

xbxbxba

i

i

ee

p

e

ep

p

1

1 .. ..

2211

Page 32: Statistical Analysis SC504/HS927 Spring Term 2008

32

E.g.: Predicting a probability from our model Predict disability for someone on low income aged

75: Add up the linear equationa(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27=-0.402 Take the exponent of it to get to the odds of being

disabled=.669 Put the odds over 1+the odds to give the probability=c.0.4 – or a 40 per cent chance of being disabled

Page 33: Statistical Analysis SC504/HS927 Spring Term 2008

33

Goodness of fit in logistic regressions based on improvements in the likelihood of observing the

sample use a chi-square test with the test statistic =

where R and U indicate restricted and unrestricted models unrestricted – all independent variables in model restricted – all or a subset of variables excluded from the

model (their coefficients restricted to be 0)

U

R

LL

ln22

Page 34: Statistical Analysis SC504/HS927 Spring Term 2008

34

Statistical significance of coefficient estimates in logistic regressions Calculated using standard errors as in OLS

for large n, t > 1.96 means that there is a 5% or lower probability that the true value of the coefficient is 0.

or p 0.05

b

b seb

ˆ

ˆ

Page 35: Statistical Analysis SC504/HS927 Spring Term 2008

35

95% confidence intervals for logistic regression coefficient estimates

For CIs of odds ratios calculate CIs for coefficients and take their exponents

bseb ˆ96.1ˆ


Recommended