CHAPTER 1: BINARY LOGIT...

1. Introduction2. Odds versus probability

3. The Logit model4. Class exercises

CHAPTER 1: BINARY LOGIT MODEL

Prof. Alan Wan

1 / 44



Table of contents

1. Introduction1.1 Dichotomous dependent variables1.2 Problems with OLS

2. Odds versus probability

3. The Logit model3.1 Basic elements3.2 Maximum likelihood estimation3.3 PROC LOGISTIC

3.3.1 SAS codes and basic outputs3.3.2 Wald test for individual significance3.3.3 Likelihood-ratio, LM and Wald tests for overall significance3.3.4 Odds ratio estimates3.3.5 AIC, SC and Generalised R2

3.3.6 Association of predicted probabilities and observed responses3.3.7 Hosmer-Lemeshow test statistic

4. Class exercises2 / 44



1.1 Dichotomous dependent variables1.2 Problems with OLS

Introduction

Motivation for Logit model:

I Dichotomous dependent variables;

I Problems with Ordinary Least Squares (OLS) in the face ofdichotomous dependent variables;

I Alternative estimation techniques

3 / 44




Introduction





3 / 44




Introduction





3 / 44




Introduction





3 / 44




Dichotomous dependent variables

Often variables in social sciences are dichotomous:

I employed vs. unemployed

I married vs. unmarried

I guilty vs. innocent

I voted vs. didn’t vote

4 / 44





I Social scientists frequently wish to estimate

regression models with a dichotomous dependent

variable;

I Most researchers are aware that something is

wrong with OLS in the face of a dichotomous

dependent variable but they do not know what

makes dichotomous variables problematic in

regression, and what other methods are superior

5 / 44





I Focus of this chapter is on binary Logit models

(or logistic regression models) for dichotomous

dependent variables;

I Logits have many similarities to OLS but there

are also fundamental differences

6 / 44




Problems with OLS

I Examine why OLS regression runs into problems when thedependent variable is 0/1.

I ExampleI Dataset: penalty.txtI Comprises 147 penalty cases in the state of New

Jersey;I In all cases the defendant was convicted of

first-degree murder with a recommendation by the

prosecutor that a death sentence be imposed;I Penalty trial is conducted to determine if the

defendant should receive a death penalty or life

imprisonment;

7 / 44




Problems with OLS

I The dataset comprises the following variables:

I DEATH 1 for a death sentence

0 for a life sentenceI BLACKD 1 if the defendant was black

0 otherwiseI WHITVIC 1 if the victim was white

0 otherwiseI SERIOUS - an average rating of seriousness of the

crime evaluated by a panel of judges, ranging from

(least serious) to 15 (most serious)

I The goal is to regress DEATH on BLACKD, WHITVIC andSERIOUS;

8 / 44




Problems with OLS

I Note that DEATH, which has only two outcomes, follows aBernoulli(p) distribution with p being the probability of adeath sentence. Let Y=DEATH, then

Pr(Y = y) = py (1− p)1−y , y = 0, 1

I Recall that Bernoulli trials led to the Binomial distribution - ifwe repeat the Bernoulli(p) trials n times and count thenumber of W successes, the distribution of W follows aBinomial B(n, p) distribution, i.e.,

Pr(W = w) =n Cwpw (1− p)(n−w), 0 ≤ w ≤ n

I So the Bernoullli distribution is special case of the Binomialdistribution when n = 1.

9 / 44




Problems with OLS

data penalty; infile 'd:\teaching\ms4225\penalty.txt'; input DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC REG; MODEL DEATH=BLACKD WHITVIC SERIOUS; RUN;

10 / 44




Problems with OLS

The REG Procedure Model: MODEL1 Dependent Variable: DEATH Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 2.61611 0.87204 4.11 0.0079 Error 143 30.37709 0.21243 Corrected Total 146 32.99320 Root MSE 0.46090 R-Square 0.0793 Dependent Mean 0.34014 Adj R-Sq 0.0600 Coeff Var 135.50409 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -0.05492 0.12499 -0.44 0.6610 BLACKD 1 0.12197 0.08224 1.48 0.1403 WHITVIC 1 0.05331 0.08411 0.63 0.5272 SERIOUS 1 0.03840 0.01200 3.20 0.0017

11 / 44




Problems with OLS

I The coefficient of SERIOUS is positive and very

significant;

I Neither of the two racial variables are

significantly different from zero;

I R2 is low;

I F-test indicates overall significance of the

model;

I But....can we trust these results?

12 / 44




Problems with OLS

I Note that if y is a 0/1 variable, then

E (yi ) = 1× Pr(yi = 1) + 0× Pr(yi = 0)

= 1× pi + 0× (1− pi )

= pi .

I But based on linear regression, yi = β1 + β2Xi + εi . Hence

E (yi ) = E (β1 + β2Xi + εi )

= β1 + β2Xi + E (εi )

= β1 + β2Xi .

I Therefore, pi = β1 + β2Xi . This is commonly referred to asthe linear probability model (LPM).

13 / 44




Problems with OLS

I Accordingly, from the SAS results, a one-point increase in theSERIOUS scale is associated with a 0.038 increase in theprobability of a death sentence; the probability of a deathsentence for blacks is 0.12 higher than for non-blacks, ceterisparibus. But do these results make sense?

I The LPM pi = β1 + β2Xi is actually implausible because pi ispostulated to be a linear function of Xi and thus has no upperand lower bounds. Accordingly, pi (which is a probability) canbe greater than 1 or smaller than 0 !!

14 / 44



Odds versus probability

I Odds of an event: the ratio of the expected number of timesthat an event will occur to the expected number of times itwill not occur;

I For example, an odds of 4 means we expect 4 times as manyoccurrences as non-occurrences; an odds of 5/2 (or 5 to 2)means we expect 5 occurrences to 2 non-occurrences;

I Let p be the probability of an event occurring and o thecorresponding odds, theno = p/(1− p) or p = o/(1 + o);

15 / 44







15 / 44







15 / 44




I Relationship between probability and odds:Probability Odds

0.1 0.11

0.2 0.25

0.3 0.43

0.4 0.67

0.5 1.00

0.6 1.50

0.7 2.33

0.8 4.00

0.9 9.00

I o < 1⇔ p < 0.5 and o > 1⇔ p > 0.5;

I 0 ≤ o <∞ although 0 ≤ p ≤ 1

16 / 44




I Death sentence by race of defendant for 147

penalty trials:

blacks non-blacks

death 28 22 50life 45 52 97

73 74 147

I oD = 50/97 = 0.52; oD|B = 28/45 = 0.62; andoD|NB = 22/52 = 0.42;

I Hence the ratio of blacks odds of death to non-blacks odds ofdeath are 0.62/0.42 = 1.476;

I This means the odds of death sentence for blacks are 47.6%higher than non-blacks, or the odds of death sentence fornon-blacks are 0.63 times the corresponding odds for blacks

17 / 44





penalty trials:

blacks non-blacks

death 28 22 50life 45 52 97

73 74 147

I oD = 50/97 = 0.52; oD|B = 28/45 = 0.62; andoD|NB = 22/52 = 0.42;



17 / 44





penalty trials:

blacks non-blacks

death 28 22 50life 45 52 97

73 74 147

I oD = 50/97 = 0.52; oD|B = 28/45 = 0.62; andoD|NB = 22/52 = 0.42;



17 / 44





penalty trials:

blacks non-blacks

death 28 22 50life 45 52 97

73 74 147

I oD = 50/97 = 0.52; oD|B = 28/45 = 0.62; andoD|NB = 22/52 = 0.42;



17 / 44



3.1 Basic elements3.2 Maximum likelihood estimation3.3 PROC LOGISTIC

Logit model: basic elements

I The Logit model is based on the following cumulativedistribution function of the logistic distribution:pi = 1

1+eβ1+β2Xi;

I Let Zi = β1 + β2Xi , thenpi = 1

1+e−Zi= F (β1 + β2Xi ) = F (Zi );

I As Zi ranges from −∞ to ∞, Pi ranges between 0 and 1;

I Pi is non-linearly related to Zi .

18 / 44





I Graph of the Logit with β1 = 0 and β2 = 1:

-4 -3 -2 -1 0 1 2 3 4

P i 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

19 / 44





I Note that eZi = pi/(1− pi ), the odds of an event;

I So, ln(pi/(1− pi )) = Zi = β1 + β2Xi ; in other words, the logof the odds is linear in Xi , although pi and Xi have anon-linear relationship. This is different from the LPM.

20 / 44





I For a linear model yi = β1 + β2Xi + εi ,∂yi∂Xi

= β2, a constant;

I But for a Logit model,

pi = F (β1 + β2Xi )

∂pi∂Xi

=∂F (β1 + β2Xi

∂Xi

= F ′(β1 + β2Xi )β2

= f (β1 + β2Xi )β2,

where f (.) is the probability density function for the logisticdistribution.

I As f (β1 + β2Xi ) is always positive, the sign of β2 indicatesthe direction of the relationship between pi and Xi .

21 / 44






= β2, a constant;


pi = F (β1 + β2Xi )

∂pi∂Xi

=∂F (β1 + β2Xi

∂Xi

= F ′(β1 + β2Xi )β2

= f (β1 + β2Xi )β2,



21 / 44






= β2, a constant;


pi = F (β1 + β2Xi )

∂pi∂Xi

=∂F (β1 + β2Xi

∂Xi

= F ′(β1 + β2Xi )β2

= f (β1 + β2Xi )β2,



21 / 44






= β2, a constant;


pi = F (β1 + β2Xi )

∂pi∂Xi

=∂F (β1 + β2Xi

∂Xi

= F ′(β1 + β2Xi )β2

= f (β1 + β2Xi )β2,



21 / 44





I Note that for the Logit model

f (β1 + β2Xi ) =e−Zi

(1 + e−Zi )2

= F (β1 + β2Xi )(1− F (β1 + β2Xi ))

= pi (1− pi )

I Therefore, ∂pi∂Xi

= β2 × pi (1− pi ). In other words, a 1-unitchange in Xi does not produce a constant effect on pi .

22 / 44





I Note that for the Logit model

f (β1 + β2Xi ) =e−Zi

(1 + e−Zi )2

= F (β1 + β2Xi )(1− F (β1 + β2Xi ))

= pi (1− pi )

I Therefore, ∂pi∂Xi

= β2 × pi (1− pi ). In other words, a 1-unitchange in Xi does not produce a constant effect on pi .

22 / 44




Maximum Likelihood estimation

I Note that yi only takes on values of 0 and 1, so pi/(1− pi ) isundefined and OLS is not an appropriate method ofestimation. Maximum likelihood (ML) estimation is usuallythe technique to adopt;

I ML principle: choose as estimates the parameter values whichwould maximise the probability of what we have alreadyobserved;

I Steps of ML estimation: First, construct the likelihoodfunction by expressing the probability of observing the data asa function of the unknown parameters. Second, find thevalues of the unknown parameters that make the value of thisexpression as large as possible.

23 / 44








23 / 44








23 / 44





I The likelihood function is given by

L = Pr(y1, y2, ....yn)

= Pr(y1)Pr(y2)....Pr(yn), assuming independent sampling;

=n∏

i=1

Pr(yi )

I But by definition, Pr(yi = 1) = pi and Pr(yi = 0) = 1− pi .Therefore, Pr(yi ) = pyii (1− pi )

1−yi

24 / 44





I The likelihood function is given by

L = Pr(y1, y2, ....yn)

= Pr(y1)Pr(y2)....Pr(yn), assuming independent sampling;

=n∏

i=1

Pr(yi )

I But by definition, Pr(yi = 1) = pi and Pr(yi = 0) = 1− pi .Therefore, Pr(yi ) = pyii (1− pi )

1−yi

24 / 44





I So,

L =n∏

i=1

Pr(yi ) =n∏

i=1

pyii (1− pi )1−yi

=n∏

i=1

(pi

1− pi)yi (1− pi )

I It is usually easier to maximise the log of L than L itself.Taking log of both sides yields

lnL =n∑

i=1

log(pi

1− pi)yi + log(1− pi )

=n∑

i=1

yi log(pi

1− pi) +

n∑i=1

log(1− pi )

25 / 44





I Substituting pi = 11+eβ1+β2Xi

in lnL leads to

lnL = β1

n∑i=1

yi + β2

n∑i=1

Xiyi −n∑

i=1

log(1 + eβ1+β2Xi )

I There are no closed form solutions to β1 and β2 whenmaximizing lnL;

I Numerical optimisation is required - SAS uses Fisher’s Scoring,which is similar in principle to the Newton-Raphson algorithm.

26 / 44






in lnL leads to

lnL = β1

n∑i=1

yi + β2

n∑i=1

Xiyi −n∑

i=1




26 / 44






in lnL leads to

lnL = β1

n∑i=1

yi + β2

n∑i=1

Xiyi −n∑

i=1




26 / 44





I Suppose θ is a univariate unknown parameter to be estimated.The Newton-Raphson algorithm derives estimates based onthe formula

θ̂new = θ̂old − H−1(θ̂old)U(θ̂old),where H(.) and U(.) are the second and first derivatives ofthe objective function with respect to θ. The algorithm stopswhen the estimates from successive iterations converge;

I Consider a simple example, where g(θ) = −θ3 + 3θ2 − 5. So,U(θ) = −3θ(θ − 2) and H(θ) = −6(θ − 1);

I Actual maximum and minimum of g(θ) are located at θ = 2and θ = 0 respectively;

27 / 44





I Suppose θ is a univariate unknown parameter to be estimated.The Newton-Raphson algorithm derives estimates based onthe formula

θ̂new = θ̂old − H−1(θ̂old)U(θ̂old),where H(.) and U(.) are the second and first derivatives ofthe objective function with respect to θ. The algorithm stopswhen the estimates from successive iterations converge;

I Consider a simple example, where g(θ) = −θ3 + 3θ2 − 5. So,U(θ) = −3θ(θ − 2) and H(θ) = −6(θ − 1);

I Actual maximum and minimum of g(θ) are located at θ = 2and θ = 0 respectively;

27 / 44





I Step 1: Choose an arbitrary initial starting value, say,θ̂initial = 1.5. So, U(1.5) = 2.25 and H(1.5) = −3. The newestimate of θ is therefore equal toθ̂new = 1.5− 2.25/(−3) = 2.25;

I Step 2: θ̂old = 2.25. So, U(2.25) = −1.6875 andH(2.25) = −7.5. The new estimate of θ isθ̂new = 2.25− 1.6875/(−7.5) = 2.025;

I Continue with Steps 3, 4 and so on until convergence;

I Caution: Suppose we start with θ̂initial = 0.5. If the process isleft unchecked, the algorithm will converge to the minimumlocated at θ = 0!!!

28 / 44





I The only difference between Fisher’s Scoring andNewton-Raphson algorithm is that Fisher’s Scoring usesE (H(.)) instead of H(.);

I Our current situation is more complicated in that theunknowns are multivariate. However, the optimisationprinciple remains the same;

I In practice, we need a set of initial values. PROC LOGISTICin SAS starts with all coefficients equal to zero.

29 / 44




PROC LOGISTIC: basic elements

data PENALTY; infile 'd:\teaching\ms4225\penalty.txt'; input DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC LOGISTIC DATA=PENALTY DESCENDING; MODEL DEATH=BLACKD WHITVIC SERIOUS; RUN;

30 / 44





The LOGISTIC Procedure Model Information Data Set WORK.PENALTY Response Variable DEATH Number of Response Levels 2 Number of Observations 147 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value DEATH Frequency 1 1 50 2 0 97 Probability modeled is DEATH=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied.

31 / 44





Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 190.491 184.285 SC 193.481 196.247 -2 Log L 188.491 176.285 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 12.2060 3 0.0067 Score 11.6560 3 0.0087 Wald 10.8211 3 0.0127

32 / 44





The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.6516 0.6748 15.4424 <.0001 BLACKD 1 0.5952 0.3939 2.2827 0.1308 WHITVIC 1 0.2565 0.4002 0.4107 0.5216 SERIOUS 1 0.1871 0.0612 9.3342 0.0022 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits BLACKD 1.813 0.838 3.925 WHITVIC 1.292 0.590 2.832 SERIOUS 1.206 1.069 1.359 Association of Predicted Probabilities and Observed Responses Percent Concordant 67.2 Somers' D 0.349 Percent Discordant 32.3 Gamma 0.351 Percent Tied 0.5 Tau-a 0.158 Pairs 4850 c 0.675

33 / 44




Wald test for individual significance

I Test of significance of individual coefficients:H0 : βj = 0 vs. H1 : otherwise

Instead of reporting the t-stats, PROC LOGISTIC reports theWald χ2-stats for the significance of individual coefficients.Reason being that the t-stat is not t distributed in a Logitmodel; instead, it has an asymptotic N(0, 1) distributionunder the null of H0 : βj = 0. The square of a N(0, 1) variableis a χ2 variable with 1 df. The Wald χ2-stat is just the squareof the usual t-stat.

34 / 44




Likelihood-ratio, LM and Wald tests for overall significance

I Test of overall model significance:H0 : β1 = β2 = .... = βk = 0 vs. H1 : otherwise

1. Likelihood-ratio test:LR = 2[lnL(β̂(UR))− lnL(β̂(R))] ∼ χ2

k

2. Score (Lagrange-multplier)(LM) test:LM = [U(β̂(R))]′[−H−1(β̂(R))][U(β̂(R))] ∼ χ2

k

3. Wald test:W = β̂′(UR)[−H(β̂(UR))]β̂(UR) ∼ χ2

k

35 / 44




Odds ratio estimates

I The odds ratio estimates are obtained by exponentiating the

corresponding β estimates, i.e., e β̂j ;

I The (predicted) odds ratio of 1.813 indicates that the odds ofa death sentence for black defendants are 81% higher thanthe odds for other defendants;

I Similarly, the (predicted) odds of death are about 29% higherwhen the victim is white, notwithstanding the coefficientbeing insignificant;

I A 1-unit increase in the SERIOUS scale is associated with a21% increase in the predicted odds of a death sentence

36 / 44




AIC, SC and Generalised R2

I Model selection criteria

1. Akaike’s Information Criterion (AIC):AIC = −2[lnL− (k + 1)]

2. Schwartz Bayesian Criterion (SBC or SC):SC = −2lnL + (k + 1)× ln(n)

3. Generalized R2 = 1− e−LR/n, analogous to the conventionalR2 used in linear regression

37 / 44




Association of predicted probabilities and observedresponses

I For the 147 observations in the sample, there are147C2 = 10731 ways to pair them up (without pairing anobservation with itself). Of these, 5881 pairs have either both1’s or both 0’s on y . These we ignore, leaving 4850 pairs forwhich one case has a 1 and other case has a 0;

I For each of these pairs, we ask the following question: Basedon estimated model, does the case with a 1 have a higherpredicted probability of attaining 1 than the case with a 0?”

I If yes, we call the pair a ”concordant”; if no, we call the pair a”discordant”; if the two cases have the same predicted values,we call it a ”tie”;

I Obviously, the more concordant pairs, the better the fit of themodel.

38 / 44









38 / 44









38 / 44









38 / 44





I Let C= number of concordant pairs, D= number ofdiscordant pairs, T=number of ties, and N=total number ofpairs before eliminating any;

I Tau − a = C−DN ,Somer ′sD(SD) = C−D

C+D+T , Gamma = C−DC+D

and C − stat = 0.5× (1 + SD)

I All 4 measures vary between 0 and 1 with large valuescorresponding to stronger associations between the predictedand observed values

I ”Rules of thumb” for minimally acceptable levels of Tau − a,SD, Gamma and C − stat are 0.1, 0.3, 0.3 and 0.65respectively.

39 / 44








and C − stat = 0.5× (1 + SD)



39 / 44








and C − stat = 0.5× (1 + SD)



39 / 44




Hosmer-Lemeshow goodness of fit test

I The Hosmer-Lemeshow (HL) test is goodness of fit test whichmay be invoked by augmenting the LACKFIT option in themodel statement under PROC LOGISTIC;

I The HL statistic is calculated as follows. Based on theestimated model, predicted probabilities are generated for allobservations. These are sorted by size, then grouped intoapproximately 10 intervals. Within each interval, the expectedfrequency is obtained by adding up the predicted probabilities.Expected frequencies are compared with the observedfrequencies by the conventional Pearson χ2 statistic. The df isthe number of intervals minus 2;

40 / 44





I HL =∑2G

j=1(Oj−Ej )

2

Ej∼ χ2

G−2, where G is the number of

intervals, and O and E are the observed and predictedfrequencies respectively. LACKFIT output is as follows:

Partition for the Hosmer and Lemeshow Test DEATH = 1 DEATH = 0 Group Total Observed Expected Observed Expected 1 15 3 2.04 12 12.96 2 15 2 2.78 13 12.22 3 15 3 3.49 12 11.51 4 15 4 4.10 11 10.90 5 15 6 4.89 9 10.11 6 15 6 5.42 9 9.58 7 15 4 5.97 11 9.03 8 15 6 6.77 9 8.23 9 15 7 7.50 8 7.50 10 12 9 7.05 3 4.95 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 3.9713 8 0.8597

41 / 44





I HL =∑2G

j=1(Oj−Ej )

2

Ej∼ χ2

G−2, where G is the number of

intervals, and O and E are the observed and predictedfrequencies respectively. LACKFIT output is as follows:

Partition for the Hosmer and Lemeshow Test DEATH = 1 DEATH = 0 Group Total Observed Expected Observed Expected 1 15 3 2.04 12 12.96 2 15 2 2.78 13 12.22 3 15 3 3.49 12 11.51 4 15 4 4.10 11 10.90 5 15 6 4.89 9 10.11 6 15 6 5.42 9 9.58 7 15 4 5.97 11 9.03 8 15 6 6.77 9 8.23 9 15 7 7.50 8 7.50 10 12 9 7.05 3 4.95 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 3.9713 8 0.8597

41 / 44



Class exercises

1. Tutorial 1

2. Table 12.4 of Ramanathan (1995): IntroductoryEconometrics, presents information on the acceptance orrejection to medical school for a sample of 60 applicants,along with a number of their characteristics. The variables areas follows:

ACCEPT=1 if granted acceptance, 0 otherwise;GPA=cumulative undergraduate grade point average;BIO=score in the biology portion of the Medical CollegeAdmission Test (MCAT);CHEM=score in the chemistry portion of the MCAT;

42 / 44



Class exercises

PHY=score in the physics portion of the MCAT;RED=score in the reading portion of the MCAT;PRB=score in the problem portion of the MCAT;QNT=score in the quantitative portion of the MCAT;AGE=age of the applicant;GENDER=1 for male, 0 for female;

Answer the following questions with the aid of the program andoutput medicalsas.txt and medicalout.txt uploaded on the coursewebsite:

43 / 44



Class exercises

1. Write down the estimated Logit model that regressesACCEPT on all of the above explanatory variables.

2. Test for the overall significance of the model using the LR, LMand Wald tests. Do the three tests provide consistent results?

3. Test for the significance of the individual coefficients using theWald test.

4. Predict the probability of success of an individual with thefollowing characteristics: GPA=2.96, BIO=7, CHEM=7,PHY=8, RED=5, PRB=7, QNT=5, AGE=25, GENDER=0.

5. Calculate the Generalised R2 for the above regression. Howwell does the model appear to fit the data?

6. AGE and GENDER represent personal characteristics. Testthe hypothesis that they jointly have no impact on theprobability of success.

44 / 44

Date post:	28-Mar-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	1 times

CHAPTER 1: BINARY LOGIT...

Documents