Modeling a Count Response
Bruce A Craig
Department of StatisticsPurdue University
Reading: Faraway Ch. 5, Agresti Ch. 4, KNNL Ch. 14
STAT 526 Topic 8 1
Poisson Distribution
Let Y represent the count of events (e.g., successes) thatoccur in some fixed interval of measure (e.g., time interval,region of area)
Unlike binomial/Bernoulli, no upper bound on support,Y = 0, 1, 2, . . .
Poisson distribution appropriate model when
The number of events that occur in two nonoverlapping unitsof measure are independentThe probability that an event occurs in a unit of measure isthe same for all units of equal size and is proportional to thesize of the unitThe probability that more than one event occurs in a unit ofmeasure is negligible for very small-sized units. In otherwords, the events occur one at a time.
STAT 526 Topic 8 2
Poisson Distribution
Probability density function
P(Y = y) =exp{−λ}λy
y !
Can show an exponential family distribution
E (Y ) = Var(Y ) = λ
Features:
If Yiind∼ Pois(λi ), then
∑
i
Yi ∼ Pois(∑
i
λi )
Arises naturally when time between events is iidexponentialApproximates B(m, p) when m is large and p is smallwith λ = mp.Approaches Normality as λ increases
STAT 526 Topic 8 3
Poisson(λ) for Various λ
STAT 526 Topic 8 4
Poisson Regression
Model
Yiind∼ Pois(λi )
log λi = xiβ
– βj = diff in log E(Y) for unit change in xj (all others fixed)
– mean at xj + 1 is mean at xj times exp{βj}
log is common (canonical) link but others may be usedLog-likelihood
l(β) = log
n∏
i=1
[
λyii exp{−λi}
yi !
]
=
n∑
i=1
(yi log λi − λi − log yi !)
=
n∑
i=1
(yi xiβ − exp xiβ) + constant =⇒ X′y = X′µ̂
STAT 526 Topic 8 5
Goodness of Fit
Deviance
D = 2n
∑
i=1
(
yi logyi
µ̂i
− (yi − µ̂i )
)
= 2
n∑
i=1
(
yi logyi
λ̂i
− (yi − λ̂i )
)
Also known as the G 2 statisticDifferent deviance values for grouped or ungrouped data
Pearson X 2
X 2 =n
∑
i=1
(yi − µ̂i )2
V (µ̂i )=
n∑
i=1
(yi − λ̂i )2
λ̂i
STAT 526 Topic 8 6
Inference About β
Same general procedure as with other GLMs
To test H0 : βj = 0 versus Ha : βj 6= 0, useWald testLikelihood ratio testScore test
For inference about µi , compute η̂i and its SE.Backtransform if interested in CI
For prediction of Y , use Poisson pdf with λ = µ̂ tocompute appropriate quantiles
STAT 526 Topic 8 7
Example
Is the diversity in plant species on an island related tovarious geographical features of the island?
Study looked at 30 Galapagos islands obtaining a speciescount (Y ) and five geographical features
Area of the island (km2)Highest elevation of the island (m)Distance to Santa Cruz island (km)Distance to the nearest island (km)Area of adjacent island (km2)
Also have count of endemic (unique) species as anotheroutcome variable
STAT 526 Topic 8 8
Examine the Data Structure
> library(faraway)
> str(gala)
’data.frame’: 30 obs. of 7 variables:
$ Species : num 58 31 3 25 2 18 24 10 8 2 ...
$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...
$ Area : num 25.09 1.24 0.21 0.1 0.05 ...
$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...
$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
## Will remove the Endemics variable for easier analysis
> gala1 = gala[,-2]
STAT 526 Topic 8 9
Fitting the Full GLM> fit1 <- glm(Species~., family=poisson, data=gala1)
> summary(fit1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 716.85 on 24 degrees of freedom
AIC: 889.68
Number of Fisher Scoring iterations: 5
STAT 526 Topic 8 10
Goodness of Fit Tests
Residual Deviance
Tests H0 : our model vs Ha: saturated modelDistribution poorly approximated by χ2
> pchisq(fit1$deviance,fit1$df.residual,lower.tail=FALSE)
[1] 7.073157e-136
Pearson X 2
Test H0 : our model vs Ha: saturated modelDistribution better approximated by χ2
Gets even better when Poisson approaches Normal
> pchisq( sum(residuals(fit1, type="pearson")^2 ), fit1$df.residual, lower.tail=FALSE)
[1] 2.18719e-145 # reject H0
Can look more into model diagnostics to find possiblereasons for poor fit
STAT 526 Topic 8 11
Randomized Quantile Residuals
Can again use these residuals to assess fit
Warning: Problem with extreme outliers / overdispersion> fit1res = qres.pois(fit1)
> fit1res
[1] -2.3704014 2.2765322 -5.5201747 0.7896027 -4.6562257
[6] -3.3413838 -1.3724017 -0.3180478 -4.6393914 -6.2867257
[11] Inf 0.5979665 8.2095362 -6.6106086 2.8081647
[16] -1.2164403 -0.5969734 -5.1684852 -8.2777086 -1.2047194
[21] -3.9679243 2.1827066 3.9796196 -7.4357526 7.9414445
[26] 0.1319390 Inf 1.1363113 -3.6259643 0.6910023
Obs #11: y = 97 and µ̂ = 27.86135
Obs #27: y = 285 and µ̂ = 158.1653
Can set Inf values to large value (e.g., 10)
STAT 526 Topic 8 12
Diagnosticsqplot(predict(fit1,type="link"),fit1res,geom=c(’point’),ylab="Residual")
fit1res1 = fit1res
fit1res1[is.infinite(fit1res1)] = 10
qqnorm(fit1res1)
abline(a=0,b=1,col="red")
STAT 526 Topic 8 13
Mean versus Variance
Model assumes mean and variance are the sameCan compare observed variance with µ̂ (log scale)plot(log(predict(fit1,type="response")),xlab="Estimated Mean",
log((gala1$Species-predict(fit1,type="response"))^2),
ylab="Approx Variance")
abline(a=0,b=1)
STAT 526 Topic 8 14
Quasi-Poisson
Goodness of fit tests and diagnostics suggest overdispersion
Similar to binomial, can estimate dispersion parameter φ
> phihat = sum(residuals(fit1,type="pearson")^2)/fit1$df.residual
> phihat
[1] 31.74914
> summary(fit1,dispersion=phihat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.1548079 0.2915897 10.819 < 2e-16 ***
Area -0.0005799 0.0001480 -3.918 8.95e-05 ***
Elevation 0.0035406 0.0004925 7.189 6.53e-13 ***
Nearest 0.0088256 0.0102621 0.860 0.390
Scruz -0.0057094 0.0035251 -1.620 0.105
Adjacent -0.0006630 0.0001653 -4.012 6.01e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 31.74914)
STAT 526 Topic 8 15
Altering the Structural Form
Scatterplot of log(Species) versus predictors showpredictors highly skewed and relationship nonlinear
Consider model using the log of each predictor
STAT 526 Topic 8 16
Fitting the Full GLM
> fit2 <- glm(Species~., family=poisson, data=gala2)
> summary(fit2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.296705 0.285277 11.556 < 2e-16 ***
Area 0.349908 0.018011 19.428 < 2e-16 ***
Elevation 0.034825 0.057003 0.611 0.54125
Nearest -0.041491 0.013926 -2.979 0.00289 **
Scruz -0.030435 0.011573 -2.630 0.00854 **
Adjacent -0.089823 0.006925 -12.972 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 360.39 on 24 degrees of freedom
AIC: 533.22
STAT 526 Topic 8 17
Diagnostics
qplot(predict(fit1,type="link"),fit1res,geom=c(’point’),ylab="Residual")
qqnorm(fit2res)
abline(a=0,b=1,col="red")
STAT 526 Topic 8 18
Quasi-Poisson
Even with better form, overdispersion appears relevant
> phihat = sum(residuals(fit2,type="pearson")^2)/fit2$df.residual
> phihat
[1] 16.6475
> summary(fit2,dispersion=phihat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.29671 1.16397 2.832 0.00462 **
Area 0.34991 0.07349 4.762 1.92e-06 ***
Elevation 0.03482 0.23258 0.150 0.88098
Nearest -0.04149 0.05682 -0.730 0.46527
Scruz -0.03044 0.04722 -0.645 0.51923
Adjacent -0.08982 0.02825 -3.179 0.00148 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 16.6475)
STAT 526 Topic 8 19
Summary
Substantial change in deviance using new structural form> anova(fit1,fit2)
Analysis of Deviance Table
Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
Resid. Df Resid. Dev Df Deviance
1 24 716.85
2 24 360.39 0 356.45
Overdispersion, however, still present
Change in log(y) for a unit change in log(x) approximatespercent change in y for percent change in x (elasticity)
STAT 526 Topic 8 20
Rate Model: Motivation
Each Yi represents a diff interval in space or time
Number of events may depend on the size of the interval# of crimes in cities (population size)# of customers served by workers (time worked)# cars running red lights in different intersections (how busy)
Sometimes can consider problem in binomial setting butoften still better to model as Poisson
counts can be small as compared to the totalthe total trials may not be easily known
Goal: express a common effect of covariates on all Yi ,while accounting for differences in ’exposure’
’exposure’ needs to be defined by a variable
STAT 526 Topic 8 21
Rate Model: Formulation
Consider the following model:
Yiind∼ Pois(λi )
with
λi = exposurei exp{xiβ}
or
log(λi ) = log(exposurei ) + xiβ
This model is equivalent to using log exposure as apredictor in Poisson GLM with the coefficient fixed at 1
The log exposure variable is often called an offset in thiscontext
STAT 526 Topic 8 22
Example
Experiment to study the effect of gamma radiation on thenumbers of chromosomal abnormalities (ca)
Three dose amounts (in Grays)Nine dose rates (in Grays/hour)
The number of cells (cells, in hundreds) exposed to adose/rate combination varied across runs
Thus, cells can be viewed as the ’exposure’ variableCan have more than one abnormality per cell> library(faraway)
> head(dicentric)
cells ca doseamt doserate
1 478 25 1 0.10
2 1907 102 1 0.25
3 2258 149 1 0.50
4 2329 160 1 1.00
5 1238 75 1 1.50
6 1491 100 1 2.00
STAT 526 Topic 8 23
Exploring the Data
Outcome: proportion of cells with an abnormalityInteraction plot:> with(dicentric,interaction.plot(doseamt,doserate,ca/cells,legend=FALSE))
> legend(’topleft’,c(’0.1’,’0.25’,’0.5’,’1’,’1.5’,’2’,’2.5’,’3’,’4’),
lty=9:1,cex=0.8)
STAT 526 Topic 8 24
Exploring the Data
Interaction plot with numeric factor on x axis> with(dicentric,interaction.plot(doserate,log(doseamt),ca/cells,legend=FAL
> legend(’topleft’,c(’1’,’2.5’,’5’),lty=3:1,cex=0.6)
Can more easily see the change in slopes?
STAT 526 Topic 8 25
Exploring the Data
Plot with numeric factor on x axis> ggplot(dicentric,aes(x=lograte,y=ca/cells,col=as.factor(doseamt)))+
geom_point()+geom_smooth(method="lm", se=F)+
theme(legend.position = "none")
STAT 526 Topic 8 26
Model Without an Offset
The coefficient of log(cells) is close to 1 (offset reasonable)
> dicentric$doseF <- factor(dicentric$doseamt)
> fit1 = glm(ca ~ log(cells) + log(doserate)*doseF,
family=poisson, data=dicentric)
> summary(fit1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.76534 0.38116 -7.255 4.02e-13 ***
log(cells) 1.00252 0.05137 19.517 < 2e-16 ***
log(doserate) 0.07200 0.03547 2.030 0.042403 *
doseF2.5 1.62984 0.10273 15.866 < 2e-16 ***
doseF5 2.76673 0.12287 22.517 < 2e-16 ***
log(doserate):doseF2.5 0.16111 0.04837 3.331 0.000866 ***
log(doserate):doseF5 0.19316 0.04299 4.493 7.03e-06 ***
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 916.127 on 26 degrees of freedom
Residual deviance: 21.748 on 20 degrees of freedom
AIC: 211.15
STAT 526 Topic 8 27
GLM Model With an Offset
Very little difference in estimates here
> fit2 <- glm(ca ~ offset(log(cells)) + log(doserate)*doseF,
family=poisson, data=dicentric)
> summary(fit2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.74671 0.03426 -80.165 < 2e-16 ***
log(doserate) 0.07178 0.03518 2.041 0.041299 *
doseF2.5 1.62542 0.04946 32.863 < 2e-16 ***
doseF5 2.76109 0.04349 63.491 < 2e-16 ***
log(doserate):doseF2.5 0.16122 0.04830 3.338 0.000844 ***
log(doserate):doseF5 0.19350 0.04243 4.561 5.1e-06 ***
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4753.00 on 26 degrees of freedom
Residual deviance: 21.75 on 21 degrees of freedom
AIC: 209.16
STAT 526 Topic 8 28
Alternative Approach I
Consider LM using proportions as response> propca = dicentric$ca / dicentric$cells
> fit3 = lm(propCA ~ log(doserate)*doseF, data=dicentric)
> summary(fit3)
Coefficients:
Estimate Std.Error tvalue Pr(>|t|)
(Intercept) 0.063489 0.019528 3.251 0.00382 **
log(doserate) 0.004573 0.016692 0.274 0.78680
doseF2.5 0.276315 0.027616 10.005 1.92e-09 ***
doseF5 1.004119 0.027616 36.359 < 2e-16 ***
log(doserate):doseF2.5 0.063933 0.023606 2.708 0.01317 *
log(doserate):doseF5 0.239129 0.023606 10.130 1.54e-09 ***
Residual standard error: 0.05858 on 21 degrees of freedom
Multiple R-squared: 0.9874,Adjusted R-squared: 0.9844
F-statistic: 330 on 5 and 21 DF, p-value: < 2.2e-16
Coefficients defined on data scale
STAT 526 Topic 8 29
Alternative Approach I
Model assumes Normality and constant variance
Residual plot: evidence of unequal varianceClearly incorrect model choice> plot(fit3$fitted,fit3$residuals)
> abline(h=0)
STAT 526 Topic 8 30
Alternative Approach II
Consider LM using log(proportion) as response> fit4 <- lm(log(propca) ~ log(doserate)*doseF,data=dicentric)
> summary(fit4)
Coefficients:
Estimate Std.Error tvalue Pr(>|t|)
(Intercept) -2.76243 0.03352 -82.402 < 2e-16 ***
log(doserate) 0.07561 0.02866 2.638 0.015364 *
doseF2.5 1.64378 0.04741 34.672 < 2e-16 ***
doseF5 2.77866 0.04741 58.610 < 2e-16 ***
log(doserate):doseF2.5 0.16483 0.04053 4.067 0.000553 ***
log(doserate):doseF5 0.19480 0.04053 4.807 9.47e-05 ***
...
Residual standard error: 0.1006 on 21 degrees of freedom
Multiple R-squared: 0.9943,Adjusted R-squared: 0.9929
F-statistic: 729.4 on 5 and 21 DF, p-value: < 2.2e-16
Analogous to using offset in Poisson regression
Parameter estimates similar to those in Poisson GLM
SE are more optimistic than in Poisson regression
STAT 526 Topic 8 31
Alternative Approach II
Model still assumes Normality and constant variance
Residual plot: evidence of unequal variance?May be ok choice but Poisson GLM preferred> plot(fit4$fitted,fit4$residuals)
> abline(h=0)
STAT 526 Topic 8 32
Pairwise Comparisons
When interaction, often compare levels of A at fixed B
Example 1: µdose5,doserate1 vs µdose2.5,doserate1
Using Poisson GLM with offset
> library(gmodels)
> contrast.function <- c(0, 0, -1, 1, 0, 0)
> names(contrast.function) <- names(coef(fit2))
> contrast.function
(Intercept) log(doserate) doseF2.5 doseF5 1 log(1) 0 1 0 log(1)
0 0 -1 1 - 1 log(1) 1 0 log(1) 0
log(doserate):doseF2.5 log(doserate):doseF5
0 0
> test.result <- estimable(fit2, contrast.function)
> test.result
Estimate Std. Error X^2 value DF Pr(>|X^2|)
(0 0 -1 1 0 0) 1.135667 0.04460504 648.2372 1 0
STAT 526 Topic 8 33
Pairwise Comparisons
Example 2: µdose5,doserate3 vs µdose2.5,doserate3
> contrast.function <- c(0, 0, -1, 1, -1.0986, 1.0986)
> names(contrast.function) <- names(coef(fit2)) 1 log(3) 0 1 0 log(3)
> test.result <- estimable(fit2, contrast.function) - 1 log(3) 1 0 log(3) 0
> test.result
Estimate Std. Error X^2 value DF Pr(>|X^2|)
(0 0 -1 1 -1.0986 1.0986) 1.171129 0.05390965 471.9289 1 0
Difference is larger (expected because of interaction)
Note that comparisons are made on the link scale
log(µ̂ij)− log(µ̂i ′j) = log(µ̂ij/µ̂i ′j) 6= log(µ̂ij − µ̂i ′j)
STAT 526 Topic 8 34
Grouped versus Ungrouped Analysis
Since Yiind∼ Pois(λi), the aggregate
∑
i
Yi ∼ Pois(∑
i
λi)
As in binomial/Bernoulli response, log-likelihood only involvessums of yi with same covariate patternsIn previous example, could add total cells and counts ofabnormalities for entries with same doseamt and doserate
Rate models: individual vs groupeddifferent deviancessame parameter estimatessame comparison of nested models
Models with no offset: indiv. vs groupeddifferent deviancessame parameters except interceptsame comparison of nested models
STAT 526 Topic 8 35
Negative Binomial Distribution
Typical motivationIndependent trials with P(success) = p
Z= the number of trials until kth successZ ∼ NB(p, k), Z = k , k + 1, . . .
A more useful motivationThe distribution of Y |λ ∼ Pois(λ) and λ ∼ G (k , α)
Probability distribution
P(Z = z) =
(
z − 1
k − 1
)
pk(1− p)z−k z = k , k + 1, . . .
or
P(Y = y) =
(
y + k − 1
k − 1
)(
α
1 + α
)y (1
1 + α
)k
y = 0, 1, . . .
Using parameterization Y = Z − k and p = (1 + α)−1
STAT 526 Topic 8 36
Negative Binomial Distribution
Under this reparameterizationE (Y ) = k(1− p)/p = kα = µVar(Y ) = k(1− p)/p2 = kα(1 + α) = µ(1 + µ/k)Variance approaches Poisson variance as k → ∞
Log-likelihood is
l =
n∑
i=1
(
yi logαi
αi + 1− k log(1 + αi )
)
+ constant
=n
∑
i=1
(
yi logµi
k + µi
+ k logk
k + µi
)
+ constant
Consider k as fixed or additional parameter to estimate
Use log (µi/(k + µi)) = xiβ as link function in GLM
STAT 526 Topic 8 37
Return to Galapagos Islands Study
Can fix k based on exploratory plots of mean-variancerelationship
R: Use glm (link=log) in the MASS package
> fit.nb <- glm(Species~ log(Area)+log(Adjacent),
family=negative.binomial(1), data=gala)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.27257 0.15304 21.384 < 2e-16 ***
log(Area) 0.35100 0.03773 9.304 6.52e-10 ***
log(Adjacent) -0.03204 0.04015 -0.798 0.432
...
(Dispersion parameter for Negative Binomial(1) family
taken to be 0.4650222)
Null deviance: 54.069 on 29 degrees of freedom
Residual deviance: 13.965 on 27 degrees of freedom
AIC: 292.97
STAT 526 Topic 8 38
Return to Galapagos Islands Study
Given dispersion parameter < 1, k too small
Could try various k or estimate using ML (glm.nb)
> fit.nb1 <- glm.nb(Species~ log(Area)+log(Adjacent),data=gala)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.27777 0.14495 22.613 <2e-16 ***
log(Area) 0.34973 0.03541 9.875 <2e-16 ***
log(Adjacent) -0.03316 0.03737 -0.887 0.375
(Dispersion parameter for Negative Binomial(2.6196)
family taken to be 1)
Null deviance: 134.240 on 29 degrees of freedom
Residual deviance: 32.741 on 27 degrees of freedom
AIC: 284.99
Theta: 2.620
Std. Err.: 0.753
STAT 526 Topic 8 39
Return to Galapagos Islands Study
Not a big difference in estimates and SEs across the two fits
CI for Theta= k almost includes 1
Poisson limiting case of NB when k → ∞
Thus, test H0 : φ = 1/k = 0 vs Ha : φ > 0
One-sided hypothesisUnder H0, φ is on the boundary of supportVerbeke and Molenberghs (2000):
LR = 2 [logLik(NB model)− logLik(Poisson model)]
∼1
2χ20 +
1
2χ21
STAT 526 Topic 8 40
Return to Galapagos Islands Study
For this study, the two model fits result in
logLik(NB) = 32.741logLik(Poisson) = 395.54
2 [395.54− 32.741] = 362.799 >1
2χ20(0.95) +
1
2χ21(0.95)
= 1.920729
Conclusion: reject H0
Caution: quality of approximation can be poor
STAT 526 Topic 8 41
Zero-Inflated Models
Again refers to there being an excess of zero counts thatcannot be adequately modeled using a dispersionparameter
Will consider both zero-inflated Poisson (ZIP) andzero-inflated negative-binomial (ZINB) models
Can think of model structure in two waysAdditional point mass at zero
P(Y = 0) = p∗i + (1− p∗i )gi (0)
P(Y = y) = (1− p∗i )gi (y) y > 0
Hurdle model
P(Y = 0) = f1(0)
P(Y = y) =1− f1(0)
1− f2(0)f2(y) y > 0
STAT 526 Topic 8 42
Hurdle Model
Consider a latent variable Zi that must exceed some valuefor these to be a non-zero value
This score often called a hurdle
Consider using binary model for this part
Must use a rescaled truncated distribution for second part
In R: hurdle function in pscl packagedist=c(“poisson”,”negbin”,”geometric”)zero.dist = c(“binomial”, “poisson”,”negbin”,”geometric”)link = c(“logit”,”probit”,”cloglog”,”cauchit”,”log”)
STAT 526 Topic 8 43
Mixture Model
Already discussed this model in relation to the binomial
Consider that there is a proportion p∗ that will alwaysrespond with a zero
Remaining 1− p∗ follow specified distribution
In R: zeroinfl function in pscl packagedist=c(“poisson”,”negbin”,”geometric”)link = c(“logit”,”probit”,”cloglog”,”cauchit”,”log”)Model specified using y ∼ x1 + x2 | z1 + z2 + z3
STAT 526 Topic 8 44
Example
915 biochemistry graduate studentsOutcome : Number of articles produced in last 3 yrs ofPhD> mod = glm(art ~ ., data=bioChemists,family=poisson)
> summary(mod)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.304617 0.102981 2.958 0.0031 **
femWomen -0.224594 0.054613 -4.112 3.92e-05 ***
marMarried 0.155243 0.061374 2.529 0.0114 *
kid5 -0.184883 0.040127 -4.607 4.08e-06 ***
phd 0.012823 0.026397 0.486 0.6271
ment 0.025543 0.002006 12.733 < 2e-16 ***
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1634.4 on 909 degrees of freedom
AIC: 3314.1
STAT 526 Topic 8 45
Example: Hurdle Model> library(pscl)
> modh = hurdle(art~.,data=bioChemists)
> summary(modh)
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.67114 0.12246 5.481 4.24e-08 ***
femWomen -0.22858 0.06522 -3.505 0.000457 ***
marMarried 0.09649 0.07283 1.325 0.185209
kid5 -0.14219 0.04845 -2.934 0.003341 **
phd -0.01273 0.03130 -0.407 0.684343
ment 0.01875 0.00228 8.222 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.23680 0.29552 0.801 0.4230
femWomen -0.25115 0.15911 -1.579 0.1144
marMarried 0.32623 0.18082 1.804 0.0712 .
kid5 -0.28525 0.11113 -2.567 0.0103 *
phd 0.02222 0.07956 0.279 0.7800
ment 0.08012 0.01302 6.155 7.52e-10 ***
Number of iterations in BFGS optimization: 12
Log-likelihood: -1605 on 12 Df
STAT 526 Topic 8 46
Example: Mixture Model> modzi = zeroinfl(art~.,data=bioChemists)
> summary(modzi)
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.640838 0.121307 5.283 1.27e-07 ***
femWomen -0.209145 0.063405 -3.299 0.000972 ***
marMarried 0.103751 0.071111 1.459 0.144565
kid5 -0.143320 0.047429 -3.022 0.002513 **
phd -0.006166 0.031008 -0.199 0.842378
ment 0.018098 0.002294 7.888 3.07e-15 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.577059 0.509386 -1.133 0.25728
femWomen 0.109746 0.280082 0.392 0.69518
marMarried -0.354014 0.317611 -1.115 0.26502
kid5 0.217097 0.196482 1.105 0.26919
phd 0.001274 0.145263 0.009 0.99300
ment -0.134114 0.045243 -2.964 0.00303 **
Number of iterations in BFGS optimization: 21
Log-likelihood: -1605 on 12 Df
STAT 526 Topic 8 47
Summary
Very difficult to choose between the two
Be careful interpreting modelsHurdle: probability of a positive countMixture: probability of a zero count
Bland-Altman plot better describes prediction differences
STAT 526 Topic 8 48