Post on 12-Jul-2020
transcript
August 2006 Trevor Hastie, Stanford Statistics 1
Logistic Regression on Autopilot
Trevor Hastie and Mee Young Park
Stanford University
August 2006 Trevor Hastie, Stanford Statistics 2
Topics
• glmpath: R package for fitting the �1 regularization path forgeneralized linear models and the Cox PH model.
• stepPLR: a regularized forward-stepwise logistic regressionpackage for fitting gene/gene and gene/environment interactionmodels in studies of genetic diseases.
• gbm: Greg Ridgeway’s package for fitting gradient boostedmodels, including logistic regression.
August 2006 Trevor Hastie, Stanford Statistics 3
Background
Logistic regression is a heavily used tool in statistics. Here aresome applications we have been involved in:
• Risk modeling — e.g. risk factors for heart disease, risk ofinsurance fraud of payment default on credit card.
• Prediction models, e.g. QSAR (Quantitative Structure-ActivityRelationship). Use compound’s chemical and structuralattributes to predict its biological activity, toxicity etc.
• Discovering genes and their interactions in genetic studies ofdiseases, based on measurements on a large number of SNPs.
August 2006 Trevor Hastie, Stanford Statistics 4
Problems with (stepwise) logistic regression
Stepwise logistic regression is popular in the biosciences (SAScommunity), because it automatically builds a model (withinteractions). It has a number of failings:
• It can overfit and perform (predict) poorly.
• With mult-level factors and smallish datasets (genetic studies),empty/sparse cells cause instability.
• It is difficult to assess results — how do we assign a p-value toselected variables?
August 2006 Trevor Hastie, Stanford Statistics 5
GLMs with �1 (lasso) regularization
For logistic regression we fit the linear model
logPr(Y = 1|X)Pr(Y = 0|X)
= β0 +p∑
j=1
βjXj
via regularized maximum likelihood:
maxβ
�(y; β) − λ||β||1
This is the lasso (Tibshirani, 1996) for logistic regression, and iswell known:
• Does variable selection and shrinkage.
• Smoother path than forward stepwise.
• Select λ by AIC, BIC or k-fold CV.
August 2006 Trevor Hastie, Stanford Statistics 6
**
**
**
*
*
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
lambda
Sta
ndar
dize
d co
effic
ient
s
******
*
*
****
**
*
*
****
**
*
*
*******
*
Coefficient path
x4
x3
x5
x2
x1
5 4 3 2 1
************************
******
******
*******
******
******
********************
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
lambda
Sta
ndar
dize
d co
effic
ient
s
*****************************
********
********
********
********
**************
***************************************************************************
***************************************************************************
**************************************************
***************
**********
Coefficient path
x4
x3
x5
x2
x1
5 4 3 2 1
glmpath package
• Computes entire �1
path for GLMs andCox model.
• Uses Predictor-corrector ideas ofconvex optimization.
• Computes exact pathat a sequence of indexpoints t.
• Can approximate thejunctions (in λ) wherethe active set changes.
August 2006 Trevor Hastie, Stanford Statistics 7
Bootstrap Sensitivity Analysis
sbp
−0.4 0.0 0.4 0.8
010
020
030
040
0
tobacco
−0.4 0.0 0.4 0.8
050
100
150
ldl
−0.4 0.0 0.4 0.8
050
100
150
adiposity
−0.4 0.0 0.4 0.8
020
040
060
080
0
Fit a logistic regression path, and use 10-fold CV to select λ.
• For SA heart disease data, red lines indicate fitted values.
• Histograms represent distribution obtained when repeating thisprocedure on bootstrapped datasets.
• Something like a Bayesian posterior distribution.
August 2006 Trevor Hastie, Stanford Statistics 8
• Pairwise plots are useful too; often a selected variable has acorrelated cousin, and either could be important.
• Can be used with stepwise logistic regression as well.
−0.05 0.00 0.05 0.10
−0.
20−
0.15
−0.
10−
0.05
0.00
adiposity
obes
ity
0.0 0.1 0.2 0.3 0.4
0.00
0.05
0.10
0.15
ldl
toba
cco
August 2006 Trevor Hastie, Stanford Statistics 9
• GLMpath paper on my website: Park & Hastie (2006), An �1
regularization-path algorithm for generalized linear models.
• Yuan and Li (2006) extend lasso to deal with groups ofvariables (e.g. dummy variables for factors):
maxβ
�(y; β) − λ
M∑
m=1
γm||βm||2.
The �2 norm ensures the vector of coefficients βm are all zeroor non-zero together.
• Using predictor-corrector methods we can construct the pathfor this criterion: Park & Hastie (2006), Regularization pathalgorithms for detecting gene interactions.
August 2006 Trevor Hastie, Stanford Statistics 10
Stepwise Penalized Logistic Regression
• In genetic disease studies, we are often faced with modestsample sizes, binary (case-control) responses, and manycandidate genes, each a 3-level factor (AA, Aa, aa).
• Usually the wild-type is prevalent, so the factor levels areunevenly populated.
• Two-way interactions have 9 cells, many of which have zero orvery low counts.
• Logistic regression does not do well in these scenarios: exactaliasing, high-variance coefficients, convergence problems andoverfitting.
August 2006 Trevor Hastie, Stanford Statistics 11
• We work with the �2 penalized log-likelihood:
maxβ
�(y; β) − λ||β||22,
with typically a small value for λ.
• Benefits of the �2 penalty:
– Can code factors with a saturated set of dummy variables;the natural “summation to 0 constraints” are automaticallymaintained.
– Coefficients for sparsly populated cells are shrunk more tozero than well-populated cells.
– Coefficients for empty cells are set to zero automatically.
• Can calibrate effective df for such a fit from the trace of theweighted ridge operator of the final IRLS step.
• We then run forward stepwise logistic regression using thisfitting engine.
August 2006 Trevor Hastie, Stanford Statistics 12
• Hierarchy rule: we allow interactions to enter if either of themain effects are present.
• Use AIC/BIC to guide the forward growing/backward deletion.
• Small/large λ allows less/more complex models. We estimate λ
by cross-validation.
• Package stepPLR in R.
• Method described in Park & Hastie (2006), Penalized logisticregression for detecting gene interactions.
August 2006 Trevor Hastie, Stanford Statistics 13
Comparisons
• MDR: Multi-factor Dimensionality Reduction, Ritchie et al(2001). This is a shotgun approach that examines allfirst-order, second-order, third-order, etc tables looking forinteractions.
– The case-control polarity for each cell in a table isdetermined by the observed majority.
– This coded table is then uses as a classifier, and the tablesare ranked via classification performence using 10-fold CV.
• FlexTree: Huang et al (2004). Complex method that buildsclassification trees using splits determined by linearcombinations of subsets of factors.
August 2006 Trevor Hastie, Stanford Statistics 14
Hypertension Study
Data used in Huang et al (2004), PNAS, data kindly supplied byDr Richard Olshen, Stanford. Menopausal status and genotypes at21 loci, 216 hypotensive and 364 hypertensive Chinese women.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC curve for PLR: unequal loss
Specificity
Sen
sitiv
ity
PLRFlexTree
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC curve for PLR: equal loss
Specificity
Sen
sitiv
ity
PLRFlexTreeMDR
August 2006 Trevor Hastie, Stanford Statistics 15
Comments
• MDR authors claim the binary coding of cells drops thedimension to one. In simulation studies we show the effectivedimension is much more than 1; df =2, 6 and 17 for 3, 9, and27 cell tables. Df are used up in coding the cells.
• MDR does not do well (loses power) with additive or low-ordermultiple effects. It has to see these via a bigger multiway table.
• Flextree models perform almost as well as stepPLR, but arehard to interpret.
• stepPLR delivers a familiar logistic regression model, suitablytamed via regularization, that so far has performed no worseand often better than all competitors we have tried.
August 2006 Trevor Hastie, Stanford Statistics 16
Gradient Boosting
• Adaptive nonparametric method for building powerfulpredictive models (Friedman, 2001). In our context the modelhas the form
logPr(Y = 1|X)Pr(Y = 0|X)
=M∑
m=1
Tm(X),
where each of the terms is a tree.
• Model is fit sequentially by a form of functional gradientdecent; a tree is grown to the current gradient of thelog-likelihood, is shrunk down heavily by a shrinkage factor,and added to the current model. See our book Elements ofStatistical Learning, HTF (2001) for details.
• M is a tuning parameter, much like λ in lasso.
August 2006 Trevor Hastie, Stanford Statistics 17
• The depth of the trees is another tuning parameter, anddetermine the maximum interaction-order of the model. E.g.,depth-two trees means second-order models.
• Very nice R package gbm by Greg Ridgeway.
• We explored the use of GBM for detecting gene-gene and otherinteractions in genetic disease studies.
August 2006 Trevor Hastie, Stanford Statistics 18
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC curve
specificity
sens
itivi
ty
step PLRGBM: interaction order 2GBM: interaction order 3
Example: Bladder Cancer
• Data from Hung et. al.(2004), Cancer Epidemiology,Biomarkers and Prevention,kindly supplied by Dr JohnWitte, UCSF.
• Fit interaction order 2 and 3models, with shrinkage = 0.01
• Performence is similar tostepPLR, and could poten-tially scale up better to largerproblems (many loci).
August 2006 Trevor Hastie, Stanford Statistics 19
Training and Cross-validation
The tuning parameter M = # of Trees determines complexity.Very similar to Lasso/Cosso (see papers on webpage on ForwardStagewise and Monotone Lasso).
0 100 200 300 400
1.26
1.28
1.30
1.32
1.34
1.36
1.38
Interaction Order 2
Number of Trees
Dev
ianc
e
Training5−fold CV
0 100 200 300 400
1.26
1.28
1.30
1.32
1.34
1.36
1.38
Interaction Order 3
Number of Trees
Dev
ianc
e
Training5−fold CV
August 2006 Trevor Hastie, Stanford Statistics 20
nat1_fsxpd_n
nat2_fsgstt1_n
pst_ncyp1b1_ngstp1_h6
nqo1_ngstm1_ncomt_nxrcc1_nxrcc3_n
mnsod_n2mpo_n
smoke_3
Relative influence
0 5 10 15 20
Variable Importance
• Method for assessing overallcontribution of each variableto the model.
• Does not treat main effectsseparately from interactions.
• Mixes contributions of corre-lated variables.
• Would need refinements totease out interaction vs maineffect contributions.
August 2006 Trevor Hastie, Stanford Statistics 21
Partial Dependence Plots
Give average (main effects here) for important variables:f̄(x1) = EX2f(x1, X2).
1.0 1.5 2.0 2.5 3.0
−0.
8−
0.4
0.0
0.2
smoke_3
Mai
n E
ffect
1.0 1.5 2.0 2.5 3.0
−0.
8−
0.4
0.0
0.2
mpo_n
Mai
n E
ffect
1.0 1.5 2.0 2.5 3.0
−0.
8−
0.4
0.0
0.2
mnsod_n2
Mai
n E
ffect
Partial Main Effects
August 2006 Trevor Hastie, Stanford Statistics 22
1.0 1.5 2.0 2.5 3.0
−1.
0−
0.5
0.0
0.5
mpo_n / smoke_3
Inte
ract
ion
Effe
cts
123
1.0 1.5 2.0 2.5 3.0
−1.
0−
0.5
0.0
0.5
mpo_n / mnsod_n2In
tera
ctio
n E
ffect
s
1.0 1.5 2.0 2.5 3.0
−1.
0−
0.5
0.0
0.5
smoke_3 / mnsod_n2
Inte
ract
ion
Effe
cts
1.0 1.5 2.0 2.5 3.0
−1.
0−
0.5
0.0
0.5
mpo_n / gstm1_n
Inte
ract
ion
Effe
cts
1.0 1.5 2.0 2.5 3.0
−1.
0−
0.5
0.0
0.5
gstm1_n / mnsod_n2
Inte
ract
ion
Effe
cts
Second Order Effects
August 2006 Trevor Hastie, Stanford Statistics 23
Discussion
• Splits divide three-level factors into two groups — potentiallyappropriate for genotype data (dominant vs recessive).
• Potentially useful screening tool for large number of SNPs
• Needs further refinement.
August 2006 Trevor Hastie, Stanford Statistics 24
Wrap-up
All three methods available in R as packages:
• glmpath by Mee Young Park & Hastie. Suitable for automaticvariable selection and regularization in linear logistic models.
• stepPLR by Mee Young Park & Hastie. Suitable for detectinginteractions in logistic regression models.
• gbm by Greg Ridgeway. Exploratory tool for screening a largenumber of variables for main effects and interactions.