Trevor Hastie and Mee Young Park Stanford Universityhastie/TALKS/hastie_jsm.pdf · Logistic...

transcript

August 2006 Trevor Hastie, Stanford Statistics 1

Logistic Regression on Autopilot

Trevor Hastie and Mee Young Park

Stanford University

Topics

• glmpath: R package for fitting the �1 regularization path forgeneralized linear models and the Cox PH model.

• stepPLR: a regularized forward-stepwise logistic regressionpackage for fitting gene/gene and gene/environment interactionmodels in studies of genetic diseases.

• gbm: Greg Ridgeway’s package for fitting gradient boostedmodels, including logistic regression.

Background

Logistic regression is a heavily used tool in statistics. Here aresome applications we have been involved in:

• Risk modeling — e.g. risk factors for heart disease, risk ofinsurance fraud of payment default on credit card.

• Prediction models, e.g. QSAR (Quantitative Structure-ActivityRelationship). Use compound’s chemical and structuralattributes to predict its biological activity, toxicity etc.

• Discovering genes and their interactions in genetic studies ofdiseases, based on measurements on a large number of SNPs.

Problems with (stepwise) logistic regression

Stepwise logistic regression is popular in the biosciences (SAScommunity), because it automatically builds a model (withinteractions). It has a number of failings:

• It can overfit and perform (predict) poorly.

• With mult-level factors and smallish datasets (genetic studies),empty/sparse cells cause instability.

• It is difficult to assess results — how do we assign a p-value toselected variables?

GLMs with �1 (lasso) regularization

For logistic regression we fit the linear model

logPr(Y = 1|X)Pr(Y = 0|X)

= β0 +p∑

via regularized maximum likelihood:

�(y; β) − λ||β||1

This is the lasso (Tibshirani, 1996) for logistic regression, and iswell known:

• Does variable selection and shrinkage.

• Smoother path than forward stepwise.

• Select λ by AIC, BIC or k-fold CV.

0 5 10 15

lambda

******

*******

Coefficient path

5 4 3 2 1

************************

******

*******

******

********************

0 5 10 15

lambda

*****************************

********

**************

***************************************************************************

**************************************************

***************

**********

Coefficient path

5 4 3 2 1

glmpath package

• Computes entire �1

path for GLMs andCox model.

• Uses Predictor-corrector ideas ofconvex optimization.

• Computes exact pathat a sequence of indexpoints t.

• Can approximate thejunctions (in λ) wherethe active set changes.

Bootstrap Sensitivity Analysis

−0.4 0.0 0.4 0.8

tobacco

−0.4 0.0 0.4 0.8

adiposity

−0.4 0.0 0.4 0.8

Fit a logistic regression path, and use 10-fold CV to select λ.

• For SA heart disease data, red lines indicate fitted values.

• Histograms represent distribution obtained when repeating thisprocedure on bootstrapped datasets.

• Something like a Bayesian posterior distribution.

• Pairwise plots are useful too; often a selected variable has acorrelated cousin, and either could be important.

• Can be used with stepwise logistic regression as well.

−0.05 0.00 0.05 0.10

adiposity

0.0 0.1 0.2 0.3 0.4

• GLMpath paper on my website: Park & Hastie (2006), An �1

regularization-path algorithm for generalized linear models.

• Yuan and Li (2006) extend lasso to deal with groups ofvariables (e.g. dummy variables for factors):

�(y; β) − λ

γm||βm||2.

The �2 norm ensures the vector of coefficients βm are all zeroor non-zero together.

• Using predictor-corrector methods we can construct the pathfor this criterion: Park & Hastie (2006), Regularization pathalgorithms for detecting gene interactions.

Stepwise Penalized Logistic Regression

• In genetic disease studies, we are often faced with modestsample sizes, binary (case-control) responses, and manycandidate genes, each a 3-level factor (AA, Aa, aa).

• Usually the wild-type is prevalent, so the factor levels areunevenly populated.

• Two-way interactions have 9 cells, many of which have zero orvery low counts.

• Logistic regression does not do well in these scenarios: exactaliasing, high-variance coefficients, convergence problems andoverfitting.

• We work with the �2 penalized log-likelihood:

�(y; β) − λ||β||22,

with typically a small value for λ.

• Benefits of the �2 penalty:

– Can code factors with a saturated set of dummy variables;the natural “summation to 0 constraints” are automaticallymaintained.

– Coefficients for sparsly populated cells are shrunk more tozero than well-populated cells.

– Coefficients for empty cells are set to zero automatically.

• Can calibrate effective df for such a fit from the trace of theweighted ridge operator of the final IRLS step.

• We then run forward stepwise logistic regression using thisfitting engine.

• Hierarchy rule: we allow interactions to enter if either of themain effects are present.

• Use AIC/BIC to guide the forward growing/backward deletion.

• Small/large λ allows less/more complex models. We estimate λ

by cross-validation.

• Package stepPLR in R.

• Method described in Park & Hastie (2006), Penalized logisticregression for detecting gene interactions.

Comparisons

• MDR: Multi-factor Dimensionality Reduction, Ritchie et al(2001). This is a shotgun approach that examines allfirst-order, second-order, third-order, etc tables looking forinteractions.

– The case-control polarity for each cell in a table isdetermined by the observed majority.

– This coded table is then uses as a classifier, and the tablesare ranked via classification performence using 10-fold CV.

• FlexTree: Huang et al (2004). Complex method that buildsclassification trees using splits determined by linearcombinations of subsets of factors.

Hypertension Study

Data used in Huang et al (2004), PNAS, data kindly supplied byDr Richard Olshen, Stanford. Menopausal status and genotypes at21 loci, 216 hypotensive and 364 hypertensive Chinese women.

0.0 0.2 0.4 0.6 0.8 1.0

ROC curve for PLR: unequal loss

Specificity

PLRFlexTree

0.0 0.2 0.4 0.6 0.8 1.0

ROC curve for PLR: equal loss

Specificity

PLRFlexTreeMDR

Comments

• MDR authors claim the binary coding of cells drops thedimension to one. In simulation studies we show the effectivedimension is much more than 1; df =2, 6 and 17 for 3, 9, and27 cell tables. Df are used up in coding the cells.

• MDR does not do well (loses power) with additive or low-ordermultiple effects. It has to see these via a bigger multiway table.

• Flextree models perform almost as well as stepPLR, but arehard to interpret.

• stepPLR delivers a familiar logistic regression model, suitablytamed via regularization, that so far has performed no worseand often better than all competitors we have tried.

Gradient Boosting

• Adaptive nonparametric method for building powerfulpredictive models (Friedman, 2001). In our context the modelhas the form

logPr(Y = 1|X)Pr(Y = 0|X)

Tm(X),

where each of the terms is a tree.

• Model is fit sequentially by a form of functional gradientdecent; a tree is grown to the current gradient of thelog-likelihood, is shrunk down heavily by a shrinkage factor,and added to the current model. See our book Elements ofStatistical Learning, HTF (2001) for details.

• M is a tuning parameter, much like λ in lasso.

• The depth of the trees is another tuning parameter, anddetermine the maximum interaction-order of the model. E.g.,depth-two trees means second-order models.

• Very nice R package gbm by Greg Ridgeway.

• We explored the use of GBM for detecting gene-gene and otherinteractions in genetic disease studies.

0.0 0.2 0.4 0.6 0.8 1.0

ROC curve

specificity

step PLRGBM: interaction order 2GBM: interaction order 3

Example: Bladder Cancer

• Data from Hung et. al.(2004), Cancer Epidemiology,Biomarkers and Prevention,kindly supplied by Dr JohnWitte, UCSF.

• Fit interaction order 2 and 3models, with shrinkage = 0.01

• Performence is similar tostepPLR, and could poten-tially scale up better to largerproblems (many loci).

Training and Cross-validation

The tuning parameter M = # of Trees determines complexity.Very similar to Lasso/Cosso (see papers on webpage on ForwardStagewise and Monotone Lasso).

0 100 200 300 400

Interaction Order 2

Number of Trees

Training5−fold CV

0 100 200 300 400

Interaction Order 3

Number of Trees

Training5−fold CV

nat1_fsxpd_n

nat2_fsgstt1_n

pst_ncyp1b1_ngstp1_h6

nqo1_ngstm1_ncomt_nxrcc1_nxrcc3_n

mnsod_n2mpo_n

smoke_3

Relative influence

0 5 10 15 20

Variable Importance

• Method for assessing overallcontribution of each variableto the model.

• Does not treat main effectsseparately from interactions.

• Mixes contributions of corre-lated variables.

• Would need refinements totease out interaction vs maineffect contributions.

Partial Dependence Plots

Give average (main effects here) for important variables:f̄(x1) = EX2f(x1, X2).

1.0 1.5 2.0 2.5 3.0

smoke_3

1.0 1.5 2.0 2.5 3.0

mnsod_n2

Partial Main Effects

1.0 1.5 2.0 2.5 3.0

mpo_n / smoke_3

1.0 1.5 2.0 2.5 3.0

mpo_n / mnsod_n2In

1.0 1.5 2.0 2.5 3.0

smoke_3 / mnsod_n2

1.0 1.5 2.0 2.5 3.0

mpo_n / gstm1_n

1.0 1.5 2.0 2.5 3.0

gstm1_n / mnsod_n2

Second Order Effects

Discussion

• Splits divide three-level factors into two groups — potentiallyappropriate for genotype data (dominant vs recessive).

• Potentially useful screening tool for large number of SNPs

• Needs further refinement.

Wrap-up

All three methods available in R as packages:

• glmpath by Mee Young Park & Hastie. Suitable for automaticvariable selection and regularization in linear logistic models.

• stepPLR by Mee Young Park & Hastie. Suitable for detectinginteractions in logistic regression models.

• gbm by Greg Ridgeway. Exploratory tool for screening a largenumber of variables for main effects and interactions.

Trevor Hastie and Mee Young Park Stanford Universityhastie/TALKS/hastie_jsm.pdf · Logistic...

Documents