Stat306:FindingRela1onshipsinData.
Lecture15Sec1ons4.1and4.2
Chapter4–Variableselec1onandaddi1onaldiagnos1cs
Chapter4–Variableselec1onandaddi1onaldiagnos1cs
4.1VariableSelec1onalgorithms4.2Cross-valida1onandout-ofsampleassessment4.3Addi1onaldiagnos1cs4.4Transformsandnonlinearity4.5Diagnos1csfordatacollectedsequen1allyin1me
Observa(onal Experimental
GoalisExplana(on 1. 2.
GoalisPredic(on 3. 4.
Fourcategoriesofscien(ficstudy
GoalisExplana(on
1. Whatques1onsdoyouwanttoask?
2. Defineanappropriatemodel.
3. Definethehypothesesthatcorrespondtotheques1onsofinterest.
4. Collectthedata.
5. Fitthemodelasdefinedearlier.
6. Answeryourques1onswithuncertaintyquan1fica1on(i.e.withp-values,ConfidenceIntervals).
GoalisPredic(on
1. Whatdoyouwanttopredict?
2. Defineanappropriatemetricforevalua1ngqualityofpredic1ons(e.g.RMSE,absolutepredic1onerror,ROCcurve).
3.Collectthedata.4. Separateyourdatainto“train”and“holdout”subsets.5. Fitmanydifferentmodelstothe“train”subsetofthedata.
6. Pickthemodelthatis“best”(accordingtoyourchosenoutcome)formakingpredic1onsonthe“holdout”subsetofthedata.
7. Notethatp-valuesandConfidenceintervalsarenotvalid.
GoalisPredic(on…butyoualsowantsomeexplana(ons(warning,thisisabitoutdated)
1. Collectthedata.
2. Selecta“model-selec1on”criteria(e.g.AdjustedR2orCp)3. Iden1fyallpossibleregressionmodelswithallpossible
combina1onsofthepredictors.4. Iden1fyasubsetofmodelsthatarebestintermsofthechosen
“model-selec1on”criteria.
5. Evaluateandrefinethemodelsiden1fiedinStep4bydoingresidualanalyses,transforma1ons,checkingmodelassump1ons.
6. Picka“best”modelfromtherefinedsubsetofmodelsthatmeetsassump1onsandallowsyoutodosomeexplana1ons.
4.1VariableSelec(onalgorithms
• Evenwithasmallnumberofpossiblecovariates,therearealotpossiblemodelsonecouldfit.
• Andthinkaboutallthepossibleinterac1onterms!
• Thiscanmakethingsalmostimpossible.
GoalisPredic(on…butyoualsowantsomeexplana(ons(warning,thisisabitoutdated)
4.1VariableSelec(onalgorithms
GoalisPredic(on…butyoualsowantsomeexplana(ons(warning,thisisabitoutdated)
TheCpsta1s1c,a“model-selec1on”criteria
TheCpsta(s(candtheadjusted-R2areverysimilar
4.1VariableSelec(onalgorithms
GoalisPredic(on…butyoualsowantsomeexplana(ons(warning,thisisabitoutdated)
1. ForwardSelec(on
2. BackwardElimina(on
4.1VariableSelec(onalgorithms
GoalisPredic(on…butyoualsowantsomeexplana(ons(warning,thisisabitoutdated)
1. ForwardSelec(on
-startwithonevariable,addonevariableata1me
2.BackwardElimina(on
-startwithfullmodel(allpoten1alvariables),removeonevariableata1me
4.2Train/Test
GoalisPredic(on
GoalisPredic(on
1. Whatdoyouwanttopredict?
2. Defineanappropriatemetricforevalua1ngqualityofpredic1ons(e.g.RMSE,absolutepredic1onerror,ROCcurve).
3.Collectthedata.4. Separateyourdatainto“train”and“holdout”subsets.5. Fitmanydifferentmodelstothe“train”subsetofthedata.
6. Pickthemodelthatis“best”(accordingtoyourchosenoutcome)formakingpredic1onsonthe“holdout”subsetofthedata.
7. Notethatp-valuesandConfidenceintervalsarenotvalid.
4.2Cross-valida(on
GoalisPredic(on
GoalisPredic(on
1. Whatdoyouwanttopredict?
2. Defineanappropriatemetricforevalua1ngqualityofpredic1ons(e.g.RMSE,absolutepredic1onerror,ROCcurve).3.Collectthedata.4. SeparateyourdataintoKrandomsubsets.
5. Forkin1:K- Fityourmodelusingallthedataexceptthekthsubset.- Calculatemetric(e.g.predic1onerror)basedonfibngthemodeltothekthsubsetofthedata.
6. CalculateaverageofKmetricsforeachmodel.
7. Choose“bestmodel”basedonaveragedmetric.8. Notethatp-valuesandConfidenceintervalsarenotvalid.
MeanAbsolutePredic(onError:
12
8
6
9
5
Foreachmodel,wedo5-foldCV:
K-averagedmetric=40/5=8
Metric:
Source:hgp://blog.goldenhelix.com/goldenadmin/cross-valida1on-for-genomic-predic1on-in-svs/
4.2Leave-one-out
GoalisPredic(on