Feature selection from gene expression dataMolecular signatures for breast cancer prognosis and inference of gene
regulatory networks.
Anne-Claire Haury
Centre for Computational BiologyMines ParisTech, INSERM U900, Institut Curie
PhD Defense, December 14, 2012
1
Introduction
2
Gene expression
Figure: Central Dogma of Molecular Biology (Source: Wikipedia)
=> RNA as a proxy to measure gene expression.
3
Microarrays: measuring gene expression
Microarrays: one option.RNA hybridized onto a chip.Quantification of gene activity in different conditions at thegenome scale.Resulting data: gene expression data.
genes
sam
ples
Low expression
High expression
4
Feature selection: extract relevant information
Genome ≈ 25,000 genesFrom gene expression, explain a phenomenonFeature selection: deciding which variables/genes are relevant.
5
Feature selection: extract relevant information
Genome ≈ 25,000 genesFrom gene expression, explain a phenomenonFeature selection: deciding which variables/genes are relevant.
Mathematically:Explain response Y using variables (Xj)j=1...p
Y = f (X1,X2,X3,X4, . . . ,Xp) + ε
5
Feature selection: extract relevant information
Genome ≈ 25,000 genesFrom gene expression, explain a phenomenonFeature selection: deciding which variables/genes are relevant.
Mathematically:Explain response Y using relevant variables amongst (Xj)j=1...p
Y = f (X1,X2,X3,X4 . . . ,Xp) + ε
Objective 1: more accurate predictorsObjective 2: more interpretable predictorsObjective 3: faster algorithms
5
Contributions of this thesis
Gene Regulatory Network InferenceTIGRESS: new method based on local feature selection.Ranked 3rd/29 at DREAM5 challenge.Linear method, competitive with more complex algorithms
Molecular signatures for breast cancer prognosisSelect biomarkers to predict metastasis/relapse in breast cancerpatients.Complete benchmark of feature selection methods.Investigation of the stability issue.
6
recX
dinQ
rpoD
yebGtisB yafP
yafNruvB
dinJlsrAprpE prpDgntT gcvA
malXaraChofN
araB
trxAaraJ
recF dnaNaraH
fucUfucIilvBspeClsrRfucAmanY
ampCybfNidnR
argP
lsrB chbC
nagE
fucR
sfsAdksA
hofB
mhpC
dinGinsK
ruvA
dinDpolB ftsW cho murE
malIproP
fucPmanXmalK chbB
creC
glmSssb
glmU ftsLuvrYmurFumuC
rplTupp
betT moeA
rimMpitA
rplBcopA
hybE
cueR
tapmoaDatoCappBserT
prmCdppAhybC
rpsC
pflBlldP
dcuShyaA
appA
puuA
yccBhycHhyaE
rplW modChtrEfadHfadDrplVleuW
thrT
gatZdcuAhyfJnuoC
pstBhyfErpsJrpsQtrmAphoU aspA
exuTdeoR
gatDdeoB
nrdAglpQ
paaK
caiCrhaT
pstA
fruRxylFtsx
malFdeoC
pheT
tsgA
ydbDmhpT
ydhO
wzxEtadA
rffE
moaC
atoAydhY
ackA
nikE
rplChcp nuoL
appC nuoHtpx
fdnG hybFmoaB
hyaBnuoE
nuoJhycA
glpDubiCfocAubiA pstCulaBgatC
trgasr
yiaLxylRbglJ
ulaG
nanKecpDyiaM
aceA
nanTnfuA tam crrglpXpaaBxseA
nanEyjjQ
agplamB
xylAleuO
narJ
ybdNulaD
paaG
hyfH
tdcBrpmCulaRdcuBacnBpaaDhyfA
hyaDyoaGfdhF
yqjIartI
artJ
argFargC
flgCfdnH ccmG
putP
ccmByecRfumCglnP
trmD rpsS
wzyE iscA
glnHrtcRnikDompW cysJfdnI
rutEsufSnrdDglnLsufCzwfpspBnrdG napB ccmH
sufA
ccmFhycG
fixCdcuD
carA
nuoB
ssuC
rrlA
metY
ndhcyoB cydAmglAhyfB
pspE
caiDpspGgltFkatEglnQ
marBgltD
sucA
ilvAtreR
torD
oppF
carBssuE
napC hypDnapG sufB
artQargH
yeaG
argB
flgApepA
fliQputA
astEastD
ssuD
hisQ
hypBflhD pnprutBrutG
pspCargR fliZ
cydDrutC
katGtorC
yjbF dppCgadXsmpA
gadEglgP wcaA dadX
gcvTpyrD fimF
micFgltBoppAyjbE
rrfAfes
ilvMfliY ilvDhycEyfiD
rnpB
adiYlrpyjbG
rpsOgadWilvL fimI
fliM hydN
nfo ssuB rutArutDyhjA flhC
adrA
csgB
nhaR
pepD
rpmAyciE
smtA
yhiMilvH
ecnBwcaB
osmCilvIgspO
nhaA
proW yciG
nrdIoxyRfliAcysB hisMyqjHartM
yeaH
ddpC
astB
argIyhdXyhdZ ddpX
hdeBhdeA
hemLybdZ
nac
wzanrdE
hdeDlivGaslB
agaB
agaCpgaA
pgaB
fpr fliIpurNfliL
pqiA
tauC ppdAygiC
flgNcysH
hmpflgDhisP
fliF
fliNgcvP
recC
ygdBcbl
fliSrutFflgB
fliD
soxS
gltJcysA
marA fimDrutR
rob argGhisJ
ycaCaroG
trpCpgaCrplU
agaR
trpLcsgDfiu
livMmdtFydeOmdtE
glnGfepE
garPgltL
emrKyhjH
glnK
soxRflgMfliP
argDastA
astC
ccmAglgC
hdfRfimA
furgarR
garL
entB
yhiD dctR
slp
entS
iraPrcsDhha
hchArprAenvY
wzc seqAtppBomrA
ddpF argE artP
yhdW
ddpBargA
gndyjjZwzb
nrdF uspEcsgAade cpxR
yebE
cheA
ftnB
cpxP
yccA
nadR
ppiD
ycfS
rpoE
mukFmukB
livF fabZagaD
tdhuof
cysW
cysPfepBflhBlysU
exbDargT
fldB
poxBtauBribA
potI
yhdY
potH
fliR
cysCyncE
gltK
pqiB
inaArfaZ
acrDfhuA
chiA
proVlpxA
fecRpurA
pyrC fepDlivJ
rcnA ydfNlivH
fliH avtAfecI
acrRfliJygiBflgI
nfsBycgRfliO
ygbK
nfsA
flgHppdC
acrB
flgE
cysU
flgJtolC
fliK
rimK
lrhAgcvH
flhA
purM
ybjCflhEmcbR
yciF
lpxDevgS csgCftnAfhuFkbl
emrY
fliTppdB ybjN
flgGamtBtauA
tauD
cysKfldA agaIdsbC
kbaY
agaS
pgaD
ddpD
ddpApotF
potG yojI proXfliC sdaA
nrdHevgA
degP
mukE
phoP cysMthrV
ybaTgadY
gltI garKfimE
mtrlacYaroP
srlB serC ftsZdadA
lacI
agaVtnaC
osmB
fecDgadB srlR
fecB
galE
argO mlrA
csiRstpArelA
galKgalM
entAansBompR
csgEgalR
fecC
csgGcyoC
gadChupA
pdhRgntPulaF
csgFnarHmglB
tyrA
aroM
aroL
trpD
yaiA
aroHacrFmglC
sdhB
nrdRguaAtorR rstA
bdm
gdhA
cirA
rrsGrrsH entEuxuB
nupC
lipAkbaZftsA
sdiAomrBlacAhofO
gabPsucC
dpspspD
oppBmarRfimC
trpBpncB
trpA
ydiVtrpE
trpR
gspByrbA
tomBentC
gabTydePsrlD entD
yjjPcadA
ompFrcsBompCuxuAgabD
csiDhns
serA
rcsA
fepGgpmA
fhuBfhuE
ryhByhhY
metK
ygaCgrxA lon
gor
mntH
rcnR
fhuC
tonBmnmG
asnB
fepC
cspDlivK baeRmdtB
bacApsd
ydeH rseA
baeS motArdoAdsbGcysN
mprAybaO
fliEtrxC
hslJ fhuDoxyS
gmr
nemR acrAnemA
ybiScysD purRpgi qseB
fliGmltF
codBmioC
cdaR
codA
nadAtsrmdtC yidQ
hflD
emrB
emrA
speB
gcvB
glnB
envR rstBmetH
flu exbBpurDpurH
borD yneM glgX slyB
metBahpCpurKspeA purF
glgB
ahpFybjGmetL
mgtAasnCqseCmgrByrbL yegR
mqsR
ybaS
frc
aidB
hemHyfdX
cvpApurE ubiX glyA
ompT
metR pagP
rseCyqjAcpxAmdtD
spy
motB
mdtA ungpnuC
metC
folEmetF
yeiBmetQ
asnA
metE
gudP
metA
metI
alkAalkB
ada
metN
nadB
rseByjeP
cheW
dsbA
prspurC phoQ
garDpurLpurB gudD metJgudX
mntR
uhpAphoB
htpGsgrS
leuDpck
gpmM
setA
cpdBmtlR
ugpQfruKgyrB
rhaA
zraR
yibQ
dsrAleuC
ilvNyahA
slyA rhaR
epd
argX
glyUthrWhisR
ugpB
edaenvC
hofCdsdA
ascG
dapBompA
fruB pykF fbaBtyrUpfkA
glk leuBedd leuAfruA
pitBphoR
phoAphnMphnJ
psiF phnIphnG
phnL amn
phnKphnC
phnP
phnFfabR
phnD
tbpA
thiQ
sgrR
thiP
citDppc
fabA zraP
fabB
citF
dpiB citX
citE
phnE
phnOphoE phnH
phnNphoH
kdgR
fadB aer murQgatAdsdX
xylG
argWmazG
msrA
glpEtopA
ydeAcspIfrdAdsdC gatB
glgSapaGapaHthrU
glpKtdcGglpF
rplMpstSlsrFfucK
mpl
rhaSmtlD
fbaA
prpRaraAgntK idnD
ychHugpC
rhaB
spf
gntR idnT
ascF
yhfA
gntX
idnKidnO
hofPzraSprpC
murP
pgknupGugpE
prpB
ascB
uhpT
cyaRugpA
gntUrhaD
xylBpsiEleuL
polAubiG ppdDhofMlsrK recN
murCptsHgapAnanAmtlAxylH
flxAenvZgatYcaiTfixX
pdxAleuP
sgbUproK
leuXscpB
puuDbetB
dpiA
iclRqueAysgAhybG fadI
xdhB scpC
ompX
betA
puuR
fadE lldDfadR
gltX
yibDproM
citC
tpiA eno
citG
glyT
xdhAargKhybD
dcuCrpsIrplScueO uspAychO
lldRbetI
fadJogt argU
pepThybO fadA
rybB tehArffA
iscUnapA
rnlAfeaBbcsZ
rpsP rtcBnapFibpB
tarsodApspFydhTglcG nrfG
thrS pheS hcrdkgB infC
rffH
wzzEcheZ
atoDcysInsrRytfEglpC
rplD bcsB cheBdmsD
ynfE yfgF napDproL
nagAhycF
fecEyjbHglgAguaB torAadiA
rpoSilvE
pgmtyrRaroAgutQ melAglnA
ccmD dppDlpdgcdhypE
ssuAamiArtcA nrfA acs
bglFtyrPcadC
fimGrpoH
fimBmalP
hyfCtdcF
sdhAcyoD
cyoE
feoB
bglB
malSdeoA
feoCmalQrrsC
nirDosmY
sdhD
caiA entFsrafimH cadB
truB
feoA
hypF
sodBoppD
sufD
dmsC
napHnrfB
dmsB
nrfDtdcC
hypA
fumBnarP
hyfGcaiBnrfE
ccmEdppF
ccmC feaRnuoI
glpA
glcFnrfF
hycI
paaJ
infBglcApaaI
nuoMtreBgadA
yiaNsgbHfixBcyoA
rbfA
osmEnuoAglpBnuoF
ihfAnusAihfBygbAhybA cheR
hybB
scpA
yjgItehB
rffC ydiUrffMrffDrpmI
rfe
uraAnorRydhW
moaAydhUynfGydhX
moeB
narLatoEmodA
hemF
frdBmdhhyaC
nrfC
modEdppBhycB acnAnuoKpspA focB
hycC
dcuR
dmsAfhlA
glmY fadL
cysG
prfAnorW
sufEfnr hyfRfumA
modBptsG
adhE
nuoGcaiF
caiE
paaX
aceE
oppC ulaApaaCbglG
hlyE hyfFgltAyiaKtdcA
tdcD
treC
uxaAydhVcydCiscRrplP
glcB
nikRnarKyeaR
xdhC
hycDfrdD narInarX
ynfHmoaEynfF
pheM
ydeJ nikA
nikC uspB hyaFdctAnikB
nuoN
arcAsdhChpt
norVappYhyfD
glcC
sucBulaEfis
yjcH paaHpaaA
malGdusBdgsAyhcHicdaldAyiaJ
yeiLygjG hipB atoB
hemA hypCfixAyiaOhipA
glcD cydB acrEpaaEglpRcdd aceK
actPfrdCnarG hyfInirBnirC
aceFdeoD
yeaErffG erpA
iscS
cheYgrxD
crpglpT
ulaCtdcRaceBpaaFaldB
hokEmurG
ydjM
lpxC
aroF
ftsI
dinB
dinF
symEftsQ
tyrB
uidCuidA
bolA
malZuidRglpGnagD
ynfK
chbA
tnaBhupBsohB
galP
uxaCchbGppiA rbsK
folA srlEuxuR
fepAgalS
nrdB gutM
tnaAcsiE
uxaB
ftsKyafO sulA
umuDuvrB
yafQmraY
ybfE
ddlBrecA
nanC creAmhpF creBaraD malYfucOrbsD
creD
araG
rbsR cyaA
lacZ flgFaraF uidBmhpA
chbF
chpAmalMptsI
melRsrlA
cspA
exuR
sgbE
sucD
fecA tdcE
galT
dnaGmurDrpsU
uvrDphr
uvrC
lexA
mhpEgrpE
lsrG
nanM
araEnagCcstAivbL
rbsAchpRcytR malE
nanR
dacC
melB sbmC
uvrAdinI
mhpR
rbsBmhpD
mhpB
rbsCchbR
lsrC
lsrDmalTudpmanZ
dnaA
gyrAnagB
Gene Regulatory NetworkInference with TIGRESS
ACH, Mordelet, F., Vera-Licona, P. and Vert, J.-P., TIGRESS: TrustfulInference of Gene REgulation using Stability Selection, BMC SystemsBiology, 2012, 6:145.
7
Gene Regulatory Networks
Gene regulation: control gene expression.Transcription factors (TF) activate or repress target genes (TG).Gene Regulatory Network (GRN): representation of the regulatoryinteractions between genes.
cspIdsdCfrdA
ydeA gatBmsrAychH spf
rhaB ompA
gntX
hofP
idnO
idnT
ascFugpCgntR
fbaA
nupGprpBugpE
mpl
glgS
glpF
apaG
mtlD
thrUpgk
glpK
apaHcpdBmazG
betIargUfadJ
fadA
lldR
pepThybO
ogtxdhB betA
scpC
ompX
yebG
recNdinQ dinJ
murC
atoCmoaDserTappB yccB
dcuC
tap
rpsIhofM ppdD
araH
ubiGprpD dnaNrecF
malX
polAlsrK
rplSxdhAychO
pitA
puuR
hybDargK
prpR idnKidnD
araAyhfAysgAhybG iclRfadIqueA
copA
hybErimM
cueO
hycH
nuoH
puuA
nuoJ
fdhF
glpD
trmA
tpx
hyaE
hycA
aspArpsQrpsJhyfE
ubiCfocAubiA
phoUpstB
ulaBglpQ
nrdAgatD
gatC
dcuA rhaTcaiC
pstC
yoaG
gatZ
ybdNulaDhyfH
paaGrpmCtdcB
hyfAhyaD
narJ
fixXfadDrplV
nuoChyfJ
caiT
paaK
xylGgatA
gatYenvZ
exuTdeoR
pstS
f lxA
rplM
yiaL
yiaJ
asrpaaD acnB
t rg
aceF
dcuB ulaR
dsdXfadB
thrTrplW
uspA
leuW
aer
pstAfadHmodChtrE
ftsW murEruvApolB cho
fucR xylAargP
fucI fucUidnR ilvB
lsrBmhpC
malKfucP
agp
mhpBmalI
sfsA chbBnagE
hofBdksA
chbC
yafN
umuC
tisB
ftsL
yafP
uvrYmurF
dinD
recX
insK
ruvBrpoD
dinG
araC
speC
araJ trxAhofN
fucA lsrR ampCglmU
ssb
creC
glmS
ybfN
mhpR
upp
wzxEtadA
rffE ydhO
pheTydbD
tsgA
mhpT
ackA hyaBnikB
nuoEhcp nuoL
appCatoAnikE
fdnG
ydhY
hybF
moaC
rplCmoaB
nikC uspB
rpsCcueRbetT rplB lldP
hyaAmoeArplT
dcuS
prmCappAdppAhybCpflB
lsrA
fucK gntK
araB
gntTprpE
rhaS
manY
lsrF
crrpaaBglpX
deoCtam
malG
nanEaceA
bglJxylR
yhcHulaG
nanK
deoByiaM
fruR
nanTnfuA ecpD
manX
xseAyjjQ
lamBproP
leuO
ptsHgapA
tsxxylF
tdcG
xylH
malF
murQ
nanAmtlA
pfkAfruB pykFtyrU
fbaBfruA
zraSglpEprpCmtlR
topA murP
ilvN
dapBgntU
ascG
uhpT
ascBugpAcyaR
rhaD
ugpB
ugpQpsiE
edarhaA
zraR
dsrAleuC
xylBleuL
fruKepd
dsdAglyU
hofC
thrW
envC
gyrBhisRlldD
proMargW
fadEfadR yibD
sgbU
leuP
betBgltX
puuDproK
argXpdxA
leuXdpiA
scpB yahA
leuBleuA
rhaR
edd yibQglk
slyA
phnF
phnK
phnG
phnC
amnphnL
phnP
htpG
gpmM
uhpAsetA
glyT enoleuD
pck
sgrS
tpiA phoB
fabA
fabR
citE
citF
dpiB citX
citC
citD
fabB
citG
phnH
phnEzraP
phnO
ppckdgR
phoE
tbpA
sgrR
thiQ
thiP
psiF
pitBphoH phnN
phnMphnJ
phnIphoR
phnD
phoA
ompW
artI
t rmD
argF
artJ yqjIargC
rtcR
iscAwzyE
ssuE
nrdGpspB
napG
sufC
hypD
glnLzwf
fumCcarB
oppFflgCfdnH
yecR
putP
ccmG
ccmBglnP
napC
rpsS
fdnI
amiAcysJnikD
glnH
yhjHglnG
fliS gltJargD
gltL
fliP
flgA
ccmA
glnKpepAastE
astC
fliQ
astD
putA
fl iD
torCargR
fl iM hydN
fliZ
ssuB
flgBrutF
nfoflhC
cydDssuD
rutBflhD
rutA
rutC
rutD
rutG
yhjAhypB
fimDentB ydeOemrK
furcysA garPmarA
fepE hdfRmdtE
pnp
pyrDhisQ gcvT
rrfA
yjbF
hycEilvDyfiD fliYoppAgltByjbE
fimF
garRgadX
fes
dppCfimA slp
yhdW
argH
argE
argA
artP
artQ
yeaG
argB
ddpF
gadWfimIilvLadiYrpsO
yjbGrcsAlrp
cadAhns
cyoC
entA
galR
pdhR
nagA
oppBulaF
marR
gntPsucC
pspD
narHfimC
rstAmglB
sdhB
torR
acrF
csgF
mglCnrdR
gabP
micF
dps
guaA
ilvM
pspGkatE
ilvA
marB
rnpB
glnQgltF
gltD
caiD
pspE
ccmH
fixC
napB
hypE
rrlAdcuD
glnA
nuoB
hyfB
sufA
hycGssuC
pspCcarA
ccmF
katG
sucAtreR
nrdD
sufB
sufSrutE
torD
cydAndhcyoB
metYmglA
rpoSfecE
ansBompR
hupAgadCcsgE
yjbH
ddpCddpX
argI
astA
astB
yeaH
artM
yhdXyhdZ
ddpBflgM
ygdByqjH
cblsoxR
fpr fl iIpurN
tauC ppdAfliL
pqiA
argGrutR
soxS hisJrobfl iA
cysHgcvHlrhA
gcvP flgNygiC
oxyR
garLglgCyhiDybdZ
ycaC
nrdI
wzahdeA
livGnac hdeD
hdeB
aslB
gadYthrVnrdE
entSilvI
dctR
gspOhemL
yciG pepDadrA
wcaBproWilvH
nhaAyhiM
csgB
omrAiraP
wzc
osmC
smtAseqA
ecnB
tppB
cysB
purM
fliF
flgDhisP
flgG
hisM
recC hmpfliN
aroM
aroHaroL
tyrA
trpR trpA
yaiA
trpD
melA
ftsZ
aroP
serC
ftsA
omrBlacAhofOrelA
entDsrlD
stpAosmB galMcsiRgalKargO mlrA
srlRgalE
fecB tnaCsrlB
dadA
mt rlacY
rrsGentEuxuBfecD
fecC gadBrrsH
pgm
agaV
sdiAkbaZ
lacI
lipAcsgG
cirA
gdhAnupC bdm
rplU
gspB
tomB
pncBydiV trpB
trpL
trpE
trpCpgaC
yrbAglgPgadE
csgDentCsmpA
f iu
dadXwcaA
agaC
agaR
agaB
nhaRaroG
pgaA
yciE pgaB
rpmA
yjjPompFgabT
uxuAcsiD ydePgabDompC serArcsB
envY hhamdtF rcsDrprAhchA
livM
flgHnfsBycgRppdC
fliO
flgFygbK yebE rpoE
yccAftnB
bacA
cheAycfS
nadR
yhhY
proVfhuAmetKydfN
ryhB
wzb
livJ
pyrCfecIpurA
cysP
mioC
fepB
fliH
codA
fecRuof lysU
mltFybiS
pgi
yncE
cysD
cysC
gltKpoxB
rcnAfepD
asnB
avtA
mukBtdh
acrDlpxA
livH chiAlivFmukF fabZ
fliGpurR qseB
gcvApqiB
potIpotH
yhdY
fldB
exbD
fimEgltI
cysM
garK
cysW
phoPybjC
acrR
fliKflhE
cysU
fliJ
flhA
flhB
rimK nrdH
gnd
evgA
yjjZ
proX
uspE
fliCyojIpotG
potFddpD
argT
ddpAkbaY
cpxP
agaSdsbC agaI
agaD
ppiD
pgaDflgJflgE
ybjNacrB
ygiBnfsAflgI
tolC
csgA
degP
lpxD
yciF
evgS csgCmukE
sdaA
cpxR
mcbR
ade
fhuF
nrdF
kbl
emrYybaT
ftnAtauB
amtBtauA
ribA
tauD
cysKfldA
inaAfliT
ppdB
fliRrfaZ
emrA
cysN
gcvB
hslJ
envR
glnB
emrB
mprA
qseC
purK
yrbL
yneM
hflD
glgX
mgrB
borD
asnCmgtA
f lu
slyB
ybjG
glgB
metL
gmroxyS
rstBpurD
purH
fliEacrA
trxC
codB
gor
nemR grxA
dsbG
nemA
ybaO
aidB
mntR
metB
metJgudD
asnA
gudP
metE
gudXgarD
metA
cvpApurE
speBspeA
purB
prs
purF
purL
metI
alkBalkA
metFmetN
folE
metQ yeiB
ada
metC
phoQ
pagPmetR
ompTglyA
purC
ubiX
ahpC
rseAmdtBpsd
ydeH
nadB
yqjA
yidQ
pnuC
nadAbaeS motA
baeR
cpxA
mdtC
mdtD
tsr
rdoA
spy rseC
fepC
rseB
mdtA
yjeP
motB
yegR
frc dsbA
cheW
ungyfdX
mqsR
fhuE
fhuC
tonB
ybaS
gpmAfepG
fhuB
ygaC
exbB
hemH
mntH cspDcdaR lon livK
rcnR
metH
mnmG
fhuD
ahpF
iscU
rybB rffA tehA
thrS pheS
infCdkgB
bcsZ
ytfE
tar
yfgF
glpC
feaB
bcsB cheB
caiB hycI
glcFfeaR
glcAinfBdppF
nrfF
ccmEnrfE
pspF
hybB
napF
dmsDnapD
rtcA
proL
rtcB ibpB
nsrR
wzzE ynfEhcr
rnlA
cheZ
ygbA
rplD
rpsP
rffH atoD
glcG
hyfG
nuoI
cysI
ydhT
ccmC
napA
nrfA
narP
ynfK
chbG
glpG
chbA
nagD
csiE
uxaC
galP
ppiA
tnaAuxuR
gutMnrdB
cspAbglFrpoH
uxaB
srlE
galT
malSfepA
folA
galSbglB
tyrP
feoCmalQ
fimG
deoA
fimB
glgAilvEguaB
cadBsrafimH
dinB
ydjM
lpxC
hokEsymE
murG
ftsI
dinF
tyrRaroF
gutQfeoB
aroA
rbsK
hupB tnaBsohB
torAsodA
ccmDacsnrfGssuA gcd
glpA
tyrB
uidA
uidRuidC
ftsQbolA
malZ
osmEnuoAadiAlpddppDhycF
sufD
umuDybfE
sulAuvrB
ftsKyafO
ddlBcaiE
caiFfadL
cysG
hyfR aceE
oppCptsGnuoGmodB
srlAsgbE
exuR
melR
hyfF chpA
ptsI malMtdcDulaA
bglG
tdcE
yiaKgltAuxaAfecA hlyE
tdcA
nuoK
frdBdppBfocBhycBhybAyjgItehB
scpArffD
rfe
rpmIydiUrffM
cheR
rffC
nrfC sufEacnA fhlA
glmYfnr
dmsAmodE
rbsD
cyaAchbF
creDaraF uidB
rbsRlacZ
mhpA
ynfGydhUydhXmoaAmoeB
norRhyaC
modAcydC mdh
rplP
atoEydhViscR
hycC
glcB
pspA
rffG iscSyeaE
cheYgrxD
ynfH
xdhCydhW
moaE
ydeJ narX
erpApheM
ynfF
nikA yeaRnarK nikRuraA
frdD
ygjG hipB
narI
atoB
narL
hycD
hemAglcD
hemF
hipA rpsU phrmurD
dnaG
mraYrecAyafQ
uvrCuvrDlexA
araD
cstAivbLlsrG
fucO
lsrC nagCnanC mhpF
malY
melBaraG creB
sbmCcreA
rbsBmhpD
uvrA dacCchbRrbsA rbsC dinI
nanMgrpElsrD
araE
mhpEudpgyrAmalT
dnaA
malE
manZnagBchpRactP
glpT
aceK
paaF
aldA
aceB ulaC
icd
acrE
treC
yeiL
tdcRpaaAadhE
ulaE
paaEglpRsucB
fispaaHpaaX
frdC nirCdctAdeoDnirBhyfIhyaFnarG
yiaO
prfAdcuRhptarcA
f ixA
glcCsdhC
cdd
yjcHnorW
crp
dgsA
paaCaldB
dusB
cytRnanR
hypC
hyfD
cydB
norVappY nuoN
cyoD
cyoAcyoE cadCsucDtdcFhyfC
osmYsdhA
malPtdcC
hypA
nrfB
paaJ
fumBnapH
dmsB
paaI
nrfD
nuoF
ihfB
glpB
nusAihfA
sodB
fumA
dmsC
feoA
oppD
hypF
fixBsgbH
gadAyiaN
sdhDrrsC
rbfAtreB
entFcaiA
truBnirD
nuoM
Figure: E. coli regulatory network8
DREAM network inference challenge
Network inference challenge:
DREAM5 results:Method Network 1 Network 3 Network 4 Overall
AUPR AUROC AUPR AUROC AUPR AUROCGENIE31 0.291 0.815 0.093 0.617 0.021 0.518 40.28ANOVerence2 0.245 0.780 0.119 0.671 0.022 0.519 34.02Naive TIGRESS 0.301 0.782 0.069 0.595 0.020 0.517 31.1
1Huynh-Thu et al., 20102Kueffner et al., 2012
9
Purposes
Introduce TIGRESS: Trustful Inference of Gene REgulation usingStability Selection.Assess the impact of the parameters.Test and benchmark TIGRESS on several datasets.
10
Outline
1 MethodsRegression-based inferenceTIGRESSMaterial
2 ResultsIn silico network resultsIn vitro networks resultsUndirected case: DREAM4
3 Conclusion
11
Regression-based inference: hypotheses
Notationsntf transcription factors (TF), ntg target genes (TG), nexp experimentsExpression data: X (nexp × (ntf + ntg)).Xg : expression levels of gene g.XG: expression levels of genes in G.Tg : candidate TFs for gene g.
Hypotheses1 The expression level Xg of a TG g is a function of the expression
levels XTg of Tg :Xg = fg(XTg ) + ε.
2 A score sg(t) can be derived from fg , for all t ∈ Tg to assess theprobability of the interaction (t ,g).
12
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1TF 2...TF ntf
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1TF 2...TF ntf
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 -TF 2 0.97... ...TF ntf 0
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23TF 2 0.97 -... ... ...TF ntf 0 0
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0TF 2 0.97 - 0.03... ... ... ...TF ntf 0 0 0
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TFntf 0 0 0 ... 0.76
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
13
Adding a linearity assumption
TIGRESS’ first hypothesis: regulations are linear
Xg = fg(XTg ) + ε = XTgωg + ε
Consequence: if ωgt = 0, no edge between g and t .
14
Adding a sparsity assumption
TIGRESS’ second hypothesis: few TFs regulate each TG
∀g, ]ωgt 6= 0 << ntf
Possible algorithm: LARS with L steps => L TFs selected.
15
Stability Selection
Problem: LARS efficiency is limited:bad response to correlation;no confidence score for each TF.
Solution: Stability Selection with randomized LARS (Bach, 2008 ;Meinshausen and Bühlmann, 2009):
Resample the experiments: run LARS many (e.g. 1,000) timeswith different training sets.“Resample” the variables: also weight the variables
Xit ←WtXit (1)
where Wj ; U([α,1]) for all t = 1...ntf . The smaller α, the morerandomized the variables.Get a frequency of selection for each TF.
16
Stability Selection
Problem: LARS efficiency is limited:bad response to correlation;no confidence score for each TF.
Solution: Stability Selection with randomized LARS (Bach, 2008 ;Meinshausen and Bühlmann, 2009):
Resample the experiments: run LARS many (e.g. 1,000) timeswith different training sets.“Resample” the variables: also weight the variables
Xit ←WtXit (1)
where Wj ; U([α,1]) for all t = 1...ntf . The smaller α, the morerandomized the variables.Get a frequency of selection for each TF.
16
Stability Selection path
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number L of LARS steps
Freq
uenc
y of
sel
ectio
n of
eac
h TF
ov
er th
e su
bsam
plin
gs
(example for one target gene)
17
Scoring
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number L of LARS steps
Freq
uenc
y of
sel
ectio
n of
eac
h TF
ov
er th
e su
bsam
plin
gs
18
Scoring
Choose L, then:Original scoringArea scoring (contribution)
18
Scoring
Let Ht be the rank of TF t . Then,
score = E[φ(Ht )]
withOriginal: φ(h) = 1 if h ≤ L , 0 otherwiseArea: φ(h) = L + 1− h if h ≤ L , 0 otherwise
=> Area scoring takes the value of the rank into account.
18
TIGRESS Summary
Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntgLARS
=>
TF 1 - 0.23 0 ... 0.11+ Stab. Selection TF 2 0.97 - 0.03 ... 0+ Choose L ... ... ... ... ... ...+ Score TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97
... ... ... ...3 Threshold to a value or a given number N of edges.
19
Parameters
TIGRESS needs four parameters to be set:
scoring method (original, area, ...);number of runs R: large;randomization level α: between 0 and 1;number of LARS steps L: not obvious.
20
Data
Network ] TF ] Genes ] Chips ] EdgesDREAM5 Net 1 (in-silico) 195 1643 805 4012DREAM5 Net 3 (E. coli) 334 4511 805 2066DREAM5 Net 4 (S. cerevisiae) 333 5950 536 3940E. coli Net from Faith et al., 2007 180 1525 907 3812DREAM4 Multifactorial Net 1 100 100 100 176DREAM4 Multifactorial Net 2 100 100 100 249DREAM4 Multifactorial Net 3 100 100 100 195DREAM4 Multifactorial Net 4 100 100 100 211DREAM4 Multifactorial Net 5 100 100 100 193
21
Outline
1 MethodsRegression-based inferenceTIGRESSMaterial
2 ResultsIn silico network resultsIn vitro networks resultsUndirected case: DREAM4
3 Conclusion
22
Impact of the parameters: results on in silico networkL
Area (1,000 runs)
0.1 0.3 0.5 0.7 0.9
5
10
15
20Original (1,000 runs)
0.1 0.3 0.5 0.7 0.9
5
10
15
20
L
Area (4,000 runs)
0.1 0.3 0.5 0.7 0.9
5
10
15
20Original (4,000 runs)
0.1 0.3 0.5 0.7 0.9
5
10
15
20
α
L
Area (10,000 runs)
0.1 0.3 0.5 0.7 0.9
5
10
15
20
α
Original (10,000 runs)
0.1 0.3 0.5 0.7 0.9
5
10
15
20
Overall Score
40 60 80 100
Area less sensitive thanoriginal to α and L.Area systematicallyoutperforms original.The more runs, the betterBest values:α = 0.4,L = 2,R = 10,000.
23
How to choose L?
0 2 4 60
1000
2000
3000First 5,000 edges
0 2 4 6 80
500
1000
1500
2000First 10,000 edges
0 5 10 15 200
200
400
600
800
1000First 50,000 edges
0 10 20 300
200
400
600
800First 100,000 edges
L=2: number of TFs/TG smallerand more variable.
0 2 4 6 80
200
400
600First 5,000 edges
0 5 10 150
100
200
300
400First 10,000 edges
10 20 30 40 500
50
100
150
200First 50,000 edges
0 50 100 1500
50
100
150First 100,000 edges
L=20: greater number of TFs/TG,less sparsity.
=> L should depend on the expected network’s topology
24
TIGRESS vs state-of-the-art
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FPR
TP
R
0 0.05 0.1 0.150
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
GENIE3
ANOVerence
CLR
ARACNE
Naive TIGRESS
TIGRESS
TIGRESS is competitive with state-of-the-art.
25
Results on E. coli network
0 0.5 10
0.2
0.4
0.6
0.8
1
FPR
TP
R
0 0.05 0.1 0.150
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
GENIE3
CLR
ARACNE
TIGRESS
Random Forests-based and DREAM winner GENIE3 overperforms allmethods.
26
False discovery analysis on E. coli
Main false positive pattern found by TIGRESS:
Should find Does find
Good news: spuriously inferred edges close to true edgesBad news: confusion (due to linear model?)
27
Undirected case: DREAM4 challenge
DREAM4: undirected networks (TFs not known in advance)A posteriori comparison of Default TIGRESS and GENIE3:
Method Network 1 Network 2 Network 3 Network 4 Network 5AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC
GENIE3 0.154 0.745 0.155 0.733 0.231 0.775 0.208 0.791 0.197 0.798TIGRESS 0.165 0.769 0.161 0.717 0.233 0.781 0.228 0.791 0.234 0.764
Overall scores:GENIE3: 37.48TIGRESS: 38.85
28
Outline
1 MethodsRegression-based inferenceTIGRESSMaterial
2 ResultsIn silico network resultsIn vitro networks resultsUndirected case: DREAM4
3 Conclusion
29
Conclusion
Contributions:Automatization and adaptation of Stability Selection to GRNinference.Area scoring setting: better results and less elasticity to parameters.Insights on network’s behavior
TIGRESS is:LinearCompetitiveParallelizableAvailable (http://cbio.ensmp.fr/tigress)But outperformed in some cases by random forests: limits oflinearity?
Perspectives:Adaptive/changeable value for L.Group selection of TFs.Use of time series/knock-out/replicates information.
30
Molecular signatures for breastcancer prognosis
ACH, Gestraud, P. and Vert, J.-P., The Influence of Feature SelectionMethods on Accuracy, Stability and Interpretability of MolecularSignatures, 2011, PLoS ONE
ACH, Jacob, L. and Vert, J.-P., Improving Stability and Interpretability ofMolecular Signatures, 2010, arXiv 1001.3109
31
Motivation
Prediction of breast cancer outcomeAssist breast cancer prognosis based on gene expression.Avoid adjuvant/preventive chemotherapy when not needed.
Gene expression signatureData: primary site tumor expression arrays.Among the genome, find the few (50-100) genes sufficient topredict metastasis/relapse.Main challenge: high-dimensional data (few samples, manyvariables).
32
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.
33
Contributions
1 Systematic comparison of feature selection methods in terms of:predictive performance;stability;biological stability and interpretability.
2 Group selection of genes:with predefined groups (Graph Lasso);with latent groups (k-overlap norm).
3 Evaluation of Ensemble methods
34
Evaluation
Accuracy: how well selected genes + classifier predict metastaticevents on test data.Stability: how similar two lists of genes are.Interpretability: how much biological sense selected genes make.
35
Data
Four public breast cancer datasets from the same technology (AffymetrixU133A):
GEO Reference ] genes ] samples ] positivesGSE1456 12,065 159 40GSE2034 12,065 286 107GSE2990 12,065 125 49GSE4922 12,065 249 89
36
Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
37
Classical feature selection/feature ranking methods
Type Characteristics Examples
FiltersUnivariate, fast T-test, KL DivergenceOnly depend on the data Wilcoxon rank-sum testDo not use loss function
WrappersLearning machine as a criterion SVM RFE, Greedy FSComputationally expensive
EmbeddedLearning + selecting Lasso, Elastic NetPossible use of prior knowledge
Each of these algorithms returns a ranked list of genes to bethresholded.
38
First results
100 genes over four databases.
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Accuracy
Stab
ility
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
39
First conclusions
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Accuracy
Stab
ility
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Random better than tossing acoin.Elastic Net neither more stablenor more accurate than Lasso.Accuracy/Stability trade-offT-test: both simplest and best.
Next step:
Can we have a better stability without decreasing accuracy?
40
First conclusions
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Accuracy
Stab
ility
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Random better than tossing acoin.Elastic Net neither more stablenor more accurate than Lasso.Accuracy/Stability trade-offT-test: both simplest and best.
Next step:
Can we have a better stability without decreasing accuracy?
40
Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
41
Ensemble Methods
Run each algorithm R times on subsamples.Get R ranked lists of genes (rb)b=1...R.Aggregate and get a score for each gene:
S(g) =1R
R∑b=1
f (rbg ).
average : f (r) = (p − r)/pexponential : f (r) = exp(−αr)stability selection : f (r) = δ(r ≤ k)
Sort S by decreasing order and threshold to get final signature
42
Results
100 genes over four databases.
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Accuracy
Stab
ility
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Single−run
Stab. sel
43
Functional stability
100 genes over four databases.
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
0
0.05
0.1
0.15
0.2
0.25
0.3
Accuracy
Func
tiona
l sta
bilit
y
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Single−run
Stab. sel
44
Ensemble methods: conclusions
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Accuracy
Stab
ility
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Single−run
Stab. sel
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
0
0.05
0.1
0.15
0.2
0.25
0.3
Accuracy
Func
tiona
l sta
bilit
y
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Single−run
Stab. sel
Expected improvement in stability not happening.Slight improvement in accuracy in some cases.Loss in functional stability.T-test: still the preferred method.
Next step:
Can we do better by incorporating prior knowledge?45
Ensemble methods: conclusions
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Accuracy
Stab
ility
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Single−run
Stab. sel
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
0
0.05
0.1
0.15
0.2
0.25
0.3
Accuracy
Func
tiona
l sta
bilit
y
Random
Ttest
Entropy
Bhattacharryya
Wilcoxon
SVM RFE
GFS
Lasso
E−Net
Single−run
Stab. sel
Expected improvement in stability not happening.Slight improvement in accuracy in some cases.Loss in functional stability.T-test: still the preferred method.
Next step:
Can we do better by incorporating prior knowledge?45
Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
46
Data
Expression data: Van’t Veer et al.,2002; Wang et al., 2005.PPI network with 8141 genes (Chuanget al., 2007)
Assumption: genes close on the graph behave similarlyIdea: instead of selecting single genes, select edges
47
Selecting groups of genes: `1 methods
Lasso : selects single genes (Tibshirani, 1996)
Why groups?
selecting similar genes: improving stability and interpretabilitysmoothing out noise by "averaging": improving accuracy
Group Lasso (Yuan & Lin, 2006): implies group sparsity for groupsof covariates that form a partition of 1...pOverlapping group Lasso (Jacob et al., 2009): selects a union ofpotentially overlapping groups of covariates (e.g. gene pathways).Graph Lasso: uses groups induced by the graph (e.g. edges)
48
Is group sparsity enough?
`1 methods work well, but face serious stability issues whengroups are correlated.Solution: randomization through stability selection.
49
Accuracy results
Test on data from Wang et al., 2005.
Lasso Lasso + stab. sel. Graph Lasso Graph Lasso + stab. sel.0
0.2
0.4
0.6
0.8
Bal
ance
d A
ccur
acy
Signature learnt on Van’t Veer dataset
Signature learnt on Wang dataset
Neither prior knowledge nor stability selection bring anyimprovement!
50
Stability results
0 20 40 60 80 100 1200
1
2
3
4
5
6
7
Number of genes in the signatures
Number
ofgen
esin
theoverlap
LassoGraph Lasso with stability selectionLasso with stability selectionGraph Lasso
Graph Lasso slightly improves stability.
51
Interpretability
Signature obtained using Lasso:
52
Interpretability
Signature obtained using Graph Lasso + Stability Selection:
52
Graph Lasso conclusion
Graphical prior seems to increase stability and interpretability.However: no change in accuracy.
Next step:
Grouping increases stability. Now on to accuracy!
53
Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
54
Latent grouping
Grouping genes makes sense.Let the data tell which genes to select together.
The k-support norm
Introduced by Argyriou et al., 2012A trade-off between `1 and `2.Equivalent to overlapping group Lasso (Jacob et al., 2009) with allpossible groups of size k .Results in selecting groups that are not predefined.
55
Extreme randomization
Following Breiman’s random forestsSample both the examples and the covariates.Less variables = less correlation.Give each gene a chance to be selected.
Extreme randomization
For each of the R runs:Bootstrap samples (classical Ensemble method)Sample the covariates: randomly choose 10% of them.Run FS procedure on the restricted data.
=> Compute frequency of selection: P(g selected |g preselected)
56
Accuracy or stability?
0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.6550
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Accuracy
Stab
ility
Random
Ttest
Lasso
ENet
kSupport (k=2)
kSupport (k=10)
kSupport (k=20)
Single−run
Extreme Rand. + SS
T−test
57
Stability or redundancy?
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Correlation
Stab
ility
Random
Ttest
Lasso
ENet
kSupport (k=2)
kSupport (k=10)
kSupport (k=20)
Single−run
Extreme Rand. + SS
58
Conclusions
0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.6550
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Accuracy
Stab
ility
Random
Ttest
Lasso
ENet
kSupport (k=2)
kSupport (k=10)
kSupport (k=20)
Single−run
Extreme Rand. + SS
T−test
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Correlation
Stab
ility
Random
Ttest
Lasso
ENet
kSupport (k=2)
kSupport (k=10)
kSupport (k=20)
Single−run
Extreme Rand. + SS
Extreme Randomization improves accuracy.Grouping improves stability.But: both effects do not add up that well (redundancy)T-test still the best trade-off?
59
Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
60
Signature selection: conclusion
Contributions:
Step-by-step study of FS methods behavior on several breastcancer datasets.Systematic analysis of accuracy, stability, interpretability.Insights on the accuracy/stability trade-off.
What have we learned?
Best methods: simple t-test or complex black boxGrouping improves gene and functional stability.Randomization improves accuracy (sometimes) but has unwantedeffects on stability.Accuracy/Stability Trade-off: Stability => redundancy => loweraccuracy.
61
Signature selection: perspectives
One unique signature?single breast cancer subtypemany sampleslarger signature
Is expression data sufficient?probably not all information is thereclinical data: same accuracy (same information?)possibly look at genotype, methylation, clinical and expression
Is stability important?not as important as accuracy + prediction concordancepossibly not even achievable
62
Conclusion
63
Conclusion
Gene expression data:High-dimensional, noisyPossibly contains important information
Feature selection:Find the needle in the haystack.Output relevant genes to be studied further.
Main issues:Results are not necessarily transferable across datasets.Models rely on hypotheses!
Fixing:Testing on many databases.Keeping model hypotheses in mind / not being afraid of black boxes.
64
Conclusion
Gene expression data:High-dimensional, noisyPossibly contains important information
Feature selection:Find the needle in the haystack.Output relevant genes to be studied further.
Main issues:Results are not necessarily transferable across datasets.Models rely on hypotheses!
Fixing:Testing on many databases.Keeping model hypotheses in mind / not being afraid of black boxes.
64
Conclusion
Gene expression data:High-dimensional, noisyPossibly contains important information
Feature selection:Find the needle in the haystack.Output relevant genes to be studied further.
Main issues:Results are not necessarily transferable across datasets.Models rely on hypotheses!
Fixing:Testing on many databases.Keeping model hypotheses in mind / not being afraid of black boxes.
64
Conclusion
Gene expression data:High-dimensional, noisyPossibly contains important information
Feature selection:Find the needle in the haystack.Output relevant genes to be studied further.
Main issues:Results are not necessarily transferable across datasets.Models rely on hypotheses!
Fixing:Testing on many databases.Keeping model hypotheses in mind / not being afraid of black boxes.
64
Acknowledgements
Fantine Paola Pierre LaurentMordelet Vera-Licona Gestraud Jacob
65
The k-support norm
It can be shown that:
Ωspk (ω) =
k−r−1∑i=1
(|ω|↓i )2 +1
r + 1
(d∑
i=k−r
|ω|↓i
)212
where r is the only integer in 0, . . . , k − 1 statisfying
|ω|↓k−r−1 >1
r + 1
d∑i=k−r
|ω|↓i ≥ |ω|↓k−r .
and |ω|↓i is the i-th largest value of |ω| (|ω|↓0 = +∞).
66
Link with overlapping Group Lasso
The k-support norm is equivalent to the overlapping group Lassonorm
Ωspk (ω) = min
v∈Rp×Gk
∑I∈Gk
||vI ||2 : supp(vI) ⊆ I,∑I∈Gk
vI = ω
where Gk denotes all subsets of 1, . . . ,d of cardinality k .Remark 1: it selects at least k variables.Remark 2: the first selected group consists of the k variables mostcorrelated with the response
67
ADMM - applied to k-support problem
Our problemminω,β Rl(ω) + λ
2 Ωspk (β)2
s.t. ω − β = 0
Augmented Lagrangian
Lρ(ω, β, µ) = Rl(ω) + λ2 Ωsp
k (β)2 + µ′(ω − β) + ρ2 ||ω − β||
2
Algorithm1 Initialize: β(1), ω(1), µ(1)
2 for t = 1,2, ..., dow (t+1) = arg minw
Rl (w) + µ(t)T w + ρ
2 ||w − β(t)||2
β(t+1) = prox λ
2ρ Ωspk (.)2
(w (t+1) + µ(t)
ρ
)µ(t+1) = µ(t) + ρ(w (t+1) − β(t+1))
68
ADMM - optimality conditions
Three first order conditions:Primal condition: ω∗ − β∗ = 0Dual condition 1: ∇Rl(ω
∗) + µ∗ = 0Dual condition 2: 0 ∈ λ
2∂Ωspk (β∗)2 + µ∗
Resulting in a definition for the residuals at step t + 1:Primal residuals: r (t+1) = ω(t+1) − β(t+1)
Dual residuals: s(t+1) = ρ(β(t+1) − β(t))
As the algorithm converges, the (norm of the) residuals tend to zero.
69
ADMM - choice of parameter
Parameter ρ is critical in ADMM: it controls how much variables change.It can be seen as a step size.
How to choose it?
Adaptive ADMMOne solution is to let it adapt to the problem:
ρ(t+1) =
(1 + τ)ρ(t) if ||r (t+1)||2 > η||s(t+1)||2 and t ≤ tmax
ρ(t)/(1 + τ) if η||r (t+1)||2 < ||s(t+1)||2 and t ≤ tmax
ρ(t) otherwiseIn practice, we use τ = 1, η = 10 and tmax = 100.=> Adaptive ADMM forces the primal and dual residuals to be of asimilar amplitude.
70
Comparison
0 10 20−15
−10
−5
0
5k = 1
RD
G
Time (Seconds) 0 10 20−15
−10
−5
0
5k = 5
RD
G
Time (Seconds)
0 10 20−15
−10
−5
0
5k = 10
RD
G
Time (Seconds) 0 20 40 60−15
−10
−5
0
5k = 100
RD
G
Time (Seconds)
ADMM − adaptiveADMM − rho=1ADMM − rho=10ADMM − rho=100FISTA
71
Accuracy vs size of the signature
0 20 40 60 80 1000.45
0.5
0.55
0.6
0.65
AU
C
Single−run
0 20 40 60 80 1000.45
0.5
0.55
0.6
0.65
AU
C
Ensemble−Mean
0 20 40 60 80 1000.45
0.5
0.55
0.6
0.65
AU
C
Ensemble−Exponential
0 20 40 60 80 1000.45
0.5
0.55
0.6
0.65
AU
C
Ensemble−Stability Selection
Random T−test Entropy Bhatt. Wilcoxon SVM RFE GFS Lasso E−Net
72
Stability vs size of the signature
0 20 40 60 80 1000
0.2
0.4
0.6
0.8 Single-run
Stability
0 20 40 60 80 1000
0.2
0.4
0.6
0.8Ensemble-average
Stability
0 20 40 60 80 1000
0.2
0.4
0.6
0.8 Ensemble-exponential
Stability
0 20 40 60 80 1000
0.2
0.4
0.6
0.8Ensemble-stability selection
Stability
Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E−Net
73