Feature selection from gene expression datacbio.mines-paristech.fr/~ahaury/thesis/defense.pdf ·...

Feature selection from gene expression dataMolecular signatures for breast cancer prognosis and inference of gene

regulatory networks.

Anne-Claire Haury

Centre for Computational BiologyMines ParisTech, INSERM U900, Institut Curie

PhD Defense, December 14, 2012

1

Introduction

2

Gene expression

Figure: Central Dogma of Molecular Biology (Source: Wikipedia)

=> RNA as a proxy to measure gene expression.

3

Microarrays: measuring gene expression

Microarrays: one option.RNA hybridized onto a chip.Quantification of gene activity in different conditions at thegenome scale.Resulting data: gene expression data.

genes

sam

ples

Low expression

High expression

4

Feature selection: extract relevant information

Genome ≈ 25,000 genesFrom gene expression, explain a phenomenonFeature selection: deciding which variables/genes are relevant.

5



Mathematically:Explain response Y using variables (Xj)j=1...p

Y = f (X1,X2,X3,X4, . . . ,Xp) + ε

5



Mathematically:Explain response Y using relevant variables amongst (Xj)j=1...p

Y = f (X1,X2,X3,X4 . . . ,Xp) + ε

Objective 1: more accurate predictorsObjective 2: more interpretable predictorsObjective 3: faster algorithms

5

Contributions of this thesis

Gene Regulatory Network InferenceTIGRESS: new method based on local feature selection.Ranked 3rd/29 at DREAM5 challenge.Linear method, competitive with more complex algorithms

Molecular signatures for breast cancer prognosisSelect biomarkers to predict metastasis/relapse in breast cancerpatients.Complete benchmark of feature selection methods.Investigation of the stability issue.

6

recX

dinQ

rpoD

yebGtisB yafP

yafNruvB

dinJlsrAprpE prpDgntT gcvA

malXaraChofN

araB

trxAaraJ

recF dnaNaraH

fucUfucIilvBspeClsrRfucAmanY

ampCybfNidnR

argP

lsrB chbC

nagE

fucR

sfsAdksA

hofB

mhpC

dinGinsK

ruvA

dinDpolB ftsW cho murE

malIproP

fucPmanXmalK chbB

creC

glmSssb

glmU ftsLuvrYmurFumuC

rplTupp

betT moeA

rimMpitA

rplBcopA

hybE

cueR

tapmoaDatoCappBserT

prmCdppAhybC

rpsC

pflBlldP

dcuShyaA

appA

puuA

yccBhycHhyaE

rplW modChtrEfadHfadDrplVleuW

thrT

gatZdcuAhyfJnuoC

pstBhyfErpsJrpsQtrmAphoU aspA

exuTdeoR

gatDdeoB

nrdAglpQ

paaK

caiCrhaT

pstA

fruRxylFtsx

malFdeoC

pheT

tsgA

ydbDmhpT

ydhO

wzxEtadA

rffE

moaC

atoAydhY

ackA

nikE

rplChcp nuoL

appC nuoHtpx

fdnG hybFmoaB

hyaBnuoE

nuoJhycA

glpDubiCfocAubiA pstCulaBgatC

trgasr

yiaLxylRbglJ

ulaG

nanKecpDyiaM

aceA

nanTnfuA tam crrglpXpaaBxseA

nanEyjjQ

agplamB

xylAleuO

narJ

ybdNulaD

paaG

hyfH

tdcBrpmCulaRdcuBacnBpaaDhyfA

hyaDyoaGfdhF

yqjIartI

artJ

argFargC

flgCfdnH ccmG

putP

ccmByecRfumCglnP

trmD rpsS

wzyE iscA

glnHrtcRnikDompW cysJfdnI

rutEsufSnrdDglnLsufCzwfpspBnrdG napB ccmH

sufA

ccmFhycG

fixCdcuD

carA

nuoB

ssuC

rrlA

metY

ndhcyoB cydAmglAhyfB

pspE

caiDpspGgltFkatEglnQ

marBgltD

sucA

ilvAtreR

torD

oppF

carBssuE

napC hypDnapG sufB

artQargH

yeaG

argB

flgApepA

fliQputA

astEastD

ssuD

hisQ

hypBflhD pnprutBrutG

pspCargR fliZ

cydDrutC

katGtorC

yjbF dppCgadXsmpA

gadEglgP wcaA dadX

gcvTpyrD fimF

micFgltBoppAyjbE

rrfAfes

ilvMfliY ilvDhycEyfiD

rnpB

adiYlrpyjbG

rpsOgadWilvL fimI

fliM hydN

nfo ssuB rutArutDyhjA flhC

adrA

csgB

nhaR

pepD

rpmAyciE

smtA

yhiMilvH

ecnBwcaB

osmCilvIgspO

nhaA

proW yciG

nrdIoxyRfliAcysB hisMyqjHartM

yeaH

ddpC

astB

argIyhdXyhdZ ddpX

hdeBhdeA

hemLybdZ

nac

wzanrdE

hdeDlivGaslB

agaB

agaCpgaA

pgaB

fpr fliIpurNfliL

pqiA

tauC ppdAygiC

flgNcysH

hmpflgDhisP

fliF

fliNgcvP

recC

ygdBcbl

fliSrutFflgB

fliD

soxS

gltJcysA

marA fimDrutR

rob argGhisJ

ycaCaroG

trpCpgaCrplU

agaR

trpLcsgDfiu

livMmdtFydeOmdtE

glnGfepE

garPgltL

emrKyhjH

glnK

soxRflgMfliP

argDastA

astC

ccmAglgC

hdfRfimA

furgarR

garL

entB

yhiD dctR

slp

entS

iraPrcsDhha

hchArprAenvY

wzc seqAtppBomrA

ddpF argE artP

yhdW

ddpBargA

gndyjjZwzb

nrdF uspEcsgAade cpxR

yebE

cheA

ftnB

cpxP

yccA

nadR

ppiD

ycfS

rpoE

mukFmukB

livF fabZagaD

tdhuof

cysW

cysPfepBflhBlysU

exbDargT

fldB

poxBtauBribA

potI

yhdY

potH

fliR

cysCyncE

gltK

pqiB

inaArfaZ

acrDfhuA

chiA

proVlpxA

fecRpurA

pyrC fepDlivJ

rcnA ydfNlivH

fliH avtAfecI

acrRfliJygiBflgI

nfsBycgRfliO

ygbK

nfsA

flgHppdC

acrB

flgE

cysU

flgJtolC

fliK

rimK

lrhAgcvH

flhA

purM

ybjCflhEmcbR

yciF

lpxDevgS csgCftnAfhuFkbl

emrY

fliTppdB ybjN

flgGamtBtauA

tauD

cysKfldA agaIdsbC

kbaY

agaS

pgaD

ddpD

ddpApotF

potG yojI proXfliC sdaA

nrdHevgA

degP

mukE

phoP cysMthrV

ybaTgadY

gltI garKfimE

mtrlacYaroP

srlB serC ftsZdadA

lacI

agaVtnaC

osmB

fecDgadB srlR

fecB

galE

argO mlrA

csiRstpArelA

galKgalM

entAansBompR

csgEgalR

fecC

csgGcyoC

gadChupA

pdhRgntPulaF

csgFnarHmglB

tyrA

aroM

aroL

trpD

yaiA

aroHacrFmglC

sdhB

nrdRguaAtorR rstA

bdm

gdhA

cirA

rrsGrrsH entEuxuB

nupC

lipAkbaZftsA

sdiAomrBlacAhofO

gabPsucC

dpspspD

oppBmarRfimC

trpBpncB

trpA

ydiVtrpE

trpR

gspByrbA

tomBentC

gabTydePsrlD entD

yjjPcadA

ompFrcsBompCuxuAgabD

csiDhns

serA

rcsA

fepGgpmA

fhuBfhuE

ryhByhhY

metK

ygaCgrxA lon

gor

mntH

rcnR

fhuC

tonBmnmG

asnB

fepC

cspDlivK baeRmdtB

bacApsd

ydeH rseA

baeS motArdoAdsbGcysN

mprAybaO

fliEtrxC

hslJ fhuDoxyS

gmr

nemR acrAnemA

ybiScysD purRpgi qseB

fliGmltF

codBmioC

cdaR

codA

nadAtsrmdtC yidQ

hflD

emrB

emrA

speB

gcvB

glnB

envR rstBmetH

flu exbBpurDpurH

borD yneM glgX slyB

metBahpCpurKspeA purF

glgB

ahpFybjGmetL

mgtAasnCqseCmgrByrbL yegR

mqsR

ybaS

frc

aidB

hemHyfdX

cvpApurE ubiX glyA

ompT

metR pagP

rseCyqjAcpxAmdtD

spy

motB

mdtA ungpnuC

metC

folEmetF

yeiBmetQ

asnA

metE

gudP

metA

metI

alkAalkB

ada

metN

nadB

rseByjeP

cheW

dsbA

prspurC phoQ

garDpurLpurB gudD metJgudX

mntR

uhpAphoB

htpGsgrS

leuDpck

gpmM

setA

cpdBmtlR

ugpQfruKgyrB

rhaA

zraR

yibQ

dsrAleuC

ilvNyahA

slyA rhaR

epd

argX

glyUthrWhisR

ugpB

edaenvC

hofCdsdA

ascG

dapBompA

fruB pykF fbaBtyrUpfkA

glk leuBedd leuAfruA

pitBphoR

phoAphnMphnJ

psiF phnIphnG

phnL amn

phnKphnC

phnP

phnFfabR

phnD

tbpA

thiQ

sgrR

thiP

citDppc

fabA zraP

fabB

citF

dpiB citX

citE

phnE

phnOphoE phnH

phnNphoH

kdgR

fadB aer murQgatAdsdX

xylG

argWmazG

msrA

glpEtopA

ydeAcspIfrdAdsdC gatB

glgSapaGapaHthrU

glpKtdcGglpF

rplMpstSlsrFfucK

mpl

rhaSmtlD

fbaA

prpRaraAgntK idnD

ychHugpC

rhaB

spf

gntR idnT

ascF

yhfA

gntX

idnKidnO

hofPzraSprpC

murP

pgknupGugpE

prpB

ascB

uhpT

cyaRugpA

gntUrhaD

xylBpsiEleuL

polAubiG ppdDhofMlsrK recN

murCptsHgapAnanAmtlAxylH

flxAenvZgatYcaiTfixX

pdxAleuP

sgbUproK

leuXscpB

puuDbetB

dpiA

iclRqueAysgAhybG fadI

xdhB scpC

ompX

betA

puuR

fadE lldDfadR

gltX

yibDproM

citC

tpiA eno

citG

glyT

xdhAargKhybD

dcuCrpsIrplScueO uspAychO

lldRbetI

fadJogt argU

pepThybO fadA

rybB tehArffA

iscUnapA

rnlAfeaBbcsZ

rpsP rtcBnapFibpB

tarsodApspFydhTglcG nrfG

thrS pheS hcrdkgB infC

rffH

wzzEcheZ

atoDcysInsrRytfEglpC

rplD bcsB cheBdmsD

ynfE yfgF napDproL

nagAhycF

fecEyjbHglgAguaB torAadiA

rpoSilvE

pgmtyrRaroAgutQ melAglnA

ccmD dppDlpdgcdhypE

ssuAamiArtcA nrfA acs

bglFtyrPcadC

fimGrpoH

fimBmalP

hyfCtdcF

sdhAcyoD

cyoE

feoB

bglB

malSdeoA

feoCmalQrrsC

nirDosmY

sdhD

caiA entFsrafimH cadB

truB

feoA

hypF

sodBoppD

sufD

dmsC

napHnrfB

dmsB

nrfDtdcC

hypA

fumBnarP

hyfGcaiBnrfE

ccmEdppF

ccmC feaRnuoI

glpA

glcFnrfF

hycI

paaJ

infBglcApaaI

nuoMtreBgadA

yiaNsgbHfixBcyoA

rbfA

osmEnuoAglpBnuoF

ihfAnusAihfBygbAhybA cheR

hybB

scpA

yjgItehB

rffC ydiUrffMrffDrpmI

rfe

uraAnorRydhW

moaAydhUynfGydhX

moeB

narLatoEmodA

hemF

frdBmdhhyaC

nrfC

modEdppBhycB acnAnuoKpspA focB

hycC

dcuR

dmsAfhlA

glmY fadL

cysG

prfAnorW

sufEfnr hyfRfumA

modBptsG

adhE

nuoGcaiF

caiE

paaX

aceE

oppC ulaApaaCbglG

hlyE hyfFgltAyiaKtdcA

tdcD

treC

uxaAydhVcydCiscRrplP

glcB

nikRnarKyeaR

xdhC

hycDfrdD narInarX

ynfHmoaEynfF

pheM

ydeJ nikA

nikC uspB hyaFdctAnikB

nuoN

arcAsdhChpt

norVappYhyfD

glcC

sucBulaEfis

yjcH paaHpaaA

malGdusBdgsAyhcHicdaldAyiaJ

yeiLygjG hipB atoB

hemA hypCfixAyiaOhipA

glcD cydB acrEpaaEglpRcdd aceK

actPfrdCnarG hyfInirBnirC

aceFdeoD

yeaErffG erpA

iscS

cheYgrxD

crpglpT

ulaCtdcRaceBpaaFaldB

hokEmurG

ydjM

lpxC

aroF

ftsI

dinB

dinF

symEftsQ

tyrB

uidCuidA

bolA

malZuidRglpGnagD

ynfK

chbA

tnaBhupBsohB

galP

uxaCchbGppiA rbsK

folA srlEuxuR

fepAgalS

nrdB gutM

tnaAcsiE

uxaB

ftsKyafO sulA

umuDuvrB

yafQmraY

ybfE

ddlBrecA

nanC creAmhpF creBaraD malYfucOrbsD

creD

araG

rbsR cyaA

lacZ flgFaraF uidBmhpA

chbF

chpAmalMptsI

melRsrlA

cspA

exuR

sgbE

sucD

fecA tdcE

galT

dnaGmurDrpsU

uvrDphr

uvrC

lexA

mhpEgrpE

lsrG

nanM

araEnagCcstAivbL

rbsAchpRcytR malE

nanR

dacC

melB sbmC

uvrAdinI

mhpR

rbsBmhpD

mhpB

rbsCchbR

lsrC

lsrDmalTudpmanZ

dnaA

gyrAnagB

Gene Regulatory NetworkInference with TIGRESS

ACH, Mordelet, F., Vera-Licona, P. and Vert, J.-P., TIGRESS: TrustfulInference of Gene REgulation using Stability Selection, BMC SystemsBiology, 2012, 6:145.

7

Gene Regulatory Networks

Gene regulation: control gene expression.Transcription factors (TF) activate or repress target genes (TG).Gene Regulatory Network (GRN): representation of the regulatoryinteractions between genes.

cspIdsdCfrdA

ydeA gatBmsrAychH spf

rhaB ompA

gntX

hofP

idnO

idnT

ascFugpCgntR

fbaA

nupGprpBugpE

mpl

glgS

glpF

apaG

mtlD

thrUpgk

glpK

apaHcpdBmazG

betIargUfadJ

fadA

lldR

pepThybO

ogtxdhB betA

scpC

ompX

yebG

recNdinQ dinJ

murC

atoCmoaDserTappB yccB

dcuC

tap

rpsIhofM ppdD

araH

ubiGprpD dnaNrecF

malX

polAlsrK

rplSxdhAychO

pitA

puuR

hybDargK

prpR idnKidnD

araAyhfAysgAhybG iclRfadIqueA

copA

hybErimM

cueO

hycH

nuoH

puuA

nuoJ

fdhF

glpD

trmA

tpx

hyaE

hycA

aspArpsQrpsJhyfE

ubiCfocAubiA

phoUpstB

ulaBglpQ

nrdAgatD

gatC

dcuA rhaTcaiC

pstC

yoaG

gatZ

ybdNulaDhyfH

paaGrpmCtdcB

hyfAhyaD

narJ

fixXfadDrplV

nuoChyfJ

caiT

paaK

xylGgatA

gatYenvZ

exuTdeoR

pstS

f lxA

rplM

yiaL

yiaJ

asrpaaD acnB

t rg

aceF

dcuB ulaR

dsdXfadB

thrTrplW

uspA

leuW

aer

pstAfadHmodChtrE

ftsW murEruvApolB cho

fucR xylAargP

fucI fucUidnR ilvB

lsrBmhpC

malKfucP

agp

mhpBmalI

sfsA chbBnagE

hofBdksA

chbC

yafN

umuC

tisB

ftsL

yafP

uvrYmurF

dinD

recX

insK

ruvBrpoD

dinG

araC

speC

araJ trxAhofN

fucA lsrR ampCglmU

ssb

creC

glmS

ybfN

mhpR

upp

wzxEtadA

rffE ydhO

pheTydbD

tsgA

mhpT

ackA hyaBnikB

nuoEhcp nuoL

appCatoAnikE

fdnG

ydhY

hybF

moaC

rplCmoaB

nikC uspB

rpsCcueRbetT rplB lldP

hyaAmoeArplT

dcuS

prmCappAdppAhybCpflB

lsrA

fucK gntK

araB

gntTprpE

rhaS

manY

lsrF

crrpaaBglpX

deoCtam

malG

nanEaceA

bglJxylR

yhcHulaG

nanK

deoByiaM

fruR

nanTnfuA ecpD

manX

xseAyjjQ

lamBproP

leuO

ptsHgapA

tsxxylF

tdcG

xylH

malF

murQ

nanAmtlA

pfkAfruB pykFtyrU

fbaBfruA

zraSglpEprpCmtlR

topA murP

ilvN

dapBgntU

ascG

uhpT

ascBugpAcyaR

rhaD

ugpB

ugpQpsiE

edarhaA

zraR

dsrAleuC

xylBleuL

fruKepd

dsdAglyU

hofC

thrW

envC

gyrBhisRlldD

proMargW

fadEfadR yibD

sgbU

leuP

betBgltX

puuDproK

argXpdxA

leuXdpiA

scpB yahA

leuBleuA

rhaR

edd yibQglk

slyA

phnF

phnK

phnG

phnC

amnphnL

phnP

htpG

gpmM

uhpAsetA

glyT enoleuD

pck

sgrS

tpiA phoB

fabA

fabR

citE

citF

dpiB citX

citC

citD

fabB

citG

phnH

phnEzraP

phnO

ppckdgR

phoE

tbpA

sgrR

thiQ

thiP

psiF

pitBphoH phnN

phnMphnJ

phnIphoR

phnD

phoA

ompW

artI

t rmD

argF

artJ yqjIargC

rtcR

iscAwzyE

ssuE

nrdGpspB

napG

sufC

hypD

glnLzwf

fumCcarB

oppFflgCfdnH

yecR

putP

ccmG

ccmBglnP

napC

rpsS

fdnI

amiAcysJnikD

glnH

yhjHglnG

fliS gltJargD

gltL

fliP

flgA

ccmA

glnKpepAastE

astC

fliQ

astD

putA

fl iD

torCargR

fl iM hydN

fliZ

ssuB

flgBrutF

nfoflhC

cydDssuD

rutBflhD

rutA

rutC

rutD

rutG

yhjAhypB

fimDentB ydeOemrK

furcysA garPmarA

fepE hdfRmdtE

pnp

pyrDhisQ gcvT

rrfA

yjbF

hycEilvDyfiD fliYoppAgltByjbE

fimF

garRgadX

fes

dppCfimA slp

yhdW

argH

argE

argA

artP

artQ

yeaG

argB

ddpF

gadWfimIilvLadiYrpsO

yjbGrcsAlrp

cadAhns

cyoC

entA

galR

pdhR

nagA

oppBulaF

marR

gntPsucC

pspD

narHfimC

rstAmglB

sdhB

torR

acrF

csgF

mglCnrdR

gabP

micF

dps

guaA

ilvM

pspGkatE

ilvA

marB

rnpB

glnQgltF

gltD

caiD

pspE

ccmH

fixC

napB

hypE

rrlAdcuD

glnA

nuoB

hyfB

sufA

hycGssuC

pspCcarA

ccmF

katG

sucAtreR

nrdD

sufB

sufSrutE

torD

cydAndhcyoB

metYmglA

rpoSfecE

ansBompR

hupAgadCcsgE

yjbH

ddpCddpX

argI

astA

astB

yeaH

artM

yhdXyhdZ

ddpBflgM

ygdByqjH

cblsoxR

fpr fl iIpurN

tauC ppdAfliL

pqiA

argGrutR

soxS hisJrobfl iA

cysHgcvHlrhA

gcvP flgNygiC

oxyR

garLglgCyhiDybdZ

ycaC

nrdI

wzahdeA

livGnac hdeD

hdeB

aslB

gadYthrVnrdE

entSilvI

dctR

gspOhemL

yciG pepDadrA

wcaBproWilvH

nhaAyhiM

csgB

omrAiraP

wzc

osmC

smtAseqA

ecnB

tppB

cysB

purM

fliF

flgDhisP

flgG

hisM

recC hmpfliN

aroM

aroHaroL

tyrA

trpR trpA

yaiA

trpD

melA

ftsZ

aroP

serC

ftsA

omrBlacAhofOrelA

entDsrlD

stpAosmB galMcsiRgalKargO mlrA

srlRgalE

fecB tnaCsrlB

dadA

mt rlacY

rrsGentEuxuBfecD

fecC gadBrrsH

pgm

agaV

sdiAkbaZ

lacI

lipAcsgG

cirA

gdhAnupC bdm

rplU

gspB

tomB

pncBydiV trpB

trpL

trpE

trpCpgaC

yrbAglgPgadE

csgDentCsmpA

f iu

dadXwcaA

agaC

agaR

agaB

nhaRaroG

pgaA

yciE pgaB

rpmA

yjjPompFgabT

uxuAcsiD ydePgabDompC serArcsB

envY hhamdtF rcsDrprAhchA

livM

flgHnfsBycgRppdC

fliO

flgFygbK yebE rpoE

yccAftnB

bacA

cheAycfS

nadR

yhhY

proVfhuAmetKydfN

ryhB

wzb

livJ

pyrCfecIpurA

cysP

mioC

fepB

fliH

codA

fecRuof lysU

mltFybiS

pgi

yncE

cysD

cysC

gltKpoxB

rcnAfepD

asnB

avtA

mukBtdh

acrDlpxA

livH chiAlivFmukF fabZ

fliGpurR qseB

gcvApqiB

potIpotH

yhdY

fldB

exbD

fimEgltI

cysM

garK

cysW

phoPybjC

acrR

fliKflhE

cysU

fliJ

flhA

flhB

rimK nrdH

gnd

evgA

yjjZ

proX

uspE

fliCyojIpotG

potFddpD

argT

ddpAkbaY

cpxP

agaSdsbC agaI

agaD

ppiD

pgaDflgJflgE

ybjNacrB

ygiBnfsAflgI

tolC

csgA

degP

lpxD

yciF

evgS csgCmukE

sdaA

cpxR

mcbR

ade

fhuF

nrdF

kbl

emrYybaT

ftnAtauB

amtBtauA

ribA

tauD

cysKfldA

inaAfliT

ppdB

fliRrfaZ

emrA

cysN

gcvB

hslJ

envR

glnB

emrB

mprA

qseC

purK

yrbL

yneM

hflD

glgX

mgrB

borD

asnCmgtA

f lu

slyB

ybjG

glgB

metL

gmroxyS

rstBpurD

purH

fliEacrA

trxC

codB

gor

nemR grxA

dsbG

nemA

ybaO

aidB

mntR

metB

metJgudD

asnA

gudP

metE

gudXgarD

metA

cvpApurE

speBspeA

purB

prs

purF

purL

metI

alkBalkA

metFmetN

folE

metQ yeiB

ada

metC

phoQ

pagPmetR

ompTglyA

purC

ubiX

ahpC

rseAmdtBpsd

ydeH

nadB

yqjA

yidQ

pnuC

nadAbaeS motA

baeR

cpxA

mdtC

mdtD

tsr

rdoA

spy rseC

fepC

rseB

mdtA

yjeP

motB

yegR

frc dsbA

cheW

ungyfdX

mqsR

fhuE

fhuC

tonB

ybaS

gpmAfepG

fhuB

ygaC

exbB

hemH

mntH cspDcdaR lon livK

rcnR

metH

mnmG

fhuD

ahpF

iscU

rybB rffA tehA

thrS pheS

infCdkgB

bcsZ

ytfE

tar

yfgF

glpC

feaB

bcsB cheB

caiB hycI

glcFfeaR

glcAinfBdppF

nrfF

ccmEnrfE

pspF

hybB

napF

dmsDnapD

rtcA

proL

rtcB ibpB

nsrR

wzzE ynfEhcr

rnlA

cheZ

ygbA

rplD

rpsP

rffH atoD

glcG

hyfG

nuoI

cysI

ydhT

ccmC

napA

nrfA

narP

ynfK

chbG

glpG

chbA

nagD

csiE

uxaC

galP

ppiA

tnaAuxuR

gutMnrdB

cspAbglFrpoH

uxaB

srlE

galT

malSfepA

folA

galSbglB

tyrP

feoCmalQ

fimG

deoA

fimB

glgAilvEguaB

cadBsrafimH

dinB

ydjM

lpxC

hokEsymE

murG

ftsI

dinF

tyrRaroF

gutQfeoB

aroA

rbsK

hupB tnaBsohB

torAsodA

ccmDacsnrfGssuA gcd

glpA

tyrB

uidA

uidRuidC

ftsQbolA

malZ

osmEnuoAadiAlpddppDhycF

sufD

umuDybfE

sulAuvrB

ftsKyafO

ddlBcaiE

caiFfadL

cysG

hyfR aceE

oppCptsGnuoGmodB

srlAsgbE

exuR

melR

hyfF chpA

ptsI malMtdcDulaA

bglG

tdcE

yiaKgltAuxaAfecA hlyE

tdcA

nuoK

frdBdppBfocBhycBhybAyjgItehB

scpArffD

rfe

rpmIydiUrffM

cheR

rffC

nrfC sufEacnA fhlA

glmYfnr

dmsAmodE

rbsD

cyaAchbF

creDaraF uidB

rbsRlacZ

mhpA

ynfGydhUydhXmoaAmoeB

norRhyaC

modAcydC mdh

rplP

atoEydhViscR

hycC

glcB

pspA

rffG iscSyeaE

cheYgrxD

ynfH

xdhCydhW

moaE

ydeJ narX

erpApheM

ynfF

nikA yeaRnarK nikRuraA

frdD

ygjG hipB

narI

atoB

narL

hycD

hemAglcD

hemF

hipA rpsU phrmurD

dnaG

mraYrecAyafQ

uvrCuvrDlexA

araD

cstAivbLlsrG

fucO

lsrC nagCnanC mhpF

malY

melBaraG creB

sbmCcreA

rbsBmhpD

uvrA dacCchbRrbsA rbsC dinI

nanMgrpElsrD

araE

mhpEudpgyrAmalT

dnaA

malE

manZnagBchpRactP

glpT

aceK

paaF

aldA

aceB ulaC

icd

acrE

treC

yeiL

tdcRpaaAadhE

ulaE

paaEglpRsucB

fispaaHpaaX

frdC nirCdctAdeoDnirBhyfIhyaFnarG

yiaO

prfAdcuRhptarcA

f ixA

glcCsdhC

cdd

yjcHnorW

crp

dgsA

paaCaldB

dusB

cytRnanR

hypC

hyfD

cydB

norVappY nuoN

cyoD

cyoAcyoE cadCsucDtdcFhyfC

osmYsdhA

malPtdcC

hypA

nrfB

paaJ

fumBnapH

dmsB

paaI

nrfD

nuoF

ihfB

glpB

nusAihfA

sodB

fumA

dmsC

feoA

oppD

hypF

fixBsgbH

gadAyiaN

sdhDrrsC

rbfAtreB

entFcaiA

truBnirD

nuoM

Figure: E. coli regulatory network8

DREAM network inference challenge

Network inference challenge:

DREAM5 results:Method Network 1 Network 3 Network 4 Overall

AUPR AUROC AUPR AUROC AUPR AUROCGENIE31 0.291 0.815 0.093 0.617 0.021 0.518 40.28ANOVerence2 0.245 0.780 0.119 0.671 0.022 0.519 34.02Naive TIGRESS 0.301 0.782 0.069 0.595 0.020 0.517 31.1

1Huynh-Thu et al., 20102Kueffner et al., 2012

9

Purposes

Introduce TIGRESS: Trustful Inference of Gene REgulation usingStability Selection.Assess the impact of the parameters.Test and benchmark TIGRESS on several datasets.

10

Outline

1 MethodsRegression-based inferenceTIGRESSMaterial

2 ResultsIn silico network resultsIn vitro networks resultsUndirected case: DREAM4

3 Conclusion

11

Regression-based inference: hypotheses

Notationsntf transcription factors (TF), ntg target genes (TG), nexp experimentsExpression data: X (nexp × (ntf + ntg)).Xg : expression levels of gene g.XG: expression levels of genes in G.Tg : candidate TFs for gene g.

Hypotheses1 The expression level Xg of a TG g is a function of the expression

levels XTg of Tg :Xg = fg(XTg ) + ε.

2 A score sg(t) can be derived from fg , for all t ∈ Tg to assess theprobability of the interaction (t ,g).

12

Regression-based inference: main steps

Idea: consider as many problems as TGs (ntg subproblems)subproblem g ⇔ find regulators TFs(g) of gene g

1 For each TG, score all ntf candidate interactions:TG 1 TG 2 TG 3 ... TG ntg

TF 1TF 2...TF ntf

2 Rank the scores altogether:TF 12 → TG 17 1TF 23 → TG 5 0.99TF 2 → TG 1 0.97

... ... ... ...3 Threshold to a value or a given number N of edges.

13




TF 1TF 2...TF ntf



13




TF 1 -TF 2 0.97... ...TF ntf 0



13




TF 1 - 0.23TF 2 0.97 -... ... ...TF ntf 0 0



13




TF 1 - 0.23 0TF 2 0.97 - 0.03... ... ... ...TF ntf 0 0 0



13




TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TFntf 0 0 0 ... 0.76



13




TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TF ntf 0 0 0 ... 0.76



13




TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TF ntf 0 0 0 ... 0.76



13




TF 1 - 0.23 0 ... 0.11TF 2 0.97 - 0.03 ... 0... ... ... ... ... ...TF ntf 0 0 0 ... 0.76



13

Adding a linearity assumption

TIGRESS’ first hypothesis: regulations are linear

Xg = fg(XTg ) + ε = XTgωg + ε

Consequence: if ωgt = 0, no edge between g and t .

14

Adding a sparsity assumption

TIGRESS’ second hypothesis: few TFs regulate each TG

∀g, ]ωgt 6= 0 << ntf

Possible algorithm: LARS with L steps => L TFs selected.

15

Stability Selection

Problem: LARS efficiency is limited:bad response to correlation;no confidence score for each TF.

Solution: Stability Selection with randomized LARS (Bach, 2008 ;Meinshausen and Bühlmann, 2009):

Resample the experiments: run LARS many (e.g. 1,000) timeswith different training sets.“Resample” the variables: also weight the variables

Xit ←WtXit (1)

where Wj ; U([α,1]) for all t = 1...ntf . The smaller α, the morerandomized the variables.Get a frequency of selection for each TF.

16

Stability Selection

Problem: LARS efficiency is limited:bad response to correlation;no confidence score for each TF.

Solution: Stability Selection with randomized LARS (Bach, 2008 ;Meinshausen and Bühlmann, 2009):

Resample the experiments: run LARS many (e.g. 1,000) timeswith different training sets.“Resample” the variables: also weight the variables

Xit ←WtXit (1)

where Wj ; U([α,1]) for all t = 1...ntf . The smaller α, the morerandomized the variables.Get a frequency of selection for each TF.

16

Stability Selection path

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number L of LARS steps

Freq

uenc

y of

sel

ectio

n of

eac

h TF

ov

er th

e su

bsam

plin

gs

(example for one target gene)

17

Scoring

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number L of LARS steps

Freq

uenc

y of

sel

ectio

n of

eac

h TF

ov

er th

e su

bsam

plin

gs

18

Scoring

Choose L, then:Original scoringArea scoring (contribution)

18

Scoring

Let Ht be the rank of TF t . Then,

score = E[φ(Ht )]

withOriginal: φ(h) = 1 if h ≤ L , 0 otherwiseArea: φ(h) = L + 1− h if h ≤ L , 0 otherwise

=> Area scoring takes the value of the rank into account.

18

TIGRESS Summary


1 For each TG, score all ntf candidate interactions:

TG 1 TG 2 TG 3 ... TG ntgLARS

=>

TF 1 - 0.23 0 ... 0.11+ Stab. Selection TF 2 0.97 - 0.03 ... 0+ Choose L ... ... ... ... ... ...+ Score TF ntf 0 0 0 ... 0.76



19

Parameters

TIGRESS needs four parameters to be set:

scoring method (original, area, ...);number of runs R: large;randomization level α: between 0 and 1;number of LARS steps L: not obvious.

20

Data

Network ] TF ] Genes ] Chips ] EdgesDREAM5 Net 1 (in-silico) 195 1643 805 4012DREAM5 Net 3 (E. coli) 334 4511 805 2066DREAM5 Net 4 (S. cerevisiae) 333 5950 536 3940E. coli Net from Faith et al., 2007 180 1525 907 3812DREAM4 Multifactorial Net 1 100 100 100 176DREAM4 Multifactorial Net 2 100 100 100 249DREAM4 Multifactorial Net 3 100 100 100 195DREAM4 Multifactorial Net 4 100 100 100 211DREAM4 Multifactorial Net 5 100 100 100 193

21

Outline



3 Conclusion

22

Impact of the parameters: results on in silico networkL

Area (1,000 runs)

0.1 0.3 0.5 0.7 0.9

5

10

15

20Original (1,000 runs)

0.1 0.3 0.5 0.7 0.9

5

10

15

20

L

Area (4,000 runs)

0.1 0.3 0.5 0.7 0.9

5

10

15

20Original (4,000 runs)

0.1 0.3 0.5 0.7 0.9

5

10

15

20

α

L

Area (10,000 runs)

0.1 0.3 0.5 0.7 0.9

5

10

15

20

α

Original (10,000 runs)

0.1 0.3 0.5 0.7 0.9

5

10

15

20

Overall Score

40 60 80 100

Area less sensitive thanoriginal to α and L.Area systematicallyoutperforms original.The more runs, the betterBest values:α = 0.4,L = 2,R = 10,000.

23

How to choose L?

0 2 4 60

1000

2000

3000First 5,000 edges

0 2 4 6 80

500

1000

1500


0 5 10 15 200

200

400

600

800


0 10 20 300

200

400

600


L=2: number of TFs/TG smallerand more variable.

0 2 4 6 80

200

400


0 5 10 150

100

200

300


10 20 30 40 500

50

100

150


0 50 100 1500

50

100


L=20: greater number of TFs/TG,less sparsity.

=> L should depend on the expected network’s topology

24

TIGRESS vs state-of-the-art

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FPR

TP

R

0 0.05 0.1 0.150

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

GENIE3

ANOVerence

CLR

ARACNE

Naive TIGRESS

TIGRESS

TIGRESS is competitive with state-of-the-art.

25

Results on E. coli network

0 0.5 10

0.2

0.4

0.6

0.8

1

FPR

TP

R

0 0.05 0.1 0.150

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

GENIE3

CLR

ARACNE

TIGRESS

Random Forests-based and DREAM winner GENIE3 overperforms allmethods.

26

False discovery analysis on E. coli

Main false positive pattern found by TIGRESS:

Should find Does find

Good news: spuriously inferred edges close to true edgesBad news: confusion (due to linear model?)

27

Undirected case: DREAM4 challenge

DREAM4: undirected networks (TFs not known in advance)A posteriori comparison of Default TIGRESS and GENIE3:

Method Network 1 Network 2 Network 3 Network 4 Network 5AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC

GENIE3 0.154 0.745 0.155 0.733 0.231 0.775 0.208 0.791 0.197 0.798TIGRESS 0.165 0.769 0.161 0.717 0.233 0.781 0.228 0.791 0.234 0.764

Overall scores:GENIE3: 37.48TIGRESS: 38.85

28

Outline



3 Conclusion

29

Conclusion

Contributions:Automatization and adaptation of Stability Selection to GRNinference.Area scoring setting: better results and less elasticity to parameters.Insights on network’s behavior

TIGRESS is:LinearCompetitiveParallelizableAvailable (http://cbio.ensmp.fr/tigress)But outperformed in some cases by random forests: limits oflinearity?

Perspectives:Adaptive/changeable value for L.Group selection of TFs.Use of time series/knock-out/replicates information.

30

Molecular signatures for breastcancer prognosis

ACH, Gestraud, P. and Vert, J.-P., The Influence of Feature SelectionMethods on Accuracy, Stability and Interpretability of MolecularSignatures, 2011, PLoS ONE

ACH, Jacob, L. and Vert, J.-P., Improving Stability and Interpretability ofMolecular Signatures, 2010, arXiv 1001.3109

31

Motivation

Prediction of breast cancer outcomeAssist breast cancer prognosis based on gene expression.Avoid adjuvant/preventive chemotherapy when not needed.

Gene expression signatureData: primary site tumor expression arrays.Among the genome, find the few (50-100) genes sufficient topredict metastasis/relapse.Main challenge: high-dimensional data (few samples, manyvariables).

32

Background

2002: Van’t Veer et al. publish 70-gene signature MammaPrint.Since then: at least 47 published signatures (Venet et al., 2011).Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).Many gene sets are equally predictive (Michiels et al., 2005;Ein-Dor et al., 2005), even within the same dataset.Prediction discordances (Reyal et al., 2008)Stability at the functional level? Not sure.

Seeking stability

Accuracy is not enough to choose the right genes.=> Stability as a confidence indicator.

33

Background


Seeking stability


33

Background


Seeking stability


33

Background


Seeking stability


33

Background


Seeking stability


33

Background


Seeking stability


33

Background


Seeking stability


33

Contributions

1 Systematic comparison of feature selection methods in terms of:predictive performance;stability;biological stability and interpretability.

2 Group selection of genes:with predefined groups (Graph Lasso);with latent groups (k-overlap norm).

3 Evaluation of Ensemble methods

34

Evaluation

Accuracy: how well selected genes + classifier predict metastaticevents on test data.Stability: how similar two lists of genes are.Interpretability: how much biological sense selected genes make.

35

Data

Four public breast cancer datasets from the same technology (AffymetrixU133A):

GEO Reference ] genes ] samples ] positivesGSE1456 12,065 159 40GSE2034 12,065 286 107GSE2990 12,065 125 49GSE4922 12,065 249 89

36

Outline

1 A simple start

2 An attempt at enforcing stability: Ensemble methods

3 Using prior knowledge: Graph Lasso

4 Acknowledging latent team work

5 Conclusion

37

Classical feature selection/feature ranking methods

Type Characteristics Examples

FiltersUnivariate, fast T-test, KL DivergenceOnly depend on the data Wilcoxon rank-sum testDo not use loss function

WrappersLearning machine as a criterion SVM RFE, Greedy FSComputationally expensive

EmbeddedLearning + selecting Lasso, Elastic NetPossible use of prior knowledge

Each of these algorithms returns a ranked list of genes to bethresholded.

38

First results

100 genes over four databases.

0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Accuracy

Stab

ility

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

39

First conclusions

0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Accuracy

Stab

ility

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Random better than tossing acoin.Elastic Net neither more stablenor more accurate than Lasso.Accuracy/Stability trade-offT-test: both simplest and best.

Next step:

Can we have a better stability without decreasing accuracy?

40

First conclusions

0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Accuracy

Stab

ility

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Random better than tossing acoin.Elastic Net neither more stablenor more accurate than Lasso.Accuracy/Stability trade-offT-test: both simplest and best.

Next step:

Can we have a better stability without decreasing accuracy?

40

Outline

1 A simple start




5 Conclusion

41

Ensemble Methods

Run each algorithm R times on subsamples.Get R ranked lists of genes (rb)b=1...R.Aggregate and get a score for each gene:

S(g) =1R

R∑b=1

f (rbg ).

average : f (r) = (p − r)/pexponential : f (r) = exp(−αr)stability selection : f (r) = δ(r ≤ k)

Sort S by decreasing order and threshold to get final signature

42

Results


0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Accuracy

Stab

ility

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Single−run

Stab. sel

43

Functional stability


0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65

0

0.05

0.1

0.15

0.2

0.25

0.3

Accuracy

Func

tiona

l sta

bilit

y

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Single−run

Stab. sel

44

Ensemble methods: conclusions

0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Accuracy

Stab

ility

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Single−run

Stab. sel

0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65

0

0.05

0.1

0.15

0.2

0.25

0.3

Accuracy

Func

tiona

l sta

bilit

y

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Single−run

Stab. sel

Expected improvement in stability not happening.Slight improvement in accuracy in some cases.Loss in functional stability.T-test: still the preferred method.

Next step:

Can we do better by incorporating prior knowledge?45

Ensemble methods: conclusions

0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.650

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Accuracy

Stab

ility

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Single−run

Stab. sel

0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65

0

0.05

0.1

0.15

0.2

0.25

0.3

Accuracy

Func

tiona

l sta

bilit

y

Random

Ttest

Entropy

Bhattacharryya

Wilcoxon

SVM RFE

GFS

Lasso

E−Net

Single−run

Stab. sel

Expected improvement in stability not happening.Slight improvement in accuracy in some cases.Loss in functional stability.T-test: still the preferred method.

Next step:

Can we do better by incorporating prior knowledge?45

Outline

1 A simple start




5 Conclusion

46

Data

Expression data: Van’t Veer et al.,2002; Wang et al., 2005.PPI network with 8141 genes (Chuanget al., 2007)

Assumption: genes close on the graph behave similarlyIdea: instead of selecting single genes, select edges

47

Selecting groups of genes: `1 methods

Lasso : selects single genes (Tibshirani, 1996)

Why groups?

selecting similar genes: improving stability and interpretabilitysmoothing out noise by "averaging": improving accuracy

Group Lasso (Yuan & Lin, 2006): implies group sparsity for groupsof covariates that form a partition of 1...pOverlapping group Lasso (Jacob et al., 2009): selects a union ofpotentially overlapping groups of covariates (e.g. gene pathways).Graph Lasso: uses groups induced by the graph (e.g. edges)

48

Is group sparsity enough?

`1 methods work well, but face serious stability issues whengroups are correlated.Solution: randomization through stability selection.

49

Accuracy results

Test on data from Wang et al., 2005.

Lasso Lasso + stab. sel. Graph Lasso Graph Lasso + stab. sel.0

0.2

0.4

0.6

0.8

Bal

ance

d A

ccur

acy

Signature learnt on Van’t Veer dataset

Signature learnt on Wang dataset

Neither prior knowledge nor stability selection bring anyimprovement!

50

Stability results

0 20 40 60 80 100 1200

1

2

3

4

5

6

7

Number of genes in the signatures

Number

ofgen

esin

theoverlap

LassoGraph Lasso with stability selectionLasso with stability selectionGraph Lasso

Graph Lasso slightly improves stability.

51

Interpretability

Signature obtained using Lasso:

52

Interpretability

Signature obtained using Graph Lasso + Stability Selection:

52

Graph Lasso conclusion

Graphical prior seems to increase stability and interpretability.However: no change in accuracy.

Next step:

Grouping increases stability. Now on to accuracy!

53

Outline

1 A simple start




5 Conclusion

54

Latent grouping

Grouping genes makes sense.Let the data tell which genes to select together.

The k-support norm

Introduced by Argyriou et al., 2012A trade-off between `1 and `2.Equivalent to overlapping group Lasso (Jacob et al., 2009) with allpossible groups of size k .Results in selecting groups that are not predefined.

55

Extreme randomization

Following Breiman’s random forestsSample both the examples and the covariates.Less variables = less correlation.Give each gene a chance to be selected.

Extreme randomization

For each of the R runs:Bootstrap samples (classical Ensemble method)Sample the covariates: randomly choose 10% of them.Run FS procedure on the restricted data.

=> Compute frequency of selection: P(g selected |g preselected)

56

Accuracy or stability?

0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.6550

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Accuracy

Stab

ility

Random

Ttest

Lasso

ENet

kSupport (k=2)

kSupport (k=10)

kSupport (k=20)

Single−run

Extreme Rand. + SS

T−test

57

Stability or redundancy?

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Correlation

Stab

ility

Random

Ttest

Lasso

ENet

kSupport (k=2)

kSupport (k=10)

kSupport (k=20)

Single−run

Extreme Rand. + SS

58

Conclusions

0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.6550

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Accuracy

Stab

ility

Random

Ttest

Lasso

ENet

kSupport (k=2)

kSupport (k=10)

kSupport (k=20)

Single−run

Extreme Rand. + SS

T−test

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Correlation

Stab

ility

Random

Ttest

Lasso

ENet

kSupport (k=2)

kSupport (k=10)

kSupport (k=20)

Single−run

Extreme Rand. + SS

Extreme Randomization improves accuracy.Grouping improves stability.But: both effects do not add up that well (redundancy)T-test still the best trade-off?

59

Outline

1 A simple start




5 Conclusion

60

Signature selection: conclusion

Contributions:

Step-by-step study of FS methods behavior on several breastcancer datasets.Systematic analysis of accuracy, stability, interpretability.Insights on the accuracy/stability trade-off.

What have we learned?

Best methods: simple t-test or complex black boxGrouping improves gene and functional stability.Randomization improves accuracy (sometimes) but has unwantedeffects on stability.Accuracy/Stability Trade-off: Stability => redundancy => loweraccuracy.

61

Signature selection: perspectives

One unique signature?single breast cancer subtypemany sampleslarger signature

Is expression data sufficient?probably not all information is thereclinical data: same accuracy (same information?)possibly look at genotype, methylation, clinical and expression

Is stability important?not as important as accuracy + prediction concordancepossibly not even achievable

62

Conclusion

63

Conclusion

Gene expression data:High-dimensional, noisyPossibly contains important information

Feature selection:Find the needle in the haystack.Output relevant genes to be studied further.

Main issues:Results are not necessarily transferable across datasets.Models rely on hypotheses!

Fixing:Testing on many databases.Keeping model hypotheses in mind / not being afraid of black boxes.

64

Conclusion





64

Conclusion





64

Conclusion





64

Acknowledgements

Fantine Paola Pierre LaurentMordelet Vera-Licona Gestraud Jacob

65

The k-support norm

It can be shown that:

Ωspk (ω) =

k−r−1∑i=1

(|ω|↓i )2 +1

r + 1

(d∑

i=k−r

|ω|↓i

)212

where r is the only integer in 0, . . . , k − 1 statisfying

|ω|↓k−r−1 >1

r + 1

d∑i=k−r

|ω|↓i ≥ |ω|↓k−r .

and |ω|↓i is the i-th largest value of |ω| (|ω|↓0 = +∞).

66

Link with overlapping Group Lasso

The k-support norm is equivalent to the overlapping group Lassonorm

Ωspk (ω) = min

v∈Rp×Gk

∑I∈Gk

||vI ||2 : supp(vI) ⊆ I,∑I∈Gk

vI = ω

where Gk denotes all subsets of 1, . . . ,d of cardinality k .Remark 1: it selects at least k variables.Remark 2: the first selected group consists of the k variables mostcorrelated with the response

67

ADMM - applied to k-support problem

Our problemminω,β Rl(ω) + λ

2 Ωspk (β)2

s.t. ω − β = 0

Augmented Lagrangian

Lρ(ω, β, µ) = Rl(ω) + λ2 Ωsp

k (β)2 + µ′(ω − β) + ρ2 ||ω − β||

2

Algorithm1 Initialize: β(1), ω(1), µ(1)

2 for t = 1,2, ..., dow (t+1) = arg minw

Rl (w) + µ(t)T w + ρ

2 ||w − β(t)||2

β(t+1) = prox λ

2ρ Ωspk (.)2

(w (t+1) + µ(t)

ρ

)µ(t+1) = µ(t) + ρ(w (t+1) − β(t+1))

68

ADMM - optimality conditions

Three first order conditions:Primal condition: ω∗ − β∗ = 0Dual condition 1: ∇Rl(ω

∗) + µ∗ = 0Dual condition 2: 0 ∈ λ

2∂Ωspk (β∗)2 + µ∗

Resulting in a definition for the residuals at step t + 1:Primal residuals: r (t+1) = ω(t+1) − β(t+1)

Dual residuals: s(t+1) = ρ(β(t+1) − β(t))

As the algorithm converges, the (norm of the) residuals tend to zero.

69

ADMM - choice of parameter

Parameter ρ is critical in ADMM: it controls how much variables change.It can be seen as a step size.

How to choose it?

Adaptive ADMMOne solution is to let it adapt to the problem:

ρ(t+1) =

(1 + τ)ρ(t) if ||r (t+1)||2 > η||s(t+1)||2 and t ≤ tmax

ρ(t)/(1 + τ) if η||r (t+1)||2 < ||s(t+1)||2 and t ≤ tmax

ρ(t) otherwiseIn practice, we use τ = 1, η = 10 and tmax = 100.=> Adaptive ADMM forces the primal and dual residuals to be of asimilar amplitude.

70

Comparison

0 10 20−15

−10

−5

0

5k = 1

RD

G

Time (Seconds) 0 10 20−15

−10

−5

0

5k = 5

RD

G

Time (Seconds)

0 10 20−15

−10

−5

0

5k = 10

RD

G

Time (Seconds) 0 20 40 60−15

−10

−5

0

5k = 100

RD

G

Time (Seconds)

ADMM − adaptiveADMM − rho=1ADMM − rho=10ADMM − rho=100FISTA

71

Accuracy vs size of the signature

0 20 40 60 80 1000.45

0.5

0.55

0.6

0.65

AU

C

Single−run

0 20 40 60 80 1000.45

0.5

0.55

0.6

0.65

AU

C

Ensemble−Mean

0 20 40 60 80 1000.45

0.5

0.55

0.6

0.65

AU

C

Ensemble−Exponential

0 20 40 60 80 1000.45

0.5

0.55

0.6

0.65

AU

C

Ensemble−Stability Selection

Random T−test Entropy Bhatt. Wilcoxon SVM RFE GFS Lasso E−Net

72

Stability vs size of the signature

0 20 40 60 80 1000

0.2

0.4

0.6

0.8 Single-run

Stability

0 20 40 60 80 1000

0.2

0.4

0.6

0.8Ensemble-average

Stability

0 20 40 60 80 1000

0.2

0.4

0.6

0.8 Ensemble-exponential

Stability

0 20 40 60 80 1000

0.2

0.4

0.6

0.8Ensemble-stability selection

Stability

Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E−Net

73

Date post:	25-Jan-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Feature selection from gene expression datacbio.mines-paristech.fr/~ahaury/thesis/defense.pdf ·...

Documents