AHybridEvolu.onaryFeatureSelec.onMethodforMicroarrayData
DensonSmithSumaiyaIqbal
MdTamjidulHoque{dsmith8,siqbal1,thoque}@uno.edu
UniversityofNewOrleans
AbstractDNAmicroarraydataallowstheanalysisoftheexpressionlevelofthousandsofgenessimultaneously.Thisprocesscancapturethecurrentstateofthegeneregula.onwithinacellbycapturingmRNAexpressions,insteadoftediousquan.tateandqualita.vemeasurementofproteinexpressions,whichwouldhavebeenmoreaccuratemeasureofthecellularac.vi.es.Aswearemeasuringtheindirectinterac.onusingmRNAexpression,wethereforeneedtohaverobustapproachestoinferthetruesta.s.cs.Thisapproachwillmakeitpossibletohaveclinicallyand/orscien.ficallyusefulpredic.onssuchasdiagnosingdiseases,theiden.fica.onoftumortypesandtreatmentselec.on.Manysta.s.calclassifica.onmethodsareavailableforthistypeoftask.Further,acentraldifficultyinsuchsta.s.calclassifica.onisthat,someofthefeatures(variables)inthedatamaybeirrelevantorredundanttothepredic.ontask.Irrelevantandredundantdatacomplicateandconfoundtheclassifica.onprocess,therefore,itisdesirabletoiden.fyandeliminatevariablesthatarenotusefulfortheclassifica.ontask.TheaimofthisresearchistoproposearobustmethodologyforclassifyingDNAmicroarraydatausingfeatureselec.on,whichistheprocessofiden.fyingandelimina.ngfeaturesthatareirrelevantorredundant.Theproposedmethodperformseffec.vefeatureselec.ontoiden.fyasubsetofgenesthatbestdescribeadisease.Twowell-knownDNAmicroarraydatasetswereusedtovalidatethemethod.
FeatureSelec.on• Theprocessofselec.ngasubsetofrelevantfeatures(variables)forusein
classifica.onmodelconstruc.onisknownasfeatureselec.on(a.k.a.:variableselec.on,aXributeselec.onorvariablesubsetselec.on).
• Classifica.onmodelsconstructedwithanop.malsubsetoffeatureshavebeendemonstratedbothintheoryandprac.cetobefastertotrain,fastertorun,provideabeXerunderstandingoftheunderlyingprocesses,haveimprovedpredic.veaccuracy,beXergeneraliza.onandreducedmodelcomplexity
MicroarrayDataChallengesforClassifica.on
• Manydatasetsarehighdimensional,i.e.thousandsortensofthousandsoffeatures.
• Manyofthefeaturesareredundant,irrelevantorweaklyrelevant.
• Datasetso[encontainsmissingand/orincorrectvalues.
• Therearepossiblymislabeledsamples.
• Usually,therearerela.velyfewsamplesavailablefortrainingandvalida.onofthemodel.
ExampleDataset:BreastCancer• Goalistoclassifytestsamplesasrelapseornot-relapse(binary
classifica.on).
• “WellKnown”datasetfromKentRidgeBio-medicalDatasetRepository.
• 24481geneexpressionra.os
• 78trainingsamples
• 19testsamples
• Missingdata
Gene.cForestFeatureSelec.onAlgorithm
ExtraTreeClassifier
FeatureImportanceEs.mates
FeatureImportanceEs.mates
Workflow
ResultsDarkercolorsindicatefeaturesthatappearinmorecandidatefeaturesets.Lightercolorsindicatefeaturesthatappearinfewercandidatefeaturesets.Featuresthatdonotappearinanycandidatefeaturesetarelikelytobeirrelevant.Rowswithequalornearequalperformancebutdifferentfeatureslikelycontainfeaturesthataremutuallyredundant.Asetof10candidatefeaturesisgeneratedforeachfitnessmetric:1. MCC2. AUC3. accuracy4. F15. (MCC+AUC)/26. (F1+AUC)/27. (accuracy+AUC)/28. (precision+recall)/2
Results
bestMCCfound metric:accuracy+AUC elite:4
#features 32AUC 0.8571
accuracy 0.9474precision 1.0000
recall 0.8571F1 0.9231
MCC 0.8895
allfeaturesmetric:None
#features 24187AUC 0.8393
accuracy 0.8421precision 0.8333
recall 0.7143F1 0.7692
MCC 0.6548!!
MCC = (TP ×TN)−(FP ×FN)(TP +FP)(TP +FN)(TN +FP)(TN +FN)
where,TP = the!number!of!true!positivesTN = the!number!of!true!negativesFP = the!number!of!false!positivesFN = the!number!of!false!negatives
MaXhewsCorrela.onCoefficient
MethodComparison
Classifica?ontechnique Selec?ontechnique #ofgenes %accuracy ReferenceSVM PSO 20 1.0000[2]SVM ABC 5 0.9470[3]ET GFFS 32 0.9470ProposedmethodJ48 GA 41 0.9381[4]SMV DRF0-1G 44 0.8421[1]
• PSO–par.cleswarmop.miza.on• ABC–ar.ficialbeecolony• GFFS–gene.cforestfeatureselector• GA–gene.calgorithm• J48–decisiontree• LDAGA–lineardiscriminateanalysisgene.calgorithm• Filter–correla.onofindividualgeneexpressionwithtargetclass
Overfisng?
• Somecandidatefeaturesetsthatperformedwellwiththetrainingdataperformedverypoorlywiththevalida.ondata.
• Thisislikelyduetospuriousrela.onshipsbetweenirrelevantfeaturesandthetargetclass.• Ifthisisthecausethenfeatureselec.onmaybeviewedasaformofoverfisngthetrainingdata.• Thisillustrateswhyavalida.onsetkeptseparateduringfeatureselec.oniscrucial.
Conclusions• Theusualgoaloffeatureselec.onistoiden.fyandremoveallirrelevant
andredundantfeatures
• Redundantfeaturesprovideanopportunitytomi.gateoratleastpredictperformancelossduetomissingdata
• Selectedfeaturesmayprovideinsightsofgenescorrelatedwiththedisease
• Featureselec.onmaybeaformofoverfisngtrainingdata
• Avalida.ondatasetiscrucialtothefeatureselec.onprocess
FutureWork• Reapplyfeatureselec.onusingonlythecandidatefeaturesetsto
determineifresultsimprove
• AXempttoreduceoverfisngofthetrainingdataduringfeatureselec.on
• Formalizethemethodofchoosinganalterna.vefeaturesetinthecaseofmissingdata
• Completetheprocessonaddi.onalmicroarraydatasets
• Completetheprocessondatasetsfromdifferentproblemdomains
References• [1]Huerta,E.B.,Duval,B.andHao,J.-K.Geneselec(onfor
microarraydatabyaLDA-basedgene(calgorithm.Springer,City,2008.
• [2]Sahu,B.andMishra,D.Anovelfeatureselec.onalgorithmusingpar.cleswarmop.miza.onforcancermicroarraydata.ProcediaEngineering,382012),27-31.
• [3]Garro,B.A.,Rodríguez,K.andVázquez,R.A.Classifica.onofDNAmicroarraysusingar.ficialneuralnetworksandABCalgorithm.AppliedSo=Compu(ng,382016),548-560.
• [4]Sasikala,S.,aliasBalamurugan,S.A.andGeetha,S.ANovelFeatureSelec.onTechniqueforImprovedSurvivabilityDiagnosisofBreastCancer.ProcediaComputerScience,502015),16-23.