A Hybrid Evolu.onary Feature Selec.on Method for ...cs.uno.edu/~tamjid/Papers/2016_LA_O3.pdf ·...

AHybridEvolu.onaryFeatureSelec.onMethodforMicroarrayData

DensonSmithSumaiyaIqbal

MdTamjidulHoque{dsmith8,siqbal1,thoque}@uno.edu

UniversityofNewOrleans

AbstractDNAmicroarraydataallowstheanalysisoftheexpressionlevelofthousandsofgenessimultaneously.Thisprocesscancapturethecurrentstateofthegeneregula.onwithinacellbycapturingmRNAexpressions,insteadoftediousquan.tateandqualita.vemeasurementofproteinexpressions,whichwouldhavebeenmoreaccuratemeasureofthecellularac.vi.es.Aswearemeasuringtheindirectinterac.onusingmRNAexpression,wethereforeneedtohaverobustapproachestoinferthetruesta.s.cs.Thisapproachwillmakeitpossibletohaveclinicallyand/orscien.ficallyusefulpredic.onssuchasdiagnosingdiseases,theiden.fica.onoftumortypesandtreatmentselec.on.Manysta.s.calclassifica.onmethodsareavailableforthistypeoftask.Further,acentraldifficultyinsuchsta.s.calclassifica.onisthat,someofthefeatures(variables)inthedatamaybeirrelevantorredundanttothepredic.ontask.Irrelevantandredundantdatacomplicateandconfoundtheclassifica.onprocess,therefore,itisdesirabletoiden.fyandeliminatevariablesthatarenotusefulfortheclassifica.ontask.TheaimofthisresearchistoproposearobustmethodologyforclassifyingDNAmicroarraydatausingfeatureselec.on,whichistheprocessofiden.fyingandelimina.ngfeaturesthatareirrelevantorredundant.Theproposedmethodperformseffec.vefeatureselec.ontoiden.fyasubsetofgenesthatbestdescribeadisease.Twowell-knownDNAmicroarraydatasetswereusedtovalidatethemethod.

FeatureSelec.on•  Theprocessofselec.ngasubsetofrelevantfeatures(variables)forusein

classifica.onmodelconstruc.onisknownasfeatureselec.on(a.k.a.:variableselec.on,aXributeselec.onorvariablesubsetselec.on).

•  Classifica.onmodelsconstructedwithanop.malsubsetoffeatureshavebeendemonstratedbothintheoryandprac.cetobefastertotrain,fastertorun,provideabeXerunderstandingoftheunderlyingprocesses,haveimprovedpredic.veaccuracy,beXergeneraliza.onandreducedmodelcomplexity

MicroarrayDataChallengesforClassifica.on

•  Manydatasetsarehighdimensional,i.e.thousandsortensofthousandsoffeatures.

•  Manyofthefeaturesareredundant,irrelevantorweaklyrelevant.

•  Datasetso[encontainsmissingand/orincorrectvalues.

•  Therearepossiblymislabeledsamples.

•  Usually,therearerela.velyfewsamplesavailablefortrainingandvalida.onofthemodel.

ExampleDataset:BreastCancer•  Goalistoclassifytestsamplesasrelapseornot-relapse(binary

classifica.on).

•  “WellKnown”datasetfromKentRidgeBio-medicalDatasetRepository.

•  24481geneexpressionra.os

•  78trainingsamples

•  19testsamples

•  Missingdata

Gene.cForestFeatureSelec.onAlgorithm

ExtraTreeClassifier

FeatureImportanceEs.mates

FeatureImportanceEs.mates

Workflow

ResultsDarkercolorsindicatefeaturesthatappearinmorecandidatefeaturesets.Lightercolorsindicatefeaturesthatappearinfewercandidatefeaturesets.Featuresthatdonotappearinanycandidatefeaturesetarelikelytobeirrelevant.Rowswithequalornearequalperformancebutdifferentfeatureslikelycontainfeaturesthataremutuallyredundant.Asetof10candidatefeaturesisgeneratedforeachfitnessmetric:1.  MCC2.  AUC3.  accuracy4.  F15.  (MCC+AUC)/26.  (F1+AUC)/27.  (accuracy+AUC)/28.  (precision+recall)/2

Results

bestMCCfound metric:accuracy+AUC elite:4

#features 32AUC 0.8571

accuracy 0.9474precision 1.0000

recall 0.8571F1 0.9231

MCC 0.8895

allfeaturesmetric:None

#features 24187AUC 0.8393

accuracy 0.8421precision 0.8333

recall 0.7143F1 0.7692

MCC 0.6548!!

MCC = (TP ×TN)−(FP ×FN)(TP +FP)(TP +FN)(TN +FP)(TN +FN)

where,TP = the!number!of!true!positivesTN = the!number!of!true!negativesFP = the!number!of!false!positivesFN = the!number!of!false!negatives

MaXhewsCorrela.onCoefficient

MethodComparison

Classifica?ontechnique Selec?ontechnique #ofgenes %accuracy ReferenceSVM PSO 20 1.0000[2]SVM ABC 5 0.9470[3]ET GFFS 32 0.9470ProposedmethodJ48 GA 41 0.9381[4]SMV DRF0-1G 44 0.8421[1]

•  PSO–par.cleswarmop.miza.on•  ABC–ar.ficialbeecolony•  GFFS–gene.cforestfeatureselector•  GA–gene.calgorithm•  J48–decisiontree•  LDAGA–lineardiscriminateanalysisgene.calgorithm•  Filter–correla.onofindividualgeneexpressionwithtargetclass

Overfisng?

•  Somecandidatefeaturesetsthatperformedwellwiththetrainingdataperformedverypoorlywiththevalida.ondata.

•  Thisislikelyduetospuriousrela.onshipsbetweenirrelevantfeaturesandthetargetclass.•  Ifthisisthecausethenfeatureselec.onmaybeviewedasaformofoverfisngthetrainingdata.•  Thisillustrateswhyavalida.onsetkeptseparateduringfeatureselec.oniscrucial.

Conclusions•  Theusualgoaloffeatureselec.onistoiden.fyandremoveallirrelevant

andredundantfeatures

•  Redundantfeaturesprovideanopportunitytomi.gateoratleastpredictperformancelossduetomissingdata

•  Selectedfeaturesmayprovideinsightsofgenescorrelatedwiththedisease

•  Featureselec.onmaybeaformofoverfisngtrainingdata

•  Avalida.ondatasetiscrucialtothefeatureselec.onprocess

FutureWork•  Reapplyfeatureselec.onusingonlythecandidatefeaturesetsto

determineifresultsimprove

•  AXempttoreduceoverfisngofthetrainingdataduringfeatureselec.on

•  Formalizethemethodofchoosinganalterna.vefeaturesetinthecaseofmissingdata

•  Completetheprocessonaddi.onalmicroarraydatasets

•  Completetheprocessondatasetsfromdifferentproblemdomains

References•  [1]Huerta,E.B.,Duval,B.andHao,J.-K.Geneselec(onfor

microarraydatabyaLDA-basedgene(calgorithm.Springer,City,2008.

•  [2]Sahu,B.andMishra,D.Anovelfeatureselec.onalgorithmusingpar.cleswarmop.miza.onforcancermicroarraydata.ProcediaEngineering,382012),27-31.

•  [3]Garro,B.A.,Rodríguez,K.andVázquez,R.A.Classifica.onofDNAmicroarraysusingar.ficialneuralnetworksandABCalgorithm.AppliedSo=Compu(ng,382016),548-560.

•  [4]Sasikala,S.,aliasBalamurugan,S.A.andGeetha,S.ANovelFeatureSelec.onTechniqueforImprovedSurvivabilityDiagnosisofBreastCancer.ProcediaComputerScience,502015),16-23.

Date post:	05-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

A Hybrid Evolu.onary Feature Selec.on Method for ...cs.uno.edu/~tamjid/Papers/2016_LA_O3.pdf ·...

Documents