+ All Categories
Home > Documents > heatmap A Hybrid Evolutionary Feature Selection Method for...

heatmap A Hybrid Evolutionary Feature Selection Method for...

Date post: 26-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
1
Sumaiya Iqbal and Md Tamjidul Hoque gratefully acknowledge the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF (2013-16)-RD-A-19. The usual goal of feature selection is to identify and remove all irrelevant and redundant features Redundant features provide an opportunity to mitigate or at least predict performance loss due to missing data Selected features may provide insights of genes correlated with the disease Feature selection may be a form of overfitting training data A validation dataset is crucial to the feature selection process Ideally, we would only use features present in the top row to build our final model. However, we may choose to build additional models with features from additional rows. If future test samples are missing features in the top row we may choose a model constructed with an alternative set of features. DNA microarray data allows the analysis of the expression level of thousands of genes simultaneously. This process can capture the current state of the gene regulation within a cell by capturing mRNA expressions, instead of tedious quantitate and qualitative measurement of protein expressions, which would have been more accurate measure of the cellular activities. As we are measuring the indirect interaction using mRNA expression, we therefore need to have robust approaches to infer the true statistics. This approach will make it possible to have clinically and/or scientifically useful predictions such as diagnosing diseases, the identification of tumor types and treatment selection. Many statistical classification methods are available for this type of task. Further, a central difficulty in such statistical classification is that, some of the features (variables) in the data may be irrelevant or redundant to the prediction task. Irrelevant and redundant data complicate and confound the classification process, therefore, it is desirable to identify and eliminate variables that are not useful for the classification task. The aim of this research is to propose a robust methodology for classifying DNA microarray data using feature selection, which is the process of identifying and eliminating features that are irrelevant or redundant. The proposed method performs effective feature selection to identify a subset of genes that best describe a disease. Two well-known DNA microarray datasets were used to validate the method. A Hybrid Evolutionary Feature Selection Method for Microarray Data Denson Smith, Sumaiya Iqbal, Md Tamjidul Hoque email: {dsmith8, siqbal1, thoque}@uno.edu Department of Computer Science, University of New Orleans, New Orleans, LA, USA Method Abstract Results and Discussion Conclusions Future Work Acknowledgements MCC = ( TP × TN ) (FP × FN ) ( TP + FP )( TP + FN )( TN + FP )( TN + FN ) where , TP = the number of true positives TN = the number of true negatives FP = the number of false positives FN = the number of false negatives Extra Tree Classifier For validation and the final model, the ET is tuned to maximize classification performance. Other classifiers such as deep neural network and support vector machine may also be trained on the selected features. The extra tree classifier provides the genetic algorithm with two pieces of information about each candidate features set. Predictions from the ET are used to generate fitness estimates for the genetic algorithm. Feature importance estimates from the ET are used to remove features estimated to be unimportant from some of the current generation’s offspring. If these features are indeed unimportant (irrelevant) then the offspring will have an equal or higher fitness estimate compared with its parents. Heatmap of the breast cancer candidate feature sets ranked by Matthews Correlation Coefficient Detail of heatmap Darker colors indicate features that appear in more candidate feature sets. Lighter colors indicate features that appear in fewer candidate feature sets. Features that do not appear in any candidate feature set are likely to be irrelevant. Rows with equal or near equal performance but different features likely contain features that are mutually redundant. A set of 10 candidate features is generated for each fitness metric: 1. MCC 2. AUC 3. accuracy 4. F1 5. (MCC+AUC)/2 6. (F1+AUC)/2 7. (accuracy+AUC)/2 8. (precision+recall)/2 During the feature selection process, the ET parameters are tuned to maximize the accuracy of feature importances. Features that generate higher information gain at more nodes are estimated to be more important. Information gain is measured by Gini purity or information entropy. best MCC found metric: accuracy+AUC elite: 4 # features 32 AUC 0.8571 accuracy 0.9474 precision 1.0000 recall 0.8571 F1 0.9231 MCC 0.8895 all features metric: None # features 24187 AUC 0.8393 accuracy 0.8421 precision 0.8333 recall 0.7143 F1 0.7692 MCC 0.6548 Performance [1] Huerta, E. B., Duval, B. and Hao, J.-K. Gene selection for microarray data by a LDA-based genetic algorithm. Springer, City, 2008. [2] Sahu, B. and Mishra, D. A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Procedia Engineering, 382012), 27-31. [3] Garro, B. A., Rodríguez, K. and Vázquez, R. A. Classification of DNA microarrays using artificial neural networks and ABC algorithm. Applied Soft Computing, 382016), 548-560. [4] Sasikala, S., alias Balamurugan, S. A. and Geetha, S. A Novel Feature Selection Technique for Improved Survivability Diagnosis of Breast Cancer. Procedia Computer Science, 502015), 16-23. References PSO – particle swarm optimization ABC – artificial bee colony GFFS – genetic forest feature selector GA – genetic algorithm J48 – decision tree LDAGA – linear discriminate analysis genetic algorithm Filter – correlation of individual gene expression with target class Overfitting? Some candidate feature sets that performed well with the training data performed very poorly with the validation data. This is likely due to spurious relationships between irrelevant features and the target class. If this is the cause then feature selection may be viewed as a form of overfitting the training data. This illustrates why a validation dataset is crucial. Classifica?on technique Selec?on technique # of genes % accuracy Reference SVM PSO 20 1.0000 [2] SVM ABC 5 0.9470 [3] ET GFFS 32 0.9470 Proposed method J48 GA 41 0.9381 [4] SVM Filter + LDA-GA 44 0.8421 [1] Comparison with Other Methods Dimensionality greatly reduced Substantial improvement of all performance metrics The best MCC was generated from a candidate set selected with accuracy+AUC as the fitness metric for the GA Reapply feature selection using only the candidate feature sets to determine if results improve Attempt to reduce overfitting of the training data during feature selection Formalize the method of choosing an alternative feature set in the case of missing data Complete the process on additional microarray datasets Complete the process on datasets from different problem domains
Transcript
Page 1: heatmap A Hybrid Evolutionary Feature Selection Method for ...cs.uno.edu/~tamjid/Papers/2016_LA_P3.pdf · The aim of this research is to propose a robust methodology for classifying

Sum

aiya

Iqba

l and

Md

Tam

jidul

Hoq

ue g

rate

fully

ack

now

ledg

e th

e Lo

uisia

na B

oard

of

Reg

ents

thro

ugh

the

Boar

d of

Reg

ents

Supp

ort F

und,

LEQ

SF (2

013-

16)-R

D-A

-19.

• Th

e us

ual g

oal o

f fea

ture

sele

ctio

n is

to id

entif

y an

d re

mov

e al

l irr

elev

ant a

nd re

dund

ant f

eatu

res

• Re

dund

ant f

eatu

res p

rovi

de a

n op

portu

nity

to m

itiga

te o

r at l

east

pred

ict p

erfo

rman

ce lo

ss d

ue to

miss

ing

data

• Se

lect

ed fe

atur

es m

ay p

rovi

de in

sight

s of g

enes

cor

rela

ted

with

the

dise

ase

• Fe

atur

e se

lect

ion

may

be

a fo

rm o

f ove

rfitti

ng tr

aini

ng d

ata

• A

valid

atio

n da

tase

t is c

ruci

al to

the

feat

ure

sele

ctio

n pr

oces

s

Idea

lly, w

e w

ould

onl

y us

e fe

atur

es p

rese

nt in

the

top

row

to b

uild

our

fin

al m

odel

. How

ever

, we

may

cho

ose

to b

uild

add

ition

al m

odel

s with

fe

atur

es fr

om a

dditi

onal

row

s. If

futu

re te

st sa

mpl

es a

re m

issin

g fe

atur

es

in th

e to

p ro

w w

e m

ay c

hoos

e a

mod

el c

onstr

ucte

d w

ith a

n al

tern

ativ

e se

t of

feat

ures

.

DN

A m

icro

arra

y da

ta a

llow

s the

ana

lysis

of t

he e

xpre

ssio

n le

vel o

f tho

usan

ds o

f gen

es si

mul

tane

ously

. Thi

s pr

oces

s can

cap

ture

the

curre

nt st

ate

of th

e ge

ne re

gula

tion

with

in a

cel

l by

capt

urin

g m

RNA

expr

essio

ns, i

nste

ad

of te

diou

s qua

ntita

te a

nd q

ualit

ativ

e m

easu

rem

ent o

f pro

tein

exp

ress

ions

, whi

ch w

ould

hav

e be

en m

ore

accu

rate

m

easu

re o

f the

cel

lula

r act

iviti

es. A

s we

are

mea

surin

g th

e in

dire

ct in

tera

ctio

n us

ing

mRN

A ex

pres

sion,

we

ther

efor

e ne

ed to

hav

e ro

bust

appr

oach

es to

infe

r the

true

stat

istic

s. Th

is ap

proa

ch w

ill m

ake

it po

ssib

le to

hav

e cl

inic

ally

and

/or s

cien

tifica

lly u

sefu

l pre

dict

ions

such

as d

iagn

osin

g di

seas

es, t

he id

entifi

catio

n of

tum

or ty

pes

and

treat

men

t sel

ectio

n. M

any

statis

tical

cla

ssifi

catio

n m

etho

ds a

re a

vaila

ble

for t

his t

ype

of ta

sk. F

urth

er, a

ce

ntra

l diffi

culty

in su

ch st

atist

ical

cla

ssifi

catio

n is

that

, som

e of

the

feat

ures

(var

iabl

es) i

n th

e da

ta m

ay b

e irr

elev

ant o

r red

unda

nt to

the

pred

ictio

n ta

sk. I

rrele

vant

and

redu

ndan

t dat

a co

mpl

icat

e an

d co

nfou

nd th

e cl

assifi

catio

n pr

oces

s, th

eref

ore,

it is

des

irabl

e to

iden

tify

and

elim

inat

e va

riabl

es th

at a

re n

ot u

sefu

l for

the

clas

sifica

tion

task

. The

aim

of t

his r

esea

rch

is to

pro

pose

a ro

bust

met

hodo

logy

for c

lass

ifyin

g D

NA

mic

roar

ray

data

usin

g fe

atur

e se

lect

ion,

whi

ch is

the

proc

ess o

f ide

ntify

ing

and

elim

inat

ing

feat

ures

that

are

irre

leva

nt o

r re

dund

ant.

The

prop

osed

met

hod

perfo

rms e

ffect

ive

feat

ure

sele

ctio

n to

iden

tify

a su

bset

of g

enes

that

bes

t de

scrib

e a

dise

ase.

Tw

o w

ell-k

now

n D

NA

mic

roar

ray

data

sets

wer

e us

ed to

val

idat

e th

e m

etho

d.A H

ybrid

Evo

lutio

nary

Fea

ture

Sel

ectio

n M

etho

d fo

r Mic

roar

ray

Dat

a�D

enso

n Sm

ith, S

umai

ya Iq

bal,

Md

Tam

jidul

Hoq

ue�

emai

l: {d

smith

8, si

qbal

1, th

oque

}@un

o.ed

u�D

epar

tmen

t of C

ompu

ter S

cien

ce, U

nive

rsity

of N

ew O

rlean

s, N

ew O

rlean

s, LA

, USA

Met

hod

Abstr

act

Resu

lts a

nd D

iscus

sion

Conc

lusio

ns

Futu

re W

ork

Ackn

owle

dgem

ents

!!MCC

=(TP×TN)−(FP×FN)

(TP+FP)(TP+FN)(TN+FP)(TN+FN)

where,

TP=the!num

ber!of!true!positives

TN=the!num

ber!of!true!negatives

FP=the!num

ber!of!false!positives

FN=the!num

ber!of!false!negatives

Extra

Tre

e Cl

assifi

er

For v

alid

atio

n an

d th

e fin

al m

odel

, the

ET

is tu

ned

to m

axim

ize

clas

sifica

tion

perfo

rman

ce. O

ther

cla

ssifi

ers s

uch

as d

eep

neur

al n

etw

ork

and

supp

ort v

ecto

r mac

hine

may

also

be

train

ed o

n th

e se

lect

ed fe

atur

es.

The

extra

tree

cla

ssifi

er p

rovi

des t

he g

enet

ic a

lgor

ithm

with

two

piec

es o

f inf

orm

atio

n ab

out

each

can

dida

te fe

atur

es se

t. Pr

edic

tions

from

the

ET a

re u

sed

to g

ener

ate

fitne

ss e

stim

ates

for

the

gene

tic a

lgor

ithm

. Fe

atur

e im

porta

nce

estim

ates

from

the

ET a

re u

sed

to re

mov

e fe

atur

es

estim

ated

to b

e un

impo

rtant

from

som

e of

the

curre

nt g

ener

atio

n’s o

ffspr

ing.

If th

ese

feat

ures

ar

e in

deed

uni

mpo

rtant

(irre

leva

nt) t

hen

the

offs

prin

g w

ill h

ave

an e

qual

or h

ighe

r fitn

ess

estim

ate

com

pare

d w

ith it

s par

ents.

Hea

tmap

of t

he b

reas

t can

cer c

andi

date

feat

ure

sets

rank

ed b

y M

atth

ews C

orre

latio

n Co

effic

ient

Det

ail o

f hea

tmap

Dar

ker c

olor

s ind

icat

e fe

atur

es th

at

appe

ar in

mor

e ca

ndid

ate

feat

ure

sets.

Lig

hter

col

ors i

ndic

ate

feat

ures

that

app

ear i

n fe

wer

ca

ndid

ate

feat

ure

sets.

Feat

ures

that

do

not a

ppea

r in

any

cand

idat

e fe

atur

e se

t are

like

ly to

be

irre

leva

nt.

Row

s with

equ

al o

r nea

r equ

al

perfo

rman

ce b

ut d

iffer

ent f

eatu

res

likel

y co

ntai

n fe

atur

es th

at a

re

mut

ually

redu

ndan

t.

A se

t of 1

0 ca

ndid

ate

feat

ures

is

gene

rate

d fo

r eac

h fit

ness

met

ric:

1. 

MCC

2. 

AU

C3. 

accu

racy

4. 

F15. 

(MCC

+AU

C)/2

6. 

(F1+

AU

C)/2

7. 

(acc

urac

y+A

UC)

/28. 

(pre

cisio

n+re

call)

/2

Dur

ing

the

feat

ure

sele

ctio

n pr

oces

s, th

e ET

par

amet

ers a

re tu

ned

to

max

imiz

e th

e ac

cura

cy o

f fea

ture

impo

rtanc

es. F

eatu

res t

hat g

ener

ate

high

er in

form

atio

n ga

in a

t mor

e no

des a

re e

stim

ated

to b

e m

ore

impo

rtant

. Inf

orm

atio

n ga

in is

mea

sure

d by

Gin

i pur

ity o

r inf

orm

atio

n en

tropy

.

bestM

CCfo

und

metric

:accuracy+AU

Celite:4

#features

32

AUC

0.8571

accuracy

0.9474

precision

1.0000

recall

0.8571

F1

0.9231

MCC

0.8895

allfeatures

metric

:Non

e

#features

24187

AUC

0.8393

accuracy

0.8421

precision

0.8333

recall

0.7143

F1

0.7692

MCC

0.6548

Perfo

rman

ce

[1] H

uerta

, E. B

., D

uval

, B. a

nd H

ao, J

.-K. G

ene

sele

ctio

n fo

r mic

roar

ray

data

by

a LD

A-ba

sed

gene

tic a

lgor

ithm

. Spr

inge

r, Ci

ty, 2

008.

[2] S

ahu,

B. a

nd M

ishra

, D. A

nov

el fe

atur

e se

lect

ion

algo

rithm

usin

g pa

rticl

e sw

arm

op

timiz

atio

n fo

r can

cer m

icro

arra

y da

ta. P

roce

dia

Engi

neer

ing,

382

012)

, 27-

31.

[3] G

arro

, B. A

., Ro

dríg

uez,

K. a

nd V

ázqu

ez, R

. A. C

lass

ifica

tion

of D

NA

mic

roar

rays

us

ing

artifi

cial

neu

ral n

etw

orks

and

ABC

alg

orith

m. A

pplie

d So

ft C

ompu

ting,

382

016)

, 54

8-56

0.[4

] Sas

ikal

a, S

., al

ias B

alam

urug

an, S

. A. a

nd G

eeth

a, S

. A N

ovel

Fea

ture

Sel

ectio

n Te

chni

que

for I

mpr

oved

Sur

viva

bilit

y D

iagn

osis

of B

reas

t Can

cer.

Proc

edia

Com

pute

r Sc

ienc

e, 5

0201

5), 1

6-23

.Refe

renc

es

• PS

O –

par

ticle

swar

m o

ptim

izat

ion

• A

BC –

arti

ficia

l bee

col

ony

• G

FFS

– ge

netic

fore

st fe

atur

e se

lect

or• 

GA

– ge

netic

alg

orith

m• 

J48

– de

cisio

n tre

e• 

LDA

GA

– lin

ear d

iscrim

inat

e an

alys

is ge

netic

alg

orith

m• 

Filte

r – c

orre

latio

n of

indi

vidu

al g

ene

expr

essio

n w

ith ta

rget

cla

ss

Ove

rfitti

ng?

• So

me

cand

idat

e fe

atur

e se

ts th

at p

erfo

rmed

wel

l with

the

train

ing

data

per

form

ed v

ery

poor

ly w

ith th

e va

lidat

ion

data

. • 

This

is lik

ely

due

to sp

urio

us re

latio

nshi

ps b

etw

een

irrel

evan

t fea

ture

s and

the

targ

et

clas

s.• 

If th

is is

the

caus

e th

en fe

atur

e se

lect

ion

may

be

view

ed a

s a fo

rm o

f ove

rfitti

ng th

e tra

inin

g da

ta.

• Th

is ill

ustra

tes w

hy a

val

idat

ion

data

set i

s cru

cial

.

Classifica?

onte

chniqu

eSelec?on

techniqu

e#ofgen

es

%accuracy

Reference

SVM

PSO

20

1.0000[2]

SVM

ABC

50.9470[3]

ET

GFFS

32

0.9470Propo

sedmetho

dJ48

GA

41

0.9381[4]

SVM

Filte

r+LDA

-GA

44

0.8421[1]

Com

paris

on w

ith O

ther

Met

hods

• D

imen

siona

lity

grea

tly re

duce

d• 

Subs

tant

ial i

mpr

ovem

ent o

f all

perfo

rman

ce m

etric

s• 

The

best

MCC

was

gen

erat

ed fr

om a

can

dida

te se

t sel

ecte

d w

ith a

ccur

acy+

AU

C as

th

e fit

ness

met

ric fo

r the

GA

• Re

appl

y fe

atur

e se

lect

ion

usin

g on

ly th

e ca

ndid

ate

feat

ure

sets

to

dete

rmin

e if

resu

lts im

prov

e• 

Atte

mpt

to re

duce

ove

rfitti

ng o

f the

trai

ning

dat

a du

ring

feat

ure

sele

ctio

n• 

Form

aliz

e th

e m

etho

d of

cho

osin

g an

alte

rnat

ive

feat

ure

set i

n th

e ca

se o

f miss

ing

data

• Co

mpl

ete

the

proc

ess o

n ad

ditio

nal m

icro

arra

y da

tase

ts• 

Com

plet

e th

e pr

oces

s on

data

sets

from

diff

eren

t pro

blem

dom

ains

Recommended