heatmap A Hybrid Evolutionary Feature Selection Method for...

Sum

aiya

Iqba

l and

Md

Tam

jidul

Hoq

ue g

rate

fully

ack

now

ledg

e th

e Lo

uisia

na B

oard

of

Reg

ents

thro

ugh

the

Boar

d of

Reg

ents

Supp

ort F

und,

LEQ

SF (2

013-

16)-R

D-A

-19.

• Th

e us

ual g

oal o

f fea

ture

sele

ctio

n is

to id

entif

y an

d re

mov

e al

l irr

elev

ant a

nd re

dund

ant f

eatu

res

• Re

dund

ant f

eatu

res p

rovi

de a

n op

portu

nity

to m

itiga

te o

r at l

east

pred

ict p

erfo

rman

ce lo

ss d

ue to

miss

ing

data

• Se

lect

ed fe

atur

es m

ay p

rovi

de in

sight

s of g

enes

cor

rela

ted

with

the

dise

ase

• Fe

atur

e se

lect

ion

may

be

a fo

rm o

f ove

rfitti

ng tr

aini

ng d

ata

• A

valid

atio

n da

tase

t is c

ruci

al to

the

feat

ure

sele

ctio

n pr

oces

s

Idea

lly, w

e w

ould

onl

y us

e fe

atur

es p

rese

nt in

the

top

row

to b

uild

our

fin

al m

odel

. How

ever

, we

may

cho

ose

to b

uild

add

ition

al m

odel

s with

fe

atur

es fr

om a

dditi

onal

row

s. If

futu

re te

st sa

mpl

es a

re m

issin

g fe

atur

es

in th

e to

p ro

w w

e m

ay c

hoos

e a

mod

el c

onstr

ucte

d w

ith a

n al

tern

ativ

e se

t of

feat

ures

.

DN

A m

icro

arra

y da

ta a

llow

s the

ana

lysis

of t

he e

xpre

ssio

n le

vel o

f tho

usan

ds o

f gen

es si

mul

tane

ously

. Thi

s pr

oces

s can

cap

ture

the

curre

nt st

ate

of th

e ge

ne re

gula

tion

with

in a

cel

l by

capt

urin

g m

RNA

expr

essio

ns, i

nste

ad

of te

diou

s qua

ntita

te a

nd q

ualit

ativ

e m

easu

rem

ent o

f pro

tein

exp

ress

ions

, whi

ch w

ould

hav

e be

en m

ore

accu

rate

m

easu

re o

f the

cel

lula

r act

iviti

es. A

s we

are

mea

surin

g th

e in

dire

ct in

tera

ctio

n us

ing

mRN

A ex

pres

sion,

we

ther

efor

e ne

ed to

hav

e ro

bust

appr

oach

es to

infe

r the

true

stat

istic

s. Th

is ap

proa

ch w

ill m

ake

it po

ssib

le to

hav

e cl

inic

ally

and

/or s

cien

tifica

lly u

sefu

l pre

dict

ions

such

as d

iagn

osin

g di

seas

es, t

he id

entifi

catio

n of

tum

or ty

pes

and

treat

men

t sel

ectio

n. M

any

statis

tical

cla

ssifi

catio

n m

etho

ds a

re a

vaila

ble

for t

his t

ype

of ta

sk. F

urth

er, a

ce

ntra

l diffi

culty

in su

ch st

atist

ical

cla

ssifi

catio

n is

that

, som

e of

the

feat

ures

(var

iabl

es) i

n th

e da

ta m

ay b

e irr

elev

ant o

r red

unda

nt to

the

pred

ictio

n ta

sk. I

rrele

vant

and

redu

ndan

t dat

a co

mpl

icat

e an

d co

nfou

nd th

e cl

assifi

catio

n pr

oces

s, th

eref

ore,

it is

des

irabl

e to

iden

tify

and

elim

inat

e va

riabl

es th

at a

re n

ot u

sefu

l for

the

clas

sifica

tion

task

. The

aim

of t

his r

esea

rch

is to

pro

pose

a ro

bust

met

hodo

logy

for c

lass

ifyin

g D

NA

mic

roar

ray

data

usin

g fe

atur

e se

lect

ion,

whi

ch is

the

proc

ess o

f ide

ntify

ing

and

elim

inat

ing

feat

ures

that

are

irre

leva

nt o

r re

dund

ant.

The

prop

osed

met

hod

perfo

rms e

ffect

ive

feat

ure

sele

ctio

n to

iden

tify

a su

bset

of g

enes

that

bes

t de

scrib

e a

dise

ase.

Tw

o w

ell-k

now

n D

NA

mic

roar

ray

data

sets

wer

e us

ed to

val

idat

e th

e m

etho

d.A H

ybrid

Evo

lutio

nary

Fea

ture

Sel

ectio

n M

etho

d fo

r Mic

roar

ray

Dat

a�D

enso

n Sm

ith, S

umai

ya Iq

bal,

Md

Tam

jidul

Hoq

ue�

emai

l: {d

smith

8, si

qbal

1, th

oque

}@un

o.ed

u�D

epar

tmen

t of C

ompu

ter S

cien

ce, U

nive

rsity

of N

ew O

rlean

s, N

ew O

rlean

s, LA

, USA

Met

hod

Abstr

act

Resu

lts a

nd D

iscus

sion

Conc

lusio

ns

Futu

re W

ork

Ackn

owle

dgem

ents

!!MCC

=(TP×TN)−(FP×FN)

(TP+FP)(TP+FN)(TN+FP)(TN+FN)

where,

TP=the!num

ber!of!true!positives

TN=the!num

ber!of!true!negatives

FP=the!num

ber!of!false!positives

FN=the!num

ber!of!false!negatives

Extra

Tre

e Cl

assifi

er

For v

alid

atio

n an

d th

e fin

al m

odel

, the

ET

is tu

ned

to m

axim

ize

clas

sifica

tion

perfo

rman

ce. O

ther

cla

ssifi

ers s

uch

as d

eep

neur

al n

etw

ork

and

supp

ort v

ecto

r mac

hine

may

also

be

train

ed o

n th

e se

lect

ed fe

atur

es.

The

extra

tree

cla

ssifi

er p

rovi

des t

he g

enet

ic a

lgor

ithm

with

two

piec

es o

f inf

orm

atio

n ab

out

each

can

dida

te fe

atur

es se

t. Pr

edic

tions

from

the

ET a

re u

sed

to g

ener

ate

fitne

ss e

stim

ates

for

the

gene

tic a

lgor

ithm

. Fe

atur

e im

porta

nce

estim

ates

from

the

ET a

re u

sed

to re

mov

e fe

atur

es

estim

ated

to b

e un

impo

rtant

from

som

e of

the

curre

nt g

ener

atio

n’s o

ffspr

ing.

If th

ese

feat

ures

ar

e in

deed

uni

mpo

rtant

(irre

leva

nt) t

hen

the

offs

prin

g w

ill h

ave

an e

qual

or h

ighe

r fitn

ess

estim

ate

com

pare

d w

ith it

s par

ents.

Hea

tmap

of t

he b

reas

t can

cer c

andi

date

feat

ure

sets

rank

ed b

y M

atth

ews C

orre

latio

n Co

effic

ient

Det

ail o

f hea

tmap

Dar

ker c

olor

s ind

icat

e fe

atur

es th

at

appe

ar in

mor

e ca

ndid

ate

feat

ure

sets.

Lig

hter

col

ors i

ndic

ate

feat

ures

that

app

ear i

n fe

wer

ca

ndid

ate

feat

ure

sets.

Feat

ures

that

do

not a

ppea

r in

any

cand

idat

e fe

atur

e se

t are

like

ly to

be

irre

leva

nt.

Row

s with

equ

al o

r nea

r equ

al

perfo

rman

ce b

ut d

iffer

ent f

eatu

res

likel

y co

ntai

n fe

atur

es th

at a

re

mut

ually

redu

ndan

t.

A se

t of 1

0 ca

ndid

ate

feat

ures

is

gene

rate

d fo

r eac

h fit

ness

met

ric:

1. 

MCC

2. 

AU

C3. 

accu

racy

4. 

F15. 

(MCC

+AU

C)/2

6. 

(F1+

AU

C)/2

7. 

(acc

urac

y+A

UC)

/28. 

(pre

cisio

n+re

call)

/2

Dur

ing

the

feat

ure

sele

ctio

n pr

oces

s, th

e ET

par

amet

ers a

re tu

ned

to

max

imiz

e th

e ac

cura

cy o

f fea

ture

impo

rtanc

es. F

eatu

res t

hat g

ener

ate

high

er in

form

atio

n ga

in a

t mor

e no

des a

re e

stim

ated

to b

e m

ore

impo

rtant

. Inf

orm

atio

n ga

in is

mea

sure

d by

Gin

i pur

ity o

r inf

orm

atio

n en

tropy

.

bestM

CCfo

und

metric

:accuracy+AU

Celite:4

#features

32

AUC

0.8571

accuracy

0.9474

precision

1.0000

recall

0.8571

F1

0.9231

MCC

0.8895

allfeatures

metric

:Non

e

#features

24187

AUC

0.8393

accuracy

0.8421

precision

0.8333

recall

0.7143

F1

0.7692

MCC

0.6548

Perfo

rman

ce

[1] H

uerta

, E. B

., D

uval

, B. a

nd H

ao, J

.-K. G

ene

sele

ctio

n fo

r mic

roar

ray

data

by

a LD

A-ba

sed

gene

tic a

lgor

ithm

. Spr

inge

r, Ci

ty, 2

008.

[2] S

ahu,

B. a

nd M

ishra

, D. A

nov

el fe

atur

e se

lect

ion

algo

rithm

usin

g pa

rticl

e sw

arm

op

timiz

atio

n fo

r can

cer m

icro

arra

y da

ta. P

roce

dia

Engi

neer

ing,

382

012)

, 27-

31.

[3] G

arro

, B. A

., Ro

dríg

uez,

K. a

nd V

ázqu

ez, R

. A. C

lass

ifica

tion

of D

NA

mic

roar

rays

us

ing

artifi

cial

neu

ral n

etw

orks

and

ABC

alg

orith

m. A

pplie

d So

ft C

ompu

ting,

382

016)

, 54

8-56

0.[4

] Sas

ikal

a, S

., al

ias B

alam

urug

an, S

. A. a

nd G

eeth

a, S

. A N

ovel

Fea

ture

Sel

ectio

n Te

chni

que

for I

mpr

oved

Sur

viva

bilit

y D

iagn

osis

of B

reas

t Can

cer.

Proc

edia

Com

pute

r Sc

ienc

e, 5

0201

5), 1

6-23

.Refe

renc

es

• PS

O –

par

ticle

swar

m o

ptim

izat

ion

• A

BC –

arti

ficia

l bee

col

ony

• G

FFS

– ge

netic

fore

st fe

atur

e se

lect

or• 

GA

– ge

netic

alg

orith

m• 

J48

– de

cisio

n tre

e• 

LDA

GA

– lin

ear d

iscrim

inat

e an

alys

is ge

netic

alg

orith

m• 

Filte

r – c

orre

latio

n of

indi

vidu

al g

ene

expr

essio

n w

ith ta

rget

cla

ss

Ove

rfitti

ng?

• So

me

cand

idat

e fe

atur

e se

ts th

at p

erfo

rmed

wel

l with

the

train

ing

data

per

form

ed v

ery

poor

ly w

ith th

e va

lidat

ion

data

. • 

This

is lik

ely

due

to sp

urio

us re

latio

nshi

ps b

etw

een

irrel

evan

t fea

ture

s and

the

targ

et

clas

s.• 

If th

is is

the

caus

e th

en fe

atur

e se

lect

ion

may

be

view

ed a

s a fo

rm o

f ove

rfitti

ng th

e tra

inin

g da

ta.

• Th

is ill

ustra

tes w

hy a

val

idat

ion

data

set i

s cru

cial

.

Classifica?

onte

chniqu

eSelec?on

techniqu

e#ofgen

es

%accuracy

Reference

SVM

PSO

20

1.0000[2]

SVM

ABC

50.9470[3]

ET

GFFS

32

0.9470Propo

sedmetho

dJ48

GA

41

0.9381[4]

SVM

Filte

r+LDA

-GA

44

0.8421[1]

Com

paris

on w

ith O

ther

Met

hods

• D

imen

siona

lity

grea

tly re

duce

d• 

Subs

tant

ial i

mpr

ovem

ent o

f all

perfo

rman

ce m

etric

s• 

The

best

MCC

was

gen

erat

ed fr

om a

can

dida

te se

t sel

ecte

d w

ith a

ccur

acy+

AU

C as

th

e fit

ness

met

ric fo

r the

GA

• Re

appl

y fe

atur

e se

lect

ion

usin

g on

ly th

e ca

ndid

ate

feat

ure

sets

to

dete

rmin

e if

resu

lts im

prov

e• 

Atte

mpt

to re

duce

ove

rfitti

ng o

f the

trai

ning

dat

a du

ring

feat

ure

sele

ctio

n• 

Form

aliz

e th

e m

etho

d of

cho

osin

g an

alte

rnat

ive

feat

ure

set i

n th

e ca

se o

f miss

ing

data

• Co

mpl

ete

the

proc

ess o

n ad

ditio

nal m

icro

arra

y da

tase

ts• 

Com

plet

e th

e pr

oces

s on

data

sets

from

diff

eren

t pro

blem

dom

ains

Date post:	26-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

heatmap A Hybrid Evolutionary Feature Selection Method for...

Documents