Sum
aiya
Iqba
l and
Md
Tam
jidul
Hoq
ue g
rate
fully
ack
now
ledg
e th
e Lo
uisia
na B
oard
of
Reg
ents
thro
ugh
the
Boar
d of
Reg
ents
Supp
ort F
und,
LEQ
SF (2
013-
16)-R
D-A
-19.
• Th
e us
ual g
oal o
f fea
ture
sele
ctio
n is
to id
entif
y an
d re
mov
e al
l irr
elev
ant a
nd re
dund
ant f
eatu
res
• Re
dund
ant f
eatu
res p
rovi
de a
n op
portu
nity
to m
itiga
te o
r at l
east
pred
ict p
erfo
rman
ce lo
ss d
ue to
miss
ing
data
• Se
lect
ed fe
atur
es m
ay p
rovi
de in
sight
s of g
enes
cor
rela
ted
with
the
dise
ase
• Fe
atur
e se
lect
ion
may
be
a fo
rm o
f ove
rfitti
ng tr
aini
ng d
ata
• A
valid
atio
n da
tase
t is c
ruci
al to
the
feat
ure
sele
ctio
n pr
oces
s
Idea
lly, w
e w
ould
onl
y us
e fe
atur
es p
rese
nt in
the
top
row
to b
uild
our
fin
al m
odel
. How
ever
, we
may
cho
ose
to b
uild
add
ition
al m
odel
s with
fe
atur
es fr
om a
dditi
onal
row
s. If
futu
re te
st sa
mpl
es a
re m
issin
g fe
atur
es
in th
e to
p ro
w w
e m
ay c
hoos
e a
mod
el c
onstr
ucte
d w
ith a
n al
tern
ativ
e se
t of
feat
ures
.
DN
A m
icro
arra
y da
ta a
llow
s the
ana
lysis
of t
he e
xpre
ssio
n le
vel o
f tho
usan
ds o
f gen
es si
mul
tane
ously
. Thi
s pr
oces
s can
cap
ture
the
curre
nt st
ate
of th
e ge
ne re
gula
tion
with
in a
cel
l by
capt
urin
g m
RNA
expr
essio
ns, i
nste
ad
of te
diou
s qua
ntita
te a
nd q
ualit
ativ
e m
easu
rem
ent o
f pro
tein
exp
ress
ions
, whi
ch w
ould
hav
e be
en m
ore
accu
rate
m
easu
re o
f the
cel
lula
r act
iviti
es. A
s we
are
mea
surin
g th
e in
dire
ct in
tera
ctio
n us
ing
mRN
A ex
pres
sion,
we
ther
efor
e ne
ed to
hav
e ro
bust
appr
oach
es to
infe
r the
true
stat
istic
s. Th
is ap
proa
ch w
ill m
ake
it po
ssib
le to
hav
e cl
inic
ally
and
/or s
cien
tifica
lly u
sefu
l pre
dict
ions
such
as d
iagn
osin
g di
seas
es, t
he id
entifi
catio
n of
tum
or ty
pes
and
treat
men
t sel
ectio
n. M
any
statis
tical
cla
ssifi
catio
n m
etho
ds a
re a
vaila
ble
for t
his t
ype
of ta
sk. F
urth
er, a
ce
ntra
l diffi
culty
in su
ch st
atist
ical
cla
ssifi
catio
n is
that
, som
e of
the
feat
ures
(var
iabl
es) i
n th
e da
ta m
ay b
e irr
elev
ant o
r red
unda
nt to
the
pred
ictio
n ta
sk. I
rrele
vant
and
redu
ndan
t dat
a co
mpl
icat
e an
d co
nfou
nd th
e cl
assifi
catio
n pr
oces
s, th
eref
ore,
it is
des
irabl
e to
iden
tify
and
elim
inat
e va
riabl
es th
at a
re n
ot u
sefu
l for
the
clas
sifica
tion
task
. The
aim
of t
his r
esea
rch
is to
pro
pose
a ro
bust
met
hodo
logy
for c
lass
ifyin
g D
NA
mic
roar
ray
data
usin
g fe
atur
e se
lect
ion,
whi
ch is
the
proc
ess o
f ide
ntify
ing
and
elim
inat
ing
feat
ures
that
are
irre
leva
nt o
r re
dund
ant.
The
prop
osed
met
hod
perfo
rms e
ffect
ive
feat
ure
sele
ctio
n to
iden
tify
a su
bset
of g
enes
that
bes
t de
scrib
e a
dise
ase.
Tw
o w
ell-k
now
n D
NA
mic
roar
ray
data
sets
wer
e us
ed to
val
idat
e th
e m
etho
d.A H
ybrid
Evo
lutio
nary
Fea
ture
Sel
ectio
n M
etho
d fo
r Mic
roar
ray
Dat
a�D
enso
n Sm
ith, S
umai
ya Iq
bal,
Md
Tam
jidul
Hoq
ue�
emai
l: {d
smith
8, si
qbal
1, th
oque
}@un
o.ed
u�D
epar
tmen
t of C
ompu
ter S
cien
ce, U
nive
rsity
of N
ew O
rlean
s, N
ew O
rlean
s, LA
, USA
Met
hod
Abstr
act
Resu
lts a
nd D
iscus
sion
Conc
lusio
ns
Futu
re W
ork
Ackn
owle
dgem
ents
!!MCC
=(TP×TN)−(FP×FN)
(TP+FP)(TP+FN)(TN+FP)(TN+FN)
where,
TP=the!num
ber!of!true!positives
TN=the!num
ber!of!true!negatives
FP=the!num
ber!of!false!positives
FN=the!num
ber!of!false!negatives
Extra
Tre
e Cl
assifi
er
For v
alid
atio
n an
d th
e fin
al m
odel
, the
ET
is tu
ned
to m
axim
ize
clas
sifica
tion
perfo
rman
ce. O
ther
cla
ssifi
ers s
uch
as d
eep
neur
al n
etw
ork
and
supp
ort v
ecto
r mac
hine
may
also
be
train
ed o
n th
e se
lect
ed fe
atur
es.
The
extra
tree
cla
ssifi
er p
rovi
des t
he g
enet
ic a
lgor
ithm
with
two
piec
es o
f inf
orm
atio
n ab
out
each
can
dida
te fe
atur
es se
t. Pr
edic
tions
from
the
ET a
re u
sed
to g
ener
ate
fitne
ss e
stim
ates
for
the
gene
tic a
lgor
ithm
. Fe
atur
e im
porta
nce
estim
ates
from
the
ET a
re u
sed
to re
mov
e fe
atur
es
estim
ated
to b
e un
impo
rtant
from
som
e of
the
curre
nt g
ener
atio
n’s o
ffspr
ing.
If th
ese
feat
ures
ar
e in
deed
uni
mpo
rtant
(irre
leva
nt) t
hen
the
offs
prin
g w
ill h
ave
an e
qual
or h
ighe
r fitn
ess
estim
ate
com
pare
d w
ith it
s par
ents.
Hea
tmap
of t
he b
reas
t can
cer c
andi
date
feat
ure
sets
rank
ed b
y M
atth
ews C
orre
latio
n Co
effic
ient
Det
ail o
f hea
tmap
Dar
ker c
olor
s ind
icat
e fe
atur
es th
at
appe
ar in
mor
e ca
ndid
ate
feat
ure
sets.
Lig
hter
col
ors i
ndic
ate
feat
ures
that
app
ear i
n fe
wer
ca
ndid
ate
feat
ure
sets.
Feat
ures
that
do
not a
ppea
r in
any
cand
idat
e fe
atur
e se
t are
like
ly to
be
irre
leva
nt.
Row
s with
equ
al o
r nea
r equ
al
perfo
rman
ce b
ut d
iffer
ent f
eatu
res
likel
y co
ntai
n fe
atur
es th
at a
re
mut
ually
redu
ndan
t.
A se
t of 1
0 ca
ndid
ate
feat
ures
is
gene
rate
d fo
r eac
h fit
ness
met
ric:
1.
MCC
2.
AU
C3.
accu
racy
4.
F15.
(MCC
+AU
C)/2
6.
(F1+
AU
C)/2
7.
(acc
urac
y+A
UC)
/28.
(pre
cisio
n+re
call)
/2
Dur
ing
the
feat
ure
sele
ctio
n pr
oces
s, th
e ET
par
amet
ers a
re tu
ned
to
max
imiz
e th
e ac
cura
cy o
f fea
ture
impo
rtanc
es. F
eatu
res t
hat g
ener
ate
high
er in
form
atio
n ga
in a
t mor
e no
des a
re e
stim
ated
to b
e m
ore
impo
rtant
. Inf
orm
atio
n ga
in is
mea
sure
d by
Gin
i pur
ity o
r inf
orm
atio
n en
tropy
.
bestM
CCfo
und
metric
:accuracy+AU
Celite:4
#features
32
AUC
0.8571
accuracy
0.9474
precision
1.0000
recall
0.8571
F1
0.9231
MCC
0.8895
allfeatures
metric
:Non
e
#features
24187
AUC
0.8393
accuracy
0.8421
precision
0.8333
recall
0.7143
F1
0.7692
MCC
0.6548
Perfo
rman
ce
[1] H
uerta
, E. B
., D
uval
, B. a
nd H
ao, J
.-K. G
ene
sele
ctio
n fo
r mic
roar
ray
data
by
a LD
A-ba
sed
gene
tic a
lgor
ithm
. Spr
inge
r, Ci
ty, 2
008.
[2] S
ahu,
B. a
nd M
ishra
, D. A
nov
el fe
atur
e se
lect
ion
algo
rithm
usin
g pa
rticl
e sw
arm
op
timiz
atio
n fo
r can
cer m
icro
arra
y da
ta. P
roce
dia
Engi
neer
ing,
382
012)
, 27-
31.
[3] G
arro
, B. A
., Ro
dríg
uez,
K. a
nd V
ázqu
ez, R
. A. C
lass
ifica
tion
of D
NA
mic
roar
rays
us
ing
artifi
cial
neu
ral n
etw
orks
and
ABC
alg
orith
m. A
pplie
d So
ft C
ompu
ting,
382
016)
, 54
8-56
0.[4
] Sas
ikal
a, S
., al
ias B
alam
urug
an, S
. A. a
nd G
eeth
a, S
. A N
ovel
Fea
ture
Sel
ectio
n Te
chni
que
for I
mpr
oved
Sur
viva
bilit
y D
iagn
osis
of B
reas
t Can
cer.
Proc
edia
Com
pute
r Sc
ienc
e, 5
0201
5), 1
6-23
.Refe
renc
es
• PS
O –
par
ticle
swar
m o
ptim
izat
ion
• A
BC –
arti
ficia
l bee
col
ony
• G
FFS
– ge
netic
fore
st fe
atur
e se
lect
or•
GA
– ge
netic
alg
orith
m•
J48
– de
cisio
n tre
e•
LDA
GA
– lin
ear d
iscrim
inat
e an
alys
is ge
netic
alg
orith
m•
Filte
r – c
orre
latio
n of
indi
vidu
al g
ene
expr
essio
n w
ith ta
rget
cla
ss
Ove
rfitti
ng?
• So
me
cand
idat
e fe
atur
e se
ts th
at p
erfo
rmed
wel
l with
the
train
ing
data
per
form
ed v
ery
poor
ly w
ith th
e va
lidat
ion
data
. •
This
is lik
ely
due
to sp
urio
us re
latio
nshi
ps b
etw
een
irrel
evan
t fea
ture
s and
the
targ
et
clas
s.•
If th
is is
the
caus
e th
en fe
atur
e se
lect
ion
may
be
view
ed a
s a fo
rm o
f ove
rfitti
ng th
e tra
inin
g da
ta.
• Th
is ill
ustra
tes w
hy a
val
idat
ion
data
set i
s cru
cial
.
Classifica?
onte
chniqu
eSelec?on
techniqu
e#ofgen
es
%accuracy
Reference
SVM
PSO
20
1.0000[2]
SVM
ABC
50.9470[3]
ET
GFFS
32
0.9470Propo
sedmetho
dJ48
GA
41
0.9381[4]
SVM
Filte
r+LDA
-GA
44
0.8421[1]
Com
paris
on w
ith O
ther
Met
hods
• D
imen
siona
lity
grea
tly re
duce
d•
Subs
tant
ial i
mpr
ovem
ent o
f all
perfo
rman
ce m
etric
s•
The
best
MCC
was
gen
erat
ed fr
om a
can
dida
te se
t sel
ecte
d w
ith a
ccur
acy+
AU
C as
th
e fit
ness
met
ric fo
r the
GA
• Re
appl
y fe
atur
e se
lect
ion
usin
g on
ly th
e ca
ndid
ate
feat
ure
sets
to
dete
rmin
e if
resu
lts im
prov
e•
Atte
mpt
to re
duce
ove
rfitti
ng o
f the
trai
ning
dat
a du
ring
feat
ure
sele
ctio
n•
Form
aliz
e th
e m
etho
d of
cho
osin
g an
alte
rnat
ive
feat
ure
set i
n th
e ca
se o
f miss
ing
data
• Co
mpl
ete
the
proc
ess o
n ad
ditio
nal m
icro
arra
y da
tase
ts•
Com
plet
e th
e pr
oces
s on
data
sets
from
diff
eren
t pro
blem
dom
ains