+ All Categories
Home > Documents > DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE...

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE...

Date post: 16-Jun-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
24
1 DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM * JAIN-SHING WU Department of Computer Science and Engineering, National Sun Yat Sen University, No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan [email protected] KUO-YI WU Department of Computer Science and Engineering, National Sun Yat Sen University, No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan [email protected] CHUNG-NAN LEE § Department of Computer Science and Engineering, National Sun Yat Sen University, No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan [email protected] CHUAN-WEN CHIANG ** Department of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan [email protected] Received (Day Month Year) Revised (Day Month Year) Accepted (Day Month Year) Finding genes causing diseases plays a crucial role in cancer diagnosis and treatment. The existing feature selection methods may not work efficiently since they try to select genes to distinguish all diseases simultaneously. Here, we propose a novel disease-oriented feature selection algorithm (DOFA) to pick up related genes corresponding to the diseases. DOFA uses the Genetic Algorithm (GA) in the selection method for automatic picking up the related genes and Support Vector Machine (SVM) and K-nearest-neighborhood (KNN) as the classifier. DOFA is tested on picking up related genes for AMLALL and Colon datasets. For AMLALL and Colon datasets, it selects 21 genes and 25 genes, respectively. Based on the literatures, it shows that 20 of 21 genes are related to the disease or cancers related for AMLALL dataset and 20 of 25 genes are directly related to the disease or cancers related for Colon dataset. Three more experiments are conducted to verify the discriminability of the genes selected by DOFA. Experimental results all indicate that DOFA obtains better discriminability than other competing methods. Thus DOFA not only can select the genes related to the diseases, but also increase the classification accuracy. Keywords: genetic algorithm; feature selection; gene relationship. * DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM JAIN-SHING WU. No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan, R.O.C. KUO-YI WU. No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan, R.O.C. § CHUNG-NAN LEE. No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan, R.O.C. ** CHUAN-WEN CHIANG. No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan, R.O.C.
Transcript
Page 1: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

1

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM*

JAIN-SHING WU†

Department of Computer Science and Engineering, National Sun Yat Sen University, No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan

[email protected]

KUO-YI WU‡

Department of Computer Science and Engineering, National Sun Yat Sen University, No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan

[email protected]

CHUNG-NAN LEE§

Department of Computer Science and Engineering, National Sun Yat Sen University, No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan

[email protected]

CHUAN-WEN CHIANG**

Department of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, No.2, Jhuoyue Rd., Nanzih District,

Kaohsiung City 811, Taiwan [email protected]

Received (Day Month Year) Revised (Day Month Year)

Accepted (Day Month Year)

Finding genes causing diseases plays a crucial role in cancer diagnosis and treatment. The existing feature selection methods may not work efficiently since they try to select genes to distinguish all diseases simultaneously. Here, we propose a novel disease-oriented feature selection algorithm (DOFA) to pick up related genes corresponding to the diseases. DOFA uses the Genetic Algorithm (GA) in the selection method for automatic picking up the related genes and Support Vector Machine (SVM) and K-nearest-neighborhood (KNN) as the classifier. DOFA is tested on picking up related genes for AMLALL and Colon datasets. For AMLALL and Colon datasets, it selects 21 genes and 25 genes, respectively. Based on the literatures, it shows that 20 of 21 genes are related to the disease or cancers related for AMLALL dataset and 20 of 25 genes are directly related to the disease or cancers related for Colon dataset. Three more experiments are conducted to verify the discriminability of the genes selected by DOFA. Experimental results all indicate that DOFA obtains better discriminability than other competing methods. Thus DOFA not only can select the genes related to the diseases, but also increase the classification accuracy.

Keywords: genetic algorithm; feature selection; gene relationship.

* DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM † JAIN-SHING WU. No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan, R.O.C. ‡ KUO-YI WU. No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan, R.O.C. § CHUNG-NAN LEE. No. 70, Lienhai Rd. Kaohsiung, Kaohsiung 80424, Taiwan, R.O.C. ** CHUAN-WEN CHIANG. No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan, R.O.C.

Page 2: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 2

1. Introduction

With the rapid development of medical science, DNA microarray technique has recently served as a powerful tool for disease diagnosis. And this technique is also useful in many medical applications, such as cancer filtering, drug design, how genes cooperate, and so on. Via expression data, a large amount of genes can be observed at the same time and tissues to be tested can be examined to distinguish normal from abnormal, even different diseases. Though researchers can obtain gene relationships from the gene expression levels, to analyze such a large amount of signals on microarray chip is a tedious task. Hence, many algorithms are proposed to help the biologist to analyze a large amount of hidden information in the expression data.

According to the microarray gene expression data, the genes which are related to the disease can be observed. Effective gene selection not only can significantly reduce feature space dimensionality without the loss of classification accuracy by removing unrelated features, but also can find the relationship between genes and diseases. Generally, the gene selection methods can be roughly categorized in to three types: filters, wrappers, and embedded systems1. The filter approach identifies the discriminator genes by using an individual attribute reduction criterion. The wrapper generally searches for the optimal gene set for classification by using the specific classifier as an evaluator. The process of embedded systems for searching the optimal gene set is built into the classifier. When the searching is done, the classification of the microarray gene expression data is also done.

The filters and the wrappers are independent to the types of classifiers, but the performances of the embedded systems depend on the classifiers adopted in the embedded systems. A sophisticated wrapper approach, which searches for the optimal gene set for classification and uses this subset to generate a specific classification model for classifying the testing dataset, usually shows relatively high classification accuracy compared with a filter approach. However, the performance improvement is not always significant2. Furthermore, the high computational cost associated with a wrapper based method also makes it intolerable and inadequate. In contrast to the wrapper approach, a filter based method generally identifies discriminatory genes by using an individual attribute reduction criterion. Hence, we use the filter method for the gene ranking to filter out the unrelated gene and use the wrapper method for the selection.

The main goals of this paper are to find the genes which are related to diseases, and to obtain high classification accuracies of normal or abnormal samples, or even multiclasses via the microarray expression data.

The rest of this paper is organized as follows. Section 2 briefly discusses related work. In Section 3, the details of DOFA are described. Section 4 presents experimental results. Finally, Section 5 draws conclusions.

2. Related Work

There are some researches on solving microarray gene selection and classification problem. Romdhane et al.3 provided a new model consisted of two phases for a

Page 3: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

3

possibilistic approach for mining gene microarray data. They first clustered the microarray data using the “Partition Information Entropy” as a validity measure. And then, they selected the most representative genes from each computed cluster to model them as a graph called a proximity graph. This model obtains great partition accuracies and higher prediction accuracies on unknown genes.

Hong and Cho4 used Pearson correlation (abbreviated PC), Spearman correlation (abbreviated SC), Euclidean distance (abbreviated ED), Cosine Coefficient (abbreviated CC), and Signal to Noise Ratio (abbreviated SNR) methods for gene selection and adopted genetic programming to generate the classification rules. And then, the samples are voted using these classification rules. Using genetic programming one can find the gene combination easily. However, when the tree of individual grows huge and it needs additional rules to maintain the tree height.

Chen and Zhao5 also used PC, SC, ED, CC and Fisher ratio methods as the gene selection methods to filter out the unrelated genes. PC, SC, ED, and CC used in their algorithm have just a few modifications from those methods used in Hong and Chos’ algorithm. The artificial neural network (abbreviated ANN) is used as the classifier in the Chen and Zhaos’ algorithm. The particle swarm optimization is adopted to train the ANN’s parameters to obtain good classification accuracies. However, the Chen and Zhaos’ algorithm selects a fixed number of genes that is decided artificially to perform classification. The number of the selected genes used in classification affects the classification results. Hence, in the Chen and Zhaos’ algorithm, it requires a better way to decide how many genes used in classification.

Yu and Liu6 used the redundancy based filter as the gene selection method and adopted C4.5 that is one of decision tree methods as the classifier. They defined the relevance function to evaluate gene set and used information aimed to find out the strong relevance subset. In Yu and Lius’ work, the strong relevance subset is selected easily. However, some genes are classified to weak relevance subsets that are not used in classification. Hence, the classification results may not be good enough.

Wang et al. 7 used the information gain as the gene selection method and adopted the Adaptive Network based Fuzzy Inference System (abbreviated ANFIS) as the classifier. They find some genes that are relevant to diseases, but not all genes.

Tan et al.8 used the Top Scoring Pair (abbreviated TSP) as the classifier. TSP doesn’t require the additional gene selection method, since TSP is automatically determined by an internal cross validation loop in the training step. They can use a few genes to get good classification result. However, due to only a few number of genes, it is difficult to obtain the best classification result.

Xiong and Chen9 used the BW ratio as the gene selection method and adopted the kernel based the support vector machine (abbreviated SVM)10 as classifier. They proposed a kernel as the classification kernel of SVM. However, the classification accuracies of their algorithm are based on the number of the selected genes. In addition, for each dataset, the number of the selected genes is different. Hence, one has to use trial

Page 4: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 4

and error to decide how many genes used in classification in order to obtain the best classification accuracy for each dataset.

For all algorithms described above, the performances are all based on the selected genes. However, most of these algorithms use a fix number of genes to classify the dataset. It might cause low accuracy, when the wrong number of genes is used. In addition, some of these algorithms select different gene sets for different classes, but combine these gene sets to one set. And then, they apply the combined set for classification. Figure 1 shows an example of the problem caused by combination of the selected genes for two different classes.

As shown in Figure 1, the gray blocks mean the selected important gene sets to the classes (diseases); the slash blocks mean the unselected genes. Though the gene subsets for each class are selected, these algorithms integrate these subsets to one subset. The lower accuracies are obtained by these algorithms since the genes, which are informative to one class, may be the noise genes to the other class. Therefore, the gene sets have to be used separately for classification.

3. Disease-Oriented Feature Selection Algorithm

This section introduces a genetic approach for picking up related genes corresponding to the diseases. The optimized solution which refers to the best combination gene set to classify the data in a set of microarray data. However, the amount of genes in a microarray data is so huge that these genes can't be used directly as the individuals of Genetic Algorithms (GAs). Therefore, first, we use the microarray attribute reduction scheme (abbreviated MARS)11 to reduce the unrelated genes, and adopt GA to pick up meaningful genes to the diseases as classification patterns. At the end of the algorithm, the classification patterns are used to classify the testing datasets. .

Figure 1. An example of the problem caused by combination of the selected genes for two different classes.

Page 5: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

5

3.1. Overview of DOFA

Before describing the whole algorithm, some variables are defined first. Suppose that a training dataset S

}),,...,,(|{ njxxxXXS jm

jjjj ≤≤== 121 (1)

consists of n expression samples, and m is the number of genes measured. In S, all samples associated with a class label k {1,2,…,c} can constitute a subset Sk, and

ck SSSSS ∪∪∪∪∪= ......21 (2)

where c is the number of total classes. DOFA consists of four processes: the normalization process, the selection process,

the classification process and the fusion and verification process. The flowchart of DOFA is given in Figure 2.

First, the microarray expression training dataset S is normalized to eliminate the difference between scales of different genes. And then, the normalized gene expression data is sent to the gene selection process to select informative genes to the diseases. The gene selection process consists of two phases, the gene reduction and the classification pattern learning phase. The MARS11 is used to perform the gene reduction phase of the gene selection process on the normalized gene expression data in order to filter out the unrelated genes. In the classification pattern learning phase of the gene selection process, DOFA uses GA to further learn the classification patterns for each disease. In the classification process, DOFA preserves the informative genes of the training and testing datasets using the classification patterns and then sends them to the SVM to obtain the classification results. In the final process, the classification results are fused with one predicted result using a decision function to decide the testing sample to which class it belongs. The predicted results are verified to calculate the classification accuracy.

3.2. Normalization process

Since the expression values of genes have notable differences between the scales, in order to eliminate these differences, the gene expression data should be normalized first. Figure 3 gives an example to show the real values of one microarray expression dataset.

As shown in Figure 3, the differences between the scales of different genes are large. The scale of the gene expression values of gene x1 is as large as thousands, and the scale of the gene expression values of gene x3 is as small as tens. If we classify the microarray expression data using these un-normalized data, some genes related to disease may be neglected. Hence, a normalization process to eliminate these differences of genes is needed.

Page 6: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 6

The gene selction process

Start

End

The normalization process

The gene reduction phase

Have all classification

patterns learned ?

The classification pattern learning phase

The classification process using the classification

patterns

The fusion and verification process

Microarray training data

Microarray testing data

No

Yes

Attribute reduced

training dataset for each class

Figure 2. The flowchart of the proposed algorithm.

Class label 1 0 1 0 1 0 1

x1 8589.416 9164.254 3825.705 6246.449 3230.329 2510.325 7126.599

x2 1285.603 2253.363 1066.839 1144.691 1540.25 962.01 2003.484

x3 84.62 91.92 43.01 75.67625 27.1525 31.1975 108.9388

Figure 3. Some of original data of the Colon dataset.

Page 7: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

7

The normalized function Nor( jix ) is defined as follows :

minjxx

xxxNor

ii

ij

iji ,...,, , ,...,,,)(

min,max,

min, 2121 ==∀−−

= (3)

where xi,max represents the maximum value of all gene expression samples at i-th gene; xi,min denotes the minimum value of all gene expression samples at i-th gene. And these two values, xi,max and xi,min, are denoted as follows :

ji

nji xx maxmax,

≤≤

=1

(4)

ji

nji xx minmin,

≤≤

=1

(5)

According to the normalized function, all gene expression values of training dataset are normalized to 0 ~ 1. All expression values are normalized according to the normalized function, the dataset S can be transformed to the normalized dataset S’. After normalization, the relationships between genes and classes are more easily and clearly to find. However, not all genes are the same important to all classes. Some genes are unrelated to disease and regarded as noise. Hence, the unrelated genes should be removed in order to increase the classification accuracy.

3.3. Gene selection process

In this paper, the gene selection process, which is the most important core of DOFA, consists two phases, the gene reduction phase and the classification pattern learning phase. The gene reduction phase is used for filtering out the unrelated genes. The classification pattern learning phase further selects the informative genes of the disease.

3.3.1. Gene reduction phase

In the gene reduction phase, the microarray attribute reduction scheme11 is adopted to remove the unrelated genes for classification. Figure 4 shows the flowchart of MARS.

MARS consists of 2 procedures: gene ranking procedure and gene reduction procedure. In the gene ranking procedure, the normalized training dataset S’ is used to calculate the scores for each gene to each class using a gene ranking function. The matrix R records all scores of every gene to classes which are calculated using the gene ranking function. For each gene xi, μi

k is the mean of i-th genes for class k, it is calculated as follows:

||

,

k

SXj

ji

ki S

xk

j∑

∈∀=μ (6)

where |Sk| represents the number of samples in the kth class subset Sk.

Page 8: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 8

And μk is the mean vector of all genes for the class k. μk is denoted as follows:

),...,,( km

kkk μμμμ 21= (7)

When the mean vector μk is determined, a gene ranking function, score(i, k), indicates the ability of the i-th gene to identify the samples associated with the k class label. The gene ranking function, score(i, k), is defined as follows:

||

),(),( ,

k

SXj

ji

S

kxvkiscore k

j ⎥⎥⎦

⎢⎢⎣

=∑

∈∀ (8)

The weighted voting function v( jix , k) is given as follows

⎩⎨⎧ ≠∈∀−<−

=otherwise,

},,...,,{|,||| if,),(

0211 klclxx

kxvli

ji

ki

jij

iμμ

(9)

The weighted voting function measures the distance between the center of the class k and the expression values of i-th gene and the distance between the center of the class l (l {1,2,…,c},l≠k) and the expression values of i-th gene. If the expression value is close to the center of the class k, in other words, the distance between the expression value and the center of the class k is smaller than the distances between the expression value and the center of any other class, then the weighted voting function returns 1; otherwise, the function returns 0.

All scores among all genes to all classes are recorded in the matrix R. The higher a score is, the more ability the corresponding i-th gene can distinguish the samples of class k.

In the gene reduction procedure, the gene selection vector Tk, 1≤k≤c which records the information of the genes whether are noise or not for class k is generated according to the threshold thr and the matrix R. All the gene selection vectors are recorded in the matrix T. For different classes, one gene may have different scores. Hence, in order to

Figure 4. The flowchart of the gene reduction phase.

Page 9: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

9

find out the genes that are related to the class k, the vector Tk=(t1k, t2

k,…,tik,…, tm

k) is a gene selection vector that records the selected genes of class k. The attribute ti

k of the vector Tk is defined as follows

⎩⎨⎧ ≥

=otherwise,

),( if,01 thrkiscore

tki (10)

Those genes whose scores exceed a threshold value thr are regarded as the related genes and preserved to the following classification pattern learning phase. On the other hand, those genes whose scores do not exceed the threshold value thr are regarded as the unrelated genes and removed. The threshold thr is used to remove the unrelated genes to narrow down the search space of the following classification pattern learning phase. Changing the value of thr only affects the size of the search space, and it doesn’t affect the classification accuracy. Hence, the threshold thr is determined by an empirical rule. Then, the normalized training dataset S’ uses the T matrix to remove the unrelated genes and transforms into the attribute reduced training datasets of each disease.

3.3.2. Classification pattern learning phase

Since the genetic algorithms12 (abbreviated GAs) has good performance in clustering13, in this paper, we use GAs in the classification pattern learning phase of the gene selection process to find out those meaningful classification patterns for each class. Figure 5 shows the flowchart of the classification patterns learning phase that consists of six procedures: the classification pattern generation procedure, the evaluation procedure, the selection procedure, the crossover procedure, the mutation procedure and survival procedure.

The classification patterns are generated in the first procedure. For class k, the classification pattern ID, which is the individual of GA, is a binary string whose length is

kα . The value kα is the number of genes in krS that is the attribute reduced training

dataset according to the gene selection class k vector Tk. Figure 6 shows the framework of ID. The elements of ID represent the genes whether

they are selected or not. When the value is “1”, it means that the corresponding gene is selected; otherwise, the gene is abandoned. In the evaluation procedure, these classification patterns are evaluated via the fitness function. In this procedure, the k nearest neighbour14 (abbreviated KNN) is used for evaluation. We proposed a fitness function that combines the KNN fitness function and the expected value fitness function to be defined in the following to evaluate the performance of the classification pattern ID.

),(_),,(_),,(_ ''' kr

kr

kr SIDkfitnessSkIDefitnessSkIDcfitness ×= (11)

According to the content of ID, the evaluation phase selects the genes of the attribute reduced training dataset k

rS to form the reduced dataset 'krS . The fitness function

),,(_ 'krSkIDefitness indicates the ability of the ID to identify the samples

associated with the k class label, and it is defined as follows: ),()),((),(),(),,(_ '' k

rkr

kr SkCEkIDprobSkCRkIDprobSkIDefitness ×−+×= 1 (12)

Page 10: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 10

The function prob(ID,k) returns the probability of accuracy with each sample of the subset 'k

rkS of 'krS in class k, and it is defined as follows:

||

),(

),( ', '

krk

SPzz

S

kPg

kIDprobkrkz

⎟⎟⎠

⎞⎜⎜⎝

=∑∈∀ (13)

and ),( kPg z is defined as follows:

⎪⎩

⎪⎨⎧ ≠∈∀−<−=

otherwise,},,...,,{, if,),(

0211 klclPPkPg

lz

kz

zμμ

(14)

where zP is the sample that belongs to subset 'krkS . The center vector yμ denoted in the

following is the average value of all genes for the class y in the reduced dataset 'krS .

cyyyyyk

≤≤= 121 ),,...,,( βμμμμ (15)

Generate corresponding

classification patterns of class k

Evaluate the classification

patterns

Find the maximum fitness value fitnessmaxand calculate average fitness value fitnessavg

of individuals

Satisfy termination condition ?

Crossover

Mutation

Update the global best

Calculate the fitness of offsprings

Selection

No

Survival

Yes

Classification pattern learning

phase

Attribute reduced training dataset Srk

Figure 5. Flowchart of the classification pattern learning phase of the proposed algorithm.

kα elements

ID 1 0 … 0 1

Figure 6. An example of the classification pattern ID for class k.

Page 11: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

11

where kβ is the number of genes in reduced dataset 'krS . The values of CR(k, 'k

rS ) and CE(k, 'k

rS ) are denoted as

⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

∈∀ ',

' ),(),(krkz SPz

zkr kPgSkCR (16)

⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑

∈∀ ',

'' ),(||),(krkz SPz

zkrk

kr kPgSSkCE (17)

The KNN fitness function, fitness_k(ID, 'krS ), is denoted as follows:

),(),(_ '' kr

kr SIDKNNSIDkfitness = (18)

The ),( 'krSIDKNN function returns the accuracy that is obtained by using the KNN to

classify the reduced dataset 'krS .

When obtaining the fitness values of all classification patterns, the maximum fitness value fitnessmax and the average fitness value fitnessavg are found and used in calculating the crossover rate and the mutation rate.

In the selection procedure, the competing method is adopted to select classification patterns for the following two procedures, the crossover procedure and the mutation procedure to generate new offsprings.

In the crossover procedure, each pair of parents performs crossover with the crossover rate in the paper15 is denoted as follows:

avgfitnessfitness

fitnessfitness−−

=max

max 'ratecrossover (19)

where the fitness’ is the maximum fitness value of the selected two individuals, A and B, and is defined as follows:

),max(' BA fitnessfitnessfitness = (20)

In this paper, the one point crossover method is adopted in the crossover phase. In the mutation procedure, each classification pattern of offspring mutates with a

probability. The mutation rate in the paper15 is adopted in the mutation phase and is denoted as follows:

avg

i

fitnessfitnessfitnessfitness

−−

=max

maxratemutation (21)

where fitnessi is the fitness value of ith classification pattern of offspring. In this paper, the twopoints mutation is adopted. The two elements of one classification pattern are randomly selected, and the values of the elements are inversed, i.e. if the value is “1”, it changes to “1”, vice versa. When collecting enough offsprings, these new classification patterns of the offsprings are further evaluated. And then, the global best is updated according to these fitness values of

Page 12: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 12

all classification patterns. In the final procedure, the enhanced roulette wheel method is used to preserve the population into next generation. The classification patterns are sorted and ranked according to the fitness value. And then, the survival rate of each classification pattern is calculated. If the fitness value is higher, the classification pattern has higher probability to survive the next generation.

3.4. Classification process

When selecting informative genes is finished, the training dataset and the testing dataset use the matrix T to remove the unrelated genes. And then, the reduced training and testing datasets use the classification patterns to select the informative genes, and then perform the classification process. SVM uses the training dataset to learn hyperplane that is used to classify the testing dataset.

3.5. Fusion and verification process

The proposed algorithm adopts a decision rule for fusing all the classification results and decides the testing samples to which class they belong. If all the results show that the testing sample belongs to the class A, the decision rule classifies the testing sample to class A. If no class result claims that the testing sample belongs to its class, or more than one class results claim that the testing sample belongs to their classes, then the testing sample is sent to KNN to decide which class the testing sample belongs to.

4. Experimental Results

In this section, the performance of the proposed algorithm is described. The program is run under PC with Intel Core2 2.33 GHz and 2GB RAM. The proposed algorithm is implemented using Java language (J2DK1.6.0_10). The LIBSVM library16 (http://www.csie.ntu.edu.tw/~cjlin/libsvm) (libsvm 2.9) and KNN library of Java-ML (http://java-ml.sourceforge.net/) (javaml 0.1.4) are used as our classifiers. There are nine datasets used in these experiments and listed in Table 1.

Table 1. Summary of the nine cancer classification datasets

Dataset name No. of classes No. of genes No. of samples

AMLALL17 2 7129 72 ALLMLLAML18 3 12582 72 CNS19 2 7129 60 Ovarian20 2 15154 253 Colon21 2 2000 62 Lung22 2 12533 181 Prostate23 2 12600 102 Lymphoma24 2 7129 77 Subtypes25 6 12625 248

Page 13: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

13

As listed in Table 1, the dataset name, class number, gene number and sample number of the nine datasets are given. Parameters used in the proposed algorithm are given as follows: population size=100, termination condition is the global best unchanged for 100 generations, and the value k of KNN is set to 9. For the SVM parameters, we use a polynomial kernel K(x, y) = (γ‧xTy+r)p, where x and y are samples with gene expression values and p, γ, r are kernel parameters. We perform classifier optimization over the set of value of cost C (the penalty parameter of SVMs) = 400 and p value = 3. The kernel parameters γ and r are set to default value as in the paper16: γ = 1/number of genes and r = 0.

4.1. Experiments of the disease oriented feature selection results

In this experiment, we use the AMLALL and Colon datasets to check whether the genes selected by DOFA are related to the diseases. In the AMLALL dataset, DOFA finds out 21 genes that are used to search for related literatures on NCBI pubmed (http://www.ncbi.nlm.nih.gov/pubmed), and extract the literatures that are related to the diseases. Table 2 shows the search results of the literatures which are related to the selected genes to AML or ALL diseases. As listed in Table 2, the gene ID, gene name, relationship and reference of the nine datasets are given. Here, we found 12 of 21 genes that are directly related to the disease of AMLALL based on the literatures, 8 of 21 genes that are related to cancer, and 1 gene that is unrelated. For the Colon dataset, we found 11 of 25 genes that are directly related to the disease of colorectal cancer, 9 of 25 genes that are related to cancer, and 5 genes that are unrelated as listed in Table 3. According to these two experimental results, the genes selected by the proposed algorithm are indeed corresponding to the diseases.

4.2. Comparisons of the classification results

In order to verify the discriminability of the genes selected by DOFA, we perform three experiments. In the first experiment, we compare the gene selection method of DOFA with the two famous gene selection methods BW ratio and S2N. In the second and third experiments, we compare DOFA with the existing methods for the gene expression classification problems. The first and second experiments use the LOOCV and the third experiment uses 50% VS 50% (OVO) to resample the datasets. The OVO method is randomly separating the original dataset into two disjoint datasets (training dataset and testing dataset) with equal size (50%, 50%) for 100 times.

In order to compare influences of the different gene selection methods to the proposed algorithm, we use the same classification tools (SVM+KNN) to classify data. In this experiment the leave one out cross validation (LOOCV) is adopted to analyze the data.

Table 4 shows comparisons among the different gene selection classification results in nine dataset, AMLALL, Colon, Lung, Ovarian, CNS, Lymphoma, Prostate, ALLMLLAML, and Subtypes.

Page 14: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 14

Table 2. The relationship table of the AMLALL dataset

Gene ID Gene name Relation Reference

AB006782 LGALS9 Related Okudaira, et al. 26

D90279 COL5A1 Related Endo, et al. 27

L05424 CD44 Related Krause, et al. 28

L19063 ATF1 Related Müller-Tidow, et al. 29

L19493 FMR1 Related Au, et al. 30

L31529 A1B Cancer related Savas and Younghusband31

M14199 RPSA Related Wei, et al. 32

M28439 KRT16 Cancer related Hickinson, et al. 33

M32373 ARSB Cancer related Ghosh34

M74542 ALDH3A1 Cancer related Nielsen, et al. 35

U20230 GUCY2C Cancer related Mejia, et al. 36

U35451 CBX1 Cancer related Tuskan, et al. 37

U44378 SMAD4 Related Wierenga, et al. 38

U76421 ADARB1 Cancer related Paz, et al. 39

X70340 TGFA Cancer related Das, et al. 40

L10838 SRSF3 Related Bavelloni 41

L13744 MLLT3 Related Hutter, et al. 42

M16474 BCHE Related Stephenson, et al. 43

M22403 CD42B Related Shikata, et al. 44

U37408 CTBP1 Related Senyuk, et al. 45

U49973 TIGD1 Unrelated N/A

Page 15: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

15

Table 3. The relationship table of the Colon dataset

Gene ID Gene name Relation Reference

J05032 DARS Related Jimenez, et al. 46

M35252 CO-029 Related Le Naour, et al. 47

L42611 KRT6C Cancer related Carey, et al. 48

X70040 MST1R Related Yang, et al. 49

X07384 GLI1 Related Fu, et al. 50

H82631 N/A Unrelated N/A

H23975 N/A Unrelated N/A

U26401 GALK1 Related Pang, et al. 51

T47562 KDELR1 Related Trifan, et al. 52

H40137 RPL23A Related Lai and Xu53

R60877 ZEB1 Related Serova, et al. 54

R40717 CSNK1E Cancer related Zhu, et al. 55

H01418 SOS2 Cancer related Kurada, et al56

R16255 PPP3CB Cancer related Hundhausen, et al. 57

X05276 TPM4 Cancer related Kopantzev, et al. 58

T70063 N/A Unrelated N/A

H29546 N/A Unrelated N/A

T83368 CD46 Related Ravindranath and Shuler59

H01346 EIF4G1 Cancer related Peffley, et al. 60

T97199 ITGB4 Related Ortega, et al. 61

X52151 ARSA Cancer related Kim, et al. 62

M23115 ATP2A2 Related Korosec, et al. 63

H70250 KIAA1712 Unrelated N/A

R83923 CLPP Cancer related Slominska, et al. 64

M76558 CACNA1D Cancer related Washington, et al. 65

Page 16: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 16

Table 4. Comparisons among the classification results using different gene selection methods for nine datasets, AMLALL, Colon, Lung, Ovarian, CNS, Lymphoma, Prostate, ALLMLLAML,

and Subtypes. LOOCV is used for resampling.

Method \ Dataset AMLALL Colon Lung Ovarian CNS Lymphoma Prostate ALLMLLAML Subtypes

DOFA 100.00% 100.00% 91.66% 100.00% 96.77% 100.00% 97.05% 100.00% 99.19%

BW 97.22% 93.06% 65.00% 99.21% 82.25% 99.45% 95.10% 94.81% 98.39%

S2N 97.22% N/A 63.33% 99.21% 85.48% 100.00% 95.10% 94.81% N/A

BW: ratio of between groups to within groups sum of squares, S2N: ratio of Signal to Noise. The best results in different datasets are emphasized as bold.

As Table 4 shows, the gene selection process of DOFA performs the best among 3 gene selection methods in nine datasets. There are 8 wins in 9 datasets and the one dataset is draw but reached the accuracy 100.00%. The classification pattern learning phase of the gene selection process of DOFA performs well, since it picks up those informative genes to the disease. In addition, there is a remarkable difference between the gene selection process of the proposed algorithm and the other gene selection methods. The proposed algorithm doesn’t combine all important classification patterns of each class to perform the classification but uses these important classification patterns of each class to perform classification and then fuses the classification results. Since the important genes to one class may be a noise gene to the other class, combining all important classification patterns of each class to classify data would lead to the data misclassification. Experimental results also show that the proposed gene selection process obtains the best classification accuracies among all competing gene selection methods.

In the second experiment, we compare the performance of the proposed algorithm with that of twenty one existing methods (including the algorithm proposed by Chen5, PSO+ANN5, C4.56, Single Neuro Fuzzy7, Neuro Fuzzy Ensemble7, Top Scoring Pair8, k-Top Scoring Pair8, decision trees8, Naïve Bayes4,8, k nearest neighbor8, Support Vector Machine8, prediction analysis of microarrays8, majority vote4, maximum vote4, minimum vote4, average vote4, product vote4, behaviour knowledge space4, decision templates4, Oracle4). In this experiment, in order to compare with these twenty one methods, the datasets and the resampling method (LOOCV) are used as in those methods. Datasets used in this experiment are AMLALL, Colon, Lung, and Ovarian datasets. Table 5 shows the classification results obtained by the proposed algorithm and the existing methods. In Table 5, “N/A” represents that this method doesn’t have the classification result for this dataset. In addition, numbers in bold correspond to the best classifications for each dataset. As shown in Table 5, the proposed algorithm performs the best among all 22 methods in three datasets. Although, the proposed algorithm obtains accuracy 96.77% for the Colon dataset which is less than the result by NFE (100%), in fact, there are just only 2 samples missed among 62 samples. Though the proposed algorithm uses SVM as classifier that is the same in Tan, et al.8, it obtains higher accuracy due to the proposed gene selection method performs better in searching the informative genes.

Page 17: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

17

Table 5. Comparisons among the classification results obtained by the proposed algorithm and twenty one existing methods for AMLALL, Colon, Lung, and Ovarian datasets using LOOCV

Dataset Method AMLALL Colon Lung Ovarian

DOFA 100.00% 96.77% 100.00% 100.00% Chen5 98.60% 95.20% 100.00% 99.60% PSO+ANN5 86.10% 88.70% 98.30% 97.00% C4.56 87.50% 93.55% 98.34% N/A Single NF7 87.50% 93.55% N/A N/A NFE7 95.85% 100.00% N/A N/A TSP8 93.80% 91.10% 98.30% N/A kTSP8 95.83% 90.30% 98.90% N/A DT1

8 73.61% 80.65% 96.13% N/A NB8 100.00% 58.06% 97.79% N/A KNN 8 84.72% 74.19% 98.34% N/A SVM8 98.61% 82.26% 99.45% N/A PAM8 97.22% 85.48% 99.45% N/A MAJ8 N/A N/A 99.20% 97.10% MAX8 N/A N/A 99.40% 97.40% MIN8 N/A N/A 96.70% 95.60% AVG8 N/A N/A 99.40% 97.50% PRO8 N/A N/A 94.60% 94.00% NB8 N/A N/A 95.40% 96.40% BKS8 N/A N/A 99.20% 97.10% DT2

8 N/A N/A 99.20% 98.00% ORA8 N/A N/A 99.80% 98.20%

PSO: Particle Swarm Optimization, ANN: Artificial Neural Network, Single NF: Single Neuro Fuzzy, NFE: Neuro Fuzzy Ensemble, TSP: Top Scoring Pair, kTSP: k Top Scoring Pair, DT1: decision trees, NB: Naïve Bayes, KNN: k nearest neighbor, SVM: Support Vector Machine, PAM: prediction analysis of microarrays, MAJ: majority vote, MAX: maximum vote, MIN: minimum vote, AVG: average vote, PRO: product vote, BKS: behaviour knowledge space, DT2: decision templates, ORA: Oracle. “N/A” represents that this method doesn’t have the classification result for this dataset. The numbers in bold correspond to the best classifications for each dataset.

In the third experiment, we compare the accuracy obtained by the proposed algorithm with the algorithm proposed by Xiong and Chen9. In order to compare fairly, we follow their experimental protocol by randomly separating the dataset into two equal size dataset as training dataset and testing dataset (50% VS 50%) for 100 times. Table 6 shows the results obtained by the proposed algorithm and the results listed in the paper9. In Table 6, number in bold is the best classification result for each dataset. As Table 6 shows, the proposed algorithm performs the best among all methods for 8 datasets.

Page 18: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 18

Table 6. Comparisons among the classification results obtained by the proposed algorithm and the other five methods for AMLALL, Colon, Lung, Ovarian, CNS, Lymphoma, Prostate, AMLMLLAML, and Subtypes datasets using 50% VS 50% (OVO) to resample the datasets

Method \ Dataset AMLALL Colon Lung Ovarian CNS Lymphoma Prostate ALLMLLAML Subtypes

DOFA 100.00% 88.54% 100.00% 100.00% 83.55% 99.87% 96.49% 98.19% 98.63% KNN 98.68% 85.97% 98.79% 99.26% 80.48% 97.95% 92.59% 93.83% 97.43% ULDA 96.92% 83.16% 99.19% 99.98% 87.74% 97.95% 94.78% 97.86% 98.27% DLDA 97.05% 87.35% 99.53% 98.42% 77.58% 93.77% 93.27% 94.81% 97.55% SVM 97.30% 88.16% 99.47% 99.83% 86.65% 98.97% 95.14% 97.17% 97.40% KerNN 97.30% 88.42% 99.69% 99.99% 84.68% 98.10% 95.10% 96.79% 97.58%

KNN: k nearest neighbor, ULDA: uncorrelated linear discriminant analysis, DLDA: diagonal linear discriminant analysis, SVM: support vector machine, KerNN: kernel optimization based KNN. Each experiment was carried out for 100 runs. The best results in different datasets are emphasized as bold.

4.3. Discussions

The classification result for CNS dataset obtained by the proposed algorithm is not the best among all methods, since the gene expression values of this dataset are overlapped and are difficult to separate. Figure 7 shows the expression values of the 4390th, 3635th and 3189th genes which are selected by the proposed algorithm for CNS dataset.

As shown in Figure 7, one can easily observe that the expression values of these genes are overlapped. The proposed algorithm is not able to provide very useful information, since the class centers are also overlapped when the expression values of the dataset are overlapped too much. However, the genes selected by the proposed algorithm still have discriminability. The proposed algorithm neither obtains the best classification result, nor the worse one among all competing methods. The performance obtained by the proposed algorithm is still better than the result that uses SVM as the classifier in the paper9 for other 8 datasets, since the proposed algorithm can select the informative genes using the better gene selection method.

Figure 7. The overlapping of 4930-th, 3635-th, and 3189-th genes in CNS dataset.

Page 19: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

19

5. Conclusions

We have proposed a novel disease oriented feature selection algorithm (DOFA) that is consisted of four processes for picking up the genes related to the diseases. A normalization process is used to remove the difference between different scales of genes. An efficient selection process consisted of two phases, gene reduction phase and the classification pattern learning phase, is used to find out the informative genes. The gene reduction phase is used to remove the unrelated genes. The genetic algorithm is used in the classification pattern learning phase to find the informative gene subsets for each disease. And then, for each class, these informative gene subsets are adopted to classify the testing dataset separately. The SVM and KNN are used as the classifiers to perform the classification. Finally, the separated classification results are fused to one final classification result.

20 of 21 genes selected by the proposed algorithm are directly related to the disease AML or ALL or cancer related. 20 of 25 genes selected by the proposed algorithm are directly related to the disease colorectal cancer or cancer related. Compared to famous gene selection methods, BW ratio and S2N, DOFA obtains 8 wins in 9 datasets. Compared to 21 exiting methods commonly used in solving the cancer classification problem, the proposed algorithm obtains 3 wins in 4 datasets using LOOCV. Compared to five existing in the paper9, the proposed algorithm obtains 8 wins in 9 datasets using the OVO resampling method. In addition, when compared to the other gene selection methods under the same classification approaches, the gene selection process of the proposed algorithm has the best performance among all gene selection methods. These experimental results indicate that the proposed algorithm has the best performance among all competing methods. The genes selected by the proposed algorithm are also related to the diseases.

Acknowledgments

We would like to thank Mr. Min Thai Wu for useful discussions on some ideas in the experiments.

References

1. Y. Saeys, I. Inza and P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23 (2007), pp. 2507-2517.

2. I. Guyon and A. Elisseeff, An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003), pp. 1157–1182.

3. L. B. Romdhane, H. Shili and B. Ayeb, P3M —possibilistic multi-step maxmin and merging algorithm with application to gene expression data mining, International Journal on Artificial Intelligence Tools 18 (2009) pp. 545-567 DOI: 10.1142/S0218213009000263.

4. J. H. Hong and S. B. Cho, The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming. Artificial Intelligence in Medicine 36 (2006) pp. 43–58.

5. Y. H. Chen, and Y. Zhao, A novel ensemble of classifiers for microarray data classification. Applied Soft Computing 8 (2008) pp. 1664-1669.

Page 20: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 20

6. L. Yu and H. Liu, Redundancy based feature selection for microarray data. Proceeding of ACM Special Interest Group Discovery and Data Mining ’04 2004 (2004) Research Track Poster.

7. Z. Y. Wang, V. Palade and Y. Xu, Neuro-fuzzy ensemble approach for microarray cancer gene expression data analysis. Proceedings of the 2006 International Symposium on Evolving Fuzzy Systems (IEEE)(2006) pp. 241–246.

8. A. C. Tan, D.Q. Naiman, L. Xu, R. L. Winslow and D. Geman, Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21 (2005) pp. 3896–3904.

9. H. Xiong, and X. Chen, Kernel-based distance metric learning for microarray data classification. BMC Bioinformatics 7 (2006) 299.

10. C. Cortes and V. Vapnik, Support-vector networks. Machine Learning 20 (1995) pp. 273-297. 11. J. S. Wu, MARS: A microarray attributes reduction scheme for microarray cancer

classification problem, CIIS lab technique report 2010 (2010), Department of Computer Science and Engineering, National Sun Yet-Sen University, accessed at http://edith.cse.nsysu.edu.tw/wu/MARS.pdf .

12. J. H. Holland, Adaptation in natural and artificial system. University of Michigan Press (1975).

13. Y. Liu, X. R. Pu, Y. D. Shen, Z. Yi and X. F. Liao, Clustering using an improved hybrid genetic algorithm. International Journal on Artificial Intelligence Tools 16 (2007) pp. 919-934 DOI: 10.1142/S021821300700362X.

14. J. H. Friedman, Flexible metric nearest neighbor classification. In Technical report Dept. of Statistics (Stanford University 1994).

15. Y. K. Kwok and I. Ahmad, Efficient scheduling of arbitrary task graphs to multiprocessors using a parallel genetic algorithm. Parallel and Distributed Computing 47 (1997) pp. 58-77.

16. C. C. Chang and C. J. Lin, LIBSVM : a library for support vector machines. (2001) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm .

17. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (1999) pp. 531-537.

18. S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, R., M. L. den Boer, M. D. Minden, S. E. Sallan, E. S. Lander, T. R. Golub and S. J. Korsmeyer, S.J. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 30 (2001) pp. 41-47.

19. S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, E.S. and T. R. Golub, 2002. Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature 415 (2002) pp. 436-442.

20. E. F. Petricoin, A. M. Ardekanl, B. A. Hitt, P. J. Levine, V. A. Fusaro, S. M. Steinberg, G. B. Mills, C. Simone, D. A. Fishman, E. C. Kohn and L. A. Liotta, Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359 (2002) pp. 572-577.

21. U. Alon, N. Barkai, D. A. Notterman, J. Gish, S. Ybarra, D. Mack and A. J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences USA, 96 (1999) pp. 6745-6750.

22. G. J. Gordon, R. V. Jenson, L. L. Hsiao, S. R. Gullans, J. E. Blumenstock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker and R. Bueno, Translation of microarray data into clinically

Page 21: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

21

relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelima. Cancer Research 62 (2002) pp. 4936-4967.

23. D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D'Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub and W. R. Sellers, 2004. Gene expression correlations of clinical prostate cancer behavior. Cancer Cell 1 (2004) pp. 203-209.

24. M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster and T. R. Golub, Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine 8 (2002) pp. 68-74.

25. E. J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C. H. Pui, W. E. Evans, C. Naeve, L. Wong and J. R. Downing, Classification, subtype discovery and prediction of outcome in pediatric lymphoblastic leukemia by gene expression profiling. Cancer Cell 1 (2002) pp. 133-143.

26. T. Okudaira, M. Hirashima, C. Ishikawa, S. Makishi, M. Tomita, T. Matsuda, H. Kawakami, N. Taira, K. Ohshiro, M. Masuda, N. Takasu and N. Mori, A modified version of galectin-9 suppresses cell growth and induces apoptosis of human T-cell leukemia virus type I-infected T-cell lines. International Journal of Cancer 120 (2007) pp. 2251-2261.

27. T. Endo, T. Imanishi, T. Gojobori and H. Inoko, Evolutionary significance of intra-genome duplications on human chromosomes. Gene 205 (1997) pp. 19-27.

28. D. S. Krause, T. R. Spitzer and C. P. Stowell, The concentration of CD44 is increased in hematopoietic stem cell grafts of patients with acute myeloid leukemia, plasma cell myeloma, and non-Hodgkin lymphoma. Archives of Pathology & Laboratory Medicine 134 (2010) pp. 1033-1038.

29. C. Müller-Tidow, H. U. Klein, A. Hascher, F. Isken, L. Tickenbrock, N. Thoennissen, S. Agrawal-Singh, P. Tschanter, C. Disselhoff, Y. Wang, A. Becker, C. Thiede, G. Ehninger, U. Zur Stadt, S. Koschmieder, M. Seidl, F. U. Müller, W. Schmitz, P. Schlenke, M. McClelland, W. E. Berdel, M. Dugas and H. Serve, 2010. Profiling of histone H3 lysine 9 trimethylation levels predicts transcription factor activity and survival in acute myeloid leukemia. Blood 116 (2010) pp. 3564-3571. PMID:20498303.

30. W. Y. Au, C. Man, A. Pang and Y. L. Kwong, 2003. Acute lymphoblastic leukemia in a patient with fragile X syndrome: cytogenetic and molecular features. Haematologica 88 (2003) ECR13.

31. S. Savas and H. B. Younghusband, dbCPCO: a database of genetic markers tested for their predictive and prognostic value in colorectal cancer. Human Mutation 31 (2010) pp. 901-907.

32. Q. Wei, Y. Li, L. Chen, L. Zhang, X. He, X. Fu, K. Ying, J. Huang, Q. Chen, Y. Xie and Y. Mao, 2006. Genes differentially expressed in responsive and refractory acute leukemia. Frontiers in Bioscience 11 (2006) pp. 977-982.

33. D. M. Hickinson, G. B. Marshall, G. J. Beran, M. Varella-Garcia, E. A. Mills, M. C. South, A. M. Cassidy, K. L. Acheson, G. McWalter, R. M. McCormack, P. A. Bunn, T. French, A. Graham, B. R. Holloway, F. R. Hirsch and G. Speake, G. Identification of biomarkers in human head and neck tumor cell lines that predict for in vitro sensitivity to gefitinib. Clinical and Translational Science 2 (2009) pp. 183-192.

34. D. Ghosh, Human sulfatases: a structural perspective to catalysis. Cellular and Molecular Life Sciences 64 (2007) pp. 2013-2022.

35. S. S. Nielsen, R. McKean-Cowdin, F. M. Farin, E. A. Holly, S. Preston-Martin and B. A. Mueller, Childhood brain tumors, residential insecticide exposure, and pesticide metabolism genes. Environmental Health Perspectives 118 (2010) pp. 144-149.

Page 22: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 22

36. A. Mejia, S. Schulz, T. Hyslop, D. S. Weinberg and S. A. Waldman, GUCY2C reverse transcriptase PCR to stage pN0 colorectal cancer patients. Expert Review of Molecular Diagnostics 9 (2009) pp. 777-785.

37. R. G. Tuskan, S. Tsang, Z. Sun, J. Baer, E. Rozenblum, X. Wu, D. J. Munroe and K. M. Reilly, Real-time PCR analysis of candidate imprinted genes on mouse chromosome 11 shows balanced expression from the maternal and paternal chromosomes and strain-specific variation in expression levels. Epigenetics 3 (2008) pp. 43-50.

38. A. T. Wierenga, B. J. Eggen, W. Kruijer and E. Vellenga, Proteolytic degradation of Smad4 in extracts of AML blasts. Leukemia Research 26 (2002) pp. 1105-1011.

39. N. Paz, E. Y. Levanon, N. Amariglio, A. B. Heimberger, Z. Ram, S. Constantini, Z. S. Barbash, K. Adamsky, M. Safran, A. Hirschberg, M. Krupsky, I. Ben-Dov, S. Cazacu, T. Mikkelsen, C. Brodie, E. Eisenberg and G. Rechavi, Altered adenosine-to-inosine RNA editing in human cancer. Genome Research 17 (2007) pp. 1586-1595.

40. K. Das, P. Lorena, L. K. Ng, D. Lim, L. Shen, W. Y. Siow, M. Teh, J. Reichardt and M. Salto-Tellez, Differential expression of steroid 5{alpha}-reductase isozymes and association with disease severity and angiogenic genes predict their biological role in prostate cancer. Endocrine-Related Cancer 17 (2010) pp. 757-770.

41. A. Bavelloni, I. Faenza, G. Cioffi, M. Piazzi, D. Parisi, I. Matic, N. M. Maraldi and L. Cocco, Proteomic-based analysis of nuclear signaling: PLCbeta1 affects the expression of the splicing factor SRp20 in Friend erythroleukemia cells. Proteomics 6 (2006) pp. 5725-5734.

42. C. Hutter, A. Attarbaschi, S. Fischer, C. Meyer, M. Dworzak, M. König, R. Marschalek, G. Mann, O. A. Haas and E. R. Panzer-Grümayer, Acute monocytic leukaemia originating from MLL-MLLT3-positive pre-B cells. British Journal of Haematology 150 (2010) pp. 621-623 PMID:20497176.

43. J. Stephenson, B. Czepulkowski, W. Hirst and G. J. Mufti, Deletion of the acetylcholinesterase locus at 7q22 associated with myelodysplastic syndromes (MDS) and acute myeloid leukaemia (AML). Leukemia Research 20 (1996) pp. 235-241.

44. H. Shikata, T. Matumoto, H. Teraoka, M. Kaneko, M. Nakanishi and T. Yoshino, Myeloid sarcoma in essential thrombocythemia that transformed into acute myeloid leukemia. International Journal of Hematology 89 (2009) pp. 214-217.

45. V. Senyuk, S. Chakraborty, F. M. Mikhail, R. Zhao, Y. Chi. and G. Nucifora, The leukemia-associated transcription repressor AML1/MDS1/EVI1 requires CtBP to induce abnormal growth and differentiation of murine hematopoietic cells. Oncogene 21 (2002) pp. 3232-3240.

46. C. R. Jimenez, J. C. Knol, G. A. Meijer and R. J. A. Fijneman, Proteomics of colorectal cancer: Overview of discovery studies and identification of commonly identified cancer-associated proteins and candidate CRC serum markers. Journal of Proteomics, 73 (2010) pp. 1873-1895.

47. F. Le Naour, M. André, C. Greco, M. Billard, B. Sordat, J. F. Emile, F. Lanza, C. Boucheix and E. Rubinstein, Profiling of the tetraspanin web of human colon cancer cells. Molecular & Cellular Proteomics 5 (2006) pp. 845-857.

48. L. A. Carey, C. M. Perou, C. A. Livasy, L. G., Dressler, D. Cowan, K. Conway, G. Karaca, M. A. Troester, C. K. Tse, S. Edmiston, S. L. Deming, J. Geradts, M. C. Cheang, T. O. Nielsen, P. G. Moorman, H. S. Earp and R. C. Millikan, Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. JAMA 295 (2006), pp. 2492-2502.

49. L. Yang, L. Zhang, Q. Wu and D. D. Boyd, Unbiased screening for transcriptional targets of ZKSCAN3 identifies integrin beta 4 and vascular endothelial growth factor as downstream targets. Journal of Biological Chemistry 283 (2008) pp. 35295-35304.

50. X. Fu, H. Deng, L. Zhao, J. Li, Y. Zhou and Y. Zhang, Distinct expression patterns of hedgehog ligands between cultured and primary colorectal cancers are associated with

Page 23: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM

23

aberrant methylation of their promoters. Molecular and Cellular Biochemistry 337 (2010) pp. 185-192.

51. Q. Pang, T. A. Prolla and R. M. Liskay, Functional domains of the Saccharomyces cerevisiae Mlh1p and Pms1p DNA mismatch repair proteins and their relevance to human hereditary nonpolyposis colorectal cancer-associated mutations. Molecular and Cellular Biology 17 (1997) pp. 4465-4473.

52. O. C. Trifan, R. M. Smith, B. D. Thompson and T. Hla, Overexpression of cyclooxygenase-2 induces cell cycle arrest. Evidence for a prostaglandin-independent mechanism. Journal of Biological Chemistry 274 (1999) pp. 34141-34147.

53. M. D. Lai and J. Xu, Ribosomal Proteins and Colorectal Cancer. Current Genomics 8 (2007) pp. 43-49.

54. M. Serova, L. Astorgues-Xerri, I. Bieche, S. Albert, M. Vidaud, K. A. Benhadji, S. Emami, D. Vidaud, P. Hammel, N. Theou-Anton, C. Gespach, S. Faivre and E. Raymond, Epithelial-to-mesenchymal transition and oncogenic Ras expression in resistance to the protein kinase Cbeta inhibitor enzastaurin in colon cancer cells. Molecular Cancer Therapeutics 9 (2010) pp. 1308-1317.

55. Y. Zhu, R. G. Stevens, A. E. Hoffman, L. M. Fitzgerald, E. M. Kwon, E. A. Ostrander, S. Davis, T. Zheng and J. L. Stanford, Testing the circadian gene hypothesis in prostate cancer: a population-based case-control study. Cancer Research 69 (2009) pp. 9315-9322.

56. B. R. Kurada, L. C. Li, N. Mulherkar, M. Subramanian, K. V. Prasad and B. S. Prabhakar, MADD, a splice variant of IG20, is indispensable for MAPK activation and protection against apoptosis upon tumor necrosis factor-alpha treatment. Journal of Biological Chemistry 284 (2009) pp.13533-13541.

57. C. Hundhausen, C. Boesch-Saadatmandi, N. Matzner, F. Lang, R. Blank, S. Wolffram, W. Blaschek and G. Rimbach, Ochratoxin a lowers mRNA levels of genes encoding for key proteins of liver cell metabolism. Cancer Genomics Proteomics. 5 (2008) pp. 319-332.

58. E. P. Kopantzev, G. S. Monastyrskaya, T. V. Vinogradova, M. V. Zinovyeva, M. B. Kostina, O. B. Filyukova, A. G. Tonevitsky, G. T. Sukhikh and E. D. Sverdlov, Differences in gene expression levels between early and later stages of human lung development are opposite to those between normal lung tissue and non-small lung cell carcinoma. Lung Cancer 62 (2008) pp. 23-34.

59. N. M. Ravindranath and C. Shuler, Cell-surface density of complement restriction factors (CD46, CD55, and CD59): oral squamous cell carcinoma versus other solid tumors. Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology & Endodontics 103 (2006) pp. 231-239.

60. D. M. Peffley, C. Sharma, P. Hentosh and R. D. Buechler, Perillyl alcohol and genistein differentially regulate PKB/Akt and 4E-BP1 phosphorylation as well as eIF4E/eIF4G interactions in human tumor cells. Archives of Biochemistry and Biophysics 465 (2007) pp. 266-273.

61. P. Ortega, A. Moran, T. Fernandez-Marcelo, C. De Juan, C. Frias, J. A. Lopez-Asenjo, A. Sanchez-Pernaute, A. Torres, E. Diaz-Rubio, P. Iniesta and M. Benito, MMP-7 and SGCE as distinctive molecular factors in sporadic colorectal cancers from the mutator phenotype pathway. International Journal of Oncology 36 (2010) pp. 1209-1215.

62. D. H. Kim, M. Muto, Y. Kuwahara, Y. Nakanishi, H. Watanabe, K. Aoyagi, K. Ogawa, T. Yoshida and H. Sasaki, Array-based comparative genomic hybridization of circulating esophageal tumor cells. Oncology Reports 16 (2006) pp. 1053-1059.

63. B. Korosec, D. Glavac, T. Rott and M. Ravnik-Glavac, Alterations in the ATP2A2 gene in correlation with colon and lung cancer. Cancer Genetics and Cytogenetics 171 (2006) pp. 105-111.

Page 24: DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION …€¦ · DOFA: A NOVEL DISEASE ORIENTED GENE SELECTION ALGORITHM* ... No.2, Jhuoyue Rd., Nanzih District, Kaohsiung City 811, Taiwan

J.-S. Wu, K.-Y. Wu, C.-N. Lee & C.-W. Chiang 24

64. M. Slominska, A. Wahl, G. Wegrzyn and K. Skarstad, Degradation of mutant initiator protein DnaA204 by proteases ClpP, ClpQ and Lon is prevented when DNA is SeqA-free. Biochemical Journal 370 (2003) pp. 867-871.

65. M. N. Washington and N. L. Weigel, 1α,25-Dihydroxyvitamin D3 inhibits growth of VCaP prostate cancer cells despite inducing the growth-promoting TMPRSS2:ERG gene fusion. Endocrinology 151 (2010) pp. 1409-1417.


Recommended