[IEEE 2011 2nd International Conference on Intelligent Systems, Modelling and Simulation (ISMS) -...

B-MIPT: A Case Tool for Biomedical Image Processing and their Classification using Nearest Neighbor and Genetic Algorithm

Pardeep Kumar Naik Emory University School of Medicine, Atlanta, GA, USA

e-mail: [email protected]

Nitin University of Nebraska at Omaha, Omaha, NE, USA

e-mail: [email protected]

Aujasvita Janmeja, Sushain Puri, Kunal Chawla, Manav Bhasin and Kunal Jain

Jaypee University of Information Technology, Waknaghat, Solan-173215, Himachal Pradesh, INDIA

e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract—A high rate of expression of Endothelin protein in the placental cell is very much regulated by inhalation of tobacco smoke and leads to placental abnormalities subjected to birth failure. Our application developed using Image Processing [1-7], Nearest Neighbour algorithm (NN) and Genetic Algorithms (GA), automates the study of these proteins to assist pathologists and lab technicians in achieving a more efficient and faster diagnosis. Using three distinct parameters, recognition of images with high protein expression was accurate up to 91% of the times. The tool has achieved a Matthews Correlation Coefficient (MCC) of 0.91. Other performance measures are = 91.1%, sensitivity = 0.91 and specificity = 0.82. These showed that computer aided diagnosis can be a helpful tool, especially in a field that lacks experienced specialists.

Keywords-component; formatting; style; styling; insert (key words)

I. INTRODUCTION The rate of expression of Endothelin and CD31 proteins

in the placental cell depends on the rate of inhalation of tobacco smoke [8-12]. It has been seen that in case of women who smoke tobacco very frequently (Active Smokers) the rate of expression of these two proteins is high, whereas in the second category of women who do not smoke tobacco, but they inhale the smoke coming from other sources (Passive Smokers) the rate of expression is low [8-10]. The higher expression of these two proteins leads to the placental abnormalities that may in turn lead to birth failure [12]. Hence, the study of these two proteins and their expression is very important for proper medication. Traditionally, the diagnostic study is done with the immunohistochemistry technique; an experienced pathologist generally derives the inference based on the visualization of the image in naked eye. Since this is a manual method sometimes, the diagnosis is erroneous and wrongly classifies the images. In this regard, it is essential to make an automatic tool using a reproducible algorithm for prediction, classification and proper diagnosis. Hence, in this study, an attempt has been made to develop an automated tool using Image Processing and Genetic

Algorithm techniques to study the rate of expression of both the proteins in the placenta.

II. GENETIC ALGORITHMS Genetic Algorithms (GAs) [13-17] are adaptive heuristic

search algorithms based on the evolutionary ideas of natural selection. Working on a population of possible solutions simultaneously, they represent an intelligent exploitation of random search within a defined search space. This is done by employing a population of individuals (each representing a particular solution) that undergo selection in the presence of variation-inducing operators such as mutation and recombination (crossover). A fitness function is used to evaluate individuals, and reproductive success varies with fitness (Figure 1).

Figure 1. Flow chart showing important steps of genetic algorithm.

2011 Second International Conference on Intelligent Systems, Modelling and Simulation

978-0-7695-4336-9/11 $26.00 © 2011 IEEE

DOI 10.1109/ISMS.2011.26

107

2011 Second International Conference on Intelligent Systems, Modelling and Simulation

978-0-7695-4336-9/11 $26.00 © 2011 IEEE

DOI 10.1109/ISMS.2011.26

107

III. METHODOLOGY

A. Data set and image processing A set of 203 images were taken from A

of Medical Sciences (AIIMS, India) for th203 images 100 images (43 active smokersmokers) were labeled and processed histopthe expression of Endothelin proteins and active smokers and 53 passive smokers)proteins. In both the cases, the proteins brown spot. Using an in house developprogram we extracted 100 different pixels image and the pixel information was captprimary colors intensity: Red (R), Green (GThese features were used for the training aalgorithms. The training data set was devgenetic algorithm.

B. Nearest Neighbour Algorithm Nearest Neighbour algorithm is a techni

an object into a particular class based olinear distances between two or more then pixels extracted from the brown spots f(active or passive) were considered and thebrown spot was calculated in terms oRepeating this process for each image claintensities characteristic to images falling ui.e. Endothelin Active, Endothelin Passiveand CD31 Passive. Given an image foranalyzed and the intensity of brown spotThen the average linear distance between thtest image and all intensities registered unclass was calculated and the image was class with minimum average linear distance

C. Image segmentation The tool provides a utility to classify th

different algorithms, which will be sensitisize and quality.

1) Image Region Analysis (IRA) This algorithm is recommended in cas

clarity and resolution, i.e. if the image is only some particular regions are of good qualgorithm, the user is being given a choice particular area, which can be selected baccording to the user has some relevance The selected region is being analyzed andcalculated. The intensity is normalized accoof the area and used for classification. Tability to classify the image according to set, which is generated on images of a fixe600 x 400.

2) Image Intensity Analysis This algorithm is the default algorithm o

algorithm, the image loaded by the user is intensity of the image is calculated with thedata set and RGB scan.

All India Institute his study. Out of s and 57 passive

pathologically for 103 images (50

) for the CD31 are labeled with ed mouse event from each of the

tured based on 3 G) and Blue (B).

and testing of our veloped based on

que to categorize on the minimum two objects. The from each class e intensity of the of RGB values. ss, we generated under each class, e, CD31 Active, r testing, it was t was calculated. he intensity from nder a particular classified into a

e.

he image by two ive to the image

se of bad image not so clear and

uality. Under this of selection of a

by the user that to classification.

d the intensity is ording to the size

This gives us the the training data ed resolution i.e.

of the tool. In this analyzed and the e help of training

Figure 2. Schematics diagram showing the v

IV. PEFORMANCE

The prediction results of thevaluated using the following statist

TABLE 1. VARIOUS PERFORMANCE MEASAPPLICATIO

Performance

Measure F

Accuracy of the methods AQ

Where P and N:c

O:oveU:und

MCC (Matthews Correlation Coefficient)

(( ) (

P NP U P

×+ × +

Sensitivity sQ

Specificity spQ

Probability of correct prediction predQ

Percentage over coverage obsQ =

various steps of image analysis.

MEASURES he tool developed was tical measures.

SURES USED TO EVALUATE THE ON.

Formula

ACCP N

T+=

T=(P+N+O+U) correctly predicted er predictions der predictions

) ( )) ( ) ( )

N O NO N U N O

− ×× + × +

sensP

P U=

+

specN

N O=

+

100PP O

= ×+

100PP U

= ×+

108108

V. RESULTS AND DISCUSSION The three features (R, G and B) calculated from the

pixels taken from the brown spot were significantly different from each other from the images belonging to active and passive classes and are suitable for training and testing the algorithm. Also the values are well differentiated between the images labeled for Endothelin and CD31 proteins. Both the algorithms such as GA and NN were trained with the image-derived features (RGB intensity), which are reflected in the images. Based on three inputs the tool is able to generate range of RGB using GA, which in turn used in the second layer for prediction of rate of expression of proteins with respect to the intensity of brown color using NN. By applying a fivefold cross-validation test using five data sets, we found that the predictability and the classification of the images reached an overall accuracy of e (91.1%). The prediction results are presented in Table II. The tool has achieved a Mathew’s correlation coefficient (MCC) of 0.91. The other performance measures are sensitivity = 91.1% and specificity = 82.2% (Table II). The rate of expression was more for active smoker in comparison to passive smokers. The performance measure for the individual categories of the images (labeled with Endothelin and CD31 proteins) is almost equal as mentioned in Table III. It is revealed a wide difference in the range of intensity between both the classes of images and hence it is possible to classify the user given image into its corresponding class.

The results demonstrate that the developed GA-NN based classification of images into active and passive smokers is adequate and can be considered an effective tool for in silico screening of images. The results also demonstrate that the image derived parameters readily

accessible from the images, can produce a variety of useful information to be used in silico; clearly demonstrates an adequacy and good predictive power of the developed GA-NN model. Presumably, accuracy of the approach operating by the image derived features can be improved even further by expanding the features or by applying more powerful classification techniques such as Support Vector Machines or Bayesian Neural Networks. Use of merely statistical techniques in conjunction with the image derived parameters would also be beneficial, as they will allow interpreting individual parameter contributions into “active/passive smokers-likeness”.

VI. CONCLUSION The results of the present work demonstrate that the

image derived features with GA-NN model appear to be a very fast image identification mechanism providing good results, comparable to some of the current efforts in the literature. We have demonstrated the feasibility of combining GA-NN with image derived features for classification of images into active/passive smokers. Expanding the image derived features, use of merely statistical techniques in conjunction with the extracted parameters and an adequate and low-noise training set, are critical to the success of NN. Apparently, the more specifically an image is to predict, thus the more definite a training set can be assembled, and the higher predicting power the corresponding NN can acquire. In the future, we envisage an array of NNs being trained to predict different classes and the stages of abnormality of the patients from the processed pathological images and to parse other pharmacological investigation data in parallel, complementing current methods to achieve more reliable, high throughput detection and medication.

TABLE II. PREDICTIVE PERFORMANCE MEASURES USED TO EVALUATE THE APPLICATION USING 5 FOLD CROSS VALIDATION.

Performance Measures Values Average Specificity 82.2%

Average MCC 0.91 Average Sensitivity 91.1% Average Accuracy 91.1%

TABLE III. PREDICTIVE CLASSIFICATION ACCURACY OF THE IMAGES USING GA APPLICATION

Metrices Endothelin Active Smoker Endothelin Passive Smoker

CD31 Active Smoker CD 31 Passive Smoker

Specificity 81.7% 81.7% 82.6% 82.6% MCC 91.2% 90.7% 94.3% 88.0%

Sensitivity 90.7% 91.2% 88.1% 94.3% Q(total)/Accuracy 0.91 0.91 91.3% 91.3%

Q(predicted) 88.6% 92.9% 93.6% 89.3%

Q(observed) 90.7% 91.2% 88.0% 94.4%

109109

TABLE IV. IMAGE SEGMENTATION AND EXTRACTION OF PIXEL INTENSITY (RED, GREEN AND BLUE) OF THE TOBACCO SMOKING PATIENTS SPECIFICALLY LABELED FOR ENDOTHELIN PROTEIN

Color Tone Active Passive Max. Min. Avg. Max. Min. Avg.

RED 134 24 79 153 47 100 GREEN 122 22 72 162 37 100 BLUE 70 36 53 135 36 85

TABLE V. IMAGE SEGMENTATION AND EXTRACTION OF PIXEL INTENSITY (RED GREEN AND BLUE) OF THE TOBACCO SMOKING PATIENTS SPECIFICALLY LABELED FOR CD31 PROTEIN

Color Tone Active Passive Max. Min. Avg. Max. Min. Avg.

RED 138 17 77 144 38 91 GREEN 143 17 75 150 36 93 BLUE 121 19 70 124 50 87

TABLE VI. NUMBER OF IMAGES USED TO EVALUATE THE APPLICATION OF GA FOR ANNOTATION AND CLASSIFICATION PROBLEM FOR TOBACCO SMOKING PATIENTS

Number of Images

provided

Number of Active Images (classified)

Number of Passive Images

(classified)

Intensity of protein in Active Smoker Intensity of protein

Right Wrong Right Wrong Max. Min. Avg. Max. Min. Avg. Endothelin(100) 39 4 52 5 1554 543 1048 602 321 461

CD 31(100) 44 6 50 3 1524 618 1071 476 113 294

Figure 3. Comaprison of Predictive classification accuracy of the images using GA application.

0102030405060708090

100

Endothelin Active Smoker

Endothelin Passive Smoker

CD31 Active Smoker

CD 31 Passive Smoker

110110

Figure4. Comparison of Image segmentation and extraction of pixel intensity (red green and blue) of the tobacco smoking patients specifically labeled for Endothelin protein.

Figure5. Comparison of Number of images used to evaluate the application of GA for annotation and classification problem for tobacco smoking patients.

0

20

40

60

80

100

120

140

160

180

Max. Min. Avg. Max. Min. Avg.

Active Passive

BLUE

RED

GREEN

0

20

40

60

80

100

120

140

160

Max. Min. Avg. Max. Min. Avg.

Active Passive

BLUE

RED

GREEN

111111

Table VI. Comparison of Number of Images used to Evaluate the Application of GA for Annotation and Classification Problem for Tobacco Smoking Patients.

ACKNOWLEDEMENTS We are thankful to All India Institute of Medical Sciences (AIIMS), New Delhi, India for providing us the set of 203 images. Out of 203 images 100 images (43 active smokers and 57 passive smokers) were labeled and processed histopathologically for the expression of Endothelin proteins and 103 images (50 active smokers and 53 passive smokers) for the CD31 proteins. We are also thankful to University of Nebraska at Omaha and Emory University, Atlanta, for using their computing labs.

REFERENCES [1] The Image Processing Handbook by John C. Russ, ISBN 0849372542

(2006). [2] Fundamentals of Image Processing by Ian T. Young, Jan J.

Gerbrands, Lucas J. Van Vliet, Paperback, ISBN 90-75691-01-7 (1995).

[3] Front-End Vision and Multi-Scale Image Analysis by Bart M. terHaar Romeny, Paperback, ISBN 1-4020-1507-0 (2003).

[4] Image Analysis and Mathematical Morphology by Jean Serra, ISBN 0126372403 (1982).

[5] Christopher M. Bishop (2007) Pattern Recognition and Machine Learning, Springer ISBN 0-387-31073-8.

[6] Neural Computing and Applications, Springer-Verlag.

[7] Bhagat, P.M. (2005) Pattern Recognition in Industry, Elsevier. ISBN 0-08-044538-1.

[8] T. Haak, E. Jungmann, C. Raab and K.H. Usadel (1994) Elevated endothelin-1 level after cigarette smoking. Metabolism 43(3) 267-269.

[9] T. Kosicka, H. Kara-Perz and S. Perz (2006). Evaluation of plasma endothelin-1 concentration in tobacco smoking patients with essential hypertension. Przegl Lek. 63(10) 957-959.

[10] R. Poreba, A. Skoczynska, A. Derkacz, A. Wojakowsk and B. Turczyn (2004). Influence of tobacco smoking on endothelial function in lead-exposed male workers. Med. Pr. 55(2) 145-151.

[11] D.A. Scott and R.M. Palmer (2002) The influence of tobacco smoking on adhesion molecule profiles. Tobacco induced diseases 1(1) 7-25.

[12] N. Takeshi, M. Sata, M. Washida, Y. Hirata, R. Nagai and M. Makuuchi (2003). Nicotine enhances neovascularization and promotes tumor growth. 16(2) 143-146.

[13] Bishop, C.M. (1995) Neural Networks for Pattern Recognition, Oxford: Oxford University Press. ISBN 0-19-853849-9 (hardback) or ISBN 0-19-853864-2 (paperback).

[14] Duda, R.O., Hart, P.E., Stork, D.G. (2001) Pattern classification (2nd edition), Wiley, ISBN 0-471-05669-3.

[15] Gurney, K. (1997) An Introduction to Neural Networks London: Routledge. ISBN 1-85728-673-1 (hardback) or ISBN 1-85728-503-4 (paperback).

[16] Haykin, S. (1999) Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13-273350-1.

[17] A. Zell & G. Mamier, Stuggart Neural Network Simulator version 4.2. Universty of Stuttgart, Stuttgart, Germany(1997).

0200400600800

10001200140016001800

Righ

t

Wro

ng

Righ

t

Wro

ng

Max

.

Min

.

Avg

.

Max

.

Min

.

Avg

.

Number of Active Images

(classified)

Number of Passive Images

(classified)

Intensity of protein in

Active Smoker

Intensity of protein

Endothelin(100)

CD 31(100)

112112

Date post:	27-Jan-2017
Category:	Documents
Upload:	kunal
View:	213 times
Download:	1 times

[IEEE 2011 2nd International Conference on Intelligent Systems, Modelling and Simulation (ISMS) -...

Documents