B-MIPT: A Case Tool for Biomedical Image Processing and their Classification using Nearest Neighbor and Genetic Algorithm
Pardeep Kumar Naik Emory University School of Medicine, Atlanta, GA, USA
e-mail: [email protected]
Nitin University of Nebraska at Omaha, Omaha, NE, USA
e-mail: [email protected]
Aujasvita Janmeja, Sushain Puri, Kunal Chawla, Manav Bhasin and Kunal Jain
Jaypee University of Information Technology, Waknaghat, Solan-173215, Himachal Pradesh, INDIA
e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract—A high rate of expression of Endothelin protein in the placental cell is very much regulated by inhalation of tobacco smoke and leads to placental abnormalities subjected to birth failure. Our application developed using Image Processing [1-7], Nearest Neighbour algorithm (NN) and Genetic Algorithms (GA), automates the study of these proteins to assist pathologists and lab technicians in achieving a more efficient and faster diagnosis. Using three distinct parameters, recognition of images with high protein expression was accurate up to 91% of the times. The tool has achieved a Matthews Correlation Coefficient (MCC) of 0.91. Other performance measures are = 91.1%, sensitivity = 0.91 and specificity = 0.82. These showed that computer aided diagnosis can be a helpful tool, especially in a field that lacks experienced specialists.
Keywords-component; formatting; style; styling; insert (key words)
I. INTRODUCTION The rate of expression of Endothelin and CD31 proteins
in the placental cell depends on the rate of inhalation of tobacco smoke [8-12]. It has been seen that in case of women who smoke tobacco very frequently (Active Smokers) the rate of expression of these two proteins is high, whereas in the second category of women who do not smoke tobacco, but they inhale the smoke coming from other sources (Passive Smokers) the rate of expression is low [8-10]. The higher expression of these two proteins leads to the placental abnormalities that may in turn lead to birth failure [12]. Hence, the study of these two proteins and their expression is very important for proper medication. Traditionally, the diagnostic study is done with the immunohistochemistry technique; an experienced pathologist generally derives the inference based on the visualization of the image in naked eye. Since this is a manual method sometimes, the diagnosis is erroneous and wrongly classifies the images. In this regard, it is essential to make an automatic tool using a reproducible algorithm for prediction, classification and proper diagnosis. Hence, in this study, an attempt has been made to develop an automated tool using Image Processing and Genetic
Algorithm techniques to study the rate of expression of both the proteins in the placenta.
II. GENETIC ALGORITHMS Genetic Algorithms (GAs) [13-17] are adaptive heuristic
search algorithms based on the evolutionary ideas of natural selection. Working on a population of possible solutions simultaneously, they represent an intelligent exploitation of random search within a defined search space. This is done by employing a population of individuals (each representing a particular solution) that undergo selection in the presence of variation-inducing operators such as mutation and recombination (crossover). A fitness function is used to evaluate individuals, and reproductive success varies with fitness (Figure 1).
Figure 1. Flow chart showing important steps of genetic algorithm.
2011 Second International Conference on Intelligent Systems, Modelling and Simulation
978-0-7695-4336-9/11 $26.00 © 2011 IEEE
DOI 10.1109/ISMS.2011.26
107
2011 Second International Conference on Intelligent Systems, Modelling and Simulation
978-0-7695-4336-9/11 $26.00 © 2011 IEEE
DOI 10.1109/ISMS.2011.26
107
III. METHODOLOGY
A. Data set and image processing A set of 203 images were taken from A
of Medical Sciences (AIIMS, India) for th203 images 100 images (43 active smokersmokers) were labeled and processed histopthe expression of Endothelin proteins and active smokers and 53 passive smokers)proteins. In both the cases, the proteins brown spot. Using an in house developprogram we extracted 100 different pixels image and the pixel information was captprimary colors intensity: Red (R), Green (GThese features were used for the training aalgorithms. The training data set was devgenetic algorithm.
B. Nearest Neighbour Algorithm Nearest Neighbour algorithm is a techni
an object into a particular class based olinear distances between two or more then pixels extracted from the brown spots f(active or passive) were considered and thebrown spot was calculated in terms oRepeating this process for each image claintensities characteristic to images falling ui.e. Endothelin Active, Endothelin Passiveand CD31 Passive. Given an image foranalyzed and the intensity of brown spotThen the average linear distance between thtest image and all intensities registered unclass was calculated and the image was class with minimum average linear distance
C. Image segmentation The tool provides a utility to classify th
different algorithms, which will be sensitisize and quality.
1) Image Region Analysis (IRA) This algorithm is recommended in cas
clarity and resolution, i.e. if the image is only some particular regions are of good qualgorithm, the user is being given a choice particular area, which can be selected baccording to the user has some relevance The selected region is being analyzed andcalculated. The intensity is normalized accoof the area and used for classification. Tability to classify the image according to set, which is generated on images of a fixe600 x 400.
2) Image Intensity Analysis This algorithm is the default algorithm o
algorithm, the image loaded by the user is intensity of the image is calculated with thedata set and RGB scan.
All India Institute his study. Out of s and 57 passive
pathologically for 103 images (50
) for the CD31 are labeled with ed mouse event from each of the
tured based on 3 G) and Blue (B).
and testing of our veloped based on
que to categorize on the minimum two objects. The from each class e intensity of the of RGB values. ss, we generated under each class, e, CD31 Active, r testing, it was t was calculated. he intensity from nder a particular classified into a
e.
he image by two ive to the image
se of bad image not so clear and
uality. Under this of selection of a
by the user that to classification.
d the intensity is ording to the size
This gives us the the training data ed resolution i.e.
of the tool. In this analyzed and the e help of training
Figure 2. Schematics diagram showing the v
IV. PEFORMANCE
The prediction results of thevaluated using the following statist
TABLE 1. VARIOUS PERFORMANCE MEASAPPLICATIO
Performance
Measure F
Accuracy of the methods AQ
Where P and N:c
O:oveU:und
MCC (Matthews Correlation Coefficient)
(( ) (
P NP U P
×+ × +
Sensitivity sQ
Specificity spQ
Probability of correct prediction predQ
Percentage over coverage obsQ =
various steps of image analysis.
MEASURES he tool developed was tical measures.
SURES USED TO EVALUATE THE ON.
Formula
ACCP N
T+=
T=(P+N+O+U) correctly predicted er predictions der predictions
) ( )) ( ) ( )
N O NO N U N O
− ×× + × +
sensP
P U=
+
specN
N O=
+
100PP O
= ×+
100PP U
= ×+
108108
V. RESULTS AND DISCUSSION The three features (R, G and B) calculated from the
pixels taken from the brown spot were significantly different from each other from the images belonging to active and passive classes and are suitable for training and testing the algorithm. Also the values are well differentiated between the images labeled for Endothelin and CD31 proteins. Both the algorithms such as GA and NN were trained with the image-derived features (RGB intensity), which are reflected in the images. Based on three inputs the tool is able to generate range of RGB using GA, which in turn used in the second layer for prediction of rate of expression of proteins with respect to the intensity of brown color using NN. By applying a fivefold cross-validation test using five data sets, we found that the predictability and the classification of the images reached an overall accuracy of e (91.1%). The prediction results are presented in Table II. The tool has achieved a Mathew’s correlation coefficient (MCC) of 0.91. The other performance measures are sensitivity = 91.1% and specificity = 82.2% (Table II). The rate of expression was more for active smoker in comparison to passive smokers. The performance measure for the individual categories of the images (labeled with Endothelin and CD31 proteins) is almost equal as mentioned in Table III. It is revealed a wide difference in the range of intensity between both the classes of images and hence it is possible to classify the user given image into its corresponding class.
The results demonstrate that the developed GA-NN based classification of images into active and passive smokers is adequate and can be considered an effective tool for in silico screening of images. The results also demonstrate that the image derived parameters readily
accessible from the images, can produce a variety of useful information to be used in silico; clearly demonstrates an adequacy and good predictive power of the developed GA-NN model. Presumably, accuracy of the approach operating by the image derived features can be improved even further by expanding the features or by applying more powerful classification techniques such as Support Vector Machines or Bayesian Neural Networks. Use of merely statistical techniques in conjunction with the image derived parameters would also be beneficial, as they will allow interpreting individual parameter contributions into “active/passive smokers-likeness”.
VI. CONCLUSION The results of the present work demonstrate that the
image derived features with GA-NN model appear to be a very fast image identification mechanism providing good results, comparable to some of the current efforts in the literature. We have demonstrated the feasibility of combining GA-NN with image derived features for classification of images into active/passive smokers. Expanding the image derived features, use of merely statistical techniques in conjunction with the extracted parameters and an adequate and low-noise training set, are critical to the success of NN. Apparently, the more specifically an image is to predict, thus the more definite a training set can be assembled, and the higher predicting power the corresponding NN can acquire. In the future, we envisage an array of NNs being trained to predict different classes and the stages of abnormality of the patients from the processed pathological images and to parse other pharmacological investigation data in parallel, complementing current methods to achieve more reliable, high throughput detection and medication.
TABLE II. PREDICTIVE PERFORMANCE MEASURES USED TO EVALUATE THE APPLICATION USING 5 FOLD CROSS VALIDATION.
Performance Measures Values Average Specificity 82.2%
Average MCC 0.91 Average Sensitivity 91.1% Average Accuracy 91.1%
TABLE III. PREDICTIVE CLASSIFICATION ACCURACY OF THE IMAGES USING GA APPLICATION
Metrices Endothelin Active Smoker Endothelin Passive Smoker
CD31 Active Smoker CD 31 Passive Smoker
Specificity 81.7% 81.7% 82.6% 82.6% MCC 91.2% 90.7% 94.3% 88.0%
Sensitivity 90.7% 91.2% 88.1% 94.3% Q(total)/Accuracy 0.91 0.91 91.3% 91.3%
Q(predicted) 88.6% 92.9% 93.6% 89.3%
Q(observed) 90.7% 91.2% 88.0% 94.4%
109109
TABLE IV. IMAGE SEGMENTATION AND EXTRACTION OF PIXEL INTENSITY (RED, GREEN AND BLUE) OF THE TOBACCO SMOKING PATIENTS SPECIFICALLY LABELED FOR ENDOTHELIN PROTEIN
Color Tone Active Passive Max. Min. Avg. Max. Min. Avg.
RED 134 24 79 153 47 100 GREEN 122 22 72 162 37 100 BLUE 70 36 53 135 36 85
TABLE V. IMAGE SEGMENTATION AND EXTRACTION OF PIXEL INTENSITY (RED GREEN AND BLUE) OF THE TOBACCO SMOKING PATIENTS SPECIFICALLY LABELED FOR CD31 PROTEIN
Color Tone Active Passive Max. Min. Avg. Max. Min. Avg.
RED 138 17 77 144 38 91 GREEN 143 17 75 150 36 93 BLUE 121 19 70 124 50 87
TABLE VI. NUMBER OF IMAGES USED TO EVALUATE THE APPLICATION OF GA FOR ANNOTATION AND CLASSIFICATION PROBLEM FOR TOBACCO SMOKING PATIENTS
Number of Images
provided
Number of Active Images (classified)
Number of Passive Images
(classified)
Intensity of protein in Active Smoker Intensity of protein
Right Wrong Right Wrong Max. Min. Avg. Max. Min. Avg. Endothelin(100) 39 4 52 5 1554 543 1048 602 321 461
CD 31(100) 44 6 50 3 1524 618 1071 476 113 294
Figure 3. Comaprison of Predictive classification accuracy of the images using GA application.
0102030405060708090
100
Endothelin Active Smoker
Endothelin Passive Smoker
CD31 Active Smoker
CD 31 Passive Smoker
110110
Figure4. Comparison of Image segmentation and extraction of pixel intensity (red green and blue) of the tobacco smoking patients specifically labeled for Endothelin protein.
Figure5. Comparison of Number of images used to evaluate the application of GA for annotation and classification problem for tobacco smoking patients.
0
20
40
60
80
100
120
140
160
180
Max. Min. Avg. Max. Min. Avg.
Active Passive
BLUE
RED
GREEN
0
20
40
60
80
100
120
140
160
Max. Min. Avg. Max. Min. Avg.
Active Passive
BLUE
RED
GREEN
111111
Table VI. Comparison of Number of Images used to Evaluate the Application of GA for Annotation and Classification Problem for Tobacco Smoking Patients.
ACKNOWLEDEMENTS We are thankful to All India Institute of Medical Sciences (AIIMS), New Delhi, India for providing us the set of 203 images. Out of 203 images 100 images (43 active smokers and 57 passive smokers) were labeled and processed histopathologically for the expression of Endothelin proteins and 103 images (50 active smokers and 53 passive smokers) for the CD31 proteins. We are also thankful to University of Nebraska at Omaha and Emory University, Atlanta, for using their computing labs.
REFERENCES [1] The Image Processing Handbook by John C. Russ, ISBN 0849372542
(2006). [2] Fundamentals of Image Processing by Ian T. Young, Jan J.
Gerbrands, Lucas J. Van Vliet, Paperback, ISBN 90-75691-01-7 (1995).
[3] Front-End Vision and Multi-Scale Image Analysis by Bart M. terHaar Romeny, Paperback, ISBN 1-4020-1507-0 (2003).
[4] Image Analysis and Mathematical Morphology by Jean Serra, ISBN 0126372403 (1982).
[5] Christopher M. Bishop (2007) Pattern Recognition and Machine Learning, Springer ISBN 0-387-31073-8.
[6] Neural Computing and Applications, Springer-Verlag.
[7] Bhagat, P.M. (2005) Pattern Recognition in Industry, Elsevier. ISBN 0-08-044538-1.
[8] T. Haak, E. Jungmann, C. Raab and K.H. Usadel (1994) Elevated endothelin-1 level after cigarette smoking. Metabolism 43(3) 267-269.
[9] T. Kosicka, H. Kara-Perz and S. Perz (2006). Evaluation of plasma endothelin-1 concentration in tobacco smoking patients with essential hypertension. Przegl Lek. 63(10) 957-959.
[10] R. Poreba, A. Skoczynska, A. Derkacz, A. Wojakowsk and B. Turczyn (2004). Influence of tobacco smoking on endothelial function in lead-exposed male workers. Med. Pr. 55(2) 145-151.
[11] D.A. Scott and R.M. Palmer (2002) The influence of tobacco smoking on adhesion molecule profiles. Tobacco induced diseases 1(1) 7-25.
[12] N. Takeshi, M. Sata, M. Washida, Y. Hirata, R. Nagai and M. Makuuchi (2003). Nicotine enhances neovascularization and promotes tumor growth. 16(2) 143-146.
[13] Bishop, C.M. (1995) Neural Networks for Pattern Recognition, Oxford: Oxford University Press. ISBN 0-19-853849-9 (hardback) or ISBN 0-19-853864-2 (paperback).
[14] Duda, R.O., Hart, P.E., Stork, D.G. (2001) Pattern classification (2nd edition), Wiley, ISBN 0-471-05669-3.
[15] Gurney, K. (1997) An Introduction to Neural Networks London: Routledge. ISBN 1-85728-673-1 (hardback) or ISBN 1-85728-503-4 (paperback).
[16] Haykin, S. (1999) Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13-273350-1.
[17] A. Zell & G. Mamier, Stuggart Neural Network Simulator version 4.2. Universty of Stuttgart, Stuttgart, Germany(1997).
0200400600800
10001200140016001800
Righ
t
Wro
ng
Righ
t
Wro
ng
Max
.
Min
.
Avg
.
Max
.
Min
.
Avg
.
Number of Active Images
(classified)
Number of Passive Images
(classified)
Intensity of protein in
Active Smoker
Intensity of protein
Endothelin(100)
CD 31(100)
112112