+ All Categories
Home > Documents > Wireless Capsule Endoscopy Video Segmentation Using an Unsupervised Learning Approach Based on...

Wireless Capsule Endoscopy Video Segmentation Using an Unsupervised Learning Approach Based on...

Date post: 24-Sep-2016
Category:
Upload: bp
View: 214 times
Download: 0 times
Share this document with a friend
8
98 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 1, JANUARY 2012 Wireless Capsule Endoscopy Video Segmentation Using an Unsupervised Learning Approach Based on Probabilistic Latent Semantic Analysis With Scale Invariant Features Yao Shen, Parthasarathy (Partha) Guturu, Senior Member, IEEE, and Bill P. Buckles, Senior Member, IEEE Abstract—Since wireless capsule endoscopy (WCE) is a novel technology for recording the videos of the digestive tract of a pa- tient, the problem of segmenting the WCE video of the digestive tract into subvideos corresponding to the entrance, stomach, small intestine, and large intestine regions is not well addressed in the literature. A selected few papers addressing this problem follow supervised leaning approaches that presume availability of a large database of correctly labeled training samples. Considering the difficulties in procuring sizable WCE training data sets needed for achieving high classification accuracy, we introduce in this paper an unsupervised learning approach that employs Scale Invariant Feature Transform (SIFT) for extraction of local image features and the probabilistic latent semantic analysis (pLSA) model used in the linguistic content analysis for data clustering. Results of experimentation indicate that this method compares well in classi- fication accuracy with the state-of-the-art supervised classification approaches to WCE video segmentation. Index Terms—Classification, probabilistic latent semantic anal- ysis, scale invariant feature transform, video segmentation, wireless capsule endoscopy. I. INTRODUCTION W IRELESS capsule endoscopy (WCE) is a novel technol- ogy to record the videos of the parts of gastrointestinal tract that cannot be visualized through other types of endoscopy such as colonoscopy. In WCE, a patient swallows a pill-sized capsule equipped with a tiny camera, which captures the videos of the digestive tract as the capsule propels through the tract by normal peristalsis. These videos are transmitted by a tiny wireless device attached to the capsule to a wireless receiver located outside the human body. There are about 5000 frames in each video with the frame ratio of 2 frames per second. Since the capsule endoscopy is introduced in clinical area, it has been proved to play an important role such as the detection of bleed- ing location [1], diagnosis of Crohn’s disease [2], and Celiac Manuscript received April 3, 2011; revised August 15, 2011; accepted October 1, 2011. Date of publication October 17, 2011; date of current ver- sion February 3, 2012. Y. Shen and B. Buckles are with the Department of Computer Science and Engineering, College of Engineering, University of North Texas, Denton, TX 76203 USA (e-mail: [email protected]; [email protected]). P. (Partha) Guturu is with the Department of Electrical Engineering, College of Engineering, University of North Texas, Denton, TX 76203 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITB.2011.2171977 Fig. 1. Endoscopy Images. (a) Entrance image. (b) Stomach image. (c) Small intestine image. (d) Large intestine image. disease [3]. However, this method is man-power intensive since it takes more than an hour for a trained specialist to examine one WCE video for abnormal conditions such as blood or ulcers [4]. To mitigate this problem, the endoscopy video frames are usu- ally segmented (separated) into four groups corresponding to the four parts of the digestive tract: entrance, stomach, small in- testine, and large intestine. Fig. 1 depicts the exemplary frames from these groups. Since some abnormal events only occur in particular parts of the digestive tract, such a grouping of video frames facilitates evaluation, as the clinician now would only need to focus on a small set of frames for analysis. Thus WCE video segmentation is pivotal to reduction of time for the WCE data analysis. The problem of automatic WCE video segmentation (equiv- alently part–part boundary detection) has been addressed in the medical imaging literature from two fronts. The first set of pa- pers are concerned with extraction of robust features such as color and texture features [5], [6], whereas the second set of research articles focus on powerful classifiers such as Bayesian classifier and support vector machine (SVM) [7], [8]. In [5], Coimbra and Cunha use the scalable color and homogenous tex- ture descriptors of MPEG-7 to segment the video. Boulougoura et al. [9] constitute their feature vectors by computing for the 1089-7771/$26.00 © 2011 IEEE
Transcript

98 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 1, JANUARY 2012

Wireless Capsule Endoscopy Video SegmentationUsing an Unsupervised Learning Approach Based

on Probabilistic Latent Semantic AnalysisWith Scale Invariant Features

Yao Shen, Parthasarathy (Partha) Guturu, Senior Member, IEEE, and Bill P. Buckles, Senior Member, IEEE

Abstract—Since wireless capsule endoscopy (WCE) is a noveltechnology for recording the videos of the digestive tract of a pa-tient, the problem of segmenting the WCE video of the digestivetract into subvideos corresponding to the entrance, stomach, smallintestine, and large intestine regions is not well addressed in theliterature. A selected few papers addressing this problem followsupervised leaning approaches that presume availability of a largedatabase of correctly labeled training samples. Considering thedifficulties in procuring sizable WCE training data sets needed forachieving high classification accuracy, we introduce in this paperan unsupervised learning approach that employs Scale InvariantFeature Transform (SIFT) for extraction of local image featuresand the probabilistic latent semantic analysis (pLSA) model usedin the linguistic content analysis for data clustering. Results ofexperimentation indicate that this method compares well in classi-fication accuracy with the state-of-the-art supervised classificationapproaches to WCE video segmentation.

Index Terms—Classification, probabilistic latent semantic anal-ysis, scale invariant feature transform, video segmentation, wirelesscapsule endoscopy.

I. INTRODUCTION

W IRELESS capsule endoscopy (WCE) is a novel technol-ogy to record the videos of the parts of gastrointestinal

tract that cannot be visualized through other types of endoscopysuch as colonoscopy. In WCE, a patient swallows a pill-sizedcapsule equipped with a tiny camera, which captures the videosof the digestive tract as the capsule propels through the tractby normal peristalsis. These videos are transmitted by a tinywireless device attached to the capsule to a wireless receiverlocated outside the human body. There are about 5000 framesin each video with the frame ratio of 2 frames per second. Sincethe capsule endoscopy is introduced in clinical area, it has beenproved to play an important role such as the detection of bleed-ing location [1], diagnosis of Crohn’s disease [2], and Celiac

Manuscript received April 3, 2011; revised August 15, 2011; acceptedOctober 1, 2011. Date of publication October 17, 2011; date of current ver-sion February 3, 2012.

Y. Shen and B. Buckles are with the Department of Computer Science andEngineering, College of Engineering, University of North Texas, Denton, TX76203 USA (e-mail: [email protected]; [email protected]).

P. (Partha) Guturu is with the Department of Electrical Engineering, Collegeof Engineering, University of North Texas, Denton, TX 76203 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TITB.2011.2171977

Fig. 1. Endoscopy Images. (a) Entrance image. (b) Stomach image. (c) Smallintestine image. (d) Large intestine image.

disease [3]. However, this method is man-power intensive sinceit takes more than an hour for a trained specialist to examine oneWCE video for abnormal conditions such as blood or ulcers [4].To mitigate this problem, the endoscopy video frames are usu-ally segmented (separated) into four groups corresponding tothe four parts of the digestive tract: entrance, stomach, small in-testine, and large intestine. Fig. 1 depicts the exemplary framesfrom these groups. Since some abnormal events only occur inparticular parts of the digestive tract, such a grouping of videoframes facilitates evaluation, as the clinician now would onlyneed to focus on a small set of frames for analysis. Thus WCEvideo segmentation is pivotal to reduction of time for the WCEdata analysis.

The problem of automatic WCE video segmentation (equiv-alently part–part boundary detection) has been addressed in themedical imaging literature from two fronts. The first set of pa-pers are concerned with extraction of robust features such ascolor and texture features [5], [6], whereas the second set ofresearch articles focus on powerful classifiers such as Bayesianclassifier and support vector machine (SVM) [7], [8]. In [5],Coimbra and Cunha use the scalable color and homogenous tex-ture descriptors of MPEG-7 to segment the video. Boulougouraet al. [9] constitute their feature vectors by computing for the

1089-7771/$26.00 © 2011 IEEE

SHEN et al.: WIRELESS CAPSULE ENDOSCOPY VIDEO SEGMENTATION USING AN UNSUPERVISED LEARNING APPROACH 99

six color channels (R,G,B,H,S,V), the statistical measurementswhich are standard deviation, variance, skew, kurtosis, entropy,energy, inverse difference moment, contrast, and covariance.Mackiewicz et al. [10] combine the color, texture, and motionfeatures of the subimage region containing only visible tissue.In [7], Spyridonos et al. employ an SVM classifier with thegradient tensor feature of the image to detect the wrinkles thatindicate the existence of contractions. Cunha et al. [8] comparethe Bayesian classifier and SVM for the endoscopic video seg-mentation. Mackiewicz et al. use an SVM based on the hiddenMarkov model (HMM), under the assumption that the transitionbetween the four states (entrance, stomach, small intestine, andlarge intestine) follows a certain probability distribution.

All the above approaches, however, employ supervised learn-ing paradigms for WCE video segmentation. The main problemwith these approaches is the presumption that a large databaseof correctly labeled samples is available for accurate training ofthe classifiers. But for WCE, being a new technology, it may bedifficult to procure sufficiently large number of WCE videos.Further, because of the small differences in color and texture ofthe internal organs of different individuals, the labeled samplescollected from the WCE of a person may not yield a good perfor-mance when applied to the video of another person. Hence, wepropose in this paper a powerful unsupervised learning approachcalled probabilistic latent semantic analysis (pLSA) [11], [12]to the WCE video segmentation. Application of pLSA to thecurrent problem has been possible because of the similarity be-tween the classification of video frames based on their featurecontent and the semantic segregation of text documents basedon their linguistic content (actually, key words). The pLSA em-ploys bag of words model to analyze the semantic content ofdocuments and to segregate them based on the dominant topic(e.g., sports, politics) even though class (topic) information islatent (hidden). In the present image analysis context, local im-age features take the role of words in the linguistic analysis.Specifically, we perform video segmentation though pLSA withfeature vectors extracted using scale invariant feature transform(SIFT) [13], [14] in lieu of words. We obtain better results byfusion of SIFT features with the color features.

The rest of the paper is organized as follows: In Section II,we present the SIFT feature extraction and pLA classificationapproaches in a general context. We discuss in Section III theproposed algorithm for video segmentation along with detailson preprocessing, visual-word (or visterm) vocabulary buildingwith SIFT feature descriptors derived from the image docu-ments (i.e., WCE video frames), and pLSA parameter learningand classification. Finally, results are presented in Section IVfollowed by conclusions in Section V.

II. SIFT FEATURE EXTRACTION AND PLSA CLASSIFICATION

APPROACHES—AN OVERVIEW

Any good classification algorithm is dependent upon an effec-tive feature extraction strategy. In the current work, we use SIFTfeatures extracted from video frames and perform unsupervisedclassification of video frames using the pLSA method originallydeveloped for the content analysis/categorization of linguistic

documents based on the probability distribution of words. In thefollowing two sections, we present these two building blocks ofour method in a general context.

A. Detection and Description of SIFT Features

The SIFT algorithm was originally developed for grey imagesby Lowe [13], [14] for extracting highly discriminative localimage features that are invariant to image scaling and rotation,and partially invariant to changes in illumination and viewpoint.The algorithm involves two steps: 1) extraction (detection) ofkey points in the image, and 2) computation of the feature vectorscharacterizing the key points. The key-points are derived byconsidering the extrema of the DOG (difference of Gaussians)filter outputs obtained by convolving an input image I(x, y)with the DOG function at multiple scales as follows:

D(x, y, σ) = (G(x, y, Lσ) − G(x, y, σ)) ∗ I(x, y) (1)

∼= (L − 1)σ2∇2G(x, y) ∗ I(x, y) (2)

where

G(x, y, σ) =1

2πσ2 e−(x2 +y 2 )/2σ 2. (3)

In the above equation, x and y are image pixel coordinates, σis the standard deviation of the Gaussian filter, and L is a con-stant. The Laplacian of Gaussian (LOG) filter given by (2) is atheoretically well-established tool for multiscale image analysis,and the DOG filter given by (1) is a computationally expedientapproach to perform the LOG filtering. The value of L may bechosen as

√2 because this value of L is large enough (compared

to 1) to make the above approximation nontrivial and at the sametime has been found in practice not to affect stability of the keypoint detection and localization to be discussed shortly. Sinceconvolution of an image with Gaussians with different values ofσ produces different levels of smoothing, σ may be consideredas a scale parameter defining the DOG filter outputs.

The key-points of the images are selected to be those withpixel values that are the extrema (maxima or minima) amongthe eight neighboring pixels at the same scale (σ value) andthe nine pixels in each of the corresponding 3×3 pixel win-dows in the adjacent scales as shown in Fig. 2. Due to theirselection from DOG filter outputs at multiple scales, the keypoints turn out to be scale invariant. Once the key-points havebeen identified, the gradient directions of pixels in the vicinityof each key point are separated into 36 bins, where each bincovers a 10o range, and each sample in the bin is weighted bythe corresponding gradient magnitude. One or more orientationscorresponding to the bins with the highest bin value or within80% of the highest value are assigned to a key point. Computa-tion of key-point orientation(s) is followed by the computationof key-point descriptors, that is, feature vectors that are invariantto scale and rotation of the focus object. For this, a 16×16 pixelwindow around each key point is selected and segmented into16 4×4 subwindows, as shown in Fig. 3(a). The pixel gradientsin each subwindow are then accumulated with their magnitudevalues into eight bins, as shown in Fig. 3(b). Since there are 16subwindows and 8 bins per window, each key-point descriptor

100 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 1, JANUARY 2012

Fig. 2. DOG filter outputs of an image after the first stage convolution of theimage with Gaussians of different scale (σ) values. The key-points of the imagesare selected to be those pixels (e.g., the ×-marked pixel in the second column)with extremal (maximal or minimal) values among the eight neighboring pixelsin the same (middle) DOG filter and the nine pixels in each of the corresponding3× 3 pixel windows in the neighboring filters.

turns out to be a 128-dimensional feature vector. Rotation invari-ance is achieved by using the pixel gradient directions relativeto the key-point orientation(s) rather than the absolute direc-tion values. However, since a key-point may occasionally havemore than one orientation, its descriptors could also be multiple.Since these multiple descriptors represent the same concept ina visual context, they are analogous to synonymous words inlinguistics.

Since the original development for object recognition, theSIFT features have been used in various applications includingrobot localization [15], human action recognition [16], and anal-ysis of human brain with 3-D magnetic resonance images [17].

B. Probabilistic Latent Semantic Analysis

Probabilistic latent semantic analysis (pLSA) was introducedby Hofmann [11] originally for information retrieval and laterextended for unsupervised learning and document classifica-tion [12]. The pLSA is a generative model based on soundstatistical foundation for addressing the challenging problem ofsemantic content analysis of natural language speech recordsor text documents. Mathematically, the problem here is to ana-lyze each of M documents dj ∈ D = {d1 , . . . , dM } containingsome words wi ∈ W = {w1 , . . . , wN }, where N is the totalnumber of words in a vocabulary W . The analysis process isto find the latent aspects or classes (or simply topics, such assports, politics, etc.) zk ∈ Z = {z1 , . . . , zK } in the documents.Since some words (called polysemous words) could have dif-ferent meanings in different documents depending upon theirlatent aspects, it is not possible to decipher the semantic con-tents of documents simply from the words they contain. Analysisis further complicated by the fact that a document dj could bea mixture of latent aspects. The power of pLSA stems fromits approach to decomposition of a document into a mixtureof latent aspects defined by a multinomial distribution over thewords in the vocabulary. Suppose now that each document dj

is a mixture of latent aspects, defined by the multinomial distri-bution P (zk |dj ). Let P (wi |zk ) be the multinomial distribution

Fig. 3. SIFT feature extraction. (a) Image gradients. (b) Keypoint descriptor.

for aspect zk . The joint probability between a word wi and doc-ument dj can now be defined by the symmetric mixture modelas follows:

P (wi, dj ) =K∑

k=1

P (zk )P (wi |zk )P (dj |zk ). (4)

This model is said to be symmetric because the probabilityP (zk |dj ) that a document dj contains latent class zk is assumedto be the same as the probability P (dj |zk ) that the latent classzk influences the document dj .

Since the aspect attribution is latent, and hence not observable,an expectation maximization (EM) algorithm could be used forestimation of the parameters P (zk ), P (wi |zk ), and P (dj |zk )iteratively from the observed data by maximizing the followinglog-likelihood function

L =M∑

i=1

N∑

j=1

n(wi, dj ) log P (wi, dj ). (5)

Where n(wi, dj ) denotes the number of times a word wi oc-curred in the document dj .

In general, the EM algorithm seeks to find out, by an itera-tive two-step process, the maximum likelihood estimate (MLE)of likelihood function of the complete data including both ob-servable data and latent aspects (i.e., unobservable hidden vari-ables). In the first step called expectation step (in short, E-step),

SHEN et al.: WIRELESS CAPSULE ENDOSCOPY VIDEO SEGMENTATION USING AN UNSUPERVISED LEARNING APPROACH 101

the expected value of complete data log-likelihood function iscomputed with respect to the conditional distribution of latentaspects given the observed data and the current estimate of theunknown parameters (initially chosen to be random). The in thesecond step called the maximization step (in short, M-step) theparameters that maximize the expectation found in the E-stepare computed. These new parameter estimates are used in thenext E-step and the process goes on till it converges. In the E-stepof the EM algorithm for the current application, the conditionalprobability distribution of the latent aspect zk is computed asfollows:

P (zk |wi, dj ) =P (zk )P (wi |zk )P (dj |zk )

∑Ll=1 P (zl)P (wi |zl)P (dj |zl)

. (6)

In the M-step, the topic probability P (zk ), the word prob-ability P (wi |zk ) conditioned on the topic and the documentprobability P (dj |zk ) conditioned on the topic are updated asfollows based on the new expected value P (zk |wi, dj ).

P (zk ) =

∑Mi=1

∑Nj=1 n(wi, dj )P (zk |wi, dj )

∑Mi=1

∑Nj=1 n(wi, dj )

(7)

P (wi |zk ) =

∑Nj=1 n(wi, dj )P (zk |wi, dj )

∑Mm=1

∑Nj=1 n(wm , dj )P (zk |wm , dj )

(8)

P (dj |zk ) =∑M

i=1 n(wi, dj )P (zk |wi, dj )∑Mm=1

∑Nj=1 n(wm , dj )P (zk |wm , dj )

. (9)

The EM algorithm starts with random values (that sum to 1over the respective variable ranges) for P (zk ), P (wi |zk ), andP (dj |zk ). After each EM iterations defined by (6) through (9),the likelihood value L defined by (5) is computed, and if it ishigher than the previously computed maximum value, the newvalue of L replaces the old one. The whole problem config-uration, particularly the set of P (dj |zk ) values, is also savedbecause it corresponds to the maximum likelihood value at thispoint of time. After a number of iterations when the algorithmconverges and no further improvements in L could be obtained,the most dominant aspect of each document dj is ascertained tobe zk where P (dj |zl) is maximum for l = k.

III. PROPOSED ALGORITHM FOR WCE VIDEO SEGMENTATION

WITH SIFT FEATURES AND PLSA

Fig. 4 presents the flow diagram for our pLSA and SIFTfeature based approach to video segmentation. Prior to appli-cation of the algorithm, unwanted information is eliminatedfrom the WCE frames by preprocessing operations discussed inSection III-A. The algorithm takes in as input the total set ofpreprocessed video frames that need to be grouped into differ-ent categories. Through random sampling of this set, a trainingsubset is formed for later use in pLSA model construction, andthe remaining samples are used as test samples. A vocabulary isthen built with the visual words (also called visterms) extractedfrom the training set by application of the SIFT algorithm fol-lowed by a vector quantization procedure as described in Sec-tion III-B. Subsequently, the pLSA model parameters (probabil-ities) are learnt from the visterm distribution in training frames

Fig. 4. Endoscopy segmentation algorithm.

through the EM algorithm under pLSA framework described inSection II-B. During the test phase, the same pLSA procedurewith the visterm probability distributions of various classeslearnt during the training phase is applied on the test videoframes, and each test sample is classified into the class withmaximum class conditioned document (test frame) probability.Further details of the pLSA training and test procedures used inour video segmentation algorithm are provided in Section III-C.This pLSA based classification procedure may be categorizedas an unsupervised learning method because no class labels areprovided with the training samples during the learning (training)phase.

A. Video Preprocessing

The first step of our multistep process for video segmenta-tion eliminates from the original WCE video frames, unwantedinformation such as the textual annotations and the pixels cor-responding to background objects, and thereby precludes gen-eration of unexpected features. Due to the dome structure of theWCE camera, only the subimage within the circular area aroundthe center of each video frame can be considered as the regionof interest (ROI). The ROI is defined as a circle centered in theimage whose radius is a pixel smaller than the full “useful” in-formation disk. The area surrounding ROI is either out of focusor replete with irrelevant information, and hence can be safelyremoved.

102 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 1, JANUARY 2012

B. Visual Word Extraction and Vocabulary Building

The next step of the processing is vocabulary building withvisterms (visual words) extracted from individual video frames.In the current application, the SIFT features derived from theROIs of the WCE video frames (as discussed in the sectionare considered as visterms. This choice is motivated by theprior evidence that the SIFT features of the gray-level inten-sity images have been used quite successfully in the bag-of-feature approaches to general scene and object categorization(see, e.g., [18]). Considering the fact that color provides morediscriminatory information than simple intensities particularlyin case of endoscopy image, we have used color SIFT features inthis work. Even though RGB (red, green, and blue) color spaceis simple and very common, HSI (hue, saturation, and inten-sity) features have been shown to have the best discriminatorypotential among several perceptually relevant color spaces [19].Also, previous work of Coimbra et al. [5] indicates that HSV(hue, saturation, and value) MPEG-7 scalable color descriptorsyield best classification (segmentation) results for the WCE im-ages. Hence, we use in this work SIFT descriptors derived fromall three channels of the HSV color model. Thus there will bea threefold increase in the dimensionality of the SIFT featurevector, and hence the key-point descriptors in color space wouldbe having 3×128 (=384) components.

Since classification performance could depend on both thecolor space used and the color feature extraction method em-ployed, we did not outrightly reject the RGB color space basedon the counter evidence provided in [20] and [21]. Alongwith the HSV SIFT features, we experimented with the 384-component RGB feature vectors obtained by concatenating theSIFT descriptors computed over the three RGB channels.

Final step of the feature extraction process is building a vo-cabulary of visual words (or visterms) consistent with our goalto develop an unsupervised learning method based on the “bagof words” model used in the analysis of semantic contents of textdocuments. However, the feature vector descriptors cannot bedirectly considered as visual words, simply because each com-ponent of the vector spans over the infinite set of real numberswhereas the words in a language are composed of charactersbelonging to a finite set of characters. A simple and obvious so-lution to this problem is to limit the number of possible featurevectors using a vector quantization [22] procedure. In this paper,we group the feature vectors extracted from some randomly cho-sen video frames into large (but finite) number of small clustersusing the k-means clustering algorithm. It may be noted herethat k is the size of the vocabulary, and any feature vector can beuniquely mapped onto a specific word in the vocabulary depend-ing upon which cluster mean is the closest to the feature vectorunder consideration. The process of extraction of visual wordsfrom the sample video frames is depicted in Fig. 5(a) throughFig. 5(d). Fig. 5(a) presents a video frame after preprocessing,i.e., elimination of unwanted information around ROI and con-volution with a Gaussian filter. Fig. 5(b) shows the key pointsextracted from a frame using multiple DOG filter outputs usingthe procedure described earlier in this section and depicted inFig. 2. The histogram or frequency distribution of the 384 SIFT

Fig. 5. Visual word extraction and Code-book construction. (a) Preprocessing.(b) Key points extraction. (c) Color-SIFT. (d) Vector quantization.

feature-vector components (128 in each one of the three col-ors in the R-G-B color space) is presented in Fig. 5(c). Finally,Fig. 5(d) presents a typical result of the earlier described vectorquantization procedure. In other words, it shows the distributionof a prototype vector represented by the centroid point of oneof the clusters formed by the k-Means clustering method. Thisprototype (quantized) vector is considered a visual code word orvisterm in our visual vocabulary, and all key-point descriptors(feature vectors) within the vicinity of the prototype (i.e. withinthe cluster boundaries) are considered to represent the same codeword as the prototype. Thus the size of our vocabulary turns outto be the same as the number of clusters formed by the k-Meansalgorithm. In our experimentation, we used vocabularies withup to 1500 visterms.

C. Frame Classification and Video Segmentation Using pLSA

Adaptation of the generic pLSA classification method de-scribed in Section II-B to the video segmentation problem isstraightforward. Using the same notation to the current prob-lem, it is easy to see that a document dj corresponds to a WCEvideo frame, and D, to the total video. The zk s are the parts (e.g.,small intestine) of digestive tract captured by the video frame,and wis are visterms or quantized feature vectors, obtained bythe vocabulary building process in Section III-B. Since the videois to be segmented into four parts corresponding to the four re-gions of the digestive tract, the value of k, here, is 4.

With the above mapping of our problem variables onto thegeneric pLSA problem variables, we can proceed with the pLSAformulation of our problem. Suppose now that each video framedj is a mixture of latent aspects, defined by the multinomial dis-tribution P (zk |dj ). Let P (wi |zk ) be the multinomial distribu-tion for aspect zk . The joint probability between a word (vistermwi) and video frame dj can now be defined by the mixture modelgiven in (4). Now, using the counts of different visterms (equiv-alently color SIFT descriptors) in the training video frames,various probabilities in the pLSA framework, initially chosen tobe random values, are estimated by EM iterations defined by (6)through (9), as discussed in Section II-B. The temporal informa-tion in the video is very useful in choosing the training frames

SHEN et al.: WIRELESS CAPSULE ENDOSCOPY VIDEO SEGMENTATION USING AN UNSUPERVISED LEARNING APPROACH 103

for EM iterations . Since only the classification of the frameson the part–part boundaries is problematic, the frames, whichare known to be in the middle of a part based on the temporalinformation, need to be sampled randomly to form the trainingset that is used for estimation of the parameters (probabilities).Once the pLSA parameters are determined accurately using thetraining samples, they (actually, P (wi |zk ) values) can be usedfor classification of the test video frames (including those onthe part–part boundaries) using the same EM iterations. Finally,each video frame dj is assigned to class zk if P (dj |zk ) is max-imum among all P (dj |zl) values.

IV. EXPERIMENTAL RESULTS

For our experimentation, we used ten annotated capsule en-doscopy videos collected using the “Given Imaging PillCamSB [23].” Experienced clinicians annotated each video frameinto four parts: entrance(P1), stomach (P2), small intestine (P3),and large intestine (P4).

Since classification depends both on the feature set and thetype of classifier used, our first experimentation is on identifi-cation of the best classifier for the same (Gray-SIFT) features.Our goal in this experiment is to compare our unsupervisedapproach using pLSA with a Support Vector Machine (SVM),which established itself as the state-of-the-art classifier operat-ing in supervised mode. Hence, we considered SVMs with bothlinear and nonlinear (radial basis function) kernels. For the clas-sification of four parts of the digestive tract after the extractionof the SIFT feature, we used three SVM classifiers to classifythe adjacent parts rather than a single SVM classifier to classifyall the four parts at once, since the most challenging task is theclassification of video frames on the part–part boundaries. Thevideo frames captured by the capsule when it is in the middleof a part during its travel through consecutive parts is not thatdifficult. However, the use of SIFT features with SVM (or anyother classifier based on a supervised learning scheme) posesone problem. These classifiers consider each image frame assingle holistic unit, and work on the feature vectors derivedfrom individual frames. One way to construct the feature vec-tors for individual frames is to concatenate the feature vectorsof the key points of the frames that are raster scanned. But,since we can in general have different numbers of key pointsper image, the frame feature vector sizes may vary. To overcomethis problem, we considered the size of the longest feature vec-tor as the size of the typical feature vector, and padded all theshorter feature vectors with random numbers so as to make themconstant-sized vectors. We used 50% of the randomly sampledvideo frames for training the classifiers, and used the remaining50% for testing the pLSA classifier. Table I presents the resultsof classification. Here P1/P2, P2/P3, and P3/P4 indicate the di-chotomies entrance/stomach, stomach/small intestine, and smallintestine/large intestine, respectively. Since there is a possibilityof confusion only while discriminating between adjacent partsonly, we do not have in the table the dichotomies P1/P3, P1/P4,and P2/P4. These dichotomies could be easily resolved usingthe temporal separation between the frames corresponding tothose parts. The entries in this table and the following tables in-

TABLE ICOMPARISON OF CORRECT CLASSIFICATION PERCENTAGES OF THE

SIFT-FEATURE BASED SVM CLASSIFIERS AND THE PROPOSED

PLSA APPROACH

TABLE IICOMPARISON OF CORRECT CLASSIFICATION PERCENTAGES OF THE PLSA

MODEL WITH THE TRADITIONAL GRAY-SIFT AND THE PROPOSED

COLOR-SIFT FEATURES

dicate the percentages of correct classification when tested withthe manually labeled test frames belonging to the two classes(parts) listed in the column heading (e.g., P2/P3). These percent-ages of correct classification are computed by averaging out theresults of classification over ten different pairs of training andtest sample sets. The results show that the entrance/stomachclassifier and stomach/small intestine classifier perform betterthan the small intestine/large intestine classifier. The reason forthat was the existence of the significant similarity of the tissuesin the two most confusing parts. From the results, it is also clearthat the SVM with radial kernel, as expected, performs betterthan that with the linear kernel. Our proposed pLSA methodyields much higher accuracy than either SVM classifier, but, inall fairness, it cannot be construed as a conclusive evidence forthe superiority of the pLSA method over SVM. The inferior per-formance of SVM classifiers could have been due to the earlierdescribed approach of constructing the constant-sized SIFT fea-ture vectors by padding with random data with no informationcontent.

The goal of our second experiment is to make a comparativeanalysis of the traditional Gray-SIFT, RGB-SIFT, and HSV-SIFT features with pLSA classifier for endoscopy video seg-mentation. The results presented in Table II indicate the RGB-SIFT feature outperforms the other two features. In view of theunsuitability of SIFT features for SVM classification, we com-pare in Table III the results our pLSA classifier with RGB-SIFTwith the best classification results of Mackiewicz et al. pre-sented in [10]. The authors in that paper obtained these resultsusing the same two SVM classifiers but with feature constitutedwith following components: 1) local binary pattern (LBP) his-tograms sorted into 343 (=73) bins over three color channelsand compressed using principle component analysis (PCA), 2)

104 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 1, JANUARY 2012

TABLE IIICOMPARISON OF CORRECT CLASSIFICATION PERCENTAGES OF THE PLSA

MODEL BASED ON THE COLOR-SIFT FEATURES AND THE TRADITIONAL SVMCLASSIFIERS BASED ON MULTIPLE FEATURES

motion feature 6-tuples extracted from 41 consecutive frames,transformed using discrete Fourier transform (DFT), and com-pressed using principal component analysis (PCA), and 3) 32×32 bin HS (hue and saturation) histograms compressed usingdiscrete cosine transform (DCT) and PCA. From Table III, itis clear that the classification results of the pLSA-RGB-SIFTalgorithm are slightly modest compared to those for the state-of-the-art SVM classifiers based on multiple feature extraction.Still these results can be considered to be significant due tothe fact that the pLSA classifier does not use the class labelinformation as its input at any stage of the processing. Thesesmall discrepancies could have also been caused by the specificdatabases used for experimentation and the feature extractionprocess employed.

Our last experiment is for testing an aspect specific to thepLSA method. Since the performance of the pLSA-Color-SIFTalgorithm is likely to depend upon the size of the code-book(vocabulary), we study here the effect of codebook size (equiv-alently, the number of clusters formed in our k-means approachto vector quantization) on the accuracy of video segmentation.In this experiment, statistical variations of the percentages ofclassification accuracy with codebook sizes ranging from 200through 1500 visterms are captured in box-and-whisker dia-grams plotted in Fig. 6 for each one of the three dichotomiesP1/P2, P2/P3, and P3/P4 using experimental results on ten dif-ferent pairs of training and test sets generated through randomsampling of the total set of video frames in the correspond-ing dichotomy. The left whisker, the left edge of the left box,the edge separating the two boxes, the right edge of the rightbox, and the right whisker in the horizontal bar for a codebooksize in the bar-charts for each one of the three diagrams rep-resent the minimum, first quartile, median, third quartile, andmaximum classification accuracies, respectively, for the corre-sponding codebook size and the corresponding dichotomy. Alsodepicted in each one of these diagrams is a curve (in the verticaldirection) that runs through the mean values. From these dia-grams, it is clear that the highest mean classification accuracywith minimal variation for randomly selected training-test pairsare achieved for codebook sizes around and greater than 1000for all the three dichotomies. The codebook size of 1200 seemsto yield best performance with respect to the statistical criteriaof highest mean and smallest range for the classification accu-racy values achieved with randomly selected training and testsample sets.

Fig. 6. Box and whisker diagrams depicting statistical dependence of the clas-sification performance on the visterm vocabulary size. (a) Statistical relationshipbetween codebook size and classification accuracy for P1/P2 dichotomy. (b) Sta-tistical relationship between codebook size and classification accuracy for P2/P3dichotomy. (c) Statistical relationship between codebook size and classificationaccuracy for P3/P4 dichotomy.

V. CONCLUSION

In this paper, we proposed a novel pLSA method for en-doscopy video segmentation. This method uses local featuresderived using an SIFT algorithm for vocabulary building andvideo frame classification. The experiment confirms that ourSIFT-pLSA method performs better than the traditional SVM

SHEN et al.: WIRELESS CAPSULE ENDOSCOPY VIDEO SEGMENTATION USING AN UNSUPERVISED LEARNING APPROACH 105

classifiers based on the SIFT feature. However, the difficulty inconstructing constant sized SIFT feature vectors for SVM train-ing and testing suggests that the SIFT features may not be theright features for use in an imaging application with an SVMor any other classifier with a supervised learning scheme. Inour pLSA approach, the SIFT features derived in color spacesperformed as expected much better than the gray SIFT features.However, in our case, the SIFT features in the RGB color spaceyield higher classification accuracy than those in the HSV colorspace, unlike in some prior works in supervised classificationapproaches for imaging applications [5], [19]. Comparison ofour results with the state of the art SVM classifiers for en-doscopy video segmentation indicate that pLSA, despite its un-supervised mode of operation, yields competitive performance.These results could be improved using better feature sets andproperly chosen code-book sizes. In our simulations performedon a laptop with a Pentium4 2.8 GHz CPU without any hardwareacceleration, the computation time for the testing is 1.375 s perframe, while for the training is 15.338 s per frame for a typicalcode-book size of 600. This computational speed suggests thatour method is quite promising for clinical use considering thetediousness (and hence error proneness) and the time involvedin the manual classification by the clinicians.

REFERENCES

[1] G. Gay, M. Delvaux, and J. Key, “The role of video capsule endoscopyin the diagnosis of digestive diseases: A review of current possibilities,”Endoscopy, vol. 36, 2004.

[2] P. Swain, “Wireless capsule endoscopy and Crohn’s disease,” Gut, vol. 54,pp. 323–326, 2005.

[3] A. Culliford, J. Daly, B. Diamond, M. Rubin, and P. H. R. Green, “Thevalue of wireless capsule endoscopy in patients with complicated celiacdisease,” Gastrointestinal Endosc., vol. 62, no. 1, pp. 55–61, 2005.

[4] A. Maieron et al.,“Multicenter retrospective evaluation of capsule en-doscopy in clinical routine,” Endoscopy, vol. 36, pp. 864–868, 2004.

[5] M. T. Coimbra and J. P. S. Cunha, “MPEG-7 visual descriptors-contributions for automated feature extraction in capsule endoscopy,”IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 5, pp. 628–637,May 2006.

[6] P. Wang, S. Krishnan, C. Kugean, and M. P. Tjoa, “Classification ofendoscopic images based on texture and neural network,” in Proc. 23rdAnnu. Int. Conf. IEEE Eng. Med. Biol. Sci., 2001, vol. 4, pp. 3691–3695.

[7] P. Spyridonos, F. Vilarino, J. Vitria’, F. Azpiroz, and P. Radeva,“Anisotropic feature extraction from endoluminal images for detectionof intestinal contractions,” in Proc. MICCAI Conf., 2006, vol. 2, pp. 161–168.

[8] J. Cunha, M. Coimbra, P. Campos, and J. M. Soares, “Automated topo-graphic segmentation and transit time estimation in endoscopic capsuleexams,” IEEE Trans. Med. Imag., vol. 27, no. 1, pp. 19–27, Jul. 2008.

[9] M. Boulougoura, E. Wadge, V. S. Kodogiannis, and H. S. Chowdrey, “In-telligent systems for computer-assisted clinical endoscopic image anal-yses,” in Proc. 2nd Int. Conf. Biomed. Eng., Innsbruck, Austria, 2005,pp. 405–408.

[10] M. Mackiewicz, J. Berens, and M. Fisher, “Wireless capsule endoscopycolor video segmentation,” IEEE Trans. Med. Imag., vol. 27, no. 12,pp. 1769–1781, Dec. 2008.

[11] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. 22nd Annu.Int. SIGIR Conf. Res. Develop. Inf. Retrieval (SIGIR-99), 1999, pp. 50–57.

[12] T. Hofmann, “Unsupervised learning by probabilistic latent semantic anal-ysis,” Mach. Learn., vol. 42, no. 1, pp. 177–196, 2001.

[13] D. G. Lowe,“Object recognition from local scale-invariant features,” inProc. Int. Conf. Comput. Vis., 1999, vol. 2, pp. 1150–1157.

[14] D. G. Lowe, “Distinctive image features from scale-invariant key-points,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[15] S. Se, D. G. Lowe, and J. Little, “Vision-based mobile robot localizationand mapping using scale invariant features, in Proc. Int. Conf. Robot.Autom., 2001, vol. 2, pp. 2051–2055.

[16] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor andits application to action recognition,” in Proc. 15th Int. Conf. Multimedia,2007, pp. 357–360.

[17] M. Toews, W. M. Wells III, D. L. Collins, and T. Arbel, “Feature-basedmorphometry: Discovering group-related anatomical patterns,” NeuroIm-age, vol. 49, no. 3, pp. 2318–2327, 2010.

[18] F. Jurie and B. Triggs, “Creating efficient code books for visual recogni-tion,” in Proc. Int. Conf. Comput. Vis., 2005, vol. 1, pp. 604–610.

[19] J. Berens, “Image indexing using compressed color histograms,” Ph.D.dissertation, School of Information Systems, Univ. East Anglia, Norwich,U.K., 2002.

[20] A. Bosch, A. Zisserman, and X. Mun̆oz, “Scene classification using ahybrid generative/discriminative approach,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008.

[21] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning nat-ural scene categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2005, vol. 2, pp. 524–531.

[22] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84–95, 1980.

[23] Given Imaging, “Pillcam SB,” Capsule Endoscopy Product,2011. [Online]. Available: http://www.givenimaging.com/en-us/healthcareprofessionals/Products/Pages/PillCamSB.aspx.

Yao Shen received the B.S. degree in electrical en-gineering from Nanjing Normal University, China,in 2003, and graduated with Ph.D. degree from theComputer Science and Engineering Department, Uni-versity of North Texas, Denton,Texas, in 2011.

He is currently working for Microsoft, Redmond,Washington, as a consultant from Matisia Consul-tants. His research interests include object tracking,video processing, Bayesian filtering,and probabilisticlatent semantic analysis.

Parthasarathy (Partha) Guturu (SM’90) receivedthe B.Tech. (Hons.) and Ph.D. (Eng.) degrees from theECE Department of Indian Institute of Technology(IIT), Kharagpur.

Subsequently, as a Faculty Member for ten yearsat IIT, he directed four Ph.D. dissertations and manymaster’s theses, and established a strong publicationrecord. He later worked as a Visiting Professor atthe University of Quebec, Montreal, Canada, and asa senior designer/architect in the Nortel R&D unitsin Ottawa, Canada and Richardson, USA. During his

seven years’ stay in the industry, he contributed to innovative research thatresulted in three US patents in the areas of intelligent networks and 3 GBwireless systems. He returned to academia in 2004, and is currently an AssociateProfessor in the EE Department of the University of North Texas, Denton.Till date, he has published three book chapters and around 60 internationaljournal/conference papers, and contributed to the areas of computer vision,computational intelligence, and wired/wireless networks. He plans to embarkupon an ambitious interdisciplinary research program that spans over the areasof his expertise.

Bill P. Buckles (SM’10) received the graduate de-grees, including the Ph.D, from the University ofAlabama in Huntsville in computer science and inoperations research.

He is presently a Professor in the Computer Sci-ence and Engineering Department at University ofNorth Texas. He has published almost 150 papersin national/international journals and has authored abook. His research has been supported by NASA,NSF, and the Missile Defense Agency. He has beena Visiting Professor at the Techhochshule in Aachen

Germany, the GMD in Germany, the Free University of Brussels, and NationalCentral University of Taiwan. His research focuses on image understanding andrelated problems in search, optimization, and pattern recognition.

Prof. Buckles has been an Associate Editor of the IEEE TRANS. ON PARAL-LEL AND DISTRIBUTED SYSTEMS, the IEEE Computer Society TechnicalCommittee Chair on Distributed Processing, and General Chair of the IEEEInternational Conf. on Distributed Computing Systems. Twice, he was honoredwith National Technical Achievement Awards from NASA.


Recommended