Automatic video object segmentation using volume growing...

EURASIP Journal on Applied Signal Processing 2004:6, 814–832c© 2004 Hindawi Publishing Corporation

Automatic Video Object Segmentation Using VolumeGrowing and Hierarchical Clustering

Fatih PorikliMitsubishi Electric Research Laboratories, Cambridge, MA 02139, USAEmail: [email protected]

Yao WangDepartment of Electrical Engineering, Polytechnic University, Brooklyn, NY 11201, USAEmail: [email protected]

Received 4 February 2003; Revised 26 December 2003

We introduce an automatic segmentation framework that blends the advantages of color-, texture-, shape-, and motion-based seg-mentation methods in a computationally feasible way. A spatiotemporal data structure is first constructed for each group of videoframes, in which each pixel is assigned a feature vector based on low-level visual information. Then, the smallest homogeneouscomponents, so-called volumes, are expanded from selected marker points using an adaptive, three-dimensional, centroid-linkagemethod. Self descriptors that characterize each volume and relational descriptors that capture the mutual properties betweenpairs of volumes are determined by evaluating the boundary, trajectory, and motion of the volumes. These descriptors are usedto measure the similarity between volumes based on which volumes are further grouped into objects. A fine-to-coarse clusteringalgorithm yields a multiresolution object tree representation as an output of the segmentation.

Keywords and phrases: video segmentation, object detection, centroid linkage, color similarity.

1. INTRODUCTION

Object segmentation is important for video compressionstandards as well as recognition, event analysis, understand-ing, and video manipulation. By object we refer to a collec-tion of image regions grouped under some homogeneity cri-teria where a region is defined as a contiguous set of pixels.

Basically, segmentation techniques can be grouped intothree classes: region-based methods using a homogeneouscolor or texture criterion, motion-based approaches utiliz-ing a homogeneous motion criterion, and object tracking.Approaches in the region-oriented domain range from em-pirical evaluation of various color spaces [1], to clusteringin feature space [2], to nearest-neighbor algorithm, to pyra-mid linking [3], to morphological methods [4], to split-and-merge [5], to hierarchical clustering [6]. Color-clustering-based methods often utilize histograms and they are com-putationally simple. Histogram analysis delivers satisfactorysegmentation result especially for multimodal color distribu-tions, and where the input data set is relatively simple, clean,and fits the model well. However, this method lacks general-ity and robustness. Besides, histogram methods fail to estab-lish spatial connectivity. Region-growing-based techniquesprovide better performance in terms of spatial connectiv-

ity and boundary accuracy than histogram-based methods.However, extracted regions may not correspond to actualphysical objects unless the intensity or color of each pixelin objects differs from the background. A common problemof histogram and region-based methods arises from the factthat a video object can contain several totally different colors.

On the other hand, works in the motion-oriented do-main start with an assumption that a semantic video objecthas a coherent motion that can be modeled by the same set ofmotion parameters. This type of motion segmentation workscan be separated into two broad classes: boundary-placementschemes [7] and region-extraction schemes [8, 9, 10, 11, 12].Most of these techniques are based on rough optical flowestimation or unreliable spatiotemporal segmentation, andmay suffer from the inaccuracy of motion boundaries. Theestimation of dense motion field tends to be extremely slow,hence not suitable for processing of large volumes of videoand real-time data. Blockwise or higher-order motion mod-els may be used instead of dense motion fields. However, achicken-egg problem exists in modeling motion: should theregion where a motion model is to be fitted be determinedfirst, or should the motion field to be used to obtain the re-gion be calculated first? Stochastic methods may overcomethis priority problem by simultaneously modeling flow field

mailto:[email protected]

mailto:[email protected]

Automatic Video Object Segmentation by Volume Growing 815

and spatial connectivity, but they require that the number ofobjects be supplied as a priori information before the seg-mentation. Small and nonrigid motion gives rise to addi-tional model fitting difficulties. Furthermore, modeling mayfail when a semantic video object has different motions indifferent parts of the object. Briefly, computational complex-ity, region-motion priority, and modeling issues are to beconsidered in utilizing dense motion fields for segmentation.

The last class is “tracking” [13]. A tracking process canbe interpreted as the search for a target. It is the trajecto-ries of the dynamic parameters that are linked in a time.This process is usually embodied through model matching.Many types of features, for example, points [14], intensityedges [15], textures [16], and regions [17] can be utilizedfor tracking. Three main approaches have been developedto track objects depending on their type: whether they arerigid, nonrigid, or have no regular shape. For the first twoapproaches, the goal is to compute the correspondences be-tween objects already tracked and the newly detected movingregions, whereas the goal of the last approach is handling thesituations where correspondences are ambiguous. The majordifficulty in tracking is to deal with the interframe changesof moving objects. It is clear that the image shape of a mov-ing object may undergo deformation, since a new aspect ofthe object may become visible or an actual shape of an objectmay change. Thus a model needs to evolve from one frameto the next, capturing the changes in the image shape of anobject as it moves. Although for most of the cases, more thantwo video frames are already available before segmentation,existing techniques usually view tracking as a unidirectionalpropagation problem.

Semiautomatic segmentation methods have the power ofcorrelating semantic information with extracted regions us-ing human assistance. However, such assistance often obli-gates training of users to understand the behaviour of thesegmentation method. Besides, real-time video systems re-quire user-independent processing tools. The vast amount ofvideo data demands for automatic segmentation since enter-ing object boundaries by hand is cumbersome.

In summary, a single homogeneous color or motion cri-terion does not lead to satisfactory extraction of object infor-mation because each homogeneous criterion can only dealwith a limited set of scenarios, and a video object may con-tain multiple colors and complex motions.

2. PROPOSED SEGMENTATION FRAMEWORK

Each of the segmentation algorithms summarized before hasits own advantages. It would be desirable to have a generalsegmentation framework that combines distinct qualities ofseparate methods without getting hampered into their pit-falls. Such a system is expected to be made up by compati-ble processing modules that can be easily modified with re-spect to the application parameters. Even user-assistance andsystem-specific a priori information should be easily embed-ded into the segmentation framework without reconstruct-ing the overall system architecture. Thus, we designed our

Rawvideo

PreprocessingMarker

assignmentVolumegrowing

MPEG-7descrip.

Skin colorscore

Frame differencescore

Volumerefinement

Descriptorextraction

MPEGMVs Feature

points

Motionvectors

Parametersestimation

Hierarchicalclustering

Humanface

Objectnumber

Objecttree

Figure 1: Flow diagram of the video segmentation algorithm show-ing all the major modular stages.

segmentation framework to meet the following targets:

(i) automaticity,(ii) adaptability,

(iii) accuracy,(iv) computational complexity.

A general flow diagram of the framework is given inFigure 1. In the diagram, the main algorithm is shown ingray, and its modular extensions that include application-specific modules, that is, skin color detection, frame differ-ence, and motion vector processing, are shown by the dashedlines. When MPEG-7 dominant color descriptors are avail-able, they can be utilized in the volume-growing stage toadapt the color similarity function parameters. Frame differ-ence score becomes useful where the camera system is sta-tionary. Skin color can be incorporated as an additional fea-ture for human detection. For MPEG encoded sequences,motion vectors can be used at the hierarchical clusteringstage.

Before segmentation, the input video sequence is slicedinto video shots that are defined as groups of consecutiveframes having similar attributes between two scene cuts. Thesegmentation algorithm takes a certain number of consecu-tive frames within the same video shot, and processes all ofthese frames at the same time. The number of frames cho-sen can be the same as the length of the corresponding shot,or a number that is sufficient to have discriminatory objectmotion within the chosen frames. A limiting factor may bethe memory requirement due to the large data size. After fil-tering, a spatiotemporal data structure is formed by comput-ing pointwise features of frames. These features include colorvalues, frame difference score, skin colors, and so forth as il-lustrated in Figure 2.

We acquire homogeneous parts of the spatiotemporaldata by growing volumes around selected marker points.By volume growing, all the frames of an input video shot

816 EURASIP Journal on Applied Signal Processing

Frames from onevideo shot

t = 1 t = tM

Color values

Texture scores

Framedifference score

Skin color score

YU

Vθkδρ

Spatiotemporaldata structure

y

x

Time

p = (x, y, t)

Figure 2: Construction of spatiotemporal data from the video.

are segmented simultaneously. Such an approach solves theproblem of tracking objects and correlating the segmentedregions between the consecutive frames since no account ofthe quantitative information about the regions and bound-aries need to be kept. Volume-growing approach solves theproblem of “should the region of support be obtained firstby color segmentation followed by motion estimation, orshould the motion field be obtained first followed by seg-mentation based on motion consistency?” by supplying theregion of support and an initial estimation of motion at thesame time. In addition, volume growing is computationallysimple.

The grown volumes are refined to remove small and erro-neous volumes. Then, motion trajectories of individual vol-umes are determined. Thus, without explicit motion estima-tion, a functional approximation of motion is obtained. Selfdescriptors for each volume and mutual descriptors for a pairof volumes are computed from volume trajectories and alsofrom other volume statistics. These volumewise descriptorsare designed to capture motion, shape, color, and other char-acteristics of the grown volumes. At this stage, we have thesmallest homogeneous parts of a video shot and their rela-tions in terms of mutual descriptors. Application-specific in-formation can be incorporated as separate descriptors suchas skin color.

In a following clustering stage, volumes are merged intoobjects by evaluating their descriptors. An iterative, hierar-chical fine-to-coarse clustering is carried out until the mo-tion similarity of merged objects becomes small. After clus-tering, an object partition tree that gives the video objectplanes for successively smaller number of objects is gener-ated. The object partition tree can be appended to the inputvideo for further recognition, data mining, and event analysispurposes. Note that this framework does not claim to obtainsemantic information automatically, but it aims to providetools for efficient extraction and integration of explicit visualfeatures to improve the object detection. Thus, a user can eas-ily change the visual definition of semantic object at the clus-tering stage, which has an insignificant computational load,without segmenting the video over again.

3. FORMATION OF SPATIOTEMPORAL DATA

3.1. Filtering

In the preprocessing stage, the input frames are filtered first.Two main objectives of filtering are noise removal and sim-plification of color components. Noisy or highly textured

Figure 3: Original and filtered images using the simplification filter.

frames can cause oversegmentation by producing excessivenumber of segments. This not only slows down the algo-rithm, but also increases the memory requirements and de-grades the stability of the segmentation. However, most noisefiltering techniques demand intensive operations. Thus, wehave developed a computationally efficient simplification fil-ter which can retain the edge structure, and yet smooth thetexture between edges. Simply stated, color value of a pointis compared with its neighbors for each color channel. If thedistance is less than a threshold, the point’s color value is up-dated by the average of its neighbors within a local window.For the performance comparison of this filter with othermethods including Gaussian, median, morphological filter-ing, and so forth, see [18]. A sample filtering result is givenin Figure 3.

3.2. Quantization and color space

To further simplify input images, color quantization is ap-plied by estimating a certain number of dominant colors.Quantization also decreases the total processing time by al-lowing use of smaller data structures in the implementationof the code. The dominant colors are determined by a hi-erarchical clustering approach incorporating the generalizedLloyd algorithm (GLA) at each level. Suppose we alreadyhave an optimal partitioning of all color vectors in the in-put image into 2k level. At the (k + 1)th level, we perturbeach cluster center into two vectors, and use the resulting 2k+1

cluster centers as the initial cluster centers at this level. Wethen run the GLA to obtain an optimal partition with 2k+1

levels. Specifically, starting with the initial cluster centers, wegroup each input color vector to its closest cluster center. Thecluster centers are then updated based on the new grouping.A distortion score is calculated which is the sum of the dis-tances of the color vectors to the cluster centers. The group-ing and the recalculation of the cluster centers are repeateduntil the distortion does not reduce significantly anymore.Initially at level k = 0, we have one cluster only, including allthe color vectors of the input image. As a final stage, the clus-ters that have close color centers are grouped to decide on afinal number of dominant colors.

The complexity of the metric used for computing colordistances is a major factor in selecting a color space sincemost of the processing time is spent while computing thecolor distances between the points. We preferred the YUVcolor space since the color distance can be computed usingsimpler norms. In addition, the YUV space separates illu-minance from luminance components, and represents color


Figure 4: Quantization by 32, 16, and 8 dominant colors, which areshown next to each image. As visible, very low quantization levelsmay disturb the color properties, that is, skin colors and edges.

in more accordance with human perception than the RGB[19]. Thus, the segmentation results are visually more plau-sible. The above-described dominant colors have minor dif-ferences from the MPEG-7 dominant color descriptors. Forexample, MPEG-7 has a smaller number of color bins, andit is based on Lab color space. In the case where MPEG-7descriptors are available with the input video, the dominantcolor descriptor can be directly used to quantize the inputvideo after suitable conversion of the color space. In Figure 4,quantized images with different number of dominant colorsare given.

3.3. Feature vectors

Frames of the input video shot are then assembled into a spa-tiotemporal data structure S. Each element of this data struc-ture has a feature vector w(p) = [Y ,U ,V , δ, θ1, . . . , θK , ρ].Here, p = (x, y, t) is a point in S where (x, y) is the spatial co-ordinate and t is the frame number. We will denote individ-ual attributes of the feature vector, for example, the Y colorvalue of point p, by Y(p). Sometimes we also use w(p, k)to represent feature k at point p, for example, k = Y ,U ,V .Table 1 summarizes the notation. Besides the color values,additional attributes can be included in the feature vector.The frame difference score δ is defined as the pointwise colordissimilarity of two frames with respect to a given set of rules.One such rule is

δ(p) = ∣∣Y(p)− Y(pt−)∣∣, (1)

where pt− = (x, y, t − 1). The texture features θ1, . . . , θK arecomputed by convolving the luminance channel Y with theGabor filter kernels as

θk(p) =∣∣∣∣Y(p)⊗ 1

2πσ2e−((x2+y2)/2πσ2)e−2π(uk+vk )

∣∣∣∣. (2)

It is sufficient to employ the values for the spatial frequency√u2 + v2 = 2, 4, 8 and the direction tan−1(u/v) = 0, π/4, π/2,

3π/4, which leads to a total of 12 texture features. Obtainingtexture features is computationally as intensive as estimatingmotion vectors by phase correlation due to the convolutionprocess. Blending texture and color components into a sin-gle similarity measure is usually done by assigning weightingparameters [20]. In this work, we concentrate on the colorcomponents.

The skin color score ρ indicates whether a point has highlikelihood of corresponding to human skin. We obtained a

Table 1: Notation of parameters.

S Volumetric spatiotemporal data

p Point in S; p = (x, y, t)

w(p) Feature vector at p

Y(p), U(p), V(p) Color values at p

δ(p) Frame difference at p

θk(p) Texture features at p

ρ(p) Skin color score at p

∇Y ,∇U ,∇V Color gradient

mi Marker of volume Vi

ci Feature vector of volume Vi

Vi A volume within S

γ(i) Self descriptor of volume Vi

Γ(i, j) Relational descriptor of pair Vi, Vj

mapping from the color space to the skin color values by pro-jecting the color values of a large set of manually segmentedskin images that include people of various races, genders, andages. This mapping is used as a lookup table to determine theskin color score. More details on this derivation can be foundin [21]. In Figure 5, skin color scores of sample images areshown. In these images, higher intensity values correspondto higher likelihoods.

4. VOLUME GROWING

Volumes are the smallest connected components of the spa-tiotemporal data S with homogeneous color and texture dis-tribution within each volume. Using markers and evaluat-ing various distance criteria, volumes are grown iterativelyby grouping neighboring points of similar characteristics.

In principle, volume-growing methods are applicablewhenever a distance measure and a linkage strategy can bedefined. Several linkage methods were developed in the liter-ature; they differ in the spatial relation of the points for whichthe distance measure is computed. In single-linkage volumegrowing, a point is joined to its 3D neighboring points whoseproperties are similar enough. In hybrid-linkage growing,similarity among the points is established based on the prop-erties within a local neighborhood of the point itself insteadof using the immediate neighbors. In the case of centroid-linkage volume growing, a point is joined to a volume byevaluating the distance between the centroid of the volumeand the current point. Yet another approach is to provide notonly a point that is in the desired volume but also counterex-amples that are not in the volume. Two-dimensional versionsof these linkage algorithms are explained in [22]. In the fol-lowing, we first describe the marker selection process, andthen the centroid-linkage algorithm in more detail.

4.1. Marker assignment

A marker is the seed of a volume around it. Since a volume’sinitial properties will be determined by its marker, a markershould be a good representative of its local neighborhood. A


Figure 5: Skin color scores ρ of sample images.

point that has a low color gradient magnitude satisfies thiscriterion. Let mi be a marker for volume Vi, and Q the set ofall available points, that is, it is all the points of S initially. Thecolor gradient magnitude is defined as follows:

∣∣∇S(p)∣∣ = ∣∣∇Y(p)

∣∣ +∣∣∇U(p)

∣∣ +∣∣∇V(p)

∣∣ (3)

such that the gradient magnitude of a channel is

∣∣∇Y(p)∣∣ = ∣∣Y(px+

)− Y(px−)∣∣ +

∣∣Y(py+

)− Y(py−

)∣∣+∣∣Y(pt+

)− Y(pt−)∣∣,

(4)

where px+ and px− represent equal distances on the x-direction from the center point p, that is, (x − 1, y, t), (x +1, y, t), and so forth. We observed that using L2 norm insteadof L1 norm does not improve the results. The point having lo-cal minimum gradient magnitude is chosen as marker. A vol-ume Vi is grown as will be explained in the following section,and all the points of the volume are removed from the set Q.The next minimum in the remaining set is chosen, and theselection process is repeated until no more available pointsremain in S. Rather than searching the full-resolution spa-tiotemporal data, a subsampled version of it is used to findthe minima since searching in full resolution is computation-ally costly.

More computational reduction is achieved by dividingsubsampled S into slices. A minimum gradient magnitudepoint is found for the first slice, and a volume is grown, thenthe next minimum is searched in the next slice as illustratedin Figure 6. The temporal continuity is preserved by growinga volume in the whole spatiotemporal data S after selectinga marker in the current slice. In case the markers are limitedonly within the first frame, the algorithm becomes a forwardvolume growing.

Generally, the marker points are uniformly distributedamong the frames of a video shot in which objects are con-sistent and motion is uniform. For such video shots, a sin-gle frame of S can be used for selection of all markers in-stead of using the whole S. However, the presence of fast

Spatiotemporaldata

Downsize3D

gradient

Slice

Find localminimum in slice



Grow volume inspatiotemporal

data


data


dataY

Point remains

Figure 6: Fast marker selection finds the minimum gradient mag-nitude points in the current slice of the downsampled data. Then, avolume is grown within the spatiotemporal data, and the process isrepeated until no point remains as unclassified.

moving small objects, highly textured objects, and illumina-tion changes may deteriorate the segmentation performanceif a single frame is used. Besides, objects that are not visiblein the single frame may not be detected at all. The iterativeslice approach overcomes these difficulties.

4.2. Centroid-linkage algorithm

For each new volume Vi, a volume feature vector ci, theso called “centroid,” is assigned. Centroid-linkage algorithmcompares the features of a candidate point to the current vol-ume’s feature vector. This vector is composed of the colorstatistics of the volume, and initially it is equal to the featurevector of the point chosen as marker ci(k) = w(mi, k). In a6-point neighborhood, two in each of the x, y, t direction,the color distances of the adjoint points are calculated. If thedistance d(ci, w(q)) is less than a volume-specific thresholdεi, the point q is included in the volume, and the centroidvector is updated as

cni (k) = 1N

[(N − 1)cn−1

i (k) + w(q, k)], (5)

where N is the number of points in the volume after the in-clusion of q. If the point q has a neighbor that is not in-cluded in the current volume, it is assigned as an “active-shell” point. Thus, active-shell points constitute the bound-ary of the volume. In the next cycle, the unclassified neigh-bors of the active-shell points are probed. Linkage is repeatedif either no point remains in the active shell or in the spa-tiotemporal data.

There are two other possible linkage techniques: single-linkage, which compares a point with only its immediateneighbors, and dual-linkage, which compares with the cur-rent object boundary. We observed that these two techniquesare prone to segmentation errors such as leakage and colorinconsistent segments. The sample results for the variouslinkage algorithms are given in Figure 7.


(a) (b) (c)

Figure 7: Segmentation by (a) single linkage, (b) dual linkage, and (c) centroid linkage. Single linkage is prone to errors.

4.3. Distance calculation and threshold determination

The aim of the linkage algorithm is to generate homoge-neous volumes. Here we define homogeneity as the qualityof being uniform in color composition. In other words, itis the amount of color variation. For a moment, let us as-sume a color density function of the data is available. Modal-ity of this density function refers to the number of its prin-cipal components, that is, the number of separate models fora mixture of models representation. A high modality indi-cates larger number of distinct color clusters of the densityfunction. Our key hypothesis is that points of a color ho-mogeneous volume are more likely to be in the same colorcluster rather than being in different color clusters. Thus, wecan establish a relationship between the number of clustersand the homogeneity specifications of volumes. If we knowthe color cluster that a volume corresponds to, we can de-termine the specifications of homogeneity for that volume,that is, parameters of the color distance function and itsthreshold.

Before volume growing, we approximate the color den-sity function by deriving a 3D color histogram of the slice.We find cluster centers within the color space either by as-signing the dominant colors as centers or using the describedGLA clustering algorithm. We group each color vector w(p)to the closest cluster center, and for each cluster we computea within-cluster distance variance σ2.

After choosing a marker and initializing a volume featurevector ci, we determine the closest cluster center to the ci inthe color space. Using the variance of this cluster, we definethe color distance and its threshold as follows:

d(

ci, q) =√∑

k

(ci(k)−w(q, k)

)2, (6)

where k : Y ,U ,V and the threshold is εi = 2.5σ to let theinclusion of the 95% of colors within the same color cluster.The above-formulation assumes that the color channels areequally contributing (due to the Euclidean distance norm),and the 3D color histogram is densely populated (for effec-tive application of clustering). However, a dense histogrammay not be available in case of small slice sizes, and color

components may not be equally important in case of theYUV space.

We also developed an alternative approach that uses sep-arate 1D histograms. Local maxima hn(k) of the histogramsare obtained for each channel such that hn(k) < hn+1(k) andn = 1, . . . ,Hk. Note that number of maxima Hk for differ-ent channels may be different. Histograms are clustered, andwithin-cluster distance variance is computed for each clus-ter similarly. Using the current marker point mi, three coef-ficients τi(k), k : Y ,U ,V (one for each histogram) are deter-mined as

τi(k) = 2.5σj(k),

j = arg minn

∣∣ci(k)− hn(k)∣∣,

(7)

where hn(k) is the closest center. These coefficients specifythe cluster ranges. A logarithmic distance function is formu-lated as follows:

d(

ci, q) =∑

k

Hk log2

(1 +

∣∣ci(k)−w(q, k)∣∣

τi(k)

). (8)

We normalized the channel differences with the clusterranges to equalize the contribution of a wide cluster in a his-togram to a narrow cluster in another histogram. The loga-rithmic term intended to suppress the large color mismatchesof a single histogram. Considering that a channel that hasmore distinctive colors should provide more information forsegmentation, the channel distances are weighted by the cor-responding Hk’s. Then, the distance threshold for volume Vi

is derived as

εi =∑k

Hk. (9)

4.4. Modes of volume growing

Volume growing can be carried out either by growing multi-ple volumes simultaneously, or expanding only one volumeat a time. Furthermore, the expansion itself can be done ei-ther in an intraframe-interframe switching fashion, or a re-cursive outward growing style.


(i) Simultaneous growing. After certain number ofmarker points are determined, volumes are grown si-multaneously from each marker. At a growing cycle,all the existing volumes are updated by examining theneighboring points to the active shell of the currentvolume. In case a volume stops growing, an additionalmarker that is an adjoint point to the boundary of thestopped volume is selected. Although simultaneousgrowing is fast, it may divide homogeneous volumesinto multiple smaller volumes, thus volume mergingbecomes is necessary.

(ii) One-at-a-time growing. At each cycle, only a singlemarker point is chosen, and a volume is grown aroundthis marker. After the volume stops growing, anothermarker in the remaining portion of the spatiotemporaldata is selected. This process continues until no morepoint remains in S. An advantage of one-at-a-timegrowing is that it can be implemented by recursive pro-gramming. It also generates more homogeneous vol-umes. However, it demands more memory to keep allthe pointers.

(iii) Recursive diffusion. The neighboring points to the ac-tive shell are evaluated disregarding whether they arein the same frame with the active shell point or notas illustrated in Figure 8. After a point is includedwithin a volume, the point becomes a point of the ac-tive shell as long as it has a neighbor that is not in-cluded in the same volume. By updating the activeshell as described, the volume is diffused outward fromthe marker. Instead of using only adjoint points, otherpoints within a local window around the active shellpoint can be used in diffusion as well. However, inthis case the computational complexity increases, andmoreover, connectivity may deteriorate.

(iv) Intraframe-interframe switching. A volume grown us-ing recursive diffusion tends to be topologically non-compact by having several holes and ridges within.Such a volume usually generate unconnected regionswhen it is sliced framewise. In intraframe-interframeswitching, the diffusion mechanism is first appliedwithin the same frame to grow a region, then resultsare propagated to the previous and next frames. Thegrown region is assigned as the active shell for theneighboring frames. As a result, each framewise pro-jection of a volume will be a single connected region,and volumes will have more compact shapes.

4.5. Volume refinement

After volume growing, some of the volumes may be negli-gible in size or very elongated due to the fine texture andedges. Such volumes increase the computational load of thelater processing. A simple way of removing a small or elon-gated volume is labeling its points as unclassified and inflat-ing the remaining volumes iteratively to fill up the emptyspace. First, the unclassified points that are adjoint to othervolumes are put into a set of active shell. Then, each activeshell point is included in the volume which is adjoint and

(a) (b)

Figure 8: (a) Volume growing by intraframe-interframe switching.(b) Recursive diffusion. As visible, recursive diffusion grows vol-umes as an inflating balloon, whereas switching method first en-larges a region in a frame then spreads this region to the adjointframes.

has the minimum color distance. The point is removed fromthe active shell, and the inclusion process is iterated until nomore unclassified point remains. Alternatively, a small vol-ume can be merged into one of its neighbors as a whole usingvolumewise similarity. In this case, similarity is defined as acombination of the ratio of the mutual surface, compactnessratio, and color distance. For more details on definition ofsuch a similarity measure, see [21].

5. DESCRIPTORS OF VOLUMES

Descriptors capture various aspects of the volumes such asmotion, shape, and color characteristics of individual vol-umes, as well as pairwise relations among the volumes.

5.1. Self descriptors

Self descriptors evaluate a volume’s properties such as itssize γsi(i), its total boundary γbo(i), its normalized color his-togram γh(i) (0 ≤ γh(i) ≤ 1), and the number of framesγex(i) that the volume extends in the spatiotemporal data.Compactness γco(i) is defined as

γco(i) = 1γex(i)

∑t

γsi(i, t)γbo(i, t)2

, (10)

where the framewise boundary γbo(i, t) is squared to makecompactness score independent from the radius of theframewise region γsi(i, t) at frame t. (Consider the case of


a disk; γco = πr2/(2πr)2 = 1/(4π).) Note that, in the spa-tiotemporal data, the most compact volume is a cylinderalong the time axis, but not a sphere. Elongated, sharp-pointed, shell-like, and thin shapes have lower compactnessscores. However, the compactness score is sensitive to theboundary irregularities.

Motion trajectory of a volume is defined as the localiza-tion of its framewise representative points. The representa-tive point can be chosen as the center of mass, or it can bethe intersection of the longest line within the volumes frameprojection and another line that is longest in the perpendicu-lar direction. We used the center of mass since it can be com-puted easily. Trajectory T(i, t) = [Tx

i (t),Tyi (t)]T is calculated

by computing the framewise averages of volume’s coordi-nates along x and y directions. Sample trajectories are shownin Figure 9. Note that, these trajectories do not involve anymotion estimation. The trajectory approximates the trans-lational motion in most of the cases. The translational mo-tion is the easiest to be perceived by the human visual sys-tem, for much the same reason it is the most discriminativein object recognition. Motion trajectory enables to compre-hend the motion of a volume between frames without requir-ing complex motion vector computation. It can also be usedto initialize parameterized motion estimation to improve theaccuracy and to accelerate the speed.

The descriptor γtl(i) measures the length of the trajec-tory. Volumes that are stationary with respect to the cam-era imaging plane have shorter trajectory lengths. The set ofaffine motion parameters A(i, t) = [a1(i, t), . . . , a6(i, t)] for avolume models the framewise motion

v(p) =[a1(i, t) a2(i, t)

a4(i, t) a5(i, t)

]p +

[a3(i, t)

a6(i, t)

]− p, (11)

where v(p) are motion vectors at p. To estimate these pa-rameters, a certain number of feature points p f are selectedfor each region Ri(t), and corresponding motion vectors arecomputed. Feature points are selected among the high spa-tial energy points. The spatial energy of a point is defined interms of color variance as

w(p, e) =∑p

∑k

(w(p, k)−w

(p,µk

))2. (12)

Above, w(p,µk) is the color mean of points in a small lo-cal window centered around p. After w(p, e)’s are computed,the points of Ri(t) are ordered with respect to their spatialenergy magnitudes. The highest rank point on the list is as-signed as a feature point p f , and neighboring points of p f

are removed from the list. Then, the next highest rank pointis chosen until a certain number of points are selected. Toestimate the motion vectors, we used phase correlation inwhich the search range is constrained around the trajectoryT(i, t). Given motion vectors v̂(p f ), the affine model is fittedby minimizing

A(i, t) = arg min∑p f

log(1 +

∣∣v(p f)− v̂

(p f)∣∣), (13)

y

x

t

y

x

t

Figure 9: Sample trajectories of Children and Foreman.

where v(p f ) are the affine projected motion vectors as givenin (11) and v̂(p f ) are the motion vectors estimated by phase-correlation at feature points p f . The logarithm term works asa robust estimator which can detect and reject the measure-ment outliers that violate the motion model. We used down-hill simplex method for minimization. To reduce the load ofthe above computationally intensive motion vector and pa-rameter estimation procedures, we only used up to 20 pointsto estimate the parameters. Note that the motion parame-ters are estimated for only a small number of volumes, whichis usually between 10 and 100, after the volume refinementstage.

The frame difference descriptor γδ(i) is proportional tothe amount of color change in the volume after trajectorymotion compensation:

γδ(i) = 1γsi(i)

∑p∈Vi

δ(x − Tx

i (t), y − Tyi (t), t

), (14)

where the frame difference score δ is given as in (1). Wepresent truncated frame difference scores in Figure 10. Theskin color descriptor γρ(i) is computed similarly

γρ(i) = 1γsi(i)

∑p∈Vi

ρ(p), (15)

where ρ(p) is the skin color score as explained in Section 3.3and γsi(i) is the size of the volume.


5.2. Relational descriptors

These descriptors evaluate correlation between a pair of vol-umes Vi and Vj . The mutual trajectory distance ∆(i, j, t) isone of the motion-based relative descriptors. It is calculatedby

∆(i, j, t) = ∣∣T(i, t)− T( j, t)∣∣. (16)

The mean of the trajectory distance Γµ(i, j) measures aver-age distance between the trajectories, and Γσ(i, j) is the vari-ance of the distance ∆(i, j, t). A small variance means twovolumes have similar translational motion, and a big vari-ance reveals volumes having different motion, that is, gettingaway from each other or moving in the opposite directions,etc. One exception happens in case of a large background,since its trajectory usually falls on the center of the frames.To distinguish volumes that have small motion variances butopposite motion directions, for example, two volumes turn-ing around a mutual axis, the directional difference Γdd(i, j)can also be defined. The parameterized motion similarity ismeasured by Γpm(i, j):

Γpm(i, j) =∑t

[cR

∑n=1,2,4,5

∣∣an(i, t)an( j, t)∣∣

+ cT∑n=3,6

∣∣an(i, t)− an( j, t)∣∣],

(17)

where the constants are set as cT � cR to take into ac-count of the fact that a small change in the parameters an,n = 1, 2, 3, 4, can lead to much larger difference in the mod-eled motion field than the translation parameters a5, a6. Thecompactness ratio Γcr(i, j) of a pair of volumes is the amountof the change on the total compactness before and after thetwo volumes merge:

Γcr(i, j) =γco(Vi ∪Vj

)γco(i) + γco( j)

, (18)

where a small Γcr(i, j) means the merging of Vi and Vj willgenerate a less compact volume. Another shape-related de-scriptor Γbr(i, j) is the ratio of mutual boundary of two vol-umes Vi and Vj to the boundary of volume Vi. The colordifference descriptor Γcd(i, j) gives the sum of the differencebetween the color histograms, the mutual existence Γex(i, j)counts the number of frames in which both volumes exist,and Γne(i, j) shows whether volumes are adjoint. Similarly,Γρ(i, j) shows the difference in the skin color scores betweenthe volumes, and Γ f d(i, j) gives the difference in the changedetection scores.

6. FINE-TO-COARSE CLUSTERING

As described in the general framework, the volumes areclustered into objects using their descriptors. Different ap-proaches to clustering data can be categorized as hierarchicaland partitional approaches. Hierarchical methods produce a

Figure 10: Frame difference score δ(p) for Foreman, Akiyo, andHead. Frame difference indicates the amount of motion for certaincases.

(a) (b)

Figure 11: (a) Coarse-to-fine (k-means, GLA, quad tree) and (b)fine-to-coarse clustering. The first approach divides the volumesinto certain number of clusters at each time, the second merges apair of volumes at each level.

nested series of partitions while a partitional clustering algo-rithm obtains a single partition of the data. Merging the vol-umes in a fine-to-coarse manner is an example to hierarchi-cal approaches. Grouping volumes using adaptive k-meansmethod in a coarse-to-fine manner is an example of the par-titional approaches as illustrated in Figure 11.

In the fine-to-coarse merging method, determination ofmost similar volumes is done iteratively. At each iteration,all the possible volume combinations are evaluated. The pairhaving the highest similarity score are merged and affecteddescriptors are updated. A similar morphological image seg-mentation approach using such hierarchical clustering is pre-sented in [6].

Detection of a semantic object requires explicit knowl-edge of specific object characteristics. Therefore, user has todecide which criteria dictate the similarity of volumes. It isthe semantic information that is being incorporated at thisstage of the segmentation. We designed the segmentationframework such that most of the important object charac-teristics will be available for user in terms of the self and rela-tional descriptors. Other characteristics can be included eas-ily without changing the overall architecture. Furthermore,the computational load of building objects from the volumesis minimized significantly by transferring the descriptor ex-traction in the previous automatic stages.

The following observations are made on the similarity oftwo volumes.

(1) Two volumes are similar if their motion is similar. Inother words, volumes having similar motion constructthe same object. A stationary region has high proba-bility of being in the same object with another region


that is stationary, that is, a tree and a house in thesame scene. We already measured the motion similar-ity of two volumes in terms of motion-based relationaldescriptors Γσ(i, j), Γdd(i, j), and Γpm(i, j). These de-scriptors can be incorporated in the similarity defini-tion. However, without using further intelligent mod-els, it is not straightforward to distinguish objects withsimilar motion.

(2) Objects tend to be compact. A human face, a car, a flag,a soccer ball are all compact objects. For instance, acar in a surveillance video is formed by separate elon-gated smaller regions. Shape of a volume gives cluesabout its identity. We captured shape information inthe descriptors Γcr(i, j) and Γbr(i, j) and also volumeboundary itself. Note that, compactness ratio must beused with caution in merging volumes. If a volumeis enclosing another volume, their merge will increasecompactness whether these two volumes correspondto same object or not. Furthermore, many objects suchas cloud formations, walking people, and so forth arenot compact. To improve the success of shape-basedobject clustering, application-specific criteria shouldbe used, for example, a human model for videocon-ferencing.

(3) Objects have connected parts. This is obvious for mostof the cases, an animal, a car, a plane, a human, andso forth, unless an object is visible only partially. Webegin evaluation of similarity with the volumes thatare neighbors to each other. Neighborhood constraintis useful, and yet, can easily deteriorate the segmenta-tion accuracy in case of an under segmentation, that is,background encloses most of the volumes.

(4) An object moves as a whole. Although this statementis not always true for human objects, for rigid bodies,it is useful. The change detection descriptor becomesvery useful in constructing objects that are moving infront of a stationary background.

(5) Each volume already has a consistent color by con-struction, therefore there is little room for utilizationof color information to determine a neighbor to mergein. In fact, most objects are made from small volumesthat have different colors, that is, human body con-stituents face, hair, dress, and so forth. When form-ing the similarity measure, color should not be a keyfactor. However, for specific video sequences featuringpeople, human skin color is an important factor.

(6) Important objects tend to be at the center. We can findgood examples as in head-and-shoulder sequences,sports, and so forth.

To blend all the above observations and statements, weevaluate the likelihood of a volume merge given the relevantdescriptors. For this purpose, we define a similarity score

P∗(Vi, j

) ≡ Γ∗(i, j)∑m,n Γ∗(m,n)

. (19)

Alternatively, P∗(Vi, j) can be defined using a ranking-based

similarity measure. For all possible neighboring volumepairs, the relevant relative descriptors are ordered in separatelists in either descending or ascending order. For example,Lσ(i, j) returns a number indicating the rank of the descrip-tor Γσ(i, j) in its ordered list. Using the ranks in the corre-sponding lists, the likelihood is computed as

P∗(Vi, j

) ≡ 1− 2L∗(i, j)l∗(l∗ + 1)

, (20)

where the length of the list L∗ is l∗. The similarity based onall descriptors is defined as

P(Vi, j

) = ∑∗:σ ,dd,...

λ∗P∗(Vi, j

), (21)

where constant multipliers λ’s are used to normalize and ad-just the contribution of each descriptor. These multiplierscan be adapted to the specific applications as well. To detecthuman face, skin color descriptor Γρ(i, j) can be included inthe above formula. Similarly, if we are interested in findingmoving objects in a stationary camera setup but trajectoryor parametric modeling are not sufficient enough to obtainan accurate motion representation, the frame difference de-scriptor γδ(i) becomes an adequate source.

The pair having the highest similarity score are merged,and the descriptors of the volumes are updated accordingly.Clustering is performed until there are only two volumes re-maining. At a level of the clustering algorithm, we can ana-lyze whether the chosen volume pair is a good choice. Thiscan be done by observing the behaviour of the similarityscore of the selected merge. If this score gets small or showsa sudden drop, the merge is likely to be not a valid mergealthough it is the best available merge.

The segmentation algorithm supplies volumes, their at-tributes, and information about how these volumes can bemerged. Since human is the ultimate decision maker in an-alyzing the results of video segmentation, it is necessary toprovide the segmentation results in an appropriate format touser or other decision mechanism for further analysis. Weuse an object tree structure to represent segmentation re-sults as demonstrated in Figure 12. In this representation,the video is divided into objects, and objects into volumes.At the lowest volume level, the descriptors and boundariesare available. Volumes are homogeneous in color and tex-ture, and they are connected within. The clustering step gen-erates higher levels that are consistent in motion. The usercan choose the segmentation result at different levels basedon the desired level of details. In case a user wants to changethe criteria used to cluster volumes, only the clustering stageneeds to be executed with new criteria, for example, weightsin different descriptors, which is computationally simple.

The corresponding objects at various object levels of themultiresolution object tree are presented in Figures 13 and14. The descriptor multipliers are set as λ f d = λρ = λcr =λbr = 1, λothers = 0 for Akiyo since we intended to find a hu-man head having very slow nonrigid motion, λµ = λcr =λbr = 1, λothers = 0 for Bream since motion is the most


Video

Foreground Background

Object Object Object

Object Object Object Object

Volume Volume Volume Volume Volume

Slow motionSpatial positionLarge volumeChange ratio

Consistent motion

Uniform colorUniform textureSpatial connectivity

Figure 12: Multiresolution partition of objects in a hierarchical treerepresentation.

Figure 13: Results at object levels 13, 10, 8, 6, 3, and 2 for Akiyousing frame difference descriptor.

discriminating visual feature for the fish, and λµ = λρ = λcr =λbr = 1, λothers = 0 for Children since objects are definedas moving regions that have human skin colors. Hierarchicalclustering finds the mouth of the speaker in Akiyo as the mostdifferent object since it has the highest frame difference andskin color score. At the consequent levels of the multireso-lution tree, the face and suit comes because of the same rea-son. For Children, the red ball has the most discriminatingmotion among all the objects, and the proposed video ob-ject segmentation (VOS) method correctly put it on the toplevel of the multiresolution tree (Figure 14). As visible, vol-ume growing accurately detects the objects boundaries as aresult of adaptive color distance threshold assignment.

7. EXPERIMENTAL RESULTS

We selected a version of the proposed VOS framework to beused as a reference considering the computational simplicity,that is, texture features and motion parameters are omitted.Centroid linkage is used to grow volumes, and 1D histogram-based formulation (8) is applied to compute color distance.Intra-inter switching method is preferred to prevent a vol-ume from having disconnected regions.

We also implemented two other state-of-art semiauto-matic tracker to provide a detailed comparison of the pro-posed method with others.

Figure 14: Results at object levels 12, 10, 8, 7, 6, 5, 4, 3, and 2 forsequence Children using trajectory distance variance descriptor.

7.1. Reference methods

Active MPEG-4 object segmentation (AMOS)

We used a semiautomatic video object segmentation algo-rithm [23, 24] to compare our results. This algorithm re-quires the initial object definition, that is, object boundaryto be provided by users by mouse-selected points aroundthe target object. Then a snake algorithm refines the userinput to fit a smooth boundary. The initial object is gener-ated through a region segmentation and aggregation process.To extract homogeneous regions in both color and motion,motion segmentation based on a dense motion field is usedto further split the color regions. Homogeneous regions areclassified as either foreground or background to form the ob-ject. Region aggregation is based on the coverage of each re-gion by the initial object mask: regions that are covered morethan a certain percentage are grouped into the foregroundobject. The final contour of the semantic object is computedfrom foreground regions. Tracking is done at both the re-gion and object levels. Segmented regions from the previousframe are first projected to the current frame using their in-dividual 2D affine models with 6 parameters. An expandedbounding box including all projected foreground regions iscomputed. Then the area inside the bounding box is splitto homogeneous color and motion regions following a re-gion tracking process. Pixels that cannot be tracked from anyold regions are labeled as new regions. Thus the resultinghomogeneous regions are tagged either foreground (mean-ing tracked from a foreground region), background (mean-ing tracked from a background region), or new (meaning nottracked). They are then passed to an aggregation process andclassified as belonging to either the foreground object or thebackground. To handle possible motion estimation errors,the aggregation process is carried out iteratively. Finally, theobject contour is computed from foreground regions.

This technique is very similar to the system explained inCOST-211 project [25].


Self-affine mapping tracker (SAM)

We also made comparisons with another semiautomatictracker [26] in which the initial boundary was entered bypainting a region instead of mouse clicks. The concept of thismethod is quite different from that of the snake method. Aself-affine mapping system instead of the energy minimiza-tion procedure is used to approach and fit the roughly drawnline to the object contour. The object contour is extractedas a self-similar curve instead of a smooth curve. The self-affine map’s parameters are detected by analyzing the block-wise self-similarity of an image using a simplified algorithmin fractal encoding.

7.2. Performance measures

As explained in [27], comparative assessment of segmenta-tion algorithms is often based upon subjective judgement,which is qualitative and time consuming. Although severalmeasures can be applied in the presence of a ground-truthmask, the generation of ground truth requires significanteffort and is often limited to foreground-background typeof object segmentation. Selection of a conventional groundtruth for multilevel object extraction algorithms may not bepossible. For instance, what should be assigned as groundtruth for 3-object level for Children sequence: two boys andthe background, or one boy, ball and background, or someother possible combination? Should two boys constitute asingle object, or should they be considered as separate en-tities? For two-object case, we hand segmented foregroundobject using the AMOS method since it is semiautomatic.However, we stopped the tracker whenever it makes an errorand corrected the object boundary accordingly. We observedthat even for the experienced users and careful initialization,the generation of ground truth is very exhausting and it takesmore than 20 seconds for a single frame on average.

Using the binary ground truth G(p) = (1 : object, 0 :background), we calculate a point misclassification scoreEpixel(t) at frame t as follows:

Epixel(t) = 1∣∣R(t)∣∣∑p

∣∣G(p)− R(p)∣∣, (22)

where R(p) = 1 if the point p is inside the object, and |R(t)|is the number of points inside the object. This measure com-putes the ratio of the misclassified points to the total numberof object points in the current frame.

In addition to ground-truth-based measure, we use threeother color- and motion-based performance measures (spa-tial color distance, temporal histogram distance, and motiondistance). These measures do not require a ground truth, anddepend on these assumptions: object boundaries coincidewith both the color and motion boundaries, and the colorhistogram of the object is stationary from frame to frame. Inorder to measure the spatial color difference, a set of probepoints just inside and just outside of the objects are selected.For the points pout, pin that are at the opposite sides of theobject boundary and at an equal distance, the averaged colorI(pout) and I(pin) are computed in the M×M neighborhood

of the corresponding points. The color difference measurealong the boundary is calculated as

Espacol(t) = 1− 1∣∣B(t)∣∣

∑p∈B(t)

∣∣I(pout)− I

(pin)∣∣, (23)

where |B(t)| is the total length of the object boundary B(t)in frame t. When the location of the object boundary is es-timated correctly, we expect the spatial color measure Espacol

to take a small value. However, the converse of this statementis not necessarily true. That is, if the spatial color measurehas a small value, this does not imply that the object bound-ary is located correctly. This color measure is expected tobe reliable when the object and background textures are notcluttered and when the color contrast across the boundary ishigh.

A straightforward way to assess the temporal changes inthe segmented object is to calculate the pairwise color his-togram differences of the objects at time t and t−1. However,a drawback of this approach is that it may not catch a grad-ual deterioration. Therefore, we can alternatively check thehistogram differences between the first and current object re-gions. This method penalizes the cumulative difference effectof the previous approach. The temporal histogram differencemeasure is defined as

Ehist(t) = 1− ∣∣γh(i, t)− γh(i, 1)∣∣, (24)

where γh(i, 1) is the normalized framewise color histogramof the object i at the first frame t = 1. We used the fore-ground object for the presented results in Figures 15, 16 and17. In order to quantify how well the estimated object bound-ary coincides with actual motion boundaries, we adopt thegeometry of the probes used for spatial color difference andconsider the difference of the average motion vectors in theneighborhood the points. The motion measure for frame t isestimated as follows:

Emotion(t) = 1− 1∣∣B(t)∣∣

∑p∈B(t)

1− e−|v(pout)−v(pin)|. (25)

The motion difference can sometimes be large, not becauseof errors in segmentation, but as a consequence of the factthat not all parts of the object is moving or having a uniformtranslational motion.

7.3. Ground truth for motion field

We implemented a dense optical flow estimation method[28, 29] to generate the ground-truth motion vectors as il-lustrated in Figure 18. This is done only for comparison, andthis dense field is not a part of the proposed segmentationframework. Instead of the simple block matching, we usedphase correlation which is a frequency domain motion mea-surement method that makes use of the shift property of theFourier transform. Phase correlation takes the advantage ofthe fact that a shift in the spatial domain is equivalent toa phase shift in the frequency domain. Using the rotation


0.25

0.2

0.15

0.1

0.05

0

Esp

acol

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(a)

2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

Eh

ist

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(b)

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

Em

otio

n

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(c)

Figure 15: Comparison of (a) the spatial color distance, (b) tempo-ral histogram distance, and (c) motion distance measures for Akiyo.

0.25

0.2

0.15

0.1

0.05

0

Esp

acol

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(a)

2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

Eh

ist

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(b)

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

Em

otio

n

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(c)

Figure 16: Comparison of (a) the spatial color distance, (b) tem-poral histogram distance, and (c) motion distance measures forBream. When the AMOS cut most of the fish, its spatial colorand temporal histogram errors became very large in comparison toVOS.


0.25

0.2

0.15

0.1

0.05

0

Esp

acol

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(a)

2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

Eh

ist

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(b)

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

Em

otio

n

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(c)

Figure 17: Comparison of (a) the spatial color distance, (b) tempo-ral histogram distance, and (c) motion distance measures for Chil-dren.

and scale properties of the Fourier transform, it is possibleto find the rotation and scale as a shift in the frequency do-main invariant to any translation. We first window both im-ages due to repeating nature of the frequency spectrum, and

(a)

(b)

(c)

Figure 18: Estimated ground-truth motion vectors using phase cor-relation for the frame 101 of (a) Akiyo, (b) Bream, and (c) Children.

calculate its Fourier transform. We filter out the DC com-ponent and any high-frequency noise. We then calculate thenormalized cross power spectrum above. We take the inverseFourier transform, and find peak on correlation surface. Aninterpolation is done finally on the surface to achieve sub-pixel accuracy. Phase correlation is limited by the number ofsamples that the Fourier transform can use, thus limiting theresolution in the frequency domain. Therefore, the block sizeis chosen as 32× 32.


Akiyo

Bream

Children

Mother

Stefan

0 50 100 150

Time (ms)

PreprocessingVolume growingPostprocessingClustering

Figure 19: Average processing times of different components for asingle frame. Preprocessing includes filtering and threshold adapta-tion. Volume growing includes marker selection and one-at-a-timegrowing. Postprocessing includes volume refinement and descriptorextraction.

7.4. Discussion on results

We extensively tested the proposed algorithm and the refer-ence methods. For the AMOS method, we carefully markedthe initial boundary by mouse clicking on more than 50boundary points. The initial boundary is aligned on the ob-ject as close as possible. Then we segmented the sequence fora total of 136 frames. We generated spatiotemporal data forthe same-sized video and run the automatic segmentation asmentioned before.

For the VOS method results presented in this section, wedid not fine-tune any parameters but only modified the mul-tipliers in the clustering stage since they are related with thesemantic definition which differs for each sequence. We setthe multipliers of the hierarchical clustering stage as λρ = 1,λothers = 0 for Akiyo, λµ = 1, λothers = 0 for Bream, andλµ = 1, λothers = 0 for Children to be able to extract semanti-cally similar objects as the hand-generated ground truth, thatis, face, fish, children, and ball.

We performed experiments for 320× 240 of YUV videoon a P4 1.8 Ghz CPU. In Figure 19, we show the average pro-cessing time for each module of the proposed method forvarious test sequences. The differences among the process-ing times are a result of the spatial color distribution and thenumber of small volumes going into the volume refinement.For instance, for the Bream sequence, the fine texture on thefish causes several small volumes to be removed. On the otherhand, the smooth background and the relatively larger vol-umes after the volume growing keep the computational timelow for Akiyo. Table 2 shows the averaged CPU processingtime of a frame and preparation time required before thesegmentation for the semiautomatic methods. For a smallnumber of experienced users, we counted the initial bound-ary marking time for the reference methods. As presented in

Table 2: Processing times of single frame.

Processing Preparation

VOS AMOS SAM VOS AMOS SAM

Akiyo 86 ms 2.1 s 70 ms 27 ms 36 s 25 s

Bream 128 ms 6.3 s 113 ms 25 ms 45 s 35 s

Children 125 ms 5.5 s 120 ms 25 ms 55 s 35 s

Mother 68 ms 7.2 s 123 ms 25 ms 40 s 23 s

Stefan 157 ms 2.2 s 87 ms 28 ms 30 s 25 s

the table, most users spend more than 30 seconds to enterthe initial object boundary for the AMOS and SAM methods.The preparation time for the VOS method indicates the timerequired for threshold adaptation and memory handling be-fore the segmentation. We observed that both of the SAMand VOS methods have close speeds (100 miliseconds/frame)although the SAM algorithm requires additional 30 secondsfor boundary initialization. Moreover, we observed that thesegmentation results of the SAM deteriorated after only asmall number of frames (around 10 frames) and requireshalting the tracking process and correcting the boundary.The AMOS method needs more time to process a frame(more than 2 seconds) but is more stable. Thus, we com-pared the segmentation accuracy with the better-performingAMOS method.

We present the segmentation results in Figure 20. Theproposed method consistently produces both visually andquantitatively better results. In Figure 21, the misclassifica-tion errors are plotted for the VOS (blue) and AMOS (red).For Akiyo, the error scores are similar due to the minordifferences between the extracted boundaries. However, thesemiautomatic method (AMOS) fails to maintain the correctboundary on the left side of the head and starts expandingafter certain number of frames. This also happens wheneverobject moves fast, which causes the tracker to miss the partof the object, as in the case of Children when the boy on theleft suddenly kneels down. For Bream, the proposed methodmanages to detect the correct boundary even when the fishchanges its direction. On the other hand, the motion estima-tion and boundary fitting mechanisms of the AMOS cannotcompensate for this movement as a result object boundaryis significantly deformed. One shortcoming of the proposedmethod is that the volume refinement process may drop agrown volume if it fails to satisfy size criterion. For instance,the size of the volume corresponding to the feet of the boy onthe left in Children is less than the threshold, thus it is not in-cluded among the volumes send to the clustering stage. Still,it is evident that the proposed algorithm has results superiorto those of than the reference one.

The computed point discrepancy measure (given inFigure 21) also confirms these observations. We used (22) tofind the misclassification scores.

In Figures 15, 16, and 17, we present the nonground-based performance measure results. These graphs confirmthe ground-truth results, although for some certain con-ditions the sensitivity of the motion and temporal color


Orig+GT

AMOS

VOS

Orig+GT

AMOS

VOS

Orig+GT

AMOS

VOSFrame 1 Frame 26 Frame 101 Frame 116 Frame 136

Figure 20: Segmented objects for frames 1, 26, 101, 116, and 136 of test sequences. Ground truth is marked by a red boundary in the originalimages. The red areas in the segmented images show undersegmented pixels which are missed. The cyan areas correspond the oversegmentedregions where the algorithm exceeded object boundary. White + cyan areas show what segmentation generates.

distance are limited. In Figure 22, we give a plot of perfor-mance measures versus object levels and frame numbers.As visible, the errors decrease for most frames as the objectnumber gets smaller until it reaches 2, which was the inten-tion of foreground/background segmentation. However, weobserved that the measures do not always comply with thisobservation since they depend on the previously describedassumptions.

In Figure 23, the highest-similarity scores P(Vi, j) at eachobject level are plotted for different test sequences. One hy-pothesis is that if the clustering stage “accurately” mergestwo volumes at the current level (k), in the highest likeli-hood in the next object level (k − 1) will be less than thehighest value at the current level (k). Otherwise, a possiblemerge, which has higher likelihood value, would be missedsince it is encountered in the following level (that is how we


0.1

0.05

0

Epi

xel

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(a)

0.25

0.2

0.15

0.1

0.05

0

Epi

xel

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(b)

0.25

0.2

0.15

0.1

0.05

0

Epi

xel

1 18 36 101 118 136

Frame number (t)

VOSAMOS

(c)

Figure 21: Misclassification errors (22) of the proposed segmenta-tion framework (VOS) and a semiautomatic method (AMOS) usingmanually extracted ground truths for (a) Akiyo, (b) Bream, and (c)Children.

find the existence of such a merge). This hypothesis is justi-fied by the object level versus performance measure plot asshown in Figure 23. These plots show that the highest likeli-

0.38

0.36

0.34

0.32

0.3

1510

5Number of objects 101

118

136

Frame number

(a)

3.53

2.52

1.51

0.5

4

32

1

Number of objects 101

118

136

Frame number

(b)

8

6

4

2

15

10

5

Number of objects 101

118

136

Frame number

(c)

Figure 22: Performance measures for different object levels in thehierarchical clustering: (a) the motion distance for Akiyo, (b) thespatial color distance for Bream, and (c) the temporal histogramdistance for Children. As expected, the temporal histogram errorsconsistently dropped for the smaller object numbers.

hood drops as the object level decreases, which also indicatesthat the merging process works accurately.

We also analyzed the effects of the color quantizationas shown in Figure 24. By quantizing the 3D space into256 levels, we are able to decrease the computational loadby 15% without causing a degradation of the segmentation


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

P(V

i,V

j)

16 14 12 10 8 6 4 2

Number of objects

Akiyo 1–36Akiyo 100–136Bream 1–36

Bream 100–136Child 1–36Child 100–136

Figure 23: Highest similarity score monotonically decreases as thevolumes are merged. Note that after volume growing, the numberof volumes are different for each sequence. Large decreases indicatepotentially weak merges.

Figure 24: Effects of quantization by 256, 64, and 16 dominant col-ors. Quantization decreases the computational load. However, withthe decreasing number of quantization levels, the extracted volumeboundaries become more sensitive to the quantization errors. Firstrow: 256 color levels Head sequence for 17, 6, and 2 objects afterclustering. Second row: 64 levels for 10, 3, and 2 objects. Third row:16 levels for 11, 4, and 2 objects. Fourth row: 32 levels Akiyo for 18,6, and 2 objects. Last row: 16 levels for 11, 4, and 2 objects.

performance. This gain is a result of using shorter data struc-tures for memory handling in the implementation. Furtherquantization, that is, into 64 and 32 levels, requires platform-specific data structures. Severe quantization, that is, into 16and 4 levels, significantly disturbs the volume boundariesand washes out skin colors.

8. SUMMARY

We introduced an automatic segmentation framework. Themain stages of the presented automatic segmentation frame-work are filtering and simplifying color distributions, calcu-lating feature vectors, assigning markers as seeds of volumes,volume growing, removal of volume irregularities, derivingself and relational descriptors of volumes, and clustering vol-umes into a multiresolution object tree. Several alternativesfor each of the preceding stages have been explored.

For volume growing, we discussed several linkage meth-ods: single linkage, dual linkage, and centroid linkage.We proposed threshold adaptation techniques for centroid-linkage method as well. Furthermore, we compared variousmodes of the volume growing. Out of these, the simultane-ous growing and one-at-a-time growing methods basicallydiffer in the number of markers that are active at each it-eration. The recursive diffusion and intraframe/interframeswitching methods offer different expansion mechanisms.We assigned self descriptors to quantify individual vol-umes. We also introduced the relational descriptor conceptwhich evaluates the similarity between a pair of volumes.In addition to descriptors that capture general attributessuch as motion and shape, we discussed ways to integrateapplication-specific features, such as skin color and framedifference, into the descriptors. Hierarchical clustering ap-proach was adapted to group volumes into objects. We useda rank-based similarity measure of volumes. We proposeda multiresolution object tree representation as an outputof the segmentation. This framework blends the advantagesof color-, texture-, shape-, and motion-based segmentationmethods in an automatic and computationally feasible way.

Our experiments proved the effectiveness and accuracyof the proposed framework.

As a future work, we plan integrating the previously men-tioned texture and available compressed domain features tothe automatic segmentation framework.

REFERENCES

[1] Y. Ohta, A region-oriented image-analysis system by computer,Ph.D. thesis, Kyoto University, Japan, 1980.

[2] B. Schachter, L. S. Davis, and A. Rosenfeld, “Some exper-iments in image segmentation by clustering of local featurevalues,” Pattern Recognition, vol. 11, no. 1, pp. 19–28, 1979.

[3] P. J. Burt, T. H. Hong, and A. Rosenfeld, “Segmentation andestimation of image region properties through cooperative hi-erarchical computation,” IEEE Trans. Systems, Man, and Cy-bernetics, vol. 11, no. 12, pp. 802–809, 1981.

[4] P. Salembier and M. Pardas, “Hierarchical morphological seg-mentation for image sequence coding,” IEEE Trans. ImageProcessing, vol. 3, no. 5, pp. 639–651, 1994.


[5] M. Kunt, A. Ikonomopoulos, and M. Kocher, “Second-generation image coding techniques,” Proceedings of the IEEE,vol. 73, no. 4, pp. 549–574, 1985.

[6] J. Crespo, R. Schafer, J. Serra, C. Gratin, and F. Meyer, “Theflat zone approach: A general low-level region merging seg-mentation method,” Signal Processing, vol. 62, no. 1, pp. 37–60, 1998.

[7] W. B. Thompson and T. G. Pong, “Detecting moving objects,”International Journal of Computer Vision, vol. 4, no. 1, pp. 39–57, 1990.

[8] J. Wang and E. Adelson, “Representing moving images withlayers,” IEEE Trans. Image Processing, vol. 3, no. 5, pp. 625–638, 1994.

[9] G. Adiv, “Determining three-dimensional motion and struc-ture from optical flow generated by several moving objects,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.7, no. 4, pp. 384–401, 1985.

[10] P. Bouthemy and E. Francois, “Motion segmentation andqualitative dynamic scene analysis from an image sequence,”International Journal of Computer Vision, vol. 10, no. 2, pp.157–182, 1993.

[11] B. Duc, P. Schroeter, and J. Bigun, “Spatio-temporal ro-bust motion estimation and segmentation,” in Proc. 6th Int.Conf. Computer Analysis of Images and Patterns, pp. 238–245,Springer-Verlag, Prague, September 1995.

[12] D. W. Murray and B. F. Buxton, “Scene segmentation fromvisual motion using global optimization,” IEEE Trans. on Pat-tern Analysis and Machine Intelligence, vol. 9, no. 2, pp. 220–228, 1987.

[13] J. K. Aggarwal, L. S. Davis, and W. N. Martin, “Correspondingprocesses in dynamic scene analysis,” Proceedings of the IEEE,vol. 69, no. 5, pp. 562–572, 1981.

[14] I. K. Sethi and R. Jain, “Finding trajectories of feature pointsin a monocular image sequence,” IEEE Trans. on Pattern Anal-ysis and Machine Intelligence, vol. 9, no. 1, pp. 56–73, 1987.

[15] R. Deriche and O. Faugeras, “Tracking line segments,” inProc. European Conference on Computer Vision, O. Faugeras,Ed., vol. 427 of Lecture Notes in Computer Science, pp. 259–268, Springer-Verlag, Antibes, France, April 1990.

[16] M. J. Black, “Combining intensity and motion for incremen-tal segmentation and tracking over long image sequences,”in Proc. European Conf. Computer Vision, vol. 588 of LectureNotes in Computer Science, pp. 485–493, Santa MargheritaLigure, Italy, May 1992.

[17] F. Meyer and P. Bouthemy, “Region-based tracking usingaffine motion models in long image sequences,” CVGIP: Im-age Understanding, vol. 60, no. 2, pp. 119–140, 1994.

[18] F. Porikli, “Image simplification by robust estimator based re-construction filter,” in Proc. 16th Int. Symposium on Computerand Information Sciences, November 2001.

[19] W. Skarbek and A. Koschan, “Colour image segmentation: Asurvey,” Tech. Rep. 94-32, Department of Computer Science,Technical University of Berlin, 1994.

[20] A. K. Jain and F. Farrokhnia, “Unsupervised texture segmen-tation using Gabor filters,” Pattern Recognition, vol. 24, no. 12,pp. 1167–1186, 1991.

[21] F. Porikli, Video object segmentation, Ph.D. thesis, Electricaland Computer Engineering Department, Polytechnic Univer-sity, New York, 2002.

[22] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision,Addison-Wesley, Boston, Mass, USA, 1st edition, 1992.

[23] S.-F. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong,“A fully automated content-based video search engine sup-porting spatiotemporal queries,” IEEE Trans. Circuits and Sys-tems for Video Technology, vol. 8, no. 5, pp. 602–615, 1998.

[24] D. Zhong and S. F. Chang, “Long-term moving object seg-

mentation and tracking using spatio-temporal consistency,”in Proc. International Conference on Image Processing, vol. 3,pp. 57–60, Thessaloniki, Greece, October 2001.

[25] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, andT. Sikora, “Image sequence analysis for emerging interactivemultimedia services-the European COST 211 framework,”IEEE Trans. Circuits and Systems for Video Technology, vol. 8,no. 7, pp. 802–813, 1998.

[26] T. Ida and Y. Sambonsugi, “Self-affine mapping system and itsapplication to object contour extraction,” IEEE Trans. ImageProcessing, vol. 9, no. 11, pp. 1926 –1936, 2000.

[27] C. E. Erdem, A. M. Tekalp, and B. Sankur, “Video object track-ing with feedback of performance evaluation measures,” IEEETrans. Circuits and Systems for Video Technology, vol. 13, no. 4,pp. 310–324, 2003.

[28] B. Reddy and B. Chatterji, “An FFT-based technique for trans-lation, rotation and scale-invariant image registration,” IEEETrans. Image Processing, vol. 5, no. 8, pp. 1266–1271, 1996.

[29] L. Hill and T. Vlachos, “Shape adaptive phase correlation,”Electronics Letters, vol. 37, no. 25, pp. 1512–1513, 2000.

Fatih Porikli received the B.S. degree inelectronic engineering from Bilkent Univer-sity, Ankara, Turkey, in 1992, and the M.S.and Ph.D. degrees in electrical and com-puter engineering from Polytechnic Univer-sity, Brooklyn, NY, in 1996 and 2002, re-spectively. He joined the Mitsubishi ElectricResearch Laboratories, Cambridge, Mass in2000 after working on stereoscopic depthestimation at AT&T Labs Research in 1997and developing satellite imagery classification methods at HughesResearch Labs in 1999. Previously, he designed algorithms for post-filtering, network management, and optimal bandwidth allocation.More recently, his research focused on computer vision and datamining, automatic object detection and tracking, unusual event de-tection, video content analysis, and multicamera surveillance ap-plications. He is serving as an Associate Editor for SPIE Journal ofReal-Time Imaging and a Senior Member of IEEE and ACM.

Yao Wang received the B.S. and M.S. de-grees in electronic engineering from Ts-inghua University, Beijing, China, in 1983and 1985, respectively, and the Ph.D. de-gree in electrical and computer engineer-ing from University of California at SantaBarbara in 1990. Since 1990, she has beenwith the faculty of Polytechnic University,Brooklyn, NY, and is presently a Professorof electrical and computer engineering. Shewas on sabbatical leave from the Princeton University in 1998 andwas a Visiting Professor at the University of Erlangen, Germany, inthe summer of 1998. She was a Consultant with AT&T Labs Re-search, formerly AT&T Bell Laboratories, from 1992 to 2000. Herresearch areas include video communications, multimedia signalprocessing, and medical imaging. She is the leading author of atextbook titled Video Processing and Communications, and has pub-lished over 100 papers in journals and conference proceedings. Sheis a Senior Member of IEEE and has served as an Associate Editorfor IEEE Transactions on Multimedia and IEEE Transactions onCircuits and Systems for Video Technology. She received New YorkCity Mayor’s Award for Excellence in Science and Technology in theYoung Investigator Category in 2000.

Date post:	04-Feb-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Automatic video object segmentation using volume growing...

Documents