of 23
7/29/2019 Framework for Unsupervised
1/23
A framework for unsupervised mesh based segmentation
of moving objects
Andreas Kriechbaum & Roland Mrzinger &
Georg Thallinger
Published online: 24 September 2009# Springer Science + Business Media, LLC 2009
Abstract Multimedia analysis usually deals with a large amount of video data with a
significant number of moving objects. Often it is necessary to reduce the amount of data
and to represent the video in terms of moving objects and events. Event analysis can be
built on the detection of moving objects. In order to automatically process a variety of video
content in different domain, largely unsupervised moving object segmentation algorithms
are needed. We propose a fully unsupervised system for moving object segmentation that
does not require any restriction on the video content. Our approach to extract movingobjects relies on a mesh-based combination of results from colour segmentation (Mean
Shift) and motion segmentation by feature point tracking (KLT tracker). The proposed
algorithm has been evaluated using precision and recall measures for comparing moving
objects and their colour segmented regions with manually labelled ground truth data.
Results show that the algorithm is comparable to other state-of-the-art algorithms. The
extracted information is used in a search and retrieval tool. For that purpose a moving
object representation in MPEG-7 is implemented. It facilitates high performance indexing
and retrieval of moving objects and events in large video databases, such as the search for
similar moving objects occurring in a certain period.
Keywords Moving object segmentation . Unsupervised system . Spatial segmentation .
Motion extraction . Clustering . Optical flow
Multimed Tools Appl (2010) 50:728
DOI 10.1007/s11042-009-0366-9
A. Kriechbaum (*) : R. Mrzinger: G. Thallinger
JOANNEUM RESEARCH, Institute of Information Systems, Steyrergasse 17, 8010 Graz, Austriae-mail: [email protected]
R M i
7/29/2019 Framework for Unsupervised
2/23
1 Introduction
A critical task in video understanding for a large amount of data is the automatic
interpretation of semantically meaningful spatio-temporal objects. To achieve this task, the
gap between pixel values and semantic descriptions needs to be bridged. The successfulapplication of object-based media description and representation depends largely on
effective moving object segmentation tools.
Moving Object Segmentation (MOS) can be used for providing important spatio-
temporal information about objects whose motion is more or less homogeneously at least
over a certain period.
Generally, moving object segmentation can be used in applications in the field of content-
based media retrieval. For example, in many film archives a manual similarity search through
all videos is needed because of the lack of annotation but this is time consuming and takes a
high effort of people. Automatic unsupervised systems introduce the possibility to search for
similar objects in a video archive. Other reasonable applications are compression algorithms of
videos. Special video formats for compression of videos were developed; a well-known
example is the MPEG-4 format that contains a description of moving objects. Motion
segmentation reduces the high amount of video data. After the analysis of the videos only
regions extracted by the MOS are subject to further processes and applications. This means a
strong reduction of data. The task of event detection in videos requires a system to automatically
extract moving objects in order to facilitate subsequent person identification and behaviour and
event analysis. Semantic event detection, media monitoring and video indexing are only a few
examples from the large spectrum of applications.
These applications impose common challenges for moving object segmentation. Mostimportantly, moving cameras entail the MO to move relative to the moving background.
Another challenge is that the objects behaviour is generally not known apriori, i.e. they
may be rigid or non-rigid and moving fast or slow. Illumination variance over a short period
of time and shadows cast on the object or cast by the object also complicate the
segmentation process. Background clutter such as swaying branches further makes the
segmentation of foreground objects difficult. In the case of multiple moving objects which
move side by side with similar appearance, the possibility for separating the objects is
limited. Occlusions, e.g. temporal disappearance when objects move in front of others are
further challenges. The computational complexity increases with the scene dynamics, e.g.
the number of moving objects.Therefore, segmenting frames into distinctive parts corresponding to moving objects is
very difficult. In general, many calculations are necessary for that purpose because moving
regions usually have no significant discriminatory single features, which can be calculated.
Possible features are colour, shape, texture and velocity of associated region-pixels. Each
feature has possible advantages and disadvantages for different environments. In the past
many algorithms where developed but the most of them use hard restrictions and manual
human intervention.
The aim of this work is to explore the feasibility of a fully automatic moving object
segmentation system which tackles the above mentioned challenges. Thus, an unsupervisedsystem should extract moving objects without any restriction on the content and without
any manual intervention in the moving object segmentation process.
8 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
3/23
2 Related work
Moving object segmentation draws from techniques in the field of tracking and
segmentation. This section deals with related techniques proposed in the literature.
The following overview of natural-based tracking techniques is proposed in [20]:
2.1 Edge-based motion tracker
This approach is a very efficient method regarding computation and implementation
costs. The low fraction of the calculated image pixels is the reason of the low
computational complexity. Furthermore this algorithm is reliable against illumination
changes and is very simple to implement. Different edge-based algorithms have been
developed with the same gradient-based approach. Edges are found in an image if
strong gradients are found. The difference between the algorithms are: one algorithm
extracts explicit model contours for the matching with the object database and the
other extracts gradients to estimate the objects pose without a calculation of a contour.
The extraction of contours leads to a much more reliable result but is slower than the
algorithm without extracting contours.
2.2 Optical flow-based motion tracker
If the analysis of a sequence of images should result in moving objects useful information
can be extracted through the optical flow-based approach. The optical flow is the velocity
vector of a pixel with the same intensity in subsequent frames. The optical flow iscalculated over the whole image whereas each pixel describes a 2D approximation of the
real 3D motion. To get reliable results an accurate dense optical flow has to be processed. A
common representation of the optical flow field is using arrows in a mesh like in Fig. 1.
The direction of the visualised arrows describes the direction of the motion. The length
of the arrows describe the magnitude of the motion of the considered pixel, the longer the
arrow the higher the motion. In optical flow-based techniques a velocity field will be
extracted as described in [5, 20].
An important disadvantage of this algorithm is the large linearization error in the optical
flow constraint if large motion is content of the video. Furthermore, an optical field
generation containing motion along edges especially of circles is very difficult due to theaperture problem [5]. Another disadvantage is the slower computation due to the
calculation of the optical flow over the whole image.
Certainly a combination of edgeand optical flow-based methods is possible and
provides good results.
Multimed Tools Appl (2010) 50:728 9
7/29/2019 Framework for Unsupervised
4/23
2.3 Template-based motion tracker
The template-based algorithm does not rely on points which will be tracked. The algorithm
depends on templates which are patterns of objects which will be tracked. The algorithm is
developed for complex objects which are difficult to model with local features. An exampleof a useful application is to find a book cover in an image sequence. With the help of edges
and local features the content of the image will be processed. If the cover is found in form
of a plane only a detection of the plane has to take place which is much simpler than the
other proposed algorithms. A disadvantage of this algorithm is that it is computationally
expensive. An important application of the template-based algorithm is the Lucas Kanade
Tracker [20] which finds a value for deformation. The deformation describes warping a
template of an object into the image.
2.4 Interest points-based motion tracker
This algorithm relies on local features similar to the optical flow-based method. The
extraction of a subset of image pixels reduces the computational cost. The patches around
the points should be textured and their neighbors should be different to eliminate unstable
edges. An object feature is defined by the location and the corresponding patch. After the
initialization of the features the algorithm computes the same loop as the edge-based
algorithm. The local features across the image are robust against partial occlusion and
matching errors. The interest points-based algorithm exploits more information of the
whole information to gain robustness.
In literature different ways for classifying moving object segmentation approaches arediscussed. A review of state-of-the-art techniques is presented in [30]. Generally, the
algorithms can be divided into three main groups of moving object segmentation
techniques. The following sections describe representative algorithms shortly.
2.5 Spatial segmentation, motion extraction, clustering
A Mean-Shift based algorithm proposed in [16] provides robust homogeneous colour
regions according to dominant colours. Furthermore, frame intensity difference based
motion detection is applied for motion extraction. The detected moving regions are
analyzed by a region-based affine model and further tracked to increase the consistency ofthe extracted objects. A morphological open-close operator is used to remove gulfs and
isthmi (narrow connection between two large regions) for object boundary smoothing. A
shape coding optimization is done using boundaries of variable width. The algorithm is fast
and highly accurate.
In [15] a two-dimensional feature vector is used for clustering in the feature space. The
first feature is image brightness which reveals the structure of interest in the image. The
second feature is the Euclidean norm of the optical flow vector. The optical flow field is
computed using the Horn-Schunck algorithm [18]. By clustering the feature space, moving
objects in the image are detected. The algorithm has the advantage that it is robustregarding to background movement.
In [21] the moving object segmentation procedure is treated as a Markovian labelling
10 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
5/23
and is validated by an elaborate occlusion detection scheme. The initial object mask is
segmented by the MRF model. A disadvantage of the approach is that it cannot deal
properly with noise.
The PCA and GGM based algorithm proposed in [22] consists of three stages: the
initial segmentation of the first frame using colour, motion, and position information,based on a variant of the K-Means-with-connectivity-constraint algorithm. Then a
temporal tracking algorithm is applied, using a Bayes classifier and rule-based processing
to reassign changed pixels to existing regions and to handle the introduction of new
regions. Finally a trajectory-based region merging procedure is used that employs the
long-term trajectory of regions to group them to objects with different motion. It is
advantageous that the algorithm can handle fast moving of objects, new objects and
disappearing objects.
2.6 Point tracking, 2D mesh
At the first frame of the video, an optimal number of feature points are selected as
nodes of a 2D content-based mesh. These points are classified as moving (foreground)
and stationary nodes-based on multi-frame node motion analysis described in [7],
yielding a coarse estimate of the foreground object boundary. To extract the moving
object, colour differences across triangles near the coarse boundary are employed. The
boundary of the video object is refined by the maximum contrast path search along
the edges of the 2D mesh. Next the refined boundary to the subsequent frame is
propagated by using motion vectors of the node points to form the coarse boundary at
the next frame.This algorithm is able to detect occlusions but small objects cannot be found. The point
tracker-based approach for moving object segmentation is a possible approach for a fully
automatic moving object segmentation and similar to the approach developed by the
authors.
2.7 Active contours
The latest algorithms are often based on active contours. This is an iterative algorithm
that produces a better contour-line description in every iteration step. It receives the
previous contour line as input and uses some balancing constant factors (internal andexternal energy) to produce a new contour line description. Active contours (snakes)
minimize the sum of the internal and external energy [27]. The graph cut algorithm
[29] is an improvement of active contours, leading to a smooth contour free of self-
crossing and uneven spacing problems. The internal force, which is used in the energy
functions to control the smoothness, is no longer needed and the number of parameters is
reduced.
An advantage of these algorithms is the ability to handle unknown noise, highly textured
background, and partial object occlusions. The disadvantage of active contours based
algorithms is the need for initial object segmentation or the requirement of initial seeds.Due to the need for initial object segmentation this algorithm is often used for video object
tracking and it is not considered in this work.
Multimed Tools Appl (2010) 50:728 11
7/29/2019 Framework for Unsupervised
6/23
3 Mesh-based moving object segmentation
The main goal of this work is to segment moving regions in videos without any restrictions
on the content (like static or moving cameras). The development of the mesh based moving
object segmentation algorithm should solve the problems in moving object segmentation.The work was mainly done in the master thesis in [19].
Many approaches from literature suggest using dense motion fields or an optical flow
field in which the video objects have to be segmented. However, the problem is that the
method for optical flow field calculation for every pixel of the image is underspecified and
that computing a velocity vector for every pixel of the images is often redundant because
most pixels in an image have zero motion. In the proposed algorithm a similar, yet modified
approach is chosen.
In general, the developed approach is based on the combination of pre-extracted
features. These are feature points of a KLT feature tracker and regions extracted by colour
segmentation. The used colour segmentation algorithm was developed using the specific
Mean-Shift implementation described in [12].
The main contribution of this paper is a new integrated workflow for mesh-based
moving object segmentation. It draws from existing state-of-the-art techniques for tracking
and segmentation, such as the Mean-Shift colour segmentation and the KLT feature
tracking. In this paper these approaches have been combined and partly extended and
improved in order to provide an integrated framework for fully automatic moving object
segmentation.
3.1 Mean-shift algorithm
The Mean-Shift (MS) algorithm is a procedure to analyse feature spaces using a clustering
method. Many pattern recognition applications make use of the advantages of the MS
algorithm, for example the colour image segmentation, where at first image pixels are
mapped into a specific feature space (e.g. L*u*v* colour space) and accordingly clusters
are formed. The cluster centres are of great importance, because they characterize the most
significant features of the image, which-in the colour image segmentation example-are the
dominant colours. Consequently, the Mean-Shift algorithm is applied to locate the clusters
and to estimate their centres to get the dominant colours of the image, which then can be
used for the segmentation procedure. The result of the segmentation process is a clustercorrespondence to homogeneous colour regions of the image.
In 1997 Comaniciu and Meer [9] applied the Mean-Shift algorithm to colour image
segmentation. The results of the segmentation were better compared to the results of other
similar applications (for example the Watershed algorithm); hence this approach became
very popular. The use of MS in image sequences is also proposed in [3, 12]. In principle,
the procedure is more robust against changes of the illumination conditions and more
reliable regarding to over-segmentation.
Figure 2 shows an input image and the according three-dimensional feature space.
Finding the location of the clusters is done by using a search window in the featurespace, which shifts to the centre of each cluster. The direction and the magnitude of the shift
are based on the difference of the centre of the search window and the local mean value in
12 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
7/23
3.1.1 Propagation of cluster centres
The position defining where the search begins determines the number of shifts which are
necessary for locating the centre of a cluster in the feature space. Consequently it is crucial to
find the best starting position for the search window in order to minimize the number of shifts.
This is implemented by choosing a number of random positions and then declaring the one with
the highest density of feature vectors as starting position for the search window.
3.1.2 Iteration steps
After the successful determination of the starting position the search window is shifted to
the designated position iteratively. The intensity distribution of each colour can be seen as a
probability density function. The difference between the local mean of this function and the
Fig. 2 Colour image and corresponding L*u*v* colour space [8]
Multimed Tools Appl (2010) 50:728 13
7/29/2019 Framework for Unsupervised
8/23
which defines the designated cluster centre. Intuitively mean shift corresponds to an
estimation of the gradient of the data density.
Figure 3 visualises the iteration process. In the left image the starting points (colour) are
shown, which move along the paths visualized in the right illustration. Some clusters merge
to larger clusters, which represent a single colour.
3.1.3 Mathematical theory of mean-shift algorithm
A sphere Sx with radius r is positioned on point x and divides some feature vectors y from
the set of vectors within the sphere. The expected value z y x for each of these vectors,which is based on x and Sx, can be computed by:
m E zjSx
Z
Sx
y x p yjSx dy
Z
Sx
y x p y
p y 2 Sx dy
In [11] the computation leading to the following final equation can be found.
m E zjSx r2
n 2
rp x
p x E xjx 2 Sx x
The mean shift vector represents, as discussed earlier, the difference between the local
mean and the centre of the search window and is proportional to the density gradient.
Consequently, the centres of the clusters are located in a region, where the gradient values
are low and p(x) is high. The mean shifts aim to the maximum of the probability density,
which is called mode. At the mode, the mean shift is obviously zero, so accordingly acriterion for stopping the iterative steps has to be defined. For example if the magnitude of
the shift is smaller that 0.1 the iteration stops. All pixels of the search window are discarded
and their 8-connected neighbours are removed. Afterwards, further starting-points can be
found and they are iteratively processed to converge to a cluster centre.
This procedure is done until the number of feature vectors within the search window is
smaller than a value, which defines the smallest number of elements which are needed for a
relevant image region. All pixels within the window whose feature vectors are located on
the relevant colour points are attached to the colour of the centre of the window. For a
further calculation process only those pixels are kept which have at least one neighbour
with an already assigned colour in the feature space.
14 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
9/23
The Mean-Shift algorithm can be used in any n-dimensional feature space. In this
paper we used the MS algorithm for basic colour segmentation and as cluster device
for the KLT-Tracker. Possible features for spanning a feature space in moving object
segmentation are the common colour components red, blue, and green or the
converted features into the L*u*v* space. Further helpful features would be thespatial coordinates and a temporal relationship of points extracted by a point tracker
over a block of frames. Each of these features refines the result with the drawback of
a longer time to convergence.
3.1.4 Post-processing steps
In a post-processing step small regions, which potentially occur along the boundary of
larger regions, are assigned to the closest substantial colour in the feature space.
The Mean-Shift algorithm yields better results than other proposed colour segmentation
approaches regarding to image sequences. However it is computationally expensive.This algorithm can be used in many different applications, due to the unrestricted
number of dimensions, thus diverse features can be obtained.
The accuracy of the segmentation is given by the size of the search window. A small
window results in a more accurate segmentation, which means that more segmented regions
are obtained and vice versa. Therefore, the radius should be flexible to a criterion, which
represents the visual activity.
3.1.5 Mean-shift advantages / disadvantages
On the one hand, the Mean-Shift algorithm yields impressive segmentation results, not only
in the field of image segmentation, many other clustering applications can make use of its
positive properties. On the other hand, the calculation time of the algorithm is high
compared to other proposed segmentation procedures (e.g. to the Watershed algorithm
[15]).
By adjusting some parameters of the Mean-Shift algorithm, it is possible to obtain
diverse segmentation results, so it can be easily used if under and over-segmentation is
desired.
Moreover, the segmentation results are robust towards changes in the images, (for
example changes in the illumination conditions) and consequently, the Mean-Shiftsegmentation procedure for a video sequence yields in very reliable results.
In [11] was shown that the Mean-Shift algorithm is well suited for colour segmentation of
image sequences, due to better results regarding the temporal stability of the segmentation
compared to other approaches (for example Watershed algorithm). A drawback is that the
Mean-Shift algorithm is computationally more expensive and the computational costs are even
more an issue when the algorithm is applied to image sequences (videos). To overcome this
several optimizations are proposed in [3].
The MS produces a base-segmentation which is combined with the motion information
generated by the KLT-tracker to get a moving object. The Mean-Shift algorithm providesthe dominant motion cluster assignment as a new property of the LKT feature points. Using
the ability of the MS algorithm to perform clustering in high dimensions, motion clusters
Multimed Tools Appl (2010) 50:728 15
7/29/2019 Framework for Unsupervised
10/23
3.2 Mesh-based moving object segmentation
The motivation of the mesh-based segmentation approach is the suggestion of optical flow
based algorithms in the literature and the point tracker-based and mesh-based approach for
moving object segmentation in [7]. In the developed approach the feature points areassigned to dominant clusters which are representing moving objects. Furthermore, this
algorithm is based on the assumption that a colour-segmented region belongs to a single
object. This object is either a foreground object (moving object) or background object. This
assignment is done similar to [7]. A workflow can be seen in Fig. 4.
In this approach velocity stable triangles (built from the extracted feature points) which
belong to one cluster are combined to represent a moving object. Triangles which are on the
same object are from the same motion cluster. If not all points from the triangle are assigned
to the same cluster the triangle is not used for the moving object segmentation. Triangles
with points from the same cluster are assigned to colour segmented regions and so a stable
skeleton of the object is formed. The stability of the triangles is determined by the
interpolated motion field calculated for the feature points. The extracted moving object is
reliable through the combination of the two base approaches (colour segmentation, feature
point tracking) and further introduced quality measurements based on the motion field. The
combination of these algorithms solves the problems of MOS with moving cameras. With
this algorithm occlusion is not a problem; appearance and disappearance from objects in the
image do not have any negative impacts to the algorithm.
3.3 Stable triangle detection
This processing step is one of the most important steps in the mesh-based algorithm which
leads to the desired quality. In this step reliable triangles are selected from all triangles for a
further region assignment to the according dominant cluster. The feature points extracted by
the KLT feature tracker are triangulated resulting in a mesh. The mesh extraction facilitates
a fast computation of a dense optical flow field. This process is called motion estimation.
The diagram in Fig. 5 shows the process to make this clearer.
Mean-Shift Colour
Point Tracking andDominant Motion
Clustering
Find candidate regions
Moving Object
Stable triangle
detection
Assign dominant motion clusters
Track regions
Points,
Cluster
Region and
Cluster Hierarchical
Clustering
Region and
Cluster
Regions
Stable Triangles
Region and
Cluster
16 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
11/23
First, Delaunays triangulation method [11] is applied to the extracted feature points
containing their local motion information. Several tests have shown that Delaunays
algorithm is very fast and has a high stability (linkage) of the extracted mesh over time
which is a necessary precondition in the mesh-based moving object algorithm.
Second, a dense motion field is extracted. The motion field is calculated by using the
Gouraud shading algorithm [17] on the velocity vectors of the extracted mesh. Theassumption is that KLT feature points from the mesh which are assigned to one dominant
motion cluster belong to a moving object. If these points which are on the moving objects
have a correct assigned velocity vector and if there are many points inside the border of the
moving object the Gouraud shading algorithm will provide a linear interpolated motion
field of the image. The linear interpolation between the motion vectors is valid because of
the minor error in the area of the moving objects and a larger error in the neighbourhood of
the moving object. If too few points in the area of the moving object the algorithm provides
worse results. Furthermore, if the points extracted by the KLT tracker have incorrect
velocity the motion field will not be calculated properly. However, the Gouraud
interpolation is only an approximation for a dense optical flow field. The motion fieldcalculation is very fast in contrast to methods described in [6]. Using the extraction of the
motion estimation, points from a frame can be found which are more reliable then other
points in terms of temporal stability. This reliability can be used in the further processing
steps.
Third, the displaced frame difference (DFD) image is extracted which shows the reliability
of the extracted motion field. The DFD-image is calculated from the original image and an
image predicted based on the calculated motion. Due to the interpolation of the motion field the
predicted image is not identical (only motion compensated) to the original consecutive image. If
the motion field is correctly calculated all values of the DFD-image are zero. Values which are
not zero show pixels/regions which are unreliable calculated due to a false estimated motion
field. The pixels/regions which are near zero have a high reliability.
Displaced Frame DifferenceImage
Motion compensationin frame t+1
Lifetime and ClusterReliability Calculation
Motion Field
Original frame t
Interpolated frame t+1
Original frame t+1
Stable Triangles
Triangulation
Gouraud Shading
Points, Cluster
Triangulated mesh
DFD value for triangle
Fig. 5 Stable triangle detection with motion interpolation
Multimed Tools Appl (2010) 50:728 17
7/29/2019 Framework for Unsupervised
12/23
point has not necessarily to be in the same cluster due to the possibility that the region is a
border region of the object with eventually bad colour segmentation. Moreover, in the
reliable triangle search triangles are discarded which are below a minimum temporal
reliability to guarantee reliable moving object segmentation. This can be done based on the
dominant cluster information over time. The dominant motion cluster is calculated oversubsequent frames. Consequently each triangle has a lifetime due to the cluster assignment
and tracking of the point. If the triangle has a longer lifetime than an adjustable parameter
the triangle is a candidate for the further process. A problem of the algorithm is the
dependency on the reliability of the KLT-tracker and on the stability of the triangulation
over time. This dependency results in skipped frames of the tracked triangles over a certain
time. These outliers are ignored by introducing a new parameter which defines the
minimum appearance of the triangle. Otherwise the skipped frames would lead to discarded
triangles.
Reliable triangles are selected by the help of previously introduced parameters, namely
the DFD-depended triangle dependency, the minimum lifetime and the minimum
appearance percentage of that. In this process a number of reliable triangle candidates for
each frame for further processing are extracted.
3.4 Cluster assignment to segmented regions
After a set of stable triangle has been selected colour segmented regions can be assigned to
the according triangles. The idea is that the current triangle is assigned to a dominant
motion which means a possible region assignment overlaps the triangle. If the region is
correctly assigned to a triangle and furthermore to a dominant cluster a skeleton of themoving object is extracted. The assignment to a dominant motion cluster is done with all
segmented regions which are entirely or partially in triangle. A minimum adjustable
threshold has to be exceeded. This threshold declares the minimum percentage of the area
of a triangle which overlaps to a colour-segmented region. If the assignment of the clusters
to the region is ambiguous, a further determination has to be done otherwise the multiple
clusters could be assigned to the regions. If three points of the related tracking points inside
the triangle belong to the same cluster it has a higher weight regarding the calculation then
the triangles with only two points belonging to the same cluster. The triangle with the
higher weight is assigned to the region. If the region contains more then one triangle with
the same weight the triangle with the higher area overlap is selected.All regions with the same dominant cluster assignment are selected to extract the
skeleton of the moving object. In Fig. 4 two further processing steps are depicted: tracking
of the assigned regions and hierarchical clustering. These steps are used to get realistic
moving objects of the extracted moving object skeletons.
3.5 Region tracking
Region tracking is established to describe an approach to get first realistic moving objects
from the initial moving object skeletons. To each extracted skeleton a cluster-parameter(velocity) from the dominant motion tracking is assigned. With the assignment of these
parameters new regions according to the moving region of the actual frame are found in the
18 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
13/23
The new colour segmented regions have to achieve some thresholds like the mean RGB-
colour threshold of the region, bounding box width, bounding box height and region area.
These thresholds ensure that the corresponding new region of the next and previous frameis the same region of the moving object as in the actual frame. After region tracking the
results are moving regions with each moving region containing points with a similar
dominant motion due to an assignment of previous and next frames.
3.6 Hierarchical clustering
To get more realistic segmentation results for non-rigid moving objects a further processing
step is introduced. Using hierarchical clustering parts of non-rigid objects containing
different dominant motion should be connected.
Hierarchical clustering is well known and commonly used in image processing. Inhierarchical clustering some distances have to be introduced. By the help of these parameters
hierarchical clustering can find the nearest nodes (represent features of moving objects) in terms
of these parameters and combine them to a single node. The clustering is done by the single
linkage clustering method which is described in [13]. The clustering is continued iteratively to
find the next node until a hierarchical tree is extracted. In case of moving object segmentation
the parameter nodes are features of moving regions. These are the motion trajectories and the
distances of the centroids from the moving regions. The motion trajectories locate the clusters
over several frames. If the motion trajectory of the compared regions is relatively similar and
the distance is relatively short the compared regions belong to the same moving object. For a
better understanding of the algorithm an example is given in the next paragraph.
The motion from all bones of a leg is not the same during a time period (for example in a
ProceedoverallframesaslongasnewMO-regionsarefound
Association of the assigned regions to
the actual moving object
Find all colour segmented regions
in the next frame t+1
Forward tracking and backward tracking
of all color segmented regions until the
last possible assignment of the region
Find the color segmented regions
of the moving object in frame t
Check the underlying base segmented
regions for color similarity and the
similarity of the transformed bounding box
to the bounding box of the underlying region
Find the bounding box of the colour
segmented region
Transform the bounding box with the
cluster parameters from the point trackerinto the next frame
If the similarity is higher then a threshold
assign the region to the moving object
Fig. 6 Region tracking. Left: tracking over the entire appearance of a skeleton of a moving region. The
algorithm proceeds over all frames where colour segmented regions of the moving object are found. Right:
backward/forward tracking algorithm to the previous and the next frame
Multimed Tools Appl (2010) 50:728 19
7/29/2019 Framework for Unsupervised
14/23
of the hierarchical clustering. A threshold is introduced to cut off the tree to get real moving
objects. The cut off threshold of the hierarchical tree was set to 0.7 as proposed in [ 1]. This
value was also the result of several tests for the search for the best parameter adjustment.
3.7 Moving object segmentation in videos
The input of moving object segmentation is usually a block of frames (BOF) which
describes a limited amount of subsequent frames. In the proposed system, large videos are
processed instead of a few frames thus a high amount of data has to be analyzed needing a
considerable amount of time and storage. To keep these negative influences within a limit
new approaches are needed. A common effective technique to extract moving objects in
videos and films is the following, which is frequently described in many papers.
One of the most fundamental tasks in moving object segmentation for extracting a
description of video and film is to find frames where the motion of the moving objects is
high enough for segmentation and the content is important related to the aspect of motion.
In the literature there are many different techniques to get these frames, an example from
[25] is shown in Fig. 7. In the candidate frames the key objects are extracted which change
significantly in their visual content. These important BOFs are usually after shot
boundaries [16]. It is necessary to find these frames due to the limitation of memory and
time. Recently many algorithms have been proposed to get the frames with the important
content, a detailed description can be found in [22, 25]. After the detection of shot-
boundaries and key-frame-extraction the Mesh-based MOS approach can be applied.
The reason to make shot boundary detection in moving object segmentation is the high
content movement after such boundaries which results in the extraction of different movingobjects. At shot boundaries many visual features changes and therefore it is crucial to detect
the shot boundaries before doing further analysis like moving object segmentation [4].
3.8 Representation and retrieval of objects and events
In the previous sections a way of extracting moving objects resulting in several moving
object descriptions was described.
The retrieved moving objects and their trajectories are directly applicable to event
analysis and retrieval. But how we can get events or actions out of the extracted moving
objects? And how is it possible to represent or save the moving objects in an effective wayso that the extracted moving objects can be compared to any other previously extracted
moving objects?
A standardized way for describing the extracted moving objects is preferable, such as
MPEG-7, which supports content-based video indexing and retrieval. An overview of
MPEG-7 is discussed in [24]. The standardized format allows interoperability between
applications. MPEG-7 predefines some features for moving object description. These
features are low-level descriptions, describing elementary features like colour (e.g. Colour
Layout, Colour Structure), texture and shape of regions. In this work, a moving object
description structure with special focus on colour features has been developed based on thedetailed audiovisual profile (DAVP) MPEG-7 profile [2].
Due to the vast amount of monitored data in surveillance systems and other archives, the
20 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
15/23
For that purpose we have developed a Search and Retrieval Tool [26] which is able toimport MPEG-7 documents and formulate queries by a graphical user interface (GUI) and
pre-defined SQL statements. Different videos can be opened, viewed and analyzed. After a
definition of the video object the search tool builds automatically the query by a
combination of predefined keywords (SQL statements) and the content-based extracted
elements (MPEG-7 Descriptors). The used parameters (e.g. which descriptors should be
combined) are defined by the type of analyzing process. The search result is represented in
form of a list of references to the metadata descriptions of the matching moving objects,
sorted by similarity.
In literature an event is defined as something that happens at a given place and time.
Two types of events are possible: object domain events and frame or shot domain events. Inthe search tool these events are easily to retrieve. In the context of event retrieval, the most
useful query parameters are the motion trajectories. The trajectories contain the information
of primitive motion e.g. move left, move right. With SQL statements moving objects of the
same motion can be searched for. Furthermore all moving objects within a certain period
can be found. This search tool supports the user in bridging the gap between the numerical
features and the symbolic description of the meaningful actions and events.
4 Results
In general evaluation of automatic moving object segmentation is a complex process and
Fig. 7 Moving object extraction procedure in video and film [26]
Multimed Tools Appl (2010) 50:728 21
7/29/2019 Framework for Unsupervised
16/23
For event detection, it is crucial to extract motion trajectories from moving objects. For
that purpose we need evaluation of the assignment of regions to moving objects (rather than
region segmentation), so we decided to use the Precision/Recall approach. For computing
the precision and recall ground truth data is required.
The ground truth data is extracted by Mean-Shift segmentation (colour segmentation).The colour segmented regions are candidates for the ground truth regions. The final ground
truth regions (moving objects) were manually composed of a set of colour segmented
regions. We adopt the precision and recall as follows:
Precision : p nt
NdRecall : r
nt
NG
nt number of correct segmented regions of all moving objects in frame t
Nd total number of segmented regions assigned to all moving objects in one frame by thealgorithm.
NG total number of segmented regions assigned to all moving objects in one frame from
the ground truth data.
In order to evaluate the specific challenges in moving object segmentation, we have selected
sports video (skiing and car race) with dynamic scenes, multiple fast moving objects and
occlusion.
The precision/recall calculation shown in Figs. 8 and 9 indicate good segmentation results
using our mesh-based algorithm. The outliers (worse moving object segmentation) are due to
high motion of video objects and therefore worse feature tracking results. The precision andrecall rates are high and similar (mean values about 0.85) for both videos. High precision values
mean that nearly all found regions are correctly assigned (i.e. are part of) the real moving object.
The lower recall value illustrates that a number of regions given in the ground truth are not
segmented by the algorithm. This algorithm was designed to extract moving regions which are
assured parts of the moving object, but the drawback is that fewer segmented regions are
obtained. In the Formula-1 video more regions are found since the motion vectors can be
calculated better on the rigid object (cars), in the ski-race fewer regions are found due to the high
set of different motion (non-rigid) which is combined in one moving object.
0,4
0,5
0,6
0,7
0,8
0,9
1
0 10 20 30 40 50 60 70 80 90 100 110 120
Precision/RecallRate
22 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
17/23
0,5
0,6
0,7
0,8
0,9
1
0 10 20 30 40 50 60 70 80 90 100 110 120
Frame Number
Precision/RecallRate
Precision Recall
Fig. 9 Precision and recall values for 120 frames of the ski-race video. Average number of MO per frame is 1.2
Multimed Tools Appl (2010) 50:728 23
7/29/2019 Framework for Unsupervised
18/23
The recall and precision values are similar to the results of the algorithm defined in [28].
Generally, the algorithm has problems if not enough stable feature points are found by the
tracker in relation to the number of segmented regions. This can happen if the object is too
far away from the camera, the object has not enough corners or there is too much motion
blur in the image.In the following figures exemplary segmentation results are shown.
Figure 10 visualizes correct moving object segmentation, which is in this case due to the
good colour segmentation and the correct assigned tracking points to the objects.
In Fig. 11 incorrect examples of moving object segmentation are shown. The MOS
results are false due to the incorrect assignment of tracking points to the object.
The analysis was done on an Intel Duo Processor (2.4 GHz, 2 MB L2 Cache, 800 MHz
FSB) and 2 GB, 667 MHz DDR2 SDRAM. The average operating time is 320 ms/frame
with a resolution of 352x288, which is too slow for applications requiring real-time
processing. However, it is possible to speed up the processing depending on the number of
key-frames extracted per shot.
24 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
19/23
5 Conclusion
In the context of self configurable event detection, special focus is on unsupervised
algorithms that are flexible enough for application in different domains.
In this work we presented a fully unsupervised mesh-based algorithm for moving objectsegmentation. The proposed system facilitates automatic moving object segmentation and is
not restricted to pre-defined settings of the environment and therefore overcomes the
limitation of many existing moving object segmentation tools.
The evaluation highlights that the quality of extracted moving objects of the mesh-
based-algorithm has high precision and recall values of 0.85 on average and is therefore
comparable with other state-of-the-art algorithms.
The results show that the algorithms are dependent on the base techniques namely
Mean-Shift colour segmentation and KLT point tracking. The colour segmentation should
separate regions of the foreground objects and the background objects. This was not always
possible due the different light conditions and the similar colours between foreground and
background. The point tracker has to generate enough stable points on these foreground
objects. Another problem that limits the quality of the moving object segmentation is the
fact that the foreground objects have less tracking points and they are usually smaller than
the background.
Future work may be to restrict the application for a specified environment and
implement self-adaptation. Further, improvement of run-time performance is necessary for
being applied in real-time based systems, such as online event detection.
Generally, the results encourage a further development and application of the proposed
system. Reasonable applications are semantic video indexing, content based video retrieval(e.g. search for similar moving objects), and compression algorithms of videos (e.g. the
MPEG-4 format that contains a description of moving objects). This work has also
proposed a compact and efficient representation of the content and moving objects using
MPEG-7, including a database based indexing for retrieval of moving objects in large-scale
video repositories.
Acknowledgements The authors would like to thank Werner Haas, Werner Bailer and Peter Schallauer as
well as several other colleagues at JOANNEUM RESEARCH, who provided valuable feedback. The
research leading to these results has received funding from the European Communitys Seventh Framework
Programme (FP7/2007-2013) under grant agreement n 216465 (ICT project SCOVIS).
References
1. Antonini G, Martinez SV, Bierlaire M, Thiran JP (2006) Behavioral priors for detection and tracking of
pedestrians in video sequences source. Int J Comput Vis 69(2):159180
2. Bailer W, Schallauer P (2006) Detailed audiovisual profile: enabling interoperability between MPEG-7
based systems. International Conference on Multi Media Modelling3. Bailer W, Schallauer P, Bergur Haraldsson H, Rehatschek H (2005) Optimized mean shift algorithm for
color segmentation in image sequences. Image and Video Communications and Processing, pp 522529
Multimed Tools Appl (2010) 50:728 25
7/29/2019 Framework for Unsupervised
20/23
6. Borshukov GD, Bozdagi G, Altunbasak Y, Tekalp AM (1997) Motion segmentation by multistage affine
classification. IEEE Trans Image Process 6:15911594
7. Celasun I, Tekalp AM, Gketekin MH, Harmanc DM (2001) 2-D mesh-based video object
segmentation and tracking with occlusion resolution. Signal Processing: Image Communication Volume
16, Issue 10
8. Comaniciu D (2002) Mean shift: a robust approach toward feature space analysis. IEEE Transactions onPattern analysis and machine intelligence
9. Comaniciu D, Meer P (1997) Robust analysis of feature spaces: colour image segmentation. Department
of Electrical and Computer Engineering
10. Computer Vision Research Group, Department of Computer Science, Homepage: http://www.cs.otago.
ac.nz/research/vision, http://of-eval.sourceforge.net/, 1999.
11. Davis JC (2002) Statistics and data analysis in geology, 3d edn. Wiley
12. Donoser M (2003) Object segmentation in film and video. Diploma thesis, TU-Graz
13. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley
14. Erdem CE, Sankur B (2000) Performance evaluation metrics for object-based video segmentation.
Proceedings of the 10th European Signal Processing Conference (EUSIPCO 00), pp. 917920, Tampere,
Finland
15. Gali S, Lonari S (2000) Spatio-temporal image segmentation using optical flow and clusteringalgorithm. Proceedings of the First International Workshop on Image and Signal Processing and
Analysis16. Guo J, Kim J, Jay Kuo C-C (1999) New Video object segmentation technique with color/motion
information and boundary postprocessing. Applied Intelligence Journal
17. Heidrich W, Seidel H-P (1999) Realistic, Hardware-accelerated Shading and Lighting. Proceeding of
SIGGRAPH 99
18. Horn BKP, Schunck BG (1980) Determining optical flow. Massachusetts Institute of Technology
19. Kriechbaum A (2005) Segmentation of moving objects in film and video. Master thesis
20. Lepetit V, Fua P (2005) Monocular model-based 3D tracking of rigid objects: a survey. Foundations and
Trends in Computer Graphics and Vision 1(1):189
21. Lienhart R (2001) Reliable transition detection in videos: a survey and practitioners guide. International
Journal of Image and Graphics (IJIG) 1(3):469
48622. Liu L, Fan G (2005) Combined key-frame extraction and object-based video segmentation. IEEE Trans.
Circuits and System for Video Technology
23. Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo
vision. International Joint Conference on Artificial Intelligence, pp 674679
24. Martinez JM (2002) MPEG-7 overview. International organisation for standardisation
25. Oh J, Lee J, Vemuri E (2003) An efficient technique for segmentation of key object(s) from video shots.
ITCC 03: Proceedings of the International Conference on Information Technology: Computers and
Communications26. Rehatschek H, Schallauer P, Bailer W, Haas W, Wertner A (2004) An innovative system for formulating
complex combined content-based and keyword-based queries. Proceedings of SPIE-IS&T, Electronic
Imaging, vol. 5304, pp 160169
27. Tsechpenakis G, Rapatzikos K, Tsapatsoulis N, Kollias S (2003) Object tracking in clutter and partialocclusion through rule-driven utilization of snakes. IEEE International Conference on Multimedia &
Expo (ICME)28. Wei Z, Jun D, Wen G, Qingming H (2005) Robust moving object segmentation on H.264/AVC
compressed video using the block-based MRF model. Real-Time Imaging
29. Xu N, Ahuja N, Bansal R (2003) Object segmentation using graph cuts based active contours. CVPR03,
pp 465330. Zhang D, Lu G (2001) Segmentation of moving objects in image sequence: a review. Circuits Syst
Signal Process 20(2):143183
26 Multimed Tools Appl (2010) 50:728
http://www.cs.otago.ac.nz/research/visionhttp://www.cs.otago.ac.nz/research/visionhttp://of-eval.sourceforge.net/http://of-eval.sourceforge.net/http://www.cs.otago.ac.nz/research/visionhttp://www.cs.otago.ac.nz/research/vision7/29/2019 Framework for Unsupervised
21/23
Andreas Kriechbaum finished his study of Telematics at the University of Technology in Graz July 2007
with the master thesis Moving Object Segmentation in Video and Film. This work was performed at the
Institute of Information Systems at JOANNEUM RESEARCH, where he works since 2001. He is involved
in a number of national and European research projects in the area of interactive TV and surveillance. His
areas of interest and experience are content based analysis and retrieval of audiovisual information, and the
application of these in the domains of audiovisual archives, video annotation and surveillance.
Roland Mrzinger finished his study Software Engineering fr Medizin at the Hagenberg University of
Applied Sciences in July 2005 with the diploma thesis Detection of Grain and Noise for Regraining in Film
and Video. Since then he has been working as research associate for the JOANNEUM RESEARCH Institute
of Information Systems, where he is involved in international R&D projects. His research interests include
computer vision and multimedia retrieval with a focus on film restoration, machine learning, image and video
classification.
Multimed Tools Appl (2010) 50:728 27
7/29/2019 Framework for Unsupervised
22/23
Georg Thallinger received an MSc in Telematics from Graz University of Technology, Austria in 1992.
Georg joined the Institute of Information Systems at JOANNEUM RESEARCH right after university as
research engineer in the domain of scientific visualization. Since 2002 he is a co-leader of the Digital Media
group at the institute and as such is co-ordinating large, international projects. His areas of interest and
experience are content based analysis and retrieval of audiovisual information, and the application of these in
the domains of audiovisual archives, film restoration, and surveillance.
28 Multimed Tools Appl (2010) 50:728
7/29/2019 Framework for Unsupervised
23/23
Reproducedwithpermissionof thecopyrightowner. Further reproductionprohibitedwithoutpermission.