+ All Categories
Home > Documents > Framework for Unsupervised

Framework for Unsupervised

Date post: 14-Apr-2018
Category:
Upload: tony-bu
View: 259 times
Download: 0 times
Share this document with a friend

of 23

Transcript
  • 7/29/2019 Framework for Unsupervised

    1/23

    A framework for unsupervised mesh based segmentation

    of moving objects

    Andreas Kriechbaum & Roland Mrzinger &

    Georg Thallinger

    Published online: 24 September 2009# Springer Science + Business Media, LLC 2009

    Abstract Multimedia analysis usually deals with a large amount of video data with a

    significant number of moving objects. Often it is necessary to reduce the amount of data

    and to represent the video in terms of moving objects and events. Event analysis can be

    built on the detection of moving objects. In order to automatically process a variety of video

    content in different domain, largely unsupervised moving object segmentation algorithms

    are needed. We propose a fully unsupervised system for moving object segmentation that

    does not require any restriction on the video content. Our approach to extract movingobjects relies on a mesh-based combination of results from colour segmentation (Mean

    Shift) and motion segmentation by feature point tracking (KLT tracker). The proposed

    algorithm has been evaluated using precision and recall measures for comparing moving

    objects and their colour segmented regions with manually labelled ground truth data.

    Results show that the algorithm is comparable to other state-of-the-art algorithms. The

    extracted information is used in a search and retrieval tool. For that purpose a moving

    object representation in MPEG-7 is implemented. It facilitates high performance indexing

    and retrieval of moving objects and events in large video databases, such as the search for

    similar moving objects occurring in a certain period.

    Keywords Moving object segmentation . Unsupervised system . Spatial segmentation .

    Motion extraction . Clustering . Optical flow

    Multimed Tools Appl (2010) 50:728

    DOI 10.1007/s11042-009-0366-9

    A. Kriechbaum (*) : R. Mrzinger: G. Thallinger

    JOANNEUM RESEARCH, Institute of Information Systems, Steyrergasse 17, 8010 Graz, Austriae-mail: [email protected]

    R M i

  • 7/29/2019 Framework for Unsupervised

    2/23

    1 Introduction

    A critical task in video understanding for a large amount of data is the automatic

    interpretation of semantically meaningful spatio-temporal objects. To achieve this task, the

    gap between pixel values and semantic descriptions needs to be bridged. The successfulapplication of object-based media description and representation depends largely on

    effective moving object segmentation tools.

    Moving Object Segmentation (MOS) can be used for providing important spatio-

    temporal information about objects whose motion is more or less homogeneously at least

    over a certain period.

    Generally, moving object segmentation can be used in applications in the field of content-

    based media retrieval. For example, in many film archives a manual similarity search through

    all videos is needed because of the lack of annotation but this is time consuming and takes a

    high effort of people. Automatic unsupervised systems introduce the possibility to search for

    similar objects in a video archive. Other reasonable applications are compression algorithms of

    videos. Special video formats for compression of videos were developed; a well-known

    example is the MPEG-4 format that contains a description of moving objects. Motion

    segmentation reduces the high amount of video data. After the analysis of the videos only

    regions extracted by the MOS are subject to further processes and applications. This means a

    strong reduction of data. The task of event detection in videos requires a system to automatically

    extract moving objects in order to facilitate subsequent person identification and behaviour and

    event analysis. Semantic event detection, media monitoring and video indexing are only a few

    examples from the large spectrum of applications.

    These applications impose common challenges for moving object segmentation. Mostimportantly, moving cameras entail the MO to move relative to the moving background.

    Another challenge is that the objects behaviour is generally not known apriori, i.e. they

    may be rigid or non-rigid and moving fast or slow. Illumination variance over a short period

    of time and shadows cast on the object or cast by the object also complicate the

    segmentation process. Background clutter such as swaying branches further makes the

    segmentation of foreground objects difficult. In the case of multiple moving objects which

    move side by side with similar appearance, the possibility for separating the objects is

    limited. Occlusions, e.g. temporal disappearance when objects move in front of others are

    further challenges. The computational complexity increases with the scene dynamics, e.g.

    the number of moving objects.Therefore, segmenting frames into distinctive parts corresponding to moving objects is

    very difficult. In general, many calculations are necessary for that purpose because moving

    regions usually have no significant discriminatory single features, which can be calculated.

    Possible features are colour, shape, texture and velocity of associated region-pixels. Each

    feature has possible advantages and disadvantages for different environments. In the past

    many algorithms where developed but the most of them use hard restrictions and manual

    human intervention.

    The aim of this work is to explore the feasibility of a fully automatic moving object

    segmentation system which tackles the above mentioned challenges. Thus, an unsupervisedsystem should extract moving objects without any restriction on the content and without

    any manual intervention in the moving object segmentation process.

    8 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    3/23

    2 Related work

    Moving object segmentation draws from techniques in the field of tracking and

    segmentation. This section deals with related techniques proposed in the literature.

    The following overview of natural-based tracking techniques is proposed in [20]:

    2.1 Edge-based motion tracker

    This approach is a very efficient method regarding computation and implementation

    costs. The low fraction of the calculated image pixels is the reason of the low

    computational complexity. Furthermore this algorithm is reliable against illumination

    changes and is very simple to implement. Different edge-based algorithms have been

    developed with the same gradient-based approach. Edges are found in an image if

    strong gradients are found. The difference between the algorithms are: one algorithm

    extracts explicit model contours for the matching with the object database and the

    other extracts gradients to estimate the objects pose without a calculation of a contour.

    The extraction of contours leads to a much more reliable result but is slower than the

    algorithm without extracting contours.

    2.2 Optical flow-based motion tracker

    If the analysis of a sequence of images should result in moving objects useful information

    can be extracted through the optical flow-based approach. The optical flow is the velocity

    vector of a pixel with the same intensity in subsequent frames. The optical flow iscalculated over the whole image whereas each pixel describes a 2D approximation of the

    real 3D motion. To get reliable results an accurate dense optical flow has to be processed. A

    common representation of the optical flow field is using arrows in a mesh like in Fig. 1.

    The direction of the visualised arrows describes the direction of the motion. The length

    of the arrows describe the magnitude of the motion of the considered pixel, the longer the

    arrow the higher the motion. In optical flow-based techniques a velocity field will be

    extracted as described in [5, 20].

    An important disadvantage of this algorithm is the large linearization error in the optical

    flow constraint if large motion is content of the video. Furthermore, an optical field

    generation containing motion along edges especially of circles is very difficult due to theaperture problem [5]. Another disadvantage is the slower computation due to the

    calculation of the optical flow over the whole image.

    Certainly a combination of edgeand optical flow-based methods is possible and

    provides good results.

    Multimed Tools Appl (2010) 50:728 9

  • 7/29/2019 Framework for Unsupervised

    4/23

    2.3 Template-based motion tracker

    The template-based algorithm does not rely on points which will be tracked. The algorithm

    depends on templates which are patterns of objects which will be tracked. The algorithm is

    developed for complex objects which are difficult to model with local features. An exampleof a useful application is to find a book cover in an image sequence. With the help of edges

    and local features the content of the image will be processed. If the cover is found in form

    of a plane only a detection of the plane has to take place which is much simpler than the

    other proposed algorithms. A disadvantage of this algorithm is that it is computationally

    expensive. An important application of the template-based algorithm is the Lucas Kanade

    Tracker [20] which finds a value for deformation. The deformation describes warping a

    template of an object into the image.

    2.4 Interest points-based motion tracker

    This algorithm relies on local features similar to the optical flow-based method. The

    extraction of a subset of image pixels reduces the computational cost. The patches around

    the points should be textured and their neighbors should be different to eliminate unstable

    edges. An object feature is defined by the location and the corresponding patch. After the

    initialization of the features the algorithm computes the same loop as the edge-based

    algorithm. The local features across the image are robust against partial occlusion and

    matching errors. The interest points-based algorithm exploits more information of the

    whole information to gain robustness.

    In literature different ways for classifying moving object segmentation approaches arediscussed. A review of state-of-the-art techniques is presented in [30]. Generally, the

    algorithms can be divided into three main groups of moving object segmentation

    techniques. The following sections describe representative algorithms shortly.

    2.5 Spatial segmentation, motion extraction, clustering

    A Mean-Shift based algorithm proposed in [16] provides robust homogeneous colour

    regions according to dominant colours. Furthermore, frame intensity difference based

    motion detection is applied for motion extraction. The detected moving regions are

    analyzed by a region-based affine model and further tracked to increase the consistency ofthe extracted objects. A morphological open-close operator is used to remove gulfs and

    isthmi (narrow connection between two large regions) for object boundary smoothing. A

    shape coding optimization is done using boundaries of variable width. The algorithm is fast

    and highly accurate.

    In [15] a two-dimensional feature vector is used for clustering in the feature space. The

    first feature is image brightness which reveals the structure of interest in the image. The

    second feature is the Euclidean norm of the optical flow vector. The optical flow field is

    computed using the Horn-Schunck algorithm [18]. By clustering the feature space, moving

    objects in the image are detected. The algorithm has the advantage that it is robustregarding to background movement.

    In [21] the moving object segmentation procedure is treated as a Markovian labelling

    10 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    5/23

    and is validated by an elaborate occlusion detection scheme. The initial object mask is

    segmented by the MRF model. A disadvantage of the approach is that it cannot deal

    properly with noise.

    The PCA and GGM based algorithm proposed in [22] consists of three stages: the

    initial segmentation of the first frame using colour, motion, and position information,based on a variant of the K-Means-with-connectivity-constraint algorithm. Then a

    temporal tracking algorithm is applied, using a Bayes classifier and rule-based processing

    to reassign changed pixels to existing regions and to handle the introduction of new

    regions. Finally a trajectory-based region merging procedure is used that employs the

    long-term trajectory of regions to group them to objects with different motion. It is

    advantageous that the algorithm can handle fast moving of objects, new objects and

    disappearing objects.

    2.6 Point tracking, 2D mesh

    At the first frame of the video, an optimal number of feature points are selected as

    nodes of a 2D content-based mesh. These points are classified as moving (foreground)

    and stationary nodes-based on multi-frame node motion analysis described in [7],

    yielding a coarse estimate of the foreground object boundary. To extract the moving

    object, colour differences across triangles near the coarse boundary are employed. The

    boundary of the video object is refined by the maximum contrast path search along

    the edges of the 2D mesh. Next the refined boundary to the subsequent frame is

    propagated by using motion vectors of the node points to form the coarse boundary at

    the next frame.This algorithm is able to detect occlusions but small objects cannot be found. The point

    tracker-based approach for moving object segmentation is a possible approach for a fully

    automatic moving object segmentation and similar to the approach developed by the

    authors.

    2.7 Active contours

    The latest algorithms are often based on active contours. This is an iterative algorithm

    that produces a better contour-line description in every iteration step. It receives the

    previous contour line as input and uses some balancing constant factors (internal andexternal energy) to produce a new contour line description. Active contours (snakes)

    minimize the sum of the internal and external energy [27]. The graph cut algorithm

    [29] is an improvement of active contours, leading to a smooth contour free of self-

    crossing and uneven spacing problems. The internal force, which is used in the energy

    functions to control the smoothness, is no longer needed and the number of parameters is

    reduced.

    An advantage of these algorithms is the ability to handle unknown noise, highly textured

    background, and partial object occlusions. The disadvantage of active contours based

    algorithms is the need for initial object segmentation or the requirement of initial seeds.Due to the need for initial object segmentation this algorithm is often used for video object

    tracking and it is not considered in this work.

    Multimed Tools Appl (2010) 50:728 11

  • 7/29/2019 Framework for Unsupervised

    6/23

    3 Mesh-based moving object segmentation

    The main goal of this work is to segment moving regions in videos without any restrictions

    on the content (like static or moving cameras). The development of the mesh based moving

    object segmentation algorithm should solve the problems in moving object segmentation.The work was mainly done in the master thesis in [19].

    Many approaches from literature suggest using dense motion fields or an optical flow

    field in which the video objects have to be segmented. However, the problem is that the

    method for optical flow field calculation for every pixel of the image is underspecified and

    that computing a velocity vector for every pixel of the images is often redundant because

    most pixels in an image have zero motion. In the proposed algorithm a similar, yet modified

    approach is chosen.

    In general, the developed approach is based on the combination of pre-extracted

    features. These are feature points of a KLT feature tracker and regions extracted by colour

    segmentation. The used colour segmentation algorithm was developed using the specific

    Mean-Shift implementation described in [12].

    The main contribution of this paper is a new integrated workflow for mesh-based

    moving object segmentation. It draws from existing state-of-the-art techniques for tracking

    and segmentation, such as the Mean-Shift colour segmentation and the KLT feature

    tracking. In this paper these approaches have been combined and partly extended and

    improved in order to provide an integrated framework for fully automatic moving object

    segmentation.

    3.1 Mean-shift algorithm

    The Mean-Shift (MS) algorithm is a procedure to analyse feature spaces using a clustering

    method. Many pattern recognition applications make use of the advantages of the MS

    algorithm, for example the colour image segmentation, where at first image pixels are

    mapped into a specific feature space (e.g. L*u*v* colour space) and accordingly clusters

    are formed. The cluster centres are of great importance, because they characterize the most

    significant features of the image, which-in the colour image segmentation example-are the

    dominant colours. Consequently, the Mean-Shift algorithm is applied to locate the clusters

    and to estimate their centres to get the dominant colours of the image, which then can be

    used for the segmentation procedure. The result of the segmentation process is a clustercorrespondence to homogeneous colour regions of the image.

    In 1997 Comaniciu and Meer [9] applied the Mean-Shift algorithm to colour image

    segmentation. The results of the segmentation were better compared to the results of other

    similar applications (for example the Watershed algorithm); hence this approach became

    very popular. The use of MS in image sequences is also proposed in [3, 12]. In principle,

    the procedure is more robust against changes of the illumination conditions and more

    reliable regarding to over-segmentation.

    Figure 2 shows an input image and the according three-dimensional feature space.

    Finding the location of the clusters is done by using a search window in the featurespace, which shifts to the centre of each cluster. The direction and the magnitude of the shift

    are based on the difference of the centre of the search window and the local mean value in

    12 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    7/23

    3.1.1 Propagation of cluster centres

    The position defining where the search begins determines the number of shifts which are

    necessary for locating the centre of a cluster in the feature space. Consequently it is crucial to

    find the best starting position for the search window in order to minimize the number of shifts.

    This is implemented by choosing a number of random positions and then declaring the one with

    the highest density of feature vectors as starting position for the search window.

    3.1.2 Iteration steps

    After the successful determination of the starting position the search window is shifted to

    the designated position iteratively. The intensity distribution of each colour can be seen as a

    probability density function. The difference between the local mean of this function and the

    Fig. 2 Colour image and corresponding L*u*v* colour space [8]

    Multimed Tools Appl (2010) 50:728 13

  • 7/29/2019 Framework for Unsupervised

    8/23

    which defines the designated cluster centre. Intuitively mean shift corresponds to an

    estimation of the gradient of the data density.

    Figure 3 visualises the iteration process. In the left image the starting points (colour) are

    shown, which move along the paths visualized in the right illustration. Some clusters merge

    to larger clusters, which represent a single colour.

    3.1.3 Mathematical theory of mean-shift algorithm

    A sphere Sx with radius r is positioned on point x and divides some feature vectors y from

    the set of vectors within the sphere. The expected value z y x for each of these vectors,which is based on x and Sx, can be computed by:

    m E zjSx

    Z

    Sx

    y x p yjSx dy

    Z

    Sx

    y x p y

    p y 2 Sx dy

    In [11] the computation leading to the following final equation can be found.

    m E zjSx r2

    n 2

    rp x

    p x E xjx 2 Sx x

    The mean shift vector represents, as discussed earlier, the difference between the local

    mean and the centre of the search window and is proportional to the density gradient.

    Consequently, the centres of the clusters are located in a region, where the gradient values

    are low and p(x) is high. The mean shifts aim to the maximum of the probability density,

    which is called mode. At the mode, the mean shift is obviously zero, so accordingly acriterion for stopping the iterative steps has to be defined. For example if the magnitude of

    the shift is smaller that 0.1 the iteration stops. All pixels of the search window are discarded

    and their 8-connected neighbours are removed. Afterwards, further starting-points can be

    found and they are iteratively processed to converge to a cluster centre.

    This procedure is done until the number of feature vectors within the search window is

    smaller than a value, which defines the smallest number of elements which are needed for a

    relevant image region. All pixels within the window whose feature vectors are located on

    the relevant colour points are attached to the colour of the centre of the window. For a

    further calculation process only those pixels are kept which have at least one neighbour

    with an already assigned colour in the feature space.

    14 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    9/23

    The Mean-Shift algorithm can be used in any n-dimensional feature space. In this

    paper we used the MS algorithm for basic colour segmentation and as cluster device

    for the KLT-Tracker. Possible features for spanning a feature space in moving object

    segmentation are the common colour components red, blue, and green or the

    converted features into the L*u*v* space. Further helpful features would be thespatial coordinates and a temporal relationship of points extracted by a point tracker

    over a block of frames. Each of these features refines the result with the drawback of

    a longer time to convergence.

    3.1.4 Post-processing steps

    In a post-processing step small regions, which potentially occur along the boundary of

    larger regions, are assigned to the closest substantial colour in the feature space.

    The Mean-Shift algorithm yields better results than other proposed colour segmentation

    approaches regarding to image sequences. However it is computationally expensive.This algorithm can be used in many different applications, due to the unrestricted

    number of dimensions, thus diverse features can be obtained.

    The accuracy of the segmentation is given by the size of the search window. A small

    window results in a more accurate segmentation, which means that more segmented regions

    are obtained and vice versa. Therefore, the radius should be flexible to a criterion, which

    represents the visual activity.

    3.1.5 Mean-shift advantages / disadvantages

    On the one hand, the Mean-Shift algorithm yields impressive segmentation results, not only

    in the field of image segmentation, many other clustering applications can make use of its

    positive properties. On the other hand, the calculation time of the algorithm is high

    compared to other proposed segmentation procedures (e.g. to the Watershed algorithm

    [15]).

    By adjusting some parameters of the Mean-Shift algorithm, it is possible to obtain

    diverse segmentation results, so it can be easily used if under and over-segmentation is

    desired.

    Moreover, the segmentation results are robust towards changes in the images, (for

    example changes in the illumination conditions) and consequently, the Mean-Shiftsegmentation procedure for a video sequence yields in very reliable results.

    In [11] was shown that the Mean-Shift algorithm is well suited for colour segmentation of

    image sequences, due to better results regarding the temporal stability of the segmentation

    compared to other approaches (for example Watershed algorithm). A drawback is that the

    Mean-Shift algorithm is computationally more expensive and the computational costs are even

    more an issue when the algorithm is applied to image sequences (videos). To overcome this

    several optimizations are proposed in [3].

    The MS produces a base-segmentation which is combined with the motion information

    generated by the KLT-tracker to get a moving object. The Mean-Shift algorithm providesthe dominant motion cluster assignment as a new property of the LKT feature points. Using

    the ability of the MS algorithm to perform clustering in high dimensions, motion clusters

    Multimed Tools Appl (2010) 50:728 15

  • 7/29/2019 Framework for Unsupervised

    10/23

    3.2 Mesh-based moving object segmentation

    The motivation of the mesh-based segmentation approach is the suggestion of optical flow

    based algorithms in the literature and the point tracker-based and mesh-based approach for

    moving object segmentation in [7]. In the developed approach the feature points areassigned to dominant clusters which are representing moving objects. Furthermore, this

    algorithm is based on the assumption that a colour-segmented region belongs to a single

    object. This object is either a foreground object (moving object) or background object. This

    assignment is done similar to [7]. A workflow can be seen in Fig. 4.

    In this approach velocity stable triangles (built from the extracted feature points) which

    belong to one cluster are combined to represent a moving object. Triangles which are on the

    same object are from the same motion cluster. If not all points from the triangle are assigned

    to the same cluster the triangle is not used for the moving object segmentation. Triangles

    with points from the same cluster are assigned to colour segmented regions and so a stable

    skeleton of the object is formed. The stability of the triangles is determined by the

    interpolated motion field calculated for the feature points. The extracted moving object is

    reliable through the combination of the two base approaches (colour segmentation, feature

    point tracking) and further introduced quality measurements based on the motion field. The

    combination of these algorithms solves the problems of MOS with moving cameras. With

    this algorithm occlusion is not a problem; appearance and disappearance from objects in the

    image do not have any negative impacts to the algorithm.

    3.3 Stable triangle detection

    This processing step is one of the most important steps in the mesh-based algorithm which

    leads to the desired quality. In this step reliable triangles are selected from all triangles for a

    further region assignment to the according dominant cluster. The feature points extracted by

    the KLT feature tracker are triangulated resulting in a mesh. The mesh extraction facilitates

    a fast computation of a dense optical flow field. This process is called motion estimation.

    The diagram in Fig. 5 shows the process to make this clearer.

    Mean-Shift Colour

    Point Tracking andDominant Motion

    Clustering

    Find candidate regions

    Moving Object

    Stable triangle

    detection

    Assign dominant motion clusters

    Track regions

    Points,

    Cluster

    Region and

    Cluster Hierarchical

    Clustering

    Region and

    Cluster

    Regions

    Stable Triangles

    Region and

    Cluster

    16 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    11/23

    First, Delaunays triangulation method [11] is applied to the extracted feature points

    containing their local motion information. Several tests have shown that Delaunays

    algorithm is very fast and has a high stability (linkage) of the extracted mesh over time

    which is a necessary precondition in the mesh-based moving object algorithm.

    Second, a dense motion field is extracted. The motion field is calculated by using the

    Gouraud shading algorithm [17] on the velocity vectors of the extracted mesh. Theassumption is that KLT feature points from the mesh which are assigned to one dominant

    motion cluster belong to a moving object. If these points which are on the moving objects

    have a correct assigned velocity vector and if there are many points inside the border of the

    moving object the Gouraud shading algorithm will provide a linear interpolated motion

    field of the image. The linear interpolation between the motion vectors is valid because of

    the minor error in the area of the moving objects and a larger error in the neighbourhood of

    the moving object. If too few points in the area of the moving object the algorithm provides

    worse results. Furthermore, if the points extracted by the KLT tracker have incorrect

    velocity the motion field will not be calculated properly. However, the Gouraud

    interpolation is only an approximation for a dense optical flow field. The motion fieldcalculation is very fast in contrast to methods described in [6]. Using the extraction of the

    motion estimation, points from a frame can be found which are more reliable then other

    points in terms of temporal stability. This reliability can be used in the further processing

    steps.

    Third, the displaced frame difference (DFD) image is extracted which shows the reliability

    of the extracted motion field. The DFD-image is calculated from the original image and an

    image predicted based on the calculated motion. Due to the interpolation of the motion field the

    predicted image is not identical (only motion compensated) to the original consecutive image. If

    the motion field is correctly calculated all values of the DFD-image are zero. Values which are

    not zero show pixels/regions which are unreliable calculated due to a false estimated motion

    field. The pixels/regions which are near zero have a high reliability.

    Displaced Frame DifferenceImage

    Motion compensationin frame t+1

    Lifetime and ClusterReliability Calculation

    Motion Field

    Original frame t

    Interpolated frame t+1

    Original frame t+1

    Stable Triangles

    Triangulation

    Gouraud Shading

    Points, Cluster

    Triangulated mesh

    DFD value for triangle

    Fig. 5 Stable triangle detection with motion interpolation

    Multimed Tools Appl (2010) 50:728 17

  • 7/29/2019 Framework for Unsupervised

    12/23

    point has not necessarily to be in the same cluster due to the possibility that the region is a

    border region of the object with eventually bad colour segmentation. Moreover, in the

    reliable triangle search triangles are discarded which are below a minimum temporal

    reliability to guarantee reliable moving object segmentation. This can be done based on the

    dominant cluster information over time. The dominant motion cluster is calculated oversubsequent frames. Consequently each triangle has a lifetime due to the cluster assignment

    and tracking of the point. If the triangle has a longer lifetime than an adjustable parameter

    the triangle is a candidate for the further process. A problem of the algorithm is the

    dependency on the reliability of the KLT-tracker and on the stability of the triangulation

    over time. This dependency results in skipped frames of the tracked triangles over a certain

    time. These outliers are ignored by introducing a new parameter which defines the

    minimum appearance of the triangle. Otherwise the skipped frames would lead to discarded

    triangles.

    Reliable triangles are selected by the help of previously introduced parameters, namely

    the DFD-depended triangle dependency, the minimum lifetime and the minimum

    appearance percentage of that. In this process a number of reliable triangle candidates for

    each frame for further processing are extracted.

    3.4 Cluster assignment to segmented regions

    After a set of stable triangle has been selected colour segmented regions can be assigned to

    the according triangles. The idea is that the current triangle is assigned to a dominant

    motion which means a possible region assignment overlaps the triangle. If the region is

    correctly assigned to a triangle and furthermore to a dominant cluster a skeleton of themoving object is extracted. The assignment to a dominant motion cluster is done with all

    segmented regions which are entirely or partially in triangle. A minimum adjustable

    threshold has to be exceeded. This threshold declares the minimum percentage of the area

    of a triangle which overlaps to a colour-segmented region. If the assignment of the clusters

    to the region is ambiguous, a further determination has to be done otherwise the multiple

    clusters could be assigned to the regions. If three points of the related tracking points inside

    the triangle belong to the same cluster it has a higher weight regarding the calculation then

    the triangles with only two points belonging to the same cluster. The triangle with the

    higher weight is assigned to the region. If the region contains more then one triangle with

    the same weight the triangle with the higher area overlap is selected.All regions with the same dominant cluster assignment are selected to extract the

    skeleton of the moving object. In Fig. 4 two further processing steps are depicted: tracking

    of the assigned regions and hierarchical clustering. These steps are used to get realistic

    moving objects of the extracted moving object skeletons.

    3.5 Region tracking

    Region tracking is established to describe an approach to get first realistic moving objects

    from the initial moving object skeletons. To each extracted skeleton a cluster-parameter(velocity) from the dominant motion tracking is assigned. With the assignment of these

    parameters new regions according to the moving region of the actual frame are found in the

    18 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    13/23

    The new colour segmented regions have to achieve some thresholds like the mean RGB-

    colour threshold of the region, bounding box width, bounding box height and region area.

    These thresholds ensure that the corresponding new region of the next and previous frameis the same region of the moving object as in the actual frame. After region tracking the

    results are moving regions with each moving region containing points with a similar

    dominant motion due to an assignment of previous and next frames.

    3.6 Hierarchical clustering

    To get more realistic segmentation results for non-rigid moving objects a further processing

    step is introduced. Using hierarchical clustering parts of non-rigid objects containing

    different dominant motion should be connected.

    Hierarchical clustering is well known and commonly used in image processing. Inhierarchical clustering some distances have to be introduced. By the help of these parameters

    hierarchical clustering can find the nearest nodes (represent features of moving objects) in terms

    of these parameters and combine them to a single node. The clustering is done by the single

    linkage clustering method which is described in [13]. The clustering is continued iteratively to

    find the next node until a hierarchical tree is extracted. In case of moving object segmentation

    the parameter nodes are features of moving regions. These are the motion trajectories and the

    distances of the centroids from the moving regions. The motion trajectories locate the clusters

    over several frames. If the motion trajectory of the compared regions is relatively similar and

    the distance is relatively short the compared regions belong to the same moving object. For a

    better understanding of the algorithm an example is given in the next paragraph.

    The motion from all bones of a leg is not the same during a time period (for example in a

    ProceedoverallframesaslongasnewMO-regionsarefound

    Association of the assigned regions to

    the actual moving object

    Find all colour segmented regions

    in the next frame t+1

    Forward tracking and backward tracking

    of all color segmented regions until the

    last possible assignment of the region

    Find the color segmented regions

    of the moving object in frame t

    Check the underlying base segmented

    regions for color similarity and the

    similarity of the transformed bounding box

    to the bounding box of the underlying region

    Find the bounding box of the colour

    segmented region

    Transform the bounding box with the

    cluster parameters from the point trackerinto the next frame

    If the similarity is higher then a threshold

    assign the region to the moving object

    Fig. 6 Region tracking. Left: tracking over the entire appearance of a skeleton of a moving region. The

    algorithm proceeds over all frames where colour segmented regions of the moving object are found. Right:

    backward/forward tracking algorithm to the previous and the next frame

    Multimed Tools Appl (2010) 50:728 19

  • 7/29/2019 Framework for Unsupervised

    14/23

    of the hierarchical clustering. A threshold is introduced to cut off the tree to get real moving

    objects. The cut off threshold of the hierarchical tree was set to 0.7 as proposed in [ 1]. This

    value was also the result of several tests for the search for the best parameter adjustment.

    3.7 Moving object segmentation in videos

    The input of moving object segmentation is usually a block of frames (BOF) which

    describes a limited amount of subsequent frames. In the proposed system, large videos are

    processed instead of a few frames thus a high amount of data has to be analyzed needing a

    considerable amount of time and storage. To keep these negative influences within a limit

    new approaches are needed. A common effective technique to extract moving objects in

    videos and films is the following, which is frequently described in many papers.

    One of the most fundamental tasks in moving object segmentation for extracting a

    description of video and film is to find frames where the motion of the moving objects is

    high enough for segmentation and the content is important related to the aspect of motion.

    In the literature there are many different techniques to get these frames, an example from

    [25] is shown in Fig. 7. In the candidate frames the key objects are extracted which change

    significantly in their visual content. These important BOFs are usually after shot

    boundaries [16]. It is necessary to find these frames due to the limitation of memory and

    time. Recently many algorithms have been proposed to get the frames with the important

    content, a detailed description can be found in [22, 25]. After the detection of shot-

    boundaries and key-frame-extraction the Mesh-based MOS approach can be applied.

    The reason to make shot boundary detection in moving object segmentation is the high

    content movement after such boundaries which results in the extraction of different movingobjects. At shot boundaries many visual features changes and therefore it is crucial to detect

    the shot boundaries before doing further analysis like moving object segmentation [4].

    3.8 Representation and retrieval of objects and events

    In the previous sections a way of extracting moving objects resulting in several moving

    object descriptions was described.

    The retrieved moving objects and their trajectories are directly applicable to event

    analysis and retrieval. But how we can get events or actions out of the extracted moving

    objects? And how is it possible to represent or save the moving objects in an effective wayso that the extracted moving objects can be compared to any other previously extracted

    moving objects?

    A standardized way for describing the extracted moving objects is preferable, such as

    MPEG-7, which supports content-based video indexing and retrieval. An overview of

    MPEG-7 is discussed in [24]. The standardized format allows interoperability between

    applications. MPEG-7 predefines some features for moving object description. These

    features are low-level descriptions, describing elementary features like colour (e.g. Colour

    Layout, Colour Structure), texture and shape of regions. In this work, a moving object

    description structure with special focus on colour features has been developed based on thedetailed audiovisual profile (DAVP) MPEG-7 profile [2].

    Due to the vast amount of monitored data in surveillance systems and other archives, the

    20 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    15/23

    For that purpose we have developed a Search and Retrieval Tool [26] which is able toimport MPEG-7 documents and formulate queries by a graphical user interface (GUI) and

    pre-defined SQL statements. Different videos can be opened, viewed and analyzed. After a

    definition of the video object the search tool builds automatically the query by a

    combination of predefined keywords (SQL statements) and the content-based extracted

    elements (MPEG-7 Descriptors). The used parameters (e.g. which descriptors should be

    combined) are defined by the type of analyzing process. The search result is represented in

    form of a list of references to the metadata descriptions of the matching moving objects,

    sorted by similarity.

    In literature an event is defined as something that happens at a given place and time.

    Two types of events are possible: object domain events and frame or shot domain events. Inthe search tool these events are easily to retrieve. In the context of event retrieval, the most

    useful query parameters are the motion trajectories. The trajectories contain the information

    of primitive motion e.g. move left, move right. With SQL statements moving objects of the

    same motion can be searched for. Furthermore all moving objects within a certain period

    can be found. This search tool supports the user in bridging the gap between the numerical

    features and the symbolic description of the meaningful actions and events.

    4 Results

    In general evaluation of automatic moving object segmentation is a complex process and

    Fig. 7 Moving object extraction procedure in video and film [26]

    Multimed Tools Appl (2010) 50:728 21

  • 7/29/2019 Framework for Unsupervised

    16/23

    For event detection, it is crucial to extract motion trajectories from moving objects. For

    that purpose we need evaluation of the assignment of regions to moving objects (rather than

    region segmentation), so we decided to use the Precision/Recall approach. For computing

    the precision and recall ground truth data is required.

    The ground truth data is extracted by Mean-Shift segmentation (colour segmentation).The colour segmented regions are candidates for the ground truth regions. The final ground

    truth regions (moving objects) were manually composed of a set of colour segmented

    regions. We adopt the precision and recall as follows:

    Precision : p nt

    NdRecall : r

    nt

    NG

    nt number of correct segmented regions of all moving objects in frame t

    Nd total number of segmented regions assigned to all moving objects in one frame by thealgorithm.

    NG total number of segmented regions assigned to all moving objects in one frame from

    the ground truth data.

    In order to evaluate the specific challenges in moving object segmentation, we have selected

    sports video (skiing and car race) with dynamic scenes, multiple fast moving objects and

    occlusion.

    The precision/recall calculation shown in Figs. 8 and 9 indicate good segmentation results

    using our mesh-based algorithm. The outliers (worse moving object segmentation) are due to

    high motion of video objects and therefore worse feature tracking results. The precision andrecall rates are high and similar (mean values about 0.85) for both videos. High precision values

    mean that nearly all found regions are correctly assigned (i.e. are part of) the real moving object.

    The lower recall value illustrates that a number of regions given in the ground truth are not

    segmented by the algorithm. This algorithm was designed to extract moving regions which are

    assured parts of the moving object, but the drawback is that fewer segmented regions are

    obtained. In the Formula-1 video more regions are found since the motion vectors can be

    calculated better on the rigid object (cars), in the ski-race fewer regions are found due to the high

    set of different motion (non-rigid) which is combined in one moving object.

    0,4

    0,5

    0,6

    0,7

    0,8

    0,9

    1

    0 10 20 30 40 50 60 70 80 90 100 110 120

    Precision/RecallRate

    22 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    17/23

    0,5

    0,6

    0,7

    0,8

    0,9

    1

    0 10 20 30 40 50 60 70 80 90 100 110 120

    Frame Number

    Precision/RecallRate

    Precision Recall

    Fig. 9 Precision and recall values for 120 frames of the ski-race video. Average number of MO per frame is 1.2

    Multimed Tools Appl (2010) 50:728 23

  • 7/29/2019 Framework for Unsupervised

    18/23

    The recall and precision values are similar to the results of the algorithm defined in [28].

    Generally, the algorithm has problems if not enough stable feature points are found by the

    tracker in relation to the number of segmented regions. This can happen if the object is too

    far away from the camera, the object has not enough corners or there is too much motion

    blur in the image.In the following figures exemplary segmentation results are shown.

    Figure 10 visualizes correct moving object segmentation, which is in this case due to the

    good colour segmentation and the correct assigned tracking points to the objects.

    In Fig. 11 incorrect examples of moving object segmentation are shown. The MOS

    results are false due to the incorrect assignment of tracking points to the object.

    The analysis was done on an Intel Duo Processor (2.4 GHz, 2 MB L2 Cache, 800 MHz

    FSB) and 2 GB, 667 MHz DDR2 SDRAM. The average operating time is 320 ms/frame

    with a resolution of 352x288, which is too slow for applications requiring real-time

    processing. However, it is possible to speed up the processing depending on the number of

    key-frames extracted per shot.

    24 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    19/23

    5 Conclusion

    In the context of self configurable event detection, special focus is on unsupervised

    algorithms that are flexible enough for application in different domains.

    In this work we presented a fully unsupervised mesh-based algorithm for moving objectsegmentation. The proposed system facilitates automatic moving object segmentation and is

    not restricted to pre-defined settings of the environment and therefore overcomes the

    limitation of many existing moving object segmentation tools.

    The evaluation highlights that the quality of extracted moving objects of the mesh-

    based-algorithm has high precision and recall values of 0.85 on average and is therefore

    comparable with other state-of-the-art algorithms.

    The results show that the algorithms are dependent on the base techniques namely

    Mean-Shift colour segmentation and KLT point tracking. The colour segmentation should

    separate regions of the foreground objects and the background objects. This was not always

    possible due the different light conditions and the similar colours between foreground and

    background. The point tracker has to generate enough stable points on these foreground

    objects. Another problem that limits the quality of the moving object segmentation is the

    fact that the foreground objects have less tracking points and they are usually smaller than

    the background.

    Future work may be to restrict the application for a specified environment and

    implement self-adaptation. Further, improvement of run-time performance is necessary for

    being applied in real-time based systems, such as online event detection.

    Generally, the results encourage a further development and application of the proposed

    system. Reasonable applications are semantic video indexing, content based video retrieval(e.g. search for similar moving objects), and compression algorithms of videos (e.g. the

    MPEG-4 format that contains a description of moving objects). This work has also

    proposed a compact and efficient representation of the content and moving objects using

    MPEG-7, including a database based indexing for retrieval of moving objects in large-scale

    video repositories.

    Acknowledgements The authors would like to thank Werner Haas, Werner Bailer and Peter Schallauer as

    well as several other colleagues at JOANNEUM RESEARCH, who provided valuable feedback. The

    research leading to these results has received funding from the European Communitys Seventh Framework

    Programme (FP7/2007-2013) under grant agreement n 216465 (ICT project SCOVIS).

    References

    1. Antonini G, Martinez SV, Bierlaire M, Thiran JP (2006) Behavioral priors for detection and tracking of

    pedestrians in video sequences source. Int J Comput Vis 69(2):159180

    2. Bailer W, Schallauer P (2006) Detailed audiovisual profile: enabling interoperability between MPEG-7

    based systems. International Conference on Multi Media Modelling3. Bailer W, Schallauer P, Bergur Haraldsson H, Rehatschek H (2005) Optimized mean shift algorithm for

    color segmentation in image sequences. Image and Video Communications and Processing, pp 522529

    Multimed Tools Appl (2010) 50:728 25

  • 7/29/2019 Framework for Unsupervised

    20/23

    6. Borshukov GD, Bozdagi G, Altunbasak Y, Tekalp AM (1997) Motion segmentation by multistage affine

    classification. IEEE Trans Image Process 6:15911594

    7. Celasun I, Tekalp AM, Gketekin MH, Harmanc DM (2001) 2-D mesh-based video object

    segmentation and tracking with occlusion resolution. Signal Processing: Image Communication Volume

    16, Issue 10

    8. Comaniciu D (2002) Mean shift: a robust approach toward feature space analysis. IEEE Transactions onPattern analysis and machine intelligence

    9. Comaniciu D, Meer P (1997) Robust analysis of feature spaces: colour image segmentation. Department

    of Electrical and Computer Engineering

    10. Computer Vision Research Group, Department of Computer Science, Homepage: http://www.cs.otago.

    ac.nz/research/vision, http://of-eval.sourceforge.net/, 1999.

    11. Davis JC (2002) Statistics and data analysis in geology, 3d edn. Wiley

    12. Donoser M (2003) Object segmentation in film and video. Diploma thesis, TU-Graz

    13. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley

    14. Erdem CE, Sankur B (2000) Performance evaluation metrics for object-based video segmentation.

    Proceedings of the 10th European Signal Processing Conference (EUSIPCO 00), pp. 917920, Tampere,

    Finland

    15. Gali S, Lonari S (2000) Spatio-temporal image segmentation using optical flow and clusteringalgorithm. Proceedings of the First International Workshop on Image and Signal Processing and

    Analysis16. Guo J, Kim J, Jay Kuo C-C (1999) New Video object segmentation technique with color/motion

    information and boundary postprocessing. Applied Intelligence Journal

    17. Heidrich W, Seidel H-P (1999) Realistic, Hardware-accelerated Shading and Lighting. Proceeding of

    SIGGRAPH 99

    18. Horn BKP, Schunck BG (1980) Determining optical flow. Massachusetts Institute of Technology

    19. Kriechbaum A (2005) Segmentation of moving objects in film and video. Master thesis

    20. Lepetit V, Fua P (2005) Monocular model-based 3D tracking of rigid objects: a survey. Foundations and

    Trends in Computer Graphics and Vision 1(1):189

    21. Lienhart R (2001) Reliable transition detection in videos: a survey and practitioners guide. International

    Journal of Image and Graphics (IJIG) 1(3):469

    48622. Liu L, Fan G (2005) Combined key-frame extraction and object-based video segmentation. IEEE Trans.

    Circuits and System for Video Technology

    23. Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo

    vision. International Joint Conference on Artificial Intelligence, pp 674679

    24. Martinez JM (2002) MPEG-7 overview. International organisation for standardisation

    25. Oh J, Lee J, Vemuri E (2003) An efficient technique for segmentation of key object(s) from video shots.

    ITCC 03: Proceedings of the International Conference on Information Technology: Computers and

    Communications26. Rehatschek H, Schallauer P, Bailer W, Haas W, Wertner A (2004) An innovative system for formulating

    complex combined content-based and keyword-based queries. Proceedings of SPIE-IS&T, Electronic

    Imaging, vol. 5304, pp 160169

    27. Tsechpenakis G, Rapatzikos K, Tsapatsoulis N, Kollias S (2003) Object tracking in clutter and partialocclusion through rule-driven utilization of snakes. IEEE International Conference on Multimedia &

    Expo (ICME)28. Wei Z, Jun D, Wen G, Qingming H (2005) Robust moving object segmentation on H.264/AVC

    compressed video using the block-based MRF model. Real-Time Imaging

    29. Xu N, Ahuja N, Bansal R (2003) Object segmentation using graph cuts based active contours. CVPR03,

    pp 465330. Zhang D, Lu G (2001) Segmentation of moving objects in image sequence: a review. Circuits Syst

    Signal Process 20(2):143183

    26 Multimed Tools Appl (2010) 50:728

    http://www.cs.otago.ac.nz/research/visionhttp://www.cs.otago.ac.nz/research/visionhttp://of-eval.sourceforge.net/http://of-eval.sourceforge.net/http://www.cs.otago.ac.nz/research/visionhttp://www.cs.otago.ac.nz/research/vision
  • 7/29/2019 Framework for Unsupervised

    21/23

    Andreas Kriechbaum finished his study of Telematics at the University of Technology in Graz July 2007

    with the master thesis Moving Object Segmentation in Video and Film. This work was performed at the

    Institute of Information Systems at JOANNEUM RESEARCH, where he works since 2001. He is involved

    in a number of national and European research projects in the area of interactive TV and surveillance. His

    areas of interest and experience are content based analysis and retrieval of audiovisual information, and the

    application of these in the domains of audiovisual archives, video annotation and surveillance.

    Roland Mrzinger finished his study Software Engineering fr Medizin at the Hagenberg University of

    Applied Sciences in July 2005 with the diploma thesis Detection of Grain and Noise for Regraining in Film

    and Video. Since then he has been working as research associate for the JOANNEUM RESEARCH Institute

    of Information Systems, where he is involved in international R&D projects. His research interests include

    computer vision and multimedia retrieval with a focus on film restoration, machine learning, image and video

    classification.

    Multimed Tools Appl (2010) 50:728 27

  • 7/29/2019 Framework for Unsupervised

    22/23

    Georg Thallinger received an MSc in Telematics from Graz University of Technology, Austria in 1992.

    Georg joined the Institute of Information Systems at JOANNEUM RESEARCH right after university as

    research engineer in the domain of scientific visualization. Since 2002 he is a co-leader of the Digital Media

    group at the institute and as such is co-ordinating large, international projects. His areas of interest and

    experience are content based analysis and retrieval of audiovisual information, and the application of these in

    the domains of audiovisual archives, film restoration, and surveillance.

    28 Multimed Tools Appl (2010) 50:728

  • 7/29/2019 Framework for Unsupervised

    23/23

    Reproducedwithpermissionof thecopyrightowner. Further reproductionprohibitedwithoutpermission.


Recommended