+ All Categories
Home > Documents > Abondand Object Detction

Abondand Object Detction

Date post: 20-Nov-2015
Category:
Upload: rajasekar-panneerselvam
View: 227 times
Download: 7 times
Share this document with a friend
Description:
dfdfdfgfg
Popular Tags:
13
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013 723 Stopped Object Detection by Learning Foreground Model in Videos Lucia Maddalena, Member, IEEE, and Alfredo Petrosino, Senior Member, IEEE Abstract— The automatic detection of objects that are aban- doned or removed in a video scene is an interesting area of computer vision, with key applications in video surveillance. Forgotten or stolen luggage in train and airport stations and irregularly parked vehicles are examples that concern significant issues, such as the fight against terrorism and crime, and public safety. Both issues involve the basic task of detecting static regions in the scene. We address this problem by introducing a model-based framework to segment static foreground objects against moving foreground objects in single view sequences taken from stationary cameras. An image sequence model, obtained by learning in a self-organizing neural network image sequence variations, seen as trajectories of pixels in time, is adopted within the model-based framework. Experimental results on real video sequences and comparisons with existing approaches show the accuracy of the proposed stopped object detection approach. Index Terms— Artificial neural network, image sequence mod- eling, stopped foreground detection, video surveillance. I. I NTRODUCTION A BANDONED and removed object detection in image sequences is a task common to many video surveillance applications, such as the detection of irregularly parked vehi- cles [1] and the detection of unattended luggage in public areas [1]–[3]. Many definitions of the problem have been given; the most widely accepted definition can be stated as in [4], where an abandoned object is defined as “a stationary object that has not been in the scene before,” and a removed object is defined as “a stationary object that has been in the scene before, but is not there anymore.” Both the problems have the common basic task of detecting stationary regions in the scene, that is, “changes of the scene that stay in the same position for relatively long time.” As an example, a moving object that stops for a while in the scene could be an abandoned object; if the object later starts again, it could be a removed object. In both cases, the basic task is to detect the object as stationary in order to later classify it as either abandoned or removed. This basic task, hereafter referred to as stopped object detection (SOD), is the objective of the present Manuscript received April 10, 2012; revised January 15, 2013; accepted January 16, 2013. Date of publication February 8, 2013; date of current version March 8, 2013. L. Maddalena is with the National Research Council of Italy, Institute for High-Performance Computing and Networking, Naples 80131, Italy (e-mail: [email protected]). A. Petrosino is with the Department of Applied Science, University of Naples Parthenope, Naples 80143, Italy (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2242092 research. Depending on the video surveillance application to be solved, SOD is followed by higher level scene analysis tasks, such as discrimination between abandoned and removed objects, as well as between humans and non-humans. The typical SOD process is articulated into two steps: 1) detection of foreground objects not belonging to the scene background, usually termed moving object detection (MOD) [5]–[8] and 2) detection—among moving objects— of those objects that are static. Accordingly, key challenges for SOD include the following. 1) Well-known issues in MOD (see [9]), such as gradual or abrupt light changes, moving background, and cast shadows (basic challenges). 2) Further specific challenges in complex real-world scenes arising from the need to determine when stopped objects resume motion (restart challenge) or are occluded by other objects or by people passing in front of them (occlusion challenge). A. Related Work Several single view approaches have been proposed to tackle SOD key challenges, and they can be broadly grouped in tracking-based and background subtraction-based approaches. In tracking-based approaches, SOD is obtained on the basis of the analysis of object trajectories through an application- dependent event detection phase [10]–[16]. While these meth- ods allow us to directly extract trajectory-based semantics (e.g., who left the item or how long the item has been left before the person moved away), they are often unreliable in complex surveillance videos, due to occlusions and light changes, as also pointed out in [4], [17], and [18]. Background subtraction-based approaches rely on back- ground modeling and foreground analysis for SOD. They have generally proven to achieve superior robustness to key challenges in complex real-world scenarios [4], [17]–[24] and can be used as a preprocessing stage to improve tracking- based video analysis (e.g., [25]). A further sub-classification can be obtained depending on the adoption of pixel-level or region-level analysis [17]. Pixel-based methods assume that the time series observations are independent at each pixel, while region-based methods take advantage of inter- pixel relations and the images are segmented into regions or the low-level classification obtained at the pixel level is refined. While pixel-wise background model adaptation is very useful for handling multimodal backgrounds (e.g., waving trees), it lacks higher level information about the object shape [4]. 2162–237X/$31.00 © 2013 IEEE
Transcript
  • IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013 723

    Stopped Object Detection by LearningForeground Model in Videos

    Lucia Maddalena, Member, IEEE, and Alfredo Petrosino, Senior Member, IEEE

    Abstract The automatic detection of objects that are aban-doned or removed in a video scene is an interesting area ofcomputer vision, with key applications in video surveillance.Forgotten or stolen luggage in train and airport stations andirregularly parked vehicles are examples that concern significantissues, such as the fight against terrorism and crime, and publicsafety. Both issues involve the basic task of detecting staticregions in the scene. We address this problem by introducinga model-based framework to segment static foreground objectsagainst moving foreground objects in single view sequences takenfrom stationary cameras. An image sequence model, obtainedby learning in a self-organizing neural network image sequencevariations, seen as trajectories of pixels in time, is adopted withinthe model-based framework. Experimental results on real videosequences and comparisons with existing approaches show theaccuracy of the proposed stopped object detection approach.

    Index Terms Artificial neural network, image sequence mod-eling, stopped foreground detection, video surveillance.

    I. INTRODUCTION

    ABANDONED and removed object detection in imagesequences is a task common to many video surveillanceapplications, such as the detection of irregularly parked vehi-cles [1] and the detection of unattended luggage in publicareas [1][3]. Many definitions of the problem have beengiven; the most widely accepted definition can be stated asin [4], where an abandoned object is defined as a stationaryobject that has not been in the scene before, and a removedobject is defined as a stationary object that has been in thescene before, but is not there anymore. Both the problemshave the common basic task of detecting stationary regionsin the scene, that is, changes of the scene that stay in thesame position for relatively long time. As an example, amoving object that stops for a while in the scene could bean abandoned object; if the object later starts again, it couldbe a removed object. In both cases, the basic task is to detectthe object as stationary in order to later classify it as eitherabandoned or removed. This basic task, hereafter referred to asstopped object detection (SOD), is the objective of the present

    Manuscript received April 10, 2012; revised January 15, 2013; acceptedJanuary 16, 2013. Date of publication February 8, 2013; date of current versionMarch 8, 2013.

    L. Maddalena is with the National Research Council of Italy, Institute forHigh-Performance Computing and Networking, Naples 80131, Italy (e-mail:[email protected]).

    A. Petrosino is with the Department of Applied Science,University of Naples Parthenope, Naples 80143, Italy (e-mail:[email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TNNLS.2013.2242092

    research. Depending on the video surveillance application tobe solved, SOD is followed by higher level scene analysistasks, such as discrimination between abandoned and removedobjects, as well as between humans and non-humans.

    The typical SOD process is articulated into two steps:1) detection of foreground objects not belonging to thescene background, usually termed moving object detection(MOD) [5][8] and 2) detectionamong moving objectsof those objects that are static. Accordingly, key challengesfor SOD include the following.

    1) Well-known issues in MOD (see [9]), such as gradualor abrupt light changes, moving background, and castshadows (basic challenges).

    2) Further specific challenges in complex real-world scenesarising from the need to determine when stopped objectsresume motion (restart challenge) or are occluded byother objects or by people passing in front of them(occlusion challenge).

    A. Related WorkSeveral single view approaches have been proposed to tackle

    SOD key challenges, and they can be broadly grouped intracking-based and background subtraction-based approaches.

    In tracking-based approaches, SOD is obtained on the basisof the analysis of object trajectories through an application-dependent event detection phase [10][16]. While these meth-ods allow us to directly extract trajectory-based semantics(e.g., who left the item or how long the item has been leftbefore the person moved away), they are often unreliablein complex surveillance videos, due to occlusions and lightchanges, as also pointed out in [4], [17], and [18].

    Background subtraction-based approaches rely on back-ground modeling and foreground analysis for SOD. Theyhave generally proven to achieve superior robustness to keychallenges in complex real-world scenarios [4], [17][24] andcan be used as a preprocessing stage to improve tracking-based video analysis (e.g., [25]). A further sub-classificationcan be obtained depending on the adoption of pixel-levelor region-level analysis [17]. Pixel-based methods assumethat the time series observations are independent at eachpixel, while region-based methods take advantage of inter-pixel relations and the images are segmented into regions orthe low-level classification obtained at the pixel level is refined.While pixel-wise background model adaptation is very usefulfor handling multimodal backgrounds (e.g., waving trees), itlacks higher level information about the object shape [4].

    2162237X/$31.00 2013 IEEE

  • 724 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

    This does not greatly influence the restart challenge, but itcould have an influence over the occlusion challenge. Indeed,a region-level analysis can play a major role in identifyingfalse foreground and false stationary objects, leading to higherrobustness to occlusion [4], [17], at the price of an increasedoverall complexity.

    One of the earliest approaches to SOD based on backgroundsubtraction is reported in [19] and [21]. Two processes aredistinguished: pixel analysis, which determines whether apixel is stationary, transient, or background by observing itsintensity profile over time, and region analysis, which dealswith the agglomeration of groups of pixels into moving andstopped regions. Stationary regions are added as a layer overthe background, through a layer management process used tohelp solving occlusion and restart challenges.

    In [22], a SOD method based on double-background sub-traction is reported that is robust to illumination changes andnoise. The long-term background represents a clean scene; allobjects located in this background are not considered staticobjects, but as part of the scene. The short-term backgroundshows static objects abandoned in the scene; this image ismade up of the last background image, updated with piecesof the current image. Static foreground objects are detectedby thresholding the difference between the short-term and thelong-term backgrounds, while dynamic foreground objects aredetected by thresholding the difference between the currentimage and the long-term background.

    Dynamic background changes, such as parked cars, leftpackages, or displaced chairs, are explicitly handled in [23],highlighting the importance of avoiding the mere absorptioninto the background model of these changes, in the light ofintelligent visual surveillance. Dynamic background changesare inserted into short-term background layers, and, in orderto update a long-term background model, the layer modelingtechnique is embedded into a codebook-based backgroundsubtraction algorithm that is robust to local and global illu-mination changes.

    Other approaches are based on pixel layer-based foregrounddetection [24], where temporal persistent objects are intro-duced. To detect persistent foreground objects, the authorsuse a color persistence criterion, continually monitoring thecolor histogram of the foreground object and using correlationto decide if the color is persistent. After a user-defined timethreshold, the persistent foreground object is converted to anew background layer. To reconvert persistent objects back toforeground layers when they become interesting again (restartchallenge), the authors use higher level features (i.e., globallymodeling the region as a whole, based on the use of region-level features).

    A pixel-wise method that employs dual foregrounds toextract temporally static image regions is reported in [18]. Theauthors construct separate long- and short-term backgrounds,modeled as pixel-wise multivariate Gaussian models [26],whose parameters are adapted online using a Bayesian updatemechanism imposed at different learning rates. By comparingeach frame with these models, they estimate two foregroundmasks: the long-term foreground mask shows color variationsin the scene, as well as moving cast shadows and illumination

    changes; the short-term foreground mask contains the movingobjects, noise, etc. An evidence score at each pixel is inferredby applying a set of hypotheses on the foreground masks, andthen aggregated in time to provide temporal consistency.

    A complete system for the detection of abandoned andremoved objects is presented in [4], where the mixture ofthe Gaussian method [26], suitably adapted, is employed todetect both the moving and the static objects in the scene.Several improvements allow the system to properly handleshadows and light changes, and higher level modules enablethe discrimination of abandoned and removed objects, alsounder occlusion.

    In [17], region-level analysis is exploited both for back-ground maintenance, where region-level information is fedback to adaptively control the learning rate, and for SOD,where it helps validating candidate stationary objects. Theresulting method is robust against illumination changes andto occlusions.

    A system for the detection of static objects based on a dualbackground model that classifies pixels by means of a finite-state machine is presented in [20]. The state machine providesthe meaning for the interpretation of the results obtained frombackground subtraction.

    It should be observed that other methods exist that do notfall in the above classification. Examples include the approachin [27] that affords the occlusion problem by using occlusionreasoning in a multiview setting, and the method in [28] forparked vehicles detection, based on the analysis of spatio-temporal maps of static image corners, that neither relies onbackground subtraction nor performs object tracking.

    B. Overview of Our ApproachWe approach SOD with a background subtraction-based

    method that relies on modeling not only the background,but also the stopped foreground. Two main novelties areintroduced as compared with previous approaches.

    The first contribution resides in the proposition of a gen-eral framework for SOD, that we named stopped foregroundsubtraction (SFS) algorithm. The basic idea consists of main-taining an up-to-date model of the stopped foreground anddiscriminating moving objects as those that deviate fromthis model, using a mechanism that is robust to occlusionand restart challenges. The proposed SFS algorithm is quitegeneral: indeed, it is independent from the model chosen forthe scene background and foreground, and therefore it can beused in conjunction with whichever model is preferred.

    Another main contribution concerns a 3-D neural model forimage sequences that automatically adapts to scene changesin a self-organizing manner. Neural network-based solutionsto MOD have already been considered due to the fact thatthese methods are usually more effective and efficient thantraditional ones [29][36]. The proposed 3-D neural networkbehaves as a competitive neural network implementing awinner-take-all function with an associated mechanism, thatmodifies the local synaptic plasticity of the neurons, allowinglearning to be spatially restricted to the local neighborhood ofthe most active neurons. Therefore, the neural image sequence

  • MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS 725

    model well adapts to scene changes and can capture themost persisting features of the image sequence, making itsuitable to model both the scene background and foreground.The 3-D neural model differs from the 2-D model proposedin [33] in terms of layered network structure and of inter-and intra-layer weights update. The rationale is to produce a3-D topographic map that is more consistent with the imagesequence, where every layer is spatially consistent with thepixel locations (i.e., each neuron corresponds to a single pixel)and their image neighborhoods, while different neurons atdifferent layers (corresponding to the same pixel location) aremore or less responsive to the change detected at that pixel.

    Overall, the reported research extends our previousresearch [37], now, including a complete description of theSFS algorithm, a constructive description of the needed back-ground and foreground models, a detailed description ofthe neural image sequence model, as well as extensive andreproducible experimental results.

    This paper is organized as follows. In Section II, we proposea model-based framework for SOD that is independent fromthe chosen model. In Section III, we describe the adoptedself-organizing model for image sequences, and describe howthis model is used for background and foreground modeling.Section IV presents results of moving and stopped objectssegmentation obtained by adopting the self-organizing modelfor the model-based framework, while Section V includesconcluding remarks.

    II. MODEL-BASED FRAMEWORK FOR SOD

    In this section, we propose a model-based approach to theclassification of foreground objects into stopped and movingobjects. The basic idea consists of keeping a model of fore-ground objects and classifying as stopped objects those whosemodel holds the same features for several consecutive frames;remaining foreground objects are consequently classified asmoving objects.

    Specifically, in the following sections we will describe in aconstructive way three models related to an image sequence{It }: Bt for the background,Ft for the moving foreground, andSt for the stopped foreground. Such models allow us to achievea segmentation of sequence frame It at time t , classifying eachpixel of It as either background, moving, or stopped, thusproviding a solution to the considered problem.

    A. Background Model: AssumptionsIn order to highlight the general applicability of the

    proposed approach, independently from the specific modeladopted for the background, for the moving foreground, andfor the stopped foreground, in this section, we assume we aresomehow able to do background subtraction in order to detect,for each sequence frame It , image elements that do not belongto the scene background. Specifically:

    Assumption 1: Given an image sequence {It }, we supposeto have devised:

    1) a background modeling technique, allowing to initializeand update a model Bt of background appearance at

    time t , which gives a statistical description of the scenebackground for sequence {It };

    2) a foreground detection technique, allowing to discrim-inate whether an incoming image pixel x of currentsequence frame It is modeled by the background model.

    The following notation will be adopted:1) the case of x being a background pixel, modeled by Bt ,

    will be denoted as xBt ;2) the case of x being a foreground pixel, not modeled byBt , will be denoted as x Bt ;

    3) the initialization of the background model Bt , for t = 0,is achieved, for every image pixel x of the sequenceframe I0, through the procedure insert(B0, x), whichreturns the initial background model B0;

    4) the update of the background model Bt1, for t > 0, isachieved, for every image pixel x of the sequence frameIt , through the procedure update(Bt,Bt1, x), whichreturns the updated background model Bt .

    Both the insert and the update procedures depend upon thechoice of the model representation; therefore, in this sectionthey are left unspecified.

    B. Moving Foreground Model: Initial ConstructionFor all foreground pixels, we construct a model Ft of the

    sequence foreground, in order to classify as stopped thoseforeground pixels that hold the same features for severalconsecutive frames, i.e., those that are modeled by Ft forseveral consecutive frames. The sequence foreground modelFt is iteratively constructed for each pixel x of the currentsequence frame It as follows.

    1) Ft is initialized as soon as x is detected as foregroundpixel: if xBt1 and x Bt , then insert(Ft , x).

    2) Ft is updated if it continues to hold the same featuresheld in past frames: if xFt1, then update(Ft,Ft1, x).

    3) Ft is erased if x is a background pixel: if xFt1 andxBt , then delete(Ft ,Ft1, x).

    Contrary to background modeling, foreground modelingallows us to model pixels whose behavior is usually onlytemporary. This implies the above described further issue oferasing the model whenever it is no more representative; thisis achieved through the procedure delete, which returns themodel Ft updated in such a way that x Ft .

    C. Moving Foreground Model: Counting of ConsecutiveOccurrences

    The availability of the constructed foreground model Ftallows us to discriminate, among foreground pixels, those thatare moving and those that are stationary. Specifically:

    Definition 1: Given a foreground model Ft for the imagesequence {It }, an image pixel x of sequence frame It at timet is said to be a stopped pixel if it has been modeled by Ftfor at least consecutive frames: xFi , i = t , . . . , t .

    Definition 2: The minimum number of consecutiveframes after which a foreground pixel assuming constantfeatures is classified as stopped is said the stationary threshold.

  • 726 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

    The value of is chosen depending on the desired responsive-ness of the system, and is application-dependent.

    In order to classify stopped pixels, for each pixel x ofsequence frame It , we compute a function Ct of pixel fea-ture consecutive occurrences in the foreground model Ft asfollows:

    Ct (x) =

    max(0, Ct1(x) kA), if xBtmin(, Ct1(x) + kB), if x Bt xFtmax(1, Ct1(x) kC), if x Bt x Ft

    (1)

    where indicates the logical AND operation, and C0(x) = 0.Basically, the count value is increased if x is modeled bythe foreground model Ft , and decreased otherwise. The threedifferent cases of (1) can be explained as follows.

    If, at time t , x is a background pixel [first case of (1)],then, at previous time t 1 it was in one of the threedifferent possible states: 1) it was a background pixel; 2) itwas a foreground pixel belonging to a moving object thatcontinues to move at time t; or 3) it was a foregroundpixel belonging to a stopped object that starts again mov-ing at time t . In each case, the counting of consecutiveoccurrences of x in Ft should be re-initialized, by set-ting Ct (x) = 0. Instead of just zeroing Ct (x), decreas-ing its value by the decay value kA in (1) allows usto control how fast the system should recognize that astopped pixelas in case 3has moved again. To set thealarm flag off immediately after the removal of the stoppedpixel, the decay value should be large, eventually equalto .

    If x is modeled by the foreground model Ft [secondcase of (1)], then it holds the same features held in theprevious frame, and therefore Ct (x) is incremented. This caseincludes situations where in past frames x was a moving pixelbelonging to an object that has not moved since then; therefore,the growth factor kB in (1) determines how fast the systemshould recognize that a moving pixel is going to stop.

    Finally, if pixel x is not modeled by any of the backgroundor moving foreground models [third case of (1)], then it mustbe a new moving foreground pixel, and therefore Ct (x) shouldbe set to 1. Instead, we just decrease it by the decay factorkC in (1), in order to enhance robustness to false negatives inthe background or in the moving foreground.

    D. Stopped Foreground Model: ConstructionAccording to Definition 1, a pixel x in the sequence frame It

    is classified as stopped if Ct (x), as defined in (1), reaches thestationary threshold value . In order to retain the memory ofobjects that have stopped and to distinguish them by movingobjects eventually passing in front of them in subsequentframes, pixels x for which Ct (x) = are moved from themoving foreground model Ft to a new stopped foregroundmodel St .

    Given a pixel x of current sequence frame It , the stoppedforeground model St is constructed and updated according to:

    1) St is initialized as soon as x has been modeled by Ft forat least consecutive frames: if xFt and Ct (x) = ,then insert(St , x);

    Fig. 1. Classification at time t of the state of a pixel x of It as eitherbackground (BG), moving foreground (MOV), or stopped foreground (ST).

    2) St is updated if it continues to hold the same featuresheld in past frames: if xSt1, then update(St ,St1, x);

    3) St is erased if x is a background pixel for at leastkA frames: if xSt1, xBt , and Ct (x) = 0, thendelete(St ,St1, x);

    where kA and Ct (x) are those of (1).As previously mentioned, as soon as the stopped foreground

    model St is initialized for pixel x of the current sequence frameIt , the corresponding foreground model Ft is re-initialized[delete(Ft,Ft , x)], together with the corresponding countingfunction value Ct , by setting Ct (x) = 0. This refinementallows us to adopt the erased model Ft for new movingforeground pixels that in subsequent frames eventually passin front of the stopped pixels.

    E. Classification of a PixelA simplified flow-chart describing how the state of an

    incoming pixel x in sequence frame It is classified, at time t ,as either background (BG), moving (MOV), or stopped (ST),is reported in Fig. 1. Here, the input background model Bt ,the moving foreground model Ft1, and the stopped fore-ground model St1, are obtained as described in Sections II-AII-D, respectively. Moreover, the input function Ct1 iscomputed as in (1), where we have assumed for simplicitykA = , kB = 1, and kC = . While the flow-chart concerns only the update of the counting functionCt and the initialization of the two foreground models Ftand St , a complete and constructive description is given inAlgorithm 1.

    F. Stopped Foreground Model: LayeringWhen several overlapping stopped objects are present

    in the scene, we cannot consider a single stoppedforeground model St if we want to consider all these objects

  • MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS 727

    Algorithm 1 Stopped Foreground Subtraction (SFS)

    separately and keep memory of all of them. To this end, weintroduce L different stopped foreground layers S1t , . . . ,SLt ,each of which contains stopped foreground objects that donot overlap with stopped foreground objectsof the otherlayers, i.e., that do not occupy the same image spatialregion. The layer S1t contains the model for the firstdetected stopped foreground pixels; subsequently detectedstopped foreground pixels that do not belong to thislayer, but overlap with it, will be included in the next layerS2t , and so on.

    G. SFS AlgorithmThe stopped foreground subtraction algorithm, in the fol-

    lowing referred to as SFS algorithm, for an incoming pixel xin sequence frame It , t [0, T ], adopting the stopped layeringmechanism with L layers described in Section II-F, is detailedas Algorithm 1.

    Most part of the algorithm is a clear consequence of theprocedures for constructing the moving foreground modelFt and the stopped foreground model St , and of (1) forcomputing the function Ct of pixel feature occurrences inthe foreground model, as described in Sections II-BII-D,respectively. Instead, the layering mechanism for stopped

    foreground models S1t , . . . ,SLt introduced in Section II-Fneeds further clarification.

    1) When a pixel is detected as background for at least kAconsecutive frames, all the stopped foreground layersare erased, because it is assumed that there are no morestopped objects containing this pixel (lines 45).

    2) When a pixel is detected as an old stationary pixelbelonging to the stopped layer Slt , then this layer isupdated (line 9); in this case, all other successive stoppedlayers Sl+1t , . . . ,SLt must be erased, because it meansthat they have been removed by the scene (line 10).

    3) Nothing should be done in any stopped layer in the casethe pixel x is an old or new moving foreground pixel(lines 1116). Indeed, the moving object it belongs to,could be passing in front of stopped objects, and we wantto keep the stopped objects model as is; this ensures SFSalgorithm robustness against stopped object occlusion.

    4) Finally, when a pixel is detected as a new stoppedpixel, then it must be inserted into one of the stoppedforeground layers (lines 2021). Choice of the suitablelayer can be done on a per-pixel basis (e.g., insert it intothe first empty stopped layer, as it has been done in SFSalgorithm) or on a per-region basis (e.g., insert it intothe empty stopped layer that already contains adjacentstopped pixels).

    H. Considerations on the Proposed FrameworkThe modeling of the foreground and the stopped foreground

    layering strategy allows us to keep an up-to-date representationof all stopped pixels and an initial classification of sceneobjects, which can be used to support subsequent trackingand classification phases. Moreover, stopped foreground mod-eling allows us to distinguish moving and stopped foregroundobjects when they appear overlapped (occlusion challenge).Indeed, in this case, the availability of an up-to-date model ofall stopped objects allows us to discern whether pixels, thatare not modeled as background, belong to one of the stoppedobjects or to moving objects. Finally, separate backgroundand stopped foreground modeling allow us to determine whenstopped objects resume their motion (restart challenge).

    The described stopped foreground subtraction algorithmis completely independent from the model adopted for thebackground, for the moving foreground, and for the stoppedforeground layers. Once a specific model has been chosen,SOD can be accomplished through the described SFS algo-rithm, provided that suitable procedures for model initializa-tion, update, and deletion are specified.

    III. NEURAL MODEL FOR SODIn this section, we give a description of a self-organizing

    model for image sequences and describe how the model isadopted for both background and foreground modeling.

    A. Image Sequence ModelingRelying on recent research in this area [33], [38], the

    idea is to build the image sequence model by learning ina self-organizing manner image sequence variations, seen

  • 728 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

    It

    Mt Mt

    (c)(b)(a)

    Fig. 2. (a) Example of sequence frame It (each circle represents a pixel, connected with its neighbors), (b) corresponding neural map Mt with n = 5 layers,and (c) inter-layer and intra-layer weight decaying during the update procedure for Mt .

    as trajectories of pixels in time. A neural network mappingmethod is proposed to use a whole trajectory incrementallyin time fed as an input to the network. Each neuron computesa function of the weighted linear combination of incominginputs, and therefore can be represented by a weight vector,obtained collecting the weights related to incoming links.An incoming pattern is mapped to the neuron whose set ofweight vectors is most similar to the pattern, and weightvectors in a neighborhood of this node are updated. Unlike[33] and [38], the obtained self-organizing neural network isorganized as a 3-D grid of neurons, producing a representationof training samples with lower dimensionality, at the sametime preserving topological neighborhood relations of theinput patterns.

    1) Neural Model Representation: Given an image sequence{It }, for each pixel x in the image domain D, we build aneural map consisting of n weight vectors mit (x), i = 1, . . . , n,which will be called a model for pixel x. If every sequenceframe has N rows and P columns, the complete set of modelsMt (x) = (m1t (x), . . . , mnt (x)) for all pixels x of the t-thsequence frame It is organized as a 3-D neural map Mt withN rows, P columns, and n layers. An example of this neuralmap is given in Fig. 2, where for each pixel x, identified byone of the colored circles in the sequence frame representedin Fig. 2(a), we have a model Mt (x) = (m1t (x), . . . , mnt (x)),that is identified by identically colored circles in the modellayers shown in Fig. 2(b).

    Put in other terms, the neural model Mt consists of nimages Lit , i = 1, . . . , n, of the same size of image It , whichwe call layers. Each layer Lit contains, for each pixel x, thei -th weight vector mit (x)

    Lit ={

    mit (x),x D}

    , i = 1, . . . , n.

    2) Neural Model Initialization: All weight vectors relatedto a pixel x are initialized with the pixel brightness value attime 0, that is

    mi0(x) = I0(x), i = 1, . . . , n. (2)The resulting neural map M0 consists of n layers, each ofwhich is a copy of the first sequence frame I0. The idea behindis that the initial guess for the model Mt , for t = 0, is exactlythe first frame of the sequence, which can be considered as agood initial approximation of the image sequence.

    3) Neural Model Update: Subsequent learning of the neuralmap allows us to adapt the image sequence model to scenemodifications. The learning process consists of updating themodel by changing the neural weights, according to a visualattention mechanism of reinforcement. Specifically, temporar-ily subsequent samples are fed to the network. At time t , thevalue It (x) of each incoming pixel x of the t-th sequenceframe It is compared to the current pixel model Mt (x) =(m1t (x), . . . , m

    nt (x)), to determine the weight vector mbt (x) that

    best matches it:d

    (mbt (x), It (x)

    )= min

    i=1,...,n d(

    mit (x), It (x))

    (3)

    where the metric d(, ) is suitably chosen according to thespecific color space being considered. Example metrics couldbe the Euclidean distance in RGB color space, or the Euclideandistance of vectors in the HSV color hexcone, as suggestedin [39]. The latter is the one adopted for the experimentsreported in Section IV. Indeed, the HSV color space allows usto specify colors in a way that is close to human experienceof colors, relying on the hue, saturation, and value propertiesof each color. Moreover, hue stability against illuminationchanges is known to be important for both cast shadowsuppression [40] and motion analysis [41], [42].

  • MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS 729

    The best matching weight vector mbt (x), computed accord-ing to (3), belonging to layer b of model Mt , is used as thepixel encoding approximation. Therefore, the best matchingweight vector mbt (x) and its neighboring weight vectors of theb-th layer of model Mt are updated, according to weightedrunning average:

    mbt (y) = (1(x, y))mbt1(y) + (x, y)It (y) y Nx. (4)Here, Nx = {y : |x y| w2D} is a 2-D spatial neighborhoodof x of size (2 w2D +1) (2 w2D +1) including x. Moreover,(x, y) = G2D(y-x), where represents the learningrate that depends from scene variability, while G2D() =N (; 0, 22D I ) is a 2-D Gaussian low-pass filter [43] with zeromean and 22D I variance.1 The (x, y) values are weights thatallow us to smoothly take into account the spatial relationshipbetween current pixel x and its neighboring pixels y Nx,thus preserving topological properties of the input (close inputscorrespond to close outputs). The 2-D intra-layer update of(4) is schematically shown in Fig. 2. Let us suppose thatthe current incoming pixel x of the t-th sequence frame Itis the center black-colored circle in Fig. 2(a) and that the bestmatching weight vector mbt (x), computed according to (3),belongs to layer b = 3 of model Mt , and is indicated by theblack-colored circle in the third layer of Fig. 2(b). Then, theweighted running average of (4) in a neighborhood Nx of size3 3 (choosing w2D = 1) involves all black-colored circlesshown in the third layer of Fig. 2(c), using Gaussian weightsrepresented by the 2-D Gaussian function over these nodes.

    The 2-D update of (4) involves only model weight vectorslying in the same layer b as the best matching weight vectormbt (x). In order to further enable the reinforcement of mbt (x)in the model for pixel x, also weight vectors of x belongingto layers close to layer b are updated. Such a further updateis achieved by weighted running average

    mit (x) = (1 (x))mit1(x) + (x)It (x) (5)that involves the weight vectors mit (x) of x such that |i b| w1D; that is, it involves the weight vectors that belong to a1-D inter-layer neighborhood of mbt (x) having size 2 w1D +1.In (5), (x) = G1D(x), is the learning rate, and G1D() =N (; 0, 21D) is a 1-D Gaussian low-pass filter with zero meanand 21D variance in the 1-D inter-layer neighborhood. The1-D inter-layer update of (5) is schematically shown inFig. 2(c) with w1D = 2; it involves all black-colored circles inthe center circles column, using Gaussian weights representedby the 1-D Gaussian function over these nodes.

    B. Modeling the Background, the Moving Foreground, and theStopped Foreground

    The previously described image sequence model can bereadily adapted for modeling the background, the movingforeground, and the stopped foreground for SOD, to be accom-plished by the general model-based framework described inSection II.

    1Contrary to the Gaussian mixture model [26], we do not assume anydistribution of input data. A Gaussian function is only adopted in order tosmooth the contribution of the best matching weight vector to the modelupdate.

    Specifically, the background model Bt is initialized as in (2),while updating is performed according to selective weightedrunning average, in order to adapt the model to slight scenemodifications, without introducing the contribution of pixelsthat do not belong to the background scene. For each incomingpixel x of the t-th sequence frame It , (4) and (5) are appliedonly if the best matching weight vector mbt (x) of Bt , computedas in (3), is close enough to the pixel value It (x), that is, onlyif

    d(mbt (x), It (x)) = mini=1,...,n d(mit (x), It (x)) (6)

    where is a threshold allowing us to distinguish betweenforeground and background pixels. Otherwise, if no acceptablebest matching weight vector exists, then x is detected asa foreground pixel. In the following, we denote this pro-cedure as 3D_SOBS algorithm [3-D self-organizing back-ground subtraction]. In the usual case that a set ofK initial sequence frames is available for training, theabove described initialization and update procedures areadopted for training the neural network background model,to be used for detection and adaptation in subsequentsequence frames. Specifically, after the described initializa-tion of B0, the background model Bt , t = 1, . . . , K 1,is updated by selective weighted running average on theremaining K 1 training sequence frames with the describedupdate procedure. Therefore, what differentiates the trainingand the adaptation phases is the choice of parameters in(4)(6).

    The threshold in (6) is chosen as

    ={

    1, if 0 < t < K2, if t K (7)

    with 1 and 2 small constants. The mathematical groundbehind the choice of 1 and 2 is as follows. To obtain a(possibly rough) initial background model that includes severalobserved pixel intensity variations, the value for [1 in (7)]within the initial training sequence frames should be high. Onthe other side, to obtain a more accurate background model inthe detection and adaptation phase, the value for [2 in (7)]should be lower within subsequent frames. Therefore, it shouldbe 2 1.

    Moreover, the learning rate in (4) is chosen as

    ={

    1 t 1 2K , if 0 < t < K2, if t K

    (8)

    where 1 and 2 are predefined constants such that2 1. Indeed, in order to ensure neural net-work convergence during the training phase, the learn-ing factor [1 in (8)] is chosen as a monotonicallydecreasing function of time t , while, during the subse-quent adaptation phase, the learning factor [2 in (8)]is chosen as a constant value that depends on the scenevariability. Large values enable the network to faster learnchanges corresponding to the background, but also leading tofalse negatives, that is, inclusion into the background modelof pixels belonging to foreground moving objects. On thecontrary, lower learning rates make the network slower toadapt to rapid background changes, but making the model

  • 730 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

    TABLE IADOPTED PARAMETER VALUES FOR 3D_SOBS AND SFS ALGORITHMS

    SPECIFIC FOR EACH IMAGE SEQUENCE

    Dog Binders PV- PV- PV- AB- AB- AB-Easy Medium Hard Easy Medium Hard

    60 30 1500 1500 1500 2500 2500 25002 0.02 0.02 0.02 0.01 0.005 0.02 0.02 0.022 0.05 0.05 0.05 0.08 0.05 0.05 0.05 0.052 0.05 0.05 0.05 0.08 0.05 0.05 0.05 0.05

    more tolerant to errors due to false negatives through self-organization. Indeed, weight vectors of false negative pixelsare readily smoothed out by the learning process itself.

    Same mathematical ground is behind the choice of thelearning rate in (5), whose values can be chosen as

    ={

    1 t 1 2K , if 0 < t < K2, if t K

    (9)

    where 1 and 2 are predefined constants such that 2 1.The moving foreground model Ft is initialized as in (2)

    using the pixel value It (x), as soon as x is detected as notbelonging to the background model Bt , and it is updatedaccording to (4) and (5) if x is detected as belonging tothe moving foreground model Ft . The latter condition canbe checked by thresholding as in (6), where this time mbt (x)should be interpreted as the best matching weight vector ofFt for current pixel x.

    Finally, the stopped foreground model St is initialized asin (2), as soon as x is detected as belonging to the movingforeground model Ft for at least consecutive frames. Here,instead of using the incoming pixel value It (x), we can makeuse of the already established moving foreground modelFt , byjust moving weight vectors of Ft into the stopped foregroundmodel. The stopped foreground model St is updated accordingto (4) and (5) if x is detected as belonging to the stoppedforeground model St . As for the case of the moving foregroundmodel, the latter condition can be checked by the thresholdingof (6), where mbt (x) is interpreted as the best matching weightvector of St for current pixel x.

    IV. EXPERIMENTAL RESULTSSeveral experiments have been conducted to validate our

    approach to SOD and to compare its results with thoseachieved by other state-of-the-art methods. In the following,choice of parameter values, qualitative and quantitative resultswill be described for several publicly available sequences.

    A. Parameter ValuesFor most of the parameters of 3D_SOBS and SFS algo-

    rithms, we could choose values common to all the consideredsequences. Specifically, we fixed n = 5 model layers forthe background, the moving foreground, and the stoppedforeground models; halfwidths w2D = 1 and w1D = 1 of theneighborhoods for the model updates in (4) and (5), respec-tively; variances 22D = 0.75 and 21D = 0.75 of the 2-D and1-D Gaussian low-pass filters specifying the weights for the

    (c)(b)(a)

    (f)(e)(d)

    Fig. 3. Results of the moving and stopped object segmentation for theDog sequence. (a) Original frame I270. (b) MOD mask. (c) Representationof the background model B270, computed by the 3D_SOBS algorithm.(d) Representation of the moving foreground model F270 and (e) of thestopped foreground model S270, computed by the SFS algorithm. (f) Originalframe with moving (green) and stopped (red) foreground objects.

    running averages in (4) and (5), respectively; the parametersfor training the neural background model [including K = 30training frames, the training segmentation threshold 1 = 0.1in (7), and the training learning rates 1 = 1 in (8) and 1 = 1in (9)]; and, finally, the growth/decay factors kA = 1, kB = 1,and kC = 1 adopted into (1). Such choices have been drivenby the conducted experiments, where, varying the parameters,we observed almost constant accuracy.

    The remaining parameters have been experimentally chosenfor each image sequence as reported in Table I. As clarified inSection II-C, the value of the stationary threshold used in (1)is application-dependent; the values here chosen for the Dogand the Binders sequences (see Section IV-B) are linked to theduration of the exemplified stopped object events, while thosefor the Parked Vehicle and the Abandoned Bag sequences(see Section IV-C) derive by the corresponding task definitiongiven by the AVSS 2007 contest [1]. The segmentationthreshold 2 adopted in (7), described in Section III-B, strictlydepends on the color similarity between the foregroundobjects and the background; suitable values are in the orderof 102. The learning rate 2 for the intra-layer update of (8)has always been chosen equal to the learning rate 2 for theinter-layer update of (9), with values in the order of 102.As clarified in Section III-B, these learning rates dependon the scene variability; because the considered sequencesdo not exhibit rapid background changes, the chosen lowlearning rates allow the neural network to slowly adapt tothese changes, at the same time allowing the self-organizingmodel to better tolerate errors due to few false negatives.We would remark that, according to the mathematical groundrelated to (7)(9) reported in Section III-B, values for thelast three parameters for the testing phase should be chosensmaller than their training counterparts.

    B. Qualitative EvaluationThe Dog sequence is an outdoor sequence consisting of 532

    frames of 320 240 spatial resolution, available on the webat http://www.openvisor.org. The scene consists of a garden,where a man sits on a chair, while a dog passes in front of him.One representative frame and the related results are reported in

  • MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS 731

    (c)(b)(a) (f)(e)(d)

    Fig. 4. Results of moving and stopped object segmentation for the frames 1280 (first row), 1321 (second row), and 1395 (third row) of the Binderssequence. (a) Original frames. (b) MOD masks computed by the 3D_SOBS algorithm. (c) Original frames with the first (red) and second (blue) stoppedlayers. Representations of (d) the first stopped layer model, (e) the second stopped layer model, and (f) the moving foreground model.

    Fig. 3. Here, we can observe that, although quite accurate, theMOD mask computed by the 3D_SOBS algorithm [Fig. 3(b)]for the original sequence frame I270 [Fig. 3(a)], does not allowus to distinguish between the man sit on the chair from the dogpassing in front of him. Representations of the moving and thestopped foreground models constructed by the SFS algorithmare reported in Fig. 3(d) and (e), respectively. Such modelrepresentations are computed as the best approximations tothe current sequence frame I270 obtainable by the related3-D neural model Ft or St , for t = 270 (i.e., choosing,for each pixel of I270, the best matching weight vector ofthe related model). Small unsteadiness of the sitting mancan be observed by the moving foreground model [Fig. 3(d)]that contains not only the moving dog, but also the contourand the hair of the man. Nonetheless, the availability of themoving and the stopped foreground models, together withthe background model obtained by the 3D_SOBS algorithm[whose representation is reported in Fig. 3(c)], allowed usto isolate moving and stopped foreground pixels, thus prov-ing the robustness of the SFS algorithm about the partialocclusion of the stopped person. The final segmentation isreported in Fig. 3(f), showing the original frame with stoppedforeground pixels (in red) and moving foreground pixels(in green).

    The Binders sequence is an indoor sequence consisting of638 frames of 320 240 spatial resolution (publicly availablein the download section of http://cvprlab.uniparthenope.it),specifically designed for showing the results achieved by theSFS algorithm. The scene consists of an office, where aperson first leaves a green binder on a desk, then leaves a redbinder, and, after that, leaves a blue book in front of the redbinder. After a while, the person picks up the green binder,passing it in front of the two stopped items, then picks upin turn also the blue book and the red binder. Representativeframes together with related results are reported in Fig. 4.Here, we report the original sequence frames nos. 1280, 1321,and 1395 [column (a)] and the corresponding MOD maskscomputed by the 3D_SOBS algorithm [column (b)]. As forthe Dog sequence, the MOD masks, although quite accurate,

    do not allow us to distinguish between moving and stoppedobjects. Specifically, in frame 1321 the blue book is stoppedin front of the red binder, which is itself stopped. However,the MOD mask allows us only to detect the entire imagearea including the two stopped objects, but it does not allowus to distinguish them. Likewise, in frame 1395 the greenbinder is passing in front of the other two stopped objects;the MOD mask includes the image area of the three objectsextraneous to the background, giving no indication whetherthey are moving or stopped. In the column (c), we reportthe original frames with superimposed the first and secondstopped layers computed by the SFS algorithm, identified byred and blue pixels, respectively; moving pixels are not shown.Apart from few pixels belonging to the contours of the objects,the achieved classification of stopped objects, as well as theirseparation into two different layers, is perfectly consistent withthe scene in the three frames, also proving the robustness ofSFS algorithm against the stopped object occlusion and restartchallenges. Representations of the first and the second stoppedlayer model, as well as of the foreground model, are reportedin columns (d)(f), respectively.

    Besides robustness against stopped object occlusion andrestart challenges, the reported qualitative results highlight thatthe SFS algorithm can be readily adopted for the detection ofabandoned objects, as defined in Section I. Also the case ofremoved objects can be readily handled (e.g., the removedgreen binder of Fig. 4), provided that the removed object doesnot appear as stationary since the beginning of the sequence.Indeed, in this case it is included into the background and,due to the selective update of the background model, it willbe detected as foreground only when it starts to move.

    C. Quantitative EvaluationTo compare results of our approach to SOD with other exist-

    ing approaches, we further consider the i-LIDS dataset (pub-licly available at ftp://motinas.elec.qmul.ac.uk/pub/iLids/),including also annotated ground truths provided for the AVSS2007 contest [1]. Two scenarios are considered. The Parked

  • 732 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

    TABLE IICOMPARISON OF GROUND TRUTH (GT) STOPPED OBJECT EVENT START AND END TIMES (IN MIN) FOR THE i-LIDS

    SEQUENCES WITH THOSE COMPUTED BY THE PROPOSED APPROACH AND BY OTHER APPROACHES

    Parked Vehicle Sequences Abandoned Bag SequencesPV-Easy PV-Medium PV-Hard Mean Median AB-Easy AB-Medium AB-Hard Mean Median

    Start End Start End Start End Error Error Start End Start End Start End Error ErrorGT 02:48 03:15 01:28 01:47 02:12 02:33 03:00 03:12 02:42 03:00 02:42 03:06 SFS + 3D_SOBS 02:45 03:19 01:28 01:51 02:12 02:34 4.00 4.00 02:50 03:17 02:35 03:01 02:42 03:07 8.00 8.00SFS + SOBS [33] 02:45 03:20 01:28 01:51 02:12 02:35 4.67 4.00 02:50 03:17 02:35 03:02 02:42 03:08 8.67 9.00SFS + MOG [26] 02:44 03:20 01:27 01:50 02:12 02:35 5.00 4.00 02:51 03:18 02:34 03:02 02:43 03:09 9.67 10.00Bhargava et al., [10] N/A N/A N/A N/A N/A N/A N/A N/A 02:59 03:12 02:46 03:00 02:43 03:07 2.33 2.00Boragno et al., [11] 02:48 03:19 01:28 01:55 02:12 02:36 5.00 4.00 N/A N/A N/A N/A N/A N/A N/A N/AGuler et al., [12] 02:46 03:18 01:28 01:54 02:13 02:36 5.33 5.00 02:23 03:18 02:42 03:06 02:14 03:16 29 38.00Lee et al., [13] 02:51 03:18 01:33 01:52 02:16 02:34 7.00 6.00 N/A N/A N/A N/A N/A N/A N/A N/APorikli et al., [18] N/A N/A 01:39 01:47 N/A N/A 11.00 11.00 N/A N/A N/A N/A N/A N/A N/A N/AVenetianer et al., [16] 02:52 03:16 01:43 01:47 02:19 02:34 9.33 8.00 02:54 03:17 02:54 03:01 N/A 03:13 10.33 11.00

    Vehicle sequences represent typical situations critical for MODin outdoor sequences, presenting strong shadows cast byobjects on the ground, mild positional instability caused bysmall movements of the camera due to the wind, and strongand long-lasting illumination variations due to clouds coveringand uncovering the sun. They are devoted to detecting vehiclesin no parking areas, where the street under control is more orless crowded with cars, depending on the hour of the day thescene refers to. The no parking area adopted for the contest isdefined as the main street borders, and the stationary thresholdis defined as = 1500; this means that an object is consideredirregularly parked if it stops in the no parking area for morethan 60 s (scenes are captured at 25 fp/s). The Abandoned Bagsequences are devoted to detecting abandoned objects in a trainstation and the detection area is restricted to the train platform.Different crowd densities in the various sequences determinedifferent levels of occlusion problems. Here, the event task forthe AVSS 2007 contest [1] is defined as detecting bags leftunattended by their owners for more than 60 s, thus requiringthe association of the owner with the bag. For this scenario,we then set the stationary threshold = 2500 (instead of = 1500), in order to approximately take into account thetime the owners take to un-attend the bag.

    We compared the results on i-LIDS sequences obtained byour approach with those obtained by several other approaches.Specifically, we consider results obtained by the SFS algorithmusing three different models for the background and theforeground (3D_SOBS, SOBS [33], and MOG [26]), and thoseobtained by the following.

    1) Bhargava et al., [10], who search unattended objects thatare separated from nearby blobs and perform reversetraversal for searching the candidate owner, continu-ously monitoring the scene for the departure/return ofthe owner.

    2) Boragno et al., [11], who employ a DSP-based systemfor automatic visual surveillance, where block matchingmotion detection is coupled with MOG-based fore-ground extraction.

    3) Guler et al., [12], who extend a tracking system, inspiredby the human visual cognition system, introducing a

    stationary object model where each region representshypotheses stationary objects whose associated proba-bility measures the endurance of the region.

    4) Lee et al., [13], who present a detection and trackingsystem operating on a 1-D projection of images.

    5) Porikli et al., [18], who employ dual foregroundsto extract temporally static image regions, as alreadydescribed in Section I.

    6) Venetianer et al., [16], who employ an object-basedvideo analysis system, featuring detection, tracking andclassification of objects.

    Table II reports results on the i-LIDS sequences in termsof stopped object event start and end times (in min) foreach sequence, as well as mean and median average errorcomputed over the absolute errors on the two scenarios.We can observe that stopped object events detected by theSFS algorithm, whichever is the adopted pixel-based model(3D_SOBS, SOBS [33], and MOG [26]), generally start assoon as a single pixel is detected as stopped, and end as soonas no more stopped pixels are detected; this leads unavoidablyto a small anticipation concerning the event start time and toa slight delay concerning the event end time, as comparedto other object-based approaches. However, from the resultswe can conclude that generally the SFS algorithm favorablycompares to the other SOD approaches, independently fromscene traffic or crowd level. Indeed, differences between theground truth and the computed stopped object event times aregenerally smaller for SFS coupled with 3D_SOBS than for theother approaches, and this is still more evident if we considerthe mean and the median errors for the two scenarios. It isworth pointing out that, in the case of the Abandoned Bagsequences, lower accuracy can be observed for most of thecompared methods. This is due to the need of higher levelknowledge of the scene content for the exact detection of theAbandoned Bag event. Although this is beyond our objective,higher level information concerning the bag ownership couldbe exploited in order to more accurately detect the start of theabandoned bag event (e.g., [12], [16]).

    Results reported in Table II also reveal the robustness ofall the compared methods against the occlusion challenge

  • MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS 733

    TABLE IIICOMPARISON OF THE NUMBER OF TRUE DETECTIONS (TD) AND FALSE

    DETECTIONS (FD) FOR THE i-LIDS SEQUENCES ACHIEVED BY THEPROPOSED APPROACH AND BY OTHER APPROACHES

    SFS + Albiol Evangelio Pan Tian3D_SOBS et al., [28] et al., [20] et al., [17] et al., [4]TD FD TD FD TD FD TD FD TD FD

    PV-easy 1 0 1 N/A N/A N/A 1 0 1 0PV-medium 1 0 1 N/A N/A N/A 1 0 1 0PV-hard 1 0 1 N/A N/A N/A 1 0 1 1AB-easy 1 0 N/A N/A 1 0 1 0 1 0AB-medium 1 0 N/A N/A 1 5 1 0 1 0AB-hard 1 1 N/A N/A 1 6 1 0 1 1

    (all of them detect the beginning of the stopped object event,despite the occlusions) and the restart challenge (all thecompared methods are able to detect the end of the stoppedobject event).

    More recent researches [4], [17], [20], [28] report accu-racy results in terms of event-based detections, instead ofstopped object event start and end times. These results arecompared to those of the proposed approach in Table III,where we report the actual number of correctly detected events(TDtrue detections) and the number of erroneously detectedevents (FDfalse detections). The results show that all thecompared methods are robust against the occlusion challenge(despite the occlusions, all the compared methods detect theexpected number of stopped object events), but nothing can besaid concerning the restart challenge. Some of the methods,including the proposed one, report false detections (mainly inthe AB-hard sequence) that are due to static people. Thesefalse detections could be avoided using a people detector(as suggested in [4]) or exploiting region-level information (asin [17]). We can conclude that the proposed approach achievesresults that are consistent with the state-of-the-art results, andthat the few false detections can be easily eliminated by higherlevel processing.

    To give an idea of the segmentation accuracy achieved byour neural-based approach to SOD, in Fig. 5, we analyze theresults achieved on the Dog and Binders sequences in termsof the F1 measure, as adopted in [33]. The reported F1 valueshave been obtained comparing manually labeled ground truthmasks for the first stopped layer of 20 frames of Dog sequence,and for the first and the second stopped layers of 20 framesof Binders sequence, with the corresponding segmentationresults, varying the number n of model layers. The high F1values, slightly increasing with n, are achieved because mostof the stopped pixels are indeed detected as stationary and onlyfew pixels detected as stopped are instead moving. Furtherexperimental results showing the accuracy of the 3D_SOBSalgorithm for MOD can be found in [44].

    The computational complexity of the proposed MOD andSOD algorithm, both in terms of space and time, is basi-cally proportional to the computational complexity inherentto the adopted model, because it entails analogous models forbackground, moving, and stopped foreground. Therefore, it isO(nN P) for each sequence image, where n is the number of

    Fig. 5. Average segmentation accuracy on Dog and Binders sequencesvarying the number n of model layers.

    Fig. 6. Execution times (in msecs/frame) for the proposed MOD and SODalgorithm on color image sequences with different sizes (S, M, H), varyingthe number n of model layers.

    model layers, N P is the image size, and we omit othersmall multiplicative constants.

    To complete our analysis, in Fig. 6, we report executiontimes (in msecs/frame) of the proposed SFS algorithm usingthe 3D_SOBS model, varying the spatial resolution of thecolor image sequences (S = 180 144, M = 360 288,H = 720 576) and the number of model layers (n = 3, 5,7, 9).Timings have been obtained by a prototype implementation inC programming language on a Pentium 4 with 2.40 GHz and512-MB RAM, running the Windows XP operating system,and do not include I/O. Image sequences with different res-olution have been obtained by subsampling the PV-mediumsequence. As expected by the analysis of the computationalcomplexity, the sequence resolution is the predominant factorinfluencing execution times. For a fixed resolution, executiontimes moderately increase when augmenting the number ofmodel layers. Having already observed in Fig. 5 that thesegmentation accuracy slightly increases with n, the valuen = 5 chosen for all the reported experimental results repre-sents a good compromise between high accuracy and computa-tional complexity. Moreover, the plot shows that only for highresolution sequences the frame rate is not sufficient to obtainreal-time processing (about 40 msecs/frame). Nonetheless, wecan observe that MOD (i.e., 3D_SOBS) requires most of thetotal execution times, while SOD (i.e., SFS) times representonly a small percentage of it (around 15%). Therefore, stoppedforeground subtraction can be considered as a useful andinexpensive by-product of background subtraction.

  • 734 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

    V. CONCLUSION

    This paper presented a neural-based contribution to stoppedobject detection in digital image sequences taken fromstationary cameras. A 3-D neural model for image sequencesthat automatically adapts to scene changes in a self-organizingmanner, was targeted for modeling the background and theforeground, finalized at the detection of stopped objects. Cou-pled with the proposed model-based framework for stoppedobject detection, it enables the segmentation of stopped fore-ground objects against moving foreground objects, robustlyhandling occlusion and restart problems.

    Experimental results on real video sequences and com-parisons with existing approaches showed that the proposed3-D neural model-based framework favorably compares toother tracking- and non tracking-based approaches to stoppedobject detection. The proposed approach is shown to be aninexpensive by-product of background subtraction that pro-vides an initial segmentation of scene objects, useful for anyother subsequent video analysis tasks, such as abandoned andremoved object classification, people counting, and humanactivity recognition.

    REFERENCES

    [1] Fourth IEEE International Conference on Advanced Video and SignalBased Surveillance. Piscataway, NJ: IEEE Computer Society, Sep. 2007.

    [2] Ninth IEEE International Workshop on Performance Evaluation ofTracking and Surveillance, J. M. Ferryman, Ed. Piscataway, NJ: IEEEComputer Society, Jun. 2006.

    [3] Tenth IEEE International Workshop on Performance Evaluation ofTracking and Surveillance, J. M. Ferryman, Ed. Piscataway, NJ: IEEEComputer Society, Oct. 2007.

    [4] Y. Tian, R. Feris, H. Liu, A. Hampapur, and M.-T. Sun, Robustdetection of abandoned and removed objects in complex surveil-lance videos, IEEE Trans. Syst., Man, Cybern. C, vol. 41, no. 5,pp. 565576, Sep. 2011.

    [5] S.-C. S. Cheung and C. Kamath, Robust techniques for backgroundsubtraction in urban traffic video, Proc. SPIE, vol. 5308, pp. 881892,2004, doi:10.1117/12.526886.

    [6] S. Elhabian, K. El Sayed, and S. Ahmed, Moving object detectionin spatial domain using background removal techniques: State-of-art,Recent Patents Comput. Sci., vol. 1, no. 1, pp. 3254, Jan. 2008.

    [7] M. Piccardi, Background subtraction techniques: A review, in Proc.IEEE Int. Conf. Syst. Man Cybern., vol. 4. Oct. 2004, pp. 30993104.

    [8] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, Image changedetection algorithms: A systematic survey, IEEE Trans. Image Process.,vol. 14, no. 3, pp. 294307, Mar. 2005.

    [9] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallflower: Prin-ciples and practice of background maintenance, in Proc. Int. Conf.Comput. Vis., vol. 1. 1999, pp. 255261.

    [10] M. Bhargava, C.-C. Chen, M. Ryoo, and J. Aggarwal, Detection ofobject abandonment using temporal logic, Mach. Vis. Appl., vol. 20,no. 5, pp. 271281, 2009.

    [11] S. Boragno, B. Boghossian, J. Black, D. Makris, and S. Velastin, ADSP-based system for the detection of vehicles parked in prohibitedareas, in Proc. IEEE Conf. Adv. Video Signal Based Surveill., Sep. 2007,pp. 260265.

    [12] S. Guler, J. A. Silverstein, and I. H. Pushee, Stationary objects inmultiple object tracking, in Proc. IEEE Conf. Adv. Video Signal BasedSurveill., Sep. 2007, pp. 248253.

    [13] J. Lee, M. Ryoo, M. Riley, and J. Aggarwal, Real-time illegal parkingdetection in outdoor environments using 1-D transformation, IEEETrans. Circuits Syst. Video Technol., vol. 19, no. 7, pp. 10141024,Jul. 2009.

    [14] S. Lu, J. Zhang, and D. Dagan Feng, Detecting unattended packagesthrough human activity recognition and object association, PatternRecognit., vol. 40, no. 8, pp. 21732184, Aug. 2007.

    [15] A. Singh, S. Sawan, M. Hanmandlu, V. Madasu, and B. Lovell, Anabandoned object detection system based on dual background segmen-tation, in Proc. 6th IEEE Int. Conf. Adv. Video Signal Based Surveill.,Sep. 2009, pp. 352357.

    [16] P. Venetianer, Z. Zhang, W. Yin, and A. Lipton, Stationary targetdetection using the objectvideo surveillance system, in Proc. IEEEConf. Adv. Video Signal Based Surveill., Sep. 2007, pp. 242247.

    [17] J. Pan, Q. Fan, and S. Pankanti, Robust abandoned object detec-tion using region-level analysis, in Proc. Int. Conf. Image Process.,Sep. 2011, pp. 35973600.

    [18] F. Porikli, Y. Ivanov, and T. Haga, Robust abandoned object detectionusing dual foregrounds, EURASIP J. Adv. Signal Process., Jan. 2008,p. 30.

    [19] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin,D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt, and L. Wixson, Asystem for video surveillance and monitoring, Dept. Nat. Acad. Sci.,Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-RI-TR-00-12, 2000.

    [20] R. Evangelio, M. Patzold, and T. Sikora, A system for automatic andinteractive detection of static objects, in Proc. IEEE Workshop Pers.Oriented Ver., Jan. 2011, pp. 2732.

    [21] H. Fujiyoshi and T. Kanade, Layered detection for multiple overlappingobjects, in Proc. 16th Int. Conf. Pattern Recognit., vol. 4. Aug. 2002,pp. 156161.

    [22] E. Herrero-Jaraba, C. Orrite-Urunuela, and J. Senar, Detected motionclassification with a double-background and a neighborhood-based dif-ference, Pattern Recognit. Lett., vol. 24, no. 12, pp. 20792092, 2003.

    [23] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, Real-timeforeground-background segmentation using codebook model, Real-Time Imag., vol. 11, no. 3, pp. 172185, 2005.

    [24] K. Patwardhan, G. Sapiro, and V. Morellas, Robust foreground detec-tion in video using pixel layers, IEEE Trans. Pattern Anal. Mach. Intell.,vol. 30, no. 4, pp. 746751, Apr. 2008.

    [25] N. Papadakis and A. Bugeau, Tracking with occlusions via graph cuts,IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 144157,Jan. 2011.

    [26] C. Stauffer and W. Grimson, Adaptive background mixture modelsfor real-time tracking, in Proc. Comp. Vis. Pattern Recognit., vol. 2.Jun. 1999, p. 252.

    [27] Q. Zhang and K. Ngan, Segmentation and tracking multiple objectsunder occlusion from multiview video, IEEE Trans. Image Process.,vol. 20, no. 11, pp. 33083313, Nov. 2011.

    [28] A. Albiol, L. Sanchis, A. Albiol, and J. Mossi, Detection of parkedvehicles using spatiotemporal maps, IEEE Trans. Intell. Transp. Syst.,vol. 12, no. 4, pp. 12771291, Dec. 2011.

    [29] D. Culibrk, O. Marques, D. Socek, H. Kalva, and B. Furht, Neural net-work approach to background modeling for video object segmentation,IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 16141627, Nov. 2007.

    [30] L. Duan, D. Xu, and I. Tsang, Domain adaptation from multiplesources: A domain-dependent regularization approach, IEEE Trans.Neural Netw., vol. 23, no. 3, pp. 504518, Mar. 2012.

    [31] A. Iosifidis, A. Tefas, and I. Pitas, View-invariant action recognitionbased on artificial neural networks, IEEE Trans. Neural Netw., vol. 23,no. 3, pp. 412424, Mar. 2012.

    [32] E. Lopez-Rubio, R. M. L. Baena, and E. Domnguez, Foregrounddetection in video sequences with probabilistic self-organizing maps.Int. J. Neural Syst., vol. 21, no. 3, pp. 225246, 2011.

    [33] L. Maddalena and A. Petrosino, A self-organizing approach to back-ground subtraction for visual surveillance applications, IEEE Trans.Image Process., vol. 17, no. 7, pp. 11681177, Jul. 2008.

    [34] G. Pajares, A Hopfield neural network for image change detection,IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 12501264, Sep. 2006.

    [35] S. Pal, A. Petrosino, and L. Maddalena, Handbook on Soft Computingfor Video Surveillance. London, U.K.: Chapman & Hall, 2012.

    [36] P. Wang, C. Shen, N. Barnes, and H. Zheng, Fast and robust objectdetection using asymmetric totally corrective boosting, IEEE Trans.Neural Netw., vol. 23, no. 1, pp. 3346, Jan. 2012.

    [37] L. Maddalena and A. Petrosino, 3D neural model-based stopped objectdetection, in Proc. 15th Int. Conf. Image Anal. Process., LNCS 5716.2009, pp. 585593.

    [38] L. Maddalena and A. Petrosino, Object motion detection and trackingby an artificial intelligence approach, Int. J. Pattern Recognit. Artif.Intell., vol. 22, no. 5, pp. 915928, Jan. 2008.

    [39] R. Fisher. (1999). Change Detection in Color Images [Online]. Avail-able: http://homepages.inf.ed.ac.uk/rbf/PAPERS/iccv99.pdf

  • MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS 735

    [40] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, Detecting movingobjects, ghosts, and shadows in video streams, IEEE Trans. PatternAnal. Mach. Intell., vol. 25, no. 10, pp. 13371342, Oct. 2003.

    [41] J. Barron and R. Klette, Experience with optical flow in colour videoimage sequences, in Proc. Inter-Vehicle Commun., 2001, pp. 195200.

    [42] Y. Mileva, A. Bruhn, and J. Weickert, Illumination-robust variationaloptical flow with photometric invariants, in Proc. 29th DAGM Conf.Pattern Recognit., 2007, pp. 152162.

    [43] P. Burt, Fast filter transform for image processing, Comput. Graph.Image Process., vol. 16, no. 1, pp. 2051, 1981.

    [44] L. Maddalena and A. Petrosino, Further experimental results with3D_SOBS algorithm for moving object detection, Dept. AppliedSci., Univ. Naples Parthenope, Naples, Italy, Tech. Rep. RT-DSA-UNIPARTHENOPE-12-01, 2012.

    Lucia Maddalena (M08) received the Laureadegree (cum laude) in mathematics and the Ph.D.degree in applied mathematics and computer sciencefrom the University of Naples Federico II, Naples,Italy.

    She is currently a Researcher with the Institutefor High-Performance Computing and Networking,National Research Council, Naples, Italy. She wasinvolved in research on parallel computing algo-rithms, methodologies, and techniques, and theirapplications to computer graphics. She is currently

    involved in research on methods, algorithms, and software for image process-ing and multimedia systems in high-performance computational environments,with applications to real-world problems, particularly digital film restorationand video surveillance. She has taught with the University of Naples FedericoII and the University of Naples Parthenope, Naples. She has co-edited onebook on soft computing for video surveillance.

    Dr. Maddalena is a member of the International Association for PatternRecognition, an Associate Editor of the International Journal of BiomedicalData Mining, and a reviewer of several international journals.

    Alfredo Petrosino (SM02) received the Laureadegree (cum laude) in computer science from theUniversity of Salerno, Salerno, Italy, under thesupervision of E. R. Caianiello.

    He is currently a Full Professor of computerscience with the University of Naples Parthenope,Naples, Italy, where he is the Head of the CVPRLab,a research laboratory in computer vision and patternrecognition (cvprlab.uniparthenope.it). He was withthe University of Salerno, the International Instituteof Advanced Scientific Studies, the Institute for the

    Physics of Matter, and the National Research Council. He has taught withthe University of Salerno, the University of Siena, the University of NaplesFederico II, and the University of Naples Parthenope. He has authored orco-authored more than 100 papers in journals and conferences, and has co-edited six books. His current research interests include computer vision, imageand video analysis, pattern recognition, neural networks, and fuzzy and roughsets.

    Prof. Petrosino is also a member of the International Association forPattern Recognition and of the International Neural Networks Society. Heis the General Chair of the International Workshop on Fuzzy Logic andApplications since 1995 and of the 17th International Conference on ImageAnalysis and Processing in 2013. He is an Associate Editor of PatternRecognition, a member of the Editorial Board of Pattern Recognition Letters,the International Journal of Knowledge Engineering and Soft Data Paradigms,a Guest Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN, ANDCYBERNETICS: SYSTEMS, Information Sciences, Fuzzy Sets and Systems,Image and Vision Computing, and Parallel Computing.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 600 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice


Recommended