Timeline Editing of Objects in Video

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 1

Time-Line Editing of Objects in VideoShao-Ping Lu, Song-Hai Zhang, Jin Wei, Shi-Min Hu, Member, IEEE, and Ralph R Martin

Abstract—We present a video editing technique based on changing the time-lines of individual objects in video, which leavesthem in their original places but puts them at different times. This allows the production of object-level slow motion effects, fastmotion effects, or even time reversal. This is more flexible than simply applying such effects to whole frames, as new relationshipsbetween objects can be created. As we restrict object interactions to the same spatial locations as in the original video, ourapproach can produce high-quality results using only coarse matting of video objects. Coarse matting can be done efficientlyusing automatic video object segmentation, avoiding tedious manual matting. To design the output, the user interactively indicatesthe desired new life-spans of objects, and may also change the overall running time of the video. Our method rearranges thetime-lines of objects in the video whilst applying appropriate object interaction constraints. We demonstrate that, while this editingtechnique is somewhat restrictive, it still allows many interesting results.

Index Terms—Object-level motion editing, Foreground/background reconstruction, Slow motion, Fast motion, Time reversal.

✦

1 INTRODUCTION

V ISUAL special effects can make movies moreentertaining, allowing the impossible to become

possible, and bringing dreams, illusions, and fantasiesto life. Special effects are an indispensable post-pro-duction tool to help convey a director’s ideas andartistic concepts.

Time-line editing during post-production is an im-portant strategy to produce special effects. Fast-motionis a creative way to indicate the passage of time.Accelerated clouds, city traffic or crowds of peopleare often depicted in this way. Slowing down a videocan enhance emotional and dramatic moments: forexample, comic moments are often more appealingwhen seen in slow-motion. However, time-line edit-ing is normally applied to entire frames, so that thewhole scene in a section of video undergoes the sametransformation of time coordinate. Allowing time-line changes for individual objects in a video has thepotential to offer the director much more freedomof artistic expression, and allows new relationshipsbetween objects to be constructed. Several popularfilms have used such effects: for example, charactersmove while time stands still in the film ‘The Matrix’.Usually, such effects are captured on the set.

The time-lines of individual objects in video may be

• S.-P Lu, S.-H Zhang and J. W are with TNList, Department ofComputer Science and Technology, Tsinghua University, Beijing100084, China.E-mail: [email protected], [email protected], [email protected].

• S.-M Hu (Corresponding author) is with TNList, Department of Com-puter Science and Technology, Tsinghua University, Beijing 100084,China.E-mail: [email protected].

• R.R. Martin is with the School of Computer Science and Informatics,Cardiff University, Cardiff, Wales CF24 3AA, UK.E-mail: [email protected].

changed by cutting, transforming and pasting themback into the video during post-production. This re-quires fine-scale object matting, as in general newobject interactions may occur within the space-timevideo volume: objects may newly touch or overlapin certain frames, or an occlusion of one object byanother may no longer happen. Typical video objectsegmentation and composition methods still need in-tensive user interaction, especially in regions whereobjects interact. On the other hand, automatic or semi-automatic tracking approaches, such as mean shifttracking [1], particle filtering [2] and space-time op-timization [3], can readily provide coarse matting re-sults, e.g. a bounding ellipse that contains the trackedobject together with some background pixels. We takeadvantage of such methods to provide an easy-to-use object-level time-line editing system. Our key ideais to retain and reuse the original interactions be-tween moving objects, and the relationships betweenmoving objects and the background. In particular,moving objects and are kept at their original spatiallocations, as are the original object interactions, butthese may occur at a different time. The user canspecify a starting and ending time for the motion ofeach object, and a speed function (constant by default)in that interval, subject to these constraints.

We use an optimization method to adjust videoobjects’ time-lines to best meet user specified require-ments, while satisfying constraints enforcing objectinteractions to remain at their original spatial posi-tions. This optimization process is fast, taking just asecond to perform for a dozen objects over 100 frames,allowing interactive editing of video. The user canclone objects, speed them up or slow then down, oreven reverse time for objects, to achieve special effectsin video.


Fig. 1. Time-line editing of video objects puts them in the same places but at different times, resulting in newtemporal relationships between objects. Top: original video, bottom: with slower cat. Left: trajectories of the catand the woman in the video.

2 RELATED WORK

Video editing based on object matting and composit-ing is often used in digital media production. Schodland Essa [4] extract video objects using blue screenmatting, and generate new videos by controlling thetrajectories of the objects and rendering them in arbi-trary video locations. Video matting and compositingentail a tedious amount of user interaction to extractobjects, even for a short video [5], [6], [7], [8]. Avariety of approaches can be used to alleviate this,such as contour tracking [9], optical flow assistedBayesian matting [10], 3D graph cut [11], mean shiftsegmentation [12], and local classifiers [13]. Even so,current methods still require intensive user interactionto perform accurate video object matting, and cannothandle objects lacking clear shape boundaries such assmoke, or objects with motion blur. Although patharrangement has been extensively considered in 3Danimation, such as group motion editing [14], it can-not be directly used in video object path editing dueto the difficulty of object extraction and compositing.In contrast, our system avoids the need for accurateobject matting as moving objects are always placedat their original locations, albeit at different times,finessing the compositing problem. Even a boundingellipse provided by straightforward tracking of theobject can provide adequate matting results.

Various approaches have been devised to providetemporal analysis and editing tools for video. Forexample, video abstraction [15], [16] allows fast videobrowsing by automatically creating a shortened ver-sion which still contains all important events. A com-mon approach to representing video evolution overtime is to use key frame selection and arrangement[17], but, being frame-based, it does not permit ma-nipulation of individual objects. Video montage [18] issimilar to video summarization, but extracts informa-tive spatio-temporal portions of the input video andfuses them in a new way into an output video volume.Barnes et al. [19] visualize the video in the style

of a tapestry without hard borders between frames,providing spatial continuity yet also allowing con-tinuous zoomin to finer temporal resolutions. Again,the aim is automatic summarization, rather than user-controlled editing. Video condensation based on seamcarving [20], [21], [22] is another a flexible approachfor removing frames to adjust the overall length ofa video. The above methods generally handle infor-mation at the frame, or pixel level, whereas our toolsallow the user to modify objects, which allows moreflexible rearrangement of video content.

Goldman et al. [2] advocate that object tracking,annotation and composition can lead to enrichedvideo editing applications. Methods in [23], [24] en-able users to navigate video using computed time-lines of moving objects and other interesting content.Liu et al. [25] present a system for magnifying micro-repetitive motions by motion based pixel clusteringand inpainting. Recent work [26] provides a videomesh structure to interactively achieve depth-awarecompositing and relighting. Scholz et al. [27] presenta fine segmentation and composition framework toproduce object deformation and other editing effects,allowing spatial changes to objects; it requires inten-sive user interaction as well as foreground extractionand 3D inpainting. Rav-Acha et al. [28], [29] introducea dynamic scene mosaic editing approach to generatetemporal changes, but, to avoid artifacts, require thatthe moving objects should not interact. Our methodanalyzes and records object interactions, and avoidsartifacts in the output by constraining the kinds ofediting allowed. We provide a visual interface for ef-ficient manipulation of objects’ life-spans and speedsin the video volume.

Object-based video summarization methods alsoexist, rearranging objects into an image or a shortvideo in the style of a static or moving storyboard[30], [31]. Goldman et al. [32] present a method forvisualizing short video clips as a single image, usingthe visual language of storyboards. Pritch et al. [33]


Fig. 2. Pipeline. Preprocessing includes panoramic background construction and coarse extraction of movingobjects. Video tubes are constructed for each extracted object. The user specifies starting times and speed foredited objects. Trajectories of video objects are optimized to preserve original interactions between objects, andrelationships to the background. Resampling of objects from suitable frames produces the output.

shorten an input video by simultaneously showingseveral actions which originally occurred at differenttimes. These techniques represent activities as 3D ob-jects in the space-time volume, and produce shortenedoutput by packing these objects more tightly along thetime axis while avoiding object collisions. These ap-proaches shift object interactions through time whilekeeping their spatial locations intact to avoid visualartifacts. We use the same approach of keeping objectinteractions at the same spatial locations, optimizingobjects’ locations in time to best meet the user’srequirements whilst also satisfying constraints. Ourmethod not only allows video to be condensed, butalso allows the user to determine the life-spans ofindividual objects, including changing their startingtimes, speeding them up or slowing them down, oreven reversing their time-lines.

3 APPROACH

Our system allows the user to edit objects’ time-lines,and produces artifact-free video output while onlyneeding coarse video object matting (see the bottomof Figure 4). Our approach relies on reinserting eachobject in the output at the same place as before, withthe same background, but at a different time. Care-fully handling object interactions is the key to ourapproach. When one object partly or wholly occludesanother, or their coarse segmentations overlap, anychanges to their interaction will prevent direct com-position of these objects with the background. Wethus disallow such changes: output is produced usingan optimization approach which imposes two hardconstraints, while also best satisfying the (possiblyconflicting) user requests:

• Moving objects must remain in their original spa-tial positions (and orientations) and can only betransformed to a new time.

• Interacting objects (i.e. objects which are veryclose or overlapping) must still interact in thesame way, at the same relative speed, althoughmaybe at a different time.

• The user may specify new starting and endingtimes for objects, as well as a speed functionwithin that duration; weights priorities such re-quirements for different objects.

• Certain frames for an object may be marked asimportant, with a greater priority of selection inthe output.

• The user may lengthen or shorten the entire video.

We allow the background to move (pan), in whichcase keeping objects and their interactions at thesame spatial positions does not mean at the samepixel coordinates, but at the same location relative toa global static background for the scene. Thus, ourmethod builds a panoramic background for all framesand coarsely extracts tubes representing the spatio-temporal presence of each moving object in the video.Object extraction is performed using an interactivekeyframe-interpolation system, which coarsely deter-mines a bounding ellipse in each frame for each mov-ing object, forming a tube in video space-time. Afterdetecting all bounding ellipses in each frame, SIFTfeatures are extracted from the remaining backgroundin each frame, and we follow the approach in [34] toregister frames to generate the panoramic backgroundimage. We extract SIFT features for all images and alsouse optical flow to provide correspondences betweenadjacent frames. RANSAC is then used to match allframes to a base frame and compute a perspectivematrix for each frame. The homography parametersobtained from the above approach are used to trans-form each frame and its bounding ellipses to globalcoordinates. The bounding ellipses are labeled, andmay be adjusted on key frames by the user if neces-sary. After interpolation to other frames, the result-ing ellipses can also be manually adjusted if poorresults are caused by problems with interpolation orhomography parameters. To perform coarse matting,we directly extract the difference between each alignedimage and the background image inside each object’sellipse to produce that object’s alpha information forthe current frame.


Having determined the background and movingobjects, video object trajectory rearrangement involvestwo steps: adjusting the shapes of the video tubeswithin the video volume, and resampling the tubes atnew times. The basic principle used in the first stepis that all objects should follow their original spatialpathways but at different times to before. Initially, theuser sets a new time-line for each object to be changed,including its starting and ending time, and its speedfunction. These user-selected time-lines may conflictwith the interaction constraints, so we optimize thetime-lines for all objects to best meet the user’s in-put requests while strictly preserving the nature andspatial locations of object interactions. We also takeinto account any requested change in overall videolength. This optimization may be weighted to indicatethat some tubes are more important than others. Theresult is a new video tube for the optimized life-spanof each object; this may be shorter or longer than theoriginal.

The new video tubes are now resampled, andstitched with the background to produce the overallresult. Resampling is done by means of a per-objectframe selection process which takes into account anyuser-prioritized frames that should be preferentiallykept. As objects still appear in their original spatiallocations, and have the same relationships with thebackground and other objects (but at different times),the main visual defects which may arise are due toillumination changes over time. Alpha matting withillumination adjustment solves this problem to a largedegree.

We next define our notation. We suppose the in-put video has N frames, and the chosen number ofoutput frames is M . The spatio-temporal track of amoving object is a tube made up of pixels with spatialcoordinates (x, y) in frames at time t. See the top-left of Figure 3. The purple tube representing oneobject interacts with another gold tube, green dotsmarking the beginning and end of the interaction.In such cases, we subdivide these tubes into sub-tubes at the start and end interaction points as shownat the bottom-left of Figure 3. A point at (x, y) inframe t which belongs to sub-tube i is denoted bypi(x, y, t). Sub-tubes are given an index i; all sub-tubes for a given object have consecutive indices. Forexample, if there were only 2 tubes, with 3 and 4 partsrespectively, their sub-tubes would have indices 0, 1, 2and 3, 4, 5, 6. We use tis and tie to represent the startand end times of each sub-tube. The second sub-tubeof the purple tube (see the bottom-left of Figure 3)is an interaction sub-tube shared with the gold tube.Within such a sub-tube, both interacting objects mustretain their original relative temporal relationship, sothat they interact in the same way, ensuring that theoriginal frames still represent a valid appearance forthe interaction.

Fig. 3. The video volume. Top-left: original trajectories,with start and end of an interaction marked by greendots. Bottom-left: object trajectories are subdivided intosub-tubes at these points. The second sub-tube inthe purple trajectory interacts with the gold trajectory.Right: trajectories mapped onto x-y and x-t planes.Note that the red dot is not a real interaction event.

The top-right of Figure 3 shows all tubes mappedonto the x-y plane. Potential interaction points like thered circle are not an actual intersection in the videovolume, but have the potential to become one if objecttubes were adjusted independently. When optimizingthe new object tubes, we do so in a way which avoidsthe possibility of a potential interaction becoming anactual interaction.

Our video editing process rearranges object time-lines using affine transformations of time for each sub-tube. First, however all user-defined speed functionsare applied as pre-mapping functions which adjustthe trajectories of the tubes while keeping the life-spans. Thus, each sub-tube i is transformed to a newoutput sub-tube i′ given by

pi′(x, y, t) = pi(x, y, Ait+Bi), (1)

where Ai determines time scaling and Bi determinestime shift. We find Ai and Bi for each sub-tubeby seeking a solution which is close as possible tothe user’s requests for time-line modification whilemeeting the hard constraints.

3.1 Optimization

Determining appropriate affine transformations oftime is done by taking into account the considerationsdescribed below; appropriate scaling is applied if theoverall video length is to be changed. These require-ments may conflict, so we seek an overall solutionwhich is the best compromise to meeting them all.For brevity, we ignore cases in which objects are to


Fig. 4. Original interaction preservation. Left, cen-ter: the sergmented regions for two cars, taken fromthe original frames, shown on the global background.Right: composed result containing both cars simultane-ously. Note that their original interaction is preserved.Bottom: corresponding matting results.

be reversed in time; these can easily be handled bystraightforward modifications.

Duration: Durations of life-spans should remainunchanged for unedited objects. Edited objects shouldhave new life-spans with lengths as close as possibleto those specified by the user.

Temporal location: The life-spans of unedited ob-jects should start and end as near as possible to theiroriginal starting and ending times, with time pro-gressing uniformly between start and finish. Editedobjects should start and end as close as possible touser specified times, with time progressing uniformlyin between. For objects with a user specified speedfunction, the new space-time distribution of the tubeis applied after mapping the original tube by thespeed function.

To meet the first requirement, we define an energyterm ED(i) whose effect is to enforce the appropriatelife-span for each sub-tube:

ED(i) = ||Hi −AiLi

M||. (2)

Here Hi is the desired life-span of sub-tube i, and Li =(tie − tis) is its original life-span. For edited objects,Hi is given by the desired life-span of its parent tubeas determined by the user, while it is set to MLi/Nfor unedited objects.

To meet the second requirement, we define anenergy EL(i) which penalizes displacement of sub-tube i from its desired position in time; we do so byconsidering the time at which each frame of the objectshould occur:

EL(i) =1

(tie − tis)

tie∑

t=tis

||(Ait+Bi

M−

t

N)||. (3)

(This equation can be simplified to avoid the need forexplicit summation).

These two terms are combined in an overall energyfunction to balance these requirements; a per-tube

Fig. 5. New interaction prevention. Top: if new inter-actions are prevented, cars’ tubes remain separate.Bottom: otherwise, unwanted interactions may arisebetween tubes, resulting in blue and white cars over-lapping in an unrealistic manner.

weight wi allows the user to indicate the importanceof meeting the requirements for each subtube—tubeswith higher weight should more closely meet theuser’s requirements:

E =∑i

wi(λED(i) + EL(i)), (4)

where λ controls relative importance of these require-ments, and is set to 2.5 by default.

Before we can minimize this energy, we also applyseveral hard constraints, as described shortly. Thisleads to a non-linear convex optimization problemwhose unknowns are Ai and Bi. We use CVX [35]to efficiently find an optimal solution.

3.2 Constraints

When minimizing the energy, several constraints mustbe imposed in addition to those described earlier.

Affine parameters: Affine parameters can only takeon certain values if each object is to fit into thetarget number of output frames. Temporal scaling andshifting parameters must thus satisfy

max(N,M)

Li

≥ Ai ≥ 0,

max(N,M) ≥ Bi ≥ min(−N,−M).

Tube continuity: Consecutive sub-tubes belongingto the same tube must remain connected. Continuitybetween the end of sub-tube i and the start of sub-tube j requires

Aitie +Bi + 1 = Aj(tie + 1) +Bj . (5)

Original interaction preservation: To preserve orig-inal interaction points at which different objects start


to interact, relevant object sub-tubes for the interactingobjects must connect in space-time. If one object’s sub-tube i starts at time tk and interacts with sub-tube jof another object trajectory, preservation of the initialinteraction point under the affine transformation re-quires that

Aitk +Bi = Ajtk +Bj . (6)

An example preserving interaction between two carsis shown in the top row of Figure 4.

New interaction prevention: To prevent potentialinteraction points between different object tubes frombecoming real interaction points, we should ensurethat different objects go through them at differenttimes. These times should be sufficiently distinct thatwe can rely on coarse object matting when placingobjects in the final output. We thus impose the fol-lowing temporal separation constraint. If sub-tube iand sub-tube j share a potential interaction point, werequire

||(Aiti +Bi)− (Ajtj +Bj)|| > ε (7)

where ti and tj are the corresponding times for eachobject. In our implementation, we set ε to between5 and 10, taking into account both the sizes of theobjects and the speeds at which they are moving,ensuring at least 5 frames separation as a safetymargin (as the objects may be sampled differently—see Section 3.3). Figure 5 shows the undesirable resultsthat can happen if this constraint is not added.

3.3 Object Resampling

Having determined the affine temporal scaling foreach sub-tube, we separately resample each trans-formed sub-tube to give the object’s appearance ineach output frame. We use uniform resampling withuser defined weighting curves to produce the out-put frames. While interpolation between appropriateinput samples would seem an alternative and perhapsmore intuitive solution, it can introduce artifacts forvarious reasons, as explained in e.g. [36], and moreimportantly, is incompatible with our coarse mattingapproach.

Clearly, output samples will not necessarily fallprecisely at transformed input samples. Thus, insteadwe select input frames for each object to generate thefinal output. Given a sequence of input samples, andtimes at which output samples are required, we usethe input sample occurring at the time nearest to thatof the desired output sample. Other work has alsoused frame based resampling [31], [37].

3.4 User Interface

Our interface offers various controls for object time-line editing (see the supplemental video), includingboth overall video length (M ), and for moving ob-jects, specifying the start and end time (the tis and

tid values), the life-span (Hi), and graphical entryof their speed function (resampling weighting). Adefault resampling weight of 1 is used; values areallowed to range from 0 to 5. The user may editthe weights using the speed curve to give desiredresampling weights for each frame. The user mayalso specify time reversal for objects, object cloningand object deletion. The interface also allows sub-tubes and output frames to be marked as havinggreater importance (wi) in the output. During editing,unedited objects which interact with edited objectsare also adjusted to ensure interactions are preserved.For example, if the user clones an object, any otherinteracting objects will also be cloned. If this is notdesired, the user should carefully limit the portion ofthe object’s life which is cloned to that part withoutinteractions with other objects.

3.5 Performance

Our system allows objects to be extracted without la-borious interaction, and provides real-time interactivecontrol and visualization of the editing results. Themost time consuming step in our system is the prepro-cessing used to construct the panoramic backgroundimage and interactively extract the moving objects.The user marks a bounding ellipse on key frames foreach moving object; the system takes about 0.3s perframe to track each object in a complex scene. Back-ground construction takes 0.1s per frame. Althoughthe preprocessing needs hundreds of seconds, it isexecuted just once, and after that the user is freeto experiment with many different rearrangements ofthe video. As shown in Table 1, once the user hasset parameters, optimization takes under 1 secondin our examples, mostly to solve the energy Equa-tion 4 to find optimal object rearrangements. Thistime depends mainly on the number of constraints,which in turn is determined by the number of objectinteractions. Overall, we can readily achieve realtimeperformance for user interaction on a typical PC withan Intel 2.5Ghz Core 2 Duo CPU and 2GB memory. Tofurther improve performance, we merge video sub-tubes with start or end points less than 5 framesapart, marking the result as an intersection sub-tube.Table 1 shows timings for all examples in the paper,with numbers of objects and corresponding relationconstraints.

4 RESULTS AND DISCUSSION

We have tested our video editing algorithm withmany examples, producing a variety of visual effectswhich demonstrate its usefulness, and that would bedifficult or tedious to obtain by other means. Figures1 and 8 show examples of object level fast and slowmotion effects and the corresponding original videos.In Figure 1 the cat moves slower. A faster cat is shownin the supplemental material. In order to preserve the


Fig. 6. Video editing. Top: input video. Middle: summarization into half the original time. Bottom: editing to fastforward and then reverse the blue car. Frames are shown at equal time separations in each case. Note thatlifespans of unedited objects are preserved to the exent possible.

Fig. 7. Video object rearrangement. Top right: four penguins appear pairwise in the original video. Bottom right:an output frame after interactively rearranging the penguins to appear together. Left: corresponding tubes.

original spatial location of the interaction between thecat and the girl (see the fourth column), the fastermoving character enters the scene at a later time. InFigure 8, we have shortened the entire video and in-teractively rearranged the cars to have approximately

the same speed, and to be approximately equallyspaced. In Figure 7 we have moved the time-spansof the penguins to make them appear in the scenesimultaneously. Figure 9 shows object cloning andreversal examples in which we make three copies of


TABLE 1Performance of the system.

Video Clip Fig. 1 Fig. 6 Fig. 7 Fig. 8 Fig. 9Frames 150 270 240 160 230Width 960 720 960 1280 640Height 540 560 540 720 480Objects 2 5 4 10 3Sub-tubes 5 7 8 16 10Preprocessing 225s 570s 300s 760s 695sOptimization 0.48s 0.71s 0.66s 0.85s 0.7s

a girl on a slide; we also make the girl go up theslide rather than down. Unlike [30], which leads toghosting, our method automatically arranges tubesto prevent object overlaps. (Corresponding videos areavailable in the supplement).

While such applications are the main target ofour approach, other less obvious effects can also beachieved. For example, we can selectively shorten avideo to a user-specified length (see Figure 6) bysetting object lifetimes to either their original lengths,or the desired total length, whichever is smaller. Ourapproach differs from previous approaches to videosummarization which either produce a summary noshorter than the longest lifetime of a single object, or,for shorter results, unnaturally cut the video tubes forobjects and move parts of them in time (e.g. Pritch’smethod in [33]). Instead, we can speed objects up toreduce the overall time. The bottom row of Figure 6further shows object fast forward, reverse motion andobject duplication, as well as local speed editing.

As Figure 5 shows, if object inter-relationships areignored when moving objects in time, unwanted over-laps may arise between objects originally crossing thesame region of space at different times. By preventingnew object interactions, we avoid such collisions be-tween object trajectories in the output video. UnlikeRav-Acha et al’s method [28], [29], our algorithmpreserves real interactions, allowing effective editingof objects while avoiding visual artifacts at interac-tions. In Figure 6, more objects are shown per frameas the overall video time is reduced. Objects followtheir original paths, but spatial relationships are alsowell preserved. We also note that each sub-tube (nottube) is adjusted in terms of time scale and offsetto meet the users desired object time-lines. It wouldbe extremely difficult and tedious for the user tomanually adjust time-lines so as to preserve existinginteractions and prevent new interactions, especiallyfor multiple objects.

We use ellipses for masking for simplicity of im-plementation and ease of manipulation. The user caneasily draw an ellipse in key frames to initiate objecttracking. Furthermore, when tracking is inaccuratein non-key frames, the user can quickly manuallycorrect the ellipse: a little additional user interactioncan overcome minor failures in tracking. This avoidsthe cost and difficulty of implementation of highly

sophisticated tracking methods, which still are notguaranteed to always work. Exactly how coarse mat-ting is done is unimportant—the key idea is that ac-curate matting is not needed when the background isrobustly reconstructed and objects retain their originallocations.

In practice, it is not always necessary to extractall moving objects, provided that the ones of interest(e.g. football players) consistently occupy a differentspatial area of the video to the others (e.g. spectators),so that the two groups do not interact. In this casea moving background can be used. It too must beresampled if a different length video is required, usinga similar approach to that for object resampling.

5 LIMITATIONS

Although we have obtained encouraging results formany applications of our video object editing frame-work, our approach can provide poor quality resultsin certain cases. Our method is appropriate for videofor which a panoramic background can be readilyconstructed and video objects can be tracked (as in-dividuals, or suitable groups). In such cases, tempo-ral adjustment and rearrangement at the object-levelmakes it possible to produce special visual effects.Clearly, our system can break down if there is afailure to track and extract foreground objects or thebackground. Less obviously, if the user places unre-alistic or conflicting requirements on the rearrangedobjects, this may result in an unsolvable optimisationproblem; this may also happen if a scene is verycomplex and there is insufficient freedom meet all ofa user’s seemingly plausible requests. Finally, if largechanges are made to the lifetimes of object sub-tubes,motion of objects in the output video may appearunnatural due to use of a frame-selection process.Widely different changes to lifetimes of adjacent sub-tubes for a single object may also result in unnaturalaccelerations or decelerations. We now discuss theseissues further.

Complex backgrounds and camera motions: Ourmethod may work poorly in the presence of back-ground change (e.g. in lighting, even if backgroundobjects remain static), and errors in background re-construction. Each frame in the output video includesmoving objects and the panoramic background, andtheir composition is performed according to the al-pha values obtained by coarse matting. We note thatthe coarse matting includes part of the backgroundas well as the moving object, so if the backgroundchanges noticeably, visible artifacts may arise due tocomposition of the elliptical region with the back-ground. Furthermore, good video composition resultsrely on successful background reconstruction. Thepanoramic background image is generated under anassumption of a particular model of camera motion,which may not be accurate; even if it is, a single static


Fig. 8. Video object rearrangement. Top: two frames of the original video. Bottom: two output frames afterinteractively rearranging the cars to be approximately equally spaced.

Fig. 9. Video object reversal and cloning. First row: girl on slide cloned. Second row: girl going up the slide.

image may not exist for complex camera motions,e.g. due to parallex effects. Our method thus sharescommon limitations with several other papers withrespect to handling complex backgrounds and cameramotions [27], [30]. Robust camera stabilization andbackground reconstruction for more general cameramotions are still challenging topics in computer vi-sion [38], [39]. Our method can potentially benefitfrom advances in those areas.

Time-line conflicts: Preserving original interactionsand preventing new ones are imposed as constraintsduring video editing. If the user manipulates objectsinconsistently, this may lead to visual artifacts; it may

even lead to an unsolvable optimization problem ifthere are many complex interactions and insufficientfreedom to permit the desired editing operations.To avoid artifacts, and gain extra freedom, the usermay resolve conflicts by moving or trimming partsof objects’ time-lines to produce the desired result, oreven delete whole objects. For example, in the timereversal example shown, we trimmed the last part ofthe girl’s tube to ensure the problem was solvable.

Implausible speeds: Our algorithm focuses onpreserving interrelations between paths in the timedimension. If many objects in the video have thepotential to intersect (i.e. to cross a shared spatial


location at different times), and certain objects areweighted for preservation, other objects can sufferunnatural accelerations or temporal jittering. Thiscould perhaps be mitigated by using a more sophis-ticated resampling method (as in [31]) or content-aware interpolation. The former would reduce, butnot completely eliminate, motion jitter. Simple 2Dinterpolation can often produce visual artifacts evenwith accurate motion estimation, as objects have 3Dshapes; such artifacts may be no more acceptable tothe viewer than minor motion jitter. An alternativesolution to alleviate such artifacts would introducemotion-blur (e.g. using simple box filtering of adjacentframes after realigning object’s centroids [40]) for fastmoving objects. Such cases could also be handledbetter if spatial adjustments were allowed as well astemporal ones during video tube optimization, butdoing so is incompatible with our framework basedon coarse segmentation and matting. Indeed, allow-ing spatial editing would give a much more flexiblesystem overall. Nevertheless, it may be possible to bea little more flexible without offering full generalityof spatial adjustment. If we were to restrict spatialchanges to locations with similar, constant coloured,backgrounds for example, we might be able to stilluse coarse matting, perhaps using some combinationof video inpainting and graphcut matching to find anoptimal new location.

Our method also may produce unnatural resultsif the output video is excessively stretched (or com-pressed) in time, due to the use of frame selection:frames would be repeated, and motion would tendto jump. Again, an interpolation scheme of somekind could overcome this issue. A further problemwhich may arise is sudden changes in speed betweenadjacent sub-tubes, resulting in implausibly large ac-celerations or decelerations. This could be solved byintroducing a higher order smoothing term into ouroptimization framework, or even by constructing anew speed-aware optimization scheme with largerfreedom. Currently, sub-tube motion rearrangementassumes an affine transformation with a time shiftand scaling, giving little freedom to edit the speedin the presence of higher order constraints. Speed-oriented modeling could be used to precisely editthe motion at frame-level, but would require morecomplicated user interaction. In the current system,we have preferred simplicity over intricate control,but accept that for some applications, detailed controlwould be desirable.

6 CONCLUSIONS

We have presented a novel realtime method for mod-ifying object motions in video. The key idea in ouralgorithm is to keep object interactions at the samespatial locations with respected to the backgroundwhile modifying the interaction times. This allows us

to avoid the need for precise matting, reducing theneed for much tedious user interaction. We optimizeobject trajectories to meet user requests concerningtemporal locations and speeds of objects, while at thesame time including constraints to preserve interrela-tions between individual object trajectories.

ACKNOWLEDGMENTS

We would like to thank all the anonymous reviewersfor their valuable comments. This work was sup-ported by the National Basic Research Project of China(Project Number 2011CB302205), the Natural ScienceFoundation of China (Project Number 61120106007and 60970100), and a UK EPSRC Travel Grant.

REFERENCES

[1] G. R. Bradski, “Computer vision face tracking for use in aperceptual user interface,” Intelligence Technology Journal, vol. 2,pp. 12–21, 1998.

[2] D. B. Goldman, C. Gonterman, B. Curless, D. Salesin, and S. M.Seitz, “Video object annotation, navigation, and composition,”in Proceedings of the 21st Annual ACM Symposium on UserInterface Software and Technology, Oct. 2008, pp. 3–12.

[3] X. L. K. Wei and J. X. Chai, “Interactive tracking of 2D genericobjects with spacetime optimization,” in Proceedings of the 10thEuropean Conference on Computer Vision: Part I, 2008, pp. 657–670.

[4] A. Schodl and I. A. Essa, “Controlled animation of videosprites,” in ACM SIGGRAPH/Eurographics symposium on Com-puter animation, 2002, pp. 121–127.

[5] S. Yeung, C. Tang, M. Brown, and S. Kang, “Matting andcompositing of transparent and refractive objects,” ACM Trans-actions on Graphics, vol. 30, no. 1, p. 2, 2011.

[6] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun, “A globalsampling method for alpha matting,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR). IEEE, 2011,pp. 2049–2056.

[7] Y. Zhang and R. Tong, “Environment-sensitive cloning inimages,” The Visual Computer, pp. 1–10, 2011.

[8] Z. Tang, Z. Miao, Y. Wan, and D. Zhang, “Video matting viaopacity propagation,” The Visual Computer, pp. 1–15, 2011.

[9] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Activecontour models,” International Journal of Computer Vision, vol. 1,pp. 321–331, 1988.

[10] Y.-Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin, andR. Szeliski, “Video matting of complex scenes,” ACM Trans-actions on Graphics, vol. 21, pp. 243–248, Jul. 2002.

[11] Y. Li, J. Sun, and H.-Y. Shum, “Video object cut and paste,”ACM Transactions on Graphics, vol. 24, pp. 595–600, Jul. 2005.

[12] J. Wang, P. Bhat, R. A. Colburn, M. Agrawala, and M. F. Co-hen, “Interactive video cutout,” ACM Transactions on Graphics,vol. 24, pp. 585–594, Jul. 2005.

[13] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut:robust video object cutout using localized classifiers,” ACMTransactions on Graphics, vol. 28, pp. 70:1–70:11, Jul. 2009.

[14] T. Kwon, K. H. Lee, J. Lee, and S. Takahashi, “Group motionediting,” ACM Transactions on Graphics, vol. 27, no. 3, pp. 80:1–80:8, Aug. 2008.

[15] Y. Li, T. Zhang, and D. Tretter, “An overview of video ab-straction techniques,” HP Laboratory, Tech. Rep. HP-2001-191,2001.

[16] B. T. Truong and S. Venkatesh, “Video abstraction: A system-atic review and classification,” ACM Transactions on MultimediaComputing, Communications, and Applications, vol. 3, pp. 1–37,2007.

[17] C. W. Ngo, Y. F. Ma, and H. J. Zhang, “Automatic videosummarization by graph modeling,” in Proceedings of the NinthIEEE International Conference on Computer Vision, 2003, pp. 104–109.


[18] H. W. Kang, X. Q. Chen, Y. Matsushita, and X. Tang, “Space-time video montage,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2006, pp. II: 1331–1338.

[19] C. Barnes, D. B. Goldman, E. Shechtman, and A. Finkelstein,“Video tapestries with continuous temporal zoom,” ACMTransactions on Graphics, vol. 29, pp. 89:1–89:9, Jul. 2010.

[20] Z. Li, P. Ishwar, and J. Konrad, “Video condensation by ribboncarving,” IEEE Transactions on Image Processing, vol. 18, pp.2572–2583, 2009.

[21] K. Slot, R. Truelsen, and J. Sporring, “Content-aware videoediting in the temporal domain,” in Proceedings of the 16thScandinavian Conference on Image Analysis, 2009, pp. 490–499.

[22] B. Chen and P. Sen, “Video carving,” in Eurographics’08, ShortPapers., 2008.

[23] S. Pongnumkul, J. Wang, G. Ramos, and M. F. Cohen,“Content-aware dynamic timeline for video browsing,” inProceedings of the 23nd annual ACM symposium on User interfacesoftware and technology, 2010, pp. 139–142.

[24] T. Karrer, M. Weiss, E. Lee, and J. Borchers, “Dragon: A di-rect manipulation interface for frame-accurate in-scene videonavigation,” in Proceeding of the twenty-sixth annual SIGCHIconference on Human factors in computing systems, Apr. 2008,pp. 247–250.

[25] C. Liu, A. Torralba, W. T. Freeman, F. Durand, and E. H. Adel-son, “Motion magnification,” ACM Transactions on Graphics,vol. 24, no. 3, pp. 519–526, Jul. 2005.

[26] J. Chen, S. Paris, J. Wang, W. Matusik, M. Cohen, and F. Du-rand, “The video mesh: A data structure for image-basedvideo editing,” in Proceedings of IEEE International Conferenceon Computational Photography, 2011, pp. 1–8.

[27] V. Scholz, S. El-Abed, H.-P. Seidel, and M. A. Magnor, “Edit-ing object behaviour in video sequences,” Computer GraphicsForum, vol. 28, no. 6, pp. 1632–1643, 2009.

[28] A. Rav-Acha, Y. Pritch, D. Lischinski, and S. Peleg, “Evolvingtime fronts: Spatio-temporal video warping,” Hebrew Univer-sity, Tech. Rep. HUJI-CSE-LTR-2005-10, Apr. 2005.

[29] ——, “Dynamosaicing: Mosaicing of dynamic scenes,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 29,no. 10, pp. 1789–1801, Oct. 2007.

[30] C. D. Correa and K.-L. Ma, “Dynamic video narratives,” ACMTransactions on Graphics, vol. 29, pp. 88:1–88:9, Jul. 2010.

[31] E. P. Bennett and L. McMillan, “Computational time-lapsevideo,” ACM Transactions on Graphics, vol. 26, pp. 102 – 108,Jul. 2007.

[32] D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz,“Schematic storyboarding for video visualization and editing,”ACM Transactions on Graphics, vol. 25, pp. 862–871, Jul. 2006.

[33] Y. Pritch, A. Rav-Acha, and S. Peleg, “Nonchronological videosynopsis and indexing,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 30, pp. 1971–1984, Nov. 2008.

[34] M. Brown and D. G. Lowe, “Recognising panoramas,” in Pro-ceedings of the Ninth IEEE International Conference on ComputerVision, 2003, pp. 1218–1227.

[35] M. Grant and S. Boyd, “CVX: Matlab software for disciplinedconvex programming, version 1.21,” http://cvxr.com/cvx,Dec. 2010.

[36] Y. Weng, W. Xu, S. Hu, J. Zhang, and B. Guo, “Keyframebased video object deformation,” in International Conference onCyberworlds, 2008, pp. 142–149.

[37] K. Peker, A. Divakaran, and H. Sun, “Constant pace skimmingand temporal sub-sampling of video using motion activity,”in IEEE International Conference on Image Processing, 2001, pp.414–417.

[38] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agarwala,“Subspace video stabilization,” ACM Transactions on Graphics,vol. 30, no. 1, p. 4, 2011.

[39] Z. Farbman and D. Lischinski, “Tonal stabilization of video,”in ACM Transactions on Graphics, vol. 30, no. 4. ACM, 2011,p. 89.

[40] A. Finkelstein, C. E. Jacobs, and D. H. Salesin, “Multiresolutionvideo,” in Computer Graphics (Proceedings of SIGGRAPH 96),1996, pp. 281–290.

Shao-Ping Lu is a Ph.D. candidate at De-partment of Computer Science and Tech-nology in Tsinghua University. His researchinterests include image and video process.He is a student member of ACM and CCF.

Song-Hai Zhang obtained his Ph.D. in 2007from Tsinghua University. He is currently anassociate professor of computer science atTsinghua University, China. His research in-terests include image and video processing,geometric computing.

Jin Wei is currently a Research Assistantat the Department of Computer Scienceand Technology, Tsinghua University. He re-ceived his MS degree in Computer Sciencefrom Tsinghua University and BS degreefrom Huazhong University of Science andTechnology. His research interests includedigital geometry modeling and processing,video processing and computational camera.

Shi-Min Hu Shi-Min Hu received the PhDdegree from Zhejiang University in 1996. Heis currently a professor in the Departmentof Computer Science and Technology at Ts-inghua University, Beijing. His research in-terests include digital geometry processing,video processing, rendering, computer ani-mation, and computer-aided geometric de-sign. He is associate Editor-in-Chief of TheVisual Computer(Springer), and on the edi-torial boards of Computer-Aided Design and

Computer & Graphics (Elsevier). He is a member of the IEEE andACM.

Ralph R Martin obtained his PhD in 1983from Cambridge University. Since then hehas been at Cardiff University, as Profes-sor since 2000, where he leads the VisualComputing research group. His publicationsinclude over 200 papers and 10 books cov-ering such topics as solid modeling, sur-face modeling, reverse engineering, intelli-gent sketch input, mesh processing, videoprocessing, computer graphics, vision basedgeometric inspection and geometric reason-

ing. He is a Fellow of the Learned Society of Wales, the Institute ofMathematics and its Applications, and the British Computer Society.He is on the editorial boards of “Computer Aided Design”, “ComputerAided Geometric Design”, “Geometric Models”, the “InternationalJournal of Shape Modeling”, “CAD and Applications”, and the “In-ternational Journal of CAD/CAM”.

Date post:	07-May-2023
Category:	Documents
Upload:	tsinghua
View:	0 times
Download:	0 times

Timeline Editing of Objects in Video

Documents