Addressing Visual Consistency in Video Retargeting: A Refined Homogeneous Approach

890 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 6, JUNE 2012

Addressing Visual Consistency in VideoRetargeting: A Refined Homogeneous Approach

Zheng Yuan, Taoran Lu, Yu Huang, Dapeng Wu, Senior Member, IEEE, and Heather Yu

Abstract—For the video retargeting problem which adjustsvideo content into a smaller display device, it is not clear howto balance the three conflicting design objectives: 1) visual inter-estingness preservation; 2) temporal retargeting consistency; and3) nondeformation. To understand their perceptual importance,we first identify that the latter two play a dominating rolein making the retargeting results appealing. Then a statisticalstudy on human response to the targeting scale is carried out,suggesting that the global preservation of contents pursued bymost existing approaches is not necessary. Based on the newlyprioritized objectives and the statistical findings, we design avideo retargeting system which, as a refined homogeneous ap-proach, addresses the temporal consistency issue holistically andis still capable of preserving high degree of visual interestingness.In particular, we propose a volume retargeting cost metric tojointly consider the retargeting objectives and formulate videoretargeting as an optimization problem in graph representation.A dynamic programming solution is then given. In addition, weintroduce a nonlinear fusion based attention model to measurethe visual interestingness distribution. The experiment resultsfrom both image rendering and subjective tests indicate thatour proposed attention modeling and video retargeting systemoutperform their conventional methods, respectively.

Index Terms—Attention modeling, confidence interval analysis,dynamic programming, subjective test, video retargeting.

I. Introduction

THANKS to the enriched communication networkresources and efficient compression techniques, a diver-

sity of mobile devices and streaming terminals gain signifi-cant shares in multimedia access over years. While they aredesigned in unique resolutions and aspect ratios, most mediasources, when created, generally follow the standard formats(e.g., resolution 1920×1080, 720×576). As the proxy to trans-fer video media across platforms, video retargeting techniquesautomatically adapt source video contents to fit for the display

Manuscript received March 30, 2011; revised August 12, 2011 and October8, 2011; accepted November 3, 2011. Date of publication December 22, 2011;date of current version May 31, 2012. This paper was recommended byAssociate Editor C. N. Taylor.

Z. Yuan and D. Wu are with the Department of Electrical and ComputerEngineering, University of Florida, Gainesville, FL 32611 USA (e-mail:[email protected]; [email protected]).

T. Lu is with the Image Technology Group, Dolby Laboratories, Inc.,Burbank, CA 91505 USA (e-mail: [email protected]).

Y. Huang is with the Digital Media Solution Laboratory, Sam-sung Electronics America, Ridgefield Park, NJ 07660 USA (e-mail:[email protected]).

H. Yu is with the Huawei Media Networking Laboratory, Bridgewater, NJ08807 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2011.2181230

size of target devices (generally from large to small). Whilethey attempt to preserve most visual interestingness (VI) foreach individual frame as required in the image retargeting task,they also demand the temporal consistencies among adjacentframes to generate visually agreeable retargeting results.

As video retargeting seeks to preserve the VI efficiently, it issensible to understand how the VI distributes over pixels spa-tially and temporally. Attention modeling, as a self-containedresearch topic, provides such distribution information by mim-icking the visual stimulus from each pixel of video frames.In the literature, many methods are proposed, including [1]–[6] where multimodalities are considered. Several aliases ofvisual interestingness include visual importance [7], visualenergy [8], and saliency [1], [2].

A. Previous Methods

Current video retargeting techniques are conceptually clas-sified into two major categories: the homogeneous versus het-erogeneous approaches. The homogeneous methodology for-mulates video retargeting as searching for a sequence of rigidretargeting windows on each original frame followed by ho-mogeneously resizing the contents within the window into thetarget display. Although the homogeneous methodology maysacrifice contents outside the retargeting window, this schemeallows a systematic regulation of retargeted pixels. Spatially,the selected pixels are treated equally to avoid the geometricdistortion, which is especially meaningful when the contentcontains the well-defined or viewer familiar objects suchas human face or architectures. Temporally, the retargetingconsistency can be easily achieved by merely imposing con-straints on the window parameters across neighboring frames.Given that viewers exercise nonuniform attention response tostimulus from different pixels shown by visual psychophys-ical studies, Liu et al.’s auto-pan-scan [9] and Hua et al.’ssearch window method [10] look for a window to movedynamically to secure the most visually interesting regionsfor each individual frame. Regarding the arisen consistencyissue, they both utilize curve fitting to smooth the windowparameters. Although this procedure alleviates the shaky ar-tifacts to some extent when visually interesting areas locateclosely over frames, it cannot guarantee consistent retargetedviews in the presence of rapid and irregular content changes.Instead, Deselaers et al. [11] considered the visual interesting-ness preservation and retargeting consistency together as theretargeting scores and traced back the best window sequencethat maximizes the accumulation scores. This approach, for

1051-8215/$26.00 c© 2011 IEEE

YUAN et al.: ADDRESSING VISUAL CONSISTENCY IN VIDEO RETARGETING 891

the first time, models the retargeting consistency issue asa built-in consideration, and ensures the consistency of theretargeted views. Nevertheless, this setup of maximization ofaccumulation score may not quite suit the VI preservation:without knowing a prerecognized destination that specifiesthe retargeting window to track the updated visual interestingareas, the calculated window sequence is anchored near itsinitial position due to the encouragement of an overinertiatransition among adjacent frames, which we name “consis-tency clamping” artifact. This artifact is visible, especially inlong video whilst its complex background change requires theretargeting window to update to the current position. Also, ifthere is “saliency inaccuracy” in the initial frame, the identifiedparameters of the first retargeting window (RW) are impaired.The impairment will propagate into all the following framesdue to clamping. For long video retargeting, the propagationis even longer.

On the other hand, the heterogenous methodology does notperform “hard” selection of a chunk of pixels continuouslyaligned in a rectangle window, but rather takes each pixel indi-vidually and imposes “soft” manipulation of potentially everypixel. This flexibility makes the heterogenous methodology thefocus of media retargeting in academia for many years, whichdevelops primarily in two tracks. The first track focuses onseam carving: [8] shrinks original image by eliminating pixelsaligning in seams (continuous curves) in priority of the onewith less visual energy. The same idea extends to video retar-geting [12] by cutting 2-D seam manifolds therein from a 3-Dvideo frame volume. Recently, Matthias et al. [13] imposedthe discontinuity constraint on seam structures to alleviate thecutting of featured objects. Reference [14] adopted a multiop-erator scheme to choose seam carving when appropriate. Theother track is warping based approaches: they do not explicitlyremove pixels away but rather morphologically squeeze themto various extents proportional to their visual importance.Wolf et al. [7] initialized this idea by formulating the mappingfrom the original pixel to the retargeted correspondence as asparse linear system of equations. Wang et al. [15] then intro-duced the “scale-and-stretch” method to warp image with thematched local scaling factors. The authors [16] then expandedthe same philosophy to video retargeting by incorporatingmotion-aware constraints. In practice, Krahenbuhl et al. [17]implemented a unique system based on warping for streamingapplications. In essence, the flexible pixel rearrangement of theheterogenous methodology avoids explicit content sacrifice,suggesting somewhat latent preference to a globally retargetedview. They produced excellent results in natural images andin scenarios when aspect ratio change is significant. However,individual manipulations of pixels also demand for such a largenumber of pixel parameters to be jointly optimized that thereare always some pixels not coordinated well spatially or tem-porally. Therefore, it is common to observe the resultant defor-mation and/or inconsistency, which can be quite noticeable oreven disastrous when the content involves well-defined objects.

B. Our Approach

Motivated by the difference between homogeneous andheterogenous methodologies, our approach takes into account

the major design considerations: 1) content preservation; 2)the temporal retargeting consistency; and 3) the prevention ofdeformation. We ask a big question: from the perspective ofobservers, what are the real priorities among these considera-tions to perform a compelling video retargeting?

For content preservation versus nondeformation, since theyare both the in-frame considerations, it is insightful to ex-plore the conclusions from image retargeting evaluations.Recently, Rubinstein et al. [18] created a benchmark imagedatabase and performed comprehensive statistical analysis onthe human response to several state-of-the-art techniques fromboth homogeneous and heterogenous methodologies. Theyfound that “viewers consistently demonstrate high sensitivityto deformation,” and “in many cases users prefer sacrificingcontent over inserting deformation to the media” [18]. Thenfor the temporal retargeting consistency that stands out invideo retargeting, it plays an even more important role: evena mild inconsistency among adjacent frames would lead tothe annoying flickering or jittering, which may fatigue hu-man eyes quickly; whereas, content sacrifice may be lesstangible without the presence of original video since viewersare capable of recovering the content via imagination. Ourapproach, as a refined homogeneous approach, is differentfrom the traditional homogeneous methods in the followingways.

1) Our approach enables a user-specified retargeting scale,making the tradeoffs between the visual content preser-vation and retargeting consistency appropriate for eachindividual viewer. Comparing heterogenous approachesthat assertively impose the pro-global-view retargetingor the traditional homogeneous methods that imposethe pro-local-view retargeting, our refined approach iscloser to the real viewer aesthetic fact. This user-specificpractice is inspired and justified by the study of humanresponse to the retargeting scale.

2) Our video retargeting system is capable of processinga long-duration video with generic contents, not limitedto the realm of short video clips as in many existingworks. Unlike short clips that last only one scene, longvideos contain many scene changes and the length ofa scene can be very long or very short. The proposedsystem studies the temporal retargeting consistency com-prehensively, which to our best knowledge is the firsttime to discuss this issue on such an elaborate level.Considering the fact that frames at different temporallocations require different consistency, our retargetingsystem adapts the tradeoff structure to different frametypes, aiming at striving for the most freedom forsaliency preservation. On the whole, this frameworkbridges the miscellaneous retargeting considerations inpractice with a structuralized analysis of retargetingobjectives and endows the generic video retargetingproblem with elegant mathematical formulations.

3) As the substantial algorithm of the video retargetingsystem, we formulate the retargeting as an optimizationproblem, where the variables to solve are the sequentialpositions of the retargeting windows over a subshot.Regarding the objective function, we propose the volume


Fig. 1. Retargeted views: global versus local. Left: original image Sunflower.Right top: retargeted image by pro-global view. Right bottom: retargetedimage by pro-local view.

retargeting cost metric to systematically consider theretargeting consistency and VI preservation together. Wefurther represent the optimization into a graph contextand then prove the equivalency of the optimization assearching for the path with minimal total cost on thegraph. The solution is obtained in a dynamic program-ming fashion. It is encouraging that the solution mayextend to the other measures of visual interestingnesspreservation and consistency refined in the future.

4) For the attention modeling, we propose an innovativecomputational model with nonlinear fusion of spatialand motion channels. The proposed model with inde-pendent channels captures the distinct mechanism ofhuman perception to luminance, chrominance and mo-tion stimulus, avoiding the shape twist of salient entitiesdue to joint processing of intertwined spatial-temporaldata in many attention models. Also, the nonlinearfusion scheme takes advantage of the properties of thecomputed visual interestingness distribution from thetwo channels and strives for the detection of meaningfulentities, which may subsequently make the suggestedinteresting objects more likely be preserved in the retar-geting window.

This paper is organized as follows. Section II describesthe statistical study of human response to retargeting scales.Section III presents the proposed video retargeting systemarchitecture. Section IV describes a visual information lossmetric to measure interestingness preservation and Section Vproposes a nonlinear fusion based attention model. Section VIproposes a volume retargeting cost metric and the corre-sponding graph representation with the dynamic programmingsolution. Section VII presents a method to choose a unifiedscale for a shot. Section VIII shows our experimental results,and Section IX concludes this paper.

II. Global or Local?−Statistical Study on

Human Response to Retargeting Scale

This section studies how viewers evaluate retargeting scales.Ideologically, most heterogeneous approaches follow the hy-pothesis that a global scale is preferred in the retargeting taskwhile most current homogeneous approaches tend to retargetcontent in a local scale (they may also be a pro-global scaleif the aspect ratio change is not drastic). Considering both themerits and weaknesses with the two methodologies, we inquirethe validity of the hypotheses by examining whether there

really exists a consensus of perception bias toward a particularscale, either global or local. This inquiry is inspired by thesubjectivity of human aesthetics, the randomness of the imagecontent, the retargeting purpose and many other nonobjectivefactors, which probably suggest there is no consensus ofpreference to make one scale dominate the other and thusboth hypotheses are opposed. Note that although preservingthe original image on a global scale is intuitively justifiedamong many viewers and then is enforced by most existingworks, a significant number of people alternatively considerit would not be strictly necessary if at the cost of geometricdistortions or inconsistency over adjacent frames. They prefera nice local retargeting that enhances the emphasis on theobject of interest. For example, in Fig. 1, the top right image isobtained by purely scaling the original image globally withoutany content removal and the bottom right image is obtainedby first cropping a local region and then scaling to fit thetarget display, but with a smaller scaling factor. As expected,many viewers we surveyed claimed the top is better for itscomplete content preservation; however, other viewers arguedthat it is reasonable to cut off the boundary regions as thegreen twigs there are not visually important and not even intactin the original frame, but the cropped retargeting renders thesunflower with finer resolution as a considerable perceptionadvantage of the bottom image. To clear the disagreement,we conduct a statistical study1 to test which hypothesis ofretargeting scale is true, aiming at the potential computationalmeasure for the supported hypothesis. Otherwise, neither canoverride the other if suggested by the study, we may leavethe freedom of choosing a retargeting scale to the individualviewer if neither hypothesis can be statistically supported todominate the other.

We devise the statistical experiment as follows: given 15different images that cover most popular topics in photogra-phy, we retarget each image with two distinct scales, whichrepresent pro-global-view2 and pro-local-view strategies, re-spectively. We also incorporate the target aspect ratio as avariable to our statistical study: each of the 15 images is inone of three common display aspect ratios 3 : 2, 4 : 3, and16 : 9; for each image, we retarget it into all three aspectratios with two retargeting scales. Then we collect the responsescores from 60 viewers to evaluate the retargeted image. Theresponse scores are rated according to the individual aestheticstandard, ranging from 1 to 10 with the increment as 1. Ourobjective is to determine if there is a meaningful gap of theresponse scores between the two retargeting strategies.

Statistically speaking, given the grouped score samplesXik = [xik1, . . . , xikj, . . . , xikn], i = 1 : 60, k = 1 : 2, n =15 × 3 = 45, where i is the viewer index, j is the imageindex, k is the strategy/group index, and n is the number

1Available at http://www.mcn.ece.ufl.edu/public/ZhengYuan/statistical−study−retargeting−scale.html

2Here, we use two methods to implement the global-scale retargeting, theglobal cropping-and-scaling and the heterogeneous method in [15]. The formerintroduces no shape deformation while its modest cropping may remove a littleborder areas and the latter keeps all contents with least shape deformation.We assign the higher score from the two methods as the score for globalretargeting to measure its best perception performance whereas a singlemethod is hardly to be deformation-free and keeping all contents as well.


Fig. 2. Images for the statistic study of human response to retargeting scale. Top left to bottom right: ArtRoom, MusicSound, DKNYGirl, Redtree, Trees,Football, Sunflower, Architecture, Butterfly, CanalHouse, Perissa, Child, Fish, GreekWine, Fatem. Each original image is in one of the three common aspectratios (4:3, 3:2, and 16:9) and is retargeted to all three aspect ratios. They are represented as thumbnails due to the conflict between various sizes and thespace limitation. Our survey website provides the retargeting results in their authentic sizes. Courtesy of [18] for the image retargeting benchmark database.

of retargeting processes, we want to infer if the subjectivescores suggest retargeting equivalence between two groups.Retargeting equivalence is statistically defined as the norm ofthe group mean difference �μ = X1 − X2 is bounded by thesubspace H = {�μ : −1 < �μj < 1, ∀j = 1 : n}. (Notethat two ratings with difference less than one are consideredthe equivalent as the discretization of the score rating is 1)with a high probability 1 − α. Its two complement spacesare H∗ = {�μ : 1 < �μj < ∞, ∀j = 1 : n} andH◦ = {�μ : −∞ < �μj < −1, ∀j = 1 : n}. H∗ represent thespace where viewers generally consider local scale is betterand H◦ refers to the space where global scale is better

P(�μ ∈ H) > 1 − α. (1)

We utilize confidence interval estimation to solve the prob-lem. Assume the collected response scores Xik of each groupfollow a multivariate Gaussian distribution with mean μk andvariance-covariance matrix �

Xik ∼ N(μk, �) ∀i = 1 : n k = 1 : 2. (2)

Hence, the difference of two group means �μ also followsGaussian distribution with mean μ1 − μ2 and covariance �

�μ ∼ N(μ1 − μ2, �). (3)

Given the confidence level 1 − α, we estimate the corre-sponding interval H where �μ probably lies in according tothe distribution in (3). If H is a subspace of H, we consideredthe two strategies are retargeting equivalent. If H is a subspaceof H∗, it suggests that viewers prefer local retargeting. If His a subspace of H◦, it is highly likely that viewers preferglobal retargeting. Otherwise, no significant preference biasexists. Table I describes the 95% confidence interval of thegroup mean difference for each dimension/image, using T 2

confidence interval analysis.As indicated in Table I, nine retargeting processes marked

with ◦ result in the confidence intervals in H◦, so they suggestthe preference to global retargeting. Meanwhile, another sixretargeting processes marked with ∗ have the confidence inter-vals in H∗, suggesting local retargeting is preferred. For theremaining 30 retargeting processes, their intervals are eitherretargeting scale equivalence or no significant preference bias.Therefore, if we take the retargeting generically, there is noconsensus on the preference to some particular scale. Thisconclusion suggests that in the retargeting task, one does nothave to preemptively preserve a global view or local view.

TABLE I

Confidence Intervals by T 2Estimation with Level 1 − α = 95%

Images3:2 4:3 16:9

lbound ubound lbound ubound lbound uboundArtRoom −1.00 −0.02 −0.78 0.23 −2.81 −1.64°DKNYGirl −0.40 0.58 −0.26 0.57 −0.15 1.08Child −0.79 0.86 0.78 1.64∗ −0.42 0.55Butterfly −0.09 0.59 1.12 1.85∗ 1.11 1.60∗

Greenwine −1.95 −1.10° −2.51 −1.70° −1.05 −0.41CanalHouse 1.04 2.71∗ −0.80 0.69 −0.37 0.49Sunflower −0.28 0.85 −0.28 0.69 −0.89 0.11Fatem −2.59 −1.66° −1.47 0.61 −1.19 0.59Fish 0.10 0.90 1.55 2.36∗ −0.52 0.84Perissa −3.42 −2.03° −2.03 −0.98° −2.22 −0.94°MusicSound 0.17 1.16 −0.07 0.79 0.84 2.05∗

Football −3.36 −2.67° −1.94 −1.16° −1.38 −0.47Trees −1.08 0.79 −0.99 1.19 0.03 0.94Architecture −1.05 0.52 −0.89 0.01 −0.87 0.34Redtree 0.28 1.32 −0.30 0.55 −0.11 0.72

Based on this inference, our proposed system endows thefreedom of choosing the global view or local view or someother scales in between to individual users according to theirown aesthetic preferences and needs. This strategy maximizesthe retargeting performance by allowing the greatest flexibility.The corresponding interface of a scale optimizer is designedin a try-and-error fashion to facilitate viewers to determine amore customized viewing scale.

III. System Design

This section discusses the design methodology elaborately.We propose a practical retargeting system that adopts the ho-mogeneous methodology, with retargeting process equivalentto searching for the scale and position of a proper retargetingwindow for each frame. In our refined homogeneous approach,we let viewers determine their aesthetically pleasing scale,hence the corresponding removal of the periphery region basedon the customized scale should be considered reasonable forthe individual viewer. Particularly, the homogeneous method-ology allows a RW to contain any region, even to the fullinclusion.

A. System Architecture

Fig. 3 describes the system architecture and explains thework flow of how to find the scale and location for the RW


Fig. 3. Video retargeting system architecture. Rectangle block: systemmodules. Elliptical block: video or image frame.

of a frame. The system consists of six major components:1) shot detection; 2) saliency calculation; 3) scale optimiza-tion; 4) visual information analysis; 5) boundary-frame re-targeting; and 6) inner-frame retargeting. The shot detectionmodule divides a long video into visually coherent units forsubquential independent retargeting. The saliency detectionmodule implements the attention model we proposed in Sec-tion V to quantize the interestingness of each pixel on theframe. The scale optimization module evaluates an entire shotand then determines a single optimal scale of the RWs for theshot. The visual-info analysis module transforms the saliencydistribution into the potential visual information loss valueincurred by the RW at all possible locations. The boundary-frame retargeting module searches the best location of the RWfor the boundary frames. Finally, the inner-frame retargetingmodule takes the inner frames altogether and determines thedynamic trace of the RWs over them.

As a new frame coming in, the shot detection modulecompares the statistics of its luminance and chrominance withprevious frames and decides if a shot is initiated [19], [20].At the same time, the saliency calculator takes the incomingframe and generates a saliency distribution map in the sameframe size. Then all frames in the detected shot with theirsaliency maps are streamed into the scale optimizer to find aunified viewing scale for the RWs of the entire shot. As thesize of the RW is aesthetically determined with the optimizedscale, the visual-info analyzer computes the potential loss dueto the cropping and scaling incurred by the RWs at all possiblelocations for each frame. Finally, the optimal locations ofthe RWs are searched in two cases: for the frames at theboundary of a subshot (the subshot generator chops a shotinto several even length subshots), boundary-frame retargetingmodule finds the RW by only minimizing the visual-info lossof the individual frame; for other within-subshot frames, theinner-frame retargeting module considers the sequential framesjointly and determines a smooth trace of RWs.

B. Design Principle

The proposed system is designed in such a way in orderto comprehensively address the visual consistency, rather thantreating it as a trivial postprocessing procedure. It is sensi-ble to notice that the frames at different temporal locationsraise distinct extent of consistency requirements. Utilizing thisnonuniformity, we exercise customized retargeting schemes tospecialize the tradeoff between the required consistency and

visual information preservation. This flexibility commits to themost exploitable visual-info preservation as long as the consis-tency requirement is satisfied. The consistency considerationsare summarized as follows.

1) If a frame is asserted as the boundary of a shot, normallya clear cut is inserted before the next frame. Theinstantaneous content switching in the original videois prevalently used and it looks natural to viewers tosymbolize the start of a new episode. Hence, no visualconsistency concerns are required in the retargeted videoeither. For the same reason, a shot can be processedindependently as the retargeting unit.

2) When a frame is within a shot, any abrupt RW changeover two adjacent frames would result in a very no-ticeable jitter against the similar visual contents amongadjacent frames. However, for these subshot boundaryframes, we may only consider the VI preservation be-cause the consistency requirement pertaining to themwill be solved by their two immediate neighbors in 3).This strategy actually keeps an intermittent but timelyupdate of where salient entities in the original frame go,rendering a retargeted video that is very likely to containthe interesting parts. This update mechanism avoidsthe “clamping” and prevents the possible inaccurateretargeting due to saliency inaccuracy from propagatingto the next updated frame.

3) When a frame belongs to the inner-subshot frames, thetwo competitive objectives should be jointly considered.In order to obtain a visual-info-keeping but also smoothtrace of RWs linking the optimal locations of the twosubshot boundary frames, we stack the inner-subshotframes together as a volume and minimize the totalvisual-information loss of the volume under the con-straint that any RW transits over adjacent frames withcurvature less than a bound.

In addition, we fixed one unified scale of RWs for the entireshot provided that human perception is susceptible to scalevariation of RWs over the frames with similar contents, knownas the “flicker” artifact. As for the locations, on the contrary, acertain amount of translation is tolerable by human perception.Actually, it is the translation of neighboring RWs that make thetracking of interesting content possible. The real challenge ishow to configure the allowed translation carefully to smoothlycatch up with the movement of the interesting region of theoriginal video.

IV. Visual Information Loss

As mentioned previously, our retargeting system aims atthe best possible preservation of visual interestingness andalso permits the retargeting consistency. This section discussesthe measure of the former and the mathematical quantificationof the visual-info loss caused by a retargeting window withcandidate parameters. The measure is implemented in thevisual-info analysis module in the system.

The visual-info loss closely relates to the attentionmodel in Section V as the pixels manipulated by the RWshed distinct interestingness. It comes from two sources:


1) the cropping loss: the retargeting system removes theperiphery areas outside the RW so that viewers cannot attendthe contents therein; and 2) the scaling loss: the croppedcontent is further downsampled with scale s to be exactly inthe retargeting size, which degrades the original frame into thecoarser resolution.

Suppose the attention distribution φ(x, y) corresponding toa frame is available, the cropping loss is then measured as theaccumulated interestingness of those pixels discarded

Lc = 1 −∑

(x,y)∈W

φ(x, y). (4)

Here, φ(x, y) is a normalized saliency map such that∑(x,y) φ(x, y) = 1, W is the retargeting window.The measure of scaling loss is more challenging as the

original frame cannot be compared directly with the retargetedcounterpart due to their distinct sizes. In [10], the originalframe is compared with its blurred version by applying a low-pass Gaussian filter, where the latter presumably reconstructsthe retargeted frame in the same size of the original. However,this heuristic does not exactly match the real operation inthe retargeting scheme—downsampling. In our measure, wegenerate a synthesized frame which contains exactly the samevisual information as the retargeted frame, but still in the samesize of the RW. To guarantee the visual-info equivalency ofthe synthesized frame with the retargeted frame, we upsamplethe synthetic frame by mapping each pixel in the retargetedframe repetitively to be the successive 1/s ≤ 1 pixels inthe synthetic frame. When 1/s is not an integer, bilinearinterpolation is adopted. The scaling loss is then defined asthe square difference between the synthesized frame and thecontent of original frame within the RW

Ls =∑

(x,y)∈W

(F (x, y) − F (x, y))2. (5)

where F = upsizing(g ∗ downsizing(F, s), s). The visualinformation loss is thus the combination of the two sources

L(x, y) = (1 − λ)Lc(x, y) + λLs(x, y) (6)

where λ is a factor to balance the importance of contentcompleteness and resolution, adjustable to user’s preference.Given a λ, we may find the optimal RW parameter (x, y, s)by minimizing the visual information loss measure

P(x, y, s) = arg minx,y,s

L(x, y, s · Wt, s · Ht) (7)

where Ws, Wt, Hs, Ht are the width and height of the sourceand target frames, respectively. Note the search range of(x, y, s) is constrained by 0 ≤ x ≤ W − s · Wt, 0 ≤ y ≤H − s · Ht. Therefore, for the subshot boundary frames, theRW is finalized by x, y, s and the retargeted frame is generatedimmediately by zooming the RW out by s times.

Note that the scale of RW is fixed within a shot (see SectionVII). Therefore, for subshot boundary frames, we only searchalong (x, y) to minimize the visual information loss measure.

V. Nonlinear Fusion Based Attention Modeling

Attention modeling computes the meaningful saliency dis-tribution for the evaluation of how much a retargeting window

preserves the visual interestingness. Among a diversity ofmodeling methods, Guo et al. [4] took the advantage of thequaternion Fourier transform [21] to extract saliency fromfrequency domain, which is a principled approach free ofthe sensitivity of laborious parameter tuning. Meanwhile, [4]proved that it is in fact the phase spectrum of an imagethat captures its most saliency feature. Hence, it is proper tosubstitute the entire spectrum residue (Hou et al. [2] showedthe spectrum residue is an efficient way to detect saliency)with only the phase spectrum for the succinct purpose. In thispaper, we inherit the merit of the phase quaternion Fouriertransform (PQFT) to detect spatial saliency but computetemporal saliency separately. We argue it is not justifiableas in [10] to mix the three spatial images (one luminanceand two chrominance) together with one temporal (motion)image into the quaternion and derive the saliency from the in-terweaved spatiotemporal data. Since humans perceive spatialand temporal stimulus by distinct psychophysical mechanisms,treating spatial and temporal channels jointly with a unifiedtransform would produce a saliency model that twists theactual spatiotemporal interaction. Instead, we first simulate thespatial saliency with PQFT and the temporal saliency withthe local motion intensity. Then we fuse them nonlinearly tomimic the human responses to various spatial and temporalsaliency distribution scenarios.

A. Spatial Saliency with Phase Quaternion Fourier Transform

Spatial saliency, to a large amount, originates from thestimulus of those pixels with high contrast in luminance andchrominance against their neighborhoods. The high contrastsare largely mathematically captured by the phase spectrumwhile quaternion Fourier transform provides a nice principledapproach to calculate the phase spectrum of a color image.In YCbCr color space, a video frame I(x, y, t) is representedas three independent scalar images, Y (x, y, t), Cb(x, y, t), andCr(x, y, t), where x, y are the location of a discrete pixel on theframe, t is the id number of the frame in temporal order, Y isthe luminance component, and Cb and Cr are two chrominancecomponents. Quaternion, generally a hypercomplex number, isq = a + bμ1 + cμ2 + dμ3, where a, b, c, d are numbers in realvalue, and

μi2 = −1 μi⊥μj i = j μ1 × μ2 = μ3. (8)

Reorganize q in the symplectic form as the combination oftwo complex numbers

q = f1 + f2μ2 f1 = a + bμ1 f2 = c + dμ1. (9)

Then the quaternion Fourier transform can be performed bytwo standard fast Fourier transforms (F1 and F2), where

Q(u, v, t) = F1(u, v, t) + F2(u, v, t)μ2 (10)

where Fi(u, v, t) is the Fourier transform of fi(x, y, t).Assume a = 0, and substitute b, c, d with Y , Cb, Cr,

respectively, we may represent the video frame I(x, y, t) asa pure quaternion frame q(x, y, t)

q(x, y, t) = Y (x, y, t)μ1 + Cb(x, y, t)μ2 + Cr(x, y, t)μ3. (11)


Then apply (10) to calculate the quaternion transformQI(u, v, t) of the quaternion frame q(x, y, t) and then its phasespectrum PI(u, v, t) is derived by dividing QI over its norm|QI |, PI = QI/|QI |. Finally, we take the inverse quaternionFourier transform for PI(u, v, t) to get the phase quaternionimage qp(x, y, t). Finally, the spatial saliency φs(x, y, t) isobtained by smoothing out the squared L2 norm of qp(x, y, t)with a 2-D Gaussian smoothing filter g

φs = g ∗ ||qp(x, y, t)||2. (12)

B. Temporal Saliency with Local Motion Intensity

Generally, the disparity of temporally adjacent pixels comesfrom both the camera motion and local object motion. Whilecamera motion applies to every pixel on the image globally,the local motion embodies the actual contrast of a pixelagainst its neighborhood. Hence, the local motion reflects thetemporal saliency. We first use Kanade–Lucas–Tomasi (KLT)tracker to obtain a set of matched good feature points [22] ofone video frame and its neighboring frame and then estimatethe global motion parameters [23] with affine model. Thelocation motion is extracted after removing the global motionfrom temporal disparity.

Denote the disparity of a KLT feature point (xt−1, yt−1) atprevious frame It−1 matched with x = (xt, yt) in current frameIt as d = (dx, dy)T . Then the disparity can be approximatedby a six-parameter affine global motion model d = Dx + t,where t is the translation component t = (tx, ty)T and D is a2 × 2 rotation matrix. Represent the affine model in the formof the matched feature points, xt−1 = Axt + t, where A = E+D

and E is a 2 × 2 identity matrix. The motion parameters in t

and D can be estimated by minimizing the total neighborhooddissimilarity of all the matched features

{A, t} = arg minA,t

∫W

(It(Ax+t) − It−1(x))2dx (13)

where W denotes a 8 × 8 neighborhood of feature points.We adopt the least median squares approach to estimate theaffine parameters robustly [23]. We generate the global motionpredicted frame by warping the current frame It(x, y, t) withthe estimated parameter A and t. The absolute difference(after smoothing) of the predicted frame with the previousframe It−1(x, y, t) reflects the intensity of local motion or thetemporal saliency

φm = g(x) ∗ |It−1(x) − It(A−1

[x − t ])|. (14)

C. Nonlinear Fusion of Spatial and Temporal Saliency

The actual strategies adopted by human attention whenfusing spatial and temporal saliency components are rathercomplex, depending on particular distributions of spatial andtemporal salient areas. It is noticeable that humans are likelyto attend video frames on pixel clusters, which may suggestthe existence of a meaningful entity, rather than attraction toa solitary pixel. In a huge variety of videos, salient entitiesalmost universally express high spatial saliency value sincemost videos are captured, by either professionals or amateurs,to promote a certain foreground target. For example, when the

Fig. 4. Comparison of linear combination, naive MAX operation, and pro-posed approach when global motion is correct or wrong.

director shoots a TV show, the actor or actress at the center offocus is often depicted with unique characteristics to distin-guish him or her from the background or others. Furthermore,an entity also demonstrates high motion saliency if it happensto move. However, since the entity may also not move in anumber of scenarios, motion saliency does not carry as dis-criminative power as spatial saliency. Hence, aiming at a fusedsaliency map featuring the high confidence and stability tosuggest the existence of an entity, we make the spatial saliencyas the primary cue and the motion saliency as secondary.

On the other hand, we observe that the detected spatiallysalient area (obtained by thresholding the spatial saliency map)is generally continuous and concentrated since the smoothingprocedure in (12) is suited for the detection of a high spatialcontrast region, rather than an individual pixel. This trait nicelystands out the existence of an underlying salient entity. In orderto further enhance the entity from other spatial salient areas,we resort to the correlation of the motion saliency map andincrease the saliency value of the spatially salient areas bythe amount proportional to the corresponding motion saliencyvalue. The reason why this measure works is that both thespatial and motion saliency are driven by the same entities.Hence, their intersection area suggests the probable existenceof real salient entity. For example, the bee in Fig. 4 representsa salient entity that viewers usually attend. As indicated inthe spatial saliency map, the high saliency area indeed coverswhere the bee locates but also admits other areas that alsoconsidered spatially salient. In this case, the calculated motionsaliency map concentrates and suggests where the bee is. Sothe motion saliency is able to improve the conspicuousness ofthe bee by increasing the saliency value of the bee area of thespatial saliency map.

However, since motion saliency is performed on a pixellevel, the detected salient area can be rather dispersed as thepixels in scattered locations may contribute competitive localmotions so that none of them can be distinguished. In thiscase, the motion saliency map cannot suggest reliable anddominant salient entities, although the related pixels are indeedwith high local motion. Fortunately, the spatial saliency mapis not affected and thus can be utilized as a filter to confine theexistence of the salient entity within the spatially saliency areaand the pixels in such areas are still allowed to be enhanced bytheir motion saliency value. For example, in Fig. 4, the motionsalient areas spread over the entire image, capturing none ofthe two anchors (the salient entities). However, the spatial


saliency map successfully concentrates on the two anchors.Therefore, the final saliency map may use the spatial salientarea to filter the motion saliency and only increase the saliencyof pixels covering the anchors. Based on the analysis above,we devise the following nonlinear fusion scheme:

φ(x, y, t) = max{φs(x, y, t),M ∩ φm(x, y, t)} (15)

M = {(x, y, t) : φs(x, y, t) ≥ ε}.The intersection of M and motion saliency map simulates

the filtering of spatial salient areas over temporal saliency.Within the intersection, the max operator use motion saliencyvalue to intensify the areas where both two cues agree. Thisoperator utilizes the characteristics of two saliency maps andproduces a saliency map that emphasizes on salient entity.

VI. Joint Considerations of Retargeting

Consistency and Interestingness Preservation

This section discusses the more general case: the retar-geting of the inner-subshot frames. Here, we not only seekto avoid the visual information loss but also ensure theretargeting consistency over adjacent frames. In essence, thetwo objectives are conflicting tradeoffs. On one hand, if wefind RWs by merely minimizing intraframe visual loss forevery frame, the resultant RW indeed always tracks the mostupdated information-rich areas. However, those areas do notnecessarily move with consistent patterns and so with thesearched RWs. Eventually, we may end up with a videocontaminated with annoying jitters. On the other hand, if weonly desire the absolute retargeting consistency, the positionsof a RW during the entire subshot should be fixed as otherwiseany nonstatic RW would introduce an artificial global motioninto the retargeted video. Nevertheless, the static RW in thissituation is unable to track the dynamic visual information richareas and preserve them.

Keep in mind that to permit retargeting consistency, it isimpossible for the RW of each frame individually to attainits best position with local minimal visual-info loss. Hence,the joint consideration of the two objectives requires us totreat the entire inner-subshot frames together. We thus proposethe volume retargeting cost metric in (16) to evaluate theretargeting of a whole subshot

Lv =N∑t=1

L(xt, yt) + ω

N∑t=1

D(xt, yt, xt−1, yt−1) (16)

where D(xt, yt, xt−1, yt−1) = |xt −xt−1, yt −yt−1|2 is the differ-ential of the RW trace at the frame t, measuring the retargetinginconsistency therein. L is the visual interestingness loss withthe same interpretation as in (6). ω is the tradeoff factor ofthe two objectives and N is the total number of inner framesin a subshot.

The volume retargeting cost metric features the total visual-info loss plus total retargeting inconsistency and emphasizessearching for a dynamic RW trace for the entire subshot. Whenthose temporally adjacent frames are stacked together, theminimization of the volume metric explores a configuration ofthe RW positions with low total cost, forcing the information

Fig. 5. Graph model for optimize crop window trace; each frame corre-sponds to a layer. Green: source and destination vertexes. Yellow: candidatevertex for each frame. Red: the path with least cost to denote optimizeddynamic trace.

loss and retargeting inconsistency regarding each individualframe low as well. This metric is a relaxation of individualvisual interestingness preservation when mingled with theretargeting consistency concern.

Furthermore, in order to guarantee the retargeting consis-tency, we explicitly add a constraint that the norm of thedifferential of the RW trace at each frame should be less thana value. Therefore, the searching for the best trace of RWs isformulated as the following optimization problem:

{xt, yt}Nt=1 = arg min{xt ,yt}Nt=1Lv(xt, yt) (17)

s.t. D(xt, yt, xt−1, yt−1) ≤ ε

where ε is a psychophysical threshold below which humanattention can tolerate view inconsistency.

A. Graph Representation of the Optimization Problem

The solution to the optimization problem in (17) is nottrivial due to two aspects.

1) The arguments in the optimization are the trace of theRWs; they are the sequential positions of RWs in highdimension.

2) The objective function may be nonlinear and evennonconvex due to the nature of the computed saliencydistribution. Thus, regular analytical or computationmethods may not work in this situation; however, weobserve it is meaningful to represent the optimization ina graph and explore the solution in that context.

As depicted in Fig. 5, we construct a graph with vertexesacross N layers. Each layer symbolizes an inner frame from 1to n within a subshot and each vertex on one layer representsa possible position of the RW to be decided. We assigna cost value for each vertex as the visual information lossincurred by the RW at the corresponding position. Then forevery pair of vertexes on the adjacent layers, if the norm ofdifferentiate with them is less than the bound ε, we establishan edge to link them together, suggesting a possible transitionof the RW over adjacent frames. This establishment ensuresthe retargeting consistency constraint is satisfied. Then we alsoassign each edge a cost value as the norm of the differentiatewith the two vertexes on its two ends. Specifically, the sourceand destination vertexes are the positions of the RWs on thetwo boundary frames of the subshot, which are obtained byminimizing visual information loss only. In this graph, anypath from source to destination vertexes would be a possibletrace of the RWs over the subshot. We define the cost of a pathas the total cost of the vertexes and edges on it. Evidently, the


cost of a path denotes the volume retargeting cost metric in(17) and the solution to the constrained optimization problemis the path with the minimal cost.

B. Dynamic Programming Solution

We propose a dynamic programming method to find the pathwith minimal cost. Suppose the optimal path from the sourcevertex s to the jth vertex v

ji on the ith layer is s → v

ji , the

question is how to find out the optimal path to all the vertexeson the next layer. For the kth vertex vk

i+1 on layer i+ 1, denotethe set of vertexes on layer i that has an edge to link vk

i+1 asV . Then the option to find the best path to vk

i+1 is to augmentthe optimal paths up to every vertex in V to vk

i+1 and choosethe one with minimal updated cost. Therefore, the recursiveformat of the objective function in (17) is as follows:

Lv(s → vki+1) = min

vj

i∈V

{Lv(s → vji ) + ω · D(vj

i , vki+1)} + L(vk

i+1)

(18)

where s = v1 and Lv(s → v1) = 0. Lv(s → vji ) denotes

minimized volume retargeting cost up to frame i assuming vji

is the destination or equivalently the shortest path from sourcevertex s of frame 1 to the jth vertex of frame i. Lv(s → vk

i+1) isthe shortest path up to the kth vertex of frame i+1, D(vj

i , vki+1)

denotes the cost of edge connecting the jth vertex of frame i

to the kth vertex of frame i + 1, and L(vji , v

ki+1) is the cost of

the kth vertex of frame i + 1.Notice that both the source s and destination d are predeter-

mined by minimizing (6) for boundary frames. Starting froms, we update the best path up to the vertexes on each layerfrom 1 to N. In the end, for all the vertexes on layer N, wefind out the one through which the best path leading to d

vkN = arg min

vj

N

Lv(s → vjN ) + ω · D(vj

N, d). (19)

Then the best path leading to vkN is the final solution to the

optimization in (17).

VII. Optimal Selection of Scale in a Shot

As mentioned before, a unified scale is chosen for the entireshot. It determines the actual size of the retargeting windowand reflects the aesthetic preference of a particular viewer. Inthe minimization of the information loss function in (6), thechosen scale depends on the cropping-scaling tradeoff factorλ. Our proposed system allows viewers to initialize it, givingthe system a general idea of generating a local or globalretargeting view or somewhere in between.

Based on the initial λ, we find the optimal scale by makingthe partial derivative of the visual-info loss function withregard to the scale equal to zero. We perform this operationon each frame in the shot and average the obtained scales asthe unified scale of the shot.

The size of the RW definitely affects how quickly the RWresponds to the dynamic content change. It is helpful to thinkof the size of a RW as gears of the manual transmission ina car, the transition rate of RWs between adjacent framesas the tachometer and the change of dynamic content as the

road condition. We wish to maintain the transition rate withina pleasing level for consistency concerns. Meanwhile, thefeatured contents are supposed to be tracked and preservedno matter how quickly they change.

Just as choosing a higher gear for a high speed if roadcondition permits, when the salient entities move rapidly,it is sensible to choose a larger RW size to satisfy theaforementioned two desires. On the contrary, when the saliententities move slowly, we may alter to a smaller RW to savethe resolution.

Therefore, given the initial λ which settles the aestheticpreference, we tune it afterward to better suit for the contentchange. We use the velocity the RW transits (predicted by theinitial λ) to estimate how fast salient entities move. Then basedon the velocity estimate, we adjust the weight λ in order toobtain a more suitable scale, which then resizes the RW totrack salient entities more wisely

λ′ =λ

1 + e−( 1

N·∑N

i=1

(vi−vi−1)2

β2i

−vα)(20)

where 1N

·∑Ni=1

(vi−vi−1)2

β2i

is the velocity estimate of the movingRW, N is the total number of frames in the shot. βi is themaximum distance a RW can move from vi−1 at frame i − 1.vα denotes a reference velocity which humans find pleasing.Given the updated weight λ′, a new optimal scale average iscalculated for the shot.

VIII. Experimental Results

We design three groups of experiments: 1) spatial saliencymodeling; 2) video saliency modeling; and 3) video retar-geting to demonstrate the effectiveness and efficiency ofthe proposed attention modeling method and the proposedretargeting system. For evaluating convenience, the testingvideo sequences are clips in 1 to 2 min length. They covermany genres, have multiple scenes and complex backgrounds,and sufficiently demonstrate the validness of our retargetingsystem. Also, in order to test the performance of our methodon long videos, we retargeted two 10 min movies, the bigbuck bunny and elephant dream, to variant resolutions andaspect ratios. They are full movies from Open Movie projectswithout license issue. In all three experiments, we compareour schemes with the representative existing approaches. Ourschemes are implemented in C++ and MATLAB with opensources libraries OpenCV (http://opencv.willowgarage.com/wiki/), FFTW3 (http://www.fftw.org/), and KLT (http://www.ces.clemson.edu/∼stb/klt/).

Since the attention modeling and the entire video retargetingare highly subjective tasks, we summarize the experimentalresults in two fashions, including the image snapshots andonline videos to provide readers with the visual comprehensionof the proposed schemes and also the subjective tests of theviewer perception scores.

A. Spatial Saliency Modeling

1) Proto-Region Detection Results on Spatial SaliencyMap: Fig. 6 shows the attention modeling results (saliency


Fig. 6. Comparison of saliency analysis on images. Col. 1: original image. Col. 2: human labeled salient regions. Col. 3: proto-regions detected by STB.Col. 4: saliency map by STB. Col. 5: proto-regions detected by CS. Col. 6: saliency map by CS. Col. 7: proto-regions by HC. Col. 8: saliency map by HC.Col. 9: proto-regions by RC. Col. 10: proto-regions by RC. Col. 11: proto-regions detected by our method. Col.12: saliency map of our method.

maps) generated by humans, saliency toolbox (STB, http://www.saliencytoolbox.net/), context-aware saliency (CS) [24],histogram-based contrast (HC) [25], Region-based contrast(RC) [25] and our attention modeling algorithm respectively.We use a collection of images in multiple resolutions andaspect ratios for the experiments. Beside the resultant saliencymaps, we also illustrate the so-called proto-regions [26], whichare found by thresholding the normalized saliency maps withthe same threshold [2], to show the contours of salient entitiesin contents, as shown in the red-circled regions on originalimages in Fig. 6.

Note that in rigid video retargeting, it is the shapes ofthe detected salient objects and whether they are accuratelydistinguished from the background that matter most becausethe salient objects, as integrated entities, are intended to bekept in the retargeting window. A good saliency map heredoes not only capture all the pixels of the salient objects,but also avoids to capture too much details from the back-ground, which weakens the emphasis of saliency objects.Therefore, we use the similarity of the “proto-regions” with theground-truth “attended objects” (generated by us as viewersin Col. 2 of Fig. 6) to measure the effectiveness of eachmethod.

Comparing the saliency maps of STB, our algorithm suc-cessfully detects more pixels on the salient objects and thusour “proto-regions” have much more similarity of the shapewith the ground-truth. For example, for image 4 which showstwo children playing at beach with a sailing boat in the sea,our algorithm successfully extracts the regions of children andthe boat as salient objects while STB is only able to capture

the line between the sea and the sky. As for HC and RC,their methods produce a saliency map that emphasizes regions,which is suitable for image segmentation (e.g., image 1 andimage 7 of Fig. 9). However, we observe occasionally theobjects of interest, although wholly extracted as regions, arenot very “salient” comparing with other regions. For example,for HC, the house region in image 3 is considered less salientthan the flower region in both the two methods and the saliencyof the old lady’s face is overshadowed by the wall nearby;for RC, the children in image 4 are less salient than thesea surface. For the comparison with CS, both CS and ouralgorithm detect all the pixels of the salient objects. Indeed,CS captures more details in both the salient objects and thebackground, which potentially counterbalances the importanceof the salient objects. In image 2, CS results show both thehouse and the ground are captured, resulting in a “proto-region” that includes many unnecessary background regions.

2) Subjective Test of the Attention Modeling: We con-duct a subjective test to collect the scores ranging from 1to 5 graded by 60 participants (http://www.mcn.ece.ufl.edu/public/ZhengYuan/spatial−saliency−comparison.html). Thenwe conduct confidence interval analysis to calculate wherethe mean scores of each method lie in with a 95%probability.

The results of confidence interval analysis are summarizedin Table II and Fig. 7 is the bar chart. It is shown that for allimages except image 7 and image 9, they exhibit a consistencythat the confidence intervals of STB do not overlap withours. Since our method has a higher mean score than STB, itsuggests that our method outperforms STB. Comparing with


Fig. 7. Statistics for saliency maps by STB, CS, HC, RC, and our method.

HC and RC, except for image 2 and image 4, the confidenceintervals of our method overlap with the two methods butgenerally occupy slightly higher ranges for other images. Thepossible reason is that the saliency of the objects of interest byHC and RC may be diluted by the color variations nearby. Forthe comparison with CS, it seems that for most images, theconfidence intervals overlap. Generally, in terms of the gradingon the similarity of detected salient objects with the groundtruth, the performances of ours and the CS method are similar.Sometimes, the two methods still demonstrate a potential ofperformance gap. In image 5, the confidence interval of CS ishigher than ours. However in image 4, our confidence intervalis higher. The possible reason is that in the saliency mapby CS, too much details on the two children are extracted.Thus, the same threshold level which perfectly distinguishesthe shape of the sailing boat results in two oversized children.This weakened salient object due to over-extracted details alsoexplains the score comparison on image 8.

3) Computational Complexity: We test the time complexityof the five methods on the operating system as Windows 7, 64bits and hardware as CPU 2.67 GHz and memory 4 GB. Thetime complexity is also in Table II. Indicated by the computa-tion time, our method is faster than all the other four methods,especially our computational efficiency is 1000 times that ofSC, thus is suitable for real-time video retargeting systems.

B. Attention Modeling Comparison in Video Retargeting

1) Saliency Video Comparsion: Here, we focus on the at-tention modeling unit and compare the proposed nonlinear fusemodeling with the baseline-PQFT [4]. We present the saliencycomparisons on six videos, Chicken, Rat, News, Sink, Jets, andSunflower. The comparisons are in the form of live videos withbaseline-PQFT side by side. Please visit http://www.mcn.ece.ufl.edu/public/ZhengYuan/saliency−comparison.html to watchthem. Due to the space limitation, Fig. 8 presents the snap-shots of two representative saliency video results (Rat andSunflower).

As both the online videos and Fig. 8 show, the salientobjects in Rat detected by our method have more resemblingshape and gesture with the rat (the salient object in Rat) thanthat of Baseline-PQFT. The same case happens in the bee inSunflower. The reason is that in our nonlinear fusion modeling,the fusion of two channels emphasizes the potential existenceof salient objects rather than individual salient pixel, thusresults in better shape of the detected salient objects. Baseline-PQFT feeds the spatial and motion data directly into the PQFT

TABLE II

Confidence Interval Analysis for Subjective Evaluation

on Spatial Saliency Modeling α = 0.05

L Bound Mean U Bound TimeSTB 2.68 2.94 3.21 97.0 msCS 3.34 3.59 3.83 129 608 ms

Img1 HC 3.49 3.69 3.90 5323 msRC 3.26 3.49 3.73 35 501 ms

Ours 3.24 3.48 3.72 10.9 msSTB 1.86 2.11 2.36 103.2 msCS 3.39 3.66 3.92 74 972 ms

Img2 HC 2.34 2.58 2.82 2069 msRC 2.62 2.86 3.10 25 971 ms


Img3 HC 3.19 3.42 3.66 1076 msRC 2.99 3.21 3.44 20 531 ms


Img4 HC 2.61 2.95 3.29 385 msRC 2.45 2.64 2.83 1831 ms


Img5 HC 3.08 3.25 3.42 1445 msRC 2.81 3.08 3.36 7529 ms


Img6 HC 3.35 3.69 4.04 428 msRC 3.27 3.64 4.02 2215 ms


Img7 HC 3.26 3.47 3.68 5561 msRC 3.28 3.50 3.73 220 497 ms


Img8 HC 3.31 3.56 3.82 765 msRC 3.66 3.92 4.19 5170 ms


Img9 HC 3.38 3.64 3.91 5757 msRC 3.36 3.61 3.87 44 585 ms

Ours 3.25 3.53 3.79 11.7 ms

calculation. Since the values are different in nature, the shapeof detect salient objects may be twisted.

2) Subjective Test: We also carry out the subjective test forthe saliency modeling comparison. Like the spatial saliencyevaluation, we collect the scores from 60 participants andperform the confidence interval analysis of the mean scoresfor the two saliency modeling methods. Table III presents theresults of the estimated interval where the two mean scoreslie in for each video and Fig. 9 shows the bar chart. FromFig. 9, we can see for Chicken, Rat, News and Sunflower,the interval by our method does not overlap with that ofBaseline-PQFT and our interval is higher. For Jets and Sink,although our confidence intervals have a small overlap with


TABLE III

Confidence Interval Analysis for Subjective Evaluation

on Video Saliency Modeling α = 0.05

L Bound Mean U Bound

ChickenBaseline-PQFT 2.54 2.93 3.06Ours 3.28 3.65 4.02

RatBaseline-PQFT 2.73 2.97 3.21Ours 3.67 3.97 4.26

NewsBaseline-PQFT 1.82 2.24 2.66Ours 3.36 3.60 3.84

SinkBaseline-PQFT 3.52 3.84 4.16Ours 3.86 4.13 4.39

JetsBaseline-PQFT 3.16 3.46 3.75Ours 3.38 3.65 3.92

SunflowerBaseline-PQFT 2.35 2.64 2.93

Ours 3.92 4.24 4.56

Fig. 8. Two snapshots of video saliency modeling comparison betweenbaseline-PQFT [4] and our approach. The first and fourth rows: original videosequences Rat and Sunflower, the second and fifth rows: the saliency modelingresults of baseline-PQFT, the third and sixth rows: the saliency modelingresults of our method.

Fig. 9. Statistical analysis for video saliency modeling. Green: saliencybaseline-PQFT. Purple: ours.

Baseline-PQFT, they still occupy a higher range. It suggeststhat participants generally consider our saliency modeling isbetter than Baseline-PQFT in terms of salient object detection.

C. Comparison of Video Retargeting Approaches

1) Video and Image Snapshot Comparison: We presentvideo targeting comparison on six videos, Barnyard, Fash-ionshow, Hearme, Madagascar, Rat, and Soccer. They are in1 to 2 min length with multiple scene changes and complexbackgrounds. We perform video retargeting with our retarget-ing system and by two previous homogeneous methods: singleframe smoothing (SFS) [9], [10] and back tracing (BT) [11].Each original video has an aspect ratio between 1

2 and 34 with

width more than 600 pixels. The retargeted output sizes are320 × 240 and 480 × 240, so in the retargeting process, thevideos are both squeezed and stretched.

TABLE IV

Confidence Interval Analysis for Video Retargeting α = 0.05

BT OursVideos Retarget Size lB M uB lB M uBBarn 320 × 240 3.36 3.60 3.84 3.65 3.85 4.05

480 × 240 3.64 3.90 4.15 3.78 4.07 4.36Fashion 320 × 240 3.25 3.53 3.80 3.58 3.85 4.12

480 × 240 3.17 3.47 3.76 3.92 4.22 4.51Hearme 320 × 240 3.69 3.88 4.06 3.82 4.04 4.26

480 × 240 3.46 3.74 4.02 3.75 4.07 4.39Madaga 320 × 240 3.31 3.64 3.97 3.52 3.86 4.20

480 × 240 3.55 3.82 4.08 3.83 4.13 4.42Soccer 320 × 240 3.18 3.48 3.77 3.43 3.67 3.91

480 × 240 3.34 3.64 3.86 3.66 3.92 4.17

We demonstrate the video comparison results on http://www.mcn.ece.ufl.edu/public/ZhengYuan/video−retargeting−comparison.html. From all the videos, we may generallyobserve that SFS suffers from jittering, which causesuncomfortable feelings of viewers. Back tracing is mostlyacceptable; however, the retargeted video is not alwaysable to preserve regions of interest of the original video.In comparison, our method throughout preserves the salientregion as a frame goes further and avoids jitter effects as well.3

Due to the space limitation, Fig. 10 captures the snapshotsof two representative video results, Madagascar (retargetto 320 × 240, aspect ratio squeezed) and the Fashionshow(retarget to 480×240, aspect ratio stretched). For the authenticperformance comparison, readers may still refer to our websitefor observation.

In the results of SFS, although the lion and the zebra arepreserved completely, the retargeting window shifts back andforth frequently, which suggests huge jitter effects in theretargeted video. Regarding BT, from the first to the thirdframes, the retargeting window includes the complete zebra;however, as frame goes to the fourth and the fifth, it is leftbehind by the zebra due to the fast motion of the latter. Somost parts of zebra are lost in the retargeted video. In contrast,our result yields a visually consistent retargeting windowtrace to preserve zebra completely. For the Fashionshow,BT results in a retargeting window that does not include themodel’s face as she moves across frames. In comparison, forthe corresponding frames, our results alleviate the consistencyclamping artifact and mostly keep the model’s face inside.For SFS retargeting, it still has a shaky artifact, viewers mayperceive that through online video results.

2) Subjective Test: Since it is obvious from the retargetedvideos that both BT and our approach are better than SFS, weonly need to evaluate BT and our approach quantitively. Foreach retargeting process, we collect the subjective scores rang-ing from 1 to 5 from 60 participants, as in the previous tests.

Based on the collected scores, we perform confidenceinterval analysis to compare two methods according to wheretheir true mean scores lie. Table IV summarizes the confidenceintervals for each video being retargeted to 320 × 240 and480 × 240. The bar chart in Fig. 11 illustrates the relative

3Occasionally, it may include modest unnatural camera motion, which canbe alleviated by increasing ω in (17).


Fig. 10. Two snapshots of video retargeting comparison among SFS, BT, and our approach: the first and fourth rows: single frame search and smoothing,the second and fifth rows: back tracing, the third and sixth rows: our proposed approach.

Fig. 11. Statistical analysis for video retargeting approaches. Green: backtracing. Purple: ours.

locations of the mean scores. From Table IV and Fig. 11, itseems for all the retargeting processes where the aspect ratiois stretched, the confidence intervals by our method are higherthan those of BT, although sometimes they overlap with BTwith a small percentage. For the retargeting process where theaspect ratio is squeezed, the confidence intervals of our methodindeed overlap with those of BT; however, it seems that theystill occupy higher ranges. This subjective test results suggestthat our method is generally better than BT, especially in thecase where the aspect ratio is stretched.

IX. Conclusion

We presented a novel video retargeting system that issuitable for long videos with generic contents. Different frommany existing approaches that focus on visual interesting-ness preservation, we identified that the temporal retargeting

consistency and nondeformation play a dominant role. Also,our statistical study on human response to the retargetingscale showed that seeking a global retargeting view as inheterogeneous approaches is not necessary. Our refined homo-geneous approach for the first time comprehensively addressedthe prioritized temporal retargeting consistency and achievedthe most possible preservation of visual interestingness. Inparticular, we proposed a volume retargeting cost metric tojointly consider the two objectives and formulated the retar-geting as an optimization problem in graph representation. Adynamic programming solution is given. To measure the visualinterestingness distribution, we also introduced a nonlinear fu-sion based attention modeling. Encouraging results have beenobtained through the image rendering of the proposed attentionmodeling and video retargeting system on various images andvideos. Our subjective test statistically demonstrated that ourattention model can more effectively and efficiently extractsalient regions than conventional methods and also the videoretargeting system outperforms other homogeneous methods.

Although the current measure of temporal consistency ismathematically interpreted as the norm of parameter dif-ferentials of adjacent retargeting windows, our system canbe easily extended to other consistency measures with theoptimization formulation, graph representation and dynamicprogramming solution remaining the same. In the future, wewill focus our research on human response to retargetingconsistency, aiming at a computational model to mimic the realpsychophysical process. A statistical study will be conductedto learn the relationship between the consistency perceptionsand retargeting window parameters.


Acknowledgment

The authors would like to thank Dr. T. Deselaers and P.Dreuw from RWTH Aachen University, Aachen, Germany,Dr. Y.-S. Wang from National Cheng Kung University, Tainan,Taiwan, and Dr. M. Cheng from Tsinghua University, Beijing,China, for sharing their source codes and instructions onparameter settings.

References

[1] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.

[2] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,”in Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–8.

[3] X. Hou and L. Zhang, “Dynamic visual attention: Searching for codinglength increments,” in Proc. Advances Neural Inform. Process. Syst.,vol. 21. 2008, pp. 681–688.

[4] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection usingphase spectrum of quaternion Fourier transform,” in Proc. IEEE Comput.Vis. Pattern Recognit., Jun. 2008, pp. 1–8.

[5] D. Walther and C. Koch, “Modeling attention to salient proto-objects,”Neural Netw., vol. 19, no. 9, pp. 1395–1407, 2006.

[6] Y. Ma, X. Hua, L. Lu, and H. Zhang, “A generic framework of userattention model and its application in video summarization,” IEEE Trans.Multimedia, vol. 7, no. 5, pp. 907–919, Oct. 2005.

[7] L. Wolf, M. Guttmann, and D. Cohen-Or, “Non-homogeneous content-driven video-retargeting,” in Proc. IEEE Int. Conf. Comput. Vis., Oct.2007, pp. 1–6.

[8] S. Avidan and A. Shamir, “Seam carving for content-aware imageresizing,” ACM Trans. Graph., vol. 26, no. 3, p. 10, 2007.

[9] F. Liu and M. Gleicher, “Video retargeting: Automating pan and scan,”in Proc. ACM Multimedia, 2006, pp. 241–250.

[10] G. Hua, C. Zhang, Z. Liu, Z. Zhang, and Y. Shan, “Efficient scale-spacespatiotemporal saliency tracking for distortion-free video retargeting,”in Proc. Comput. Vis., 2010, pp. 182–192.

[11] T. Deselaers, P. Dreuw, and H. Ney, “Pan, zoom, scan—time-coherent,trained automatic video cropping,” in Proc. IEEE Comput. Vis. PatternRecognit., Jun. 2008, pp. 1–8.

[12] M. Rubinstein, A. Shamir, and S. Avidan, “Improved seam carving forvideo retargeting,” ACM Trans. Graph., vol. 27, no. 3, pp. 1–9, 2008.

[13] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Discontinuous seam-carving for video retargeting,” in Proc. IEEE Comput. Vis. PatternRecognit., Jun. 2010, pp. 569–576.

[14] M. Rubinstein, A. Shamir, and S. Avidan, “Multi-operator media retar-geting,” ACM Trans. Graph., vol. 28, pp. 23:1–23:11, Jul. 2009.

[15] Y. S. Wang, C. L. Tai, O. Sorkine, and T. Y. Lee, “Optimized scale-and-stretch for image resizing,” ACM Trans. Graph., vol. 27, no. 5, pp. 1–8,2008.

[16] Y. S. Wang, H. Fu, O. Sorkine, T. Y. Lee, and H. P. Seidel, “Motion-aware temporal coherence for video resizing,” ACM Trans. Graph., vol.28, pp. 127:1–127:10, Dec. 2009.

[17] P. Krahenbuhl, M. Lang, A. Hornung, and M. Gross, “A system forretargeting of streaming video,” ACM Trans. Graph., vol. 28, pp. 126:1–126:10, Dec. 2009.

[18] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir, “A comparativestudy of image retargeting,” ACM Trans. Graph., vol. 29, no. 6, p. 160,2010.

[19] J. S. Boreczky and L. A. Rowe, “Comparison of video shot boundarydetection techniques,” J. Electron. Imaging, vol. 5, no. 2, pp. 122–128,1996.

[20] N. V. Patel and I. K. Sethi, “Video shot detection and characterizationfor video databases,” Pattern Recognit., vol. 30, no. 4, pp. 583–592,1997.

[21] T. A. Ell and S. J. Sangwine, “Hypercomplex Fourier transforms of colorimages,” IEEE Trans. Image Process., vol. 16, no. 1, pp. 22–35, Jan.2007.

[22] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Comput.Vis. Pattern Recognit., Jun. 1994, pp. 593–600.

[23] P. J. Rousseeuw and A. M. Leroy, Robust Regression and OutlierDetection. New York: Wiley-IEEE, 2003.

[24] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliencydetection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.2010, pp. 2376–2383.

[25] M. Cheng, G. Zhang, N. Mitra, X. Huang, and S. Hu, “Global contrastbased salient region detection,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2011, pp. 409–416.

[26] R. Rensink, “Seeing, sensing and scrutinizing,” Vis. Res., vol. 40, nos.10–12, pp. 1469–1487, Jun. 2000.

Zheng Yuan received the B.S. degree in electri-cal engineering from Jilin University, Changchun,China, in 2006, and the M.S. degree in electricalengineering from Shanghai Jiao Tong University,Shanghai, China, in 2009. Currently, he is pursuingthe Ph.D. degree with the Multimedia and Commu-nication Network Group, Department of Electricaland Computer Engineering, University of Florida,Gainesville.

His current research interests include video andimage analysis/processing, video compression, mul-

timedia, computer vision, and machine learning.

Taoran Lu received the Ph.D. degree in electricaland computer engineering from the University ofFlorida, Gainesville, in 2010.

She is currently a Senior Research Engineer withthe Image Technology Group, Dolby Laboratories,Inc., Burbank, CA. Her current research interestsinclude image and video compression and transmis-sion, next-generation video coding, image and videoanalysis, and computer vision.

Yu Huang received the Bachelors degree in infor-mation and control engineering from Xi’an Jiao-Tong University, Xi’an, China, in 1990, the Mas-ters degree in electronic engineering from XidianUniversity, Xi’an, in 1993, and the Ph.D. degree ininformation science from Beijing Jiao-Tong Univer-sity, Beijing, China, in 1997.

He was a Post-Doctoral Research Associate withthe Beckman Institute, University of Illinois atUrbana-Champaign, Urbana, from 2000 to 2003.He is currently a Senior Staff Research Engineer

with the Digital Media Solution Laboratory, Samsung Electronics America,Ridgefield Park, NJ. His current research interests include image and videoprocessing, computer vision, human–computer interaction, machine learning,image-based rendering, and augmented reality.

Dapeng Wu (S’98–M’04–SM’06) received thePh.D. degree in electrical and computer engineeringfrom Carnegie Mellon University, Pittsburgh, PA, in2003.

Currently, he is a Professor with the Departmentof Electrical and Computer Engineering, Universityof Florida, Gainesville.

Heather Yu received the Ph.D. degree from Prince-ton University, Princeton, NJ, in 1998.

Currently, she is the Director of the Huawei MediaNetworking Laboratory, Bridgewater, NJ. She holds23 granted U.S. patents and has published more than70 publications, including 4 books, P2P Networkingand Applications, Semantic Computing, P2P Hand-books, and Multimedia Security Technologies forDigital Rights Management.

Dr. Yu has served numerous positions with relatedassociations, such as the Chair of the IEEE Multime-

dia Communications Tech Committee, member of the IEEE CommunicationsSociety Strategic Planning Committee Member, Chair of the IEEE HumanCentric Communications Emerging Technology Committee Chair, AssociateEditor-in-Chief for the Peer-to-Peer Networking and Applications journal, As-sociate Editor of several IEEE journals/magazines, and the Conference Chairand Technical Program Committee Chair for many conferences in the field.

Date post:	12-Dec-2016
Category:	Documents
Upload:	heather
View:	212 times
Download:	0 times

Addressing Visual Consistency in Video Retargeting: A Refined Homogeneous Approach

Documents