+ All Categories
Home > Documents > [IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan,...

[IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan,...

Date post: 18-Dec-2016
Category:
Upload: erol
View: 214 times
Download: 1 times
Share this document with a friend
5
978-1-4673-0964-6/12/$31.00 ©2012 IEEE 206 2012 5th International Congress on Image and Signal Processing (CISP 2012) Frame Rate Up-Conversion With Nonlinear Temporal Interpolation Yücel ÇİMTAY Electrical and Electronics Engineering Eskişehir Osmangazi University Eskişehir, Turkey Erol SEKE Electrical and Electronics Engineering Eskişehir Osmangazi University Eskişehir, Turkey Abstract— Motion compensated frame rate up-conversion (MCFRUC) partly resolves some of the issues in video interpolation such as blurred images resulting from incorrect linear motion interpolation and motion jerkiness caused by frame repetition. Most of the block artifacts and incorrect interpolated frames are caused by linear-motion-based FRUC. In this paper, we evaluated a motion-compensation approach based on nonlinear spatio-temporal motion assumption which is more general and realistic than linear assumption. New motion model is tested against linear counterpart, proving clear superiority in generating better reconstructed and closer to the ground truth intermediate frames. Keywords-FRUC; motion compensation; nonlinearity I. INTRODUCTION In video storage and communication, some frames are skipped in video streams for achieving a specific compression ratio and satisfying the bandwidth limitation. However the new video exhibits motion jerkiness effects and some of its visual quality may be lost. To improve the visual quality and achieving a fluid video on the receiver side, frame repetition and linear interpolation methods has been studied. However, frame repetition results in jerky motions and linear frame interpolation causes blurred images. In attempt to eliminate the undesired effects of these methods, motion compensated frame rate up-conversion (MCFRUC) has been widely studied. It aims to resolve the motion jerkiness and blur in order to improve the visual quality by estimating positions of objects at exact desired points in time. MCFRUC inserts one or more temporally interpolated frames between existing frames, so that the frame rate is increased and a more fluid video stream is acquired. MCFRUC may use unidirectional or bidirectional motion compensation. Motion estimation is a process of determining positions of overlapping/nonoverlapping blocks of current frame within the previous/next frame. Instead of block- based, pixel-based search can also be performed. Pixel-based estimation can reduce block artifacts but has huge computational cost [1]. Therefore, it is almost customary to employ block-based estimation followed by some artifact removal/reduction algorithms. When the sizes of the moving objects in video sequences are considered to be greater than the blocks, calculated motion vectors for these blocks are expected to be similar for spatially or temporally neighboring blocks. However, when objects move away from each other or occlude or appear, resulting in blocks with no correspondence in the other frame, motion vectors may be unreliable or simply incorrect. This causes null areas and block artifacts in the interpolated/reconstructed frame. Several researches worked on correcting or filtering incorrect motion vectors [2, 3]. Averaging or median filtering of motion vector field are examples of such work [4]. The assumption of neighboring blocks moving together is the basis for many studies [6-8]. When neighboring blocks move together with the same motion vector, no hole would be created between them in the constructed frame that requires filling [8]. Both motion estimation and construction of the moving block in reconstructed frame can be unidirectional [5] or bidirectional [7, 8]. In most of the studies, it is assumed that objects in view, hence the corresponding blocks, move linearly within the time span between successive frames. However in real-world, objects almost never exhibit linear motion. Therefore, representation of block motions by a linear motion model is just an approximation. When temporal distances between frames are high compared to the speed of motions, nonlinearity in motions becomes effective and the assumption of linear motion becomes increasingly invalid. In some video transmission systems when real intermediate frames exist but are not transmitted to the user, the difference between motion vectors to the real intermediate frame and the vectors obtained using linear motion assumption is calculated and transmitted to the receiver as error vector. This is a vector form of differential mode transmission. At the receiver side, these differences are added to the motion vectors of each block and intermediate frame is reconstructed [9]. Although the difference data is very small compared to the rest of the transmission, this approach implies that the coder side needs to be modified in order to accommodate additional data. In the work proposed here, no data related to real intermediate frames is used and there is no transmitter side modifications needed. Intermediate frames are reconstructed by using multiple receiver side frames and nonlinear motion estimation. II. PROPOSED METHOD In order to model nonlinear motions of the objects in video sequences, we have applied three steps in the proposed work.
Transcript

978-1-4673-0964-6/12/$31.00 ©2012 IEEE 206

2012 5th International Congress on Image and Signal Processing (CISP 2012)

Frame Rate Up-Conversion With Nonlinear Temporal Interpolation

Yücel ÇİMTAY Electrical and Electronics Engineering

Eskişehir Osmangazi University Eskişehir, Turkey

Erol SEKE Electrical and Electronics Engineering

Eskişehir Osmangazi University Eskişehir, Turkey

Abstract— Motion compensated frame rate up-conversion (MCFRUC) partly resolves some of the issues in video interpolation such as blurred images resulting from incorrect linear motion interpolation and motion jerkiness caused by frame repetition. Most of the block artifacts and incorrect interpolated frames are caused by linear-motion-based FRUC. In this paper, we evaluated a motion-compensation approach based on nonlinear spatio-temporal motion assumption which is more general and realistic than linear assumption. New motion model is tested against linear counterpart, proving clear superiority in generating better reconstructed and closer to the ground truth intermediate frames.

Keywords-FRUC; motion compensation; nonlinearity

I. INTRODUCTION In video storage and communication, some frames are

skipped in video streams for achieving a specific compression ratio and satisfying the bandwidth limitation. However the new video exhibits motion jerkiness effects and some of its visual quality may be lost. To improve the visual quality and achieving a fluid video on the receiver side, frame repetition and linear interpolation methods has been studied. However, frame repetition results in jerky motions and linear frame interpolation causes blurred images. In attempt to eliminate the undesired effects of these methods, motion compensated frame rate up-conversion (MCFRUC) has been widely studied. It aims to resolve the motion jerkiness and blur in order to improve the visual quality by estimating positions of objects at exact desired points in time. MCFRUC inserts one or more temporally interpolated frames between existing frames, so that the frame rate is increased and a more fluid video stream is acquired. MCFRUC may use unidirectional or bidirectional motion compensation. Motion estimation is a process of determining positions of overlapping/nonoverlapping blocks of current frame within the previous/next frame. Instead of block-based, pixel-based search can also be performed. Pixel-based estimation can reduce block artifacts but has huge computational cost [1]. Therefore, it is almost customary to employ block-based estimation followed by some artifact removal/reduction algorithms.

When the sizes of the moving objects in video sequences are considered to be greater than the blocks, calculated motion vectors for these blocks are expected to be similar for spatially

or temporally neighboring blocks. However, when objects move away from each other or occlude or appear, resulting in blocks with no correspondence in the other frame, motion vectors may be unreliable or simply incorrect. This causes null areas and block artifacts in the interpolated/reconstructed frame. Several researches worked on correcting or filtering incorrect motion vectors [2, 3]. Averaging or median filtering of motion vector field are examples of such work [4]. The assumption of neighboring blocks moving together is the basis for many studies [6-8]. When neighboring blocks move together with the same motion vector, no hole would be created between them in the constructed frame that requires filling [8]. Both motion estimation and construction of the moving block in reconstructed frame can be unidirectional [5] or bidirectional [7, 8]. In most of the studies, it is assumed that objects in view, hence the corresponding blocks, move linearly within the time span between successive frames. However in real-world, objects almost never exhibit linear motion. Therefore, representation of block motions by a linear motion model is just an approximation. When temporal distances between frames are high compared to the speed of motions, nonlinearity in motions becomes effective and the assumption of linear motion becomes increasingly invalid.

In some video transmission systems when real intermediate frames exist but are not transmitted to the user, the difference between motion vectors to the real intermediate frame and the vectors obtained using linear motion assumption is calculated and transmitted to the receiver as error vector. This is a vector form of differential mode transmission. At the receiver side, these differences are added to the motion vectors of each block and intermediate frame is reconstructed [9]. Although the difference data is very small compared to the rest of the transmission, this approach implies that the coder side needs to be modified in order to accommodate additional data. In the work proposed here, no data related to real intermediate frames is used and there is no transmitter side modifications needed. Intermediate frames are reconstructed by using multiple receiver side frames and nonlinear motion estimation.

II. PROPOSED METHOD In order to model nonlinear motions of the objects in video

sequences, we have applied three steps in the proposed work.

207

A. Motion Estimation and Motion Vector Detection Unidirectional motion estimation is performed using 3

consecutive frames ( 1)k − , ( 1)k + and ( 3)k + where the frame ( 1)k + is the reference frame in the middle. Blocks of ( 1)k + are searched in the frames ( 1)k − and ( 3)k + by using exhaustive search. Fig. 1 illustrates the relation and positions of an example block in three consecutive frames. Note that even indexed frames are spared as originals for PSNR (Peak-Signal-to-Noise-Ratio) calculations as they are estimated using the proposed method.

Motion vectors are filtered first by grouping [3] and applying the averaging method shown in [2]. It is obvious from the example in Fig. 1 that when a block under investigation moves nonlinearly within the neighborhood, a linear motion approach can not accurately represent the motion of that block

B. Modeling the Block Motions In general, the coordinates ( , )x y of a block in frames

within the close neighborhood of a specified point in time can be written as

( ) ( ), ( ) ( )x t f t y t g t= = (1)

where t is the position in time and (.)f and (.)g are block specific functions defining the block position at that time. If

(.)f and (.)g are linear functions of t , then the motion is called linear. Due to the fact that in reality motions are not linear in general, the better choice for (.)f and (.)g would be second degree polynomials of t given by

It should be stressed that the name “nonlinear motion” does not imply that the motion of a block from one frame to another draws a second degree function (parabola) on the view. It is very possible that a block moves through a line but time dependency is a nonlinear function. A simple case would be an accelerating block towards a specific direction. Our definition (2) accommodates such a motion. ( 2a , 1a , 0a ) and

( 2b , 1b , 0b ) are the coefficients of the second degree functions.

In the proposed method the motion is modeled by a second degree function pair (2), however higher degree polynomial pairs are also employable;

1( ) ......1 01( ) ......1 0

n nx t a t a t an nn ny t b t b t bn n

−= + + +−−= + + +−

(3)

On the other hand, there is actually no basis for preferring one nonlinear model to another. We have chosen a second

degree time-parametric model (2) over the basis of belief of sufficiency. Since we are, in reality, limited to 3 or 4 consecutive frames in modeling motions, it seems appropriate to employ such a model. Getting farther in time to collect temporal samples would likely to introduce accuracy problems. At that point one would see two possibility in motion/time relation within recorded sequences of regularly spaced frames; either frames are too far apart so that there exists temporal aliasing, or they are close enough so that (2) is sufficient. Statistical analysis, other than that given in Table 2, on much higher number of real video sequences, is still needs to be performed.

Given that a block is in the positions ( , )1 1x yk k− − ,

( , )1 1x yk k+ + and ( , )3 3x yk k+ + in the consecutive frames

( 1)k − , ( 1)k + and ( 3)k + captured at the times 1tk− , 1tk+ ,

3tk+ then the coefficients of (2) for x coordinate can be found by using (4). Coefficient for y function can be found similarly and (5) can be written.

2 11 1 122 1 11 1 12 1 0 33 3

t t xak k kt t a xk k k

a xt t kk k

− − −=+ + +

++ +

⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎣ ⎦ ⎣ ⎦⎢ ⎥⎣ ⎦

(4)

TA X= TB Y= (5)

At least three frames are needed for determining three parameter motion equations. Any three neighboring frames can be used for reconstructing frame ( )k , but farther the frames are from the frame to be reconstructed, less accurate the representation capability of (2) becomes.

Unless properly handled, there would be spaces with no assigned pixel value in the reconstructed frame due to the occlusions and appearances. Hole handling approaches suggested in [7, 8] are employed in our work too. In that, the blocks that are temporally or spatially close to each other move

2( ) 2 1 02( ) 2 1 0

x t a t a t a

y t b t b t b

= + +

= + + (2)

frame )1( −k frame )1( +k

frame )3( +k

Figure1. Position of a block in three consecutive frames.

Motion Vectors

208

according to the same or similar motion vectors. That means the blocks in the same position in frames ( )k and ( 1)k + have the same motion equations. In linear-motion-based methods the motion vector of a block in frame ( 1)k + is halved to retrieve blocks from frames ( 1)k + and ( 1)k − . These blocks are averaged and placed in the regular grid of frame ( )k . Halving the motion vector for the intermediate frame is identical to calculating the position of the block under evaluation using a linear motion trajectory. Similarly, it is assumed that the trajectory of the block ( , )A x yk k′ ′ is the same as the trajectories of the blocks in the same position in neighboring possible intermediate frames. Therefore the position difference

'x x xk kΔ = − , 'y y yk kΔ = − (6)

between the position of ( , )A x yk k′ ′ in the interpolated frame

and ( , )x yk k can be used to retrieve/calculate the block to be placed in the position. The situation is illustrated in Fig. 2. In Fig. 2, block I is the interpolated frame in frame ( )k . It is at the same position with block A that is in frame ( 1)k + . Shifted blocks B in neighboring frames are used to interpolate/generate the block in the position ( , )x yk k . Since all temporally interpolated blocks are placed on an exact block grid, there is no need to employ hole-filling algorithms. This provides fairness in comparisons between linear and nonlinear motion test results as otherwise the filled regions would play an important role in PSNR calculations.

Linear motion model uses the first degree equation set

1 0x c t cl = +

1 0y d t dl = + (7)

Coefficients 0c , 1c , 0d and 1d of (7) are found using the

block locations in frames ( 1)k − and ( 1)k + . ( exl , eyl ) is the

positional difference between the blocks obtained via linear (7) and exact block locations ),( rkrk yx in frame ( )k . Similarly

( exnl , eynl ) is the case for nonlinear model obtained by (2).

So the positional errors can be written as

lkrkyl

lkrkxl

yyexxe

''

−=−=

(8)

krkynl

krkxnl

yyexxe

''

−=−=

(9)

where ( , )x ylk lk′ ′ is the position at ( )k of linear trajectory (7)

and ( , )x yk k′ ′ is the position at ( )k of nonlinear trajectory (2). The basis of nonlinear motion approach and use of second (or higher) degree polynomials for modeling the block motions can

be built by statistically analyzing ex and ey for different video sequences. Results of such statistical experiments are provided in Tables 1 and 2.

C. Producing the Intermediate Frames The matching block from either ( 1)k − or ( 1)k + or a

linear combination of them can be used to be placed in position ( , )x yk k of interpolated frame ( )k . The interpolated block whose position is estimated using nonlinear formulation is persistently closer to the original block position than the position obtained using linear motion model. For motions that are almost linear both methods generate very similar results. When motion nonlinearity is dominant proposed nonlinear motion method generates higher quality intermediate frames than linear method. In addition, in this work, since all temporally interpolated blocks are placed on an exact block grid, there is no need to employ hole-filling algorithms. So, it provides more fairness in PSNR calculations and also any computational cost of hole-filling is eliminated.

III. SIMULATION RESULTS The block motion statistics are calculated for real video

sequences Mother&Daughter, City, Crew, Miss America, Foreman and Mobile, in order to support the hypothesis that lead us to nonlinear motion model. Table 1 shows the average positional errors in the horizontal ( ex ) and vertical ( ey ) directions. These numbers are collected by first storing, for each frame, the average positional discrepancy between block positions calculated using two different methods and the original position obtained by searching the block in the original frame ( )k and then calculating their average in time. Note that these are average positional errors. There are also high positional errors such as 3 or 4 pixels of linear method according to nonlinear method for many blocks. In order to accept that a block is moving nonlinearly, the second order term should be higher than or equal to 1 pixel. Because, in this work, the indexes of block B’s are rounded when block I is

(k-1) (k) (k+1) (k+2) (k+3)

A

A A

B

B I

same location

( , )x yΔ Δ

' '( , )k kx y

A

A

B

B same location

' '( , )lk lkx y( , )l lx yΔ Δ I

Figure 2. Illustrations of nonlinear (upper) and linear (lower) methods used in reconstruction of intermediate frames. Step-1: Using the locations of block

A in 3 (2 in linear) consecutive frames, the block trajectory is calculated. Step-2: Block I of intermediate frame (k) is assumed to be located at the exact

point of A in (k+1) and is interpolated from B’s.

209

interpolated. The percentages of nonlinear motion are given, for each sequence, in Table 2. These numbers actually prove that the nonlinear motion approach using second degree parametric motion equations is logical. Table 2 also shows the average PSNR for 50 interpolated frames. It is understood from the table, the proposed method has an obvious superiority in generating better reconstructed intermediate frames that are closer to the ground truth. Fig.’s 3-4-5-6 are the comparative PSNR graphs of linear and proposed method for Foreman, Hall, Mother&Daughter and Akiyo video sequences. The proposed method exhibits higher PSNR result for almost all of the interpolated frames. Fig. 7 illustrates a comparison of linear and proposed nonlinear method. The distance between nose and mouth of Miss America is the same in actual and the frame which is produced by proposed method, but it is different in linear method’s interpolated frame.

TABLE I. AVERAGE POSITIONAL ERROR FOR INDIVIDUAL SEQUENCES

Sequence \ Error xle xnle yle ynle

Mother&Daughter 1.44 0.63 1.02 0.75

Mobile 2.23 1.33 2.51 1.35

Foreman 2.52 1.92 2.35 1.72

Miss America 1.67 1.12 2.11 1.52

City 1.21 0.97 1.32 0.58

Crew 2.36 1.88 1.71 1.03

Akiyo 1.42 0.81 1.38 0.78

Hall 1.93 1.01 2.03 1.22

Figure 3. PSNR Values for foreman video sequence

Figure 4. PSNR Values for hall video sequence

Figure 5. PSNR Values for mother&daughter video sequence

TABLE II. THE PERCENTAGE OF NONLINEAR MOTIONS AND COMPARISON OF RECONSTRUCTION RESULTS USING LINEAR AND NONLINEAR

MOTION MODELS

Sequence %nonlinear Linear method(DB) Proposed method(DB)

Mother&Daughter 23% 38.01 39.43

Mobile 14% 20.52 21.79

Foreman 22% 29.49 30.45

Miss America 34 % 42.09 43.37

City 45% 24.99 25.46

Crew 67% 26.85 27.19

Akiyo 8% 44.55 45.25

Hall 12% 32.68 33.02

210

Figure 6. PSNR Values for akiyo video sequence

IV. SUMMARY AND DISCUSSION MCFRUC is superior to frame repetition and linear

interpolation since it removes jerky motions and blur. On the other hand, MCFRUC with linear motion assumption has high tendency to create motion artifacts and generate incorrect interpolated frames when motions are dominantly nonlinear. In reality, in a video stream, objects rarely move linearly.

Between successive frames, the objects usually exhibit nonlinear motions. It can be seen that the linear MCFRUC is adequate for modeling the motions with very small nonlinear components. When there is no data for ground truth intermediate frames, modeling the block motions with a nonlinear parametric equation set and generating the intermediate frames accordingly is more logical as it is expected to create more realistic results. The test results of this research prove this argument.

Both linear and nonlinear methods use unidirectional motion estimations from frame ( 1)k + to frame ( 1)k − and from frame ( 1)k + to frame ( 3)k + for producing the frames ( )k and ( 2)k + . Then, there is no computational cost difference between the linear and nonlinear methods in terms of motion estimation process. The only additional cost of nonlinear method is the solution of the second degree function which is achieved by Matlab polyfit function. However, this cost can be negligible, because it is nothing but the solution of a second degree function with three unknowns.

No work has been done, in this work, for the parts that appear and/or disappear between successive frames. In order for any type of MCFRUC to generate successful results in practice, the ground truth correspondence between blocks of successive frames must be established. This is possible only when such occlusion and appearing are handled properly. This issue should be taken into account and implemented as future work.

REFERENCES [1] C.W. Tang and O.C. Au, “Comparison between block-based and pixel-

based temporal interpolation for video coding”, Proc. IEEE Int’l Symp. on Circuits and Systems, Monterey, CA, June 1998, pp. 122-125.

[2] C.S. Park; J. Ye; S.U. Lee, “Lost motion vector recovery algorithm”, Proc. IEEE Int’l Symp. on Circuits and Systems, 1994. vol. 3, pp.229-232.

[3] S. Liu, Z. Yan, J. Kim and C.-C. J. Kuo, “Global/local motion-compensated frame interpolation for low bitrate video", Proc. SPIE Image and Visual Communications Processing, Jan. 2000.

[4] P. Haskell and D. Messerschmitt, "Resynchronization of motion compensated video affected by ATM cell loss", Prvc. ICASSP, vol. 3, pp. 545-548, MU. 1992.

[5] Chi-Wah Tang and Oscar C. Au, “Unidirectional motion compensated frame interpolation”, Proc. IEEE ISCAS'97 Int’l Symp. On Circuits and Systems, vol. 2, pp.1444 – 1447, 1997.

[6] K. Hilman, H. W. Park, and Y. Kim, “Using motion-compensated frame-rate conversion for the correction of 3:2 pulldown artifacts in video sequences”, IEEE Trans. Circuits, Systems and Video Tech., vol. 10, pp. 869-877, Sept. 2000.

[7] J. Chalidbhongse and C.-C. J. Kuo, “Fast motion estimation using multi-resolution-spatial-temporal correlations” IEEE Trans. Circuits, Sys. Video Tech., vol. I, pp. 477-488, June 1997.

[8] T. Chen, “Adaptive temporal interpolation using bidirectional motion estimation and compensation”, Proc. Int’l Conf. on Image Processing Proceedings, vol. 2, Issue 2, pp. 313-316, 2002.

[9] S. Liu, J.W. Kim and C.-C. J. Kuo, “Non-linear motion-compensated interpolation for low bit rate video”, Proc. SPIE 4115, 203, doi:10.1117/12.411544, 2000.

Figure 5. Frames from Miss America sequence. a) Actual 20th frame b) Reconstructed using linear motion model (42.46dB) c) Reconstructed

using second degree motion model (45.14dB).

a

b

c


Recommended