Acknowledgements - The University of Texas at · Web viewp pix + ∆p= - x B N pix D m+ ∆m...

transcript

IMPLEMENTATION OF AN OUT-OF-THE LOOP POST-PROCESSING TECHNIQUE

FOR HEVC DECODED DEPTH MAPS

PARASHAR NAYANA KARUNAKAR

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

THE UNIVERSITY OF TEXAS AT ARLINGTON

December 2013

Dedicated to my Grand-Mother

~No occasion is complete without you~

Acknowledgements

I take this opportunity to express my gratitude to Dr. K.R. Rao, my supervising

professor. If it was not for his support, guidance and mentoring throughout my Masters,

this thesis would have been impossible. I would like to thank Dr. Jonathan Bredow and

Dr. Alan Davis at UTA for serving on my committee.

I would like to thank my manager, Dr. Yuriy Rezink at InterDigital, San Diego for

being considerate and supportive towards my thesis work while I worked as an intern, Dr.

Karsten Mueller and Dr. Gerhard Tech of Fraunhofer HHI and Dr. Varuna De Silva of

University of Surrey for their prompt email responses and clarifications that helped me

during the course of my thesis.

A special shout to Shwetha and Auddy, they have helped me in innumerable

ways (tech talks, university related work, always motivating and encouraging, etc.). I

thank Dilip and Abhijith for helping me endure the daunting task of managing the thesis-

work and internship simultaneously; all the “Housians” (Sindhu, Raghu, Asha,

Sarmishtha, Rohit, KT, Om, Karthik, …); my friends, Chethana and Apoorva back in India

who managed to support me irrespective of our time-zone differences. Also, a number of

people (family and friends) helped me to collect the results for the thesis by participating

in my image quality survey; I would like to acknowledge their help.

Finally and most importantly, I would like to thank my parents, Karunakar and

Champaka Parashar; my aunts, Suryaprabha, Saraswathi and Pankaja; my cousin

Chandana and her parents, Mr. and Mrs. Gurumurthy. Their unwavering love and support

always motivate me to march forward without fear and inhibitions

Nayana Parashar

25th Nov, 2013

Abstract

IMPLEMENTATION OF AN OUT-OF-THE LOOP POST-PROCESSING TECHNIQUE

FOR HEVC DECODED DEPTH MAPS

Parashar Nayana Karunakar, MSEE

The University of Texas at Arlington, 2013

Supervising Professor: K.R. Rao

When depth-maps are compressed using the existing video codecs, the

compression artifacts introduce distortions in the rendered views. To get better rendering,

it is important to get rid of these compression artifacts. This thesis achieves this by using

a post-processing frame-work on HEVC decoded depth maps. The proposed method is

based on compression artifact analysis of depth maps. The proposed work implements a

post-processing filter frame-work which involves two-stage filtering, first by an edge-

adaptive joint trilateral filter followed by histogram analysis and an adaptive bilateral

filtering [43], which can effectively minimize the effects of compression artifacts from the

HEVC decoded depth-maps. The rendered views before and after applying the post-

processing filter are compared with respect to the perceptual quality of the rendered

views. The PSNR, SSIM and MOS are the metrics that are used for video quality

estimation. The post-processing was applied on three different sequences. For all the

three sequences, the improvements in SSIM and better MOS rating were obtained for

those images rendered using post-processed depth-maps in comparison to the images

rendered using just the HEVC decoded depth-maps. The obtained results suggest that

the post-processing technique proposed in this thesis can be effectively used to improve

the quality of images obtained from depth-map based rendering.

Table of Contents

Acknowledgements........................................................................................................... iv

Abstract.............................................................................................................................. v

List of Illustrations.............................................................................................................. ix

List of Tables..................................................................................................................... xi

List of Acronyms............................................................................................................... xii

Chapter 1 Introduction.......................................................................................................1

1.1 Multimedia...............................................................................................................1

1.1.1 Multimedia applications....................................................................................1

1.2 Visual media............................................................................................................1

1.2.1 Multi-view plus depth video format...................................................................2

1.2.2 Depth-image based rendering..........................................................................4

1.3 Need for compression..............................................................................................4

1.4 Thesis Outline..........................................................................................................6

Chapter 2 Video compression standard - HEVC................................................................7

2.1 High Efficiency Video Coding (HECV).....................................................................8

2.1.1 HEVC coding design and feature highlights.....................................................9

2.1.1.1 Video Coding Layer...................................................................................9

2.1.1.2 High level syntax architecture.................................................................18

2.1.1.3 Parallel decoding syntax and modified slice structuring..........................19

2.1.2 HEVC complexity analysis..............................................................................22

2.2 Summary...............................................................................................................22

Chapter 3 3D video compression standards....................................................................23

3.1 3D video coding in H.264/AVC..............................................................................23

3.2 3D video coding in HEVC......................................................................................24

3.2.1 Multi view plus depth video.............................................................................24

3.2.2 Transmission of 3D video...............................................................................25

3.2.3 Coding algorithm............................................................................................27

3.2.4 Basic structure of 3D video codec..................................................................29

3.2.5 MVD codec vs HEVC standard codec............................................................30

3.2.5.1 Coding of dependent views.....................................................................30

3.2.5.2 Coding of depth maps.............................................................................36

3.3 Summary...............................................................................................................40

Chapter 4 Analysis of compression artifacts in depth maps.............................................41

4.1 Virtual stereoscopic view generation process........................................................42

4.2 Analysis of compression artifacts in depth maps on view rendering......................44

4.2.1 Frequency domain analysis of the artifacts.....................................................45

4.3 Design requirements of a post-processing filter to minimize the effects

of compression artifacts...............................................................................................48

4.4 Summary...............................................................................................................49

Chapter 5 Thesis – Scope, Background and Working algorithm......................................50

5.1 Scope....................................................................................................................50

5.2 Background Theory...............................................................................................50

5.2.1 Bilateral Filter..................................................................................................51

5.2.2 The joint bilateral and trilateral filter................................................................51

5.2.3 The adaptive bilateral filter..............................................................................52

5.3 Proposed algorithm................................................................................................53

5.3.1. Depth discontinuity analysis..........................................................................54

5.3.1.1 Identification of significant depth discontinuities..........................54

5.3.1.2 Identifying of aligned color and depth edges...........................................55

5.3.2 Pre-filtering of the depth maps to improve depth bin identification.................56

5.3.3 Global histogram analysis..............................................................................58

5.3.4 Bilateral Sharpening Filter..............................................................................59

5.3.5 Stereoscopic view rendering and comparison of results.................................61

5.4 Summary...............................................................................................................61

Chapter 6 Experimental results........................................................................................62

6.1 An approximate Mean Opinion Score calculation..................................................62

6.2 Input Parameters...................................................................................................63

6.2 Results for different sequences.............................................................................64

6.2.1 Sequence: Balloons........................................................................................64

6.2.2 Sequence: Break-Dancer...............................................................................66

6.2.3 Sequence: Kendo...........................................................................................68

6.3 Summary...............................................................................................................69

Chapter 7 Conclusions and future-work...........................................................................70

7.1 Conclusions...........................................................................................................70

7.2 Future work............................................................................................................71

References.......................................................................................................................72

Biographical Information..................................................................................................76

List of Illustrations

Figure 1-1 2D image with spatial samples (L) and Video with N frames (R) [8].................2

Figure 1-2 Color video frame (L) and associated depth map frame (R) [18]......................4

Figure 2-1 Chronology of International video coding standards [8]....................................7

Figure 2-2 Typical HEVC encoder [1].............................................................................10

Figure 2-3 HEVC decoder block diagram [4]....................................................................11

Figure 2-4 Example of CTU, partitioning and processing order when size of CTU is equal

to 64 × 64 and minimum CU size is equal to 8 × 8 (a) CTU partitioning (b) Corresponding

coding tree structure [5]...................................................................................................13

Figure 2-5 Prediction unit splitting types (U = up, D = down, L = left, R = right) [5]..........14

Figure 2-6 Integer and fractional positions for luma interpolation [1]................................15

Figure 2-7 Motion estimation with multiple reference frames [38]....................................16

Figure 2-8 9 4 4 Luma Prediction (intra-prediction) modes H.264 [3]..............................17

Figure 2-9 Modes and directional orientations for intra-picture prediction [1]...................17

Figure 2-10 Sub-division of a picture into a) Slices b) Tiles and c) illustration of wavefront

parallel processing [1]......................................................................................................21

Figure 3-1 Simulcast coding structure with hierarchical B pictures for

temporal prediction (black arrows) (L) and Multi-view coding structure with hierarchical B

pictures for both temporal (black arrows) and inter-view prediction (red arrows) (R) [36]

......................................................................................................................................... 25

Figure 3-2 Overview of the system structure and data format for the transmission of 3D

video [37].........................................................................................................................27

Figure 3-3 Access unit structure and coding order of view components [37]...................28

Figure 3-4 Basic codec structure with inter-component prediction (red-arrows) [35] [37] 29

Figure 3-5 Disparity-compensated prediction as an alternative to motion-compensated

prediction [37].................................................................................................................. 31

Figure 3-6 Basic principle of deriving motion parameters for a block in a current picture

based on motion parameters in an already coded reference view and an estimate of the

depth map for the current picture [37]..............................................................................33

Figure 3-7 Independent derivation of motion information for each point of encoded CU

from corresponding point in reference view [37]..............................................................34

Figure 3-8 Basic concept for the inter-view residual prediction [37].................................35

Figure 4-1 Virtual view generation in Depth Image Based Rendering (DIBR) process [43]

......................................................................................................................................... 42

Figure 4-2 Effect of compression noise in areas of homogenous depth [43]...................47

Figure 4-3 Effect of compression noise in areas of sharp depth discontinuities [43]........48

Figure 5-1 Block diagram for the proposed work [43]......................................................53

Figure 5-2 Illustration of depth discontinuity analysis.......................................................56

Figure 6-1 Result images – Rendered left-side images for balloons sequence...............66

Figure 6-2 Result images – Rendered left-side images for break-dancer sequence........67

Figure 6-3 Rendered left-side images for kendo sequence.............................................69

List of Tables

Table 1-1 Mass storage requirements by various media types (B=byte) [9]......................5

Table 2 Input parameters and their values.......................................................................63

Table 3 Filter parameters for EA-JTF and ABF [43].........................................................64

Table 4 Sequence Balloons results..................................................................................65

Table 5 Balloons sequence MOS rating...........................................................................65

Table 6 Sequence Break-Dancer results.........................................................................67

Table 7 Break-dancer sequence MOS rating...................................................................67

Table 8 Sequence Kendo results.....................................................................................68

Table 9 Kendo sequence MOS rating..............................................................................68

List of Acronyms

2D: Two Dimensional 3D: Three Dimensional ABF: Adaptive Bilateral Filtering ADLF: Availability Deblocking Loopback Filter AMVP: Advanced Motion Vector Prediction AVC: Advanced Video Coding AVS China: The Audio and Video coding Standard of China CB: Coding Block CG: Computer Graphics CU: Coding Unit CTB : Coding Tree Block CTU: Coding Tree Unit DBMP: Depth-Based Motion Prediction DCP: Disparity-Compensated Prediction DCT: Discrete Cosine Transform DF: Deblocking Filter DFT: Discrete Fourier Transform DGLF: Depth Gradient based Loopback Filter DIBR: Depth Image Based Rendering DST: Discrete Sine Transform EA-JTF: Edge-Adaptive Joint Trilateral Filter FVV: Free Viewpoint Video HD: High Definition HEVC: High Efficiency Video Coding IEC: International Electrotechnical Commission ISO: International Standardization Organization ITU-T: The Telecommunication Standardization Sector of International

Telecommunication Union MC: Motion Compensation MCP: Motion-Compensated Prediction MOS: Mean Opinion Score MPEG: Moving Picture Experts Group MV: Motion Vector MVC: Multi-View Coding MVD: Multi-View plus Depth NAL: Network Abstraction Layer PSNR: Peak Signal to Noise Ratio PU: Prediction Unit QP: Quantization Parameter SAO: Sample Adaptive Offset SDO: Standards Development Organization

SEI: Supplemental Enhancement Information SMPTE: Society of Motion Picture and Television Engineers SSIM: Structural Similarity Index Metric TU: Transform Unit VCEG: Video Coding Experts Group VCL: Video Coding Length VOI: View Order Index VPS: Video Parameter Set VUI: Video Usability Information WPP: Wavefront Parallel Processing ZZC: Znear Zfar Compensation

Chapter 1

Introduction

1.1 Multimedia

The combination of multiple sources of video, audio, image and text is usually

known as multimedia. Multimedia systems combine a variety of information sources, such

as voice, graphics, animations, image, audio, and full-motion video, into a wide range of

applications. The big picture shows multimedia as the merging of three industries:

computing, communication and broadcasting. The defining characteristic of multimedia

systems is the incorporation of continuous media such as voice, video and animation. [7].

1.1.1 Multimedia applications

Multimedia finds its application in various areas including, but not limited to

advertisements, art, education, entertainment, engineering, medicine, mathematics,

business, scientific research and spatial / temporal applications. Multimedia finds its

major application in creative and entertainment industries. Movies and TV- shows, music

industry, interactive and non-interactive video games, video teleconferencing, medical

imaging and interactive educational tools are some of the many applications of

multimedia [7].

1.2 Visual media

Images and video make-up the visual media. An image is characterized by pixels

or pels. The number of pixels in an image (height and width), color and brightness of

each pixel determine the property of an image. Video is composed of a sequence of

pictures (frames) taken at regular temporal intervals. The number of frames per second is

called as the frame rate. [8] This is shown in the Figure 1-1.

Figure 1-1 2D image with spatial samples (L) and Video with N frames (R) [8]

1.2.1 Multi-view plus depth video format

The multi-view video plus depth (MVD) [16] [36] format is currently one of the

most promising formats to provide enhanced 3D visual experiences [13]. This type of

representation provides, for each view-point, texture (image sequences) and depth (map

sequences) information as shown in Figure 1-2. The depth maps represent the per-pixel

depth of a corresponding color image, and signal the disparity information needed at the

virtual (novel) view rendering system. The depth maps can be represented as a gray-

scale image sequence for storage and transmission requirements, and thus can be

compressed with existing video codecs, such as H.264/AVC [3] and HEVC [1] [2].

Thus, in addition to the textures, depth maps must be efficiently coded and

transmitted to the decoder, to be later used to render some virtual intermediate views of

the scene. This solution promises the capability to render a large amount of views

(across a wide view-angle) while reducing the amount of data that needs to be

transmitted. However, the MVD format still requires a significant amount of data to be

stored or transmitted which is essential to provide the enhanced experience associated

with emerging applications such as free viewpoint video (FVV) [15] and next-generation

3DTV displays [15]. For FVV, MVD allows the viewer to select any desired view point

while for autostereoscopic displays (glasses free), intermediate views can be created at

the decoder for a multitude of viewing angles, thus increasing the 3D experience [14]

[16] [36] [13].

In the depth maps, each pixel conveys information on the relative distance from

the camera to the object in the 3D space. While the lighter gray regions represent near

objects, the darker gray regions represent far objects as shown in Figure 1-2. While

depth maps must be efficiently encoded, they are never displayed, but only used to

synthesize the intermediate views from the original ones, typically with depth image-

based rendering (DIBR) techniques. Contrary to the texture, depth map sequences do

not have any color, texture or illumination changes and are correlated with the

corresponding texture view sequence only in the object boundary. [16] [36].

The quality of experience provided by the MVD format depends on several

factors, one of them being the accuracy of the estimated depth map. Depth can be

obtained in several ways, such as directly captured using time-of-flight cameras and

estimated from texture with stereo or multi-view matching algorithms [17]. In addition, the

depth map coding scheme selected is crucial to enable high quality view synthesis in

bandwidth constrained channels; a relevant factor for the performance of depth map

coding schemes is the preservation of the depth map discontinuities as this is critical to

reduce the geometric distortions along the object contours [16] [36].

Figure 1-2 Color video frame (L) and associated depth map frame (R) [18]

1.2.2 Depth-image based rendering

Depth-image-based rendering (DIBR) [20] is the process of synthesizing “virtual”

views of a scene from still or moving images and associated per-pixel depth information

[19]. Conceptually, this novel view generation can be understood as the following two-

step process: At first, the original image points are reprojected into the 3D world, utilizing

the respective depth data. Thereafter, these 3D space points are projected into the

image plane of a “virtual” camera, which is located at the required viewing position. The

concatenation of reprojection (2D-to-3D) and subsequent projection (3D-to-2D) is usually

called 3D image warping in the Computer Graphics (CG) literature [20].

1.3 Need for compression

Compression is a technique where in data is coded in an efficient manner to

represent it using fewer bits than the original un-coded data. The goal of compression is

to represent data with as low a bit-rate as possible without compromising on the quality.

Audio, image, and video signals require a vast amount of data for their representation.[9].

Especially, the data required for representation of multi-view video with its associated

depth-maps is many times more than the normal 2D video. Table 1-1 illustrates the mass

storage requirements for various media types, namely text, image, audio, and video.

Table 1-1 Mass storage requirements by various media types (B=byte) [9]

Text Image Audio Video

Object type -ASCII-EBCDIS

-Bitmapped graphics-Still photos-Faxes

Non coded stream of digitized audio or voice

TV analog or digital image with synched streams at 24-30 frames/s

Size and bandwidth

2KB per page

-Simple:64 KB/image-detailed(color):7.5MB/image

Voice/Phone 8 KHz/8 bits (mono) 6-44 KB/s Audio CD 44.1 KHz/ 16 bit/stereo 176 KB/s

27.7 MB/s for 640 × 480 × 24 pixels per frame (24-bit color) 30 frames/s

When it comes to video, another important factor to be considered is the video

internet traffic. Video internet traffic is growing at a very past pace, and it is estimated that

by 2017, 69% of the total Internet traffic will be video. Growing numbers of mobile

devices (smart phones, tablets, etc.) capable of video streaming and playback, and

increasing popularity of viewing online video content, have accelerated the growth rate of

90% from 2012 to 2017 [10]. Thus, there are three main reasons why present multimedia

systems require data to be compressed. They are:

a) Large storage requirements of multimedia data.

b) Relatively slow storage devices which do not allow playing multimedia

data (specifically video) in real-time.

c) The present network’s bandwidth, which does not allow real-time video

data transmission.

These three reasons along with the vast applications and usefulness of multimedia data

make compression an extremely important and challenging task.

1.4 Thesis Outline

Chapter 2 covers the HEVC video compression standard. The 3D video codecs

for H.264/AVC and HEVC are covered in chapter 3. In chapter 4, analysis of compression

artefacts in depth-maps is explained. Chapter 5 covers the scope, back-ground theory

and the working algorithm that is used in the proposed research. In chapter 6, results of

the experimentation are listed along-with the input parameters used. Chapter 7 gives the

conclusions and the areas where more work can be done in the future.

Chapter 2

Video compression standard - HEVC

Video and audio coding standards guarantee interoperability between software

and hardware provided by multiple vendors that make multimedia communications

practical. Series of video and audio coding standards have been developed by Standards

Development Organizations (SDO), including ISO/IEC (the International Standardization

Organization and the International Electrotechnical Commission) [11] [12], ITU-T (the

Telecommunication Standardization Sector of the International Telecommunication

Union, formerly CCITT) [21], SMPTE (Society of Motion Picture and Television

Engineers) [22], AVS China (the Audio and Video coding Standard of China) [24], DIRAC

by BBC [25] [26], and well-known companies, including Microsoft [23], Real Networks

[27] and On 2 Technologies (acquired by Google) [28].The chronology of different video

compression standards is shown in Figure 2-1

Figure 2-3 Chronology of International video coding standards [8]

In this chapter, High Efficiency Video Coding (HEVC) [2] video compression

standard will be discussed followed by discussions on 3D video compression in

H.264/AVC [3] and HEVC [2]

2.1 High Efficiency Video Coding (HECV)

High-Efficiency Video Coding (HEVC) [1] [2] [32] is the newest video coding

standard of the ITU-T [29] Video Coding Experts Group (VCEG) and the ISO/IEC Moving

Picture Experts Group (MPEG) [30]. HEVC enables significantly improved compression

performance relative to existing standards – in the range of 50% bit rate reduction [1] for

equal perceptual video quality.

The major video coding standard directly preceding the HEVC [1] [2] [32] project

was H.264/MPEG-4 Advanced Video Coding (AVC) [3], which was initially developed

during 1999–2003, and then was extended in several important ways during 2003–2009.

H.264/MPEG-4 AVC [3] was an enabling technology for digital video in almost every area

that was not previously covered by H.262/MPEG-2 [31] Video, and has substantially

displaced the older standard within its existing application domain. It is widely used for

many applications, including broadcast of high definition (HD) TV signals over satellite,

cable, and terrestrial transmission systems, video content acquisition and editing

systems, camcorders, security applications, Internet and mobile network video, Blu-ray

discs, and real-time conversational applications such as video chat, video conferencing,

and tele-presence systems.[1]

An increasing diversity of services, the growing popularity of HD video, and the

emergence of beyond-HD formats (e.g. 4k×2k or 8k×4k resolution) are creating even

stronger needs for coding efficiency superior to H.264/MPEG-4 AVC’s [3] capabilities.

The need is even stronger when higher resolution is accompanied by stereo or multi-view

capture and display. Moreover, the traffic caused by video applications targeting mobile

devices and tablet-PCs, as well as the transmission needs for video on demand services,

are imposing severe challenges on today’s networks. An increased desire for higher

quality and resolutions is also arising in mobile applications. HEVC standardization [1] [2]

[32] began to address all these needs. HEVC has been designed to address essentially

all existing applications of H.264/MPEG-4 AVC [3] and to particularly focus on two key

issues: increased video resolution and increased use of parallel processing architectures

2.1.1 HEVC coding design and feature highlights

The HEVC standard is designed to achieve multiple goals: coding efficiency,

transport system integration and data loss resilience, as well as implementability using

parallel processing architectures. The following sub-sections describe the key elements

of the design by which these goals are achieved, and the typical encoder operation which

would generate a valid bitstream.

2.1.1.1 Video Coding Layer

The video coding layer of HEVC employs the same “hybrid” approach

(inter-/intra-picture prediction and 2D transform coding) used in all video compression

standards since H.261 [33]. Figure 2-2 depicts the block diagram of a hybrid video

encoder, which can create a bitstream conforming to the HEVC standard.

An encoding algorithm producing an HEVC [1] [2] [32] compliant bitstream would

typically proceed as follows. Each picture is split into block-shaped regions, with the

exact block partitioning being conveyed to the decoder. The first picture of a video

sequence (and the first picture at each “clean” random access point in a video sequence)

is coded using only intra-picture prediction (which uses some prediction of data spatially

from region-to-region within the same picture but has no dependence on other pictures).

For all the remaining pictures of a sequence or between random access points, inter-

picture temporally-predictive coding modes are typically used for most blocks. The

encoding process for inter-picture prediction consists of choosing motion data comprising

the selected reference picture and motion vector (MV) to be applied for predicting the

samples of each block. The encoder and decoder generate identical inter prediction

signals by applying motion compensation (MC) using the MV and mode decision data,

which are transmitted as side information [41].

Figure 2-4 Typical HEVC encoder [1]

The residual signal of the intra or inter prediction, which is the difference between

the original block and its prediction, is transformed by a linear spatial transform. The

transform coefficients are then scaled, quantized, entropy coded, and transmitted

together with the prediction information [1] [2].

The encoder duplicates the decoder processing loop such that both will generate

identical predictions for subsequent data. Therefore, the quantized transform coefficients

are constructed by inverse scaling and are then inverse transformed to duplicate the

decoded approximation of the residual signal. The residual is then added to the

prediction, and the result of that addition may then be fed into one or two loop filters to

smooth out artifacts induced by the block-wise processing and quantization. The final

picture representation (which is a duplicate of the output of the decoder) is stored in a

decoded picture buffer to be used for the prediction of subsequent pictures. In general,

the order of the encoding or decoding processing of pictures often differs from the order

in which they arrive from the source; necessitating a distinction between the decoding

order (a.k.a. bitstream order) and the output order (a.k.a. display order) for a decoder.

Figure 2-3 shows the block diagram of a HEVC decoder which performs the inverse of

process of the encoder.

Figure 2-5 HEVC decoder block diagram [4]

Video material to be encoded by HEVC is generally expected to be input as

progressive scan imagery (either due to the source video originating in that format or

resulting from de-interlacing prior to encoding). No explicit coding features are present in

the HEVC design to support the use of interlaced scanning, as interlaced scanning is no

longer used for displays and is becoming substantially less common for distribution.

However, metadata syntax has been provided in HEVC to allow an encoder to indicate

that interlace-scanned video has been sent by coding each field (i.e. the even or odd

numbered lines of each video frame) of interlaced video as a separate picture or that it

has been sent by coding each interlaced frame as an HEVC coded picture. This provides

an efficient method of coding interlaced video without burdening decoders with a need to

support a special decoding process for it [1] [2].

The various features involved in hybrid video coding using HEVC will now be

highlighted [1] [2]:

Coding Tree Units and Coding Tree Block structure: The core of the

coding layer in previous standards was the macroblock, containing a

16×16 block of luma samples and, in the usual case of 4:2:0 color

sampling, two corresponding 8×8 blocks of chroma samples; whereas

the analogous structure in HEVC is the coding tree unit (CTU), which has

a size selected by the encoder and can be larger than a traditional

macroblock. The CTU consists of a luma coding tree block (CTB) and the

corresponding chroma CTBs and syntax elements. The size L x L of a

luma CTB can be chosen as L = 16, 32, or 64 samples, with the larger

sizes typically enabling better compression. HEVC then supports a

partitioning of the CTBs into smaller blocks using a tree structure and

quadtree-like signaling.

Coding Units and Coding Blocks: The quadtree syntax of the CTU

specifies the size and positions of its luma and chroma coding blocks

(CBs). The root of the quadtree is associated with the CTU. Hence, the

size of the luma CTB is the largest supported size for a luma CB. The

splitting of a CTU into luma and chroma CBs is signaled jointly. One

luma CB and ordinarily two chroma CBs, together with associated

syntax, form a Coding Unit (CU). A CTB may contain only one CU or

may be split to form multiple CUs, and each CU has an associated

partitioning into prediction units (PUs) and a tree of transform units

(TUs). An example of CTU, partitioning and processing order when size

of CTU is equal 64 × 64 and minimum CU size equal to 8 × 8 is shown in

Figure 2-4 [5].

Figure 2-6 Example of CTU, partitioning and processing order when size of CTU is equal

to 64 × 64 and minimum CU size is equal to 8 × 8 (a) CTU partitioning (b) Corresponding

coding tree structure [5]

Prediction Units and Prediction Blocks: The decision whether to code

a picture area using inter-picture or intra-picture prediction is made at the

CU level. A prediction unit (PU) partitioning structure has its root at the

CU level. Depending on the basic prediction type decision, the luma and

chroma CBs can then be further split in size and predicted from luma and

chroma prediction blocks (PBs). HEVC supports variable PB sizes from

64×64 down to 4×4 samples. Different PU splitting types are shown in

Figure 2-5 [5].

Transform Units and Transform Blocks: The prediction residual is

coded using block transforms. A transform unit (TU) tree structure has its

root at the CU level. The luma CB residual may be identical to the luma

transform block (TB) or may be further split into smaller luma TBs. The

same applies to the chroma TBs. Integer basis functions similar to those

of a discrete cosine transform (DCT) are defined for the square TB sizes

4×4, 8×8, 16×16, and 32×32. For the 4×4 transform of intra-picture

prediction residuals, an integer transform derived from a form of discrete

sine transform (DST) is alternatively specified.

Figure 2-7 Prediction unit splitting types (U = up, D = down, L = left, R = right) [5]

Motion vector signaling: Advanced motion vector prediction (AMVP) is

used, including derivation of several most probable candidates based on

data from adjacent PBs and the reference picture. A “merge” mode for

MV coding can be also used, allowing the inheritance of MVs from

neighboring PBs. Moreover, compared to H.264/MPEG-4 AVC, improved

“skipped” and “direct” motion inference are also specified.

Motion compensation: Quarter-sample precision is used for the MVs,

and 7-tap or 8-tap filters are used for interpolation of fractional-sample

positions Integer (A i,j) and fractional pixel positions (lower case letters)

for luma interpolation are shown in Figure 2-6 (compared to 6-tap

filtering of half-sample positions followed by bi-linear interpolation of

quarter-sample positions in H.264/MPEG-4 AVC) [3]. Similar to

H.264/MPEG-4 AVC [3], multiple reference pictures as shown in Figure

2-7 are used. For each PB, either one or two motion vectors can be

transmitted, resulting either in uni-predictive or bi-predictive coding,

respectively. As in H.264/MPEG-4 AVC [3], a scaling and offset

operation may be applied to the prediction signal(s) in a manner known

as weighted prediction.

Figure 2-8 Integer and fractional positions for luma interpolation [1]

Figure 2-9 Motion estimation with multiple reference frames [38]

Intra-picture prediction: The decoded boundary samples of adjacent

blocks are used as reference data for spatial prediction in PB regions

when intra-picture prediction is not performed. Intra prediction supports

33 directional modes [1] [2] (compared to 9 such modes shown in Figure

2-8 in H.264/MPEG-4 AVC) [3], plus planar (surface fitting) and DC (flat)

prediction modes. The selected intra prediction modes are encoded by

deriving most probable modes (e.g. prediction directions) based on those

of previously-decoded neighboring PBs. The different modes and

directional orientations for intra-picture prediction are as shown in Figure

2-9 [1].

Figure 2-10 9 4 4 Luma prediction (intra-prediction) modes H.264 [3]

Figure 2-11 Modes and directional orientations for intra-picture prediction

Quantization control: As in H.264/MPEG-4 AVC, uniform reconstruction

quantization (URQ) is used in HEVC, with quantization scaling matrices

supported for the various transform block sizes.

Entropy coding: Context adaptive binary arithmetic coding (CABAC) is

used for entropy coding. This is similar to the CABAC scheme in

H.264/MPEG-4 AVC [3], but has undergone several improvements to

improve its throughput speed (especially for parallel-processing

architectures) and its compression performance, and also to reduce its

context memory requirements.

In-loop deblocking filtering (DF): A deblocking filter (DF) similar to the

one used in H.264/MPEG-4 AVC is operated in the inter-picture

prediction loop. However, the design is simplified in regard to its

decision-making and filtering processes, and is made friendlier to parallel

processing.

Sample adaptive offset (SAO): A non-linear amplitude mapping is

introduced in the inter-picture prediction loop after the deblocking filter.

The goal is to better reconstruct the original signal amplitudes by using a

look-up table that is described by a few additional parameters that can be

determined by histogram analysis at the encoder side.

2.1.1.2 High level syntax architecture

A number of design aspects new to the HEVC standard has improved flexibility

for operation over a variety of applications and network environments and also has

improved robustness to data losses. However, the high-level syntax architecture used in

the H.264/MPEG-4 AVC [3] standard has generally been retained, including the following

features [1] [2]:

Parameter set structure: Parameter sets contain information that can

be shared for the decoding of several regions of the decoded video. The

parameter set structure provides a secure mechanism for conveying data

that are essential to the decoding process. The concepts of sequence

and picture parameter sets from H.264/MPEG-4 AVC [3] are augmented

by a new video parameter set (VPS) structure.

NAL unit syntax structure: Each syntax structure is placed into a

logical data packet called a network abstraction layer (NAL) unit.

Depending on the content of a two-byte NAL unit header, it is possible to

readily identify the purpose of the associated payload data.

Slices: A slice is a data structure that can be decoded independently

from other slices of the same picture, in terms of entropy coding, signal

prediction, and residual signal reconstruction. (This describes ordinary

slices. An alternative form known as dependent slices is discussed

below.) A slice can either be an entire picture or a region of a picture.

One of the main purposes of slices is re-synchronization in the event of

data losses. In the case of packetized transmission, the maximum

number of payload bits within a slice is typically restricted, and the

number of CTUs in the slice is often varied to minimize the packetization

overhead while keeping the size of each packet within this bound.

SEI and VUI metadata: The syntax includes support for various types of

metadata known as supplemental enhancement information (SEI), video

usability information (VUI). Such data provides information about the

timing of the video pictures, the proper interpretation of the color space

used in the video signal, 3D stereoscopic frame packing information,

other “display hint” information, etc.

2.1.1.3 Parallel decoding syntax and modified slice structuring

Finally, few new features are introduced in the HEVC standard to enhance

parallel processing capability or to modify the structuring of slice data for packetization

purposes. Each of them may have benefits in particular application contexts, and it is

generally up to the implementer of an encoder or decoder to determine whether and how

to take advantage of these features [1] [2].

Tiles: The option to partition a picture into rectangular regions called tiles

has been specified. The main purpose of tiles is to increase the capability

for parallel processing rather than provide error resilience. Tiles are

independently-decodable regions of a picture that are encoded with

some shared header information. Therefore, they can additionally be

used for the purpose of random access to local regions of video pictures.

A typical tile configuration of a picture consists of segmenting the picture

into rectangular regions with approximately equal numbers of CTUs in

each tile. Tiles provide parallelism at a more coarse level (picture/sub-

picture) of granularity, and no sophisticated synchronization of threads is

necessary for their use.

Wavefront parallel processing: When wavefront parallel processing

(WPP) is enabled, a slice is divided into rows of CTUs. The first row is

processed in an ordinary way; the second row can begin to be processed

after only a few decisions have been made in the first row; the third row

can begin to be processed after only a few decisions have been made in

the second row; etc. The context models of the entropy coder in each

row are inferred from those in the preceding row with a small fixed

processing lag. WPP provides a form of processing parallelism at a

rather fine level of granularity, i.e. within a slice. WPP may often provide

better compression performance than tiles (and avoid some visual

artifacts that may be induced by tiles).

Dependent slices: A structure called a dependent slice allows data

associated with a particular wavefront entry point or tile to be carried in a

separate NAL unit, and thus potentially makes that data available to a

system for fragmented packetization with lower latency than if it were all

coded together in one slice. A dependent slice for a wavefront entry point

can only be decoded after at least part of the decoding process of

another slice has been performed. Dependent slices are mainly useful in

low-delay encoding, where other parallel tools may penalize compression

performance.

The concepts of slices, tiles and wavefront parallel processing are illustrated in

Figure 2-10 [1].

Figure 2-12 Sub-division of a picture into a) Slices b) Tiles and c) illustration of wavefront

parallel processing [1]

2.1.2 HEVC complexity analysis

Complexity of some key modules such as transforms, intra prediction, and

motion compensation is higher in HEVC than in H.264/AVC [3]. Complexity of modules

such as entropy coding and deblocking is lower in HEVC than in H.264/AVC [3]. The

implementation cost of an HEVC decoder is thus not much higher than that of an

H.264/AVC decoder, even with the addition of an in-loop filter such as SAO [34].

From an encoder perspective, things look different: HEVC features many more

mode combinations as a result of the added flexibility from the quadtree structures and

the increase of intra prediction modes. An encoder fully exploiting the capabilities of

HEVC is thus expected to be several times more complex than an H.264/AVC encoder.

This added complexity does however have a substantial benefit in the expected

significant improvement in rate-distortion performance [34]. Researchers are focusing on

reducing the encoder complexity [39] [40] [63] [64].

2.2 Summary

In this chapter, development of different video compression standards is first

explored followed by a detailed description of the latest HEVC video compression

standard. In chapter 3, 3D video coding in H.264/AVC and 3D video coding in HEVC are

covered.

Chapter 3

3D video compression standards

With the development of 3D video technology, there is a rising demand for better

representation and delivery of 3D video without compromising on the quality of video.

The 3D video coding versions of H.264/AVC [6] and HEVC [1] [2] are emerging as an

solution to these demands.

3.1 3D video coding in H.264/AVC

Multiview Video Coding (MVC) [6] is an amendment to H.264/MPEG-4

AVC video compression standard [3] developed with joint efforts by MPEG/VCEG that

enables efficient encoding of sequences captured simultaneously from multiple cameras

using a single video stream.

MVC is intended for encoding stereoscopic (two-view) video, as well as free

viewpoint television and multi-view 3D television. The Stereo High profile of H.264/AVC

has been standardized in June 2009; the profile is based on MVC toolset and is used in

stereoscopic Blu-ray 3D releases [6]

MVC stream is backward compatible with H.264/AVC [3], which allows older

devices and software to decode stereoscopic video streams, ignoring additional

information for the second view [6].

Multiview video contains a large amount of inter-view statistical dependencies,

since all cameras capture the same scene from different viewpoints. Therefore, combined

temporal and inter-view prediction is the key for efficient MVC encoding. A frame from a

certain camera can be predicted not only from temporally related frames from the same

camera, but also from the frames of neighboring cameras. These interdependencies can

be used for efficient prediction [6].

3.2 3D video coding in HEVC

The multi-view plus depth video format [16] explained in section 1.2.1 is the video

format used for 3D video coding in HEVC. This section explains the codec used for 3D

extension of HEVC.

3.2.1 Multi view plus depth video

Recent improvements in 3D video technology led to a growing interest in 3D

video. Autostereoscopic displays, which provide a 3D viewing experience without

glasses, are consistently improved and are considered as a promising technology for

future 3D home entertainment [35] [36]. In contrast to common stereo displays,

autostereoscopic displays require not only two, but a multitude of different views for

providing the 3D viewing experience. Since the bit rate required for coding multiview

video with the MVC extension of H.264/AVC [3] increases approximately linearly with the

number of coded views, MVC [6] is not appropriate for delivering 3D content for

autostereoscopic displays. A promising alternative is the transmission of 3D video in the

multiview video plus depth (MVD) format [13] [36].

In the MVD format, typically only a few views are actually coded, but each of

them is associated with coded depth data, which represent the basic geometry of the

captured video scene. Based on the transmitted video pictures and depth maps,

additional views suitable for displaying 3D video content on autostereoscopic displays

can be generated using depth image based rendering (DIBR) [20] techniques at the

receiver side. For the purpose of view synthesis, camera parameters are additionally

included in the bitstream. The bitstream packets include header information, which signal,

in connection with transmitted parameter sets, a view identifier and an indication whether

the packet contains video or depth data. The difference between the simulcast coding

structure with hierarchical B pictures and that of multi-view coding structure with

hierarchical B pictures is shown in Figure 3-1 [36] [37].

Figure 3-13 Simulcast coding structure with hierarchical B pictures for

temporal prediction (black arrows) (L) and Multi-view coding structure with hierarchical B

pictures for both temporal (black arrows) and inter-view prediction (red arrows) (R) [36]

3.2.2 Transmission of 3D video

The basic concept of the system and data format is illustrated in Figure 3-2. In

general the input signal for the encoder consists of multiple views, associated depth

maps, and corresponding camera parameters. However, as described above, the codec

can also be operated without depth data. The input component signals are coded using a

3D video encoder, which represents an extension of HEVC. The base view is coded

using an unmodified HEVC encoder. The 3D video encoder generates a bitstream, which

represents the input videos and depth data in a coded format. If the bitstream is decoded

using a 3D video decoder, the input videos, the associated depth data, and camera

parameters are reconstructed with the given fidelity. For displaying the 3D video on an

autostereoscopic display, additional intermediate views are generated by a DIBR

algorithm using the reconstructed views and depth data. If the 3D video decoder is

connected to a conventional stereo display instead of to an autostereoscopic display, the

view synthesizer can also generate a pair of stereo views, in case such a pair is not

actually present in the bitstream. It is possible to adjust the rendered stereo views to the

stereo geometry of the viewing conditions. One of the decoded views or an intermediate

view at an arbitrary virtual camera position can also be used for displaying a single view

on a conventional 2D display. The 3D video bitstream is constructed in a way that the

sub-bitstream representing the coded representation of the base view can be extracted

by simple means. The bitstream packets representing the base view can be identified by

inspecting transmitted parameter sets and the packet headers. The sub-bitstream for the

base view can be extracted by discarding all packets that contain depth data or data for

the dependent views. Then, the extracted sub-bitstream can be directly decoded with an

unmodified HEVC decoder and displayed on a conventional 2D video display. [35] [37]

The encoder can also be configured in a way that the sub-bitstream containing

only two stereo views can be extracted and directly decoded using a stereo decoder. The

encoder can also be configured in a way that the views can be generally decoded

independently of the depth data. It is also possible to synthesize intermediate view using

only the stereo sequences as input of the view synthesis [35] [37].

Figure 3-14 Overview of the system structure and data format for the transmission

of 3D video [37]

3.2.3 Coding algorithm

The coding algorithm based on the MVD format, in which each video picture is

associated with a depth map, is described. The coding algorithm can also be used for a

multiview format without depth maps. The video pictures and, when present, the depth

maps are coded access unit by access unit, as illustrated in Figure 3-3. An access unit

includes all video pictures and depth maps that correspond to the same time instant.

Non-VCL NAL units containing camera parameters may be additionally associated with

an access unit [37].

The video pictures and depth maps corresponding to a particular camera position

are indicated by a view identifier (viewId). All video pictures and depth maps that belong

to the same camera position are associated with the same value of viewId. The view

identifiers are used for specifying the coding order inside the access units and detecting

missing views in error-prone environments. Inside an access unit, the video picture and,

when present, the associated depth map with viewId equal to 0 are coded first, followed

by the video picture and depth map with viewId equal to 1, etc. A video picture and depth

map with a particular value of viewId are transmitted after all video pictures and depth

maps with smaller values of viewId. [35] [37]

Figure 3-15 Access unit structure and coding order of view components [37]

For the independent view, the video picture is always coded before the

associated depth map. For dependent views, the video picture may be coded before or

after the associated depth map (i.e., the depth map with the same value of viewId). It

should be noted that the value of viewId does not necessarily represent the arrangement

of the cameras in the camera array. For ordering the reconstructed video pictures and

depth map after decoding, each value of viewId is associated with another identifier

called view order index (VOI). The view order index is a signed integer values, which

specifies the ordering of the coded views from left to right. If a view A has a smaller value

of VOI than a view B, the camera for view A is located left to the camera of view B. In

addition, camera parameters required for converting depth values into disparity vectors

are included in the bitstream [35] [37].

3.2.4 Basic structure of 3D video codec

The basic structure of the 3D video codec is shown in Figure 3-4. In principle,

each component signal is coded using an HEVC-based codec. The resulting bitstream

packets, or more accurately, the resulting Network Abstraction Layer (NAL) units, are

multiplexed to form the 3D video bitstream. The base or independent view is coded using

an unmodified HEVC codec. Given the 3D video bitstream, the NAL units containing data

for the base layer can be identified by parsing the parameter sets and NAL unit header of

coded slice NAL units (up to the picture parameter set identifier). Based on these data,

the sub-bitstream for the base view can be extracted and directly coded using a

conventional HEVC decoder [4] [35] [37].

Figure 3-16 Basic codec structure with inter-component prediction (red-

arrows) [35] [37]

For coding the dependent views and the depth data, modified HEVC codecs are

used, which are extended by including additional coding tools and inter-component

prediction techniques that employ already coded data inside the same access unit as

indicated by the red arrows in Figure 3-4. For enabling an optional discarding of depth

data from the bitstream, e.g., for supporting the decoding of a stereo video suitable for

conventional stereo displays, the inter-component prediction can be configured in a way

that video pictures can be decoded independently of the depth data.

3.2.5 MVD codec vs HEVC standard codec

In this section, those aspects of MVD codec that are different from the standard

HEVC codec are discussed.

3.2.5.1 Coding of dependent views

Additional tools have been integrated into the HEVC codec, which employ

already coded data in other views for efficiently representing a dependent view. These

tools include –

Disparity-compensated prediction

As a first coding tool for the dependent views, the well-known concept of

disparity-compensated prediction (DCP), which is also used in MVC, has been added as

an alternative to motion-compensated prediction (MCP). At this, MCP refers to an inter-

picture prediction that uses already coded pictures of the same view, while DCP refers to

an inter-picture prediction that uses already coded pictures of other views in the same

access unit, as it is illustrated in Figure 3-5 [37].

View synthesis based inter-view prediction

Basing on all already coded views, a new virtual view is synthesized in the

position of the current view. Some regions of newly synthesized image are not available

because they were occluded in previously coded views. Those disoccluded regions are

identified and marked on a binary map, named availability map, which controls coding

and decoding process. Coder and decoder simultaneously use this map to determine,

whether given CU is coded or not. Because in a typical case most of the scene is the

same in all of views, only small parts are disoccluded in subsequently coded views, and

thus only small amount of CUs can be coded [37].

Figure 3-17 Disparity-compensated prediction as an alternative to motion-

compensated prediction [37]

Post processing in-loop filtering

A final step of view-synthesis prediction is reduction of artifacts in synthesized

view. This post-processing consists of Depth-Gradient-based Loopback Filterer (DGLF)

and Availability Deblocking Loopback Filter (ADLF) [37].

The first one (DGLF), reduces texture artifacts introduced by DIBR [20] technique

in the areas of a sudden depth changes. In order to cope that the synthesized image is

adaptively filtered with respect to depth gradient strengths. Large depth edges impose

strong low-pass filtering of the synthesized texture, while flat depth regions are not

filtered at all [37].

The latter (ADLF) reduces artifacts that are generated as a result of block CU-

based coding. Shape of coded region not necessarily matches shape of binary availability

map. This discrepancy is a source of artificial edges between those regions. The ADLF

provides smooth transition between coded and synthesized regions by interpolating

between them [37].

Inter-view motion prediction

The basic concept of the inter-view prediction of motion parameters is illustrated

in Figure 3-6. For the following overview, it is assumed that an estimate of a pixel-wise

depth map for the current picture is given. Below, it is described how such an estimate

can be derived. For deriving candidate motion parameters for a current block in a

dependent view, a sample location x in the middle of the block is selected and the

associated depth value d is converted to a disparity vector. By adding the disparity vector

to the sample location x a reference sample location xR is obtained. The prediction block

in the already coded picture in the reference view that covers the sample location xR is

used as the reference block. If this reference block is coded using MCP, the associated

motion parameters can be used as candidate motion parameters for the current block in

the current view. The derived disparity vector can also be directly used as a candidate

disparity vector for DCP [37].

Depth-based motion parameter prediction

Depth-Based Motion Prediction (DBMP) is a new coding tool for multiview video

coding which originates from the idea that motion fields of neighboring views in multiview

sequence are highly correlated. DBMP provides an efficient representation of motion data

in multiview video bitstreams that carry also depth/disparity maps. The motion

information, such as motion vectors and reference indices, for each pixel of encoded

coding unit (CU) is directly inferred with use of already coded disparity maps from

encoded CUs in the neighboring views at the same temporal instance (Figure 3-7). This

procedure is repeated independently for every pixel of encoded CU. Consequently,

motion vectors and reference indices for CU are not transmitted in the bitstream but are

obtained from the reference view at the receiving side.

Figure 3-18 Basic principle of deriving motion parameters for a block in a current picture

based on motion parameters in an already coded reference view and an estimate of the

depth map for the current picture [37].

Figure 3-19 Independent derivation of motion information for each point of encoded CU

from corresponding point in reference view [37].

Inter-view residual prediction

The basic principle of the inter-view residual prediction is illustrated in Figure 3-

8. Similarly as for the inter-view motion prediction, the inter-view residual prediction is

based on a depth map estimate for the current picture. The same depth map estimate as

for the inter-view motion prediction is used. Based on the depth map estimate, a disparity

vector is determined for a current block and the residual block in the reference view that

is referenced by the disparity vector is used for predicting the residual of the current block

Figure 3-20 Basic concept for the inter-view residual prediction [37]

Adjustment of QP of texture based on depth data

In order to improve perceptual quality of coded texture, a tool for bit assignment

in the texture layer was developed. The basic idea is to increase texture quality of objects

in the foreground and to increase compression factor (decrease texture quality) for

objects in the background. The quality is adjusted in coding units (CUs) with use of

quantization parameter QP that depends on the corresponding depth values. The QP

adjustment is done simultaneously in coder and decoder so that no additional information

is send. Described tool is disabled in the base view to preserve HEVC compatibility. The

texture QP is modified in the following way:

where QP' is adjusted QP value with corresponding disparity dx,y ( 8- bit depth

maps are considered).

3.2.5.2 Coding of depth maps

For the coding of depth maps, basically the same concepts of intra-prediction,

motion-compensated prediction, disparity-compensated prediction, and transform coding

as for the coding of the video pictures are used. However, some tools have been

modified for depth maps, other tools have been generally disabled, and additional tools

have been added.

As a first difference to the coding of video pictures, the inter-view motion and

residual prediction as described in sec. 3.2.5.1 are not used for depth coding. Instead,

motion parameters are derived based on coded data in the associated video pictures.

The other differences are described in this section.

Disabled chrominance coding

Depth maps may be coded in 4:0:0 chroma sampling format.

Non-linear depth representation

As alternative representation of depth maps, the depth may be non-linearly

scaled as described in the following. The human perception of depth depends on

absolute distance of viewed objects, therefore the internal depth representation is non-

linear. Closer objects are represented more accurately than distant ones. Thanks to that,

subjective quality of synthesized views is improved.

Z-near z-far compensated weighted prediction

Proposed znear-zfar compensation (ZZC) is a new coding tool for multiview

video, designed especially for inter-frame depth map coding.

The concept of ZZC exploits the observation that frames from different views and

time instances of encoded depth sequence may have different znear and zfar

parameters. The mentioned znear and zfar parameters describe range of depths

represented in a gray-scale depth map. If znear and zfar parameters are different for two

frames, then given depth value is represented with different gray-scale values in those

depth maps. Consequently, using one of such depth maps as a reference for the other

one will result in a poor prediction.

To overcome this problem, a new ZZC coding tool is proposed. Prior to any inter-

frame depth map prediction, each depth map that resides in the codec reference picture

list is scaled, so that gray-scale depth values in scaled image and currently coded image

refer to the same depth. As a result, depth maps with compensated znear and zfar range

are used for prediction. Values used for prediction (instead of the original ones)

are calculated as follows:

where LT is compensated disparity in range depth znear T to zfar T and LS is

original disparity in depth range znear S and zfar S.8 bit image is considered.

Modified motion compensation and motion vector coding

In contrast to natural video, depth maps are characterized by sharp edges and

large regions with nearly constant values. The eight-tap interpolation filters that are used

for motion-compensated interpolation in HEVC, can produce ringing artifacts at sharp

edges in depth maps, which are visible as disturbing components in synthesized

intermediate views. For avoiding this issue and for decreasing the encoder and decoder

complexity, the motion-compensated prediction (MCP) as well as the disparity-

compensated prediction (DCP) has been modified in a way that no interpolation is used.

That means, for depth maps, the inter-picture prediction is always performed with full-

sample accuracy. For the actual MCP or DCP, a block of samples in the reference picture

is directly used as prediction signal without interpolating any intermediate samples. In

order to avoid the transmission of motion and disparity vectors with an unnecessary

accuracy, full-sample accurate motion and disparity vectors are used for coding depth

maps. The transmitted motion vector differences are coded using full-sample instead of

quarter-sample precision.

Disabling of in-loop filtering

The in-loop filters in the HEVC design have been particularly designed

for the coding of natural video. For the coding of depth maps, these filters are

less useful. In order to decrease the encoder and decoder complexity, the in-

loop filters have been disabled for depth coding. This includes the following

filters:

the de-blocking filter;

the sample-adaptive loop filter.

Depth modelling modes

Depth maps are mainly characterized by sharp edges (which represent object

borders) and large areas of nearly constant or slowly varying sample values (which

represent object areas). While the HEVC intra prediction and transform coding are well-

suited for nearly constant regions, it can result in significant coding artifacts at sharp

edges, which are visible in synthesized intermediate views. For a better representation of

edges in depth maps, four new intra prediction modes for depth coding are added.

Four depth-modeling modes, which mainly differ in the way the partitioning is

derived and transmitted, have been added:

Mode 1: Explicit wedgelet signaling;

Mode 2: Intra-predicted wedgelet partitioning;

Mode 3: Inter-component-predicted wedgelet partitioning;

Mode 4: Inter-component-predicted contour partitioning.

Mode 1: The basic principle of this mode is to find the best matching wedgelet partition at

the encoder and transmit the partition information in the bitstream. At the decoder the

signal of the block is reconstructed using the transmitted partition information.

Mode 2: The basic principle of this mode is to predict the wedgelet partition from data of

previously coded blocks in the same picture, i.e. by intra-picture prediction. For a better

approximation, the predicted partition is refined by varying the line end position. Only the

offset to the line end position is transmitted in the bitstream and at the decoder the signal

of the block is reconstructed using the partition information that results from combining

the predicted partition and the transmitted offset.

Mode 3: The basic principle of this mode is to predict the wedgelet partition from a

texture reference block, namely the co-located block of the associated video picture. This

type of prediction is referred to as inter-component prediction. Unlike temporal or inter-

view prediction, no motion or disparity compensation is used, as the texture reference

picture shows the scene at the same time and from the same perspective. The wedgelet

partition information is not transmitted for this mode and consequently, the inter-

component prediction uses the reconstructed video picture as a reference. For efficient

processing, only the luminance signal of the reference block is taken into account, as this

typically contains the most significant information for predicting the partition of a depth

block, i.e. the edges between objects.

Mode 4: The basic principle of this mode is to predict a contour partition from a texture

reference block by inter-component prediction. Like for the inter-component prediction of

a wedgelet partition pattern, the reconstructed luminance signal of the co-located block of

the associated video picture is used as a reference. In contrast to wedgelet partitions, the

prediction of a contour partition is realized by a thresholding method. Here, the mean

value of the texture reference block is set as the threshold and depending on whether the

value of a sample is above or below the threshold; sample position is marked as part of

region P1 or P2 in the resulting contour partition pattern.

3.3 Summary

In this chapter, the 3D video codecs are explained in some detail. The two video

codecs that are covered are – 3D video coding in H.264/AVC (multi-view coding) and 3D

video coding in HEVC (multi-view plus depth coding). In chapter 4, motion artifact

analysis of depth maps and how it affects the rendered views are discussed.

Chapter 4

Analysis of compression artifacts in depth maps

Depth maps were briefly explained in section 1.2.1. As explained, depth maps

can be represented as a grayscale image sequence for storage and transmission

purposes and can be compressed with any existing video codecs. However, existing

video codecs are optimized to encode image/video sequences that are finally viewed by

the end users. Depth maps on the other hand, are not viewed by the end-users, but are

used as an aid for view rendering. Therefore, when the existing video codecs are used to

compress depth maps, the compression artifacts on depth maps cause distortions in

rendered views. Two types of solution can be identified to solve this problem. The first

solution is to come-up with novel compression techniques and introduce depth-map

compression specific features to the codec. The HEVC-3D codec [37] described in

section 3.2 is one such codec. As already observed in section 3.2, it is very clear that this

kind of solution increases the complexity of the codec to a very large extent as it deals

with components that are specific to compression of depth maps. The second type of

solution is to encode depth-maps with existing video codecs and to post-process the

decoded depth-maps with image denoising techniques [42]. The advantage of this type of

solution is that the existing codecs need not be modified to specifically suit the

compression of depth-maps, but image denoising techniques can be used on the

decoded depth maps to minimize undesirable compression artifacts. The proposed post-

processing algorithm belongs to the second category.

As the proposed post-processing algorithm minimizes the effects of compression

artifacts upon virtual view generation process, it is extremely important to understand the

view generation process as well as the effects of compression artifacts upon view

generation. Although, multi-view rendering is possible, the scope of this thesis is limited

to rendering of stereoscopic views with monoscopic color image and per-pixel depth map

(Figure 1-2). After introducing the view generation process, a theoretical derivation of

thresholds for maximum possible distortions in depth map that does not cause perceptual

rendering distortion is presented.

4.1 Virtual stereoscopic view generation process

A monoscopic color image and per-pixel depth map can be used to generate

virtual stereoscopic views. The virtual view generation process is shown in Figure 4-1.

Figure 4-21 Virtual view generation in Depth Image Based Rendering (DIBR) process

In this process, the original image points at locations (x, y) are transferred to new

locations (xL, y) and (xr,y) for left and right view respectively.

This process is defined with:

xR=x+( Ppix

2 ) (1)

xL=x−(P pix

2 ) (2)

ppix=−xB( N pix

D ){( m255 ) (k near+k far )−k far } (3)

ppix – pixel parallax

xB – distance between the left and right virtual cameras or eye separation (assumed to be

D - viewing distance (assumed to be 250 cm)

m – depth value of each pixel in the reference view

knear and kfar – range of the depth information respectively behind and in front of the

picture, relative to the screen width.

Npix – screen width measured in pixels

8-bit images are considered

Virtual cameras are selected such that the epipolar lines are horizontal, and thus

the y component is constant. The equation (3) is in accordance with MPEG informative

recommendation [44]. The dis-occluded regions (visual holes) are filled by background

pixel extrapolation technique [45]. Due to any noise with which the depth maps could be

corrupted, the luminance values of the pixels would be modified, i.e. m in eq (3) will be

modified. This will result in warping error and thus cause distortions in the image

rendered with the noisy depth map. A pixel wise distortion model that quantifies errors on

the rendered views is given in [46].

The quality of the rendered virtual views can be determined by calculating the

PSNR between view rendered with uncompressed color image and depth map and the

view rendered with the compressed/corrupted color image and depth map [44]. Another

metric that can be used to measure the quality of rendered views is the Structural

Similarity Index Metric (SSIM) [48].

4.2 Analysis of compression artifacts in depth maps on view rendering

Due to bandwidth constraints, it is a common practice to compress depth maps

before storage or transmission in bandwidth limited channels. Traditional block based

video codecs, such as HEVC [1] [2] and H.264/AVC [3], are based on motion estimation,

transform coding, quantization and entropy coding. During quantization process, high

spatial frequencies in individual images are eliminated. This is done mainly due to the

fact that the human visual system is more sensitive to the low spatial frequencies in

images.

When traditional video codecs are used to compress depth maps, which are not

viewed by humans but used to aid the view rendering process, the compression artifacts

will have adverse consequences upon the quality of the rendered views. It is highly

important to preserve the sharp depth discontinuities present in depth maps for high

quality virtual view generation. In this subsection we analyze the effects of compression

artifacts upon the view generation process.

The eq. (3) provides the relationship between the value of the depth pixel (m)

and the corresponding pixel parallax (ppix). Suppose there is a change of ∆m in the

value of the depth pixel, there would be a corresponding change of ∆p in the pixel

parallax. 8-bit images are considered.

ppix+∆ p=−xB( N pix

D ){(m+∆m255 ) (k near+k far )−k far} (4)

From Eq. (3) and Eq. (4), it can be deduced that

∆ p=xB( N pix

D ){(∆m255 ) (k near+k far )} (5)

In terms of the rendering algorithm used, the change of depth pixel value (∆m)

will not have any significance unless it is large enough to cause a parallax change of at

least 1 pixel. According to Eqs. (1) and (2), the maximum value ∆p in Eq. (5) could be is

2 pixels. Using this information, the maximum change (∆mmax) of the value of a depth

pixel could be derived as follows,

2=x B(N pix

D ){(∆mmax

255 ) (k near+k far)} (6)

∆ mmax=2. D .255

xB .N pix .(knear+k far)(7)

The ∆mmax in Eq. (7) provides a theoretical threshold, which indicates the

maximum change a depth pixel value could undergo without causing a rendering error. It

should be noted that the above derivation does not take in to account the rounding errors,

in which case a minimum parallax of 0.2 could bring about change in the warped pixel

position. However, the above derivation is valid when the rendering quality does not

consider positional errors of one pixel [47]. The derived threshold is the basis for

calculating most of the parameters in designing the filters used in the proposed work.

4.2.1 Frequency domain analysis of the artifacts

In this sub-section, frequency domain analysis of the artifacts introduced during

the compression of depth maps is presented. For analysis purposes, 2D disctete fourier

transform (DFT) F (u, v) of a digital image I(x,y) of size M x N is defined by eq (8)

F (u , v )=∑x=0

∑y=0

f ( x , y ) . e− j2 π (uxM + vy

N )(8)

Where, u = 0, 1,…… M-1 and v = 0, 1,……, N-1

The power spectrum (PS) P (u,v) of a considered image segment f( x,y) is

obtained using the DFT as,

P (u , v )=|F (u , v )|2 (9)

Figure 4-2 illustrates the effect of compression artifact in areas with homogenous

depth. During compression, small depth variations, which present spatially high

frequencies, are removed. This fact is illustrated by the power spectrum difference

between Figure 4-2(e) and 4-2 (g), where the energy in 4-2 (g) is much lower than 4-2

(e). The periodic nature of the power spectrum in Figure 4-2(g), is due to the blocking

artifacts present in Figure 4-2(c). The effect of this upon rendered views is illustrated in

Figure 4-2(b) and 4-2(d). The corresponding power spectrums illustrated respectively in

Figure 4-2(f) and 4-2(h) does not show significant change. The Figure 4-3(a-d) illustrate

the effect of compression artifacts in an area with a sharp depth discontinuity. The

corresponding power spectrum of the image rendered with the compressed depth map in

Figure 4-3(h) illustrates increased energy concentration in high spatial frequencies, as

compared to the power spectrum of the image rendered with the uncompressed depth

map. Figure 4-3(f). This is mainly due to the uneven blurring at the depth discontinuity.

For the purpose of the following analysis we define the following terms,

Depth Noise: difference in the pixel values between the original depth map and

the compressed depth map.

Rendering Noise: difference in pixel values between the view rendered with the

original depth map and the view rendered with the compressed depth map.

Perceived Noise: Perceivable difference between the view rendered with the

original depth map and the view rendered with the compressed depth map. This measure

will neglect tiny position errors due to warping and pixel value differences below a certain

threshold. An estimate of perceivable difference can be obtained using SSIM [48].

Figure 4-22 Effect of compression noise in areas of homogenous depth

Figure 4-23 Effect of compression noise in areas of sharp depth discontinuities [43]

4.3 Design requirements of a post-processing filter to minimize the effects of

compression artifacts

Based on the analysis of compression artifacts upon view rendering in section

4.2, the compression artifacts appear as non-uniform noise in the depth maps. The non-

uniform noise (uneven noise) due to compression artifacts could be approximated as a

zero-mean normal distribution. Therefore, a candidate post-processing filter should be

able to minimize the spread of the noise. However, when smoothing, it is important not to

increase the threshold derived in eq. (7). If the depth map is smoothed above the

threshold derived in eq. (7), it will cause perceivable rendering artifacts. Furthermore, the

above analysis illustrated that the artifacts along the depth discontinuities cause

significant distortion in terms of the quality of rendered views. Compression artifacts that

are present in smooth (homogenous) depth areas, generally, cause less distortion.

Considering the design requirements outlined above, a depth map post-

processing framework to minimize the effect of compression artifacts upon view

rendering is proposed in [43]. The proposed depth map post processing framework is

designed based on the principles of bilateral filtering [49]. In chapter 5, the post-

processing frame-work in [43] along with the required background theory will be

described.

4.4 Summary

In this chapter, a detail analysis of compression artifacts is discussed. Virtual

stereoscopic view generation process is explained and how presence of compression

artifacts in depth-map affects the rendered views are discussed. Finally, design

requirements for a post-processing frame-work to reduce compression artifacts based on

artifact analysis are presented. In chapter 5, the scope, necessary background and the

algorithm for the thesis are considered.

Chapter 5

Thesis – Scope, Background and Working algorithm

This chapter discusses the scope of the proposed work followed by background

knowledge required to understand the working algorithm and finally the working algorithm

itself.

5.1 Scope

While a lot of research has been done to study the effect of using post-

processing image denoising techniques on H.264/AVC decoded depth maps, there has

not been any research done to study the effects of applying image denoising techniques

(post – processing) to HEVC decoded images/video. This thesis applies image denoising

techniques on HEVC decoded depth maps as a post-processing technique. Quality of

rendered images with and without post-processing is compared using PSNR and SSIM

[48]. Specifically, a post-processing framework that is based on analysis of compression

artifacts upon generation of virtual views is used. The post-processing frame-work utilizes

a non-linear spatial filtering technique to reduce compression artifacts [43].

This thesis is an effort to effectively reduce the compression artifacts from HEVC

decoded depth-maps and improve the perceptual quality of rendered views without using

depth-map specific video codec.

5.2 Background Theory

Based on the analysis of compression artifacts presented in chapter 4, a post

processing framework is proposed [43], to minimize the effects of compression artifacts in

depth maps. The proposed framework is inspired by two applications of bilateral filtering

[49], namely, joint (cross) bilateral filtering [50], [51] and adaptive bilateral filtering [52].

These two concepts will be briefly explained in this section.

5.2.1 Bilateral Filter

The bilateral filter [49] uses both a closeness filter kernel as well as a similarity

filter kernel evaluated on the pixel values. More formally, for some pixel position p the

filtered result Bp is given as in the eq. (10),

BP=∑q∈Ω

W pq−I q/∑q∈Ω

W pq (10)

In Eq. (10), Iq is the value at pixel position q in the kernel neighborhood. The

filter weight wpq at pixel position q is calculated as,

W pq=c ( p ,q ) . s ( p ,q) (11)

Where c is the closeness filter kernel and s is the similarity filter kernel. Both c

and s are popularly implemented as a Gaussian centered at p and Ip (Ip is the value at

pixel position p) with standard deviations σc and σs, respectively as,

c ( p ,q )=exp (−12

(p−q )2 /σc2) (12)

s (p , q )=exp(−12 ( I p−I q )2/σ s

2) (13)

5.2.2 The joint bilateral and trilateral filter

When similarity filter kernel of the bilateral filter is derived from a second guided

image, the process is known as a joint (cross) bilateral filter (JBF). The concept of joint

bilateral filtering was first proposed as a means of removing adverse effects of flash

photography [50], [51].

Accordingly, the similarity filter kernel (sj) in the case of a joint bilateral filter is

implemented as,

s j ( p ,q )=exp(−12 (~I p−

~I q )2/σ j2) (14)

When there are two similarity filter kernels used along with a closeness filter, the

filter is known as a trilateral filter [53]. The basis for the similarity filters need to be

judicially selected. For example, a trilateral filter is designed as an in-loop deblocking filter

in Ref. [54], in which two similarity filter kernels are derived each from the color image

and the depth map. The similarity filter kernel st of the filter proposed in Ref. [54] is given

st ( p ,q )=s ( p ,q ) . s j( p ,q) (15)

st ( p ,q )=exp(−12 ( I p−I q )2/σs

2) .exp(−12 (~I p−

~Iq )2/σ j2)

5.2.3 The adaptive bilateral filter

In Ref. [52] authors define the Adaptive Bilateral Filter (ABF) as a image

sharpening technique. In ABF [52] the similarity filter kernel sa is defined as,

sa ( p ,q )=exp(−12 ( I p−I q−∆p )2/σ p

2) (16)

Where, Δp and σp are adaptation parameters dependent on p and they are used

to control the center and the standard deviation of the Gaussian kernel that implements

sa. As opposed to the similarity filter kernel s in Eq. (13), which is centered around Ip, the

similarity filter kernel of the ABF is is centered at Ip - Δp. The ABF has very good

sharpening ability if the adaptation parameters Δp and σp are

calculated appropriately. In [52], the adaption parameters Δp and σp

are found empirically for digital images by a least mean square error

training method.

5.3 Proposed algorithm

This thesis is based on the post-processing algorithm in [43]. The method is

designed based on the principles of bilateral filtering introduced in III to minimize the

effects of compression artifacts upon the virtual view generation process. Specifically, a

Bilateral Sharpening filter (BSF) is used to post-process compressed depth maps by

analyzing global image histograms. Figure 5-1 illustrates the block diagram of the

proposed post-processing framework.

Figure 5-24 Block diagram for the proposed research [43]

The video sequence and depth map are encoded using HM 9.2 [55]. The

decoded depth map with compression artifacts is then obtained. The BSF operates by

adjusting the histogram of the compressed depth map by identifying the dominant depth

value bins present in the depth map. Thus, the identification of correct depth value bins is

crucial for the correct operation of the BSF. If the histograms are analyzed directly from

the compressed depth [56], the identified depth value bins will not be very appropriate

due to the effect of noisy pixels. These noisy pixels could be a result of either the

compression algorithm or of the depth map generation process. The identification of the

depth value bins could be improved by filtering the compressed depth maps to reduce

these noisy pixels. Edge Adaptive Joint Trilateral Filter (EA-JTF) [43] is used, whose filter

coefficients are theoretically derived to enable maximum possible filtering of the noisy

pixels. Furthermore, in areas where the color image and the corresponding depth map

are aligned, the EA-JTF is designed to utilize the edges in the color image to reconstruct

the depth map. The output of the EA-JTF is then given as the input to BSF.

The different steps involved in the algorithm are explained in the coming sub-

sections.

5.3.1. Depth discontinuity analysis

The purpose of the depth discontinuity analysis step is twofold. Firstly, the areas

that have aligned edges in the color image and the corresponding depth map are

identified. The filter kernels of the EA-JTF are adaptively selected based on this

information about edge alignment between the color image and the depth map. Secondly,

all depth discontinuities that are significant in terms of rendering are identified. This

information about significant depth discontinuities are used to reduce the complexity of

the bilateral sharpening filter.

5.3.1.1 Identification of significant depth discontinuities

Theoretically, a depth discontinuity, or an edge in the depth map, is considered to

be significant if the neighborhood of the corresponding color pixels in the warped image

is different from the original color image. In the context of depth map based stereo view

rendering, a significant depth discontinuity will cause the corresponding color pixels on

either side of the edge to be shifted by different magnitudes.

Firstly, the depth map is convolved with a sobel filter. Let the result of this

operation be denoted as Gx. An edge mask Ed is then derived as in Eq. (17),

which corresponds to pixel locations of significant depth

discontinuities. However, this derivation neglects round-off errors in

the rendering algorithm.

Ed ( p , q)={1 if |G x( p ,q)|≥∆mmax

0 if |G x (p ,q )|≤∆mmax(17)

where ∆mmax is defined in eq. (7).

5.3.1.2 Identifying of aligned color and depth edges

Once the edge mask Ed is obtained as in Eq. (17), it is necessary to identify the

regions in which the color edges and depth discontinuities are aligned. For this purpose

an edge mask Ec of the color image is generated by the canny edge detection algorithm.

Using Ed and Ec, the binary mask Es signifying the aligned edge areas is obtained as

follows,

E s=((Ed⊕S1 )∩ Ec )⊕S2 (18)

Where, ⨁ represents the morphological dilation and S1and S2 represent flat

square structuring elements of size 2 and 7 respectively.

An example of outputs at each step of the depth discontinuity analysis is given in

Figure 5-2.

Figure 5-25 Illustration of depth discontinuity analysis

5.3.2 Pre-filtering of the depth maps to improve depth bin identification

The correct depth value bin identification from the histogram analysis is very

important for the operation of the proposed post-processing filter. As noisy depth pixels

affect the depth bin identification process, smoothing the compressed depth map to filter

out any insignificant depth discontinuities will improve the correctness of the bins that are

identified. A bilateral filter whose edge threshold is selected to preserve only the

significant discontinuities, is a good candidate for this purpose. Furthermore, this stage of

filtering also makes use of the corresponding color image to realign the discontinuities in

the depth map with the edges in the color image.

Considering both the above requirements, the Joint Trilateral Filter as described

in section 5.2.2 could be used for our purpose. However, the JTF is suitable only in areas

in which the color and depth edges are aligned. If the color edges and depth

discontinuities are not aligned, the JTF will generate depth maps different from the

original depth maps. Therefore, in [43], the similarity filter kernel st of the joint trilateral

filter is adaptively selected as given in Eq. (19). For the areas where the

edges between the color image and the corresponding depth map are

aligned, there will be two similarity filter kernels used, each derived

from the compressed depth map and the color image. For the

remaining area, only the similarity filter kernel derived from the

compressed depth map is used.

st (p ,q )={s ( p ,q ) . s j ( p ,q ) if Es ( p ,q )=1s (p ,q ) if E s ( p ,q )=0

The aim of this pre-filtering step is to filter the compressed depth map to filter out

any insignificant depth discontinuities. Therefore, the edge threshold used for the

similarity filter kernel s in Eq. (19) is made equal to ∆mmax given by eq (7). While the

closeness filter and the similarity filter kernel derived from the color image sj is

implemented as a Gaussian kernel, the similarity filter kernel derived from depth map s is

implemented as a binary filter.

While JBF [57] could recover edge information from its corresponding texture

image to a certain extent, it lacks the capability to do so in areas where there are depth

discontinuities, but inadequate gradient in the color image to support it. While the JTF

proposed in Ref. [54] overcomes this problem, it fails to perform in areas of the depth

map that are not perfectly aligned with the color image. Thus, the EA-JTF is designed to

overcome drawbacks of both the JBF and the JTF.

5.3.3 Global histogram analysis

The aim of this step is to identify the significant depth value bins by analyzing the

global histograms of the depth map. A depth value bin in the histogram is characterized

by a peak enclosed by two immediate minimums (valleys) on either side, except when

the peak is at 0 or 255 gray levels. Once the particular depth value bins are identified,

they are represented as a non-symmetric Gaussian distribution centered at the peak

value of a particular bin. The distance to the enclosing valleys from the peak is used to

calculate the standard deviation on each side of the Gaussian curve representing the

particular depth value cluster.

For this purpose, the output of the EA-JTF is segmented in to equal size blocks

of 64x64 (For some sequences it is 72x72, to divide the image in to equal size blocks).

The histogram analysis is performed on each segmented block to find all the dominant

depth value bins within that block. The decision to perform the histogram analysis on

64x64 blocks, rather than on a pixel by pixel basis is made for two reasons. Firstly, this

will make the reconstruction method to be consistent among all the pixels within the

block. Secondly, it minimizes the chances of some noisy depth pixel values to be

identified as a significant depth value bin. The different steps of the global histogram

analysis are given in below.

1. Segment the output of the EA-JTF into equal size segments of 64x64

pixels

2. Obtain the histogram for each segment

3. Smooth the histogram using an averaging filter: The averaging filter

kernel used is [1 1 1 1 1].

4. Identify dominant peaks and their enclosing valleys.

5.3.4 Bilateral Sharpening Filter

In this subsection, bilateral sharpening technique to minimize the effects of

artifacts in compressed depth map [43] is explained in detail. The method is inspired by

the Adaptive Bilateral Filter (ABF) [52] described in section 5.2.3. The ABF [24] is able to

adjust the histogram of an image in a desired way by selecting the adaptation parameters

in eq. (16) appropriately. The method proposed in [52] is optimized for sharpening natural

images and the adaptation parameters are found by a training method based on Least

Mean Squared Error (LMSE) minimization.

Unlike natural images, depth maps are mostly piecewise smooth images with

sharp depth discontinuities (edges). By appropriate selection of the adaptation

parameters in Eq. (16) it is possible to adjust the histograms of the compressed depth

maps, to a similar form that it was before compression.

The Depth maps are captured by various techniques such as depth range

cameras and computer vision techniques based on disparity estimation. Thus, a training

method as proposed in Ref. [52] cannot be successfully adapted for sharpening of depth

maps, to accommodate varying types of depth maps. We use the piecewise smooth

property of depth maps to propose a non-training based method to find adaptation

parameters of sa in Eq. (16). Specifically, the depth value bins identified by the global

histogram analysis and characterized by Gaussian curves as described in section 5.3.3

are used to derive the appropriate adaptation parameters of the Bilateral Sharpening

Filter (BSF) proposed in [43].

The EA-JTF [43] successfully filters out insignificant depth discontinuities present

in the compressed depth maps, and the output is provided in to the BSF. At this stage of

filtering, each pixel of the depth map is replaced by a value determined by the following

bilateral sharpening process. During the bilateral sharpening process, all the pixels in a

64x64 block are processed with respect to the identified depth value bins of the particular

block. Once a pixel (Ip) is taken for filtering, the nearest depth value bin in the histogram

of the corresponding 64x64 block is identified. Thereafter, the bilateral filter weights (wpq

in Eq. (7)) are derived as described below. To reduce the spread beside the

peaks in the histogram of the compressed image, the similarity filter

kernel of the bilateral sharpening filter in [43] is implemented as a

Gaussian centered at the nearest peak (Np) to Ip, given as,

sa ( p ,q )=exp(−12 (N p−I q )2/σ p

2) (20)

where, σp is defined as half the distance (since 2σp corresponds to a 95%

confidence interval in a Gaussian distribution) to valleys enclosing Np,

σ p=¿ (21)

In Eq. (21), Vhigh and Vlow represent the enclosing valley greater than and lower

than Np, respectively. The bilateral filter weights are then derived as follows,

W pq=c ( p ,q ) . ss( p ,q) (22)

Finally, the filtered result (Bp) is calculated as in Eq. (10).

5.3.5 Stereoscopic view rendering and comparison of results

After obtaining the depth map with artifacts removed using the steps described in

section 5.3.1 to section 5.3.4, the left side and right side views are obtained using

stereoscopic view rendering process [44] [45]. The images obtained using uncompressed

depth map, HEVC decoded depth-map and HEVC decoded depth-map to which the post-

processing has been applied are compared using the metrics PSNR, SSIM [48] and a

approximate of Mean Opinion Score.

5.4 Summary

In this chapter, the scope of the proposed thesis is discussed. This is followed by

the exploration of the background theory necessary to understand the post-processing

technique used in this thesis. Finally, the entire working algorithm is explained. Chapter 6

provides the experimental results for different test sequences.

Chapter 6

Experimental results

To evaluate the performance of the EA-JTF [43] on HEVC decoded depth maps,

color sequences along with the corresponding depth maps are compressed using HEVC

reference software HM 9.2 [55]. Since, one frame is enough for stereoscopic rendering,

only one frame is compressed at QP = 32. Thereafter, to compare the efficiency of the

post-processing frame-work, three different rendered images are obtained. First, the

original image and the corresponding depth map are used for stereoscopic rendering.

Second, the HEVC decoded image and the corresponding decoded depth-map are used

for stereoscopic rendering. Third, the post-processing frame-work is applied on the HEVC

decoded depth-map and is used for stereoscopic rendering. Stereoscopic views are

rendered according to the MPEG informative recommendations [44] [58].

To evaluate the improvements of using the post-processing techniques on HEVC

decoded depth maps, two metrics are utilized to measure the quality between the views

rendered with the post-processed depth map and the corresponding views rendered with

the uncompressed depth map. These are PSNR and SSIM [48]. Also, an approximate

Mean Opinion Score (MOS) [59] was used to evaluate the perceptual quality of the

rendered views. The way in which MOS was calculated is explained in section 6.1.

The experiments were performed on three different sequences: - Break-dancers

[60], Balloons [18] and Kendo [18]. All the post-processing frame-work was performed

using MATLAB R2013a student version. The results for each sequence are discussed in-

detail in section 6.2.

6.1 An approximate Mean Opinion Score calculation

To compute the MOS, the rendered images from all the three test-cases (original,

HEVC decoded, HEVC decoded + post-processed) were sent to a number of volunteers.

The volunteers were unaware as to which images are rendered from which depth-maps.

They were asked to view the images under normal conditions (viewing distance, posture,

etc.) like how they normally view pictures/videos on their laptops/desktops. The

volunteers were then asked to rate the images based on the viewing quality of the

images. The best image was to be given a rating of 3, the second-best image was to be

given a rating of 2 and the worst image was to be given a rating of 1. The volunteers

were chosen from a spectrum of people from those who have some knowledge of image-

processing techniques to those who are completely new to image quality assessment.

From these rating, the MOS calculated for each test-case using the formula given in eq.

MOS=∑ Ratings

Number of people(23)

6.2 Input Parameters

The different input parameters used while conducting the experiments are listed

in this section.

Table 2 Input parameters and their values

Parameter Value

Viewing distance (D) 250cm (assumed)

Eye separation (xB) 6cm(assumed)

Screen width in pixels (Npix) 1366 (for the laptop used for experimentation)

The range of depth information respectively

behind and in front of the picture (knear and

knear = 44.00; kfar = 120.00 (BreakDancer)

knear = 448.25; kfar = 11206.28 (Balloons)

knear = 448.25 ; kfar = 11206.28 (Kendo)

Resolution of the video sequences used 1024 x 768

Table 3 Filter parameters for EA-JTF and ABF [43]

EA-JTF

Kernel size: 15 x 15 pixels

Standard deviation for the color similarity filter (σs) = 0.025 (normalized range of 0-1)

Standard deviation for the depth similarity filter (σj) = 0.036 (normalized range of 0-1)

Standard deviation for the closeness filter (σc) = 21

Kernel size: 7 x 7 pixels

Standard deviation for the closeness filter (σc) = 12

6.2 Results for different sequences

For all the sequences, original is the view rendered using the original image and

depth-map. Decoded is the view rendered using the image and depth-map decoded

using HEVC. Processed is the view rendered using the HEVC decoded image and the

post-processed depth-map.

6.2.1 Sequence: Balloons

Sequence Balloons gave the best result of the three sequences in terms of

PSNR, SSIM as well as MOS. This can be attributed to the fact that the original depth

map for Balloons sequence itself is blurred, the post-processing techniques sharpens and

enhances the depth-map to a very large extent to yield good results.

Table 4 Sequence Balloons results

Metric Decoded Image(Left-side

Processed Image(Left-side view)

Decoded Image(Right-side view)

Processed Image(Right-side view)

PSNR (dB) 32.6614 38.2611 32.6614 38.2611SSIM (dB) 0.6977 0.9143 0.6977 0.6977

MOS rating is currently being taken only for Left-side rendered image. Ten volunteers

rated the three images and the results show that the processed image got a far higher

rating than the decoded image. In this specific case of balloons, due to the blurry nature

of the original depth-map itself, some of the volunteers gave the processed image a

higher rating than the original itself. On a scale of 3, MOS is calculated using the formula

given in eq. (23). Table 5 gives the MOS rating for the balloons sequence. The images

are shown in figure 6-1.

Table 5 Balloons sequence MOS rating

Image MOS Rating (max = 3)

Original 2.4

Decoded 1.0

Processed 2.5

Figure 6-26 Result images – Rendered left-side images for balloons sequence

6.2.2 Sequence: Break-Dancer

For the sequence Break-dancer, the decoded image actually has a higher PSNR

than the processed image. This can be attributed to the fact that the pixel-value wise the

original and decoded are more similar compared to the original and post-processed.

However, in terms of SSIM and MOS, the processed image show improvements

compared to the decoded image. The results are given in Tables 6 and 7 respectively.

The images are shown in figure 6-2.

Table 6 Sequence Break-Dancer results

PSNR (dB) 41.1128 40.6078 41.1128 40.6078SSIM (dB) 0.8953 0.8987 0.8953 0.8987

Table 7 Break-dancer sequence MOS rating

Original 2.6

Decoded 1.5

Processed 1.9

Figure 6-27 Result images – Rendered left-side images for break-dancer

sequence

6.2.3 Sequence: Kendo

Similar to Break-dancer, even for Kendo sequence, there is no improvement in terms of

PSNR, but perceptual quality-wise, with SSIM and MOS, the post-processed image

scores better than the decoded sequence. The results are given in Tables 8 and 9

respectively. The result images are shown in figure 6-3.

Table 8 Sequence Kendo results

PSNR (dB) 41.7771 40.8181 41.7771 40.6078SSIM (dB) 0.9459 0.9466 0.9459 0.9466

Table 9 Kendo sequence MOS rating

Original 2.2

Decoded 1.7

Processed 2.1

Figure 6-28 Rendered left-side images for kendo sequence

6.3 Summary

In this chapter, the different input parameters and filter parameters used for the

experimentation purpose are discussed. This is followed by exploring the results for the

images rendered from three different test- sequences. Conclusions and future-work are

presented in chapter 7.

Chapter 7

Conclusions and future-work

7.1 Conclusions

This thesis is an effort to improve the quality of rendered views obtained from

HEVC decoded depth-maps. The 3D video codec for HEVC is extremely complicated

compared to the normal HEVC codec. In this thesis, the depth-maps are compressed

directly using HEVC reference software. A post-processing technique described in

chapter 5 is applied to the HEVC decoded depth-maps and the results are obtained. The

views obtained using post-processed depth-maps are compared with views obtained

using just the decoded depth-maps based on the views obtained using original depth-

map as reference. Three test-sequences are used. PSNR, SSIM and MOS are the three

metrics that are used to compare the results. For the sequence, balloons, the post-

processing improves the quality of rendered images to a large extent. There is a

significant improvement in PSNR, SSIM as well as MOS rating obtained by ten different

volunteers. There is an increase in PSNR of 5.59 dB, an SSIM improvement of 0.2166 dB

and the view obtained using post-processed depth map was found to have the best MOS

rating of 2.5. The PSNR results weren’t this promising for the other two sequences.

However, perceptually quality measurements using SSIM and MOS showed that the

views rendered using post-processed depth maps are better than the one rendered using

just the decoded depth-maps. For the sequence Break-dancers, there are an SSIM

improvement of 0.0034 dB and the MOS rating of 1.9 which is better than the rating

obtained for the decoded case which got an MOS rating of 1.5. For Kendo sequence,

SSIM improvement of 0.0007 dB was obtained while the MOS rating was 2.1 for the

processed image compared to 1.7 for the decoded images. Thus, the results for all the

three sequences clearly suggest that the perceptual quality of the views rendered using

the depth-maps that have been post-processed in better than the views rendered using

depth-maps that have been just HEVC decoded.

7.2 Future work

There are few ways into which this thesis can branch into and provide scope for

more meaningful research. Some more work into the filter design may provide more

significant results. In the current work, only stereoscopic view rendering is considered.

This can be extended to multi-view rendering. Also, the current work implements post-

processing as an out-of-the loop solution. This can be in-loop and merged with the HEVC

compression codec. For evaluating the perceptual quality, the current work used SSIM

and an approximate of Mean Opinion Score. More research into perceptual quality

assessment for depth-maps and rendered views may be useful.

References

1) G.J. Sullivan; J. Ohm; Woo-Jin Han and T.Wiegand, “Overview of the High Efficiency Video Coding ( HEVC ) Standard ”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649-1668, Dec 2012.

2) HEVC text specification draft 10: http://phenix.it- sudparis.eu/jct/doc_end_user/current_document.php?id=7243

3) H.264/AVC reference website -http://www.itu.int/rec/T-REC-H.264-201003-I.4) C. M.Fu, et al, “Sample adaptive offset in the HEVC standard,” IEEE Trans.

on circuits and Systems for video technology, vol. 22, no. 12, pp. 1755-1764, Dec. 2012.

5) K. Kim, et al, “Block partitioning structure in the HEVC standard,” IEEE Trans. on circuits and systems for video technology, vol. 22, pp.1697-1706, Dec. 2012.

6) 3DV for H.264: http://mpeg.chiariglione.org/technologies/general/mp-3dv/index.htm

7) B. Furht, “Multimedia systems: an overview”, IEEE Multimedia, vol. 1, pp. 47-59, 1994.

8) K.R. Rao, D.N. Kim and J.J. Hwang, “Video coding standards: AVS China, H.264/MPEG4-Part 10, HEVC, VP6, DIRAC and VC-1”, Springer -2014.

9) B. Furht, “Survey of multimedia compression techniques and standards. Part 1: JPEG standard”, Real time imaging, vol. 1, pp.49-67, 1995.

10) Cisco Visual Networking Index: Global mobile data traffic forecast update,2012-2017: http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-520862.pdf

11) ISO website: http://www.iso.org/iso/home.htm12) IEC website: http://www.iec.ch/ 13) P. Merkle, A. Smolic, K. Muller and T. Wiegand, “Multi-view video plus depth

representation and coding,” IEEE International Conf. on Image Processing, pp. I-201 – I-204, San Antonio, USA, Sept. 2007.

14) A. Vetro, S. Yea and A. Smolic, “Towards a 3D video format for Autostereoscopic displays,” SPIE Conf. on Applications of Digital Image Processing XXXI, San Diego, USA, Aug. 2008.

15) A. Smolic, et al, "3D Video and Free Viewpoint Video - Technologies, Applications and MPEG Standards," IEEE International Conference on Multimedia and Expo, 2006, vol., pp.2161, 2164, 9-12 July 2006.

16) D.K. Shah, et al, "Evaluating multi-view plus depth coding solutions for 3D video scenarios," 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2012, vol., pp.1-4, 15-17 Oct. 2012.

17) S. Gokturk, H. Yalcin and C. Bamji, “A time-of-flight depth sensor – system description, issues and solutions,” Conf. on Computer Vision and Pattern Recognition, Washington, USA, June 2004.

18) Balloons and Kendo test sequences: http://www.tanimoto.nuee.nagoya-u.ac.jp/~fukushima/mpegftv/

19) L. McMillan, “An Image-Based Approach on Three-Dimensional Computer Graphics”, Ph.D. thesis, University of North Carolina at Chapel Hill: 1997.

20) C. Fehn "A 3D-TV system based on video plus depth information," Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, vol.2, no., pp.1529-1533 Vol.2, 9-12 Nov. 2003.

21) ITU-T website: http://www.itu.int/ITU-T/index.html22) SMPTE website: http://www.smpte.org/home/23) Microsoft website: http://www.microsoft.com/en/us/default.aspx24) Website of AVS working group: http://www.avs.org.cn/en 25) T. Borer and T. Davies, “Dirac Video compression using open technology,”

EBU Technical Review, pp. 19, July 2005.26) K. Onthriar, K.K. Loo and Z. Xue, “Performance comparison of emerging

Dirac video codec with H.264/AVC,” IEEE Int’l Conf. on Digital Telecommunication., ICDT '06, pp. 2222, Cap Esterel, Cote d'Azur, France, Aug. 2006.

27) Website of Real Networks: http://www.realnetworks.com/28) Website of ON 2 Technologies: http://www.on2.com/

29) Website for ITU-T:http://www.itu.int/en/ITU-T/studygroups/2013-2016/16/Pages/video/jctvc.aspx

30) MPEG website: http://www.mpeg.org/

31) Reference for H.262/MPEG-2: http://mpeg.chiariglione.org/standards/mpeg-2/video

32) Reference website for HEVC: www.hevc.info

33) H.261 recommendation: http://www.itu.int/rec/T-REC-H.261-199303-I/en

34) F Bossen, et al, “HEVC complexity and implementation analysis”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 22, Issue: 12, pp. 1685 - 1696, Dec. 2012.

35) Fraunhofer HHI, 3D Video coding information: http://www.hhi.fraunhofer.de/fields-of-competence/image-processing/research-groups/image-video-coding/3d-hevc-extension.html

36) P. Merkle, A Smolic, K. Müller, and T. Wiegand, “Multi-View video plus depth data representation and coding”. Picture Coding Symposium, 2007.

37) “Test Model under Consideration for HEVC based 3D video coding”, ISO/IEC JTC1/SC29/WG11 MPEG2011/N12559, San Jose, CA, USA, Feb. 2012.

38) T. Lee, Y. Chan, C. Fu and W. Siu; “Reliable tracking algorithm for multiple reference frame motion estimation”, J. Electron. Imaging, vol. 20, Issue: 3, pp. 033003-01 - 033003-14, Jul – Sept., 2011.

39) H. Zhang and Z. Ma, “Fast intra prediction for high efficiency video coding”, Pacific Rim Conf. on Multimedia, PCM2012, Singapore, Dec. 2012.

40) M. Zhang, C. Zhao and J. Xu, “An adaptive fast intra mode decision in HEVC”, IEEE ICIP 2012, pp. 221-224, Orlando, FL, Sept. - Oct., 2012.

41) M. Jakubowski and G. Pastuszak, ‘Block-based motion estimation algorithms – a survey”, Opto-electronics review, vol. 21, pp. 86 – 102, 2013.

42) M.C. Motwani, et al, “A survey of image denoising techniques”, Proceedings of GSPx 2004, Santa Clara, CA: http://www.cse.unr.edu/~fredh/papers/conf/034-asoidt/paper.pdf

43) D.V.S. De Silva, et al, “A Depth Map Post-Processing Framework for 3D-TV systems based on Compression Artifact Analysis”, IEEE journal of Selected Topics in Signal Processing, vol. pp. ,Issue: 99, pp. 1 – 30, Aug. 2011

44) I. J. S. W. 11, “Proposed experimental conditions for EE4 in MPEG 3DAV. WG 11 doc. m9016,” vol. Shanghai, Oct. 2002.

45) C. Fehn, “Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV,” Proceedings of the SPIE, vol. 5291, 93, 2004.

46) D. De Silva, W. Fernando, and S. Worrall, “Intra mode selection method for depth maps of 3D video based on rendering distortion modeling,” IEEE Trans. on Consumer Electronics, vol. 56, no. 4, pp. 2735–2740, Nov. 2010.

47) Y. Zhao and L. Yu, “Perceptual measurement for evaluating quality of view synthesis,” ISO/IEC JTC1/SC29/WG11/M16407, Apr. 2009.

48) Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. on Image Processing, vol. 13, no. 4, pp. 600 - 612, Apr. 2004.

49) C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” IEEE International Conference on Computer Vision, pp.839-846, Washington DC, USA, 1998.

50) E. Eisemann and F. Durand, “Flash photography enhancement via intrinsic relighting,” in ACM Trans. on Graphics (TOG), vol. 23, no. 3. ACM,pp. 673–678, 2004.

51) G. Petschnigg, et al, “Digital photography with flash and no-flash image pairs,” in ACM Trans. on Graphics (TOG), vol. 23, no. 3. ACM, pp. 664–672, 2004.

52) B. Zhang and J. Allebach, “Adaptive bilateral filter for sharpness enhancement and noise removal,” IEEE Trans. on Image Processing, vol. 17, no. 5, pp. 664–678, 2008.

53) P. Choudhury and J. Tumblin, “The trilateral filter for high contrast images and meshes,” in ACM SIGGRAPH 2005 Courses. ACM, pp. 5-es, 2005.

54) S. Liu, P. Lai, D. Tian, C. Gomila, and C. W. Chen, “Joint trilateral filtering for depth map compression,” pp. 77 440F-10, Huangshan, China, 2010.

55) HEVC reference software (HM 9.2):- https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/branches/HM-9.2-dev/

56) D. De Silva, et al, "Adaptive sharpening of depth maps for 3D-TV," IET Electronics Letters, vol.46, no.23, pp.1546, 1548, Nov. 11, 2010.

57) O. Gangwal and R. Berretty, “Depth map post-processing for 3D-TV,” in IEEE ICCE 2009, pp. 1-2, 2009.

58) MATLAB code for stereoscopic view rendering: http://www.mathworks.com/matlabcentral/fileexchange/27538-depth-image-based-stereoscopic-view-rendering

59) L Ma, et al, "Image Retargeting Quality Assessment: A study of subjective scores and objective metrics,", IEEE Journal of Selected Topics in Signal Processing,vol.6, no.6, pp.626-639, Oct. 2012.

60) Break-Dancers and Ballet sequence: http://research.microsoft.com/en-us/um/people/sbkang/3dvideodownload/

61) “Interactive stereoscopic video conversion”, IEEE Trans. on circuits and systems for video technology, vol. 23, Oct. 2013.

62) T. Na, et al, “A Hybrid Stereoscopic Video Coding Scheme Based on MPEG-2 and HEVC for 3DTV Services”, IEEE, Trans. on circuits and systems for video technology vol. 23, pp. 1542-1554, Sept. 2013.

63) S. Vasudevan, “Implementation of fast residual quadtree coding and fast intra-prediction in high efficiency video coding”, Masters’ thesis, EE Dept. University of Texas at Arlington, Dec. 2013.

64) V. Gajula, “Complexity reduction of intra-coding in HEVC and comparison with H.264/AVC”, Masters’ thesis, EE Dept., University of Texas at Arlington, Dec. 2013.

Biographical Information

Nayana Parashar was born in Bangalore, India, in 1989. She did her schooling at

Holy Cross Convent, Kolhapur, Maharashtra, India. She received her Bachelor of

Engineering (B.E.) degree in Instrumentation Technology from Visvesvaraya

Technological University, India, in 2011. She started her M.S. program in Electrical

Engineering at the University of Texas at Arlington in Jan 2012 and has received the M.S

degree in Dec.2013. While at UTA, She joined Multimedia Processing Lab as a student

researcher under Dr. K.R. Rao. She also worked as intern at InterDigital communications,

San Diego, CA from May – Dec 2013. While at InterDigital, she worked on user-adaptive

video streaming, fast transforms and perceptual video/image technology. After

graduation, she hopes to work in multimedia and computer vision related fields where

she can put her knowledge and experience into good use.

Acknowledgements - The University of Texas at · Web viewp pix + ∆p= - x B N pix D m+ ∆m...

Documents