Post on 11-Mar-2018
transcript
IMPLEMENTATION OF AN OUT-OF-THE LOOP POST-PROCESSING TECHNIQUE
FOR HEVC DECODED DEPTH MAPS
by
PARASHAR NAYANA KARUNAKAR
Presented to the Faculty of the Graduate School of
The University of Texas at Arlington in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
THE UNIVERSITY OF TEXAS AT ARLINGTON
December 2013
Copyright © by Parashar Nayana Karunakar 2013
All Rights Reserved
ii
Dedicated to my Grand-Mother
~No occasion is complete without you~
iii
Acknowledgements
I take this opportunity to express my gratitude to Dr. K.R. Rao, my supervising
professor. If it was not for his support, guidance and mentoring throughout my Masters,
this thesis would have been impossible. I would like to thank Dr. Jonathan Bredow and
Dr. Alan Davis at UTA for serving on my committee.
I would like to thank my manager, Dr. Yuriy Rezink at InterDigital, San Diego for
being considerate and supportive towards my thesis work while I worked as an intern, Dr.
Karsten Mueller and Dr. Gerhard Tech of Fraunhofer HHI and Dr. Varuna De Silva of
University of Surrey for their prompt email responses and clarifications that helped me
during the course of my thesis.
A special shout to Shwetha and Auddy, they have helped me in innumerable
ways (tech talks, university related work, always motivating and encouraging, etc.). I
thank Dilip and Abhijith for helping me endure the daunting task of managing the thesis-
work and internship simultaneously; all the “Housians” (Sindhu, Raghu, Asha,
Sarmishtha, Rohit, KT, Om, Karthik, …); my friends, Chethana and Apoorva back in India
who managed to support me irrespective of our time-zone differences. Also, a number of
people (family and friends) helped me to collect the results for the thesis by participating
in my image quality survey; I would like to acknowledge their help.
Finally and most importantly, I would like to thank my parents, Karunakar and
Champaka Parashar; my aunts, Suryaprabha, Saraswathi and Pankaja; my cousin
Chandana and her parents, Mr. and Mrs. Gurumurthy. Their unwavering love and support
always motivate me to march forward without fear and inhibitions
Nayana Parashar
25th Nov, 2013
iv
Abstract
IMPLEMENTATION OF AN OUT-OF-THE LOOP POST-PROCESSING TECHNIQUE
FOR HEVC DECODED DEPTH MAPS
Parashar Nayana Karunakar, MSEE
The University of Texas at Arlington, 2013
Supervising Professor: K.R. Rao
When depth-maps are compressed using the existing video codecs, the
compression artifacts introduce distortions in the rendered views. To get better rendering,
it is important to get rid of these compression artifacts. This thesis achieves this by using
a post-processing frame-work on HEVC decoded depth maps. The proposed method is
based on compression artifact analysis of depth maps. The proposed work implements a
post-processing filter frame-work which involves two-stage filtering, first by an edge-
adaptive joint trilateral filter followed by histogram analysis and an adaptive bilateral
filtering [43], which can effectively minimize the effects of compression artifacts from the
HEVC decoded depth-maps. The rendered views before and after applying the post-
processing filter are compared with respect to the perceptual quality of the rendered
views. The PSNR, SSIM and MOS are the metrics that are used for video quality
estimation. The post-processing was applied on three different sequences. For all the
three sequences, the improvements in SSIM and better MOS rating were obtained for
those images rendered using post-processed depth-maps in comparison to the images
rendered using just the HEVC decoded depth-maps. The obtained results suggest that
the post-processing technique proposed in this thesis can be effectively used to improve
the quality of images obtained from depth-map based rendering.
v
Table of Contents
Acknowledgements........................................................................................................... iv
Abstract.............................................................................................................................. v
List of Illustrations.............................................................................................................. ix
List of Tables..................................................................................................................... xi
List of Acronyms............................................................................................................... xii
Chapter 1 Introduction.......................................................................................................1
1.1 Multimedia...............................................................................................................1
1.1.1 Multimedia applications....................................................................................1
1.2 Visual media............................................................................................................1
1.2.1 Multi-view plus depth video format...................................................................2
1.2.2 Depth-image based rendering..........................................................................4
1.3 Need for compression..............................................................................................4
1.4 Thesis Outline..........................................................................................................6
Chapter 2 Video compression standard - HEVC................................................................7
2.1 High Efficiency Video Coding (HECV).....................................................................8
2.1.1 HEVC coding design and feature highlights.....................................................9
2.1.1.1 Video Coding Layer...................................................................................9
2.1.1.2 High level syntax architecture.................................................................18
2.1.1.3 Parallel decoding syntax and modified slice structuring..........................19
2.1.2 HEVC complexity analysis..............................................................................22
2.2 Summary...............................................................................................................22
Chapter 3 3D video compression standards....................................................................23
3.1 3D video coding in H.264/AVC..............................................................................23
3.2 3D video coding in HEVC......................................................................................24
vi
3.2.1 Multi view plus depth video.............................................................................24
3.2.2 Transmission of 3D video...............................................................................25
3.2.3 Coding algorithm............................................................................................27
3.2.4 Basic structure of 3D video codec..................................................................29
3.2.5 MVD codec vs HEVC standard codec............................................................30
3.2.5.1 Coding of dependent views.....................................................................30
3.2.5.2 Coding of depth maps.............................................................................36
3.3 Summary...............................................................................................................40
Chapter 4 Analysis of compression artifacts in depth maps.............................................41
4.1 Virtual stereoscopic view generation process........................................................42
4.2 Analysis of compression artifacts in depth maps on view rendering......................44
4.2.1 Frequency domain analysis of the artifacts.....................................................45
4.3 Design requirements of a post-processing filter to minimize the effects
of compression artifacts...............................................................................................48
4.4 Summary...............................................................................................................49
Chapter 5 Thesis – Scope, Background and Working algorithm......................................50
5.1 Scope....................................................................................................................50
5.2 Background Theory...............................................................................................50
5.2.1 Bilateral Filter..................................................................................................51
5.2.2 The joint bilateral and trilateral filter................................................................51
5.2.3 The adaptive bilateral filter..............................................................................52
5.3 Proposed algorithm................................................................................................53
5.3.1. Depth discontinuity analysis..........................................................................54
5.3.1.1 Identification of significant depth discontinuities..........................54
5.3.1.2 Identifying of aligned color and depth edges...........................................55
vii
5.3.2 Pre-filtering of the depth maps to improve depth bin identification.................56
5.3.3 Global histogram analysis..............................................................................58
5.3.4 Bilateral Sharpening Filter..............................................................................59
5.3.5 Stereoscopic view rendering and comparison of results.................................61
5.4 Summary...............................................................................................................61
Chapter 6 Experimental results........................................................................................62
6.1 An approximate Mean Opinion Score calculation..................................................62
6.2 Input Parameters...................................................................................................63
6.2 Results for different sequences.............................................................................64
6.2.1 Sequence: Balloons........................................................................................64
6.2.2 Sequence: Break-Dancer...............................................................................66
6.2.3 Sequence: Kendo...........................................................................................68
6.3 Summary...............................................................................................................69
Chapter 7 Conclusions and future-work...........................................................................70
7.1 Conclusions...........................................................................................................70
7.2 Future work............................................................................................................71
References.......................................................................................................................72
Biographical Information..................................................................................................76
viii
List of Illustrations
Figure 1-1 2D image with spatial samples (L) and Video with N frames (R) [8].................2
Figure 1-2 Color video frame (L) and associated depth map frame (R) [18]......................4
Figure 2-1 Chronology of International video coding standards [8]....................................7
Figure 2-2 Typical HEVC encoder [1].............................................................................10
Figure 2-3 HEVC decoder block diagram [4]....................................................................11
Figure 2-4 Example of CTU, partitioning and processing order when size of CTU is equal
to 64 × 64 and minimum CU size is equal to 8 × 8 (a) CTU partitioning (b) Corresponding
coding tree structure [5]...................................................................................................13
Figure 2-5 Prediction unit splitting types (U = up, D = down, L = left, R = right) [5]..........14
Figure 2-6 Integer and fractional positions for luma interpolation [1]................................15
Figure 2-7 Motion estimation with multiple reference frames [38]....................................16
Figure 2-8 9 4 4 Luma Prediction (intra-prediction) modes H.264 [3]..............................17
Figure 2-9 Modes and directional orientations for intra-picture prediction [1]...................17
Figure 2-10 Sub-division of a picture into a) Slices b) Tiles and c) illustration of wavefront
parallel processing [1]......................................................................................................21
Figure 3-1 Simulcast coding structure with hierarchical B pictures for
temporal prediction (black arrows) (L) and Multi-view coding structure with hierarchical B
pictures for both temporal (black arrows) and inter-view prediction (red arrows) (R) [36]
......................................................................................................................................... 25
Figure 3-2 Overview of the system structure and data format for the transmission of 3D
video [37].........................................................................................................................27
Figure 3-3 Access unit structure and coding order of view components [37]...................28
Figure 3-4 Basic codec structure with inter-component prediction (red-arrows) [35] [37] 29
ix
Figure 3-5 Disparity-compensated prediction as an alternative to motion-compensated
prediction [37].................................................................................................................. 31
Figure 3-6 Basic principle of deriving motion parameters for a block in a current picture
based on motion parameters in an already coded reference view and an estimate of the
depth map for the current picture [37]..............................................................................33
Figure 3-7 Independent derivation of motion information for each point of encoded CU
from corresponding point in reference view [37]..............................................................34
Figure 3-8 Basic concept for the inter-view residual prediction [37].................................35
Figure 4-1 Virtual view generation in Depth Image Based Rendering (DIBR) process [43]
......................................................................................................................................... 42
Figure 4-2 Effect of compression noise in areas of homogenous depth [43]...................47
Figure 4-3 Effect of compression noise in areas of sharp depth discontinuities [43]........48
Figure 5-1 Block diagram for the proposed work [43]......................................................53
Figure 5-2 Illustration of depth discontinuity analysis.......................................................56
Figure 6-1 Result images – Rendered left-side images for balloons sequence...............66
Figure 6-2 Result images – Rendered left-side images for break-dancer sequence........67
Figure 6-3 Rendered left-side images for kendo sequence.............................................69
x
List of Tables
Table 1-1 Mass storage requirements by various media types (B=byte) [9]......................5
Table 2 Input parameters and their values.......................................................................63
Table 3 Filter parameters for EA-JTF and ABF [43].........................................................64
Table 4 Sequence Balloons results..................................................................................65
Table 5 Balloons sequence MOS rating...........................................................................65
Table 6 Sequence Break-Dancer results.........................................................................67
Table 7 Break-dancer sequence MOS rating...................................................................67
Table 8 Sequence Kendo results.....................................................................................68
Table 9 Kendo sequence MOS rating..............................................................................68
xi
List of Acronyms
2D: Two Dimensional 3D: Three Dimensional ABF: Adaptive Bilateral Filtering ADLF: Availability Deblocking Loopback Filter AMVP: Advanced Motion Vector Prediction AVC: Advanced Video Coding AVS China: The Audio and Video coding Standard of China CB: Coding Block CG: Computer Graphics CU: Coding Unit CTB : Coding Tree Block CTU: Coding Tree Unit DBMP: Depth-Based Motion Prediction DCP: Disparity-Compensated Prediction DCT: Discrete Cosine Transform DF: Deblocking Filter DFT: Discrete Fourier Transform DGLF: Depth Gradient based Loopback Filter DIBR: Depth Image Based Rendering DST: Discrete Sine Transform EA-JTF: Edge-Adaptive Joint Trilateral Filter FVV: Free Viewpoint Video HD: High Definition HEVC: High Efficiency Video Coding IEC: International Electrotechnical Commission ISO: International Standardization Organization ITU-T: The Telecommunication Standardization Sector of International
Telecommunication Union MC: Motion Compensation MCP: Motion-Compensated Prediction MOS: Mean Opinion Score MPEG: Moving Picture Experts Group MV: Motion Vector MVC: Multi-View Coding MVD: Multi-View plus Depth NAL: Network Abstraction Layer PSNR: Peak Signal to Noise Ratio PU: Prediction Unit QP: Quantization Parameter SAO: Sample Adaptive Offset SDO: Standards Development Organization
xii
SEI: Supplemental Enhancement Information SMPTE: Society of Motion Picture and Television Engineers SSIM: Structural Similarity Index Metric TU: Transform Unit VCEG: Video Coding Experts Group VCL: Video Coding Length VOI: View Order Index VPS: Video Parameter Set VUI: Video Usability Information WPP: Wavefront Parallel Processing ZZC: Znear Zfar Compensation
xiii
Chapter 1
Introduction
1.1 Multimedia
The combination of multiple sources of video, audio, image and text is usually
known as multimedia. Multimedia systems combine a variety of information sources, such
as voice, graphics, animations, image, audio, and full-motion video, into a wide range of
applications. The big picture shows multimedia as the merging of three industries:
computing, communication and broadcasting. The defining characteristic of multimedia
systems is the incorporation of continuous media such as voice, video and animation. [7].
1.1.1 Multimedia applications
Multimedia finds its application in various areas including, but not limited to
advertisements, art, education, entertainment, engineering, medicine, mathematics,
business, scientific research and spatial / temporal applications. Multimedia finds its
major application in creative and entertainment industries. Movies and TV- shows, music
industry, interactive and non-interactive video games, video teleconferencing, medical
imaging and interactive educational tools are some of the many applications of
multimedia [7].
1.2 Visual media
Images and video make-up the visual media. An image is characterized by pixels
or pels. The number of pixels in an image (height and width), color and brightness of
each pixel determine the property of an image. Video is composed of a sequence of
pictures (frames) taken at regular temporal intervals. The number of frames per second is
called as the frame rate. [8] This is shown in the Figure 1-1.
1
Figure 1-1 2D image with spatial samples (L) and Video with N frames (R) [8]
1.2.1 Multi-view plus depth video format
The multi-view video plus depth (MVD) [16] [36] format is currently one of the
most promising formats to provide enhanced 3D visual experiences [13]. This type of
representation provides, for each view-point, texture (image sequences) and depth (map
sequences) information as shown in Figure 1-2. The depth maps represent the per-pixel
depth of a corresponding color image, and signal the disparity information needed at the
virtual (novel) view rendering system. The depth maps can be represented as a gray-
scale image sequence for storage and transmission requirements, and thus can be
compressed with existing video codecs, such as H.264/AVC [3] and HEVC [1] [2].
Thus, in addition to the textures, depth maps must be efficiently coded and
transmitted to the decoder, to be later used to render some virtual intermediate views of
the scene. This solution promises the capability to render a large amount of views
(across a wide view-angle) while reducing the amount of data that needs to be
transmitted. However, the MVD format still requires a significant amount of data to be
stored or transmitted which is essential to provide the enhanced experience associated
with emerging applications such as free viewpoint video (FVV) [15] and next-generation
3DTV displays [15]. For FVV, MVD allows the viewer to select any desired view point
2
while for auto- stereoscopic displays (glasses free), intermediate views can be created at
the decoder for a multitude of viewing angles, thus increasing the 3D experience [14]
[16] [36] [13].
In the depth maps, each pixel conveys information on the relative distance from
the camera to the object in the 3D space. While the lighter gray regions represent near
objects, the darker gray regions represent far objects as shown in Figure 1-2. While
depth maps must be efficiently encoded, they are never displayed, but only used to
synthesize the intermediate views from the original ones, typically with depth image-
based rendering (DIBR) techniques. Contrary to the texture, depth map sequences do
not have any color, texture or illumination changes and are correlated with the
corresponding texture view sequence only in the object boundary. [16] [36].
The quality of experience provided by the MVD format depends on several
factors, one of them being the accuracy of the estimated depth map. Depth can be
obtained in several ways, such as directly captured using time-of-flight cameras and
estimated from texture with stereo or multi-view matching algorithms [17]. In addition, the
depth map coding scheme selected is crucial to enable high quality view synthesis in
bandwidth constrained channels; a relevant factor for the performance of depth map
coding schemes is the preservation of the depth map discontinuities as this is critical to
reduce the geometric distortions along the object contours [16] [36].
3
Figure 1-2 Color video frame (L) and associated depth map frame (R) [18]
1.2.2 Depth-image based rendering
Depth-image-based rendering (DIBR) [20] is the process of synthesizing “virtual”
views of a scene from still or moving images and associated per-pixel depth information
[19]. Conceptually, this novel view generation can be understood as the following two-
step process: At first, the original image points are reprojected into the 3D world, utilizing
the respective depth data. Thereafter, these 3D space points are projected into the
image plane of a “virtual” camera, which is located at the required viewing position. The
concatenation of reprojection (2D-to-3D) and subsequent projection (3D-to-2D) is usually
called 3D image warping in the Computer Graphics (CG) literature [20].
1.3 Need for compression
Compression is a technique where in data is coded in an efficient manner to
represent it using fewer bits than the original un-coded data. The goal of compression is
to represent data with as low a bit-rate as possible without compromising on the quality.
Audio, image, and video signals require a vast amount of data for their representation.[9].
Especially, the data required for representation of multi-view video with its associated
depth-maps is many times more than the normal 2D video. Table 1-1 illustrates the mass
storage requirements for various media types, namely text, image, audio, and video.
4
Table 1-1 Mass storage requirements by various media types (B=byte) [9]
Text Image Audio Video
Object type -ASCII-EBCDIS
-Bitmapped graphics-Still photos-Faxes
Non coded stream of digitized audio or voice
TV analog or digital image with synched streams at 24-30 frames/s
Size and bandwidth
2KB per page
-Simple:64 KB/image-detailed(color):7.5MB/image
Voice/Phone 8 KHz/8 bits (mono) 6-44 KB/s Audio CD 44.1 KHz/ 16 bit/stereo 176 KB/s
27.7 MB/s for 640 × 480 × 24 pixels per frame (24-bit color) 30 frames/s
When it comes to video, another important factor to be considered is the video
internet traffic. Video internet traffic is growing at a very past pace, and it is estimated that
by 2017, 69% of the total Internet traffic will be video. Growing numbers of mobile
devices (smart phones, tablets, etc.) capable of video streaming and playback, and
increasing popularity of viewing online video content, have accelerated the growth rate of
90% from 2012 to 2017 [10]. Thus, there are three main reasons why present multimedia
systems require data to be compressed. They are:
a) Large storage requirements of multimedia data.
b) Relatively slow storage devices which do not allow playing multimedia
data (specifically video) in real-time.
c) The present network’s bandwidth, which does not allow real-time video
data transmission.
These three reasons along with the vast applications and usefulness of multimedia data
make compression an extremely important and challenging task.
5
1.4 Thesis Outline
Chapter 2 covers the HEVC video compression standard. The 3D video codecs
for H.264/AVC and HEVC are covered in chapter 3. In chapter 4, analysis of compression
artefacts in depth-maps is explained. Chapter 5 covers the scope, back-ground theory
and the working algorithm that is used in the proposed research. In chapter 6, results of
the experimentation are listed along-with the input parameters used. Chapter 7 gives the
conclusions and the areas where more work can be done in the future.
6
Chapter 2
Video compression standard - HEVC
Video and audio coding standards guarantee interoperability between software
and hardware provided by multiple vendors that make multimedia communications
practical. Series of video and audio coding standards have been developed by Standards
Development Organizations (SDO), including ISO/IEC (the International Standardization
Organization and the International Electrotechnical Commission) [11] [12], ITU-T (the
Telecommunication Standardization Sector of the International Telecommunication
Union, formerly CCITT) [21], SMPTE (Society of Motion Picture and Television
Engineers) [22], AVS China (the Audio and Video coding Standard of China) [24], DIRAC
by BBC [25] [26], and well-known companies, including Microsoft [23], Real Networks
[27] and On 2 Technologies (acquired by Google) [28].The chronology of different video
compression standards is shown in Figure 2-1
Figure 2-3 Chronology of International video coding standards [8]
7
In this chapter, High Efficiency Video Coding (HEVC) [2] video compression
standard will be discussed followed by discussions on 3D video compression in
H.264/AVC [3] and HEVC [2]
2.1 High Efficiency Video Coding (HECV)
High-Efficiency Video Coding (HEVC) [1] [2] [32] is the newest video coding
standard of the ITU-T [29] Video Coding Experts Group (VCEG) and the ISO/IEC Moving
Picture Experts Group (MPEG) [30]. HEVC enables significantly improved compression
performance relative to existing standards – in the range of 50% bit rate reduction [1] for
equal perceptual video quality.
The major video coding standard directly preceding the HEVC [1] [2] [32] project
was H.264/MPEG-4 Advanced Video Coding (AVC) [3], which was initially developed
during 1999–2003, and then was extended in several important ways during 2003–2009.
H.264/MPEG-4 AVC [3] was an enabling technology for digital video in almost every area
that was not previously covered by H.262/MPEG-2 [31] Video, and has substantially
displaced the older standard within its existing application domain. It is widely used for
many applications, including broadcast of high definition (HD) TV signals over satellite,
cable, and terrestrial transmission systems, video content acquisition and editing
systems, camcorders, security applications, Internet and mobile network video, Blu-ray
discs, and real-time conversational applications such as video chat, video conferencing,
and tele-presence systems.[1]
An increasing diversity of services, the growing popularity of HD video, and the
emergence of beyond-HD formats (e.g. 4k×2k or 8k×4k resolution) are creating even
stronger needs for coding efficiency superior to H.264/MPEG-4 AVC’s [3] capabilities.
The need is even stronger when higher resolution is accompanied by stereo or multi-view
capture and display. Moreover, the traffic caused by video applications targeting mobile
8
devices and tablet-PCs, as well as the transmission needs for video on demand services,
are imposing severe challenges on today’s networks. An increased desire for higher
quality and resolutions is also arising in mobile applications. HEVC standardization [1] [2]
[32] began to address all these needs. HEVC has been designed to address essentially
all existing applications of H.264/MPEG-4 AVC [3] and to particularly focus on two key
issues: increased video resolution and increased use of parallel processing architectures
[1].
2.1.1 HEVC coding design and feature highlights
The HEVC standard is designed to achieve multiple goals: coding efficiency,
transport system integration and data loss resilience, as well as implementability using
parallel processing architectures. The following sub-sections describe the key elements
of the design by which these goals are achieved, and the typical encoder operation which
would generate a valid bitstream.
2.1.1.1 Video Coding Layer
The video coding layer of HEVC employs the same “hybrid” approach
(inter-/intra-picture prediction and 2D transform coding) used in all video compression
standards since H.261 [33]. Figure 2-2 depicts the block diagram of a hybrid video
encoder, which can create a bitstream conforming to the HEVC standard.
An encoding algorithm producing an HEVC [1] [2] [32] compliant bitstream would
typically proceed as follows. Each picture is split into block-shaped regions, with the
exact block partitioning being conveyed to the decoder. The first picture of a video
sequence (and the first picture at each “clean” random access point in a video sequence)
is coded using only intra-picture prediction (which uses some prediction of data spatially
from region-to-region within the same picture but has no dependence on other pictures).
9
For all the remaining pictures of a sequence or between random access points, inter-
picture temporally-predictive coding modes are typically used for most blocks. The
encoding process for inter-picture prediction consists of choosing motion data comprising
the selected reference picture and motion vector (MV) to be applied for predicting the
samples of each block. The encoder and decoder generate identical inter prediction
signals by applying motion compensation (MC) using the MV and mode decision data,
which are transmitted as side information [41].
Figure 2-4 Typical HEVC encoder [1]
The residual signal of the intra or inter prediction, which is the difference between
the original block and its prediction, is transformed by a linear spatial transform. The
transform coefficients are then scaled, quantized, entropy coded, and transmitted
together with the prediction information [1] [2].
The encoder duplicates the decoder processing loop such that both will generate
identical predictions for subsequent data. Therefore, the quantized transform coefficients
10
are constructed by inverse scaling and are then inverse transformed to duplicate the
decoded approximation of the residual signal. The residual is then added to the
prediction, and the result of that addition may then be fed into one or two loop filters to
smooth out artifacts induced by the block-wise processing and quantization. The final
picture representation (which is a duplicate of the output of the decoder) is stored in a
decoded picture buffer to be used for the prediction of subsequent pictures. In general,
the order of the encoding or decoding processing of pictures often differs from the order
in which they arrive from the source; necessitating a distinction between the decoding
order (a.k.a. bitstream order) and the output order (a.k.a. display order) for a decoder.
Figure 2-3 shows the block diagram of a HEVC decoder which performs the inverse of
process of the encoder.
Figure 2-5 HEVC decoder block diagram [4]
Video material to be encoded by HEVC is generally expected to be input as
progressive scan imagery (either due to the source video originating in that format or
resulting from de-interlacing prior to encoding). No explicit coding features are present in
the HEVC design to support the use of interlaced scanning, as interlaced scanning is no
longer used for displays and is becoming substantially less common for distribution.
However, metadata syntax has been provided in HEVC to allow an encoder to indicate
11
that interlace-scanned video has been sent by coding each field (i.e. the even or odd
numbered lines of each video frame) of interlaced video as a separate picture or that it
has been sent by coding each interlaced frame as an HEVC coded picture. This provides
an efficient method of coding interlaced video without burdening decoders with a need to
support a special decoding process for it [1] [2].
The various features involved in hybrid video coding using HEVC will now be
highlighted [1] [2]:
Coding Tree Units and Coding Tree Block structure: The core of the
coding layer in previous standards was the macroblock, containing a
16×16 block of luma samples and, in the usual case of 4:2:0 color
sampling, two corresponding 8×8 blocks of chroma samples; whereas
the analogous structure in HEVC is the coding tree unit (CTU), which has
a size selected by the encoder and can be larger than a traditional
macroblock. The CTU consists of a luma coding tree block (CTB) and the
corresponding chroma CTBs and syntax elements. The size L x L of a
luma CTB can be chosen as L = 16, 32, or 64 samples, with the larger
sizes typically enabling better compression. HEVC then supports a
partitioning of the CTBs into smaller blocks using a tree structure and
quadtree-like signaling.
Coding Units and Coding Blocks: The quadtree syntax of the CTU
specifies the size and positions of its luma and chroma coding blocks
(CBs). The root of the quadtree is associated with the CTU. Hence, the
size of the luma CTB is the largest supported size for a luma CB. The
splitting of a CTU into luma and chroma CBs is signaled jointly. One
luma CB and ordinarily two chroma CBs, together with associated
12
syntax, form a Coding Unit (CU). A CTB may contain only one CU or
may be split to form multiple CUs, and each CU has an associated
partitioning into prediction units (PUs) and a tree of transform units
(TUs). An example of CTU, partitioning and processing order when size
of CTU is equal 64 × 64 and minimum CU size equal to 8 × 8 is shown in
Figure 2-4 [5].
Figure 2-6 Example of CTU, partitioning and processing order when size of CTU is equal
to 64 × 64 and minimum CU size is equal to 8 × 8 (a) CTU partitioning (b) Corresponding
coding tree structure [5]
Prediction Units and Prediction Blocks: The decision whether to code
a picture area using inter-picture or intra-picture prediction is made at the
CU level. A prediction unit (PU) partitioning structure has its root at the
CU level. Depending on the basic prediction type decision, the luma and
chroma CBs can then be further split in size and predicted from luma and
chroma prediction blocks (PBs). HEVC supports variable PB sizes from
64×64 down to 4×4 samples. Different PU splitting types are shown in
Figure 2-5 [5].
13
Transform Units and Transform Blocks: The prediction residual is
coded using block transforms. A transform unit (TU) tree structure has its
root at the CU level. The luma CB residual may be identical to the luma
transform block (TB) or may be further split into smaller luma TBs. The
same applies to the chroma TBs. Integer basis functions similar to those
of a discrete cosine transform (DCT) are defined for the square TB sizes
4×4, 8×8, 16×16, and 32×32. For the 4×4 transform of intra-picture
prediction residuals, an integer transform derived from a form of discrete
sine transform (DST) is alternatively specified.
Figure 2-7 Prediction unit splitting types (U = up, D = down, L = left, R = right) [5]
Motion vector signaling: Advanced motion vector prediction (AMVP) is
used, including derivation of several most probable candidates based on
data from adjacent PBs and the reference picture. A “merge” mode for
MV coding can be also used, allowing the inheritance of MVs from
14
neighboring PBs. Moreover, compared to H.264/MPEG-4 AVC, improved
“skipped” and “direct” motion inference are also specified.
Motion compensation: Quarter-sample precision is used for the MVs,
and 7-tap or 8-tap filters are used for interpolation of fractional-sample
positions Integer (A i,j) and fractional pixel positions (lower case letters)
for luma interpolation are shown in Figure 2-6 (compared to 6-tap
filtering of half-sample positions followed by bi-linear interpolation of
quarter-sample positions in H.264/MPEG-4 AVC) [3]. Similar to
H.264/MPEG-4 AVC [3], multiple reference pictures as shown in Figure
2-7 are used. For each PB, either one or two motion vectors can be
transmitted, resulting either in uni-predictive or bi-predictive coding,
respectively. As in H.264/MPEG-4 AVC [3], a scaling and offset
operation may be applied to the prediction signal(s) in a manner known
as weighted prediction.
Figure 2-8 Integer and fractional positions for luma interpolation [1]
15
Figure 2-9 Motion estimation with multiple reference frames [38]
Intra-picture prediction: The decoded boundary samples of adjacent
blocks are used as reference data for spatial prediction in PB regions
when intra-picture prediction is not performed. Intra prediction supports
33 directional modes [1] [2] (compared to 9 such modes shown in Figure
2-8 in H.264/MPEG-4 AVC) [3], plus planar (surface fitting) and DC (flat)
prediction modes. The selected intra prediction modes are encoded by
deriving most probable modes (e.g. prediction directions) based on those
of previously-decoded neighboring PBs. The different modes and
directional orientations for intra-picture prediction are as shown in Figure
2-9 [1].
16
Figure 2-10 9 4 4 Luma prediction (intra-prediction) modes H.264 [3]
Figure 2-11 Modes and directional orientations for intra-picture prediction
[1]
Quantization control: As in H.264/MPEG-4 AVC, uniform reconstruction
quantization (URQ) is used in HEVC, with quantization scaling matrices
supported for the various transform block sizes.
Entropy coding: Context adaptive binary arithmetic coding (CABAC) is
used for entropy coding. This is similar to the CABAC scheme in
H.264/MPEG-4 AVC [3], but has undergone several improvements to
17
improve its throughput speed (especially for parallel-processing
architectures) and its compression performance, and also to reduce its
context memory requirements.
In-loop deblocking filtering (DF): A deblocking filter (DF) similar to the
one used in H.264/MPEG-4 AVC is operated in the inter-picture
prediction loop. However, the design is simplified in regard to its
decision-making and filtering processes, and is made friendlier to parallel
processing.
Sample adaptive offset (SAO): A non-linear amplitude mapping is
introduced in the inter-picture prediction loop after the deblocking filter.
The goal is to better reconstruct the original signal amplitudes by using a
look-up table that is described by a few additional parameters that can be
determined by histogram analysis at the encoder side.
2.1.1.2 High level syntax architecture
A number of design aspects new to the HEVC standard has improved flexibility
for operation over a variety of applications and network environments and also has
improved robustness to data losses. However, the high-level syntax architecture used in
the H.264/MPEG-4 AVC [3] standard has generally been retained, including the following
features [1] [2]:
Parameter set structure: Parameter sets contain information that can
be shared for the decoding of several regions of the decoded video. The
parameter set structure provides a secure mechanism for conveying data
that are essential to the decoding process. The concepts of sequence
and picture parameter sets from H.264/MPEG-4 AVC [3] are augmented
by a new video parameter set (VPS) structure.
18
NAL unit syntax structure: Each syntax structure is placed into a
logical data packet called a network abstraction layer (NAL) unit.
Depending on the content of a two-byte NAL unit header, it is possible to
readily identify the purpose of the associated payload data.
Slices: A slice is a data structure that can be decoded independently
from other slices of the same picture, in terms of entropy coding, signal
prediction, and residual signal reconstruction. (This describes ordinary
slices. An alternative form known as dependent slices is discussed
below.) A slice can either be an entire picture or a region of a picture.
One of the main purposes of slices is re-synchronization in the event of
data losses. In the case of packetized transmission, the maximum
number of payload bits within a slice is typically restricted, and the
number of CTUs in the slice is often varied to minimize the packetization
overhead while keeping the size of each packet within this bound.
SEI and VUI metadata: The syntax includes support for various types of
metadata known as supplemental enhancement information (SEI), video
usability information (VUI). Such data provides information about the
timing of the video pictures, the proper interpretation of the color space
used in the video signal, 3D stereoscopic frame packing information,
other “display hint” information, etc.
2.1.1.3 Parallel decoding syntax and modified slice structuring
Finally, few new features are introduced in the HEVC standard to enhance
parallel processing capability or to modify the structuring of slice data for packetization
purposes. Each of them may have benefits in particular application contexts, and it is
19
generally up to the implementer of an encoder or decoder to determine whether and how
to take advantage of these features [1] [2].
Tiles: The option to partition a picture into rectangular regions called tiles
has been specified. The main purpose of tiles is to increase the capability
for parallel processing rather than provide error resilience. Tiles are
independently-decodable regions of a picture that are encoded with
some shared header information. Therefore, they can additionally be
used for the purpose of random access to local regions of video pictures.
A typical tile configuration of a picture consists of segmenting the picture
into rectangular regions with approximately equal numbers of CTUs in
each tile. Tiles provide parallelism at a more coarse level (picture/sub-
picture) of granularity, and no sophisticated synchronization of threads is
necessary for their use.
Wavefront parallel processing: When wavefront parallel processing
(WPP) is enabled, a slice is divided into rows of CTUs. The first row is
processed in an ordinary way; the second row can begin to be processed
after only a few decisions have been made in the first row; the third row
can begin to be processed after only a few decisions have been made in
the second row; etc. The context models of the entropy coder in each
row are inferred from those in the preceding row with a small fixed
processing lag. WPP provides a form of processing parallelism at a
rather fine level of granularity, i.e. within a slice. WPP may often provide
better compression performance than tiles (and avoid some visual
artifacts that may be induced by tiles).
20
Dependent slices: A structure called a dependent slice allows data
associated with a particular wavefront entry point or tile to be carried in a
separate NAL unit, and thus potentially makes that data available to a
system for fragmented packetization with lower latency than if it were all
coded together in one slice. A dependent slice for a wavefront entry point
can only be decoded after at least part of the decoding process of
another slice has been performed. Dependent slices are mainly useful in
low-delay encoding, where other parallel tools may penalize compression
performance.
The concepts of slices, tiles and wavefront parallel processing are illustrated in
Figure 2-10 [1].
Figure 2-12 Sub-division of a picture into a) Slices b) Tiles and c) illustration of wavefront
parallel processing [1]
21
2.1.2 HEVC complexity analysis
Complexity of some key modules such as transforms, intra prediction, and
motion compensation is higher in HEVC than in H.264/AVC [3]. Complexity of modules
such as entropy coding and deblocking is lower in HEVC than in H.264/AVC [3]. The
implementation cost of an HEVC decoder is thus not much higher than that of an
H.264/AVC decoder, even with the addition of an in-loop filter such as SAO [34].
From an encoder perspective, things look different: HEVC features many more
mode combinations as a result of the added flexibility from the quadtree structures and
the increase of intra prediction modes. An encoder fully exploiting the capabilities of
HEVC is thus expected to be several times more complex than an H.264/AVC encoder.
This added complexity does however have a substantial benefit in the expected
significant improvement in rate-distortion performance [34]. Researchers are focusing on
reducing the encoder complexity [39] [40] [63] [64].
2.2 Summary
In this chapter, development of different video compression standards is first
explored followed by a detailed description of the latest HEVC video compression
standard. In chapter 3, 3D video coding in H.264/AVC and 3D video coding in HEVC are
covered.
22
Chapter 3
3D video compression standards
With the development of 3D video technology, there is a rising demand for better
representation and delivery of 3D video without compromising on the quality of video.
The 3D video coding versions of H.264/AVC [6] and HEVC [1] [2] are emerging as an
solution to these demands.
3.1 3D video coding in H.264/AVC
Multiview Video Coding (MVC) [6] is an amendment to H.264/MPEG-4
AVC video compression standard [3] developed with joint efforts by MPEG/VCEG that
enables efficient encoding of sequences captured simultaneously from multiple cameras
using a single video stream.
MVC is intended for encoding stereoscopic (two-view) video, as well as free
viewpoint television and multi-view 3D television. The Stereo High profile of H.264/AVC
has been standardized in June 2009; the profile is based on MVC toolset and is used in
stereoscopic Blu-ray 3D releases [6]
MVC stream is backward compatible with H.264/AVC [3], which allows older
devices and software to decode stereoscopic video streams, ignoring additional
information for the second view [6].
Multiview video contains a large amount of inter-view statistical dependencies,
since all cameras capture the same scene from different viewpoints. Therefore, combined
temporal and inter-view prediction is the key for efficient MVC encoding. A frame from a
certain camera can be predicted not only from temporally related frames from the same
camera, but also from the frames of neighboring cameras. These interdependencies can
be used for efficient prediction [6].
23
3.2 3D video coding in HEVC
The multi-view plus depth video format [16] explained in section 1.2.1 is the video
format used for 3D video coding in HEVC. This section explains the codec used for 3D
extension of HEVC.
3.2.1 Multi view plus depth video
Recent improvements in 3D video technology led to a growing interest in 3D
video. Autostereoscopic displays, which provide a 3D viewing experience without
glasses, are consistently improved and are considered as a promising technology for
future 3D home entertainment [35] [36]. In contrast to common stereo displays,
autostereoscopic displays require not only two, but a multitude of different views for
providing the 3D viewing experience. Since the bit rate required for coding multiview
video with the MVC extension of H.264/AVC [3] increases approximately linearly with the
number of coded views, MVC [6] is not appropriate for delivering 3D content for
autostereoscopic displays. A promising alternative is the transmission of 3D video in the
multiview video plus depth (MVD) format [13] [36].
In the MVD format, typically only a few views are actually coded, but each of
them is associated with coded depth data, which represent the basic geometry of the
captured video scene. Based on the transmitted video pictures and depth maps,
additional views suitable for displaying 3D video content on autostereoscopic displays
can be generated using depth image based rendering (DIBR) [20] techniques at the
receiver side. For the purpose of view synthesis, camera parameters are additionally
included in the bitstream. The bitstream packets include header information, which signal,
in connection with transmitted parameter sets, a view identifier and an indication whether
the packet contains video or depth data. The difference between the simulcast coding
24
structure with hierarchical B pictures and that of multi-view coding structure with
hierarchical B pictures is shown in Figure 3-1 [36] [37].
Figure 3-13 Simulcast coding structure with hierarchical B pictures for
temporal prediction (black arrows) (L) and Multi-view coding structure with hierarchical B
pictures for both temporal (black arrows) and inter-view prediction (red arrows) (R) [36]
3.2.2 Transmission of 3D video
The basic concept of the system and data format is illustrated in Figure 3-2. In
general the input signal for the encoder consists of multiple views, associated depth
maps, and corresponding camera parameters. However, as described above, the codec
can also be operated without depth data. The input component signals are coded using a
3D video encoder, which represents an extension of HEVC. The base view is coded
using an unmodified HEVC encoder. The 3D video encoder generates a bitstream, which
represents the input videos and depth data in a coded format. If the bitstream is decoded
using a 3D video decoder, the input videos, the associated depth data, and camera
parameters are reconstructed with the given fidelity. For displaying the 3D video on an
autostereoscopic display, additional intermediate views are generated by a DIBR
algorithm using the reconstructed views and depth data. If the 3D video decoder is
connected to a conventional stereo display instead of to an autostereoscopic display, the
25
view synthesizer can also generate a pair of stereo views, in case such a pair is not
actually present in the bitstream. It is possible to adjust the rendered stereo views to the
stereo geometry of the viewing conditions. One of the decoded views or an intermediate
view at an arbitrary virtual camera position can also be used for displaying a single view
on a conventional 2D display. The 3D video bitstream is constructed in a way that the
sub-bitstream representing the coded representation of the base view can be extracted
by simple means. The bitstream packets representing the base view can be identified by
inspecting transmitted parameter sets and the packet headers. The sub-bitstream for the
base view can be extracted by discarding all packets that contain depth data or data for
the dependent views. Then, the extracted sub-bitstream can be directly decoded with an
unmodified HEVC decoder and displayed on a conventional 2D video display. [35] [37]
The encoder can also be configured in a way that the sub-bitstream containing
only two stereo views can be extracted and directly decoded using a stereo decoder. The
encoder can also be configured in a way that the views can be generally decoded
independently of the depth data. It is also possible to synthesize intermediate view using
only the stereo sequences as input of the view synthesis [35] [37].
26
Figure 3-14 Overview of the system structure and data format for the transmission
of 3D video [37]
3.2.3 Coding algorithm
The coding algorithm based on the MVD format, in which each video picture is
associated with a depth map, is described. The coding algorithm can also be used for a
multiview format without depth maps. The video pictures and, when present, the depth
maps are coded access unit by access unit, as illustrated in Figure 3-3. An access unit
includes all video pictures and depth maps that correspond to the same time instant.
Non-VCL NAL units containing camera parameters may be additionally associated with
an access unit [37].
The video pictures and depth maps corresponding to a particular camera position
are indicated by a view identifier (viewId). All video pictures and depth maps that belong
to the same camera position are associated with the same value of viewId. The view
identifiers are used for specifying the coding order inside the access units and detecting
missing views in error-prone environments. Inside an access unit, the video picture and,
when present, the associated depth map with viewId equal to 0 are coded first, followed
27
by the video picture and depth map with viewId equal to 1, etc. A video picture and depth
map with a particular value of viewId are transmitted after all video pictures and depth
maps with smaller values of viewId. [35] [37]
Figure 3-15 Access unit structure and coding order of view components [37]
For the independent view, the video picture is always coded before the
associated depth map. For dependent views, the video picture may be coded before or
after the associated depth map (i.e., the depth map with the same value of viewId). It
should be noted that the value of viewId does not necessarily represent the arrangement
of the cameras in the camera array. For ordering the reconstructed video pictures and
depth map after decoding, each value of viewId is associated with another identifier
called view order index (VOI). The view order index is a signed integer values, which
specifies the ordering of the coded views from left to right. If a view A has a smaller value
of VOI than a view B, the camera for view A is located left to the camera of view B. In
addition, camera parameters required for converting depth values into disparity vectors
are included in the bitstream [35] [37].
28
3.2.4 Basic structure of 3D video codec
The basic structure of the 3D video codec is shown in Figure 3-4. In principle,
each component signal is coded using an HEVC-based codec. The resulting bitstream
packets, or more accurately, the resulting Network Abstraction Layer (NAL) units, are
multiplexed to form the 3D video bitstream. The base or independent view is coded using
an unmodified HEVC codec. Given the 3D video bitstream, the NAL units containing data
for the base layer can be identified by parsing the parameter sets and NAL unit header of
coded slice NAL units (up to the picture parameter set identifier). Based on these data,
the sub-bitstream for the base view can be extracted and directly coded using a
conventional HEVC decoder [4] [35] [37].
Figure 3-16 Basic codec structure with inter-component prediction (red-
arrows) [35] [37]
For coding the dependent views and the depth data, modified HEVC codecs are
used, which are extended by including additional coding tools and inter-component
prediction techniques that employ already coded data inside the same access unit as
29
indicated by the red arrows in Figure 3-4. For enabling an optional discarding of depth
data from the bitstream, e.g., for supporting the decoding of a stereo video suitable for
conventional stereo displays, the inter-component prediction can be configured in a way
that video pictures can be decoded independently of the depth data.
3.2.5 MVD codec vs HEVC standard codec
In this section, those aspects of MVD codec that are different from the standard
HEVC codec are discussed.
3.2.5.1 Coding of dependent views
Additional tools have been integrated into the HEVC codec, which employ
already coded data in other views for efficiently representing a dependent view. These
tools include –
Disparity-compensated prediction
As a first coding tool for the dependent views, the well-known concept of
disparity-compensated prediction (DCP), which is also used in MVC, has been added as
an alternative to motion-compensated prediction (MCP). At this, MCP refers to an inter-
picture prediction that uses already coded pictures of the same view, while DCP refers to
an inter-picture prediction that uses already coded pictures of other views in the same
access unit, as it is illustrated in Figure 3-5 [37].
View synthesis based inter-view prediction
Basing on all already coded views, a new virtual view is synthesized in the
position of the current view. Some regions of newly synthesized image are not available
because they were occluded in previously coded views. Those disoccluded regions are
identified and marked on a binary map, named availability map, which controls coding
and decoding process. Coder and decoder simultaneously use this map to determine,
whether given CU is coded or not. Because in a typical case most of the scene is the
30
same in all of views, only small parts are disoccluded in subsequently coded views, and
thus only small amount of CUs can be coded [37].
Figure 3-17 Disparity-compensated prediction as an alternative to motion-
compensated prediction [37]
Post processing in-loop filtering
A final step of view-synthesis prediction is reduction of artifacts in synthesized
view. This post-processing consists of Depth-Gradient-based Loopback Filterer (DGLF)
and Availability Deblocking Loopback Filter (ADLF) [37].
The first one (DGLF), reduces texture artifacts introduced by DIBR [20] technique
in the areas of a sudden depth changes. In order to cope that the synthesized image is
adaptively filtered with respect to depth gradient strengths. Large depth edges impose
strong low-pass filtering of the synthesized texture, while flat depth regions are not
filtered at all [37].
The latter (ADLF) reduces artifacts that are generated as a result of block CU-
based coding. Shape of coded region not necessarily matches shape of binary availability
map. This discrepancy is a source of artificial edges between those regions. The ADLF
31
provides smooth transition between coded and synthesized regions by interpolating
between them [37].
Inter-view motion prediction
The basic concept of the inter-view prediction of motion parameters is illustrated
in Figure 3-6. For the following overview, it is assumed that an estimate of a pixel-wise
depth map for the current picture is given. Below, it is described how such an estimate
can be derived. For deriving candidate motion parameters for a current block in a
dependent view, a sample location x in the middle of the block is selected and the
associated depth value d is converted to a disparity vector. By adding the disparity vector
to the sample location x a reference sample location xR is obtained. The prediction block
in the already coded picture in the reference view that covers the sample location xR is
used as the reference block. If this reference block is coded using MCP, the associated
motion parameters can be used as candidate motion parameters for the current block in
the current view. The derived disparity vector can also be directly used as a candidate
disparity vector for DCP [37].
Depth-based motion parameter prediction
Depth-Based Motion Prediction (DBMP) is a new coding tool for multiview video
coding which originates from the idea that motion fields of neighboring views in multiview
sequence are highly correlated. DBMP provides an efficient representation of motion data
in multiview video bitstreams that carry also depth/disparity maps. The motion
information, such as motion vectors and reference indices, for each pixel of encoded
coding unit (CU) is directly inferred with use of already coded disparity maps from
encoded CUs in the neighboring views at the same temporal instance (Figure 3-7). This
procedure is repeated independently for every pixel of encoded CU. Consequently,
32
motion vectors and reference indices for CU are not transmitted in the bitstream but are
obtained from the reference view at the receiving side.
Figure 3-18 Basic principle of deriving motion parameters for a block in a current picture
based on motion parameters in an already coded reference view and an estimate of the
depth map for the current picture [37].
33
Figure 3-19 Independent derivation of motion information for each point of encoded CU
from corresponding point in reference view [37].
Inter-view residual prediction
The basic principle of the inter-view residual prediction is illustrated in Figure 3-
8. Similarly as for the inter-view motion prediction, the inter-view residual prediction is
based on a depth map estimate for the current picture. The same depth map estimate as
for the inter-view motion prediction is used. Based on the depth map estimate, a disparity
vector is determined for a current block and the residual block in the reference view that
is referenced by the disparity vector is used for predicting the residual of the current block
[37].
34
Figure 3-20 Basic concept for the inter-view residual prediction [37]
Adjustment of QP of texture based on depth data
In order to improve perceptual quality of coded texture, a tool for bit assignment
in the texture layer was developed. The basic idea is to increase texture quality of objects
in the foreground and to increase compression factor (decrease texture quality) for
objects in the background. The quality is adjusted in coding units (CUs) with use of
quantization parameter QP that depends on the corresponding depth values. The QP
adjustment is done simultaneously in coder and decoder so that no additional information
is send. Described tool is disabled in the base view to preserve HEVC compatibility. The
texture QP is modified in the following way:
where QP' is adjusted QP value with corresponding disparity dx,y ( 8- bit depth
maps are considered).
35
3.2.5.2 Coding of depth maps
For the coding of depth maps, basically the same concepts of intra-prediction,
motion-compensated prediction, disparity-compensated prediction, and transform coding
as for the coding of the video pictures are used. However, some tools have been
modified for depth maps, other tools have been generally disabled, and additional tools
have been added.
As a first difference to the coding of video pictures, the inter-view motion and
residual prediction as described in sec. 3.2.5.1 are not used for depth coding. Instead,
motion parameters are derived based on coded data in the associated video pictures.
The other differences are described in this section.
Disabled chrominance coding
Depth maps may be coded in 4:0:0 chroma sampling format.
Non-linear depth representation
As alternative representation of depth maps, the depth may be non-linearly
scaled as described in the following. The human perception of depth depends on
absolute distance of viewed objects, therefore the internal depth representation is non-
linear. Closer objects are represented more accurately than distant ones. Thanks to that,
subjective quality of synthesized views is improved.
Z-near z-far compensated weighted prediction
Proposed znear-zfar compensation (ZZC) is a new coding tool for multiview
video, designed especially for inter-frame depth map coding.
The concept of ZZC exploits the observation that frames from different views and
time instances of encoded depth sequence may have different znear and zfar
parameters. The mentioned znear and zfar parameters describe range of depths
represented in a gray-scale depth map. If znear and zfar parameters are different for two
36
frames, then given depth value is represented with different gray-scale values in those
depth maps. Consequently, using one of such depth maps as a reference for the other
one will result in a poor prediction.
To overcome this problem, a new ZZC coding tool is proposed. Prior to any inter-
frame depth map prediction, each depth map that resides in the codec reference picture
list is scaled, so that gray-scale depth values in scaled image and currently coded image
refer to the same depth. As a result, depth maps with compensated znear and zfar range
are used for prediction. Values used for prediction (instead of the original ones)
are calculated as follows:
where LT is compensated disparity in range depth znear T to zfar T and LS is
original disparity in depth range znear S and zfar S.8 bit image is considered.
Modified motion compensation and motion vector coding
In contrast to natural video, depth maps are characterized by sharp edges and
large regions with nearly constant values. The eight-tap interpolation filters that are used
for motion-compensated interpolation in HEVC, can produce ringing artifacts at sharp
edges in depth maps, which are visible as disturbing components in synthesized
intermediate views. For avoiding this issue and for decreasing the encoder and decoder
complexity, the motion-compensated prediction (MCP) as well as the disparity-
compensated prediction (DCP) has been modified in a way that no interpolation is used.
That means, for depth maps, the inter-picture prediction is always performed with full-
sample accuracy. For the actual MCP or DCP, a block of samples in the reference picture
is directly used as prediction signal without interpolating any intermediate samples. In
37
order to avoid the transmission of motion and disparity vectors with an unnecessary
accuracy, full-sample accurate motion and disparity vectors are used for coding depth
maps. The transmitted motion vector differences are coded using full-sample instead of
quarter-sample precision.
Disabling of in-loop filtering
The in-loop filters in the HEVC design have been particularly designed
for the coding of natural video. For the coding of depth maps, these filters are
less useful. In order to decrease the encoder and decoder complexity, the in-
loop filters have been disabled for depth coding. This includes the following
filters:
the de-blocking filter;
the sample-adaptive loop filter.
Depth modelling modes
Depth maps are mainly characterized by sharp edges (which represent object
borders) and large areas of nearly constant or slowly varying sample values (which
represent object areas). While the HEVC intra prediction and transform coding are well-
suited for nearly constant regions, it can result in significant coding artifacts at sharp
edges, which are visible in synthesized intermediate views. For a better representation of
edges in depth maps, four new intra prediction modes for depth coding are added.
Four depth-modeling modes, which mainly differ in the way the partitioning is
derived and transmitted, have been added:
Mode 1: Explicit wedgelet signaling;
Mode 2: Intra-predicted wedgelet partitioning;
Mode 3: Inter-component-predicted wedgelet partitioning;
38
Mode 4: Inter-component-predicted contour partitioning.
Mode 1: The basic principle of this mode is to find the best matching wedgelet partition at
the encoder and transmit the partition information in the bitstream. At the decoder the
signal of the block is reconstructed using the transmitted partition information.
Mode 2: The basic principle of this mode is to predict the wedgelet partition from data of
previously coded blocks in the same picture, i.e. by intra-picture prediction. For a better
approximation, the predicted partition is refined by varying the line end position. Only the
offset to the line end position is transmitted in the bitstream and at the decoder the signal
of the block is reconstructed using the partition information that results from combining
the predicted partition and the transmitted offset.
Mode 3: The basic principle of this mode is to predict the wedgelet partition from a
texture reference block, namely the co-located block of the associated video picture. This
type of prediction is referred to as inter-component prediction. Unlike temporal or inter-
view prediction, no motion or disparity compensation is used, as the texture reference
picture shows the scene at the same time and from the same perspective. The wedgelet
partition information is not transmitted for this mode and consequently, the inter-
component prediction uses the reconstructed video picture as a reference. For efficient
processing, only the luminance signal of the reference block is taken into account, as this
typically contains the most significant information for predicting the partition of a depth
block, i.e. the edges between objects.
Mode 4: The basic principle of this mode is to predict a contour partition from a texture
reference block by inter-component prediction. Like for the inter-component prediction of
a wedgelet partition pattern, the reconstructed luminance signal of the co-located block of
the associated video picture is used as a reference. In contrast to wedgelet partitions, the
prediction of a contour partition is realized by a thresholding method. Here, the mean
39
value of the texture reference block is set as the threshold and depending on whether the
value of a sample is above or below the threshold; sample position is marked as part of
region P1 or P2 in the resulting contour partition pattern.
3.3 Summary
In this chapter, the 3D video codecs are explained in some detail. The two video
codecs that are covered are – 3D video coding in H.264/AVC (multi-view coding) and 3D
video coding in HEVC (multi-view plus depth coding). In chapter 4, motion artifact
analysis of depth maps and how it affects the rendered views are discussed.
40
Chapter 4
Analysis of compression artifacts in depth maps
Depth maps were briefly explained in section 1.2.1. As explained, depth maps
can be represented as a grayscale image sequence for storage and transmission
purposes and can be compressed with any existing video codecs. However, existing
video codecs are optimized to encode image/video sequences that are finally viewed by
the end users. Depth maps on the other hand, are not viewed by the end-users, but are
used as an aid for view rendering. Therefore, when the existing video codecs are used to
compress depth maps, the compression artifacts on depth maps cause distortions in
rendered views. Two types of solution can be identified to solve this problem. The first
solution is to come-up with novel compression techniques and introduce depth-map
compression specific features to the codec. The HEVC-3D codec [37] described in
section 3.2 is one such codec. As already observed in section 3.2, it is very clear that this
kind of solution increases the complexity of the codec to a very large extent as it deals
with components that are specific to compression of depth maps. The second type of
solution is to encode depth-maps with existing video codecs and to post-process the
decoded depth-maps with image denoising techniques [42]. The advantage of this type of
solution is that the existing codecs need not be modified to specifically suit the
compression of depth-maps, but image denoising techniques can be used on the
decoded depth maps to minimize undesirable compression artifacts. The proposed post-
processing algorithm belongs to the second category.
As the proposed post-processing algorithm minimizes the effects of compression
artifacts upon virtual view generation process, it is extremely important to understand the
view generation process as well as the effects of compression artifacts upon view
41
generation. Although, multi-view rendering is possible, the scope of this thesis is limited
to rendering of stereoscopic views with monoscopic color image and per-pixel depth map
(Figure 1-2). After introducing the view generation process, a theoretical derivation of
thresholds for maximum possible distortions in depth map that does not cause perceptual
rendering distortion is presented.
4.1 Virtual stereoscopic view generation process
A monoscopic color image and per-pixel depth map can be used to generate
virtual stereoscopic views. The virtual view generation process is shown in Figure 4-1.
Figure 4-21 Virtual view generation in Depth Image Based Rendering (DIBR) process
[43]
In this process, the original image points at locations (x, y) are transferred to new
locations (xL, y) and (xr,y) for left and right view respectively.
This process is defined with:
42
xR=x+( Ppix
2 ) (1)
xL=x−(P pix
2 ) (2)
ppix=−xB( N pix
D ){( m255 ) (k near+k far )−k far } (3)
Where
ppix – pixel parallax
xB – distance between the left and right virtual cameras or eye separation (assumed to be
6 cm)
D - viewing distance (assumed to be 250 cm)
m – depth value of each pixel in the reference view
knear and kfar – range of the depth information respectively behind and in front of the
picture, relative to the screen width.
Npix – screen width measured in pixels
8-bit images are considered
Virtual cameras are selected such that the epipolar lines are horizontal, and thus
the y component is constant. The equation (3) is in accordance with MPEG informative
recommendation [44]. The dis-occluded regions (visual holes) are filled by background
pixel extrapolation technique [45]. Due to any noise with which the depth maps could be
corrupted, the luminance values of the pixels would be modified, i.e. m in eq (3) will be
modified. This will result in warping error and thus cause distortions in the image
rendered with the noisy depth map. A pixel wise distortion model that quantifies errors on
the rendered views is given in [46].
43
The quality of the rendered virtual views can be determined by calculating the
PSNR between view rendered with uncompressed color image and depth map and the
view rendered with the compressed/corrupted color image and depth map [44]. Another
metric that can be used to measure the quality of rendered views is the Structural
Similarity Index Metric (SSIM) [48].
4.2 Analysis of compression artifacts in depth maps on view rendering
Due to bandwidth constraints, it is a common practice to compress depth maps
before storage or transmission in bandwidth limited channels. Traditional block based
video codecs, such as HEVC [1] [2] and H.264/AVC [3], are based on motion estimation,
transform coding, quantization and entropy coding. During quantization process, high
spatial frequencies in individual images are eliminated. This is done mainly due to the
fact that the human visual system is more sensitive to the low spatial frequencies in
images.
When traditional video codecs are used to compress depth maps, which are not
viewed by humans but used to aid the view rendering process, the compression artifacts
will have adverse consequences upon the quality of the rendered views. It is highly
important to preserve the sharp depth discontinuities present in depth maps for high
quality virtual view generation. In this subsection we analyze the effects of compression
artifacts upon the view generation process.
The eq. (3) provides the relationship between the value of the depth pixel (m)
and the corresponding pixel parallax (ppix). Suppose there is a change of ∆m in the
value of the depth pixel, there would be a corresponding change of ∆p in the pixel
parallax. 8-bit images are considered.
44
ppix+∆ p=−xB( N pix
D ){(m+∆m255 ) (k near+k far )−k far} (4)
From Eq. (3) and Eq. (4), it can be deduced that
∆ p=xB( N pix
D ){(∆m255 ) (k near+k far )} (5)
In terms of the rendering algorithm used, the change of depth pixel value (∆m)
will not have any significance unless it is large enough to cause a parallax change of at
least 1 pixel. According to Eqs. (1) and (2), the maximum value ∆p in Eq. (5) could be is
2 pixels. Using this information, the maximum change (∆mmax) of the value of a depth
pixel could be derived as follows,
2=x B(N pix
D ){(∆mmax
255 ) (k near+k far)} (6)
∆ mmax=2. D .255
xB .N pix .(knear+k far)(7)
The ∆mmax in Eq. (7) provides a theoretical threshold, which indicates the
maximum change a depth pixel value could undergo without causing a rendering error. It
should be noted that the above derivation does not take in to account the rounding errors,
in which case a minimum parallax of 0.2 could bring about change in the warped pixel
position. However, the above derivation is valid when the rendering quality does not
consider positional errors of one pixel [47]. The derived threshold is the basis for
calculating most of the parameters in designing the filters used in the proposed work.
45
4.2.1 Frequency domain analysis of the artifacts
In this sub-section, frequency domain analysis of the artifacts introduced during
the compression of depth maps is presented. For analysis purposes, 2D disctete fourier
transform (DFT) F (u, v) of a digital image I(x,y) of size M x N is defined by eq (8)
F (u , v )=∑x=0
M−1
∑y=0
N−1
f ( x , y ) . e− j2 π (uxM + vy
N )(8)
Where, u = 0, 1,…… M-1 and v = 0, 1,……, N-1
The power spectrum (PS) P (u,v) of a considered image segment f( x,y) is
obtained using the DFT as,
P (u , v )=|F (u , v )|2 (9)
Figure 4-2 illustrates the effect of compression artifact in areas with homogenous
depth. During compression, small depth variations, which present spatially high
frequencies, are removed. This fact is illustrated by the power spectrum difference
between Figure 4-2(e) and 4-2 (g), where the energy in 4-2 (g) is much lower than 4-2
(e). The periodic nature of the power spectrum in Figure 4-2(g), is due to the blocking
artifacts present in Figure 4-2(c). The effect of this upon rendered views is illustrated in
Figure 4-2(b) and 4-2(d). The corresponding power spectrums illustrated respectively in
Figure 4-2(f) and 4-2(h) does not show significant change. The Figure 4-3(a-d) illustrate
the effect of compression artifacts in an area with a sharp depth discontinuity. The
corresponding power spectrum of the image rendered with the compressed depth map in
Figure 4-3(h) illustrates increased energy concentration in high spatial frequencies, as
compared to the power spectrum of the image rendered with the uncompressed depth
map. Figure 4-3(f). This is mainly due to the uneven blurring at the depth discontinuity.
For the purpose of the following analysis we define the following terms,
46
Depth Noise: difference in the pixel values between the original depth map and
the compressed depth map.
Rendering Noise: difference in pixel values between the view rendered with the
original depth map and the view rendered with the compressed depth map.
Perceived Noise: Perceivable difference between the view rendered with the
original depth map and the view rendered with the compressed depth map. This measure
will neglect tiny position errors due to warping and pixel value differences below a certain
threshold. An estimate of perceivable difference can be obtained using SSIM [48].
Figure 4-22 Effect of compression noise in areas of homogenous depth
[43]
47
Figure 4-23 Effect of compression noise in areas of sharp depth discontinuities [43]
4.3 Design requirements of a post-processing filter to minimize the effects of
compression artifacts
Based on the analysis of compression artifacts upon view rendering in section
4.2, the compression artifacts appear as non-uniform noise in the depth maps. The non-
uniform noise (uneven noise) due to compression artifacts could be approximated as a
zero-mean normal distribution. Therefore, a candidate post-processing filter should be
able to minimize the spread of the noise. However, when smoothing, it is important not to
increase the threshold derived in eq. (7). If the depth map is smoothed above the
threshold derived in eq. (7), it will cause perceivable rendering artifacts. Furthermore, the
above analysis illustrated that the artifacts along the depth discontinuities cause
significant distortion in terms of the quality of rendered views. Compression artifacts that
are present in smooth (homogenous) depth areas, generally, cause less distortion.
48
Considering the design requirements outlined above, a depth map post-
processing framework to minimize the effect of compression artifacts upon view
rendering is proposed in [43]. The proposed depth map post processing framework is
designed based on the principles of bilateral filtering [49]. In chapter 5, the post-
processing frame-work in [43] along with the required background theory will be
described.
4.4 Summary
In this chapter, a detail analysis of compression artifacts is discussed. Virtual
stereoscopic view generation process is explained and how presence of compression
artifacts in depth-map affects the rendered views are discussed. Finally, design
requirements for a post-processing frame-work to reduce compression artifacts based on
artifact analysis are presented. In chapter 5, the scope, necessary background and the
algorithm for the thesis are considered.
49
Chapter 5
Thesis – Scope, Background and Working algorithm
This chapter discusses the scope of the proposed work followed by background
knowledge required to understand the working algorithm and finally the working algorithm
itself.
5.1 Scope
While a lot of research has been done to study the effect of using post-
processing image denoising techniques on H.264/AVC decoded depth maps, there has
not been any research done to study the effects of applying image denoising techniques
(post – processing) to HEVC decoded images/video. This thesis applies image denoising
techniques on HEVC decoded depth maps as a post-processing technique. Quality of
rendered images with and without post-processing is compared using PSNR and SSIM
[48]. Specifically, a post-processing framework that is based on analysis of compression
artifacts upon generation of virtual views is used. The post-processing frame-work utilizes
a non-linear spatial filtering technique to reduce compression artifacts [43].
This thesis is an effort to effectively reduce the compression artifacts from HEVC
decoded depth-maps and improve the perceptual quality of rendered views without using
depth-map specific video codec.
5.2 Background Theory
Based on the analysis of compression artifacts presented in chapter 4, a post
processing framework is proposed [43], to minimize the effects of compression artifacts in
depth maps. The proposed framework is inspired by two applications of bilateral filtering
[49], namely, joint (cross) bilateral filtering [50], [51] and adaptive bilateral filtering [52].
These two concepts will be briefly explained in this section.
50
5.2.1 Bilateral Filter
The bilateral filter [49] uses both a closeness filter kernel as well as a similarity
filter kernel evaluated on the pixel values. More formally, for some pixel position p the
filtered result Bp is given as in the eq. (10),
BP=∑q∈Ω
W pq−I q/∑q∈Ω
W pq (10)
In Eq. (10), Iq is the value at pixel position q in the kernel neighborhood. The
filter weight wpq at pixel position q is calculated as,
W pq=c ( p ,q ) . s ( p ,q) (11)
Where c is the closeness filter kernel and s is the similarity filter kernel. Both c
and s are popularly implemented as a Gaussian centered at p and Ip (Ip is the value at
pixel position p) with standard deviations σc and σs, respectively as,
c ( p ,q )=exp (−12
(p−q )2 /σc2) (12)
s (p , q )=exp(−12 ( I p−I q )2/σ s
2) (13)
5.2.2 The joint bilateral and trilateral filter
When similarity filter kernel of the bilateral filter is derived from a second guided
image, the process is known as a joint (cross) bilateral filter (JBF). The concept of joint
51
bilateral filtering was first proposed as a means of removing adverse effects of flash
photography [50], [51].
Accordingly, the similarity filter kernel (sj) in the case of a joint bilateral filter is
implemented as,
s j ( p ,q )=exp(−12 (~I p−
~I q )2/σ j2) (14)
When there are two similarity filter kernels used along with a closeness filter, the
filter is known as a trilateral filter [53]. The basis for the similarity filters need to be
judicially selected. For example, a trilateral filter is designed as an in-loop deblocking filter
in Ref. [54], in which two similarity filter kernels are derived each from the color image
and the depth map. The similarity filter kernel st of the filter proposed in Ref. [54] is given
as,
st ( p ,q )=s ( p ,q ) . s j( p ,q) (15)
st ( p ,q )=exp(−12 ( I p−I q )2/σs
2) .exp(−12 (~I p−
~Iq )2/σ j2)
5.2.3 The adaptive bilateral filter
In Ref. [52] authors define the Adaptive Bilateral Filter (ABF) as a image
sharpening technique. In ABF [52] the similarity filter kernel sa is defined as,
sa ( p ,q )=exp(−12 ( I p−I q−∆p )2/σ p
2) (16)
Where, Δp and σp are adaptation parameters dependent on p and they are used
to control the center and the standard deviation of the Gaussian kernel that implements
52
sa. As opposed to the similarity filter kernel s in Eq. (13), which is centered around Ip, the
similarity filter kernel of the ABF is is centered at Ip - Δp. The ABF has very good
sharpening ability if the adaptation parameters Δp and σp are
calculated appropriately. In [52], the adaption parameters Δp and σp
are found empirically for digital images by a least mean square error
training method.
5.3 Proposed algorithm
This thesis is based on the post-processing algorithm in [43]. The method is
designed based on the principles of bilateral filtering introduced in III to minimize the
effects of compression artifacts upon the virtual view generation process. Specifically, a
Bilateral Sharpening filter (BSF) is used to post-process compressed depth maps by
analyzing global image histograms. Figure 5-1 illustrates the block diagram of the
proposed post-processing framework.
Figure 5-24 Block diagram for the proposed research [43]
The video sequence and depth map are encoded using HM 9.2 [55]. The
decoded depth map with compression artifacts is then obtained. The BSF operates by
adjusting the histogram of the compressed depth map by identifying the dominant depth
value bins present in the depth map. Thus, the identification of correct depth value bins is
crucial for the correct operation of the BSF. If the histograms are analyzed directly from
the compressed depth [56], the identified depth value bins will not be very appropriate
53
due to the effect of noisy pixels. These noisy pixels could be a result of either the
compression algorithm or of the depth map generation process. The identification of the
depth value bins could be improved by filtering the compressed depth maps to reduce
these noisy pixels. Edge Adaptive Joint Trilateral Filter (EA-JTF) [43] is used, whose filter
coefficients are theoretically derived to enable maximum possible filtering of the noisy
pixels. Furthermore, in areas where the color image and the corresponding depth map
are aligned, the EA-JTF is designed to utilize the edges in the color image to reconstruct
the depth map. The output of the EA-JTF is then given as the input to BSF.
The different steps involved in the algorithm are explained in the coming sub-
sections.
5.3.1. Depth discontinuity analysis
The purpose of the depth discontinuity analysis step is twofold. Firstly, the areas
that have aligned edges in the color image and the corresponding depth map are
identified. The filter kernels of the EA-JTF are adaptively selected based on this
information about edge alignment between the color image and the depth map. Secondly,
all depth discontinuities that are significant in terms of rendering are identified. This
information about significant depth discontinuities are used to reduce the complexity of
the bilateral sharpening filter.
5.3.1.1 Identification of significant depth discontinuities
Theoretically, a depth discontinuity, or an edge in the depth map, is considered to
be significant if the neighborhood of the corresponding color pixels in the warped image
is different from the original color image. In the context of depth map based stereo view
rendering, a significant depth discontinuity will cause the corresponding color pixels on
either side of the edge to be shifted by different magnitudes.
54
Firstly, the depth map is convolved with a sobel filter. Let the result of this
operation be denoted as Gx. An edge mask Ed is then derived as in Eq. (17),
which corresponds to pixel locations of significant depth
discontinuities. However, this derivation neglects round-off errors in
the rendering algorithm.
Ed ( p , q)={1 if |G x( p ,q)|≥∆mmax
0 if |G x (p ,q )|≤∆mmax(17)
where ∆mmax is defined in eq. (7).
5.3.1.2 Identifying of aligned color and depth edges
Once the edge mask Ed is obtained as in Eq. (17), it is necessary to identify the
regions in which the color edges and depth discontinuities are aligned. For this purpose
an edge mask Ec of the color image is generated by the canny edge detection algorithm.
Using Ed and Ec, the binary mask Es signifying the aligned edge areas is obtained as
follows,
E s=((Ed⊕S1 )∩ Ec )⊕S2 (18)
Where, ⨁ represents the morphological dilation and S1and S2 represent flat
square structuring elements of size 2 and 7 respectively.
An example of outputs at each step of the depth discontinuity analysis is given in
Figure 5-2.
55
Figure 5-25 Illustration of depth discontinuity analysis
5.3.2 Pre-filtering of the depth maps to improve depth bin identification
The correct depth value bin identification from the histogram analysis is very
important for the operation of the proposed post-processing filter. As noisy depth pixels
affect the depth bin identification process, smoothing the compressed depth map to filter
out any insignificant depth discontinuities will improve the correctness of the bins that are
identified. A bilateral filter whose edge threshold is selected to preserve only the
significant discontinuities, is a good candidate for this purpose. Furthermore, this stage of
56
filtering also makes use of the corresponding color image to realign the discontinuities in
the depth map with the edges in the color image.
Considering both the above requirements, the Joint Trilateral Filter as described
in section 5.2.2 could be used for our purpose. However, the JTF is suitable only in areas
in which the color and depth edges are aligned. If the color edges and depth
discontinuities are not aligned, the JTF will generate depth maps different from the
original depth maps. Therefore, in [43], the similarity filter kernel st of the joint trilateral
filter is adaptively selected as given in Eq. (19). For the areas where the
edges between the color image and the corresponding depth map are
aligned, there will be two similarity filter kernels used, each derived
from the compressed depth map and the color image. For the
remaining area, only the similarity filter kernel derived from the
compressed depth map is used.
st (p ,q )={s ( p ,q ) . s j ( p ,q ) if Es ( p ,q )=1s (p ,q ) if E s ( p ,q )=0
(19)
The aim of this pre-filtering step is to filter the compressed depth map to filter out
any insignificant depth discontinuities. Therefore, the edge threshold used for the
similarity filter kernel s in Eq. (19) is made equal to ∆mmax given by eq (7). While the
closeness filter and the similarity filter kernel derived from the color image sj is
implemented as a Gaussian kernel, the similarity filter kernel derived from depth map s is
implemented as a binary filter.
While JBF [57] could recover edge information from its corresponding texture
image to a certain extent, it lacks the capability to do so in areas where there are depth
57
discontinuities, but inadequate gradient in the color image to support it. While the JTF
proposed in Ref. [54] overcomes this problem, it fails to perform in areas of the depth
map that are not perfectly aligned with the color image. Thus, the EA-JTF is designed to
overcome drawbacks of both the JBF and the JTF.
5.3.3 Global histogram analysis
The aim of this step is to identify the significant depth value bins by analyzing the
global histograms of the depth map. A depth value bin in the histogram is characterized
by a peak enclosed by two immediate minimums (valleys) on either side, except when
the peak is at 0 or 255 gray levels. Once the particular depth value bins are identified,
they are represented as a non-symmetric Gaussian distribution centered at the peak
value of a particular bin. The distance to the enclosing valleys from the peak is used to
calculate the standard deviation on each side of the Gaussian curve representing the
particular depth value cluster.
For this purpose, the output of the EA-JTF is segmented in to equal size blocks
of 64x64 (For some sequences it is 72x72, to divide the image in to equal size blocks).
The histogram analysis is performed on each segmented block to find all the dominant
depth value bins within that block. The decision to perform the histogram analysis on
64x64 blocks, rather than on a pixel by pixel basis is made for two reasons. Firstly, this
will make the reconstruction method to be consistent among all the pixels within the
block. Secondly, it minimizes the chances of some noisy depth pixel values to be
identified as a significant depth value bin. The different steps of the global histogram
analysis are given in below.
1. Segment the output of the EA-JTF into equal size segments of 64x64
pixels
2. Obtain the histogram for each segment
58
3. Smooth the histogram using an averaging filter: The averaging filter
kernel used is [1 1 1 1 1].
4. Identify dominant peaks and their enclosing valleys.
5.3.4 Bilateral Sharpening Filter
In this subsection, bilateral sharpening technique to minimize the effects of
artifacts in compressed depth map [43] is explained in detail. The method is inspired by
the Adaptive Bilateral Filter (ABF) [52] described in section 5.2.3. The ABF [24] is able to
adjust the histogram of an image in a desired way by selecting the adaptation parameters
in eq. (16) appropriately. The method proposed in [52] is optimized for sharpening natural
images and the adaptation parameters are found by a training method based on Least
Mean Squared Error (LMSE) minimization.
Unlike natural images, depth maps are mostly piecewise smooth images with
sharp depth discontinuities (edges). By appropriate selection of the adaptation
parameters in Eq. (16) it is possible to adjust the histograms of the compressed depth
maps, to a similar form that it was before compression.
The Depth maps are captured by various techniques such as depth range
cameras and computer vision techniques based on disparity estimation. Thus, a training
method as proposed in Ref. [52] cannot be successfully adapted for sharpening of depth
maps, to accommodate varying types of depth maps. We use the piecewise smooth
property of depth maps to propose a non-training based method to find adaptation
parameters of sa in Eq. (16). Specifically, the depth value bins identified by the global
histogram analysis and characterized by Gaussian curves as described in section 5.3.3
are used to derive the appropriate adaptation parameters of the Bilateral Sharpening
Filter (BSF) proposed in [43].
59
The EA-JTF [43] successfully filters out insignificant depth discontinuities present
in the compressed depth maps, and the output is provided in to the BSF. At this stage of
filtering, each pixel of the depth map is replaced by a value determined by the following
bilateral sharpening process. During the bilateral sharpening process, all the pixels in a
64x64 block are processed with respect to the identified depth value bins of the particular
block. Once a pixel (Ip) is taken for filtering, the nearest depth value bin in the histogram
of the corresponding 64x64 block is identified. Thereafter, the bilateral filter weights (wpq
in Eq. (7)) are derived as described below. To reduce the spread beside the
peaks in the histogram of the compressed image, the similarity filter
kernel of the bilateral sharpening filter in [43] is implemented as a
Gaussian centered at the nearest peak (Np) to Ip, given as,
sa ( p ,q )=exp(−12 (N p−I q )2/σ p
2) (20)
where, σp is defined as half the distance (since 2σp corresponds to a 95%
confidence interval in a Gaussian distribution) to valleys enclosing Np,
σ p=¿ (21)
In Eq. (21), Vhigh and Vlow represent the enclosing valley greater than and lower
than Np, respectively. The bilateral filter weights are then derived as follows,
W pq=c ( p ,q ) . ss( p ,q) (22)
60
Finally, the filtered result (Bp) is calculated as in Eq. (10).
5.3.5 Stereoscopic view rendering and comparison of results
After obtaining the depth map with artifacts removed using the steps described in
section 5.3.1 to section 5.3.4, the left side and right side views are obtained using
stereoscopic view rendering process [44] [45]. The images obtained using uncompressed
depth map, HEVC decoded depth-map and HEVC decoded depth-map to which the post-
processing has been applied are compared using the metrics PSNR, SSIM [48] and a
approximate of Mean Opinion Score.
5.4 Summary
In this chapter, the scope of the proposed thesis is discussed. This is followed by
the exploration of the background theory necessary to understand the post-processing
technique used in this thesis. Finally, the entire working algorithm is explained. Chapter 6
provides the experimental results for different test sequences.
61
Chapter 6
Experimental results
To evaluate the performance of the EA-JTF [43] on HEVC decoded depth maps,
color sequences along with the corresponding depth maps are compressed using HEVC
reference software HM 9.2 [55]. Since, one frame is enough for stereoscopic rendering,
only one frame is compressed at QP = 32. Thereafter, to compare the efficiency of the
post-processing frame-work, three different rendered images are obtained. First, the
original image and the corresponding depth map are used for stereoscopic rendering.
Second, the HEVC decoded image and the corresponding decoded depth-map are used
for stereoscopic rendering. Third, the post-processing frame-work is applied on the HEVC
decoded depth-map and is used for stereoscopic rendering. Stereoscopic views are
rendered according to the MPEG informative recommendations [44] [58].
To evaluate the improvements of using the post-processing techniques on HEVC
decoded depth maps, two metrics are utilized to measure the quality between the views
rendered with the post-processed depth map and the corresponding views rendered with
the uncompressed depth map. These are PSNR and SSIM [48]. Also, an approximate
Mean Opinion Score (MOS) [59] was used to evaluate the perceptual quality of the
rendered views. The way in which MOS was calculated is explained in section 6.1.
The experiments were performed on three different sequences: - Break-dancers
[60], Balloons [18] and Kendo [18]. All the post-processing frame-work was performed
using MATLAB R2013a student version. The results for each sequence are discussed in-
detail in section 6.2.
6.1 An approximate Mean Opinion Score calculation
To compute the MOS, the rendered images from all the three test-cases (original,
HEVC decoded, HEVC decoded + post-processed) were sent to a number of volunteers.
62
The volunteers were unaware as to which images are rendered from which depth-maps.
They were asked to view the images under normal conditions (viewing distance, posture,
etc.) like how they normally view pictures/videos on their laptops/desktops. The
volunteers were then asked to rate the images based on the viewing quality of the
images. The best image was to be given a rating of 3, the second-best image was to be
given a rating of 2 and the worst image was to be given a rating of 1. The volunteers
were chosen from a spectrum of people from those who have some knowledge of image-
processing techniques to those who are completely new to image quality assessment.
From these rating, the MOS calculated for each test-case using the formula given in eq.
(23)
MOS=∑ Ratings
Number of people(23)
6.2 Input Parameters
The different input parameters used while conducting the experiments are listed
in this section.
Table 2 Input parameters and their values
Parameter Value
Viewing distance (D) 250cm (assumed)
Eye separation (xB) 6cm(assumed)
Screen width in pixels (Npix) 1366 (for the laptop used for experimentation)
The range of depth information respectively
behind and in front of the picture (knear and
kfar)
knear = 44.00; kfar = 120.00 (BreakDancer)
knear = 448.25; kfar = 11206.28 (Balloons)
knear = 448.25 ; kfar = 11206.28 (Kendo)
Resolution of the video sequences used 1024 x 768
63
Table 3 Filter parameters for EA-JTF and ABF [43]
EA-JTF
Kernel size: 15 x 15 pixels
Standard deviation for the color similarity filter (σs) = 0.025 (normalized range of 0-1)
Standard deviation for the depth similarity filter (σj) = 0.036 (normalized range of 0-1)
Standard deviation for the closeness filter (σc) = 21
ABF
Kernel size: 7 x 7 pixels
Standard deviation for the closeness filter (σc) = 12
6.2 Results for different sequences
For all the sequences, original is the view rendered using the original image and
depth-map. Decoded is the view rendered using the image and depth-map decoded
using HEVC. Processed is the view rendered using the HEVC decoded image and the
post-processed depth-map.
6.2.1 Sequence: Balloons
Sequence Balloons gave the best result of the three sequences in terms of
PSNR, SSIM as well as MOS. This can be attributed to the fact that the original depth
map for Balloons sequence itself is blurred, the post-processing techniques sharpens and
enhances the depth-map to a very large extent to yield good results.
64
Table 4 Sequence Balloons results
Metric Decoded Image(Left-side
view)
Processed Image(Left-side view)
Decoded Image(Right-side view)
Processed Image(Right-side view)
PSNR (dB) 32.6614 38.2611 32.6614 38.2611SSIM (dB) 0.6977 0.9143 0.6977 0.6977
MOS rating is currently being taken only for Left-side rendered image. Ten volunteers
rated the three images and the results show that the processed image got a far higher
rating than the decoded image. In this specific case of balloons, due to the blurry nature
of the original depth-map itself, some of the volunteers gave the processed image a
higher rating than the original itself. On a scale of 3, MOS is calculated using the formula
given in eq. (23). Table 5 gives the MOS rating for the balloons sequence. The images
are shown in figure 6-1.
Table 5 Balloons sequence MOS rating
Image MOS Rating (max = 3)
Original 2.4
Decoded 1.0
Processed 2.5
65
Figure 6-26 Result images – Rendered left-side images for balloons sequence
6.2.2 Sequence: Break-Dancer
For the sequence Break-dancer, the decoded image actually has a higher PSNR
than the processed image. This can be attributed to the fact that the pixel-value wise the
original and decoded are more similar compared to the original and post-processed.
However, in terms of SSIM and MOS, the processed image show improvements
compared to the decoded image. The results are given in Tables 6 and 7 respectively.
The images are shown in figure 6-2.
66
Table 6 Sequence Break-Dancer results
Metric Decoded Image(Left-side
view)
Processed Image(Left-side view)
Decoded Image(Right-side view)
Processed Image(Right-side view)
PSNR (dB) 41.1128 40.6078 41.1128 40.6078SSIM (dB) 0.8953 0.8987 0.8953 0.8987
Table 7 Break-dancer sequence MOS rating
Image MOS Rating (max = 3)
Original 2.6
Decoded 1.5
Processed 1.9
67
Figure 6-27 Result images – Rendered left-side images for break-dancer
sequence
6.2.3 Sequence: Kendo
Similar to Break-dancer, even for Kendo sequence, there is no improvement in terms of
PSNR, but perceptual quality-wise, with SSIM and MOS, the post-processed image
scores better than the decoded sequence. The results are given in Tables 8 and 9
respectively. The result images are shown in figure 6-3.
Table 8 Sequence Kendo results
Metric Decoded Image(Left-side
view)
Processed Image(Left-side view)
Decoded Image(Right-side view)
Processed Image(Right-side view)
PSNR (dB) 41.7771 40.8181 41.7771 40.6078SSIM (dB) 0.9459 0.9466 0.9459 0.9466
68
Table 9 Kendo sequence MOS rating
Image MOS Rating (max = 3)
Original 2.2
Decoded 1.7
Processed 2.1
Figure 6-28 Rendered left-side images for kendo sequence
6.3 Summary
In this chapter, the different input parameters and filter parameters used for the
experimentation purpose are discussed. This is followed by exploring the results for the
69
images rendered from three different test- sequences. Conclusions and future-work are
presented in chapter 7.
70
Chapter 7
Conclusions and future-work
7.1 Conclusions
This thesis is an effort to improve the quality of rendered views obtained from
HEVC decoded depth-maps. The 3D video codec for HEVC is extremely complicated
compared to the normal HEVC codec. In this thesis, the depth-maps are compressed
directly using HEVC reference software. A post-processing technique described in
chapter 5 is applied to the HEVC decoded depth-maps and the results are obtained. The
views obtained using post-processed depth-maps are compared with views obtained
using just the decoded depth-maps based on the views obtained using original depth-
map as reference. Three test-sequences are used. PSNR, SSIM and MOS are the three
metrics that are used to compare the results. For the sequence, balloons, the post-
processing improves the quality of rendered images to a large extent. There is a
significant improvement in PSNR, SSIM as well as MOS rating obtained by ten different
volunteers. There is an increase in PSNR of 5.59 dB, an SSIM improvement of 0.2166 dB
and the view obtained using post-processed depth map was found to have the best MOS
rating of 2.5. The PSNR results weren’t this promising for the other two sequences.
However, perceptually quality measurements using SSIM and MOS showed that the
views rendered using post-processed depth maps are better than the one rendered using
just the decoded depth-maps. For the sequence Break-dancers, there are an SSIM
improvement of 0.0034 dB and the MOS rating of 1.9 which is better than the rating
obtained for the decoded case which got an MOS rating of 1.5. For Kendo sequence,
SSIM improvement of 0.0007 dB was obtained while the MOS rating was 2.1 for the
processed image compared to 1.7 for the decoded images. Thus, the results for all the
three sequences clearly suggest that the perceptual quality of the views rendered using
71
the depth-maps that have been post-processed in better than the views rendered using
depth-maps that have been just HEVC decoded.
7.2 Future work
There are few ways into which this thesis can branch into and provide scope for
more meaningful research. Some more work into the filter design may provide more
significant results. In the current work, only stereoscopic view rendering is considered.
This can be extended to multi-view rendering. Also, the current work implements post-
processing as an out-of-the loop solution. This can be in-loop and merged with the HEVC
compression codec. For evaluating the perceptual quality, the current work used SSIM
and an approximate of Mean Opinion Score. More research into perceptual quality
assessment for depth-maps and rendered views may be useful.
72
References
1) G.J. Sullivan; J. Ohm; Woo-Jin Han and T.Wiegand, “Overview of the High Efficiency Video Coding ( HEVC ) Standard ”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649-1668, Dec 2012.
2) HEVC text specification draft 10: http://phenix.it- sudparis.eu/jct/doc_end_user/current_document.php?id=7243
3) H.264/AVC reference website -http://www.itu.int/rec/T-REC-H.264-201003-I.4) C. M.Fu, et al, “Sample adaptive offset in the HEVC standard,” IEEE Trans.
on circuits and Systems for video technology, vol. 22, no. 12, pp. 1755-1764, Dec. 2012.
5) K. Kim, et al, “Block partitioning structure in the HEVC standard,” IEEE Trans. on circuits and systems for video technology, vol. 22, pp.1697-1706, Dec. 2012.
6) 3DV for H.264: http://mpeg.chiariglione.org/technologies/general/mp-3dv/index.htm
7) B. Furht, “Multimedia systems: an overview”, IEEE Multimedia, vol. 1, pp. 47-59, 1994.
8) K.R. Rao, D.N. Kim and J.J. Hwang, “Video coding standards: AVS China, H.264/MPEG4-Part 10, HEVC, VP6, DIRAC and VC-1”, Springer -2014.
9) B. Furht, “Survey of multimedia compression techniques and standards. Part 1: JPEG standard”, Real time imaging, vol. 1, pp.49-67, 1995.
10) Cisco Visual Networking Index: Global mobile data traffic forecast update,2012-2017: http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-520862.pdf
11) ISO website: http://www.iso.org/iso/home.htm12) IEC website: http://www.iec.ch/ 13) P. Merkle, A. Smolic, K. Muller and T. Wiegand, “Multi-view video plus depth
representation and coding,” IEEE International Conf. on Image Processing, pp. I-201 – I-204, San Antonio, USA, Sept. 2007.
14) A. Vetro, S. Yea and A. Smolic, “Towards a 3D video format for Autostereoscopic displays,” SPIE Conf. on Applications of Digital Image Processing XXXI, San Diego, USA, Aug. 2008.
15) A. Smolic, et al, "3D Video and Free Viewpoint Video - Technologies, Applications and MPEG Standards," IEEE International Conference on Multimedia and Expo, 2006, vol., pp.2161, 2164, 9-12 July 2006.
16) D.K. Shah, et al, "Evaluating multi-view plus depth coding solutions for 3D video scenarios," 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2012, vol., pp.1-4, 15-17 Oct. 2012.
17) S. Gokturk, H. Yalcin and C. Bamji, “A time-of-flight depth sensor – system description, issues and solutions,” Conf. on Computer Vision and Pattern Recognition, Washington, USA, June 2004.
73
18) Balloons and Kendo test sequences: http://www.tanimoto.nuee.nagoya-u.ac.jp/~fukushima/mpegftv/
19) L. McMillan, “An Image-Based Approach on Three-Dimensional Computer Graphics”, Ph.D. thesis, University of North Carolina at Chapel Hill: 1997.
20) C. Fehn "A 3D-TV system based on video plus depth information," Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, vol.2, no., pp.1529-1533 Vol.2, 9-12 Nov. 2003.
21) ITU-T website: http://www.itu.int/ITU-T/index.html22) SMPTE website: http://www.smpte.org/home/23) Microsoft website: http://www.microsoft.com/en/us/default.aspx24) Website of AVS working group: http://www.avs.org.cn/en 25) T. Borer and T. Davies, “Dirac Video compression using open technology,”
EBU Technical Review, pp. 19, July 2005.26) K. Onthriar, K.K. Loo and Z. Xue, “Performance comparison of emerging
Dirac video codec with H.264/AVC,” IEEE Int’l Conf. on Digital Telecommunication., ICDT '06, pp. 2222, Cap Esterel, Cote d'Azur, France, Aug. 2006.
27) Website of Real Networks: http://www.realnetworks.com/28) Website of ON 2 Technologies: http://www.on2.com/
29) Website for ITU-T:http://www.itu.int/en/ITU-T/studygroups/2013-2016/16/Pages/video/jctvc.aspx
30) MPEG website: http://www.mpeg.org/
31) Reference for H.262/MPEG-2: http://mpeg.chiariglione.org/standards/mpeg-2/video
32) Reference website for HEVC: www.hevc.info
33) H.261 recommendation: http://www.itu.int/rec/T-REC-H.261-199303-I/en
34) F Bossen, et al, “HEVC complexity and implementation analysis”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 22, Issue: 12, pp. 1685 - 1696, Dec. 2012.
35) Fraunhofer HHI, 3D Video coding information: http://www.hhi.fraunhofer.de/fields-of-competence/image-processing/research-groups/image-video-coding/3d-hevc-extension.html
36) P. Merkle, A Smolic, K. Müller, and T. Wiegand, “Multi-View video plus depth data representation and coding”. Picture Coding Symposium, 2007.
37) “Test Model under Consideration for HEVC based 3D video coding”, ISO/IEC JTC1/SC29/WG11 MPEG2011/N12559, San Jose, CA, USA, Feb. 2012.
38) T. Lee, Y. Chan, C. Fu and W. Siu; “Reliable tracking algorithm for multiple reference frame motion estimation”, J. Electron. Imaging, vol. 20, Issue: 3, pp. 033003-01 - 033003-14, Jul – Sept., 2011.
39) H. Zhang and Z. Ma, “Fast intra prediction for high efficiency video coding”, Pacific Rim Conf. on Multimedia, PCM2012, Singapore, Dec. 2012.
74
40) M. Zhang, C. Zhao and J. Xu, “An adaptive fast intra mode decision in HEVC”, IEEE ICIP 2012, pp. 221-224, Orlando, FL, Sept. - Oct., 2012.
41) M. Jakubowski and G. Pastuszak, ‘Block-based motion estimation algorithms – a survey”, Opto-electronics review, vol. 21, pp. 86 – 102, 2013.
42) M.C. Motwani, et al, “A survey of image denoising techniques”, Proceedings of GSPx 2004, Santa Clara, CA: http://www.cse.unr.edu/~fredh/papers/conf/034-asoidt/paper.pdf
43) D.V.S. De Silva, et al, “A Depth Map Post-Processing Framework for 3D-TV systems based on Compression Artifact Analysis”, IEEE journal of Selected Topics in Signal Processing, vol. pp. ,Issue: 99, pp. 1 – 30, Aug. 2011
44) I. J. S. W. 11, “Proposed experimental conditions for EE4 in MPEG 3DAV. WG 11 doc. m9016,” vol. Shanghai, Oct. 2002.
45) C. Fehn, “Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV,” Proceedings of the SPIE, vol. 5291, 93, 2004.
46) D. De Silva, W. Fernando, and S. Worrall, “Intra mode selection method for depth maps of 3D video based on rendering distortion modeling,” IEEE Trans. on Consumer Electronics, vol. 56, no. 4, pp. 2735–2740, Nov. 2010.
47) Y. Zhao and L. Yu, “Perceptual measurement for evaluating quality of view synthesis,” ISO/IEC JTC1/SC29/WG11/M16407, Apr. 2009.
48) Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. on Image Processing, vol. 13, no. 4, pp. 600 - 612, Apr. 2004.
49) C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” IEEE International Conference on Computer Vision, pp.839-846, Washington DC, USA, 1998.
50) E. Eisemann and F. Durand, “Flash photography enhancement via intrinsic relighting,” in ACM Trans. on Graphics (TOG), vol. 23, no. 3. ACM,pp. 673–678, 2004.
51) G. Petschnigg, et al, “Digital photography with flash and no-flash image pairs,” in ACM Trans. on Graphics (TOG), vol. 23, no. 3. ACM, pp. 664–672, 2004.
52) B. Zhang and J. Allebach, “Adaptive bilateral filter for sharpness enhancement and noise removal,” IEEE Trans. on Image Processing, vol. 17, no. 5, pp. 664–678, 2008.
53) P. Choudhury and J. Tumblin, “The trilateral filter for high contrast images and meshes,” in ACM SIGGRAPH 2005 Courses. ACM, pp. 5-es, 2005.
54) S. Liu, P. Lai, D. Tian, C. Gomila, and C. W. Chen, “Joint trilateral filtering for depth map compression,” pp. 77 440F-10, Huangshan, China, 2010.
55) HEVC reference software (HM 9.2):- https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/branches/HM-9.2-dev/
56) D. De Silva, et al, "Adaptive sharpening of depth maps for 3D-TV," IET Electronics Letters, vol.46, no.23, pp.1546, 1548, Nov. 11, 2010.
57) O. Gangwal and R. Berretty, “Depth map post-processing for 3D-TV,” in IEEE ICCE 2009, pp. 1-2, 2009.
75
58) MATLAB code for stereoscopic view rendering: http://www.mathworks.com/matlabcentral/fileexchange/27538-depth-image-based-stereoscopic-view-rendering
59) L Ma, et al, "Image Retargeting Quality Assessment: A study of subjective scores and objective metrics,", IEEE Journal of Selected Topics in Signal Processing,vol.6, no.6, pp.626-639, Oct. 2012.
60) Break-Dancers and Ballet sequence: http://research.microsoft.com/en-us/um/people/sbkang/3dvideodownload/
61) “Interactive stereoscopic video conversion”, IEEE Trans. on circuits and systems for video technology, vol. 23, Oct. 2013.
62) T. Na, et al, “A Hybrid Stereoscopic Video Coding Scheme Based on MPEG-2 and HEVC for 3DTV Services”, IEEE, Trans. on circuits and systems for video technology vol. 23, pp. 1542-1554, Sept. 2013.
63) S. Vasudevan, “Implementation of fast residual quadtree coding and fast intra-prediction in high efficiency video coding”, Masters’ thesis, EE Dept. University of Texas at Arlington, Dec. 2013.
64) V. Gajula, “Complexity reduction of intra-coding in HEVC and comparison with H.264/AVC”, Masters’ thesis, EE Dept., University of Texas at Arlington, Dec. 2013.
76
Biographical Information
Nayana Parashar was born in Bangalore, India, in 1989. She did her schooling at
Holy Cross Convent, Kolhapur, Maharashtra, India. She received her Bachelor of
Engineering (B.E.) degree in Instrumentation Technology from Visvesvaraya
Technological University, India, in 2011. She started her M.S. program in Electrical
Engineering at the University of Texas at Arlington in Jan 2012 and has received the M.S
degree in Dec.2013. While at UTA, She joined Multimedia Processing Lab as a student
researcher under Dr. K.R. Rao. She also worked as intern at InterDigital communications,
San Diego, CA from May – Dec 2013. While at InterDigital, she worked on user-adaptive
video streaming, fast transforms and perceptual video/image technology. After
graduation, she hopes to work in multimedia and computer vision related fields where
she can put her knowledge and experience into good use.
77