i
Multiplexing the elementary streams of H.264 video and
MPEG4 HE AAC v2 audio using MPEG2 systems
specification, demultiplexing and achieving lip
synchronization during playback
by
NAVEEN SIDDARAJU
Presented to the Faculty of the Graduate School of
The University of Texas at Arlington in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
Nov 2010
ii
Copyright © by Naveen Siddaraju 2010
All Rights Reserved
iii
ACKNOWLEDGEMENTS
I am greatly thankful to my supervising professor Dr.K.R.Rao, whose constant
encouragement, guidance and support have helped me in smooth completion of
the project. He has always been accessible and helpful throughout. I also thank
him for introducing me to the field of multimedia processing.
I would like to thank Dr.W. Alan Davis and Dr.William E.Dillon for taking interest
in my project and accepting to be part of my project defense committee.
I am forever grateful to my parents for their unconditional support at each turn of
the road. I thank my brother and sisters, who have always been a source of
inspiration. I would like to thank my friends both in US and in India for their
encouragement and support.
November 22, 2010
iv
ABSTRACT
MULTIPLEXING THE ELEMENTARY STREAMS OF H.264 VIDEO AND
MPEG4 HE AAC v2 AUDIO USING MPEG2 SYSTEMS SPECIFICATION,
DEMULTIPLEXING AND ACHIEVING LIP SYNCHRONIZATION DURING
PLAYBACK
Naveen Siddaraju, MS
The University of Texas at Arlington, 2010
Supervising Professor: Dr. K. R. Rao
Delivering broadcast quality content to the mobile customers is one of the
most challenging tasks in the world of digital broadcasting. Limited network
bandwidth and processing capability of the handheld devices are critical factors
that should be considered. Hence selection of the compression schemes for the
media content is very important from both economic and quality points of view.
H.264 which is also known as Advanced Video Codec (AVC) [1] is the latest
and the most advanced video codec available in the market today. The H.264
baseline profile which is used in applications such as mobile television (mobile
DTV) broadcast has one of the best compression ratios among the other profiles
and requires the least processing power at the decoder. The audio MPEG4 HE
AAC v2 [2] which is also known as enhanced aacplus, is the latest audio codec
belonging to the AAC (advanced audio codec) [3] family. In addition to the core
AAC, it uses the latest tools such as Spectral Band Replication (SBR) [2] and
Parametric Stereo (PS) [2] resulting in the best perceived quality for the lowest
v
bitrates. The audio and video codec standards have been chosen based on ATSC-
M/H (advanced television systems committee – mobile handheld) [17].
For the television broadcasting applications such as ATSC-M/H, DVB [16] the
encoded audio and video streams should be transmitted in a single transport
stream containing fixed sized data packets, which can be easily recognized and
decoded at the receiver. The goal of the project is to implement a multiplexing
scheme for the elementary streams of H.264 baseline and HE AAC v2 using the
MPEG2 systems specifications [4], then demultiplex the transport stream and
playback the decoded elementary stream with lip synchronization or audio-video
synchronization. The multiplexing involves two layers of packetization of the
elementary streams of audio and video. The first level of packetization results in
Program Elementary Stream (PES) packets, which are variable size packets and
hence not suitable for transport. MPEG2 defines a transport stream where PES
packets are logically organized into fixed size packets called the Transport Stream
(TS) packets, which are 188 bytes long. These packets are continuously generated
to form a transport stream, which is decoded by the receiver and the original
elementary streams are reconstructed. The PES packets that are logically
encapsulated into the TS header contain the time stamp information which is
used at the de-multiplexer to achieve synchronization between audio and video
elementary streams.
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS……………………………………………………………………………… iii
ABSTRACT…………………………………………………………………………………………………. iv
LIST OF FIGURES…………………………………………………………………………………….... iii
LIST OF TABLES…………………………………………………………………………………………. ix
LIST OF TABLES…………………………………………………………………………………………. xi
ACRONYMS AND ABBREVIATIONS…………………………………………………………….. xii
Chapter
1. INTRODUCTION…………………………………………………………………………………… 1
2. OVERVIEW OF H.264……………………………………………………………………………. 2
2.1 H.264/ AVC………………………………………………………………………………. 2
2.2 Coding structure……………………………………………………………………….. 2
2.3 Profiles and levels……………………………………………………………………. 3
2.4 Description of various profiles ………………………………………………… 4
2.4.1 Baseline Profile…………………………………………………………... 4
2.4.2 Extended profile………………………………………………………….. 5
2.4.3 Main Profile………………………………………………………………… 5
2.4.4 High Profiles………………………………………………………………… 5
2.5 H.264 encoder and decoder……………………………………………………. 6
2.5.1 Intra prediction…………………………………………………………… 8
2.5.2 Inter prediction…………………………………………………………… 9
2.5.3 Transform and quantization………………………………………… 10
2.5.4 Entropy coding…………………………………………………………… 10
vii
2.5.5 Deblocking filter………………………………………………………… 11
2.6 H.264 bitstream……………………………………………………………………. 11
3. OVERVIEW OF HE AAC V2…………………………………………………………………. 16
3.1HE AAC v2……………………………………………………………………………… 16
3.2 Spectral Band Replication (SBR)……………………………………………. 18
3.3 Parametric Stereo (PS)………………………………………………………… 19
3.4 Enhanced aacplus encoder………………………………………………….. 20
3.5 Enhanced aacplus decoder………………………………………………….. 22
3.6 Advanced Audio Coding (AAC)……………………………………………. 23
3.6.1 AAC encoder…………………………………………………………. 23
3.7 HE AAC v2 bitstream formats…………………………………………….. 27
4. TRANSPORT PROTOCOLS………………………………………………………………. 30
4.1 Introduction……………………………………………………………………… 30
4.2 Real-Time protocol (RTP)…………………………………………………… 30
4.3 MPEG2 systems layer…………………………………………………………. 31
4.4 Packetized elementary stream (PES)………………………………….. 32
4.4.1 PES encapsulation process………………………………………………. 34
4.5 MPEG Transport stream (MPEG- TS)…………………………………. 35
4.6 Time stamps……………………………………………………………………… 38
5. MULTIPLEXING…………………………………………………………………………. 42
6. DE MULTIPLEXING……………………………………………………………………. 48
6.1 Lip or audio-video synchronization………………………………… 51
viii
7. RESULTS………………………………………………………………………………….. 55
7.1 Buffer fullness…………………………………………………………….. 55
7.2 Synchronization/skew calculation……………………………….. 56
8. CONCLUSIONS …………………………………………………………………………. 59
9. FUTURE WORK………………………………………………………………………… 59
References…………………………………………………………………………………. 60
ix
LIST OF FIGURES
Fig 2.1. Video data organization in H.264 [42].
Fig 2.2. Specific coding parts of the profiles in H.264 [5].
Fig 2.3 Different YUV systems.
Fig 2.4. H.264 encoder [5].
Fig 2.5. H.264 decoder [5].
Fig2.6. Intra prediction modes for 4X4 luma in H.264
Fig2.7. Different layers of JVT coding.
Fig2.8. NAL formatting of VCL and non-VCL data [6].
Fig2.9. NAL unit format [6].
Fig2.10. Relationship between parameter sets and picture slices [24].
Fig3.1: HE AAC audio codec family
Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7]
Fig 3.4 Original audio signal [28].
Fig 3.5 High band reconstruction through SBR [28].
Fig3.6: Enhanced aacplus encoder block diagram [9]
Fig3.7: Enhanced aacplus decoder block diagram [9]
Fig 3.8 AAC encoder block diagram [10]
Fig 3.9. ADTS elementary stream
Fig 4.1. RTP packet structure (simplified) [22]
x
Fig 4.2. MPEG2 transport stream [22]
Fig 4.3. Conversion of an elementary stream into PES packets [29]
Fig4.4. A standard MPEG-TS packet structure [14].
Fig4.5. Transport stream (TS) packet format used in this project.
Fig5.1. Overall multiplexer flow diagram
Fig5.2. Flow chart of video processing block
Fig 5.3. Flow chart of audio processing block.
Fig6.1. Flow chart for the de-multiplexer used
xi
LIST OF TABLES
Table2.1. NAL Unit types.
Table3.1 ADTS header format [2] [3]
Table3.2 Profile bits expansion [2] [3]
Table4.1. PES packet header format used [4].
Table7.1. Video and audio buffer sizes and their respective playback times
Table 7.2. Characteristics of test clips used
Table 7.3: Demultiplexer output.
xii
ACRONYMS AND ABBREVIATIONS
3GPP Third generation partnership project
AAC Advanced audio coding
AAC LPT Advanced audio coding - long term prediction
ADIF Audio data interchange format
ADTS Audio data transport stream
AFC Adaptation field control
ATM Asynchronous transfer mode
ATSC Advanced television systems committee
ATSC-M/H Advanced television systems committee- mobile /handheld
AVC Advanced Video Coding
CABAC Context-based Adaptive Binary Arithmetic Coding
CAVLC Context-based Adaptive Variable Length Coding
CC Continuity counter
CIF Common intermediate format
CRC Cyclic redundancy check
DCT Discrete code transform
DPCM Discrete pulse coded modulation
DVB Digital video broadcasting
DVB Digital video broadcasting
DVD Digital video disc
EI Error indicator
ES Elementary stream
FMO Flexible macro block order
GOP Group of pictures
HDTV High definition television
HE AACv2 High efficiency advanced audio codec version 2
IC Inter-channel Coherence
IDR Instantaneous decoder refresh
IID Inter-channel Intensity Differences
IP Internet protocol
IPD Inter-channel Phase Differences
ISDB Integrated Services Digital Broadcasting system
xiii
I-slice Intra predictive slice
ISO International Standards Organization
ITU International Telecommunication Union
JM Joint model
JVT Joint Video Team
M4A Moving picture experts group four file format audio only
MB Macro blocks
MC Motion compensation
MDCT Modified discrete cosine transform
ME Motion estimation
MP4 Moving picture experts group four file format
MPEG Moving Picture Experts Group
MPTS Multi program transport stream
NALU Network Abstraction Layer Unit
OPD Overall Phase Difference
PCR Program clock reference
PES Packetized elementary stream
PID Packet identifier
PMT Program map table
PPS Picture parameter set
PS Parametric stereo
P-slice Predictive slice
PTS Presentation time stamp
PUSI Payload unit start Indicator
QCIF Quarter common intermediate format
QMF Quadrature mirror filter banks
RTP Real time protocol
SBR Spectral band replication
SCI Simplified chrominance intra prediction
S-DMB Satellite - Digital multimedia broadcasting
SDTV Standard definition television
SEI Supplemental enhancement information
SPS Sequence parameter set
SPTS Single program transport stream
xiv
STC System timing clock
TCP Transmission control protocol
TNS Temporal noise shaping
TS Transport stream
UDP User datagram protocol
VC1 Video Codec1
VCEG Video Coding Experts Group
VCL Video coding layer
VLC Variable length coding
VUI Video usability information
YUV Luminance and chrominance color components
1
CHAPTER 1
INTRODUCTION
Mobile broadcast systems are increasingly important as cellular phones and
highly efficient digital video compression techniques merge to enable digital TV
and multimedia reception on the move. There are several digital mobile TV
broadcasters in the market. Major ones are DVB-H (digital video broadcast-
handheld) [16] and ATSC-M/H (advanced television systems committee-
mobile/handheld) [17] [18]. Both DVB-H and ATSC-M/H have a relatively smaller
channel bandwidth allocation (~14Mbps for DVB-H and ~19.6 Mbps for ATSC-
M/H), so the choice of multimedia compression standard and transport protocol
becomes very important. DVB-H specifies the use of either VC1 [19] or H.264 [1]
compression standard for video and AAC [3] audio. ATSC-M/H specifies H.264
baseline profile for video and HEAACv2 [2] for audio. The transport protocol is
usually RTP (real time protocol) [20] or MPEG2 part 1 systems [4].The MPEG-2
systems specification [4] describes how MPEG-compressed video and audio data
streams may be multiplexed together with other data to form a single data
stream suitable for digital transmission or storage. Two alternative streams are
specified for the MPEG-2 systems layer. The program stream is used for storage of
multimedia content like DVD while the transport stream is intended for the
simultaneous delivery of a number of programs over potentially error-prone
channels.
In this project, the compression standards used is H.264 baseline profile for
video and HEAACv2 for audio, as specified by the ATSC-M/H standard.
Distribution is achieved through the MPEG-2 part 1 systems specifications
transport stream. Chapters 2 and 3 give the brief overview of H.264 and HEAACv2
compression standards respectively. Chapter 4 explains the transport stream
2
protocol used in this project. Chapters 5 and 6 explain the multiplexing and de-
multiplexing schemes used in this project.
All the results are tabulated in chapter 7. Chapters 8 and 9 outline the conclusions
of the project and future work respectively.
3
CHAPTER 2
OVERVIEW OF H.264
2.1 H.264/ AVC
H.264 is the latest and the most advanced video codec available today. It
was jointly developed by the VCEG (video coding experts group) of ITU-T
(international telecommunication union) and the MPEG (moving pictures experts
group) of ISO/IEC (international standards organization). This standard achieves
much greater compression than its predecessors like MPEG-2 video [37], MPEG4
part 2 [38] etc. But the higher coding efficiency comes at the cost of increased
complexity. The H.264 has been adopted as the video standard for many
applications around the world including ATSC. The H.264 baseline profile with
some restrictions is the adopted standard for ATSC-M/H or ATSC mobile DTV [40].
2.2 Coding structure:
The basic coding structure of H.264 is similar to that of the earlier standards
(MPEG-1, MPEG-2) and is commonly referred to as motion-compensated
transform coding structure. Coding of video is performed picture by picture. Each
picture is partitioned into a number of slices, which is a sequence of macroblocks.
Each slice is coded independently in H.264. It is possible that a picture can have
just one slice. A macroblock (MB) consists of 16X16 luminance (y) component and
associated two chrominance ( ) components. Each macroblock’s 16X16
luminance can be partitioned into 16 X 16, 16 X 8, 8 X 16, and 8 X 8 units, and
further, each 8 X 8 luminance can be sub-partitioned into 8 X 8, 8 X 4, 4 X 8 and 4
X 4. The 4 X 4 sub-macroblock partition is called a block. The hierarchy of video
data organization is shown in figure 2.1.
4
Fig2.1 Video data organization in H.264 [42]
There are basically three types of slices I (intra predictive), P (predictive) and B
(bipredictive) slices. I slices are strictly intracoded I.e. Macro blocks (MB) are
compressed without using any motion prediction from earlier slices. A special
type picture containing only I-slice is called instantaneous decoder refresh (IDR)
picture. Any picture following the IDR picture does not use the pictures prior to
IDR for its motion prediction. IDR pictures can be used for random access or as
entry points for a coded sequence [6]. P-slices on the other hand contain
macroblocks which use motion prediction. The MBs of P-slice can use only one
frame as reference (either from past or future) for their motion prediction.
2.3 Profiles and levels:
Profiles and levels specify restrictions on bit streams and hence limits on
the capabilities needed to decode the bit streams. Profiles and levels may also be
used to indicate interoperability points between individual decoder
implementations. For any given profile, levels generally correspond to decoder
processing load and memory capability. Each level may support a different picture
size– QCIF, CIF, ITU-R 601 (SDTV), HDTV etc. Also each level sets the limits for data
bitrates, frame size, picture buffer size, etc [5].
5
H.264/AVC is purely a video codec with as many as seven profiles. Three
profiles main, baseline and extended profile were included in its first release. Four
new profiles were added in the subsequent releases defined in the fidelity range
extensions for applications such as content distribution, content – contribution,
and studio editing and post-processing [5]. Profiles and their specific tools and
common features are shown in fig 2.2.
Fig 2.2. Specific coding parts of the profiles in H.264 [5].
It can be noted that I-slice, P-slice and CAVLC (Context-based Adaptive Variable
Length Coding) entropy coding are common to all the profiles.
2.4 Description of various profiles:
2.4.1 Baseline Profile:
Baseline profile supports coded sequences containing I and P slices. Apart
from the common features, baseline profile consists of some error resilience tools
such as Flexible Macro Block order (FMO), arbitrary slice order and redundant
6
slices. It was designed for low delay applications, as well as for applications that
run on platforms with low processing power and in high packet loss environment.
Among the three profiles, it offers the least coding efficiency [6]. The baseline
profile caters to applications such as video conferencing and mobile television
broadcast video. This project uses baseline profile for video encoding since it is
specified by ATSC for mobile digital television.
2.4.2 Extended profile:
The extended profile is a superset of the baseline profile. Besides tools of
the baseline profile it includes B-, SP- and SI-slices, data partitioning, and interlace
coding tools. SP and SI are specifically coded P and I slice respectively which are
used for efficient switching between different bitrates in some streaming
applications. It however does not include CABAC. It is thus more complex but also
provides better coding efficiency. Its intended applications are streaming video
over internet [6].
2.4.3 Main Profile:
Other than the common features main profile includes tools such as CABAC
for entropy coding, B-slices. It does not include any error resilience tools such as
FMO. Main profile is used in Broadcast television and high resolution video
storage and playback. It also contains interlaced coding tools like extended
profile.
2.4.4 High Profiles:
High profiles are the superset of main profile. It also includes additional tools such
as adaptive transform block size, quantization scaling matrices. High profiles are
7
used for applications such as content-contribution, content-distribution, and
studio editing and post-processing [5]. Four different high profiles are described
below:
High Profile - supports the 8-bit video with 4:2:0 sampling for applications using
high resolution.
High 10 Profile – supports the 4:2:0 sampling with up to 10 bits of representation
accuracy per sample.
High 4:2:2 Profile - supports up to 4:2:2 chroma sampling and up to 10 bits per
sample.
High 4:4:4 Profile – supports up to 4:4:4 chroma sampling, up to 12 bits per
sample, and integer residual color transform for coding RGB signal. Different YUV
formats are shown in fig 2.3.
Fig 2.3 Different YUV systems.
8
For any given profile, a level corresponds to various data bit rates, frame size,
picture buffer size, etc.
2.5 H.264 encoder and decoder:
The H.264 encoder follows a classic DPCM encoding loop. The encoder may
select between various inter- or intra- prediction modes. Intra- coding uses up to
nine prediction modes to reduce spatial redundancy for the single picture. Inter-
coding is more efficient than intra-coding and are used in B and P frames. Inter-
coding uses motion vectors for block-based inter-prediction to reduce temporal
redundancy among different pictures [5]. The de-blocking filter is used to reduce
the blocking artifacts. The predicted signal is then subtracted with the input
sequence to get a residual which is further compressed by applying integer
transform. This will remove the spatial correlation between the pixels. The
resulting signal is given to the quantization block. Finally the quantized transform
coefficients, motion vectors, intra prediction modes, control data etc are given to
the entropy coding block. There are basically two types of entropy encoders in
H.264 they are CAVLC (context adaptive variable length coding) and CABAC
(context adaptive binary arithmetic coding). The encoder and decoder for H.264
are shown in figures 2.4 and 2.5 respectively.
9
Fig 2.4. H.264 encoder [5].
The decoder performs in the exact opposite way, taking in the encoded bitstream
and decoding it. Then it is given to the inverse quantization and inverse transform
block.
Fig 2.5. H.264 decoder [5]
10
2.5.1 Intra prediction:
In the intra-coded mode, a prediction block is formed based on previously
reconstructed (but, unfiltered for deblocking) blocks of the same frame. The
residual signal between the current block and the prediction is finally encoded.
All macroblocks are Intra-coded in I- slice. Macroblocks having unacceptable
temporal correlation in P and B slices are also intra coded. Essentially, the intra-
coded macroblocks introduces the large number of coded bits. This is a bottleneck
for reducing the bitrates. For the luma samples, the prediction block may be
formed for each 4 X 4 sub block, each 8 X 8 block, or for a 16 X16 macroblock.
There are a total of 9 prediction modes for 4 X 4 and 8 X 8 luma blocks; 4 modes
for a 16 X16 luma block; and four modes for each chroma block. Figure 2.6 shows
the intra prediction modes for 4X4 luma. There are basically nine different modes.
For mode 0 (vertical) and mode 1 (horizontal), the predicted samples are formed
by extrapolation from upper samples [A, B, C, D] and from left samples [I, J, K, L]
respectively. For mode 2 (DC), all of the predicted samples are formed by the
mean of the upper and left samples [A, B, C, D, I, J, K, L]. For mode 3 (diagonal
down left), mode 4 (diagonal down right), mode 5 (vertical right), mode 6
(horizontal down), mode 7 (vertical left), and mode 8 (horizontal up), the
predicted samples are formed from a weighted average of the prediction samples
A–M.
11
Fig 2.6. Intra prediction modes for 4X4 luma in H.264 [39].
For prediction of each 8X8 luma block, one mode is selected from the 9 modes,
similar to the (4 X4) intra-block prediction. For prediction of all 16 X16 luma
components of a macroblock, four modes are available. For mode 0 (vertical),
mode 1 (horizontal), mode 2 (DC), the predictions are similar with the cases of 4X
4 luma block. For mode 4 (Plane), a linear plane function is fitted to the upper and
left samples.
2.5.2 Inter prediction:
This block includes both Motion Estimation (ME) and Motion Compensation
(MC). It generates a predicted version of a rectangular array of pixels, by choosing
another similarly sized rectangular array of pixels from a previously decoded
reference picture, translating the reference array to the position of the
current rectangular array [6]. In H.264, the block sizes for motion prediction
include: 4X4, 4X8, 8X4, 8X8, 16X8, 8X16, and 16X16 pixels (shown in figure 2.1).
Inter-prediction of a sample block can also involve the selection of the frames to
be used as the reference pictures from a number of stored previously decoded
pictures. Reference pictures for motion compensation are stored in the picture
buffer. With respect to the current picture, the pictures before and after the
12
current picture, in the display order are stored into the picture buffer. These are
classified as short-term and long-term reference pictures. Long-term reference
pictures are introduced to extend the motion search range by using multiple
decoded pictures, instead of using just one decoded short-term picture. Memory
management is required to take care of marking some stored pictures as unused
and deciding which pictures to delete from the buffer for efficient memory
management [5].
2.5.3 Transform and quantization:
The residual signal (prediction error) will have a high spatial redundancy.
AVC like its predecessors uses block based transform (Integer DCT) and
quantization to remove/reduce this spatial redundancy. H.264 uses an adaptive
transform block size, 4X4 and 8X8 (for high profile only). The smaller block size
reduces the ringing artifacts. Also, the 4 X4 transform has the additional benefit of
removing the need for multiplications [5]. For improved compression H.264 also
employs 4X4 Hadamard transform for the DC components of the 4X4 (DCT)
transforms in case of luma 16X16 intra mode and 2X2 Hadamard transform for
chroma DC coefficients.
2.5.4 Entropy coding:
The predecessors of H.264 (MPEG 1 and MPEG 2) used the entropy coding
based on the fixed tables of variable length codes (VLC). H.264 uses different VLCs
to match a symbol to a code based on the context characteristics. All syntax
elements except for the residual data are encoded by the exp-golomb codes [5].
For coding the residual data, a more sophisticated method called CAVLC (context
based adaptive variable length coding) is employed. CABAC (context based
adaptive binary arithmetic coding) is employed in main and high profiles, CABAC
13
achieves better coding efficiency, but with higher complexity compared to CAVLC
[1].
2.5.5 Deblocking filter:
H.264 based systems may suffer from blocking artifacts due to block-based
transform in intra and inter-prediction coding, and the quantization of the
transform coefficients. The deblocking filter reduces the blocking artifacts in the
block boundary and prevents the propagation of accumulated coded noise since
the filter is present in the DPCM loop.
For this project, the H.264 baseline profile at level 1.3 is used as specified
for the ATSC- mobile digital television [41]. The resolution of the video sequence
is 416 pixels X240 lines (aspect ratio 16:9).
2.6 H.264 bitstream:
H.264 video syntax can be broken down into two layers. Video Coding Layer
(VCL) which consist of the video data, slice layer or below and the Network
Abstraction Layer (NAL) which formats the VCL representation of the video and
provides the header information. NAL also provides additional non-VCL
information like sequence parameter sets, picture parameter sets, Supplemental
Enhancement Information (SEI), etc so that it may be used in a variety of
transport streams like the MPEG2 transport stream , IP/RTP systems, etc or
storage media like ISO image. Figure 2.7 shows the different layers of JVT (joint
video team) coding. Figure 2.8 shows the NAL layer formatting of VCL and non-
VCL data.
14
Fig 2.7 Different layers of JVT coding [1].
Fig 2.8 NAL formatting of VCL and non-VCL data [6].
The H.264 bit stream is encapsulated into packets called NAL units (NALU).
Each NALU is separated by a 4 byte sequence “0X00000001”. After this byte
sequence the following byte is the NAL header and the rest is a variable byte
15
length raw byte sequence payload (RBSP). The NAL header/unit format is shown
in the figure 2.9. The first bit of the NAL header called the forbidden bit is always
zero. The next two bits indicate whether the NALU consist of sequence parameter
set, picture parameter set or the slice data of the reference picture. The next 5
bits indicate the type of NALUs (type indicator), depending upon the type of data
being carried by that NALU. There are 32 different types of NALU. These may be
classified into VCL NALUs and non-VCL NALUs, depending upon the type of data
they carry.
Fig 2.9. NAL unit format [6].
If the type indicator is less than 5, it is a VCL NALU and if the type indicator is
greater than 5 it is non-VCL NALU. Different types of NALUs are listed in table2.1.
NALU types 1-5 and 12 are VCL-NAL units containing coded VCL data. The
rest of NALUs are called non-VCL NAL units that contain information such as SEI,
sequence parameter set, picture parameter set etc. Of these NALUs IDR pictures,
sequence parameter set and picture parameter set are important.
An instantaneous decoder refresh (IDR) picture is a picture that is placed at
the beginning of the video sequence. When the decoder receives an IDR picture,
all information is refreshed, which indicated a new coded video sequence and
frames prior to this IDR frame are not required to decode this new sequence.
The sequence parameter set contains important header information that
applies to all NALUs in the coded sequence. The picture parameter set contains
16
important header information that is used for decoding one or more frames in the
coded sequence.
Type indicator
NALU type
0 unspecified
1 coded slice
2 data partition A
3 data partition B
4 data partition C
5 IDR (instantaneous decoder refresh)
6 SEI(supplemental enhancement
information)
7 sequence parameter set
8 picture parameter set
9 access unit delimiter
10 end of sequence
11 end of stream
12 filler data
13-23 extended
23-31 undefined Table 2.1. NAL unit types [6].
The relationship between the parameter sets and the slice data is shown in
fig 2.10. Each VCL NAL unit contains an identifier that refers to the content of the
relevant Picture Parameter Set (PPS) and each PPS contains an identifier that
refers to the content of the relevant Sequence Parameter Set (SPS). In this
manner, a small amount of data (the identifier) is used to refer to a larger amount
of information (the parameter set) without repeating that information within
each VCL NAL unit. Sequence and picture parameter sets can be sent well ahead
of the VCL NAL units to which they apply, and can be repeated to provide
robustness against data loss. In some applications, parameter sets may be sent
within the channel that carries the VCL NAL units (termed "in-band"
transmission). In other applications, it can be advantageous to convey the
17
parameter sets "out of band" using a more reliable transport mechanism than the
video channel itself. By using this mechanism, H.264 can transmit multiple video
sequences (with different parameters) in a single bitstream.
Fig 2.10. Relationship between parameter sets and picture slices[24].
The important information carried by SPS include profile/level indicator,
decoding order or playback order, frame size, number of reference frames, Video
Usability Information (VUI) such as aspect ratio, color or space details etc. SPS
remains the same for the entire coded video sequence. Important information
carried by PPS include entropy coding scheme used, macro block reordering,
quantization parameters, a flag to indicate whether inter predicted MBs can be
used for intrapredicton etc. PPS remains unchanged within a coded picture.
This chapter provided an overview of H.264. Various profiles, encoder,
decoder and the H.264 bit stream format were discussed in detail. An overview of
the HE AAC v2 audio codec is presented in the next chapter.
18
CHAPTER 3
OVERVIEW OF HE AAC V2
3.1HE AAC v2
High efficiency advanced audio codec version 2 also known as enhanced
aacplus is a low bit rate audio codec defined in MPEG4 audio profile [2] belonging
to the AAC family. It is specifically designed for low bit rate applications such as
streaming.
HE AAC v2 has been proven to be the most efficient audio compression tool
available today. It comes with a fully featured toolset which enables coding in
mono, stereo and multichannel modes (up to 48 channels). Apart from ATSC [17],
enhanced aacplus is already the audio standard in various applications and
systems around the world. In Asia, HE-AAC v2 is the mandatory audio codec for
the Korean Satellite Digital Multimedia Broadcasting (S-DMB) [25] technology and
is optional for Japan’s terrestrial Integrated Services Digital Broadcasting system
(ISDB) [26]. HE-AAC v2 is also a central element of the 3GPP and 3GPP2 [27]
specifications and is applied in multiple music download services over 2.5 and 3G
mobile communication networks. Others includes XM satellite radio (the digital
satellite broadcasting service in the USA), HD Radio (the terrestrial digital
broadcasting system of iBiquity Digital, USA) [7].
HE AAC v2 is a combination of three technologies: AAC (advanced audio
codec), SBR (spectral band replication) and PS (parametric stereo). All the 3
technologies are defined in MPEG4 audio standard [2]. The combination of AAC
and SBR is called HE-AAC or aacplus. AAC is a general audio codec, SBR is a
bandwidth extension technique offering substantial coding gain in combination
19
with AAC, and Parametric Stereo (PS) enables stereo coding at very low bitrates.
Figure 3.1 shows the family of AAC audio codecs.
HE AAC v2
HE AAC
AAC SBR PS
Fig 3.1: HE AAC audio codec family [7]
Figure 3.2 shows the typical bitrate ranges for stereo plotted against the
perceptual quality factor for all three forms of the codec. It can be easily derived
that HEAAC v2 provides the best quality for the lowest bitrate.
Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7]
20
3.2 Spectral Band Replication (SBR):
SBR [2] is a bandwidth expansion technique; it has emerged as one of the
most important tools that have led to the development of audio coding
technology.
SBR exploits the correlation that exists between the energy of the audio
signal at high and low frequencies also referred to as high and low bands. It is also
based on the fact that psychoacoustic importance of high band is relatively low.
SBR uses a well guided technique called transposition to predict the energies at
high band from low band. Besides just transposition, the reconstruction of the
high band is conducted by transmitting some guiding information such as spectral
envelope of the original signal, prediction error, etc. These are referred to as SBR
data. The original and the high band reconstructed audio signal are shown in the
figures 3.4 and 3.5 respectively
Fig 3.4 original audio signal [28].
Fig 3.5 High band reconstruction through SBR [28].
21
SBR has enabled high-quality stereo sound at bitrates as low as 48 kbps. SBR was
invented as a bandwidth extension tool when used along with AAC. It was
adopted as an MPEG4 standard in March 2004 [2].
3.3 Parametric Stereo (PS):
Parametric stereo coding is a technique to efficiently code a stereo audio
signal as a monaural signal plus a small amount of stereo parameters. The
monaural signal can be encoded using any audio coder. The stereo parameters
can be embedded in the ancillary part of the mono bit stream creating backwards
mono compatibility. In the decoder, first the monaural signal is decoded after
which the stereo signal is reconstructed from the stereo parameters.
PS coding has led to a high quality stereo sound reconstruction at relatively
low bitrates. In the parametric approach, the audio signal or stereo image is
separated into its transient, sinusoid, and noise components. Next, each
component is re-represented via parameters that drive a model for the signal,
rather than the standard approach of coding the actual signal itself. PS uses three
types of parameters to describe the stereo image:
Inter-channel Intensity Differences (IID): describes the intensity differences
between the channels.
Inter-channel Phase Differences (IPD): describes the phase differences between
the channels and
Inter-channel Coherence (IC): describes the coherence between the channels.
The coherence is measured as the maximum of the cross-correlation as a function
of time or phase.
In principle, these three parameters allow for a high quality reconstruction of the
stereo image. However, the IPD parameters only specify the relative phase
22
differences between the channels of the stereo input signal. They do not
prescribe the distribution of these phase differences over the left and right
channels. Hence, a fourth type of parameter is introduced, describing an overall
phase offset or Overall Phase Difference (OPD). In order to reconstruct the stereo
image, in the PS decoder a number of operations are performed, consisting of
scaling (IID), phase rotations (IPD/OPD) and decorrelation (IC).
3.4 Enhanced aacplus encoder:
Figure 3.6 shows the complete block diagram of the enhanced aacplus
encoder. The input PCM time domain signal (raw audio signal) is first fed to a
stereo-to-mono down mix unit, which is only applied if the input signal is stereo
but the chosen audio encoding mode is selected to be mono.
The (mono or stereo) input time domain signal is fed to an IIR resampling
filter in order to adjust the input sampling frequency to the best-suited
sampling rate for the encoding process. The usage of the IIR resampler block
is only applied if the input signal sampling rate differs from the encoding sampling
rate. The IIR resampler may either be run as a 3:2 downsampler (e.g. to
downsample from 48 kHz to 32 kHz) or as a 1:2 upsampler (e.g. to upsample from
16 to 32 kHz). The QMF filter bank (part of SBR) is used to derive the spectral
envelope of the original signal. This envelope data along with some other error
information forms the SBR stream.
The enhanced aacplus encoder basically consists of the well-known AAC
encoder, the SBR high band reconstruction encoding tool and the Parametric
Stereo (PS) encoding tool. The enhanced aacplus encoder is operated in a dual
frequency mode, SBR encoder unit operates at the encoding sampling frequency
( as delivered from the IIR resampler and the AAC encoder unit at half of
23
this sampling rate . Consequently a 2:1 downsampler is present at the
input to the AAC encoder. For an efficient implementation an IIR (Infinite Impulse
Response) filter is used. The parametric stereo tool is used for low-bitrate stereo
coding, i.e. up to and including a bitrate of 44 kbps [4].
Fig3.6: Enhanced aacplus encoder block diagram [9]
The SBR encoder consists of a QMF (Quadrature Mirror Filter) analysis filter
bank, which is used to derive the spectral envelope of the original input signal.
This spectral envelope data along with transposition information forms the SBR
stream.
For stereo bitrates at and below 44 kbps, the parametric stereo encoding
tool in the enhanced aacplus encoder is used. For stereo bitrates above 44 kbps,
normal stereo operation is performed. The parametric stereo encoding tool
estimates parameters characterizing the perceived stereo image of the input
signal. These stereo parameters are embedded in the SBR stream.
Analysis
QMF Bank
SBR-related
Modules
2:1 IIR
Downsampler
AAC Core
Encoder
Envelope
Estimation
Input
PCM
Samples
Bitstr
ea
m P
aylo
ad
Fo
rma
tte
r
Coded
Audio
StreamIIR
Resampler1
stereo-to-
mono
downmix 1
1 usage dependant
on audio mode
Downsampled
Synthesis
QMF Bank
Parametric
Stereo
Estimation
(incl. Downmix)
stereo parameter
24
The embedding of the SBR stream (including the parametric stereo data) into the
AAC stream is done in a backwards compatible way, i.e. legacy AAC decoders can
parse the enhanced aacplus stream and decode the AAC core part.
3.5 Enhanced aacplus decoder:
Figure 3.7 shows the entire block diagram of an enhanced aacplus decoder.
In the decoder the bitstream is de-multiplexed into the AAC and the SBR stream.
Error concealment, e.g. in the case of frame loss, is achieved by designated
algorithms in the decoder for AAC, SBR and parametric stereo.
The low band AAC time domain signal, sampled at , is first fed to a
32-channel QMF analysis filter bank. The QMF low-band samples are then used to
generate a high-band signal, whereas the transmitted transposition guidance
information is used to best match the original input signal characteristics.
The transposed high band signal is then adjusted according to the
transmitted spectral envelope signal to best match the original’s spectral
envelope. Also, missing components that could not be reconstructed by the
transposition process are introduced. Finally, the lowband and the reconstructed
highband are combined to obtain the complete output signal in the QMF domain.
In the case of a stream using parametric stereo, the mono output signal
from the underlying aacplus decoder is converted into a stereo signal. This
process is carried out in the QMF domain and is controlled by the parametric
stereo parameters embedded in the SBR stream.
A 64-channel QMF synthesis filter bank is used to obtain the time domain output signal, sampled at the encoding sampling rate . The synthesis filter bank may also be used to apply an implicit down sampling by a factor of 2, resulting in an output sampling rate of .
25
Fig3.7: Enhanced aaplus decoder block diagram [9]
3.6 Advanced Audio Coding (AAC)
3.6.1 AAC encoder:
The AAC encoder acts as the core encoding algorithm of the enhanced
aacplus system encoding at half the sampling rate of aacplus. In the case of SBR
being used, the maximum AAC sampling rate is restricted to 24 kHz whereas if
SBR is not used, the maximum AAC sampling rate is restricted to 48 kHz [9].
Figure 3.8 shows the block diagram of a core AAC encoder. The various blocks in the encoder are explained below. Stereo preprocessing: In this block, the stereo width of difficult to encode signals
at low bitrates are reduced (attenuated). Stereo preprocessing is active for
bitrates less than 60kbps. The smaller the bitrate, the more attenuation of the
side channel takes place.
Analysis
QMF Bank
HF Generation
SBR
stereo-to-
mono
downmix1
AAC Core
Decoder
(incl. error
concealment)
Envelope
Adjustment
Output
PCM
Samples
Bitstr
eam
Paylo
ad D
em
ultip
lexer
Coded
Audio
StreamSpline
Resampler1
stereo-to-
mono
downmix 1
1 usage dependant
on audio mode
Me
rge lo
wband a
nd h
ighband
SBR error
concealment
guidance
information
stereo parameter
Synth
esis
QM
F B
ank
Para
metr
ic S
tere
o S
ynth
esis
PS error
concealment
26
Filter bank: The encoder breaks down the raw audio signal into segments known
as blocks. Modified Discrete Cosine Transform (MDCT) is applied to the blocks to
maintain smooth transition from block to block. AAC dynamically switches
between the two block sizes i.e., 2048-samples, and 256-samples each, referred
to as long blocks and short blocks, respectively. (3.1) shows the MDCT equation.
AAC also switches between two different types of long blocks: sine-function and
Kaiser-Bessel Derived (KBD) according to the complexity of the signal.
X (k)
Where N is the block length.
Psychoacoustic model: Is a highly complex block which implements the switching
between block sizes, threshold calculation (upper limit of quantization error),
spreaded energy calculation, grouping etc [10].
Temporal Noise Shaping (TNS) block: This technique does noise shaping in the
time domain by doing an open loop prediction in the frequency domain. The TNS
technique provides enhanced control of the location, in time, of quantization
noise within a filter bank window. TNS proves to be especially successful for the
improvement of speech quality at low bit-rates.
Mid/Side Stereo: M/S stereo coding is another data reduction module based on
channel pair coding. In this case channel pair elements are analyzed as left/right
and sum/difference signals on a block-by-block basis. In cases where the M/S
channel pair can be represented by fewer bits, the spectral coefficients are coded,
and a bit is set to note that the block has used M/S stereo coding. During
decoding, the decoded channel pair is de-matrixed back to its original left/right
state. For normal stereo operation, M/S Stereo, is only required when operating
the encoder at bitrates at or above 44 kbps. Below 44 kbps the parametric stereo
coding tool is used instead where the AAC core is operated in mono.
27
Reduction of psychoacoustic requirements block: Usually the requirements of the
psychoacoustic model are too strong for the desired bitrate. Thus a threshold
reduction strategy is necessary, i.e. the strategy reduces the requirements by
increasing the thresholds given by the psychoacoustic model.
Quantization and coding: A majority of the data reduction generally occurs in the
quantization phase after the data has already achieved a certain level of
compression when passed through the previous modules. This module contains
many other blocks such as Scale factor quantization block, Noiseless coding and
Out of Bits Prevention.
Scale factor quantization: This block consist of two additional blocks called scale
factor determination and scale factor difference reduction.
Scale factor determination: The scale factors determine the quantization step size
for each scale factor band. By changing the scale factor, the quantization noise
will be controlled.
Scale factor difference reduction: This block takes into account the difference of
the scale factors which will be encoded. A smaller difference between two
adjacent scale factors requires fewer bits.
Noiseless coding: Coding of the quantized spectral coefficients is done by the
noiseless coding. The encoder uses a so called greedy merge algorithm to
segment the 1024 coefficients of a frame into section and to find the best
Huffman codebook for each section.
Out of Bits Prevention: after noiseless coding, the number of really needed bits is
counted. If this number is too high, the number of bits has to be reduced.
28
Fig 3.8 AAC encoder block diagram [10]
Stereo
Preprocessing
Filterbank
TNS
M/S
Reduction of
psychoacoustic
requirements
scalefactors /
quantization
Noiseless
Coding
Out of bits
prevention
Psycho-
acoustic
Model
Bitstream
multiplex
Bitstream
Input signal
Quantization &
Coding
29
3.7 HE AAC v2 bitstream formats:
HE AAC v2 encoded data has variable file formats with different extensions,
depending on the implementation and the usage scenario. The most commonly
used file formats are the MPEG-4 file formats MP4 and M4A [15], carrying the
respective extensions .mp4 and .m4a. The “.m4a” extension is used to emphasize
the fact that a file contains audio only. Additionally there are other bit stream
formats, such as MPEG-4 ADTS (audio data transport stream) and ADIF (audio
data interchange format).
ADIF format has a single header at the beginning of the bit stream followed
by raw audio data blocks. It is used mainly for local storage purposes. ADTS has a
header before each access unit or audio frame and also the header information
will remain same for all the frames in a stream. ADTS is more robust against
errors and is suited for communication applications like broadcasting. For this
project file format ADTS has been used.
Tables 3.1 and 3.2 describe the ADTS header. This is present before each
access unit (a frame). This is later exploited for packetizing the frames into
packetized elementary stream (PES) packets, which is the first layer of
packetization before transport. Figure 3.9 shows the ADTS elementary stream.
In this chapter, an overview of HE AAC v2 audio codec standard was
presented. The encoder, decoder, SBR, PS, AAC encoder and the bit stream
format were described. Next chapter gives a brief overview of the transport
protocols in particular MPEG 2 systems layer.
30
Field name Number of bits
syncword 12 always "111111111111"
ADTS fixed
header
ID 1 0: MPEG-4, 1: MPEG-2
layer 2 always "00"
protection_absent 1
profile 2 Explained below
sampling_frequency_index 4
private_bit 1
channel_configuration 3
original/copy 1
home 1
copyright_identification_bit
ADTS variable header
copyright_identification_start
aac_frame_length 13 length of the frame including
header (in bytes)
adts_buffer_fullness 11 0x7FF indicates VBR
no_raw_data_blocks_in_frame 2
crc_check 16 only if protection_absent == 0
raw_data_blocks variable size
Table 3.1 ADTS header format [2] [3]
profile bits
bits ID == 1 (MPEG-2 profile) ID == 0 (MPEG-4 Object type)
00 (0) Main profile AAC MAIN
01 (1) Low Complexity profile (LC) AAC LC
10 (2) Scalable Sample Rate profile (SSR) AAC SSR
11 (3) (reserved) AAC LTP
Table 3.2 Profile bits expansion [2] [3]
31
ADTS frame ADTS frame ADTS frame
ADTS header ADTS raw data
Syn
c
wo
rd
Pro
file
Sa
mp
ling
fre
qu
en
cy
.. ADTS variable header
Fig 3.9 . ADTS elementary stream [3].
32
CHAPTER 4
TRANSPORT PROTOCOLS
4.1 Introduction
Once the raw video and audio are encoded into their respective elementary
streams, it has to be converted into fixed sized packets to enable transmission
across networks such as IP (internet protocol), wireless mobile networks etc.
H.264 and HE AAC v2 do not define a transport mechanism for the coded data.
There are a number of transport solutions available which can be used depending
on the application. Some of them are discussed briefly below.
4.2 Real-Time Protocol (RTP):
RTP is a packetisation protocol which can be used along with User
Datagram Protocol (UDP) to transmit real time multimedia content across
networks that uses the Internet Protocol (IP). The packet structure for RTP real
time data is shown in figure 4.1. The payload type indicates the type of codec
used to generate the coded data. Sequence is used during playback for reordering
the packets that are transmitted out of order. A time stamp is used for calculating
the presentation time during decoding. Transmission via RTP involves packetizing
each elementary stream into a series of RTP packets and transmitting them across
the IP network using UDP as the basic transport protocol. H.264 NAL has been
designed keeping RTP protocol in mind, since each NALU can be placed in its own
RTP packet.
33
Time
stampUnique
idpayload
Payload
type
Sequence
number
Fig 4.1. RTP packet structure (simplified) [22]
4.3 MPEG2 systems layer:
The MPEG2 part 1 [4] standard describes a method of combining one or
more elementary streams of video and audio , as well as other data , into a single
stream which is suitable for storage (DVD) or transmission ( digital television,
streaming etc). There are basically two types of system coding, Program Stream
(PS) and Transport Stream (TS). Each one is optimized for different types of
applications. Program stream is designed for reasonably reliable media such as
DVDs, while transport stream is designed for less reliable media such as television
broadcast, mobile networks etc. Irrespective of the scheme used the
transport/program stream is constructed in two layers, the outer layer is the
system layer and the inner most layer is the compression layer. System layer
provides the functions necessary for one or more compressed streams like audio
and video in a system.
The MPEG2 transport stream system is shown in figure 4.2.The elementary
stream such as coded audio or video undergoes two layers of packetization. The
first layer of packetization results in variable sized packets known as Packetized
Elementary Stream (PES). PES packets from different elementary streams undergo
one more level of packetisation known as multiplexing, where they are broken
down into fixed size packets (188 bytes) known as transport stream packets (TS).
These TS packets are what are actually transmitted across the network using
34
broadcast techniques such as those used in ATSC and DVB. The TS contains the
actual data (payload) as well as timing and synchronization information and some
error control mechanism. The latter plays a crucial role during the decoding
process. This project is implemented using the MPEG2 transport stream
specification with a few modifications. Transport of H.264 bit stream over MPEG2
systems is covered in amendment 3 of MPEG2 systems [3]. Even though MPEG2
systems support multiple elementary streams for multiplexing, for
implementation purposes only two elementary streams, audio and video are
considered. Additional elementary streams like data streams or different
audio/video streams may be added by following the same method described next.
Elementary stream eg.coded video,audio
packetize
PES
packets
from
multiple
streams
multiplex
Transport
stream
Fig 4.2. MPEG2 transport stream [22]
4.4 Packetized elementary stream (PES):
PES packets are obtained after the first layer of packetisation of audio and
video coded data. This packetisation process is carried out by sequentially
separating out the audio and video elementary streams into access units. The
access units for both audio and video elementary streams are frames. Hence each
35
PES packet is an encapsulation of one frame of data, either an audio or video
elementary stream. Each PES packet contains a packet header and the payload
data from only one particular stream. PES header contains information which can
distinguish between audio and video PES packets. Since the number of bits used
to represent a frame in the bit stream varies (for both audio and video) the size of
the PES packets also varies and depends on the type of frame that is encoded. For
example I frames require more bits to be represented than P frames. Figure 4.3
shows how the elementary stream is converted into PES stream.
Fig 4.3. Conversion of an elementary stream into PES packets [29]
The PES header used is shown in table 4.1. The PES header starts with a 3
byte packet start code prefix which is always “0x000001” followed by 1 byte
stream id. Stream id is used to uniquely identify a particular stream. Stream id
along with start code prefix is known as start code (4 bytes). Valid stream ids [30]
for audio streams range from 11000000 to 11011111 and for video streams range
from 11100000 to 11101111. Stream id 11000000 and 11100000 for audio and
video respectively are used in this implementation.
36
Name Size (in
Bytes) Description
Packet start code prefix 3 0x000001
Stream id 1 Unique ID to distinguish between audio and video PES packet
Examples: Audio streams (0xC0-0xDF), Video streams (0xE0-0xEF)[3]
Note: the above 4 bytes together known as start code.
PES Packet length 2 The PES packet can be of any length. A value of zero for the PES packet
length can be used only when the PES packet payload is a video elementary stream
Time Stamp 2 frame number
Table4.1. PES packet header format used [4].
PES packet length may vary and go up to 65536 bytes. In case of longer
elementary stream, the packet length may be set as unbound i.e. 0, only in the
case of video stream. The next two bytes in the header is the time stamp field,
which contains the playback time information. In this project we use frame
number to calculate the playback time, which is explained in detail later.
4.4.1 PES encapsulation process:
As discussed before PES packets are obtained by encapsulating sequential
access units’ data bytes from elementary streams into PES header. In the case of
an audio stream, HE AAC v2 bitstream is searched for the 12 bit synch word i.e.
“111111111111” which indicates the start of a ADTS header and the start of an
audio frame. The frame length is extracted from the ADTS header. The audio
frame number is calculated from the beginning of the frame and coded as a two
byte timestamp. The stream ID used for audio is 11000000. An audio PES packet is
formed by encapsulating start code, frame length, time stamp and the payload.
In the case of an video stream, the H.264 bitstream is searched for a 4 byte prefix
start sequence 0x00000001 which indicates the beginning of a NAL unit. Then the
5 bit frame type is extracted from the NAL header and checked if it is a video
37
frame or a parameter set. Parameter sets are very important and are required for
decoding process. So if a parameter set is found (both PPS and SPS) it is
packetized separately and transmitted. If NAL unit contains the slice data, then
frame number is calculated from beginning of the stream and coded as time
stamp in PES. It has to be noted that parameter sets are not counted as frames, so
while coding parameter sets the time stamp field is coded as zero. Stream id used
for video is 11100000. Then the video PES packet is formed by encapsulating the
start code, frame length, time stamp and payload.
4.5 MPEG Transport stream (MPEG- TS):
PES packets are of variable sizes and are difficult to multiplex and transmit
in an error prone network. Hence they undergo one more layer of packetisation
which results in Transport Stream (TS) packets.
MPEG Transport Streams (MPEG-TS) use a fixed length packet size and a
packet identifier identifies each transport packet within the transport stream. A
packet identifier in an MPEG system identifies the type of packetized elementary
stream (PES) whether audio or video. Each TS packet is 188 bytes long which
includes header and payload data. Each PES packet may be broken down into a
number of transport stream (TS) packets since a PES packet which represents an
access unit (a frame) in the elementary stream which will be usually much bigger
than 188bytes. And also a particular TS packet should contain data from only one
particular PES.
The standard MPEG TS packet format is shown in the fig 4.4. It consists of a
synchronization byte, whose value is 0x47, followed by three one-bit flags and a
13-bit PID (packet identifier). This is followed by a 4-bit continuity counter, which
usually increments with each subsequent packet of a frame, and can be used to
38
detect missing packets. Additional optional transport fields, whose presence may
be signaled in the optional adaptation field, may follow. The rest of the packet
consists of payload. Packets are most often 188 bytes in length, but some
transport streams consist of 204-byte packets which end in 16 bytes of Reed-
Solomon error correction data. The 188-byte packet size was originally chosen for
compatibility with ATM systems.
Fig4.4. A standard MPEG-TS packet structure [14].
The transport Error Indicator (EI) flag is set by the demodulator if it cannot
correct errors in the stream, to tell the demultiplexer that the packet has an
uncorrectable error. Payload Unit Start Indicator (PUSI) flag indicates the start of
PES data. A transport priority flag when set means higher priority than other
packets with the same PID. Out of 188 bytes, the header occupies 4 bytes and the
rest 184 bytes are for the payload.
39
For the purposes of this implementation all the flags and fields mentioned
above are not required, hence a few changes have been made although the frame
work and the packet size remains the same. The whole header information is
represented in 3 bytes instead of 4 bytes and the rest is available for payload
data. The modified transport stream (TS) packet is shown in fig. 4.5
Sync byte
0x47
P
U
S
I
A
F
C
CC(4
bits)
PID(10
bits)
188 bytes long
Data payloadOffset- optional
8bits
185 bytes1 byte 2 bytes
Fig 4.5. Transport stream (TS) packet format used in this project.
The sync byte (0x47) indicates the start of the new TS packet. It is followed
by a payload unit start indicator (PUSI) flag, which when set indicates that the
data payload contains start of new PES packet.
The Adaptation Field Control (AFC) flag when set indicates that the whole
of the allotted 185 bytes for the data payload is not occupied by the PES data.
This occurs when the PES data is smaller than 185 bytes. When this happens the
unoccupied bytes of the data payload are filled with filler data ( all zeros in this
case), and the length of the filler data is stored in a byte called the offset right
after the TS header, offset is calculated by 185 – length of PES data.
40
The Continuity Counter (CC) is a 4 bit field which is incremented by the
multiplexer for each TS packet sent for a particular stream I.e. audio PES or video
PES, this information is used at the demultiplexer side to determine if any packets
are lost, repeated or is out of sequence. Packet ID (PID) is a unique 10 bit
identification to describe a particular stream to which the data payload belongs in
the transport stream (TS) packet. The MPEG2 transport stream has a concept of
broadcast programs. Each single program is described by a Program Map Table
(PMT), and the elementary streams associated with that program have PIDs listed
in the PMT. For instance, a transport stream used in digital television might
contain three programs, to represent three television channels. Suppose each
channel consists of one video stream and one audio stream. A receiver wishing to
decode a particular "channel" merely has to decode the payloads of each PID
associated with its program. It can discard the contents of all other PIDs. A
transport stream with more than one program is referred to as a MPTS (multi
program transport stream). Similarly a transport stream with a single program is
referred to as a single program transport stream (SPTS). A PMT will have its own
PID and will be transmitted at regular intervals. In this implementation only two
streams, audio and video are used, so the PMT is not required. The PIDs is
assumed to be known by the decoders. The PIDs used for this implementation are
0000001110 (14) for audio stream and 0000001111 (15) for video stream.
Optional offset byte: as described above, if the adaptation field control flag is set,
this byte is filled with the length of the filler data (zeroes).
4.6 Time stamps:
Time stamps indicate where a particular access unit belongs in time. Lip
sync is obtained by incorporating time stamps into the headers in both video and
audio PES packets. When a decoder receives a selected PES packet, it decodes
41
each access unit and stores the elementary streams in buffers. When the time-
line count reaches the value of the time stamp, the buffer is read out. This
operation has two desirable results. First, effective time base correction is
obtained in each elementary stream. Second, the video and audio elementary
streams can be synchronized together to make a program.
Traditionally to enable the decoder to maintain synchronization between
audio track and video frames, a 33 bit encoder clock sample called Program Clock
Reference (PCR) is transmitted in the adaptation field of the TS packet from time
to time (every 100 ms). The PCR transport stream (TS) packet will have its own
PID that will be recognized by the decoder. This is used to generate the system
timing clock (STC) in the decoder which provides an accurate time base. This
along with the presentation time stamp (PTS) field that resides in the PES packet
layer of the transport stream is used to synchronize the audio and video
elementary streams.
This project uses the frame numbers of both audio and video as time
stamps to synchronize the streams. This section explains how frame numbers can
be used to synchronize audio and video streams. As explained before in sections
2.6 and 3.7, both H.264 and HE AAC v2 bit streams are organized into access units
i.e. frames separated by their respective sync sequence. A particular video
sequence will have a fixed frame rate during playback which is specified by frames
per second (fps). So assuming that the decoder has a prior knowledge about the
fps of the stream, the presentation time or the playback time of a particular video
frame can be calculated using (4.1).
42
The AAC compression standard defines each audio frame to contain 1024
samples per channel. This is true for HE AAC v2 as well .The sampling frequency of
the audio stream can be extracted from the sampling frequency index field of the
ADTS header. The sampling frequency remains the same for a particular audio
stream. Since both samples per frame and sampling frequency are fixed the audio
frame rate also remains constant throughout a particular audio stream. Hence the
presentation time of a particular audio frame (assuming stereo) can be calculated
using (4.2).
The same expression can be expanded for multi channel audio streams, just by
multiplying the number of channels.
Hence by using (4.1) and (4.2), presentation times of the frames can be
calculated by coding the frame numbers as the time stamps. And also once the
presentation time of one stream is calculated, the frame number of the second
stream that has to played at that particular time can calculated. This approach is
used at the decoder to achieve the audio-video synchronization or lip
synchronization; this is explained in detail in later chapters.
Using frame numbers as time stamps has many advantages over the
traditional PCR approach. Obvious advantages are that there is no need to send
the additional Transport Stream (TS) packets with PCR information, reduced
overall complexity, no need to consider clock jitters during synchronization,
smaller time stamp field in the PES packet, just 16 bits to encode frame number
43
compared to 33 bits for the Presentation Time Stamp (PTS) which has a sample
from the encoder clock.
The time stamp field in this project is encoded in 2 bytes in the PES header,
which implies that time stamp field can carry frame numbers up to 65536. Once
the frame number of either stream exceeds this number, which is a possibility in
the case of long video and audio sequences, the frame number is reset to 1. The
reset is done simultaneously on both audio and video frame numbers as soon as
the frame number of either one of the stream crosses 65536. This will not create
a frame number conflict at the demultiplexer side during synchronization because
the audio and video buffer sizes are much smaller than the maximum allowed
frame number. Hence, at no point of time will there be two frames in the buffer
with the same time stamp.
The next chapter addresses the multiplexing scheme used to multiplex the
audio and video elementary streams.
44
CHAPTER 5
MULTIPLEXING
Multiplexing is a process where Transport Stream (TS) packets are
generated and transmitted in such a way that the buffers at the decoder (de-
multiplexer) do not overflow or under flow. Buffer overflow or underflow by the
video and audio elementary streams can cause skips or freeze/mute errors in
video and audio playback. There are many methods adopted in various systems to
prevent this at the decoder side, like when a potential buffer overflow is
detected; null packets (transmitted to maintain constant bit rate) are deleted or
presentation time is delayed by a few frames till both the buffers have the
content to be played back at that particular presentation time.
The flow chart of the multiplexing scheme used in this project is shown in
figures 5.1, 5.2 and 5.3. The basic logic is based on both audio and video
sequences having constant frame rates. For video the frames per second value
will remain same throughout the video sequence. In an audio sequence since
sampling frequency remains constant throughout the sequence and samples per
frame is fixed (1024 for stereo), the frame duration also remains constant.
For transmission a PES packet which represents a frame is logically broken down
to n number (n depends on PES packet size) of TS packets of 188 bytes each. The
exact presentation time of each TS packet ( ) may be calculated as
shown in (5.1), (5.2) and (5.3), where is the number of TS packets
required to represent corresponding PES packet or frame:
45
= +
Similarly for audio:
where is given by
= +
From (5.3) and (5.6) it may be observed that the presentation time of a
current TS packet is the cumulative sum of presentation time of previous TS
packet (of the same type) and the current TS duration. The decision to transmit a
particular TS packet (audio or video) is made by comparing their respective
presentation times, and which ever stream has a lower value, it is scheduled to
transmit a TS packet. This makes sure that both audio and video content get equal
priority and gets transmitted uniformly. Once the decision about which TS to
transmit is made, the control goes to one of the blocks where the actual
generation and transmission of TS and PES packets takes place.
In the audio/video processing block, the first step is to check whether the
multiplexer is still in the middle of a frame or in the beginning of a new frame. If a
new frame is being processed, (5.2) or (5.5) is executed appropriately, to find out
the TS duration. This information is used to update the TS presentation time at a
46
later stage. Next data is read from the concerned PES packet, if PES is bigger than
185 bytes then only the first 185 bytes are read out and the PES packet is adjusted
accordingly. If the current TS packet is the last packet for that PES packet, a new
PES packet for the next frame (for that stream) is generated. Now the 185 bytes
payload data and all the remaining information are ready to generate the
transport stream (TS) packet. Once a TS packet is generated, the TS presentation
time is updated using (5.3) and (5.6). Then the control goes back to the
presentation time decision block and the whole process is repeated till all the
video and audio frames are transmitted. It has to be noted here that one of
streams I.e. video or audio may get transmitted completely before the other. In
that case only that particular processing block is operated which is still pending
transmission.
The next chapter describes the de-multiplexing algorithm used and the
method used to achieve audio-video synchronization.
47
PT_VIDEO_TS < PT_AUDIO_TS?
· TRANSMIT A VIDEO TS PACKET,
· GENERATE THE NEXT VIDEO
PES PACKET IFREQUIRED
· UPDATE PT_VIDEO_TS
· TRANSMIT A AUDIO TS PACKET,
· GENERATE THE NEXT AUDIO
PES PACKET IF REQUIRED.
· UPDATE PT_AUDIO_TS
ALL AUDIO FRAMES DONE
AND
ALL VIDEO FRAMES DONE?
DONE
YES
NO
YES
NO
(Video processing block) (Audio processing block)
(comparing presentation
times of audio and video
TS)
Fig 5.1. Overall multiplexer flow diagram
48
NEW VIDEO FRAME
?
CALCULATE
· NO OF TS PACKETS
FOR THIS FRAME
· TS DURATION
CURRENT PES
LENGTH >185
· CONSIDER ONLY FIRST 185
BYTES FOR TRANSMISSION
· UPDATE PES DATA AND
LENGTH
GENERATE PES PACKET
FOR NEXT VIDEO FRAME
· GENERATE (TRANSMIT) A VIDEO TS PACKET
· UPDATE PT_VIDEO_TS VALUE
YES
YES
NO
NO
Fig 5.2. Flow chart of video processing block
49
NEW AUDIO FRAME
?
CALCULATE
· NO OF TS PACKETS
FOR THIS FRAME
· TS DURATION
CURRENT PES
LENGTH >185
· CONSIDER ONLY FIRST 185
BYTES FOR TRANSMISSION
· UPDATE PES DATA AND
LENGTH
GENERATE PES PACKET
FOR NEXT AUDIO FRAME
· GENERATE (TRANSMIT) A AUDIO TS PACKET
· UPDATE PT_AUDIO_TS VALUE
YES
YES
NO
NO
Fig 5.3. Flow chart of audio processing block.
50
CHAPTER 6
DE MULTIPLEXING
The Transport Stream (TS) input to a receiver is separated into a video
elementary stream and audio elementary stream by a demultiplexer. At this time,
the video elementary stream and the audio elementary stream are temporarily
stored in the video and audio buffer, respectively.
The basic flow chart of the demultiplexer is shown in the figure 6.1. After
receiving a packet, it is checked for the sync byte (0X47), to check if the packet is
valid or not. If invalid that packet is skipped and de-multiplexing is continued with
the next packet. The valid TS packet header is read to extract fields like packet ID
(PID), adaptation field control flag (AFC), payload unit start (PUS) flag, 4 bit
continuity counter etc. Now the payload is prepared to be read into the
appropriate buffer. By checking the AFC flag it can be known that an offset value
has to be calculated or all 185 bytes in the TS packet has payload data. If the AFC
is set then the payload is extracted by skipping through the stuffing bytes.
The Payload Unit Start (PUS) bit is checked to see if the present TS packet
contains a PES header. If so then, the PES header is first checked for the presence
of the sync sequence (I, e 0X000001). If not, the packet is discarded and the next
TS packet is processed. If valid then the PES header is read and fields like stream
ID, PES length, frame number are extracted. Now the PID is checked to see if it is
an audio TS packet or video TS packet. Once this decision is made, the payload is
written into its respective buffer. If the TS packet payload contained the PES
header, information like frame no, its location in the corresponding buffer , PES
length are stored in a separate array variable which is later used for synchronizing
audio and video streams.
51
Read TS packet
Valid
packet
?
Go to next packet
Get
PID
AFC flag
AFC = 1
?
Read PES header
- PES length
-frame no
-stream ID
Extract payload
Adjust offset
PUS = 1
Sync seq
present ?
Video packet
?
PUS = 1
?
Write payload data
in to video buffer
Is video buffer
full ?
Search for the next IDR frame and calculate the
corresponding audio frame
Write both video and audio buffer values from
those frames in to their respective bitstream
files(.264 and .aac)
PUS = 1
?
Save frame no and
pointer loc in
buffer
Write payload data
in to audio buffer
Save frame no and
pointer loc in
buffer
YES
YES
YES
YES
YES
YES
YES
YES
NO
NO
NO
NO
NO
NO
NO
NO
Fig 6.1. Flow chart for the Demultiplexer used
52
After the payload has been written into the audio/video buffer. Video
buffer is checked for fullness. Since video files are always much larger than audio
files, the video buffer gets filled up first. Once the video buffer is full, the next
occurring IDR frame is searched in the video buffer. Once found the frame
number is noted and used to calculate the corresponding audio frame number
that has to be played at that time which is given by (6.1).
The above equation is used to synchronize the audio and video streams.
Once the frame numbers are obtained, the audio and video elementary streams
can be constructed by writing the audio and video buffer contents from that point
(frame) into their respective elementary streams i.e. .aac and .264 files
respectively. Then the streams are merged into a container format by using mkv
merge software [31] which is freely available software. The resulting container
format can be played back by video player like VLC media player [32] or Gom
media player [33]. In the case of video sequence, to ensure proper playback,
picture and sequence parameter sets must be inserted before the first IDR frame.
The reason the de-multiplexing is carried out from an IDR (instantaneous
decoder refresh) frame is because the IDR frame breaks the sequence making
sure that the later frames like P-frames do not use frames before the IDR frame
for motion estimation. This is not true in the case of normal I- frame. So in a long
sequence the GOPs after the IDR frame are treated as a new sequence by the
H.264 decoder. In the case of audio the HE AAC v2 decoder can playback the
sequence from any audio frame.
53
6.1 Lip or audio-video synchronization:
Synchronization in multimedia systems refers to the temporal relations that
exist between the media objects in a system. A temporal or time relation is the
relation between a video and an audio sequence during the time of recording. If
these objects are presented, the temporal relation during the presentations of
the two media objects must correspond to the temporal relation at the time of
recording.
Once the video buffer is full, the content (audio and video) is ready to be
written in their elementary streams and played back. The audio ADTS elementary
stream can be played back from any frame. However for the H.264 stream
decoding can only start from an IDR frame. So the video buffer is searched for the
next occurring IDR frame, this frame number is used to calculate the
corresponding audio frame using (6.1).
Once the audio frame number is obtained, the audio buffer from that
frame is written into its elementary stream (.aac file). For the video elementary
stream however, the picture and sequence parameter sets (which were sent as a
separate TS packet from the multiplexer) are inserted before writing from the IDR
frame in the video buffer. Because both PPS and SPS information is used by
decoder to find out the encoding parameters used.
Since the output of (6.1) may not be a whole number, it is rounded off to
the closest integer value. The theoretical value of maximum rounding off error is
half the audio frame duration. This depends on the sampling frequency of the
audio stream. For example a sampling frequency of 24000, the frame duration is
1024/24000 i.e. 42.6 ms and the maximum latency will be 21.3 ms. For most
audio streams the frame duration value will be not more that 64 ms, hence the
maximum latency will be 31 ms [33]. This latency/time difference is known as a
54
“skew” *47+. According to a research the “in-sync” region spans a skew of -80 ms
(audio behind video) and +80 ms (video behind audio) [47]. In-sync refers to the
range of skew values where the synchronization error is not perceptible. The
MPEG-2 systems [4] define a skew threshold of ±40 ms. Once the streams are
synchronized the skew remains constant throughout. This possible maximum
skew is the limitation of the project; however the value remains well below the
allowed range. Simulation results using the audio and video test sequences are
presented in the next chapter.
55
CHAPTER 7
RESULTS
The multiplexing algorithm was implemented in MATLAB and de-
multiplexing algorithm was implemented using C++. JM (joint model) 16.1 [35]
and 3gpp Enhanced aacplus encoder [36] were used for encoding video and audio
raw sequences respectively. GOP sequence adopted for video encoding was IPPP;
the H.264 baseline profile was used. For audio encoding bitrate was set at 32 kbps
to enable parametric stereo.
7.1 Buffer fullness:
As stated before buffer overflow or underflow by the video and audio
elementary streams can cause skips or freeze/mute errors in video and audio
playback. Table 7.1 shows the values of video buffer and the corresponding audio
buffer size at that moment and the playback times of both audio and video
contents of buffer. It can be observed the content playback times vary only by
about 20 ms. This means that when a video buffer is full (for any size of video
buffer) almost all the corresponding audio content is present in the audio buffer.
56
video frames in buffer
Audio frames in
buffer
video buffer size
(in KB)
audio buffer size
(in KB)
video content play back time
(in sec)
audio content
play back time
(in sec)
100 98 771.076 17.49 4.166 4.181
200 196 1348.359 34.889 8.333 8.362
300 293 1770.271 52.122 12.5 12.51
400 391 2238.556 69.519 16.666 16.682
500 489 2612.134 86.949 20.833 20.864
600 586 3158.641 104.165 25 25.002
700 684 3696.039 121.627 29.166 29.184
800 782 4072.667 139.043 33.333 33.365
900 879 4500.471 156.216 37.5 37.504
1000 977 4981.05 173.657 41.666 41.685
Table7.1. Video and audio buffer sizes and their respective playback times
7.2 Synchronization/skew calculation
Table 7.2 shows the results and various parameters of the test clips used.
The results show that, the compression ratio achieved by HEAACv2 is in the order
of 45 to 65 which is at least three times better than that achieved by just core
AAC. Also H.264 video compression is in the order of 100, which is due to the fact
that baseline profile is used. The net transport stream bitrate requirement is
about 50 kBps, which can be easily accommodated in systems such as ATSC-M/H,
which has an allocated bandwidth of 19.6 Mbps [17] or 2450 kBps.
57
Test clip 1 2
Clip length (sec) 30 50
Video FPS 24 24
Audio sampling frequency (Hz) 24000 24000
total video frames 721 1199
Total audio frames 704 1173
Video raw file (.yuv) size(kB) 105447 175354
Audio raw file (.wav) size(kB) 5626 9379
H.264 file size(kB) 1273 1365
AAC file size (kB) 92 204
Video compression ratio 82.82 128.4
Audio compression ratio 61.15 45.97
H.264 encoder bitrate(kBps) 42.43 27.3
AAC encoder bitrate(kbps) 32 32
Total TS packets 8741 9858
Transport stream size(kB) 1605 1810
Transport stream bitrate (kBps) 53.49 36.2
Test clip size (kB) 1376.78 1576.6
Reconstructed clip size (kB) 1312.45 1563.22
Table 7.2. Characteristics of test clips used
Table 7.3 shows the skew for various start TS packets. The delay column
indicates the skew achieved when demultiplexing was started from different TS
packet number. The maximum theoretical value is 21 ms because the sampling
frequency used is 24,000 Hz (audio frame duration is 42 ms). As seen the worst
skew is 13 ms, but in most cases the skew rate is below 10ms. This is well below
the MPEG2 threshold of 40 ms. Chapter 8 outlines the conclusions followed by
future work which is described in chapter 9.
58
Transport stream packet
number
video IDR frame
number chosen
Audio frame number chosen
chosen video frame presentation
time (s)
chosen audio frame presentation
time (s)
delay ms
Perceptible ?
100 13 13 .5416 .5546 13 no
300 29 28 1.208 1.1946 13 no
400 33 32 1.375 1.365 9.6 no
500 45 44 1.875 1.877 2.3 no
600 53 52 2.208 2.218 10.6 no
800 73 71 3.041 3.03 11 no
100 89 87 3.708 3.712 4 no
Table 7.3: Demultiplexer output.
59
CHAPTER 8
CONCLUSIONS
This project implemented an effective multiplexing and demultiplexing
scheme with synchronization. The latest codec H.264 and HE AAC v2 were used.
Both encoders achieved very high compression ratios as a result the transport
stream bitrate requirement was contained to about 50 kBps. Also buffer fullness
was effectively handled with maximum buffer difference observed was around
20ms of media content. During decoding the audio-video synchronization was
achieved with a maximum skew of 13ms.
CHAPTER 9
FUTURE WORK
This project implemented a multiplexing/demultiplexing algorithm for one
audio and one video stream i.e. a single program. The same scheme can be
expanded to multiplex multiple programs by having a program map table (PMT).
Also the same algorithm can be modified to multiplex other elementary streams
like VC1 [44], Dirac video [45], AC3 [46] etc.
The present project used standards specified by MPEG2 systems. The same
multiplexing scheme can be applied to other transport streams like RTP/IP, which
are used for applications such as streaming videos over the internet.
Since transport stream is sent across networks that are prone to errors, an
error correction scheme like Reed-Solomon [43] or CRC can be added while
coding the transport stream (TS) packets.
60
References:
[1] MPEG-4: ISO/IEC JTC1/SC29 14496-10: Information technology – Coding of audio-visual
objects - Part 10: Advanced Video Coding, ISO/IEC, 2005.
[2] MPEG-4: ISO/IEC JTC1/SC29 14496-3: Information technology – coding of audio-visual
objects – part3: Audio, AMENDMENT 4: Audio lossless coding (ALS), new audio profiles and
BSAC extensions.
3] MPEG–2: ISO/IEC JTC1/SC29 13818–7, advanced audio coding, AAC. International Standard IS
WG11, 1997.
[4]MPEG-2: ISO/IEC 13818-1 Information technology—generic coding of moving pictures and
associated audio—Part 1: Systems, ISO/IEC: 2005.
[5] Soon-kak Kwon et al, “Overview of H.264 / MPEG-4 Part 10 (pp.186-216)”, Special issue on “
Emerging H.264/AVC video coding standard”, J. Visual Communication and Image
Representation, vol. 17, pp.183-552, April. 2006.
[6] A. Puri et al, “Video coding using the H.264/MPEG-4 AVC compression standard”, Signal
Processing: Image Communication, vol.19, pp. 793-849, Oct. 2004.
[7] MPEG-4 HE-AAC v2 — audio coding for today's digital media world , article in the EBU
technical review (01/2006) giving explanations on HE-AAC. Link:
http://tech.ebu.ch/docs/techreview/trev_305-moser.pdf
[8]ETSI TS 101 154 “Implementation guidelines for the use of video and audio coding in
broadcasting applications based on the MPEG-2 transport stream”.
[9] 3GPP TS 26.401: General Audio Codec audio processing functions; Enhanced aacPlus
General Audio Codec; 2009
[10] 3GPP TS 26.403: Enhanced aacPlus general audio codec; Encoder Specification AAC part.
[11] 3GPP TS 26.404 : Enhanced aacPlus general audio codec; Encoder Specification SBR part.
[12] 3GPP TS 26.405: Enhanced aacPlus general audio codec; Encoder Specification Parametric
Stereo part.
61
[13] E. Schuijers et al, “Low complexity parametric stereo coding “,Audio engineering society,
May 2004 , Link: http://www.jeroenbreebaart.com/papers/aes/aes116_2.pdf
[14] MPEG Transport Stream. Link:
http://www.iptvdictionary.com/iptv_dictionary_MPEG_Transport_Stream_TS_definition.html
[15] MPEG-4: ISO/IEC JTC1/SC29 14496-14 : Information technology — coding of audio-visual
objects — Part 14 :MP4 file format, 2003
[16] DVB-H : Global mobile TV. Link : http://www.dvb-h.org/
[17] ATSC-M/H. Link : http://www.atsc.org/cms/
[18] Open mobile vidéo coalition. Link : http://www.openmobilevideo.com/about-mobile-
dtv/standards/
[19] VC-1 Compressed Video Bitstream Format and Decoding Process (SMPTE 421M-
2006), SMPTE Standard, 2006 (http://store.smpte.org/category-s/1.htm).
[20] Henning Schulzrinne's RTP page. Link: http://www.cs.columbia.edu/~hgs/rtp/
[21] G.A.Davidson et al, “ATSC video and audio coding”, Proc. IEEE, vol.94, pp. 60-76,
Jan. 2006 (www.atsc.org).
[22] I. E.G.Richardson, “H.264 and MPEG-4 video compression: video coding for next-
generation multimedia”, Wiley, 2003.
[23] European Broadcasting Union, http://www.ebu.ch/
*24+ Shintaro Ueda, et al, “NAL level stream authentication for H.264/AVC” , IPSJ Digital
courier, vol.3 , Feb.2007.
[25] World DMB: link: http://www.worlddab.org/
[26] ISDB website. Link: http://www.dibeg.org/
[27] 3gpp website. Link: http://www.3gpp.org/
[28] M Modi, “Audio compression gets better and more complex”, link:
http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-and-
more-complex
[29] PA Sarginson,”MPEG-2: Overview of systems layer”, Link:
http://downloads.bbc.co.uk/rd/pubs/reports/1996-02.pdf
62
[30] MPEG-2 ISO/IEC 13818-1: GENERIC CODING OF MOVING PICTURES AND AUDIO: part 1-
SYSTEMS Amendment 3: Transport of AVC video data over ITU-T Rec H.222.0 |ISO/IEC 13818-1
streams, 2003
[31] MKV merge software. Link: http://www.matroska.org/
[32] VLC media player. Link: http://www.videolan.org/
[33] Gom media player. Link: http://www.gomlab.com/
*34+ H. Murugan, “Multiplexing H264 video bit-stream with AAC audio bit-stream,
demultiplexing and achieving lip sync during playback”, M.S.E.E Thesis, University of Texas at
Arlington, TX May 2007.
[35] H.264/AVC JM Software link: http://iphome.hhi.de/suehring/tml/download/.
[36] 3GPP Enhanced aacPlus reference software. Link: http://www.3gpp.org/ftp/
[37] MPEG–2: ISO/IEC JTC1/SC29 13818–2, Information technology -- Generic coding of moving
pictures and associated audio information: Part 2 - Video, ISO/IEC, 2000.
[38] MPEG–4: ISO/IEC JTC1/SC29 14496–2, Information technology – Coding of audio visual
objects: Part 2 - visual, ISO/IEC, 2004.
[39] T. Wiegand et al, ”Overview of the H.264/AVC Video Coding Standard ”, IEEE Trans. CSVT,
Vol. 13, pp. 560-576, July 2003.
[40] ATSC-Mobile DTV Standard, part 7 – AVC and SVC video system characteristics. Link:
http://www.atsc.org/cms/standards/a153/a_153-Part-7-2009.pdf
[41] ATSC-Mobile DTV Standard, part 8 – HE AAC audio system characteristics. Link:
http://www.atsc.org/cms/standards/a153/a_153-Part-8-2009.pdf
[42] H.264 Video Codec - Inter Prediction. Link:
http://mrutyunjayahiremath.blogspot.com/2010/09/h264-inter-predn.html
*43+ B.A. Cipra, “The Ubiquitous Reed-Solomon Codes”. Link:
http://www.eccpage.com/reed_solomon_codes.html
63
[44] VC1 technical overview .link:
http://www.microsoft.com/windows/windowsmedia/howto/articles/vc1techoverview.aspx
[45] Dirac video compression website. Link: http://diracvideo.org/
[46] MPEG2: ISO-IEC JTCl/SC29/WGll 13818-3 : Coding Of Moving Pictures and Associate Audio : Part 3 – audio Nov.1994
[47] G. Blakowski et al, “A Media Synchronization Survey: Reference Model, Specification, and
Case Studies”, IEEE Journal on selected areas in communications, VOL. 14, NO. 1, Jan 1996.