Journal of Systems Architecture 49 (2003) 641–661
www.elsevier.com/locate/sysarc
A flexible architecture for H.263 video coding
Mat�ııas J. Garrido a,*, C�eesar Sanz a, Marcos Jim�eenez a, Juan M. Meneses b
a Dpto. de Sistemas Electr�oonicos y de Control, Universidad Polit�eecnica de Madrid, E.U.I.T. Telecomunicaci�oon,Ctra. de Valencia, Km. 7, 28031 Madrid, Spain
b Dpto. Ingenierıa Electr�oonica, Universidad Polit�eecnica de Madrid, E.T.S.I. Telecomunicaci�oon,Ciudad Universitaria s/n, 28040 Madrid, Spain
Abstract
In this paper a flexible and efficient architecture that implements the core of a video coder according to Rec. H.263 is
presented. It consists of a RISC processor that controls the scheduling of a set of specialized processors that perform the
discrete cosine transform (DCT), the inverse discrete cosine transform (IDCT), the direct and inverse quantization (DQ
and IQ), the motion estimation (ME) and the motion compensation (MC). The architecture also includes pre-pro-
cessing modules for the input video signal from the camera and interfaces for the external video memory and the H.263
stream generation.
The processors have been written in synthesizeable Verilog and the firmware for the RISC (a commercial processor)
has been developed in C language.
The design has been tested with hardware–software co-simulations in a Verilog testbench using standard video
sequences and has also been prototyped onto a development system based on an FPGA and a RISC. It performs 30
QCIF frames/s with a system clock of 12 MHz or 30 CIF frames/s with a system clock of 48 MHz, which is better than
other reported designs with similar degree of flexibility. Also, the low frequency system clock makes it suitable for low-
power applications such as mobile videotelephony.
� 2003 Elsevier B.V. All rights reserved.
Keywords: H.263; FPGA; RISC; Intellectual property; Low bit-rate video coding; Pipelined architecture; Discrete cosine transform;
Motion estimation
1. Introduction
In the last 10 years, the evolution of digital
technologies, together with the establishment of aset of standards widely followed by the industry,
* Corresponding author. Fax: +34-3367801.
E-mail addresses: [email protected] (M.J. Garrido), ce-
[email protected] (C. Sanz), [email protected] (M. Jim�eenez),
[email protected] (J.M. Meneses).
1383-7621/$ - see front matter � 2003 Elsevier B.V. All rights reserv
doi:10.1016/S1383-7621(03)00094-8
such as MPEG-2 [1], MPEG-4 [2] and H.263 [3],
has allowed the development of a wide range
of video applications: digital TV, HDTV, VoD,
videotelephony, videoconference, etc.The applications implemented in low rate
channels, such as videotelephony, use low-resolu-
tion formats such as CIF (common intermediate
format: spatial resolution of 352 · 288 pels and
temporal resolution of 30 frames/s). Even so, the
available bandwidth is usually lower than that
necessary for working at minimum performance.
ed.
642 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
For example, a videophone working with CIF
and using an ISDN channel (considering 32 Kbits/
s for transmission and 32 Kbits/s for reception)
would allow the visualization of just one image
every 75 s.
Image compression techniques can drasticallyreduce the bit-rate necessary for encoding the
digital video signals. The compression techniques
used take advantage of the peculiarities of the
human vision system to attain compression ratios
of up to 100:1 [4]. Although a large amount of
useful techniques has been reported, nearly all
applications are based on the hybrid encoding
scheme shown in Fig. 1, that is based on the re-duction of the spatial and temporal redundancies
existing in any natural sequence of images.
The hybrid encoder reduces the temporal re-
dundancy, encoding the difference between each
image and its prediction computed on the basis of
previous or future images in the sequence. A
transformation to the spatial frequency domain is
applied to this difference and finally, the trans-formed coefficients are quantized. The spatial re-
dundancy reduction is obtained by means of a
DCT
MotionCompensation
MotionEstimation
R
+
-
FrameMemory
Preproc.
Input
0
Inter/Intra
Fig. 1. Hybrid encoder for
coarse quantization of the higher spatial frequen-
cies and a variable length coding (VLC). As the
human vision system is less sensitive to these
higher spatial frequencies, the image quality re-
mains acceptable while the output bit-rate is
greatly reduced.In 1998, the International Telecommunication
Union established the Recommendation H.263
that uses a number of encoding techniques tested
in other standards such as MPEG-1 and MPEG-2
as well as more advanced ones.
This paper shows an efficient and flexible ar-
chitecture that implements a basic-line H.263
video coder, based on the hybrid encoding loopshown in Fig. 1. In Section 2, a survey of some of
the architectures that implement H.263 encoders
reported in the last four years is made. In Section 3
the proposed architecture, MVIP-2, is presented.
In Section 4 the methodology followed in the de-
velopment of the design is explained. Section 5 is
devoted to the prototyping stage. In Section 6 the
tests performed and the results obtained areshown. Finally, Section 7 explains the conclusions
of this work.
+ +
Rec. FrameMemory
Buffer
egulator
Q
IQ
IDCT
VLC
Output
video compression.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 643
2. A survey of H.263 video encoding architectures
Most of the implementations that have been
presented in international publications in the last
four years can be classified into three groups.The first group contains the implementations
based on general-purpose microprocessors, in-
cluding PCs or workstations. All the efforts are
focused on the optimisation of the code that im-
plements the encoder loop for the target micro-
processor. A very representative proposal of this
group is presented in [5], where the basic H.263
encoding loop is optimised for three differentplatforms: a 167 MHz Sun UltraSPARC-1 work-
station, a 233 MHz Pentium II PC and a 600 MHz
Pentium III PC, attaining a minimum of 10, 13
and 35 frames/s for QCIF in tests with standard
sequences.
The second group contains architectures based
on special microprocessors such as DSPs, vector
parallel processors or multiprocessors. As moresignificant proposals, a multiprocessor architec-
ture made up of interconnected nodes is presented
in [6]; each node containing a RISC core adapted
for video encoding, DRAM memory, a video in-
terface and an external host interface. Using two
nodes working at 120 MHz the system encodes 25
CIF images per second. In [7] a vector parallel
processor is used, with a scalar core at 200 MHz,that encodes 21 frames/s in QCIF.
The third group includes the architectures
based on a controller together with a set of spec-
ialised processors for the specific tasks in the en-
coding loop. An architecture based on a sequencer
that implements the scheduling for a group of
specialised processors, encoding and decoding 30
CIF frames/s simultaneously is proposed in [8].The system clock frequency is 54 MHz and the
circuit has nearly 1.8 million gates. Another ar-
chitecture based on a dedicated sequencer and
specialized processors is detailed in [9]. It is im-
plemented on an 80,000 gate Xilinx FPGA run-
ning at 30 MHz and carrying out the basic core of
H.263 with CIF and 30 frames/s without motion
estimation. The depicted architectures lack flexi-bility because of their dedicated controller. In-
stead, the following ones use a programmable
controller: In [10], an ARM RISC core at 200
MHz is used to carry out the transforms (DCT,
IDCT) and quantizers (DQ, IQ) and controls a set
of processors for motion estimation and com-
pensation, video signal processing and external
dynamic memory interfacing. The processors are
implemented with about 40,000 gates and work ata 66 MHz clock frequency. This system imple-
ments the encoder and decoder for the H.263 with
QCIF and 29 frames/s. An architecture based on a
programmable address generator and a pipeline
controller for a set of processors: the camera in-
terface, the image filter, the loop DCT-DQ-IQ-
IDCT, the motion estimation and the VLC is
presented in [11]. With 80,000 gates and 27 MHzsystem clock, this architecture encodes QCIF at 30
frames/s.
3. MVIP-2
3.1. The architecture
MVIP-2 is an evolution from the MViP archi-
tecture [12,13] to implement H.263 video encoding.
Our goal is obtaining a design with a moderate
number of gates and a slow system clock that will
be suitable in the future for low-power applica-
tions such as mobile videotelephony; and also
flexible enough to allow its adaptation to other
standards.The block diagram of MVIP-2 is shown in Fig.
2. It consists of three functional blocks: the CPU
system, the processing system and the interface
system. MVIP-2 also needs several external mod-
ules: a digital camera, flash and RAM memories to
store code and data for the CPU and SDRAM for
the video memory.
3.1.1. The CPU system
The CPU system is made up of a 32-bit RISC
processor, an address decoder and a programma-
ble interrupt controller (PIC). After a reset, the
decoder maps the RAM, the flash memory and the
peripherals (the specialized processors in the pro-
cessing system and the interfaces in the interface
system) as shown in Fig. 3(a).The flash memory contains a loader and the
encoding firmware; when the CPU boots from
Fig. 2. Block diagram of MVIP-2.
644 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
flash memory the loader configures the decoder to
remap the memory as shown in Fig. 3(b), copies
the firmware to RAM and starts running from this
memory for a faster execution. On the other hand,
all the specialized processors and the interfaces
have the same structure (see Fig. 4): a core toimplement specific tasks (i.e. discrete cosine
transform or direct quantization) and an interface
with the RISC based on a configuration register
and a status register. All the configuration registers
have a start bit and all the status registers have a
done bit, which also is connected to one of the 32
inputs of the PIC. The RISC configures and starts
the processors and interfaces by writing in theirmemory mapped configuration registers. Also,
when a processor or interface ends its work, it
asserts the done bit that can be polled or generates
an interrupt if enabled.
The CPU carries out the following tasks: initial
configuration of the system, control of the sched-
uling of the specialized processors and the inter-
faces and a part of the work in the inter/intra
decision, VLC and H.263 bit-stream generation.
3.1.2. The processing system
The processing system consists of specialized
processors for implementing the direct and inversediscrete cosine transform (DCT and IDCT), the
direct and inverse quantization (DQ and IQ) and
the motion estimation and compensation (ME
and MC), a set of internal memories (M10. . .M51)and an interconnection network (CROSSBAR).
The internal memories are a set of macroblock-
size memories that are accessed by the processors
using the CROSSBAR. They are divided up intofive groups with different data bus sizes as stated in
Table 1.
The CROSSBAR implements the interface be-
tween the processors and the internal memories. It
has nine read-channels and seven write-channels
on the processors side and 17 memory interface
channels at the memories side.
CoreProcessor
RISC interface
Configuration Status
start done
data_from_RISCdata_to_RISC
Fig. 4. Structure of all processors and interfaces.
Table 1
Groups of internal memories
Group Data bus size
M10. . .M13 8-bit wide
M20. . .M24 9-bit wide
M30. . .M33 12-bit wide
M40. . .M41 11-bit wide
M50. . .M51 15-bit wide
PDCT CROSS BAR
rd channel
wr channel
M20 M21
M30 M31
M32 M33
Fig. 5. The discrete cosine transform processor.
Fig. 3. Memory map of MVIP-2 after a reset (a) and after remapping (b).
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 645
The DCT processor (see Fig. 5) reads macro-
blocks from an internal memory (M20 or M21)
using a CROSSBAR read-channel, processes
them, and writes the results in other internalmemory (M30. . .M33) using a CROSSBAR write-channel. Actually, the DCT processor works on a
block basis and six blocks are sequentially pro-
cessed per macroblock. For each block, the pro-
cessor carries out a 64-pel two-dimensional
discrete cosine transform of type DCT-II [14]
particularized for 8-pel wide square blocks.
The IDCT processor also reads macroblocksfrom an internal memory (M30. . .M33) and writesthe results to other internal memory (M23 or
M24). The IDCT is computed with the precision to
be IEEE-1180 [15] compliant.
PDQ CROSSBAR
rd channel
wr channel M40 M41
M30 M31
M32 M33
Fig. 6. The direct quantizer.
Current Frame
Previous Frame
Referencemacro block lock
Search Area
Fig. 8. The motion estimation process.
646 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
The DQ processor (see Fig. 6) reads the DCTresults from M30. . .M33 and writes the quantizedmacroblocks in M40 or M41. Actually, the DQ
processor works on a block basis and six blocks
are sequentially processed per macroblock. The
processor carries out the quantization of each 64-
pel data block as defined in [16].
The IQ processor (see Fig. 7) is divided inter-
nally into two main modules that read the sameinput macroblock from M40 or M41: IQ_proc and
LRL_proc. IQ_proc carries out the inverse quan-
tization as defined in [16] and writes the de-quan-
tized macroblock in M30. . .M33. LRL_proc
performs a last, run, level encoding and writes the
results in M50 and M51. Both modules share the
control and status registers and a read controller
to get the input data. The zz modules carry out anaddress translation for reading the blocks in zig-
zag scan.
The ME processor works on a macroblock
basis. For each macroblock in the frame to be
coded (current frame in Fig. 8), the ME carries out
a search in a limited area (search area) around
the counterpart macroblock in a previous frame,
to find the one that minimizes an error function.When the ME processor selects a macroblock
from the candidates, it then outputs a motion
IQ_proc CROSS BAR
rd_channel
wr_channel
wr_channel
LRL_proc
reader
zz
zz
M50 M51
M40 M41
M30 M31
M32 M33
Fig. 7. The inverse quantizer.
vector that will allow a decoder to recover the
same macroblock from a previously decoded im-
age. The search area is 7.5 pels-wide around the
counterpart macroblock in the previous frame, the
error function is the mean absolute error (MAE)
and the motion vector is computed with half-pel
precision.
The ME processor consists of four main blocks(see Fig. 9): a controller for reading the search area
from video memory and the reference macroblock
from the internal memories, an internal RAM
bank to store the search area, an entire-pel preci-
sion processor (EST1P) to find a first macroblock
candidate and a half-pel precision processor
(EST1_2P) to refine the result of EST1P and to
output a half-pel precision vector.The controller uses two IMEM read channels to
read the search area from a former image in the
video memory and one CROSSBAR read channel
to read the reference macroblock from the internal
memories (M10. . .M13).The RAM bank is used as in [17] to reduce the
data throughput into the ME. The memory is di-
vided into three blocks, each the size of a halfsearch area. As can be seen in Fig. 10, the right
half search area of macroblock #n overlaps withthe left half search area of macroblock #nþ 1, so,only a half of the search area must be read for
each macroblock. The controller reads the half
search area of macroblock #nþ 1 and stores it ina RAM bank block while EST1P reads the entire
search area of macroblock #n from the other twoblocks.
IMEM
reader
RAM block #2
EST1P EST1_2P
RAM block #1
RAM block #3
CROSS BAR
rd channel
rd channel
wr
ref
upper
lower
motionvector
PredictionY CR CB
search area
up_band
low_band
block_sel
SDRAM
RISCM10 M11
M12 M13
Fig. 9. Top-level block diagram of the motion estimation processor.
Macroblock #N Macroblock #N+1
Search area for Macroblock #N
Search area forMacroblock #N+1
Common search area for #N and #N+1
Fig. 10. Overlapping of the search areas for consecutive mac-
roblocks.
PMC
IMEM SDRAM rd channel
RISC
pointer to video memory
CROSS BAR
rd_channel
wr_channel
Mode (inter/intra)
M10 M11
M12 M13
M20 M21
Fig. 11. Architecture of the motion compensation processor.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 647
EST1P is based on EST3P [18], a hierarchical
three-step motion estimation circuit. During the
third step, EST1P reads only a 20 · 20 pels search
area and computes an entire-pel precision motion
vector. In parallel, EST1_2P reads the same 20 · 20pels search area and, at the end of the third step,
interpolates a half-pel precision search area, reads
the entire-pel precision motion vector from EST1P
and obtains a half-pel precision motion vector in
one more step. Finally, the candidate macroblock
with half-pel precision (Y , CR and CBÞ is written invideo memory using an IMEM write channel (not
shown in Fig. 9).
The MC processor (Fig. 11) reads the macro-
block candidate selected by the ME processor
from video memory using a read IMEM channel
and the reference macroblock from the internal
memories (M10. . .M13) using a CROSSBAR readchannel. If the reference macroblock is to be codedin intra mode, MC writes it in the internal mem-
ories (M20 or M21) using a CROSSBAR write
channel, otherwise MC writes the difference be-
tween the reference and the selected candidate.
648 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
3.1.3. The interface system
The interface system is made up of a set of
modules: for the camera (IVIDEOH, IVIDEOV),
for the frame memory (IFRMEM), for the re-
construction memory (IFRECMEM), for thevideo memory (IMEM) and for the H.263 bit-
stream generation (IT).
The IMEM module supports a set of 16-bit read
or write channels that compete to access the
SDRAM. IMEM performs the SDRAM initial
configuration and the refreshing tasks, manages
the requests of the channels and sends the read or
write commands to the SDRAM. The architectureof this module is shown in Fig. 12. Each of the 16-
bit channels has a controller (rd_ctrl for read
channels or wr_ctrl for write channels) and a four-
word memory (mr or mw). A write channel stores
the 16-bit words sequentially in the four-word
memory using a rq/ack protocol and when a 64-bit
word is completed the controller requests an access
to the memory manager. If the SDRAM is notbusy, the 64-bit word is written and the request is
add_ch0_rd data_ch0_rd
ack_ch0_rd rq_ch0_rd
rd_ctrl
add_ch5_rd data_ch5_rd
ack_ch5_rd rq_ch5_rd
rd_ctrl
add_ch0_wr data_ch0_wr
ack_ch0_wr rq_ch0_wr
wr_ctrl
add_ch4_wr data_ch4_wr
ack_ch4_wr rq_ch4_wr
wr_ctrl
mr0 demux
addrd
addrd
addwr
addwr
datard
datard
datawr
datawr
rq0rd
rq5rd
rq0wr
rq4wr
ack0rd
ack5rd
ack0wr
ack4wr
mr5
mw0
mw4
mu x
Fig. 12. Architecture of the v
acknowledged. The read channels work in a simi-
lar way. The memory manager supports requests
from six read and five write channels, resolves the
priorities, if necessary, performs the read and write
operations in the SDRAM and acknowledges the
channels.The camera interface. The camera interface
consists of two modules (Fig. 13) working on an
image basis: IVIDEO_H reads the images from the
camera in raster-scan format, synchronizes them
with the system clock, performs a horizontal fil-
tering, if necessary, and stores the results in video
memory using an IMEM write channel. IVI-
DEO_V reads the image from video memory,carries out a chrominance sub-sampling and a
vertical filtering and writes the results in video
memory.
The IFRMEM module (Fig. 14) reads the fil-
tered images from video memory using an IMEM
read channel and writes them on a macroblock
basis in one of four internal memories
(M10. . .M13) using a CROSSBAR write channel.
memory manager
rq0rd
rq5rd rq0wr
rq4wr
ack0wr
ack4wr
ack0rd
ack5rd
add_ch0_rd
add_ch5_rd add_ch0_wr
add_ch4_wr
data_in data_out
add_sdram
data_sdram
RAS
CAS R/W
ideo memory interface.
IVIDEO_Hcamera
IMEM
wr channel
CCIR.601 4:2:2CIF
QCIF
SDRAM
IVIDEO_Vwr channel
rd channel
Fig. 13. The camera interface.
IFRMEM
IMEM SDRAM rd channel
CROSSBAR wr channel
M10 M11
M12 M13
Fig. 14. The frame memory interface (IFRMEM).
IFRECMEM
IMEM SDRAM
rd channel
wr channel
rd channel
RISC
intra /inter mode
CROSSBAR M23
M24
Fig. 15. The reconstruction memory interface.
IT
CROSS BARrd channel
H.263stream
out
RISCHeaders & MV
FIFO
H.263stream
in
M50M51
Fig. 16. The H.263 frame interface (IT).
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 649
The IFRECMEMmodule (Fig. 15) is committed
to storing the reconstructed images in video
memory on a macroblock basis. The reconstructed
macroblock, generated by the IDCT processor, is
read from an internal memory (M23 or M24).Also, the prediction obtained from the ME pro-
cessor is read from video memory. In intra mode,
IFRECMEM writes to video memory only the
reconstructed macroblock. In inter mode, IF-
RECMEM writes to video memory the sum of the
reconstructed macroblock plus the prediction.
The IT module (Fig. 16) reads the run-length
coded (RLC) coefficients from the inverse quanti-zation processor and the image, group of block
and macroblock headers from the RISC, assem-
bles the H.263 bit-stream and outputs it through
an 8-bit port. Inside the IT, a module reads theRLC coefficients from the internal memories (M50
or M51) and carries out the VLC. The image and
macroblock headers are written in a header
memory by the RISC as they become available.
The IT module joins both data sources in a byte-
aligned stream that is sent to a first in first out
(FIFO) buffer.
3.2. Scheduling of the architecture
MVIP-2 works with three levels of pipeline:
image-level, macroblock-level and pel-level.
The interfaces IVIDEOH, IVIDEOV, IFR-
MEM and IFRECMEM work with an image-
level-pipeline. IMEM supports frame-size logic
pages and the processors use these pages to inter-change the images. Each processor reads an image
from a logic page, processes it and stores the re-
sults on a different page for the next processor. In
Fig. 17 a typical coding sequence is shown: inside
each frame period IVIDEOH reads an image from
the camera and carries out the horizontal filtering,
IVIDEOV carries out the vertical filtering, IFR-
MEM reads the filtered image and stores it, on amacroblock basis, in the internal memory
(M10. . .M13) and, at the end of the coding loop,IFRECMEM reads the IDCT output pels from
M23 and M24 and stores them in a page of video
memory.
Table 2 shows the access sequence to video
memory logic pages corresponding to the coding
sequence shown in Fig. 17. In the frame periods(T0. . .T5) the processors interchange images usingseven logic pages (P0. . .P6), e.g. at T1 IVIDEOHwrites the second frame (WR F2) using P1 while
IVIDEOH, frame 1 IVIDEOH, frame3IVIDEOH, frame 2
IVIDEOV, frame 1 IVIDEOV,
frame 2Loop , frame 1 Loop , frame 2
IVIDEOH, frame4
Initial latency
IVIDEOV, frame 3
Macroblock period
Frame period
me mc
dct dq iq/lrl
idct ifrecmem
ifrmem
Fig. 17. Example of encoding sequence.
650 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
IVIDEOV reads the first frame (RD F1), already
filtered in the horizontal dimension, from P0. After
the initial latency of three frames all processors
work in parallel using six logic pages in video
memory.IFRMEM, ME, MC, DCT, DQ, IQ, IDCT and
IFRECMEM work with a macroblock-level-
pipeline. Each processor reads a macroblock from
one of the internal memories through the
CROSSBAR, then processes it and writes the re-
sults in another internal memory (IFRMEM and
IFRECMEM also access video memory for read-
ing and writing respectively). As Fig. 17 shows,after an initial latency of seven macroblocks all
processors work in parallel.
Table 3 shows a typical inter coding sequence,
where T0. . .T8 represent macroblock periods. Themacroblocks to be processed (current macro-
blocks) are stored alternatively in M10. . .M13 forIFRMEM (I-frmem WR CF#). The motion esti-
mator reads three times the current macroblock(p-me 3 ·RD CF#) to carry out the prediction andonce more to get the 1/2 pel accuracy. Finally, the
motion compensator reads it once again to com-
pute the difference with the prediction and to write
the result alternatively in M20 and M21 (p-me WR
CF#-REC#). The DCT processor reads these data
(p-dct RD#) and stores the transformed coeffi-
cients in M30. . .M33 (p-dct WR#), from where
they are read and quantized by DQ, which writes
them into M40 or M41 (p-dq WR#). The IQprocessor reads the quantized coefficients and
calculates inverse quantization and last, run, level
(LRL) coding simultaneously. The de-quantized
coefficients are stored in M30. . .M33 (p-iq WR#)and the LRL coded coefficients in M50 or M51
(p-lrl WR#). The IDCT processor reads the
de-quantized coefficients and writes the spatial
domain transformed pels in M23 or M24 (p-idctWR#), from where they are read by IFRECMEM
(i-frecmem RD#). The LRL coefficients are read
by IT (i-tr RD#).
At the image and macroblock level, the sched-
uling can be controlled completely by the RISC:
each processor remains idle until the micropro-
cessor sets its start bit and when the image or
macroblock is processed the processor sets its donebit, which can be polled or generate an interrupt.
The MVIP-2 processors also work with a classic
pel-level-pipeline. The controllers of the processors
have been designed in order to modify the number
of their pipeline stages easily.
Table 2
Access to video memory pages sequence
T0 T1 T2 T3 T4 T5
P0 IVIDEOH
WR F1
IVIDEOV
RD F1
IVIDEOH
WR F3
IVIDEOV
RD F3
IVIDEOH
WR F5
IVIDEOV
RD F5
P1 IVIDEOH
WR F2
IVIDEOV
RD F2
IVIDEOH
WR F4
IVIDEOV
RD F4
IVIDEOH
WR F6
P2 IVIDEOV
WR F1
IFRMEM
RD F1
IVIDEOV
WR F3
IFRMEM
RD F3
IVIDEOV
WR F5
P3 IVIDEOV
WR F2
IFRMEM
RD F2
IVIDEOV
WR F4
IFRMEM
RD F4
P4 IFRECMEM
WR F1
P-ME RD F2 IFRECMEM
WR F3
P-ME RD F4
P5 IFRECMEM
WR F2
P-ME RD F3 IFRECMEM
WR F4
P6 P-ME WR F2 P-ME WR F3 P-ME WR F4
P-MC RD F2 P-MC RD F3 P-MC RD F4
IFRECMEM
RD F2
IFRECMEM
RD F3
IFRECMEM
RD F4
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 651
3.3. The software implementation
The coder firmware has been designed for the
RISC processor in C language. In order to allow
flexibility in the design of the encoding algorithm
two application programmer interfaces (APIs)
have been implemented:
• The mvip2 API provides access to the proces-
sors and interfaces, supporting the start, stop
and configuration tasks.
• The h263 API supports the generation of the
H.263 stream headers.
The firmware is structured in two pieces of
code:
• init.s is a loader written in assembly language.
• coder.c is the module that implements the
scheduling for the processors and interfaces
using the mvip2 API and the generation of the
H.263 headers using the h.263 API.
In the current version of the firmware, theprocessors that operate at the image-level pipeline
(IVIDEOH and IVIDEOV) are managed by in-
terrupt while the remaining processors are polled.
The module coder.c has a main program and an
interrupt routine. When IVIDEOH is started for
the first time in the main program (see Fig. 18), itsynchronizes with the first image, processes it and
generates an interrupt when it finishes. In the in-
terrupt routine (see Fig. 19) IVIDEOH is started
again to wait for the next image and IVIDEOV is
started to process the former. Also, when IVID-
EOV finishes, the interrupt routine is entered and a
flag is asserted.
In the main program, first of all, the SDRAMcontroller (IMEM) is initialized, IVIDEOH is
started with its interrupt enabled and the RISC
waits for IVIDEOV to assert the flag. When
IVIDEOV ends the processing of the current
image and the flag is asserted in the interrupt
routine, the Image Header is created and sent to
the IT and the processors (IFRECMEM, IFR-
MEM, DCT, IDCT, DQ, IQ, ME and MC) arestarted in sequence, beginning with the slower
ones. While a macroblock is being processed, the
RISC generates the macroblock header, the group
of blocks header (if necessary) and other parts of
the H.263 bit-stream, sends these data to the IT
and starts it. At the end of the loop processing, the
RISC reads the motion vector and other parame-
ters from the processors and starts them again forthe next macroblock.
On the completion of an image, the end of
frame header is generated and sent to the IT and
the flag is deasserted.
Table 3
Macroblock-level operations scheduling
T0 T1 T2 T3 T4 T5 T6 T7 T8
M10 I-frmem
WR
CF0
p-me
3 ·RDCF0
p-me
RD
CF0
p-mc
RD
CF0
I-frmem
WR CF4
p-me
3 ·RDCF4
p-me
RD CF4
p-mc
RD CF 4
I-frmem
WR CF8
M11 I-frmem
WR
CF1
p-me
3·RDCF1
p-me
RD CF1
p-mc
RD CF 1
I-frmem
WR CF5
p-me
3·RDCF5
p-me
RD CF5
p-mc
RD CF 5
M12 I-frmem
WR CF2
p-me
3·RDCF2
p-me
RD CF2
p-mc
RD CF2
I-frmem
WR CF6
p-me
3·RDCF6
p-me
RD CF6
M13 I-frmem
WR CF3
p-me
RD CF2
p-me
RD CF3
p-mc
RD CF3
I-frmem
WR CF7
p-me
3 ·RDCF7
M20 p-mc
WR CF0-
REC0
p-dct
RD 0
p-mc
WR
CF2-
REC2
p-dct
RD 2
p-mc
WR
CF4-
REC4
p-dct
RD 4
M21 p-mc
WR
CF1-
REC1
p-dct
RD 1
p-mc
WR
CF3-
REC3
p-dct
RD 3
p-mc
WR
CF5-
REC5
M30 p-dct
WR 0
p-dq
RD 0
p-iq
WR 0
p-idct
RD 0
p-dct
WR 4
M31 p-dct
WR 1
p-dq
RD 1
p-iq
WR 1
p-idct
RD 1
M32 p-dct
WR 2
p-dq
RD 2
p-iq
WR 2
M33 p-dct
WR 3
p-dq
RD 3
M40 p-dq
WR 0
p-iq
RD 0
p-dq
WR 2
p-iq
RD 2
M41 p-dq
WR 1
p-iq
RD 1
p-dq
WR 3
M50 p-lrl
WR 0
p-tr
RD 0
p-lrl
WR 2
M51 p-lrl
WR 1
p-tr
RD 1
M23 p-idct
WR 0
I-frecmem
RD 0
M24 p-idct
WR 1
652 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
If, when IVIDEOV processing finishes, the
processors are still working for the former image,
then an image is skipped.
3.4. The hardware/software co-operation
The specialized processors and RISC co-oper-
ation at the macroblock level determine the speed
of the overall system. Fig. 20 shows the computing
period for a macroblock; the top line represents
the RISC tasks and the bottom represents the
processor tasks.
The RISC carries three main tasks for the
macroblock processing: (1) configuring and start-ing the processors, (2) generating the macroblock
header (and other H.263 bit-stream components)
and starting the IT and (3) reading the results from
the processors and computing parameters for the
main
creates Image Header and GOB Header (if needed)
flag?
starts processors
starts IT
end of loop for 1 MB?
end of frame?
DEASSERTED
ASSERTED
N
Y
N
Y
starts IMEM & IVIDEOH enables IVIDEOH interrupt
creates Macroblock Header & other stream parts
reads results & computes params for next MB.
generates End of Frame Header
flag=DEASSERTED
Fig. 18. Main program flowchart.
int
source?
disables IVIDEOV int starts IVIDEOH
starts IVIDEOV &
enables IVIDEOV
interrupt
flag=ASSERTED
IVIDEOV IVIDEOH
end
Fig. 19. IVIDEOH and IVIDEOV interrupt routine flowchart.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 653
next macroblock. These tasks can be evaluated in
terms of system clock cycles. (N , M and P cycles inFig. 20).
The processors are idle before they are started
and while the RISC is reading its results and
computing the parameters for the next macro-
block. Each processor lasts a fixed minimumnumber of clock cycles and, if the processor ac-
cesses the video memory, this number is incre-
mented in a quantity that depends on the priority
assigned for this access. The IT processing time
depends on the image and the quantization step. In
Fig. 20, L represents the number of cycles from the
start of the last processor to the end of all pro-
cessor and IT activities.In the current state of design, N is about 1000
cycles, M is about 150 cycles, P is about 1400 cy-cles and L is about 2900 cycles.
4. Design methodology
The design methodology we have used hasbeen oriented towards three objectives: (1) the
design must be flexible enough to evolve or be
reused to implement the encoding loop for other
standards, (2) the functional test of such a com-
plex system must be carried out efficiently and (3)
the design must be oriented towards rapid pro-
totyping.
As well as using an HDL for the design de-scription, we get the flexibility by using the fol-
lowing techniques:
1MB period
pollingRISC
MB Header & start of IT (P)
start of processors(N)
read results &compute paramsfor next MB (M)
processorsidle idleloop processing (L)
Fig. 20. The HW/SW tasks arrangement.
1 MVIP-2 is a complex design with more than 200 modules.
As we use parameters and retiming techniques the synthesis
process is also complex and prone to human errors. We use
formal verification to ensure that the synthesized netlist is
equivalent to the RTL design discarding a human error in the
process.
654 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
• Most of the design modules have been parame-
terized. As an example, the DCT processor is
instantiated with the multipliers and transfor-
mation matrix size adequate for the current de-
sign, but these parameters can be changed to
adapt to other designs.
• The initial latency of controllers also has beenparameterized, allowing to use the automatic
synthesis tools to re-design the pipeline of critical
processors without re-designing the controller.
• The processors can be synthesized to work with
an 8-bit data bus or a 32-bit data bus by chang-
ing a parameter (the firmware is also changed
with the same parameter). The 32-bit option is
faster while the 8-bit option is smaller.• The CROSSBAR design is modular and can be
modified very easily to implement other topolo-
gies. As the processors control is implemented
by software, it is very easy to add new proces-
sors to MVIP-2, and to fit them in the macro-
block-level-pipeline.
• Also, the IMEM design is modular. It is very
easy to add new channels and therefore to addnew processors to the image-level-pipeline.
These features will allow MVIP-2 to be used in
the future as a base for the design of an IP to
implement the hybrid coding loop for MPEG-2 or
MPEG-4 video coders.
An efficient verification is carried out designing
exhaustive functional testbenches with self-testcapabilities before logic synthesis stage, and using
formal verification techniques in the post-synthesis
stage.
The design of MVIP-2 has been oriented to
rapid system prototyping [19]. This feature allows
us to configure a lower complexity version for easy
prototyping.
A simplified diagram of the design cycle is
shown in Fig. 21. The first stage is the development
of the software for the RISC processor and theVerilog register transfer level (RTL) description of
the other hardware modules. A testbench that in-
cludes Verilog simulation models for the camera,
the memories and the RISC allows the functional
tests with HW/SW co-simulations to be carried
out. The second stage is logic synthesis (using
Design Compiler from Synopsys); a netlist is ob-
tained from the RTL description and the area andtime restrictions. The third stage is formal verifi-
cation (using Formality from Synopsys) in order to
validate the netlist against the RTL description. 1
The fourth is the prototyping stage and is ex-
plained in more detail in next section.
5. The prototyping stage
As we said in Section 4, the design of MVIP-2
has been oriented to rapid system prototyping.
SW RISC RTL MVIP-2
SPECIFICATION
VERILOG TESTBENCH
OK? OK? OK? N
Y
N N
Y Y GOLDEN
RTL
TARGET LIBRARY
RESTRI- CTIONS SYNTHESYZER
NETLIST
FORMAL VERIFICATION
PASS_
Y
N
LIBRARY
PROTOTYPING Fig. 21. Design cycle.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 655
Among other features, MVIP-2 can be configured
with only one DCT processor to carry out both
DCT and IDCT sequentially, 1-pel accuracy mo-tion estimation and compensation and no filter-
ing while maintaining the rest of the features
stated in Section 3. This configuration allows the
prototyping onto a cost-effective development
board.
In Fig. 22 a block diagram of the testbench used
for prototyping is shown, consisting of three main
modules: a development board, a personal com-puter with a PCI input/output board and a logic
analyzer.
The modules inside the dotted box in Fig. 22 are
included in the development board, an HSDT200
[20], that is a cost-effective system for prototyping
RISC FPGA SDRAM
RAM flash
PC PC
logic analyzer
Inputframes
OutputH.263stream
5V 3.3V
JTAG multiICE
Fig. 22. Block diagram of the prototype testbench.
656 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
a wide range of designs. The core of this board is
an EP20K400BC652-1 FPGA [21] from Alteraconnected to an ARM7TDMI processor [22]. In
the FPGA, 400,000 gate designs with up to
200,000 bits of memory can usually be imple-
mented and the ARM is a scalar 32-bit RISC that
can execute up to 120 MIPS. Around these ele-
ments the board has 1 Mb of static RAM, 4 Mb of
flash memory and several connectors to ease de-
bugging and to allow working with SDRAMs andstandard interfaces like RS-232, Smart Card and
PCI.
Fig. 23. The te
The input–output board in the PC is a PCI-
6534 from National Instruments [23]. Using this
32-channel PCI board a very flexible pattern gen-
erator has been created. The system, that is man-
aged from a shell that has been designed for this
research, emulates an OV6620 colour digitalcamera [24] allowing an image file to be selected,
displayed on the PC monitor and outputted to the
coder making an infinite loop.
The logic analyzer is a TLA714 from Tektronix.
This 96-channel and 200 MHz logic analyzer al-
lows the address, data and control buses of the
RISC processor or the SDRAM interface to be
watched.We are using the ARM Software Development
Tools v. 2.50 and multiICE to download and de-
bug the ARM C code; the FPGA Compiler II v.
2000.11-FC3.5 from Synopsys, in order to achieve
better results for the target technology, and
Quartus II v. 1.1 from Altera for the design im-
plementation on the FPGA; and finally, the shell
for the pattern generator management has beingdeveloped using LabVIEW v. 6.0.
A photograph of the testbench can be seen in
Fig. 23.
stbench.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 657
6. Results
To date, an RTL description of MVIP-2 and a
first version of the software for the RISC processor
have been obtained. Exhaustive functional testswith fixed and random data sequences have been
carried out. Afterwards, the encoder has been
tested with standard sequences such as Foreman,
Silent or Miss America. Finally, the H.263 gener-
ated bit-stream has been tested using ClipPlayer
[25]. In Fig. 24 the first (intra with pquant 16) and
12th (inter with pquant 12) reconstructed images
of Foreman together with their original versionsare shown. Also, Table 4 shows the PSNR and
the number of bytes per image for the first 12
images.
The functional tests have stated that MVIP-2
can encode each macroblock in about 4050 system
clock cycles; this performance allows the encoding
of 30 QCIF fps with a 12 MHz system clock or 30
CIF fps with a 48 MHz system clock.
Fig. 24. Original (a) and reconstructed
We have also have performed a logic synthesis
and formal verification of the entire design with
Design Compiler and Formality.
An MVIP-2 configuration with a DCT proces-
sor to carry out both DCT and IDCT and with
entire-pel accuracy for motion estimation havebeen synthesized with FPGA Compiler II and fitted
into the EP20K400BC652-1 FPGA using 93% of
its logic cells and 52% of its internal memory re-
sources. Due to the physical limitations in the
prototyping board, the maximum achievable sys-
tem clock frequency is 24 MHz. The tests carried
out in the prototype have shown that, at this fre-
quency, the system can encode QCIF at 60 fps,equivalent to CIF at 15 fps. With a 48 MHz system
clock MVIP-2 would encode 30 CIF fps, but in
order to support this clock frequency the design
must be retargetted onto a different platform (i.e.
an FPGA with an embedded RISC or an ASIC).
The results of the logic synthesis are summa-
rized in Figs. 25 and 26. Fig. 25 shows the size, in
(b) frames of Foreman sequence.
Table 4
Performance data for the first 12 images of Foreman sequence
Image pquant PSNR Y (dB) # of bytes
#1 16 31.31 1760
#2 12 31.28 468
#3 12 31.20 517
#4 12 30.99 405
#5 12 31.23 404
#6 12 31.25 403
#7 12 31.24 336
#8 12 31.18 378
#9 12 31.09 678
#10 12 30.82 693
#11 12 30.90 755
#12 12 30.82 749
#12 12 36.46 82
658 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
logic cells, for all processors and interfaces ofMVIP-2 except the RISC. The DCT and the ME
processors use near a half of the available logic
cells. As the number of typical equivalent gates in
the EP20K400BC652 FPGA are 400.000, we can
state for comparison purposes that the number of
gates of the design (without the RISC) is about
375.000.
Fig. 26 shows the amount of internal FPGAmemory used by the design. About 90 Kbits are
spent in maintaining the macroblock-level-pipe-
Fig. 25. Size of processors an
line. The ME processor spends about 10 Kbits to
store the search area and a small amount is needed
by the DCT processor to store coefficients and
intermediate results.
7. Conclusions
In this paper MVIP-2, a flexible and efficient
architecture based on a RISC and a set of spe-
cialized processors and interfaces that implements
an H.263 basic-line encoder, has been presented.
The design methodology of MVIP-2 has been
oriented towards providing a reusable design, toease an exhaustive testing and to fast-prototyping,
features that will allow its transformation into an
IP.
The set of specialized processors and interfaces
have been described in Verilog RTL while the
RISC is a commercial processor. The functionality
of the entire architecture as well as the synthe-
sizeability of the RTL code have been exhaustivelytested.
The Verilog description of the specialized pro-
cessors and interfaces has parameters for different
architectural features, including: bus widths,
number of IMEM channels, number and type of
d interfaces (logic cells).
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
M10...M51 DCT ME
Kbits
Fig. 26. Internal memory used by MVIP-2.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 659
CROSSBAR channels and number of pipeline
stages of the controllers. On the other hand, the
RISC implements the scheduling for the proces-
sors and interfaces, most of the VLC tables and
most of the H.263 output stream generation. This
Table 5
Our proposal
fps/size CLK
(MHz)
Kgates Performance
30/QCIF 12 375 + H.263 base-line
15/CIF 24 RISC
30/CIF 48
Table 6
Summary of proposals referenced in Section 2
Ref. fps/size CLK (MHz)
[5] 35/QCIF 600
[6] 25/CIF 120
[7] 21/QCIF 200
[8] 30/CIF 54
[9] 30/CIF 30
[10] 29/QCIF 200 (RISC)
66 (ME)
[11] 30/QCIF 27
features will allow in the future the transformation
of the design to implement the hybrid coding loop
core for an MPEG-2 or MPEG-4 encoder easily.
MVIP-2 has also been prototyped onto a com-
mercial board based on an FPGA and a RISC
processor, working with a 24 MHz system clock.
The performance of the design is summarized in
Table 5. As we can see, real-time encoding isachieved with low system clock frequencies; this
feature makes MVIP-2 suitable for low-power ap-
plications like mobile videotelephony. Moreover,
in Table 6 we have summarized the characteristics
of the proposals referenced in Section 2. The direct
comparison of all these proposals is not easy due to
Kgates Performance
H.263 base-line
H.263 base-line
1800 H.263 with options.
Encoder+ decoder.
<85 H.263 base-line.
Without ME/MC.
40+RISC H.263 base-line.
Encoder+ decoder.
80 H.263 with options.
660 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661
its heterogeneity, but, as can be seen, our proposal
can encode more frames per second (fps) with less
system clock frequency than the others, with the
exception of [9], that is an encoder without motion
estimation and compensation. In [5–7] and [10] the
high system clock frequency is compensated with alower hardware complexity. If we look at the low
system clock proposals, [8,9] and [11], we found
that all of them use specialized processors to im-
plement the hybrid coding loop. The architecture
proposed in [11] has a good trade-off between sys-
tem clock, hardware complexity and number of fps,
but lacks of the flexibility of MVIP-2. The proposal
[8] is too much complex despite of implementsboth, encoder and decoder. Finally, the fast and
low-complex proposal in [9] is because the motion
estimation and compensation features are not im-
plemented and it lacks of the MVIP-2 flexibility.
Acknowledgements
This work is being supported by grant TIC99-
0927 from the Comisi�oon Interministerial de Ciencia
y Tecnologıa of the Spanish Government.
References
[1] ISO/IEC 13818-2 (ITU-T Rec.H.262), Generic coding of
moving pictures and associated audio information: Video,
1995.
[2] ISO/IEC JTC1/SC29/WG11, CD 14496-2 Coding of Audio
Visual Objects: Video, 1998.
[3] ITU-T Rec. H.263, Video Coding for Low Bit-Rate
Communication, 1998.
[4] A.K. Jain, Image Data Compression: A Review, Proceed-
ings of the IEEE, vol. 69, no. 3, March 1981.
[5] S.M. Akramullah, I. Ahmad, M.L. Liou, Optimization of
H.263 video encoding using a single processor computer:
performance tradeoffs and benchmarking, IEEE Transac-
tions on Circuits and Systems for Video Technology 11 (8)
(2001).
[6] K. Herrmann, S. Moch, J. Hilgenstock, P. Pirsch, Imple-
mentation of a multiprocessor system with distributed
embedded DRAM on a large area integrated circuit, IEEE
International Symposium on Defect and Fault Tolerance
in VLSI, Proceedings, 2000, pp. 1665–1669.
[7] T.P.Q. Nguyen, A. Zakhor, K. Yelick, Performance
Analysis of an H.263 video encoder for VIRAM, Depar-
tament of Electrical Engineering and Computer Ciences,
University of California at Berkeley, 1999.
[8] M. Harrand, J. Sanches, A. Bellon, J. Bulone, A. Tournier,
A Single Chip CIF 30-Hz, H261, H263 and H263+ video
encoder/decoder with embedded display controller, IEEE
Journal of Solid-State Circuits 34 (11) (1999) 1627–1633.
[9] G. Lienhart, R. Lay, R. Manner, An FPGA video
compressor for H.263 compatible bitstreams, International
Conference on Consumer Electronics, 2000, Digest of
Technical Papers, pp. 320–321.
[10] S.K. Jang, S.D. Kim, J. Lee, G.Y. Choi, J.B. Ra,
Hardware–software co-implementation of a H.263 video
codec, IEEE Transactions on Consumer Electronics 46 (1)
(2000) 191–200.
[11] C. Honsawek, K. Ito, T. Ohtsuka, T. Isshiki, Li Dongju, T.
Adiono, H. Kunieda, System-MSPA design of H.263+
video encoder LSI for face focused videotelephony, The
2000 IEEE Asia-Pacific Conference on Circuits and Sys-
tems, 2000.
[12] J.M. Fern�aandez, F.Moreno, J.Meneses, A high-performancearchitecture with a macroblock-level-pipeline for MPEG-2
coding, Real Time Imaging Journal 2 (1996) 331–340.
[13] J.M. Fern�aandez, Arquitecturas VLSI para la codificaci�oon
de im�aagenes en movimiento en tiempo real, Ph.D. ThesisE.T.S.I.T.,UniversidadPolit�eecnica deMadrid,March 1998.
[14] K.R. Rao, P. Yip, Discrete Cosine Transform, Algorithms,
Advantages, Applications, Academic Press, 1990.
[15] IEEE G.216, Presentation to IEEE G.216 Video Com-
pression Measurement Subcommittee on IEEE 1180/1190
Standard, Discrete Cosine Transform Accuracy Test,
January 1998.
[16] Video Codec Test Model, TMN5, Telenor Research, 1995.
[17] C. Sanz, M.A. Freire, J. Meneses, Low Cost ASIC
Implementation of a Three-Step Search Block-Matching
Algorithm for Motion Estimation in Image Coding, Design
Automation and Test in Europe Conference, User�s Forum,Paper awarded with the Best ASIC Prize, 1999, pp. 75–79.
[18] C. Sanz, M. Garrido, J. Meneses, VLSI Architecture for
Motion Estimation using the Three-Step Block Matching
Algorithm, Design Automation and Test in Europe Con-
ference, Designer Track, 1998, pp. 45–50.
[19] M. Garrido, C. Sanz, M. Jimenez, J. Meneses, A Flexible
H.263 Video Coder Prototype Based on FPGA, 13th IEEE
International Workshop in Rapid System Prototyping,
2002, pp. 34–41.
[20] SIDSA. Semiconductor Design Solutions. Available from
<http://www.sidsa.com>.
[21] APEX 20 K Programmable Logic Device Family data
sheet. Available from <http://www.altera.com/literature/
ds/apex.pdf>.
[22] ARM7TDMITechnical ReferenceManual Rev. 4. Available
from <http://www.arm.com/arm/TRMs?OpenDocument>.
[23] High-Speed 32-bit Digital Pattern I/O and Handshaking.
Available from <http://www.ni.com/pdf/products/us/
mhw332-333e.pdf>.
[24] OV6620 Single-chip CMOS CIF color digital camera.
Available from <http://www.ovt.com/pdfs/ov6620-
DSLF.PDF>.
[25] ClipPlayer V1.1b2. 1996 Fraunhofer-Gesellschaft, IIS.
M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 661
Matıas J. Garrido received the In-geniero T�eecnico de Telecomunicaci�oondegree in 1986 and the Ingeniero deTelecomunicaci�oon degree in 1996, bothfrom the Universidad Polit�eecnica deMadrid. Since 1986 he has been amember of the faculty of the E.U.I.T.de Telecomunicaci�oon of the U.P.M.,and since 1987 he has been AssociateLecturer at the Department of Elec-tronic and Control Systems. He is afounder member (in 1996) of theElectronic and Microelectronic DesignGroup (GDEM), participating in de-
sign projects from the Spanish and European industry as well asuniversity projects. His areas of interest are electronic digitaldesign, video coding and digital video broadcasting.
Cesar Sanz received the Ingeniero deTelecomunicaci�oon degree with honorsin 1989 and the Doctor Ingeniero deTelecomunicaci�oon degree with summacum laude in 1998, both from theUniversidad Polit�eecnica de Madrid.Since 1984 he has been a member ofthe faculty of the E.U.I.T. de Teleco-municaci�oon of the U.P.M., and since1999 has been Associate Professor atthe Department of Electronic andControl Systems. In addition, he leadsthe Electronic and MicroelectronicDesign Group (GDEM) involved in
R&D projects with Spanish and European companies andpublic institutions. His areas of interest are microelectronic
design applied to image coding, digital TV and IP-data trans-mission over digital broadcasting networks.
Marcos Jimenez received the Ingenierode Telecomunicaci�oon degree in 2001,from the Universidad Polit�eecnica deMadrid. He has been a member of theElectronic and Microelectronic DesignGroup since 2000 and at present alsoworks as a software developer atSIDSA. His areas of interest are real-time video coding hardware imple-mentations and IP transmission overdigital television networks.
Juan M. Meneses received the Ingeni-ero de Telecomunicaci�oon degree in1977 and the Doctor Ingeniero deTelecomunicaci�oon degree with summacum laude in 1985, both from theUniversidad Polit�eecnica de Madrid.Since 1989 he has directed a researchgroup in digital architectures for imageand video processing. At the present,he is a full professor at the ElectronicsEngineering Department and seniorscientist at GDEM.