+ All Categories
Transcript

Journal of Systems Architecture 49 (2003) 641–661

www.elsevier.com/locate/sysarc

A flexible architecture for H.263 video coding

Mat�ııas J. Garrido a,*, C�eesar Sanz a, Marcos Jim�eenez a, Juan M. Meneses b

a Dpto. de Sistemas Electr�oonicos y de Control, Universidad Polit�eecnica de Madrid, E.U.I.T. Telecomunicaci�oon,Ctra. de Valencia, Km. 7, 28031 Madrid, Spain

b Dpto. Ingenierıa Electr�oonica, Universidad Polit�eecnica de Madrid, E.T.S.I. Telecomunicaci�oon,Ciudad Universitaria s/n, 28040 Madrid, Spain

Abstract

In this paper a flexible and efficient architecture that implements the core of a video coder according to Rec. H.263 is

presented. It consists of a RISC processor that controls the scheduling of a set of specialized processors that perform the

discrete cosine transform (DCT), the inverse discrete cosine transform (IDCT), the direct and inverse quantization (DQ

and IQ), the motion estimation (ME) and the motion compensation (MC). The architecture also includes pre-pro-

cessing modules for the input video signal from the camera and interfaces for the external video memory and the H.263

stream generation.

The processors have been written in synthesizeable Verilog and the firmware for the RISC (a commercial processor)

has been developed in C language.

The design has been tested with hardware–software co-simulations in a Verilog testbench using standard video

sequences and has also been prototyped onto a development system based on an FPGA and a RISC. It performs 30

QCIF frames/s with a system clock of 12 MHz or 30 CIF frames/s with a system clock of 48 MHz, which is better than

other reported designs with similar degree of flexibility. Also, the low frequency system clock makes it suitable for low-

power applications such as mobile videotelephony.

� 2003 Elsevier B.V. All rights reserved.

Keywords: H.263; FPGA; RISC; Intellectual property; Low bit-rate video coding; Pipelined architecture; Discrete cosine transform;

Motion estimation

1. Introduction

In the last 10 years, the evolution of digital

technologies, together with the establishment of aset of standards widely followed by the industry,

* Corresponding author. Fax: +34-3367801.

E-mail addresses: [email protected] (M.J. Garrido), ce-

[email protected] (C. Sanz), [email protected] (M. Jim�eenez),

[email protected] (J.M. Meneses).

1383-7621/$ - see front matter � 2003 Elsevier B.V. All rights reserv

doi:10.1016/S1383-7621(03)00094-8

such as MPEG-2 [1], MPEG-4 [2] and H.263 [3],

has allowed the development of a wide range

of video applications: digital TV, HDTV, VoD,

videotelephony, videoconference, etc.The applications implemented in low rate

channels, such as videotelephony, use low-resolu-

tion formats such as CIF (common intermediate

format: spatial resolution of 352 · 288 pels and

temporal resolution of 30 frames/s). Even so, the

available bandwidth is usually lower than that

necessary for working at minimum performance.

ed.

642 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

For example, a videophone working with CIF

and using an ISDN channel (considering 32 Kbits/

s for transmission and 32 Kbits/s for reception)

would allow the visualization of just one image

every 75 s.

Image compression techniques can drasticallyreduce the bit-rate necessary for encoding the

digital video signals. The compression techniques

used take advantage of the peculiarities of the

human vision system to attain compression ratios

of up to 100:1 [4]. Although a large amount of

useful techniques has been reported, nearly all

applications are based on the hybrid encoding

scheme shown in Fig. 1, that is based on the re-duction of the spatial and temporal redundancies

existing in any natural sequence of images.

The hybrid encoder reduces the temporal re-

dundancy, encoding the difference between each

image and its prediction computed on the basis of

previous or future images in the sequence. A

transformation to the spatial frequency domain is

applied to this difference and finally, the trans-formed coefficients are quantized. The spatial re-

dundancy reduction is obtained by means of a

DCT

MotionCompensation

MotionEstimation

R

+

-

FrameMemory

Preproc.

Input

0

Inter/Intra

Fig. 1. Hybrid encoder for

coarse quantization of the higher spatial frequen-

cies and a variable length coding (VLC). As the

human vision system is less sensitive to these

higher spatial frequencies, the image quality re-

mains acceptable while the output bit-rate is

greatly reduced.In 1998, the International Telecommunication

Union established the Recommendation H.263

that uses a number of encoding techniques tested

in other standards such as MPEG-1 and MPEG-2

as well as more advanced ones.

This paper shows an efficient and flexible ar-

chitecture that implements a basic-line H.263

video coder, based on the hybrid encoding loopshown in Fig. 1. In Section 2, a survey of some of

the architectures that implement H.263 encoders

reported in the last four years is made. In Section 3

the proposed architecture, MVIP-2, is presented.

In Section 4 the methodology followed in the de-

velopment of the design is explained. Section 5 is

devoted to the prototyping stage. In Section 6 the

tests performed and the results obtained areshown. Finally, Section 7 explains the conclusions

of this work.

+ +

Rec. FrameMemory

Buffer

egulator

Q

IQ

IDCT

VLC

Output

video compression.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 643

2. A survey of H.263 video encoding architectures

Most of the implementations that have been

presented in international publications in the last

four years can be classified into three groups.The first group contains the implementations

based on general-purpose microprocessors, in-

cluding PCs or workstations. All the efforts are

focused on the optimisation of the code that im-

plements the encoder loop for the target micro-

processor. A very representative proposal of this

group is presented in [5], where the basic H.263

encoding loop is optimised for three differentplatforms: a 167 MHz Sun UltraSPARC-1 work-

station, a 233 MHz Pentium II PC and a 600 MHz

Pentium III PC, attaining a minimum of 10, 13

and 35 frames/s for QCIF in tests with standard

sequences.

The second group contains architectures based

on special microprocessors such as DSPs, vector

parallel processors or multiprocessors. As moresignificant proposals, a multiprocessor architec-

ture made up of interconnected nodes is presented

in [6]; each node containing a RISC core adapted

for video encoding, DRAM memory, a video in-

terface and an external host interface. Using two

nodes working at 120 MHz the system encodes 25

CIF images per second. In [7] a vector parallel

processor is used, with a scalar core at 200 MHz,that encodes 21 frames/s in QCIF.

The third group includes the architectures

based on a controller together with a set of spec-

ialised processors for the specific tasks in the en-

coding loop. An architecture based on a sequencer

that implements the scheduling for a group of

specialised processors, encoding and decoding 30

CIF frames/s simultaneously is proposed in [8].The system clock frequency is 54 MHz and the

circuit has nearly 1.8 million gates. Another ar-

chitecture based on a dedicated sequencer and

specialized processors is detailed in [9]. It is im-

plemented on an 80,000 gate Xilinx FPGA run-

ning at 30 MHz and carrying out the basic core of

H.263 with CIF and 30 frames/s without motion

estimation. The depicted architectures lack flexi-bility because of their dedicated controller. In-

stead, the following ones use a programmable

controller: In [10], an ARM RISC core at 200

MHz is used to carry out the transforms (DCT,

IDCT) and quantizers (DQ, IQ) and controls a set

of processors for motion estimation and com-

pensation, video signal processing and external

dynamic memory interfacing. The processors are

implemented with about 40,000 gates and work ata 66 MHz clock frequency. This system imple-

ments the encoder and decoder for the H.263 with

QCIF and 29 frames/s. An architecture based on a

programmable address generator and a pipeline

controller for a set of processors: the camera in-

terface, the image filter, the loop DCT-DQ-IQ-

IDCT, the motion estimation and the VLC is

presented in [11]. With 80,000 gates and 27 MHzsystem clock, this architecture encodes QCIF at 30

frames/s.

3. MVIP-2

3.1. The architecture

MVIP-2 is an evolution from the MViP archi-

tecture [12,13] to implement H.263 video encoding.

Our goal is obtaining a design with a moderate

number of gates and a slow system clock that will

be suitable in the future for low-power applica-

tions such as mobile videotelephony; and also

flexible enough to allow its adaptation to other

standards.The block diagram of MVIP-2 is shown in Fig.

2. It consists of three functional blocks: the CPU

system, the processing system and the interface

system. MVIP-2 also needs several external mod-

ules: a digital camera, flash and RAM memories to

store code and data for the CPU and SDRAM for

the video memory.

3.1.1. The CPU system

The CPU system is made up of a 32-bit RISC

processor, an address decoder and a programma-

ble interrupt controller (PIC). After a reset, the

decoder maps the RAM, the flash memory and the

peripherals (the specialized processors in the pro-

cessing system and the interfaces in the interface

system) as shown in Fig. 3(a).The flash memory contains a loader and the

encoding firmware; when the CPU boots from

Fig. 2. Block diagram of MVIP-2.

644 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

flash memory the loader configures the decoder to

remap the memory as shown in Fig. 3(b), copies

the firmware to RAM and starts running from this

memory for a faster execution. On the other hand,

all the specialized processors and the interfaces

have the same structure (see Fig. 4): a core toimplement specific tasks (i.e. discrete cosine

transform or direct quantization) and an interface

with the RISC based on a configuration register

and a status register. All the configuration registers

have a start bit and all the status registers have a

done bit, which also is connected to one of the 32

inputs of the PIC. The RISC configures and starts

the processors and interfaces by writing in theirmemory mapped configuration registers. Also,

when a processor or interface ends its work, it

asserts the done bit that can be polled or generates

an interrupt if enabled.

The CPU carries out the following tasks: initial

configuration of the system, control of the sched-

uling of the specialized processors and the inter-

faces and a part of the work in the inter/intra

decision, VLC and H.263 bit-stream generation.

3.1.2. The processing system

The processing system consists of specialized

processors for implementing the direct and inversediscrete cosine transform (DCT and IDCT), the

direct and inverse quantization (DQ and IQ) and

the motion estimation and compensation (ME

and MC), a set of internal memories (M10. . .M51)and an interconnection network (CROSSBAR).

The internal memories are a set of macroblock-

size memories that are accessed by the processors

using the CROSSBAR. They are divided up intofive groups with different data bus sizes as stated in

Table 1.

The CROSSBAR implements the interface be-

tween the processors and the internal memories. It

has nine read-channels and seven write-channels

on the processors side and 17 memory interface

channels at the memories side.

CoreProcessor

RISC interface

Configuration Status

start done

data_from_RISCdata_to_RISC

Fig. 4. Structure of all processors and interfaces.

Table 1

Groups of internal memories

Group Data bus size

M10. . .M13 8-bit wide

M20. . .M24 9-bit wide

M30. . .M33 12-bit wide

M40. . .M41 11-bit wide

M50. . .M51 15-bit wide

PDCT CROSS BAR

rd channel

wr channel

M20 M21

M30 M31

M32 M33

Fig. 5. The discrete cosine transform processor.

Fig. 3. Memory map of MVIP-2 after a reset (a) and after remapping (b).

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 645

The DCT processor (see Fig. 5) reads macro-

blocks from an internal memory (M20 or M21)

using a CROSSBAR read-channel, processes

them, and writes the results in other internalmemory (M30. . .M33) using a CROSSBAR write-channel. Actually, the DCT processor works on a

block basis and six blocks are sequentially pro-

cessed per macroblock. For each block, the pro-

cessor carries out a 64-pel two-dimensional

discrete cosine transform of type DCT-II [14]

particularized for 8-pel wide square blocks.

The IDCT processor also reads macroblocksfrom an internal memory (M30. . .M33) and writesthe results to other internal memory (M23 or

M24). The IDCT is computed with the precision to

be IEEE-1180 [15] compliant.

PDQ CROSSBAR

rd channel

wr channel M40 M41

M30 M31

M32 M33

Fig. 6. The direct quantizer.

Current Frame

Previous Frame

Referencemacro block lock

Search Area

Fig. 8. The motion estimation process.

646 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

The DQ processor (see Fig. 6) reads the DCTresults from M30. . .M33 and writes the quantizedmacroblocks in M40 or M41. Actually, the DQ

processor works on a block basis and six blocks

are sequentially processed per macroblock. The

processor carries out the quantization of each 64-

pel data block as defined in [16].

The IQ processor (see Fig. 7) is divided inter-

nally into two main modules that read the sameinput macroblock from M40 or M41: IQ_proc and

LRL_proc. IQ_proc carries out the inverse quan-

tization as defined in [16] and writes the de-quan-

tized macroblock in M30. . .M33. LRL_proc

performs a last, run, level encoding and writes the

results in M50 and M51. Both modules share the

control and status registers and a read controller

to get the input data. The zz modules carry out anaddress translation for reading the blocks in zig-

zag scan.

The ME processor works on a macroblock

basis. For each macroblock in the frame to be

coded (current frame in Fig. 8), the ME carries out

a search in a limited area (search area) around

the counterpart macroblock in a previous frame,

to find the one that minimizes an error function.When the ME processor selects a macroblock

from the candidates, it then outputs a motion

IQ_proc CROSS BAR

rd_channel

wr_channel

wr_channel

LRL_proc

reader

zz

zz

M50 M51

M40 M41

M30 M31

M32 M33

Fig. 7. The inverse quantizer.

vector that will allow a decoder to recover the

same macroblock from a previously decoded im-

age. The search area is 7.5 pels-wide around the

counterpart macroblock in the previous frame, the

error function is the mean absolute error (MAE)

and the motion vector is computed with half-pel

precision.

The ME processor consists of four main blocks(see Fig. 9): a controller for reading the search area

from video memory and the reference macroblock

from the internal memories, an internal RAM

bank to store the search area, an entire-pel preci-

sion processor (EST1P) to find a first macroblock

candidate and a half-pel precision processor

(EST1_2P) to refine the result of EST1P and to

output a half-pel precision vector.The controller uses two IMEM read channels to

read the search area from a former image in the

video memory and one CROSSBAR read channel

to read the reference macroblock from the internal

memories (M10. . .M13).The RAM bank is used as in [17] to reduce the

data throughput into the ME. The memory is di-

vided into three blocks, each the size of a halfsearch area. As can be seen in Fig. 10, the right

half search area of macroblock #n overlaps withthe left half search area of macroblock #nþ 1, so,only a half of the search area must be read for

each macroblock. The controller reads the half

search area of macroblock #nþ 1 and stores it ina RAM bank block while EST1P reads the entire

search area of macroblock #n from the other twoblocks.

IMEM

reader

RAM block #2

EST1P EST1_2P

RAM block #1

RAM block #3

CROSS BAR

rd channel

rd channel

wr

ref

upper

lower

motionvector

PredictionY CR CB

search area

up_band

low_band

block_sel

SDRAM

RISCM10 M11

M12 M13

Fig. 9. Top-level block diagram of the motion estimation processor.

Macroblock #N Macroblock #N+1

Search area for Macroblock #N

Search area forMacroblock #N+1

Common search area for #N and #N+1

Fig. 10. Overlapping of the search areas for consecutive mac-

roblocks.

PMC

IMEM SDRAM rd channel

RISC

pointer to video memory

CROSS BAR

rd_channel

wr_channel

Mode (inter/intra)

M10 M11

M12 M13

M20 M21

Fig. 11. Architecture of the motion compensation processor.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 647

EST1P is based on EST3P [18], a hierarchical

three-step motion estimation circuit. During the

third step, EST1P reads only a 20 · 20 pels search

area and computes an entire-pel precision motion

vector. In parallel, EST1_2P reads the same 20 · 20pels search area and, at the end of the third step,

interpolates a half-pel precision search area, reads

the entire-pel precision motion vector from EST1P

and obtains a half-pel precision motion vector in

one more step. Finally, the candidate macroblock

with half-pel precision (Y , CR and CBÞ is written invideo memory using an IMEM write channel (not

shown in Fig. 9).

The MC processor (Fig. 11) reads the macro-

block candidate selected by the ME processor

from video memory using a read IMEM channel

and the reference macroblock from the internal

memories (M10. . .M13) using a CROSSBAR readchannel. If the reference macroblock is to be codedin intra mode, MC writes it in the internal mem-

ories (M20 or M21) using a CROSSBAR write

channel, otherwise MC writes the difference be-

tween the reference and the selected candidate.

648 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

3.1.3. The interface system

The interface system is made up of a set of

modules: for the camera (IVIDEOH, IVIDEOV),

for the frame memory (IFRMEM), for the re-

construction memory (IFRECMEM), for thevideo memory (IMEM) and for the H.263 bit-

stream generation (IT).

The IMEM module supports a set of 16-bit read

or write channels that compete to access the

SDRAM. IMEM performs the SDRAM initial

configuration and the refreshing tasks, manages

the requests of the channels and sends the read or

write commands to the SDRAM. The architectureof this module is shown in Fig. 12. Each of the 16-

bit channels has a controller (rd_ctrl for read

channels or wr_ctrl for write channels) and a four-

word memory (mr or mw). A write channel stores

the 16-bit words sequentially in the four-word

memory using a rq/ack protocol and when a 64-bit

word is completed the controller requests an access

to the memory manager. If the SDRAM is notbusy, the 64-bit word is written and the request is

add_ch0_rd data_ch0_rd

ack_ch0_rd rq_ch0_rd

rd_ctrl

add_ch5_rd data_ch5_rd

ack_ch5_rd rq_ch5_rd

rd_ctrl

add_ch0_wr data_ch0_wr

ack_ch0_wr rq_ch0_wr

wr_ctrl

add_ch4_wr data_ch4_wr

ack_ch4_wr rq_ch4_wr

wr_ctrl

mr0 demux

addrd

addrd

addwr

addwr

datard

datard

datawr

datawr

rq0rd

rq5rd

rq0wr

rq4wr

ack0rd

ack5rd

ack0wr

ack4wr

mr5

mw0

mw4

mu x

Fig. 12. Architecture of the v

acknowledged. The read channels work in a simi-

lar way. The memory manager supports requests

from six read and five write channels, resolves the

priorities, if necessary, performs the read and write

operations in the SDRAM and acknowledges the

channels.The camera interface. The camera interface

consists of two modules (Fig. 13) working on an

image basis: IVIDEO_H reads the images from the

camera in raster-scan format, synchronizes them

with the system clock, performs a horizontal fil-

tering, if necessary, and stores the results in video

memory using an IMEM write channel. IVI-

DEO_V reads the image from video memory,carries out a chrominance sub-sampling and a

vertical filtering and writes the results in video

memory.

The IFRMEM module (Fig. 14) reads the fil-

tered images from video memory using an IMEM

read channel and writes them on a macroblock

basis in one of four internal memories

(M10. . .M13) using a CROSSBAR write channel.

memory manager

rq0rd

rq5rd rq0wr

rq4wr

ack0wr

ack4wr

ack0rd

ack5rd

add_ch0_rd

add_ch5_rd add_ch0_wr

add_ch4_wr

data_in data_out

add_sdram

data_sdram

RAS

CAS R/W

ideo memory interface.

IVIDEO_Hcamera

IMEM

wr channel

CCIR.601 4:2:2CIF

QCIF

SDRAM

IVIDEO_Vwr channel

rd channel

Fig. 13. The camera interface.

IFRMEM

IMEM SDRAM rd channel

CROSSBAR wr channel

M10 M11

M12 M13

Fig. 14. The frame memory interface (IFRMEM).

IFRECMEM

IMEM SDRAM

rd channel

wr channel

rd channel

RISC

intra /inter mode

CROSSBAR M23

M24

Fig. 15. The reconstruction memory interface.

IT

CROSS BARrd channel

H.263stream

out

RISCHeaders & MV

FIFO

H.263stream

in

M50M51

Fig. 16. The H.263 frame interface (IT).

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 649

The IFRECMEMmodule (Fig. 15) is committed

to storing the reconstructed images in video

memory on a macroblock basis. The reconstructed

macroblock, generated by the IDCT processor, is

read from an internal memory (M23 or M24).Also, the prediction obtained from the ME pro-

cessor is read from video memory. In intra mode,

IFRECMEM writes to video memory only the

reconstructed macroblock. In inter mode, IF-

RECMEM writes to video memory the sum of the

reconstructed macroblock plus the prediction.

The IT module (Fig. 16) reads the run-length

coded (RLC) coefficients from the inverse quanti-zation processor and the image, group of block

and macroblock headers from the RISC, assem-

bles the H.263 bit-stream and outputs it through

an 8-bit port. Inside the IT, a module reads theRLC coefficients from the internal memories (M50

or M51) and carries out the VLC. The image and

macroblock headers are written in a header

memory by the RISC as they become available.

The IT module joins both data sources in a byte-

aligned stream that is sent to a first in first out

(FIFO) buffer.

3.2. Scheduling of the architecture

MVIP-2 works with three levels of pipeline:

image-level, macroblock-level and pel-level.

The interfaces IVIDEOH, IVIDEOV, IFR-

MEM and IFRECMEM work with an image-

level-pipeline. IMEM supports frame-size logic

pages and the processors use these pages to inter-change the images. Each processor reads an image

from a logic page, processes it and stores the re-

sults on a different page for the next processor. In

Fig. 17 a typical coding sequence is shown: inside

each frame period IVIDEOH reads an image from

the camera and carries out the horizontal filtering,

IVIDEOV carries out the vertical filtering, IFR-

MEM reads the filtered image and stores it, on amacroblock basis, in the internal memory

(M10. . .M13) and, at the end of the coding loop,IFRECMEM reads the IDCT output pels from

M23 and M24 and stores them in a page of video

memory.

Table 2 shows the access sequence to video

memory logic pages corresponding to the coding

sequence shown in Fig. 17. In the frame periods(T0. . .T5) the processors interchange images usingseven logic pages (P0. . .P6), e.g. at T1 IVIDEOHwrites the second frame (WR F2) using P1 while

IVIDEOH, frame 1 IVIDEOH, frame3IVIDEOH, frame 2

IVIDEOV, frame 1 IVIDEOV,

frame 2Loop , frame 1 Loop , frame 2

IVIDEOH, frame4

Initial latency

IVIDEOV, frame 3

Macroblock period

Frame period

me mc

dct dq iq/lrl

idct ifrecmem

ifrmem

Fig. 17. Example of encoding sequence.

650 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

IVIDEOV reads the first frame (RD F1), already

filtered in the horizontal dimension, from P0. After

the initial latency of three frames all processors

work in parallel using six logic pages in video

memory.IFRMEM, ME, MC, DCT, DQ, IQ, IDCT and

IFRECMEM work with a macroblock-level-

pipeline. Each processor reads a macroblock from

one of the internal memories through the

CROSSBAR, then processes it and writes the re-

sults in another internal memory (IFRMEM and

IFRECMEM also access video memory for read-

ing and writing respectively). As Fig. 17 shows,after an initial latency of seven macroblocks all

processors work in parallel.

Table 3 shows a typical inter coding sequence,

where T0. . .T8 represent macroblock periods. Themacroblocks to be processed (current macro-

blocks) are stored alternatively in M10. . .M13 forIFRMEM (I-frmem WR CF#). The motion esti-

mator reads three times the current macroblock(p-me 3 ·RD CF#) to carry out the prediction andonce more to get the 1/2 pel accuracy. Finally, the

motion compensator reads it once again to com-

pute the difference with the prediction and to write

the result alternatively in M20 and M21 (p-me WR

CF#-REC#). The DCT processor reads these data

(p-dct RD#) and stores the transformed coeffi-

cients in M30. . .M33 (p-dct WR#), from where

they are read and quantized by DQ, which writes

them into M40 or M41 (p-dq WR#). The IQprocessor reads the quantized coefficients and

calculates inverse quantization and last, run, level

(LRL) coding simultaneously. The de-quantized

coefficients are stored in M30. . .M33 (p-iq WR#)and the LRL coded coefficients in M50 or M51

(p-lrl WR#). The IDCT processor reads the

de-quantized coefficients and writes the spatial

domain transformed pels in M23 or M24 (p-idctWR#), from where they are read by IFRECMEM

(i-frecmem RD#). The LRL coefficients are read

by IT (i-tr RD#).

At the image and macroblock level, the sched-

uling can be controlled completely by the RISC:

each processor remains idle until the micropro-

cessor sets its start bit and when the image or

macroblock is processed the processor sets its donebit, which can be polled or generate an interrupt.

The MVIP-2 processors also work with a classic

pel-level-pipeline. The controllers of the processors

have been designed in order to modify the number

of their pipeline stages easily.

Table 2

Access to video memory pages sequence

T0 T1 T2 T3 T4 T5

P0 IVIDEOH

WR F1

IVIDEOV

RD F1

IVIDEOH

WR F3

IVIDEOV

RD F3

IVIDEOH

WR F5

IVIDEOV

RD F5

P1 IVIDEOH

WR F2

IVIDEOV

RD F2

IVIDEOH

WR F4

IVIDEOV

RD F4

IVIDEOH

WR F6

P2 IVIDEOV

WR F1

IFRMEM

RD F1

IVIDEOV

WR F3

IFRMEM

RD F3

IVIDEOV

WR F5

P3 IVIDEOV

WR F2

IFRMEM

RD F2

IVIDEOV

WR F4

IFRMEM

RD F4

P4 IFRECMEM

WR F1

P-ME RD F2 IFRECMEM

WR F3

P-ME RD F4

P5 IFRECMEM

WR F2

P-ME RD F3 IFRECMEM

WR F4

P6 P-ME WR F2 P-ME WR F3 P-ME WR F4

P-MC RD F2 P-MC RD F3 P-MC RD F4

IFRECMEM

RD F2

IFRECMEM

RD F3

IFRECMEM

RD F4

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 651

3.3. The software implementation

The coder firmware has been designed for the

RISC processor in C language. In order to allow

flexibility in the design of the encoding algorithm

two application programmer interfaces (APIs)

have been implemented:

• The mvip2 API provides access to the proces-

sors and interfaces, supporting the start, stop

and configuration tasks.

• The h263 API supports the generation of the

H.263 stream headers.

The firmware is structured in two pieces of

code:

• init.s is a loader written in assembly language.

• coder.c is the module that implements the

scheduling for the processors and interfaces

using the mvip2 API and the generation of the

H.263 headers using the h.263 API.

In the current version of the firmware, theprocessors that operate at the image-level pipeline

(IVIDEOH and IVIDEOV) are managed by in-

terrupt while the remaining processors are polled.

The module coder.c has a main program and an

interrupt routine. When IVIDEOH is started for

the first time in the main program (see Fig. 18), itsynchronizes with the first image, processes it and

generates an interrupt when it finishes. In the in-

terrupt routine (see Fig. 19) IVIDEOH is started

again to wait for the next image and IVIDEOV is

started to process the former. Also, when IVID-

EOV finishes, the interrupt routine is entered and a

flag is asserted.

In the main program, first of all, the SDRAMcontroller (IMEM) is initialized, IVIDEOH is

started with its interrupt enabled and the RISC

waits for IVIDEOV to assert the flag. When

IVIDEOV ends the processing of the current

image and the flag is asserted in the interrupt

routine, the Image Header is created and sent to

the IT and the processors (IFRECMEM, IFR-

MEM, DCT, IDCT, DQ, IQ, ME and MC) arestarted in sequence, beginning with the slower

ones. While a macroblock is being processed, the

RISC generates the macroblock header, the group

of blocks header (if necessary) and other parts of

the H.263 bit-stream, sends these data to the IT

and starts it. At the end of the loop processing, the

RISC reads the motion vector and other parame-

ters from the processors and starts them again forthe next macroblock.

On the completion of an image, the end of

frame header is generated and sent to the IT and

the flag is deasserted.

Table 3

Macroblock-level operations scheduling

T0 T1 T2 T3 T4 T5 T6 T7 T8

M10 I-frmem

WR

CF0

p-me

3 ·RDCF0

p-me

RD

CF0

p-mc

RD

CF0

I-frmem

WR CF4

p-me

3 ·RDCF4

p-me

RD CF4

p-mc

RD CF 4

I-frmem

WR CF8

M11 I-frmem

WR

CF1

p-me

3·RDCF1

p-me

RD CF1

p-mc

RD CF 1

I-frmem

WR CF5

p-me

3·RDCF5

p-me

RD CF5

p-mc

RD CF 5

M12 I-frmem

WR CF2

p-me

3·RDCF2

p-me

RD CF2

p-mc

RD CF2

I-frmem

WR CF6

p-me

3·RDCF6

p-me

RD CF6

M13 I-frmem

WR CF3

p-me

RD CF2

p-me

RD CF3

p-mc

RD CF3

I-frmem

WR CF7

p-me

3 ·RDCF7

M20 p-mc

WR CF0-

REC0

p-dct

RD 0

p-mc

WR

CF2-

REC2

p-dct

RD 2

p-mc

WR

CF4-

REC4

p-dct

RD 4

M21 p-mc

WR

CF1-

REC1

p-dct

RD 1

p-mc

WR

CF3-

REC3

p-dct

RD 3

p-mc

WR

CF5-

REC5

M30 p-dct

WR 0

p-dq

RD 0

p-iq

WR 0

p-idct

RD 0

p-dct

WR 4

M31 p-dct

WR 1

p-dq

RD 1

p-iq

WR 1

p-idct

RD 1

M32 p-dct

WR 2

p-dq

RD 2

p-iq

WR 2

M33 p-dct

WR 3

p-dq

RD 3

M40 p-dq

WR 0

p-iq

RD 0

p-dq

WR 2

p-iq

RD 2

M41 p-dq

WR 1

p-iq

RD 1

p-dq

WR 3

M50 p-lrl

WR 0

p-tr

RD 0

p-lrl

WR 2

M51 p-lrl

WR 1

p-tr

RD 1

M23 p-idct

WR 0

I-frecmem

RD 0

M24 p-idct

WR 1

652 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

If, when IVIDEOV processing finishes, the

processors are still working for the former image,

then an image is skipped.

3.4. The hardware/software co-operation

The specialized processors and RISC co-oper-

ation at the macroblock level determine the speed

of the overall system. Fig. 20 shows the computing

period for a macroblock; the top line represents

the RISC tasks and the bottom represents the

processor tasks.

The RISC carries three main tasks for the

macroblock processing: (1) configuring and start-ing the processors, (2) generating the macroblock

header (and other H.263 bit-stream components)

and starting the IT and (3) reading the results from

the processors and computing parameters for the

main

creates Image Header and GOB Header (if needed)

flag?

starts processors

starts IT

end of loop for 1 MB?

end of frame?

DEASSERTED

ASSERTED

N

Y

N

Y

starts IMEM & IVIDEOH enables IVIDEOH interrupt

creates Macroblock Header & other stream parts

reads results & computes params for next MB.

generates End of Frame Header

flag=DEASSERTED

Fig. 18. Main program flowchart.

int

source?

disables IVIDEOV int starts IVIDEOH

starts IVIDEOV &

enables IVIDEOV

interrupt

flag=ASSERTED

IVIDEOV IVIDEOH

end

Fig. 19. IVIDEOH and IVIDEOV interrupt routine flowchart.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 653

next macroblock. These tasks can be evaluated in

terms of system clock cycles. (N , M and P cycles inFig. 20).

The processors are idle before they are started

and while the RISC is reading its results and

computing the parameters for the next macro-

block. Each processor lasts a fixed minimumnumber of clock cycles and, if the processor ac-

cesses the video memory, this number is incre-

mented in a quantity that depends on the priority

assigned for this access. The IT processing time

depends on the image and the quantization step. In

Fig. 20, L represents the number of cycles from the

start of the last processor to the end of all pro-

cessor and IT activities.In the current state of design, N is about 1000

cycles, M is about 150 cycles, P is about 1400 cy-cles and L is about 2900 cycles.

4. Design methodology

The design methodology we have used hasbeen oriented towards three objectives: (1) the

design must be flexible enough to evolve or be

reused to implement the encoding loop for other

standards, (2) the functional test of such a com-

plex system must be carried out efficiently and (3)

the design must be oriented towards rapid pro-

totyping.

As well as using an HDL for the design de-scription, we get the flexibility by using the fol-

lowing techniques:

1MB period

pollingRISC

MB Header & start of IT (P)

start of processors(N)

read results &compute paramsfor next MB (M)

processorsidle idleloop processing (L)

Fig. 20. The HW/SW tasks arrangement.

1 MVIP-2 is a complex design with more than 200 modules.

As we use parameters and retiming techniques the synthesis

process is also complex and prone to human errors. We use

formal verification to ensure that the synthesized netlist is

equivalent to the RTL design discarding a human error in the

process.

654 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

• Most of the design modules have been parame-

terized. As an example, the DCT processor is

instantiated with the multipliers and transfor-

mation matrix size adequate for the current de-

sign, but these parameters can be changed to

adapt to other designs.

• The initial latency of controllers also has beenparameterized, allowing to use the automatic

synthesis tools to re-design the pipeline of critical

processors without re-designing the controller.

• The processors can be synthesized to work with

an 8-bit data bus or a 32-bit data bus by chang-

ing a parameter (the firmware is also changed

with the same parameter). The 32-bit option is

faster while the 8-bit option is smaller.• The CROSSBAR design is modular and can be

modified very easily to implement other topolo-

gies. As the processors control is implemented

by software, it is very easy to add new proces-

sors to MVIP-2, and to fit them in the macro-

block-level-pipeline.

• Also, the IMEM design is modular. It is very

easy to add new channels and therefore to addnew processors to the image-level-pipeline.

These features will allow MVIP-2 to be used in

the future as a base for the design of an IP to

implement the hybrid coding loop for MPEG-2 or

MPEG-4 video coders.

An efficient verification is carried out designing

exhaustive functional testbenches with self-testcapabilities before logic synthesis stage, and using

formal verification techniques in the post-synthesis

stage.

The design of MVIP-2 has been oriented to

rapid system prototyping [19]. This feature allows

us to configure a lower complexity version for easy

prototyping.

A simplified diagram of the design cycle is

shown in Fig. 21. The first stage is the development

of the software for the RISC processor and theVerilog register transfer level (RTL) description of

the other hardware modules. A testbench that in-

cludes Verilog simulation models for the camera,

the memories and the RISC allows the functional

tests with HW/SW co-simulations to be carried

out. The second stage is logic synthesis (using

Design Compiler from Synopsys); a netlist is ob-

tained from the RTL description and the area andtime restrictions. The third stage is formal verifi-

cation (using Formality from Synopsys) in order to

validate the netlist against the RTL description. 1

The fourth is the prototyping stage and is ex-

plained in more detail in next section.

5. The prototyping stage

As we said in Section 4, the design of MVIP-2

has been oriented to rapid system prototyping.

SW RISC RTL MVIP-2

SPECIFICATION

VERILOG TESTBENCH

OK? OK? OK? N

Y

N N

Y Y GOLDEN

RTL

TARGET LIBRARY

RESTRI- CTIONS SYNTHESYZER

NETLIST

FORMAL VERIFICATION

PASS_

Y

N

LIBRARY

PROTOTYPING Fig. 21. Design cycle.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 655

Among other features, MVIP-2 can be configured

with only one DCT processor to carry out both

DCT and IDCT sequentially, 1-pel accuracy mo-tion estimation and compensation and no filter-

ing while maintaining the rest of the features

stated in Section 3. This configuration allows the

prototyping onto a cost-effective development

board.

In Fig. 22 a block diagram of the testbench used

for prototyping is shown, consisting of three main

modules: a development board, a personal com-puter with a PCI input/output board and a logic

analyzer.

The modules inside the dotted box in Fig. 22 are

included in the development board, an HSDT200

[20], that is a cost-effective system for prototyping

RISC FPGA SDRAM

RAM flash

PC PC

logic analyzer

Inputframes

OutputH.263stream

5V 3.3V

JTAG multiICE

Fig. 22. Block diagram of the prototype testbench.

656 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

a wide range of designs. The core of this board is

an EP20K400BC652-1 FPGA [21] from Alteraconnected to an ARM7TDMI processor [22]. In

the FPGA, 400,000 gate designs with up to

200,000 bits of memory can usually be imple-

mented and the ARM is a scalar 32-bit RISC that

can execute up to 120 MIPS. Around these ele-

ments the board has 1 Mb of static RAM, 4 Mb of

flash memory and several connectors to ease de-

bugging and to allow working with SDRAMs andstandard interfaces like RS-232, Smart Card and

PCI.

Fig. 23. The te

The input–output board in the PC is a PCI-

6534 from National Instruments [23]. Using this

32-channel PCI board a very flexible pattern gen-

erator has been created. The system, that is man-

aged from a shell that has been designed for this

research, emulates an OV6620 colour digitalcamera [24] allowing an image file to be selected,

displayed on the PC monitor and outputted to the

coder making an infinite loop.

The logic analyzer is a TLA714 from Tektronix.

This 96-channel and 200 MHz logic analyzer al-

lows the address, data and control buses of the

RISC processor or the SDRAM interface to be

watched.We are using the ARM Software Development

Tools v. 2.50 and multiICE to download and de-

bug the ARM C code; the FPGA Compiler II v.

2000.11-FC3.5 from Synopsys, in order to achieve

better results for the target technology, and

Quartus II v. 1.1 from Altera for the design im-

plementation on the FPGA; and finally, the shell

for the pattern generator management has beingdeveloped using LabVIEW v. 6.0.

A photograph of the testbench can be seen in

Fig. 23.

stbench.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 657

6. Results

To date, an RTL description of MVIP-2 and a

first version of the software for the RISC processor

have been obtained. Exhaustive functional testswith fixed and random data sequences have been

carried out. Afterwards, the encoder has been

tested with standard sequences such as Foreman,

Silent or Miss America. Finally, the H.263 gener-

ated bit-stream has been tested using ClipPlayer

[25]. In Fig. 24 the first (intra with pquant 16) and

12th (inter with pquant 12) reconstructed images

of Foreman together with their original versionsare shown. Also, Table 4 shows the PSNR and

the number of bytes per image for the first 12

images.

The functional tests have stated that MVIP-2

can encode each macroblock in about 4050 system

clock cycles; this performance allows the encoding

of 30 QCIF fps with a 12 MHz system clock or 30

CIF fps with a 48 MHz system clock.

Fig. 24. Original (a) and reconstructed

We have also have performed a logic synthesis

and formal verification of the entire design with

Design Compiler and Formality.

An MVIP-2 configuration with a DCT proces-

sor to carry out both DCT and IDCT and with

entire-pel accuracy for motion estimation havebeen synthesized with FPGA Compiler II and fitted

into the EP20K400BC652-1 FPGA using 93% of

its logic cells and 52% of its internal memory re-

sources. Due to the physical limitations in the

prototyping board, the maximum achievable sys-

tem clock frequency is 24 MHz. The tests carried

out in the prototype have shown that, at this fre-

quency, the system can encode QCIF at 60 fps,equivalent to CIF at 15 fps. With a 48 MHz system

clock MVIP-2 would encode 30 CIF fps, but in

order to support this clock frequency the design

must be retargetted onto a different platform (i.e.

an FPGA with an embedded RISC or an ASIC).

The results of the logic synthesis are summa-

rized in Figs. 25 and 26. Fig. 25 shows the size, in

(b) frames of Foreman sequence.

Table 4

Performance data for the first 12 images of Foreman sequence

Image pquant PSNR Y (dB) # of bytes

#1 16 31.31 1760

#2 12 31.28 468

#3 12 31.20 517

#4 12 30.99 405

#5 12 31.23 404

#6 12 31.25 403

#7 12 31.24 336

#8 12 31.18 378

#9 12 31.09 678

#10 12 30.82 693

#11 12 30.90 755

#12 12 30.82 749

#12 12 36.46 82

658 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

logic cells, for all processors and interfaces ofMVIP-2 except the RISC. The DCT and the ME

processors use near a half of the available logic

cells. As the number of typical equivalent gates in

the EP20K400BC652 FPGA are 400.000, we can

state for comparison purposes that the number of

gates of the design (without the RISC) is about

375.000.

Fig. 26 shows the amount of internal FPGAmemory used by the design. About 90 Kbits are

spent in maintaining the macroblock-level-pipe-

Fig. 25. Size of processors an

line. The ME processor spends about 10 Kbits to

store the search area and a small amount is needed

by the DCT processor to store coefficients and

intermediate results.

7. Conclusions

In this paper MVIP-2, a flexible and efficient

architecture based on a RISC and a set of spe-

cialized processors and interfaces that implements

an H.263 basic-line encoder, has been presented.

The design methodology of MVIP-2 has been

oriented towards providing a reusable design, toease an exhaustive testing and to fast-prototyping,

features that will allow its transformation into an

IP.

The set of specialized processors and interfaces

have been described in Verilog RTL while the

RISC is a commercial processor. The functionality

of the entire architecture as well as the synthe-

sizeability of the RTL code have been exhaustivelytested.

The Verilog description of the specialized pro-

cessors and interfaces has parameters for different

architectural features, including: bus widths,

number of IMEM channels, number and type of

d interfaces (logic cells).

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

M10...M51 DCT ME

Kbits

Fig. 26. Internal memory used by MVIP-2.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 659

CROSSBAR channels and number of pipeline

stages of the controllers. On the other hand, the

RISC implements the scheduling for the proces-

sors and interfaces, most of the VLC tables and

most of the H.263 output stream generation. This

Table 5

Our proposal

fps/size CLK

(MHz)

Kgates Performance

30/QCIF 12 375 + H.263 base-line

15/CIF 24 RISC

30/CIF 48

Table 6

Summary of proposals referenced in Section 2

Ref. fps/size CLK (MHz)

[5] 35/QCIF 600

[6] 25/CIF 120

[7] 21/QCIF 200

[8] 30/CIF 54

[9] 30/CIF 30

[10] 29/QCIF 200 (RISC)

66 (ME)

[11] 30/QCIF 27

features will allow in the future the transformation

of the design to implement the hybrid coding loop

core for an MPEG-2 or MPEG-4 encoder easily.

MVIP-2 has also been prototyped onto a com-

mercial board based on an FPGA and a RISC

processor, working with a 24 MHz system clock.

The performance of the design is summarized in

Table 5. As we can see, real-time encoding isachieved with low system clock frequencies; this

feature makes MVIP-2 suitable for low-power ap-

plications like mobile videotelephony. Moreover,

in Table 6 we have summarized the characteristics

of the proposals referenced in Section 2. The direct

comparison of all these proposals is not easy due to

Kgates Performance

H.263 base-line

H.263 base-line

1800 H.263 with options.

Encoder+ decoder.

<85 H.263 base-line.

Without ME/MC.

40+RISC H.263 base-line.

Encoder+ decoder.

80 H.263 with options.

660 M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661

its heterogeneity, but, as can be seen, our proposal

can encode more frames per second (fps) with less

system clock frequency than the others, with the

exception of [9], that is an encoder without motion

estimation and compensation. In [5–7] and [10] the

high system clock frequency is compensated with alower hardware complexity. If we look at the low

system clock proposals, [8,9] and [11], we found

that all of them use specialized processors to im-

plement the hybrid coding loop. The architecture

proposed in [11] has a good trade-off between sys-

tem clock, hardware complexity and number of fps,

but lacks of the flexibility of MVIP-2. The proposal

[8] is too much complex despite of implementsboth, encoder and decoder. Finally, the fast and

low-complex proposal in [9] is because the motion

estimation and compensation features are not im-

plemented and it lacks of the MVIP-2 flexibility.

Acknowledgements

This work is being supported by grant TIC99-

0927 from the Comisi�oon Interministerial de Ciencia

y Tecnologıa of the Spanish Government.

References

[1] ISO/IEC 13818-2 (ITU-T Rec.H.262), Generic coding of

moving pictures and associated audio information: Video,

1995.

[2] ISO/IEC JTC1/SC29/WG11, CD 14496-2 Coding of Audio

Visual Objects: Video, 1998.

[3] ITU-T Rec. H.263, Video Coding for Low Bit-Rate

Communication, 1998.

[4] A.K. Jain, Image Data Compression: A Review, Proceed-

ings of the IEEE, vol. 69, no. 3, March 1981.

[5] S.M. Akramullah, I. Ahmad, M.L. Liou, Optimization of

H.263 video encoding using a single processor computer:

performance tradeoffs and benchmarking, IEEE Transac-

tions on Circuits and Systems for Video Technology 11 (8)

(2001).

[6] K. Herrmann, S. Moch, J. Hilgenstock, P. Pirsch, Imple-

mentation of a multiprocessor system with distributed

embedded DRAM on a large area integrated circuit, IEEE

International Symposium on Defect and Fault Tolerance

in VLSI, Proceedings, 2000, pp. 1665–1669.

[7] T.P.Q. Nguyen, A. Zakhor, K. Yelick, Performance

Analysis of an H.263 video encoder for VIRAM, Depar-

tament of Electrical Engineering and Computer Ciences,

University of California at Berkeley, 1999.

[8] M. Harrand, J. Sanches, A. Bellon, J. Bulone, A. Tournier,

A Single Chip CIF 30-Hz, H261, H263 and H263+ video

encoder/decoder with embedded display controller, IEEE

Journal of Solid-State Circuits 34 (11) (1999) 1627–1633.

[9] G. Lienhart, R. Lay, R. Manner, An FPGA video

compressor for H.263 compatible bitstreams, International

Conference on Consumer Electronics, 2000, Digest of

Technical Papers, pp. 320–321.

[10] S.K. Jang, S.D. Kim, J. Lee, G.Y. Choi, J.B. Ra,

Hardware–software co-implementation of a H.263 video

codec, IEEE Transactions on Consumer Electronics 46 (1)

(2000) 191–200.

[11] C. Honsawek, K. Ito, T. Ohtsuka, T. Isshiki, Li Dongju, T.

Adiono, H. Kunieda, System-MSPA design of H.263+

video encoder LSI for face focused videotelephony, The

2000 IEEE Asia-Pacific Conference on Circuits and Sys-

tems, 2000.

[12] J.M. Fern�aandez, F.Moreno, J.Meneses, A high-performancearchitecture with a macroblock-level-pipeline for MPEG-2

coding, Real Time Imaging Journal 2 (1996) 331–340.

[13] J.M. Fern�aandez, Arquitecturas VLSI para la codificaci�oon

de im�aagenes en movimiento en tiempo real, Ph.D. ThesisE.T.S.I.T.,UniversidadPolit�eecnica deMadrid,March 1998.

[14] K.R. Rao, P. Yip, Discrete Cosine Transform, Algorithms,

Advantages, Applications, Academic Press, 1990.

[15] IEEE G.216, Presentation to IEEE G.216 Video Com-

pression Measurement Subcommittee on IEEE 1180/1190

Standard, Discrete Cosine Transform Accuracy Test,

January 1998.

[16] Video Codec Test Model, TMN5, Telenor Research, 1995.

[17] C. Sanz, M.A. Freire, J. Meneses, Low Cost ASIC

Implementation of a Three-Step Search Block-Matching

Algorithm for Motion Estimation in Image Coding, Design

Automation and Test in Europe Conference, User�s Forum,Paper awarded with the Best ASIC Prize, 1999, pp. 75–79.

[18] C. Sanz, M. Garrido, J. Meneses, VLSI Architecture for

Motion Estimation using the Three-Step Block Matching

Algorithm, Design Automation and Test in Europe Con-

ference, Designer Track, 1998, pp. 45–50.

[19] M. Garrido, C. Sanz, M. Jimenez, J. Meneses, A Flexible

H.263 Video Coder Prototype Based on FPGA, 13th IEEE

International Workshop in Rapid System Prototyping,

2002, pp. 34–41.

[20] SIDSA. Semiconductor Design Solutions. Available from

<http://www.sidsa.com>.

[21] APEX 20 K Programmable Logic Device Family data

sheet. Available from <http://www.altera.com/literature/

ds/apex.pdf>.

[22] ARM7TDMITechnical ReferenceManual Rev. 4. Available

from <http://www.arm.com/arm/TRMs?OpenDocument>.

[23] High-Speed 32-bit Digital Pattern I/O and Handshaking.

Available from <http://www.ni.com/pdf/products/us/

mhw332-333e.pdf>.

[24] OV6620 Single-chip CMOS CIF color digital camera.

Available from <http://www.ovt.com/pdfs/ov6620-

DSLF.PDF>.

[25] ClipPlayer V1.1b2. 1996 Fraunhofer-Gesellschaft, IIS.

M.J. Garrido et al. / Journal of Systems Architecture 49 (2003) 641–661 661

Matıas J. Garrido received the In-geniero T�eecnico de Telecomunicaci�oondegree in 1986 and the Ingeniero deTelecomunicaci�oon degree in 1996, bothfrom the Universidad Polit�eecnica deMadrid. Since 1986 he has been amember of the faculty of the E.U.I.T.de Telecomunicaci�oon of the U.P.M.,and since 1987 he has been AssociateLecturer at the Department of Elec-tronic and Control Systems. He is afounder member (in 1996) of theElectronic and Microelectronic DesignGroup (GDEM), participating in de-

sign projects from the Spanish and European industry as well asuniversity projects. His areas of interest are electronic digitaldesign, video coding and digital video broadcasting.

Cesar Sanz received the Ingeniero deTelecomunicaci�oon degree with honorsin 1989 and the Doctor Ingeniero deTelecomunicaci�oon degree with summacum laude in 1998, both from theUniversidad Polit�eecnica de Madrid.Since 1984 he has been a member ofthe faculty of the E.U.I.T. de Teleco-municaci�oon of the U.P.M., and since1999 has been Associate Professor atthe Department of Electronic andControl Systems. In addition, he leadsthe Electronic and MicroelectronicDesign Group (GDEM) involved in

R&D projects with Spanish and European companies andpublic institutions. His areas of interest are microelectronic

design applied to image coding, digital TV and IP-data trans-mission over digital broadcasting networks.

Marcos Jimenez received the Ingenierode Telecomunicaci�oon degree in 2001,from the Universidad Polit�eecnica deMadrid. He has been a member of theElectronic and Microelectronic DesignGroup since 2000 and at present alsoworks as a software developer atSIDSA. His areas of interest are real-time video coding hardware imple-mentations and IP transmission overdigital television networks.

Juan M. Meneses received the Ingeni-ero de Telecomunicaci�oon degree in1977 and the Doctor Ingeniero deTelecomunicaci�oon degree with summacum laude in 1985, both from theUniversidad Polit�eecnica de Madrid.Since 1989 he has directed a researchgroup in digital architectures for imageand video processing. At the present,he is a full professor at the ElectronicsEngineering Department and seniorscientist at GDEM.


Top Related