2.ASIP FOR H

8/9/2019 2.ASIP FOR H

1/15

Journal of Signal Processing Systems 50, 5367, 2008

* 2007 Springer Science + Business Media, LLC. Manufactured in The United States.

DOI: 10.1007/s11265-007-0109-y

ASIP Approach for Implementation of H.264/AVC

SUNG DAE KIM AND MYUNG H. SUNWOO

School of Electrical and Computer Engineering, Ajou University, San 5, Wonchun-Dong,

Yeongtong-Ku, Suwon, 442-749, South Korea

Received: 3 April 2007; Accepted: 13 June 2007

Abstract. This paper presents an Application Specific Instruction Set Processor (ASIP) for implementation ofH.264/AVC, called Video Specific Instruction-set Processor (VSIP). The proposed VSIP has novel instructions

and optimized hardware architectures for specific applications, such as intra prediction, in-loop deblocking filter,

integer transform, etc. Moreover, VSIP has coprocessors for computation intensive parts in video signal

processing, such as inter prediction and entropy coding. The proposed VSIP has much smaller area and can

dramatically reduce the number of memory access compared with commercial DSP chips, which result in low

power consumption. Moreover, the proposed hardware accelerators have small size, consume low power

consumption, and thus, they can support real-time video processing. VSIP has been thoroughly verified using an

FPGA board having the Xilinxi Virtex II. VSIP can implement a real-time H.264/AVC decoder. The proposed

VSIP is one of promising solutions for video signal processing.

Keywords: application specific instruction-set processor, hardware software codesign, H.264/AVC,low power design, data reuse, hardware accelerator

1. Introduction

With the rapid progress of semiconductor technolo-

gy, Application Specific Instruction-set Processor

(ASIP), which adopts high performance and low

power of ASIC and flexibility of DSP, has become

increasingly important. The market of ASIP is

growing fast since the sales of portable devices,

such as cellular phones, digital cameras, MP3

(MPEG Layer-3) players, PMP (Portable MultimediaPlayer), etc. are dramatically increasing. These

applications need high performance, low power

consumption and low cost. Application-Specific

Integrated Circuit (ASIC) designs can reduce the

cost, size, and power consumption of systems.

However, ASIC designs have been found inadequate

to upgrade standards since they should be rede-

signed. On the other hand, programmable DSPs can

greatly reduce time-to-market and allow faster

changes and upgrades. However, programmable

DSPs may have the disadvantages related with cost,

size, and power consumption. ASIP can compromise

advantages of ASIC designs and general DSP chips

[15]. In other words, ASIP chips adopt high

performance and low power of ASIC chips and

flexibility of DSP chips. ASIP can give low power at

the algorithm/architecture level, which can provide

the most efficient way to achieve low powerconsumption [6].

Multimedia technology has been developed with

the progress of semiconductor technology. Technol-

ogy related to multimedia codec has been standard-

ized as MPEG-2, MPEG-4, H.261, H.263, etc. The

Joint Video Team (JVT) announced H.264/AVC in

Dec. 2003 [7]. The new video coding standard

H.264/AVC can provide twice as much as higher

8/9/2019 2.ASIP FOR H

2/15

compression efficiency than MPEG-4. However, it

has about 2 times more hardware complexity for a

decoder, and about 10 times more hardware com-

plexity for an encoder than the MPEG-4 visual

simple profile codec [8].

In mobile communications, the implementation of

multimedia codec needs high performance, low

power consumption and low cost. The implementa-

tion also requires the flexible system which can

upgrade without replacing the system. The ASIP

approach can be quite suitable for these require-

ments. Hence, we propose an ASIP for implementa-

tion of mobile multimedia codec, called Video

Specific Instruction-set Processor (VSIP) [5]. We

implement the VSIP based on the design flow of

ASIP as shown in Fig. 1 [9].

First, the target application is chosen. H.264/AVCis widely used in mobile communication standards,

such as DMB, DVB-S2, DVB-T, etc. Hence, the

target application of VSIP is video signal processing

including H.264/AVC. Second, we profile the H.264/

AVC tasks. Through profiling, we can find the

complexity of H.264/AVC tasks. According to the

complexity of each task, we can divide the applica-

tion into hardware implementation for high perfor-

mance and software implementation for flexibility.

H.264/AVC has computation intensive parts such as

inter prediction and entropy coding. To achieve low

power consumption and real-time processing, hard-

ware accelerators for inter prediction and entropy

coding are required. Next, we design the optimized

instruction set and their architecture based on the

analysis. H.264/AVC has new features such as intra

prediction, in-loop deblocking filter, integer trans-

form, etc. It is inefficient to implement these blocks

using existing DSP instructions. Hence, we propose

new instructions and their architecture to implement

H.264/AVC efficiently. The optimized instruction set

can reduce computation complexity, redundancy and

overhead. In general, computation cycles to perform

target applications in an ASIP are much less thanthose of general DSPs. Finally, the functions of the

proposed VSIP have been thoroughly verified using

the Xilinx XC2v6000 FPGA.

The proposed VSIP can efficiently perform new

features of H.264/AVC, such as intra prediction, in-

loop deblocking filter, and integer transform. More-

over, VSIP has hardware accelerators for inter

Target applicationselection

Application profiling

H/W, S/W partitioning

Design special instructionsand architecture

Verification and

performance comparison

Design hardwareaccelerators

Chip fabrication

Figure 1. Design flow of ASIP.

54 Kim and Sunwoo

8/9/2019 2.ASIP FOR H

3/15

prediction and entropy coding that occupy the largest

portion of power consumption and critical timing

parts of video processing. Hence, VSIP can imple-

ment a real-time and low power H.264/AVC baseline

profile decoder in QCIF format.

This paper is organized as follows. Section 2

introduces H.264/AVC and describes existing DSP

instructions to implement multimedia standards.

Section 3 proposes novel instructions and hardware

accelerators, and Section 4 explains performance

comparisons. Finally, Section 5 contains concluding

remarks.

2. Implementation Analysis for H.264/AVC

This section introduces briefly H.264/AVC and

shows the results of profiling. Then, various imple-

mentations of H.264/AVC are analyzed. This section

also presents existing DSP instructions for video

signal processing.

2.1. Implementation of H.264/AVC

H.264/AVC has adopted new features to improve

code efficiency, which are described as follows.

Figure 2. Complexity analysis of the H.264/AVC baseline profile.

Figure 3. Computation times of various implementations. a DSP. b ASIC. c VSIP + accelerators.

ASIP Approach for Implementation of H.264/AVC 55

8/9/2019 2.ASIP FOR H

4/15

H.264/AVC uses several reference frames, variable

block size, and quarter pixel accuracy in Motion

Estimation (ME)/Motion Compensation (MC). These

features enable the encoder to search for the best

match for the current frame. However, the memoryaccess and hardware complexity increase significant-

ly. The past standards, such as MPEG-2, MPEG-4,

H.263, etc., transmit the first frame without com-

pression. On the other hand, the H.264/AVC encoder

adopts intra prediction, which eliminates the redun-

dancy of intra frame.

The block based structure causes blocking arti-

facts. Thus, H.264/AVC adopts the in-loop deblock-

ing filter to eliminate blocking artifacts. The

Exponential Golomb Coding (EGC) and Context

Adaptive Variable Length Coding (CAVLC) are also

the newly adopted features of the H.264/AVCbaseline profile. EGC uses variable length codes

with a regular construction [10]. CAVLC is the

method used to encode the residual data of 44blocks [1113].

Figure 2 shows the operation complexity of the

H.264/AVC baseline profile [14]. As shown in Fig. 2,

ME/MC takes 53% and VLC takes 18.20% of theoperation complexity. Especially, these tasks access

memory frequently. To achieve low power consump-

tion we need the dedicated hardware for these tasks.

In practice, the computation complexity of VLC is not a

dominant part in H.264/AVC. Intra prediction and in-

loop deblocking filter have more computation complex-

ity compared with VLC. However, VLC requires bit

manipulation operations which are inefficient to be

implemented on a general processor. Hence, we employ

the dedicated hardware for VLC. Moreover, inter

prediction can be executed in parallel with intra

prediction and entropy coding can also be executed inparallel with in-loop deblocking filter. Thus parallelism

can be exploited for these tasks. The proposed instruc-

AUPCU

Program Counter

Instruction Register

FSM

Stack

Interrupt Controller

Program Memory

Data Memory 1

Data Memory 2

DPU

Program Bus (16 Bit)

Address Buses (16 Bit)

Data Buses (32 Bit)

AGU

MAC MAC ALU ALU Shifter

Register File

AGU

16 Bit Address Registers

Prefetch Logic

IPA

ME Hardware

Accelerator

MC Hardware

Accelerator

ECA

CAVLC

Accelerator

EGC

Accelerator

Figure 5. Proposed VSIP architecture.

DOTPU4

src1

src2

dst

a0 a1 a2 a3

b0 b1 b2 b3

(a0*b0) + (a1*b1) + (a2*b2) + (a3*a4)

Figure 4. DOTPU4 instruction in TMS320c64.

56 Kim and Sunwoo

8/9/2019 2.ASIP FOR H

5/15

tions of VSIP can efficiently support intra prediction and

in-loop deblocking filter [5]. To maximize the usage of

hardware resource, the hardware accelerators of inter

prediction and entropy coding are essential. Thus, the

proposed VSIP employs the hardware accelerators forthese tasks.

Figure 3 shows computation times of DSP, ASIC,

and VSIP implementations according to the profiling

results. Figure 3(a) shows the computation times of

the DSP implementation. If a single DSP is used to

implement the H.264/AVC algorithm, the DSP

serially executes all of the algorithm blocks.

Figure 3(b) shows the computation times of the

ASIC implementation. Each block is executed using

the dedicated hardware. However, all of the blocks

cannot be executed in parallel, since some blocks use

the output of other blocks. For example, the

transform block uses ME results and the entropy

coding block needs transform/quantization results.

Figure 3(c) shows the proposed VSIP having accel-

erators. The VSIP implementation having acceler-

ators requires more computation times than the ASIC

implementation. However, it requires much less

computation times than DSP and can support various

profiles and standards.

2.2. Existing DSP Instructions for Video Signal

Processing

Existing DSPs support various instructions to exe-cute packed operations between two registers. These

operations are used for various video signal process-

ing, such as DCT, IDCT, ME/MC, etc. TMS320c6of Texas Instruments supports special instructions

for multimedia signal processing, such as SUB-

ABS4, AVGx, etc. [15]. The SUBABS4 instruction

calculates absolute differences of four pairs of the

packed data. The AVG4 instruction calculates

averages of the packed data in two registers. After

additions of four packed data, four results are shifted

a bit to the left for division, and 0.5 is added to each

result for rounding. The TMS320c6 series also

support the DOTPU4 instruction which calculatesthe dot product between four sets of packed 8 bit

values. Figure 4 shows the operation flow of the

DOTPU4 instruction. The values in both src1 and

src2 are treated as the unsigned 8 bit packed data.

The 32 bit unsigned result is written into dst. Four

clock cycles are required to execute this instruction.

DCT has a regular computation flow, while ME/MC

and entropy coding have control based computations.

TMS320c55 has a coprocessor for DCT computa-tions, and it requires 2.8 MIPS for DCT computations

to achieve the processing speed of 30 fps for the QCIF

format. TMS320c6 having eight function unitsrequires 1.1 MIPS to implement DCT of 30 QCIF fps

video data using DSP instructions [16].

In entropy coding, the code word table is referred

according to the number of successive zeros in the

input bit stream. Moreover, packed compare oper-

ations are required. To execute these operations,

TMS320c64 supports the LMBD and CMPEQ/GT/LT instructions, and the Blackfin DSP of Analog

Device supports the ONES instruction [16, 17]. The

q3 q2 q1 q0 p0 p1 p2 p3

boundary

Figure 6. Block boundary.

a b c d

e f g h

i j k l

m n o p

A B C D E F G H

I

J

K

L

Q

M

N

O

P

Figure 7. Identification of samples for 44 intra prediction.


8/9/2019 2.ASIP FOR H

6/15

dst = HADD(src)

dst = HADD(src:mask)

dst = HADD(src:mask1.mask2)

a0 a1 a2 a3

dst

src

a

a0 a1 a2 a3

dst

a0

8/9/2019 2.ASIP FOR H

7/15

LMBD instruction counts the number of zeros in a

register. The CMPEQ/GT/LT instructions compare

pairs of 8 bits or 16 bit packed data.

3. Proposed Instructions and Accelerators

This section presents an overall architecture, new

instructions and hardware accelerators for the H.264/

AVC codec.

3.1. Overall Architecture of the Proposed VSIP

Figure 5 shows the overall architecture of the

proposed VSIP. The proposed VSIP consists of two

parts, a programmable DSP part and a hardware

accelerator part. The DSP part has a program control

unit (PCU), a data processing unit (DPU), and anaddress unit (AU). The hardware accelerator part has

an Inter Prediction Accelerator (IPA) and an Entropy

Coding Accelerator (ECA). IPA consists of an ME

accelerator and an MC accelerator. ECA has a

CAVLC accelerator and an EGC accelerator. The

hardware accelerators can operate in parallel with the

DSP units.

PCU consists of a prefetch logic, a program

counter, an instruction register, an FSM (Finite State

Machine), a stack, and an interrupt controller. DPU

consists of two Multiply and Accumulate (MAC)

units for two 16-bit by 16-bit multiplications and

accumulations, two Arithmetic Logic units (ALU), a

barrel shifter and a register file. AU has two address

generation units (AGU) for load and store. Each of

the internal word lengths is 32 bit. The instruction

pipeline consists of six stages, that is, pre-fetch,

fetch, decode, execute1, execute2, and execute3. The

proposed ASIP has 35 arithmetic instructions, 11

logical and shift instructions, 6 program control

instructions, 4 move instructions and 16 special

instructions including instructions for H.264/AVC,

which are described next.

3.2. Proposed Instructions for In-loop Deblocking

Filter and Intra Prediction

The in-loop deblocking filter is used to eliminate

blocking artifacts as mentioned in Section 2. Figure 6

shows 8 pixels of neighboring 44 blocks. The8 pixel values are decided according to the boundary

Strength (bS), which represents the difference of two

neighboring blocks, using p0 p3 and q0 q3.

The equations calculating pixel values are defined

in [7]. The equations can be classified into five

categories as follows.

p2 p1 p0 1

p2 2 p1 2 p0 2

2 p3 3 p2 p1 p0 3

2 p1 p0 4

p0 q0 1 ) 1 5

p0

p3 are the packed data in a register, and q0

q3 are also the packed data in another register. Then,

Eq. (1) shows the additions of three packed data in

one register. Eq. (2) represents one bit shift left

operations of two data followed by additions of three

packed data in the same register. Eq. (3) shows one

bit shift left operation of data and a multiplication

operation of data followed by the additions of four

packed data. Eq. (4) shows one bit shift left

operation of the packed data followed by an addition

Figure 9. Assembly program of core block for in-loop deblocking filter.


8/9/2019 2.ASIP FOR H

8/15

of two packed data. Eq. (5) shows an addition of the

most significant byte (MSB) of one register and the

least significant byte (LSB) of the other registerfollowed by one bit shift operation. Even though

these computations are packed operations, these

operations do not occur between two registers as

shown in Fig. 6, but they occur between the packed

data within the same register.

As mentioned in Section 2, the intra predictioneliminates the redundancy of intra frame and inter

frame, which has few redundancies between two

frames. Figure 7 shows an identification of samples

-2

2

x(0)

x(1)

x(2)

x(3)

X(0)

X(2)

X(1)

X(3)

-

-

-

a

1/2

1/2

-

-

-

x(0)

x(1)

x(2)

x(3)

X(0)

X(2)

X(1)

X(3)

b

Figure 11. Operation flow of 44 integer transform. a 1-D forward transform. b 1-D inverse transform.

Figure 10. Assembly program of intra prediction.

60 Kim and Sunwoo

8/9/2019 2.ASIP FOR H

9/15

for 44 intra prediction. a p in Fig. 7 arepredicted using A Q according to the equations

defined in [7] and some of equations are represented

in Eq. (6), where A, B, and C represent pixel values,

and a pixel value is represented using 8 bits. For a 32

bit architecture, A, B, C and D are stored in one

register since a p and A Q in Fig. 7 are 8 bit

values.

A 2 B C 2 ) 2

A B 1 ) 1

A 3 B 2 ) 2

6

As described in Section 2, existing DSPs support

only packed operations between two registers. A

large number of instruction cycles is required to

implement the in-loop deblocking filter and intra

prediction with the existing packed instructions that

execute packed operations between two registers.

Hence, H.264/AVC may require a new instruction to

execute packed operations within a register.

Figure 8 shows the proposed three horizontal addi-

tion (HADD) instructions. Three HADD instructions

are as follows. The proposed instruction in Fig. 8(a)

packs a 32 bit register into four 8 bit data, adds fourpacked data, and then saturates the result to 8 bit data.

Figure 8(b) is similar with Fig. 8(a). However, the

packed data, which is selected by a mask, is one bit

shifted to left. In Fig. 8(c), mask1 selects the data to be

added, and mask2 selects the data to be shifted. Eqs.

(1), (2), (4), (5) of the in-loop deblocking filter and Eq.

(6) of the intra prediction can be implemented using

the proposed HADD instructions.

Intra prediction and in-loop deblocking filterrequire dot product calculation. TI_s TMS320C6supports the DOTPU4 instruction for dot product

calculation which performs packed multiplications of

two registers and adds four results in four cycles.

ADI_s ADSP-BF53 not supporting these specialinstructions requires more clock cycles to perform

dot product calculation. Hence, we only compare the

proposed HADD instruction with the TMS320C6instruction.

Figure 9 shows the assembly program of the core

block for the in-loop deblocking filter. R0 and R1 are

general registers. The packed pixel data is stored inR0 and R1. Each result of the instruction can be

obtained after one clock cycle. Hence, the proposed

VSIP can execute these equations for in-loop

deblocking filter in one clock cycle.

Figure 10 shows the assembly program of the intra

prediction. Acc represents an accumulator and the

packed pixel data is stored in R0. R1 and R2 have

Loop label1 #2

Acc = fTRAN(R0)

R(4) = MOVR(Acc)Acc = fTRAN(R1)

R(5) = MOVR(Acc)Acc = fTRAN(R2)

R(6) = MOVR(Acc)

Acc = fTRAN(R3)

R(7) = MOVR(Acc)G0 = TRAN(G1)label1

Figure 13. Assembly program of 44 forward 2-D integertransform.

A B C Dsrc

A+B+C+Ddst (B-C)+2(A-D) A+D-(B+C) (A-D)-2(B-C)

fTRAN

A B C DR0 A E I MR4

B F J NR5

C G K OR6

D H L PR7

TRANE F G HR1

I J K LR2

M N O PR3

a

b

Figure 12. Operation flow of fTRAN and TRAN instruction. a Operation flow of fTRAN instruction. b Operation flow of TRAN

instruction.


8/9/2019 2.ASIP FOR H

10/15

offset values for rounding. The ADDAR instruction

in VSIP calculates an addition of two source data.

After the addition, the result is shifted to right by the

immediate value in the instruction. Each result of the

instruction can be obtained after one clock cycle.

Hence, the proposed VSIP can execute these equa-

tions for intra prediction in two clock cycles.

3.3. Proposed Instructions for Integer Transform

The 44 integer transform can be operated using theforward transform as shown in Fig. 11(a). The

forward transform is executed with four rows of four

packed data. Then, the forward transform is per-

formed again with four columns of four packed data

to get the results of the 44 integer transform.Figure 11(b) represents an inverse transform. Simi-

larly, the 44 inverse integer transform can beexecuted using the operations in Fig. 11(b).

This paper proposes novel instructions to efficient-

ly execute the forward/inverse 44 integer transformas follows.

dst fTRAN src dst iTRAN src :

Each instruction performs the operations of Fig. 11(a)

and (b).

Figure 15. ME computation flow. a Existing computation flow.

b Proposed computation flow.

4 x 4 Current Block

Reference Picture

SAD operation

a

4 x 4 Current Block

Reference Picture

SAD operation

b

Figure 14. Operation flow of the proposed motion estimation. a ME operation in the first cycle. b ME operation in the second cycle.

62 Kim and Sunwoo

8/9/2019 2.ASIP FOR H

11/15

Figure 12(a) shows the operation flow of the fTRAN

instruction. The fTRAN instruction reads a 32 bit

general register in one register file, which consists of

four 32 bit registers, and executes the operation flow in

Fig. 12(a). Then, the results are written in anotherregister file consisting of four 32 bit registers. The

iTRAN instruction performs a similar operation.

These instructions can be implemented using the

adders and eight additional 21 multiplexers. Figure12(b) shows the operation flow of the TRAN

instruction. The general register file has a 44 matrixwhose elements are 8 bit pixel data. The TRAN

instruction in VSIP executes the transpose of a 44matrix as shown in Fig. 12(b).

44 integer transform can be easily programmedwith the proposed instructions. The assembly pro-

gram of 44 forward 2-D integer transform is shownin Fig. 13. G0 and G1 represent two general register

files each of which contains R0, R1, R2, and R3 or

R4, R5, R6, and R7, respectively. The Loop instruc-

tion repeats the program until the label. The number

of repeats is determined by the immediate value in

the instruction. The MOVR instruction moves the

source data into the general register that has four

8 bit pixel data. The general register file has four

general registers. Each VSIP instruction requires one

clock cycle. As you can see, the program in Fig. 13

has nine instructions for 1-D forward integer trans-

form. To execute 2-D transform, 18 instructions for

computation and 1 instruction for program control

are needed. Hence, 19 cycles are required to execute

this program. If the data load cycle is added, the total

cycles of 44 forward 2-D integer transform are 23cycles. The implementation using the instructions of

TMS320c55 (SW) for integer transform requiresmore than 1,078 cycles [18]. Hence, the proposed

fTRAN and iTRAN instructions can be more

efficient than the existing DSP instructions for

integer transform.

3.4. Proposed Accelerator for Inter Prediction

The proposed MC accelerator supports the motion

vector with quarter pixel accuracy which is one of

the key features of H.264/AVC. However, it does not

yet support the multiple reference frames which is

another key feature of H.264/AVC. As described in

Section 2, ME/MC should frequently access memo-

First 1 detect Level Decode Table Update

First 1 detect Level Decode Table Update

Pipeline Stage

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

a

Level Decode

Pre TableUpdate

First 1 detect

Level Decode

Pre TableUpdate

First 1 detect

Pipeline Stage

Stage 1 Stage 2 Stage 3

b

Figure 17. Comparison of flows for the level of the nonzero coefficient decoding. a Generic level of the nonzero coefficient decoding flow.

b Proposed level of the nonzero coefficient decoding flow.

00

vlc=1

vlc=2

vlc=3

vlc=4vlc=5

vlc=6

001X

0011X

00111X

001111X0011111X

00111111X

11 11 Xvlc=n

n

Figure 16. Threshold value of each level decoding table.


8/9/2019 2.ASIP FOR H

12/15

ry. From a performance point of view and a low

power point of view, it can be a serious problem.

Thus, the sliding window method is used to alleviate

this problem [19]. Figure 14 illustrates the proposed

ME operation flow.

The proposed ME architecture supports the [+8,

j7] search window. In the [+8, j7] search window,

16 44 blocks exist in a row. In the first cycle, fourSADs are simultaneously calculated as shown in Fig.

14(a). Next, the search window shifts right and each

operation unit repeats the SAD calculation as shown

in Fig. 14(b). The SADs of upper four pixels of every

block in a row can be obtained after four cycles and

16 SADs are stored in buffers. The SADs of the

second upper are calculated in the same way, and the

16 SADs are accumulated with the 16 SADs in

buffers, respectively. Then, after 16 cycles, the 16

SADs of 44 blocks can be obtained.Figure 15 shows the ME computation flows of

general architectures [20] and the proposed architec-ture. In Fig. 15(a), the pixel values in the dotted

block should be fetched again to calculate the SAD

of the dotted block after the SADs of two adjacent

blocks (block 1 and block 2) are obtained. However,

if a 44 block is shifted pixel by pixel as shown inFig. 15(b), the data in the dotted block in Fig. 15(a)

can be reused. Hence, we can achieve the low power

consumption.

3.5. Proposed Accelerator for Entropy Coding

As described in Section 2, H.264/AVC uses EGC and

CAVLC for entropy coding. Since EGC has a regularcoding structure, the EGC accelerator consists of a

barrel shifter, the first one detector, an adder, etc. The

encoder of CAVLC finds the value of the codeword

and the length of the codeword in memory according

to the data. Therefore, the efficient memory address

generator is needed. The decoder of CAVLC is

usually implemented with a lookup table. In the

decoding process, the level of the nonzero coefficient

decoding is an iterative method, which can be

implemented without a lookup table.

A generic decoding process for the level of the

nonzero coefficient is as follows. First, the decoder

obtains the number of successive zeros in the input bit

stream. Next, the decoder calculates the current symbol

length and decodes the current symbol. Finally, the

decoder updates the table information used for next

symbol decoding. The decoder cannot decode the next

symbol until the table information is decided.To increase the decoding speed, we propose the

pre-table update method. Table updating is decided

whether the current symbol value is greater than the

threshold value. Fig. 16 shows the threshold value of

each level decoding table. As shown here, all

threshold values have regular forms. Runs of zeros

are two and runs of ones are equal to the level

decoding table index. Hence, we can update the table

before current symbol decoding.

The generic level of the nonzero coefficient decod-

ing process is shown in Fig. 17(a). The level decoding

process cannot be performed until finishing the table

update process. Therefore, five pipeline stages are

required to decode two symbols. Figure 17(b) shows

the level of the nonzero coefficient decoding process

using the proposed pre-table update method. Since

three pipeline stages are required to decode two

symbols, we can reduce two computation cycles for

level decoding.

4. Implementation and PerformanceComparisons

H.264/AVC can be implemented using VSIP havinghardware accelerators. The proposed VSIP has been

Table 1. Performance comparisons of 44 integer transform.

Parameter TMS320c55 (SW) [18] TMS320c55 (HW) [18] TMS320c64 [16] Proposed VSIP

MIPS 12.8 2.8 1.1 1.1

Figure 18. FPGA emulation.

64 Kim and Sunwoo

8/9/2019 2.ASIP FOR H

13/15

modeled by Verilog HDL and thoroughly verified

using the iPROVEi FPGA board having the

Xilinxi

Virtex II shown in Fig. 18.Several core blocks for generating an intra

predictor and an in-loop deblocking filter are

coded using the proposed special instructions and

the same blocks are also coded using the existing

instructions of TMS320c64. The proposed archi-tecture can reduce the number of clock cycles for

generating an intra predictor about 40% compared

with TMS320c6. Moreover, the total number ofclock cycles to execute the in-loop deblocking

fi lt er c an b e r ed uc ed a bo ut 2 02 5% t ha n

TMS320c6. TMS320C64 supports the DOTPU4instruction that executes packed multiplications of

two registers and adds four results in four cycles.Other DSPs require more instructions, since they

do not support the special instructions.

The fTRAN and iTRAN instructions can be

executed in one cycle. Hence, 23 clock cycles are

required to execute 44 integer transform using theproposed instructions and about 1,092,960 clock

cycles for 30 frames ((23 cycles16 blocks)99macro blocks30 frame) are required for QCIFimages, since a QCIF image has 99 1616 macro

blocks. Table 1 shows the number of the required

instructions for 30 frames on existing DSPs [16, 18]

and the proposed VSIP. VSIP can be more efficientthan the implementation using instructions of

TMS320c55 (SW) and using the coprocessor ofTMS320c55 ( H W) f o r i n te g er t r an s fo r m.TMS320c64 is a large VLIW architecture havingeight function units while VSIP requires only two 32

bit adders.

The optimized VSIP instructions can reduce

computation complexity, redundancy and overhead.

Hence, the computation cycles of VSIP to perform

tasks in H.264/AVC are much less than those of

general DSPs. Hence, the proposed VSIP can

efficiently reduce the number of memory accesses

and achieve low power.The hardware accelerators have been implemented

using the MagnaChip HSI 0.25 mm standard cell

library by the Synopsysi Astro tool as shown in

Fig. 19. The chip specifications are listed in Table 2.

The chip is being fabricated by the Information

Technology System on Chip (ITSOC) MPW service

Table 2. Chip summary of proposed hardware accelerator forME and MC.

Parameter

ME hardware

accelerator

MC hardware

accelerator

Pr ocess technolog y 0.25 mm 1p4m 0.25 mm 1p4m

Logic gate count 40,000 10,000

Maximum frequency 100 MHz 150 MHz

On chip memory size 32 kb

a b

Figure 19. Chips of proposed hardware accelerator for ME and MC. a ME hardware accelerator. b MC hardware accelerator.

Table 3. Performance comparisons of the hardware ME archi-

tectures.

Parameter

Clockcycles/

frame

Search

range

Supported

block size

Gatecounts

(K)

References

[20]

405,6 03 [j16,

+15]

Variable block

support

154

References

[21]

406,0 77 [j8,

+7]

Variable block

support

61

Proposed

architecture

431,2 44 [j8,

+7]

Variable block

support

40


8/9/2019 2.ASIP FOR H

14/15

in KOREA. The ME chip achieves the gate count

without memory of 40 K and the operating frequency

of 100 MHz. The MC chip achieves the gate count

without memory of 10 K and the operating frequency

of 150 MHz. ME requires more memory than MC.

Moreover the size of ME is almost four times larger

than the size of MC. However, we could not insert

the memory of ME in a chip because of the size

limitation of the ITSOC MPW service in Korea.

Moreover, we separated ME and MC.

The proposed ME accelerator can significantly

reduce the gate counts compared with [21] and [22].

Table 3 shows the comparisons among [21, 22] and

our architecture. Kim et al. [21] can support larger

search ranges than the other architectures. However,

it has much larger gate counts than the other

architectures. The required computation cycles of[21] a n d [22] are comparable to our architecture.

However, the total gate counts of [21] and [22] are

much larger than our architecture.

The proposed hardware accelerator for CAVLC

takes average 368 clock cycles for a macro block. To

achieve the real-time processing requirement for

H.264/AVC decoding with HD1080i format, the

proposed design should run over 90 MHz. The

proposed design can support real-time processing

since the maximum operating frequency of the

proposed design is about 130 MHz.

5. Conclusions

This paper presents the ASIP for video signal

processing, called VSIP. VSIP has the special

instructions and the optimized hardware architec-

tures for H.264/AVC. Moreover, VSIP has the

hardware accelerators for ME/MC and entropy

coding. As shown in performance comparisons,

computation cycles to perform target applications

on our VSIP are much less than those of general

DSPs. Moreover, VSIP can dramatically reduce

memory access by using the proposed specialinstructions and the hardware accelerators. Hence,

VSIP can achieve low power at the algorithm/

architecture level. Since the hardware accelerators

can concurrently operate, the VSIP can efficiently

perform in real-time video processing and it can

support various profiles and standards. The proposed

VSIP is one of promising solutions for video signal

processing.

Acknowledgements

This work was supported in part by the Ubiquitous

Computing and Network (UCN) Project, the Minis-

try of Information and Communication (MIC) 21st

Century Frontier R&D Program in Korea, in part by

IT R&D Project funded by Korean Ministry of

Information and Communications, in part by the

second stage of Brain Korea 21 Project in 2006, and

in part by IDEC.

References

1. J. S. Lee, Y. S. Jeon and M. H. Sunwoo, BDesign of new DSPinstructions and their hardware architecture for high-speed

FFT^, in Proc. IEEE Workshop on Signal Processing Syst.,

Sept. 2001, pp. 8090.

2. J. Glossner, J. Moreno, M. Moudgill, J. Derby, E. Hokenek, D.

Meltzer, U. Shavadron and M. Ware, BTrends in compilable

DSP architecture,^ in Proc. IEEE Workshop on Signal

Processing Syst., 2000, pp. 181199.

3. J. H. Lee, J. H. Moon, K. L. Heo, M. H. Sunwoo, S. K. Oh and

I. H. Kim, BImplementation of Application Specific DSP for

OFDM Systems,^ in Proc, IEEE IEEE Int. Symp. Circuit

Syst., May 2004.

4. S. H. Yoon, J. H. Moon and M. H. Sunwoo, BEfficient DSP

Architecture for High-Quality Audio Algorithms, in Proc.

IEEE Int. Symp. Circuits Syst., May 2005.

5. S. D. Kim, J. H. Lee, C. J. Hyun and M. H. Sunwoo, BASIP

approach for implementation of H.264/AVC,^ in Proc. Asia

South Pacific Design Automation Conf., Jan 2006.

6 . J . C he n a n d K . J . R . L iu , BCost-effective low-power

architectures of video coding systems,^ in Proc. IEEE Int.

Symp. On Circuits and Syst., May 1999, pp. 153156.

7. Draft ITU-T Recommendation and Final Draft International

Standard of Joint Video Specification (ITU-T Rec. H.264/ISO/

IEC 14496-10 (E) AVC). July, 2004.

8. J. Ostermann, T. Wedi, et al., BVideo coding with H.264/

AVC: tools, performance, and complexity, IEEE Circuits

and Systems Magazine, vol. 4, 2004, pp. 728.

9. M. K. Jain, M. Balakrishnam and A. Kumar, BASIP design

methodologies: survey and issues,^ in Fourteenth Internation-

al Conference on VLSI Design, Jan. 2001, pp. 7681.

10. W. Di, G. Wen, H. Mingzeng and J. Zhenzhou, BAn Exp-

Golomb encoder and decoder architecture for JVT/AVS,^ in

Proc. 5th International Conference on ASIC, vol. 2, 2124

Oct., 2003, pp. 910913.

11. G. Bjontcgaard and K. Lillcvold, BContext-adaptive VLC

(CAVLC) coding of coefficients,^ Doc. JVT-028, JVT of IS0/

IEC MPEG & ITU-T VCEG 3rd Meeting, Virginia, USA,

May. 2002.

66 Kim and Sunwoo

8/9/2019 2.ASIP FOR H

15/15

12. H.-C. Chang, C.-C. Lin and J.-I. Guo, BA Novel Low-Cost

High-Performance VLSI Architecture for MPEG-4 AVC/

H.264 CAVLC Decoding,^ in Proc. IEEE Int. Symp. Circuits

Syst., May 2005.

13. Y.-K. Lai, C.-C. Chou and Y.-C. Chung, BA simple and cost

effective video encoder with memory-reducing CAVLC,^ in

Proc. IEEE Int. Symp. Circuits Syst., May 2005.

14. W. I. L. Choi, B. Jeon and J. Jeong, BFast motion estimation

with modified diamond search for variable motion block

sizes,^ in Proc. International Conference on Image Process-

ing, vol. 3, Sept. 2003, pp. 1417.

15. TMS320C6000 CPU and Instruction Set Reference Guide,

Texas Instruments Inc., Dallas, TX, 2000.

16. TMS320C64 Image/Video Processing Library, Texas Instru-ments Inc., Dallas, TX, 2003.

17. Blackfini DSP Instruction Set Reference, Analog Device

Inc., Norwood, Mass. 2002.

18. TMS320C55 Hardware Extensions for Image/Video Appli-cations Programmer_s Reference, Texas Instruments Inc.,

Dallas, TX, 2002.19. T. Wiegand, X. Zhang and B. Girod, BLong-Term Memory

Motion-Compensated Prediction,^ Trans. Circuit Syst. Video

Technol., vol. 9, no. 1, Feb. 1999, pp. 7084.

20. E. Iain, G. Richardson, Video Codec Design: Developing

Image and Video Compression Systems, Wiley, 2002.

21. M. H. Kim, I. G. Hwang and S. I. Chae, BA Fast VLSI

Architecture for Full-Search Variable Block Size Motion

Estimation in MPEG-4 AVC/H.264,^ in Proc. of Asia and

South Pacific Design Automation Conference (ASP-DAC

2005), Shanghai, China, Jan 2005.

22. S. Y. Yap and J. V. McCanny, BA VLSI Architecture for

Variable Block Size Video Motion Estimation,^ Trans. Circuit

Syst. Video Technol., vol. 51, no. 7, July 2004.

Sung Dae Kim received the B.S. degree in ElectronicsEngineering from the Ajou University, Suwon, Korea in

2002. He is currently working toward the Ph.D. degree. Hiscurrent research interests are in the areas of digital image/

video processing, DSP, and ASIP, specifically low power and

high performance architectures.

Myung H. Sunwoo received the B.S. degree in ElectronicEngineering from the Sogang University in 1980, the M.S.

degree in Electrical and Electronics from the Korea Advanced

Institute of Science and Technology in 1982, and the Ph.D.

degree in Electrical and Computer Engineering from the

University of Texas at Austin in 1990. He worked for

Electronics and Telecommunications Research Institute

(ETRI) in Daejeon, Korea from 1982 to 1985, and Digital

Signal Processor Operations, Motorola, Austin, TX from 1990

to 1992. Since 1992, he has been a Professor with the Schoolof Electrical and Computer Engineering, Ajou University in

Suwon, Korea. In 2000, he was a Visiting Professor in the

Department of Electrical and Computer Engineering, the

University of California, Davis, CA. He has over 300 papers

and also holds 37 patents. He received more than 20 research

awards including the Best Student Paper Award from the

IEEE Workshop on Signal Processing Systems (SIPS) 2005,

Athens, Greece, the Ministry of Commerce, Industry and

Energy, Samsung Electronics, the Institute of Electronics

Engineers of Korea (IEEK), and professional foundations.

His research interests include SOC architectures and design

for multimedia and communications, application-specific DSP

architectures, and application-specific design. He served on

the Technical Program Chairs of the IEEE Workshop on SIPS

in 2003 and International Conference on SOC Design in 2003and has served on program committee, organizing committee,

steering committee, and executive committee for major

international conferences and workshops including IEEE

Workshop on SIPS, Cool Chips, Design, Automation and Test

in Europe (DATE), IEEE International ASIC/SOC Confer-

ence, Asian-Pacific Conference on CAS (APC-CAS), Asian-

Solid State Circuits Conference (A-SSCC), International SOC

Design Conference (ISOCC), International Symposium on

VLSI Design, Automation and Test (VLSI-DAT), etc. He

served as an Associate Editor for the IEEE Transactions on

Very Large Scale Integration (VLSI) Systems (20022003)

and as a Guest Editor for the Journal of VLSI Signal

Processing (Kluwer, 2004). He is a Director of the National

Research Laboratory sponsored by the Ministry of Science and

Technology, a Director of the New Growth Engine Semicon-ductor Center, and an Executive Director of IEEK. Currently,

He is a Senior Member of IEEE and a Chair of the IEEE CAS

Society of the Seoul Chapter.


Date post:	29-May-2018
Category:	Documents
Upload:	anil-kumar
View:	226 times
Download:	0 times

2.ASIP FOR H

Documents