+ All Categories
Home > Documents > 2.ASIP FOR H

2.ASIP FOR H

Date post: 29-May-2018
Category:
Upload: anil-kumar
View: 226 times
Download: 0 times
Share this document with a friend

of 15

Transcript
  • 8/9/2019 2.ASIP FOR H

    1/15

    Journal of Signal Processing Systems 50, 5367, 2008

    * 2007 Springer Science + Business Media, LLC. Manufactured in The United States.

    DOI: 10.1007/s11265-007-0109-y

    ASIP Approach for Implementation of H.264/AVC

    SUNG DAE KIM AND MYUNG H. SUNWOO

    School of Electrical and Computer Engineering, Ajou University, San 5, Wonchun-Dong,

    Yeongtong-Ku, Suwon, 442-749, South Korea

    Received: 3 April 2007; Accepted: 13 June 2007

    Abstract. This paper presents an Application Specific Instruction Set Processor (ASIP) for implementation ofH.264/AVC, called Video Specific Instruction-set Processor (VSIP). The proposed VSIP has novel instructions

    and optimized hardware architectures for specific applications, such as intra prediction, in-loop deblocking filter,

    integer transform, etc. Moreover, VSIP has coprocessors for computation intensive parts in video signal

    processing, such as inter prediction and entropy coding. The proposed VSIP has much smaller area and can

    dramatically reduce the number of memory access compared with commercial DSP chips, which result in low

    power consumption. Moreover, the proposed hardware accelerators have small size, consume low power

    consumption, and thus, they can support real-time video processing. VSIP has been thoroughly verified using an

    FPGA board having the Xilinxi Virtex II. VSIP can implement a real-time H.264/AVC decoder. The proposed

    VSIP is one of promising solutions for video signal processing.

    Keywords: application specific instruction-set processor, hardware software codesign, H.264/AVC,low power design, data reuse, hardware accelerator

    1. Introduction

    With the rapid progress of semiconductor technolo-

    gy, Application Specific Instruction-set Processor

    (ASIP), which adopts high performance and low

    power of ASIC and flexibility of DSP, has become

    increasingly important. The market of ASIP is

    growing fast since the sales of portable devices,

    such as cellular phones, digital cameras, MP3

    (MPEG Layer-3) players, PMP (Portable MultimediaPlayer), etc. are dramatically increasing. These

    applications need high performance, low power

    consumption and low cost. Application-Specific

    Integrated Circuit (ASIC) designs can reduce the

    cost, size, and power consumption of systems.

    However, ASIC designs have been found inadequate

    to upgrade standards since they should be rede-

    signed. On the other hand, programmable DSPs can

    greatly reduce time-to-market and allow faster

    changes and upgrades. However, programmable

    DSPs may have the disadvantages related with cost,

    size, and power consumption. ASIP can compromise

    advantages of ASIC designs and general DSP chips

    [15]. In other words, ASIP chips adopt high

    performance and low power of ASIC chips and

    flexibility of DSP chips. ASIP can give low power at

    the algorithm/architecture level, which can provide

    the most efficient way to achieve low powerconsumption [6].

    Multimedia technology has been developed with

    the progress of semiconductor technology. Technol-

    ogy related to multimedia codec has been standard-

    ized as MPEG-2, MPEG-4, H.261, H.263, etc. The

    Joint Video Team (JVT) announced H.264/AVC in

    Dec. 2003 [7]. The new video coding standard

    H.264/AVC can provide twice as much as higher

  • 8/9/2019 2.ASIP FOR H

    2/15

    compression efficiency than MPEG-4. However, it

    has about 2 times more hardware complexity for a

    decoder, and about 10 times more hardware com-

    plexity for an encoder than the MPEG-4 visual

    simple profile codec [8].

    In mobile communications, the implementation of

    multimedia codec needs high performance, low

    power consumption and low cost. The implementa-

    tion also requires the flexible system which can

    upgrade without replacing the system. The ASIP

    approach can be quite suitable for these require-

    ments. Hence, we propose an ASIP for implementa-

    tion of mobile multimedia codec, called Video

    Specific Instruction-set Processor (VSIP) [5]. We

    implement the VSIP based on the design flow of

    ASIP as shown in Fig. 1 [9].

    First, the target application is chosen. H.264/AVCis widely used in mobile communication standards,

    such as DMB, DVB-S2, DVB-T, etc. Hence, the

    target application of VSIP is video signal processing

    including H.264/AVC. Second, we profile the H.264/

    AVC tasks. Through profiling, we can find the

    complexity of H.264/AVC tasks. According to the

    complexity of each task, we can divide the applica-

    tion into hardware implementation for high perfor-

    mance and software implementation for flexibility.

    H.264/AVC has computation intensive parts such as

    inter prediction and entropy coding. To achieve low

    power consumption and real-time processing, hard-

    ware accelerators for inter prediction and entropy

    coding are required. Next, we design the optimized

    instruction set and their architecture based on the

    analysis. H.264/AVC has new features such as intra

    prediction, in-loop deblocking filter, integer trans-

    form, etc. It is inefficient to implement these blocks

    using existing DSP instructions. Hence, we propose

    new instructions and their architecture to implement

    H.264/AVC efficiently. The optimized instruction set

    can reduce computation complexity, redundancy and

    overhead. In general, computation cycles to perform

    target applications in an ASIP are much less thanthose of general DSPs. Finally, the functions of the

    proposed VSIP have been thoroughly verified using

    the Xilinx XC2v6000 FPGA.

    The proposed VSIP can efficiently perform new

    features of H.264/AVC, such as intra prediction, in-

    loop deblocking filter, and integer transform. More-

    over, VSIP has hardware accelerators for inter

    Target applicationselection

    Application profiling

    H/W, S/W partitioning

    Design special instructionsand architecture

    Verification and

    performance comparison

    Design hardwareaccelerators

    Chip fabrication

    Figure 1. Design flow of ASIP.

    54 Kim and Sunwoo

  • 8/9/2019 2.ASIP FOR H

    3/15

    prediction and entropy coding that occupy the largest

    portion of power consumption and critical timing

    parts of video processing. Hence, VSIP can imple-

    ment a real-time and low power H.264/AVC baseline

    profile decoder in QCIF format.

    This paper is organized as follows. Section 2

    introduces H.264/AVC and describes existing DSP

    instructions to implement multimedia standards.

    Section 3 proposes novel instructions and hardware

    accelerators, and Section 4 explains performance

    comparisons. Finally, Section 5 contains concluding

    remarks.

    2. Implementation Analysis for H.264/AVC

    This section introduces briefly H.264/AVC and

    shows the results of profiling. Then, various imple-

    mentations of H.264/AVC are analyzed. This section

    also presents existing DSP instructions for video

    signal processing.

    2.1. Implementation of H.264/AVC

    H.264/AVC has adopted new features to improve

    code efficiency, which are described as follows.

    Figure 2. Complexity analysis of the H.264/AVC baseline profile.

    Figure 3. Computation times of various implementations. a DSP. b ASIC. c VSIP + accelerators.

    ASIP Approach for Implementation of H.264/AVC 55

  • 8/9/2019 2.ASIP FOR H

    4/15

    H.264/AVC uses several reference frames, variable

    block size, and quarter pixel accuracy in Motion

    Estimation (ME)/Motion Compensation (MC). These

    features enable the encoder to search for the best

    match for the current frame. However, the memoryaccess and hardware complexity increase significant-

    ly. The past standards, such as MPEG-2, MPEG-4,

    H.263, etc., transmit the first frame without com-

    pression. On the other hand, the H.264/AVC encoder

    adopts intra prediction, which eliminates the redun-

    dancy of intra frame.

    The block based structure causes blocking arti-

    facts. Thus, H.264/AVC adopts the in-loop deblock-

    ing filter to eliminate blocking artifacts. The

    Exponential Golomb Coding (EGC) and Context

    Adaptive Variable Length Coding (CAVLC) are also

    the newly adopted features of the H.264/AVCbaseline profile. EGC uses variable length codes

    with a regular construction [10]. CAVLC is the

    method used to encode the residual data of 44blocks [1113].

    Figure 2 shows the operation complexity of the

    H.264/AVC baseline profile [14]. As shown in Fig. 2,

    ME/MC takes 53% and VLC takes 18.20% of theoperation complexity. Especially, these tasks access

    memory frequently. To achieve low power consump-

    tion we need the dedicated hardware for these tasks.

    In practice, the computation complexity of VLC is not a

    dominant part in H.264/AVC. Intra prediction and in-

    loop deblocking filter have more computation complex-

    ity compared with VLC. However, VLC requires bit

    manipulation operations which are inefficient to be

    implemented on a general processor. Hence, we employ

    the dedicated hardware for VLC. Moreover, inter

    prediction can be executed in parallel with intra

    prediction and entropy coding can also be executed inparallel with in-loop deblocking filter. Thus parallelism

    can be exploited for these tasks. The proposed instruc-

    AUPCU

    Program Counter

    Instruction Register

    FSM

    Stack

    Interrupt Controller

    Program Memory

    Data Memory 1

    Data Memory 2

    DPU

    Program Bus (16 Bit)

    Address Buses (16 Bit)

    Data Buses (32 Bit)

    AGU

    MAC MAC ALU ALU Shifter

    Register File

    AGU

    16 Bit Address Registers

    Prefetch Logic

    IPA

    ME Hardware

    Accelerator

    MC Hardware

    Accelerator

    ECA

    CAVLC

    Accelerator

    EGC

    Accelerator

    Figure 5. Proposed VSIP architecture.

    DOTPU4

    src1

    src2

    dst

    a0 a1 a2 a3

    b0 b1 b2 b3

    (a0*b0) + (a1*b1) + (a2*b2) + (a3*a4)

    Figure 4. DOTPU4 instruction in TMS320c64.

    56 Kim and Sunwoo

  • 8/9/2019 2.ASIP FOR H

    5/15

    tions of VSIP can efficiently support intra prediction and

    in-loop deblocking filter [5]. To maximize the usage of

    hardware resource, the hardware accelerators of inter

    prediction and entropy coding are essential. Thus, the

    proposed VSIP employs the hardware accelerators forthese tasks.

    Figure 3 shows computation times of DSP, ASIC,

    and VSIP implementations according to the profiling

    results. Figure 3(a) shows the computation times of

    the DSP implementation. If a single DSP is used to

    implement the H.264/AVC algorithm, the DSP

    serially executes all of the algorithm blocks.

    Figure 3(b) shows the computation times of the

    ASIC implementation. Each block is executed using

    the dedicated hardware. However, all of the blocks

    cannot be executed in parallel, since some blocks use

    the output of other blocks. For example, the

    transform block uses ME results and the entropy

    coding block needs transform/quantization results.

    Figure 3(c) shows the proposed VSIP having accel-

    erators. The VSIP implementation having acceler-

    ators requires more computation times than the ASIC

    implementation. However, it requires much less

    computation times than DSP and can support various

    profiles and standards.

    2.2. Existing DSP Instructions for Video Signal

    Processing

    Existing DSPs support various instructions to exe-cute packed operations between two registers. These

    operations are used for various video signal process-

    ing, such as DCT, IDCT, ME/MC, etc. TMS320c6of Texas Instruments supports special instructions

    for multimedia signal processing, such as SUB-

    ABS4, AVGx, etc. [15]. The SUBABS4 instruction

    calculates absolute differences of four pairs of the

    packed data. The AVG4 instruction calculates

    averages of the packed data in two registers. After

    additions of four packed data, four results are shifted

    a bit to the left for division, and 0.5 is added to each

    result for rounding. The TMS320c6 series also

    support the DOTPU4 instruction which calculatesthe dot product between four sets of packed 8 bit

    values. Figure 4 shows the operation flow of the

    DOTPU4 instruction. The values in both src1 and

    src2 are treated as the unsigned 8 bit packed data.

    The 32 bit unsigned result is written into dst. Four

    clock cycles are required to execute this instruction.

    DCT has a regular computation flow, while ME/MC

    and entropy coding have control based computations.

    TMS320c55 has a coprocessor for DCT computa-tions, and it requires 2.8 MIPS for DCT computations

    to achieve the processing speed of 30 fps for the QCIF

    format. TMS320c6 having eight function unitsrequires 1.1 MIPS to implement DCT of 30 QCIF fps

    video data using DSP instructions [16].

    In entropy coding, the code word table is referred

    according to the number of successive zeros in the

    input bit stream. Moreover, packed compare oper-

    ations are required. To execute these operations,

    TMS320c64 supports the LMBD and CMPEQ/GT/LT instructions, and the Blackfin DSP of Analog

    Device supports the ONES instruction [16, 17]. The

    q3 q2 q1 q0 p0 p1 p2 p3

    boundary

    Figure 6. Block boundary.

    a b c d

    e f g h

    i j k l

    m n o p

    A B C D E F G H

    I

    J

    K

    L

    Q

    M

    N

    O

    P

    Figure 7. Identification of samples for 44 intra prediction.

    ASIP Approach for Implementation of H.264/AVC 57

  • 8/9/2019 2.ASIP FOR H

    6/15

    dst = HADD(src)

    dst = HADD(src:mask)

    dst = HADD(src:mask1.mask2)

    a0 a1 a2 a3

    dst

    src

    a

    a0 a1 a2 a3

    dst

    a0

  • 8/9/2019 2.ASIP FOR H

    7/15

    LMBD instruction counts the number of zeros in a

    register. The CMPEQ/GT/LT instructions compare

    pairs of 8 bits or 16 bit packed data.

    3. Proposed Instructions and Accelerators

    This section presents an overall architecture, new

    instructions and hardware accelerators for the H.264/

    AVC codec.

    3.1. Overall Architecture of the Proposed VSIP

    Figure 5 shows the overall architecture of the

    proposed VSIP. The proposed VSIP consists of two

    parts, a programmable DSP part and a hardware

    accelerator part. The DSP part has a program control

    unit (PCU), a data processing unit (DPU), and anaddress unit (AU). The hardware accelerator part has

    an Inter Prediction Accelerator (IPA) and an Entropy

    Coding Accelerator (ECA). IPA consists of an ME

    accelerator and an MC accelerator. ECA has a

    CAVLC accelerator and an EGC accelerator. The

    hardware accelerators can operate in parallel with the

    DSP units.

    PCU consists of a prefetch logic, a program

    counter, an instruction register, an FSM (Finite State

    Machine), a stack, and an interrupt controller. DPU

    consists of two Multiply and Accumulate (MAC)

    units for two 16-bit by 16-bit multiplications and

    accumulations, two Arithmetic Logic units (ALU), a

    barrel shifter and a register file. AU has two address

    generation units (AGU) for load and store. Each of

    the internal word lengths is 32 bit. The instruction

    pipeline consists of six stages, that is, pre-fetch,

    fetch, decode, execute1, execute2, and execute3. The

    proposed ASIP has 35 arithmetic instructions, 11

    logical and shift instructions, 6 program control

    instructions, 4 move instructions and 16 special

    instructions including instructions for H.264/AVC,

    which are described next.

    3.2. Proposed Instructions for In-loop Deblocking

    Filter and Intra Prediction

    The in-loop deblocking filter is used to eliminate

    blocking artifacts as mentioned in Section 2. Figure 6

    shows 8 pixels of neighboring 44 blocks. The8 pixel values are decided according to the boundary

    Strength (bS), which represents the difference of two

    neighboring blocks, using p0 p3 and q0 q3.

    The equations calculating pixel values are defined

    in [7]. The equations can be classified into five

    categories as follows.

    p2 p1 p0 1

    p2 2 p1 2 p0 2

    2 p3 3 p2 p1 p0 3

    2 p1 p0 4

    p0 q0 1 ) 1 5

    p0

    p3 are the packed data in a register, and q0

    q3 are also the packed data in another register. Then,

    Eq. (1) shows the additions of three packed data in

    one register. Eq. (2) represents one bit shift left

    operations of two data followed by additions of three

    packed data in the same register. Eq. (3) shows one

    bit shift left operation of data and a multiplication

    operation of data followed by the additions of four

    packed data. Eq. (4) shows one bit shift left

    operation of the packed data followed by an addition

    Figure 9. Assembly program of core block for in-loop deblocking filter.

    ASIP Approach for Implementation of H.264/AVC 59

  • 8/9/2019 2.ASIP FOR H

    8/15

    of two packed data. Eq. (5) shows an addition of the

    most significant byte (MSB) of one register and the

    least significant byte (LSB) of the other registerfollowed by one bit shift operation. Even though

    these computations are packed operations, these

    operations do not occur between two registers as

    shown in Fig. 6, but they occur between the packed

    data within the same register.

    As mentioned in Section 2, the intra predictioneliminates the redundancy of intra frame and inter

    frame, which has few redundancies between two

    frames. Figure 7 shows an identification of samples

    -2

    2

    x(0)

    x(1)

    x(2)

    x(3)

    X(0)

    X(2)

    X(1)

    X(3)

    -

    -

    -

    a

    1/2

    1/2

    -

    -

    -

    x(0)

    x(1)

    x(2)

    x(3)

    X(0)

    X(2)

    X(1)

    X(3)

    b

    Figure 11. Operation flow of 44 integer transform. a 1-D forward transform. b 1-D inverse transform.

    Figure 10. Assembly program of intra prediction.

    60 Kim and Sunwoo

  • 8/9/2019 2.ASIP FOR H

    9/15

    for 44 intra prediction. a p in Fig. 7 arepredicted using A Q according to the equations

    defined in [7] and some of equations are represented

    in Eq. (6), where A, B, and C represent pixel values,

    and a pixel value is represented using 8 bits. For a 32

    bit architecture, A, B, C and D are stored in one

    register since a p and A Q in Fig. 7 are 8 bit

    values.

    A 2 B C 2 ) 2

    A B 1 ) 1

    A 3 B 2 ) 2

    6

    As described in Section 2, existing DSPs support

    only packed operations between two registers. A

    large number of instruction cycles is required to

    implement the in-loop deblocking filter and intra

    prediction with the existing packed instructions that

    execute packed operations between two registers.

    Hence, H.264/AVC may require a new instruction to

    execute packed operations within a register.

    Figure 8 shows the proposed three horizontal addi-

    tion (HADD) instructions. Three HADD instructions

    are as follows. The proposed instruction in Fig. 8(a)

    packs a 32 bit register into four 8 bit data, adds fourpacked data, and then saturates the result to 8 bit data.

    Figure 8(b) is similar with Fig. 8(a). However, the

    packed data, which is selected by a mask, is one bit

    shifted to left. In Fig. 8(c), mask1 selects the data to be

    added, and mask2 selects the data to be shifted. Eqs.

    (1), (2), (4), (5) of the in-loop deblocking filter and Eq.

    (6) of the intra prediction can be implemented using

    the proposed HADD instructions.

    Intra prediction and in-loop deblocking filterrequire dot product calculation. TI_s TMS320C6supports the DOTPU4 instruction for dot product

    calculation which performs packed multiplications of

    two registers and adds four results in four cycles.

    ADI_s ADSP-BF53 not supporting these specialinstructions requires more clock cycles to perform

    dot product calculation. Hence, we only compare the

    proposed HADD instruction with the TMS320C6instruction.

    Figure 9 shows the assembly program of the core

    block for the in-loop deblocking filter. R0 and R1 are

    general registers. The packed pixel data is stored inR0 and R1. Each result of the instruction can be

    obtained after one clock cycle. Hence, the proposed

    VSIP can execute these equations for in-loop

    deblocking filter in one clock cycle.

    Figure 10 shows the assembly program of the intra

    prediction. Acc represents an accumulator and the

    packed pixel data is stored in R0. R1 and R2 have

    Loop label1 #2

    Acc = fTRAN(R0)

    R(4) = MOVR(Acc)Acc = fTRAN(R1)

    R(5) = MOVR(Acc)Acc = fTRAN(R2)

    R(6) = MOVR(Acc)

    Acc = fTRAN(R3)

    R(7) = MOVR(Acc)G0 = TRAN(G1)label1

    Figure 13. Assembly program of 44 forward 2-D integertransform.

    A B C Dsrc

    A+B+C+Ddst (B-C)+2(A-D) A+D-(B+C) (A-D)-2(B-C)

    fTRAN

    A B C DR0 A E I MR4

    B F J NR5

    C G K OR6

    D H L PR7

    TRANE F G HR1

    I J K LR2

    M N O PR3

    a

    b

    Figure 12. Operation flow of fTRAN and TRAN instruction. a Operation flow of fTRAN instruction. b Operation flow of TRAN

    instruction.

    ASIP Approach for Implementation of H.264/AVC 61

  • 8/9/2019 2.ASIP FOR H

    10/15

    offset values for rounding. The ADDAR instruction

    in VSIP calculates an addition of two source data.

    After the addition, the result is shifted to right by the

    immediate value in the instruction. Each result of the

    instruction can be obtained after one clock cycle.

    Hence, the proposed VSIP can execute these equa-

    tions for intra prediction in two clock cycles.

    3.3. Proposed Instructions for Integer Transform

    The 44 integer transform can be operated using theforward transform as shown in Fig. 11(a). The

    forward transform is executed with four rows of four

    packed data. Then, the forward transform is per-

    formed again with four columns of four packed data

    to get the results of the 44 integer transform.Figure 11(b) represents an inverse transform. Simi-

    larly, the 44 inverse integer transform can beexecuted using the operations in Fig. 11(b).

    This paper proposes novel instructions to efficient-

    ly execute the forward/inverse 44 integer transformas follows.

    dst fTRAN src dst iTRAN src :

    Each instruction performs the operations of Fig. 11(a)

    and (b).

    Figure 15. ME computation flow. a Existing computation flow.

    b Proposed computation flow.

    4 x 4 Current Block

    Reference Picture

    SAD operation

    a

    4 x 4 Current Block

    Reference Picture

    SAD operation

    b

    Figure 14. Operation flow of the proposed motion estimation. a ME operation in the first cycle. b ME operation in the second cycle.

    62 Kim and Sunwoo

  • 8/9/2019 2.ASIP FOR H

    11/15

    Figure 12(a) shows the operation flow of the fTRAN

    instruction. The fTRAN instruction reads a 32 bit

    general register in one register file, which consists of

    four 32 bit registers, and executes the operation flow in

    Fig. 12(a). Then, the results are written in anotherregister file consisting of four 32 bit registers. The

    iTRAN instruction performs a similar operation.

    These instructions can be implemented using the

    adders and eight additional 21 multiplexers. Figure12(b) shows the operation flow of the TRAN

    instruction. The general register file has a 44 matrixwhose elements are 8 bit pixel data. The TRAN

    instruction in VSIP executes the transpose of a 44matrix as shown in Fig. 12(b).

    44 integer transform can be easily programmedwith the proposed instructions. The assembly pro-

    gram of 44 forward 2-D integer transform is shownin Fig. 13. G0 and G1 represent two general register

    files each of which contains R0, R1, R2, and R3 or

    R4, R5, R6, and R7, respectively. The Loop instruc-

    tion repeats the program until the label. The number

    of repeats is determined by the immediate value in

    the instruction. The MOVR instruction moves the

    source data into the general register that has four

    8 bit pixel data. The general register file has four

    general registers. Each VSIP instruction requires one

    clock cycle. As you can see, the program in Fig. 13

    has nine instructions for 1-D forward integer trans-

    form. To execute 2-D transform, 18 instructions for

    computation and 1 instruction for program control

    are needed. Hence, 19 cycles are required to execute

    this program. If the data load cycle is added, the total

    cycles of 44 forward 2-D integer transform are 23cycles. The implementation using the instructions of

    TMS320c55 (SW) for integer transform requiresmore than 1,078 cycles [18]. Hence, the proposed

    fTRAN and iTRAN instructions can be more

    efficient than the existing DSP instructions for

    integer transform.

    3.4. Proposed Accelerator for Inter Prediction

    The proposed MC accelerator supports the motion

    vector with quarter pixel accuracy which is one of

    the key features of H.264/AVC. However, it does not

    yet support the multiple reference frames which is

    another key feature of H.264/AVC. As described in

    Section 2, ME/MC should frequently access memo-

    First 1 detect Level Decode Table Update

    First 1 detect Level Decode Table Update

    Pipeline Stage

    Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

    a

    Level Decode

    Pre TableUpdate

    First 1 detect

    Level Decode

    Pre TableUpdate

    First 1 detect

    Pipeline Stage

    Stage 1 Stage 2 Stage 3

    b

    Figure 17. Comparison of flows for the level of the nonzero coefficient decoding. a Generic level of the nonzero coefficient decoding flow.

    b Proposed level of the nonzero coefficient decoding flow.

    00

    vlc=1

    vlc=2

    vlc=3

    vlc=4vlc=5

    vlc=6

    001X

    0011X

    00111X

    001111X0011111X

    00111111X

    11 11 Xvlc=n

    n

    Figure 16. Threshold value of each level decoding table.

    ASIP Approach for Implementation of H.264/AVC 63

  • 8/9/2019 2.ASIP FOR H

    12/15

    ry. From a performance point of view and a low

    power point of view, it can be a serious problem.

    Thus, the sliding window method is used to alleviate

    this problem [19]. Figure 14 illustrates the proposed

    ME operation flow.

    The proposed ME architecture supports the [+8,

    j7] search window. In the [+8, j7] search window,

    16 44 blocks exist in a row. In the first cycle, fourSADs are simultaneously calculated as shown in Fig.

    14(a). Next, the search window shifts right and each

    operation unit repeats the SAD calculation as shown

    in Fig. 14(b). The SADs of upper four pixels of every

    block in a row can be obtained after four cycles and

    16 SADs are stored in buffers. The SADs of the

    second upper are calculated in the same way, and the

    16 SADs are accumulated with the 16 SADs in

    buffers, respectively. Then, after 16 cycles, the 16

    SADs of 44 blocks can be obtained.Figure 15 shows the ME computation flows of

    general architectures [20] and the proposed architec-ture. In Fig. 15(a), the pixel values in the dotted

    block should be fetched again to calculate the SAD

    of the dotted block after the SADs of two adjacent

    blocks (block 1 and block 2) are obtained. However,

    if a 44 block is shifted pixel by pixel as shown inFig. 15(b), the data in the dotted block in Fig. 15(a)

    can be reused. Hence, we can achieve the low power

    consumption.

    3.5. Proposed Accelerator for Entropy Coding

    As described in Section 2, H.264/AVC uses EGC and

    CAVLC for entropy coding. Since EGC has a regularcoding structure, the EGC accelerator consists of a

    barrel shifter, the first one detector, an adder, etc. The

    encoder of CAVLC finds the value of the codeword

    and the length of the codeword in memory according

    to the data. Therefore, the efficient memory address

    generator is needed. The decoder of CAVLC is

    usually implemented with a lookup table. In the

    decoding process, the level of the nonzero coefficient

    decoding is an iterative method, which can be

    implemented without a lookup table.

    A generic decoding process for the level of the

    nonzero coefficient is as follows. First, the decoder

    obtains the number of successive zeros in the input bit

    stream. Next, the decoder calculates the current symbol

    length and decodes the current symbol. Finally, the

    decoder updates the table information used for next

    symbol decoding. The decoder cannot decode the next

    symbol until the table information is decided.To increase the decoding speed, we propose the

    pre-table update method. Table updating is decided

    whether the current symbol value is greater than the

    threshold value. Fig. 16 shows the threshold value of

    each level decoding table. As shown here, all

    threshold values have regular forms. Runs of zeros

    are two and runs of ones are equal to the level

    decoding table index. Hence, we can update the table

    before current symbol decoding.

    The generic level of the nonzero coefficient decod-

    ing process is shown in Fig. 17(a). The level decoding

    process cannot be performed until finishing the table

    update process. Therefore, five pipeline stages are

    required to decode two symbols. Figure 17(b) shows

    the level of the nonzero coefficient decoding process

    using the proposed pre-table update method. Since

    three pipeline stages are required to decode two

    symbols, we can reduce two computation cycles for

    level decoding.

    4. Implementation and PerformanceComparisons

    H.264/AVC can be implemented using VSIP havinghardware accelerators. The proposed VSIP has been

    Table 1. Performance comparisons of 44 integer transform.

    Parameter TMS320c55 (SW) [18] TMS320c55 (HW) [18] TMS320c64 [16] Proposed VSIP

    MIPS 12.8 2.8 1.1 1.1

    Figure 18. FPGA emulation.

    64 Kim and Sunwoo

  • 8/9/2019 2.ASIP FOR H

    13/15

    modeled by Verilog HDL and thoroughly verified

    using the iPROVEi FPGA board having the

    Xilinxi

    Virtex II shown in Fig. 18.Several core blocks for generating an intra

    predictor and an in-loop deblocking filter are

    coded using the proposed special instructions and

    the same blocks are also coded using the existing

    instructions of TMS320c64. The proposed archi-tecture can reduce the number of clock cycles for

    generating an intra predictor about 40% compared

    with TMS320c6. Moreover, the total number ofclock cycles to execute the in-loop deblocking

    fi lt er c an b e r ed uc ed a bo ut 2 02 5% t ha n

    TMS320c6. TMS320C64 supports the DOTPU4instruction that executes packed multiplications of

    two registers and adds four results in four cycles.Other DSPs require more instructions, since they

    do not support the special instructions.

    The fTRAN and iTRAN instructions can be

    executed in one cycle. Hence, 23 clock cycles are

    required to execute 44 integer transform using theproposed instructions and about 1,092,960 clock

    cycles for 30 frames ((23 cycles16 blocks)99macro blocks30 frame) are required for QCIFimages, since a QCIF image has 99 1616 macro

    blocks. Table 1 shows the number of the required

    instructions for 30 frames on existing DSPs [16, 18]

    and the proposed VSIP. VSIP can be more efficientthan the implementation using instructions of

    TMS320c55 (SW) and using the coprocessor ofTMS320c55 ( H W) f o r i n te g er t r an s fo r m.TMS320c64 is a large VLIW architecture havingeight function units while VSIP requires only two 32

    bit adders.

    The optimized VSIP instructions can reduce

    computation complexity, redundancy and overhead.

    Hence, the computation cycles of VSIP to perform

    tasks in H.264/AVC are much less than those of

    general DSPs. Hence, the proposed VSIP can

    efficiently reduce the number of memory accesses

    and achieve low power.The hardware accelerators have been implemented

    using the MagnaChip HSI 0.25 mm standard cell

    library by the Synopsysi Astro tool as shown in

    Fig. 19. The chip specifications are listed in Table 2.

    The chip is being fabricated by the Information

    Technology System on Chip (ITSOC) MPW service

    Table 2. Chip summary of proposed hardware accelerator forME and MC.

    Parameter

    ME hardware

    accelerator

    MC hardware

    accelerator

    Pr ocess technolog y 0.25 mm 1p4m 0.25 mm 1p4m

    Logic gate count 40,000 10,000

    Maximum frequency 100 MHz 150 MHz

    On chip memory size 32 kb

    a b

    Figure 19. Chips of proposed hardware accelerator for ME and MC. a ME hardware accelerator. b MC hardware accelerator.

    Table 3. Performance comparisons of the hardware ME archi-

    tectures.

    Parameter

    Clockcycles/

    frame

    Search

    range

    Supported

    block size

    Gatecounts

    (K)

    References

    [20]

    405,6 03 [j16,

    +15]

    Variable block

    support

    154

    References

    [21]

    406,0 77 [j8,

    +7]

    Variable block

    support

    61

    Proposed

    architecture

    431,2 44 [j8,

    +7]

    Variable block

    support

    40

    ASIP Approach for Implementation of H.264/AVC 65

  • 8/9/2019 2.ASIP FOR H

    14/15

    in KOREA. The ME chip achieves the gate count

    without memory of 40 K and the operating frequency

    of 100 MHz. The MC chip achieves the gate count

    without memory of 10 K and the operating frequency

    of 150 MHz. ME requires more memory than MC.

    Moreover the size of ME is almost four times larger

    than the size of MC. However, we could not insert

    the memory of ME in a chip because of the size

    limitation of the ITSOC MPW service in Korea.

    Moreover, we separated ME and MC.

    The proposed ME accelerator can significantly

    reduce the gate counts compared with [21] and [22].

    Table 3 shows the comparisons among [21, 22] and

    our architecture. Kim et al. [21] can support larger

    search ranges than the other architectures. However,

    it has much larger gate counts than the other

    architectures. The required computation cycles of[21] a n d [22] are comparable to our architecture.

    However, the total gate counts of [21] and [22] are

    much larger than our architecture.

    The proposed hardware accelerator for CAVLC

    takes average 368 clock cycles for a macro block. To

    achieve the real-time processing requirement for

    H.264/AVC decoding with HD1080i format, the

    proposed design should run over 90 MHz. The

    proposed design can support real-time processing

    since the maximum operating frequency of the

    proposed design is about 130 MHz.

    5. Conclusions

    This paper presents the ASIP for video signal

    processing, called VSIP. VSIP has the special

    instructions and the optimized hardware architec-

    tures for H.264/AVC. Moreover, VSIP has the

    hardware accelerators for ME/MC and entropy

    coding. As shown in performance comparisons,

    computation cycles to perform target applications

    on our VSIP are much less than those of general

    DSPs. Moreover, VSIP can dramatically reduce

    memory access by using the proposed specialinstructions and the hardware accelerators. Hence,

    VSIP can achieve low power at the algorithm/

    architecture level. Since the hardware accelerators

    can concurrently operate, the VSIP can efficiently

    perform in real-time video processing and it can

    support various profiles and standards. The proposed

    VSIP is one of promising solutions for video signal

    processing.

    Acknowledgements

    This work was supported in part by the Ubiquitous

    Computing and Network (UCN) Project, the Minis-

    try of Information and Communication (MIC) 21st

    Century Frontier R&D Program in Korea, in part by

    IT R&D Project funded by Korean Ministry of

    Information and Communications, in part by the

    second stage of Brain Korea 21 Project in 2006, and

    in part by IDEC.

    References

    1. J. S. Lee, Y. S. Jeon and M. H. Sunwoo, BDesign of new DSPinstructions and their hardware architecture for high-speed

    FFT^, in Proc. IEEE Workshop on Signal Processing Syst.,

    Sept. 2001, pp. 8090.

    2. J. Glossner, J. Moreno, M. Moudgill, J. Derby, E. Hokenek, D.

    Meltzer, U. Shavadron and M. Ware, BTrends in compilable

    DSP architecture,^ in Proc. IEEE Workshop on Signal

    Processing Syst., 2000, pp. 181199.

    3. J. H. Lee, J. H. Moon, K. L. Heo, M. H. Sunwoo, S. K. Oh and

    I. H. Kim, BImplementation of Application Specific DSP for

    OFDM Systems,^ in Proc, IEEE IEEE Int. Symp. Circuit

    Syst., May 2004.

    4. S. H. Yoon, J. H. Moon and M. H. Sunwoo, BEfficient DSP

    Architecture for High-Quality Audio Algorithms, in Proc.

    IEEE Int. Symp. Circuits Syst., May 2005.

    5. S. D. Kim, J. H. Lee, C. J. Hyun and M. H. Sunwoo, BASIP

    approach for implementation of H.264/AVC,^ in Proc. Asia

    South Pacific Design Automation Conf., Jan 2006.

    6 . J . C he n a n d K . J . R . L iu , BCost-effective low-power

    architectures of video coding systems,^ in Proc. IEEE Int.

    Symp. On Circuits and Syst., May 1999, pp. 153156.

    7. Draft ITU-T Recommendation and Final Draft International

    Standard of Joint Video Specification (ITU-T Rec. H.264/ISO/

    IEC 14496-10 (E) AVC). July, 2004.

    8. J. Ostermann, T. Wedi, et al., BVideo coding with H.264/

    AVC: tools, performance, and complexity, IEEE Circuits

    and Systems Magazine, vol. 4, 2004, pp. 728.

    9. M. K. Jain, M. Balakrishnam and A. Kumar, BASIP design

    methodologies: survey and issues,^ in Fourteenth Internation-

    al Conference on VLSI Design, Jan. 2001, pp. 7681.

    10. W. Di, G. Wen, H. Mingzeng and J. Zhenzhou, BAn Exp-

    Golomb encoder and decoder architecture for JVT/AVS,^ in

    Proc. 5th International Conference on ASIC, vol. 2, 2124

    Oct., 2003, pp. 910913.

    11. G. Bjontcgaard and K. Lillcvold, BContext-adaptive VLC

    (CAVLC) coding of coefficients,^ Doc. JVT-028, JVT of IS0/

    IEC MPEG & ITU-T VCEG 3rd Meeting, Virginia, USA,

    May. 2002.

    66 Kim and Sunwoo

  • 8/9/2019 2.ASIP FOR H

    15/15

    12. H.-C. Chang, C.-C. Lin and J.-I. Guo, BA Novel Low-Cost

    High-Performance VLSI Architecture for MPEG-4 AVC/

    H.264 CAVLC Decoding,^ in Proc. IEEE Int. Symp. Circuits

    Syst., May 2005.

    13. Y.-K. Lai, C.-C. Chou and Y.-C. Chung, BA simple and cost

    effective video encoder with memory-reducing CAVLC,^ in

    Proc. IEEE Int. Symp. Circuits Syst., May 2005.

    14. W. I. L. Choi, B. Jeon and J. Jeong, BFast motion estimation

    with modified diamond search for variable motion block

    sizes,^ in Proc. International Conference on Image Process-

    ing, vol. 3, Sept. 2003, pp. 1417.

    15. TMS320C6000 CPU and Instruction Set Reference Guide,

    Texas Instruments Inc., Dallas, TX, 2000.

    16. TMS320C64 Image/Video Processing Library, Texas Instru-ments Inc., Dallas, TX, 2003.

    17. Blackfini DSP Instruction Set Reference, Analog Device

    Inc., Norwood, Mass. 2002.

    18. TMS320C55 Hardware Extensions for Image/Video Appli-cations Programmer_s Reference, Texas Instruments Inc.,

    Dallas, TX, 2002.19. T. Wiegand, X. Zhang and B. Girod, BLong-Term Memory

    Motion-Compensated Prediction,^ Trans. Circuit Syst. Video

    Technol., vol. 9, no. 1, Feb. 1999, pp. 7084.

    20. E. Iain, G. Richardson, Video Codec Design: Developing

    Image and Video Compression Systems, Wiley, 2002.

    21. M. H. Kim, I. G. Hwang and S. I. Chae, BA Fast VLSI

    Architecture for Full-Search Variable Block Size Motion

    Estimation in MPEG-4 AVC/H.264,^ in Proc. of Asia and

    South Pacific Design Automation Conference (ASP-DAC

    2005), Shanghai, China, Jan 2005.

    22. S. Y. Yap and J. V. McCanny, BA VLSI Architecture for

    Variable Block Size Video Motion Estimation,^ Trans. Circuit

    Syst. Video Technol., vol. 51, no. 7, July 2004.

    Sung Dae Kim received the B.S. degree in ElectronicsEngineering from the Ajou University, Suwon, Korea in

    2002. He is currently working toward the Ph.D. degree. Hiscurrent research interests are in the areas of digital image/

    video processing, DSP, and ASIP, specifically low power and

    high performance architectures.

    Myung H. Sunwoo received the B.S. degree in ElectronicEngineering from the Sogang University in 1980, the M.S.

    degree in Electrical and Electronics from the Korea Advanced

    Institute of Science and Technology in 1982, and the Ph.D.

    degree in Electrical and Computer Engineering from the

    University of Texas at Austin in 1990. He worked for

    Electronics and Telecommunications Research Institute

    (ETRI) in Daejeon, Korea from 1982 to 1985, and Digital

    Signal Processor Operations, Motorola, Austin, TX from 1990

    to 1992. Since 1992, he has been a Professor with the Schoolof Electrical and Computer Engineering, Ajou University in

    Suwon, Korea. In 2000, he was a Visiting Professor in the

    Department of Electrical and Computer Engineering, the

    University of California, Davis, CA. He has over 300 papers

    and also holds 37 patents. He received more than 20 research

    awards including the Best Student Paper Award from the

    IEEE Workshop on Signal Processing Systems (SIPS) 2005,

    Athens, Greece, the Ministry of Commerce, Industry and

    Energy, Samsung Electronics, the Institute of Electronics

    Engineers of Korea (IEEK), and professional foundations.

    His research interests include SOC architectures and design

    for multimedia and communications, application-specific DSP

    architectures, and application-specific design. He served on

    the Technical Program Chairs of the IEEE Workshop on SIPS

    in 2003 and International Conference on SOC Design in 2003and has served on program committee, organizing committee,

    steering committee, and executive committee for major

    international conferences and workshops including IEEE

    Workshop on SIPS, Cool Chips, Design, Automation and Test

    in Europe (DATE), IEEE International ASIC/SOC Confer-

    ence, Asian-Pacific Conference on CAS (APC-CAS), Asian-

    Solid State Circuits Conference (A-SSCC), International SOC

    Design Conference (ISOCC), International Symposium on

    VLSI Design, Automation and Test (VLSI-DAT), etc. He

    served as an Associate Editor for the IEEE Transactions on

    Very Large Scale Integration (VLSI) Systems (20022003)

    and as a Guest Editor for the Journal of VLSI Signal

    Processing (Kluwer, 2004). He is a Director of the National

    Research Laboratory sponsored by the Ministry of Science and

    Technology, a Director of the New Growth Engine Semicon-ductor Center, and an Executive Director of IEEK. Currently,

    He is a Senior Member of IEEE and a Chair of the IEEE CAS

    Society of the Seoul Chapter.

    ASIP Approach for Implementation of H.264/AVC 67


Recommended