AnEfﬁcientMulti-CoreSIMDImplementationfor …ing high eﬃciency video coding (HEVC)...

Hindawi Publishing CorporationVLSI DesignVolume 2012, Article ID 413747, 14 pagesdoi:10.1155/2012/413747

Research Article

An Efficient Multi-Core SIMD Implementation forH.264/AVC Encoder

M. Bariani, P. Lambruschini, and M. Raggio

Department of Biophysical and Electronic Engineering, University of Genova, Via Opera Pia 11 A, 16145 Genova, Italy

Correspondence should be addressed to P. Lambruschini, [email protected]

Received 18 November 2011; Revised 20 February 2012; Accepted 3 March 2012

Academic Editor: Muhammad Shafique

Copyright © 2012 M. Bariani et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The optimization process of a H.264/AVC encoder on three different architectures is presented. The architectures are multi-and singlecore and SIMD instruction sets have different vector registers size. The need of code optimization is fundamentalwhen addressing HD resolutions with real-time constraints. The encoder is subdivided in functional modules in order to betterunderstand where the optimization is a key factor and to evaluate in details the performance improvement. Common issues in bothpartitioning a video encoder into parallel architectures and SIMD optimization are described, and author solutions are presentedfor all the architectures. Besides showing efficient video encoder implementations, one of the main purposes of this paper is todiscuss how the characteristics of different architectures and different set of SIMD instructions can impact on the target applicationperformance. Results about the achieved speedup are provided in order to compare the different implementations and evaluatethe more suitable solutions for present and next generation video-coding algorithms.

1. Introduction

In the last years the video compression algorithms haveplayed an important role in the enjoying of multimedia con-tents. The passage from analog to digital world in multimediaenvironment cannot be performed without compressionalgorithms. DVDs, Blu-Ray, and Digital TV are typicalexamples. The compression algorithm used in DVDs isMPEG-2, and Blu-Ray supports VC-1 standardized with thename SMPTE 421M [1], in addition to MPEG-2 and H.264.In the digital television, the compression algorithms areused to reduce the transmission throughput. In DVB-T, thepicture format for DVD and Standard Definition TV (SDTV)is 720 × 576, and this resolution is the most used in digitalmultimedia contents. The most recent standards for digitaltelevision as DVB-T2 and DVB-H support H.264/MPEG-4AVC for coding video.

The H.264/AVC [2] video compression standard cancope with a large range of applications, reaching compres-sion rate and video quality levels never accomplished byprevious algorithms. Even if the initial H.264/AVC stan-dard (completed in May 2003) was primarily focused on“entertainment-quality” video, not dealing with the highestvideo resolutions, the introduction of a new set of extensions

in July 2004 covered this lack. These extensions are knownas “fidelity range extensions” (FRExt) and produced a set ofnew profiles, collectively called High Profiles. As describedin [3], these profiles support all the Main Profile featuresand introduce additional characteristics such as adaptivetransform block-size and perceptual quantization scalingmatrices. Experimental results show that, when restrictedto intra-only coding, H.264/AVC High Profile outperformsthe state-of-the-art in still-image coding represented byJPEG2000 on a set of monochrome test images by 0.5 dBaverage PSNR [4].

It results that a H.264 encoder addressing high definition(HD) resolutions needs to support High Profiles in order tobe part of an effective video application. On the other hand,the already great complexity of the H.264 algorithm is furtherincreased by supporting FRExt. In particular, this leads toimplement two new modules: the 8 × 8 intraprediction andthe 8× 8 transform.

In case of mobile devices, the H.264 complexity issuestogether with the constraints of limited power consumptionand the typical need of real-time operations in video-basedapplications draw a difficult scenario for video applicationdevelopers.

2 VLSI Design

The HD resolution involves a large amount of data,and the compression algorithms are high computationaldemand applications, often used as benchmark to measurethe processor performance. In order to support real-timevideo encoding and decoding, specific architectures aredeveloped. Multicore architectures have the potential to meetthe performance levels required by the real-time coding ofHD video resolutions. But in order to exploit multicorearchitectures, several problems have to be faced. The firstissue is the subdivision of an encoder application in modulesthat can be executed in parallel. In this case, the main difficultis the strong data dependency in video encoder algorithms.Parallel architectures can be more easily exploited usingother kind of algorithm like computer graphics, renderingtechnology or cryptography, where the data dependencyis not as strong as in video compression. Once a goodpartitioning is achieved, the optimization of a video encodershould take advantage of the data level parallelism to increasethe performance of each encoder module running on thearchitecture’s processing element. A common approach isto use the SIMD instructions to exploit the data levelparallelism during the execution; otherwise, ASIC design canbe adopted for critical kernel. SIMD architectures are widelyused for their flexibility. SIMD ISAs are added at most marketspread processor: Intel’s MMX, SSE1, SSE2, SSE3, SSE4; Amd3DNow!; ARM’s NEON; Motorola’s AltiVec (also known asApple’s Velocity Engine or IBM’s VMX).

In this paper, we will show how the data level par-allelism is exploited by SIMD and which instructions aremore useful in video processing. Different instruction setarchitectures (ISAs) will be compared in order to showhow the optimization can be driven and how different ISAfeatures can lead to different performance. This paper isintended to be a great help to both software programmersthat have to choose for the most suitable SIMD ISA fordeveloping a video-based application and for ISA designersthat want to create a generic instruction set being able to givegood performance on video applications. In that regard, theauthors will select a set of generic SIMD instructions that canspeed up video codec applications, detailing the modules thatwill profit from the introduction of each instruction. Besidesdescribing the optimization methods, the paper indicatesa few guidelines that should be followed to partition theencoder in separate modules.

Even though the work focuses on H.264/AVC, most of theproposed solutions will also apply to the earlier mentionedstandards as well as to more recent video compressionalgorithms as scalable video coding (SVC) [5]. Moreover,H.264/AVC tools will have a fundamental role in the emerg-ing high efficiency video coding (HEVC) standardizationproject [6].

This paper is organized as follows. Section 2 gives anoverview of the state-of-the-art SIMD-based architectures,giving particular attention to those targeting video-codingapplications. A brief description of the three architecturesused for the presented project is given in Section 3. Section 4describes the H.264 optimized encoder, focusing on modulepartitioning and SIMD-based implementation. The per-formance results of both the C-pure implementation and

the SIMD version are given in Section 5 together withan explanation about what are the key instructions foroptimizing a video codec. Finally, the conclusion is drawnin Section 6.

2. Related Works

The basic concept of SIMD instructions is the possibility offill vector registers with multiple data in order to executethe same operation on several elements. One of the majorbottlenecks in the SIMD approach is the overhead due tothe data handling needed to feed the vector registers. Typicalrequired operations are extra memory accesses, packing data,element permutation inside vectors, and conversion fromvector to scalar results. All these preliminary operations limitthe vector dimension and the performance enhancementachievable with SIMD optimization.

In literature, several studies regarding the SIMD opti-mization of video-coding applications are available [7–11].The scope driving these studies is the achievement of themaximum performance, adopting measures in order toreduce the known bottlenecks. Since its standardization,SIMD optimizations targeting the H.264 algorithm havebeen proposed as can be seen in [12, 13]. However, both theworks only address the H.264 decoder and present a MMXoptimization starting from the H.264 reference code. Besidesaddressing a more complex application, our aims were also todiscuss how the characteristics of different architectures anddifferent set of SIMD instructions can impact on the encoderperformance.

In SIMD processors the memory access has an importantimpact on performance. The unaligned access is not usuallypossible in SIMD ISA, and when possible it is discourageddue to additional instruction latency. The programmersusually take care of handling the unaligned load addingfurther overhead to vector data organization. Moreover,the need of unaligned load is always present in video-coding algorithm especially in motion estimation (ME) andmotion compensation (MC), where the pixel blocks selectedby motion vectors are frequently at misaligned positionseven if the start of a frame is memory aligned. Often, theposition of a block we need to access cannot be known inadvance, and this leads to unpredictable misalignment indata loaded from memory. In Intel’s architectures, startingfrom SSE2 the support to unaligned load has been added,but the performance is strongly reduced either if the loadoperation crosses the cache boundary or, with SSE3, if theload instruction needs store-to-load forwarding. In AltiVec,it is necessary to load two adjacent positions and shift datain order to achieve one unaligned load, a usually adoptedapproach to overcome the misalignment access issue. Thisproblem is common in digital signal processor (DSP) as well.Usually, DSP do not support unaligned loads, but due to thelarge use of DSP in video application several producers haveadded the support to this kind of operation. For example,Texas Instruments family TMS320C64x supports unalignedload and store operations of 32 and 64 bit element, but withonly one of the two memory ports [14].

VLSI Design 3

The MediaBreeze SIMD processor was proposed toreduce the bottlenecks in SIMD implementations [15]. TheBreeze SIMD ISA uses a multidimensional vector able tospeed up nested loops but at the cost of a very complicatedinstruction structure requiring a dedicated instruction mem-ory. In [16], a specific SIMD ISA named VS-ISA was pro-posed in order to improve performance in video coding. Theauthors adopted specific solutions for sum of absolute differ-ence (SAD), not aligned load applied to ME, interpolation,DCT-IDCT, and quantization dequantization.

Another typical approach to reduce the SIMD overheadis the usage of multibank vector memory where data is storedinterleaved. The drawback is the increase of hardware cost forsupporting the addresses generation.

An alternative to SIMD implementation on program-mable processor architectures is the hardwired processor.Usually, it is only used when performance and low powerconsumption are essential requirements [7, 14, 17]. In fact,the lack of flexibility typical of hardwired processors reducestheir applicability to a narrow segment of the market, wherethe programmability is either not required or considerablyreduced.

3. SIMD ISA Description

In order to optimize the H.264 encoder, we chose three differ-ent ISAs. The adopted architectures are ST240, xSTream, andP2012, all developed by STMicroelectronics. The former isa single-processor architecture, and the others are multicoreplatforms. In the following, the three architectures willbe briefly described, giving special attention to the SIMDinstruction set.

We chose these architectures for their novelty and for thepossibility to have a complete toolchain (code generation,simulation, profiling, etc.) for developing an application inan optimal way. Each toolchain allowed a complete observ-ability of the system. In this way, it was possible to evaluatethe effectiveness of every author’s solution. Observability isa very important characteristic when developing/optimizingan application. Using a real system it is not always possible toreach the degree of observability you have using a simulatorand a suitable toolchain. Moreover, in an architecture underdevelopment as P2012 we had the possibility to contribute tothe SIMD instruction set and, more important, to evaluatethe contribution of each particular SIMD to the performanceof the target video codec application. The three instructionsets present suitable characteristics for our research; theyare generic instruction set, but ST240 includes a few video-specific instructions; we can analyse the impact of differentvector register sizes; even if xSTream and P2012 share manycharacteristics, only xSTream supports horizontal SIMD(this is a special feature; e.g., other SIMD extensions as IntelSSE and ARM NEON do not have the same support); inP2012 platform, we were able to define and insert new SIMDinstructions.

Besides the type of instructions, the SIMD extensionsdiffer in both size and precision. These differences allowanalyzing the impact of different architecture solutions onthe global performance.

Source 1Source 2

CD B A GH F E

Absubu.pb result

+ Sadu.pb result

| − | | − | | − | | − |

|D−H| |C−G| |B − F| |A − E|

Figure 1: SAD operation.

3.1. ST240. The ST240 is a processor of STMicroelectronicsST200 family based on LX technology jointly developed withHewlett Packard [18, 19]. The main ST240’s features are thefollowing:

(i) 4-issue Very Long Instruction Word (VLIW)

(ii) 64-32-bit general purpose registers

(iii) 32KB D-Cache and 32KB I-Cache

(iv) 450 MHz clock frequency

(v) 8-bit/16-bit arithmetic SIMD.

In the H.264 encoder SIMD optimization, the most sig-nificant instructions of the ST240 ISA are the following: theSIMD add.ph and sub.ph which perform, respectively, thepacked 16-bit addition or subtraction; the perm.pb instruc-tion which performs byte permutations and the mulad-dus.pb which multiplies an unsigned byte by a signed bytein each of the byte lanes and then sums across the fourlanes to produce a single result. Furthermore, several datamanipulation instructions are defined: pack.pb packs 16-bitvalues to byte elements ignoring the upper half; shuffeve.pband shuffodd.pb, respectively, perform 8-bit shuffle of evenand odd lanes. Two averaging operations (avg4u.pb andavgu.pb) are also defined in the instruction set.

One important operation in video-coding algorithms,the absolute value of the difference, abs (a-b), can beperformed with the absubu.pb instruction (Figure 1) whichworks on each byte lane (treating each byte lane as anunsigned value) and returns the result in the correspondingbyte lane of the destination register. The sadu.pb (Figure 1)performs the same operation and then sums the byte lanesvalue and returns the result.

3.2. xSTream. xSTream is a multiprocessor dataflow archi-tecture for high-performance embedded multimedia stream-ing applications designed at STMicroelectronics [20, 21].

xSTream is constituted by a parallel distributed andshared memory architecture. It is an array of processingelements connected by a Network on Chip (NoC) withspecific hardware for management of communication [22],as depicted in Figure 2.

4 VLSI Design

xSTNoc

xPE

LMLMLM

xFC

xPE

LMLMLM

xFC

xPE

LMLMLM

xFC

xPE

LMLMLM

xFC

xSTream

xFC

xPE

LMLMLM

xFC

xPE

LMLMLM

xFC

xPE

LMLMLM

xFC

xDMA

System BUS

Host

I/O RA

M

IP

µP

L1$

Figure 2: xSTream architecture.

1 03 25 47 6

3 2 1 0127 96/95 64/63 32/31 0

127 96/95 64/63 32/31 0108/107 80/79 48/47 16/15 Vector field

Vector field

Element number

Element number

Vector operand with 16 bit subwords

Vector operand with 32 bit subwords

Figure 3: Vector operand.

The main elements in Figure 2 are the general purposeengine, the xSTreaming Processing Engines (XPEs) and theNoC interconnecting all components.

The XPEs are based on ST231 VLIW processors [22] ofST200 STMicroelectronics family [18, 19]. The main featurescan be resumed as

(i) 2-issue VLIW,

(ii) 128-bit vector registers,

(iii) up to 512 KB local memory cache,

(iv) up to 1 GHz clock frequency, and


In order to achieve excellent performance, the XPE coretries to exploit available parallelism at various levels. Itsupports a plethora of SIMD instructions to exploit avail-able data-level parallelism. These instructions concurrentlyexecute up to four operations on 32-bit operands or eightoperations on 16-bit operands. The core supports wide 128-bit load/store.

The xSTream architecture handles scalar and vectoroperands.

Vector operands are 128-bit wide and consist of eithereight 16-bit half-words or four 32-bit words, as shown inFigure 3.

In the xSTream ISA each SIMD instruction has anadditional operand allowing permuting the result’s elementpositions or replicating any element in the other positions.

Core L1

Core L1

Core L1

Core L1

L2

L1

L1

L1

L1

Core

Core

Core

Core

NI

ClusterFabric

controller code

HW syncronizer, CDMA, D and TU, T and MU, E and

WU

Figure 4: P2012 scheme.

This feature considerably increases the SIMD flexibilitybecause the results have often to be reordered for furtherelaboration. This is especially true for video-coding algo-rithm with operations performed on several steps where theinput of next step is usually the output of previous one. Thepermutation operand allows this with the cost of only oneadditional cycle. This leads to reduced costs to perform allthe operation needed for data reordering.

The XPE supports horizontal SIMD as well. This kind ofSIMD allows operations among elements in the same vector,and it is a key feature for speeding up execution in severalH.264 functional units, as we will see in next sections.

3.3. Platform 2012 (P2012). Platform 2012 is a high-performance programmable architecture for high com-putational demanding embedded multimedia applications,currently under joint development by STMicroelectronicsand Commissariat a l’energie atomique et aux energies alter-natives (CEA) [23]. The goal of P2012 platform is to be refer-ence architecture for next generation of multimedia product.

The P2012 architecture (Figure 4) is constituted by alarge number of decoupled clusters of STxP70 processorsinterconnected by a Network on Chip (NoC). Each clustercan contain a number of computational elements rangingfrom 1 to 16. The main features of the STxP70 processor ele-ment are as follows:

(i) 32-bit RISC processor (up to 2 instructions per cy-cle),

(ii) 128-bit vector registers,

(iii) 256 KB of memory shared by all the processors (percluster),

(iv) 600 MHz clock frequency, and


The P2012 basic modules can be easily replicated to pro-vide scalability [24]. Each module is constituted by a com-puting cluster with cache memory hierarchy and a communi-cation engine. The STxP70 is dual issue application-specific

VLSI Design 5

instruction-set processor (ASIP) [25] with domain-specificparameterized vector extension named VECx. STxP70 SIMDinstructions are used to exploit available data level paral-lelism [26]. These instructions execute in parallel up to fouroperations on 32-bit operands or eight operations on 16-bitoperands, while 128-bit load/store is supported.

Vector operands are 128-bit wide and consist of eithereight 16-bit half-words or four 32-bit words. In order toincrease the SIMD flexibility, instructions able to permutedata positions inside the vector operands are defined in theinstruction set. The support to horizontal SIMD is limitedat operation involving only two adjacent elements inside avector, but its presence is fundamental for typical video-coding operation like sum of absolute difference (SAD).

3.4. SIMD Instruction-Set Evaluation. Whatever platform wechoose, we will have a limited number of SIMD instructionsbecause of hardware constraints. For this reason, besidesprecision and size, one of the key issues while choosinga SIMD extension is generality versus application-specificinstructions. The former can show good speedups for a largevariety of applications. The latter can reach greater perfor-mance, but limited to a particular family of applications. Ofcourse, there are a lot of solutions that lay in the middle.

The vector register size impacts performance, hardwarereliability, and costs. The choice of the optimal size andprecision of SIMD instructions is a key factor for reaching thedesired performance for the target application. The axiomlarger SIMD equal to better performance may be valid forapplications having no constraints and data dependenciesin either spatial or temporal field. It is not the H.264encoder condition. In general, algorithms with a heavycontrol flow are very difficult to vectorize, and the SIMDoptimization does not always lead to the desired performanceenhancement.

The application developers should choose the dimensionthat best fit their needs, as well as ISA designers should takeinto account the requirements of the application familiesthey are targeting. As stated in [7], in a processor designedto handle video-coding standards for which the theoreticalworst-case video sequence will consist of a large number of4 × 4 blocks, four-way SIMD parallelism makes full use ofdata paths. In this case, increasing the size will lead to littleperformance improvement. In contrast, if we focus on theH.264’s fidelity range extensions, with their 8 × 8 transformand 8 × 8 intraprediction, an ISA with eight-way SIMDparallelism will yield to better performance. Next generationvideo-coding standards like HEVC will use wider ranges ofblock sizes for both prediction and transformation processes,making the choice of the optimal vector register size evenmore complicated.

4. H.264 Encoder Implementation

4.1. Software Partitioning. In order to support real-timevideo encoding addressing HD resolutions, multiprocessorarchitectures seem to be an optimal solution, as earlier ex-plained. Moreover, we would like to test the multicore

MB D MB B MB C

MB A Curr

MB

Figure 5: MB neighbours.

architectures with an application of high interest but not sosuitable for these kind of architectures in order to stress thearchitecture design and to evaluate possible issues and find-ing solutions that could be also useful for other applications.

The first programmer’s task dealing with this type ofplatforms is the subdivision of the encoder application inmodules that can execute in parallel. The H.264 encoderpartitioning plays a fundamental role in multicore architec-tures as xSTream and P2012, where each functional blockhas to meet the resources of processor elements, and theinterconnection system must fulfil the memory bandwidthneeded to feed the modules. The designer choice becomesmore complex when some modules can run in parallelavoiding stalls in pipeline [26].

Even if a detailed description of the encoder partitioningis beyond the scope of this paper, we can here depict someissues we faced approaching this process and the solutionswe adopted.

First of all, it is worth to take into account the datadependency inside the H.264 encoder. Temporal data depen-dency is implicit in the Motion Estimation mechanism; thecoding of the current frame always depends on the previouslyencoded frame(s) that are used as reference. Thus, there isalways a temporal data dependency, except if the currentframe is an I picture. Anyway, the encoding process alsoshows a spatial data dependency between macroblocks, thatis, the basic encoding block comprising 16 × 16 pixel ele-ments. While coding the current macroblock (MB), we needdata from the previously encoded MBs belonging to the sameframe, or, to be more precise, to the same slice (a sequence ofMBs in which the frame can be segmented). Figure 5 showsthe current MB together with the already reconstructedneighbours that are needed for its prediction. Specifically,MB A, B, C, and D are required for intraprediction, motionvector prediction, and spatial direct prediction (in the SVC-compatible version). Furthermore, MB A and B are used tocheck the skip mode in P frames.

Spatial data dependency can even occur inside a MB. Theprediction of a 4 × 4 block may depend on the results ofalready-predicted neighbouring blocks. e.g., this occurs inIntra 4× 4 or in the deblocking filter.

In this scenario, we cannot encode two frames in parallel,because of the temporal data dependency, and we cannotconcurrently process different MBs, because of the spatial

6 VLSI Design

Table 1: Encoder data flow.

Module Input Output Input from local buffers

MV prediction MV A MV pred MV B, C, D

Motion estimationSearch window, original MB, MV

predictorMV, cost, best intermode, MB

predictor

Intraprediction Original MB, reconstructed MB ACost, best intramode, MB

predictorReconstructed MB B, C, D

Residual coding Original MB, MB predictorResidual signal, coded MB

parameters

IDCT-DeQuant and reconstruction Residual signal Reconstructed MB

Deblocking filter Reconstructed MB A Decoded MB Reconstructed MB B

Entropy coding Coded MB parameters Output stream

data dependency, unless the MBs belong to different slices.Thus, one opportunity is to concurrently process every slice,but this solution has two drawbacks; it strongly dependson the particular encoder configuration, and it requests toimplement the whole encoder on every processor element.Therefore, the only chance to partition the encoder is duringthe MB processing. This does not mean to separately process8 × 8 or 4 × 4 blocks, but to separately execute the encoderfunctional units at MB level.

The encoder partitioning should now derive from anevaluation of the functional units that can be concurrentlycomputed, taking into account the amount of data that needsto be exchanged between the different cores.

If we suppose that each module will run on a differentcore, we must consider both the chunk of data eachcore needs to exchange with the interconnected cores andthe frequency of such communications. Therefore, for anoptimal module partitioning, it is important to analyse theencoder data flow. Basically, this analysis should result ina list of selected modules with a set of input and outputdata for every list’s entry, as shown in Table 1. The Figure 5’snotation is used to indicate the neighbouring MBs. Thistable allows identifying the dependencies between modulesas well as the data flow, from which we can obtain therequested bandwidth for the communication mechanismbetween processor elements. This preliminary analysis alsoproduces the partition diagram, shown in Figure 6.

Each module will keep local memory buffers containingthe data required to process the current MB. For example, theIntraprediction module needs to store a row of reconstructedMBs plus one MB (the left MB) in order to be able topredict the current MB. The deblocking filter will need tostore the same number of reconstructed MBs as well. Theselocal storages are filled by producer modules as soon asthey complete the respective tasks. In the previous example,“IDCT-DeQuant & Reconstruction” is the producer forintraprediction; when the MB reconstruction has completedfor MBn, the intraprediction of MBn+1 can start. It is worthnoting that the intraprediction of MBn+1 can be concurrentlyexecuted with the motion estimation of MBn+1 and thedeblocking filter of MBn.

For the sake of simplicity we did not put into Figure 6scheme all the project components. The buffer mechanismfor passing reference-frame data to the ME and the decoded

MV prediction

ME Intraprediction

Decision and residual coding

IDCT-DeQuant and reconstruction Entropy coding

Deblocking filter

MV

MV, cost, best intermode

Cost, best intramode

Decoded MB

Recon MB Output stream

Residual signal MB type, MV,residual signal

MV pred

Orig. MB Ref -frame SW

Recon MB

Current-frame buffer

Figure 6: Encoder partition diagram.

picture buffer are not described. We preferred to focus onthe encoder data flow in order to highlight the chances formodule parallelisation. Moreover, the buffering mechanismsstrongly depend on the architecture design implementation.

The here described partitioning seems to both ful-fil the data dependency constraints and exploit the fewopportunities of parallel execution available in a H.264encoder. Moreover, the computational weight of the encodercomponents is quite well distributed among the differentcores. The only exception is the ME, which is the most time-consuming module. In our encoder we utilize the SLIMH264ME algorithm [27]. SLIMH264 is divided into two differentstages: the first phase is common to all the partitions and

VLSI Design 7

64 64 1616

N N − 1 N + 1NN − 1 N + 1

Figure 7: Search window update.

performs a fast search; the second step utilizes the coarseresults coming from the first phase to refine the search forevery MB partition. The second step can be executed inparallel for every MB partition. This leaves the designer thefreedom to subsequently split the module to eventually avoidstalls in pipeline in the likely case the ME requires more cyclesthan intraprediction.

Among the issues the designer should take into account,there is still the memory bandwidth needed to feed themodules. From Table 1, we can notice that the ME modulerequests the largest amount of data. Besides coding param-eters, the ME should receive data belonging to two frames:current and reference frame. For each MB, the data passedto the ME consists of the original MB luma values and theportion of reference frame enclosed by the search window(SW). Supposing one byte per luma sample and a SW setto 64 × 64 pixels (a suitable value for HD formats), we willget a width of (64 + 16 + 64) pixels leading to 20736 bytes.Thus, we had 20736 bytes plus the original-image MB 16×16bytes to send to the ME module for every MB. This leads to avery large memory bandwidth. Anyway, as could be noticedin Figure 7, not all of the SW must be resent every time anew MB is coded. Since MBs are coded in raster-scan orderand search window of neighbouring MB overlaps, just a 16-byte-wide column update can be sent after the first completewindow, as described in [28, 29]. Figure 7 shows the SWfor the MBN (left side) and the SW for the next MB (rightside). The amount of data sent to the ME module for codingMBN+1 is shown as a red rectangle. When reaching the end ofthe row, MBN+1 does not need the update because this will beover the image border. Nevertheless, a SW update is writtento the array, and it will be part of the SW of the first MB inthe next row.

4.2. Modules Optimization. The H.264 encoder moduleswork on a block basis. Even though the basic block of thecoding process is the macroblock, consisting of 16× 16 pixelelements, the basic block of each module’s computation canvary from 4× 4 to 16× 16. A number of experiments carriedout at STMicroelectronics’s Advanced System TechnologyLaboratories showed that, addressing HD resolutions, it ispossible to disable interprediction modes involving the 8× 8blocks subpartitions without significant effects on video-quality and -coding efficiency. The same experiments alsoshowed that fidelity range extensions are needed to improve

video quality at high resolutions, as one could expect. Forthis reason, we choose both to disable ME on partitions 4×8,8× 4, and 4× 4 and to add intra 8× 8 and transform 8× 8.In this scenario, most of the encoder modules work on 8× 8blocks of 8-bit samples. The 4×4 blocks are still used in Intra-4×4 and DCT/Q/IQ/IDCT 4×4. The Intra-16×16 predictionworks on the whole MB, whereas the correspondent transfor-mations just iterate the 4× 4 procedures.

Usually, inside each module the computations require16-bit precision for intermediate results. Thus, a typicalsituation is as follows:

(i) load 8-bit samples from memory;

(ii) switch to 16-bit precision and compute the results;

(iii) store the results to memory as 8-bit samples.

Some of the modules, or at least some parts of them,require a 32-bit precision. Among them, it is worth notinga few computations for pixels interpolation and the Quanti-zation and Inverse-Quantization process.

In order to evaluate the different performance achievablewith the three different ISAs, we have inserted the SIMDinstructions in an already optimized ANSI C code which isused as reference to evaluate the achieved speedup. For abetter understanding of the presented work, the comparisonis not only carried out at global level, but for every H.264functional unit.

In the following, the implementation detail of the sumof absolute difference (SAD) and the Hadamard filter will beshown for all the three addressed ISAs. Among all the severalmodules implementations, we have chosen to describe theseparticular operations for different reasons: the SAD is one ofthe most time-consuming operations in video-compressionalgorithms; the implementation of Hadamard filter is a goodexample for describing how an ANSI C implementation canbe rewritten to best fit the available SIMD ISA. The access todata stored in memory will be discussed as well because it isa typical issue in optimizing video compression algorithmsusing SIMD instructions. A complete description of theencoder SIMD implementation on the ST240 processor canbe found in [30].

4.2.1. SAD Operation. The sum of absolute differences is akey operation for a large variety of video-coding algorithms.The number of times this operation is executed during a cod-ing process can vary depending on the encoder implementa-tion and it strongly depends on the motion estimation mod-ule, that it is not covered by the H.264 standard definition.Anyway, independently of specific implementations, thisoperation is a key factor for the whole-encoder performance.

Here, we will show three different SAD implementationsusing SIMD instructions, and we will compare them with anoptimized ANSI C code.

Given the essential role the SAD plays in video codingalgorithms, some instruction sets include specific instruc-tions to speed up such operation. Here, we will compareSIMD instruction sets having different size and different de-gree of specializations.

8 VLSI Design

load 4 elements for p and i p_temp0 = *pp; pp += p_off;i_temp0 = *dd; dd += d_off;

load 4 elements for p and i p_temp1 = pp; pp += p_off;i_temp1 = dd; dd += d_off;

load 4 elements for p and i p_temp2 = pp; pp += p_off;i_temp2 = dd; dd += d_off;

load 4 elements for p and i p_temp3 = pp;i_temp3 = dd;sad0 = sadu.pb(i_temp0, p_temp0);sad1 = sadu.pb(i_temp1, p_temp1);sad2 = sadu.pb(i_temp2, p_temp2);sad3 = sadu.pb(i_temp3, p_temp3);

sad result = sad0 + sad1 + sad2 + sad3;

i

pp_temp0

p_temp3

i_temp0

sadu

sadu

sadu

sadu

sad0

sad1

sad2

sad3

i_temp3

/∗

/∗

/∗

/∗

/∗ /∗

/∗ /∗

/∗ /∗

∗∗

∗∗

∗∗

· · ·

· · · · · ·

· · ·

Figure 8: SAD implementation.

Table 2: SAD performance.

Cycles Operations Load Store

ANSI C version 36 134 8 0

SIMD version 14 30 8 0

Using the ST240 32-bit wide SIMD extension, theoptimization of the SAD computation has been quitestraightforward thanks to the SIMD instruction sadu.pb.

The SAD finds the “distance” between two 4 × 4 blocks,generally between a prediction block and the original image;given the two blocks in the left side of Figure 8, the pseudo-code computing the SAD can be viewed in the right side ofthe same figure. Besides loading the input data, it basicallyconsists of four calls to the sadu.pb instruction.

The achieved speedup is shown in Table 2.The xSTream and P2012 architectures support 128-

bit-wide vector registers, and they can perform 8-bit, 16-bit, or 32-bit arithmetic SIMD operations. Usually, SAD isperformed using 8-bit precision, allowing for each SIMDcalculation a capability to handle sixteen elements. Usingvertical SIMD instructions, it is easy to achieve the absolutedifference among several elements stored in two vectors,but the addition of the elements stored in a single vectoris onerous because usually it requires several vertical SIMDinefficiently utilized. Both P2012 and xSTream ISAs havehorizontal addition of SIMD instructions, but with differentcapability. In xSTream, it is allowed adding all the elementsstored in the same vector, producing a scalar result. In P2012,VECx horizontal addition is limited to only add two adjacentelements inside a vector; in this way, four SIMD instructionsmust be used in order to achieve the scalar result of the SADoperation. This difference significantly impacts the encoderoptimization. For example, when the SAD is calculated toevaluate the predictor cost in Intra 16× 16, only two SIMDsare used with xSTream against the six used with P2012. Thisis schematized in Figure 9.

Even if in P2012 ISA the lacking of a horizontal SIMDfor addition partially wastes the obtained great gain, we still

a b c de f g h

i j k l

m n o p

a b c de f g h

i j k lm n o p

a b c de f g h

i j k l

m n o p

a b c de f g h

i j k lm n o p

- =abs

Original block Predictor Difference Absolute Diff.

a b c d e f g h i j k l m n o pa b c d e f g h i j k l m n o pcost + + + + + + + + + + + + + + +=

a b c d e f g h i j k l m n o pa b c d e f g h i j k l m n o p



Original block

Predictor

Absolute Diff.

SUBABS

ADDVcost

xSTream SIMD implementation

P2012 SIMD implementation




Original block

Predictor

Absolute Diff.

SUBABS

cost

VZACC2H

VZACC2H

VZACC2H

VZACC2H

XRF0X2R

S5S4S3S2S1S0 S6 S7

S3S2S1S0

S1S0

S0

Figure 9: Predictor cost calculation.

complete the SAD operation using six VECx instructions andone scalar instruction, as shown in Figure 9, versus the 48scalar instructions used in the ANSI C implementation (16subtractions, 16 absolute values, and 16 additions).

4.2.2. Hadamard. We consider very interesting the Hada-mard SIMD optimization because it involves a large numberof instructions and can be considered a typical case study.

Although the Hadamard transform it is not currentlyused in the rest of the encoder, the intraprediction moduleutilizes such transform to find the best 16×16 intrapredictionmode. The intramodule divides the predicted MB intosixteen 4 × 4 blocks. Each block is compared to the cor-respondent original-image’s block, and sixteen differencesare calculated. These sixteen values are filtered through theHadamard transform before computing the SAD of thewhole MB.

In the ST240 code, the optimization has started consid-ering that Hadamard can be subdivided into two differentphases: horizontal and vertical. The horizontal phase can besubdivided into 4 rows as well as the vertical phase into 4columns, as shown in the portion of pseudocode in Table 3.

VLSI Design 9

Table 3: Hadamard phases.

Horizontal phase Vertical phase

/∗ first row ∗/ /∗ first column ∗/

m0 = d0 + d3 + d1 + d2; w0 = m0 + m12 + m4 + m8;

m1 = d0 + d3− d1− d2; w1 = m0 + m12−m4−m8;

m2 = d0− d3 + d1− d2; w2 = m0−m12 + m4−m8;

m3 = d0− d3− d1 + d2; w3 = m0−m12−m4 + m8;

/∗ second row ∗/ /∗ second column ∗/

m4 = d4 + d7 + d5 + d6; w4 = m2 + m14 + m6 + m10;

m5 = d4 + d7− d5− d6; w5 = m2 +m14−m6−m10;

m6 = d4− d7 + d5− d6; w6 = m2−m14 +m6−m10;

m7 = d4− d7− d5 + d6; w7 = m2−m14−m6 +m10;

/∗ third row ∗/ /∗ third column ∗/

m8 = d8 + d11 + d9 + d10; w8 = m1 + m13 + m5 + m9;

m9 = d8 + d11− d9− d10; w9 = m1 + m13−m5−m9;

m10 = d8− d11 + d9− d10; w10 = m1−m13 +m5−m9;

m11 = d8− d11− d9 + d10; w11 = m1−m13−m5 +m9;

/∗ fourth row ∗/ /∗ fourth column ∗/

m12 = d12 + d15 + d13 + d14; w12 = m3+m15+m7+m11;

m13 = d12 + d15− d13− d14; w13 = m3+m15−m7−m11;

m14 = d12− d15 + d13− d14; w14 = m3−m15+m7−m11;

m15 = d12− d15− d13 + d14; w15 = m3−m15−m7+m11;

In the pseudocode, di are the differences and mi the inter-mediate values of the transform.

Once we have all the differences contained in packed 16-bit values subdivided into even and odd pairs, we can rewritethe first row of the horizontal Hadamard transform as

m0 = (d0 + d1) + (d2 + d3),

m1 = (d0− d1)− (d2− d3),

m2 = (d0 + d1)− (d2 + d3),

m3 = (d0− d1) + (d2− d3).

(1)

In such a way, we can exploit the packed 16-bit additionand subtraction to obtain the high and low halves of the mi

coefficients. As can be noted, the low and high halves of m0and m2 are the same, but while the m0’s value is achievableby adding its halves, to compute the value of m2 we have tosubtract its high half from the lower one. Similar considera-tions can be applied to the odd elements m1 and m3.

Anyway, since the vertical phase of Hadamard is yet tocome, there is no need to compute such values at this point.In fact, we can rewrite the mi coefficient as functions of theirown halves as follows:

m0 = m0L + m0H ,

m1 = m1L−m1H ,

m2 = m2L−m2H ,

m3 = m3L + m3H.

(2)

and utilize this notation to rewrite the vertical phase of theHadamard transform as described below

w0 = (m0 + m4) + (m8 + m12)

= (m0L + m4L) + (m0H + m4H)

+ (m8L + m12L) + (m8H + m12H)

= (m0L + m4L) + (m8L + m12L)

+ (m0H + m4H) + (m8H + m12H)

= w0L + w0H.

(3)

We can use the low and high halves of the intermediatecoefficients to compute the low and high halves of the finalcoefficients wi as illustrated in Figure 10.

The Hadamard optimization with ST240 SIMD is quitecomplex. Due to shortness of SIMD, the standard algorithmhas been modified in order to better match the SIMD ISAfeatures.

Using the xSTream and P2012 architectures, we followeda different approach. Our goal was the exploitation of 128-bit-wide SIMD minimizing the data reordering. Consideringthat the Hadamard transform can be defined as

Hn =⎡⎣Hn−1 Hn−1

Hn−1 −Hn−1

⎤⎦

H0 = 1,

(4)

the Hadamard matrices are composed of±1 and are a specialcase of discrete fourier transform (DFT). For this reason, thecalculation can exploit the FFT algorithm, usually known asFast Walsh-Hadamard transform [31].

The only issue is obtaining a good implementation ofthe FFT butterfly with SIMD, avoiding wasting all the gainachieved using fast algorithm with the data reorderingneeded to implement the calculation. Our approach consistsof a modified butterfly that allows using always the samebutterfly structure for every level, even if we have to reorderdata between stages (Figure 11).

The output values coming from every butterfly canbe calculated for 16 samples at a time using two SIMDinstructions, one calculating the additions and one calcu-lating the differences. In this way, we have the advantage ofcomputing the output of each level using a simple SIMDimplementation, at the cost of swapping intermediate resultsbetween different levels.

Even if xSTream and P2012 share this implementationmechanism, we have measured different performance. Inthis case, the difference depends on the different types ofdata manipulation instructions. The xSTream ISA havingthe third operand allowing the permutation of results insidevectors is more flexible and can implement the abovealgorithm with a reduced number of instructions respectto VECx P2012 ISA. Algorithm 1 shows the xSTream SIMDimplementation.

4.2.3. Memory Access Issues. As previous exposed, a key fac-tor to achieve a good performance improvement with

10 VLSI Design

m0Lm0H m4Lm4H m8Lm8H m12Lm12H

m0L+m4L

m0H+m4H

m8L+m12L

m8H+m12H

w0Lw0H

Add

Add

Add

w0 pck = add(add(m0 2 pck,m4 6 pck),add(m8 10 pck, m12 14 pck));w1 pck = sub(sub(m0 2 pck, m4 6 pck),sub(m8 10 pck, m12 14 pck));w2 pck = sub(add(m0 2 pck, m4 6 pck),add(m8 10 pck, m12 14 pck));w3 pck = add(sub(m0 2 pck, m4 6 pck),sub(m8 10 pck, m12 14 pck));

m0 2 p m4 6 p m8 10 m12 14

w0 pck

Figure 10: Hadamard vertical phase with ST240 SIMD.

m0L

m0_2_p

m0H m4L

m4_6_p

m4H m8L

m8_10_

m8H m12L

m12_14

m12H

m0L+m4L

m0H+m4H

m8L+m12L

m8H+m12H

w0Lw0H w0_pck

Add

Add

Add

w0_pck = add(add(m0_2_pck, m4_6_pck),add(m8_10_pck, m12_14_pck));w1_pck = sub(sub(m0_2_pck, m4_6_pck),sub(m8_10_pck, m12_14_pck));w2_pck = sub(add(m0_2_pck, m4_6_pck),add(m8_10_pck, m12_14_pck));w3_pck = add(sub(m0_2_pck, m4_6_pck),sub(m8_10_pck, m12_14_pck));

Figure 11: Hadamard modified butterfly.

SIMD optimization is the efficient handling of unalignedload operations. In general, programmers should structurethe application data in order to avoid or minimize mis-aligned memory accesses. In video compression algorithm,the motion compensation is surely a case where it is notpossible avoid unaligned memory accesses because it isimpossible to predict motion vectors and consequently aligndata.

None of the three addressed architectures support una-ligned load instructions. Therefore, it is important to effi-ciently use aligned accesses to load misaligned data frommemory. The three ISAs support instructions to concatenatetwo vectors. This allows a solution consisting in two steps:first, we use two aligned load instructions for loading datain two vector registers, and, then, we concatenate and shifttheir elements in order to extract a single vector containingthe needed data, as shown in Figure 12.

/∗first level: one 16 samples butterfly∗//∗(s0 ÷ s7)+(s8 ÷ s15)∗/vaddh out low = in low, in high/∗(s0 ÷ s7)−(s8 ÷ s15)∗/vsubh out high = in low, in high

/∗data reordering∗//∗0 1 2 3 8 9 10 11∗/vmrgbl in low = out low, out high, perm/∗4 5 6 7 12 13 14 15∗/vmrgbu in high = out low, out high, perm

/∗second level: two 8 samples butterfly∗/vaddh out low = in low, in highvsubh out high = in low, in high

/∗data reordering∗//∗0 1 8 9 4 5 12 13∗/vmrge in low = out low, out high/∗2 3 10 11 6 7 14 15∗/vmrgo in high = out low, out high

/∗third level: four 4 samples butterfly∗/vaddh out low = in low, in highvsubh out high = in low, in high

/∗data reordering∗//∗0 8 2 10 4 12 6 14∗/vmrgeh in low = out low, out high/∗1 9 3 11 5 13 7 15∗/vmrgoh in high = out low, out high

/∗fourth level: eight 2 samples butterfly∗/vaddh out low = in low, in highvsubh out high = in low, in high

Algorithm 1: Hadamard transform xSTream SIMD implementa-tion.

VLSI Design 11

uint32 AddressAt128;

vector 16b sw Va, Vb, Vout;

AddressAt128b = ((uint32) (mref ptr)) & (∼0xF);Offset = ((uint32) (mref ptr)) & (0xF);

Va = ldq(AddressAt128, 0);

Vb = ldq(AddressAt128, 16);

Vout = wrot(Va, Vb, Offset);

Algorithm 2: Unaligned load SIMD implementation with concatenate instruction.

ui32 t PackCurr0 = ∗(orig line);

ui32 t PackCurr1 = ∗(orig line+1);

/∗ Pack to 128 bits ∗/TmpVectArray[0] = PackCurr0;

TmpVectArray[1] = PackCurr1;

Pack128In = ldqi(Pack128In, TmpVectArray,0);

/∗ Reorganize pixels ∗/Va = vmrgbeh(Va,Pack128In,VZero,permute0);

Vb = vmrgboh(Vb,Pack128In,VZero, permute1);

VPackCurr = vaddh(VPackCurr,Va,Vb,0);

Algorithm 3: Unaligned load SIMD implementation without concatenate instruction.

Input data

Output data

First level Second level

Second level

Output data

Input data

First level

Original structure

Modified structure

Data reordering

−1

−1

−1

−1

−1

−1

−1

−1

Figure 12: Unaligned load.

If an ISA does not define a SIMD performing this typeof concatenate operation, then the unaligned load will beimplemented with an extra cost due to the use of additionalinstructions for merging data between the two vectors.

Algorithm 2 shows the implementation of an unalignedload using xSTream. This solution can be compared to thesame operation carried out without a concatenate instruc-tion shown in Algorithm 3, in which we should add threeinstructions to reorganize the data for composing therequired not-aligned vector.

It is very important that these concatenate instructionscan take the offset argument not as a constant value but as

a variable value; otherwise, modules such as motion com-pensation would not get any benefit from using them. Forexample, the Intel SSSE3 “palignr” instruction concatenatestwo operands and shift right the composite vector by anoffset for extracting an aligned results, but the offset mustbe a compile-time constant value. This is a big issue for amodule as motion compensation, in which it is impossibleto know in advance the offset of a misaligned address.

5. Results

In the H.264 encoder, the most cycle-demanding moduleshave been optimized using SIMD instructions: motionestimation and compensation, DCT, Intraprediction, and soforth. The best way to compare different instruction sets inorder to judge the effectiveness of both SIMD extensionsand code optimizations is to measure the speedup obtainedwith the SIMD-based implementation versus the ANSI Cversion of the same source code. In order to separatethe effect of SIMD performance improvement from ANSIC optimizations, we have inserted SIMD instructions inpreviously optimized ANSI C modules.

The results are provided in terms of average cyclesspent to process one macroblock. The xSTream and P2012architectures share the same modules subdivision. For thesingle-core DSP ST240, the subdivision is less fine, andrelated modules are joined together. In the reported tests,the presence of the ST240 processor is important becauseit allows comparing the single-processor elements of themulticore platforms to a single-core architecture. Tests areperformed on a set of video sequences addressing different

12 VLSI Design

Table 4: SIMD instructions for video coding.

Instruction description Affected modules Notes

Horizontal add: adds all the elements inside a vectorregister and produces a scalar result

ME, intraprediction Speeds up SAD

Horizontal permute: rearranges elements inside avector register

Intraprediction, DCT/Q/IQ/IDCTAllows zig-zag scan and speeds up intra

diagonal modes

Concatenate: concatenates two vector registers into anintermediate composite, shifts the composite to theright by a variable offset

Motion estimation andcompensation

Allows software implementation ofunaligned load

Promotion/demotion precision: an efficient support forpromoting element precision while loading data frommemory, and demoting the precision (with saturation)while storing data to memory

All the main modulesIt will speed up the load and store

operations for several modules

Absolute subtraction: for every element “a” in the firstvector and every element “b” in the second vectorperforms the following operation: |a− b|

ME, intraprediction, deblocking filterSpeeds up SAD in conjunction with

horizontal add; used in deblocking filter

Shift with round: performs the following operation forevery element “a” in the vector operand:(a + 2n−1) >> n, where n is a scalar value

IDCT, deblocking filter, motioncompensation

Speeds up 1/2 pixel interpolation

Average: for every element “a” in the first vector andevery element “b” in the second vector performs thefollowing operation: (a + b + 1) >> 2

Intraprediction, deblocking filter,motion compensation

Speeds up 1/4 pixel interpolation

Table 5: Cycles/MB spent in each module for each ISA.

xSTream P2012 ST240

ANSI C SIMD Gain factor ANSI C SIMD Gain factor ANSI C SIMD Gain factor

Luma motion compensation 4788 2257 2.1x 8286 3965 2.1x

Croma motion compensation 3064 658 4.7x 3626 1282 2.8x 265559 200380 1.3x

Motion estimation 303769 84342 3.6x 603182 114776 5.3x

Intra 4 × 4 24366 10076 2.4x 38234 15760 2.4x32013 19182 1.6x

Intra 8 × 8 15396 4997 3.0x 26972 9455 2,9x

DCT/Q/IQ/IDCT 4 × 4 14994 7616 2.0x 20473 9088 2.3x32013 19182 1.7x

DCT/Q/IQ/IDCT 8 × 8 18660 3498 5.3x 24486 11636 2.1x

resolutions, and average results are resumed in Table 5.The results in Table 5 and Figure 13 show that the ST240,exploiting the instruction level parallelism (ILP) with a 4-issue VLIW architecture, achieves the best performance forthe ANSI C implementation. All the SIMD implementationsimprove performance for every encoder module, but theST240 with the shortest SIMD size obtains the lowestspeedup factor. P2012 and xSTream with their wider SIMDcan better exploit the data-level parallelism. In terms ofpure number of cycles spent to encode one macroblock, thexSTream ISA achieves the best performance.

It is worth analyzing in detail these results to understandhow different instruction sets lead to different performance.The xSTream processor elements take advantage from the“horizontal add” instruction that allow an efficient com-putation of the SAD operations: it is evident in the MEmodule, where xSTream spends about 25% fewer cyclesthan P2012 (84,342 versus 114,776 cycles/MB). The higherspeedup obtained by P2012 is mainly due to the less-efficientANSI C code generated by the P2012 compiler. We alreadydescribed as the ST240 can exploit a specific instruction for

the SAD operation. In fact, its result is not far from thearchitectures having 128-bit-wide vector registers (the 200,380 cycles/MB also include motion compensation). Fromthese results, we can state that the support for horizontalSIMD will not only give a great performance improvementfor the SAD operation, but it significantly impacts the wholeME module.

As earlier said, data manipulation instructions are a keyfactor to fully exploit SIMD implementations because opera-tions such as transposing matrices or data reordering becomefrequent in this type of optimizations. An experimentalresult confirming this consideration can be seen in theDCT/Q/IQ/IDCT 8 × 8 module, covering all the toolchainperforming the residual coding and decoding. This moduleinvolves several data-reordering operations, ranging frommatrix transposition to zig-zag reorder. Both ST240 andxSTream instruction sets support the permutation of ele-ments inside a vector in a very efficient way, as described inSections 3.1 and 3.2. The P2012 SIMD extension includes aseries of instructions for interleaving and merging elementsbetween two vector operands. The great speed up the

VLSI Design 13

0 100000 200000 300000 400000 500000 600000 700000

ANSI-C

SIMD

ANSI-C

SIMD

ANSI-C

SIMD

xSTr

eam

P20

12ST

240

(Cycles/MB)

DCT

Intra

ME/MC

Figure 13: ISA comparison.

xSTream architecture gathers in comparison with P2012 ismainly due to the possibility to permute elements usinga single instruction, in a sort of horizontal permute. Theeffect is emphasized in the 8 × 8 transform where the datareordering process is stressed more than in the 4× 4 case. Inour experience, we saw that if such instruction is available,then the zig-zag reordering can be effectively implementedwith SIMD instructions; otherwise, we are forced to use thescalar implementation involving look-up tables to performthe reordering.

Intraprediction can exploit the horizontal permute in-struction as well; the intraprediction modes involving diago-nal directions require the permutation of elements inside theresulting vectors. For similar reasons, ST240 achieves greatspeedup factors in DCT and intramodules (resp., 1.7 and1.6), considering that a 32-bit-wide SIMD can only performtwo 16-bit-arithmetic operations.

There are other several SIMD instructions that in ouropinion are to be considered as key instructions for optimiz-ing video codec applications. Here, we assume an instructionset will already include SIMD for all the common arithmeticoperations, compare, select, shift, and memory operations.

In previous sections, we already discussed about theimpact of the unaligned memory access to the video codecperformance. All the encoder modules are affected by theperformance of the unaligned memory operations, but itbecomes a keyfactor for motion estimation and compensa-tion. An instruction concatenating two vectors and produc-ing a vector at the desired offset is fundamental to implementan unaligned load instruction. As stated in Section 4.2.3, thecapability to support variable offsets is a key factor for theinstruction usability because the offset could not be knownin advance.

Inside most of the modules, the computations require a16-bit precision for intermediate results, but the input andoutput data contained into the noncompressed YUV imagesare 8-bit values. Thus, a typical operation at the beginning ofa module is to load 8-bit input values and extend them to 16-bit precision. At the end, the output data precision is usuallydemoted down to 8-bit saturating the values before storingthe results. Therefore, even if the support to 8-bit operationsis not required, it would be very useful that an instruction set

will include SIMD instructions for promoting and demotingprecision in a fast way. An optimal solution will also combinepromotion with load operations and demotion with storeinstructions.

Usually, the video codec algorithms try to avoid thedivision operations because of its computational cost. Whenneeded, divisors are power of two, and the division is substi-tuted with a shift right with rounding as follows:

a

2n⇐⇒ (

a + 2n−1)� n. (5)

Therefore, even if most instruction sets already includethis type of instruction, it is important to remind its utility.Often, the shift right with rounding is used for averagingtwo or more values, as in the intraprediction and deblockingfilter. In our implementation, one of the reasons the ST240achieves a good speedup in the intraprediction module is thepresence of an average SIMD instruction in the instructionset.

Table 4 summarizes our conclusions based on the pre-sented work. The proposed instructions are described in thefirst column. For each instruction, the table indicates theH.264 modules that will be mainly affected by the introduc-tion of the instruction, as well as a few notes about specificcontributions to basic video coding operations.

6. Conclusions

This paper presents efficient implementations of theH.264/AVC encoder on three different ISAs. The optimiza-tion process exploits the SIMD extensions of the threearchitectures for improving the performance of the mosttime-consuming encoder modules. For each addressed archi-tecture, experimental results are presented in order to bothcompare the different implementations and evaluate thespeedup versus the optimized ANSI C code.

The paper discusses how SIMD size and different instruc-tion sets can impact the achievable performance. Severalissues affecting video-coding SIMD optimization are dis-cussed, and authors’ solutions are presented for all the archi-tectures.

Most instruction sets have specific SIMD instructions forvideo coding. Even though these instructions can lead togreat performance improvements, they could be useless forother application families. In this paper, we identify a set ofgeneric SIMD instructions that can significantly improve theperformance of video applications.

Besides presenting the SIMD optimization for the mosttime-demanding modules, the paper describes how a com-plex application as the H.264/AVC encoder can be parti-tioned to a multicore architecture.

Acknowledgments

The authors would like to thank STMicroelectronics’s Ad-vanced System Technology Laboratories for their support.This work is supported by the European Commission in thecontext of the FP7 HEAP project (#247615).

14 VLSI Design

References

[1] “VC-1 Compressed Video Bitstream Format and DecodingProcess,” SMPTE 421M-2006, SMPTE Standard, 2006.

[2] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra,“Overview of the H.264/AVC video coding standard,” IEEETransactions on Circuits and Systems for Video Technology, vol.13, no. 7, pp. 560–576, 2003.

[3] G. J. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVCAdvanced Video Coding Standard: Overview and Introduc-tion to the Fidelity Range Extensions,” in Applications of Digi-tal Image Processing XXVII, Proceedings of SPIE, August, 2004.

[4] D. Marpe, T. Wiegand, and S. Gordon, “H.264/MPEG4-AVCfidelity range extensions: tools, profiles, performance, andapplication areas,” in IEEE International Conference on ImageProcessing (ICIP ’05), pp. 593–596, September 2005.

[5] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of thescalable video coding extension of the H.264/AVC standard,”IEEE Transactions on Circuits and Systems for Video Technology,vol. 17, no. 9, pp. 1103–1120, 2007.

[6] Joint Collaborative Team on Video Coding (JCT-VC), “WD4:Working Draft 4 of High-Efficiency Video Coding,” 6thMeeting, Torino, Italy, July, 2011.

[7] J. Probell, “Architecture considerations for multi-format pro-grammable video processors,” Journal of Signal Processing Sys-tems, vol. 50, no. 1, pp. 33–39, 2008.

[8] M. Koziri, D. Zacharis, I. Katsavounidis, and N. Bellas, “Imple-mentation of the AVS video decoder on a heterogeneousdual-core SIMD processor,” IEEE Transactions on ConsumerElectronics, vol. 57, no. 2, pp. 673–681, 2011.

[9] M. Sayed, W. Badawy, and G. Jullien, “Towards an H.264/AVCHW/SW integrated solution: an efficient VBSME architec-ture,” IEEE Transactions on Circuits and Systems II, vol. 55, no.9, pp. 912–916, 2008.

[10] T. Rintaluoma and O. Silven, “SIMD performance in softwarebased mobile video coding,” in 10th International Conferenceon Embedded Computer Systems: Architectures, Modeling andSimulation (IC-SAMOS ’10), pp. 79–85, July 2010.

[11] H. Lv, L. Ma, and H. Liu, “Analysis and optimization of theUMHexagons algorithm in H.264 based on SIMD,” in 2ndInternational Conference on Communication Systems, Networksand Applications (ICCSNA ’10), pp. 239–244, July 2010.

[12] X. Zhou, E. Q. Li, and Y.-K. Chen, “Implementation of H.264decoder on general-purpose processors with media instruc-tions,” in Image and Video Communications and Processing,Santa Clara, Calif, USA, January 2003.

[13] J. Lee, S. Moon, and W. Sung, “H.264 decoder optimizationexploiting SIMD instructions,” in IEEE Asia-Pacific Conferenceon Circuits and Systems (APCCAS ’04), pp. 1149–1152,December 2004.

[14] W. Lo, D. Lun, W. Siu, W. Wang, and J. Song, “Improved SIMDarchitecture for high performance video processors,” IEEETransactions on Circuits and Systems for Video Technology, vol.21, no. 12, pp. 1769–1783, 2011.

[15] D. Talla, L. K. John, and D. Burger, “Bottlenecks in multimediaprocessing with SIMD style extensions and architecturalenhancements,” IEEE Transactions on Computers, vol. 52, no.8, pp. 1015–1031, 2003.

[16] Z. Shen, H. He, Y. Zhang, and Y. Sun, “A Video SpecificInstruction Set Architecture for ASIP design,” VLSI Design,vol. 2007, Article ID 58431, 7 pages, 2007.

[17] M. Shafique, L. Bauer, and J. Henkel, “Optimizing theH.264/AVC video encoder application structure for reconfig-urable and application-specific platforms,” Journal of SignalProcessing Systems, vol. 60, no. 2, pp. 183–210, 2010.

[18] P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, and F. Home-wood, “Lx: a technology platform for customizable VLIWembedded processing,” in 27th Annual International Sympo-sium on Computer Architecture (ISCA ’00), pp. 203–213, June2000.

[19] J. Fisher, P. Faraboschi, and C. Young, “VLIW processors: fromblue sky to best buy,” IEEE Solid-State Circuits Magazine, vol.1, no. 2, pp. 10–17, 2009.

[20] N. Coste, H. Garavel, H. Hermanns, F. Lang, R. Mateescu,and W. Serwe, “Ten Years of Performance Evaluation for Con-current Systems using CADP,” in 4th International Symposiumon Leveraging Applications of Formal Methods, Verification andValidation ISoLA, Heraklion, Greece, 2010.

[21] D. Pandini, G. Desoli, and A. Cremonesi, “Computing anddesign for software and silicon manufacturing,” in IFIPInternational Conference on Very Large Scale Integration (VLSI’07), pp. 122–127, October 2007.

[22] G. Desoli and E. Filippi, “An outlook on the evolutionof mobile terminals: from monolithic to modular multi-radio, multi-application platforms,” IEEE Circuits and SystemsMagazine, vol. 6, no. 2, pp. 17–29, 2006.

[23] L. Benini, “P2012: a many-core platform for 10Gops/mm2multimedia computing,” in 21st IEEE International Sympo-sium on Rapid System Prototyping, Fairfax, Va, USA, June 2010.

[24] C. Silvano, W. Fornaciari, S. Crespi Reghizzi et al., “2PARMA:parallel paradigms and run-time management techniques formany-core architectures,” in IEEE Annual Symposium on VLSI,pp. 494–499, July 2010.

[25] C. Mucci, L. Vanzolini, I. Mirimin et al., “Implementation ofparallel LFSR-based applications on an adaptive DSP featuringa Pipelined Configurable Gate Array,” in Design, Automationand Test in Europe (DATE ’08), pp. 1444–1449, March 2008.

[26] P. Paulin, “Programming challenges & solutions for multi-processor SoCs: An industrial perspective,” in Design Automa-tion Conference (DAC ’11), June 2011.

[27] A. Kumar, D. Alfonso, L. Pezzoni, and G. Olmo, “A complexityscalable H.264/AVC encoder for mobile terminals,” in Euro-pean Signal Processing Conference (EUSIPCO ’08), Lausanne,Switzerland, August 2008.

[28] C. Y. Chen, C. T. Huang, Y. H. Chen, and L. G. Chen, “Level C+data reuse scheme for motion estimation with correspondingcoding orders,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 16, no. 4, pp. 553–558, 2006.

[29] B. Zatt, M. Shafique, F. Sampaio, L. Agostini, S. Bampi, andJ. Henkel, “Run-Time Adaptive Energy-Aware Motion andDisparity Estimation in Multiview Video Coding,” in 48thDesign Automation Conference (DAC ’11), pp. 1026–1031, SanDiego, Calif, USA, June 2011.

[30] M. Bariani, I. Barbieri, D. Brizzolara, and M. Raggio, “H.264implementation on SIMD VLIW cores,” STreaming Day 2007,Genova, Italy.

[31] C. S. Lubobya, M. E. Dlodlo, G. de Jager, and K. L. Fergu-son, “SIMD implementation of integer DCT and hadamardtransforms in H.264/AVC encoder,” in Proceedings of the IEEEAFRICON, pp. 1–5, September 2011.

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks


Date post:	15-Jul-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

AnEfﬁcientMulti-CoreSIMDImplementationfor …ing high eﬃciency video coding (HEVC)...

Documents