A MIMO Decoder Accelerator for Next Generation

1544 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

A MIMO Decoder Accelerator for Next GenerationWireless Communications

Karim Mohammed, Member, IEEE, and Babak Daneshrad, Member, IEEE

Abstract—In this paper, we present a multi-input–multi-output(MIMO) decoder accelerator architecture that offers versatilityand reprogrammability while maintaining a very high perfor-mance-cost metric. The accelerator is meant to address theMIMO decoding bottlenecks associated with the convergence ofmultiple high-speed wireless standards onto a single device. Itis scalable in the number of antennas, bandwidth, modulationformat, and most importantly, present and emerging decoderalgorithms. It features a Harvard-like architecture with complexvector operands and a deeply pipelined fixed-point complex arith-metic processing unit. When implemented on a Xilinx Virtex-4LX200FF1513 field-programmable gate array (FPGA), the designoccupied 43% of overall FPGA resources. The accelerator showsan advantage of up to three orders of magnitude (1000 times)in power-delay product for typical MIMO decoding operationsrelative to a general purpose DSP. When compared to dedicatedapplication-specific IC (ASIC) implementations of mmse MIMOdecoders, the accelerator showed a degradation of 340%–17%,depending on the actual ASIC being considered. In order tooptimize the design for both speed and area, specific challengeshad to be overcome. These include: definition of the processingunits and their interconnection; proper dynamic scaling of thesignal; and memory partitioning and parallelism.

Index Terms—Application-specific processor, multi-input–multi-output (MIMO), orthogonal frequency-divisionmultiplexing (OFDM), software-defined radio.

I. INTRODUCTION

T ODAY, two prominent trends in wireless communicationsystems are the use of multi-input–multi-output (MIMO)

processing and orthogonal frequency-division multiplexing(OFDM). MIMO-OFDM techniques improve data rate andreliability [1]. As a result, many current and future wirelessstandards such as 802.11n WiFi, WiMax, and 3G-long-termevolution (LTE) leverage MIMO-OFDM to deliver on the userrequirements. Additionally, all trends point to the convergenceof all such wireless standards on a single platform such as a per-sonal digital assistant (PDA) or a smartphone. This motivatesan accelerator-like approach to efficiently deliver on the com-putation-intensive elements of the system. The MIMO decoderis one such component. MIMO processing is computationally

Manuscript received October 23, 2008; revised March 09, 2009. First pub-lished September 01, 2009; current version published October 27, 2010.

K. Mohammed was with the University of California, Los Angeles, CA 90095USA. He is now with Cairo University, Giza 12613, Egypt (e-mail: [email protected]).

B. Daneshrad is with the University of California, Los Angeles, CA 90095USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2009.2025590

intensive due to the need to invert a channel matrix with verylow latency. Moreover, over time, systems are expected toincorporate a higher number of antennas and more advancedalgorithms. Analogous to the use of the Viterbi acceleratorengines [2], [3] in today’s cellular systems, a MIMO decoderaccelerator that is programmable in bandwidth, number of an-tennas, decoder algorithm, and modulation format will greatlyfacilitate the adoption of multistandard, MIMO-based solu-tions. Such an accelerator engine could also greatly acceleratethe adoption of MIMO communications on software-definedradio (SDR) and cognitive radio (CR) based platforms thatare mostly found in the research and advanced developmentenvironments today [4]–[6].

MIMO decoding is essentially an inversion of a complexmatrix channel. This can be achieved by using a variety ofalgorithms with a range of complexity and performance. Thechoice of algorithm and antenna configuration depends on theexpected channel conditions, power budget, available resources,and throughput requirements. Even for relatively simple MIMOdecoding algorithms, the MIMO decoder is among the mostcomplicated blocks in the transceiver. For example, in a 4 4802.11n transceiver, an MMSE decoder can easily occupy asmuch area as the rest of the digital baseband [7]. In addition tothe high resource requirements, MIMO decoders also requirea long design cycle if they are to be optimized to the targetplatform.

Traditionally, matrix inversion is simplified by using one ofa number of matrix decompositions to transform the channelmatrix into a more invertible form [8]. The decomposition usu-ally involves regular arrays (systolic arrays) of processing el-ements (often coordinate rotation digital computer (CORDIC)processors) [9]. QR decomposition leading to an MMSE solu-tion is the traditional approach, but singular value decomposi-tion (SVD) is also efficiently implemented on systolic arrays[10]. Recursive algorithms have been also implemented usinghigh-efficiency arrays; in [11], an LMS-based solution is imple-mented using a systolic array. Systolic arrays deliver a quick, ef-ficient implementation of simple algorithms such as MMSE, butthey do not offer an easy tradeoff in cost/performance. A Gram--Schmidt-based QR decomposition on Xilinx Virtex-4 deliversa high-throughput, low-slice count 4 4 802.11n compliantMMSE solution in [7] by paying acute attention to the dynamic-range requirements, and aggressive utilization and time sharingof system resources. SVD is implemented using a power-op-timized data path delivering superior performance relative tomost SVD application-specific IC (ASIC) solutions [8]–[10] in[12]. Programmable solutions have been proposed for matrixdecompositions. A programmable solution that uses a compact

1063-8210/$26.00 © 2009 IEEE

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS 1545

time-shared processing unit with fixed-point optimizations ispresented in [13]. The architecture supports 4 4 QR decom-position and SVD.

In this paper, we present a MIMO decoder accelerator archi-tecture. The accelerator allows the programmer to easily defineand implement MIMO decoders at will. The accelerator is in-tended to replace or accelerate the performance of MIMO de-coders on programmable devices or system-on-chip (SoC) so-lutions. The accelerator has a processor-like architecture withmost of the controls derived from a memory-stored program.The processing core is designed to support a range of complexoperations necessary to enable the realization of major MIMOdecoding algorithms. This architecture does not benefit fromthe regular, application-specific flow of regular arrays, neithercan it rely on platform or technology specific optimizations asa main driver of high performance. The MIMO accelerator de-parts radically from a conventional processor in several areas,which deliver an improvement in performance over general pur-pose processors reaching three orders of magnitude. The accel-erator core accepts very wide complex matrix operands and pro-duces complex matrix results. The high access rate required tosupport this is made possible by a memory map that exploitsthe matrix/vector nature of the operands in MIMO decoding.The memory map is augmented by sorting circuits at the in-puts and outputs of memory that allow the programmer to re-define input and output orders without using extra processingcycles. The processing cycle uses properties of OFDM decodingto optimize its flow, and through the use of predecoded instruc-tions and proper compiler positioning of critical control signals,the accelerator ensures that the processing pipeline is continu-ously engaged. A programmable dynamic scaling circuit auto-matically handles intermediate word length issues for high dy-namic range operations. This allows us to use fixed-point pro-cessing units, which substantially increases the performance ofthe processing pipeline over a floating-point implementation.With these optimizations in place, the accelerator penalty (interms of the ratio of resource-delay or PDP) relative to opti-mized ASIC and field-programmable gate array (FPGA) imple-mentations is less than an order of magnitude for most imple-mentations. On the other hand, when compared to TI DSP 6416processors, the PDP ratio for a number of complex arithmetictest benches was always over two orders of magnitude, and oftenabove three orders of magnitude.

In Section II, we briefly introduce major MIMO decoding al-gorithms and derive a set of primitive processing operations.These primitives form the least common set of processing oper-ations needed to realize all MIMO decoding algorithms.

Section III takes primitives derived in Section II and describesthe architectures used to implement them, showing optimiza-tions in arrangement and interconnection, as well as introducinga multiple cycle dynamic scaling circuit used to maintain buswidths in the overall architecture within reasonable bounds.

Section IV discusses memory access, detailing the memorymap and associated sorting circuits that allow the programmerto support the high rate of the architecture in Section III, andefficiently define and access data in terms of complex matrixoperands.

Finally, in Section V, we compare the accelerator to a numberof FPGA, ASIC, and DSP implementations using slice count,cycle count, and PDP as comparison metrics.

II. MINIMUM OPERATION SET

The literature is rich with alternative algorithms for MIMOdecoding [7]–[12]. Our approach to support these algorithms isto try and identify the set of primitive processing elements thatform the basis of major MIMO decoding algorithms. With sucha set in hand, the realization of a specific decoder algorithmwill translate into the proper sequencing of data among theseprimitive elements.

An MIMO system with transmitters andreceivers can be modeled as follows:

(1)

where is the observation vector, is the trans-mitted vector, and is the additive Gaussian noise.

is the channel matrix, where each element ofthe matrix is Rayleigh faded. MIMO decoding involves ex-tracting an estimated vector from the observation vector.The optimal MIMO decoding algorithm is an exhaustivesearch maximum-likelihood (ML) decoding. In this algorithm,the geometric distance between the observation vector anda distorted version of every candidate transmit vector ismeasured, and the nearest candidate is chosen. The high com-plexity of an exhaustive search ML can be managed throughthe sphere decoding algorithm (SD). SD can potentially reducecomplexity of exhaustive search by arranging the search sothat more likely points are examined first and by pruning largesubsets of points early in the search process [14]. With addedconstraints, SD can be suboptimal relative to an exhaustivesearch.

Linear decoding algorithms are much simpler than ML, butare also suboptimal. These algorithms calculate a weight matrix,often an implicit or explicit inverse of some expression of thechannel matrix. This weight is multiplied by the observation toobtain an estimate. The most common linear decoding algorithmis MMSE. In MMSE the estimate is calculated as follows:

SNR(2)

MMSE can also be implemented in a variant form, calledsquare root MMSE, where inversion is performed implicitly ona factor of the MMSE expression [7]. Inverting the MMSE ex-pression directly is untenable in hardware; therefore, a matrixdecomposition is used to reduce it to a more manageable form.QR decomposition is a typical choice in this case because of itswell-behaved dynamic performance, the usefulness of the re-sultant matrices, and its amenability to parallel processing [8].This method of explicit matrix inversion is also useful for algo-rithms other than MMSE, for example, it is used for a recursiveleast-squared algorithm in [11].

SVD decomposes a matrix into two unitary matrices and adiagonal matrix of singular values. The decomposition is typ-ically performed using the Jacobi algorithm [8]. Jacobi uses a


series of unitary transformations to reduce the matrix into itssingular values. A 2 2 matrix two-sided complex Jacobi startsby transforming the matrix to pure real. The first step performsa complex Givens rotation to null element (2,1)

(3)

A unitary transformation can remove the phase of element(2,2). Another transformation that exchanges the phases of off-diagonal elements yields a pure real matrix

(4)

Real-valued Jacobi can be then used to decompose the matrixby applying a two-sided unitary transformation, leaving only thesingular values

(5)

This can be extended to larger matrices, but multiple itera-tions are required to null all off-diagonal elements. AlthoughSVD has multiple potential implementations, Jacobi diagnoal-ization is highly suited for hardware implementation.

The algorithms discussed before share a number of commonoperations. The order of operations and the nature of theoperands set them apart. For example, complex matrix mul-tiplication is used in square MMSE, while matrix vectormultiplication and dot products of various sizes are used inall algorithms either as part of the inversion (decomposition)or in the decoding steps. ML and SD are simplified by a ma-trix decomposition in the initial stage to facilitate the searchphase. The search phase of both algorithms requires dedicatedhardware, especially in high-throughput systems [15], but thecalculation of metrics can be stated in matrix form, and canthus benefit from a matrix processing accelerator.

Matrix decomposition is critical to all decoding algorithmsdiscussed before. Although there are alternative decompositionsfor some algorithms, QR decomposition is the most practicalmethod for hardware application. In some decoding algorithms,the target matrix may possess special properties that allow sim-plification [7], but in the accelerator, we have to provide sup-port for QR for a general target matrix. QR decomposition canbe done using several methods, for example, in [7], Gramm--Shmidt orthogonalization is used efficiently to such end. How-ever, Givens rotations are commonly used because they are wellsuited for hardware implementation using CORDIC [8]. The di-

TABLE IPRIMITIVE OPERATIONS

agonal exchange in (4) and Jacobi diagonalization in (5) rein-force the utility of Givens rotations in algorithms that do notuse QR decomposition.

Table I lists the primitive operations common to decodingalgorithms discussed before. These primitives are derived fromcertain realizations of the decompositions, but inspection of theaccelerator will show that it can support alternatives. The ac-celerator has to give full flexibility to the operations in Table I,for example, allowing multiplication of any combination ofmatrix and vector operands of any size, or supporting left andright unitary transformations without any loss in performance.Table II details operations necessary for four MIMO decodingalgorithms. Although the operations fall neatly into a few arith-metic categories (listed in Table I and detailed in Section III),the flexibility of the operands discussed earlier introducesmany possible suboperations. The challenge this introduces tomemory access will be discussed in detail in Section IV.

III. ACCELERATOR PROCESSING UNIT

The MIMO-accelerator processing unit (Fig. 1) consists offour cores: a vector multiplication core, a scalar division core,a CORDIC (or coordinate rotation) core, and a vector additioncore. Although the design is scalable, this paper discusses re-sults for a realization optimized for 4 4 or smaller matrices.Column operations on matrices with more than four rows orrow operations on matrices with more than four columns canstill be performed but require multiple instructions per opera-tion, whereas all smaller matrices require a single instruction.

The cores are designed for full coverage of the operations inTable I and all their realizations in Table II. Efficient realiza-tion of the operations is contingent on a memory access schemethat allows single cycle access to multiple elements of storedmatrices, rearranged at will into rows, columns, diagonals, sub-matrices, etc. The cores are designed to produce at least one 4

1 output vector per processor cycle. The rotation core by ne-cessity generates a vector pair; therefore, the processor output is


TABLE IICOMPONENT OPERATIONS OF MIMO DECODING ALGORITHMS

� � ��, � �� , � � �� ,�� , �� , �� ,� � ��, �� , �� , � � �� ,�� , �� , �� , �� , �� ,and �� .

expressed as a pair of 4 1 vectors even for cores that producea single vector output. The addition core is a set of eight com-plex adders capable of performing two 4 1 vector additionsper cycle. The division core consists of four dividers, supportingone 4 1 vector scaling per cycle. The multiplication core con-sists of four complex dot product units that can perform four 4

1 dot products per cycle, producing one 4 1 vector output.A full 4 4 matrix multiplication can be performed in four cy-cles. Every clock cycle the processor accepts 32 complex inputs(N1–N32 in Fig. 1); this is the number of inputs necessary forthe multiplication core to realize four distinct dot products. Theinputs are divided into two sets/operands.

A. Coordinate Rotation Core

The coordinate rotation core supports (6)–(8) in Table I. Nor-mally, CORDIC processors [8] are used in unitary transforma-tions due to their good dynamic-range properties, and the factthat they can be implemented without any multipliers. Excep-tion to this rule is when the target hardware platform is an FPGAwith dedicated multiplier resources. In this case, some authors[7] have used a Gramm--Schmidt-based approach that does notleverage CORDICs. Given that the proposed accelerator will

Fig. 1. Overall architecture of the accelerator: processing core, data memory,and input and output sorting circuits.

most likely be implemented in an ASIC flow, we will use theCORDIC for implementing the rotation engine.

CORDIC performs Cartesian to polar or polar to Cartesiantransformation using adders and shifters. CORDIC units canbe combined to perform two angle transformations on com-plex coordinate pairs. We use a compact realization of complexCORDIC, sometimes referred to as a super CORDIC. The superCORDIC assumes a pure real leading element in the vectoringpair, but since a simple rotation can make any element of the ma-trix real, this causes no loss in generality. The traditional superCORDIC units [11] realize (6) directly. The circuits in Fig. 2are identical to this realization when the multiplexers are set tomode 0. The realization in Fig. 2 is modified to readily supportboth equations (7) and (8). Equation (7) is two independent realrotations. The multiplexers in Fig. 2 can realize (7) by reroutingthe real and imaginary components of the input vectors as inputsto individual CORDIC units. In fact, since the two componentCORDICs of the vectoring unit and two of the three in the ro-tation units are completely decoupled, (7) can be extended tocases where the two rotation terms in the rotation matrix havedifferent phases or to process multiple real vector inputs simul-taneously. Equation (8) can be realized by recognizing that it isidentical to (6) if one of the rotation phases is bypassed, and thesense of left-side and right-side rotations is reversed. Again, themultiplexers can be used to reroute the inputs to achieve this.Thus, the modified super CORDIC architecture in Fig. 2 sup-ports rotation operations listed in Table I while providing addi-tional flexibility in phase distinctions with the negligible cost ofthe added multiplexers.

The coordinate rotation core is derived from the traditionalsystolic array architecture [8]–[11]. Systolic arrays providevery high performance, and past implementations have re-alized a full parallel array of systolic units. This would beoverkill for our application as it will deliver throughput far


Fig. 2. Building blocks of the rotation core: (left) multimode super vectoring (translation) CORDIC and (right) multimode super rotation CORDIC. Mode 0realizes complex Givens rotation, mode 1 real Givens rotation, and mode 2 single-phase rotation of complex vectors (e.g., Jacobi diagonalization).

beyond that required by current or future wireless systems.It also has the additional drawback of reducing the flexibilityand programmability of the overall accelerator. The rotationcore is therefore arranged in a collapsed or linear array, seeFig. 3. The collapsed array delivers vector pair outputs, sincedecompositions can be easily broken down into vector pairoperations, this greatly simplifies programming. The rotationcore in Fig. 3 consists of four super rotation (SR) CORDICprocessors and one super vectoring (SV) CORDIC. The SVCORDICs generate two phases based on their complex inputs.Each phase is generated from a bit vector derived from theinputs, indicating the direction of microrotation at every stepof the process. Conventionally, every rotation CORDIC has aphase interpretation unit that accepts the input phase in radiansfrom a vectoring CORDIC and translates it back into a seriesof directions for microrotations [8], [11]. The phase inter-pretation components of rotation CORDIC contain relativelycostly trigonometric lookup tables and a set of adders. In theaccelerator, all SR units and therefore all rotation CORDICsoperate on the same vector pair operand per cycle, thus usingidentical phases (one of the two phases generated by the vec-toring CORDIC). We designed rotation CORDICs withoutphase interpretation components, instead of using a commonpair of phase processing units (PP in Fig. 3) that do the job forall 12 rotation CORDICs. This reduces the total resources ofthe rotation core by 8.7%. The phase processing units containthe phase to direction interpretation units traditionally foundin rotation CORDICs, thus allowing them to convert phasevalues in radians to microrotation direction vectors. They canalso serialize the phase encoded as microrotations directlyfrom vectoring CORDIC. Basic arithmetic operations can beperformed on the phases in the phase processing unit withoutfeeding phase values back to memory.

As discussed earlier, this realization of the accelerator can na-tively handle inputs with up to 4. The rotation core accepts

Fig. 3. Rotation core with phase processing unit expanded.

two vectors of up to 5 1 size. However, only four ofthe input and output pairs are significant in any clock cycle. Inmost cases, the rotation core rotates a pair of leading elementsin the SV unit, reducing one to null and calculating the phasesused to achieve such a result, which refers to a process known asvectoring, and identically rotates the three remaining pairs fromthe 4 1 vector input pair in 3 SR units. In (3), for example,this involves calculating and and performing the rotation ina single cycle. In other cases, however, the core is required torotate all four input pairs not by a phase derived from a leadingelement, but by a phase stored in the phase processing unit orexternally generated in the accelerator. This is true, for example,when rotating the unitary factor in the QR decomposition or inthe real diagonalization step of the complex two-sided Jacobi. Inthis case, multiplexers in the phase processor bypass the phasesof the SV; thus, only the outputs of the SR’s are significant. As


Fig. 4. Dynamic scaling unit.

shown in Fig. 3, the rotation core outputs are multiplexed be-tween these two cases, resulting in a four element coordinaterotation output. Although this is relatively simple, it has a sig-nificant impact in light of the memory access scheme and theoutputs of the remaining processing units.

B. High Dynamic Range Processing

The accelerator must support multiple algorithms, and allowmodifications and manipulations at will. Part of this is to pro-vide the programmer with reasonable freedom in the numberof operations that can be performed before the processor over-flows. Traditionally, a fixed-point simulation is needed beforea hardware implementation is considered. The simulation helpsquantify the impact of fixed-point effects and the choice of inter-mediate wordlength values. The accelerator cannot benefit fromthis approach. In the accelerator, an algorithm is implementedby repeatedly passing operands through the processing unit (anarbitrary number of times), therefore no useful predictions canbe made about appropriate intermediate precision requirements.Use of very wide wordlength will cause the size of the acceler-ator to grow rapidly. This is exacerbated by the matrix/vectornature of the processing units and the fact that all operands arecomplex numbers.

Coordinate rotation, as performed by CORDIC, has a rela-tively stable wordlength. The addition operation also has veryslow growth. Multiplication and division, however, can result invery fast growth in the wordlength requirement. Complex vectormultiplication effectively doubles the wordlength of the inputs.We use a dynamic scaling circuit to manage the precision ofthe multiplier and divider outputs. The circuit efficiently han-dles vector and matrix inputs of variable lengths over a variablenumber of cycles (corresponding to variable matrix sizes), asshown in Fig. 4. It accepts a vector input of size 4 (the size ofthe output of the multiplication and division cores). Each vectorelement is a complex number of size bits per rail, where

Fig. 5. SQNR for a series of matrix multiplications. � � �� ,� � �� , �� , �� ,etc.

is the native precision of the processor. An initial most signif-icant bit (MSB) mask is estimated from the inputs by passingthem through the four input OR gate bank. The bank performsan OR operation between corresponding bits of each element ofthe 4 1 input vector. This is performed for all most signif-icant bits in excess of native precision. The result contains in-formation on the active bits in the largest member of the outputvector. This mask is then held for one cycle in register R1 andfurther OR’ed through the two-input OR bank. The inputs to thetwo-input OR bank are the old (accumulative) mask in R2, andthe corresponding bits from the mask in R1 resulting from anew input vector. R2 thus holds information on active bits frommultiple cycles. The final result in R2 is then passed to a scalingvalue circuit that extracts a shift value from the result. Signalsare routed to a programmable shifter where the shift value isheld for a period appropriate for the matrix size while resultsare scaled back to significant bits per rail. dyn_clear anddyn_hold provide primary control over the size of the matrixbeing scaled. dyn_clear resets the value of R1 to zero and R2 toR1, thus starting a new multiple-cycle sequence. dyn_hold dis-ables registering new values into the scaling value circuit fromR2, thus controlling the number of cycles over which the scalingvalue is considered valid. The dynamic scaling circuit providesprogrammable scaling while maintaining the throughput of theprocessor. The additional latency is absorbed since the latencyof the CORDIC core is higher than the combined latencies ofthe multipliers and the dynamic scaling circuit.

Fig. 5 shows signal to quantization noise ratio (SQNR)for a matrix multiplier whose outputs are dynamically scaledusing the circuit in Fig. 4. The outputs are always truncated to

, and the independent axis shows the pretruncationwordlength at the output of the multipliers. Another set ofcurves shows results for a circuit that always stores the highest

MSBs. The quantization noise is measured relative to anoiseless result. The curves show that for one matrix multipli-cation, the difference in performance between the two is notdramatic. However, when they are used to run more matrixmultiplications successively, the dynamically scaled circuitpreserves SQNR above 50 dB while the constantly scaledoutputs quickly deteriorate. Fig. 6 shows simulation results for


Fig. 6. BER curves for a ZF solution realized with constant scaled and dynamicscaled matrix multiplications.

a zero forcing decoder, where fixed-point modeling is limitedto the effect of the multipliers. To operate within 2 dB SNR offloating-point performance, a circuit with static scaling needs27 bit multipliers, while a circuit using dynamic scaling needsonly 16 bits per rail.

IV. MEMORY ACCESS

A. Memory Partitioning, Addressing, and Vector Operands

The processing core is designed to accept complex matrix orvector inputs. The efficiency of the processor is contingent on amemory access scheme that allows access to any combinationof matrix elements simultaneously. The algorithms discussed inSection II require access to row-vectors, column-vectors, wholematrices, submatrices, and diagonals in a single cycle at will. Todistinguish between different antenna configurations, the pro-grammer needs to be able to define how, where, and whichresults are stored back to memory. Memory read/write opera-tions must be performed as fast as possible since the processingcores are deeply pipelined and provide a new output every pro-cessor cycle. Read and write operations can be simultaneouslysupported through the use of dual port memories, but the ran-domness of access to matrix elements and the large number ofoperands every needed cycle require more than multiple-portmemories.

As shown in Fig. 1, the processing core has 32 complexoperand inputs to accept vector or matrix inputs from ma-trices in data memory. By inspection of algorithms discussedin Section II, the programmer may need up to four 4 4matrices to store intermediate results or observation vectorswhile processing the 4 4 channel matrix. This correspondsto three matrices needed for factors of SVD, and one matrixfor observation vectors and results . If all four matrices arestored in a single block of memory and we rely on memoryaddress to access elements, a serious bottleneck is created atthe data bus. For example, for a matrix-vector multiplicationof a 4 4 matrix and four 4 1 vectors, the processor has towait 32 cycles for all inputs to be registered. The processor alsoprovides a new output vector (of up to eight elements) everycycle, which means that the processor actually has to wait 40cycles. Our tests with Virtex-4 block RAM and TSMC 65-nmregister-based memories show that the memory can be clocked

Fig. 7. Data memory map.

twice as fast as the processor. So, if a flat single port memoryis used, the processor will have to be underclocked by a factorof 20, reducing the performance significantly.

MIMO-OFDM systems employ OFDM. In OFDM, a wide-band channel is divided into a number of narrowband subchan-nels (subcarriers), where each can be treated as a flat channel.In OFDM, all subcarriers are processed identically and indepen-dently, so data for all subcarriers must be stored and decoded.Thus, data memory does not only contain four matrices as dis-cussed before, but also a number that is a multiple of the numberof subcarriers (64 or 128 in 802.11n, but higher for other stan-dards). The size of memory in this case justifies splitting it intomultiple blocks. With multiple ports and decoders, the processorcan be clocked at a faster rate. If splitting is taken to the extentthat each of the 64 elements of the four 4 4 matrices occupiesan independent block, the processor can be clocked at its fullpotential. This memory map is shown in Fig. 7. The four con-ceptual 4 4 matrices are labeled A, B, C, and D, respectively;and the 64 independent memory blocks (not all shown) are eachas deep as the number of subcarriers .

Exchanging a single memory block for 64 elements with in-dependent address decoders introduces some challenges. Eachmemory location now needs a pair of indexes to locate it: oneto indicate its memory block (e.g., A23) and one to index itsdepth, namely the subcarrier. The latter in particular can be pro-hibitive, needing either a very long instruction (nearly 900 ad-ditional instruction bits to support 128 subcarriers), or a verycomplicated address decoder. However, although the processoraccepts a large number of complex inputs every cycle, all ele-ments of all operands come from the same depth (subcarrier),regardless of which of the 64 memory blocks they come from.This is because in any cycle, a single subcarrier is being pro-cessed. So, even though the elements are stored in independentmemories, they do not need independent memory addresses,they all share the same subcarrier address in any given cycle.As shown in Fig. 8, the address is provided directly by the con-troller as derived from relatively simple matrix index logic. Ad-dresses are multiplexed between read and write indexes, usuallyoffset by the latency of the processing units. Write enables aremultiplexed between a null word and values provided from the


Fig. 8. Section of data memory showing control and data paths.

Fig. 9. Single-port RAM equivalent circuit.

controller (and in turn from the instruction). Since the memorycan be easily clocked twice as fast as the processing unit criticalpath, this allows multiple one-port memories to read and writea whole 4 4 or smaller matrix every processor cycle.

With the aforementioned in mind, the memory map, as shownin Fig. 7, needs to be reconsidered. The identical nature of theaddresses to all memory blocks means that independent addressdecoders are not necessary. The only control port distinguishingmemory blocks is write enable during the write phase of theprocessor cycle. Fig. 9 shows how memory can be conceptuallyviewed. It is one very wide single port memory (64 32 bitswide in a 16-bit engine), where each memory location containsall elements of all four matrices for a certain subcarrier. Memoryaddresses do not need to be distinguished, but write enables needto be independently managed for different segments of the wordrepresenting different matrix elements. The true additional costof this memory map reduces to the additional logic used to acti-vate writing; this cost is negligible relative to the memory itself.

B. Sorting Circuits and Addressing

Data memory provides access to all elements of a matrix ina fixed, predetermined manner. The processing unit inputs are

also fixed, for example, the multiplication core multiplies all el-ements in input vector 1 with the exact corresponding elementsof vector input 2, and the coordinate rotation core always con-siders the first element to be the vectoring element. For the ac-celerator to support all operations in Table II and begin to extendbeyond, significant flexibility needs to be introduced. To definewhich matrices or vectors are multiplied, and the direction andtarget of coordinate rotations, the programmer has to be able tomap the outputs of the data memory freely to the inputs of theprocessor. Many operations also require freedom in mappingthe outputs of the processor to the inputs of memory. Gener-ally speaking, we need to be able to map any of the 64 memoryoutput ports (Fig. 7) to any of the 32 input ports of the processor(N1–N32 in Fig. 1), and any of the eight output ports of the pro-cessor to any of the 64 input ports of the memory (M1–M32 inFig. 1). This is the function of the memory input and the pro-cessor input sorting units.

The sorting circuits proved to be two of the most resource in-tensive components of the MIMO accelerator. Essentially, eachsorting circuit consists of a collection of multiplexers equal tothe number of target ports (32 for processor input sorter, 64for memory input sorter), with a number of inputs equal to thesource (8 for memory input sorter, 64 for processor input sorter).Alternatively, sorting can be performed at the data memory portsby using memory addresses to access specific elements. Sincethe memory address is already reserved for indexing the sub-carrier, additional addressing for sorting involves some redun-dancy. Due to the critical nature of the sorting circuits, we con-sidered three alternatives that tradeoff multiplexer use and re-dundancy in data memory. In the first alternative for the pro-cessor input sorter, shown in Fig. 10, we consider replicating theentire data memory while simultaneously reducing the numberof independent memory blocks in Fig. 7 by the same order ofreplication. This consolidates multiple matrix elements in singleblocks, allowing a compound memory address to both distin-guish the subcarrier and some level of matrix element distinc-tion. The remaining reordering can be supported by a smallernumber of multiplexers. However, the size of memory, evenfor a small number of subcarriers, grows much faster than thesaving in multiplexers, both on CMOS ASIC and FPGA tar-gets. Table III shows the trend on a V4 LX200 target for 64subcarriers.

In the second strategy, we consider multiple-port memoriesas an alternative. Table IV shows that the memory grows lessdramatically with port replication than with full memory repli-cation. However, the trend is still not favorable, although havingtwo read ports seems to minimize resources; the saving is notsignificant, especially considering the additional complexity inmemory addressing and the additional circuitry used to addressthe problem at the write port. Thus, the alternative using onlymultiplexers and a memory map unchanged from Fig. 7 is op-timal. For the processor input sorting circuit, this translates into32 64 1 multiplexers, equivalent to 16 970 slices on a V4LX200. This is roughly 150% of the total area of the coordi-nate rotation core or 57% of all resources used in the processor,excluding data memory.

The 32 input ports of the processor are not independent. Theyare divided into at most two vector/matrix arithmetic operands,


Fig. 10. Memory replication and remainder multiplexing of orders 32, 16, and8.

TABLE IIIBRAM REPLICATION AND OPERAND SHARING

TABLE IVMULTIPLE READ PORT MEMORY AND OPERAND SORTING

each of 16 complex elements. Each operand comes from a singlematrix in memory. The four areas of memory (A, B, C, and Din Fig. 7) can be associated with actual matrices through thecompiler without loss of generality. This means that each set of16 processor input ports is associated with 16 memory ports (asopposed to 64) defined on a per-instruction basis. Fig. 11 showshow this matrix--operand relation can be leveraged in the sortingcircuit. A first level of two 16 element wide 4 1 multiplexersis used to link each of the operands to a matrix, allowing themain sorting multiplexers to be reduced from 64 1 to 161. This results in a resource saving of nearly 70% over a directmultiplexing approach. Table V shows alternative memory areato input group premultiplexing strategies and their equivalent

Fig. 11. Processing unit input sorter.

TABLE VPROCESSOR INPUT SORTER RESOURCES BY MULTIPLEXING STRATEGY

resources. Access flexibility is defined as the percentage of thememory accessible for either input after the first level of mul-tiplexing. The critical access flexibility is 25% since it allowseach input group to access a full matrix (16 blocks) of memory.The approach used in Fig. 11 is optimal in this context usingminimum resources when not restricting the processor.

The memory input sorting circuit accepts 3 8 input buses(the three bus inputs to MC in Fig. 1) and redistributes themover 64 memory input ports. The inputs to this circuit are resultsfrom matrix or vector operations. Similar to the processor inputsorting circuit, processor outputs are divided into at most twovector/matrix outputs with four elements each. Each processoroutput is assigned in its entirety to a matrix in memory. So,it is only necessary to distribute the processor outputs over 32ports (M1–M32 in Fig. 1) corresponding to at most two memorymatrices, and these ports can then be mirrored on the rest ofmemory without loss of generality. Additionally, since only onecore per cycle is producing a result, a first level of multiplexing(MC in Fig. 1) allows them to timeshare the circuit. Anotheroptimization for the memory input sorter is multiplexer MO inthe CORDIC core (Fig. 3). Although the rotation core has tenoutputs, only four output pairs at a time are meaningful, addi-tionally all other processing cores have eight or fewer outputs.By performing premultiplexing in MO, multiplexers M1–M32in Fig. 1 are reduced from 10 1 to 8 1. This saves 36% ofsorting circuit resources in a V4LX200.


Fig. 12. 4� 4 QRD program. The original matrix is A, by the end of processingA contains the R factor, matrix B contains the Q factor.

The accelerator controller uses an open instruction with con-trol signals mostly corresponding to controls in the circuit. Thisallows the programmer to define new memory access and pro-cessor-mode combinations at will. Addressing for the acceler-ator is nonconventional since it involves defining operand andresult orientation, write enables, and conjugation as opposed toa traditional addressing scheme [16]. Thus, a high-level syntaxis provided to support a large number of common matrix/vectoroperations. The syntax is very similar to MATLAB with ad-ditional operators to cover unitary transformations. High-levelprograms written in this syntax are converted through a customcompiler to machine-level instructions. Fig. 12 shows the syntaxof a program that performs a 4 4 QRD on matrices stored inmemory area A, leaving the R component in A and the Q com-ponent in B. Operators “!” and “@” correspond to real rotation(mode 1 in CORDIC) and complex Givens (mode 0), respec-tively. Operands and results are written as matrix ranges, and thematrices correspond to the areas of memory in Fig. 7. The com-piler recognizes most matrix expressions and common math-ematical operations as long as they conform to the followingmemory access limitations: each 4 1 vector operand and resultmust originate from or be stored in a single matrix in memory.

V. RESULTS

Table VI shows synthesis results of the accelerator and itsmain building blocks for a Virtex-4 LX200 target. Timingresults show a critical path in the fixed-point divider or in themultipliers if they are realized using logic slices. Maximumclock speed on Virtex4-LX200 speed grade 11 is 208 MHz.When two numbers are given in a field, they represent resultswith logic slice-based and DSP slice-based multipliers. Otherthan the rotation and multiplication cores, the main nonmemorycomponents are the sorting circuits accounting for nearly 20%of total logic slices. Section IV, however, shows that this isa substantial improvement over a flat multiplexer solution.Table VII shows synthesis results for a 65-nm CMOS ASICtechnology. Results are listed for the entire circuit and thecircuit excluding data memory (128 kbits for 64 subcarriers).

TABLE VIRESOURCES FOR ACCELERATOR BUILDING BLOCKS ON V4-LX200

TABLE VIISYNTHESIS RESULTS FOR A TSMC 65-nm PROCESS

TABLE VIIICYCLE COUNT ESTIMATES BY ALGORITHM AND ANTENNA COMBINATION

Table VIII lists cycle count results obtained from cycle-ac-curate fixed-point simulations for different antenna/algorithmconfigurations for 64 subcarriers.

The accelerator is optimized for matrix operations used forMIMO decoding. By accepting complex vector operands andby virtue of an optimized processing core, it should show a sig-nificant advantage in matrix operations over a general purposeprocessor. Additionally, a custom processor cycle (subject of afuture paper) allows the accelerator processor pipeline to remainfull at all times, thereby reducing the processor overhead signif-icantly. To quantify the accelerator advantage, we carried outa series of tests to measure the energy required to carry out aset of typical complex matrix operations. To quantify delay, wemeasure the number of cycles required to run these tests on ageneral purpose DSP and the MIMO accelerator. Latency couldbe a disadvantage to the DSP in terms of power-delay product,so to isolate any trends, we repeat the test for different subcar-rier counts, and repeat the tests in isolation and series, allowinga range of DSP compiler optimizations to become visible. Wealso compare a number of decompositions and decoding algo-rithms on both platforms. The algorithms running on the DSPare not identical to those running on the accelerator; for ex-ample, in MMSE, an inversion is carried out without performinga QRD since the decomposition does not aid the software im-plementation. For the 4 4 QRD, the DSP implementation uses


TABLE IXPDP AND THROUGHPUT COMPARISON WITH TMS320C641T-600 MHz.

ACCELERATOR CLOCK 234 MHz

�� , �� ,�� , �� ,MMSE is 2� 2, and ML is exhaustive metric calculation. THROUGHPUT FOR

DSP IN kilo samples per second and FOR ACCELERATOR IN mega samplesper second.

TABLE XPDP AND THROUGHPUT COMPARISON WITH TMS320C641T-720 MHz

ACCELERATOR CLOCK 234 MHz

�� , �� ,�� , �� ,MMSE is 2 � 2, ML is exhaustive metric calculation. THROUGHPUT FOR

DSP IN kilo samples per second and FOR ACCELERATOR IN mega samplesper second.

Gramm--Shmidt orthogonalization instead of Givens rotationssince the former is more suited for software. The DSP compileris set to optimize for speed above code size, and a flat memorymap is used. The DSP used is a fixed-point TI DSP6416 600MHz, a power number assuming 60% CPU utilization is used,and cycle counts are obtained from TI code composer studiosimulations. Table IX lists the ratio of accelerator PDP to DSPPDP and throughputs for different test scenarios. The accel-erator has a significant advantage in all operations, and in allscenarios, almost always above two orders of magnitude. Theadvantage is most notable in rotation (unitary transformation),where the ratio is consistently above three orders of magnitude.Unitary transformations dominate in most decoding algorithms(MMSE, ZF, and SVD).

Table X shows results from the tests repeated for a faster DSPfrom the high-performance 6416 family. The trends observedbefore are still valid, and the ratio of PDP remains above three

TABLE XICOMPARISON OF MMSE CIRCUITS TO ACCELERATOR

�� , ��

� �� , timeto invert and decode 64 samples including latency, penalty is thepercentage by which the figure of merit (slices � time to invert and decode)for the accelerator is off from the dedicated design, based on slices xnumber of cycles.

TABLE XIIPDP COMPARISON SUMMARY

orders of magnitude on average, staying above 2000 for Givensrotations.

Compared to single-purpose ASIC and ASIC-like MIMO de-coders that do not offer the same level of versatility and pro-grammability, the accelerator is bound to have some perfor-mance penalty. Table XI quantifies this penalty against a numberof MMSE MIMO decoder implementations on a Virtex-2 plat-form, while Table XII compares against a CMOS ASIC imple-mentation. For FPGA platforms, the penalty is defined basedon the following figure of merit: (total slices) (time to invertand decode 64 samples). The accelerator disadvantage is mostlywithin an order of magnitude and comparable to ASICs withinthe systolic array architecture, where the disadvantage in PDP isaround 20% (Table XII). These penalties are considerably lowcompared to the significant PDP advantage of the acceleratoragainst general purpose processors.

REFERENCES

[1] G. J. Foschini and M. J. Gans, “On limits of wireless communicationsin a fading environment when using multiple antennas,” Wireless Pers.Commun., vol. 6, no. 3, pp. 311–335, Mar. 1998.

[2] M. Anders, S. Mathew, R. Krishnamurthy, and S. Borkar, “A 64-state 2GHz 500 Mbps 40 mW Viterbi accelerator in 90 nm CMOS,” in Symp.VLSI Ciruits, Dig. Tech. Papers, Jun. 2004, pp. 174–175.

[3] M. Anders, S. Mathew, S. Hsu, R. Krishnamurthy, and S. Borkar, “A1.9 Gb/s 358 mW 16-256 state reconfigurable Viterbi accelerator in 90nm CMOS,” IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 214–222,Jan. 2008.

[4] J. Mitola III, “The software radio architecture,” IEEE Commun. Mag.,vol. 33, no. 5, pp. 26–38, May 1995.

[5] S. Haykin, “Cognitive radio: Brain-empowered wireless communica-tions,” IEEE J. Sel. Areas Commun., vol. 23, no. 2, pp. 201–220, Feb.2005.


[6] C. Chang, J. Wawrzynek, and R. W. Brodersen, “BEE2: A high-endreconfigurable computing system,” IEEE Des. Test. Comput., vol. 22,no. 2, pp. 114–125, Mar. 2005.

[7] H. S. Kim, W. Zhu, J. Bhatia, K. Mohammed, A. Shah, and B.Daneshrad, “A practical, hardware friendly MMSE detector forMIMO-OFDM based systems,” EURASIP J. Adv. Signal Process.,2008, Article ID 267460.

[8] N. D. Hemkumar, “Efficient VLSI architectures for matrix factoriza-tions,” Ph.D. dissertation, George R. Brown School of Engineering,Dept. Elect. Comput. Eng., Rice Univ., Houston, TX, 1994.

[9] R. P. Brent and F. T. Luk, “The solution of singular-value and sym-metric Eigenvalue problems on multiprocessor arrays,” SIAM J. Sci.Stat. Comput., vol. 6, no. 1, pp. 69–84, Jan. 1985.

[10] R. P. Brent, F. T. Luk, and C. Van Loan, “Computation of the singularvalue decomposition using mesh connected processors,” J. Very LargeScale Integr. Comp. Syst., vol. 1, no. 3, pp. 242–267, 1985.

[11] J. Wang, “A recursive least-squares ASIC for Broadband 8 � 8 mul-tiple-input multiple-output wireless communications,” Ph.D. disserta-tion, Henry Samueli School Eng. Appl. Sci., Univ. California in LosAngeles, Los Angeles, 2005.

[12] D. Markovic, R. W. Brodersen, and B. Nikolic, “A 70 GOPS, 34 mWmulti-carrier MIMO chip in 3.5 mm ,” in Symp. VLSI Circuits Dig.Tech. Papers, 2006, pp. 196–197.

[13] C. Suder, P. Blosch, P. Friedli, and A. Burg, “Matrix decomposition ar-chitecture for MIMO systems: Design and implementation trade-offs,”in Proc. Conf. Rec. Forty-First Asilomar Signals, Syst. Comput. 2007(ACSSC), Nov. 4–7, 2007, pp. 1986–1990.

[14] B. Hassibi and H. Vikalo, “On the sphere-decoding algorithm I: Ex-pected complexity,” IEEE Trans. Signal Process., vol. 53, no. 8, pp.2806–2818, Aug. 2005.

[15] A. Burg, M. Borgmann, M. Wenk, and M. Zellweger, “VLSI imple-mentation of MIMO detection using the sphere decoding algorithm,”IEEE J. Solid-State Circuits, vol. 40, no. 7, pp. 1566–1577, Jul. 2005.

[16] F. M. Cady, Microcontrollers and Microprocessors Principles of Soft-ware and Hardware Engineering. New York: Oxford Univ. Press,1997.

[17] M. Myllyla, J. Hintikka, J. R. Cavallaro, and M. Juntti, “Complexityanalysis of MMSE detector architectures for MIMO OFDM systems,”in Conf. Rec. 39th Asilomar Conf. Signals, Syst. Comput., 2005, pp.75–81.

[18] I. LaRoche and S. Roy, “An efficient regular matrix inversion circuitarchitecture for MIMO processing,” presented at the IEEE Int. Symp.Circuits Syst., Kos, Greece, 2006.

[19] F. Edman and V. Öwall, “A scalable pipelined complex valued matrixinversion architecture,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),2005, pp. 4489–4492.

[20] M. Karkooti and J. R. Cavallaro, “FPGA implementation of matrix in-version Using QRD-RLS algorithm,” in Proc. Conf. Rec. 39th AsilomarConf. Signals, Syst. Comput., 2005, pp. 1625–1629.

Karim Mohammed (S’99–M’09) received the B.Sc.degree in electronics and electrical communicationsand the M.S. degree with emphasis on microelec-tronics from Cairo University, Cairo, Egypt, in 2002and 2004, respectively, and the Ph.D. degree fromthe University of California, Los Angeles (UCLA),in 2009.

He is currently a Lecturer at Cairo University. Hisresearch interests include architectural approachestoward the realization of complex digital signalprocessing for wireless communication.

Babak Daneshrad (S’84–M’94) received the B.Eng.and M.Eng. degrees with emphasis in communica-tions from McGill University, Montreal, QC, Canada,in 1986 and 1988, respectively, and the Ph.D. degreewith emphasis in ICs and systems from the Univer-sity of California, Los Angeles (UCLA), in 1993.

He is currently a Professor in the Electrical En-gineering Department, UCLA. His research interestsinclude wireless communication system design, ex-perimental wireless systems, VLSI for communica-tions, cross disciplinary in nature and deal with ad-

dressing practical issues associated with the realization of advanced wirelesssystems. He is the author or coauthor of the Best Paper Award at the Paralleland Distributed Simulation 2004.

Prof. Daneshrad was a recipient of the 2005 Okawa Foundation Award andthe First Prize in the Design Automation Conference (DAC) 2003 design con-test. He is the beneficiary of the endowment for “UCLA-Industry Partnershipfor Wireless Communications and Integrated Systems.” In January 2001, hecofounded Innovics Wireless, a company focused on developing 3G-cellularmobile terminal antenna diversity solutions, and in 2004, he cofounded SilvusTechnologies. From 1993 to 1996, he was a member of technical staff with theWireless Communications Systems Research Department, AT&T Bell Labora-tories, where he was involved in the design and implementation of systems forhigh-speed wireless packet communications.

Date post:	30-Nov-2015
Category:	Documents
Upload:	kdilip05
View:	23 times
Download:	2 times

A MIMO Decoder Accelerator for Next Generation

Documents