06200359 (1)

7/29/2019 06200359 (1)

1/14

834 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

Novel MIMO Detection Algorithm for High-OrderConstellations in the Complex Domain

Mojtaba Mahdavi and Mahdi Shabany

AbstractA novel detection algorithm with an efficient VLSIarchitecture featuring efficient operation over infinite complexlattices is proposed. The proposed design results in the highestthroughput, the lowest latency, and the lowest energy comparedto the complex-domain VLSI implementations to date. Themain innovations are a novel complex-domain means of expand-ing/visiting the intermediate nodes of the search tree on demand,rather than exhaustively, as well as a new distributed sortingscheme to keep track of the best candidates at each search phase.Its support of unbounded infinite lattice decoding distinguishesthe present method from previous K-Best strategies and alsoallows its complexity to scale sublinearly with the modulationorder. Since the expansion and sorting cores are data-driven,the architecture is well suited for a pipelined parallel VLSI

implementation. The proposed algorithm is used to fabricate a44, 64-QAM complex multiple-input-multiple-output detectorin a 0.13-m CMOS technology, achieving a clock rate of417 MHz with the core area of 340 kgates. The chip test resultsprove that the fabricated design can sustain a throughput of1 Gb/s with energy efficiency of 110 pJ/bit, the best numbersreported to date.

Index Terms Complex-domain detection, K-best detectors,LTE/WiMAX systems, multiple-input multiple-output (MIMO)detector.

I. INTRODUCTION

MULTIPLE-INPUT MULTIPLE-OUTPUT (MIMO)systems have the potential of achieving high spectralefficiency, high data rate, and robust wireless link, with anacceptable implementation complexity in wireless systems.

The MIMO technology has been already included in manywireless communication standards, such as the long-term

evolution project, IEEE 802.16e, and IEEE 802.16 m. The

design of low-complexity, low-energy, high-performance,

and high-throughput receivers is the key challenge in the

design of any MIMO receiver. Several MIMO detection

algorithms have been proposed to address this challenge,which offer various tradeoffs between the performance and

the computational complexity.Among the large variety of the MIMO detection techniques,

maximum-likelihood (ML) detection is the optimum detection

method and minimizes the bit error rate (BER) performance.But its computational complexity grows exponentially with

the number of transmit antennas. On the other hand, linear

Manuscript received November 11, 2011; revised February 2, 2012;accepted April 4, 2012. Date of publication May 15, 2012; date of currentversion April 22, 2013.

The authors are with the Electrical Engineering Department, SharifUniversity of Technology, Tehran 14174, Iran (e-mail: [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2012.2196296

detection methods such as the zero-forcing or the minimum

mean squared error (MMSE) have lower complexity with apoor BER performance. Ordered successive interference can-

celation (SIC) algorithms such as the vertical Bell Laboratorieslayered space-time algorithm are employed in another category

of detectors. These algorithms have better performance than

the linear detection ones but their BER performance is not

acceptable.

Finally, as a tradeoff between complexity and performance

loss, a large category of the detection algorithms have beenproposed, which includes the depth-first and the breadth-

first search algorithms. The well-known depth-first strategy

is the sphere decoder (SD) [1], which guarantees the optimalperformance in the case of unlimited execution time [2]. But

the intrinsic variable throughput results in extra overhead in

the hardware and significantly lower data rates in the lower

signal-to-noise ratio (SNR) regimes. Among the breadth-

first search methods, the most well-known approach is the

K-Best algorithm (a.k.a M-algorithm) [3]. The K-Best detector

guarantees an SNR-independent fixed-throughput detection

scheme with a performance close to that of ML. AlthoughK-Best detectors are attractive for VLSI implementations,

there are still some challenges, such as an efficient sortingand expansion scheme, in order to pave the way to achieve

high throughputs.

I I . SYSTEM MODEL

Consider a MIMO system with NT transmit and NR receive

antennas. The equivalent baseband model of the Rayleigh

fading channel between the transmitter and the receiver isdescribed by a complex-valued NR NT channel matrix H.There are two models for such MIMO systems, namely, the

complex equivalent model and the real equivalent model.

In this paper, we consider the complex-domain framework.

However, the proposed scheme can be easily tailored for

the real equivalent model. The complex baseband equivalent

model can be expressed as

y = Hs + v (1)where s = [s1, s2, . . . , sNT]T is the NT-dimensional complextransmit signal vector, in which each element is indepen-

dently drawn from a complex constellation O (symmetric

M-QAM schemes with log2M bits per symbol, i.e., |O| =M), y = [y1, y2, . . . ; yNR ]T is the NR -dimensional receivedsymbol vector, and v = [v1, v2, . . . , v NR ]T represents theNR -dimensional independent identically distributed circularly

symmetric complex zero-mean Gaussian noise vector with

variance 2, i.e., vi Nc(0, 2). The real equivalent model1063-8210/$31.00 2012 IEEE

7/29/2019 06200359 (1)

2/14

MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 835

can also be derived using a simple real-valued decomposition

technique [1].

The objective of the MIMO detection method is to find theclosest lattice point s for a given received signal y

s = arg minsONT

y-Hs 2 . (2)

In this paper, a novel MIMO detection algorithm is proposed

to solve the above problem in the complex domain with linearcomplexity. Since the proposed algorithm is based on the

K-Best algorithm, first this algorithm will be briefly described

in the following.

A. K-Best Algorithm

Consider the problem in (2), and let us denote the QRdecomposition of the channel matrix as H = QR, where Qis a unitary NR NT matrix and R is an upper triangularNT NT matrix. Performing the nulling operation by QHresults in z = QHy = Rs + w, where w = QHv. Since thenulling matrix is unitary, the noise w remains spatially white

after the nulling. Exploiting the triangular nature of R, (2) canbe expanded as

s = arg minsONT

NTi=1

zi NTj=i

ri j sj

2

. (3)

The above problem can be thought of as a tree-searchproblem with NT levels, where, starting from the last row,

one symbol is detected and, based on that, the next symbolin the upper row is detected, and so on. Thus starting from

i = NT, (3) can be evaluated in an iterative manner as follows:Ti

s(i) = Ti+1s(i+1)+

ei

s(i)

2(4)

ei

s(i) = zi NT

j=iri j sj = L is(i) rii si (5)

L i

s(i) = zi

NTj=i+1

ri j sj (6)

L i

s(i) = Lis(i) rii (7)

where s(i) = [si si+1, . . . , sNT ]T, Ti

s(i)

is the accumulated

partial Euclidean distance (PED) with TNT+1(s(NT+1)) = 0,and |ei (s(i))|2 denotes the distance increment between twosuccessive nodes/levels in the tree.

Based on the above model, starting from level i

=NT, the

K-Best algorithm expands each K existing nodes in each levelto M new possible children in O and calculates their updated

PED. Therefore, it sorts the K M produced nodes and selectsthe K best nodes with the lowest PED as the surviving nodes

in the next level. The path with the lowest PED at the first levelof the tree is the hard decision output of the detector. There are

two main computational cores in the above algorithm, which

are discussed in the following.

1) Expansion: According to the K-Best algorithm in the

complex domain, in each level, K (parents of each level)M(children per parent) children should be enumerated, which

results in a large complexity. The current expansion schemes

in the real domain such as the phase shift keying (PSK)

enumeration [11], the base-centric search methodology [5],

and the relaxed K-Best enumeration scheme based on PSKenumeration [6] are compared in [7]. Although these schemes

can be applied to the complex domain, they do not linearly

scale with the constellation size (such as in [11]) and/or have

performance loss compared to the exact K-Best implemen-

tation (such as in [5] and [6]). To address this challenge,

an efficient complex-domain expansion method called the

on-demand expansion scheme is proposed in this paper, which

provides all the information required for the exact K-Bestimplementation in the complex domain with no performance

degradation while avoiding the exhaustive enumeration of

the children. The computational complexity of the proposed

scheme is independent of the constellation size.

2) Sorting: In each level of the K-Best algorithm in the

complex domain, K M children should be sorted. In [7] and

[8], most of the sorting schemes such as bubble sorting [3],

which is a sorting method based on the SchnorrEuchner (SE)

([2], [9]) technique, and a distributed sorting scheme [6], [10]

are compared. But some of them are time-intensive for largevalues of K and M ( [3], [11]) or have performance loss

([6], [10]). The most efficient sorting scheme is proposed in

[7], which is used in this paper. The distributed sorter works

for any value of K and M with no performance loss. Also, its

complexity is independent of the constellation size and scales

linearly with the value of K.1

Due to the intrinsic challenges in the implementation in

the complex domain, most of the MIMO detection algorithms

in the literature have been proposed for the real domain.

However, on account of the deeper search tree, the real domainimplementation results in a larger silicon area and a larger

latency. Nevertheless, a high-throughput MIMO detector in

the complex domain with an acceptable complexity for thehigh-order constellations has always been a challenge in the

literature. To address this challenge, in this paper a high-

throughput detection algorithm along with its VLSI architec-

ture for a 44, 64-QAM complex MIMO detector is proposed,which is scalable to higher order constellation schemes such as256-QAM and for a larger number of antennas (i.e., NT > 4).

III. COMPLEX SE ENUMERATION

The main challenge in developing the complex-domain

enumerator lies in devising a means of iteratively enumerating

the elements of the complex constellation in the order of

increasing the squared distance from the unconstrained value,i.e., the PED value. In this paper, a novel complex SE

enumeration scheme is proposed to enumerate the complex

constellation points in the order of nondecreasing PED.

A. First Child (FC)

Based on the complex version of the system model in (4),

the FC of a node in Kl+1 (s[1]l ) is the one that minimizes

1By increasing the value of K, the performance becomes close to that ofML detection. However, a higher K results in more hardware complexity. Inthis paper, based on the simulations (see Section VII), K is chosen to be 10.

7/29/2019 06200359 (1)

3/14


Fig. 1. Three-level tree used for enumeration of the complex constellation

O. This tree is defined for each complex constellation point.

|el (s(l))|.2 In other words

s[1]l = arg min

slO

els(l)2 = arg min

slO

Lls(l) rll sl2

= arg minR(sl)

RLl

s(l)

/rll uR

l

R(sl )

2

(8)

+ arg minI(s

l)

ILl

s(l)

/rll

uIlI(sl )

2

(9)

= aR[1] + j aI[1] (10)where = {M + 1, . . . , 1, +1, . . . , +M 1} repre-sents all the possible values of the real/imaginary part of the

constellation points, aR[1] = R[s[1]l ], aI[1] = I[s[1]l ], and theindex l was removed as we focus on one parent node in the

level l + 1 and try to enumerate its children in level l. Notethat (8) and (10) are derived based on the fact that rll is a real

number, which is a result of the QR decomposition.

Let us define s[0]l = (Ll (s(l))/rll ) as the unconstrained

received value. Considering the square symmetric M-QAM

constellation schemes, there are || =

M possible integerson both the real and imaginary axes. Thus the optimizationsin (8) and (10) are computationally inexpensive to implement,

as uRl = R[s[0]l ] and uIl = I[s[0]l ] can be easily rounded to thenearest integers in to find the optimized value for R(s

[1]l )

and I(s[1]l ), respectively. This optimized value is denoted by

aR[1] + j aI[1] in (10), which is the FC of the current parent.Therefore, the FC can be easily implemented through a 2-D

slicer.

B. Next Child (NC)

To describe how the next child in the complex domain

is calculated, let us denote all the points in the complex

constellation O by a three-level tree as shown in Fig. 1,

where s[0]l is at the root (Level 1). Once the FC (i.e., s

[1]l =

aR[1] + j aI[1]) is determined, it is selected as the first node inLevel 2 of the three-level tree (the left-most node in the Level 2

in Fig. 1) and the

M Level-2 siblings are chosen as those

that share the same imaginary value aI[1] as the FC.3 Therefore,

their squared distances from s[0]l vary directly with those of

their real components from the real part of s[0]l . Since they

2Because Tl+1(s(l+1)) is common for all the children of a parent node.3This is because of the fact that there are

M possible real values and

M

possible imaginary values in the constellation.

are all in one line in the complex constellation with different

real parts, the typical real SE enumeration technique [7] can

be applied to enumerate them in the order of nondecreasing

squared distance from s[0]l [row SE (RSE) enumeration]. This

means that the nodes in the second level of the three-level treeare positioned in the nondecreasing PED order.

For the third level of the three-level tree in Fig. 1,

M 1siblings are assigned to each Level-2 parent node such that

they share the same real value inherited from their common

Level-2 node whereas their imaginary parts take the values

aI[2], . . . aI

[M]. For instance, the elements of the left-mostsubtree in Fig. 1 are {aR[1] + j aI[2], . . . , aR[1] + j aI[M]} andthose of the right-most subtree are {aR[M]+ j a

I[2], . . . , a

R

[M]+j aI[M]}. In fact, these (

M 1) Level-3 nodes and their

common Level-2 node are in the same column of the complex

constellation. Thus each column of the complex constellation

corresponds to a definite subtree in the three-level tree. Similar

to the Level-2 nodes, the real SE enumeration can be appliedto their imaginary components to enumerate them in the

order of nondecreasing squared distance from s[0

]l [columnSE enumeration (CSE)]. This implies that, using CSE, all theLevel-3 nodes of a definite Level-2 node are positioned in the

corresponding subtree in the third level of the three-level tree

from left to right in the order of nondecreasing PED.

Based on the above three-level tree structure, the next child

is calculated as follows. Recall that the FC corresponds to thesl O that minimizes |el (s(l))|. By definition, its next bestsibling (s

[2]l ) is the one that has the next smallest incremental

distance |el (s(l))|, i.e., it is the one in sl {O s[1]l } thatminimizes |el (s(l))|. Let L denote the set of points in theconstellation, which are enumerated but have not yet been

announced as the next best sibling. These nodes are named

visited nodes. As a new point is enumerated, it is added toL, so initially L = {s[1]l }. At each step, the point in L withthe lowest PED is announced to be the next best sibling. That

point is removed from L and the complex SE enumeration

is applied to it. So, according to the type of the announced

node, i.e., Level-3 or Level-2, the announced node is replaced

by one or two new points, respectively.

In other words, if the announced node is a Level-3 node,then only column SE (CSE) enumeration will be applied to it,

whereas if the announced node is a Level-2 node, then both

row and CSE enumerations will be performed on it. In fact,

the row and column enumerations enable the coverage of the

possible sets of values for R

[sl

]and I

[sl

], respectively. These

new point(s) are said to be visited and are then added to L.Fig. 2 shows an example of the complex SE enumeration,

where the bold crosses represent the visited points while the

bold circled crosses represent the points announced as the next

best child. Starting from s[1]l [Fig. 2(a)], which is the left-most

node in Level 2, its corresponding Level-3 node and its next

sibling in Level 2 are visited and are added to L. Among

these two new nodes in L, the one with the lowest PED ischosen [+1 + j in Fig. 2(b)]. If the chosen node is a Level-2node, its next sibling in Level 2 and its next child in the

Level 3 are both enumerated and are added to L, which is

equivalent to running both the row and the CSE enumerations

7/29/2019 06200359 (1)

4/14


(a) (b)

(c) (d)

Fig. 2. First four best children using complex SE enumeration in a 16-QAM constellation scheme. (a) L

= {1

+j

}. (b) L

= {1

j,

+1

+j

}.

(c) L = {1 j, 1 j, 3 + j}. (d) L = {1 + 3j, 1 j, 3 + j}.

simultaneously [Fig. 2(c)]. However, if the chosen node in L

is a Level-3 node, only its next sibling in Level 3 is countedand added to L, as is the case in Fig. 2(d). In other words,

after finding s[2]l in Fig. 2(b), since it is a Level-2 node, both

row and column enumerations are performed, resulting in the

addition of +1 j and 3 + j to L. On the other hand,since in Fig. 2(c) the node s

[3]l is a Level-3 node, only the

column enumeration is performed, resulting in the addition of

1 + 3j to L. This process is performed until all the pointsin the constellation are covered. Repeated application of this

procedure ensures the on-demand enumeration of the complex

constellation points in the order of increasing local PED. Note

that, since the expansion scheme proposed here is on demand,

not all the nodes in the constellation are necessarily visited.

The sequence in Fig. 2 shows the process of finding thefirst four best children of a particular parent node using the

proposed scheme with the elements ofL listed in each stage

in the caption. At each stage, a dashed gray circle indicates

the distance of the most recent announced K-Best node to

s[0]l . One argument should be proven to guarantee the correct

functionality of this complex enumeration scheme, i.e., it

should be ensured that, when a node in L is announced as the

next K-Best node, any other node in O with a lower PED has

already been visited and announced as the K-Best candidate.In other words, all the unannounced nodes in O should have

a larger PED than the most recently announced K-Best node.

Proposition: Using the proposed SE complex enumeration

scheme, nodes are visited in the order of increasing PED value.

Proof: The following two observations are used for this

proof.

Lemma:

1) any Level-3 node has a larger PED than that of its

corresponding Level-2 parent node;

2) if a node is announced as the next K-Best node, it has

the lowest PED among the nodes in L.

Fig. 3. Six possible cases for the complex SE enumeration.

Let U= OKL, where O is the set of all nodes in theconstellation, K is the set of nodes that have been announced

as the K-Best nodes, and L is the set of nodes visited but

have not announced as the K-Best candidate yet. It should be

proved that all the unannounced nodes in O have a larger PED

than the nodes in K. In a mathematical form

sl {O K}, s[i]l K : PED(sl ) > PED(s[i]l ) (11)where PED(s

[i]l ) denotes the PED value of the node and all its

ancestors to the NTth level of the detection tree. Since K-Best

nodes are announced in the nondecreasing order of the PED,

one only needs to prove that

sl {O K} : PED(sl ) > PED(s[k]l ) (12)where s

[k]l represents the most recent announced K-Best node

in K.

To prove, let us consider the contrary, i.e., assume there is

a node, represented by sl , which is not in K and has a lowerPED than PED(s

[k]l ). There are six possible cases to look at,

as shown in Fig. 3 and described in the following:

1) this case implies that a Level-2 node sl has a lowerPED than a Level-2 s

[k]l node, which is contrary to the

concept of the ordered row SE enumeration;2) this case means that a Level-2 node sl L has a lower

PED than the Level-3 s[k]l node, which is contrary to the

note 2) above;3) this case means that a Level-2 node sl U has a lower

PED than the Level-3 node s[k]l . This implies that there

is one unannounced Level-2 node in L whose PED is

larger than PED(s[k

]l ), resulting in the conclusion thatPED(sl ) < PED(s

[k]l ), which is contrary to the ordered

row SE enumeration;

4) this case implies that a Level-3 node sl L has a lowerPED than the node s

[k]l , which is contrary to the note 2)

above;

5) this means that a Level-3 node sl U, whose Level-2parent node is in L, has a lower PED than the node s

[k]l ,

which is contrary to notes 1) and 2) above;

6) this case implies that a Level-3 node sl U, whoseLevel-2 parent node is in U, has a lower PED than the

node s[k]l . This means that regardless of the Level of

7/29/2019 06200359 (1)

5/14


Fig. 4. Variation of the value of|L

|for 16-QAM for a specific received symbol. (a)

|L

| =3. (b)

|L

| =4. (c)

|L

| =4. (d)

|L

| =4. (e)

|L

| =4. (f)

|L

| =1.

s[k]l , there is one unannounced Level-2 node in L (not

including s[k]l ), which has a lower PED than the node

s[k]l . This is contrary to the ordered row SE enumeration

and note 1) above.

Considering the nature of the SE enumeration in both

dimensions creates an intuitive proof. In other words, the nodesin the three-level tree shown in Fig. 1 are sorted while being

enumerated in the tree on both Levels 2 and 3 from left to right.Therefore, the selection of the nodes starts from left to right on

both levels. In fact, moving from left to right on the three-level

tree corresponds to visiting the nodes with nondecreasing PEDvalue. Therefore, if there is one unannounced Level-3 node

among children of a Level-2 parent node, it is guaranteed that

all its Level-3 siblings on its right side have a higher PED.Also, if there is one unannounced Level-2 node, all the other

Level-2 nodes on its right side and their children have a higherPED than this node.

One key property of the proposed enumeration scheme is

that, because it is based on the SE enumeration, it does not

require that the lattice search space under consideration be

bounded. Another feature is that the best children of each

parent are generated one by one and on demand (without

visiting all the other points). Therefore, the complexity of this

approach and the search complexity are independent of theconstellation order. This makes our approach a promising one

especially for higher order modulations such as the 64-QAM

or 256-QAM.

C. Relation of |L| and the Constellation Order (M)Another promising aspect of the proposed approach is that it

can be shown that |L| M. The fact that makes this featurepossible is the ordered expansion along with the pipelinedsorting scheme. It is worth noting that the complex version

requires extra circuitry to implement the above proposed

expansion scheme in the complex domain. However, since it is

implemented in an on-demand basis, and the fact that L doesnot populate linearly, this extra circuitry is not considerable.

Proposition: Using the above complex SE enumerationscheme, the value of the L is always less than

M, where M

represents the size of the constellation.Proof: Based on the above proposed scheme, there are two

levels of nodes in the tree (Levels-2 and -3 nodes). If any of

the Level-2 nodes is selected, it is excluded from L and atmost two more constellation points are added to L (its next

sibling in Level-2 and its next child in Level 3). Therefore,

the selection of any Level-2 node would increase the value of|L| at most by 1. However, if the selected node is in Level 3,

it is excluded from L and at most its sibling, if any, is added

to L. Therefore, if a Level-3 node is selected, the value of |L|does not change, or may even decrease. Having said that, since

there are

M Level-2 nodes in any M-QAM constellation, the

value of |L| can increase by

M thus |L|

M.This fact is illustrated in Fig. 4, which shows the complex

enumeration in a 16-QAM constellation for a specific receivedsymbol, where the received symbol is depicted by in thefigure. Note that dots () represent the constellation points,circled dots () represent the visited candidates not selectedyet (or the elements of

L), and finally the black circles ()denote the announced constellation so far (the elements of

K). The arrows show the flow of the enumeration in the 16-QAM constellation, which depends on the location of the

received symbol, and the number on each arrow representsthe time Step at which the target node is visited. For instance,

in Fig. 4(a), +1 j and 1 3j are visited in the secondStep of the enumeration, while +1 + j is visited in the fifthStep of the enumeration [Fig. 4(c)]. The value of |L| is thenumber of visited points not selected yet, i.e., circled dots

(). By looking at Fig. 4, the largest value of |L| is 4. Thevalue of |L| for each Step is mentioned in the caption.

IV. PROPOSED

COMPLEX

K- BES T

ALGORITHM

The implementation on the real domain is straightforward,

as the next child can be found by a simple zigzag movement

around the unconstrained received value s[0]l without any PED

calculation in the feedback path of the architecture [12].

However, in the complex SE enumeration scheme, a 2-D SE

enumeration is needed to find the next best siblings. After

each complex SE enumeration, one or two new nodes will begenerated that will later be added to the L after calculation of

their PED values. Thus the size ofL, which includes the bestsibling nodes, may increase. It is most probable that the size

ofL is greater than 1. So in order to find the next child of the

announced node, we should find the child with the minimum

PED from the L entries, which incurs an extra complexity inthe critical path.

On the other hand, in order to have a high-throughputdetector, K parent nodes of the next level should be generated

in K clock cycles. So according to the nature of the distributedK-Best algorithm, both the PED calculation and PED compari-

son processes should be done in the feedback path in one clock

cycle for each announced node, which is the main underlyingchallenge. Also it is obvious that the PED calculation in the

complex domain needs more computations compared to the

real domain. Therefore, all these computations will be added

to the total critical path of the detector, which will result in

7/29/2019 06200359 (1)

6/14


a significant decrease in the throughput. Thus the idea of the

real-domain distributed K-Best algorithm cannot be applied to

the complex domain to achieve a high-throughput design.To address the above challenges, a novel complex-domain

detection algorithm is proposed in this paper. Let us consider

an NR NT, M-QAM MIMO system. So the complex-domaindetection tree has NT levels. Thus the proposed algorithm can

be described as follows.

A. Proposed Complex K-Best Algorithm

Step I. Level NT

1) Calculate the FC of the incoming node, which is the

NTth entry of the z matrix.2) Find all of the Level-2 nodes that are located in the same

row of the constellation with the FC.3) Calculate the PED of the FC and all the Level-2 nodes

and save all these

M nodes and their PED values in

a register bank (i.e., L).

4) For k = 1:Ka) Find the node with the minimum PED in theL and

announce it as one of the K parent nodes of thenext level of the detector.

b) Find the next child of the announced node. In the

complex domain, the next child should be calcu-

lated by the complex SE enumeration technique.

c) Calculate the PED of the new Level-3 node and

replace the announced node with the new Level-3

node in L.4

End

Step II. Level (NT 1)Level 2

1) For i = 1 : Ka) Calculate the value of L is(i) for the incoming

node, which is the i th parent of the current level.

b) Find the FC.c) Find RSE_Num Level-2 nodes, which are the

nearest nodes to the FC in the constellation, using

the row SE enumeration (RSE) technique.

d) Calculate the PED values for the above RSE_Num

Level-2 nodes and the FC, and then save all these

nodes in the corresponding register bank (L) result-

ing in |L| = RSE_Num + 1 for the i th parent ofthe current level.

e) For j = 1 : CSE_Numi) Find the node with the minimum PED in the L.

ii) Find the next child of the selected node. Inthis step, all the L entries are Level-2 nodes.

So regardless of the type of the selected node

(Level 2 or Level 3), in order to find the next

child, only the corresponding Level-3 node of

the selected node should be found by using

the CSE technique.

4Note that there are

M Level-2 nodes in an M-QAM constellation. So inStep I.2) dfd, L was initialized by

M Level-2 nodes (|L| =

M). Also,

according to the proposed idea in Step I.4.b, each announced node will bereplaced with only one Level-3 node. So the size of L will remain fixed(|L| =

M).

iii) Calculate the PED of the new Level-3 node and

save it in L. So the size ofL will increase by 1.

End

f) For the current parent node (i.e., i th parent), sort

the entries of the obtained L in the order of

nondecreasing PED, resulting in the final size of|L| = RSE_Num + 1 + CSE_Num.

End2) Find the sorted list of the K first children of the K

parents in the order of nondecreasing PED.

3) For k = 1 : Ka) Announce the node with the minimum PED in the

above sorted list as the kth parent of the next level.

b) In the above sorted list, replace the announced node

with its next child, which is obtained from the

sorted L of its parent.

End

Step III. First Level

1) For k = 1 : K

a) Calculate the value of Lks(k) for the incomingnode, which is the kth parent of the current level.b) Find the FC and the corresponding PED.

End

2) Find the sorted list of the K first children of the K

parents in the order of nondecreasing PED.

3) Find the node with the minimum PED from the sorted

list of the first children and announce it with all of its

parents up to the level NT as the hard decision outputsof the detector.

B. Limited SE Enumeration Idea

There are two key points in the detection process that affect

the BER performance of the detection algorithm.

1) It is obvious that the generation of the K parent nodes of

level NT should be done carefully, as any error in levelNT propagates to all of the other levels, which results

in performance loss.

2) According to the complex SE enumeration, any Level-2

node of each column has lower PED than the other nodes

of that column. So the Level-2 node of each column

in the constellation has a higher priority than the other

nodes of the same column to announce as one of the K

best nodes of the next level. Thus, it is necessary to avoid

missing the Level-2 nodes in the detection algorithm.

So the generation of the parent nodes of the level NT andthe generation of the Level-2 nodes are two important factors

in the final BER of the system.One of the innovations of this paper is to find all the

Level-2 nodes at the beginning of the proposed algorithm (i.e.,Level NT). Then, regardless of the type of the current node

(Level-2 or Level-3), in order to find the next child, only the

corresponding Level-3 node of the current node should befound, which can be done by the CSE enumeration technique.

So, in order to consider the first factor at the beginning of

level NT, all the Level-2 nodes are generated and saved in

L (Step I.2). If one of these nodes is selected as one of the

7/29/2019 06200359 (1)

7/14


parents of the next level, then only its corresponding Level-3

node should be generated, as its neighboring Level-2 nodes

have already been generated (Step I.4.b). So all the Level-2and Level-3 nodes can be generated in this level and there is

no omitted node in the level NT of the proposed algorithm.

This will ensure that there is no performance degradation in

this level of the proposed algorithm compared to the exact

K-Best algorithm.

Also, the second point can be considered through the

generation of all of the Level-2 nodes. The above idea can

be applied to the other levels of the proposed algorithm withadded cost of a larger silicon area. On the other hand, the effect

of the parent nodes of the level 1 on the BER performance

is lower than that of the parent nodes of level NT. Anothercontribution in the proposed algorithm is to generate a fixed

number of child nodes for each parent in level NT 1 throughlevel 2, described in the following.

Let us consider the distributed K-Best algorithm which uses

the complex SE enumeration scheme [13]. After announcing a

node as one of the K best nodes of the next level, the complex

SE enumeration should be applied on the announced node.Then the node with the lowest PED in L will be announcedas one of the K parents of the next level. This process will be

repeated to generate all of the K parents of the next level.During this process, an unpredictable number (0 K) ofparents of the next level will be chosen from the same parent

of the current level. So a number of children will be chosen

from a parent of the current level, as the parents of the next

level are not fixed. This means that the number of column/row

SE enumerations that should be performed on a definiteparent and also the size of corresponding L are unknown

variant values. This results in a significant decrease in thehardware utilization. To address this challenge, in the proposed

algorithm a fixed number of children will be generated foreach parent (i.e., RSE_Num + CSE_Num + 1). Thus a fixednumber of the column/row SE enumerations will be applied

to each parent, which is done in Steps II.1.c and II.1.e of the

proposed algorithm by using CSE_Num CSE enumeration and

RSE_Num row SE enumeration, respectively. Proper values of

these two important parameters will be indicated as follows.

C. Finding the Value of CSE_Num

Let us consider the generation of the parent nodes in the

complex domain using the complex SE enumeration scheme.

In this case, all the child nodes of a parent can be enumerated

and there is no constraint on the number of the column/rowSE enumerations. This method is referred to as the relaxed

SE enumeration scheme as opposed to our proposed limitedSE enumeration scheme. In fact, the BER performance of the

exact K-Best algorithm will be obtained for the relaxed SEenumeration scheme. It will also be shown that the difference

between the final BER of our proposed detection algorithm

and the exact K-Best algorithm is negligible.In order to find the proper value of the CSE_Num,

consider the generation of the parent nodes in the relaxed

SE enumeration scheme in the complex domain. The sim-

ulation result of this scheme for a 4 4, 64-QAM

Fig. 5. Number of parent nodes that have the same number of visited childnodes for a 4 4, 64-QAM complex MIMO detector with K = 10.

MIMO detector with K = 10 is shown in Fig. 5.This simulation was performed for 21 845 packets, whereeach packet consists of 4(transmitted vector/packet) 4(symbol/transmitted vector) 6(bit/symbols) = (96 bits)(thus 2 M bits in total). This figure shows the number ofparents that have the same number of visited child nodes at

the end of the simulation. For more clarity, the generation of

the parents will be explained Step by Step in the following.

At the beginning, the FC of each parent will be found. So

each parent node has only one visited child. Thus the parent

nodes will be placed in the first column in Fig. 5. If the FCof a parent node from the first column is announced as one

of the K-Best parents of the next level, then both column/rowSE enumerations will be applied on it and two new nodes will

be generated (note that the FC is a Level-2 node). So thereis a total of three visited children, which will be reflectedin the third column in Fig. 5. Thus the parent nodes with

no announced child will remain in the first column. This is

the reason why always the second column is empty, which is

the result of the concept of the complex SE enumeration [see

Fig. 2(a) and (b)].

Moreover, Fig. 5 shows that there are nearly 11106 parentnodes that have one visited child till the end of the simulation

with no announced node. Also, there are almost 6 106 parentnodes with three visited children.

This can be seen differently in Fig. 6, which shows the

number of child nodes that are in the same category. For

example, the third column of Fig. 5 shows that there are6106 parent nodes with three visited children. Thus there are36106 = 18106 child nodes in this category. In fact, thevalue of each column in Fig. 6 is equal to the product of the

values of the horizontal and vertical axes of the correspondingcolumn in Fig. 5.

Note that, according to Figs. 5 and 6, the number of parent

nodes with five visited children is larger than the numberof parent nodes with four visited children (i.e., the fourth

column). After announcing the FC as a parent of the next

level, its parent will be transferred from the first column to

the third column. According to the concept of the complex

7/29/2019 06200359 (1)

8/14


Fig. 6. Number of child nodes that are in the same category for a 4 4,64-QAM MIMO detector with K = 10 in the complex domain.

SE enumeration, the third column includes both Level-2

and -3 nodes. So there are two scenarios.

1) If a Level-3 node is chosen from the third column as

one of the K parents of the next level, then just the

column SE enumeration will be applied to it, which

results in transferring that parent from the third column

to the fourth column. This is the only possible way toadd a node to the fourth column because the second

column is empty.

2) If a Level-2 node is chosen from the third column

as one of the parents of the next level, then both the

column/row SE enumeration will be applied to it, which

results in transferring that parent from the third columnto the fifth column. But as a key point, consider a

Level-3 node in the fourth column, which is chosenas one of the K parents of the next level. So only the

column SE enumeration will be applied to it, which

results in transferring that parent node from the fourth

column to the fifth column.

Thus the fifth column is the target of two columns (i.e., the

third and fourth columns). But the fourth column is the target

of only one column (i.e., the third column). So the number ofnodes that are placed in the fifth category is larger than in the

fourth category.It is worth noting that visiting and announcing are two

different concepts, meaning that the number of visited nodes

and the number of announced nodes from a parent node will

be different, as shown in Table I. The first row of this tablecorresponds to the horizontal axes in Fig. 5, which includes the

number of visited child nodes for a parent, and the second rowshows the possible number of announced child nodes for that

parent node. For instance, the parent nodes of the third columnin Fig. 5 have only one announced child node (i.e., their first

children), which is shown in the third column of Table I.

Note that in the second row of Table I the possible range

for the number of announced children from a specific parent

is shown. For example, for the parent nodes of the seventh

column of Table I, we are sure that at least three child nodes of

these parents were announced as the parents of the next level.

TABLE I

COMPARISON OF THE NUMBER OF VISITED NODES AND THE

NUMBER OF ANNOUNCED NODES FOR A SPECIFIC PARENT

Number of1 2 3 4 5 6 7 8 9 10

visited nodes

0 0 1 2 2 3 3 4 4 5

Number of 3 4 4 5 5 6

parent nodes 5 6 6 7

7 8

Fig. 7. All possible scenarios for the node generation from a definite parent

node with CSE_Num = 3.

But this table shows that the parent nodes of this category can

have up to five announced child nodes.

As an important result, Fig. 5 shows that less than 9% of

all the parent nodes (2106 parent nodes from 22 106) havemore than five visited child nodes, and less than 2% of all the

parent nodes (0.5106 parent nodes from 22106) have morethan seven visited child nodes. So we can ignore the parent

nodes of the other categories. Thus, one of the innovations of

this paper is to limit the number of SE enumerations for each

parent node, which is equivalent to ignoring the last columns

in Fig. 5. This idea is used to obtain the appropriate valuesof CSE_Num and RSE_Num based on the above observation.

The large values of CSE_Num and RSE_Num result in a betterBER performance at the cost of more silicon area.

In order to find the proper value of CSE_Num, exhaustive

simulations were done and the value of CSE_Num was chosen

to be 3. According to the concept of complex SE enumeration,

it can be proven that in this case (i.e. CSE_Num = 3) thenumber of visited children for a parent node will be five,six, or seven child nodes, which is shown in Fig. 7(a)(c),

respectively. Fig. 7 shows all of the possible scenarios forthe node generation of a definite parent with CSE_Num = 3,which results in visiting five, six, or seven child nodes from

that parent. The announced child nodes of the parent are shownby green circles in Fig. 7. Thus in our proposed algorithm,

three CSE enumerations (i.e., CSE_Num = 3) will be appliedon each parent node and three Level-3 nodes will be generated

(Step II.1.e in the proposed algorithm).

D. Value of RSE_Num

Fig. 7 shows that, according to the value of the CSE_Numafter the FC, up to three Level-2 nodes can be generated [see

Fig. 7(a)]. So, in order to avoid missing the Level-2 nodes

in the proposed algorithm, always in addition to the FC of a

parent, three Level-2 nodes should be generated, which can be

7/29/2019 06200359 (1)

9/14


Fig. 8. Two scenarios of node generation from a definite parent node that arecovered by (a) Level-3 nodes are in three different columns and (b) Level-3nodes are in two columns.

Fig. 9. Proposed VLSI architecture of a 4 4, 64-QAM MIMO detector.

using three row SE enumerations. According to the Step II.1.f

of the proposed algorithm, all children of a parent should be

sorted in the order of nondecreasing PED. According to the

proposed architecture for Step II.1.f, it can be proven that

the difference between the required hardware to sort seven

or eight values is negligible. So, according to this fact and

the importance of the Level-2 nodes in the final BER (seeSection IV-B2), the number of Level-2 child nodes of each

parent was chosen to be five in the proposed algorithm, which

implies that RSE_Num = 4. Because the FC is generated ina different manner and without SE enumeration technique.

Thus always four row SE enumerations will be applied oneach parent node and four Level-2 nodes will be generated

(Step II.1.c in the proposed algorithm).Therefore, the result of the above ideas is to generate

four Level-2 nodes by applying four row SE enumerations

(RSE_Num = 4) and three Level-3 nodes by applying threeenumerations (CSE_Num = 3) for each parent (Steps II.1.cand II.1.e in the proposed algorithm). Thus, in addition to the

FC, seven child nodes will be generated for each parent. Fig. 8

shows two possible cases that are covered by the proposed

idea. Note that, after finding the FC, all these nodes will be

visited for each parent (Steps II.1.c and II.1.e). This means

that all the nodes that are not indicated with the green circles

in Fig. 7 can be announced as the parents of the next level.

But if the next child of the announced node does not exist in

the collection of eight visited nodes of that parent, then thatnext child will be not enumerated.

V. PROPOSED VLSI ARCHITECTURE

The proposed VLSI architecture for a 4 4, 64-QAM hard-output MIMO detector is shown in Fig. 9. This architecture

consists of four layers. Each layer gets the entries of z, R,

some control signals, and the K parents of the previous layer(if any) as inputs and generates the K parents of the next layer

as outputs. The control unit performs the scheduling of the

inputs of the layers. Each layer consists of some subblocks,

as described in the following. In order to reduce the final

Fig. 10. Detailed VLSI architecture for Layer NT of the proposed MIMOdetector.

critical path of the design, fine-grain pipelining is applied to

the proposed architecture. The internal pipelining stages of

the sub blocks and the pipeline stages between the blocks

are shown by the red dash lines and arrows in all the figures

to come.

A. Layer NT

The layer NT of the proposed architecture implements Step Iof the proposed algorithm. The detailed VLSI architecture of

the layer NT is shown in Fig. 10, which gets z4 and r44as inputs and generates K parents of layer NT 1 as theoutputs.

B. Layer (NT 1). . . Layer 2

In the layer NT 1 through Layer 2, the correspondingentries of z and R, some control signals and the K parents of

the previous layer are the inputs and the K parents of the

next layer will be generated as the outputs. In fact, these

layers of the proposed architecture perform Step II of theproposed algorithm. The architecture of theses layers are the

same, which will be described in detail.

C. Layer 1

Finally, Layer 1 of the proposed architecture performsStep III of the proposed algorithm and announces the child

with the lowest PED with all its parents up to Layer NT as

the hard decision output s of the detector.

D. Multipliers

According to (3)(7), there are four types of multiplication

in the proposed scheme, which are implemented as follows.

7/29/2019 06200359 (1)

10/14


Fig. 11. (a) Proposed architecture of the fast multiplier. (b) Architecture ofa 4 4-bit BaughWooley carry-save multiplier.

1) Fast Multiplier: The first type of multipliers is devisedto perform L i = rii L i . This multiplication is time-intensiveand is a part of the total critical path of the system. So, in

order to improve the throughput of the system, an efficient

architecture is proposed in Fig. 11. The basic core of the

proposed architecture is a 2 s complement 4 4-bit BaughWooley carry-save multiplier [Fig. 11(b)]. The final 16 16-bit fast multiplier is obtained by connecting a group of these

basic cores together [Fig. 11(a)]. Also, in order to decrease thecritical path of the multiplier, fine-grain pipelining is used in

this block [red arrows in Fig. 11(b)], which results in a critical

path of length of 1 ns and the throughput of 32 Gb/s for themultiplier in a 0.13-m CMOS technology instead of 7 ns

and the throughput of 4.5 Gb/s for a 16 16-bit conventionalmultiplier in the same technology.

2) Constant Multiplier (CM): The second multiplier is

devised to perform ri j R/I{sk}, which is the basic coreof the remaining three types of multipliers. So an efficient

low-area multiplier is proposed that is implemented by usingthe simple shift and addition operations [Fig. 12(a)]. The keyidea in the proposed multiplier comes from the fact that both

operands are real valued and always the value ofR/I{sk} ischosen from a constant set ().

3) Real Complex Multiplier (RCM): The third type ofmultipliers is designed to perform rii si . After the QRdecomposition, all the diagonal elements ofR will be real and

the other elements will be complex. So this multiplier is an

RCM. The RCM is used to implement (5). It is is implemented

by a combination of two constant multipliers [see Fig. 12(b)].

4) Complex Complex Multiplier: Finally, the fourth typeof multipliers is designed to perform ri j

sj , which is a CCM.

According to (6), the CCM is used to implement L i s(i),shown in Fig. 12(c).

E. PED Calculation Block (PED Calc.)

This block calculates the PED based on (4), which is

implemented through a fully pipelined architecture in Fig. 13.

The middle subblocks in Fig. 13 implement the 1-norm cal-culation. Simulation results show that the difference between

the BER performance of 2-norm and 1-norm is negligible

[11]. Due to the lower complexity of the 1-norm, it is the

preferred approach for the implementation in our design.

Fig. 12. Proposed architectures of (a) CM, (b) RCM, and (c) CCM.

Fig. 13. Proposed VLSI architecture for the PED calculation block.

F. Li Calculation (L i Calc.)

According to (8) and (10), in order to find the FC the

value of s[0]l = Ll

s(l)

should be calculated. Also based

on (7), this value is used to calculate Ll

s(l)

for the PED

calculation. Since different numbers of zi and ri j are used

to calculate L1, L2, and L3 a customized fully pipelinedarchitecture is proposed for each of them in Fig. 14(a)(c),

respectively. These blocks perform Step II.1.a and Step III.1.aof the proposed algorithm.

G. FC Block

The proposed architecture for the FC block is shown

in Fig. 15(a), which performs Step I.1, Step II.1.b, and

Step III.1.b of the proposed algorithm. This is done by using

two mapper blocks and two limiter blocks, which are described

below.

1) Mapper: The main task of the mapper block is to round

the real/imaginary part of L i (i.e., R/I{s[0]l }) to the nearestodd integer value, which can be done on the basis of thefollowing equation:

R

I{s[1]l }= 2

R

I{s[0]l

}2

+ 1 1 (13)

where . represents the truncation operation. The detailedarchitecture of the mapper block is shown in Fig. 15(b).

2) Limiter: If R/I{s[1]l } is outside the allowed boundaryof , it will be bounded by the limiter block to generate

R/I{s[1]l }, which is the real/imaginary part of the FC [seeFig. 15(c)].

7/29/2019 06200359 (1)

11/14


Fig. 14. Proposed VLSI architecture of Li calculation block. (a) L1calculation block. (b) L2 calculation block. (c) L3 calculation block.

Fig. 15. (a) Proposed architecture for the FC block. (b) Detailed architectureof the mapper block. (c) Detailed architecture of the limiter block.

H. CSE Enumeration

This block performs the CSE technique to generate the

Level-3 nodes of a parent. In fact, this block performs

Step I.4.b and Step II.1.e.ii of the proposed algorithm, which

is implemented using simple adders and subtractors.

I. NC Block

This block is used in all layers except the first and the last

layer. A fully pipelined VLSI architecture is proposed for theNC block, shown in Fig. 16. The NC block implements the

Step II.1.c and the Step II.1.d using adders, subtractors, and

norm calculation blocks. Also, it implements the Step II.1.e

and the Step II.1.f of the proposed algorithm using the upper

right corner blocks and lower half part of Fig. 16, respectively.

J. Sorters

The proposed VLSI architecture includes four types of

sorters, which are described below.

1) Sorter 1: This sorter performs the Step II.2 and Step III.2

of the proposed algorithm. Sorter 1 proposed in [7] is used in

all of the layers except the Layer NT.2) Sorter 2 and Shifter: This block implements the Step II.3

of the proposed algorithm by using Sorter 2, which is the sameas the Sorter 1 and a shifter. This block is used in all layers

except the first and the last.

3) Sorter 3: The main task of Sorter 3 is to perform the

Step I.4.a of the proposed algorithm. A feedforward and

pipelinable VLSI architecture is proposed for Sorter 3, which

is shown in Fig. 17.

4) Sorter NC: This sorter is used in the NC block. In fact,

Sorter NC is a modified version of the Sorter 3, which is

customized for a different number of inputs. Note that the

Fig. 16. Detailed VLSI architecture of the NC block.

Fig. 17. Proposed VLSI architecture of Sorter 3.

only difference between the Sorter NC blocks in Fig. 16 is in

the number of inputs.

It is worth noting that the proposed feedforward/fully

pipelined architecture can be easily applied to the multicarrier

scenarios. In fact, the proposed MIMO detector is applied on

each carrier separately, so each subsequent carrier can be fed

to the proposed MIMO detector through a pipelined fashion.This can be done through a simple hardware wrapper sitting

next to the proposed core. While it is assumed that the channel

is perfectly known to the receiver, the proposed algorithm canbe used under different channel conditions when used along

with a channel estimator providing the estimate of the current

channel status.

VI . COMPLEXITY ANALYSIS

A. ASIC Implementation

The proposed VLSI architecture was modeled in Verilog

HDL using ModelSim, synthesized using Synopsys DesignVision in 0.13- and 90-nm CMOS technology and placed, and

routed using Cadence SoC Encounter. The chip boundary and

final graphic database system stream out was performed using

Cadence Virtuoso. The golden fixed-point MATLAB modelwas used to validate the register transfer language and gate-

level net lists. The final verified ASIC core was fabricatedin 0.13-m IBM 1P/8M CMOS technology using Artisan

standard library cells. A micrograph of the die for the design isshown in Fig. 18, which was packaged in a CFP120 package.

The fabricated design was tested using an Agilent (Verigy)

93 000 SoC high-speed digital tester and a TemptronicTP04300 thermal forcing unit. The test setup consisted of the

93K system-on-chip (SoC) tester, the thermal forcing unit, and

a load board holding the device under test. The nominal supply

voltage supplied to the core is 1.2 V, while the I/O voltage is

7/29/2019 06200359 (1)

12/14


TABLE II

DESIGN COMPARISON OF THE CURRENT ASIC IMPLEMENTATIONS FOR 4 4 MIMO DETECTORS

ReferenceJSAC

2006 [1]TCAS-II2010 [14]

DATE2009 [15]

TVLSI2007 [6]

TVLSI2010 [16]

JSSC2010 [17]

JSSC2011 [18]

TVLSI2011 [19]

TVLSI2011 [7]

This work

Modulation 16-QAM 16-QAM 64-QAM 64-QAM 64-QAM (4-64)QAM 64-QAM (16-64)QAM 64-QAM 64-QAM

Antenna 4 4 4 4 4 4 4 4 4 4 4 488 4 4 4 4 4 4 4 4Method K-Best SISO-SD Sys. Like

detectionK-Best K-Best MBF-FD

(SD)SISOMMSE-PIC

MMF-LSD K-Best ModifiedK-Best

Domain Real Complex Complex Complex Real Complex Complex Real Real Complex

K-Value 5 N/A N/A 64 5-64 N/A N/A N/A 10 10

Process 0.35 m 90 nm 45 nm 0.13 m 65 nm 0.13 m 90 nm 0.18 m 0.13 m 0.13$m

fmax (MHz) 100 250 574.7 270 158 198 568 250 282 417

Throughput 54 90 215 100 732-100 285-431 757 31.7-146.3 675 1000

(Mb/s) 145a 62a 74a 366-50a 524a 44-202a

Gate count 91 kG 96 kG 33.1 kG 5270 kG 1760 kG 350 kG 410 kG 25.4-48.2 kG 114 kG 340 kG

NHEb 0.63a 1.6a 0.45a 52.7 4.81-35.2a 1.23-0.81 0.78a 0.58-0.24a 0.17 0.34

Energy/bit 594 pJ/b N/A N/A 8470 pJ/b N/A N/A 250 pJ/b N/A 200 pJ/b 110 pJ/b

Power (mW) 626 N/A N/A 847 165 57-74 189.1 57-90 135 1700

Latency (s) 2.4 N/A N/A N/A N/A N/A N/A N/A 0.6 0.36

Hard/soft Hard Soft Soft Soft Hard Soft Soft Soft Hard Harda Technology scaling from S1 to 0.13 m CMOS process assuming tpd2 =

tpd1.S1(nm)

130(nm), fmax 1tpd

b Normalized Hardware Efficiency (kG/M/bps).

Layer 2

Layer 1

Layer 4

Layer 3

Fig. 18. Photograph of the die.

2.7 V. The operation of the chip was verified by passing the

input vectors at different SNR values to the chip through the

tester and comparing the detector outputs with the expectedvalues from the bit-true simulations both from MATLAB and

ModelSim simulations. Finally, an at-speed test was run onthe chip and the outputs were compared against the expected

bit stream generated by the MATLAB simulations.

The final measured BER performance result of the

proposed approach is similar to that in [1] and [11]. Thusthe major difference between all of these schemes, including

the one in this paper, is the way the detection algorithm isimplemented, which translates to different throughput and/or

hardware complexity. It is shown that the proposed algorithmis implemented using a feedforward architecture. In our

proposed architecture, the critical path of the subblocks such

as the 16 16 bit fast multiplier and the PED Calc. blockare reduced by applying the pipelining technique. According

to the proposed algorithm, K-Best candidates of each layer

of the architecture are generated in K clock cycles, which

increases the throughput of the system.

TABLE III

DEVICE UTILIZATION IN THE FPGA PLATFORM

Slices LUTs Reg. LUT-FF pairs DSP48Es

Available 14720 58880 58880 51869 640

Used 13467 46160 36912 31203 8

Utilization 91% 78% 62% 60% 1%

The comparison between the proposed complex MIMOdetector and the recently proposed MIMO detectors in the

complex and the real domains that are reported in the literatureis shown in Table II. This comparison shows that the proposed

scheme has the same performance but higher throughput,

lower area, lower energy, and lower latency compared to

all the reported complex-domain VLSI realizations. Also, the

proposed design has higher throughput, less latency, and less

energy than the distributed K-Best algorithm in [7], which

is one of the most efficient real-domain MIMO detectors.

Needless to say, the proposed design has a larger core areathan the one in [7], which is related to the nature of the

complex-domain implementation and extra resources for the

complex-domain calculations.

In order to perform a fair comparison, a normalized hard-

ware efficiency (NHE) is defined, which includes the core area

of the design (i.e., gait count) and the corresponding scaled

throughput in the same technology for all designs. So

NHE(kG/Mb /ps) = core area(kG)scaled throughput(Mb/s)

.

Table II shows that the proposed design has the lowest NHE

compared to the all of the complex-domain MIMO detectors.

Moreover, the proposed scheme has less NHE than all the

real-domain implementations except [7].

7/29/2019 06200359 (1)

13/14


10 15 20 25 30 3510

6

105

104

103

102

101

100

SNR

BER

KBest complex (K=7)

KBest real (K=10)

KBest complex (K=11)

KBest complex (K=15)

ML

Fig. 19. K-Best versus. ML BER performance for different values of K inboth real and complex domain for a 4 4, 64-QAM MIMO detector.

Moreover, the proposed scheme is implemented in the

FPGA platform. The synthesis results and the requiredresources for the 4

4, 64-QAM MIMO detector using the

proposed scheme is shown in Table III. The target device is theVirtex-5 FPGA from Xilinx, i.e., XC5VSX95T-2FF1136. On

the FPGA platform, the throughput of 360 Mb/s at 150 MHz

is achieved for the proposed design.

VII. SIMULATION RESULTS

In theory, the K-Best algorithm might miss the hard-ML

point and might have performance loss as a result. However,

by a proper choice of K, the BER performance of the K-Best method approaches the optimal case for a reasonable

range of SNR values. Since the proposed algorithm is basedon the K-Best algorithm, it is necessary to choose a proper

value for K. Fig. 19 shows the BER performance curvesf o r a 4 4, 64-QAM MIMO system using the proposedscheme versus the ML detector. It reveals the behavior of

the proposed algorithm for different values of K. It is seen

that by increasing the value of K, the performance result

becomes close to that of the ML detection. However, a higher

K value results in more hardware complexity. For instance,

in the proposed algorithm for a 4 4, 64-QAM MIMOsystem, K = 15 results in an ML-like result while K = 8comes with less diversity in high-SNR regimes (Fig. 19). Theperformance of the K-Best scheme with K = 10 is close tothe ML while having a moderate complexity. Thus K = 10 ischosen as the framework for the hardware implementation in

this paper.Moreover, word-length effect is another important parameter

that affects on the final BER of the algorithm and also thehardware complexity. Let us consider a pair (W, F) for each

variable of the algorithm, which represents the total wordlength and fractional length, respectively. We consider three

different cases for (W, F): small values, medium values, and

large values. These values are listed in Table IV. The BERcomparison of these cases in the fixed-point domain and also

the floating point simulation results of the proposed algorithm

as well as ML detection are shown in Fig. 20. Simulation

results show that choosing a large value for W and F results in

TABLE IV

FIXED-P OINT VARIABLES WIT H THREE SETS OF (W, F)

ri j ri i Zi Si L i P E D

Best (34, 3 0) (3 1, 28 ) (34, 2 5) (4, 0 ) (34 , 2 5) (3 4, 31 )

Optimized (16, 12) (13, 10) (16, 7) (4, 0) (16, 7) (16, 13)

Bad (10, 6) (9, 6) (12, 3) (4, 0) (11, 2) (10, 7)

Fig. 20. BER performance of the proposed algorithm in both fixed/floatingpoint domains for different values of (W, F) versus ML for a 4 4, 64-QAMMIMO detector with K = 10.

less performance loss, but a larger and complicated hardware

is needed (best in Fig. 20). More truncation of parameters

results in lower hardware complexity but a larger BER and

performance loss (bad in Fig. 20). Medium hardware com-plexity and BER are obtained by choosing the medium set of

(W, F), denoted by optimized in Fig. 20, which is chosenin this paper.

Finally, it is necessary to verify the proposed idea of thelimited number of column/row SE enumerations and comparethe BER of the proposed algorithm with the exact K-Best

algorithm and also ML detection. In other words, the chosen

values of CSE_Num and RSE_Num (i.e., CSE_Num = 3,RSE_Num = 4) as well as their effects on the BER of theproposed algorithm should be confirmed. According to the

proposed algorithm, the values of CSE_Num and RSE_Numaffect the complexity of the architecture and the BER perfor-

mance. Large values results in larger chip area and better BERperformance.

There are two strategies for the value of CSE_Num and

RSE_Num. The first strategy is the relaxed SE enumeration

scheme, where the number of column/row SE enumerationsis not limited (relaxed SE in Fig. 21). In fact, the BER

of this strategy is the same as the BER of the exact K-Bestalgorithm [13]. The second strategy is our proposed limited

SE enumeration scheme, where the number of column/row SEenumerations is limited (i.e., CSE_Num = 3, RSE_Num = 4).This scheme is denoted by Limited SE in Fig. 21. The

simulation results of these strategies versus ML detection areshown in Fig. 21. We can see that the difference between the

BER of these two schemes is negligible, which confirms that

the proposed algorithm can achieve the same BER of that of

the exact K-Best algorithm.

7/29/2019 06200359 (1)

14/14


Fig. 21. Effect of the proposed limited SE enumeration idea on the BERperformance of the proposed algorithm versus the exact K-Best algorithm andthe ML for a 4 4, 64-QAM MIMO detector with K = 10.

The above simulation results are for a single-carrier 4 4,64-QAM MIMO system. The simulation results for the BER

curve are performed for 100 000 packets, where each packet

consists of 4 6 4 = 96 bits (9.6Mbits in total) for a 4 4,64-QAM MIMO system. Test vectors are generated using:1) pseudorandom data; 2) complex-valued random Gaussian

channel matrix H with statistically independent elements

updated per four channel use; and 3) additive white Gaussian

(circularly symmetric) complex random noise.

VIII. CONCLUSION

A novel detection algorithm with an efficient architecture

featuring efficient operation over infinite complex-domain

lattices has been proposed. The proposed design is scalable

both in terms of the number of transmit antenna and the

constellation order. Efficient implementation of the subblocksresults in the highest throughput and the lowest area andenergy consumption design in the literature to date. The

proposed design was implemented on both the FPGA and

ASIC platforms. In the ASIC implementation, the proposed

hard-output detector provided a sustained throughput of 1 Gb/s

with the areas of 340 kgates in a 0.13-m CMOS process.

Synthesis results in 90-nm CMOS show a potential throughput

of 1.5 Gb/s.

REFERENCES

[1] Z. Guo and P. Nilsson, Algorithm and implementation of the K-bestsphere decoding for MIMO detection, IEEE J. Sel. Areas Commun.,

vol. 24, no. 3, pp. 491503, Mar. 2006.[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, Closest point search in

lattices, IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 22012214, Aug.2002.

[3] K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, A VLSIarchitecture of a K-best lattice decoding algorithm for MIMO channels,in Proc. IEEE Int. Symp. Circuits Syst. , vol. 3. May 2002, pp. 273276.

[4] B. M. Hochwald and S. T. Brink, Achieving near-capacity on amultiple-antenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp.389399, Mar. 2003.

[5] H.-L. Lin, R. C. Chang, and H. Chan, A high-speed SDM-MIMOdecoder using efficient candidate searching for wireless communication,

IEEE Trans. Circuits, Syst. II, vol. 55, no. 3, pp. 289293, Mar. 2008.[6] S. Chen, T. Zhang, and Y. Xin, Relaxed K-best MIMO signal detector

design and VLSI implementation, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 15, no. 3, pp. 328337, Mar. 2007.

[7] M. Shabany and P. G. Gulak, A 675 Mb/s, 4 4 64-QAM K-bestMIMO detector in 0.13 m CMOS, IEEE Trans. Very Large Scale

Integr. (VLSI) Syst., vol. 20, no. 1, pp. 135147, Jan. 2012.[8] P. A. Bengough and S. J. Simmons, Sorting-based VLSI architecture

for the M-algorithm and T-algorithm trellis decoders, IEEE Trans.Commun., vol. 43, no. 234, pp. 514522, Mar. 1995.

[9] C. P. Schnorr and M. Euchner, Lattice basis reduction: Improved prac-tical algorithms and solving subset sum problems, Math. Programm.,vol. 66, nos. 13, pp. 181191, 1994.

[10] B. Kim and I. C. Park, K-best MIMO detection based on interleaving

of distributed sorting, Electron. Lett., vol. 44, no. 1, pp. 4243, Jan.2008.

[11] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, K-bestMIMO detection VLSI architectures achieving up to 424 Mb/s, in Proc.

IEEE Int. Symp. Circuits Syst., May 2006, pp. 11511154.[12] M. Shabany and P. G. Gulak, A 0.13 m CMOS, 655 Mb/s, 64-QAM,

K-best 4 4 MIMO Detector, in Proc. IEEE Int. Solid State CircuitsConf., Feb. 2009, pp. 256257.

[13] M. Shabany, K. Su, and P. G. Gulak, A pipelined high-throughputimplementation of near-optimal complex K-best lattice decoders, inProc. IEEE Int. Conf. Acoust., Speech, Signal, Apr. 2008, pp. 31733176.

[14] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, Ascalable VLSI architecture for soft-input soft-output single tree-searchsphere decoding, IEEE Trans. Circuits, Syst. II, vol. 57, no. 9, pp. 706710, Sep. 2010.

[15] P. Bhagawat, R. Dash, and G. Choi, Systolic like soft-detection archi-

tecture for 44 64-QAM MIMO system, in Proc. IEEE Design, Autom.Test Eur. Conf. Exhibit., Jun. 2009, pp. 870873.

[16] S. Mondal, A. Eltawil, C. Shen, and K. Salama, Design and implemen-tation of a sort-free K-best sphere decoder, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 18, no. 10, pp. 14971501, Oct. 2010.

[17] C. Liao, T. Wang, and T. Chiueh, A 74.8 mW soft-output detector ICfor 8 8 spatial-multiplexing MIMO communications, IEEE J. SolidState Circuits, vol. 45, no. 2, pp. 411421, Feb. 2010.

[18] C. Studer, S. Fateh, and D. Seethaler, ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interferencecancellation, IEEE J. Solid State Circuits, vol. 46, no. 7, pp. 17541765, Jul. 2011.

[19] M. Myllyl, J. Cavallaro, and M. Juntti, Architecture design and imple-mentation of the metric first list sphere detector algorithm, IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 5, pp. 895899, May2011.

Mojtaba Mahdavi received the M.Sc. degree inelectrical engineering from the Sharif University ofTechnology, Tehran, Iran, in 2010.

He was with Advanced Integrated Circuit DesignLaboratory, Sharif University of Technology, from2010 to 2012. Currently, he is working on implemen-tation of the Long Term Evolution (LTE-Advanced)System. His current research interests include digi-tal VLSI architectures for digital signal processingalgorithms, VLSI communication systems, digitalintegrated circuit design, and field-programmable

gate array-based systems.Mr. Mahdavi was on the subcommittee for the International Solid-State

Circuits Conference from 2005 to 2008.

Mahdi Shabany received the B.Sc. degree in elec-trical engineering from the Sharif University ofTechnology, Tehran, Iran, in 2002, and the M.Sc.and Ph.D. degrees in electrical engineering from theUniversity of Toronto, Toronto, ON, Canada, in 2004and 2008, respectively.

He is an Assistant Professor with the ElectricalEngineering Department, Sharif University of Tech-nology. From 2007 to 2008, he was with Red-line Communications Company, Toronto, where hedeveloped and patented designs for WiMAX sys-

tems. He served as a Post-Doctoral Fellow with the University of Torontoin 2009. He holds two U.S. patents. His current research interests includedigital electronics and VLSI architecture and algorithm design for broadbandcommunication systems.

Date post:	14-Apr-2018
Category:	Documents
Upload:	binukiruba
View:	218 times
Download:	0 times

06200359 (1)

Documents