+ All Categories
Home > Documents > 06200359 (1)

06200359 (1)

Date post: 14-Apr-2018
Category:
Upload: binukiruba
View: 218 times
Download: 0 times
Share this document with a friend

of 14

Transcript
  • 7/29/2019 06200359 (1)

    1/14

    834 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    Novel MIMO Detection Algorithm for High-OrderConstellations in the Complex Domain

    Mojtaba Mahdavi and Mahdi Shabany

    AbstractA novel detection algorithm with an efficient VLSIarchitecture featuring efficient operation over infinite complexlattices is proposed. The proposed design results in the highestthroughput, the lowest latency, and the lowest energy comparedto the complex-domain VLSI implementations to date. Themain innovations are a novel complex-domain means of expand-ing/visiting the intermediate nodes of the search tree on demand,rather than exhaustively, as well as a new distributed sortingscheme to keep track of the best candidates at each search phase.Its support of unbounded infinite lattice decoding distinguishesthe present method from previous K-Best strategies and alsoallows its complexity to scale sublinearly with the modulationorder. Since the expansion and sorting cores are data-driven,the architecture is well suited for a pipelined parallel VLSI

    implementation. The proposed algorithm is used to fabricate a44, 64-QAM complex multiple-input-multiple-output detectorin a 0.13-m CMOS technology, achieving a clock rate of417 MHz with the core area of 340 kgates. The chip test resultsprove that the fabricated design can sustain a throughput of1 Gb/s with energy efficiency of 110 pJ/bit, the best numbersreported to date.

    Index Terms Complex-domain detection, K-best detectors,LTE/WiMAX systems, multiple-input multiple-output (MIMO)detector.

    I. INTRODUCTION

    MULTIPLE-INPUT MULTIPLE-OUTPUT (MIMO)systems have the potential of achieving high spectralefficiency, high data rate, and robust wireless link, with anacceptable implementation complexity in wireless systems.

    The MIMO technology has been already included in manywireless communication standards, such as the long-term

    evolution project, IEEE 802.16e, and IEEE 802.16 m. The

    design of low-complexity, low-energy, high-performance,

    and high-throughput receivers is the key challenge in the

    design of any MIMO receiver. Several MIMO detection

    algorithms have been proposed to address this challenge,which offer various tradeoffs between the performance and

    the computational complexity.Among the large variety of the MIMO detection techniques,

    maximum-likelihood (ML) detection is the optimum detection

    method and minimizes the bit error rate (BER) performance.But its computational complexity grows exponentially with

    the number of transmit antennas. On the other hand, linear

    Manuscript received November 11, 2011; revised February 2, 2012;accepted April 4, 2012. Date of publication May 15, 2012; date of currentversion April 22, 2013.

    The authors are with the Electrical Engineering Department, SharifUniversity of Technology, Tehran 14174, Iran (e-mail: [email protected];[email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TVLSI.2012.2196296

    detection methods such as the zero-forcing or the minimum

    mean squared error (MMSE) have lower complexity with apoor BER performance. Ordered successive interference can-

    celation (SIC) algorithms such as the vertical Bell Laboratorieslayered space-time algorithm are employed in another category

    of detectors. These algorithms have better performance than

    the linear detection ones but their BER performance is not

    acceptable.

    Finally, as a tradeoff between complexity and performance

    loss, a large category of the detection algorithms have beenproposed, which includes the depth-first and the breadth-

    first search algorithms. The well-known depth-first strategy

    is the sphere decoder (SD) [1], which guarantees the optimalperformance in the case of unlimited execution time [2]. But

    the intrinsic variable throughput results in extra overhead in

    the hardware and significantly lower data rates in the lower

    signal-to-noise ratio (SNR) regimes. Among the breadth-

    first search methods, the most well-known approach is the

    K-Best algorithm (a.k.a M-algorithm) [3]. The K-Best detector

    guarantees an SNR-independent fixed-throughput detection

    scheme with a performance close to that of ML. AlthoughK-Best detectors are attractive for VLSI implementations,

    there are still some challenges, such as an efficient sortingand expansion scheme, in order to pave the way to achieve

    high throughputs.

    I I . SYSTEM MODEL

    Consider a MIMO system with NT transmit and NR receive

    antennas. The equivalent baseband model of the Rayleigh

    fading channel between the transmitter and the receiver isdescribed by a complex-valued NR NT channel matrix H.There are two models for such MIMO systems, namely, the

    complex equivalent model and the real equivalent model.

    In this paper, we consider the complex-domain framework.

    However, the proposed scheme can be easily tailored for

    the real equivalent model. The complex baseband equivalent

    model can be expressed as

    y = Hs + v (1)where s = [s1, s2, . . . , sNT]T is the NT-dimensional complextransmit signal vector, in which each element is indepen-

    dently drawn from a complex constellation O (symmetric

    M-QAM schemes with log2M bits per symbol, i.e., |O| =M), y = [y1, y2, . . . ; yNR ]T is the NR -dimensional receivedsymbol vector, and v = [v1, v2, . . . , v NR ]T represents theNR -dimensional independent identically distributed circularly

    symmetric complex zero-mean Gaussian noise vector with

    variance 2, i.e., vi Nc(0, 2). The real equivalent model1063-8210/$31.00 2012 IEEE

  • 7/29/2019 06200359 (1)

    2/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 835

    can also be derived using a simple real-valued decomposition

    technique [1].

    The objective of the MIMO detection method is to find theclosest lattice point s for a given received signal y

    s = arg minsONT

    y-Hs 2 . (2)

    In this paper, a novel MIMO detection algorithm is proposed

    to solve the above problem in the complex domain with linearcomplexity. Since the proposed algorithm is based on the

    K-Best algorithm, first this algorithm will be briefly described

    in the following.

    A. K-Best Algorithm

    Consider the problem in (2), and let us denote the QRdecomposition of the channel matrix as H = QR, where Qis a unitary NR NT matrix and R is an upper triangularNT NT matrix. Performing the nulling operation by QHresults in z = QHy = Rs + w, where w = QHv. Since thenulling matrix is unitary, the noise w remains spatially white

    after the nulling. Exploiting the triangular nature of R, (2) canbe expanded as

    s = arg minsONT

    NTi=1

    zi NTj=i

    ri j sj

    2

    . (3)

    The above problem can be thought of as a tree-searchproblem with NT levels, where, starting from the last row,

    one symbol is detected and, based on that, the next symbolin the upper row is detected, and so on. Thus starting from

    i = NT, (3) can be evaluated in an iterative manner as follows:Ti

    s(i) = Ti+1s(i+1)+

    ei

    s(i)

    2(4)

    ei

    s(i) = zi NT

    j=iri j sj = L is(i) rii si (5)

    L i

    s(i) = zi

    NTj=i+1

    ri j sj (6)

    L i

    s(i) = Lis(i) rii (7)

    where s(i) = [si si+1, . . . , sNT ]T, Ti

    s(i)

    is the accumulated

    partial Euclidean distance (PED) with TNT+1(s(NT+1)) = 0,and |ei (s(i))|2 denotes the distance increment between twosuccessive nodes/levels in the tree.

    Based on the above model, starting from level i

    =NT, the

    K-Best algorithm expands each K existing nodes in each levelto M new possible children in O and calculates their updated

    PED. Therefore, it sorts the K M produced nodes and selectsthe K best nodes with the lowest PED as the surviving nodes

    in the next level. The path with the lowest PED at the first levelof the tree is the hard decision output of the detector. There are

    two main computational cores in the above algorithm, which

    are discussed in the following.

    1) Expansion: According to the K-Best algorithm in the

    complex domain, in each level, K (parents of each level)M(children per parent) children should be enumerated, which

    results in a large complexity. The current expansion schemes

    in the real domain such as the phase shift keying (PSK)

    enumeration [11], the base-centric search methodology [5],

    and the relaxed K-Best enumeration scheme based on PSKenumeration [6] are compared in [7]. Although these schemes

    can be applied to the complex domain, they do not linearly

    scale with the constellation size (such as in [11]) and/or have

    performance loss compared to the exact K-Best implemen-

    tation (such as in [5] and [6]). To address this challenge,

    an efficient complex-domain expansion method called the

    on-demand expansion scheme is proposed in this paper, which

    provides all the information required for the exact K-Bestimplementation in the complex domain with no performance

    degradation while avoiding the exhaustive enumeration of

    the children. The computational complexity of the proposed

    scheme is independent of the constellation size.

    2) Sorting: In each level of the K-Best algorithm in the

    complex domain, K M children should be sorted. In [7] and

    [8], most of the sorting schemes such as bubble sorting [3],

    which is a sorting method based on the SchnorrEuchner (SE)

    ([2], [9]) technique, and a distributed sorting scheme [6], [10]

    are compared. But some of them are time-intensive for largevalues of K and M ( [3], [11]) or have performance loss

    ([6], [10]). The most efficient sorting scheme is proposed in

    [7], which is used in this paper. The distributed sorter works

    for any value of K and M with no performance loss. Also, its

    complexity is independent of the constellation size and scales

    linearly with the value of K.1

    Due to the intrinsic challenges in the implementation in

    the complex domain, most of the MIMO detection algorithms

    in the literature have been proposed for the real domain.

    However, on account of the deeper search tree, the real domainimplementation results in a larger silicon area and a larger

    latency. Nevertheless, a high-throughput MIMO detector in

    the complex domain with an acceptable complexity for thehigh-order constellations has always been a challenge in the

    literature. To address this challenge, in this paper a high-

    throughput detection algorithm along with its VLSI architec-

    ture for a 44, 64-QAM complex MIMO detector is proposed,which is scalable to higher order constellation schemes such as256-QAM and for a larger number of antennas (i.e., NT > 4).

    III. COMPLEX SE ENUMERATION

    The main challenge in developing the complex-domain

    enumerator lies in devising a means of iteratively enumerating

    the elements of the complex constellation in the order of

    increasing the squared distance from the unconstrained value,i.e., the PED value. In this paper, a novel complex SE

    enumeration scheme is proposed to enumerate the complex

    constellation points in the order of nondecreasing PED.

    A. First Child (FC)

    Based on the complex version of the system model in (4),

    the FC of a node in Kl+1 (s[1]l ) is the one that minimizes

    1By increasing the value of K, the performance becomes close to that ofML detection. However, a higher K results in more hardware complexity. Inthis paper, based on the simulations (see Section VII), K is chosen to be 10.

  • 7/29/2019 06200359 (1)

    3/14

    836 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    Fig. 1. Three-level tree used for enumeration of the complex constellation

    O. This tree is defined for each complex constellation point.

    |el (s(l))|.2 In other words

    s[1]l = arg min

    slO

    els(l)2 = arg min

    slO

    Lls(l) rll sl2

    = arg minR(sl)

    RLl

    s(l)

    /rll uR

    l

    R(sl )

    2

    (8)

    + arg minI(s

    l)

    ILl

    s(l)

    /rll

    uIlI(sl )

    2

    (9)

    = aR[1] + j aI[1] (10)where = {M + 1, . . . , 1, +1, . . . , +M 1} repre-sents all the possible values of the real/imaginary part of the

    constellation points, aR[1] = R[s[1]l ], aI[1] = I[s[1]l ], and theindex l was removed as we focus on one parent node in the

    level l + 1 and try to enumerate its children in level l. Notethat (8) and (10) are derived based on the fact that rll is a real

    number, which is a result of the QR decomposition.

    Let us define s[0]l = (Ll (s(l))/rll ) as the unconstrained

    received value. Considering the square symmetric M-QAM

    constellation schemes, there are || =

    M possible integerson both the real and imaginary axes. Thus the optimizationsin (8) and (10) are computationally inexpensive to implement,

    as uRl = R[s[0]l ] and uIl = I[s[0]l ] can be easily rounded to thenearest integers in to find the optimized value for R(s

    [1]l )

    and I(s[1]l ), respectively. This optimized value is denoted by

    aR[1] + j aI[1] in (10), which is the FC of the current parent.Therefore, the FC can be easily implemented through a 2-D

    slicer.

    B. Next Child (NC)

    To describe how the next child in the complex domain

    is calculated, let us denote all the points in the complex

    constellation O by a three-level tree as shown in Fig. 1,

    where s[0]l is at the root (Level 1). Once the FC (i.e., s

    [1]l =

    aR[1] + j aI[1]) is determined, it is selected as the first node inLevel 2 of the three-level tree (the left-most node in the Level 2

    in Fig. 1) and the

    M Level-2 siblings are chosen as those

    that share the same imaginary value aI[1] as the FC.3 Therefore,

    their squared distances from s[0]l vary directly with those of

    their real components from the real part of s[0]l . Since they

    2Because Tl+1(s(l+1)) is common for all the children of a parent node.3This is because of the fact that there are

    M possible real values and

    M

    possible imaginary values in the constellation.

    are all in one line in the complex constellation with different

    real parts, the typical real SE enumeration technique [7] can

    be applied to enumerate them in the order of nondecreasing

    squared distance from s[0]l [row SE (RSE) enumeration]. This

    means that the nodes in the second level of the three-level treeare positioned in the nondecreasing PED order.

    For the third level of the three-level tree in Fig. 1,

    M 1siblings are assigned to each Level-2 parent node such that

    they share the same real value inherited from their common

    Level-2 node whereas their imaginary parts take the values

    aI[2], . . . aI

    [M]. For instance, the elements of the left-mostsubtree in Fig. 1 are {aR[1] + j aI[2], . . . , aR[1] + j aI[M]} andthose of the right-most subtree are {aR[M]+ j a

    I[2], . . . , a

    R

    [M]+j aI[M]}. In fact, these (

    M 1) Level-3 nodes and their

    common Level-2 node are in the same column of the complex

    constellation. Thus each column of the complex constellation

    corresponds to a definite subtree in the three-level tree. Similar

    to the Level-2 nodes, the real SE enumeration can be appliedto their imaginary components to enumerate them in the

    order of nondecreasing squared distance from s[0

    ]l [columnSE enumeration (CSE)]. This implies that, using CSE, all theLevel-3 nodes of a definite Level-2 node are positioned in the

    corresponding subtree in the third level of the three-level tree

    from left to right in the order of nondecreasing PED.

    Based on the above three-level tree structure, the next child

    is calculated as follows. Recall that the FC corresponds to thesl O that minimizes |el (s(l))|. By definition, its next bestsibling (s

    [2]l ) is the one that has the next smallest incremental

    distance |el (s(l))|, i.e., it is the one in sl {O s[1]l } thatminimizes |el (s(l))|. Let L denote the set of points in theconstellation, which are enumerated but have not yet been

    announced as the next best sibling. These nodes are named

    visited nodes. As a new point is enumerated, it is added toL, so initially L = {s[1]l }. At each step, the point in L withthe lowest PED is announced to be the next best sibling. That

    point is removed from L and the complex SE enumeration

    is applied to it. So, according to the type of the announced

    node, i.e., Level-3 or Level-2, the announced node is replaced

    by one or two new points, respectively.

    In other words, if the announced node is a Level-3 node,then only column SE (CSE) enumeration will be applied to it,

    whereas if the announced node is a Level-2 node, then both

    row and CSE enumerations will be performed on it. In fact,

    the row and column enumerations enable the coverage of the

    possible sets of values for R

    [sl

    ]and I

    [sl

    ], respectively. These

    new point(s) are said to be visited and are then added to L.Fig. 2 shows an example of the complex SE enumeration,

    where the bold crosses represent the visited points while the

    bold circled crosses represent the points announced as the next

    best child. Starting from s[1]l [Fig. 2(a)], which is the left-most

    node in Level 2, its corresponding Level-3 node and its next

    sibling in Level 2 are visited and are added to L. Among

    these two new nodes in L, the one with the lowest PED ischosen [+1 + j in Fig. 2(b)]. If the chosen node is a Level-2node, its next sibling in Level 2 and its next child in the

    Level 3 are both enumerated and are added to L, which is

    equivalent to running both the row and the CSE enumerations

  • 7/29/2019 06200359 (1)

    4/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 837

    (a) (b)

    (c) (d)

    Fig. 2. First four best children using complex SE enumeration in a 16-QAM constellation scheme. (a) L

    = {1

    +j

    }. (b) L

    = {1

    j,

    +1

    +j

    }.

    (c) L = {1 j, 1 j, 3 + j}. (d) L = {1 + 3j, 1 j, 3 + j}.

    simultaneously [Fig. 2(c)]. However, if the chosen node in L

    is a Level-3 node, only its next sibling in Level 3 is countedand added to L, as is the case in Fig. 2(d). In other words,

    after finding s[2]l in Fig. 2(b), since it is a Level-2 node, both

    row and column enumerations are performed, resulting in the

    addition of +1 j and 3 + j to L. On the other hand,since in Fig. 2(c) the node s

    [3]l is a Level-3 node, only the

    column enumeration is performed, resulting in the addition of

    1 + 3j to L. This process is performed until all the pointsin the constellation are covered. Repeated application of this

    procedure ensures the on-demand enumeration of the complex

    constellation points in the order of increasing local PED. Note

    that, since the expansion scheme proposed here is on demand,

    not all the nodes in the constellation are necessarily visited.

    The sequence in Fig. 2 shows the process of finding thefirst four best children of a particular parent node using the

    proposed scheme with the elements ofL listed in each stage

    in the caption. At each stage, a dashed gray circle indicates

    the distance of the most recent announced K-Best node to

    s[0]l . One argument should be proven to guarantee the correct

    functionality of this complex enumeration scheme, i.e., it

    should be ensured that, when a node in L is announced as the

    next K-Best node, any other node in O with a lower PED has

    already been visited and announced as the K-Best candidate.In other words, all the unannounced nodes in O should have

    a larger PED than the most recently announced K-Best node.

    Proposition: Using the proposed SE complex enumeration

    scheme, nodes are visited in the order of increasing PED value.

    Proof: The following two observations are used for this

    proof.

    Lemma:

    1) any Level-3 node has a larger PED than that of its

    corresponding Level-2 parent node;

    2) if a node is announced as the next K-Best node, it has

    the lowest PED among the nodes in L.

    Fig. 3. Six possible cases for the complex SE enumeration.

    Let U= OKL, where O is the set of all nodes in theconstellation, K is the set of nodes that have been announced

    as the K-Best nodes, and L is the set of nodes visited but

    have not announced as the K-Best candidate yet. It should be

    proved that all the unannounced nodes in O have a larger PED

    than the nodes in K. In a mathematical form

    sl {O K}, s[i]l K : PED(sl ) > PED(s[i]l ) (11)where PED(s

    [i]l ) denotes the PED value of the node and all its

    ancestors to the NTth level of the detection tree. Since K-Best

    nodes are announced in the nondecreasing order of the PED,

    one only needs to prove that

    sl {O K} : PED(sl ) > PED(s[k]l ) (12)where s

    [k]l represents the most recent announced K-Best node

    in K.

    To prove, let us consider the contrary, i.e., assume there is

    a node, represented by sl , which is not in K and has a lowerPED than PED(s

    [k]l ). There are six possible cases to look at,

    as shown in Fig. 3 and described in the following:

    1) this case implies that a Level-2 node sl has a lowerPED than a Level-2 s

    [k]l node, which is contrary to the

    concept of the ordered row SE enumeration;2) this case means that a Level-2 node sl L has a lower

    PED than the Level-3 s[k]l node, which is contrary to the

    note 2) above;3) this case means that a Level-2 node sl U has a lower

    PED than the Level-3 node s[k]l . This implies that there

    is one unannounced Level-2 node in L whose PED is

    larger than PED(s[k

    ]l ), resulting in the conclusion thatPED(sl ) < PED(s

    [k]l ), which is contrary to the ordered

    row SE enumeration;

    4) this case implies that a Level-3 node sl L has a lowerPED than the node s

    [k]l , which is contrary to the note 2)

    above;

    5) this means that a Level-3 node sl U, whose Level-2parent node is in L, has a lower PED than the node s

    [k]l ,

    which is contrary to notes 1) and 2) above;

    6) this case implies that a Level-3 node sl U, whoseLevel-2 parent node is in U, has a lower PED than the

    node s[k]l . This means that regardless of the Level of

  • 7/29/2019 06200359 (1)

    5/14

    838 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    Fig. 4. Variation of the value of|L

    |for 16-QAM for a specific received symbol. (a)

    |L

    | =3. (b)

    |L

    | =4. (c)

    |L

    | =4. (d)

    |L

    | =4. (e)

    |L

    | =4. (f)

    |L

    | =1.

    s[k]l , there is one unannounced Level-2 node in L (not

    including s[k]l ), which has a lower PED than the node

    s[k]l . This is contrary to the ordered row SE enumeration

    and note 1) above.

    Considering the nature of the SE enumeration in both

    dimensions creates an intuitive proof. In other words, the nodesin the three-level tree shown in Fig. 1 are sorted while being

    enumerated in the tree on both Levels 2 and 3 from left to right.Therefore, the selection of the nodes starts from left to right on

    both levels. In fact, moving from left to right on the three-level

    tree corresponds to visiting the nodes with nondecreasing PEDvalue. Therefore, if there is one unannounced Level-3 node

    among children of a Level-2 parent node, it is guaranteed that

    all its Level-3 siblings on its right side have a higher PED.Also, if there is one unannounced Level-2 node, all the other

    Level-2 nodes on its right side and their children have a higherPED than this node.

    One key property of the proposed enumeration scheme is

    that, because it is based on the SE enumeration, it does not

    require that the lattice search space under consideration be

    bounded. Another feature is that the best children of each

    parent are generated one by one and on demand (without

    visiting all the other points). Therefore, the complexity of this

    approach and the search complexity are independent of theconstellation order. This makes our approach a promising one

    especially for higher order modulations such as the 64-QAM

    or 256-QAM.

    C. Relation of |L| and the Constellation Order (M)Another promising aspect of the proposed approach is that it

    can be shown that |L| M. The fact that makes this featurepossible is the ordered expansion along with the pipelinedsorting scheme. It is worth noting that the complex version

    requires extra circuitry to implement the above proposed

    expansion scheme in the complex domain. However, since it is

    implemented in an on-demand basis, and the fact that L doesnot populate linearly, this extra circuitry is not considerable.

    Proposition: Using the above complex SE enumerationscheme, the value of the L is always less than

    M, where M

    represents the size of the constellation.Proof: Based on the above proposed scheme, there are two

    levels of nodes in the tree (Levels-2 and -3 nodes). If any of

    the Level-2 nodes is selected, it is excluded from L and atmost two more constellation points are added to L (its next

    sibling in Level-2 and its next child in Level 3). Therefore,

    the selection of any Level-2 node would increase the value of|L| at most by 1. However, if the selected node is in Level 3,

    it is excluded from L and at most its sibling, if any, is added

    to L. Therefore, if a Level-3 node is selected, the value of |L|does not change, or may even decrease. Having said that, since

    there are

    M Level-2 nodes in any M-QAM constellation, the

    value of |L| can increase by

    M thus |L|

    M.This fact is illustrated in Fig. 4, which shows the complex

    enumeration in a 16-QAM constellation for a specific receivedsymbol, where the received symbol is depicted by in thefigure. Note that dots () represent the constellation points,circled dots () represent the visited candidates not selectedyet (or the elements of

    L), and finally the black circles ()denote the announced constellation so far (the elements of

    K). The arrows show the flow of the enumeration in the 16-QAM constellation, which depends on the location of the

    received symbol, and the number on each arrow representsthe time Step at which the target node is visited. For instance,

    in Fig. 4(a), +1 j and 1 3j are visited in the secondStep of the enumeration, while +1 + j is visited in the fifthStep of the enumeration [Fig. 4(c)]. The value of |L| is thenumber of visited points not selected yet, i.e., circled dots

    (). By looking at Fig. 4, the largest value of |L| is 4. Thevalue of |L| for each Step is mentioned in the caption.

    IV. PROPOSED

    COMPLEX

    K- BES T

    ALGORITHM

    The implementation on the real domain is straightforward,

    as the next child can be found by a simple zigzag movement

    around the unconstrained received value s[0]l without any PED

    calculation in the feedback path of the architecture [12].

    However, in the complex SE enumeration scheme, a 2-D SE

    enumeration is needed to find the next best siblings. After

    each complex SE enumeration, one or two new nodes will begenerated that will later be added to the L after calculation of

    their PED values. Thus the size ofL, which includes the bestsibling nodes, may increase. It is most probable that the size

    ofL is greater than 1. So in order to find the next child of the

    announced node, we should find the child with the minimum

    PED from the L entries, which incurs an extra complexity inthe critical path.

    On the other hand, in order to have a high-throughputdetector, K parent nodes of the next level should be generated

    in K clock cycles. So according to the nature of the distributedK-Best algorithm, both the PED calculation and PED compari-

    son processes should be done in the feedback path in one clock

    cycle for each announced node, which is the main underlyingchallenge. Also it is obvious that the PED calculation in the

    complex domain needs more computations compared to the

    real domain. Therefore, all these computations will be added

    to the total critical path of the detector, which will result in

  • 7/29/2019 06200359 (1)

    6/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 839

    a significant decrease in the throughput. Thus the idea of the

    real-domain distributed K-Best algorithm cannot be applied to

    the complex domain to achieve a high-throughput design.To address the above challenges, a novel complex-domain

    detection algorithm is proposed in this paper. Let us consider

    an NR NT, M-QAM MIMO system. So the complex-domaindetection tree has NT levels. Thus the proposed algorithm can

    be described as follows.

    A. Proposed Complex K-Best Algorithm

    Step I. Level NT

    1) Calculate the FC of the incoming node, which is the

    NTth entry of the z matrix.2) Find all of the Level-2 nodes that are located in the same

    row of the constellation with the FC.3) Calculate the PED of the FC and all the Level-2 nodes

    and save all these

    M nodes and their PED values in

    a register bank (i.e., L).

    4) For k = 1:Ka) Find the node with the minimum PED in theL and

    announce it as one of the K parent nodes of thenext level of the detector.

    b) Find the next child of the announced node. In the

    complex domain, the next child should be calcu-

    lated by the complex SE enumeration technique.

    c) Calculate the PED of the new Level-3 node and

    replace the announced node with the new Level-3

    node in L.4

    End

    Step II. Level (NT 1)Level 2

    1) For i = 1 : Ka) Calculate the value of L is(i) for the incoming

    node, which is the i th parent of the current level.

    b) Find the FC.c) Find RSE_Num Level-2 nodes, which are the

    nearest nodes to the FC in the constellation, using

    the row SE enumeration (RSE) technique.

    d) Calculate the PED values for the above RSE_Num

    Level-2 nodes and the FC, and then save all these

    nodes in the corresponding register bank (L) result-

    ing in |L| = RSE_Num + 1 for the i th parent ofthe current level.

    e) For j = 1 : CSE_Numi) Find the node with the minimum PED in the L.

    ii) Find the next child of the selected node. Inthis step, all the L entries are Level-2 nodes.

    So regardless of the type of the selected node

    (Level 2 or Level 3), in order to find the next

    child, only the corresponding Level-3 node of

    the selected node should be found by using

    the CSE technique.

    4Note that there are

    M Level-2 nodes in an M-QAM constellation. So inStep I.2) dfd, L was initialized by

    M Level-2 nodes (|L| =

    M). Also,

    according to the proposed idea in Step I.4.b, each announced node will bereplaced with only one Level-3 node. So the size of L will remain fixed(|L| =

    M).

    iii) Calculate the PED of the new Level-3 node and

    save it in L. So the size ofL will increase by 1.

    End

    f) For the current parent node (i.e., i th parent), sort

    the entries of the obtained L in the order of

    nondecreasing PED, resulting in the final size of|L| = RSE_Num + 1 + CSE_Num.

    End2) Find the sorted list of the K first children of the K

    parents in the order of nondecreasing PED.

    3) For k = 1 : Ka) Announce the node with the minimum PED in the

    above sorted list as the kth parent of the next level.

    b) In the above sorted list, replace the announced node

    with its next child, which is obtained from the

    sorted L of its parent.

    End

    Step III. First Level

    1) For k = 1 : K

    a) Calculate the value of Lks(k) for the incomingnode, which is the kth parent of the current level.b) Find the FC and the corresponding PED.

    End

    2) Find the sorted list of the K first children of the K

    parents in the order of nondecreasing PED.

    3) Find the node with the minimum PED from the sorted

    list of the first children and announce it with all of its

    parents up to the level NT as the hard decision outputsof the detector.

    B. Limited SE Enumeration Idea

    There are two key points in the detection process that affect

    the BER performance of the detection algorithm.

    1) It is obvious that the generation of the K parent nodes of

    level NT should be done carefully, as any error in levelNT propagates to all of the other levels, which results

    in performance loss.

    2) According to the complex SE enumeration, any Level-2

    node of each column has lower PED than the other nodes

    of that column. So the Level-2 node of each column

    in the constellation has a higher priority than the other

    nodes of the same column to announce as one of the K

    best nodes of the next level. Thus, it is necessary to avoid

    missing the Level-2 nodes in the detection algorithm.

    So the generation of the parent nodes of the level NT andthe generation of the Level-2 nodes are two important factors

    in the final BER of the system.One of the innovations of this paper is to find all the

    Level-2 nodes at the beginning of the proposed algorithm (i.e.,Level NT). Then, regardless of the type of the current node

    (Level-2 or Level-3), in order to find the next child, only the

    corresponding Level-3 node of the current node should befound, which can be done by the CSE enumeration technique.

    So, in order to consider the first factor at the beginning of

    level NT, all the Level-2 nodes are generated and saved in

    L (Step I.2). If one of these nodes is selected as one of the

  • 7/29/2019 06200359 (1)

    7/14

    840 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    parents of the next level, then only its corresponding Level-3

    node should be generated, as its neighboring Level-2 nodes

    have already been generated (Step I.4.b). So all the Level-2and Level-3 nodes can be generated in this level and there is

    no omitted node in the level NT of the proposed algorithm.

    This will ensure that there is no performance degradation in

    this level of the proposed algorithm compared to the exact

    K-Best algorithm.

    Also, the second point can be considered through the

    generation of all of the Level-2 nodes. The above idea can

    be applied to the other levels of the proposed algorithm withadded cost of a larger silicon area. On the other hand, the effect

    of the parent nodes of the level 1 on the BER performance

    is lower than that of the parent nodes of level NT. Anothercontribution in the proposed algorithm is to generate a fixed

    number of child nodes for each parent in level NT 1 throughlevel 2, described in the following.

    Let us consider the distributed K-Best algorithm which uses

    the complex SE enumeration scheme [13]. After announcing a

    node as one of the K best nodes of the next level, the complex

    SE enumeration should be applied on the announced node.Then the node with the lowest PED in L will be announcedas one of the K parents of the next level. This process will be

    repeated to generate all of the K parents of the next level.During this process, an unpredictable number (0 K) ofparents of the next level will be chosen from the same parent

    of the current level. So a number of children will be chosen

    from a parent of the current level, as the parents of the next

    level are not fixed. This means that the number of column/row

    SE enumerations that should be performed on a definiteparent and also the size of corresponding L are unknown

    variant values. This results in a significant decrease in thehardware utilization. To address this challenge, in the proposed

    algorithm a fixed number of children will be generated foreach parent (i.e., RSE_Num + CSE_Num + 1). Thus a fixednumber of the column/row SE enumerations will be applied

    to each parent, which is done in Steps II.1.c and II.1.e of the

    proposed algorithm by using CSE_Num CSE enumeration and

    RSE_Num row SE enumeration, respectively. Proper values of

    these two important parameters will be indicated as follows.

    C. Finding the Value of CSE_Num

    Let us consider the generation of the parent nodes in the

    complex domain using the complex SE enumeration scheme.

    In this case, all the child nodes of a parent can be enumerated

    and there is no constraint on the number of the column/rowSE enumerations. This method is referred to as the relaxed

    SE enumeration scheme as opposed to our proposed limitedSE enumeration scheme. In fact, the BER performance of the

    exact K-Best algorithm will be obtained for the relaxed SEenumeration scheme. It will also be shown that the difference

    between the final BER of our proposed detection algorithm

    and the exact K-Best algorithm is negligible.In order to find the proper value of the CSE_Num,

    consider the generation of the parent nodes in the relaxed

    SE enumeration scheme in the complex domain. The sim-

    ulation result of this scheme for a 4 4, 64-QAM

    Fig. 5. Number of parent nodes that have the same number of visited childnodes for a 4 4, 64-QAM complex MIMO detector with K = 10.

    MIMO detector with K = 10 is shown in Fig. 5.This simulation was performed for 21 845 packets, whereeach packet consists of 4(transmitted vector/packet) 4(symbol/transmitted vector) 6(bit/symbols) = (96 bits)(thus 2 M bits in total). This figure shows the number ofparents that have the same number of visited child nodes at

    the end of the simulation. For more clarity, the generation of

    the parents will be explained Step by Step in the following.

    At the beginning, the FC of each parent will be found. So

    each parent node has only one visited child. Thus the parent

    nodes will be placed in the first column in Fig. 5. If the FCof a parent node from the first column is announced as one

    of the K-Best parents of the next level, then both column/rowSE enumerations will be applied on it and two new nodes will

    be generated (note that the FC is a Level-2 node). So thereis a total of three visited children, which will be reflectedin the third column in Fig. 5. Thus the parent nodes with

    no announced child will remain in the first column. This is

    the reason why always the second column is empty, which is

    the result of the concept of the complex SE enumeration [see

    Fig. 2(a) and (b)].

    Moreover, Fig. 5 shows that there are nearly 11106 parentnodes that have one visited child till the end of the simulation

    with no announced node. Also, there are almost 6 106 parentnodes with three visited children.

    This can be seen differently in Fig. 6, which shows the

    number of child nodes that are in the same category. For

    example, the third column of Fig. 5 shows that there are6106 parent nodes with three visited children. Thus there are36106 = 18106 child nodes in this category. In fact, thevalue of each column in Fig. 6 is equal to the product of the

    values of the horizontal and vertical axes of the correspondingcolumn in Fig. 5.

    Note that, according to Figs. 5 and 6, the number of parent

    nodes with five visited children is larger than the numberof parent nodes with four visited children (i.e., the fourth

    column). After announcing the FC as a parent of the next

    level, its parent will be transferred from the first column to

    the third column. According to the concept of the complex

  • 7/29/2019 06200359 (1)

    8/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 841

    Fig. 6. Number of child nodes that are in the same category for a 4 4,64-QAM MIMO detector with K = 10 in the complex domain.

    SE enumeration, the third column includes both Level-2

    and -3 nodes. So there are two scenarios.

    1) If a Level-3 node is chosen from the third column as

    one of the K parents of the next level, then just the

    column SE enumeration will be applied to it, which

    results in transferring that parent from the third column

    to the fourth column. This is the only possible way toadd a node to the fourth column because the second

    column is empty.

    2) If a Level-2 node is chosen from the third column

    as one of the parents of the next level, then both the

    column/row SE enumeration will be applied to it, which

    results in transferring that parent from the third columnto the fifth column. But as a key point, consider a

    Level-3 node in the fourth column, which is chosenas one of the K parents of the next level. So only the

    column SE enumeration will be applied to it, which

    results in transferring that parent node from the fourth

    column to the fifth column.

    Thus the fifth column is the target of two columns (i.e., the

    third and fourth columns). But the fourth column is the target

    of only one column (i.e., the third column). So the number ofnodes that are placed in the fifth category is larger than in the

    fourth category.It is worth noting that visiting and announcing are two

    different concepts, meaning that the number of visited nodes

    and the number of announced nodes from a parent node will

    be different, as shown in Table I. The first row of this tablecorresponds to the horizontal axes in Fig. 5, which includes the

    number of visited child nodes for a parent, and the second rowshows the possible number of announced child nodes for that

    parent node. For instance, the parent nodes of the third columnin Fig. 5 have only one announced child node (i.e., their first

    children), which is shown in the third column of Table I.

    Note that in the second row of Table I the possible range

    for the number of announced children from a specific parent

    is shown. For example, for the parent nodes of the seventh

    column of Table I, we are sure that at least three child nodes of

    these parents were announced as the parents of the next level.

    TABLE I

    COMPARISON OF THE NUMBER OF VISITED NODES AND THE

    NUMBER OF ANNOUNCED NODES FOR A SPECIFIC PARENT

    Number of1 2 3 4 5 6 7 8 9 10

    visited nodes

    0 0 1 2 2 3 3 4 4 5

    Number of 3 4 4 5 5 6

    parent nodes 5 6 6 7

    7 8

    Fig. 7. All possible scenarios for the node generation from a definite parent

    node with CSE_Num = 3.

    But this table shows that the parent nodes of this category can

    have up to five announced child nodes.

    As an important result, Fig. 5 shows that less than 9% of

    all the parent nodes (2106 parent nodes from 22 106) havemore than five visited child nodes, and less than 2% of all the

    parent nodes (0.5106 parent nodes from 22106) have morethan seven visited child nodes. So we can ignore the parent

    nodes of the other categories. Thus, one of the innovations of

    this paper is to limit the number of SE enumerations for each

    parent node, which is equivalent to ignoring the last columns

    in Fig. 5. This idea is used to obtain the appropriate valuesof CSE_Num and RSE_Num based on the above observation.

    The large values of CSE_Num and RSE_Num result in a betterBER performance at the cost of more silicon area.

    In order to find the proper value of CSE_Num, exhaustive

    simulations were done and the value of CSE_Num was chosen

    to be 3. According to the concept of complex SE enumeration,

    it can be proven that in this case (i.e. CSE_Num = 3) thenumber of visited children for a parent node will be five,six, or seven child nodes, which is shown in Fig. 7(a)(c),

    respectively. Fig. 7 shows all of the possible scenarios forthe node generation of a definite parent with CSE_Num = 3,which results in visiting five, six, or seven child nodes from

    that parent. The announced child nodes of the parent are shownby green circles in Fig. 7. Thus in our proposed algorithm,

    three CSE enumerations (i.e., CSE_Num = 3) will be appliedon each parent node and three Level-3 nodes will be generated

    (Step II.1.e in the proposed algorithm).

    D. Value of RSE_Num

    Fig. 7 shows that, according to the value of the CSE_Numafter the FC, up to three Level-2 nodes can be generated [see

    Fig. 7(a)]. So, in order to avoid missing the Level-2 nodes

    in the proposed algorithm, always in addition to the FC of a

    parent, three Level-2 nodes should be generated, which can be

  • 7/29/2019 06200359 (1)

    9/14

    842 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    Fig. 8. Two scenarios of node generation from a definite parent node that arecovered by (a) Level-3 nodes are in three different columns and (b) Level-3nodes are in two columns.

    Fig. 9. Proposed VLSI architecture of a 4 4, 64-QAM MIMO detector.

    using three row SE enumerations. According to the Step II.1.f

    of the proposed algorithm, all children of a parent should be

    sorted in the order of nondecreasing PED. According to the

    proposed architecture for Step II.1.f, it can be proven that

    the difference between the required hardware to sort seven

    or eight values is negligible. So, according to this fact and

    the importance of the Level-2 nodes in the final BER (seeSection IV-B2), the number of Level-2 child nodes of each

    parent was chosen to be five in the proposed algorithm, which

    implies that RSE_Num = 4. Because the FC is generated ina different manner and without SE enumeration technique.

    Thus always four row SE enumerations will be applied oneach parent node and four Level-2 nodes will be generated

    (Step II.1.c in the proposed algorithm).Therefore, the result of the above ideas is to generate

    four Level-2 nodes by applying four row SE enumerations

    (RSE_Num = 4) and three Level-3 nodes by applying threeenumerations (CSE_Num = 3) for each parent (Steps II.1.cand II.1.e in the proposed algorithm). Thus, in addition to the

    FC, seven child nodes will be generated for each parent. Fig. 8

    shows two possible cases that are covered by the proposed

    idea. Note that, after finding the FC, all these nodes will be

    visited for each parent (Steps II.1.c and II.1.e). This means

    that all the nodes that are not indicated with the green circles

    in Fig. 7 can be announced as the parents of the next level.

    But if the next child of the announced node does not exist in

    the collection of eight visited nodes of that parent, then thatnext child will be not enumerated.

    V. PROPOSED VLSI ARCHITECTURE

    The proposed VLSI architecture for a 4 4, 64-QAM hard-output MIMO detector is shown in Fig. 9. This architecture

    consists of four layers. Each layer gets the entries of z, R,

    some control signals, and the K parents of the previous layer(if any) as inputs and generates the K parents of the next layer

    as outputs. The control unit performs the scheduling of the

    inputs of the layers. Each layer consists of some subblocks,

    as described in the following. In order to reduce the final

    Fig. 10. Detailed VLSI architecture for Layer NT of the proposed MIMOdetector.

    critical path of the design, fine-grain pipelining is applied to

    the proposed architecture. The internal pipelining stages of

    the sub blocks and the pipeline stages between the blocks

    are shown by the red dash lines and arrows in all the figures

    to come.

    A. Layer NT

    The layer NT of the proposed architecture implements Step Iof the proposed algorithm. The detailed VLSI architecture of

    the layer NT is shown in Fig. 10, which gets z4 and r44as inputs and generates K parents of layer NT 1 as theoutputs.

    B. Layer (NT 1). . . Layer 2

    In the layer NT 1 through Layer 2, the correspondingentries of z and R, some control signals and the K parents of

    the previous layer are the inputs and the K parents of the

    next layer will be generated as the outputs. In fact, these

    layers of the proposed architecture perform Step II of theproposed algorithm. The architecture of theses layers are the

    same, which will be described in detail.

    C. Layer 1

    Finally, Layer 1 of the proposed architecture performsStep III of the proposed algorithm and announces the child

    with the lowest PED with all its parents up to Layer NT as

    the hard decision output s of the detector.

    D. Multipliers

    According to (3)(7), there are four types of multiplication

    in the proposed scheme, which are implemented as follows.

  • 7/29/2019 06200359 (1)

    10/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 843

    Fig. 11. (a) Proposed architecture of the fast multiplier. (b) Architecture ofa 4 4-bit BaughWooley carry-save multiplier.

    1) Fast Multiplier: The first type of multipliers is devisedto perform L i = rii L i . This multiplication is time-intensiveand is a part of the total critical path of the system. So, in

    order to improve the throughput of the system, an efficient

    architecture is proposed in Fig. 11. The basic core of the

    proposed architecture is a 2 s complement 4 4-bit BaughWooley carry-save multiplier [Fig. 11(b)]. The final 16 16-bit fast multiplier is obtained by connecting a group of these

    basic cores together [Fig. 11(a)]. Also, in order to decrease thecritical path of the multiplier, fine-grain pipelining is used in

    this block [red arrows in Fig. 11(b)], which results in a critical

    path of length of 1 ns and the throughput of 32 Gb/s for themultiplier in a 0.13-m CMOS technology instead of 7 ns

    and the throughput of 4.5 Gb/s for a 16 16-bit conventionalmultiplier in the same technology.

    2) Constant Multiplier (CM): The second multiplier is

    devised to perform ri j R/I{sk}, which is the basic coreof the remaining three types of multipliers. So an efficient

    low-area multiplier is proposed that is implemented by usingthe simple shift and addition operations [Fig. 12(a)]. The keyidea in the proposed multiplier comes from the fact that both

    operands are real valued and always the value ofR/I{sk} ischosen from a constant set ().

    3) Real Complex Multiplier (RCM): The third type ofmultipliers is designed to perform rii si . After the QRdecomposition, all the diagonal elements ofR will be real and

    the other elements will be complex. So this multiplier is an

    RCM. The RCM is used to implement (5). It is is implemented

    by a combination of two constant multipliers [see Fig. 12(b)].

    4) Complex Complex Multiplier: Finally, the fourth typeof multipliers is designed to perform ri j

    sj , which is a CCM.

    According to (6), the CCM is used to implement L i s(i),shown in Fig. 12(c).

    E. PED Calculation Block (PED Calc.)

    This block calculates the PED based on (4), which is

    implemented through a fully pipelined architecture in Fig. 13.

    The middle subblocks in Fig. 13 implement the 1-norm cal-culation. Simulation results show that the difference between

    the BER performance of 2-norm and 1-norm is negligible

    [11]. Due to the lower complexity of the 1-norm, it is the

    preferred approach for the implementation in our design.

    Fig. 12. Proposed architectures of (a) CM, (b) RCM, and (c) CCM.

    Fig. 13. Proposed VLSI architecture for the PED calculation block.

    F. Li Calculation (L i Calc.)

    According to (8) and (10), in order to find the FC the

    value of s[0]l = Ll

    s(l)

    should be calculated. Also based

    on (7), this value is used to calculate Ll

    s(l)

    for the PED

    calculation. Since different numbers of zi and ri j are used

    to calculate L1, L2, and L3 a customized fully pipelinedarchitecture is proposed for each of them in Fig. 14(a)(c),

    respectively. These blocks perform Step II.1.a and Step III.1.aof the proposed algorithm.

    G. FC Block

    The proposed architecture for the FC block is shown

    in Fig. 15(a), which performs Step I.1, Step II.1.b, and

    Step III.1.b of the proposed algorithm. This is done by using

    two mapper blocks and two limiter blocks, which are described

    below.

    1) Mapper: The main task of the mapper block is to round

    the real/imaginary part of L i (i.e., R/I{s[0]l }) to the nearestodd integer value, which can be done on the basis of thefollowing equation:

    R

    I{s[1]l }= 2

    R

    I{s[0]l

    }2

    + 1 1 (13)

    where . represents the truncation operation. The detailedarchitecture of the mapper block is shown in Fig. 15(b).

    2) Limiter: If R/I{s[1]l } is outside the allowed boundaryof , it will be bounded by the limiter block to generate

    R/I{s[1]l }, which is the real/imaginary part of the FC [seeFig. 15(c)].

  • 7/29/2019 06200359 (1)

    11/14

    844 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    Fig. 14. Proposed VLSI architecture of Li calculation block. (a) L1calculation block. (b) L2 calculation block. (c) L3 calculation block.

    Fig. 15. (a) Proposed architecture for the FC block. (b) Detailed architectureof the mapper block. (c) Detailed architecture of the limiter block.

    H. CSE Enumeration

    This block performs the CSE technique to generate the

    Level-3 nodes of a parent. In fact, this block performs

    Step I.4.b and Step II.1.e.ii of the proposed algorithm, which

    is implemented using simple adders and subtractors.

    I. NC Block

    This block is used in all layers except the first and the last

    layer. A fully pipelined VLSI architecture is proposed for theNC block, shown in Fig. 16. The NC block implements the

    Step II.1.c and the Step II.1.d using adders, subtractors, and

    norm calculation blocks. Also, it implements the Step II.1.e

    and the Step II.1.f of the proposed algorithm using the upper

    right corner blocks and lower half part of Fig. 16, respectively.

    J. Sorters

    The proposed VLSI architecture includes four types of

    sorters, which are described below.

    1) Sorter 1: This sorter performs the Step II.2 and Step III.2

    of the proposed algorithm. Sorter 1 proposed in [7] is used in

    all of the layers except the Layer NT.2) Sorter 2 and Shifter: This block implements the Step II.3

    of the proposed algorithm by using Sorter 2, which is the sameas the Sorter 1 and a shifter. This block is used in all layers

    except the first and the last.

    3) Sorter 3: The main task of Sorter 3 is to perform the

    Step I.4.a of the proposed algorithm. A feedforward and

    pipelinable VLSI architecture is proposed for Sorter 3, which

    is shown in Fig. 17.

    4) Sorter NC: This sorter is used in the NC block. In fact,

    Sorter NC is a modified version of the Sorter 3, which is

    customized for a different number of inputs. Note that the

    Fig. 16. Detailed VLSI architecture of the NC block.

    Fig. 17. Proposed VLSI architecture of Sorter 3.

    only difference between the Sorter NC blocks in Fig. 16 is in

    the number of inputs.

    It is worth noting that the proposed feedforward/fully

    pipelined architecture can be easily applied to the multicarrier

    scenarios. In fact, the proposed MIMO detector is applied on

    each carrier separately, so each subsequent carrier can be fed

    to the proposed MIMO detector through a pipelined fashion.This can be done through a simple hardware wrapper sitting

    next to the proposed core. While it is assumed that the channel

    is perfectly known to the receiver, the proposed algorithm canbe used under different channel conditions when used along

    with a channel estimator providing the estimate of the current

    channel status.

    VI . COMPLEXITY ANALYSIS

    A. ASIC Implementation

    The proposed VLSI architecture was modeled in Verilog

    HDL using ModelSim, synthesized using Synopsys DesignVision in 0.13- and 90-nm CMOS technology and placed, and

    routed using Cadence SoC Encounter. The chip boundary and

    final graphic database system stream out was performed using

    Cadence Virtuoso. The golden fixed-point MATLAB modelwas used to validate the register transfer language and gate-

    level net lists. The final verified ASIC core was fabricatedin 0.13-m IBM 1P/8M CMOS technology using Artisan

    standard library cells. A micrograph of the die for the design isshown in Fig. 18, which was packaged in a CFP120 package.

    The fabricated design was tested using an Agilent (Verigy)

    93 000 SoC high-speed digital tester and a TemptronicTP04300 thermal forcing unit. The test setup consisted of the

    93K system-on-chip (SoC) tester, the thermal forcing unit, and

    a load board holding the device under test. The nominal supply

    voltage supplied to the core is 1.2 V, while the I/O voltage is

  • 7/29/2019 06200359 (1)

    12/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 845

    TABLE II

    DESIGN COMPARISON OF THE CURRENT ASIC IMPLEMENTATIONS FOR 4 4 MIMO DETECTORS

    ReferenceJSAC

    2006 [1]TCAS-II2010 [14]

    DATE2009 [15]

    TVLSI2007 [6]

    TVLSI2010 [16]

    JSSC2010 [17]

    JSSC2011 [18]

    TVLSI2011 [19]

    TVLSI2011 [7]

    This work

    Modulation 16-QAM 16-QAM 64-QAM 64-QAM 64-QAM (4-64)QAM 64-QAM (16-64)QAM 64-QAM 64-QAM

    Antenna 4 4 4 4 4 4 4 4 4 4 4 488 4 4 4 4 4 4 4 4Method K-Best SISO-SD Sys. Like

    detectionK-Best K-Best MBF-FD

    (SD)SISOMMSE-PIC

    MMF-LSD K-Best ModifiedK-Best

    Domain Real Complex Complex Complex Real Complex Complex Real Real Complex

    K-Value 5 N/A N/A 64 5-64 N/A N/A N/A 10 10

    Process 0.35 m 90 nm 45 nm 0.13 m 65 nm 0.13 m 90 nm 0.18 m 0.13 m 0.13$m

    fmax (MHz) 100 250 574.7 270 158 198 568 250 282 417

    Throughput 54 90 215 100 732-100 285-431 757 31.7-146.3 675 1000

    (Mb/s) 145a 62a 74a 366-50a 524a 44-202a

    Gate count 91 kG 96 kG 33.1 kG 5270 kG 1760 kG 350 kG 410 kG 25.4-48.2 kG 114 kG 340 kG

    NHEb 0.63a 1.6a 0.45a 52.7 4.81-35.2a 1.23-0.81 0.78a 0.58-0.24a 0.17 0.34

    Energy/bit 594 pJ/b N/A N/A 8470 pJ/b N/A N/A 250 pJ/b N/A 200 pJ/b 110 pJ/b

    Power (mW) 626 N/A N/A 847 165 57-74 189.1 57-90 135 1700

    Latency (s) 2.4 N/A N/A N/A N/A N/A N/A N/A 0.6 0.36

    Hard/soft Hard Soft Soft Soft Hard Soft Soft Soft Hard Harda Technology scaling from S1 to 0.13 m CMOS process assuming tpd2 =

    tpd1.S1(nm)

    130(nm), fmax 1tpd

    b Normalized Hardware Efficiency (kG/M/bps).

    Layer 2

    Layer 1

    Layer 4

    Layer 3

    Fig. 18. Photograph of the die.

    2.7 V. The operation of the chip was verified by passing the

    input vectors at different SNR values to the chip through the

    tester and comparing the detector outputs with the expectedvalues from the bit-true simulations both from MATLAB and

    ModelSim simulations. Finally, an at-speed test was run onthe chip and the outputs were compared against the expected

    bit stream generated by the MATLAB simulations.

    The final measured BER performance result of the

    proposed approach is similar to that in [1] and [11]. Thusthe major difference between all of these schemes, including

    the one in this paper, is the way the detection algorithm isimplemented, which translates to different throughput and/or

    hardware complexity. It is shown that the proposed algorithmis implemented using a feedforward architecture. In our

    proposed architecture, the critical path of the subblocks such

    as the 16 16 bit fast multiplier and the PED Calc. blockare reduced by applying the pipelining technique. According

    to the proposed algorithm, K-Best candidates of each layer

    of the architecture are generated in K clock cycles, which

    increases the throughput of the system.

    TABLE III

    DEVICE UTILIZATION IN THE FPGA PLATFORM

    Slices LUTs Reg. LUT-FF pairs DSP48Es

    Available 14720 58880 58880 51869 640

    Used 13467 46160 36912 31203 8

    Utilization 91% 78% 62% 60% 1%

    The comparison between the proposed complex MIMOdetector and the recently proposed MIMO detectors in the

    complex and the real domains that are reported in the literatureis shown in Table II. This comparison shows that the proposed

    scheme has the same performance but higher throughput,

    lower area, lower energy, and lower latency compared to

    all the reported complex-domain VLSI realizations. Also, the

    proposed design has higher throughput, less latency, and less

    energy than the distributed K-Best algorithm in [7], which

    is one of the most efficient real-domain MIMO detectors.

    Needless to say, the proposed design has a larger core areathan the one in [7], which is related to the nature of the

    complex-domain implementation and extra resources for the

    complex-domain calculations.

    In order to perform a fair comparison, a normalized hard-

    ware efficiency (NHE) is defined, which includes the core area

    of the design (i.e., gait count) and the corresponding scaled

    throughput in the same technology for all designs. So

    NHE(kG/Mb /ps) = core area(kG)scaled throughput(Mb/s)

    .

    Table II shows that the proposed design has the lowest NHE

    compared to the all of the complex-domain MIMO detectors.

    Moreover, the proposed scheme has less NHE than all the

    real-domain implementations except [7].

  • 7/29/2019 06200359 (1)

    13/14

    846 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013

    10 15 20 25 30 3510

    6

    105

    104

    103

    102

    101

    100

    SNR

    BER

    KBest complex (K=7)

    KBest real (K=10)

    KBest complex (K=11)

    KBest complex (K=15)

    ML

    Fig. 19. K-Best versus. ML BER performance for different values of K inboth real and complex domain for a 4 4, 64-QAM MIMO detector.

    Moreover, the proposed scheme is implemented in the

    FPGA platform. The synthesis results and the requiredresources for the 4

    4, 64-QAM MIMO detector using the

    proposed scheme is shown in Table III. The target device is theVirtex-5 FPGA from Xilinx, i.e., XC5VSX95T-2FF1136. On

    the FPGA platform, the throughput of 360 Mb/s at 150 MHz

    is achieved for the proposed design.

    VII. SIMULATION RESULTS

    In theory, the K-Best algorithm might miss the hard-ML

    point and might have performance loss as a result. However,

    by a proper choice of K, the BER performance of the K-Best method approaches the optimal case for a reasonable

    range of SNR values. Since the proposed algorithm is basedon the K-Best algorithm, it is necessary to choose a proper

    value for K. Fig. 19 shows the BER performance curvesf o r a 4 4, 64-QAM MIMO system using the proposedscheme versus the ML detector. It reveals the behavior of

    the proposed algorithm for different values of K. It is seen

    that by increasing the value of K, the performance result

    becomes close to that of the ML detection. However, a higher

    K value results in more hardware complexity. For instance,

    in the proposed algorithm for a 4 4, 64-QAM MIMOsystem, K = 15 results in an ML-like result while K = 8comes with less diversity in high-SNR regimes (Fig. 19). Theperformance of the K-Best scheme with K = 10 is close tothe ML while having a moderate complexity. Thus K = 10 ischosen as the framework for the hardware implementation in

    this paper.Moreover, word-length effect is another important parameter

    that affects on the final BER of the algorithm and also thehardware complexity. Let us consider a pair (W, F) for each

    variable of the algorithm, which represents the total wordlength and fractional length, respectively. We consider three

    different cases for (W, F): small values, medium values, and

    large values. These values are listed in Table IV. The BERcomparison of these cases in the fixed-point domain and also

    the floating point simulation results of the proposed algorithm

    as well as ML detection are shown in Fig. 20. Simulation

    results show that choosing a large value for W and F results in

    TABLE IV

    FIXED-P OINT VARIABLES WIT H THREE SETS OF (W, F)

    ri j ri i Zi Si L i P E D

    Best (34, 3 0) (3 1, 28 ) (34, 2 5) (4, 0 ) (34 , 2 5) (3 4, 31 )

    Optimized (16, 12) (13, 10) (16, 7) (4, 0) (16, 7) (16, 13)

    Bad (10, 6) (9, 6) (12, 3) (4, 0) (11, 2) (10, 7)

    Fig. 20. BER performance of the proposed algorithm in both fixed/floatingpoint domains for different values of (W, F) versus ML for a 4 4, 64-QAMMIMO detector with K = 10.

    less performance loss, but a larger and complicated hardware

    is needed (best in Fig. 20). More truncation of parameters

    results in lower hardware complexity but a larger BER and

    performance loss (bad in Fig. 20). Medium hardware com-plexity and BER are obtained by choosing the medium set of

    (W, F), denoted by optimized in Fig. 20, which is chosenin this paper.

    Finally, it is necessary to verify the proposed idea of thelimited number of column/row SE enumerations and comparethe BER of the proposed algorithm with the exact K-Best

    algorithm and also ML detection. In other words, the chosen

    values of CSE_Num and RSE_Num (i.e., CSE_Num = 3,RSE_Num = 4) as well as their effects on the BER of theproposed algorithm should be confirmed. According to the

    proposed algorithm, the values of CSE_Num and RSE_Numaffect the complexity of the architecture and the BER perfor-

    mance. Large values results in larger chip area and better BERperformance.

    There are two strategies for the value of CSE_Num and

    RSE_Num. The first strategy is the relaxed SE enumeration

    scheme, where the number of column/row SE enumerationsis not limited (relaxed SE in Fig. 21). In fact, the BER

    of this strategy is the same as the BER of the exact K-Bestalgorithm [13]. The second strategy is our proposed limited

    SE enumeration scheme, where the number of column/row SEenumerations is limited (i.e., CSE_Num = 3, RSE_Num = 4).This scheme is denoted by Limited SE in Fig. 21. The

    simulation results of these strategies versus ML detection areshown in Fig. 21. We can see that the difference between the

    BER of these two schemes is negligible, which confirms that

    the proposed algorithm can achieve the same BER of that of

    the exact K-Best algorithm.

  • 7/29/2019 06200359 (1)

    14/14

    MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 847

    Fig. 21. Effect of the proposed limited SE enumeration idea on the BERperformance of the proposed algorithm versus the exact K-Best algorithm andthe ML for a 4 4, 64-QAM MIMO detector with K = 10.

    The above simulation results are for a single-carrier 4 4,64-QAM MIMO system. The simulation results for the BER

    curve are performed for 100 000 packets, where each packet

    consists of 4 6 4 = 96 bits (9.6Mbits in total) for a 4 4,64-QAM MIMO system. Test vectors are generated using:1) pseudorandom data; 2) complex-valued random Gaussian

    channel matrix H with statistically independent elements

    updated per four channel use; and 3) additive white Gaussian

    (circularly symmetric) complex random noise.

    VIII. CONCLUSION

    A novel detection algorithm with an efficient architecture

    featuring efficient operation over infinite complex-domain

    lattices has been proposed. The proposed design is scalable

    both in terms of the number of transmit antenna and the

    constellation order. Efficient implementation of the subblocksresults in the highest throughput and the lowest area andenergy consumption design in the literature to date. The

    proposed design was implemented on both the FPGA and

    ASIC platforms. In the ASIC implementation, the proposed

    hard-output detector provided a sustained throughput of 1 Gb/s

    with the areas of 340 kgates in a 0.13-m CMOS process.

    Synthesis results in 90-nm CMOS show a potential throughput

    of 1.5 Gb/s.

    REFERENCES

    [1] Z. Guo and P. Nilsson, Algorithm and implementation of the K-bestsphere decoding for MIMO detection, IEEE J. Sel. Areas Commun.,

    vol. 24, no. 3, pp. 491503, Mar. 2006.[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, Closest point search in

    lattices, IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 22012214, Aug.2002.

    [3] K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, A VLSIarchitecture of a K-best lattice decoding algorithm for MIMO channels,in Proc. IEEE Int. Symp. Circuits Syst. , vol. 3. May 2002, pp. 273276.

    [4] B. M. Hochwald and S. T. Brink, Achieving near-capacity on amultiple-antenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp.389399, Mar. 2003.

    [5] H.-L. Lin, R. C. Chang, and H. Chan, A high-speed SDM-MIMOdecoder using efficient candidate searching for wireless communication,

    IEEE Trans. Circuits, Syst. II, vol. 55, no. 3, pp. 289293, Mar. 2008.[6] S. Chen, T. Zhang, and Y. Xin, Relaxed K-best MIMO signal detector

    design and VLSI implementation, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 15, no. 3, pp. 328337, Mar. 2007.

    [7] M. Shabany and P. G. Gulak, A 675 Mb/s, 4 4 64-QAM K-bestMIMO detector in 0.13 m CMOS, IEEE Trans. Very Large Scale

    Integr. (VLSI) Syst., vol. 20, no. 1, pp. 135147, Jan. 2012.[8] P. A. Bengough and S. J. Simmons, Sorting-based VLSI architecture

    for the M-algorithm and T-algorithm trellis decoders, IEEE Trans.Commun., vol. 43, no. 234, pp. 514522, Mar. 1995.

    [9] C. P. Schnorr and M. Euchner, Lattice basis reduction: Improved prac-tical algorithms and solving subset sum problems, Math. Programm.,vol. 66, nos. 13, pp. 181191, 1994.

    [10] B. Kim and I. C. Park, K-best MIMO detection based on interleaving

    of distributed sorting, Electron. Lett., vol. 44, no. 1, pp. 4243, Jan.2008.

    [11] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, K-bestMIMO detection VLSI architectures achieving up to 424 Mb/s, in Proc.

    IEEE Int. Symp. Circuits Syst., May 2006, pp. 11511154.[12] M. Shabany and P. G. Gulak, A 0.13 m CMOS, 655 Mb/s, 64-QAM,

    K-best 4 4 MIMO Detector, in Proc. IEEE Int. Solid State CircuitsConf., Feb. 2009, pp. 256257.

    [13] M. Shabany, K. Su, and P. G. Gulak, A pipelined high-throughputimplementation of near-optimal complex K-best lattice decoders, inProc. IEEE Int. Conf. Acoust., Speech, Signal, Apr. 2008, pp. 31733176.

    [14] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, Ascalable VLSI architecture for soft-input soft-output single tree-searchsphere decoding, IEEE Trans. Circuits, Syst. II, vol. 57, no. 9, pp. 706710, Sep. 2010.

    [15] P. Bhagawat, R. Dash, and G. Choi, Systolic like soft-detection archi-

    tecture for 44 64-QAM MIMO system, in Proc. IEEE Design, Autom.Test Eur. Conf. Exhibit., Jun. 2009, pp. 870873.

    [16] S. Mondal, A. Eltawil, C. Shen, and K. Salama, Design and implemen-tation of a sort-free K-best sphere decoder, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 18, no. 10, pp. 14971501, Oct. 2010.

    [17] C. Liao, T. Wang, and T. Chiueh, A 74.8 mW soft-output detector ICfor 8 8 spatial-multiplexing MIMO communications, IEEE J. SolidState Circuits, vol. 45, no. 2, pp. 411421, Feb. 2010.

    [18] C. Studer, S. Fateh, and D. Seethaler, ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interferencecancellation, IEEE J. Solid State Circuits, vol. 46, no. 7, pp. 17541765, Jul. 2011.

    [19] M. Myllyl, J. Cavallaro, and M. Juntti, Architecture design and imple-mentation of the metric first list sphere detector algorithm, IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 5, pp. 895899, May2011.

    Mojtaba Mahdavi received the M.Sc. degree inelectrical engineering from the Sharif University ofTechnology, Tehran, Iran, in 2010.

    He was with Advanced Integrated Circuit DesignLaboratory, Sharif University of Technology, from2010 to 2012. Currently, he is working on implemen-tation of the Long Term Evolution (LTE-Advanced)System. His current research interests include digi-tal VLSI architectures for digital signal processingalgorithms, VLSI communication systems, digitalintegrated circuit design, and field-programmable

    gate array-based systems.Mr. Mahdavi was on the subcommittee for the International Solid-State

    Circuits Conference from 2005 to 2008.

    Mahdi Shabany received the B.Sc. degree in elec-trical engineering from the Sharif University ofTechnology, Tehran, Iran, in 2002, and the M.Sc.and Ph.D. degrees in electrical engineering from theUniversity of Toronto, Toronto, ON, Canada, in 2004and 2008, respectively.

    He is an Assistant Professor with the ElectricalEngineering Department, Sharif University of Tech-nology. From 2007 to 2008, he was with Red-line Communications Company, Toronto, where hedeveloped and patented designs for WiMAX sys-

    tems. He served as a Post-Doctoral Fellow with the University of Torontoin 2009. He holds two U.S. patents. His current research interests includedigital electronics and VLSI architecture and algorithm design for broadbandcommunication systems.


Recommended