MIMO OFDM

Application-Specific Processor for MIMO-OFDMSoftware-Defined Radio

Diss. ETH No. 18582

Application-SpecificProcessor for MIMO-OFDM

Software-Defined Radio

A dissertation submitted toETH ZURICH

for the degree ofDoctor of Sciences

presented bySTEFAN EBERLIDipl. El.-Ing. ETHborn 15.4.1978

citizen of Zürich (ZH) and Hüttwilen (TG)

accepted on the recommendation ofProf. Dr. W. Fichtner, examinerProf. Dr. H. Meyr, co-examiner

2009

Acknowledgments

I would like to express my gratitude to Prof. Dr. Fichtner, who gaveme the opportunity to pursue my Ph.D. at the Integrated Systems Lab-oratory (IIS), in an excellent scientific environment and with excellentcolleagues. His concise advice kept me focused on the project, withoutloosing the big-picture. My gratitude goes also to Prof. Dr. Meyr forreading, correcting, and co-examining this thesis.

I am very thankful to Dr. Norbert Felber and Dr. Hubert Kaeslinfor their support during all the years at the IIS, and for their valuableinput while proof-reading this manuscript – thank you Norbert, egrazie mille Hubert. Next, I would like to acknowledge Prof. Dr. Burg.He has been a reference since the beginning and his precious andpractical advice guided me many times – thank you Andy.

BridgeCo AG has been the industrial partner of this Ph.D. project.Among the many colleagues at BridgeCo, I am especially thankfulto Thomas Thaler, who contributed setting up the project; and toDr. Markus Thalmann, Matthias Tramm, and Dr. Manfred Stadlerfor their support and the fruitful discussions and advice on processors.Thanks go also to the Swiss innovation promotion agency CTI thatenabled the project.

Among the colleagues at the IIS, my gratitude goes to Dr. DavidPerels for carefully following me in the early stage of my thesis. Also, agreat fortune was the possibility to discuss with top-class and brilliantcolleagues as Christoph Studer, Felix Bürgin, Flavio Carbognani, Frank,Jürg Treichler, Marc Wegmüller, Markus Wenk, Matthias Brändli,Peter Lüthi, Simon Häne, and Stephan Oetiker.

At the Communication Theory Group, Davide Cescato helpedme a lot with mathematical discussions and contributed with the

v

vi ACKNOWLEDGMENTS

divide-and-conquer matrix inversion method – grazie Davide.I also would like to thank the following students for their contri-

butions to this thesis: Luca Henzen, Christoph Pedretti, and LorenzoBardelli (master thesis); and Benjamin Dietrich and Lukas Haas(semester thesis).

Finally, I owe much to my family, and to Mattea and Caterina –thanks for coloring life ♥.

Abstract

Software-defined radios (SDRs) present a promising approach to facethe demands of today’s fast evolving environment of wireless communi-cation standards. Ever increasing requirements in terms of performanceand flexibility call for programmable, high-performance signal pro-cessing platforms. Performance is necessary to cope with the highcomputational burden inherent to wireless communications. Flexibilityis desired to shorten the time-to-market. Unfortunately, performanceand flexibility are antagonists and are difficult to combine in one singleefficient platform, as one is obtained at the cost of the other.

This thesis contributes to the SDR research domain, addressingthe implementation of a 2× 2 MIMO-OFDM receiver on an SDRplatform. Appropriate receiver algorithms are evaluated and thecorresponding computational complexity is derived. The use of low-complexity algorithms is imperative to spare the limited processingresources of a programmable platform. Three software-programmablearchitectures are evaluated to find a suitable SDR platform, eventuallyleading to the selection of a design-time configurable application-specific processor as platform. The 2× 2 MIMO-OFDM receiver issplit into two parts which are mapped onto two application-specificprocessors, each tailored to the computational needs of the associateddigital signal processing kernels. The first processor performs theper-stream MIMO-OFDM processing. The second processor handlesthe MIMO detection. Finally, the 0.18 µm 1P/6M CMOS technologylayout of both fabricated application-specific processors is presented:the silicon area required by the two processors is 7.65 mm2 and real-time baseband processing is possible on these engines running at aclock frequency of 250MHz.

vii

Zusammenfassung

Software-definierte Radios (SDRs) stellen einen viel versprechendenAnsatz dar, um sich dem heutigen, sich schnell entwickelnden Umfeldder drahtlosen Kommunikationsstandards rasch anpassen zu können.Ständig steigende Anforderungen, in Leistung und Flexibilität aus-gedrückt, beanspruchen programmierbare und leistungsstarke Signal-aufbereitungsplattformen. Leistung ist notwendig, um mit dem hohenRechenaufwand fertig zu werden, der der drahtlosen Kommunikationinheränt ist. Flexibilität ist erwünscht, um die “time-to-market” zuverkürzen. Leider sind Leistung und Flexibilität Antagonisten undschwierig in einer einzelnen, effizienten und leistungsfähigen Plattformzu kombinieren, da das eine auf Kosten des anderen erreicht wird.

Diese Arbeit trägt zum SDR-Forschungsgebiet bei und behandeltdie Implementierung eines 2× 2 MIMO-OFDM Empfängers auf ei-ner SDR-Plattform. Dazu werden passende Empfängeralgorithmenausgewertet und der entsprechende Rechenaufwand wird abgeleitet.Der Gebrauch von Algorithmen mit niedrigem Rechenaufwand istzwingend, um kostbare Rechenressourcen einer programmierbarenPlattform einzusparen. Drei Software-programmierbare Architektu-ren werden evaluiert, um eine geeignete SDR-Plattform zu finden.Ein konfigurierbarer, anwendungsspezifischer Prozessor wird schlies-slich als Plattform ausgewählt. Der 2× 2 MIMO-OFDM Empfängerwird dann in zwei Teile, die an zwei solcher anwendungsspezifischenProzessoren angepasst werden, aufgespalten, wobei die Recheneinhei-ten der zwei Prozessoren auf die Charakteristiken der Algorithmenmassgeschneidert sind. Der erste Prozessor führt die Verarbeitung desMIMO-OFDM Datenstromes durch. Der zweite Prozessor behandelt

ix

x ZUSAMMENFASSUNG

die MIMO-Detektion. Zu guter Letzt wird das 0.18 µm 1P/6M CMOSLayout beider gefertigten Prozessoren vorgestellt: die Siliziumflächeder zwei Prozessoren beträgt zusammen 7.65 mm2 und die Echtzeitda-tenverarbeitung ist bei einer Taktfrequenz von 250 MHz möglich.

Contents

Acknowledgments v

Abstract vii

Zusammenfassung ix

1 Introduction 11.1 Motivation – Mobility and Wireless Communications . 11.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State of the Art 92.1 Design Considerations . . . . . . . . . . . . . . . . . . 9

2.1.1 Flexible architectures . . . . . . . . . . . . . . 92.1.2 Technology scaling . . . . . . . . . . . . . . . . 112.1.3 Real-valued vs. complex-valued functional units 12

2.2 Flexible Architectures for OFDM Baseband Processing 172.2.1 Academic players . . . . . . . . . . . . . . . . . 172.2.2 Relevant examples for industrial implementations 25

2.3 Flexible Architecture for MIMO-OFDM Baseband Pro-cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Summary and Discussion . . . . . . . . . . . . . . . . 35

3 Algorithms and Computational Complexity 433.1 MIMO-OFDM System Model . . . . . . . . . . . . . . 443.2 Performance and Computational Complexity Metrics . 503.3 Choice of the MIMO Detector . . . . . . . . . . . . . . 52

xi

xii CONTENTS

3.3.1 Brute-force maximum-likelihood (ML) . . . . . 533.3.2 Sphere decoding (SD) . . . . . . . . . . . . . . 543.3.3 K-Best (KB) . . . . . . . . . . . . . . . . . . . 563.3.4 Successive interference cancellation (SIC) . . . 563.3.5 Linear detection . . . . . . . . . . . . . . . . . 573.3.6 Results and conclusion . . . . . . . . . . . . . . 58

3.4 Linear MMSE Detection . . . . . . . . . . . . . . . . . 603.4.1 Adjoint method . . . . . . . . . . . . . . . . . . 653.4.2 LR-decomposition . . . . . . . . . . . . . . . . 653.4.3 LDL-decomposition . . . . . . . . . . . . . . . 653.4.4 GS-decomposition . . . . . . . . . . . . . . . . 663.4.5 QR-decomposition . . . . . . . . . . . . . . . . 663.4.6 Rank-1 update . . . . . . . . . . . . . . . . . . 673.4.7 Divide-and-Conquer algorithm . . . . . . . . . 673.4.8 Results and conclusion . . . . . . . . . . . . . . 68

3.5 MIMO-OFDM Receiver Algorithms . . . . . . . . . . . 723.5.1 Frame-start detection . . . . . . . . . . . . . . 733.5.2 STF processing . . . . . . . . . . . . . . . . . . 743.5.3 LTF processing . . . . . . . . . . . . . . . . . . 753.5.4 MIMO channel processing . . . . . . . . . . . . 763.5.5 Data processing . . . . . . . . . . . . . . . . . . 773.5.6 Computational complexity of the presented al-

gorithms . . . . . . . . . . . . . . . . . . . . . 783.6 Summary and Conclusion . . . . . . . . . . . . . . . . 82

4 Design Space Exploration 834.1 C6455 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1.1 SISO-OFDM transceiver . . . . . . . . . . . . . 844.1.2 Results, discussion, and conclusion . . . . . . . 89

4.2 MSEC4 . . . . . . . . . . . . . . . . . . . . . . . . . . 914.2.1 Architecture details . . . . . . . . . . . . . . . 934.2.2 Results, discussion, and conclusion . . . . . . . 96

4.3 ASPE . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.3.1 Architecture . . . . . . . . . . . . . . . . . . . 984.3.2 SISO-OFDM receiver . . . . . . . . . . . . . . 1014.3.3 Results, discussion, and conclusion . . . . . . . 103

4.4 Summary and Conclusion . . . . . . . . . . . . . . . . 105

CONTENTS xiii

5 MIMO-OFDM SDR Receiver 1095.1 SDR Platform Overview . . . . . . . . . . . . . . . . . 109

5.1.1 The Modified ASPE . . . . . . . . . . . . . . . 1095.1.2 Receiver Architecture . . . . . . . . . . . . . . 1115.1.3 Common ASPE A and ASPE B configuration . 112

5.2 ASPE A – MIMO-OFDM Processing . . . . . . . . . . 1135.2.1 Datapath configuration . . . . . . . . . . . . . 1135.2.2 BB processing on ASPE A . . . . . . . . . . . 124

5.3 ASPE B – MIMO Detection . . . . . . . . . . . . . . . 1305.3.1 Datapath configuration . . . . . . . . . . . . . 1305.3.2 BB processing on ASPE B . . . . . . . . . . . 137

5.4 Dictionary Based Program-Code Compression . . . . . 1395.4.1 Reference design . . . . . . . . . . . . . . . . . 1405.4.2 DBCC with NOP bitmask . . . . . . . . . . . . 141

5.5 Implementation Results . . . . . . . . . . . . . . . . . 148

6 Summary and Conclusions 155

A MIMO Detection Methods 161A.1 Sphere Decoding . . . . . . . . . . . . . . . . . . . . . 161A.2 K-Best . . . . . . . . . . . . . . . . . . . . . . . . . . . 163A.3 Successive Interference Cancellation . . . . . . . . . . 166A.4 Linear Detection – Matrix Decomposition and Inversion

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 166A.4.1 Adjoint method . . . . . . . . . . . . . . . . . . 168A.4.2 LR-decomposition . . . . . . . . . . . . . . . . 169A.4.3 LDL-decomposition . . . . . . . . . . . . . . . 170A.4.4 GS-decomposition . . . . . . . . . . . . . . . . 175A.4.5 QR-decomposition . . . . . . . . . . . . . . . . 181A.4.6 Rank-1 update method . . . . . . . . . . . . . 185A.4.7 Divide-and-conquer method . . . . . . . . . . . 189

B Datasheet 193

Chapter 1

Introduction

1.1 Motivation – Mobility and WirelessCommunications

The beginnings Long-distance, wireless communication across theAtlantic Ocean was inaugurated in 1901 by Marconi. On the SignalHill in Newfoundland (CA) he received a message transmitted fromCornwall (GB). Since then, the domain of wireless communications hasincredibly evolved and, with the advances in both hardware and wirelesstechnology, wireless communications have diffused into everyone’s dailylife.

Diffusion to mass market In the late 1980s, the definition ofmobile phone communication standards prepared the proliferationof economically affordable, mobile wireless communications to themass market, culminating today in penetration rates near 100 %. Thesuccess of mobile wireless communications was – and still is – fueled bythe human need for communication and social interaction, combinedwith a global environment demanding flexibility and mobility.

This need for mobility is also well reflected and sustained by therapid development and deployment of portable personal computerswhere, again, wireless communications played an important role. Ini-tially, wired local area networks (LANs) were conceived for large

1

2 CHAPTER 1. INTRODUCTION

universities and research laboratories to interconnect their computers.In the early 1990s, the Internet’s tentacles begun to grow dramatically,interconnecting an increasing number of LANs via metropolitan areanetworks (MANs). The Internet diffused into the homes, providing themass market with the possibility to share and disseminate informationacross almost the entire globe. In the late 1990s, wireless communica-tion was introduced into the LAN infrastructure thanks to the globalstandardization effort, which lead to the initial IEEE 802.11 wirelessLAN (WLAN) standard (2 Mbit/s). Wired LAN connections couldfinally be replaced by wireless ones, and gradually, new wireless accesspoints appeared. Nowadays, train stations, airports, and even coffeebars provide WLAN infrastructure, leading to an almost ubiquitousInternet connectivity. MANs are experiencing a similar fate, andwireless links started to replace wired last-mile connections.

Today: plethora of standards Despite all attempts of globalstandardization, the world of (consumer) wireless communications ispopulated by differing standards. These standards are tailored to theparticular needs of their own application domain, employing the appro-priate modulation techniques to best exploit the application’s wirelesschannel. To name a few, the WMAN IEEE 802.16 standard, theIEEE 802.11a/b/g WLAN standards, the Bluetooth wireless personalarea network (WPAN) standard, as well as the mobile phone GSMstandard all populate this heterogeneous world. Figure 1.1 illustratesthe situation. Today, the interaction between services relying on thesedifferent standards is crucial, for the mobile end-user. For instance, acalendar has to be synchronized between laptop and mobile phone viaBluetooth, and, at the same time, the mobile phone has to supportGSM and possibly UMTS for its original duty.

While infrastructure components can concentrate on a single wire-less standard and do not pose overly tight requirements to powerconsumption, mobile terminals (e.g., mobile phones, laptops) evolvetowards multi-standard platforms that are very sensitive to powerconsumption. For conventional mobile terminals, this means the in-tegration of one dedicated, hardwired modem for each supportedstandard. The result is an efficient, but complex platform that haslong redesign times and little flexibility. An unacceptable condition in

1.1. MOTIVATION –MOBILITY ANDWIRELESS COMMUNICATIONS3

Figure 1.1: Wireless networks form an heterogeneous environment(source [1]).

this rapidly evolving world.

SDR Innovative solutions group multiple standards onto a single,programmable and thus flexible platform, pointing towards the soft-ware defined radio (SDR)1. The ideal SDR is capable of handling allimaginable standards, from the radio frequency (RF) front-end to thebaseband processing and up to MAC layer processing – all on the sameplatform. The platform on which the SDR resides is programmed insoftware enabling run-time adaptation to each particular standard. Forinstance, a single SDR chip embedded in a mobile phone would providethe connectivity to the mobile phone network, the WLAN, and theWPAN; either in mutual exclusion, or even concurrently, time-sharingthe hardware resources to process the different standards.

Compared to dedicated hardwired solutions, the advantages ofSDR platforms are evident. They reduce the time to market due tothe inherent programmability and design re-use. Algorithms can beimproved or adapted to evolving standards after chip-fabrication, andboth hardware and software bugs can be fixed. In other words, they

1www.sdrforum.org

www.sdrforum.org


perfectly fit this heterogeneous and rapidly evolving habitat.Unfortunately, the stringent – and, when combined together, some-

how utopic – exigencies of the ideal SDR pose significant challenges toall components of the underlying platform. With today’s technology,the ideal SDR is unrealistic and, indeed, no such fully generic SDRexists. Although the advances in silicon integration technology, ex-pressed by Moore’s law [2] as exponential in time, bring some relief, theconcurrent complexity increase associated to the exponential growthof communication data rates predicted by Edholm’s law [3], partiallynullifies these advances.

In this scenario, the research for an efficient hardware platformcontributing to the SDR realization is imperative. It appears rea-sonable to step back form the ambition of directly implementing theideal SDR. Current research efforts focus on the different componentsof the SDR (i.e., the RF front-end, the baseband processing, etc.)separately. In the baseband processing domain – in which this thesisis situated – an important aspect to be considered is the platform, orthe flexible architecture (FA) employed to implement the SDR func-tionality. This FA has to have the right balance between flexibilityand processing performance required to support the considered stan-dards. Commercial SDRs are appearing as standard-specific basebandsoftware-programmable architectures that address the relatively lowdata rates of mobile phone standards (e.g., [4, 5]). For high data rates,the implementation on the limited processing resources of an FA ischallenging and commercial solutions are still lacking.

1.2 This ThesisThis thesis contributes to the SDR development by concentrating onthe WLAN domain and by implementing the relevant MIMO-OFDMbaseband processing on an FA. Instead of pursuing the ambitionof designing a fully-generic platform, the more pragmatic standard-specific approach is followed.

In the WLAN domain, the push towards higher and higher datarates is well documented by the numerous amendments to the initialIEEE 802.11 standard. Orthogonal frequency-division multiplexing(OFDM) is included as modulation technique in the 802.11a-1999

1.2. THIS THESIS 5

supplement, enabling data rates up to 54 Mbit/s. Currently, the use ofmultiple antennas at both ends of the wireless link – commonly referredto as multiple-input multiple-output (MIMO) – is being integrated inwhat will become the 802.11n supplement [6, 7, 8].

With respect to single-antenna communication systems (single-input single-output, SISO), MIMO systems deliver significant perfor-mance gains that manifest themselves in a broader coverage, a higherthroughput, or a more reliable wireless link. For the transmissionover a wideband frequency-selective channel, MIMO is typically com-bined with OFDM (as in the 802.11n supplement), which dividesthe wideband frequency-selective channel into a set of narrowbandfrequency-flat, parallel subchannels. As result, the communication ismore robust in multipath-propagation environments and the channelequalization at the receiver becomes less computationally intensive.

However, compared to SISO-OFDM, the signal processing loadinherent to MIMO-OFDM communication is significant: it scales atleast with the number of transmit antennas and it depends on theemployed receiver algorithm. Therefore, for real-time operation on FAs,only low-complexity MIMO-OFDM transceivers with computationallyefficient algorithms can be considered.

In this thesis, three different FAs are assessed by mapping compu-tationally hard OFDM baseband processing kernels onto the followingarchitectures: Texas Instruments high-end C6455 DSP, a special pur-pose baseband processor, and a design-time customizable application-specific instruction-set processor (ASIP) developed in the predecessor’sPh.D. thesis [9]. Eventually, the proposed ASIP qualifies best, andproceeding further on this track for the 2× 2 MIMO-OFDM basebandimplementation is worthwhile.

Contributions In summary, the two main contributions of thisthesis are:

• The implementation of the complete IEEE 802.11a basebandprocessing (except Viterbi decoding) on an ASIP, publishedin [10] and described in Section 4.3. This contribution shows thatthe real-time implementation of OFDM baseband processing onthe selected ASIP is possible, enabling data rates up to 54 Mbit/s.Compared to other work in the domain, the solution presented


here is very competitive in terms of silicon area.

• The implementation of the relevant baseband processing ker-nels of a 2× 2 MIMO-OFDM receiver on a pair of properlycustomized ASIPs, published in [11] and detailed in Chapter 5.The comparison of the proposed baseband receiver with pub-lished related results [12] indicates that – thanks to appropriatedesign-time customization and to low-complexity algorithms –the solution presented here is very area-efficient. In addition, itsuggests that ASIPs, are appropriate and efficient vehicles forthe further development of baseband processing in SDRs.

A number of minor contributions were necessary to construct andconsolidate the two main contributions:

• The classification of FAs that are employed for OFDM basebandprocessing and are reported in the open literature. The clas-sification clarifies the options available to design FAs and theobtained performance. The material is presented in Chapter 2.

• The detailed analysis of existing MIMO detection algorithms,performed with special attention to computational complexity,in view of the implementation on an FA. The analysis can beused as a reference for future work in the domain, for instance,when more sophisticated detectors are required. The material ispresented in Chapter 3, partly in [11] and [13].

• The mapping of hard computational kernels identified during thealgorithm evaluation on different FAs permits their assessmentfor (MIMO-)OFDM processing. The material is presented inChapter 4.

• Finally, this thesis acts as proof of concept for the design-timecustomizable ASIP framework, proposed in the predecessor’sPhD. thesis [9].

1.3. OUTLINE 7

1.3 OutlineThe remainder of this thesis is organized as follows:

Chapter 2 reviews related work. FAs implementing OFDM-relatedbaseband processing tasks are presented. Relevant figures of meritare gathered and/or extrapolated form the open literature. The finaldiscussion allows a comparison of the different FAs.

Chapter 3 is dedicated to the algorithms. After an introductionto the domain of MIMO-OFDM wireless communications, two crucialMIMO-OFDM design considerations are made. First, computationalcomplexities of several MIMO detectors are compared, allowing to se-lect linear MMSE detection as an appropriate one. Second, algorithmsto compute linear MMSE detectors are assessed in their computationalcomplexity and BER performance. Finally, the complete basebandprocessing for the MIMO-OFDM receiver considered in this thesis isdescribed and the associated computational complexity is derived.

Chapter 4 evaluates three different FAs, by mapping computation-ally hard OFDM baseband processing kernels onto each one of them.The evaluation suggests to consider the design-time customizable ASIPfor the case-study described in the subsequent chapter.

Chapter 5 details the implementation of the relevant basebandprocessing kernels of the 2× 2 MIMO-OFDM receiver. The receiveris split onto two properly configured ASIPs. The chapter concludeswith the comparison to the only known related work [12] and givesreference for the silicon complexity.

Chapter 6 summarizes and discusses the achieved results anddraws the appropriate conclusions.

Chapter 2

State of the Art

This chapter reviews the literature of flexible architectures that areemployed as SDR platforms for OFDM baseband processing in wirelesscommunication systems. The review includes selected examples fromboth academia and industry, and it aims at presenting the architecturedesign environment in which this thesis is situated.

Flexible architectures are briefly described and characterized bytheir area, processing performance, and power consumption. At theend of the chapter, these characteristics are presented in a unifiedmanner to get an overview of the domain. Before the actual literaturereview, a few concepts and considerations related to the design of FAsneed to be introduced.

2.1 Design Considerations2.1.1 Flexible architecturesThe term flexible architecture (FA) unifies two architecture categories,namely: software-programmable architectures (SPAs) and reconfigurablearchitectures (RAs). Figure 2.1, on the left, delineates the genericFA concept with a block diagram. FAs allow their datapath (DP) tooperate in several modes or configurations through F control bits thatare determined by the control path (CP). Inside the DP, a number offunctional units (FUs) perform the actual data processing. Memory

9

10 CHAPTER 2. STATE OF THE ART

CP DP

Memory

Flexible Architecture (FA)

F FU FU

FU FU...

SPA RASPA + RA

FA

Figure 2.1: Left: Flexible architecture (FA) block diagram, CP: Con-trolpath, DP: Datapath. Right: Software-programmable architectures(SPAs), reconfigurable architectures (RA), and a combination of these(SPA+RA), are FA sub-classes.

stores the data to be processed, as well as instructions or configurations.Data and instructions or configurations may also be stored in separatememories.

SPAs determine their DP operation with a sequence of instructionsthat are fetched from the instruction memory: the DP can changeits operation mode at each clock cycle. However, the instruction setarchitecture is determined at design time, and it cannot change afterthe SPA is fabricated. The desired functionality is implemented byselecting and sequencing the appropriate instructions into program-code, ideally with the aid of a software development tool. RAs, instead,keep one DP configuration F for several clock cycles, and, commonly,a change of the DP configuration requires several clock cycles to takeplace. The configurations are determined after design-time by theuser/programmer and are loaded into the configuration memory duringan initialization phase. A combination of SPA and RA (SPA+RA)results in part of the DP being able to change operation mode at eachclock cycle, and another part – the reconfigurable one – to continuewith the same configuration for many clock cycles. This classificationis depicted on the right side of Figure 2.1.

Processing performance The peak processing performance of FAsis commonly reported in millions of operations per second (MOPS). Itis derived by multiplying the number functional units (FUs) that can

2.1. DESIGN CONSIDERATIONS 11

be addressed in parallel during one clock cycle, with the operating clockfrequency. Conventionally, to compute the processing performance, alloperations that can be executed in parallel are taken into account (e.g.,load/store, data processing operations, address generation operations).The so-obtained processing performance is a qualitative measurethat allows only a rough comparison between differing architectures.Because the operations on the various architectures are not necessarilythe same, comparisons are error-prone.

In this thesis, for the comparison with the computational loadinherent to the algorithms presented in Chapter 3, the processingperformance (PP) is defined as the millions of data-operations persecond (MdOp/s) an FA can execute. This unit considers only thedata processing relevant operations that can be executed in parallel onthe FA, multiplied with the operating clock frequency. The relevantdata processing operations are those typical to digital signal processors:additions, subtractions, multiplications and combinations of these.

2.1.2 Technology scalingImportant figures of merit that characterize ICs include silicon area,processing performance (e.g, the throughput), and energy consumption.These figures of merit are related to one IC realization in a givenCMOS technology. Since the architectures reported in the literaturemay be implemented in different technologies, their figures of merit cannot directly be compared to each other, especially when consideringarchitecture design aspects.

Therefore, in this thesis, these technology-dependent figures ofmerit are normalized to a 0.18 µm CMOS technology. Although scal-ing to a reference technology represents an approximation, it allowsmore meaningful conclusions. The scaling from the original technol-ogy, utilized in the considered publication, to the 0.18 µm referencetechnology is performed according to [14], assuming devices and wiresscale equally with 1/αD:

A0.18 = A· 1/α2D (2.1)

f0.18 = f ·αD (2.2)P0.18 = P · (ε/αD)2, (2.3)


where A, f , and P stand for the area, clock frequency, and powerconsumption in the original technology and the normalized quantitiesare indicated by the 0.18 subscript. The technology scaling factorαD = 0.18/x is derived from the half-minimum-feature-size x in theoriginal technology. Although not standardized, the CMOS technol-ogy’s name commonly indicates approximately the associated half-minimum-feature-size x: for instance, in a 0.13 µm CMOS technologytechnology x = 0.13. The correction factor ε reflects the voltage scalingV0.18 = V · ε/αD from original to the 0.18 µm reference technology(typically ε ≈ 1).

Figure 2.2 illustrates the area-time product curves (AT-plot, [15])for a 16 bit adder and a 16 bit multiplier, synthesized for 0.25 µm,0.18 µm, and 0.13 µm UMC CMOS technologies. The curves havebeen generated synthesizing the corresponding circuit with SynopsysDesign Compiler (Z-2007.03-SP3) and applying the timing constraintsindicated by the curve’s markers. The blue circles show the resultsobtained when taking the 0.25 µm CMOS technology as starting pointand scaling the two designs to the 0.18 µm (αD = 0.25/0.18), and0.13 µm (αD = 0.25/0.13) technologies, according to (2.1) and (2.2).It is comforting to note that the scaled results reflect the resultsachieved with synthesis well: the distance between scaled result-pointand synthesized result-curve is minimal.

2.1.3 Real-valued vs. complex-valued functionalunits

Most baseband algorithms operate on complex-valued numbers. Sinceconventional FAs provide FUs that perform real-valued arithmeticoperations, it is necessary to map the complex-valued operationsinto real-valued ones – which is easily performed (c.f. Section 3.2).However, an interesting aspect is whether the support of complex-valued operations by the FA’s datapath is desirable or not.

Figure 2.3 reports the AT-plot for a real-valued and a complex-valued multiplier-FU, both synthesized for a 0.18 µm CMOS technology.The two synthesized units are shown in Figure 2.4. While the real-valued FU requires six clock cycles to compute the complex-valuedmultiplication, the complex-valued FU requires just one clock cycle


0.5 1 1.5 2 2.5 3 3.5 4 4.50

2000

4000

6000

8000

10000

12000

14000

16000

Longest path [ns]

Are

a [μ

m2 ]

AT−plot for 16bit Adder

umc250umc180umc130

1 2 3 4 5 6 7 80

10000

20000

30000

40000

50000

60000

Longest path [ns]

Are

a [μ

m2 ]

AT−plot for 16bit x 16bit Multiplier

umc250umc180umc130

Figure 2.2: Area-time product curves (AT-plot) for 0.25 µm, 0.18 µm,and 0.13 µm CMOS technology. Top: 16 bit adder, Bottom: 16 bitmultiplier.


for the same task.1 This difference is visible in the AT-plot where theisomorph, complex-valued unit can attain the higher throughput (atthe cost of an area increase), whereas the decomposed real-valued unitoccupies the smallest area (at the penalty of a lower throughput). Asshown by the black dashed curve, the AT-efficiency of the two unitsis nearly the same, with the complex-valued unit being slightly moreefficient. The operation with FUs performing only real-valued arith-metics results in a lower throughput than that with complex-valuedFUs since the circuit cannot be synthesized for timing constraintsbelow 12 ns per data item. Thus, it can be stated that for algorithmsrequiring high throughput and mostly operating on complex-valuednumbers, it is convenient to incorporate complex-valued FUs into theFA’s datapath.

1The computation of the complex-valued multiplication C = A·B is di-vided into the six real-valued steps: s1 = <{A}<{B}, s2 = ={A}={B},s3 = <{A}={B}, s4 = ={A}<{B}, s5 = s1 − s2, s6 = s3 + s4. Where the<{.} and ={.} operators return the real and imaginary parts of their argument,respectively. The complex-valued result is C = s5 + j s6.


0 5 10 15 20 25 302

4

6

8

10

12

14

16x 10

4

Time per data item (T) [ns]

Are

a [μ

m2 ]

AT−plot for UMC180

AT = 410700 μ m2 ns

CMULRMULADD

Figure 2.3: AT-plot for complex-valued multiplication with real-valuedfunctional unit (RMULADD) and complex-valued functional unit(CMUL).


>> >>

3232 32 32

Im{A} Re{A} Im{B} Re{B}

32 32

16 16 16 16

Re{A} Re{B} Im{A} Im{B} Re{A}Im{B} Im{A} Re{B}

32 32

16 16 16 16 16 16 16 16

32

Im{C} Re{C}

16 16

1616

16 16 16 16

16 16

X Y

32

>>

16

16

16

Z

Selectresult

Figure 2.4: Left: complex-valued multiplication unit. Right: real-valued unit able to compute one complex-valued multiplication in sixsteps, or clock cycles.

2.2. FAS FOR OFDM BB PROCESSING 17

2.2 Flexible Architectures for OFDMBase-band Processing

2.2.1 Academic playersReconfigurable datapath (RD) The potential of run-time recon-figurable hardware in supporting different standards that exploit thesame modulation technique is explored in [16] (2003). The RD isdesigned to support synchronization and demapping of IEEE 802.11aand HiperLAN/2 standards, both relying on OFDM. Other tasks thatare also required by the two standards, as for instance the computationof a fast Fourier transform (FFT) and Viterbi decoding, are deemedas too computationally intensive and delegated to dedicated hardwareblocks.

The RD is depicted in Figure 2.5. The datapath is steered by acontroller and one of three possible configurations, stored in a config-uration memory, is applied to the datapath. The configuration bitsfor the three operation modes are determined without the aid of asoftware tool, i.e. by hand. Small memories for temporary data storageare distributed across the datapath. The architecture falls into thecategory of RAs.

The circuit is realized in 0.35 µm CMOS technology and runs at aclock frequency of 100 MHz [16]. No power consumption figures arereported. The area of the circuit is normalized to the area of one 8 bitmultiplier, and amounts to 143. The area savings obtained by sharingparts of the datapath instead of implementing distinct units for eachcomputation amount to 20 % [16]. The datapath has a wordwidth of8 bit and incorporates nine multiplier units, one divider, and twelveadders (the two CORDIC units are neglected). Consequently, theprocessing performance is PP = 2′200 MdOp/s.2

RaPiD In [17] (2004) a 4 antenna OFDM receiver has been imple-mented on the RaPiD architecture [18] (1996). The receiver performsthe timing synchronization necessary to detect an OFDM frame and toalign the received symbol boundaries. The FFT on the four receivedstreams, required to demodulate the received data, is performed as well.

2PP = 22dOp× 100MHz = 2′200MdOp/s.


Inv

Inv

Inv

Inv

m3

*

*

*

*

*

*

*

+

+

+

+

+-

+-

*

*

*

Memory 3

m3

m3

m_o4

m_o3

m5

m7CORDIC1

CORDIC2m_o3

m2

m6

m_o4

m1

m9 m_o2

m10

m_o1

m_o1

Phase

m5-

-

-

Phase

m4

m8

m_o1 m_o2

+-

+-

-

m_o2

Memory 4

m1m9m2

m3m10m4

m5

m6

m7

m8

m_o1

m_o2

m_o3

m_o4

Memory 1a

Memory 1b

Memory 2a

Memory 2b

m_o4

m_o3

Figure 2.5: Reconfigurable datapath block diagram (source [16]).


External Memory External Sensors

CONFIGURABLE INTERCONNECT

CONFIGURABLE INSTRUCTION DECODE

INSTRUCTION GENERATOR

STREAM

MANAGER

FIFO

FIFO

FIFO

FIFO

MULT

MULT

MULT

ALU

ALU

ALU

RAM

RAM

RAM

REG

REG

REG

REG

User

Def

FU

Output

stream

Input

streams

Data bus

Control

Figure 2.6: RaPiD block diagram (source [18]).

The same receiver tasks are also mapped onto an ASIC, an FPGA,and a DSP for comparing the achieved performance-over-cost. Oneof the conclusions in [17] is that RAs fill the performance-over-costgap between ASICs and DSPs. It is found that there is a six-foldincrease in complexity compared to an ASIC implementation, whilethe cost is reduced by a factor of six compared to an implementationon conventional DSPs. Thus, compared to ASICs, lower NRE-costproduction is possible, while realizing a higher performance-over-costthan on DSPs.

The RaPiD architecture is reported in Figure 2.6. Its datapathconsists of an heterogeneous, linear array of FUs. The FUs are ALUs,multipliers (MULT), registers (REG), and storage units (RAM) thatare connected through a configurable interconnect network. Thenumber and type of FUs are scalable, and determined at design-time.The interconnect is built by multiplexers that select the input to thefunctional units, by tristate buffers for driving the output of functionalunits onto the wanted bus, and by bus connectors that split long bussegments into smaller ones, enabling concurrent utilization of two bussegments belonging to the same long bus.

The configuration bits required to operate the RaPiD architectureare divided into hard and soft configuration bits. Hard configuration


bits (as for example those controlling the bus inter-connectors) donot change while an application is running, whereas soft configurationbits (e.g., control bits for functional units) can change at each clockcycle. The control part is sophisticated and implements instructioncompression through the addition of instruction repeat counters andautomatic loop generation. RaPiD is programmed by means of theRaPiD-C language. According to the definition at the beginning of thischapter, the RaPiD architecture falls into the category of SPA+RA.

The figures of merit reported in [19] refer to one RaPiD benchmarkcell. The datapath of one such RaPiD benchmark cell has a word-width of 16 bit, and includes 1 multiplier and 3 ALUs. Its area is5.07 mm2 in 0.5 µm CMOS technology at 100 MHz. The estimatedpower consumption lies between 1.9 W, for performing a 16-tap FIRfilter, and 6.1 W peak power consumption. The receiver describedin [17] utilizes 16 RaPiD benchmark cells, thus resulting in an area ofapproximately 16 · 5.07 mm2 = 81 mm2 in 0.5 µm CMOS technology,at a power consumption of 16 · 1.9 W = 30.4 W. The correspondingdatapath processing performance is PP = 6′400 MdOp/s.3

MS1 and MaRS A Viterbi decoder and an FFT are implementedas relevant, wireless communication kernels on MaRS, in [20] (2005).

MaRS [21] is the scalable successor of MorphoSys MS1 [22]. Sinceonly marginal information regarding MaRS is available, the followingdescription concentrates on the MS1, shown in Figure 2.7. Its datapathis composed by an 8× 8 array of reconfigurable cells (RCs). Each RCincorporates an ALU, a multiplier, and a registerfile. The RCs areconfigured through 32 bit context words. Two RC-array configurationmodes exist: in row-mode, all RCs of the same row receive the samecontext word; analogously, in column-mode all RCs of the same columnare provided with the same context word. As a result the operation iseither row- or column SIMD-like. The MS1 is a RA.

In [22], the MS1 architecture is synthesized for a 0.35 µm CMOStechnology and occupies 180 mm2 (entire chip, including periphery).One RC occupies an area of 1.5 mm2 and the achieved clock frequencyis estimated to be 100 MHz [22]. The peak data processing performanceis PP = 6′400 MdOp/s.4

3PP = 16 tiles× 4dOp/tile× 100MHz = 6′400MdOp/s.4PP = 64RCs× 1 dOp/RC× 100MHz = 6′400MdOp/s.


Figure 2.7: MS1 block diagram (source [22]).


SODA The SODA architecture [23] (2006) is especially designedfor the SDR domain. In [23], the W-CDMA and IEEE 802.11a stan-dards are taken as two wireless communication standards that rely oncompletely different modulation techniques, and are implemented onthe SODA architecture. The achieved performance is of 2 Mbit/s forW-CDMA, and 24 Mbit/s for 802.11a (including Viterbi decoding).

The SODA architecture (see Figure 2.8) is composed of an ARMCortex-M3 processor for top-level control tasks connected to a systembus. The system bus, in turn, connects four processing elements(PEs) and a global memory. The PEs (see Figure 2.9) are designedto support data-level parallelism and data transfers, since these arekey elements of the analyzed communication algorithms. Each PEcontains a scalar unit and a 32-way SIMD unit. Their 16 bit datapathsare interconnected through a shuffle network for the conversion fromscalar to vector operation, and vice-versa. The scalar unit incorporatesone ALU, whereas the 32-way SIMD unit incorporates 32 multipliersand 32 adders. Each PE contains a scalar, as well as an SIMDscratch-pad memory, and the corresponding register-file counterparts.Programming SODA’s PEs is done in C-language with additionaloptimization and mapping of processing kernels supported by a softwaretool. The SODA architecture falls into the category of SPAs.

SODA occupies an area of 26.6 mm2 and runs at a clock frequencyof 400 MHz in 0.18 µm CMOS technology [23]. The power consumptionlies around 3 W. The datapath’s peak processing performance, for thefour PEs together, is considerable: PP = 51′200 MdOp/s.5

5PP = 4PEs× 32dOp/PE× 400MHz = 51′200MdOp/s.


ARM

PE

GLOBAL

MEMORY

LOCAL

MEMORY

LOCAL

MEMORY

LOCAL

MEMORY

LOCAL

MEMORY

EXECUTION

UNIT

EXECUTION

UNIT

EXECUTION

UNIT

EXECUTION

UNIT

DMA

SIMD

MEMORY

SCALAR

MEMORY

SCALAR

ALU

SIMD

ALU

WtoS

&

StoW

SCALAR

RF

SIMD

REGISTER FILE

Figure 2.8: SODA multi-core architecture (source [23]).


PE

32-way

SIMD

SIMD PIPELINE

I

D

S

S

H

I

R

I

R

I

R

E

X

W

B

RF

16x16bit16bit

ALU

16bit

MULT

32x16bit

32x16bit 32x16bit

SIMD SCRATCHPAD MEMORY (8KB)

2 READ/WRITE PORT (512bit wide)

I

D

E

X

W

B

RF

16x16bit16bit

ALU

16bit

MULT

I

D

E

X

W

B

RF

16x16bit16bit

ALU

16bit

ALU

16bit

MULT

I

D

I

D

I

D

E

X

W

B

W

B

W

B

RF

16x16bit

RF

16x16bit

Address

Calculation

RF

16x16bit

16bit

ALU

16bit

MULT

32-way SIMD RF

2 READ PORTS

1 WRITE PORT

Wide SIMD to Scalar

Reduction Network

Stage 1 (WtoS 1)

Wide SIMD to Scalar

Reduction Network

Stage 2 (WtoS 2)

StoW2StoW1

I-MEM

4KB

I-Queue

PC&Loop

Counter

SCALAR SCRATCHPAD MEMORY (4KB)

2 READ/WRITE PORT (16 bit wide)

E

X

E

X

DMA

To

SCALAR

RF

To

AGU

RF

To

Inter-PE

BUS

SCALAR PIPELINE

AGU PIPELINE

16

16

16

16

16

51216

12

16

16

16

16

Figure 2.9: One SODA PE (source [23]).


2.2.2 Relevant examples for industrial implemen-tations

Montium – Recore Systems Recore Systems founded 2005 inEntschede, The Netherlands, sells the Montium processor as intellectualproperty (IP). The Montium processor has its origins at the Universityof Twente, The Netherlands.

In [24] (2004), the implementation of an OFDM receiver on theMontium reconfigurable architecture, is described. The receiver isimplemented on three Montium tiles. The first tile performs the tasksof frequency offset correction, the second the computation of a 64-point FFT, and the last performs channel equalization, the phase offsetcorrection and the subsequent demapping. The proposed platformachieves datarates up to 54 Mbit/s.

One Montium tile [25, 26] is depicted in Figure 2.10. The architec-ture is given by a linear array of five ALUs connected to ten memoryunits. All units are connected to each other by a dense bus network.The ALUs embody one multiplier, three adders, and may be extendedat design-time with user defined functionality. Each ALU stores fourdifferent configurations in corresponding registers. These configurationregisters are addressed by the ALU decoder to control the ALU’soperation. A similar concept is employed for controlling the memory,the bus network, and the ALU’s input registers, thus resulting in ahierarchal control system with the main program sequencer selectingthe configurations of the four sub-systems. The Montium processor isprogrammed by means of the MontiumC language. It falls into theSPA+RA category.

The figures of merit for one Montium tile are obtained from [27].One Montium tile occupies an area of 2 mm2 in 0.13 µm CMOS tech-nology. The achievable clock frequency is 100 MHz, which is rather lowwhen compared to the gate delays of that technology. The power effi-ciency for one tile is estimated to be 0.5 mW/MHz and thus the powerconsumption is derived as 50 mW [27]. The peak datapath performanceof one Montium tile is determined by the concurrent operation of thefive ALUs, resulting in 500 MdOp/s.6 Eventually, the total area forthe above-described OFDM receiver, which employs three Montium

6PP = 5ALUs× 1dOp/ALU× 100MHz = 500MdOp/s.


A B C D

ALU1 E

OUT2 OUT1

A B C D

ALU2 E

OUT2 OUT1

A B C D

ALU3 E

OUT2 OUT1

A B C D

ALU4 E

OUT2 OUT1

A B C D

ALU5

OUT2 OUT1

W W W W

M01 M02 M03 M04 M05 M05 M07 M08 M09 M10

Memory

decoder

Crossbar

decoder

Register

decoder

ALU

decoder

Sequencer

Communication and Configuration Unit

Figure 2.10: Montium tile block diagram (source [27]).

tiles, is 6 mm2, the power consumption scales to 150 mW, and the dataprocessing performance to PP = 1′500 MdOp/s.

EVP – NXP NXP, formerly Philips, acquired the company Syste-mOnIC AG, Dresden, Germany, in early 2003. SystemOnIC developedDSP1 [28], the predecessor of EVP [5]. DSP1, in turn, leans upon theM3-DSP architecture [29] developed at TU Dresden, Germany.

Reference [5] analyzes different wireless baseband processing ker-nels (including OFDM baseband processing) and derives architecturalrequirements an SPA needs to efficiently support baseband process-ing. The conclusions are that, although SIMD operation can heavilybe employed, the support of scalar operations is still required. Thecommon wordwidth in the evaluated algorithms is 16 bits, with a fewexceptions requiring 8 bit or 32 bit precision. The embedded vectorprocessor (EVP), stylized in Figure 2.11, meets these requirements.


Vector memory

16 vector registers

Code generation unit

16-way SIMD units

Scalar units

ALU

MAC

Load/store

32 registers

Load/store unit

ALU

MAC/shift unit

Shuffle unit

Intravector unit

Program memory

VLIW controller

AGU

Control

Figure 2.11: EVP block diagram (source [5]).

The EVP’s datapath includes one scalar unit, as well as a set of 16-way SIMD units. The datapath is controlled by very long instructionwords (VLIWs). The data memory feeds the 16 vector registers fromwhere the execution units retrieve their operands. The programmingof the EVP is performed in EVP-C, an extension to the C-languagefor supporting the SIMD units. The EVP falls into the category ofSPAs.

The EVP described in [5] is synthesized for a 90 nm CMOS technol-ogy. It runs at a frequency of 300 MHz and occupies an area of 2 mm2.The power efficiency is of 1 mW/MHz, leading to a power consumptionof 300 mW. The EVP’s processing performance is derived observingthat one multiplication and one ALU operation can be executed inparallel on the 16-way SIMD datapath, thus resulting in a peak dataprocessing performance of PP = 9′600 MdOp/s.7

7PP = 2 units× 16dOp/unit× 300MHz = 9′600MdOp/s.


SB3010 – Sandbridge Sandbridge Technologies Inc. (TarrytownNY, USA) was founded in 2001, targeting the domain of basebandprocessors for 3G wireless phones. No specific academic project isbehind the company, but rather different personalities of the digitalsignal processor scene, especially from IBM (eLite DSP project [30]).

In [31] a WiMAX receiver, which relies on OFDM, is implementedon the SB3010 platform. The receiver performs timing synchronization,frequency offset compensation, channel equalization, and demodulatesBPSK symbols via 256-point FFTs. Viterbi decoding is also imple-mented on the SB3010 platform. The Sandblaster architecture [32, 33](see Figure 2.12) encapsulates four DSP cores, which are controlled bya general purpose processor (ARM9). The scheduling of tasks amongthese four cores is dynamic. Each DSP core is designed to support4-way SIMD instructions, scalar and general-purpose instructions, aswell as memory address generation. The I-decode unit distributes theinstructions to these three parts. A single memory delivers the datafor the SIMD and the scalar parts. Eight banks guarantee enoughbandwidth to maintain the SIMD register-file filled. The SB3010 plat-form is programmed in C-language by means of a powerful softwaredevelopment kit. Sandblaster falls into the category of SPAs.

The SB3010 chip is fabricated in 90 nm CMOS technology andeach DSP core runs at 600 MHz [31]. The power consumption isreported as 150 mW in [33]. No area figures are disclosed. The totaldata processing performance delivered by the four DSP cores is ofPP = 9′600 MdOp/s.8

8PP = 4DSP cores× 4 dOp/DSP core× 600MHz = 9′600MdOp/s.


RF Control

Timer I/O

TX Data

RX Data

Memory Interface

(Synch. and

Asynch.)

DSP

Int

L2I&D Mem MEM

DSP

Int

L2I&D Mem MEM

DSP

Int

L2I&D Mem MEM

DSP

Int

L2I&D Mem MEM

DSP Complex

Clock Gen

DSP-ARM Bridge

Vector

Interrupt

ControllerARM

Processor

DMA

ControllerAHB-APB

BridgePeripheral

Dev. Ctrl.RTC

Timers

GPIO

Audio

Codec IF

UART IrDA

Keypad IF

Keyboard

IF

Sync. Ser.

Port

Smart Card

IF

Multimedia

Card IF

JTAG

Multi Port

Memory

Controller

Parallel

Streaming

Data IF

USB

Interface

Prog.

Timers/Gens

Serial IF

(SPI, I2C)

GPIO

LCD

Interface

REF1

Int. Clks

REF210-50MHz REF

DSP Local

Peripherals

Data Memory

64kb

8 Banks

Bus/Memory

Interface

Dir

LRU

Replace

Interrupt

I-Decode

WB

(16)32bit

GPR

Address

Address

ALUADD ADD ADD ADD

VP

R0

VP

R0

VP

R0

VP

R0

MPY MPY MPY MPY

VRABC

PABC

ACC

LRA

LS IQBranch

LR

CR

CTR

PC INT IQ

SIMDIQ

Data Buffer

PABC

ACC

LRB PABC

ACC

IRA PABC

ACC

IRB

SAT

VRABC VRABC VRABC

I-Cache

64kB

64B Lines

4W (2 active)

Figure 2.12: SB3010 architecture (source [32, 33]). Top: entire plat-form. Bottom: one DSP slice.


BBP1 and BBP2 – Coresonic Coresonic is a start-up companyfounded in 2004 in Linköping, Sweden. The company has its roots inthe BBP1 processor research project at Linköping University, Sweden.

BBP1 [34] is a multi-standard baseband processor mainly designedfor WLAN standards (e.g., IEEE 802.11a/b/g). The attained per-formance is, for instance, sufficient to sustain 54 Mbit/s in the IEEE802.11a standard (OFDM, no Viterbi decoding). The main idea leadingto the BBP1 architecture reported in Figure 2.13, is that many wirelesscommunication standards employ the same set of functions (e.g., filters,FFTs, interleaving, etc.), configured with standard-specific parameters(e.g, filter coefficients, number of FFT points, permutations used forinterleaving, etc.). The resulting architecture contains a basebandprocessor-core that is connected to a set of specialized, parameterizabledata processing blocks, and to data memories (DM). The processorcore controls the specialized units, and it is equipped with an ALU anda complex-valued MAC unit. Vector instructions are used to schedulethe processing of data blocks on the specialized blocks. Programmingis eased by an assembler and an instruction set simulator. The BBP1is classified as SPA and, the accelerators, as RA.

The figures of merit for a BBP1 realization in 0.18 µm CMOStechnology are collected from [34]. The BBP1 runs up to a frequency of240 MHz and occupies an area of 2.9 mm2. The energy consumption isof 126 mW when operating at 160 MHz (which is the frequency requiredfor the 802.11a receiver operation). The processing performance is hardto estimate because of the heterogeneous granularity of the datapath.Therefore, no performance figures are extrapolated.

The BBP2 processor, described in [35, 36] (see Figure 2.14), is thesuccessor of the BBP1. It is designed for multi-standard baseband pro-cessing and its datapath includes two 4-way SIMD-units that operateon 16 bit complex-valued vectors. The first unit is a complex-valuedALU, and the second a complex-valued MAC. A simple controller unitsteers the two SIMD units, through the corresponding vector controlunits, and performs the program control flow. The controller supportsup to three contexts, for three different tasks. Four memory banks forcomplex-valued data and one bank for real-valued data compose thedata memory of the BBP2 processor. Each of the four complex-valueddata banks contains four memories that are accessed concurrently.As a result, enough bandwidth for the operation on one of the two


Decimator

& Symbol

Shaper

RF

Front-

end

RAKE /

Despread

FFT/

CMAC

Inter-

leave

Viter-

biCRC

MAC

port

Central Baseband Processor Core

and Accelerator Network

DM1 DM2 DM3 DM4 CM PM

MAC /

Application

Processor

Figure 2.13: BBP1 block diagram (source [34]).

Complex oriented on-chip network

Complex

Memory

Complex

MemoryComplex

MemoryAGU

AGU

AGU

AGU

Memory bank 0 Memory bank 1 Memory bank 4

Integer

Memory

Integer oriented

on-chip network

Freq.err.

canc.

Filter &

decimation

NCO

CALU

CALU

CALU

CALU

Vector

controllerVector L/S unit

CMAC

CMAC

CMAC

CMAC

ALSU

MAC

PRBS

gen

To host

processor

To analog part

Host

IF

PM

Vector

controllerVector L/S unit RF Stack Map/

demap

Digital front-end CALU SIMD Datapath CMAC SIMD Datapath Controller unit

Figure 2.14: BBP2 block diagram (source [35]).

4-way SIMD units is delivered. The BBP2 is programmed in assemblerlanguage and debugged with a bit and cycle true C-simulator. TheBBP2 processor is classified as an SPA.

The implementation in 0.13 µm CMOS technology runs at a clockfrequency of 240 MHz and occupies an area of 11 mm2 [35]. The dataprocessing performance is determined by the two SIMD units thatcan execute 8 complex-valued operations per clock cycle. Accord-ingly, the real-valued data processing performance becomes PP =5′760 MdOp/s.9

9PP = 2 units × 4 CdOp/unit × 240MHz = 24 RdOp/unit × 240MHz =5′760MdOp/s.


CSP2xxx Series – Silicon Hive Silicon Hive (Eindhoven, TheNetherlands) spun-out of Philips Research in 2007.10 It bases itsCSP2xxx processor series upon the AVISPA processor [37]. ThreeAVISPA architectures are reported in literature: AVISPA, AVISPA+,and AVISPA-CH. AVISPA and AVISPA+ are designed for OFDMbaseband processing [38]. AVISPA-CH [39], the successor of AVISPAand AVISPA+, is designed for the multi-standard digital televisionbaseband processing and incorporates complex-valued FUs.

The AVISPA architecture [37] is shown in Figure 2.15. The top levelarchitecture contains a control processing and storage element (PSE)and a mesh of four PSE for data processing. Each data processing PSEinstantiates different FUs, namely: a 16 bit ALU, a 16 bit multiplier, a40 bit accumulator, a 40 bit barrel shifter, an address generation unit,and two 16 bit load/store units connected to a local dual-port datamemory. Small register files (RFs), connected to the FUs througha local interconnect network, enable temporary data storage. TheAVISPA processor is programmed by means of different tools thatallow to write code in a subset of the C-language and extract theinstruction level parallelism. The AVISPA architecture is classified asan SPA.

The AVISPA architecture is realized in 0.13 µm CMOS technology,running at 150 MHz and consuming an area of 6.5 mm2 [38]. Theprocessor consumes around 127 mW and the peak data processingperformance is PP = 1′200 MdOp/s.11

2.3 Flexible Architecture for MIMO-OFDMBaseband Processing

Today, to the best of the author’s knowledge, only one MIMO-OFDMbaseband processing implementation case-study exists that is com-parable to the one presented in this thesis. The corresponding FAis described here and then refreshed later on in Chapter 5, whenpresenting the implementation results of this thesis.

10http://www.siliconhive.com11PP = 4PSE× 2dOp/PSE× 150MHz = 1′200MdOp/s.

http://www.siliconhive.com

2.3. FA FOR MIMO-OFDM BB PROCESSING 33

MULTI-CELL CORE

Host Mem

PROG.MEM.

CTRL

PSE

RF RF RF

PSE

PSE

PSE

BUS

CELL

IN

ISIS IS

IN

IN

FU FU FU FU

IN

PSE

MEM

Figure 2.15: AVISPA block diagram (source [37]).

ADRES – IMEC The ADRES processor [40] was developed at theIMEC research center in Leuven, Belgium. The implementation of thecomplete 2× 2 MIMO-OFDM baseband processing on the ADRESprocessor is presented in [12, 41] (only Viterbi decoding is performedon a dedicated unit). The presented receiver can process data ratesup to 108 Mbit/s.

The generic ADRES architecture template [40] and the realizationfor the baseband receiver in [12] are depicted in Figure 2.16. TheADRES core is composed of one VLIW part and one coarse-grainedarray (CGA) part. These two parts operate in mutual exclusion. Forthe realization in [12], the CGA part consists of a 4× 4 array of 4-waySIMD 16bit FUs and the VLIW part comprises 3 FUs. The VLIWand CGA parts exchange data over a shared register-file. A four-bankscratch-pad memory completes the storage capabilities. The ADREScore is programmed in C-language, the mapping to the VLIW andCGA parts is done by the DRESC compiler. It is interesting to notethat the CGA architecture and the interconnect are based on the MS1architecture analyzed in Section 2.2.1 [42].

The ADRES processor is fabricated in 90 nm CMOS technology.It occupies an area of 5.79 mm2, runs at a frequency of 400 MHz,and consumes around 220 mW [12]. The data processing performanceis determined by the 4-way SIMD CGA, which can perform PP =25′600 MdOp/s.12

12PP = 16FUs× 4dOp/FU× 400MHz = 25′600MdOp/s.


ICache

FU VLIW FU VLIW

CGA

CU

VLIW

Debug

IF

CMEM

intefaceAHB-S

CDRF/CPRF

ADRES core

Instructions

DMQ

FU

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

FU FU FU FU FU

Program FetchInstruction DispatchInstruction Decode

Shared Registerfile VLIW view

Reconfigurable

Matrix view

Shared Registerfile

Configuration

memory

bank1

Configuration

memory

bank2

FU0

FU4

FU8

FU12

FU1

FU5

FU9

FU13

FU2

FU6

FU10

FU14

FU3

FU7

FU11

FU15

LRF

LRF

LRF

LRF

LRF

LRF

LRF

LRF

LRF

LRF

LRF

LRF

Generic Architecture

Functional Unit

4x4 CGA

From different sources

To different destinations

LRFFU

Configuration

counter

Configuration

RAM

Figure 2.16: ADRES block diagram (source: [12, 40]). Top: Genericarchitecture template. Middle: 4× 4 CGA realization for the 2× 2MIMO-OFDM receiver. Bottom: FU template.

2.4. SUMMARY AND DISCUSSION 35

2.4 Summary and DiscussionSummary This chapter reviewed selected flexible architectures (FAs)employed as SDR platforms, and especially related to the OFDMbaseband processing domain. The review permitted to gain insightinto the structure of these architectures that range from RA (e.g., RD,MS1), over SPA+RA (e.g., RaPiD, BBP1), to pure SPAs (e.g., SODA,SB3010). For a concise overview, the key figures of merit of these FAsare reported in Table 2.1. In Table 2.2, these figures are normalizedto a 0.18 µm CMOS technology, according to the technology scalingdescribed in Section 2.1.

In the following, the data of Table 2.2 is further elaborated tohighlight different design aspects associated to the reviewed FAs (wherethe corresponding figures of merit are available).

Discussion Figure 2.17 depicts the data processing performance perarea attained by the various FAs. In these terms, the RD is clearlythe most efficient architecture. Thanks to its datapath, especiallytailored to support two specific and very similar tasks, it reachesa high processing performance at small silicon area expense. TheSODA and RaPiD architectures follow next in the ranking. Bothprovide slightly more than 1′500 MdOp/s/mm2, which is already afactor of three less than that of the RD. The DSP1, EVP, and ADRESarchitectures deliver more than 500 MdOp/s/mm2; while MS1, BBP2,Montium, and AVISPA less than that.

Figure 2.18 expands the view to the energy efficiency. The mostprocessing- and energy-efficient architectures reside in the upper-rightcorner of the figure. The energy-efficiencies of the presented FAs arewide-spread, and lie between 20 MdOp/s/mW (ADRES, SODA) and2 MdOp/s/mW (RaPiD) – one order of magnitude apart. Accordingly,the power densities attained by the various architectures, lie in thestrip between 10 mW/mm2 and 1 W/mm2.

Nonetheless, when considering a real-world implementation, theabsolute figures of merit are important. In this perspective, Figure 2.19illustrates the processing performance vs. the power consumption of theconsidered FAs and, in addition, the marker’s size is made proportionalto the FA’s areas. The power consumption is a crucial aspect forportable mobile devices since it determines the endurance form battery.


Today, typical batteries of mobile devices have a capacity of 3600 mWh.At a power consumption of, for instance, 1 W, an endurance of 3.6 h canbe attained from the battery. Thus, for reasonable real-life applications,the power consumption of mobile devices should remain well below1 W (cf. vertical line in Figure 2.19). On the reference 0.18 µm CMOStechnology, both the SODA and RaPiD architectures violate thisconstraint.

On the other side, the computational processing performance re-quired to sustain the OFDM baseband processing is significant. Thehorizontal lines at 680 MdOp/s, 1′530 MdOp/s, 3′650 MdOp/s, and8′580 MdOp/s indicate the estimated data-processing performance re-quired for the SISO-OFDM, 2× 2, 3× 3, and 4× 4 MIMO-OFDMWLAN baseband processing, as described later in Chapter 3 (with-out considering Viterbi decoding). Apparently, all reported FAs cansupport single-antenna OFDM baseband processing. AVISPA-CH andDSP1 possibly support 2× 2; and EVP 3× 3 MIMO-OFDM opera-tion. ADRES possibly supports up to 4× 4 MIMO-OFDM operation.However, it must be stressed that the data processing performanceattributed to the FAs and the algorithmic processing performancerequirements are qualitative measures. The processing performancerequirements assume that the underlying FA is able to compute ex-actly the required operation at the correct point in time. The dataprocessing performance, instead, assumes that all FUs inside the FA’sdatapath are fully exploited, which is rather difficult to fulfill anddepends on how well the FA’s datapath matches the application do-main.13 Despite these two assumptions, the big picture presented inFigure 2.19 remains valid and becomes especially helpful for comparingthe various FAs among each other.

To conclude, it can be stated that a vast number of FAs exists andthat an increasing number of publications describe the implementationof (MIMO-)OFDM baseband processing related tasks onto these FAs.As this vast number of architectures suggests, there is no unique andoptimal FA able to support MIMO-OFDM baseband processing yet,but rather there are many ways aiming at the same goal. As the

13The discrepancy between estimated algorithmic data-processing performanceand the data-processing performance delivered by the FA can be seen as an indicatorof how well the FA matches the algorithm. The lower this discrepancy, the betterthe FA matches the algorithm since the number of overhead instructions is reduced.


complexity of the underlying architectures increases, the importance ofthe support by a powerful programming tool grows. Indeed, many ofthe reviewed FAs go into this direction (e.g., EVP, Montium, ADRES)and provide various – more or less sophisticated – tool chains. Finally,among the related work, only one addresses the complete implementa-tion of a 2× 2 MIMO-OFDM receiver on the coarse grained ADRESarchitecture [41].Further reading The review in this chapter presented only a setof selected FAs employed in the OFDM baseband processing domainand stricktly related to this thesis. The following list summarizes, in amore general term, the surveys and descriptions related to FAs:

• In 1999, Enzler explores the status of the early reconfigurablecomputing research [43]. After an analysis and description ofthe reconfigurable computing paradigm a list of over sixty (!)different FAs is given. In his thesis [44] many of the issues facedby the early FAs are described.

• In 2001, Hartenstein gives a survey of FAs [45]. Herein, the archi-tectures reported are: DP-FPGA, KressArray, Colt, MATRIX,RAW, GARP, REMARC, MorphoSys, CHESS, DReAM, CS2000family, MECA family, CALISTO, FIPSOC, RaPiD, PipeRench,PADDI and PADDI-2, and Pleiades. The author concludes thatthe exploding design costs of dedicated VLSI solutions and theshrinking product life-cycles strengthen the demand for FAs. Thechallenge lies in the development of software tools that effectivelysupport designers and hence reduce time-to-market.

• In 2006, Amano [46] lists recent FAs that have gained industrialmaturity: CS2112 (Chameleon), DAPDNA-2 (IPFlex), DRP-1 (NEC Electronics), FE-FA (Hitachi), XPP-64 (PACT), D-Fabrix (Exilent), Kilokore KC256 (Rapport), ADRES (IMEC),S5-engine (Stretch), Cluster machine (Fujitsu). The authorconcludes that many FAs have gained industrial attention andthat their structure gratly varies according to the applicationdomain. It is argued that in the near future the structure will begenerated automatically to match the target application domain.

• In 2008, IEEE micro dedicates a complete edition to hardwareaccelerators [47].


Table2.1:

Figuresofm

eritfor

thereviewed

FAs,in

theoriginaltechnology.

FlexibleCMOS

Area

Freq.Power

Proc.Perf.Architecture a

[µm]

[mm

2][M

Hz]

[W]

[MdO

p/s]

RD

[48],20030.35

2.86100

n.a.2’200

RaPiD

[18,17],19960.5

81100

30.46’400

MS1

[22],19990.35

180100

n.a.6’400

SODA

[23],20060.18

26.6400

351’200

Montium

[49,24],20030.13

6100

0.1501’500

DSP1

[28,5],20020.13

1.5160

0.1282’560

EVP

[5],20050.09

2300

0.3009’600

SB3010[32,31],2002

0.09n.a.

6000.150

9’600BBP1

[34],20050.18

2.9240

189b

n.a.BBP2

[35],20070.13

11240

n.a.5’760

AVISPA

[37],20030.13

6.5150

0.1271’200

ADRES

[40,12],20030.09

5.79400

0.2225’600

aThe

firstreferenceindicates

thedescription

ofthearchitecture,w

hilethe

secondto

theSD

R/O

FDM

basebandprocessing

relatedwork

(ifdifferentform

thefirst).

The

yearrefers

tothe

firstpublication

ofthearchitecture.

bLinearlyscaled

from126m

W@

160MH

zto

189mW

@240M

Hz.


Table2.2:

Figu

resof

meritforthereview

edFA

s,scaled

to0.

18µm

CMOStechno

logy.

Flexible

Scaling

Area

Freq.

Powe

rPr

oc.P

erf.

Architecture

αD

ε[m

m2 ]

[MH

z][W

][M

dOp/

s]RD

1.94

1.06

0.76

194

n.a.

4’278

RaP

iD2.78

1.52

10.5

278

917’778

MS1

1.94

1.06

47.61

194

n.a.

12’444

SODA

11

26.6

400

351’200

Mon

tium

0.72

1.08

11.5

720.338

1’083

DSP

10.72

1.08

2.88

116

0.288

1’849

EVP

0.5

0.9

8150

0.972

4’800

SB3010

0.5

0.9

n.a

300

0.486

4’800

BBP1

11

2.9

240

0.189

n.a.

BBP2

0.72

1.08

21.09

173

n.a.

4’160

AVISPA

0.72

1.08

12.46

108

0.286

867

ADRES

0.5

0.9

23.16

200

0.713

12’800


0

1000

2000

3000

4000

5000

6000

Performance over area [MdOP/s/mm2]

RDSO

DARaPiD

DSP 1EVP

AVISPA−CHADRESM

S1BBP2AVISPA+M

ontiumAVISPA

1’500 MdO

p/s/mm

2

500 MdO

p/s/mm

2

Figure2.17:

Performance/area,norm

alizedto

0.18µmCMOStechnology.


100

101

102

103

101

102

103

104

Ene

rgy

effic

ienc

y [M

dOP

/s/m

W]

Performance over area [MdOP/s/mm2]

10 m

W/m

m2

100 m

W/m

m2

1000

mW

/mm2

RaP

iDM

ontiu

mS

OD

AD

SP

1E

VP

AV

ISP

AA

VIS

PA

+A

VIS

PA

−CH

AD

RE

S

Figu

re2.18:Pe

rform

ance/areavs.en

ergy

efficien

cy,n

ormalized

to0.

18µm

CMOStechno

logy.


102

103

104

105

102

103

104

105

Pow

er consumption [m

W]

Processing performance [MdOP/s]

100 MdOp/s/mW

10 MdOp/s/mW

1 MdOp/s/mW

680 MdO

p/s

1’530 MdO

p/s

3’650 MdO

p/s

8’580 MdO

p/s

RaP

iD

Montium

SO

DA

DS

P 1

EV

P

AV

ISP

A

AV

ISP

A+

AV

ISP

A−C

H

AD

RE

S

Figure2.19:

Processingperform

ancevs.

powerconsum

ption,normalized

to0.18µm

CMOStechnology.

The

marker

sizeis

proportionaltothe

correspondingcircuit

area.

Chapter 3

Algorithms andComputationalComplexity

The current chapter describes the system model of the MIMO-OFDMtransceiver considered in this thesis. It details and evaluates thecorresponding receiver algorithms with the aim of eventually findinga suitable candidate that fits the limited processing resources of FAs,while delivering an acceptable receiver signal quality. To this end,the mathematical relation used to model the MIMO-OFDM system ispresented first. Next, well known MIMO detectors are reviewed andevaluated with special attention to their computational complexityand to their impact on the receive signal quality. The evaluationgenerates the decision criteria that permit to select linear minimummean-squared error (MMSE) detection as best candidate.

Many methods exist to implement linear MMSE detection. Thedifference among these methods does not rely in the achieved result orquality, but in the way the result is attained.1 Again, in order to find

1The achieved signal quality is dictated by the type of MIMO detector (linearMMSE in this case). When the computations are performed in infinite precision, allmethods using the same type of MIMO detector lead to the same result. However,when considering finite precision computations the picture changes. Rounding

43

44 CHAPTER 3. ALGORITHMS AND COMP. COMPLEXITY

the best trade-off between computational complexity and achievablereceive quality, different candidates are assessed and two promisingmethods are identified. With the appropriate linear MMSE detec-tor at hand, the algorithms for the practical MIMO-OFDM receiverconsidered in this thesis are detailed. The summary with the computa-tional complexity of the presented MIMO receiver and the subsequentdiscussion of these results concludes the chapter.

Notation x ∈ Ca, is a complex-valued vector with a entries, x(n)

is the nth element of vector x. X ∈ Ca×b is a complex-valued matrixwith a rows and b columns. The superscripts (.)T and (.)H denotethe transpose and conjugate transpose, respectively. The notationz ∼ CN (u,R) indicates that the random vector z is characterizedby a circularly-symmetric complex-valued Gaussian distribution withmean u and covariance matrix R.

3.1 MIMO-OFDM System ModelFigure 3.1 depicts a generic MR ×MT MIMO-OFDM transceiver. Inthis thesis, the transceiver operates in spatial multiplexing mode, orspace-division multiplex mode. It employs MT transmit antennas andMR ≥ MT receive antennas. The transmission is frame based, andone OFDM-frame is composed of a preamble and the data payloadconsisting of one or more OFDM-symbols. The preamble serves thereceiver for estimating physical system impairments, whereas theOFDM-symbols carry the actual data to be transmitted.

Accordingly, the operation of the receiver can be divided into threemain phases, depending on whether it is processing a frame or not;and, if it is processing a frame, depending on which part of the frameis being processed. During the frame-start detection phase the receiveris not processing a frame, but it is analyzing the received samplesto discover a frame start. Then, the preamble is processed during

differences among the methods used to implement the MIMO detector may lead toa better or worse received signal quality. For this reason, Section 3.4, describinglinear MMSE detection, also considers finite precision effects implementing themethods in fixed-point.

3.1. MIMO-OFDM SYSTEM MODEL 45

OFDM

Demod.

Receiver

Transmitter

Channel

OFDM

Mod.

Mapping

Serial to ParallelConversion

Parallel to SerialConversion

Mapping

TxData

...01011

MIMO Processing

Bit-Metric Computation

RxData

...01011

MIMO Detection

nk

Hk

yk

s ks k

x

Mapping

...

...

FEC

Decoding

Noise

xb

b

...

OFDM

Demod.

OFDM

Demod.

OFDM

Mod.

OFDM

Mod.

Figu

re3.1:MR×MTMIM

O-O

FDM

tran

sceiver.


Re

Im

−1 1

0 1

(a) BPSK, M = 2, Q = 1.

Re

Im

−1/√

2

1/√

21101

00 10

(b) QPSK, M = 4, Q = 2.

Re

Im

−1/√

10

1/√

10

3/√

10

−3/√

10

1010

1011

1001

1000

(c) 16-QAM, M = 16, Q = 4.

Re

Im 7/√

42

5/√

42

3/√

42

1/√

42

−1/√

42

−3/√

42

−5/√

42

−7/√

42

100100

100101

100111

100110

(d) 64-QAM, M = 64, Q = 6.

Figure 3.2: Mary-QAM constellation points with M = 2Q, and cor-responding Gray-mapped binary labels. (For 16-QAM and 64-QAM,only four labels are shown for clarity.)


the preprocessing phase while the payload during the data processingphase.

The transceiver’s building blocks illustrated in Figure 3.1 are nowdescribed one after the other, from the transmitter to the receiver.In the description, the subscript k = 1, . . . ,N indicates the OFDM-subchannel a variable belongs to. The superscript n = 1, . . . ,MT

associates a variable to one of the MT lower-rate datastreams. A isan alphabet containing Mary-QAM constellation points that havemodulation order M = 2Q, mean zero, and average energy 1/MT . Qis the number of bits encoded by one constellation point. Figure 3.2depicts the alphabets A for MT = 1, and M = 2, 4, 16 and 64.

Forward Error Correction The transmitter starts by convolution-ally encoding the incoming binary datastream (TxData, with bits b).Encoding increases the transmission’s robustness by adding redun-dancy to the incoming bitstream. It is performed by the forward errorcorrection (FEC) block with a coding rate R, meaning that each bit ofthe FEC’s output encodes R bits of its input (typical coding rates areR = 1/2, 2/3, 3/4, and 5/6). Then, in spatial multiplexing mode, theencoded binary datastream is split into MT lower rate datastreams.

OFDM Modulation For obtaining OFDM modulated data in asystem with N OFDM-subchannels, the operations described in thefollowing are performed, independently, on each of the MT lower-ratetransmit streams (typically 2 to 4 streams).

1. The encoded lower-rate bitstream is partitioned into groups ofQ bits represented by the binary labels

x(n)k = [b(n)

(k−1)Q, b(n)(k−1)Q+1, . . . , b

(n)kQ−1].

These binary labels x(n)k are Gray-mapped, by the Mapping-

block in Figure 3.1, into the complex-valued constellation pointss(n)

k ∈ A according to G : x(n)k 7→ s(n)

k .

2. Next, by using an N-point inverse Fourier transform, groups ofN constellation points are mapped into time-domain OFDM-symbols. Each time-domain OFDM-symbol is prepended by


a cyclic extension of itself. This extension is named guardinterval (GI) and adds robustness against interference caused bymultipath propagation.

3. Finally, the MIMO-OFDM preamble is inserted in front of theresulting MT time-domain OFDM-symbols, and the completeOFDM-frame is transmitted – all MT streams concurrently andin the same frequency band – over the wireless baseband (BB)channel.

OFDM Demodulation The receiver performs the inverse processof the transmitter. After removal of the guard interval in the time-domain, the MR received datastreams are individually OFDM demo-dulated using Fourier transforms. This transformation (back) into thefrequency domain, and the successive stacking of the demodulatedOFDM-symbols, yields the MR-dimensional received vector

yk = Hksk + nk, (3.1)

where the corresponding OFDM-subchannel is indicated by the indexk. The transmitted frequency-domain vector-symbols

sk = [s(1)k , s(2)

k , . . . , s(MT )k ]T

are obtained by stacking the MT constellation points of subchannelk into one vector. Each vector-symbol sk then conveys MT ·Q·Rbits of the binary datastream TxData. In (3.1), the noise experiencedat the receiver is modeled by the additive noise nk whose entriesare distributed according to CN (0, σ2). The matrix Hk ∈ CMR×MTdescribes the gain and phase of the MR ×MT wireless BB channel.

MIMO Detection The task of the receiver is to recover the trans-mitted binary datastream by observing the received datastream. Moreprecisely, the MT ·Q·R data bits conveyed by the vector-symbol skhave to be recovered by observing the corresponding received vectoryk. To this end, the MIMO detector processes the received vector ykand outputs one row vector x(n)

k = [L(n)(k−1)Q, L

(n)(k−1)Q+1, . . . , L

(n)kQ−1]

for each of the MT entries of sk (n = 1, 2, . . . ,MT ). The row vector


x(n)k contains Q entries. Each entry L(n)

i [i = (k − 1)Q, (k − 1)Q +1, . . . , kQ− 1] deliveres decision information employed to detect thecorresponding bit of the binary label of the constellation point s(n)

k .MIMO detectors (and detectors in general) can be classified into two

categories according to the type of decision information they deliver.Hard detectors (or hard-out detectors) output only two values, usually−1 and +1, according to whether the bit that has to be detectedis estimated to be 0 or 1, respectively. Soft detectors (or soft-outdetectors), instead, deliver an entire range of values that usually lie inthe interval [−1,+1]. The sign indicates whether the bit is estimatedto be 0 or 1, and larger absolute values indicate more reliable estimatesthan smaller ones. Soft detection is superior to hard detection inthe receiver’s signal quality. On the other hand, depending on theMIMO detection algorithm, the generation of soft information is notalways possible, or it may be associated with an overly increasedcomputational complexity.

Another distinction is made according to whether the MIMO de-tector performs coherent or non-coherent detection. In this thesis,receivers that perform coherent detection are considered: for coherentdetection, the receiver needs to take the effect of the wireless channelinto account for correct operation – instead non-coherent detection isperformed without channel knowledge. The coherent receiver has toestimate the wireless channel Hk during an appropriate training phaseof the transmission (typically at the beginning of an OFDM-frame).The channel-estimate is then expressed as matrix Hk.

Finally, to perform MIMO-detection, several algorithms that varyin computational complexity and in the receiver’s signal quality areknown (e.g., [50]). Before choosing an appropriate MIMO detector forthe implementation on an FA in Section 3.3, the evaluation metricsrequired to take this choice are introduced in the Section 3.2.

Convolutional Decoding / Viterbi Decoding As a last step,the decision information (either hard or soft) computed by the MIMOdetector is multiplexed into one single stream and fed into the convo-lutional decoder that eventually delivers the received bitstream.

Please note that from now on, the OFDM-subchannel index kwill be dropped for sake of brevity. Also, in the following, the chan-


nel estimate H is supposed to be perfect (i.e., H = H), thus ren-dering the distinction with the channel matrix H superfluous. TheOFDM-subchannel index k and channel-estimate matrix H will onlybe considered when necessary.

3.2 Performance and Computational Com-plexity Metrics

The algorithms evaluated in Section 3.3 are classified according to thefollowing characteristics.

BER performance The wireless receiver signal quality is measuredby means of the bit error rate BER = berr/btot, where berr is thenumber of erroneous bits in the received datastream and btot is thetotal number of transmitted bits.

In this thesis, the BER performance is obtained by running MonteCarlo simulations, and it is computed for different receiver signal-to-noise ratios (SNRs). In each simulation cycle, randomly generatedbits are sent through a rate R = 1/2 convolutional encoder (whichhas generator polynomials [1338 1718] and constraint length 7)2 andsuccessively Gray-mapped onto points of a 64-QAM constellation(recall Figure 3.2). The resulting complex-valued symbols are stackedto build the vector s, which has average energy 1. The vector s istransmitted over the MIMO channel according to (3.1), where theentries of the channel matrix H are chosen to be independent andidentically-distributed (i.i.d) as CN (0, 1). The SNR at the receiver is1/σ2. In the simulations, the receiver perfectly estimates the wirelesschannel (i.e., H = H), as well as the noise variance σ2.

Depending on the purpose of the simulations, the numerical pre-cision for the entire receiver is set to either floating-point, in orderto fathom its limits, or parts of the receiver may be written to em-ulate fixed-point behavior, introducing quantization errors. This al-lows to assess the receiver’s performance, achievable on practical FAs

2The polynomials are expressed in octal format. In this case, the generatorpolynomials are g0(x) = x6 + x4 + x3 + x+ 1 and g1(x) = x6 + x5 + x4 + x3 + 1,in GF(2).

3.2. PERFORMANCE AND COMP. COMPLEXITY 51

that typically support word-widths of 16 bits and perform fixed-pointcomputations. For instance, the decision of transmitting 64-QAMconstellation points is taken in this perspective: the higher numericalprecision requirements of 64-QAM, compared to 16-QAM, QPSK, orBPSK, allow to derive the fixed-point requisites for the datapath ofan FA that supports up to 64-QAM modulation.

Computational complexity (CC) The CC of a given algorithmdepends on the number of operations and on the type of operations, i.e.the atomic operations, required to complete it. The CC for algorithm acan be described by Ca =

∑o∈N Nowo, where N is the set of all atomic

operations needed for algorithm a; No is the number of operationsperformed with atomic operation o, and wo the cost of o. In general,No depends on the specific realization of algorithm a, whereas wovaries according to the target platform selected for implementing thatalgorithm.

For estimating the CC, with the implementation on an FA inmind, the atomic operations commonly available on digital signalprocessing platforms are split into the following categories (rememberalso Chapter 2):

• ADD: add, subtract, arithmetic shift, compare.

• MAC: multiply, multiply and accumulate, multiply and subtract.

Additional atomic operations that are required in parts of the MIMOreceiver, and are thus also considered with special care as atomicoperations of a signal processor, are:

• DIV: invert (1/x), divide (y/x).

• ANGLE: compute angle α of a complex number z, α = ](z).Other trigonometric functions.

• SQRT: Square-root√x.

With this subdivision, the CC for algorithm a is computed as: Ca =NMACwMAC+NADDwADD+NDIVwDIV+NANGLEwANGLE+NSQRTwSQRT.NMAC, NADD, NDIV, NANGLE, and NSQRT are the number of MAC,ADD, DIV, ANGLE, and SQRT atomic operations, whereas the costs


associated to these atomic operations are set to wMAC = wADD =wDIV = wANGLE = wSQRT = 1. These weights reflect the clock cyclesrequired to complete the corresponding operation. It is importantto note that the weights for the additional atomic operations DIV,ANGLE, and SQRT are optimistic and assume single-cycle atomic op-erations. Thus, the CC of algorithms that involve one of these atomicoperations is rather underestimated. Summarizing, the so-defined CCis a rough measure of the clock cycles needed to accomplish algorithma.

Most of the BB processing algorithms deal with complex-valuednumbers. Since conventional FAs do not offer dedicated executionunits for complex-valued operations, the CCs are reported in twoflavors. First, accounting the operations as if only real-valued atomicoperations were available on the platform, and second, accounting theoperations honoring complex-valued atomic operations. The mappingfrom complex- to real-valued operations, and vice-versa, is:

• 1 complex-valued addition ↔ 2 real-valued additions,

• 1 complex-valued multiplication ↔ 4 real-valued multiplicationsand 2 real-valued additions,

• 1 complex-valued multiply and accumulate ↔ 4 real-valued mul-tiply and accumulate (or, 4 real-valued multiplications and 4additions).

The mapping of complex- into real-valued division is not necessarysince the considered BB algorithms require only real-valued divisions.

Finally, the processing performance requirement Pa necessary tocompute algorithm a in real-time is obtained by Pa = Ca/Ta, where Tais the time lapse, or the duty cycle, at disposal to complete algorithma. The required processing performance Pa is expressed in millions ofdata-operations per second (MdOp/s) the platform has to execute (cf.Chapter 2, with the MdOp/s delivered by the various FAs).

3.3 Choice of the MIMO DetectorThis section reviews the most popular types of MIMO detectors andderives the corresponding CCs, in light of an implementation on an FA

3.3. CHOICE OF THE MIMO DETECTOR 53

equipped with the atomic operations ADD, MAC, DIV, ANGLE, andSQRT. At the end of the section, the CC of the evaluated detectors isreported for one OFDM-subchannel and it is split into two parts. It isreported separately for the receiver’s preprocessing phase (in Table 3.1)and the data processing phase (in Table 3.2), since the algorithmsinvolved in the two phases differ. The subsequent discussion andcomparison of the CCs and the BER performances permits to select aMIMO detector that is appropriate for an FA.

For the implementation as dedicated VLSI components, a good ref-erence that compares the appropriate CCs and the BER performancesof various MIMO detection algorithms is [51].

3.3.1 Brute-force maximum-likelihood (ML)MIMO detection is a decision problem: among all possible transmittedvector-symbols s ∈ AMT , the MIMO detector has to choose the vector-symbol s that maximizes the probability of a correct decision (i.e.,that s = s).3 From statistics, it is well known that in this case the MLrule [e.g, [7, 52]] maximizes the probability of a correct decision. Forthe considered MIMO system, the ML rule can be reduced to

s = arg mins∈AMT

‖y−Hs‖2. (3.2)

The solution of (3.2) can be found by first estimating the channelto get H and by precomputing all possible Hs candidates – during thepreprocessing –, followed by exhaustively testing all |AMT | = MMT

candidates against the current received vector y – during the dataprocessing.

For hard detection, once the solution of (3.2) is obtained, the entriesof s are directly translated into the binary labels of the correspondingconstellation points (demapping). The resulting CCs associated to boththe preprocessing and the data processing phases are prohibitive [inthe order of O(MMT )]. The rough calculation of the CC for a 64-QAM2× 2 MIMO-OFDM receiver is sufficient to show the overwhelmingcomplexity of the problem. In total, 642 = 4′096 candidate vector-symbols have to be tested – just for one OFDM-subchannel. The

3The vector-symbol s ∈ AMT has MT entries s ∈ A.


assumption that one test can be completed in one clock cycle, for anOFDM system with 52 subchannels and an OFDM-symbol durationof 4 µs, would lead to a required processing performance of 4′096 ×52/4 µs = 53′248 MdOp/s (!). Soft detection would lead to an evenhigher CC [53].

These results show that brute-force ML is not a viable path to findthe solution of (3.2). For this reason, brute-force ML is dropped fromthe candidate list.

3.3.2 Sphere decoding (SD)A more sophisticated and computationally less complex method toobtain the ML solution is to map the problem (3.2) onto an equivalenttree structure, in a first step. Then, in a second step, the problemcan be solved by applying appropriate tree search algorithms that usepruning on forlorn branches to find the ML solution earlier. Eventually,this leads to a lower CC than brute-force ML. In this category, the mostprominent and promising algorithm is SD [54], which is considered inthe following.

In order to perform the mapping onto the appropriate tree structure,during the preprocessing, the QR-decomposition of H = QR has tobe taken. The QR-decomposition leads to the MR ×MT orthonormalmatrix Q and to the MT ×MT right-triangular matrix R.4 Then,during data processing, the received vector y is left-multiplied by QH

leading to the modified input-output relation [cf. (3.1)] y = Rs + n,where y = QHy, and n = QHn has the same statistics as n. Thismodified input-output relation enables the mapping onto an equivalenttree structure (see Appendix A.1).

The CC of SD during data processing is proportional to the averagenumber of visited tree nodes (Nav). Nav, in its turn, depends on theSNR-regime the receiver is operating at. At low SNR, Nav is largerthan at high SNR, as visible in Figure 3.3. The CC of SD, for onevisited node, is reported in Table 3.2.

One problem of SD resides exactly in its varying CC and thus itsvarying run-time. On average, its complexity is significantly lowerthan that of brute-force ML, however, in the worst case it is equal.

4The precise QR-decomposition is detailed later, in Section 3.4.


0 10 20 30 400

5

10

15

20

SNR [dB]

Ave

rage

num

eber

of v

isite

d no

des

[Nav

]2x2 MIMO4x4 MIMO

Figure 3.3: CC of SD with respect to SNR for a 2× 2 and 4× 4 MIMOsystem with 64-QAM.

Thus, if a certain throughput at a fixed BER must be guaranteed,the implementation of SD has to be dimensioned for the worst case,resulting in a design that fully exploits its resources only occasionally.To overcome this problem, run-time constraints can be imposed toSD such that the search is terminated after a pre-determined amountof time or number of operations, as proposed in [55]. Applying thisrestriction makes SD attractive, but it slightly degrades the BERperformance. In [55], the average number of visited nodes, has to beset to values between Nav = 7 and 18 for obtaining a reasonable BERperformance in a 4× 4 MIMO system with 16-QAM.

Both above-described SD incarnations, [54, 55], deliver hard deci-sion information. The results in [56, 57] describe a realization of SDcapable of both, delivering soft decision information and respectingrun-time constraints. Although providing a much better BER perfor-mance, for the implementation on a reasonably-dimensioned FA, theCC of soft-out SD is not yet manageable.


3.3.3 K-Best (KB)Another tree-search method that completes in a fixed run-time, at thecost of diverging from the ML performance, is the KB algorithm [58].As for SD, the preprocessing requires the QR-decomposition of Hfor obtaining the tree structure. In [59], a high-throughput VLSIimplementation of KB is presented where the complex-valued input-output relation (3.1) is decomposed into a real-valued problem, throughits real-valued decomposition (RVD). As a consequence, the size of theinvolved vectors and matrices doubles, but the computations are all real-valued instead of complex-valued. During data processing, at each treelevel KB keeps only the K best solutions in its candidate list. All othercandidates are neglected. Once the lowest tree level is attained, thebest of the K solutions is returned and the so-obtained vector-symbolis declared as the transmitted vector-symbol (see Appendix A.2).

Although the CC during data processing is reduced compared toSD, involving only real-valued operators due to the RVD, it is stillconsiderable (as depicted in Figure 3.4, at the end of this section). TheBER performance slightly diverges from the ML BER performance inthe high SNR regime. The largerK, the later the KB BER performancediverges from that of ML.5

3.3.4 Successive interference cancellation (SIC)During preprocessing the SIC algorithm relies on the QR-decomposition,as it is required for SD and for KB. However, during the preprocess-ing phase, SIC has a much lower CC than SD or KB. SIC maps thedetection problem onto the equation y = R−1QHy. Thanks to theright-triangular structure of the MT ×MT matrix R, the unknowny is stepwise reconstructed through back-substitution (3.3), solvingy = R−1y (where y = QHy):

yi = yi −MT∑j=i+1

ri,j sj (3.3)

si = Q(yi, ri,i), (3.4)5Reference [60] describes the implementation of the KB list sphere decoding on

a transport triggered architecture. The achieved throughput is 5.3Mbit/s.


and i = MT , . . . , 1. After each back-substitution step i, the obtainedsolution yi is mapped to the nearest constellation point in the alphabetA (3.4), leading to the (hard) detected vector-symbol s.6 The maindrawback of SIC is that no good-quality soft information can easily beextracted [61].

The BER performance of the SIC algorithm lies between that ofKB and linear detection. The CC of SIC during the preprocessing isderived in Appendix A.3 and it is reported in Table 3.2 at the end ofthis section.

3.3.5 Linear detectionLinear detectors reduce the CC in the receiver by splitting the MIMOdetection problem into MR independent SISO problems, before ap-plying the ML rule to each stream. To this end, the received vectory is multiplied with an estimator matrix G. Commonly the matrixG is obtained as either zero forcing (ZF) or minimum mean-squarederror (MMSE) estimator. ZF has a slightly worse BER performanceand almost the same computational complexity as MMSE, thereforeusually MMSE is preferred.

To derive the MMSE estimate y, given the received vector y in (3.1)and the channel estimate H, the following three steps are required [7]:

F = HHH +MTσ2I (3.5)

G = F−1HH (3.6)y = Gy. (3.7)

It is important to note that in (3.6) the matrix F has to be invertedto obtain G, constituting a major computational challenge when itcomes to the fixed-point implementation on the FA. Furthermore, toobtain the ZF estimator matrix it suffices to set σ2 = 0 in (3.5).

The vector y is further processed to obtain the estimated trans-mitted symbol s = Q(y) through the slicing operation Q(.), whenperforming hard detection. As alternative, with only minor additionalCC, y can further be elaborated to obtain appropriate soft information

6Note that the mapping Q(.) takes the diagonal elements ri,i of R into accountfor scaling the decision boundaries, such that no division is required.


as detailed in [53, 62]. Here, the possibility of obtaining good-qualitysoft information represents a significant advantage over SIC. The CCof linear MMSE detection during the preprocessing in Table 3.1 isobtained using the rank-1 update method (proposed in [63]) for theinversion of F.

3.3.6 Results and conclusionComplexity Tables 3.1 and 3.2 summarize the CCs of the above de-scribed MIMO detectors for the preprocessing and the data processing,respectively. The case real-valued, as well as the case complex-valuedatomic operations are at disposal, are both considered. The CCs arelisted for one OFDM-subchannel, and thus, for obtaining the CC ofthe MIMO-OFDM detector, they have to be scaled with the numberof data-carrying OFDM-subchannels.

Although the reported CCs represent estimates and do not takeany implementation overhead into account, they are essential to relatethe detectors among each other. Figure 3.4 visualizes the findings fordifferent, symmetric (MT = MR), antenna configurations. For KB theCC was obtained with K = 5, whereas for SD the average number ofvisited tree nodes was set to Nav = 2.5, 3.75, and 5 for the 2× 2, 3× 3,and 4× 4 systems, respectively. These Nav values are optimistic, andcorrespond to operating in the high SNR regime (cf. Figure 3.3).

While all considered detectors have a comparable CC during pre-processing, the situation is completely different during data processing.As expected, SD and KB have a much higher CC than SIC and linearMMSE detection. SIC is slightly more complex than linear MMSE.Examining theMT = MR = 2 case, with real-valued atomic operations[see Figure 3.4(a)], discloses that the preprocessing has a CC of 224(with QR-decomposition), and the data processing a CC of 1′940 withSD, of 574 with KB, of 96 with SIC, and of 16 with linear MMSEdetection. Scaling these results to a MIMO-OFDM system as theone described later on in Section 3.5, leads to a required processingperformance of more than 2′900 MdOp/s for the preprocessing, and25′000 MdOp/s, 7′400 MdOp/s, 1′200 MdOp/s, and 200 MdOp/s forSD, KB, SIC, and linear MMSE, during the data processing phase.7

7The MIMO-OFDM system considered in Section 3.5 employs 52 data-carrying


As these figures testify, linear MMSE and SIC detection may possiblyfit on a conventional high-end digital signal processor (DSP) – e.g., TI’sC6455 (with a peak data processing performance of 4′000 MdOp/s)or ADI’s TigerSHARC (with 2′400 MdOp/s peak data performance).8SD and KB, however, are far beyond that possibility.

An FA with execution units performing complex-valued arithmeticwould reduce the CC [see Figure 3.4(b)] and the processing require-ments would become: 800 MCdOp/s (millions of complex-valued op-erations per second) for the preprocessing with QR-decomposition,and 8′400 MCdOp/s (SD), 7′400 MCdOp/s (KB), 400 MCdOp/s (SIC),and 50 MCdOp/s (MMSE) for the data payload processing.

BER performance First, the BER performance without any de-coding is considered. The simulations were run according to the setupdescribed in Section 3.2, and the results are reproduced in Figure 3.5for both a 2× 2 and a 4× 4 MIMO system. It is easy to observethat the performance gap between SD and KB, which achieve ML andnear-ML performance, and SIC and MMSE is substantial. For the2× 2 system the gap is of around 5 dB at an SNR of 30 dB, wheres forthe 4× 4 system it is of more than 10 dB.

Figure 3.6 illustrates the BER performance obtained when consid-ering encoding and decoding, employing a MIMO detector deliveringhard-out information. The gap between SD and KB, and SIC andMMSE is perceptible also here. In addition, the figure shows theBER performance obtained with soft-out information for SD and lin-ear MMSE. Please note that the simulations are performed with thesetup described in Section 3.2, where the channel H has i.i.d. entries∼ CN (0, 1), the BER performance difference between these two MIMOdetectors is only minimal. Using a different channel model leads toa different gap in the BER performance between soft-out SD andsoft-out MMSE detection. With the TGn channel, for instance, thegap becomes larger [64].

OFDM-subchannels and the data processing has to be concluded in 4 µs. Hence,the required processing performance is computed as Pa = Ca· 52/4 µs.

8The data processing performance is derived as: PP = 2-way SIMD ×2 dOp/SIMD × 1GHz = 4′000MdOp/s for the C6455. For the TigerSHARCADSP TS201S: PP = 2-way SIMD× 2dOp/SIMD× 600MHz = 2′400MdOp/s.


Table 3.1: Per-subchannel CC of different MIMO detectors – Prepro-cessing.

Detector Ops. PreprocessingBrute-force ML, C-Ops. MRMTM

MT

Brute-force ML, R-Ops. 4MRMTMMT

SD, KB, SIC, C-Ops. (17/2 + 2MR)MRMT + 3/2MRM2T

SD, KB, SIC, R-Ops. 4(7 + 2MR)MRMT + 6MRM2T

Linear MMSE, C-Ops. 2MR + 2MRMT + 4MRM2T

Linear MMSE, R-Ops. 3MR + 8MRMT + 14MRM2T

Conclusion Although SD and KB exhibit a much better BER per-formance than SIC and MMSE, the analysis of their CCs and theresulting processing performance requirements excludes them fromthe candidates suitable for an FA. Between SIC and MMSE there isno practical BER performance gain. Thus, considering that MMSEdetection has the lowest CC during both preprocessing and data pro-cessing, the implementation of a linear MMSE detector seems the mostreasonable step towards a MIMO-OFDM SDR implementation on anFA. Further, this choice is enforced by observing that it is still possibleto boost the MMSE detector’s BER performance by generating softdecision information, as illustrated in Figure 3.6.

3.4 Linear MMSE DetectionThe hardest computational kernel involved in the computation of thelinear MMSE estimator matrix G in (3.6) is the matrix inversion F−1

performed during the preprocessing. Therefore, this section inspectsdifferent methods to obtain G and, especially, to invert the MT ×MT

matrix F. By comparing the associated CCs and the achievable BERperformance, as done in the previous section, it is possible to quantifythe qualities of the evaluated methods, which eventually permits to toselect the method that best fits on the target platform.

The methods considered for computing G are: classical adjointmethod, indirect inversion of F through LR-decomposition, throughLDL-decomposition, through GS-decomposition, and through QR-

3.4. LINEAR MMSE DETECTION 61

2 3 40

140

280

420

560

700

840

980

1120

1260

1400

Antenna configuration (MxM) [M]

CC

Preprocessing

224150

630

459

1344

1036

2 3 40

400

800

1200

1600

2000

2400

2800

3200

3600

4000


CC

Symbol processing

1940

574

96 16

2925

891

15036

3920

1228

20864

SD, Nav=2.5, 3.75, 5

SD, Nav=2.5, 3.75, 5

KB, K=5

KB, K=5

SIC

SIC

LMMSE

LMMSE

(a) CC considering real-valued atomic operations.

2 3 40

40

80

120

160

200

240

280

320

360

400


CC

Preprocessing

6244

171

132

360

296

SD, Nav=2.5, 3.75, 5

KB, K=5

SIC

LMMSE

2 3 40

140

280

420

560

700

840

980

1120

1260

1400


CC

Symbol processing

648574

31 4

975891

48 9

13051228

6616

SD, Nav=2.5, 3.75, 5

KB, K=5

SIC

LMMSE

(b) CC considering complex-valued atomic operations.

Figure 3.4: CC of different MIMO detectors.


0 10 20 30 4010

−5

10−4

10−3

10−2

10−1

100

SNR [dB]

BE

R

SDKB, K=5SICMMSE

0 10 20 30 4010

−5

10−4

10−3

10−2

10−1

100

SNR [dB]

BE

R

SDKB, K=5SICMMSE

Figure 3.5: Uncoded BER performance for different MIMO detectorswith 64-QAM. Top: 2× 2 MIMO system. Bottom: 4× 4 MIMOsystem.


10 20 30 4010

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

SNR [dB]

BE

R

Hard−out SDHard−out KB, K=5Hard−out SICHard−out MMSESoft−out SDSoft−out MMSE

10 20 30 4010

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

SNR [dB]

BE

R

Hard−out SDHard−out KB, K=5Hard−out SICHard−out MMSESoft−out SDSoft−out MMSE

Figure 3.6: Coded BER performance for different hard-out MIMOdetectors with 64-QAM. Top: 2× 2 MIMO system. Bottom: 4× 4MIMO system.


Table 3.2: Per-subchannel CC of different MIMO detectors – Dataprocessing.

Detector Ops. Symbol detectionBrute-force ML, C-Ops. (2MR + 1)MMT

Brute-force ML, R-Ops. (6MR + 2)MMT

SD, C-Ops. Nav(4M +MT + 1)SD, R-Ops. Nav(12M + 4MT )KB, C-Ops. (K(6

√M − 1) + 4

√M)MT + 2KM2

T

KB, R-Ops. (K(6√M − 1) + 4

√M)MT + 2KM2

T

SIC, C-Ops. (1/2 +√M + log2 M)MT +M2

T /2SIC, R-Ops. 2(2

√M + log2 M)MT + 2M2

T

Linear MMSE, C-Ops. MRMT

Linear MMSE, R-Ops. 4MRMT

decomposition; direct inversion of F by a series of Rank-1 updates,and by the Divide-and-Conquer (D&C) algorithm. While the accurateCCs and the steps leading to the inverse of F for the above mentionedalgorithms are all detailed in Appendix A.4, in the following onlythe underlying principles are explained and the corresponding atomicoperations are identified. The discussion focuses on the steps that areperformed during the preprocessing and on the resulting CC, since theCC of the data processing is equal for all methods. The CC of thedata processing arises from the matrix-vector multiplication y = Gythat is part of the detection. It amounts to 4MRMT when consideringonly real-valued atomic operations, or to MRMT with complex-valuedatomic operations.

Finally, we remark that F is Hermitian and positive-definite byconstruction. A matrix F is Hermitian if it satisfies FH = F andpositive-definite if xHFx > 0 holds for all x ∈ CM [65]. Some of thepresented methods exploit the structure of F to reduce the CC, whileothers do not since no significant advantages accrue.


3.4.1 Adjoint methodThe classical way of presenting the inverse of an MT ×MT matrix Fis by the adjoint method:

F−1 = adj (F)det(F) . (3.8)

The solution is trivial for the case M = 1 since in this case F is apositive scalar and its CC amounts to 1 (i.e., one division). For M = 2,(3.8) corresponds to:

F−1 =[a bb∗ c

]−1= 1ac− bb∗

[c −b−b∗ a

]. (3.9)

To solve (3.9) employing complex-valued atomic operations, 5 CMACsand 1 real-valued DIV are required, resulting in a CC of 6. Whenconsidering real-valued atomic operations 20 MACs and 1 real-valuedDIV are necessary, leading to a CC of 21. In total, for the computationof G the CC is 22 (with complex-valued operations), or 69 whenaccounting real-valued atomic operations.

In the case MT > 2, however, the high CC and the large dynamicrange render the use (3.8) impractical.

3.4.2 LR-decompositionLR-decomposition [66] decomposes F into a left-triangular MT ×MT

matrix L and a right-triangular MT ×MT matrix R, such that LR =F. Then, two successive back-substitution steps lead to the matrix G:1) A = L−1HH , and 2) G = R−1A. No divisions are required duringthe first back-substitution, since the diagonal entries of L are all 1.However, for the second back-substitution the inversion of the diagonalelements of R is required. The atomic operations for LR-decompositionare ADD, MAC, and DIV.

3.4.3 LDL-decompositionWith LDL-decomposition [66], F is decomposed into the left-triangularMT ×MT matrix L and the diagonal MT ×MT matrix D, with the


characteristic LDLH = F. To perform the detection, first the mul-tiplication R = DLH leads to a right-triangular matrix R. There-after, the two consecutive back-substitution steps utilized in the LR-decomposition are performed to obtain the MMSE estimator G. Asfor LR-decomposition, the atomic operations are ADD, MAC, andDIV.

3.4.4 GS-decompositionWhen using GS-decomposition9 [66] the augmented channel matrix

H = [HH√MTσIMT ]H ∈ C(MR+MT )×MT

is decomposed into the (MR+MT )×MT orthonormal matrix Q and theright-triangular MT×MT matrix R, such that QR = H. Then, theaugmented matrix G = R−1QH is computed. During data processingdetection is performed as y = Gy, where y is the received vector yextended with zeros to render the multiplication with G possible.

The presence of the SQRT atomic operation in the GS-decomposition,in addition to the common atomic operations ADD, MAC, and DIV,is a clear drawback of this method: when two methods have nearlyequal CC, the method with the fewer, or less costly, atomic operationsis favored.10

3.4.5 QR-decompositionThe classical QR-decomposition [66] involves the same steps as the GS-decomposition. The only difference is the method used to obtain thematrices Q and R. The QR-decomposition analyzed in Appendix A.4.5relies on Givens-rotations, which require the computation of the arctangent as a fundamental operation. Therefore, compared to GS-decomposition, instead of the SQRT atomic operation, the ANGLE

9Named after the initials of its two independent discoverers: Jorgen PedersenGram and Erhard Schmidt. Gram published in 1883 [67], whereas Schmidt in1907 [68].

10Recall that, for deriving the CC, the weights of all atomic operations have beenset to 1 (in Section 3.2) which is, of course, an approximation. If the differenceamong the CCs of the various linear MMSE detection methods that are evaluatedbecomes too small for a clear choice, these weights may be refined. However, thiswill not be necessary, as the final discussion will show.


atomic operation is required in addition to ADD, MAC, and DIV. Asfor GS-decomposition, the presence of an additional atomic operation(here ANGLE) represents a potential disadvantage.

3.4.6 Rank-1 updateRank-1 update (R1) [63] directly inverts F starting from P(0) =1/(MTσ

2) and by performing a series of MT rank-1 updates (k =1, 2, . . . ,MT ):

P(k) = P(k−1) − P(k−1)hHk hkP(k−1)

1 + hkP(k−1)hHk,

where hk denotes the kth row of the channel estimate H. Eventually,the inverse of F is obtained as F−1 = P(MT ) and further employedto compute G = F−1HH . The required atomic operations are ADD,MAC, and DIV.

3.4.7 Divide-and-Conquer algorithmDivide-and-Conquer algorithm (D&C) [13] is a recursive matrix inver-sion method. The key element of D&C is the partitioning of F as

F =[

A BBH C

], (3.10)

with A ∈ Cp×p and C ∈ C(MT−p)×(MT−p), 1 ≤ p < MT . Then, usingthe Banachiewicz formula for the inverse of a partitioned matrix [69],11F−1 is computed as:

F−1 =[

A−1 + A−1BS−1BHA−1 −A−1BS−1

−S−1BHA−1 S−1

], (3.11)

with S , C − BHA−1B being the Schur complement of A in F.As a result, the task of inverting the MT ×MT matrix F can be

11The formula for the inverse of a nonsingular partitioned block matrix wasintroduced in 1937 by the astronomer Tadeusz Banachiewicz (1882-1954) [70].However note that, as stated in [69], closely related results were obtained earlierin 1923 by the geodesist Hans Boltz (1883-1947) [71] and in 1933 by Ralf Lohan(1902-2000) [72].


replaced by the simpler tasks of inverting the p× p matrix A and the(MT −p)×(MT −p) matrix S, followed by combining the resulting A−1

and S−1 according to (3.11). If the matrix A (or S) has dimension2× 2 or less, direct matrix inversion is performed to obtain A−1 (orS−1). Otherwise, A (S) is partitioned as in (3.10), leading to arecursive procedure to obtain A−1 (S−1). The recursion breaks whenthe matrix A (S) of the actual level of recursion can be inverted byscalar or 2× 2 direct inversion.

The atomic operations required by D&C are ADD, MAC, and DIV.

3.4.8 Results and conclusionComplexity Table 3.3 summarizes the preprocessing CCs for theabove-described linear MMSE detection methods. The CCs for boththe cases of real- and complex-valued atomic operations are formulated,as done for the MIMO detector comparison. Figure 3.7 shows thecorresponding complexity-scaling for symmetric (MT = MR) MIMOsystems and for the processing of one OFDM-subchannel.

When considering an FA that provides real-valued atomic opera-tions, the LR and LDL algorithms have a slightly lower CC than D&C[Figure 3.7(a)]. The GS-decomposition, the Rank-1 update algorithm,and the QR-decomposition have the higher CCs. Recall that theseCCs are higher even though the weights wSQRT (for GS) and wANGLE(for QR) were set to 1. On the other side, when complex-valued atomicoperations are at disposal, the D&C algorithm has the lowest CC fora given antenna configuration, which is explained by the fact thatD&C has more complex-valued multiplications and less complex-valuedadditions than LR has. The remaining rankings remain unchanged[Figure 3.7(b)].

Thus, neglecting any implementation overhead, an FA that pro-vides real-valued ADD, MAC, and DIV atomic operations will requirethe fewest clock cycles with LR- or LDL-decomposition, whereasa platform with the same set of complex-valued atomic operationswill require less cycles with D&C. For instance, the processing per-formance required for a 2× 2 MIMO-OFDM system with 52 data-carrying OFDM-subchannels amounts to approximately 850 MdOp/sand 280 MCdOp/s for real- and complex-valued atomic operationsrespectively.


Table 3.3: CC of different linear MMSE detectors – Preprocessing.Method Ops. PreprocessingLR, C-Ops. 1− 2MR + (−1/3 + 5/2MR)MT + (1 + 3/2MR)M2

T +M3T /3

LR, R-Ops. 2− 4MR + (−7/3 + 6MR)MT + (2 + 6MR)M2T + 4/3M3

TLDL, C-Ops. −2MR + (4/3 + 5MR/2)MT + 3/2(1 +MR)M2

T +M3T /6

LDL, R-Ops. −4MR + (4/3 + 6MR)MT + (5 + 6MR)M2T + 2/3M3

TGS, C-Ops. −1 + (4 +MR)MT + (1/2 + 2MR)M2

T +M3T /2

GS, R-Ops. −2 + (25/3 + 5MR)MT + (2 + 7MR)MT 2 + 5/3M3T

QR, C-Ops. −1 + (2 + 9MR + 2M2R)MT + 2MRM2

TQR, R-Ops. −2 + (3 + 30MR + 8M2

R)MT + 8MRM2T

R1, C-Ops. 2MR + 2MRMT + 4MRM2T

R1, R-Ops. 3MR + 8MRMT + 14MRM2T

D&C, C-Ops. −9 + (41 + 3MR)MT /6 + (−1 + 3MR)M2T /2 + 2/3M3

TD&C, R-Ops. −30 + (127/6 + 2MR)MT + (−3/2 + 6MR)M2

T + 7/3M3T

BER performance Figure 3.8 illustrates the uncoded BER per-formance achievable in a 4× 4 MIMO system transmitting 64-QAMsymbols, and where the linear MMSE estimator G is computed em-ploying the previously described algorithms with 16 bit fixed-pointprecision. Each algorithm has been independently optimized for ob-taining the highest possible BER performance. In the figure, thefloating-point BER curve serves as a reference to quantify the imple-mentation loss. As one can see, the Rank-1 update algorithm breaks offfirst from the reference curve and manifests the highest error floor atBER = 5 · 10−2. The D&C algorithm’s error floor is at BER = 10−2.The LR, LDL, GS, and QR decompositions, all have a lower errorfloor.

Conclusion It can be stated that for the implementation of MIMO(-OFDM) systems with two or less antennas at the receiver, the adjointmethod (direct matrix inversion) is the most promising. For systemswith more than two antennas at the receiver, the D&C and LR algo-rithms are the most promising ones when low CC is desired. However,if the BER performance attained by D&C and LR is deemed to beinsufficient, the GS- and QR-decomposition come into play at the priceof a higher CC. Finally, for the implementation on an FA, the mostreasonable choice is to start by considering a 2× 2 MIMO-OFDMsystem where the CC appears affordable.


2 3 4 50

300

600

900

1200

1500

1800

2100

2400

2700

3000


CC

D&C

LR-decomp

LDL-decomp

GS-decomp

Rank1

QR-decomp

(a) CC accounting real-valued operators.

2 3 4 50

80

160

240

320

400

480

560

640

720

800


CC

D&C

LR-decomp

LDL-decomp

GS-decomp

Rank1

QR-decomp

(b) CC accounting complex-valued operators.

Figure 3.7: CC-scaling of different linear MMSE MIMO detectorswith MT = MR antennas during the preprocessing phase and for oneOFDM-subchannel.


0 10 20 30 40 50 6010

−5

10−4

10−3

10−2

10−1

100

SNR [dB]

BE

R

Floating pointLRGSLDLQRR1DC

Figure 3.8: Uncoded BER performance for a linear MMSE 4× 4 MIMOsystem using different matrix decomposition methods. Floating pointvs. quantized 16 bit fixed-point operation.


Preamble Data payload

8 μs 8 μs

T(1)1BT

(1)1AGI2(1)t1 t2 t3 . . . t10 GI(1) T

(1)2 D

(1)1 D

(1)2 D

(1)NdGI

(1)1 GI

(1)2 GI

(1)Nd

t1 t2 t3 . . . t10 T(2)1BT

(2)1AGI2(2) GI(2) T

(2)2 D

(2)1 D

(2)2GI

(2)1 GI

(2)2 D

(2)NdGI

(2)Nd

4 μs 4 μs 4 μs 4 μs

Receive antenna index

STF LTF 1

2) STFprocessing

3) LTFprocessing

MIMO-OFDM Symbols SNdS1LTF 2

4) MIMO ch.processing

5) Dataprocessing

. . .

Receiver states 1 μs −> 20 Samples

PreprocessingFramedetection

1) Framedetection

Data processing

time

Figure 3.9: 2× 2 MIMO-OFDM-frame structure (top), and corre-sponding receiver states (bottom).

3.5 MIMO-OFDM Receiver AlgorithmsFigure 3.9 shows the time-domain frame structure for the MIMO-OFDM system considered in this thesis.12 Although the illustratedframe-structure is specific for the 2× 2 case, it can easily be extendedto the generic MR ×MT case. The considered frame structure issimilar to that of the IEEE 802.11n standard [8] in HT-Greenfieldmode.

Frames start with a short training field (STF) that is composedof ten identical short training sequences (t1, t2, . . . , t10), each oflength NSP samples. This sequence is designed to support frame-startdetection, automatic gain control adjustment,13 and coarse frequencyoffset estimation. The STF is followed by a sequence of MT longtraining fields (LTF1, LTF2, . . . , LTFMT ). The first long trainingfield (LTF1) comprises a long guard interval (GI2) of NGI2 samples and

12In the following, the term OFDM-frame is used when referring to a SISO-OFDM-frame as well as to a MIMO-OFDM-frame.

13Automatic gain control (AGC) has not been implemented in this work.

3.5. MIMO-OFDM RECEIVER ALGORITHMS 73

Table 3.4: OFDM modulation parameters for the system under con-sideration.Parameter N Nc NSP NLP NGI2 NGI Tsym Ns fsValue 64 52 16 64 64 16 4 µs 80 20MHz

two identical long training symbols (T1A and T1B), each of length NLPsamples. LTF1 is exploited to refine the frequency offset estimationand participates in the channel estimation together with the remaininglong training fields (LTFn, n = 2, 3, . . . ,MT ). Each of the remainingLTFs is composed of a guard interval (GI) of length NGI samples,followed by a training symbol Tn of length NLP. The MIMO-OFDMdata symbols Sm have a GIm of NGI samples and carry the data Dm

(m = 1, 2, . . . , Nd). The number of data carrying OFDM-subchannelsis Nc and the remaining N−Nc subchannels are either unused or carrypilot symbols. One OFDM-symbol has a duration Tsym and a lengthof Ns = NGI + N samples, at a sample rate fs. The OFDM parametersfor the system under consideration are reported in Table 3.4.

Based on the above-described frame structure, proper reception ofan OFDM-frame can be divided into five states: frame-start detection,STF processing, LTF processing, MIMO channel processing, and datapayload processing. The bottom section of Figure 3.9 illustrates howthese five receiver states are traversed during the reception of anOFDM-frame. Note that the exact point in time for switching fromone receiver state to the next varies depending on the quality of thereceived signal and, consequently, on when the frame start is detected.Typically, the first 4 to 6 short training sequences are corrupted byAGC.

3.5.1 Frame-start detection

The frame-start detection is the receiver’s idle state. In this state, thepresence of a new OFDM-frame has to be detected by analyzing theincoming received BB samples. The corresponding detection algorithmis extended from the well established single-antenna algorithm proposedin [73]. The basic idea of this extended algorithm is to compute twometrics p(n)

L [d] and m(n)L [d] for each time-domain BB sample r(n)[d],


for all receive antennas n = 1, . . . ,MR according to

p(n)L [d] =

L−1∑j=0

(r(n)[d+ j]H · r(n)[d+ j + L]

)(3.12)

m(n)L [d] =

L−1∑j=0

∣∣∣r(n)[d+ j + L]∣∣∣2 (3.13)

with L = NSP . As a result, (3.12) correlates the received BB samplesover the length of two adjacent short training sequences and (3.13)computes the energy over the length of one short training sequence.

Next, the p(n)L [d] and m(n)

L [d] metrics from all receive antennas areaveraged to obtain

pL[d] = 1MR

MR∑j=1

p(j)L [d], (3.14)

mL[d] = 1MR

MR∑j=1

r(j)L [d]. (3.15)

As a last step, pL[d] and mL[d] are compared. A frame start isdetected for the first discrete sample-time index d = dSP that satisfiesthe threshold detection inequality

|pL[d]|2 > |mL[d]|2 /2. (3.16)

In that case, the receiver proceeds to the STF processing state.The atomic operations required to compute the correlation (3.12)

and the mean energy (3.13) are multiply and accumulate (MAC)operations, the arithmetic means in (3.14) and (3.15) require additions,whereas the threshold detection in (3.16) requires comparisons.

3.5.2 STF processingOnce a frame start has been detected, the remaining short trainingsequences are exploited to roughly estimate the rotation induced bythe carrier frequency offset on the received BB samples, i.e., to performthe coarse frequency offset estimation (FOE). The corresponding phase


increment between two consecutive received BB samples is given by φ =](pNSP [dSP ])/NSP.14 The phase φ can be computed by the CORDICalgorithm (COordinate Rotation DIgital Computer, e.g. [74, 75]),for which the required atomic operations are additions, shifts, andcomparisons.

To compensate for the estimated frequency offset at the receiver, allreceived time-domain BB samples are rotated through multiplicationwith a complex-valued phasor: r[d] = r[d] · e−jφd. The atomic opera-tions associated with this frequency offset compensation (FOC) arereal-valued multiplications and additions (which together correspondto a complex-valued multiplication).

Next, fine time-synchronization takes place in order to refine theestimate of the location of the OFDM-symbol boundaries. To thisend, the computations executed for the frame start detection [(3.12)–(3.16)] are repeated on r[d], with the only difference that L = NLPnow corresponds to the period of the two long training sequences T1Aand T1B. The start of LTF1 is detected for the first sample (whered = dLP ) satisfying the threshold detection inequality (3.16). Then,the time-domain frame start index is updated and the receiver proceedsto the LTF processing state.

3.5.3 LTF processingOn the first part of LTF1, carrier FOE is performed a second timebased on the autocorrelation pNLP [dLP ], to allow for a more accurateestimation of the residual phase rotation remaining after coarse FOC.The residual phase rotation is compensated, involving the same algo-rithms and atomic operations required for computing the phase φ andfor the coarse FOC performed in the STF processing state.

After removing GI2, the arithmetic mean between the first and thesecond long training sequences is computed (i.e., T1 = (T1A+T1B)/2).For each receive stream, the resulting N arithmetic-mean values aretransformed by an N-point Fourier transform, leading to the receivedfrequency-domain vectors zk[1] ∈ CMR , k = 1, 2, . . . ,N. Next, theGI is removed from the following long training fields (LTFn) and

14The atomic operation ](z) returns the angle spanned between the real andimaginary parts of a complex number z.


the residual phase rotation is compensated. The remaining samplesare N-point Fourier transformed as well, resulting in the receivedfrequency-domain vectors zk[n], n = 2, . . . ,MT . The FFT’s atomicoperations are additions and multiplications. It is important to notethat for the computation of the FFT all N input samples are requiredconcurrently, whereas for the preceding computations the processingis executed at sample rate.15 Hence, during LTF processing, theprocessing granularity changes from single sample to entire OFDM-symbol and it remains unchanged until the end of the OFDM-frame.FOC represents the only exception and continues at sample rate.

Once the last long training field is received, frequency-offset com-pensated, and Fourier transformed, the wireless MIMO channel canbe estimated. To perform this operation, the receiver switches to theMIMO channel processing state.

3.5.4 MIMO channel processingThe knowledge of the transmitted long training fields is exploited atthe receiver to estimate the MIMO channel Hk for each subchannelk. In order to obtain this estimation, during the training phase, thereceived frequency-domain vectors for the kth OFDM-subchannel

Zk =[zk[1], zk[2], . . . , zk[MT ]

]∈ CMR×MT

can be described by Zk = HkTk +nk, where theMT×MT -dimensionalmatrix Tk is the known training sequence. With this knowledge,Hk = ZkT−1

k yields the ZF channel estimate.In the system under consideration, Tk is a Hadamard matrix

which implies that the scaled entries of the corresponding inverseT−1

k are +1 or −1. Therefore, the atomic operations for the matrixmultiplication required to compute Hk can be reduced to only additionsand subtractions, followed by the correct scaling; or, if preferred, thematrix multiplication can be performed utilizing the multiplicationatomic operation. Next, the channel estimate Hk is used to obtain the

15In practical systems, the input samples of the FFT may be processed slightlystaggered, potentially exploiting pipelining and thus slightly reducing the FFT’sprocessing latency.


linear MMSE estimator matrix [remember (3.6)]

Gk =(

HHk Hk +MTσ

2kIMT

)−1HH

k = F−1k HH

k ,

where σ2k denotes the noise variance on subchannel k.

The matrix inversion required to obtain F−1k can be performed

by one of the methods presented earlier in Section 3.4. For the 2× 2MIMO-OFDM receiver considered in Chapter 5, direct matrix in-version is used. The atomic operations required to compute Gk aremultiplications and additions (matrix multiplication). Matrix inversionadditionally requires real-valued divisions as atomic operations. Aftercomputing the MMSE estimator Gk, the receiver proceeds to the dataprocessing state for decoding the OFDM-symbols that carry the actualpayload.

3.5.5 Data processingDuring the data processing state, the guard interval GIm of eachreceived OFDM-symbol Sm is removed. Fine FOC is applied to theremaining time-domain samples of the OFDM-symbol. The resultis then Fourier transformed into frequency domain, leading to thereceived vectors yk (see Figure 3.1). Next, for each subchannel k,sk is estimated by first computing yk = Gkyk, followed by detection.Detection maps the entries of yk to the nearest constellation points inA, resulting in the detected vector-symbol sk = Q(yk) (cf. Figure 3.1).Then, the constellation points composing sk are directly translatedinto the corresponding binary labels (demapped) and the so-resultingbitstream is de-interleaved and directed to a Viterbi decoder.16

The computation of yk requires a complex-valued matrix-vectormultiplication where the atomic operations are multiplications andadditions. Detection, to obtain sk, can be performed by shift operations.When performing hard detection, the subsequent demapping requires atable look-up, or, for soft detection, it requires a series of comparisons,additions, and multiplications [53, 62].

16Viterbi decoding is not taken into account for the implementation on an FA.Its required performance (approximately MT · 4000MdOp/s) is too high for anefficient implementation on an FA and will thus be performed on a dedicatedhardware block.


3.5.6 Computational complexity of the presentedalgorithms

The CCs for the five receiver states described in the previous sectionare presented in Table 3.5. The complexities are given for a genericMR ×MT system and account for real-valued atomic operations. Fig-ure 3.10 visualizes the CCs resulting for symmetric (MT = MR)MIMO-OFDM systems and derives the corresponding processing per-formance for sustaining real-time operation. These requirements areobtained with the OFDM configuration summarized in Table 3.4 andby constraining duration of the duty cycle Tdc, i.e. the durationof each task in a state, to the duration Tdc = Tsym = 4 µs of oneOFDM-symbol. Consequently, the number of samples to be processedin one duty cycle amounts to Ndc = Ns = 80, accommodating thecomputation’s granularity over all three receiver phases.

The processing performance requirements for a given antennaconfiguration are dictated by the receiver state with the highest CC.In particular, for the symmetric system at hand, it is dictated bythe LTF processing and data processing, for systems with up to2 antennas. For systems with more antennas, the CC of MIMOchannel processing, which grows polynomially and is inherent to matrixinversion/manipulation, prevails over the CCs of the remaining statesand sets the processing performance mark.

Clearly, the estimates in Table 3.5 (and Figure 3.10) depend on thespecific implementation of the algorithm and do not take any overhead(e.g., control, address-generation, load/store, or sorting of data) intoaccount. Nevertheless, they constitute a valid indicator of the requiredprocessing performance and are instrumental in identifying the hardcomputational kernels and suitable custom datapath configurations foran implementation on an application-specific processor implementation(as the one described in Chapter 5). In summary, the here presentedcomplexity evaluation permits to state that a SISO-OFDM receiverclaims a platform delivering a pure data-processing performance ofapproximately 680 MdOp/s, and a 2× 2, 3× 3, and 4× 4 MIMO-OFDM receiver a platform delivering approximately 1′530 , 3′650 , and8′580 MdOp/s, respectively. As a comparison, Appendix A of [34],formulates the processing requirements for an IEEE 802.11a receiver(SISO system). These result into 600 MdOp/s.


Table 3.5: CC for an MR×MT MIMO-OFDM receiver.State / Task Estimated real-valued CC

Frame-start detectionTask MAC MUL ADDCorrelation 4MR(NSP + 2(Ndc − 1)) – 2Ndcdlog2(MR)eMean energy 2MR(NSP + 2(Ndc − 1)) – Ndcdlog2(MR)eTh. detection – 4Ndc 3NdcSTF processingTask MAC MUL ADDPhase comp. – – 96FOC – 4MRNdc 2MRNdcCorrelation 4MR(NLP + 2(Ndc − 1)) – 2Ndcdlog2(MR)eMean energy 2MR(NLP + 2(Ndc − 1)) – Ndcdlog2(MR)eTh. detection – 4Ndc 3NdcLTF1 processingTask MAC MUL ADDPhase comp. – – 96FOC – 2MR4Ndc 2MR2NdcMean LTF1 – – 2NLPFFT LTF1a – 768MR 768MRLTFn processing, for n = 2, 3, . . . ,MRTask MAC MUL ADDFOC – MR4Ndc MR2NdcFFT LTFna – 768MR 768MRMIMO channel processingTask MAC DIV ADDChannel est. H – – 2MRM2

TNcMatrix Gb NcNDC,MAC

c NcMT /2 NcNDC,ADDd

Data processingTask MAC MUL ADDFOC – MR4Ndc MR2NdcFFTa – 768MRbNdc/80c 768MRbNdc/80cHard detect (64QAM) 4NcMTMRbNdc/80c – NcMR12bNdc/80c

aAssuming a radix-2 FFT implementation.bD&C is used for the matrix inversion.cNDC,MAC = 4(−6 + (4−MR/2)MT + (−1 + 6MR)M2

T /4 +M3T /2).

dNDC,ADD = −6 + 14/3MT −M2T /2 +M3

T /3.


Table 3.6: 2× 2 MIMO-OFDM receiver processing requirements.State / Task 2×2 System

Frame start detectionTask Op MdOp/sCorrelation (3.14), L = NSP 1’552 388Mean energy 776 194Th. detection 560 140TOTAL 2’888 722

STF processingTask Op MdOp/sPhase comp. 96 24FOC 960 240Correlation 1’936 484Mean energy 968 242Th. detection 560 140TOTAL 4’520 1’130

LTF1 processingTask Op MdOp/sPhase comp. 96 24FOC 1920 480Mean LTF1 128 32FFT LTF1a 3’072 768TOTAL 5’216 1’304

LTF2Task Op MdOp/sFOC 960 240FFT LTF2a 3’072 768TOTAL 4’032 1’008

MIMO channel processingTask Op MdOp/sChannel est. H 832 208Matrix Gb 3’380 845TOTAL 4’212 1’053

Data processingTask Op MdOp/sFOC 960 240FFTa 3’072 768Hard detect (64QAM) 2’080 520TOTAL 6’112 1’528

a64-point radix-2 FFT implementation.bDirect matrix inversion is used.


1 2 3 40

3500

7000

10500

14000

17500

21000

24500

28000

31500

35000


CC

8580 MdOP/s

3653 MdOP/s

1528 MdOP/s

680 MdOP/s

Frame start detection

STF

LTF

MIMO

Data payload

Figure 3.10: CC and processing requirements for a 1× 1 (SISO), a2× 2, a 3× 3, and a 4× 4 MIMO-OFDM receiver.


3.6 Summary and ConclusionSummary This chapter started with the description of the con-sidered MIMO-OFDM system along with the evaluation of differentMIMO detectors. Linear MMSE detection appeared to have the besttrade-off between CC and achievable BER, for an FA. The followingfixed-point assessment of methods for computing the linear MMSEestimator matrix G showed that the spectrum of achievable BER per-formance vs. CC is vast. For a 2× 2 system, direct matrix inversionwas identified as best candidate method, whereas for systems withmore antennas either the D&C method, the LR-decomposition, or QR-decomposition are the most promising, depending on whether the focuslies more on low CC or good BER performance. Once linear MMSEdetection was determined as the detection method, the description ofthe three receiver phases – frame-start detection, preprocessing, anddata processing – could be tackled, eventually leading to the detailedcomplexity analysis of the entire MIMO-OFDM receiver.

Conclusion In conclusion, for implementing a MIMO-OFDM re-ceiver on an FA, linear MMSE detection seems the most reasonablechoice to start with. This statement is primarily dictated by thelimited processing resources available on these platforms. In addition,for assessing the challenges of a practical implementation, it seemsreasonable to start with a SISO-OFDM receiver, upon which the 2× 2MIMO-OFDM receiver can be built. The architecture exploration inthe next chapter goes exactly along this line.

Chapter 4

Design SpaceExploration

In this chapter, three different SPAs are evaluated. The architecturesare representatives of three different signal processor classes: the TexasInstruments (TI) C6455 represents the general purpose DSP class, theMSEC4 is a special purpose BB processor, whereas the ASPE wasdesigned for multimedia streaming applications. The evaluation aimsat characterizing these architectures for the OFDM BB processing inan exemplary manner and, although being far from stimulating allcorners of the design space, it helps distinguishing important fromunnecessary properties.

The considered architectures are briefly described, unveiling theirkey characteristics. Subsequently, some of the hard BB processingkernels identified in Chapter 3 are mapped onto each architecture, toassess the available processing performance. The chapter terminateswith the discussion of the achieved results and with the suggestion ofan suitable SPA.

4.1 C6455The C6455 [76] is a commercial high-performance fixed-point VLIWDSP. Its core is depicted in Figure 4.1. The CPU consists of fetch and

83

84 CHAPTER 4. DESIGN SPACE EXPLORATION

L1P cache/SRAM

L2

cache/

SRAM

128

128A registerfile B registerfile

.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2

Datapath 1 Datapath 2Instruction decode

16/32-bit instruction dispatch

SPLOOP buffer

Instruction fetch

C64+ CPU

32

6464256

256 256

128

128

256

256

256

L1D cache/SRAM

Bandwidth management

Memory protection

L1 data memory controller

Interrupt

& exception

controller

PWR control

DMA

slave IF

Master port

(CPU cache req)

128

Memory

protection

Bandwidth

management

L2 memory

controller

IDMA

L1 program memory controller

Badwidth management

Memory protection

Advanced event

triggering

(AET)

Figure 4.1: The C6455’s core.

decode logic together with two identical datapaths. Each datapathincorporates four units, namely: a load unit (.L), a store unit (.S),a multiply unit (.M), and an ALU (.D). In each clock cycle, up toeight instructions can be fetched form the instruction memory andbe assembled to form one VLIW. Thus, up to eight instructions perclock cycle can be executed in parallel exploiting the eight units. Thememory hierarchy is composed of two levels. L1 is a 32 KiB cache andL2 a 2 MiB on-chip RAM. The C6455 is implemented in 90 nm CMOStechnology, occupies an area of 91 mm2, and runs at a clock frequencyof 1 GHz – Figure 4.2 depicts the C6455 die used for measuring thesilicon area. The C6455’s peak performance is of 4′000 MdOp/s. Apowerful software development kit (SDK), offering many options toproduce optimized code, is at disposal for programming the DSP (CodeComposer Studio).

4.1.1 SISO-OFDM transceiverThe implementation of an acoustic single-antenna OFDM transmitterand receiver on the C6455 allows to assess its qualification as a BB

4.1. C6455 85

Figure 4.2: Die photograph of the C6455. The black dots are solderballs remaining after etching the package away.

processor [77]. The lower data rates of the acoustic domain permita real-time and real-life implementation, with data streaming acrossthe C6455’s input and output ports, while the programming does notrequire a throughout optimization of the employed algorithms. Inaddition, the acoustic physical front-end is more economic than its RFcounterpart.1

The OFDM parameters used for the acoustic communication systemare summarized in Table 4.2 at the end of this section. Out of the 64allocated OFDM-subchannels, 54 carry data. To avoid digital up anddown conversion, the real-valued acoustic passband signal is generatedusing a 128-point FFT. The first 64 tones are determined by thetransmit constellation points, and the remaining 64 correspond to thesymmetric and complex-conjugate of the first half (i.e., the constellationpoints on the subchannels k = 64, 65, . . . , 127 are determined by sk =sH127−k). Eventually, one time-domain OFDM-symbol consists of 208

1Although not specifically considered here, all important physical RF systemimpairments are recognizable in the acoustic domain as well, enabling the study ofappropriate countermeasures (e.g., sample rate estimation and tracking, carrierfrequency offset estimation and tracking, etc.)


Tx PC

Ethernet link

Rx PC

Ethernet link

C6455 DSK Boards

Figure 4.3: Acoustic transceiver setup with two C6455 DSK boards.

samples, of which 80 are GI samples. It is interesting to note that theduration of the acoustic domain GI is relatively long compared to theOFDM-symbol duration and is necessary to absorb the delay spreadby the acoustic channel.2

Figure 4.3 illustrates the system setup. Both the transmitter andthe receiver are composed of a PC connected to a C6455 DSK board,over an Ethernet link. On the transmit side, the PC generates randomdata that is assembled into Ethernet packets and sent to the transmitDSK board over the Ethernet link.

The transmit C6455 generates the OFDM-frame’s preamble whilethe first transmit data samples arrive into the DSP’s input buffer viathe Ethernet link. The preamble comprises an STF and an LTF, asdescribed in Section 3.5 (Figure 3.9). The STF is composed of tenidentical short training sequences, each having a length of 16 samples,whereas the LTF is composed of two identical long training symbolsthat have a length of 208 samples each. The data payload is rate

2The measured delay spread is approximately 1ms in the anechoic chamber.

4.1. C6455 87

R = 1/2 convolutionally encoded, with the generator polynomial [13381718] and constraint length 7. The encoded data is then interleaved andOFDM modulated through inverse FFTs. Finally, the OFDM-frame iswritten to the output buffer, from where it is sent to the board’s audiocodec. Double buffering of the output data avoids interference betweendata processing and streaming out the processed data. The two outputbuffers both have a size of NObuf samples and are read at a rate of48 kS/s by the audio codec. Hence, a full output buffer is emptiedin TObuf = NObuf/(48 kS/s) and real-time operation is possible if theprocessing of NObuf transmit samples takes Tp < TObuf time.

This constraint is fulfilled in the considered acoustic implementa-tion, where the output buffer’s size is chosen to be NObuf = 41′600samples, leading to TObuf = 866 ms.3 The profiling results for a 64-QAM transmission are summarized in Table 4.1: one OFDM-symbolis processed in 22.8 µs and thus the entire OFDM-frame, consistingof 200 OFDM-symbols, is processed in Tp = 200 · 22.8 µs = 4.6 ms <TObuf = 866ms.

The receive C6455 double buffers the data samples received overthe audio codec’s input line. Each input buffer has a size of NIbuf =NObuf received samples. These samples are processed block-wise fordetecting the OFDM frame start, in accordance with the methoddescribed in Section 3.5.1, [(3.14)-(3.16)]. Once a frame has beendetected and the timing reconstructed, the acoustic channel is esti-mated using the LTF. Next, the time-domain OFDM-symbols areFFT-ed, demapped taking the channel estimates into account, anddeinterleaved. Note that neither FOE nor FOC are performed on thereceiver, reducing the processing power requirements (and, of course,also the BER performance). Finally, the on-chip Viterbi coprocessor isset up to decode the received data and to convey them to the receivePC. The BER performance of the acoustic communication system iscomputed on the receive-PC by comparing the transmitted with thereceived data.

For the receiver to operate in real-time, the processing of the inputdata buffer has to be finished in Tp < TIbuf (= TObuf ). Or, equivalently,

3The length of one OFDM symbol is of NGI + N = 80 + 128 samples. Theoutput buffer can hold at most 41′600/208 = 200 OFDM-symbols, cf. Table 4.2.


Table 4.1: Processing times resulting from the profiling of the trans-mission (Tx) and reception (Rx) of one 64-QAM OFDM-symbol onthe C6455. In italics, the processing times for the RF system discussedin Section 4.1.2.

Phase Tx-Task Time [ µs] Rx-Task Time [ µs]Frame-start detection Frame start det. 1600 1600Preprocessing Channel est. 3.3 2.9

Dataproc.

Encode 10.7 9.5 Decode 12.6 11.2Interleave 2 1.8 Deinterleave 1.2 1.1Map 8.8 7.8 Demap 10.2 9.0I-FFT 1.3 0.5 FFT 1.3 0.5

Total 22.8 19.6 12.7 10.6

the processing rates associated to the three receiver phases frame-startdetection, preprocessing, and data processing have to be respected. Asdetailed in the next paragraph, the real-time constraint is respectedduring all these three phases.

Frame start detection over the NIbuf received samples requires1.6 ms in the worst case, i.e., when no frame start is detected and allNIbuf samples have to be processed. The corresponding per-sampleprocessing completes in 38.5 ns, which is much less than the sampleduration 20.8 ms, satisfying the real-time constraint. The subsequentpreprocessing, which consists in estimating the channel, is completedin 3.3 µs and can thus be performed within the duration Tsym = 4.3 msof one OFDM-symbol. Then, as the data processing time breakdownin Table 4.1 reveals, computing the 128-point FFT requires 1.3 µs,the successive demapping takes 10.2 µs, and the deinterleaving 1.2 µs;summing up to a total of 12.7 µs. Viterbi decoding on the DSP’son-chip coprocessor occurs concurrently, and one 64-QAM OFDM-symbol (324 raw bits or 162 data bits) is decoded in 12.6 µs. Theprocessing time for one OFDM-symbol is dictated by the longer ofthe two concurrent tasks, and is thus 12.7 µs. Again, the real-timeconstraint is met.

4.1. C6455 89

4.1.2 Results, discussion, and conclusionResults A first evident result is that the acoustic SISO-OFDMsystem implemented on the C6455 operates in real-time, respecting theOFDM system parameter specifications of Table 4.2. The maximumdesign-datarate achievable with these specifications is obtained whentransmitting 64-QAM OFDM-symbols and amounts to 75 kbit/s.4

Unfortunately, the much harder RF constraints lead to a differ-ent situation. The profiling results for the RF domain have beenobtained by compiling the transmit and receive programs with theRF OFMD-system parameters in Table 4.2, and by setting the size ofthe FFT to 64 points. The profiling results are reported in italic inTable 4.1. The computations performed for frame start detection arethe same as in the acoustic domain, and hence, 38.5 ns are necessaryto process one received sample. The duration of one RF domain re-ceived sample, however, changes and it lasts 50 ns, reducing the safetymargin but still allowing for real-time operation. The preprocessing iscompleted in 2.9 µs, which in less than the duration Tsym = 4 µs of oneOFDM-symbol. The hurdle comes during the data processing. Thetransmitting C6455 requires 19.6 µs to prepare one 64-QAM OFDM-symbol, while the receive C6455 needs 11.2 µs to decode it. Since theduration of one OFDM-symbol is Tsym = 4 µs neither transmission, norreception work in real-time. The corresponding datarate attained bythe transmitting C6455 for 64-QAM is 14.7 Mbit/s, while the receiveC6455 sustains data rates up to 25.7 Mbit/s.5

The power consumption of the C6455 during the reception of oneRF OFDM-frame has been determined with [78] and is approximately2 W. Accordingly, the energy efficiency becomes 4′000 MdOp/s/2 W =2 MdOp/s/mW.

Discussion Surprisingly, the transmitter achieves a lower data ratethan the receiver. A closer look at the profiling results reveals that theconvolutional encoding of the transmit bitstream requires most of theprocessing time, closely followed by the Gray-mapping of the binary

4The design datarate is computed as: (54 data carriers × 6 bits percarrier)/(4.3ms per OFDM-symbol) = 75 kbit/s.

5The transmit data rate is computed as: (48 · 6 bit)/19.6 µs = 14.7 Mbit/s andthe receive data rate as: (48 · 6 bit)/11.2 µs = 25.7 Mbit/s.


labels onto constellation points. Convolutional encoding operates onthe incoming bitstream at a bitwise granularity. Even though theC6455 supports bitwise operations, it is necessary to manipulate andshift single bits, eventually resulting in an inefficient implementation.At the receiver, instead, the decoding is more efficient since it is assistedby the dedicated Viterbi coprocessor. The demapping at the receiver,however, experiences the same difficulties of Gray-mapping at thetransmitter. The Gray-mapping of the binary labels onto constellationpoints relies on either extensive if-then-else statement usage, or onextensive masking and shifting followed by a table look-up. Bothoptions result in an inefficient C-code.

Lowering the modulation order and transmitting only BPSK mod-ulated OFDM-symbols lightens the computational burden, and couldpossibly enable real-time operation. In BPSK transmission, each tonecarries only one instead of six bits. Then, the transmitter processesone OFDM-symbol in 3.7 µs. The receiver performs the FFT in 0.5 µs.Demapping requires 1.5 µs, the deinterleaving 0.2 µs, and the concur-rent Viterbi decoding requires 1.9 µs. Consequently, the processingtime of one OFDM-symbol results in 2.2 µs, which is less than theduration of one OFDM-symbol. Since the processing of one OFDM-symbol at both the transmitter and the receiver require less thanthe OFDM-symbol duration, real-time operation is possible and theassociated datarate is 12 Mbit/s.

Table 4.2 also compares the obtained performance figures to thoseof related work. Sereni [79] describes an IEEE 802.11a receiver im-plementation for a C64x DSP running at 600 MHz and reports thecomputational complexity for transmitter and receiver. Tariq [80]presents an OFDM system centered on two C62x platforms connectedthrough a cable for the transmission in the BB. Video bursts are trans-mitted at a sustained datarate of 1.7 Mbit/s. Cinquino [81] reportscycle counts for an OFDM based system and maps these counts intoa datarate of 4.9 Mbit/s for a C64x platform that uses the Viterbicoprocessor.

Conclusion Although the potential for optimizing the C-code ofthe presented transceiver is intact, it can be stated that the C6455’sdatapath is not well suited for the fine-grained bit-wise operations

4.2. MSEC4 91

required by interleaving and deinterleaving, convolutional encoding anddecoding. The dedicated (partially configurable) Viterbi coprocessoris a good example of how a dedicated component reliefs the CPU formoperations that do not match its granularity. However, despite thededicated Viterbi coprocessor, the processing performance deliveredby the C6455 is on the edge of what is required for the reception ofBPSK modulated OFDM-frames in an RF system as the one abovesketched – at a relatively low energy efficiency: 2 MdOp/s/mW.

In summary, for systems that require higher modulation orders theavailable processing performance is not sufficient, nor it is sufficient toafford the even higher requirements of an RF MIMO-OFDM system.Finally, the main limiting factor of the C6455 is its power consumptionof 2 W (in 90 nm CMOS technology), which is definitely too high for amobile wireless device.

4.2 MSEC4

The MSEC4, developed at the IIS, ETH Zurich [82], is a fixed-pointSIMD BB processor targeted at OFDM and CDMA BB processing.Figure 4.4 shows the block diagram of the MSEC4’s core. It containsfour parallel processing elements (PEs), a program control unit (PCU),and an address generation unit (AGU). The MSEC4’s memory consistsof multiple tightly-coupled memories that directly supply the PEs viaparallel data buses. These memories are addressed via the AGU. TheMSEC4 core further contains a register file for intermediate data stor-age and a system control unit (SCU). The SCU comprises instructionfetch and decode, as well as several registers for device control. TheMSEC4 has two pipeline stages: in the first stage instruction fetch,decode, and AGU address calculation are performed. Execution andwrite-back take place in the second stage.

The description of the MSEC4’s building blocks in the next subsec-tions shall underline important design aspects considered for tailoringthe processor to the BB processing domain.


Table4.2:O

FDM

systemparam

etersandperform

anceforthe

acousticand

RFsystem

sconsideredin

thisthesis;and

comparison

torelated

work.The

reporteddata

ratesreferto

theraw

over-the-airrate,anddo

notconsider

coding.This

work,C6455

[79], a[80],

[81], aOFD

MParam

eterAcoustic

RF

C64x

C62x

C64x

Channelbandw

idth[M

Hz]

0.024

2020

20n.a.

#OFD

M-subchannels

(N)

6464

6464

64#

OFD

Mdata

carriers(N

c )54

4848

4864

Subchannelspacing[kH

z]0.375

312.5

312.5

312.5

n.a.GI[#

samples]/

[µs]80

/1′666

16/

0.8

16/

0.816

/0.8

n.a.OFD

M-sym

bol[#sam

ples]/[µs]

208/

4′300

80/

480

/4

80/

4n.a.

BBsam

plerate

[MS/s]

0.048

2020

20n.a.

Modulation

64-QAM

64-QAM

BPSKQPSK

16-QAM

Coding

yesyes

yesno

yesDesign

datarate b

[Mbit/s]

0.07572

1224

n.a.Achieved

Rx-data-rate c

[Mbit/s]

25.525.7

n.a. d1.7

4.9aIm

plements

thereceiver

onaC62x

platformand

scalesthe

resultsto

aC64x

platform.

bData

ratespecified

bythe

OFD

Mparam

eters,i.e.,bitsper

OFD

M-sym

boldividedby

theOFD

M-sym

bolduration.cE

ffectivedata

rateachieved

onthe

DSP,i.e.,bits

perOFD

M-sym

boldividedby

theOFD

M-sym

bolprocessingtim

e.dT

heCC

isestim

atedto

be2977M

OPSfor

theprocessing

ofoneOFD

M-sym

bol.

4.2. MSEC4 93

X

PE0 PE1 PE2 PE3PCUREG.FILE

BRANCHCTRL

LOOPCTRL

AGU

SCU

Y

Z

DATA ADDRESS BUSES

INSTR. ADDRESS BUS

INSTRUCTIONS

Figure 4.4: MSEC4 core.

4.2.1 Architecture detailsProcessing elements A crucial computation kernel in many BBprocessing algorithms, including OFDM (cf. Table 3.5), is the Fouriertransform. As a result of its complex internal structure, its computationis a tedious and time-consuming task when performed on architecturesthat do not provide special Fourier-transform support. The fast Fouriertransform (FFT) simplifies the CC of an N -point FFT from N2 toN/r logrN , where r is the radix of the FFT algorithm and N has tobe a power of r. The atomic operation of a radix-r FFT is namedbutterfly.

Accordingly, the MSEC4’s main execution unit (consisting of thefour PEs) has the structure of a radix-4 butterfly, enabling fast andefficient FFT computation. Each of the four identical PEs includesa 16× 8bit complex-valued multiplier, a ’trivial multiplier’, a 32 bitcomplex-valued ALU with accumulator register, and a second-stage32 bit complex-valued adder [see Figure 4.5.a)]. This compositionallows to interconnect the four PEs in such a way as to form onecomplete single-cycle radix-4 butterfly [see Figure 4.5.b)] or two radix-


-1 -1

j j

PE0

X Y

ComplexMULT.

TrivialMULT.

From PE2

To PE2

ALU

To PE1

From PE1

Z

a) b)

16|16 16|16

16|16

32|32

64|64

PE0 16|16

16|16

Figure 4.5: a) MSEC4’s processing element and b) radix-4 butterfly.

2 butterflies. In addition, each PE can perform one complex-valuedMAC instruction in a single clock cycle, providing optimal support forvector/matrix operations, filters and other convolutional algorithms.

The trivial-multiplier is employed for efficient signal de-/spreadingwith complex-valued binary codes, as found in various communicationprotocols such as in UMTS. Here, signal de-/spreading is performedby multiplication of the original data signal with a complex-valuedbinary code sequence ∈ {±1±j}. While this could be done with themain multiplier, a dedicated solution allows to perform the sametrivial multiplication operation more efficiently, both in terms of powerconsumption and memory usage.

Data memories In contrast to conventional load/store architectures,MSEC4 supports direct data processing from memory to memory. Thiscapability offers great advantages for stream-oriented applications thatoften rely on data-block processing. Block sizes, though, tend to berather small for minimizing latency and limiting memory requirements.

4.2. MSEC4 95

The MSEC4 memory architecture perfectly meets these characteristicsby employing sixteen 64-word data-memory blocks and four 64-wordcoefficient-memory blocks. At the same time, 8 memory read-ports and4 write-ports allow for very high memory bandwidth. The data-memoryhas a wordwidth of 16 bit whereas for the coefficients a wordwidth of8 bit is sufficient. Furthermore, direct memory processing eliminatesthe need for constant register file reloading which causes considerableoverhead and is an often experienced system bottleneck in conventionalload/store architectures. Very short access times are achieved by usingtightly-coupled memories that are, apart from their access latency,comparable to large register files, but much smaller in area.6 Memoryaccess conflicts that occur when multiple PEs address the same memorybank are solved automatically, stalling the computation until all dataitems are ready, and are transparent to the programmer.

Address Generation Unit The AGU can compute up to twelve ad-dresses per clock cycle: eight read (operands) and four write addresses(results). Four specialized address generation modes are providedfor the efficient implementation of DSP algorithms: linear address-ing (standard arithmetic) for general purpose computations; moduloaddressing allowing efficient data access in circular buffers; radix-4and radix-2 bit-reverse addressing for FFT address calculation. Thesemodes can be arbitrarily combined with the AGU operations for ad-dress modification, which include post in/de-crement, post incrementby signed offset, and indexed by signed offset, allowing for a rich varietyof addressing schemes.

Program Control Unit All tasks related to program sequencecontrol, i.e. tasks that manipulate the program counter (PC), areperformed by the PCU. This includes jumps, branches, subroutine calls,return instructions and looping. MSEC4 supports up to four nestedzero-overhead hardware loops, an important feature for performanceoptimization of heavily loop-based signal processing algorithms.

6Depending on the underlying CMOS technology and design library, there mightbe a critical memory size below which the instantiation of a register file is morearea-efficient.


Instruction Set Architecture MSEC4 is based on an orthogonal,4-way SIMD instruction set that allows maximum use of processingresources and high program memory utilization. Moreover, the instruc-tion set’s orthogonal and regular structure enables fast instructiondecoding and eases the design verification. Dedicated instructionsfor complex arithmetic lead to compact code for complex-valued algo-rithms. Instructions are 62 bits long and can address up to four sets ofthree operands (two sources, one destination) with individual addressmodification operations in each clock cycle.

In addition to conventional SIMD instructions, MSEC4 providesreconfigurable instructions [83]. In a first step, the operation withreconfigurable instructions, requires application-specific PE configu-rations to be loaded into a dedicated on-chip memory at runtime.Then, on request, they can be activated by executing the appertainingreconfigurable instruction, which contains the address to the config-uration memory and the operands to be processed. Hence, the fourPEs can be individually configured, effectively enabling the executionof almost arbitrary operations. In particular, it becomes possible toperform different operations on different PEs, which represents an es-sential enhancement to common SIMD architectures. The integrationof reconfigurable instructions into the MSEC4 architecture providesan increase in terms of both flexibility and performance, yet withoutintroducing notable architectural complexity.

4.2.2 Results, discussion, and conclusionResults MSEC4, synthesized for a 0.25 µm CMOS technology, runsat a clock frequency of 65 MHz and occupies an area of 8.14 mm2. Thedata and coefficient memories of the synthesized version are of 4 KiBand 512 B, respectively. The resulting data processing performance is1040 MdOp/s.7 The power consumption of the MSEC4’s core has beendetermined in [84] (p. 70) by collecting the node toggling activitiesexperienced while computing a 1024-point FFT, on a placed and routeddesign. The resulting power consumption amounts to 2.4 W on theconsidered 0.25 µm CMOS technology.

7For the performance, complex-valued operators are mapped to real-valuedones, i.e., PP = 4PEs × 4 real-valued dOp/PE × 65MHz = 1040MdOp/s. Theflexibility does not consider reconfigurable instructions.

4.2. MSEC4 97

The processing performance achieved by MSEC4 for the mostcommonly used BB processing algorithms (FFT, FIR, LMS, etc.) isevaluated and compared to the TI DSP-generations for mobile (C55x)and high performance (C64x) applications [85]. The results of theevaluation are summarized in Table 4.3. As shown, speedup factorsbetween two and fifteen in terms of cycle counts are achieved by theMSEC4 design. The prime example is the computation of radix-4FFTs that are greatly accelerated thanks to the PE’s butterfly layout.

Discussion A 64-point FFT, as required by the MIMO-OFDM re-ceiver detailed in Section 3.5, is computed in only 93 clock cyclescompared to the 182 clock cycles required by the C6455. Unfor-tunately, the single-cycle radix-4 configuration also determines thelongest timing path, thereby limiting the operating frequency of thecircuit. As a consequence, the achieved clock frequency of 65 MHzis low compared to what is offered by the employed 0.25 µm CMOStechnology and a performance gain in terms of cycles is watered downby longer execution times. Indeed, the scaling of the MSEC4 to 90 nmCMOS technology, for a comparison with the C6455, confirms thispresentiment. The MSEC4’s scaling to 90 nm technology leads to anarea of 2.1 mm2 at a frequency of 130 MHz. The processing perfor-mance becomes 2080 MdOp/s and the power consumption scales downto approximately 320 mW. With these performance figures, the com-putation of the 64-point FFT requires 715 ns on the MSEC4, whereason the C6455 only 182 ns are necessary.

Although OFDM-symbol processing is not jeopardized by the result-ing 90 nm technology processing time and the introduction of pipelinestages inside the PEs to shorten the critical timing path can reliefthis shortcoming, the evaluation highlighted a few other characteris-tics of the MSEC4 that prevent a re-design for the successive use asMIMO-OFDM BB processor.

The high memory bandwidth required to feed the 4-way SIMDdatapath is obtained by splitting the memory into multiple smallmemory banks that can be accessed concurrently. This solution isexcellent for computations that have a block-wise granularity becausethe data samples processed simultaneously during one clock cycleare distributed over different memory banks, allowing for single-cycle


memory accesses. Instead, when the granularity is fine, e.g. as duringthe frame-start detection or STF processing, the computations becomemore inefficient. The per-sample granularity compels the frequentaccess to data samples residing inside the same memory bank, whichresults in penalty clock cycles necessary to fetch the data through thesingle read port provided by the addressed memory bank.

In addition, as for the C6455, no explicit support for computingthe angle of a complex number (required for FOE) is provided bythe MSEC4’s PEs, and thus FOE would result in a computationallyexpensive and inefficient enterprise. Also for the subsequent FOC theMSEC4 presents no appropriate support.

Conclusion The MSEC4 is computationally efficient and well suitedfor regular, complex-valued, processing kernels that exhibit sufficientdata-level parallelism to exploit the four-way SIMD datapath.

However, for sample-based computations, the large memory band-width provided by the MSEC4 is not well exploited due to frequentmemory access conflicts and the associated penalty, rendering theprocessing overhead-afflicted.

4.3 ASPEThe adaptive stream processing engine (ASPE), developed at theIIS, ETH Zurich [9], is a modular coarse-grained ASIP architectureoptimized for multimedia stream processing, which mainly consists ofregular and repetitive tasks.

4.3.1 ArchitectureFigure 4.6 shows the ASPE architecture. The ASPE is tightly-coupledwith a general purpose processor (GPP) responsible for controlling andsetting up the ASPE, as well as for executing performance-uncriticaltasks. In addition, the ASPE has access to the system bus providingthe capability of autonomously handling datastreams.

The ASPE consists of a datapath and a controlpath. The datapathemploys two types of building blocks: functional units (FUs) andstorage units (SUs). FUs perform the arithmetic operations and SUs

4.3. ASPE 99

Table4.3:

Perfo

rman

ceevalua

tionfortypicalD

SPalgorit

hms.

Cycle

Estim

ates

(Formulaan

dNum

eric

Exam

ple)

Ben

chmark

C55xa

C64xb

MSE

C4c

N(=

4k)Po

int

CFF

Tn.a.

0.75N

log 4N

+38

(N/4

+10

)log

4N

+15

N=

256

4786

[154

0%]

614

[197

%]

311

[100

%]

RFI

RFilte

rnx/2(nh

+4)

nx/4(nh

+11

)+15

nx/4(nh/2

+2)

+36

nx

=10

0,nh

=32

1800

[370

%]

1090

[220

%]

486

[100

%]

CFI

RFilte

rnx

(2nh

+4)

nxnh

+24

nx/4nh

+14

nx

=10

0,nh

=32

6800

[840

%]

3224

[400

%]

814

[100

%]

CDelayed

LMSFilte

rnx

(8nh

+5)

dnx

(3nh

+17

)dnx

(3/4nh

+10

)+24

nx

=10

0,nh

=32

2610

0[7

60%

]11

300

[330

%]

3424

[100

%]

CMatrix

Prod

uct

2r1c

1c2

+4r

1c2

+10c2

dr1c1c2

+4.

5r1c

2+

11dr1c1c2/4

+6r

1c2

+6c

2+

24r1

=c1

=c2

=16

9376

[350

%]

5259

[200

%]

2680

[100

%]

nx:nu

mbe

rof

samples;nh:nu

mbe

rof

taps;ri:

rowsin

matrix

i;ci:columns

inmatrix

i(c

1=r2)

a Two

real

processin

gun

its;b

Four

real

processin

gun

its;c

Four

complex

processin

gun

itsAllTIB

enchmarks

arefro

m[85],e

xcept:

d Extrapo

latedfro

mcorrespo

nding

Rbe

nchm

arks.


Data, Commands, Control

GPP

System bus

SEQ SEQ SEQ

SEQ SEQ SEQ...

...

C-Net

SU SU SU SU

Empty

RF FU FU

Empty

...

...

D-Net

0 1 2 3 7

ASPE 15 14 13 12 8

CWs

SEQ

RegisterFile (RF)

FU FU

SU SU SU SU

DatapathControlpath Slot number

Data

Figure 4.6: Block-diagram of the generic ASPE framework describedin [9].

provide local storage for the data processing. Design-time configurationpermits to select appropriate SUs and FUs which provide suitableaddress generation modes and the atomic operations required for aparticular application, respectively. The units are selected from alibrary to which – at design-time – new units can be easily addedvia user defined modules implemented in predefined wrappers. In theASPE architecture, the FUs and SUs are connected through a run-time reconfigurable network which allows to combine multiple FUs andSUs to form a single atomic operation (e.g., SU → FU → FU → SU→ FU → SU). This datapath reconfiguration provides an advantagecompared to conventional VLIW architectures since data does not needto take turns over the bottleneck of a complex full-custom multi-portregister-file.

The controlpath consists of sequencer units (SEQs). The SEQsprovide the FUs and SUs with the necessary 16 bit control words(CWs) that determine their operation mode, and they control thereconfigurable network (D-Net) to route the data between FUs andSUs. The SEQs support zero-overhead loops and data-dependentcontrol flow.

4.3. ASPE 101

4.3.2 SISO-OFDM receiverThe ASPE is evaluated through the implementation of a SISO-OFDMreceiver that is based on the IEEE 802.11a physical layer specifica-tion [86]. The algorithms mapped onto the ASPE are described inSection 3.5 and obtained by setting MT = MR = 1 [10]. The carefulanalysis of the selected SISO-OFDM BB algorithms reveals that anASPE configuration composed of 1×SEQ, 3×FUs, and 8×SUs is neces-sary in order to sustain real-time operation. As in the MSEC4 design,nearly all operations (except one real-valued comparison operation)required by the selected algorithms deal with complex-valued numbers,thus all three FUs are designed to support complex-valued arithmetics.Memory-access bottlenecks are easily avoided by storing the real andimaginary parts in the lower and upper half of the same data word,respectively.

Figure 4.7 depicts the block diagram of the ASPE customized forSISO-OFDM BB processing. The single SEQ is configured with aprogram memory of 762 words, each 192 bits wide, to store the programcontrol flow for the SEQ itself and the 16 bit CWs for the elevenunits (FUs and SUs).8 The three FUs correspond to one complex-valued multiply and accumulate (CMAC) unit, and two complex-valued arithmetic logic units (CALUs). The former performs atomicoperations such as multiply, multiply with complex-conjugate, and thecorresponding accumulate operations [e.g., as required by (3.12) and(3.13)]. The CALUs implement basic ALU functionalities (i.e., add,sub, shift, max, min, bit-wise and, and bit-wise or) and provide theSEQ with flags for data-dependent decisions, as required for the statetransition according the threshold detection in (3.16). In addition, theyhave been enhanced by a set of BB processing operations (CORDIC,detect and demap, and absolute value) that are implemented by sharingthe already available CALU resources, thus adding only a minimal –control-related – hardware overhead. Six of the eight SUs are composedof 256×32 bit memories and incorporate addressing schemes for bit-reversal (as proven to be important by the MSEC4 architecture) andadditionally de-interleaving. One SU acts as an input data buffer(FIFO) and has a size of 64×32 bit, the last SU is a register-file of

8The resulting instruction word length justifies the classification of this ASPEcustomization as a VLIW architecture.


Control Network

Data In

Data Req

Data Ack

Controlpath

VLIW program memory

I-BUF

PC

Sequencer

RF

RAM0

16 bit CW

Number of Units (Nu)

Data Network

Program Length (P)

CMAC CALU0 CALU1

RAM1 ... RAM5

Datapath

CWs

2-way SIMD unit

Figure 4.7: ASPE configured for SISO-OFDM BB processing.

eight registers.All FUs implement a two-stage pipeline, resulting in an equal

execution time for all FUs independent of their complexity. Thepotential advantage of exploiting the different FU’s execution times forhigher hardware efficiency, is traded-off with the advantage of regularassembler programming for a shorter development time.

The FUs and SUs have been enhanced to operate in a 2-way SIMDmanner for better exploiting the data level parallelism inherent tomany signal processing algorithms – OFDM BB processing included.Finally, a datapath word-width of 16 bit guarantees sufficient precisionfor all the required computations.

Careful scheduling is required to efficiently share all FUs and SUs.Table 4.4 summarizes the assembler cycle counts for the SISO-OFDMBB implementation and the corresponding processing times, whereasFigure 4.8 depicts the ASPE’s task schedule for the reception of aBPSK modulated OFDM-frame during the data processing state. Aclock frequency of 160 MHz together with the duty cycle of Tdc = 4 µs,

4.3. ASPE 103

Si+1

SU0SU1SU2...

Ressources:

CMACCALU0CALU1

Tdc = 4 μs

Si-1 Si

Si-1

Input buffer:

Si Si-2Processed symbol: time

Tasks: GI removal & FOC..............140 cycles

Channel compensation.......96 cycles64-point FFT........................160 cycles

Demapping & detection..36 cyclesDeinterleaving..................66 cycles

Total: 498 cycles / 640 cycles

Figure 4.8: Data processing task schedule for BPSK modulated OFDM-symbols.

lead to a total of 640 clock cycles at one’s disposal for performing alldata processing related tasks. This has proven to be sufficient forreal-time reception of OFDM-frames modulated up to 64-QAM.

4.3.3 Results, discussion, and conclusionResults Using the described ASPE architecture, the complete BBprocessing of an IEEE 802.11a receiver has been implemented inassembler language. Its BER performance has been verified usingbit-true MATLAB models, to show only a small degradation comparedto a floating-point implementation. As attested by the cycle counts inTable 4.4, real-time operation up to 54 Mbit/s is possible at a clockfrequency of 160 MHz which has been achieved when synthesizing theASPE for the 0.13 µm CMOS target process. The corresponding siliconarea amounts to 1.9 mm2, and is low compared to the area requiredby similar approaches, e.g., Montium [87] and MS1/MaRs [88] (inChapter 2). The resulting processing performance is of 2′560 MdOp/s.9Unfortunately, no power consumption figures were extracted from this

9The processing performance is obtained as: PP = 2-way SIMD ×8dOps/SIMD× 160MHz = 2560MdOp/s.


Table 4.4: Assembler cycle counts and processing times for SISO-OFDM BB processing on ASPE running at 160 MHz.

State / Task Assembler cycle counts # Time [µs]Frame-start detectionCorrelation 2Ns + NSP + 20 196 1.22Mean energy and th. Ns + 20 100 0.63TOTAL 3Ns + NSP + 40 296 1.85Short preamble processing (init)coarse FOE 75 75 0.47coarse FOC Ns + 10 90 0.56TOTAL Ns + 85 165 1.03Short preamble processingcoarse FOC Ns + 10 90 0.56LTF1 start detect 6Ns + 40 520 3.25TOTAL 7Ns + 90 610 3.81LTF1 processingfine FOE 75 75 0.47fine FOC Ns + 10 90 0.56mean LTF1 NLP/2 + 10 42 0.26FFT on LTF1 160 160 1.00Channel estimation NLP/2 + 10 42 0.26TOTAL Ns + NLP + 265 409 2.55Data processingfine FOC Ns + 10 90 0.56FFT 160 160 1.00Channel compensation 114 114 0.71Demap 64QAM 270 270 1.69TOTAL Ns + 554 634 3.96

4.4. SUMMARY AND CONCLUSION 105

implementation.

Discussion The complete BB processing implementation permittedan extensive evaluation of the described ASPE architecture. Althoughreal-time operation is possible, the potential for further increasing thehardware efficiency is intact. The implementation pointed out, for in-stance, that the SEQ program code exhibits a very poor density, or, thatthe processor’s efficiency to compute the employed algorithms couldbe further increased by only minimal modifications in the hardwareto facilitate the intra-/inter-kernel data sorting. As an example, thecomputation of the SIMD 64-point radix-2 FFT requires 160 clock cy-cles in the above described implementation, instead of the theoretical192/2 = 96 clock cycles, showing that indeed there is a substantialoverhead. Another source of inefficiency is the control network residinginside the ASPE’s controlpath. The control network is responsibleof scheduling the potentially concurrent accesses of multiple SEQs tothe same FU. The complexity of the resulting network is such thatit limits the achievable clock frequency. Thus, especially for ASPEincarnations where only one SEQ is used, the control network becomesa severe bottleneck.

The CORDIC algorithm support provided by the CALU unitproved to be essential for FOE because it enabled a cycle efficientcomputation of the angle of a complex-valued number.

Conclusion The ASPE’s two-fold adaptivity to tasks of differentgranularities has proven to be important for the successful implemen-tation of the SISO-OFDM BB processing. The evaluation showedthat the ASPE’s datapath can easily be tailored to the needs of theapplication domain at hand: the design-time configurability providesan enormous flexibility and enables the design of appropriate unitsthat, at run-time, provide exactly the right flexibility/granularity.

4.4 Summary and ConclusionSummary The design space exploration presented in this chapterhighlighted important characteristics a software-programmable plat-form demands for its deployment as an OFDM BB processor.


Facts The C6455 high performance DSP is extremely flexible and al-lows for rapid code development thanks to its powerful SDK. However,despite the two-fold datapath and the high clock frequency, its process-ing performance is hardly sufficient to sustain real-time SISO-OFDMBB processing with BPSK modulation. The MSEC4 special purposeBB processor incorporates many efficient mechanisms that supportOFDM BB processing (e.g., radix-4 butterfly structure, flexible AGUsfor intra-kernel data sorting, zero overhead loop support). However,mainly due to its long critical timing-path, but also because of itsdifficulty of performing the per-sample processing required during theinitial reception phase, the processor cannot be employed for efficientMIMO-OFDM BB processing without a substantial re-design. Finally,the properly configured ASPE streaming processor comes with enoughflexibility to sustain both per-sample and per-symbol computations,and delivers enough performance to sustain real-time SISO-OFDMBB processing.

Important characteristics Although the OFDM BB processing ismainly represented by regular and repetitive processing tasks and isthus well suited for DSPs, frame-start detection and STF processingpresent irregular tasks that demand sample-based processing andprogram control-flow. The sample-based FOE requires the supportof dedicated hardware (e.g., support for the CORDIC algorithm) tobe computationally efficient. Next, the extensive support of complex-valued arithmetics greatly reduces the overhead otherwise experiencedwhen complex-valued operations are performed on single, real-valuedoperators (cf. Chapter 2). Another important characteristic requiredto sustain the differing granularities of OFDM BB processing, is thepresence of mechanisms that assist efficient intra- and inter-kernel datasorting/addressing.

Conclusion Table 4.5 reports the area and the corresponding op-erating frequency for the evaluated DSPs for their original targettechnology, as well as normalized for a 0.18 µm CMOS technology. Asreinforced graphically by Figure 4.9, the best area efficiency is attainedby the ASPE, followed by the MSEC4. The C6455’s efficiency is by farthe lowest which can be brought back, on one side, to the DSP’s large

4.4. SUMMARY AND CONCLUSION 107

Table 4.5: Areas and clock frequencies for the original designs andtheir normalized versions for a 0.18 µm CMOS technology.

Original Normalized to 0.18 µmArchitecture CMOS f [MHz] A [mm2] f0.18 [MHz] A0.18 [mm2]C6455 [76] 0.09 1′000 91 500 360MSEC [82] 0.25 65 8.14 90 4.2ASPE [10] 0.13 160 1.9 115 3.6

on-chip memory, and on the other side to its enormous flexibility.To conclude, the ASPE architecture has the best prerequisites to

successfully implement the MIMO-OFDM BB processing. It has thebest potential to be tailored to its application domain and thus to befine positioned in the performance-flexibility design space thanks toits design-time configurability, while both the MSEC4 and the C6455do not.


ASPE MSEC4 C64550

100

200

300

400

500

600

Perf

orm

ance

/are

a [M

dOPS

/mm

2 ]

Figure 4.9: Processing performance per area, for the evaluated SPAs.

Chapter 5

MIMO-OFDM SDRReceiver

This chapter presents the mapping of the relevant BB processing al-gorithms described in Section 3.5 onto an SDR platform, in order toform a 2× 2 MIMO-OFDM receiver. The SDR platform is composedof two ASIPs, each of which tailored to the computational needs of theassociated digital signal processing kernels. The first processor per-forms the per-stream MIMO-OFDM processing. The second processorhandles the MIMO detection.

5.1 SDR Platform OverviewThe two application-specific processors used to implement the SDRplatform are based on a modified version of the ASPE design frameworkthat prevailed against its two competitor architectures evaluated in theprevious chapter. The modified ASPE design framework is describednext.

5.1.1 The Modified ASPEFigure 5.1 illustrates the modified ASPE design framework. Threemodifications mainly characterize this framework. The addition of

109

110 CHAPTER 5. MIMO-OFDM SDR RECEIVER

GPP

System bus

SU SU SU

Empty

RF FU FU

Empty

...

...

D-Net

0 1 2 3 7

15 14 13 12 8

CWs

RegisterFile (RF)

FU FU

SU SU

DatapathControlpath Slot number

Data

SEQ

DICTIONARY

INSTRUCTIONS

ASPE

IBUF

OBUF

Data, Commands, Control

Figure 5.1: Block-diagram of the modified ASPE design framework.

dedicated input and output buffers (IBUF and OBUF), dictated bythe streaming-like nature of the BB processing tasks, enables directaccess to the data source and sink, thus reducing the data movementoverhead otherwise experienced across the D-Net and the system bus.Also, in this way, the system bus’ load is relieved and the so earnedbandwidth is at disposal for other – possibly more control-related –tasks, as for instance reloading the SEQ’s program memory.

The second modification regards the controlpath structure. Whilethe initial framework permitted to employ many SEQs, the modifiedASPE framework supports just one SEQ. This restriction is motivatedby two considerations. First, the use of a single SEQ eases the assemblerprogramming considerably allowing a more rapid progress. Second, thecomplexity of the control network, which was found to be a limitingfactor in the previous chapter, is greatly reduced enabling the operationat higher clock frequencies.

The third modification regards the SEQ structure. In order toimprove the low program-code density, which is inherent to VLIW ar-chitectures as pointed out by the SISO-OFDM receiver implementationin the previous chapter, the SEQ is enhanced to support dictionary-based program-code compression. With this technique, the ASPE

5.1. SDR PLATFORM OVERVIEW 111

instruction words are stored in a dictionary memory that is indexedthrough the content of a much narrower, but deeper program mem-ory. This mechanism reduces the overall storage requirements. Themethod used to compress the program-code is detailed later on inSection 5.4, whereas the next section concentrates on the high-levelreceiver architecture.

5.1.2 Receiver ArchitectureThe choice of the high-level architecture for implementing the 2×2SDR MIMO-OFDM receiver is mainly guided by the findings of thealgorithm analysis in Section 3.5 and of the SISO-OFDM BB im-plementation in Section 4.3. The results of the BB implementationindicate that the ASPE architecture has enough processing power toachieve real-time operation for a single-antenna OFDM receiver. Onthe other hand, the first order complexity-estimates in Table 3.5 revealthat, in a two-antenna MIMO-OFDM receiver, the MIMO-OFDMprocessing alone requires slightly more than twice as many operationscompared to a single-antenna receiver. In addition, significant effortis required for the computation of the MIMO estimator matrix G [in(3.6)], especially for the involved matrix inversion, but also for theMIMO detection itself.

Nevertheless, a system architecture with only two ASPEs has beenchosen. In the proposed configuration, the first ASPE (in the followingit is named ASPE A) is dedicated to the OFDM-related processingof the two receive chains. The second ASPE (ASPE B) handles thecomputation of the MMSE estimator matrix and the MIMO detection.The two ASPEs are connected through their dedicated I/O-buffers.More precisely, ASPE A’s OBUF is connected to ASPE B’s IBUFallowing to stream data from the first processor to the second. Thepartitioning of the functionality is illustrated by the dashed boxesin Figure 5.2 (cf. also Figure 3.1).

This approach appears reasonable from a structural point of view,since MIMO-OFDM processing and MIMO-detection rely on differ-ent computational kernels. The former requires mainly sample rateprocessing and FFTs, while the latter relies on matrix manipulationas inversions, matrix-matrix and matrix-vector multiplications. Thus,one ASPE can be customized for the OFDM processing, while the


OFDMDemod.

OFDMDemod.

Dema

ppin

g an

dP

to S

Conv

ersi

on

MIMO

Dete

ctio

n

ReceiverTransmitter

Noise

Channel

ASPE A ASPE B

S to

PCo

nver

sion

and

Ma

ppin

g

OFDMMod.

OFDMMod.

sknkHk yksk

RxDataTxData

Figure 5.2: Simplified block diagram of the considered 2× 2 MIMO-OFDM platform and task partitioning onto ASPE A and ASPE B.

other ASPE can be tailored to the MIMO processing. In addition,both ASPEs can operate concurrently in a pipelined fashion whiledecoding an OFDM-frame, maximizing their resource utilization.

5.1.3 Common ASPE A and ASPE B configura-tion

The design-time configuration of the two ASPEs strives at entirely sus-taining the target application, while reducing the differences betweenthe two architectures to a minimum, simplifying the portability of thedesign tools (e.g., assembler, interface, HDL test-bench). From thisperspective, the characteristics that are common to both ASPE A andASPE B are described in the following.

Since nearly all operations required by the selected algorithmsdeal with complex-valued numbers, and, as revealed expedient by theevaluation in Chapter 4, both ASPEs implement FUs that are designedto support 16 bit fixed-point complex-valued arithmetic. As for theSISO-OFDM implementation, memory-access bottlenecks are easilyavoided by storing the real and imaginary parts in the lower and upperhalf of the same data word, respectively. The SUs are composed of256×32 bit SIMD single-ported memories and incorporate commonaddressing schemes (i.e., post increment by one, post decrement byone, post increment by an offset register, and bit-reversal). Both theSIMD IBUF and OBUF have a size of 64×32 bit and come with a read

5.2. ASPE A – MIMO-OFDM PROCESSING 113

and a write port to operate as first-in first-out buffers. Finally, thetemporary storage capability of the two ASPEs is sustained by eightSIMD registers that compose the register-file (two read-ports and onewrite-port).

In both ASPEs, the SEQ is configured with a VLIW dictionarymemory of 256words, each 192 bits wide, to store the 16 bit CWsfor the SEQ itself and the CWs for the eleven units (FUs and SUs)instantiated inside each ASPE. The program memory can contain upto 1024 dictionary pointers. It stores the pointers to the dictionarymemory and guarantees proper sequencing of the dictionary entries inorder to reproduce the original uncompressed program.

The number and the type of FUs and SUs selected to realize theparticular datapath of ASPE A and of ASPE B are described in moredetail in the next two sections (Section 5.2 and Section 5.3). However,the partitioning of the 16 bit CWs used to control the FU’s operation iscommon for all FUs across the two designs. The CWs are partitionedinto the orthogonal fields:

Instr OpA OpB

4 bit 4 bit

Shamt

4 bit 4 bit

where OpA and OpB select the operand sources (i.e., SUs or FUs),Shamt defines a possible shift amount, and Instr codes the instructionto be executed on the FU.

5.2 ASPE A – MIMO-OFDM ProcessingThe particular datapath configuration of ASPE A is detailed beforethe description of the per-stream OFDM processing on the properlyconfigured architecture.

5.2.1 Datapath configurationFigure 5.3 depicts the block diagram of the design-time configuredASPE A. The analysis of the atomic operations required by the se-lected BB algorithms reveals that for the 2×2 MIMO-OFDM relatedprocessing an ASPE configuration with three FUs, five SUs, one IBUF,


SU4 IBUFOBUF...

REG

SU0

Datapath

Data Network

CWs

Control Network

Controlpath

VLIW dictionary memories

SEQ

IDXmem.

2-way SIMD unit

CMAC CALU0 CALU1

Data Out

Data In

Data Ack

Data Req

Data Req

Data Ack

Figure 5.3: ASPE A: datapath configuration.

one OBUF, and the register-file can deliver the necessary processingperformance.

The three FUs have been selected principally by observing that thehardest computational kernel in the MIMO-OFDM processing partresides in the computation of the 64-point FFT and that this kernel isbest undermined by the use of butterfly operators that significantlyspeed up its computation (see Table 3.5). Consequently, the threeFUs can be interconnected to form a radix-2 butterfly. The FUs are:one complex-valued multiply and accumulate (CMAC) unit, and twocomplex-valued arithmetic logic units (CALU0 and CALU1). TheCMAC unit is implemented with three pipeline stages, while the twoCALUs require only two pipeline stages to attain the same criticalpath length on all FUs.


The CALU unit

The structure of one CALU is illustrated in Figure 5.4. The CALUimplements basic ALU functionalities (i.e., add, sub, shift, max, min,bit-wise and, and bit-wise or) and it provides the SEQ with flagsfor data-dependent decisions, as required, for example, for the statetransition triggered by the threshold detection in (3.16). As for theSISO-OFDM implementation, the two CALUs have been enhancedby a set of BB processing operations (CORDIC and absolute valuecomputation) that are implemented by sharing the already availableCALU resources. Table 5.1 lists the Instr-field coding of the CALU’sCW.

The operation of the CALU is illustrated on the example of thespecialized CORDIC instruction since it differs from the conventionalALU operation. Figure 5.5 shows how the two CALUs interact tocompute the angle of a complex-valued number z = x+ j y, by usingthe CORDIC algorithm [75]. This datapath configuration is required,for instance, by the FOE in Section 3.5.3. The CORDIC algorithmperforms a series of iterations that successively lead to the angleφ = ](z). One CORDIC iteration is defined by

x(i+1) = x(i) − d(i) y(i) 2−i (5.1)y(i+1) = y(i) + d(i) x(i) 2−i (5.2)φ(i+1) = φ(i) − d(i) arctan(2−i), (5.3)

where i = 0, 1, . . . , NCOR−1 is the iteration index, NCOR the number ofiterations, and d(i) = −sign(x(i)y(i)) indicates whether φ(i) is positive(d(i) = +1), or negative (d(i) = −1). The initial values are x(0) = x,y(0) = y, and φ(0) = 0.

The CORDIC instruction of the CALU exploits and configuresthe datapath to compute (5.1), (5.2), as well as d(i) which is mappedonto the CALU’s flags. By checking the corresponding flag, the SEQdecides whether to perform an addition or a subtraction betweenthe operands of the second CALU, eventually computing (5.3). Thearctan(2−i) values are loaded into the ASPE A’s data memory beforethe computation starts. In addition, to facilitate the FOC executedby the CMAC unit, the arctan(2−i) values are scaled by a factor of256/2π (see later description of CMAC). The alternative to carve these


values into a look-up table that resides inside the CALU has beendiscarded and traded-in for more numerical flexibility at run-time. TheFU’s datapath word-width of 16 bit permits to run up to NCOR = 16iterations: arctan(2−15) = 3.0518 · 10−5, which, quantized for a fixed-point representation [16 15], corresponds to 1 and is the smallestrepresentable quantized number.1 Listings 5.1 and 5.2 show twoassembler code snippets that compute the CORDIC algorithm on 16complex-valued data samples. The former code snippet is compact and,at first sight, seems program memory and computationally efficient.However, a closer look at the CordicIterLoop at line 20 reveals that onlytwo of the five VLIWs that compose the loop perform effective dataoperations (i.e., one of those at lines 21 or 22 depending on the flagof CALU0, and that at line 23), while the remaining two instructionsare required to fill the two branch delay slots. Thus, the code is notcomputationally efficient.

The second code snippet (Listing 5.2) is computationally moreefficient. At every second clock cycle an effective data operation takesplace (again, depending on the flag of CALU0). The apparent programcode inefficiency caused by the several repetitions of the same VLIWsturns out not to be an issue. Thanks to the dictionary-based programcode compression only the unique VLIWs need to be stored inside thedictionary memory, hence resulting in a compact code. Eventually,this code snippet is used for the MIMO-OFDM BB implementation.

The CMAC unit

The CMAC unit is illustrated in Figure 5.6. It performs 16 bitfixed-point complex-valued operations such as multiply, multiply withcomplex-conjugate, and the corresponding accumulate operations [e.g.,as required by (3.12) and (3.13)]. In addition, the CMAC provides cir-cuitry to support FOC, i.e., r[d] = r[d] · e−jφd. This circuitry includesa look-up table (LUT) containing 256 phasors L[k] = e−j2πk/256, withk = 0, 1, . . . , 255 being the LUT’s address.

FOC is performed by first loading the scaled phase increment 2562π φ,

1The notation [ww fw] means that a fixed-point number xq has a word-widthof ww bits and that its decimal point is positioned at the fwth bit, starting fromthe least significant bit. The mapping from fixed-point number to its correspondingreal-valued number is: x = xq/2fw.


Listing 5.1: ASPE A assembler code snippet to compute CORDICwithin two loops. The code is compact, but overhead affected.// I n i t// D0: conta ins data samples// D1: conta ins atan v a l u e s$6 = NSAMPLE; \\ Load number o f samples to be proce s sed

5 nop , \D0. or0R = 0 , \D1. or0R = 16 ,\D2. or0R = 0 , \D3. or0R = 0 , \

10 D4. or0R = 0 , \COEFS. or0R = 0 ; \\ I n i t r i g h t SIMD memory o f f s e t r eg s

// Only work on r i g h t SIMD memory banks// Do loop over a l l samples and f o r each sample do the

15 // CORDIC i t e r a t i o n sSampleLoop :

nop , D0[ rpR ]+1; // Get new data samplenop , calu0 = cord i c (D0, 0 ) ;$7 = NITER−1, D1[ rpR]+1 , calu0 , calu1 = regs0 , regs0 [ 0 ] ;

20 CordicIterLoop :i f ( ! f l a g 0 ) , calu0 , calu1 = in t + D1;i f ( ! cond ) , calu0 , calu1 = in t − D1;i f (−−$7 ) goto CordicIterLoop , calu0 = cord i c ( calu0 , 1 ) , calu1 ;nop , D1[ rpR]+1 , calu0 , calu1 ; // F i l l de lay s l o t 1

25 nop , calu0 , calu1 ; // F i l l de lay s l o t 2i f (−−$6 ) goto SampleLoop , D1[ rpR]−or0 , D2[wpR]+1 = caluR1 ;nop ; // F i l l de lay s l o t 1nop ; // F i l l de lay s l o t 2

// D2 now conta ins the ang l e s o f a l l processed samples


Listing 5.2: ASPE A assembler code snippet to compute CORDICwithin one loop. The code is repetitive, but computationally efficient.

1 // D0: conta ins data samples , D1: conta ins atan v a l u e s$6 = NSAMPLE; \\ Load number o f samples to be proce s sed\\ I n i t r i g h t SIMD memory o f f s e t r eg s :nop , D0. or0R = 0 , D1. or0R = 16 , D2. or0R = 0 ,D3. or0R = 0 , \

D4. or0R = 0 , COEFS. or0R = 0 ;6 // Loop over a l l samples and f o r each sample do 16

// CORDIC i t e r a t i o n sSampleLoop :nop , D0[ rpR ]+1; // Get new data samplenop , calu0 = cord i c (D0, 0 ) ;

11 nop , calu0 ;nop , D1[ rpR]+1 , calu0=cord i c ( calu0 , 1 ) , calu1=regs0 [ 0 ] ;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;

16 i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;

21 i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;. . . // Omitted 6 i n s t r u c t i o n s f o r sake o f b r e v i t yi f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;

26 i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;

31 i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;

36 i f ( ! f l a g 10 ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , D1[ rpR]+1 , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f ( ! f l a g 10 ) , calu0=cord i c ( int , 1 ) , calu1=in t+D1;i f ( ! cond ) , calu0=cord i c ( int , 1 ) , calu1=int−D1;i f (−−$6 ) goto SampleLoop ;

41 nop , D1[ rpR]−or0 ;nop , D2[ wpR]+1 = caluR1 ;// D2 now conta ins the ang l e s o f a l l processed samples


Shift

Shift

max, min

flags

+/-

+/-

AB

16 | 16

16 | 16

16

Decoder

Input selection network

Result selection network

Saturate

OUT

Control word

imag(B) real(B)

|x|2

16 | 16

Figu

re5.4:

One

ofthetw

oidentic

alda

tapa

thsthat

compo

sethe2-wa

ySIMD

CALU

FUsof

ASP

EA.


A

16|16

16|16

OUT

+/- Shift

Re{.}

Im{.}

+/- Shift

16

Decoder

Control word

B

Iteration

Counter

-d(i)

+0

1

FLAG

to SEQ

-d(i)

arctan(2-i)0| 16

+/-

A

0| 16OUT

16

Decoder

Control word

Figure5.5:

Configuration

andcom

binationofthe

twoCALU

sused

tocom

puteone

CORDIC

iteration.


Table 5.1: CALU instruction coding.OUT : CALU output, A: operand A, B: operand B, SHAMT : shiftamount provided by the Shamt-field inside the CW.

Instr Mnemonic Meaning0000 HOLD OUT COUT0001 ADD OUT CA+ (B >> SHAMT )0010 SUB OUT CA− (B >> SHAMT )0011 MAX OUT C max(A,B)0100 MIN OUT C min(A,B)0101 SQU OUT C |A|2 >> SHAMT0110 NOP OUT CA1000 AND OUT C bitand(A,B)1001 OR OUT C bitor(A,B)1010 CORDICa OUT C CORDIC(A)

aThe CORDIC instruction performs one CORDIC iteration. The first of asequence of CORDIC instruction has to reset the internal iteration counter. Thisis done by setting the corresponding CW’s Shamt-field to ’0000’.

which is computed by the CALU through CORDIC rotations, into thecorresponding initialization register PHI.2 The scaling of φ is performedto directly obtain the LUT’s address in the subsequent steps. For eachreceived sample, the scaled phase increment is accumulated in the phaseaccumulator register PHIACCU that resides in the accumulator stage ofthe CMAC. Concurrently, each received sample is multiplied by thecorresponding phasor L[PHIACCU] pointed to by the phase accumulatorregister. The involved multiplication is performed on the CMAC’smultiply-stage, thereafter the result bypasses the accumulator stageand is available at the output.

The two accumulator registers (ACCU and PHIACCU) assure that noreloading of the accumulated data nor accumulated phase is necessaryfor the CMAC’s operation. Thus, no reloading penalty is inferredwhen switching back and forth between FOC and FFT computationduring the data processing.

The CMAC’s FOC operation is another example of how the com-2Loading the PHI-register is done with the load PHI (LDPHI) instruction, see

Table 5.2.


Table 5.2: CMAC instruction coding.OUT : CMAC output, A: operand A, B: operand B, , SHAMT :shift amount provided by the Shamt-field inside the CW, ACCU :accumulator register, PHI: φ register, PHIACCU : φ accumulatorregister.

Instr Mnemonic Meaning0000 NOP OUT CACCU >> SHAMT0010 MAC OUT CACCU >> SHAMT ;

ACCU CACCU +A·B0100 MUL OUT C (A·B) >> SHAMT ;

ACCU C 00110 MSU OUT CACCU >> SHAMT ;

ACCU CACCU −A·B1000 cMAC OUT CACCU >> SHAMT ;

ACCU CACCU +A·BH

1010 cMUL OUT C (A·BH) >> SHAMT ;ACCU C 0

1100 cMSU OUT CACCU >> SHAMT ;ACCU CACCU −A·BH

1110 LDPHI PHI CA;PHIACCU C 0

0001 FOC OUT CA·L[PHIACCU ];PHIACCU C PHIACCU + PHI

putational blocks composing one unit can be combined together rais-ing the utilization of these blocks and allowing for the support ofa broader range of operations, while adding only a minimal control-related hardware-overhead.

To conclude, Table 5.2 summarizes the Instr field coding of theCMAC’s CW.


16 16 16 16

Re{A} Im{A} Re{B} Im{B}

16|16 16|16

A B

32 32 32 32

0

Shift

PHI

Re{ACCU}

A

Re{FOC bypass} Im{FOC bypass}

16|16

OUT

LUT

L[PHIACCU]

Control Word

16

Decoder

Decoder0

PHIACCU

Decoder

Im{ACCU}

Figure 5.6: One of the two identical datapaths that compose the 2-waySIMD CMAC FU of ASPE A.


5.2.2 BB processing on ASPE AThis section summarizes the tasks performed on ASPE A from theframe-start detection to the data processing states. It illustrates thecorresponding datapath configurations and it reports clock cycle countsrequired by the assembler implementation to process blocks of Ndcsamples.3 Together, these results are instrumental to determine theclock frequency the processor has to attain for operating in real-timeand to determine a viable task schedule.

IBUF data sorting The IBUF has a capacity of 64 words. Thereceived data samples are stored in the left and right parts of theSIMD memory according to the receive stream they belong to, asshown in Figure 5.7. The buffer provides the upstream circuitrywith a handshake interface to control the data-flow and avoid bufferoverflowing. The requirements to the upstream circuitry are to providethe IBUF with the received datastream sampled at a rate of 20 MS/sand to be capable of buffering at least Ndc samples.

To increase the efficiency of the frame-start detection, the incomingsamples are processed in blocks of the length of one OFDM-symbol(Ndc = Ns samples).

Frame start detection In a first step and through appropriateVLIWs, the datapath is configured as illustrated in Figure 5.8(a) forcomputing m16 [(3.15) in Section 3.5.1]. While computing m16, thereceived data samples r[d] are also temporarily stored into SU1 andSU2, to be further processed in a second step. ASPE A requires2Ndc + 35 clock cycles for obtaining Ndc mean-energy values m16.Next, the received data samples previously buffered in SU1 and SU2are used to compute p16 [cf. (3.14)], with the datapath configurationshown in Figure 5.8(b). To complete this task, another 2Ndc+35 clockcycles are required. Finally, the threshold detection (3.16) is performedin 3Ndc + 20 cycles, configuring the datapath as in Figure 5.9. Forthis operation, the datapath crossing mechanism at the input A ofthe CALU is exploited and then only one of the two SIMD units are

3The reported clock cycle counts are rounded up to the next multiple of five.The terms ’clock cycle’ and ’cycle’ are used interchangeably in this chapter.


......

r(2)[0] r(1)[0]

r(2)[1] r(1)[1]

r(2)[2] r(1)[2]

Interleaved rxdatastream

L R

Figure 5.7: IBUF receive-data sorting.

employed for the comparison. Once the frame start is detected, thereceiver proceeds into the STF processing state.

Summarizing, frame start detection requires 7Ndc + 90 clock cyclesto process blocks of Ndc received data samples.

STF processing First, FOE as described in Section 3.5.2 is per-formed in 60 cycles. FOE needs to be computed only once duringSTF processing. The datapath configuration for the computation ofthe phase rotation by the CORDIC algorithm uses both CALUs andhas been detailed in Section 5.2.1. Then, coarse FOC is performedon the Ndc received samples employing the CMAC unit and, for that,Ndc + 10 cycles are taken. The subsequent computation of the meanenergy values m64 (3.15), the correlation p64 (3.14), and the thresholddetection in (3.16) with L = NLP = 64, necessary to detect the start ofLTF1, are performed in 6Ndc + 75 cycles. If the threshold is detected,the SU’s base addresses are aligned to match the OFDM-symbol’sboundary in at most Ndc + 10 cycles.

Thus, in steady state (without FOE), STF processing over Ndcreceived data samples requires 8Ndc + 95 cycles.


meanenergy

absolutevalue

CMAC

CALU0

CALU1

average

. 2

...

L RST0

m16[0]

m16[1]m16[2]

ST2

...

L R

...

r(2)[0] r(1)[0]

r(1)[1]

r(2)[2] r(1)[2]

...

L R

...

r(2)[0] r(1)[0]

r(2)[1] r(1)[1]

r(2)[2] r(1)[2]

ST1

Rx data

r(2)[1]

...

L R

...

IBUF

r(2)[0] r(1)[0]

r(2)[1] r(1)[1]

r(2)[2] r(1)[2]

(a) Computation of m16.

SU2

correlation

absolutevalue

CMAC

CALU0

CALU1

average

...

L R

...

SU1

...

L R

...

r(2)[0] r(1)[0]

r(2)[1] r(1)[1]

r(2)[2] r(1)[2]

r(2)[0] r(1)[0]

r(1)[1]

r(2)[2] r(1)[2]

. 2

...

L R

...

SU0

p16[0]

p16[0]p16[2]

m16[0]

m16[1]m16[2]

r(2)[1]

(b) Computation of p16.

Figure 5.8: Datapath configuration for frame-start detection.


...

L R

...

SU0

p16[0]

p16[0]p16[2]

m16[0]

m16[1]m16[2]

CALU0

thresholddetection

flags toSEQ

Figure 5.9: Datapath configuration for threshold detection.

LTF processing Coarse FOC is performed on the samples of theLTF2 not yet compensated during STF processing (in at most Ndc+10cycles). Subsequently, fine FOE takes place (in 60 cycles) employingp64[d64], followed by fine FOC on the three OFDM-symbols thattogether compose LTF1 and LTF2, in order to remove the residualfrequency offset (in 250 cycles). Thereafter, the granularity of theoperations switches to that of an OFDM-symbol and the average overthe two OFDM-symbols T1A and T1B composing LTF1 is calculated(in 85 cycles). Finally, the long guard interval GI2 is removed, andthe averaged LTF1 and the LTF2 are fast Fourier transformed in 250cycles each.

The datapath configuration for FOE and FOC is the same asemployed for the corresponding task during STF processing. Onlythe addressing of the SUs changes and reflects the increased lag ofthe correlation: from L = NSP = 16 to L = NLP = 64. To computethe average over the two OFDM-symbols T1A and T1B that composeLTF1, one CALUs reads the corresponding samples of the LTF1 fromSU1 and SU2 and averages them. The interleaved 64-point FFTdatapath configuration with the corresponding data sorting is shownin Figure 5.10. Listing 5.3 illustrates the assembler code snippet usedto compute the second stage of the 64-point interleaved FFT.


Datapath

...

L R... ...

L R...

SU2 SU3

CALU0 CALU1

CMAC00

...L R...

REGFILE

SU0

SU1

*

-+

radix-2butterfly

Left and Right SIMD

datapaths

Twiddlefactors

r0 r48 r32r16 c0 c0c1 c1r0 r48 r32r16

r17 r1 r49 r33...

...... ......L RL R

Interleaveddata

Figure 5.10: Datapath configuration for 64-point FFT computation.


Listing 5.3: Second FFT-stage on ASPE A.// Stage 2// Read data and c o e f f i c i e n t snop , D3[ rpL , rpR]+1 , COEFS[ rpL , rpR ]+1;// −− F i l l p i p e l i n e s ( pro logue )

5 nop , cmac=D3∗COEFS>>SHAMT, D3[ rpL , rpR ]+1;nop , cmac=D3∗COEFS>>SHAMT, D3[ rpL , rpR ]+1;nop , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR]+1 , D3[ rpL , rpR ]+1;nop , calu0=D2+cmac , calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT,

D2[ rpL , rpR]+1 , D3[ rpL , rpR ]+1;nop , calu0=D2+cmac , calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT,

D2[ rpL , rpR]+1 , D3[ rpL , rpR ]+1;10 // −− Radix−2 b u t t e r f l i e s ( steady−s t a t e )

nop , D0[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D1[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR]+1 ,D3[ rpL , rpR ]+1;

nop , D0[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D1[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR]+1 ,D3[ rpL , rpR ]+1;

. . . // −− Removed code to shorten l i s t i n gnop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,

D0[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR]+1 ,D3[ rpL , rpR ]+1;

15 nop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D0[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR]+1 ,D3[ rpL , rpR]−or0 ;

// −− Empty p i p e l i n e s t a g e s ( e p i l o g u e )nop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,

D0[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR ]+1;

nop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D0[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , D2[ rpL , rpR ]+1;

nop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D0[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac , D2[ rpL , rpR]−or0 ;

20 nop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D0[wpL,wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,calu1=D2−cmac ;

nop , D1[wpL,wpR]+1=[ caluL0 , caluR0 ] ,D0[wpL,wpR]+1=[ caluL1 , caluR1 ] ;

nop , D1[wpL,wpR]−or1=[ caluL0 , caluR0 ] ,D0[wpL,wpR]−or0=[ caluL1 , caluR1 ] ;

// −− Total : 39 c l o c k c y c l e s f o r s t a g e 2


MIMO channel processing First, the coarse and fine phases areadded together to generate a single, fine frequency offset estimate (in15 cycles). Fine FOC takes place next (Ndc + 10 cycles). Then, thechannel estimate is computed by the matrix-matrix multiplicationH = ZT−1 described in Section 3.5.4, and for which ASPE A requires4Nc + 10 cycles. The resulting channel estimates are transferredto ASPE B, via the OBUF→IBUF link between the two processors.Once on ASPE B, they are further elaborated to obtain the linearMMSE estimator matrix G. Then, the processor switches to the dataprocessing state.

Data processing ASPE A performs fine FOC and computes an FFTon the received samples, in Ndc + 10 and 250 cycles respectively. Thetransformed data is then conveyed to ASPE B for proper detection.

5.3 ASPE B – MIMO Detection

This section describes the FUs, SUs, and the operation of ASPE B aspart of the SDR platform, in an analogous form as done for ASPE A.

5.3.1 Datapath configuration

ASPE B is dedicated to MIMO detection and configured with four FUs,four SUs, an IBUF, an OBUF, and with the register-file as illustratedin Figure 5.11. The FUs necessary to compute the matrix computationrelated tasks of MIMO detection are: one CALU, two CMACs, and onereal-valued divider unit (DIV). The DIV unit is necessary to performmatrix-inversion, as described in Section 3.5.4.

In order to reach the highest possible clock frequency, the FUs ofASPE B have two different pipeline depths. The CALU and the twoCMAC units have three pipeline stages each, while the DIV unit hasbeen pipelined with seven stages. As a result, all units have the samecritical-path length.

5.3. ASPE B – MIMO DETECTION 131

Control Network

REG CMAC0 CMAC1 CALU

SU0 SU1 SU2 SU3 IBUF

SEQ

OBUF

DIV

Pipeline stage2-way SIMD unit

CWs

Controlpath

Datapath

IDXmem.

Data Network

Data Out

Data In

Data Ack

Data Req

Data Req

Data Ack

VLIW dictionary memories

Figure 5.11: Datapath configuration of ASPE B.

The CALU unit

The CALU (see Figure 5.12) implements the instructions reported inTable 5.3. Compared to the ASPE A’s CALU, no CORDIC support isprovided since not required during the MIMO detection. Instead, theimportant feature of adding only the real parts of two complex-valuednumbers, while setting the imaginary part to zero, is implemented.

The CMAC unit

The CMAC unit of ASPE B is illustrated in Figure 5.13. It performs16 bit fixed-point complex-valued operations such as multiply, multiplywith complex-conjugate, and the corresponding accumulate opera-tions. The Instr field of the CMAC codes the instructions reportedin Table 5.4. Even though its main functionality is the same as theCMAC of ASPE A, its structure is adapted to the particular needsof the MIMO detection algorithms. In particular, the FOC supportis dropped in place of a few additional variations of complex-valued


A B Control word

16 | 16 16

0

0 1

Re{A}

1616

1

0 1

1616

Shift

Decoder

88

8

SHREG

0

0 1

SaturateSaturate

16 16

16 16

16 | 16

Im{A}

Re{B} Im{B}

16 | 16

OUT

(MSBs)

(MSBs)

imag(B)

real(B)

Figure 5.12: One of the two identical datapaths that compose the2-way SIMD CALU FU of ASPE B.


Table 5.3: ASPE B CALU instruction coding.OUT : CALU’s output, A: operand A, B: operand B, SHREG: shiftamount register, SHAMT : shift amount provided by the Shamt-fieldinside the CW.Instr Mnemonic Meaning0000 NOP OUT CB >> SHAMT0001 HOLD OUT COUT0010 LDSHAMT SHREGCB0011 SHIFT OUT CB >> (SHREG+ SHAMT )0100 ADDRe OUT C (<{A}+ <{B}) >> (SHREG+ SHAMT )0110 SUB OUT C (A−B) >> (SHREG+ SHAMT )0111 ADD OUT C (A+B) >> (SHREG+ SHAMT )

multiply operations (e.g., cMULn, cMULcn) and a wider shift rangefor the CMAC’s result. Overall, the circuit’s complexity is reduced,compared to the CMAC unit of ASPE A.

The DIV unit

Figure 5.14 depicts the real-valued divider unit, involved in the 2× 2direct matrix inversion required to compute estimator matrix G. Thedivider performs 16 bit divisions and it provides the instructions re-ported in Table 5.5. The DIV instruction performs a standard, signeddivision between the operands A and B. Instead, the SHDIV instruc-tion is mainly used to compute the shifted inverse 2blog2 Bc/B of thereal-valued fixed-point operand B. Note that for the SHDIV instruction,the operands of the DIV unit have to be set both to operand B. Theso-obtained shifted inverse allows to fully exploit the dynamic rangeof the division. The shift amount blog2 Bc is available at the DIV’soutput always one cycle before the actual division result and can thusbe incorporated in the subsequent computations. The shifted inverseof a negative number results in the largest positive value at the outputof the DIV unit. This behavior is desired for the computation of thematrix inversion where the determinant of the Hermitian matrix F,which has to be inverted, is always positive.


8

16 16 16 16

Re{A} Im{A} Re{B} Im{B}

16|16 16|16

A B Control Word

16

32 32 32 32

Decoder

8

Re{ACCU} Im{ACCU}

33 33

37 37

8

8

SHREG

Decoder

00 0 0

1 1

37 37

Shift

16|16

OUT

8

8

Figure 5.13: One of the two identical datapaths that compose the2-way SIMD CMAC FU of ASPE B.


Table 5.4: ASPE B CMAC instruction coding.OUT : CMAC’s output, A: operand A, B: operand B, SHREG: shiftamount register, SHAMT : shift amount provided by the Shamt-fieldinside the CW.Instr Mnemonic Meaning0001 LDSHAMT SHREGCB0010 MUL OUT C (A·B) >> (SHREG+ SHAMT )0011 MAC OUT CACCU >> (SHREG+ SHAMT );

ACCU CACCU +A·B0100 cMUL OUT C (A·BH) >> (SHREG+ SHAMT )0101 cMAC OUT CACCU >> (SHREG+ SHAMT );

ACCU CACCU +A·BH0110 cMULn OUT C−(A·BH) >> (SHREG+ SHAMT )1000 MACn OUT CACCU >> (SHREG+ SHAMT );

ACCU CACCU −A·BH1001 cMULcn OUT C−(AH ·B) >> (SHREG+ SHAMT )1010 cMACcn OUT CACCU >> (SHREG+ SHAMT );

ACCU CACCU −AH ·B1011 NOP OUT CACCU >> (SHREG+ SHAMT )

Table 5.5: ASPE B DIV instruction coding.OUT : CMAC’s output, A: operand A, B: operand B, SHAMT : shiftamount provided by the Shamt-field inside the CW.Instr Mnemonic Meaning0000 NOP OUT C blog2(A)c >> (SHREG+ SHAMT )0001 DIV OUT CA/B >> (SHREG+ SHAMT )0010 SHDIV OUT C 2blog2(A)c/B >> (SHREG+ SHAMT )0011 HOLD OUT COUT >> (SHREG+ SHAMT )


if A>0 X = log2(Re{A}) else X = 0end

A B

16

Decoder

16|16

OUT

16Re{A} Re{B}

5 pipelinestages

2X

16 16

16

X

16

4 pipelinestages

0

32

Shift

32

16

Decoder

0

Control word

16

( )( )

16

Figure 5.14: One of the two identical datapaths that compose the2-way SIMD DIV FU of ASPE B.


5.3.2 BB processing on ASPE BASPE B remains idle until it is triggered by the first channel-estimatesamples dropping into its IBUF in order to start with MIMO channelprocessing.

MIMO channel processing ASPE B computes blocks of four 2× 2linear MMSE estimator matrices G in 44 cycles: the computation offour F matrices requires 12 clock cycles, the matrix inversions resultingin four F−1 matrices are then performed in 20 clock cycles, and thematrix multiplications to obtain four G = F−1HH are computed in12 clock cycles. Figure 5.15 illustrates the data sorting adopted andhow the FUs are chained in order to compute F = HHH + MTσ

2I(the addition of MTσ

2 to the diagonal elements of HHH is performedon the CALU and is not shown in the figure).

Data processing During the data processing state, s is computedby a matrix-vector multiplication combined with an arithmetic shift(thus performing hard-out detection). The configuration is similarto that of Figure 5.15 and 16 cycles are required for computing foursymbols s, independent from the modulation order M . Finally, thedetected data is written to the OBUF where it is ready to be demapped,de-interleaved, and conveyed to an external Viterbi decoder.


Datapath

...L R

... ...L R

...ST2 ST3

CMAC0

ST0

......L R

CMAC1

yR00yR10yR01yR11

ST1

......L R

xR01

xR00xR10

xR11xR00xR10

xR11xR01

(.)H (.)H

yL00yL10yL01yL11

xL01

xL00xL10

xL11xL00xL10

xL11xL01

Y = X XH

L L L

Y = X XH

R R R

Left and Right SIMD

datapaths

Figure 5.15: Datapath configuration for the matrix-matrix multiplica-tion HHH.

5.4. DBCC 139

5.4 Dictionary Based Program-Code Com-pression

The SISO-OFDM BB processing implementation on the ASPE de-scribed in Section 4.3 revealed that the achieved program code densityis poor, as it is typical for VLIW architectures.4 VLIW processors canexploit the instruction-level parallelism (ILP) inherent to data process-ing algorithms by operating the FUs allocated inside their datapathconcurrently. Unfortunately, the implemented application often doesnot require to run all available units concurrently, and therefore thecode contains many no-operations (NOPs) for the unemployed units.Especially for VLIW processors as the ASPE, which store the entireVLIW in one program memory word (i.e., fixed instruction formatVLIW processors), this means that the code-density is low. More-over, the application may require one VLIW to be repeated severaltimes inside the program memory, resulting in large programs (cf.Listing 5.2).

In this thesis, dictionary-based program-code compression (DBCC)is implemented to alleviate for this inefficiency. DBCC schemes collectall unique instructions of a program into a dictionary memory, atcompile-time. Then, at run-time, the dictionary entries are indexedthrough a much narrower, but deeper memory, which eventually allowsto reconstruct the original program flow. In [89], dictionary-basedprogram-code compression was introduced to VLIW architectures.Compared to other methods that use, for instance, entropy encoding(e.g., as in [90, 91, 92]) and are better suited for flexible instructionformat VLIWs, DBCC is well suited for fixed instruction width VLIWprocessors. It achieves good compression ratios, while the additionalhardware overhead, required to decompress and restore the originalprogram flow, is low. Motivated by the observation that the dictionarymemory still contains a significant number of NOP CWs, in this thesis,the dictionary memory is further compressed, which represents anaddition to conventional DBCC.

In the following, first the initial SISO-OFDM ASPE programmemory configuration is reported, which will serve as a reference. Then,the actual DBCC method with the additional dictionary compaction

4In this thesis, the code-density is defined as ρ = # of instructions# of instructions+ # of NOPs .


Table 5.6: Benchmark program sizes and code densities.Benchmark P ρ ρa bunc

SISO BB proc. 762 24% 76% 14630416 Tap FIRb 44 24% 76% 844864-point FFT 149 47% 53% 28608Two 64-point FFTc 253 47% 53% 49152MIMO BB kerneld 506 34% 66% 97152

aρ = 1− ρ.b16 tap finite response filter (FIR) implementation.cImplementation of two 64-point FFTs interleaved in memory, as used in

Section 5.2.2.dImplementation of MIMO detection, as used in Section 5.2.2.

step is described and compared to the reference setup.

5.4.1 Reference design

The ASPE performing the SISO-OFDM BB processing of Section 4.3is equipped with eleven units (FUs and SUs, cf. Figure 4.7). One ofits VLIWs is 192 bits wide and comprises Nu = 12 CW, to controlthe SEQ unit and the eleven units (each CW is 16 bit wide). Theprogram memory can contain P = 762 VLIWs, thus resulting in anuncompressed storage capacity of bunc = 146′304 bit.

Figure 5.16 illustrates the area breakdown of the various buildingblocks of the reference ASPE, placed and routed for 0.18 µm CMOStechnology. Here, the SEQ share amounts to almost 25% of the proces-sor’s total silicon area (or 1 mm2), which underlines that the requiredprogram-code memory area is indeed considerable. The programcode-density for the SISO-OFDM assembler program of Section 4.3is reported in Table 5.6, together with a set of other benchmark pro-grams. The SISO-OFDM assembler code implementation attains alow code-density: 24% of all CWs are useful instructions, whereas theremaining 76% of the CWs are filled with NOPs. For the additionalbenchmarks the situation is similar.

5.4. DBCC 141

CMAC: 13%

Other: 12%

CALU0: 7%

SEQ: 24%

CALU1: 7%ST2: 5%

ST1: 5%

Regfile: 2%

ST5:4%

ST0: 5%

IBUF: 4%

ST4: 5%

ST3: 5%

Figure 5.16: Area breakdown of the ASPE placed and routed for0.18 µm CMOS technology. The total area amounts to 4.2mm2, theSEQ containing the program memory occupies 1mm2.

5.4.2 DBCC with NOP bitmaskDBCC Figure 5.17 depicts the hardware components necessary toimplement DBCC (without the additional dictionary compression stepyet). To generate the dictionary, the initial assembler program isparsed and each unique VLIW is saved into the dictionary binaryfile. Concurrently, to generate the indexes that allow to reconstructthe original program, for each parsed VLIW the address of the corre-sponding VLIW inside the dictionary is stored into the index binaryfile. These two files are loaded into the index and dictionary memo-ries, after which the ASPE is ready to operate. The sequencer startsfetching an index from the index memory. This index addresses thedictionary form where the effective VLIW is fetched. Then, the CWscomposing the VLIW are directed to the corresponding units, i.e., tothe sequencer and to the eleven datapath units. In the figure, NOPCWs are indicated by ’----’.

As suggested by the dashed contour in Figure 5.17, the change tothe SEQ structure required to support DBCC is only minimal. Theinterfaces to the program memory remain the same as for the SEQ


0082----------------------------4110411041134113----------------600E7800----------------E1D5E1D502E5------------2498--------------------8100810002E5----AE98----25BA------------81008100A100A10000E7----4FD820491049----A300----42208800----A1008805--------------------4102----410241024102410201E5------------25BA------------81008100----8500F9AA------------------------------------462846280040----4FDB204A104A----A300----A300A300E3E6E3D5F97C----------------------------A000A000A000A0000FE1--------------------4102----41034103----4103F9E2------------------------4400----44004540454000E5------------2498--------------------81008100F907--------------------420042004217421742004200FFDA--------------------443A44004400440044004400F9B0----4FD820491049------------E1E6----A300A30002A5----CE98----5E04--------E1C4D050D0D0A300A3008801----4FDA----104B------------A100A100----E1D5F962----4FD920481048------------E1E6----A600A30000A5----C198----5E04--------E1C4C105E1D5----A300F8F1----4FD920481048------------E1E6E1D5A300A300F8B4----4FDB----104A----A600----A600A300E3E6----0185----4FD920481048----A300----8400----A100A100

Number of units (Nu)

Unompressed dictionary memory

Dictionary length (L)

PC

Program length (P)

0x2

0x2

Index

0x2

Sequencer

CWs to the units

Figure 5.17: Dictionary-based decompression hardware.

operating without DBCC. The only difference lies in the additionalinstruction fetch step inferred by the index memory and necessaryto retrieve the actual VLIW that has to be executed. Jumps andbranches are handled as without the dictionary, only the programmerhas to respect the additional latency cycle.

Table 5.7 summarizes the results obtained when applying DBCCto the benchmark programs of Table 5.6. Clearly, the program lengthP remains the same as in the reference design, instead its width is nowdiminished to a wordwidth of dlog2 Le bits. L indicates the number ofwords of the dictionary memory, and bcmp the storage bits requiredfor the index memory and the dictionary memory. The compressionratio is commonly defined as Rbit = bcmp/bunc, and thus, it takes theadditional overhead caused by the index memory into account. LowRbit values indicate good compression ratios and, as the table reveals,a first progress compared to the reference design is made. However,the code densities ρ attained by the DBCC benchmark programs arestill poor. The additional dictionary compression step described nextfurther improves this aspect.

5.4. DBCC 143

Table 5.7: DBCC benchmark program sizes and code densities.Benchmark P L ρ ρa bcmp Rbit

b

SISO BB proc. 762 433 27% 73% 89994 62%16 Tap FIR 44 13 26% 74% 2672 32%64-point FFT 149 118 46% 54% 23699 83%Two 64-point FFT 253 145 49% 51% 29864 61%MIMO BB kernel 506 213 42% 58% 44944 46%

aρ = 1− ρ.bThe lower the better.

DBCC with NOP bitmask The dictionary memory is compressedin a second step. Figure 5.18 depicts the decompression hardwareblock diagram and illustrates the concept adopted for the dictionarycompression step. Again, it is interesting to note that the interfaces tothe program memory (see dashed contour) remain the same as withoutDBCC, enabling a modular design. The uncompressed dictionary isshown on top (label 1). Here, the three highlighted VLIWs can becondensed together into one compressed dictionary memory entry.

The generated compressed dictionary, with the corresponding pro-gram memory containing indexes and mask bits, is shown in the middleof Figure 5.18 (label 2). The three highlighted VLIWs satisfy the testdescribed in the next paragraph and have been mapped onto the single,condensed VLIW highlighted in the compressed dictionary memory.The reconstruction of the original VLIW requires an additional bit-mask to be stored inside the index memory, together with the pointerto the corresponding compressed dictionary word. This bitmask isdetermined by Nu bits and it is necessary to identify whether the CWin the original VLIW is an effective instruction or a NOP. If the ithbit of the bitmask is ’0’, then the CW at position i ∈ {1, . . . , Nu}in the original VLIW was a NOP. Thus, the CW is masked out andreplaced with a NOP CW in any case. Otherwise, if the ith bit ofthe bitmask is ’1’ the considered CW was an effective instruction, andit is not replaced. As shown at the bottom of Figure 5.18 (label 3),three appropriate bitmasks allow to reconstruct the original VLIWs,starting from the condensed one, by setting the CW of unused unitsto NOPs.


----------------------------------------D1D0D150----60004009----------------------------A740----0038----------------------------------------------------------------------------85008500----------------CE98----------------------------A100A100----------------600E7800----------------E1D5E1D5--------4FDB--------------------A100A100------------------------------------------------46304630----------------------------------------E1D5E1D5------------------------------------A200--------

1

----------------600E7800----------------E1D5E1D5

----------------600E7800----------------E1D5E1D5

0038------------600E7800----------------E1D5E1D5

0082----------------------------41104110411341130038600040D930E0600E7800A100E1D5E14C4400E1D5E1D502E5------------2498--------------------8100810002E5----AE98----25BA------------81008100A100A10000E7----4FD820491049----A300----42208800A100A1008805--------------------4102----410241024102410201E5------------25BA------------8100810085008500F9AA------------------------------------462846280040----4FDB204A104A----A300----A300A300E3E6E3D5F97C----------------------------A000A000A000A0000FE1--------------------4102----4103410341034103F9E2------------------------4400440044004540454000E5------------2498--------------------81008100F907--------------------420042004217421742004200FFDA--------------------443A44004400440044004400F9B0----4FD820491049------------E1E6E1D5A300A30002A5----CE98----5E04--------E1C4D050D0D0A300A3008801----4FDA204B104B------------A100A100E1E6E1D5F962----4FD920481048------------E1E6E1D5A600A30000A5----C198----5E04--------E1C4C105E1D5A300A300F8F1----4FD920481048------------E1E6E1D5A300A300F8B4----4FDB204A104A----A6004010A600A300E3E6E3D50185----4FD920481048----A300----84008100A100A100

Number of units (Nu)

Compressed dictionary memory

Dictionary length (L)

PC

Program length (P)

0x2

0x2

Index Mask

0x2

2

Uncompressed dictionary memory

Further NOP removal

Sequencer

3

000011000011

000000000011

100000000000

Masks:

mask logic

CWs to the units

Figure 5.18: Decompressor hardware for dictionary-based code decom-pression with NOP bitmask.

5.4. DBCC 145

Two VLIWs can be condensed together if they pass the followingtest. For each of the Nu unit slots, the two corresponding CWs of thetwo VLIWs are compared. The comparison has four possible outcomes:

1. If the CWs of both VLIWs are NOPs, then the chance of con-desing the two VLIWs is still intact.

2. If the CWs of both VLIWs are effective instructions (i.e., non-NOPs), then these two CWs have to be equal and the chance ofcondensing the two VLIWs is still intact.

3. If the CWs of one unit is an effective instruction and the seconda NOP CW, then chance of condensing the two VLIWs is stillintact.

4. If none of the above three tests is successful, then the two VLIWscannot be condensed.

Once all Nu CW-pairs are tested and it is found that the two VLIWscan be merged, the resulting condensed VLIW is stored inside thecompressed dictionary memory.

The algorithm used to generate the compressed dictionary is illus-trated by the pseudo-code fragment in Algorithm 1. The uncompresseddictionary memory is considered as an L × Nu matrix M, and theresulting compressed dictionary as an L′ ×Nu matrix N. The entriesof both matrices are CWs of the datapath units (FUs and SUs) andthe SEQ. The notation M(m, :) means that the entire mth row of M isaccessed, and similarly, M(:, n) means that the entire nth column of Mis accessed. At line 1, the sortcols(.) function permutes the columns ofthe uncompressed dictionary matrix M, such that the output matrixT1’s first column contains the column of M with the most non-NOPCWs, and the last column of T1 that column of M with the leastnon-NOP CWs. The vector t1i stores the column-permutation indexesof T1. The sortrows(.) function at line 3 instead, sorts the rows ofits argument vector according to the recurrence of the CWs inside theoriginal, uncompressed program. The most frequent CW is permutedto the first row of t2, whereas the least frequent CW becomes thelast entry of t2. The function genpattern(.) (line 15) generates anappropriate pattern pat that is used to test if the four above mentioned


Table 5.8: DBCC with NOP masking for the benchmark programs.Benchmark P L′ ρ ρa bcmp Rbit

b

SISO BB proc. 762 142 53% 47% 42504 29%16 Tap FIR 44 8 33% 67% 2196 25%64-point FFT 149 74 59% 41% 17039 59%Two 64-point FFT 253 97 59% 41% 23431 48%MIMO BB kernel 506 116 63% 37% 31886 33%

aρ = 1− ρ.bThe lower the better.

conditions can be satisfied. The operator =∼ performs this test atonce.

After the completion of Algorithm 1, the compressed dictionaryis available in the matrix N. The compressed dictionary pointer andthe NOP bitmaks can be easily generated by parsing the VLIWs oforiginal uncompressed program M and comparing these VLIWs withthe entries of the compressed dictionary memory. If the original VLIWcan be reconstructed from the condensed VLIW by an appropriatebitmask, then the corresponding address of the condensed VLIW andthe NOP bitmask are stored inside the index memory.

Finally, Table 5.8 shows the results obtained with the additional dic-tionary compression step. Compared to DBCC without the additionalcompaction step, the compression ratio could be further decreasedand the compression ratio is now Rbit = 39% on average. The decod-ing latency remains of one clock cycle. The resulting decompressionhardware overhead of 0.015 mm2 is relatively small.

SEQ program memory configuration The definitive SEQ’s in-dex and mask memory can contain P = 1024 pointers to the compresseddictionary memory. The compressed dictionary memory is selected tocontain L′ = 256 condensed VLIWs, which, together with the supportof Nu = 12 units, defines the width of the index and mask memory asdlog2 L

′e +Nu = 20bits. With this program memory configuration,the total area of the SEQ amounts to 0.78 mm2. Thus, compared tothe reference ASPE, where the SEQ occupied an area of 1 mm2, DBCC

5.4. DBCC 147

Algorithm 1 Dictionary compression step.In: Uncompressed dictionary MOut: Compressed dictionary N

[T1, t1i] = sortcols(M)for k = 1, 2, . . . , Nu do

t2 = sortrows(T1(:,k))for l = 1, 2, . . . , length(t2) do

5: for m = 1, 2, . . . , L docw = t2(l) // Get the CW to be processedif cw == M(m, t1i(k)) then

if M(m, :) ∈ N then// VLIW is already in compressed dictionary N

10: elseif size(N) == 0 then

N(1, :) = M(m, :) // Initelse

for o = 1, . . . , length(N) do15: pat = genpattern(N(o, :))

if pat =∼M(m, :) thenfor p = 1, 2, . . . , Nu do

if N(o, p) == M(m, p) then// CWs are equal

20: else if (N(o, p) == NOP) and (M(m, p) ! = NOP) thenN(o, p) = M(m, p) // Modify compressed dict. entry

else if (N(o, p) ! = NOP) and (M(m, p) == NOP) then// Compressed dict. CW will be masked out.

else25: error()

end ifend for

end ifbreak;

30: end forelse

if o == length(N) then // No matching entry was foundN(o+ 1, :) = M(m, :); // Append VLIW to end of compr. dict.

end if35: end if

end ifend if

end forend for

40: end for


enabled an area saving of 22 %.In addition, this configuration supports all of the considered bench-

mark programs and allows to run programs containing more than 762VLIWs, namely up to 1024 VLIW when the ratio L′/P ≤ 0.25. As areference, the average L′/P -ratio for the five considered benchmarkprograms amounts to 0.29, resulting in an average supported programlength of 256/0.29 = 882VLIWs.

5.5 Implementation ResultsTask Schedule for ASPE A and ASPE B Careful schedulingis required to efficiently share all FUs and SUs on both ASPEs. Ta-bles 5.9 and 5.10 summarize the cycle counts of the tasks performed onASPE A and ASPE B, respectively. With these results the scheduleof Figure 5.19 is determined. The figure depicts how the tasks arescheduled among the two ASPE instances, when running at a clockfrequency of 250 MHz.

The hard computational kernels that have to be performed justafter the detection of a MIMO-OFDM-frame on ASPE A (e.g., LTFprocessing) almost fill one duty cycle, fully exploiting the availablecomputational power. However, during the data processing part ofthe frame ASPE A is less loaded. The situation is similar on ASPE B.Here the computational load required for performing the matrix in-version fully claims the available processing power during the MIMOchannel processing state. Later, while performing data processing, thecomputational load is reduced and the resources that were allocatedfor the highest-load period are only partially utilized.

At fist sight, the suboptimal resource utilization during the majorpart of the frame suggests that the duty-cycle should be extendedto allow for a reduction of the hardware complexity. Unfortunately,such an extension of the duty-cycle would also increase the receiver’slatency that is usually constrained (in the solution at hand, by theIEEE 802.11n standard) and is thus not a viable solution.

Silicon realization The system composed of two different ASPEconfigurations presented in this chapter is capable of implementing theBB processing relevant tasks required for a 2×2 MIMO-OFDM receiver,

5.5. IMPLEMENTATION RESULTS 149

Table 5.9: Assembler cycle counts and processing times for 2× 2MIMO-OFDM processing on ASPE A running at 250 MHz.

State / Task Cycle counts # Time [ µs]Frame start detectioncorrelation 2Ndc + 35 195 0.78mean energy 2Ndc + 35 195 0.78threshold det. 3Ndc + 20 260 1.04TOTAL 7Ndc + 84 650 2.60Short preamble processing (init)coarse FOE 60 60 0.24coarse FOC Ndc + 10 90 0.36TOTAL Ndc + 70 150 0.60Short preamble processingcoarse FOC Ndc + 10 90 0.36LTF1 start detect 6Ndc + 75 555 2.22align SU addresses Ndc + 10 90 0.36TOTAL 8Ndc + 95 735 2.94LTF processingcoarse FOC max. Ndc + 10 90 0.36fine FOE 60 60 0.24fine FOC LTF1&LTF2 3Ndc + 10 250 1.00average LTF1 85 85 0.34FFT on LTF1 250 250 1.00FFT on LTF2 250 250 1.00TOTAL 4Ndc + 665 985 3.94MIMO channel processingφf + φc 15 15fine FOC Ndc + 10 90 0.36Channel est. 4N + 10 218 0.87FFT on S1 250 250 1.00TOTAL Ndc + 4N + 285 573 2.29Data processingfine FOC Ndc + 10 90 0.36FFT 250 250 1.00TOTAL Ndc + 260 340 1.36


S2

LTF1BLTF1A

STFA

STFB

STFa

time

LTF2z[2]S

1

noisenoise

S3

S4

S2

S1

LTF1ALTF1B

Duty cycle: 1000 clock cycles @

250MH

z => 4 μs

AS

PE

A input

fdfd

fdshp

shpltf2

shp

STFB

LTF1AS

TFBLTF1ALTF1B

ltf1

z[1]MM

SE

AS

PE

Adp

S3

...LTF2

...A

ntenna 2noise

STFA

STFB

LTF1ALTF1B

LTF2S

1S

2S

3S

4S

5A

ntenna 1

...1) Fram

e detection2) S

TF processing3) LTF processing

4) MIM

O ch.

processing5) D

ata processing...

...

LTF1BLTF1A

STFA

STFB

LTF2

noisenoise

S4

S1

LTF1ALTF1B

STFB

LTF1AS

TFB

LTF2

LTF1ALTF1B

AS

PE

B...

AS

PE

B output

...

h[2]h[1]

h[2]h[1]

g[2]g[1]

F{S1}

F{S3}

F{S1}

S3

F{S2}

F{S1}

S2

dpdp

dp

s1 s2

g[2]g[1]

s3

S5

F{S4}

g[2]g[1]

s4

Receiver states

S5

S4

AS

PE

A output

AS

PE

B input

cFOE

cFOC

cFOC

th.detectcFO

Cth.detect

cFOC

th.detectfFO

CFFT

fFOC

FFTfFO

CFFT

fFOC

Ch.est

fFOE

fFOC

mean LTF1

FFT on LTF1FFT on LTF2

det. S1

det. S2

det. S3

detect S0

Tasks

Figure5.19:

Schedulingofallreceiver

tasksfor

thepresented

2×

2MIM

O-O

FDM

system.


Table 5.10: Assembler cycle counts and processing times for 2× 2MIMO detection on ASPE B running at 250 MHz.

State / Task Cycle counts # Time [ µs]MIMO channel processinglinear MMSE estimator G 11N 572 2.29TOTAL 11N 572 2.29Data processingDemapping 4N 208 0.83TOTAL 4N 208 0.83

as described in Section 3.5. Both ASPE designs were synthesized andplaced in 0.18 µm 1P/6M CMOS technology, and fabricated on amulti-project wafer run. The achieved post-layout clock frequencyof 250MHz permits to follow the schedule of Figure 5.19, allowingthe system to operate in real-time. ASPE A occupies an area of3.95 mm2 and ASPE B of 3.7 mm2. Thus, the total area required bythe two programmable solutions together amounts to 7.65 mm2, or toapproximately 792 kGE.5 Tables 5.11 and 5.12 summarize the post-layout key figures of the implemented designs, whereas Figure 5.20shows the final floorplan of the two ASPEs with the main buildingblocks highlighted. The post-layout power consumption of each ASPEamounts to approximately 700 mW, and consequently the completesystem with two ASPEs consumes around 1.4 W (see Appendix B).

Comparison with 2×2 MIMO-OFDM receiver in [12] To thebest of my knowledge, the sole comparable solution described inliterature is [12] (cf. Section 2.3), where the silicon implementation ofthe ADRES architecture for a 90 nm TSMC process is described (with aparticular focus on power efficiency). The ADRES-core is responsible ofperforming the MIMO-OFDM BB processing related tasks, comparableto those described in this thesis. The so-obtained SDR is reported torun at a clock frequency of 400 MHz, while occupying a total silicon

5The area of one gate equivalent (GE) corresponds to the silicon area occupiedby one low-drive 2 input NAND. On the 0.18 µm CMOS technology considered,this amounts to 9.67 µm2.


Table 5.11: Post-layout results for ASPE A on a 0.18 µm 1P/6MCMOS technology.

Entity Area Complexity Area share[mm2] [kGE] [%]

CMAC 0.54 56.418 13.7CALU 0 0.32 32.713 8.1CALU 1 0.31 31.881 7.8SU 0 0.22 22.479 5.6SU 1 0.22 22.562 5.6SU 2 0.23 23.933 5.8SU 3 0.21 21.963 5.3SU 4 0.21 22.114 5.3IBUF 0.17 18.095 4.3OBUF 0.18 18.722 4.5SEQ 0.78 80.862 20other 0.56 56.641 14ASPE A 3.95 408.388 100

Table 5.12: Post-layout results for ASPE B on a 0.18 µm 1P/6MCMOS technology.

Entity Area Complexity Area share[mm2] [kGE] [%]

DIV 0.40 41.792 11CMAC0 0.37 38.131 10CMAC1 0.38 39.620 10CALU 0.10 10.487 3SU 0 0.22 22.623 6SU 1 0.22 22.531 6SU 2 0.23 23.801 6SU 3 0.22 22.575 6IBUF 0.10 10.976 3OBUF 0.11 11.538 3SEQ 0.78 80.459 21other 0.57 58.576 15ASPE B 3.70 383.109 100


(a) ASPE A for MIMO-OFDM pro-cessing.

(b) ASPE B for MIMO detection.

Figure 5.20: Floorplan of the two fabricated chips. The main buildingblocks are highlighted and the corresponding areas (in kGE) arereported. The non-labeled area is occupied by the D-Net, by thecontrol logic of the SUs, and mainly by filler-cells.


area of 5.79 mm2. The power consumption amounts to approximately220 mW.

Without memories, the ADRES-core occupies approximately 45%of the total silicon area, or 2.6 mm2, on the 90 nm TSMC process [12].Scaling this area to the 0.18µm CMOS reference technology, for acomparison with the here presented solution, leads to a scaled corearea of: 2.6 mm2 · 4 = 10.4 mm2. Scaling of the frequency leads to400 MHz · 1/2 = 200 MHz and the power consumption (of the entirechip) scales to 220 mW · (0.9/0.5)2 = 713mW.

Although these figures are only rough estimates, they clearly showthat the implementation presented here is very competitive in termsof area-efficiency.

Chapter 6

Summary andConclusions

SummaryAs the domain of mobile wireless communications becomes increasinglypopulated with differing communication protocols, the importanceof mobile software defined radio (SDR) terminals grows. The highdatarates prescribed, however, render an implementation on the limitedprocessing resources of a flexible architecture extremely challenging.This trend is especially perceptible in the wireless local area network(WLAN) domain, where the datarates are already high, compared forinstance to the datarates of mobile phone standards. Moreover, thetight power consumption constraint, necessary to ensure long operationtimes from battery, does not relax this challenge either.

Nevertheless, the implementation of the 2× 2 MIMO-OFDM base-band receiver algorithms on an SDR platform composed of two appli-cation specific processors (ASIPs) proposed in this thesis, indicatesa viable solution to overcome these tough constraints. Therefore,the joint analysis of the computational complexity and the BER per-formance of MIMO detection algorithms was necessary to identifysuitable, low-complexity algorithms (Chapter 3). The subsequentmapping of computationally hard OFDM baseband processing kernels

155

156 CHAPTER 6. SUMMARY AND CONCLUSIONS

onto three different software programmable architectures permitted toselect the ASPE architecture [9] for the 2× 2 MIMO-OFDM receiverimplementation (Chapter 4). The ASPE architecture was further im-proved by three modifications: 1) the addition of dedicated input andoutput buffers simplified the data-stream handling; 2) the processor’scontrolpath structure was restricted to support only a single, VLIW-fashioned program sequencer, reducing the control overhead and thusthe critical timing path; 3) the program sequencer was enhanced tosupport dictionary based code decompression, which further increasedthe ASPE’s area-efficiency. Next, the MIMO-OFDM receiver wassplit into two parts and mapped onto two properly configured ASPEs.The first ASPE – ASPE A – was prepared for frame-start detection,frequency offset estimation and compensation, and OFDM processing.The second – ASPE B – was responsible of the MIMO detection. Thetwo ASPEs were placed and routed for a 0.18 µm CMOS technology,and the final post-layout clock frequency of 250 MHz permitted thereceiver to operate in real-time. The silicon area of both ASPEstogether resulted as 7.65 mm2, or to approximately 792 kGE.1 Thisarea was then compared to that of a similar approach on the ADRESprocessor [12], proving the competitiveness of the solution presentedin this thesis (Chapter 5).

ConclusionsThe implementation of the 2× 2 MIMO-OFDM baseband receiveralgorithms on the two ASPEs proved to be extremely challenging. Onone side, among the vast number of MIMO-OFDM receiver algorithms,appropriate ones had to be selected. Therefore, extensive MonteCarlo simulations were run for assessing the achievable (bit-true) BERperformance, and the involved atomic operations where counted forderiving the algorithm’s computational complexity. On the other side,the algorithms had to be implemented onto the ASPE architecture.Therefore, the ASPE’s datapath first had to be configured with ap-propriate units, followed by the mapping of the assembly coded, hand

1The area of one gate equivalent (GE) corresponds to the silicon area occupiedby one low-drive 2 input NAND. On the 0.18 µm CMOS technology considered,this amounts to 9.67 µm2.

157

scheduled algorithms onto it. Bit-true hardware description language(HDL) simulations were run afterwards, to verify the implementation’scorrectness. Among these design-flow steps, the most time-consumingwas the iterative scheduling and assembly coding of the various algo-rithms onto the ASPEs, combined with possible changes to the ASPEconfiguration.

Development tool support A first general, but important con-clusion can be drawn from this observation. In the future, for anefficient and rapid development of applications on the ASPE (andon ASIPs in general), the support of a software toolkit is imperative.Clearly, the human interaction still remains and is desired for takingimportant design decisions, but on the other side, the time-consumingand repetitive tasks may well be automated and accelerated. The idealsoftware toolkit has to include a (bit-true) simulator and an appropri-ate compiler, both enabling faster iterations from the algorithm levelto the architecture. The recent advances in the ASIP developmentenforce this conclusion, and indeed, today such toolkits start gainingcommercial maturity (e.g., LISA processor description language [93]).

Application-specific implementation On the architectural level,it can be stated that the modified ASPE design-framework has allnecessary characteristics to act as a platform for the SDR basebandprocessing. As shown by the 2× 2 MIMO-OFDM implementation, thedesign-time configurability permitted to instantiate units matching theparticular application domain. This is an important characteristic thatpermits to attain just as much flexibility and processing performanceas required by the application domain.

The functional units of both ASPEs perform complex-valued com-putations as required by most of the involved algorithms, and a word-width of 16 bits proved to be enough for the 2× 2 MIMO-OFDMreceiver. The datapath of ASPE A was tailored to computationally in-tensive kernels as correlations (required for frame-start detection), theCORDIC algorithm (for frequency offset estimation/compensation),and fast Fourier transforms (OFDM processing). ASPE B instead wasconfigured for matrix manipulations as matrix inversion, matrix-matrixand matrix-vector multiplications – all computational kernels required

158 CHAPTER 6. SUMMARY AND CONCLUSIONS

for the (linear MMSE) MIMO detection. Hence, the partitioning ofthe receiver into two parts respecting the granularity and the type ofthe involved computational kernels, was a key decision for an efficientimplementation. And this is a crucial difference to other related ar-chitectures as, for instance, Montium [27] or ADRES [12] (Chapter 2)both providing a 1D respectively 2D array of almost equal functionalunits.

An interesting observation, highlighted by the partitioning of thereceiver, is that the tasks involved in the MIMO-OFDM basebandprocessing can be split into the following, elementary computationalkernels:

• correlation and filtering (atomic operation: MAC),

• fast Fourier transform (radix-r butterfly),

• computation of the angle of a complex number (CORDIC itera-tion),

• matrix inversion, matrix-matrix and matrix-vector multiplication(MAC),

• interleaving, convolutional encoding, and other bitwise atomicoperations,

• convolutional decoding (e.g., Viterbi decoder, or add-compare-select units), and

• generation of appropriate addressing modes to support the abovelisted elementary kernels.

An efficient ASIP SDR platform thus requires units that respect andsupport the granularity of these elementary kernels.

DLP and ILP The degree of data level parallelism (DLP) inherentto many signal processing algorithms should be exploited to increasethe processing performance while reducing the datapath control over-head. In this thesis, the DLP inherent to the two receive streams ofthe 2× 2 MIMO-OFDM receiver was efficiently exploited by extend-ing the ASPE’s datapath to operate in a 2-way SIMD-manner. The

159

instruction level parallelism (ILP) was exploited by controlling theASPE’s datapath through very long instruction words (VLIWs), whichpermitted to efficiently map the above-mentioned elementary kernelsonto the instantiated functional units. Eventually, the implementeddictionary based program-code decompression mechanism could in-crease the otherwise low program-code density which is inherent tofixed format VLIW architectures.

Final words and future work To conclude, the 2× 2 MIMO-OFDM receiver presented in this thesis has shown that ASIPs aresuitable and efficient platforms for the baseband processing of SDRs.Nevertheless, it must be noted that some dedicated units – here theViterbi decoding – with little, or no configurability at all, may still benecessary to allow for real-time operation.

Although the proposed direction is especially promising, the wayleading to an efficient fully-functional, multi-standard SDR is still longand difficult. The future work strongly needs to consider and introducelow-power design techniques in order to reduce the power consumption.The power consumption achieved by the system presented in thisthesis is still too high to be integrated into a commercial mobile device.Further, it needs to address the integration of the two ASPEs into asingle system-on-chip (SoC), with the addition of a general purposeprocessor responsible of controlling the two ASPEs and of handling theMAC layer protocol – a challenging engineering task. The extension ofthe ASPEs to support a second wireless standard is a next necessarystep towards multi-standard systems. Finally, despite the extensivesimulations, the implementation needs to be validated with real-lifedata. Here, a possible solution is to design a PCB for the two fabricatedASPEs and to embed this PCB into an appropriate testbed, as forinstance the MIMO-OFDM testbed of the ETH Zurich [94, 95].

Appendix A

MIMO DetectionMethods

Maximum likelihood (ML) [e.g, [7, 52]] maximizes the probability ofa correct decision (i.e., that s = s). The ML-detector has to solve

s = arg mins∈AMT

‖y−Hs‖2. (A.1)

A.1 Sphere DecodingBefore deriving the CC for processing one visited tree-node in thesphere decoding (SD) algorithm, the algorithm itself is briefly reviewed.The material presented in this review is mainly taken from [54].

Review SD maps the problem of finding the solution of (A.1) ontoan appropriate tree structure. In order to perform the mappingonto the appropriate tree structure, during the preprocessing, the QR-decomposition of H = QR has to be taken, leading to theMR ×MT or-thonormal matrix Q and to the MT ×MT right-triangular matrix R.1Then, during data processing, the received vector y is left-multiplied by

1QR-decomposition is detailed later, in Section 3.4.

161

162 APPENDIX A. MIMO DETECTION METHODS

QH leading to the modified input-output relation [cf. (A.6)] y = Rs+n,where y = QHy and n = QHn has the same statistics as n.

The tree structure constructed to use SD is represented by MT treelevels. The convention is that the tree root is at level MT , whereas theleaves are at level 1. At each tree level i = MT ,MT − 1, . . . , 1 thereare Nnodes(i) = M (MT+1−i) tree nodes. The path to each tree nodeat level i represents one possible combination of constellation pointss(i) = [si, si+1, . . . , sMT ] that could have been sent, up to the ith treelevel.

Vector-symbol candidates of the MIMO alphabet AMT that violatethe sphere constraint can be excluded from the search, by pruning thecorresponding tree branches. As a result, the CC is greatly reducedcompared to brute-force ML. The sphere constraint is defined asd(s) = ‖y−Rs‖2 < r2, with s being the tested vector-symbol candidate.The radius r constrains the search and influences the number ofvisited tree branches. Eventually, the ML solution of (A.1) is given bythe vector-symbol candidate s = arg mins|d(s)<r2 d(s), i.e., the vector-symbol candidate s that leads to the smallest distance d(s).

The distance d(s) can be computed recursively descending thei = MT ,MT − 1, . . . , 1 tree levels. The partial distances are describedby

Ti(s(i)) = Ti+1(s(i+1)) + ‖yi −MT∑j=i

rijsj‖2, (A.2)

where sj is the jth element of the tested vector-symbol candidates(i), rij is the element in the ith row and jth column of R, andTMT+1 = 0. Finally, once one tree leave is reached, the distance forthe corresponding symbol is obtained by d(s) = T1.

Implementation The efficient implementation of the SD algorithmrelies upon a strategy that leads to the ML solution, by pruning asmuch tree branches as possible (without loosing the ML solution).One efficient implementation traverses the tree in a depth-first search.Herein, the hardest computations that have to be performed at level i

A.2. K-BEST 163

and for each visited tree node are

bi+1(s(i+1)) = yi −MT∑j=i+1

rijsj , (A.3)

Di(sk) = ‖bi+1(s(i+1))− riisk‖2, k = 1, 2, . . . ,M (A.4)Ti = Ti+1 + minDi(sk). k = 1, 2, . . . ,M (A.5)

First, (A.3) is computed. Therefore, 1 complex-valued ADD is requiredand, in the worst case, MT − 1 complex-valued MACs. Then, alldistances to the M nodes sk, that can be reached from the currentnode at level i+ 1, are computed in (A.4). M is the size of the QAMalphabet. This step requires M additions and 2M complex-valuedmultiplications. Finally, the minimum over the previously computeddistances Di(sk) is taken (A.5), in order to find the child node of s(i+1)

leading to the smallest partial euclidean distance Ti. The tree searchproceeds downwards from the node sk, whose distance leads to exactlythis minimum Ti.

The CC for the preprocessing and for each visited tree node duringthe data processing are summarized in Table A.1. For obtaining theoverall CC for the entire tree search, the results of the preprocessinghave to be multiplied with the average number of visited tree nodesNav.

A.2 K-BestThe K-best algorithm is a breadth-first search algorithm that operateson the same tree structure as the SD algorithm does. The descriptionof the here employed K-best algorithm and a corresponding ASICimplementation is presented in [59]. In the following, the K-best algo-rithm is only briefly described. The aim is to highlight the differencesto the SD algorithm, and to compute its CC.

Two preprocessing methods are considered in [59]. With QR-decomposition the resulting problem is complex-valued, whereas withreal-valued decomposition (RVD) and subsequent QR-decomposition,the problem becomes only real-valued. With RVD each tree levelhas√M nodes and the tree depth doubles resulting in 2MT (instead,

without RVD, each of the MT tree levels has M nodes). The BER


Table A.1: CC for SD.PreprocessingStep CMAC ANGLE TotalQR-Decompositiona (13/2 + 2MR)MRMT + 3/2MRM2

T 2MRMT TCb

Total C-Ops. (13/2 + 2MR)MRMT + 3/2MRM2T 2MRMT TC

Total R-Ops. 2(13 + 4MR)MRMT + 6MRM2T 2MRMT TR

c

DetectionStep CMAC CADD Total(A.3) MT − 1 1 MT(A.4) 2M M 3M(A.5) 0 M + 1 M + 1Total C-Ops. 2M + MT − 1 2M + 2 4M + MT + 1Total R-Ops. 8M + 4MT − 4 4M + 4 12M + 4MT

aAs described in Section A.4.5, (A.25)bTC = (17/2 + 2MR)MRMT + 3/2MRM2

TcTR = 4(7 + 2MR)MRMT + 6MRM2T

performance of the RVD problem is slightly better then when operatingon the QR-decomposed problem.

In contrast to SD, at each tree level i, only the nodes that leadto the K smallest distances are further considered and expanded.Once the lowest tree level is attained, the vector-symbol s among theK evaluated ones leading to the smallest distance d(s) is declaredas the received vector-symbol. The solution is not necessarily theML-solution.

For the CC the case with RVD during the preprocessing is consid-ered and thus all atomic operations are real-valued. However, withQR-decomposition the results lead to a similar complexity. The KBalgorithm is sketched in pseudo code in Algorithm 2 and the corre-sponding estimated operation count is given by:

NMULT =2MT∑i=1

√M +K(2MT − i+

√M)

= (K(2√M − 1) + 2

√M)MT + 2KM2

T

NADD = 2MT

√M(1 + 2K)

Table A.2 summarizes these findings (for K <√M).

A.2. K-BEST 165

Algorithm 2 KB.In: R, yOut: s1: for i = 2MT , 2MT − 1, . . . , 1 do2: for n = 1, . . . ,

√M do

3: di(sn) = riisn − yi // Mult : 1, Add : 14: end for5: for k = 1, . . . ,K do6: bi+1(s(i+1)(k)) =

∑2MTj=i+1 rijsj(k) // Mult : 2MT − i

7: for n = 1, . . . ,√M do

8: D(k, n) = Ti+1(k)+ ‖bi+1(s(i+1)(k))+di(sn)‖2 // Mult :1, Add : 2

9: end for10: end for11: Ti[1 : K] = sort(D)[1 : K] // sort in ascending order and take

the K smallest distances12: Store the K candidate vector-symbols s(i)(k) that lead to

Ti[1 : K]13: end for14: s = min(s(1)[1 : K])

Table A.2: CC for KB decoding.PreprocessingStep CMAC ANGLE TotalQR-Decompositiona (13/2 + 2MR)MRMT + 3/2MRM2

T 2MRMT Tpp,Cb

Total C-Ops. (13/2 + 2MR)MRMT + 3/2MRM2T 2MRMT Tpp,C

Total R-Ops. 2(13 + 4MR)MRMT + 6MRM2T 2MRMT Tpp,R

c

Symbol-Vector DetectionStep MAC ADD TotalKB (K(2

√M − 1) + 2

√M)MT + 2KM2

T 2MT√M(1 + 2K) Tdp,C

d

Total C-Ops. (K(2√M − 1) + 2

√M)MT + 2KM2

T 2MT√M(1 + 2K) Tdp,C

Total R-Ops. (K(2√M − 1) + 2

√M)MT + 2KM2

T 2MT√M(1 + 2K) Tdp,R

e

aAs described in Section A.4.5, (A.25). RVD is neglected.bTpp,C = (17/2 + 2MR)MRMT + 3/2MRM2

TcTpp,R = 4(7 + 2MR)MRMT + 6MRM2T

dTdp,C = (K(6√M − 1) + 4

√M)MT + 2KM2

TeTdp,R = (K(6

√M − 1) + 4

√M)MT + 2KM2

T


Algorithm 3 SIC.In: R, yOut: s1: for i = MT , . . . , 1 do2: yi = yi −

∑MTj=i+1 ri,j sj // Mult : MT − i, Add : 1

3: si = Q(yi, ri,i) // Mult : b√Mc, Comp : log2 M

4: end for

A.3 Successive Interference Cancellation

The successive interference cancellation (SIC) algorithm does notachieve ML performance. As for SD and KB, for performing SICon the received vector-symbol, first the QR-decomposition of thechannel matrix H has to be taken, which transforms the probleminto y = Rs + n. Then, SIC solves the detection problem by back-substitution. After each back-substitution step, the entry of theobtained solution is sliced and mapped to the nearest constellationpoint in the alphabet A. SIC is summarized in Algorithm 3 andrequires the following number of operations:

NMULT =MT∑i=1

MT − i+ b√Mc = (−1/2 +

√M)MT +M2

T /2

NADD = (1 + log2 M)MT .

Table A.3 summarizes the findings.

A.4 Linear Detection – Matrix Decompo-sition and Inversion Methods

The MIMO input-output relation and the steps necessary to reconstructthe received transmitted symbol-vector s with linear MMSE detection,

A.4. LINEAR DETECTION 167

Table A.3: CC for SIC.PreprocessingStep CMAC ANGLE TotalQR-Decompositiona (13/2 + 2MR)MRMT + 3/2MRM2

T 2MRMT Tpp,Cb

Total C-Ops. (13/2 + 2MR)MRMT + 3/2MRM2T 2MRMT Tpp,C

Total R-Ops. 2(13 + 4MR)MRMT + 6MRM2T 2MRMT Tpp,R

c

Symbol-Vector DetectionStep CMAC CADD TotalSIC (−1/2 +

√M)MT + M2

T /2 (1 + log2 M)MT Tdp,Cd

Total C-Ops. (−1/2 +√M)MT + M2

T /2 (1 + log2 M)MT Tdp,CTotal R-Ops. (−1/2 +

√M)4MT + 2M2

T (1 + log2 M)2MT Tdp,Re

aAs described in Section A.4.5, (A.25). RVD is neglected.bTpp,C = (17/2 + 2MR)MRMT + 3/2MRM2

TcTpp,R = 4(7 + 2MR)MRMT + 6MRM2T

dTdp,C = (1/2 +√M + log2 M)MT +M2

T /2eTdp,R = 2(2

√M + log2 M)MT + 2M2

T

are recapitulated in (A.6)-(A.9) below.

y = Hs + n ∈ CMR (A.6)F =

(HHH +MTσ

2I)∈ CMT×MT (A.7)

G = F−1HH ∈ CMT×MR (A.8)y = Gy ∈ CMT (A.9)

At the receiver, the (perfect) channel estimate H ∈ CMR×MT , as wellas the received vector y ∈ CMR are known. The noise n is unknown,whereas we assume its variance σ2 is known. We note that the matrixF has to be inverted for obtaining G.

For hard detection, the vector y is further processed to obtain theestimated transmitted symbol s = Q(y) through slicing. The CC ofslicing is not considered in this evaluation, since this step is commonto all methods and would not change the overall rankings.

The decomposition and inversion methods to obtain F−1, which areevaluated in the following, are mainly taken from [66, 51]. Their CC isderived by counting and weighting the number of operations necessaryto detect one received symbol vector y. For this evaluation, whichtargets a software-programmable platform, the considered operatorsare: ADD, MUL, and DIV (plus, in certain cases additional operators


as SQRT or ANGLE, see Section 3.2).For the decomposition methods that construct an upper triangular

matrix R, the detection of y through back-substitution (BS) is consid-ered as well. BS is applied to linear equation systems Rb = a, withR ∈ CM×M , a ∈ CM , and b ∈ CM being the unknown. BS consistsin solving the Mth equation of the system, i.e. bM = aM/rM,M , sub-stituting the result into the (M − 1)th equation, and then repeatingthe procedure until obtaining b. Otherwise, when no upper triangularmatrix is constructed, only detection by matrix multiplication (MM)is considered. MM consists in computing (A.9) directly. The notationMM and BS has been taken from [51].

A.4.1 Adjoint methodThe classical way of writing and presenting the inverse of an M ×Mmatrix F employs the adjoint method

F−1 = adj (F)det(F) . (A.10)

When M = 3, (A.10) becomes slightly more demanding than thecase where M = 2 presented in Section 3.4:

F−1 =

f11 f12 f13f∗12 f22 f23f∗13 f∗23 f33

−1

= 1D

fi11 fi12 fi13fi∗12 fi22 fi23fi∗13 fi∗23 fi33

. (A.11)

Where the determinant of F is D = f11(f33f22−f∗23f23)−f∗12(f33f12−f∗23f13) + f∗13(f23f12 − f22f13) and the entries of F−1:

fi11 = f33 f22 − f∗23 f23

fi12 = f33f12 − f∗23f13

fi13 = f23f12 − f22f13

fi22 = f33f∗22 − f∗13f13

fi23 = −(f23f11 − f∗12f13)fi33 = f12f11 − f∗12f12.


Algorithm 4 Implementation of 2× 2 HPD matrix inversion withthe adjoint method.r1 ← ac CMULr2 ← r1 − bb∗ CMACr3 ← r−1

2 DIVx← r3c CMULy ← −r3b CMULz ← −r3a CMUL

F−1 =[x yy∗ z

]–

In total, roughly 21 complex-valued MAC operations are required, or,translating into real-valued operations, 84 real-valued MAC operations.Thus, the CC of (A.11) is roughly of 21 when accounting for complex-valued atomic operations, and of 84 when accounting for real-valuedones. However, the high numerical precision and dynamic rangerequired to compute 1/D render the use of (A.10) impractical formatrices with M > 2.

Eventually, Algorithm 4 describes the 2× 2 hermitian and positive-definite (HPD) matrix inversion as it is implemented in this thesis (onASPE B).

A.4.2 LR-decompositionThe LR-decomposition (or LU-decomposition) in Algorithm 5 is stablefor strictly diagonally dominant matrices, which is the case of the Fmatrix at low-SNR regime. The steps leading to y are:

[L,R] = LRdecomp(F), (A.12)A = BS(L,HH) = L−1HH ∈ CMT×MR . (A.13)

• For BS-based detection we proceed as follows:

m = Ay ∈ CMT , (A.14)y = BS(R,m) = R−1m ∈ CMT . (A.15)

The corresponding CC is reported in Table A.4.


Algorithm 5 [L,R] = LRdecomp(F)In: FOut: LR = F1: for i = 0, 1, . . . ,MT − 1 do2: for j = i, i+ 1, . . . ,MT − 1 do3: ri,j = fi,j −

∑i−1k=0 rk,j · li,k // Mult : i(MT − i), Add :

MT − i4: lj,i = (fj,i −

∑i−1k=0 rk,i · lj,k)/ri,i // Mult : (i + 1)(MT −

i− 1), Div : 1, Add : MT − i− 15: end for6: end for

• For MM-based detection the matrix G is constructed by twosubsequent BSs:

G = BS(R,A) = R−1A ∈ CMT×MR , (A.16)y = Gy ∈ CMT .

The corresponding CC are reported in Table A.5.

The number of operations required to complete Algorithm 5 isestimated as follows:

NMULT =MT−1∑i=0

i(MT − i) + (i+ 1)(MT − i− 1) = 1/3MT (M2T − 1),

NDIV = MT ,

NADD =MT−1∑i=1

MT − i+MT − i− 1 = 1− 2MT +M2T .

A.4.3 LDL-decompositionThree versions of the LDL-decomposition are reported. Algorithm 6 isthe implementation reported in [66], Algorithm 7 and Algorithm 8 aremodified versions with slightly lower CC.


TableA.4:LR

-decom

posit

ion’sCC,B

S-ba

seddetection.

Preprocessing

Step

CMAC

DIV

CADD

Total

(A.7)

MTMR

(MT

+1)/2

0MT

(1+MR/2)MT

+MRM

2 T/2

(A.12)

1/3MT

(M2 T−

1)MT

1−

2MT

+M

2 T1−

4/3MT

+M

2 T+M

3 T/3

(A.13)

MTMR

(MT−

1)/2

0MR

(MT−

1)−MR

+MRMT/2

+MRM

2 T/2

TotalC

-Ops.

T1

aMT

T2

b1−MR−MT/3

+MRMT

+(1

+MR

)M2 T

+M

3 T/3

TotalR

-Ops.

4T1

MT

2T2

TR

c

Symbo

l-VectorDetectio

nStep

CMAC

DIV

CADD

Total

(A.14)

MTMR

00

MTMR

(A.15)

MT

(MT

+1)/2

0dMT−

1−

1+

3/2MT

+M

2 T/2

TotalC

-Ops.

(1/2

+MR

)MT

+M

2 T/2

0MT−

1−

1+

(3/2

+MR

)MT

+M

2 T/2

TotalR

-Ops.

(1+

2MR

)2MT

+2M

2 T0

2MT−

2−

2+

4(1

+MR

)MT

+2M

2 T

a T1

=−MT/3

+MRM

2 T+M

3 T/3

b T2

=1−MR

+(MR−

1)MT

+M

2 Tc T

R=

2−

2MR

+(−

7/3−

2MR

)MT

+(2

+4MR

)M2 T

+M

3 T4/

3d D

ivisionis

avoided:

1/r i,iis

compu

teddu

ring

preprocessing.


TableA.5:

LR-decom

position’sCC,M

M-based

detection.Preprocessing

StepCMAC

DIV

CADD

Total(A

.7)MTMR

(MT

+1)/2

0MT

(1+MR/2)M

T+MRM

2T/2

(A.12)

1/3MT

(M2T−

1)MT

1−

2MT

+M

2T1−

4/3MT

+M

2T+M

3T/3

(A.13)

MTMR

(MT−

1)/20

MR

(MT−

1)−MR

+MRMT/2

+MRM

2T/2

(A.16)

MTMR

(MT

+1)/2

0MR

(MT−

1)−MR

+3/2MRMT

+MRM

2T/2

TotalC-O

ps.T

1a

MT

T2

bT

Cc

TotalR-O

ps.4T

1MT

2T

2T

Rd

Symbol-Vector

Detection

StepCMAC

DIV

CADD

Total(A

.9)MTMR

00

MTMR

TotalC-O

ps.MTMR

00

MRMT

TotalR-O

ps.4MTMR

00

4MRMT

aT

1=

(−1/3

+MR/2)M

T+

3/2MRM

2T+M

3T/3

bT2

=1−

2MR

+(−

1+

2MR

)MT

+M

2TcT

C=

1−

2MR

+(−

1/3

+5/2MR

)MT

+(1

+3/2MR

)M2T

+M

3T/3

dT

R=

2−

4MR

+(−

7/3

+6MR

)MT

+(2

+6MR

)M2T

+4/3M

3T


To detect y = Gy using the LDL-decomposition, with LDLH = Fand observing that F−1 = L−HD−1L−1, the computations are:

F =(HHH +MTσ

2I)∈ CMT×MT ,

[L,D] = LDLdecomp(F), (A.17)A = BS(L,HH) = L−1HH ∈ CMT×MR ,R = DLH ∈ CMT×MT . (A.18)

• For BS-based detection, we proceed as follows:

m = Ay ∈ CMT ,y = BS(R,m) = R−1m ∈ CMT ,

and the corresponding CC is reported in Table A.6.

• For MM-based detection, the matrix G is constructed by a secondBS, before performing MM:

G = BS(R,A) = R−1A ∈ CMT×MR ,y = Gy ∈ CMT .

The corresponding CC is reported in Table A.7.

The number of operations for Algorithm 6 is estimated as:

NMULT =MT∑n=1

i−1∑j=1

1 + i− 1 +MT∑j=i+1

i

= −7MT /6 +M2T +M3

T /6

NDIV = MT

NADD = −MT /2 +M2T /2.

The estimated number of operations to complete Algorithm 7 is:

NMULT =MT∑i=1

i− 1 +i−1∑j=1

j − 1 + 1

= −2MT /3 +M2T /2 +M3

T /6

NDIV = MT

NADD = −MT /2 +M2T /2.


Algorithm 6 [L,D] = LDLdecomp(F), Golub-version [66].In: FOut: LDLH = F1: for i = 1, . . . ,MT do2: for j = 1, . . . , i− 1 do3: vj = lHi,jdj //Mult : 14: end for5: vi = fi,i −

∑i−1k=1 vk · li,k //Mult : i− 1, Add : 1

6: di = vi7: ri = 1/di //Div : 18: for j = i+ 1, . . . ,MT do9: lj,i =

(fj,i −

∑i−1m=1 vm · lj,m

)· ri //Mult : i − 1 +

1, Add : 110: end for11: end for

Algorithm 7 [L,D] = LDLdecomp(F). Version with MT divisions.In: FOut: LDLH = F1: for i = 1, . . . ,MT do2: for j = 1, . . . , i− 1 do3: vj = fi,j −

∑j−1k=1 vk · lHj,k //Mult : j − 1, Add : 1

4: li,j = vj · rj //Mult : 15: end for6: di = fi,i −

∑i−1k=1 vk · lHi,k //Mult : i− 1, Add : 1

7: ri = 1/di //Div : 18: end for


Algorithm 8 [L,D] = LDLdecomp(F). Version with more than MT

divisions.In: FOut: LDLH = F1: for i = 1, . . . ,MT do2: for j = 1, . . . , i− 1 do3: vj = fi,j −

∑j−1k=1 vk · lHj,k //Mult : j − 1, Add : 1

4: li,j = vj/dj // Div : 15: end for6: di = fi,i −

∑i−1k=1 vk · lHi,k //Mult : i− 1, Add : 1

7: end for

The number of operations to complete Algorithm 8 is:

NMULT =MT∑i=1

i− 1 +i−1∑j=1

j − 1

= −MT /6 +M3T /6

NDIV =MT∑i=1

i−1∑j=1

1 = −MT /2 +M2T /2

NADD = −MT /2 +M2T /2.

A.4.4 GS-decompositionTo perform Gram-Schmidt based QR-decomposition, we start by ob-serving that with H = [HH

√MTσIMT ]H ∈ C(MR+MT )×MT we

obtain

G = (HHH)−1HH

= (HHH +MTσ2IMT )−1[HH

√MTσIMT ]

= [G H], (A.19)

where H =√MTσF−1.


TableA.6:

LDL-decom

position’sCC,BS-based


StepCMAC

DIV

CADD

Total(A

.7)MTMR

(MT

+1)/2

0MT

(1+MR/2)M

T+MRM

2T/2

Algorithm

6−

7MT/6

+M

2T+M

3T/6

MT

−MT/2

+M

2T/2

−2MT/3

+3M

2T/2

+M

3T/6

Algorithm

7−

2MT/3

+M

2T/2

+M

3T/6

MT

−MT/2

+M

2T/2

−MT/6

+M

2T+M

3T/6

Algorithm

8−MT/6

+M

3T/6

−MT/2

+M

2T/2

−MT/2

+M

2T/2

−7MT/6

+M

2T+M

3T/6

(A.13)

MTMR

(MT−

1)/20

MR

(MT−

1)−MR

+MRMT/2

+MRM

2T/2

(A.18)

MT

(MT

+1)/2

00

MT

(MT

+1)/2

TotalC-O

ps. aT

1b

MT

T2

cT

Cd

TotalR-O

ps.4T

1MT

2T

2T

Re

Symbol-Vector

Detection

StepCMAC

DIV

CADD

Total(A

.14)MTMR

00

MTMR

(A.15)

MT

(MT

+1)/2

0f

MT−

1−

1+

3/2MT

+M

2T/2

TotalC-O

ps.(1/2

+MR

)MT

+M

2T/2

0MT−

1−

1+

(3/2

+MR

)MT

+M

2T/2

TotalR-O

ps.4(1

/2+MR

)MT

+2M

2T0

2MT−

2−

2+

4(1+MR

)MT

+2M

2T

aForthe

TotalAlgorithm

7has

beentaken

intoaccount,since

ithas

alow

erCC.

bT1

=−MT/6

+(1

+MR

)M2T

+M

3T/6

cT2

=−MR

+(1/2

+MR

)MT

+M

2T/2

dT

C=−MR

+(4/3

+MR

)MT

+(3/2

+MR

)M2T

+M

3T/6

eTR

=−

2MR

+(4/3

+2MR

)MT

+(5

+4MR

)M2T

+2M

3T/3

fDivision

isavoided

if1/ri,i

canbe

storedduring

preprocessing.


TableA.7:LD

L-de

compo

sition’sCC,M

M-based

detection.

Preprocessing

Step

CMAC

DIV

CADD

Total

(A.7)

MTMR

(MT

+1)/2

0MT

(1−MR/2)MT

+MRM

2 T/2

Algorith

m6

−7MT/6

+M

2 T+M

3 T/6

MT

−MT/2

+M

2 T/2

−2MT/3

+3M

2 T/2

+M

3 T/6

Algorith

m7

−2MT/3

+M

2 T/2

+M

3 T/6

MT

−MT/2

+M

2 T/2

−MT/6

+M

2 T+M

3 T/6

Algorith

m8

−MT/6

+M

3 T/6−MT/2

+M

2 T/2−MT/2

+M

2 T/2

−7MT/6

+M

2 T+M

3 T/6

(A.13)

MTMR

(MT−

1)/2

0MR

(MT−

1)−MR

+MRMT/2

+MRM

2 T/2

(A.18)

MT

(MT

+1)/2

00

MT

(MT

+1)/2

(A.16)

MTMR

(MT

+1)/2

0aMR

(MT−

1)−MR

+3/

2MRMT

+MRM

2 T/2

TotalC

-Ops.b

T1

cMT

T2

dT

Ce

TotalR

-Ops.

4T1

MT

2T2

TR

f

Symbo

l-VectorDetectio

nStep

CMAC

DIV

CADD

Total

(A.9)

MTMR

00

MTMR

TotalC

-Ops.

MRMT

00

MRMT

TotalR

-Ops.

4MRMT

00

4MRMT

a Divisionis

avoidedif1/r i,icanbe

stored

during

preprocessing.

b For

theTo

talA

lgorith

m7ha

sbe

entakeninto

accoun

t,sinceitha

salower

CC.

c T1

=(−

1/6

+MR/2)MT

+(1

+3/

2MR

)M2 T

+M

3 T/6

d T2

=−

2MR

+(1/2

+2MR

)MT

+M

2 T/2

e TC

=−

2MR

+(4/3

+5MR/2)MT

+3/

2(1

+MR

)M2 T

+M

3 T/6

f TR

=−

4MR

+(4/3

+6MR

)MT

+(5

+6MR

)M2 T

+2/

3M3 T


Now we can write

y = Gy =[G H

] [ y0

]. (A.20)

By taking the GS-decomposition of H

[Q,R] = GSdecom(H) (A.21)

we obtain the matrix Q ∈ C(MR+MT )×MT , which has orthonormalcolumns, and the upper triangular matrix R ∈ CMT×MT .

Substituting H with QR in (A.19) leads to:

G =((QR)H(QR)

)−1 (QR)H

=(RHQHQR

)−1 (RHQH)= (RHR)−1(RHQH)= R−1QH .

Thus, according to (A.20), we have y = R−1QH y. Since, the lastMT entries of y are all zero and with

Q =[

QQ

],

the detection problem simplifies to y = R−1QHy. Depending on thechoice of the detection method we proceed as follows.

• For BS-based detection:

m = QHy ∈ CMT (A.22)y = BS(R,m) = R−1m ∈ CMT . (A.23)

Table A.8 summarizes the CC of detecting y by following theabove described procedure.

• Whereas for MM-based detection we compute:

G = BS(R,QH) = R−1QH ∈ CMT ∈ CMT×MR (A.24)y = Gy ∈ CMT .

Table A.9 summarizes the corresponding CC.


TableA.8:GS-de

compo

sition’sCC,B

S-ba

sed.

Preprocessing

Step

CMAC

DIV

/SQRT

CADD

Total

(A.21)

T1

a2MT

T2

bT

Cc

TotalC

-Ops.

T1

2MT

T2

TC

TotalR

-Ops.

4T1

2MT

2T2

TR

d

Symbo

l-VectorDetectio

nStep

CMAC

DIV

CADD

Total

(A.22)

MRMT

00

MRMT

(A.23)

MT

(MT

+1)/2

0eMT−

1−

1+

3/2MT

+M

2 T/2

TotalC

-Ops.

(1/2

+MR

)MT

+M

2 T/2

0MT−

1−

1+

(3/2

+MR

)MT

+M

2 T/2

TotalR

-Ops.

4(1/

2+MR

)MT

+2M

2 T0

2MT−

2−

2+

(2+

4(1/

2+MR

))MT

+2M

2 T

a T1

=(7/6

+MR

)MT

+(1/2

+MR

)M2 T

+M

3 T/3

b T2

=(−

1−

3MR

)MT/6

+MRM

2 T/2

+M

3 T/6

c TC

=(6

+MR

)MT/2

+(1

+3MR

)M2 T/2

+M

3 T/2

d TR

=(1

9/3

+3MR

)MT

+(2

+5MR

)M2 T

+M

3 T5/

3e D

ivisionis

avoidedsince1/r i,iis

compu

teddu

ring

preprocessinga

ndstored.


TableA.9:

GS-decom

position’sCC,M

M-based.

Preprocessing

StepCMAC

DIV

/SQRT

CADD

Total(A

.21)T

1a

2MT

T2

bT

3c

(A.24)

MRMT

(MT

+1)/2

0d

MT−

1−

1+

(2+MR

)MT/2

+MRM

2T/2

TotalC-O

ps.T

4e

2MT

T5

fT

Cg

TotalR-O

ps.4T

42MT

2T

5T

Rh

Symbol-Vector

Detection

StepCMAC

DIV

CADD

Total(A

.9)MRMT

00

MRMT

TotalC-O

ps.MRMT

00

MRMT

TotalR-O

ps.4MRMT

00

4MRMT

aT

1=

(7/6

+MR

)MT

+(1/2

+MR

)M2T

+M

3T/3

bT2

=(−

1−

3MR

)MT/6

+MRM

2T/2

+M

3T/6

cT3

=(6

+MR

)MT/2

+(1

+3MR

)M2T/2

+M

3T/2

dDivision

isavoided

since1/ri,i

iscom

putedduring

preprocessingandstored.

eT4

=(7

+9MR

)MT/6

+(1

+3MR

)M2T/2

+M

3T/3

fT5

=−

1+

(5−

3MR

)MT/6

+MRM

2T/2

+M

3T/6

gTC

=−

1+

(4+MR

)MT

+(1/2

+2MR

)M2T

+M

3T/2

hT

R=−

2+

(25/3

+5MR

)MT

+(2

+7MR

)MT

2+

5/3M

3T


Algorithm 9 [Q,R] = GSdecomp(H)In: Q = H,R = 0MT×MT .Out: QR = H.1: for i = 1, 2, . . . ,MT do2: ri,i =

√qHi qi // Mult : MR + i, Sqrt : 1

3: qi = qi/ri,i // Mult : MR + i, Div : 14: for k = i+ 1, . . . ,MT do5: ri,k = qHi qk // Mult : MR + i− 16: qk = qk − ri,kqi // Mult : MR + i, Add : MR + i7: end for8: end for

The number of operations for completing Algorithm 9 is:

NMULT =MT∑i=1

2(MR + i) + (MT − i)(2MR + 2i− 1) =

= (7/6 +MR)MT + (1/2 +MR)M2T +M3

T /3NSQRT = MT

NDIV = MT

NADD =MT∑i=1

(MT − i)(MR − i) =

= −MT /6−MRMT /2 +MRM2T /2 +M3

T /6.

We note that the square root operator (SQRT) is required. Forcomputing the CC we set the weight of the SQRT operator to anoptimistic value of wSQRT = 1. With this choice, we obtain a lowerbound of the GS-decomposition’s CC, as defined in Section 3.2.

A.4.5 QR-decompositionIn order to detect y = Gy using the classical Givens-rotations basedQR-decomposition, we proceed in an analogous way as for GS-decompo-sition. With H = [HH

√MTσIMT ]H and (A.19), we obtain (A.20).

By taking the QR-decomposition of H = QR we obtain an unitarymatrix Q ∈ C(MR+MT )×(MR+MT ) and an upper triangular matrix


R = [RH 0]H ∈ C(MT+MR)×MT . Since the last MR rows of Rare all zero, and with the optimization in [51], we can reduce thenumber of required operations by modifying the QR-decomposition tocompute only ¯Q = [QH QH ]H ∈ C(MR+MT )×MT , such that ¯QR = H.Substituting the decomposed H matrix with ¯Q and R in (A.19) weobtain:

G =(( ¯QR)H( ¯QR)

)−1( ¯QR)H

=(

RH ¯QH ¯QR)−1

(RH ¯QH)

= (RHR)−1(RH ¯QH)

= R−1 ¯QH .

This leads to the solution of the detection problem by computingy = R−1 ¯QH y = R−1QHy. The optimized QR-decomposition isdescribed in Algorithm 10 and delivers

[Q,R] = QRdecom(H, σ). (A.25)

The so-called Givens rotations required by the QR-decompositionalgorithm are performed by aid of two unitary matrices Em(α) andUm,n(β), defined as follows:

↓ m

Em(α) =

1 0 . . . 0

0 . . . ...... . . . ...0 . . . . . . ejα 0 . . .

0 1... . . .

← m


and

↓ m ↓ n

Um,n(β) =

1 0 . . . 0 0

0 . . . ......

... . . . ......

0 . . . . . . cosβ sin β 0 . . .0 . . . . . . − sin β cosβ 0 . . .

0 0 1...

... . . .

← m← n

.

According to the detection method we proceed as follows.• For BS-based detection:

m = QHy ∈ CMT

y = BS(R,m) = R−1m ∈ CMT .

Table A.10 resumes the CC of detecting y by the above describedprocedure.2

• Whereas for MM-based detection the steps are:

G = BS(R,QH) = R−1QH ∈ CMT ∈ CMT×MR

y = Gy ∈ CMT .

The number of operations to complete Algorithm 10 is:

NMULT =MT∑n=1

MR∑m=1

[2(MT − n+ 2) + (MT − n) + 4m+ 2]

= 5/2(MRMT ) + 3/2(MRM2T ) + 2MR(1 +MR)MT

+ 2MRMT = (13/2 + 2MR)MRMT + 3/2MRM2T

NDIV = 0NADD = 0

NANGLE = 2MRMT

2No division is required during symbol processing if the slicing operation Q(.)takes the matrix R into account.


Algorithm 10 [Q,R] = QRdecomp(H, σ)In: H, σOut: QR = H1: R = [HH

√MTσIMT ]H

2: Q = [IMR0MR×MT ]H3: for n = 1, 2, . . . ,MT do4: for m = 1, 2, . . . ,MR do5: p = MR + n−m6: q = p− 17: ϕ = atan (=(rq,n)/<(rq,n)) // Angle : 18: R = Eq(ϕ)R // Mult : 19: QH = Eq(ϕ)QH // Mult : 110: ϑ = atan(<(rp,n)/<(rq,n)) // Angle : 111: R = Up,q(ϑ)R // Mult : 2(MT − n+ 2) + (MT − n)12: QH = Up,q(ϑ)QH // Mult : 2 · 2m13: end for14: end for15: Q = QMR×MT16: R = RMT×MTNote: With infinite precision both rp,n and rq,n are real-valued inϑ = atan(<(rp,n)/<(rq,n)).


Algorithm 11 F−1 = Rank1Update(H, 1/(MTσ2))

In: H, 1/(MTσ2)

Out: F−1

1: P(0) = 1/(MTσ2)I

2: for i = 1, 2, . . . ,MR do3: m = P(i−1)hHi // Mult : M2

T

4: s = 1 + him // Mult : MT , Add : 15: se = blog2(s)c, sm = 2se/s // Div : 16: m = smm // Mult : MT

7: P(i) = P(i−1) − mmH2−se // Mult : M2T , Add : M2

T

8: end for9: F−1 = PMR

The number of operations that require the arc tangent function[atan()] is NANGLE . For the computation of the QR-decomposition’sCC we assume that an execution unit delivering the result in one clockcycle is available and thus, wANGLE = 1. However, we rememberthat in the case another decomposition method has a similar CC butrequires less atomic operators, we prefer this other method.

A.4.6 Rank-1 update method

The Rank-1 update method, inverts the matrix F directly:

F−1 = Rank1Update(H, 1/(MTσ2)) ∈ CMT×MT (A.26)

G = F−1HH ∈ CMT×MR

y = Gy ∈ CMT .

The associated CC is summarized in Table A.12, and the Rank-1update algorithm is described by Algorithm 11 [63].


TableA.10:

QR-decom

position’sCC,BS-based


StepCMAC

ANGLE

CADD

Total(A

.25)(13

/2+

2MR

)MRMT

+3/2MRM

2T2MRMT

0(17

/2+

2MR

)MRMT

+3/2MRM

2TTotalC

-Ops.

(13/2

+2MR

)MRMT

+3/2MRM

2T2MRMT

0(17

/2+

2MR

)MRMT

+3/2MRM

2TTotalR

-Ops.

2(13+

4MR

)MRMT

+6MRM

2T2MRMT

04(7

+2MR

)MRMT

+6MRM

2T

Symbol-Vector

Detection

StepCMAC

DIV

CADD

Total(A

.22)MRMT

00

MRMT

(A.23)

MT

(MT

+1)/2

MT

aMT−

1−

1+

3/2MT

+M

2T/2

TotalC-O

ps.(1/2

+MR

)MT

+M

2T/2

MT

MT−

1−

1+

(5/2

+MR

)MT

+M

2T/2

TotalR-O

ps.(2

+4MR

)MT

+2M

2TMT

2MT−

2−

2+

(5+

4MR

)MT

+2M

2T

aDivision

canbe

avoidedifthe

BS(.)

performsslicing

andtakes

thefactor

ri,i

intoaccount

foradjusting

thedecision

boundaries.


TableA.11:

QR-decom

posit

ionCC,M

M-based

detection.

Preprocessing

Step

CMAC

ANGLE

CADD

Total

(A.25)

(13/

2+

2MR

)MRMT

+3/

2MRM

2 T2M

RMT

0(1

7/2

+2M

R)MRMT

+3/

2MRM

2 T(A

.24)

MRMT

(MT

+1)/

2MT

MT−

1−

1+

(2+MR/

2)MT

+MRM

2 T/

2Total

C-O

ps.

(7+

2MR

)MRMT

+2M

RM

2 T(1

+2M

R)MT

MT−

1−

1+

(2+

9MR

+2M

2 R)MT

+2M

RM

2 TTotal

R-O

ps.

4(7

+2M

R)MRMT

+8M

RM

2 T(1

+2M

R)MT

2MT−

2−

2+

(3+

30MR

+8M

2 R)MT

+8M

RM

2 T

Symbo

l-VectorDetection

Step

CMAC

DIV

CADD

Total

(A.9)

MRMT

00

MRMT

Total

C-O

ps.

MRMT

00

MRMT

Total

R-O

ps.

4MRMT

00

4MRMT


TableA.12:

Rank-1

updateCC

forMM-based


StepCMAC

DIV

CADD

Total

(A.26)

2MR

(MT

+M

2T )MR

MR

(1+M

2T )2MR

+2MRMT

+3MRM

2T(A

.8)MRM

2T0

0MRM

2T

Total

C-O

ps.2MRMT

+3MRM

2TMR

MR

(1+M

2T )2MR

+2MRMT

+4MRM

2TTotal

R-O

ps.8MRMT

+12MRM

2TMR

2MR

(1+M

2T )3MR

+8MRMT

+14MRM

2T

Symbol-V

ectorDetection

StepCMAC

DIV

CADD

Total

(A.9)

MRMT

00

MRMT

Total

C-O

ps.MRMT

00

MRMT

Total

R-O

ps.4MRMT

00

4MRMT


The number of operations for completing Algorithm 11 is:

NMULT =MR∑i=1

2(MT +M2T ) = 2MR(MT +M2

T )

NDIV = MR

NADD =MR∑i=1

1 +M2T = MR(1 +M2

T ).

A.4.7 Divide-and-conquer methodThe Divide-and-Conquer (D&C) method recursively inverts F. Thesteps leading to MM-based MMSE detection are:

F =(HHH +MTσ

2I)∈ CMT×MT

F−1 = D&C(F) ∈ CMT×MR (A.27)G = F−1HH ∈ CMT×MR

y = Gy ∈ CMT

The complete, recursive, D&Cmethod is described in Algorithm 12 [13].The corresponding CCs are reported in Table A.13.


TableA.13:

D&C

CC

forMM-based


StepCMAC

DIV

CADD

Total

(A.7)

MTMR

(MT

+1)/2

0MT

(1+MR/2)M

T+MRM

2T/2

(A.27)

T1

aMT/2

T2

bT

3c

(A.8)

MRM

2T0

0MRM

2T

Total

C-O

ps.T

4d

MT/2

−3

+7/3MT−M

2T/4

+M

3T/6

TC

e

Total

R-O

ps.4T

4MT/2

−6

+14/3MT−M

2T/2

+M

3T/3

TR

f

Symbol-V

ectorDetection

StepCMAC

DIV

CADD

Total

(A.9)

MRMT

00

MRMT

Total

C-O

ps.MRMT

00

MRMT

Total

R-O

ps.4MRMT

00

4MRMT

aT

1=−

6+

4MT−M

2T/4

+M

3T/2

bT2

=−

3+

4/3MT−M

2T/4

+M

3T/6

cT3

=−

9+

35/6MT−M

2T/2

+M

3T 2/3

dT

4=−

6+

(4+MR/2)M

T+

(−1

+6MR

)M2T/4

+M

3T/2

eTC

=−

9+

(41+

3MR

)MT/6

+(−

1+

3MR

)M2T/2

+2/3M

3TfT

R=−

30+

(127/6

+2MR

)MT

+(−

3/2

+6MR

)M2T

+7/3M

3T


Algorithm 12 F−1 = D&C(F)In: F ∈ CM×MOut: F−1

1: if M = 1 then2: F−1 = 1/f1,13: else if M = 2 then

4: F−1 =[f1,1 f1,2f∗1,2 f2,2

]−1= 1f1,1f2,2 − f1,2f∗1,2

[f2,2 −f1,2−f∗1,2 f1,1

]5: else6: pick a suitable p satisfying 1 ≤ p < M7: partition F as in (3.10)8: A−1 = D&C(A)9: S = C−BHA−1B10: S−1 = D&C(S)11: F−1 as in (3.11)12: end if

Appendix B

Datasheet

PinoutFigure B.1 illustrates the pinout of ASPE A and ASPE B in theirPGA120 package. The two ASPEs are pin-compatible. Table B.1describes the functionality of the I/O signal pads.

193

194 APPENDIX B. DATASHEET

Core Power

Pad Power

GND

pad_vss

pad_vss

core_vss

core_vss

L3 M2 N2 L4 M3 N3 M4 L5 N4 M5 N5 L6 M6 N6 M7 L7 N7 N8 M8 L8 N9 M9N10

L9M10

N11N12

L10M11

N13

A13

C12

D11

B13

C13

D12

E11

D13

E12

E13

F11

F12

F13

G13

G11

G12

H13

H12

H11

J13

J12

K13

J11

K12

L13

L12

K11

M13

M12

L11

pad_vdd_p6

pad_vdd_c8

120

110

1

10

20

30

31 40 50 60

61

70

80

90

91100

pad_vss_p6

FifoOutDataxDO_PAD_7








FifoOutDataReqxSI_PAD



pad_vdd_c4






pad_vss_c8

FifoInDataxDI_PAD_8

pad_vss_p3

pad_vdd_c6

pad_vdd_p3

FifoInDataxDI_PAD_7

FifoInDataxDI_PAD_6

FifoInDataxDI_PAD_5

FifoInDataxDI_PAD_4

FifoInDataxDI_PAD_3

FifoInDataxDI_PAD_2

FifoInDataxDI_PAD_1

FifoInDataxDI_PAD_0

FifoInDataReqxSO_PAD

FifoInDataxDI_PAD_9

FifoInDataxDI_PAD_10

pad_vdd_c2






pad_vss_c6

pad_vdd_p1

StBistOutxTO_PAD_9

StBistOutxTO_PAD_7

StBistOutxTO_PAD_6

StBistOutxTO_PAD_5

pad_vss_c5

pad_vdd_c5

StBistOutxTO_PAD_4

StBistOutxTO_PAD_3

StBistOutxTO_PAD_2

StBistOutxTO_PAD_1

StBistOutxTO_PAD_0

StBistOutxTO_PAD_10

pad_vss_p1

pad_vdd_c1 C3

B2

B1

D3

C2

C1

D2

E3

D1

E2

E1

F3

F2

F1

G2

G3

G1

H1

H2

H3

J1

J2

K1

J3

K2

L1

M1

K3

L2

N1FifoInWritexSI_PAD_0

FifoOutWritexSO_PAD_0

FifoOutWritexSO_PAD_1

StBistOutxTO_PAD_13

StBistOutxTO_PAD_12

StBistOutxTO_PAD_11

StBistOutxTO_PAD_8

pad_vdd_p2

RstxRBI_PAD

pad_vss_p2

SeqBistOutxTO_PAD_2

SeqBistOutxTO_PAD_1

SeqBistOutxTO_PAD_0

FifoInWritexSI_PAD_1

ScanEnablexTI_PAD

BistEnablexTI_PAD

BistModexTI_PAD

pad_vdd_c7

pad_vss_c7

pad_vdd_p4

pad_vdd_p5

SCKxCI_PAD

pad_vdd_c3

SlaveOutxDO_PAD

SlaveInxDI_PAD

SSxSBI_PAD

pad_vss_p5

ScanModexTI_PAD

ClkxCI_PAD

pad_vss_p4


A1 B3 C4A2 A3B4 C5 A4B5 A5 C6 B6 A6 A7 C7 B7 A8 B8 C8 A9 B9 A10C9 B10A11

B11C10

A12B12

C11

120

Figure B.1: ASPE A and ASPE B pinout.

195

Table B.1: Description of signal pads.Signal Name I/O Width DescriptionRstxRBI in 1 Asynchronous reset (active-low)ClkxCI in 1 Core clockSCKxCI in 1 SPI clockSSxSBI in 1 SPI slave select (active-low)SlaveInxDI in 1 SPI slave data inSlaveOutxDO out 1 SPI slave data outFifoInWritexSI in 2 Input FIFO, write assertFifoInDataxDI in 16 Input FIFO, data inFifoInDataReqxSO out 1 Input FIFO, data requestFifoOutDataReqxSI in 1 Output FIFO, data requestFifoOutWritexSO out 2 Output FIFO, write data assertFifoOutDataxDO out 16 Output FIFO, data outStBistOutxTO out 14 SU, BIST outputSeqBistOutxTO out 3 SEQ, BIST outputBistModexTI in 1 Bist mode;

0: Show BIST ok,1: Show BIST done

BistEnablexTI in 1 Bist enable;0: disabled,1: enabled

ScanModexTI in 1 Scan mode;0: no RAM bypassing,1: RAM bypassing

ScanEnablexTI in 1 Scan enable;0: normal mode, 1: scan mode


Table B.2: Input signal timing constraints.Group Pad Setup HoldSPI timing SSxSBI 5 ns 0.5 ns(SCKxC, min. 7 ns) SlaveInxDI 2.25 ns 1.8 nsI/O timing FifoInDataxDI 2.4 ns 0.6 ns(ClkxC, min. 4 ns) FifoInWritexSI 0.7 ns 0.8 ns

FifoOutDataReqxSI 0.8 ns 0.6 nsTest timing ScanModexTI 8.2 ns 0.8 ns

ScanEnablexTI 8 ns 4.7 nsBistEnablexTI 5.1 nsBistModexTI 7.2 ns 4.1 ns

Table B.3: Output signal timing constraints.Group Pad Prop. DelaySPI timing SlaveOutxDO 4 nsI/O timing FifoOutDataxDO 4 ns

FifoInDataReqxSO 3.6 nsTest timing StBistOutxTO 5.7 ns

SeqBistOutxTO 5.1 ns

Post-layout Timing and Power Consump-tionThe timing constraints and power consumption of both ASPEs areextracted, using the design tools and the post-layout netlist. Thepost-layout timing and power consumption are estimated on the netlistgenerated with CDS First Encounter v6.2. The power consumptionis estimated extracting the simulation-based node toggeling activityobtained with the postlayout netlist.

Timing The maximum core clock frequency (ClkxCI) resulting fromthe post-layout simulation is 250 MHz and the maximum SPI clockfrequency (SCKxC) is 142 MHz. Table B.2 reports the correspond-ing input timing constraints, whereas Table B.3 the output timingconstraints. These constraints are achieved on both ASPEs.

197

Table B.4: Post-layout power consumption.Clock period [ns] Power consumption [mW]

ASPE A ASPE B4 700 7005

7.5

Power For ASPE A, the node toggling activity was extracted whilesimulating the computation of a 64-point FFT. For ASPE B, thenode toggling activity was extracted while simulating the computationof the inverse of a 2× 2 matrix. The resulting power consumption isreported in Table B.4 At 250 MHz the resulting power consumption isaround 700 mW for each ASPE.

Operating Modes

Configuration through SPIThe serial peripheral interface (SPI) permits to access the ASPE’smemories. This access is essential for loading software into the se-quencer’s program memory. Also, while debugging, the access throughSPI to SUs and FUs facilitates the error localization. ScanEnablexTI,ScanModexTI, and BistEnablexTI shall be tied to ground duringconfiguration.

Figure B.2 shows the generic timing diagram of one SPI transaction.One transaction begins with SSxSBI being de-asserted, while SCKxCIis low. Next, an SPI-command-byte is transmitted, bit by bit andfrom MSB to LSB, over the SlaveInxDI pin. Concurrently, one SPI-response-byte is output from MSB to LSB, over the SlaveOutxDO pin.Three SPI-commands are available: READ from and WRITE to an ASPEmemory, and NOP.

READ The structure of an SPI-READ sequence is shown in Figure B.3.Reading one 32 bit data-word from one ASPE memory-location involvesfour phases:


SCKxCI

SSxSBI

SlaveInxDI

SlaveOutxDO

1 2 3 4 8

Figure B.2: SPI transaction.

1. Issue an SPI-READ command. Requires one SPI transaction.

2. Send address, from least significant byte (A0) to most significantone (A2). Requires three SPI transactions.

3. Send SPI-NOP commands until RREADY command is output onSlaveOutxDO. Requires a variable number of transactions.

4. Receive data, from the least significant byte (D0) to the mostsignificant one (D3). Requires four SPI transactions.

WRITE Figure B.4 illustrates one SPI-WRITE sequence, used forwriting one 32 bit data-word D (bytes D3-D0) to the ASPE memory ataddress A (bytes A2-A0). The operation is divided into four phases:

1. Issue an SPI-WRITE command. Requires one SPI transaction.

2. Send write address, from least significant byte (A0) to mostsignificant one (A2). Requires three SPI transactions.

3. Send data-word to be written, from the least significant byte (D0)to the most significant one (D3). Requires four SPI transactions.

199

READ

NOP

SCKxCI

SSxSBI

SlaveInxDI

SlaveOutxDO

A0

READ

A1

RACK0

A2

RACK1

NOP

RACK2

NOP

READING

NOP

RREADY

NOP

D0

NOP

D2

NOP

D3

NOP

D1

Figure B.3: SPI-READ.


SCKxCI

SSxSBI

SlaveInxDI

SlaveOutxDO

WRITE

NOP

A0

WRITE

A1

RACK0

A2

RACK1

D0

RACK2

D1

DACK0

D2

DACK1

D3

DACK2

NOP

DACK3

NOP NOP

WRITING WDONE

Figure B.4: SPI-WRITE.

4. Send SPI-NOP commands until WDONE command is output onSlaveOutxDO. Requires a variable number of transactions.

SPI commands Table B.5 summarizes the SPI commands and SPIresponses.

201

Table B.5: SPI commands and SPI responses.SPI commands (SlaveInxDI)

Mnemonic Cmd MeaningNOP 0x00 No-operationREAD 0x01 Read data-wordWRITE 0x02 Write data-word

SPI responses (SlaveOutxDO)Mnemonic Cmd MeaningNOP 0x00 No-operationREAD 0x01 Read acknowledgeWRITE 0x02 Write acknowledgeACK0 0x03 R/W A0 address acknowledgeACK1 0x04 R/W A1 address acknowledgeACK2 0x05 R/W A2 address acknowledgeDACK0 0x06 W D0 data acknowledgeDACK1 0x07 W D1 data acknowledgeDACK2 0x08 W D2 data acknowledgeDACK3 0x09 W D3 data acknowledgeRREADY 0x0A Read doneREADING 0x0B Read access in progressWDONE 0x0C Write doneWRITING 0x0D Write access in progress


Memory Map Figure B.5 depicts the physical design of the ASPE’sindex and dictionary memory. The index memory is a 1024× 20 bitSRAM and the dictionary memory is composed of three 256× 64 bitSRAMs (DICT0-DICT2). The 16 bit control words (CWs) used tocontrol the SEQ, the FUs, and SUs are assigned to the dictionarymemory slots as in Table B.6 (cf. Figure B.5). Finally, the memorymap in Table B.7 allows to access the index and dictionary memoriesover SPI.

Access to the SUs and FUs is also possible over SPI. For the FUand SU access, the 24 bit SPI-address A (bytes A0-A3) has the bitstructure 11uu uuaa aaaa aaaa aaaa aa00. The 4 bits uuuu encodethe unit’s internal slot number and the 16 bits aaaa aaaa aaaa aaaathe unit’s CW. The internal slot number (uuuu) assignment is reportedin Table B.8. The memory map required to access tha ASPE’s storageunits is reported in Table B.9. The IBUF and OBUF are accessed inan analogous way.

Normal operationThe ASPE starts its autonomous operation once the stall registerresiding inside the SEQ is cleared. Therfore, the SPI command ’WRITE0x780060 0x00000000’ is issued (write the data word 0x00000000 toaddress 0x780060, which corresponds to the SEQ stall register). TheSEQ then starts fetching the first dictionary pointer located at address0x000 of the index memory.

ScanEnablexTI, ScanModexTI, and BistEnablexTI shall be tied toground during normal operation.

BISTThe memory BIST writes a chess-board pattern into the memories.Thereafter, it reads out the memories and checks whether the patternmatches the expected one or not. If the check for one memory passesthe BIST-signal of that memory is raised, otherwise it remains zero.

Figure B.6 shows the signaling scheme used to enable the mem-ory built-in self-test (BIST) mode. To enter the BIST mode, theBistEnablexTI signal has to be set to 1 before RstxRBI is released.BistModexTI selects whether the "Bist DONE" or "Bist OK" status is

203

IDX DICT2

Slot: U0(SEQ)

U1 U2 U3

0x000 0x001 0x002 0x003

0x004 0x005 0x006 0x007

...

0x3FC 0x3FD 0x3FE 0x3FF

0x000

0x001

0x3FF

Addr:

256

DICT1

U4 U5 U6 U7

0x000 0x001 0x002 0x003

0x004 0x005 0x006 0x007

...


256

Slot:

Addr:

DICT0

U8 U9 U10 U11

0x000 0x001 0x002 0x003

0x004 0x005 0x006 0x007

...


256

Slot:

Addr:

Addr:

...

ASPE index and dictionary memories

Figure B.5: Physical view of ASPE’s index and dictionary memories.


Table B.6: Assignment of dictionary memory slots to SEQ, FUs, andSUs.

Dict. Slot ASPE A ASPE BU0 SEQ SEQU1 REG REGU2 CMAC DIVU3 CALU0 CMAC0U4 CALU1 CMAC1U5 SU0 CALUU6 SU1 SU0U7 SU2 SU1U8 SU3 SU2U9 SU4 SU3U10 OBUF OBUFU11 IBUF IBUF

Table B.7: Index and dictionary memory maps for ASPE A andASPE B(write only).

Unit Address RangeDICT0 0x720000

0x720FFCDICT1 0x760000

0x760FFCDICT2 0x7A0000

0x7A0FFCIDX 0x7B0000

0x7B0FFC

205

Table B.8: Internal slot number uuuu assignment for FUs and SUs.Internal slot ASPE A ASPE Buuuu0x0 REG REG0x1 nil nil0x2 nil DIV0x3 nil CMAC00x4 CMAC CMAC10x5 CALU0 CALU0x6 CALU1 nil0x7 nil nil0x8 SU0 SU00x9 SU1 SU10xA SU2 SU20xB SU3 SU30xC SU4 OBUF0xD OBUF IBUF0xE IBUF nil

visible on StBistOutxTO_X and SeqBistOutxTO_X. If BistModexTIis set to 1, the "Bist DONE" status is visible, otherwise "Bist OK".Table B.11 reports the StBistOutxTO_X and SeqBistOutxTO_Xmapping to the different memories.

SCANThe scan chain is activated trough ScanEnablexTI (active if 1). Scan-ModexTI isolates the memories from the scan chain and needs to beset to 1 during test mode.


Table B.9: Memory maps for addressing SUs of ASPE A and ASPE B.Unit Address Range MeaningSU0 0xE1A000 Write to both SIMD mems

0xE1A3FF0xE18000 Write to R mem0xE183FF0xE19000 Write to L mem0xE193FF0xE14000 Read from R mem0xE143FF0xE15000 Read from L mem0xE153FF

SU1 0xE5A000 Write to both SIMD mems0xE5A3FF0xE58000 Write to R mem0xE583FF0xE59000 Write to L mem0xE593FF0xE54000 Read from R mem0xE543FF0xE55000 Read from L mem0xE553FF

SU2 0xE9A000 Write to both SIMD mems0xE9A3FF0xE98000 Write to R mem0xE983FF0xE99000 Write to L mem0xE993FF0xE94000 Read from R mem0xE943FF0xE95000 Read from L mem0xE953FF

207

Table B.10: Memory maps for addressing SUs of ASPE A and ASPE B.Note that SU4 is only available on ASPE A.

Unit Address Range MeaningSU3 0xEDA000 Write to both SIMD mems

0xEDA3FF0xED8000 Write to R mem0xED83FF0xED9000 Write to L mem0xED93FF0xED4000 Read from R mem0xED43FF0xED5000 Read from L mem0xED53FF

SU4 0xF1A000 Write to both SIMD mems0xF1A3FF0xF18000 Write to R mem0xF183FF0xF19000 Write to L mem0xF193FF0xF14000 Read from R mem0xF143FF0xF15000 Read from L mem0xF153FF


ClkxCI

RstxRBI

StBistOutxTO_X

SeqBistOutxTO_X

BistModexTI

BistEnablexTI

Show BIST DONE

before RstxRBI goes high

Show BIST OK

BIST DONE

BIST DONE

BIST OK

BIST OK

Figure B.6: BIST signaling.

209

Table B.11: BIST pad-to-memory mapping.Pad ASPE A ASPE B

StBistOutxTO_X:0: SU0, RAM R SU0, RAM 01: SU0, RAM L SU0, RAM 12: SU1, RAM R SU1, RAM 03: SU1, RAM L SU1, RAM 14: SU2, RAM R SU2, RAM 05: SU2, RAM L SU2, RAM 16: SU3, RAM R SU3, RAM 07: SU3, RAM L SU3, RAM 18: SU4, RAM R always 19: SU4, RAM L always 010: always 0 always 011: always 0 always 012: OBUF OBUF13: IBUF IBUFSeqBistOutxTO_X:0: DICT0 DICT01: DICT1 DICT12: DICT2 DICT2

Bibliography

[1] M. Eteläperä and J.-P. Soininen, “4G mobile terminalarchitectures,” Technical Research Centre of Finnland (VTT),Tech. Rep., 2007. [Online]. Available: http://rooster.oulu.fi/materiaalit/4G%20terminal%20architectures.pdf 3

[2] G. E. Moore, “Cramming more components onto integrated cir-cuits, reprinted from electronics, volume 38, number 8„” IEEESolid-State Circuits Newsletter, vol. 20, no. 3, pp. 33–35, Sep.2006. 4

[3] S. Cherry, “Edholm’s law of bandwidth,” IEEE Spectrum, vol. 41,no. 7, pp. 58–60, Jul. 2004. 4

[4] U. Ramacher, “Software-defined radio prospects for multistandardmobile phones,” Computer, vol. 40, no. 10, pp. 62–69, Oct. 2007.4

[5] K. van Berkel, F. Heinle, P. P. E. Meuwissen, K. Moerman, andM. Weiss, “Vector processing as an enabler for software-definedradio in handheld devices,” EURASIP Journal on Applied SignalProcessing, vol. 2005, no. 16, pp. 2613–2625, 2005. 4, 26, 27, 38

[6] H. Bölcskei, “MIMO-OFDM wireless systems: basics, perspectives,and challenges,” IEEE Wireless Communications, vol. 13, no. 4,pp. 31–37, Aug. 2006. 5

[7] A. J. Paulraj, D. A. Gore, R. U. Nabar, and H. Bölcskei, “Anoverview of MIMO communications - a key to gigabit wireless,”

211

http://rooster.oulu.fi/materiaalit/4G%20terminal%20architectures.pdf

http://rooster.oulu.fi/materiaalit/4G%20terminal%20architectures.pdf

212 BIBLIOGRAPHY

Proceedings of the IEEE, vol. 92, no. 2, pp. 198–218, Feb. 2004.5, 53, 57, 161

[8] IEEE, Draft Standard for Information Technology-Telecommunications and information exchange between systems–Local and metropolitan area networks–Specific requirements– Part11: Wireless LAN Medium Access Control (MAC) and PhysicalLayer (PHY) specifications: Amendment 4: Enhancements forHigher Throughput, 2007. 5, 72

[9] T. Boesch, “Adaptive stream processor for network multimediaconsumer electronic devices,” Ph.D. dissertation, ETH Zurich,2004. 5, 6, 98, 100, 156

[10] S. Eberli, A. Burg, T. Boesch, and W. Fichtner, “An IEEE 802.11abaseband receiver implementation on an application specific pro-cessor,” in Circuits and Systems, 2007. MWSCAS 2007. 50thMidwest Symposium on, Montreal, Que., Aug. 5–8, 2007, pp.1324–1327. 5, 101, 107

[11] S. Eberli, A. Burg, and W. Fichtner, “Implementa-tion of a 2× 2 MIMO-OFDM receiver on an appli-cation specific processor,” Microelectronics Journal, vol.In Press, Corrected Proof, pp. –, 2009 (invited). [On-line]. Available: http://www.sciencedirect.com/science/article/B6V44-4VWJ1YP-1/2/bbcb3d4c513f25e650913b83fef4c11d 6

[12] B. Bougard, B. De Sutter, S. Rabou, D. Novo, O. Allam,S. Dupont, and L. Van der Perre, “A coarse-grained array basedbaseband processor for 100Mbps+ software defined radio,” in De-sign, Automation and Test in Europe, 2008. DATE ’08, Munich,Germany, Mar. 2008, pp. 716–721. 6, 7, 33, 34, 38, 151, 154, 156,158

[13] S. Eberli, D. Cescato, and W. Fichtner, “Divide-and-conquer ma-trix inversion for linear MMSE detection in SDR MIMO receivers,”in NORCHIP, 2008., Tallinn, Nov. 2008, pp. 162–167. 6, 67, 189

[14] R. H. Dennard, J. Cai, and A. Kumar, “A perspective on today’sscaling challenges and possible future directions,” Solid-StateElectronics, vol. 51, no. 4, pp. 518–525, April 2007. 11

http://www.sciencedirect.com/science/article/B6V44-4VWJ1YP-1/2/bbcb3d4c513f25e650913b83fef4c11d

http://www.sciencedirect.com/science/article/B6V44-4VWJ1YP-1/2/bbcb3d4c513f25e650913b83fef4c11d

BIBLIOGRAPHY 213

[15] H. Kaeslin, Digital Integrated Circuits: From VLSI Architecturesto CMOS Fabrication. Cambridge University Press, May 2008.12

[16] T. Pionteck, L. D. Kabulepa, and M. Glesner, “Exploring thecapabilities of reconfigurable hardware for OFDM-based WLANs,”ser. IFIP International Federation for Information Processing, vol.200. Springer Boston, 2006, pp. 149–164. [Online]. Available:http://www.springerlink.com/content/0148404m07082636/ 17,18

[17] C. Ebeling, C. Fisher, G. Xing, M. Shen, and H. Liu, “Implement-ing an OFDM receiver on the RaPiD reconfigurable architecture,”IEEE Transactions on Computers, vol. 53, no. 11, pp. 1436–1448,Nov. 2004. 17, 19, 20, 38

[18] C. Ebeling, D. C. Cronquist, and P. Franklin, Field-ProgrammableLogic Smart Applications, New Paradigms and Compilers, ser.Lecture Notes in Computer Science. Springer Berlin / Heidelberg,1996, vol. 1142, ch. RaPiD – Reconfigurable pipelined datapath,pp. 126–135. 17, 19, 38

[19] D. C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. Ebel-ing, “Architecture design of reconfigurable pipelined datapaths,”in Advanced Research in VLSI, 1999. Proceedings. 20th Anniver-sary Conference on, Atlanta, GA, Mar. 1999, pp. 23–40. 20

[20] A. Kamalizad, N. Tabrizi, N. Bagherzadeh, and A. Hatanaka,“A programmable DSP architecture for wireless communicationsystems,” in Application-Specific Systems, Architecture Processors,2005. ASAP 2005. 16th IEEE International Conference on, Jul.23–25, 2005, pp. 231–238. 20

[21] N. Tabrizi, N. Bagherzadeh, A. Kamalizad, and H. Du, “MaRS: Amacro-pipelined reconfigurable system,” ACM Computing Fron-tiers, pp. 343 – 349, 2004. 20

[22] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, andE. M. Chaves Filho, “MorphoSys: an integrated reconfigurablesystem for data-parallel and computation-intensive applications,”

http://www.springerlink.com/content/0148404m07082636/

214 BIBLIOGRAPHY

IEEE Transactions on Computers, vol. 49, no. 5, pp. 465–481,May 2000. 20, 21, 38

[23] Y. Lin, H. Lee, M. Who, Y. Harel, S. Mahlke, T. Mudge,C. Chakrabarti, and K. Flautner, “SODA: A low-power architec-ture for software radio,” in Computer Architecture, 2006. ISCA’06. 33rd International Symposium on, Boston, MA, 2006, pp.89–101. 22, 23, 24, 38

[24] G. K. Rauwerda, P. M. Heysters, and G. J. M. Smit, “AnOFDM receiver implemented on the coarse-grain reconfigurablemontium processor,” in Proceedings of the 9th InternationalOFDM-Workshop, Dresden, Germany, September 2004, pp. 197–201. [Online]. Available: http://eprints.eemcs.utwente.nl/1496/25, 38

[25] P. M. Heysters, G. J. Smit, and E. Molenkamp, “Montium -balancing between energy-efficiency, flexibility and performance,”in International Conference on Engineering of ReconfigurableSystems and Algorithms, ERSA, 2003, pp. 235–241. [Online].Available: http://doc.utwente.nl/46380/ 25

[26] G. K. Rauwerda, “Multi-standard adaptive wireless communica-tion receivers: adaptive applications mapped on heterogeneousdynamically reconfigurable hardware,” Ph.D. dissertation, Univ.of Twente, Enschede, January 2008. [Online]. Available:http://dx.doi.org/10.3990/1.9789036526074 25

[27] G. K. Rauwerda, P. M. Heysters, and G. J. M. Smit, “Towardssoftware defined radios using coarse-grained reconfigurable hard-ware,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 16, no. 1, pp. 3–13, Jan. 2008. 25, 26, 158

[28] J. Kneip, M. Weiss, W. Drescher, V. Aue, J. Strobel, T. Oberthür,M. Bolle, and G. Fettweis, “Single chip programmable basebandASSP for 5GHz wireless LAN applications,” IEICE Trans. Elec-tron., vol. E85-C, pp. 359 – 367, Feb. 2002. 26, 38

[29] T. Richter, W. Drescher, E. Engel, S. Kobayashi, V. Nikolajevic,Weiss, and G. Fettweis, “A platform-based highly parallel digital

http://eprints.eemcs.utwente.nl/1496/

http://doc.utwente.nl/46380/

http://dx.doi.org/10.3990/1.9789036526074

BIBLIOGRAPHY 215

signal processor,” in Custom Integrated Circuits, 2001, IEEEConference on., San Diego, CA, May 2001, pp. 305–308. 26

[30] J. H. Moreno, V. Zyuban, U. Shvadron, F. D. Neeser, J. H.Derby, M. S. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David,S. W. Asaad, T. W. Fox, D. Littrell, M. Biberstein, D. Naishlos,and H. Hunter, “An innovative low-power high-performance pro-grammable signal processor for digital communications,” IBM J.Res. Dev., vol. 47, no. 2-3, pp. 299–326, 2003. 28

[31] D. Iancu, H. Ye, E. Surducan, M. Senthilvelan, J. Glossner, V. Sur-ducan, V. Kotlyar, A. Iancu, G. Nacer, and J. Takala, “Softwareimplementation of wimax on the sandbridge sandblaster platform,”in SAMOS, 2006, pp. 435–446. 28, 38

[32] C. J. Glossner, T. Raja, E. Hokenek, and M. Moudgill, “A multi-threaded processor architecture for sdr,” Proceedings of the KoreanInstitute of Communication Sciences, pp. 70–85, November 2002.28, 29, 38

[33] M. Schulte, J. Glossner, S. Jinturkar, M. Moudgill, S. Mamidi,and S. Vassiliadis, “A low-power multithreaded processor forsoftware defined radio,” The Journal of VLSI Signal Processing,vol. 43, no. 2-3, pp. 143–159, June 2006. [Online]. Available:http://www.springerlink.com/content/t535603181168478/ 28, 29

[34] E. Tell, “Design of programmable baseband processors,” Ph.D.dissertation, Linköping University, 2005. [Online]. Available: http://liu.diva-portal.org/smash/get/diva2:20611/FULLTEXT01 30,31, 38, 78

[35] A. Nilsson, “Design of programmable multi-standard basebandprocessors,” Ph.D. dissertation, Linköping University, 2007.[Online]. Available: http://www.ep.liu.se/smash/get/diva2:23639/FULLTEXT01 30, 31, 38

[36] A. Nilsson, E. Tell, and D. Liu, “An 11mm2, 70 mW fully pro-grammable baseband processor for mobile WiMAX and DVB-T/H in 0.12µm CMOS,” in Solid-State Circuits, IEEE Journalof, vol. 44, no. 1, Lille, France, Jan. 2009, pp. 90–97. 30

http://www.springerlink.com/content/t535603181168478/

http://liu.diva-portal.org/smash/get/diva2:20611/FULLTEXT01

http://liu.diva-portal.org/smash/get/diva2:20611/FULLTEXT01

http://www.ep.liu.se/smash/get/diva2:23639/FULLTEXT01

http://www.ep.liu.se/smash/get/diva2:23639/FULLTEXT01

216 BIBLIOGRAPHY

[37] J. Leĳten, G. Burns, J. Huisken, E. Waterlander, and A. vanWel, “AVISPA: a massively parallel reconfigurable accelerator,”System-on-Chip, 2003. Proceedings. International Symposium on,pp. 165 – 168, Nov. 19–21, 2003. 32, 33, 38

[38] T. R. Halfhill, “Silicon hive breaks out,” Microprocessor Report,pp. 165 – 168, Dec. 1, 2003. 32

[39] I. Held and B. VanderWiele, “AVISPA CH - embedded communi-cations signal processor for multi-standard digital television,” inGSPx TV to Mobile, Mar. 2006. 32

[40] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins,“ADRES: An architecture with tightly coupled VLIW processorand coarse-grained reconfigurable matrix,” in FPL, 2003, pp.61–70. 33, 34, 38

[41] B. Bougard, A. Bourdoux, F. Naessens, M. Glassee, V. Derudder,and L. Van der Perre, “Energy-efficient software-defined radio so-lutions for MIMO-based broadband communication,” in EuropeanSignal Processing Conference. Proceedings of, Poznan, Poland,Sep. 2007. 33, 37

[42] D. Novo, W. Moffat, V. Derudder, and B. Bougard, “Mappinga multiple antenna SDM-OFDM receiver on the ADRES coarse-grained reconfigurable processor,” IEEE Workshop on SignalProcessing Systems Design and Implementation, pp. 473–478. ,Nov. 2–4, 2005. 33

[43] R. Enzler, “The current status of reconfigurable computing,” ETHZurich, Electronics Laboratory, Tech. Rep., 1999. 37

[44] ——, “Architectural trade-offs in dynamically reconfigurable pro-cessors,” Ph.D. dissertation, ETH Zurich, 2004. 37

[45] R. Hartenstein, “A decade of reconfigurable computing: a vision-ary retrospective,” Design, Automation and Test in Europe, 2001.Conference and Exhibition 2001. Proceedings, pp. 642 – 649, 13-16Mar. 2001. 37

BIBLIOGRAPHY 217

[46] H. Amano, “A survey on dynamically reconfigurable processors,”IEICE Transactions on Communications, vol. E89-B, no. 12,pp. 3179–3187, 2006. [Online]. Available: http://ietcom.oxfordjournals.org/cgi/content/abstract/E89-B/12/3179 37

[47] “IEEE Micro Special Issue: Accelerator Architectures,” IEEEMicro, vol. 28, no. 4, Jul./Aug. 2008. 37

[48] T. Pionteck, L. D. Kabulepa, and M. Glesner, “Exploring thecapabilities of reconfigurable hardware for ofdm-based wlans,” inVLSI-SOC, Darmstadt, Germany, Dec. 1–3, 2003, pp. 149–164.38

[49] P. M. Heysters, G. J. Smit, and E. Molenkamp, “A Flexible andEnergy-Efficient Coarse-Grained Reconfigurable Architecture forMobile Systems,” The Journal of Supercomputing, vol. 26, no. 3,pp. 283 – 308, November 2003. 38

[50] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-TimeWireless Communications. Cambridge Univ. Press, 2003. 49

[51] A. Burg, “VLSI circuits for MIMO communication systems,” Ph.D.dissertation, ETH Zurich, 2006. 53, 167, 168, 182

[52] R. G. Galager, Principles of Digital Communication. CambridgeUniversity Press, 2008. 53, 161

[53] I. B. Collings, M. R. G. Butler, and M. McKay, “Low complexityreceiver design for MIMO bit-interleaved coded modulation,” inSpread Spectrum Techniques and Applications, 2004 IEEE EighthInternational Symposium on, Aug. 30–Sep. 2, 2004, pp. 12–16. 54,58, 77

[54] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, andH. Bölcskei, “VLSI implementation of MIMO detection using thesphere decoding algorithm,” IEEE Journal of Solid-State Circuits,vol. 40, no. 7, pp. 1566–1577, Jul. 2005. 54, 55, 161

[55] A. Burg, M. Borgmann, M. Wenk, C. Studer, and H. Bölcskei,“Advanced receiver algorithms for MIMO wireless communications,”in Design, Automation and Test in Europe, 2006. DATE ’06.Proceedings, vol. 1, Mar. 6–10, 2006. 55

http://ietcom.oxfordjournals.org/cgi/content/abstract/E89-B/12/3179

http://ietcom.oxfordjournals.org/cgi/content/abstract/E89-B/12/3179

218 BIBLIOGRAPHY

[56] C. Studer, M. Wenk, A. Burg, and H. Bölcskei, “Soft-outputsphere decoding: Performance and implementation aspects,” inSignals, Systems and Computers, 2006. ACSSC ’06. FortiethAsilomar Conference on, Pacific Grove, CA, Oct./Nov. 2006, pp.2071–2076. 55

[57] C. Studer, A. Burg, and H. Bölcskei, “Soft-output sphere decoding:algorithms and VLSI implementation,” IEEE Journal on SelectedAreas in Communications, vol. 26, no. 2, pp. 290–300, Feb. 2008.55

[58] K. wai Wong, C. ying Tsui, R. S. K. Cheng, and W. ho Mow,“A VLSI architecture of a k-best lattice decoding algorithm forMIMO channels,” in Circuits and Systems, 2002. ISCAS 2002.IEEE International Symposium on, vol. 3, 2002, pp. 273–276. 56

[59] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner,“K-best MIMO detection VLSI architectures achieving up to 424Mbps,” in Circuits and Systems, 2006. ISCAS 2006. Proceedings.2006 IEEE International Symposium on, Island of Kos. 56, 163

[60] J. Antikainen, P. Salmela, O. Silven, M. Juntti, J. Takala, andM. Myllyla, “Application-specific instruction set processor im-plementation of list sphere detector,” in Signals, Systems andComputers, 2007. ACSSC 2007. Conference Record of the Forty-First Asilomar Conference on, Pacific Grove, CA, Nov. 4–7, 2007,pp. 943–947. 56

[61] E. Zimmermann and G. Fettweis, “Adaptive vs. hybrid iterativeMIMO receivers based on MMSE linear and soft-SIC detection,”in Personal, Indoor and Mobile Radio Communications, 2006IEEE 17th International Symposium on, Helsinki, Sep. 2006, pp.1–5. 57

[62] S. Haene, A. Burg, D. Perels, P. Luethi, N. Felber, and W. Ficht-ner, “Silicon implementation of an MMSE-based soft demapperfor MIMO-BICM,” in Circuits and Systems, 2006. ISCAS 2006.Proceedings. 2006 IEEE International Symposium on, May 21–24,2006. 58, 77

BIBLIOGRAPHY 219

[63] A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber, and W. Ficht-ner, “Algorithm and VLSI architecture for linear MMSE detectionin MIMO-OFDM systems,” in Circuits and Systems, 2006. ISCAS2006. Proceedings. 2006 IEEE International Symposium on, May2006. 58, 67, 185

[64] M. Wenk, “Real-time MIMO-OFDM testbed: Challenges, imple-mentations, and measurement results,” Ph.D. dissertation, ETHZurich, 2010, to appear. 59

[65] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge, U.K.,1985. 64

[66] G. H. Golub and C. F. V. Loan, Matrix computations (3rd ed.).Baltimore, MD, USA: Johns Hopkins University Press, 1996. 65,66, 167, 170, 174

[67] J. P. Gram, “Ueber die entwickelung reeller funtionen in reihenmittelst der methode der kleinsten quadrate,” Journal für diereine und angewandte Mathematik, vol. 94, pp. 41–73, 1883. 66

[68] E. Schmidt, “Zur theorie der linearen und nichtlinearen integral-gleichungen. i. teil: Entwicklung willkürlicher funktionen nachsystemen vorgeschriebener,” Mathematische Annalen, vol. 63, pp.433–476, 1907. 66

[69] F. Zhang, Ed., The Schur Complement and Its Applications, ser.Numerical Methods and Algorithms. Springer US, March 302006, vol. 4. 67

[70] T. Banachiewicz, “Méthode de résolution numérique des équationslinéaires, du calcul des déterminants et des inverses, et de réduc-tion des formes quadratiques.” Bull. internat. Acad. PolonaiseSci. Leit., Cl. Sci. math. natur., A, vol. 1938, pp. 393–404, 1938.67

[71] H. Boltz, “Entwickelungs-verfahren zum ausgleichen geodädtis-cher netze nach der methode der kleinsten quadrate,” in Verof-fentlichungen des Preussischen Geodatischen Institutes, NeueFolge 90, Druck und Verlag von P. Stankiewicz’ Buchdruckerei,1923. 67

220 BIBLIOGRAPHY

[72] R. Lohan, “Das entwicklungsverfahren zum ausgleichen geodätis-cher netze nach boltz im matrizenkalkül,” Zeitschrift für ange-wandte Mathematik und Mechanik, vol. 13, pp. 59–60, 1933. 67

[73] T. M. Schmidl and D. C. Cox, “Robust frequency and timing syn-chronization for OFDM,” IEEE Transactions on Communications,vol. 45, no. 12, pp. 1613–1621, Dec. 1997. 73

[74] J. Volder, “The CORDIC trigonometric computing technique,” inIRE Trans. Electronic Computers, vol. EC-8, no. 3, Sep. 1959, pp.330–334. 75

[75] B. Parhami, Computer Arithmetic: Algorithms and HardwareDesigns. Oxford University Press, 2000. 75, 115

[76] SPRS276H, TMS320C6455 – Fixed-Point Digital Signal Processor,Texas Instruments Incorporated, May 2005. 83, 107

[77] L. Bardelli, L. Henzen, and C. Pedretti, “Design of a real-timeunderwater acoustic mimo-ofdm communication system,” Master’sthesis, ETH Zürich, Jun. 2007. 85

[78] SPRAAE8B, TMS320C6455/C6454 Power Consumption Sum-mary (Rev. B), Texas Instruments Incorporated, Oct. 2007.[Online]. Available: http://focus.ti.com.cn/cn/lit/an/spraae8b/spraae8b.pdf 89

[79] E. Sereni, S. Culicchi, V. Vinti, E. Luchetti, S. Ottaviani, andM. Salvi, “A software radio OFDM transceiver for WLANapplications,” 2001. [Online]. Available: http://www.di.uoa.gr/speech/dsp/X/PERUGI.PDF 90, 92

[80] M. Tariq, Y. Baltaci, T. Horseman, M. Butler, and A. Nix, “De-velopment of an OFDM based high speed wireless LAN platformusing the TI C6x DSP,” Communications, 2002. ICC 2002. IEEEInternational Conference on, vol. 1, pp. 522 – 526, 28 April-2 May2002. 90, 92

[81] Y. Cinquino, A.L. amd Shayan, “A real-time software implemen-tation of an OFDM modem suitable for software defined radios,”Electrical and Computer Engineering, 2004. Canadian Conferenceon, vol. 2, pp. 697 – 701, 2-5 May 2004. 90, 92

http://focus.ti.com.cn/cn/lit/an/spraae8b/spraae8b.pdf

http://focus.ti.com.cn/cn/lit/an/spraae8b/spraae8b.pdf

http://www.di.uoa.gr/speech/dsp/X/PERUGI.PDF

http://www.di.uoa.gr/speech/dsp/X/PERUGI.PDF

BIBLIOGRAPHY 221

[82] M. Schoenes, S. Eberli, A. Burg, D. Perels, S. Haene, N. Felber,and W. Fichtner, “A novel SIMD DSP architecture for softwaredefined radio,” 2003. MWSCAS ’03. Proceedings of the 46th IEEEInternational Midwest Symposium on Circuits and Systems, vol. 3,pp. 1443–1446, Dec. 2003. 91, 107

[83] F. Barat and R. Lauwereins, “Reconfigurable instruction set pro-cessors: a survey,” in Rapid System Prototyping, 2000. RSP 2000.Proceedings. 11th International Workshop on, Paris, France, 2000,pp. 168–173. 96

[84] C. Merk and C. Studer, “Power and performance optimizationof a complex-number arithmetic logic unit,” ETH Zürich, Tech.Rep., Feb. 2003. 96

[85] Benchmarks for TMS320C55x and TMS320C64x DSP Families,Texas Instruments Incorporated. [Online]. Available: http://dspvillage.ti.com 97, 99

[86] IEEE Std., Part 11: Wireless LAN medium Access Control (MAC)and Physical Layer (PHY) specifications, High-speed PhysicalLayer in the 5GHz Band, 1999. 101

[87] P. M. Heysters, G. K. Rauwerda, and G. J. Smit, “Implementa-tion of a HiperLAN/2 receiver on the reconfigurable Montiumarchitecture,” in Parallel and Distributed Processing Symposium,2004. Proceedings. 18th International, 26-30 Apr. 2004, p. 147.103

[88] A. Niktash, R. Maestre, and N. Bagherzadeh, “A case study ofperforming OFDM kernels on a novel reconfigurable DSP architec-ture,” in Military Communications Conference, 2005. MILCOM2005. IEEE, 17-20 oct 2005, pp. 1813–1818. 103

[89] M. Ros and P. Sutton, “Compiler optimization and orderingeffects on VLIW code compression,” in CASES ’03: Proceedingsof the 2003 international conference on Compilers, architectureand synthesis for embedded systems. New York, NY, USA: ACM,2003, pp. 95–103. 139

http://dspvillage.ti.com

http://dspvillage.ti.com

222 BIBLIOGRAPHY

[90] S. Y. Larin and T. M. Conte, “Compiler-driven cached code com-pression schemes for embedded ILP processors,” in Microarchitec-ture, 1999. MICRO-32. Proceedings. 32nd Annual InternationalSymposium on, Haifa, Nov. 16–18, 1999, pp. 82–92. 139

[91] Y. Xie, W. Wolf, and H. Lekatsas, “Code compression for em-bedded VLIW processors using variable-to-fixed coding,” IEEETransactions on Very Large Scale Integration (VLSI) Systems,vol. 14, no. 5, pp. 525–536, May 2006. 139

[92] C. H. Lin, Y. Xie, and W. Wolf, “LZW-based code compressionfor VLIW embedded systems,” in Design, Automation and Testin Europe Conference and Exhibition, 2004. Proceedings, vol. 3,Feb. 16–20, 2004, pp. 76–81. 139

[93] O. Schliebusch, H. Meyr, and R. Leupers, OptimizedASIP Synthesis from Architecture Description LanguageModels. Springer Netherlands, 2007. [Online]. Available:http://www.springerlink.com/content/wpg782/ 157

[94] S. Haene, D. Perels, and A. Burg, “A real-time 4-stream MIMO-OFDM transceiver: System design, FPGA implementation, andcharacterization,” IEEE Journal on Selected Areas in Communi-cations, vol. 26, no. 6, pp. 877–889, Aug. 2008. 159

[95] M. Wenk, P. Luethi, T. Koch, P. Maechler, M. Lerjen, N. Felber,and W. Fichter, “Hardware platform and implementation of a real-time multi-user MIMO-OFDM testbed,” in Proc. IEEE Int. Symp.on Circuits and Systems (ISCAS’09), May 2009, pp. 789–792. 159

http://www.springerlink.com/content/wpg782/

Curriculum Vitae

Stefan Eberli was born in 1978 in Lugano. He studied electrical engi-neering at the ETH Zürich where he received the Dipl. Ing. degree in2003, writing a thesis about a novel digital signal processor for softwaredefined radio applications at the Integrated Systems Laboratory (IIS).

After two years at BridgeCo AG, Dübendorf, Switzerland, wherehe worked in the field of ASIC functional verification, he joined the IIS,ETH Zürich in 2005, as a research and teaching assistant. His researchinterests lie in the domain of digital signal processing and include, inparticular, the design of efficient platforms for software-defined radios.

223

Date post:	22-Nov-2014
Category:	Documents
Upload:	nova-ten-mag
View:	313 times
Download:	6 times

MIMO OFDM

Documents