+ All Categories
Home > Documents > A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for...

A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for...

Date post: 11-May-2018
Category:
Upload: vancong
View: 224 times
Download: 1 times
Share this document with a friend
6
A Heterogeneous SDR MPSoC in 28 nm CMOS for Low-Latency Wireless Applications Sebastian Haas 1 , Tobias Seifert 1 , Benedikt Nöthen 1 , Stefan Scholze 1 , Sebastian Höppner 1 , Andreas Dixius 1 , Esther Pérez Adeva 1 , Thomas Augustin 1 , Friedrich Pauls 1 , Sadia Moriam 1 , Mattis Hasler 1 , Erik Fischer 1 , Yong Chen 1 , Emil Matúš 1 , Georg Ellguth 1 , Stephan Hartmann 1 , Stefan Schiefer 1 , Love Cederström 1 , Dennis Walter 1 , Stephan Henker 1 , Stefan Hänzsche 1 , Johannes Uhlig 1 , Holger Eisenreich 2 , Stefan Weithoer 3 , Norbert Wehn 3 , René Schüny 1 , Christian Mayr 1 , Gerhard Fettweis 1 1 Technische Universität Dresden 2 Racyics GmbH 3 Technische Universität Kaiserslautern Center for Advancing Electronics Dresden (cfaed) Dresden, Germany Microelectronic Systems Design Dresden, Germany [email protected] Kaiserslautern, Germany {rst.last }@tu-dresden.de {weithoer,wehn}@eit.uni-kl.de ABSTRACT Current and future applications impose high demands on software- dened radio (SDR) platforms in terms of latency, reliability, and exibility. This paper presents a heterogeneous SDR MPSoC with a hexagonal network-on-chip to address these issues. It features four data processing modules and a baseband processing engine for iterative multiple-input multiple-output (MIMO) receiving. In- tegrated memory controllers enable dynamic data ow mapping and application isolation. In a 4 × 4 MIMO application scenario, the MPSoC achieves a throughput of 232 Mbit/s with a latency of 20 μ s while consuming 414 mW. It outperforms state-of-the-art platforms in terms of throughput by a factor of 4. 1 INTRODUCTION Current and future applications in areas like Car2X, augmented reality, or Internet of Things (IoT) connected via 5G cellular will impose high demands on processing platforms. The challenges for terminal devices are twofold: rstly, systems should deal with exible and energy-ecient data processing such as audio/video coding, web browsing, payment systems, etc. Secondly, they should handle signal processing applications of wireless technologies from 2G to 5G, WiFi, and NarrowBand-IoT [15]. Similar requirements are also present on the infrastructure side. Emerging paradigms like the Tactile Internet [5] or Mobile Edge Computing [9] will impose stringent latency and reliability constraints as oftentimes required by mission-critical applications. Consequently, a exible software-centric solution is essential which still provides high throughputs with low energy consump- tion. One key towards this solution is hence the use of multi- ple application-specic processors tailored, e.g., by instruction Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. DAC ’17, Austin, TX, USA © 2017 ACM. ISBN 978-1-4503-4927-7/17/06. . . $15.00 DOI: http://dx.doi.org/10.1145/3061639.3062188 set extensions since they provide high throughputs while being energy-ecient. However, exible communication schemes and low-latency data transfers between the cores are major prerequi- sites to avoid wasting the mentioned energy gain. Furthermore, combining the applications listed above into one single SDR system demands fast switching between protocol and physical layer while isolating their dedicated resources. Previous work [4, 11, 14] show multi-core SDR architectures with advanced hardware accelerators for signal processing. They focus on the individual components which aim to tackle the aforemen- tioned requirements but lack proper interaction and data transfer capabilities between the processing elements. This paper proposes a heterogeneous SDR MPSoC for low-latency wireless applications which is capable of interweaving the necessary protocol, signal, and bit processing by fast local switching of computing fabric. This is enabled by (1) heterogeneous data processing modules with single- cycle task switching, (2) a high-throughput baseband processing engine for multi-iterative MIMO receiving, (3) a hexagonal network- on-chip (NoC) topology for improved resiliency and latency, and (4) sophisticated memory controllers with hardware-supported data ow control, data manipulation (ltering, reordering), memory pooling, and application isolation. 2 MPSOC ARCHITECTURE Our SDR MPSoC is built upon the Tomahawk concept [2] primarily designed for signal processing applications. Tomahawk provides a platform with multiple heterogeneous processing modules (PMs) or clusters which integrate two or more processing elements (PEs) equipped with local memories for data and instructions. A central application core together with a global memory and peripherals represent the top-level in the system hierarchy. It denes the control ow of the application by using data ow graphs (DFGs) such as Kahn process networks [6] and synchronous DFGs [10]. A central scheduling unit, called CoreManager, dynamically maps the DFG to the platform by performing an adaptive runtime scheduling with power management capabilities and data transfer management. In particular, the presented SDR MPSoC, called Tomahawk4, com- prises six PMs: four data PMs (DPM), one baseband PM (BBPM),
Transcript
Page 1: A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for Low-Latency Wireless Applications ... (DVFS). Furthermore, each DPM is …

A Heterogeneous SDR MPSoC in 28nmCMOSfor Low-Latency Wireless Applications

Sebastian Haas1, Tobias Seifert1, Benedikt Nöthen1, Stefan Scholze1, Sebastian Höppner1,Andreas Dixius1, Esther Pérez Adeva1, Thomas Augustin1, Friedrich Pauls1, Sadia Moriam1,

Mattis Hasler1, Erik Fischer1, Yong Chen1, Emil Matúš1, Georg Ellguth1,Stephan Hartmann1, Stefan Schiefer1, Love Cederström1, Dennis Walter1,Stephan Henker1, Stefan Hänzsche1, Johannes Uhlig1, Holger Eisenreich2,

Stefan Weitho�er3, Norbert Wehn3, René Schü�ny1, Christian Mayr1, Gerhard Fettweis1

1Technische Universität Dresden 2Racyics GmbH 3Technische Universität KaiserslauternCenter for Advancing Electronics Dresden (cfaed) Dresden, Germany Microelectronic Systems Design

Dresden, Germany [email protected] Kaiserslautern, Germany{�rst.last}@tu-dresden.de {weitho�er,wehn}@eit.uni-kl.de

ABSTRACTCurrent and future applications impose high demands on software-de�ned radio (SDR) platforms in terms of latency, reliability, and�exibility. This paper presents a heterogeneous SDR MPSoC witha hexagonal network-on-chip to address these issues. It featuresfour data processing modules and a baseband processing enginefor iterative multiple-input multiple-output (MIMO) receiving. In-tegrated memory controllers enable dynamic data �ow mappingand application isolation. In a 4 × 4 MIMO application scenario, theMPSoC achieves a throughput of 232Mbit/s with a latency of 20 µswhile consuming 414mW. It outperforms state-of-the-art platformsin terms of throughput by a factor of 4.

1 INTRODUCTIONCurrent and future applications in areas like Car2X, augmentedreality, or Internet of Things (IoT) connected via 5G cellular willimpose high demands on processing platforms. The challengesfor terminal devices are twofold: �rstly, systems should deal with�exible and energy-e�cient data processing such as audio/videocoding, web browsing, payment systems, etc. Secondly, they shouldhandle signal processing applications of wireless technologies from2G to 5G, WiFi, and NarrowBand-IoT [15]. Similar requirements arealso present on the infrastructure side. Emerging paradigms likethe Tactile Internet [5] or Mobile Edge Computing [9] will imposestringent latency and reliability constraints as oftentimes requiredby mission-critical applications.

Consequently, a �exible software-centric solution is essentialwhich still provides high throughputs with low energy consump-tion. One key towards this solution is hence the use of multi-ple application-speci�c processors tailored, e.g., by instruction

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected] ’17, Austin, TX, USA© 2017 ACM. ISBN 978-1-4503-4927-7/17/06. . . $15.00DOI: http://dx.doi.org/10.1145/3061639.3062188

set extensions since they provide high throughputs while beingenergy-e�cient. However, �exible communication schemes andlow-latency data transfers between the cores are major prerequi-sites to avoid wasting the mentioned energy gain. Furthermore,combining the applications listed above into one single SDR systemdemands fast switching between protocol and physical layer whileisolating their dedicated resources.

Previous work [4, 11, 14] show multi-core SDR architectures withadvanced hardware accelerators for signal processing. They focuson the individual components which aim to tackle the aforemen-tioned requirements but lack proper interaction and data transfercapabilities between the processing elements. This paper proposesa heterogeneous SDR MPSoC for low-latency wireless applicationswhich is capable of interweaving the necessary protocol, signal, andbit processing by fast local switching of computing fabric. This isenabled by (1) heterogeneous data processing modules with single-cycle task switching, (2) a high-throughput baseband processingengine for multi-iterative MIMO receiving, (3) a hexagonal network-on-chip (NoC) topology for improved resiliency and latency, and(4) sophisticated memory controllers with hardware-supported data�ow control, data manipulation (�ltering, reordering), memorypooling, and application isolation.

2 MPSOC ARCHITECTUREOur SDR MPSoC is built upon the Tomahawk concept [2] primarilydesigned for signal processing applications. Tomahawk provides aplatform with multiple heterogeneous processing modules (PMs)or clusters which integrate two or more processing elements (PEs)equipped with local memories for data and instructions. A centralapplication core together with a global memory and peripheralsrepresent the top-level in the system hierarchy. It de�nes the control�ow of the application by using data �ow graphs (DFGs) such asKahn process networks [6] and synchronous DFGs [10]. A centralscheduling unit, called CoreManager, dynamically maps the DFG tothe platform by performing an adaptive runtime scheduling withpower management capabilities and data transfer management.

In particular, the presented SDR MPSoC, called Tomahawk4, com-prises six PMs: four data PMs (DPM), one baseband PM (BBPM),

Page 2: A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for Low-Latency Wireless Applications ... (DVFS). Furthermore, each DPM is …

DAC ’17, June 18-22, 2017, Austin, TX, USA S. Haas et al.

ExecPM

Data Proc. ASIP

ARM

iDMA128kB MEM

MM

ADPLLDVFS, AVFS

NoC IF

ExecPM

Data Proc. ASIP

ARM

iDMA128kB MEM

MM

ADPLLDVFS, AVFS

NoC IF

ExecPM

Data Proc. ASIP

ARM

iDMA128kB MEM

MM

ADPLLDVFS, AVFS

NoC IF

Hexagonal NoC

BBPM

Equalizer ASIP

ADPLL PMGT

NoC IF NoC IF

Sphere Detector

MEM

Turbo Decoder

MEM

LPDDR2 IF

AD

PLL

NoC IF

DPM

Data Proc. ASIP ARM

iDMA128kB MEM

MM

ADPLLDVFS, AVS

NoC IF

ACM

NoC IF

CoreManager

App.-Core96kB MEM

32kB Cache

NoC IF

DMA

ADPLL PMGT

iDMA128kB MEM

MM

NoC IF

FPGA IF

AD

PLL

NoC IF

Periphery

AD

PLL

NoC IF

GPIO/UARTFPGA128MB SDRAM

DV

FS,AV

S

R[2] R[3]

R[5]

R[4]R[6] ACM

DPM[2]

LPDDR2 IFDPM[0]

FPGA IF

R[1]

R[0]

R[7]

BBPM

DPM[3]

DPM[1]

Periphery

Serial NoC

Link

Figure 1: Tomahawk4 MPSoC block diagram, left: general architecture, right: hexagonal NoC topology.

as well as an application control module (ACM), as depicted inFigure 1. All modules have a separate local synchronous clock (all-digital phase-locked loop, ADPLL) and are globally asynchronouslyconnected via NoC (GALS). Besides the local memories (SRAM) ineach PM, an LPDDR2 interface connects 128MB of global mem-ory (SDRAM). The LPDDR2 controller runs at 400MHz and sup-ports a bandwidth of 12.5Gbit/s. The FPGA interface allows to inter-connect multiple chips by using on-chip SerDes links with speeds ofup to 5Gbit/s. The periphery includes register �les to access eightGPIO pins and UART. It further contains an OpenMSP430 controllerfor mathmatical computations as well as application-speci�c hard-ware supporting the advanced encryption standard (AES). Table 1summarizes the MPSoC speci�cations.

The packet-switched hexagonal NoC employs parallel links oper-ating on 80Gbit/s at 500MHz. The high-speed serial link betweenrouters 6 and 7 allows to bridge large on-chip distances. With a66 % higher bisection bandwidth, the hexagonal NoC has improvedtra�c distribution as well as greater fault resilience compared to anXY mesh NoC. The routing logic of the routers uses programmablelook-up tables (LUTs) which allow for adaptation of the routingfunction. This is necessary to re-route the tra�c when the linksmay be blocked, either due to failures or because of link reservationby a PE for accessing memory of neighboring PEs. The hexagonalNoC is especially suited to apply the mentioned re-routing tech-niques since it provides greater path diversity than the XY mesh.The LUTs can be altered by sending con�guration �its over the NoCto the a�ected routers. Moreover, for the purpose of monitoringthe tra�c over the NoC, all links have �it counters which can beread via read-request �its.

2.1 Data Processing ModuleEach DPM contains a customized Tensilica LX5 RISC and a general-purpose ARM Cortex-M4F with 64 kB memories each for data(DMEM) and instructions (IMEM). Both memories are dual-ported,accessible from the NoC side and shared on the core side. Thearchitecture allows to show the interoperability of di�erent code

binaries. In general, Tensilica processors are used for physical layersignal processing tasks while ARM cores are more bene�cial inhigher protocol layers. In future work, we plan to investigate and tocompare the strengths and weaknesses of both. The LX5 is extendedby application-speci�c functional units (FUs) to accelerate selectedbit operations (application-speci�c instruction-set processor, ASIP),as demonstrated by [8]. In this HW/SW codesign approach, the FUsrepresent an additional instruction set extending the processor’shardware and can be accessed by the PE’s software via assemblermacros. For data processing purposes, the FUs support hashingalgorithms of integers and strings, bit and integer count functions,and an integer sorting algorithm. Furthermore to improve SDRfunctionalities, one FU contains an accelerated 128-point FFT op-erating on 32-bit �xed-point complex values. The LX5 has two128-bit data interfaces and two load/store units to provide SIMDcapabilities to fully exploit the data parallelism of the implementedFUs. The IMEM interface exhibits a width of 64 bit for a 3-issueVLIW architecture. The DPMs are connected to three power supplyrails variable in the range of 0.6 − 1.1V. Each power rail is set toa speci�c supply voltage which thus de�nes the maximum clockfrequencies. The integrated power management (PMGT) controllerenables ultra-fast dynamic voltage and frequency scaling (DVFS).Furthermore, each DPM is equipped with an adaptive voltage scal-ing (AVS) controller for voltage adaptations according to the chip’sprocess corner and temperature.

2.1.1 iDMA Controller. Each PM connects to the NoC via aniDMA (intelligent direct memory access) controller, which com-prises the functionality of a conventional DMA controller, plushandling of data �elds by means of: byte-wise data �ltering, datamodi�cation (based on a 3-stage butter�y network with eight 8-bit input values) and data distribution. The iDMA supports eightcon�gurable virtual channels between PEs or between a PE andSDRAM. The channel concept ensures strict isolation of memoryregions of di�erent applications. Memory mapped FIFOs allow dy-namic size adaptation to the application needs. To avoid over�ow ofFIFO bu�ers, a credit-based �ow-control mechanism is supported.

Page 3: A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for Low-Latency Wireless Applications ... (DVFS). Furthermore, each DPM is …

A Heterogeneous SDR MPSoC in 28 nmCMOSfor Low-Latency Wireless Applications DAC ’17, June 18-22, 2017, Austin, TX, USA

Table 1: Tomahawk4 MPSoC Hardware Components Overview.

Component Area [mm2] Total MemorySize [kB]

Max. clock freq.[MHz] atVDD = 1.1V

Performanceat 500MHz

Power [mW]at 500MHz,VDD = 1.1V

Total(w/o Mem) Mem

ACM CoreManager 0.139 0.354 96 667 2.33 MTasks/s 33.9App.-Core 0.110 0.098 33.5 667 1 GOPS 28.4

1x DPM LX5 0.3131) 0.564 128 667 256 GOPS 49.62)ARM 0.108 571 1 GOPS 713)

BBPMEQ 0.3441) 0.564 128 667 96 GOPS 126SD 0.094 0.093 4.25 571 188 Mbit/s 136TD 0.338 0.222 54.125 571 382 Mbit/s 230

NoC 0.357 - - 667 80 Gbit/s 23Periphery 0.101 0.041 16 - - -FPGA IF 0.046 - - 500 5 Gbit/s -LPDDR2 IF 1.535 - - 400 12.5 Gbit/s -1) Includes iDMA, MM, Mux. 2)Averaged over multiple functional units. 3)Based on exp()-calculation benchmark.

2.1.2 Memory Manager. On-chip SRAM is a very expensiveresource in terms of area. This constrains the size of local mem-ories within a PM. Tasks which need to use more memory thanlocally available can either not run at all on the platform or need amechanism to gain access to more memory. Since the architectureenforces strict memory isolation, a separate unit is needed whichopens up the visible address space in a de�ned manner. This unitis called memory manager (MM) and is integrated into each PM.It can transparently extend the local data memory to up to 192 kBby utilizing currently unused memories of a di�erent PM (whichcan provide 128 kB of SRAM at maximum). The advantage of usingSRAM of neighboring modules instead of SDRAM is that it providesa �xed and predictable latency which is essential for giving hardreal-time guarantees. NoC routes can be reserved to enforce this.If predictability and thus �xed latency is not important, memorycan also be mapped to SDRAM. In this case, the maximum amountthe local memory can be extended to is 2MB for the current im-plementation. It is only limited by the address space con�gurationof the PE. In contrast to a highly variable load-speci�c SDRAMlatency (min. 33 cycles), the read latency to a neighboring modulestays constant at 29 cycles. Also, in the latter case the memory andNoC load can be more evenly distributed over the chip since tra�cdoes not need to be routed through the SDRAM interface. Thus, theMM provides a scalable solution, which increases memory place-ment �exibility and reduces tra�c to and from SDRAM while stillensuring isolation.

2.2 Baseband Processing ModuleThe BBPM comprises three accelerators for computationally inten-sive receiver algorithms: a minimum mean square error (MMSE)based equalization (EQ) core, a MIMO sphere detection (SD) core,and a turbo decoding (TD) core. The BBPM area is 1.687mm2 in-cluding 0.879mm2 for 186 kB memory.

The architecture of the TD core is based on eight radix-4 maxi-mum a posteriori (MAP) kernels optimized for high communica-tions performance. It supports 3GPP-LTE-A block sizes of up to2016 bit and on-the-�y iteration control. Furthermore, it also fea-tures cyclic redundancy checks (CRC) for up to four LTE transport

blocks and supports soft value computation.The soft-input soft-output (SISO) SD core is based on a set of

programmable FUs which are controlled by the opcode of 64-bitVLIWs. In contrast to previous implementations, the partial met-rics of the log-likelihood ratio values are calculated by exploitingsymmetric properties of the QAM modulated symbols [1], whichreduces the enumeration e�ort and results in an increased detectionthroughput. At the same time, close to full max-log a posterioriprobability communications performance is still provided.

The EQ core extends a basic LX5 RISC by two dedicated FUsfor e�cient SISO MMSE-based equalization. While the �rst FUcombines multiple MAC operations to perform complex-valued dotproduct computations, the second FU enables accelerated process-ing for recursive algorithms such as forward/back substitution andCholesky decomposition.

2.3 Application Control ModuleThe ACM includes a Tensilica 570T CPU used as application corewith 16 kB cache for data and instructions each. It furthermorecomprises a dynamic scheduling unit, called CoreManager, to or-chestrate the platform. The CoreManager is based on a TensilicaLX5 RISC with tailored instructions for e�cient data �ow mapping.The core is attached to 64 kB DMEM and 32 kB IMEM. The totalACM area is 0.701mm2 of which 0.452mm2 is occupied by theSRAM and caches. To ensure security, only the CoreManager isallowed to access speci�c MPSoC con�gurations such as the LUTsor the �it counters integrated in the NoC routers. Monitoring the �itcounters and recon�guring the routing enables the CoreManagerto control the tra�c distribution and hence to reduce latency.

3 EVALUATIONThis section presents the manufactured Tomahawk4 chip and themeasurement results of the single MPSoC components as well as aperformance analysis of two multi-core application benchmarks.

3.1 Implementation ResultsThe 14-core SDR MPSoC was fabricated in Globalfoundries28 nm SLP CMOS technology and occupies 3 × 6mm2 with

Page 4: A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for Low-Latency Wireless Applications ... (DVFS). Furthermore, each DPM is …

DAC ’17, June 18-22, 2017, Austin, TX, USA S. Haas et al.

Table 2: Data Processing ASIP: Single-core Measurements of Functional Units at 500MHz, 1.0V

Functional Unit IntegerHashing

StringHashing Bit Count Integer

Count Sorting FFT

Throughput [Gbit/s] 81.1 38.2 82.5 61.2 1.0 0.3Speedup (RISC vs. ASIP) 1188.8× 84.1× 851.5× 21.5× 8.8× 12.8×Power [mW] 64.8 35.9 50.0 49.9 45.2 51.8Energy [pJ/bit] 0.8 0.9 0.6 0.8 45.2 172.7Energy gain (RISC vs. ASIP) 664.7× 78.1× 677.6× 17.5× 8.7× 12.7×

LPDDR2 Memory Interface

BBPM

SerDes I/O (FPGA IF)

DPM[1]

ACM

DPM[0]

DPM[3]DPM[2]

Perip

hery

GPIO

R[6]

R[7]

NoC

Serial Link

Figure 2: Tomahawk4 Die Photograph.

24.43 M NAND2 gate equivalents and 844 kB total on-chip memory.A high-speed serial link between routers 6 and 7 bridges the 4.5mmon-chip distance as can be seen in the chip photo of Figure 2.

The measurement set-up, depicted in Figure 3, includes the chipmodule with MPSoC and SDRAM which is placed on a power supplyboard. The board is plugged into a Xilinx Virtex-7 FPGA to enableinter-chip packet communication for system scalability. The physi-cal inter-chip communication is realized by the mentioned FPGAinterface employing high-speed serial links. In addition to this, theFPGA integrates an UDP/IP stack connecting the MPSoCs witha host-PC for control/debug purposes Further, the power supplyboard integrates an analog-to-digital converter to measure voltageand current of each power rail. Hardware performance counters ineach PE are used to monitor the execution time. Both power con-sumption and performance can be accessed by the CoreManager’ssoftware to support scheduling decisions.

As explained in Section 2.1, the LX5 core of each DPM is ex-tended by additional FUs to accelerate bit operations. Hence, we�rst evaluate the performance and power consumption of the indi-vidual FUs. For these experiments, the input data contains randomlygenerated values with uniform distribution and is located in thelocal DMEM of the DPM. Since the bandwidth of the global mem-ory is the bottleneck in the system, loading the data from SDRAMwould limit the performance and hide the e�ective throughput ofthe cores. Table 2 summarizes the measurement results of eachFU. Hashing extracts bits from a 32-bit integer value speci�ed by aspeci�c bit mask or applies the CityHash32 algorithm [7] of 8-bitstrings. Bit and integer count create an histogram to illustrate thevalue distribution of bits and 32-bit values, respectively. The sortingoperator is based on a merge-sort scheme and performs on a listof 32-bit integer values. The throughput for the 128 �xed-pointFFT is determined for 32-bit complex values and interpreted for a64-QAM signal. Compared with the pure RISC implementation, theFUs achieve speedups of up to more than three orders of magnitudewhile increasing the power consumption by 36% on average. As

Figure 3: Tomahawk4 Measurement Set-up.

the power increase is low compared to the speedups, the FUs leadto energy gains between 9× and 678× .

In the next set of experiments, we evaluate the BBPM cores.Table 3 lists their characteristics and performances along with state-of-the-art competitors. Based on the throughput and power valuesin Table 3, energy-delay pro�les expressed as EDP and ED2P �g-ures can be derived for the given operating points (clock frequencyand supply voltage). The delay is taken from the throughput andcalculated with 64 bit. Regarding the EQ core, its area and energye�ciency is signi�cantly lower (by more than a factor of 4) com-pared to the ASIC implementation. However, due to the highly�exible implementation, the ASIP EQ core can be reused for a rangeof transmission scenarios (both uplink and downlink) as well asiterative and non-iterative receiver schemes. Columns 3 and 4 ofTable 3 show that the SD core outperforms a comparable ASICimplementation in terms of both area and energy e�ciency by6.8× and 1.4× , respectively. Considering the last two columns, theTD core demonstrates to be a powerful but still energy-e�cientsolution, outperforming the TD presented in [12] by a factor ofalmost 3 for EDP and ED2P each. Since ED2P emphasizes perfor-mance, we can con�rm that our TD mainly bene�ts from a lowerpower consumption. Besides the individual results, an applicationbenchmark re�ecting the interaction of all BBPM cores has beeninvestigated in Section 3.2.

3.2 Application ScenariosThe presented SDR MPSoC is capable of combining high-throughput data processing with low-latency wireless applications.In a �rst benchmark, we execute integer hashing employing theLX5 cores of the DPMs which support the algorithm by instructionset extensions. The three power supply rails are set to 1.1, 1.0, and0.9V which corresponds to maximum clock frequencies of 667, 444,and 333MHz, respectively. We further de�ne the requirement toachieve a total throughput of at least 200Gbit/s. The CoreManager

Page 5: A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for Low-Latency Wireless Applications ... (DVFS). Furthermore, each DPM is …

A Heterogeneous SDR MPSoC in 28 nmCMOSfor Low-Latency Wireless Applications DAC ’17, June 18-22, 2017, Austin, TX, USA

Table 3: Baseband Processing Components Comparison.

Turbo Equalizer (EQ) Sphere Detector (SD) Turbo Decoder (TD)This work Studer [13] This work Borlenghi [3] This work Studer [12]

ApproachSISO MMSEPIC (�exibleMIMO setup)

SISO MMSEPIC (up to 4 × 4MIMO)

SISO SD(up to 4 × 464-QAM)

SISO SD(up to 4 × 464-QAM)

3GPP-LTETurbo decoder

3GPP-LTETurbo decoder

Design ASIP ASIC ASIP ASIC ASIC ASICTechnology 28 nm 90 nm 28 nm 65 nm 28 nm 130 nmArea [mm2] 1.691) (0.22) ) 1.5 1.691) (0.187) 2.78 1.691) (0.56) 3.57Supply voltage VDD [V] 1.1 1.2 1.0 1.2 1.1 1.2Max. clock freq. fmax [MHz] at VDD 667 568 571 135 571 302Gate count [kGE] 4272) 410 399 872 1196 553

Throughput [Mbit/s] at fmax 1683) 7573) 2144)5) 664) 435.96) 390.67)Power [mW] at fmax, VDD 2038) 189 1428) 177.5 3658) 789Area e�ciency [Mbit/s/kGE] 0.39 1.85 0.54 0.08 0.36 0.71EDP [10−15 Ws2] 29.5 1.4 12.7 166.9 7.9 21.2ED2P [10−21 Ws3] 11.2 0.1 3.8 161.8 1.2 3.51)Refers to BBPM area (values in brackets refer to the PE area). 5)At 10−5 BER, SD radius size of 4 candidates.2)Core area only (w/o memories, iDMA, MM). 6)LTE block size of 2016 info bits, 1/3 rate, 5.5 iterations.3)For matrix inversion only (4 × 4 MIMO, 64-QAM). 7)LTE block size of 3200 info bits, 1/3 rate, 5.5 iterations.4)With 2 detection-decoding iterations, 4 × 4 MIMO, 64-QAM. 8) Includes leakage power of total BBPM and NoC.

0102030405060708090100

0

10

20

30

40

50

60

70

IntegerHashing

StringHashing

BitCount

IntegerCount

Sorting FFT

Spee

dup

Pow

er [m

W] a

t 500

MH

z, 1

.0V

Power RISC Power ASIP Speedup

852x1189x

0%

20%

40%

60%

80%

100%

0.7 0.8 0.9 1.0 1.1

ASIP

Effi

cien

cy(In

tege

r Has

hing

)

Supply Voltage VDD [V]

EnergyPerformance

0

50

100

150

200

250

1 2 3 4

Pow

er [m

W]

Number of PEs

667MHz at 1.1V,109Gbit/s

667 MHzat 1.1V

444 MHzat 1.0V

333 MHzat 0.9V

0

50

100

150

200

250

0,9 1,0 1,1

Pow

er [m

W]

Supply Voltage [V]

4 PEs at333 MHz

3 PEs at333 MHz

667MHz at 1.1V,219.2Gbit/s 444MHz at 1.0V,

219.2Gbit/s 333MHz at 0.9V,219.2Gbit/s

Figure 4: CoreManager determines minimal power con-sumption with available PEs to meet the throughput con-straint (>200Gbit/s) by assigning PEs to power supply railswith prede�ned voltages and frequencies.

analyses the throughputs and power consumptions for a speci�cnumber of PEs and their assigned power rail. Figure 4 depicts themeasured power consumption for di�erent numbers of PEs. Sincea single PE delivers only 109Gbit/s, at least two PEs have to beavailable. The CoreManager minimizes the total power consump-tion to 136mW by using four PEs assigned to the power rail at0.9V. Compared to the two-core option, the energy consumptionis reduced by approximately 32 %.

As described in Section 2.2, the BBPM comprises essential re-ceiver components for physical layer processing. Depending onthe communication direction (up-/downlink) and the channel sce-nario, the BBPM cores perform iterative detection-decoding oriterative equalization-decoding. As a second benchmark, we use anon-iterative receiver scheme for a 4 × 4 MIMO-ISI channel sce-nario, which represents SC-FDMA transmission for the LTE uplink.For this purpose, the received signal (present in frequency domain)has to be equalized, transferred to time domain, soft-demodulated,and eventually decoded. To guarantee high communications per-formance at low latency, the data �ow of the processing schemescan be e�ciently mapped to dedicated MPSoC resources (Figure 5).

Direct memory transfers via iDMA slots

RF

+ AD

C

N-DFT

N-DFT

N-DFT

N-DFT

MIMOEQ

(BBPE)

N-IDFT

N-IDFT

N-IDFT

N-IDFT

Soft-Demod

Soft-Demod

Soft-Demod

Soft-Demod

TurboDec(1/3-rate)

232Mbit/s(w/o iterations)

On-Chip Iterative Receiver

Slot0Slot1· · ·

Slot7

Equalizer ASIP

iDM

A128kB MEM

NoC

IF

Hex

agon

al N

oC

Slot0Slot1· · ·

Slot7

Data Proc. ASIP

iDM

A 128kB MEM

NoC

IFARM

Virtual Channel

P/S

Mapped on MIMOSphere Detector

696Mbit/s

Mapped on DPM0 LX5

Figure 5: Data Flow Mapping and iDMA Mechanism.

The considered codeword length is 6060 bit with a code rateof 1/3. In this scenario, a bandwidth of 128 subcarriers is allocated,which requires two SC-FDMA symbols to accommodate the com-plete code word. It is assumed that the channel does not signi�cantlyvary within one transmission time interval (1ms). Consequently,the equalization �lter matrix does not need to be recalculated foreach SC-FDMA symbol. The 128-points IFFT operation can be car-ried out individually for each transmit layer and is mapped to thefour LX5 RISC cores of the DPMs. The soft demodulation for eachtransmit layer is performed by the sphere detection core. All pro-cessing modules run with 571MHz at 1.04V, resulting in a totalpower consumption of 414mW for this benchmark. Due to theASIP’s memory controllers as well as double bu�ering mechanismsat the sphere detector and turbo decoder input, a high total appli-cation throughput of 696Mbit/s (encoded) or 232Mbit/s (e�ective)can be achieved. The total processing latency from equalizer inputto decoder output is 20 µs.

Page 6: A Heterogeneous SDR MPSoC in 28nmCMOS for Low … · A Heterogeneous SDR MPSoC in 28nmCMOS for Low-Latency Wireless Applications ... (DVFS). Furthermore, each DPM is …

DAC ’17, June 18-22, 2017, Austin, TX, USA S. Haas et al.

Table 4: State-of-the-art Platform Comparisons.

This work Tomahawk2 [11] 2x MT-ADRES [14] Magali [4]

Platform scope MIMO 3GPP-LTE-A, 802.11n,NB-IoT, SDR

MIMO 3GPP-LTE, WiMAX,802.11n, SDR

MIMO 3GPP-LTE Cat-4,802.11a/n, SISO WiMAX,SDR

MIMO 3GPP-LTE, WiMAX,802.11n, SDR, Cognitive Radio

Die size 18 mm2 36 mm2 10.6 mm2 29.6 mm2

Technology 28 nm 65 nm 40 nm 65 nmClocking and power mgt. GALS, local DVFS and AVS,

power gatingGALS, local DVFS and AVS,power gating - GALS, local DFS

Scheduling Dynamic Dynamic - StaticNetwork topology 2D Hexagonal NoC 2D star-mesh NoC Interconnecting buses 2D regular mesh NoCMemory organization Distributed, Shared,

Pooling (MM) Distributed, Shared Distributed, Shared DistributedPeak performance 1121 GOPS 105 GOPS - 37 GOPS

Application scenario 4 × 4 MIMO 3GPP-LTE-AUplink Rx Baseband

4 × 4 MIMO 3GPP-LTERx Baseband

2 × 2 MIMO3GPP-LTE Cat 4

4 × 2 MIMO 3GPP-LTE Rx,2 × 2 MIMO Tx, MAC

Application performance 232 Mbit/s 60 Mbit/s 150 Mbit/s 10.8 Mbit/sPower consumption 414 mW

(at 571 MHz, 1.04 V) 480 mW (at 1.15 V) 500 mW 477 mW (at 1.2 V)Energy 1.8 nJ/bit 8.0 nJ/bit 3.3 nJ/bit 44.2 nJ/bit

Table 4 compares our SDR MPSoC with state-of-the-art plat-forms. For the described 4 × 4 MIMO application scenario, our SDRMPSoC outperforms the Tomahawk2 platform [11] by a factor of 4while reducing the total power consumption by 13 % and 17 % com-pared also to the Magali MPSoC [4] and the MT-ADRES chips [14],respectively.

4 CONCLUSIONSDR platforms with sophisticated hardware components for digitalsignal processing and bit processing are key to deal with currentand future wireless applications. In this paper, we demonstratedthe energy-e�cient implementation of an iterative MIMO receiverby following an HW/SW codesign approach with tailored dataprocessing modules and a baseband processing engine. Due tothe integrated memory controllers, the data �ow can be directlymapped on the platform to ensure low communication latencies.Furthermore, a hexagonal NoC improves the bisection bandwidth by66% compared to a standard XY mesh NoC. For security reasons onlya single scheduling unit, the CoreManager, controls the platformwhile applying power management techniques such as DVFS andAVS. Comparing a 4 × 4 MIMO application scenario for an LTE-A uplink with state-of-the-art systems, our �nal taped-out SDRMPSoC shows a 4× performance improvement at reduced powerconsumption.

5 ACKNOWLEDGMENTSThis work has been supported in part by the state of Saxony undergrant of the German Research Foundation (DFG) within the Clusterof Excellence “Center for Advancing Electronics Dresden” (cfaed),and SFB912 - HAEC. This work has further been supported bythe European Social Fund in the framework of the Young Inves-tigators Group “Communication Infrastructures for Attonets in3D-Chip-Stacks” (Atto3D), and by the European Union under GrantAgreements No. 604102 and DLV-720270 (Human Brain Project).Furthermore, the authors would like to thank Synopsys, Cadence,and ARM for providing software and IP.

REFERENCES[1] E. P. Adeva and G. P. Fettweis. 2016. E�cient Architecture for Soft-Input Soft-

Output Sphere Detection With Perfect Node Enumeration. IEEE Transactions onVery Large Scale Integration (VLSI) Systems 24, 9 (Sept 2016), 2932–2945. DOI:http://dx.doi.org/10.1109/TVLSI.2016.2526904

[2] Oliver Arnold, Emil Matus, Benedikt Noethen, Markus Winter, Torsten Limberg,and Gerhard Fettweis. 2014. Tomahawk: Parallelism and Heterogeneity in Com-munications Signal Processing MPSoCs. ACM Trans. Embed. Comput. Syst. 13, 3s(2014), 107:1–107:24.

[3] Filippo Borlenghi, Ernst Martin Witte, Gerd Ascheid, Heinrich Meyr, and AndreasBurg. 2012. A 2.78 mm2 65nm CMOS Gigabit MIMO Iterative Detection andDecoding Receiver. In 2012 Proceedings of the ESSCIRC (ESSCIRC). IEEE, 65–68.

[4] Fabien Clermidy, Christian Bernard, Romain Lemaire, Jerome Martin, Ivan Miro-Panades, Yvain Thonnart, Pascal Vivet, and Norbert Wehn. 2010. A 477mWNoC-based digital baseband for MIMO 4G SDR. In 2010 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 278–279.

[5] Gerhard P Fettweis. 2014. The Tactile Internet: Applications and Challenges.IEEE Vehicular Technology Magazine 9, 1 (2014), 64–70.

[6] KAHN Gilles. 1974. The semantics of a simple language for parallel programming.In Information Processing 74 (1974), 471–475.

[7] Google Inc. 2013. CityHash v1.1.1. http://code.google.com/p/cityhash/. (June2013).

[8] Sebastian Haas, Oliver Arnold, Benedikt Nöthen, Stefan Scholze, and others.2016. An MPSoC for Energy-e�cient Database Query Processing. In Proceedingsof the 53rd Annual Design Automation Conference (DAC’16). 112:1–112:6.

[9] Yun Chao Hu, Milan Patel, Dario Sabella, Nurit Sprecher, and Valerie Young.2015. Mobile Edge Computing –A Key Technology Towards 5G. ETSI WhitePaper 11 (2015).

[10] Edward A Lee and David G Messerschmitt. 1987. Synchronous Data Flow. Proc.IEEE 75, 9 (1987), 1235–1245.

[11] Benedikt Nöthen, Oliver Arnold, Esther Perez Adeva, Tobias Seifert, and others.2014. A 105GOPS 36mm2 Heterogeneous SDR MPSoC with Energy-AwareDynamic Scheduling and Iterative Detection-Decoding for 4G in 65nm CMOS. In2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers(ISSCC). IEEE, 188–189.

[12] Christoph Studer, Christian Benkeser, Sandro Belfanti, and Quiting Huang. 2010.A 390Mb/s 3.57 mm2 3GPP-LTE turbo decoder ASIC in 0.13µm CMOS. In 2010IEEE International Solid-State Circuits Conference-(ISSCC).

[13] Christoph Studer, Schekeb Fateh, and Dominik Seethaler. 2011. ASIC implementa-tion of soft-input soft-output MIMO detection using MMSE parallel interferencecancellation. IEEE Journal of Solid-State Circuits 46, 7 (2011), 1754–1765.

[14] Tomoya Suzuki, Hideki Yamada, Toshiyuki Yamagishi, Daisuke Takeda, Koji Ho-risaki, Tom Vander Aa, Toshio Fujisawa, Liesbet Perre, and Yasuo Unekawa. 2011.High-Throughput, Low-Power Software-De�ned Radio Using Recon�gurableProcessors. IEEE Micro 6, 31 (2011), 19–28.

[15] Y-P Eric Wang, Xingqin Lin, Ansuman Adhikary, Asbjörn Grövlen, Yutao Sui,Yufei Blankenship, Johan Bergman, and Hazhir S Razaghi. 2016. A Primer on3GPP Narrowband Internet of Things (NB-IoT). arXiv preprint arXiv:1606.04171(2016).


Recommended