University of Michigan Electrical Engineering and Computer Science From SODA to Scotch: The...

University of MichiganElectrical Engineering and Computer Science

From SODA to Scotch: The Evolution of a Wireless Baseband Processor

Mark Woh (University of Michigan - Ann Arbor)Yuan Lin (University of Michigan - Ann Arbor)Sangwon Seo (University of Michigan - Ann Arbor)Scott Mahlke (University of Michigan - Ann Arbor)Trevor Mudge (University of Michigan - Ann Arbor)Chaitali Chakrabarti (Arizona State University)Richard Bruce (ARM Ltd.)Danny Kershaw (ARM Ltd.)Alastair Reid (ARM Ltd.)Mladen Wilder (ARM Ltd.)Krisztian Flautner (ARM Ltd.)

2University of Michigan

Electrical Engineering and Computer Science

From SODA to Scotch : What is this talk about?

• If a fully programmable 3G baseband processor commercially viable?

► The SODA processor was the first full research design [ISCA06]

► ARM R&D developed the Ardbeg SDR commercial prototype

• What we will present► Comparison study between SODA and Ardbeg► Lessons learned in the evolution

2



Mobile Computing

• In 2007, world-wide mobile telephone subscription: 3.3 billion1

► ~Half of the world’s population► Some countries have mobile penetration over 100%► Largest consumer electronic device in terms of volume

• Wireless multimedia anywhere at anytime

3

Cell phones are getting more complex

PCs are getting more mobile

1. “Global cellphone penetration reaches 50 pct”, Reuter, Nov. 29th, 2007



Wireless Communication

4

BluetoothBluetooth UWBUWB

802.11g802.11g

Personal Area Network

Local Area Network

Wide Area Network

Global Network

GSMGSM W-CDMAW-CDMA

802.11n802.11n

DVBDVBGPSGPS



Software Defined Radio

5

GPS

Bluetooth

Application Processors

Application ProcessorsBaseband

ProcessorBaseband Processor

Analog FrontendAnalog

Frontend

WCDMA

CameraCamera

KeypadKeypad

DisplayDisplay

SpeakerSpeaker

MicrophoneMicrophone




6

GPS

Bluetooth


Application ProcessorsBaseband

ProcessorBaseband Processor


Frontend

WCDMA

CameraCamera

KeypadKeypad

DisplayDisplay

SpeakerSpeaker


MAC

Link

Network

Transport

PHY

GPP

DSP + ASICs




7

GPS

Bluetooth


Frontend

WCDMA Application Processors


CameraCamera

KeypadKeypad

DisplayDisplay

SpeakerSpeaker


SDR Baseband Processor

SDR Baseband Processor



Advantages of Soft Radio

• Design factor► Protocol complexity► Multi-mode operation► Prototyping and bug fixes

• Cost factor► Time-to-market► Silicon area► Higher volume► Longevity of platform

8

BluetoothBluetoothUWBUWB802.11g802.11g

GSMGSM W-CDMAW-CDMA

DVBDVBGPSGPS 802.11n802.11n

SDR



Mobile SDR Design Challenges

1

10

100

1000

0.1 1 10 100

Power (Watts)

Pe

ak

Pe

rfo

rma

nc

e (

Go

ps

)

Better

Pow

er Efficiency

10 Mops/m

W

100 Mops/m

W

1 Mops/m

W

9

GeneralPurpose

ProcessorsEmbeddedDSPs

Mobile SDRRequirements

Pentium MTI C6x

IBM CellHigh-end

DSPs

SDR Design Objectives for 3G and WiFi

Throughput requirements 40+Gops peak throughput

Power budget 100mW~500mW peak power

SDR Design Objectives for 3G and WiFi

Throughput requirements 40+Gops peak throughput

Power budget 100mW~500mW peak power



First Generation SDR Processor : SODA

• Our first attempt was the SODA processor► Design at 180nm technology► Built for WCDMA and 802.11a in mind► Sub 500mW operation estimated at 90nm



SODA

System:• Heterogeneous multi-core

architecture• Multi-level scratchpad

memories

PE:• SIMD/Scalar/AGU LIW• 32-lane 16-bit SIMD• 16-bit scalar datapath• Scalar-to-SIMD • SIMD-to-scalar• Iterative Perfect Shuffle

Network

11

512-bitSIMDReg.File

EX

512-bitSIMD ALU+Mult

SIMDShuffle

Net-work

( SSN)

WB

Scalar ALU

WB

EX

ScalarRF

L1SIMDData

Memory

L1ScalarData

Memory

STV

AGURF

EX

WB

AGUALU

1. wide SIMD

2. Scalar

4. AGU

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)

ALU

RF

DMA

SODA PE

5. DMA

3. Local memory

To System

Bus

L1ProgramMemory

Controller



1

10

100

1000

0.1 1 10 100

Power (Watts)

Pe

ak

Per

form

an

ce

(Go

ps )

Better

Power Eff iciency

10 Mops/m

W

100 Mops/m

W

1 Mops/m

W

Mobile SDR

requirements

SODA Summary

12

SODA 180nmSODA 90nm

GeneralPurpose

ProcessorsEmbedded

DSPs

High-endDSPs

TI C6x 90nm

Picochip 130nm

Sandbridge 90nm

NXP EVP 90nm req. ASICs



512-bitSIMDReg.File

512-bitSIMD Mult

SIMDShuffle

Net-work

Scalar ALU+Mult

ScalarRF+ACC

L1Data

Memory

AGURF

AGU

1. wide SIMD

Pred.RF

SIMD+ScalarTransf

Unit

Ardbeg PE

3. Memory

SIMDPred.ALU

Scalarwdata

1024-bitSIMD

ACC RF

SIMDwdata

512-bitSIMD ALUwith

shuffle

EX

EX

INTERCONNECTS

INTERCONNECTS

L2Memory

2. Scalar & AGUL1ProgramMemory

Controller

EX

EX

AGU

AGU

WB

WB

WB

WB

64- b

it A

MB

A 3

AX

I In

terc

on

ne

ct

ControlProcessor

Ardbeg System

FECAccelerator

L1Mem

ExecutionUnit

PE

L1Mem

ExecutionUnit

PE

DMAC

Peripherals

L1Mem

L2Mem

512

-bit

Bu

s

Ardbeg SDR Processor

Application Specific Hardware

Block Floating Point

Application Specific Hardware

Block Floating Point

Combined Scalar/Vector MemoryCombined Scalar/Vector Memory

8,16,32 bit fixed point support8,16,32 bit fixed point support

128-lane 8-bit Banyan Network128-lane 8-bit Banyan Network

3 Read/2 Write RF for VLIW3 Read/2 Write RF for VLIW

Sparse Connected VLIWSparse Connected VLIW

Multiple Data Address AccessesMultiple Data Address Accesses

Fused Permute ALU operationsFused Permute ALU operations



Evolution to Ardbeg : Lessons Learned

• Ardbeg achieved ~3x speedup overall at 30% lower power than SODA

• To get these improvements many lessons were learned as a result of the studies done

• We will present a few of these studies► 1) Benefit of Wide SIMD► 2) VLIW on SIMD support► 3) Support for Complex Shuffle Network ► 4) Application Specific Hardware



1) Benefiting from Wide SIMD

• Increasing SIMD width still a good idea for SDR• But area becomes a big concern

► 32 wide 16-bit SIMD at 90nm seems a good fit

1.2

1.0

0.8

0.6

0.4

0.2

0

12

10

8

6

4

2

08 16 32 64

SIMD Width

No

rma

lize

d E

ne

rgy

-De

lay

Pro

du

ct

No

rma

li zed

Are

aEnergy-DelayArea



2) VLIW Support for Wide SIMD

• VLIW execution on top of the SIMD datapath

• 3 read ports, 2 write ports

► Shared between SIMD units► 2-issue SIMD LIW► Only support the most

frequently used SIMD op pairs

16

SIMD

32-lane

SIMDALU

32-lane

SIMDALU

SIMDRF

SIMDRF

128-laneSSN

128-laneSSN

SIMDscalartrans.unit


EXEX

WB

WB

scalarRF

scalarRF

16-bitALU

16-bitALU

EXEX

WB

WB

InterconnectsInterconnects

EXEX

WB

WB


EXEX

WB

WB

Scalar

AGUAGU

DataMEMDataMEM

AGUAGU

AGUAGU



2) VLIW on SIMD Support

• There is a distinct set of instructions that execute frequently at the same time

• We want to take advantage of this in order to reduce complexity of VLIW

Mem.Arith.Mult.

ShuffleTrans.Move

Comp.

Mem.

NAHighHighLowHighLowLow

Arith.

--NAMidHighMidLowLow

Mult.

----

NAMidHighHighLow

Shuffle

------

NAMidLowLow

Trans.

--------

NALowLow

Move

----------

NALow

Comp.

------------

NA



0

0.2

0.4

0.6

0.8

1

1.2

FIR CFIR FFT Radix-2 FFT Radix-4 Viterbi K7 Viterbi K9 Average

En

erg

y-D

elay

Pro

du

ct

2 Read/ 2 Write (Single Issue) 3 Read/ 2 Write (Ardbeg)

4 Read/ 4 Write (Any two SIMD ops) 6 Read/ 5 Write (Any three SIMD ops)

2) VLIW on SIMD Support

• 3 Read/ 2 Write provides us for the most case the best overall design point



3) Support for Shuffle Network

• 7-stage single-cycle SSN► Banyan network► 128-lane 8-bit (64-lane 16-bit)

19

2 stage 16-lane Banyan networkSIMD

32-lane

SIMDALU

32-lane

SIMDALU

SIMDRF

SIMDRF

128-laneSSN

128-laneSSN



EXEX

WB

WB

scalarRF

scalarRF

16-bitALU

16-bitALU

EXEX

WB

WB


EXEX

WB

WB


EXEX

WB

WB

Scalar

SIMDDataMEM

SIMDDataMEM

AGUAGU



0

0.2

0.4

0.6

0.8

1

1.2

64pt FFTRadix-2

2048pt FFTRadix-2

64pt FFTRadix-4

2048pt FFTRadix-4

Viterbi K9

En

erg

y-D

elay

Pro

du

ct

32 Wide Perfect 64 Wide Perfect64 Wide Crossbar 64 Wide Banyan

3) Support for Shuffle Network

• 64-Wide Banyan gives us close to a simple iterative interconnect energy with crossbar like performance



4) Application Specific Optimizations

• Application specific hardware► Turbo coprocessor► Block-floating point support► Fused Permute-ALU operations► Interleaving support

• Trade-off programmability for performance► Less “soft” than SODA► But more energy efficient for common operations

21



4) Application Specific Optimizations

• Some kernels are common among many different protocols

► Many protocols use the same Error Correction algorithms

• Turbo Coprocessor is one of them► Tradeoff between Programmable vs ASIC

• ASIC implementations is around 5x more efficient than programmable implementation

► SODA PE: 2Mbps with 111mW in 90nm► ASIC: 2Mbps with 21mW in 90nm



00.5

11.5

22.5

33.5

44.5

FIR 1

6-ta

ps

FIR 3

3-ta

ps

FIR 6

5-ta

ps

CFIR 1

6-ta

ps

CFIR 3

3-ta

ps

CFIR 6

5-ta

ps

Avera

ge

FFT Rx2

64p

t

FFT Rx2

204

8pt

FFT Rx4

64p

t

FFT Rx4

204

8pt

QAM4

QAM16

QAM64

Despre

ader

Descr

amble

r

Combin

er

Avera

ge

W-C

DMA S

earc

her

802.

11a

Inte

rpola

tor

DVB-T E

qualize

r

DVB-T C

han. E

st.

Avera

ge

Viterb

i K7

Viterb

i K9

Bit In

tlv 3

Bit In

tlv 6

Inte

rleav

er

Avera

ge

Ard

beg

Sp

eed

up

Ove

r S

OD

A

Baseline SODA SIMD ALU SIMD Shuffle VLIW Compiler Optimization

Filtering Modulation SynchronizationError

Correction7x

Overall Improvements

• Achieves between ~1.5-7x speedup for wireless algorithms compared to SODA



802.11a 180nm 802.11a

W-CDMA 2Mbps180nm W-CDMA 2Mbps

802.11a

W-CDMA 2Mbps

W-CDMA data

W-CDMA voice

W-CDMA data

802.11a

W-CDMA 2Mbps

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000

Power (Watts)

Ac

hie

ve

d T

hro

ug

hp

ut

(Mb

ps

)

SODA

ASIC

Sandblaster

TigerSHARC

7 Pentium M

Summary of Ardbeg

• Power vs Throughput for protocols on different processors



W-CDMA 2Mbps

DVB-H

DVB-T

802.11a

W-CDMA data

W-CDMA voice

802.11a 180nm 802.11a

W-CDMA 2Mbps180nm W-CDMA 2Mbps

802.11a

W-CDMA 2Mbps

W-CDMA data

W-CDMA voice

W-CDMA data

802.11a

W-CDMA 2Mbps

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000

Power (Watts)

Ac

hie

ve

d T

hro

ug

hp

ut

(Mb

ps

)

Ardbeg

SODA

ASIC

Sandblaster

TigerSHARC

7 Pentium M

Summary of Ardbeg

• Ardbeg is lower power at same throughput• We are getting closer to ASICs



26

Conclusion• SODA Ardbeg

► Overall ~1.5-7x improvement across multiple wireless algorithms

► 30% less power over SODA (with turbo also in software)

• Fully programmable research design evolved to a commercial design that is “less soft”

• Feasible to design programmable solutions that start to approach ASIC efficiency

► ASICs are locally optimal for single kernels but combined create an inefficient system

• Programmability allows time multiplexing of hardware = Less hardware, same amount of work



Questions?

Thanks!

Date post:	21-Dec-2015
Category:	Documents
View:	215 times
Download:	0 times

University of Michigan Electrical Engineering and Computer Science From SODA to Scotch: The...

Documents