Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
University of MichiganElectrical Engineering and Computer Science
From SODA to Scotch: The Evolution of a Wireless Baseband Processor
Mark Woh (University of Michigan - Ann Arbor)Yuan Lin (University of Michigan - Ann Arbor)Sangwon Seo (University of Michigan - Ann Arbor)Scott Mahlke (University of Michigan - Ann Arbor)Trevor Mudge (University of Michigan - Ann Arbor)Chaitali Chakrabarti (Arizona State University)Richard Bruce (ARM Ltd.)Danny Kershaw (ARM Ltd.)Alastair Reid (ARM Ltd.)Mladen Wilder (ARM Ltd.)Krisztian Flautner (ARM Ltd.)
2University of Michigan
Electrical Engineering and Computer Science
From SODA to Scotch : What is this talk about?
• If a fully programmable 3G baseband processor commercially viable?
► The SODA processor was the first full research design [ISCA06]
► ARM R&D developed the Ardbeg SDR commercial prototype
• What we will present► Comparison study between SODA and Ardbeg► Lessons learned in the evolution
2
3University of Michigan
Electrical Engineering and Computer Science
Mobile Computing
• In 2007, world-wide mobile telephone subscription: 3.3 billion1
► ~Half of the world’s population► Some countries have mobile penetration over 100%► Largest consumer electronic device in terms of volume
• Wireless multimedia anywhere at anytime
3
Cell phones are getting more complex
PCs are getting more mobile
1. “Global cellphone penetration reaches 50 pct”, Reuter, Nov. 29th, 2007
4University of Michigan
Electrical Engineering and Computer Science
Wireless Communication
4
BluetoothBluetooth UWBUWB
802.11g802.11g
Personal Area Network
Local Area Network
Wide Area Network
Global Network
GSMGSM W-CDMAW-CDMA
802.11n802.11n
DVBDVBGPSGPS
5University of Michigan
Electrical Engineering and Computer Science
Software Defined Radio
5
GPS
Bluetooth
Application Processors
Application ProcessorsBaseband
ProcessorBaseband Processor
Analog FrontendAnalog
Frontend
WCDMA
CameraCamera
KeypadKeypad
DisplayDisplay
SpeakerSpeaker
MicrophoneMicrophone
6University of Michigan
Electrical Engineering and Computer Science
Software Defined Radio
6
GPS
Bluetooth
Application Processors
Application ProcessorsBaseband
ProcessorBaseband Processor
Analog FrontendAnalog
Frontend
WCDMA
CameraCamera
KeypadKeypad
DisplayDisplay
SpeakerSpeaker
MicrophoneMicrophone
MAC
Link
Network
Transport
PHY
GPP
DSP + ASICs
7University of Michigan
Electrical Engineering and Computer Science
Software Defined Radio
7
GPS
Bluetooth
Analog FrontendAnalog
Frontend
WCDMA Application Processors
Application Processors
CameraCamera
KeypadKeypad
DisplayDisplay
SpeakerSpeaker
MicrophoneMicrophone
SDR Baseband Processor
SDR Baseband Processor
8University of Michigan
Electrical Engineering and Computer Science
Advantages of Soft Radio
• Design factor► Protocol complexity► Multi-mode operation► Prototyping and bug fixes
• Cost factor► Time-to-market► Silicon area► Higher volume► Longevity of platform
8
BluetoothBluetoothUWBUWB802.11g802.11g
GSMGSM W-CDMAW-CDMA
DVBDVBGPSGPS 802.11n802.11n
SDR
9University of Michigan
Electrical Engineering and Computer Science
Mobile SDR Design Challenges
1
10
100
1000
0.1 1 10 100
Power (Watts)
Pe
ak
Pe
rfo
rma
nc
e (
Go
ps
)
Better
Pow
er Efficiency
10 Mops/m
W
100 Mops/m
W
1 Mops/m
W
9
GeneralPurpose
ProcessorsEmbeddedDSPs
Mobile SDRRequirements
Pentium MTI C6x
IBM CellHigh-end
DSPs
SDR Design Objectives for 3G and WiFi
Throughput requirements 40+Gops peak throughput
Power budget 100mW~500mW peak power
SDR Design Objectives for 3G and WiFi
Throughput requirements 40+Gops peak throughput
Power budget 100mW~500mW peak power
10University of Michigan
Electrical Engineering and Computer Science
First Generation SDR Processor : SODA
• Our first attempt was the SODA processor► Design at 180nm technology► Built for WCDMA and 802.11a in mind► Sub 500mW operation estimated at 90nm
11University of Michigan
Electrical Engineering and Computer Science
SODA
System:• Heterogeneous multi-core
architecture• Multi-level scratchpad
memories
PE:• SIMD/Scalar/AGU LIW• 32-lane 16-bit SIMD• 16-bit scalar datapath• Scalar-to-SIMD • SIMD-to-scalar• Iterative Perfect Shuffle
Network
11
512-bitSIMDReg.File
EX
512-bitSIMD ALU+Mult
SIMDShuffle
Net-work
( SSN)
WB
Scalar ALU
WB
EX
ScalarRF
L1SIMDData
Memory
L1ScalarData
Memory
STV
AGURF
EX
WB
AGUALU
1. wide SIMD
2. Scalar
4. AGU
VTS
Pred.Regs
WB
SIMDto
Scalar(VtoS)
ALU
RF
DMA
SODA PE
5. DMA
3. Local memory
To System
Bus
L1ProgramMemory
Controller
12University of Michigan
Electrical Engineering and Computer Science
1
10
100
1000
0.1 1 10 100
Power (Watts)
Pe
ak
Per
form
an
ce
(Go
ps )
Better
Power Eff iciency
10 Mops/m
W
100 Mops/m
W
1 Mops/m
W
Mobile SDR
requirements
SODA Summary
12
SODA 180nmSODA 90nm
GeneralPurpose
ProcessorsEmbedded
DSPs
High-endDSPs
TI C6x 90nm
Picochip 130nm
Sandbridge 90nm
NXP EVP 90nm req. ASICs
13University of Michigan
Electrical Engineering and Computer Science
512-bitSIMDReg.File
512-bitSIMD Mult
SIMDShuffle
Net-work
Scalar ALU+Mult
ScalarRF+ACC
L1Data
Memory
AGURF
AGU
1. wide SIMD
Pred.RF
SIMD+ScalarTransf
Unit
Ardbeg PE
3. Memory
SIMDPred.ALU
Scalarwdata
1024-bitSIMD
ACC RF
SIMDwdata
512-bitSIMD ALUwith
shuffle
EX
EX
INTERCONNECTS
INTERCONNECTS
L2Memory
2. Scalar & AGUL1ProgramMemory
Controller
EX
EX
AGU
AGU
WB
WB
WB
WB
64- b
it A
MB
A 3
AX
I In
terc
on
ne
ct
ControlProcessor
Ardbeg System
FECAccelerator
L1Mem
ExecutionUnit
PE
L1Mem
ExecutionUnit
PE
DMAC
Peripherals
L1Mem
L2Mem
512
-bit
Bu
s
Ardbeg SDR Processor
Application Specific Hardware
Block Floating Point
Application Specific Hardware
Block Floating Point
Combined Scalar/Vector MemoryCombined Scalar/Vector Memory
8,16,32 bit fixed point support8,16,32 bit fixed point support
128-lane 8-bit Banyan Network128-lane 8-bit Banyan Network
3 Read/2 Write RF for VLIW3 Read/2 Write RF for VLIW
Sparse Connected VLIWSparse Connected VLIW
Multiple Data Address AccessesMultiple Data Address Accesses
Fused Permute ALU operationsFused Permute ALU operations
14University of Michigan
Electrical Engineering and Computer Science
Evolution to Ardbeg : Lessons Learned
• Ardbeg achieved ~3x speedup overall at 30% lower power than SODA
• To get these improvements many lessons were learned as a result of the studies done
• We will present a few of these studies► 1) Benefit of Wide SIMD► 2) VLIW on SIMD support► 3) Support for Complex Shuffle Network ► 4) Application Specific Hardware
15University of Michigan
Electrical Engineering and Computer Science
1) Benefiting from Wide SIMD
• Increasing SIMD width still a good idea for SDR• But area becomes a big concern
► 32 wide 16-bit SIMD at 90nm seems a good fit
1.2
1.0
0.8
0.6
0.4
0.2
0
12
10
8
6
4
2
08 16 32 64
SIMD Width
No
rma
lize
d E
ne
rgy
-De
lay
Pro
du
ct
No
rma
li zed
Are
aEnergy-DelayArea
16University of Michigan
Electrical Engineering and Computer Science
2) VLIW Support for Wide SIMD
• VLIW execution on top of the SIMD datapath
• 3 read ports, 2 write ports
► Shared between SIMD units► 2-issue SIMD LIW► Only support the most
frequently used SIMD op pairs
16
SIMD
32-lane
SIMDALU
32-lane
SIMDALU
SIMDRF
SIMDRF
128-laneSSN
128-laneSSN
SIMDscalartrans.unit
SIMDscalartrans.unit
EXEX
WB
WB
scalarRF
scalarRF
16-bitALU
16-bitALU
EXEX
WB
WB
InterconnectsInterconnects
EXEX
WB
WB
InterconnectsInterconnects
EXEX
WB
WB
Scalar
AGUAGU
DataMEMDataMEM
AGUAGU
AGUAGU
17University of Michigan
Electrical Engineering and Computer Science
2) VLIW on SIMD Support
• There is a distinct set of instructions that execute frequently at the same time
• We want to take advantage of this in order to reduce complexity of VLIW
Mem.Arith.Mult.
ShuffleTrans.Move
Comp.
Mem.
NAHighHighLowHighLowLow
Arith.
--NAMidHighMidLowLow
Mult.
----
NAMidHighHighLow
Shuffle
------
NAMidLowLow
Trans.
--------
NALowLow
Move
----------
NALow
Comp.
------------
NA
18University of Michigan
Electrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
FIR CFIR FFT Radix-2 FFT Radix-4 Viterbi K7 Viterbi K9 Average
En
erg
y-D
elay
Pro
du
ct
2 Read/ 2 Write (Single Issue) 3 Read/ 2 Write (Ardbeg)
4 Read/ 4 Write (Any two SIMD ops) 6 Read/ 5 Write (Any three SIMD ops)
2) VLIW on SIMD Support
• 3 Read/ 2 Write provides us for the most case the best overall design point
19University of Michigan
Electrical Engineering and Computer Science
3) Support for Shuffle Network
• 7-stage single-cycle SSN► Banyan network► 128-lane 8-bit (64-lane 16-bit)
19
2 stage 16-lane Banyan networkSIMD
32-lane
SIMDALU
32-lane
SIMDALU
SIMDRF
SIMDRF
128-laneSSN
128-laneSSN
SIMDscalartrans.unit
SIMDscalartrans.unit
EXEX
WB
WB
scalarRF
scalarRF
16-bitALU
16-bitALU
EXEX
WB
WB
InterconnectsInterconnects
EXEX
WB
WB
InterconnectsInterconnects
EXEX
WB
WB
Scalar
SIMDDataMEM
SIMDDataMEM
AGUAGU
20University of Michigan
Electrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
64pt FFTRadix-2
2048pt FFTRadix-2
64pt FFTRadix-4
2048pt FFTRadix-4
Viterbi K9
En
erg
y-D
elay
Pro
du
ct
32 Wide Perfect 64 Wide Perfect64 Wide Crossbar 64 Wide Banyan
3) Support for Shuffle Network
• 64-Wide Banyan gives us close to a simple iterative interconnect energy with crossbar like performance
21University of Michigan
Electrical Engineering and Computer Science
4) Application Specific Optimizations
• Application specific hardware► Turbo coprocessor► Block-floating point support► Fused Permute-ALU operations► Interleaving support
• Trade-off programmability for performance► Less “soft” than SODA► But more energy efficient for common operations
21
22University of Michigan
Electrical Engineering and Computer Science
4) Application Specific Optimizations
• Some kernels are common among many different protocols
► Many protocols use the same Error Correction algorithms
• Turbo Coprocessor is one of them► Tradeoff between Programmable vs ASIC
• ASIC implementations is around 5x more efficient than programmable implementation
► SODA PE: 2Mbps with 111mW in 90nm► ASIC: 2Mbps with 21mW in 90nm
23University of Michigan
Electrical Engineering and Computer Science
00.5
11.5
22.5
33.5
44.5
FIR 1
6-ta
ps
FIR 3
3-ta
ps
FIR 6
5-ta
ps
CFIR 1
6-ta
ps
CFIR 3
3-ta
ps
CFIR 6
5-ta
ps
Avera
ge
FFT Rx2
64p
t
FFT Rx2
204
8pt
FFT Rx4
64p
t
FFT Rx4
204
8pt
QAM4
QAM16
QAM64
Despre
ader
Descr
amble
r
Combin
er
Avera
ge
W-C
DMA S
earc
her
802.
11a
Inte
rpola
tor
DVB-T E
qualize
r
DVB-T C
han. E
st.
Avera
ge
Viterb
i K7
Viterb
i K9
Bit In
tlv 3
Bit In
tlv 6
Inte
rleav
er
Avera
ge
Ard
beg
Sp
eed
up
Ove
r S
OD
A
Baseline SODA SIMD ALU SIMD Shuffle VLIW Compiler Optimization
Filtering Modulation SynchronizationError
Correction7x
Overall Improvements
• Achieves between ~1.5-7x speedup for wireless algorithms compared to SODA
24University of Michigan
Electrical Engineering and Computer Science
802.11a 180nm 802.11a
W-CDMA 2Mbps180nm W-CDMA 2Mbps
802.11a
W-CDMA 2Mbps
W-CDMA data
W-CDMA voice
W-CDMA data
802.11a
W-CDMA 2Mbps
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000
Power (Watts)
Ac
hie
ve
d T
hro
ug
hp
ut
(Mb
ps
)
SODA
ASIC
Sandblaster
TigerSHARC
7 Pentium M
Summary of Ardbeg
• Power vs Throughput for protocols on different processors
25University of Michigan
Electrical Engineering and Computer Science
W-CDMA 2Mbps
DVB-H
DVB-T
802.11a
W-CDMA data
W-CDMA voice
802.11a 180nm 802.11a
W-CDMA 2Mbps180nm W-CDMA 2Mbps
802.11a
W-CDMA 2Mbps
W-CDMA data
W-CDMA voice
W-CDMA data
802.11a
W-CDMA 2Mbps
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000
Power (Watts)
Ac
hie
ve
d T
hro
ug
hp
ut
(Mb
ps
)
Ardbeg
SODA
ASIC
Sandblaster
TigerSHARC
7 Pentium M
Summary of Ardbeg
• Ardbeg is lower power at same throughput• We are getting closer to ASICs
26University of Michigan
Electrical Engineering and Computer Science
26
Conclusion• SODA Ardbeg
► Overall ~1.5-7x improvement across multiple wireless algorithms
► 30% less power over SODA (with turbo also in software)
• Fully programmable research design evolved to a commercial design that is “less soft”
• Feasible to design programmable solutions that start to approach ASIC efficiency
► ASICs are locally optimal for single kernels but combined create an inefficient system
• Programmability allows time multiplexing of hardware = Less hardware, same amount of work