Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | eugene-george |
View: | 216 times |
Download: | 0 times |
RICE UNIVERSITY
SWAPs: Re-thinking mobile and base-station
architectures
Sridhar Rajagopal
VLSI Signal Processing GroupCenter for Multimedia Communication
Department of Electrical and Computer EngineeringRice University, Houston TX 77005
March 23, 2003
This work has been supported in part by Nokia, TI, TATP and NSF
RICE UNIVERSITY
Future wireless devices :
High data rate mobile devices with multimedia
Seamless connection across environments and
standards
Use the fastest and cheapest available service
Bluetooth/Home Networks
Wireless Cellular
Wireless LAN
RICE UNIVERSITY
Aim of the talk
How do I build such a device?ChallengesConstraintsSolutions
RICE UNIVERSITY
Trend comparisons
Past Current Future Year 1990’s 2002-2005 2006+
Function Voice Data Multimedia
Data rates 10’s of Kbps 100’s of Kbps (10x) 10’s of Mbps (10-100x)
Complexity KOPs MOPs (1000x) GOPs (1000x)
Power < 500 mW < 500 mW < 500mW
Antennas Single Single Multiple
Standard GSM (Europe) CDMA (Qualcomm)
TDMA (Nokia) (different devices)
GSM/TDMA/CDMA on same device
GSM/TDMA/CDMA/EDGE/ Wireless LAN/Bluetooth
on same device
RICE UNIVERSITY
Change in flexibility requirements
Physical Layer
MAC Layer
Network Layer
Application LayerNo change
(already flexible)
Maximum change(needs to support multiple
environments, algorithms and standards)
RICE UNIVERSITY
Summary of Challenges for
Sophisticated algorithms (GOPs of computation)10’s of Mbps, < 500 mW
Flexibility required at physical layerMultiple algorithms, multiple standards, multiple
environments
What we would also like:Time to marketRapid evaluation and implementationScalable architecture design methodologies
RICE UNIVERSITY
Physical layer of a receiver
Antenna
Channel estimation
Detection DecodingHigher(MAC/
Network/OS)Layers
RF Front-end
Baseband processing
Receiver more complex than transmitter
RICE UNIVERSITY
Physical layer architecture
Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000
ro
Analog RF
Digital
Baseband
DSP
ASICs
controller
Analog Baseband
Audio A/D
D/A
RICE UNIVERSITY
Architecture trade-offs
Past : more DSP + less ASIC Current “proposed” solutions : less DSP + more ASICs
Reason: DSPs not powerful enough
Can’t we build better DSPs?
ASIC solutions
Intermediate solutions
Programmable solutions
Area-Time-PowerPerformance
Flexibility
RICE UNIVERSITY
Can this methodology scale for
Baseband increasingly important for real-time and power
Need much more flexibility Environment-specific sophisticated algorithmsCannot keep adding co-processors lose flexibility of a programmable solution
1 Mbps with 100 MHz processor100 cycles per bit to do all your work (GOPs/bit)
Power consumption with bigger color displays, video and more complex algorithmsMay have only ~100 mW for baseband
RICE UNIVERSITY
Motivation
Now that we know the challenges and constraints,
Design me
RICE UNIVERSITY
design
How do we choose the right algorithms? the right amount of flexibility?
Do we build DSPs, ASICs, heterogeneous, reconfigurable?
If ASICs, how to build better ASICs?If programmable, how to build better DSPs?If both, how do we mix them better?
Answers dependent on level of flexibility needed area-time-power architecture tradeoffs
RICE UNIVERSITY
My contributions
“Low-complexity” algorithms for wireless:Parallel, fixed point algorithms for multiuser estimation and
detection
ASIC design for wireless using computer arithmetic techniques:Dynamic truncation using on-line arithmetic
Programmable architecture design for wireless:Scalable Wireless Application-specific Processors (SWAPs)
RICE UNIVERSITY
Programmable architectures
Current DSPsNot enough functional units (FUs) for GOPs of
computation
Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Limited Subword Parallelism (SP)Cannot support more registers (register area increases quadratically with FUs)Compilers: difficult to find ILP as FUs increase
RICE UNIVERSITY
Solution
Exploit data parallelism (DP)Lots available in wireless algorithms
Example:Int i,a,b,c; // 32 bitsshort int d,e,f; // 16 bits packed
for (i = 1: 1024)
{
a[i] = b[i] + c[i];
d[i] = e[i] + f[i];
} ILP
DP
SP
RICE UNIVERSITY
DSP vs. SWAPs
+++***
InternalMemory
ILP
Stream Register File
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
…
ILP
DP
DSP(1 cluster)
SWAPs(max. clusters)
RICE UNIVERSITY
Builds on the Imagine media processor
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
RICE UNIVERSITY
SWAPs trade-offs
Same internal memory size as DSPs Dependent on application, not architecture
Needs more area to support more functional unitsArea is less of a constraint than power
Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters
More parallelism lower clock frequency lower voltage
low power (CV2f + leakage) in spite of larger area
RICE UNIVERSITY
Design methodology
Chain of receiver algorithms
Low “complexity”, parallel, fixed point
High level language implementation
Programmable implementation
Modular programmablearchitecture design
ASICimplementation
FPGA, customized,
reconfigurable, heterogeneous
implementations
Example: Pentium, DSP, SWAPs
Area-Time-Power
specs: no
1
1
2
3
4
5
6
7
8
7
specs : no
learn
learn
Example: H-SWAPs
RICE UNIVERSITY
Choosing the right algorithms : theory
Algorithm research: Spectral efficiency Low power (RF)
Metrics: Bit error rate Frame error rate
10 -8
10 -6
10 -4
10 -2
10 0
Signal to Noise Ratio
Bit
Err
or R
ate
Past
Current
Future
Theory
RICE UNIVERSITY
Choosing right algorithms : practice
Refine candidates from theory (using linear algebra / opt.) lower “complexity”, parallel, fixed-point
Optimization:
Area: ATime: BPower: A
Energy: A/B
Multi-parameter optimization
?
“Complexity” : #operations of equivalent type
Complexity Complexity/Parallelism
Exe
cuti
on
Tim
e
0
10
20
30
40
50
60
70
80OriginalCandidate ACandidate B
RICE UNIVERSITY
Example : Parallel Viterbi Decoding
Add-Compare-Select (ACS) : trellis interconnectRe-order for exploiting DPParallelism depends on constraint length (#states)
Conventional Traceback – sequentialUse Register Exchange (RE) for parallel solution
Exploiting DP in a programmable architecture implies:Re-order ACS Re-order RE
RICE UNIVERSITY
SWAP design
Decide how many clustersExploit DP Look at the for loop () count
Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space
See how it meets time-area-power constraints
RICE UNIVERSITY
What goes inside a cluster?
1
2
3
4
5
1
2
3
4
5
40
60
80
100
120
140
160
(43,58)
(54,59)
(39,41)
(62,62)
(47,43)
#Multipliers
(40,32)
(70,59)
(65,45)
(49,33)
(39,27)
(80,34)
(73,41)
(61,33)
Auto-exploration of adders and multipliers for kernel "acskc"
(48,26)
(39,22)
(50,22)
(85,24)
(76,33)
(60,26)
#Adders
(61,22)
(85,17)
(72,22)
(72,19)
(85,13)
(85,11)
Ins
tru
cti
on
co
un
t w
ith
FU
uti
liza
tio
n(+
,*)
RICE UNIVERSITY
Re-ordering for parallel Viterbi
X(0)
X(2)
X(4)
X(6)
X(8)
X(10)
X(12)
X(14)
X(1)
X(3)
X(5)
X(7)
X(9)
X(11)
X(13)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
b. Shuffled Trellisa. Trellis
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
RICE UNIVERSITY
Viterbi reconfiguration
Packet 1Constraint length 7
(16 clusters)
Packet 2Constraint length 9
(64 clusters)
Packet 3Constraint length 5
(4 clusters)
DP Can be turned OFF
RICE UNIVERSITY
How to reconfigure?
Move data to appropriate clusters and turn off unused clusters and SRFSignificant loss in performanceMaximum power savings
Use Conditional StreamsCannot turn off SRF, comm ,scratchpad in clustersMinimal loss in performance
Use mux-demux buffersCan turn off clusters entirely – more power savingsMinimal loss in performance
RICE UNIVERSITY
64-bit Packet 1Rate ½ Constraint Length 7
64-bit Packet 2Rate ½ Constraint Length 9
64-bit Packet 3Rate ½ Constraint Length 5
Kernels(Computation)
Memoryaccesses
RICE UNIVERSITY
Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz
100
101
102
100
101
102
103
Number of clusters
Frequency
needed t
o a
ttain
real-
tim
e (
in M
Hz) Actual K = 9
Actual K = 7
Actual K = 5
Regular codeReconfigurable code
RICE UNIVERSITY
Viterbi decoding: Execution time
10
10
10
103
Ideal DSP C64x (w/o co-proc)
*VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong
128 KHz
(1 bit /cycle)
DSP (RE)
SWAP
ASIC/FPGA – Real-time performance
DP
Task PipeliningDedicated interconnect 10
010
110
2
0
1
2
Actual K = 9Actual K = 7Actual K = 5
Virtex II FPGA*
RICE UNIVERSITY
Salient features of this solution
Any constraint length 10 MHz at 128 Kbps (handset)
Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant
Exploiting parallelism for real-time: Instruction Level Parallelism (DSP)Subword Parallelism (DSP)Data Parallelism (Imagine)Dynamic Cluster Scaling (SWAP)
Power savings due to dynamic cluster scaling
RICE UNIVERSITY
Expected SWAP power numbers
Viterbi decoding
64 clusters and 1 multiplier per cluster: Process: 0.13 micron Voltage: 1.5 V (to min. leakage when not active) R-T Frequency: f~10 MHz Peak Active Power: ~16 mW/MHz (11 mW/MHz if 1.2V) Area: ~53.7 mm2
10 MHz, 128 Kbps ~160 (110) mW for K = 9 ~53.33 (36.7) mW for K = 7 ~26.67 (12.5) mW for K = 5
ASICs : ~10-100 W
*Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164
RICE UNIVERSITY
Problems
Suitable for handsets? - Not yet!
Still too general Not low power enough!!!
No special customization for the applicationExcept for a fixed-point architectureGeneric instruction setGeneric ALUs (though, can be powered down)Generic inter-cluster communication network
Suitable for base-stations?Why not – power is not a primary constraint?
RICE UNIVERSITY
Multiuser Estimation-Detection+Decoding
100
101
102
101
102
103
104
105
Number of clusters
Fre
qu
en
cy n
ee
de
d t
o a
tta
in r
ea
l-tim
e (
in M
Hz)
FASTMEDIUMSLOW
32-user 3G base-station
Hand-set
Real-time target : 128 Kbps per user
RICE UNIVERSITY
Expected power numbers
32 user base-station with 3 multipliers per cluster and 64 clusters: Process: 0.13 micron Voltage: 1.2 V (always active, leakage less important) R-T Frequency: f~1 GHz Peak Active Power: ~19.88 mW/MHz (increased *) Area: ~93.4 mm2
Total Base-station power consumption:~19.88 W at 1 GHz for 32 users at 128 Kbps/user
RICE UNIVERSITY
H-SWAPs
Trade Data Parallelism for Task Pipelining Customize each SWAPlet
Internal Memory
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
…
DP
SWAPs(max. clusters
and reconfigure)
+++*
+++*
+++*
+++*
LimitedDP
SWAPlet(limit
clusters)
+++*
+++*
+++*
+++*
LimitedDP
++*
++*
++*
++*
LimitedDP
++++
++++
LimitedDP
H-SWAPs(collection of customized
SWAPlets)
RICE UNIVERSITY
Viterbi decoding
Survivor management – serial Finding parallel solution for SWAPs – expensive
> 50% of execution time : overheadSerial solution now possible with H-SWAPs
ACS+
ACS+
ACS+
ACS+
LimitedDP
TBU
H-SWAPs for Viterbi decoding
ACS unit
Traceback unit
RICE UNIVERSITY
Potential advantages
DSP (RE)
SWAP
ASIC/FPGA – Real-time performance
DP
Task PipeliningDedicated interconnect
DSP (RE)
H-SWAP
Partial DP + Task Pipelining
Application-specific units
ASIC/FPGA – Real-time performance
Dedicated interconnect
Performance
H-SWAPsSWAPs
RICE UNIVERSITY
Current research
How to trade-off task vs. data parallelism?
Evaluation of specialized inter-cluster communication
Integrating specialized arithmetic units (ACS, on-line)
Area-Time-Power efficiency of Handset SWAPs
Learning to migrate from H-SWAPs to SWAPs
Scale to future systems!!
RICE UNIVERSITY
Future research: efficient algorithms
M u lt ipa thC h a n n e l
Equ a lize rM R C D e co de r
D e te cto rD e m o du la to r
N on -C oh e r e n tor P ar tial ly
C oh e r e n t S TC
B e a m fo rm in g
C o h e re n tS TC
C h a n n e lEs t im a to r
C S I RFin ite
Fe e dba ck
C h a n n e l
Tu rbo Equ a lize r
RICE UNIVERSITY
Future research: architectures
Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs
Some other potential applications Image processing:
Cameras : variety of compression algorithms
Biomedical applications: Hearing aids: DSP running on body heat*
Sensor networks
*Quote: Gene Frantz, TI Fellow
RICE UNIVERSITY
Conclusions
Exciting times for wireless algorithm and architecture research More complex algorithmsHigher data rates – meet real-time requirementsLower powerLow area
Seek to design flexible architectures learn from ASIC solutions
Inter-disciplinary research needed: Computer architecture, VLSI, wireless
communications, computer arithmetic, compilers