Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | isabel-fleming |
View: | 213 times |
Download: | 0 times |
RICE UNIVERSITY
Flexible wireless communication architectures
Sridhar Rajagopal
Department of Electrical and Computer EngineeringRice University, Houston TX
March 31, 2003
This work has been supported in part by Nokia, TI, TATP and NSF
RICE UNIVERSITY
Future wireless devices :
High data rate mobile devices with multimedia
Multiple antennas w/ complex algorithms, GOPs of
computation
Area-Time-Power constraints
Seamless connection across environments and standards
Use the fastest and cheapest available service
Bluetooth/Home Networks
Wireless Cellular
Wireless LAN
RICE UNIVERSITY
Aim of the talk
Design me
RICE UNIVERSITY
Trends
Past Current Future Year 1990’s 2002-2005 2006+
Function Voice Data Multimedia
Data rates 10’s of Kbps 100’s of Kbps (10x) 10’s of Mbps (10-100x)
Complexity KOPs MOPs (1000x) GOPs (1000x)
Power < 500 mW < 500 mW < 500mW
Antennas Single Single Multiple
Standard GSM (Europe) CDMA (Qualcomm)
TDMA (Nokia) (different devices)
GSM/TDMA/CDMA on same device
GSM/TDMA/CDMA/EDGE/ Wireless LAN/Bluetooth on same
device
FLEXIBILITY
RICE UNIVERSITY
Change in flexibility requirements
Physical Layer
MAC Layer
Network Layer
Application LayerNo change
(already flexible)
Maximum change(needs to support multiple
environments, algorithms and standards)
RICE UNIVERSITY
Architecture trade-offs
Past : more DSP + less ASIC, Current : less DSP + more ASIC
Reason: need less flexibility OR DSPs not powerful enough?
Can’t we build better DSPs? How much flexibility do we need?
ASICs
Intermediate
Programmable
Area-Time-PowerbenefitsFlexibility
Time-to-marketSoftware updates
RICE UNIVERSITY
What is the right architecture?
ASICs not good: Need much more flexibility
Multiple complex algorithms and multiple environmentsCannot keep adding co-processors
DSPs not good either: 1 Mbps with 100 MHz processor
100 cycles available per bit (GOPs) Power : bigger color displays and more complex algorithms
Only ~100 mW for baseband
Need a methodology to explore flexibility-architecture tradeoffs
RICE UNIVERSITY
My contributions
Algorithms:Parallel, fixed point algorithms for multiuser estimation and
detection
Architectures:Dynamic truncation in ASICs using on-line arithmetic
Processors:
Scalable Wireless Application-specific Processors (SWAPs)
Design methodology to explore flexibility vs. architecture tradeoffs
RICE UNIVERSITY
Problems with current DSPs
Current DSPsNot enough functional units (FUs) for GOPs of
computationNeed 100’s of FUsNot low power enough!!
Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Limited Subword Parallelism (such as MMX)Cannot support more registers (area,ports)Compilers: difficult to find ILP as FUs increase
RICE UNIVERSITY
Solution: SWAPs
Exploit data parallelism (DP)Available in many wireless algorithmsThis is what ASICs do!!
Example:int i,a[N],b[N],c[N]; // 32 bitsshort int d[K],e[K],f[K]; // 16 bits packed
for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; }
ILP
DP
Subword
RICE UNIVERSITY
SWAPs: stream processors for wireless
Kernel
Viterbidecoding
StreamInput Data Output Data
Correlator channelestimation
receivedsignal
Matchedfilter
InterferenceCancellation
Decoded bits
Kernels (computation) and streams (communication) Operations on kernels use local data Streams expose data parallelism
Imagine stream processor at Stanford
RICE UNIVERSITY
DSP vs. SWAPs
+++***
InternalMemory
ILP
Stream Register File (SRF)
DSP(1 cluster)
+++***
+++***
+++***
+++***
…ILP
DP
SWAPs(max. clusters
All clusters same &do same operations)
RICE UNIVERSITY
Arithmetic clusters
FUs (+,*,/) Scratch-pad (Sp)
Indexed accesses Comm. unit (CU)
Intercluster comm. Distributed reg. Files
more FUs
Intercluster Network
From/To SRF
Cross Point
Local Register File
CU
+
+
+*
*/
+
/
+
+
+*
*/
+
/
Sp
SRF
RICE UNIVERSITY
SWAPs vs. Imagine trade-offs
Imagine – Stanford Optimized for media processing Floating point with 8 clusters
3 adders, 2 multipliers, 1 divider in each
Architecture simulator toolVary number of clusters, functional units, registers ….
SWAPs – Rice Optimized for wireless communications Minimized access to data memory Fixed point with clusters adapting to available DP Functional units adapting to available ILP
RICE UNIVERSITY
SWAPs vs. DSPs trade-offs
Same internal memory size as DSPs Dependent on application, not architecture
Needs more area to support more functional unitsArea is less of a constraint than power
Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters (and FUs)
More parallelism lower clock frequency lower voltage
low power (CV2f + leakage) in spite of larger area
RICE UNIVERSITY
Design methodology
Chain of receiver algorithms
Low “complexity”, parallel, fixed point
High level language implementation
Modular programmablearchitecture design
ASICdesign
FPGA, customized,
reconfigurable, heterogeneous
designs DSP, SWAPs
learn
H-SWAPs
learn
Algorithm-specificArchitecture exploration
Flexibility-performance
tradeoffs
RICE UNIVERSITY
Physical layer of wireless receivers
Antenna
Channel estimation
Detection DecodingHigher(MAC/
Network/OS)Layers
RF Front-end
Baseband processing
Receiver more complex than transmitter
RICE UNIVERSITY
Algorithms for
Multiple antenna systems (MIMO systems) Complexity exponential with transmit * receive antennas
Wide range of extremely complex algorithms Optimal depends on fading, mobility, bandwidth, antennas GOPs of computations
Estimation: Linear MMSE, blind, conjugate gradient….
Detection: FFT, (blind) interference cancellation….
Decoding: Viterbi, Turbo, LDPC….
Implement ALL of them AND the NEXT one in line Use for the best for the situation
Example for concept demonstration: Viterbi decoding
RICE UNIVERSITY
Parallel Viterbi Decoding
1. Add-Compare-Select (ACS) : trellis interconnectParallelism depends on constraint length (#states)
2. Conventional Traceback Sequential (No DP)Difficult to implement in parallel architecture
Use Register Exchange (RE) parallel solution
RICE UNIVERSITY
Re-ordering for parallel Viterbi
a. Trellis
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
X(0)
X(2)
X(4)
X(6)
X(8)
X(10)
X(12)
X(14)
X(1)
X(3)
X(5)
X(7)
X(9)
X(11)
X(13)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
b. Shuffled Trellis
Exploiting Viterbi DP in SWAPs:Re-order ACS, RE Overhead
RICE UNIVERSITY
SWAP: Algorithms + Architecture
Algorithm design for parallelism
Architecture design?
RICE UNIVERSITY
SWAP design
Decide how many clustersExploit DP
Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool
See how it meets time-area-power constraints
+?**
+
**
+
**
+
**
…ILP
DP
? ? ?
RICE UNIVERSITY
Inside a SWAP cluster: EXPLORE
Auto-exploration of adders and multipliers for “ACS"
1
2
3
4
5
1
2
3
4
5
40
60
80
100
120
140
160
(43,58)
(54,59)
(39,41)
(62,62)
(47,43)
#Multipliers
(40,32)
(70,59)
(65,45)
(49,33)
(39,27)
(80,34)
(73,41)
(61,33)
(48,26)
(39,22)
(50,22)
(85,24)
(76,33)
(60,26)
#Adders
(61,22)
(85,17)
(72,22)
(72,19)
(85,13)
(85,11)
Inst
ruct
ion c
ount
(Adder FU%, Multiplier FU%)
RICE UNIVERSITY
“Explore” tool benefits
Instruction count vs. functional unit efficiencyWhat goes inside each cluster
Explore all algorithms turn off functional units not in use for given kernel
Design customized application-specific unitsBetter performance with increased FU utilization
Algorithm 1 : 3 adders, 3 multipliers, 32 clustersAlgorithm 2 : 4 adders, 1 multiplier, 64 clusters
Architecture: 4 adders, 3 multipliers, 64 clusters
RICE UNIVERSITY
Viterbi reconfiguration
Packet 1Constraint length 7
(16 clusters)
Packet 2Constraint length 9
(64 clusters)
Packet 3Constraint length 5
(4 clusters)
DP Can be turned OFF
RICE UNIVERSITY
Reconfiguration : 1 : Data transfer
Move data to appropriate clusters via comm units
Significant performance loss, additional SRF memory required
Can turn off SRF too!
SRF
Clusters CU
RICE UNIVERSITY
Reconfiguration : 2: Conditional streams
Sp Sp Sp Sp
Transfer data via comm unit (CU) and scratchpad (Sp)
Minimal loss in performance
Cannot turn off SRF, comm unit , scratchpad in clusters
RICE UNIVERSITY
Reconfiguration : 3 : Multiplexed buffers
Use mux-demux buffers
Minimal loss in performance
Can turn off clusters entirely – more power savings
RICE UNIVERSITY
64-bit Packet 1Rate ½ K = 7
Packet 2K = 9
Packet 3K = 5
Kernels(Computation)
No Data Memoryaccesses
Execu
tion T
ime (
cycl
es)
Clusters Memory
RICE UNIVERSITY
Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz
1 10 1001
10
100
1000
Number of clusters
Fre
qu
en
cy n
eed
ed
to a
ttain
real-
tim
e (
in M
Hz)
K = 9K = 7 K = 5Static
architecture
SWAPs
DSP
Ideal C64x (w/o co-proc) needs ~200 MHz for real-time
RICE UNIVERSITY
SWAPs : Salient features
1-2 orders of magnitude better than 1 processor DSP
Any constraint length 10 MHz at 128 Kbps
Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant
Power savings due to dynamic cluster scaling
RICE UNIVERSITY
Expected SWAP power consumption
64 clusters and 1 multiplier per cluster: 0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz Area: ~53.7 mm2
10 MHz, 128 Kbps with reconfiguration
*Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164
0 10 20 30 40 50 60 700102030405060708090
Active Clusters (max 64)P
ow
er (
in m
W)Viterbi Clusters used Peak Power
K = 9 64 ~90 mW
K = 7 16 ~28.57 mW
K = 5 4 ~13.8 mW
overhead 0 ~8.1 mW
RICE UNIVERSITY
Flexibility vs. performance
Suitable for mobile devices?SWAPs: Real-time at ~10-100 mWMaybe ; but can we do better?
ASICs : Real-time at ~10-100 W
No special customization for the applicationNo application-specific unitsGeneric inter-cluster communication networkOverhead for extracting parallelism
SWAPs suitable for base-stations?Why not? – power is not a primary constraint!
RICE UNIVERSITY
Multiuser Estimation-Detection+Decoding
Real-time target : 128 Kbps per user
1 10 10010
100
1000
10000
100000
Number of clusters
Fre
qu
en
cy
ne
ed
ed
to
att
ain
re
al-
tim
e (
in M
Hz)
FASTMEDIUMSLOW
32-user base-station
Mobile
DSP
Ideal C64x (w/o co-proc) needs ~15 GHz for real-time
RICE UNIVERSITY
Expected SWAP power : base-station
32 user base-station with 3 X’s per cluster and 64 clusters: 0.13 micron, 1.2 V Peak Active Power: ~18.19 mW for 1 MHz (increased
*) Area: ~93.4 mm2
Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user
RICE UNIVERSITY
Current research
SWAPs : Completely flexible and general
How do we trade-off flexibility for better performance?
Handset SWAPs (H-SWAPs)
RICE UNIVERSITY
Let’s look at ASICs
*VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong
128 KHz*
(1 bit /cycle)
DSP
SWAP
ASIC/FPGA
DP
Task PipeliningDedicated interconnect
10 MHz(~1 bit /100 cycles)
200 MHz(~1 bit /2000 cycles)
Execu
tion t
ime
RICE UNIVERSITY
Handset SWAPs: H-SWAPs
Trade Data Parallelism for Task Pipelining Design SWAPlets and customize each SWAPlet
SRF
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
…
DP
SWAPs(max. clusters
and reconfigure)
+++*
+++*
+++*
+++*
LimitedDP
SWAPlet(limit
clusters)
+++*
+++*
+++*
+++*
LimitedDP
++*
++*
++*
++*
LimitedDP
++++
++++
LimitedDP
H-SWAPs(collection of customized
SWAPlets)
RICE UNIVERSITY
H-SWAPs: Viterbi decoding
Survivor management – serial Finding parallel solution for SWAPs – expensive
> 50% of execution time : overheadSerial solution now possible with H-SWAPsBetter performance with less flexibility!!
ACS+
ACS+
ACS+
ACS+
LimitedDP
TBU
H-SWAPs for Viterbi decoding
ACS unit
Traceback unit
RICE UNIVERSITY
H-SWAPs: Potential advantages
DSP (RE)
SWAP
ASIC/FPGA – Real-time performance
DP
Task PipeliningDedicated interconnect
DSP (RE)
H-SWAP
Partial DP + Task Pipelining
Application-specific units
ASIC/FPGA – Real-time performance
Dedicated interconnect
H-SWAPsSWAPs
Execu
tion t
ime
RICE UNIVERSITY
Current research
Task vs. data parallelism tradeoffs Evaluation of specialized inter-cluster communication Integrating specialized arithmetic units (ACS, on-line)
Learning to migrate from H-SWAPs to SWAPs
Scale to future systems!!
RICE UNIVERSITY
Future research: efficient algorithms
M u lt ipa thC h a n n e l
Equ a lize rM R C D e co de r
D e te cto rD e m o du la to r
N on -C oh e r e n tor P ar tial ly
C oh e r e n t S TC
B e a m fo rm in g
C o h e re n tS TC
C h a n n e lEs t im a to r
C S I RFin ite
Fe e dba ck
C h a n n e l
Tu rbo Equ a lize r
RICE UNIVERSITY
Future research: architectures
Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs
Some other potential applications Image processing:
Cameras : variety of compression algorithms
Biomedical applications: Hearing aids: DSP running on body heat*
Sensor networksCompression of data before transmission
*Quote: Gene Frantz, TI Fellow
RICE UNIVERSITY
Conclusions
Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms
Design methodology (SWAPs, H-SWAPs, ASICs)Flexibility vs. performance trade-offsBlurs distinction between ASICs and programmable solutions
Also need parallel, low precision algorithms for efficient mapping
Inter-disciplinary research: Computer architecture, VLSI, wireless communications,
computer arithmetic, compilers