RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of...

RICE UNIVERSITY

Flexible wireless communication architectures

Sridhar Rajagopal

Department of Electrical and Computer EngineeringRice University, Houston TX

March 31, 2003

This work has been supported in part by Nokia, TI, TATP and NSF

RICE UNIVERSITY

Future wireless devices :

High data rate mobile devices with multimedia

Multiple antennas w/ complex algorithms, GOPs of

computation

Area-Time-Power constraints

Seamless connection across environments and standards

Use the fastest and cheapest available service

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

RICE UNIVERSITY

Aim of the talk

Design me

RICE UNIVERSITY

Trends

Past Current Future Year 1990’s 2002-2005 2006+

Function Voice Data Multimedia

Data rates 10’s of Kbps 100’s of Kbps (10x) 10’s of Mbps (10-100x)

Complexity KOPs MOPs (1000x) GOPs (1000x)

Power < 500 mW < 500 mW < 500mW

Antennas Single Single Multiple

Standard GSM (Europe) CDMA (Qualcomm)

TDMA (Nokia) (different devices)

GSM/TDMA/CDMA on same device

GSM/TDMA/CDMA/EDGE/ Wireless LAN/Bluetooth on same

device

FLEXIBILITY

RICE UNIVERSITY

Change in flexibility requirements

Physical Layer

MAC Layer

Network Layer

Application LayerNo change

(already flexible)

Maximum change(needs to support multiple

environments, algorithms and standards)

RICE UNIVERSITY

Architecture trade-offs

Past : more DSP + less ASIC, Current : less DSP + more ASIC

Reason: need less flexibility OR DSPs not powerful enough?

Can’t we build better DSPs? How much flexibility do we need?

ASICs

Intermediate

Programmable

Area-Time-PowerbenefitsFlexibility

Time-to-marketSoftware updates

RICE UNIVERSITY

What is the right architecture?

ASICs not good: Need much more flexibility

Multiple complex algorithms and multiple environmentsCannot keep adding co-processors

DSPs not good either: 1 Mbps with 100 MHz processor

100 cycles available per bit (GOPs) Power : bigger color displays and more complex algorithms

Only ~100 mW for baseband

Need a methodology to explore flexibility-architecture tradeoffs

RICE UNIVERSITY

My contributions

Algorithms:Parallel, fixed point algorithms for multiuser estimation and

detection

Architectures:Dynamic truncation in ASICs using on-line arithmetic

Processors:

Scalable Wireless Application-specific Processors (SWAPs)

Design methodology to explore flexibility vs. architecture tradeoffs

RICE UNIVERSITY

Problems with current DSPs

Current DSPsNot enough functional units (FUs) for GOPs of

computationNeed 100’s of FUsNot low power enough!!

Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Limited Subword Parallelism (such as MMX)Cannot support more registers (area,ports)Compilers: difficult to find ILP as FUs increase

RICE UNIVERSITY

Solution: SWAPs

Exploit data parallelism (DP)Available in many wireless algorithmsThis is what ASICs do!!

Example:int i,a[N],b[N],c[N]; // 32 bitsshort int d[K],e[K],f[K]; // 16 bits packed

for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; }

ILP

DP

Subword

RICE UNIVERSITY

SWAPs: stream processors for wireless

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

Kernels (computation) and streams (communication) Operations on kernels use local data Streams expose data parallelism

Imagine stream processor at Stanford

RICE UNIVERSITY

DSP vs. SWAPs

+++***

InternalMemory

ILP

Stream Register File (SRF)

DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILP

DP

SWAPs(max. clusters

All clusters same &do same operations)

RICE UNIVERSITY

Arithmetic clusters

FUs (+,*,/) Scratch-pad (Sp)

Indexed accesses Comm. unit (CU)

Intercluster comm. Distributed reg. Files

more FUs

Intercluster Network

From/To SRF

Cross Point

Local Register File

CU

+

+

+*

*/

+

/

+

+

+*

*/

+

/

Sp

SRF

RICE UNIVERSITY

SWAPs vs. Imagine trade-offs

Imagine – Stanford Optimized for media processing Floating point with 8 clusters

3 adders, 2 multipliers, 1 divider in each

Architecture simulator toolVary number of clusters, functional units, registers ….

SWAPs – Rice Optimized for wireless communications Minimized access to data memory Fixed point with clusters adapting to available DP Functional units adapting to available ILP

RICE UNIVERSITY

SWAPs vs. DSPs trade-offs

Same internal memory size as DSPs Dependent on application, not architecture

Needs more area to support more functional unitsArea is less of a constraint than power

Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters (and FUs)

More parallelism lower clock frequency lower voltage

low power (CV2f + leakage) in spite of larger area

RICE UNIVERSITY

Design methodology

Chain of receiver algorithms

Low “complexity”, parallel, fixed point

High level language implementation

Modular programmablearchitecture design

ASICdesign

FPGA, customized,

reconfigurable, heterogeneous

designs DSP, SWAPs

learn

H-SWAPs

learn

Algorithm-specificArchitecture exploration

Flexibility-performance

tradeoffs

RICE UNIVERSITY

Physical layer of wireless receivers

Antenna

Channel estimation

Detection DecodingHigher(MAC/

Network/OS)Layers

RF Front-end

Baseband processing

Receiver more complex than transmitter

RICE UNIVERSITY

Algorithms for

Multiple antenna systems (MIMO systems) Complexity exponential with transmit * receive antennas

Wide range of extremely complex algorithms Optimal depends on fading, mobility, bandwidth, antennas GOPs of computations

Estimation: Linear MMSE, blind, conjugate gradient….

Detection: FFT, (blind) interference cancellation….

Decoding: Viterbi, Turbo, LDPC….

Implement ALL of them AND the NEXT one in line Use for the best for the situation

Example for concept demonstration: Viterbi decoding

RICE UNIVERSITY

Parallel Viterbi Decoding

1. Add-Compare-Select (ACS) : trellis interconnectParallelism depends on constraint length (#states)

2. Conventional Traceback Sequential (No DP)Difficult to implement in parallel architecture

Use Register Exchange (RE) parallel solution

RICE UNIVERSITY

Re-ordering for parallel Viterbi

a. Trellis

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(2)

X(4)

X(6)

X(8)

X(10)

X(12)

X(14)

X(1)

X(3)

X(5)

X(7)

X(9)

X(11)

X(13)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

b. Shuffled Trellis

Exploiting Viterbi DP in SWAPs:Re-order ACS, RE Overhead

RICE UNIVERSITY

SWAP: Algorithms + Architecture

Algorithm design for parallelism

Architecture design?

RICE UNIVERSITY

SWAP design

Decide how many clustersExploit DP

Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool

See how it meets time-area-power constraints

+?**

+

**

+

**

+

**

…ILP

DP

? ? ?

RICE UNIVERSITY

Inside a SWAP cluster: EXPLORE

Auto-exploration of adders and multipliers for “ACS"

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)

(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Inst

ruct

ion c

ount

(Adder FU%, Multiplier FU%)

RICE UNIVERSITY

“Explore” tool benefits

Instruction count vs. functional unit efficiencyWhat goes inside each cluster

Explore all algorithms turn off functional units not in use for given kernel

Design customized application-specific unitsBetter performance with increased FU utilization

Algorithm 1 : 3 adders, 3 multipliers, 32 clustersAlgorithm 2 : 4 adders, 1 multiplier, 64 clusters

Architecture: 4 adders, 3 multipliers, 64 clusters

RICE UNIVERSITY

Viterbi reconfiguration

Packet 1Constraint length 7

(16 clusters)


(64 clusters)


(4 clusters)

DP Can be turned OFF

RICE UNIVERSITY

Reconfiguration : 1 : Data transfer

Move data to appropriate clusters via comm units

Significant performance loss, additional SRF memory required

Can turn off SRF too!

SRF

Clusters CU

RICE UNIVERSITY

Reconfiguration : 2: Conditional streams

Sp Sp Sp Sp

Transfer data via comm unit (CU) and scratchpad (Sp)

Minimal loss in performance

Cannot turn off SRF, comm unit , scratchpad in clusters

RICE UNIVERSITY

Reconfiguration : 3 : Multiplexed buffers

Use mux-demux buffers

Minimal loss in performance

Can turn off clusters entirely – more power savings

RICE UNIVERSITY

64-bit Packet 1Rate ½ K = 7

Packet 2K = 9

Packet 3K = 5

Kernels(Computation)

No Data Memoryaccesses

Execu

tion T

ime (

cycl

es)

Clusters Memory

RICE UNIVERSITY

Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

1 10 1001

10

100

1000

Number of clusters

Fre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5Static

architecture

SWAPs

DSP

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

RICE UNIVERSITY

SWAPs : Salient features

1-2 orders of magnitude better than 1 processor DSP

Any constraint length 10 MHz at 128 Kbps

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Power savings due to dynamic cluster scaling

RICE UNIVERSITY

Expected SWAP power consumption

64 clusters and 1 multiplier per cluster: 0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz Area: ~53.7 mm2

10 MHz, 128 Kbps with reconfiguration

*Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164

0 10 20 30 40 50 60 700102030405060708090

Active Clusters (max 64)P

ow

er (

in m

W)Viterbi Clusters used Peak Power

K = 9 64 ~90 mW

K = 7 16 ~28.57 mW

K = 5 4 ~13.8 mW

overhead 0 ~8.1 mW

RICE UNIVERSITY

Flexibility vs. performance

Suitable for mobile devices?SWAPs: Real-time at ~10-100 mWMaybe ; but can we do better?

ASICs : Real-time at ~10-100 W

No special customization for the applicationNo application-specific unitsGeneric inter-cluster communication networkOverhead for extracting parallelism

SWAPs suitable for base-stations?Why not? – power is not a primary constraint!

RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

1 10 10010

100

1000

10000

100000

Number of clusters

Fre

qu

en

cy

ne

ed

ed

to

att

ain

re

al-

tim

e (

in M

Hz)

FASTMEDIUMSLOW

32-user base-station

Mobile

DSP

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

RICE UNIVERSITY

Expected SWAP power : base-station

32 user base-station with 3 X’s per cluster and 64 clusters: 0.13 micron, 1.2 V Peak Active Power: ~18.19 mW for 1 MHz (increased

*) Area: ~93.4 mm2

Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user

RICE UNIVERSITY

Current research

SWAPs : Completely flexible and general

How do we trade-off flexibility for better performance?

Handset SWAPs (H-SWAPs)

RICE UNIVERSITY

Let’s look at ASICs

*VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong

128 KHz*

(1 bit /cycle)

DSP

SWAP

ASIC/FPGA

DP

Task PipeliningDedicated interconnect

10 MHz(~1 bit /100 cycles)

200 MHz(~1 bit /2000 cycles)

Execu

tion t

ime

RICE UNIVERSITY

Handset SWAPs: H-SWAPs

Trade Data Parallelism for Task Pipelining Design SWAPlets and customize each SWAPlet

SRF

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

…

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

SWAPlet(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

SWAPlets)

RICE UNIVERSITY

H-SWAPs: Viterbi decoding

Survivor management – serial Finding parallel solution for SWAPs – expensive

> 50% of execution time : overheadSerial solution now possible with H-SWAPsBetter performance with less flexibility!!

ACS+

ACS+

ACS+

ACS+

LimitedDP

TBU

H-SWAPs for Viterbi decoding

ACS unit

Traceback unit

RICE UNIVERSITY

H-SWAPs: Potential advantages

DSP (RE)

SWAP

ASIC/FPGA – Real-time performance

DP

Task PipeliningDedicated interconnect

DSP (RE)

H-SWAP

Partial DP + Task Pipelining

Application-specific units

ASIC/FPGA – Real-time performance

Dedicated interconnect

H-SWAPsSWAPs

Execu

tion t

ime

RICE UNIVERSITY

Current research

Task vs. data parallelism tradeoffs Evaluation of specialized inter-cluster communication Integrating specialized arithmetic units (ACS, on-line)

Learning to migrate from H-SWAPs to SWAPs

Scale to future systems!!

RICE UNIVERSITY

Future research: efficient algorithms

M u lt ipa thC h a n n e l

Equ a lize rM R C D e co de r

D e te cto rD e m o du la to r

N on -C oh e r e n tor P ar tial ly

C oh e r e n t S TC

B e a m fo rm in g

C o h e re n tS TC

C h a n n e lEs t im a to r

C S I RFin ite

Fe e dba ck

C h a n n e l

Tu rbo Equ a lize r

RICE UNIVERSITY

Future research: architectures

Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs

Some other potential applications Image processing:

Cameras : variety of compression algorithms

Biomedical applications: Hearing aids: DSP running on body heat*

Sensor networksCompression of data before transmission

*Quote: Gene Frantz, TI Fellow

RICE UNIVERSITY

Conclusions

Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms

Design methodology (SWAPs, H-SWAPs, ASICs)Flexibility vs. performance trade-offsBlurs distinction between ASICs and programmable solutions

Also need parallel, low precision algorithms for efficient mapping

Inter-disciplinary research: Computer architecture, VLSI, wireless communications,

computer arithmetic, compilers

Date post:	05-Jan-2016
Category:	Documents
Upload:	isabel-fleming
View:	213 times
Download:	0 times

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of...

Documents