Reconï¬gurable stream processors for wireless base-stations

Reconfigurable stream processors for wireless base-stations

Sridhar Rajagopal ([email protected])September 9, 2003, 3:25 AM

Abstract

The need to support evolving standards, rapid prototyping and fast time-to-market are some of thekey reasons for desiring programmability in future wireless base-stations. However, supporting highlycomplex signal processing algorithms for multiple users athigh data rates (in Mbps), requiring billionsof operations per second, while providing power efficiency present challenges in attaining that goal.This paper demonstrates the viability of stream processorsfor physical layer base-station processing bydemonstrating a fully loaded 3G base-station for 32 users meeting 128 Kbps/user (rate 1/2 constraint9 coded data rate) at estimated power consumption of 8.2 W. However, when the system load decreasesand the amount of data parallelism reduces, clusters of ALUsin the stream processor remain unutilizedand waste power. We provide reconfiguration support in stream processors that allows us to turn offclusters dynamically as the data parallelism reduces. Support is provided to turn off ALUs as well suchthat only the minimum number of ALUs in active clusters needed to meet performance requirements arescheduled. When the system load changes from a fully loaded 32-user base-station at constraint length9 to say, a more typical 16 active users at constraint length 7, the power consumption can reduce to 2.23W with 5.06 W power savings due to frequency scaling and a further 0.91 W due to hardware reconfig-uration. Thus, by providing real-time and power efficient support in fully programmable hardware, thearchitecture and algorithm development for wireless communication systems can be made independentand limited versions of future systems can be deployed whichcan co-exist with current systems untilprogrammable architecture research rises to meet the challenge.

1. Introduction

The past several decades have seen an increasing shift in embedded system designs from the traditional

custom application-specific integrated circuits (ASICs) towards programmable designs. The reasons for

this shift has been due to flexibility and rapid design time available in programmable solutions. The

challenges in programmable architectures for embedded systems have been to meet real-time, area and

power constraints of the embedded application. This paper focuses on wireless cellular base-stations,

which show an increasing need for programmability and powersavings compared to existing solutions.

1.1 Need for programmable base-stations

Wireless base-stations, past and current, have dedicated hardware for compute-intensive operations.

The recently announced 3G TCI1� UMTS chipset from Texas Instruments for cellular base-stations,

contains application-specific hardware for operations such as chip de-spreading and Viterbi/Turbo de-

coding (DSP co-processors) [4]. Less than 10% of the actual physical layer processing is being done

on fully programmable hardware (DSP). The lack of programmable solutions is one of the main fac-

tors in the delay in deployment and upgrading of wireless base-stations. Any proposed change in the

algorithms in the new standards means a complete re-design of the system hardware as well. One of

the chief factors contributing to the success of the PC revolution has been the fact that the application

development has been made independent of the architecture development. While attaining this is harder

for wireless base-stations due to real-time processing constraints, this is exactly what is needed for the

wireless revolution to succeed. Current solutions have to wait for new hardware to be developed for

newer standards and face difficulties co-existing with the already deployed ones and providing seamless

’handoff’. Thus, as systems are changing over from voice to multimedia data networks (such as 2G

to 3G to 4G) [8], intermediate services can be provided (suchas 2.5G) and newer algorithms can be

implemented, tested and base-stations upgraded in a fast and efficient manner. The emerging standards

are extremely flexible in terms of providing support for a wide variety of data rates. 3G standards, for

example, support data rates, from 16 Kbps to 10 Mbps, depending on the version of the standard and the

modulation, coding and spreading schemes used [2].

The support for lower data rate versions in evolving standards allows programmable architecture

designs to incorporate limited features of newer standards(limited by the processing power in the current

base-station) until architecture research catches up to support higher data rate versions of the standard.

This enables base-station algorithm designs to be done independently of the architecture design, paving

the way for quicker time-to-market, fast design time and rapid deployment of newer standards and

algorithms.

1.2 Need for power efficiency in base-stations

Power efficiency, as a secondary goal, is also desirable in designing wireless base-stations. A low

power wireless base-station helps in decreasing operational and maintenance costs by decreasing the

2

power requirements [17]. Lowering the power also helps in reducing the cooling requirements at the

base-station. Both of the above aspects will become more significant in future years as base-stations get

denser (pico-cells) and more prevalent in high-traffic areas. Base-station power savings will also result

in providing smaller and lower cost battery back-up systems, enabling continued, uninterrupted service

during critical times of power failures such as the recent August blackout[19].

As shown later in this document, the compute-requirements of the base-station are extremely dynamic

with the load in the base-station in terms of active users andthe algorithms used. The programmable

architecture design should also be able to adapt to this varying load to be power efficient.

1.3 Challenges in providing programmable, power-efficientbase-stations

As standards are evolving from 2G to 3G, there is an increase in complexity of algorithms used

at the receiver. This is because more sophisticated signal processing techniques are needed for the

receiver, to provide lower bit error rates, resulting in better reception, making more efficient use of

the expensive wireless spectrum and enabling higher capacity systems (higher bits/sec/Hz). This is

especially important as we move from voice networks to data networks, as data is not as tolerant as voice

to errors. The combination of the higher operation count requirements of the sophisticated algorithms

along with the increase in data rates results in 1-2 orders ofmagnitude increase in complexity of the

receiver, the fact of which is supported by Figure 1. For 4G standards, where multiple antenna systems

(MIMO) are being proposed, the data rates again increase dueto simultaneous transmission across the

antennas, resulting in more sophisticated signal processing algorithms at the receiver and a complexity

increase of another 1-2 orders of magnitude can be expected,depending on the number of antennas.

The concept of all-software radio base-stations is not new [9, 18, 21]. However, designing pro-

grammable architectures that can meet the high performancereal-time requirements is still a formidable

challenge[7]. Figure 1 shows the increase in the physical layer operation count from millions of oper-

ations per second (MOPs) to billions of operations per second (GOPs) as a 32-user base-station goes

from 2G (supporting 16 Kbps/user (coded data rate)) to 3G (supporting 128 Kbps/user). Note that the

numbers shown are for the base operations (additions and multiplications) required for major compute-

intensive parts of the physical layer processing at the base-station. Programmable architectures will have

an additional overhead while mapping them on their instruction set (such as load,store, loop increments,

3

1 2 3 4 5 6 7 80

5

10

15

20

25

32−user base−station load scenarios

Ope

ratio

n co

unt (

in G

OP

s)

Case 1,2: 4 users, K = 7,9

Case 3,4: 8 users, K = 7,9

Case 5,6: 16 users, K = 7,9

Case 7,8: 32 users, K = 7,9

2G base−station (16 Kbps/user)3G base−station (128 Kbps/user)

Figure 1. Operation count requirements for real-time physi cal layer processing in 2G and 3G base-

stations

...) and accounting for dependencies, register pressure and latencies.

Providing support for billions of computation per second atlow power implies having 100’s of func-

tional units for a 100 MHz programmable architecture (assuming a high functional unit efficiency).

Current single processor DSP-based architectures do not have sufficient functional units in order to meet

real-time requirements for these algorithms. Extending the DSP architectures for more functional units

is not a feasible solution as the register file size and port interconnections rapidly begin to dominate

the architecture area with cubic complexity related to the functional units [29]. Also, it is difficult for

the compiler to find and extract instruction level parallelism and sub-word parallelism as the number

of functional units increase in the architecture. This leads to low functional unit utilizations in such

architectures. A multi-user base-station based on TI DSPs without co-processors such as the C62x may

require more than a DSP/user [3] for a 3G base-station. However, multi-processor DSP systems (includ-

ing chip multiprocessors) suffer from inter-processor communication bottlenecks (although the cost of

off-chip accesses is amortized over more cores). The instruction sequencing and control logic is repli-

cated to every tile in most of these solutions. Multi-processor solutions also do not scale well to change

in performance or load requirements [14].

4

Antenna

Multi - user

estimation

Multi - user

Detection

Viterbi

Decoding Higher

Layers RF

Front - end

Baseband processing

Multiple

users

Antenna

Multi - user

estimation

Multi - user

Detection

Viterbi

Decoding Higher

Layers RF

Front - end

Baseband processing

Multiple

users

Figure 2. Physical layer processing

In this paper, we explore stream processors for meeting real-time requirements and power-efficiency

in software base-stations. We present methods to reconfigure them in software and hardware to adapt

to the varying load conditions at the base-station. We show that providing frequency scaling and re-

configuration support in hardware allows to adapt the architecture and provide power efficiency while

maintaining real-time performance. The following sections of this paper elaborate on the above concepts

and present the results and conclusions drawn.

2. Wireless applications

The physical layer in wireless communications is the layer between the RF antennas and the medium

access control layer. The computationally-intensive operations occurring in the physical layer are those

of channel estimation, detection and decoding. Figure 2 shows the compute-intensive blocks of the

physical layer in wireless communication systems [6]. Channel estimation refers to the process of deter-

mining the channel parameters such as the amplitude and phase of the received signal. These parameters

are then given to the detector, which detects the transmitted bits. The detected bits are then forwarded

to the decoder which removes the error protection code on thetransmitted signal and then sends the

decoded information bits to the MAC and higher layers. Although the physical layer also contains other

blocks such as Automatic Gain Control (AGC), Automatic Frequency Control (AFC), frequency off-

set compensation etc., these blocks are not compute-intensive and hence, in this paper, we will restrict

ourselves to attaining real-time for the compute-intensive algorithms in the base-station receiver.

2.1 2G Base-station processing

For the purposes of this paper, we will consider a 2G base-station to consist of simple versions of

channel estimation, detection and decoding and a subset of a3G base-station. Specifically, we will

5

consider a 32-user base-station system providing support for 16 Kbps/user (coded data rate) employing

a sliding correlator as a channel estimator, a code matched filter as a detector [20] followed by Viterbi

decoding [33].

A sliding correlator correlates the known bits (pilot) at the receiver with the transmitted data to cal-

culate the timing delays and phase-shifts. The operations involved in the sliding correlator used in our

design involves outer product updates. The matched filter detector despreads the received data and con-

verts the received ’chips’ into ’bits’ (’Chips’ are the values of a binary waveform used for spreading

the data ’bits’). The operations involved in matched filtering at the base-station involves a matrix vector

product, with complexity proportional to the number of users. Both these algorithms are amenable to

parallel implementations. Viterbi decoding [33] is used toremove the error control coding done at the

transmitter. The strength of the error control code is usually dependent on the severity of the channel.

A channel with good signal to noise ratio is usually coded less strongly than a bad channel. The reason

for lower strength coding for good channels is that the channel decoding complexity is exponential with

the strength of the code. Viterbi decoding typically involves a trellis [33] with two phases of computa-

tion: an add-compare-select phase and a traceback phase. The add-compare-select phase in Viterbi goes

through the trellis to find the most likely path with the leasterror and has data parallelism proportional

to the strength (constraint length) of the code. However, the traceback phase traces the trellis backwards

and recovers the transmitted information bits. This traceback is inherently sequential and involves dy-

namic decisions and pointer-based chasing. With large constraint lengths, the computational complexity

of Viterbi decoding increases exponentially and becomes the critical bottleneck, esp. as multiple users

need to be decoded simultaneously in real-time. This is the main reason for Viterbi accelerators in the

C55x DSP and co-processors in the C6416 DSP from Texas Instruments as the DSP cores are unable to

handle the needed computations in real-time.

The breakup of the operation count and the memory requirements for a 2G base-station are shown in

Figure 3 and Figure 4

2.2 3G base-station processing

For the purposes of this paper, a 3G base-station contains the elements of a 2G base-station, along

with some more sophisticated signal processing elements for better accuracy of channel estimates and

6

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2G base−station load scenarios

Ope

ratio

n co

unt (

in G

OP

s)

Case 1,2: 4 users, K = 7,9

Case 3,4: 8 users, K = 7,9

Case 5,6: 16 users, K = 7,9

Case 7,8: 32 users, K = 7,9

EstimationDetectionDecoding

Figure 3. Operation count break-up for a 2G base-station

1 2 3 4 5 6 7 80

20

40

60

80

100

120

140


Mem

ory

requ

irem

ents

(in

KB

)

Case 1,2: 4 users, K = 7,9

Case 3,4: 8 users, K = 7,9

Case 5,6: 16 users, K = 7,9

Case 7,8: 32 users, K = 7,9

EstimationDetectionDecoding

Figure 4. Memory requirements of a 2G base-station

for eliminating interference between users. Specifically,we consider a 32-user base-station with 128

Kbps/user(coded), employing multiuser channel estimation, multiuser detection and Viterbi decoding

[25].

Multiuser channel estimation refers to the process of jointly estimating the channel parameters for

all the users at the base-station. Since the received signalhas interference from other users, jointly es-

timating the parameters allows use to obtain the optimal maximum-likelihood estimate of the channel

parameters for all users. However, a maximum likelihood estimate has a significant increase in compu-

tational complexity over single-user estimates [30], but provides a much more reliable channel estimate

to the detector.

The maximum likelihood solutions also involve matrix inversions, which present difficulties in numer-

7

1 2 3 4 5 6 7 80

5

10

15

20

25


Ope

ratio

n co

unt (

in G

OP

s)

Case 1,2: 4 users, K = 7,9

Case 3,4: 8 users, K = 7,9

Case 5,6: 16 users, K = 7,9

Case 7,8: 32 users, K = 7,9

Multiuser estimationMultiuser detectionDecoding

Figure 5. Operation count break-up for a 3G base-station

ical stability with finite precision computations and in exploiting data parallelism in a simple manner.

Hence, we used a conjugate-gradient descent based algorithm that was proposed in [24] that approx-

imates the matrix inversions and replaces the matrix inversion by matrix multiplications, which are

simpler to implement and can be computed in finite precision without loss in bit error rate performance.

The details of the implemented algorithm are presented in [24]. The computations involve matrix-matrix

multiplications of the order of the number of users and the spreading gain.

Multi-user detection [32] refers to the joint detection of all the users at the base-station. Since all the

wireless users interfere with each other at the base-station, interference cancellation techniques are used

to provide reliable detection of the transmitted bits of allusers. The detection rate directly impacts the

real-time performance. We choose a parallel interference cancellation based detection algorithm [31]

for implementation which has a bit-streaming and parallel structure using only adders and multipliers.

The computations involve matrix-vector multiplications of the order of the number of active users in the

base-station. The breakup of the operation count and the memory requirements for a 2G base-station are

shown in Figure 5 and Figure 6

For the rest of the paper, we will concentrate on the design aspects of a 3G base-station architecture.

2.3 Programmable architecture mapping and characteristics

Figure 7 presents the detailed view of the physical layer processing from a programmable architec-

ture perspective. Specifically, we mean a programmable architecture, that is able to exploit subword

parallelism, instruction level parallelism and data parallelism. The A/D converter provides the input to

8

1 2 3 4 5 6 7 80

50

100

150

200

250


Mem

ory

requ

irem

ents

(in

KB

)

Case 1,2: 4 users, K = 7,9

Case 3,4: 8 users, K = 7,9

Case 5,6: 16 users, K = 7,9

Case 7,8: 32 users, K = 7,9

Multiuser estimationMultiuser detectionDecoding

Figure 6. Memory requirements of a 3G base-station

Code

Matched

Filter

Stage

1

Stage

2

Stage

3

Viterbi

decoding

Transpose

Matrices

for

matched

filter

Matrices for interference

cancellation

A/D

Converter

Buffer

Parallel Interference

Cancellation

Pack and

Reorder

Buffer

Load Surviving states

for current user being decoded

Decoded

bits

for

MAC/

Network Layers

Multiuser Detection

Multiuser Estimation

Loads

received

data

in

real-time

Figure 7. Detailed view of the physical layer processing.

architecture which is typically in the 8-12 bit range. However, programmable architectures can typically

support only byte-aligned units and hence, we will assume a 16-bit complex received data at the input of

the architecture. Since the data will be arriving in real-time at a constant rate (4 million chips/second),

it becomes apparent from Figure 7 that unless the processingis done at the same rate, the data will be

lost. The real and imaginary parts of the 16-bit data are packed together and stored in memory. Note

that there is a trade-off between storing of the low precision data values in memory and their efficient

utilization during computations. Certain computations may require the packing to be removed before

the operations can be performed. Hence, at every stage, a careful decision was made to decide the extent

of packing on the data for memory requirements and its effecton the real-time performance.

9

The channel estimation block does not need to be computed every bit and needs evaluation only when

the channel statistics have changed over time. For the basisof this paper, we classify the update rate of

the channel estimates as once per 64 data bits. The received signal first passes through a code matched

filter which provides initial estimates of the received bitsof all the users. The output of the code matched

filter is then sent to three pipelined stages of parallel interference cancellation (PIC) where the decisions

of the users’ bits are refined. The input to the PIC stages are not packed to provide efficient computation

between the stages. In order to exploit data parallelism forthe sequential Viterbi traceback, we use a

register-exchange based-scheme [34], which provides a forward traceback scheme with data parallelism

that can be exploited in a parallel architecture. The Viterbi algorithm is able to make efficient use of

sub-word parallelism and hence, the detected bits are packed to 4 bits per word before being sent to

the decoder. While all the users are being processed in parallel until this point, it is more efficient to

utilize data parallelism inside the Viterbi algorithm rather than exploit data parallelism among users.

This is because processing multiple users in parallel implies keeping a lot of local active memory state

in the architecture that may not be available. Also, it is more desirable to exploit non-user parallelism

whenever possible in the architecture as those computations can be avoided if the number of active users

in the system change. Hence, a pack and re-order buffer is used to hold the detected bits until 64 bits of

each user has been received. The transpose unit shown in Figure 7 is obtained by consecutive odd-even

sorts between the matrix rows as proposed in the Altivec instruction set [1].

The output bits of the Viterbi decoder are binary and hence, need to be packed before sending it to the

higher layers. Due to this, the decoding is done 32 bits at a time so that a word of each user is dumped

to memory. During the forward pass through the Viterbi trellis, the states start converging as soon as the

length of the pass exceeds 5*� where� is the constraint length. The depth of the Viterbi trellis iskept at

64 where the first 32 stages contain the past history which is loaded during the algorithm initialization

and the next 32 stages process the current received data. Thepath metrics and surviving states of the

new data are calculated while the old data is traced-back anddumped. When a rate��

Viterbi decoding

(typical rate) is used, the 64 detected bits of each user in the pack and re-order buffer gets absorbed and

new data can now be stored in the blocks.

Note that the sequential processing of the users implies that some users attain higher latencies than

other users in the base-station. This would not be a significant problem as the users are switched at

10

every 32 bit intervals. Also, the base-station could potentially re-order users such that users with more

stringent latency requirements such as video conferencingover wireless can have priority in decoding.

For the purposes of this paper, we will consider a Viterbi decoding based on blocks of 64 bits each per

user. We will assume that the users use constraint lengths 5,7 and 9 (which are typically used in wireless

standards). Thus, users with lower constraint length will have lower amounts of data parallelism that can

be exploited (which implies that power will be wasted if the architecture exploits more data parallelism

than for the lowest parallelism case) but will have the same decrease in computational complexity as

well.

Since performance, power, and area are critical to the design of any wireless communications system,

the estimation, detection, and decoding algorithms are typically designed to be simple and efficient.

Matrix-vector based signal processing algorithms are utilized to allow simple, regular, limited-precision

fixed-point computations. These types of operations can be statically scheduled and contain significant

parallelism. It is clear from the algorithms in wireless communications that variable amount of work

needs to be performed in constant time and this motivates theneed for a flexible architecture that adapts

to the workload requirements. Furthermore, these characteristics are well suited to parallel architectures

that can exploit data parallelism with simple, limited precision computations.

3. Stream Processors

3.1 Choosing the right programmable architecture

Vector processors solve the problem of providing 100’s of functional units by providing clusters of

arithmetic units and have been traditionally used in scientific computing to exploit data parallelism.

Stream processors combine the features of VLIW-based DSPs with traditional vector processors, ex-

ploiting instruction-level, subword and data parallelism. Stream processors enhance the capabilities of

vector processors by providing a register hierarchy, allowing it to support many more ALUs for the

same memory bandwidth (See the discussion on Stream vs. Vector processors in [13]). Stream pro-

cessors [27] have been shown to be very suitable for high performance multimedia processing. Since

multimedia processing also contains many similar signal processing algorithms that are used in wireless

such as FFT and FIR filters, in this paper, we explore stream processors as a viable candidate for physical

layer processing in wireless base-stations.

11

Cluster C

Cluster 0

Cluster 1 SRF 1

SRF 0 E

xte

rnal M

em

ory

Stream Register File

(SRF)

SRF C

Inter-cluster

communication

network

Figure 8. Stream Processor Architecture

3.2 Stream Processor background

Special-purpose processors for wireless communications perform well because of the abundant paral-

lelism and regular communication patterns within physicallayer processing. These processors efficiently

exploit these characteristics to keep thousands of arithmetic units busy without requiring many expen-

sive global communication and storage resources. The bulk of the parallelism in wireless physical layer

processing can be exploited as data parallelism, as identical operations are performed repeatedly on

incoming data elements.

A stream processor can also efficiently exploit this kind of data parallelism, as it processes individual

elements fromstreamsof data in parallel [27]. Streams are stored in a stream register file, which can

efficiently transfer data to and from a set of local register files between major computations. Local

register files, co-located with the arithmetic units, directly feed those units with their operands. Truly

global data, data that is persistent throughout the application, is stored off-chip only when necessary.

These three explicit levels of storage form an efficient communication structure to keep hundreds of

arithmetic units efficiently fed with data. The Imagine stream processor developed at Stanford is the first

implementation of such a stream processor [27].

Figure 8 shows the architecture of a stream processor, with� arithmetic clusters. Operations in a

stream processor all consume and/or produce streams which are stored in the centrally located stream

register file (SRF). The two major stream instructions are memory transfers and kernel operations. A

stream memory transfer either loads an entire stream into the SRF from external memory or stores

an entire stream from to the SRF to external memory. Multiplestream memory transfers can occur

12

Memory Array N Words

M Blocks

N W

ord

s

2 Blocks

To Cluster 7

To Cluster 0

To Cluster 1

To Cluster 2

To Cluster 3

To Cluster 4

To Cluster 5

To Cluster 6

To Other

Stream Buffers

Stream Buffer

Figure 9. Stream Register File Implementation

simultaneously, as hardware resources allow. A kernel operation performs a computation on a set of

input streams to produce a set of output streams. Kernel operations are performed within a data parallel

array of arithmetic clusters. Each cluster performs the same sequence of operations on independent

stream elements.

3.3 The Stream Register File

The heart of a stream processor is the stream register file. This register file must have sufficient ca-

pacity to hold the working set of streams of an application and provide sufficient bandwidth to keep the

arithmetic clusters busy. In modern VLSI, these are typically conflicting goals, as it is not possible to

build a high bandwidth random access memory efficiently. Fortunately, the nature of a stream processor

severely restricts the type of accesses to the SRF, enablingan efficient, high-bandwidth implementa-

tion [28]. The stream register file only supports sequentialaccesses to data streams. A set of stream

buffers external to the memory array of the SRF allow multiple streams to be accessed simultaneously.

However, there is only a single, wide port into the memory array of the SRF. Each stream buffer arbi-

trates for access to that port, to transfer data to or from thememory array. A stream buffer is twice the

width of the memory port, so that streams may be accessed in one half of the stream buffer while the

other half of the stream buffer is being filled or drained by the memory array.

The design of a SRF is shown in Figure 9. The figure shows part ofthe SRF for an eight cluster stream

processor [27, 28]. In Figure 9, only a single stream buffer connected to the arithmetic clusters is shown,

but in practice the SRF would include numerous stream buffers - several connected to the arithmetic

13

clusters and several connected to other stream clients, such as the memory system. Streams are stored

in the SRF inside of a single-ported memory array. In the figure, the memory array contains� blocks,

each consisting of� words. The single port into the memory array is one block (� words) wide. Data

may only be transferred to/from the memory array a block at a time.

To understand how the SRF operates, consider a kernel that isreading a stream from the SRF. A

stream descriptor is used to indicate the starting block address of the stream, and the length of the stream

in words (a stream must start on a block boundary, but may haveany number of words), and whether the

stream is being read or written. Initially, the stream buffer is empty. Since the stream is being read, the

first action the stream buffer will take is to arbitrate for access to the single port into the memory array.

When the stream buffer is granted access, it makes a single transfer from the memory array, filling half

of the stream buffer (� words in the figure). Note that� can be larger than the number of clusters. The

stream buffer that feeds the cluster is effectively banked (as is the SRF), with the same number of banks

as there are clusters. Words will be interleaved in the memory array such that every eighth word in the

stream is stored in the same stream buffer bank. That way, each cluster will access every eighth stream

element.

Once the first transfer to the stream buffer from the memory array occurs, the clusters may begin

accessing the stream. For each read requested by the clusters, the stream buffer will return the next eight

consecutive elements. Simultaneously, the stream buffer will also arbitrate for access to the memory port

again. When granted access, it makes another block transferto fill the other half of the stream buffer.

After � ��reads (each read transfers 8 words, one word per cluster) from the clusters, the initial half of

the stream buffer will be empty, and the stream buffer will then again arbitrate for access to the memory

port in order to refill itself. This process continues until the length specified in the stream descriptor is

reached. If, at any time, the stream buffer is completely empty and has not yet been granted access to

the memory port, the stream buffer will indicate that it is not ready for accesses by the cluster. In that

case, the clusters must wait until the stream buffer is readybefore proceeding with the read operation.

3.4 Stream Processor evaluation for Wireless Communications

Figure 10 shows the performance of stream processors for a fully loaded 2G and 3G base-station

with 32 users at 128 Kbps and 16 Kbps respectively. The architecture assumes sufficient memory in the

14

100

101

102

101

102

103

104

105

Number of clusters

Fre

quen

cy n

eede

d to

atta

in r

eal−

time

(in M

Hz)

3G base−station (128 Kbps/user)3G base−station (C64x TI DSP)2G base−station (16 Kbps/user)2G base−station (C64x TI DSP)

Figure 10. Real-time performance of stream processors for 2 G and 3G fully loaded base-stations

SRF to avoid accesses to external memory (256 KB) and 3 addersand 3 multipliers per cluster (chosen

because matrix algorithms tend to use equal number of addersand multipliers). For our implementation

of a fully loaded base-station, the parallelism available for channel estimation and detection is limited by

the number of users while for decoding, it is limited by the constraint length (as the users are processed

sequentially).

Ideally, we would like to exploit all the available parallelism in the algorithms as that would lower the

clock frequency. Having a lower clock frequency has also been shown to be power efficient as it opens

up the possibilities of reducing the voltage [23], thereby reducing the effective power consumption

(� � � � ). While Viterbi at constraint length 9 has sufficient parallelism to support upto 64 clusters at full

load, channel estimation and detection algorithms do not benefit from extra clusters, having exhausted

their parallelism. The mapping of a smaller size problem to alarger cluster size also gets complicated

because the algorithm mapping has to be adapted by some meanseither in software or in hardware to

use lower number of clusters while the data is spread across the entire SRF.

From the figure, we can see that a 32-cluster architecture canoperate as a 2G base-station at around 78

MHz and as a 3G base-station around 935 MHz, thereby validating the suitability of Imagine as a viable

architecture for 3G base-station processing. It is also possible that other multi-DSP architectures such

as [5] can be adapted and are viable for 3G base-station processing. However, the lack of multi-DSP

simulators and tools limit our evaluation of these systems.

15

Kernel Max. Parallelism Reason

Estimation 32 number of users (32), spreading gain (32)

Detection 32 number of users (32), spreading gain (32)

Decoding 4/16/64 Constraint lengths 5,7,9, packing 4 bits per word

Table 1. Available Data Parallelism in Wireless Communicat ion Tasks

4. Reconfigurable Stream Processors

A reconfigurable stream processor that can change the numberof active clusters would enable kernels

that have enough data parallelism to make use of a fully-enabled processor, while allowing unusable

clusters to be shut off for kernels with limited data parallelism.

Table 1 shows the available data parallelism of the tasks presented in Section 2. From the table, it is

obvious that the available data parallelism changes throughout the application. However, as shown in

Figure 9 of Section 3, the stream buffers attached to the stream register file feed all clusters. So, if some

of the clusters are inactive, then valid data in a stream willbe sent to inactive clusters and lost. Therefore,

to successful deactivate clusters in a stream processor, stream data must be rerouted so that the active

clusters receive the appropriate data. Three different mechanisms can be used to reroute the stream data

appropriately: by using the memory system (software), by using conditional streams (software), and by

using a multiplexer network (hardware).

4.1 Reorganizing Streams using Memory Transfers (Software)

A data stream can be realigned to match the number of active clusters by first transferring the stream

from the SRF to external memory, and then reloading the stream so that it only is placed in SRF banks

that correspond to active clusters. Figure 11 shows how thiswould be accomplished on an eight cluster

stream processor. In the figure, a 16 element data stream, labelled Stream A in the figure, is produced

by a kernel running on all eight clusters. For clarity, the banks within the SRF are shown explicitly and

the stream buffers are omitted. Therefore, the stream is striped across all eight banks of the SRF. If the

machine is then reconfigured to only have four active clusters, the stream needs to be striped across only

the first four banks of the SRF. By first performing a stream store instruction, the stream can be stored

16

SRF

Stream A

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 8

7 15

6 14

5 13

4 12

3 11

2 10

1 9

0 4 8 12

1 5 9 13

2 6 10 14

4 7 11 15

X X X X

To

Cluster 0

To

Cluster 7

To

Cluster 6

To

Cluster 5

To

Cluster 4

To

Cluster 3

To

Cluster 2

To

Cluster 1

Stream A'

Step 1: Store

stream A to memory Step 2: Load into

stream A'

Figure 11. Reorganizing Streams in Memory

contiguously in memory, as shown in the figure. Then, a streamload instruction can be used to transfer

the stream back into the SRF, only using the first four banks. The figure shows Stream A� as the result

of this load. As can be seen in the figure, the second set of fourbanks of the SRF would not contain any

valid data for Stream A�.

While this mechanism works, it is quite inefficient. One of the reasons why stream processors can keep

so many arithmetic units busy is that data is infrequently transferred between memory and the processor

core. Forcing all data to go through memory whenever the number of active clusters is changed directly

violates this premise. Therefore, using external memory toreorganize data streams will likely cause

numerous stalls, and result in more idle arithmetic units than if the machine were to remain fully active.

Using the memory system in this way is neither desirable nor productive.

4.2 Reorganizing Streams using Conditional Streams (Software)

Stream processors already contain a mechanism for reorganizing data streams. That mechanism is

conditional streams [12]. Conditional streams allow data streams to be compressed or expanded in

the clusters so that the direct mapping between SRF banks andclusters can be violated. When using

conditional streams, stream input and output operations are predicated in each cluster. If the predicate

is true in a particular cluster then it will receive the next stream element, and if the predicate is false, it

will receive garbage. As an example, consider an eight cluster stream processor executing a kernel that

is reading a stream conditionally.

Figure 12 presents an example of a 4-cluster stream processor that could reconfigure to 2 clusters

17

Conditional Buffer Data

Condition Switch

Data received

0

1 1

C

0

-

D

0

-

1 2 3 Cluster index 0

1 1 0

-

0

-

1 2 3

A B

Access 0 Access 1

A 4-cluster stream processor reconfiguring to 2

clusters using conditional streams

A

A

B

B C

C

D

D

Figure 12. Reorganizing Streams with Conditional Streams

using conditional streams. The conditional buffer data is only routed to those clusters whose conditional

switch (predicate) is set. Hence, if the conditional switches are set only in the first 2 clusters, all the

data will get routed only to the first 2 clusters, irrespective of the total number of clusters in the stream

processor. a

In order to use conditional streams to deactivate clusters,the active clusters would always use a true

predicate on conditional input or output stream operationsand inactive clusters would always use a false

predicate. This has the effect of sending consecutive stream elements only to the active clusters. As

explained in [12], conditional streams are used for three main purposes: switching, combining, and load

balancing. Using conditional streams to completely deactivate a set of clusters is really a fourth use of

conditional streams.

In a stream processor, conditional input streams are implemented by having each cluster read data

from the stream buffer as normal, but instead of directly using that data, it is first buffered. Then,

based upon the predicates, the buffered data is sent to the appropriate clusters using an intercluster

switch [12]. Conditional output streams are implemented similarly: data is buffered, compressed, and

then transferred to the SRF. Therefore, when using conditional streams, inactive clusters are not truly

inactive. They must still buffer and communicate data. Furthermore, if conditional streams are already

being used to switch, combine, or load balance data, then thepredicates must be modified to also account

for active and inactive clusters, which complicates programming.

18

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

SRF

Clusters

Register

Mux / Demux

(a) A reconfigurable

4-cluster

stream processor

(b) Reconfiguring to

4 clusters

( c ) Reconfiguring to

2 clusters

(d) Reconfiguring to

1 cluster

Comm network

Figure 13. Reorganizing Streams with a Multiplexer/Demult iplexer Network

4.3 Reorganizing Streams using a Multiplexer Network (Hardware)

Both the memory system and conditional streams are inefficient mechanisms for reorganizing streams.

While providing the flexibility to use lower number of clusters, both these methods do not provide any

way to reduce the power wastage of unused clusters. Clustersare the main source of power consumption

in stream processors, as they contain the arithmetic units,the scratchpad, the register files and the inter-

cluster communication unit, and can account for up to 90% of the power dissipation in the architecture

[16]. Just to give an idea of the possible power wastage, whena 32-user base-station with constraint

length 5 is used instead of constraint length 9, only 4 out of the 32 clusters are being used while decoding.

A multiplexer/demultiplexer network between the stream buffers of the SRF and the arithmetic clusters

can efficiently route stream data only to active clusters andcan turn off unused clusters using clock and

Vdd-gating techniques [10, 22], thereby providing power savings. Figure 13 shows how such a network

would operate.

The mux network shown is per stream. Each input stream is configured as a multiplexer and each

output stream is configured as a demultiplexer during execution of a kernel. The cluster reconfiguration is

done dynamically by providing support in the instruction set of the host processor for setting the number

of active clusters during the execution of a kernel. By default, all the clusters are turned off when kernels

are inactive, thereby providing power savings during memory stalls and system initialization. Another

advantage of the mux network is that the user does not have to modify his code as long as the stream

length is a multiple of the number of clusters.

19

Parameter Value Description��

632578 Energy of ALU operation� ��

79365 LRF access energy��

155 Energy of SB access (1 bit)��

8.7 SRAM access energy per bit�� 1603319 SP access energy�

1 Normalized wire propagation energy per wire track

Table 2. Power consumption table (normalized to�

units)

4.4 Power analysis

For the purposes of power analysis in this paper, we will assume parameters as shown in Table 2.

All energy values shown are with respect to�

. For this paper, we will assume� � �� is the length of

a wire track in microns and� � � pJ is the unit energy dissipation of a wire track and the distance between

2 clusters to be� �� units. These numbers are based on [16] with modifications to account for fixed

point ALUs (instead of floating point). The modifications arebased on a low power version of stream

processors for embedded systems (See (LP) in Chapter 6 in [14]). These modifications reduce the power

consumption from an estimated 16 mW/MHz for a 8-cluster Imagine at 1.2V [13] to approximately 8

mW/MHz for a 32-cluster low power stream processor with 3 multipliers and 3 adders per cluster. For

the rest of the paper, we will consider a voltage of 1.2 V and power to get affected linearly with frequency

in order to get estimates (since the scope for voltage scaling is also limited at 1.2 V).

4.5 Additional power consumption due to the mux network

For a first order approximation, the power consumption of themux network is wire length capacitance

dominant to a first order [16]. Assuming a linear layout of theclusters (worst case), the total wire

length of the mux network for�

bit is � � units, where the unit is the distance between�

clusters.

For 8 streams with 32 bits of wire each, the mux network uses a total wire of length approximately

� � � � �� J [16]. The clusters can be re-ordered according to bit-

reversal of their locations and this can be shown to minimizethe wire length to�� units, from O(� �) to

20

O(� � �� , where� is the number of clusters. The power consumption of this network is less than 1%

of a single multiplier and this has a negligible effect on thepower consumption of the stream processor

design.

4.6 Reconfiguring ALUs within a cluster

Different kernels have different ALU utilization factors.During the execution of different kernels, one

or more ALUs can be turned off without loss in performance. Since most wireless algorithm have static

schedules, the schedule during compilation can provide information about the actual number of adders

or multipliers needed to maintain performance while turning off unused ones. This allows us to save

static power dissipation but does not greatly affect dynamic power dissipation as the number of ops does

not change. While dynamic power is the dominant power consumption source in current technologies,

static power dissipation is increasing with emerging technology [26]. In fact, it is projected that static

power dissipation will equal dynamic power dissipation in afew generations. Hence, it is necessary to

minimize static power dissipation as well, if it can be done without loss in performance.

For this purpose, we do a architecture exploration during compile time and schedule the instructions

on the minimum number of adders and multipliers needed to provide performance. A ’sleep’ instruction

is added in the instruction set which turns off the ALUs that are not scheduled for the entire kernel.

The idea of turning off idle functional units has been explored in [26]. However, it has been done in

the context of regular programs in the micro-level (instruction level), where the window of opportunity

to turn off arithmetic units is limited as it is difficult to find cases where an arithmetic unit is idle for

sufficiently long periods of time to compensate for the overhead in turning clusters on and off. Since we

are able to do it at the kernel level (algorithm level), turning off arithmetic units is simpler in our case

with a greater potential for power reduction. We provide theability to determine how many arithmetic

units need to be active during compilation by varying the static schedule mapping to the various adder,

multiplier configurations and determining the best possible schedule.

Figure 14 shows the architecture exploration of an add-compare-select (ACS) kernel in Viterbi decod-

ing with increasing the number of adders and multipliers. Ascan be observed from the figure, a 3 adder,

1 multiplier case can provide close to the best performance and additional ALUs can be turned off to

save static power consumption. The expected functional unit utilization from the schedule is provided

21

1

2

3

4

5

12

34

5

40

60

80

100

120

140

160

(43,58)

(39,41)

(54,59)

#Multipliers

(40,32)

(47,43)

(62,62)

(39,27)

(49,33)

(65,45)

(70,59)

(39,22)

(48,26)

(61,33)

Auto−exploration of adders and multipliers for kernel "acskc"

(73,41)

(50,22)

(80,34)

(60,26)

#Adders

(76,33)

(85,24)

(61,22)

(72,22)

(85,17)

(72,19)

(85,13)(85,11)

Inst

ruct

ion

coun

t with

FU

util

izat

ion(

+,*

)

Figure 14. ALU exploration for a cluster for an ACS kernel

in (adder,multiplier) form on each data point. The 3 adder, 1multiplier configuration attains a func-

tional unit utilization of 62% on each adder and multiplier.Thus, the schedule of the compiler can be

changed from the default schedule to use only the minimal required to meet performance requirements.

This is as shown in Figure 15. The default schedule of the Viterbi ACS kernel as shown in Figure 15(a)

shows very poor utilization of the multipliers. Since we observed from Figure 14 that a single multiplier

was sufficient, we can re-compile and re-schedule the algorithm such that all the operations get packed

onto a single multiplier. The compiler automatically inserts a ’sleep’ instruction for the remaining two

multipliers during the entire kernel execution period.

5. Experimental Methodology

The stream processor code is written by splitting it into ’streamC’ code and ’kernelC’ code, where the

kernelC code represents the computations done in kernels ofthe various algorithms and the streamC code

represents the communication between the computations [15]. The stream dataflow for the algorithms

is as shown in [25]. The code was written to avoid accesses to external memory. This was done by

ensuring sufficient SRF memory and by doing all the necessarydata re-ordering operations within the

clusters. This made sure that the data for the next kernel operation was always in the right place. These

optimizations were done to avoid the latency and power consumption for external memory accesses.

For evaluation of the multiplexer network, the base stream processor simulator ’isim’ was modified

22

Adders Multipliers

I n s t r

u c

t i o

n s

c h

e d

u l e

Adders Multipliers

‘ S l e

e p

’ i n

s t r

u c t i

o n

(a) Default schedule (b) Schedule after

exploration

Figure 15. ALU Schedule for ACS kernel before and after archi tecture exploration

23

to support a mux network between the SRF and the clusters. A ’setClusters(#clusters)’ macro was

added to streamC in order to set the number of active clustersused by the following kernel. The macro

implemented a setClusters instruction and this instruction is scheduled to execute just before the kernel

executes so that the clusters can be turned off for the maximum time period possible. No modifications

are required in the streamC code unless the stream lengths are shorter than the number of active clusters.

In the case where stream lengths are shorter than the number of clusters, the streams need to be zero

padded. Additional dummy reads must be inserted in the streamC code to account for the zero padded

data. Apart from this, no other change is required in the code.

The streamC code now looks as follows:

:

setClusters(16); //use 16 clusters only for the next kernel

kernel1(input1,input2,input3,.. , output1);

setClusters(8);

kernel2(input1,....,output1, output2,...);

:

To reconfigure the number of ALUs, a ’sleep’ instruction was added to the architecture which is

supposed to use Vdd and clock gating techniques to minimize the static power dissipation of the ALUs.

During compile time, different number of adders and multipliers are explored in the architecture to

determine the best possible combination and to turn off unused ones. The compiler takes in 2 parameters

(-add, -mul) where the number of adders and multipliers to bescheduled are given via a command line

interface based on the architecture exploration.

6. Results

Figure 16 shows the performance comparisons between a standard 32-cluster stream processor, a

reconfigurable stream processor in software and a reconfigurable stream processor in hardware. To

make the worst case comparisons and find the reconfiguration overhead, the workload was chosen such

that it had sufficient data parallelism and did not actually involve any reconfiguration. From the figure,

we see that the performance overhead due to reconfiguration support for a fully loaded 32-user base-

24

1 2 30

1

2

3

4

5

6x 10

5

Reconfiguration methods

Ker

nel e

xecu

tion

time

(cyc

les)

Default stream processor

Software reconfiguration

Hardware reconfiguration

Kernel BusyMicrocontroller stalls

Figure 16. Comparing hardware and software reconfiguration schemes

station with constraint length 9 Viterbi decoding is negligible. When the reconfiguration support is

provided in software (conditional streams are used insteadof normal streams – even though conditional

streams are not required), the overhead of conditional streams is absorbed in most kernels that do not

have a high I/O requirement. For kernels such as matrix multiplication which do have significant I/O,

we have observed an increase in microcontroller stalls due to the conditional streams. However, for

the end-to-end application run, Viterbi decoding dominates the execution time and has very little I/O

in its kernels, thereby absorbing most of the overhead. While reconfiguring in hardware using the mux

network, we actually see a slight improvement in performance rather than the expected performance

loss due to reconfiguration. Reconfiguration in hardware adds a latency of��

�� to the input and

output streams before they can reach the cluster, where�

is the number of clusters. However, the mux

network also doubles up as a prefetch buffer, reducing the amount of microcontroller stalls needed in

kernels requiring high I/O such as matrix multiplications.Further power reduction is possible by turning

off arithmetic units within a cluster. If we assume that static power dissipation will be 33% of the

total power dissipation in arithmetic units [26], and the hardware reconfigures to 1 multiplier for Viterbi

decoding from 3 multipliers during channel estimation and detection, a further power savings can be

obtained for this case.

When the application requires reconfiguration such as a half-loaded base-station with 16 users at

constraint length 7, there are significant benefits to be gained by both scaling the frequency to meet the

25

1 2 3 4 5 6 7 80

100

200

300

400

500

600

700

800

900

1000

Variation in base−station load

Fre

quen

cy n

eede

d to

run

at r

eal−

time

(in M

Hz)

Case 1,2: 4 users, K = 7,9

Case 3,4: 8 users, K = 7,9

Case 5,6: 16 users, K = 7,9

Case 7,8: 32 users, K = 7,9

Figure 17. Frequency scaling due to base-station load varia tion

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

Variation in base−station load

Pow

er c

onsu

mpt

ion

(in W

atts

)

Legend 1 + 2 = Power used

Legend 3 + 4 = Power saved due to hardware reconfiguration

Non−cluster powerCluster powerPower saved: ALU turn offPower saved: mux reconfigurationPower saved: frequency scaling

Figure 18. Power savings due to frequency scaling and hardwa re reconfiguration

26

lower load requirements and by reconfiguration the hardwareto utilize the right number of clusters and

ALUs, dynamically. The reconfiguration in hardware from 32 clusters to 16 clusters (case 5) allows us

to reduce the power consumption by 62% (8.2 to 3.14 W) due to frequency scaling and an additional

29% (from 3.14 W to 2.23 W) due to the mux network and the ALU turning off feature. This is as shown

in Figure 17 and Figure 18. Figure 17 shows the decrease in thereal-time frequency requirement as the

load on the base-station decreases. We can see that the frequency goes down from around 935 MHz to

around 100 MHz as the load decreases from 32 users to 4 users. Figure 18 shows the corresponding

power savings due to use of the frequency scaling and due to the use of the mux network, thereby

providing a strong case for the use of frequency scaling and hardware reconfiguration support in stream

processors.

It may not be possible to change the clock frequency at such a fine level to optimize for performance

and the frequency may be modified only in steps (such as Intel’s Xscale architecture that can run at

333, 400, 600, 733 MHz). However, in these cases, all the clusters can be turned off in the remaining

available time until data is ready for the next kernel to process, saving up to 70% of the power dissipated

on the chip. The power is assumed to be linear with clock frequency although greater power savings can

be attained by lowering the voltage with frequency as well.

6.1. Discussion

While we have shown the dynamic nature of the computational load at the base-station, there are

alternatives to hardware reconfiguration of the clusters tosupport this dynamic load. If there are multiple

levels of data parallelism in the algorithm (across time,users,spreading length...), the data could be re-

organized to utilize the best type of available parallelism. This may not be possible for all algorithms

and may require re-organization of the data in the SRF. A goodexample for this case is the Viterbi

implementation. The Viterbi implementation exhibits parallelism both in the number of users and in

constraint length. For the fully loaded case of 32 users and constraint length 9, it is better to exploit

parallelism in terms of constraint length as exploiting parallelism in users would require storing survivor

states for all users in the internal scratchpad, exceeding the available scratchpad space. This is also

easier to scale with users that are non-powers of 2. However,when the constraint length is low, for

example, a constraint length 5 case with 8 users, it is betterto exploit parallelism in terms of users as

27

the survivor states can now easily fit in the internal scratchpad space. Hence, tradeoffs need to be made

during implementation to decide the best form of parallelism to exploit. Eventually, these optimizations

must be integrated inside the compiler. There also exists the possibility of supporting multiple threads in

clustered architectures, such as the Sandbridge processor[11], allowing us to run multiple threads of the

same algorithm (loop unrolling) or running different algorithms simultaneously. This may however have

a higher cost in terms of providing support for multi-threading and for exploiting thread parallelism.

However, our proposed solution for hardware reconfiguration and frequency scaling can co-exist with

these solutions as well for providing power savings in the absence of available parallelism.

For 4G systems employing multiple antennas, we can again assume an order-of-magnitude higher

complexity. In order to support such a system in a programmable processor, other features that ASICs

exploit such as task pipelining needs to be exploited and several stream processors may need to be

chained for base-station processing. However, one has to consider the bottlenecks of inter-processor

communication and increase in software complexity before considering such solutions. The goal of

providing a 100 Mbps programmable architecture at 100 MHz clock frequency with a bit processed per

cycle, is still an open unsolved question for us at this time.

When the number of users are non-powers of 2, the estimation and detection algorithms do not scale

well. However, instead of zero padding, there may exist possibilities of allocating the free codes to users,

thereby enabling ’virtual’ users to exist, thereby providing higher data rates to some users. Since the

decoding is done sequentially, the decoding algorithm implementation scales well with varying number

of users.

Finally, the proposed multiplexer network addition benefits are not limited to wireless communication

applications and other applications of stream processors where power is a concern can make use of

the multiplexer network. Other multi-cluster based architectures which are oriented towards handset

architectures such as [11] could also make use of the mux network features for additional power savings,

when the algorithms and/or the compiler does not provide sufficient means to exploit parallelism.

7. Conclusions

In this paper, we have shown stream processors to be a viable programmable architecture for physical

layer processing in wireless base-stations. The varying computational load in base-station processing

28

requires reconfiguration support in the architecture. We show that by providing this reconfiguration

support using frequency scaling and in hardware using a multiplexer network, we can attain adapt the

power consumption of the architecture to the computationalload, without compromising on the real-

time performance. By providing a real-time software solution to wireless base-stations, we can separate

the algorithm design from the architecture design of wireless systems. Limited versions of higher stan-

dards can be supported while the hardware design research catches up and gets upgraded to newer stan-

dards. We can thus attain a faster time-to-market, co-existence of multiple standards during this wireless

(re)evolution, and faster upgradability and verification of newly designed algorithms and applications.

8. Acknowledgments

We are grateful to Ujval Kapasi, Abhishek Das for their timely help, suggestions and software sup-

port with the Imagine simulator tools and to Brucek Khailanyfor his assistance in providing power

consumption estimates for stream processor architecturesused in this paper.

References

[1] Matrix transpose using Altivec instructions. http://developer.apple.com/hardware/ve/algorithms.html#transpose.[2] Third Generation Partnership Project. http:// www.3gpp.org.[3] Implementation of a WCDMA Rake receiver on a TMS320C62x DSP device. Technical Report SPRA680,

Texas Instruments, July 2000.[4] Channel card design for 3G infrastructure equipment. Technical Report SPRY048, Texas Instruments, 2003.[5] B. Ackland et al. A single-chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP.IEEE Journal of Solid-State

Circuits, 35(3):412–425, March 2000.[6] M. Barberis. Designing W-CDMA Mobile Communication Systems. CSDMAG, pages 25–32, February

1999.[7] M. Baron. Extremely High Performance - Parallel EnginesChallenge Performance Limitations. Micropro-

cessor report, February 18 2003.[8] R. Berezdivin, R. Breinig, and R. Topp. Next-generationwireless communications concepts and technolo-

gies. IEEE Communications Magazine, 40(3):108–117, March 2002.[9] S. Blust and D. Murotake. Software-Defined Radio Moves into Base-station designs. Wireless System

Design Magazine Report 11, November 1999.[10] S. Dropsho, V. Kursun, D. H. Albonesi, S. Dwarkadas, andE. G. Friedman. Managing static energy leakage

in microprocessor functional units. InIEEE/ACM International Symposium on Micro-architecture (MICRO-35), pages 321–332, Istanbul, Turkey, November 2002.

[11] J. Glossner, E. Hokenek, and M. Moudgill. An overview ofthe Sandbridge processor technology. Whitepaper at www.sandbridgetech.com.

[12] U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens, andB. Khailany. Efficient conditional operations fordata-parallel architectures. In33rd Annual International Symposium on Microarchitecture, pages 159–170,Monterey, CA, December 2000.

29

[13] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmablestream processors.IEEE Computer, 36(8):54–62, August 2003.

[14] B. Khailany. The VLSI implementation and evaluation of area- and energy-efficient streaming media pro-cessors. PhD thesis, Stanford University, Palo Alto, CA, June 2003.

[15] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, A. Chang, andS. Rixner. Imagine: media processing with streams.IEEE Micro, 21(2):35–46, March-April 2001.

[16] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, and B. Towles. Exploring the VLSI scalabilityof stream processors. InInternational Conference on High Performance Computer Architecture (HPCA-2003), Anaheim, CA, February 2003.

[17] N. Mera. The wireless challenge. Telephony Online, http://telephonyonline.com/ar/telecomwirelesschallenge,December 9 1996.

[18] J. Mitola. The Software Radio Architecture.IEEE Communications Magazine, 33(5):26–38, May 1995.[19] P. R. L. Monica. Wireless gets blacked out too - North America Power Blackout. CNN Money News report:

http://money.cnn.com/2003/08/15/technology/landlines/index.htm, August 16 2003.[20] S. Moshavi. Multi-user detection for DS-CDMA communications.IEEE Communications Magazine, pages

124–136, October 1996.[21] S. Muir. The Vanu software base-station. White paper atwww.vanu.com, March 2003.[22] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Gated-�� : A Circuit Technique to reduce

Leakage in Deep-Submicron Cache memories. InIEEE International Symposium on Low Power ElectronicDesign (ISLPED’00), pages 90–95, Rapallo, Italy, July 2000.

[23] J. M. Rabaey, A. Chandrakasan, and B. Nikolic’.Digital Integrated Circuits - A Design Perspective.Prentice-Hall, 2 edition, 2003.

[24] S. Rajagopal, S. Bhashyam, J. R. Cavallaro, and B. Aazhang. Real-time algorithms and architectures formultiuser channel estimation and detection in wireless base-station receivers.IEEE Transactions on WirelessCommunications, 1(3):468–479, July 2002.

[25] S. Rajagopal, S. Rixner, and J. R. Cavallaro. A programmable baseband processor design for softwaredefined radios. InIEEE International Midwest Symposium on Circuits and Systems, Tulsa, OK, August2002.

[26] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing static power consumption by functional units insuperscalar architectures. InInternational Conference on Compiler Construction, LCNS 2304, SpringerVerlag, pages 261–275, Grenoble, France, April 2002.

[27] S. Rixner, W. Dally, U. Kapasi, B. Khailany, A. Lopez-Lagunas, P. Mattson, and J. Owens. A bandwidth-efficient architecture for media processing. In31st Annual International Symposium on Microarchitecture,pages 3–13, Dallas, TX, November 1998.

[28] S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi,and J. Owens. Register organization for mediaprocessing. In6th International Symposium on High-Performance ComputerArchitecture, pages 375–386,Toulouse, France, January 2000.

[29] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. R. Mattson, and J. Owens. Abandwidth-efficient architecture for media processing. In31st Annual ACM/IEEE International Symposiumon Microarchitecture (Micro-31), pages 3–13, Dallas, TX, December 1998.

[30] C. Sengupta. Algorithms and Architectures for Channel Estimation in Wireless CDMA CommunicationSystems. PhD thesis, Rice University, Houston, TX, December 1998.

[31] M. K. Varanasi and B. Aazhang. Multistage detection in asynchronous code-division multiple-access com-munications.IEEE Transactions on Communications, 38(4):509–519, April 1990.

[32] S. Verdu.Multiuser Detection. Cambridge University Press, 1998.[33] B. Vucetic and J. Yuan.Turbo Codes: Principles and Applications. Kluwer Academic Publishers, 1 edition,

2000.[34] S. B. Wicker.Error Control Systems for Digital Communication and Storage. Prentice Hall, 1 edition, 1995.

30

Date post:	03-Feb-2022
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Reconï¬gurable stream processors for wireless base-stations

Documents