Reconfigurable stream processors for wireless base-stations
Sridhar Rajagopal ([email protected])September 9, 2003, 3:25 AM
Abstract
The need to support evolving standards, rapid prototyping and fast time-to-market are some of thekey reasons for desiring programmability in future wireless base-stations. However, supporting highlycomplex signal processing algorithms for multiple users athigh data rates (in Mbps), requiring billionsof operations per second, while providing power efficiency present challenges in attaining that goal.This paper demonstrates the viability of stream processorsfor physical layer base-station processing bydemonstrating a fully loaded 3G base-station for 32 users meeting 128 Kbps/user (rate 1/2 constraint9 coded data rate) at estimated power consumption of 8.2 W. However, when the system load decreasesand the amount of data parallelism reduces, clusters of ALUsin the stream processor remain unutilizedand waste power. We provide reconfiguration support in stream processors that allows us to turn offclusters dynamically as the data parallelism reduces. Support is provided to turn off ALUs as well suchthat only the minimum number of ALUs in active clusters needed to meet performance requirements arescheduled. When the system load changes from a fully loaded 32-user base-station at constraint length9 to say, a more typical 16 active users at constraint length 7, the power consumption can reduce to 2.23W with 5.06 W power savings due to frequency scaling and a further 0.91 W due to hardware reconfig-uration. Thus, by providing real-time and power efficient support in fully programmable hardware, thearchitecture and algorithm development for wireless communication systems can be made independentand limited versions of future systems can be deployed whichcan co-exist with current systems untilprogrammable architecture research rises to meet the challenge.
1. Introduction
The past several decades have seen an increasing shift in embedded system designs from the traditional
custom application-specific integrated circuits (ASICs) towards programmable designs. The reasons for
this shift has been due to flexibility and rapid design time available in programmable solutions. The
challenges in programmable architectures for embedded systems have been to meet real-time, area and
power constraints of the embedded application. This paper focuses on wireless cellular base-stations,
which show an increasing need for programmability and powersavings compared to existing solutions.
1.1 Need for programmable base-stations
Wireless base-stations, past and current, have dedicated hardware for compute-intensive operations.
The recently announced 3G TCI1� UMTS chipset from Texas Instruments for cellular base-stations,
contains application-specific hardware for operations such as chip de-spreading and Viterbi/Turbo de-
coding (DSP co-processors) [4]. Less than 10% of the actual physical layer processing is being done
on fully programmable hardware (DSP). The lack of programmable solutions is one of the main fac-
tors in the delay in deployment and upgrading of wireless base-stations. Any proposed change in the
algorithms in the new standards means a complete re-design of the system hardware as well. One of
the chief factors contributing to the success of the PC revolution has been the fact that the application
development has been made independent of the architecture development. While attaining this is harder
for wireless base-stations due to real-time processing constraints, this is exactly what is needed for the
wireless revolution to succeed. Current solutions have to wait for new hardware to be developed for
newer standards and face difficulties co-existing with the already deployed ones and providing seamless
’handoff’. Thus, as systems are changing over from voice to multimedia data networks (such as 2G
to 3G to 4G) [8], intermediate services can be provided (suchas 2.5G) and newer algorithms can be
implemented, tested and base-stations upgraded in a fast and efficient manner. The emerging standards
are extremely flexible in terms of providing support for a wide variety of data rates. 3G standards, for
example, support data rates, from 16 Kbps to 10 Mbps, depending on the version of the standard and the
modulation, coding and spreading schemes used [2].
The support for lower data rate versions in evolving standards allows programmable architecture
designs to incorporate limited features of newer standards(limited by the processing power in the current
base-station) until architecture research catches up to support higher data rate versions of the standard.
This enables base-station algorithm designs to be done independently of the architecture design, paving
the way for quicker time-to-market, fast design time and rapid deployment of newer standards and
algorithms.
1.2 Need for power efficiency in base-stations
Power efficiency, as a secondary goal, is also desirable in designing wireless base-stations. A low
power wireless base-station helps in decreasing operational and maintenance costs by decreasing the
2
power requirements [17]. Lowering the power also helps in reducing the cooling requirements at the
base-station. Both of the above aspects will become more significant in future years as base-stations get
denser (pico-cells) and more prevalent in high-traffic areas. Base-station power savings will also result
in providing smaller and lower cost battery back-up systems, enabling continued, uninterrupted service
during critical times of power failures such as the recent August blackout[19].
As shown later in this document, the compute-requirements of the base-station are extremely dynamic
with the load in the base-station in terms of active users andthe algorithms used. The programmable
architecture design should also be able to adapt to this varying load to be power efficient.
1.3 Challenges in providing programmable, power-efficientbase-stations
As standards are evolving from 2G to 3G, there is an increase in complexity of algorithms used
at the receiver. This is because more sophisticated signal processing techniques are needed for the
receiver, to provide lower bit error rates, resulting in better reception, making more efficient use of
the expensive wireless spectrum and enabling higher capacity systems (higher bits/sec/Hz). This is
especially important as we move from voice networks to data networks, as data is not as tolerant as voice
to errors. The combination of the higher operation count requirements of the sophisticated algorithms
along with the increase in data rates results in 1-2 orders ofmagnitude increase in complexity of the
receiver, the fact of which is supported by Figure 1. For 4G standards, where multiple antenna systems
(MIMO) are being proposed, the data rates again increase dueto simultaneous transmission across the
antennas, resulting in more sophisticated signal processing algorithms at the receiver and a complexity
increase of another 1-2 orders of magnitude can be expected,depending on the number of antennas.
The concept of all-software radio base-stations is not new [9, 18, 21]. However, designing pro-
grammable architectures that can meet the high performancereal-time requirements is still a formidable
challenge[7]. Figure 1 shows the increase in the physical layer operation count from millions of oper-
ations per second (MOPs) to billions of operations per second (GOPs) as a 32-user base-station goes
from 2G (supporting 16 Kbps/user (coded data rate)) to 3G (supporting 128 Kbps/user). Note that the
numbers shown are for the base operations (additions and multiplications) required for major compute-
intensive parts of the physical layer processing at the base-station. Programmable architectures will have
an additional overhead while mapping them on their instruction set (such as load,store, loop increments,
3
1 2 3 4 5 6 7 80
5
10
15
20
25
32−user base−station load scenarios
Ope
ratio
n co
unt (
in G
OP
s)
Case 1,2: 4 users, K = 7,9
Case 3,4: 8 users, K = 7,9
Case 5,6: 16 users, K = 7,9
Case 7,8: 32 users, K = 7,9
2G base−station (16 Kbps/user)3G base−station (128 Kbps/user)
Figure 1. Operation count requirements for real-time physi cal layer processing in 2G and 3G base-
stations
...) and accounting for dependencies, register pressure and latencies.
Providing support for billions of computation per second atlow power implies having 100’s of func-
tional units for a 100 MHz programmable architecture (assuming a high functional unit efficiency).
Current single processor DSP-based architectures do not have sufficient functional units in order to meet
real-time requirements for these algorithms. Extending the DSP architectures for more functional units
is not a feasible solution as the register file size and port interconnections rapidly begin to dominate
the architecture area with cubic complexity related to the functional units [29]. Also, it is difficult for
the compiler to find and extract instruction level parallelism and sub-word parallelism as the number
of functional units increase in the architecture. This leads to low functional unit utilizations in such
architectures. A multi-user base-station based on TI DSPs without co-processors such as the C62x may
require more than a DSP/user [3] for a 3G base-station. However, multi-processor DSP systems (includ-
ing chip multiprocessors) suffer from inter-processor communication bottlenecks (although the cost of
off-chip accesses is amortized over more cores). The instruction sequencing and control logic is repli-
cated to every tile in most of these solutions. Multi-processor solutions also do not scale well to change
in performance or load requirements [14].
4
Antenna
Multi - user
estimation
Multi - user
Detection
Viterbi
Decoding Higher
Layers RF
Front - end
Baseband processing
Multiple
users
Antenna
Multi - user
estimation
Multi - user
Detection
Viterbi
Decoding Higher
Layers RF
Front - end
Baseband processing
Multiple
users
Figure 2. Physical layer processing
In this paper, we explore stream processors for meeting real-time requirements and power-efficiency
in software base-stations. We present methods to reconfigure them in software and hardware to adapt
to the varying load conditions at the base-station. We show that providing frequency scaling and re-
configuration support in hardware allows to adapt the architecture and provide power efficiency while
maintaining real-time performance. The following sections of this paper elaborate on the above concepts
and present the results and conclusions drawn.
2. Wireless applications
The physical layer in wireless communications is the layer between the RF antennas and the medium
access control layer. The computationally-intensive operations occurring in the physical layer are those
of channel estimation, detection and decoding. Figure 2 shows the compute-intensive blocks of the
physical layer in wireless communication systems [6]. Channel estimation refers to the process of deter-
mining the channel parameters such as the amplitude and phase of the received signal. These parameters
are then given to the detector, which detects the transmitted bits. The detected bits are then forwarded
to the decoder which removes the error protection code on thetransmitted signal and then sends the
decoded information bits to the MAC and higher layers. Although the physical layer also contains other
blocks such as Automatic Gain Control (AGC), Automatic Frequency Control (AFC), frequency off-
set compensation etc., these blocks are not compute-intensive and hence, in this paper, we will restrict
ourselves to attaining real-time for the compute-intensive algorithms in the base-station receiver.
2.1 2G Base-station processing
For the purposes of this paper, we will consider a 2G base-station to consist of simple versions of
channel estimation, detection and decoding and a subset of a3G base-station. Specifically, we will
5
consider a 32-user base-station system providing support for 16 Kbps/user (coded data rate) employing
a sliding correlator as a channel estimator, a code matched filter as a detector [20] followed by Viterbi
decoding [33].
A sliding correlator correlates the known bits (pilot) at the receiver with the transmitted data to cal-
culate the timing delays and phase-shifts. The operations involved in the sliding correlator used in our
design involves outer product updates. The matched filter detector despreads the received data and con-
verts the received ’chips’ into ’bits’ (’Chips’ are the values of a binary waveform used for spreading
the data ’bits’). The operations involved in matched filtering at the base-station involves a matrix vector
product, with complexity proportional to the number of users. Both these algorithms are amenable to
parallel implementations. Viterbi decoding [33] is used toremove the error control coding done at the
transmitter. The strength of the error control code is usually dependent on the severity of the channel.
A channel with good signal to noise ratio is usually coded less strongly than a bad channel. The reason
for lower strength coding for good channels is that the channel decoding complexity is exponential with
the strength of the code. Viterbi decoding typically involves a trellis [33] with two phases of computa-
tion: an add-compare-select phase and a traceback phase. The add-compare-select phase in Viterbi goes
through the trellis to find the most likely path with the leasterror and has data parallelism proportional
to the strength (constraint length) of the code. However, the traceback phase traces the trellis backwards
and recovers the transmitted information bits. This traceback is inherently sequential and involves dy-
namic decisions and pointer-based chasing. With large constraint lengths, the computational complexity
of Viterbi decoding increases exponentially and becomes the critical bottleneck, esp. as multiple users
need to be decoded simultaneously in real-time. This is the main reason for Viterbi accelerators in the
C55x DSP and co-processors in the C6416 DSP from Texas Instruments as the DSP cores are unable to
handle the needed computations in real-time.
The breakup of the operation count and the memory requirements for a 2G base-station are shown in
Figure 3 and Figure 4
2.2 3G base-station processing
For the purposes of this paper, a 3G base-station contains the elements of a 2G base-station, along
with some more sophisticated signal processing elements for better accuracy of channel estimates and
6
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2G base−station load scenarios
Ope
ratio
n co
unt (
in G
OP
s)
Case 1,2: 4 users, K = 7,9
Case 3,4: 8 users, K = 7,9
Case 5,6: 16 users, K = 7,9
Case 7,8: 32 users, K = 7,9
EstimationDetectionDecoding
Figure 3. Operation count break-up for a 2G base-station
1 2 3 4 5 6 7 80
20
40
60
80
100
120
140
2G base−station load scenarios
Mem
ory
requ
irem
ents
(in
KB
)
Case 1,2: 4 users, K = 7,9
Case 3,4: 8 users, K = 7,9
Case 5,6: 16 users, K = 7,9
Case 7,8: 32 users, K = 7,9
EstimationDetectionDecoding
Figure 4. Memory requirements of a 2G base-station
for eliminating interference between users. Specifically,we consider a 32-user base-station with 128
Kbps/user(coded), employing multiuser channel estimation, multiuser detection and Viterbi decoding
[25].
Multiuser channel estimation refers to the process of jointly estimating the channel parameters for
all the users at the base-station. Since the received signalhas interference from other users, jointly es-
timating the parameters allows use to obtain the optimal maximum-likelihood estimate of the channel
parameters for all users. However, a maximum likelihood estimate has a significant increase in compu-
tational complexity over single-user estimates [30], but provides a much more reliable channel estimate
to the detector.
The maximum likelihood solutions also involve matrix inversions, which present difficulties in numer-
7
1 2 3 4 5 6 7 80
5
10
15
20
25
3G base−station load scenarios
Ope
ratio
n co
unt (
in G
OP
s)
Case 1,2: 4 users, K = 7,9
Case 3,4: 8 users, K = 7,9
Case 5,6: 16 users, K = 7,9
Case 7,8: 32 users, K = 7,9
Multiuser estimationMultiuser detectionDecoding
Figure 5. Operation count break-up for a 3G base-station
ical stability with finite precision computations and in exploiting data parallelism in a simple manner.
Hence, we used a conjugate-gradient descent based algorithm that was proposed in [24] that approx-
imates the matrix inversions and replaces the matrix inversion by matrix multiplications, which are
simpler to implement and can be computed in finite precision without loss in bit error rate performance.
The details of the implemented algorithm are presented in [24]. The computations involve matrix-matrix
multiplications of the order of the number of users and the spreading gain.
Multi-user detection [32] refers to the joint detection of all the users at the base-station. Since all the
wireless users interfere with each other at the base-station, interference cancellation techniques are used
to provide reliable detection of the transmitted bits of allusers. The detection rate directly impacts the
real-time performance. We choose a parallel interference cancellation based detection algorithm [31]
for implementation which has a bit-streaming and parallel structure using only adders and multipliers.
The computations involve matrix-vector multiplications of the order of the number of active users in the
base-station. The breakup of the operation count and the memory requirements for a 2G base-station are
shown in Figure 5 and Figure 6
For the rest of the paper, we will concentrate on the design aspects of a 3G base-station architecture.
2.3 Programmable architecture mapping and characteristics
Figure 7 presents the detailed view of the physical layer processing from a programmable architec-
ture perspective. Specifically, we mean a programmable architecture, that is able to exploit subword
parallelism, instruction level parallelism and data parallelism. The A/D converter provides the input to
8
1 2 3 4 5 6 7 80
50
100
150
200
250
3G base−station load scenarios
Mem
ory
requ
irem
ents
(in
KB
)
Case 1,2: 4 users, K = 7,9
Case 3,4: 8 users, K = 7,9
Case 5,6: 16 users, K = 7,9
Case 7,8: 32 users, K = 7,9
Multiuser estimationMultiuser detectionDecoding
Figure 6. Memory requirements of a 3G base-station
Code
Matched
Filter
Stage
1
Stage
2
Stage
3
Viterbi
decoding
Transpose
Matrices
for
matched
filter
Matrices for interference
cancellation
A/D
Converter
Buffer
Parallel Interference
Cancellation
Pack and
Reorder
Buffer
Load Surviving states
for current user being decoded
Decoded
bits
for
MAC/
Network Layers
Multiuser Detection
Multiuser Estimation
Loads
received
data
in
real-time
Figure 7. Detailed view of the physical layer processing.
architecture which is typically in the 8-12 bit range. However, programmable architectures can typically
support only byte-aligned units and hence, we will assume a 16-bit complex received data at the input of
the architecture. Since the data will be arriving in real-time at a constant rate (4 million chips/second),
it becomes apparent from Figure 7 that unless the processingis done at the same rate, the data will be
lost. The real and imaginary parts of the 16-bit data are packed together and stored in memory. Note
that there is a trade-off between storing of the low precision data values in memory and their efficient
utilization during computations. Certain computations may require the packing to be removed before
the operations can be performed. Hence, at every stage, a careful decision was made to decide the extent
of packing on the data for memory requirements and its effecton the real-time performance.
9
The channel estimation block does not need to be computed every bit and needs evaluation only when
the channel statistics have changed over time. For the basisof this paper, we classify the update rate of
the channel estimates as once per 64 data bits. The received signal first passes through a code matched
filter which provides initial estimates of the received bitsof all the users. The output of the code matched
filter is then sent to three pipelined stages of parallel interference cancellation (PIC) where the decisions
of the users’ bits are refined. The input to the PIC stages are not packed to provide efficient computation
between the stages. In order to exploit data parallelism forthe sequential Viterbi traceback, we use a
register-exchange based-scheme [34], which provides a forward traceback scheme with data parallelism
that can be exploited in a parallel architecture. The Viterbi algorithm is able to make efficient use of
sub-word parallelism and hence, the detected bits are packed to 4 bits per word before being sent to
the decoder. While all the users are being processed in parallel until this point, it is more efficient to
utilize data parallelism inside the Viterbi algorithm rather than exploit data parallelism among users.
This is because processing multiple users in parallel implies keeping a lot of local active memory state
in the architecture that may not be available. Also, it is more desirable to exploit non-user parallelism
whenever possible in the architecture as those computations can be avoided if the number of active users
in the system change. Hence, a pack and re-order buffer is used to hold the detected bits until 64 bits of
each user has been received. The transpose unit shown in Figure 7 is obtained by consecutive odd-even
sorts between the matrix rows as proposed in the Altivec instruction set [1].
The output bits of the Viterbi decoder are binary and hence, need to be packed before sending it to the
higher layers. Due to this, the decoding is done 32 bits at a time so that a word of each user is dumped
to memory. During the forward pass through the Viterbi trellis, the states start converging as soon as the
length of the pass exceeds 5*� where� is the constraint length. The depth of the Viterbi trellis iskept at
64 where the first 32 stages contain the past history which is loaded during the algorithm initialization
and the next 32 stages process the current received data. Thepath metrics and surviving states of the
new data are calculated while the old data is traced-back anddumped. When a rate���
Viterbi decoding
(typical rate) is used, the 64 detected bits of each user in the pack and re-order buffer gets absorbed and
new data can now be stored in the blocks.
Note that the sequential processing of the users implies that some users attain higher latencies than
other users in the base-station. This would not be a significant problem as the users are switched at
10
every 32 bit intervals. Also, the base-station could potentially re-order users such that users with more
stringent latency requirements such as video conferencingover wireless can have priority in decoding.
For the purposes of this paper, we will consider a Viterbi decoding based on blocks of 64 bits each per
user. We will assume that the users use constraint lengths 5,7 and 9 (which are typically used in wireless
standards). Thus, users with lower constraint length will have lower amounts of data parallelism that can
be exploited (which implies that power will be wasted if the architecture exploits more data parallelism
than for the lowest parallelism case) but will have the same decrease in computational complexity as
well.
Since performance, power, and area are critical to the design of any wireless communications system,
the estimation, detection, and decoding algorithms are typically designed to be simple and efficient.
Matrix-vector based signal processing algorithms are utilized to allow simple, regular, limited-precision
fixed-point computations. These types of operations can be statically scheduled and contain significant
parallelism. It is clear from the algorithms in wireless communications that variable amount of work
needs to be performed in constant time and this motivates theneed for a flexible architecture that adapts
to the workload requirements. Furthermore, these characteristics are well suited to parallel architectures
that can exploit data parallelism with simple, limited precision computations.
3. Stream Processors
3.1 Choosing the right programmable architecture
Vector processors solve the problem of providing 100’s of functional units by providing clusters of
arithmetic units and have been traditionally used in scientific computing to exploit data parallelism.
Stream processors combine the features of VLIW-based DSPs with traditional vector processors, ex-
ploiting instruction-level, subword and data parallelism. Stream processors enhance the capabilities of
vector processors by providing a register hierarchy, allowing it to support many more ALUs for the
same memory bandwidth (See the discussion on Stream vs. Vector processors in [13]). Stream pro-
cessors [27] have been shown to be very suitable for high performance multimedia processing. Since
multimedia processing also contains many similar signal processing algorithms that are used in wireless
such as FFT and FIR filters, in this paper, we explore stream processors as a viable candidate for physical
layer processing in wireless base-stations.
11
Cluster C
Cluster 0
Cluster 1 SRF 1
SRF 0 E
xte
rnal M
em
ory
Stream Register File
(SRF)
SRF C
Inter-cluster
communication
network
Figure 8. Stream Processor Architecture
3.2 Stream Processor background
Special-purpose processors for wireless communications perform well because of the abundant paral-
lelism and regular communication patterns within physicallayer processing. These processors efficiently
exploit these characteristics to keep thousands of arithmetic units busy without requiring many expen-
sive global communication and storage resources. The bulk of the parallelism in wireless physical layer
processing can be exploited as data parallelism, as identical operations are performed repeatedly on
incoming data elements.
A stream processor can also efficiently exploit this kind of data parallelism, as it processes individual
elements fromstreamsof data in parallel [27]. Streams are stored in a stream register file, which can
efficiently transfer data to and from a set of local register files between major computations. Local
register files, co-located with the arithmetic units, directly feed those units with their operands. Truly
global data, data that is persistent throughout the application, is stored off-chip only when necessary.
These three explicit levels of storage form an efficient communication structure to keep hundreds of
arithmetic units efficiently fed with data. The Imagine stream processor developed at Stanford is the first
implementation of such a stream processor [27].
Figure 8 shows the architecture of a stream processor, with� arithmetic clusters. Operations in a
stream processor all consume and/or produce streams which are stored in the centrally located stream
register file (SRF). The two major stream instructions are memory transfers and kernel operations. A
stream memory transfer either loads an entire stream into the SRF from external memory or stores
an entire stream from to the SRF to external memory. Multiplestream memory transfers can occur
12
Memory Array N Words
M Blocks
N W
ord
s
2 Blocks
To Cluster 7
To Cluster 0
To Cluster 1
To Cluster 2
To Cluster 3
To Cluster 4
To Cluster 5
To Cluster 6
To Other
Stream Buffers
Stream Buffer
Figure 9. Stream Register File Implementation
simultaneously, as hardware resources allow. A kernel operation performs a computation on a set of
input streams to produce a set of output streams. Kernel operations are performed within a data parallel
array of arithmetic clusters. Each cluster performs the same sequence of operations on independent
stream elements.
3.3 The Stream Register File
The heart of a stream processor is the stream register file. This register file must have sufficient ca-
pacity to hold the working set of streams of an application and provide sufficient bandwidth to keep the
arithmetic clusters busy. In modern VLSI, these are typically conflicting goals, as it is not possible to
build a high bandwidth random access memory efficiently. Fortunately, the nature of a stream processor
severely restricts the type of accesses to the SRF, enablingan efficient, high-bandwidth implementa-
tion [28]. The stream register file only supports sequentialaccesses to data streams. A set of stream
buffers external to the memory array of the SRF allow multiple streams to be accessed simultaneously.
However, there is only a single, wide port into the memory array of the SRF. Each stream buffer arbi-
trates for access to that port, to transfer data to or from thememory array. A stream buffer is twice the
width of the memory port, so that streams may be accessed in one half of the stream buffer while the
other half of the stream buffer is being filled or drained by the memory array.
The design of a SRF is shown in Figure 9. The figure shows part ofthe SRF for an eight cluster stream
processor [27, 28]. In Figure 9, only a single stream buffer connected to the arithmetic clusters is shown,
but in practice the SRF would include numerous stream buffers - several connected to the arithmetic
13
clusters and several connected to other stream clients, such as the memory system. Streams are stored
in the SRF inside of a single-ported memory array. In the figure, the memory array contains� blocks,
each consisting of� words. The single port into the memory array is one block (� words) wide. Data
may only be transferred to/from the memory array a block at a time.
To understand how the SRF operates, consider a kernel that isreading a stream from the SRF. A
stream descriptor is used to indicate the starting block address of the stream, and the length of the stream
in words (a stream must start on a block boundary, but may haveany number of words), and whether the
stream is being read or written. Initially, the stream buffer is empty. Since the stream is being read, the
first action the stream buffer will take is to arbitrate for access to the single port into the memory array.
When the stream buffer is granted access, it makes a single transfer from the memory array, filling half
of the stream buffer (� words in the figure). Note that� can be larger than the number of clusters. The
stream buffer that feeds the cluster is effectively banked (as is the SRF), with the same number of banks
as there are clusters. Words will be interleaved in the memory array such that every eighth word in the
stream is stored in the same stream buffer bank. That way, each cluster will access every eighth stream
element.
Once the first transfer to the stream buffer from the memory array occurs, the clusters may begin
accessing the stream. For each read requested by the clusters, the stream buffer will return the next eight
consecutive elements. Simultaneously, the stream buffer will also arbitrate for access to the memory port
again. When granted access, it makes another block transferto fill the other half of the stream buffer.
After � ��reads (each read transfers 8 words, one word per cluster) from the clusters, the initial half of
the stream buffer will be empty, and the stream buffer will then again arbitrate for access to the memory
port in order to refill itself. This process continues until the length specified in the stream descriptor is
reached. If, at any time, the stream buffer is completely empty and has not yet been granted access to
the memory port, the stream buffer will indicate that it is not ready for accesses by the cluster. In that
case, the clusters must wait until the stream buffer is readybefore proceeding with the read operation.
3.4 Stream Processor evaluation for Wireless Communications
Figure 10 shows the performance of stream processors for a fully loaded 2G and 3G base-station
with 32 users at 128 Kbps and 16 Kbps respectively. The architecture assumes sufficient memory in the
14
100
101
102
101
102
103
104
105
Number of clusters
Fre
quen
cy n
eede
d to
atta
in r
eal−
time
(in M
Hz)
3G base−station (128 Kbps/user)3G base−station (C64x TI DSP)2G base−station (16 Kbps/user)2G base−station (C64x TI DSP)
Figure 10. Real-time performance of stream processors for 2 G and 3G fully loaded base-stations
SRF to avoid accesses to external memory (256 KB) and 3 addersand 3 multipliers per cluster (chosen
because matrix algorithms tend to use equal number of addersand multipliers). For our implementation
of a fully loaded base-station, the parallelism available for channel estimation and detection is limited by
the number of users while for decoding, it is limited by the constraint length (as the users are processed
sequentially).
Ideally, we would like to exploit all the available parallelism in the algorithms as that would lower the
clock frequency. Having a lower clock frequency has also been shown to be power efficient as it opens
up the possibilities of reducing the voltage [23], thereby reducing the effective power consumption
(� � � � ). While Viterbi at constraint length 9 has sufficient parallelism to support upto 64 clusters at full
load, channel estimation and detection algorithms do not benefit from extra clusters, having exhausted
their parallelism. The mapping of a smaller size problem to alarger cluster size also gets complicated
because the algorithm mapping has to be adapted by some meanseither in software or in hardware to
use lower number of clusters while the data is spread across the entire SRF.
From the figure, we can see that a 32-cluster architecture canoperate as a 2G base-station at around 78
MHz and as a 3G base-station around 935 MHz, thereby validating the suitability of Imagine as a viable
architecture for 3G base-station processing. It is also possible that other multi-DSP architectures such
as [5] can be adapted and are viable for 3G base-station processing. However, the lack of multi-DSP
simulators and tools limit our evaluation of these systems.
15
Kernel Max. Parallelism Reason
Estimation 32 number of users (32), spreading gain (32)
Detection 32 number of users (32), spreading gain (32)
Decoding 4/16/64 Constraint lengths 5,7,9, packing 4 bits per word
Table 1. Available Data Parallelism in Wireless Communicat ion Tasks
4. Reconfigurable Stream Processors
A reconfigurable stream processor that can change the numberof active clusters would enable kernels
that have enough data parallelism to make use of a fully-enabled processor, while allowing unusable
clusters to be shut off for kernels with limited data parallelism.
Table 1 shows the available data parallelism of the tasks presented in Section 2. From the table, it is
obvious that the available data parallelism changes throughout the application. However, as shown in
Figure 9 of Section 3, the stream buffers attached to the stream register file feed all clusters. So, if some
of the clusters are inactive, then valid data in a stream willbe sent to inactive clusters and lost. Therefore,
to successful deactivate clusters in a stream processor, stream data must be rerouted so that the active
clusters receive the appropriate data. Three different mechanisms can be used to reroute the stream data
appropriately: by using the memory system (software), by using conditional streams (software), and by
using a multiplexer network (hardware).
4.1 Reorganizing Streams using Memory Transfers (Software)
A data stream can be realigned to match the number of active clusters by first transferring the stream
from the SRF to external memory, and then reloading the stream so that it only is placed in SRF banks
that correspond to active clusters. Figure 11 shows how thiswould be accomplished on an eight cluster
stream processor. In the figure, a 16 element data stream, labelled Stream A in the figure, is produced
by a kernel running on all eight clusters. For clarity, the banks within the SRF are shown explicitly and
the stream buffers are omitted. Therefore, the stream is striped across all eight banks of the SRF. If the
machine is then reconfigured to only have four active clusters, the stream needs to be striped across only
the first four banks of the SRF. By first performing a stream store instruction, the stream can be stored
16
SRF
Stream A
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 8
7 15
6 14
5 13
4 12
3 11
2 10
1 9
0 4 8 12
1 5 9 13
2 6 10 14
4 7 11 15
X X X X
To
Cluster 0
To
Cluster 7
To
Cluster 6
To
Cluster 5
To
Cluster 4
To
Cluster 3
To
Cluster 2
To
Cluster 1
Stream A'
Step 1: Store
stream A to memory Step 2: Load into
stream A'
Figure 11. Reorganizing Streams in Memory
contiguously in memory, as shown in the figure. Then, a streamload instruction can be used to transfer
the stream back into the SRF, only using the first four banks. The figure shows Stream A� as the result
of this load. As can be seen in the figure, the second set of fourbanks of the SRF would not contain any
valid data for Stream A�.
While this mechanism works, it is quite inefficient. One of the reasons why stream processors can keep
so many arithmetic units busy is that data is infrequently transferred between memory and the processor
core. Forcing all data to go through memory whenever the number of active clusters is changed directly
violates this premise. Therefore, using external memory toreorganize data streams will likely cause
numerous stalls, and result in more idle arithmetic units than if the machine were to remain fully active.
Using the memory system in this way is neither desirable nor productive.
4.2 Reorganizing Streams using Conditional Streams (Software)
Stream processors already contain a mechanism for reorganizing data streams. That mechanism is
conditional streams [12]. Conditional streams allow data streams to be compressed or expanded in
the clusters so that the direct mapping between SRF banks andclusters can be violated. When using
conditional streams, stream input and output operations are predicated in each cluster. If the predicate
is true in a particular cluster then it will receive the next stream element, and if the predicate is false, it
will receive garbage. As an example, consider an eight cluster stream processor executing a kernel that
is reading a stream conditionally.
Figure 12 presents an example of a 4-cluster stream processor that could reconfigure to 2 clusters
17
Conditional Buffer Data
Condition Switch
Data received
0
1 1
C
0
-
D
0
-
1 2 3 Cluster index 0
1 1 0
-
0
-
1 2 3
A B
Access 0 Access 1
A 4-cluster stream processor reconfiguring to 2
clusters using conditional streams
A
A
B
B C
C
D
D
Figure 12. Reorganizing Streams with Conditional Streams
using conditional streams. The conditional buffer data is only routed to those clusters whose conditional
switch (predicate) is set. Hence, if the conditional switches are set only in the first 2 clusters, all the
data will get routed only to the first 2 clusters, irrespective of the total number of clusters in the stream
processor. a
In order to use conditional streams to deactivate clusters,the active clusters would always use a true
predicate on conditional input or output stream operationsand inactive clusters would always use a false
predicate. This has the effect of sending consecutive stream elements only to the active clusters. As
explained in [12], conditional streams are used for three main purposes: switching, combining, and load
balancing. Using conditional streams to completely deactivate a set of clusters is really a fourth use of
conditional streams.
In a stream processor, conditional input streams are implemented by having each cluster read data
from the stream buffer as normal, but instead of directly using that data, it is first buffered. Then,
based upon the predicates, the buffered data is sent to the appropriate clusters using an intercluster
switch [12]. Conditional output streams are implemented similarly: data is buffered, compressed, and
then transferred to the SRF. Therefore, when using conditional streams, inactive clusters are not truly
inactive. They must still buffer and communicate data. Furthermore, if conditional streams are already
being used to switch, combine, or load balance data, then thepredicates must be modified to also account
for active and inactive clusters, which complicates programming.
18
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
SRF
Clusters
Register
Mux / Demux
(a) A reconfigurable
4-cluster
stream processor
(b) Reconfiguring to
4 clusters
( c ) Reconfiguring to
2 clusters
(d) Reconfiguring to
1 cluster
Comm network
Figure 13. Reorganizing Streams with a Multiplexer/Demult iplexer Network
4.3 Reorganizing Streams using a Multiplexer Network (Hardware)
Both the memory system and conditional streams are inefficient mechanisms for reorganizing streams.
While providing the flexibility to use lower number of clusters, both these methods do not provide any
way to reduce the power wastage of unused clusters. Clustersare the main source of power consumption
in stream processors, as they contain the arithmetic units,the scratchpad, the register files and the inter-
cluster communication unit, and can account for up to 90% of the power dissipation in the architecture
[16]. Just to give an idea of the possible power wastage, whena 32-user base-station with constraint
length 5 is used instead of constraint length 9, only 4 out of the 32 clusters are being used while decoding.
A multiplexer/demultiplexer network between the stream buffers of the SRF and the arithmetic clusters
can efficiently route stream data only to active clusters andcan turn off unused clusters using clock and
Vdd-gating techniques [10, 22], thereby providing power savings. Figure 13 shows how such a network
would operate.
The mux network shown is per stream. Each input stream is configured as a multiplexer and each
output stream is configured as a demultiplexer during execution of a kernel. The cluster reconfiguration is
done dynamically by providing support in the instruction set of the host processor for setting the number
of active clusters during the execution of a kernel. By default, all the clusters are turned off when kernels
are inactive, thereby providing power savings during memory stalls and system initialization. Another
advantage of the mux network is that the user does not have to modify his code as long as the stream
length is a multiple of the number of clusters.
19
Parameter Value Description����
632578 Energy of ALU operation� �� �
79365 LRF access energy���
155 Energy of SB access (1 bit)�����
8.7 SRAM access energy per bit�� 1603319 SP access energy�
1 Normalized wire propagation energy per wire track
Table 2. Power consumption table (normalized to�
units)
4.4 Power analysis
For the purposes of power analysis in this paper, we will assume parameters as shown in Table 2.
All energy values shown are with respect to�
. For this paper, we will assume� � �� is the length of
a wire track in microns and� � � pJ is the unit energy dissipation of a wire track and the distance between
2 clusters to be� �� units. These numbers are based on [16] with modifications to account for fixed
point ALUs (instead of floating point). The modifications arebased on a low power version of stream
processors for embedded systems (See (LP) in Chapter 6 in [14]). These modifications reduce the power
consumption from an estimated 16 mW/MHz for a 8-cluster Imagine at 1.2V [13] to approximately 8
mW/MHz for a 32-cluster low power stream processor with 3 multipliers and 3 adders per cluster. For
the rest of the paper, we will consider a voltage of 1.2 V and power to get affected linearly with frequency
in order to get estimates (since the scope for voltage scaling is also limited at 1.2 V).
4.5 Additional power consumption due to the mux network
For a first order approximation, the power consumption of themux network is wire length capacitance
dominant to a first order [16]. Assuming a linear layout of theclusters (worst case), the total wire
length of the mux network for�
bit is � � units, where the unit is the distance between�
clusters.
For 8 streams with 32 bits of wire each, the mux network uses a total wire of length approximately
� � � � �� � � � �� � � � �� � � � �� � � � ��J [16]. The clusters can be re-ordered according to bit-
reversal of their locations and this can be shown to minimizethe wire length to�� units, from O(� �) to
20
O(� � ��� �� ��, where� is the number of clusters. The power consumption of this network is less than 1%
of a single multiplier and this has a negligible effect on thepower consumption of the stream processor
design.
4.6 Reconfiguring ALUs within a cluster
Different kernels have different ALU utilization factors.During the execution of different kernels, one
or more ALUs can be turned off without loss in performance. Since most wireless algorithm have static
schedules, the schedule during compilation can provide information about the actual number of adders
or multipliers needed to maintain performance while turning off unused ones. This allows us to save
static power dissipation but does not greatly affect dynamic power dissipation as the number of ops does
not change. While dynamic power is the dominant power consumption source in current technologies,
static power dissipation is increasing with emerging technology [26]. In fact, it is projected that static
power dissipation will equal dynamic power dissipation in afew generations. Hence, it is necessary to
minimize static power dissipation as well, if it can be done without loss in performance.
For this purpose, we do a architecture exploration during compile time and schedule the instructions
on the minimum number of adders and multipliers needed to provide performance. A ’sleep’ instruction
is added in the instruction set which turns off the ALUs that are not scheduled for the entire kernel.
The idea of turning off idle functional units has been explored in [26]. However, it has been done in
the context of regular programs in the micro-level (instruction level), where the window of opportunity
to turn off arithmetic units is limited as it is difficult to find cases where an arithmetic unit is idle for
sufficiently long periods of time to compensate for the overhead in turning clusters on and off. Since we
are able to do it at the kernel level (algorithm level), turning off arithmetic units is simpler in our case
with a greater potential for power reduction. We provide theability to determine how many arithmetic
units need to be active during compilation by varying the static schedule mapping to the various adder,
multiplier configurations and determining the best possible schedule.
Figure 14 shows the architecture exploration of an add-compare-select (ACS) kernel in Viterbi decod-
ing with increasing the number of adders and multipliers. Ascan be observed from the figure, a 3 adder,
1 multiplier case can provide close to the best performance and additional ALUs can be turned off to
save static power consumption. The expected functional unit utilization from the schedule is provided
21
1
2
3
4
5
12
34
5
40
60
80
100
120
140
160
(43,58)
(39,41)
(54,59)
#Multipliers
(40,32)
(47,43)
(62,62)
(39,27)
(49,33)
(65,45)
(70,59)
(39,22)
(48,26)
(61,33)
Auto−exploration of adders and multipliers for kernel "acskc"
(73,41)
(50,22)
(80,34)
(60,26)
#Adders
(76,33)
(85,24)
(61,22)
(72,22)
(85,17)
(72,19)
(85,13)(85,11)
Inst
ruct
ion
coun
t with
FU
util
izat
ion(
+,*
)
Figure 14. ALU exploration for a cluster for an ACS kernel
in (adder,multiplier) form on each data point. The 3 adder, 1multiplier configuration attains a func-
tional unit utilization of 62% on each adder and multiplier.Thus, the schedule of the compiler can be
changed from the default schedule to use only the minimal required to meet performance requirements.
This is as shown in Figure 15. The default schedule of the Viterbi ACS kernel as shown in Figure 15(a)
shows very poor utilization of the multipliers. Since we observed from Figure 14 that a single multiplier
was sufficient, we can re-compile and re-schedule the algorithm such that all the operations get packed
onto a single multiplier. The compiler automatically inserts a ’sleep’ instruction for the remaining two
multipliers during the entire kernel execution period.
5. Experimental Methodology
The stream processor code is written by splitting it into ’streamC’ code and ’kernelC’ code, where the
kernelC code represents the computations done in kernels ofthe various algorithms and the streamC code
represents the communication between the computations [15]. The stream dataflow for the algorithms
is as shown in [25]. The code was written to avoid accesses to external memory. This was done by
ensuring sufficient SRF memory and by doing all the necessarydata re-ordering operations within the
clusters. This made sure that the data for the next kernel operation was always in the right place. These
optimizations were done to avoid the latency and power consumption for external memory accesses.
For evaluation of the multiplexer network, the base stream processor simulator ’isim’ was modified
22
Adders Multipliers
I n s t r
u c
t i o
n s
c h
e d
u l e
Adders Multipliers
‘ S l e
e p
’ i n
s t r
u c t i
o n
(a) Default schedule (b) Schedule after
exploration
Figure 15. ALU Schedule for ACS kernel before and after archi tecture exploration
23
to support a mux network between the SRF and the clusters. A ’setClusters(#clusters)’ macro was
added to streamC in order to set the number of active clustersused by the following kernel. The macro
implemented a setClusters instruction and this instruction is scheduled to execute just before the kernel
executes so that the clusters can be turned off for the maximum time period possible. No modifications
are required in the streamC code unless the stream lengths are shorter than the number of active clusters.
In the case where stream lengths are shorter than the number of clusters, the streams need to be zero
padded. Additional dummy reads must be inserted in the streamC code to account for the zero padded
data. Apart from this, no other change is required in the code.
The streamC code now looks as follows:
:
setClusters(16); //use 16 clusters only for the next kernel
kernel1(input1,input2,input3,.. , output1);
setClusters(8);
kernel2(input1,....,output1, output2,...);
:
To reconfigure the number of ALUs, a ’sleep’ instruction was added to the architecture which is
supposed to use Vdd and clock gating techniques to minimize the static power dissipation of the ALUs.
During compile time, different number of adders and multipliers are explored in the architecture to
determine the best possible combination and to turn off unused ones. The compiler takes in 2 parameters
(-add, -mul) where the number of adders and multipliers to bescheduled are given via a command line
interface based on the architecture exploration.
6. Results
Figure 16 shows the performance comparisons between a standard 32-cluster stream processor, a
reconfigurable stream processor in software and a reconfigurable stream processor in hardware. To
make the worst case comparisons and find the reconfiguration overhead, the workload was chosen such
that it had sufficient data parallelism and did not actually involve any reconfiguration. From the figure,
we see that the performance overhead due to reconfiguration support for a fully loaded 32-user base-
24
1 2 30
1
2
3
4
5
6x 10
5
Reconfiguration methods
Ker
nel e
xecu
tion
time
(cyc
les)
Default stream processor
Software reconfiguration
Hardware reconfiguration
Kernel BusyMicrocontroller stalls
Figure 16. Comparing hardware and software reconfiguration schemes
station with constraint length 9 Viterbi decoding is negligible. When the reconfiguration support is
provided in software (conditional streams are used insteadof normal streams – even though conditional
streams are not required), the overhead of conditional streams is absorbed in most kernels that do not
have a high I/O requirement. For kernels such as matrix multiplication which do have significant I/O,
we have observed an increase in microcontroller stalls due to the conditional streams. However, for
the end-to-end application run, Viterbi decoding dominates the execution time and has very little I/O
in its kernels, thereby absorbing most of the overhead. While reconfiguring in hardware using the mux
network, we actually see a slight improvement in performance rather than the expected performance
loss due to reconfiguration. Reconfiguration in hardware adds a latency of���
��� � to the input and
output streams before they can reach the cluster, where�
is the number of clusters. However, the mux
network also doubles up as a prefetch buffer, reducing the amount of microcontroller stalls needed in
kernels requiring high I/O such as matrix multiplications.Further power reduction is possible by turning
off arithmetic units within a cluster. If we assume that static power dissipation will be 33% of the
total power dissipation in arithmetic units [26], and the hardware reconfigures to 1 multiplier for Viterbi
decoding from 3 multipliers during channel estimation and detection, a further power savings can be
obtained for this case.
When the application requires reconfiguration such as a half-loaded base-station with 16 users at
constraint length 7, there are significant benefits to be gained by both scaling the frequency to meet the
25
1 2 3 4 5 6 7 80
100
200
300
400
500
600
700
800
900
1000
Variation in base−station load
Fre
quen
cy n
eede
d to
run
at r
eal−
time
(in M
Hz)
Case 1,2: 4 users, K = 7,9
Case 3,4: 8 users, K = 7,9
Case 5,6: 16 users, K = 7,9
Case 7,8: 32 users, K = 7,9
Figure 17. Frequency scaling due to base-station load varia tion
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
Variation in base−station load
Pow
er c
onsu
mpt
ion
(in W
atts
)
Legend 1 + 2 = Power used
Legend 3 + 4 = Power saved due to hardware reconfiguration
Non−cluster powerCluster powerPower saved: ALU turn offPower saved: mux reconfigurationPower saved: frequency scaling
Figure 18. Power savings due to frequency scaling and hardwa re reconfiguration
26
lower load requirements and by reconfiguration the hardwareto utilize the right number of clusters and
ALUs, dynamically. The reconfiguration in hardware from 32 clusters to 16 clusters (case 5) allows us
to reduce the power consumption by 62% (8.2 to 3.14 W) due to frequency scaling and an additional
29% (from 3.14 W to 2.23 W) due to the mux network and the ALU turning off feature. This is as shown
in Figure 17 and Figure 18. Figure 17 shows the decrease in thereal-time frequency requirement as the
load on the base-station decreases. We can see that the frequency goes down from around 935 MHz to
around 100 MHz as the load decreases from 32 users to 4 users. Figure 18 shows the corresponding
power savings due to use of the frequency scaling and due to the use of the mux network, thereby
providing a strong case for the use of frequency scaling and hardware reconfiguration support in stream
processors.
It may not be possible to change the clock frequency at such a fine level to optimize for performance
and the frequency may be modified only in steps (such as Intel’s Xscale architecture that can run at
333, 400, 600, 733 MHz). However, in these cases, all the clusters can be turned off in the remaining
available time until data is ready for the next kernel to process, saving up to 70% of the power dissipated
on the chip. The power is assumed to be linear with clock frequency although greater power savings can
be attained by lowering the voltage with frequency as well.
6.1. Discussion
While we have shown the dynamic nature of the computational load at the base-station, there are
alternatives to hardware reconfiguration of the clusters tosupport this dynamic load. If there are multiple
levels of data parallelism in the algorithm (across time,users,spreading length...), the data could be re-
organized to utilize the best type of available parallelism. This may not be possible for all algorithms
and may require re-organization of the data in the SRF. A goodexample for this case is the Viterbi
implementation. The Viterbi implementation exhibits parallelism both in the number of users and in
constraint length. For the fully loaded case of 32 users and constraint length 9, it is better to exploit
parallelism in terms of constraint length as exploiting parallelism in users would require storing survivor
states for all users in the internal scratchpad, exceeding the available scratchpad space. This is also
easier to scale with users that are non-powers of 2. However,when the constraint length is low, for
example, a constraint length 5 case with 8 users, it is betterto exploit parallelism in terms of users as
27
the survivor states can now easily fit in the internal scratchpad space. Hence, tradeoffs need to be made
during implementation to decide the best form of parallelism to exploit. Eventually, these optimizations
must be integrated inside the compiler. There also exists the possibility of supporting multiple threads in
clustered architectures, such as the Sandbridge processor[11], allowing us to run multiple threads of the
same algorithm (loop unrolling) or running different algorithms simultaneously. This may however have
a higher cost in terms of providing support for multi-threading and for exploiting thread parallelism.
However, our proposed solution for hardware reconfiguration and frequency scaling can co-exist with
these solutions as well for providing power savings in the absence of available parallelism.
For 4G systems employing multiple antennas, we can again assume an order-of-magnitude higher
complexity. In order to support such a system in a programmable processor, other features that ASICs
exploit such as task pipelining needs to be exploited and several stream processors may need to be
chained for base-station processing. However, one has to consider the bottlenecks of inter-processor
communication and increase in software complexity before considering such solutions. The goal of
providing a 100 Mbps programmable architecture at 100 MHz clock frequency with a bit processed per
cycle, is still an open unsolved question for us at this time.
When the number of users are non-powers of 2, the estimation and detection algorithms do not scale
well. However, instead of zero padding, there may exist possibilities of allocating the free codes to users,
thereby enabling ’virtual’ users to exist, thereby providing higher data rates to some users. Since the
decoding is done sequentially, the decoding algorithm implementation scales well with varying number
of users.
Finally, the proposed multiplexer network addition benefits are not limited to wireless communication
applications and other applications of stream processors where power is a concern can make use of
the multiplexer network. Other multi-cluster based architectures which are oriented towards handset
architectures such as [11] could also make use of the mux network features for additional power savings,
when the algorithms and/or the compiler does not provide sufficient means to exploit parallelism.
7. Conclusions
In this paper, we have shown stream processors to be a viable programmable architecture for physical
layer processing in wireless base-stations. The varying computational load in base-station processing
28
requires reconfiguration support in the architecture. We show that by providing this reconfiguration
support using frequency scaling and in hardware using a multiplexer network, we can attain adapt the
power consumption of the architecture to the computationalload, without compromising on the real-
time performance. By providing a real-time software solution to wireless base-stations, we can separate
the algorithm design from the architecture design of wireless systems. Limited versions of higher stan-
dards can be supported while the hardware design research catches up and gets upgraded to newer stan-
dards. We can thus attain a faster time-to-market, co-existence of multiple standards during this wireless
(re)evolution, and faster upgradability and verification of newly designed algorithms and applications.
8. Acknowledgments
We are grateful to Ujval Kapasi, Abhishek Das for their timely help, suggestions and software sup-
port with the Imagine simulator tools and to Brucek Khailanyfor his assistance in providing power
consumption estimates for stream processor architecturesused in this paper.
References
[1] Matrix transpose using Altivec instructions. http://developer.apple.com/hardware/ve/algorithms.html#transpose.[2] Third Generation Partnership Project. http:// www.3gpp.org.[3] Implementation of a WCDMA Rake receiver on a TMS320C62x DSP device. Technical Report SPRA680,
Texas Instruments, July 2000.[4] Channel card design for 3G infrastructure equipment. Technical Report SPRY048, Texas Instruments, 2003.[5] B. Ackland et al. A single-chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP.IEEE Journal of Solid-State
Circuits, 35(3):412–425, March 2000.[6] M. Barberis. Designing W-CDMA Mobile Communication Systems. CSDMAG, pages 25–32, February
1999.[7] M. Baron. Extremely High Performance - Parallel EnginesChallenge Performance Limitations. Micropro-
cessor report, February 18 2003.[8] R. Berezdivin, R. Breinig, and R. Topp. Next-generationwireless communications concepts and technolo-
gies. IEEE Communications Magazine, 40(3):108–117, March 2002.[9] S. Blust and D. Murotake. Software-Defined Radio Moves into Base-station designs. Wireless System
Design Magazine Report 11, November 1999.[10] S. Dropsho, V. Kursun, D. H. Albonesi, S. Dwarkadas, andE. G. Friedman. Managing static energy leakage
in microprocessor functional units. InIEEE/ACM International Symposium on Micro-architecture (MICRO-35), pages 321–332, Istanbul, Turkey, November 2002.
[11] J. Glossner, E. Hokenek, and M. Moudgill. An overview ofthe Sandbridge processor technology. Whitepaper at www.sandbridgetech.com.
[12] U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens, andB. Khailany. Efficient conditional operations fordata-parallel architectures. In33rd Annual International Symposium on Microarchitecture, pages 159–170,Monterey, CA, December 2000.
29
[13] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmablestream processors.IEEE Computer, 36(8):54–62, August 2003.
[14] B. Khailany. The VLSI implementation and evaluation of area- and energy-efficient streaming media pro-cessors. PhD thesis, Stanford University, Palo Alto, CA, June 2003.
[15] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, A. Chang, andS. Rixner. Imagine: media processing with streams.IEEE Micro, 21(2):35–46, March-April 2001.
[16] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, and B. Towles. Exploring the VLSI scalabilityof stream processors. InInternational Conference on High Performance Computer Architecture (HPCA-2003), Anaheim, CA, February 2003.
[17] N. Mera. The wireless challenge. Telephony Online, http://telephonyonline.com/ar/telecomwirelesschallenge,December 9 1996.
[18] J. Mitola. The Software Radio Architecture.IEEE Communications Magazine, 33(5):26–38, May 1995.[19] P. R. L. Monica. Wireless gets blacked out too - North America Power Blackout. CNN Money News report:
http://money.cnn.com/2003/08/15/technology/landlines/index.htm, August 16 2003.[20] S. Moshavi. Multi-user detection for DS-CDMA communications.IEEE Communications Magazine, pages
124–136, October 1996.[21] S. Muir. The Vanu software base-station. White paper atwww.vanu.com, March 2003.[22] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Gated-��� : A Circuit Technique to reduce
Leakage in Deep-Submicron Cache memories. InIEEE International Symposium on Low Power ElectronicDesign (ISLPED’00), pages 90–95, Rapallo, Italy, July 2000.
[23] J. M. Rabaey, A. Chandrakasan, and B. Nikolic’.Digital Integrated Circuits - A Design Perspective.Prentice-Hall, 2 edition, 2003.
[24] S. Rajagopal, S. Bhashyam, J. R. Cavallaro, and B. Aazhang. Real-time algorithms and architectures formultiuser channel estimation and detection in wireless base-station receivers.IEEE Transactions on WirelessCommunications, 1(3):468–479, July 2002.
[25] S. Rajagopal, S. Rixner, and J. R. Cavallaro. A programmable baseband processor design for softwaredefined radios. InIEEE International Midwest Symposium on Circuits and Systems, Tulsa, OK, August2002.
[26] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing static power consumption by functional units insuperscalar architectures. InInternational Conference on Compiler Construction, LCNS 2304, SpringerVerlag, pages 261–275, Grenoble, France, April 2002.
[27] S. Rixner, W. Dally, U. Kapasi, B. Khailany, A. Lopez-Lagunas, P. Mattson, and J. Owens. A bandwidth-efficient architecture for media processing. In31st Annual International Symposium on Microarchitecture,pages 3–13, Dallas, TX, November 1998.
[28] S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi,and J. Owens. Register organization for mediaprocessing. In6th International Symposium on High-Performance ComputerArchitecture, pages 375–386,Toulouse, France, January 2000.
[29] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. R. Mattson, and J. Owens. Abandwidth-efficient architecture for media processing. In31st Annual ACM/IEEE International Symposiumon Microarchitecture (Micro-31), pages 3–13, Dallas, TX, December 1998.
[30] C. Sengupta. Algorithms and Architectures for Channel Estimation in Wireless CDMA CommunicationSystems. PhD thesis, Rice University, Houston, TX, December 1998.
[31] M. K. Varanasi and B. Aazhang. Multistage detection in asynchronous code-division multiple-access com-munications.IEEE Transactions on Communications, 38(4):509–519, April 1990.
[32] S. Verdu.Multiuser Detection. Cambridge University Press, 1998.[33] B. Vucetic and J. Yuan.Turbo Codes: Principles and Applications. Kluwer Academic Publishers, 1 edition,
2000.[34] S. B. Wicker.Error Control Systems for Digital Communication and Storage. Prentice Hall, 1 edition, 1995.
30