Implementing algorithms for advanced communication systems
-- My bag of tricks
Sridhar RajagopalElectrical and Computer Engineering
This work is supported by Nokia, TI, TATP and NSF
Motivation
Build wireless multimedia communication systems - Kbps to Mbps
Sophisticated algorithms - exponential complexity
Approaches:Sub-optimal algorithms - O(n2,n3) complexity
Better hardware implementations needed
Contributions
• Develop algorithms suitable for implementation
• Bit-level extensions to microprocessors
• Pipelining to reduce latency and memory
• On-line arithmetic for Most Significant Digit First Computations.
Outline
Advanced communication systems
• Algorithms for efficient implementation• Pipelining• On-line arithmetic
Bit-level extensions to microprocessors
Summary
Communication System - Physical layerTransmitter
Antenna
Information bits
(from higherlayers)
Coding SpreadingRF
unitD/A
Digital Analog
+1
Multipath reflections, attenuations, noise, multiple user interference
Communication System - Physical layerChannel
Channel estimation
Detection Decoding
Antenna
Information bits
(to higherlayers)
RFunit
A/D
Digital
Communication System - Physical layerReceiver
Analog +1
Questions
Higher data rates => sophisticated algorithms=> strain on hardware => lower data rates
1.Which is the best algorithm to use for implementation?
2.How to best do the digital part?- VLSI, DSP, FPGA, microprocessor- combination of these?
Outline
Advanced communication systems
• Algorithms for efficient implementation• Pipelining• On-line arithmetic
Bit-level extensions to microprocessors
Summary
rbR Hiibr
bbR Tiibb
RA*R bribb
Multiuser Channel Estimation Algorithm
= {+1, -1} : Training/Tracking bits
= 8-bit integer (complex) : Received signal
N = spreading gain (typically fixed ,e.g: 32)
K = number of users (variable, <=N)
= Maximum Likelihood channel estimate
Cr
RbN
i
2Ki
bi
ri
Ai
Iterative hardware-efficient scheme
Bit-streaming : suitable for tracking (window length L)
Method of gradient descent
Stable convergence behavior
Simple fixed-point VLSI architecture
T00
TLL
)1i(bb
)i(bb b*bb*bRR
H00
HLL
)1i(br
)i(br r*br*bRR
)RR*A(AA )i(br
)i(bb
)1i()1i()i(
4 5 6 7 8 9 10 11 1210
-3
10-2
10-1 Comparison of Bit Error Rates (BER)
Signal to Noise Ratio (SNR)
BER
MF ActMFML ActML
O(K2N)
O(K3+K2N)
Simulations - Static multipath channel
SINR = 0 dB
Paths =3
Preamble =150
Spreading N = 31
Users K = 15
Outline
Advanced communication systems
• Algorithms for efficient implementation• Pipelining• On-line arithmetic
Bit-level extensions to microprocessors
Summary
Multiuser interference
ri-2 ri-1 ri ri+1
Interference from previous bits of other users
Interference fromfuture bits ofother users
Desired User
User 1
User j
ri
bibi+1
time
Block Based Detector
1 12
1 12
1 12
1 12
11 22
11 22
11 22
11 22
Matched Filter
Stage 1
Stage 2
Stage 3
Matched Filter
Stage 1
Stage 2
Stage 3
Bits 2-11
Bits 12-21
Detection
)y(signd
d]SAARe[yy1l1l
lH01l
d
d
d
d
D,K
D,1
1,K
1,1
1ii1iii RdCdLdyy
Iterate for convergence
)y(signd
]rARe[y00
H0
Matched filter
1H10
H1
1H01
H10
H00
H1
1H00
H0
AAAA00
0AAAAAAAA
00AAAA
Pipelined detection scheme
ri-2 ri-1 ri ri+1
Interference from previous bits of other users
Interference fromfuture bits ofother users
Desired User
User 1
User j
ri
bibi+1
time
Pipelined Detector
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Matched Filter
Stage 1
Stage 2
Stage 3
1 2 3 4 5 6 7 8 9 10 11 12
M atc he d F i l te r
Adde r
L d i -1L T d i + 1 C d i
Si g nD e te c t i o n
Stag e 1
r i + 3 A0 , A1
C LR = L Td i + 2y i + 2
+
- --
^
^ ^ ^
Stag e 2
Ld iy i R = L T^
^ ^
D e l ay
D e l ay y i
R e c e i ve dSi g nal
C hanne lE s t i m ate s
Stag e 3
Cd i -2y i -2 R = L T^
y i -4^d i -4
D e te c te d bi ts
C
L
Chip being built
as part of the
Elec 422
VLSI course project
Outline
Advanced communication systems
• Algorithms for efficient implementation• Pipelining• On-line arithmetic
Bit-level extensions to microprocessors
Summary
On-line arithmetic
Sign of dot-product computationsHigh precision operations done to find the signCan be avoided with Most Significant Digit First
computation using redundant number systems
d p = s ig n (A H r)
A Hp , 1 A H
p , 2 A Hp , N -1
+
+
+
+
+
A H r
A Hp ,N
* * * *
r0 r1 rN - 1 rN
Outline
Advanced communication systems
• Algorithms for efficient implementation• Pipelining• On-line arithmetic
Bit-level extensions to microprocessors
Summary
DSP/microprocessor implementations
Further acceleration needed for real-time performance
Matrix based massively parallel algorithmsDetection of bits {+1,-1} : bit - level operations
DSPsBit multiplications not needed - (add/subtract on FPGA)Bit storage not convenientNot fully able to exploit parallelism
FPGAs for acceleration
Flexibility of ASICsGood for parallelism and bit-level operations
Code matched filter detector
Multiuser estimation
PIC (Stage 1)
PIC (Stage 2)
Received bits
Detected
bits
DSP2
DSP1
FPGA1 FPGA2
0 5 10 15 20 25 30 3510
-6
10-5
10-4
10-3
10-2
Ex
ecu
tio
n t
ime
(in
se
con
ds
)
Users
Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user 2 DSPs + 2 FPGAs
Multiprocessor simulations
Instruction Set Extensions
To accelerate Bit level computations in Wireless
Real/Complex Integer - Bit Multiplications
Used in Multiuser Detection, Decoding
Bit - Bit Multiplications
Used in Outer Product Updates
Correlation, Channel Estimation
Complex Integer-Integer Multiplications
Useful in other Signal Processing applications
Speech, Video,,,
SIMD Parallelism
64-bit Register A
+ +
64-bit Register C
88
8
64-bit Register B
x
Integer - Bit Multiplications
64-bit Register C[j]
For i = 1..8, j= 1..8
D[i][j] = D[i][j] + b[i]*C[j] (Cross-Correlation)
64-bit Register D[i][j]
+/- +/-
64-bit Register D[i][j]
8-bit ControlRegister b[i]
88
8
Computational Savings
Avoid bit multiplications and control structures4 8-bit Multiply
-Latency 3 cycles
8 8-bit Add-Latency 1 cycle
Cross-Correlation Example64 multiply, 64 add
Original SIMDInstruction Set
16*3 + 8*1 = 54 cycles
With Extensions 8*1 = 8 cycles
Bit-Bit Multiplications
D = D + b*bT
Eg: Auto-Correlation
64-bit Register A = b1 64-bit Register B=b2
XNOR
b1*b2Bit-Bit Multiplications
64-bit Register C=b1*b2
B1 B2 B1*B2
0 0 10 1 01 0 01 1 1
8-bit to 64-bit conversions
D = D + b*bT
Eg: Auto-Correlationb1 = b(1:8),b(1:8),….b(1:8)
b2 = b(1)b(1)……b(8)b(8)
b(1)..b(8) b(1) b(1) b(8)
b(1)..b(8) b(1) b(2) b(8)b(7)
b(8)
8-bit Register b 64-bit Register A
1.1 1.2
2.1
Increment/Decrement
64-bit Register D
+/- +/- +/-
64-bit Register (D+b1*b2)
8-bit Register b1*b2
1
D = D + b*bT
Eg: Auto-Correlation
Truncated Multipliers
Many applications need approximate computations
Adaptive Algorithms :Y = Y + mu*(Y*C)
Truncate lower bits
Truncated Multipliers - half the area/half the delay
Can do 2 truncated multiplies in parallel with regular
Multiplier 1 Multiplier 2Truncated
Multiplier
ALU Multipliers
Open Questions
VLIW simulator??
Showing performance improvement, for different algorithms
Compiler and software support
Outline
Advanced communication systems
• Algorithms for efficient implementation• Pipelining• On-line arithmetic
Bit-level extensions to microprocessors
Summary
Conclusions
Data rates for advanced communication systems , limited by hardware, not by algorithms
Need to find efficient solutions to tackle this problem - Hardware-software co-design
Presented my ways of attacking this problem
Future Work
RENÉ:Single re-configurable hardware to switch
between 2 communication standards
Designing algorithms, conditioned on the availability of only finite precision
http://www.ece.rice.edu/~sridhar/research.htm
http://cmc.rice.edu