Implementing algorithms for advanced communication systems -- My bag of tricks

Implementing algorithms for advanced communication systems

-- My bag of tricks

Sridhar RajagopalElectrical and Computer Engineering

This work is supported by Nokia, TI, TATP and NSF

Motivation

Build wireless multimedia communication systems - Kbps to Mbps

Sophisticated algorithms - exponential complexity

Approaches:Sub-optimal algorithms - O(n2,n3) complexity

Better hardware implementations needed

Contributions

• Develop algorithms suitable for implementation

• Bit-level extensions to microprocessors

• Pipelining to reduce latency and memory

• On-line arithmetic for Most Significant Digit First Computations.

Outline

Advanced communication systems

• Algorithms for efficient implementation• Pipelining• On-line arithmetic

Bit-level extensions to microprocessors

Summary

Communication System - Physical layerTransmitter

Antenna

Information bits

(from higherlayers)

Coding SpreadingRF

unitD/A

Digital Analog

+1

Multipath reflections, attenuations, noise, multiple user interference

Communication System - Physical layerChannel

Channel estimation

Detection Decoding

Antenna

Information bits

(to higherlayers)

RFunit

A/D

Digital

Communication System - Physical layerReceiver

Analog +1

Questions

Higher data rates => sophisticated algorithms=> strain on hardware => lower data rates

1.Which is the best algorithm to use for implementation?

2.How to best do the digital part?- VLSI, DSP, FPGA, microprocessor- combination of these?

Outline




Summary

rbR Hiibr

bbR Tiibb

RA*R bribb

Multiuser Channel Estimation Algorithm

= {+1, -1} : Training/Tracking bits

= 8-bit integer (complex) : Received signal

N = spreading gain (typically fixed ,e.g: 32)

K = number of users (variable, <=N)

= Maximum Likelihood channel estimate

Cr

RbN

i

2Ki

bi

ri

Ai

Iterative hardware-efficient scheme

Bit-streaming : suitable for tracking (window length L)

Method of gradient descent

Stable convergence behavior

Simple fixed-point VLSI architecture

T00

TLL

)1i(bb

)i(bb b*bb*bRR

H00

HLL

)1i(br

)i(br r*br*bRR

)RR*A(AA )i(br

)i(bb

)1i()1i()i(

4 5 6 7 8 9 10 11 1210

-3

10-2

10-1 Comparison of Bit Error Rates (BER)

Signal to Noise Ratio (SNR)

BER

MF ActMFML ActML

O(K2N)

O(K3+K2N)

Simulations - Static multipath channel

SINR = 0 dB

Paths =3

Preamble =150

Spreading N = 31

Users K = 15

Outline




Summary

Multiuser interference

ri-2 ri-1 ri ri+1

Interference from previous bits of other users

Interference fromfuture bits ofother users

Desired User

User 1

User j

ri

bibi+1

time

Block Based Detector

1 12

1 12

1 12

1 12

11 22

11 22

11 22

11 22

Matched Filter

Stage 1

Stage 2

Stage 3

Matched Filter

Stage 1

Stage 2

Stage 3

Bits 2-11

Bits 12-21

Detection

)y(signd

d]SAARe[yy1l1l

lH01l

d

d

d

d

D,K

D,1

1,K

1,1

1ii1iii RdCdLdyy

Iterate for convergence

)y(signd

]rARe[y00

H0

Matched filter

1H10

H1

1H01

H10

H00

H1

1H00

H0

AAAA00

0AAAAAAAA

00AAAA

Pipelined detection scheme

ri-2 ri-1 ri ri+1

Interference from previous bits of other users

Interference fromfuture bits ofother users

Desired User

User 1

User j

ri

bibi+1

time

Pipelined Detector

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

Matched Filter

Stage 1

Stage 2

Stage 3

1 2 3 4 5 6 7 8 9 10 11 12

M atc he d F i l te r

Adde r

L d i -1L T d i + 1 C d i

Si g nD e te c t i o n

Stag e 1

r i + 3 A0 , A1

C LR = L Td i + 2y i + 2

+

- --

^

^ ^ ^

Stag e 2

Ld iy i R = L T^

^ ^

D e l ay

D e l ay y i

R e c e i ve dSi g nal

C hanne lE s t i m ate s

Stag e 3

Cd i -2y i -2 R = L T^

y i -4^d i -4

D e te c te d bi ts

C

L

Chip being built

as part of the

Elec 422

VLSI course project

Outline




Summary

On-line arithmetic

Sign of dot-product computationsHigh precision operations done to find the signCan be avoided with Most Significant Digit First

computation using redundant number systems

d p = s ig n (A H r)

A Hp , 1 A H

p , 2 A Hp , N -1

+

+

+

+

+

A H r

A Hp ,N

* * * *

r0 r1 rN - 1 rN

Outline




Summary

DSP/microprocessor implementations

Further acceleration needed for real-time performance

Matrix based massively parallel algorithmsDetection of bits {+1,-1} : bit - level operations

DSPsBit multiplications not needed - (add/subtract on FPGA)Bit storage not convenientNot fully able to exploit parallelism

FPGAs for acceleration

Flexibility of ASICsGood for parallelism and bit-level operations

Code matched filter detector

Multiuser estimation

PIC (Stage 1)

PIC (Stage 2)

Received bits

Detected

bits

DSP2

DSP1

FPGA1 FPGA2

0 5 10 15 20 25 30 3510

-6

10-5

10-4

10-3

10-2

Ex

ecu

tio

n t

ime

(in

se

con

ds

)

Users

Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user 2 DSPs + 2 FPGAs

Multiprocessor simulations

Instruction Set Extensions

To accelerate Bit level computations in Wireless

Real/Complex Integer - Bit Multiplications

Used in Multiuser Detection, Decoding

Bit - Bit Multiplications

Used in Outer Product Updates

Correlation, Channel Estimation

Complex Integer-Integer Multiplications

Useful in other Signal Processing applications

Speech, Video,,,

SIMD Parallelism

64-bit Register A

+ +

64-bit Register C

88

8

64-bit Register B

x

Integer - Bit Multiplications

64-bit Register C[j]

For i = 1..8, j= 1..8

D[i][j] = D[i][j] + b[i]*C[j] (Cross-Correlation)

64-bit Register D[i][j]

+/- +/-

64-bit Register D[i][j]

8-bit ControlRegister b[i]

88

8

Computational Savings

Avoid bit multiplications and control structures4 8-bit Multiply

-Latency 3 cycles

8 8-bit Add-Latency 1 cycle

Cross-Correlation Example64 multiply, 64 add

Original SIMDInstruction Set

16*3 + 8*1 = 54 cycles

With Extensions 8*1 = 8 cycles

Bit-Bit Multiplications

D = D + b*bT

Eg: Auto-Correlation

64-bit Register A = b1 64-bit Register B=b2

XNOR

b1*b2Bit-Bit Multiplications

64-bit Register C=b1*b2

B1 B2 B1*B2

0 0 10 1 01 0 01 1 1

8-bit to 64-bit conversions

D = D + b*bT

Eg: Auto-Correlationb1 = b(1:8),b(1:8),….b(1:8)

b2 = b(1)b(1)……b(8)b(8)

b(1)..b(8) b(1) b(1) b(8)

b(1)..b(8) b(1) b(2) b(8)b(7)

b(8)

8-bit Register b 64-bit Register A

1.1 1.2

2.1

Increment/Decrement

64-bit Register D

+/- +/- +/-

64-bit Register (D+b1*b2)

8-bit Register b1*b2

1

D = D + b*bT

Eg: Auto-Correlation

Truncated Multipliers

Many applications need approximate computations

Adaptive Algorithms :Y = Y + mu*(Y*C)

Truncate lower bits

Truncated Multipliers - half the area/half the delay

Can do 2 truncated multiplies in parallel with regular

Multiplier 1 Multiplier 2Truncated

Multiplier

ALU Multipliers

Open Questions

VLIW simulator??

Showing performance improvement, for different algorithms

Compiler and software support

Outline




Summary

Conclusions

Data rates for advanced communication systems , limited by hardware, not by algorithms

Need to find efficient solutions to tackle this problem - Hardware-software co-design

Presented my ways of attacking this problem

Future Work

RENÉ:Single re-configurable hardware to switch

between 2 communication standards

Designing algorithms, conditioned on the availability of only finite precision

http://www.ece.rice.edu/~sridhar/research.htm

http://cmc.rice.edu

Date post:	06-Jan-2016
Category:	Documents
Upload:	jamil
View:	28 times
Download:	0 times

Implementing algorithms for advanced communication systems -- My bag of tricks

Documents