Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | evelyn-nash |
View: | 213 times |
Download: | 0 times |
Techniques for Low Power Turbo Coding in Software Radio
Joe AntoonAdam Barnett
Software Defined Radio
• Single transmitter for many protocols• Protocols completely specified in memory• Implementation:
– Microprocessors– Field programmable logic
Why Use Software Radio?
• Wireless protocols are constantly reinvented– 5 Wi-Fi protocols– 7 Bluetooth protocols– Proprietary mice and keyboard protocols– Mobile phone protocol alphabet soup
• Custom DSP logic for each protocol is costly
So Why Not Use Software Radio?
• Requires high performance processors• Consumes more power
Inefficient general fork
Efficient applicationspecific fork
Inefficient Field-programmable
fork
Turbo Coding
• Channel coding technique• Throughput nears theoretical limit• Great for bandwidth limited applications
– CDMA2000– WiMAX – NASA ‘s Messenger probe
Turbo Coding Considerations
• Presents a design trade-off• Turbo coding is computationally expensive• But it reduces cost in other areas
– Bandwidth– Transmission power
Reducing Power in Turbo Decoders
• FPGA turbo decoders– Use dynamic reconfiguration
• General processor turbo decoders– Use a logarithmic number system
Generic Turbo Encoder
ComponentEncoder
ComponentEncoderInterleave
p1
s
p2
Data stream
q1
r
q2
Generic Turbo Decoder
Decoder DecoderInterleave
Decoder Design Options• Multiple algorithms used to
decode• Maximum A-Posteriori (MAP)
– Most accurate estimate possible– Complex computations required
• Soft-Output Viterbi Algorithm– Less accurate– Simpler calculations
Decoder
FPGA Design Options
• Goal Make an adaptive decoder
DecoderReceived Data
Parity
Original sequence
Tunable Parameter
Lowpower,
accuracy
Highpower,
accuracy
Component Encoder
• M blocks are 1-bit registers• Memory provides encoder state
M MGeneratorFunction
Encoder State
00
Time
01
10
11
00
01
10
11
0
1GF
000
01
10
11
1
0
1
Viterbi’s Algorithm
• Determine most likely output• Simulate encoder state given received values
s0 s1 s2
r0 p0 r1 p1 r2 p2
d0 d1 d2
…
Time
Viterbi’s Algorithm
• Write: Compute branch metric (likelihood)• Traceback: Compute path metric, output data• Update: Compute distance between paths
• Rank paths by path metric and choose best• For N memory:
– Must calculate 2N-1 paths for each state
Adaptive SOVA
• SOVA: Inflexible path system scales poorly• Adaptive SOVA: Heuristic
– Limit to M paths max– Discard if path metric below threshold T– Discard all but top M paths when too many paths
Implementing in Hardware
Branch Metric
Unit
AddCompare
Select
Survivor memory
Control
q
r
Implementing in Hardware
• Controller – – Control memory– select paths
• Branch Metric Unit– Compute likelihood– Consider all possible
“next” states
• Add, Compare, Select – Append path metric– Discard paths
• Survivor Memory– Store / discard path bits
Implementing in Hardware
Add, Compare, Select Unit
Present State Path
Values
Next State Path Values
Compute,Compare
Paths
BranchValues
> T
PathDistance
Threshold
Dynamic Reconfiguration
• Bit Error Rate (BER)– Changes with signal strength– Changes with number of paths used
• Change hardware at runtime– Weak signal: use many paths, save accuracy– Strong signal: use few paths, save power– Sample SNR every 250k bits, reconfigure
Dynamic Reconfiguration
Experimental Results
K (Number of encoder bits) proportional to average speed, power
Experimental Results
• FPGA decoding has a much higher throughput• Due to parallelism
Experimental Results
• ASOVA performs worse than commercial cores• However, in other metrics it is much better
– Power– Memory usage– Complexity
Future Work
• Use present reconfiguration means to design– Partial reconfiguration– Dynamic voltage scaling
• Compare to power efficient software methods
Power-Efficient Implementation of a Turbo Decoder in SDR System
• Turbo coding systems are created by using one of three general processor types– Fixed Point (FXP)
• Cheapest, simplest to implement, fastest
– Floating Point (FLP)• More precision than fixed point
– Logarithmic Numbering System (LNS)• Simplifies complex operations• Complicates simple add/subtract operations
Logarithmic Numbering System
• X = {s, x = log(b)[|x|]}– S = sign bit, remaining bits used for number value
• Example– Let b = 2,– Then the decimal number 8 would be represented
as log(2)[8] = 3– Numbers are stored in computer memory in 2’s
compliment form (3 = 01111101) (sign bit = 0)
Why use Logarithmic System?
• Greatly simplifies multiplication, division, roots, and exponents– Multiplication simplifies to addition
• E.g. 8 * 4 = 32, LNS => 3 + 2 = 5 (2^5 = 32)
– Division simplifies to subtraction• E.g. 8 / 4 = 2, LNS => 3 – 2 = 1 (2^1 = 2)
Why use Logarithmic System?
• Roots are done as right shifts– E.g. sqrt(16) = 4,
LNS => 4 shifted right = 2 (2^2 = 4)
• Exponents are done as left shifts– E.g. 8^2 = 64, LNS => 3 shifted left = 6 (2^6 = 64)
So why not use LNS for all processors?
• Unfortunately addition and subtraction are greatly complicated in LNS.– Addition: log(b)[|x| + |y|] = x + log(b)[1 + b^z]– Subtraction: log(b)[|x| - |y|] = x + log(b)[1 - b^z]
• Where z = y – x
• Turbo coding/decoding is computationally intense, requiring more mults, divides, roots, and exps, than adds or subtracts
Turbo Decoder block diagram
• Use present reconfiguration means to design– Partial reconfiguration– Dynamic voltage scaling
• Compare to power efficient software methods
• Each bit decision requires a subtraction, table look up, and addition
Proposed new block diagram
• As difference between e^a and e^b becomes larger, error between value stored in lookup table vs. computation becomes negligible.
• For this simulation a difference of >5 was used
How it works
• For d > 5• New Mux (on right) ignores SRAM input and simply
adds 0 to MAX result.• d > 5, pre-Decoder circuitry disables the SRAM for
power conservation.
Comparing the 3 simulations
• Comparisons were done between a 16-bit fixed point microcontroller, a 16-bit floating point processor, and a 20-bit LNS processor.
• 11-bits would be sufficient for FXP and FLP, but 16-bit processors are much more common
• Similarly 17-bits would suffice for LNS processor, but 20-bit is common type
Power Consumption
Latency
• Recall: Max*(a,b) = ln(e^a+e^b)
Power savings
• Pre-Decoder circuitry adds 11.4% power consumption compared to SRAM read.
• So when an SRAM read is required, we use 111.4% of the power compared to the unmodified system
• However, when SRAM is blocked we only use 11.4% of the power we used before.
Power savings
• The CACTI simulations for the system reported that the Max* operation accounted for 40% of all operations in the decoder
• The Max* operations for the modified system required 69% of the power when compared to the unmodified system.
• This leads to an overall power savings of69% * 40% = 27.6%
Conclusion
• Turbo codes are computationally intense, requiring more complex operations than simple ones
• LNS processors simplify complex operations at the expense of making adding and subtracting more difficult
Conclusion
• Using a LNS processor with slight modifications can reduce power consumption by 27.6%
• Overall latency is also reduced due to ease of complex operations in LNS processor when compared to FXP or FLP processors.