Markov, Shannon, and Turbo Codes: The Benefits of Hindsight

transcript

Markov, Shannon, and Turbo Codes:The Benefits of Hindsight

Professor Stephen B. Wicker

School of Electrical Engineering

Cornell University

Ithaca, NY 14853

Introduction

Theme 1: Digital Communications, Shannon and Error Control Coding

Theme 2: Markov and the Statistical Analysis of Systems with Memory

Synthesis: Turbo Error Control: Parallel Concatenated Encoding and Iterative Decoding

Digital Telecommunication

The classical design problem: transmitter power vs. bit error rate (BER)

Complications:– Physical Distance– Co-Channel and Adjacent Channel Interference– Nonlinear Channels

Shannon and Information Theory

Noisy Channel Coding Theorem (1948): – Every channel has a capacity C.– If we transmit at a data rate that is less

than capacity, there exists an error control code that provides arbitrarily low BER.

For an AWGN channel:

C =Wlog2 1+Es

⎝ ⎜ ⎞

⎠ ⎟ bits per second

Coding Gain

Coding Gain: PUNCODED - PCODED

– The difference in power required by the uncoded and coded systems to obtain a given BER.

NCCT: Almost 10dB possible on an AWGN channel with binary signaling.

1993: NASA/ESA Deep Space Standard provides 7.7 dB.

Classical Error Control Coding

MAP Sequence Decoding Problem: – Find X that maximizes p(X|Y).– Derive estimate of U from estimate of X.– General problem is NP-Hard - related to many

optimization problems.– Polynomial time solutions exist for special

cases.

EncoderU=(u1, ... , uk) X

NoisyChannel

Class P Decoding Techniques

Hard decision: MAP decoding reduces to minimum distance decoding.

Example: Berlekamp algorithm (RS codes) Soft Decision: Received signals are quantized. Example: Viterbi algorithm (Convolutional

Codes) These techniques do NOT minimize

information error rate.

Binary Convolutional Codes

Memory is incorporated into encoder in an obvious way.

Resulting code can be analyzed using state diagram.

Trellis for a Convolutional Code

Trees and Sequential Decoding

Convolutional code can be depicted as a tree. Tree and metric define a metric space. Sequential decoding is a local search of a

metric space. Search complexity is a polynomial function of

memory order. May not terminate in a finite amount of time. Local search methodology to return...

Theme 2: Markov and Memory

Markov was, among many other things, a cryptanalyst.– Interested in the structure of written text.– Certain letters are can only be followed by

certain others. Markov Chains:

– Let I be a countable set of states and let be a probability measure on I.

– Let random variable S range over I and set i = p(S = i)

– Let P = {pij} be a stochastic matrix with rows and columns indexed by I.

– S = (Sn)n≥0 is a Markov chain with initial distribution and transition matrix P if

- S0 has distribution

- p(Sn+1 | S0, S1, S2, …, Sn – 1, Sn) = P(Sn+1 | Sn) = pij

Hidden Markov Models

– Markov chain X = X1, X2, …

– Sequence of r.v.’s Y = Y1, Y2, … that are a probabilistic function f() of X.

Inference Problem: Observe Y and infer:– Initial state of X– State transition probabilities for X– Probabilistic function f()

Hidden Markov Models are Everywhere...

Duration of eruptions by Old Faithful Movement of Locusts (Locusta Migratoria) Suicide rate in Capetown, SA. Progress of epidemics Econometric models Decoding of convolutional codes

Baum-Welch Algorithm

Lloyd Welch and Leonard Baum developed iterative solution to the HMM inference problem (~1962).

Application-specific solution was classified for many years.

Published in general form:– L. E. Baum and T. Petrie, “Statistical Inference for

Probabilistic Functions of Finite State Markov Chains,” Ann. Math. Stat., 37:1554 - 1563, 1966.

BW Overview

Member of the class of algorithms now known as “Expectation-Maximization”, or “EM” algorithms.

– Initial hypothesis 0

– Series of estimates generated by the mapping i = T(i-1)

– P(0) ≤ P(1) ≤ P(2) ≤ … , where is the maximum likelihood parameter estimate.

limi→ ∞

Forward - Backward Algorithm: Exploiting the Markov Property

Goal: Derive probability measure p(xj, y).

BW algorithm recursively computes ’s and ’s.

p x j ,y( ) =p xj ,y j−( ) ⋅p yj |xj ,y j

−( ) ⋅p yj+ |xj ,yj ,y j

−( )

=p xj ,y j−( ) ⋅p yj |xj( )⋅p y j

+ |xj( )

= xj( )↑

⋅ γ xj( )↑

present

⋅β x j( )↑

future

Forward and Backward Flow

Define flow (xi, xj) to be the probability that a random walk starting at xi will terminate at xj.

(xj) is the forward flow to xj at time j.

(xj) is the backward flow to xj at time j.

x j( ) = p x j ,y j−

( ) = α x j−1( )Q x j | x j−1( )γ x j−1( )x j−1∈X j−1

x j( ) = p y j+ | x j( ) = Q x j+1 | x j( )γ x j+1( )β x j+1( )

x j+1∈X j+1

Earliest Reference to Backward-Forward Algorithm

Several of the woodsmen began to move slowly toward her and observing them closely, the little girl saw that they were turned backward, but really walking forward. “We have to go backward forward!” cried Dorothy. “Hurry up, before they catch us.”

– Ruth Plumly Thompson, The Lost King of Oz, pg. 120, The Reilly & Lee Co., 1925.

Generalization: Belief Propagation in Polytrees

Judea Pearl (1988) Each node in a

polytree separates the graph into two distinct subgraphs.

X D-separates upper and lower variables, implying conditional independence.

Spatial Recursion and Message Passing

Synthesis: BCJR

1974: Bahl, Cocke, Jelinek, and Raviv apply portion of BW algorithm to trellis decoding for convolutional and block codes.– Forward and backward trellis flow: APP

that a given branch is traversed.– Info bit APP: sum of probabilities for

branches associated with particular bit value.

BW/BCJR

u j( ) uj( )γ u j( )

Synthesis Crescendo:Turbo Coding

May 25, 1993: G. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon-Limit Error Correction Coding: Turbo Codes.”

Two Key Elements: – Parallel Concatenated Encoders– Iterative Decoding.

Parallel Concatenated Encoders

One “systematic” and two parity streams are generated from the information.

Recursive (IIR) convolutional encoders are used as “component” encoders.

Recursive Binary Convolutional Encoders

Impact of the Interleaver

Only a small number of low-weight input sequences are mapped to low-weight output sequences.

The interleaver ensures that if the output of one component encoder has low weight, the output of the other probably will not.

PCC emphasis: minimize number of low weight code words, as opposed to maximizing the minimum weight.

The PCE Decoding Problem

Encoder

Interleaver

Encoder

U =( u

, ... , u

Channel

, ... , y

BELi a( ) =p(ui =a|y)

= i a( )systematic

{ ⋅π i a( ) a prioriterm

{ ⋅ p(y1 |x1)p(y2 |x2 ) j uj( )π j uj( )j=1j≠i

∏u:ui =a∑

extrinisic term1 2 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4

Turbo Decoding

BW/BCJR decoders are associated with each component encoder.

Decoders take turns estimating and exchanging distribution on information bits.

Alternating Estimates of Information APP

BELi1 a( ) =α λi a( )

systematicterm

{ ⋅π i2 a( )

updatedterm

1 2 3 ⋅ p(y1 |x1) λ j uj( )π j uj( )j=1j≠i

∏u:ui =a∑

extrinisic term1 2 4 4 4 4 4 3 4 4 4 4 4

BELi2 a( ) =α λi a( )

systematicterm

{ ⋅πi1 a( )

updatedterm

1 2 3 ⋅ p(y2 |x2) λj uj( )π j uj( )j=1j≠i

∏u:ui =a∑

extrinisic term1 2 4 4 4 4 4 3 4 4 4 4 4

Decoder 1: BW/BCJR derives

Decoder 2: BW/BCJR derives

Converging Estimates

Information exchanged by the decoders must not be strongly correlated with systematic info or earlier exchanges.

πim( ) a( ) =

αPr ui =a |Ys =ys,Y1 =y1{ }

λi a( )πim−1( ) a( )

if m is odd

αPr ui =a|Ys =ys,Y2 =y2{ }

λi a( )πim−1( ) a( )

if m is even

⎨ ⎪ ⎪

⎩ ⎪ ⎪

Impact and Questions

Turbo coding provides coding gain near 10dB – Within 0.3 dB of the Shannon limit.– NASA/ESA DSN: 1 dB = $80M in 1996.

Issues:– Sometimes turbo decoding fails to correct all

of the errors in the received data. Why?– Sometimes the component decoders do not

converge. Why?– Why does turbo decoding work at all?

Cross-Entropy Between the Component Decoders

Cross entropy, or the Kullback-Leibler distance, is a measure of the distance between two distributions.

Joachim Hagenauer et al. have suggested using a cross-entropy threshold as a stopping condition for turbo decoders.

D= π 1 uj =a|Y( )a=0

∑j=1

∑ logπ1 uj =a |Y( )π 2 uj =a|Y( )

Correlating Decoder Errors with Cross-Entropy

Neural Networks do the Thinking

Neural networks can implement any piecewise-continuous function.

Goal: Emulation of indicator functions for turbo decoder error and convergence.

Two Experiments: – FEDN: Predict eventual error and convergence

at the beginning of the decoding process.– DEDN: Detect error and convergence at the

end of the decoding process.

Network Performance

Missed detection occurs when number of errors is small.

The average weight of error events in NN-assisted turbo is far less than that of CRC-assisted turbo decoding.

When coupled with a code combining protocol, NN-assisted turbo is extremely reliable.

What Did the Networks Learn?

Examined weights generated during training. Network monitors slope of cross entropy (rate

of descent). Conjecture:

– Turbo decoding is a local search algorithm that attempts to minimize cross-entropy cycles.

– Topology of search space is strongly determined by initial cross entropy.

Exploring the Conjecture

Turbo Simulated Annealing (Buckley, Hagenauer, Krishnamachari, Wicker)– Nonconvergent turbo decoding is nudged

out of local minimum cycles by randomization (heat).

Turbo Genetic Decoding (Krishnamachari, Wicker)– Multiple processes are started in different

places in the search space.

Turbo Coding: A Change in Error Control Methodology

“Classical” response to Shannon: – Derive probability measure on transmitted

sequence, not actual information.– Explore optimal solutions to special cases of

NP-Hard problem.– Optimal, polynomial time decoding algorithms

limit choice of codes.

“Modern”: Exploit Markov property to obtain temporal/spatial recursion:– Derive probability measure on information, not

codeword– Explore suboptimal solutions to more difficult

cases of NP-Hard problem.– Iterative decoding – Graph Theoretic Interpretation of Code Space– Variations on Local Search

The Future Relation of cross entropy to impact of cycles in

belief propagation. Near-term abandonment of PCE’s as unnecessarily

restrictive. Increased emphasis on low density parity check

codes and expander codes.– Decoding algorithms that look like solutions to K-

SAT problem.– Iteration between subgraphs.– Increased emphasis on decoding as local search.

Markov, Shannon, and Turbo Codes: The Benefits of Hindsight

Documents