+ All Categories
Home > Documents > Introduction: Predistortion DPD - Implementation...

Introduction: Predistortion DPD - Implementation...

Date post: 29-Mar-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
15
Copyright © 2003 Altera Corporation Digital Predistortion using Polynomial Approach Digital Predistortion using Polynomial Approach Copyright © 2003 Altera Corporation Agenda Agenda Introduction Indirect Learning Architecture Forward Path Feedback Path Block diagram of complete solution Example Resource Estimate Numerical Accuracy Summary Copyright © 2003 Altera Corporation Introduction: Predistortion Introduction: Predistortion V RF V in V RF V in Overall Linear Response V rf = kV in V RF V in PA V rf = kV in Ideal PA DPD V d = 1/f nl V in V d V in Predistorter Linear response Linear response V d V PA PA V rf = f nl kV d Real PA Nonlinear response Nonlinear response 4 Copyright © 2003 Altera Corporation - Confidential DPD - Implementation Choices DPD - Implementation Choices Two basic approaches Look-Up-Table (LUT) contains points on the transfer function Polynomial representation of the transfer function x y x x y y n n x c x c x c c y + + + = .. 2 2 1 0 Pros-Cons of the two approaches Advantage Relatively easier to address memory effects & phase dependent errors Memory effects & phase dependent errors require complex implementation Faster convergence time Slower convergence time Requires Stratix Low cost Cyclone implementation possible Difficult to address different PA types Same solution for different PAs (as long as error dependencies are same) High complexity as use higher order terms Low complexity Polynomial LUT
Transcript
Page 1: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Digital Predistortion using Polynomial Approach

Digital Predistortion using Polynomial Approach

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Copyright © 2003 Altera Corporation

Introduction: PredistortionIntroduction: Predistortion

VRF

Vin

VRF

Vin

Overall LinearResponseVrf = kVin

VRF

Vin

PAVrf = kVin

Ideal PA

DPDVd = 1/fnlVin

Vd

Vin

Predistorter

Linear responseLinear response

Vd

VPA

PAVrf = fnlkVd

Real PA

Nonlinear responseNonlinear response

4Copyright © 2003 Altera Corporation - Confidential

DPD - Implementation ChoicesDPD - Implementation ChoicesTwo basic approaches

Look-Up-Table (LUT) contains points on the transfer functionPolynomial representation of the transfer function

x y

xx

yy

nnxcxcxccy +++= ..2

210Pros-Cons of the two approaches

Advantage

Relatively easier to address memory effects & phase dependent errors

Memory effects & phase dependent errors require complex implementation

Faster convergence timeSlower convergence time

Requires StratixLow cost Cyclone implementation possible

Difficult to address different PA typesSame solution for different PAs (as long as error dependencies are same)

High complexity as use higher order termsLow complexity

PolynomialLUT

Page 2: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Copyright © 2003 Altera Corporation

Indirect Learning ArchitectureIndirect Learning Architecture

The output of the PA is compared with the input of the PA rather than the input of the predistorter

Figure 1: Indirect learning architecture for Polynomial based predistortion

Forward path

Feedback path

Copyright © 2003 Altera Corporation

Indirect Learning ArchitectureIndirect Learning ArchitectureFeedback path (predistorter training block A) can work offline

A block of [z(n), y(n)], i.e., [input to PA, output from PA] can be stored in a look up table

The predistorter training block derives the inverse response of the PA by performing polynomial curve fitting on the data in the LUT (produces polynomial coefficients by solving the least squares problem)

The polynomial coefficients are computed such that e(n)=0, then y(n)= x(n)

The generated coefficients in the feedback path are used by the predistorter in the forward path until the next update by the training block

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Page 3: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Predistorter ModelPredistorter ModelAssume Predistorter is modelled by memory polynomial model[1]

[eq 1]

z(n) = c10 . x(n) + c30 . x(n) . |x(n)|2 + c50 . x(n) . |x(n)|4 +… c(2K+1)0 . x(n) . |x(n)|2K +c11 . x(n-1) + c31 . x(n-1) . |x(n-1)|2 + c51 . x(n-1) . |x(n-1)|4 +… c(2K+1)1 . x(n-1) . |x(n-1)|2K +c12 . x(n-2) + c32 . x(n-2) . |x(n-2)|2 + c52 . x(n-2) . |x(n-2)|4 +… c(2K+1)2 . x(n-2) . |x(n-2)|2K +...c1Q . x(n-Q) + c3Q . x(n-Q) . |x(n-Q)|2 + c5Q . x(n-Q) . |x(n-Q)|4 +… c(2K+1)Q . x(n-Q) . |x(n-Q)|2K

[eq 2]

Implement using FIR filters − Number of FIR Filters = K + 1 [eq 3]

∑ ∑= =

−−=K

k

2kQ

qqnqn(2k+1)q x. xcnz

0 0)()()( .

[1] L.Ding, G.T.Zhou, D.R.Morgan, Z.Ma, J.S.Kenney, J.Kim and C.R.Giardina, “Memory Polynomial predistorter based on the indirect learning architecture”, IEEE Global Telecommunications Conference, Taipei, Taiwan, Nov 2002

FIR 1FIR 1 FIR 2FIR 2 FIR 3FIR 3 FIR K+1FIR K+1

Copyright © 2003 Altera Corporation

Forward Path Block DiagramForward Path Block DiagramFilter Taps

C10

X(n)

Filter Input Filter Input

generationgeneration

FIR filter 1FIR filter 1

C11 C1Q

C30

FIR filter 2FIR filter 2

C31 C3Q

C(2k+1)0

FIR filter K+1FIR filter K+1

C(2k+1)1 C(2k+1)Q

Z(n)

X(n)

X(n) |X(n)|2

X(n) |X(n)|2K

−− K+1 FIR filtersK+1 FIR filters

−− Each filter has Q+1 tapsEach filter has Q+1 taps

−− Each tap performs complex Each tap performs complex multiplicationmultiplication

Copyright © 2003 Altera Corporation

Generating Inputs to FIR Filters Generating Inputs to FIR Filters

|x|x(n)(n)||22 = I= I22 + jQ+ jQ22

2 multiplications2 multiplications

|x|x(n)(n)||44 = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|22

1 multiplications1 multiplications

|x|x(n)(n)||2k2k = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|2(k2(k--1)1)

1 multiplications1 multiplications

XX

xx(n(n)) = I + = I + jQjQ

DelayDelay

DelayDelay

DelayDelay

DelayDelay

XX

XX

xx(n(n))

xx(n(n))

xx(n(n))

xx(n(n)) . . |x|x(n)(n)||22

Input to FIR 2Input to FIR 2

xx(n(n))

Input to FIR 1Input to FIR 1

xx(n(n)) . . |x|x(n)(n)||44

Input to FIR 3Input to FIR 3

xx(n(n)) . . |x|x(n)(n)||2k2k

Input to FIR K+1Input to FIR K+1

2 multiplications2 multiplications

Copyright © 2003 Altera Corporation

DSP Block Architecture DSP Block Architecture

+

Opt

iona

l Pip

elin

ing

Out

put R

egis

ters

Out

put M

UX+ - Σ

+ - ΣInpu

t Reg

iste

rs

High Performance DSP Operation− 18x18 Functions at 282 MHz

Input, Output & Pipelining registers− Reduce overall Logic usage

Add/Accumulate/Subtract− Signed & unsigned operations− Dynamically change between Add &

SubtractSupports real multiplications (required for generating power terms in generating inputs for FIR filters) −− e.g. |e.g. |(I + jQ)|2 = I2 + Q2

Support complex multiplications (required in FIR filter taps)− (Ar + jAi) x (Br + jBi) = (Ar Br – AiBi) + j(Ai Br + ArBi)− 4 Multiplications, 1 Addition & 1 Subtraction

· · · ·- +

Page 4: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Forward Path Multiplier RequirementForward Path Multiplier RequirementTimeshare multipliers − Stratix multipliers can run over 240MHz− IF frequency should be less than 120MHz− Halve the number of multipliers required by running at twice the speed

Our example with K=2 and Q=2 requires 3 FIR filters , each with 3 taps

Multiplier RequirementForward Path Implementation

622Memory Polynomial model with K=2 and Q=2

No. of DSP blocksNo of 18x18 multipliers

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Copyright © 2003 Altera Corporation

Feedback Path OverviewFeedback Path OverviewCoefficients Update Algorithms

QRD-RLS Algorithm

Using CORDIC for QR Decomposition

Using Nios for Back Substitution

Copyright © 2003 Altera Corporation

Coefficients Update algorithmsCoefficients Update algorithmsLeast mean squares (LMS), Normalized LMS (NLMS)Recursive least squares (RLS)

QR Decomposition based RLS (QRD-RLS) is more robust and well suited for hardware implementation

Pros:− RLS converges faster than LMS and NLMS

Cons:− Numerical stability issues

− Computationally expensive

− Operates on correlation matrix of input data

Page 5: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

QR Decomposition (QRD)QR Decomposition (QRD)Least Squares problem:

Decompose Y into QR, where Q is a unitary matrix (QQT = 1, i.e QT = Q-1) and R is an upper triangular matrix

zYc =Y = M x N (M>N)c = N x 1 z = M x 1

zQRc T=

[eq 4]

Solve for c where

zRc ′=

[eq 5]

Then,

or

[eq 6] where zz TQ =′

R = N x N c = N x 1 z’ = N x 1

Copyright © 2003 Altera Corporation

Back Substitution to solve for cBack Substitution to solve for cSince R is an upper triangular matrix, c can be solved from [eq 6] via back substitution

1,...,1

1

1

'

'

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

=

∑+=

Nifor

cRzR

c

Rzc

N

ijjiji

iii

NN

NN

[eq 7]

[eq 8]

Copyright © 2003 Altera Corporation

Resource Mapping for QRD-RLSResource Mapping for QRD-RLSDSP Blocksx(n)

LUT

CORDIC

PA

Nios

QRD-RLS

z(n) y(n)

DSP Blocks

CORDIC for QR DecompositionNios for Back Substitution

R, z’c Y, z

Copyright © 2003 Altera Corporation

Using CORDIC for QRD : OverviewUsing CORDIC for QRD : OverviewWhat is CORDIC?

Systolic Array Architecture

Direct Mapping Implementation

Mixed Mapping Implementation

Discrete Mapping Implementation

Summary of Implementing Systolic Array

Generating inputs to Systolic Array

Page 6: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

What is CORDIC?What is CORDIC?Hardware efficient algorithm for computing functions such as:− Trigonometric, Hyperbolic, Logarithmic

Iterative solution that uses only shifts and adding/subtracting− High performance/ low complexity as no multiplications and divisions

Altera’s CORDIC− operates in both Vectorise and Rotate Modes− pipelined to allow new inputs to be applied on every clk cycle− MATLAB GUI containing

bit accurate model of RTL for CORDICtest environment for generating test data, view results, parameterise RTL

CORDIC

X_inY_inZ_in

mode

X_outY_outZ_out

Copyright © 2003 Altera Corporation

CORDIC LEs usageCORDIC LEs usage

163MHz66%70104040

189MHz43%46003232

198MHz25%26702424

219MHz12%13001616

264MHz3%38088

% of LEs in Stratix 1S10

cor_les(absolute no)

FmaxLogic Elements (LEs)Number of Iterations

Input Vector Width x,y & z

Copyright © 2003 Altera Corporation

Systolic Array ArchitectureSystolic Array Architecture

R11 R12

R22

R13

R33

R23

R14

R44

z’1

z’2R24

R34 z’3

z’4

Inputs Desired output

CORDIC 1

CORDIC 2 CORDIC 3

Real(Xout) Imag(Xout)

Real(Xin) Imag(Xin)

øin øout

Өin Өout

Өout

CORDIC øoutøin

Өin

Real(Xin) Imag(Xin)

Real(Xout) Imag(Xout)

CORDICCORDIC

VectoriseVectorise

RotateRotate

Real inputs & outputs

Real inputs & outputs

Complex inputs & outputs

Complex inputs & outputs

with timesharing

with timesharing

Complex inputs & outputsComplex inputs & outputsCopyright © 2003 Altera Corporation

Example of Polynomial ModelExample of Polynomial ModelSame example used in forward path where assumed predistorter is modelled by memory polynomial model with K=2 and Q=2

Learning model in Indirect Architecture has output z(n) related to PA output y(n) by

z(n) = c10 . y(n) + c30 . y(n) . |y(n)|2 + c50 . y(n) . |y(n)|4 +c11 . y(n-1) + c31 . y(n-1) . |y(n-1)|2 + c51 . y(n-1) . |y(n-1)|4 +c12 . y(n-2) + c32 . y(n-2) . |y(n-2)|2 + c52 . y(n-2) . |y(n-2)|4 [eq 3]

Objective : Use this example to determine number of CORDIC blocks required & throughput for implementing Systolic Array using following schemes − Direct Mapping Scheme− Mixed Mapping Scheme− Discrete Mapping Scheme

Page 7: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Example: ConditionsExample: Conditionsfclk = 150 MHz ; tclk = 1 / fclk = 6.7ns

PA output/input y(n)/x(n) is 16 bits wide

CORDIC implements the full 16 iterations − delay through CORDIC = 19 clk cycles (see Altera App Note 263 on CORDIC)

Each processor cell comprises a single CORDIC block, timesharing the three operations required for complex inputs.

Systolic architecture has 9 rows by 10 columns− there are 9 c coefficent values to determine

Input matrix is m x p = 64 x 9

Time required before all cells in systolic array are updated with their R and z values = tupdate_del

Throughput is number of input matrices (each m x p) that are processed per second = thr_putthr_put = 1 / tupdate_del

Copyright © 2003 Altera Corporation

Direct Mapping ImplementationDirect Mapping ImplementationEach cell in Systolic Array is mapped to its own CORDIC block(s)

Total number of CORDIC blocks required is equal to total number of cells in array = 54 in this case− 70K LEs!!!

Advantage of high throughput− update delay = 5.1us − throughput = 196,078 update/s

Disadvantage of larger number of resources− Mixed or Discrete Mapping schemes

reduce number of CORDIC blocks (resources) required -> timesharingat cost of reduced throughput

Copyright © 2003 Altera Corporation

Mixed Mapping ImplementationMixed Mapping ImplementationMove rows in bottom portion to top portion− Try to result in same number of cells per row

Map to 4 rows

Processor

2

Processor

1

Processor

3

Processor

4

Use each processor for one/more rows (in this case one processor per row)processors perform both boundary and internal cell operations -> “Mixed”processors 2 – 4 perform 13 cell operations while processor 1 performs 15 cell operations

4 processors = 4 CORDIC blocks = 5K LEsThroughput

update delay = 250.85usthroughput = 3986 updates / second

Copyright © 2003 Altera Corporation

Discrete Mapping ImplementationDiscrete Mapping ImplementationMinimum of 2 processors− One performs only boundary cell operations− Others perform internal cell operations

Max processors = max number of cells in any diagonal− 5 processors for this example

Proc2 Proc3Proc1Proc4 Proc5

Proc1 Proc2

As CORDIC iterations are 16, could timeshare and use 2 processors:

Resources = 2 CORDIC blocks = 2600LEsThroughput− Update delay = 198.11us− Throughput = 5047 updates / second

Proc

Hybrid scheme: 1 processor only

Page 8: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Example : SummaryExample : Summary

50473986

196078

updates/s

198.11250.85

5.1

update delay (us)

Throughput

26005200

70200

No of LEs

24

54

No. of Blocks

CORDIC usage

0.358Direct Mapping

0.5151.305

Cost LE/update

Discrete MappingMixed Mapping

Implementation Technique

Why in this scenario Discrete Mapping is better than Mixed Mapping − DM really works assuming 5 processors but because it can utilise the

pipelining ability of CORDIC is using only 2 processors.

[Note: results assume input matrix of 64 x 9; 16 CORDIC iteratio[Note: results assume input matrix of 64 x 9; 16 CORDIC iterations; ns; clkclk freq of 150MHz; 9 unknown coefficients to calculate]freq of 150MHz; 9 unknown coefficients to calculate]Copyright © 2003 Altera Corporation

Summary of Systolic Array ImplementationSummary of Systolic Array Implementation

Direct Mapping offers highest throughput at expense of large resource usage

For approx same throughput and using CORDIC, Discrete Mapping will always require fewer resources than Mixed Mapping− it can utilise the pipelining ability of CORDIC

Deploy Discrete Mapping hybrid technique− use discrete mapping technique to determine scheduling of cell

operations− reduce processor cells to 1 (not 2) to minimise resources for minimal

reduction in throughput

Copyright © 2003 Altera Corporation

Generating inputs to Systolic Array IGenerating inputs to Systolic Array I

Same order of terms generated in forward path

y(n)

y(n) |y(n)|2y(n) |y(n)|4

y(n-1)y(n-1) |y(n-1)|2

y(n-1) |y(n-1)|4y(n-2)

y(n-2) |y(n-2)|2y(n-2) |y(n-2)|4

z(n)

|x|x(n)(n)||22 = I= I22 + jQ+ jQ22

|x|x(n)(n)||44 = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|22

|x|x(n)(n)||2k2k = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|2(k2(k--1)1)

XX

xx(n(n)) = I + = I + jQjQ

DelayDelay

DelayDelay

DelayDelay

DelayDelay

XX

xx(n(n))

xx(n(n))

xx(n(n))

xx(n(n)) . |x. |x(n)(n)||22

xx(n(n))

xx(n(n)) . |x. |x(n)(n)||44

xx(n(n)) . |x. |x(n)(n)||2k2k

XX

Copyright © 2003 Altera Corporation

Generating inputs to Systolic Array IIGenerating inputs to Systolic Array IIComparison to input generation for forward path:− Same function− lower speed requirement

Use NIOS with Multiply Custom InstructionAlternative: Custom logic− Timeshare multipliers more in feedback path

forward path requires result every clock cycleCORDIC in feedback path takes n cycles to produce output

− for Mixed or Discrete Mapping would require new input once every n cycles (worst case)Also feedback processing performed offline

− reduce multiplier requirement and also throughput through R matrix by taking more clkcycles

− Our ExampleForward Path : generating inputs to FIR filters requires 4 18x18 multipliers (1 DSP block)Feedback Path: timesharing allows using single 18x18 multiplier

Page 9: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Back Substitution using NiosBack Substitution using NiosNios CPU

I/O1Program

& DataMemory

I D

Avalon Bus

Rr[N][N]

Ri[N][N]

z`r[N]

z`i[N]

cr[N]

ci[N]

CORDIC

Copyright © 2003 Altera Corporation

Back Substitution ExampleBack Substitution Example

1,...,1 1 1

''

−=⎟⎟⎠

⎞⎜⎜⎝

⎛−== ∑

+=NiforcRz

Rc

Rzc

N

ijjiji

iii

NN

NN

10] 8[ 5 03 2

=′⎥⎦

⎤⎢⎣

⎡= zR

FormulaFormula

Example (with real numbers)Example (with real numbers)

25

10

22

22 ==

′=

Rzc

i.e., R is a 2 x 2 upper triangular matrix (N =2)

[eq 7] [eq 8]

(from eq 7)

( ) ( ) 12*38211

222111

1 =−=−′= cRzR

c (from eq 8)

Requires mainly multiply and divide operations

Copyright © 2003 Altera Corporation

Back Substitution Results – Latency ComparisonBack Substitution Results – Latency Comparison

Back Substitution Algorithm

100

1000

10000

100000

1000000

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

N

Clo

ck C

ycle

Cou

nt

Nios 3.1 Nios II Custom Logic

Bes

t Per

form

ance

(L

owes

t Lat

ency

)

1200012000

15001500

300300

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Page 10: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Block DiagramBlock Diagram

X(n)Input Input

generationgeneration

X(n)

X(n)3

X(n)5 FIR filter 3FIR filter 3

C50 C51 C52

C’50 C’51 C’52

FIR filter 2FIR filter 2

C30 C31 C32

C’30 C’31 C’32

FIR filter 1FIR filter 1

C10 C11 C12

C’10 C’11 C’12

ProgramProgram

&Data &Data

MemoryMemory

NIOSNIOS

CORDICCORDIC

acceleratoraccelerator

Multiply CIMultiply CI

Divide CIDivide CI

Sample Sample

bufferbuffer

Avalon BusAvalon Bus

D/A, analogueD/A, analogue

upconverterupconverter PAPA

A/D, analogueA/D, analogue

downconverterdownconverter

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Copyright © 2003 Altera Corporation

Resource EstimateResource Estimate

900

320

Update delay (us)

1S30114241002K=2 (5th order), Q=5 (5 previous terms)

1S1062341002K=2 (5th order), Q=2 (2 previous terms)

DSP Blocks18x18 mult

Suitable Stratix device

Number of MultipliersLEs for CORDIC +

NIOS

CORDIC blocks

Polynomial type

Predistorter described by Memory Polynomial modelCoefficients determined in feedback path using− QRD-RLS algorithm− CORDIC used in Discrete Mapping Scheme to implement Systolic Array− NIOS used for back substitution

Estimates do NOT include− memory requirements (abundant supply in all Stratix devices)− control logic

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Block diagram of complete solution

Example Resource Estimate

Numerical Accuracy

Summary

Page 11: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

AccuracyAccuracyPotential of large dynamic range of signals− higher order polynomial terms require higher dynamic range

Rounding & truncating errors− impact on coefficients & predistorted values needs

consideration

Errors affect system performance− increase in-band distortion − increase out-of band distortion

∑ ∑= =

−−=K

k

2kQ

qqnqn(2k+1)q x. xcnz

0 0)()()( .

Copyright © 2003 Altera Corporation

Maintaining AccuracyMaintaining AccuracyFixed Point Solution− Larger bit widths for CORDIC & in forward path intermediate products

Floating Point Solutions− Forward Path: floating point operations (multiply, add)− Feedback Path

Implement Systolic array with floating point operations – NO CORDICImplement Systolic array with floating point CORDIC

Hybrid Solution − Floating point operations

forward path calculation of high order terms to be fed into Systolic array in feedback path

− Fixed point operations CORDIC blocks (with floating point extension) in feedback path

Right Solution− Depends on performance - cost requirements for individual customers

Copyright © 2003 Altera Corporation

Resource comparison of optionsResource comparison of optionsFixed Point Solution (e.g. extending bit widths from 16 to 32)− CORDIC resources increase by ~250%− Multipliers required increase by up to 300%

Floating Point Operators (multiply, divide, add)− Require ~20-30% more resources than equivalent fixed point operators− Employ the use of barrel shifters

Floating Point CORDIC vs Fixed Point CORDIC− ~20-30% larger− requires barrel shifters− slower throughput as non-pipelined (serial architecture)

Copyright © 2003 Altera Corporation

AgendaAgendaIntroduction

Indirect Learning Architecture

Forward Path

Feedback Path

Example Resource Estimate

Numerical Accuracy

Summary

Page 12: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

SummarySummaryStratix devices offer optimal resources for Polynomial DPD implementation− DSP blocks− Embedded RAM

QRD-RLS algorithm can be effectively addressed with Altera IP− CORDIC− NIOS with custom instructions

Flexible architecture capable of addressing a wide range of processing demands

Copyright © 2003 Altera Corporation

Backup SlidesBackup Slides

Copyright © 2003 Altera Corporation

Least Squares using QRDLeast Squares using QRDNow,

zQRcor

RRzQRc

zQRRRc

QQcezQRRRc

zQRQRQRc

zYYYc

zYYcY

zYc

T

TTT

TTT

TTTT

TTTT

TT

TT

=

==

=

==

=

=

=

=

−−

−−

)1(since

)(

)1(sin )(

)(

)(

1

1

1

1

1

Q is unitary matrix

R is upper triangular matrix

Assuming Y has been decomposed into Q, R

Copyright © 2003 Altera Corporation

7th order Polynomial Resource Estimates7th order Polynomial Resource Estimates

Page 13: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Example ConditionsExample Conditions4 UMTS carriers

Assumed sample rate into DPD = 140Msps

16 or 18-bit inputs and coefficients

0 0 HzHz

20 20 MHz BandwidthMHz Bandwidth

2.5 2.5 MHzMHz

7.5 7.5 MHzMHz

--7.5 7.5 MHzMHz

--2.5 2.5 MHzMHz

--10 10 MHzMHz

10 10 MHzMHz

Copyright © 2003 Altera Corporation

Example ConditionsExample ConditionsMemory polynomial model (with both even and odd terms)

7th order with 2 previous samples for memory effects (K=7, Q=2)

7 FIR filters with 3 complex taps each

∑∑= =

−−−=

K

k

Q

q

kqnqnkq xxcnz

1 0

1)()()( eq(1)eq(1)

61170

52262

42252

32242

222322222212

61170

51161

41151

31141

211311121111

670

560

450

340

2302010)(

−−−−−−−−−−−−−

−−−−−−−−−−−−−

++++++

++++++

++++++=

nnnnnnnnnnnnn

nnnnnnnnnnnnn

nnnnnnnnnnnnn

xxcxxcxxcxxcxxcxxcxc

xxcxxcxxcxxcxxcxxcxc

xxcxxcxxcxxcxxcxxcxcnz

FIR 1FIR 1 FIR 2FIR 2 FIR 3FIR 3 FIR 4FIR 4 FIR 5FIR 5 FIR 6FIR 6 FIR 7FIR 7

Copyright © 2003 Altera Corporation

Total number of 18 x 18 multipliersTotal number of 18 x 18 multipliersNumber of FIR filter multipliers

Multipliers for generating FIR inputs

Total number of multipliers required

− 7 x 3 = 21 complex multiplications

− Each complex multiplication requires 4 18x18 multipliers = 1DSP Block

− 21 DSP blocks (84 multipliers) required

− cannot time-share (for Stratix) as data rate at 140Msps

− Forward Path: 14 Multipliers + LEs for computing |x(n)| (time-sharing not possible for Stratix)

− Feedback Path: 1 Multiplier (with time-sharing)

84 + 14 + 1 = 99 84 + 14 + 1 = 99 multipliers multipliers

Copyright © 2003 Altera Corporation

Feedback path conditionsFeedback path conditionsQR decomposition

Back substitution

Total feedback delay = 1048.78μs = 1.05ms

− 2 CORDIC blocks

− 150MHz clock

− Input matrix size 64 x 22

− Update delay = 504.27μs

− 1 Nios processor

− 100MHz clock

− 21 coefficients to compute

− Update delay = 544.51μs

Page 14: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Total Resource EstimateTotal Resource Estimate

less than 1.05

1.05

Update delay (ms)

EP2S301350 (with time-

sharing)

41002K=7 (7th order), Q=2 (2 previous terms)

1S80 (using LEs for 3 DSP

blocks)

259941002K=7 (7th order), Q=2 (2 previous terms)

DSP Blocks18x18 mult

Suitable Stratix,

Stratix II devices

Number of MultipliersLEs for CORDIC +

NIOS

CORDIC blocks

Polynomial type

Predistorter described by Memory Polynomial modelCoefficients determined in feedback path using− QRD-RLS algorithm− CORDIC used in Discrete Mapping Scheme to implement Systolic Array− NIOS used for back substitution

Estimates do NOT include− memory requirements (abundant supply in all Stratix devices)− control logic

Copyright © 2003 Altera Corporation

Possible Queriesby Customers

(do not include as a part of the presentation)

Possible Queriesby Customers

(do not include as a part of the presentation)

Copyright © 2003 Altera Corporation

Can the Nios processor perform floating point operations?

We currently do not have a full floating point unit for the Nios. Any floating point calculations are done with the floating point library in software. In the event that you have a hardware integer multiplier as part of the Nios, then the software libraries will take advantage of this (and the resulting performance will increase).

That being said, we have done a subset floating point unit that does add, negate, absolute value, and multiply – but it’s more of an example than a supported part of the Nios package.

We would like to know if you would want to see floating point operations added to the Nios features

How many carriers can you address with this design?

The multipliers on Stratix can run upto 280MHz. For a 4 carrier UMTS system with total bandwidth of 20MHz, and a polynomial of order 5, the required bandwidth would be 20x5=100MHz which from Nyquist sampling theorem translates to an IF frequency of 200MHz. This can be addressed with the multipliers. What order of the polynomial are you looking at?

Copyright © 2003 Altera Corporation

What stage are you in your design? Do you have any performance numbers?We are contemplating doing simulations to verify the performance. Would you be interested?

Page 15: Introduction: Predistortion DPD - Implementation Choicesread.pudn.com/downloads64/doc/project/224847/dpd_poly... · 2006-05-21 · Time required before all cells in systolic array

Copyright © 2003 Altera Corporation

Possible Queriesto Customers

(do not include as a part of the presentation)

Possible Queriesto Customers

(do not include as a part of the presentation)

Copyright © 2003 Altera Corporation

What sort of PA model are you using?

What is the degree of the polynomial?

How many samples need to be considered for memory effect?

What is the adaptation algorithm you are considering in the feedback loop?

What is the update rate for the coefficients?

What is the IF frequency?

Are you operating in fixed or floating point?− if fixed what are the bit widths in forward and feedback paths?− for floating point what representation are you using?

What performance are you targeting for your predistortion unit operating with your PA (i.e. reduction in EVM and increase in ACLR specifically due to predistortion only)?

Why are you specifically implementing polynomial solution as opposed to LUT based one?

What device (i.e. Altera, ASIC, Xlinx) are you targeting for your solution and why?

If not using Altera, what would persuade you to switch?


Recommended