Copyright © 2003 Altera Corporation
Digital Predistortion using Polynomial Approach
Digital Predistortion using Polynomial Approach
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
Introduction: PredistortionIntroduction: Predistortion
VRF
Vin
VRF
Vin
Overall LinearResponseVrf = kVin
VRF
Vin
PAVrf = kVin
Ideal PA
DPDVd = 1/fnlVin
Vd
Vin
Predistorter
Linear responseLinear response
Vd
VPA
PAVrf = fnlkVd
Real PA
Nonlinear responseNonlinear response
4Copyright © 2003 Altera Corporation - Confidential
DPD - Implementation ChoicesDPD - Implementation ChoicesTwo basic approaches
Look-Up-Table (LUT) contains points on the transfer functionPolynomial representation of the transfer function
x y
xx
yy
nnxcxcxccy +++= ..2
210Pros-Cons of the two approaches
Advantage
Relatively easier to address memory effects & phase dependent errors
Memory effects & phase dependent errors require complex implementation
Faster convergence timeSlower convergence time
Requires StratixLow cost Cyclone implementation possible
Difficult to address different PA typesSame solution for different PAs (as long as error dependencies are same)
High complexity as use higher order termsLow complexity
PolynomialLUT
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
Indirect Learning ArchitectureIndirect Learning Architecture
The output of the PA is compared with the input of the PA rather than the input of the predistorter
Figure 1: Indirect learning architecture for Polynomial based predistortion
Forward path
Feedback path
Copyright © 2003 Altera Corporation
Indirect Learning ArchitectureIndirect Learning ArchitectureFeedback path (predistorter training block A) can work offline
A block of [z(n), y(n)], i.e., [input to PA, output from PA] can be stored in a look up table
The predistorter training block derives the inverse response of the PA by performing polynomial curve fitting on the data in the LUT (produces polynomial coefficients by solving the least squares problem)
The polynomial coefficients are computed such that e(n)=0, then y(n)= x(n)
The generated coefficients in the feedback path are used by the predistorter in the forward path until the next update by the training block
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
Predistorter ModelPredistorter ModelAssume Predistorter is modelled by memory polynomial model[1]
[eq 1]
z(n) = c10 . x(n) + c30 . x(n) . |x(n)|2 + c50 . x(n) . |x(n)|4 +… c(2K+1)0 . x(n) . |x(n)|2K +c11 . x(n-1) + c31 . x(n-1) . |x(n-1)|2 + c51 . x(n-1) . |x(n-1)|4 +… c(2K+1)1 . x(n-1) . |x(n-1)|2K +c12 . x(n-2) + c32 . x(n-2) . |x(n-2)|2 + c52 . x(n-2) . |x(n-2)|4 +… c(2K+1)2 . x(n-2) . |x(n-2)|2K +...c1Q . x(n-Q) + c3Q . x(n-Q) . |x(n-Q)|2 + c5Q . x(n-Q) . |x(n-Q)|4 +… c(2K+1)Q . x(n-Q) . |x(n-Q)|2K
[eq 2]
Implement using FIR filters − Number of FIR Filters = K + 1 [eq 3]
∑ ∑= =
−−=K
k
2kQ
qqnqn(2k+1)q x. xcnz
0 0)()()( .
[1] L.Ding, G.T.Zhou, D.R.Morgan, Z.Ma, J.S.Kenney, J.Kim and C.R.Giardina, “Memory Polynomial predistorter based on the indirect learning architecture”, IEEE Global Telecommunications Conference, Taipei, Taiwan, Nov 2002
FIR 1FIR 1 FIR 2FIR 2 FIR 3FIR 3 FIR K+1FIR K+1
Copyright © 2003 Altera Corporation
Forward Path Block DiagramForward Path Block DiagramFilter Taps
C10
X(n)
Filter Input Filter Input
generationgeneration
FIR filter 1FIR filter 1
C11 C1Q
C30
FIR filter 2FIR filter 2
C31 C3Q
C(2k+1)0
FIR filter K+1FIR filter K+1
C(2k+1)1 C(2k+1)Q
Z(n)
X(n)
X(n) |X(n)|2
X(n) |X(n)|2K
−− K+1 FIR filtersK+1 FIR filters
−− Each filter has Q+1 tapsEach filter has Q+1 taps
−− Each tap performs complex Each tap performs complex multiplicationmultiplication
Copyright © 2003 Altera Corporation
Generating Inputs to FIR Filters Generating Inputs to FIR Filters
|x|x(n)(n)||22 = I= I22 + jQ+ jQ22
2 multiplications2 multiplications
|x|x(n)(n)||44 = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|22
1 multiplications1 multiplications
|x|x(n)(n)||2k2k = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|2(k2(k--1)1)
1 multiplications1 multiplications
XX
xx(n(n)) = I + = I + jQjQ
DelayDelay
DelayDelay
DelayDelay
DelayDelay
XX
XX
xx(n(n))
xx(n(n))
xx(n(n))
xx(n(n)) . . |x|x(n)(n)||22
Input to FIR 2Input to FIR 2
xx(n(n))
Input to FIR 1Input to FIR 1
xx(n(n)) . . |x|x(n)(n)||44
Input to FIR 3Input to FIR 3
xx(n(n)) . . |x|x(n)(n)||2k2k
Input to FIR K+1Input to FIR K+1
2 multiplications2 multiplications
Copyright © 2003 Altera Corporation
DSP Block Architecture DSP Block Architecture
+
Opt
iona
l Pip
elin
ing
Out
put R
egis
ters
Out
put M
UX+ - Σ
+ - ΣInpu
t Reg
iste
rs
High Performance DSP Operation− 18x18 Functions at 282 MHz
Input, Output & Pipelining registers− Reduce overall Logic usage
Add/Accumulate/Subtract− Signed & unsigned operations− Dynamically change between Add &
SubtractSupports real multiplications (required for generating power terms in generating inputs for FIR filters) −− e.g. |e.g. |(I + jQ)|2 = I2 + Q2
Support complex multiplications (required in FIR filter taps)− (Ar + jAi) x (Br + jBi) = (Ar Br – AiBi) + j(Ai Br + ArBi)− 4 Multiplications, 1 Addition & 1 Subtraction
· · · ·- +
Copyright © 2003 Altera Corporation
Forward Path Multiplier RequirementForward Path Multiplier RequirementTimeshare multipliers − Stratix multipliers can run over 240MHz− IF frequency should be less than 120MHz− Halve the number of multipliers required by running at twice the speed
Our example with K=2 and Q=2 requires 3 FIR filters , each with 3 taps
Multiplier RequirementForward Path Implementation
622Memory Polynomial model with K=2 and Q=2
No. of DSP blocksNo of 18x18 multipliers
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
Feedback Path OverviewFeedback Path OverviewCoefficients Update Algorithms
QRD-RLS Algorithm
Using CORDIC for QR Decomposition
Using Nios for Back Substitution
Copyright © 2003 Altera Corporation
Coefficients Update algorithmsCoefficients Update algorithmsLeast mean squares (LMS), Normalized LMS (NLMS)Recursive least squares (RLS)
QR Decomposition based RLS (QRD-RLS) is more robust and well suited for hardware implementation
Pros:− RLS converges faster than LMS and NLMS
Cons:− Numerical stability issues
− Computationally expensive
− Operates on correlation matrix of input data
Copyright © 2003 Altera Corporation
QR Decomposition (QRD)QR Decomposition (QRD)Least Squares problem:
Decompose Y into QR, where Q is a unitary matrix (QQT = 1, i.e QT = Q-1) and R is an upper triangular matrix
zYc =Y = M x N (M>N)c = N x 1 z = M x 1
zQRc T=
[eq 4]
Solve for c where
zRc ′=
[eq 5]
Then,
or
[eq 6] where zz TQ =′
R = N x N c = N x 1 z’ = N x 1
Copyright © 2003 Altera Corporation
Back Substitution to solve for cBack Substitution to solve for cSince R is an upper triangular matrix, c can be solved from [eq 6] via back substitution
1,...,1
1
1
'
'
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
=
∑+=
Nifor
cRzR
c
Rzc
N
ijjiji
iii
NN
NN
[eq 7]
[eq 8]
Copyright © 2003 Altera Corporation
Resource Mapping for QRD-RLSResource Mapping for QRD-RLSDSP Blocksx(n)
LUT
CORDIC
PA
Nios
QRD-RLS
z(n) y(n)
DSP Blocks
CORDIC for QR DecompositionNios for Back Substitution
R, z’c Y, z
Copyright © 2003 Altera Corporation
Using CORDIC for QRD : OverviewUsing CORDIC for QRD : OverviewWhat is CORDIC?
Systolic Array Architecture
Direct Mapping Implementation
Mixed Mapping Implementation
Discrete Mapping Implementation
Summary of Implementing Systolic Array
Generating inputs to Systolic Array
Copyright © 2003 Altera Corporation
What is CORDIC?What is CORDIC?Hardware efficient algorithm for computing functions such as:− Trigonometric, Hyperbolic, Logarithmic
Iterative solution that uses only shifts and adding/subtracting− High performance/ low complexity as no multiplications and divisions
Altera’s CORDIC− operates in both Vectorise and Rotate Modes− pipelined to allow new inputs to be applied on every clk cycle− MATLAB GUI containing
bit accurate model of RTL for CORDICtest environment for generating test data, view results, parameterise RTL
CORDIC
X_inY_inZ_in
mode
X_outY_outZ_out
Copyright © 2003 Altera Corporation
CORDIC LEs usageCORDIC LEs usage
163MHz66%70104040
189MHz43%46003232
198MHz25%26702424
219MHz12%13001616
264MHz3%38088
% of LEs in Stratix 1S10
cor_les(absolute no)
FmaxLogic Elements (LEs)Number of Iterations
Input Vector Width x,y & z
Copyright © 2003 Altera Corporation
Systolic Array ArchitectureSystolic Array Architecture
R11 R12
R22
R13
R33
R23
R14
R44
z’1
z’2R24
R34 z’3
z’4
Inputs Desired output
CORDIC 1
CORDIC 2 CORDIC 3
Real(Xout) Imag(Xout)
Real(Xin) Imag(Xin)
øin øout
Өin Өout
Өout
CORDIC øoutøin
Өin
Real(Xin) Imag(Xin)
Real(Xout) Imag(Xout)
CORDICCORDIC
VectoriseVectorise
RotateRotate
Real inputs & outputs
Real inputs & outputs
Complex inputs & outputs
Complex inputs & outputs
with timesharing
with timesharing
Complex inputs & outputsComplex inputs & outputsCopyright © 2003 Altera Corporation
Example of Polynomial ModelExample of Polynomial ModelSame example used in forward path where assumed predistorter is modelled by memory polynomial model with K=2 and Q=2
Learning model in Indirect Architecture has output z(n) related to PA output y(n) by
z(n) = c10 . y(n) + c30 . y(n) . |y(n)|2 + c50 . y(n) . |y(n)|4 +c11 . y(n-1) + c31 . y(n-1) . |y(n-1)|2 + c51 . y(n-1) . |y(n-1)|4 +c12 . y(n-2) + c32 . y(n-2) . |y(n-2)|2 + c52 . y(n-2) . |y(n-2)|4 [eq 3]
Objective : Use this example to determine number of CORDIC blocks required & throughput for implementing Systolic Array using following schemes − Direct Mapping Scheme− Mixed Mapping Scheme− Discrete Mapping Scheme
Copyright © 2003 Altera Corporation
Example: ConditionsExample: Conditionsfclk = 150 MHz ; tclk = 1 / fclk = 6.7ns
PA output/input y(n)/x(n) is 16 bits wide
CORDIC implements the full 16 iterations − delay through CORDIC = 19 clk cycles (see Altera App Note 263 on CORDIC)
Each processor cell comprises a single CORDIC block, timesharing the three operations required for complex inputs.
Systolic architecture has 9 rows by 10 columns− there are 9 c coefficent values to determine
Input matrix is m x p = 64 x 9
Time required before all cells in systolic array are updated with their R and z values = tupdate_del
Throughput is number of input matrices (each m x p) that are processed per second = thr_putthr_put = 1 / tupdate_del
Copyright © 2003 Altera Corporation
Direct Mapping ImplementationDirect Mapping ImplementationEach cell in Systolic Array is mapped to its own CORDIC block(s)
Total number of CORDIC blocks required is equal to total number of cells in array = 54 in this case− 70K LEs!!!
Advantage of high throughput− update delay = 5.1us − throughput = 196,078 update/s
Disadvantage of larger number of resources− Mixed or Discrete Mapping schemes
reduce number of CORDIC blocks (resources) required -> timesharingat cost of reduced throughput
Copyright © 2003 Altera Corporation
Mixed Mapping ImplementationMixed Mapping ImplementationMove rows in bottom portion to top portion− Try to result in same number of cells per row
Map to 4 rows
Processor
2
Processor
1
Processor
3
Processor
4
Use each processor for one/more rows (in this case one processor per row)processors perform both boundary and internal cell operations -> “Mixed”processors 2 – 4 perform 13 cell operations while processor 1 performs 15 cell operations
4 processors = 4 CORDIC blocks = 5K LEsThroughput
update delay = 250.85usthroughput = 3986 updates / second
Copyright © 2003 Altera Corporation
Discrete Mapping ImplementationDiscrete Mapping ImplementationMinimum of 2 processors− One performs only boundary cell operations− Others perform internal cell operations
Max processors = max number of cells in any diagonal− 5 processors for this example
Proc2 Proc3Proc1Proc4 Proc5
Proc1 Proc2
As CORDIC iterations are 16, could timeshare and use 2 processors:
Resources = 2 CORDIC blocks = 2600LEsThroughput− Update delay = 198.11us− Throughput = 5047 updates / second
Proc
Hybrid scheme: 1 processor only
Copyright © 2003 Altera Corporation
Example : SummaryExample : Summary
50473986
196078
updates/s
198.11250.85
5.1
update delay (us)
Throughput
26005200
70200
No of LEs
24
54
No. of Blocks
CORDIC usage
0.358Direct Mapping
0.5151.305
Cost LE/update
Discrete MappingMixed Mapping
Implementation Technique
Why in this scenario Discrete Mapping is better than Mixed Mapping − DM really works assuming 5 processors but because it can utilise the
pipelining ability of CORDIC is using only 2 processors.
[Note: results assume input matrix of 64 x 9; 16 CORDIC iteratio[Note: results assume input matrix of 64 x 9; 16 CORDIC iterations; ns; clkclk freq of 150MHz; 9 unknown coefficients to calculate]freq of 150MHz; 9 unknown coefficients to calculate]Copyright © 2003 Altera Corporation
Summary of Systolic Array ImplementationSummary of Systolic Array Implementation
Direct Mapping offers highest throughput at expense of large resource usage
For approx same throughput and using CORDIC, Discrete Mapping will always require fewer resources than Mixed Mapping− it can utilise the pipelining ability of CORDIC
Deploy Discrete Mapping hybrid technique− use discrete mapping technique to determine scheduling of cell
operations− reduce processor cells to 1 (not 2) to minimise resources for minimal
reduction in throughput
Copyright © 2003 Altera Corporation
Generating inputs to Systolic Array IGenerating inputs to Systolic Array I
Same order of terms generated in forward path
y(n)
y(n) |y(n)|2y(n) |y(n)|4
y(n-1)y(n-1) |y(n-1)|2
y(n-1) |y(n-1)|4y(n-2)
y(n-2) |y(n-2)|2y(n-2) |y(n-2)|4
z(n)
|x|x(n)(n)||22 = I= I22 + jQ+ jQ22
|x|x(n)(n)||44 = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|22
|x|x(n)(n)||2k2k = |x(n)|= |x(n)|22 . |x(n)|. |x(n)|2(k2(k--1)1)
XX
xx(n(n)) = I + = I + jQjQ
DelayDelay
DelayDelay
DelayDelay
DelayDelay
XX
xx(n(n))
xx(n(n))
xx(n(n))
xx(n(n)) . |x. |x(n)(n)||22
xx(n(n))
xx(n(n)) . |x. |x(n)(n)||44
xx(n(n)) . |x. |x(n)(n)||2k2k
XX
Copyright © 2003 Altera Corporation
Generating inputs to Systolic Array IIGenerating inputs to Systolic Array IIComparison to input generation for forward path:− Same function− lower speed requirement
Use NIOS with Multiply Custom InstructionAlternative: Custom logic− Timeshare multipliers more in feedback path
forward path requires result every clock cycleCORDIC in feedback path takes n cycles to produce output
− for Mixed or Discrete Mapping would require new input once every n cycles (worst case)Also feedback processing performed offline
− reduce multiplier requirement and also throughput through R matrix by taking more clkcycles
− Our ExampleForward Path : generating inputs to FIR filters requires 4 18x18 multipliers (1 DSP block)Feedback Path: timesharing allows using single 18x18 multiplier
Copyright © 2003 Altera Corporation
Back Substitution using NiosBack Substitution using NiosNios CPU
I/O1Program
& DataMemory
I D
Avalon Bus
Rr[N][N]
Ri[N][N]
z`r[N]
z`i[N]
cr[N]
ci[N]
CORDIC
Copyright © 2003 Altera Corporation
Back Substitution ExampleBack Substitution Example
1,...,1 1 1
''
−=⎟⎟⎠
⎞⎜⎜⎝
⎛−== ∑
+=NiforcRz
Rc
Rzc
N
ijjiji
iii
NN
NN
10] 8[ 5 03 2
=′⎥⎦
⎤⎢⎣
⎡= zR
FormulaFormula
Example (with real numbers)Example (with real numbers)
25
10
22
22 ==
′=
Rzc
i.e., R is a 2 x 2 upper triangular matrix (N =2)
[eq 7] [eq 8]
(from eq 7)
( ) ( ) 12*38211
222111
1 =−=−′= cRzR
c (from eq 8)
Requires mainly multiply and divide operations
Copyright © 2003 Altera Corporation
Back Substitution Results – Latency ComparisonBack Substitution Results – Latency Comparison
Back Substitution Algorithm
100
1000
10000
100000
1000000
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
N
Clo
ck C
ycle
Cou
nt
Nios 3.1 Nios II Custom Logic
Bes
t Per
form
ance
(L
owes
t Lat
ency
)
1200012000
15001500
300300
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
Block DiagramBlock Diagram
X(n)Input Input
generationgeneration
X(n)
X(n)3
X(n)5 FIR filter 3FIR filter 3
C50 C51 C52
C’50 C’51 C’52
FIR filter 2FIR filter 2
C30 C31 C32
C’30 C’31 C’32
FIR filter 1FIR filter 1
C10 C11 C12
C’10 C’11 C’12
ProgramProgram
&Data &Data
MemoryMemory
NIOSNIOS
CORDICCORDIC
acceleratoraccelerator
Multiply CIMultiply CI
Divide CIDivide CI
Sample Sample
bufferbuffer
Avalon BusAvalon Bus
D/A, analogueD/A, analogue
upconverterupconverter PAPA
A/D, analogueA/D, analogue
downconverterdownconverter
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
Resource EstimateResource Estimate
900
320
Update delay (us)
1S30114241002K=2 (5th order), Q=5 (5 previous terms)
1S1062341002K=2 (5th order), Q=2 (2 previous terms)
DSP Blocks18x18 mult
Suitable Stratix device
Number of MultipliersLEs for CORDIC +
NIOS
CORDIC blocks
Polynomial type
Predistorter described by Memory Polynomial modelCoefficients determined in feedback path using− QRD-RLS algorithm− CORDIC used in Discrete Mapping Scheme to implement Systolic Array− NIOS used for back substitution
Estimates do NOT include− memory requirements (abundant supply in all Stratix devices)− control logic
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Block diagram of complete solution
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
AccuracyAccuracyPotential of large dynamic range of signals− higher order polynomial terms require higher dynamic range
Rounding & truncating errors− impact on coefficients & predistorted values needs
consideration
Errors affect system performance− increase in-band distortion − increase out-of band distortion
∑ ∑= =
−−=K
k
2kQ
qqnqn(2k+1)q x. xcnz
0 0)()()( .
Copyright © 2003 Altera Corporation
Maintaining AccuracyMaintaining AccuracyFixed Point Solution− Larger bit widths for CORDIC & in forward path intermediate products
Floating Point Solutions− Forward Path: floating point operations (multiply, add)− Feedback Path
Implement Systolic array with floating point operations – NO CORDICImplement Systolic array with floating point CORDIC
Hybrid Solution − Floating point operations
forward path calculation of high order terms to be fed into Systolic array in feedback path
− Fixed point operations CORDIC blocks (with floating point extension) in feedback path
Right Solution− Depends on performance - cost requirements for individual customers
Copyright © 2003 Altera Corporation
Resource comparison of optionsResource comparison of optionsFixed Point Solution (e.g. extending bit widths from 16 to 32)− CORDIC resources increase by ~250%− Multipliers required increase by up to 300%
Floating Point Operators (multiply, divide, add)− Require ~20-30% more resources than equivalent fixed point operators− Employ the use of barrel shifters
Floating Point CORDIC vs Fixed Point CORDIC− ~20-30% larger− requires barrel shifters− slower throughput as non-pipelined (serial architecture)
Copyright © 2003 Altera Corporation
AgendaAgendaIntroduction
Indirect Learning Architecture
Forward Path
Feedback Path
Example Resource Estimate
Numerical Accuracy
Summary
Copyright © 2003 Altera Corporation
SummarySummaryStratix devices offer optimal resources for Polynomial DPD implementation− DSP blocks− Embedded RAM
QRD-RLS algorithm can be effectively addressed with Altera IP− CORDIC− NIOS with custom instructions
Flexible architecture capable of addressing a wide range of processing demands
Copyright © 2003 Altera Corporation
Backup SlidesBackup Slides
Copyright © 2003 Altera Corporation
Least Squares using QRDLeast Squares using QRDNow,
zQRcor
RRzQRc
zQRRRc
QQcezQRRRc
zQRQRQRc
zYYYc
zYYcY
zYc
T
TTT
TTT
TTTT
TTTT
TT
TT
=
==
=
==
=
=
=
=
−−
−−
−
−
−
)1(since
)(
)1(sin )(
)(
)(
1
1
1
1
1
Q is unitary matrix
R is upper triangular matrix
Assuming Y has been decomposed into Q, R
Copyright © 2003 Altera Corporation
7th order Polynomial Resource Estimates7th order Polynomial Resource Estimates
Copyright © 2003 Altera Corporation
Example ConditionsExample Conditions4 UMTS carriers
Assumed sample rate into DPD = 140Msps
16 or 18-bit inputs and coefficients
0 0 HzHz
20 20 MHz BandwidthMHz Bandwidth
2.5 2.5 MHzMHz
7.5 7.5 MHzMHz
--7.5 7.5 MHzMHz
--2.5 2.5 MHzMHz
--10 10 MHzMHz
10 10 MHzMHz
Copyright © 2003 Altera Corporation
Example ConditionsExample ConditionsMemory polynomial model (with both even and odd terms)
7th order with 2 previous samples for memory effects (K=7, Q=2)
7 FIR filters with 3 complex taps each
∑∑= =
−−−=
K
k
Q
q
kqnqnkq xxcnz
1 0
1)()()( eq(1)eq(1)
61170
52262
42252
32242
222322222212
61170
51161
41151
31141
211311121111
670
560
450
340
2302010)(
−−−−−−−−−−−−−
−−−−−−−−−−−−−
++++++
++++++
++++++=
nnnnnnnnnnnnn
nnnnnnnnnnnnn
nnnnnnnnnnnnn
xxcxxcxxcxxcxxcxxcxc
xxcxxcxxcxxcxxcxxcxc
xxcxxcxxcxxcxxcxxcxcnz
FIR 1FIR 1 FIR 2FIR 2 FIR 3FIR 3 FIR 4FIR 4 FIR 5FIR 5 FIR 6FIR 6 FIR 7FIR 7
Copyright © 2003 Altera Corporation
Total number of 18 x 18 multipliersTotal number of 18 x 18 multipliersNumber of FIR filter multipliers
Multipliers for generating FIR inputs
Total number of multipliers required
− 7 x 3 = 21 complex multiplications
− Each complex multiplication requires 4 18x18 multipliers = 1DSP Block
− 21 DSP blocks (84 multipliers) required
− cannot time-share (for Stratix) as data rate at 140Msps
− Forward Path: 14 Multipliers + LEs for computing |x(n)| (time-sharing not possible for Stratix)
− Feedback Path: 1 Multiplier (with time-sharing)
84 + 14 + 1 = 99 84 + 14 + 1 = 99 multipliers multipliers
Copyright © 2003 Altera Corporation
Feedback path conditionsFeedback path conditionsQR decomposition
Back substitution
Total feedback delay = 1048.78μs = 1.05ms
− 2 CORDIC blocks
− 150MHz clock
− Input matrix size 64 x 22
− Update delay = 504.27μs
− 1 Nios processor
− 100MHz clock
− 21 coefficients to compute
− Update delay = 544.51μs
Copyright © 2003 Altera Corporation
Total Resource EstimateTotal Resource Estimate
less than 1.05
1.05
Update delay (ms)
EP2S301350 (with time-
sharing)
41002K=7 (7th order), Q=2 (2 previous terms)
1S80 (using LEs for 3 DSP
blocks)
259941002K=7 (7th order), Q=2 (2 previous terms)
DSP Blocks18x18 mult
Suitable Stratix,
Stratix II devices
Number of MultipliersLEs for CORDIC +
NIOS
CORDIC blocks
Polynomial type
Predistorter described by Memory Polynomial modelCoefficients determined in feedback path using− QRD-RLS algorithm− CORDIC used in Discrete Mapping Scheme to implement Systolic Array− NIOS used for back substitution
Estimates do NOT include− memory requirements (abundant supply in all Stratix devices)− control logic
Copyright © 2003 Altera Corporation
Possible Queriesby Customers
(do not include as a part of the presentation)
Possible Queriesby Customers
(do not include as a part of the presentation)
Copyright © 2003 Altera Corporation
Can the Nios processor perform floating point operations?
We currently do not have a full floating point unit for the Nios. Any floating point calculations are done with the floating point library in software. In the event that you have a hardware integer multiplier as part of the Nios, then the software libraries will take advantage of this (and the resulting performance will increase).
That being said, we have done a subset floating point unit that does add, negate, absolute value, and multiply – but it’s more of an example than a supported part of the Nios package.
We would like to know if you would want to see floating point operations added to the Nios features
How many carriers can you address with this design?
The multipliers on Stratix can run upto 280MHz. For a 4 carrier UMTS system with total bandwidth of 20MHz, and a polynomial of order 5, the required bandwidth would be 20x5=100MHz which from Nyquist sampling theorem translates to an IF frequency of 200MHz. This can be addressed with the multipliers. What order of the polynomial are you looking at?
Copyright © 2003 Altera Corporation
What stage are you in your design? Do you have any performance numbers?We are contemplating doing simulations to verify the performance. Would you be interested?
Copyright © 2003 Altera Corporation
Possible Queriesto Customers
(do not include as a part of the presentation)
Possible Queriesto Customers
(do not include as a part of the presentation)
Copyright © 2003 Altera Corporation
What sort of PA model are you using?
What is the degree of the polynomial?
How many samples need to be considered for memory effect?
What is the adaptation algorithm you are considering in the feedback loop?
What is the update rate for the coefficients?
What is the IF frequency?
Are you operating in fixed or floating point?− if fixed what are the bit widths in forward and feedback paths?− for floating point what representation are you using?
What performance are you targeting for your predistortion unit operating with your PA (i.e. reduction in EVM and increase in ACLR specifically due to predistortion only)?
Why are you specifically implementing polynomial solution as opposed to LUT based one?
What device (i.e. Altera, ASIC, Xlinx) are you targeting for your solution and why?
If not using Altera, what would persuade you to switch?