Microprocessor implementation of numbertheoretic transforms
S.C.P. Martin and B.J. Stanier
Indexing terms: Computerised signal processing, Transforms
Abstract: Consideration is given to the suitability of microprocessor systems for the fast implementation ofnumber theoretic transforms (n.t.t.s). Fast-multiply instructions available on some microprocessors, or theuse of external multipliers, relax the basic constraints on the choice of a particular n.t.t. A search was madefor suitable moduli which allow fast computation of n.t.t.s using Winograd's algorithm. The search wasextended for other moduli which allow increased dynamic range when combined using the Chinese remaindertheorem. Finally, a description is given of how modular arithmetic may efficiently be performed using micro-processors.
1 Introduction
A numerical procedure important in digital signal process-ing, the convolution operator, is defined by the relation
N-l
and is denoted by
ie(Q,N-\) (1)
Certain transforms (T) possess the cyclic-convolutionproperty (c.c.p.) which may be stated as
T(y) = TQi) x T(x)
where x denotes pointwise multiplication, or
(2)
y - 1
Hence an isomorphism exists between the convolutionoperator and the pointwise multiplication operator undersuch transforms as T.
Transforms with the d.f.t. structure possess the c.c.p.Let Xk = T(Xi) and hence xt = T'1 (Xk), then
N-l
Xk * £ Xjaih
N-lxj £ N~l
1 te=o
ke(0,N-l)
(3)
where a. is an element of order N. It has been shown2 thatthe d.f.t. is the only such transform defined in the complexdomain. However, many such transforms exist which aredefined in finite rings of integers (ZM). These transformsare collectively known as number theoretic transforms(n.t.t.s). Agarwal and Burrus2 have shown that constraintslimit practical choices of n.t.t.s and for clarity they arecited here:
(a)N must divide 0(M), where 0(Af) is defined to bethe greatest common divisor of the set of prime divisors{pi) of M, i.e. 0(M) 4 g.c.d. (pi)
Paper T290 E, first received 28th July and in revised form 1stNovember 1978Mr. Martin and Dr. Stanier are with the Department of AppliedPhysics and Electronics, Science Laboratories, South Road, DurhamDH1 3LE, England
(b)a must be an element of order N, i.e. a N = 1 modAfandar=£l modM V re(l,N- 1)
(c)Af"1, a multiplicative inverse of N, must exist in thering ZM.
(d)N should be well factored for fast algorithms toexist.
(e) To facilitate fast and simple arithmetic mod M, Mmust have a simple binary representation, and to facilitatefast multiplication by powers of a, a must also have asimple binary representation.Considerable interest has been shown in the literature inat least two classes of numbers as choices for moduli.Using Fermat numbers,1 >4>s>17 which are numbers of theform Ft = 22 + 1, it is possible to perform transformsrequiring only bit-shifts and additions. However, suchmoduli do suffer from disadvantages: microprocessorsystems handle most efficiently data in multiples of 8bits (full bytes). By their definition Fermat number modulimay require an (8 k + l)th bit. During processing thismay be supplied by a carry flag, but for data storagein memory this requirement can cause embarrassment.For F5 and larger moduli, it is generally acceptable toignore the extra bit since it has only a low probability ofbeing required. Finally, the maximum transform lengthavailable for use with Fermat moduli is severely limited bythe constraint upon a simple binary representation of a.
Mersenne numbers6 (numbers of the form 2Q~1, q prime)have also been considered for moduli. However, corres-ponding sequence lengths are not factored and so do nothave fast algorithms comparable to the f.f.t.
Recently, a new class of potentially fast and efficientFourier transform and number-theoretic transforms arealgorithms (w.f.t.a.), have been described.7"11 Since theFourier transform and number theoretic transforms arealike in structure, then any Fourier-transform algorithmmay in principle be applied to n.t.t.s.
Winograd has shown how short-/V transforms may beperformed efficiently. He suggests a technique wherebyshort-/V algorithms may be combined to provide analgorithm for a transform whose length is a product of theshort-Ns, provided the short-/Vs be relatively prime. Let
(4)
where, as is well known12
N =(5)
ELECTRONIC CIRCUITS AND SYSTEMS, JANUAR Y 1979, Vol. 3, No. 1 21
0308-6984/79/010021 + 06 $01-50/0
and ® denotes the Kronecker product of matrices andPi ,P2 are permutation matrices.
Silverman8 has described short-TV algorithms forTV= 5, 7, (3 and 9), (2, 4, 8 and 16). and using the nestednature of eqn. 4 various transform lengths are possiblebetween 2 and 5040. Reference 8 shows how the permu-tation matrices Pi and P2 may be derived. In a micro-processor implementation the permutation sequences mayconveniently be stored in read only memory.
2 Microprocessors and n.t.t.s
Integer arithmetic can be performed on microprocessorsystems more easily than real arithmetic, and using suchsystems n.t.t.s are in principle easier to implement than theFourier transform for convolution. Further,microprocessorsare becoming available with fast-multiply instructions, andfor those that do not have this facility, fast hardwaremultiplier chips are available.13 These trends allow fastgeneralised multiplications, and make such operationscompetitive with repeated bit-shift and subtract in carryoperations such as are required with Fermat numbertransforms. By waiving constraint (e) and allowing non-simple moduli and as, many more n.t.t.s become practicablefor microprocessor implementation.
A search was made, as outlined in Reference 10, forsuitable moduli that would satisfy constraints (a) to (d)and additionally would support general Winograd transformalgorithms, allowing nonsimple TVth roots of unity. Sincedata is handled most efficiently in full bytes, this searchwas conducted from 216 downwards.
It was found that the modulus M= 65521 satisfied allthe constraints, and in particular it has two desirableproperties:
(a) 0(65521) = 65520 = (5 x 7 x 9 x 16) x 13. As canbe seen, this modulus will support any Winograd transformlength from 2 to 5040.
(b) Arithmetic would generally be complex becauseso little redundancy is incurred in the use of 16-bitarithmetic.
Since any specific-TV transform algorithm would becomputationally more efficient than a general-TV algorithm,we derived various algorithms for specific transform lengths.However, it was found that a general-TV program was aninvaluable tool for such algorithm development, since theprogram derives the multiplication coefficients, the pointersinto workspace arrays and the permutation sequencesrequired for the specific-TV algorithms. This design tech-nique allows rapid development of the specific-TV algorithms.Certain points are of interest concerning the general-TVprogram and the subsequent specific-TV algorithms.
(a) Since the ultimate aim is to derive algorithms for amicroprocessor environment where memory workspaceshould be minimised, the suggestions for reducing memoryrequirement given in Reference 8 were heeded.
(b) Arithmetic would generally be complex becauseFourier transform is defined in the complex domain.However, the algorithms described in Reference 8 involveonly purely real or purely imaginary data, and so in theinner stages of their computation, certain savings are madeby keeping flags to denote the data type. The concepts ofreal, imaginary and complex do not strictly apply in thenumber theoretic sense, and so it was not necessary forsuch flags to be kept.
(c) The points given in Reference 10 concerning theinterpretation in the number theoretic sense of the
trigonometrical expressions referred to in Reference 8, andthe incorporation of the TV"1 normalising factor into themultiplication coefficients for the inverse transform, werealso heeded.
The general-TV program was run on an IBM 370, and theresults of convolutions performed by such n.t.t.s werecompared with reference techniques (direct-integerconvolution with short lengths, and a Fourier transformtechnique for long-convolution lengths). In all cases, then.t.t. convolutions gave results in exact agreement withthe direct-convolution technique. The Fourier-transformtechnique produces 'real' results subject to roundoff error,and within the limits of accuracy of such a technique, theresults of the n.t.t. convolutions were also in exact agree-ment.
A transform of length 60 was written for an Intel 8080microprocessor using the FORTH programming technique,14
and this algorithm was employed in a real-time bandstopfiltering application.
3 Extension to complex filtering
Vanwormhoudt16 has shown that there exist two mainclasses of moduli. Since all moduli M that are useful fornumber theoretic transforms are odd, the two classes arethose for which M = 1 mod 4 (type A) and those for whichAf = 3 mod 4 (type B). In Reference 16 it is shown that fortype A moduli there exists an element / in the ring ZM
such that / 2 = — 1 modM. This is an alternative to showingthat there exists an element of order 4. It is also shown thatno such element exists for type B moduli.
In References 15 and 17 it is shown that complexconvolutions can be efficiently performed through two realconvolutions. The ideas expressed in References 15 and 17can be generalised for any type .4 modulus.
Let at i a{ + ja{
Where a = complex sequencea,- = real part of a,-a,- = imaginary part of a,-/ = element such that j 2 = — 1 mod M
The two real convolutions required to compute y whereyt = JC,- * ht are given in eqn. 6:
(6)
yi+ih = axi*ahimodM j
S>i -}5>i = bxi * bhi mod M J
where the following are defined:axi = xi + ixi m °d M\ ahi = h( + jhi mod M
bxi = * i ~J'xi m o d M ; bhi = ht—]'ht mod MJ (7 )
In this paper much interest is shown in the modulusM= 65521 for which the element / = 24297 satisfies/ 2 = — 1 modTVf.
4 Extension to other moduli
For a given modulus the output must be limited to avoidoverflow; hence a compromise exists between the dataamplitude and the filter impulse digitisation.2 In general,the digitisation will degrade the filter response, and sofor a given choice of modulus there may be insufficientdynamic range for the filter design to achieve the limitsset. In such cases, a larger modulus should be chosen;
22 ELECTRONIC CIRCUITS AND SYSTEMS, JANUARY 1979, Vol. 3, No. I
Table 1: Table of dual moduli to pair with M = 65521 using c.r.t.
Convolution length Dual modulus Primary Winogradtransform length
Multidimensionalfactor required
610121415182021242830353640404245454856606370728080849090105112112120120126140144168180180210240240252280280315315336336360360420504504560560630630720720840840100810081260126016801680252025205040
5040
65497653816549765437651016544965381654376549765437651016538165449649216549765437646216544965281653536510165269653816544964081653936543764621654496510164849653936492165497652696538165089653536462165449651016408165281652696384165381592216538164849652816408165449651016451365449638416528159221653816408165089638416535364513650895922165381638416528155441654495544165089
6101214151820212428303536408
4245948566063707280168490181051121612024126140144168180362102404825228035315353364836072
4205047256016
6307072014484016810081441260140168048
252072
5040144
8
9
7
5
7
5*7
9
5
5
7
9
5*7
5*7
5*7
ELECTRONIC CIRCUITS AND SYSTEMS, JANUARY 1979, Vol. 3, No. I 23
however, direct implementation of such a scheme wouldinvolve performing arithmetic with greater wordlength.This problem may be circumvented by the use of theChinese remainder theorem (c.r.t.).11 This theorem statesthat if an integer x is such that x = at mod m,- where a setof moduli m,- is relatively prime then
M\ ( M\ ( M\x = ai\t>i— +02U>2— ' • • •' +a" \bn— modMmx \ m2 \ mn
' ' ' (8)
where
M = mt
The bi are defined such that
bsl — 1 = 1 mod rrii(9)
An interpretation of this theorem shows that if calculationsare performed with respect to two or more relativelyprime moduli (mi and m2), then by using the c.r.t. theresults may be determined mod(wiim2). This techniquehas great potential for utilising a parallel processing tech-
nique.Let us consider the case for two moduli
x = a
where
C =
Let
x = 1
therefore
ld +
or
C\ + C
iCx +a2c2 modM
Mbx — and c2 =
and so ax = 1
\c2 = 1 modM
2 = 1 modM
h Mb2 —
m2
= a2
(10)
(11)
This result will be discussed later.A search was made for other moduli that would combine
with 65521 over specific transform lengths. The search wasconducted for various transform lengths by scanning from216 downwards for moduli, other than 65521, for whichconstraints (a) to (d) would be satisfied. These results areshown in Table 1.
It can be seen from the last two entries in Table 1 thatthe highest suitable modulus below 216 that will directlysupport a transform length of 5040 is M= 55441. Thechoice of such a low modulus is undesirable, since a greatloss occurs of the possible dynamic range of 216.
Agarwal and Cooley3 described how the cr.t. mayalso be employed to convert a 1-dimensional cyclicconvolution to a multidimensional convolution which iscyclic in all dimensions. This may be applied to cases wherea given long convolution length is a product of shortermutually prime convolution lengths. They cite an examplewhereby a number-theoretic-transform technique can beused for convolution of length N, and by using the c.r.t.
in this manner, convolutions of length (Npxp2 ...Pj)may be computed provided TV, px, p2,..., p} are mutuallyprime. Efficient algorithms are described in Reference 3for convolutions of lengths 5 and 7 (2, 4, 8) and (3,9).Using such a scheme, it is possible to perform n.t.t. con-volutions of length 144, and using the multidimensionalmapping technique it is possible to derive convolutions oflength 144 x 5 x 7(= 5040). Constraints upon the choiceof modulus arise only from the n.t.t. length used, and notfrom the multidimensional factors employed. Therefore,modulus M= 65089 (this is the choice for a length 144transform) may be used for convolutions of length 5040,even though this modulus does not support such a transformlength directly, by taking n.t.t. convolutions of length 144and using the c.r.t. mapping technique to compute the5040 length convolutions.
This technique can be used to factor transform lengthswith inefficient moduli to lengths with more efficientmoduli. The other entries in the Table have been derivedin a similar manner.
It is advantageous if the two moduli to be combinedare used with the same transform lengths, since the samepermutation sequences and essentially the same algorithmsare used for both moduli, leading to economy in memoryutilisation.
5. Microprocessors and modular arithmetic
The moduli previously described are optimally close to216 and therefore only a small duplicity arises if all 16-bitbinary patterns are allowed on input and output with anarithmetic procedure. In all examples in this Section thechoice of M = 65521 will be taken.
(a) Addition mod M: when two numbers are addedtogether they may or may not generate an overflow. Ifthere is no overflow then the result of the addition isreturned.
If
x = a + b = c + carry
= c + 216
= c + (216 -M)
= c+ 15 mod 65521 (12)
Therefore, if a carry is detected, 15 must be added into thepartial sum. This may generate a further carry, but willnot generate more than two carries. An example of suitablecoding for such an operation is given for an Intel 8080microprocessor.
PLUS DAD D
RNC
LXI
JMP
D #15
PLUS
This subroutine will add mod 65521 the two 16-bit num-bers in register pairs (DE) and (HL) returning the answerin (HL).
(b) Subtraction mod M: for microprocessors with 16-bitsubtraction instructions, the 16-bit addition instruction inthe previous example may be directly replaced by such aninstruction. However, few 8-bit microprocessors have such
24 ELECTRONIC CIRCUITS AND SYSTEMS, JANUARY 1979, Vol. 3, No. 1
instructions, and so for the majority a byte-orientatedsubtraction should be used.
(c) Multiplication modM: this operation can beclassified into subdivisions:
(i) Multiplication of a 16-bit number (x) by a 16-bitfixed constant (k).
This can be considered to be a mapping from a 16-bitnumber to another 16-bit number. Since multiplication isa distributive operation, then the mapping may be achievedby treating separately the high and low bytes of x, andfinally adding their respective outputs together. Fig. 1shows how such a mapping may be achieved. The obviousdesign requires two 28 x 16-bit patterns which may beconveniently be read only memory (r.o.m.). If, however,the memory is restructured to be byte orientated, then asaving is obtained since part of the memory Rx isduplicated in R2. This minimised structure is shown inFig. 2. R.O.M.s R are read once only and r.o.m. R' is readtwice. The read operations are controlled by the micro-processor. A final modular add operation is required tobound the output within the 16-bit range.
B H I ,
8 t o '
8-16R,
8-16R,
16 .
16 *
• M16
Fig. 1 Multiplication by fixed constant
16
x 1 6
8HI
fl.«°L0
8-8R
8-8R'
8-8R
8LO
8HI
8HI
1b
IT"
• M16
Fig. 2 Memory minimisation for fixed constant multiplication
16
8 HI
8Lo
- * • 8-8R
fr
— '
fr
o —o
R'
8 -8R'
8-8P'
8-8R
87
8,
fl
82_
32
8̂3
8*
• m
• M
32
Fig. 3 Multiplication of 16-bit number by 32-bit constant
(ii) Multiplication of a 16-bit number by a 32-bitconstant.
When combining results using theChinese remainder theorem requiresbe performed of the form
two moduli, thethat calculations
X = d\C\ (13)
where ax and a2 are results mod mi and modw2,respectively, and cx and c2 are 32-bit constants derivedusing the c.r.t. Therefore, this class of operation is usefulfor deriving results mod M(M=m1m2) given the resultsmod mi and mod m2. A memory reduction similar to thatdescribed previously can also be applied. A suitable structurefor performing this operation is shown in Fig. 3. R.O.M.sR are read once and r.o.m.s R' are read twice. A finalmodular add is required to bound the output within the32-bit range. This can be achieved using the same principleas has been described for the 16-bit modular add operation.
In general, two such operations would be required formultiplications by C\ and c2. However, if the result ofeqn. 11 is recalled, then it will be seen that
(14)
1 = C
x = a
from eqn.
cxmx
and so
x = c
i + c2 modM
iCi + a 2 ( l —Ci)
^ - a ^ + a.modM
8 it can be seen that
= OmodM
i(kmi + Ci — a2)+a2 modM(15)
If k is chosen such that (km\ + ax —a2) lies in the rangezero to 216, then only one 16 x 32-bit multiplication isrequired to derive x from eqn. 15. A flowchart to achievethis condition is shown in Fig. 4.
(iii) Multiplication of two 16-bit variables modM.The multiplication of two 16-bit numbers generates a
32-bit answer.
Let.y = ab = 216yh +yt
= (216 -M)yh +yt
0<a,b,yh,yl<2 16
mod 65521 (16)
Therefore, a general 32-bit intermediate answer may bepartially reduced modM by a multiplication of,vh by thefixed constant (216 — M). This can be achieved by thescheme described in (c), part (i).
(iv) Multiplication by 2"1.This is the analogous operation to division by 2.
Let ̂ = 2~lx
By applying a shift right to x we may determine from thecarry flag if x was even or odd. If x was even then we havealready determined y. If JC was odd then y may be derivedby adding in (M+ l)/2.
The procedures described cover the general classes ofarithmetic required for convolution performed within thering ZM, where M is a product of 65521 and one of themoduli shown in Table 1.
(d) Table 2 summarises the memory requirements forthe multiplication look-up tables. The unit of one page isused to denote a quantity of memory of 256 bytes. In the
ELECTRONIC CIRCUITS AND SYSTEMS, JANUAR Y 1979, Vol. 3, No. 1 25
case of multiplications of two 16-bit variables, 3 pagesare required for the Table to derive the term correspondingto the multiplication of the higher 16-bit intermediateanswer with the fixed constant (216 —M). The multi-plication by / can be most efficiently performed by thetechnique described in Section 5, (c), part (i), and thisrequires an additional 3 pages.
Fig. 4 Flowchart to perform c.r.t. combination
Table 2: Summary of memory requirements for multiplicationlook-up tables
Real convolutions
Complexconvolutions
GeneralisedmultiplicationsC.R.T. combinationTotalGeneralisedmultiplicationsMultiplication by jC.R.T. combinationTotal
Memorypages
For eachmodulus
3—3
33—6
required
For 2moduli
65
11
665
17
6 Conclusions
With a view to utilising a microprocessor-based system forconvolution, we have considered the constraints which
govern the choice of a particular number theoretic trans-form (n.t.t.). We have found that recent hardware develop-ments relax these basic constraints, allowing fastcomputation of much wider classes of n.t.t.s. In particular,the Winograd Fourier transform algorithm (w.f.t.a.) can beapplied to n.t.t.s and the work in this paper has beenconcerned with implementing the w.f.t.a. using modulijust less than 216. The modulus 65521 was found to havevery desirable properties from this aspect since it willsupport any Winograd transform length. We have alsoconsidered other moduli to combine with 65521 usingthe Chinese remainder theorem for applications whichrequire a greater dynamic range.
It can be seen that microprocessor systems can be usedeffectively to provide the hardware for small-scaleconvolution requirements, and therefore such systems canbe used wherever digital convolution is required. The rangeof possible applications is broad, covering areas from thecomputation of auto and cross correlation and powerspectra and nonrecursive and recursive digital signalprocessing and obtaining the solution of differenceequations.
7 References
1 AGARWAL, R.C., and BURRUS, C.S.: 'Fast convolution usingFermat number transforms with applications to digital filtering',IEEE Trans., 1974, ASSP-22, pp. 87-99
2 AGARWAL, R.C., and BURRUS, C.S.: 'Number theoretictransforms to implement fast digital convolution', Proc. IEEE,1975,63,pp. 550-560
3 AGARWAL, R.C., and COOLEY, J.C.: 'New algorithms fordigital convolution', IEEE Trans., 1977, ASSP-25, pp. 392-410
4 MELHUISH, P.: 'Fermat transform implementation by a mini-computer', perron. Lett., 1975, 11, pp. 109-111
5 McCLELLAN, J.H.: 'Hardware realisation of a Fermat numbertransform', IEEE Trans., 1976, ASSP-24, pp. 216-225
6 RADER, CM.: 'Discrete convolution via Mersenne transforms',ibid., 1972,C-21,pp. 1269-1273
7 WINOGRAD, S.: 'On computing the discrete Fourier trans-form', Prod. Nat. Acad. Sci., 1976, 73, pp. 1005-1006
8 SILVERMAN, H.F.: 'An introduction to programming theWinograd Fourier transform algorithm (WFTA)', IEEE Trans.,1977, ASSP-25, pp. 152-165
9 SILVERMAN, H.F.: 'Corrections and an addendum to Anintroduction to the Winograd Fourier transform algorithm(WFTA)', ibid., 1978, ASSP-26, pp. 268 !
10 BAILEY, D.: 'Winograds algorithm applied to number theoretictransforms', Electron. Lett., 1977,13, pp. 548-549
11 SRIDHAR REDDY, N., and UMPATHI REDDY, V.: 'Implemen-tation of Winograds algorithm in modular arithmetic for digitalconvolutions',/^., 1978,14,pp. 228-229
12 NICHOLSON, P.J., 'Algebraic theory of finite Fourier trans-forms',/. Comput. & Syst. Sci., 1971,5, pp. 524-547
13 DAVIES, A.C, and FUNG, Y.T.: 'Interfacing a hardware multi-plier to a general purpose microprocessor', Microprocessors,1977,1, pp. 425-432
14 MOORE, C.H.: 'FORTH: A new way to program a mini-computer', Astron. & Astrophys., 1974,15, pp. 497-511
15 SRIDHAR REDDY, N., and UMPATHI REDDY, V.: 'Complexconvolutions using rectangular transforms', Electron. Lett.,1978,14,pp. 458-459
16 VANWORMHOUDT, M.C.: 'Structural properties of complexresidue rings applied to number theoretic Fourier transforms',IEEE Trans., 1978, ASSP-26 pp. 99-104
17 NUSSBAUMER, H.J.: 'Complex convolutions via Fermatnumber transforms', IBM J. Res. & Dev., 1976, 20, pp. 282-284
26 ELECTRONIC CIRCUITS AND SYSTEMS, JANUARY 1979, Vol. 3, No. 1