Download - 01354396

7/30/2019 01354396

1/4

The 47rh IEEE InternationalMidwestSymposiumonCircuitsand Systems

A Regular Parallel RSA ProcessorQiang Liu Fangzhen Ma2

1. The Department of C omputer Science and Technology,Peking U niversity,Beijing, P. R. China, 1 00871

Abstract- High performance VLSI implementation of theRSA algorithm using the systolic array is presented. High-speedapplications of RSA systems require parallel implementa-tions of modular multipliers. Besides using the systolic archi-tecture which is popular in hardware-based RSA systems, ablock-based scheme is used to further eliminate global signals,with a pipelined bus to convey data globally. The control signalsand intermediate results used for sequential multiplications aretransmitted by shift registers. All signals, except for the clocksignal, are limited in one block or between two adjacent blocks.A Carry-Save-Adder structure is used for calculating the iterativestep of the algorithm, which contributes to speed improvementand area saving. In addition, long modular multipliers sufferfrom the effect of large fanout. Novel architectures are pro-posed to eliminate the fanout bottleneck, which reduce theachievable minimum clock period of long modular multipli-ers. Compared to the original modular multiplier architec-ture with fanout bottleneck, the proposed architectures canachieve an increase of over 7% in throughput without increasein area. The Chinese Remainder Theorem (CRT) technique in-creases the decryption data rate by a factor of four. Two redun-dant blocks are added to adapt to the on-line partition of themultiplier and the variation of the length of P and Q in CRTmode.

I. INTRODUCTIONThe increasing use of e-commerce and the increase in demandfor secure communications over the Internet have led to a greatdemand for reliable high-speed security products. The RSA public-key cryptosystem, named after its inventor R. Rivest, A. Shamir,and L. Adleman [ l] , is widely used to provide both secrecy anddigital signatures and its security is based on the difficulty of theinteger factorization problem [2]. An RSA operation is a modularexponentiation in essence, which in turn can be calculated by re-peated multiplication under the same modulus. The bit length ofmodulus and exponent is so large (in excess of 1024 now) thatdeveloping an inexpensive hardware device for real-time RSAoperation is a big challenge.The m ost popular method for modular multiplication is Mont-gomerys modular multiplication algorithm [3], which convertssuccessive subtractions and comparisons to additions. The dis -culty of im plementation of M ontgomery algorithm lies in the addi-tions of long operands. Previous RSA designs have mainly focusedon two particular strategies [4]. Firstly, the intermediate results a rekept in redundant form to avoid long carry propagation [5], 6], inwhich a linear array is used. This implementation suffers from theproblem of signal broadcasting and amplification [5]. Secondly, asystolic array is employed [7], [SI, which eliminates the broadcast-ing problem by distributing carries between adjacent processingunits. But the draw backs are higher latencies and possibly more

resources. In Deep Sub-micron (DSM) technology, the delays andwiring associated with the global data movement imposed the re-striction of neighbor communication. For Montgomerys algorithm,a systolic array makes this possible.For long-bit modular arithme-tic, the broadcasting issue for signals means that the systolic im-plementation of Montgomerys algorithms should be faster.Although the systolic implementation eliminates the problem ofsignal broadcastin g, some data still have t o be transp orted betweenthe control module and some locations in the multiplier. A block-based scheme is used to resolve this problem [17]. A pipelined busis used to propagate these data, such that all signals, except for the

Dong Tong Xu Cheng]2. School of Information Manage ment,

University of International Business and Econom ics,Beijing, P. R. hina, 100029

clock signal, are limited in the scope of a single block or betweentwo adjacent blocks. The contml signals and intermediate resultsused for sequential multiplications are transmitted by shift registers.The global signal transportation problem has not been consideredby most previous works. We use a CSA (Carry Save Adder) sb uc-ture for calculating the iterative step of the algorithm.As the operands of the multiplier are beyond 1024 bits long, itsuffers from the effect of large fanout. In our previous implementa-tions [17], the critical path is in the designs control module, wherethe access of a single bit from the 1024-bit long registers causes helongest path. Another long timing path is between the input datapins of the mu ltiplier and the input pins of the long registers, wherethe input data have to be distributed to several 1088-bit registers.We propose a twestage access scheme and a signal backupscheme to eliminate the fanout bottleneck, which reduce theachievable minimum clock period of long modular multipliers.Compared to theoriginal modular multiplier architecture with fan-out bottleneck, the proposed architectures can achieve a speedup ofover 7% .The modulus in the RSA cryptosystem is the product of twoprime numbers, say P and Q. This allows utilizing the Chine&Remainder Theorem (CRT) to speed up the private key operations[9]. By the CRT technique, the multiplier should be partitionedinto two smaller ones according to the length of P and Q, and thisshould be done on-line. The difficulty of the on-line partition isworsened by the variation of the length of P and Q.Two redundantblocks are added to a dapt to the on-line partition. The redundancyscheme is quite different from those of previous works and furtherdecrease the delay of critical paths. The multipliers datapath isreconfgurable to execute either one 108 6bi t modular exponentia-tion or two 542-bit modular exponentiations in parallel.As the M ontgomery multiplication algorithm [3], [51, [lo] andthe normal square and multiply algorithm for exponentiation [ l I]are well documented, the rest of the paper does not describe them.Section I1 focuses on the architectures and implementation meth-ods of the modula r multiplier and expo nentiator. The sys tolic array,the block-based scheme, the global bus to transport data, the dedi-cated shift registers, the two-stage access scheme, the signal-backup scheme, and the redundancy scheme for on-line partitionare discussed in detail. Section 111posts the timing and area resultsobtained. Comparisons with other representative designs are alsoshown. The paper is finished with Section IV, summary of themain strate gies used.11.ARCHITECTURE

A. Th e Systolic ArrayThe design consists of a modular multiplierbased on a systolic

array and a control module to control the data inputloutput andscheduling mechanism. It follows the general systolic array designflow [12], [13] to obtain the systolic RSA crypto processor accord-ing to the Montgomery multiplication algorithm and the squareand m ultiply algorithm. The resulting modular multiplier is shownin Fig. 1 where the signal qi is generated by XORing &B and theLSB of Si. An m-bit modular multiplier needs (m+l) bit slicesbesides the LSB slice to generate q and c An n-bit RSA cryptoprocessor needs (n+ 2) bit slices besides the LSB.Each cell in the array is idle on alternative cycles. Thus, full useof the hardware requires two modular multiplications to be prcc-

0-7803-8346-X/04/$20.00 02004 IEEEI11-467

7/30/2019 01354396

2/4

essed in parallel. The square and multiply algorithm [1 ] for exp onentiation can be programmed to compute squares in the evencycles and interleave any necessary m ultiplications in the odd cy-des . This has som e overhead in storage and switching between thetwo concurrent multiplications. Overall, with this added complex-

ity, the linear array resolution might be faster for small n, but forlarger n the broadcasting problem for signals means that a systolicimplementation of Montgom erys algorithms should be faster.This is feasible in RSA computation.

Fig. I . A 4-bit Montgomery modular multiplier. The normal square and multiply algorithm for exponentiation can be programmed to compute squares in theeven cycles and interleave any necessary multiplications in the odd cyc les.

Fig. 2. Modified architecture of the m ultipliers slices.The first output digit (LSB) of the multiplier appears 2(n+2)clock cycles after the input of &, and successive digits appearsover the next (n+l) cycles. The latency of multiplication is thus2(n+2) cycles.

control module and output ports. The signal transportation problemmay cause great difficulty in the place and route process. A block-based scheme is proposed to resolve this problem (as in Fig. 3),with each block containing 32 bit slices. A pipelined 32-bit-wideB. Global Control and Data Transportation bus is used to transport data between the controi module and co m -sponding blocks of the multiplier, in which data (thick horizontal

In the systolic implementation, the digits of a i, can ies and qi arepumped through cells without broadcasting which would cause theclock perid to grow due to the increased wire loading. There aretwo kinds of problems which are overlooked by the previous im-plementations in the literature: the pum p of the control signals andthe global data transportation.The control signals are used in each slice of the m ultiplier. Theyconsist of the selection of B (selb), reset-enable of s (rst), write-enable of r and p (wr, wp). Signal selb is set as 0 n the even cy-cles and 1 in the odd cycles to select p and r alternatively, suchthat the interleaving schem e can work. Signal wr and wp are usedto enable/disable the write to r and p respectively. These signalsshould be set in succession , after curtain digit of the results is gen-erated by the a rray, starting from the lowest. In our design, speciallogic is added to pump these control signals as shown in Fig. 2.Some initial data (plaintext, m odulus, e tc) have to be located in

h e s in Fig. 3) and commands (slender horizontal lines) are buff-ered in registers (thick vertical lines) and pumped between adjacen tblocks. h he initiation stage O the exponentiation operation, theinitial data are transported to proper location, and at the end of theoperation the results (the final value) are transported back to thecontrol module and output ports.

the and the final values have to be to the Fig. 3. The Block-based architecture and the pipelined global bus.

I11-468

7/30/2019 01354396

3/4

The glob al bus is not enough for the transm ission of the inter-mediate values of m ultiplications, each dig it of which should befed to the lo west slice of the multiplier in order. Shift registers (sp)are used here. When an output digit appears, it is written to reg isterp and sp , and sp will pump the digit back to the control module.Indeed, the output SUI appearsj cycles after S [ O ] , and is pumpedback to the control modu le 2j cycles after S [ O ] .So each digit of Scan be fed to subsequ ent multiplication son time (see register sp inFig. 2) .All signals, except for the clock signa l, are now limited in thescope of one block or between two adjacent blocks. It is suitablefor implementation in DSM techno logy.C. Eliminat ion of the Fanout Bot t leneck

As the o perands of the multiplier are beyond 1024 bits long, itsuffers from the effect of large fanout. Large fanout can causeroutability problem s. Therefore ED A tools try to limit fanout byduplicating gates or by inserting buffers, which will cause clockperiod to increase (Fig. 4).

I n v e r t i n g~ B u f f e r sw

Fig. 4. Add intermediate buffers.In our previous implementations [171, the critical path is in thecontrol module, where the access of a sing le bit from the 1088-b itlong registers causes the longest path. We propose a two-stageaccess scheme to reso lve these problems (Fig.5).

I JAddress +R e g i s t e r R e g i s t e r

tBit-outFig. 5. Two-stage access of a single bit of the 1088-bit long integers.

For the 1088-bit (32*34 ) register, a single bit is accessed w ithintwo stages. Firstly, a 32-bit long row is selected with the high sixbits of address and stored in the Column Register. Secondly, thetarget bit is selected from the Column Register by the low fiv e bits.Another long timing path is between the inp ut data pins of themultiplier and the input pins of the long registers and several othermodules where the input data are d istributed. We used a signal-backup scheme to eliminate the fanout bottleneck, which reducethe achiev able minimum clock period of long mod ular multipliers.The input values are first distributed to a grou p of backup registersbefore they are sent to the registers and mo dules. Eac h of the regis-ters will be sent to only a certain segment of the large fanout. Aswe know, EDA tools can insert buffers to amplify these signals,but the amplificationof these signals and the c alculation that uses

the signals will be at the same cycle, and this will cause criticalpaths and increase hardware cost. The essence of the signa l-backupschem e is that part of the amplification effort is moved to an earlierclock cycle, and this cuts the potential critical paths.Compared to the original m odular m ultiplier architecture withfanout bottleneck, the proposed architectures can achieve aspeedup of over 7% .D. rc h i t e c ture A d a p te d fo r CR T

Using the CRT technique, the multiplier should be partitionedinto two sm aller ones accord ing to the length of P and Q, and thisshould be do ne on-line.P and Q are approximatelyof the same bitlength (but not exac tly the same). The difficu lty of the on-line par-tition is worsened by the variation of the length of P and Q.If the partition is do ne accurately according to the length of Pand Q, as n [8 ], each cell in the array must be able to read the datafrom the primary inputs (or control module) directly, and be able topropagate the data to the primary outputs (or control module). Spe-cial control logic must be added to each bit slice of the array withmuch time and area overhead. A red undan cy scheme [171 is usedin our design, where two redundant blocks are added to adapt tothe o n-line partition of the multiplier (Fig. 3). The multiplier, with34 blocks and 1088-bit long, can be partitioned into two smallermultipliers, each containing 17 blocks and 544-bit long. Ano thercontrol module (ctrl2 in Fig. 3) is added for the parhtion mecha-nism and the tim ing managem ent of the derived multiplier.In the redundancy scheme, the maximum length of P orQ is 542(and the minimum is 1024-542+1=483) if n=1024. Since P and Qare approximatelyof the same length, 59 bits difference should beenough.The red undanc y schem e is quite different from those of the pre-vious works and further decreases the delay of c ritical paths. Themultiplier datapath is reconfg urabl e to execute either one 1086-bitmodular exponentiationor two 542-bit modular exponen tiations nparallel.E. Modular Exponent ia t ion

There are twocommon exponentiationalgorithms[l 11: The L-R(Left to Right) Binary Metho d, which is area optimized, and the R-L Binary Method (the square and multiply algorithm), which isspeed optim ized. The R-L algorithm is used here sin ce the primaryamis to increase the data throug hput of the exponentiator for realtime operation. Modular exponentiation s executed in three stages:mapping, exponentiation,and remapp ing. In our design which usesthe R-L binary method, internal multiplications n each stage areperformed in parallel.Before operation begins, the processoris initialized. The controldata are read from U 0 nd stored in the control modules. Then themodulus and the plaintext are read from YO and transported toproper blocks via the global bus.In the m apping stage, initial value is fed to the multiplier by thecontrol module , from th e lowes t bit to the highest. In the exponen-tiation stag e, multiplicationsand sq uares are performed in parallel.Interme diate results are pumped back to th e control module andfed into the multiplier. In the last stage, constant 0.. 01 is fed inand the mapping factor is eliminated by a multiplication. Finally,the resultcanbe transported to ou tput vie theglobal bus.

.

111. PERFORMANCEAN D COMPARISONThe design has been implemented in TSMC 0.18-pm CMOS

standard-cell technology using Verilog. From synthesis result, itoperates at a frequen cy of 450 MHz and uses 148Kilo Gates. Ta-ble l shows the performance results in terms of bit length ofmodulus (n) and exponent (k), clock cycles (T)and the baud rate(Kbk). The multiplier is reconfigurable o execute either one 1086bit modular expon entiationor two 542 bit modular expon entiationsin parallel for the application of C RT technique. To speed up en-cryption the use of a short exponent E has been poposed [ l ].Recomm ended by I TU is the Fermat prime F4 =2 + l. Using F4,the encryptio n is execu ted in only 2*1 9*(n+2 ) cycles. Bit lengths

I11-469

7/30/2019 01354396

4/4

of 1024 and 512 art commonly used an d the corresponding resultsare isted in Table 1. TABLE 1PERFORMANCERESULTS OFTHE RSA C R Y P T 0 PROCESSOR

The comparison with other des igns is summarized in Table 2 ,wh ich s h o ws a clear advantage for our approach in terms of speed.We have pos t a des ign which used a s imi l a r b lo ck - b as ed g lo b a lsignal t r an s p o r t a t io n scheme a n d redundancy scheme [171.The cu r r en t d e s ig n h as a h ig h e r c lo ck f r eq u en cy , h en ce ah i g h e r t h r o u g h p u t .TABLE 2

COMPARISON WITH PREVIOUS WORKS I B au d R a te I

IV. CONCLUSIONW e proposed ef f icient architectures w hich are used in a highperformance RSA crypt0 processor . The multiplier is based on anew systolic array, which is regular to be cascaded for higher bits.Special efforts are focused on the problems met in DSM technol-o g y . A block-based schem e and a pipelined bus are proposed toel iminate g lobal communicat ion . Contro l s ignals are p u m p e dthrough the systolic array by shift regis ters . T wo redundant b locks

are used for dynamic partitions with much higher efficiency andshorter delay . W e propose a two-stage access s ch eme an d a signalb ack u p s ch em e to e l imin a te the fanout bottleneck, which increasethe achievable maximum c lo ck frequency of long modular multi-pliers.These schemes are quite d if ferent from those o f the pas t andthe high performance conf irms the ef f ic iency of the implementedarChitechlES.REFERENCES

[ l ] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtain-ing digital signatures and public-key cryptosystems, Commu-nications of the ACM, Feb. 1978,21(2):120-126.[ 2 ]Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone,Handbook of applied cryptography, CRC Press series on dis-crete mathem atics and its applications. Boca Raton: CRC Press,

[3] P. L. Montgomery, Modular multiplication without trial divi-sion, Mathematics of Computation, Apr. 1985, 44(170):519-521.[4] T. Blum and C. Paar, Montgom ery modular exponentiation onreconfigurable hardware, 14th IEEE Symposium on ComputerArithmetic (ARlTH-l4), Adelaide, Australia, Apr. 1999. pp.70-77 .

~ 1 9 9 7 ,p.285-291.

[5 ] S. E. Eldridge and C. D. W alter, Hardware implementation ofMontgomerys modular multiplication algorithm, IEEE Trans-actions on Computers, Jun. 1993,42(6):693-699.[6] Taek-Won K won, Chang-Seok You, Won-Seo k Heo, Yong-KyuKang, and Jun-Rim Choi, Two implementation methods of a1024 bit RSA cryptoprocesor based on modified Montgomeryalgorithm, The 2001 IEEE International Symposium on Cir-cuits and Systems (ISCAS 2001) , May 2001,4:650-653.[7] C . D.Walter, Systolic modular multiplication, IEEE Transac-tions on Computers, Mar. 1993,42(3):376-378.[8] Chung-Hsien Wu, Jin-Hua Hong, and Cheng-Wen Wu, VLSIdesign of RSA cryptosystem based on the Chinese remaindertheorem, Journal of Information Science and Engineering,

[9] M. Shand and J . Vuillemin, Fast implementations of RSA cryp-tography, Proceedings of 1 Ith IEEE Symposium on ComputerArithmetic, Los A lamitos, C A, 1993, pp.252-259.[ IO ] C. D. Walter, Montgomery exponentiation needs no final sub-tractions, Electronics Letters, Oct. 1999, 35(21):1831-1832.[ I 1 D. Knuth, The Art of Compu ter Programming Volume 2: Semi-numerical Algorithms (Third Edition), Massachusetts: Addison-Wesley, 1998, pp.461-481[12] S.-Y. Kung, V U 1 Array Processors. Englewood Cliffs, NJ:Prentice-Hall, 1988.[I31 Jin-Hua Hong and Cheng-Wen Wu, Cellular-array modularmultiplier for fast RSA public-key cryptosystem based on modi-fied booths algorithm, IEEE Transactions on Very LargeScale Integration (VLSI)Systems, Jun. 2003, 11(3):474-484.[14] Alan Daly and William Marnane, Efficient Architectures forimplementing Montgomery Modu lar Multiplication and RSAModular Exponentiation on Reconfigurable Logic, Tenth ACMInternational Symposium on Field-Programmable Gat e Array s(FPGAO2),Feb. 2002, M onterey, California, USA, pp.44-49.[IS] C. McIvor, M. McLoone, and J . V. McCanny, A high-speed,low latency RSA decryption silicon core, Proceedings of the2003 International Symposium on Circuits and Systems(ISCASO3), May 2003,4 : 133-136 .[I61 C. McIvor, M. M cLoone, J . McCanny, A. Daly, and W. M ar-nane, Fast Montgomery Modular Multiplication and RSACryptographic Processor Architectures, 37th Asilomar Confer-ence on Signals, Systems, and Computers, Nov. 2003.[171 Qiang Liu, Fangzhen Ma , Dong T ong and Xu Cheng, EfficientImplementation of the RSA Crypto Processor in Deep Sub-micron Technology, Proceedings of ICISA, International Con -ference on Applied Cryptography and Network Security( A C NS M) , un. 2004, in press.

NOV. 001, 17(6):967-979.

I11-470