Transcript

A PIPELINED ARCHITECTURE OF FAST MODULAR MULTIPLICATION FOR RSA CRYPTOGRAPHY

Jia-Lin Sheu, Ming-Der Shieh, Chien-Hsing Wu, and Ming-Hwa Sheu

Department of Electronic Engineering National Y d i n University of Science & Technology

Touliu, Yunlin, Taiwan Email: [email protected]. yuntech.edu.tw

ABSTRACT In this paper, a fast algorithm with its corresponding VLSI

architecture is proposed to speed up the modular multiplication with a large modulus. By partitioning the operand (multiplier) into several equal-sized segments, and performing the multiplication and residue calculation of each segment in a pipelined fashion, a performance improvement can be achieved by using our algorithm compared with previous work. We also show an efficient procedure to accelerate the residue calculation and use carry-save addition to implement the architecture such that the critical path is independent of the size of the modulus. Therefore, the resulting architecture and implementation are very suitable to be applied to the high-speed RSA cryptosystem and can be easily implemented in VLSI technology.

1. INTRODUCTION With the explosion of electronic data communication and

computer network, how to ensure their security has become an important topic in current research. In encryptioddecryption of RSA cryptography [ l ] as well as those of other public-key cryptography [2, 31, the characteristic is carried out by the modular exponentiation and the security lies on our inability to efficiently factor large numbers (usually > 500 bits). In general, the modular exponentiation is accomplished by performing iterations of modular multiplication, therefore the throughput rate of cryptosystems will be entirely dependent on the speed of modular multiplication of large numbers. The key point to implementing the modular exponentiation would be to develop an efficient algorithm for the modular multiplication of large numbers and use less amount of hardware.

A variety of algorithms have been proposed for the modular multiplication in the literature. Most of them are generally fallen into two categories: the division-after-multiplication or the division-during-multiplication method. The former requires more amount of memory space than the latter does because it should perform the division of a 2n-bit number by a n-bit number for residue calculation. The latter method, which each subtraction step for division is embedded in the repeated multiply-addition algorithm [2], requires more operations than the former does. In addition, an accurate magnitude comparison for possible subtraction of the modulus N is required in many algorithms to limit the value of the partial sum to be less than IN1 for ensuring the convergence of the algorithms. But, it has been shown that an accurate quotient determination will be a waste of calculation time and silicon area. In order to simplify the problem of quotient

determination, several techniques were proposed to avoid the time- consuming process of magnitude comparison [4].

By taking the advantages of both division-during- multiplication and division-after-multiplication methods, in $is paper an efficient algorithm without magnitude comparison is first presented to speed up the modular multiplication. The basic idea is to partition the operand (multiplier) into several equal-sized segments, and perform the multiplication and residue calculation of each segment in a pipelined fashion, thus a potential speedup of the modular multiplication can be achieved. Moreover, the carry-save addition is used to implement the developed architecture such that the critical path is independent of the size of the modulus. We also show that the presented algorithm and its corresponding architecture can achieve good performance, therefore it is applicable to the RSA cryptosystem.

This paper is organized as follows. Section 2 describes the proposed algorithm for fast modular multiplication. The corresponding pipelined architecture is given in Section 3. Section 4 shows the performances of our proposed algorithm and its VLSI implementation in comparison with previous works. Finally, Section 5 concludes this paper.

2. THE PROPOSED ALGORITHM FOR FAST MODULAR MULTIPLICATION

This section presents an efficient algorithm without magnitude comparison to speed up the modular multiplication denoted as BA mod N, where B, A, and N represent the n-bit multiplicand, multiplier, and modulus, respectively. Basically, there are two disadvantages in the traditional repeated multiply-addition algorithm Ill] shown in Fig. 1. First, it requires an accurate magnitude comparison which will increase the time complexity and hardware overhead. Secondly, the data dependency between addition and subtraction results in the sequential execution of these two types of operations for each single bit uj.

P = 0; for j = n-1 down to 0 do

begin P = 2*P; if ( P > N )then P = P - N, if ( u, = 1) then {

P = P + B ; i f ( P 2N) thenP = P -N }

end

Figure 1 : The traditional repeated multiply-addition algorithm.

11-121

0-7803-4455-3/98/$10.00 0 1998 IEEE

To overcome these two disadvantages, the basic idea is to partition the multiplier into several equal-sized segments and then perform the multiplication of the current segment and the residue calculation of its preceding segment concurrently. Thus, a potential speedup of the modular multiplication can be achieved. Specifically, let the multiplier A be partitioned into k equal-sized segment, each with nlk bits, denoted from Ak-1 to Ao. Then, the modular multiplication can be rewritten as

R = (...(BAk.lmod N)2"P"+BAk.2)mod N)2* +-..+BAo)mod N (1)

As a result, the dependency of addition and subtraction is no longer existed within each segment. It implies that the multiplication of the current segment and the residue calculation of its preceding segment can be performed concurrently. Thus, they can be evaluated in the pipelined fashion. From (l), the final result is obtained by repeating the iteration Gom m = k down to 0, where & + I = 0, and can be restated as

Rm= (Rm+lmod N)*2"/k+ BAm-I form> 0 (2.a) = Rlmod N form = 0 (2.b)

Based on this derivation, the proposed algorithm is shown in Fig. 2, in which a modified two-bit overlapping scanning without magnitude comparison is used to estimate the quotient such that the problem of overflow is eliminated. For simplicity of explanation, Fig. 3 showsthereductionstepsforN=(11010011)~=211~~andR= (101 11 1010000)2 = 302410, where the modulo reduction represents the value should be added during the reduction procedure. It should be noted that the computation load of multiplication is different from that of residue calculation. Therefore, to balance the computation loads, two steps of the residue calculation is scheduled to the multiplication loop. This will not cause any problem because the basic function of the two distinct for-loops are the same.

M[n+n/k-1 : 01 = R[n+n/k: 01 = 0; Nkey= 2"-N; N3 = (3*2" ) mod N; for m = k down to 0 parallel do {

begin /* Concurrent execution of the following two for-loops */ /*for each segment */

else case (M[n+n/k+j : n+n/k+j-11) /* Load balance */

for j = (n/k+l) down to 0 do { if (j 2 2) then M = 2*M+B*a[(m-l)*n/k+j-2]

1: M = M[n+nAc+j-2:0] + Nkey*2("/k+j-1) /* Shifting Nkey for alignment */

2: M = M[n+nflc+j-2:0] + Nkey*2("P"+J) 3: M = M[n+n/k+j-2:0] + N3*2("P"+J-1)

1 for i = (n/k+l) down to 0 do {

if (i 2 1) then c =O

case (R[n+i+c:n+i+c-11) else c = 1 /* The correction step for convergence */

1: R = R[n+i+c-2:0] + Nk~,*2(itc")

3: R = R[n+i+c-2:0] + N3*2('+'-') 2: R = R[n+i+c-2:0] + Nky*2icc

1 R = M + R*2*; M = 0 ;

I* Executed after two for-loops *I

end 1

Figure 2: The proposed algorithm for concurrent modular multiplication.

~ l l l l O l O 0 0 0 =3024; N = 11010011=211 - 01011010 Nkey= 2" mod N - 10100111000 = 00101101

01011010 N,= (2*2")modN - 0111101100 = 01011010

00 10 1 10 1

00101 101 010001 10

N3=(3*2" )mod N - 10001 1001 =10000111 => the correction step = 70 =3024 mod 21 1

Figure 3: An example €or modulo reduction.

In terms of hardware implementation, the carry-save addition is commonly implemented for operands with a long bit-width to avoid unnecessary carry propagation. The proposed algorithm can be also applied to the modular multiplication based on the carry- save addition shown in Fig. 4(a), where the Top2b-scan-mod-N(C, S) is to consider the intermediate sum and carry at the same time based on the same concept of the modified two-bit overlapping scanning technique. The rule of the modulo reduction is shown in Fig. 4(b).

CM=SM=CR=SR=O; Nkey=2"-N; N3=(3*2" )mod N, N4=(4*2" )mod N form = k down to 0 parallel do {

begin for j = (n/k+l) down to 0 do {

if (j 2 2) then C&SM ~*(CM~SM~B*U[(~.I)*"P"+,-Z] else Top2b-scan-mod-N(C~ ,SM)

1 for i = (n/k+l) down to 0 do {

Top2b-scan-mod-N(C~ ,SR) 1

cR+sR t SM + CM + (cR+sR ) * 2 5 CM = 0; SM =o;

end 1

(a) The simplified algorithm

I C, C,.I Si., I Modulo reduction I 0 0 0 1 0 0 0 1 I N k ~ 2 " m o i l N

1 0 N3= (3 *2") mod N 1 0 N3

(b) The rule of modulo reduction for Top2b-scan-mod-N Figure 4: The proposed algorithm based on carry-save additions.

3. A PIPELINED ARCHITECTURE FOR THE MODULAR MULTIPLICATION

Based on the proposed algorithm, operations of the modular multiplication can then be accomplished by implementing two distinct blocks: the residue (Res) block and the multiplication (Mul) block. Fig. 5 depicts the conceptual framework of our proposed algorithm for hardware implementation with k=4. For each segment,

11- 122

the residue block (b) is to calculate the residue based on the modified two-bit overlapping scanning technique. The multiplication block (a) takes (n/k) cycles to calculate the BAi and a few additional cycle to balance the load. As explained above, the total operation cycles of the block (a) is the same as those of the block (b). These two blocks can operate concurrently to reduce time complexity. Between two segments, the block (c) is used to accumulate residue value, Le., the shifted value of the old residue in Res block is added to the value from Mu1 block. The operations of modular multiplication can be repeatedly calculated until all of the k segments are finished.

X

korrect Residue IC) I

Correct Residue (c)

I Correct Residue (c)l

Mu1 block (a) / Figure 5: The conceptual framework of the proposed algorithm.

Fig. 6 shown the pipelined architecture of the proposed algorithm based on the carry-save addition. It takes six n-bit registers to store the N, N3, Nq, A, SB, and CB, respectively, and four (n+n/k+2>bit registers for CM, SM, CR, SR. The multiplication and residue blocks are used to perform the two concurrent for-loops in the presented algorithm. In the derived architecture, the multiplication block composed of 4-2 counters is used for the accumulation of partial products for each segment. AS shown in [12], a 4-2 counter can be implemented using two cascaded 3-2 counters, also named as the full adders, therefore we can pipeline the 4-2 counter to further promote the performance of modular multiplication at the expense of hardware overhead. By doing this way, the critical path of the computation elements can be estimated to the delay time of a full adder.

In the residue block, the 3-2 bi-function adder consists of 3-2 counters and multiplexers such that it can operate as either a carry- save adder or a carry-propagation adder. It is due to the fact that the cany-propagation operation is still required at the last step for outputting the final result, therefore using a bi-function adder would reduce the overall hardware requirement. Note that if the developed architecture is used in the RSA cryptosytem, the carry- propagation operation is only executed once based on our algorithm. According to the promising feature, the resulting architecture is very suitable to be applied in high-speed RSA cryptosystem. Further, due to the regular data flow of our architecture, it can be easily implemented in VLSI technology.

MUX 5-1 I I

Residue Block

Figure 6: The pipelined architecture of the proposed algorithm based on cany-save additions.

4. PERFORMANCE EVALUATION Table I lists the performance of our proposed algorithm for

modular multiplication in comparison with the previous works j4-61. For the sake of simplification, the hardware requirement is calculated based on the assumption that the carry-propagation adder is used to achieve the m u m hardware requirement. And, the computation time is estimated based on the number of required operation cycles to complete the task. The operation cycle is the time to perform an n-bit addition. It is assumed that the complexity of performing the addition is the same as that of subtraction, and the shift operation is neglected. As shown in Table I, in terms of operation cycles, a speedup ratio of 2.6 for the optimal case of our algorithm can be achieved in comparison with Lu’s work at the expense of less than 60% hardware overhead. The optimal k=16 can be theoretically derived based on all possible partitions in the presented algorithm.

In RSA cryptosytem, both the encryption and decryption operations are modular exponentiation. A simplest way to calculate modular exponentiation is to use “square and multiply” approach to reduce the total number of modular multiplications for encryptioddecryption. Of the existing solutions to reducing the depth of the modular multiplication circuit of 6 large modulus, the carry-save operation is known to be the most attractive solution in

11- 123

hardware implementation. For the purpose of comparison, the well- known H algorithm is assumed in this work. Table II and llI list the comparisons of the hardware and time complexity, respectively. The comparisons are based on the computation elements on the data path, and the control circuits are ignored. In general, if the RAM is used to store all the modulo coefficient, then it will takes a lots of area although some speed improvement can be achieved [8, 91. It can be seen that for a fixed hardware requirement, our developed architecture can achieve the best performance compared with the works [7-lo]. One of the key points to achieve such a performance is that the carry-propagation operation is only required at the last step for outputting the final result, namely the cipher.

Authors Addition

Table I: Performance Comparisons of Modular Multiplication Algorithms

Cycle Comparison ~ i m ~

Hardware No. of

11.2n 11 .2n2+1 1.211

Ours I optimal k=16

Ours 11 optimal k=8

2n(nk+4)(k+l) Simple 2FA 2.4n2 Simple 2FA

2n(n/k+7)(k+l) Simple FA 2.511’ Simple FA

Takagi 161 ours

Table II: Comparisons of the Hardware Requirement

Hardware Requirement 1

5.2n I 3n I 15.611~ 8.4n I ( n/k+4)(k+l) I 8.4n(n/k+4)(k+l)

Authors

with1 wu[9] FAs I Latches MUXs Table

2n+c1***1 12n 6n 512*512 RAM Juang[8] I 2n I 14n I 10n I10*512

Koc[71 I 3n I 15n+36 I 0 1 No

* The nonpipelined 4-2 counter is used in our architecture. ** The pipelined 4-2 counter is applied in our architecture to

increase the speed. *** C1= logz(n+l) and C2 = n represent the number of used half

adders.

w/o RAM

5. CONCLUSION

Eldridgello] 3n+C2*** 16n 9n No OursI* 3(n+nAc) lOn+4nk 9n No

optimalk=16 3.1811 10.24n 9n No

To speed up the operation of the modular multiplication with a large modulus, a concurrent algorithm has been successfully developed in this paper based on the partitioning of the multiplier. In the meanwhile, the carry-save addition is used to implement the resulting architecture and a modified two-bit overlapping scanning technique is applied to eliminate the magnitude comparison. We also show that the presented algorithm can achieve good performance compared with previous works. The corresponding VLSI architecture and hardware evaluation have demonstrated the efficiency of our development, therefore it can be applied to the high-speed RSA cryptosystem. Moreover, the developed

Ours II** optimalk=8

architecture has the characteristics of regular structures, modular design and expandable feature, thus it is suitable for VLSI implementation.

3(n+n/k) I 13n+7n/k I18n+9n/kI No 3.3811 I 13.8811 I 19.ln I No

Table IE Comparisons of the Time Complexity

I Time Complexity

with RAM

WIO RAM

I I I I

Sim le

Koc 7 Sim le Eldrid e 10 Sim le 2FA

REFERENCES 2. Rivest, A. Samir and L. Adleman, “A method for

obtaming digital signatures and public-Ley cryptosystems,” Com. ofACM, vol. 21, no. 2,pp. 120-126,Feb. 1978.

[2] E. F. Brickell, “A fast modular multiplication algorithm with application to two key cryptography,” Advances in cryptology, Proc. of CRYPT0 82’, pp. 51-60, New York, 1983.

[3] T. Elgamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” IEEE Trans. Information Theoty, ~ 1 . lT-3 1, no.4, pp. 469472, July 1985.

[4] C. D. Chiou and T. C. Yang , ”Iteration modular multiplication algorithm without magnitude comparison,” Electronic Letters, vol. 30, no. 30, Nov. 1994.

[5] E. Lu Lein Ham, J. Lee, W. Hwang, “A programmable VLSI architecture for computing multiplication and polynomial evaluation for modulo a positive integer,” IEEE J. of Solid- State Circuits, vol. SC-23, pp. 204-207, 1988.

[6] N. Takagi, “A modular multiplication algorithm with triangle additions,” in Proc. of 1 lth Symp. on Computer Arithmetic, pp. 272-276, July 1993.

[7] C. K. Koc and C. Y. Hung, “Carry-save adder for computing the product AB modulo N,” Electronics Letters, vol. 26, no. 13, pp. 899-900, June1990.

[8] Y. J. Juang, E. H. Lee, J. Y. Lee and C. H. Chen, “A new architecture for fast modular multiplication,” Int. Symp. on VLSI Technology, systems and application, pp. 357-360, 1989.

[9] C. Y. Su and C. W. Wu, “A practical VLSI architecture for RSA public-key cryptosystem,” Proc. Sixth VLSI Design/CAD

[lo] S. E. Eldridge, “A faster modular multiplication algorithm,” Int. J. ComputerMath., vol. 40, pp. 63-68, 1991.

[ l l ] G.R. Blakey, “A computer algorithm for calculating the product AB modulo N,” IEEE Trans, Computers, vol. C-32, pp. 497-500, May, 1983.

[ 121 K. Hwang, Compuleer Arithmetic: Principles, Architecture, and Design, John Wiley and Sons, 1979.

SWP., pp. 273-276, August 1995.

11- 124

Top Related