[IEEE Second IEEE Asia Pacific Conference on ASICs AP-ASIC 2000 - Cheju, South Korea (28-30 Aug....

ASYNCHRONOUS IMPLEMENTAION OF MODULAR EXPONENTIATION FOR RSA CRYPTOGRAPHY

Ming-Der Shieh, Chien-Hsing Wu , Ming-Hwa Sheu, Jia-Lin Sheu and Che-Han Wu Department of Electronic Engineering

National Yunlin University of Science & Technology Touliu, Yunlin, Taiwan

Email: [email protected]

Absfracf

This paperpresents an eflcient YLSlimplemen tation of the modular exponentiation, common& used in RSA cryprographj based on the asynchmnous behavior of the modular mult@kcation. The basic idea is to partifion the operand fmult@ked into several equal-sized segments and then to perform the mu1t@kahon and residue calculation of each segment in a micmp@ekning fashion. Experimental results show that on the average, more than 20% cperations can be saved by taking into account the asvnchmnous behavior of the modular mu/t@kcahon. n e resulting implementation has the characteristics of modular design, simple controlj expandable strucfure, and the crificalpath is independent of the size of the modulus.

I. INTRODUCTION

The increasing demands on applications of data communication and computer network has made the data security an important topic in current research. The idea of public-key cryptosystem was originally presented by Diffie and Hellman [l]. In 1978, Rivest, Shamir and Adleman introduced the widely used RSA public-key cryptosystem [2], in which the characteristic is camed out by the modular exponentiation and the security lies on our inability to efficiently factor large integers (usually larger than 500 bits). For the purpose of real-time applications, an efficient hardware implementation of the modular exponentiatiodmultiplication becomes the key point to speedup RSA algorithm.

In RSA, both encryption and decryption operations are modular exponentiation, which is usually transformed into a number of consecutive modular multiplications. The well-known “square-and-multiply” approach [3] requires at most 2n times of modular multiplications to compute Me mod N, where M, N, and e represent the data, modulus, and key, respectively, and n is the number of bits in the binary representation of e. As a result, the throughput rate of RSA cryptosystem will be entirely dependent on the speed of modular multiplication and the number of performed modular multiplications. The aims of this paper is to explore design techniques toward shortening the overall computational time of a modular multiplication as well as reducing the number of executed modular multiplcations for the RSA implementation.

Up to date, most of the algorithms dealing with the problem of modular multiplication are based on the following two methodologies: the traditional repeated multiply-addition algorithm [7] or the Montgomery’s 0 -7803 -6470 -8/OO/ $10.00 0 2000 IEEE 191

algorithm [8]. Basically, they are all implemented in synchronous structures. This paper presents an efficient asynchronous algorithm to speed up the modular multiplication and/or exponentiation. The basic idea is to partition the operand (multiplier) into several equal-sized segments and then to perform the multiplication and residue calculation of each segment in a pipelined fashion. At the same time, according to the asynchronous behavior of the RSA encryptioddecryption function, we develop an asynchronous architecture of modular multiplication to further reduce the total number of operations. In the hardware implementations, the carry-save addition is used to implement the developed architecture such that the critical path is independent of the size of the modulus. The asynchronous methodology adopted in our design is based on Sutherland’s “micropipeline” [4] and a local- synchronous-global-asynchronous structure. By this way, we can achieve the right balance of engineering cost and resulting performance over other asynchronous design methodologies.

This paper is organized as follows. In Section 2, we address the property of binary algorithm and the proposed asynchronous modular multiplication algorithm. Section 3 fiescribes the micropipeline structure and the derived asynchronous architecture. The performance evaluation of our resulting implementation is shown in Section 4. Finally, Section 5 concludes this paper.

11. CONCURRENT MODULAR MULTIPLICATION

Consider the binary representation of the exponent e = [e, -,,..., eo] with e,., = 1 in binary algorithm, there are two ways to transform the computation of W mod N into a sequence of squares and multiplications. The L-algorithm processes the exponent e from low-order bits to high-order bits and conversely in H-algorithm. For synchronous design, the binary algorithm takes n iterations, each consisting of one squaring and one multiplication, to complete its operations. Therefore, both H- and L- algorithms takes 2n operations of squaring and multiplication in synchronous design. Because both algorithms include “if-else” statements in the modular exponentiation, the total number of modular multiplication operations becomes n+m in asynchronous design, where n represents the number of bits of e and m is the number of 1’s in the binary representation of e [3, 51.

Based on the asynchronous behavior of the modular exponentiation, the objective of this paper is twofold. First,

The Second IEEE if e, = 0 in current iteration, then only the squaring operation is performed. Secondly, for each modular multiplication, we also skip unnecessary additions and/or subtractions to speed up its evaluation. By this way, we can fully take advantage of the asynchronous behavior to increase the throughput rate of modular exponentiation in RSA cryptosystem. Note that the squaring can be treated as a specific case of multiplication. Without loss of generality, an efficient asynchronous algorithm and the corresponding asynchronous architecture are proposed to speed up the modular multiplication denoted as BA mod N where B, A and N represent the fibit multiplicand, multiplier and modulus, respectively. And, the two-phase handshaking protocol is adopted at higher level to control the overall operations.

To overcome the data dependency between addition and subtraction in the traditional division-during- multiplication algorithm, we partition the multiplier into? several equal-sized segments and then perform the multiplication of the current segment and the residue calculation of its preceding segment concurrently [6] . Thus, a potential speedup of the modular multiplication can be achieved. Specifically, let the multiplier A be partitioned into k equal-sized segments, each one with n/k bits, denoted from &., to 4, then the modular multiplication can be rewritten as

R = (...((BA~.I~U&)~~ + m0&)2"lk i- ...+ B&) mudN (1)

As a result, the multiplication of the current segment and the residue calculation of its preceding segment can be performed concurrently. Thus, they can be evaluated in the pipelined fashion. From (l), the final result can be obtained by repeating from i= k down to 0 with &+,=O and the recursive equation can be restated as

R, = (RI+, mudN) x 2"/k + BA,, fo r i> 0 (2.a)

= RI mudN for i = O (2.b) Based on the asynchronous behavior of modular

multiplication and the presented partitioning scheme, the proposed algorithm is shown in Fig. 1. For simplicity of explanation, we use the carry-propagation addition to illustrate the main concept of our approach, which is ready to be applied to carry-save addition. It should be noted that the final implementation for performance evaluation is based on the carry-save addition.

Let the symbol Afab] denote the radix-2 value of X counting from bit location a to bit location b and Nk[l] be the ith element of the set N,. As shown in Fig. 1, the scanning patterns (00, 01, IO} in the multiplication loop are used to check the conditions for removing unnecessary additions in each segment such that the number of operation cycles can be reduced. When the two selected bits of A match one of the scanning patterns, an equivalent radix-4 operation can be performed in such a situation; otherwise radix-2 operation should be applied. The process continues until either those bits in the current segment of A are exhaustively evaluated or a possible boundary condition is met. The boundary condition occurs when only one bit of A left for bit scanning. Similar

Asia Pacific Conference on ASICs / Aug 28-30, 2000 concept is also applied during residue calculation, but a different set of scanning patterns are chosen for estimating the quotient, a modified two-bits overlapping scanning technique [6]. Note that as shown in Section 4, the performance of asynchronous implementations depends on how many bits are scanned and how many pre-computed values should be stored. Nevertheless, the segmentation technique is used to maximize the probability of getting benefits from the asynchronous behavior. The conceptual relationship between the partitioned segments and the pipelined schedules for synchronous and asynchronous implementation are illustrated in Fig. 2.

MM-asyn(B, A, N) I* result = BA m o a *I M[n+n/k-l:O] = R[n+n/k+l:O] = Nk[O] = 0; N,[ I] = 2"-N; N,[2] = (2*2")mod N; Nk[3] = (3*2")mod N; for m = k down to 0 parallel do {

/* concurrent execution of the two for-loops */ for j = (nik- I ) down to 0 do { /* for each segment */

/* check boundary during scanning *I

/* an equivalent radix-4 operation *I /* radix-2 *I

if(A[(m-l)*nlk+j-2: (m-l)*n/k+j-3] E {00,01, IO} andj + O ) I M = 4*M + B*A[(m-l)*nik+j-2: (m-l)*n/lc+j-3]; j = j - 1 ; }

else M = 2*M + B*A[(m-l)*n/k+j-2]; 1 fori = (n/k+l) down to 0 do {

i f ( i> l ) ( c=O; if(R[n+i+c:n+i+c-2] E {000,001,010,011, 100) and i

R = R[ n+i+c-3 :O]+N,[ R[ n+i+c : n+i+c-2]] *21"c-z'; i = i -1; }

#2) {

/* radix-4 *I else R = R[n+i+c-2:O]+Nk[R[n+i+c:n+i+c-1]]*2~"'-'~;

1 else { c=l;

1 R = R[n+i+c-2:O]+N,[R[n+i+~:n+i+c-1]]*2~'+~-'~;

1 R = M+R*2a; M = 0;

Fig. 1 : The concurrent algorithm with asynchronous behavior for modular multiplication

}

(a) Partitioned multiplier

(b) Synchronous pipelined operation

(c) Asynchronous pipelined operation

Fig. 2: The conceptual diagram of the proposed algorithm 192

111. ASYNCHRONOUS ARCHITECTURE OF MODULAR MULTIPLICATION

The asynchronous architecture of the developed algorithm is based on both the micropipeline and the local- synchronous-global-asynchronous (LSGA) structures. The two-phase handshaking protocol is used to control the data transfer between the multiplication block and the residue block. Within each block, all the operations are performed in synchronous manner and the asynchronous rules are applied to eliminate the unnecessary operations.

3.1 Operations of the Micropipline

In general, the asynchronous communication interface between the sender and the receiver consists of two controlling signals, i.e., request and acknowledgement, and transmitted data signals. Using the two-phase bundled data convention shown in Fig. 3, the interaction between the sender and the receiver is described as follows. The sender produces a request event to indicate that valid data are available. The receiver generates an acknowledgement event when the data have been accepted. The three events, data change, request event and acknowledge event, always repeat in cyclic order. The sequence defines which event must occur and there exists no upper bound on the delay between consequent time events.

The basic control circuit of micropipeline structure consists of Muller-C elements and delay elements shown in Fig. 4. The Muller-C element acts as an AND function for events and it follows a very simple stage state rule: If the predecessor and successor differ in state, then copy predecessor’s state, else hold present state [4]. To provide correct timing signals among stages, the hardwired delay and delay cells should be larger than the time of the data operation in a stage. Therefore, maintaining a small hardware overhead and considering a physical bundling should be approached in an automatic way.

I I I 4

I I

I I I

i First Cycle I Next Cycle I I

Fig. 3: Two-phase bundled data convention

3.2 LSGA Architecture Based on the characteristic of the micropipeline

structure, a simplified asynchronous architecture of modular exponentiatiodmultiplication for RSA cryptosystem is shown in Fig. 5 . Two arithmetic blocks, denoted as the M-block and R-block, are designed to perform the multiplication and residue calculation in the pipelined fashion for modular multiplication. These two blocks are implemented using carry-save addition such that the critical path is independent of the size of the operands. The other blocks are dedicated control circuits used to generate the control signals to implement the asynchronous RSA cryptosystem.

Within the M-block and R-block, all the operations are executed in synchronous mode, therefore we can achieve lower hardware cost and easier timing control compared with the pure asynchronous implementation. Based on the asynchronous behavior of modular multiplication, the total number of operations in M-block and R-block can be reduced by removing the unnecessary operation. The three-level counter shown in Fig. 6 will keep track of the required iteration cycles and generate the necessary signals used in the micropipeline structure. The trigger circuit is designed to avoid the occurrence of metastability during clock synchronization. And, the 4-2 counter is implemented using two cascaded 3-2 counters for the purpose of pipelined operation to reduce the critical path delay. The stored values are the message M, the public key e, the intermediate result B, and B, (the carry- save form of the operand B) and the pre-computed values N d l , NJ31 and NkW.

R(in)

Y M output

CTR b n /1

q - 1 - -

n Me mod N A(out) R(out)

Fig. 4: The micropipeline without processors Fig. 5 : The developed asynchronous architecture 193

The Second IEEE Asia Pacific Conference on ASICs / Aug 28-30, 2000

Addition

Synchronous 2n(dk+5)(k+l)

n=512, k=l6 2.46n2

Asynchronous 1.5n(fidk))(k+l)

n=512, k=16 1 .352n2

, , CLK Co;tr,;able

Coun‘er

Counter Count # of cycles in each

segment

Firs-Level

Count # of segments in

Cycle Time

FA

FA

FA

FA

1 modular m yiplication 1 Semnd-Leve, Counter

Pre-Stored Time Hardware Values Nk[i]

Synch, i= 1,2,3,4. 2.460n2 28.26n

i= 1, 2,3,4. 1.352n2 30.95n

i = 1, 2, ..., 5. 1.221n2 32.10n

Count # of multiplications in 1 moduar exponentiation I

T*H

69.52n3

41.84n3

39.16n3

Third-Level Controllable I Counter Counter

Fig. 6: The three-level counter

IV. PERFORMANCE EVALUATION In RSA cryptosystem, both the encryption and

decryption are modular exponentiation. For the purpose of comparisons, the well-known H-algorithm is assumed in this work. Table I and I1 list the comparisons of the hardware requirement and time complexity of the synchronous implementation, reported in [6], and the asynchronous counterparts developed in the paper. In Table 11, the hardware requirement is computed based on the Compass cell library and the 0.6 um SPTM process. And, the total number of operation time in terms of additions for different types of asynchronous implementation is estimated by simulation using Cadence tools with n = 512 and k=16 in our final implementation. The estimated operation time is derived based on randomly selected 20 paragraphs, which are verified through both encryption and decryption processes.

As stated above, the performance of asynchronous implementations depends on how many bits are scanned and how many pre-computed values should be stored. In Table 11, we evaluate the performance when more information related to the value N is stored, where NJi] = (P2“) mod N. As seen in Table 11, significant speed improvement can be achieved by taking into account the asynchronous behavior of modular multiplication and modular exponentiation.

V. CONCLUSION We present a concurrent algorithm and its associated

asynchronous architecture to speed up the operation of the modular exponentiatiodmultiplication with a large modulus. The resulting performance evaluation of our asynchronous architecture and VLSI implementation has demonstrated the efficiency of our development. The result is promising and if the asynchronous operation can be adopted between communication environment and the RSA cryptosystem, then it is no doubt that our result is very suitable to high-speed RSA cryptosystem. Moreover, based on the features of modular design and simple control in our implementation, it is also appropriate for VLSI implementation.

Table I: Comparisons of the Time Complexity of Modular

and pre-stored values N,[r].

I I I I I I I I

I i = 1,2, ..., 7. I 1.147n2 I 34.66n I 39.76n3 I I REFERENCES

[ l ] W. DifEe and M. E. Hellman, “New Directions in Cryptography,” ZEEE Trans. on Znfomafion Theoty,

[2] R. L. Rivest, A. Shamir and L. Adleman, “A method for Obtaining Digital Signatures and Public-Key Cryptosystem,” Communications of the ACM; Vol. 2 1, No. 2, pp. 120-126, Feb. ,1978.

[3] D.E. Knuth, The art of Computer Programming, Vol. 2, Seminumerical algorithm, Addition-Wesley, 1973.

[4] E.Sutherland, “Micropipelines,” Communicafions of theACM; pp. 720-738, 1988.

[5] K. Y. Lam and L. C. K. Hui, ‘‘Efficiency of SS(I) Square-and-Multiply Exponentiation Algorithms,” Electronic Leftem, Vol. 30, No. 25, pp. 2115-2116, Dec. 1994.

[6] Jia-Lin Sheu, Ming-Der Shieh, Chien-Hsing Wu and Ming-Hwa Sheu, “ A Pipelined Architecture of Fast Modular Multiplication for RSA Cryptography”, in Proc. 1998 IEEE ISCAS. pp.11-121-124, 1998.

[7] G.B. blakey, “A Computer Algorithm for Calculating the product AB modulo N,” IEEE Danx on Compufers, Vol. C-32, pp. 497-500, May 1983.

[8] P.L. Montgomery, “Modular Multiplication without Trial Division,” Mathematics of Computation, Vol. 44, pp. 519-521, April 1985.

Vol. IT-22, NO. 6, pp. 644-654, 1976.

194

Date post:	05-Jan-2017
Category:	Documents
Upload:	phungdang
View:	212 times
Download:	0 times

[IEEE Second IEEE Asia Pacific Conference on ASICs AP-ASIC 2000 - Cheju, South Korea (28-30 Aug....

Documents