01277455

7/30/2019 01277455

1/4

A Novel AS IC Implementation of RSA AlgorithmZhu Kejia ,Xu Ke, Wang Yang, and Min Hao, Member, IEEE

ASIC & System State Key Lab, Fudan University220 Handan Road, Shanghai, 200433 P.R.ChinaE-mail: [email protected]

AbstractIn this paper. a novel ASIC implementation of RSA

algorithm is presented. By utilizing Yang's modifiedMontgomcy algorithm, thc ovcr-largc rcsidue problemis eliminaled. The multiplication and Montgomerymodular reduction in modular multiplication are handledidentically to minimize hardware cost.Microprogramn~edcontrol makes the architecture veryflexible to support variable key lengths. These featuresmake the c hp very suitable for smart card applications.A RSA coprocessor hased on the new architecture hasheen fabricated with 0.5 P 111 CMOS cell library. Thecoprocessor has I ? K gate counts and 3 inmi die sizewith a inaxunum clock frequency of 40 MHz, whichtakes ahout 32 5 ms to encrypt\decrypt 1024-bit data.1. Introduction

With the constant progress of economy, the smartcard is becoming more and more significant and,playingan important role in our daily life. Public keycyptosqstems can offer the highest level of securitytogether with maxiiiuun tlexibility to smart cardapplications. Among the various public keycyptographv algorithms, the RSA cryptosystem [ I] isthe best-known, most versatile, and widely used publickey cryptosystem todav.

The basic R S A operation is a modularespoiienfiation o n large numbers (typically 20 0 to 1000h). \\-hicl1 can he split into successive modularmultiplications. The blakley algorithm [2] andMontgomery algorithm [3 ] are the most commonly usedmodular multiplication methods. In the latterMontgomey method, the divisor is not N but the powerof 2. which is especially suitable for hardware'implementatjons.Various algorithm modifications and hardwareimplementations of Montgomery algorithm can he found.High radix technique is utilized to speed up in [4 ] and[:I In [ 6 ] , uotient determination is avoided hp shiftingthe multiplicand 2 bit. These improvements achieved asignificant specd up to Montgomery modularmultiplication However, they all suffer from the over-large residue. Though In [7] Yang's modifiedMontgomer) algorithm successfully solved the problem,its hardware design was too complex, resulting in largechip area and power consiu~iption.Furthenilore, It canonly handle 51?-bit RS A operation.

0-7803-7889-X/03/$17.00@2003IEEE.

In this paper, we propose a novel implementationof Yang's modified Montgomery algorithm for martcard environment. The new architecture equal themultiplication and module process in Montgomery'salgorithm, thus greatly dccrcasc the hardwarerequisites. Software-hardware co-design' technique isutilized to minimize the chip size and maximize thetlexibility. A MCU is designed as the controller. Thenew architecture is applied to a single chip RSAcoprocessor to achieve best trade off among speed,chip area and flexibility. .

This paper is organized as follows. In Section 11,wc will introduce Yang's modified Montgomeyalgorithm and propose our moditications. In Section111, chip architecture is presented. The hardwaredesign and test is detailed in Section IV . Fmally, wewill conclude this paper in Section V.IL AlgorithmA Yang'smodified A4onigomeiy algorithm

There are many ways to realize the hasicMontgomeIy algorithm; Yang's modified Montgomery'algorithm [7] can avoid the over-large residue problemand additional subtraction procedure. It is shown asfollows:

0utput:R =Mon_pro_ynng(A,B,N)= .B.2.'"''' modN1nput:A.B.NPar t l : stepl. A.8= G =G,-2"+'+Go.

GoZO.G,

7/30/2019 01277455

2/4

G,. Then we sum PM 2 and GI to get the modularmultiplication result R. We split product G into higher( r t - 2 ) bit, and lawer ( n+ 2 ) bit but not two II bit numbersto avoid the oyer-largc residue problem and additionalsubtraction procedure.

I n [7]; the algorithm is iniplemented in parallel andthe multiplication and Montgomen, niodular reductionare realized by deferent hardware to achieve highthroughput, resulting in a large hardware cost (7 4 K gatecounts). Furthenuore, i t can only handle 521-bit RSAopcration.B Algor i rhn?mtdijcorions

In order to solve the problems in [7], making ourc h ~ pmore suitable for smart card applications, we-proposed a new implementation oft he algorithm, mainlywith the follow modifications:(1 ) Equal the multiplication and inodulai reduction inYeng's modified Montgomery algorithm into oneoperation: shift-and-add, greatly decreasing thehardware cost.

Two operations consist of the modularnmltiplicstion: iiiultiplication and Montgomery modularreduction. The niultiplication can be implemented hyshift-and-add: in the modular processing, first in step 3athe quotient detenninalion is a parity decision on thesuniniation of the interinediate result and the carry,which can easil!' be realized by using an exclusive-ORgstc. Nest in step 3b; although there are three addendshcre; the addend g has just one bit length; we can take itas the cam! of the lowest bit and use a c o n " fulladder to realize the 3-operand adding. The division canbe realized by shift because the divisor is 2 . So themodular reduction can also be realized by shift-and-add,making. the design inore regular and save hardware.Furthemiore, pipelines are easily utilized to improve thkperformance.(2 ) Utilize iiiicroprograinnied control instead ofhardwired control, greatly increasing our chip'sflexibility.C dhdtiplicarion ~ d i z o r i o nAs the sue of the multiplication operands is large(1024 h or more), it is impossible to handle theniultiplication in parallel. We split the process intosc~c raI stiiall blocks, each of which is a smallmultiplication. We handle these m a l l blocks by usingsimple niultiplier and get some partial products Thenunder the microprogrammed control lcgic we add upthese partial products to get the final results.

The simple multiplier can be easily implemented byshift-and-add. In this paper we design a 264-bitaccumulator and it can take a block of 264"264 productafter accumulating 26 4 cycles. We can divide aI056*1056 operation into 16 blocks and calculate themone by one. Generally speaking, we can handle a 264*Kmodular multiplication by KA 2 blocks of accumulations.[Here we select the bit length 1056 hut not 1024 becausein the RS A algorithm .Ad mod N , A F N s needed, so N

is larger then A4 whose bit length is 1024 for usualapplications

Of cotuse, these improvements cost loss of speed tosome degree, but in ow smart cards' applications, thespeed is high enough to handle most circumstances.111. System architectureA Hardwave designcoprocessor, including four portions: Master MCU.RAM, ontroller and Datapath

We use a master MCU to control the exponent scan,while OUT coprocessor focuses on accelerating themodular multiplication.

There are two bus access modes for RA Moperations: %bit bus for masier MCU and 264-bit busfor RSA. Control mechanism is utilized to avoid busaccess conflictions. In this way we can avoid the 110buffers and parallel-serial conversions. Thx solutionresults in sonie costs of memory size due to theincreased coniplexity in memory decoder. However,because the memory is very regular, the final layout willnot increase significantly.

Fig.1 shows the architecture of our RS A

RSAI

ki

j+J; 0atapath .c t lDatapath

o a t b i n1264)Fig.1 Chip architectun

Fig.2 shows the architecture of Controller. TwoIs of control approaches are often utilized 181:hardwired control and microprogrammed control. Bo&of them have their own advantage and disadvantage.Hardwired control can operate at high speed. However,it has little flexibili ty, and the coniplexity 'o f theinstruction set it can iinplernent is limited.Mcroprogramnied control provides a means for siiiple ,tlexible and relatively inexpensive execution ofinstructions. One of its drawbacks is that it leads to a

Fig.2 Canlroller architecture

1301

7/30/2019 01277455

3/4

Pig.) Datapath architectureslower operating speed because the time it takes to fetchniicroinstrnctions from the control store. In ou r RSAchip the speed bottleneck lies in the Datapath. Thus wechoose microprogrammed control for flexibility and itcan support different hey lengths. The microinshwtionset is variable length. Each instruction acts as a smallstate machine.

Datapath is the critical path of the chip and is basedon a shift-accumulator. It is shown in Fig3 Theoperations of Datapath are described below. It has a264-bit intemal bus. Acc is a 264-hit register foraccumulationl Acsl is a 264-hit shifter; MI is a 264-hitregister to latch the data from RAM. Each time when weprepare for the accumulation, we put last time's partialproduct in Acc, the niultiplicaud in M I register and themultiplicator in Acsl. Then it can be decided whether toadd the iuultiplicand on Acc according to the hit of theniultiplicator I or 0. At the saine time the lowest hit ofAcc will he shifted into the highest bit of Acsl, thenAcsl is shifted. We choose Acsl to hold both themultiplicator and product because we do not need themultiplicator after the iuultiplication processing. Afterone cycle of processing, the higher 264 bits of productare in Acc; while the lower 264 bits are in Acsl. Savethe interim product to RAM and continue to next cycle.Through the repetitions the whole modularmultiplication can be handled.

Fig.4 shows the architecture of the 264-bit adder inFig.3. We split the long adder into eight 33-bit ones.There are two modes for these adders: CSA (Cany SaveAdder) mode, in which the eight 33-bit adders work inparallel, thus eliminating the long cany chain; CRA( C a q Ripple Adder) mode, in which the eight 33-hitadders work in serial, resulting in eight clock cycles tocomplele a 264bit add operation,~.which s mainly forc a q ad,iusting. A control hit is utilized to switch theniode. Because in niore than 90 percent of theprocessing time the chip works in CSA mode, the slowerCRA mode is not a coarse limitation.

The 33-hit adder consists of a 32-hit common fulladder and a one-hit specially designed adder. Because ofthe 'CSA' adder, ili the 33rd bit, there is not only thecarry froin the 32nd hit, hut also the cany saved by last

f ig .4 Addcrrrchilecture

time's processing. Thus two carries may he generatedand handled respectively.B PerformaliceBy choosing deferent K , our chip can handlevariable RSA operations. The total clock cycles.in theprocessor arewhere V(E) is the number of 1bit in the exponent. If K=4, which means a 1056 RSA operation, it takes about13 M clock cycles for the average case (equal to 0 or 1probability).N. ircuit implementntion and test

( 2 * 2 5 6 * K ' +32 * K * ) (log, E +V(E) ) , (1 )

Fig.5 shows the die photo of the chip, including aRSA coprocessor, a master MCU core, RAM, ROM,E'ROM and a USB interface. All the digital circuits arein the central irregular parts. The whole chip is based on0.5 P ni CMOS cell library. It is about 14 K gates countand 3 nun' die s'z&-

F i g . 5 C h i p micrwhotosraph

1302

7/30/2019 01277455

4/4

Fig.6 shows the test system. In the middle of theboard is the test chip. In our test vectors, we used 1000different pairs of cipher and keys, including the for low cost applications such as smart cards.overflow case and some other comer cases. Bycomparing the resulting with software models, we Surethat tlie,chip can work properly. The'maximum clockfrequency tested is about 40 MHz.

Our design has a very small die size, sufficientspeed and flexibility, which make it a very good solution

AcknowledgmentsThe authors wish to thank Zhongping Nie and Jian

Fig.6 Test boardTable. I1 shows a comparison with a commercial

RS A chip [9]T a b l e 11

Compared w i t h c ~ i n m e r c i a lmnwi t i ona l counterpart

.. ..modular multiplication," in Roc. 12'Arilhmktie, pp. 19349 9. 1~1.19 95.

Symp. On Compuler[5] N. Shand, and J. Vuillemin, '%art implcmentrtions of R S Acryptography," in Roc. 11" Symp. On Computer Arilhmelic,pp.252-259, 1993.

161 S.E. Eldridgs and C. D.Waller. "Hardynrc implemcntation ofMontgomery's modular multiplication a lgor i thm" IEEE Trans.Comput., vol. 4 2 . p ~ .93-699,Jun.l99%

[J Chhg-Chao Yang, Tian-Sheuan Chang, and Cheh-Wei Jen, "Ane w RS A c l y p t o s y s tm hardware design based on Montgomery'salgorithm, " IEEE Tram. Circuils and Systems U : Analog anddigital Signal Rocerring.,vol. 45, No 7, pp. 908-913, Jul. 1998.[SI H. Carl, V. Zvonka, and Z. Srfwal. Computa Organization.(McGnw_Hill, 2002), pp. 425

[9] H. Handshuh, P. Paillier, Proe. Of 3" Inter. C od . On CARDIS,1998.. pp.372.

Our design has a very small die size, especiallysuitable for low cost applications such as smart cards.Besides, our chip is mnre tlesible because it can beeasily programmed lo different key length to meetvarious security levels.V. Conclusion

I

In this paper, we propose a new ASICimplementation of RS A algorithm based on Yang'smodified Montgomen. algorithm The multiplicationand Montgomey modular reduction in niodulariuultiplication are handled identically to minimizehardware cost. Microprogrammed control is utilized forthe tlesibility. The chip has been fabricated with 0.5 P mCMOS library. The gates count is about 14 K and diesize is 3 imii'. Test result shows that the chip can workproperly up to 40 MHz. The ch p takes about 325 ms totinish 1024-hit RSA operation and delivers a baud rateof 3.08 Kbps at 40 MHz in the average case.

1303

Date post:	14-Apr-2018
Category:	Documents
Upload:	chitragows
View:	215 times
Download:	0 times

01277455

Documents