+ All Categories
Home > Documents > [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland...

[IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland...

Date post: 24-Dec-2016
Category:
Upload: odysseas
View: 214 times
Download: 0 times
Share this document with a friend
7
Efficient CRT RSA with SCA countermeasures Apostolos P. Fournaris VLSI Design Lab Electrical and Computer Engineering Dpt. University of Patras Patra, Greece [email protected] Odysseas Koufopavlou VLSI Design Lab Electrical and Computer Engineering Dpt. University of Patras Patra, Greece Email: [email protected] Abstract—RSA cryptographic algorithm, working as a security tool for many years, has long achieved cryptographic and market maturity. However, as all crypto algorithms, RSA implemen- tations, after the discovery and wide spread of Side Channel Attacks (SCA), are susceptible to a wide variety of different attacks that target the hardware structure rather than the algorithm itself. While there are a wide range of countermeasures that can be applied on the RSA structure in order to protect the algorithm from SCAs, combining several such measures in order to guarantee an SCA resistant RSA design is not an easy job. There are many incompatibility issues among SCA protection methods as well as an extensive performance cost added to an SCA secure RSA implementation. In this paper, we address some very popular and potent SCAs against RSA like Fault attacks (FA), Simple Power attacks (SPA), Doubling attacks (DA) and Differential Power attacks (DPA), and propose an algorithmic modification of RSA based on Chinese Remainder Theorem (CRT) that can thwart those attacks. We describe an implementation approach based on Montgomery modular multiplication and propose a hardware architecture for a SCA resistant CRT RSA that is structured on our proposed algorithm. The designed architecture is implemented in FPGA technology and results on its time and space complexity are extracted and evaluated. Index Terms—Public Key Cryptography; VLSI Design; Side Channel Attack Resistance; Modular Exponentiation; I. I NTRODUCTION Public key cryptographic algorithms due to their high com- putational complexity are considered difficult to implement, have significant computational time delay and consume con- siderable number of hardware resources (chip covered area, power dissipation). So, in order to design an efficient Public key cryptography hardware architecture capable of processing information fast and capable of fitting in a resource constrained environment, special attention needs to be paid in the utilized design approach [1]. Furthermore, since nowadays security enhanced devices usually operate in a hostile environment where they can be stolen or manipulated by untrusted users, security attacks on their hardware structure can not be ex- cluded as a potential thread. Such attacks called side channel attacks (SCA), exploit an architecture’s hardware character- istics (power dissipation, computation time, electromagnetic emission e.t.c) to extract information about the processed data 0 The work reported in this paper is supported by the European Commission through the SECRICOM FP7 European project under contract FP7 SEC 218123 and use them to deduce cryptographic keys and messages [2]. While RSA is considered very secure against traditional cryptanalysis, side channel attacks have been successful in determining RSA keys using information leaking from a straightforward implementation of the algorithm. For the above reasons, SCA resistance is a highly desirable feature of an RSA encryption-decryption unit architecture but it usually constitutes a performance bottleneck. SCAs can be mounted very easily using a PC, a simple oscilloscope and some probes and therefore can be used by a wide range of attachers. This ease of use makes SCA very potent. Among the most effective SCA launched on RSA are Fault attacks (FA) and power attacks (PA). The fault attack goal is to disturb a hardware device during cryptographic operation execution, analyze the faulty behavior of the dis- turbed device and as a result deduce sensitive information. Combining such attack with a power attack, where a hardware device’s power trace is measured and exploited for secret information leakage [2], a crypto-system attacker can relatively easy deduce a cryptographic key of a secure hardware device. Protection of a hardware device against SCAs aims either at the disaccociation of the leaked information with the computed secret data or at minimizing the information leakage itself. To achieve these goals, existing SCA countermeasures are based on two main directives, changes on the cryptographic algorithm and related computer arithmetics or changes on the hardware architecture at hand. Many researchers have proposed solutions on protecting RSA from FA and PA with relative success [3], [4], [5], [6]. However, those solutions are focused on an FA-PA resistance on algorithmic level without taking into account the implemen- tation cost for one SCA secure RSA encryption-decryption operation. This cost is associated with the arithmetic oper- ations required for RSA and can be very high - restrictive when applying the existing SCA resistant RSA solutions in real security devices. SCA countermeasures can be generic, on circuit level, (efficient on specific hardware implementa- tions) [7] [8] or specialized, focused on specific cryptographic algorithms, on an algorithmic level [9]. The second approach can be more effective since it utilizes techniques that better negate a cryptoalgorithm’s specialized SCA weaknesses. The Chinese Remainder Theorem approach is widely used in the design of an RSA crypto core since it can speed 2011 14th Euromicro Conference on Digital System Design 978-0-7695-4494-6/11 $26.00 © 2011 IEEE DOI 10.1109/DSD.2011.81 593 2011 14th Euromicro Conference on Digital System Design 978-0-7695-4494-6/11 $26.00 © 2011 IEEE DOI 10.1109/DSD.2011.81 593
Transcript
Page 1: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

Efficient CRT RSA with SCA countermeasuresApostolos P. Fournaris

VLSI Design LabElectrical and Computer Engineering Dpt.

University of PatrasPatra, Greece

[email protected]

Odysseas KoufopavlouVLSI Design Lab

Electrical and Computer Engineering Dpt.University of Patras

Patra, GreeceEmail: [email protected]

Abstract—RSA cryptographic algorithm, working as a securitytool for many years, has long achieved cryptographic and marketmaturity. However, as all crypto algorithms, RSA implemen-tations, after the discovery and wide spread of Side ChannelAttacks (SCA), are susceptible to a wide variety of differentattacks that target the hardware structure rather than thealgorithm itself. While there are a wide range of countermeasuresthat can be applied on the RSA structure in order to protectthe algorithm from SCAs, combining several such measures inorder to guarantee an SCA resistant RSA design is not aneasy job. There are many incompatibility issues among SCAprotection methods as well as an extensive performance costadded to an SCA secure RSA implementation. In this paper,we address some very popular and potent SCAs against RSAlike Fault attacks (FA), Simple Power attacks (SPA), Doublingattacks (DA) and Differential Power attacks (DPA), and proposean algorithmic modification of RSA based on Chinese RemainderTheorem (CRT) that can thwart those attacks. We describean implementation approach based on Montgomery modularmultiplication and propose a hardware architecture for a SCAresistant CRT RSA that is structured on our proposed algorithm.The designed architecture is implemented in FPGA technologyand results on its time and space complexity are extracted andevaluated.

Index Terms—Public Key Cryptography; VLSI Design; SideChannel Attack Resistance; Modular Exponentiation;

I. INTRODUCTION

Public key cryptographic algorithms due to their high com-putational complexity are considered difficult to implement,have significant computational time delay and consume con-siderable number of hardware resources (chip covered area,power dissipation). So, in order to design an efficient Publickey cryptography hardware architecture capable of processinginformation fast and capable of fitting in a resource constrainedenvironment, special attention needs to be paid in the utilizeddesign approach [1]. Furthermore, since nowadays securityenhanced devices usually operate in a hostile environmentwhere they can be stolen or manipulated by untrusted users,security attacks on their hardware structure can not be ex-cluded as a potential thread. Such attacks called side channelattacks (SCA), exploit an architecture’s hardware character-istics (power dissipation, computation time, electromagneticemission e.t.c) to extract information about the processed data

0The work reported in this paper is supported by the European Commissionthrough the SECRICOM FP7 European project under contract FP7 SEC218123

and use them to deduce cryptographic keys and messages[2]. While RSA is considered very secure against traditionalcryptanalysis, side channel attacks have been successful indetermining RSA keys using information leaking from astraightforward implementation of the algorithm. For the abovereasons, SCA resistance is a highly desirable feature of anRSA encryption-decryption unit architecture but it usuallyconstitutes a performance bottleneck.

SCAs can be mounted very easily using a PC, a simpleoscilloscope and some probes and therefore can be used bya wide range of attachers. This ease of use makes SCA verypotent. Among the most effective SCA launched on RSA areFault attacks (FA) and power attacks (PA). The fault attackgoal is to disturb a hardware device during cryptographicoperation execution, analyze the faulty behavior of the dis-turbed device and as a result deduce sensitive information.Combining such attack with a power attack, where a hardwaredevice’s power trace is measured and exploited for secretinformation leakage [2], a crypto-system attacker can relativelyeasy deduce a cryptographic key of a secure hardware device.Protection of a hardware device against SCAs aims either atthe disaccociation of the leaked information with the computedsecret data or at minimizing the information leakage itself.To achieve these goals, existing SCA countermeasures arebased on two main directives, changes on the cryptographicalgorithm and related computer arithmetics or changes on thehardware architecture at hand.

Many researchers have proposed solutions on protectingRSA from FA and PA with relative success [3], [4], [5], [6].However, those solutions are focused on an FA-PA resistanceon algorithmic level without taking into account the implemen-tation cost for one SCA secure RSA encryption-decryptionoperation. This cost is associated with the arithmetic oper-ations required for RSA and can be very high - restrictivewhen applying the existing SCA resistant RSA solutions inreal security devices. SCA countermeasures can be generic,on circuit level, (efficient on specific hardware implementa-tions) [7] [8] or specialized, focused on specific cryptographicalgorithms, on an algorithmic level [9]. The second approachcan be more effective since it utilizes techniques that betternegate a cryptoalgorithm’s specialized SCA weaknesses.

The Chinese Remainder Theorem approach is widely usedin the design of an RSA crypto core since it can speed

2011 14th Euromicro Conference on Digital System Design

978-0-7695-4494-6/11 $26.00 © 2011 IEEE

DOI 10.1109/DSD.2011.81

593

2011 14th Euromicro Conference on Digital System Design

978-0-7695-4494-6/11 $26.00 © 2011 IEEE

DOI 10.1109/DSD.2011.81

593

Page 2: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

up RSA calculations (modular exponentiation) up to fourtimes compared to the original algorithm. Following thisapproach, the RSA modulus is divided in two independentsub-modulus that are used for encryption-decryption of themessage. The Gauss’s combination algorithm is employedin the computations end in order to reconstruct the correctRSA outcome. The modular exponentiation unit in general,is the main target of SCAs. The use of CRT, increases thisunit’s vulnerability to such attacks, so, strong countermeasuresare needed for modular exponentiation protection. Simple PAresistance is achieved by making the arithmetic operationsduring the exponentiation algorithm execution undiscriminatedto an external observer [5]. This countermeasure can befurther enhanced by blinding the modulus N or message usinga random number (Differential PA resistance). Fault attackcountermeasures are based on techniques of detecting singlefault injection and blocking further processing thus prohibitingthe release of secret information. Giraud in [3] proposed a FA-SPA resistant modular exponentiation algorithm and later Kimand Quisquater [6] proposed an attack on the algorithm alongwith a way to thwart it. Recently, Fournaris [10] proposed amodification of Giraud’s and Kim’s RSA algorithm [6] thatworks using Montgomery modular multiplication algorithmand results in very optimistic performance characteristics. TheRSA modified algorithms of [10] [3] and [6] guaranty FAresistance by introducing two values, S0 and S1 and checkingif a known equation between them is always true . If thisconnection between S0 and S1 is disturbed then a fault attackis detected and the cryptographic process stops.

In this paper, we investigate the possibility of embeddingseveral SCA countermeasures in the CRT RSA algorithmin order to design a space and time complexity efficienthardware architecture. To achieve this goal, we propose a FA-SPA resistant modular exponentiation algorithm and enhanceit with message randomization-masking in order to furtherprotect it from more intricate attacks like DPA and relativedoubling attack. The proposed algorithm uses the CRT RSAscheme proposed by Giraud [3] and extends the work of[10] by adopting Montgomery modular multiplication andexponentiation as its structural element. The efficiency ofthe proposed algorithms is evaluated by realizing them inFPGA technology using VHDL. The resulting implementationprovided us with very interesting results in terms of space andtime complexity that are not even matched by non SCA resis-tant designs not bearing the SCA countermeasures extra cost.These evidence provide some proof that SCA resistance in theproposed structure can be achieved with few compromises inperformance.

The paper is organized as follows. In section II, Sidechannel attacks that can easily be mounted on CRT RSA aredescribed. In section III, a small mention on the Montgomerymodular multiplication algorithm is made so as to inform thereader on this paper’s background concept. In section IV, theproposed algorithms are analyzed in detail and their SCAresistance is discussed. In section V, the hardware architecturethat is derived from the proposed algorithms of section IV is

presented, analyzed and performance results and comparisonsare made in section VI. Section VII concludes the paper andpresents future goals.

II. POWER AND FAULT ATTACKS ON CRT RSA

In the RSA cryptographic scheme, three n-bit numbers areused, the public modulus N , the public key e and the privatekey d. Let N = p · q, where p, q are secret prime numbers.Let also e · d = 1mod(p − 1)(q − 1). Assuming that m isthe message to be encrypted (plaintext), the RSA encryptedoutcome (ciphertext) is c = memodN and decrypted outcomeis m = cdmodN . CRT is usually used during RSA decryptionsince the bit length of the private key d is required to be long.In CRT RSA, we compute Sp = cdpmodp and Sq = cdqmodq,where dp = dmod(p − 1) and dq = dmod(q − 1). Then,the final result is computed by following Gauss’s combinationalgorithm, meaning

S = CRT (Sp, Sq) = (Sp · q · qi) + (Sq · p · pi)modN (1)

where qi = q−1modp, pi = p−1modq.Assuming that a fault is introduced during the first expo-

nentiation (with modulus p), then the faulty output would beSp and the CRT reconstruction in 1 would be as follows:

S = CRT (Sp, Sq) = (Sp · q · qi) + (Sq · p · pi)modN (2)

Knowing a legitimate CRT-RSA outcome S and a faulty oneS, one can find the secret prime q by calculating q = gcd((S−S), N). The deliberate insertion of a fault in the computationflow of one of the exponentiations constitute a fault attack,originally proposed by Boneh et al [11] and later enhanced byLentra [12] where no legitimate outcome is also needed, ascan be observed by the equation gcd((Se −m)modN,N)

To defeat such attack, several researchers have proposedmethods that mainly involve detection of the fault and thenrelease of the correct RSA outcome. Giraud in [3] proposesthe use of the Montgomery ladder methodology for FA re-sistance as well as SPA resistance. Giraud takes advantageof the fact that two consecutive, temporary, exponentiationoutputs always have the same relationship (ma−1/ma = m).Instead of generating one output, Giraud’s algorithm producesan output pair (S0, S1) = (md−1,md) and checks if therelationship inside th pair is true. The algorithm offers FAand SPA resistance but due to the use of Montgomery ladderis susceptible to the relative doubling attack, introduced in[13], as a special case of doubling attack [14]. In general, themain idea behind doubling attack is to choose two stronglyrelated inputs m and m2modN and to observe the collisionof two computations for m2(2x+di)modN and m4xmodN ifdi = 0. In the doubling attack, even if the attacker cannotdecide whether a computation being performed is squaringor multiplication (that is possible through the Montgomeryladder square and always multiply approach), the attacker canstill detect collision of two operations (basically the squaringoperation) within two related computations. More precisely,for two computations A2modN and B2modN even if theattacker cannot tell the values of A or B, he can detect if

594594

Page 3: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

there was a collision between A and B. The attack uses theabove information to derive the private key bit di by checkingwhether di = 0.

While the generic doubling attack is not applicable inGiraud’s CRT-RSA since we cannot find absolute di values,the relationship between two consecutive private key bits isrevealed. Taking advantage of this relationship by checkingwhether di = di−1 or not, mounting a relative doubling attackas described in [13], can still lead to compromise of the privatekey d (or dp and dq in the case of CRT-RSA).

An additional problem of Giraud’s algorithm, related toits fault attack resistance, was pointed out by Kim et al in[6]. They have concluded that a fault can still be insertedundetected in the computation flow and compromise the pri-vate key if it is placed after fault detection and before CRTreconstruction in the Giraud CRT-RSA algorithm. To solvethis issue, in [6], the authors introduce a random point a intothe computation flow that is only removed after CRT recon-struction (and therefore after fault detection). This variation ofGiraud’s CRT RSA algorithm is described below.

FA-SPA CRT RSA algorithmInput: m, a, p, q, dp, dq, iq = q−1modp,NOutput: mdmodN

1) (Sp0 , S

p1 ) = FSPAME(m, a, dp, p)

2) (Sq0 , S

q1) = FSPAME(m, a, dq, q)

3) S = sq0 + q · ((sp0 − sq0) · iqmodp)4) S = sq1 + q · ((sp1 − sq1) · iqmodp)5) S = (m · S + a)modN6) S = (S + a ·m)modN7) If (S = S) and p, q not modified then return

(S − a− a ·m)modN else return errorFSPAME is an exponentiation algorithm that uses Joye et

al work on Montgomery ladder [5] but is further enhanced bythe use of a 32 bit random number a. FA resistance is achievedby introducing two values, S and S, for which the equation(m · S + a)modN = (S + a ·m)modN is always true. If thisrelationship is disturbed then a fault attack is detected and thecryptographic processes is canceled. The FSPAME algorithmis described below. Note that the subscript with a horizontalline over it, refers to the logical not operation of the subscript’svalue.

FA-SPA Giraud Modular Exponentiation (FSPAME) al-gorithmInput: m, e = (1, et−2, ...e0), N, aOutput: (a+me−1)modN, (a+me)modN

1) S0 = m2) S1 = m2modN3) For i = t− 2 to 1

a) Sei = Sei · SeimodNb) Sei = S2

eimodN

4) S1 = (a+ S1 · S0)modN5) S0 = (a+ S2

0)modN6) If (Loop Counter i and exponent e are not modified)

then return (S0, S1) else return errorA similar approach is proposed by Fournaris in [10], where

Giraud’s algorithm is optimized in term of hardware efficiencyso as to be applicable in real applications. The optimizationin [10] is based on the use of Montgomery modular multipli-cation as an RSA structural element.

III. MONTGOMERY MODULAR MULTIPLICATION AND

EXPONENTIATION

The MMM algorithm [15] calculates the value A = X ·Y ·R−1modN where R is a constant number usually R = 2n.The n-bit value N has to be an integer satisfying the conditiongcd(R,N)=1. In RSA, N value is odd as the product of twoprimes therefore the above constrain is always true for R = 2n.Walter in [16] proved that if the MMM algorithm is to be usedfor modular exponentiation the final step of the algorithm (asubtraction operation) is not necessary as long as N is odd,Y < 2N and 2N < 2p−1 where p is the number of algorithmicrounds. So, to satisfy the above constrains it suffices that p =n + 2 since the input Y will never be greater than 2N ina modular exponentiation process [16], [17]. Therefore theMMM algorithm has the form described below. We denote thei-th bit of a variable by [i]. Note, that to meet the p = n+ 2constrain all the n bit-length numbers have to be extended ton+ 2 bit length numbers by padding zeros.

Montgomery Modular Multiplication (MMM) algorithmInput: X,Y,N Output: A = X · Y ·R−1modN Init: A = 0

1) For k = 0 to n+ 1

a) q = (A[0] +X [k] · Y [0])mod2b) A = (A+X [k] · Y + q ·N )/2

2) Return A

IV. PROPOSED FA-PA CRT RSA ALGORITHM

Our approach on an SCA resistant RSA module is based onthe methodology described in [10] for Montgomery modularmultiplication. The algorithm in [10] seems very promising interms of security and hardware performance. It is based onthe Montgomery multiplication algorithm that is adapted forGiraud’s modular exponentiation methodology resulting in aCRT FA-SPA resistant RSA algorithm. However, the work of[10] only focus on how FA-SPA modular exponentiation canbe done and not how a fully functional CRT based FA-SPAunit. This work, additionally, is still vulnerable to relative dou-bling power attack (rDA) and Differential power attack (DPA).In this paper, we propose enhancements on the algorithmof [10] that achieve more intricate SCA resistance includingDPA and rDA protection. To achieve this we introduce anadditional random number b into the algorithmic data flow thatcan multiplicatively mask the message. This approach protectsthe cryptosystem from rDAs since the connection betweenmessage M and M2 no longer exists while the masking itselfdisassociates the processed data from the real message thusproviding DPA resistance (Coron DPA countermeasures [18]).We propose using the random number b in a similar fashionto the Fumaroli and Vigilant scheme [19]

We use the FSPAME algorithm of [6] and propose a hard-ware oriented efficient optimized algorithm for modular expo-nentiation that supports FA, SPA, RDA, DPA protection. This

595595

Page 4: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

algorithm employs Montgomery modular multiplication as astructural element, chosen among other approaches due to itsefficient realization in hardware. Following the above directive,we adopt the optimized version of the Montgomery modularmultiplication algorithm (CSMMM) described in [20], [10],that employs carry-save logic (C-S logic) in all its inputs,outputs and intermediate values along with precomputation.The proposed FRDAME algorithm is described below.

FA-PA-rDA Montgomery Modular Exponentiation (FR-DAME) algorithmInput: m, a, b, b−1, e = (1, et−2, ...e0), NOutput: (a+me−1)modN, (a+me)modNInitialization: T = R2modN , bR1 = b2 · R2modN , bR2 =b · RmodN , bR−1 = b−1 · RmodN , aR = a ·RmodN whereR = 2n+2

1) S−1 = bR1 ·m · R−1modN2) S0 = bR2 ·m · R−1modN3) S1 = S−1 · S−1 · R−1modN4) S2 = bR−1

5) For i = t− 2 to 1

a) S2 = S22 · R−1modN

b) If ei = 1 theni) S0 = S0 · S1 ·R−1modN

ii) S1 = S21 · R−1modN)

c) elsei) S1 = S0 · S1 ·R−1modN

ii) S0 = S20 · R−1modN)

6) S2 = S22 · R−1modN

7) S1 = (S0 · S1) · S2 · R−1modN8) S0 = (S2

0) · S2 ·R−1modN9) S1 = (S1 · 1 + aR)R

−1modN10) S0 = (S0 · 1 + aR)R

−1modN11) If (Loop Counter i, exponent e are not modified then

return (S0, S1) else return error

The above algorithm constitute the core of the CRT RSAmodule and can be used in order to describe the CRT FA-rDA-PA RSA algorithm as follows:

FA-rDA-PA CRT RSA algorithmInput: m, a, b, p, q, dp, dq, iq = q−1modp,NOutput: mdmodN

1) (sp0, sp1) = FRDAME(m, a, b, dp, p)

2) (sq0, sq1) = FRDAME(m, a, b, dq, q)

3) S = sq0 + q · ((sp0 − sq0) · iqmodp)4) S = sq1 + q · ((sp1 − sq1) · iqmodp)5) S = (m · S + a)modN6) S = (S + a ·m)modN7) If (S = S) and p, q not modified then return

(S − a− a ·m)modN else return error

Note that in the above algorithm the Gauss CRT recon-struction equation (1) has been replaced with Garner’s formS = sq0+q·((sp0−sq0)·iqmodp) that is simpler. In the CRT RSAcrypto process two executions of the FRDAME algorithm areperformed. The algorithm is run using dp and dq exponentsand p, q modulus respectively. We can assume without loss of

generality that p, q are of similar bit length and therefore abouthalf the bit length of N. Thus, using CRT the tedious n bitmodular exponentiation operation is broken into two paralleln/2 bit modular exponentiations that can give results faster.The price for this action, is a final CRT operation (steps 3and 4) for both fault secure streams of data. Note, that the useof the random number a prohibits the discovery of the RSAdecryption result before the fault attack check is performed(step 7).

V. PROPOSED HARDWARE ARCHITECTURES

Designing an FA-SPA resistant RSA encryption/decryptionposes several challenges especially if the system at hand isresource constrained. Our primary goal is to minimize thechip covered area without compromises in security. For thisreason, we employ one FRDAME unit responsible for modularexponentiation using either p or q. The FRDAME unit isstructured around two Montgomery Modular Multipliers usingCarry-Save logic input (CSMMU) and output signals as wellas a Montgomery Modular squarer that is used for the randomnumber b squaring operations. All the Montgomery units areconnected through a series of input-output storage elements(Register) that are specially designed to feed those unit withappropriate input data. Each Montgomery multiplier is a 5parallel (n/2 + 1) bit and 2 serial (1) bit input architecturethat generates a Montgomery modular multiplication productin Carry-Save format every n/2 + 2 clock cycles. The func-tionality and full architecture of the CSMMU is described in[20], [10]. We try to maximize parallel operations performedwithin the Hardware structure. More specifically, parallelismcan be applied on steps 5bi and 5bii, steps 5ci and 5cii, steps7 and 8 as well as steps 9 and 10. The 3 Montgomery units(multipliers and squarer) are fully utilized during RSA com-putations, Multiplier 1 always performs Montgomery modularmultiplication between S0 and S1 in parallel with multiplier2 that performs Montgomery modular squaring either of S0

or S1. The Montgomery squarer also works in parallel to theother two units in order to modulo square the random numberb in each clock cycle.

The FRDAME unit needs to be used twice for one CRTRSA session since it implements steps 1 and 2 of the CRTRSA in a serial fashion. During the first use of the FRDAMEunit the outcome of mdpmodp is calculated in carry - saveformat and stored in a intermediate value storage element(Register file) while on the second use of the FSME unit theoutcome mdqmodq is calculated and stored in a similar way.Note, that following the FRDAME algorithm there are twocalculation streams for an exponentiation outcome, S0 and S1,and therefore each utilization of the FRDAME unit requires4 storage registers (2 for S0 and 2 for S1). Thus, a total of8 registers is needed. The stored values are chosen through amultiplexer as inputs for the CRT Transformation unit thatis responsible for implementing the operations of steps 3and 4 of the CRT RSA algorithm. The CRT transformationunit will be used twice to provide outcome for both thosesteps. In the first use of CRT transformation unit, the inputs

596596

Page 5: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

ar

MontgomeryMultiplier 1

MontgomeryMultiplier 2

MontgomerySquarer

STORAGE AREAREGISTER FILE

random number b random number ar

S0p_SaveS0p_Carry

S1p_SaveS1p_Carry

S0q_SaveS0q_Carry

S1q_SaveS1q_Carry

MontgomeryMultiplier 3

CarrySave

Subtraction

n

CS serial Multiplier

Final Full Adder Structure

CRT Reconstruction

Fault Detection Unit

Final Result

parallel

parallelC output

parallel

N=p q

q modp-1

CS Full Adder

n

parallelS output

External Environment

n bit MontgomeryMultiplier

CS Full Adder

CS Adder - Subtructor Unit

n n

Sele

cto

r

Comparator

ar

on

Fig. 1. The CRT FA-SPA resistant RSA architecture

provided by the multiplexer are sq0 and sp0 in carry saveformat while on the second use of the CRT transformationunit the inputs would be sq1 and sp1. The output of the CRTtransformation unit is an n-bit value in carry save formatstored in two n-bit registers. Those values are also storedin the intermediate storage Register File, replacing existingvalues (that are no longer needed). After the two utilizationof the CRT transformation unit the 2 outcomes are insertedinto the Fault detection unit that realizes the functionality ofstep 7 in the CRT RSA algorithm. This unit performs n-bitmodular multiplication that can be realized by an additionaln-bit CSMMU along with a C-S adder-subtractor unit anda comparator (comparing with the random number a). Theoutcome of this unit is either an error code provided by an allzero value or the correct crypto-message of the RSA algorithm.The proposed CRT FRDAME RSA generic architecture ispresented in Figure 1.

All modern processing systems have certain restrictions inthe information that can be process within their respected data

core. These limitations are posed by their bus length, memorysize and depth e.t.c. No modern processor can handle datawith bit length closely relevant to that of the RSA core (atleast 1024 bit). All modern systems work with at most 64 bitnumbers. For this reason, we develop a mechanism for busdata insertion into the RSA architecture so that the arbitraryinput RSA algorithm where the bit length of the processedvalues is higher than the one provided by the data bus, canfully work. The data bus inputs are multiplexed using theinput reconstruction unit. There exist 3 digit input paralleloutput shift registers (for the message m, the random numberb and the p or q modulus) and one digit input serial outputshift register (for the dp or dq exponent). As the data bus isaccessed in digits analogous to a connected control processorword length (w bits), information are stored to the appropriateregister through a multiplexer structure.

Initially, the modulus register is filled out, after n/2w clockcycles. Then, the message register is filled out, after n/2wconsecutive clock cycles. The data bus words are inserted in

597597

Page 6: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

those registers in each clock cycle and then right shifted w bitsuntil the whole register length is full. The register’s value isoutputted in parallel fashion using a parallel out control signal.When the message and modulus registers are filled with themessage and p values, the random number b is inserted in wlength digits and when this operation is concluded, the storageof the exponent may begin. Input is provided to the exponentregister in w bit words and output is performed in a serialfashion (the two least significant bits of the stored value areoutputted). Note, that the exponent register has two modes ofoperation, data insertion and data output processing. In thefirst mode of operation the register data are right shifted wbits in order to make space for the next data bus word to bestored while in the second mode of operation, register dataare right shifted 1 bit. The data insertion can be completedin n/2w clock cycles, however, RSA operations can beginafter the first clock cycle of that time since only the 2 leastsignificant bits of the exponent are needed in every modularexponentiation round.

VI. PERFORMANCE

Assuming that n bit RSA encryption-decryption needs tobe performed and that the data bus clock speed is consider-ably higher than the proposed CRT RSA unit clock speed,data insertion to the RSA unit’s input registers (modulus,random number, exponent and message registers) can bedone with small speed overhead to the whole system. Thus,ignoring the cost of data insertion,that is closely relevant tothe input external device speed (memory write speed, busread speed), the proposed CRT RSA architecture requiresCCRTRSA = 2 · (CFRDAME + CCRT ) + CFA clock cyclesfor one encryption-decryption operation where CFRDAME isthe clock cycle number for Modular exponentiation using theFRDAME algorithm, CCRT is the clock cycle number for CRTtransformation and CFA is the clock cycle number for theFault detection. The CCRTRSA is dominated by the FRDAMEunit since CFRDAME has O(nt) complexity where t is theHamming weight of the FSME exponent. Considering thatthe worst case scenario then t = n/2 (usual case for CRTRSA) and CFRDAME has O((n2)/2) time complexity. TheCCRT and CFA numbers are of O(n) time complexity. TheMontgomery multiplication unit needs n

2 + 2 clock cycles tocome up with a result. The FRDAME algorithm is concludedafter t− 2 algorithmic rounds where the CSMMM algorithmis executed in parallel three times in each round. Also, inFRDAME, 3 execution of Montgomery multiplication areneeded for initialization (FRDAME steps 1, 2 and 3) and 3 par-allel Montgomery executions for post processing calculations(FRDAME parallel steps 6, 7 - 8 and 9 - 10). As a result, thetotal number of clock cycles for one modular exponentiationusing FRDAME is CFRDAME = (t + 4) · (n2 + 2). Sincethrough the CRT RSA algorithm (steps 1 and 2) there aretwo FRDAME executions, all modular exponentiations areconcluded after 2 ·CFRDAME = (t+4) · (n+4) clock cycles.

To further evaluate our proposed system we realized theFRDAME architecture in FPGA technology (xilinx virtex 5)

using VHDL language for n = 1024 RSA operations. Assum-ing that p and q are approximately n/2 bit length, a 512 bitFRDAME architecture was implemented and measurementsin chip covered Area (FPGA slices) and maximum clockfrequency (MHz) were taken. We also designed the CRTreconstruction unit, as described in Figure 1 and added it tothe whole architecture in order to evaluate its contribution tothe RSA complexity. We have also included into our designthe full functionality of the control unit that handles bothRSA control and input data reconstruction as well as thechip covered area cost of the additional 1024-bit CSMMUmodule needed for Fault Detection. The synthesis results ofthe mentioned designs are provided in Table I.

TABLE IRSA CRT IMPLEMENTATION RESULTS n = 1024

Arch. Technology Area (slices) Freq. (MHz) FA-SPA

prop. FRDAME 5vsx240tff1738 6074 235 Yesprop. CRT Reconstruction 5vsx240tff1738 2691 227,3 Yes

[21] XC2V3000 12537 152.5 No[22] XC2V6000 23208 96 No[20] XC2V3000 7873 129 No

As observed in the above table, the proposed work, isconsiderably better than similar hardware architectures that donot bare the extra cost of SCA resistance. The proposed workis extremely fast even if we consider the extra hardware costof CRT reconstruction unit. In that case, the space complexityis similar to the results of [20],however, this work does notuse CRT so it has an overall higher computational time delaythan our work that adopts CRT.

VII. CONCLUSIONS

In this paper we introduced a CRT RSA modification that isresistant against Fault attacks, Differential and Simple powerattacks as well as doubling and relative doubling attacks.We designed a hardware architecture based on the proposedalgorithms and came up with a resource constrained plan inorder to offer high efficiency in space and time complexity.To further proof the validity of our case, we implemented thisarchitecture in modern FPGA technology and came up withvery interesting results. The proposed approach towards a SCAresistant RSA cryptosystem is a step forward the realization ofa fully protected hardware structure against SCAs. Note thatthis is an ongoing process since there is still a wide range ofSCAs that our approach can not yet counter. Also, in our futureplans is optimization of our design’s storage requirements inorder to further reduce space complexity as well as minimizethe possibility of more intricate Fault injection attacks.

REFERENCES

[1] N. Sklavos, “On the hardware implementation cost of crypto-processorsarchitectures,” Information Security Journal: A Global Perspective,vol. 19, no. 2, pp. 53–60, 2010.

[2] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Advancesin Cryptology Proceedings of Crypto 99. Springer-Verlag, 1999, pp.388–397.

[3] C. Giraud, “An rsa implementation resistant to fault attacks and to simplepower analysis,” IEEE Transactions on Computers, vol. 55, no. 9, pp.1116–1120, 2006.

598598

Page 7: [IEEE 2011 14th Euromicro Conference on Digital System Design (DSD) - Oulu, Finland (2011.08.31-2011.09.2)] 2011 14th Euromicro Conference on Digital System Design - Efficient CRT

[4] D. Vigilant, “Rsa with crt: A new cost-effective solution to thwart faultattacks,” in CHES, ser. Lecture Notes in Computer Science, E. Oswaldand P. Rohatgi, Eds., vol. 5154. Springer, 2008, pp. 130–145.

[5] M. Joye and S.-M. Yen, “The montgomery powering ladder,” in CHES’02: Revised Papers from the 4th International Workshop on Crypto-graphic Hardware and Embedded Systems. London, UK: Springer-Verlag, 2003, pp. 291–302.

[6] C. H. Kim and J.-J. Quisquater, “Fault attacks for crt based rsa: Newattacks, new results, and new countermeasures,” in WISTP, ser. LectureNotes in Computer Science, D. Sauveron, C. Markantonakis, A. Bilas,and J.-J. Quisquater, Eds., vol. 4462. Springer, 2007, pp. 215–228.

[7] K. Bhattacharya and N. Ranganathan, “A linear programming formu-lation for security-aware gate sizing,” in GLSVLSI ’08: Proceedings ofthe 18th ACM Great Lakes symposium on VLSI. New York, NY, USA:ACM, 2008, pp. 273–278.

[8] K. Tiri and I. Verbauwhede, “A digital design flow for secure integratedcircuits,” IEEE Trans. on CAD of Integrated Circuits and Systems,vol. 25, no. 7, pp. 1197–1208, 2006.

[9] N. Ebeid and R. Lambert, “A new crt-rsa algorithm resistant to powerfulfault attacks,” in Proceedings of the 5th Workshop on Embedded SystemsSecurity, ser. WESS ’10. New York, NY, USA: ACM, 2010, pp. 8:1–8:8. [Online]. Available: http://doi.acm.org/10.1145/1873548.1873556

[10] A. P. Fournaris, “Fault and simple power attack resistant rsa usingmontgomery modular multiplication,” in Proc. of the IEEE InternationalSymposium on Circuits and Systems (ISCAS 2010). IEEE, 30 May-02June 2010.

[11] D. Boneh, R. A. DeMillo, and R. J. Lipton, “On the importanceof checking cryptographic protocols for faults (extended abstract).” inEUROCRYPT’97, 1997, pp. 37–51.

[12] A. K. Lenstra, “Memo on rsa signature generation in the presence offaults,” September 1996.

[13] S.-M. Yen, W.-C. Lien, S.-J. Moon, and J. Ha, “Power analysis byexploiting chosen message and internal collisions - vulnerability ofchecking mechanism for rsa-decryption,” in Mycrypt, ser. Lecture Notesin Computer Science, E. Dawson and S. Vaudenay, Eds., vol. 3715.Springer, 2005, pp. 183–195.

[14] P.-A. Fouque and F. Valette, “The doubling attack why upwards is betterthan downwards,” in Cryptographic Hardware and Embedded Systems -CHES 2003, ser. Lecture Notes in Computer Science, C. Walter, C. Koc,and C. Paar, Eds. Springer Berlin / Heidelberg, vol. 2779, pp. 269–280.

[15] P. L. Montgomery, “Modular multiplication without trial division,”Mathematics of Computation, vol. 44, no. 170, pp. 519–521.

[16] C. Walter, “Montgomery exponentiation needs no final subtractions,”Electronics Letters, vol. 35, no. 21, pp. 1831–1832, 1999.

[17] G. Hachez and J.-J. Quisquater, “Montgomery exponentiation with nofinal subtractions: Improved results,” in CHES ’00: Proceedings ofthe Second International Workshop on Cryptographic Hardware andEmbedded Systems. London, UK: Springer-Verlag, 2000, pp. 293–301.

[18] J.-S. Coron, “Resistance against differential power analysis for ellipticcurve cryptosystems,” in Proceedings of the First InternationalWorkshop on Cryptographic Hardware and Embedded Systems, ser.CHES ’99. London, UK: Springer-Verlag, 1999, pp. 292–302. [Online].Available: http://portal.acm.org/citation.cfm?id=648252.752381

[19] G. Fumaroli and D. Vigilant, “Blinded fault resistant exponentiation,” inFDTC, ser. Lecture Notes in Computer Science, L. Breveglieri, I. Koren,D. Naccache, and J.-P. Seifert, Eds., vol. 4236. Springer, 2006, pp. 62–70.

[20] A. P. Fournaris and O. G. Koufopavlou, “A new rsa encryption archi-tecture and hardware implementation based on optimized montgomerymultiplication,” in ISCAS (5). IEEE, 2005, pp. 4645–4648.

[21] M.-D. Shieh, J.-H. Chen, H.-H. Wu, and W.-C. Lin, “A new modularexponentiation architecture for efficient design of rsa cryptosystem,”IEEE Trans. Very Large Scale Integr. Syst., vol. 16, no. 9, pp. 1151–1161, 2008.

[22] C. McIvor, M. McLoone, and J. McCanny, “Modified montgomery mod-ular multiplication and rsa exponentiation techniques,” IEE Proceedings- Computers and Digital Techniques, vol. 151, no. 6, pp. 402–408, 2004.

599599


Recommended