Montgomery Algorithm for Modular Multiplication with ...math-sa-sara0050/space16/... · A systolic...

transcript

MRABET Amine

Montgomery Algorithm for Modular Multiplication

with Systolic Architecture

LIASD Paris 8

ENIT-TUNIS EL MANAR University

SAS - CMP - Gardanne

SPACE 2016

1. Introduction for pairing

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

3. Architecture

4. Results

This work is part of the hardware implementation of

asymmetric cryptography primitives, such as Optimal-Ate

pairing based on elliptic curves, the cryptographic systems

based on elliptic curves and RSA,

General Context

This work is part of the hardware implementation of

asymmetric cryptography primitives, such as Optimal-Ate

pairing based on elliptic curves, the cryptographic systems

based on elliptic curves and RSA,

Which are the best known methods in asymmetric encryption.

General Context

Let G1 and G2 be two additive groups and let G3 be a

multiplicative group.

Pairing is an application

e : G1 × G2 G3 with the following properties:

Definition

e is non degenerate :

if P ∈ G1, P ≠ 0 it exists Q ∈ G2 such as e(P, Q) ≠ 1

if Q ∈ G2, Q ≠ 0 it exists P ∈ G1 such as e(P, Q) ≠ 1.

e is non degenerate :

if P ∈ G1, P ≠ 0 it exists Q ∈ G2 such as e(P, Q) ≠ 1

if Q ∈ G2, Q ≠ 0 it exists P ∈ G1 such as e(P, Q) ≠ 1.

Bilinearity:

e(xP, yQ) = e(P,Q)xy ,

e(xP, yQ)z = e(yP, zQ)x = e(zP, xQ)y = e(P,Q)xyz

Definition

The bilinearity of the pairings allowed the construction of

protocols.

Pairing protocols

Diffie–Hellman key exchange ( Joux 2001)

Identity-Based Cryptography(Boneh and Franklin)

Short signature schemes (Boneh, Lynn, Shacham)

The bilinearity of the pairings allowed the construction of

protocols.

Trusted authority

Pairing protocolsExample of Cryptography Based on Identity

S: The secret of the trusted authority

The Public keys are the identities of people.

S: The secret of the trusted authority

The Public keys are the identities of people.

The private keys are Constructed by the trusted authority and

Transmitted to users.

Trusted authority

Bob Alice

PB=S*IB PA=S*IA

e (PA, IB) = e (IA, IB) se (PB, IA) = e (IA, IB) s

Alice wants to send a message to Bob:

She chooses an integer a randomly,

She retrieves Bob's public key : IB,

She calculates the pairing e(IB;Q0)a,

She sends to Bob : [ aP, M ⊕H2 (e(IB;Q0)a) ]=[U,V]

Pairing protocols

Example of Cryptography Based on Identity

Encryption step of the clear message M

Bob follows the following steps:

He contacts the trusted authority to retrieve his private key

PB = sIB,

He finds the message by calculating V ⊕ H2 (e(PB,U)).

The message : M

The bilinearity of pairings :

e(PB,U) = e(sIB,aP) = e(IB,P)as = e(IB,sP)a

Decryption step of the encrypted message.

Different pairings

Weil pairing

: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk

(P,Q) → (-1)r fr, p

(Q) / fr ,Q

Miller Lite fr, p

Miller Full fr ,Q

Inversion

Multiplication

Different pairings

Weil pairing

: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk

(P,Q) → (-1)r fr, p

(Q) / fr ,Q

Tate pairing

eT: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk

(P,Q) → [ fr, p(Q) ] (p^k- 1)/r

Tate pairing is defined with the same parameters E, Fp, r, k

than Weil pairing.

For the calculation of Tate pairing we make log2(r) iterations during

the Miller algorithm, where r is the order of the subgroups used.

The main advantage compared to Tate pairing is the reduction of the number of

iterations made during the Miller algorithm.

log2(T) where T = t − 1, and t is the Frobenius trace on E(Fp).

The disadvantage of Ate pairing is that it corresponds to a Miller Full application.

Different pairings

Ate paring

G1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])

: G1 × G2 → F*pk;

(P,Q) → [ fT, Q

(P) ] (p^k- 1)/r

The calculation is made by an execution of Miller Lite, which would alleviate the

complexity of the calculations.

Different pairings

Twisted Ate pairingG1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])

: G1 × G2 → F*pk;

(P,Q) → [ fT, p

(Q) ] (p^k- 1)/r

Different pairings

Ate-Optimal (OATE) pairing

Ate-Optimal pairing improves Ate pairing by reducing the number of iterations

in the Miller algorithm used to calculate f,Q(P).

In the case of BN curves , OATE pairing is defined by:

where = 6t+2 (t the parameter of BN curves)

The calculation is made by an execution of Miller Lite, which would alleviate the

complexity of the calculations.

Twisted Ate pairingG1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])

: G1 × G2 → F*pk;

(P,Q) → [ fT, p

(Q) ] (p^k- 1)/r

The basic operations in the Finite field :

Addition

Subtraction

Multiplication

inversion

Basic operations

The basic operations in the Finite field :

Addition

Subtraction

Multiplication

inversion

Constitute the essential of calculation time of pairing.

That’s why the optimization of these operation is the most

important

Basic operations

3. Architecture

4. Results

Reminder: Montgomery algorithm

Ordinary domain Montgomery domain

a M(a)=a.R mod p

b M(b)=b.R mod p

a.b M(a.b)=a.b.R mod p

Conversion between Ordinary Field and Montgomery

The CIOS method improves the Montgomery algorithm by

integrating multiplication and reduction.

[1] Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro. , juin1996

Cetin Kaya Koç, Tolga Acar and Burton S. Kaliski Jr.

The Coarsely Integrated Operand Scanning method [1] ?

The CIOS method improves the Montgomery algorithm by

integrating multiplication and reduction.

Instead of multiplying axb then performe to reduction, it

allows to alternate between the iterations of multiplication

and reduction.

[1] Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro. , juin1996

Cetin Kaya Koç, Tolga Acar and Burton S. Kaliski Jr.

The Coarsely Integrated Operand Scanning method [1] ?

What is a systolic architecture ?

It’s a network composed of a large number of cells, Each

cell receives data from the neighboring cells, performs a

simple calculation, and then transmits the results, always to

neighboring cells.

A systolic architecture provides very simplified elementary

cells. Therefore, this architecture reduces resource

requirements in hardware implementations.

neighboring cells.

A systolic architecture provides very simplified elementary

cells. Therefore, this architecture reduces resource

requirements in hardware implementations.

Our contribution in this work is to combine a systolic

architecture, which is supposed to be the best solution for

FPGA implementations, with the CIOS method of the

Montgomery modular multiplication.

Coarsely Integrated Operand Scanning

Cutting the algorithm CIOS

alpha : the lines 5 and 6

_2alpha : the lines 7,8 and 9

beta: the lines11 and 12

gamma: the lines14 and 15

_2gamma: the lines16,17 and 18

gamma: the lines14 and 15

1. Introduction

3. Architecture

4. Results

Multiplication Step

Reduction Step

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

CIOS in Systolic for s=8

Multiplication Step

Reduction Step

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

Multiplication Step

Reduction Step

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

In this architecture we also have an integration between

the different iterations that loop on i.

In our case we have 3 iterations of i which can be

executed at the same time.

Multiplication Step

Reduction Step

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

. . . . . . . . . . . .. . . . . . . . . . . .

. . . . . . . . . . . .

Multiplication Step

Reduction Step

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

a x b x R-1 mod p

. . . . . . . . . . . .. . . . . . . . . . . .

. . . . . . . . . . . .

a x b x R-1 mod p

Multiplication Step

Reduction Step

a x b x R-1 mod p

Multiplication Step

Reduction Step

a x b x R-1 mod p

Multiplication Step

Reduction Step

b0 b1 b2 b3 b4 b5 b6 b7

p0 p1 p2 p3 p4 p5 p6 p7P

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

b1 b2 b0 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

b2 b0 b1 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

b1 b2 b0 b4 b5 b3 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p3 p4 p2 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

b2 b0 b1 b5 b3 b4 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p4 p2 p3 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

b1 b2 b0 b4 b5 b3 b7 b6

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p3 p4 p2 p6 p7 p5

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

b2 b0 b1 b5 b3 b4 b7 b6

B1 B2 B3

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p4 p2 p3 p7 p5 p6

Data Flow

. . . . . . . . .. . . . . . . .

. . . . . . . .

a x b x R-1 mod p

Multiplication Step

Reduction Step

During execution of this algorithm

there are always three iterations

of the loop 'i' which are executed

at the same time, which gives a

maximum of three alphas and

three gammas which are executed

in parallel.

According to the blocks that are

repeated, we modeled our FSM

with 3 states, which allows us to

perform all the multiplication in

just 33 cycles.

(8+3)*3=33

a x b x R-1 mod p

Multiplication Step

Reduction Step

S0 S1 S2

S0 S1 S2 S0 S1 S2 S0 S1 S2 S0

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5a0 b14 a0 b15

j=0 j=1 j=2 j=3 j=4 j=5 j=14 j=15

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5a0 b14 a0 b15

j=0 j=1 j=2 j=3 j=4 j=5 j=14 j=15

a x b x R-1 mod p

. . . . . . . . . . . .. . . . . . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15

b0 b1 b2

b3 b4 b5

b6 b7 b8

b9 b10 b11

b12 b13 b14

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15

b0 b1 b2

b3 b4 b5

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15P

p2 p3 p4

p5 p6 p7

b6 b7 b8

b9 b10 b11

b12 b13 b14

p8 p9 p10

p11 p12 p13

p14 p15

alpha_2

gamma_

K=256, w=32, s=8

K=512, w=64, s=8

33 clock cycles

K=256, w=16, s=16

alpha_f

gamma_f

K=512, w=32, s=16

66 clock cycles

Alpha_f

gamma_

K=256, w=32, s=8

K=512, w=64, s=8

33 clock cycles

S=8 6 +3 cells 33 clock cycles

Comparison

S=8 S=16 S=32

K=256 32 16 8

K=512 64 32 16

K=1024 128 64 32

Number of

cycles

33 66 132

The interest of each architecture depends on our needs

Security level

Resources

The method used

The interest of each architecture

ArchitecturesDigital signal processing (DSP)

Modern FPGAs are equipped with hardware extensions for

arithmetic calculation.

ArchitecturesDigital signal processing (DSP)

Modern FPGAs are equipped with hardware extensions for

arithmetic calculation.

Perform basic arithmetic calculations: multiplication, addition and

subtraction of unsigned integers.

The arithmetic operations of each cell

are designed to use the maximum of the

REGLSB w bits

REGMSB w bits

C__Out

S__Out

Internal architectures - cells

P[0]REG

C__Out

REGLSB w bits

REGMSB w bits

C__Out

S__Out

C_ _In

REGLSB w bits

REGMSB w bits

C_ _Out

S_ _Out

S_ _In

gamma_2

S1__2_In

REGw bits

REG S2__2_Out

S1__2_Out

S2__2_In

LSB w bits

MSB w bits

C_ _In

REGLSB w bits

REGMSB w bits

C_ _Out

S_ _Out

S_ _In

alpha_2C__2

REG S2__2_Out

S1__2_OutS__2_In LSB w bits

MSB w bits

gamma_2

S1__2_In

REGw bits

REG S2__2_Out

S1__2_Out

S2__2_In

LSB w bits

MSB w bits

C_ _In

REGLSB w bits

REGMSB w bits

C_ _Out

S_ _Out

S_ _In

ROTATION

A (K bits)X

Internal architectures - Rotation

ROTATION

A (K bits)X

ROTATION

B (3 w bits)X

ROTATION

B (3 w bits)X

ROTATION

B (2 w bits)X

Internal architectures - Rotation

ROTATION

A (K bits)X

ROTATION

B (3 w bits)X

ROTATION

P (3 w bits)X

ROTATION

B (3 w bits)X

ROTATION

P (3 w bits)X

ROTATION

B (2 w bits)X

C_1_Out

C_1_InMUX

S_1_In

S_2_Out S_1_Out

S_1_Out

sig_state

A- alpha1

Architectures

C_1_Out

C_1_InMUX

S_1_In

S_2_Out S_1_Out

S_1_Out

C_2_Out

C_2_In

MUXS_2_In

S_3_Out S_2_Out

S_2_Out

C_1_Out

sig_state sig_state

A- alpha1B- alpha2

Architectures

C_3_Out

C_3_InMUX

S_3_In

S_3_Out

C_2_OutS1__2_Out

sig_state

C- alpha3

C_1_Out

C_1_InMUX

S_1_In

S_2_Out S_1_Out

S_1_Out

C_2_Out

C_2_In

MUXS_2_In

S_3_Out S_2_Out

S_2_Out

C_1_Out

sig_state sig_state

A- alpha1B- alpha2

Architectures

C_ 1_Out

C_ 1_InS_ 1_In

S_ 1_Out

D- gamma1

Architectures

C_ 2_Out

C_ 2_InMUX

S_ 2_In

S_ 2_Out

C_ 1_OutS_ 1_Out

sig_state

E- gamma2

C_ 1_Out

C_ 1_InS_ 1_In

S_ 1_Out

D- gamma1

Architectures

C_ 3_Out

C_ 3_InMUX

S_ 3_In

S_ 3_Out

C_ 2_OutS_ 2_Out

sig_state

F- gamma3

C_ 2_Out

C_ 2_InMUX

S_ 2_In

S_ 2_Out

C_ 1_OutS_ 1_Out

sig_state

E- gamma2

C_ 1_Out

C_ 1_InS_ 1_In

S_ 1_Out

D- gamma1

Architectures

alpha_2

gamma_2

S1__2_Out S2__2_Out S1_ _2_Out S2_ _2_Out

m C_ _Out

S_ _In

G- alpha_2H- gamma_2

I- beta

p’P[0]

S1__2_In S2__2_In C__2 S__2_In

Architectures

1. Introduction

3. Architecture

4. Results

Nexys 4 DSP Frequency (MHz) Cycles

MMM(s=8/K=256) 31 105.275 33

Alpha 4 291.023 1

Gamma 4 291.023 1

Beta 4 388.350 1

Alpha_2 1 459.918 1

Gamma_2 2 442.811 1

Results

Nexys 4 DSP LUTs Reg Occupied

Frequency Cycles

S=8/k=256

31 809 870 352 105.275 33

S=16/k=256

33 846 1123 402 145.892 66

S=8/k=512

87 2650 1614 878 64.825 33

S=16/k=512

57 1789 2164 798 105.594 66

Results

1. Introduction

3. Architecture

4. Results

We have implemented the Montgomery multiplication with a

systolic architecture in a number of fixed clock cycles.

We made our design in order to use the maximum of the DSPs on

FPGA card

Conclusion

conclusion and perspectives

We implemented two architectures(s=8 and s=16)

We used this two design to implement the scalar multiplication for

the security level of 128-bits.

Perspective

Perform a Mixed Implementation Soft / hard (co-design) for the

Optimal-Ate pairing on the BN curves in Jacobian coordinates

using this multiplication algorithm.

Finalize the hardware implementation of the designs

s= 32.

s= 64.

Montgomery Algorithm for Modular Multiplication with ...math-sa-sara0050/space16/... · A systolic...

Documents