Post on 22-Jun-2020
transcript
MRABET Amine
Montgomery Algorithm for Modular Multiplication
with Systolic Architecture
LIASD Paris 8
ENIT-TUNIS EL MANAR University
SAS - CMP - Gardanne
SPACE 2016
1
1. Introduction for pairing
2. Montgomery Multiplication (CIOS)
3. Architecture
4. Results
5. Conclusion and Perspectives
Plan
2
1. Introduction for pairing
2. Montgomery Multiplication (CIOS)
3. Architecture
4. Results
5. Conclusion and Perspectives
Plan
2
This work is part of the hardware implementation of
asymmetric cryptography primitives, such as Optimal-Ate
pairing based on elliptic curves, the cryptographic systems
based on elliptic curves and RSA,
3
General Context
This work is part of the hardware implementation of
asymmetric cryptography primitives, such as Optimal-Ate
pairing based on elliptic curves, the cryptographic systems
based on elliptic curves and RSA,
Which are the best known methods in asymmetric encryption.
General Context
3
Let G1 and G2 be two additive groups and let G3 be a
multiplicative group.
Pairing is an application
e : G1 × G2 G3 with the following properties:
4
Definition
Definition
4
Let G1 and G2 be two additive groups and let G3 be a
multiplicative group.
Pairing is an application
e : G1 × G2 G3 with the following properties:
e is non degenerate :
if P ∈ G1, P ≠ 0 it exists Q ∈ G2 such as e(P, Q) ≠ 1
and
if Q ∈ G2, Q ≠ 0 it exists P ∈ G1 such as e(P, Q) ≠ 1.
e is non degenerate :
if P ∈ G1, P ≠ 0 it exists Q ∈ G2 such as e(P, Q) ≠ 1
and
if Q ∈ G2, Q ≠ 0 it exists P ∈ G1 such as e(P, Q) ≠ 1.
Bilinearity:
e(xP, yQ) = e(P,Q)xy ,
e(xP, yQ)z = e(yP, zQ)x = e(zP, xQ)y = e(P,Q)xyz
Definition
4
Let G1 and G2 be two additive groups and let G3 be a
multiplicative group.
Pairing is an application
e : G1 × G2 G3 with the following properties:
The bilinearity of the pairings allowed the construction of
protocols.
5
Pairing protocols
5
Pairing protocols
Diffie–Hellman key exchange ( Joux 2001)
Identity-Based Cryptography(Boneh and Franklin)
Short signature schemes (Boneh, Lynn, Shacham)
The bilinearity of the pairings allowed the construction of
protocols.
Trusted authority
Alice
IA
Pairing protocolsExample of Cryptography Based on Identity
6
Bob
IB
S: The secret of the trusted authority
The Public keys are the identities of people.
S: The secret of the trusted authority
The Public keys are the identities of people.
The private keys are Constructed by the trusted authority and
Transmitted to users.
Trusted authority
Bob Alice
IB IA
6
PB=S*IB PA=S*IA
Pairing protocolsExample of Cryptography Based on Identity
e (PA, IB) = e (IA, IB) se (PB, IA) = e (IA, IB) s
7
Alice wants to send a message to Bob:
She chooses an integer a randomly,
She retrieves Bob's public key : IB,
She calculates the pairing e(IB;Q0)a,
She sends to Bob : [ aP, M ⊕H2 (e(IB;Q0)a) ]=[U,V]
Pairing protocols
Example of Cryptography Based on Identity
Encryption step of the clear message M
8
Bob follows the following steps:
He contacts the trusted authority to retrieve his private key
PB = sIB,
He finds the message by calculating V ⊕ H2 (e(PB,U)).
The message : M
The bilinearity of pairings :
e(PB,U) = e(sIB,aP) = e(IB,P)as = e(IB,sP)a
Pairing protocolsExample of Cryptography Based on Identity
Decryption step of the encrypted message.
Different pairings
9
Weil pairing
eW
: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk
(P,Q) → (-1)r fr, p
(Q) / fr ,Q
(P)
Miller Lite fr, p
(Q)
Miller Full fr ,Q
(P)
Inversion
Multiplication
Different pairings
9
Weil pairing
eW
: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk
(P,Q) → (-1)r fr, p
(Q) / fr ,Q
(P)
Tate pairing
eT: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk
(P,Q) → [ fr, p(Q) ] (p^k- 1)/r
Tate pairing is defined with the same parameters E, Fp, r, k
than Weil pairing.
For the calculation of Tate pairing we make log2(r) iterations during
the Miller algorithm, where r is the order of the subgroups used.
The main advantage compared to Tate pairing is the reduction of the number of
iterations made during the Miller algorithm.
log2(T) where T = t − 1, and t is the Frobenius trace on E(Fp).
The disadvantage of Ate pairing is that it corresponds to a Miller Full application.
Different pairings
Ate paring
G1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])
eA
: G1 × G2 → F*pk;
(P,Q) → [ fT, Q
(P) ] (p^k- 1)/r
10
The calculation is made by an execution of Miller Lite, which would alleviate the
complexity of the calculations.
Different pairings
Twisted Ate pairingG1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])
eTA
: G1 × G2 → F*pk;
(P,Q) → [ fT, p
(Q) ] (p^k- 1)/r
11
Different pairings
Ate-Optimal (OATE) pairing
Ate-Optimal pairing improves Ate pairing by reducing the number of iterations
in the Miller algorithm used to calculate f,Q(P).
In the case of BN curves , OATE pairing is defined by:
where = 6t+2 (t the parameter of BN curves)
The calculation is made by an execution of Miller Lite, which would alleviate the
complexity of the calculations.
Twisted Ate pairingG1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])
eTA
: G1 × G2 → F*pk;
(P,Q) → [ fT, p
(Q) ] (p^k- 1)/r
11
The basic operations in the Finite field :
Addition
Subtraction
Multiplication
inversion
Basic operations
12
The basic operations in the Finite field :
Addition
Subtraction
Multiplication
inversion
Constitute the essential of calculation time of pairing.
That’s why the optimization of these operation is the most
important
12
Basic operations
1. Introduction for pairing
2. Montgomery Multiplication (CIOS)
3. Architecture
4. Results
5. Conclusion and Perspectives
Plan
13
Reminder: Montgomery algorithm
14
Reminder: Montgomery algorithm
14
Ordinary domain Montgomery domain
a M(a)=a.R mod p
b M(b)=b.R mod p
a.b M(a.b)=a.b.R mod p
Conversion between Ordinary Field and Montgomery
The CIOS method improves the Montgomery algorithm by
integrating multiplication and reduction.
How?
[1] Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro. , juin1996
Cetin Kaya Koç, Tolga Acar and Burton S. Kaliski Jr.
The Coarsely Integrated Operand Scanning method [1] ?
15
The CIOS method improves the Montgomery algorithm by
integrating multiplication and reduction.
How?
Instead of multiplying axb then performe to reduction, it
allows to alternate between the iterations of multiplication
and reduction.
[1] Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro. , juin1996
Cetin Kaya Koç, Tolga Acar and Burton S. Kaliski Jr.
15
The Coarsely Integrated Operand Scanning method [1] ?
What is a systolic architecture ?
16
It’s a network composed of a large number of cells, Each
cell receives data from the neighboring cells, performs a
simple calculation, and then transmits the results, always to
neighboring cells.
What is a systolic architecture ?
16
It’s a network composed of a large number of cells, Each
cell receives data from the neighboring cells, performs a
simple calculation, and then transmits the results, always to
neighboring cells.
What is a systolic architecture ?
16
It’s a network composed of a large number of cells, Each
cell receives data from the neighboring cells, performs a
simple calculation, and then transmits the results, always to
neighboring cells.
A systolic architecture provides very simplified elementary
cells. Therefore, this architecture reduces resource
requirements in hardware implementations.
It’s a network composed of a large number of cells, Each
cell receives data from the neighboring cells, performs a
simple calculation, and then transmits the results, always to
neighboring cells.
A systolic architecture provides very simplified elementary
cells. Therefore, this architecture reduces resource
requirements in hardware implementations.
Our contribution in this work is to combine a systolic
architecture, which is supposed to be the best solution for
FPGA implementations, with the CIOS method of the
Montgomery modular multiplication.
What is a systolic architecture ?
16
Coarsely Integrated Operand Scanning
17
Coarsely Integrated Operand Scanning
Coarsely Integrated Operand Scanning
17
Cutting the algorithm CIOS
17
alpha : the lines 5 and 6
17
_2alpha : the lines 7,8 and 9
alpha : the lines 5 and 6
Cutting the algorithm CIOS
17
beta: the lines11 and 12
_2alpha : the lines 7,8 and 9
alpha : the lines 5 and 6
Cutting the algorithm CIOS
gamma: the lines14 and 15
17
beta: the lines11 and 12
_2alpha : the lines 7,8 and 9
alpha : the lines 5 and 6
Cutting the algorithm CIOS
_2gamma: the lines16,17 and 18
17
gamma: the lines14 and 15
beta: the lines11 and 12
_2alpha : the lines 7,8 and 9
alpha : the lines 5 and 6
Cutting the algorithm CIOS
Plan
18
1. Introduction
2. Montgomery Multiplication (CIOS)
3. Architecture
4. Results
5. Conclusion and Perspectives
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=0
_
2
3_
2
Multiplication Step
Reduction Step
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7
j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6
CIOS in Systolic for s=8
19
_2
_2
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=0
_
2
3_
2
Multiplication Step
Reduction Step
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7
j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=1
_
2
3_
2
19
CIOS in Systolic for s=8
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=0
_
2
3_
2
Multiplication Step
Reduction Step
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7
j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=1
_
2
3_
2
In this architecture we also have an integration between
the different iterations that loop on i.
In our case we have 3 iterations of i which can be
executed at the same time.
19
CIOS in Systolic for s=8
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=0
_
2
3_
2
Multiplication Step
Reduction Step
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7
j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=1
_
2
3_
2
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=7
_
2
3_
2
i=2
i=3
i=4
i=5
i=6
19
CIOS in Systolic for s=8
. . . . . . . . . . . .. . . . . . . . . . . .
. . . . . . . . . . . .
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=0
_
2
3_
2
Multiplication Step
Reduction Step
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7
j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=1
_
2
3_
2
1 1 1
1
2 2 2
2 2 2
3 3
3 3
i=7
_
2
3_
2
a x b x R-1 mod p
i=2
i=3
i=4
i=5
i=6
19
CIOS in Systolic for s=8
. . . . . . . . . . . .. . . . . . . . . . . .
. . . . . . . . . . . .
i=0
2
2
i=1
2
2
2
2
a x b x R-1 mod p
i=2
Multiplication Step
Reduction Step
2
2
i=3
2
2
i=4
2
2
i=5
2
2
i=6
2
2
i=7
20
CIOS in Systolic for s=8
S
C C
S
ai bj
i=0
2
2
i=1
2
2
2
2
a x b x R-1 mod p
Multiplication Step
Reduction Step
2
2
i=3
2
2
i=4
2
2
i=5
2
2
i=6
2
2
i=7
20
CIOS in Systolic for s=8
i=2
S
C C
S
C
C
ai bj
m pj
i=0
2
2
i=1
2
2
2
2
a x b x R-1 mod p
Multiplication Step
Reduction Step
2
2
i=3
2
2
i=4
2
2
i=5
2
2
i=6
2
2
i=7
20
CIOS in Systolic for s=8
i=2
S
S
S
a0
a1
.
.
.
.
.
.
.
a7
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_
2
3_
2
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_
2
3_
2
B
i=0
i=1
A
p0 p1 p2 p3 p4 p5 p6 p7P
Data Flow
1 1 1
1
2 2 2
2 2 2
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_
2
3_
2
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_
2
3_
2
b0 b1 b2 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_
2
3_
2
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_
2
3_
2
b0 b1 b2 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p2 p3 p4 p5 p6 p7
P2 P3
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b0 b1 b2 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p2 p3 p4 p5 p6 p7
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
P2 P3
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b1 b2 b0 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p2 p3 p4 p5 p6 p7
P2 P3
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
SC
C
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b2 b0 b1 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p2 p3 p4 p5 p6 p7
P2 P3
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
C
S
C
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b0 b1 b2 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p2 p3 p4 p5 p6 p7
P2 P3
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
S
C
S
CC
S
C
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b1 b2 b0 b4 b5 b3 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p3 p4 p2 p5 p6 p7
P2 P3
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
S
C
S
C
S
CC C
S
C
S
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b2 b0 b1 b5 b3 b4 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p4 p2 p3 p5 p6 p7
P2 P3
a0
a1
.
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
Data Flow
i=2
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
S
C
S
C
S
C
S
CC C
S
C
S S
C
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b0 b1 b2 b3 b4 b5 b6 b7
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p2 p3 p4 p5 p6 p7
P2 P3
a0
a1
a2
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
i=2
Data Flow
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
CC C
S
C
S S
C
S
C
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b1 b2 b0 b4 b5 b3 b7 b6
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p3 p4 p2 p6 p7 p5
P2 P3
a0
a1
a2
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
i=2
Data Flow
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
CC C
S
C
S S
C
S
C
S
C
b0 b1 b2 b3 b4 b5 b6 b7
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
1 1 1
1
2 2 2
2 2 2
3 3
3 3
_f
3 _f
b2 b0 b1 b5 b3 b4 b7 b6
B
B1 B2 B3
i=0
i=1
p0 p1 p2 p3 p4 p5 p6 p7P
p0 p1 p4 p2 p3 p7 p5 p6
P2 P3
a0
a1
a2
.
.
.
.
.
.
a7
A
1 1 1
1
2 2 2
2 2 2
i=2
Data Flow
21
. . . . . . . . .. . . . . . . .
. . . . . . . .
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S
C
S,C
S
CC C
S
C
S S
C
S
C
S
C
S
C
i=0
2
2
i=1
2
2
2
2
a x b x R-1 mod p
Multiplication Step
Reduction Step
2
2
i=3
2
2
i=4
2
2
i=5
2
2
i=6
2
2
During execution of this algorithm
there are always three iterations
of the loop 'i' which are executed
at the same time, which gives a
maximum of three alphas and
three gammas which are executed
in parallel.
i=7
22
CIOS in Systolic for s=8
i=2
According to the blocks that are
repeated, we modeled our FSM
with 3 states, which allows us to
perform all the multiplication in
just 33 cycles.
(8+3)*3=33
i=0
2
2
i=1
2
2
2
2
a x b x R-1 mod p
i=2
Multiplication Step
Reduction Step
2
2
i=3
2
2
i=4
2
2
i=5
2
2
i=6
2
2
i=7
S0 S1 S2
CIOS in Systolic for s=8
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2 S0 S1 S2 S0 S1 S2 S0
22
1 1 1
1
2 2 2
2 2 2
6 6
6 6
i=0
_
2
6_
2
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5a0 b14 a0 b15
j=0 j=1 j=2 j=3 j=4 j=5 j=14 j=15
CIOS in Systolic for s=16
23
CIOS in Systolic for s=16
23
1 1 1
1
2 2 2
2 2 2
6 6
6 6
i=0
_
2
6_
2
a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5a0 b14 a0 b15
j=0 j=1 j=2 j=3 j=4 j=5 j=14 j=15
i=2
i=3
i=15
1 1 1
1
2 2 2
2 2 2
6 6
6 6
_
2
6_
2
a x b x R-1 mod p
. . . . . . . . . . . .. . . . . . . . . . . .
. . . . . . . . . . . .. . . . . . . . . . . .
CIOS in Systolic for s=16
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
b0 b1 b2
b3 b4 b5
B
B1
B2
B3
b6 b7 b8
b9 b10 b11
B4
b12 b13 b14
B5
b15
1
2
3
4
5
6
B6
24
CIOS in Systolic for s=16
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
b0 b1 b2
b3 b4 b5
B
B1
B2
B3
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15P
p0
p1
p2 p3 p4
p5 p6 p7
P2
P3
b6 b7 b8
b9 b10 b11
B4
b12 b13 b14
B5
b15
1
2
3
4
5
6
p8 p9 p10
p11 p12 p13
P4
P5
p14 p15
P6
B6
P1
1
64
53
2
24
alpha_2
gamma_
2
alpha
(1)
alpha
(2)
alpha
(3)
gamma
(1)
gamma
(2)
gamma
(3)
beta
i++
K=256, w=32, s=8
K=512, w=64, s=8
33 clock cycles
CIOS in Systolic for s=8
25
K=256, w=16, s=16
alpha_f
gamma_f
alpha
(1)
alpha
(2)
alpha
(3)
gamma
(1)
gamma
(2)
gamma
(3)
beta
i++
alpha
(4)
alpha
(5)
alpha
(6)
gamma
(4)
gamma
(5)
gamma
(6)
K=512, w=32, s=16
66 clock cycles
Alpha_f
gamma_
f
alpha
(1)
alpha
(2)
alpha
(3)
gamma
(1)
gamma
(2)
gamma
(3)
beta
i++
K=256, w=32, s=8
K=512, w=64, s=8
33 clock cycles
CIOS in Systolic for s=8
25
S=8 6 +3 cells 33 clock cycles
S=16 12 +3 cells 66 clock cycles
S=32 24 +3 cells 132 clock cycles
S=64 48 +3 cells 264 clock cycles
Comparison
26
S=8 S=16 S=32
K=256 32 16 8
K=512 64 32 16
K=1024 128 64 32
Number of
cycles
33 66 132
The interest of each architecture depends on our needs
Security level
Resources
Speed
The method used
The interest of each architecture
27
ArchitecturesDigital signal processing (DSP)
Modern FPGAs are equipped with hardware extensions for
arithmetic calculation.
28
ArchitecturesDigital signal processing (DSP)
Modern FPGAs are equipped with hardware extensions for
arithmetic calculation.
Perform basic arithmetic calculations: multiplication, addition and
subtraction of unsigned integers.
28
The arithmetic operations of each cell
are designed to use the maximum of the
DSPs.
29
a[i]
b[j]
C__In
REGLSB w bits
REGMSB w bits
C__Out
S__Out
S__In
+
+x
alpha
_2
_2
Internal architectures - cells
p’
S__In
P[0]REG
C__Out
REG m
xx
+
beta
29
a[i]
b[j]
C__In
REGLSB w bits
REGMSB w bits
C__Out
S__Out
S__In
+
+x
alpha
S__In
Internal architectures - cells
m]
p[j]
C_ _In
REGLSB w bits
REGMSB w bits
C_ _Out
S_ _Out
gamma
S_ _In
+
+x
30
Internal architectures - cells
_2
_2
gamma_2
S1__2_In
C__2
REGw bits
REG S2__2_Out
S1__2_Out
S2__2_In
LSB w bits
MSB w bits
++
30
m]
p[j]
C_ _In
REGLSB w bits
REGMSB w bits
C_ _Out
S_ _Out
gamma
S_ _In
+
+x
Internal architectures - cells
alpha_2C__2
REG
REG S2__2_Out
S1__2_OutS__2_In LSB w bits
MSB w bits
+
Internal architectures - cells
30
gamma_2
S1__2_In
C__2
REGw bits
REG S2__2_Out
S1__2_Out
S2__2_In
LSB w bits
MSB w bits
++
m]
p[j]
C_ _In
REGLSB w bits
REGMSB w bits
C_ _Out
S_ _Out
gamma
S_ _In
+
+x
ROTATION
Mux
A (K bits)X
31
Internal architectures - Rotation
ROTATION
Mux
A (K bits)X
ROTATION
Mux
B (3 w bits)X
ROTATION
Mux
B (3 w bits)X
ROTATION
Mux
B (2 w bits)X
31
Internal architectures - Rotation
Internal architectures - Rotation
ROTATION
Mux
A (K bits)X
ROTATION
Mux
B (3 w bits)X
ROTATION
Mux
P (3 w bits)X
ROTATION
Mux
B (3 w bits)X
ROTATION
Mux
P (3 w bits)X
ROTATION
Mux
B (2 w bits)X
31
PE
alpha
(1)
MUX
C_1_Out
zero
C_1_InMUX
S_1_In
S_2_Out S_1_Out
S_1_Out
sig_state
A- alpha1
Architectures
32
PE
alpha
(1)
MUX
C_1_Out
zero
C_1_InMUX
S_1_In
S_2_Out S_1_Out
S_1_Out
PE
alpha
(2)
MUX
C_2_Out
C_2_In
MUXS_2_In
S_3_Out S_2_Out
S_2_Out
C_1_Out
sig_state sig_state
A- alpha1B- alpha2
Architectures
32
PE
alpha
(3)
MUX
C_3_Out
C_3_InMUX
S_3_In
S_3_Out
S_3_Out
C_2_OutS1__2_Out
sig_state
C- alpha3
PE
alpha
(1)
MUX
C_1_Out
zero
C_1_InMUX
S_1_In
S_2_Out S_1_Out
S_1_Out
PE
alpha
(2)
MUX
C_2_Out
C_2_In
MUXS_2_In
S_3_Out S_2_Out
S_2_Out
C_1_Out
sig_state sig_state
A- alpha1B- alpha2
Architectures
32
PE
gamma
(1)
C_ 1_Out
C_ 1_InS_ 1_In
S_ 1_Out
D- gamma1
m
p[0]
Architectures
33
PE
gamma
(2)
MUX
C_ 2_Out
C_ 2_InMUX
S_ 2_In
S_
2_Out
S_ 2_Out
C_ 1_OutS_ 1_Out
sig_state
E- gamma2
m
p[j]
PE
gamma
(1)
C_ 1_Out
C_ 1_InS_ 1_In
S_ 1_Out
D- gamma1
m
p[0]
Architectures
33
PE
gamma
(3)
MUX
C_ 3_Out
C_ 3_InMUX
S_ 3_In
S_
3_Out
S_ 3_Out
C_ 2_OutS_ 2_Out
sig_state
F- gamma3
m
p[j]
PE
gamma
(2)
MUX
C_ 2_Out
C_ 2_InMUX
S_ 2_In
S_
2_Out
S_ 2_Out
C_ 1_OutS_ 1_Out
sig_state
E- gamma2
m
p[j]
PE
gamma
(1)
C_ 1_Out
C_ 1_InS_ 1_In
S_ 1_Out
D- gamma1
m
p[0]
Architectures
33
PE
alpha_2
PE
gamma_2
S1__2_Out S2__2_Out S1_ _2_Out S2_ _2_Out
C_ _2
PE
beta
m C_ _Out
S_ _In
G- alpha_2H- gamma_2
I- beta
p’P[0]
S1__2_In S2__2_In C__2 S__2_In
Architectures
34
Plan
35
1. Introduction
2. Montgomery Multiplication (CIOS)
3. Architecture
4. Results
5. Conclusion and Perspectives
Nexys 4 DSP Frequency (MHz) Cycles
MMM(s=8/K=256) 31 105.275 33
Alpha 4 291.023 1
Gamma 4 291.023 1
Beta 4 388.350 1
Alpha_2 1 459.918 1
Gamma_2 2 442.811 1
Results
36
Nexys 4 DSP LUTs Reg Occupied
slice
Frequency Cycles
MMM
S=8/k=256
31 809 870 352 105.275 33
MMM
S=16/k=256
33 846 1123 402 145.892 66
MMM
S=8/k=512
87 2650 1614 878 64.825 33
MMM
S=16/k=512
57 1789 2164 798 105.594 66
Results
37
Plan
38
1. Introduction
2. Montgomery Multiplication (CIOS)
3. Architecture
4. Results
5. Conclusion and Perspectives
We have implemented the Montgomery multiplication with a
systolic architecture in a number of fixed clock cycles.
We made our design in order to use the maximum of the DSPs on
FPGA card
Conclusion
conclusion and perspectives
39
We implemented two architectures(s=8 and s=16)
We used this two design to implement the scalar multiplication for
the security level of 128-bits.
Perspective
40
Perform a Mixed Implementation Soft / hard (co-design) for the
Optimal-Ate pairing on the BN curves in Jacobian coordinates
using this multiplication algorithm.
Finalize the hardware implementation of the designs
s= 32.
s= 64.