Hardware Acceleration of Elliptic Curve Cryptography

Federal Democratic Republic of Ethiopia

Ministry of Defense

Defense University, College Of Engineering

Office of Postgraduate Programs and Research

M-Tech Thesis

Hardware Acceleration of Elliptic Curve Cryptography

(ECC) Algorithm: Design and Simulation

By

Alemayehu Tilahun

Supervisor: Manoj V.N.V (Dr.)

Department of Computer Information and Technology

Computer Engineering Specialization

June, 2014

Bishoftu

Alemayehu TilahunAlemayehu Tilahun Alemayehu Tilahun

II

Acknowledgments

I would like to articulate my deepest gratitude to my thesis guide Dr. Manoj V.N.V for his skillful

advice and follow-ups have always been my motivation throughout the thesis. It is also great

pleasure to express my appreciations and many thanks to all scholars whose assemblage

publications are referenced in my work, as the nature will not be completed without their

references.

Let my boundless respect and great gratitude goes to Major Alemseged A. and Captain Abrham

J. (Captain) for their endless support and encouragements during the process.

Last but not the least my sincere thanks to all of CIT Department Staffs members of Defense

Engineering College and All my friends who have patiently extended all sorts of help towards

accomplishing this undertaking.


III

DECLARATION

I hereby declared that the thesis project entitled “Hardware Acceleration of Elliptic Curve

Cryptography (ECC) Algorithm: Design and Simulation” submitted for M-Tech Degree is my

original work and the thesis project has not formed the basis for the award of any degree, associate

ship, fellowship or any other similar titles.

Signature of the student: Alemayehu Tilahun

Place: DUC, Bishoftu

Date: June 9, 2014


IV

CERTIFICATE

This is to notify that the thesis project entitled “Hardware Acceleration of Elliptic Curve

Cryptography (ECC) Algorithm: Design and Simulation” is the Work Carried out by Alemayehu

Tilahun Haile student of M-Tech, Defense Engineering College, Bishoftu, during the year

2013/2014. In partial fulfillment of the requirement for the ward of the degree of M-Tech in

Computer Engineering and has the Project has not formed the basis for the award previously of

any degree, diploma, associate ship, fellowship or any other similar rule.

Signature of the Advisor: Manoj V.N.V(Dr.)

Place: DUC, Bishoftu

Date: June 9, 2014


V

Approval by Members of BoE

Name Signature

1. External Examiner:

2. Internal Examiner:

3. Chairperson/HoD:

4. Advisor:


VI

Abstract

In today’s dynamically changing world, technology advancement resulted in explosive growth of

communications and other related computer engineering fields. Applications like online banking,

personal digital assistants, mobile communication, smartcards, etc. have emphasized the need for

security in resource constrained environments. Elliptic curve cryptography (ECC) serves as a

perfect cryptographic tool because of its short key sizes and security comparable to that of other

standard public key algorithms. However, to match the ever increasing requirement for speed in

today’s applications, hardware acceleration of the cryptographic algorithms is a necessity. As a

further challenge, the designs have to be robust against an attacks.

This thesis explores hardware acceleration of elliptic curve cryptography over binary Galois fields.

The efficiency is largely affected by the underlying arithmetic primitives. The thesis therefore

explores field programmable gate array designs for two of the most important field primitives

namely multiplication and inversion. Field programmable gate array s are reconfigurable hardware

platforms offering flexibility and lower costs like software programs. However, designing on field

programmable gate array platforms is challenging because of the large granularity, limited

resources, and large routing delay. The smallest programmable entity in a field programmable gate

array is the look up table. The arithmetic algorithms proposed in this thesis maximizes the

utilization of look-up tables on the field programmable gate array. A novel finite field multiplier

based on the Karatsuba multiplication algorithm is proposed. The proposed multiplier combines

two variants of Karatsuba, namely the general and the simple Karatsuba multipliers. The general

Karatsuba multiplier has a large gate count but for small sized multiplications is compact because

it utilizes look-up table resources efficiently. For large sized multiplications, the simple Karatsuba

is efficient as it requires lesser gates. The proposed hybrid multiplier does the initial recursion

using the simple algorithm while final small sized multiplications is done using the general

algorithm. The multiplier thus obtained has the best area time product compared to reported

literature. The proposed primitives are organized as Karatsuba multiplier and has one of the best

timings and area time product compared to reported works. We conclude that the performance of

and our multiplier is significantly enhanced if the underlying primitives are carefully designed.

Key Words:- Cryptography, Elliptic Curve Cryptography, Karatsuba Multiplier, Hardware

Acceleration


VII

Table of contents Page

Acknowledgements II

Declaration III

Certificate IV

Approval of BoE’s V

Abstract VI

Table of Contents VII

Lists of Acronyms and Abbreviations XI

Lists of Tables XII

Lists of Figures XIII

Chapter -1 INTRODUCTION

1.1 Background 1

1.2 Quality and Needs for its Achievement 2

1.3 Statement Of the Problem 3

1.4 Objective of the Study 3

1.4.1 General Objective 3

1.4.2 Specific Objectives 3

1.5 Scope of the Study 3

1.6 Limitation of the Study 4

1.7 Significance of the Study 4

1.8 Organization of the study 4

Chapter-2 LITERATURE REVIEW

2.1 Literature Review at International Level 6

2.2 Literature Review at National Level 7

2.3 Concepts in Cryptography 7


VIII

2.4 ECC Cryptography 9

2.5 Mathematical Background of ECC 10

2.5.1 Groups 10

2.5.2 Rings 10

2.5.3 Fields and Vector Space 10

2.5.3.1 Finite Field 11

2.5.3.2 Prime Field Fp 12

2.5.3.3 Binary Field F(2m) 12

2.5.4 Polynomial Basis Representation of F(2m) 12

2.5.5 Normal basis Representation of F(2m) 13

2.6 Elliptic Curve Over Fp 14

2.7 Elliptic Curves Over F(2m) 15

2.8 Elliptic Curve Discrete Logarithm Problem 17

2.9 Application of Elliptic Curve in Key Exchange 17

2.9.1 ECC Domain Parameter 17

2.9.2 Elliptic Curve Protocols 17

2.9.3 Elliptic Curve Digital Signature Authentication 18

2.9.4 Elliptic Curve Authentication Encryption

Scheme

21

2.10 Algorithms for Elliptic Curve Multiplication 22

2.11 Hierarchy of Elliptic Curve Cryptography 22

2.11.1 Point Multiplication 23

2.11.2 Point Addition 23

2.11.3 Point Doubling 24

2.12 Hardware Accelerator 24

2.13 FPGA Architecture 26

2.13.1 Look-Up Table 27

2.13.2 Configurable Logic Block 27

2.13.3 Input/Output Block 28

2.13.3 RAM Block 29

2.13.4 Programmable Routing 29


IX

Chapter-3 METHODOLOGY

3.1 Design Cycle 30

3.2 Tools Used 32

3.2.1 Simulation and Verification 33

3.2.2 Synthesizing tools 33

3.2.3 Place and Rout 34

3.2.4 Loading FPAG Program 35

Chapter- 4 DATA AND DATA ANALYSIS

4.1 Elliptic Curve Scalar Multiplication 36

4.2 Karatsuba Multiplier 37

4.2 Point Addition 40

4.3 Point Multiplication 41

4.4 Squaring

41

Chapter-5 RESULT AND DISCUSSIONS

5.1 Simulation Result for Karatsuba Multiplier 44

5.2 Resource Utilization of Polynomial Reducer 45

5.3 Simulation Result for Encryption Unit 46

5.4 Design of the Encryption Unit 47

5.5 Xpower Analysis for Karatsuba Multiplier 47

5.6 Resource Utilization for ECC Decryptor Unit 48

5.7 Xpower Analysis of ECC Component 49

5.8 Comparson With Other Related Workd 49


X

Chapter-6 SUMMARY, CONCLUSION,

RECOMMENDATION AND FUTURE RESEARCH

WORK

6.1 Summary 51

6.2 Conclusion 52

6.3 Recommendation for Future Work 52

6.3.1 New Design Consideration 52

6.3.2 Implementation Alternatives 53

REFERENCE 54

Appendix-I Sample Algorithms 57

Appendix-II Sample Snapshots 59

Appendix-III Xilinx Synthesis Report 63

Appendix-IV Sample VHDL Code 71


XI

Lists of Acronyms and Abbreviations

ASIC Application Specific Integrated Circuits

CLB Configurable Logic Block

DSA Digital Signature Algorithm

ECC Elliptic Curve Cryptography

ECDH Elliptic Curve Diffie-Hellman Protocol

ECDLP Elliptic Curve Discreet Logarithm Problem

ECDSA Elliptic Curve Digital Signature Authentication

ECAES Elliptic Curve Authentication Encryption Scheme

FPGA Field Programmable Gate Array

FDRE Federal Democratic Republic of Ethiopia

GF Finite Field

GRM General Routing Matrix

HDL Hardware Description Language

JTAG Joint Test Action Group

LUT Look-Up Table

LSB Least Significant Bit

MUX Multiplexer

MoD Ministry Of Defense

PLD Programmable Logic Device

RTL Register Transfer Logic

RSA Ron Rivest, Adi Shamir, and Leonard Adleman

SoC System On a Chip

VHDL Very High Speed Integrated Circuit Hardware

Description Language


http://en.wikipedia.org/wiki/Ron_Rivest

http://en.wikipedia.org/wiki/Adi_Shamir

http://en.wikipedia.org/wiki/Leonard_Adleman

XII

List of Tables

Table Number Description Page No

Table 2.1 Comparison of NIST Recommended Key size 9

Table 5.1 Resource Utilization of Karatsuba Multiplier 44

Table 5.2 Resource Utilization for Polynomial Reducer 45

Table 5.3 Resource Utilization of ECC Encryption Unit 46

Table 5.4 Xpower Report for Karatsuba Multiplier 47

Table 5.5 Decryption Unit Resource Utilization 48

Table 5.6 Xpower Analysis for ECC Components 49

Table 5.7 Comparison with other Related Works 50


XIII

List of Figures

Figure Number Description Page No

Figure 2.1 Two Party Communication 8

Figure 2.2 Illustration of Elliptic Curve Digital Signature 20

Figure 2.3 Illustration of Elliptic Curve Authentication 22

Figure 2.4 Hierarchy of Elliptic Curve Cryptography 23

Figure 2.5 Graph of Point Addition 24

Figure 2.6 Graph of Point Doubling 24

Figure 2.7 FPGA Architecture 26

Figure 2.8 Internal Architecture of FPGA 27

Figure 3.1 Design Flow Chart 31

Figure 3.2 FPGA Board 32

Figure 3.3 Synthesis Flow Diagram 33

Figure 3.4 Place and Rout processes 34

Figure 3.5 Programing FPGA Board 35

Figure 4.1 Typical Hierarchy of ECC 36

Figure 4.2 RTL Structure of Karatsuba Multiplier 40

Figure 4.3 RTL Structure of Squarer 43

Figure 5.1 Karatsuba based Encryptor 47

Figure 5.2 Karatsuba Based Decryptor 48


1

CHAPTER-1

INTRODUCTION

1.1 Background

The ever increase in communications over the wired and wireless networks lets everyday

thousands of transactions take place over the World Wide Web. Several of these transactions

have critical data which need to be confidential, transactions that need to be validated, and

users authenticated. These requirements need a rugged security framework to be in force [17]

[33] [35].

The idea of information security lead to the evolution of Cryptography. In other words,

Cryptography is the science of keeping information secure. It involves encryption and

decryption of messages. Encryption is the process of converting a plain text into cipher text

and decryption is the process of getting back the original message from the encrypted text.

Cryptography, in addition to providing confidentiality, also provides Authentication, Integrity

and Non-repudiation.[1][5]

There have been many known cryptographic algorithms. The crux of any cryptographic

algorithm is the “seed” or the “key” used for encrypting/decrypting the information[20][34].

Many of the cryptographic algorithms are available publicly, though some organizations

believe in having the algorithm a secret. The general method is in using a publicly known

algorithm while maintaining the key a secret.

Based on the key, cryptosystems can be classified into two categories: Symmetric and

Asymmetric. In Symmetric Key Cryptosystems, we use the same key for both Encryption as

well as the corresponding decryption. [4][7][9][16]

Asymmetric or Public key or shared key cryptosystems use two different keys. One is used

for encryption while the other key is used for decryption. The two keys can be used

interchangeably. One of the keys is made public (shared) while the other key is kept a secret.

i.e. let k1 and k2 be public and private keys respectively. M. Kider and Manoj V.N.V(2008)In

general, symmetric key cryptosystems are preferred over public key systems due to the

following factors:-

I. Ease of computation


2

II. Smaller key length providing the same amount of security as compared to a larger key

in Public key systems.

Hence the common method adopted is to use a public key system to securely transmit a “secret

key”. Once we have securely exchanged the Key, we then use this key for encryption and

decryption using a Symmetric Key algorithm.

The idea of using Elliptic curves in cryptography was introduced by Victor Miller and Neal

Koblitz (1986) as an alternative to established public-key systems such as DSA and RSA. The

Elliptical curve Discrete Log Problem (ECDLP) makes it difficult to break an ECC as

compared to RSA and DSA where the problems of factorization or the discrete log problem

can be solved in sub-exponential time. This means that significantly smaller parameters can

be used in ECC than in other competitive systems such as RSA and DSA. This helps in having

smaller key size hence faster computations.

In our theis we study the Hardware Acceleration of elliptic curves in the field of cryptography.

We study the property of finite field and elliptic curves over finite fields and also how these

properties can be used for efficient Design and Simulation of the Encryption Decryption

Process.

1.2 Quality and the Need for its Achievement

FPGAs are an attractive choice for implementing cryptographic algorithms on hardware’s,

because of their low cost in prototyping relative to ASICs. FPGAs are flexible when adopting

security protocol upgrades, as they can be re-programmed in-place [13]. One series of FPGA

is Xilinx Spartan®-6 FPGA which delivers an optimal balance of low risk, low cost, and low

power for cost-sensitive applications, now with 42% less power consumption and 12%

increased performance over previous generation devices. Part of Xilinx’s All

Programmable low-end portfolio, Spartan-6 FPGAs offer advanced power management

technology, up to 150K logic cells, integrated PCI Express® blocks, advanced memory

support, 250MHz DSP slices, and 3.2Gbps low-power transceivers. Xilinx ISE Design Suite

14.7 is the latest version of Hardware programing Environment package which provides

multiple futures over the rival Quartus II Web Edition development package.

www.Xilinx.com (last viewed May 23, 2014:3PM)


http://www.xilinx.com/products/silicon-devices/low-end-portolio.html

http://www.altera.com/products/software/quartus-ii/web-edition/

http://www.xilinx.com/

3

1.3 Statement of the Problem

Scalar multiplication is the most time consuming operation in Elliptic curve when

implemented in both Hardware and software based cryptosystems; as scalar multiplication is

mostly performed in successive addition in their implementation nature[4][11][19][25][33].

Efficient implementation of ECC Cryptography algorithms on Hardware can be introduced

by implementing multiplication optimizing techniques such as Karatsuba. Karatsuba make

ECC protocols more attractive by reducing the execution time and IO usage of the

multiplication process. Therefore, while the general purpose microprocessor is doing its

routine task the time consuming operations can be executed using Karatsuba co-processor

designed on a special reprogrammable hardware’s such as FPGA.

1.4 Objectives of the Study

1.4.1 General Objective

The General objective of this study is to design and simulate Hardware acceleration of

Elliptical Curve Cryptography (ECC).

1.4.2 Specific Objectives

The Specific objectives of the Study are:

1. Design and Simulate the Karatsuba multiplier

2. To design and simulate finite arithmetic units for binary fields using Xilinx ISE Design

3. To measure efficiency of Karatsuba multiplication on Xilinx ISE Design

4. To integrate the finite arithmetic units into an efficient hardware scalar multiplier.

5. To Design and Simulate the Karatsuba based ECC Encryptor/ Decryptor.

6. To compare the performance of the hardware multiplier with the software

implementation and other related works.

1.5 Scope of the Study

In this thesis, the performance of hardware units are designed for Karatsuba multiplication,

binary field arithmetic and then compared with that of the software. These finite field

arithmetic units are then integrated together to create an elliptic curve cryptographic Hardware

capable of computing the scalar multiplication on elliptic curves and Performing Encryption

Decryption.


4

To measure the efficiency of the hardware, the design is translated into a hardware description

language namely Verilog. Then simulation is done for functionality and timing analysis using

Xilinx design suite V14.5 software.

1.6 Limitations of the study

In conducting this thesis work, the researcher may expect the following challenges:-

a) Window 8 was the Original Operating system for the researcher but does not support

any version of Xilinx ISE Design suite

b) Lack of Current Literature during Literatures Survey

1.7 Significance of the study

With the rapid growth of Internet and digital communication, the need of protecting files and

other information stored on, and transmitted between computers has become of vital

importance. To these extent, the requirement for trusted computing and secured

communications are has become an important issue of the era. Therefore; this thesis on

Hardware acceleration of Elliptic Curve cryptography believed to deliver comparative

importance in advancing the Security and performance of information communication and

dissemination activities in EFDRE MoND. Furthermore the Findings of the thesis and the

result obtained from the analysis will be helpful to other researchers wishing to conduct an

experiment on the area.

1.8 Organization of the Thesis

The thesis is organized into six chapters. The first chapter introduces the thesis background,

research objectives, thesis scope, thesis Significance, thesis contribution, and the thesis

organization.

The Second Chapter reviews the background of the research. Related works are presented.

Summary of the literature review is given to clarify the study rationale, the chapter also

presents the brief introduction of the Cryptography, Elliptic Curve Cryptography with the

mathematical concepts of finite fields and elliptic curves. Various design styles of hardware

accelerator to implement elliptic curve arithmetic are described. The ECDH and ECAES

Protocols scheme are discussed in this chapter.


5

Chapter three presents the methodology followed to design and Simulate ECC hardware

accelerator for ECC Cryptography primitives namely Karatsuba multiplier, field arithmetic

level and point arithmetic level. The activities that are followed to design Hardware for

primitives of ECC Cryptography also explained in this chapter.

The Fourth Chapter presents the details on the design and Implementation of the hardware

Accelerator of ECC on Reconfigurable Hardware (FPGA) is presented. For Each lower level

activities of the Elliptic Curve Cryptography, their respective circuits have been designed

independently and integrated as one Accelerator Module.

The Fifth Involves on the results and discussion of the thesis. The Reports Generated from

Xilinx ISE Design Suit and PlanAhead platform for design verification, synthesis and

Implementation test results and performance studies on the Hardware Based ECC

Cryptography are presented in tubular and charts. Test results of the elliptic curve

cryptosystem from Xilinx ISE Design Suit 14.7 and Results from related hardware accelerator

also are also compared and reported.

The final chapter in which, the thesis work is summarized, Concluded and the potential future

works are indicated.


6

CHAPTER-2

LITERATURE REVIEW

2.1 Literature reviewed at International Level

There have been several reported high performance FPGA processors for elliptic curve

cryptography. Various acceleration techniques have been used ranging from efficient

implementations to parallel and pipelined architectures. In [29] the Montgomery multiplier

[4] [13] [30] is used for scalar multiplication. The finite field multiplication is performed using

a digit-serial multiplier proposed in [31]. The Itoh-Tsujii algorithm is used for finite field

inversion [7] [19] [30] [35].

In [22], the ECC processor designed has squarers, adders, and multipliers in the data path. The

authors have used a hybrid coordinate representation in affine, Jacobean, and López-Dahab

form.

In [34] an end-to-end system for ECC is developed, which has a hardware implementation for

ECC on an FPGA. The high performance is obtained with an optimized field multiplier. A

digit-serial shift-and-add multiplier is used for the purpose. Inversion is done with a dedicated

division circuit.

In [30], the finite field multiplier in the processor is prevented from becoming idle. The finite

field multiplier is the bottle neck of the design therefore preventing it from becoming idle

improves the overall performance. Our design of the ECCP is on similar lines where the

operations required for point addition and point doubling are scheduled so that the finite field

multiplier is always utilized.

Hankerson, Hernandez and Menezes (Hankerson, et, al. 2000) wrote an excellent survey

discussing software algorithms for computing elliptic curve point multiplication. Many of

these algorithms can be adapted for use with hardware but it does not refer any multiplication

optimization techniques when implemented on reconfigurable Hardware in which in this

thesis (Optimizations parameters like Surface area, Energy Consumption and Performance)

does.


7

Groß schädl and Kamendje(2003) propose a simple architectural change to the multiplier

within a RISC processor and software algorithms making use of the modified multiplier for

point multiplication. These software implementations usually make use of a polynomial basis

over binary fields. In addition chesters Reberio and Debdeep Mukhopadhyay on their

publication on High Performance Elliptic Curve Crypto processor explained the same issue

on software polynomial bases using ordinary method for Scalar multiplications.

Hardware implementations, on the other hand, often make use of an optimal normal basis

over binary fields. They also generally target towards FPGAs for the realization of the

proposed architectures. S. Janssens, et al, 2003, propose an architecture that makes use of

hardware/software co-design and targets the Atmel FPSLIC.

Okada, Torii, Itoh and Takenaka, 2011, propose an elliptic curve coprocessor for arbitrary bit

length to be implemented on an FPGA. Sutikno, Surya and Effendi propose a processor to

compute point multiplication in F (2155).

Leung, Ma, Wong and Leong propose an FPGA implementation of a micro-coded elliptic

curve processor for arbitrary key sizes. Many other hardware implementations also exist.

These implementations make use of microcode instructions to drive special-purpose

arithmetic units and store intermediate results in standard registers

An outstanding material was by Bahram Hakhamaneshi (Islamic Azad University Iran 2000),

Z. Guitouni, . Chotin-Avot, M. Machhu, H. Mehrez and R. Tourki is one of the most similar

work proposed in publication available and revised so far. The Publication uses Scalar

Multiplication over finite field on ASIC. Perhaps much more performance will be achieved

by using multiplication optimization schema as indicated in this thesis.

2.2 Literature Reviewed at National Level

A National level work by Mubarek Kedir and Manoj V.N.V (April 2008) provided a

description of Hardware acceleration of Elliptic Curve Cryptography Algorithm by

Montgomery multiplication schema. But much more efficiency will be obtained by

implementing scalar multiplication schema as mentioned in this thesis.

2.3 Concepts in Cryptography

Cryptography uses mathematics to encrypt and decrypt data. It enables people to store or

transmit sensitive information via insecure network. On the other hand, cryptanalysis is the


8

science of breaking secure communication. There are two persons, Alice and Bob, (the

beginning of cryptography: “A” and “B” are used as handy abbreviations of the names)

communicate via an insecure channel in a secure way. The third person who is eavesdropper

(Eve, abbreviated as E) should not be able to read the clear-text or change it.

The goal of cryptography is to achieve the aim of allowing two people to exchange messages

using cryptography which are not understood by other people (Wang, et al.). Figure 2.1

provides a sample model of a two-party communication using encryption. In this simple party,

an entity is a person that sends, receives or manipulates data. Sender is an entity that

legitimately transmits the information. On the other hand, a receiver is an entity that is the

recipient of information. A receiver may be one of the entities that attempts to crush the

information security service provided between the sender and receiver. An adversary plays

the role either as the sender or the receiver. The other synonymous names for adversary are

attacker, enemy, eavesdropper, opponent and intruder (Jesper 2006).

Figure 2.1 Two Party Communication

The cryptographic strength can be measured by the needed resources and time in recovering

the plain text. In order to encrypt the plaintext, cryptographic algorithm works in a

combination with a key (private key) to resolve the ciphertext. The ciphertext differs from one

to another because of different values used in each time. The security of encrypted data

depends on the strength of the cryptographic algorithm and the confidentiality of the key (B.

Schneier 1996).

When Whitfield Diffie and Martin E. Hellman published their famous article ”New Directions

in Cryptography” [22], stating cryptography algorithms have been divided into two

categories: symmetric-key cryptography and public key cryptography. Symmetric-key

cryptography (private-key, single-key or one key cryptography) is a cryptosystem where both

encryption and decryption processes are performed using the same key. In a public-key

Plain Text ----> Encryption Decryption----> Plain Text Unsecured Channel

Adversary


9

cryptosystem there are two different keys, one which is public (public key), and other which

is secret (private key). The most famous public-key cryptosystem is probably RSA which was

presented by Rivest, Shamir and Adleman in Reference [14] in 1978.

2.4 ECC Cryptography

Elliptic curve cryptography (ECC) was proposed in 1985 by Neal Koblitz and Victor Miller.

Elliptic curve cryptographic schemes can provide the same functionality as RSA schemes

which are public-key mechanisms. The security is based on the difficultly of a different

problem, which is called the Elliptic Curve Discrete Logarithm Problem (ECDLP).In order to

solve the ECDLP, the best algorithms have fully exponential time. In contrast, the integer

factorization problem has to be solved with sub exponential-time algorithms (Hankerson, et

al. 2004). It makes Elliptic Curve Cryptography offers similar security. It is offered by other

traditional public key cryptography schemes used nowadays, with smaller key sizes and

memory requirements. (As shown in Table 2.1) (Kumar 2006). For example, it is generally

accepted that a 1024-bit RSA key provides the same level of security as a 160-bit elliptic

curve key. The advantages can be achieved from smaller key sizes including storage, speed

and efficient use of power and bandwidth. The use of shorter keys means lower space

requirements for key storage and quicker arithmetic operations. These advantages are

essential when public-key cryptography is applied in constrained devices, such as in mobile

devices or RFID. These advantages are the reason behind choosing ECC as the cryptography

system in this thesis.

Table 2.1 Comparison of NIST recommend Key sizes

Symmetric Key ECC RSA Comment

64 128 700 Short Period Security

80 160 1024 Medium period Security

128 256 2040 Long Period Security

In brief, ECC based algorithms can be easily included into existing protocols to get the same

backward compatibility and security with smaller resources. Therefore, more low-end

controlled devices can use such protocols to be considered unsuitable for such systems.

A group structure used to implement the cryptographic schemes is provided by using Elliptic

curves and is determined over a finite field. The elements of the group are the points on the


10

elliptic curve. They act as the identity element of the group. On the other hand the group

operation can be executed by arithmetic operations based on finite field. It is discussed in

detail in the next section (Kumar 2006).

2.5 Mathematical Background of Elliptic Curve Cryptography

2.5.1 Groups

A mathematical structure consisting of a set G and a binary operator on G is a group if,

a, b G, if c = a b, then c G (Closure)

a (b c) = (a b) c, a, b, c G (Associative)

e G, such that a G, a e = e a = a (Identity element)

a G, a G such that, a a = a a = e. a is unique for each a and is called the

inverse of a.

The group is represented as G, . Additionally, a group is said to be abelian if it also

satisfies the commutative property, i.e., a, b G, if, a b = b a.

2.5.2 Rings

A Ring is a set R with two binary operations + and (Addition and multiplication) defined

on R such that the following conditions are satisfied.

R, + is an Abelian group

a (b c) = (a b) c, a, b, c R (Associativity of )

a (b + c) = (a b) + (a c), a, b, c R (Distributivity of over +)

A Ring, in which is commutative is called a commutative ring. Further, if the ring contains

an identity element with respect to , i.e. e R and a R, a e = e a = a, then e is

called the identity element or the unity element and is represented by 1. If R contains a unity

element, then R is called a Unitary Ring.

2.5.3 Fields and Vector Spaces

A Field F is a commutative and a unitary ring such that, F* = a | a F and a 0 is a

multiplicative group. The ring Zp is a Field, if and only if p is a prime.


11

If F is a field. A subset K of F that is also a field under the operations of F (with restriction

to K) is called a sub field of F. In this case, F is called an extension field of K. If K F then

K is a proper sub field of F. A field is called prime if it has no proper sub field.

If F is a field and V is an additive abelian group, then V is called the vector space over F, if

an operation F x V V is defined such that:

a (v + u) = av + au

(a + b) v = av + bv

a (bv) = (a.b) v

1.v = v

where, a, b F and u, v V.

The elements of F are called the scalars and the elements of V are called the vectors.

If v1, v2, …, vm V, and f1, f2, …, fm F, then the vector v’ = ji vf , 1 i, j m, is a

linear combination of the vectors in V. The set of all such linear combinations is called the

span of V.

The vectors v1, v2, …, vm V are said to be linearly independent over F if there exists no

scalars f1, f2, …, fm F such that ji vf 0, 1 i, j m.

A set S = u1, u2, …, un are said to the basis of V iff all the elements of S are linearly

independent and span V. If a vector space V over a field F has a basis of a finite number of

vectors, then this number is called the dimension of V over F.

If F is an extension field of a field Fp then, F is a vector space over Fp. The dimension of F

over Fp is called the degree of the extension of F over Fp.

2.5.3.1 Finite Fields

A field of a finite number of elements is denoted Fq or GF(q), where q is the number

of elements. This is also known as a Galois Field.

The order of a Finite field Fq is the number of elements in Fq. Further, there exists a finite

field Fq of order q iff q is a prime power, i.e. either q is prime or q = pm, where p is prime.


12

In the latter case, p is called the characteristic of Fq and m is called the extension degree of

Fq and every element of Fq is a root of the polynomial xxmp over Zp.

Let us consider two classes of Finite fields Fp (Prime Field, p is a prime number) and m2F

(Binary finite field).

2.5.3.2 Prime Field Fp

The prime field Fp consists of the set of integers 0, 1, 2, ….., p – 1, with the following

arithmetic operations defined over it.

Addition: a, b Fp, r Fp, where r = (a + b) mod p

Multiplication: a, b Fp, s Fp, where s = (a b) mod p

2.5.3.3 Binary Finite Field F2m

The finite field m2F , called a characteristic two finite field or a binary finite field can be

viewed as a vector space of m dimensions over F2, which consists of 2 elements 0 and 1.

There exists m elements 0, 1, 2, …, m-1 in m2F such that each element m2

F can be

uniquely represented as = i

1m

0i

iαa

, where ai 0, 1, 0 i m

The string 0, 1, 2, …, m-1 is called the basis of m2F over F2. Given such a basis, every

field element can be represented as a bit string (a0a1a2…am-1). Generally two kinds of basis

are used to represent binary finite fields: polynomial basis and normal basis.

2.5.4 Polynomial basis representation of F2m

Let f(x) = xm + fm-1xm-1 + … + f2x

2 + f1x + f0, where fi 0, 1, 0 i m, be an irreducible

polynomial of degree m over F2. f(x) is called the reduction polynomial of m2F .

The finite field m2F is comprised of all polynomials over F2 of degree less than m, i.e.:

m2F = am-1x

m-1 + am-2xm-2 + … + a2x

2 + a1x + a0 : ai 0, 1.

The field element am-1xm-1 + am-2x

m-2 + … + a2x2 + a1x + a0 is usually represented by the bit

string (am-1am-2…a2a1a0) of length m such that


13

m2F = (am-1am-2…a2a1a0) : ai 0, 1.

Thus, the elements of m2F can be represented by the set of all binary strings of length m. The

multiplicative identity 1 is represented by the bit string (00…001) and the bit string of all

zeroes represents the additive identity 0.

The following operations are defined on the elements of m2F when using f(x) as the

reduction polynomial.

Addition: If a = (am-1am-2…a2a1a0) and b = (bm-1bm-2…b2b1b0) are elements of m2F ,

then, c = a + b = (cm-1cm-2…c2c1c0), where ci = (ai + bi) mod 2 = ai bi.

Multiplication: If a = (am-1am-2…a2a1a0) and b = (bm-1bm-2…b2b1b0) are elements of

m2F , then, c = a . b = (cm-1cm-2…c2c1c0), where the polynomial

cm-1xm-1 + cm-2x

m-2 + … + c2x2 + c1x + c0 is the remainder when the polynomial

(am-1xm-1 + am-2x

m-2 + … + a1x + a0) (bm-1xm-1 + bm-2x

m-2 + … + b1x + b0) is divided

by f(x) over F2.

Inversion: If a is a nonzero element in m2F , then the inverse of a, denoted a–1, is a

unique element c m2F , where a.c = c.a = 1

2.5.5 Normal basis representation of F2m

A normal basis of m2F over F2 is a basis of the form

1m2222 β ,...,β ,β β,, where m2

F .

Any element a m2F can be written as a =

i

iβ1 m

0 i

a , where ai 0, 1.

Gaussian Normal Bases (GNB): A GNB representation of m2F exists if there exists a

positive integer T such that p = Tm + 1 is prime and gcd(Tm/k , k) = 1, where k is the

multiplicative order of 2 modulo p. The GNB representation is called a “type T GNB for

m2F ”.

The following operations are defined over m2F when using a type T GNB representation.

Addition: If a = (am-1am-2…a2a1a0) and b = (bm-1bm-2…b2b1b0) are elements of m2F ,

then, c = a + b = (cm-1cm-2…c2c1c0), where ci = (ai + bi) mod 2 = ai bi.


14

Squaring: Let a = (am-1am-2…a2a1a0) m2F . Squaring is a linear operation in m2

F .

Hence 2 - m201 - m

1 - m

0 i

i 21 - i

1 - m

0 i

1 i 2i

21 - m

0 i

i2i

2 β β β a aaaaaaa

. Hence

squaring a field element is simply a rotation of the vector representation.

Multiplication: Let p = Tm + 1 and let u Fp. Let us define a sequence F(0), F(1),

…, F(p - 1) by F(2i uj mod p) = i, for 0 i m, 0 j T.

If a = (am-1am-2…a2a1a0) and b = (bm-1bm-2…b2b1b0) are elements of m2F , then the

product c = a.b = (cm-1cm-2…c2c1c0) where,

odd is T If ) (

even is T If

2 / m

1k

2p

1k

2p

1k

i k) - F(pi 1) F(k 1 - i k 1 - i k m/21 - i k m/21 - i k

i k) - F(pi 1) F(k

i

bababa

ba

c

for each i, 0 i m, where indices are reduced modulo m.

Inversion: If a is a nonzero element in m2F , then the inverse of a, denoted a–1, is a

unique element c m2F , where a.c = c.a = 1

2.6 Elliptic Curves over Fp

An elliptic curve E(Fp) over a finite field Fp is defined by the parameters a, b Fp (a, b

satisfy the relation 4a3 + 27b2 0), consists of the set of points (x, y) Fp, satisfying the

equation y2 = x3 + ax + b. The set of points on E(Fp) also include point O, which is the point

at infinity and which is the identity element under addition.

The Addition operator is defined over E(Fp) and it can be seen that E(Fp) forms an abelian

group under addition.

The addition operation in E(Fp) is specified as follows.

P + O = O + P = P, P E(Fp)

If P = (x , y) E(Fp), then (x, y) + (x, – y) = O. (The point (x, – y) E(Fp) and is called

the negative of P and is denoted –P)


15

If P = (x1, y1) E(Fp) and Q = (x2, y2) E(Fp) and P Q, then R = P + Q = (x3, y3)

E(Fp), where x3 = 2 – x1 – x2, y3 = (x1 – x3) – y1, and = (y2 – y1) / (x2 – x1), i.e. the

sum of 2 points can be visualized as the point of intersection E(Fp) and the straight line

passing through both the points.

Let P = (x, y) E(Fp). Then the point Q = P + P = 2P = (x1, y1) E(Fp),

where x1 = 2 – 2x, y1 = (x – x1) – y, where = (3x2 + a) / 2y. This operation is also

called doubling of a point and can be visualized as the point of intersection of the elliptic

curve and the tangent at P.

We can notice that addition over E(Fp) requires one inversion, two multiplications, one

squaring and six additions. Similarly, doubling a point on E(Fp) requires one inversion, two

multiplication, two squaring and eight additions.

Consider the set E(Fp) over addition. We can see that

P, Q E(Fp), if R = P + Q, then R E(Fp) (Closure)

P + (Q + R) = (P + Q) + R, P, Q, R E(Fp) (Associative)

O E(Fp), such that P E(Fp), P + O = O + P = P (Identity element)

P E(Fp), – P E(Fp) such that, P + (– P) = (– P) + P = O. (Inverse element)

P, Q E(Fp), P + Q = Q + P. (Commutative)

Thus we see that E(Fp) forms an abelian group under addition.

2.7 Elliptic curves over F2m

An elliptic curve E( m2F ) over a finite field m2

F is defined by the parameters a, b m2F (a,

b satisfy the relation 4a3 + 27b2 0, b 0), consists of the set of points (x, y) m2F ,

satisfying the equation y2 + xy = x3 + ax + b. The set of points on E( m2F ) also include point

O, which is the point at infinity and which is the identity element under addition.

Similar to E(Fp), addition is defined over E( m2F ) and we can similarly verify that even E(

m2F ) forms an abelian group under addition.

The addition operation in E( m2F ) is specified as follows.


16

P + O = O + P = P, P E( m2F )

If P = (x , y) E( m2F ), then (x, y) + (x, – y) = O. (The point (x, – y) E( m2

F ) and is

called the negative of P and is denoted –P)

If P = (x1, y1) E( m2F ) and Q = (x2, y2) E( m2

F ) and P Q,

then R = P + Q = (x3, y3) E( m2F ), where x3 = 2 + + x1 + x2 + a,

y3 = (x1 + x3) + x3 + y1, and = (y1 + y2) / (x1 + x2), i.e. the sum of 2 points can be

visualized as the point of intersection E( m2F ) and the straight line passing through both

the points.

Let P = (x, y) E( m2F ). Then the point Q = P + P = 2P = (x1, y1) E( m2

F ), where x1 =

2 + + a, y1 = (x + x1) + x1 + y, where = x + (x / y). This operation is also called

doubling of a point and can be visualized as the point of intersection of the elliptic curve

& the tangent at P.

We can notice that addition over E( m2F ) requires one inversion, two multiplications, one

squaring and eight additions. Similarly, doubling a point on E( m2F ) requires one inversion,

two multiplication, one squaring and six additions.

Similar to E(Fp), consider addition under E( m2F ),

P, Q E( m2F ), if R = P + Q, then R E( m2

F ) (Closure)

P + (Q + R) = (P + Q) + R, P, Q, R E( m2F ) (Associative)

O E( m2F ), such that P E( m2

F ), P + O = O + P = P (Identity element)

P E( m2F ), – P E( m2

F ), such that, P + (– P) = (– P) + P = O. (Inverse)

P, Q E( m2F ), P + Q = Q + P. (Commutative)

Thus we see that E( m2F ) forms an abelian group under addition.

Scalar Multiplication: Given an integer k and a point P on the elliptic curve, the elliptic

scalar multiplication kP is the result of adding Point P to itself k times.

Order: Order of a point P on the elliptic curve is the smallest integer r such that


17

rP = O. Further if c and d are integers, then cP = dP iff c d (mod r).

Curve Order: The number of points on the elliptic curve is called its curve order and is

denoted #E.

2.8 Elliptical Curve Discrete Logarithm Problem

The strength of the Elliptic Curve Cryptography lies in the Elliptic Curve Discrete Log

Problem (ECDLP). The statement of ECDLP is as follows.

Let E be an elliptic curve and P E be a point of order n. Given a point Q E with

Q = mP, for a certain m 2, 3, ……, m – 2.

Find the m for which the above equation holds.

When E and P are properly chosen, the ECDLP is thought to be infeasible. Note that m = 0,

1 and m – 1, Q takes the values O, P and – P. One of the conditions is that the order of P i.e.

n be large so that it is infeasible to check all the possibilities of m.

The difference between ECDLP and the Discrete Logarithm Problem (DLP) is that, DLP

though a hard problem is known to have a sub exponential time solution, and the solution of

the DLP can be computed faster than that to the ECDLP. This property of Elliptic curves

makes it favorable for its use in cryptography.

2.9 Application of Elliptical Curves in Key Exchange

2.9.1 Elliptic Curve Cryptography (ECC) domain parameters

The public key cryptographic systems involves arithmetic operations on Elliptic curve over

finite fields which is determined by elliptic curve domain parameters.

The ECC domain parameters over Fq is defined by the septuple as given below

D = (q, FR, a, b, G, n, h), where

q: prime power, that is q = p or q = 2m, where p is a prime

FR: field representation of the method used for representing field elements Fq

a, b: field elements, they specify the equation of the elliptic curve E over Fq,

y2 = x3 + ax + b


18

G: A base point represented by G= (xg, yg) on E (Fq)

n: Order of point G , that is n is the smallest positive integer such that nG = O

h: cofactor, and is equal to the ratio #E(Fq)/n, where #E(Fq) is the curve order

The primary security in ECC is the parameter n; therefore the length of ECC key is the bit

length of n. For comparative length, the security of ECC keys is much more than that of

other cryptosystems. That is for equivalent security, the key length of ECC key is much

lesser than other cryptosystems.

2.9.2 Elliptic Curve protocols

Generally in the process of encryption and decryption, we have 2 entities, the one at the

encryption side and the other at the decryption side. Let us assume that Alice is the person

who is encrypting and Bob is the person decrypting.

Key generation: Alice’s (or Bob’s) public and private keys are associated with a particular

set of elliptic key domain parameters (q, FR, a, b, G, n, h).

Alice generates the public and private keys as follows

1. Select a random number d, d [1, n – 1]

2. Compare Q = dG.

3. Alice’s public key is Q and private key is d.

It should be noted that the public key generated needs to be validated to ensure that it

satisfies the arithmetic requirement of elliptic curve public key. A public key Q = (xq, yq)

associated with the domain parameters (q, FR, a, b, G, n, h) is validated using the following

procedure

1. Check that Q O

2. Check that xq and yq are properly represented elements of Fq

3. Check if Q lies on the elliptic curve defined by a and b.

4. Check that nQ = O

2.9.3 Elliptic Curve Digital Signature Authentication (ECDSA)

Alice, with domain parameters D = (q, FR, a, b, G, n, h), public key Q and private key d,

does the following steps to sign the message m

Step 1: Selects a Random number k [1, n – 1]


19

Step 2: Computes Point kG = (x, y) and r = x mod n, if r = 0 then goto Step 1

Step 3: Compute t = k–1 mod n

Step 4: Compute e = SHA-1(m), where SHA-1 denotes the 160 bit hash function

Step 5: Compute s = k– 1 (e + da*r) mod n, if s = 0 goto Step 1

Step 6: The signature of message m is the pair (r, s)

Step 7: Alice sends Bob the message m and her signature (r, s)

To verify Alice’s signature, Bob does the following (Note that Bob knows the domain

parameters D and Alice’s public key Q)

Step 1: Verify r and s are integers in the range [1, n – 1]

Step 2: Compute e = SHA-1(m)

Step 3: Compute w = s–1 mod n

Step 4: Compute u1 = e.w and u2 = r.w

Step 5: Compute Point X = (x1, y1) = u1G + u2Q

Step 6: If X = O, then reject the signature

Else compute v = x1 mod n

Step 7: Accept Alice’s signature iff v = r


20

Alice Bob

Generates k

Computes P = k G = (x, y)

Verify r and s are integers in

the range [1, n – 1]

Sends P, m

Signature of message

m is the Pair P= (r, s)

Compute

r = x mod n

Compute

s = k– 1

(e + da*r) mod n

e = SHA-1(m)

w = s–1

mod n

u1 = e.w and u2 = r.w

Point X = (x1, y1) = u1G + u2Q

Reject

Accept Alice’s signature if v = r

Is r = 0

?

No

e = SHA-1(m)

Is s = 0

?

Yes

No

Yes

No

Yes Is X = O

?

Figure 2.2 Illustration of Elliptic Curve Digital Signature Algorithm

Proof for verification

If the message is indeed signed by Alice, then s = k–1 (e + d*r) mod n.

That is, k = s–1 (e + d.r) mod n = s–1 e + s–1 d.r = w.e + w.d.r = (u1 + u2.d ) mod n ……[1]

Now consider u1G + u2Q = u1G + u2dG = (u1 + u2.d) G = kG from [1]

In step 5 of the verification process, we have v = x1 mod n, where,

Point X = (x1, y1) = u1G + u2Q. Thus we see that v = r since r = x mod n and x is the x

coordinate of the point kG and we have already seen that u1G + u2Q = kG


21

2.9.4 Elliptic Curve Authentication Encryption Scheme (ECAES)

Alice has the domain parameters D = (q, FR, a, b, G, n, h) and public key Q. Bob has the

domain parameters D. Bob’s public key is QB and private key is dB. The ECAES mechanism

is as follows.

Alice performs the following stepsA does the following

Step 1: Selects a random integer r in [1, n – 1]

Step 2: Computes R = rG

Step 3: Computes K = hrQB = (Kx, Ky), checks that K O

Step 4: Computes keys k1||k2 = KDF(Kx) where KDF is a key derivation function,

which derives cryptographic keys from a shared secret

Step 5: Computes c = ENCk1(m) where m is the message to be sent and ENC a

symmetric encryption algorithm

Step 6: Compute t = MACk2(c) where MAC is message authentication code

Step 7: Sends (R, c, t) to Bob

To decrypt a cipher text, Bob performs the following steps

Step 1: Perform a partial key validation on R (check if R O, check if the coordinates

of R are properly represented elements in Fq and check if R lies on the elliptic

curve defined by a and b)

Step 2: Computes KB = h.dB.R = (Kx, Ky ) , check K O

Step 3: Compute k1, k2 = KDF (Kx)

Step 4: Verify that t = MACk2(c)

Step 5: Computes m = (c)ENC 11K

We can see that K = KB, since K = h.r.QB = h.r.dB.G = h.dB.r.G = h.dB.R = KB


22

Alice Bob

Generate random integer r

in [1, n – 1]

Perform partial

key validation on R

Sends (R, c, t)

Compute R = rG

Compute

K = hrQB = (Kx, Ky)

Compute

k1||k2 = KDF(Kx)

Computes

KB = h.dB.R = (Kx, Ky )

Verify that t = MACk2(c)

Computes m = ENCk1–1

(c)

m is the

decrypted Plain

Text message

Compute

c = ENCk1(m)

Compute

t = MACk2(c)

Compute

k1||k2 = KDF(Kx)

Figure 2.3 Illustration of Elliptic Curve Authentication Encryption Scheme

2.10 Algorithms for Elliptic Scalar Multiplication

In all the protocols that were discussed (ECDH, ECDSA, ECAES), the most time

consuming part of the computations are scalar multiplications. That is the calculations of the

form

Q= k P = P + P + P… k times

Here P is a curve point, k is an integer in the range of order of P (i.e. n). P is a fixed point

that generates a large, prime subgroup of E(Fq), or P is an arbitrary point in such a subgroup.

Elliptic curves have some properties that allow optimization of scalar multiplications. The

following sections describe some efficient algorithms for computing kP.

2.11 Hierarchy of Elliptic Curve Cryptography

Elliptic curve crypto systems have a layered hierarchy as shown in Figure 2.2. The bottom layer

constituting the arithmetic on the underlying finite field most prominently influences the area

and critical delay of the overall implementation.


23

Figure 2.4 Hierarchy of Elliptic Curve Cryptography

2.11.1 Point Multiplication

Scalar point multiplication is a block of all elliptic curve cryptosystems. It is an operation of

the form k.P. ‘P’ is a point on the elliptic curve and ‘k’ is a positive integer. Computing k.P

means adding the point ‘P’ exactly d-1 times to itself, which results in another point ‘Q’ on

the elliptic curve. Point multiplication uses two basic elliptic curve operations:

1- Point addition (add two point to find another point)

2- Point doubling (adding point p to itself to find another point)

For example to calculate KP=Q if ‘K’ is 23 then KP=23P=2(2(2(2P) + P) + P) + P so to get

the result point addition and point doubling is used repeatedly (Tata, 2007).

2.11.2 Point Addition

Suppose that P and Q are two distinct points on an elliptic curve, and the P is not -Q. To add

the points P and Q, a line is drawn through the two points. This line will intersect the elliptic

curve in exactly one more point, call -R. The point -R is reflected in the x-axis to the point

R. The law for addition in an elliptic curve group is P + Q = R. For example:

EC Primitives and Protocols

Scalar Multiplication

(Karatsuba, Montgomery... )

Elliptic Curve Group Operations

(Point Addition and Point Doubling)

Finite Field Operations Addition, (Multiplication, Invesion and Squarer)


24

Figure 2.5 Point addition

2.11.3 Point Doubling

To add a point P to itself, a tangent line to the curve is drawn at the point P. If YP is not 0,

then the tangent line intersects the elliptic curve at exactly one other point, -R. -R is reflected

in the x-axis to R. This operation is called doubling the point P; the law for doubling a point

on an elliptic curve group is defined by:

Figure 2.6 Point Doubling

2.12 Hardware Accelerator

General purpose processors are not optimized for cryptographic arithmetic [4]. They also

cannot provide the amount of parallelism that is required to compute field arithmetic in scalar


25

multiplication which is required in elliptic curve based cryptographic systems. This results

in degradation of performance when compared to hardware implementation. It is, therefore,

important to use hardware implementation to avoid such draw backs. This can be done by

the use of two different hardware technologies.

They are:

I. Application Specific Integrated Circuits (ASICs)

II. Field Programmable Gate Arrays (FPGAs)

ASICs are typically used when a design is to be produced in mass or when performance

is of the utmost importance. FPGAs, on the other hand, lend themselves nicely to research

work where a design is being prototyped. The following attributes of the FPGA design flow

are particularly advantageous.

a. Relatively small initial setup cost: A single FPGA is inexpensive when compared to

the manufacturing cost of an ASIC design.

b. Simplified implementation flow: In most cases, the FPGA vendor will provide a fully

integrated tool flow. This flow will have been fully tested for compatibility with the

FPGA and as a result fewer tool related problems can be expected.

c. Fast turnaround time: An FPGA can be programmed in less than a minute and can also

be reprogrammed many times. An ASIC on the other hand may take months to

fabricate.

d. Simplified integration: Whether using an ASIC or FPGA design flow, the design must

be integrated into a hardware/software system. It is common for FPGAs to be sold

within such a system, minimizing the integration task required of the designer

FPGAs are reconfigurable devices offering parallelism and flexibility on one hand while being

low cost and easy to use on the other. Moreover, they have much shorter design cycle times

compared to ASICs. FPGAs were initially used as prototyping devices and in high

performance scientific applications, but the short time-to-market and on-site reconfigurability

features have expanded their application space.

These devices can now be found in various consumer electronic devices, high performance

networking applications, medical electronics and space applications. The reconfigurability

aspect of FPGAs also makes them suited for cryptography applications. Reconfigurability

results in flexible implementations allowing operating modes, encryption algorithms and

curve constants etc. to be configured. FPGA’s do not require sophisticated equipment for


26

production, they can be programmed in house. This is beneficial for cryptography as no

untrusted party is involved in the production cycle. (Chester Rebeiro 2008)

2.13 FPGA Architecture

There are two main parts of the FPGA chip: the input/output (I/O) blocks and the core. The

I/O blocks are located around the periphery of the chip and are used to provide programmable

connectivity to the chip. The core of the chip consists of programmable logic blocks and

programmable routing architectures.

A popular architecture for the core, called island style architecture, is shown in Figure 1.2

below. Logic blocks, also called configurable logic blocks (CLB), consists of logic circuitry

for implementing logic. Each CLB is surrounded by routing channels connected through

switch blocks and connection blocks. A switch block connects wires in adjacent channels

through programmable switches.

Fig 2.7 FPGA Architecture

Logic blocks and interconnects can be programmed by the designer, after the FPGA is

manufactured, to implement any logical function, hence the name “field- programmable”.

FPGAs are usually slower than their application-specific integrated circuit (ASIC)

counterparts, they cannot handle a complex design and draw more power (for any given

semiconductor process). But their advantages include a shorter time to market, ability to

re-program in the field to fix bugs, and lower non-recurring engineering costs.

Through the years, FPGAs features have been improved and their density has grown.

Current FPGAs have embedded processors, GiGa-bit serial transceivers, clock managers,

Logic Block Programmable Connection

Switch

Programmable routing switch


27

FPGA LUT F

F

Inter Connection Matrix

analog-to-digital converters, dedicated digital signal processing blocks, Ethernet controllers,

substantial memory capacity, and other dedicated functional blocks beyond the basic

arrays of simple logic elements they started out with in the mid-1980s. The current high

density of FPGAs allows to implement complete systems (System-on-Chip or SoC) on them.

In addition, the capacity of reconfiguration of FPGAs has increased. The best advantage and

the opportunities to design using these devices resides in the way the reconfiguration is

performed. The FPGA reconfiguration is based on the SRAM (Static Random Access

Memory) technology. The configuration of the device is guided by data stored in the

configuration memory. This content deter- mines the interconnection among the configurable

blocks and the function these blocks perform. Usually, the configuration memory stores just

one configuration (one-context) but some devices can store more than one (multi-context).

SRAM memory is volatile so the FPGA must be configured normally by an external

memory nonvolatile each time the FPGA is powered up.

Figure 2.8 Inter Architecture of FPGA

2.13.1 Look-Up Table

The way logic functions are implemented in a FPGA is another key feature. Logic blocks

that carry out logical functions are look-up tables (LUTs), implemented as memory, or

multiplexer and memory. Figure 2-10 shows these internal architecture of Common

FPGA’s, together with an inside component of for some basic operations.

2.13.2 Configurable Logic Blocks (CLBs)

The basic building block of Xilinx (CLBs) is the slice. Virtex and Spartan II hold two slices

in one CLB, while Virtex II and Spartan III hold four slices per CLB. Each slice contains two

4-input function generators (F/G), carry logic, and two storage elements.

Configuration Memory FPGA Structure


28

Each function generator output drives both the CLB output and the D-input of a flip-flop.

Besides the four basic function generators, the Virtex/Spartan II CLB contains logic that

combines function generators to provide functions of five or six inputs. The look-up tables

and storage elements of the CLB have the following characteristics:

i. Look-Up Tables (LUTs): Xilinx function generators are implemented as 4-input look-

up tables. Beyond operating as a function generator, each LUT can be programmed as

a (16x1)-bit synchronous RAM. Furthermore, the two LUTs can be combined within

a slice to create a (16x2)-bit or (32x1)-bit synchronous RAM, or a (16x1)-bit dual-port

synchronous RAM. Finally, the LUT can also provide a 16-bit shift register, ideal for

capturing high-speed data.

ii. Storage Elements: The storage elements in a slice can be configured either as edge-

triggered D-type flip-flops or as level-sensitive latches. The D-inputs can be driven

either by the function generators within the slice or directly from the slice inputs,

bypassing the function generators. As well as clock and clock enable signals, each

slice has synchronous set and reset signals.

2.13.3 Input/Output Blocks (IOBs)

The Xilinx IOB includes inputs and outputs that support a wide variety of I/O signaling

standards. The IOB storage elements act either as D-type flip-flops or as latches. For each

flip-flop, the set/reset (SR) signals can be independently configured as synchronous set,

synchronous reset, asynchronous preset, or asynchronous clear. Pull-up and pull-down

resistors and an optional weak-keeper circuit can be attached to each pad. IOBs are

programmable and can be categorized as follows:

a. Input Path: A buffer in the IOB input path is routing the input signals either directly

to internal logic or through an optional input flip-flop.

b. Output Path: The output path includes a 3-state output buffer that drives the output

signal onto the pad. The output signal can be routed to the buffer directly from the

internal logic or through an optional IOB output flip-flop. The 3-state control of the

output can also be routed directly from the internal logic or through a flip-flop that

provides synchronous enable and disable signals.


29

2.13.4 RAM Blocks

Xilinx FPGA incorporates several large RAM memories (block select RAM). These memory

blocks are organized in columns along the chip. The number of blocks, ranging from 8 up to

more than 100, depends on the device size and family. In Virtex/Spartan II, each block is a

fully synchronous dual-ported 4096-bit RAM, with independent control signals for each port.

The data width of the two ports can be configured independently. In Virtex II/Spartan III, each

block provides 18-kbit storage.

2.13.5 Programmable Routing

Adjacent to each CLB stands a general routing matrix (GRM). The GRM is a switch matrix

through which resources are connected; the GRM is also the means by which the CLB gains

access to the general-purpose routing. Horizontal and vertical routing resources for each row

or column include:

i. Long Lines: bidirectional wires that distribute signals across the device.

ii. Vertical and horizontal long lines span the full height and width of the device.

iii. Hex Lines route signals to every third or sixth block away in all four directions.

iv. Double Lines: route signals to every first or second block away in all four

directions.

v. Direct Lines: route signals to neighboring blocks—vertically, horizontally, and

diagonally.

vi. Fast Lines: internal CLB local interconnections from LUT outputs to LUT inputs. The

routing performance factor of internal signals is the longest delay path that limits the

speed of any worst-case design. Consequently, the Xilinx routing architecture and its

place-and-route software were defined in a single optimization process. Xilinx devices

provide high-speed, low-skew clock distribution. Vertex provides four primary global

nets that drive any clock pin; instead, Vertex II has 16 global clock lines—eight per

quadrant.


30

CHAPTER-3

METHODOLOGY

Thesis work is defined, implemented, and verified with choice of tools and supporting devices.

Also, some remarks on the performance and effectiveness of the tools are given.

3.1 The Design Cycle

The general design cycle for this work consisted of the following steps:

1. Studying the arithmetic functions.

2. Studying the elliptic curve constructs

3. HDL (VHDL) implementation of arithmetic functions

4. Commitment to a specific implementation of elliptic curve field representation.

5. Design of point multiplication elliptic curve engine.

6. Logic verification of the design.

7. Synthesis and logic optimization.

8. Device specific realization (place and route).

9. Back-annotated verification of the design.

The order of steps outlined above is more or less accurate. At some point of the project, steps

had to be retraced to ensure optimal or correct implementation. Since not all algorithms can

be easily implemented in hardware, careful consideration of the implementation was necessary

before committing to a specific option. By doing the initial research into Galois Field

arithmetic operations and their implementations in hardware, a few guidelines were developed

that aided in the choice of Galois field representation and elliptic curve point representation.

More specifically, standard base representation for Galois field arithmetic was chosen and

composite architectures were mapped to reconfigurable devices. Furthermore, for the Scalar

multiplication scheme Karatsuba based multiplication Optimization technique was chosen for

the reason of avoiding the most complex operation. Thus at the end of initial research,

commitment was made to realize the elliptic curve operation with maximum optimization and

standard base representation.

The next stage was the actual design of the digital system that realized the elliptic curve

group operation. During this stage many revisions were made to better fit the design to a


31

specific device. The XILINX FPGA XC6SLX45 family of devices was chosen as the target

platform.

Figure 3.1 Design Flow Chart

Verification of the design was first performed on the logic level basis. This step assured the

correct functionality if all combinatorial and net delays were ignored. Once the design was

verified logically, synthesis and optimization was performed. Timing constraints were set for

each component and different iterations were done until constraints were met. The next step

was to actually map, place and route the design into reconfigurable device. The choice

of a specific device within the XILINX FPGA XC6SLX45 family depends on the area

utilization report obtained through synthesis. Finally, the output of the place and route step

was used to perform back-annotated simulation. This step verified the correct operation with

net and combinatorial delays that resulted from the place and route process.


32

3.2 Tools

The results presented in this thesis are the ones obtained by implementing the hardware

designs in FPGA technology. The targeted FPGA is the Spartan-6 XC6slx45 from Xilinx. The

Spartan-6 device logic unit is the slice. Each slice (see figure 5.1) consists of two fixed 4-input

LUTs, embedded multiplexers, carry logic, and two registers. (www.Xilinx.com last viewed May

27, 2014:4:00PM)

Configurable Logic Blocks (CLBs) in Spartan-6 FPGAs are made up of four slices. The function

generators are configurable as 4-input look-up tables (LUTs). Two slices in a CLB can have their

LUTs configured as 16-bit shift registers, or as 16-bit distributed RAM. In addition, the two storage

elements are either edge- triggered D-type flip-flops or level sensitive latches. Each CLB has

internal fast interconnect and connects to a switch matrix to access general routing resources.

The entire design, with the exception of vendor specific soft macros, was entered in VHDL format.

Once the design was developed in VHDL, Boolean logic and major timing errors were verified by

simulating the gate level description with ISim (VHDL/Verilog) Simulator. The next step

involved synthesis of the VHDL code with XST (VHDL/Verilog) Version 14.7. The output of

this step was an optimized netlist describing the gate level design in XILINX ISE suite 14.7.

Figure 3.2 FPGA Board


33

VHDL Librarie

s

NGCNetlist

NGDNetlist

UCF

3.2.1 Simulation and Verification

As previously stated, verification of the design is done at two points. First, it is applied to the

initial VHDL design. This verifies only the logic without delays. The input to this verification

process is a test bench written in VHDL, a model of the design written in actual VHDL design.

The test bench is used together with the VHDL design to simulate the design. Then the results

from the simulation are compared against results obtained from other published works in the same

area. The post place and route verification uses the test bench (with few modifications). The

VHDL input model to this stage is different. Here the VHDL model is obtained from the XILINX

place and route tools.

3.2.2 Synthesis

XST (VHDL/Verilog) Version 14.7 synthesis tools have been used; the documentation that

accompanied these tools was quite extensive and very helpful. This and other literature helped in

developing script files that could be launched from within the FPGA analyzer. One advantage of

running this XST (VHDL/Verilog) Version 14.7 was that multiple jobs could be run concurrently

resulting in faster turnaround and more time to try different optimization options.

Synthesis

Figure 3.3 Synthesis Flow diagram


34

3.2.3 Place and Route

The place and route tools were used on the implementation workstation of Xilinx platform. The

input to the place and route tools is a design netlist and constraints files generated by XST

(VHDL/Verilog) Version 14.7, as well as possible user constraints file. The user constraints have

higher priority. The Xilinx implementation include additional constraints relaxing the clock period

or implementing pin assignment. As it is explicitly known the output of this process is bit-stream

file that can be used to directly program the device and the back-annotated design that can be

simulated for timing verification.

Figure 3.4: Place and Rout Process


35

3.2.4 Lording the FPGA board

Once the programing files are generated (.bit file) the next step coming is programing the device;

for this our target device Spartan-6 XC6slx45 is then connected to our computer running the

Xilinx ISE Design Suit from this we trigger the “iMPACT “ tool to initialize the FPGA USB Port.

After the Cable port is initialized the Karatsuba.bit, the ECC_Eryptography.bit and other related

programing files are loaded to the FPGA. For the Reason of displaying the output a computer

HyperTerminal can be used.

Figure 3.5 Programing the FPGA board


36

CHAPTER-4

Design and Implementation

As it had been discussed in section 2.11 of chapter two, the efficiency of Elliptic Curve

Cryptography is highly dependent on the general construction of the computationally intensive

operation of the lower tree levels namely Scalar Multiplication, Elliptic Curve Group Operation

and Finite Field Operation. Therefore this chapter illustrates the design of the tree bottom layers

of Elliptic curve Cryptography on FPGAs. Furthermore, the control, data, and processing units

will be introduced as the basic building blocks of the (EC) implementation.

Figure 4.1 Typical Hierarchy of Elliptic Curve Cryptography

ECC Schemes

Point Scalar

Multiplication Point Addition

Point Doubling

Addition

Subtraction

Multiplication

Squarer

Addition Operation

Field

Operations

Large Integer Arithmetic Operation

Elliptic Point

Operations


37

Clearly, finite field Operations in Figure 2.4 are designed into any hardware, One possibility of

hardware design is to accelerate finite field arithmetic only, and then use off-the-shelf

microprocessor to perform the higher-level functions of elliptic curve point arithmetic. It is

important to note that an efficient finite field multiplier does not necessarily yield an efficient

point multiplier: all layers of the hierarchy in the Figure 4.1 need to be optimized. This is because

executing field operations in parallel that is possible at the curve operation level in hardware

will not be possible, if implementation such operations is done in software.

Moving point addition and doubling and then point multiplication to hardware provides a more

efficient ECC processor at the expense of more complexity. In all cases a combination of both

efficient algorithms and hardware architectures is required. Our design focuses on all but the

protocol level of the elliptic curve cryptosystem.

The basic method for computing scalar multiplication or point multiplication is the well-known

“add-and-double” method discussed in section literature survey part which requires m point

doublings and m/2 point additions on the average. [27] Proposed a fast algorithm of point

multiplication over GF (2m

) without pre-computation based on Montgomery ladder method [18].

One advantage of using this algorithm is that fewer field multiplications will be involved on

average than in the traditional method. Secondly, since projective instead of affine

coordinates are used, inversion is performed at the coordinate transformation step. In addition,

it is secure against side channel attack. Therefore, we adopt it for our scalar multiplier [1].

4.1 Karatsuba multiplier

Scalar multiplication is the most costly basic arithmetic function in Finite Field. For a given

extension field of order Prime Field GF(P), GF(2m) subfield multiplications are required to

multiply two values using traditional polynomial multiplication. It is shown in [12] [17] [24] that

this can be reduced drastically in certain cases. Using a method developed by Karatsuba and

Ofman [11], the number of multiplications can be reduced in exchange for an increased number

of additions. As long as the time ratio for executing a multiplication vs. an addition is high, this

tradeoff is more efficient.


38

A basic example of Karatsuba is given here to demonstrate its usefulness.

Given two degree-1 polynomials, A(x) and B(x), we can demonstrate the traditional and

the Karatsuba methods.

5 A(x) = a1x + a0

6 B(x) = b1x + b0

For the traditional method, w e must calculate the product o f each possible pair of

coefficients.

D0 = a0

b0 D1 = a0

b1 D2 = a1

b0 D3 = a1

b1 Now we can calculate the product C (x) = A(x) · B(x) as:

C (x) = D3 x2 + (D2 + D1) x + D0

The Karatsuba method begins by taking the same two polynomials, and calculating the

following three products:

E0 = a0 b0

E1 = a1 b1

E2 = (a0 + a1 )(b0 +b1 b1 )

These are then used to assemble the result C (x) = A(x) · B(x):

C (x) = E 1 x2 + (E2 − E1 − E0) x + E0 --- Equation 4.1

We can now look at how many operations are required for each method. The traditional

method requires four multiplications a n d one addition, w h i l e the Karatsuba method

requires three multiplications and four additions. Thus we have traded a single multiplication

for three additions. If the cost to multiply on the target platform is as least three times the

cost to add, then the method is effective. While this basic form of Karatsuba was presented

in the original paper, there are a number of ways this method may be expanded to handle

larger degree polynomials. This is shown in [9], where the authors give an in-depth study of

this method and its variations.


39

In order to reduce the complexity of polynomial multiplication, the method of Karatsuba is

applied [12].

Whereas classically the coefficient of the product

(a1x+a0)(b1x+b0)=alb0x2+ (a1b0a0b1) x+a0b0

From the four input coefficients a0, a1, b0, b1 are computed with four 4 multipliers and one

addition, the Karatsuba formula uses only 3 multipliers and 4 addition in binary fields:

(a1x+a0)(b1x+b0)= alb0x2 + (a1⊕a0) (b1⊕b0) ⊕alb1 ⊕ a0b0 ) x + a0b0 ---Equation 4.2

By applying the Karatsuba method for larger polynomials the cost of extra additions vanishes

compared to other multiplication schemas.

Algorithm 4:1 Karatsuba Multiplier

M Input: Two Element A, B GF(2m) with m an arbitrary number, where A & B Can be

Expressed as : A=Xm/2AH+AL, B=Xm/2BH+BL

Output: A polynomial C=AB with up to 2m-1 coordinates, where C=XmCH+CL

Procedure BK(C,A,B)

Begin

K=[log2m]

d = m-2k

If (d==0)then

C=k mul2k (A,b)

Return

For i from 0 to d-1 do

MAi = AiL +Ai

H

MBi = BiL +Bi

H

End for

mul2k (AL,BL, CL)

mul2k (AL,BL, CL)

BK(CH , AH, BH)

For i from 2 to 2k-2 do

Mi = Mi + CiL

+ CiH

End for


Ci+k= CK+i + Mi

End for


40


Ci+k= CK+i + Mi

End for

End if

End

Figure 4.2: RTL Structure of Karatsuba Multiplier

4.2 Point Addition

The addition in the finite field of GF (2m) is very easy to compute. For the chosen field the addition

of two numbers is the simplest operation, since it is only a XOR combination of the bits of the two

addends. Therefore we need only m XOR gates and one clock cycle for this operation.

Algorithm 4.3 Double and Add/Subtract

Input: An Integer k > 0 and a point P

Output: Q = k·P

1. k := (kn-1, …, k1, k0)SD, ki {0, 1, -1}

2. Q := P


41

3. for i from n - 2 downto 0 do

4. Q := 2Q

5. if ki = "1" then

6. Q := Q + P

7. elseif ki = "-1" then

8. Q := Q - P

9. return Q

4.3 Point Multiplication

For the multiplication we chose a serial implementation, where the reduction with the irreducible

polynomial was integrated. So we need m - 1 XOR gates for the addition, several additional gates

for the integrated reduction with the irreducible polynomial, two shift registers, one register for

the multiplicand and two multiplexers.

4.4 Squaring

Extension field squaring is similar to multiplication, except that t h e two inputs are equal.

By modifying the standard multiplication routine, we are able to take advantage of identical

inner product terms. For example, c2 = a0 b2 + a1b1 + a2 b0 + ωc19, can be simplified to c2 = 2a0

a2 + a12 + ωc19. Further gain is accomplished by doubling only one coefficient, reducing it,

and storing the new value. This approach saves us from recalculating the doubled coefficient

when it is needed again.

Algorithm 4.2 Squaring with Subfield Reduction

Require: A(x) =∑ ai x

i , B(x) =∑ bix

i ∈ GF (23917)/P (x), where P (x) = xm −ω;

ai, bi ∈ GF (239); 0 ≤ i < 17

Ensure: C (x) =∑ cixi = A(x)B(x), ci ∈ GF (239)

1: Define z[w] to mean the with 8-bit word of z

2: ci ← 0

3: if i = 16 then

4: for j ← m − 1 downto i + 1 do


42

5: ci ← ci + ai+m−j bj

6: end for

7: ci ← 2ci – multiply by ω = 2

8: end if

9: for j ← i down to 0 do

10: ci ← ci + ai−j bj

11: end for

12: ci ← ci[2] ∗ 50 + ci[1] ∗ 17 + ci [0] – begin reduction, Equation (4.3)

13: t ← ci [1] ∗ 17 – begin Equation (4.4)

14: if t ≥ 256 then

15: t ← t[0] + 17

16: end if

17: ci ← ci[0] + t – end Equation (4.4)

18: if ci ≥ 256 then

19: ci ← ci[0] + 17


21: ci ← ci [0] + 17

22: terminate

23: end if

24: end if

25: ci ← ci − 239

26: if ci ≤ 0 then

27: ci ← ci + 239

28: End if


43

Figure 4.3 RTL Schematics of Squarer


44

CHAPTER-5

RESULTS AND DISCUSSIONS

5.1 Simulation Result for Karatsuba Multiplier

Based on the simulation result and reported generated from on Xilinx ISE Suit 14.7 the Device

Utilization summary for the Karatsuba multiplication is presented in the table below. As the table

explicitly infers, the resource utilization for Karatsuba multiplier is much utilized compare to all

the surveyed literatures explained in this thesis. This is illustrated by Table and Graph 5.1 shown

below. According to the table values generated the Karatsuba multiplier used 25 (1%) of the look

up tables from the total available (27,288) look up tables. This shows that our Karatsuba multiplier

uses the minimum number of lookup tables used in the literature reviewed [9][18][20][33][35].

Table 5.1 Resource Utilization of Karatsuba

Slice Logic Utilization Used Available Utilization

Number of Slice LUTs 25 27,288 1%

Number used as logic 25 27,288 1%

Number of occupied Slices 11 6,822 1%

Number with an unused Flip Flop 25 25 100%

Number of bonded IOBs 31 218 14%

Number of Slice

LUTs

Number used as

logic

Number of occupied

Slices

Number with an

unused Flip Flop

Number of bonded

IOBs

Utilization 1% 1% 1% 100% 14%

Available 27,288 27,288 6,822 25 218

Used 25 25 11 25 31

25 25 11 25 31

27,288 27,288

6,822

250

5000

10000

15000

20000

25000

30000

Quan

tity

Slice Logic

Resource Utilization of Karatduba Multiplier

Used Available Utilization


45

5.2 Resource Utilization for Polynomial Reduce

As the table bellow clearly illastrates the utilization report produced for polynomial reducer is

1% for number of ocupied slicee, 100% for number of Flip-flops used and 8 number (1%) of

Look up tables 1%. From this we can conclude that the Utilization for the mentioned device is

efficeint compared to reports generated from [4][21][20][32].

Table 5.2 Resource Utilization of Polynomila Reducer


Number of Slice LUTs 8 27,288 1%

Number used as logic 8 27,288 1%

Number of occupied

Slices 5 6,822 1%

Number of Flip-Flops 8 8 100%


Number of SliceLUTs

Number used aslogic

Number ofoccupied Slices

Number of Flip-Flops

Number of bondedIOBs

Utilization 1% 1% 1% 100% 10%

Available 27,288 27,288 6,822 8 218

Used 8 8 5 8 23

8 8 5 8 23

27,288 27,288

6,822

8 218

Utilization, 1% Utilization, 1%Utilization, 1%

Utilization, 100%

Utilization, 10%

0

5000

10000

15000

20000

25000

30000

PLOYMOMIAL REDUCER RESOURCE UTILIZATION



46

5.3 Resource Utiliazation of the Encryption Unit

As table 5.3 sumerizes the resurce avalable for the karatsuba based Encyptor Decyptor unit is

almost optimazed compared to publication presented in the thesis[4][11][22][28]. This can be

justified taking the amount of slice registers 715 (1%) used in the Encyptor unit.

Table 5.3 Resource Utilization of ECC Encryption Unit


Number of Slice Registers 715 54,576 1%

Number of Slice LUTs 1,977 27,288 7%

Number used as logic 1,970 27,288 7%


Number of MUXCYs used 1,396 13,644 10%

Number with an unused Flip Flop 1,294 2,000 64%

Number with an unused LUT 23 2,000 1%

Number of slice register sites lost

to control set restrictions 7 54,576 1%

Number of

Slice

Registers

Number of

Slice LUTs

Number

used as

logic

Number of

occupied

Slices

Number

with an

unused

Flip Flop

Number

with an

unused

LUT

Number

of fully

used LUT-

FF pairs

Number

of slice

register

sites lost

Used 715 1,977 1,970 601 1,294 23 683 7

Available 54,576 27,288 27,288 6,822 2,000 2,000 2,000 54,576

Utilization 1% 7% 7% 8% 64% 1% 34% 1%

715 1,977 1,970 601 1,294 23 683 7

54,576

27,288 27,288

6,8222,000 2,000 2,000

54,576

1%7% 7% 8%

64%

1%

34%

1% 0%

10%

20%

30%

40%

50%

60%

70%

0

10000

20000

30000

40000

50000

60000

Axi

s Ti

tle

Axi

s Ti

tle

Axis Title

R S O U R C E U T I L I ZAT ION O F E C C E N C RY PTO R



47

5.4 Desing of ECC Encryption

The Designed Karatsuba multiplier from the xilinx enviroment is presented in the figure 5.1

bellow

Figure 5.1 Karatsuba based ECC Encryptor

As the figure above shows the designed Elliptic curve cryptography takes 15 bits of data and 15

bits of public key for the encyption prrocess. The output on the right side of the cirute also a 15

bit length after the encyption process.

4.5 Xpower Analysis of the Karatsuba Multiplier

The Xpower analyzer from the Xilinx ISE design suit produces the power consumption of the

Karatsuba based ECC circuit unit

Table 5.4 Power Report

On-Chip Power (mW) Used Available Utilization

Logics 0 25 27288 0%

IOs 36.14 31 218 14%


48

As Indicated in the above table the power consumption for the Logics on the chip found to be 25

mW almost 0%. This implies that the power consumption for the logic is efficiently optimized

compared to all publication result viewed in this thesis [17][22][29].

4.6 Decryption Unit resource Utilization report

Figure 5.2 Karatsuba based ECC Decryptor

As the table 5.5 shows the resource utilization for every component in the unit is almost efficiently

utilized. The amount of logic used for the decryption process is 1,980 (7%); this analysis shows

that our design almost efficiently utilized the available resource compared to publications reviewed

in this thesis.

Table 5.5 Decryption Unit Resource Utilization


Number of Slice Registers 726 54,576 1%

Number of Slice LUTs 1,988 27,288 7%

Number used as logic 1,980 27,288 7%


Number with an unused Flip Flop 1,364 2,077 65%

Number with an unused LUT 89 2,077 4%

Number of fully used LUT-FF pairs 624 2,077 30%



49

4.7 Computation Time Obtained from the Experiments

Table 5.6 illustrates the power consumption of two components inside the Hardware Accelerator

according to the table the power consumption for the Decryption Process is 0.914W greater than

the Karatsuba multiplier and polynomial Reducer units.

Table 5.6 Xpower Analysis for ECC Components

Activities Power consumption Time Taken

Karatsuba Multiplier 0.036W 7.019ns

Polynomial Reducer 0.036W 7.019ns

Decryption 0.037W 7.933ns

4.8 Comparison with other Related Work

As it is explicitly seen from the table below the resource utilizations table we can infer that our

proposed system is much more efficient than the listed experimentations and literatures revised in

the thesis. To exemplify this [4] used 1918 number of flip-flops and 14527 numbers of look-up

tables from their target device XCV200, in which case much more resource have been reduced in

our proposed system. Consequently the reduction of resource consumption in our experiment

resulted the reduction of power consumption as illustrated in section 5.6 of this chapter.

726 1,988 1,980 607 1,364 89 624 52

54,576

27,288 27,288

6,822 2,077 2,0772,077 218

1% 7% 7%

8% 65% 4%30%

23%

0

10000

20000

30000

40000

50000

60000

0 1 2 3 4 5 6 7 8 9

Decryption Unit Resource Utilization



50

Table 5.7 Comparison of Different Implementations

Implementation

FPGA Number

of Flip-Flops

Number of

LUTs

KP

M. Kider and Manoj

V.N.V, 2008

XCV2000 1918 14527 47

Orlando & Parr

(2011)

XCV400E Unspecified Unspecified 210

N. Gura, et, al. (2007) XCV2000E 6442 19508 144

J. Luarz(2009) XCV2000E 1930 10017 75

Chang Chu ( 2013) XCV2000E 7467 25768 53

Our Design XC6SLX45 20 20 9.05

MontgomeryClassical Knuthmultiplication

Schönhage-Strassen trick

Montgomery Montgomery Karatsuba

M. Kider andManoj V.N.V,

2008

Orlando & Parr(2011)

N. Gura, et, al.(2007)

J. Luarz(2009)Chang Chu (

2013)Our Design

# of Flip-Flops 1918 0 6442 1930 7467 20

Number of LUTs 14527 0 19508 10017 25768 20

K*P time ns 47 210 144 75 53 9.05

1918 0 6442 1930 7467 2014527 0 19508 10017 25768 2047

210

144

7553

9.05

0

50

100

150

200

250

0

5000

10000

15000

20000

25000

30000

Comparison of Different Works

# of Flip-Flops Number of LUTs K*P time ns


51

CHAPTER-6

SUMMARY, CONCLUSION, RECOMMENDATIONS AND FUTURE

RESEARCH WORK

6.1 Summary

From a design point of view, FPGAs provide a suitable environment for our implementation. These

register rich devices can accommodate large memory structures and provide optimized macro cells

that improve the speed performance of the system. The fine grain device architecture allows for

synthesis tools to perform optimization almost at a gate level resulting in very efficient

implementations.

The concept of reconfigurable hardware for elliptic curves is very attractive for various reasons.

Reconfigurable hardware provides a versatile environment that is desirable when implementing

modern cryptographic protocols. In the work described here, we have shown that an elliptic curve

cryptosystem can principally be implemented on reconfigurable devices. There is however one

limitation. The long compile times required to place and route the EC design into a

specific device are currently a bottleneck during the development cycle. The available

tools are improving very rapidly and new, larger devices are being offered from many

vendors every year.

These improvements will make it possible to implement large and very complicated designs

in the near future.

With the synthesis tools available, it was possible to obtain estimated results for all

architectures. Furthermore, comparison of synthesis and implementation results, for various

large modules of our design, shows that synthesis results are very accurate. Thus EC crypto

engine can be implemented on XILINX FPGAs a t the estimated computation time.


52

6.2 Conclusion

In section 6.1 of chapter six, our work provide some insight into hardware implementation of

complex cryptographic algorithms. Point multiplication on elliptic curves is one of the most

challenging computations used to implement public-key protocols namely Elliptic Curve

Cryptography. This holds especially true for hardware implementations of which very few have

been reported in the literature. It is our intention to provide the reader with the issues concerning

hardware Acceleration of elliptic curves Cryptography. Moreover, one of our goals was to show

that cryptographic protocols can be implemented in reconfigurable hardware. Wide data-paths

associated with elliptic curve implementation in hardware is of concern when trying to use FPGA

devices. However the limitation lies more in the tools rather than the resources available to us.

In this thesis, we have shown that reconfigurable hardware is a viable solution for public-key

cryptography. In principal, elliptic curve point multiplication can be achieved on FPGAs resulting

in very flexible implementation with increased speed performance over current software solution.

As security issues become more and more pronounced in the next few years and supporting FPGA

tools improve, we hope that reconfigurable hardware and elliptic curves will provide a viable

solution

6.3 Recommendations for Future works

This thesis concentrated on achieving point multiplication on elliptic curves in re-

configurable hardware. To our knowledge, this approach has not been yet attempted before.

Below, we summarize some of the more important work that could still be done from a

design and implementation point of view.

6.3.1 New Design Considerations

We would recommend to investigate different alternatives for implementing the control

structure. For example, the possibility of using RAM and counters to generate the control

vectors could be implemented.

Also, we would have liked to implement the system using two clocks to speed up

computation times.


53

Another important design alternative that should be researched further is the implementation

of multiple arithmetic processing elements. This would allow for parallel

Operation effectively reducing the entire computation cycle by half. Such an alternative would

also require more routing resources.

Conversely, we would like to implement another design with a narrower datapath. Reducing

datapath would result in longer computation cycle. However, such a design would allow us to

use smaller FPGAs and possibly implement the general design on future smart cards.

6.3.2 Implementation Alternatives

From an implementation point of view, further research can be done to investigate other

reconfigurable devices. Soft macros can be remapped so that the design can be implemented in

EPLDs and CPLDs. Furthermore, devices from other vendors like ALTERA, AT&T and Motorola

could be used to implement our design. This would allow us to research other place and route

tools that may or may not perform better.

Future work could also concentrate on the actual system hardware implementation. For instance,

designing a PC plug-in board with reconfigurable cryptographic algorithms seems like an

attractive application.

Lastly, we would like to devote some time to try out one of the new devices that will be available

from XILINX in the near future. The new Virtex family of devices use 0.25 micron, five layer

metal process technology which will increase area, routing resources, and speed performance.


54

REFERENCES

[1] A. Karatsuba and Y. Ofman. Multiplication of Multidigit Numbers on Automata.

Sov. Phys. Dokl. (English translation), 7(7):595–596, 1963.

[2] A. Woodbury, D. V. Bailey, and C. Paar. Elliptic Curve Cryptography on Smart Cards

Without Coprocessors. In IFIP CARDIS 2000, Fourth Smart Card Research and Ad-

vanced Application Conference, Bristol, UK, September 20–22 2000. Kluwer.

[3] B.Schneier. Applied Cryptography. John Wiley and Sons, second edition, 1996

[4] B. Sunar. Fast Galois Field Arithmetic for Elliptic Curve Cryptography and Error

Control Codes. PhD thesis, Department of Electrical & Computer Engineering, Oregon

State University, Corvallis, Oregon, USA, November 1998

[5] Cryptography and Elliptic Curves,

http://www.tcs.hut.fi/~helger/crypto/link/public/elliptic/

[6] David Seal. ARM Architecture Reference Manual. Addison-Wesley Longman

Publishing Co., Inc., Boston, MA, second edition, 2000.

[7] D. R. Stinson. Cryptography, Theory and Practice. Chapman & Hall/CRC, Boca Raton,

Florida, USA, second edition, 2002.

[8] D. V. Bailey and C. Paar. Optimal Extension Fields for Fast Arithmetic in Public-

Key Algorithms. In H. Krawczyk, editor, Advances in Cryptology — CRYPTO ’98,

volume LNCS 1462, pages 472–485, Berlin, Germany, 1998. Springer-Verlag.

[9] D. V. Bailey and C. Paar. Efficient Arithmetic in Finite Field Extensions with

Appli- cation in Elliptic Curve Cryptography. Journal of Cryptology, 14(3):153–176,

2001.

[10] I. Blake, G. Seroussi, and N. Smart. Elliptic Curves in Cryptography.

Cambridge University Press, London Mathematical Society Lecture Notes Series 265,

1999.

[11] J. Guajardo and C. Paar. Itoh-Tsujii Inversion in Standard Basis and Its

Application in Cryptography. Design, Codes, and Cryptography, (25):207–216, 2002.

[12] M. Kider and Manoj V.N.V Hardware Acceleration of Elliptic Curve Cryptography,

Adiss Ababa University, Ethiopia 2008 .

[13] T. ElGamal. A Public-Key Cryptosystem and a Signature Scheme Based on

DiscreteLogarithms. IEEE Transactions on Information Theory, IT-31(4):469–472,


http://www.tcs.hut.fi/~helger/crypto/link/public/elliptic/

55

1985.

[14] S. T. J. Fenn, M. Benaissa, and D. Taylor. Finite Field Inversion Over the

Dual Base. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

4(1):134– 136, March 1996.

[15] W. Geiselmann and D. Gollmann. Self-Dual Bases in Fqn . Designs, Codes and

Cryp- tography, 3:333–345, 1993.

[16] M. A. Hasan. Double-Basis Multiplicative Inversion Over GF (2m). IEEE

Transactions on Computers, 47(9):960–970, September 1998.

[17] I. S. Hsu, T. K. Truong, L. J. Deutsch, and I. S. Reed. A Comparison of VLSI

Ar- chitecture of Finite Field Multipliers Using Dual-, Normal-, or Standard Bases.

IEEE Transactions on Computers, 37(6):735–739, June 1988.

[18] T. Itoh and S. Tsujii. A Fast Algorithm for Computing Multiplicative Inverses

in GF (2m) Using Normal Bases. Information and Computation, 78:171–177, 1988.

[19] C . K Koc and T. Acar. Montgomery Multplication in GF (2k ). Design, Codes,

and Cryptography, 14(1):57–69, 1998.

[20] R. Lidl and H. Niederreiter. Finite Fields, volume 20 of Encyclopedia of Mathematics

and its Applications. Addison-Wesley, Reading, Massachusetts, USA, 1983.

[21] E. D. Mastrovito. VLSI Architectures for Computation in Galois Fields. PhD thesis,

Linkoping University, Department of Electrical Engineering, Linkoping, Sweden, 1991.

[22] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook of Applied Cryptog-

raphy. CRC Press, Boca Raton, Florida, USA, 1997.

[23] R. L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signatures

and Public-Key Cryptosystems. Communications of the ACM, 21(2):120–126,

February,1978.

[24] R. Schroeppel, H. Orman, S. O’Malley, and O. Spatscheck. Fast Key Exchange

with Elliptic Curve Systems. In D. Coppersmith, editor, Advances in Cryptology —

CRYPTO ’95, volume LNCS 963, pages 43–56, Berlin, Germany, 1995. Springer-

Verlag.

[25] Julio Lopez and Ricardo Dahab, “An overview of elliptic curve cryptography”, May

2000.


56

[26] V. Miller, “Uses of elliptic curves in cryptography”, Advances in Cryptology -

CRYPTO'85, LNCS 218, pp.417-426, 1986.

[27] Jeffrey L. Vagle, “A Gentle Introduction to Elliptic Curve Cryptography”, BBN

Technologies

[28] Mugino Saeki, “Elliptic curve cryptosystems”, M.Sc. thesis, School of Computer

Science, McGill University, 1996. http://citeseer.nj.nec.com/saeki97elliptic.html

[29] J. Borst, “Public key cryptosystems using elliptic curves”, Master's thesis,

Eindhoven University of Technology, Feb. 1997.

http://citeseer.nj.nec.com/borst97public.html

[30] http://world.std.com/~franl/crypto.html

[31] Aleksandar Jurisic and Alfred Menezes, “Elliptic Curves and Cryptography”, Dr.

Dobb's Journal, April 1997, pp 26ff

[32] Robert Milson, “Introduction to Public Key Cryptography and Modular

Arithmetic”

[33] Aleksandar Jurisic and Alfred J. Menezes, Elliptic Curves and Cryptography

[34] William Stallings, Cryptography and Network Security-Principles and Practice

second edition, Prentice Hall publications.

[35] R. Schroppel, H. Orman, S. O’Malley and O. Spatscheck, “Fast key exchange with

elliptic key systems”, Advances in Cryptography, Proc. Crypto’95, LNCS 963, pp. 43-56,

Springer-Verlag, 1995.


57

2

APPENDIX I

Algorithm 1: Point Doubling

input : P (x1 , y1) ∈ Fq

output: [2]P (x3, y3) ∈ E (Fq )

x2 2 2 2

2 = x2 .x2, 2x2 = x2 + x2 ,; A = 3x2 = 2x2 + x2, B = A + a,;

2y2 = y2 + y2 , inv2y2, λ = B/2y2 ,;

λ2 = λ.λ, 2x2 = x2 + x2 , x3 = λ2 − 2x2,; C = x2 − x3, D = λ.C, y3 = D − y2

Algorithm 2: Point Addition

input : P (x1 , y1)

Q(x2 , y2) ∈ Fq

output: P + Q(x3, y3) ∈ E (Fq)

A = y2 − y1 , B = x2 − x1 , invB,;

λ = A/B, λ2 = λ.λ, C = λ2 − x1,; x3 = C − x2, D = x1 − x3 , E = D.λ, y3 = E − y1

Algorithm 4.2 Squaring with Subfield Reduction

Require: A(x) =∑ ai x

i , B(x) =∑ bix

i ∈ GF (23917)/P (x), where P (x) = xm −ω;

ai, bi ∈ GF (239); 0 ≤ i < 17

Ensure: C (x) =∑ cixi = A(x)B(x), ci ∈ GF (239)

1: Define z[w] to mean the with 8-bit word of z

2: ci ← 0

3: if i = 16 then

4: for j ← m − 1 downto i + 1 do

5: ci ← ci + ai+m−j bj

6: end for

7: ci ← 2ci – multiply by ω = 2

8: end if

9: for j ← i down to 0 do

10: ci ← ci + ai−j bj

11: end for


58

12: ci ← ci[2] ∗ 50 + ci[1] ∗ 17 + ci [0] – begin reduction, Equation (4.3)

13: t ← ci [1] ∗ 17 – begin Equation (4.4)

14: if t ≥ 256 then

15: t ← t[0] + 17

16: end if

17: ci ← ci[0] + t – end Equation (4.4)


19: ci ← ci[0] + 17


21: ci ← ci [0] + 17

22: terminate

23: end if

24: end if

25: ci ← ci − 239

26: if ci ≤ 0 then

27: ci ← ci + 239

28: End if


59

APPENDIX II : Sample Snapshot

A. ECC Encryptor

B. ECC Decryptor


60

C. Simulation for ECC Encryption

D. Simulation for Decryption

E. ISM Simulation of Karatsuba Multiplier


61

F. Design Result of Karatsuba Multiplier (Narrowed Design )

G. Detailed Karatsuba Multiplier RTL Circuit


62

H. Polynomial Reducer Circuit

I. Programing the Device on FPGA Spartan-6 Families


63

APPENDIX III : Synthesis Report From Xilinx

Release 14.7 - xst P.20131013 (nt)

Copyright (c) 1995-2013 Xilinx, Inc. All rights reserved.

--> Parameter TMPDIR set to xst/projnav.tmp

Total REAL time to Xst completion: 0.00 secs

Total CPU time to Xst completion: 0.37 secs

--> Parameter xsthdpdir set to xst



--> Reading design: poly_reducer.prj

TABLE OF CONTENTS

1) Synthesis Options Summary

2) HDL Parsing

3) HDL Elaboration

4) HDL Synthesis

4.1) HDL Synthesis Report

5) Advanced HDL Synthesis

5.1) Advanced HDL Synthesis Report

6) Low Level Synthesis

7) Partition Report

8) Design Summary

8.1) Primitive and Black Box Usage

8.2) Device utilization summary

8.3) Partition Resource Summary

8.4) Timing Report

8.4.1) Clock Information

8.4.2) Asynchronous Control Signals Information

8.4.3) Timing Summary

8.4.4) Timing Details

8.4.5) Cross Clock Domains Report

===============================================================


64

* Synthesis Options Summary *

===============================================================

---- Source Parameters

Input File Name : "poly_reducer.prj"

Ignore Synthesis Constraint File : NO

---- Target Parameters

Output File Name : "poly_reducer"

Output Format : NGC

Target Device : xc6slx45-2-csg324

---- Source Options

Top Module Name : poly_reducer

Automatic FSM Extraction : YES

FSM Encoding Algorithm : Auto

Safe Implementation : No

FSM Style : LUT

RAM Extraction : Yes

RAM Style : Auto

ROM Extraction : Yes

Shift Register Extraction : YES

ROM Style : Auto

Resource Sharing : YES

Asynchronous To Synchronous : NO

Shift Register Minimum Size : 2

Use DSP Block : Auto

Automatic Register Balancing : No

---- Target Options

LUT Combining : Auto

Reduce Control Sets : Auto

Add IO Buffers : YES

Global Maximum Fanout : 100000

Add Generic Clock Buffer(BUFG) : 16

Register Duplication : YES

Optimize Instantiated Primitives : NO

Use Clock Enable : Auto

Use Synchronous Set : Auto

Use Synchronous Reset : Auto

Pack IO Registers into IOBs : Auto


65

Equivalent register Removal : YES

---- General Options

Optimization Goal : Speed

Optimization Effort : 1

Power Reduction : NO

Keep Hierarchy : No

Netlist Hierarchy : As_Optimized

RTL Output : Yes

Global Optimization : AllClockNets

Read Cores : YES

Write Timing Constraints : NO

Cross Clock Analysis : NO

Hierarchy Separator : /

Bus Delimiter : <>

Case Specifier : Maintain

Slice Utilization Ratio : 100

BRAM Utilization Ratio : 100

DSP48 Utilization Ratio : 100

Auto BRAM Packing : NO

Slice Utilization Ratio Delta : 5

===============================================================

===============================================================

* HDL Parsing *

===============================================================

Parsing VHDL file

"G:\Collection\Karatsuba_Monogomry_ECC_Cryptography\classic_multiplier.vhd" into

library work

Parsing package <classic_multiplier_parameters>.

Parsing package body <classic_multiplier_parameters>.

Parsing entity <poly_multiplier>.

Parsing architecture <simple> of entity <poly_multiplier>.

Parsing entity <poly_reducer>.

Parsing architecture <simple> of entity <poly_reducer>.

Parsing entity <classic_multiplication>.

Parsing architecture <simple> of entity <classic_multiplication>.

Parsing VHDL file

"G:\Collection\Karatsuba_Monogomry_ECC_Cryptography\classic_squarer.vhd" into

library work


66

Parsing package <classic_squarer_parameters>.

Parsing package body <classic_squarer_parameters>.

Parsing entity <poly_reducer>.

WARNING:HDLCompiler:685 -

"G:\Collection\Karatsuba_Monogomry_ECC_Cryptography\classic_squarer.vhd" Line

69: Overwriting existing primary unit poly_reducer

Parsing architecture <simple> of entity <poly_reducer>.

Parsing entity <classic_squarer>.

Parsing architecture <simple> of entity <classic_squarer>.

===============================================================

* HDL Elaboration *

===============================================================

Elaborating entity <poly_reducer> (architecture <simple>) from library <work>.

===============================================================

* HDL Synthesis *

===============================================================

Synthesizing Unit <poly_reducer>.

Related source file is

"G:\Collection\Karatsuba_Monogomry_ECC_Cryptography\classic_squarer.vhd".

Summary:

Unit <poly_reducer> synthesized.

===============================================================

HDL Synthesis Report

Macro Statistics

# Xors : 11

1-bit xor2 : 3

1-bit xor3 : 4

1-bit xor4 : 4

===============================================================

===============================================================

* Advanced HDL Synthesis *

===============================================================


67

===============================================================

Advanced HDL Synthesis Report

Macro Statistics

# Xors : 11

1-bit xor2 : 3

1-bit xor3 : 4

1-bit xor4 : 4

===============================================================

===============================================================

* Low Level Synthesis *

===============================================================

Optimizing unit <poly_reducer> ...

Mapping all equations...

Building and optimizing final netlist ...

Found area constraint ratio of 100 (+ 5) on block poly_reducer, actual ratio is 0.

Final Macro Processing ...

===============================================================

Final Register Report

Found no macro

===============================================================

===============================================================

* Partition Report *

===============================================================

Partition Implementation Status

-------------------------------

No Partitions were found in this design.

===============================================================

* Design Summary *

===============================================================


68

Top Level Output File Name : poly_reducer.ngc

Primitive and Black Box Usage:

------------------------------

# BELS : 9

# LUT2 : 1

# LUT4 : 5

# LUT5 : 2

# LUT6 : 1

# IO Buffers : 23

# IBUF : 15

# OBUF : 8

Device utilization summary:

---------------------------

Selected Device : 6slx45csg324-2

Slice Logic Utilization:

Number of Slice LUTs: 9 out of 27288 0%

Number used as Logic: 9 out of 27288 0%

Slice Logic Distribution:

Number of LUT Flip Flop pairs used: 9

Number with an unused Flip Flop: 9 out of 9 100%

Number with an unused LUT: 0 out of 9 0%

Number of fully used LUT-FF pairs: 0 out of 9 0%

Number of unique control sets: 0

IO Utilization:

Number of IOs: 23

Number of bonded IOBs: 23 out of 218 10%

Specific Feature Utilization:

---------------------------

Partition Resource Summary:

---------------------------


69

No Partitions were found in this design.

===============================================================

Timing Report

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.

FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE

REPORT

GENERATED AFTER PLACE-and-ROUTE.

Clock Information:

------------------

No clock signals found in this design

Asynchronous Control Signals Information:

----------------------------------------

No asynchronous control signals found in this design

Timing Summary:

---------------

Speed Grade: -2

Minimum period: No path found

Minimum input arrival time before clock: No path found

Maximum output required time after clock: No path found

Maximum combinational path delay: 7.019ns

Timing Details:

---------------

All values displayed in nanoseconds (ns)

===============================================================

Timing constraint: Default path analysis

Total number of paths / destination ports: 37 / 8

-------------------------------------------------------------------------

Delay: 7.019ns (Levels of Logic = 4)

Source: d<11> (PAD)

Destination: c<3> (PAD)

Data Path: d<11> to c<3>

Gate Net


70

Cell:in->out fanout Delay Delay Logical Name (Net Name)

---------------------------------------- ------------

IBUF:I->O 4 1.328 0.912 d_11_IBUF (d_11_IBUF)

LUT2:I0->O 1 0.250 0.682 Mxor_gen_xors[3].l1.aux_xo<0>_SW0 (N2)

LUT6:I5->O 1 0.254 0.681 Mxor_gen_xors[3].l1.aux_xo<0> (c_3_OBUF)

OBUF:I->O 2.912 c_3_OBUF (c<3>)

----------------------------------------

Total 7.019ns (4.744ns logic, 2.275ns route)

(67.6% logic, 32.4% route)

===============================================================

Cross Clock Domains Report:

--------------------------

===============================================================



-->

Total memory usage is 185824 kilobytes

Number of errors : 0 ( 0 filtered)

Number of warnings : 1 ( 0 filtered)

Number of infos : 0 ( 0 filtered)


71

APPENDIX IV: Sample VHDL Code

/* VHDL Code for Karatsuba Multiplier */

library ieee;

use ieee.std_logic_1164.all;

use ieee.std_logic_arith.all;

use ieee.std_logic_unsigned.all;

entity karatsuba_multiplier_even is

generic (M: integer:= 8);

port (

a, b: in std_logic_vector(M-1 downto 0);

d: out std_logic_vector(2*M-2 downto 0)

);

end karatsuba_multiplier_even;

architecture simple of karatsuba_multiplier_even is

component polynom_multiplier is

generic (M: integer:= 8);

port (

a, b: in std_logic_vector(M-1 downto 0);

d: out std_logic_vector(2*M-2 downto 0)

);

end component polynom_multiplier;

constant half_M :integer := M/2;

signal x0y0, x01y01: std_logic_vector(2*half_M-2 downto 0);

signal x1y1: std_logic_vector(2*half_M-2 downto 0);

signal x0_p_X1, y0_p_y1: std_logic_vector(half_M-1 downto 0);

begin

mult1: polynom_multiplier generic map(M => half_M)

port map(a => a(half_M-1 downto 0),

b => b(half_M-1 downto 0), d=> x0y0);


port map(a => a(M-1 downto half_M),

b => b(M-1 downto half_M), d=> x1y1);


72


port map(a => x0_p_X1,

b => y0_p_y1, d=> x01y01);

gen_x0x1y0y1: for i in 0 to half_M-1 generate

x0_p_X1(i) <= a(i) xor a(i + half_M);

y0_p_y1(i) <= b(i) xor b(i + half_M);

end generate;

gen_prod1: for i in 0 to half_M-2 generate

d(half_M + i) <= x01y01(i) xor x0y0(i) xor x1y1(i) xor x0y0(i+half_M);

end generate;

d(2*half_M-1) <= x01y01(half_M-1) xor x0y0(half_M-1) xor x1y1(half_M-1);

gen_prod2: for i in half_M to 2*half_M-2 generate

d(half_M + i) <= x01y01(i) xor x0y0(i) xor x1y1(i) xor x1y1(i-half_M) ;

end generate;

d(3*half_M-1) <= x1y1(half_M-1);

d(half_M-1 downto 0) <= x0y0(half_M-1 downto 0);

d(2*M-2 downto 3*half_M) <= x1y1(2*half_M-2 downto half_M);

end simple;

--------------------------------------------------------------------------------

-- Simple testbench for "poly_multiplier" module (for m=8)

--

--------------------------------------------------------------------------------

LIBRARY ieee;

USE ieee.std_logic_1164.ALL;

USE ieee.std_logic_unsigned.all;

USE ieee.numeric_std.ALL;

use work.classic_multiplier_parameters.all;

ENTITY test_poly_mult_vhd IS

END test_poly_mult_vhd;

ARCHITECTURE behavior OF test_poly_mult_vhd IS


73

-- Component Declaration for the Unit Under Test (UUT)

COMPONENT poly_multiplier

PORT(

a : IN std_logic_vector(m-1 downto 0);

b : IN std_logic_vector(m-1 downto 0);

d : OUT std_logic_vector(2*m-2 downto 0)

);

END COMPONENT;

--Inputs

SIGNAL a : std_logic_vector(m-1 downto 0) := (others=>'0');

SIGNAL b : std_logic_vector(m-1 downto 0) := (others=>'0');

--Outputs

SIGNAL d : std_logic_vector(2*m-2 downto 0);

BEGIN

-- Instantiate the Unit Under Test (UUT)

uut: poly_multiplier PORT MAP( a => a, b => b, d => d );

tb : PROCESS

BEGIN

-- Wait 100 ns for global reset to finish

wait for 100 ns;

a <= "10101010";

b <= "10101010";

wait for 100 ns;

assert (d = "100010001000100") report "ERROR in mult" severity FAILURE;

a <= "10101010";

b <= "00000000";

wait for 100 ns;


a <= "11111111";

b <= "10101010";

wait for 100 ns;


a <= "10101010";

b <= "01010101";

wait for 100 ns;


74


a <= "01010101";

b <= "01010101";

wait for 100 ns;


wait; -- will wait forever

END PROCESS;

END;

--------------------------------------------------------------------------------

-- VHDL Code For Square

--------------------------------------------------------------------------------

LIBRARY ieee;


USE IEEE.std_logic_arith.all;



USE ieee.std_logic_textio.ALL;

use ieee.math_real.all; -- for UNIFORM, TRUNC

USE std.textio.ALL;

--use work.classic_multiplier_parameters.all;

use work.LSB_first_squarer_package.all;

ENTITY test_square_comparac IS

END test_square_comparac;

ARCHITECTURE behavior OF test_square_comparac IS

-- Component Declaration for the Unit Under Test (UUT2)

COMPONENT classic_multiplication

PORT(

a : IN std_logic_vector(M-1 downto 0);

b : IN std_logic_vector(M-1 downto 0);

c : OUT std_logic_vector(M-1 downto 0)

);

END COMPONENT;

COMPONENT classic_squarer


75

PORT(



);

END COMPONENT;

COMPONENT montgomery_squarer is

port (

a: in std_logic_vector (M-1 downto 0);

clk, reset, start: in std_logic;

z: out std_logic_vector (M-1 downto 0);

done: out std_logic

);

END COMPONENT;

COMPONENT montgomery_comb_squarer is

port (


c: out std_logic_vector (M-1 downto 0)

);

END COMPONENT;

COMPONENT LSB_first_squarer is

port (



z: out std_logic_vector (M-1 downto 0);

done: out std_logic

);

END COMPONENT;

-- Internal signals

SIGNAL x, c, sq : std_logic_vector(M-1 downto 0) := (others=>'0');

SIGNAL clk, reset, start, done_montg, done_lsbf: std_logic;

SIGNAL montg_sq, r, montg_sq_adj, montg_sq_comb , montg_sq_comb_adj, lsbf_sq:

std_logic_vector(M-1 downto 0) := (others=>'0');

constant DELAY : time := 100 ns;

constant PERIOD : time := 200 ns;

constant DUTY_CYCLE : real := 0.5;

constant OFFSET : time := 0 ns;

constant NUMBER_TESTS: natural := 100;

BEGIN


76


uut0: classic_multiplication PORT MAP( a => x, b => x, c => c );

uut1: classic_squarer PORT MAP( a => x, c => sq );

uut2: montgomery_squarer PORT MAP(A => x,

clk => clk, reset => reset, start => start,

z => montg_sq, done => done_montg);

r <= F; --2**K mod F = F

uut2b: classic_multiplication PORT MAP( a => montg_sq, b => r, c => montg_sq_adj );

uut3: montgomery_comb_squarer PORT MAP( a => x, c => montg_sq_comb );

uut3b: classic_multiplication PORT MAP( a => montg_sq_comb, b => r, c =>

montg_sq_comb_adj );

uut4: LSB_first_squarer PORT MAP(A => x,


z => lsbf_sq, done => done_lsbf);

PROCESS -- clock process for clk

BEGIN

WAIT for OFFSET;

CLOCK_LOOP : LOOP

clk <= '0';

WAIT FOR (PERIOD *(1.0 - DUTY_CYCLE));

clk <= '1';

WAIT FOR (PERIOD * DUTY_CYCLE);

END LOOP CLOCK_LOOP;

END PROCESS;

tb_proc : PROCESS --generate values

PROCEDURE gen_random(X : out std_logic_vector (M-1 DownTo 0); w: natural; s1, s2:

inout Natural) IS

VARIABLE i_x, aux: integer;

VARIABLE rand: real;

BEGIN

aux := W/16;

for i in 1 to aux loop

UNIFORM(s1, s2, rand);

i_x := INTEGER(TRUNC(rand * real(2**16)));


77

x(i*16-1 downto (i-1)*16) := CONV_STD_LOGIC_VECTOR (i_x, 16);

end loop;


i_x := INTEGER(TRUNC(rand * real(2**(w-aux*16))));

x(w-1 downto aux*16) := CONV_STD_LOGIC_VECTOR (i_x, (w-aux*16));

END PROCEDURE;

VARIABLE TX_LOC : LINE;

VARIABLE TX_STR : String(1 to 4096);

VARIABLE seed1, seed2: positive;

VARIABLE i_x, i_y, i_p, i_z, i_yz_modp: integer;

VARIABLE cycles, max_cycles, min_cycles, total_cycles: integer := 0;

VARIABLE avg_cycles: real;

VARIABLE initial_time, final_time: time;

VARIABLE xx: std_logic_vector (M-1 DownTo 0) ;

BEGIN

min_cycles:= 2**20;

start <= '0'; reset <= '1';

WAIT FOR PERIOD;

reset <= '0';

WAIT FOR PERIOD;

for I in 1 to NUMBER_TESTS loop

gen_random(xx, M, seed1, seed2);

x <= xx;

start <= '1'; initial_time := now;

WAIT FOR PERIOD;

start <= '0';

wait until done_montg = '1';

final_time := now;

cycles := (final_time - initial_time)/PERIOD;

total_cycles := total_cycles+cycles;

--ASSERT (FALSE) REPORT "Number of Cycles: " & integer'image(cycles) & "

TotalCycles: " & integer'image(total_cycles) SEVERITY WARNING;

if cycles > max_cycles then max_cycles:= cycles; end if;

if cycles < min_cycles then min_cycles:= cycles; end if;

WAIT FOR 2*PERIOD;

IF ( c /= sq or c/= montg_sq_adj or c /= montg_sq_comb_adj or c /=lsbf_sq) THEN

write(TX_LOC,string'("ERROR!!! C=")); write(TX_LOC, c);


78

write(TX_LOC,string'("/= Z=")); write(TX_LOC, c);

write(TX_LOC,string'("/= sq=")); write(TX_LOC, sq);

write(TX_LOC,string'("/= montg_sq=")); write(TX_LOC, montg_sq);

write(TX_LOC,string'("/= montg_sq_Adj=")); write(TX_LOC, montg_sq_adj);

write(TX_LOC,string'(" (montg_comb=")); write(TX_LOC, montg_sq_comb);

write(TX_LOC,string'(") /= montg_combAdj=")); write(TX_LOC, montg_sq_comb_adj);

write(TX_LOC,string'(") using: ( A =")); write(TX_LOC, x);

write(TX_LOC, string'(", F = 1")); write(TX_LOC, F);

write(TX_LOC, string'(" )"));

TX_STR(TX_LOC.all'range) := TX_LOC.all;

Deallocate(TX_LOC);

ASSERT (FALSE) REPORT TX_STR SEVERITY ERROR;

END IF;

end loop;

WAIT FOR DELAY;

avg_cycles := real(total_cycles)/real(NUMBER_TESTS);

ASSERT (FALSE) REPORT

"Simulation successful!. MinCycles: " & integer'image(min_cycles) &

" MaxCycles: " & integer'image(max_cycles) & " TotalCycles: " &

integer'image(total_cycles) &

" AvgCycles: " & real'image(avg_cycles)

SEVERITY FAILURE;

END PROCESS;

END;

----------------------------------------------------------------------------------------------------------------

-- Test Division algorithm

LIBRARY ieee;


USE IEEE.std_logic_arith.all;




79

USE ieee.std_logic_textio.ALL;

use ieee.math_real.all; -- for UNIFORM, TRUNC

USE std.textio.ALL;

use work.binary_algorithm_polynomials_parameters.all;

ENTITY test_binary_division IS

END test_binary_division;

ARCHITECTURE behavior OF test_binary_division IS

-- a multiplier is instantiated to check the results

COMPONENT classic_multiplication

PORT(


b : IN std_logic_vector(M-1 downto 0);


);

END COMPONENT;

-- Component Declaration for the Unit Under Test (UUT2)

COMPONENT binary_algorithm_polynomials is

port (

g, h: in std_logic_vector (M-1 downto 0);


Z: out std_logic_vector (M-1 downto 0);

done: out std_logic

);


80

END COMPONENT binary_algorithm_polynomials;

-- Internal signals

SIGNAL x, y, z, z_by_y : std_logic_vector(M-1 downto 0) := (others=>'0');

SIGNAL clk, reset, start, done: std_logic;

constant ZERO: std_logic_vector(M-1 downto 0) := (others=>'0');

constant DELAY : time := 100 ns;

constant PERIOD : time := 200 ns;

constant DUTY_CYCLE : real := 0.5;

constant OFFSET : time := 0 ns;

constant NUMBER_TESTS: natural := 100;

BEGIN


uut1: binary_algorithm_polynomials PORT MAP(g => x, h => y,


z => z, done => done);

uut2: classic_multiplication PORT MAP( a => z, b => y, c => z_by_y );

PROCESS -- clock process for clk

BEGIN

WAIT for OFFSET;

CLOCK_LOOP : LOOP

clk <= '0';

WAIT FOR (PERIOD *(1.0 - DUTY_CYCLE));

clk <= '1';

WAIT FOR (PERIOD * DUTY_CYCLE);


81

END LOOP CLOCK_LOOP;

END PROCESS;

tb_proc : PROCESS --generate values

PROCEDURE gen_random(X : out std_logic_vector (M-1 DownTo 0); w: natural; s1, s2:

inout Natural) IS

VARIABLE i_x, aux: integer;

VARIABLE rand: real;

BEGIN

aux := W/16;

for i in 1 to aux loop


i_x := INTEGER(TRUNC(rand * real(2**16)));

x(i*16-1 downto (i-1)*16) := CONV_STD_LOGIC_VECTOR (i_x, 16);

end loop;


i_x := INTEGER(TRUNC(rand * real(2**(w-aux*16))));

x(w-1 downto aux*16) := CONV_STD_LOGIC_VECTOR (i_x, (w-aux*16));

END PROCEDURE;

VARIABLE TX_LOC : LINE;

VARIABLE TX_STR : String(1 to 4096);

VARIABLE seed1, seed2: positive;

VARIABLE i_x, i_y, i_p, i_z, i_yz_modp: integer;

VARIABLE cycles, max_cycles, min_cycles, total_cycles: integer := 0;

VARIABLE avg_cycles: real;

VARIABLE initial_time, final_time: time;

VARIABLE xx: std_logic_vector (M-1 DownTo 0) ;

BEGIN


82

min_cycles:= 2**20;

start <= '0'; reset <= '1';

WAIT FOR PERIOD;

reset <= '0';

WAIT FOR PERIOD;

for I in 1 to NUMBER_TESTS loop


x <= xx;


while (xx = ZERO) loop gen_random(xx, M, seed1, seed2); end loop;

y <= xx;

start <= '1'; initial_time := now;

WAIT FOR PERIOD;

start <= '0';

wait until done = '1';

final_time := now;

cycles := (final_time - initial_time)/PERIOD;

total_cycles := total_cycles+cycles;

--ASSERT (FALSE) REPORT "Number of Cycles: " & integer'image(cycles) & "

TotalCycles: " & integer'image(total_cycles) SEVERITY WARNING;

if cycles > max_cycles then max_cycles:= cycles; end if;

if cycles < min_cycles then min_cycles:= cycles; end if;

WAIT FOR 2*PERIOD;

IF ( x /= z_by_y ) THEN

write(TX_LOC,string'("ERROR!!! z_by_y=")); write(TX_LOC, z_by_y);

write(TX_LOC,string'("/= x=")); write(TX_LOC, x);

write(TX_LOC,string'("( z=")); write(TX_LOC, z);


83

write(TX_LOC,string'(") using: ( A =")); write(TX_LOC, x);

write(TX_LOC, string'(", B =")); write(TX_LOC, y);

write(TX_LOC, string'(", F = 1")); write(TX_LOC, F);

write(TX_LOC, string'(" )"));

TX_STR(TX_LOC.all'range) := TX_LOC.all;

Deallocate(TX_LOC);

ASSERT (FALSE) REPORT TX_STR SEVERITY ERROR;

END IF;

end loop;

WAIT FOR DELAY;

avg_cycles := real(total_cycles)/real(NUMBER_TESTS);

ASSERT (FALSE) REPORT

"Simulation successful!. MinCycles: " & integer'image(min_cycles) &

" MaxCycles: " & integer'image(max_cycles) & " TotalCycles: " &

integer'image(total_cycles) &

" AvgCycles: " & real'image(avg_cycles)

SEVERITY FAILURE;

END PROCESS;

END;

----------------------------------------------------------------------------------------------------

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use IEEE.STD_LOGIC_ARITH.ALL;

use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity multiplier is

port( B :in std_logic_vector(15 downto 0);

Product :out std_logic_vector(15 downto 0)

);

end multiplier;

architecture Behavioral of multiplier is

signal P1:std_logic_vector(15 downto 0);


84









component wallace_structure is

port(P1,P2,P3,P4,P5,P6,P7,P8,P9 :in std_logic_vector( 15 downto 0);

product :out std_logic_vector( 15 downto 0));

end component;

begin

-- partial products reduced to 9 from 16 due to the multiplication of B and 2B+1

P1 <= B;

gen1:for i in 13 downto 1 generate

p2(i+2) <= B(0) and B(i); --P2 = {{B[3:15] & {13{B[15]}}},1'd0,B[15],1'd0};

end generate;

p2(2 downto 0)<=('0' & B(0) & '0');


p3(i+3) <= B(1) and B(i); --P3 <= {{B[5:15] & {11{B[14]}}},1'd0,B[14],3'd0};

end generate;

p3(4 downto 0)<=( '0' & B(1) & "000");


p4(i+4) <= B(2) and B(i); -- P4 <= {{B[4:12] & {9{B[13]}}},1'd0,B[13],5'd0};

end generate;

p4(6 downto 0)<=('0'& B(2) & "00000");


p5(i+5) <= B(3) and B(i); -- P5 <= {{B[5:11] & {7{B[12]}}},1'd0,B[12],7'd0};

end generate;

p5(8 downto 0)<=('0' & B(3) & "0000000");


85


p6(i+6) <= B(4) and B(i); --P6 <= {{B[6:10] & {5{B[11]}}},1'd0,B[11],9'd0};

end generate;

p6(10 downto 0)<=( '0' & B(4) & "000000000");


p7(i+7) <= B(5) and B(i); -- P7 <= {{B[7:9] & {3{B[10]}}},1'd0,B[10],11'd0};

end generate;

p7(12 downto 0)<=( '0' & B(5) & "00000000000");

P8 <= ((B(7) AND B(6)) & '0'& B(6) & "0000000000000");

P9 <= (B(7) & "000000000000000");

w1: wallace_structure port map (P1,P2,P3,P4,P5,P6,P7,P8,P9,product); --Wallace tree

end Behavioral;


Date post:	19-Jan-2016
Category:	Documents
Upload:	alemayehu-tilahun
View:	72 times
Download:	4 times

Hardware Acceleration of Elliptic Curve Cryptography

Documents