Download - Energy-Efficient Critical Path Enhancement for Blowfish Cryptosystemwl/conf/asaP11reVieW/asap2011... · 2011-05-07 · Energy-Efficient Critical Path Enhancement for Blowfish Cryptosystem

Energy-Efficient Critical Path Enhancement for

Blowfish Cryptosystem

Abstract—Data security, power consumption, and execution

speed have all become crucial criteria in the new era of

computing and communication technology. In this paper we

present an implementation technique for energy-efficient

hardware acceleration of the Blowfish cryptography based on the

Virtex-5 Field-Programmable Gate Array (FGPA) platform. We

provide a system design that focus on accelerating the execution

of the critical path of the Blowfish algorithm, which is the most

computation-intensive component. We carefully implement the

critical path as an embedded coprocessor to improve the overall

throughput. Subsequently, we make a comparison of the energy

consumption between the pure software implementation of the

Blowfish algorithm against our proposed approach. The results

show that our critical path enhancement design speeds up the

execution by up to 59% and reducing the energy by up to 56.2%,

thus achieving our objective.

Keywords— Coprocessor, Cryptography, Blowfish, Hardware

Accelerator, Low-Power, Energy-Efficient.

I. INTRODUCTION

Nowadays sensitive data are more vulnerable through the

modern information systems. The use of the Internet enables

worldwide data transfer, making suitable for any outside party

to have access to it. Consequently, various cryptographic

algorithms have been successfully implemented and used to

encrypt/decrypt the vital data at the software and hardware

level. Many hardware cryptographic processing cores have

been employed to speed up the overall system, which allows

the cryptosystem to execute at its fastest speed. This

methodology, however, requires more hardware resource thus

increasing the power consumption. At the present time, power

consumption has a great impact on the semiconductor

industries as the complexity of their hardware product keeps

increasing, known as the “Power Wall” [14]. Therefore, we

explore a new way to reduce power consumption while

enhancing the overall performance and minimizing the energy

consumption for the Blowfish Cryptosystem. Many of the

cryptographic implementations that have been developed are

being widely adopted in systems such as ATM cards, mobile

devices, and mostly used for electronic commerce (e-

commerce). As the demand for secure communication

bandwidth continues to grow, it requires faster cryptographic

processing speed. This serves as the motivation for our

hardware acceleration approach.

Hardware acceleration speeds up specific operations,

allowing the overall system (including the general purpose

processor and the coprocessor) to execute concurrently in

order to achieve performance improvement. The processor

assigns specific function to the coprocessor while executing

its own instructions concurrently. For example, Irwansyah et

al. [2] and Hodjat et al. [3] integrated the AES cryptography

[4] as a complete system and interfaced it with the general

purpose processor. This technique is very efficient to

accelerate as most as you can to finish executing the process

as fast as possible. However, we need to take into

consideration that adding more hardware to the system implies

that it will consume more power. Clearly, there is a trade-off

between performance improvement and power consumption.

What’s more, if the system requires additional hardware

accelerator, it can be a crucial since this will incur even more

power consumption.

Our design is based on FPGA platform, which allows us to

customize our hardware implementation without going

through the process of realizing the hardware into a physical

chip. Nevertheless, the resources on the FPGA are limited. If

the system is complex enough with multiple accelerators

needed and has occupied most of the FPGA resources, we

might run out of space and probably will end up using several

FPGAs to accommodate the customized hardware. Thus, we

need a system that not only executes fast but also consumes

less power and takes less space or resources in the FPGA

platform. Nuan et al. [9], Hani et al. [10] and Zutter et al. [11]

focused on speeding up the RSA [1] cryptography core by

enhancing the modular exponentiation of square and

multiplication. H. Singpiel et al. [12] implemented the

Blowfish cryptographic as a coprocessor using the

microEnable FPGA platform. However, they all implemented

the whole cryptography algorithm as a coprocessor. We

followed a similar path, but instead of implementing the

Blowfish cryptography core in its entirety, we selectively

implemented only a single hardware accelerator targeted at the

critical part of the algorithm. We employed hardware in the

form of a customized IP that accelerates the computation, so

that we can observe a performance improvement when we

execute the software code with the new integrated hardware.

The rest of the paper is organized as follows. Section II

describes the Blowfish cryptography and the flow of the

software implementation. Section III will cover the system

architecture of our proposed design, which describes all the

components that are used during the implementation of the

hardware accelerator and how it interfaces with the

microprocessor. Section IV presents the experiment results

when adding the customized hardware to our system

architecture. Finally, the conclusion is drawn in Section V.

II. BLOWFISH OVERVIEW

Blowfish [15] is a symmetric-key cryptographic block

cipher developed by Bruce Schneir in 1993. The block size

consists of 64 bits. Input and output must be of same size in

bits, and the key length ranges from 32 up to 448 bits. In

addition, it is a 16-round Feistel cipher [16] and utilizes key-

dependent Substitution-Box (S-Box). The pure software

execution flow of Blowfish is shown in Figure 1. The blf_key()

initializes the sub-key which refers to the four S-Box of 256

entries and p-array of 18 entries. The blf_enc() encrypts the

input data and converts it as a cipher data. The input is a 64-

bit data element. Then, it is divided into two parts of 32-bit (xl,

xr). Next, for 16 rounds xl is XORed with the p-array entry

indexed by the round and xr is XORed with blf_f(xl) output.

Then, swap the value of xl and xr for each round. After the

sixteenth round, xl and xr are swapped again. xr is XORed

with p-array and itself, while xl is XORed with p-array and

itself. Finally, merge xl and xr to form the cipher-text of 64

bits again. The blf_dec() function performs the same operation

as blf_enc() with the only difference being that the order of

the p-array is in reverse order p18, p17… p1.

Figure 1 - Blowfish Flow Chart

Figure 2 - blf_f() Data Flow

The blf_f() function, as shown in Figure 2, requires a 32-bit

input data to be decomposed into four 8-bit blocks. Each

block references an S-Box and each entry of the S-Box

outputs a 32-bit data. First, the output of S-Box(0) and S-

Box(1) are added, then the result of the addition is XORed

with S-Box(2). Finally, S-Box(3) is then added to the output

of the XORed operation and provides a 32-bit output. The

blf_f() is the most accessed function since every other function

of the Blowfish calls upon it to encrypt or decrypt the input

data.

III. SYSTEM ARCHITECTURE

The use of FPGA eases the implementation of a customized

hardware since it enables the system to be reconfigurable and

reprogrammable. There is no need to manufacture the actual

hardware to verify the functionality of the design, minimizing

the time to production and associated cost. We used the

Virtex5 FPGA board [7] to implement our proposed system.

The system architecture is shown in Figure 3.

Figure 3 - Hardware Architecture

The 32-bit general purpose processor MicroBlaze [6] has a

5-stage pipeline and runs at 125 MHz. MicroBlaze can access

the local memory Block RAM (BRAM) through the Local

Memory Bus (LMB). The size of BRAM is 64 KB of memory

space. If there is no external memory, the BRAM will contain

all the instruction and data to be executed by the processor.

MicroBlaze is of RISC architecture. It accesses the instruction

and data via separate buses, being the Instruction LMB (ILMB)

and Data LMB (DLMB) respectively.

The size of the ILMB and DLMB is 32-bit and Microblaze

accesses the ILMB and DLMB in 1 clock cycle. Moreover,

MicroBlaze communicates with any peripheral connected

through the Processor Local Bus (PLB), which is of 32-bit.

The FPGA platform is a memory-mapped system and every

component is assigned a memory range. Therefore, if

MicroBlaze needs to access a peripheral, it must provide the

address location and/or the data. Beside the BRAM,

instruction and data can be stored on the off-chip memory,

DDR_SDRAM. The off-chip DDR_SDRAM has up to 256

MB of memory storage. Part of this memory can be cached for

instruction and data. Microblaze has access to the memory

ports through the Instruction Xilinx Cache Link (IXCL) and

Data Xilinx Cache Link (DXCL), both 32-bit in width. The

XPS Timer is used to measure the number of cycles a program

takes to execute either on MicroBlaze or any component

connected through the PLB bus. We also use the XPS

Interrupt to allow the execution of a program or event to be

interruptible at any time if a request with higher priority needs

to execute. The RS232_UART is used for displaying the

output on the host computer.

Finally, the blf_HW_f() is our customized hardware that

implements the functionality of the blf_f() function from the

pure software implementation of Blowfish. MicroBlaze must

provide the address location of this IP and the 32-bit data for

the customized hardware to perform the operation and to

return a 32-bit output as shown in Figure 3.

IV. EXPERIMENTAL RESULTS

The experiment consists of encrypting and decrypting 64-

bit (8-byte) data using the Blowfish function blf_f() or

blf_hw_f(). The program flow is shown in Figure 4. At first,

the pure software code is executed without the existence of the

customized hardware and stored in the local memory BRAM.

The flow goes as follows. blf_key(), blf_enc() and blf_dec()

functions call the blf_f() and everything is executed on

MicroBlaze. We observe that it takes approximately 576,425

clock cycles to finish the process of encrypting and decrypting

the 8-byte data set, as shown in Table I. Since, Microblaze

runs at 125 MHz, it takes about 4.61 ms to execute the entire

process. Then, we applied the same procedure to

decrypt/encrypt the data input set, each time incrementing

by 2��, �� 1, 2, 3, … , log��Memory_Size� � 2.

Thus for a memory size of 64 KB, the test set is 2�� , 2��,

and all the way to 2�� ! bytes. As the data size increases, it

requires more clock cycles to finish the entire process hence

the execution time increases accordingly.

Figure 4 - Blowfish Flow with/without Custom Hardware

Next, MicroBlaze interfaces with our customized IP which

is embedded into the FPGA. Partial of the Blowfish pure

software code is now replaced by the corresponding hardware

part. Here, the native software function blf_f() that computes a

32-bit input and output a 32-bit as well, as described in Figure

2, is replaced by a coprocessor blf_hw_f().The FPGA system

is a memory-mapped system, thus each peripheral or

embedded IP is accessed by providing its corresponding

memory address location. The blf_hw_f() is called upon by

providing its memory location and the 32-bit data. Thus, we

execute the modified Blowfish software code to call the

blf_hw_f( ) instead, as shown in Figure 4. Now, the same data

is loaded to blf_hw_f() as the one we used to call native

software function blf_f(). The observation is that

encrypting/decrypting 8-byte of data takes 367,225 clock

cycles as shown in Table II. This leads to an execution time of

2.94 ms. The overhead of accessing the hardware accelerator

through the PLB bus is 7 clock cycles plus 2 to 3 cycles to

perform a load or store operation.

Table I - Blowfish without Acceleration and BRAM

# Bytes

Clk Cycles

(10^6)

Exec Time

(ms)

Energy

(mJoules)

uJoules /

Bytes

8 0.5764 4.6114 5.7729 721.6149

16 0.5785 4.6287 5.7946 362.1595

32 0.5829 4.6632 5.8378 182.4321

64 0.5915 4.7324 5.9244 92.5681

128 0.6088 4.8706 6.0974 47.6361

256 0.6433 5.1471 6.4435 25.1701

512 0.7125 5.7001 7.1358 13.9371

1K 0.8507 6.8060 8.5203 8.3206

2K 1.1272 9.0178 11.2892 5.5123

4K 1.6801 13.4415 16.8271 4.1082

8K 2.7861 22.2888 27.9030 3.4061

16K 4.9979 39.9836 50.0546 3.0551

32K 9.4216 75.3730 94.3580 2.8796

64K 18.2689 146.1519 182.9646 2.7918

Table II - Blowfish with Acceleration and BRAM

# Bytes

Clk Cycles

(10^6) Exec Time

(ms) Energy

(mJoules) uJoules /

Bytes

8 0.3672 2.9378 3.7374 467.1690

16 0.3685 2.9486 3.7512 234.4495

32 0.3713 2.9704 3.7789 118.0902

64 0.3767 3.0139 3.8342 59.9101

128 0.3876 3.1010 3.9450 30.8201

256 0.4093 3.2750 4.1664 16.2751

512 0.4529 3.6232 4.6094 9.0026

1K 0.5399 4.3195 5.4952 5.3664

2K 0.7140 5.7122 7.2668 3.5483

4K 1.0621 8.4974 10.8101 2.6392

8K 1.7585 14.0680 17.8968 2.1847

16K 3.1511 25.2091 32.0701 1.9574

32K 5.9364 47.4914 60.4166 1.8438

64K 11.5069 92.0558 117.1098 1.7870

The blf_hw_f() takes about 45 clock cycles to perform the

operation. Blowfish pure software function blf_f() takes about

80 clock cycle to finish the execution and return the result.

Hence, the custom IP performs 1.78 times faster than its pure

software counterpart. If we compare the execution time of

pure software approach with the hardware accelerator

approach, we can examine that the overall speedup of our

customize hardware is more than 56%. As we increase the

input size data, the speedup achieved converges to a constant

reading of 59%, as illustrated in Figure 5.

Amdahl’s Law to verify the speedup we

adding the custom hardware. Based on our observation,

of total execution time of Blowfish code is spent on

function, which is then converted into hardware and this

conversion acquires a speedup of 1.78. Then we can apply

given formula of Amdahl’s Law:

"#�$%%&'��()' � 1

�1 � '� *'

where p = 0.86 and s = 1.78. The overall speedup

1.6032, which also means the ideal speedup that can be

obtained is 60.32%. In experiment we obtained an overall

speedup of up to 59%. The minor difference is due to the fact

that we need to take into consideration the custom

connected to the PLB bus; thus to access the custom IP takes

about 7 cycles more. Therefore, this slightly affects

execution time of the system.

We utilized XPower Analyzer [8] to get the power

consumption of the system. XPower Analyzer is a Xilinx tool

dedicated to analyze the power consumption for a post

implemented place and routed design. It collect

about the hardware design and provides accurate estimation

about the power utilization. Now, using the XP

tool, the overall power consumption for the hardware system

without the customized IP and BRAM memory

1.2565 Watts, of which 1.1058 being static power

being dynamic power. After adding our customized hardware,

the overall power consumption is a total of 1.2

which 1.1108 being static and 0.1645 being

extra hardware corresponds to a 1.5% increase

consumption. The addition of the extra hardware consumes

188 slice registers used as flip-flops and 148 lookup table

(LUT). The custom hardware corresponds to no more than 8%

of hardware resource on the FPGA when compared to the

implementation without the use of hardware acceleration.

Table I and Table II show the number of clock cycles and

execution time of the two different implementations.

accurately estimate the power consumption using the XP

Analyzer tool to calculate the energy for the

execution without customized IP approach and with the

customized IP approach respectively. From these two tables,

we can observe that our design executed the

algorithm effectively and the energy consumption is reduced

by more than 54% after adding the embedded peripheral

Furthermore, it can also be observed from the table that

increasing the input data size the energy consumed per byte

decreases accordingly. Figure 5 shows the performance of

hardware design over software implementation when executed

on the local memory BRAM with maximum capacity of 64

KB memory size. From the figure, we can observe that for

data input set of 128 bytes or less, the performance is about

1.57 times faster. For data input set bigger than 128 bytes, the

performance slightly increases to reach up to 1.59 faster as it

reaches the maximum memory capacity. Figure 6 shows the

percentage of energy reduction by using our customized

%, as illustrated in Figure 5. We can apply

obtained while

Based on our observation, 86%

is spent on blf_f()

converted into hardware and this

we can apply the

' +⁄

overall speedup we get is

speedup that can be

e obtained an overall

is due to the fact

the custom IP is

o access the custom IP takes

affects the overall

] to get the power

ower Analyzer is a Xilinx tool

dedicated to analyze the power consumption for a post-

collects information

provides accurate estimation

XPower Analyzer

power consumption for the hardware system

memory is a total of

power and 0.1507

After adding our customized hardware,

1.2754 Watts, of

being dynamic. The

% increase in power

The addition of the extra hardware consumes

and 148 lookup tables

to no more than 8%

of hardware resource on the FPGA when compared to the

implementation without the use of hardware acceleration.

Table I and Table II show the number of clock cycles and

on time of the two different implementations. We can

the power consumption using the XPower

calculate the energy for the Blowfish

without customized IP approach and with the

customized IP approach respectively. From these two tables,

we can observe that our design executed the Blowfish

algorithm effectively and the energy consumption is reduced

dded peripheral.

from the table that by

the energy consumed per byte

performance of our

implementation when executed

on the local memory BRAM with maximum capacity of 64

From the figure, we can observe that for

data input set of 128 bytes or less, the performance is about

1.57 times faster. For data input set bigger than 128 bytes, the

ly increases to reach up to 1.59 faster as it

Figure 6 shows the

percentage of energy reduction by using our customized

hardware acceleration. Recall the energy depends on the

execution time of the process and the overa

consumption. Thus, the addition of the hardware acceleration

contributes to a 1.5% increase in power consumption. On the

other hand, the execution time is being reduced since part of

the Blowfish runs on the custom IP now while the rest is

executed in MicroBlaze. The obtained speedup and energy

reduction are shown in Table II. We can also observe that for

input size of 128 bytes or less, the energy consumption is

about 54.5% and for bigger input size slightly reduces the

energy until it achieves a 56.2% energy reduction for a data

input set of 64 KB. Hence, the hardware acceleration design

not only gained speedup but also reduced the energy

consumption.

Figure 5 - Speedup of with-Acceleration over

BRAM

Figure 6 - Energy Reduction of with-Acceleration over without

Acceleration on BRAM

Figure 7 illustrates the normalized energy consumed per byte

over the 8-byte input case. Figure 7 shows that by in

the data input size, the average energy consumed to

encrypt/decrypt a byte drops dramatically.

it only takes 0.4% of the energy for a single byte when

comparing with the input size of 8 bytes case

1.56

1.57

1.57

1.58

1.58

1.59

1.59

8

16

32

64

12

8

25

6

51

2

1K

Pe

rfo

rma

nce

Input Size (Byte)

54.00

54.50

55.00

55.50

56.00

56.50

8

16

32

64

12

8

25

6

51

2

1K

En

erg

y R

ed

uct

ion

%

Input Size (Byte)

hardware acceleration. Recall the energy depends on the

execution time of the process and the overall system power

consumption. Thus, the addition of the hardware acceleration

contributes to a 1.5% increase in power consumption. On the

other hand, the execution time is being reduced since part of

the Blowfish runs on the custom IP now while the rest is

executed in MicroBlaze. The obtained speedup and energy

reduction are shown in Table II. We can also observe that for

input size of 128 bytes or less, the energy consumption is

about 54.5% and for bigger input size slightly reduces the

ieves a 56.2% energy reduction for a data

hardware acceleration design

not only gained speedup but also reduced the energy

over without-Acceleration on

Acceleration over without-

Acceleration on BRAM

Figure 7 illustrates the normalized energy consumed per byte

byte input case. Figure 7 shows that by increasing

the data input size, the average energy consumed to

encrypt/decrypt a byte drops dramatically. For input size 64K,

it only takes 0.4% of the energy for a single byte when

comparing with the input size of 8 bytes case. Even though we

2K

4K

8K

16

K

32

K

65

K

Input Size (Byte)

1K

2K

4K

8K

16

K

32

K

65

K

Input Size (Byte)

reached the size limit of the local memory, from the figures,

we can see that by incrementing the input size the speedup,

energy reduction and normalized energy per byte will

eventually converges to 1.59, 56.2%, and 0.004

Figure 7 - Normalized Energy per Byte Consumption with Accelerator

BRAM

Next, we store the Blowfish in the 256 MB off

and follow the same process as what we did using the BRAM

memory, executing Blowfish with and without the use of our

customized hardware. Figure 3 shows the hardware

architecture including the off-chip memory. From the figure,

we can observe that MicroBlaze can either access the

DDR_SDRAM through the Xilinx Cache Link (XCL) bus or

the PLB bus. MicroBlaze accesses the instruction and data

cache on DDR_SDRAM directly through the IXCL and

DXCL respectively. For the other part of the memory,

MicroBlaze accesses any other address location through th

PLB. Accessing the off-chip memory through the PLB bus is

very expensive even though the DDR_SDRAM runs at 200

MHz [15]. MicroBlaze runs at 125 MHz and DDR_SDRAM

runs 1.6 times faster. BRAM runs at the same speed

Microblaze hence we should expect the overall execution of

Blowfish to run faster in the off-chip memory. However,

MicroBlaze cannot access the memory directly. Xilinx

provides an external memory controller that connects to the

PLB bus to allow MicroBlaze to communicate with the off

chip memory through this interface. The external

controller is the largest and most complex component

the FPGA, thus it requires several hundreds of cycle to

complete a load/store operation. Table III provides the number

of clock cycles, execution time and energy consumption for

the entire execution of Blowfish on MicroBlaze and the off

chip memory. We can see that it takes 16

cycles to perform the encryption and decryption of 8

input data. Next, running the same input data set and if

enable our customized hardware, we can see that it takes

7,372,868 clock cycles for the process of e

decrypting the same 8-byte data as shown in Table IV. Once

again, we used the XPower Analyzer tool to gather the

estimated power consumption with the off-chip memory.

1.00E-03

1.00E-02

1.00E-01

1.00E+00

8

16

32

64

12

8

25

6

51

2

1K

2K

4K

No

rma

lize

d E

ne

rgy

pe

r B

yte

Input Size (Byte)

ze limit of the local memory, from the figures,

we can see that by incrementing the input size the speedup,

energy per byte will

4 respectively.

Normalized Energy per Byte Consumption with Accelerator on

Next, we store the Blowfish in the 256 MB off-chip memory

did using the BRAM

memory, executing Blowfish with and without the use of our

customized hardware. Figure 3 shows the hardware

chip memory. From the figure,

we can observe that MicroBlaze can either access the

ugh the Xilinx Cache Link (XCL) bus or

the instruction and data

directly through the IXCL and

DXCL respectively. For the other part of the memory,

MicroBlaze accesses any other address location through the

chip memory through the PLB bus is

very expensive even though the DDR_SDRAM runs at 200

MHz [15]. MicroBlaze runs at 125 MHz and DDR_SDRAM

BRAM runs at the same speed as

overall execution of

chip memory. However,

MicroBlaze cannot access the memory directly. Xilinx

provides an external memory controller that connects to the

PLB bus to allow MicroBlaze to communicate with the off-

. The external memory

component inside

thus it requires several hundreds of cycle to

complete a load/store operation. Table III provides the number

and energy consumption for

the entire execution of Blowfish on MicroBlaze and the off-

chip memory. We can see that it takes 16,816,937 clock

encryption and decryption of 8-byte

input data. Next, running the same input data set and if we

enable our customized hardware, we can see that it takes

868 clock cycles for the process of encrypting/

data as shown in Table IV. Once

again, we used the XPower Analyzer tool to gather the

chip memory.

Table III - Blowfish without Acceleration and DDR_SDRAM

# Bytes

Clk Cycles

(10^6)

Exec Time

(ms)

8 16.8169 134.5355

16 16.8800 135.0404

32 17.0060 136.0485

64 17.2577 138.0623

128 17.7614 142.0917

256 18.7684 150.1479

512 20.7828 166.2625

1K 24.8117 198.4942

2K 32.8688 262.9508

4K 48.9839 391.8716

8K 81.2134 649.7075

16K 145.6728 1165.3828

32K 274.5913 2196.7311

64K 532.4312 4259.4497

128K 1048.1006 8384.8048

256K 2079.4590 16635.6724

512K 4142.1553 33137.2430 1

Table IV - Blowfish with Acceleration and DDR_SDRAM

# Bytes

Clk Cycles

(10^6)

Exec Time

(ms)

8 7.3728 58.9829

16 7.4003 59.2028

32 7.4535 59.6286

64 7.5616 60.4929

128 7.7759 62.2076

256 8.2055 65.6441

512 9.0654 72.5240

1K 10.7840 86.2726

2K 14.2221 113.7771

4K 21.0974 168.7797

8K 34.8501 278.8002

16K 62.3531 498.8254

32K 117.3595 938.8760

64K 227.3736 1818.9889

128K 447.4005 3579.2046

256K 887.4548 7099.6391

512K 1767.5622 14140.4979

Using the same hardware components with hardware

accelerator as shown in Figure 3, we obtain a power reading

of 3.2129 Watts, of which 2.6159 being static power

0.5970 being dynamic power. Adding the extra hardware also

constitutes an increase in power consumption and it takes the

same FPGA resources as reported for the BRAM

implementation. For the same hardware architecture except

this time without the hardware accelerator, the power reading

we get is a total of 3.1944 Watts, of which 2.6152 being stat

4K

8K

16

K

32

K

65

K

Blowfish without Acceleration and DDR_SDRAM

Energy

(mJoules)

uJoules /

Bytes

429.7736 53721.7052

431.3866 26961.6615

434.6068 13581.4631

441.0400 6891.2505

453.9119 3546.1865

479.6474 1873.6227

531.1256 1037.3548

634.0897 619.2282

839.9962 410.1544

1251.8339 305.6235

2075.4906 253.3558

3722.8155 227.2226

7017.4574 214.1558

13606.8120 207.6235

26785.2589 204.3553

53142.6555 202.7231

105856.9228 201.9061

Blowfish with Acceleration and DDR_SDRAM

Energy

(mJoules)

uJoules /

Bytes

189.5151 23689.3935

190.2216 11888.8473

191.5896 5987.17488

194.3666 3036.9777

199.8763 1561.5335

210.9179 823.8979

233.0231 455.1233

277.1982 270.7013

365.5715 178.5017

542.2976 132.3969

895.7989 109.3505

1602.7509 97.8242

3016.6557 92.0610

5844.5021 89.1800

11500.1632 87.7393

22811.4953 87.0189

45434.1268 86.6587

Using the same hardware components with hardware

accelerator as shown in Figure 3, we obtain a power reading

Watts, of which 2.6159 being static power and

. Adding the extra hardware also

consumption and it takes the

same FPGA resources as reported for the BRAM

implementation. For the same hardware architecture except

this time without the hardware accelerator, the power reading

we get is a total of 3.1944 Watts, of which 2.6152 being static

and 0.5792 being dynamic. We can see the hardware

accelerator causes in increase in power consumption of 0.58%.

Figure 8 - Speedup of with-Acceleration over without

DDR_SDRAM

Figure 9 - Energy Reduction of with-Acceleration over withoutAcceleration on DDR_SDRAM

Figure 10 - Normalized Energy per Byte Consumption with

on DDR_SDRAM

2.24

2.25

2.26

2.27

2.28

2.29

2.30

2.31

2.32

2.33

2.34

2.35

8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

Pe

rfo

rma

nce

Input Size (Byte)

55.20

55.40

55.60

55.80

56.00

56.20

56.40

56.60

56.80

57.00

57.20

8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

En

erg

y R

ed

uct

ion

%

Input Size (Byte)

1.00E-03

1.00E-02

1.00E-01

1.00E+00

8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

No

rma

lize

d E

ne

rgy

pe

r B

yte

Input Size (Byte)

We can see the hardware

accelerator causes in increase in power consumption of 0.58%.

Acceleration over without-Acceleration on

Acceleration over without-

Normalized Energy per Byte Consumption with-Acceleration

Figure 8 shows the performance of the modified software

targeting the hardware acceleration over the pure software

implementation. It can be observed that for input data of 128

bytes or less the overall performance is about 2.28 times faster

and for bigger input size the performance increases. If we

increase the input size to more than 512 KB, then we can

accomplish a 2.34 performance since from the figure we can

observe that it merges to that point. Figure 9 shows the

percentage of the energy reduction as

increased. From the figure, we can perceive that increasing the

input size more than 512 KB, the energy reduction also

merges to the maximum energy reduction of 57.1%. Finally,

Figure 10 demonstrates the normalized energy reduction per

byte over the 8 byte input data case. From the figure, we can

see that by increasing the data input size, the average energy

consumed to encrypt/decrypt a byte drops dramatically.

input size 64K, it only takes 0.4% of the energy for a single

byte when comparing with the input size of 8 bytes case

V. CONCLUSION

The implementation of an entire software algorithm

hardware is not always the best choice

hardware requires more power to operate

to accelerate our process but also want to

consumption, thus we explore the technique of

critical path function of a program that is defined by the most

computation-intensive component and then realiz

hardware accelerator using Blowfish

example design. The coprocessor helped the system to execute

the specific function while the main processor executed the

remaining of the code. We achieve an overall

up to 1.59 faster using our proposed system. In additio

reduced its energy consumption by

executing the Blowfish on the BRAM and hardware

accelerator. We also extended the implementation using the

off-chip memory and the design achieved a maximum of

about 2.34 in performance and a maximum

energy reduction. This technique can be implemented in other

systems to explore ways of minimizing the hardware

and energy consumption as to maximizing its overall

throughput.

REFERENCES

[1] Introduction of RSA for public-key cryptographhttp://en.wikipedia.org/wiki/RSA

[2] A. Irwansyah, V. Nambiar and M. Khalil

Coupled Hardware Accelerator in an FPGA

Processor Core,” Proc. ICCET’09, vol. 02, p. 521

[3] S. Hodjat and I. Verbauwhede, “Interfacing a High Speed Crypto

Accelerator to an Embedded CPU,” Asilomar SSC’04,

492, Nov. 2004.

[4] Announcing the Advanced Encryption Standard (AES). [Online].

Available: http://csrc.nist.gov/publications/fips/fips197/fips[5] P. Giusto and G. Martin, “Reliable Estimation of Execution Time of

Embedded Software,” Proc. DATE’01, p. 580

[6] MicroBlaze Processor Reference Guide. [Online]. Available: http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_gu

ide.pdf [7] Virtex-5 FPGA User Guide. [Online]. Available:

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

32

K

65

K

12

8K

25

6K

51

2k

65

K

12

8K

25

6K

51

2k

32

K

65

K

12

8K

25

6K

51

2k

Figure 8 shows the performance of the modified software

targeting the hardware acceleration over the pure software

implementation. It can be observed that for input data of 128

bytes or less the overall performance is about 2.28 times faster

input size the performance increases. If we

increase the input size to more than 512 KB, then we can

accomplish a 2.34 performance since from the figure we can

observe that it merges to that point. Figure 9 shows the

percentage of the energy reduction as the input size is

increased. From the figure, we can perceive that increasing the

input size more than 512 KB, the energy reduction also

merges to the maximum energy reduction of 57.1%. Finally,

Figure 10 demonstrates the normalized energy reduction per

te over the 8 byte input data case. From the figure, we can

by increasing the data input size, the average energy

consumed to encrypt/decrypt a byte drops dramatically. For

input size 64K, it only takes 0.4% of the energy for a single

mparing with the input size of 8 bytes case.

ONCLUSION

n entire software algorithm into

choice since a bigger

hardware requires more power to operate. We do not just want

want to reduce the energy

we explore the technique of identifying the

that is defined by the most

and then realizing it as a

Blowfish cryptography as an

The coprocessor helped the system to execute

the specific function while the main processor executed the

achieve an overall performance of

up to 1.59 faster using our proposed system. In addition, we

by up to 56.2% when

executing the Blowfish on the BRAM and hardware

We also extended the implementation using the

chip memory and the design achieved a maximum of

about 2.34 in performance and a maximum of 57.1% in

This technique can be implemented in other

to explore ways of minimizing the hardware overhead

and energy consumption as to maximizing its overall

key cryptography. [Online]. Available:

Khalil-Hani, “An AES Tightly

Coupled Hardware Accelerator in an FPGA-based Embedded

, vol. 02, p. 521-525, 2009.

whede, “Interfacing a High Speed Crypto

Asilomar SSC’04,vol. 1, p. 488-

Announcing the Advanced Encryption Standard (AES). [Online].

Available: http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf Martin, “Reliable Estimation of Execution Time of

p. 580-588, Mar. 2001.

MicroBlaze Processor Reference Guide. [Online]. Available: http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_gu

5 FPGA User Guide. [Online]. Available:

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

[8] Xilinx Xpower Analyzer [Online]. Available:

http://www.xilinx.com/products/design_tools/logic_design/verification

/xpower_an.htm.

[9] W. Nuan, D. Bin, and S. Fu, “FPGA Implementation of Alterable Parameters RSA Public-Key Cryptographic Coprocessor,” ASIC’05,

vol. 2, p. 769-773, 2005.

[10] M. Hani, T. Lin and N. Shaikh-Husin, “FPGA Implementation of RSA Public-Key Cryptographic Coprocessor,” TENCON’00, vol.3, p. 6-11,

2000.

[11] J. Zutter, M. Thalmaier, M. Klein and K. Laux, “Acceleration of RSA

Cryptographic Operations using FPGA Technology,” DEXA’09, vol. 1,

p. 20-25, 2009.

[12] H Singpiel, H. Simmler, R. Manner, A. C. Castanon, F. Galver-Durand,

J.M.S. de Alcantara and V.C. Alves, “ Implementation of

Cryptographic Applications on the Reconfigurable FPGA Coprocessor

microEnable,” SBCCI’00, p. 359-362, 2000.

[13] DDR SDRAM Controller Using Virtex-5 FPGA Devices. [Online].

Available: http://www.xilinx.com/support/documentation/application_

notes/xapp 851.pdf [14] When Processors Hit the Power Wall. [Online]. Available:

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=01493852

[15] Introduction of Blowfish for public-key cryptography. [Online].

Available: http://en.wikipedia.org/wiki/Blowfish_(cipher)

[16] Feistel Cipher definition and usage. [Online]. Available:

http://en.wikipedia.org/wiki/Feistel_cipher