Energy-Efficient Critical Path Enhancement for
Blowfish Cryptosystem
Abstract—Data security, power consumption, and execution
speed have all become crucial criteria in the new era of
computing and communication technology. In this paper we
present an implementation technique for energy-efficient
hardware acceleration of the Blowfish cryptography based on the
Virtex-5 Field-Programmable Gate Array (FGPA) platform. We
provide a system design that focus on accelerating the execution
of the critical path of the Blowfish algorithm, which is the most
computation-intensive component. We carefully implement the
critical path as an embedded coprocessor to improve the overall
throughput. Subsequently, we make a comparison of the energy
consumption between the pure software implementation of the
Blowfish algorithm against our proposed approach. The results
show that our critical path enhancement design speeds up the
execution by up to 59% and reducing the energy by up to 56.2%,
thus achieving our objective.
Keywords— Coprocessor, Cryptography, Blowfish, Hardware
Accelerator, Low-Power, Energy-Efficient.
I. INTRODUCTION
Nowadays sensitive data are more vulnerable through the
modern information systems. The use of the Internet enables
worldwide data transfer, making suitable for any outside party
to have access to it. Consequently, various cryptographic
algorithms have been successfully implemented and used to
encrypt/decrypt the vital data at the software and hardware
level. Many hardware cryptographic processing cores have
been employed to speed up the overall system, which allows
the cryptosystem to execute at its fastest speed. This
methodology, however, requires more hardware resource thus
increasing the power consumption. At the present time, power
consumption has a great impact on the semiconductor
industries as the complexity of their hardware product keeps
increasing, known as the “Power Wall” [14]. Therefore, we
explore a new way to reduce power consumption while
enhancing the overall performance and minimizing the energy
consumption for the Blowfish Cryptosystem. Many of the
cryptographic implementations that have been developed are
being widely adopted in systems such as ATM cards, mobile
devices, and mostly used for electronic commerce (e-
commerce). As the demand for secure communication
bandwidth continues to grow, it requires faster cryptographic
processing speed. This serves as the motivation for our
hardware acceleration approach.
Hardware acceleration speeds up specific operations,
allowing the overall system (including the general purpose
processor and the coprocessor) to execute concurrently in
order to achieve performance improvement. The processor
assigns specific function to the coprocessor while executing
its own instructions concurrently. For example, Irwansyah et
al. [2] and Hodjat et al. [3] integrated the AES cryptography
[4] as a complete system and interfaced it with the general
purpose processor. This technique is very efficient to
accelerate as most as you can to finish executing the process
as fast as possible. However, we need to take into
consideration that adding more hardware to the system implies
that it will consume more power. Clearly, there is a trade-off
between performance improvement and power consumption.
What’s more, if the system requires additional hardware
accelerator, it can be a crucial since this will incur even more
power consumption.
Our design is based on FPGA platform, which allows us to
customize our hardware implementation without going
through the process of realizing the hardware into a physical
chip. Nevertheless, the resources on the FPGA are limited. If
the system is complex enough with multiple accelerators
needed and has occupied most of the FPGA resources, we
might run out of space and probably will end up using several
FPGAs to accommodate the customized hardware. Thus, we
need a system that not only executes fast but also consumes
less power and takes less space or resources in the FPGA
platform. Nuan et al. [9], Hani et al. [10] and Zutter et al. [11]
focused on speeding up the RSA [1] cryptography core by
enhancing the modular exponentiation of square and
multiplication. H. Singpiel et al. [12] implemented the
Blowfish cryptographic as a coprocessor using the
microEnable FPGA platform. However, they all implemented
the whole cryptography algorithm as a coprocessor. We
followed a similar path, but instead of implementing the
Blowfish cryptography core in its entirety, we selectively
implemented only a single hardware accelerator targeted at the
critical part of the algorithm. We employed hardware in the
form of a customized IP that accelerates the computation, so
that we can observe a performance improvement when we
execute the software code with the new integrated hardware.
The rest of the paper is organized as follows. Section II
describes the Blowfish cryptography and the flow of the
software implementation. Section III will cover the system
architecture of our proposed design, which describes all the
components that are used during the implementation of the
hardware accelerator and how it interfaces with the
microprocessor. Section IV presents the experiment results
when adding the customized hardware to our system
architecture. Finally, the conclusion is drawn in Section V.
II. BLOWFISH OVERVIEW
Blowfish [15] is a symmetric-key cryptographic block
cipher developed by Bruce Schneir in 1993. The block size
consists of 64 bits. Input and output must be of same size in
bits, and the key length ranges from 32 up to 448 bits. In
addition, it is a 16-round Feistel cipher [16] and utilizes key-
dependent Substitution-Box (S-Box). The pure software
execution flow of Blowfish is shown in Figure 1. The blf_key()
initializes the sub-key which refers to the four S-Box of 256
entries and p-array of 18 entries. The blf_enc() encrypts the
input data and converts it as a cipher data. The input is a 64-
bit data element. Then, it is divided into two parts of 32-bit (xl,
xr). Next, for 16 rounds xl is XORed with the p-array entry
indexed by the round and xr is XORed with blf_f(xl) output.
Then, swap the value of xl and xr for each round. After the
sixteenth round, xl and xr are swapped again. xr is XORed
with p-array and itself, while xl is XORed with p-array and
itself. Finally, merge xl and xr to form the cipher-text of 64
bits again. The blf_dec() function performs the same operation
as blf_enc() with the only difference being that the order of
the p-array is in reverse order p18, p17… p1.
Figure 1 - Blowfish Flow Chart
Figure 2 - blf_f() Data Flow
The blf_f() function, as shown in Figure 2, requires a 32-bit
input data to be decomposed into four 8-bit blocks. Each
block references an S-Box and each entry of the S-Box
outputs a 32-bit data. First, the output of S-Box(0) and S-
Box(1) are added, then the result of the addition is XORed
with S-Box(2). Finally, S-Box(3) is then added to the output
of the XORed operation and provides a 32-bit output. The
blf_f() is the most accessed function since every other function
of the Blowfish calls upon it to encrypt or decrypt the input
data.
III. SYSTEM ARCHITECTURE
The use of FPGA eases the implementation of a customized
hardware since it enables the system to be reconfigurable and
reprogrammable. There is no need to manufacture the actual
hardware to verify the functionality of the design, minimizing
the time to production and associated cost. We used the
Virtex5 FPGA board [7] to implement our proposed system.
The system architecture is shown in Figure 3.
Figure 3 - Hardware Architecture
The 32-bit general purpose processor MicroBlaze [6] has a
5-stage pipeline and runs at 125 MHz. MicroBlaze can access
the local memory Block RAM (BRAM) through the Local
Memory Bus (LMB). The size of BRAM is 64 KB of memory
space. If there is no external memory, the BRAM will contain
all the instruction and data to be executed by the processor.
MicroBlaze is of RISC architecture. It accesses the instruction
and data via separate buses, being the Instruction LMB (ILMB)
and Data LMB (DLMB) respectively.
The size of the ILMB and DLMB is 32-bit and Microblaze
accesses the ILMB and DLMB in 1 clock cycle. Moreover,
MicroBlaze communicates with any peripheral connected
through the Processor Local Bus (PLB), which is of 32-bit.
The FPGA platform is a memory-mapped system and every
component is assigned a memory range. Therefore, if
MicroBlaze needs to access a peripheral, it must provide the
address location and/or the data. Beside the BRAM,
instruction and data can be stored on the off-chip memory,
DDR_SDRAM. The off-chip DDR_SDRAM has up to 256
MB of memory storage. Part of this memory can be cached for
instruction and data. Microblaze has access to the memory
ports through the Instruction Xilinx Cache Link (IXCL) and
Data Xilinx Cache Link (DXCL), both 32-bit in width. The
XPS Timer is used to measure the number of cycles a program
takes to execute either on MicroBlaze or any component
connected through the PLB bus. We also use the XPS
Interrupt to allow the execution of a program or event to be
interruptible at any time if a request with higher priority needs
to execute. The RS232_UART is used for displaying the
output on the host computer.
Finally, the blf_HW_f() is our customized hardware that
implements the functionality of the blf_f() function from the
pure software implementation of Blowfish. MicroBlaze must
provide the address location of this IP and the 32-bit data for
the customized hardware to perform the operation and to
return a 32-bit output as shown in Figure 3.
IV. EXPERIMENTAL RESULTS
The experiment consists of encrypting and decrypting 64-
bit (8-byte) data using the Blowfish function blf_f() or
blf_hw_f(). The program flow is shown in Figure 4. At first,
the pure software code is executed without the existence of the
customized hardware and stored in the local memory BRAM.
The flow goes as follows. blf_key(), blf_enc() and blf_dec()
functions call the blf_f() and everything is executed on
MicroBlaze. We observe that it takes approximately 576,425
clock cycles to finish the process of encrypting and decrypting
the 8-byte data set, as shown in Table I. Since, Microblaze
runs at 125 MHz, it takes about 4.61 ms to execute the entire
process. Then, we applied the same procedure to
decrypt/encrypt the data input set, each time incrementing
by 2���, ����� � 1, 2, 3, … , log��Memory_Size� � 2.
Thus for a memory size of 64 KB, the test set is 2�� , 2���,
and all the way to 2�� ! bytes. As the data size increases, it
requires more clock cycles to finish the entire process hence
the execution time increases accordingly.
Figure 4 - Blowfish Flow with/without Custom Hardware
Next, MicroBlaze interfaces with our customized IP which
is embedded into the FPGA. Partial of the Blowfish pure
software code is now replaced by the corresponding hardware
part. Here, the native software function blf_f() that computes a
32-bit input and output a 32-bit as well, as described in Figure
2, is replaced by a coprocessor blf_hw_f().The FPGA system
is a memory-mapped system, thus each peripheral or
embedded IP is accessed by providing its corresponding
memory address location. The blf_hw_f() is called upon by
providing its memory location and the 32-bit data. Thus, we
execute the modified Blowfish software code to call the
blf_hw_f( ) instead, as shown in Figure 4. Now, the same data
is loaded to blf_hw_f() as the one we used to call native
software function blf_f(). The observation is that
encrypting/decrypting 8-byte of data takes 367,225 clock
cycles as shown in Table II. This leads to an execution time of
2.94 ms. The overhead of accessing the hardware accelerator
through the PLB bus is 7 clock cycles plus 2 to 3 cycles to
perform a load or store operation.
Table I - Blowfish without Acceleration and BRAM
# Bytes
Clk Cycles
(10^6)
Exec Time
(ms)
Energy
(mJoules)
uJoules /
Bytes
8 0.5764 4.6114 5.7729 721.6149
16 0.5785 4.6287 5.7946 362.1595
32 0.5829 4.6632 5.8378 182.4321
64 0.5915 4.7324 5.9244 92.5681
128 0.6088 4.8706 6.0974 47.6361
256 0.6433 5.1471 6.4435 25.1701
512 0.7125 5.7001 7.1358 13.9371
1K 0.8507 6.8060 8.5203 8.3206
2K 1.1272 9.0178 11.2892 5.5123
4K 1.6801 13.4415 16.8271 4.1082
8K 2.7861 22.2888 27.9030 3.4061
16K 4.9979 39.9836 50.0546 3.0551
32K 9.4216 75.3730 94.3580 2.8796
64K 18.2689 146.1519 182.9646 2.7918
Table II - Blowfish with Acceleration and BRAM
# Bytes
Clk Cycles
(10^6) Exec Time
(ms) Energy
(mJoules) uJoules /
Bytes
8 0.3672 2.9378 3.7374 467.1690
16 0.3685 2.9486 3.7512 234.4495
32 0.3713 2.9704 3.7789 118.0902
64 0.3767 3.0139 3.8342 59.9101
128 0.3876 3.1010 3.9450 30.8201
256 0.4093 3.2750 4.1664 16.2751
512 0.4529 3.6232 4.6094 9.0026
1K 0.5399 4.3195 5.4952 5.3664
2K 0.7140 5.7122 7.2668 3.5483
4K 1.0621 8.4974 10.8101 2.6392
8K 1.7585 14.0680 17.8968 2.1847
16K 3.1511 25.2091 32.0701 1.9574
32K 5.9364 47.4914 60.4166 1.8438
64K 11.5069 92.0558 117.1098 1.7870
The blf_hw_f() takes about 45 clock cycles to perform the
operation. Blowfish pure software function blf_f() takes about
80 clock cycle to finish the execution and return the result.
Hence, the custom IP performs 1.78 times faster than its pure
software counterpart. If we compare the execution time of
pure software approach with the hardware accelerator
approach, we can examine that the overall speedup of our
customize hardware is more than 56%. As we increase the
input size data, the speedup achieved converges to a constant
reading of 59%, as illustrated in Figure 5.
Amdahl’s Law to verify the speedup we
adding the custom hardware. Based on our observation,
of total execution time of Blowfish code is spent on
function, which is then converted into hardware and this
conversion acquires a speedup of 1.78. Then we can apply
given formula of Amdahl’s Law:
"#�$%%&'��()' � 1
�1 � '� *'
where p = 0.86 and s = 1.78. The overall speedup
1.6032, which also means the ideal speedup that can be
obtained is 60.32%. In experiment we obtained an overall
speedup of up to 59%. The minor difference is due to the fact
that we need to take into consideration the custom
connected to the PLB bus; thus to access the custom IP takes
about 7 cycles more. Therefore, this slightly affects
execution time of the system.
We utilized XPower Analyzer [8] to get the power
consumption of the system. XPower Analyzer is a Xilinx tool
dedicated to analyze the power consumption for a post
implemented place and routed design. It collect
about the hardware design and provides accurate estimation
about the power utilization. Now, using the XP
tool, the overall power consumption for the hardware system
without the customized IP and BRAM memory
1.2565 Watts, of which 1.1058 being static power
being dynamic power. After adding our customized hardware,
the overall power consumption is a total of 1.2
which 1.1108 being static and 0.1645 being
extra hardware corresponds to a 1.5% increase
consumption. The addition of the extra hardware consumes
188 slice registers used as flip-flops and 148 lookup table
(LUT). The custom hardware corresponds to no more than 8%
of hardware resource on the FPGA when compared to the
implementation without the use of hardware acceleration.
Table I and Table II show the number of clock cycles and
execution time of the two different implementations.
accurately estimate the power consumption using the XP
Analyzer tool to calculate the energy for the
execution without customized IP approach and with the
customized IP approach respectively. From these two tables,
we can observe that our design executed the
algorithm effectively and the energy consumption is reduced
by more than 54% after adding the embedded peripheral
Furthermore, it can also be observed from the table that
increasing the input data size the energy consumed per byte
decreases accordingly. Figure 5 shows the performance of
hardware design over software implementation when executed
on the local memory BRAM with maximum capacity of 64
KB memory size. From the figure, we can observe that for
data input set of 128 bytes or less, the performance is about
1.57 times faster. For data input set bigger than 128 bytes, the
performance slightly increases to reach up to 1.59 faster as it
reaches the maximum memory capacity. Figure 6 shows the
percentage of energy reduction by using our customized
%, as illustrated in Figure 5. We can apply
obtained while
Based on our observation, 86%
is spent on blf_f()
converted into hardware and this
we can apply the
' +⁄
overall speedup we get is
speedup that can be
e obtained an overall
is due to the fact
the custom IP is
o access the custom IP takes
affects the overall
] to get the power
ower Analyzer is a Xilinx tool
dedicated to analyze the power consumption for a post-
collects information
provides accurate estimation
XPower Analyzer
power consumption for the hardware system
memory is a total of
power and 0.1507
After adding our customized hardware,
1.2754 Watts, of
being dynamic. The
% increase in power
The addition of the extra hardware consumes
and 148 lookup tables
to no more than 8%
of hardware resource on the FPGA when compared to the
implementation without the use of hardware acceleration.
Table I and Table II show the number of clock cycles and
on time of the two different implementations. We can
the power consumption using the XPower
calculate the energy for the Blowfish
without customized IP approach and with the
customized IP approach respectively. From these two tables,
we can observe that our design executed the Blowfish
algorithm effectively and the energy consumption is reduced
dded peripheral.
from the table that by
the energy consumed per byte
performance of our
implementation when executed
on the local memory BRAM with maximum capacity of 64
From the figure, we can observe that for
data input set of 128 bytes or less, the performance is about
1.57 times faster. For data input set bigger than 128 bytes, the
ly increases to reach up to 1.59 faster as it
Figure 6 shows the
percentage of energy reduction by using our customized
hardware acceleration. Recall the energy depends on the
execution time of the process and the overa
consumption. Thus, the addition of the hardware acceleration
contributes to a 1.5% increase in power consumption. On the
other hand, the execution time is being reduced since part of
the Blowfish runs on the custom IP now while the rest is
executed in MicroBlaze. The obtained speedup and energy
reduction are shown in Table II. We can also observe that for
input size of 128 bytes or less, the energy consumption is
about 54.5% and for bigger input size slightly reduces the
energy until it achieves a 56.2% energy reduction for a data
input set of 64 KB. Hence, the hardware acceleration design
not only gained speedup but also reduced the energy
consumption.
Figure 5 - Speedup of with-Acceleration over
BRAM
Figure 6 - Energy Reduction of with-Acceleration over without
Acceleration on BRAM
Figure 7 illustrates the normalized energy consumed per byte
over the 8-byte input case. Figure 7 shows that by in
the data input size, the average energy consumed to
encrypt/decrypt a byte drops dramatically.
it only takes 0.4% of the energy for a single byte when
comparing with the input size of 8 bytes case
1.56
1.57
1.57
1.58
1.58
1.59
1.59
8
16
32
64
12
8
25
6
51
2
1K
Pe
rfo
rma
nce
Input Size (Byte)
54.00
54.50
55.00
55.50
56.00
56.50
8
16
32
64
12
8
25
6
51
2
1K
En
erg
y R
ed
uct
ion
%
Input Size (Byte)
hardware acceleration. Recall the energy depends on the
execution time of the process and the overall system power
consumption. Thus, the addition of the hardware acceleration
contributes to a 1.5% increase in power consumption. On the
other hand, the execution time is being reduced since part of
the Blowfish runs on the custom IP now while the rest is
executed in MicroBlaze. The obtained speedup and energy
reduction are shown in Table II. We can also observe that for
input size of 128 bytes or less, the energy consumption is
about 54.5% and for bigger input size slightly reduces the
ieves a 56.2% energy reduction for a data
hardware acceleration design
not only gained speedup but also reduced the energy
over without-Acceleration on
Acceleration over without-
Acceleration on BRAM
Figure 7 illustrates the normalized energy consumed per byte
byte input case. Figure 7 shows that by increasing
the data input size, the average energy consumed to
encrypt/decrypt a byte drops dramatically. For input size 64K,
it only takes 0.4% of the energy for a single byte when
comparing with the input size of 8 bytes case. Even though we
2K
4K
8K
16
K
32
K
65
K
Input Size (Byte)
1K
2K
4K
8K
16
K
32
K
65
K
Input Size (Byte)
reached the size limit of the local memory, from the figures,
we can see that by incrementing the input size the speedup,
energy reduction and normalized energy per byte will
eventually converges to 1.59, 56.2%, and 0.004
Figure 7 - Normalized Energy per Byte Consumption with Accelerator
BRAM
Next, we store the Blowfish in the 256 MB off
and follow the same process as what we did using the BRAM
memory, executing Blowfish with and without the use of our
customized hardware. Figure 3 shows the hardware
architecture including the off-chip memory. From the figure,
we can observe that MicroBlaze can either access the
DDR_SDRAM through the Xilinx Cache Link (XCL) bus or
the PLB bus. MicroBlaze accesses the instruction and data
cache on DDR_SDRAM directly through the IXCL and
DXCL respectively. For the other part of the memory,
MicroBlaze accesses any other address location through th
PLB. Accessing the off-chip memory through the PLB bus is
very expensive even though the DDR_SDRAM runs at 200
MHz [15]. MicroBlaze runs at 125 MHz and DDR_SDRAM
runs 1.6 times faster. BRAM runs at the same speed
Microblaze hence we should expect the overall execution of
Blowfish to run faster in the off-chip memory. However,
MicroBlaze cannot access the memory directly. Xilinx
provides an external memory controller that connects to the
PLB bus to allow MicroBlaze to communicate with the off
chip memory through this interface. The external
controller is the largest and most complex component
the FPGA, thus it requires several hundreds of cycle to
complete a load/store operation. Table III provides the number
of clock cycles, execution time and energy consumption for
the entire execution of Blowfish on MicroBlaze and the off
chip memory. We can see that it takes 16
cycles to perform the encryption and decryption of 8
input data. Next, running the same input data set and if
enable our customized hardware, we can see that it takes
7,372,868 clock cycles for the process of e
decrypting the same 8-byte data as shown in Table IV. Once
again, we used the XPower Analyzer tool to gather the
estimated power consumption with the off-chip memory.
1.00E-03
1.00E-02
1.00E-01
1.00E+00
8
16
32
64
12
8
25
6
51
2
1K
2K
4K
No
rma
lize
d E
ne
rgy
pe
r B
yte
Input Size (Byte)
ze limit of the local memory, from the figures,
we can see that by incrementing the input size the speedup,
energy per byte will
4 respectively.
Normalized Energy per Byte Consumption with Accelerator on
Next, we store the Blowfish in the 256 MB off-chip memory
did using the BRAM
memory, executing Blowfish with and without the use of our
customized hardware. Figure 3 shows the hardware
chip memory. From the figure,
we can observe that MicroBlaze can either access the
ugh the Xilinx Cache Link (XCL) bus or
the instruction and data
directly through the IXCL and
DXCL respectively. For the other part of the memory,
MicroBlaze accesses any other address location through the
chip memory through the PLB bus is
very expensive even though the DDR_SDRAM runs at 200
MHz [15]. MicroBlaze runs at 125 MHz and DDR_SDRAM
BRAM runs at the same speed as
overall execution of
chip memory. However,
MicroBlaze cannot access the memory directly. Xilinx
provides an external memory controller that connects to the
PLB bus to allow MicroBlaze to communicate with the off-
. The external memory
component inside
thus it requires several hundreds of cycle to
complete a load/store operation. Table III provides the number
and energy consumption for
the entire execution of Blowfish on MicroBlaze and the off-
chip memory. We can see that it takes 16,816,937 clock
encryption and decryption of 8-byte
input data. Next, running the same input data set and if we
enable our customized hardware, we can see that it takes
868 clock cycles for the process of encrypting/
data as shown in Table IV. Once
again, we used the XPower Analyzer tool to gather the
chip memory.
Table III - Blowfish without Acceleration and DDR_SDRAM
# Bytes
Clk Cycles
(10^6)
Exec Time
(ms)
8 16.8169 134.5355
16 16.8800 135.0404
32 17.0060 136.0485
64 17.2577 138.0623
128 17.7614 142.0917
256 18.7684 150.1479
512 20.7828 166.2625
1K 24.8117 198.4942
2K 32.8688 262.9508
4K 48.9839 391.8716
8K 81.2134 649.7075
16K 145.6728 1165.3828
32K 274.5913 2196.7311
64K 532.4312 4259.4497
128K 1048.1006 8384.8048
256K 2079.4590 16635.6724
512K 4142.1553 33137.2430 1
Table IV - Blowfish with Acceleration and DDR_SDRAM
# Bytes
Clk Cycles
(10^6)
Exec Time
(ms)
8 7.3728 58.9829
16 7.4003 59.2028
32 7.4535 59.6286
64 7.5616 60.4929
128 7.7759 62.2076
256 8.2055 65.6441
512 9.0654 72.5240
1K 10.7840 86.2726
2K 14.2221 113.7771
4K 21.0974 168.7797
8K 34.8501 278.8002
16K 62.3531 498.8254
32K 117.3595 938.8760
64K 227.3736 1818.9889
128K 447.4005 3579.2046
256K 887.4548 7099.6391
512K 1767.5622 14140.4979
Using the same hardware components with hardware
accelerator as shown in Figure 3, we obtain a power reading
of 3.2129 Watts, of which 2.6159 being static power
0.5970 being dynamic power. Adding the extra hardware also
constitutes an increase in power consumption and it takes the
same FPGA resources as reported for the BRAM
implementation. For the same hardware architecture except
this time without the hardware accelerator, the power reading
we get is a total of 3.1944 Watts, of which 2.6152 being stat
4K
8K
16
K
32
K
65
K
Blowfish without Acceleration and DDR_SDRAM
Energy
(mJoules)
uJoules /
Bytes
429.7736 53721.7052
431.3866 26961.6615
434.6068 13581.4631
441.0400 6891.2505
453.9119 3546.1865
479.6474 1873.6227
531.1256 1037.3548
634.0897 619.2282
839.9962 410.1544
1251.8339 305.6235
2075.4906 253.3558
3722.8155 227.2226
7017.4574 214.1558
13606.8120 207.6235
26785.2589 204.3553
53142.6555 202.7231
105856.9228 201.9061
Blowfish with Acceleration and DDR_SDRAM
Energy
(mJoules)
uJoules /
Bytes
189.5151 23689.3935
190.2216 11888.8473
191.5896 5987.17488
194.3666 3036.9777
199.8763 1561.5335
210.9179 823.8979
233.0231 455.1233
277.1982 270.7013
365.5715 178.5017
542.2976 132.3969
895.7989 109.3505
1602.7509 97.8242
3016.6557 92.0610
5844.5021 89.1800
11500.1632 87.7393
22811.4953 87.0189
45434.1268 86.6587
Using the same hardware components with hardware
accelerator as shown in Figure 3, we obtain a power reading
Watts, of which 2.6159 being static power and
. Adding the extra hardware also
consumption and it takes the
same FPGA resources as reported for the BRAM
implementation. For the same hardware architecture except
this time without the hardware accelerator, the power reading
we get is a total of 3.1944 Watts, of which 2.6152 being static
and 0.5792 being dynamic. We can see the hardware
accelerator causes in increase in power consumption of 0.58%.
Figure 8 - Speedup of with-Acceleration over without
DDR_SDRAM
Figure 9 - Energy Reduction of with-Acceleration over withoutAcceleration on DDR_SDRAM
Figure 10 - Normalized Energy per Byte Consumption with
on DDR_SDRAM
2.24
2.25
2.26
2.27
2.28
2.29
2.30
2.31
2.32
2.33
2.34
2.35
8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
Pe
rfo
rma
nce
Input Size (Byte)
55.20
55.40
55.60
55.80
56.00
56.20
56.40
56.60
56.80
57.00
57.20
8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
En
erg
y R
ed
uct
ion
%
Input Size (Byte)
1.00E-03
1.00E-02
1.00E-01
1.00E+00
8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
No
rma
lize
d E
ne
rgy
pe
r B
yte
Input Size (Byte)
We can see the hardware
accelerator causes in increase in power consumption of 0.58%.
Acceleration over without-Acceleration on
Acceleration over without-
Normalized Energy per Byte Consumption with-Acceleration
Figure 8 shows the performance of the modified software
targeting the hardware acceleration over the pure software
implementation. It can be observed that for input data of 128
bytes or less the overall performance is about 2.28 times faster
and for bigger input size the performance increases. If we
increase the input size to more than 512 KB, then we can
accomplish a 2.34 performance since from the figure we can
observe that it merges to that point. Figure 9 shows the
percentage of the energy reduction as
increased. From the figure, we can perceive that increasing the
input size more than 512 KB, the energy reduction also
merges to the maximum energy reduction of 57.1%. Finally,
Figure 10 demonstrates the normalized energy reduction per
byte over the 8 byte input data case. From the figure, we can
see that by increasing the data input size, the average energy
consumed to encrypt/decrypt a byte drops dramatically.
input size 64K, it only takes 0.4% of the energy for a single
byte when comparing with the input size of 8 bytes case
V. CONCLUSION
The implementation of an entire software algorithm
hardware is not always the best choice
hardware requires more power to operate
to accelerate our process but also want to
consumption, thus we explore the technique of
critical path function of a program that is defined by the most
computation-intensive component and then realiz
hardware accelerator using Blowfish
example design. The coprocessor helped the system to execute
the specific function while the main processor executed the
remaining of the code. We achieve an overall
up to 1.59 faster using our proposed system. In additio
reduced its energy consumption by
executing the Blowfish on the BRAM and hardware
accelerator. We also extended the implementation using the
off-chip memory and the design achieved a maximum of
about 2.34 in performance and a maximum
energy reduction. This technique can be implemented in other
systems to explore ways of minimizing the hardware
and energy consumption as to maximizing its overall
throughput.
REFERENCES
[1] Introduction of RSA for public-key cryptographhttp://en.wikipedia.org/wiki/RSA
[2] A. Irwansyah, V. Nambiar and M. Khalil
Coupled Hardware Accelerator in an FPGA
Processor Core,” Proc. ICCET’09, vol. 02, p. 521
[3] S. Hodjat and I. Verbauwhede, “Interfacing a High Speed Crypto
Accelerator to an Embedded CPU,” Asilomar SSC’04,
492, Nov. 2004.
[4] Announcing the Advanced Encryption Standard (AES). [Online].
Available: http://csrc.nist.gov/publications/fips/fips197/fips[5] P. Giusto and G. Martin, “Reliable Estimation of Execution Time of
Embedded Software,” Proc. DATE’01, p. 580
[6] MicroBlaze Processor Reference Guide. [Online]. Available: http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_gu
ide.pdf [7] Virtex-5 FPGA User Guide. [Online]. Available:
http://www.xilinx.com/support/documentation/user_guides/ug190.pdf
32
K
65
K
12
8K
25
6K
51
2k
65
K
12
8K
25
6K
51
2k
32
K
65
K
12
8K
25
6K
51
2k
Figure 8 shows the performance of the modified software
targeting the hardware acceleration over the pure software
implementation. It can be observed that for input data of 128
bytes or less the overall performance is about 2.28 times faster
input size the performance increases. If we
increase the input size to more than 512 KB, then we can
accomplish a 2.34 performance since from the figure we can
observe that it merges to that point. Figure 9 shows the
percentage of the energy reduction as the input size is
increased. From the figure, we can perceive that increasing the
input size more than 512 KB, the energy reduction also
merges to the maximum energy reduction of 57.1%. Finally,
Figure 10 demonstrates the normalized energy reduction per
te over the 8 byte input data case. From the figure, we can
by increasing the data input size, the average energy
consumed to encrypt/decrypt a byte drops dramatically. For
input size 64K, it only takes 0.4% of the energy for a single
mparing with the input size of 8 bytes case.
ONCLUSION
n entire software algorithm into
choice since a bigger
hardware requires more power to operate. We do not just want
want to reduce the energy
we explore the technique of identifying the
that is defined by the most
and then realizing it as a
Blowfish cryptography as an
The coprocessor helped the system to execute
the specific function while the main processor executed the
achieve an overall performance of
up to 1.59 faster using our proposed system. In addition, we
by up to 56.2% when
executing the Blowfish on the BRAM and hardware
We also extended the implementation using the
chip memory and the design achieved a maximum of
about 2.34 in performance and a maximum of 57.1% in
This technique can be implemented in other
to explore ways of minimizing the hardware overhead
and energy consumption as to maximizing its overall
key cryptography. [Online]. Available:
Khalil-Hani, “An AES Tightly
Coupled Hardware Accelerator in an FPGA-based Embedded
, vol. 02, p. 521-525, 2009.
whede, “Interfacing a High Speed Crypto
Asilomar SSC’04,vol. 1, p. 488-
Announcing the Advanced Encryption Standard (AES). [Online].
Available: http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf Martin, “Reliable Estimation of Execution Time of
p. 580-588, Mar. 2001.
MicroBlaze Processor Reference Guide. [Online]. Available: http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_gu
5 FPGA User Guide. [Online]. Available:
http://www.xilinx.com/support/documentation/user_guides/ug190.pdf
[8] Xilinx Xpower Analyzer [Online]. Available:
http://www.xilinx.com/products/design_tools/logic_design/verification
/xpower_an.htm.
[9] W. Nuan, D. Bin, and S. Fu, “FPGA Implementation of Alterable Parameters RSA Public-Key Cryptographic Coprocessor,” ASIC’05,
vol. 2, p. 769-773, 2005.
[10] M. Hani, T. Lin and N. Shaikh-Husin, “FPGA Implementation of RSA Public-Key Cryptographic Coprocessor,” TENCON’00, vol.3, p. 6-11,
2000.
[11] J. Zutter, M. Thalmaier, M. Klein and K. Laux, “Acceleration of RSA
Cryptographic Operations using FPGA Technology,” DEXA’09, vol. 1,
p. 20-25, 2009.
[12] H Singpiel, H. Simmler, R. Manner, A. C. Castanon, F. Galver-Durand,
J.M.S. de Alcantara and V.C. Alves, “ Implementation of
Cryptographic Applications on the Reconfigurable FPGA Coprocessor
microEnable,” SBCCI’00, p. 359-362, 2000.
[13] DDR SDRAM Controller Using Virtex-5 FPGA Devices. [Online].
Available: http://www.xilinx.com/support/documentation/application_
notes/xapp 851.pdf [14] When Processors Hit the Power Wall. [Online]. Available:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=01493852
[15] Introduction of Blowfish for public-key cryptography. [Online].
Available: http://en.wikipedia.org/wiki/Blowfish_(cipher)
[16] Feistel Cipher definition and usage. [Online]. Available:
http://en.wikipedia.org/wiki/Feistel_cipher