Parallel Hashing,
Compression and Encryption
with OpenCL under OS X
Vasileios Xouris
Master of Science
Computer Science
School of Informatics
University of Edinburgh
2010
i
Abstract
In this dissertation we examine the efficiency of GPUs with a limited number of
stream processors (up to 32), located in desktops and laptops, in the execution of
algorithms such as hashing (MD5, SHA1), encryption (Salsa20) and compression
(LZ78). For the implementation part, the OpenCL framework was used under OS X.
The graphic cards tested were NVIDIA GeForce 9400m and GeForce 9600m GT. We
found an efficient block size for each algorithm that results in optimal GPU
performance. The results show that encryption and hashing algorithms can be
executed on these GPUs very efficiently and replace or assist CPU computations. We
achieved a throughput of 159 MB/s for Salsa20, 107.5 MB/s for MD5 and 123.5
MB/s for SHA1. Compression results showed a reduced compression ratio due to
GPU memory limitations and reduced speed due to divergent code paths. The
combined execution of encryption and compression on the GPU can improve
execution times by reducing the latency caused by data transfers between CPU and
GPU. In general, a GPU device with 32 stream processors can provide us with
enough computation power to replace CPU in the execution of data-parallel
computation-intensive algorithms.
ii
Acknowledgements
I would like to thank my supervisor, Paul Anderson, for his invaluable help and
guidance. I would also like to thank Dr. Zhang Le for his helpful remarks.
I would like to thank my family that always supports me in everything I do.
Finally, I would like to thank Stefania for being patient and supportive during this
year.
iii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has
not been submitted for any other degree or professional qualification except as
specified.
(Xouris Vasileios)
iv
Table of Contents
Chapter 1 Introduction ................................................................................................1
Chapter 2 GPU and OpenCL ......................................................................................3
2.1 GPU architecture ................................................................................................... 3
2.2 Open Computing Language (OpenCL) ............................................................. 4
2.2.1 Memory model .............................................................................................. 5
2.2.2 Memory access patterns ............................................................................... 7
2.2.3 OpenCL execution model ............................................................................ 8
Chapter 3 Encryption on GPU .................................................................................. 10
3.1 Background .......................................................................................................... 10
3.2 GPU advantages and disadvantages ................................................................ 12
3.3 Relevant work ...................................................................................................... 13
3.4 Implementation of Salsa20 & Results ............................................................... 15
Chapter 4 Hashing on GPU ...................................................................................... 21
4.1 Background .......................................................................................................... 21
4.2 GPU advantages and disadvantages ................................................................ 23
4.3 Relevant work ...................................................................................................... 23
4.4 Implementation of MD5 and SHA1 & Results ................................................ 24
Chapter 5 Compression on GPU .............................................................................. 29
5.1 Background .......................................................................................................... 29
5.2 GPU advantages and disadvantages ................................................................ 30
5.3 Relevant Work ..................................................................................................... 32
5.4 Implementation of LZ78 & Results ................................................................... 32
Chapter 6 Putting it all together ............................................................................... 38
Chapter 7 Discussion ................................................................................................. 41
7.1 Project difficulties ................................................................................................ 42
7.2 Future Work ......................................................................................................... 44
v
Chapter 8 Conclusion ................................................................................................ 45
Bibliography ................................................................................................................... 46
1
Chapter 1 Introduction
During the last few years, there has been a lot of research focused on efficient
implementations of well known algorithms optimized for execution on the Graphics
Processor Units (GPU). GPUs offer a great architecture that can take advantage of
data parallelism very effectively. A mid range GPU device can have around 64
stream processors, a number that offers great computation power. Entry level and
mid range GPUs can be found in laptops and desktops that are used every day. Of
course, there are more specialized, high-end GPUs that offer a much bigger number
of stream processors and enormous computation power. In this dissertation, we
plan to research whether GPUs located in desktops and laptops can be used in order
to execute operations that include heavy computations such as hashing, encryption
and compression efficiently. Until now, every published work related to these
operations used powerful GPUs with hundreds of stream processors.
The motivation of this dissertation was a recent research about a fast and secure
backup system for Mac laptops [1]. The main idea of this dissertation is to use the
GPU for computations included in a backup system such as data hashing,
encryption and compression. We are planning to implement some specific
algorithms for each field and see whether we can get a speedup over a CPU
implementation or if we can get execution times that are acceptable and can assist
the CPU where possible. For the testing of our implementations, entry-level and
mid-range GPUs are going to be used with up to 32 streaming processors, which is a
number much smaller when compared to high end GPUs containing hundreds of
streaming processors. GPUs of this kind are usually located in most laptops. The
framework that will be used for the programming of our implementations will be
2
the OpenCL (Open Computing Language) framework [16]. The advantage of
OpenCL is that it gives the programmer the ability to control all available processing
units in a system including CPUs and GPUs.
This dissertation is organized in several sections. We are going to start with a
general background section on GPU architecture and a brief description of the
OpenCL framework, its capabilities and restrictions. Then we will examine the
operations mentioned above (hashing, encryption and compression) one by one. For
each one of them, a brief background and relevant work on implementations for the
GPU is given. We will also discuss how each one of them fits on the GPU
architecture and mention its advantages and disadvantages. An implementation and
results section with the outputs of our research is also available for each case. After
looking to each operation in isolation, we present some conclusions from the
combined execution of encryption and compression on the GPU. In the final
chapters there is a discussion where we describe the difficulties that we faced
during our research and implementation and we also propose some ideas for future
work.
3
Chapter 2 GPU and OpenCL
2.1 GPU architecture
Graphics Processor Units (GPUs) are specialized processors originally implemented
to render 3-dimensional graphics. The main difference between a CPU core and a
GPU is that CPU is designed to execute a stream of instructions as fast as it can,
while GPUs have the ability to execute in parallel the same stream of instructions
over multiple data. GPUs contain a number of stream multiprocessors (SM) and
each SM contains 8 stream processor cores (SP), 2 special function units, an
instruction and a constant cache, a multithreaded instruction unit and a shared
memory. GPUs have a parallel throughput architecture that allows many threads to
be executed concurrently. They are designed to handle complex computations of
computer graphics fast and efficiently. They can operate on vectors of data very fast.
Because of their nature, programmers started to use them in order to execute more
general computation-intensive algorithm by taking advantage of data parallelism.
With the introduction of frameworks such as OpenCL and CUDA, the development
of GPU versions of general algorithms became easier.
Until recently, the main problem of General Purpose Computing on Graphic
Processor Units was that only floating point arithmetic computations were
supported inside pixel shaders. Fortunately, with the introduction of G80
architecture of NVIDIA, integer data types and bitwise operations are now available
[21].
4
Figure 2.1.1 - CPU versus GPU design (source: [4])
In figure 2.1.1 we can see why GPUs are so powerful. GPUs sacrifice
sophisticated control flow in order to have a lot of stream processors on chip. Also
the size of cache memories is much smaller because GPUs hide memory latency by
executing calculations while waiting for memory access instead of using large cache
memories.
2.2 Open Computing Language (OpenCL)
OpenCL was created originally by Apple Inc. and developed later by Khronos
Group. OpenCL is a framework that gives access to applications to execute on the
GPU device. In this section, we are going to discuss how OpenCL maps to the GPU
architecture.
In OpenCL, a data-parallel application (kernel) has to be written in a specific
language similar to C99. To create parallelism, OpenCL divides the total amount of
work into workgroups. Each workgroup is further divided to work items (threads).
So workgroups are executed on SMs, and each work item is executed by a SP. The
total amount of work, called N-D Range in the OpenCL world, is a collection of
workgroups that will be executed in parallel. The distribution of workgroups along
the available SMs is taking place dynamically by OpenCL itself. Threads are further
organized by the SM in groups of 32 that are called warps and all threads within a
5
warp are executed in parallel. When a warp is delayed for some reason, another
warp is selected for execution in order to hide latency. Because of the SIMT (Single
Instruction Multiple Thread) nature of the SPs, all threads within a warp must
execute the same instruction in order to take full advantage of parallelism [4].
GPUs handle memory latency by switching between workgroups. Thousands
of threads are ready for execution at any time. Every time that a group of threads
needs to read data from the memory, immediately another group takes its place.
Unlike CPUs where the cost of switching between threads is hundreds of clock
cycles, in GPU there is no cost at all. GPU threads are very lightweight.
2.2.1 Memory model
There are several different memory spaces in the OpenCL architecture. A diagram of
the locations of different memory spaces described in this section can be found in
figure 2.2.1.1. The main and biggest memory of GPU architecture is the global
memory which is off-chip and is usually between 128MB up to several GB for high-
end GPUs. Accessing the global memory requires 400 to 600 cycles and this is the
reason why we must be careful when accessing it. Global memory can be accessed
by all work items of all workgroups. A region of global memory is reserved in order
to be used as constant read-only memory and is called constant memory.
In contrast, OpenCL’s local memory (referred as shared memory in CUDA
world) is on-chip which makes it extremely fast. Accessing the shared memory
usually takes 4 to 6 cycles. However the size of shared memory is very small,
usually 16KB, so it must be used to store data that are frequently used for
computations and updates. The shared memory can be accessed by all work-items
of a workgroup so it can also be used for communication between work items of the
same workgroup. This feature is ideal in the case where data needs to be shared
among work items.
Another useful memory space is the constant cache read-only memory which is
usually 64KB and is located on-chip. When many threads within a workgroup try to
read the same constant cache address space it just takes one transaction, otherwise
reads of different addresses are serialized. This memory space is used to speed up
6
reads from the constant memory by caching frequently used data. A similar cache,
texture cache, is also available and it is used in order to speed up reads of image
objects.
Finally, the private memory (registers) is the fastest memory and distributed
privately among the work items of a workgroup by the SM. The total amount of
memory is limited for each multiprocessor, between 8192 and 16384 32-bit registers
(32kb - 64kb), and is partitioned to threads. In case more registers are needed by a
workgroup, there will be a performance problem which is known as “register
pressure”. Registers are the best solution to store small amounts of data that need to
be used frequently [12].
Figure 2.2.1.1 - The different memory spaces of GPU (source: [4])
7
2.2.2 Memory access patterns
The way that a group of threads accesses the global GPU memory is very important.
As mentioned before, each transaction with the global memory can take 400 to 600
cycles so it is important to somehow group memory transactions requested by
different work items. GPU devices are capable of reading data of 4, 8 or 16 bytes in
a single transaction. Another restriction for this to happen is that data must also be
aligned to a multiple of the element that we are reading. This means that data of
type X must be stored in an address that is a multiple of sizeof(X) [4].
Half warps (groups of 16 threads) that are executed in parallel can be
programmed to read global memory in a coalesced way. This can happen if all
threads of the half warp access a different element in an aligned segment of global
memory (4, 8 or 16-byte words) which can result in a single 64-byte transaction, a
single 128-byte transaction or two 128-byte transactions. For NVIDIA GPU devices
of compute capability of 1.0 or 1.1 this access must also be further organized so that
threads access elements of the segment in order. For example the first thread of the
half warp must access the first element of the segment, the second thread must
access the second element and so on. For GPU devices of compute capability of 1.2
or higher this restriction does not apply. Threads within a half warp can access
different address spaces within a segment with no order and still result in a single
transaction.
Accessing the shared memory requires a little different behavior in order to
achieve a high bandwidth. Shared memory is split in 16 memory banks and in order
to achieve a single transaction each thread of a half warp (groups of 16 threads)
needs to access a different memory bank. In the case that two or more different
threads of a half warp request a transaction with the same memory bank, these
accesses take place sequentially. Only in the case that all threads of a half warp
request the same memory bank then we have a broadcast that takes place in just one
transaction.
At this point we should note that the graphic cards (NVIDIA GeForce 9400m,
NVIDIA GeForce 9600m GT) used for this dissertation have a compute capability
less than 1.2.
8
2.2.3 OpenCL execution model
The OpenCL framework is responsible for the optimal execution of a data-parallel
algorithm. The amount of work that needs to be executed is called NDRange in the
OpenCL language. NDrange is a grid of thread blocks (workgroups). Each
workgroup contains a number of work items (threads) which are executed in
parallel. The OpenCL framework discovers how many SMs are available on the
current GPU and assigns workgroups on all available SMs where they execute in
turns, so all algorithms can scale to a large number of SMs without problems.
The NDRange has to be large enough because as its size gets bigger it is easier
to hide memory latency. Each SM has the ability to allow parallel execution a warp.
All workgroups are divided in warps for parallel execution. To keep track of
different workgroups and work items during execution, each workgroup has a
unique group Id number and each work item has:
a unique local Id that is used to separate the current work item from
other work items of the same workgroup
a unique global Id that is used to separate the current work item from
all other work items in the NDRange.
Warps are threads with consecutive local and global Id’s. A representation of the
NDRange appears in figure 2.2.3.1.
9
Figure 2.2.3.1 - Representation of the NDrange (grid) of OpenCL (source: [4])
10
Chapter 3 Encryption on GPU
Traditionally, since the appearance of General Purpose computing on Graphics
Processor Units (GPGPU), GPUs were mostly used for algorithms with a lot of
computations on float data structures. Until recently, there was no integer support
on GPU which made encryption algorithms very bad candidates for execution on
GPUs due to the fact that these algorithms are consisted of complex operations on
integer data types. In the last few years, this was no longer a problem, and with the
introduction of G80 architecture, encryption algorithms were ready to take a “crash
test” on GPUs [21].
3.1 Background
Encryption algorithms are used when there is a need to transfer a message through
an unsafe communication channel. The output of the encryption process is an
encrypted message which is usually of the same size as the input and is called
ciphertext. There are two kinds of encryption: symmetric and asymmetric. In
symmetric encryption, a secret key, that both communication sides possess, is used
for encryption and decryption of the data. In asymmetric encryption, each user
possesses a secret and a public key. If user A wants to transfer a message to user B,
then it uses B’s public key to encrypt the message and then user B can decrypt it
with its own secret key that is only known to him.
In general, encryption algorithms break the original message in blocks of equal
size and process them through a function that applies on them some bitwise
operations. This function is usually repeated several times (rounds) on each block of
data. The problem here is that if each block is encrypted independently of each other
11
with the same key, then the ciphertext of each block will always be the same and this
may be a serious security problem since it can lead to replay attacks: someone might
reuse the same encrypted message and claim to be someone he isn’t or request a
valid operation using the same valid encrypted message. Malicious users can use a
large number of blocks that were encrypted with the same key in order to find some
patterns that can reveal information about the original message.
For this reason, block ciphers have different modes of operation. A block cipher
mode of operation has the responsibility of mixing each block ciphertext with some
kind of information in order to prevent replay attacks and keep encrypted data
consistent. For example, Cipher Block Chaining (CBC) xor’s the ciphertext of the
previous block with the next block’s plaintext that was previously xor’d with a
nonce and then starts the encryption process. The problem with CBC and similar
modes of operation is that the original message must be processed sequentially.
Fortunately, there exists a mode of operation that allows us to take advantage of
data parallelism in encryption algorithms and it is called Counter Mode (CTR). CTR
uses a nonce, which is some initialization variables that are different for each
execution of the encryption algorithm, and a counter; it combines them in some way
(usually by using XOR) and then encrypts the result using a secret key. The output
of the encryption process (keystream) is then xor’d with the original message block
and the result of this operation is the ciphertext. So in this mode, we are not actually
encrypting the message, but we add to it the noise that comes out of the encryption
of the counter and nonce. The counter is simply a variable that is guaranteed to be
unique for a large number of blocks so the most popular option is to use an actual
counter that starts from 0 and increases by 1 for each block. The nonce is used so
that there is randomness in the output of the XOR operation with the counter and to
avoid replay attacks. It must be unique for each encryption process. For the
decryption of encrypted data, the key and the nonce must be known.
Every encryption algorithm that wants to operate in parallel, on multiple blocks
at the same time, has to operate on CTR mode. So the information needed for
parallel execution is the block number, the nonce, the key, and the block of data. A
demonstration of CTR mode appears at the figure 3.1.1 below. CTR is the mode that
we will use for our implementation.
12
Figure 3.1.1 - The CTR mode that can process blocks in parallel for encryption
(source: [9])
3.2 GPU advantages and disadvantages
First of all, we need to present the advantages and disadvantages of encryption
algorithms on GPU implementations when compared to CPU.
The main disadvantage of a GPU implementation is that keystream data need
to be repeatedly transferred from the GPU device to the host device. In order to
have good results, we need to make sure that the communication bandwidth
through the PCI express bus between the two devices is big enough. The transfer
operation is the bottleneck of many GPU algorithms because it is very costly. The
initialization latency of the transfer is usually small and the general trend is that the
transfer time grows linearly as the size of the data increases. So moving data in very
large amounts has no real benefit and it is also not possible because of the limited
memory on GPUs [15]. Transferring data in very small amounts is also inefficient
because of the initialization latency mentioned above.
In the previous paragraph we talked about the problem of transferring data
between the host and the GPU. Fortunately, when the encryption algorithm is
executed in CTR mode, the only things that we have to transfer from the host to the
GPU are the secret key, the nonce and a counter offset. This is because in CTR mode,
we do not encrypt the original message but the combination of the nonce with the
counter offset. So the time needed to transfer this kind of data is insignificant. We
13
just need to transfer back to the host an encrypted sequence (keystream) for each
block that will then be combined with the original text on the CPU, usually by using
XOR.
GPUs are designed for fast, parallel operations on vectors of floating point data
and this is where they are really unbeatable. With the introduction of GPGPU
capable GPUs, the benefits of graphics operations could be used in more general
operations including integer support. The computation power of GPUs is by far
better than CPU. For example, the Nvidia GeForce 9400 graphics card which is
located in most Mac Mini's has 54 GFLOPS (Floating Point Operations per Second),
which is an extremely big number. The Intel Core Duo processor, which is also
located in Mac Mini's with the GeForce 9400m, has 25 GFLOPS. GPUs clearly
outperform CPUs on computation power and this is their biggest advantage.
Another advantage is that encryption algorithms are very straightforward
algorithms. They do not contain branches and this makes them ideal for execution
on GPU devices. As mentioned in previous chapters, all threads that are executed in
parallel on a GPU workgroup of threads must execute the same instruction in order
to take full advantage of parallelism offered. Since encryption algorithms do not
contain branches in the code, we can rest assured that at any given time all threads
execute the same instruction but on different data and as a result no thread will
need to wait for other threads to finish the execution of a different code path.
3.3 Relevant work
Several encryption algorithms have been tested on GPUs with various speedups
during the last years. The results of these studies are very encouraging and GPU
seems to be an ideal platform for the execution of encryption algorithms.
Before the appearance of OpenCL and CUDA, the traditional OpenGL graphics
pipeline was used to take advantage of GPU computation power. Fortunately with
the introduction of these frameworks things became easier. Now the GPU can be
seen as a device similar to the CPU and through frameworks such as OpenCL and
CUDA, developers can distribute the encryption process without the need to know
very low level of graphics stuff. We will look briefly some traditional graphics
14
pipeline implementations and some CUDA/OpenCL implementations in more
detail because this is our approach for this dissertation. Most implementations for
encryption on the GPU choose the AES algorithm.
In [10], the Advanced Encryption Standard (AES) is implemented and tested on
GeForce 7900 GT which results in 5-6x speed up over a CPU implementation that
runs on an Intel Core 2 Duo (1,86GHz). The encryption rate achieved was 12 Mb/s.
In [14], an AES encryption implementation is created by using the graphics pipeline
and the Raster Operations Unit (ROP) which results in 108.86 Mb/s. Because of the
lack of XOR support in fragment processors for prior to DirectX10 support
hardware, the XOR operation in this case takes place in ROP.
These implementations follow the traditional way of programming the graphics
pipeline and use the vertex and fragment processors for parallel computations. Data
are passed as texture elements to each fragment processor for independent
execution and they are stored to the screen frame buffer or to other textures. The
OpenGL API is used for these operations.
In [11], the Compute Unified Device Architecture (CUDA) of NVIDIA is used to
create an implementation of AES-256 that gives a peak performance of 8,28 Gbit/s
on a GeForce 8800 GTX which contains 128 stream processors. They identified the
bottleneck of their implementation to be the transfer of data between the GPU and
the host device due to the limited bandwidth of PCI Express. They chose a block
size of 1024 bytes and each processed block is loaded in shared memory for further
parallel processing. Their implementation also appears to be faster when a large
number of blocks are transferred on the GPU each time. A pretty much same AES
implementation approach appears in [17] but this time the OpenCL framework is
used. In this work a GeForce 8600 GT and an ATI Firestream 9270 (800 stream cores)
is used. The results show a speedup by a factor of 11 over a sequential
implementation on a Dual Core Intel E8500.
In this dissertation, a similar algorithm to AES, Salsa20 is going to be
implemented. Unfortunately there are no relevant academic papers on GPU
implementation of the Salsa20 algorithm but the presented work in this section can
be used as a starting point.
15
3.4 Implementation of Salsa20 & Results
For the purposes of this dissertation, we decided to use the Salsa20 encryption
algorithm in CTR mode optimized for execution on the GPU [5]. Salsa20 is a stream
cipher developed by Daniel Bernstein. The reason why we decided to implement
the Salsa20 algorithm is that it seems to be faster than AES. In fact the 20 rounds of
Salsa20 algorithm are faster than 14 rounds of AES. For example, Salsa20 requires
3,93 cycles/byte while AES requires 9,2 cycles/byte at its best reported performance
[27]. This fact makes Salsa20 ideal for systems that require a high throughput like
backup systems. Also Salsa20 is a stream cipher which means that it has the ability
to produce encrypted output of equal size as the input. AES is a block cipher
meaning that the input size must be a multiple of the block size. To satisfy this
condition, we need to add padding to the last block in most occasions. This can be a
problem because the encrypted output will be slightly bigger than the input and this
can cause problems in systems where we need to process a large number of files
(like backup systems).
Salsa20’s basic operations are XOR, rotation and 32-bit addition. It is a stream
cipher that in order to operate needs a 128 or 256-bit key, a 64-bit initialization
vector and a 64-bit counter. It consists of 20 rounds of mixing operations. It has the
ability to operate on different blocks of a 64kb size in parallel while running in CTR
mode that we described in the previous section. This feature, in addition to the fact
that it contains a lot of arithmetic and bitwise operations and no branches, makes it
ideal for execution on GPU.
The first step that we need to take is to split the Salsa20 code in two parts. The
first part is the encryption process that creates a block keystream, and the second
part is the action of the actual mixing of the keystream with the original data (by
using XOR). The keystream is independent of the original data so we can just
calculate the keystream on GPU and then transfer it back to host in order to XOR it.
In this way we can store the original data on CPU memory and reduce the
transferring time by just transferring the generated keystream back to the host
device.
Because of the fact that each work item needs to have knowledge of its counter
number, the designed kernel, apart from the nonce and the secret key, takes also as
16
parameters the following:
A bytes offset, which contains information about how many bytes have
been processed until now.
A block size, which gives information about how many bytes each work
item is responsible for in order to create an appropriately sized
keystream.
The total number of bytes transferred to the GPU this time. This
information is used in the case that the total work isn’t divided exactly
by the block size. So the last work item will need to produce a smaller
sized keystream.
When a work item wants to calculate its block number, it has to first calculate
its position inside all work items of all work groups and then by taking into account
the block size, it can calculate its block number by adding the block offset. A
demonstration of this method appears below:
uint myGroupId = get_group_id(0);
uint myLocalId = get_local_id(0);
uint gsize = get_global_size(0);
uint lsize = get_local_size(0);
uint groupBlockSize = lsize*blocksize;
uint from = myGroupId*groupBlockSize +
myLocalId*blocksize;
uint to = from + blocksize - 1;
if (from >= totalbytes) return;
if (to >= totalbytes) to = totalbytes-1;
ulong myBlock =
(bytesOffset / 64) +
((myGroupId * groupBlockSize) / 64) +
((myLocalId * pr_block_size) / 64);
Figure 3.4.1 - The calculation of counter offset (“myBlock”) for each work item
block
17
The nonce, bytes offset and block size are passed in global memory and are
used by all work items. All work items of a workgroup can read the nonce from the
same memory address which results in just one transaction. Based on previous
works on other encryption algorithms that found a relatively small optimal block
size of 1024 bytes [11], we chose to process data through registers and not shared
memory for better performance. The results are written to the global memory in
chunks of 16 bytes (128 bit).
A very important issue is that we need to define an optimal block size. By
saying “block size” we mean the amount of data that is distributed to each work
item. A large block size will not cause private memory problems since the keystream
is generated in blocks of fixed size (64 bytes), then it is written in global memory,
and the same private memory is used to generate the next keystream block. A large
block size, however, will cause global memory problems because of the amount of
data of the generated keystream. We need to try different block sizes and find the
optimal for this method. Here we should note that the optimal block size also
depends on the hardware so it has to be decided on the runtime after acquiring the
GPU device’s information about the maximum number of work items within a
workgroup. So for different GPUs this value may vary but not significantly. For
example, let us suppose that our GPU device supports a total of X parallel threads
and we decide a block size of Y. This is a data size that can be executed in parallel.
To hide latency, we need to pass on the GPU a multiple Z of this size. The product of
XYZ must be less than the total amount of supported GPU memory or we will get a
compiler error.
Finally, for the transfer of data between the host device and GPU global
memory, pinned memory was used. Pinned memory can provide higher transfer
rates between the two devices which can reach 5GB/s on PCIe x16 Gen2 cards [12].
To test our implementation of Salsa20 we used two different graphic cards,
NVIDIA GeForce 9400m and GeForce 9600m GT. We should note that these cards
can be found in laptops or desktops of normal users. The results from these two
cards were compared to a single-threaded and a multithreaded implementation on
an Intel Core 2 Duo at 2.26 GHz. The specifications of GeForce 9400m and 9600m GT
appear below:
18
Model GeForce 9400m GeForce 9600m GT
Streaming Processors 16 32
Memory 256 Mb 256 Mb
Clock 1100 MHz 1250 MHz
Table 3.4.2 - Technical specifications of the graphic cards used for testing
In the next figure we present the resulting times of the encryption of a 200 MB
file. The times that appear include tests that used different block sizes. The block
size refers to the amount of data given to each work item for processing. Execution
times were measured by using the average of 10 executions. We tried different block
sizes in the range of 64 bytes to 16 KB. Bigger block sizes were not tested because of
the limited GPU memory. In the next figure, and generally in all figures from now
on, we will use the following abbreviations:
GF 9600m GT - for the execution on the NVIDIA GeForce 9600m GT using
OpenCL.
GF 9400m - for the execution on the NVIDIA GeForce 9400m using OpenCL.
CPU 1-thread - for the sequential execution of the algorithm on CPU (Intel
Core 2 Duo).
CPU OpenCL - for the multi-threaded parallel execution on the CPU (Intel
Core 2 Duo) by using the OpenCL framework. As mentioned before,
OpenCL can be used to handle parallel execution on heterogeneous devices,
including CPUs, by distributing work to available cores. So, by using
OpenCL for CPU execution we can take advantage of all available CPU cores
and gain maximum performance on CPU.
19
Figure 3.4.3 - Execution times of Salsa20 for all devices using different block sizes
The results we got are very interesting. We can see that the performance of both
GPU devices is faster than this of a single threaded CPU implementation. The
execution times of the 32 streaming processors of 9600m GT can compete with the
times of a multithreaded GPU implementation. The important point in this graph is
that GPU performance is maximized for relatively small block sizes between 64 and
256 bytes but it is also acceptable for sizes up to 2048 bytes. For very large block
sizes the performance drops considerably. The main reason for this is that with large
block sizes, each thread has to write more data in global memory and it is more
difficult to hide memory latency. By using smaller block sizes we can take advantage
of data parallelism between work items more easily and make sure that we are not
loosing performance due to memory latency. The CPU implementations don’t seem
to be much affected by the block size. The 9600m GT best execution time is very
close to the respective multithreaded CPU time. Finally the best throughput
achieved by a GPU implementation was this of 9600m GT which was equal to 159
Mbytes/s. The respective value for the multithreaded CPU implementation was 180
Mbytes/s, for the single threaded CPU implementation 49 and for the 9400m was 93
Mbytes/s. The throughput is almost double for the 9600m GT when compared to
9400m. This is because 9600m GT contains the double amount of streaming
processors.
0
1000
2000
3000
4000
5000
6000
64 512 4096
Tim
e (
ms)
Block Size (bytes)
GF 9600m GT
GF 9400m
CPU OpenCL
CPU 1-thread
20
Device Throughput (Mbytes/s)
GeForce 9400 93
GeForce 9600m GT 159
CPU single thread 49
CPU OpenCL 180
Table 3.4.4 - Throughput measurements of execution of Salsa20 on different devices
In general, the results that we got show that GPUs with a small number of
streaming processors can be used effectively in order to achieve a high throughput
for the Salsa20 algorithm. The more stream processors we have available, the better
the throughput we can achieve. The results show that with a number of stream
processors greater than 32, we can achieve better times on GPU.
Finally, the results cannot be compared to related work for two reasons: the first
one is that there isn’t published relevant work for the Salsa20 algorithm on GPU and
the second is that the related work in this field uses GPUs with huge computation
power and hundreds of cores. We used GPUs with up to 32 stream processors so we
cannot make a comparison. For example, in [11] a throughput of 1035 MB/s is
achieved for the AES-256 using a GPU with 128 stream processors. Our best GPU
used 32 stream processors for Salsa20 and achieved 159 MB/s.
21
Chapter 4 Hashing on GPU
4.1 Background
Hashing algorithms have the ability to create a fixed-sized data sequence from a
variable-sized data sequence. In this section, we are going to deal with hashing
algorithms that are used to compute a message digest (fingerprint) of data
sequences. The main characteristics of hashing algorithms are that they can compute
a fingerprint from a large data sequence pretty fast, but the reverse procedure is
impossible and also it is unlikely, with a very low probability, to get the same
fingerprint from 2 different inputs.
These algorithms can help us to identify if there was a transmission error or
some other malfunction that resulted in the alteration of some of the original data.
For example, the digest of a file can be generated at some point; when someone else
wants to copy or download this file he can check if the downloaded file has the
same checksum as the original file. If not, then he knows that there was an error
during transmission and he can try again. The digest doesn’t have to be generated
for a whole file, but we can instead create and check the digests of different blocks of
data transmitted.
Another important point that we need to mention is that this kind of algorithms
are not parallelizable. The reason for this is that in order to compute the message
digest of a file, we need to process all data of this file through the hashing algorithm
sequentially. So we are not allowed to split the file in blocks and process them
independently in parallel. This would only work if we kept the digest of each
processed block of data, which could result in a lot of disk space occupied by
22
checksums of different blocks of the same file instead of having a single fixed size
digest for the whole file. Of course, this attribute of hashing algorithms is desirable
because files with the same content but in different order must generate different
digests. So every block of data processed must take into account the output of
previous blocks of the same data stream. In general, the high level of hashing
algorithms has this form:
1. Initialize digest variables
2. Process next block of data stream (fixed size, usually 512 bits)
3. Apply the hashing function on this block (which modifies the digest
variables)
4. If there are more blocks to process from the same stream, go to step 2
5. Output digest variables (fixed size)
It is easy to understand that we cannot parallelize this algorithm by processing
different blocks because each new block must use the modified variables of the
previous block.
So how can we take advantage of the parallel nature of GPUs in order to
compute digests of data faster? There are 2 main approaches. The first one is to give
at each GPU thread a different block of the same data stream in parallel and keep a
digest for each of these blocks for later reference. As mentioned before, however,
this would need a lot of extra disk space to store all computed digests. The second
one is to use many independent data streams and let GPU process one block at a
time from each data stream in parallel. At the next step, another block from each
data stream is processed. In this way, all blocks that depend on each other will be
processed sequentially but in the same time we can take advantage of GPU
parallelism. Of course, this approach requires a large number of different data
streams that can be processed in parallel and, in fact, this number must be much
larger than the maximum number of concurrent threads within a GPU device to
help hide memory latency.
23
4.2 GPU advantages and disadvantages
For the purposes of this project, the MD5 and SHA1 algorithms [7][8] were chosen
for testing on GPU. The structure of these algorithms is similar to the one described
in the encryption section. MD5 and SHA1 do not contain branches, and are based on
arithmetic and bitwise operations such as XOR, AND, OR, NOT, left bit rotation,
right shifting and addition modulo . Another very important advantage of
hashing algorithms is that the output of a large data sequence is very small (128 to
512 bits depending on the algorithm). This minimizes the time needed to transfer
the results from the GPU device back to the host. In fact, the MD5 algorithm
produces a 128-bit digest and the SHA1 algorithm a 160-bit digest. We already know
that data transfer to and from the GPU device can be a bottleneck but in the case of
hashing we do not have to worry a lot about moving data back to the host because
the output of each block has a small fixed size.
The biggest disadvantage of hashing algorithms is their sequential nature that
does not allow us to operate on different blocks of the same data stream in parallel.
We can, however, operate in parallel on different data streams. Additionally, another
disadvantage is the large number of blocks that we need to transfer on the GPU.
Although we do not need to transfer back a lot of information, the amount of data
transferred to the GPU can still be a bottleneck.
4.3 Relevant work
A lot of background work of GPU hashing in industry was focused on cracking
digests. At this moment, many programs available on the Internet exist that are able
to use the available GPU devices of a system in order to crack MD5 and SHA1
password digests. A digest cracker tries to find a data sequence that can result in a
given digest when processed through a specific hashing function. The way to do this
is to calculate the digest of many relatively small data sequences until the result
matches the given digest. This is the approach that we discussed in the previous
section which processes many different data streams in parallel. The most well-
24
known program available is the Lightning Hash Cracker by Elcomsoft reaching a
brute-force peak performance of 608 million passwords per second on a GeForce
9800 GX2 (2 x 128 stream processors) [19].
In academic literature, there are a limited number of published papers for MD5
or SHA1 hashing on the GPU. Most academic works so far for algorithms such as
MD5 and SHA1 followed a FPGA-based approach. In [20], there is a detailed
implementation of the MD5 algorithm on GPU which computes MD5 digests of
small blocks of data of the same size in parallel. Again, the main bottleneck of the
implementation appears to be the small bandwidth of PCI express compared to the
computation power of the GPU device. Each thread is assigned to a 512-bit space of
shared memory that uses in order to store each processed chunk of data for further
processing. The main limitation of this approach is that due to the limited shared
memory (16KB) the implementation can be tested for thread workgroups with less
than 256 work items. A bigger number of work items would require a bigger shared
memory size. The results of this work show a peak performance of 1400Mbps for a
large input size using a NVIDIA GeForce 9800 GTX+ (128 stream processors). Other
implementations [18] use the constant memory which can be fast because of the
constant memory cache that is located on-chip. In this paper the SHA-1 algorithm is
implemented on GPU and achieves a rate of 2,5 GB/s in a NVIDIA GeForce 9800
GTX+ (128 stream processors).
4.4 Implementation of MD5 and SHA1 & Results
For the MD5 algorithm, the "RSA Data Security, Inc. MD5 Message Digest
Algorithm" [7] was used as a starting point. Some modifications were needed in the
code in order to compile for execution on the GPU. These modifications included
the removal of not supported code and also a duplication of the “Md5Update”
function so that it can support pointer parameters that refer to different address
spaces (vector variables in registers and GPU global memory). For the SHA1
algorithm, a simple implementation was used that can be found in [26].
For both algorithms a similar approach is used. Data are passed to the GPU’s
global memory in large blocks. Then the hardware scheduler of the GPU creates
25
workgroups according to the given parameters. Each work item of a workgroup can
identify its position in a similar way as in the encryption implementation described
in previous chapter. A large file of 200MB was used to run the tests and to simulate a
multiple data streams parallel operation. The modified code was compiled as an
OpenCL kernel. We decided to use registers for the processing of our data. We knew
from the beginning that this will force us to use small block sizes but the parallel
nature of GPU can support this decision. By using registers we are sure that we will
have very small latency when reading our data. Each thread reads its assigned block
in little pieces that process sequentially. The size we chose for these pieces was 16
bytes. The reason for this is that by using this size we can use the built-in vector
type of OpenCL “char16” and we can achieve aligned access to global memory. The
same vector type was used when storing the calculated digest back to the global
memory. The digest of MD5 is exactly 16 bytes (128 bits). The digest of SHA1 is 20
bytes (160 bits)
Again for the transfer of data between the host device and GPU global memory,
pinned memory was used just like in the encryption implementation. Pinned
memory can provide higher bandwidth.
For the testing procedure, we used the same graphic cards and CPU as in the
encryption (GeForce 9400m, GeForce 9600m GT, Intel Core 2 Duo at 2.26 GHz). For
specifications please refer to table 3.4.2. In the next figures we present the resulting
times of the MD5 and SHA1 hashing of a 200 MB file. Please note that in both cases,
GPU and CPU implementation, each block of data was treated as a separate data
stream in order to simulate an environment with multiple independent data
streams. The times that appear include tests that used different block sizes. The
block size refers to the amount of data given to each work item for the calculation of
an independent MD5 hash. The execution times were acquired using the average of
10 executions.
26
Figure 4.4.1 - Execution times of MD5 for all devices using different block sizes
In figure 4.4.1, the results of MD5 algorithm are presented. We can see that for
small block sizes, the single-threaded CPU implementation appears to be faster than
the 9400m GPU. As the block size grows we can see that the 9400m GPU takes a
significant lead in front of the single threaded CPU implementation. After the
4KBytes block size, there is not significant increase for the 9400m GPU. The CPU
execution times appear to be almost irrelevant of the block size. Both the single
threaded and the OpenCL CPU implementations are not affected a lot by the block
size. The 9600m GT performance is almost 2 times better than this of 9400m. The big
difference in execution times between 9400m and 9600m GT comes from their
difference in the number of stream processors (16 vs 32) and in their clock frequency.
Both GPU implementations are faster than the single threaded on CPU. The
multithreaded CPU implementation seems to be the fastest but by using a more
powerful GPU with more stream processors we can get a speedup.
As a result of figure 4.4.1, we can say that an optimal block size for each work
item in the MD5 GPU implementation is between 1024 and 4096 bytes. Very small
block sized are not good for GPU implementations. This is due to the fact that more
and more work items require transactions with the global memory in order to read
data. In this case, hiding latency is not very efficient because of the small number of
computations that each work item has to do when compared to the amount of data
that are read and written back to the global memory. For example, a block size of
0
1000
2000
3000
4000
5000
6000
7000
64 512 4096
Tim
e (
ms)
Block Size (bytes)
GF 9600m GT
GF 9400m
CPU OpenCL
CPU 1-thread
27
8192 bytes computes and requires a write transaction of 128 bits every 4096 bytes,
while a block size of 64 bytes requires the same transaction executed 128 times more.
The difference between the hashing algorithm and the encryption algorithm
that we discussed in the previous chapter is that in this implementation each work
item also needs to read data from the global memory and this appears to be the
bottleneck here.
In table 4.4.2, the throughput achieved appears measured in Mbytes/s. The
maximum GPU throughput of 107,5 MBytes/s was observed with the GeForce
9600m GT.
Device Throughput (Mbytes/s)
GeForce 9400m 57,2
GeForce 9600m GT 107,5
CPU single thread 48,8
CPU OpenCL 190,5
Table 4.4.2 - Throughput measurements of execution of MD5 on different devices
To conclude the MD5 section, we can say with certainty that GPU devices with
a small number of stream processors, available in most desktop and laptops, can be
used for MD5 computations efficiently and also can be used in co-operation with
CPUs for maximum results. A number of at least 32 stream processors or more is
desired in order to achieve a good performance.
In figure 4.4.3 and table 4.4.4 that can be found below we present the results of
SHA1 implementation. We can see that the results are pretty similar to those of
MD5. This is natural since SHA1 is based on the principles of MD5. The analysis of
the results is also similar to MD5. The general trend is that as the block size grows,
the execution times are improved but after the block size of 512 bytes there is not a
significant improvement. Again the multithreaded CPU implementation seems to be
the fastest but execution times on GPU devices are improved as the number of
multiprocessors grows (16 vs 32 stream processors of 9400m and 9600m GT
respectively). So a GPU device with 32 or more stream processors can really assist or
28
replace the CPU in SHA1 hashing computations. The 32 stream processors of 9600m
GT seem to be enough to replace the CPU in the calculation of SHA1 digests.
Figure 4.4.3 – Execution times of SHA1 for all devices using different block sizes
Device Throughput (Mbytes/s)
GeForce 9400m 51,9
GeForce 9600m GT 123,5
CPU single thread 30,4
CPU OpenCL 155
Table 4.4.4 - Throughput measurements of execution of SHA1 on different devices
0
1000
2000
3000
4000
5000
6000
7000
8000
64 512 4096
Tim
e (
ms)
Block Size (bytes)
GF 9600m GT
GF 9400m
CPU OpenCL
CPU 1-thread
29
Chapter 5 Compression on GPU
5.1 Background
Compression is an essential operation. A lot of data are compressed every day in
order to reduce their size and make them more suitable for transfer over the
Internet. There are two different types of compression: Lossy and lossless
compression. Lossy compression refers to compression algorithms that try to reduce
the size of a file with a cost on its quality. Lossy compression is used on photos,
sounds, videos and, more generally, on files where the main characteristics are still
recognizable when their quality is not so good.
On the other hand, lossless compression refers to compression algorithms that
can compress a file by reducing its size, but after decompression we are able to get
back the file that was originally compressed. This kind of compression is mostly
used on files such as text files, executable files etc. In this section, we are going to
research a little further the prospects of lossless data compression on GPU. There are
many different compression algorithms that take advantage of the fact that data
sequences contain large identical sub-sequences that we can encode with smaller
representations. We are going to implement the dictionary-based Lempel Ziv 78
(LZ78) algorithm [13] for execution on GPU so this is a good place for a brief
description of this algorithm. Directory-based algorithms are often used because of
their simplicity and simple algorithms operate better on the GPU.
The LZ78 algorithm uses a dictionary that updates while traversing the
available data and it also keeps a copy of the largest sequence found in the
dictionary (called prefix). Input is processed byte by byte. Each time a new character
30
is read, a search is taking place to find out if the sequence {prefix + new character} is
present in the dictionary. If it is present, we update the prefix with the new character
and we keep reading characters following the same procedure until a match in the
dictionary cannot be found. At that point, we update the dictionary with a new
entry that contains the sequence {prefix + new character}, we reset the prefix and we
output the sequence {position of the prefix in the dictionary + new character}. This is
a compressed sequence. This procedure continues by constantly updating the
dictionary with new sequences and by outputting references to it until there is no
more input. The opposite operation, decompression, follows the same technique by
constructing a similar dictionary and by following the references.
5.2 GPU advantages and disadvantages
After getting a more clear understanding on how lossless compression algorithms
work, we will present which of their characteristics prevent the full exploitation of
the GPU’s computation power and how we can deal with these problems. Here we
should note that the problems of moving data to and from the GPU discussed in the
section about encryption and hashing, also apply here.
Synchronization. The main idea behind compression algorithms is to find
repeated sequences of characters in a file and replace them with a shorter
representation depending on their frequency in the file. This operation is optimized
when we can have a central dictionary structure that controls the execution of the
algorithm and optimizes the compression ratio by keeping as much information as
possible. As mentioned in previous chapters, GPU likes to execute a lot of threads in
parallel meaning that these threads must operate on independent data. This also
means that each thread cannot make use of information gathered by other threads
unless there is some kind of synchronization between them which would slow
down the whole procedure. Then those data should be moved back and forth
between the GPU and the host device in order to feed next blocks. Those restrictions
would make the algorithm even more complex. The only efficient way to implement
a lossless compression algorithm on GPU is to sacrifice compression ratio in order to
get the wanted parallelization. This can be done by compressing different blocks of
31
data independently treating them as different streams of data. This will reduce our
compression ratio a little but will speed up the whole procedure.
Complex and branched algorithms. Compression algorithms contain a lot of
branches in their code, a lot of “if” and “while” statements that force different
threads to follow different paths of execution some times. As a result different
threads execute different instructions which results in a sequential execution at
some parts of the code between threads. There is not much that we can do to avoid
this in a GPU implementation so this is an important disadvantage. Also
compression algorithms do not contain arithmetic operations and is all about
searching for patterns. So we cannot take advantage of the computation power of
GPUs.
Another important issue, when dealing with GPUs, is the limited memory
supplied and the restrictions for memory allocation of current parallel programming
frameworks for GPUs like OpenCL and CUDA. Dynamic memory allocation is not
supported in running kernels so we need to know in advance information about the
size of the current block. When dealing with compression and decompression, the
amount of memory needed for the compressed/decompressed data is not always
known in advance. A way to overtake this problem is to make some conventions
that will help us deal with it. For example, the compressed size of a block of data
can have a maximum size equal to the original size plus some header information
about the compressed block. To decompress a block, we will need to know in
advance the size of the original block by reading the appropriate header information
so that we can easily allocate the memory required for decompression. Apart from
this, compression algorithms need to allocate memory for a number of sub-
operations. This requires a re-implementation of the compression algorithm in order
to follow the GPU framework standards. A successful GPU implementation must
supply enough pre-allocated memory to the (de)compression kernel in order to
successfully (de)compress all blocks without running out of memory resources. The
limited GPU global memory and the large number of concurrent threads that deal
with different blocks is an important problem that needs to be solved.
All problems discussed above plus the complex nature of compression
algorithms must be taken into account. The main structure of the algorithm needs to
be optimized and modified in order to satisfy all GPU restrictions and to take
32
advantage of all GPU benefits.
5.3 Relevant Work
There are no relevant academic papers on lossless data compression on GPU. In
contrary there are a lot of research papers on lossy compression and especially on
lossy image compression algorithms on GPU because of the fact the GPUs are
optimized for handling image files. The fact that there are no relevant academic
papers can be explained because of the nature of lossless compression algorithms.
As described in the previous section, these algorithms do not fit well on the GPU
architecture.
Nevertheless, there is relevant work on algorithms for parallel block
compression in general, which is the method that we will use in the implementation
part. In [22] a parallel block compression approach is used in order to achieve
speedup to dictionary based compression algorithms. Because the parallel
processing of blocks may result in a reduced compression ratio with independent
dictionaries, a joint dictionary construction is proposed where different compression
processes reference a shared dictionary.
A very famous block compression program is bzip2 [23] which uses a
combination of some famous compressions algorithms including the Burrows–
Wheeler transform [24] and the Huffman coding algorithm [25]. This algorithm
works on blocks and compresses each block independently. The problem is that it
operates on large blocks, usually between 100 and 900 Kbytes, which makes it a bad
candidate for GPUs due to limited memory.
5.4 Implementation of LZ78 & Results
For the compression algorithm, the LZ78 algorithm was chosen. Before this choice,
many other zip libraries were examined such as bzip, gzip and others but these
libraries were too complicated for the GPU architecture: too large code with a lot of
branches and heavy memory operations. This is the reason why we decided to
33
implement a LZ78 version that can fit well on the GPU and then test it in practice.
Dictionary-based compression algorithms are often used because of their simplicity.
We must note that this implementation was created for the GPU architecture; other
CPU implementations can be a lot faster than this because of their large memory
and freedom of memory allocations. For the purposes of this dissertation, we decide
to create an implementation that can fit to the GPU architecture and test it in several
devices.
Our main concern was to find ways to speed up the compression process as
much as possible. From the beginning, it was clear that our bottleneck would be the
transferring process to and from the GPU. For this reason, we have to choose a
relatively large global size of data to be compressed each time with respect to the
total available memory of the GPU device. Of course, we must keep in mind that
these parameters depend on our hardware and the PCI express bandwidth. On
different systems, we need to make sure that the full bandwidth is used.
The main idea for the compression on GPU is to split the data and process
blocks in parallel that will be compressed independently. We can follow 2
approaches here: either give a block of data to a workgroup or give a block of data
to each work item. The first approach can lead to better compression ratio but it
needs some kind of synchronization between work threads. The idea is to create a
shared dictionary for each workgroup that all work items within it will be able to
update and to reference it. The problem with this approach is that synchronization
will lead to delays and will reduce efficient use of parallelism. This approach will
not be implemented but can be considered as a potential future work of this
dissertation.
The second approach, assigning an independent small block to each work item,
seems faster but will result in a reduced compression ratio. For the implementation
part, we will use this approach.
Another issue is the dictionary size of each work item. LZ78 uses a dynamic
dictionary that is created during the compression process but because of memory
issues on GPU we need to set a limit to its size. The bigger the size, the better
compression ratio we will achieve. Due to the large number of threads that the GPU
platform needs, the dictionary size has to be small. When the dictionary is full and
we want to add a new entry, we do so by replacing the oldest entry of the dictionary
34
with the new sequence. Instead of using registers to store the dictionary, we can
also use the shared workgroup memory that has a bigger capacity, usually 16Kb,
and can be as fast as accessing registers when there are no memory bank conflicts
between threads asking for a transaction. Shared memory, unlike global memory,
can serve multiple transactions, up to 16, by different work items in parallel. For our
implementation, we chose to bypass the shared memory and copy small chunks of
data each time into registers for faster execution.
For the current implementation, we chose to use a small dictionary size of 256
entries for a number of reasons.
1. The first and most important reason is because of the limited GPU
memory. Each work item must have a small dictionary if we want to
guarantee that we are not going to have memory problems.
2. The second reason is because we need a small number of bits to represent
a reference in the dictionary. So a 256-entry of the dictionary can be
referenced with 8 bits.
3. Another reason is that our implementation uses a sequential search to
find a match in the dictionary so a large dictionary size would result in
more search time.
As we said before, the OpenCL framework does not support dynamic memory
allocation and this may be a problem in compression/decompression functions
because we cannot be sure of the compressed and decompressed size. To bypass
memory allocation issues we will make some conventions:
When each work item completes the compression of a block of data, it also
needs to save the compressed data size. In this way, at the time of decompression
the decryption function will know that the next compressed data size read will
result in an unzipped sequence of a fixed block size. So we can pre-allocate the
memory needed.
For the encryption part, buffers of the same size as the input data size were pre-
allocated to store the encrypted data. We will make a convention that if the
encrypted data results in a bigger size than the input, then the input will be stored
unchanged.
35
For the testing procedure, we used the same graphic cards and CPU as in the
encryption (GeForce 9400m, GeForce 9600m GT, Intel Core 2 Duo at 2.26 GHz). For
specifications please refer to table 3.4.2.
In this section we present the resulting times of compressing a 9,3MB file by
using our LZ78 implementation. The times that appear include tests that used
different block sizes. The block size refers to the amount of data given to each work
item for the compression of a different block of data. For all tests, a dictionary size of
256 entries was used.
Figure 5.4.1 - Execution times of LZ78 for all devices using different block sizes
In figure 5.4.1, we can see the results of the implemented LZ78 algorithm. The
execution time is reduced in GPUs when using a relatively small block size between
128 and 1024 bytes. This is because small block sizes don’t contain a lot of
information in order to take full advantage of the dictionary and as a result there are
fewer replacements and this cause fewer threads to follow different paths. As the
block size grows more and more threads follow different paths. The 9400m
performance is always slower than the sequential CPU implementation. The 9600m
GT execution seems improved: execution times are reduced by 50% when compared
to those of 9400m. Again, this can be explained from their difference in stream
processors (16 vs 32). The performance of 9600m GT is always better than the
2000
7000
12000
17000
22000
64 256 1024 4096
Tim
e (
ms)
Block Size (bytes)
CPU OpenCL
GF 9400m
GF 9600m GT
CPU 1-thread
36
sequential CPU implementation but for large block sizes the performance drops. In
general, the LZ78 algorithm performs better when running as a multithreaded CPU
program (OpenCL CPU).
Before making any assumptions, we have to look how these block sizes behave
when it comes to the compression ratio achieved. The next figure presents the
compressed size achieved for the 9,3 MB file used after parallel block compression
with different block sizes.
Figure 5.4.2 - Compressed size achieved with different block sizes by using our
specific LZ78 implementation with a small fixed sized dictionary
We can see that for very small block sizes, the compressed size is too large,
nearly unaffected. This is because small block sizes don’t give the capability to the
algorithm to fill all available positions of the dictionary. The chosen dictionary size
was 256 entries so data of 64, 128, 256, 512 bytes cannot take full advantage of it
because each entry can contain several bytes. Fewer dictionary entries mean fewer
possible compressed sequences. That is why we see an improvement after a block
6
6.5
7
7.5
8
8.5
9
64 512 4096
Co
mp
ress
ed
Siz
e (
MB
)
Block Size (bytes)
Compressed size
37
size 512 bytes. A CPU implementation with an infinite (or very large) dictionary size
would give much improved compressed sizes.
As a result from figures 5.4.1 and 5.4.2 we can state that for the specific
parameters we selected, an optimal block size for each work item would be between
512 and 1024 bytes because these sizes give good execution times and a relatively
good compression ratio.
To conclude, the results show that GPU memory limitations can be very
harmful for the resulting compressed size. Also, the nature of compression
algorithms doesn’t allow the exploitation of GPU computation power. GPUs are not
yet ready for this task.
38
Chapter 6 Putting it all together
In this chapter, we will examine how we can combine some algorithms discussed
earlier for execution on the GPU in order to process a single stream of data more
efficiently. We already know that a stream of data can be divided in little blocks for
parallel encryption and compression. Hashing algorithms, on the other hand, are
strictly sequential and have to operate on each block in order. Combining a
sequential algorithm with parallel algorithms is not optimal on GPU.
So in this section we will discuss how compression and encryption can be
combined on the GPU in order to get the maximum performance. The idea is to
move blocks on the GPU, compress them, then encrypt them and finally transfer
them back to the host. By combining these two operations on the GPU, we can
reduce the time required to transfer data from the host device to the GPU and back
when compared to executing encryption and compression independently (figure
6.1). We could also say that a compressed stream of data results in a reduced amount
of data for encryption. Unfortunately the exact size of compressed data cannot be
known in advance, so buffers must be allocated and data need to be transferred
back for the worst possible scenario. In figure 6.1, the red arrows represent
operations that need recurrent transfer of data and make heavy usage of the PCI
express bandwidth. On the other hand, green arrows indicate operations that
happen immediately. So from figures 6.1a and 6.1b it is clear that by combining
encryption and compression we can reduce the total time needed to move data
between the two devices. Heavy transfers through the PCI express are reduced from
3 to 2.
39
Figure 6.1 - (a) Encryption and compression executed separately,
(b) Combined execution
Our goal at this moment is to decide on an efficient block size that will fit both
to encryption and compression. According to the results that we presented in the
encryption chapter, small block sizes up to 2048 bytes have the best performance.
On the other hand, the results from the compression indicate that block sizes smaller
than 1024 bytes suffer from reduced compression ratio. In general, larger block sizes
result in an improved compression ratio but if we get into account the incapability
of GPUs to supply enough memory we soon realize that we cannot use very large
block sizes. We need to have a very large number of threads on the fly, and each one
will be assigned to a block of data. So the limited GPU memory available prevents
us from satisfying both conditions. An efficient block size for both encryption and
compression seems to be between 1024 and 2048 bytes. The procedure of this
combined operation appears in figure 6.2. Each work item is responsible for a block
of data equal to the chosen block size. It compresses the block and then it encrypts
the output of compression. After this, it stores the final output size of the
compressed block and the compressed/encrypted block (C/E) in the appropriate
place in the global memory. The size information is needed because the host needs
to know how many bytes the output size of each block was in order to recover it.
This information is also needed for the decryption/decompression operation.
40
Figure 6.2 - Each work item (Wn) compresses a block and then encrypts the
compressed output
41
Chapter 7 Discussion
In previous sections we examined algorithms and developed some GPU
implementations. In this section we will discuss in detail the results that we derived
by making a critical evaluation. The algorithms used were of different nature and
for some of them (i.e. the compression part) had to be re-implemented from scratch
in order to fit on the GPU architecture.
We did a research on three different categories of algorithms: hashing,
encryption and compression. We also examined how encryption and compression
can be executed on the GPU with just one call by determining an optimal block size
for both of them. For the hashing and encryption part, all available algorithms are
very similar and results can somehow be more general and not specific for Salsa20
or MD5 and SHA1 algorithm. On the other hand, the compression implementation
was the trickiest. The reason for this is that there exist many compression algorithms
and each one is based on a different approach. This fact results in very different
algorithms which may or may not fit on the GPU. We tried to choose an algorithm
that was relatively simple and could be parallelized easily but by sacrificing speed
and compression ratio.
The results of hashing and encryption are very straightforward: GPU
implementations are much more effective than this of a single threaded CPU
version. The results also show that more powerful GPUs can easily overcome a
multithreaded CPU implementation. Mid range GPUs can also be very efficient in
these tasks and assist or replace CPUs. For Salsa20, we achieved acceptable results
for small block sizes between 64 and 2048 bytes. In fact block sizes of 64 and 128
bytes seem to be optimal for our implementation. The results of MD5 and SHA1
gave us a peak performance for block sizes of 1024 or 2048 bytes, but acceptable
performance was in the range of 512 to 4096 bytes.
42
The results for the compression part are not very encouraging. Some
characteristics of compression algorithms are reduced as we explained in the
relevant section such as compression ratio. The block sizes that resulted in an
efficient performance, including speed and compression ratio, were 1024 and 2048
bytes. Bigger block sizes resulted in an improved compression ratio but had an
impact on speed.
Finally, the combined execution of encryption and compression operations can
have an improvement of performance. This is a natural result because every block of
data is staying for a longer time on the GPU and is used for more computations. So
the ratio of computations over amount of data is increased and this is the whole
point of parallelism on the GPU: use less data for more computations in parallel. It
would be good if the hashing part could be combined for execution on GPU with
the other two operations but as mentioned before its sequential nature prevents this.
A large block of data can be divided into multiple sub-blocks which can then be
used for independent encryption and compression. However, this cannot happen
for the calculation of a digest using a hashing algorithm.
At this point, we would like also to discuss the results based on our primary
motivation which was the use of GPU computation power for assisting the CPU in
operations required for a backup system (hashing, encryption, compression).
Hashing and encryption results were very promising on the GPU, but compression
had many problems. So an efficient backup system could use the CPU to compress
files and then send them to the GPU for the encryption and hashing part in a
pipelined way. According to the results, efficient systems need to use GPU devices
with 32 or more stream processors. In general, by taking into account all the results
of this dissertation, we can state that the performance of each algorithm is improved
nearly by 50% when the number of available stream processors is doubled from 16
to 32.
7.1 Project difficulties
During the implementation phase of this project we run into a number of
difficulties. In this section we are going to discuss some of these difficulties that
appeared to be the most important.
43
For the purposes of this project we had to implement a number of algorithms of
different kind. We found some implementations of these algorithms that we tried to
modify in order to fit in the GPU architecture. The problem was that they were
designed for optimal execution on CPU and GPU compiler doesn’t support the
entire set of the C language. For example, functions related to memory operations
such as memcpy etc are not supported by GPU kernels. So when there was a need
for copying memory we had to do it manually by replacing these functions.
Another difficulty was that GPU has many different address spaces (described
in previous chapters) so for optimal execution we had to transfer data between these
address spaces.
Also the debugging process appeared to be much more difficult than we
thought at the beginning. GPU devices do not support at the moment output
functions such as printf. Checking the content of some variables during runtime
wasn’t an easy task. We had to create an extra buffer space on the GPU global
memory where we stored any information that we needed to know for the
debugging process and then checked the content of those variables by transferring
them and outputting them on the host device. The problem with this approach is
that when there was a bug in the code that was forcing the kernel to crash then we
couldn’t reach the part where we could send the data back to the host for
examination. In this case, we had to execute small parts of the kernel until we reach
the point of the problem.
As in most parallel and distributed systems, the debugging of many different
instances that are executed in parallel was difficult. We had to coordinate the
execution of hundreds of threads which was difficult at the beginning but only until
we had our first algorithm running. The same method of coordination and
debugging was used for all algorithms. Compression algorithms were the most
difficult to be modified in order to fit to the GPU because of their complex memory
operations and of their big size. For this reason, a simple implementation of LZ78
compression algorithm was created.
44
7.2 Future Work
The subject of this dissertation included many different areas of study such as
hashing, encryption and compression. We did our best to create algorithms that can
execute efficiently on the GPU but there is always room for improvements.
During our study for the behavior of such algorithms on GPU, we realized that
there was very limited information and academic reference to the data compression
part on the GPU. Because of the limited time given for this project, we didn’t have
enough time to go very deep in this section but we think that this research can be
used as a starting point for future implementations. The proposed approach for the
LZ78 algorithm for the shared/synchronized dictionary between work items of the
same workgroup can be examined as a future work of this dissertation. Also other,
more efficient, search techniques of the dictionary instead of a sequential search
could be tried such as hash tables. The problem of limited GPU memory and of the
incapability for dynamic memory allocation prevented us from following this
approach. A research of how we can create an efficient hash table with a fixed size
for the GPU platform would be very helpful for the LZ78 algorithm and would
speed up the process by a large factor.
Also as future work, we could test these algorithms on more powerful, high end
GPUs. The GPUs that we used for our testing (NVIDIA GeForce 9400m, NVIDIA
GeForce 9600m GT) were entry-level and mid-range GPUs but served well the
purpose of this dissertation that was to examine whether laptop and desktop GPUs
could be used to speed up these operations.
Another possible extension of this dissertation can research in detail different
ways in which the CPU and GPU device can cooperate in order to achieve
maximum performance for hashing, encryption and compression in a pipelined
fashion: how can these operations be synchronized and what speedups can be
achieved over a pure CPU implementation.
45
Chapter 8 Conclusion
The computation power of GPU devices grows year by year. As this power grows,
more and more computationally intensive fields start to use it in order to achieve
greater speedups. Encryption and hashing algorithms have been already tried on
the GPU architecture and showed great speedups. Most of these speedups were
achieved by using expensive high end GPUs with a very large number of stream
processors and high clock frequencies. In this dissertation, we proved that even
entry level and mid range GPUs can be used for encryption and hashing effectively.
The results that we got from the Salsa20 and the MD5 algorithm are very
encouraging. Unfortunately, there are fields such as compression that are not yet
ready to take full advantage of GPU devices. Compression algorithms need to be
implemented with many restrictions in mind in order to run on GPU devices and
these restrictions have a cost on speed and compression ratio. In general, we can say
that GPUs with 32 or more stream processors can be used as a powerful
computation device in any algorithm that involves intensive computations.
There is too much unexploited computation power at this moment in most
users’ desktop and laptop GPUs. As our results show, this power can be used to
maximize the performance of many algorithms. In previous chapters, we referred
many times to the limited GPU memory. We believe that, in a few years, this will not
be a problem anymore and GPUs will have bigger and faster memories. As a result
of this, we believe that, in the near future, GPUs will be an essential computation
device in every user’s computer, either assisting CPUs in computation intensive
problems or even replacing them.
46
Bibliography
[1] P. Anderson and L. Zhang, “Fast and Secure Laptop Backups with Encrypted
De-duplication”, under publication in 24th Large Installation System
Administration Conference (LISA 2010), San Jose, CA, November 7–12 2010.
[2] Intel, “Intel microprocessor export compliance metrics”,
http://www.intel.com/support/processors/sb/cs-023143.htm
[3] GPU Gems 2, “Chapter 32. Taking the Plunge into GPU Computing”,
NVIDIA Corporation, 2009,
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter32.html
[4] “OpenCL Programming Guide for the CUDA Architecture, Version 3.1”,
NVIDIA Corporation, 2009.
[5] D.J. Bernstein, “The Salsa20 Family of Stream Ciphers”, New Stream Cipher
Designs: The eSTREAM Finalists, Springer-Verlag, 2008, pp. 84-97.
[6] T. Xie and D. Feng, “How to Find Weak Input Differences for MD5 Collision
Attacks”, Cryptology ePrint Archive, Report 2009/223, 2009.
[7] R. Rivest, “The MD5 Message-Digest Algorithm”, RFC 1321, MIT and RSA
Data Security, Inc., 1992.
[8] D. Eastlake and P. Jones, “US Secure Hash Algorithm 1 (SHA1)”, RFC 3174,
Motorola and Cisco systems, 2001.
[9] Wikipedia, “Block cipher modes of operation”,
http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation
[10] N. Pilkington and B. Irwin “A Canonical Implementation Of The Advanced
Encryption Standard On The Graphics Processing Unit”, In the Innovative
Minds Conference, Johannesburg, South Africa, 7 - 9 July 2008.
47
[11] S. Manavski, “CUDA Compatible GPU as an Efficient Hardware Accelerator
for AES Cryptography”, Signal Processing and Communications, 2007.
ICSPC 2007. IEEE International Conference on, 2007, pp. 65-68.
[12] “NVIDIA OpenCL Best Practices Guide, Version 1.0”, NVIDIA Corporation,
2009.
[13] J. Ziv and A. Lempel, “Compression of individual sequences via variable-
rate coding”, Information Theory, IEEE Transactions on, vol. 24, 1978, pp.
530-536.
[14] O. Harrison and J. Waldron, “AES Encryption Implementation and Analysis
on Commodity Graphics Processing Units”, Proceedings of the 9th
international workshop on Cryptographic Hardware and Embedded
Systems, Vienna, Austria: Springer-Verlag, 2007, pp. 209-226.
[15] Accelereyes . “GPU Memory Transfer”,
http://wiki.accelereyes.com/wiki/index.php/GPU_Memory_Transfer/
[16] OpenCL - The open standard for parallel programming of heterogeneous
systems, Khronos Group, www.khronos.org/opencl/.
[17] O. Gervasi, D. Russo, and F. Vella, “The AES Implantation Based on OpenCL
for Multi/many Core Architecture”, Computational Science and Its
Applications (ICCSA), 2010 International Conference on, 2010, pp. 129-134.
[18] Lin Zhou and Wenbao Han, “A Brief Implementation Analysis of SHA-1 on
FPGAs, GPUs and Cell Processors”, Engineering Computation, 2009. ICEC
'09. International Conference on, 2009, pp. 101-104.
[19] Lightning Hash Cracker, ElcomSoft Co.Ltd.,
http://www.elcomsoft.com/lhc.html
[20] Guang Hu, Jianhua Ma, and Benxiong Huang, “High Throughput
Implementation of MD5 Algorithm on GPU”, Ubiquitous Information
Technologies & Applications, 2009. ICUT '09. Proceedings of the 4th
International Conference on, 2009, pp. 1-5.
[21] NVIDIA, 2007. Ext gpu shader4 opengl extension,
http://developer.download.nvidia.com/opengl/specs/GL_EXT_gpu_shade
r4.txt.
48
[22] P. Franaszek, J. Robinson, and J. Thomas, “Parallel compression with
cooperative dictionary construction”, Data Compression Conference, 1996.
DCC '96. Proceedings, 1996, pp. 200-209.
[23] Bzip2 compression algorithm, Julian Seward, http://www.bzip.org/
[24] M. Burrows, D.J. Wheeler, M. Burrows, and D.J. Wheeler, “A block-sorting
lossless data compression algorithm”, Technical Report 124, Digital
Equipment Corporation, 1994.
[25] D. Huffman, “A Method for the Construction of Minimum-Redundancy
Codes”, Proceedings of the IRE, vol. 40, 1952, pp. 1098-1101.
[26] Secure Hashing Algorithm (SHA-1) C implementation, Packetizer Inc.,
http://www.packetizer.com/security/sha1/
[27] D. J. Bernstein, “Why switch from AES to a new stream cipher?”,
http://cr.yp.to/streamciphers/why.html