Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | immanuel-sessions |
View: | 216 times |
Download: | 0 times |
1
StoreGPUExploiting Graphics Processing Units to
Accelerate Distributed Storage Systems
NetSysLabThe University of British Columbia
Samer Al-Kiswany
with: Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, Matei Ripeanu
2
Computation Landscape
Recent GPUs dramatically change the computation cost landscape.
Floating-Point Operations per Second for the CPU and GPU. (Source: CUDA 1.1 Guide)
A quiet revolution: Computation: 367 vs. 32 GFLOPS
128 vs. 4 cores
Memory Bandwidth: 86.4 vs. 8.4 GB/s
$220
$290
HPDC ‘08
3
Computation Landscape
Affordable
Widely available in commodity desktop
Include 10s to 100s of cores ( can support 1000s of threads)
General purpose programming friendly
Recent GPUs dramatically change the computation cost landscape.
HPDC ‘08
4
Exploiting GPUs’ Computational Power
Studies exploiting the GPU:• Bioinformatics: [Liu06]• Chemistry: [Vogt08] • Physics: [Anderson08]• And many more : [Owens07]
Report: 4x to 50x speedup
But: Mostly scientific and specialized applications.
HPDC ‘08
5
Motivating Question
System design: balancing act in a multi-dimensional spacee.g., given certain objectives, say job turnaround, minimize total system cost given component prices, I/O bottlenecks, bounds on storage and network traffic, energy consumption, etc.
Q: Does the 10x reduction in computation costs GPUs offer change the way we design/implement (distributed) system middleware?
6
Distributed Systems Computationally Intensive Operations
Hashing Erasure coding Encryption/decryption Compression Membership testing (Bloom-filter)
HPDC ‘08
Computationally intensive Often avoided in existing systems.
Used in: Storage systems Security protocols Data dissemination techniques Virtual machines memory management And many more …
7
Why Start with Hashing?
Popular -- used in many situations: Similarity detection Content addressability Integrity Copyright infringement detection Load balancing
8
File AX
Y
Z
Hashing
ICDCS ‘08
How Hashing is Used in Similarity Detection ?
W
Y
Z
File BHashing
Only the first block is different
9
How to divide the file into blocks
Fixed-size blocks
Content-based block boundaries
ICDCS ‘08
How Hashing is Used in Similarity Detection ?
10
File i
Hashing
B1 B2 B3 B4
ICDCS ‘08
HashValueK = 0 ?HashValueK = 0 ?HashValueK = 0 ?
m bytes
k bits
offset
Detecting Content-based Block Boundaries
11
Hashing Use in Similarity Detection – Two scenarios
I. Computing block hashes : large blocks of data (100s KB to 10s MB).
II. Computing block boundary: Hashing large number of small data blocks (few bytes)
HPDC ‘08
12
StoreGPU
HPDC ‘08
StoreGPU : a library that exploits GPUs to support distributed storage system by offloading the computationally intensive functions.
One performance data point:In similarity detection, StoreGPU achieves 8x speedup and
5x data compression for a checkpointing application.
StoreGPU v1.0 implements hashing functionsused in computing block hashes and blocks boundaries
Implication:GPUs unleash valuable set of optimization techniques into high performance systems design space. - Although GPUs have not been designed with this usage in mind.
13
Outline
GPU architecture GPU programming Typical application flow StoreGPU design Evaluation
HPDC ‘08
14
NVIDIA CUDA GPU Architecture
HPDC ‘08
SIMD Architecture.Four memories.
• Device (a.k.a. global)slow – 400-600
cycles access latency
large – 256MB – 1GB
• Sharedfast – 4 cycles
access latency
small – 16KB
• Texture – read only
• Constant – read only
15
GPU Programming
HPDC ‘08
NVIDIA CUDA programming model: Abstracts the GPU architecture Is an extension to C programming language
• Compiler directives• Provides GPU specific API (device properties,
timing, memory management…etc)
Programming still challenging Parallel programming is challenging
• Extracting parallelism at large scale• Parallel programming (SIMD)
Memory management Synchronization Immature debugging tools
16
Performance Tips
HPDC ‘08
Use 1000s of threads to best use the GPU hardware
Optimize the use the shared memory and the registers
Challenge: limited shared memory and registers
Challenge: small, bank conflicts
17
Shared Memory Complications
HPDC ‘08
Shared memory is organized into 16 -1KB banks. Bank 0
Bank 1
Bank 15
.
.
.
Complication I : Concurrent accesses to the same bank will be serialized (bank conflict) slow down.
Complication II : Banks are interleaved.
Tip : Assign different threads to different banks.
Bank 0
Bank 1
Bank 2
.
.
.
4 bytes
4 bytes
4 bytes
4 bytes
0
48
16
18
Execution Path on GPU – Data Processing Application
HPDC ‘08
TTotal =
1
TPreprocesing
1
2
+ TDataHtoG
2
3
+ TProcessing
3
4
+ TDataGtoH
4
5
+ TPostProc
5
1. Preprocessing
2. Data transfer in
3. GPU Processing
4. Data transfer out
5. Postprocessing
19
Outline
GPU architecture GPU programming Typical application flow StoreGPU design Evaluation
HPDC ‘08
20
StoreGPU Design
I. Computing block hashes : large blocks of data (100s KB to 10s MB).
II. Computing block boundary: Hashing large number of small data blocks (few bytes)
HPDC ‘08
21HPDC ‘08
Input Data
1 2 3 4 5 6 i-1 i. . .
. . .
Output
GPU
Input DataHost Machine
Host Machine
Data transf. to shared mem
Data transfer in
Processing
Result transfer to global
Result transfer out
Preprocessing
Execute the final hash
Computing Block Hash – Module Design
22
Computing Block Hash – Module Design
HPDC ‘08
The design is highly parallel Last step - on the CPU to avoid synchronization The resulting hash is not compatible with standard MD5 and
SHA1 but is equally collision resistant [Damgard89]
23HPDC ‘08
Input Data
. . . .
Output
Input Data
GPU
Host Machine
Host Machine
Data transf. to shared mem
Data transfer in
Processing
Result transfer to global
Result transfer out
Preprocessing
Detecting Block Boundaries – Module Design
24
StoreGPU v1.0 Optimizations
Optimized shared memory usage.
StoreGPU shared memory management mechanism: assigns threads to different banks while providing contiguous space abstraction.
Memory pinning Reduced output size
HPDC ‘08
B1 B2 B3 B4
HashValueK = 0 ?
m bytes
k bits1 2
3
45
Bank 0
Bank 1
Bank 2
.
.
.
4 bytes
4 bytes
4 bytes
4 bytes
0
48
16
25
Outline
GPU architecture GPU programming Typical application flow StoreGPU design StoreGPU v1.0 optimizations Evaluation
HPDC ‘08
26
Evaluation
Testbed: A machine with
CPU: Intel Core2 Duo 6600, 2 GB RAM (priced at : $290)
GPU: GeForce 8600 GTS GPU (32 cores, 256 MB RAM, PCIx 16x) (priced at : $100)
HPDC ‘08
Experiment space: GPU vs. single CPU core.
MD5 and SHA1 implementations Three optimizations Detecting block boundary
configurations (m and offset)
27
Computing Block Hash
HPDC ‘08
Over 4x speedup in computing block hashes
Computing Block Hash – MD5
28
Computing Block Boundary
HPDC ‘08
Over 8x speedup in detecting blocks boundaries
Computing Block Boundary– MD5m = 20 bytes, offset = 4 bytes
1
29HPDC ‘08
Dissecting GPU Execution Time
TTotal =
1
TPreprocesing
1
2
+ TDataHtoG
2
3
+ TProcessing
3
4
+ TDataGtoH
4
5
+ TPostProc
5
1. Preprocessing
2. Data transfer in
3. GPU Processing
4. Data transfer out
5. Postprocessing
30
Dissecting GPU Execution Time
HPDC ‘08
0%
20%
40%
60%
80%
100%
8 32 128
375
1500
6000
2400
0
9600
0
Data Size (KB)
Run
time
Per
cent
age
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
TTotal = TPreprocesing
1+ TDataHtoG
2+ TProcessing
3+ TDataGtoH
4+ TPostProc
5
MD5 computing block hashes module with all optimizations enabled
31
Application Level Performance – Similarity Detection
HPDC ‘08
Online similarity detection throughput and speedup using MD5.
Throughput (MBps) Similarity ratio detectedStoreGPU Standard
Fixed size Compare by Hash
19323%
Content based Compare by Hash
13.580%
Implication: similarity detection can be used even on 10Gbps setups !!
840
Speedup : 4.3x
114
Speedup: 8.4x
Application: similarity detection between checkpoint images.Data: checkpoints from BLAST (bioinformatics)
collected using BLCR, checkpoint interval : 5 minutes
32
Summary
HPDC ‘08
StoreGPU : Offloads the computationally intensive operations from the CPU Achieves considerable speedups
Contributions: Feasibility of using GPUs to support (distributed) middlewares Performance model StoreGPU library
Implication :GPUs unleash valuable set of optimization techniques into high performance systems design space.
33
Other GPU Applications
Current NetSysLab GPU-related projects Exploring GPU to support other middleware primitives: Bloom
filters (BloomGPU) Packet classification Medical imaging compression
Hashing Erasure coding Encryption/decryption Compression Membership testing (Bloom-filter)
HPDC ‘08
35
References
HPDC ‘08
[Damgard89] Damgard, I. A Design Principle for Hash Functions. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science.
[Liu06] Liu, W., et al. Bio-sequence database scanning on a GPU. in Parallel and Distributed Processing Symposium, IPDPS. 2006
[Vogt08] Vogt, L, et al. Accelerating Resolution-of-the-Identity Second-Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. J. Phys. Chem. A, 112 (10), 2049 -2057, 2008.
[Anderson08] Joshua A. Anderson, Chris D. Lorenz and A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational PhysicsVolume 227, Issue 10, 1 May 2008, Pages 5342-5359
[Owens07] Owens, J.D., et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 2007. 26(1): p. 80-113