1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab...

1

StoreGPUExploiting Graphics Processing Units to

Accelerate Distributed Storage Systems

NetSysLabThe University of British Columbia

Samer Al-Kiswany

with: Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, Matei Ripeanu

2

Computation Landscape

Recent GPUs dramatically change the computation cost landscape.

Floating-Point Operations per Second for the CPU and GPU. (Source: CUDA 1.1 Guide)

A quiet revolution: Computation: 367 vs. 32 GFLOPS

128 vs. 4 cores

Memory Bandwidth: 86.4 vs. 8.4 GB/s

$220

$290

HPDC ‘08

3

Computation Landscape

Affordable

Widely available in commodity desktop

Include 10s to 100s of cores ( can support 1000s of threads)

General purpose programming friendly

Recent GPUs dramatically change the computation cost landscape.

HPDC ‘08

4

Exploiting GPUs’ Computational Power

Studies exploiting the GPU:• Bioinformatics: [Liu06]• Chemistry: [Vogt08] • Physics: [Anderson08]• And many more : [Owens07]

Report: 4x to 50x speedup

But: Mostly scientific and specialized applications.

HPDC ‘08

5

Motivating Question

System design: balancing act in a multi-dimensional spacee.g., given certain objectives, say job turnaround, minimize total system cost given component prices, I/O bottlenecks, bounds on storage and network traffic, energy consumption, etc.

Q: Does the 10x reduction in computation costs GPUs offer change the way we design/implement (distributed) system middleware?

6

Distributed Systems Computationally Intensive Operations

Hashing Erasure coding Encryption/decryption Compression Membership testing (Bloom-filter)

HPDC ‘08

Computationally intensive Often avoided in existing systems.

Used in: Storage systems Security protocols Data dissemination techniques Virtual machines memory management And many more …

7

Why Start with Hashing?

Popular -- used in many situations: Similarity detection Content addressability Integrity Copyright infringement detection Load balancing

8

File AX

Y

Z

Hashing

ICDCS ‘08

How Hashing is Used in Similarity Detection ?

W

Y

Z

File BHashing

Only the first block is different

9

How to divide the file into blocks

Fixed-size blocks

Content-based block boundaries

ICDCS ‘08

How Hashing is Used in Similarity Detection ?

10

File i

Hashing

B1 B2 B3 B4

ICDCS ‘08

HashValueK = 0 ?HashValueK = 0 ?HashValueK = 0 ?

m bytes

k bits

offset

Detecting Content-based Block Boundaries

11

Hashing Use in Similarity Detection – Two scenarios

I. Computing block hashes : large blocks of data (100s KB to 10s MB).

II. Computing block boundary: Hashing large number of small data blocks (few bytes)

HPDC ‘08

12

StoreGPU

HPDC ‘08

StoreGPU : a library that exploits GPUs to support distributed storage system by offloading the computationally intensive functions.

One performance data point:In similarity detection, StoreGPU achieves 8x speedup and

5x data compression for a checkpointing application.

StoreGPU v1.0 implements hashing functionsused in computing block hashes and blocks boundaries

Implication:GPUs unleash valuable set of optimization techniques into high performance systems design space. - Although GPUs have not been designed with this usage in mind.

13

Outline

GPU architecture GPU programming Typical application flow StoreGPU design Evaluation

HPDC ‘08

14

NVIDIA CUDA GPU Architecture

HPDC ‘08

SIMD Architecture.Four memories.

• Device (a.k.a. global)slow – 400-600

cycles access latency

large – 256MB – 1GB

• Sharedfast – 4 cycles

access latency

small – 16KB

• Texture – read only

• Constant – read only

15

GPU Programming

HPDC ‘08

NVIDIA CUDA programming model: Abstracts the GPU architecture Is an extension to C programming language

• Compiler directives• Provides GPU specific API (device properties,

timing, memory management…etc)

Programming still challenging Parallel programming is challenging

• Extracting parallelism at large scale• Parallel programming (SIMD)

Memory management Synchronization Immature debugging tools

16

Performance Tips

HPDC ‘08

Use 1000s of threads to best use the GPU hardware

Optimize the use the shared memory and the registers

Challenge: limited shared memory and registers

Challenge: small, bank conflicts

17

Shared Memory Complications

HPDC ‘08

Shared memory is organized into 16 -1KB banks. Bank 0

Bank 1

Bank 15

.

.

.

Complication I : Concurrent accesses to the same bank will be serialized (bank conflict) slow down.

Complication II : Banks are interleaved.

Tip : Assign different threads to different banks.

Bank 0

Bank 1

Bank 2

.

.

.

4 bytes

4 bytes

4 bytes

4 bytes

0

48

16

18

Execution Path on GPU – Data Processing Application

HPDC ‘08

TTotal =

1

TPreprocesing

1

2

+ TDataHtoG

2

3

+ TProcessing

3

4

+ TDataGtoH

4

5

+ TPostProc

5

1. Preprocessing

2. Data transfer in

3. GPU Processing

4. Data transfer out

5. Postprocessing

19

Outline

GPU architecture GPU programming Typical application flow StoreGPU design Evaluation

HPDC ‘08

20

StoreGPU Design

I. Computing block hashes : large blocks of data (100s KB to 10s MB).

II. Computing block boundary: Hashing large number of small data blocks (few bytes)

HPDC ‘08

21HPDC ‘08

Input Data

1 2 3 4 5 6 i-1 i. . .

. . .

Output

GPU

Input DataHost Machine

Host Machine

Data transf. to shared mem

Data transfer in

Processing

Result transfer to global

Result transfer out

Preprocessing

Execute the final hash

Computing Block Hash – Module Design

22

Computing Block Hash – Module Design

HPDC ‘08

The design is highly parallel Last step - on the CPU to avoid synchronization The resulting hash is not compatible with standard MD5 and

SHA1 but is equally collision resistant [Damgard89]

23HPDC ‘08

Input Data

. . . .

Output

Input Data

GPU

Host Machine

Host Machine

Data transf. to shared mem

Data transfer in

Processing

Result transfer to global

Result transfer out

Preprocessing

Detecting Block Boundaries – Module Design

24

StoreGPU v1.0 Optimizations

Optimized shared memory usage.

StoreGPU shared memory management mechanism: assigns threads to different banks while providing contiguous space abstraction.

Memory pinning Reduced output size

HPDC ‘08

B1 B2 B3 B4

HashValueK = 0 ?

m bytes

k bits1 2

3

45

Bank 0

Bank 1

Bank 2

.

.

.

4 bytes

4 bytes

4 bytes

4 bytes

0

48

16

25

Outline

GPU architecture GPU programming Typical application flow StoreGPU design StoreGPU v1.0 optimizations Evaluation

HPDC ‘08

26

Evaluation

Testbed: A machine with

CPU: Intel Core2 Duo 6600, 2 GB RAM (priced at : $290)

GPU: GeForce 8600 GTS GPU (32 cores, 256 MB RAM, PCIx 16x) (priced at : $100)

HPDC ‘08

Experiment space: GPU vs. single CPU core.

MD5 and SHA1 implementations Three optimizations Detecting block boundary

configurations (m and offset)

27

Computing Block Hash

HPDC ‘08

Over 4x speedup in computing block hashes

Computing Block Hash – MD5

28

Computing Block Boundary

HPDC ‘08

Over 8x speedup in detecting blocks boundaries

Computing Block Boundary– MD5m = 20 bytes, offset = 4 bytes

1

29HPDC ‘08

Dissecting GPU Execution Time

TTotal =

1

TPreprocesing

1

2

+ TDataHtoG

2

3

+ TProcessing

3

4

+ TDataGtoH

4

5

+ TPostProc

5

1. Preprocessing

2. Data transfer in

3. GPU Processing

4. Data transfer out

5. Postprocessing

30

Dissecting GPU Execution Time

HPDC ‘08

0%

20%

40%

60%

80%

100%

8 32 128

375

1500

6000

2400

0

9600

0

Data Size (KB)

Run

time

Per

cent

age

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

TTotal = TPreprocesing

1+ TDataHtoG

2+ TProcessing

3+ TDataGtoH

4+ TPostProc

5

MD5 computing block hashes module with all optimizations enabled

31

Application Level Performance – Similarity Detection

HPDC ‘08

Online similarity detection throughput and speedup using MD5.

Throughput (MBps) Similarity ratio detectedStoreGPU Standard

Fixed size Compare by Hash

19323%

Content based Compare by Hash

13.580%

Implication: similarity detection can be used even on 10Gbps setups !!

840

Speedup : 4.3x

114

Speedup: 8.4x

Application: similarity detection between checkpoint images.Data: checkpoints from BLAST (bioinformatics)

collected using BLCR, checkpoint interval : 5 minutes

32

Summary

HPDC ‘08

StoreGPU : Offloads the computationally intensive operations from the CPU Achieves considerable speedups

Contributions: Feasibility of using GPUs to support (distributed) middlewares Performance model StoreGPU library

Implication :GPUs unleash valuable set of optimization techniques into high performance systems design space.

33

Other GPU Applications

Current NetSysLab GPU-related projects Exploring GPU to support other middleware primitives: Bloom

filters (BloomGPU) Packet classification Medical imaging compression

Hashing Erasure coding Encryption/decryption Compression Membership testing (Bloom-filter)

HPDC ‘08

34

Thank you

netsyslab.ece.ubc.ca

HPDC ‘08

35

References

HPDC ‘08

[Damgard89] Damgard, I. A Design Principle for Hash Functions. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science.

[Liu06] Liu, W., et al. Bio-sequence database scanning on a GPU. in Parallel and Distributed Processing Symposium, IPDPS. 2006

[Vogt08] Vogt, L, et al. Accelerating Resolution-of-the-Identity Second-Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. J. Phys. Chem. A, 112 (10), 2049 -2057, 2008.

[Anderson08] Joshua A. Anderson, Chris D. Lorenz and A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational PhysicsVolume 227, Issue 10, 1 May 2008, Pages 5342-5359

[Owens07] Owens, J.D., et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 2007. 26(1): p. 80-113

Date post:	14-Dec-2015
Category:	Documents
Upload:	immanuel-sessions
View:	216 times
Download:	0 times

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab...

Documents