Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler...

Post on 19-Jan-2016

217 views 0 download

Tags:

transcript

Characterizing and Evaluating a Key-value Store Application on Heterogeneous

CPU-GPU Systems

Tayler H. Hetheringtonɣ

Timothy G. Rogersɣ

Lisa Hsu*Mike O’Connor*Tor M. Aamodtɣ

ɣUBC *AMD

University of British ColumbiaIn Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS)

Rich Miler – www.datacenterknowledge.com

2

Server farms require a lot of power– Need for efficient, cost-effective solutions– GPU/APUs

New types of workloads– Non-HPC– Server applications

Server applications– Memcached

Programmer’s initial intuition into an application’s behavior

Intuition Actual0%

10%20%30%40%50%

SIM

D E

f-fic

ienc

yMotivation

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Bruno Giussani – ww.wired.com

3

Background Memcached

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

*Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar

4

Memcached - Compatible with GPU?• Irregular control flow • Irregular memory access patterns • Large memory requirements• Highly input data dependent

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

5

Porting MemcachedSimple key-value lookup

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

• READ (GET)

requests on

GPU

• WRITE (SET)

requests on

CPU

6

GET Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

Porting Memcached - Batching

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

GET Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

Servern

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

7

Porting Memcached

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

• Main Goals– Increase request throughput– Keep request latency reasonable

• Main Challenges– Irregular memory access patterns– Irregular control flow– Data transfer overheads

8

Methodology

• Hardware– AMD Radeon HD 5870 (Discrete)– AMD Llano A8-3850 (Fusion)– AMD Zacate E-350 (Fusion)

• Simulators– GPGPU-Sim v3.x – In-house GPU control flow simulator

• Testing and Simulation– Traces of Wikipedia accesses

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

9

Porting MemcachedMemory Access

• One request per work item

• Data accesses for GET requests are input data dependent

• Data can be anywhere in memory– Poor performance on GPU?

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

10

No L1 Cache

8k 8-way

32k 8-way

64k 8-way

128k 8-way

256k 8-way

1M 8-way

1M FA No Mem

La-tency

No Mem Stalls

0%

5%

10%

15%

20%

25%

30%

35%

Perc

enta

ge o

f Pea

k IP

C

Porting MemcachedMemory Divergence

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

11

Porting MemcachedControl Flow

• Recall the control flow graph

• Many branch outcomes are input data dependent

Work item ID 1 – 2 – 3 – 4 – 5

1 – 2 – 5 3 – 4

1 – 5 2 3 – 4

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

12

MC (Pes) MC (Aug) MC (Act)0%

10%20%30%40%50%60%70%80%90%

100% SIMD Efficiency Breakdown

1-45-89-1213-1617-2021-2425-2829-32

# Active Work-items

Porting MemcachedControl Flow

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

15% 40%

62%

29%

Overall

51%

13

Porting MemcachedData Management

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

• Dynamic memory manager

• Transfer memory regions to device

•Virtual addresses different on host and device

14

Porting MemcachedData Transfer Reduction

• Fusion Systems– Physical shared memory region between host and device– Zero-copy data

• Discrete Systems– Possible transfer reduction techniques• Reduction in unnecessary transfers• Acyclic data transfers (Overlap comm. with comp.)• Automatic data transfer frameworks

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

15

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500

10203040

Performance vs. CPUNo Data Transfers

Data Transfers

Spee

d up

(X)

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500

5

10

15

20

25

30

35

Performance vs. CPU

No Data Transfers

Data Transfers

Spee

d up

(X)

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500%

20%

40%

60%

80%

100%Execution Breakdown

Data Transfer

Execution

Porting Memcached

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

16

0 10000 20000 30000 40000 50000 60000 70000 80000 900000

1

2

3

4

5

6

7

8

0

2

4

6

8

10

12

14

Normalized Throughput (Requests/Second) Normalized LatencyLatency - 0.5ms

Requests / Batch

Nor

mal

ized

Thr

ough

put

Nor

mal

ized

Lat

ency

ResultsRadeon HD 5870

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

• ~8000 requests yields highest ratio of throughput to latency

17

Summary• Programmer intuition doesn’t always paint the

whole picture• We exploited the available parallelism on

GPUs by batching requests, showing a 7.5X performance increase on the Llano system

• Data transfer overheads can have a large impact on overall performance

• Thank you – Questions?

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Rich Miler – www.datacenterknowledge.com