Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler...

transcript

Characterizing and Evaluating a Key-value Store Application on Heterogeneous

CPU-GPU Systems

Tayler H. Hetheringtonɣ

Timothy G. Rogersɣ

Lisa Hsu*Mike O’Connor*Tor M. Aamodtɣ

ɣUBC *AMD

University of British ColumbiaIn Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS)

Rich Miler – www.datacenterknowledge.com

Server farms require a lot of power– Need for efficient, cost-effective solutions– GPU/APUs

New types of workloads– Non-HPC– Server applications

Server applications– Memcached

Programmer’s initial intuition into an application’s behavior

Intuition Actual0%

10%20%30%40%50%

yMotivation

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Bruno Giussani – ww.wired.com

Background Memcached

*Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar

Memcached - Compatible with GPU?• Irregular control flow • Irregular memory access patterns • Large memory requirements• Highly input data dependent

Porting MemcachedSimple key-value lookup

Server2

Memory

Key Comparison

Return Hit/Miss

Hash chaining

• READ (GET)

requests on

• WRITE (SET)

requests on

GET Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Server2

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Server2

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Porting Memcached - Batching

GET Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Servern

Server2

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Porting Memcached

• Main Goals– Increase request throughput– Keep request latency reasonable

• Main Challenges– Irregular memory access patterns– Irregular control flow– Data transfer overheads

Methodology

• Hardware– AMD Radeon HD 5870 (Discrete)– AMD Llano A8-3850 (Fusion)– AMD Zacate E-350 (Fusion)

• Simulators– GPGPU-Sim v3.x – In-house GPU control flow simulator

• Testing and Simulation– Traces of Wikipedia accesses

Porting MemcachedMemory Access

• One request per work item

• Data accesses for GET requests are input data dependent

• Data can be anywhere in memory– Poor performance on GPU?

No L1 Cache

8k 8-way

32k 8-way

64k 8-way

128k 8-way

256k 8-way

1M 8-way

1M FA No Mem

La-tency

No Mem Stalls

Porting MemcachedMemory Divergence

Porting MemcachedControl Flow

• Recall the control flow graph

• Many branch outcomes are input data dependent

Work item ID 1 – 2 – 3 – 4 – 5

1 – 2 – 5 3 – 4

1 – 5 2 3 – 4

MC (Pes) MC (Aug) MC (Act)0%

10%20%30%40%50%60%70%80%90%

100% SIMD Efficiency Breakdown

1-45-89-1213-1617-2021-2425-2829-32

# Active Work-items

Porting MemcachedControl Flow

15% 40%

Overall

Porting MemcachedData Management

• Dynamic memory manager

• Transfer memory regions to device

•Virtual addresses different on host and device

Porting MemcachedData Transfer Reduction

• Fusion Systems– Physical shared memory region between host and device– Zero-copy data

• Discrete Systems– Possible transfer reduction techniques• Reduction in unnecessary transfers• Acyclic data transfers (Overlap comm. with comp.)• Automatic data transfer frameworks

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500

10203040

Performance vs. CPUNo Data Transfers

Data Transfers

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500

Performance vs. CPU

No Data Transfers

Data Transfers

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500%

100%Execution Breakdown

Data Transfer

Execution

Porting Memcached

0 10000 20000 30000 40000 50000 60000 70000 80000 900000

Normalized Throughput (Requests/Second) Normalized LatencyLatency - 0.5ms

Requests / Batch

ResultsRadeon HD 5870

• ~8000 requests yields highest ratio of throughput to latency

Summary• Programmer intuition doesn’t always paint the

whole picture• We exploited the available parallelism on

GPUs by batching requests, showing a 7.5X performance increase on the Llano system

• Data transfer overheads can have a large impact on overall performance

• Thank you – Questions?

Rich Miler – www.datacenterknowledge.com

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler...

Documents