Post on 19-Jan-2016
transcript
Characterizing and Evaluating a Key-value Store Application on Heterogeneous
CPU-GPU Systems
Tayler H. Hetheringtonɣ
Timothy G. Rogersɣ
Lisa Hsu*Mike O’Connor*Tor M. Aamodtɣ
ɣUBC *AMD
University of British ColumbiaIn Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS)
Rich Miler – www.datacenterknowledge.com
2
Server farms require a lot of power– Need for efficient, cost-effective solutions– GPU/APUs
New types of workloads– Non-HPC– Server applications
Server applications– Memcached
Programmer’s initial intuition into an application’s behavior
Intuition Actual0%
10%20%30%40%50%
SIM
D E
f-fic
ienc
yMotivation
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Bruno Giussani – ww.wired.com
3
Background Memcached
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
*Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar
4
Memcached - Compatible with GPU?• Irregular control flow • Irregular memory access patterns • Large memory requirements• Highly input data dependent
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
5
Porting MemcachedSimple key-value lookup
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
GET
Server2
Hash
Memory
Key Comparison
Return Hit/Miss
Hash chaining
Miss
Hit
• READ (GET)
requests on
GPU
• WRITE (SET)
requests on
CPU
6
GET Hash
Memory
Key Comparison
Return Hit/Miss
Hash chaining
Miss
Hit
GET
Server2
Hash
Memory
Key Comparison
Return Hit/Miss
Hash chaining
Miss
Hit
GET
Server2
Hash
Memory
Key Comparison
Return Hit/Miss
Hash chaining
Miss
Hit
Porting Memcached - Batching
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
GET Hash
Memory
Key Comparison
Return Hit/Miss
Hash chaining
Miss
Hit
Servern
GET
Server2
Hash
Memory
Key Comparison
Return Hit/Miss
Hash chaining
Miss
Hit
7
Porting Memcached
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
• Main Goals– Increase request throughput– Keep request latency reasonable
• Main Challenges– Irregular memory access patterns– Irregular control flow– Data transfer overheads
8
Methodology
• Hardware– AMD Radeon HD 5870 (Discrete)– AMD Llano A8-3850 (Fusion)– AMD Zacate E-350 (Fusion)
• Simulators– GPGPU-Sim v3.x – In-house GPU control flow simulator
• Testing and Simulation– Traces of Wikipedia accesses
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
9
Porting MemcachedMemory Access
• One request per work item
• Data accesses for GET requests are input data dependent
• Data can be anywhere in memory– Poor performance on GPU?
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
10
No L1 Cache
8k 8-way
32k 8-way
64k 8-way
128k 8-way
256k 8-way
1M 8-way
1M FA No Mem
La-tency
No Mem Stalls
0%
5%
10%
15%
20%
25%
30%
35%
Perc
enta
ge o
f Pea
k IP
C
Porting MemcachedMemory Divergence
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
11
Porting MemcachedControl Flow
• Recall the control flow graph
• Many branch outcomes are input data dependent
Work item ID 1 – 2 – 3 – 4 – 5
1 – 2 – 5 3 – 4
1 – 5 2 3 – 4
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
12
MC (Pes) MC (Aug) MC (Act)0%
10%20%30%40%50%60%70%80%90%
100% SIMD Efficiency Breakdown
1-45-89-1213-1617-2021-2425-2829-32
# Active Work-items
Porting MemcachedControl Flow
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
15% 40%
62%
29%
Overall
51%
13
Porting MemcachedData Management
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
• Dynamic memory manager
• Transfer memory regions to device
•Virtual addresses different on host and device
14
Porting MemcachedData Transfer Reduction
• Fusion Systems– Physical shared memory region between host and device– Zero-copy data
• Discrete Systems– Possible transfer reduction techniques• Reduction in unnecessary transfers• Acyclic data transfers (Overlap comm. with comp.)• Automatic data transfer frameworks
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
15
AMD Radeon HD 5870
Llano A8-3850 Zacate E-3500
10203040
Performance vs. CPUNo Data Transfers
Data Transfers
Spee
d up
(X)
AMD Radeon HD 5870
Llano A8-3850 Zacate E-3500
5
10
15
20
25
30
35
Performance vs. CPU
No Data Transfers
Data Transfers
Spee
d up
(X)
AMD Radeon HD 5870
Llano A8-3850 Zacate E-3500%
20%
40%
60%
80%
100%Execution Breakdown
Data Transfer
Execution
Porting Memcached
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
16
0 10000 20000 30000 40000 50000 60000 70000 80000 900000
1
2
3
4
5
6
7
8
0
2
4
6
8
10
12
14
Normalized Throughput (Requests/Second) Normalized LatencyLatency - 0.5ms
Requests / Batch
Nor
mal
ized
Thr
ough
put
Nor
mal
ized
Lat
ency
ResultsRadeon HD 5870
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
• ~8000 requests yields highest ratio of throughput to latency
17
Summary• Programmer intuition doesn’t always paint the
whole picture• We exploited the available parallelism on
GPUs by batching requests, showing a 7.5X performance increase on the Llano system
• Data transfer overheads can have a large impact on overall performance
• Thank you – Questions?
Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Rich Miler – www.datacenterknowledge.com