SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission
Haicheng Wu*, Gregory Diamos#, Jin Wang*, Srihari Cadambi^, Sudhakar Yalamanchili*, Srimat Chakradhar^
*Georgia Institute of Technology#NVIDIA Research
^NEC Laboratories America
Sponsors: National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The General Purpose GPU
2
② Launch Kernel
① Input Data
④ Result
③ExecuteCPU (Multi
Core)2-10 Cores
MAIN MEM~128GB
GPU~1500 Cores
GPU MEM~6GB
PCI-E
GPU is a many core co-processor
10s to 100s of cores 1000s to 10,000s of
concurrent threads CUDA and OpenCL are the
dominant programming models
Well suited for data parallel apps
Molecular Dynamics, Options Pricing, Ray Tracing, etc.
Commodity: led by NVIDIA, AMD, and Intel
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Enterprise: Amazon EC2 GPU Instance
Amazon EC2 GPU InstancesElements CharacteristicsOS CentOS 5.5CPU 2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz)GPU 2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1Memory 22 GBStorage 1690 GBI/O 10 GigEPrice $2.10/hour
NVIDIA Tesla
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Data Warehousing Applications on GPUs
4
The good Lots of potential data parallelism If data fits in GPU mem, 2x—27x
speedup has been shown
The bad Very large data set (will not even
fit in host memory) I/O bound (GPU has no disk) PCI data transfer takes 15–90% of
the total time*
Order Price Discount
0 10 10%1 20 20%2 10 15%3 51 14%4 33 13%5 22 10%
…… …… ……
• B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
This Work
5
Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs
AssumptionsIn-memory system
Host memory, not GPU memoryNot OLTP (Online Transaction Processing) type simple queries
Focus on data analysis instead of data entry/retrieval
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Two Optimizations for Data Movement
6
Our solutions are:
Kernel Fusion – Aggregate computation to reuse data
Kernel Fission – Overlap computation with PCI transfer
This is the problem!!!
CPU (Multi Core)2-10 Cores
MAIN MEM~128GB
GPU~1500 Cores
GPU MEM~6GB
PCI-E~16GB
/s
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Relational Algebra (RA) Operators
7
RA are building blocks of DB APPs
UNION x = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)}union x y -> {(3,a), (4,a), (2,b), (0,a)}
INTERSECTION
x = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)}intersection x y -> {(2,b)}
PRODUCT x = {(3,a), (4,a)}, y = {(True, 2)}product x y -> {(3,a,True,2), (4,a,True,2)}
DIFFERENCE x = {(3,a), (4,a), (2,b)}, y = {(4,a), (3,a)}difference x y -> {(2,b)}
JOIN x = {(2,b), (3,a), (4,a)}, y = {(2,f), (3,c)}join x y -> {(3,a,c), (2,b,f)}
PROJECTION x = {(3,True,a), (4,True,a), (2,False,b)}project [0,2] x -> {(3,a), (4,a), (2,b)}
SELECT x = {(3,True,a), (4,True,a), (2,False,b)}select [field.0==2] x -> (2,False,b)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Common RA Combinations of TPC-H
8
A1
SELECT
SELECT
SELECT
…
A1
JOIN
JOIN
JOIN
A2
A3
An
A1
SELECT SELECT
A1
JOIN
A2
SELECT
A1
SELECT
A2
SELECT
JOIN
A1
SELECT
AGGREGATION
A1
ARITH
AGGREGATION
PROJECT
A1
JOIN
A2
ARITH
(a) (b)
(c)
(d) (e)
(f)
(g) (h)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Experimental Environment
9
Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission
CPU 2 quad-core Xeon E5520 @ 2.27GHzMemory
48 GB
GPU 1 Tesla C2070 (6GB GDDR5 memory)OS Ubuntu 10.04 ServerGCC 4.4.3NVCC 4.0
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
PCI Bandwidth vs. GPU Computation Capacity
10
PCI Bandwidth GPU Computation
Capacity (1 SELECT)
<
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
+/-
Kernel Fusion
11
1 2 3
+Kernel A
Kernel B
Fused Kernel
A1: A2:
Kernel A
A1 A2
A3
Kernel B
Result
A1 A2 A3
Fused Kernel A , B
Result
4 5 6
5 7 9 A3: 2 4 6
-
3 3 3
1 2 3A1: A2: 4 5 6 A3: 2 4 6
3 3 3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Benefits of Kernel Fusion-Reduce Data Footprint (1)
12
Spatial Locality
Traverse the data only ONCE
GPU
temp
GPU
temp Result
GPU
Result
temp
A1 A2 A3
A1 A2 A3
Temporal Locality
Kernel A Kernel B
Fused Kernel A&B
A1 A1
A1
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Benefits of Kernel Fusion-Reduce Data Footprint (2)
13
Reduce Data Transfer
input1
result1
input2
result2
CPUMEM
GPUMEM
Memory Efficiency
A1 A2Temp A3
A1A2A3
GPU
MEM
GPU
MEM
Kernel A
A1 A2
A3
Kernel B
Result
A1 A2 A3
Fused Kernel A ,
B
Result
Temp
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Benefits of Kernel Fusion-Enlarge Optimization Scope
14
Eliminate Common Stages
Enable More Opt
Fused Kernel A, B
Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation……
Kernel A
Kernel B
s1 s2 s3 s1 s2 s4
Kernel A Kernel B
s1 s2 s3Fused Kernel
A&B
s4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Examples of Kernel Fusion
15
CTA0
CTA1
CTA2
CTA3
GPU MEM
Unmatched element
Matched element
Partition Filter Buffer Gather
GPU CORE GPU MEMCTA0
CTA1
CTA2
CTA3
CPU MEM
Unmatched element
CompletelyMatched element
Partiallymatched element
Partition Filter1 Buffer GatherFilter2
GPU CORE
Original 1 SELECT
Fused 2 SELECTs
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fusion-Overall Performance
16
Including PCI
Excluding PCI1.80x speedup
PCI-e noise
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fusion-Breakdown Execution Time
17
Not needed
Faster filterand gather
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fusion-Sensitivity
18
Fusing more kernels is better
Lower selected rate is better
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fission-CUDA Stream
19
• Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel
• Commands in the same CUDA STREAM have to run in sequential
Kernel 1
Stream 1 Stream 2 Stream 3Kernel 2 Kernel 3
Kernel 4
Kernel 5 Kernel 6
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fission-Stream Pool
Stream Pool is a library that abstracts away the details of CUDA STREAM
API CommentgetAvailableStream()
Get an available stream
setStreamCommand()
Assign a command to a specific stream
startStreams() Start the executionselectWait() Assign point-to-point synchronization
between two specific streamsterminate() End the execution immediately
20
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fission-Different Ways to Use CUDA Stream
Concurrently running two kernels is not always beneficial
small uses half resource as big
21
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example of Kernel Fission
CTA0
CTA1
CTA2
GPU MEM
CPU->GPU
GPU Computation
GPU->CPU CPU->GPU
GPU Computation
GPU->CPU
Cycle 0 Cycle 1 1.37x speedup
22
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fusion + Kernel FissionGPU MEM
Partition GPUGather
GPU MEM
CPU MEM
CPU MEM
CTA2
CTA3
CTA2
CTA3
CTA2
CTA3
Filter1 BufferFilter2
GPU CORE
Unmatched element
Completelymatchedelement
Partiallymatched element
CTA0
CTA4
CPU->GPU
GPU->CPU
CPUGather
CTA1
CTA5
1.41x serial1.31x fusion only1.10x fission only
23
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Real Queries-Q1
24
+
Date
Price
Tax
Discount
Quantity
Flag
Status
Select JoinSortAggregate
+ Arithmetic
Fusion + FissionFusion Only
0
0.2
0.4
0.6
0.8
1
1.2
1.4
NotOptimized
Fusion Fusion+
Fission
Nor
mal
ized
Exe
cutio
n Ti
me
Query Plan
Totally 1.26x speedup
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Real Queries-Q21
25
Query Plan
Status Date1 Date2
Supplier
Nation
Select JoinSortAggregate
+ Arithmetic
Fusion + FissionFusion Only
Unique
0.9
0.95
1
1.05
1.1
1.15
NotOptimized
Fusion Fusion+
Fission
Nor
mal
ized
Exe
cutio
n Ti
me
Totally 1.13x speedup
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
ConclusionsTwo Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps.
Kernel Fusion Does not need to dump intermediate temporary data Enlarge the optimization scope
Kernel Fission works like double buffer that can overlap data
transfer with GPU Computation
26
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thank You
Questions?
27