Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Haicheng Wu*, Gregory Diamos#, Jin Wang*, Srihari Cadambi^, Sudhakar Yalamanchili*, Srimat Chakradhar^

*Georgia Institute of Technology#NVIDIA Research

^NEC Laboratories America

Sponsors: National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA


The General Purpose GPU

2

② Launch Kernel

① Input Data

④ Result

③ExecuteCPU (Multi

Core)2-10 Cores

MAIN MEM~128GB

GPU~1500 Cores

GPU MEM~6GB

PCI-E

GPU is a many core co-processor

10s to 100s of cores 1000s to 10,000s of

concurrent threads CUDA and OpenCL are the

dominant programming models

Well suited for data parallel apps

Molecular Dynamics, Options Pricing, Ray Tracing, etc.

Commodity: led by NVIDIA, AMD, and Intel


Enterprise: Amazon EC2 GPU Instance

Amazon EC2 GPU InstancesElements CharacteristicsOS CentOS 5.5CPU 2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz)GPU 2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1Memory 22 GBStorage 1690 GBI/O 10 GigEPrice $2.10/hour

NVIDIA Tesla

3


Data Warehousing Applications on GPUs

4

The good Lots of potential data parallelism If data fits in GPU mem, 2x—27x

speedup has been shown

The bad Very large data set (will not even

fit in host memory) I/O bound (GPU has no disk) PCI data transfer takes 15–90% of

the total time*

Order Price Discount

0 10 10%1 20 20%2 10 15%3 51 14%4 33 13%5 22 10%

…… …… ……

• B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.


This Work

5

Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs

AssumptionsIn-memory system

Host memory, not GPU memoryNot OLTP (Online Transaction Processing) type simple queries

Focus on data analysis instead of data entry/retrieval


Two Optimizations for Data Movement

6

Our solutions are:

Kernel Fusion – Aggregate computation to reuse data

Kernel Fission – Overlap computation with PCI transfer

This is the problem!!!

CPU (Multi Core)2-10 Cores

MAIN MEM~128GB

GPU~1500 Cores

GPU MEM~6GB

PCI-E~16GB

/s


Relational Algebra (RA) Operators

7

RA are building blocks of DB APPs

UNION x = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)}union x y -> {(3,a), (4,a), (2,b), (0,a)}

INTERSECTION

x = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)}intersection x y -> {(2,b)}

PRODUCT x = {(3,a), (4,a)}, y = {(True, 2)}product x y -> {(3,a,True,2), (4,a,True,2)}

DIFFERENCE x = {(3,a), (4,a), (2,b)}, y = {(4,a), (3,a)}difference x y -> {(2,b)}

JOIN x = {(2,b), (3,a), (4,a)}, y = {(2,f), (3,c)}join x y -> {(3,a,c), (2,b,f)}

PROJECTION x = {(3,True,a), (4,True,a), (2,False,b)}project [0,2] x -> {(3,a), (4,a), (2,b)}

SELECT x = {(3,True,a), (4,True,a), (2,False,b)}select [field.0==2] x -> (2,False,b)


Common RA Combinations of TPC-H

8

A1

SELECT

SELECT

SELECT

…

A1

JOIN

JOIN

JOIN

A2

A3

An

A1

SELECT SELECT

A1

JOIN

A2

SELECT

A1

SELECT

A2

SELECT

JOIN

A1

SELECT

AGGREGATION

A1

ARITH

AGGREGATION

PROJECT

A1

JOIN

A2

ARITH

(a) (b)

(c)

(d) (e)

(f)

(g) (h)


Experimental Environment

9

Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission

CPU 2 quad-core Xeon E5520 @ 2.27GHzMemory

48 GB

GPU 1 Tesla C2070 (6GB GDDR5 memory)OS Ubuntu 10.04 ServerGCC 4.4.3NVCC 4.0


PCI Bandwidth vs. GPU Computation Capacity

10

PCI Bandwidth GPU Computation

Capacity (1 SELECT)

<


+/-

Kernel Fusion

11

1 2 3

+Kernel A

Kernel B

Fused Kernel

A1: A2:

Kernel A

A1 A2

A3

Kernel B

Result

A1 A2 A3

Fused Kernel A , B

Result

4 5 6

5 7 9 A3: 2 4 6

-

3 3 3

1 2 3A1: A2: 4 5 6 A3: 2 4 6

3 3 3


Benefits of Kernel Fusion-Reduce Data Footprint (1)

12

Spatial Locality

Traverse the data only ONCE

GPU

temp

GPU

temp Result

GPU

Result

temp

A1 A2 A3

A1 A2 A3

Temporal Locality

Kernel A Kernel B

Fused Kernel A&B

A1 A1

A1


Benefits of Kernel Fusion-Reduce Data Footprint (2)

13

Reduce Data Transfer

input1

result1

input2

result2

CPUMEM

GPUMEM

Memory Efficiency

A1 A2Temp A3

A1A2A3

GPU

MEM

GPU

MEM

Kernel A

A1 A2

A3

Kernel B

Result

A1 A2 A3

Fused Kernel A ,

B

Result

Temp


Benefits of Kernel Fusion-Enlarge Optimization Scope

14

Eliminate Common Stages

Enable More Opt

Fused Kernel A, B

Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation……

Kernel A

Kernel B

s1 s2 s3 s1 s2 s4

Kernel A Kernel B

s1 s2 s3Fused Kernel

A&B

s4


Examples of Kernel Fusion

15

CTA0

CTA1

CTA2

CTA3

GPU MEM

Unmatched element

Matched element

Partition Filter Buffer Gather

GPU CORE GPU MEMCTA0

CTA1

CTA2

CTA3

CPU MEM

Unmatched element

CompletelyMatched element

Partiallymatched element

Partition Filter1 Buffer GatherFilter2

GPU CORE

Original 1 SELECT

Fused 2 SELECTs


Kernel Fusion-Overall Performance

16

Including PCI

Excluding PCI1.80x speedup

PCI-e noise


Kernel Fusion-Breakdown Execution Time

17

Not needed

Faster filterand gather


Kernel Fusion-Sensitivity

18

Fusing more kernels is better

Lower selected rate is better


Kernel Fission-CUDA Stream

19

• Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel

• Commands in the same CUDA STREAM have to run in sequential

Kernel 1

Stream 1 Stream 2 Stream 3Kernel 2 Kernel 3

Kernel 4

Kernel 5 Kernel 6


Kernel Fission-Stream Pool

Stream Pool is a library that abstracts away the details of CUDA STREAM

API CommentgetAvailableStream()

Get an available stream

setStreamCommand()

Assign a command to a specific stream

startStreams() Start the executionselectWait() Assign point-to-point synchronization

between two specific streamsterminate() End the execution immediately

20


Kernel Fission-Different Ways to Use CUDA Stream

Concurrently running two kernels is not always beneficial

small uses half resource as big

21


Example of Kernel Fission

CTA0

CTA1

CTA2

GPU MEM

CPU->GPU

GPU Computation

GPU->CPU CPU->GPU

GPU Computation

GPU->CPU

Cycle 0 Cycle 1 1.37x speedup

22


Kernel Fusion + Kernel FissionGPU MEM

Partition GPUGather

GPU MEM

CPU MEM

CPU MEM

CTA2

CTA3

CTA2

CTA3

CTA2

CTA3

Filter1 BufferFilter2

GPU CORE

Unmatched element

Completelymatchedelement

Partiallymatched element

CTA0

CTA4

CPU->GPU

GPU->CPU

CPUGather

CTA1

CTA5

1.41x serial1.31x fusion only1.10x fission only

23


Real Queries-Q1

24

+

Date

Price

Tax

Discount

Quantity

Flag

Status

Select JoinSortAggregate

+ Arithmetic

Fusion + FissionFusion Only

0

0.2

0.4

0.6

0.8

1

1.2

1.4

NotOptimized

Fusion Fusion+

Fission

Nor

mal

ized

Exe

cutio

n Ti

me

Query Plan

Totally 1.26x speedup


Real Queries-Q21

25

Query Plan

Status Date1 Date2

Supplier

Nation

Select JoinSortAggregate

+ Arithmetic

Fusion + FissionFusion Only

Unique

0.9

0.95

1

1.05

1.1

1.15

NotOptimized

Fusion Fusion+

Fission

Nor

mal

ized

Exe

cutio

n Ti

me

Totally 1.13x speedup


ConclusionsTwo Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps.

Kernel Fusion Does not need to dump intermediate temporary data Enlarge the optimization scope

Kernel Fission works like double buffer that can overlap data

transfer with GPU Computation

26


Thank You

Questions?

27

Date post:	11-Feb-2016
Category:	Documents
Upload:	raheem
View:	42 times
Download:	0 times

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Documents