+ All Categories
Home > Documents > Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Date post: 11-Feb-2016
Category:
Upload: raheem
View: 42 times
Download: 0 times
Share this document with a friend
Description:
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission. Haicheng Wu*, Gregory Diamos # , Jin Wang*, Srihari Cadambi ^, Sudhakar Yalamanchili *, Srimat Chakradhar ^ * Georgia Institute of Technology # NVIDIA Research ^ NEC Laboratories America. - PowerPoint PPT Presentation
27
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission Haicheng Wu*, Gregory Diamos # , Jin Wang*, Srihari Cadambi^, Sudhakar Yalamanchili*, Srimat Chakradhar^ *Georgia Institute of Technology # NVIDIA Research ^ NEC Laboratories America Sponsors : National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA
Transcript
Page 1: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Haicheng Wu*, Gregory Diamos#, Jin Wang*, Srihari Cadambi^, Sudhakar Yalamanchili*, Srimat Chakradhar^

*Georgia Institute of Technology#NVIDIA Research

^NEC Laboratories America

Sponsors: National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA

Page 2: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

The General Purpose GPU

2

② Launch Kernel

① Input Data

④ Result

③ExecuteCPU (Multi

Core)2-10 Cores

MAIN MEM~128GB

GPU~1500 Cores

GPU MEM~6GB

PCI-E

GPU is a many core co-processor

10s to 100s of cores 1000s to 10,000s of

concurrent threads CUDA and OpenCL are the

dominant programming models

Well suited for data parallel apps

Molecular Dynamics, Options Pricing, Ray Tracing, etc.

Commodity: led by NVIDIA, AMD, and Intel

Page 3: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Enterprise: Amazon EC2 GPU Instance

Amazon EC2 GPU InstancesElements CharacteristicsOS CentOS 5.5CPU 2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz)GPU 2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1Memory 22 GBStorage 1690 GBI/O 10 GigEPrice $2.10/hour

NVIDIA Tesla

3

Page 4: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Data Warehousing Applications on GPUs

4

The good Lots of potential data parallelism If data fits in GPU mem, 2x—27x

speedup has been shown

The bad Very large data set (will not even

fit in host memory) I/O bound (GPU has no disk) PCI data transfer takes 15–90% of

the total time*

Order Price Discount

0 10 10%1 20 20%2 10 15%3 51 14%4 33 13%5 22 10%

…… …… ……

• B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.

Page 5: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

This Work

5

Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs

AssumptionsIn-memory system

Host memory, not GPU memoryNot OLTP (Online Transaction Processing) type simple queries

Focus on data analysis instead of data entry/retrieval

Page 6: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Two Optimizations for Data Movement

6

Our solutions are:

Kernel Fusion – Aggregate computation to reuse data

Kernel Fission – Overlap computation with PCI transfer

This is the problem!!!

CPU (Multi Core)2-10 Cores

MAIN MEM~128GB

GPU~1500 Cores

GPU MEM~6GB

PCI-E~16GB

/s

Page 7: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Relational Algebra (RA) Operators

7

RA are building blocks of DB APPs

UNION x = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)}union x y -> {(3,a), (4,a), (2,b), (0,a)}

INTERSECTION

x = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)}intersection x y -> {(2,b)}

PRODUCT x = {(3,a), (4,a)}, y = {(True, 2)}product x y -> {(3,a,True,2), (4,a,True,2)}

DIFFERENCE x = {(3,a), (4,a), (2,b)}, y = {(4,a), (3,a)}difference x y -> {(2,b)}

JOIN x = {(2,b), (3,a), (4,a)}, y = {(2,f), (3,c)}join x y -> {(3,a,c), (2,b,f)}

PROJECTION x = {(3,True,a), (4,True,a), (2,False,b)}project [0,2] x -> {(3,a), (4,a), (2,b)}

SELECT x = {(3,True,a), (4,True,a), (2,False,b)}select [field.0==2] x -> (2,False,b)

Page 8: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Common RA Combinations of TPC-H

8

A1

SELECT

SELECT

SELECT

A1

JOIN

JOIN

JOIN

A2

A3

An

A1

SELECT SELECT

A1

JOIN

A2

SELECT

A1

SELECT

A2

SELECT

JOIN

A1

SELECT

AGGREGATION

A1

ARITH

AGGREGATION

PROJECT

A1

JOIN

A2

ARITH

(a) (b)

(c)

(d) (e)

(f)

(g) (h)

Page 9: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Experimental Environment

9

Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission

CPU 2 quad-core Xeon E5520 @ 2.27GHzMemory

48 GB

GPU 1 Tesla C2070 (6GB GDDR5 memory)OS Ubuntu 10.04 ServerGCC 4.4.3NVCC 4.0

Page 10: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PCI Bandwidth vs. GPU Computation Capacity

10

PCI Bandwidth GPU Computation

Capacity (1 SELECT)

<

Page 11: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

+/-

Kernel Fusion

11

1 2 3

+Kernel A

Kernel B

Fused Kernel

A1: A2:

Kernel A

A1 A2

A3

Kernel B

Result

A1 A2 A3

Fused Kernel A , B

Result

4 5 6

5 7 9 A3: 2 4 6

-

3 3 3

1 2 3A1: A2: 4 5 6 A3: 2 4 6

3 3 3

Page 12: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Benefits of Kernel Fusion-Reduce Data Footprint (1)

12

Spatial Locality

Traverse the data only ONCE

GPU

temp

GPU

temp Result

GPU

Result

temp

A1 A2 A3

A1 A2 A3

Temporal Locality

Kernel A Kernel B

Fused Kernel A&B

A1 A1

A1

Page 13: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Benefits of Kernel Fusion-Reduce Data Footprint (2)

13

Reduce Data Transfer

input1

result1

input2

result2

CPUMEM

GPUMEM

Memory Efficiency

A1 A2Temp A3

A1A2A3

GPU

MEM

GPU

MEM

Kernel A

A1 A2

A3

Kernel B

Result

A1 A2 A3

Fused Kernel A ,

B

Result

Temp

Page 14: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Benefits of Kernel Fusion-Enlarge Optimization Scope

14

Eliminate Common Stages

Enable More Opt

Fused Kernel A, B

Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation……

Kernel A

Kernel B

s1 s2 s3 s1 s2 s4

Kernel A Kernel B

s1 s2 s3Fused Kernel

A&B

s4

Page 15: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Examples of Kernel Fusion

15

CTA0

CTA1

CTA2

CTA3

GPU MEM

Unmatched element

Matched element

Partition Filter Buffer Gather

GPU CORE GPU MEMCTA0

CTA1

CTA2

CTA3

CPU MEM

Unmatched element

CompletelyMatched element

Partiallymatched element

Partition Filter1 Buffer GatherFilter2

GPU CORE

Original 1 SELECT

Fused 2 SELECTs

Page 16: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion-Overall Performance

16

Including PCI

Excluding PCI1.80x speedup

PCI-e noise

Page 17: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion-Breakdown Execution Time

17

Not needed

Faster filterand gather

Page 18: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion-Sensitivity

18

Fusing more kernels is better

Lower selected rate is better

Page 19: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fission-CUDA Stream

19

• Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel

• Commands in the same CUDA STREAM have to run in sequential

Kernel 1

Stream 1 Stream 2 Stream 3Kernel 2 Kernel 3

Kernel 4

Kernel 5 Kernel 6

Page 20: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fission-Stream Pool

Stream Pool is a library that abstracts away the details of CUDA STREAM

API CommentgetAvailableStream()

Get an available stream

setStreamCommand()

Assign a command to a specific stream

startStreams() Start the executionselectWait() Assign point-to-point synchronization

between two specific streamsterminate() End the execution immediately

20

Page 21: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fission-Different Ways to Use CUDA Stream

Concurrently running two kernels is not always beneficial

small uses half resource as big

21

Page 22: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example of Kernel Fission

CTA0

CTA1

CTA2

GPU MEM

CPU->GPU

GPU Computation

GPU->CPU CPU->GPU

GPU Computation

GPU->CPU

Cycle 0 Cycle 1 1.37x speedup

22

Page 23: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion + Kernel FissionGPU MEM

Partition GPUGather

GPU MEM

CPU MEM

CPU MEM

CTA2

CTA3

CTA2

CTA3

CTA2

CTA3

Filter1 BufferFilter2

GPU CORE

Unmatched element

Completelymatchedelement

Partiallymatched element

CTA0

CTA4

CPU->GPU

GPU->CPU

CPUGather

CTA1

CTA5

1.41x serial1.31x fusion only1.10x fission only

23

Page 24: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Real Queries-Q1

24

+

Date

Price

Tax

Discount

Quantity

Flag

Status

Select JoinSortAggregate

+ Arithmetic

Fusion + FissionFusion Only

0

0.2

0.4

0.6

0.8

1

1.2

1.4

NotOptimized

Fusion Fusion+

Fission

Nor

mal

ized

Exe

cutio

n Ti

me

Query Plan

Totally 1.26x speedup

Page 25: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Real Queries-Q21

25

Query Plan

Status Date1 Date2

Supplier

Nation

Select JoinSortAggregate

+ Arithmetic

Fusion + FissionFusion Only

Unique

0.9

0.95

1

1.05

1.1

1.15

NotOptimized

Fusion Fusion+

Fission

Nor

mal

ized

Exe

cutio

n Ti

me

Totally 1.13x speedup

Page 26: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

ConclusionsTwo Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps.

Kernel Fusion Does not need to dump intermediate temporary data Enlarge the optimization scope

Kernel Fission works like double buffer that can overlap data

transfer with GPU Computation

26

Page 27: Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Thank You

Questions?

27


Recommended