EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

transcript

EECS 583 – Class 20Research Topic 2: Stream Compilation, GPU Compilation

University of Michigan

December 3, 2012

Guest Speakers Today: Daya Khudia and Mehrzad Samadi

Announcements & Reading Material

Exams graded and will be returned in Wednesday’s class

This class» “Orchestrating the Execution of Stream Programs on Multicore

Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.

Next class – Research Topic 3: Security» “Dynamic taint analysis for automatic detection, analysis, and

signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005

Stream Graph Modulo Scheduling

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining» Equal work distribution» Communication/computation

overlap» Synchronization costs

Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data

DMA engine independent of PEs

Filters = operations, cores = function units

256 KB LS

MFC(DMA)

256 KB LS

MFC(DMA)

256 KB LS

MFC(DMA)

PPE(Power PC)

SPE0 SPE1 SPE7

Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Stateless

int wgts[N];

wgts = adapt(wgts);

Stateful

SGMS Overview

PE0 PE1 PE2 PE3

T1 ≈ 4

Prologue

Epilogue

SGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Processor Assignment: Maximizing Throughputs

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

W: 20 W: 20

Minimum II: 50Balanced workload!Maximum throughput

T2 = 50

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3Four Processing Elements

• Assigns each filter to a processor

PE1PE0 PE2 PE3

W: workload

Need More Than Just Processor Assignment

Assign filters to processors» Goal : Equal work distribution

Graph partitioning? Bin packing?

Original stream program

PE0 PE1

D Speedup = 60/40 = 1.5

Modified stream program

D Speedup = 60/32 ~ 2

- 10 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

- 11 -

Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)

Split/Join overheadfactored in

Objective function- Maximal load on any PE» Minimize

Result» Number of times to “split” each filter

» Filter → processor mapping

- 12 -

Step 2: Forming the Software Pipeline To achieve speedup

» All chunks should execute concurrently» Communication should be overlapped

Processor assignment alone is insufficient information

PE0 PE1

A2A→B

A3A→B

Overlap Ai+2 with Bi

- 13 -

Stage Assignment

Sj ≥ Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

Data flow traversal of the stream graph» Assign stages using above two rules

- 14 -

Stage Assignment Example

B1Stage 0

DMA DMA DMA Stage 1

Stage 2

DMA Stage 3

Stage 4

- 15 -

Step 3: Code Generation for Cell

Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs

» QS20 – up to 16 SPEs

One thread / SPE Challenge

» Making a collection of independent threads implement a software pipeline

» Adapt kernel-only code schema of a modulo schedule

- 16 -

Complete Example

void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

DMA DMA DMA

DMA DMA

CtoD B2

SPE1 DMA1 SPE2 DMA2T

- 17 -

SGMS(ILP) vs. Greedy

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

- 18 -

SGMS Conclusions

Streamroller» Efficient mapping of stream programs to multicore

» Coarse grain software pipelining

Performance summary» 14.7x speedup on 16 cores

» Up to 35% better than greedy solution (11% on average)

Scheduling framework» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems Cache based system

- 19 -

Discussion Points

Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?

» Filters change execution time?

» Memory faster/slower than expected?

Could this be adapted for a more conventional multiprocessor with caches?

Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

Compilation for GPUs

- 21 -

Why GPUs?

- 22 -

Efficiency of GPUs

High Memory

Bandwidth

GTX 285 : 159 GB/Seci7 : 32 GB/Sec

High FlopRate

i7 :102 GFLOPS

GTX 285 :1062 GFLOPS

i7 : 51 GFLOPS

GTX 285 : 88.5 GFLOPS

GTX 480 : 168 GFLOPS

High FlopPer Watt

GTX 285 : 5.2 GFLOP/W

i7 : 0.78 GFLOP/W

High FlopPer Dollar

GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$

- 23 -

GPU Architecture

Shared

Interconnection Network

Global Memory (Device Memory)PCIe

Bridge

CPU HostMemory

Shared

SM 0 SM 1 SM 2 SM 29

- 24 -

“Compute Unified Device Architecture”

General purpose programming model» User kicks off batches of threads on the GPU

Advantages of CUDA» Interface designed for compute - graphics free API

» Orchestration of on chip cores

» Explicit GPU memory management

» Full support for Integer and bitwise operations

- 25 -

Programming Model

Kernel 1

Kernel 2

Grid 1

Grid 2

DeviceTi

- 26 -

Grid 1

GPU Scheduling

Shared

- 27 -

Warp Generation

Block 0

Block 1

Block 3

Shared

Registers

Block 2

ThreadId

0 31 32 63Warp 0 Warp 1

- 28 -

Memory Hierarchy

Per BlockShared Memory

__shared__ int SharedVar

Block 0

Per-threadRegister

int LocalVarArray[10]

Per-threadLocal Memory

int RegisterVarThread 0

Grid 0

Per appGlobal Memory

__global__ int GlobalVar

__constant__ int ConstVar

Texture<float,1,ReadMode> TextureVar

Per appTexture Memory

Per appConstant Memory

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

Documents