+ All Categories
Home > Documents > Dissertation Defense

Dissertation Defense

Date post: 31-Dec-2015
Category:
Upload: wynter-barton
View: 36 times
Download: 3 times
Share this document with a friend
Description:
GPU DECLARATIVE FRAMEWORK: DEFG. Dissertation Defense. Robert Senser October 29, 2014. PhD Committee:. Gita Alaghband (chair) Tom Altman (advisor) Michael Mannino Boris Stilman Tam Vu. Presentation Outline. Motivation for Work - PowerPoint PPT Presentation
Popular Tags:
61
Dissertation Defense Robert Senser October 29, 2014 1 GPU DECLARATIVE FRAMEWORK: DEFG PhD Committee: Gita Alaghband (chair) Tom Altman (advisor) Michael Mannino Boris Stilman Tam Vu
Transcript
Page 1: Dissertation Defense

1

Dissertation DefenseRobert Senser

October 29, 2014

GPU DECLARATIVE FRAMEWORK:DEFG

PhD Committee:Gita Alaghband (chair)Tom Altman (advisor)Michael ManninoBoris Stilman Tam Vu

Page 2: Dissertation Defense

2

Presentation Outline• Motivation for Work• Background: Graphical Processing Units (GPUs)

and OpenCL• GPU DECLARATIVE FRAMEWORK: DEFG• Diverse GPU Applications using DEFG– Image Filters (Sobel and Median) – Breadth-First Search – Sorting Roughly Sorted Data – Iterative Matrix Inversion

• Dissertation Accomplishments• Future Research and Observations

Page 3: Dissertation Defense

3

Motivation for Work

• GPUs can provide high throughput– Radeon HD 7990: 2 (double-precision) TFLOPS

• Developing parallel HPC software is difficult• Parallel development for GPUs is even more difficult• GPU HPC software development requires:

– Understanding of unique GPU hardware characteristics– Use of specialized algorithms– Use of GPU-specific, low-level APIs

• OpenCL• CUDA

• Driving notion: Let software minimize the complexity and difficultly.

Page 4: Dissertation Defense

4

Background: GPUs and OpenCL

• Graphical Processing Unit (GPU)– Highly specialized coprocessor– Hundreds of cores– Thousands of hardware-managed threads– SIMT: Single Instruction, Multiple Thread

• Variant of the common Single Instruction, Multiple Data (SIMD) model• Threads not on the execution path pause

– Code executed in a “kernel” • Common GPU programming environments

– OpenCL, which is Open Source– CUDA, which is NVIDIA proprietary

• DEFG is designed for OpenCL

Page 5: Dissertation Defense

5

High-Level GPU Architecture

PCIe bus

RAM Global RAM

VirtualMemory

CPU GPU

Cache Cache?

bus

GPU Characteristic:• Processors often connected by Peripheral

Component Interconnect Express (PCIe) bus• GPU has own fast Global RAM• Threads have a small amount of fast local memory• May or may not have a cache• Many hardware-controlled threads• Lacks CPU-style predictive branching, etc.

Local

Page 6: Dissertation Defense

6

OpenCL Overview• Specification provided by Khronos Group• Open Source, multi-vendor• Hardware device support

– GPUs– CPUs– Digital signal processors (DSPs)– Field-programmable gate arrays (FPGAs)

• Device kernel normally written in C• Each thread shares a common kernel• CPU-side code

– C/C++– Very low-level, detailed CPU-side application interface (API)– Third-party bindings for Java, Python, etc.

Page 7: Dissertation Defense

7

GPU Applications• Three components– Application algorithms– CPU-side code

• Moves kernel code to GPU• Manages GPU execution and errors• Moves application data between CPU and GPU• May contain a portion of application algorithms

– GPU kernel code• Can have multiple kernels per application• Each kernel usually contains an algorithm or algorithm step• Kernel code often uses GPU-specific techniques

• This work concentrates on the CPU-side code

Page 8: Dissertation Defense

8

GPU Performance • Major Issues in GPU Performance– Kernel Instruction Path Divergence

• Occurs with conditional instructions (ifs, loops, etc.)• Causes some threads to pause• Needs to be minimized, if not totally avoided

– High Memory Latency• Each RAM access can consume time of 200-500 instructions• Accesses to global RAM should be coalesced• “Bank conflicts” can occur with local thread memory

• Rob Farber GPU suggestions [1]:– “Get the data on the GPU and leave it”– “Give the GPU enough work to do”– “Focus on data reuse to avoid memory limits”

• Existing HPC code usually re-factored for GPU use

Page 9: Dissertation Defense

9

DEFG Overview

• GPU software development tool for OpenCL• Generates the CPU-side of GPU Applications• Uses a Domain Specific Language (DSL)– Developer writes CPU code in DEFG’s DSL– DEFG generates the corresponding CPU C/C++ program

• Relative to hand-written CPU code– Faster by using declarative approach– Simpler by using design patterns, and abstraction

• Developer provides standard OpenCL GPU kernels

Page 10: Dissertation Defense

10

• The DEFG generates the C/C++ code for CPU

• DEFG Translator– DEFG Source Input– ANTLR-based Parser– XML-based Tree– Optimizer (Java)– Code Generator (C++)– Template Driven – C/C++ Output

DEFG Translator Architecture[2,3]

Translator:

Page 11: Dissertation Defense

11

DEFG Benefits and Features• Implement OpenCL applications with less effort• Requires many fewer lines of code to be written• Encourages the developer to focus on the kernels• How is this done?– With a Doman-Specific Language (DSL)

• Data characteristics are declared• Uses one or more pre-defined DEFG design patterns• Many details managed inside DEFG

– Technical Features• Abstracts the OpenCL APIs and their details• Automatic optimization of buffer transfers• Handles error detection• Supports multiple GPU cards• Supports anytime algorithms

Page 12: Dissertation Defense

12

DEFG Code Sample01. declare application sobel 02. declare integer Xdim (0)03. declare integer Ydim (0)04. declare integer BUF_SIZE (0)05. declare gpu gpuone ( any )06. declare kernel sobel_filter SobelFilter_Kernels ( [[ 2D,Xdim,Ydim ]] )07. declare integer buffer image1 ( BUF_SIZE )08. integer buffer image2 ( BUF_SIZE )09. call init_input (image1(in) Xdim (out) Ydim (out) BUF_SIZE(out)) 10. execute run1 sobel_filter ( image1(in) image2(out) ) 11. call disp_output (image2(in) Xdim (in) Ydim (in) ) 12. end

…status = clSetKernelArg(sobel_filter, 1, sizeof(cl_mem), (void *)&buffer_image2); if (status != CL_SUCCESS) { handle error } // *** execution size_t global_work_size[2]; global_work_size[0] = Xdim ; global_work_size[1] = Ydim ; status = clEnqueueNDRangeKernel(commandQueue, sobel_filter, 2, NULL, global_work_size, NULL, 0,

NULL, NULL); if (status != CL_SUCCESS) { handle error } // *** result buffers status = clEnqueueReadBuffer(commandQueue, buffer_image2, CL_TRUE, 0, BUF_SIZE * sizeof(int), image2, 0, NULL, NULL); …

Page 13: Dissertation Defense

13

DEFG Design Patterns• Invocation Patterns (Control Flow)

– Sequential Flow– Single-Kernel Repeat Sequence– Multiple-Kernel

• Concurrent-GPU Patterns (Multiple-GPU Support)– Multiple-Execution– Divide-Process-Merge– Overlapped-Split-Process-Concatenate

• Prefix-Allocation (Buffer Allocation, without Locking)• Dynamic-Swap (Buffer Swapping)• Code-Morsel (C/C++ Code Insertion)• Anytime algorithm (Control Flow Change on External Event)• Design patterns can be combined

– Example: Multiple-Kernel + Divide-Process-Merge + Code-Morsel

Page 14: Dissertation Defense

14

DEFG Implementation

• Lines of Code– ANTLR-based parser: 580 lines– Optimizer: 659 lines of Java– Code Generator: 1,513 lines of C++– Templates and includes: 1,572 lines of C++

• Number of Diagnostic Programs: 15+• Testing investment: Man Months– Faults tended to be in the C/C++ code generation– Most faults were in multi-GPU buffer management

Page 15: Dissertation Defense

15

Diverse New DEFG Applications• Constructed four very diverse GPU applications

– Image Processing: Sobel and Median Image Filters• Showcase for multiple GPU support

– Graph Theoretic: Breadth-First Search (BFS), Large Graphs• Novel use of prefix sum to avoid GPU low-level locking• BFS processing with multiple GPUs

– Sorting: Sorting Roughly Sorted Data• Implementation of novel sorting approach• Use of parallel prefix calculations in sorting optimization• Also shows multiple GPU support

– Numerical: Iterative Matrix Inversion, M. Altman’s Method• Demonstrates anytime algorithms use of OpenCL clMath BLAS (Basic

Linear Algebra Subprograms)• When a measure is met, anytime algorithm stops the process

• These four applications demonstrate DEFG’s applicability

Page 16: Dissertation Defense

16

Filter Application: Sobel Image Filter

• Sobel operator detects edges in images• Pixel gradient calculated from 3x3 mask• Uses a single GPU kernel, invoked once• A base-line test application for multiple GPUs• Example of DEFG Sobel operator processing:

Sobel

Page 17: Dissertation Defense

17

Filter Application: Median Filter• Median filter removes “noise” from images• Median determined for 3x3 or 5x5 mask• Also uses a single GPU kernel, invoked once• 2nd base-line test application for multiple GPUs• Example of DEFG median 5x5 filter processing:

Page 18: Dissertation Defense

18

Application: Breadth First Search (BFS)

• Well-studied graph-theoretic problem• Focus: BFS with Very Large Irregular (VLI)

Graphs– Social Networking, Routing, Citations, etc.

• Many published GPU BFS approaches, starting with Harish [4]

• Harish used “Dijkstra” BFS• Vertex frontier as a Boolean array

• 1 = vertex on frontier• A GPU thread assigned to each vertex• Can result in poor thread utilization

Page 19: Dissertation Defense

19

BFS Vertex Frontier

• Merrill approach to vertex buffer management [5]– Have a buffer with multiple update threads– Uses prefix sum to allocate cells

• Generalize this buffer management in DEF-G– Provided as a set of kernel functions

• Useful for shared buffers with multiple GPU cards

Page 20: Dissertation Defense

20

Application: Sorting Roughly Sorted Data

• Goal: Improve on O(n log n) sorting bound when sequence is partially sorted

• Based on the prior sorting work by T. Altman, et al. [6]• k is a measure of “sortedness”• A sequence is k-sorted if no element is more than k positions out

of sequence• Knowing k allows for sorts of O(n log k) • If k is small then we obtain a substantial performance gain• The k-sorted trait can be GPU exploited

– Prefix sum in calculating k – Parallel sorts of sub-sequences

Page 21: Dissertation Defense

21

Parallel Roughly Sorting Algorithm

• Step LR: Left-to-right scan, compute running max• Step RL: Right-to-left scan, compute running min• Step DM:– Uses LR-max array and RL-min array as inputs– Computes each elements distance measure

• Step UB: Finds distance measure upper bound • Step Sorting: Using distance measure– Perform sort pass one– Perform sort pass two

Notion: Convert the large sort problem into many smaller, parallel sort operations.

Page 22: Dissertation Defense

22

Iterative Matrix Inversion (IMI)

• DEFG iterative matrix inversion application using M. Altman’s method [7]

• Use the Anytime-algorithm approach to manage the iterative inversion– Inversion is stoppable at “anytime”– Can balance run time against accuracy– Anytime management in DEFG, not the application

• Requires GPU matrix operations– Use OpenCL clMath (APPML)– clMath integration into DEFG

Page 23: Dissertation Defense

23

M. Altman IMI Approach

The initial inverse approximation, that is R0, can be formed by:

R0 = αI

where α = 1 / || A ||||A|| being the Euclidean norm of Aand I is the identity matrix.

To invert matrix A, each iteration calculates:Rn+1 = Rn(3I – 3ARn + (ARn)2)

with the result in Rn+1.

• Better R0 estimate provides for quicker convergence• Method is self correcting• DEFG Anytime facility stops the iterations

– When inversion quality measure is met– When max iterations have occurred– When max run time has occurred

Page 24: Dissertation Defense

24

Accomplishments: DEFG Framework• Fully Implemented– Consists of approximately 5,000 code lines– 7 different applications and 15+ diagnostic programs– Complete User’s Guide– Packaged for general use

• Design Patterns– 10+ Patterns– Patterns range from simple to complex

• Delineation of DEF-G Limits• Explanation for success of DSL and design

patterns

Page 25: Dissertation Defense

25

DEFG’s Performance

• DEFG Papers– Conference: Parallel and Distributed Processing Techniques and

Applications (PDPTA’13) [2]– Conference: Parallel and Distributed Processing Techniques and

Applications (PDPTA’14) [3]• Analysis

– Three existing OpenCL applications converted to DEFG• CPU-side re-coded in DEFG and used existing GPU kernels• Three applications:

– Breadth-First Search (BFS)– All-Pairs Shortest Path (APSP/FW)– Sobel Image Filter (SOBEL)

• Output results carefully verified

– Comparisons between “reference” and DEFG• Lines-of-Code Comparison• Run-time Performance Comparison

Page 26: Dissertation Defense

26

Lines-of-Code Comparison

DEFG DEFG

Source Gen. Ref.

SOBEL 12 467 442

BFS 42 620 364

FW 12 481 478

On average, the DEFG code is 5.6 percent of the reference code.

Page 27: Dissertation Defense

27

• Shown are original run times (average of 10 runs) • Later, made manual changes to understand timing differences • FW reference version was slow due to a vendor coding error • Likely CPU-based BFS-4096 was fast due to CPU’s cache• Summary: DEFG provided equal, or better, performance

Run-Time Performance Comparison

Page 28: Dissertation Defense

28

New Applications

• Application Implementations– Filtering, BFS, Roughly Sorting, Iterative Inversion– Implementation Goals

• Show general applicability of DEFG• Multiple-GPU: Filtering, BFS, and R. Sorting • Novel Algorithm: R. Sorting and Iterative Inversion • Proof of Concept: Iterative Inversion (BLAS usage)

• Application Performance results– Run-time Performance– Single-GPU and multiple-GPU configurations– Problem-size characteristics– Vast majority of tests run on Hyrda server

Page 29: Dissertation Defense

29

Image Filtering Results• Image Applications Implementation– Both Sobel operator and median filter:

• Overlapped-split-process-concatenate design pattern• Single and Multiple-GPU versions• Analysis with large images and multiple GPUs

• Image neighborhoods – Sobel operator: 3x3 grid– Median filter:

• 3x3 grid: less computationally intense• 5x5 grid: more computationally intense

Page 30: Dissertation Defense

30

Sobel Operator Application Results

• Single-GPU refactored for multiple-GPU use• Used existing OpenCL kernel• Three simple DEFG code changes needed• New version used two GPUs– 50% image plus small overlap given to each– Produced same resultant image

• Run-time performance was not impressive– Not sufficiently computationally intense– OpenCL transfer times went up– Kernel executions times stayed the same

Page 31: Dissertation Defense

31

Median Filter Application Results• CPU-side DEFG code very similar to Sobel• Developed two new OpenCL kernels

– 3x3 grid kernel– 5x5 grid kernel

• Performance with 3x3 grid similar to Sobel• Performance with multiple-gpu, 5x5 median

– Run-time improvement with all test images• With large image: 1.062 (1 GPU) seconds down to 0.794 (2 GPUs)• Speed up: 1.34 with 2 GPUs, with 7k x 7k image

– Also, 2-GPU median filter handled larger images (22k x 22k) • Performance Analysis with 2 GPUs

– Kernel execution run times dropped– CPU to GPU OpenCL transfer times increased– pthreads experiment showed need for O.S. threads

Page 32: Dissertation Defense

32

Breadth-First Search Results• Breadth-First Search (BFS) Summary

– DEFG generalization of Merrillapproach• Prefix-scan based buffer allocation:

• “Virtual pointers” to nodes between GPUs• Used Harish sparse data structure approach

– Analysis of BFS application • Characteristics• Capabilities• Run-time performance

Page 33: Dissertation Defense

33

Multiple-GPU BFS Implementation• BFSDP2GPU DEFG Application – Re-factoring of previously-ported DEFG BFS

application– DEFG implementation of complex OpenCL application

• Management of shared buffers with prefix sum• Run-time communications between GPUs

• Tested application against LVI graphs– Test graphs from SNAP and DIMACS repositories

• Stanford Network Analysis Package (SNAP) [8]• Center for Discrete Mathematic and Theoretical Computer

Science [9]

– Very large graph datasets: millions of vertices and edges

Page 34: Dissertation Defense

34

BFSDP2GPU Results

• Analysis of BFSDP2GPU application– Characteristics

• Kernel count went from 2 kernels to 6• Used 2 GPUs

– Capabilities and Run-time performance• Single-card versus multi-card performance• Performance relative to existing BFS application

• Application results with LVI graphs– Processed large graphs (4.8M nodes, 69M edges)– However, unimpressive run-time performance

• Run Times increased by factors of 6 to 17 • Issue: OpenCL’s lack of GPU-to-GPU communications• Lesser issue: Mixing of sparse and dense data structures

Page 35: Dissertation Defense

35

Roughly Sorting

• RSORT application implementation– Divide-process-merge pattern utilized– Implementation contains five kernels:

(LRmax, RLmin, DM, UB, and comb_sort) – Sort selected for OpenCL kernel: comb sort• sort-in-place design• non-recursive• similar to bubble sort but much faster• elements are compared gap apart

Page 36: Dissertation Defense

36

RSORT Results• Run-Time Comparisons using large datasets– Generated with set k value and dataset size– Fully perturbed dataset, or singly – Example with a k value of 4, 16 perturbed items:

5 4 3 2 1 10 9 8 7 6 15 14 13 12 11 16• Performance analysis over 3 configurations– QSORT on CPU, used as base line– RSORT with 1 GPU– RSORT with 2 GPUs

Page 37: Dissertation Defense

37

RSORT Results Summary

• Application Implemented in 1-GPU and 2-GPU forms• RSORT run times generally faster when k is small• At K:1000, 2-GPU to 1-GPU speed up is 1.73• 2-GPU RSORT handles larger datasets than 1-GPU

Page 38: Dissertation Defense

38

Iterative Matrix Inversion

• IMIFLX application implementation– Used DEFG blas statement to access clMath functions– Multiple-kernel-loop pattern used

• Multiple blas statements per iteration• Blend of blas statements and kernels

– Anytime used to end iterations at a time limit– Analysis of application

• Inversion accuracy• Range of matrices: size and type• Used data from University of Florida Sparse Matrix

Collection[10]

Page 39: Dissertation Defense

39

Application Sample Result

• IMIFLX uses 13 iterations for this 500 x 500 matrix• Norm value: ||(A*Rn) - I|| • Graph shows convergence to a solution• Run time was 0.259 seconds

M500 NormIteration Value

1 19.9057002 16.2986003 10.7754004 6.3201105 3.6879806 2.1869907 1.3620908 0.9434859 0.684855

10 0.32021511 0.03283412 0.00003513 0.000000

Page 40: Dissertation Defense

40

Iterative Matrix Inversion Results

• Required BLAS support in DEFG• Anytime support: traded accuracy for less run time• Hydra’s NVIDIA T20 GPU

• Available RAM: 2,678 MB • Limits double precision matrix to just over 8,000 by 8,000

Name Type Size Iterations Seconds

H2 Hilbert 2x2 4 0.018

H12 Hilbert 12x12 70 0.089

M500 Generated 500x500 13 0.259

M500any Generated 500x500 10 0.206

M8000 Generated 8000x8000 17 1380.320

M8500 Generated 8500x8500 n.a. overflow

1138_bus Repository 1138x1138 14 3.262

Kuu Repository 7102x7102 9 605.310

Page 41: Dissertation Defense

41

Dissertation Accomplishments

• Designed, Implemented, and Tested DEFG• Produced DEFG “Enabling” Design Patterns• Compared DEFG Applications to Hand-Written – DEFG applications required less code– DEFG applications produced equal run times

• Applied DEFG to Diverse GPU Applications– Four diverse applications fully implemented– Impressive application run-time results • with the exception of BFS, due to an OpenCL limit

Page 42: Dissertation Defense

42

Future Research

• Additional DEFG Design Patterns– Multiple-GPU Load Balancing– Resource sharing

• Suggest DEFG Support for NVIDIA’s CUDA• Suggest a re-factored DEFG

– internal DSL– More-standard programming environment– Enable support more environments

• Not optimistic about declarative approach for GPU-side • Potential for other technical improvements

Page 43: Dissertation Defense

43

DEFG’s Success• DEFG is a DSL focused on: HPC with GPUs– Note the Faber suggestions for GPU performance:• “Get the data on the GPU and leave it”• “Give the GPU enough work to do”• “Focus on data reuse to avoid memory limits”

– The CPU becomes the orchestrator• DEFG provides the CPU code to orchestrate– Declarations to describe the data– Design patterns to describe the orchestration– Optimization to minimize data transfers

Page 44: Dissertation Defense

44

References[1] Farber, R. CUDA application design and development. Access Online via Elsevier, 2011.

[2] Senser, R. and Altman, T. “DEF-G: Declarative Framework for GPUs” Proceedings of The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (2013): 490-496.

[3] Senser, R. and Altman, T. “A second generation of DEFG: Declarative Framework for GPUs” Proceedings of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (To be published November, 2014).

[4] Harish, P. and Narayanan, P. "Accelerating large graph algorithms on the GPU using CUDA." High performance computing–HiPC 2007. Springer Berlin Heidelberg, 2007. 197-208.

[5] Merrill, D., and Andrew S. Grimshaw. "Revisiting sorting for GPGPU stream architectures." Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010.

[6] Altman, T. and Yoshihide Igarashi. "Roughly sorting: Sequential and parallel approach." Journal of Information Processing 12.2 (1989): 154-158.

[7] Altman, M. "An optimum cubically convergent iterative method of inverting a linear bounded operator in Hilbert space." Pacific Journal of Mathematics 10.4 (1960): 297-300.

[8] SNAP URL: http://snap.stanford.edu/data[9] DIMACS URL: http://www.dis.uniroma1.it/challenge9/download.shtml [10] University of Florida Sparse Matrix Collection: http://www.cise.ufl.edu/research/sparce/matrices

Page 45: Dissertation Defense

45

Additional Slides

Page 46: Dissertation Defense

46

Raw Performance Numbersfor Three Applications, in Milliseconds

CPU GPU-Tesla T20

DEF-G Ref. DEF-G Ref.

BFS-4096

1.5 2.6 4.3 5.8

BFS-65536

12.3 14.2 8.0 11.3

FW 111.8 152.0 6.0 51.2

SOBEL 23.0 24.8 3.7 4.1

Page 47: Dissertation Defense

47

Sample DEFG Code Showing a Sequence

01. declare application floydwarshall 02. declare integer NODE_CNT (0) 03. declare integer BUF_SIZE (0) 04. declare gpu gpuone ( any ) 05. declare kernel floydWarshallPass FloydWarshall_Kernels ( [[ 2D,NODE_CNT ]] ) 06. declare integer buffer buffer1 ( BUF_SIZE ) 07. integer buffer buffer2 ( BUF_SIZE ) 08. call init_input (buffer1(in) buffer2(in) NODE_CNT(out) BUF_SIZE(out)) 09. sequence NODE_CNT times 10. execute run1 floydWarshallPass ( buffer1(inout) buffer2(out) NODE_CNT(in) DEFG_CNT(in) ) 11. call disp_output (buffer1(in) buffer2(in) NODE_CNT(in)) 12. end

Page 48: Dissertation Defense

48

Sample DEFG Code Showing a Loop-While

declare application bfs declare integer NODE_CNT (0) declare integer EDGE_CNT (0) declare integer STOP (0) declare gpu gpuone ( any ) declare kernel kernel1 bfs_kernel ( [[ 1D,NODE_CNT ]] ) kernel kernel2 bfs_kernel ( [[ 1D,NODE_CNT ]] ) declare struct (4) buffer graph_nodes ( NODE_CNT ) integer buffer graph_edges (EDGE_CNT ) integer buffer graph_mask ( NODE_CNT ) integer buffer updating_graph_mask ( $NODE_CNT ) integer buffer graph_visited (NODE_CNT ) integer buffer cost (NODE_CNT) // note: init_input handles setting "source" node call init_input (graph_nodes(out) graph_edges(out) graph_mask(out) updating_graph_mask(out) graph_visited (out) cost (out) NODE_CNT(out) EDGE_CNT(out)) loop execute part1 kernel1 ( graph_nodes(in) graph_edges(in) graph_mask(in) updating_graph_mask(out) graph_visited(in) cost(inout) $NODE_CNT(in) ) // set STOP to zero each time thru... set STOP (0) // note: STOP value is returned... execute part2 kernel2 ( graph_mask(inout) updating_graph_mask(inout) graph_visited(inout) STOP(inout) NODE_CNT(in) ) while STOP eq 1 call disp_output (cost(in) NODE_CNT(in)) end

Page 49: Dissertation Defense

49

RSORT Data

Page 50: Dissertation Defense

50

IMIFLX Data

Page 51: Dissertation Defense

51

DEFG 4-Way Mini-Experiment SpeedUp

GPUs SpeedUp1 12 1.9474 3.622

Page 52: Dissertation Defense

52

Old Slides

Page 53: Dissertation Defense

53

DEF-G Input Code

ANTLR-Based DEF-G Parser

XML Document

TinyXML2 & DEF-G Code Generator

OpenCL Code

• The DEFG framework generates the CPU code– Input: declarative statements– Uses design patterns– Output: OpenCL code

• DEFG “Translator”– ANTLR-Based Parser– Intermediate XML Document– TinyXML2 Parser– Code Generator written in C++

DEFG Architecture-old

Page 54: Dissertation Defense

54

Accomplishments ???• DEFG– Inputs declarations– Uses declared design patterns– Generates CPU-side OpenCL code

• Summarized the Proof-of-Concept DEFG• Described the Version 2 DEFG enhancements• Described the diverse DEFG applications– These show the DEFG applicability and flexibility– Each is a full application implementation

• Addressed DEFG research goals

Page 55: Dissertation Defense

55

Presentation Outline• Research Accomplishments– Designed and implemented DEFG– Produced DEFG “Enabling” Design Patterns– Applied DEFG to Diverse GPU Applications– Analyzed the Performance of these Applications

• Future Research

Page 56: Dissertation Defense

56

Proposed Dissertation Work Plan(design patterns)

• DEFG Enhancements– Three existing design patterns

• Execute kernel once, execute N times, and loop-while • The current loop-while syntax is too “procedural”

– New design patterns• Anytime algorithm support• Multiple GPU Support

– divide, process, merge– overlapped split, process, concatenate

• Explicit Parallel

– Add other interesting DEF-G design patterns

Page 57: Dissertation Defense

57

(implimentation)

• DEFG Enhancements• ((get list from doc))

Page 58: Dissertation Defense

58

Image Filtering Results

• Developed proof-of-concept DEFG version• Created a proof-of-concept, hand-written, multiple GPU

version– Using two cards doubled throughput– Complexity in managing overlapped sub-images

• Next steps:– DEFG multiple GPU version

• Due to 3x3 mask, sub-images overlap• Uses overlapped split, process, concatenate design pattern

– Testing and analysis with large images and more than one GPU– Expect to find (or emulate) a four-GPU environment

Page 59: Dissertation Defense

59

Accomplishments

• Declarative Approach to …• Design Patterns• DEFG Tools

– Produced DEFG parser, optimizer, and code generator– Diagnostics …

• DEFG Applications– Performance verification– Image Filters– Multiple-GPU graph processing– Rough Sorting– Iterative Matrix Inversion

Page 60: Dissertation Defense

60

Proposed New DEFG Features

• Additional DEFG Design Patterns– Multiple GPU Load Balancing– Resource sharing

• DEFG Support for NVIDIA’s CUDA• Re-factored DEFG (DEFG Version 3)– internal DSL• P1• P2

– More-standard programming environment– Could support more environments

Page 61: Dissertation Defense

61

Roughly Sorting Implementation• RSORT implementation contained five kernels• Comb sort• Analysis and Complications– Choose an existing OpenCL sort for use– Add support for multiple GPU cards using DEFG’s

divide, process, merge pattern– Performance analysis of GPU Roughly Sorting


Recommended