Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | wynter-barton |
View: | 36 times |
Download: | 3 times |
1
Dissertation DefenseRobert Senser
October 29, 2014
GPU DECLARATIVE FRAMEWORK:DEFG
PhD Committee:Gita Alaghband (chair)Tom Altman (advisor)Michael ManninoBoris Stilman Tam Vu
2
Presentation Outline• Motivation for Work• Background: Graphical Processing Units (GPUs)
and OpenCL• GPU DECLARATIVE FRAMEWORK: DEFG• Diverse GPU Applications using DEFG– Image Filters (Sobel and Median) – Breadth-First Search – Sorting Roughly Sorted Data – Iterative Matrix Inversion
• Dissertation Accomplishments• Future Research and Observations
3
Motivation for Work
• GPUs can provide high throughput– Radeon HD 7990: 2 (double-precision) TFLOPS
• Developing parallel HPC software is difficult• Parallel development for GPUs is even more difficult• GPU HPC software development requires:
– Understanding of unique GPU hardware characteristics– Use of specialized algorithms– Use of GPU-specific, low-level APIs
• OpenCL• CUDA
• Driving notion: Let software minimize the complexity and difficultly.
4
Background: GPUs and OpenCL
• Graphical Processing Unit (GPU)– Highly specialized coprocessor– Hundreds of cores– Thousands of hardware-managed threads– SIMT: Single Instruction, Multiple Thread
• Variant of the common Single Instruction, Multiple Data (SIMD) model• Threads not on the execution path pause
– Code executed in a “kernel” • Common GPU programming environments
– OpenCL, which is Open Source– CUDA, which is NVIDIA proprietary
• DEFG is designed for OpenCL
5
High-Level GPU Architecture
PCIe bus
RAM Global RAM
VirtualMemory
CPU GPU
Cache Cache?
bus
GPU Characteristic:• Processors often connected by Peripheral
Component Interconnect Express (PCIe) bus• GPU has own fast Global RAM• Threads have a small amount of fast local memory• May or may not have a cache• Many hardware-controlled threads• Lacks CPU-style predictive branching, etc.
Local
6
OpenCL Overview• Specification provided by Khronos Group• Open Source, multi-vendor• Hardware device support
– GPUs– CPUs– Digital signal processors (DSPs)– Field-programmable gate arrays (FPGAs)
• Device kernel normally written in C• Each thread shares a common kernel• CPU-side code
– C/C++– Very low-level, detailed CPU-side application interface (API)– Third-party bindings for Java, Python, etc.
7
GPU Applications• Three components– Application algorithms– CPU-side code
• Moves kernel code to GPU• Manages GPU execution and errors• Moves application data between CPU and GPU• May contain a portion of application algorithms
– GPU kernel code• Can have multiple kernels per application• Each kernel usually contains an algorithm or algorithm step• Kernel code often uses GPU-specific techniques
• This work concentrates on the CPU-side code
8
GPU Performance • Major Issues in GPU Performance– Kernel Instruction Path Divergence
• Occurs with conditional instructions (ifs, loops, etc.)• Causes some threads to pause• Needs to be minimized, if not totally avoided
– High Memory Latency• Each RAM access can consume time of 200-500 instructions• Accesses to global RAM should be coalesced• “Bank conflicts” can occur with local thread memory
• Rob Farber GPU suggestions [1]:– “Get the data on the GPU and leave it”– “Give the GPU enough work to do”– “Focus on data reuse to avoid memory limits”
• Existing HPC code usually re-factored for GPU use
9
DEFG Overview
• GPU software development tool for OpenCL• Generates the CPU-side of GPU Applications• Uses a Domain Specific Language (DSL)– Developer writes CPU code in DEFG’s DSL– DEFG generates the corresponding CPU C/C++ program
• Relative to hand-written CPU code– Faster by using declarative approach– Simpler by using design patterns, and abstraction
• Developer provides standard OpenCL GPU kernels
10
• The DEFG generates the C/C++ code for CPU
• DEFG Translator– DEFG Source Input– ANTLR-based Parser– XML-based Tree– Optimizer (Java)– Code Generator (C++)– Template Driven – C/C++ Output
DEFG Translator Architecture[2,3]
Translator:
11
DEFG Benefits and Features• Implement OpenCL applications with less effort• Requires many fewer lines of code to be written• Encourages the developer to focus on the kernels• How is this done?– With a Doman-Specific Language (DSL)
• Data characteristics are declared• Uses one or more pre-defined DEFG design patterns• Many details managed inside DEFG
– Technical Features• Abstracts the OpenCL APIs and their details• Automatic optimization of buffer transfers• Handles error detection• Supports multiple GPU cards• Supports anytime algorithms
12
DEFG Code Sample01. declare application sobel 02. declare integer Xdim (0)03. declare integer Ydim (0)04. declare integer BUF_SIZE (0)05. declare gpu gpuone ( any )06. declare kernel sobel_filter SobelFilter_Kernels ( [[ 2D,Xdim,Ydim ]] )07. declare integer buffer image1 ( BUF_SIZE )08. integer buffer image2 ( BUF_SIZE )09. call init_input (image1(in) Xdim (out) Ydim (out) BUF_SIZE(out)) 10. execute run1 sobel_filter ( image1(in) image2(out) ) 11. call disp_output (image2(in) Xdim (in) Ydim (in) ) 12. end
…status = clSetKernelArg(sobel_filter, 1, sizeof(cl_mem), (void *)&buffer_image2); if (status != CL_SUCCESS) { handle error } // *** execution size_t global_work_size[2]; global_work_size[0] = Xdim ; global_work_size[1] = Ydim ; status = clEnqueueNDRangeKernel(commandQueue, sobel_filter, 2, NULL, global_work_size, NULL, 0,
NULL, NULL); if (status != CL_SUCCESS) { handle error } // *** result buffers status = clEnqueueReadBuffer(commandQueue, buffer_image2, CL_TRUE, 0, BUF_SIZE * sizeof(int), image2, 0, NULL, NULL); …
13
DEFG Design Patterns• Invocation Patterns (Control Flow)
– Sequential Flow– Single-Kernel Repeat Sequence– Multiple-Kernel
• Concurrent-GPU Patterns (Multiple-GPU Support)– Multiple-Execution– Divide-Process-Merge– Overlapped-Split-Process-Concatenate
• Prefix-Allocation (Buffer Allocation, without Locking)• Dynamic-Swap (Buffer Swapping)• Code-Morsel (C/C++ Code Insertion)• Anytime algorithm (Control Flow Change on External Event)• Design patterns can be combined
– Example: Multiple-Kernel + Divide-Process-Merge + Code-Morsel
14
DEFG Implementation
• Lines of Code– ANTLR-based parser: 580 lines– Optimizer: 659 lines of Java– Code Generator: 1,513 lines of C++– Templates and includes: 1,572 lines of C++
• Number of Diagnostic Programs: 15+• Testing investment: Man Months– Faults tended to be in the C/C++ code generation– Most faults were in multi-GPU buffer management
15
Diverse New DEFG Applications• Constructed four very diverse GPU applications
– Image Processing: Sobel and Median Image Filters• Showcase for multiple GPU support
– Graph Theoretic: Breadth-First Search (BFS), Large Graphs• Novel use of prefix sum to avoid GPU low-level locking• BFS processing with multiple GPUs
– Sorting: Sorting Roughly Sorted Data• Implementation of novel sorting approach• Use of parallel prefix calculations in sorting optimization• Also shows multiple GPU support
– Numerical: Iterative Matrix Inversion, M. Altman’s Method• Demonstrates anytime algorithms use of OpenCL clMath BLAS (Basic
Linear Algebra Subprograms)• When a measure is met, anytime algorithm stops the process
• These four applications demonstrate DEFG’s applicability
16
Filter Application: Sobel Image Filter
• Sobel operator detects edges in images• Pixel gradient calculated from 3x3 mask• Uses a single GPU kernel, invoked once• A base-line test application for multiple GPUs• Example of DEFG Sobel operator processing:
Sobel
17
Filter Application: Median Filter• Median filter removes “noise” from images• Median determined for 3x3 or 5x5 mask• Also uses a single GPU kernel, invoked once• 2nd base-line test application for multiple GPUs• Example of DEFG median 5x5 filter processing:
18
Application: Breadth First Search (BFS)
• Well-studied graph-theoretic problem• Focus: BFS with Very Large Irregular (VLI)
Graphs– Social Networking, Routing, Citations, etc.
• Many published GPU BFS approaches, starting with Harish [4]
• Harish used “Dijkstra” BFS• Vertex frontier as a Boolean array
• 1 = vertex on frontier• A GPU thread assigned to each vertex• Can result in poor thread utilization
19
BFS Vertex Frontier
• Merrill approach to vertex buffer management [5]– Have a buffer with multiple update threads– Uses prefix sum to allocate cells
• Generalize this buffer management in DEF-G– Provided as a set of kernel functions
• Useful for shared buffers with multiple GPU cards
20
Application: Sorting Roughly Sorted Data
• Goal: Improve on O(n log n) sorting bound when sequence is partially sorted
• Based on the prior sorting work by T. Altman, et al. [6]• k is a measure of “sortedness”• A sequence is k-sorted if no element is more than k positions out
of sequence• Knowing k allows for sorts of O(n log k) • If k is small then we obtain a substantial performance gain• The k-sorted trait can be GPU exploited
– Prefix sum in calculating k – Parallel sorts of sub-sequences
21
Parallel Roughly Sorting Algorithm
• Step LR: Left-to-right scan, compute running max• Step RL: Right-to-left scan, compute running min• Step DM:– Uses LR-max array and RL-min array as inputs– Computes each elements distance measure
• Step UB: Finds distance measure upper bound • Step Sorting: Using distance measure– Perform sort pass one– Perform sort pass two
Notion: Convert the large sort problem into many smaller, parallel sort operations.
22
Iterative Matrix Inversion (IMI)
• DEFG iterative matrix inversion application using M. Altman’s method [7]
• Use the Anytime-algorithm approach to manage the iterative inversion– Inversion is stoppable at “anytime”– Can balance run time against accuracy– Anytime management in DEFG, not the application
• Requires GPU matrix operations– Use OpenCL clMath (APPML)– clMath integration into DEFG
23
M. Altman IMI Approach
The initial inverse approximation, that is R0, can be formed by:
R0 = αI
where α = 1 / || A ||||A|| being the Euclidean norm of Aand I is the identity matrix.
To invert matrix A, each iteration calculates:Rn+1 = Rn(3I – 3ARn + (ARn)2)
with the result in Rn+1.
• Better R0 estimate provides for quicker convergence• Method is self correcting• DEFG Anytime facility stops the iterations
– When inversion quality measure is met– When max iterations have occurred– When max run time has occurred
24
Accomplishments: DEFG Framework• Fully Implemented– Consists of approximately 5,000 code lines– 7 different applications and 15+ diagnostic programs– Complete User’s Guide– Packaged for general use
• Design Patterns– 10+ Patterns– Patterns range from simple to complex
• Delineation of DEF-G Limits• Explanation for success of DSL and design
patterns
25
DEFG’s Performance
• DEFG Papers– Conference: Parallel and Distributed Processing Techniques and
Applications (PDPTA’13) [2]– Conference: Parallel and Distributed Processing Techniques and
Applications (PDPTA’14) [3]• Analysis
– Three existing OpenCL applications converted to DEFG• CPU-side re-coded in DEFG and used existing GPU kernels• Three applications:
– Breadth-First Search (BFS)– All-Pairs Shortest Path (APSP/FW)– Sobel Image Filter (SOBEL)
• Output results carefully verified
– Comparisons between “reference” and DEFG• Lines-of-Code Comparison• Run-time Performance Comparison
26
Lines-of-Code Comparison
DEFG DEFG
Source Gen. Ref.
SOBEL 12 467 442
BFS 42 620 364
FW 12 481 478
On average, the DEFG code is 5.6 percent of the reference code.
27
• Shown are original run times (average of 10 runs) • Later, made manual changes to understand timing differences • FW reference version was slow due to a vendor coding error • Likely CPU-based BFS-4096 was fast due to CPU’s cache• Summary: DEFG provided equal, or better, performance
Run-Time Performance Comparison
28
New Applications
• Application Implementations– Filtering, BFS, Roughly Sorting, Iterative Inversion– Implementation Goals
• Show general applicability of DEFG• Multiple-GPU: Filtering, BFS, and R. Sorting • Novel Algorithm: R. Sorting and Iterative Inversion • Proof of Concept: Iterative Inversion (BLAS usage)
• Application Performance results– Run-time Performance– Single-GPU and multiple-GPU configurations– Problem-size characteristics– Vast majority of tests run on Hyrda server
29
Image Filtering Results• Image Applications Implementation– Both Sobel operator and median filter:
• Overlapped-split-process-concatenate design pattern• Single and Multiple-GPU versions• Analysis with large images and multiple GPUs
• Image neighborhoods – Sobel operator: 3x3 grid– Median filter:
• 3x3 grid: less computationally intense• 5x5 grid: more computationally intense
30
Sobel Operator Application Results
• Single-GPU refactored for multiple-GPU use• Used existing OpenCL kernel• Three simple DEFG code changes needed• New version used two GPUs– 50% image plus small overlap given to each– Produced same resultant image
• Run-time performance was not impressive– Not sufficiently computationally intense– OpenCL transfer times went up– Kernel executions times stayed the same
31
Median Filter Application Results• CPU-side DEFG code very similar to Sobel• Developed two new OpenCL kernels
– 3x3 grid kernel– 5x5 grid kernel
• Performance with 3x3 grid similar to Sobel• Performance with multiple-gpu, 5x5 median
– Run-time improvement with all test images• With large image: 1.062 (1 GPU) seconds down to 0.794 (2 GPUs)• Speed up: 1.34 with 2 GPUs, with 7k x 7k image
– Also, 2-GPU median filter handled larger images (22k x 22k) • Performance Analysis with 2 GPUs
– Kernel execution run times dropped– CPU to GPU OpenCL transfer times increased– pthreads experiment showed need for O.S. threads
32
Breadth-First Search Results• Breadth-First Search (BFS) Summary
– DEFG generalization of Merrillapproach• Prefix-scan based buffer allocation:
• “Virtual pointers” to nodes between GPUs• Used Harish sparse data structure approach
– Analysis of BFS application • Characteristics• Capabilities• Run-time performance
33
Multiple-GPU BFS Implementation• BFSDP2GPU DEFG Application – Re-factoring of previously-ported DEFG BFS
application– DEFG implementation of complex OpenCL application
• Management of shared buffers with prefix sum• Run-time communications between GPUs
• Tested application against LVI graphs– Test graphs from SNAP and DIMACS repositories
• Stanford Network Analysis Package (SNAP) [8]• Center for Discrete Mathematic and Theoretical Computer
Science [9]
– Very large graph datasets: millions of vertices and edges
34
BFSDP2GPU Results
• Analysis of BFSDP2GPU application– Characteristics
• Kernel count went from 2 kernels to 6• Used 2 GPUs
– Capabilities and Run-time performance• Single-card versus multi-card performance• Performance relative to existing BFS application
• Application results with LVI graphs– Processed large graphs (4.8M nodes, 69M edges)– However, unimpressive run-time performance
• Run Times increased by factors of 6 to 17 • Issue: OpenCL’s lack of GPU-to-GPU communications• Lesser issue: Mixing of sparse and dense data structures
35
Roughly Sorting
• RSORT application implementation– Divide-process-merge pattern utilized– Implementation contains five kernels:
(LRmax, RLmin, DM, UB, and comb_sort) – Sort selected for OpenCL kernel: comb sort• sort-in-place design• non-recursive• similar to bubble sort but much faster• elements are compared gap apart
36
RSORT Results• Run-Time Comparisons using large datasets– Generated with set k value and dataset size– Fully perturbed dataset, or singly – Example with a k value of 4, 16 perturbed items:
5 4 3 2 1 10 9 8 7 6 15 14 13 12 11 16• Performance analysis over 3 configurations– QSORT on CPU, used as base line– RSORT with 1 GPU– RSORT with 2 GPUs
37
RSORT Results Summary
• Application Implemented in 1-GPU and 2-GPU forms• RSORT run times generally faster when k is small• At K:1000, 2-GPU to 1-GPU speed up is 1.73• 2-GPU RSORT handles larger datasets than 1-GPU
38
Iterative Matrix Inversion
• IMIFLX application implementation– Used DEFG blas statement to access clMath functions– Multiple-kernel-loop pattern used
• Multiple blas statements per iteration• Blend of blas statements and kernels
– Anytime used to end iterations at a time limit– Analysis of application
• Inversion accuracy• Range of matrices: size and type• Used data from University of Florida Sparse Matrix
Collection[10]
39
Application Sample Result
• IMIFLX uses 13 iterations for this 500 x 500 matrix• Norm value: ||(A*Rn) - I|| • Graph shows convergence to a solution• Run time was 0.259 seconds
M500 NormIteration Value
1 19.9057002 16.2986003 10.7754004 6.3201105 3.6879806 2.1869907 1.3620908 0.9434859 0.684855
10 0.32021511 0.03283412 0.00003513 0.000000
40
Iterative Matrix Inversion Results
• Required BLAS support in DEFG• Anytime support: traded accuracy for less run time• Hydra’s NVIDIA T20 GPU
• Available RAM: 2,678 MB • Limits double precision matrix to just over 8,000 by 8,000
Name Type Size Iterations Seconds
H2 Hilbert 2x2 4 0.018
H12 Hilbert 12x12 70 0.089
M500 Generated 500x500 13 0.259
M500any Generated 500x500 10 0.206
M8000 Generated 8000x8000 17 1380.320
M8500 Generated 8500x8500 n.a. overflow
1138_bus Repository 1138x1138 14 3.262
Kuu Repository 7102x7102 9 605.310
41
Dissertation Accomplishments
• Designed, Implemented, and Tested DEFG• Produced DEFG “Enabling” Design Patterns• Compared DEFG Applications to Hand-Written – DEFG applications required less code– DEFG applications produced equal run times
• Applied DEFG to Diverse GPU Applications– Four diverse applications fully implemented– Impressive application run-time results • with the exception of BFS, due to an OpenCL limit
42
Future Research
• Additional DEFG Design Patterns– Multiple-GPU Load Balancing– Resource sharing
• Suggest DEFG Support for NVIDIA’s CUDA• Suggest a re-factored DEFG
– internal DSL– More-standard programming environment– Enable support more environments
• Not optimistic about declarative approach for GPU-side • Potential for other technical improvements
43
DEFG’s Success• DEFG is a DSL focused on: HPC with GPUs– Note the Faber suggestions for GPU performance:• “Get the data on the GPU and leave it”• “Give the GPU enough work to do”• “Focus on data reuse to avoid memory limits”
– The CPU becomes the orchestrator• DEFG provides the CPU code to orchestrate– Declarations to describe the data– Design patterns to describe the orchestration– Optimization to minimize data transfers
44
References[1] Farber, R. CUDA application design and development. Access Online via Elsevier, 2011.
[2] Senser, R. and Altman, T. “DEF-G: Declarative Framework for GPUs” Proceedings of The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (2013): 490-496.
[3] Senser, R. and Altman, T. “A second generation of DEFG: Declarative Framework for GPUs” Proceedings of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (To be published November, 2014).
[4] Harish, P. and Narayanan, P. "Accelerating large graph algorithms on the GPU using CUDA." High performance computing–HiPC 2007. Springer Berlin Heidelberg, 2007. 197-208.
[5] Merrill, D., and Andrew S. Grimshaw. "Revisiting sorting for GPGPU stream architectures." Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010.
[6] Altman, T. and Yoshihide Igarashi. "Roughly sorting: Sequential and parallel approach." Journal of Information Processing 12.2 (1989): 154-158.
[7] Altman, M. "An optimum cubically convergent iterative method of inverting a linear bounded operator in Hilbert space." Pacific Journal of Mathematics 10.4 (1960): 297-300.
[8] SNAP URL: http://snap.stanford.edu/data[9] DIMACS URL: http://www.dis.uniroma1.it/challenge9/download.shtml [10] University of Florida Sparse Matrix Collection: http://www.cise.ufl.edu/research/sparce/matrices
45
Additional Slides
46
Raw Performance Numbersfor Three Applications, in Milliseconds
CPU GPU-Tesla T20
DEF-G Ref. DEF-G Ref.
BFS-4096
1.5 2.6 4.3 5.8
BFS-65536
12.3 14.2 8.0 11.3
FW 111.8 152.0 6.0 51.2
SOBEL 23.0 24.8 3.7 4.1
47
Sample DEFG Code Showing a Sequence
01. declare application floydwarshall 02. declare integer NODE_CNT (0) 03. declare integer BUF_SIZE (0) 04. declare gpu gpuone ( any ) 05. declare kernel floydWarshallPass FloydWarshall_Kernels ( [[ 2D,NODE_CNT ]] ) 06. declare integer buffer buffer1 ( BUF_SIZE ) 07. integer buffer buffer2 ( BUF_SIZE ) 08. call init_input (buffer1(in) buffer2(in) NODE_CNT(out) BUF_SIZE(out)) 09. sequence NODE_CNT times 10. execute run1 floydWarshallPass ( buffer1(inout) buffer2(out) NODE_CNT(in) DEFG_CNT(in) ) 11. call disp_output (buffer1(in) buffer2(in) NODE_CNT(in)) 12. end
48
Sample DEFG Code Showing a Loop-While
declare application bfs declare integer NODE_CNT (0) declare integer EDGE_CNT (0) declare integer STOP (0) declare gpu gpuone ( any ) declare kernel kernel1 bfs_kernel ( [[ 1D,NODE_CNT ]] ) kernel kernel2 bfs_kernel ( [[ 1D,NODE_CNT ]] ) declare struct (4) buffer graph_nodes ( NODE_CNT ) integer buffer graph_edges (EDGE_CNT ) integer buffer graph_mask ( NODE_CNT ) integer buffer updating_graph_mask ( $NODE_CNT ) integer buffer graph_visited (NODE_CNT ) integer buffer cost (NODE_CNT) // note: init_input handles setting "source" node call init_input (graph_nodes(out) graph_edges(out) graph_mask(out) updating_graph_mask(out) graph_visited (out) cost (out) NODE_CNT(out) EDGE_CNT(out)) loop execute part1 kernel1 ( graph_nodes(in) graph_edges(in) graph_mask(in) updating_graph_mask(out) graph_visited(in) cost(inout) $NODE_CNT(in) ) // set STOP to zero each time thru... set STOP (0) // note: STOP value is returned... execute part2 kernel2 ( graph_mask(inout) updating_graph_mask(inout) graph_visited(inout) STOP(inout) NODE_CNT(in) ) while STOP eq 1 call disp_output (cost(in) NODE_CNT(in)) end
49
RSORT Data
50
IMIFLX Data
51
DEFG 4-Way Mini-Experiment SpeedUp
GPUs SpeedUp1 12 1.9474 3.622
52
Old Slides
53
DEF-G Input Code
ANTLR-Based DEF-G Parser
XML Document
TinyXML2 & DEF-G Code Generator
OpenCL Code
• The DEFG framework generates the CPU code– Input: declarative statements– Uses design patterns– Output: OpenCL code
• DEFG “Translator”– ANTLR-Based Parser– Intermediate XML Document– TinyXML2 Parser– Code Generator written in C++
DEFG Architecture-old
54
Accomplishments ???• DEFG– Inputs declarations– Uses declared design patterns– Generates CPU-side OpenCL code
• Summarized the Proof-of-Concept DEFG• Described the Version 2 DEFG enhancements• Described the diverse DEFG applications– These show the DEFG applicability and flexibility– Each is a full application implementation
• Addressed DEFG research goals
55
Presentation Outline• Research Accomplishments– Designed and implemented DEFG– Produced DEFG “Enabling” Design Patterns– Applied DEFG to Diverse GPU Applications– Analyzed the Performance of these Applications
• Future Research
56
Proposed Dissertation Work Plan(design patterns)
• DEFG Enhancements– Three existing design patterns
• Execute kernel once, execute N times, and loop-while • The current loop-while syntax is too “procedural”
– New design patterns• Anytime algorithm support• Multiple GPU Support
– divide, process, merge– overlapped split, process, concatenate
• Explicit Parallel
– Add other interesting DEF-G design patterns
57
(implimentation)
• DEFG Enhancements• ((get list from doc))
58
Image Filtering Results
• Developed proof-of-concept DEFG version• Created a proof-of-concept, hand-written, multiple GPU
version– Using two cards doubled throughput– Complexity in managing overlapped sub-images
• Next steps:– DEFG multiple GPU version
• Due to 3x3 mask, sub-images overlap• Uses overlapped split, process, concatenate design pattern
– Testing and analysis with large images and more than one GPU– Expect to find (or emulate) a four-GPU environment
59
Accomplishments
• Declarative Approach to …• Design Patterns• DEFG Tools
– Produced DEFG parser, optimizer, and code generator– Diagnostics …
• DEFG Applications– Performance verification– Image Filters– Multiple-GPU graph processing– Rough Sorting– Iterative Matrix Inversion
60
Proposed New DEFG Features
• Additional DEFG Design Patterns– Multiple GPU Load Balancing– Resource sharing
• DEFG Support for NVIDIA’s CUDA• Re-factored DEFG (DEFG Version 3)– internal DSL• P1• P2
– More-standard programming environment– Could support more environments
61
Roughly Sorting Implementation• RSORT implementation contained five kernels• Comb sort• Analysis and Complications– Choose an existing OpenCL sort for use– Add support for multiple GPU cards using DEFG’s
divide, process, merge pattern– Performance analysis of GPU Roughly Sorting