Date post: | 13-Jan-2015 |
Category: |
Technology |
Upload: | amd-developer-central |
View: | 902 times |
Download: | 3 times |
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Case Study: Accelerating Full Waveform Inversion
via OpenCL™ on AMD GPUs
©2014 Acceleware Ltd. All rights reserved.
Chris Mason, Acceleware Product Manager
March 5, 2014
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
About Acceleware
Software and services company specializing in HPC product development, developer training and consulting services
OpenCL training for AMD GPUs
– Progressive lectures and hands-on lab exercises
– Experienced instructors
– Delivered worldwide
– Find out more
High performance consulting
– Feasibility studies
– Porting and optimization
– Code commercialization
– Find out more
1
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Acceleware Software
Seismic Applications
– Survey design and 3D modeling
– Reverse Time Migration
Electromagnetics
– FDTD Solver
Radio Frequency Heating
– Simulation application for the RF
heating of hydrocarbon reserves
2
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Outline
Watch the recording of this webinar
What is Full Waveform Inversion?
The Project
OpenCL
Optimizations
– Coalescing
– Iterative kernel for stencil operations
– Fusing kernels together to eliminate redundant memory accesses
Key Performance Results
3
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
What is Full Waveform Inversion?
Seismic inversion technique
Used to build Earth models from recorded seismic data
Uses a finite-difference solution to the acoustic wave
equation
Computationally expensive
4
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
What is FWI? From a basic starting point...
... to an accurate velocity model
5
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
FWI Algorithm Initial Model Estimate
Forward Propagate Source → Residuals
Back Propagate Residuals → Gradient
Forward Propagation(s) → Step Length
Update Model
Increase Frequency
Loop over shots
Loop over frequencies
Loop until convergence
6
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
FWI Compute Cost
Cluster size of 10s to 100s of CPU nodes
Many days of runtime
Accuracy and quality reduced to keep runtime acceptable
7
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
The Project
GeoTomo develops high-end geophysical software products that help geophysicists around the world to image beneath the subsurface
GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution
GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients – Looked to AMD GPUs to potentially accelerate their FWI and approached
Acceleware for our help to make it happen
8
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Why use GPUs? Performance!
9
AMD Opteron 6386 SE
AMD FirePro
W9000
AMD Firepro
S10000
Memory Bandwidth 59.7 GB/s 264 GB/s 480 GB/s
Peak Gflops (single) ~410 4000 5910
Peak Gflops (double) ~205 1000 1480
Total Memory >>6 GB 6GB 6 GB
Power Consumption 140 W 274 W 375 W
Gflops per Watt (single precision) <3 14.59 15.76
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
OpenCL Overview
Parallel computing architecture standardized by the Khronos Group
OpenCL:
– Is a royalty free standard
– Provides an API to coordinate parallel computation across heterogeneous processors Of interest because heterogeneous devices can significantly accelerate certain
(primarily data-parallel) workloads
– Defines a cross-platform programming language
– Used on handheld/embedded devices through supercomputers
10
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
OpenCL Programming Model
Heterogeneous model, including provisions for a host connected to one or more devices
– Example: GPUs, CPUs
Host
Device 1 GPU
Device 2 GPU
… Device N
GPU
11
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
The OpenCL Programming Model
Data-parallel portions of an algorithm are executed on the device as kernels – Kernels are C functions with some
restrictions and a few language extensions
– Many (parallel) work-items execute the kernel
The host executes serial code between device kernel launches – Memory management
– Data exchange to/from device (usually)
– Error handling
12
Work-Group (0,0) Work-Group (1,0)
Work-Group (0,1) Work-Group (1,1)
Work-Group (0,2) Work-Group( 1,2)
ND Range
Work-Group (0,0)
Work-Group (1,0)
Work-Group (2,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (2,1)
ND Range
Host
Device
Host
Device
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
OpenCL Memory Model
OpenCL kernels have access to four distinct memory regions: – Global
Allows read/write access from all work-items in all work-groups
Persistent across kernels
– Local Memory that is local to all work-items within a work-group
– Constant Region of memory that remains constant (read-only) during the execution of a kernel
– Private Memory that is private to a work-item
OpenCL vendors map memory regions into physical resources – Local/constant/private memory usually several orders of magnitude lower
capacity but orders of magnitude faster than global memory
13
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
OpenCL Syntax – Memory Spaces
Host and device have separate memory spaces – Data is explicitly moved between them
Typically over PCIe bus
Host functions to allocate, copy, and free memory on device, eg.
– clCreateBuffer()
– clEnqueueReadBuffer()
– clEnqueueWriteBuffer()
– clReleaseMemoryObject()
14
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Putting It All Together
15
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
One work-item per element
Operation
__kernel
void VectorAdd(__global float* a,
__global float* b,
__global float* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
Each work-item has a unique index, typically used to index into arrays
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Vector Add – Host Code
16
void VectorAdd(float* aH, float* bH, float* cH, int N)
{
int N_BYTES = N * sizeof(float);
// Device management code
…
cl_mem aD = clCreateBuffer(…,N_BYTES, …);
cl_mem bD = clCreateBuffer(…,N_BYTES, …);
cl_mem cD = clCreateBuffer(…,N_BYTES, …);
clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);
clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);
// Pass kernel arguments and launch kernel
…
clEnqueueNDRangeKernel(…, &N, …);
clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);
}
Allocate memory on device
Transfer input arrays to device
Launch kernel
Transfer output array to host
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Project Steps
1) Profiling
– Acquired code, datasets and reference benchmarks from GeoTomo
– Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers
– Augmented code with timers to determine time spent in parallel regions, areas of interest
17
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Project Steps
2) Feasibility Analysis
– Investigated memory footprint for FWI jobs
GPU memory limited to 6GB per card
– Investigated potential speedup / time to port code
Maximum speed up determined by time spent in parallel regions (Amdahl’s Law)
Time to port dependent on feature set
– E.g. domain decomposition across multiple GPUs
18
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Project Steps
3) Implementation
– Creating testing harnesses
– Kernel implementation
– Resolving hardware driver issues
– Enabling multi-GPU device support
– Optimization iterations
4) Wrapup
– Delivery of port, along with installation documentation
– Trained GeoTomo developer on OpenCL
19
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Key GeoTomo Optimizations
1) Coalescing
– Changing memory access patterns in the kernels to those best suited for GPUs
Global memory is accessed via a request for a multi-byte word
Combine load/store requests from consecutive work-items to reduce the number of requested words
– Fewer requests less contention to global memory
Make one big multi-word burst request to global memory whenever possible
– Contiguous bursts -> less global memory overhead
20
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Key GeoTomo Optimizations
2) Iterative kernel for stencil operations
Input Volumes Stencil Kernels
* • Outputs are weighted combinations of surrounding elements from input volumes • Off-axis weights are zero
Acknowledgement: Paulius Micikevicius, 2009 21
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Key GeoTomo Optimizations
Naïve implementation would have each work-item read all of its neighboring elements directly from global memory
– Possible to hit maximum GPU memory bandwidth but redundant reads hurt performance
22
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Key GeoTomo Optimizations
Alternative: Iterating over 2D slices along slowest dimension
– Single items responsible for column of output array
– Work-group caches 2D plane of input in local memory
– Work-items store inputs in direction of iteration in registers
– Reduces required number of global memory reads significantly
Single Work-item View
Register Local memory
Acknowledgement: Paulius Micikevicius, 2009 23
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Key GeoTomo Optimizations
3) Kernel Fusion
– Reduce redundant memory accesses by fusing kernels that operate on the same volume together
– Improves performance by reducing redundant global memory reads
4) Kernel Fission
– Improve occupancy by lowering kernel resource requirements (registers) via kernel simplification
– Allows for more work-items to run concurrently on GPU, improving masking of global memory latency
24
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Performance Results
FWI 15 Hz, 15 shots
– GPU version 7997 seconds
– CPU (5 cores per shot) 67086 seconds [8.4X]
– CPU (30 cores per shot) 166948 seconds [20.9X]
GPU: Sapphire Radeon HD 7970 GHz Edition
– 6GB model
25
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Performance Results
“Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.”
James Jackson, President, GeoTomo
26
Case S
tudy:
Accele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L
on A
MD
GPU
s
Questions?
Contact Us Tel: +1 403.249.9099
Email: [email protected]
OpenCL Courses June 3-6, 2014, Calgary, Canada
Private onsite classes also available
Find out more
OpenCL Consulting Feasibility studies
Code commercialization
Porting and optimization
Mentoring
Find out more
Watch the recording of this webinar
27