www.postersession.com
1. We model faults that occur in GPU computation components, i.e. ALU, FPU and memory address calculation. 2. The categories of failures are: Crash, Silent Data Corruption (SDC), Hang and Benign
§ GPUs have been originally designed for error-resilient workloads.
§ Today, GPUs are used in error-sensitive applications.
Evaluating Error Resilience of GPGPU Applications Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu
University of British Columbia
Towards Building Error Resilient GPGPU Applications § To reduce SDCs, we embed error detectors at
program level based on heuristics and evaluate the efficacy of the error detectors by running the fault injection with 3 instrumented benchmarks
Motivation
Research Question
Spot the 5 different pixels between these pictures!
ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT
Error
The error (A->T) seriously affects the usability of the result
Features: § Breakpoint-based fault injection
• On top of CUDA toolkit 4.1 (cuda-gdb 4.1)
• On real GPU hardware
Fault Injection Tool
ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT
Evaluation Results
Overview of each benchmark Crash breakdown for each benchmark Take-aways:
§ Crashes are the most frequent outcome § SDCs (10%~40%) are much more
common in GPGPU applications than in CPU applications
§ The range of SDC popularity variation among applications is wider with GPGPU applications than with CPU applications
Take-aways: § Memory related crashes are the main
reason of crashes § Out-of-bound memory access
dominates
§ Better understand the fundamental reliability characteristics of GPGPU applications? • Specifically tolerance to transient faults
in processing logic
§ Methodology: • Design a fault injection tool to study the
behavior of GPGPU applications under faults.
Design Goals for the Fault Injection Tool
1. Has visibility to runtime information of executed instruction stream
2. Minimally interferes with the original GPGPU applications
3. Uniformly injects faults in dynamic instructions of applications
Profiles one application
Chooses one
run-time instruction randomly
Sets a conditional breakpoint
and injects a fault1 at run-time
Collects failure results2
Goal 1
Goal 3
Goal 2
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#
AES# MAT# MUM# BFS# LIB#
Outcome(pe
rcen
tage(
Benchmark(program(
Crash#
Hang#
Silent#Data#CorrupGon#
Benign#
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#
100%#
AES# MAT# MUM# BFS# LIB#
Percentage)of)crash)root)cause)
Benchmark)program)
Stack#overflow#
Global#outEofEbound#
Misaligned#
Local/share#outEofEbound#
Experiment setup § NVIDIA C2075 (Fermi architecture) § 5 benchmarks
§ Preliminary results
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
BFS MAT LIB Average
SDC rate SDC rate with detectors
Fault detection experimental results for instrumented benchmarks
Contact: [email protected]
§ Heuristic categories: § Category I: Loop conditions. § Category II: Branches with block or thread
identifier § Category III: Computation statements that pertain
to: 1) Initialization of block and thread identifier 2) Computation involving block or thread id 3) Computation involving loop iterator variables 4) Data movement between global memory and other memory regions
Error