[IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt...

www.postersession.com

1. We model faults that occur in GPU computation components, i.e. ALU, FPU and memory address calculation. 2. The categories of failures are: Crash, Silent Data Corruption (SDC), Hang and Benign

§  GPUs have been originally designed for error-resilient workloads.

§  Today, GPUs are used in error-sensitive applications.

Evaluating Error Resilience of GPGPU Applications Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu

University of British Columbia

Towards Building Error Resilient GPGPU Applications §  To reduce SDCs, we embed error detectors at

program level based on heuristics and evaluate the efficacy of the error detectors by running the fault injection with 3 instrumented benchmarks

Motivation

Research Question

Spot the 5 different pixels between these pictures!

ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT

Error

The error (A->T) seriously affects the usability of the result

Features: §  Breakpoint-based fault injection

•  On top of CUDA toolkit 4.1 (cuda-gdb 4.1)

•  On real GPU hardware

Fault Injection Tool

ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT

Evaluation Results

Overview of each benchmark Crash breakdown for each benchmark Take-aways:

§  Crashes are the most frequent outcome §  SDCs (10%~40%) are much more

common in GPGPU applications than in CPU applications

§  The range of SDC popularity variation among applications is wider with GPGPU applications than with CPU applications

Take-aways: §  Memory related crashes are the main

reason of crashes §  Out-of-bound memory access

dominates

§  Better understand the fundamental reliability characteristics of GPGPU applications? •  Specifically tolerance to transient faults

in processing logic

§  Methodology: •  Design a fault injection tool to study the

behavior of GPGPU applications under faults.

Design Goals for the Fault Injection Tool

1.  Has visibility to runtime information of executed instruction stream

2.  Minimally interferes with the original GPGPU applications

3.  Uniformly injects faults in dynamic instructions of applications

Profiles one application

Chooses one

run-time instruction randomly

Sets a conditional breakpoint

and injects a fault1 at run-time

Collects failure results2

Goal 1

Goal 3

Goal 2

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#

AES# MAT# MUM# BFS# LIB#

Outcome(pe

rcen

tage(

Benchmark(program(

Crash#

Hang#

Silent#Data#CorrupGon#

Benign#

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#

100%#

AES# MAT# MUM# BFS# LIB#

Percentage)of)crash)root)cause)

Benchmark)program)

Stack#overflow#

Global#outEofEbound#

Misaligned#

Local/share#outEofEbound#

Experiment setup §  NVIDIA C2075 (Fermi architecture) §  5 benchmarks

§  Preliminary results

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

BFS MAT LIB Average

SDC rate SDC rate with detectors

Fault detection experimental results for instrumented benchmarks

Contact: [email protected]

§  Heuristic categories: §  Category I: Loop conditions. §  Category II: Branches with block or thread

identifier §  Category III: Computation statements that pertain

to: 1)  Initialization of block and thread identifier 2) Computation involving block or thread id 3) Computation involving loop iterator variables 4) Data movement between global memory and other memory regions

Error

Date post:	16-Dec-2016
Category:	Documents
Upload:	matei
View:	213 times
Download:	1 times

[IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt...

Documents