+ All Categories
Home > Documents > [IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt...

[IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt...

Date post: 16-Dec-2016
Category:
Upload: matei
View: 213 times
Download: 1 times
Share this document with a friend
1
www.postersession.com 1. We model faults that occur in GPU computation components, i.e. ALU, FPU and memory address calculation. 2. The categories of failures are: Crash, Silent Data Corruption (SDC), Hang and Benign GPUs have been originally designed for error-resilient workloads. Today, GPUs are used in error-sensitive applications. Evaluating Error Resilience of GPGPU Applications Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu University of British Columbia Towards Building Error Resilient GPGPU Applications To reduce SDCs, we embed error detectors at program level based on heuristics and evaluate the efficacy of the error detectors by running the fault injection with 3 instrumented benchmarks Motivation Research Question Spot the 5 different pixels between these pictures! ATATTTTTTCTTGT TTTTTATATCCACA AACTCTTTTCGTA CTTTTACACAGTAT ATCGTGT Error The error (A->T) seriously affects the usability of the result Features: Breakpoint-based fault injection On top of CUDA toolkit 4.1 (cuda-gdb 4.1) On real GPU hardware Fault Injection Tool ATATTTTTTCTTGT TTTTTATATCCACA ATCTCTTTTCGTAC TTTTACACAGTATA TCGTGT Evaluation Results Overview of each benchmark Crash breakdown for each benchmark Take-aways: Crashes are the most frequent outcome SDCs (10%~40%) are much more common in GPGPU applications than in CPU applications The range of SDC popularity variation among applications is wider with GPGPU applications than with CPU applications Take-aways: Memory related crashes are the main reason of crashes Out-of-bound memory access dominates Better understand the fundamental reliability characteristics of GPGPU applications? Specifically tolerance to transient faults in processing logic Methodology: Design a fault injection tool to study the behavior of GPGPU applications under faults. Design Goals for the Fault Injection Tool 1. Has visibility to runtime information of executed instruction stream 2. Minimally interferes with the original GPGPU applications 3. Uniformly injects faults in dynamic instructions of applications Profiles one application Chooses one run-time instruction randomly Sets a conditional breakpoint and injects a fault 1 at run-time Collects failure results 2 Goal 1 Goal 3 Goal 2 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% AES MAT MUM BFS LIB Outcome percentage Benchmark program Crash Hang Silent Data CorrupGon Benign 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% AES MAT MUM BFS LIB Percentage of crash root cause Benchmark program Stack overflow Global outEofEbound Misaligned Local/share outEofEbound Experiment setup NVIDIA C2075 (Fermi architecture) 5 benchmarks Preliminary results 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% BFS MAT LIB Average SDC rate SDC rate with detectors Fault detection experimental results for instrumented benchmarks Contact: [email protected] Heuristic categories: Category I: Loop conditions. Category II: Branches with block or thread identifier Category III: Computation statements that pertain to: 1) Initialization of block and thread identifier 2) Computation involving block or thread id 3) Computation involving loop iterator variables 4) Data movement between global memory and other memory regions Error
Transcript
Page 1: [IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt Lake City, UT (2012.11.10-2012.11.16)] 2012 SC Companion: High Performance Computing,

www.postersession.com

1. We model faults that occur in GPU computation components, i.e. ALU, FPU and memory address calculation. 2. The categories of failures are: Crash, Silent Data Corruption (SDC), Hang and Benign

§  GPUs have been originally designed for error-resilient workloads.

§  Today, GPUs are used in error-sensitive applications.

Evaluating Error Resilience of GPGPU Applications Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu

University of British Columbia

Towards Building Error Resilient GPGPU Applications §  To reduce SDCs, we embed error detectors at

program level based on heuristics and evaluate the efficacy of the error detectors by running the fault injection with 3 instrumented benchmarks

Motivation

Research Question

Spot the 5 different pixels between these pictures!

ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT

Error

The error (A->T) seriously affects the usability of the result

Features: §  Breakpoint-based fault injection

•  On top of CUDA toolkit 4.1 (cuda-gdb 4.1)

•  On real GPU hardware

Fault Injection Tool

ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT

Evaluation Results

Overview of each benchmark Crash breakdown for each benchmark Take-aways:

§  Crashes are the most frequent outcome §  SDCs (10%~40%) are much more

common in GPGPU applications than in CPU applications

§  The range of SDC popularity variation among applications is wider with GPGPU applications than with CPU applications

Take-aways: §  Memory related crashes are the main

reason of crashes §  Out-of-bound memory access

dominates

§  Better understand the fundamental reliability characteristics of GPGPU applications? •  Specifically tolerance to transient faults

in processing logic

§  Methodology: •  Design a fault injection tool to study the

behavior of GPGPU applications under faults.

Design Goals for the Fault Injection Tool

1.  Has visibility to runtime information of executed instruction stream

2.  Minimally interferes with the original GPGPU applications

3.  Uniformly injects faults in dynamic instructions of applications

Profiles one application

Chooses one

run-time instruction randomly

Sets a conditional breakpoint

and injects a fault1 at run-time

Collects failure results2

Goal 1

Goal 3

Goal 2

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#

AES# MAT# MUM# BFS# LIB#

Outcome(pe

rcen

tage(

Benchmark(program(

Crash#

Hang#

Silent#Data#CorrupGon#

Benign#

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#

100%#

AES# MAT# MUM# BFS# LIB#

Percentage)of)crash)root)cause)

Benchmark)program)

Stack#overflow#

Global#outEofEbound#

Misaligned#

Local/share#outEofEbound#

Experiment setup §  NVIDIA C2075 (Fermi architecture) §  5 benchmarks

§  Preliminary results

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

BFS MAT LIB Average

SDC rate SDC rate with detectors

Fault detection experimental results for instrumented benchmarks

Contact: [email protected]

§  Heuristic categories: §  Category I: Loop conditions. §  Category II: Branches with block or thread

identifier §  Category III: Computation statements that pertain

to: 1)  Initialization of block and thread identifier 2) Computation involving block or thread id 3) Computation involving loop iterator variables 4) Data movement between global memory and other memory regions

Error

Recommended