Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel...

Radiation-Induced Error Criticality inModern HPC Parallel Accelerators

Daniel Oliveira, Laercio Pilla, Mauricio Hanzich,Vinicius Fratin, Fernando Santos, Caio Lunardi, José Maria Cela, Philippe Navaux, Luigi Carro, Paolo Rech

WMC 2017

Daniel Oliveira – WMC 2017

HPC reliability importance

2


Available AcceleratorsModern parallel accelerators offer:

- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources

3

Kepler K40 Xeon-Phi



- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources- Reliability?

3

Kepler K40 Xeon-Phi



- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources- Reliability?

Error Rate

3

Kepler K40 Xeon-Phi


Titan

Titan (Oak Ridge National Lab): 18,688 GPUs

High probability of having a GPU corruptedTitan Detected Uncorrectable Errors MTBF is ~44h*

*(field and experimental data from HPCA’15)

4


Outline

Radiation Effects Essentials

Error Criticality in HPC

Experimental Procedure

K40 vs Xeon Phi FIT rates

Qualify SDCs for HPC applications

What’s the Plan?

5


Terrestrial Radiation Environment

Galactic cosmic rays interaction with atmosphere generates neutrons.

13 n/(cm2*h) @sea level

6


Terrestrial Radiation Environment

Galactic cosmic rays interaction with atmosphere generates neutrons.

13 n/(cm2*h) @sea level

6

0

1

1

0FFLogic

Soft Errors: the device is not permanently damaged, but the particle may generate bit-flips or logic errors


Silent Data Corruption vs Crash

Soft Errors in:- data cache- register files- logic gates (ALU)- scheduler

Soft Errors in:- instruction cache- scheduler / dispatcher- PCI-e bus controller

Silent Data Corruption

DUE (Crash)

7


Silent Data Corruption vs Crash

Soft Errors in:- data cache- register files- logic gates (ALU)- scheduler

Soft Errors in:- instruction cache- scheduler / dispatcher- PCI-e bus controller

Silent Data Corruption

DUE (Crash)

7


Output Correctness in HPC

…

A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.

8



error can be in thefloat intrinsic variance

Values in a given range are accepted as correct in physical simulations

Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications

…

8




error can be in thefloat intrinsic variance

Values in a given range are accepted as correct in physical simulations

Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications

Goal: quantify and qualify SDC in NVIDIA and Intel architectures.

…

8



Radiation Test Facilities

9

Irradiation of Chips Electronics


GPU Radiation Test Setup

23/48

GPU power control circuitry is out of beam

NVIDIAK40

NVIDIAK40

IntelXeon-Phi

IntelXeon-Phi

desktop PCs

desktop PCs


@LANSCE 1.8x106 n/(cm2 h)@NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation(~ 91,000 years)

Neutrons Spectrum

11


@LANSCE 1.8x106 n/(cm2 h)@NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation(~ 91,000 years)

Neutrons Spectrum

All the collected SDCs are publicly available:https://github.com/UFRGS-CAROL/HPCA2017-log-data

11


- DGEMM: matrix multiplication

- lavaMD: particles interactions

- Hotspot: heat simulation

- CLAMR: DOE’s workload

Selected AlgorithmsWe select a set of benchmarks that:

- stimulate different resources- are representative of HPC applications- minimize error masking (high AVF)

12


Xeon Phi vs K40 FIT rate

1

10

100

1000

Xeon Phi

K40

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

Xeon Phi error rate seems lower than Kepler, but:

-Xeon Phi is built in 3D Trigate, Kepler in planar CMOS-Xeon Phi and K40 have different throughput

13


Parallelism Management Reliability

0

100

200

300

400

500

600

700

0

50

100

150

200

250

300

15 19 23

lavaMD

210 211 212

DGEMM

Rel

ativ

e F

IT [a

.u.]

Rel

ativ

e F

IT [a

.u.]

What about parallel threads management?

Increasing the input size (and #threads):-Xeon-Phi error rate remains constant (<20% variation)-K40 SDC error rate increases with input size

K40 Xeon Phi

14


Parallelism Management Reliability

K40 Xeon-Phi

FIT increases with input size: HW scheduler is prone to be corrupted!

data of 2048 active threads is maintained in the register file

constant FIT rate:embedded OS is OK!

only 4 threads/core are maintained. Other threads data in the main memory (not exposed)

15


29x29 210x210 211x211 212x212 213x213

DG

EM

M G

Flo

ps

0.00E+00

2.00E+02

4.00E+02

6.00E+02

8.00E+02

1.00E+03

1.20E+03

Xeon Phi

K40

Xeon-Phi GFlops almost constant

K40 Gflopsrapidly increase

Parallelism Management ReliabilityK40 throughput increases with input size.Reliability vs Performances trade-off should be considered(in the paper: Mean Workload Between Failures)

16


Quantify and Qualify SDCs

Number of incorrect elements

Relative Errorhow different the error is from the expected value

Spatial Locality

Potentially Masked Errorsrelative error < 2% is tolerable

xx

x

xx

x x x x x xx x xx x x

x x x

xx

x

line square random

17


Quantify and Qualify SDCs

Number of incorrect elements

Relative Errorhow different the error is from the expected value


xx

x

xx

Spatial Localityx x x x x x

x x xx x x

x x x

xx

x

line square random

In the paper

17


Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

18

K40Xeon Phi



DGEMM lavaMD

Greater different from expected value

18

K40Xeon Phi



DGEMM lavaMD

Higher number of corrupted elements


18

K40Xeon Phi



DGEMM lavaMD

Higher number of corrupted elements


BAD: high number of corrupted elements,which are very different from the expected output

18

K40Xeon Phi



DGEMM lavaMD

K40 few corrupted elements, value similar to expected one Xeon Phi: a lot of corrupted elements,

which are very different from expected value

18

K40Xeon Phi



DGEMM lavaMD

Both K40 and Xeon Phi have few corrupted elements.K40 corruption are very different from the expected one

18

K40Xeon Phi



Purely arithmetic operations are more reliable (and faster) on the K40 (GPUs have shorten and faster pipelines).

Xeon Phi is more reliable for Finite Different Methods (lavaMD), which are based on transcendental functions (exp).

18

DGEMM lavaMDK40Xeon Phi


1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]Potentially Masked Errors


19

K40Xeon Phi


Potentially Masked Errors

1

10

100

1000

1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%


19



1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

K40Xeon Phi

lavaMD: at most 5% of errors are potentially masked.Exponentiation exacerbate the error magnitude.

1

10

100

1000

Rel

ativ

e F

IT [a

.u.]

19

errors<2%


1

10

100

1000

1

10

100

1000


15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

DGEMM: ~64% K40 errors are potentially masked,0% for the Xeon Phi! K40’s short and fast pipelines are reliable for arithmetic operations.

19

errors<2%



1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

1

10

100

1000

19

errors<2%


1

10

100

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

1

10

100

Hotspot

Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.

Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.

Hotspot

20


Hotspot

1

10

100

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

1

10

100

Hotspot



20


Hotspot

1

10

100

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

1

10

100

Hotspot

Stencil-like code: a lot of elements are corrupted, but the error is small.



20


What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

21



- We can show how SDC appears at the output, to ease detection

- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC

21





- Fault-injection to better understand error propagationSASSIFI: NVIDIA architectural-level fault-injectorCAROL-FI: UFRGS fault-injector for Xeon Phi and X86

21





- Fault-injection to better understand error propagationSASSIFI: NVIDIA architectural-level fault-injectorCAROL-FI: UFRGS fault-injector for Xeon Phi and X86

- Propose selective-hardening solutions(duplicate only what matters, what REALLY matters)

21

Sponsors

Research has received funding from the EU H2020 Programme and from MCTI/RNP-Brazil under the HPC4E Project, grant agreement 689772.


AcknowledgmentsCaio LunardiCaroline AguiarDaniel OliveiraFernando SantosVinicius FrattinPaolo RechPhilippe NavauxLuigi Carro

Chris Frost

Nathan DeBardelebenSean BlanchardHeather QuinnThomas FairbanksSteve Wender

Timothy TsaiSiva HariSteve Keckler

David KaeliNUCAR group

Matteo Sonza ReordaLuca Sterpone

Laercio Pilla

Israel KorenSandip Kundu

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel...

Documents