Radiation-Induced Error Criticality inModern HPC Parallel Accelerators
Daniel Oliveira, Laercio Pilla, Mauricio Hanzich,Vinicius Fratin, Fernando Santos, Caio Lunardi, José Maria Cela, Philippe Navaux, Luigi Carro, Paolo Rech
WMC 2017
Daniel Oliveira – WMC 2017
HPC reliability importance
2
Daniel Oliveira – WMC 2017
Available AcceleratorsModern parallel accelerators offer:
- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources
3
Kepler K40 Xeon-Phi
Daniel Oliveira – WMC 2017
Available AcceleratorsModern parallel accelerators offer:
- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources- Reliability?
3
Kepler K40 Xeon-Phi
Daniel Oliveira – WMC 2017
Available AcceleratorsModern parallel accelerators offer:
- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources- Reliability?
Error Rate
3
Kepler K40 Xeon-Phi
Daniel Oliveira – WMC 2017
Titan
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corruptedTitan Detected Uncorrectable Errors MTBF is ~44h*
*(field and experimental data from HPCA’15)
4
Daniel Oliveira – WMC 2017
Outline
Radiation Effects Essentials
Error Criticality in HPC
Experimental Procedure
K40 vs Xeon Phi FIT rates
Qualify SDCs for HPC applications
What’s the Plan?
5
Daniel Oliveira – WMC 2017
Terrestrial Radiation Environment
Galactic cosmic rays interaction with atmosphere generates neutrons.
13 n/(cm2*h) @sea level
6
Daniel Oliveira – WMC 2017
Terrestrial Radiation Environment
Galactic cosmic rays interaction with atmosphere generates neutrons.
13 n/(cm2*h) @sea level
6
0
1
1
0FFLogic
Soft Errors: the device is not permanently damaged, but the particle may generate bit-flips or logic errors
Daniel Oliveira – WMC 2017
Silent Data Corruption vs Crash
Soft Errors in:- data cache- register files- logic gates (ALU)- scheduler
Soft Errors in:- instruction cache- scheduler / dispatcher- PCI-e bus controller
Silent Data Corruption
DUE (Crash)
7
Daniel Oliveira – WMC 2017
Silent Data Corruption vs Crash
Soft Errors in:- data cache- register files- logic gates (ALU)- scheduler
Soft Errors in:- instruction cache- scheduler / dispatcher- PCI-e bus controller
Silent Data Corruption
DUE (Crash)
7
Daniel Oliveira – WMC 2017
Output Correctness in HPC
…
A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.
8
Daniel Oliveira – WMC 2017
Output Correctness in HPC
error can be in thefloat intrinsic variance
Values in a given range are accepted as correct in physical simulations
Imprecise computation is being applied to HPC
Not all SDCs are critical for HPC applications
…
8
A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.
Daniel Oliveira – WMC 2017
Output Correctness in HPC
error can be in thefloat intrinsic variance
Values in a given range are accepted as correct in physical simulations
Imprecise computation is being applied to HPC
Not all SDCs are critical for HPC applications
Goal: quantify and qualify SDC in NVIDIA and Intel architectures.
…
8
A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.
Daniel Oliveira – WMC 2017
Radiation Test Facilities
9
Irradiation of Chips Electronics
Daniel Oliveira – WMC 2017
GPU Radiation Test Setup
23/48
GPU power control circuitry is out of beam
NVIDIAK40
NVIDIAK40
IntelXeon-Phi
IntelXeon-Phi
desktop PCs
desktop PCs
Daniel Oliveira – WMC 2017
@LANSCE 1.8x106 n/(cm2 h)@NYC 13 n/(cm2 h)
We test each architecture for 800h, simulating 9.2x108 h of natural radiation(~ 91,000 years)
Neutrons Spectrum
11
Daniel Oliveira – WMC 2017
@LANSCE 1.8x106 n/(cm2 h)@NYC 13 n/(cm2 h)
We test each architecture for 800h, simulating 9.2x108 h of natural radiation(~ 91,000 years)
Neutrons Spectrum
All the collected SDCs are publicly available:https://github.com/UFRGS-CAROL/HPCA2017-log-data
11
Daniel Oliveira – WMC 2017
- DGEMM: matrix multiplication
- lavaMD: particles interactions
- Hotspot: heat simulation
- CLAMR: DOE’s workload
Selected AlgorithmsWe select a set of benchmarks that:
- stimulate different resources- are representative of HPC applications- minimize error masking (high AVF)
12
Daniel Oliveira – WMC 2017
Xeon Phi vs K40 FIT rate
1
10
100
1000
Xeon Phi
K40
15 19 23 210 211 212 Hotspot CLAMR
N/A
lavaMD DGEMM
Rel
ativ
e F
IT [a
.u.]
Xeon Phi error rate seems lower than Kepler, but:
-Xeon Phi is built in 3D Trigate, Kepler in planar CMOS-Xeon Phi and K40 have different throughput
13
Daniel Oliveira – WMC 2017
Parallelism Management Reliability
0
100
200
300
400
500
600
700
0
50
100
150
200
250
300
15 19 23
lavaMD
210 211 212
DGEMM
Rel
ativ
e F
IT [a
.u.]
Rel
ativ
e F
IT [a
.u.]
What about parallel threads management?
Increasing the input size (and #threads):-Xeon-Phi error rate remains constant (<20% variation)-K40 SDC error rate increases with input size
K40 Xeon Phi
14
Daniel Oliveira – WMC 2017
Parallelism Management Reliability
K40 Xeon-Phi
FIT increases with input size: HW scheduler is prone to be corrupted!
data of 2048 active threads is maintained in the register file
constant FIT rate:embedded OS is OK!
only 4 threads/core are maintained. Other threads data in the main memory (not exposed)
15
Daniel Oliveira – WMC 2017
29x29 210x210 211x211 212x212 213x213
DG
EM
M G
Flo
ps
0.00E+00
2.00E+02
4.00E+02
6.00E+02
8.00E+02
1.00E+03
1.20E+03
Xeon Phi
K40
Xeon-Phi GFlops almost constant
K40 Gflopsrapidly increase
Parallelism Management ReliabilityK40 throughput increases with input size.Reliability vs Performances trade-off should be considered(in the paper: Mean Workload Between Failures)
16
Daniel Oliveira – WMC 2017
Quantify and Qualify SDCs
Number of incorrect elements
Relative Errorhow different the error is from the expected value
Spatial Locality
Potentially Masked Errorsrelative error < 2% is tolerable
xx
x
xx
x x x x x xx x xx x x
x x x
xx
x
line square random
17
Daniel Oliveira – WMC 2017
Quantify and Qualify SDCs
Number of incorrect elements
Relative Errorhow different the error is from the expected value
Potentially Masked Errorsrelative error < 2% is tolerable
xx
x
xx
Spatial Localityx x x x x x
x x xx x x
x x x
xx
x
line square random
In the paper
17
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
DGEMM lavaMD
18
K40Xeon Phi
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
DGEMM lavaMD
Greater different from expected value
18
K40Xeon Phi
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
DGEMM lavaMD
Higher number of corrupted elements
Greater different from expected value
18
K40Xeon Phi
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
DGEMM lavaMD
Higher number of corrupted elements
Greater different from expected value
BAD: high number of corrupted elements,which are very different from the expected output
18
K40Xeon Phi
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
DGEMM lavaMD
K40 few corrupted elements, value similar to expected one Xeon Phi: a lot of corrupted elements,
which are very different from expected value
18
K40Xeon Phi
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
DGEMM lavaMD
Both K40 and Xeon Phi have few corrupted elements.K40 corruption are very different from the expected one
18
K40Xeon Phi
Daniel Oliveira – WMC 2017
Number of Incorrect Elements vs Relative Error
Purely arithmetic operations are more reliable (and faster) on the K40 (GPUs have shorten and faster pipelines).
Xeon Phi is more reliable for Finite Different Methods (lavaMD), which are based on transcendental functions (exp).
18
DGEMM lavaMDK40Xeon Phi
Daniel Oliveira – WMC 2017
1
10
100
1000
15 19 23 210 211 212 Hotspot CLAMR
N/A
lavaMD DGEMM
Rel
ativ
e F
IT [a
.u.]Potentially Masked Errors
Potentially Masked Errorsrelative error < 2% is tolerable
19
K40Xeon Phi
Daniel Oliveira – WMC 2017
Potentially Masked Errors
1
10
100
1000
1
10
100
1000
15 19 23 210 211 212 Hotspot CLAMR
N/A
lavaMD DGEMM
Rel
ativ
e F
IT [a
.u.]
K40Xeon Phi
errors<2%
Potentially Masked Errorsrelative error < 2% is tolerable
19
Daniel Oliveira – WMC 2017
Potentially Masked Errors
1
10
100
1000
15 19 23 210 211 212 Hotspot CLAMR
N/A
lavaMD DGEMM
K40Xeon Phi
lavaMD: at most 5% of errors are potentially masked.Exponentiation exacerbate the error magnitude.
1
10
100
1000
Rel
ativ
e F
IT [a
.u.]
19
errors<2%
Daniel Oliveira – WMC 2017
1
10
100
1000
1
10
100
1000
Potentially Masked Errors
15 19 23 210 211 212 Hotspot CLAMR
N/A
lavaMD DGEMM
Rel
ativ
e F
IT [a
.u.]
K40Xeon Phi
DGEMM: ~64% K40 errors are potentially masked,0% for the Xeon Phi! K40’s short and fast pipelines are reliable for arithmetic operations.
19
errors<2%
Daniel Oliveira – WMC 2017
Potentially Masked Errors
1
10
100
1000
15 19 23 210 211 212 Hotspot CLAMR
N/A
lavaMD DGEMM
Rel
ativ
e F
IT [a
.u.]
K40Xeon Phi
1
10
100
1000
19
errors<2%
Daniel Oliveira – WMC 2017
1
10
100
Rel
ativ
e F
IT [a
.u.]
K40Xeon Phi
errors<2%
1
10
100
Hotspot
Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.
Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.
Hotspot
20
Daniel Oliveira – WMC 2017
Hotspot
1
10
100
Rel
ativ
e F
IT [a
.u.]
K40Xeon Phi
errors<2%
1
10
100
Hotspot
Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.
Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.
20
Daniel Oliveira – WMC 2017
Hotspot
1
10
100
Rel
ativ
e F
IT [a
.u.]
K40Xeon Phi
errors<2%
1
10
100
Hotspot
Stencil-like code: a lot of elements are corrupted, but the error is small.
Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.
Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.
20
Daniel Oliveira – WMC 2017
What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.
21
Daniel Oliveira – WMC 2017
What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.
- We can show how SDC appears at the output, to ease detection
- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC
21
Daniel Oliveira – WMC 2017
What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.
- We can show how SDC appears at the output, to ease detection
- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC
- Fault-injection to better understand error propagationSASSIFI: NVIDIA architectural-level fault-injectorCAROL-FI: UFRGS fault-injector for Xeon Phi and X86
21
Daniel Oliveira – WMC 2017
What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.
- We can show how SDC appears at the output, to ease detection
- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC
- Fault-injection to better understand error propagationSASSIFI: NVIDIA architectural-level fault-injectorCAROL-FI: UFRGS fault-injector for Xeon Phi and X86
- Propose selective-hardening solutions(duplicate only what matters, what REALLY matters)
21
Sponsors
Research has received funding from the EU H2020 Programme and from MCTI/RNP-Brazil under the HPC4E Project, grant agreement 689772.
Daniel Oliveira – WMC 2017
AcknowledgmentsCaio LunardiCaroline AguiarDaniel OliveiraFernando SantosVinicius FrattinPaolo RechPhilippe NavauxLuigi Carro
Chris Frost
Nathan DeBardelebenSean BlanchardHeather QuinnThomas FairbanksSteve Wender
Timothy TsaiSiva HariSteve Keckler
David KaeliNUCAR group
Matteo Sonza ReordaLuca Sterpone
Laercio Pilla
Israel KorenSandip Kundu