ModelingSoft-ErrorPropagationinProgramsGuanpeng (Justin)LiKarthik Pattabiraman
SivaHariMichaelSullivanTimothyTsai
Motivation:SoftErrors
2
= 0001 = 0101
[1]
Softerrorsbecomingmorecommoninprocessors
[1] http://aviral.lab.asu.edu/soft-error-resilience/
SilentDataCorruption(SDC)
NormalExecution
Fault
ErrorPropagation
SDC
Crash
Benign
IncorrectOutput
CorrectOutput
Exceptions,NoOutput
AmazonS3Incident
3
SoftwareSolutions
Device/CircuitLevel
ArchitecturalLevel
OperatingSystemLevel
ApplicationLevel
ImpactfulErrors
Protectio
nOverhead
SoftError
4
Increasing
Softwareprotection techniquesaremoreflexibleandcost-effective!
SelectiveInstructionDuplication
“TheGoldenCurve”
SDCCoverage
ProtectionOverhead
ApplicationSpecific!
*MeasuredinLibquantum,SPEC
InstructionSequence InstructionDuplication
Instruction:SDCRate=X%Overhead=Y%
SelectedInstructionsforGivenTargetSDCCoverage
AKnapsackProblem
5
DevelopingFault-TolerantApplications
DevelopmentofApplication EvaluateProgramSDCRate
SelectiveProtection
Acceptable
NewRelease
MeasureInstruction SDCRates
1. Thousandsoffaultinjectionsneedtobedone2.Repeateverytimecodeismodified
6
EstimatingSDCRate
OurGoal
Accuracy
Speed
AVF/PVF/ePVF
[MICRO’03,HPCA’10,DSN’16]
SymPLFIED/Relyzer/GangES
[DSN’08,ASPLOS’12,ISCA’14]
Noexistingtechniquemodelserrorpropagationinbothfastandaccurateway!
FastpredictionofSDCwithoutfaultinjection!
8
Challenges
• TrackingSDCpropagationishard
• Overbillionsofexecutedinstructions
• Everyinstructionmaypropagateerrorswithdifferentprobabilities
• Dynamicnatureofprogramexecution
• Control-flowdivergence
… …
BR
… …
Corruptingsubsequentstates
T F
8
… …… …… …… …
Trident:KeyInsight
• Errorpropagationscanbedecomposedintomodules,whichcan
beabstractedintoprobabilisticevents
• Decomposition
• Abstraction
9
Trident:Workflow
SourceCode
ProgramInput
OutputInsn.
Insn.SDCRates
OverallSDCRate
Insn.forPrediction
Profiling Prediction
10
BB12
… …
Trident:OurApproach
• Three-levelmodeling
• Register-communication
• Control-flow
• Memorydependency
Reg.
Mem.Contl.
BB4
$2=LOAD0x04
$3=ADD$2,4
CMP$4,$3,4
BR$4,BB5,BB10
BB5
$5=MUL$6,16
… …
BB10
… …
… …
BB102
...=LOAD0x08
T1 F1
T2 F2
fS
fC fM
BB11STORE…,0x08
11
fs =100%*100%*25%*100%=25%
BB12
… …BB11STORE…,0x08
BB4
$2=LOAD0x04
$3=ADD$2,4
CMP$4,$3,4
BR$4,BB5,BB10
BB5
$5=MUL$6,16
… …
BB10
… …
… …
BB102
...=LOAD0x08
T1 F1
T2 F2
<100%>
<100%>
<25%>
<100%>
PropagationprobabilitywithinBB4?
Reg.
Mem.Contl.
fS
fC fM
Reg.
12
Trident:RegisterCommn.
Trident:Control-Flow
BB12
… …BB11STORE…,0x08
BB4
$2=LOAD0x04
$3=ADD$2,4
CMP$4,$3,4
BR$4,BB5,BB10
BB5
$5=MUL$6,16
… …
BB10
… …
… …
BB102
...=LOAD0x08
T1 F1
T2 F2
CorruptionprobabilityofSTORE?
80% 20%
30% 70%
<100%>
<100%>
<25%>
<100%>=
*Fornon-loop-terminatingbranches
Reg.
Mem.Contl.
fS
fC fM
Contl.
fC
STOREexec.prob.F1*T2
BRdom.prob.F1
Corrupted
13
Trident:Memory-Dependency
BB12
… …BB11STORE…,0x08
BB4
$2=LOAD0x04
$3=ADD$2,4
CMP$4,$3,4
BR$4,BB5,BB10
BB5
$5=MUL$6,16
… …
BB10
… …
… …
BB102
...=LOAD0x08
T1 F1
T2 F2
DependentLOAD&STORE
80% 20%
30% 70%
<100%>
<100%>
<25%>
<100%>
Reg.
Mem.Contl.
fS
fC fM
Mem.
P(In) = fS (In)* fC (In2)* fS (In3)* fC (In4) … …
14
*ncorrespondstotheindexofdynamicinstructions
ExperimentalSetup
BenchmarkApplication Domains
15
• FaultModel• Singlebit-flipinjections– accurate[DSN’17]
• Randominsn.– oneperprogramexecution
• Benchmarks• 11open-sourcebenchmarksfromvariousdomains
• Comparisonwithfaultinjection• Accuracy
• Speed(wallclocktime)
ExperimentalMethodology
Reg.
Mem.Contl.
fS
Reg.
Mem.Contl.
fS+fCTwoSimplerModelsforComparison
GoalistopredictSDCrateasperfaultinjection
[1]LLVMFaultInjector[DSN’14]
Reminder:
16
• Baseline:FaultinjectionderivedbyLLFI[1]
• ThecloserSDCratetofaultinjection, thebetterprediction
• Createdtwosimplermodels
• Accuracyofeachsub-model
• Asproxytopriorwork
Evaluation:Accuracy
• MeanAbsoluteError• Trident:4.75%• SimplerModels:15.13%and19.13%
• t-TestonIndividualInstructions• Trident:8outof11arestatisticallyindistinguishable• SimplerModels(fS andfS+fC):Only2and4
ProgramSDCRate;3,000Sampled Instructions;ErrorBar:+/-0.07%~+/-1.76%at95%ConfidenceInterval
Trident isclosetofaultinjectionresults,andsignificantlybetterthanthesimplermodels!
3,000randomlysampledinstructionsforfaultinjection
andthemodels
17
Evaluation:Speed
• Program’sOverallSDCRate:• 6.7xfasterat3,000samples
• Per-InstructionSDCRate:• Onaverage,380xfasterat100samples
perinstruction
• Benchmarks:FItakesnearly100hourswhereasTridenttakes<20mins
Trident isfasterthanfaultinjectionby2ordersofmagnitude!
Wall-Clock TimeofEstimatingProgramSDCRate
18
UseCase:SelectiveInstructionDuplication
SDCCoverage
ProtectionOverhead
*MeasuredinLibquantum,SPEC
ByFaultInjections
ByTrident
“TheGoldenCurve”
ByfS+fCByfS
SelectiveInstructionDuplication
Recap:
19
Extension
• Understandhowerrorpropagationisaffectedbymultipleinputs
• ExtensionforboundingSDCratewithmultipleinputs
20
Session6:ModelingandVerificationWednesday,June27th
“ModelingInput-DependentErrorPropagationinPrograms”
Summary
• Faultinjectionsaretooslowtointegrateintosoftwaredevelopmentcycle
• Trident isbothaccurateandfastinpredictingSDCrates
• Canguideselectiveprotectionofinstructionsinprograms– comparable
tofaultinjectioninaccuracyforfractionofcost
• OpenSource:https://github.com/DependableSystemsLab/Trident
Guanpeng (Justin)LiUniversityofBritishColumbia (UBC)