+ All Categories
Home > Documents > Research Article Detecting Silent Data Corruptions in...

Research Article Detecting Silent Data Corruptions in...

Date post: 12-Jun-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
11
Research Article Detecting Silent Data Corruptions in Aerospace-Based Computing Using Program Invariants Junchi Ma, 1,2 Dengyun Yu, 3 Yun Wang, 1,2 Zhenbo Cai, 3 Qingxiang Zhang, 3 and Cheng Hu 1,2 1 School of Computer Science & Engineering, Southeast University, Nanjing 211189, China 2 Key Laboratory of Computer Network and Information Integration, Ministry of Education, Nanjing 211189, China 3 Beijing Institute of Spacecraſt System Engineering, Beijing 100094, China Correspondence should be addressed to Junchi Ma; [email protected] Received 20 April 2016; Revised 20 September 2016; Accepted 10 October 2016 Academic Editor: Christopher J. Damaren Copyright © 2016 Junchi Ma et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Soſt error caused by single event upset has been a severe challenge to aerospace-based computing. Silent data corruption (SDC) is one of the results incurred by soſt error. SDC occurs when a program generates erroneous output with no indications. SDC is the most insidious type of results and very difficult to detect. To address this problem, we design and implement an invariant-based system called Radish. Invariants describe certain properties of a program; for example, the value of a variable equals a constant. Radish first extracts invariants at key program points and converts invariants into assertions. It then hardens the program by inserting the assertions into the source code. When a soſt error occurs, assertions will be found to be false at run time and warn the users of soſt error. To increase the coverage of SDC, we further propose an extension of Radish, named Radish D, which applies soſtware-based instruction duplication mechanism to protect the uncovered code sections. Experiments using architectural fault injections show that Radish achieves high SDC coverage with very low overhead. Furthermore, Radish D provides higher SDC coverage than that of either Radish or pure instruction duplication. 1. Introduction A single event upset (SEU) is a change of state caused by one single ionizing particle (ions, electrons, photons, etc.) striking a sensitive node in a microelectronic device [1, 2]. e error in device output or operation caused as a result of SEU is called soſt error. Because this type of error does not reflect a perma- nent failure, it is termed soſt [3]. e first reports of failures attributed to cosmic rays emerged in 1975 when space-borne electronics malfunctioned [4]. In 1993, neutron-induced soſt errors were even observed in airborne computers at commercial aircraſt flight altitudes [5]. Soſt error has emerged as a key challenge in aerospace-based computing [6, 7]. e raw error rate per device (e.g., latch, SRAM cell) in a bulk CMOS process is projected to remain roughly constant or decrease slightly; thus soſt error rate per processor will grow with Moore’s law in direct proportion to the number of devices added to a processor in the next generation [8]. Unless we develop and apply more effective soſt error mitiga- tion techniques, the trend is inevitable. e result of soſt error is categorized into four types [9], benign, crash, hang, and silent data corruption (SDC), shown in Figure 1. Benign means the error is masked and the program gets the right output; crash means the error causes the program to stop execution; hang means that resource is exhausted but the program still cannot finish execution; silent data corruption means that the program generates erroneous output. When crash or hang occurs, the system is aware that the program is executed abnormally. Compared with the oth- ers, SDC is more insidious since it occurs without any indica- tions. Applying the erroneous output incurred by SDC may lead to loss of properties and even casualties. Erroneous output is definitely more dangerous than none, since users cannot be aware of errors until a serious consequence occurs. is paper mainly focuses on eliminating SDC. Symptom-based fault detection mechanisms provide low- cost solutions [10, 11]. ese mechanisms treat anomalous soſtware behavior as symptoms of hardware faults and detect them by placing very low-cost symptom monitors in hardware or soſtware. However, faults incurring SDC escape Hindawi Publishing Corporation International Journal of Aerospace Engineering Volume 2016, Article ID 8213638, 10 pages http://dx.doi.org/10.1155/2016/8213638
Transcript
Page 1: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

Research ArticleDetecting Silent Data Corruptions in Aerospace-BasedComputing Using Program Invariants

Junchi Ma12 Dengyun Yu3 Yun Wang12 Zhenbo Cai3 Qingxiang Zhang3 and Cheng Hu12

1School of Computer Science amp Engineering Southeast University Nanjing 211189 China2Key Laboratory of Computer Network and Information Integration Ministry of Education Nanjing 211189 China3Beijing Institute of Spacecraft System Engineering Beijing 100094 China

Correspondence should be addressed to Junchi Ma bjbzmjc126com

Received 20 April 2016 Revised 20 September 2016 Accepted 10 October 2016

Academic Editor Christopher J Damaren

Copyright copy 2016 Junchi Ma et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Soft error caused by single event upset has been a severe challenge to aerospace-based computing Silent data corruption (SDC) isone of the results incurred by soft error SDC occurs when a program generates erroneous output with no indications SDC is themost insidious type of results and very difficult to detect To address this problem we design and implement an invariant-basedsystem called Radish Invariants describe certain properties of a program for example the value of a variable equals a constantRadish first extracts invariants at key program points and converts invariants into assertions It then hardens the program byinserting the assertions into the source code When a soft error occurs assertions will be found to be false at run time and warn theusers of soft error To increase the coverage of SDC we further propose an extension of Radish named Radish D which appliessoftware-based instruction duplication mechanism to protect the uncovered code sections Experiments using architectural faultinjections show that Radish achieves high SDC coverage with very low overhead Furthermore Radish D provides higher SDCcoverage than that of either Radish or pure instruction duplication

1 Introduction

A single event upset (SEU) is a change of state caused by onesingle ionizing particle (ions electrons photons etc) strikinga sensitive node in amicroelectronic device [1 2]The error indevice output or operation caused as a result of SEU is calledsoft error Because this type of error does not reflect a perma-nent failure it is termed soft [3] The first reports of failuresattributed to cosmic rays emerged in 1975 when space-borneelectronics malfunctioned [4] In 1993 neutron-inducedsoft errors were even observed in airborne computers atcommercial aircraft flight altitudes [5] Soft error has emergedas a key challenge in aerospace-based computing [6 7]

The raw error rate per device (eg latch SRAM cell) in abulk CMOS process is projected to remain roughly constantor decrease slightly thus soft error rate per processor willgrow with Moorersquos law in direct proportion to the numberof devices added to a processor in the next generation [8]Unless we develop and apply more effective soft error mitiga-tion techniques the trend is inevitable

The result of soft error is categorized into four types[9] benign crash hang and silent data corruption (SDC)shown in Figure 1 Benign means the error is masked and theprogram gets the right output crash means the error causesthe program to stop execution hang means that resource isexhausted but the program still cannot finish execution silentdata corruption means that the program generates erroneousoutput When crash or hang occurs the system is aware thatthe program is executed abnormally Compared with the oth-ers SDC is more insidious since it occurs without any indica-tions Applying the erroneous output incurred by SDC maylead to loss of properties and even casualties Erroneousoutput is definitely more dangerous than none since userscannot be aware of errors until a serious consequence occursThis paper mainly focuses on eliminating SDC

Symptom-based fault detectionmechanisms provide low-cost solutions [10 11] These mechanisms treat anomaloussoftware behavior as symptoms of hardware faults anddetect them by placing very low-cost symptom monitors inhardware or software However faults incurring SDC escape

Hindawi Publishing CorporationInternational Journal of Aerospace EngineeringVolume 2016 Article ID 8213638 10 pageshttpdxdoiorg10115520168213638

2 International Journal of Aerospace Engineering

The execution

Has right output

Silent data corruptionBenign

Crash

No

No

No

Yes

Yes

Yes

Hang

The result ofsoft error

The executionends

ends peacefully

Figure 1 Classification of the result of soft error

detection since they do not cause symptoms at all To addressthis limitation software-based instruction duplication is apossible alternative With this approach instructions areduplicated and their results are validated within a singlethread of execution [12ndash15] This solution has the advantageof being purely software-based requiring no specializedhardware and can achieve high coverage However theoverheads in terms of performance and power are quite highsince a large fraction of the program is replicated Futuremissionswill requiremuch greater computational power thanis available in todayrsquos processors [4] thus low-cost fault detec-tion solution is desired by future aerospace-based computing

To address the problem of detecting SDC this paper pro-poses an assertion-based detection mechanism An assertionis a statement with a predicate (boolean-valued function atrue-false expression) If an assertion is found to be false atrun time an assertion failure rises which typically causesthe program to throw an assertion exception Assertionsin this paper are based on program invariants [16] whichare properties that are true at a particular program pointor points For example 119909 = 2119910 is an invariant about thevariables 119909 and 119910 which represents that they satisfy a linearrelationshipThis invariant is satisfied whenever the programis executed normally but seldom satisfied if a soft error affectsthe value of 119909 or 119910 Based on this principle we design andimplement the system Radish which can harden the programagainst soft errors Radish can extract invariants from a C

program and insert invariant-based assertions back into thesource code Once an assertion is found to be false it suggeststhat a soft error is detectedThen the execution is stopped anda warning is given

Radish merely adds a few lines of code to original sourcecode and thus it is easy to implement Besides it does notneed tomodify the underlying hardware and hardly increasesthe complexity of the system Furthermore the overhead ofRadish turns out to be very low since the overhead of a singleassertion is low and the number of assertions in a program issmall

To further increase the SDC coverage we extend Radishby incorporating the mechanism of software-based instruc-tion duplication The code sections that are not covered byRadish are protected by deploying instruction duplicationExperimental results show that Radish achieves high cov-erage with low cost and Radish D even achieves highercoverage than that of Radish or pure instruction duplicationThe techniques of Radish and Radish D offer new solutionsto soft error mitigation

2 Definitions and Models

This section describes important definitions andmodels usedin this paper

Definition 1 A program is defined as ⟨119865 119864 INOUT⟩ 119865represents the functions in the program 119864 is the set of edges

International Journal of Aerospace Engineering 3

Table 1 Relationships of invariants considered in this paper

Category Expression

Unary 119909 = 119886 119909 gt 119886 119909 lt 119886 119909119886 = 0 119909 = 0119909 isin 119886 119887 119888 119909[119896] lt 119886 119909[119896] gt 119886

Binary119910 = 119886119909 + 119887 119909 lt 119910 119909 = 119910 119909 = 1199102119909[119896] lt 119909[119896 + 1] 119909[119896] gt 119909[119896 + 1] 119909[ ] sub 119910[ ]119909[119896] lt 119910[119896] 119910[119896] = 119886119909[119896] + 119887 119909 isin 119910[ ]

Ternary119911 = 119886119909 + 119887119910 + 119888 119909 = 119910 and 119911 119909 = 119910 or 119911119909 = Lshif t (119910 119911) 119909 = Rshift (119910 119911) 119909 = max (119910 119911)119909 = min (119910 119911) 119909 = 119910 times 119911 119909 = 119910 divide 119911

that denote dependencies between functions st 119864 = 119890119909119910 |119891119909 call 119891119910 119891119909 isin 119865 119891119910 isin 119865 IN and OUT denote the inputand the output Soft computation [17] is not considered in thispaper therefore if 119865 119864 and IN are determined OUT can beuniquely determined

Definition 2 A function119865 is composed of a set of basic blocks119861 and variables 119881 thus 119865 = 119861 119881 A basic block is a singleentrance single exit sequence of instructions For a singleinstruction 119894119895 119894119895 = ⟨120579 119878 119863⟩ where 119895 denotes the sequencenumber of the dynamic instruction during the execution 120579denotes the program point which equals the offset from thestart position of the assembly file 119878 and 119863 denote the sourceoperands and the destination operands

Definition 3 forall119894119898 isin 119891119910 if exist119894119896 isin 119891119909 119890119910119909 isin 119864 and 119894119896 sdot 119878 = 119894119898 sdot 119863also exist119894119897 isin 119891119909 119897 lt 119896 and 119894119897 sdot 119863 = 119894119896 sdot 119878 then 119894max119898 in 119891119910 isdefined as the connector instruction Literally the connectorinstruction transmits data from one function to another Thevariable that a connector instruction writes is defined as theconnector variable CV = V | V isin 119894max119898 sdot 119863 Connectorvariables include function argument variables functionreturn variables and global variables By definition theconnector instruction is the last to write a connector variablein the function

Definition 4 Execution profile is denoted by Γ which is givenas a tuple Γ = ⟨120579 119881 119871⟩ Execution profile defines the values ofthe variables at given program points 120579 represents the givenprogram point 119871 is the acquired value set of variables thatappears in 119881

Definition 5 Invariant 119876 is defined as 119876 = ⟨120579 120595 119903⟩ where 120579represents the program point 120595 = ⟨V1 V119895 V119899⟩ is theordered set of variables and 119903 represents the relationship ofvariables that appear in 120595 119903 isin 119877 where 119877 is the relationshipset considered in the paper shown in Table 1 119877 can becategorized into unary binary and ternary

For instance suppose an invariant 1199021 = ⟨1205791 1205951 1199031⟩ where1205791 = 0x10 1205951 = ⟨tmp1 tmp2⟩ and 1199031 = ⟨119909 119910⟩ | 119910 = 119909 + 11199021 represents that at the program point 0x10 the ordered set⟨tmp1 tmp2⟩ isin 1199031 that is tmp1 tmp2 satisfies the conditionof tmp2 = tmp1 + 1

The fault model we assume is a single bit flip within theregister file Most faults in other portions of the processoreventually manifest as corrupted state in the register file [18]Moreover we assume that at most one fault occurs during aprogramrsquos execution

3 Radish

This paper implements Radish a system which can hardenprogram against soft error Radish enhances the resilienceof the program to soft error by inserting assertions to thesource code The assertions are based on program invariantsIf the statement of an assertion is not satisfied during theexecution the execution is stopped and a warning reports theoccurrence of soft error

The input of Radish is C source file and the output is anew C source file The new source file can be compiled andexecuted just as the original source file They are identical infunctionality but vary in reliability

This section introduces the workflow of Radish whichcan be divided into three phases that is preprocessingdetecting and selecting Figure 2 shows the details of eachphase In the preprocessing phase we extract the executionprofiles Γ of the critical program points Then Γ is usedto extract potential invariants in the detecting phase Afterthat invariants 119876pot are obtained and a fraction of them areconverted to assertions in the selecting phase In the endhardened source code is outputted We will describe eachphase below

31 Preprocessing Phase In the phase of preprocessing wefind the critical program points and extract their executionprofilesThe profiles are used to extract invariants in the nextphase Finally assertions will be placed in those programpoints to prevent faults frompropagatingThe SDC coveragesvary due to the programpoints of assertions and thereforeweanalyze the propagation of SDC and find the critical programpoints for propagation A fault may propagate through dataflow or control flow to incur SDC Due to the distinction ofthe two categories of propagation we analyze and search fortheir critical program points separately

When a fault propagates throughdata flow the same staticinstructions are executed just as the fault-free execution butthe data that the instructions read or write are corrupted Toincur SDC the corrupted data need to be transmitted to otherfunctions especially the output function Only connectorinstructions can perform this operation thus they must beexecuted and the data they transmit are corrupted Thismakes the connector instructions efficient for fault detectionand therefore they are selected as the critical program pointsagainst data flow propagation

Next we discuss fault propagation in control flow Thecompare instruction performs a comparison between twovalues and the result of the comparison impacts the bits ofthe flag register which determines the consequent jump per-formed by a branch instruction Propagation through controlflowmeans that an erroneous jump is performed by a branchinstruction Assume that in the fault-free execution 119894119896 is abranch instruction and the next instruction is 119894119896+1 isin 119887119906 which

4 International Journal of Aerospace Engineering

Source code

Hardened source code

Assertions

Selecting phase

Detecting phasePreprocessing phase

Qass

Θcri Γ V120579 Qpot

N(P1120579 )

N(P2120579 )

N(P3120579

P1120579

P2120579

P3120579 )

Figure 2 The workflow of Radish

means 119894119896 chooses 119887119906 as the next basic blockWhen the flag reg-ister is corrupted in the presence of soft error then 119894119896+1 isin 119887119908whichmeans 119894119896 chooses the erroneous branch 119887119908 instead of 119887119906To avoid this we should check if the right branch is taken afterthe execution of 119894119896Therefore branch instructions are selectedas the critical program points of control flow propagation

According to the analysis above the critical programpoints of data flow and control flow propagation refer to con-nector instructions and branch instructions It takes two stepsto extract the execution profiles of the critical programpoints

Step 1 We compile the source code and translate it intoassembly file and then locate connector instructions andbranch instructions in the assembly fileTheir programpointsare recorded and added to the program point set Θcri

Step 2 The execution profile is acquired by using Kvasir [16]Kvasir executes C and C++ programs and creates data tracefiles of variables and their values by examining the operationof the binary at runtime Using Kvasir makes it possibleto interrupt programrsquos execution and read the values of allvariables manifest at the program points of interest Once itfinishes executing we get the profiles Γ at the target programpoints in Θcri

32 Detecting Phase In the detecting phase the ordered setof variables and the corresponding ordered set of values aregenerated according to the execution profiles Γ We check ifthe values satisfy any relationship of 119877 listed in Table 1 Thedetecting phase has 4 steps in total

Step 1 For each program point 120579 of Θcri we get the set ofaccessible variables 119881120579 from the execution profile Γ Then theunary binary and ternary ordered sets 1198751120579 1198752120579 and 1198753120579 arecreatedThe superscript digits refer to the number of variablesof the ordered set For example 1198752120579 is an arrangement of twovariables in 119881120579 that is 1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ | V119896 isin 119881120579 and V119895 isin119881120579

Step 2 Find the corresponding values of the variables appear-ing in1198751120579 119875

2120579 1198753120579 and generate the ordered sets of value119873(1198751120579 )

119873(1198752120579 )119873(1198753120579 ) For example119873(1198752120579 (V119896 V119895)) is the ordered value

set of 1198752120579 (V119896 V119895) 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ

Step 3 For the relationships that have undetermined param-eters we use a part of the ordered set of values to calculatethose parameters Thus the entire expression is determined

Step 4 Test if each element of the ordered set of valuessatisfies the condition of the relationship If so then create anew invariant and put it into the potential invariant set 119876pot

Take a binary relationship 119903lin = ⟨119909 119910⟩ | 119910 = 119886119909 + 119887 asexample We shall show each step of detecting phase Since itis a binary relationship only the binary ordered sets of1198752120579 and119873(1198752120579 ) are considered in this example

In the first step we get 119881120579 = V119896 | exist119897 ⟨120579 V119896 119897⟩ isin Γ bysearching the execution profile ΓThen1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ |V119896 isin 119881120579 and V119895 isin 119881120579 is obtained by creating the arrangement ofevery two variables in 119881120579

In the second step 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ is obtained by finding the values of V119896 andV119895 in the execution profile Γ There may be many value pairsof V119896 and V119895 because certain code sections can be invoked formany times in a single execution and each invoking producesone value instance

In the third step we calculate the parameters 119886 119887 in 119903linTo this endwe need to use at least 2 elements of119873(1198752120579 (V119896 V119895))Assuming the two elements are ⟨1198971 1198972⟩ and ⟨1198973 1198974⟩ it could beeasily obtained that 119886 = (1198974 minus 1198972)(1198973 minus 1198971) and 119887 = (11989721198973 minus11989711198974)(1198973 minus 1198971)

In the last step all elements in 119873(1198752120579 (V119896 V119895)) are checkedwhether they satisfy 119903lin = ⟨119909 119910⟩ | 119910 = ((1198974 minus 1198972)(1198973 minus 1198971))119909 +(11989721198973 minus 11989711198974)(1198973 minus 1198971) If all of them pass this validation theinvariant ⟨120579 ⟨V119896 V119895⟩ 119903lin⟩holds and it is added to the potentialinvariant set 119876pot

33 Selecting Phase It is often observed that the number ofelements of the potential invariant set 119876pot is very large If allof them are converted into assertions and inserted into sourcefile it will incur very high performance overhead In theselecting phase proper invariants are selected according totheir capability of detecting SDC Heuristics about selection

International Journal of Aerospace Engineering 5

criteria are formulated on the basis of propagation of SDCThese heuristics are generic and can be applied to anyinvariants We list the heuristics first and then describe theselecting steps

Heuristic 1 There are certain types of variables that should bemonitored at each target program point

A fraction of variables are capable of telling if theexecution is going well and thus monitoring these variablesis able to detect SDC The target program point set Θcri canbe categorized into program points of connector instructionsand branch instructions At the program points of connectorinstructions it is the connector variables that should bepaid special attentions to since they reflect whether resultsof functions are correct At the program points of branchinstructions branch-controlling variables which appear inthe statement of if while or for structure reflect the statusof these structures and thus should be noticed Therefore forall target program points we find certain variables tomonitor

Heuristic 2 The likelihood of detecting SDC increases if thenumber of valid values defined by an assertion decreases

Invalid values cannot pass the examination of assertionsin the presence of soft error Therefore having more invalidvalues (less valid values) means the likelihood of detectingSDC increases The number of valid values of an invariantis determined by its relationship Equality relationship usingldquo=rdquo as operator only has one valid value then come inclusion(isin sube) range (gt lt) and inequality relationship ( =) in order ofascending number of valid values

Heuristic 3 The likelihood of detecting SDC increases ifmore variables are included by an assertion

The more the variables appearing in an assertion themore the variables it can monitor If any of the variablesgets corrupted due to soft error the assertion will be able tocatch the error Thus having more variables in an assertionleads to higher coverage of SDC So far the largest number ofvariables is 3 which refers to ternary relationships

Utilizing these heuristics we are able to reduce thenumber of invariants and obtain more effective assertionsThe selecting phase has three steps

Step 1 The invariants which contain connector variables atthe program points of connector instructions or branch-controlling variables at the program points of branch instruc-tions are selected based on Heuristic 1

Step 2 The invariants with the relationship that has fewervalid values are picked up according to Heuristic 2

Step 3 The invariants which contain the largest number ofvariables are selected due to Heuristic 3

The selecting process stops until there is only one invari-ant left or all the steps have been performedThen we convertthe chosen invariants into assertions which is basically astring conversion problem For brevityrsquos sake we do not talkabout it in this paper Finally we include the assertion headerfile at the beginning of the new source file to make sureassertions can work

4 Radish_D

The assertions generated by Radish cannot fully monitor allthe variables and program points thus certain faults mightpropagate through unprotected code sections To furtherincrease the coverage of SDC we introduce software-basedinstruction duplication mechanism to protect the codesections that are not covered by Radish

This paper utilizes instruction duplication mechanism ofSWIFT [15] for comparison and also for our own duplicationin Radish D SWIFT duplicates all computation instructionsalong the path of replication and the replica instructions usedifferent registers and different memory locations At certainsynchronization points comparison instructions are insertedto check if the original instructions and their replica haveidentical values

Rather than deploying full instruction duplication mech-anism of SWIFT Radish D applies selective instructionduplication mechanism Because a portion of instructionshave been protected by assertions we only need to duplicatethe others

Before deploying duplications we need to determinewhich variables are safe under the protection of assertions Anassertion is capable of protecting the variables which appearin its statement However the protection does not last for theentire lifetime of those variables Only the fraction from thebeginning of the local function till the variablersquos host assertionis considered safe since the variablersquos value is checked duringthe execution of the assertion

We partition each variablersquos lifetime by assertions andidentify the safe periods Then duplications are deployedin the instruction level The targets of duplications are theinstructionswhich do not contain a variable in the safe periodas operand A replica instruction is created by copying theopcode and operands of the original instructionThe destina-tion operand is changed into an unused register the copy ofthe original destination operand Next we decide if there is aneed to change the source operands of the replica instructionIf there has already been a copy of the source operand whichmeans this source operandwas some instructionrsquos destinationoperand and thus got a copy we replace the replica instruc-tionrsquos source operand with its copy The replica instructionis inserted before the original instruction in the same basicblock

Besides store branch and call instructions are chosen asthe synchronization points If any source operand of theseinstructions has a copy we compare its value with that of itscopy by inserting a compare instruction According to thetype of the operand (int or float) the compare instruction canbe either icmp instruction or fcmp instructionAfter the com-pare instruction a branch instruction using the predicate ofneq is inserted into the code If the two values show a discrep-ancy it will jump to a function called faultDetected if other-wise it will continue to execute the previous store branch orcall instruction The function of faultDetected outputs errormessages and returns with an exit code whichwill inform thesystem of soft error and end the execution

We use an example to show the distinction betweenour method and full instruction duplication mechanism in

6 International Journal of Aerospace Engineering

(a) Original assembly code

(c) Assembly code of Radish_D

(b) Assembly code after full instruction duplicationi1 R3 = xor R1 R2i2 R5 = add R4 R2i3 R6 = icmp eq R3 R5i4 br R6 label b2

i1 R3 = xor R1 R2i2 R3

998400= xor R1998400 R2998400

i3 R5 = add R4 R2i4 R5

998400= add R4998400 R2998400

i5 R7 = icmp neq R3 R3998400

i6 br R7 label faultDetectedi7 R8 = icmp neq R5 R5998400

i8 br R8 label faultDetectedi9 R6 = icmp eq R3 R5i10 br R6 label b2

i2 R5 = add R4 R2

i6 R6 = icmp eq R3 R5i7 br R6 label b2

i1 R3 = xor R1 R2

i3 R5998400= add R4998400 R2998400

i5 br R7 labeL faultDetectedi4 R7 = icmp neq R5 R5998400

assert(R1 gt R3)

Figure 3 A sample assembly code before and after transformation of full instruction duplication and Radish D

Figure 3 For consistency we make use of the LLVM [19]assembly language to present the assembly code Figure 3(a)shows the original assembly code and Figure 3(b) shows theassembly code after full instruction duplication It can befound in Figure 3(b) that 11987731015840 is the replica of 1198773 and theduplication is accomplished by the instruction 1198942 Similarly11987751015840 is the replica of 1198775 through the duplication by 1198944 1198949 is thesynchronization point and the source operands of 1198949 1198773 and1198775 need to be examined 1198945 and 1198947 compare 1198773 and 1198775 withtheir replicas separately If the values of 1198773 and 11987731015840 are notequal 1198946 will call faultDetected to report a soft error

The assembly code generated by Radish D is shownin Figure 3(c) Assume that we have already obtained anassertion about1198771 and1198773 by utilizingRadish which is shownin the line of code ldquoassert(1198771 gt 1198773)rdquo Due to the assertion1198771and1198773 are considered safe during the execution of this exam-ple 11987711015840 and 11987731015840 are no longer necessary and the instructionsused for their duplication are eliminated Variables except 1198771and 1198773 still need to be duplicated and checked thus 1198775 isduplicated by 1198943 and checked at the synchronization point 1198946The efficiency of Radish D and full instruction duplicationmechanism will be exploited in the next section

5 Experiment

This paper applies fault injection experiments to validatethe effectiveness of Radish and Radish D The fault injec-tion experiment is performed on the original executivefirst The hardened executives using Radish Radish D and

full instruction duplication are targeted subsequently Wecompare the results of the fault injection experiments andcalculate the SDC coverage and performance overhead Toensure a fair comparison among these mechanisms we usea metric called the SDC detection efficiency which is definedin prior work [9] as the ratio between SDC coverage andoverhead for a detection mechanism

The platform for validation is Ubuntu 1404 (AMD64architecture) LLFI [20] is applied to perform fault injectionsLLFI is an LLVM-based fault injection toolThe source code istranslated into an intermediate representation (IR) and the IRcode is then injected The faults can be injected into specificprogram points and the effect can be easily tracked backto the source code LLFI is configured to inject destinationregister In a single fault injection LLFI randomly picks upone instruction and injects 1 soft error to the destinationoperand One fault injection experiment continues until thefault injection has been repeated for 1000 times The injectedfaults may affect data flow or control flow We take thefollowing LLVM IR code to explain the effect on control flow

(1) judge1=icmp ne i32 1 2

(2) br i1 judge1 label BB1 label BB2

judge1 determines the outcome of branch If judege1is injected the branch instruction may choose the wrongbranch and thus affect control flow The mechanism of fullinstruction duplication is implemented by developing a newpass under LLVM infrastructure The pass is also used by

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

2 International Journal of Aerospace Engineering

The execution

Has right output

Silent data corruptionBenign

Crash

No

No

No

Yes

Yes

Yes

Hang

The result ofsoft error

The executionends

ends peacefully

Figure 1 Classification of the result of soft error

detection since they do not cause symptoms at all To addressthis limitation software-based instruction duplication is apossible alternative With this approach instructions areduplicated and their results are validated within a singlethread of execution [12ndash15] This solution has the advantageof being purely software-based requiring no specializedhardware and can achieve high coverage However theoverheads in terms of performance and power are quite highsince a large fraction of the program is replicated Futuremissionswill requiremuch greater computational power thanis available in todayrsquos processors [4] thus low-cost fault detec-tion solution is desired by future aerospace-based computing

To address the problem of detecting SDC this paper pro-poses an assertion-based detection mechanism An assertionis a statement with a predicate (boolean-valued function atrue-false expression) If an assertion is found to be false atrun time an assertion failure rises which typically causesthe program to throw an assertion exception Assertionsin this paper are based on program invariants [16] whichare properties that are true at a particular program pointor points For example 119909 = 2119910 is an invariant about thevariables 119909 and 119910 which represents that they satisfy a linearrelationshipThis invariant is satisfied whenever the programis executed normally but seldom satisfied if a soft error affectsthe value of 119909 or 119910 Based on this principle we design andimplement the system Radish which can harden the programagainst soft errors Radish can extract invariants from a C

program and insert invariant-based assertions back into thesource code Once an assertion is found to be false it suggeststhat a soft error is detectedThen the execution is stopped anda warning is given

Radish merely adds a few lines of code to original sourcecode and thus it is easy to implement Besides it does notneed tomodify the underlying hardware and hardly increasesthe complexity of the system Furthermore the overhead ofRadish turns out to be very low since the overhead of a singleassertion is low and the number of assertions in a program issmall

To further increase the SDC coverage we extend Radishby incorporating the mechanism of software-based instruc-tion duplication The code sections that are not covered byRadish are protected by deploying instruction duplicationExperimental results show that Radish achieves high cov-erage with low cost and Radish D even achieves highercoverage than that of Radish or pure instruction duplicationThe techniques of Radish and Radish D offer new solutionsto soft error mitigation

2 Definitions and Models

This section describes important definitions andmodels usedin this paper

Definition 1 A program is defined as ⟨119865 119864 INOUT⟩ 119865represents the functions in the program 119864 is the set of edges

International Journal of Aerospace Engineering 3

Table 1 Relationships of invariants considered in this paper

Category Expression

Unary 119909 = 119886 119909 gt 119886 119909 lt 119886 119909119886 = 0 119909 = 0119909 isin 119886 119887 119888 119909[119896] lt 119886 119909[119896] gt 119886

Binary119910 = 119886119909 + 119887 119909 lt 119910 119909 = 119910 119909 = 1199102119909[119896] lt 119909[119896 + 1] 119909[119896] gt 119909[119896 + 1] 119909[ ] sub 119910[ ]119909[119896] lt 119910[119896] 119910[119896] = 119886119909[119896] + 119887 119909 isin 119910[ ]

Ternary119911 = 119886119909 + 119887119910 + 119888 119909 = 119910 and 119911 119909 = 119910 or 119911119909 = Lshif t (119910 119911) 119909 = Rshift (119910 119911) 119909 = max (119910 119911)119909 = min (119910 119911) 119909 = 119910 times 119911 119909 = 119910 divide 119911

that denote dependencies between functions st 119864 = 119890119909119910 |119891119909 call 119891119910 119891119909 isin 119865 119891119910 isin 119865 IN and OUT denote the inputand the output Soft computation [17] is not considered in thispaper therefore if 119865 119864 and IN are determined OUT can beuniquely determined

Definition 2 A function119865 is composed of a set of basic blocks119861 and variables 119881 thus 119865 = 119861 119881 A basic block is a singleentrance single exit sequence of instructions For a singleinstruction 119894119895 119894119895 = ⟨120579 119878 119863⟩ where 119895 denotes the sequencenumber of the dynamic instruction during the execution 120579denotes the program point which equals the offset from thestart position of the assembly file 119878 and 119863 denote the sourceoperands and the destination operands

Definition 3 forall119894119898 isin 119891119910 if exist119894119896 isin 119891119909 119890119910119909 isin 119864 and 119894119896 sdot 119878 = 119894119898 sdot 119863also exist119894119897 isin 119891119909 119897 lt 119896 and 119894119897 sdot 119863 = 119894119896 sdot 119878 then 119894max119898 in 119891119910 isdefined as the connector instruction Literally the connectorinstruction transmits data from one function to another Thevariable that a connector instruction writes is defined as theconnector variable CV = V | V isin 119894max119898 sdot 119863 Connectorvariables include function argument variables functionreturn variables and global variables By definition theconnector instruction is the last to write a connector variablein the function

Definition 4 Execution profile is denoted by Γ which is givenas a tuple Γ = ⟨120579 119881 119871⟩ Execution profile defines the values ofthe variables at given program points 120579 represents the givenprogram point 119871 is the acquired value set of variables thatappears in 119881

Definition 5 Invariant 119876 is defined as 119876 = ⟨120579 120595 119903⟩ where 120579represents the program point 120595 = ⟨V1 V119895 V119899⟩ is theordered set of variables and 119903 represents the relationship ofvariables that appear in 120595 119903 isin 119877 where 119877 is the relationshipset considered in the paper shown in Table 1 119877 can becategorized into unary binary and ternary

For instance suppose an invariant 1199021 = ⟨1205791 1205951 1199031⟩ where1205791 = 0x10 1205951 = ⟨tmp1 tmp2⟩ and 1199031 = ⟨119909 119910⟩ | 119910 = 119909 + 11199021 represents that at the program point 0x10 the ordered set⟨tmp1 tmp2⟩ isin 1199031 that is tmp1 tmp2 satisfies the conditionof tmp2 = tmp1 + 1

The fault model we assume is a single bit flip within theregister file Most faults in other portions of the processoreventually manifest as corrupted state in the register file [18]Moreover we assume that at most one fault occurs during aprogramrsquos execution

3 Radish

This paper implements Radish a system which can hardenprogram against soft error Radish enhances the resilienceof the program to soft error by inserting assertions to thesource code The assertions are based on program invariantsIf the statement of an assertion is not satisfied during theexecution the execution is stopped and a warning reports theoccurrence of soft error

The input of Radish is C source file and the output is anew C source file The new source file can be compiled andexecuted just as the original source file They are identical infunctionality but vary in reliability

This section introduces the workflow of Radish whichcan be divided into three phases that is preprocessingdetecting and selecting Figure 2 shows the details of eachphase In the preprocessing phase we extract the executionprofiles Γ of the critical program points Then Γ is usedto extract potential invariants in the detecting phase Afterthat invariants 119876pot are obtained and a fraction of them areconverted to assertions in the selecting phase In the endhardened source code is outputted We will describe eachphase below

31 Preprocessing Phase In the phase of preprocessing wefind the critical program points and extract their executionprofilesThe profiles are used to extract invariants in the nextphase Finally assertions will be placed in those programpoints to prevent faults frompropagatingThe SDC coveragesvary due to the programpoints of assertions and thereforeweanalyze the propagation of SDC and find the critical programpoints for propagation A fault may propagate through dataflow or control flow to incur SDC Due to the distinction ofthe two categories of propagation we analyze and search fortheir critical program points separately

When a fault propagates throughdata flow the same staticinstructions are executed just as the fault-free execution butthe data that the instructions read or write are corrupted Toincur SDC the corrupted data need to be transmitted to otherfunctions especially the output function Only connectorinstructions can perform this operation thus they must beexecuted and the data they transmit are corrupted Thismakes the connector instructions efficient for fault detectionand therefore they are selected as the critical program pointsagainst data flow propagation

Next we discuss fault propagation in control flow Thecompare instruction performs a comparison between twovalues and the result of the comparison impacts the bits ofthe flag register which determines the consequent jump per-formed by a branch instruction Propagation through controlflowmeans that an erroneous jump is performed by a branchinstruction Assume that in the fault-free execution 119894119896 is abranch instruction and the next instruction is 119894119896+1 isin 119887119906 which

4 International Journal of Aerospace Engineering

Source code

Hardened source code

Assertions

Selecting phase

Detecting phasePreprocessing phase

Qass

Θcri Γ V120579 Qpot

N(P1120579 )

N(P2120579 )

N(P3120579

P1120579

P2120579

P3120579 )

Figure 2 The workflow of Radish

means 119894119896 chooses 119887119906 as the next basic blockWhen the flag reg-ister is corrupted in the presence of soft error then 119894119896+1 isin 119887119908whichmeans 119894119896 chooses the erroneous branch 119887119908 instead of 119887119906To avoid this we should check if the right branch is taken afterthe execution of 119894119896Therefore branch instructions are selectedas the critical program points of control flow propagation

According to the analysis above the critical programpoints of data flow and control flow propagation refer to con-nector instructions and branch instructions It takes two stepsto extract the execution profiles of the critical programpoints

Step 1 We compile the source code and translate it intoassembly file and then locate connector instructions andbranch instructions in the assembly fileTheir programpointsare recorded and added to the program point set Θcri

Step 2 The execution profile is acquired by using Kvasir [16]Kvasir executes C and C++ programs and creates data tracefiles of variables and their values by examining the operationof the binary at runtime Using Kvasir makes it possibleto interrupt programrsquos execution and read the values of allvariables manifest at the program points of interest Once itfinishes executing we get the profiles Γ at the target programpoints in Θcri

32 Detecting Phase In the detecting phase the ordered setof variables and the corresponding ordered set of values aregenerated according to the execution profiles Γ We check ifthe values satisfy any relationship of 119877 listed in Table 1 Thedetecting phase has 4 steps in total

Step 1 For each program point 120579 of Θcri we get the set ofaccessible variables 119881120579 from the execution profile Γ Then theunary binary and ternary ordered sets 1198751120579 1198752120579 and 1198753120579 arecreatedThe superscript digits refer to the number of variablesof the ordered set For example 1198752120579 is an arrangement of twovariables in 119881120579 that is 1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ | V119896 isin 119881120579 and V119895 isin119881120579

Step 2 Find the corresponding values of the variables appear-ing in1198751120579 119875

2120579 1198753120579 and generate the ordered sets of value119873(1198751120579 )

119873(1198752120579 )119873(1198753120579 ) For example119873(1198752120579 (V119896 V119895)) is the ordered value

set of 1198752120579 (V119896 V119895) 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ

Step 3 For the relationships that have undetermined param-eters we use a part of the ordered set of values to calculatethose parameters Thus the entire expression is determined

Step 4 Test if each element of the ordered set of valuessatisfies the condition of the relationship If so then create anew invariant and put it into the potential invariant set 119876pot

Take a binary relationship 119903lin = ⟨119909 119910⟩ | 119910 = 119886119909 + 119887 asexample We shall show each step of detecting phase Since itis a binary relationship only the binary ordered sets of1198752120579 and119873(1198752120579 ) are considered in this example

In the first step we get 119881120579 = V119896 | exist119897 ⟨120579 V119896 119897⟩ isin Γ bysearching the execution profile ΓThen1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ |V119896 isin 119881120579 and V119895 isin 119881120579 is obtained by creating the arrangement ofevery two variables in 119881120579

In the second step 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ is obtained by finding the values of V119896 andV119895 in the execution profile Γ There may be many value pairsof V119896 and V119895 because certain code sections can be invoked formany times in a single execution and each invoking producesone value instance

In the third step we calculate the parameters 119886 119887 in 119903linTo this endwe need to use at least 2 elements of119873(1198752120579 (V119896 V119895))Assuming the two elements are ⟨1198971 1198972⟩ and ⟨1198973 1198974⟩ it could beeasily obtained that 119886 = (1198974 minus 1198972)(1198973 minus 1198971) and 119887 = (11989721198973 minus11989711198974)(1198973 minus 1198971)

In the last step all elements in 119873(1198752120579 (V119896 V119895)) are checkedwhether they satisfy 119903lin = ⟨119909 119910⟩ | 119910 = ((1198974 minus 1198972)(1198973 minus 1198971))119909 +(11989721198973 minus 11989711198974)(1198973 minus 1198971) If all of them pass this validation theinvariant ⟨120579 ⟨V119896 V119895⟩ 119903lin⟩holds and it is added to the potentialinvariant set 119876pot

33 Selecting Phase It is often observed that the number ofelements of the potential invariant set 119876pot is very large If allof them are converted into assertions and inserted into sourcefile it will incur very high performance overhead In theselecting phase proper invariants are selected according totheir capability of detecting SDC Heuristics about selection

International Journal of Aerospace Engineering 5

criteria are formulated on the basis of propagation of SDCThese heuristics are generic and can be applied to anyinvariants We list the heuristics first and then describe theselecting steps

Heuristic 1 There are certain types of variables that should bemonitored at each target program point

A fraction of variables are capable of telling if theexecution is going well and thus monitoring these variablesis able to detect SDC The target program point set Θcri canbe categorized into program points of connector instructionsand branch instructions At the program points of connectorinstructions it is the connector variables that should bepaid special attentions to since they reflect whether resultsof functions are correct At the program points of branchinstructions branch-controlling variables which appear inthe statement of if while or for structure reflect the statusof these structures and thus should be noticed Therefore forall target program points we find certain variables tomonitor

Heuristic 2 The likelihood of detecting SDC increases if thenumber of valid values defined by an assertion decreases

Invalid values cannot pass the examination of assertionsin the presence of soft error Therefore having more invalidvalues (less valid values) means the likelihood of detectingSDC increases The number of valid values of an invariantis determined by its relationship Equality relationship usingldquo=rdquo as operator only has one valid value then come inclusion(isin sube) range (gt lt) and inequality relationship ( =) in order ofascending number of valid values

Heuristic 3 The likelihood of detecting SDC increases ifmore variables are included by an assertion

The more the variables appearing in an assertion themore the variables it can monitor If any of the variablesgets corrupted due to soft error the assertion will be able tocatch the error Thus having more variables in an assertionleads to higher coverage of SDC So far the largest number ofvariables is 3 which refers to ternary relationships

Utilizing these heuristics we are able to reduce thenumber of invariants and obtain more effective assertionsThe selecting phase has three steps

Step 1 The invariants which contain connector variables atthe program points of connector instructions or branch-controlling variables at the program points of branch instruc-tions are selected based on Heuristic 1

Step 2 The invariants with the relationship that has fewervalid values are picked up according to Heuristic 2

Step 3 The invariants which contain the largest number ofvariables are selected due to Heuristic 3

The selecting process stops until there is only one invari-ant left or all the steps have been performedThen we convertthe chosen invariants into assertions which is basically astring conversion problem For brevityrsquos sake we do not talkabout it in this paper Finally we include the assertion headerfile at the beginning of the new source file to make sureassertions can work

4 Radish_D

The assertions generated by Radish cannot fully monitor allthe variables and program points thus certain faults mightpropagate through unprotected code sections To furtherincrease the coverage of SDC we introduce software-basedinstruction duplication mechanism to protect the codesections that are not covered by Radish

This paper utilizes instruction duplication mechanism ofSWIFT [15] for comparison and also for our own duplicationin Radish D SWIFT duplicates all computation instructionsalong the path of replication and the replica instructions usedifferent registers and different memory locations At certainsynchronization points comparison instructions are insertedto check if the original instructions and their replica haveidentical values

Rather than deploying full instruction duplication mech-anism of SWIFT Radish D applies selective instructionduplication mechanism Because a portion of instructionshave been protected by assertions we only need to duplicatethe others

Before deploying duplications we need to determinewhich variables are safe under the protection of assertions Anassertion is capable of protecting the variables which appearin its statement However the protection does not last for theentire lifetime of those variables Only the fraction from thebeginning of the local function till the variablersquos host assertionis considered safe since the variablersquos value is checked duringthe execution of the assertion

We partition each variablersquos lifetime by assertions andidentify the safe periods Then duplications are deployedin the instruction level The targets of duplications are theinstructionswhich do not contain a variable in the safe periodas operand A replica instruction is created by copying theopcode and operands of the original instructionThe destina-tion operand is changed into an unused register the copy ofthe original destination operand Next we decide if there is aneed to change the source operands of the replica instructionIf there has already been a copy of the source operand whichmeans this source operandwas some instructionrsquos destinationoperand and thus got a copy we replace the replica instruc-tionrsquos source operand with its copy The replica instructionis inserted before the original instruction in the same basicblock

Besides store branch and call instructions are chosen asthe synchronization points If any source operand of theseinstructions has a copy we compare its value with that of itscopy by inserting a compare instruction According to thetype of the operand (int or float) the compare instruction canbe either icmp instruction or fcmp instructionAfter the com-pare instruction a branch instruction using the predicate ofneq is inserted into the code If the two values show a discrep-ancy it will jump to a function called faultDetected if other-wise it will continue to execute the previous store branch orcall instruction The function of faultDetected outputs errormessages and returns with an exit code whichwill inform thesystem of soft error and end the execution

We use an example to show the distinction betweenour method and full instruction duplication mechanism in

6 International Journal of Aerospace Engineering

(a) Original assembly code

(c) Assembly code of Radish_D

(b) Assembly code after full instruction duplicationi1 R3 = xor R1 R2i2 R5 = add R4 R2i3 R6 = icmp eq R3 R5i4 br R6 label b2

i1 R3 = xor R1 R2i2 R3

998400= xor R1998400 R2998400

i3 R5 = add R4 R2i4 R5

998400= add R4998400 R2998400

i5 R7 = icmp neq R3 R3998400

i6 br R7 label faultDetectedi7 R8 = icmp neq R5 R5998400

i8 br R8 label faultDetectedi9 R6 = icmp eq R3 R5i10 br R6 label b2

i2 R5 = add R4 R2

i6 R6 = icmp eq R3 R5i7 br R6 label b2

i1 R3 = xor R1 R2

i3 R5998400= add R4998400 R2998400

i5 br R7 labeL faultDetectedi4 R7 = icmp neq R5 R5998400

assert(R1 gt R3)

Figure 3 A sample assembly code before and after transformation of full instruction duplication and Radish D

Figure 3 For consistency we make use of the LLVM [19]assembly language to present the assembly code Figure 3(a)shows the original assembly code and Figure 3(b) shows theassembly code after full instruction duplication It can befound in Figure 3(b) that 11987731015840 is the replica of 1198773 and theduplication is accomplished by the instruction 1198942 Similarly11987751015840 is the replica of 1198775 through the duplication by 1198944 1198949 is thesynchronization point and the source operands of 1198949 1198773 and1198775 need to be examined 1198945 and 1198947 compare 1198773 and 1198775 withtheir replicas separately If the values of 1198773 and 11987731015840 are notequal 1198946 will call faultDetected to report a soft error

The assembly code generated by Radish D is shownin Figure 3(c) Assume that we have already obtained anassertion about1198771 and1198773 by utilizingRadish which is shownin the line of code ldquoassert(1198771 gt 1198773)rdquo Due to the assertion1198771and1198773 are considered safe during the execution of this exam-ple 11987711015840 and 11987731015840 are no longer necessary and the instructionsused for their duplication are eliminated Variables except 1198771and 1198773 still need to be duplicated and checked thus 1198775 isduplicated by 1198943 and checked at the synchronization point 1198946The efficiency of Radish D and full instruction duplicationmechanism will be exploited in the next section

5 Experiment

This paper applies fault injection experiments to validatethe effectiveness of Radish and Radish D The fault injec-tion experiment is performed on the original executivefirst The hardened executives using Radish Radish D and

full instruction duplication are targeted subsequently Wecompare the results of the fault injection experiments andcalculate the SDC coverage and performance overhead Toensure a fair comparison among these mechanisms we usea metric called the SDC detection efficiency which is definedin prior work [9] as the ratio between SDC coverage andoverhead for a detection mechanism

The platform for validation is Ubuntu 1404 (AMD64architecture) LLFI [20] is applied to perform fault injectionsLLFI is an LLVM-based fault injection toolThe source code istranslated into an intermediate representation (IR) and the IRcode is then injected The faults can be injected into specificprogram points and the effect can be easily tracked backto the source code LLFI is configured to inject destinationregister In a single fault injection LLFI randomly picks upone instruction and injects 1 soft error to the destinationoperand One fault injection experiment continues until thefault injection has been repeated for 1000 times The injectedfaults may affect data flow or control flow We take thefollowing LLVM IR code to explain the effect on control flow

(1) judge1=icmp ne i32 1 2

(2) br i1 judge1 label BB1 label BB2

judge1 determines the outcome of branch If judege1is injected the branch instruction may choose the wrongbranch and thus affect control flow The mechanism of fullinstruction duplication is implemented by developing a newpass under LLVM infrastructure The pass is also used by

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

International Journal of Aerospace Engineering 3

Table 1 Relationships of invariants considered in this paper

Category Expression

Unary 119909 = 119886 119909 gt 119886 119909 lt 119886 119909119886 = 0 119909 = 0119909 isin 119886 119887 119888 119909[119896] lt 119886 119909[119896] gt 119886

Binary119910 = 119886119909 + 119887 119909 lt 119910 119909 = 119910 119909 = 1199102119909[119896] lt 119909[119896 + 1] 119909[119896] gt 119909[119896 + 1] 119909[ ] sub 119910[ ]119909[119896] lt 119910[119896] 119910[119896] = 119886119909[119896] + 119887 119909 isin 119910[ ]

Ternary119911 = 119886119909 + 119887119910 + 119888 119909 = 119910 and 119911 119909 = 119910 or 119911119909 = Lshif t (119910 119911) 119909 = Rshift (119910 119911) 119909 = max (119910 119911)119909 = min (119910 119911) 119909 = 119910 times 119911 119909 = 119910 divide 119911

that denote dependencies between functions st 119864 = 119890119909119910 |119891119909 call 119891119910 119891119909 isin 119865 119891119910 isin 119865 IN and OUT denote the inputand the output Soft computation [17] is not considered in thispaper therefore if 119865 119864 and IN are determined OUT can beuniquely determined

Definition 2 A function119865 is composed of a set of basic blocks119861 and variables 119881 thus 119865 = 119861 119881 A basic block is a singleentrance single exit sequence of instructions For a singleinstruction 119894119895 119894119895 = ⟨120579 119878 119863⟩ where 119895 denotes the sequencenumber of the dynamic instruction during the execution 120579denotes the program point which equals the offset from thestart position of the assembly file 119878 and 119863 denote the sourceoperands and the destination operands

Definition 3 forall119894119898 isin 119891119910 if exist119894119896 isin 119891119909 119890119910119909 isin 119864 and 119894119896 sdot 119878 = 119894119898 sdot 119863also exist119894119897 isin 119891119909 119897 lt 119896 and 119894119897 sdot 119863 = 119894119896 sdot 119878 then 119894max119898 in 119891119910 isdefined as the connector instruction Literally the connectorinstruction transmits data from one function to another Thevariable that a connector instruction writes is defined as theconnector variable CV = V | V isin 119894max119898 sdot 119863 Connectorvariables include function argument variables functionreturn variables and global variables By definition theconnector instruction is the last to write a connector variablein the function

Definition 4 Execution profile is denoted by Γ which is givenas a tuple Γ = ⟨120579 119881 119871⟩ Execution profile defines the values ofthe variables at given program points 120579 represents the givenprogram point 119871 is the acquired value set of variables thatappears in 119881

Definition 5 Invariant 119876 is defined as 119876 = ⟨120579 120595 119903⟩ where 120579represents the program point 120595 = ⟨V1 V119895 V119899⟩ is theordered set of variables and 119903 represents the relationship ofvariables that appear in 120595 119903 isin 119877 where 119877 is the relationshipset considered in the paper shown in Table 1 119877 can becategorized into unary binary and ternary

For instance suppose an invariant 1199021 = ⟨1205791 1205951 1199031⟩ where1205791 = 0x10 1205951 = ⟨tmp1 tmp2⟩ and 1199031 = ⟨119909 119910⟩ | 119910 = 119909 + 11199021 represents that at the program point 0x10 the ordered set⟨tmp1 tmp2⟩ isin 1199031 that is tmp1 tmp2 satisfies the conditionof tmp2 = tmp1 + 1

The fault model we assume is a single bit flip within theregister file Most faults in other portions of the processoreventually manifest as corrupted state in the register file [18]Moreover we assume that at most one fault occurs during aprogramrsquos execution

3 Radish

This paper implements Radish a system which can hardenprogram against soft error Radish enhances the resilienceof the program to soft error by inserting assertions to thesource code The assertions are based on program invariantsIf the statement of an assertion is not satisfied during theexecution the execution is stopped and a warning reports theoccurrence of soft error

The input of Radish is C source file and the output is anew C source file The new source file can be compiled andexecuted just as the original source file They are identical infunctionality but vary in reliability

This section introduces the workflow of Radish whichcan be divided into three phases that is preprocessingdetecting and selecting Figure 2 shows the details of eachphase In the preprocessing phase we extract the executionprofiles Γ of the critical program points Then Γ is usedto extract potential invariants in the detecting phase Afterthat invariants 119876pot are obtained and a fraction of them areconverted to assertions in the selecting phase In the endhardened source code is outputted We will describe eachphase below

31 Preprocessing Phase In the phase of preprocessing wefind the critical program points and extract their executionprofilesThe profiles are used to extract invariants in the nextphase Finally assertions will be placed in those programpoints to prevent faults frompropagatingThe SDC coveragesvary due to the programpoints of assertions and thereforeweanalyze the propagation of SDC and find the critical programpoints for propagation A fault may propagate through dataflow or control flow to incur SDC Due to the distinction ofthe two categories of propagation we analyze and search fortheir critical program points separately

When a fault propagates throughdata flow the same staticinstructions are executed just as the fault-free execution butthe data that the instructions read or write are corrupted Toincur SDC the corrupted data need to be transmitted to otherfunctions especially the output function Only connectorinstructions can perform this operation thus they must beexecuted and the data they transmit are corrupted Thismakes the connector instructions efficient for fault detectionand therefore they are selected as the critical program pointsagainst data flow propagation

Next we discuss fault propagation in control flow Thecompare instruction performs a comparison between twovalues and the result of the comparison impacts the bits ofthe flag register which determines the consequent jump per-formed by a branch instruction Propagation through controlflowmeans that an erroneous jump is performed by a branchinstruction Assume that in the fault-free execution 119894119896 is abranch instruction and the next instruction is 119894119896+1 isin 119887119906 which

4 International Journal of Aerospace Engineering

Source code

Hardened source code

Assertions

Selecting phase

Detecting phasePreprocessing phase

Qass

Θcri Γ V120579 Qpot

N(P1120579 )

N(P2120579 )

N(P3120579

P1120579

P2120579

P3120579 )

Figure 2 The workflow of Radish

means 119894119896 chooses 119887119906 as the next basic blockWhen the flag reg-ister is corrupted in the presence of soft error then 119894119896+1 isin 119887119908whichmeans 119894119896 chooses the erroneous branch 119887119908 instead of 119887119906To avoid this we should check if the right branch is taken afterthe execution of 119894119896Therefore branch instructions are selectedas the critical program points of control flow propagation

According to the analysis above the critical programpoints of data flow and control flow propagation refer to con-nector instructions and branch instructions It takes two stepsto extract the execution profiles of the critical programpoints

Step 1 We compile the source code and translate it intoassembly file and then locate connector instructions andbranch instructions in the assembly fileTheir programpointsare recorded and added to the program point set Θcri

Step 2 The execution profile is acquired by using Kvasir [16]Kvasir executes C and C++ programs and creates data tracefiles of variables and their values by examining the operationof the binary at runtime Using Kvasir makes it possibleto interrupt programrsquos execution and read the values of allvariables manifest at the program points of interest Once itfinishes executing we get the profiles Γ at the target programpoints in Θcri

32 Detecting Phase In the detecting phase the ordered setof variables and the corresponding ordered set of values aregenerated according to the execution profiles Γ We check ifthe values satisfy any relationship of 119877 listed in Table 1 Thedetecting phase has 4 steps in total

Step 1 For each program point 120579 of Θcri we get the set ofaccessible variables 119881120579 from the execution profile Γ Then theunary binary and ternary ordered sets 1198751120579 1198752120579 and 1198753120579 arecreatedThe superscript digits refer to the number of variablesof the ordered set For example 1198752120579 is an arrangement of twovariables in 119881120579 that is 1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ | V119896 isin 119881120579 and V119895 isin119881120579

Step 2 Find the corresponding values of the variables appear-ing in1198751120579 119875

2120579 1198753120579 and generate the ordered sets of value119873(1198751120579 )

119873(1198752120579 )119873(1198753120579 ) For example119873(1198752120579 (V119896 V119895)) is the ordered value

set of 1198752120579 (V119896 V119895) 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ

Step 3 For the relationships that have undetermined param-eters we use a part of the ordered set of values to calculatethose parameters Thus the entire expression is determined

Step 4 Test if each element of the ordered set of valuessatisfies the condition of the relationship If so then create anew invariant and put it into the potential invariant set 119876pot

Take a binary relationship 119903lin = ⟨119909 119910⟩ | 119910 = 119886119909 + 119887 asexample We shall show each step of detecting phase Since itis a binary relationship only the binary ordered sets of1198752120579 and119873(1198752120579 ) are considered in this example

In the first step we get 119881120579 = V119896 | exist119897 ⟨120579 V119896 119897⟩ isin Γ bysearching the execution profile ΓThen1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ |V119896 isin 119881120579 and V119895 isin 119881120579 is obtained by creating the arrangement ofevery two variables in 119881120579

In the second step 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ is obtained by finding the values of V119896 andV119895 in the execution profile Γ There may be many value pairsof V119896 and V119895 because certain code sections can be invoked formany times in a single execution and each invoking producesone value instance

In the third step we calculate the parameters 119886 119887 in 119903linTo this endwe need to use at least 2 elements of119873(1198752120579 (V119896 V119895))Assuming the two elements are ⟨1198971 1198972⟩ and ⟨1198973 1198974⟩ it could beeasily obtained that 119886 = (1198974 minus 1198972)(1198973 minus 1198971) and 119887 = (11989721198973 minus11989711198974)(1198973 minus 1198971)

In the last step all elements in 119873(1198752120579 (V119896 V119895)) are checkedwhether they satisfy 119903lin = ⟨119909 119910⟩ | 119910 = ((1198974 minus 1198972)(1198973 minus 1198971))119909 +(11989721198973 minus 11989711198974)(1198973 minus 1198971) If all of them pass this validation theinvariant ⟨120579 ⟨V119896 V119895⟩ 119903lin⟩holds and it is added to the potentialinvariant set 119876pot

33 Selecting Phase It is often observed that the number ofelements of the potential invariant set 119876pot is very large If allof them are converted into assertions and inserted into sourcefile it will incur very high performance overhead In theselecting phase proper invariants are selected according totheir capability of detecting SDC Heuristics about selection

International Journal of Aerospace Engineering 5

criteria are formulated on the basis of propagation of SDCThese heuristics are generic and can be applied to anyinvariants We list the heuristics first and then describe theselecting steps

Heuristic 1 There are certain types of variables that should bemonitored at each target program point

A fraction of variables are capable of telling if theexecution is going well and thus monitoring these variablesis able to detect SDC The target program point set Θcri canbe categorized into program points of connector instructionsand branch instructions At the program points of connectorinstructions it is the connector variables that should bepaid special attentions to since they reflect whether resultsof functions are correct At the program points of branchinstructions branch-controlling variables which appear inthe statement of if while or for structure reflect the statusof these structures and thus should be noticed Therefore forall target program points we find certain variables tomonitor

Heuristic 2 The likelihood of detecting SDC increases if thenumber of valid values defined by an assertion decreases

Invalid values cannot pass the examination of assertionsin the presence of soft error Therefore having more invalidvalues (less valid values) means the likelihood of detectingSDC increases The number of valid values of an invariantis determined by its relationship Equality relationship usingldquo=rdquo as operator only has one valid value then come inclusion(isin sube) range (gt lt) and inequality relationship ( =) in order ofascending number of valid values

Heuristic 3 The likelihood of detecting SDC increases ifmore variables are included by an assertion

The more the variables appearing in an assertion themore the variables it can monitor If any of the variablesgets corrupted due to soft error the assertion will be able tocatch the error Thus having more variables in an assertionleads to higher coverage of SDC So far the largest number ofvariables is 3 which refers to ternary relationships

Utilizing these heuristics we are able to reduce thenumber of invariants and obtain more effective assertionsThe selecting phase has three steps

Step 1 The invariants which contain connector variables atthe program points of connector instructions or branch-controlling variables at the program points of branch instruc-tions are selected based on Heuristic 1

Step 2 The invariants with the relationship that has fewervalid values are picked up according to Heuristic 2

Step 3 The invariants which contain the largest number ofvariables are selected due to Heuristic 3

The selecting process stops until there is only one invari-ant left or all the steps have been performedThen we convertthe chosen invariants into assertions which is basically astring conversion problem For brevityrsquos sake we do not talkabout it in this paper Finally we include the assertion headerfile at the beginning of the new source file to make sureassertions can work

4 Radish_D

The assertions generated by Radish cannot fully monitor allthe variables and program points thus certain faults mightpropagate through unprotected code sections To furtherincrease the coverage of SDC we introduce software-basedinstruction duplication mechanism to protect the codesections that are not covered by Radish

This paper utilizes instruction duplication mechanism ofSWIFT [15] for comparison and also for our own duplicationin Radish D SWIFT duplicates all computation instructionsalong the path of replication and the replica instructions usedifferent registers and different memory locations At certainsynchronization points comparison instructions are insertedto check if the original instructions and their replica haveidentical values

Rather than deploying full instruction duplication mech-anism of SWIFT Radish D applies selective instructionduplication mechanism Because a portion of instructionshave been protected by assertions we only need to duplicatethe others

Before deploying duplications we need to determinewhich variables are safe under the protection of assertions Anassertion is capable of protecting the variables which appearin its statement However the protection does not last for theentire lifetime of those variables Only the fraction from thebeginning of the local function till the variablersquos host assertionis considered safe since the variablersquos value is checked duringthe execution of the assertion

We partition each variablersquos lifetime by assertions andidentify the safe periods Then duplications are deployedin the instruction level The targets of duplications are theinstructionswhich do not contain a variable in the safe periodas operand A replica instruction is created by copying theopcode and operands of the original instructionThe destina-tion operand is changed into an unused register the copy ofthe original destination operand Next we decide if there is aneed to change the source operands of the replica instructionIf there has already been a copy of the source operand whichmeans this source operandwas some instructionrsquos destinationoperand and thus got a copy we replace the replica instruc-tionrsquos source operand with its copy The replica instructionis inserted before the original instruction in the same basicblock

Besides store branch and call instructions are chosen asthe synchronization points If any source operand of theseinstructions has a copy we compare its value with that of itscopy by inserting a compare instruction According to thetype of the operand (int or float) the compare instruction canbe either icmp instruction or fcmp instructionAfter the com-pare instruction a branch instruction using the predicate ofneq is inserted into the code If the two values show a discrep-ancy it will jump to a function called faultDetected if other-wise it will continue to execute the previous store branch orcall instruction The function of faultDetected outputs errormessages and returns with an exit code whichwill inform thesystem of soft error and end the execution

We use an example to show the distinction betweenour method and full instruction duplication mechanism in

6 International Journal of Aerospace Engineering

(a) Original assembly code

(c) Assembly code of Radish_D

(b) Assembly code after full instruction duplicationi1 R3 = xor R1 R2i2 R5 = add R4 R2i3 R6 = icmp eq R3 R5i4 br R6 label b2

i1 R3 = xor R1 R2i2 R3

998400= xor R1998400 R2998400

i3 R5 = add R4 R2i4 R5

998400= add R4998400 R2998400

i5 R7 = icmp neq R3 R3998400

i6 br R7 label faultDetectedi7 R8 = icmp neq R5 R5998400

i8 br R8 label faultDetectedi9 R6 = icmp eq R3 R5i10 br R6 label b2

i2 R5 = add R4 R2

i6 R6 = icmp eq R3 R5i7 br R6 label b2

i1 R3 = xor R1 R2

i3 R5998400= add R4998400 R2998400

i5 br R7 labeL faultDetectedi4 R7 = icmp neq R5 R5998400

assert(R1 gt R3)

Figure 3 A sample assembly code before and after transformation of full instruction duplication and Radish D

Figure 3 For consistency we make use of the LLVM [19]assembly language to present the assembly code Figure 3(a)shows the original assembly code and Figure 3(b) shows theassembly code after full instruction duplication It can befound in Figure 3(b) that 11987731015840 is the replica of 1198773 and theduplication is accomplished by the instruction 1198942 Similarly11987751015840 is the replica of 1198775 through the duplication by 1198944 1198949 is thesynchronization point and the source operands of 1198949 1198773 and1198775 need to be examined 1198945 and 1198947 compare 1198773 and 1198775 withtheir replicas separately If the values of 1198773 and 11987731015840 are notequal 1198946 will call faultDetected to report a soft error

The assembly code generated by Radish D is shownin Figure 3(c) Assume that we have already obtained anassertion about1198771 and1198773 by utilizingRadish which is shownin the line of code ldquoassert(1198771 gt 1198773)rdquo Due to the assertion1198771and1198773 are considered safe during the execution of this exam-ple 11987711015840 and 11987731015840 are no longer necessary and the instructionsused for their duplication are eliminated Variables except 1198771and 1198773 still need to be duplicated and checked thus 1198775 isduplicated by 1198943 and checked at the synchronization point 1198946The efficiency of Radish D and full instruction duplicationmechanism will be exploited in the next section

5 Experiment

This paper applies fault injection experiments to validatethe effectiveness of Radish and Radish D The fault injec-tion experiment is performed on the original executivefirst The hardened executives using Radish Radish D and

full instruction duplication are targeted subsequently Wecompare the results of the fault injection experiments andcalculate the SDC coverage and performance overhead Toensure a fair comparison among these mechanisms we usea metric called the SDC detection efficiency which is definedin prior work [9] as the ratio between SDC coverage andoverhead for a detection mechanism

The platform for validation is Ubuntu 1404 (AMD64architecture) LLFI [20] is applied to perform fault injectionsLLFI is an LLVM-based fault injection toolThe source code istranslated into an intermediate representation (IR) and the IRcode is then injected The faults can be injected into specificprogram points and the effect can be easily tracked backto the source code LLFI is configured to inject destinationregister In a single fault injection LLFI randomly picks upone instruction and injects 1 soft error to the destinationoperand One fault injection experiment continues until thefault injection has been repeated for 1000 times The injectedfaults may affect data flow or control flow We take thefollowing LLVM IR code to explain the effect on control flow

(1) judge1=icmp ne i32 1 2

(2) br i1 judge1 label BB1 label BB2

judge1 determines the outcome of branch If judege1is injected the branch instruction may choose the wrongbranch and thus affect control flow The mechanism of fullinstruction duplication is implemented by developing a newpass under LLVM infrastructure The pass is also used by

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

4 International Journal of Aerospace Engineering

Source code

Hardened source code

Assertions

Selecting phase

Detecting phasePreprocessing phase

Qass

Θcri Γ V120579 Qpot

N(P1120579 )

N(P2120579 )

N(P3120579

P1120579

P2120579

P3120579 )

Figure 2 The workflow of Radish

means 119894119896 chooses 119887119906 as the next basic blockWhen the flag reg-ister is corrupted in the presence of soft error then 119894119896+1 isin 119887119908whichmeans 119894119896 chooses the erroneous branch 119887119908 instead of 119887119906To avoid this we should check if the right branch is taken afterthe execution of 119894119896Therefore branch instructions are selectedas the critical program points of control flow propagation

According to the analysis above the critical programpoints of data flow and control flow propagation refer to con-nector instructions and branch instructions It takes two stepsto extract the execution profiles of the critical programpoints

Step 1 We compile the source code and translate it intoassembly file and then locate connector instructions andbranch instructions in the assembly fileTheir programpointsare recorded and added to the program point set Θcri

Step 2 The execution profile is acquired by using Kvasir [16]Kvasir executes C and C++ programs and creates data tracefiles of variables and their values by examining the operationof the binary at runtime Using Kvasir makes it possibleto interrupt programrsquos execution and read the values of allvariables manifest at the program points of interest Once itfinishes executing we get the profiles Γ at the target programpoints in Θcri

32 Detecting Phase In the detecting phase the ordered setof variables and the corresponding ordered set of values aregenerated according to the execution profiles Γ We check ifthe values satisfy any relationship of 119877 listed in Table 1 Thedetecting phase has 4 steps in total

Step 1 For each program point 120579 of Θcri we get the set ofaccessible variables 119881120579 from the execution profile Γ Then theunary binary and ternary ordered sets 1198751120579 1198752120579 and 1198753120579 arecreatedThe superscript digits refer to the number of variablesof the ordered set For example 1198752120579 is an arrangement of twovariables in 119881120579 that is 1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ | V119896 isin 119881120579 and V119895 isin119881120579

Step 2 Find the corresponding values of the variables appear-ing in1198751120579 119875

2120579 1198753120579 and generate the ordered sets of value119873(1198751120579 )

119873(1198752120579 )119873(1198753120579 ) For example119873(1198752120579 (V119896 V119895)) is the ordered value

set of 1198752120579 (V119896 V119895) 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ

Step 3 For the relationships that have undetermined param-eters we use a part of the ordered set of values to calculatethose parameters Thus the entire expression is determined

Step 4 Test if each element of the ordered set of valuessatisfies the condition of the relationship If so then create anew invariant and put it into the potential invariant set 119876pot

Take a binary relationship 119903lin = ⟨119909 119910⟩ | 119910 = 119886119909 + 119887 asexample We shall show each step of detecting phase Since itis a binary relationship only the binary ordered sets of1198752120579 and119873(1198752120579 ) are considered in this example

In the first step we get 119881120579 = V119896 | exist119897 ⟨120579 V119896 119897⟩ isin Γ bysearching the execution profile ΓThen1198752120579 (V119896 V119895) = ⟨V119896 V119895⟩ |V119896 isin 119881120579 and V119895 isin 119881120579 is obtained by creating the arrangement ofevery two variables in 119881120579

In the second step 119873(1198752120579 (V119896 V119895)) = ⟨119897119906 119897119908⟩ | ⟨120579 V119896 119897119906⟩ isinΓ and ⟨120579 V119895 119897119908⟩ isin Γ is obtained by finding the values of V119896 andV119895 in the execution profile Γ There may be many value pairsof V119896 and V119895 because certain code sections can be invoked formany times in a single execution and each invoking producesone value instance

In the third step we calculate the parameters 119886 119887 in 119903linTo this endwe need to use at least 2 elements of119873(1198752120579 (V119896 V119895))Assuming the two elements are ⟨1198971 1198972⟩ and ⟨1198973 1198974⟩ it could beeasily obtained that 119886 = (1198974 minus 1198972)(1198973 minus 1198971) and 119887 = (11989721198973 minus11989711198974)(1198973 minus 1198971)

In the last step all elements in 119873(1198752120579 (V119896 V119895)) are checkedwhether they satisfy 119903lin = ⟨119909 119910⟩ | 119910 = ((1198974 minus 1198972)(1198973 minus 1198971))119909 +(11989721198973 minus 11989711198974)(1198973 minus 1198971) If all of them pass this validation theinvariant ⟨120579 ⟨V119896 V119895⟩ 119903lin⟩holds and it is added to the potentialinvariant set 119876pot

33 Selecting Phase It is often observed that the number ofelements of the potential invariant set 119876pot is very large If allof them are converted into assertions and inserted into sourcefile it will incur very high performance overhead In theselecting phase proper invariants are selected according totheir capability of detecting SDC Heuristics about selection

International Journal of Aerospace Engineering 5

criteria are formulated on the basis of propagation of SDCThese heuristics are generic and can be applied to anyinvariants We list the heuristics first and then describe theselecting steps

Heuristic 1 There are certain types of variables that should bemonitored at each target program point

A fraction of variables are capable of telling if theexecution is going well and thus monitoring these variablesis able to detect SDC The target program point set Θcri canbe categorized into program points of connector instructionsand branch instructions At the program points of connectorinstructions it is the connector variables that should bepaid special attentions to since they reflect whether resultsof functions are correct At the program points of branchinstructions branch-controlling variables which appear inthe statement of if while or for structure reflect the statusof these structures and thus should be noticed Therefore forall target program points we find certain variables tomonitor

Heuristic 2 The likelihood of detecting SDC increases if thenumber of valid values defined by an assertion decreases

Invalid values cannot pass the examination of assertionsin the presence of soft error Therefore having more invalidvalues (less valid values) means the likelihood of detectingSDC increases The number of valid values of an invariantis determined by its relationship Equality relationship usingldquo=rdquo as operator only has one valid value then come inclusion(isin sube) range (gt lt) and inequality relationship ( =) in order ofascending number of valid values

Heuristic 3 The likelihood of detecting SDC increases ifmore variables are included by an assertion

The more the variables appearing in an assertion themore the variables it can monitor If any of the variablesgets corrupted due to soft error the assertion will be able tocatch the error Thus having more variables in an assertionleads to higher coverage of SDC So far the largest number ofvariables is 3 which refers to ternary relationships

Utilizing these heuristics we are able to reduce thenumber of invariants and obtain more effective assertionsThe selecting phase has three steps

Step 1 The invariants which contain connector variables atthe program points of connector instructions or branch-controlling variables at the program points of branch instruc-tions are selected based on Heuristic 1

Step 2 The invariants with the relationship that has fewervalid values are picked up according to Heuristic 2

Step 3 The invariants which contain the largest number ofvariables are selected due to Heuristic 3

The selecting process stops until there is only one invari-ant left or all the steps have been performedThen we convertthe chosen invariants into assertions which is basically astring conversion problem For brevityrsquos sake we do not talkabout it in this paper Finally we include the assertion headerfile at the beginning of the new source file to make sureassertions can work

4 Radish_D

The assertions generated by Radish cannot fully monitor allthe variables and program points thus certain faults mightpropagate through unprotected code sections To furtherincrease the coverage of SDC we introduce software-basedinstruction duplication mechanism to protect the codesections that are not covered by Radish

This paper utilizes instruction duplication mechanism ofSWIFT [15] for comparison and also for our own duplicationin Radish D SWIFT duplicates all computation instructionsalong the path of replication and the replica instructions usedifferent registers and different memory locations At certainsynchronization points comparison instructions are insertedto check if the original instructions and their replica haveidentical values

Rather than deploying full instruction duplication mech-anism of SWIFT Radish D applies selective instructionduplication mechanism Because a portion of instructionshave been protected by assertions we only need to duplicatethe others

Before deploying duplications we need to determinewhich variables are safe under the protection of assertions Anassertion is capable of protecting the variables which appearin its statement However the protection does not last for theentire lifetime of those variables Only the fraction from thebeginning of the local function till the variablersquos host assertionis considered safe since the variablersquos value is checked duringthe execution of the assertion

We partition each variablersquos lifetime by assertions andidentify the safe periods Then duplications are deployedin the instruction level The targets of duplications are theinstructionswhich do not contain a variable in the safe periodas operand A replica instruction is created by copying theopcode and operands of the original instructionThe destina-tion operand is changed into an unused register the copy ofthe original destination operand Next we decide if there is aneed to change the source operands of the replica instructionIf there has already been a copy of the source operand whichmeans this source operandwas some instructionrsquos destinationoperand and thus got a copy we replace the replica instruc-tionrsquos source operand with its copy The replica instructionis inserted before the original instruction in the same basicblock

Besides store branch and call instructions are chosen asthe synchronization points If any source operand of theseinstructions has a copy we compare its value with that of itscopy by inserting a compare instruction According to thetype of the operand (int or float) the compare instruction canbe either icmp instruction or fcmp instructionAfter the com-pare instruction a branch instruction using the predicate ofneq is inserted into the code If the two values show a discrep-ancy it will jump to a function called faultDetected if other-wise it will continue to execute the previous store branch orcall instruction The function of faultDetected outputs errormessages and returns with an exit code whichwill inform thesystem of soft error and end the execution

We use an example to show the distinction betweenour method and full instruction duplication mechanism in

6 International Journal of Aerospace Engineering

(a) Original assembly code

(c) Assembly code of Radish_D

(b) Assembly code after full instruction duplicationi1 R3 = xor R1 R2i2 R5 = add R4 R2i3 R6 = icmp eq R3 R5i4 br R6 label b2

i1 R3 = xor R1 R2i2 R3

998400= xor R1998400 R2998400

i3 R5 = add R4 R2i4 R5

998400= add R4998400 R2998400

i5 R7 = icmp neq R3 R3998400

i6 br R7 label faultDetectedi7 R8 = icmp neq R5 R5998400

i8 br R8 label faultDetectedi9 R6 = icmp eq R3 R5i10 br R6 label b2

i2 R5 = add R4 R2

i6 R6 = icmp eq R3 R5i7 br R6 label b2

i1 R3 = xor R1 R2

i3 R5998400= add R4998400 R2998400

i5 br R7 labeL faultDetectedi4 R7 = icmp neq R5 R5998400

assert(R1 gt R3)

Figure 3 A sample assembly code before and after transformation of full instruction duplication and Radish D

Figure 3 For consistency we make use of the LLVM [19]assembly language to present the assembly code Figure 3(a)shows the original assembly code and Figure 3(b) shows theassembly code after full instruction duplication It can befound in Figure 3(b) that 11987731015840 is the replica of 1198773 and theduplication is accomplished by the instruction 1198942 Similarly11987751015840 is the replica of 1198775 through the duplication by 1198944 1198949 is thesynchronization point and the source operands of 1198949 1198773 and1198775 need to be examined 1198945 and 1198947 compare 1198773 and 1198775 withtheir replicas separately If the values of 1198773 and 11987731015840 are notequal 1198946 will call faultDetected to report a soft error

The assembly code generated by Radish D is shownin Figure 3(c) Assume that we have already obtained anassertion about1198771 and1198773 by utilizingRadish which is shownin the line of code ldquoassert(1198771 gt 1198773)rdquo Due to the assertion1198771and1198773 are considered safe during the execution of this exam-ple 11987711015840 and 11987731015840 are no longer necessary and the instructionsused for their duplication are eliminated Variables except 1198771and 1198773 still need to be duplicated and checked thus 1198775 isduplicated by 1198943 and checked at the synchronization point 1198946The efficiency of Radish D and full instruction duplicationmechanism will be exploited in the next section

5 Experiment

This paper applies fault injection experiments to validatethe effectiveness of Radish and Radish D The fault injec-tion experiment is performed on the original executivefirst The hardened executives using Radish Radish D and

full instruction duplication are targeted subsequently Wecompare the results of the fault injection experiments andcalculate the SDC coverage and performance overhead Toensure a fair comparison among these mechanisms we usea metric called the SDC detection efficiency which is definedin prior work [9] as the ratio between SDC coverage andoverhead for a detection mechanism

The platform for validation is Ubuntu 1404 (AMD64architecture) LLFI [20] is applied to perform fault injectionsLLFI is an LLVM-based fault injection toolThe source code istranslated into an intermediate representation (IR) and the IRcode is then injected The faults can be injected into specificprogram points and the effect can be easily tracked backto the source code LLFI is configured to inject destinationregister In a single fault injection LLFI randomly picks upone instruction and injects 1 soft error to the destinationoperand One fault injection experiment continues until thefault injection has been repeated for 1000 times The injectedfaults may affect data flow or control flow We take thefollowing LLVM IR code to explain the effect on control flow

(1) judge1=icmp ne i32 1 2

(2) br i1 judge1 label BB1 label BB2

judge1 determines the outcome of branch If judege1is injected the branch instruction may choose the wrongbranch and thus affect control flow The mechanism of fullinstruction duplication is implemented by developing a newpass under LLVM infrastructure The pass is also used by

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

International Journal of Aerospace Engineering 5

criteria are formulated on the basis of propagation of SDCThese heuristics are generic and can be applied to anyinvariants We list the heuristics first and then describe theselecting steps

Heuristic 1 There are certain types of variables that should bemonitored at each target program point

A fraction of variables are capable of telling if theexecution is going well and thus monitoring these variablesis able to detect SDC The target program point set Θcri canbe categorized into program points of connector instructionsand branch instructions At the program points of connectorinstructions it is the connector variables that should bepaid special attentions to since they reflect whether resultsof functions are correct At the program points of branchinstructions branch-controlling variables which appear inthe statement of if while or for structure reflect the statusof these structures and thus should be noticed Therefore forall target program points we find certain variables tomonitor

Heuristic 2 The likelihood of detecting SDC increases if thenumber of valid values defined by an assertion decreases

Invalid values cannot pass the examination of assertionsin the presence of soft error Therefore having more invalidvalues (less valid values) means the likelihood of detectingSDC increases The number of valid values of an invariantis determined by its relationship Equality relationship usingldquo=rdquo as operator only has one valid value then come inclusion(isin sube) range (gt lt) and inequality relationship ( =) in order ofascending number of valid values

Heuristic 3 The likelihood of detecting SDC increases ifmore variables are included by an assertion

The more the variables appearing in an assertion themore the variables it can monitor If any of the variablesgets corrupted due to soft error the assertion will be able tocatch the error Thus having more variables in an assertionleads to higher coverage of SDC So far the largest number ofvariables is 3 which refers to ternary relationships

Utilizing these heuristics we are able to reduce thenumber of invariants and obtain more effective assertionsThe selecting phase has three steps

Step 1 The invariants which contain connector variables atthe program points of connector instructions or branch-controlling variables at the program points of branch instruc-tions are selected based on Heuristic 1

Step 2 The invariants with the relationship that has fewervalid values are picked up according to Heuristic 2

Step 3 The invariants which contain the largest number ofvariables are selected due to Heuristic 3

The selecting process stops until there is only one invari-ant left or all the steps have been performedThen we convertthe chosen invariants into assertions which is basically astring conversion problem For brevityrsquos sake we do not talkabout it in this paper Finally we include the assertion headerfile at the beginning of the new source file to make sureassertions can work

4 Radish_D

The assertions generated by Radish cannot fully monitor allthe variables and program points thus certain faults mightpropagate through unprotected code sections To furtherincrease the coverage of SDC we introduce software-basedinstruction duplication mechanism to protect the codesections that are not covered by Radish

This paper utilizes instruction duplication mechanism ofSWIFT [15] for comparison and also for our own duplicationin Radish D SWIFT duplicates all computation instructionsalong the path of replication and the replica instructions usedifferent registers and different memory locations At certainsynchronization points comparison instructions are insertedto check if the original instructions and their replica haveidentical values

Rather than deploying full instruction duplication mech-anism of SWIFT Radish D applies selective instructionduplication mechanism Because a portion of instructionshave been protected by assertions we only need to duplicatethe others

Before deploying duplications we need to determinewhich variables are safe under the protection of assertions Anassertion is capable of protecting the variables which appearin its statement However the protection does not last for theentire lifetime of those variables Only the fraction from thebeginning of the local function till the variablersquos host assertionis considered safe since the variablersquos value is checked duringthe execution of the assertion

We partition each variablersquos lifetime by assertions andidentify the safe periods Then duplications are deployedin the instruction level The targets of duplications are theinstructionswhich do not contain a variable in the safe periodas operand A replica instruction is created by copying theopcode and operands of the original instructionThe destina-tion operand is changed into an unused register the copy ofthe original destination operand Next we decide if there is aneed to change the source operands of the replica instructionIf there has already been a copy of the source operand whichmeans this source operandwas some instructionrsquos destinationoperand and thus got a copy we replace the replica instruc-tionrsquos source operand with its copy The replica instructionis inserted before the original instruction in the same basicblock

Besides store branch and call instructions are chosen asthe synchronization points If any source operand of theseinstructions has a copy we compare its value with that of itscopy by inserting a compare instruction According to thetype of the operand (int or float) the compare instruction canbe either icmp instruction or fcmp instructionAfter the com-pare instruction a branch instruction using the predicate ofneq is inserted into the code If the two values show a discrep-ancy it will jump to a function called faultDetected if other-wise it will continue to execute the previous store branch orcall instruction The function of faultDetected outputs errormessages and returns with an exit code whichwill inform thesystem of soft error and end the execution

We use an example to show the distinction betweenour method and full instruction duplication mechanism in

6 International Journal of Aerospace Engineering

(a) Original assembly code

(c) Assembly code of Radish_D

(b) Assembly code after full instruction duplicationi1 R3 = xor R1 R2i2 R5 = add R4 R2i3 R6 = icmp eq R3 R5i4 br R6 label b2

i1 R3 = xor R1 R2i2 R3

998400= xor R1998400 R2998400

i3 R5 = add R4 R2i4 R5

998400= add R4998400 R2998400

i5 R7 = icmp neq R3 R3998400

i6 br R7 label faultDetectedi7 R8 = icmp neq R5 R5998400

i8 br R8 label faultDetectedi9 R6 = icmp eq R3 R5i10 br R6 label b2

i2 R5 = add R4 R2

i6 R6 = icmp eq R3 R5i7 br R6 label b2

i1 R3 = xor R1 R2

i3 R5998400= add R4998400 R2998400

i5 br R7 labeL faultDetectedi4 R7 = icmp neq R5 R5998400

assert(R1 gt R3)

Figure 3 A sample assembly code before and after transformation of full instruction duplication and Radish D

Figure 3 For consistency we make use of the LLVM [19]assembly language to present the assembly code Figure 3(a)shows the original assembly code and Figure 3(b) shows theassembly code after full instruction duplication It can befound in Figure 3(b) that 11987731015840 is the replica of 1198773 and theduplication is accomplished by the instruction 1198942 Similarly11987751015840 is the replica of 1198775 through the duplication by 1198944 1198949 is thesynchronization point and the source operands of 1198949 1198773 and1198775 need to be examined 1198945 and 1198947 compare 1198773 and 1198775 withtheir replicas separately If the values of 1198773 and 11987731015840 are notequal 1198946 will call faultDetected to report a soft error

The assembly code generated by Radish D is shownin Figure 3(c) Assume that we have already obtained anassertion about1198771 and1198773 by utilizingRadish which is shownin the line of code ldquoassert(1198771 gt 1198773)rdquo Due to the assertion1198771and1198773 are considered safe during the execution of this exam-ple 11987711015840 and 11987731015840 are no longer necessary and the instructionsused for their duplication are eliminated Variables except 1198771and 1198773 still need to be duplicated and checked thus 1198775 isduplicated by 1198943 and checked at the synchronization point 1198946The efficiency of Radish D and full instruction duplicationmechanism will be exploited in the next section

5 Experiment

This paper applies fault injection experiments to validatethe effectiveness of Radish and Radish D The fault injec-tion experiment is performed on the original executivefirst The hardened executives using Radish Radish D and

full instruction duplication are targeted subsequently Wecompare the results of the fault injection experiments andcalculate the SDC coverage and performance overhead Toensure a fair comparison among these mechanisms we usea metric called the SDC detection efficiency which is definedin prior work [9] as the ratio between SDC coverage andoverhead for a detection mechanism

The platform for validation is Ubuntu 1404 (AMD64architecture) LLFI [20] is applied to perform fault injectionsLLFI is an LLVM-based fault injection toolThe source code istranslated into an intermediate representation (IR) and the IRcode is then injected The faults can be injected into specificprogram points and the effect can be easily tracked backto the source code LLFI is configured to inject destinationregister In a single fault injection LLFI randomly picks upone instruction and injects 1 soft error to the destinationoperand One fault injection experiment continues until thefault injection has been repeated for 1000 times The injectedfaults may affect data flow or control flow We take thefollowing LLVM IR code to explain the effect on control flow

(1) judge1=icmp ne i32 1 2

(2) br i1 judge1 label BB1 label BB2

judge1 determines the outcome of branch If judege1is injected the branch instruction may choose the wrongbranch and thus affect control flow The mechanism of fullinstruction duplication is implemented by developing a newpass under LLVM infrastructure The pass is also used by

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

6 International Journal of Aerospace Engineering

(a) Original assembly code

(c) Assembly code of Radish_D

(b) Assembly code after full instruction duplicationi1 R3 = xor R1 R2i2 R5 = add R4 R2i3 R6 = icmp eq R3 R5i4 br R6 label b2

i1 R3 = xor R1 R2i2 R3

998400= xor R1998400 R2998400

i3 R5 = add R4 R2i4 R5

998400= add R4998400 R2998400

i5 R7 = icmp neq R3 R3998400

i6 br R7 label faultDetectedi7 R8 = icmp neq R5 R5998400

i8 br R8 label faultDetectedi9 R6 = icmp eq R3 R5i10 br R6 label b2

i2 R5 = add R4 R2

i6 R6 = icmp eq R3 R5i7 br R6 label b2

i1 R3 = xor R1 R2

i3 R5998400= add R4998400 R2998400

i5 br R7 labeL faultDetectedi4 R7 = icmp neq R5 R5998400

assert(R1 gt R3)

Figure 3 A sample assembly code before and after transformation of full instruction duplication and Radish D

Figure 3 For consistency we make use of the LLVM [19]assembly language to present the assembly code Figure 3(a)shows the original assembly code and Figure 3(b) shows theassembly code after full instruction duplication It can befound in Figure 3(b) that 11987731015840 is the replica of 1198773 and theduplication is accomplished by the instruction 1198942 Similarly11987751015840 is the replica of 1198775 through the duplication by 1198944 1198949 is thesynchronization point and the source operands of 1198949 1198773 and1198775 need to be examined 1198945 and 1198947 compare 1198773 and 1198775 withtheir replicas separately If the values of 1198773 and 11987731015840 are notequal 1198946 will call faultDetected to report a soft error

The assembly code generated by Radish D is shownin Figure 3(c) Assume that we have already obtained anassertion about1198771 and1198773 by utilizingRadish which is shownin the line of code ldquoassert(1198771 gt 1198773)rdquo Due to the assertion1198771and1198773 are considered safe during the execution of this exam-ple 11987711015840 and 11987731015840 are no longer necessary and the instructionsused for their duplication are eliminated Variables except 1198771and 1198773 still need to be duplicated and checked thus 1198775 isduplicated by 1198943 and checked at the synchronization point 1198946The efficiency of Radish D and full instruction duplicationmechanism will be exploited in the next section

5 Experiment

This paper applies fault injection experiments to validatethe effectiveness of Radish and Radish D The fault injec-tion experiment is performed on the original executivefirst The hardened executives using Radish Radish D and

full instruction duplication are targeted subsequently Wecompare the results of the fault injection experiments andcalculate the SDC coverage and performance overhead Toensure a fair comparison among these mechanisms we usea metric called the SDC detection efficiency which is definedin prior work [9] as the ratio between SDC coverage andoverhead for a detection mechanism

The platform for validation is Ubuntu 1404 (AMD64architecture) LLFI [20] is applied to perform fault injectionsLLFI is an LLVM-based fault injection toolThe source code istranslated into an intermediate representation (IR) and the IRcode is then injected The faults can be injected into specificprogram points and the effect can be easily tracked backto the source code LLFI is configured to inject destinationregister In a single fault injection LLFI randomly picks upone instruction and injects 1 soft error to the destinationoperand One fault injection experiment continues until thefault injection has been repeated for 1000 times The injectedfaults may affect data flow or control flow We take thefollowing LLVM IR code to explain the effect on control flow

(1) judge1=icmp ne i32 1 2

(2) br i1 judge1 label BB1 label BB2

judge1 determines the outcome of branch If judege1is injected the branch instruction may choose the wrongbranch and thus affect control flow The mechanism of fullinstruction duplication is implemented by developing a newpass under LLVM infrastructure The pass is also used by

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

International Journal of Aerospace Engineering 7

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

20

40

60

80

120

100

Perfo

rman

ce o

verh

ead

()

Figure 4 The comparison of performance overheads among full instruction duplication Radish and Radish D

Radish D for the operation of instruction duplication bymodifying certain conditions for duplication

The programs used for evaluation are from MiBenchbenchmark suite [21] These programs are qsort (whichperforms the algorithm of quick sort) isqrt (which is basetwo analogue of the square root algorithm) cubic (whichsolves a cubic polynomial) rad2deg (which converts betweenradians and degrees) crc (which computes 32-bit crc to detectaccidental changes to raw data) and bitstrng (which prints bitpattern of bytes formatted to string) These are C programsconsisting of a few hundred lines of C code We use 25 inputsto extract invariants and randomly choose one input for theinjection

51 Comparison between Radish and Full Instruction Dupli-cation Figure 4 shows the performance overheads of Radishand full instruction duplicationWe use the execution time ofthe original program as baseline for comparison Comparedwith the baseline the average overhead incurred by Radish is304 and the overhead incurred by full instruction dupli-cation is 528 The overhead of full instruction duplicationmechanism is 224 higher than the overhead of Radish forthe studied programs

Figure 5 shows the SDC coverages The average SDCcoverage of Radish is 771 and that of full instruction dupli-cation is 843The average SDC coverage of full instructionduplication is 72 higher than that of Radish Among mostof the benchmarks the SDC coverages of full instructionduplication and Radish are very close

Full instruction duplication does not achieve nearly 100coverage since it does not check the result of store and branchinstruction For example in Figure 3(b) which denotes thefull instruction duplication if 1198776 in 1198949 is injected 11989410 isaffected andmay choose the wrong branch SWIFT [15] raisesthe coverage to nearly 100 since it assumes that the hardwareapplies ECC and it adds control flow checking mechanismThe SDC detection efficiency can be observed in Figure 6Radish has higher SDCdetection efficiency which is 16 timesas much as that of full instruction duplicationThis is becausethe mechanism of full instruction duplication protects allinstructions executed which incurs high SDC coverage with

very high overhead However Radish obtains relatively highSDC coverage with much lower overhead Radish achievesthis by curbing the number of program points that generateassertions Further the execution cost of assertions is rela-tively low and assertions have good SDC coverage since theyare seldom satisfied when soft errors occur

52 The Experimental Results of Radish D The average SDCcoverage of Radish D is 925 which is 82 higher thanthat of full instruction duplication and 155 higher than thatof Radish It can be validated that instruction duplication ofRadish D protects unsafe code sections that are not coveredby assertions Radish D may generate assertions that checkthe variable which is stored in the memory after the storeinstruction (see Heuristic 1) Moreover at the program pointsof branch instructions branch-controlling variables arechecked Therefore the assertions of Radish D catch some offaults that escape the detection of duplicationmechanism andthe coverage of Radish D is higher than that of full instruc-tion duplication

The average overhead of Radish D is 763 lower thanthe sum of the overhead of full instruction duplication andRadish because we eliminate the duplication deployed to theinstructions that have already been protected by assertions

The average SDC detection efficiency of Radish D islower than that of full instruction duplication or RadishFor Radish D there are overlapping soft errors that can bedetected by both instruction duplication and assertions Tothese soft errors the overhead increases by deploying instruc-tion duplication but the SDC coverage does not increaseTheSDC detection efficiency is the ratio between SDC coverageand overhead and thus it is lowered

53 False Positives of Invariants A false positive for an inputcan occur when the values at the assertion points for thisinput do not satisfy the condition of the assertion learnedfrom the training inputs We use 25 inputs for training and100 inputs for testing No faults are injected in these runs Wetest all the programs that were used to evaluate SDC coveragein the fault injection experiment The result shows that theaveraged false positive rate of the studied programs is 48

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

8 International Journal of Aerospace Engineering

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

10

20

30

40

50

60

70

80

90

100

SDC

cove

rage

()

Figure 5 The comparison of SDC coverages among full instruction duplication Radish and Radish D

qsort is cubic rad2deg crc bitstrngqrt

DuplicationRadishRadish_D

0

05

1

15

2

25

3

35

4

SDC

dete

ctio

n effi

cien

cy

Figure 6 The comparison of SDC detection efficiencies among full instruction duplication Radish and Radish D

We also conduct the experiment to exam the effect oftraining set size The result of qsort is shown in Figure 7 Thetraining set consists of 25 50 and 75 inputs and false positivesare computed across 100 inputs

The false positive rate decreases from 5 to 3 as thetraining set size is increased from 25 to 50 and to 2 for75 inputs The SDC coverage also decreases as the trainingset increases from 25 to 75 inputs The impact on both SDCcoverage and false positive rate from increasing the trainingset size is significant Hence we should choose the training setsize according to the user target If user specifies the boundof SDC coverage and overhead by turning false positive rateinto overhead we can choose a training set size to achieve thetarget

Besides reexecution can reduce the overhead incurredby fault positive When an assertion raises an alarm we candetermine if it is a false positive by reexecuting it If theassertion raises an alarm again it is a false positive In thiscase the alarm can be ignored and the program can continue

From the discussion above it can be concluded thatRadish D has higher SDC coverage than that of Radish or fullinstruction duplication But its overhead is also higher whichsuggests that Radish D should be used in the situation where

SDC coverage is considered to have more priority than over-head Further the SDC detection efficiency of Radish is farhigher than that of Radish D or full instruction duplicationwhich means it is more cost-effective But Radish may incuroverhead due to false positives Users can choose Radishor Radish D according to their consideration of tradeoffbetween the SDC coverage and performance overhead

6 Related Work

Prior research [8 22 23] applies invariants with a singlevariable and most of the invariants are based on boundedrange We apply invariants with more variables which canachieve better coverage in many occasions For example wecan always extract an invariant 119899 minus 119896 + 1 = 0 from a typicalloop structure shown as follows

for (119896 = 1 119896 lt= 119899 119896 + +)

sdot sdot sdot

larr 119864119909119905119903119886119888119905119894119899119892 119894119899V119886119903119894119886119899119905 119891119903119900119898 ℎ119890119903119890

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

International Journal of Aerospace Engineering 9

072

073

074

074075

075

076

077

078078

SDC

cove

rage

25 50 75

Training set size

SDC coverageFalse positive

006

005

005004

003

003

002

002001

0

False

pos

itive

rate

Figure 7 The SDC coverage and false positive rate for varied training set sizes

It is found that assert(119899 minus 119896 + 1 = 0) is often better than thebounded-range-based invariant assert(119896min le 119896 le 119896max) atdetecting errors since assert(119899 minus 119896 + 1 = 0) checks both 119899 and119896 while assert(119896min le 119896 le 119896max) only checks 119896

A typical criterion for selection of detectors defined in[22] the tightness is the probability that the detector detectsan error given that there is an error in the value of the variablethat it checks The notion of tightness is based on the valueof a single variable The invariant in this paper may include2 or 3 variables and the notion of tightness cannot be usedto describe an invariant with more than one variable Forexample if 119909 is flipped in the invariant 119909 lt 119910 since there aremultiple possible values of119910 it cannot be decidedwhether theinvariant is still satisfied and thus the tightness cannot be cal-culated Since the tightness cannot be used we apply certainheuristics to choose invariants and it is proved to be effective

7 Conclusion

To address the problem of detecting SDC we proposean approach which applies invariant-based assertions andimplement a system called Radish Radish neither requiresany hardware modifications to add error detection capabilityto the original system nor needs to acknowledge the seman-tics of the program and thus possesses a good scalabilityExperiments show that Radish achieves high SDC coveragewith very low overhead

Furthermore we propose Radish D by adding instruc-tion duplication to the unsafe code sections which arenot covered by assertions Radish D achieves higher SDCcoverage than that of Radish or full instruction duplicationmechanism Both Radish and Radish D offer feasible alter-natives for soft error mitigation

Competing Interests

The authors declare no conflict of interests regarding thepublication of this paper

Acknowledgments

This work was supported by the National Basic ResearchProgram of China (ldquo973rdquo Project)

References

[1] H Schirmeier C Borchert and O Spinczyk ldquoAvoiding pitfallsin fault-injection based comparison of program susceptibilityto soft errorsrdquo in Proceedings of the 45th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo15) pp 319ndash330 IEEE Rio de Janeiro Brazil June 2015

[2] A O Daniel L P Laercio S Thiago et al ldquoEvaluationand mitigation of radiation-induced soft errors in graphicsprocessing unitsrdquo IEEE Transactions on Computers vol 65 no3 pp 791ndash804 2016

[3] S S Mukherjee J Emer and S K Reinhardt ldquoThe soft errorproblem an architectural perspectiverdquo in Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA rsquo05) pp 243ndash247 San Francisco CalifUSA February 2005

[4] D Binder E C Smith and A B Holman ldquoSatellite anomaliesfrom galactic cosmic raysrdquo IEEE Transactions on NuclearScience vol 22 no 6 pp 2675ndash2680 1975

[5] J Olsen P E Becher P B Fynbo P Raaby and J SchultzldquoNeutron-induced single event upsets in static RAMS observedat 10 km flight altituderdquo IEEE Transactions on Nuclear Sciencevol 40 no 2 pp 74ndash77 1993

[6] J P Walters K M Zick and M French ldquoA practical char-acterization of a NASA SpaceCube application through faultemulation and laser testingrdquo in Proceedings of the 43rd AnnualIEEEIFIP International Conference on Dependable Systems andNetworks (DSN rsquo13) pp 1ndash8 June 2013

[7] S Mittal and J S Vetter ldquoA survey of techniques for modelingand improving reliability of computing systemsrdquo IEEE Trans-actions on Parallel and Distributed Systems vol 27 no 4 pp1226ndash1238 2016

[8] P Racunas K Constantinides S Manne and S S MukherjeeldquoPerturbation-based fault screeningrdquo in Proceedings of the IEEE13th International Symposium on High Performance ComputerArchitecture pp 169ndash180 Scottsdale Ariz USA February 2007

[9] Q Lu K Pattabiraman M S Gupta et al ldquoSDCTune amodel for predicting the SDC proneness of an applicationfor configurable protectionrdquo in Proceedings of the CompilersArchitecture and Synthesis for Embedded Systems pp 1ndash10 UttarPradesh India 2014

[10] N J Wang and S J Patel ldquoReStore symptom-based soft errordetection in microprocessorsrdquo IEEE Transactions on Depend-able and Secure Computing vol 3 no 3 pp 188ndash201 2006

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

10 International Journal of Aerospace Engineering

[11] M-L Li P Ramachandran S K Sahoo S V Adve V S Adveand Y Zhou ldquoUnderstanding the propagation of hard errorsto software and implications for resilient system designrdquo ACMSIGARCH Computer Architecture News vol 36 no 1 pp 265ndash276 2008

[12] N Oh P P Shirvani and E J McCluskey ldquoError detectionby duplicated instructions in super-scalar processorsrdquo IEEETransactions on Reliability vol 51 no 1 pp 63ndash75 2002

[13] M Shafique S Rehman P V Aceituno and J HenkelldquoExploiting program-level masking and error propagation forconstrained reliability optimizationrdquo in Proceedings of the 50thAnnual Design Automation Conference (DAC rsquo13) pp 1ndash17Austin Tex USA June 2013

[14] S Rehman M Shafique P V Aceituno F Kriebel J-J Chenand J Henkel ldquoLeveraging variable function resilience for selec-tive software reliability on unreliable hardwarerdquo in Proceedingsof the 16th Design Automation and Test in Europe Conferenceand Exhibition (DATE rsquo13) pp 1759ndash1764 Grenoble FranceMarch 2013

[15] G A Reis J Chang N Vachharajani R Rangan and DI August ldquoSWIFT software implemented fault tolerancerdquo inProceedings of the International Symposium on Code Generationand Optimization pp 243ndash254 IEEE Computer Society SanJose Calif USA 2005

[16] M D Ernst J H Perkins P J Guo et al ldquoThe Daikon systemfor dynamic detection of likely invariantsrdquo Science of ComputerProgramming vol 69 no 1ndash3 pp 35ndash45 2007

[17] A Thomas and K Pattabiraman ldquoError detector placement forsoft computationrdquo in Proceedings of the 43rd Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo13) pp 1ndash12 IEEE Computer Society June 2013

[18] F Shuguang G Shantanu A Amin et al ldquoShoestring proba-bilistic soft error reliability on the cheaprdquo in Proceedings of theASPLOS pp 385ndash396 Pittsburgh Pa USA 2010

[19] C Lattner and V Adve ldquoLLVM a compilation framework forlifelong program analysis amp transformationrdquo in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO rsquo04) pp 75ndash86 San Jose Calif USA March 2004

[20] A Thomas and K Pattabiraman ldquoLLFI an intermediate codelevel fault injector for soft computing applicationsrdquo in Proceed-ings of the Workshop on Silicon Errors in Logic System Effects(SELSE rsquo13) pp 1ndash8 Palo Alto Calif USA 2013

[21] M RGuthaus J S RingenbergD Ernst et al ldquoMiBench a freecommercially representative embedded benchmark suiterdquo inProceedings of the Workload Characterization pp 3ndash14 AustinTex USA 2001

[22] K Pattabiraman S Giacinto C Daniel et al ldquoDynamicderivation of application-specific error detectors and theirimplementation in hardwarerdquo inProceedings of the 6th EuropeanDependable Computing Conference (EDCC rsquo06) pp 97ndash108Coimbra Portugal October 2006

[23] S K Sahoo M-L Li P Ramachandran S V Adve V SAdve and Y Zhou ldquoUsing likely program invariants to detecthardware errorsrdquo in Proceedings of the 38th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks(DSN rsquo08) pp 70ndash79 IEEE Computer Society AnchorageAlaska USA June 2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Detecting Silent Data Corruptions in ...downloads.hindawi.com/journals/ijae/2016/8213638.pdf · Research Article Detecting Silent Data Corruptions in Aerospace-Based

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of


Recommended