Recent Advances and New Avenues in Hardware-level Reliability Support

7/26/2019 Recent Advances and New Avenues in Hardware-level Reliability Support

http://slidepdf.com/reader/full/recent-advances-and-new-avenues-in-hardware-level-reliability-support 1/12

18

The deployment of computer sys-tems in complex mission- and life-criticalapplications has increased the significance andimpact of transient errors. Hardware design-ers have handled these errors at the device, cir-cuit, and architectural levels, employing

information redundancy, hardware redun-dancy, time redundancy, or a combination of these techniques. This article analyzes tech-niques developed at the circuit and architec-tural levels, both in experimental academicresearch and industry.

On the basis of past studies, we observe thatmost low-level errors don’t translate to errors inthe application’s outcome, which is the user’sprimary concern. Therefore, we propose analternative paradigm called application-awareruntime checking. In this approach, we ana-

lyze the application either statically or throughdynamic profiling to extract its reliability-sen-sitive characteristics. Using extracted applica-tion properties, we devise hardware checkermodules and embed them in a processor-levelframework to enable runtime error detection

and recovery. The architecture of the IllinoisReliability and Security Engine is a possibleimplementation of such a framework.

Background and motivationTransient faults have traditionally been asso-

ciated with the corruption of computer mem-ory content. This phenomenon was reportedas early as 1954 in applications operating insuch adverse conditions as proximity to nuclearbomb test sites and, later, space. Since 1978,dense memory circuits, both DRAM and

Ravishankar K. Iyer

Nithin M. Nakka

Zbigniew T.

Kalbarczyk

University of Illinois at

Urbana-Champaign

Subhasish Mitra

Intel

DISCUSSION ONSOMEOF THE RECENT ADVANCES IN LOGIC- AND

ARCHITECTURAL-LEVEL TECHNIQUES TO DEALWITH TRANSIENT ERRORS

SERVESAS A “SPRINGBOARD” TOMOTIVATE THE NEED FOR HARDWARE-LEVEL

APPLICATION-AWARERUNTIME CHECKS, WHICH THEAPPLICATION CAN

INVOKE FROMWITHIN ITS INSTRUCTION STREAM PER ITS DEPENDABILITY

REQUIREMENTS. IN CONTRASTWITH TRADITIONAL APPROACHES OF

SOFTWARE ANDHARDWARE DUPLICATION, ALTERNATIVE TECHNIQUESSUCH

AS FINE-GRAINED APPLICATION-AWARERUNTIME CHECKING OFFERMORE

EFFICIENT, LOW-OVERHEAD DETECTION, CORRECTION, ANDRECOVERY.

RECENT

ADVANCES AND

NEW

AVENUES IN HARDWARE-LEVEL

RELIABILITY SUPPORT

Published by the IEEE Computer Society 0272-1732/05/$20.00© 2005 IEEE



SRAM, have been known to be susceptible tosoft errors caused by alpha particles from ICpackaging and cosmic rays. While soft errorsare removed when the erroneous location isoverwritten, such errors can be catastrophic for

correct program execution, because a corrupt-ed intermediate value, if not handled, can cor-rupt subsequent computations.

Continuously decreasing device feature sizesand supply voltages reduce capacitive nodecharge and noise margin, making even flip-flop circuits susceptible to soft errors. At thecircuit level, available data on flip-flop errorrate trends is inconsistent. Researchers at TexasInstruments have observed an increase in nom-inal soft-error rates of flip-flops and latches,1

whereas experiments at Intel have shown fair-

ly constant or even slightly decreasing nominalsoft-error rates.2 Nevertheless, with the increas-ing integration of devices on a chip, the chip-level soft-error rate contribution fromsequential elements is clearly rising. The highclock rate of modern processors exacerbatesthe problem by increasing the probability thatan incorrect signal in a combinational circuitis latched by a flip-flop. This places us in anunfamiliar realm in which logically correctimplementations alone cannot ensure correctprogram execution. Consequently, soft errors

in flip-flops need immediate attention, and fortechnology readiness in future processor gen-erations, solutions to handle combinational-logic errors are essential.

Over the years, three main types of mea-sures to protect against soft errors at the hardware platform level have evolved: circuit-level,logic-level, and architectural techniques.

Circuit-level techniques include

• forward body bias—forward biasing theMOSFET body-source junction increas-

es the junction capacitance, decreases the junction depletion charge collection vol-ume, and creates a stronger feedback loop. All these factors decrease the prob-ability that a particle strike can flip theoutput of the device.

• transistor sizing—increasing the transis-tor aspect ratio increases output capaci-tance and hardens the transistor againstcharge injected by cosmic radiation;

• conservative design practices—using high-reliability components and exclud-

ing radiation-sensitive circuit styles suchas dynamic-logic styles, and

• incorporating a sufficient functional mar-gin in circuit designs to account for antic-ipated shifts in circuit characteristics.

Logic-level techniques detect and recoverfrom errors in combinational circuits by using a redundant or a self-checking circuit, such asoutput parity generation, to validate the out-put of the combinational circuit; and in flip-flop elements by providing redundant latchesor by reusing scan flip-flops to hold redun-dant copies of flip-flop data.

Architectural techniques include providing duplicate functional units or independenthardware to mimic and verify pipeline execu-

tion, or replicating application executionthrough multiple communicating threads.

Reliability support approaches exploit one(or some combination) of three forms of redundancy: information, hardware, or time.Use of information redundancy, such as par-ity or error correction code (ECC), allowsdetection and/or correction of certain classesof bit errors. Information redundancy requiresadditional storage for the encoded data, andit incurs hardware overhead in the form of encoding and checking logic. Typically,

designers reserve information redundancy forprotecting memory, caches, and perhaps reg-ister files and use hardware- and time-redun-dant techniques elsewhere in the processor.

Systems achieve hardware redundancy by carrying out the same computation on mul-tiple, independent hardware at the same timeand corroborating the redundant results toexpose errors. Systems with triple (or higher)redundancy can obtain a correct answerthrough a majority-voting scheme. For duplex systems, computation must restart to recover

from an error. Techniques that exploit hardware redundancy exhibit sensitivity to com-mon-mode or correlated failures, in which thesame fault affects multiple redundant hardware units in the same way, making the errorundetectable.

Time redundancy avoids the large hardwareoverhead of hardware redundancy. Thisapproach obtains redundant computation by repeating the same operation multiple timeson the same hardware. Time redundancy tech-niques incur high performance overhead and

19NOVEMBER–DECEMBER 2005



require additional hardware for collecting andcomparing the multiple-execution results.

Software techniques often suffer from rela-tively high performance overheads and higherror-detection latency. Hardware-imple-

mented techniques offer low-latency detec-tion. However, they usually target specifichardware structures and consequently areunaware of the application executing on thesystem and don’t give the application a choiceof techniques to use. Furthermore, severalindependent fault-injection-based studies of computing systems have shown that only a small percentage of hardware-level errorspropagate and are observable at the applica-tion level.3,4 We can attribute this to two fac-tors: natural masking in the system such as

electrical masking, logical masking, or cor-rupted data not used by the application; andlow-level detection and recovery such as ECCembedded in the system.

Errors that escape low-level detection anderrors caused by application design and imple-mentation flaws call for more efficient, low-overhead detection and recovery techniquesthat recognize application characteristics andthat can be automatically configured into thesystem to meet application needs. In address-ing this challenge, we present the Illinois Reli-

ability and Security Engine, an experimentalprocessor-level framework for embedding application-aware, hardware-based error detec-tion, security attack masking, and recovery.The framework can configure itself to embedhardware that provides application-aware run-time support. The hardware modules embed-ded in the framework can execute in parallel with instruction execution in the pipeline, thusminimizing overhead. Also, the applicationcan invoke these modules at specific points inits execution stream to derive a desired depend-

ability level.

Logic-level soft-error resilienceThe conventional notion is that flip-flops

are more vulnerable than combinational logicto soft errors.5,6 So we first discuss techniquesfor protection of flip-flops and latches andthen those for combinational circuits.

Flip-flop and latch protection techniquesThe two main flip-flop protection tech-

niques are the following:

Circuit or flip-flop hardening.7 Two funda-mental concepts underlie the design of single-event-upset (SEU)-immune storage cells using conventional CMOS processes. First, redun-dancy in the memory circuit maintains a

source of uncorrupted data after an SEU. Thecircuit achieves this redundancy by using twospecifically designed latch sections that storethe same data. Second, data in the uncor-rupted section provides specific state-restor-ing feedback to recover the corrupted data.

Built-in soft-error resilience.5 Scan-based designsuse the inherent redundancy of latches, along with the Muller C-element, to detect and tol-erate soft errors in latches and flip-flops. A C-element has two inputs and one output. If the

two inputs match, the C-element acts as aninverter. If the inputs don’t match, the previ-ous value is retained. The C-element toleratesany errors in the latches and flip-flops when theclock is 0 (when the latch is vulnerable to error).

Combinational-logic techniquesNicolaidis proposes a combination of hard-

ware redundancy and time redundancy tech-niques to provide soft-error detection andcorrection in combinational logic.7 Thisapproach largely exploits the fact that soft

errors affect circuit outputs only for a shorttime, resulting in a temporal occurrence of anincorrect value, whereas for the rest of thetime, the outputs are correct.

Nicolaidis presents two ways to deal with softerrors in combinational logic. One way is touse a shadow latch. A shadow latch is a redun-dant latch that is clocked out of phase or whoseinputs arrive after a delay from the main out-put latch to provide redundancy. The phase dif-ference between the clocks or the delay betweenthe latches’ inputs is determined as a function

of the duration of transient pulses that must betolerated, such that not more than one of thelatches can be affected by the transient pulse.The outputs of the main latch and the redun-dant latch(es) are either compared (in the caseof one redundant latch) or voted upon by a simple majority voter (in the case of two redun-dant latches), depending on whether the goalis only to detect an error or to tolerate it. Theother way to handle soft-errors is to use a C-element. The combinational logic outputs anda delayed version of the same combinational

20

HARDWARE-LEVEL RELIABILITY SUPPORT

IEEE MICRO



logic are fed into a C-element. Unlike themethod described by Mitra et al.,5 the C-ele-ment method tolerates any transient errors inthe combinational logic.

Razor8 is another method for detecting softerrors in combinational logic. In addition toerror detection, soft-error resilience also

requires correction. The techniques presentedby Ernst et al. for recovering from timing errorsin microprocessor pipelines can be adapted forrecovering from soft errors in the pipeline.8

Table 1 summarizes logic-level techniques, dis-tinguishing those targeted at sequential ele-ments and combinational circuits.


Table 1. Logic-level techniques for soft-error resilience (SER).

Detection and tolerance

Technique Architecture mechanisms Merits/Disadvantages

Protection for sequential elements

Circuit hardening Special circuit design Redundant nodes provided in Protects flip-flops via selectivetechniques reduce flip-flop a latch. Gates of pull-up insertion of redundantsensitivity to soft errors. PMOS and pull-down components.

NMOS transistors High area and powerconnected to different overheads.nodes prevent particle Difficult to reuse design forstrike on node affecting low-power applicationscomplementary node. not needing protection.

Built-in SER exploiting Scan flip-flop for testing circuit Scan flip-flop and system Exploits redundancy alreadyinherent redundancy in is used during normal latch outputs corrected by present in scan-basedscan-based designs5 operation to hold copy of C-element. designs.

system latch state. Selectively applicable to flip-flopwith scan support.

Error-trapping scan celldecreases routing andhardware overhead.

Doesn’t protect combinationallogic.

Time redundancy using Single latch replaced by two or Redundant latches’ outputs Small space overhead.shadow latches7 three latches clocked with a are compared or voted upon Time overhead to generatefixed phase delay or whose by a simple majority voter. delayed clock signals orinputs arrive after fixed delay. delayed input signals and to

wait for all latches to beclocked before anyinputs can change.

Phase delay between clocksdetermines width of transientpulse tolerated.

Protection for combinational circuitsDuplicate or self-checking Combinational circuit replaced by C-element detects error when Tolerating errors without

combinational circuits duplicate copies or self-checking outputs of two copies don’t requiring three copies of thecircuit (such as circuit generating match or when output of circuit using C element canoutput along with output parity). self-checking circuit be useful.

corresponds to a noncode Space and time overhead toword. On detecting an error, generate self-checking code.maintains previous state. Single error in circuit with fan-

out can cause multiple errors,foiling checking code.

Assumes that transient errordisappears before circuitinputs change, eventuallyproducing correct output.

Needs precise timing-hazardanalysis.

Razor: tolerating timing Pipeline latches augmented with On a mismatch between main Allows dynamic voltage scalingerrors due to dynamic shadow latches hold signals that latch and shadow latch, to reduce powervoltage scaling using arrive after a delay. clock gating or counterflow consumption.time-redundant shadow pipelining performs recovery. Low performance impact fromlatch8 dynamic voltage scaling.

Uses two recoverymechanisms to recover frompipeline timing errors.

Same recovery mechanisms canalso correct soft errors inpipeline.

Short and long paths feedingpipeline latch must bebalanced.



Error protection in current-generationprocessors

The most commonly used soft-error protec-

tion mechanisms in modern processors useinformation redundancy techniques, including parity, ECC, and Chipkill memory protectionmechanisms. Here, we present basic informa-tion redundancy techniques and examine themin recent commercial microprocessors.

Information redundancyCoding techniques using parity, Hamming,

single-error-correction/double-error-detec-tion (SEC/DED), or other complex codesprotect data words by providing information

redundancy. These codes, generically termedEDC (error detection code) or ECC (errorcorrection code), allow certain classes of biterrors to be detected and corrected. ECC pro-tection of a memory array is relatively efficientbecause the coding logic’s cost can be amor-tized over the array. However, applying ECCto individual processor registers might requirea significant amount of overhead and increasecritical-path delay.

IBM Chipkill technology

IBM Chipkill is an advanced form of ECCthat can withstand multibit errors in a DRAMchip, including errors that would affect all thedata bits in the chip, whether that is 4, 8, or16 bits. Standard ECC memory detects andcorrects single-bit errors. The risk of multib-it memory errors grows higher with increasedmemory capacity (for example, 32 Gbytes),memory density on a single dual in-line mem-ory module (for example, 1 Gbyte), and mem-ory subsystem speed.

Figure 1 illustrates the results of an IBM

analysis comparing server outages due to mem-ory failures in systems equipped with parity,ECC, and Chipkill memory protection mech-

anisms. In a three-year period, the 32-Mbyteparity-memory-equipped server had seven out-ages per 100 servers; the 1-Gbyte ECC-memory-equipped server, nine outages per 100servers; and the 4-Gbyte Chipkill-equippedserver, six outages per 10,000 servers. StandardECC DRAM corrects all single bit errors, butChipkill provides protection for multibiterrors, over and above the protection provid-ed by standard ECC. The fact that the Chip-kill-equipped server had a failure rate twoorders of magnitude lower than systems with

regular ECC DRAM, as shown in Figure 1,implies that multibit errors constitute a sig-nificant percentage of memory errors.

Recent commercial microprocessorsThe study of several commercial micro-

processors provides examples of the detectionand recovery techniques adapted in realimplementations. Designers target processorsat either the home PC market or the high-end server market. The Intel P6 processorfamily represents the class of processors that

target a home desktop environment. They operate as single-core processors. Processorsbased on the AMD Hammer architecture alsotarget the high-performance PC market butcan operate in multicore architectures.

The Intel Itanium processor is designed forhigh-end servers, but its primary design dri-ver is high performance. Consequently, ittrades reliability for performance as much aspossible. On the other hand, the main goalof the IBM G5, with dual instruction andexecution units and a hierarchy of error-han-

22


IEEE MICRO

1,000900800700600500

400300200100

07 failures per 100 9 failures per 100 6 failures per 10,000

C u m u l a t i v e

f a i l u r e s

p e r 1 0 , 0

0 0

s e r v e r s o v

e r t h r e e y e a r s

32-Mbyte parity1-Gbyte ECC1-Gbyte IBM Chipkill

Figure 1. Comparison of server outages due to memory failures in systems with parity, ECC,

and Chipkill protection (source: IBM9).



dling and recovery mechanisms, is to providehigh availability. The cost is lower perfor-mance caused by restricting the processorpipeline to issue at most one instruction percycle. The IBM Power4, a high-end serverprocessor that includes many reliability fea-tures, primarily supports multicore architec-tures. Table 2 summarizes the salientreliability features of these processors.

Architectural error protection techniques

Several academic researchers have proposedtechniques for handling transient faultsthrough hardware, time redundancy, or both.One technique, simultaneously and redun-dantly threaded processors with recovery (SRTR), achieves fault tolerance and recovery by executing two copies of the same programon a simultaneous multithreading (SMT)processor.10 The Dynamic Instruction Verifi-cation Architecture (DIVA) comprises anaggressive, out-of-order superscalar processorand, on the same die, a simple, in-order check-

er processor.11 The checker processor verifiesthe complex out-of-order processor’s outputand triggers a recovery action when it finds aninconsistency. Table 3 summarizes the key concepts of these and other techniques, theirmerits, and their disadvantages.

Application-aware runtime checkingCurrent hardware-based protection tech-

niques based on coding, duplication, and cus-tom solutions such as shadow latches often

suffer from high area, time, and power over-heads. For instance, simulation-based exper-iments have shown up to 27 percent overheadcaused by synchronization of duplicateprocessor threads.10 An independent experi-ment triplicating only the ALU of a super-scalar processor showed a 9 percent overheadin hardware and a 20 percent increase in clock cycle time. Thus, duplication suffers not only from overhead due to synchronizing dupli-cate threads, but also from inherent perfor-mance overhead due to additional hardware.


Table 2. Reliability features in modern microprocessors.

Feature Intel P6 family AMD Hammer Intel Itanium IBM S/390 G5 IBM Power4

Internal registers Parity No protection No protection ECC ParityL1 Data Parity I cache: parity; Parity Parity; Parity

D cache: ECC Store bufferprotected by ECCTag Parity Parity Parity Parity Parity

L2 Data ECC ECC 8-bit ECC/ ECC ECC64 data bits

Tag Parity ECC Parity ECC ECCL3 Data N/A N/A 8-bit ECC/ N/A ECC

64 data bitsTag N/A N/A 3 parity bits N/A ECC

TLBs Parity Parity Parity Parity ParityBuses ECC on No protection No protection No protection Data bus: ECC; address

CPU-L2 bus and control bus: parityOther features Machine check MCA Multilevel MCA: Duplicate instruction Logic checkers for fault

architecture (MCA) local and global and execution units’ isolation and recoveryto detect and MCA, hardware outputs comparedcorrect errors in bus reset every clock cycleprocessor logic

Unique features Functional Chipkill memory Multilevel error Single-issue processor Careful placement andredundancy controller to containment; for robust recovery identification of logicchecking using support memory watchdog timer; checkers aids automatedmaster/slave scrubbing; NX error logging and recovery or isolation ofprocessors virus protection corrected error faulty unit for deferred

for Windows XP notification; NX repairSP2 virus protection

Recovery features Error correction Chipkill memory Processor Hierarchical recovery Invalidate-and-retryin L2 cache and controller performs abstraction layer levels: processor protocol, invalidatesCPU-L2 bus memory scrubbing handles errors at hardware error erroneous array location

on ECC-protected various layers; recovery, system and fetches from next-arrays ECC errors recovery, level storage

corrected on the fly transparent processorsparing



Whereas hardware mechanisms provide low-latency error detection, software-based relia-bility techniques suffer from high detectionlatency in addition to the usual high perfor-mance overhead and complexity.

Furthermore, architectural techniques areunaware of the application executing in thesystem. Hence, the application cannot selec-tively choose among the system’s available

fault tolerance features. Current processorsdon’t facilitate application-aware runtimechecks that the application can invoke from within its instruction stream to meet itsdependability requirements. In a desirableprotection scheme, the application would dic-tate what it wants, and the processor wouldadapt to meet those requirements.

Detailed latch-level fault injection to observeerror effects at higher layers (processor pin leveland software level) demonstrate the need forapplication-aware checking. Choi, Iyer, and

Carreno performed fault injections at thedevice level in a 16-bit microcontroller for jetengines to study fault propagation at the gateand higher levels. The study showed that about80 percent of errors had no impact on the chip. Wang et al. studied a model of the Alpha 21264 processor through fault injection andfound that less than 15 percent of errors result-ed in software-visible failures. Recently,

Saggese et al. conducted extensive gate-levelinjection in a SuperScalar DLX-like processorand found that only 4 percent of errors man-ifested as pin-level mismatches.

The fact that most processor-level errors donot lead to errors in application outcome may be one of the reasons why chip-level duplica-tion, as in the HP NonStop Himalaya proces-sor, has shown higher levels of dependability than on-chip replication. However, chip-levelduplication has a high overhead in hardware.The synchronization (including voting) of the

24


IEEE MICRO

Table 3. Architectural reliability techniques.

Category Technique Description Merits Disadvantages

Hardware redundancy DIVA11 Simple in-order checker Checker uses Design of checker forprocessor checks intermediate results superscalar core is

complex out-of-order from core processor. nontrivial.core processor. Simple checker can be Errors causing omission offormally verified. prediction stream entries

not covered.Time redundancy Multithreaded Application process Utilizes spare hardware Accessing new datain hardware processors: SRTR 10 duplicated into two in multithreaded structures increases cycle

communicating threads processors. time.executing at a phase lag. Leading thread provides Error detection limited to

intermediate results to replication sphere—trailing thread. includes only execute

stage.Application cannot choosereliability throughduplication.

Time redundancy in Duplicated instructions Instructions duplicated No additional hardware High performance penalty.software in superscalar in software at compile required. High power penalty.

processors12 time. Easily used with High error detection latency.Hardware executes two commercial off-the-shelf Selective insertion ofcopies of instruction and components. duplicated instructions iscompares results. Idle superscalar difficult.Data diversified to processor hardwaremaintain mapping can be used.between original and Targets energyduplicated data and efficiency.results.

Hardware and time Dual use of superscalar Instructions replicated Dispatch mechanism Introduced redundancyredundancy datapath13 when dispatched to modified to support reduces effective dispatch

reservation station, and replication. and commit bandwidths.results in reorder buffer Recovery through same Decreases effectivecompared before mechanism that utilization of reorder buffercommit. recovers from branch and register rename table.

misprediction. Doesn’t provide selectivereplication.



replicated processors incurs a performanceoverhead. Moreover, an apparent lack of experimental studies on quantitative charac-terization of on-chip replication makes it dif-ficult to judge what level of reliability suchprocessors can achieve. It is therefore not sur-

prising that IBM systems such as the G5 usea backup processor (for chip-level replication)in addition to the on-chip replication of theinstruction and execution units and the high-availability features of the Multiple VirtualStorage (MVS) operating system running ontop of the G5-based platform.

The discussion above indicates that tech-niques such as duplication, which are unawareof application characteristics, are inefficientin protecting application outcome or can doso only at a high hardware and performance

overhead. This motivates the need for tech-niques that are aware of the application char-acteristics, and that can be implemented inhardware to minimize performance overheadand provide low latency of detection.

RSE frameworkThe Illinois Reliability and Security Engine

(RSE) is a processor-level hardware framework that addresses the need for application-awareruntime checking and provides more reliabil-ity than on-chip replication can offer.14 The

RSE provides a variety of techniques for errordetection, security vulnerability masking, andrecovery under one umbrella in a uniform,low-overhead way and makes adapting orreconfiguring error checking to the applica-tion’s needs possible. Implementing applica-

tion-aware detection techniques at theprocessor level has several advantages:

• Additional hardware enhancements canbe simple and application neutral.

• Minimal instrumentation of the appli-cation is needed.

• Low detection latency, to the level of a few instructions, limits error propagation.

• The hardware monitoring mechanism isautomatically available to the operating system.

We implemented the RSE as an integralpart of the processor, residing on the same die.Hardware modules providing error detectionand security services are embedded in the RSEand execute in parallel with the core pipeline.Figure 2 depicts the RSE framework in thecontext of a superscalar processor. The top of the figure shows the five-stage main processorpipeline; the RSE framework is at the bottomof the figure. The RSE consists of two mainparts: the framework interface fabric (input


RSE framework

Fetch_Out

RegFile_Data

Execute_Out

Memory_Out

Commit_Out

Manager F r a m e w o r k

i n t e r f a c e f a b r i c

Hardwaremodules

Instructionfetch

Decode Execute Memory Commit

InstructionRegister no.

Register valuesALU result

Address/next PCData loaded

from memoryCommit/ squash

Preemptivecontrol-flow

checking

Application/ OS health

monitor

Selectivereplication

Pointertaintedness

trackingInstruction

queue

Memory-ready

Memory

Figure 2. RSE framework organization.14



interface) and the internal hardware modulesthat perform reliability and security checking.

The framework interface consists of a set of input queues, each containing the outputs of a particular pipeline stage. Additional fan-outs

from the core processor’s pipeline stage out-puts provide input to the framework. Theapplication interfaces with the engine via spe-cial CHECK instructions. We instrument theapplication at compile time to include thesespecial instructions. This requires an exten-sion of the processor’s instruction set archi-tecture. As the processor fetches instructionsfrom the pipeline, it forwards the check instructions to the RSE to invoke the securi-ty and reliability hardware checker modules.

We implemented the framework and its

interface to the pipeline, along with theprocessor, on a reconfigurable portion of thedie. Depending on the modules required by the set of applications currently executing,only part of the framework interface fabricmust be instantiated. In Figure 2, for exam-ple, the complete framework interface fabricis enabled to provide inputs for four embed-ded hardware modules: preemptive controlflow checking, which verifies whether anapplication follows a correct execution flow;an application/OS health monitor, which

detects application and/or operating systemshangs and crashes; selective replication, whichenables selective duplication of parts of theapplication code; and pointer taintednesstracking, which detects the deference of a pointer value derived from user input, possi-bly an indication of malicious tampering withthe system (a security attack). These modulesillustrate the framework’s versatility in pro-viding runtime application support throughvarious embedded mechanisms.

To estimate the framework’s area and per-

formance overhead, we simulated the proposedarchitecture and synthesized an initial imple-mentation in the context of a DLX-like super-scalar processor. Compared with the processor,the framework incurred area overhead of 9.4percent and increased clock cycle time by 5.9percent. The increased clock cycle time wasmainly due to tapping the pipeline stages’ out-puts for inputs to the framework, thus increas-ing the pipeline outputs’ capacitive load.

Using an application-centered error detec-tion approach and hardware-implemented

recovery means that we need not include thedetection techniques at the time of hardwaredesign. Rather, we transfer this effort to appli-cation design time, when we analyze the appli-cation to understand its reliability characteristics

and implement suitable hardware modules. Thehardware framework, as mentioned earlier,adapts by embedding the checkers and moni-toring the application at runtime.

Example modulesSeveral hardware modules explored in the

current implementation of the framework illustrate the RSE’s versatility.

Application/OS health monitor. A heartbeatmechanism can be fairly efficient in detecting

many types of errors. But software imple-mentations of heartbeat mechanisms requirecareful application instrumentation andexhibit high crash and hang detection laten-cy. To address this challenge, we have devisedthe in-processor application/OS health mon-itor to reduce error detection latency andinstrumentation overhead in detecting processand operating system crash and hang. Themonitoring mechanism is OS neutral and canbe applied to both application processes andthe OS. This unique design requires no appli-

cation/OS-level instrumentation for detect-ing application crashes or OS hangs andallows an automated, lightweight, compiletime instrumentation methodology to enableruntime detection of application hangs,including infinite loops. To monitor the exe-cution of different application code sectionsfor detecting crashes and hangs, the healthmonitor implements three separate tech-niques: instruction count heartbeat (ICH),infinite-loop hang detection (ILHD), andsequential-code hang detection (SCHD).

ICH supports the detection of all processcrashes and the class of hangs in which theprocess exists but is not executing instructions.This approach exploits existing performancecounters in modern processors and eliminatesOS and application-level instrumentation.ICH does this by monitoring whether theprocessor executes a fixed number of instruc-tions of the process within a constant time. Whereas information regarding an abnormalprocess termination is available at the OS level,the per-process data collected in the ICH mod-

26


IEEE MICRO



ule has value in the broader context of locally detecting OS hangs. In contrast, current appli-cation/OS hang detection techniques requirecustom instrumentation and, possibly, an

external monitoring entity on a remote node.Providing process and kernel infinite-loopdetection, ILHD captures infinite executionof legitimate loops, and SCHD captures ille-gal loops due to errors that subvert a program’scontrol flow. To detect infinite execution of legitimate loops, we instrument the entry andexit points of loops in the application via a compiler. (Most compilers can detect loopsand can be modified to instrument the appli-cation at deterministic points corresponding to the loops’ entry and exit points.) In addi-

tion, we use an application-profiling method-ology to determine the heartbeats’ timeoutvalues on a per-loop basis rather than the usualapproach of using a fixed timeout for the

entire application. An illegal loop can occur when an error cor-rupts the target address of the terminating branch in the basic block, and, as a result, thecontrol flow subverts to an instruction in theblock, creating an illegal loop. While theapplication is executing sequential code, theSCHD module detects such a loop by main-taining a log of recently committed instruc-tions and searching for a repetition of aninstruction sequence.

Figure 3 shows how the modules imple-


ILHD

SCHD

ICH

Log address

Check

sequence repetition

Set timeout; Start timer

Disable timer

ICH

Fetch

Decode

Execute

Memory

Commit

Address

Update

Instructioncounter

Check for

updates

Loop entry

Loop

exit

Previous processcounter n

Previous processcounter 3



Counterarrayscan

Previous processcounter n

Previous

OS counter

Current

OS counter




OS hang? Yes/no

Threshold

Shifter

τi

Timeri

Timeri

Starttimer

i

Loadtimer

i

Timeri

expired

Timeri

Initial timervalue from

checkinstruction

Load threshold fromcheck instruction

Estimated

loopexecutiontime from

checkinstruction

Timer1

Timer1

Timer1

expired

CheckLoop start

CheckLoop end

CheckLoop end

CheckLoop end

LoadStart

0 1

Timer2

Timer2

Timer2

expired

CheckLoop start

LoadStart

Timerd

Timerd

Timerd

expired

CheckLoop start

LoadStart

2

Error

d

FSM

Figure 3. Crash and hang detection hardware (ILHD: infinite-loop hang detector; SCHD; sequential-code hang detector; ICH:

instruction count heartbeat module).



menting the ILHD, SCHD, and ICH tech-niques interface with the main pipeline, as well as the internal hardware implementationdetails of the ICH and ILHD modules. TheICH module requires only a set of timers, one

for each process (including the OS) it is mon-itoring. Similarly, the ILHD module requiresa timer to hold the estimated execution timeof the loop it is monitoring. Monitoring mul-tiple (nested) loops simultaneously requiresa set of timers controlled by a finite-statemachine. The FSM’s state denotes the nest-ing depth of the current loop being checkedfor hangs.

Because the three techniques either auto-matically instrument the application at loopentry and exit points or monitor the number

of instructions committed and their order,they don’t require knowledge of the applica-tion source code. This contrasts with presenttechniques for infinite-loop hang detection, which require a detailed knowledge of thesource code to instrument the application.

We implemented the most complex of thethree modules, the SCHD module, in hardware on an FPGA device. The module incurredarea overhead of about 5 percent with respectto a DLX double-issue superscalar processor.Our evaluation of the hang detection tech-

niques found a 1.6 percent performance over-head and a 6 percent memory overhead forinstrumentation. The crash detection tech-nique incurred no performance overhead andhad a latency of a few instructions.

Security extensions. We can easily extend theRSE framework’s open (or configurable)architecture to embed hardware modules forproviding security. Two examples are thepointer taintedness tracking and detection(PTTD) module and the memory layout ran-

domization (MLR) module. As a basis fordetecting memory corruption attacks, weintroduce the notion of pointer taintedness. We say a pointer is tainted if the pointer valuecomes directly or indirectly from user input. A tainted pointer allows the user to specify thetarget memory address to read, write, or trans-fer control to, and this can lead to system secu-rity compromise. The PTTD module tracksdata taintedness and detects an attack if taint-ed data is dereferenced.15

The MLR module transparently random-

izes the memory layout of an applicationprocess to defeat a large class of security attacks, including buffer overflow and formatstring. These attacks, which amount to over60 percent of all attacks reported by the

CERT Internet security center, are based onan attacker’s knowledge of a target applica-tion’s memory layout. Randomizing theprocess’s memory layout foils the attacker’sassumptions in engineering an attack.14

With hardware becoming ever-increasingly cheaper, there is a trend toward using comput-ing resources as a utility to run long-running,critical applications. The reliability and secu-rity demands would differ from application toapplication. In order for such utility comput-ing to be successful, the hardware must adapt

to the needs of the application. The concept of application-aware checking in hardware is bestsuited for this model of computing. There isongoing work on compiler-based static anddynamic analysis of applications to automati-cally extract the reliability and security charac-teristics of the application that can be translatedinto hardware modules to be embedded in theRSE framework. MICRO

Acknowledgments

This work was supported in part by the

Gigascale Systems Research Center, NSFgrants ACI-0121658 ITR/AP and CNS-0406351 ACI Next-Generation Software Pro-gram, and MURI grant N00014-01-1-0576. We thank Fran Baker for her careful reading of our manuscript.

References1. R. Baumann, “The Impact of Technology

Scaling on Soft Error Rate Performance and

Limits to the Efficacy of Error Correction,”

Proc. Int’l Electron Devices Meeting (IEDM

02), IEEE Press, 2002, pp. 329-332.2. P. Hazucha et al., “Neutron Soft Error Rate

Measurements in a 90-nm CMOS Process

and Scaling Trends in SRAM from 0.25-µm

to 90-nm Generation,” Proc. Int’l Electron

Devices Meeting (IEDM 03), IEEE Press,

2003, pp. 21.5.1-21.5.4.

3. G. Choi, R. Iyer, and V. Carreno, “FOCUS:

An Experimental Environment for Validation

of Fault Sensitivity Analysis,” IEEE Trans.

Computers , vol. 41, no. 12, Dec. 1992, pp.

1,515-1,526.

28


IEEE MICRO



4. G. Saggese et al., “Microprocessor Sensitiv-

ity to Failures: Control vs Execution and Com-

binational vs Sequential Logic,” Proc. Int’l

Conf. Dependable Systems and Networks

(DSN 05), IEEE Press, 2005, pp. 760-769.

5. S. Mitra et al., “Robust System Design with

Built-In Soft-Error Resilience,” Computer ,

vol. 38, no. 2, Feb. 2005, pp. 43-52.

6. S. Mitra et al., “Logic Soft Errors in Sub-65 nm

Technologies: Design and CAD Challenges,”

Proc. 42nd Ann. Conf. Design Automation

(DAC 05), ACM Press, 2005, pp. 2-4.

7. M. Nicolaidis, “Time Redundancy Based

Soft-Error Tolerance to Rescue Nanometer

Technologies,” Proc. 17th VLSI Test. Symp.

(VTS 99), IEEE Press, 1999, pp. 86-94.

8. D. Ernst et al., “Razor: A Low-Power Pipeline

Based on Circuit-Level Timing Speculation,”Proc. 36th Ann. Int’l Symp. Microarchitec-

ture (Micro-36), IEEE Press, 2003, pp. 7-18.

9. “IBM Chipkill Memory,” http://www-4.ibm.

com/software/is/mp/windows2000/insider/

library/pdf/chipkill_white_paper.pdf.

10. T.N. Vijaykumar, I. Pomeranz, and K. Cheng,

“Transient-Fault Recovery Using Simultane-

ous Multithreading,” Proc. 29th Ann. Int’l

Symp. Computer Architecture (ISCA 02),

IEEE Press, 2002, pp. 87-98.

11. C. Weaver and T. Austin, “A Fault Tolerant

Approach to Microprocessor Design,” Proc.Int’l Conf. Dependable Systems and Net-

works (DSN 01), IEEE Press, 2001, pp. 411-

420.

12. N. Oh, P.P. Shirvani, and E.J. McCluskey,

“Error Detection by Duplicated Instructions in

Super-Scalar Processors,” IEEE Trans. Relia-

bility , vol. 51, no. 1, Mar. 2002, pp. 63-75.

13. J. Ray, J.C. Hoe, and B. Falsafi, “Dual Use

of Superscalar Datapath for Transient-Fault

Detection and Recovery,” Proc. 34th Ann.

Int’l Symp. Microarchitecture (Micro-34),

IEEE Press, 2001, pp. 214-224.14. N. Nakka et al., “An Architectural Framework

for Providing Reliability and Security Sup-

port,” Proc. Int’l Conf. Dependable Systems

and Networks (DSN 04), IEEE Press, 2004,

pp. 544-553.

15. S. Chen et al., “Defeating Memory Corrup-

tion Attacks via Pointer Taintedness Detec-

tion,” Proc. Int’l Conf. Dependable Systems

and Networks (DSN 05), IEEE Press, 2005,

pp. 378-387.

Ravishankar K. Iyer is the George and AnnFisher Distinguished Professor of Engineer-ing and Director of the Coordinated ScienceLaboratory at the University of Illinois atUrbana-Champaign. His research interests

include reliable and secure computing. Iyerhas a PhD in electrical engineering from theUniversity of Queensland, Australia. He is a Fellow of the IEEE, the ACM, and an Asso-ciate Fellow of the American Institute for Aeronautics and Astronautics.

Nithin M. Nakka is a doctoral student in theCoordinated Science Laboratory at the Uni-versity of Illinois at Urbana-Champaign. Hisresearch interests include hardware-imple-mented techniques for reliability, security, and

recovery. Nakka has an MS in electrical andcomputer engineering from the University of Illinois at Urbana-Champaign.

Zbigniew T. Kalbarczyk is a research profes-sor in the Coordinated Science Laboratory atthe University of Illinois at Urbana-Cham-paign. His research interests include auto-mated design, implementation, andevaluation of dependable and secure com-puting systems. Kalbarczyk has a PhD in com-puter science from the Bulgarian Academy of

Sciences. He is a member of the IEEE and theIEEE Computer Society.

Subhasish Mitra is a principal engineer atIntel and a consulting assistant professor inthe electrical engineering department of Stan-ford University. His research interests includerobust system design, VLSI design and test,VLSI CAD, and computer architecture. Mitra has a PhD in electrical engineering from Stan-ford University.

Direct questions and comments about thisarticle to Nithin M. Nakka, University of Illi-nois at Urbana Champaign, Coordinated Sci-ence Laboratory, 1308 West Main St.,Urbana, IL 61801; [email protected].

For further information on this or any other

computing topic, visit our Digital Library at

http://www.computer.org/publications/dlib.


Date post:	02-Mar-2018
Category:	Documents
Upload:	cristiano-ruschel
View:	215 times
Download:	0 times

Recent Advances and New Avenues in Hardware-level Reliability Support

Documents