Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos...

transcript

Hardware-Integrated Approaches to Failure Advanced Warning

Ralph H. Castain, Ph.D.Los Alamos National Laboratory

Outline

• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program

• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI

• Open Cluster Manager (OpenCM)

Motivation

• Head of new business unit for integrated diagnostics and control

• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k

Resiliency

• Fault Events that hinder the correct operation of a process.

• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level

Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.

• Fault prediction Estimate probability of incipient fault within some time period in the future

• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault

• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences

• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults

Traditional Approach to Faults:The “Bathtub”

InfantMortality

“Floor”Region

DefinedLifetime?

What’s Wrong With That?

• Infant mortality Resolved by extensive burn-in: costly

• Where to define “lifetime”? A: Units decommissioned with considerable unused life B: High probability of failures in advance MTBF: ~50% of units fail before

• Bathtub floor does not sit at “zero” Still significant probability of failure

• Can’t reliably estimate system lifetime due to multi-component degradation Component-component interactions not reflected in individual component

lifetime statistics

• Failures can be costly Operational impact Replacement costs B

DARPA BIT Program

• Multi-year program in 1990s Focus on electronic, mechanical failures Create a “resilient war fighting” capability Enable better maintenance support of increasingly

complex systems

• Objectives Push-button “good box/bad box” readout

• Eliminate diagnostic “carts”, “toolboxes”,…

Pre-emptive switch from failing systems “Okay for mission” test

• Reduce probability of failures during mission

Results Encouraging

• Vibration signatures Impending bearing failures

• Fans, axles, transmissions

• Thermal patterns Mechanical failures

• Existence of hot spots• Patterns revealed root causes, better prediction

Electronic failures• Patterns across boards, surface of chips

• Electrical frequency composition Breakdowns in power transistors, other devices IC internal wire connection degradation

General Conclusions

• Exploit access to internals Investigate optimal location, number of sensors Embed intelligence, communications capability

• Integrate data from all available sources Engineering design tests Reliability life tests Production qualification tests

• Utilize learning algorithms to improve performance Both embedded, post process Seed with expert knowledge

Objective

Prob of Failure

1 2 3 4 5 6 7 8 9 10

Time Interval

Motivation

• Head of new business unit for integrated diagnostics and control

• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k

Questions

• Can we develop technologies that would… Warn of impending failure

• Provide time to reconfigure, respond• Allow switch to backup systems for continuous

operation• Provide an opportunity to pace ourselves

“Stretch” life of system

With minimal overhead• Cannot significantly impact performance

• How would we use them?

Direct DetectionSpectralFilter

VoltageCurrent

FDDPAnalyzer

Good BoxBad Box

ProblemDiagnosis

FaultPrediction

Integrate All Factors

Results (generalized)

• Prediction Better than 97% faults predicted within

specified response time (hours) Less than 5% “bad” prediction rate

• Diagnosis Better than 80% correct localization

• Detection (good/bad box) Better than 99% correct identification Less than 5% false positive rate

Outline

1) Internet Traffic Growth and interconnect requirements are growing faster than Silicon and Software available power are.

2) One approach is to build a larger more Distributed System.

3) Result are increased requirements on System Software in terms of:

a) High Availability across a multi-component system

b) Coherent view of intra-component messaging

c) Fast Convergence amongst components during change

d) Distributed Failover and effective sharing of load.

e) SW/HW maintenance w/o service impact

Problem Statements

System BW

MHz-gate/mW

Mbps/W

System Power

Shortfall!

Shortfall is overcome by architectural innovation and trading off:Performance, functionality, programmability, physical size/density

Very hard to sustain long-term

Technology is falling behind Demand Curve

Problem Drivers

Product example

• Largest Routing System available today

Each Linecard Chassis: 1.28Tbps, 13.6kW

Switch Fabric Chassis: 8kW

Hardware Details

Product example

• Maximum HW configuration: 92Tbps Switching capacity across millions of interfaces.

48 x LC chassis + 8 x Fabric chassis

=> System Messaging Across all control CPUs to manage switch fabric

and interface control

Hardware Details

System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

Software Details

Our Approach: Use OpenRTE

• Setup for new frameworks Sensor - monitor hardware, software FDDP - use sensor inputs to compute sliding

window or probabilities

• Contribute back to OpenMPI Proprietary modules as binary plug-ins

• Write new cluster manager Exploit new capabilities Create as non-centralized application

ORTE Extensions

• Software sensors Memory footprint, cpu utilization (upper and lower),

output file size

• Hardware sensors Temperature, vibration

• FDDP B-spline trend fit

• Resilient mapper Fault groups

• Nodes with common failure mode• Node can belong to multiple fault groups

Map replicas across fault groups

Cluster Manager

• Orted auto-starts upon node power-up Auto-detect and connect to CM

• CM launches specified number of replicas of each application Resilient mapper => minimize single point

failures

• Applications auto-wireup Plug-and-play inspired approach Application decides which input to declare

“leader”

Application Failure

• Orted detects (or predicts) failure and notifies CM

• CM utilizes resilient mapper to determine location of replacement Future extension: probability of failure modes

to help drive fault group selection New replica is launched, does auto-wireup

• Connected applications Loss of communication from “leader” Independently select new “leader”

Outline

OpenCM

• Transition Cisco work to open source

• Broaden mission Extend to HPC, other embedded operations Manage any collection of nodes Resilient operation with hooks

• MPI• Other application layers

• Released under the OpenMPI license BSD-like, open use

http://www.open-mpi.org/

Concluding Remarks

Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos...

Documents