+ All Categories
Home > Documents > 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve...

1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve...

Date post: 22-Dec-2015
Category:
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
6
1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan
Transcript
Page 1: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

1

ExtraVirt: Detecting and recovering from transient processor faults

Dominic Lucchetti, Steve Reinhardt, Peter Chen

University of Michigan

Page 2: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

2

Flips Happen

Similar die area+

Decreasing transition energy=

Increasing risk of transient failure

Page 3: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

3

Multi-Processors &Virtual Machine

Multi-Processor Ensure error

independence Enable fault detection Efficient resource sharing

Virtual Machine No changes to OS or

applications VM replay

Synchronize replicas Recover correct state

Replica 1 Replica 2

Hypervisor

DeviceDrivers

Replication Management Layer (RML)

Page 4: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

4

Example: Memory

Copy on write Reduces overhead Protects checkpoints

Merge on checkpoint Verify correctness Re-execute on

deviation Memory Fault

Protection ECC against RAM

faults MMU against CPU

faults

Memory CheckpointReplica 1Checkpoint Replica 2

A

B

CD

E

A

B

CX

E

A

B

C

E

Verify

Replica 3

A

B

CD

E

Page 5: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

5

Status

Present VM Replay Beginnings of Replication

Management Layer (RML) Still much to do…

Future Replicate the un-replicated Handle faults in device

drivers Expanded fault model

Replica 1 Replica 2

Hypervisor/RML

DeviceDrivers

Page 6: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

6

Questions?


Recommended