Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando...

Microreboot

References

1. George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”, Proceedings 6th Symposium on Operating Systems Design and Implementation (OSDI’04), pp 31 – 44.

The Problem

• Software has bugs– Memory leaks, race conditions, environment

dependent – Many bugs that appear in production have no fix at

the time of failure– Enterprise scale systems application level failures are

more frequent• Modern Operating Systems are comparatively more reliable

– Desktop system OS continue to have problems

• Testing eliminates some of the bugs, but not all the bugs

[1]

Recovery - Urgency

• Enterprise failures focuses the operators on the need to recover from failure and restoration of operations– Diagnosis is for later – no time for real-time diagnosis– Studies show that rebooting often is adequate even if

the cause of the failure is unknown

• Server clusters increase reliability – redundancy to withstand failure– Isolate failed node– Reboot the failed node– Reintegrate the recovered node into the cluster

[1]

Recovery using reboots

• There is high-confidence that reboot will reclaim stale or leaked resources

• Does not require correct functioning of the rebooted system

• Easy to implement and automate

• Returns the system to a “best” (most) tested state

[1]

Recovery using reboots - 2

• Unexpected reboots can result in data loss and unpredictable recovery times– Data recovery and process recovery are not

isolated– Example: write back buffer caches

• Data for persistent storage is kept in volatile memory

• Unexpected reboot will loose the buffer contents, i.e. the files would not be updated

[1]

Microreboot

• Individual rebooting of fine-grain application components

• Potentially the same results as a system reboot, but much faster and less loss of work

• Safe microreboots require– Well isolated – Stateless components– Application state is saved in specialized state stores– Consequence: isolation of data recovery from

process recovery• Rejuvenate the system without shutting it down

[1]

Microrebooting: High Availability

• Try microrebooting first– Even if false positives are expected– Even if the failure is not expected to be fixed

by microreboot• Even if the system reboot is required, the

microreboot adds only small amount of time to the process

• In server clusters, microrebooting should be tried before fail over– Avoids loading of non-failed server

[1]

Microrebootable Internet Services Software

• Many relatively short tasks– Work lost because of microrebooting is small– Few requests lost per day

• Design goals– Fast and correct component recovery– Strongly localized recovery– Fast and correct reintegration of recovered

components

[1]

Microrebootable Internet Services Software -2

• Fine grain components– Program logic and start up time– Software tools can help build such systems

• State segregation– Consistent state across microrebooting– Keep state info in state stores outside the application

• Transaction databases• Session state managers

– Isolate data recovery from application recovery

[1]

Session Persistence: Sticky Sessions

User 1 User 2 User 3 User N

Load Balancer Traffic Distribution, Session Persistence, SSL Termination

Server 1 Server 2 Server 3 Server N

Server Clusters




Persistent db • Disk, memory resident.MulticastingShared memory

Virtual Server Implementation: Session Replication




VirtualServer

1

VirtualServer

2

VirtualServer

3

• Virtual servers are short lived.

• Persistent db is easy.• Multi casting

– Additional network traffic.

– Reduce traffic through smaller clusters.

• Shared memory is not recommended.

Storing Session State

• FastS is an in-memory storage in the JBoss embedded webserver– In case of failure of JVM session state is lost

• SSM maintains state in a separate machine– Slower but session state is retained even if

JVM fails

Microrebootable Internet Services Software -3

• Decoupling of components helps to gracefully microreboot– Direct references, e.g. pointers are stored outside the

components

• Retryable requests– Timeouts

• Leases– Persistent state would have longer term leases– CPU execution time: hanging computations would

lead to non-renewal of leases, and program will be terminated by microreboot

[1]

Crash Only Design

• Programs that can be safely crashed in whole or by parts and recover quickly every time

• Fine-grain components– Fast restart and reinitialization

• State segregation– E-Commerce handle 3 types of state

• Presentation (http, gif, jsp, etc – stateless)• Session Persistent (session related – lost on session end)• Long term data persistence (db – customer info)

• Isolation and decoupling of EJBs– Compiler enforced interfaces– EJBs cannot use each others internal variables– Microreboot all the “connected” EJBs together

[1]

Evaluation Framework

• Test application: EBid– Ebay type of auction

• Client Emulator– Human clients are modeled with a 25 state Markov chain

• Example states: Login, BuyNow or AboutMe

• In between URL clicks – model user think time– Exponential with a mean = 7 seconds and a max of 70 seconds

[1]

Evaluation Framework - 2

• Failure Detection– Network level error (File not found, Service

not available,…http 4xx or 5xx)– Text analysis and look for key words like

“error”, “failed”, “exception”– Application induced error

• Request logon when user is logged

– Compare system under test with a known good instance: much more complex

[1]


• Fault diagnosis and Recovery Manager (RM) – recursive recovery policy– RM Microreboots or reboots as required– RM monitors the failure reports– Try the cheapest reboot first

[1]


• Application availability is measured– User operation: login to logout– Each session consists of multiple user actions– Each action is a sequence of operations (http

requests) that ends in a commit point – indicates successful completion

– All operations must succeed for the action to be marked success; any operations is marked as a failed action

[1]

[1]

Evaluation Questions

• Are microreboots successful in recovering from failure?

• Are microreboots better than JVM restart?

• Are microreboots useful in clusters?

• What is the performance overhead?

[1]

What faults to induce?

• Key question but very little supporting evidence on critical software faults in production systems

• Anecdotal reports of faults– Deadlocked threads– Leak-induced resource exhaustion– Bug-induced corruption of volatile metadata– Incorrect handling of Java exceptions

• Compare to CVE, CWE. Malicious vs internal

[1]

Faults injected

• Set value to null – NullPointerException on access

• Invalid value that passes type check – UserID of size greater than Max UserID size

• Wrong value – valid for application but incorrect

• Recover using recursive recovery policy

[1]

[1]

Resuscitation – resume request processing, without fixing db corruptionsRecovery – resume + 100% correct db

Microreboot successful

•Microrebooting not successful•Additional effort – more coarse operation or manual

Microrebooting vs Full Reboot

• Availability – assessed by requests denied during down time

• Microreboots instead of JVM restarts reduced failures by 98%– Width of the dip estimate failed requests see Figure 1

• Microreboots recover faster – see Table 3• Microreboots reduce funtional distruption – see

depth of dip in Figure 1 (more requests lost), and analysis in Figure 2

• Microreboots reduce lost work

[1]

[1]

[1]

•Entity Group – 5 EJBs: Category, Region, User, Item and Bid•Restarting EBid does not restart all the EJBs•Initialization dominates the time taken

[1]

Microrebooting in Clusters

• Failover under normal load– Sticky session servers– Number of failed requests depends on number of

failed sessions– More requests failed with JVM restart– More nodes per cluster reduces the microrebooting

advantage

[1]

Microrebooting in Clusters - 2

• Behavior under changing load– Peak maybe several

times average– In experiment peak =

2*average– Consumers

distracted if the request response more than 8 seconds

[1]

Performance Impact

• Impact of microrebooting on steady state fault-free throughput and latency

• Throughput changes are small• Latency increase is more serious

– Human perceptible delay is 100 ms

[1]

[1]

[1]

[1]

[1]

Date post:	16-Jan-2016
Category:	Documents
Upload:	sandra-may
View:	222 times
Download:	0 times

Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando...

Documents