+ All Categories
Home > Documents > Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando...

Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando...

Date post: 16-Jan-2016
Category:
Upload: sandra-may
View: 222 times
Download: 0 times
Share this document with a friend
Popular Tags:
36
Microreboot
Transcript
Page 1: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microreboot

Page 2: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

References

1. George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”, Proceedings 6th Symposium on Operating Systems Design and Implementation (OSDI’04), pp 31 – 44.

Page 3: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

The Problem

• Software has bugs– Memory leaks, race conditions, environment

dependent – Many bugs that appear in production have no fix at

the time of failure– Enterprise scale systems application level failures are

more frequent• Modern Operating Systems are comparatively more reliable

– Desktop system OS continue to have problems

• Testing eliminates some of the bugs, but not all the bugs

[1]

Page 4: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Recovery - Urgency

• Enterprise failures focuses the operators on the need to recover from failure and restoration of operations– Diagnosis is for later – no time for real-time diagnosis– Studies show that rebooting often is adequate even if

the cause of the failure is unknown

• Server clusters increase reliability – redundancy to withstand failure– Isolate failed node– Reboot the failed node– Reintegrate the recovered node into the cluster

[1]

Page 5: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Recovery using reboots

• There is high-confidence that reboot will reclaim stale or leaked resources

• Does not require correct functioning of the rebooted system

• Easy to implement and automate

• Returns the system to a “best” (most) tested state

[1]

Page 6: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Recovery using reboots - 2

• Unexpected reboots can result in data loss and unpredictable recovery times– Data recovery and process recovery are not

isolated– Example: write back buffer caches

• Data for persistent storage is kept in volatile memory

• Unexpected reboot will loose the buffer contents, i.e. the files would not be updated

[1]

Page 7: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microreboot

• Individual rebooting of fine-grain application components

• Potentially the same results as a system reboot, but much faster and less loss of work

• Safe microreboots require– Well isolated – Stateless components– Application state is saved in specialized state stores– Consequence: isolation of data recovery from

process recovery• Rejuvenate the system without shutting it down

[1]

Page 8: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebooting: High Availability

• Try microrebooting first– Even if false positives are expected– Even if the failure is not expected to be fixed

by microreboot• Even if the system reboot is required, the

microreboot adds only small amount of time to the process

• In server clusters, microrebooting should be tried before fail over– Avoids loading of non-failed server

[1]

Page 9: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebootable Internet Services Software

• Many relatively short tasks– Work lost because of microrebooting is small– Few requests lost per day

• Design goals– Fast and correct component recovery– Strongly localized recovery– Fast and correct reintegration of recovered

components

[1]

Page 10: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebootable Internet Services Software -2

• Fine grain components– Program logic and start up time– Software tools can help build such systems

• State segregation– Consistent state across microrebooting– Keep state info in state stores outside the application

• Transaction databases• Session state managers

– Isolate data recovery from application recovery

[1]

Page 11: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Session Persistence: Sticky Sessions

User 1 User 2 User 3 User N

Load Balancer Traffic Distribution, Session Persistence, SSL Termination

Server 1 Server 2 Server 3 Server N

Page 12: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Server Clusters

User 1 User 2 User 3 User N

Load Balancer Traffic Distribution, Session Persistence, SSL Termination

Server 1 Server 2 Server 3 Server N

Persistent db • Disk, memory resident.MulticastingShared memory

Page 13: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Virtual Server Implementation: Session Replication

User 1 User 2 User 3 User N

Load Balancer Traffic Distribution, Session Persistence, SSL Termination

Server 1 Server 2 Server 3 Server N

VirtualServer

1

VirtualServer

2

VirtualServer

3

• Virtual servers are short lived.

• Persistent db is easy.• Multi casting

– Additional network traffic.

– Reduce traffic through smaller clusters.

• Shared memory is not recommended.

Page 14: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Storing Session State

• FastS is an in-memory storage in the JBoss embedded webserver– In case of failure of JVM session state is lost

• SSM maintains state in a separate machine– Slower but session state is retained even if

JVM fails

Page 15: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebootable Internet Services Software -3

• Decoupling of components helps to gracefully microreboot– Direct references, e.g. pointers are stored outside the

components

• Retryable requests– Timeouts

• Leases– Persistent state would have longer term leases– CPU execution time: hanging computations would

lead to non-renewal of leases, and program will be terminated by microreboot

[1]

Page 16: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Crash Only Design

• Programs that can be safely crashed in whole or by parts and recover quickly every time

• Fine-grain components– Fast restart and reinitialization

• State segregation– E-Commerce handle 3 types of state

• Presentation (http, gif, jsp, etc – stateless)• Session Persistent (session related – lost on session end)• Long term data persistence (db – customer info)

• Isolation and decoupling of EJBs– Compiler enforced interfaces– EJBs cannot use each others internal variables– Microreboot all the “connected” EJBs together

[1]

Page 17: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Evaluation Framework

• Test application: EBid– Ebay type of auction

• Client Emulator– Human clients are modeled with a 25 state Markov chain

• Example states: Login, BuyNow or AboutMe

• In between URL clicks – model user think time– Exponential with a mean = 7 seconds and a max of 70 seconds

[1]

Page 18: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Evaluation Framework - 2

• Failure Detection– Network level error (File not found, Service

not available,…http 4xx or 5xx)– Text analysis and look for key words like

“error”, “failed”, “exception”– Application induced error

• Request logon when user is logged

– Compare system under test with a known good instance: much more complex

[1]

Page 19: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Evaluation Framework - 3

• Fault diagnosis and Recovery Manager (RM) – recursive recovery policy– RM Microreboots or reboots as required– RM monitors the failure reports– Try the cheapest reboot first

[1]

Page 20: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Evaluation Framework - 4

• Application availability is measured– User operation: login to logout– Each session consists of multiple user actions– Each action is a sequence of operations (http

requests) that ends in a commit point – indicates successful completion

– All operations must succeed for the action to be marked success; any operations is marked as a failed action

[1]

Page 21: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Page 22: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Evaluation Questions

• Are microreboots successful in recovering from failure?

• Are microreboots better than JVM restart?

• Are microreboots useful in clusters?

• What is the performance overhead?

[1]

Page 23: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

What faults to induce?

• Key question but very little supporting evidence on critical software faults in production systems

• Anecdotal reports of faults– Deadlocked threads– Leak-induced resource exhaustion– Bug-induced corruption of volatile metadata– Incorrect handling of Java exceptions

• Compare to CVE, CWE. Malicious vs internal

[1]

Page 24: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Faults injected

• Set value to null – NullPointerException on access

• Invalid value that passes type check – UserID of size greater than Max UserID size

• Wrong value – valid for application but incorrect

• Recover using recursive recovery policy

[1]

Page 25: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Resuscitation – resume request processing, without fixing db corruptionsRecovery – resume + 100% correct db

Microreboot successful

•Microrebooting not successful•Additional effort – more coarse operation or manual

Page 26: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebooting vs Full Reboot

• Availability – assessed by requests denied during down time

• Microreboots instead of JVM restarts reduced failures by 98%– Width of the dip estimate failed requests see Figure 1

• Microreboots recover faster – see Table 3• Microreboots reduce funtional distruption – see

depth of dip in Figure 1 (more requests lost), and analysis in Figure 2

• Microreboots reduce lost work

[1]

Page 27: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Page 28: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

•Entity Group – 5 EJBs: Category, Region, User, Item and Bid•Restarting EBid does not restart all the EJBs•Initialization dominates the time taken

Page 29: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Page 30: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebooting in Clusters

• Failover under normal load– Sticky session servers– Number of failed requests depends on number of

failed sessions– More requests failed with JVM restart– More nodes per cluster reduces the microrebooting

advantage

[1]

Page 31: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Microrebooting in Clusters - 2

• Behavior under changing load– Peak maybe several

times average– In experiment peak =

2*average– Consumers

distracted if the request response more than 8 seconds

[1]

Page 32: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

Performance Impact

• Impact of microrebooting on steady state fault-free throughput and latency

• Throughput changes are small• Latency increase is more serious

– Human perceptible delay is 100 ms

[1]

Page 33: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Page 34: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Page 35: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]

Page 36: Microreboot. References 1.George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox, “Microreboot – A Technique for Cheap Recovery”,

[1]


Recommended