A Dissertation
Presented to
the Faculty of the School of Engineering and Applied Science
at the
University of Virginia
In Partial Fulfill ment
of the Requirements for the Degree
Doctor of Philosophy (Computer Science)
by
Integrating Fault-Tolerance Techniques in Gr id Applications
Anh Nguyen-Tuong
August 2000
i
Abstract
The contribution of this thesis is the development of a framework for simpli fying the
construction of grid computational applications. The framework provides a generic
extension mechanism for incorporating functionality into applications and consists of two
models: (1) the reflective graph and event model, and (2), the exoevent notification model.
These models provide a platform for extending user applications with additional
capabiliti es via composition. While the models are generic and can be used for a variety of
purposes, including security, resource accounting, debugging, and application monitoring
[VILE97, FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the
integration of fault-tolerance techniques.
Using the framework, fault-tolerance experts can encapsulate algorithms using the two
reflective models developed in this dissertation. Developers incorporate these algorithms
into their tools and augment the set of services provided to application programmers.
Application programmers then use these augmented tools to increase the likelihood that
their programs will complete successfully.
We claim that the framework enables the easy integration of fault-tolerance techniques
into object-based grid applications. To support this claim, we have mapped onto our
models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD
checkpointing, passive and stateless replication, and pessimistic method logging. We
incorporated these algorithms into three common grid programming tools: Message
Passing Interface (MPI), Mentat, and Stub Generator (SG). MPI is the de facto standard
ii
for message passing; Mentat is a C++-based parallel programming environment; and SG
is a popular tool for writing client/server applications.
We measured the ease by which techniques can be integrated into applications based
on the number of additional li nes of code that a programmer would have to write. In the
best case, programmers needed to add three lines of code. In the worst case, programmers
had to write functions to save and restore the local state of their objects. However, such
functions are simple to write and exploit programmers’ knowledge of their applications.
Acknowledgements
To my ancestors, who have trekked down this path,and cleared a road for others to follow,three centuries is not that long after all
To that turtle in Hanoi,forever gazing at the pond,the smell of incense on a hot summer day
To the committee, for helping me to ascertain,the inside from the outside, the lines delicately drawn
To John Knight,for ensuring a smooth landing
To Andrew, my advisor and mentor,for showing me the difference between a milli second and a microsecond,and for taking me along on his adventures
To Karine, my eternal accomplice,whose support and love,are the real foundation of this research
To my parents, whose journey I have yet to fully appreciate,cam on nhieu
To my sister, Vi,the dancer, the musician, the pharmacist, the photograph,who never ceases to amaze me,may she appreciate her roots on her voyage home
To Madgy, Bootsy, Noushka, Kona,rain or shine, eyes always sparkling,heart purring and tail wagging
Special thanks to Nuts,whose wit is as sharp as his intellect,for all his insights, technical, culinary and otherwise
And to all my friends, Chenxi, Dave, John, Karp, Glenn, Matt, Mike, Paco, Rashmi, the Dinner Gang,who have made this trip so enjoyable
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Current support for fault tolerance in grids . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Properties of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Grid models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.3 Legion grid environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Framework foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Framework summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Constraints and assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Computational grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 PVM and MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1.1 DOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.2 CVMULUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
2.1.1.3 Other extensions to PVM and MPI . . . . . . . . . . . . . . . 202.1.2 Isis, Horus and Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.3 Linda, Pirhana and JavaSpaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Local events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.1 Protocol stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1.2 Graphical user interface . . . . . . . . . . . . . . . . . . . . . . . 272.3.1.3 JavaBeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Distributed events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 Aspect-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Integrating fault tolerance in distributed systems. . . . . . . . . . . . . . . . . . . 30
2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3 Reflective Graph and Event Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Event API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Overhead for graphs and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Structure of an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Overview of a protocol stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Example of incorporating new functionality . . . . . . . . . . . . . . . . 47
3.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 4 Exoevent Notification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Registering interest in an exoevent. . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 Object scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.3 Method scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 The notify-root policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.2 The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 The notify-third-party policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.4 The notify-hybrid policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Application programmer interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Example exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.1 Failure detection – push model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vi
4.6.2 Failure detection – pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6.3 Failure detection – service model . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 5 Mappings of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.1 SPMD checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.1.2 Mapping SPMD checkpointing . . . . . . . . . . . . . . . . . 775.1.1.3 Summary of SPMD checkpointing. . . . . . . . . . . . . . . 80
5.1.2 2-phase commit distributed checkpointing. . . . . . . . . . . . . . . . . . 805.1.2.1 Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1.2.2 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.1.2.3 Mapping 2-phase commit distributed checkpointing . 835.1.2.4 Summary of 2PCDC algorithm. . . . . . . . . . . . . . . . . . 86
5.2 Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 Pessimistic message logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.2.2 Mapping pessimistic message logging . . . . . . . . . . . . . . . . . . . . . 915.2.3 Optimization: pessimistic method logging. . . . . . . . . . . . . . . . . . 945.2.4 Legion system-level support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.5 Summary of pessimistic logging. . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Passive replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.3.1.1 Mapping passive replication. . . . . . . . . . . . . . . . . . . 1005.3.1.2 Legion system-level support . . . . . . . . . . . . . . . . . . . 1015.3.1.3 Summary of passive replication . . . . . . . . . . . . . . . . 102
5.3.2 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.3.2.1 Mapping stateless replication . . . . . . . . . . . . . . . . . . 1055.3.2.2 Duplicate method suppression . . . . . . . . . . . . . . . . . 1085.3.2.3 Summary of stateless replication . . . . . . . . . . . . . . . 108
5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 6 Integration into Programming Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.1 MPI (SPMD and 2PCDC Checkpointing) . . . . . . . . . . . . . . . . . . . . . . . 112
6.1.1 Legion MPI (LMPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1.2 Legion MPI-FT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Stub generator (passive replication and pessimistic method logging) . . 121
6.2.1 Modifications to the stub generator . . . . . . . . . . . . . . . . . . . . . . 1226.2.2 Integration with pessimistic method logging . . . . . . . . . . . . . . . 1236.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
vii
6.2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2.5 Integration with passive replication . . . . . . . . . . . . . . . . . . . . . . 1276.2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 MPL – Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3.1 Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.3.2 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Chapter 7 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.1 Stub Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2.2 BT-MED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Mentat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.1 RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.2 Complib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
viii
List of Figures
Figure 1: Grid layered implementation models (adapted from [FOST99], pg. 30) . . . 7
Figure 2: Code fragment and RGE graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 3: Example use of the graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 4: Graph interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 5: Example use of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 6: Event interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 7: Structure of an object: sample protocol stack. . . . . . . . . . . . . . . . . . . . . . 47
Figure 8: Adding a handler for logging methods (pseudo-code) . . . . . . . . . . . . . . . 48
Figure 9: The notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 10: Propagating exoevents to a catcher object . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 11: Example propagation of exoevents in the notify-hybrid policy . . . . . . . . 59
Figure 12: API for exoevents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 13: Failure detection using the push model . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 14: Failure detection using a pull model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 15: Generic failure detection service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 16: Structure of a fault-tolerant application . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 17: Lost and orphan messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 18: Insertion of checkpoint in SPMD code. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 19: Recovery example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 20: Interface for checkpoint server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 21: Interface for application manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 22: Raising the “CheckpointTaken” exoevent . . . . . . . . . . . . . . . . . . . . . . . . 78
ix
Figure 23: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 24: Interface for coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 25: 2PCDC code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 26: Interface for participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 27: Pessimistic message logging (PML). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure 28: Interface for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 92
Figure 29: Handlers for pessimistic message logging . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 30: Handler for intercepting outgoing communication. . . . . . . . . . . . . . . . . . 94
Figure 31: Pessimistic method logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 32: Passive replication example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Figure 33: Passive replication interface (primary) . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Figure 34: Handlers for passive replication (primary) . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 35: Server lookup with primary replication . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 36: Stateless replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 37: Interface for proxy object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 38: Sending a method to a replica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 39: Simple MPI program (myprogram) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Figure 40: Legion MPI architecture augmented with FT modules. . . . . . . . . . . . . . 116
Figure 41: Example of MPI application with checkpointing. . . . . . . . . . . . . . . . . . 119
Figure 42: Example of saving and restoring user state . . . . . . . . . . . . . . . . . . . . . . 120
Figure 43: Creating objects using the stub generator . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 44: Specification of READONLY methods . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 45: Modified client-side stubs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 46: Interface and code for myApp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure 47: Example of MPL application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Figure 48: Declaring a Mentat class as stateless . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 49: Specifying parameters for the stateless replication policy . . . . . . . . . . . 131
Figure 50: Interface for context object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 51: Context application structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Figure 52: BT-MED application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Figure 53: Complib application structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Figure 54: Complib main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
x
List of Tables
Table 1: Overhead of graphs and events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Table 2: Sample set of events for building protocol stack of an object . . . . . . . . . 45
Table 3: Example of typical exoevent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Table 4: Exoevent interest for notify-root policy . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Table 5: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 57
Table 6: Exoevent interest for notify-client policy . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table 7: Exoevent interest for notify-hybrid policy for object AppA . . . . . . . . . . . 59
Table 8: Exoevent interest for notify-hybrid policy for object catcher . . . . . . . . . 60
Table 9: Exoevent interest for notify-hybrid policy for object B . . . . . . . . . . . . . . 60
Table 10: Overhead in creating and raising exoevents . . . . . . . . . . . . . . . . . . . . . . . 63
Table 11: Sample exoevents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 12: “ I am Alive” exoevent raised by application objects . . . . . . . . . . . . . . . . 64
Table 13: Exoevent raised on object creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Table 14: Exoevent raised by failure detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 15: Data structures for FT modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 16: Summary SPMD checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 17: 2PCDC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 18: Recovery in 2PCDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Table 19: Summary 2PCDC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 20: Summary of pessimistic logging algorithm . . . . . . . . . . . . . . . . . . . . . . . 96
Table 21: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 102
Table 22: “Object:MethodDone” notification by replica . . . . . . . . . . . . . . . . . . . . 106
xi
Table 23: Summary of the passive replication algorithm . . . . . . . . . . . . . . . . . . . . 108
Table 24: Summary of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Table 25: Sample MPI functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Table 26: Functions to support checkpoint/restart . . . . . . . . . . . . . . . . . . . . . . . . . 116
Table 27: Options for legion_mpi_run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Table 28: Summary of work required for integration of checkpointing algorithms120
Table 29: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Table 30: Summary of work required for integration of PML . . . . . . . . . . . . . . . . 126
Table 31: Parameters for legion_set_ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Table 32: Summary of work required for integration of passive replication . . . . . 128
Table 33: Summary of work required for integration of stateless replication . . . . 132
Table 34: Stub generator – RPC performance (n = 100, α = 0.05). . . . . . . . . . . . . 136
Table 35: Context performance (n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 139
Table 36: Context performance with one induced failure (n = 5, α = 0.05) . . . . . . 140
Table 37: Send and receive performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . 142
Table 38: BT-MED performance (n = 20, α = 0.05). . . . . . . . . . . . . . . . . . . . . . . . 143
Table 39: Performance with one induced failure (n = 10, α = 0.05) . . . . . . . . . . . 145
Table 40: RPC performance (1 worker, n = 100, α = 0.05) . . . . . . . . . . . . . . . . . . 146
Table 41: Complib performance (n = 20, α = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . 149
Table 42: Complib performance with failure induced (n = 10, α = 0.05) . . . . . . . 149
Table 43: Application summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Table 44: Framework overhead based on RPC application . . . . . . . . . . . . . . . . . . 151
1
in-fra-struc-ture \'in-fre-,strek-cher, n (1927)The basic facili ties, services, and installations needed for the functioning of a community
or society, such as transportation and communications systems, water and power lines,and public institutions including schools, post offices, and prisons.
— American Heritage Dictionary
Chapter 1
Introduction
Throughout history, the development of infrastructures has catalyzed and shaped the
evolution of human progress. The construction of Roman roads, the telegraph, the
telephone, the modern banking system, the rail road, the interstate highway system, the
electrical power grids, and the Internet, are all successful infrastructures that have
revolutionized how people communicate and interact. At the dawn of the new millennium,
we are witnessing the birth of what promises to be the next revolutionary infrastructure.
Funded in the United States by several governmental agencies, including the National
Science Foundation (NSF), the Defense Advanced Research Project Agency (DARPA),
the Department of Energy (DOE), and the National Aeronautics and Space Administration
(NASA), this new infrastructure is often referred to as a metasystem or computational grid
[GRIM97A, SMAR97, GRIM98, FOST99, LEIN99].
A computational grid is a specialized instance of a distributed system [MULL93,
TANE94] with the following characteristics: compute and data resources are
geographically distributed; they are under the control of different administrative domains
2
with different security and accounting policies; and the hardware resource base is
heterogeneous and consists of PCs, workstations and supercomputers from different
manufacturers. The abilit y to develop applications over this environment is sometimes
referred to as the wide-area computing problem [GRIM99].
Computational grids present a complex environment in which to develop applications.
Writing a grid application is at least as difficult as writing an application for traditional
distributed systems. Thus, since both are fundamentally distributed memory systems,
programmers must deal with issues of application distribution, communication and
synchronization. Furthermore, grids present additional challenges as programmers may be
required to deal with issues such as security, disjoint file systems, fault tolerance and
placement, to name only a few [GRIM98, FOST99, GRIM99]. Without additional higher
level abstractions, all but the best programmers will be overwhelmed by the complexity of
the environment.
The contribution of this work is the development of a framework for simpli fying the
construction of grid applications. The framework provides a generic extension mechanism
for incorporating functionality into applications and consists of two models: (1) the
reflective graph and event model, and (2), the exoevent notification model. These models
provide a platform for extending user applications with additional capabiliti es via
composition. While the models are generic and can be used for a variety of purposes,
including security, resource accounting, debugging, and application monitoring [VILE97,
FERR99, LEGI99, MORG99], we apply the models in this dissertation towards the
integration of fault-tolerance techniques. Support for the development of fault-tolerant
3
applications has been identified as one of the major technical challenges to address for the
successful deployment of computational grids [GRIM98, FOST99, LEIN99].
Consider application reliabilit y in a grid. As applications scale to take advantage of a
grid’s vast available resources, the probabilit y of failure is no longer negligible and must
be taken into account. For example, consider an application decomposed into 100 objects,
with each object requiring one week of processing time and placed on its own workstation.
Assuming that each workstation has an exponentially distributed failure mode with a
mean-time-to-failure of 120 days, the mean-time-to-failure of the entire application would
only be 1.2 days, thus, the application would rarely finish!
Using the framework, fault-tolerance experts can encapsulate algorithms using the two
reflective models developed in this dissertation. Developers incorporate these algorithms
into their tools and augment the set of services provided to application programmers.
Application programmers then use these augmented tools to increase the likelihood that
their programs will complete successfully.
We claim that the framework enables the easy integration of fault-tolerance techniques
into object-based grid applications. To support this claim, we have mapped onto our
models five different fault-tolerance algorithms from the literature: 2PCDC and SPMD
checkpointing, passive and stateless replication, and pessimistic method logging. We
chose these algorithms to il lustrate the applicabilit y of our framework to a range of fault-
tolerance techniques. Furthermore, we selected these algorithms because we believe that
they are likely to be used in grid applications. We incorporated these algorithms into three
common grid programming tools: Message Passing Interface (MPI), Mentat, and Stub
Generator (SG). MPI is the de facto standard for message passing; Mentat is a C++-based
4
parallel programming environment; and SG is a popular tool for writing client/server
applications.
We measured the ease by which techniques can be integrated into applications based
on the number of additional li nes of code that a programmer would have to write. In the
best case, programmers needed to add three lines of code. In the worst case, programmers
had to write functions to save and restore the local state of their objects. However, such
functions are simple to write and exploit programmers’ knowledge of their applications.
Furthermore, tools to automate save and restore state functions have already been
demonstrated in the literature [BEGU97, FERR97, FABR98].
To the best of our knowledge, we are the first to advocate and use a reflective
architecture to structure applications in computational grids. Moreover, we are the first to
demonstrate the integration of a wide range of fault-tolerance techniques into grid
applications using a single framework.
1.1 Current support for fault tolerance in gr ids
Until recently, the foremost priority for grid developers has been to develop working
prototypes and to show that applications can be written over a grid environment
[GRIM97B, BRUN98, FOST98]. To date, there has been limited support for application-level
fault tolerance in computational grids. Support has consisted mainly of failure detection
services [STEL98, GROP99] or fault-tolerance capabilities in specialized grid toolkits
[NGUY96, CASA97]. Neither solution is satisfactory in the long run. The former places the
burden of incorporating fault-tolerance techniques into the hands of application
programmers, while the latter only works for specialized applications. Even in cases
5
where fault-tolerance techniques have been integrated into programming tools, these
solutions have generally been point solutions, i.e., tool developers have started from
scratch in implementing their solution and have not shared, nor reused, any fault-tolerance
code.
As these tools are ported to grid environments, or as new tools are developed for grid
environments, the continued development of fault-tolerant tools as point solutions
represents wasteful expenditure. We believe a better approach is to provide a structural
framework in which tool developers can integrate fault-tolerance solutions via a
compositional approach in which fault-tolerance experts write algorithms and encapsulate
them into reusable code artifacts, or modules. Tool developers can then integrate these
modules in their environments.
1.2 Properties of the framework
Our long-term goal is to simpli fy the construction of fault-tolerant grid applications.
We believe that a good solution for achieving this goal should exhibit the following
properties:
• P1. Separation of concerns and composition. Designing and writing fault-
tolerance code are complex and error-prone tasks and should be done by experts,
not application programmers or tool developers. Thus, fault-tolerance experts
should be able to encapsulate algorithms into reusable and composable code
artifacts [NGUY99]. Furthermore, the incorporation of fault-tolerance techniques
should not interfere with other non-functional concerns such as security or
accounting.
• P2. Localized cost. By localized cost, we mean that the use of resources or services
to implement fault-tolerance techniques should not be charged to applications that
6
do not require those resources or services—users should pay only for the level of
services that they need. In general, localized cost is an important attribute for any
grid services [GRIM97A].
• P3. Working proof of concept. We should be able to demonstrate the integration of
fault-tolerance techniques in running applications on a working grid prototype and
using multiple programming tools. Further, applications with fault-tolerance
techniques integrated should be able to tolerate more failures than applications that
do not use any fault-tolerance techniques.
1.3 Evaluation
Based on our goal of simpli fying the construction of fault-tolerant applications and the
properties listed in §1.2, we have derived several criteria by which to evaluate our
framework (next to each criterion, we note in parenthesis its related property):
• Multiple programming tools. A successful solution should promote and enable the
incorporation of fault-tolerance techniques into multiple programming tools,
including legacy tools such as MPI or PVM. Legacy tools are already familiar to
programmers and should ease the transition from traditional distributed systems to
grid environments. (P1, P3)
• Breadth of fault-tolerance techniques. A successful solution should support a wide
range of fault-tolerance techniques so that application programmers may use the
one that is most appropriate for their needs. (P1, P2)
• Ease of use. Incorporating fault-tolerance techniques should required only trivial
or small modifications to applications. (P1, P3)
• Localized cost. Application programmers should select and pay only for the level
of fault tolerance that they require. A good framework should not impose a
system-wide solution. Instead, the cost of using fault-tolearnce techniques should
be localized to the applications that use these techniques. (P2)
• Overhead. Is the overhead of using fault-tolerance techniques due to the algorithm
or to the framework itself? In deciding whether to incorporate a fault-tolerance
7
technique, users should only worry about the algorithmic overhead, i.e., the cost of
the algorithm itself. (P2, P3)
1.4 Background
1.4.1 Gr id models
Before describing our framework, we present the implementation models of
computational grids. As shown in Figure 1, a grid consists of services that run on top of
native operating systems. These services provide functionality such as authentication,
failure detection, object and process management, and remote input/output, and are
accessed via grid libraries. Typically, an application programmer will not access these
libraries directly, but will use a programming tool such as MPI [GROP99],
NetSolve [CASA97], Ninf [SATO97] or MPL [GRIM97B], which in turn will call the
underlying grid libraries. The advantage of this layered model is that application
programmers can use familiar programming tools and interfaces and are shielded from the
complexity of accessing grid services.
FIGURE 1: Grid layered implementation models (adapted from [FOST99], pg. 30)
MPI, PVM, NetSolve, DOME, MPL, Fortran
Grid Services
Programming Tools
Applications
Native Operating Systems
Security, Object/Process Management, Scheduling,Failure Detection, Storage
Globus API, Legion API
Applications
Windows NT, Unix
Grid Libraries
8
There are currently three approaches to building grids: the commodity approach, the
service approach, and the integrated architecture approach [FOST99]. In the commodity
approach, existing commodity technologies, e.g. HTTP, CORBA, COM, Java, serve as the
basic building blocks of the grid [ALEX96, BALD96, FOX96, CHRI97]. The primary
advantages of this approach are the use of industry standard protocols, allowing
programmers to ride the technology curve as improvements are made to these protocols.
Furthermore, standard protocols stand a better chance of being adopted by a large
community of developers. The problem with this approach is that the current set of
protocols may not be adequate to meet the requirements of computational grids. In the
service approach, as exempli fied by the Globus project, a set of basic services such as
security, communication, and process management are provided and exported to
developers in the form of a toolkit [FOST97]. In the integrated architecture approach,
resources are treated and accessed through a uniform model of abstraction [GRIM98]. As
we describe in §1.4.3, our framework targets the integrated approach.
1.4.2 Reflection
Our framework relies on the observation that although fault-tolerance techniques are
diverse by nature, their implementation is not. Indeed, the implementation of the major
famili es of fault-tolerance techniques rely on common basic primitives such as:
• intercepting the message stream
• piggybacking information on the message stream
• acting upon the information contained in the message stream
• saving and restoring state
• detecting failure
• exchanging protocol information between participants of an algorithm
9
Thus, by providing an execution model whereby these primitives can be expressed and
manipulated as first class entities, it is possible to achieve our goals of developing fault-
tolerance capabili ties independently and integrating them into programming tools.
We use reflection as the architectural principle behind our execution models. Smith
introduced the concept of reflection as a computational process that can reason about itself
and manipulate representations of its own internal structure [SMIT82]. Two properties
characterize reflective systems: introspection and causal connection.* Introspection
allows a computational process to have access to its own internal structures. Causal
connection enables the process to modify its behavior directly by modifying its internal
data structures—there is a cause-and-effect relationship between changing the values of
the data structures and the behavior of the process. The internal data structures are said to
reside at the metalevel while the computation itself resides at the baselevel. The metalevel
controls the behavior at the baselevel. In our case, the fault-tolerance capabiliti es are
expressed at the metalevel and control the underlying baselevel computation.
1.4.3 Legion gr id environment
Our work targets the Legion environment for multiple reasons: (1) Legion is object-
based, (2) it already uses graphs for inter-object communication, (3) it is an existing grid
prototype, and (4), multiple programming tools are available. None of the other
environments considered, such as Globus and CORBA-based systems, possess all these
attributes. However, our framework is also relevant to these other environments. For
example, it could be used to structure CORBA applications. Recent research has been
* Note that the term causal is used differently in the distributed systems literature where it refersto the “happen-before” relationship as defined by Lamport [LAMP78].
10
oriented towards extending the functionality of CORBA systems through a reflective
architecture [BLAI98, HAYT98, LEDO99]. Our work suggests that structuring CORBA-
reflective architectures using an event-based and/or graph-based paradigm is an idea
worth pursuing.
Legion treats all resources in a computation grid as objects that communicate via
asynchronous method invocations. Objects are address-space-disjoint, i.e., they are
logically-independent collections of data and associated methods. Objects contain a thread
of control, and are named entities identified by a Legion Object IDentifer (LOID). Objects
are persistent and can be in one of two states: active or inert. Active objects contain a
thread of control and are ready to service method calls. They are implemented with
running processes over a message passing layer. Inert objects exist as passive object state
representations on persistent storage. Legion moves objects between active and inert states
to use resources efficiently, to support object mobili ty, and to enable failure resili ence.
Legion objects are under the control of a Class Manager object that is responsible for
the management of its instances. A Class Manager defines policies for its instances and
regulates how an object is created, or deleted, and when it should be migrated, activated or
deactivated. By defining new Class Managers, grid developers can change the
management policies of object instances. Class Managers themselves are managed by
higher-order class managers, forming a rooted hierarchy.
Legion provides several default objects to manage its resource base. The two basic
objects are Host Objects and Vault Objects, which correspond to processor and storage
resources in a traditional operating system. Host objects are responsible for running an
active object while vault objects are used to store inert objects. Legion allows
11
customization of all it s objects. Thus, a host object could represent compute resources that
exhibit varying degrees of reliabilit y and performance, e.g., a personal computer, a
workstation, a server, a cluster, or a queue-controlled supercomputer. Similarly a vault
object could represent a local disk, a RAID disk, or tertiary storage. A full description of
the Legion object model can be found in the literature [GRIM98].
1.5 Framework foundation
The key contribution of this work is the development of two reflective models that are
the foundations of our framework, the reflective graph and event model, and the exoevent
notification model. Together these models provide flexible mechanisms for structuring
applications and specifying the flow of information between objects that comprise an
application. Furthermore, the models enable information propagation policies to be bound
to applications at run-time. The flexibilit y of the models and the abilit y to defer the
binding of policy decisions are the differentiating features of our framework.
The reflective graph and event model (RGE) reflects our target environment of (1) an
environment in which objects are implemented by running processes that communicate
via message passing, and (2) an object-based environment in which an application consists
of a set of cooperating objects. The RGE model employs graphs and events to expose the
structure of objects to fault-tolerance developers. It specifies both its external aspect
(interactions between objects) and its internal aspect (interaction inside objects). Graphs
and events are the building blocks with which fault-tolerance implementors can
incorporate functionali ty inside objects and exchange fault-tolerance protocol information
between objects. Graphs represent interactions between objects; a graph node is either a
12
member function call on an object or another graph, arcs model data or control
dependencies, and each input to a node corresponds to a formal parameter of the member
function. Events specify interactions inside objects and are used to structure their protocol
stack.
Our second model, the exoevent notification model, is a distributed event model.
Similarly to the event model defined by CORBA [BENN95] and the Java Distributed Event
Specification [SUN99A], the exoevent notification model provides a flexible mechanism
for objects to communicate. However, unlike the CORBA and Java models, the salient and
distinguishing features of the exoevent notification model are that it unifies the concept of
exceptions and events—an exception is a special case of an event—and it allows the
specification of event propagation policies to be set on a per-application, per-object or per-
method basis, at run-time. In our model, exoevents denote object state transitions and are
associated with program graphs. Raising an exoevent results in the execution of method
invocations on remote objects through the execution of associated program graphs—
hence the term exoevent. The abilit y to specify handlers as program graphs allows
developers to specify more complex policies than with a traditional event model.
The use of reflection to incorporate non-functional requirements has been proposed by
Stroud [STRO96]. Its use for integrating fault-tolerance capabilit ies into systems has been
successfully employed in many object-based systems, including FRIENDS [FABR98] and
GARF [GUER97]. Reflection has also been used as the basis for extending object
functionality in CORBA-based systems (OpenORB [BLAI98], FlexiNet [HAYT98],
OpenCorba [LEDO99]). The novelty of this dissertation is to suggest the use of events as
the primary structuring mechanism for designing object request brokers, the use of generic
13
program graphs to describe distributed event propagation policy and bind policy at run-
time, and the use of reflection to specify inter- and intra-object communication as generic
and flexible means of extending grid applications with additional functionality. In
particular, we focus on using the models to extend applications with fault-tolerance
capabiliti es.
1.5.1 Framework summary
In order to enable the integration of fault-tolerance techniques with applications, our
framework requires that both fault-tolerance experts and tool developers target the
reflective graph and event model and the exoevent notification model. Note that the
framework does not make any assumptions about the failure model used by the underlying
system, or the failure assumptions made by a given fault-tolerance algorithm. The
framework is an integration framework only; the decision as to whether a given algorithm
is suitable for a given application is not part of the framework proper.
Our framework imposes a unified structure on the way grid libraries are organized.
Specifically, our framework requires that library components use an event paradigm for
intra-object communication. The advantages of events in terms of flexibilit y and
extensibilit y are well-known. Events have been used in such diverse areas as graphical
user interfaces [NYE92], protocol stacks [BHAT97, HAYD98], operating system kernels
[BERS95] and integrated systems [SULL96]. Using events for building the protocol stack
of an object provides natural hooks for inserting fault-tolerance capabiliti es. In fact, the
events required to build a protocol stack for objects are those that are needed for
incorporating fault-tolerance functionality.
14
For inter-object communications, our model provides a data-driven, graph-based
abstraction. Graphs have been used successfully in parallel and distributed systems
[BABA92, BEGU92, GRIM96A]. Graphs enable the expression of traditional client/server
interactions, such as CORBA, as well as more complex interactions, such as pipelined
flow.
1.6 Constraints and assumptions
The fault-tolerance algorithms discussed in this dissertation make use of three
common assumptions: fail-stop, availabil ity of reliable storage, and reliable networks.
However, Legion only provides an approximation of these assumptions. Detecting a
crashed object is approximated using conservatively-set timeouts; reliable storage is
approximated with standard disks; and the use of a high-level retry mechanism for sending
messages is used to mask transient network partitions. Thus, it is possible for an
application using a given fault-tolerance technique to violate its failure assumptions. To
increase the likelihood that these assumptions are met, Legion could be configured to use
hosts and storage devices with higher reliabilit y, e.g., hosts such as those provided by the
NonStopTM Compaq®† or Stratus® architectures, storage such as RAID disks, and
possibly hosts configured with redundant network paths. However, we do not expect this
configuration to be common in grids in the near future. Thus, application developers
should be aware of the possibili ty of violating the failure assumptions—if the cost of
violating these assumptions is too high, e.g., as would be the case with safety-criti cal
applications, then these applications should not be used on Legion.‡ The framework
† Formerly known as Tandem®, acquired by Compaq Corporation.‡ Note that this comment applies to any computational grids.
15
described here is an integration framework only, and does not make any guarantees as to
the suitability of using a given algorithm. However, to increase the likelihood that the
failure assumptions are met, we configured applications to run within a site [DOCT99].
In this dissertation the algorithms we have mapped onto our framework are designed
to tolerate host failures. Computational grids use hardware resources owned by various
entities, including research labs, governmental agencies, and universities. At any moment
in time, it is thus not surprising to find that some hosts used by a grid system have crashed
due to someone rebooting the machine or tripping on a power cord; or by chance; or a host
may simply be down for maintenance. While the crash failure of hosts represents an
important class of failures in grids, we note that they are not the only source of failures—
unreliable software or operator error could also result in the failure of applications
[GRAY85]. Furthermore, we do not concern ourselves with non-fault-masking techniques
such as reconfiguration and presentation of alternative services to cope with failures
[HOFM94, KNIG98, GART99]. We are only concerned with the integration of fault-masking
techniques in grid applications. Once a host fails, we assume that it does not recover.
Furthermore, we seek only to integrate fault-tolerance techniques into user applications
and do not address the case of fault-tolerance for system-level objects.** We assume that
Legion services are always available.
1.7 Outline
We have organized the rest of the dissertation as follows. In Chapter 2, we present an
overview of related work in the areas of computational grids, reflection, event-driven
** Legion system-level objects already tolerate transient host failures.
16
systems, aspect-oriented programming and integration of fault-tolerance techniques in
distributed systems. In Chapter 3, we provide an overview of our execution model, the
reflective graph and event model. In Chapter 4, we describe the development of a
distributed event notification model that is used as a flexible communication model to
exchange protocol information between objects. In Chapter 5, we illustrate mappings from
several well -known fault-tolerance techniques onto the reflective graph and event model
and the distributed event notification model. In Chapter 6, we present the integration of
several mappings described in Chapter 5 into several programming tools available in the
Legion grid. In Chapter 7, we tie the previous chapters together and provide a working
proof that our models have been successfully integrated into several tools and
applications. We also evaluate the performance of these applications. In Chapter 8, we
conclude by presenting lessons we learned and opportunities for future research.
17
There is only one nature – the division into science and engineering is a humanimposition, not a natural one. Indeed, the division is a human failure;
it reflects our limited capacity to comprehend the whole.— Bill Wulf
Chapter 2
Related Work
We present a broad overview of computational grids and potential grid tools to provide
context for our work (§2.1). We discuss reflective systems (§2.2) as our reflective graph
and event model is based on a reflective architecture. We discuss the event model and its
use in various settings to support extensibilit y and flexibil ity (§2.3). We consider aspect-
oriented programming and its potential relationship with event-based extension
mechanisms (§2.4). Finally, we present several approaches to integrating fault-tolerance
techniques into distributed systems, including CORBA-based systems (§2.5).
2.1 Computational gr ids
Foster et al. have identified three approaches to building computational grids: the
commodity approach, the service approach, and the integrated architecture approach
[FOST99]. In the commodity approach, existing commodity technologies, e.g., HTTP,
CORBA, COM, Java, serve as the basic building blocks of the grid [ALEX96, BALD96,
FOX96, CHRI97]. In the service approach, as exempli fied by the Globus project, a set of
18
basic services such as security, communication, and process management are provided and
exported to developers in the form of a toolkit [FOST97]. In the integrated architecture
approach, resources are accessed through a uniform model of abstraction [GRIM98]. For
example, Legion enables the development of grid applications by providing a uniform
object abstraction to encapsulate and represent grid resources, e.g., compute, data, and
people resources. A motivating factor for both the service and integrated architecture
approach is that the set of commodity services provided by current technology does not
suffice to meet the requirements of computational grids [FOST99].
We present several systems below and comment on the suitabilit y of these systems for
developing grid applications.
2.1.1 PVM and MPI
PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are the two
best-known message passing environments in grid computing [GEIS94, GEIS97]. They
provide programmers with library support for writing applications with explicit message
send and receive operations. In addition to message passing, PVM and MPI provide the
il lusion of an abstract virtual machine that supports the creation and deletion of processes
or tasks. As of this writing, MPI has eclipsed PVM to become the primary message
passing standard, and is supported by all major computer manufacturers.
Both Legion and Globus provide support for MPI [FOST99]. Legion also provides
support for PVM. We describe below several systems layered on top of PVM or MPI that
provide fault-tolerance capabilit ies. While these systems have not yet been ported to grid
prototypes, they are representative of the kind of systems that are likely to be incorporated
19
into grids. It is interesting to note that many of these systems are geared towards scientific
computing; they provide support for a style of application known as SPMD applications
(Single Program Multiple Data) in which identical processes process a subdomain of the
application data. SPMD applications are often time-stepped, with periodic exchange of
information at well -defined intervals.
2.1.1.1 DOME
DOME (Distributed Object Migration Environment), runs on top of PVM and
supports application-level fault-tolerance in heterogeneous networks of workstations
[BEGU97]. DOME defines a collection of data parallel objects such as arrays of integers or
floats that are automatically distributed over a network of workstations. DOME supports
the writing of SPMD applications in which a process is replicated on multiple nodes and
executes its computation over a different subset of the data. DOME provides support for
the checkpointing of SPMD applications. Similarly to the checkpointing techniques that
we use, DOME’s checkpoints support the recovery of applications on heterogeneous
architectures.
2.1.1.2 CVMULUS
CVMULUS is a library package for visualization and steering of fault-tolerant SPMD
applications for use on top of PVM [GEIS97]. In CVMULUS, programmers specify the
data decomposition of their applications. CVMULUS automatically uses this information
for checkpoint/recovery and is able to reconfigure applications even if the recovered
application uses fewer workers or tasks. Since CVMULUS is geared towards SPMD
applications, the consistency of application-wide checkpoints is easily maintained.
20
2.1.1.3 Other extensions to PVM and MPI
Fail-Safe PVM is an extension of PVM to provide application-transparent fault
tolerance based on checkpoint and recovery [LEON93]. While it achieves transparency,
Fail-Safe PVM required modifications to the PVM daemons to monitor the flow of
messages between PVM tasks. Silva et. al provide a user-level li brary called PUL-RD to
support checkpointing and recovery of SPMD applications on top of MPI [SILV95].
Programmers are responsible for describing the data layout of their applications. Similarly
to CVMULUS, the PUL-RD library supports the recovery of applications with fewer
processes.
2.1.2 Isis, Horus and Ensemble
Isis, Horus and Ensemble are representative of systems that use a process group
abstraction to structure distributed applications [BIRM93, RENE96, HAYD98]. The central
tenet of such systems is that support for programming with distributed groups is the key to
writing reliable applications.
Process groups enable the realization of a virtually synchronous model of computation
wherein the notion of time is defined based on the ordering of messages [LAMP78].
Typically, a programmer uses various forms of multicast primitives for communication
with members of a group, e.g., causal multicast or totally ordered multicast. The receipt of
messages within a group may be ordered with respect to group membership changes,
thereby enabling programmers to write algorithms such that group members can logically
take some actions “at the same time” with respect to failures. Failures of processes are
treated as changes in the membership of a group. Only processes that are members of a
21
group are allowed to process messages. Thus, group membership as seen in Isis, simulates
a fail -stop model in which processes fail by halting [SCHN83, SABE94].
The process group model has often been criti cized on the basis of the end-to-end
argument [SALT90]. Critics of the model argue that the ordering properties guaranteed by
group communication primitives are provided at too low a level of abstraction, and in
some cases, may be unnecessary to meet the specifications of an application [CHER93].
Proponents of the model argue that the services provided by the model are invaluable in
developing fault-tolerant distributed applications [RENE93, BIRM94, RENE94].
It is interesting to view the progression of systems developed at Cornell University,
from Isis to Horus, and then to Ensemble, as a response to the end-to-end argument. While
Isis was a monolithic system, both Horus and Ensemble allow developers to configure and
customize the protocol stacks of processes to meet the needs of applications. In Ensemble,
the protocol stack of processes can be configured at run-time using an event-driven
paradigm, unlike the protocol stack of Horus which has to be configured statically.
The process group model has found acceptance in several domain areas, including
finance, groupware applications, telecommunication, military systems, factory automation
and production control [BIRM93]. For more information on the model and its applications
to Internet applications, please see the recent book by Birman [BIRM96].
Our framework differs in that its focus is on integrating fault-tolerance techniques in
object-based systems whereas the focus of Isis, Horus and Ensemble, is in supporting the
process group abstraction. The two are not mutually exclusive, it is possible to layer a
reflective framework on top of ordered group communication primitives [FABR98].
22
For grid applications, it is too early to determine how much of a role the process group
model will play. However, the evolution from Isis to Ensemble point to a common design
goal of supporting flexibilit y and extensibility (§2.3).
2.1.3 L inda, Pirhana and JavaSpaces
In Linda, processes in an application cooperate by communication through an
associative shared memory abstraction called tuple space [CARR89]. A tuple in tuple space
names a data element that consists of a sequence of basic data types such as integers,
floats, characters and arrays. Linda defines four basic operations, out, in, rd and eval, to
access tuple space. Out is used to deposit tuples in tuple space, in and rd are used to search
tuple space. A nice property of in and rd is that they can specify a generic pattern to search
tuple space. Finally, eval is used to create a new process. The primary advantages of Linda
are that it is simple to learn its four operations and easy for programmers to use a shared
memory abstraction. PLinda is an extension to Linda to provide fault-tolerance through
the checkpointing and recovery of tuple space and the use of a commit protocol to deposit
and read tuples from tuple space [JEON94]. Another fault-tolerant version of Linda is
Pirhana [CARR95]. Pirhana supports a style of computation known as master-worker
parallelism, in which a master process generates a set of tasks to be consumed by workers.
Pirhana enables users to treat a collection of hosts as a computational resource base on
which to assign tasks. When a user reclaims a host, e.g. by pressing a key or clicking on
the mouse, Pirhana automatically reassigns the task to another host, thus ensuring that an
application eventually completes. The act of reclaiming of host can be treated as a failure
and is analogous to leaving a group in a system with group membership.
23
Linda and its derivatives are particularly well -suited to a master-worker style of
computation—a style that is prevalent in grid applications. We expect that over time, a
Linda-like abstraction, wil l be ported to computational grids. We note that Linda is
currently a commercial product supported by Scientific Computing Associates, Inc, under
the tradename Paradise ® .
The Linda tuple model heavily influenced the development of the Jini JavaSpacestm
Specification [SUN99A]. Similarly to Linda, JavaSpaces provide the abstraction of an
associative shared memory in which Java programs can deposit and retrieve information.
JavaSpaces improve upon the Linda model in that Java programs can be automatically
notified of changes in the JavaSpace through events [SUN99A]. Both Linda tuple space
and JavaSpaces can be viewed as an instance of a blackboard architecture in which
different components interact and coordinate actions based on state changes in a shared
repository [SHAW96].
2.2 Reflection
Smith introduced the concept of reflection and that of a computational process that can
reason about itself and manipulate representations of its own internal structure [SMIT82].
Two properties characterize reflective systems: introspection and causal connection.
Instropection enables a computational process to have access to its own internal structures.
Causal connection enables the computational process to modify its behavior directly by
modifying its internal data structures, i.e., there is a cause-and-effect relationship between
changing the values of the data structures and the behavior of the process. The internal
24
data structures are said to reside at the metalevel while the computation itself resides at the
baselevel; thus the metalevel controls the behavior of the baselevel.
Reflection provides a principled means of achieving open engineering, i.e., of
extending the functionali ty of a system in a disciplined manner [BLAI98]. A key attribute
of reflective systems is that of separation of concerns between the metalevel and the
baselevel. For example, Fabre et al. incorporated replication techniques into objects using
the reflective programming language Open-C++ [FABR95]. The implementation of the
replication techniques was performed at the metalevel with lit tle changes to the underlying
baselevel application. The design and implementation of the replication techniques were
separated from the design and implementation of the actual application, thus allowing the
replication techniques to be composable with many applications. In general, reflective
architectures enable the composition of non-functional concerns with the underlying
computational process [STRO96].
Another advantage of reflective architectures is that they enable flexibilit y and
extensibilit y of functionality. Reflective architectures have been used in such diverse areas
as programming languages [MAES87, WATA88, KICZ91, AKSI98, TATS98, MOSS99,
WELC99], operating systems [YOKO92], real-time systems [SING97, STAN98, STAN99],
fault-tolerant real-time systems [BOND93], agent-based systems [CHAR96], dependable
systems [AGHA94], and distributed middleware systems, e.g., OpenORB [BLAI98],
FlexiNet [HAYT98], OpenCorba [LEDO99] and Legion [NGUY99].
A feature common to all reflective systems is that they answer two questions: What
internal structure or metalevel information (meta-information) is exposed to developers?
How does one access the metalevel? The answer to the first question is application-
25
dependent. For example, in real-time systems such as FERT or Spring [BOND93, STAN98]
the meta-information includes timing constraints of tasks, deadlines, and precedence
constraints. In a programming language such as CLOS, the meta-information includes
slots and methods [KICZ91]. In an object-based distributed systems, meta-information can
include methods, arguments and replies [BLAI98, HAYT98, LEDO99, VILE97]. The answer
to the second question also varies. A popular method of programming the metalevel is
through an object-oriented paradigm in which a metalevel object defines and controls the
behavior of baselevel objects [MAES87, KICZ91]. Other means of accessing meta-
information include using compiler technology [FABR95, CHIB95, TATS98], configuration
files [MOSS99, WELC99], and events [NGUY98, PAWL98].
The reflective models developed in this dissertation reflect our target environment of a
computational grid. Incorporating fault-tolerance techniques in a distributed application—
a set of cooperating objects—requires manipulation of the internal as well as external
aspects of an object. Our models regulate both intra-object interactions, i.e., interactions
between modules inside an object, and inter-object interactions, i.e., interactions between
objects. The dual aspect of our models enable the integration of application-wide
algorithms such as checkpointing, in contrast to other reflective systems whose focus have
been on integrating techniques such as replication in server objects [FABR95, GUER97,
BLAI98, HAYT98].
A further difference between our architecture and other reflective middleware
architectures is that we do not use a metaobject protocol to control the behavior of the
baselevel [AGHA94, FABR95, GUER97, FABR98, HAYT98, LEDO99]. Instead, we present a
graph-and-event-based interface accessible through simple C++ library calls. In contrast,
26
other reflective approaches such as OpenCorba [LEDO99] and Garf [GUER97] rely on the
Smalltalk programming language. We believe that presenting a C++ based interface
expands our potential community of developers.
2.3 Events
Events have been used in a variety of contexts [SHAW96], in graphical user interfaces,
to build protocol stacks [BERS95, BHAT97, HAYD98, VILE97], in integrated systems
[SULL96], or as a generic mechanism for component interactions [BENN95]. We separate
our discussion of events in two sections: local events and distributed events. Local events
propagate within the same address space whereas distributed events propagate to a
different address space.
2.3.1 Local events
2.3.1.1 Protocol stacks
Many projects such as SPIN [BERS95], Coyote [BHAT97] and Ensemble [HAYD98],
use an event-based paradigm for flexibil ity and extensibilit y. SPIN is a dynamically
extensible operating system that uses events as its extension mechanism. A SPIN event is
used to notify the system of a state change or to request a service. For example, an IP
extension to the kernel could announce the event PacketArrived. Events in SPIN are fine-
grained, reflecting their use in an operating system. Likewise, events in the Coyote project
are fine-grained, reflecting their use in a kernel designed for network protocols. Coyote
extends the x-kernel [HUTC91] and enables the construction of micro-protocols that
communicate via events. Micro-protocols implement low-level properties, e.g.,
27
acknowledging that a message has been received or maintaining a membership list of li ve
processes. By composing micro-protocols, the Coyote protocol stack can be easily
configured to implement higher-level properties, e.g., group remote procedure calls with
acknowledgment. Coyote was designed primarily for network protocols and so the set of
pre-defined events relate mostly to messages, e.g., Message_Inserted_Into_Bag or
Message_Ready_To_Be_Sent. Ensemble uses events as the primary mechanism for
composing micro-protocols and supporting the process group abstraction. Example events
in Ensemble include Send-Message and Leave-Group.
The set of events exported by a system depends on the target environment and defines
the extension vocabulary with which developers can extend functionality. Since we target
an object-based system implemented over a message-passing communication layer, we
export events such as MessageSend and MethodReceived. Approaches such as Coyote or
our own in which events manipulate data structures (e.g., messages) contained in shared
data structures (e.g., message repository), can be viewed as a blackboard architecture
augmented with implicit invocations [SHAW96].
2.3.1.2 Graphical user inter face
Events have been widely popular in implement graphical user interfaces, e.g., the
MacOS ®, Microsoft Windows ®, Java’s Abstract Window Toolkit. Events enable the
separation of the visual aspects of a program from the actual computation. Typical events
in these systems deal with various aspects of the desktop metaphor, e.g, mouse, windows,
buttons, menus, keyboard input. Programmers can register event handlers to be notified of
user actions and take appropriate actions. However, coordinating events may be a difficult
28
task. Thus, most environments provide tools to facilit ate the development of graphical
user interfaces, e.g., Java Swing, Visual Basic.
2.3.1.3 JavaBeans
JavaBeanstm is the component technology developed by Sun Microsystems for use
within the Java platform [SUN99B]. A bean is a reusable software artifact that can be
manipulated visually using a builder tool. Beans can communicate with one another using
an event paradigm. The advantages of using Beans are that they are portable across
heterogenous architectures and that many tool builders are actively developing products to
support the development of Java Beans.
2.3.2 Distr ibuted events
Distributed events are used to communicate information between remote objects or
processes. In CORBA, the Event Service allows an object to register its interest in events
raised by other objects [BENN95]. CORBA defines two roles for objects: suppliers and
consumers. Suppliers produce events; consumers processes them. Suppliers and
consumers may be directly linked in which case events flow directly from the suppliers to
the consumers. Alternatively, an event channel may be defined to serve as an intermediary
object between suppliers and consumers. Using an event channel fully decouples suppliers
from consumers—consumers need not be active when suppliers deposit events on an
event channel. Furthermore, event channels may provide added functionality such as
filtering and persistence. The Jini Distributed Event Specification provides similar
functionality as CORBA’s event service [SUN99A]. It also provides additional features
such as the ability to bound the time during which an object is interested in an event raised
29
by some other objects via leasing [SUN99A]. In Jini terminology, an event listener may
register to be notified of an event on a one-time basis, forever, or for a specified time
period.
The exoevent notification model developed in this dissertation is similar to both the
CORBA and the Java Distributed Event specifications in that it supports the flexible
propagation of events between objects. The distinguishing features of our model are that it
unifies the concept of exceptions and events, i.e., an exception is simply a special kind of
event, and it allows programmers to specify the propagation of events on a per-
application, per-object or per-method basis. The exoevent notification model does not
support the concept of leasing.
While we use distributed events in our work for the dissemination of data to support
fault-tolerance algorithms, we note that the publish/subscribe model supported by events
is generic. As an example, the Department of Defense’s High Level Architecture uses the
publish/subscribe model to propagate information about entities in distributed simulations
[DMSO98]. As another example, the Jini Discovery and Join Specification regulates how
devices can discover the presence of other devices on a network [SUN99A].
2.4 Aspect-or iented programming
The use of the event paradigm to extend functionality for middleware systems is
related to the issue of crosscutting and weaving in aspect oriented programming [KICZ97].
Crosscutting is the concept that extensions to a modularly-designed program cannot be
constrained within the bounds of the original program decomposition. An example of
crosscutting in an object-oriented program would be the addition of synchronization
30
primitives at the beginning of each method. Kiczales’ thesis is that crosscutting is
common in large software systems. Our experiences with middleware systems corroborate
his thesis; aside from implementing its functional requirements, an object may also handle
issues such as argument marshalling, security, debugging, performance monitoring and
synchronization. In aspect-oriented programming technology, these issues are called
aspects. Aspect-oriented programming languages elevate aspects to first-class status and
provide a clean separation between the functional decomposition of a program—objects
or modules—and non-functional requirements which pertain to the way objects and
modules relate to one another [HIGH99].
After aspects are elevated to first-class status they must be composed with the
underlying program. This process is known as weaving and seems closely related to events
in the sense that events can be used to implement weaving. For example, an aspect for
debugging could be implemented easily in an object-based system by inserting an event
handler to intercept methods and logging them on storage for future replay. An interesting
avenue of research would be to investigate the use of an aspect-oriented programming
language to extend the functionality of objects in computational grids, or alternatively, to
investigate the suitability of the event paradigm for weaving aspects. Pawlak et al. are
currently investigating this line of research [PAWL98].
2.5 Integrating fault tolerance in distr ibuted systems
Fabre et al. present an excellent analysis of different approaches for integrating fault-
tolerance in distributed systems [FABR95, FABR98]. They distinguish between three main
approaches: the system approach, the library approach and the inheritance approach. In
31
the system approach, the runtime system provides support for fault-tolerance. For
example, Delta-4 [POWE94] offers several replication strategies such as passive, semi-
active and active replication to Delta-4 application programmers. In the library approach,
a set of functions is provided at the application-level to support a set of fault-tolerance
algorithms. For example, ISIS [BIRM93], Horus [RENE96] and Ensemble [HAYD98],
provide developers with various forms of ordered communication primitives. In the
inheritance approach, an object can inherit fault-tolerance properties such as persistence
and recoverabilit y from a base class. Examples of this approach include Avalon/C++
[DETL88] and Arjuna [ARJU92]. Fabre analyzes these approaches in terms of transparency,
reusabilit y and composabilit y, and argues that none meet these criteria simultaneously.
Fabre proposes the use of reflective techniques to meet these criteria and shows how to
integrate replication techniques into distributed objects using the reflective language
Open C++ [FABR95, FABR98]. Other systems that advocate the use of reflection to
incorporate fault-tolerance techniques include MAUD [AGHA94] and Garf [GUER97].
A fertile area of research has been to integrate fault-tolerance techniques into CORBA.
Moser et al. propose a fault-tolerance framework that implement fault-tolerance
management services both above and below an object request broker (ORB) [MOSE99].
Other projects such as Electra and Orbix+Isis integrate replication and group mechanisms
inside the ORB itself [MAFF95, LAND97]. DOORS (Distributed Object-Oriented Reliable
Service) provides fault-tolerance services as CORBA horizonatal services [SCHO98].
Elnozahy et al. provide a library of fault-tolerance techniques that can be used in both
CORBA and DCE environments [ELNO95]. Except for DOORS, which is implemented
above the ORB layer, all the other projects use interception methods to implement
32
replication services. Interception is implemented by modifying the ORB itself [LAND97],
by providing a library to be called from within the ORB [ELNO95], or by using features of
the operating system [MOSE99]. The Orbix ORB includes the notion of f il ters to intercept
method calls. However, Marzullo’s group at the University of Cali fornia, San Diego,
reported difficulties in integrating the messing logging fault-tolerance technique with
Orbix [NAMP99]. Marzullo et al. suggest that an event-driven model would have
alleviated the report diff iculties [NAMP99].
The need to extend the functionality of ORBs have led several researchers to adopt a
reflective architure to structure ORB implementations [BLAI98, HAYT98, LEDO99]. Our
development of the RGE and exoevent notification models also provides an extension
mechanism. The novelty of this work is to suggest the use of events as the primary
structuring mechanism for designing object request brokers and to specify both inter- and
intra-object communication within a unified model.
2.6 Summary
In designing our models, we drew inspiration from reflective systems as well as
previous work on flexible protocol stack. Our approach differs in two respects with most
CORBA-based reflective middleware approaches: (1) we use a simple graph and event-
based interface for extending object functionality instead of a metaobject protocol, and
(2), our reflective models are designed to extend the functionality of applications, not just
single server objects. In the next chapter, we present the cornerstone of our framework, the
reflective graph and event model. We show an application of our model in designing a
protocol stack and extending it with new functionality.
33
Make everything as simple as possible, but not simpler.— Albert Einstein (1879-1955)
Chapter 3
Reflective Graph and Event Model
The cornerstone of our framework is the specification of the reflective graph and event
(RGE) execution model. It provides a structural framework for providing basic object
functionality such as invoking methods, and marshalli ng and unmarshalli ng parameters,
similar to an object request broker (ORB) in CORBA systems [OMG95]. In addition, the
model provides a generic extension mechanism for incorporating new functionality into
objects—such functionality is encapsulated into reusable code artifacts, or modules. Thus,
the RGE model provides a common framework for fault-tolerance designers and tool
developers, and enables the integration and composition of fault-tolerance modules into
programming tools.
The novelty of this work is to suggest the use of events as the primary structuring
mechanism for designing object request brokers and to use a single model to specify both
inter- and intra-object communication. The RGE model employs graphs for inter-object
communication and events for intra-object interactions. Graphs represent interactions
between objects; a graph node is either a member function call on an object or another
34
graph, arcs model data and control dependencies, and each input to a node corresponds to
a formal parameter of the member function. Events specify interactions between modules
inside objects. Graphs and events are the building blocks with which fault-tolerance
developers can incorporate functionality inside objects and exchange protocol information
between objects.
The RGE model is reflective because it exposes the structure of objects (introspection)
and enables the extension of an object’s functionality through the modification of its
structure (causal connection). In an object-based system in which method invocation is
implemented over a message-passing layer, the structure of an object consists of data
structures to represent methods and messages. The distinguishing feature of the RGE
model is that it not only specifies the structure of a single object, but also the interactions
between objects. In other words, the RGE model specifies both inter- and intra-object
communication, and enables the incorporation of functionality at the application level.
We describe graphs in §3.1 and events in §3.2. We present the overhead of creating
and using graphs and events in §3.3. We explore the structure of an object by describing an
example protocol stack configured with events and ill ustrate the ease with which
developers can incorporate new functionality in §3.4.
3.1 Graphs
We use an existing graph model, macro data flow (MDF), to specify method
invocations on objects. MDF is a proven model and was first deployed in Mentat, an
object-based parallel processing system [GRIM96B]. In MDF, graph nodes are called
actors and represent method invocation on objects, arcs denote data-dependencies
35
between actors, and tokens flowing across arcs represent data or control information.
MDF differs from most other data-flow models in that it allows for persistent actors—
actors that can retain state information between firings [BABB84, BROW90, BABA92,
BEGU92]. When an actor has a token on each of its input arcs, it may fire, i.e., execute its
corresponding method, and deposit a result token on each output arc. A special token, the
bottom token, represents an error value. If a bottom token is present on an input arc when
an actor fires, it may propagate the bottom token on its output arcs, or it may mask the
bottom token and output a normal token.
Graphs may be annotated with meta-level information in the form of <name, type,
value> triples. The name field is an arbitrary string, the type field indicates its type, and
the value field consists of arbitrary data. The name and type fields dictate the
interpretation of the value field. Annotations may propagate through the object method
invocation chain, in which case we call them implicit parameters. Implicit parameters
provide a mechanism for adding meta-level information transitively. They are similar to
CORBA's contexts in that they denote meta-level information and are part of the
environment when executing a method [OMG95]. However, unlike CORBA's contexts,
implicit parameters propagate through the method invocation call chain automatically. If
object A annotates its graph with an implicit parameter, invokes a method on object B, and
B invokes a method on object C, A's implicit parameter propagates to C. The abilit y to
propagate protocol information enables objects to receive generic contextual
information—information that is determined and specified at run-time—and behave
differently based on the presence or absence of such information.*
* Implicit parameters are similar to environment variables in Unix systems.
36
Figure 1 il lustrates a fragment of code written in C++-like syntax and the
corresponding graph representation. If A.op1 outputs a bottom token, its successor node,
A.op2 , may propagate the bottom token or it may mask it. In practice, bottom token
propagation is useful for unblocking an object when its thread of control is blocked
waiting on a return value (line 7).
A benefit of using program graphs are that opportunities for parallel execution are
captured implicitly [BABB84, BROW90, BABA92, BEGU92, GRIM96B]. In Figure2, calls to
A.op1 and B.op2 (lines 4-5) may proceed concurrently because there are no data
dependencies between them. Furthermore, unlike a traditional client/server model, the
results from the method invocations on lines 4 and 5 can be forwarded to A.op2 directly,
instead of returning to the Main object. For more details about the MDF model and its use
in exploiting parallelism please see the literature [GRIM96B].
The salient feature of RGE graphs is that they are first-class entities. They may be
assembled at run-time, transformed, passed as arguments to other objects, and executed
remotely. The interface to the graph facili ties consists of library routines to build graph
nodes, add tokens, add arcs, annotate graphs, execute graphs, and wait on return values.
FIGURE 2: Code fragment and RGE graph
� � � � � � � � � �� � � � � � � � � � � � � � � � � �� � � � � � � � � � � � �� � � � � � � � � � � � �� � � � � � � � � � � �� � � � � � � � � � � � � �� � � ! � � " � # � $ % & � ' � � � �� ( � )
A . o p 1 B.op1
A.op2
a b
z
37
Calls to these routines can be hand-coded, generated by a compiler front-end or other
automated tool [FERR98, GRIM96A], or embedded into libraries [LEGI99].
3.1.1 Graph API
We show the implementation of the example code fragment of Figure 2 to ill ustrate the
use of the graph interface (Figure3). A LegionLOID is an object identifier and names an
object. A LegionInvocation denotes a graph node, a LegionParameter
corresponds to a token, a LegionImplicitParameter is used to annotate the graph.
A LegionParameter is assigned an integer value that identifies the arc to which it
belongs. A LegionBuffer is a data structure that stores generic typed values and
enables type conversion between heterogeneous architectures.
Lines 1 through 14 consist of variable declarations. In lines 8 and 9, we declare two
instances of the object MyObject, A and B. On line 12, we create a program graph and use
Legion.getMyLo i d() to obtain the identity of the graph creator. On line 13,
LegionCoreHand l es are used as handles to objects A and B.
Lines 15 through 20 ill ustrate the implementation of x = A.op1(a) . On line 16, we
create a graph node with a call to invoke() . The first argument, ”op1 ” , is the method
signature (recall that a graph node corresponds to a method invocation). The second
argument specifies the number of input arcs, or parameters. The third parameter specifies
the number of output arcs, or return values. After creating the graph node, we add it to the
graph (line 17). On lines 18 and 19, we create a parameter and add it to the graph.
The implementation of y = B.op1(b) on lines 21 through 25 is similar so we do
not describe it further.
38
The implementation of z = A.op2(x,y) is shown on lines 27 through 33. We use
add_invocation _parameter() to specify that the output arcs from A.op1 and
B.op1 correspond to the input arcs of A.op2 . Note that the constant
METHOD_RETURN_VALUE denotes the return value of a method.
FIGURE 3: Example use of the graph API
� � � * * + � � , � ! � � � � -� � � . � / � � � 0 � 1 � � � � � � � � 1 � � � � 1 � � � � 1 � � * * / ! � � 2 � � % � -� � � . � / � � � 3 � ! � � � � ! � � ! � �� � � � � � � � � � � � � �� � �� � � * * � � � � � ! � � � � �� � . � / � � � . � 0 + � 4 � � � � � � 4 � � � � �� ( � � . � / � � � � 5 ! � � � � � � � � 6 � � � � � � 6 � �� 7 � � . � / � � � � 5 ! � � � � � � � � 6 � � � � � � 6 � �� � � �� � � � * * 5 ! � � � / ! � � 2 � � % 2 � � % , � -� � � � . � / � � � 3 ! � / ! � � 8 ! � � 2 8 � . � / � � � � / � � � . � � % � � � �� � � � . � / � � � 5 � ! � 9 � � % , � � 4 2 � � % , � � � � � � 4 2 � � % , � � � � �� � � �� � � � * * � � � � � � � � � �� � � � � � 1 � � 4 2 � � % , � � � � 1 � : � � 6 � � � 6 � � � � � �� � � 8 � � % % 4 � � 1 � � � � � � � � � 1 � � �� � ( � � � ! � � � : � 4 � � ! � � � � ! � � � � � �� � 7 � 8 � � % % 4 � � � - � � 4 � � ! � � � � ! � � � 1 � � � � ! � � � � �� � � �� � � � * * � � � � � � � � �� � � � � � 1 � � 4 2 � � % , � � � � 1 � : � � 6 � � � 6 � � � � � �� � � � 8 � � % % 4 � � 1 � � � � � � � � � 1 � � �� � � � � � ! � � � : � 4 � � ! � � � � ! � � � � �� � � � 8 � � % % 4 � � � - � � 4 � � ! � � � � ! � � � 1 � � � � ! � � � � �� � � �� � � * * � � � � � � � � � � � �� � ( � � � 1 � � 4 2 � � % , � � � � 1 � : � � 6 � � � 6 � � � � � �� � 7 � 8 � � % % 4 � � 1 � � � � � � � � � 1 � � �� � � � * * � ; � ; � " � � � � � � - � � � ; � � ! � � � � ! �� � � � 8 � � % % 4 � � 1 � � � � � � 4 � � ! � � � � ! � � � 1 � � � � 1 � � � � � < = 9 � + 4 > < = ? > @ 4 A � . ? < � �� � � � * * � ; � ; � " � � � � � � - � � � ; � � ! � � � � ! �� � � � 8 � � % % 4 � � 1 � � � � � � 4 � � ! � � � � ! � � � 1 � � � � 1 � � � � � < = 9 � + 4 > < = ? > @ 4 A � . ? < � �� � � �� � � � 8 � � � � � ; � � � � * * < � � � ; � � ! � / ! � � / ! � � 2� � � �� � � * * > � ! � � 1 � 2 � ! � ; ! � 1 � , ; � B � ! � � 1 � , ; � � " �� � ( � * * C ; " " � ! D � - � % � � - ! ; � ; ! � � - � ! � � ! � ! � ! � % � �� � 7 � . � / � � � � ; " " � ! ; " " � ! �� � � � ; " " � ! 8 � / � 4 1 � , ; � � � � 1 � � � < = 9 � + 4 > < = ? > @ 4 A � . ? < � �� � � � ; " " � ! � / � 4 � � � E � � � � �� � � � � ! � � " � 6 � $ % & � 6 � � � �
A . o p 1 B.op1
A.op2
a b
z
39
On line 35, we execute the graph. On lines 37 through 42, we retrieve and print the
return value from A.op2 .
In Figure4, we show the full graph interface, including functions to annotate the
program graph. Note that in practice, we do not hand-generate code to implement graphs
but instead rely on a stub generator tool.
FIGURE 4: Graph interface
� � � � , � - - . � / � � � 3 ! � / ! � � 8 ! � � 2 �� � � � ; , � � F� � � * * � � � - ! ; � � ! � � - - � / � � � � � � % � � � " � � ! � 2 � / ! � � 2� � � . � / � � � 3 ! � / ! � � 8 ! � � 2 � . � / � � � . � 0 + � � � � � �� � �� � � * * � % % � / ! � � 2 � � % �� � . � / � � � 0 � 1 � � � � � � � % % 4 � � 1 � � � � � � � . � / � � � 0 � 1 � � � � � � � � 1 � � � � � � � �� ( �� 7 � * * � % % � � � ! � � � � ! � � / ! � � 2 � � % �� � � � 3 � ! � � � � ! G � ; - � % % 4 � � � - � � 4 � � ! � � � � ! � . � / � � � 0 � 1 � � � � � � � ! / � �� � � � . � / � � � 3 � ! � � � � ! � � ! � � � � ! �� � � � � � � � ! � � � � ! 4 � ; � � ! � �� � � �� � � � * * , � � : 2 � � ; � ; � " � / ! � � 2 � � % � � 2 � � � � ; � " � � � 2 � ! / ! � � 2 � � % �� � � � 3 � ! � � � � ! G � ; - � % % 4 � � 1 � � � � � � 4 � � ! � � � � ! � . � / � � � 0 � 1 � � � � � � � ! / � �
. � / � � � 0 � 1 � � � � � � � � ! � � � � ! �� � � � ! � � � � ! 4 � ; � � ! �� � ! � - ; , 4 � � ! � � � � ! 4 � ; � � ! � �
� � � �� � � * * � � � � ; � 2 � / ! � � 2� � ( � � � � � � � ; � � � �� � 7 �� � � � * * / � � ; � ; 1 � , ; � - " ! � � � / ! � � 2 � � % �� � � � . � / � � � � ; " " � ! / � 4 1 � , ; � � . � / � � � 0 � 1 � � � � � � - � � : � � � � � ! � � 4 � ; � � ! � � � � 1 � , H � � � � ; � �� � � �� � � � * * � � � � � � � � � � � ; � ! �� � � � 1 � � % � � � � � � 4 � � � ; � . � / � � � 0 � 1 � � � � � � � � 1 �
� � � � � ; 4 � � ! � � 4 � ; � � ! �. � / � � � 0 � � , � � � 3 � ! � � � � ! � � � � � � � � � �
� � � �� � � � * * � � � � � � / ! � � 2 � � % �� � � 1 � � % � � � � � � 4 � � 1 � � � � � � � . � / � � � 0 � 1 � � � � � � � � 1 � . � / � � � 0 � � , � � � 3 � ! � � � � ! � � � � � � � � � �� � ( �� � 7 � * * � � � � � � � � � ; � ; � ! �� � � � 1 � � % � � � � � � 4 ! � - ; , � . � / � � � 0 � 1 � � � � � � � � 1 �
� � ! � - ; , 4 � ; � � ! �. � / � � � 0 � � , � � � 3 � ! � � � � ! � � � � � � � � � �
� � � � ) �
40
3.2 Events
Flexibilit y and extensibilit y are key requirements in computational grids to support a
wide range of functionality, including fault tolerance, security and scheduling [GRIM98].
These requirements drive our adoption of the event paradigm for structuring the internal
implementation of objects. Events provide a unifying mechanism for intra-object
interactions; they are conceptually easy to understand and are familiar to programmers;
and they allow the independent development of modules. Furthermore, they enable the
easy addition or deletion of functionality, providing a basis for extending the behavior of
objects.
Our event model is defined by events, event kinds, event handlers and event managers.
An event is a data structure that represents a state transition inside an object. It is used to
notify interested parties that something of interest has occurred. An event contains user-
defined data as well as a tag to denote its event kind. An event kind serves as a template for
an event. An event kind contains a set of event handlers—functions that are invoked upon
the occurrence of an event. In this dissertation, we name events by their event kinds. For
example, a MethodReceive event means an event whose event kind is MethodReceive. An
event manager regulates when handlers are invoked. Events are announced, or raised, in
one of two modes, asynchronous or synchronous. In the former case, an event manager
stores the event in an internal queue for later delivery. In the latter, the handlers are
invoked immediately. The order in which an event manager invokes handlers is
determined by the priority assigned to handlers upon registration with an event kind. Note
that an event handler can postpone or prevent the execution of handlers with lower
priorities.
41
In Figure 5, we ill ustrate how a module Y can be notified of the event MethodReceive
announced by a module X: (1) Y registers the handler, HandlerFor Y, with the event kind
MethodReceive. Note that we assign HandlerForY a higher priority than the previous
handler, SomeHandle r , thereby ensuring that HandlerForY will be the first handler
invoked. (2) X creates a MethodReceive event—an event whose event kind is
MethodReceive. Upon creation of the event, X could attach event-specific data. (3) X
announces a MethodReceive event using an event manager. In this example, the event is
simply enqueued. (4) The event manager dequeues and processes event by calli ng the
associated handlers. Upon processing the MethodReceive event just enqueued, the
manager invokes the handler HandlerFor Y, thereby notifying module Y that a
MethodReceive event has been announced. Note that apart from application-specific data
manipulation, each of these actions requires developers to write only one or two lines of
code.
Even t Manage r
MethodReceive.addHandler(HandlerForY, HighPRIO); data_ptr = ... // set according to applicationmyEvent = new LegionEvent (MethodReceive,data_ptr);
EventManager.announce(myEvent); EventManager.flushEvents();
Even t Manage r
Hand ler L is t
M e t h o d R e c e i v e
Hand lerForY( )
SomeHand le r ( )
Even t
Event
Event
Hand ler L is t
Even t
Event
Event
SomeHand le r ( )
Hand le rForY( )
1 2
3 4
Y
Handler L is t
M e t h o d R e c e i v e
Hand lerForY( )
SomeHand le r ( )X
X Even t
FIGURE 5: Example use of events
42
The event model enables flexibil ity and extensibili ty by allowing modules to add,
modify and remove handlers. New event kinds may be added, and handler priorities may
be changed to affect the order in which handlers are processed. In subsequent chapters, we
wil l use events to incorporate fault-tolerance functionality inside objects.
3.2.1 Event API
We present the interface to our event model in Figure 6. A LegionEvent represents an
event, a LegionEventHandler corresponds to an event handler, a LegionEventKind denotes
an event kind, and a LegionEventManager represents an event manager.
On line 2, we define a LegionEventHandler as a function that takes a LegionEvent as
argument and returns a LegionHandlerStatus. LegionHandlerStatus specifies whether
subsequent handlers should be invoked. Lines 4 through 13 show the interface to a
LegionEventKind. The constructor for a LegionEventKind takes an integer argument
which is used to identify the LegionEventKind. The functions addHandler and
deleteHandler are used to register event handlers. On lines 15 through 22, we show
the interface to a LegionEvent. Associated with a LegionEvent is an integer identifying the
LegionEventKind and an optional data field. On lines 24 through 31, we show the
interface for a LegionEventManager. Announcing an event is done via the announce
function. An event can be announced asynchronously in which case
LegionEventQue uingDiscipline is set to LegionEventAnnounceLate r .
43
Alternatively, an event can be announced synchronously, in which case
LegionEventQue uingDiscipline is set to LegionEventAnnounceNo w.
3.3 Overhead for graphs and events
We present performance overhead for graphs and events in Table 1. All numbers were
obtained on a 400 MHz dual-processor Pentium II running the Linux operating system and
are averaged over 10000 calls.
FIGURE 6: Event interface
� � � * * � � � 1 � � 2 � � % , � ! � - � " ; � � � � � 2 � � : � - � . � / � � � < 1 � � � - � � � ; � � % ! � ; ! � -* * I 2 � 2 � ! 2 � � � � 2 � � % , � ! - 2 � ; , % � � � 1 � : � %
� � � � � � % � " . � / � � � 9 � � % , � ! G � ; - � H . � / � � � < 1 � � 9 � � % , � ! � � . � / � � � < 1 � � � �� � �� � � � , � - - . � / � � � < 1 � � J � � % �� � � * * � � � - ! ; � � ! " � ! . � / � � � < 1 � � J � � % B B � - - � / � - � � � � � / � ! � % � � � " � � !� � � . � / � � � < 1 � � J � � % � � � : � � % � �� �� ( � * * � % % � 2 � � % , � ! I � 2 � / � 1 � � � ! � � ! � �� 7 � � � � % % 9 � � % , � ! � . � / � � � < 1 � � 9 � � % , � ! 2 � . � / � � � < 1 � � 9 � � % , � ! 3 ! � � ! � � � � �� � � �� � � � * * % � , � � 2 � � % , � !� � � � � � % � , � � 9 � � % , � ! � . � / � � � < 1 � � 9 � � % , � ! 2 � �� � � � ) �� � � �� � � � � , � - - . � / � � � < 1 � � �� � � � � ; , � � F� � � * * � ! � � � � � � I � 1 � � B B � - - � / � � � � 1 � � : � � % � � % % � �� � ( � . � / � � � < 1 � � � � � : � � % � 1 � � % H % � � @ ? . . � �� � 7 �� � � � * * / � 2 � % � � � - - � � � � � % I � 2 � � � 1 � � � � � � 1 � � % H / � + � � � � �� � � � ) �� � � �� � � � � , � - - . � / � � � < 1 � � � � � � / � ! �� � � � � ; , � � F� � � � * * � � � � ; � � � � � � 1 � � � � � � % � � � , � � ! - � ! � 2 � � 1 � � � � � K ; � ; � " � ! % � " � ! ! � % � � � � ; � � �� � � � � � � � � ; � � � � . � / � � � < 1 � � �
. � / � � � < 1 � � L ; � ; � � / + � - � � � , � � � . � / � � � < 1 � � � � � � ; � � � . � � ! � �� � ( �� � 7 � * * " , ; - 2 � , , � 1 � � - � � % � � � � ; � 2 � � ! 2 � � % , � ! -� � � � ; � - � / � � % " , ; - 2 < 1 � � - � � �� � � � ) �
44
Creating an event requires less than 4 µs. Announcing an event synchronously with up
to 16 null handlers requires 1.4 µs. There is an order of magnitude difference between
announcing events synchronously and asynchronously (1.4 µs vs 44.3 µs). In the
asynchronous case, the additional overhead consists of queuing and dequeuing events
from the event manager’s internal queue. Creating a graph with no arguments takes 257
µs. Creating a graph with one argument requires 364 µs. Each additional argument adds
about 90 µs of overhead to the graph creation time.
Executing a graph with no arguments takes 1.267 ms. Each additional argument adds
about 150 µs to the graph execution time. Measurements for graph executions include the
graph execution time and the time to traverse the protocol stack immediately prior to the
Network Module (§3.4.1). Execution times for a full remote invocation are provided in
Chapter 7.
TABLE 1: Overhead of graphs and events
Test name Overhead
Create event 3.6 µs
Synchronously announce 1 event (with 16 null handlers) 1.4 µs
Asynchronously announce 1 event (with 16 null handlers) 44.3 µs
Create graph (0 argument) 257 µs
Create graph (1 argument) 364 µs
Create graph ( 2 arguments) 454 µs
Executing graph (0 argument) 1267 µs
Executing graph (1 argument) 1412 µs
Executing graph (2 arguments) 1561 µs
45
3.4 Structure of an object
To understand how a fault-tolerance developer would incorporate functionality into
applications, we first present an example of a protocol stack configured using the event
paradigm [VILE97]. Then, we show an example of incorporating new functionali ty.
3.4.1 Overview of a protocol stack
Only a few events are needed to implement the core features of a protocol stack.† We
classify these events into three broad categories: message-related, method-related and
object management-related events. These events reflect our assumptions of an object-
based system in which communication is implemented over a message-passing fabric.
Table 2 describes the major event kinds used in configuring the protocol stack. The set of
events defines the vocabulary that designers can use to implement their algorithms.
† A more accurate description would be that of a protocol graph as events allow arbitraryconnections between modules. Nevertheless, we reuse the term protocol stack because of itsfamil iarity to most readers.
TABLE 2: Sample set of events for building protocol stack of an object
Category Event Kind Description
Message-related events MessageReceive Object has received a message
MessageSend Object is sending a message
MessageComplete Object has sent a messagesuccessfully
MessageError Error in sending message
46
Figure 7 ill ustrates the major components of a protocol stack. In order to invoke a
method on a remote object, the GraphModule announces a MethodSend event for each
node in the graph that has the sender as a source of an input token. In turn, the
MessageLayerModule bundles parameters into a message and announces a MessageSend
event. Finally, the NetworkModule sends the message over the network. On a receiving
side, the NetworkModule announces a MessageReceive event upon receipt of a message
from the network. The MethodAssemblyModule determines whether the received message
is suff icient to form a complete method invocation (recall that in data flow multiple
messages may be required to trigger a method execution). If the message results only in a
partial method invocation, the object stores the message in an internal database. When the
required messages arrive to complete the method invocation, a MethodReceive event is
raised. At this point, the MethodInvocationModule, stores the complete method in a
Method-related events MethodReceive Object has received a complete method invocation, all parameters
have been received
MethodSend Object is invoking a method on another object
MethodDone Object is done servicing a method
Object-related events ObjectCreated An object has been created
ObjectDeleted An object has been deleted
TABLE 2: Sample set of events for building protocol stack of an object
Category Event Kind Description
47
database of ready methods. A server loop may then extract ready methods from the
database and execute them.
3.4.2 Example of incorporating new functionality
We now show the ease with which a developer can add functionality to a user
application. Consider the case wherein a developer wishes to incorporate logging facilities
to record the exchange of methods in an application, perhaps to support post-mortem
debugging [MORG99] or fault-tolerance [NAMP99]. A simple way to implement this
functionality is to use implicit parameters to propagate the identity of a logger object.
Upon receiving a method, an object searches for the identity of the logger object in its
implicit parameter li st. If the object finds the identity of logger, it forwards the method to
the logger object prior to servicing the method.
Protocol Stack of Object using Modules
Network
Events
NetworkModule
MessageLayer
Module
GraphModule
MethodSend Event
MessageSend Event
NetworkModule
MethodAssembly
Module
MethodInvocation
Module
MessageReceive Event
MethodReceive Event
FIGURE 7: Structure of an object: sample protocol stack
48
To implement logging, a developer can add a handler with the MethodReceive event
kind to intercept incoming methods. The handler extracts the identity of the logger object
from the method, builds and executes a graph that corresponds to a method invocation on
the logger object to log the method. Figure 8 shows the body of the handler and the
registration of the handler with MethodReceive. A more detailed example is described in
Chapter 5.
This example ill ustrates a typical implementation of a fault-tolerance technique.
Events are used to intercept and manipulate methods. Within a handler, we make method
invocations on remote objects. Note that we do not show the graph associated with the call
on the logger object. Developers may hand-generate calls to the graph interface or use an
automated tool [LEGI99].
3.5 Summary
We have presented the reflective graph and event model and provided examples of its
use in building a protocol stack (or object request broker) and in incorporating new
functionality. The RGE model exposes the structure of applications to fault-tolerance
designers and programming tool developers and provides both parties with a common set
. � / � � 2 � % 9 � � % , � ! � . � / � � � < 1 � � � 1 � �+ � = � % � � � 1 � / � + � � � � � * * � � ! � � % � � " ! � � � 1 � � � < = 9 � + � � 2 � % % � � � / � � � 2 � % � � � * * � � ! � � � � 2 � % % � � - ! ; � ; ! �. � 8 8 < > , � / / � ! � � 2 � % � / � . � / / � ! � � � * * � � ! � � � % � � � � � " , � / / � !� " � , � / / � ! �
, � / / � ! � , � / � � 2 � % � � � 2 � % � � * * " � ! I � ! % � � 2 � % � , � / / � ! � ! � � � � � � , , �)
* * � � 2 � % > � � � � 1 � � 1 � � : � � % 2 � - � � � % � � , � ! � % � ! � 1 � � ; - , �* * � � I � % % � � � I 2 � � % , � !
� � 2 � % > � � � � 1 � � � % % 9 � � % , � ! � . � / � � 2 � % 9 � � % , � ! � �
FIGURE 8: Adding a handler for logging methods (pseudo-code)
49
of abstractions. In the next chapter, we present a distributed event notification model, the
exoevent notification model, that is based on the RGE model. Using the RGE and
exoevent notification models, we will t hen show how to encapsulate fault-tolerance
algorithms into modules and integrate them in programming tools.
50
The best way to predict the future is to invent it.— Alan Kay
Chapter 4
Exoevent Notification Model
In Chapter 3, we presented the RGE model and showed how to incorporate new
functionality by using graphs for inter-object communication and events for intra-object
interaction. We now present the exoevent notification model, a flexible distributed event
notification model based on the RGE model. We show how to use the model to propagate
information between objects to support the construction of fault-tolerance algorithms.
The exoevent notification model supports the abstraction of a distributed event
notification service [OMG95, SUN99A]. In a distributed event notification service, events
can cross object boundary—an object A can register to be notified of events raised by
another object B. We call such events exoevents. An object raises an exoevent to notify
other objects that an event of interest has occurred. Raising an exoevent causes the
execution of associated exoevent handlers—RGE graphs that describes method
invocations on objects. Thus, raising an exoevent may result in the invocation of methods
on remote objects. Unlike the CORBA and Java event models, the exoevent notification
51
model permits the run-time specification of propagation policies—where to propagate
exoevents—on a per-application, per-object or per-method basis.
We describe the exoevent notification model in §4.1 and ill ustrate several notification
policies in §4.2. We show the interface to the model in §4.3. In §4.4, we measure its
performance overhead. In §4.5, we show an example set of exoevents exported by objects.
In §4.6, we provide three applications of the model in implementing a simple failure
detector.
4.1 Description
Before presenting the exoevent notification model, we first define the following terms:
exoevent, exoevent interest, and exoevent interest set.
An exoevent is a data structure that consists of a set of descriptors. A descriptor is a 2-
tuple, <name, data>. The name field identifies the descriptor while the data field contains
arbitrary data. By convention, an exoevent must have exactly one descriptor whose name
field is set to “ExoEventType”. The data associated with this exoevent is a string that
categorizes the exoevent, e.g., “Exception:ObjectCrash” . As a convention, we delineate
categories and subcategories with a “ :” with the leftmost category being the most generic.
We say that an exoevent is of type Z, or that it is a Z exoevent, when the descriptor
“ExoEventType” is set to Z. In Table 3, we show an example of an “Exception” exoevent.
By convention, descriptors should include a description of the events, the identity of the
raiser, and the signature of the method that raises the exoevent.
52
An exoevent interest is a 2-tuple, <category string, exoevent handler>. The category
string serves as a filter to specify interest in a specific type of exoevent, e.g., interest in an
“Exception” exoevent. An exoevent handler is a RGE graph that is to be executed if there
is a match between an exoevent interest and an exoevent. We say that a match has
occurred when the category string of an exoevent interest is a prefix of the descriptor
“ExoEventType” of an exoevent. For example, “Exception” matches
“Exception:ObjectCrash” . With this convention it is simple to specify interest in an entire
category, e.g., the exoevent interest “Exception” matches all exceptions. As a convention,
we use the category string “All ” to denote interest in all exoevents. Finally, an exoevent
interest set is a set of exoevent interests.
We also define the following roles for objects, registrar, monitored, and catcher. A
registrar object registers interest in an exoevent, i.e., it specifies an exoevent interest. A
catcher object is an object that is invoked as a result of raising an exoevent. An object that
raises an exoevent is said to be a monitored object. Since an object may have multiple
roles, the terms, registrar, monitored, and catcher, may describe a single object.
4.1.1 Registering interest in an exoevent
The exoevent notification model defines two scoping levels for specifying interest in
an exoevent, object scope and method scope. In object scope, an exoevent interest is valid
across method calls on a monitored object, while with method scope, an exoevent interest
TABLE 3: Example of typical exoevent
Descriptor name Descriptor data
“ExoEventType”“Description”“Raiser”“Method”
“Exception”“Error: Exception raised at 12:40:30”<identity of raising object>“ int methodFoo(int, int)”
53
is valid only during the execution of one method. Further, with method scope, the
exoevent interest propagates transitively, i.e., if a monitored object invokes a method
start() on object A, and A invokes a method go() on object B, then the exoevent interest
specified by monitored would be valid during the execution of B.go(). If upon raising an
exoevent a monitored object finds a match at both the object and method scopes, the
exoevent handlers from the method scope level are executed first, followed by the
exoevent handlers from the object scope level.
In a computational grid environment where an application may be composed of
dynamically-created objects, e.g., PVM [GEIS94], MPI-2 [GROP99], MPL [GRIM96A], the
propagation of an exoevent interest provides fault-tolerance developers with a mechanism
for obtaining and propagating information to all objects within an application, including
objects that are created at run-time.
Irrespective of the method used to specify interest in an exoevent, the functions
specified in an exoevent handler graph must take as their first argument an exoevent. The
signature of such a function is of the form:
void SomeCatch erFunction(LegionExoEvent, ...)
A catcher object can then retrieve the data contained in the exoevent. This restriction
on the signature of the catcher function is similar to the Unix convention for signal
handlers.
4.1.2 Object scope
To support object scope, a monitored object exports the methods:
54
void RegisterExoE ventHandler(LegionLOID registrar, ExoEventInterestSet set);
void UnregisterEx oEventHandler(LegionLOID registrar);
RegisterExoEve ntHandler() specifies a set of exoevent interests. When a
monitored object raises an exoevent, the exoevent is matched against all exoevent interest
in the exoevent interest set. When a match is made, the exoevent handler contained in the
exoevent interest is executed. Execution of an exoevent handler may result in method calls
on one or more catcher objects. UnregisterExoEventHandler() removes the
exoevent interest set previously specified by a registrar object.
4.1.3 Method scope
In order to specify and propagate an exoevent interest, a registrar object inserts one or
more exoevent interests in an exoevent interest set and uses implicit parameters to
annotate its program graph with the exoevent interest set. Since implicit parameters
propagate automatically (§3.1), the interest set wil l be available to all objects in the call
chain. Thus, each object in the call chain becomes a monitored object.
We configure monitored objects such that upon raising an exoevent, we search for a
match in the exoevent interest set. If a match is found, we execute the corresponding
exoevent handlers.
4.2 Policies
We ill ustrate the flexibilit y of the exoevent notification model by demonstrating
several exoevent propagation policies: notify-root (§4.2.1), notify-client (§4.2.2), notify-
third-party (§4.2.3), and notify-hybrid (§4.2.4). The notify-root and notify-client policies
55
use method scope, the notify-third-party uses object scope, and the notify-hybrid policy
uses both method and object scope.
4.2.1 The notify-root policy
In this policy, all exoevents of interest propagate to a root object. The root object is the
object from which all other objects in an application are transitively created. In
computation grid environments, the root object is often the object that is invoked at the
command line. In this policy, the root object is both a registrar and a catcher object. This
policy is useful when the root object monitors the execution of an application. As an
example, the root object could be notified of all exceptions raised during the execution of
an application, including security exceptions or communication failure exceptions. As
another example, the root object could monitor the progress of an application by catching
“ I am Alive” exoevents raised by objects periodically. Based on this monitoring activity,
the root object can take actions based on the notification (or lack of notification) of
exoevents. For example, a root object could terminate an application if any exceptions are
encountered. Furthermore, such a root object could be written as a generic application
manager to monitor and control any user applications.
To implement this policy, the root object creates the exoevent interest shown in Table
4 and uses implicit parameters to propagate the interest through the method invocation
chain.
TABLE 4: Exoevent interest for notify-root policy
Exoevent interest
category string “All ”
56
4.2.2 The notify-client policy
In the notify-client policy the exoevent raised by a monitored object propagates to its
immediate caller, i.e., its client. This policy can be used to implement the traditional style
of exception handling wherein the caller is notified of exceptions [STRO97]. Upon receipt
of an exoevent, the caller can take several actions, including retrying the request or re-
raising the exoevent.
Figure 9 shows three objects, root, A, and B. Root invokes the method A.start() and
within A.start(), A invokes B.go().
For object A to be the catcher for exoevents raised by B.go(), A specifies the exoevent
interest shown in Table 5:
A possible application of the notify-client policy is for masking the propagation of
exoevents. In our example of Figure 9, assume that root uses the notify-client policy upon
exoevent handler
TABLE 4: Exoevent interest for notify-root policy
Exoevent interest
Graph
root.notify
FIGURE 9: The notify-client policy
root
B.go
A.start
57
invoking A.start(), and A.start() uses the notify-client policy upon invoking B.go(). When
A catches an exoevent raised by B.go(), it can handle the exoevent or it can re-raise the
exoevent so that root can be notified.
4.2.3 The notify-third-par ty policy
In the previous two policies, the catcher object was part of the application defined by
the set of objects created transitively from a root object. This view of an application maps
well to several common programming environments in metasystems [GEIS94, GRIM96A].
However, in the case of server objects—objects that provide services to multiple
applications—the catcher object may not be part of the application that is requesting the
service. Consider Figure 10, in which a server object S is used by two applications. The
TABLE 5: Exoevent interest for notify-client policy
Exoevent interest
category string “All ”
exoevent handler
Graph
A.notify
58
first application consists of the objects AppA and A, while the second application consists
of the objects AppB and B.
Using the notify-third-party policy, a catcher object can register to be notified of all
exoevents raised by S by registering the exoevent interest shown in Table 6 using object
scope (§4.1.2).
4.2.4 The notify-hybr id policy
The notify-hybrid policy ill ustrates the flexibility of the model and combines the three
policies, notify-root, notify-client, and notify-third-party, shown previously. For example,
Figure 11 ill ustrates the combined use of all three policies. Object S is a server object that
is used by two applications. The first application uses the notify-root policy so that
exoevents raised by S while servicing a call from A propagate to AppA. The second
TABLE 6: Exoevent interest for notify-client policy
Exoevent interest
category string “All ”
exoevent handler
FIGURE 10: Propagating exoevents to a catcher object
S
AppA
AppB
catcher
A
B
methodinvocations
propagationof exoevent
Graph
catcher.notify
59
application uses the notify-client policy so that exoevents raised by S propagate to B. The
catcher object uses the notify-third-party policy so that all exoevents raised by any
applications propagate back to the catcher object.
With this policy, different applications can specify their own policies with respect to
exoevent propagation. Furthermore, shared objects such as S in the figure, can support
multiple propagation policies.
We show the exoevent interests specified by objects AppA, catcher and B in Table 7,
Table 8 and Table 9.
TABLE 7: Exoevent interest for notify-hybrid policy for object AppA
Exoevent interest specified by AppA
category string “All ”
exoevent handler
FIGURE 11: Example propagation of exoevents in the notify-hybrid policy
S
AppA
AppB
catcher
A
B
methodinvocations
propagationof exoevent
Graph
AppA.notify
60
While in this example all category strings have been set to “All ” , it would be trivial to
specify a policy in which AppA, catcher and B specify different category strings. For
example, AppA could specify interest in exceptions with the category string “Exceptions” ,
catcher could specifiy interest in all exoevents with the category string “All ” , and B could
specify interest in security exceptions with the category string “Exceptions:Security” .
Table 8: Exoevent interest for notify-hybrid policy for object catcher
Exoevent interest specified by catcher
category string “All ”
exoevent handler
TABLE 9: Exoevent interest for notify-hybrid policy for object B
Exoevent interest specified by B
category string “All ”
exoevent handler
Graph
catcher.notify
Graph
B.notify
61
4.3 Application programmer inter face
The interface for using exoevents should be simple to use. Raising an exoevent is a
three-step process that consists of (1) creating the exoevent, (2) inserting descriptors, and
(3), raising the exoevent:
LegionExoEvent ex o(“Exceptions”); // specify type
exo.insertDescrip t or(“Description”, “This is an exception”);
LegionRaiseExoEve nt(exo); // raise exoevent
Registering to catch an exoevent using object scope consists of calling the following
function:
LegionRegisterExo EventHandler( LegionLOID monitored, ExoEventInterestSet set)
A commonly-used policy is for a root object to register to catch exoevents raised in an
application with the function:
LegionExoEventCat cherEnable(ExoEventInterest);
For more complex policies, e.g., masking exoevents (§4.2.2), users must create and
register the appropriate graphs with an exoevent interest using the interface described in
§3.1.1.
The full interface for using exoevents is shown in Figure12.
62
4.4 Overhead
Table 10 shows the overhead of creating and raising exoevents. The time required to
create an exoevent is 166 µs. The time to raise an exoevent is linearly proportional to the
number of exoevent interests in the exoevent interest set as we must inspect each exoevent
interest to find a match (~120 µs per exoevent interest).
FIGURE 12: API for exoevents
M N O P P Q R S T U V W X U W Y R V Z [\ \ O ] ] O ] R P M ^ T _ Z U ^ Z U Z ` R R X U R Y R V Z
T V P R ^ Z a R P M ^ T _ Z U ^ b c Z ^ T V S V O d R e a O Z O ] O Z O f g\ \ ^ R d U Y R ] R P M ^ T _ Z U ^
^ R d U Y R a R P M ^ T _ Z U ^ b c Z ^ T V S V O d R f g\ \ ^ R Z h ^ V P Z ` R ] O Z O O P P U M T O Z R ] i T Z ` j V O d R k
a O Z O S R Z a R P M ^ T _ Z U ^ b c Z ^ T V S V O d R f g\ \ P R Z Z ` R Z l _ R U m Z ` R R X U R Y R V Z\ \ R n h T Y O N R V Z Z U T V P R ^ Z a R P M ^ T _ Z U ^ b o R X U R Y R V Z p l _ R q e c Z ^ T V S f g
P R Z r Z l _ R b c Z ^ T V S f g\ \ S R Z Z ` R Z l _ R
c Z ^ T V S S R Z r Z l _ R b f g\ \ M U V P Z ^ h M Z U ^ s s Z l _ R T P Z ` R Z l _ R U m Z ` R R X U R Y R V Z\ \ R n h T Y O N R V Z Z U P R Z r Z l _ R b Z l _ R c Z ^ T V S f
Q R S T U V W X U W Y R V Z b c Z ^ T V S Z l _ R c Z ^ T V S f gt
M N O P P Q R S T U V W X U W Y R V Z u V Z R ^ R P Z [\ \ P _ R M T m l Z ` R v T V ] U m R X U R Y R V Z Z ` O Z i R O ^ R T V Z R ^ R P Z R ] T V
P R Z r M O Z R S U ^ l u V Z R ^ R P Z b c Z ^ T V S f g\ \ P R Z Z ` R S ^ O _ ` Z U w R R X R M h Z R ] T m Z ` R ^ R x P O d O Z M ` w R Z i R R V O V R X U R Y R V Z O V ] Z ` T P T V Z R ^ R P Z
P R Z r R X U R Y R V Z y O V ] N R ^ b z ^ O _ ` f g\ \ R X R M h Z R S ^ O _ ` T m Z ` R ^ R T P O d O Z M `
R X R M h Z R u m u V Z R ^ R P Z R ] b Q R S T U V W X U W Y R V Z f gt
\ \ M O Z M ` R X U R Y R V Z P U m Z l _ R j c Z ^ T V S k O V ] P R Z h _ m h V M Z T U V j m T ] k O P Z ` R M O N N w O M v m h V M Z T U VY U T ] Q R S T U V W X U W Y R V Z { O Z M ` R ^ W V O w N R b c Z ^ T V S e | h V M Z T U V u ] R V Z T m T R ^ m T ] f g\ \ ^ O T P R O V R X U R Y R V Z\ \ O Z Z R d _ Z Z U d O Z M ` Z ` R R X U R Y R V Z i T Z ` R O M ` R X U R Y R V Z T V Z R ^ R P Z P _ R M T m T R ] T V Z ` R P R Z\ \ O V ] R X R M h Z R O N N d O Z M ` R P
Q R S T U V } O T P R W X U W Y R V Z b Q R S T U V W X U W Y R V Z e Q R S T U V W X U W Y R V Z u V Z R ^ R P Z c R Z ~ � � Q Q f g
63
4.5 Example exoevents
In Table 11, we list seven examples of exoevents that developers can export. In
general, developers of programming tools should decide which of these exoevents are
relevant in their environments. The list below is not exhaustive; developers may export
other exoevents that are not listed here. Furthermore, except for “ExoEventType”, all
other descriptors shown are optional and can be incorporated at the discretion of tool
developers.
TABLE 10: Overhead in creating and raising exoevents
Test name Overhead
Time to create exoevent 166 µs
Time to raise exoevent (0 exoevent interest) 60 µs
Time to raise exoevent (1 exoevent interest) 184 µs
Time to raise exoevent (10 exoevent interests) 1181 µs
TABLE 11: Sample exoevents
Exoevents Description
“ExoEventType” = “Object:MethodStarted”“Method” = “userFunction(int,int)”
Exoevent raised before executing a method
“ExoEventType” = “Object:MethodDone”“Method”= “userFunction(int,int)”
Exoevent raised after execution of method
“ExoEventType” = “Object:ObjectCreated”“Loid” = <LOID of created object>
Exoevent raised after object creation
“ExoEventType” = “Object:ObjectDeleted”“Loid” = <LOID of deleted object>
Exoevent raised after object deletion
“ExoEventType” = “Object:IamAlive”“Loid” = <LOID of object>
Exoevent raised periodically
64
4.6 Examples
We use the policies described in §4.2 to present three possible implementations of a
failure detector.
4.6.1 Failure detection – push model
We present a failure detector based on a push model of exoevent propagation—objects
periodically raise a “I am Alive” exoevent. The catcher for the “I am Alive” exoevent is
the root object of the application. If the root object does not receive a “I am Alive”
notification within a specific time interval, it treats the object as having failed (Figure 13).
To implement this policy, objects in an application raise the following exoevent
periodically:
“ExoEventType” =“Exception:Interface:NoSuchMethod”“Method” = “userFunction(int,int)”
Exoevent raised when a non-existent method is
requested
“ExoEventType” = “Exception:Security:AccessDenied”“Method” = “userFunction(int,int)”
Exoevent raised by security layer when user
is not authorized to invoke a method
TABLE 12: “ I am Alive” exoevent raised by application objects
Descriptor name Descriptor data
“ExoEventType” “ObjectNotification:IamAlive”
“Loid” Object identifier of object raising the exoevent
TABLE 11: Sample exoevents
Exoevents Description
65
The “Loid” descriptor contains the identify of the raiser so that a catcher object can
keep track of its status. The root object registers the exoevent interest <category string =
“ObjectNotification:IamAlive”, exoevent handler = root.notify(LegionExoEvent)> to
catch the proper exoevents.
4.6.2 Failure detection – pull model
An alternative implementation of a failure detector is for the root object to ping objects
in an application periodically (Figure 14). We use the exoevent notification model to
determine the set of objects in an application.
An advantage of this method is that monitored objects are passive and do not need to
raise an explicit “ I am Alive” exoevent. A disadvantage is that this method requires a
round-trip method invocation between the root and each object.
Root registers to be notified of the creation and deletion of objects by specifying the
exoevent interest <category string = “ObjectNotification” , exoevent handler =
root.notify(LegionExoEvent)>. The root object can then ping each object; if it does not
FIGURE 13: Failure detection using the push model
root
B
A
C
D
Propagation of"I am alive" exoevent
66
receive a timely reply from an object it marks the object as having failed. To keep track of
created objects, the object’s creator raises the following exoevent:
Figure 14 ill ustrates the propagation of the “ObjectNotification:ObjectCreated”
exoevent. In this example, the object root is the creator of objects A and B. Object A is the
creator of objects C and D.
The two dashed arrows from root to itself ill ustrate the propagation of the exoevent
“ObjectNotification:ObjectCreated” as root creates A and B. The two dashed arrows from
object A to the root object illustrate the propagation of the exoevent
“ObjectNotification:ObjectCreated” as A creates objects C and D.
4.6.3 Failure detection – service model
Our third option is to design a generic failure detection service that is shared by
multiple applications [FELB99, STEL98]. The advantages of a generic service are that
TABLE 13: Exoevent raised on object creation
Descriptor name Descriptor data
“ExoEventType” “ObjectNotification:ObjectCreated”
“Loid” Object identifier of newly created object
FIGURE 14: Failure detection using a pull model
root
B
A
C
D
Propagation of"ObjectNotif ication:ObjectCreated"exoevent
Object creation path
67
developers do not need to implement their own failure detection service and can select
from among different types of failure detectors. For example, some failure detectors may
be aggressive to declare failure while others may rely on special knowledge such as
network topology or network latency.
Figure 15 shows a failure detector object FD monitoring the status of four objects, A,
B, C, and D, using both the push and pull models as described in §4.6.1 and §4.6.2. FD
catches the “I am Alive” exoevent raised by A and B, and pings objects C and D
periodically. In the figure, object A has crashed and no longer raises the “I am Alive”
exoevent. The failure detector FD notices the absence of the “ I am Alive” exoevent from
A and raises the exoevent “FD:Failure:ObjectFailedToReport” . For objects (not shown in
figure) to be notified of notification failures, they should have previously registered their
interests with FD.
Table 14 shows the exoevent raised by FD upon detecting the death of object A.
FIGURE 15: Generic failure detection service
FDB
A
C
Dpropagation of exoevents
pings
ExoEvent
"ExoEventType" = "FD:Failure:ObjectFailedToReport""Loid" = <Object A>
68
4.7 Summary
The combination of the exoevent notification and the reflective graph and event
models provides developers with a flexible framework for implementing fault-tolerance
algorithms. Salient features of the exoevent notification model include the notion of
graphs as event handlers and the run-time specification of interest in exoevents on a per-
application, per-method, or per-object basis. In subsequent chapters we map fault-
tolerance algorithms in terms of these models and incorporate them with user applications.
TABLE 14: Exoevent raised by failure detector
Descriptor name Descriptor data
“ExoEventType” “FD:Failure:ObjectFailedToReport”
“Loid” Object A
69
I find that the harder I work, the more luck I seem to have— Thomas Jefferson
Chapter 5
Mappings of Algor ithms
We have mapped several fault-tolerance algorithms onto our models. Since the
algorithms we chose are well -known and varied, we show the applicabilit y and flexibilit y
of the RGE and exoevent notification models. We selected algorithms from rollback-
recovery and replication protocols. In rollback-recovery techniques, the state of an
application is rolled back to an error-free state in the event of failure. In replication
techniques, failures are masked through the redundancy of components. We mapped
algorithms representative of rollback-recovery techniques from a survey published by
Elnozahy et al [ELNO96]. For replication, we ill ustrate the use of our models in
encapsulating a passive replication algorithm as well as a specialized replication algorithm
that works with stateless objects—objects whose methods are side-effect free.
Figure 16 ill ustrates the architecture of our design. We transform an application to
incorporate fault-tolerance techniques using FT objects and FT modules. FT objects,
objects such as an application manager, a checkpoint server and a failure detector, manage
and support the fault-tolerant application. FT modules encapsulate fault-tolerance
70
algorithms. FT objects and FT modules cooperate to implement an algorithm. The
advantage of our architecture is the abilit y to integrate fault-tolerance functionality by
using different FT objects and FT modules with user applications. Fault-tolerance
designers encapsulate their algorithms inside FT modules. Developers of programming
tools incorporate the FT modules to enable the construction of reliable grid applications.
The correctness of using an algorithm depends on the correctness of the algorithm
itself as well as the correctness of the implementation of the algorithm. Regarding the
correctness of the algorithms, these algorithms have been described at length in the
li terature. Regarding the correctness of the implementation, we defer to standard software
engineering techniques, e.g. code walkthroughs, inspection and testing, to ensure that the
specification of an algorithm is met by its implementation. We have tested the integration
of the algorithms presented in this chapter using synthetic test cases and real applications
(Chapter 7).
We present algorithms that cope with permanent host failures. Once a host has
crashed, it does not recover and is taken out of the system. All objects that are running on
FIGURE 16: Structure of a fault-tolerant application
FT objects Application with FT modules
ApplicationManager
O 1
O 2
O n
applicationcommunicat ion
object withFT module
O x
ObjectMonitor
CheckpointServer
FTcommunicat ion
71
the crashed host also fail and exhibit fail-stop behavior [SCHN83]. We assume the
existence of a stable storage facil ity on which objects may store data. In all our mappings,
an object that is assumed never to fail, serves as stable storage.
We present mappings for the following rollback-recovery algorithms, checkpointing
(§5.1) and logging (§5.2), and then mappings for failure masking using replication (§5.3).
For each algorithm, we present a brief overview, the failure assumptions underlying the
algorithm and the mapping to our models. Furthermore, we also present possible
extensions to the algorithms to relax the failure assumptions of fail-stop, reliable network
and reliable storage.
In presenting the API for FT modules, we use the data structures shown in Table 15.
We also present the interface to FT modules in a C++-like syntax. Methods that are
visible to other objects are denoted by the keyword exports . Methods and variables that
are internal to objects are denoted by the keywords private and public . We note that
all code examples shown are very close to actual code. However, to simplify our
TABLE 15: Data structures for FT modules
Data structure Description
MESSAGE represents a message
METHOD represents a method, including its signature and argument li st
TAG unique identifier for a METHOD invocation
WORK_REQUEST contains a METHOD and additional data fields
BUFFER holds arbitrary data; data stored in a BUFFER is compatible across heterogeneous architectures
RESULTS represents the values returned from a method invocation
INFO represents protocol-specific information
72
exposition, we have removed unnecessary details. For examples of actual code, interested
readers may refer to the Legion documentation [LEGI99].
5.1 Checkpointing
A common method of ensuring the progress of a long-running application is to take a
checkpoint, i.e., save its state on stable storage periodically. A checkpoint is an insurance
policy against failures—in the event of a failure, the application can be rolled back and
restarted from its last checkpoint—thereby bounding the amount of lost work to be
recomputed.
The state of a distributed application consists of the instantaneous snapshot of the local
state of processes and communication channels. However, in an asynchronous distributed
system with no global clocks or shared memory, we can only devise algorithms to
approximate this global state [CHAN85]. A snapshot is deemed consistent if it could have
occurred during the execution of an application [CHAN85, MATT93]. To yield a consistent
snapshot, or checkpoint, an algorithm must ensure that all messages received by a process
are recorded as having been sent [CHAN85, JAL94]. Figure 17 ill ustrates two processes
whose local checkpoints do not form a consistent checkpoint. Message m1 from O1 to O2
is a lost message; it is marked as having been sent in O1’s checkpoint but not as having
been received in O2’s checkpoint* . Message m2 from O1 to O2 is an orphan message; it is
recorded as being received by O2 but not as having been sent in O1’s checkpoint. Lost
messages may occur when in-transit messages between two processes are not captured by
* Note that if a checkpointing protocol runs on top of a lossy communication channel, a consistentcheckpoint may allow in-transit messages [ELNO96]. In our model, protocols run on top of areliable communication protocol.
73
a checkpointing algorithm. If O2 fails after receiving message m1 from O1 (denoted by X
on O2’s timeline) and restarts executing from its local checkpoint, m1 wil l be lost if O1
does not retransmit it. Orphan messages may occur upon restart of a process. If O1 fails
after sending message m2 (X on O1’s timeline) and restarts from its checkpoint, it would
be as if O2 had received a message that O1 had not yet sent; clearly an impossible situation
in a failure-free execution of the application.
There are two broad categories of checkpointing algorithms: uncoordinated and
coordinated checkpointing. In uncoordinated checkpointing algorithms, objects establish
local checkpoints autonomously. Uncoordinated checkpointing potentially provides lower
overhead during normal execution because objects need not coordinate checkpoints.
However, establishing a consistent application state requires non-trivial work during
recovery. Recovery algorithms for uncoordinated checkpoints must establish a consistent
set of local checkpoints to recover from [CAO92, WANG95], and deal with the possibilit y
of the domino effect [RAND75, RUSS80], where the restart of one process triggers the
rollback of other processes to avoid orphan messages. Coordinated checkpointing
algorithms avoid the domino effect by coordinating the taking of local checkpoints and
blocking interprocess communication temporarily to establish only consistent
FIGURE 17: Lost and orphan messages
(m1 ) Lost message
(m2 ) Orphan
message
O 1
O 2fail!
fail!
localcheckpoint
74
checkpoints. The primary advantage of coordinated checkpointing is its simple recovery
characteristics, albeit at the potential cost of greater overhead during normal execution.
We focus on coordinated checkpointing because of its simpler design and recovery
characteristics. We present mappings for two algorithms: SPMD checkpointing (§5.1.1)
and 2-phase commit distributed checkpointing (2PCDC) (§5.1.2). The former is named
after a style of applications known as Single Program Multiple Data applications. SPMD
applications are prevalent in parallel computing and exhibit a regular communication
structure that can be exploited to ensure consistency among checkpoints [GEIS97]. The
latter, 2PCDC, is an adaptation of an algorithm proposed by Koo and Toueg that can be
used for applications with arbitrary communication structures [KOO87].
The local state of a process should consist of all the data structures necessary to restart
that process. In computational grids, an object may be restarted on a host of a different
architecture. Thus, we do not use system-level checkpoints—core images of running
processes—because they are not portable across heterogeneous architectures. Instead, we
require that developers identify and save the relevant state. Given our object-based model
of computation, the state of an application consists of protocol-related data, user-defined
data, partial methods† and complete methods. Note that we do not include the program
counter in our state; upon restart, developers are responsible for restoring the program
counter to an appropriate point. Developers may provide programmers with tools for
automatic stack recovery [BEGU97, FERR97] or may require them to structure their code
appropriately [GEIS97].
† Recall from Chapter 3 that multiple messages may be needed to assemble a complete methodinvocation.
75
We design these algorithms to cope with permanent host failures. We assume that a
host will fail by crashing and that it may never recover. Any objects running on the
crashed host will also crash and any data contained in volatile memory is lost. We use
pings and heartbeat pulses as our failure detection mechanism.
One of the advantages of checkpointing is that once the application state is consistent
and stored on stable storage, the application can always be restarted. A checkpoint server
object serves as stable storage. Since we are interested in coping with permanent host
failure, we require that the checkpoint server be on a separate host from any of the
application objects. We assume that the checkpoint server never crashes, nor does the host
in which the application is started (as it is responsible for coordinating the checkpointing
algorithm).‡
Note that the assumptions underlying the checkpointing algorithms can be relaxed. For
example, the checkpoint server (reliable storage) could be allowed to crash given a
transient failure model in which we assume that hosts eventually recover. Furthermore, we
could tolerate network partitioning of an application if we assume that the checkpoint
server does not crash or is recoverable because an application could then be restarted from
within the partition in which the checkpoint server resides.
5.1.1 SPMD checkpointing
SPMD (Single Program Multiple Data) applications are prevalent in computational
grids [FOST94, QUIN94]. Typically, an SPMD application consists of multiple processes
that are responsible for a subdomain of the application. An SPMD application exhibits a
‡ If the coordinating host crashes, the application can still recover from the last saved consistentcheckpoint.
76
regular structure: it contains a loop that performs calculations on a subset of the data and
exchanges information periodically. Thus, it is simple to exploit the regular structure of
SPMD applications to implement application-consistent checkpointing
[GEIS97, BEGU97].
5.1.1.1 Algor ithm
To obtain a consistent checkpoint, a user inserts checkpoints in such a manner as to
guarantee that there will be no lost and no orphan messages. In general, this is a difficult
task. However, in an SPMD application, the periodic exchange of boundary information
establishes natural points for taking application consistent checkpoints, e.g., at the top or
the bottom of the main loop. The set of checkpoints at each local process defines an epoch.
By inserting a checkpoint at the top or bottom of the loop, we constrain the exchange of
messages to within an epoch, and thus guarantee no lost and no orphan messages. The
skeleton for a typical SPMD application and the insertion of a checkpoint (line 2) is shown
in Figure 18.
Recovery is relatively straightforward (Figure 19). Upon starting the application,
programmers determine whether they should restart from a previously-saved checkpoint
(lines 1-2). If so, they can call the appropriate routines to restore their state. Saving the
loop index as part of the state ensures that programmers restart from the correct iteration
FIGURE 18: Insertion of checkpoint in SPMD code
b � f N U U _ T ~ � Z U �b � f � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �b � f R X M ` O V S R w U h V ] O ^ l T V m U ^ d O Z T U V b P R V ] \ ^ R M R T Y R _ O T ^ fb � f ] U P U d R i U ^ vb � f R V ] N U U _
77
(line 5). Note that SPMD checkpointing is often hand-coded; programmers use restart files
to save application data.
5.1.1.2 Mapping SPMD checkpointing
Figure 20 ill ustrates the interface for the checkpoint server. The checkpoint server
defines methods to store and retrieve the object state and protocol-related data for each
participant. The checkpoint server also has a method, setStableCheckpoint() , to
specify that a set of checkpoints form a consistent state. When setStableCheckpoint()
is called, the checkpoint server can garbage-collect data associated with all previously
taken checkpoints. Note that the notion of consistency is not determined by the checkpoint
server but is set externally.
An application manager controls the creation of objects and is responsible for
determining when a checkpoint is consistent. During initialization, it registers to be
FIGURE 19: Recovery example
b � f � � � � � � � � � � � � � �b � f � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �b � f R N P R ¡ ~ �b � f N U U _ T ~ Z U �b � f � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �b ¢ f R X M ` O V S R w U h V ] O ^ l T V m U ^ d O Z T U Vb £ f ] U P U d R i U ^ vb ¤ f R V ] N U U _
FIGURE 20: Interface for checkpoint server
M N O P P { ` R M v _ U T V Z c R ^ Y R ^ [R X _ U ^ Z P ¥
P Z U ^ R { ` R M v _ U T V Z b T V Z U w ¦ u a e T V Z M v _ Z u a e § � | | W } P Z O Z R f g \ \ P Z U ^ R P Z O Z R U m U w ¦ R M Z P§ � | | W } ^ R P Z U ^ R { ` R M v _ U T V Z b T V Z U w ¦ u a e T V Z M v _ Z u a f g \ \ ^ R P Z U ^ R P Z O Z RP Z U ^ R ¡ R P P O S R b T V Z U w ¦ u a e T V Z M v _ Z u a e ¡ W c c ¨ z W d P S f g \ \ P Z U ^ R d R P P O S R P¡ W c c ¨ z W ^ R P Z U ^ R ¡ R P P O S R b T V Z U w ¦ u a e T V Z M v _ Z u a f g \ \ ^ R P Z U ^ R d R P P O S R PP Z U ^ R © ^ U Z U M U N a O Z O b T V Z U w ¦ u a e T V Z M v _ Z u a e § � | | W } ] O Z O f g \ \ P Z U ^ R _ ^ U Z U M U N ] O Z O§ � | | W } ^ R P Z U ^ R © ^ U Z U M U N a O Z O b T V Z U w ¦ u a e T V Z M v _ Z u a f g \ \ ^ R P Z U ^ R _ ^ U Z U M U N ] O Z OT V Z P R Z c Z O w N R { ` R M v _ U T V Z b T V Z M v _ Z u a f g \ \ P R Z M ` R M v _ U T V Z O P M U V P T P Z R V ZT V Z S R Z c Z O w N R { ` R M v _ U T V Z b T V Z U w ¦ u a f g \ \ ^ R Z ^ T R Y R Z ` R N O P Z M U V P T P Z R V Z M v _ Z u a
t g
78
notified of the “CheckpointTaken” and “ I am Alive” exoevents exported by participants.
We show the interface to the application manager in Figure21. The class INFO maintains
internal data structures required for the algorithm.
As participants forward checkpoints to the checkpoint server successfully, they raise a
“CheckpointTaken” exoevent with their objectID and current checkpointID as data
(Figure22).
The application manager catches this exoevent with notifyCheckpointTak en() .
Once the manager receives an exoevent from each participant, it informs the checkpoint
server that the checkpoint is consistent (set_stable() ).
FIGURE 21: Interface for application manager
M N O P P u � | ª [T V Z M ` R M v _ U T V Z ¨ N S U ^ T Z ` d g \ \ O N S U ^ T Z ` d T ] b c © ¡ a U ^ � © { a { fT V Z V h d ª w ¦ R M Z P g \ \ V h d w R ^ U m U w ¦ R M Z P T V O _ _ N T M O Z T U VQ R S T U V Q ª u a N U T ] P « ¬ g \ \ T ] R V Z T Z l U m U w ¦ R M Z PT V Z U w ¦ u a g \ \ U w ¦ R M Z T ]T V Z M v _ Z u a g \ \ M ` R M v _ U T V Z T ]T V Z d U ] R g \ \ V U ^ d O N U ^ ^ R M U Y R ^ lQ R S T U V Q ª u a M ` R M v _ U T V Z c R ^ Y R ^ g \ \ T ] R V Z T Z l U m P Z U ^ O S R P R ^ Y R ^
t g
M N O P P ¨ _ _ N T M O Z T U V ¡ O V O S R ^ [R X _ U ^ Z P ¥
Y U T ] V U Z T m l { ` R M v _ U T V Z p O v R V b Q R S T U V W X U W Y R V Z R X U f g \ \ V U Z T m T M O Z T U V Z ` O Z M ` R M v _ U T V Z ` O P w R R V Z O v R VY U T ] V U Z T m l ª w ¦ R M Z ¨ N T Y R b Q R S T U V W X U W Y R V Z R X U f g \ \ V U Z T m T M O Z T U V U m N T Y R V R P P
_ ^ T Y O Z R ¥u � | ª T V m U g \ \ _ ^ U Z U M U N T V m U
Y U T ] P R Z r P Z O w N R b T V Z M v _ Z u a f g \ \ T V m U ^ d P Z U ^ O S R P R ^ Y R ^ M ` R M v _ U T V Z T P M U V P T P Z R V ZY U T ] P R V ] r P _ d ] r T V m U b T V Z U w ¦ u a e u � | ª T V m U f g \ \ T V T Z T O N T R _ ^ U Z U M U N T V m U
_ h w N T M ¥T V Z M ` R M v r N T Y R V R P P b f g \ \ d U V T Z U ^ ` R O N Z ` U m O _ _ N T M O Z T U VT V Z ^ R M U Y R ^ r O _ _ N T M O Z T U V b T V Z M v _ Z u a f g \ \ ^ R P Z O ^ Z O _ _ N T M O Z T U V m ^ U d M ` R M v _ U T V Z
t g
FIGURE 22: Raising the “CheckpointTaken” exoevent
Q R S T U V W X U W Y R V Z R X U gR X U ® P R Z r Z l _ R b o { ` R M v _ U T V Z p O v R V q f g \ \ P R Z Z ` R R X U R Y R V Z Z l _ RR X U ® T V P R ^ Z a R P M ^ T _ Z U ^ b o ª w ¦ u a q e ¯ d l u a f g \ \ T V P R ^ Z U w ¦ R M Z u aR X U ® T V P R ^ Z a R P M ^ T _ Z U ^ b o { v _ Z u a q e ¯ M h ^ ^ R V Z { ` R M v _ U T V Z u a f g \ \ T V P R ^ Z M ` R M v _ U T V Z u aQ R S T U V } O T P R W X U W Y R V Z b R X U f g
79
The interface for participants is shown in Figure 23 and consists of functions to save
and restore the local state, to notify the manager that a participant is alive, to notify the
manager that a checkpoint has been taken successfully and to determine whether a
participant is in recovery mode.
The application manager maintains a record of the last known live time—a timestamp
of the last successful communication—for each object. The manager updates the record
when it receives a message from an object. For example, the manager may update records
upon successfully pinging an object using check_liveness() , or upon catching the
“ I am Alive” and “CheckpointTaken” exoevents. The manager marks an object as failed if
its last known live time exceeds a user-defined threshhold. The manager then proceeds to
restart the application by killi ng and restarting each object. Once all objects have been
restarted, the coordinator informs participants that they should restart from a given
checkpoint through the call send_spmd_info() . The participants can then request the
necessary state from the checkpoint server and restart.
FIGURE 23: Interface for participants
M N O P P P _ d ] r _ O ^ Z T M T _ O V Z r d U ] h N R [R X _ U ^ Z P ¥
Y U T ] S R Z r P _ d ] r T V m U b u � | ª T V m U f g \ \ ^ R M R T Y R _ ^ U Z U M U N T V m U ^ d O Z T U V_ ^ T Y O Z R ¥
u � | ª T V m U g \ \ _ ^ U Z U M U N T V m U ^ d O Z T U VT V Z M ` R M v _ U T V Z u a g \ \ M h ^ ^ R V Z M ` R M v _ U T V Z T ]
_ h w N T M ¥Y U T ] S R Z r d U ] R b f g \ \ V U ^ d O N U ^ ^ R M U Y R ^ lY U T ] P O Y R r N U M O N r P Z O Z R b f g \ \ P O Y R P Z O Z RY U T ] ^ R P Z U ^ R r N U M O N r P Z O Z R b f g \ \ ^ R P Z U ^ R P Z O Z RY U T ] T r O d r O N T Y R b f g \ \ V U Z T m l Z ` O Z U w ¦ R M Z T P O N T Y RY U T ] M ` R M v _ U T V Z r Z O v R V b T V Z U w ¦ u a e T V Z M v _ Z u a f g \ \ V U Z T m l Z ` O Z M ` R M v _ U T V Z ` O P w R R V Z O v R V
t g
80
5.1.1.3 Summary of SPMD checkpointing
Table 16 provides a summary of the use of the RGE and exoevent notification models
in mapping the SPMD checkpointing algorithm.
5.1.2 2-phase commit distr ibuted checkpointing
The SPMD checkpointing algorithm requires that developers insert checkpoints at
consistent points in their program. For SPMD programs this is not a diff icult task. We now
present 2-Phase Commit Distributed Checkpointing (2PCDC), an algorithm which
relieves developers from the burden of establishing consistent checkpoints. The basic idea
behind 2PCDC is to produce a consistent application checkpoint atomically—all objects
in an application checkpoint or none do. Atomicity ensures that the algorithm can tolerate
failures when it is in progress and it also ensures the existence of at least one consistent
checkpoint at any given time.
The algorithm presented here is an adaptation of an algorithm proposed by Koo and
Toueg [KOO87]. The original algorithm prevented orphan messages only and relied on the
underlying communication channels to retransmit lost messages. We make no such
assumption and ensure that no in-transit messages are lost by capturing in-transit
messages using a counter-based approach [MATT93].
TABLE 16: Summary SPMD checkpointing
Functionality Model
Notification of checkpoints Exoevent notification model
Notification of li veness Exoevent notification model
Communication between objects RGE model (graphs)
81
5.1.2.1 Checkpointing
The algorithm proceeds in two phases (Table 17). In the first phase, the coordinator
requests that participants take a checkpoint. To reject the request, a participant sends a
“No” reply to the coordinator. Otherwise, a participant sends a “Yes” reply. Along with the
“Yes” reply, a participant also sends a counter (s,r) where s denotes the number of
messages sent and r denotes the number of messages received by the participant since its
last checkpoint. The participant then awaits the coordinator's decision.
While in the wait stage, a participant Pi may receive a message that was sent from Pj
prior to Pj taking a local checkpoint. This message is said to be in-transit and must be
recorded to prevent a lost message. Upon receipt of an in-transit message, Pi forwards the
message to the checkpoint server and informs the coordinator that it has received an in-
transit message.
TABLE 17: 2PCDC algorithm
Coordinator Participants
Requests participants take local checkpointAwait all repliesif all replies = YES then
based on message count, determine number in-transit messages if in-transit messages > 0 then
Wait till no more in-transit messagesDecide YES
else Decide NO
if accept request thenForward state to checkpoint serverReply YES & send message countAwait coordinator’s decisionif in-transit message received then
Forward message to checkpoint server and send newmessage count to coordinator
elseReply NO
Inform checkpoint server that checkpoint is consistentInform participants of decision
if decision = “YES” Reset message count
82
If and only if all participants reply “Yes” , the coordinator also decides “Yes” .
Otherwise, the coordinator decides “No”. The coordinator's authoritative decision marks
the end of the first phase. If the decision is “Yes” , the coordinator informs the checkpoint
server that the checkpoint is consistent and sends its decision to all participants.
Otherwise, the coordinator informs the checkpoint server to discard the local checkpoints
just stored.
To prevent orphan messages, a participant may not initiate communication with
another once it has taken a local checkpoint. The algorithm handles lost messages by
including a message count with each participant’s reply. To determine whether all in-
transit messages have been caught, the coordinator sums the count from each participant.
If the total number of sent messages equals the number of received messages then all i n-
transit messages have been caught and the set of local checkpoints and in-transit messages
form a consistent checkpoint.
5.1.2.2 Recovery
The recovery protocol also proceeds in two phases (Table 18). In the first phase, the
coordinator sends protocol information to each participant. The information sent informs
participants that they are in recovery mode. Each participant retrieves its state from the
checkpoint server (including in-transit messages) and informs the coordinator that it is
ready to proceed. The coordinator then awaits the ready notification from each participant.
In the second phase, the coordinator informs each participant to proceed.
83
5.1.2.3 Mapping 2-phase commit distr ibuted checkpointing
We show the interface to the coordinator in Figure 24. The class INFO maintains
internal data structures required for the algorithm. As part of the initialization phase, the
coordinator sends this information to participants. The coordinator initiates the algorithm
with a call to take_2pc_checkpoint(timeout) . If any outgoing calls to the participants
do not terminate within the specified time interval, the coordinator aborts the protocol by
sending a NO decision to the participants.
TABLE 18: Recovery in 2PCDC
Coordinator Participant
Send protocol information to each participant
Await READY notification from each participants
Await protocol information from coordinator
If in recovery mode thenretrieve state from checkpoint server
Notify coordinator that participant is READY
Inform participants to start executing Await GO signal from coordinator
84
Figure 25 illustrates the implementation of take_2pc_checkpoint(timeout) . The
coordinator first requests that all participants take a checkpoint and await the participants’
answer (await_answers() ). If all participants reply “Yes” , the coordinator waits for
potential in-transit messages (await_in_transits() ). When all in-transit messages have
been caught, the coordinator commits the checkpoint (set_stable () ). Regardless of the
final outcome, the coordinator notifies participants of its decision
(notify_vote_result () ). The calls await_answers() and await_i n_transits() are
FIGURE 24: Interface for coordinator
M N O P P u � | ª [T V Z M ` R M v _ U T V Z ¨ N S U ^ T Z ` d gT V Z V h d ª w ¦ R M Z P gQ R S T U V Q ª u a N U T ] P « ¬ gT V Z U w ¦ u a gT V Z M v _ Z u a gT V Z d U ] R gQ R S T U V Q ª u a P Z U ^ O S R c R ^ Y R ^ g
t g
M N O P P M U U ^ ] T V O Z U ^ [R X _ U ^ Z P ¥
\ \ V U Z T m T M O Z T U V U m ^ R _ N l b _ ` O P R u fY U T ] V U Z T m l ¨ V P i R ^ b T V Z U w ¦ u a e T V Z M v _ Z u a e T V Z O V P i R ^ e T V Z V h d P R V Z e T V Z V h d ^ M Y ] f gY U T ] V U Z T m l u V p ^ O V P T Z b T V Z U w ¦ u a e T V Z M v _ Z u a f g \ \ V U Z T m T M O Z T U V U m T V s Z ^ O V P T Z d R P P O S RY U T ] V U Z T m l ª w ¦ R M Z ¨ N T Y R b Q R S T U V W X U W Y R V Z R X U f g \ \ V U Z T m T M O Z T U V U m N T Y R V R P P
_ ^ T Y O Z R ¥u � | ª T V m U gP R V ] r � _ M ] M r T V m U b T V Z U w ¦ u a e u � | ª T V m U f g \ \ P R V ] _ ^ U Z U M U N T V m U^ R n h R P Z r M ` R M v _ U T V Z P b T V Z M v _ Z u a f g \ \ ^ R n h R P Z P _ O ^ Z T M T _ O V Z P Z O v R M ` R M v _ U T V ZT V Z O i O T Z r _ O ^ Z T M T _ O V Z r ^ R _ N l b T V Z M v _ Z u a e N U V S Z T d R U h Z f g \ \ O i O T Z O V P i R ^T V Z O i O T Z r T V r Z ^ O V P T Z P b T V Z M v _ Z u a e N U V S Z T d R U h Z f g \ \ O i O T Z T V s Z ^ O V P T Z d R P P O S R P
Y U T ] V U Z T m l r Y U Z R r ^ R P h N Z P b T V Z M v _ Z u a e T V Z ^ R P h N Z f g \ \ P R V ] m T V O N ] R M T P T U V Z U _ O ^ Z T M T _ O V Z PY U T ] P R Z r P Z O w N R b T V Z M v _ Z u a f g \ \ T V m U ^ d M ` R M v _ U T V Z P R ^ Y R ^ M ` R M v _ U T V Z T P M U V P T P Z R V Z
_ h w N T M ¥Z O v R r � _ M r M ` R M v _ U T V Z b N U V S Z T d R U h Z f g \ \ T V T Z T O Z R � _ ` O P R O N S U ^ T Z ` d
T V Z M ` R M v r N T Y R V R P P b f g \ \ d U V T Z U ^ N T Y R V R P PT V Z ^ R M U Y R ^ r O _ _ N T M O Z T U V b T V Z M v _ Z u a f g \ \ ^ R P Z O ^ Z O _ _ N T M O Z T U V
t g
85
implemented with a loop that waits for the functions notifyAnswer ( ) and
notifyInTransit() to be invoked.
The interface for participants is shown in Figure 26. Participants poll for the
checkpoint request from the coordinator with the function chec kpointRequested() .
When the coordinator requests a checkpoint, participants forward their state to the
checkpoint server and await a decision from the coordinator (do_2pcdc_phaseI() ). Note
that this is an optimistic protocol as there are no guarantees that the checkpoint will
succeed. In do_2pcdc_phaseII() , the participant awaits the final decision from the
coordinator.
In order to count the number of sent and received messages, participants register
handlers with the MessageReceive and MessageSend events. To ensure that participants
only count application level messages, these handlers use
isApplicationLeve l Function() . Programming tool developers should a priori have
identified user functions as being application-level. In addition to counting the number of
messages, the handler for MessageReceive is also responsible for catching in-transit
FIGURE 25: 2PCDC code
M U U ^ ] T V O Z U ^ ¥ ¥ Z O v R r � _ M r M ` R M v _ U T V Z b N U V S Z T d R U h Z f [^ R n h R P Z r M ` R M v _ U T V Z P b Z T d R U h Z f g
Y U Z R r ^ R P h N Z ~ O i O T Z r O V P i R ^ P b M h ^ ^ R V Z { ` R M v _ U T V Z u a e Z T d R U h Z f gT m b Y U Z R r ^ R P h N Z ~ ~ ° W c f [
P Z O w N R ~ O i O T Z r T V r Z ^ O V P T Z P b M h ^ ^ R V Z { ` R M v _ U T V Z u a e Z T d R U h Z f gT m b P Z O w N R ± ~ p } � W f
Y U Z R r ^ R P h N Z ~ � ª gt
\ \ W V ] U m © ` O P R u s s Z ` R M U U ^ ] T V O Z U ^ ` O P ] R M T ] R ]
T m b Y U Z R r ^ R P h N Z ~ ~ ° W c fP R Z r P Z O w N R b M h ^ ^ R V Z { ` R M v _ U T V Z u a f g
V U Z T m l r Y U Z R r ^ R P h N Z b M h ^ ^ R V Z { ` R M v _ U T V Z u a e Y U Z R r ^ R P h N Z f gt g
86
messages. If the participant is in the process of performing the algorithm and has already
voted YES then the handler forwards the in-transit message to the checkpoint server and
notifies the coordinator.
Restarting the application is similar to the SPMD checkpointing algorithm (§5.1.1)
except that in the 2PCDC algorithm, the state of participants includes any recorded in-
transit messages.
5.1.2.4 Summary of 2PCDC algor ithm
Table 19 provides a summary of the use of the RGE and exoevent notification models
in mapping the 2PCDC algorithm.
TABLE 19: Summary 2PCDC algorithm
Functionali ty Model
Catching in-transit messages and forward to checkpoint server
RGE model (events)
Notification of li veness Exoevent notification model
Communication between objects RGE model (graphs)
FIGURE 26: Interface for participants
M N O P P � _ M ] M r _ O ^ Z T M T _ O V Z r d U ] h N R [R X _ U ^ Z P ¥
Y U T ] S R Z r � _ M ] M r T V m U b u � | ª T V m U f g \ \ ^ R M R T Y R T V T Z T O N _ ^ U Z U M U N T V m U ^ d O Z T U V m ^ U d M U U ^ ] T V O Z U ^Y U T ] M ` R M v _ U T V Z r ^ R n h R P Z b T V Z M v _ Z u a f g \ \ M U U ^ ] T V O Z U ^ ^ R n h R P Z P O M ` R M v _ U T V ZY U T ] V U Z T m l ² U Z R } R P h N Z b T V Z M v _ Z u a e T V Z Y U Z R f g \ \ ] R M T P T U V m ^ U d Z ` R M U U ^ ] T V O Z U ^
_ ^ T Y O Z R ¥u � | ª T V m U g \ \ _ ^ U Z U M U N T V m UT V Z V h d r d P S P r P R V Z g \ \ V h d w R ^ d R P P O S R P P R V Z P T V M R N O P Z M v _ ZT V Z V h d r d P S P r ^ M Y ] g \ \ V h d w R ^ d R P P O S R P ^ R M R T Y R ] P T V M R N O P Z M v _ Zw U U N R O V T P ¨ _ _ N T M O Z T U V Q R Y R N | h V M Z T U V b | h V M Z T U V u ] R V Z T m T R ^ m T ] f g \ \ T P Z ` T P O V O _ _ N T M O Z T U V s N R Y R N m h V M Z T U V ³
_ h w N T M ¥w U U N R O V M ` R M v _ U T V Z } R n h R P Z R ] b f g \ \ i O P O M ` R M v _ U T V Z ^ R n h R P Z R ] w l Z ` R M U U ^ ] T V O Z U ^ ³
Y U T ] P O Y R r N U M O N r P Z O Z R b f g \ \ P O Y R N U M O N P Z O Z RY U T ] ^ R P Z U ^ R r N U M O N r P Z O Z R b f g \ \ ^ R P Z U ^ R N U M O N P Z O Z R
T V Z ] U r � _ M ] M r _ ` O P R u b f g \ \ _ ` O P R u U m O N S U ^ T Z ` dT V Z ] U r � _ M ] M r _ ` O P R u u b f g \ \ _ ` O P R u u U m O N S U ^ T Z ` d
Y U T ] T r O d r O N T Y R b f g \ \ ^ O T P R R X U R Y R V Z Z U V U Z T m l Z ` O Z U w ¦ R M Z T P O N T Y Rt g
87
5.2 Logging
We now explore the second form of rollback-recovery, namely log-based rollback-
recovery. In log-based rollback-recovery, a process can be recreated from its checkpointed
state and message log. A common assumption is that of a piecewise deterministic model
of computation—the execution of a process consists of a series of non-deterministic
events that delineate deterministic state intervals [ELNO96]. In a message-based systems,
non-deterministic events typically correspond to the ordering of message delivery. By
logging messages and their ordering, a process can recover from a crash by replaying
messages in the same order as it originally delivered them. Typically, a process logs both
the delivery order of messages and their content, though logging both is not a necessary
condition as messages may be regenerated upon recovery [ALVI98].
There are three types of log-based rollback-recovery techniques: pessimistic logging,
optimistic logging and causal logging. All guarantee that upon recovery the state of a
failed process is consistent with the state of other processes. This consistency requirement
is expressed in terms of orphan processes, i.e., processes that contain orphan messages.
Alvisi et al. provide a formal definition of the always-no-orphans condition and derive a
characterization for all three classes of logging protocols [ALVI98]. Elnozahy et al.
provide a practical and less formal comparison of logging protocols [ELNO96].
In pessimistic logging, a process synchronously logs messages prior to delivering
them in order to ensure that no message that can affect the state of a process is lost. This
algorithm is pessimistic because it assumes that failures are likely between the time a
message is logged and the time it is delivered. Logging messages synchronously ensures
that upon recovery, a process can replay all messages that have previously affected the
88
state. The advantage of this technique is that recovery is simple and localized: a process
recovers by retrieving its last checkpoint and replaying its message log. It does not need to
coordinate recovery with other processes in the application. The drawback of pessimistic
logging is the high failure-free overhead of logging messages synchronously.
In contrast, optimistic logging protocols log messages asynchronously. The implicit
assumption is that failure is unlikely to occur between the time a message is logged and
the time it is delivered. A process does not block to perform the logging of messages; thus
the potential for higher failure-free performance. The problem is that sometimes an
optimistic assumption can be wrong. If a process crashes before a message has been
logged, information such as message delivery order or message content will be lost. To
compound the problem, if the crashed process has sent messages to other processes (and
potentially affected their state), these processes wil l become orphans and must be rolled
back during recovery. Thus, optimistic protocols require tracking dependencies during a
failure-free run to support a consistent recovery. Furthermore, processes in an optimistic
protocol may be required to rollback to a previous checkpoint whereas rollback for
pessimistic protocols is bounded to the last checkpoint.
Causal logging techniques strike a balance between pessimistic and optimistic
protocols. They do not require blocking during a failure-free run nor do they create orphan
processes. Causal logging maintain information about events that have a causal effect on
the state of processes [ELNO92, ALVI93]. This information can be used to reestablish the
delivery order of messages upon recovery and limit the extent of rollbacks to the last
saved checkpoint. Causal logging techniques do not suffer a high failure-free performance
cost as they do not synchronously log messages to stable storage. Furthermore, causal
89
logging bounds the rollback of any failed process to its last checkpoint. As with optimistic
logging, the drawback of causal logging is its complex recovery protocols.
For a detailed analysis of the similarities and differences between logging protocols
please see the literature [ALVI98]. There are other issues related to logging that we have
not discussed, e.g., interactions with the outside world, asynchronous vs. synchronous
recovery and garbage collection. For a treatment of these issues, please see the survey by
Elnozahy [ELNO96].
For the purpose of mapping algorithms to the RGE and exoevent notification models,
we focus on pessismistic logging because of its simplicity and the fact that, despite its high
overhead, most commercial implementations of message logging use pessimistic logging
[HUAN95]. Similarly to the work in Ho’s master’s thesis, we adapt a pessimistic message
logging protocol to an object-based system [HO99].
We design our system to tolerate a single permanent host failure. We use a checkpoint
server object as stable storage for storing checkpoints and message logs. Thus, the
algorithm can tolerate either the failure of the server or of the checkpoint server, but not
both. We further assume that no network partitioning occurs.
5.2.1 Pessimistic message logging
We have discussed the piecewise deterministic model in terms of processes and
messages. In an object-based system, the non-deterministic events of interest are the order
in which methods are delivered. By logging the delivery of methods, we can recreate the
execution of an object by replaying its methods. We implement the logging of methods by
logging messages.
90
Pessimistic message logging (PML) enables the abstraction of a resili ent object, an
object that can mask failures. Object failure is masked by the PML protocol; other objects
should only see a pause while PML recovers an object. We implement PML by logging
messages onto stable storage. An advantage of PML is the ability to recover an object
locally, without needing to coordinate recovery with other application objects. However,
the simple recovery characteristic of PML comes at the cost of logging messages during
normal execution.
In Figure27 we show a client invoking the method foo on object A (1). For this
example, we assume that a single message is suff icient to form a complete method
invocation for foo . Upon receipt of the message from the client, the PML module sends the
message to the CheckpointServer object (2). Once PML receives an acknowledgement
from CheckpointServer that the message has been stored successfully (3), PML allows the
message to flow to the MethodAssembly module (4). Since the message forms a
complete method, A can execute the method foo (5). Object A then returns the reply to the
client (6).
In order to recover an object, we restart it from its last checkpoint, retrieve the
message log, and replay messages in their original order. While replaying the message log,
we intercept outgoing messages in order to prevent sending duplicate messages. If object
A received a reply during its original execution, e.g., as a result of making a method
invocation on other objects, we retrieve the reply from the log. Once all messages have
been replayed, we let outgoing messages proceed normally at which point we have
recovered the object successfully.
91
Clients that expect a reply should see a pause while a recovery protocol is in progress.
In practice, clients should retry an invocation after a certain amount of time in case an
object fails before logging a message. The implication of the possibility of retries is that
objects should handle duplicate method invocations.
5.2.2 Mapping pessimistic message logging
Figure 28 shows the interface to the module for implementing pessimistic message
logging. To intercept messages we register the handler LogMessageHandler with the
event MessageReceive. Inside the handler, we forward the message to the checkpoint
server and await acknowledgement that the message has been stored successfully. To store
FIGURE 27: Pessimistic message logging (PML)
Pessimist ic MessageLogg ing
MethodAssembly
PML
(1) A.foo()
(5)service method
(2) send message
(3) message OK
foo()
bar()
Client
(4)(6) reply
CheckpointServer
92
return values, we register the handler MethodStartHandler with the MethodReady event
and the handler StoreRetainedResultHandler with the MethodDone event.
Inside MethodStartHandle r , we insert the computation tag of the method in an
associative array that maps computation tags to return values. Since the method is about to
start executing, the tag maps to an empty value. When the method finishes executing and
StoreRetainedResu l tHandler is invoked, we update the associative array to store the
returned values, and forward the returned values to the CheckpointServer object. The code
for these handlers is shown in Figure 29.
M N O P P © R P P d T P Z T M ¡ R P P O S R Q U S S T V S r d U ] h N R [R X _ U ^ Z P ¥
P R Z { ` R M v _ U T V Z c R ^ Y R ^ b Q ª u a f g \ \ P R Z M ` R M v _ U T V Z P R ^ Y R ^_ ^ T Y O Z R ¥
Q ª u a { ` R M v _ U T V Z c R ^ Y R ^ g \ \ V O d R U m M ` R M v _ U T V Z P R ^ Y R ^j p ¨ z e } W c � Q p c k ^ R Z O T V R ] } R P h N Z P g \ \ v R R _ Z ^ O M v U m T V Y U v R ] d R Z ` U ] P O V ] ^ R _ N T R PT V Z Q U S ¡ R P P O S R b ¡ W c c ¨ z W f g \ \ N U S d R P P O S R Z U M ` R M v _ U T V Z P R ^ Y R ^T V Z Q U S } R Z O T V R ] } R P h N Z P b } W c � Q p c f g \ \ N U S ^ R Z h ^ V Y O N h R P Z U M ` R M v _ U T V Z P R ^ Y R ^T V Z } R M U Y R ^ Q U S P b ¡ W c c ¨ z W e } W c � Q p c f g \ \ ^ R M U Y R ^ O N N N U S P
_ h w N T M ¥T V Z Q U S ¡ R P P O S R y O V ] N R ^ b W Y R V Z f g \ \ ` O V ] N R ^ T V Y U v R ] h _ U V ^ R M R T _ Z U m O d R P P O S RT V Z ¡ R Z ` U ] c Z O ^ Z y O V ] N R ^ b W Y R V Z f g \ \ ` O V ] N R ^ T V Y U v R ] i ` R V O d R Z ` U ] T P O w U h Z Z U R X R M h Z RT V Z c Z U ^ R } R Z O T V R ] } R P h N Z y O V ] N R ^ b W Y R V Z f g \ \ ` O V ] N R ^ m U ^ i ` R V d R Z ` U ] m T V T P ` R P R X R M h Z T V ST V Z u V Z R ^ M R _ Z ª h Z S U T V S ¡ R P P O S R P b W Y R V Z f g \ \ ` O V ] N R ^ Z U P Z U _ U h Z S U T V S d P S P ] h ^ T V S ^ R M U Y R ^ l
t
FIGURE 28: Interface for pessimistic message logging
93
To recover, an object retrieves its last saved checkpoint and all l ogs from the
CheckpointServer. Next, it replays each message in order to recreate the original execution
of the object. We trap outgoing communications so that other objects do not receive
duplicate requests (Figure30). Whenever an object is blocked waiting on a return value
from some other object, the result values can be found in the message log. Once all
messages have been replayed and all return values extracted, an object stops intercepting
outgoing method invocations, and the object resumes normal processing.
T V Z © R P P T d T P Z T M ¡ R P P O S R Q U S S T V S ¡ U ] h N R ¥ ¥ Q U S ¡ R P P O S R y O V ] N R ^ b W Y R V Z R Y f [¡ W c c ¨ z W d ~ R Y ® S R Z a O Z O b f gQ U S ¡ R P P O S R b d f g^ R Z h ^ V W Y R V Z { U V Z T V h R g
t
T V Z © R P P T d T P Z T M ¡ R P P O S R Q U S S T V S ¡ U ] h N R ¥ ¥ ¡ R Z ` U ] c Z O ^ Z y O V ] N R ^ b W Y R V Z R Y f [´ ª } µ r } W ¶ � W c p i ~ R Y ® S R Z a O Z O b f gp ¨ z Z ~ i ® S R Z p O S b f g
} W c � Q p ^ R P h N Z P ~ � � Q Q g
T m b ^ R P h N Z P ~ ^ R Z O T V R ] } R P h N Z P ® N U U v h _ b Z f f [P R V ] } R P h N Z P b i f g
tR N P R [
^ R Z O T V R ] } R P h N Z P ® T V P R ^ Z b Z e ^ R P h N Z P f gQ U S } R Z O T V R ] } R P h N Z P b ^ R P h N Z P f g
t^ R Z h ^ V W Y R V Z { U V Z T V h R g
t
T V Z © R P P T d T P Z T M ¡ R P P O S R Q U S S T V S ¡ U ] h N R ¥ ¥ c Z U ^ R } R Z O T V R ] } R P h N Z P b W Y R V Z R Y f [´ ª } µ r } W ¶ � W c p i ~ R Y ® S R Z a O Z O b f gp ¨ z Z ~ i ® S R Z p O S b f g
} W c � Q p c ^ R P h N Z P ~ i ® S R Z } R P h N Z P b f g^ R Z O T V R ] } R P h N Z P ® T V P R ^ Z b Z e ^ R P h N Z P f g^ R Z h ^ V W Y R V Z { U V Z T V h R g
tFIGURE 29: Handlers for pessimistic message logging
94
5.2.3 Optimization: pessimistic method logging
We present an optimization to pessimistic message logging that relies on the following
two assumptions: (1) an object receives complete method invocations only, i.e., all its
arguments are contained in a single message, and (2) an object does not call other objects
while servicing a request. Based on these assumptions, we modify the pessimistic message
logging algorithm to pessimistic method logging.
The differences between pessimistic method logging and pessimistic message logging
are that instead of forwarding messages to the checkpoint server, we forward complete
method invocations; and instead of replaying messages during recovery, we replay
methods. In Figure31, we show a client invoking A.foo() (1). Instead of logging
messages as in §5.2.2, we log methods (2). Forwarding complete method invocations to
the checkpoint server (3-4) is implemented by registering a handler with the MethodReady
event. The handler assembles and executes a graph to store the method invocation at the
checkpoint server. Once the checkpoint server has acknowledged receipt of the method, A
services the method foo (5).
T V Z © R P P T d T P Z T M ¡ R P P O S R Q U S S T V S ¡ U ] h N R ¥ ¥ u V Z R ^ M R _ Z ª h Z S U T V S ¡ R P P O S R y O V ] N R ^ b W Y R V Z R Y f [^ R Z h ^ V W Y R V Z c Z U _ g \ \ _ ^ R Y R V Z m h ^ Z ` R ^ ` O V ] N R ^ P m ^ U d _ ^ U M R P P T V S Z ` R R Y R V Z e
\ \ Z ` R ^ R w l P ` h Z Z T V S U m m U h Z S T V S M U d d h V T M O Z T U Vt
FIGURE 30: Handler for intercepting outgoing communication
95
5.2.4 Legion system-level suppor t
We define a new Legion class object such that upon failure of an object instance, the
class restarts another copy on a different host. Furthermore, we define a new method on
the class object, set_logger(LegionLOID instance, LegionLOID
logger) , to associate an object with its logger object. Upon detecting that an object has
failed, the class object restarts a new copy, and forwards to the copy the identity of its
logger object. Upon starting up, the new object retrieves its state from the logger object
and replays its log before accepting any methods.
To support pessimistic method logging requires overloading the methods to create and
delete objects, and the methods to lookup object names. Object creation is modified to
send the identity of a logger object to the newly created instance, object deletion is
FIGURE 31: Pessimistic method logging
Pessimist ic MethodLogg ing
PML
MethodAssembly
(1) A.foo()
(5)service method
(3) send method
(4) method OK
foo()
bar()
Client
(2)(6) reply
CheckpointServer
96
modified to clean up internal data structures, i.e., to remove the association of an object
with its logger, and object naming is modified to trigger the failover protocol.
5.2.5 Summary of pessimistic logging
Table 20 provides a summary of the use of the RGE and exoevent notification models
in mapping pessimistic logging. We did not implement the full pessimistic message
logging algorithm, but instead implemented the pessimistic method logging algorithm, as
pessimistic method logging is well -suited for client/server interactions.
5.3 Repli cation
Replication techniques can be classified broadly into closely-synchronized techniques
and loosely-synchronized techniques [CHRI91]. In the former, the state of replicas is kept
closely synchronized; replicas service the same requests in parallel and undergo the same
state transitions. This algorithm is sometimes referred to as the state machine approach or
active replication [SCHN90]. In the latter, a primary replica services requests on behalf of
TABLE 20: Summary of pessimistic logging algorithm
Functionali ty Model
Intercepting messages RGE model (events)
Turning off communications RGE model (events)
Storing return values RGE model (events)
Detection of duplicate requests and sending previously saved return values
RGE model (graphs & events)
Communication between objects RGE model (graphs)
97
clients. Other replicas are kept as spares and can take over in the case of a primary failure
[BUDH93]. This is sometimes referred to as passive replication.
In the state machine approach the following properties must hold [SCHN90]:
Agreement — all replicas receive and process the same sequence of requests
Order — every non-faulty state machine replica processes the requests it receives in
the same relative order
A common approach to implement these properties has been to use order-preserving
communication protocols such as atomic multicast [BIRM93, HAYD98, RENE96].
In passive replication, the following properties must hold [BUDH93]:
Property 1 — there is only one primary at any given time
Property 2 — clients communicate only with the primary
Property 3 — if any backup replicas receive a client request, it ignores the request
Passive replication algorithms are simpler to implement because they do not require
complex ordering communication primitives. The disadvantage of passive replication is
that the failover time—the time it takes to elect a new primary in case of failure—may be
unacceptably high.
Transparently incorporating both kinds of replication techniques into applications has
been investigated in many projects [ELNO95, FABR95, FABR98, GARB95, HO99,
MOSE99]. Fabre et al. use a reflective language to encapsulate replication algorithms
[FABR95, FABR98]. Elnozahy et al. and Moser et al. extend a CORBA object request
broker [ELNO95, MOSE99]. Ho exploits the extension facilit ies of Orbix, a CORBA object
request broker, to incorporate replication in the Nile project [HO99]. The CORBA 3.0
specification defines interception faciliti es to extend the functionality of objects. This
recent development is very important as CORBA is an architecture specification over
98
which many systems can be implemented. CORBA’s approach is similar to ours in that we
provide developers of multiple programming environments with the abilit y to insert and
extend object functionality, not as an afterthought but as a primary feature of an
architecture.
The RGE and exoevent notification models provide faciliti es for implementing
replication techniques. However, these models also provide faciliti es for a more
comprehensive solution for incorporating techniques into user applications. They seek to
encompass not just replication but also checkpointing and message logging techniques. To
the best of our knowledge, the RGE and exoevent notification models are the only models
that serve as a unified model for these three famili es of techniques.
In mapping replication techniques we focus on two replication techniques, passive
replication (§5.3.1) and stateless replication (§5.3.2). These techniques are simple to
understand, implement, and aptly illustrate the capabiliti es of our models. The failure
assumptions for each algorithm are shown in their respective sections.
Note that we do not map active replication techniques. The primary reason for this
decision is that, generally, active replication is used to achieve availabilit y while our focus
is on reliabilit y. Furthermore, the current Legion prototype system does not support
ordered communication primitives.
5.3.1 Passive replication
The basic idea in passive replication is to keep the state of the primary and backups
synchronized so that upon failure of the primary, a backup can take over and process client
requests [BUDH93]. We consider the case of one primary and one backup only; though a
99
generalization to multiple backups is straightforward [BUDH93]. Figure 32 ill ustrates a
method call on a replicated object A (1). The module PR encapsulates the passive
replication algorithm. After servicing the method foo , control returns to the module PR
(2-3). PR sends the state of the primary to the backup object (4) (the state is represented by
stars in the figure). The backup updates the state and sends an acknowledgement back to
the primary (5-6). Once the primary has received the acknowledgement, it sends the result
of A.foo to the client (7).
Passive replication is designed to tolerate a single crash failure. Either the primary or
the backup is permitted to fail , but not both at the same time. Passive replication also
assumes a reliable network, i.e., no network partitioning.
Furthermore, we assume that a naming service is available for looking up the name of
the primary and backup objects. In particular, the client should be able to use just one
FIGURE 32: Passive replication example
Repl icated Object A
BackupPrimary
PR
(1) A.foo()
(2)service method
(4) send state
(6) state OK
(5) update state
foo()
bar()
foo()
bar()
(7) reply
PR
Client
(3) done w/method
100
name. The fact that a request is sent to the primary, or to a backup which has just been
elected primary, should be transparent to the client. Naming and binding issues are
orthogonal to our models and depend on the target grid environment. In Legion, such
issues are the responsibiliti es of class objects (§5.3.1.2).
5.3.1.1 Mapping passive replication
The interface to the PR module is shown in Figure 33. Upon startup, if the object is a
primary, we register the handler HandleP assiveReplication and HandleMethodDone
with the MethodDone event. Inside HandlePassiveReplication , we lookup the invoked
method in a table to determine whether it is a state-updating method. If this information is
not available, we take the conservative approach of assuming a state-updating method. If
the method is state-updating, we assemble the state via SaveUserState(BUFFER) and
send it to the backup. Upon receiving the state, the backup calls
AssignUserState(B UFFER) and replies to the primary. Inside HandleMethodDone , we
send the return values back to the client.
M N O P P © O P P T Y R } R _ N T M O Z T U V r d U ] h N R [R X _ U ^ Z P ¥
c R Z c Z O Z h P b © } r c p ¨ p � c f gc R Z c Z O Z R b § � | | W } f g
_ ^ T Y O Z R ¥y O V ] N R © O P P T Y R } R _ N T M O Z T U V b W Y R V Z f g \ \ ` O V ] N R ^y O V ] N R ¡ R Z ` U ] a U V R b W Y R V Z f g \ \ ` O V ] N R ^© } r c p ¨ p � c P Z O Z h P g \ \ _ ^ T d O ^ l U ^ w O M v h _c O Y R � P R ^ c Z O Z R b § � | | W } f g \ \ P O Y R Z ` R P Z O Z R
¨ P P T S V � P R ^ c Z O Z R b § � | | W } f g \ \ O P P T S V Z ` R P Z O Z R_ h w N T M ¥
P R ^ Y R ^ Q U U _ b f gt g
FIGURE 33: Passive replication interface (primary)
101
The primary then returns the results of the original function invocation to the client
(Figure34).
5.3.1.2 Legion system-level support
Integrating passive replication requires support from the class object. Recall that in
Legion, class objects are responsible for object-management functions such as creation,
deletion, naming and binding. We modify the class object so that on object creation, the
class creates two objects, a primary and a backup object. Upon failure of the primary, the
class makes the backup the new primary object.
W Y R V Z c Z O Z h P© O P P T Y R } R _ N T M O Z T U V ¥ ¥ y O V ] N R © O P P T Y R } R _ N T M O Z T U V b W Y R V Z R Y f [
´ ª } µ r } W ¶ � W c p i g¡ W p y ª a d g
i ~ b ´ ª } µ r } W ¶ � W c p f R Y ® S R Z a O Z O b f gd ~ i s k S R Z ¡ R Z ` U ] b f gT m b T P c Z O Z R � _ ] O Z T V S b d f f [
§ � | | W } P Z O Z R ~ P O Y R � P R ^ c Z O Z R b f gP h M M R P P ~ P R V ] c Z O Z R p U § O M v h _ b P Z O Z R e Z T d R U h Z f gT m b P h M M R P P f ^ R Z h ^ V { U V Z T V h R y O V ] N R W Y R V Z gR N P R [
^ R M U V m T S h ^ R § O M v h _ b f gP h M M R P P ~ P R V ] c Z O Z R p U § O M v h _ b P Z O Z R e Z T d R U h Z f gT m b P h M M R P P f ^ R Z h ^ V { U V Z T V h R y O V ] N R W Y R V Z gR N P R ^ O T P R W X M R _ Z T U V b o W } } ª } q f g
tt^ R Z h ^ V { U V Z T V h R y O V ] N R W Y R V Z g
t g
W Y R V Z c Z O Z h P© O P P T Y R } R _ N T M O Z T U V ¥ ¥ y O V ] N R ¡ R Z ` U ] a U V R b W Y R V Z R Y f [
´ ª } µ r } W ¶ � W c p i gi ~ b ¡ W p y ª a f R Y ® S R Z a O Z O b f g
c W � a r } W c � Q p c b i f g^ R Z h ^ V { U V Z T V h R y O V ] N R W Y R V Z g
tFIGURE 34: Handlers for passive replication (primary)
102
Figure 35 illustrates the process of invoking a method foo on object S. A client object
first contacts the class of S to obtain a binding for S (1). The binding contains an object
address, i.e., a low-level name, with which to communicate with S (2). Normally, the
binding returned by Class S corresponds to the primary object (3). However, if the primary
object crashes, Class S initiates a failover protocol that consists of making the backup the
new primary object. On subsequent binding requests, Class S returns a binding that
corresponds to the new primary object.
5.3.1.3 Summary of passive replication
Table 21 provides a summary of the use of the RGE and exoevent notification models
in mapping the passive replication algorithm.
TABLE 21: Summary of the passive replication algorithm
Functionali ty Model
Updating state of backup when a method finishes execution
RGE model (graphs & events)
Raising exceptions Exoevent notification model
Detection of duplicate methods RGE model (events)
FIGURE 35: Server lookup with primary replication
S
Client
Class S
Primary
Backup
(1) lookup(LOID(S))
(2) binding(3) S.foo()
103
5.3.2 Stateless replication
Stateless objects—objects whose methods are side-effect free—can be replicated to
provide higher performance [GRIM96A], higher availabilit y, or both [BABA92, CASA97,
NGUY95, NGUY96]. Stateless objects are used in several applications, including file
servers, mathematical li braries, graphical rendering, biochemistry, and pipe and filter
applications. The original stateless replication was designed to achieve higher
performance through the load-balancing of parallel requests on stateless objects. The
problem was that the failure of any replicas would lead to the failure of the application that
uses stateless objects. We modified the algorithm to tolerate failures of replicas through a
retry mechanism.
We present stateless replication, an algorithm for managing stateless replicas. The
architecture of this algorithm is shown in Figure 36. Note the presence of a proxy object
that intercepts method calls intended for the replicas. The algorithm tolerates the crash
failures of replicas. We assume that the proxy object never crashes and that the network is
reliable. However, the assumption of a reliable network could be relaxed. If the network
partitions, workers that are outside of the primary partition can be treated as having failed.
The proxy object would reassign the failed computation to workers that reside within the
primary partition.
When the proxy object receives a work request, i.e., a method call i ntended for the
replicas, it stores the request in an internal queue. The proxy object maintains a capacity
State transfer Provided by developer
TABLE 21: Summary of the passive replication algorithm
Functionali ty Model
104
count for its replicas, i.e., the maximum number of work requests that can be issued at any
given time. The proxy dequeues work requests and selects replicas for performing the
work until the maximum capacity is reached.
The selection algorithm can be a simple one such as random or round-robin, or it can
be a more complex algorithm such as least-loaded. When a replica finishes a method
invocation, it notifies the proxy (dashed arrow labeled “done” in Figure36). This
notification is the basis for monitoring the progress of an invocation; if a method that has
been assigned to a replica fails to finish executing within a specified time interval, the
proxy can reassign the work to another replica. Furthermore, the arrival of the notification
triggers the assignment of another work request to a replica. Thus, this architectures
achieves a form of self-scheduling, replicas that execute fastest, whether because they are
inherently faster or servicing less computationally demanding methods, receive on
average more work from the proxy. Note that other replication algorithms could be
implemented. For example, a form of active replication could be implemented by having
the proxy schedule N duplicates for each work request [NGUY95].
Relying on a timeout value for reassigning work requests may lead to multiple results
being sent back to clients. Thus, clients must be able to handle the possibilit y of duplicate
FIGURE 36: Stateless replication
Replicas
O 1
O 2
O n
ProxyClient O.foo()
done
reply
Method
105
replies. In environments in which client objects are waiting on a specific reply, this task is
easy. In others, duplicates should be detected and discarded (§5.3.2.2).
5.3.2.1 Mapping stateless replication
The proxy object implements the stateless replication algorithm. The proxy object
exports methods for registering and unregistering replicas, setting the queue capacity, and
specifying a time interval after which to reassign work requests (Figure 37).
We register the handler methodInvokeHandler with the MethodReady event. Inside
methodInvokeHandl er we determine whether the method is intended for the proxy object
itself or for the replicated object. If it is intended for the proxy object, we route it to the
appropriate function and update various data structures such as the list of candidate
FIGURE 37: Interface for proxy object
M N O P P ´ ª } µ r } W ¶ � W c p [¡ W p y ª a d R Z ` U ] g \ \ m h V M Z T U V P T S V O Z h ^ R e O ^ S h d R V Z P e M U d _ h Z O Z T U V Z O SN U V S Z T d R } R M R T Y R ] g \ \ Z T d R i U ^ v ^ R n h R P Z ^ R M R T Y R ]N U V S Z T d R c Z O ^ Z R ] g \ \ Z T d R i U ^ v ^ R n h R P Z P R V Z Z U ^ R _ N T M ON U V S Z T d R W V ] R ] g \ \ Z T d R i U ^ v ^ R n h R P Z ] U V RT V Z V h d p ^ T R P g \ \ ` U i d O V l Z T d R P ` O Y R i R Z ^ T R ] Z ` T P i U ^ v ^ R n h R P ZT V Z ^ R Z ^ l O w N R g \ \ P ` U h N ] i R ^ R Z ^ l Z ` T P i U ^ v ^ R n h R P Z T m T Z ] U R P V x Z
\ \ m T V T P ` R X R M h Z T V S T V O Z T d R N l d O V V R ^ ³t g
M N O P P © ^ U X l [R X _ U ^ Z P ¥
Y U T ] V U Z T m l ¡ R Z ` U ] a U V R b Q R S T U V W X U W Y R V Z f g \ \ V U Z T m T M O Z T U V Z ` O Z ^ R _ N T M O ` O P m T V T P ` R ] R X R M h Z T V S Od R Z ` U ]
Y U T ] ¦ U T V b Q ª u a f g \ \ O ] ] ^ R _ N T M O Z U Z ` R _ U U NY U T ] N R O Y R b Q ª u a f g \ \ ^ R d U Y R ^ R _ N T M O m ^ U d Z ` R _ U U NY U T ] P R Z p T d R U h Z b N U V S Z T d R U h Z f g \ \ P R Z Z T d R U h Z Y O N h R O m Z R ^ i ` T M ` Z U ^ R O P P T S V i U ^ vY U T ] P R Z { O _ O M T Z l b T V Z f g \ \ P R Z Z ` R n h R h R M O _ O M T Z l
_ ^ T Y O Z R ¥w U U N R O V T P } R _ N T M O | h V M Z T U V b | h V M Z T U V u ] R V Z T m T R ^ m T ] f g \ \ h P R ] Z U ] R Z R ^ d T V R i ` R Z ` R ^ M O N N T P m U ^ _ ^ U X l U ^ ^ R _ N T M O
Y U T ] T V Y U v R r ^ R _ N T M O r m h V M Z T U V b ´ ª } µ r } W ¶ � W c p f g \ \ P R V ] i U ^ v Z U ^ R _ N T M OY U T ] ` O V ] N R © ^ U X l ¡ R Z ` U ] b ´ ª } µ r } W ¶ � W c p f g \ \ Z ` T P T P O d R Z ` U ] m U ^ Z ` R _ ^ U X l s ` O V ] N R T Z´ ª } µ r } W ¶ � W c p r ¶ � W � W i U ^ v r ^ R n h R P Z P g \ \ n h R h R U m i U ^ v ^ R n h R P Z P´ ª } µ r } W ¶ � W c p r ¶ � W � W T V r _ ^ U S ^ R P P g \ \ n h R h R U m i U ^ v ^ R n h R P Z P O N ^ R O ] l O P P T S V R ] Z U ^ R _ N T M O
y ¨ � a Q W } d R Z ` U ] u V Y U v R y O V ] N R ^ g \ \ ` O V ] N R ^ m U ^ M O Z M ` T V S ¡ R Z ` U ] } R O ] l R Y R V Z PQ ª u a ^ R _ N T M O P « ¬ g \ \ N T P Z U m ^ R _ N T M O P
_ h w N T M ¥` O V ] N R ¡ R Z ` U ] u V Y U M O Z T U V b ´ ª } µ r } W ¶ � W c p f g \ \ ` O V ] N R O V T V M U d T V S d R Z ` U ] T V Y U M O Z T U V
t g
106
replicas. If it is intended for the replicated object, we store the method in a queue of work
requests. The work request contains the method, its arguments, and other information such
as timestamps.
Provided there is spare capacity, the proxy dequeues work requests, sends them to the
replicas, and stores them in the in_progress queue (Figure 38).
Upon finishing a method invocation, replicas raise the exoevent
“Object:MethodDone”. Descriptors for the exoevent contain the function signature and its
computation tag. To be notified of this exoevent, the proxy object sets an exoevent interest
to catch “Object:MethodDone” exoevents (Table 22):
TABLE 22: “Object:MethodDone” notification by replica
Exoevent interest
categoryString “Object:MethodDone”
\ \ M U ] R Z U P R V ] i U ^ v ^ R n h R P Z Z U ^ R _ N T M OT V Y U v R r ^ R _ N T M O r m h V M Z T U V b ´ ª } µ r } W ¶ � W c p i f [
W Y R V Z R Y ~ V R i W Y R V Z b ¡ R Z ` U ] c R V ] f g \ \ M ^ R O Z R O ¡ R Z ` U ] c R V ] R Y R V Z¡ W p y ª a d ~ i s k S R Z ¡ R Z ` U ] b f g \ \ S R Z Z ` R d R Z ` U ] ] O Z O P Z ^ h M Z h ^ RR Y s k P R Z a O Z O b d f g \ \ P R Z ] O Z O m T R N ] U m R Y R V ZW Y R V Z ¡ O V O S R ^ ® O V V U h V M R b R Y f g \ \ ^ O T P R R Y R V Z
t
Y U T ] ` O V ] N R ¡ R Z ` U ] u V Y U M O Z T U V b ´ ª } µ r } W ¶ � W c p i U ^ v f [T m b V U Z T P } R _ N T M O Z T U V | h V M Z T U V b i U ^ v ® m T ] f f [
\ \ Z ` T P T P O d R Z ` U ] m U ^ Z ` R _ ^ U X l` O V ] N R © ^ U X l ¡ R Z ` U ] b i U ^ v f g
t R N P R [i U ^ v r ^ R n h R P Z P s k R V n h R h R b i U ^ v f g
P _ O ^ R r M O _ O M T Z l ~ ¡ ¨ · { ¨ © ¨ { u p ° s M h ^ ^ R V Z ´ U ^ v Q U O ] gi ` T N R b P _ O ^ R r M O _ O M T Z l s s f [
T V Y U v R r ^ R _ N T M O r m h V M Z T U V b i U ^ v f gT V r _ ^ U S ^ R P P s k R V n h R h R b i U ^ v f g
tt
tFIGURE 38: Sending a method to a replica
107
Upon receiving a notifyMethodDone() call , the proxy object dequeues another work
request, assigns it to the same replica, and stores the request in the in_progress queue.
The proxy periodically scans the in_progress queue to determine whether any work
requests have exceeded the specified time interval. If so, the proxy considers the work
request and the replica to have failed. If a work request fails and is not retryable, the proxy
raises an “Exception:RequestFailed” exoevent that contains the function signature of the
failed request and its computation tag. If the work request is retryable, the proxy updates
the numTries field and resubmits the request to the work_requests queue. If the number
of allowable retries has been reached, the proxy gives up retrying and raises the exoevent
“Exception:RequestFailed:MaximumRetriesReached”.
To detect the failure of replicas, the proxy object registers the event handler,
FailureDetectionH andler with the MessageSendError event. When this handler is
called, the proxy object removes the failed replica from its set of available replicas. Work
requests that were assigned to the failed replica are reassigned to other replicas.
exoeventHandler
TABLE 22: “Object:MethodDone” notification by replica
Exoevent interest
Graph
Proxy.notifyMethodDone
108
5.3.2.2 Duplicate method suppression
It is possible for an object to receive duplicate method invocations when using
stateless replication. Consider the case of a computation S.foo() where the notification of
the “MethodDone” exoevent is delayed by network congestion. As a result, the proxy
object could potentially retry the computation. The end result is that the computation
S.foo() is invoked twice. If the return value of S.foo () is used as a parameter to
X.bar () , then X.bar () could also be invoked twice. While invoking S. foo () twice is
safe because S is a stateless object, invoking X.bar () twice may not be safe and could
result in the erroneous execution of X.
To detect duplicates, objects register a DuplicateH andler with the MethodReady
event. Inside DuplicateHandle r , we check for the presence of the computation tag of the
method in an internal hash table. If the tag is already present, we have a duplicate and thus
delete the method. Otherwise, we insert the tag in the hash table to detect subsequent
duplicates.
5.3.2.3 Summary of stateless replication
Table 23 provides a summary of the use of the RGE and exoevent notification models
in mapping the stateless replication algorithm.
Table 23: Summary of the passive replication algorithm
Functionali ty Model
Notification that a replica has finished executing a method
Exoevent notification model
Determining whether a method is invoked on the proxy object or a replica
RGE model (events)
Forwarding method invocation to replica RGE model (graphs & events)
109
5.4 Summary
We have shown the application of the RGE and exoevent notification models in
mapping the following fault-tolerance algorithms: SPMD checkpointing, 2PCDC
checkpointing, pessimistic message logging, pessimistic method logging, passive
replication and stateless replication. Table 24 summarizes the faults tolerated by these
algorithms and their assumptions. Note that for the checkpointing algorithms we have
assumed that the host that starts the application (the checkpoint coordinator) does not
crash. If the coordinator crashes, the application can still be restarted from the saved
checkpoints.
Detecting duplicate return values RGE model (events)
Raising exceptions Exoevent notification model
Failure detection RGE model (events)
Table 24: Summary of algorithms
Algorithm
Number of worker failures tolerated
Assumptions Comments
SPMD checkpointing
n • reliable store• reliable network• checkpoint coordinator does not crash
reliable network assumption can be relaxed
2PCDC checkpointing
n • reliable store• reliable network• checkpoint coordinator does not crash
reliable network assumption can be relaxed
Table 23: Summary of the passive replication algorithm
Functionali ty Model
110
The implementation of pessimistic method logging and passive replication required
system support to change the behavior of object creation or other object-management
services. In the next chapter, we show the incorporation of fault-tolerance algorithms in
programming tools and the API that developers present to programmers.
Pessimistic method logging
1 • reliable store• reliable network
none
Passive replication
1 • reliable network the backup is represented by an object that is allowed to crash
Stateless replication
n-1 • reliable store• reliable network
reliable network assumption can be relaxed
Table 24: Summary of algorithms
Algorithm
Number of worker failures tolerated
Assumptions Comments
111
Obstacles are those frightful things you seewhen you take your eyes off your goal.
— Henry Ford
Chapter 6
Integration into Programming Tools
We present the integration of the fault-tolerance algorithms presented in Chapter 5 into
the following programming tools: the Message Passing Interface (MPI), the de facto
message passing standard in the grid community [GROP99]; the Stub Generator, a tool for
writing client/server applications; and the Mentat Programming Language (MPL), an
object-based parallel processing language. We describe the integration of fault-tolerance
algorithms into these environments and describe the interface presented to application
programmers.
We show that the burden placed on application programmers is manageable, ranging
from inserting a few extra lines of code, writing routines to save and restore state, to
setting command-line options. For tool developers, incorporating the fault-tolerance
algorithms requires targeting the RGE and exoevent notification models and linking in the
proper fault-tolerance libraries.
We present the integration of the SPMD and 2PCDC checkpointing techniques into the
MPI environment (§6.1). For the Stub Generator, we present the integration of passive
112
replication and pessimistic method logging (§6.2). For MPL, we present the integration of
stateless replication (§6.3). For each environment, we present a high-level overview so
that readers may compare the interface to programmers both before and after the
integration of fault-tolerance algorithms.
All three environments use the reflective graph and event model and the exoevent
notification model and have been deployed for over 2 years. The algorithms have been
tested using synthetic test cases designed to stress various parts of the algorithms, e.g.,
ensuring that invariants hold, that the output of a program after recovery is correct, as well
as using several real-world applications (Chapter 7).
6.1 MPI (SPMD and 2PCDC Checkpointing)
The Message Passing Interface (MPI) is a message-passing standard that is used
widely on parallel machines and networks of workstations to develop parallel and
distributed applications [GROP99]. The goals of the MPI designers were to achieve
portabilit y, flexibil ity and ease-of use through the specification of a standard application
programmer interface based on the familiar message passing paradigm. MPI is supported
by all major computer manufacturers.
Our goal in augmenting the Legion MPI implementation (LMPI) is to provide MPI
programmers with a simple interface for supporting application checkpoint/restart. We
add only six new functions and refer to our augmented implementation as LMPI-FT. We
present a brief overview of MPI by describing several of its most commonly-used
functions and show an example program (§6.1.1). Next, we describe the architecture and
interface of LMPI-FT (§6.1.2) and ill ustrate its use with a simple program (§6.1.3). We
113
conclude this subsection by summarizing the efforts required from both developers and
programmers (§6.1.4).
6.1.1 Legion MPI (LMPI)
Table 25 shows six of the most commonly used MPI functions [FOST94].
TABLE 25: Sample MPI functions
MPI functions Description
mpi_init() Initiate an MPI computation
mpi_finalize() Terminate a computation
mpi_comm_size(comm, size)comm – communicator size – # of tasks inside of communicator
Determine number of tasks
mpi_comm_rank(comm, rank)comm – communicator rank – id within communicator
Determine my task identifier
mpi_send(buf, count, datatype, target, tag, comm)buf – address of buffercount – # of items to receivedatatype – type of the itemstarget – rank id of the target tasktag – id of the messagecomm – communicator used
Blocking send of message
mpi_recv(buf, count, datatype, source, tag, comm, status)buf – address of buffercount – # of items to receivedatatype – type of the itemssource – rank id of the source tasktag – id of the messagecomm – communicator usedstatus – status/error values
Blocking receive of message
114
Note that MPI uses the concept of a communicator to group related tasks. A global
communicator, MPI_COMM_WORLD, groups all tasks in an application. For more
information about communicators and other communication primitives, please refer to the
MPI standard [GROP99].
An MPI application typically consists of a fixed number of tasks (or processes) that
are started from the command-line. For example, in Legion MPI (LMPI), an application is
started with the command-line utilit y legion_mpi_ r un that takes as arguments the
number of tasks to be created and the name of the program, e.g., legion_mpi_run -n
4 myprogram .
Figure 39 shows a simple MPI program. MPI tasks are logically organized in a ring
and are denoted as task0..n-1. At each iteration a task sends an integer to its left neighbor
115
and receives an integer from its right neighbor (lines 14-25). Note that the left neighbor of
task0 is taskn-1 and the right neighbor of taskn-1 is task0.
Running this program with the command legion_mpi_run -n 4 myprogram
yields the output:
I am task 0 and I have received the value 0 from my neighbor
I am task 1 and I have received the value 0 from my neighbor
I am task 2 and I have received the value 0 from my neighbor
I am task 3 and I have received the value 0 from my neighbor
I am task 0 and I have received the value 1 from my neighbor
...
6.1.2 Legion MPI-FT
Our extensions to LMPI provide programmers with optional functionalit ies. MPI
programmers are exposed only to the additional functions defined by LMPI-FT when they
FIGURE 39: Simple MPI program (myprogram)
b � f d O T V b T V Z O ^ S M e M ` O ^ ¸ ¸ O ^ S Y fb � f [b � f \ \ Y O ^ T O w N R ] R M N O ^ O Z T U V U d d T Z Z R ]b � f ¡ © u r u V T Z b ¯ O ^ S M e ¯ O ^ S Y f gb � f ¡ © u r { U d d r ^ O V v b ¡ © u r { ª ¡ ¡ r ´ ª } Q a e ¯ d l T ] f gb ¢ f ¡ © u r { U d d r P T R b ¡ © u r { ª ¡ ¡ r ´ ª } Q a e ¯ V h d r Z O P v P f gb £ fb ¤ f m U ^ b T Z R ^ O Z T U V ~ P Z O ^ Z r T Z R ^ O Z T U V g T Z R ^ O Z T U V j � � ¡ r u p W } ¨ p u ª � c g ¹ ¹ T Z R ^ O Z T U V fb º f ] U c U d R ´ U ^ v b d l T ] e V h d r Z O P v P e T Z R ^ O Z T U V f gb � » fb � � f ¡ © u r | T V O N T R b f gb � � f tb � � fb � � f \ \ P R V ] O V ] ^ R M R T Y R m ^ U d V R T S ` w U ^ T V S Z O P v P e Z O P v P O ^ R O ^ ^ O V S R ] T V O N U S T M O N ^ T V Sb � � f ] U c U d R ´ U ^ v b T V Z d l V U ] R e T V Z V h d r Z O P v P e T V Z T Z R ^ O Z T U V f [b � ¢ f \ \ Y O ^ T O w N R ] R M N O ^ O Z T U V P U d T Z Z R ]b � £ f T V m U ~ T Z R ^ O Z T U V gb � ¤ fb � º f \ \ P R V ] Z U N R m Z V R T S ` w U ^b � » f ¡ © u r c R V ] b ¯ T V m U e � e ¡ © u r u � p e N R m Z V R T S ` w U ^ e » e ¡ © u r { ª ¡ ¡ r ´ ª } Q a f gb � � f \ \ ^ R M R T Y R m ^ U d ^ T S ` Z V R T S ` w U ^b � � f ¡ © u r } R M Y b ¯ T V m U e � e ¡ © u r u � p e ^ T S ` Z V R T S ` w U ^ e » e ¡ © u r { ª ¡ ¡ r ´ ª } Q a f gb � � f \ \ ] U P U d R i U ^ v ` R ^ Rb � � f _ ^ T V Z m b o u O d Z O P v ¼ ½ ] O V ] u ` O Y R ^ R M R T Y R ] Z ` R Y O N h R ½ ] m ^ U d d l V R T S ` w U ^ ¾ V q e d l V U ] R e T V m U f gb � � f t
116
need to use the checkpoint/restart faciliti es. The relationship between programmers,
LMPI-FT, and the RGE and exoevent notification models is shown in Figure 40.
LMPI-FT exports the standard MPI interface to programmers as well as several new
functions to support checkpoint/restart (Table 26). The internal implementation of the
standard MPI interface targets the RGE and exoevent notification models. Calls such as
mpi_send() and mpi_recv() are implemented by raising events and executing
graphs. Similarly, the FT modules also target the models, thus enabling the composition of
the checkpointing algorithms within LMPI-FT.
To support checkpointing, application programmers insert code to save and restore
state. Table 26 describes the extensions to MPI to support checkpoint and restart.
TABLE 26: Functions to support checkpoint/restart
LMPI-FT functions Description
int mpi_ft_on() Returns:0 – no checkpointing specified1 – SPMD checkpointing2 – 2PCDC checkpointing
mpi_ft_init(int rank, int &recovery) Initiate the checkpoint/recovery libraryRank is the id of the MPI task<recovery> is true if in recovery mode
FIGURE 40: Legion MPI architecture augmented with FT modules
MPI implementation (LMPI-FT)
RGE & Exoevent Notification Models
Graphs, events, exoevents
MPI Programmers
FT ModuleSPMD and 2PCDCCheckpointing
117
Furthermore, we add several flags to legio n_mpi_run to specify parameters for
the checkpoint and restart algorithms (Table 27):
For example, the command,
legion_mpi_run -n 2 -ft -spmd -s myCheckpoin t Server -g 200
-r 500 myapp ,
mpi_ft_save(char *buffer, int size) Save data onto storage
mpi_ft_save_done(); Done saving data for this checkpoint
mpi_ft_restore(char *buffer, int size) Restore data from storage
int mpi_ft_checkpoint_request(int &ckptid) Returns true if a checkpoint has been requested by the coordinator. Also sets the checkpoint id (only used with 2PCDC checkpointing)
TABLE 27: Options for legion_mpi_run
Options Descriptions
-ft [-spmd | -2pc <ckptFreq>] Specify either SPMD or 2PCDC checkpointing. If 2PCDC checkpointing, <ckptFreq> specifies how often to request a checkpoint.
-s <checkpoint server> Specify the checkpoint server from which checkpoints will be stored and retrieved. This option may be repeated to specify multiple checkpoint servers.
-g <ping interval> Specify the ping frequency for each MPI task
-r <reconfigurationTime> If we have not heard from an MPI task in the last <reconfigurationTime> seconds, restart the application from the last consistent checkpoint.
-R Specifies recovery mode. Restart application from the last consistent checkpoint.
TABLE 26: Functions to support checkpoint/restart
LMPI-FT functions Description
118
specifies that the application should run with 2 tasks, that it uses the checkpoint server
called myCheckpointServe r , that the ping interval is 200 seconds and that the
reconfiguration time is 500 seconds. We provide a command-line tool,
legion_create_ checkpoint_server <name> , to create a checkpoint server.
6.1.3 Example
We illustrate the use of the checkpointing library for the SPMD and 2PCDC
algorithms using the same example MPI program as above (Figure 41). This toy
application is representative of SPMD programs and il lustrates the amount of work
required from programmers.
The code required to support checkpointing is shown italicized in Figure 41. This code
consists of functions to set up the checkpointing libraries (lines 9-10) and functions to call
the checkpointing routines (lines 11, 15-24). Where to insert code to take a checkpoint
depends on the algorithm used. For SPMD checkpointing, the programmer is responsible
for specifying when to checkpoint, e.g., every tenth iteration (lines 16-20, 23-24). For
2PCDC checkpointing, the participant periodically polls to determine whether a
checkpoint has been requested by the coordinator, i.e., legion_mpi_run (lines 19-24).
119
Upon recovery, the programmer is responsible for restarting the program from an
appropriate point in the code. In this example, the programmer can restart from the proper
loop index because the loop index is saved when taking a checkpoint.
FIGURE 41: Example of MPI application with checkpointing
¿ À Á  à Ŀ Å Á Æ Ç Â Ã ¿ Â Ã Ä Ç È É Ê Ë Ê Ì Ç È Í Í Ç È É Î Á¿ Ï Á п Ñ Á Ò Ò Î Ç È Â Ç Ó Ô Õ Ö Õ Ê Ô Ç È Ç Ä Â × Ã × Æ Æ Â Ä Ä Õ Ö¿ Ø Á Ù Ú Û Ü Û Ã Â Ä ¿ Ý Ç È É Ê Ë Ý Ç È É Î Á Þ¿ ß Á Ù Ú Û Ü à × Æ Æ Ü È Ç Ã á ¿ Ù Ú Û Ü à â Ù Ù Ü ã â ä å æ Ë Ý Æ ç Â Ö Á Þ¿ è Á Ù Ú Û Ü à × Æ Æ Ü é  ê Õ ¿ Ù Ú Û Ü à â Ù Ù Ü ã â ä å æ Ë Ý Ã ë Æ Ü Ä Ç é á é Á Þ¿ ì Áí î ï ð ñ ò ó ô ñ ð õ ö ÷ ø ù ú û ü ú ù ý þ ÿ í � õ þ � ï �í � � ï � � � ó ð þ ÿ � � ö ÷ ø ù ú û ü ú ý í ï �í � � ï þ í ð ñ ò ó ô ñ ð õ ï ð ñ � ÿ ó ð ñ ú � ÿ � ÿ ñ í � � ÿ � ð ÿ ú þ ÿ ñ ð � ÿ þ ó ý ïí � ï ñ � � ñ � ÿ � ð ÿ ú þ ÿ ñ ð � ÿ þ ó ý ö � �¿ À Ï Á¿ À Ñ Á � × È ¿ Â Ä Õ È Ç Ä Â × Ã � é Ä Ç È Ä Ü Â Ä Õ È Ç Ä Â × Ã Þ Â Ä Õ È Ç Ä Â × Ã � � � Ù Ü Û � � ä � � Û â � � Þ � � Â Ä Õ È Ç Ä Â × Ã Á Ðí � � ï þ ý ÿ ÿ � � ñ ú ò � ñ ò � � ó þ ý ÿ ö û � � � �í � � ï þ í � � � ó ð þ ÿ � � ö ö � ø ÷ ï !í � " ï þ í þ ÿ ñ ð � ÿ þ ó ý # � � ö ö � � � þ ÿ ñ ð � ÿ þ ó ý $ � ïí � % ï ÿ � � ñ ú ò � ñ ò � � ó þ ý ÿ ö ü & ' � �í � î ï ( ñ � � ñ ! ) ) � � � ó ð þ ÿ � � þ � ø * *í � ï þ í � � þ ú ÿ ú ò � ñ ò � � ó þ ý ÿ ú ð ñ + , ñ � ÿ í ï ïí � ï ÿ � � ñ ú ò � ñ ò � � ó þ ý ÿ ö ü & ' � �í ï (í - ï þ í ÿ � � ñ ú ò � ñ ò � � ó þ ý ÿ ïí . ï � � ô ñ ú � ÿ � ÿ ñ í þ ÿ ñ ð � ÿ þ ó ý ï �¿ Å Ø Á Ö × � × Æ Õ ã × È á ¿ Æ ç Â Ö Ë Ã ë Æ Ü Ä Ç é á é Ë Â Ä Õ È Ç Ä Â × Ã Á Þ¿ Å ß Á /¿ Å è Á Ù Ú Û Ü 0 Â Ã Ç Ô Â ê Õ ¿ Á Þ¿ Å ì Á /¿ Å 1 Á¿ Ï 2 Á Ò Ò é Õ Ã Ö Ç Ã Ö È Õ Ê Õ Â Î Õ � È × Æ Ã Õ Â É Ì Ó × È Â Ã É Ä Ç é á é¿ Ï À Á Ò Ò Ä Ç é á é Ç È Õ Ç È È Ç Ã É Õ Ö Â Ã Ç Ô × É Â Ê Ç Ô È Â Ã É¿ Ï Å Á Ö × � × Æ Õ ã × È á ¿ Â Ã Ä Æ ç Ã × Ö Õ Ë Â Ã Ä Ã ë Æ Ü Ä Ç é á é Ë Â Ã Ä Â Ä Õ È Ç Ä Â × Ã Á п Ï Ï Á Ò Ò Î Ç È Â Ç Ó Ô Õ Ö Õ Ê Ô Ç È Ç Ä Â × Ã é × Æ Â Ä Ä Õ Ö¿ Ï Ñ Á  à � × � Â Ä Õ È Ç Ä Â × Ã Þ¿ Ï Ø Á¿ Ï ß Á Ò Ò é Õ Ã Ö Ä × Ô Õ � Ä Ã Õ Â É Ì Ó × È¿ Ï è Á Ù Ú Û Ü � Õ Ã Ö ¿ Ý Â Ã � × Ë À Ë Ù Ú Û Ü Û � � Ë Ô Õ � Ä Ã Õ Â É Ì Ó × È Ë 2 Ë Ù Ú Û Ü à â Ù Ù Ü ã â ä å æ Á Þ¿ Ï ì Á Ò Ò È Õ Ê Õ Â Î Õ � È × Æ È Â É Ì Ä Ã Õ Â É Ì Ó × È¿ Ï 1 Á Ù Ú Û Ü ä Õ Ê Î ¿ Ý Â Ã � × Ë À Ë Ù Ú Û Ü Û � � Ë È Â É Ì Ä Ã Õ Â É Ì Ó × È Ë 2 Ë Ù Ú Û Ü à â Ù Ù Ü ã â ä å æ Á Þ¿ Ñ 2 Á Ò Ò Ö × é × Æ Õ 3 × È á Ì Õ È Õ¿ Ñ À Á /
120
Figure 42 shows the functions that a programmer would write to save and restore state
(lines 1-14). The MPI_FT_Save() and MPI_FT_Restore() functions take as
arguments a buffer and a size. We use the standard MPI functions MPI_Pack() and
MPI_Unpack() t o store non-contiguous data in a user-allocated buffer.
6.1.4 Summary
Table 28 provides a summary description of the work required to incorporate and use
the checkpointing techniques in LMPI-FT. From a programmer’s point of view, the most
difficult aspect of using LMPI-FT is to write the code to save and restore the relevant data
structures. However, we note that many applications already have save and restore state
TABLE 28: Summary of work required for integration of checkpointing algorithms
Whom Description of work Lines of code
Developers of LMPI-FT
• incorporation of checkpointing modules as described in §5.1
• addition of several flags to legion_mpi_run
• modification of initialization to pass algorithm specific information to tasks
• 230 lines of code for MPI tasks• 314 lines of code for
legion_mpi_run
FIGURE 42: Example of saving and restoring user state
í � ï � � ô ñ ú � ÿ � ÿ ñ í þ ý ÿ þ ÿ ñ ð � ÿ þ ó ý ï !í ï þ ý ÿ � ó � ö � 4 � ó � ñ 5 � � , ñ �í - ï ÷ ø ù ú ø � ò � í � þ ÿ ñ ð � ÿ þ ó ý 4 � 4 ÷ ø ù ú ù 6 ü 4 7 , ñ ð 4 � � . 4 � � ó � 4 ÷ ø ù ú * ÷ ÷ ú 8 & � ï �í . ï ÷ ø ù ú ø � ò � í � � õ þ � 4 � 4 ÷ ø ù ú ù 6 ü 4 7 , ñ ð 4 � � . 4 � � ó � ñ 5 � � , ñ 4 ÷ ø ù ú * ÷ ÷ ú 8 & � ï �í � ï ÷ ø ù ú û ü ú � � ô ñ í , � ñ ð ú 7 , ñ ð 4 � þ 9 ñ ó í þ ÿ ñ ð � ÿ þ ó ý ï : � þ 9 ñ ó í � õ þ � ï ï �í � ï ÷ ø ù ú û ü ú � � ô ñ ú ó ý ñ í ï �í " ï (í % ïí î ï ð ñ � ÿ ó ð ñ ú � ÿ � ÿ ñ í þ ý ÿ � � ÿ � ð ÿ ú þ ÿ ñ ð � ÿ þ ó ý ï !í � � ï þ ý ÿ � ó � ö � 4 � ó � ñ 5 � � , ñ �í � � ï � þ 9 ñ ö ÷ ø ù ú û ü ú & ñ � ÿ ó ð ñ í , � ñ ð ú 7 , ñ ð 4 � � . ï �í � ï ÷ ø ù ú ' ý � � ò � í , � ñ ð ú 7 , ñ ð 4 � � . 4 � � ó � 4 � � ÿ � ð ÿ ú þ ÿ ñ ð � ÿ þ ó ý 4 � 4 ÷ ø ù ú ù 6 ü 4 ÷ ø ù ú * ÷ ÷ ú 8 & � ï �í � - ï ÷ ø ù ú ' ý � � ò � í , � ñ ð ú 7 , ñ ð 4 � � . 4 � � ó � 4 � � ó � ñ 5 � � , ñ 4 � 4 ÷ ø ù ú ù 6 ü 4 ÷ ø ù ú * ÷ ÷ ú 8 & � ï �í � . ï (
121
functions defined. Integrating the SPMD and 2PCDC algorithms required 544 additional
lines of C++ code, most of which consisted of mapping the LMPI-FT interface presented
in §6.1.2 to the checkpointing modules in §5.1, and modifying legion_mpi_run to
support additional flags.
6.2 Stub generator (passive replication and pessimistic
method logging)
The stub generator (SG) provides programmers with a tool for developing Legion
client and server objects. SG is a tool that takes as input a C++ header file and produces
server-side and client-side stubs (Figure43). Before the development of SG, Legion
programmers had to hand-generate the client-side and server-stubs, a tedious
programming task.
The server-side stub files generated by SG contain a server loop to service incoming
method calls (myserver.stubs.c). For each method, SG generates stubs to unmarshall
arguments, call the appropriate user-supplied back-end functions, and send the return
Programmers • learning new flags to legion_mpi_run
• learning six new functions• writing code to save and
restore state• structuring code so as to
properly restart• learning a new command
line util ity to create a checkpoint server
• additional li nes of code is application dependent
TABLE 28: Summary of work required for integration of checkpointing algorithms
Whom Description of work Lines of code
122
values back to the caller (myserver.c). On the client side, the stub files generated consists
of a set of functions that handle the tedious details of invoking methods on the remote
object, namely, creating and executing program graphs and waiting on return values.
Programmers link the stub files with their own code to produce an object.
SG is well suited for writing passive server objects—objects that typically provide
services for multiple clients and do not themselves make calls on other objects. An
example of a passive server object would be a directory service.
We present modifications made to the stub generator (§6.2.1), the integration of the
passive replication (§6.2.2) and pessimistic method logging (§6.2.5) techniques into
passive server objects created with the stub generator.
6.2.1 Modifications to the stub generator
We made two changes to the stub generator. The first is to allow programmers to
specify that a method is read-only, i.e., that it does not update state. Specifying read-only
FIGURE 43: Creating objects using the stub generator
Server stub
myserver.stub.hmyserver.stub.c
C++ header f i le
myserver.h
C++ serverimplementat ion
myserver.cStub
generator
Client stub
myserver.cl ient.hmyserver.cl ient.c
make myserverObject
Client code
client.c
clientObject
make
123
semantics on a per method basis enables the optimization of the passive replication and
pessimistic algorithms. A sample interface file is shown in Figure 44 with the
READONLY modifier preceding the standard function declaration:
Our second modification was to produce different client-side stubs so that
programmers can specify a timeout value and the number of times a call should be
invoked. (Figure 45). The default values restore the original blocking semantics. The
timeout is set to INFINITY and the number of times a computation should be tried is 1.
6.2.2 Integration with pessimistic method logging
To specify the parameters for the pessimistic method logging algorithm, programmers
must create an object and link it with the PML library. Also, programmers must create a
checkpoint server with the command-line tool:
legion_create_ checkpoint_server <name> .
FIGURE 44: Specification of READONLY methods
Ê Ô Ç é é Æ ç � ; ; Ð; ë Ó Ô Â Ê <
ä � � æ â � å = Â Ã Ä Ç Ö Ö ¿ Â Ã Ä Ë Â Ã Ä Á ÞÂ Ã Ä é Õ Ä � Õ Ê È Õ Ä ¿ Â Ã Ä Á Þ
/ Þ
FIGURE 45: Modified client-side stubs
Original client-side code:
È Õ é ë Ô Ä � Æ ç � ; ; > Ç Ö Ö ¿ Ø Ë ß Á Þ
New client-side code (retry after 200 seconds):
� ÿ ð , ò ÿ ÿ þ � ñ ô � � ÿ þ � ñ ó , ÿ ö ! � � 4 � ( �þ ý ÿ ý , � ú ÿ ð þ ñ � ö �È Õ é ë Ô Ä � Æ ç � ; ; > Ç Ö Ö ¿ Ø Ë ß Ë � ÿ þ � ñ ó , ÿ 4 ý , � ú ÿ ð þ ñ � Á Þ
124
Next, programmers invoke the command-line tool legion_set_ft to set various
parameters (Table 29):
Upon startup, an object obtains the identity of the checkpoint server from its class. If
none is specified, then the object is not running the pessimistic method logging algorithm,
i.e., the programmer has not yet invoked legion_set _ft . Otherwise, the object
attempts to retrieve its state and its method log from the checkpoint server. If the method
log contains entries, the object replays the log to bring its state up-to-date. During replay
of the log, the object does not accept any method invocations from clients. It services
client requests only once its state has been fully restored.
Whereas the PML module automatically initiates the transfer of state information
between the object and the checkpoint server, programmers are responsible for saving and
restoring the state. Programmers must define two functions,
SaveUserState( BUFFER) and RestoreUserState( BUFFER) . The first
function saves the state of an object in a data structure called BUFFER and the second sets
the state of an object based on BUFFER. Note that in Legion, BUFFER is a self-
TABLE 29: Parameters for legion_set_ft
Options Descriptions
-c <object> Specify the object to which to apply the pessimistic method logging algorithm
-ft -pml Specify the use of the pessimistic method logging algorithm
-s <checkpoint server> Specify the checkpoint server from which checkpoints and methods wil l be stored and retrieved.
-auto_trim_log <sleepTime> After <sleepTime> of no activity, save the entire state onto the checkpoint server and delete the method log.
125
describing data structure that performs data conversion between heterogeneous
architectures automatically [VILE97].
6.2.3 Example
We ill ustrate the use of pessimistic method logging with a simple application called
myApp. We show the interface and implementation of myApp in Figure 46.
Setting up myApp to use pessimistic method logging is a two-step process. The
programmer creates a checkpoint server and then calls legion_set_ft .
legion_create_ checkpoint_server /home/joe/ckptServer
legion_set_ft - ft -pml -s /home/joe/ckptServer -c myApp
Clients should modify their code to specify the timeout value and the number of times
a method should be tried, e.g., myApp.setSecret(7,&timeout,numtr i es=3) .
FIGURE 46: Interface and code for myApp
Æ ç � ; ; > Â Ö Ô
Ê Ô Ç é é Æ ç � ; ; Ð; È Â Î Ç Ä Õ <
Â Ã Ä Æ ç é Õ Ê È Õ Ä Þ; ë Ó Ô Â Ê <
ä � � æ â � å = Â Ã Ä Ç Ö Ö ¿ Â Ã Ä Ë Â Ã Ä Á ÞÂ Ã Ä é Õ Ä � Õ Ê È Õ Ä ¿ Â Ã Ä Á Þ
/ Þ
Æ ç � ; ; > Ê
Â Ã Ä Æ ç � ; ; < < Ç Ö Ö ¿ Â Ã Ä Â Ë ? Á ÐÈ Õ Ä ë È Ã Â � ? Þ
/
Â Ã Ä Æ ç � ; ; < < é Õ Ä � Õ Ê È Õ Ä ¿ Â Ã Ä é Õ Ê È Õ Ä Á ÐÆ ç é Õ Ê È Õ Ä � é Õ Ê È Õ Ä
/ Þ
) ) ' � ñ ð � � , � ÿ � ñ þ ý ñ � � ô ñ ) ð ñ � ÿ ó ð ñ � ÿ � ÿ ñ , ý ò ÿ þ ó ý �� � ô ñ ' � ñ ð � ÿ � ÿ ñ í @ ' û û � & � ÿ � ÿ ñ ï !
� ÿ � ÿ ñ A � , ÿ ú þ ý ÿ í � � õ � ñ ò ð ñ ÿ 4 � ï � ) ) � ÿ ó ð ñ þ ý ÿ ñ � ñ ð þ ý � ÿ � ÿ ñ(
& ñ � ÿ ó ð ñ ' � ñ ð � ÿ � ÿ ñ í @ ' û û � & � ÿ � ÿ ñ ï !� ÿ � ÿ ñ A � ñ ÿ ú þ ý ÿ í � � õ � ñ ò ð ñ ÿ 4 � ï � ) ) � ÿ ó ð ñ þ ý ÿ ñ � ñ ð þ ý � ÿ � ÿ ñ
(
126
6.2.4 Summary
For the tool developer, integrating the pessimistic method logging protocol consists
mainly of modifying the stub generator to understand the READONLY specifier as well as
generating different client-side stubs. For application programmers, using PML consists of
linking in the PML library, specifying a timeout value and the number of times a method
should be invoked, writing routines to save and restore state, and invoking the command-
line tool legion_set_ft . For the programmer, the most diff icult aspect of integrating
PML is to write the code to save and restore the relevant state. However, we note that with
a more sophisticated stub generator, we could generate the functions to save and restore
state on behalf of programmers automatically, provided that programmers identify the
variable declarations to be saved [FABR95].
TABLE 30: Summary of work required for integration of PML
Whom Description of work Lines of code
Developers of Stub Generator
• incorporation of pessimistic message logging as described in §5.2
• modification of client-stub generations to retry computations after a set time interval
• modification of interface file to allow the specification of READONLY semantics
• development of command-line tool, legion_set_ft
• 190 lines of code for modifications to the stub generator
Programmers • learning command line utili ty to specify parameters
• learning command line utili ty to create checkpoint server
• writing code to save and restore state
• 2 additional li nes of code per remote procedure call (to specify the timeout and number of tries)
• additional li nes to write save/restore state is application-dependent
127
6.2.5 Integration with passive replication
To specify the parameters for the passive replication algorithm, programmers must
create an object and link it with the FT_PassiveReplication library. Next,
programmers invoke the command-line tool legion_set_ft to set various parameters
(Table 30):
Upon startup, an object assumes that it is a primary object and attempts to obtain the
identity of its backup from its class. If none is specified, then the object is not running the
passive replication algorithm, i.e., the programmer has not yet invoked
legion_set_ft . Otherwise, the object starts forwarding its state to the backup after
each state-updating method. As in the pessimistic method logging algorithm,
programmers are responsible for saving and restoring state through the functions
SaveUserState( BUFFER) and RestoreUs erState(BUFFER) . For an example
of the modifications required to run passive replication, see Figure 46.
Upon the failure of the primary object, the class object is responsible for the failover
protocol and makes the backup object the new primary object. The class object also
creates a new backup and assigns it to the new primary.
TABLE 31: Parameters for legion_set_ft
Options Descriptions
-c <object1> Specify the object to which to apply the passive replication algorithm. This object is the PRIMARY.
-backup <object2> Create a new backup object and name it <object2>
-ft -passivereplication Specify the use of the pessimistic method logging algorithm
128
6.2.6 Summary
Table 32 summarizes the work required in implementing and using passive replication
with the stub generator:
6.3 MPL – Stateless replication
The Mentat Programming Language (MPL) is a parallel, object-based, programming
language based on C++, that was designed to facilitate the construction of parallel and
distributed applications [GRIM96A]. The philosophy behind Mentat is to exploit the
relative strengths of programmers and compilers; to let programmers make decomposition
and granularity decisions while letting the compiler take care of data dependencies and
synchronization.
TABLE 32: Summary of work required for integration of passive replication
Whom Description of work Lines of code
Developers of Stub Generator
• incorporation of passive replication as described in §5.3
• modification of client-stub generations to retry computations after a set time interval
• modification of interface file to allow the specification of READONLY semantics
• development of command-line tool, legion_set_ft
• 190 lines of code for modifications to the stub generator
Programmers • learning command line utilit y to specify parameters
• writing code to save and restore state
• 2 additional l ines of code per remote procedure call (to specify the timeout and number of tries)
• Additional li nes to write save/restore state is application-dependent
129
The granule of computation in MPL is the Mentat class instance, which consists of
contained objects (local and member variables), their procedures, and a thread of control.
Programmers are responsible for identifying those object classes that are of sufficient
computational complexity to allow eff icient parallel execution. Instances of Mentat
classes are used just li ke ordinary C++ classes, freeing the programmer to concentrate on
the algorithm, not on managing the environment. The data and control dependencies
between Mentat class instances involved in invocation, communication, and
synchronization are detected automatically and managed by the compiler and run-time
system without further programmer intervention.
The basic idea in MPL is to let the programmer specify those C++ classes that are of
sufficient computational complexity to warrant parallel execution. This is accomplished
using the mentat keyword in the class definition. Instances of Mentat classes are called
Mentat objects. The programmer uses instances of Mentat classes much as she would any
other C++ class instance. The compiler generates code to construct and execute data
dependency graphs in which the nodes are Mentat object member function invocations,
and the arcs are the data dependencies found in the program. All of the communication
and synchronization is managed by the compiler.
Figure 47 shows an example MPL class declaration. The class declaration and
implementation are identical to C++ except for the keyword mentat (lines 1-15). The
main program (lines 17-23) ill ustrates code to create and use a Math object. The
declaration of a Math instance results in the creation of a Mentat object (line 18). The call
130
to doSomeWork() results in a remote method invocation on the object myMathWorker
(line 21).
An MPL class may be declared as stateless, meaning that all its methods are free of
side-effects. In Figure 48, we show the declaration of the stateless class Math. The
advantage of a stateless object is that it may be replicated to service method calls in
parallel, thereby increasing performance [GRIM96B]. For example, in the loop of
Figure 47, the calls to myMathWorker may be executed in parallel (line 23). Through the
FIGURE 47: Example of MPL application
¿ À Á Æ Õ Ã Ä Ç Ä Ê Ô Ç é é Ù Ç Ä Ì Ð¿ Å Á ; ë Ó Ô Â Ê <¿ Ï Á Â Ã Ä Ö × � × Æ Õ ã × È á ¿ Â Ã Ä 3 × È á Û æ Á Þ¿ Ñ Á / Þ¿ Ø Á¿ ß Á  à Ŀ è Á Ù Ç Ä Ì < < Ö × � × Æ Õ ã × È á ¿ Â Ã Ä 3 × È á Û æ Á п ì Á Â Ã Ä Â Þ¿ 1 Á � Ô × Ç Ä È Õ é ë Ô Ä Þ¿ À 2 Á¿ À À Á � × È ¿  � 2 Þ Â � Ù � B Ü Û � � ä � � Û â � � Þ � �  Á¿ À Å Á È Õ é ë Ô Ä � È Õ é ë Ô Ä � é × Æ Õ 0 ë Ã Ê Ä Â × Ã ¿  Á Þ¿ À Ï Á¿ À Ñ Á È Õ Ä ë È Ã È Õ é ë Ô Ä Þ¿ À Ø Á /¿ À ß Á¿ À è Á Æ Ç Â Ã ¿ Á п À ì Á Ù Ç Ä Ì Æ ç Ù Ç Ä Ì ã × È á Õ È Þ¿ À 1 Á Â Ã Ä Â � Ù � B Ü Û � � ä � � Û â � � Þ¿ Å 2 Á Â Ã Ä È Õ é ë Ô Ä é C Ù � B Ü Û � � ä � � Û â � � D Þ¿ Å À Á¿ Å Å Á � × È ¿  � 2 Þ Â � Ù � B Ü Û � � ä � � Û â � � Þ � �  Á¿ Å Ï Á È Õ é ë Ô Ä é C  D � Æ ç Ù Ç Ä Ì ã × È á Õ È > Ö × � × Æ Õ ã × È á ¿  Á Þ¿ Å Ñ Á¿ Å Ø Á � × È ¿  � 2 Þ Â � Ù � B Ü Û � � ä � � Û â � � Þ � �  Á¿ Å ß Á ; È Â Ã Ä � ¿ E È Õ é ë Ô Ä C F Ö D � F Ö G à H Ë Â Ë È Õ é ë Ô Ä é C  D Á Þ¿ Å è Á /
131
use of a command line utility, programmers may set the level of replication for stateless
objects [LEGI99].
6.3.1 Stateless replication
While the original design goal for stateless objects was to improve performance
through parallel execution and load-balancing of method calls, we can improve the
reliability of stateless objects as well by integrating into MPL the stateless replication
algorithm described in §5.3.2.
Figure 49 shows how MPL programmers can specify the parameters for the stateless
replication algorithm through the use of Aut oStack_StatelessRetrySetting .
Programmers can set the timeout value and the number of times a computation should
be retried (line 6-10). These parameters apply to all calls on stateless objects within the
FIGURE 48: Declaring a Mentat class as stateless
� ÿ � ÿ ñ � ñ � � Æ Õ Ã Ä Ç Ä Ê Ô Ç é é Ù Ç Ä Ì Ð; ë Ó Ô Â Ê <
� Ô × Ç Ä Ö × � × Æ Õ ã × È á ¿ Â Ã Ä 3 × È á Û æ Á Þ/ Þ
FIGURE 49: Specifying parameters for the stateless replication policy
¿ À Á Æ Ç Â Ã ¿ Á п Å Á Ù Ç Ä Ì Æ ç Ù Ç Ä Ì ã × È á Õ È Þ¿ Ï Á Â Ã Ä Â � Ù � B Ü Û � � ä � � Û â � � Þ¿ Ñ Á Â Ã Ä È Õ é ë Ô Ä é C Ù � B Ü Û � � ä � � Û â � � D Þ¿ Ø Á¿ ß Á þ ý ÿ � � I ú ý , � ú ÿ ð þ ñ � ö - � ) ) ÿ ð õ J ó ð � ð ñ + , ñ � ÿ � � � � I þ � , � ó ÿ � ð ñ ñ ÿ þ � ñ �¿ è Á þ ý ÿ ÿ þ � ñ ó , ÿ ö - � � � ) ) þ � J ó ð � ð ñ + , ñ � ÿ � � � ý ó ÿ ò ó � � � ñ ÿ ñ � J þ ÿ � þ ý - � � � ñ ò ó ý � � 4 ð ñ � ÿ � ð ÿ þ ÿ¿ ì Á¿ 1 Á ) ) � � ñ ò þ õ ÿ � ñ � ÿ � ÿ ñ � ñ � � ð ñ � � þ ò � ÿ þ ó ý � ó � þ ò õ¿ À 2 Á � , ÿ ó � ÿ � ò � ú � ÿ � ÿ ñ � ñ � � & ñ ÿ ð õ � ñ ÿ ÿ þ ý � � � ð í � � I ú ý , � ú ð ñ ÿ ð þ ñ � 4 ÿ þ � ñ ó , ÿ ï �¿ À À Á¿ À Å Á � × È ¿  � 2 Þ Â � Ù � B Ü Û � � ä � � Û â � � Þ � �  Á¿ À Ï Á È Õ é ë Ô Ä é C  D � Æ ç Ù Ç Ä Ì ã × È á Õ È > Ö × � × Æ Õ ã × È á ¿  Á Þ¿ À Ñ Á¿ À Ø Á � × È ¿  � 2 Þ Â � Ù � B Ü Û � � ä � � Û â � � Þ � �  Á¿ À ß Á ; È Â Ã Ä � ¿ E È Õ é ë Ô Ä C F Ö D � F Ö G à H Ë Â Ë È Õ é ë Ô Ä é C  D Á Þ¿ À è Á /
132
scope of the AutoStack_StatelessReplySettings declaration. Furthermore, these parameters
apply transitively to all methods that are invoked. For example, the parameters would
apply to any calls made on stateless objects inside of myMathWorker.add(). A simple way
of specifying a stateless replication policy for an entire application is to set it in the root
object, i.e., the first object, of an application.
6.3.2 Summary
Using the stateless replication policy requires programmers to add only three lines of
code. For the developer, implementing stateless replication entails adding the necessary
capabiliti es to retry computations. Incorporation of stateless replication is relatively
simple because MPL already replicates stateless objects to increase performance. Table 33
summarizes the work required in implementing and using stateless replication in MPL:
6.4 Summary
We have shown the integration of various fault-tolerance algorithms into multiple
programming tools in Legion. The tools chosen are already deployed and support the
current Legion user base.
TABLE 33: Summary of work required for integration of stateless replication
Whom Description of work Additional li nes of code
Developers of MPL • incorporation of stateless replication as described in §5.3
• 33 lines to implement specification of stateless replication policy
Programmers • learning one new function to set the parameters of the algorithm
• 3 lines to set the parameters of the stateless replication algorithms (timeouts, number of retries)
133
We have shown the burden placed on programmers to be manageable. The most
difficult aspect of incorporating fault-tolerance techniques for programmers consisted of
writing routines to save and restore the local state of objects. Furthermore, tools could be
develop to automate the task of saving and restoring state. For environment developers,
integration of algorithms consisted mainly of linking and using the proper library.
134
A distributed system is one that stops you from getting any work donewhen a machine you’ve never even heard of crashes.
— Leslie Lamport
Chapter 7
Evaluation
The goals of this section are to evaluate the overhead of the framework and to
demonstrate the successfully integration of fault-tolerance techniques into grid
applications. We evaluate our framework based on the criteria outlined in §1.3: multiple
tool support, breadth of fault-tolerance techniques, ease-of-use, localized cost and
framework overhead. To demonstrate multiple tool support and breadth of techniques, we
present three applications written using the different tools and techniques described in
Chapters 5 and 6. To evaluate ease-of-use, we show the number of additional lines of code
inserted by programmers to incorporate fault tolerance. Our framework supports localized
cost as techniques are only integrated in applications that need them. To evaluate the
performance of our framework, we measured the overhead of processing events and event
handlers introduced by the integration of fault-tolerance techniques without measuring the
algorithmic cost—the cost inherent to running the algorithms themselves. Furthermore,
for each technique, we present performance numbers on a real-world application. We
135
show that the incorporation of fault-tolerance techniques enables these applications to
tolerate more crash faults than if no techniques had been used.
We used four applications: RPC, Context, BT-MED and Complib. RPC is a simple
application that performs a series of remote procedure calls and serves to estimate the
overhead of the framework. Context is a directory service that maps string names to
Legion Object IDentifiers and is written using the stub generator. BT-MED is a barotropic
ocean model written in MPI and was developed at the Naval Oceanographic Office.
Complib is a biochemistry application that compares libraries of protein or DNA
sequences and is written in the Mentat Programming Language.
We present the integration of the pessimistic method logging and passive replication
algorithms into Context (§7.1.2), of SPMD and 2PCDC checkpointing into BT-MED
(§7.2.2), and of stateless replication into Complib (§7.3.2). For each we ran three
experiments: (1) a baseline run without any incorporated fault-tolerance techniques, (2) a
failure-free run with a fault-tolerance technique incorporated, and (3), a run in which we
induced a permanent host failure.
Our testbed consisted of a homogeneous Legion environment with twenty 400Mhz
Pentium II dual-processors running the Linux operating system connected by a 100Mb
Ethernet network. Storage for this Legion configuration was provided through NFS. We
shared CPU and storage resources with other users. In general, the hosts were lightly
loaded, and contention for the NFS storage was variable. We simulated the crash failure of
a host by killi ng all our processes running on the target host. Note that the experiments in
this section were not based on an experimental design (in the statistical sense). Instead,
they were designed to illustrate the behavior of applications with various fault-tolerance
136
techniques integrated and to show that applications can survive a single crash failure
whereas they would not if no fault-tolerance had been integrated.
7.1 Stub Generator
We measure the overhead of the framework using the RPC application (§7.1.1). We
estimate the overhead of integrating the pessimistic method logging and passive
replication algorithms into the stub generator by comparing the time for a read remote
procedure call . A read call measures the overhead of using events to process incoming
methods but does not incorporate the algorithmic cost of invoking methods to a logger or
backup object. We then present the integration of pessimistic method logging and passive
replication into the Context application (§7.1.2).
7.1.1 RPC
RPC consists of a remote procedure calls between a client and a server. Table 34
presents the performance of plain RPC (SG-RPC), RPC in conjunction with pessimistic
method logging (PML-RPC) and RPC in conjunction with passive replication (PR-RPC).
We measured performance in terms of the amount of time to complete a remote procedure
call . Each number reported represents the mean and 95% confidence interval for 100 runs.
All three versions of RPC contained 100 KB of state data.
TABLE 34: Stub generator – RPC performance (n = 100, α = 0.05)
Test nameRead/write
Performance (msec/iter)
Payload (0K)
Performance (msec/iter)
Payload (100K)
SG-RPC read 8.13 ± 0.01 30.68 ± 0.55
137
For read calls and payload of 0KB, PML-RPC and PR-RPC are within .56 msec or 7%
of SG-RPC. For read calls and payload of 100KB, PML-RPC and PR-RPC are within 3
msec or 10% of SG-RPC. In these test cases, PML-RPC and PR-RPC do not perform
operations such as method logging or state transfer. Therefore, we estimate the overhead
of the framework in implementing pessimistic method logging and passive replication by
attributting the overhead for these test cases to the framework itself.
For write calls, PML-RPC and PR-RPC perform considerably worse than SG-RPC,
with overheads of 17 msec for PML-RPC and 32 msec for PR-RPC in the no payload case.
In the 100K payload case, the overhead was 51 msec for PML-RPC and 35 msec for PR-
RPC. For each remote procedure call, PR-RPC transfers the state (100KB) to the backup
server while PML-RPC transfers a copy of each method to a logger object. Thus, PR-RPC
and PML-RPC incur the cost of an additional remote procedure call as well as any
processing required by the algorithm itself, e.g., updating the backup state or logging
methods onto disk.
PML-RPC read 8.69 ± 0.01 33.64 ± 1.30
PR-RPC read 8.38 ± 0.01 32.53 ± 1.39
SG-RPC write 8.15 ± 0.01 30.60 ± 0.55
PML-RPC write 25.01 ± 0.13 81.95 ± 3.04
PR-RPC write 40.46 ± 0.89 66.00 ± 2.39
TABLE 34: Stub generator – RPC performance (n = 100, α = 0.05)
Test nameRead/write
Performance (msec/iter)
Payload (0K)
Performance (msec/iter)
Payload (100K)
138
The overall performance of using PML-RPC and PR-RPC depends on the ratio of read
to write calls. Whether the additional overhead of pessimistic method logging and passive
replication is acceptable depends on the application to which they are applied. In the next
section, we apply both these techniques to the Context application. In general, pessimistic
method logging is preferrable over passive replication when the state is relatively large.
7.1.2 Context
We present and analyze the overhead of pessimistic method logging (PML) and
passive replication (PR) using the application Context. Context is a commonly-used
Legion application that provides a directory service to map human-readable string names
to Legion object identifiers (LOID). Context can be viewed as analogous to a standard
Unix file system; but instead of mapping filenames to inodes, a context maps names to
LOID.
Contexts provide Legion users with a hierarchical directory service. The interface for a
Context object is shown in Figure50. The state of a context object consists of a set of
entries, where each entry maps a string name to a Legion object identifier (LegionLOID).
Incorporating the save and restore state functions to support pessimistic method logging
and passive replication required an additional 16 lines of code.
FIGURE 50: Interface for context object
Ê Ô Ç é é à × Ã Ä Õ K Ä â Ó ? Õ Ê Ä Ð; È Â Î Ç Ä Õ <
� é é × Ê Â Ç Ä Â × Ã � Õ Ä � � Ä È Â Ã É Ë å Õ É Â × Ã å â Û æ L Ô × Â Ö é Þ Ò Ò Æ Ç ; é é Ä È Â Ã É é Ä × å Õ É Â × Ã å â Û æ é; ë Ó Ô Â Ê <
Â Ã Ä Ç Ö Ö ¿ � Ä È Â Ã É Ë å Õ É Â × Ã å â Û æ Á Þ Ò Ò Ç Ö Ö Ç Ã Õ Ã Ä È çÂ Ã Ä È Õ Æ × Î Õ ¿ � Ä È Â Ã É Á Þ Ò Ò È Õ Æ × Î Õ Ç Ã Õ Ã Ä È çä � � æ â � å = å Õ É Â × Ã å â Û æ Ô × × á ë ; ¿ � Ä È Â Ã É Á Þ Ò Ò Ô × × á ë ; Ç Ã Õ Ã Ä È çä � � æ â � å = � Ä È Â Ã É È Õ Î Õ È é Õ å × × á ë ; ¿ å Õ É Â × Ã å â Û æ Á Þ Ò Ò È Õ Î Õ È é Õ Ô × × á ë ; × � Ç Ã Õ Ã Ä È ç
/
139
We ran three versions of Context, the baseline version (SG-Context, Figure51a), a
version with pessimistic method logging (PML-Context, Figure 51b), and a version with
passive replication (PR-Context, Figure51c).
In Table 35, we present performance numbers for a context server object with 1000
entries, which corresponds to a state of 281 KB. Note that 1000 entries is a conservative
scenario since Legion context objects typically contain less than 100 entries. Each number
reported represents the mean and 95% confidence interval of 100 runs.
TABLE 35: Context performance (n = 100, α = 0.05)
Test name Read/writePerformance
(msec/iteration)
SG-Context read 8.73 ± 0.02
PML-Context read 9.01 ± 0.02
PR-Context read 9.34 ± 0.02
SG-Context write 9.00 ± 0.01
PML-Context write 24.66 ± 0.06
FIGURE 51: Context application structure
PML-Context
Client
Logger
PR-ContextClient
ContextBackup
SG-ContextClient(a)
(b)
(c)
140
Note that the performance for SG-Context is lower than that for our standard remote
procedure calls baseline (SG-RPC) from §7.1.1. The reason is that Context objects save
their state on a local disk for every state-updating method invocation.
For read calls, the overhead of using pessimistic method logging and passive
replication is within 0.61 msec (7%) of the baseline case. For write calls, the overhead of
using PML-Context is 15 msec (174%) and PR-Context is 1935 msec (21500%). For this
application, the overhead of using pessimistic method logging is acceptable. However, the
overhead of using passive replication is too high. Thus, passive replication is not suitable
for context objects with a large number of entries.
The PML-Context and PR-Context applications are designed to tolerate a single host
failure. If this assumption is violated, e.g., 2 host failures, the applications would fail. In
Table 36, we show the performance characteristics of PML-Context and PR-Context under
a failure scenario by inducing a server crash approximately 5 seconds after the start of the
test. We set up the client to time-out and retry a remote procedure call after 200 seconds.
The number of entries in the Context object was 100.
PR-Context write 1944 ± 1.06
TABLE 36: Context performance with one induced failure (n = 5, α = 0.05)
Test name Write ratioRecovery time
(seconds)
PML-Context 100% 247 ± 2
PR-Context 100% 245 ± 2
TABLE 35: Context performance (n = 100, α = 0.05)
Test name Read/writePerformance
(msec/iteration)
141
The recovery time for PML-Context and PR-Context is determined by the amount of
take required by a Legion class object to declare its instances as having failed. As a
default, a Legion class object requires up to 330 seconds to detect failure when a host fails,
thus the relatively long recovery time for both tests. Future work consists of reducing the
failover time by allowing programmers to set their own timeouts.
7.2 MPI
We present the overhead of the framework in integrating the SPMD and 2PCDC
checkpointing techniques by measuring the time required for a send and receive operation
(§7.2.1). We then present performance numbers for the BT-MED application (§7.2.2).
7.2.1 RPC
In Table 37 we show the time required to perform a send and receive operation. The
numbers shown represent the mean and 95% confidence interval for 20 runs. To measure
the cost of integrating the SPMD and 2PCDC algorithms (and not the algorithmic cost of
taking checkpoints), we set the checkpoint interval arbitrarily high so that no checkpoints
were taken. In our tests, there appears to be no significant difference between the baseline
case and the cases where the SPMD and 2PCDC algorithms are integrated. In the SPMD
case, there is no extra processing that is required; the algorithm only takes effect when the
programmer requests a checkpoint. In the 2PCDC case, the event handlers used to count
the number of messages sent and received do not add any significant overhead.
142
7.2.2 BT-MED
BT-MED is a barotropic ocean model that simulates sea surface height and
temperature. It is used at the Naval Ocenographic Office as a benchmarking program and
is representative of a full-scale ocean model. BT-MED is written in Fortran and MPI and is
a typical 2-dimensional SPMD code. Figure52 shows BT-MED configured with four
workers and one checkpoint server. Each worker is responsible for a sub-domain of the
entire data grid and periodically exchanges information with its nearest neighbor.
Programmer modifications to incorporate the checkpointing algorithms consists of
146 additional l ines of code, 36 lines for initializing the checkpointing algorithm and
taking checkpoints, and 110 lines for saving and restoring state.
TABLE 37: Send and receive performance (n = 20, α = 0.05)
Test namePerformance
(msec/iteration)
MPI-RPC 9.49 ± 0.05
SPMD-RPC 9.49 ± 0.04
2PCDC-RPC 9.49 ± 0.06
FIGURE 52: BT-MED application structure
W 4W 3
W 2W 1
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
CheckpointServer
143
We ran BT-MED under three configurations, with 4, 9 and 16 workers (Table 38). For
each configuration, we ran three versions of BT-MED, the baseline version with no fault
tolerance (BT-MED), a version with the SPMD checkpointing algorithm (SPMD-BTMED)
and a version with the 2PCDC checkpointing algorithm (2PCDC-BTMED). As the
number of workers increased, we scaled the problem size so that the workload for each
worker was kept constant. The amount of data saved in a checkpoint for each worker was
32,065,348 bytes or approximately 30MB. Thus, the amount of data saved for each
application checkpoint was about 120MB, 270MB and 480MB, for 4, 9, and 16 workers,
respectively. As the amount of data scaled up, we increased the numbers of checkpoint
servers to avoid the obvious bottleneck had we used only a single checkpoint server. For
SPMD-BTMED we took a checkpoint on every 30th iteration of the main loop, for a total
of 2 checkpoints for the duration of the program. For SPMD-2PCDC we initiated a
checkpoint every 125 seconds for a total of 2 checkpoints. We selected 125 seconds to
ensure the same number of checkpoints as the SPMD-BTMED version.
TABLE 38: BT-MED performance (n = 20, α = 0.05)
Test nameNumber
of workers
Number of checkpoint
servers
Elapsed Time (seconds)
Checkpoint Overhead (seconds/
checkpoint)
BT-MED 4 n/a 270 ± 1 n/a
BT-MED 9 n/a 282 ± 1 n/a
BT-MED 16 n/a 293 ± 1 n/a
SPMD-BTMED 4 1 345 ± 9 37
SPMD-BTMED 9 2 511 ± 3 114
SPMD-BTMED 16 3 662 ± 11 185
144
The overhead of checkpointing is significant—up to 194 seconds to transfer 480 MB
of data (2.47 MB/s). As the number of workers increases, the overhead of taking
checkpoints also increases. For SPMD-BTMED and 2PCDC-BTMED, the elapsed time
was dominated by the checkpoint overhead. In practice, a production run of a full -scale
ocean model would execute on the order of 10,000 or more iterations and checkpoint
about every 1000 iterations, thus the application would perform much more work in
relation to the overhead time of taking checkpoints.
We cannot draw any conclusions as to the relative performance of SPMD and 2PCDC
checkpointing due to experimental conditions. Between the test cases for SPMD and
2PCDC checkpointing, other users were using the system and competing for CPU and
network resources. However, the intent of presenting this data is primarily to show the
successful integration of checkpointing into BT-MED.
The SPMD and 2PCDC checkpointing algorithms are designed to tolerate up to n
workers faili ng (assuming 1 worker per host). We assume that the hosts on which the
checkpoint servers are located and the host that starts the application do not fail . Thus, an
application can be restarted from the last saved consistent checkpoint. If this failure
2PCDC-BTMED 4 1 431 ± 4 81
2PCDC-BTMED 9 2 538 ± 3 125
2PCDC-BTMED 16 3 680 ± 5 194
TABLE 38: BT-MED performance (n = 20, α = 0.05)
Test nameNumber
of workers
Number of checkpoint
servers
Elapsed Time (seconds)
Checkpoint Overhead (seconds/
checkpoint)
145
assumption is violated, i.e., the host on which a checkpoint server is located crashes
permanently, then the application will cease to be restartable.
In Table 39, we present performance numbers with one failure induced during a test
run. For each test, we ensured that we crashed the target host only after the completion of
a complete checkpoint. Note that if a worker crashes while checkpointing is in progress,
then the checkpoint would not be committed and the application would be rolled back to
the previous consistent checkpoint.*
We varied the time at which we induced failure so that each application would be
killed at about the same iteration. The ping interval was set to 37 seconds; the
reconfiguration time was set to 60 seconds.
As BT-MED is a tightly synchronized application, its rate of progress is determined by
its slowest worker. As we competed for CPU and storage resources with other users, the
elapsed times for SPMD-BTMED and 2PCDC-BTMED exhibited a wide variance. The
intent behind Table 39 is to show that the applications recovered successfully. In general,
* This has been done to confirm correct behavior.
TABLE 39: Performance with one induced failure (n = 10, α = 0.05)
Test nameNumber workers
Elapsed Time(seconds)
SPMD-BTMED 4 634 ± 49
SPMD-BTMED 9 905 ± 43
SPMD-BTMED 16 1138 ± 69
2PCDC-BTMED 4 619 ± 15
2PCDC-BTMED 9 832 ± 69
2PCDC-BTMED 16 1000 ± 75
146
the time required to complete an application with a failure induced depends on the
following factors: the time to detect and initiate recovery, the time for each worker to
retrieve the state from the checkpoint server and restore its state, and the time to
recompute the work lost since the last consistent checkpoint.
7.3 Mentat
To demonstrate the performance overhead of stateless replication, we use two
applications: RPC and Complib. RPC consists of a series of remote procedure calls while
Complib is a biochemistry application that compares two libraries of protein or DNA
sequences.
7.3.1 RPC
In Table 40, we show the performance of RPC without and with fault tolerance (SR-
RPC) and with payloads of 0 and 100 KB. We used no replication, i.e., the number of
worker was one, and configured the proxy object with a queue depth of two (a queue depth
of two means that each worker will be issued at most two work requests at a time). As a
worker finishes servicing a call , the proxy object will send it another work request.
This configuration allows us to measure the overhead of the stateless replication
algorithm. In all cases, the overhead of SR-RPC was within 2 msec, or 5% of RPC. It is
TABLE 40: RPC performance (1 worker, n = 100, α = 0.05)
Test namePerformance
(msec/iteration)Payload (0K)
Performance (msec/iteration)Payload (100K)
RPC 29.29 ± 0.17 67.62 ± 0.56
147
interesting to note the performance of the stub-generated SG-RPC from §7.1.2 (8 msec/
iteration) and RPC (29 msec/iteration). We attribute the performance difference to the
following two facts: (1) the proxy object imposes an additional level of indirection, and
(2), the self-scheduling algorithm imposes additional delays because a worker must first
notify the proxy after it finishes servicing a request so that the proxy can send it another.
7.3.2 Complib
In Figure 53, we show the architecture of Complib. The source and target libraries are
divided into equi-sized chunks. Each comparison consists of comparing a chunk from the
source library against a chunk from the target library. After each worker finishes a
comparison, it forwards the results to a collector object. After all chunks have been
compared, the application is finished.
SR-RPC 30.75 ± 0.20 69.53 ± 0.56
TABLE 40: RPC performance (1 worker, n = 100, α = 0.05)
Test namePerformance
(msec/iteration)Payload (0K)
Performance (msec/iteration)Payload (100K)
FIGURE 53: Complib application structure
Stateless Object
Complib Workers
W 1
W 2
W n
Complib ProxyCollector
SourceLibrary
TargetLibrary
148
For efficiency reasons, the designer of Complib used the source library as the object
from which to initiate the comparisons. Although we would have designed the architecture
differently—the main program would have initiated all the computations—we reuse the
existing code to show the incorporation of fault-tolerance techniques using an existing
application. The heart of Complib is shown in Figure54.
We ran Complib to compare a library of 287 protein sequences against itself. This is a
small li brary; a standard library would include on the order of 10,000 sequences.
However, a small l ibrary suff ices to gain an understanding of the performance of Complib
when incorporated with our fault-tolerance techniques. We ran Complib with 8 and 16
workers.
The libraries chosen resulted in 100 method calls to perform the comparisons. Under
the 8 worker configuration and a queue depth of 2, each worker initially received 2 work
requests, for of total of 16 work requests. The other 84 work requests were assigned to
workers as they finished working. Under the 16 worker configuration, 32 work requests
were initially assigned. The remaining 68 work requests were assigned to workers as they
finished working. Failure-free performance numbers were the mean and 95% confidence
intervals for 20 runs. Specifying the fault-tolerance policy required the programmer to add
three lines of code.
FIGURE 54: Complib main loop
È Ê Ô Ü É Õ Ã × Æ Õ Ü Ô Â Ó é × ë È Ê Õ Þ Ò Ò Ù Ú å × Ó ? Õ Ê Ä M é × ë È Ê Õ Ô Â Ó È Ç È çÈ Ê Ô Ü É Õ Ã × Æ Õ Ü Ô Â Ó Ä Ç È É Õ Ä Þ Ò Ò Ù Ú å × Ó ? Õ Ê Ä M Ä Ç È É Õ Ä Ô Â Ó È Ç È çÈ Ê Ô Ü Ê × Æ ; Ç È Õ Ü é 3 3 × È á Õ È Þ Ò Ò Ù Ú å é Ä Ç Ä Õ Ô Õ é é × Ó ? Õ Ê ÄÈ Ê Ô Ü Ê × Ô Ô Õ Ê Ä × È Ê × Ô Ô Õ Ê Ä × È Þ Ò Ò Ù Ú å × Ó ? Õ Ê Ä M É Ç Ä Ì Õ È é È Õ é ë Ô Ä é � È × Æ Ç Ô Ô Ê × Æ ; Ç È Â é × Ã é
� × È ¿  � 2 Þ Â � � � Ù Ü � â � ä à � Ü à N � � O � Þ � �  Á� × È ¿ ? � 2 Þ ? � � � Ù Ü � � ä P � � Ü à N � � O � Þ � � ? Á
Ê × Ô Ô Õ Ê Ä × È > É Ç Ä Ì Õ È ¿ Â Ë ? Ë 3 × È á Õ È > Ê × Æ ; Ç È Õ ¿ é × ë È Ê Õ > É Õ Ä Ü Ê Ì ë Ã á ¿ Â Á Ë Ä Ç È É Õ Ä > É Õ Ä Ü Ê Ì ë Ã á ¿ ? Á Á Á Þ
149
In Table 41, we show the performance of Complib with and without fault tolerance.
The performance overhead of incorporating fault tolerance was not observable. Thus, by
exploiting the semantics of stateless objects, we were able to replicate workers for both
performance and fault tolerance reasons.
The stateless replication algorithm is designed to tolerate the crash failure of up to n-1
workers (assuming one worker per host). We assumed that the hosts on which the
collector, library and proxy objects are located do not fail . If this failure assumption is
violated, then the application will not complete successfully.
We induced failure by killi ng a host 100 seconds after starting the application. We set
the retry time to 90 seconds, i.e., the proxy allowed 90 seconds for a work request to
complete once the request is sent to a worker. After 90 seconds, the proxy object
considered a work request to have failed and reassigned it to another worker.
TABLE 41: Complib performance (n = 20, α = 0.05)
Test nameNumber replicas
Elapsed time(seconds)
Complib 8 321 ± 2
FT-Complib 8 319 ± 1
Complib 16 174 ± 1
FT-Complib 16 174 ± 1
TABLE 42: Complib performance with failure induced (n = 10, α = 0.05)
Test nameNumber replicas
Elapsed Time(seconds)
FT-Complib (1 failure) 8 365 ± 3
FT-Complib (1 failure) 16 225 ± 5
150
Retrying a work request that has failed to complete in a timely manner occurs
concurrently with the running of the application. Thus, as can be seen by our data, the
additional time required to run the application to completion can be less than the retry time
(51 seconds vs. 90 seconds). In general, the recovery time for the stateless replication
algorithm depends on the retry time and the time it takes to recompute the failed
computation.
7.4 Summary
In this chapter, we have shown the successful integration of fault-tolerance techniques
into grid applications written using multiple programming tools. In Table 43, we
summarize the number of lines required from programmers for the incorporation of
various techniques.
Table 43: Application summary
Application Tool TechniqueLines of
code
Number of failed workers
tolerated
Context Stub generator Pessimistic method logging
16 1
Stub generator Passive replication
16 1
BT-MED MPI SPMD checkpointing
146 n
MPI 2PCDC checkpointing
146 n
Complib Mentat Stateless replication
3 n-1
151
Programmer modifications consisted of incorporating 16 lines of code for Context (out
of a total of 173 lines), 146 lines for BT-MED (1039 lines), and 3 lines of code for
Complib (1857 lines). For Context and BT-MED, most of the additional code entailed
writing routines to save and restore state. The integration of pessimistic method logging or
passive replication enables Context to tolerate the crash failure of 1 host. The integration
of SPMD checkpointing and 2PCDC checkpointing enables BT-MED to tolerate the crash
failure of up to n workers. If any worker crashes, BT-MED rolls-back to its last consistent
checkpoint. The integration of stateless replication enables Complib to tolerate the crash
failure of up to n-1 workers.
In Table 44, we summarize the overhead inherent to the framework itself. We
measured the overhead to range between 0 and 3 msec, or in percentage terms, between
0% and 10%, for a remote procedure call . Measuring the overhead in terms of a remote
procedure call provides a conservative estimate—the true overhead depends on the
communication pattern and granularity of an application.
Table 44: Framework overhead based on RPC application
Tool Technique
Framework overhead(msec/
iteration)
Frameworkoverhead
(%)
Stub Generator Pessimistic method logging 3 10%
Passive replication 2 6%
Message Passing Interface
SPMD Checkpointing 0 0%
2PCDC Checkpointing 0 0%
Mentat Stateless replication 2 5%
152
In practice, the algorithmic overhead dominates. In the case of the Context application,
we show that the overhead of passive replication is too high (2 seconds for a remote
procedure call ) while the overhead of pessimistic method logging is acceptable (15 msec).
For BT-MED, the frequency of taking checkpoints, and thus the overhead, can be
configured by users. For Complib, the overhead of using stateless replication is negligible.
153
in-fra-struc-ture \'in-fre-,strek-cher, n (1927)The basic facili ties, services, and installations needed for the functioning of a community
or society, such as transportation and communications systems, computational resources,water and power lines, and public institutions including schools, post offices, and prisons.
— Possible future definition for the American Heritage Dictionary
Chapter 8
Conclusion
This dissertation has addressed the problem of integrating fault-tolerance techniques
into grid applications. Our primary contribution is the development of a reflective
framework for easily incorporating fault-tolerance techniques into object-based grid
applications. To support this claim, we have demonstrated the integration of several fault-
tolerance techniques—checkpointing, passive replication, pessimistic logging and
stateless replication—with several grid programming tools, the Message Passing
Interface, the Mentat Programming Language and the Stub Generator, in the Legion grid
environment. Using these programming tools augmented with fault-tolerance capabiliti es,
we have shown how applications can be written to tolerate crash failures. To demonstrate
ease of use, we have shown that programmers only needed to insert a few lines of
additional codes or write routines to save and restore the local state of objects.
A secondary contribution is the development of a flexible event notification model to
propagate events between objects. The salient features of the model are that it enables the
specification of event propagation policies to be set on a per-application, per-object, or
154
per-method basis, and that it unifies the concepts of events and exceptions—an exception
is simply a special kind of event.
To our knowledge, we are the first to advocate the use of reflection to structure grid
applications. Furthermore, we are the first to show the integration of multiple fault-
tolerance techniques in grid applications using a single framework. Prior to our work, the
development and integration of fault-tolerance techniques in computational grids have
been provided through point solutions, i.e., tool developers designed their own fault-
tolerant solutions (if any).
8.1 Limitations
In this dissertation, we have only considered fault-tolerance techniques designed to
mask the crash failure of objects. We have not looked extensively at techniques designed
to cope with other failure assumptions, i.e., network partitioning, or non-masking fault-
tolerance techniques [KNIG98].
Furthermore, we have assumed that Legion objects fail by crashing and that their
failure is eventually detectable, that network partitions do not occur within a site, and that
objects have access to reliable storage. If these assumptions are violated, then applications
that integrate techniques based on these assumptions may not complete successfully. Thus,
the implication is that some applications, e.g., li fe-criti cal applications, may not be
suitable for this environment. Note that these observations are generic and not specific to
Legion; they would also apply to any other computational grids.
155
8.2 Future Work
The limitations presented above naturally lead to several areas of future research. We
would like to incorporate more techniques into grid applications using our framework. For
example, we would like to incorporate techniques to cope with network partitioning. A
first step would be to extend the checkpointing and stateless replication algorithms
presented in this dissertation to tolerate network partitioning. Informally, in the case of
checkpointing, one could restrict the storage of checkpoints to a single, primary, site. As
long as the checkpoints are available, an application can be restarted successfully. For
stateless replication, workers that are outside a primary partition, could be treated as
having failed. The work that they were responsible for could be reassigned to other
workers within the primary partition. Furthermore, we would like to incorporate
additional techniques within our framework, e.g., causal message logging, nested
checkpointing techniques, as well as other tools, e.g., tools to automate the saving and
restoring of application state.
A second area of research would be to investigate the failure models that are most
appropriate for grids and provide experimental validation for any proposed models. As of
this writing, the grid community has not yet settled on a failure model.
A third area of research would be to develop new algorithms designed specifically for
grids. For example, a richer interface description language could lead to algorithms that
exploit semantic information. The stateless replication algorithm presented in this
dissertation is an example of an algorithm that exploits the side-effect free nature of
stateless objects for both fault tolerance and performance. We believe that with additional
semantic information, new and efficient algorithms could be designed for grids.
156
Finally, another area of research is to investigate failure detection in grids. The current
Legion system is conservatively configured and employs relatively long timeouts
(upwards of 300 seconds) to detect and mark an object as having failed. We believe a more
flexible model is required in which application programmers can set their own policy
regarding the aggressiveness of the failure detection mechanism and the type of failure
detector used [CHAN96]. Furthermore, we would like to incorporate network diagnostic
tools such as SNMP in our failure detection mechanisms.
157
He who wonders discovers that this in itself is wonderful.— M. C. Escher
References
AGHA94 Agha, G., Sturman, D. C., A Methodology for Adapting Patterns of Faults,
Foundations of Dependable Computing: Models and Frameworks for
Dependable Systems, Kluwer Academic Publishers, Vol. 1, pp. 23-60,
1994.
AKSI98 Aksit, M., Tekinerdogan, B., Solving the Modeling Problems of Object-
Oriented Languages by Composing Multiple Aspects using Composition
Filters, (ECOOP ‘98), 1998.
ALEX96 Alexandrov, A. D., Ibel, M., Schauser, K., Scheiman, C. J., SuperWeb:
Research Issues in Java-Based Global Computing, Proceedings of the
Workshop on Java for High Performance Scientific and Engineering
Computing Simulation and Modelli ng, Syracuse University, New York,
1996.
ALVI93 Alvisi, L., Hoppe, B., Marzullo, K., Nonblocking and Orphan-free
Message Logging Protocols, Proceedings of the 23rd Fault-Tolerant
Computing Symposium, pp. 145-154, June 1993.
ALVI98 Alvisi, L., Marzullo, K., Message Logging: Pessimistic, Optimistic, Causal
and Optimal, IEEE Transactions on Software Engineering, Vol. 24, No. 2,
pp. 149-159, February 1998.
158
ANDE81 Anderson, T., Lee, P. A., Fault Tolerance Principles and Practice, Prentice
Hall , Englewood Cli ffs, 1981.
ARJU92 —, The Arjuna System Programmer’s Guide, Department of Computer
Science, University of Newcastle-upon-Tyne, UK , July 1992.
BABA92 Babaoglu, O., et al., Paralex: An Environment for Parallel Programming
in Distributed Systems, Technical Report UBLCS-92-4, Laboratory for
Computer Science, University of Bologna, October 1992.
BABB84 Babb, R. F., Parallel Processing with Large-Grain Data Flow Techniques,
IEEE Computer, pp. 55-61, July 1984.
BALD96 Baldeschwieler, J. E., Blumofe, R. D., Brewer, E. A., ATLAS: An
Infrastructure for Global Computing, Proceedings of the Seventh ACM
SIGOPS European Workshop on System Support for Worldwide
Applications, 1996.
BEGU92 Beguelin, A., et al., HeNCE: Graphical Development Tools for Network-
Based Concurrent Computing, Proceedings SHPCC-92, Willi amsburg, VA,
pp. 129-36, May 1992.
BEGU97 Beguelin, A., Seligman E., Stephan P., Application Level Fault Tolerance
in Heterogeneous Networks of Workstations, Journal of Parallel and
Distributed Computing on Workstation Clusters and Network-based
Computing, June 1997.
BENN95 Ben-Natan, R., CORBA, a Guide to the Common Object Request Broker
Architecture, McGraw-Hill , 1995.
BERS95 Bershad, B., et al., Extensibility, Safety and Performance in the SPIN
Operating System, Proceedings of the Fifteenth ACM Symposium on
Operating System Principles, pp. 267-284, Copper Mountain, CO, 1995.
BIRM93 Birman, K. P., The Process Group Approach to Reliable Distributed
Computing, Communications of the ACM, Vol. 36, No. 12, pp. 127-133,
December 1993.
159
BIRM94 Birman, K. P., A Response to Cheriton and Skeen’s Criticism of Causal and
Totally Ordered Communication, Operating Systems Review, Vol. 28, No.
1, pp. 11-21, January 1994.
BIRM96 Birman, K. P., Building Secure and Reliable Network Applications,
Prentice Hall , ISBN: 0137195842, October 1996.
BHAT97 Bhatti, N. T., et al., Coyote: A System for Constructing Fine-Grain
Configurable Communication Services, Department of Computer Science
Technical Report TR 97-12, University of Arizona, July 1997.
BLAI98 Blair, G. S., et al., An Architecture for Next Generation Middleware,
Proceedings of Middleware ‘98, Springer-Verlag, pp. 191-206, September
1998.
BOND93 Bondavalli , A., Stankovic, J., Strigini, L., Adaptable Fault Tolerance for
Real-Time Systems, Proc. of Predictably Dependable Computing Systems,
September 1993.
BROW90 Browne, J. C, Lee, T., Werth, J., Experimental Evaluation of a Reusabilit y-
Oriented Parallel Programming Environment, IEEE Transactions on
Software Engineering, pp. 111-120, February 1990.
BRUN98 Brunett, S., Davis, D., Gottschalk, T., Messina, P., Kesselman, C.,
Implementing distributed synthetic forces simulations in metacomputing
environments, Proceedings Heterogeneous Computing Workshop, 1998.
BUDH93 Budhiraja, N., Marzullo, K., Schneider, F. B., The Primary-Backup
Approach, Distributed Systems, ACM Press, pp. 199-215, 1993.
CAO92 Cao J., Wang, K. C., An Abstract Model of Rollbak Recovery Control in
Distributed Systems, ACM Operating Systems Review, pp. 62-76, October
1992.
CARR89 Carriero, N., Gelernter, D., Linda in Context, Communications of the
ACM, Vol. 32, No. 4, pp. 444-458, April 1989.
CARR95 Carrieroro, N., Freeman, E., Gelernter, D., Kaminsky, D., Adaptive
Parallelism and Pirhana, IEEE Computer, pp. 40-49, January 1995.
160
CASA97 Casanova, H., Dongarra, J., NetSolve: A Network-Enabled Server for
Solving Computational Science Problems, The International Journal of
Supercomputer Applications and High Performance Computing, Vol. 11,
No. 3, pp. 212-223, Fall 1997.
CHAN85 Chandy, K. M., Lamport, L., Distributed Snapshots: Determining Global
States of Distributed Systems, ACM Transactions on Computer Systems,
pp. 63-75, February 1985.
CHAN96 Chandra, T. D, Toueg, S., Unreliable Failure Detectors for Reliable
Distributed Systems, Journal of the ACM, Vol. 43 , No. 2, pp. 225-267,
1996.
CHAR96 Charlton, P., Self-Configurable Software Agents, Advances in Object-
Oriented Metalevel Architectures and Reflection, CRC Press, pp. 103-127,
1996.
CHER93 Cheriton, D., Skeen, D., Understanding the Limitations of Causally and
Totally Ordered Communication, Proceedings of the Thirteenth ACM
Symposium on Operating Systems Principes, ACM Press, pp. 44-57,
December 1993.
CHIB95 Chiba, S., A Metaobject Protocol for C++, Proceedings of OOPSLA,
Austin, TX, USA, pp. 285-299, 1995.
CHRI91 Cristian, F., Understanding Fault-Tolerant Distributed Systems,
Communications of the ACM, Vol. 34, No. 2, pp. 57-78, Feb. 1991.
CHRI97 Christiansen, B. O., et al., Javelin: Internet-Based Parallel Computing
Using Java, Concurrency: Practice and Experience, Vol. 9, No. 11, Nov 97.
DETL88 Detlefs, D. L., Herlihy, M. P., Wing, J. M., Inheritance of Synchronization
and Recovery Properties in Avalon/C++ , Computer, pp. 57-69, December
1988.
DMSO98 DMSO, HLA Object Model Template Specification, Defense Modeling &
Simulation Office, http://hla.dmso.mil, version 1.3, Feb. 1998.
161
DOCT99 Distributed Object Computing Testbed, http://www.sdsc.edu/DOCT, Regan
Moore, Principal Investigator, Enabling Technologies Group, San Diego
Supercomputing Center, July 1999.
ELNO92 Elnozahy, E. N., Zwaenepoel, W., Manetho: Transparent Rollback-
recovery with Low Overhead, Limited Rollback and Fast Output Commit,
IEEE Transactions on Computers, pp. 526-531, May 1992.
ELNO95 Elnozahy, E. N., Ratan, V., Segal, M. E., Experiences using DCE and
CORBA to Build Tools for Creating Highly-Available Distributed Systems,
International Conference on Open Distributed Processing, February 1995.
ELNO96 Elnozahy, E. N., Johnson, D. B., Wang, Y. M., A Survey of Rollback-
Recovery Protocols in Message Passing Systems, Technical Report CMU-
CS-96-181, Department of Computer Science, Carnegie Mellon
University, September 1996.
FABR95 Fabre, J. C., Nicomette, V., Perennou, T., Stroud, R. J., Wu, Z.,
Implementing Fault-Tolerant Applications using Reflective Object-
Oriented Programming, Proceedings of the 25th Symposium on Fault-
Tolerant Computing, pp. 489-498, June 1995.
FABR98 Fabre, J. C., Perennou, T., A Metaobject Architecture for Fault-Tolerant
Distributed Systems: The FRIENDS Approach, IEEE Transactions on
Computers, pp. 78-95, January 1998.
FELB99 Felber, P., Guerraoui, R., Fayad, M. E., Putting OO Distributed
Programming to Work, Communications of the ACM, pp. 97-101,
November 1999.
FERR97 Ferrari, A. J., Chapin, S. J., Grimshaw, A. S., Process Introspection: A
Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code
Modification, University of Virginia Computer Science Technical Report,
CS-97-05, March 25, 1997.
FERR98 Ferrari, A., Grimshaw, A. S., Basic Fortran Support in Legion, University
of Virginia Computer Science Technical Report, CS-98-11, March 1998.
162
FERR99 Ferrari, A., Knabe, F., Humphrey, M., Chapin, S., Grimshaw, A. S., A
Flexible Security System for Metacomputing Environments, High
Performance Computing and Networking Europe (HPCN Europe 99),
April 1999.
FOST94 Foster, I., Designing and Building Parallel Programs, Addison-Wesley
Publishing Company, 1994.
FOST97 Foster, I., Kesselman, C., Globus: A Metacomputing Infrastructure Toolkit,
International Journal of Supercomputing Applications, pp. 115-128, 1997.
FOST98 Foster, I., Geisler, J., Nickless, W., Smith, W., Tuecke, S., Software
Infrastructure for the I-WAY metacomputing experiment, IEEE
Concurrency: Practice and Experience, 1998.
FOST99 Foster, I., Kesselman, C., The Grid: Blueprint for a New Computing
Infrastructure, Morgan Kaufmann, pp. 15-51, 1999.
FOX96 Fox, G., Furmanski, W., Towards Web/Java based High Performance
Distributed Computing - An Evolving Virtual Marchine, Proceedings of the
Fifth IEEE International Symposium on High Performance Distributed
Cmoputing, Syracuse, NY, August 1996.
GARB95 Garbinato, B., Guerraoui, R., Mazouni, K., Implementation of the GARF
Replicated Objects Platform, Distributed Systems Engineering Journal,
Vol. 2, pp. 14-27, March 1995.
GART99 Gartner, F. C., Fundamentals of Fault-Tolerant Distributed Computing in
Asynchronous Environments, ACM Computing Surveys, Vol. 31 No. 1, pp.
1-26, March 1999.
GELE89 Gelertner, D., Multiple Tuple Spaces in Linda, volume 366 of Lecture
Notes in Computer Science, Proceedings of Parallel Architectures and
Languages, Europe, Volume 2, Springer-Verlag, Berlin/New York, pp. 20-
27, June 1989.
GEIS94 Geist, G. A., et al., PVM : Parallel Virtual Machine : A Users’ Guide and
Tutorial for Networked Parallel Computing, Scientific and Engineering
Computation Series, MIT Press, December 1994.
163
GEIS97 Geist, G. A., Kohl, J. A., Papadopoulos, P. M., CUMULVS: Providing
Fault-Tolerance, Visualization and Steering of Parallel Applications,
International Journal of High Performance Computing Applications,
Vol. 11, No. 3, August 1997, pp. 224-236.
GRAY85 Gray, J., Why Do Computers Stop and What Can Be Done About It?,
Tandem Technical Report 85.7, June 1985.
GRIM96A Grimshaw, A. S., Ferrari, A., West E., Mentat, Parallel Programming Using
C++, The MIT Press, Cambridge, Massachusetts, pp. 383-427, 1996.
GRIM96B Grimshaw, A. S., Weissman, J. B., Strayer T., Portable Run-Time Support
for Dynamic Object-Oriented Parallel Processing, ACM Transactions on
Computer Systems, Vol. 14, No. 2, 1996.
GRIM97A Grimshaw, A. S., Wulf, W., The Legion Vision of a Worldwide Virtual
Computer, Communications of the ACM, pp. 39-45, January 1997.
GRIM97B Grimshaw, A. S., Nguyen-Tuong, A., Lewis, M., Hyett, M., Campus-Wide
Computing: Early Results Using Legion at the University of Virginia,
Journal of Supercomputing Applications and High Performance
Computing, Vol. 11, No. 2, pp. 129-143, Summer 1997.
GRIM98 Grimshaw, A. S., et al., Metasystems, Communications of the ACM, pp.
46-55, November 1998.
GRIM99 Grimshaw, A. S., Ferrari, A., Knabe, F., Humphrey, M, Wide-Area
Computing: Resource Sharing on a Large Scale, IEEE Computer, Vol. 32,
No. 5, pp. 29-37, May 1999.
GROP99 Gropp, W., Lusk, E., Skjellum, A., Using MPI: Portable Parallel
Programming with the Message-Passing Interface, Scientific and
Engineering Computation Series, MIT Press, December 1999.
GUER97 Guerraoui, R., Garbinato, B., Mazouni, K. R., Garf: A Tool for
Programming Reliable Distributed Applications, IEEE Concurrency, pp.
32-39, October-December 1997.
HAYD98 Hayden, M., The Ensemble System, Cornell University Technical Report,
TR98-1662, January 1998.
164
HAYT98 Hayton, R., Herbert, A., Donaldson, D., FlexiNet – A flexible component
oriented middleware system, Proceedings of ACM SIGOPS European
Workshop, Sintra, Portugal, September 1998.
HIGH99 Highley, T, Lack, M, Myers, P., Aspect Oriented Programming: A Critical
Analysis of a New Programming Paradigm, University of Virginia,
Department of Computer Science Technical Report CS-99-29, May 1999.
HOFM94 Hofmeister, C., Dynamic Reconfiguration of Distributed Applications,
Ph.D. Dissertation, Technical Report CS-TR-3210, Department of
Computer Science, University of Maryland, January 1994.
HO99 Ho, E. D., Retrofitting Fault-Tolerance into CORBA-Based Applications,
Master’s Thesis, University of Cali fornia, San Diego, 1999.
HUAN95 Huang, Y., Wang, Y. M., Why Optimistic Message Logging Has Not Been
Used in Telecommunication Systems, Proceedings of IEEE Fault-Tolerant
Computing Symposium, pp. 459-463, June 1995.
HUTC91 Hutchinson, N. C., Peterson, L. L., The x-kernel: an Architecture for
Implementing Network Protocols, IEEE Transactions on Software
Engineering, pp. 64-76, January 1991.
IONA95 IONA, ORBIX Programming Guide, IONA Technologies Ltd., 1995.
JAL94 Jalote, K., Fault Tolerance in Distributed Systems, Prentice-Hall, 1994.
JEON94 Jeong, K., Shasha, D., PLinda 2.0: A Transactional/checkpointing
Approach to Fault Tolerant Linda, Proceedings of the Thirteenth
Symposium on Reliable Distributed Systems, pp. 96-105, 1994.
KICZ91 Kiczales, G., des Rivieres, J., Bobrow, D. G., The Art of the Metaobject
Protocol, MIT Press, 1991.
KICZ97 Kiczales, G., Lamping, J., Mendhekar, A., et al., Aspect-Oriented
Programming, Xerox PARC, Palo Alto, Cali fornia, June 1997.
KNIG98 Knight, J., Elder, M., and Du, X., Error Recovery in Critical Infrastructure
Systems, Proceedings of Computer Security, Dependability, and Assurance,
IEEE Computer Society Press, Los Alamitos, CA, pp. 49-71, 1999.
165
KOO87 Koo, R., and Toueg, S., Checkpointing and Rollback-Recovery for
Distributed Systems, IEEE Transactions on Software Engineering, pp. 23-
31, January 1987.
LAMP78 Lamport, L., Time, Clocks, and the Ordering of Events in a Distributed
System, Communications of the ACM, Vol. 21, No.7, pp. 558-565, July
1978.
LAND97 Landis, S., Maffeis, S., Building Reliable Distributed Systems with
CORBA, Theory and Practice of Object Systems, Vol. 3, No. 1, pp.31-43,
April 1997.
LEDO99 Ledoux, T., OpenCorba: A Reflective Open Broker, Proceedings of Meta-
Level Architectures and Reflections, (Reflections ‘99), Lecture Notes in
Computer Science 1616, pp. 197-214, Springer, 1999.
LEGI99 Legion Research Group, Developer’s Manual, http://legion.virginia.edu,
1999.
LEIN99 Leinberger, W., Kumar, V., Information Power Grid: The New Frontier in
Parallel Computing?, IEEE Concurrency, pp. 75-84, October-December
1999.
LEON93 Leon, J., Fisher, A. L., Steenkiste, P., Fail -Safe PVM: a Portable Package
for Distributed Programming with Transparent Recovery, Carnegie Mellon
University Technical Report, CMU-CS-93-124, February1993.
LIN90 Lin, L., Ahamad, M., Checkpointing and Rollback-Recovery in Distributed
Object Based Systems, 20th International Symposium on Fault-Tolerant
Computing, pp. 97-104, June 1990.
LITT94 Little, M. C., McCue, D. L., The Replica Management System: a Scheme
for Flexible and Dynamic Replication, Proceedings 2nd International
Workshop on Configurable Distributed Systems, pp. 46-57, 1994.
MAES87 Maes, P., Concepts and Experiments in Computational Reflection,
Proceedings of the ACM Conference on Object-Oriented Programming
Systems, Languages and Applications, pp. 147-55, October 1987.
166
MAFF95 Maffeis, S., Adding Group Communication and Fault Tolerance to
CORBA, Proceedings of the 1995 USENIX Conference on Object-Oriented
Technologies, Monterey, CA, June 1995.
MATT93 Mattern, F., Efficient Algorithms for Distributed Snapshots and Global
Virtual Time Approximation, Journal of Parallel and Distributed
Computing, pp. 423-434, 1993.
MORG99 Morgan, M., Post Mortem Debugger for Legion, Master’s Thesis,
University of Virginia, May 1999.
MOSE99 Moser, L. E., Melli ar-Smith, P. M., Narasimhan, P., A Fault Tolerance
Framework for CORBA, 29th International Symposium on Fault-Tolerant
Computing, pp. 150-157, June 1999.
MOSS99 Mossenbock, H., Steindl, C., The Oberon-2 Reflection Model and Its
Application, Proceedings of Meta-Level Architectures and Reflections
(Reflections ‘99), Lecture Notes in Computer Science 1616, pp. 2-21,
Springer, 1999.
MULL93 Mullender, S. (ed), Distributed Systems, Addison-Wesley Pub Co, ISBN:
0201624273, 1993.
NAMP99 Namprempre, C., Sussman, J., Marzullo, K., Implementing Causal Logging
using OrbixWeb Interception, The Fifth USENIX Conference on Object-
Oriented Technologies and Systems, (COOTS ‘99), May 1999.
NGUY95 Nguyen-Tuong, A., Grimshaw, A. S., Karpovich, J.F., Fault Tolerance via
Replication in Coarse Grain Data Flow, Lecture Notes in Computer
Science 1068, Proceedings Parallel Symbolic Languages and Systems,
October 1995.
NGUY96 Nguyen-Tuong, A., Grimshaw, A. S., Hyett, M., Exploiting Data-Flow for
Fault-Tolerance in a Wide-Area Parallel System, Proceedings 15th
Symposium on Reliable Distributed Systesm, pp. 2-11, October 1996.
167
NGUY98 Nguyen-Tuong, A., Chapin, S. J., Grimshaw, A. S., Viles, C., Using
Reflection for Flexibilit y and Extensibility in a Metacomputing
Environment, University of Virginia Technical Report CS-98-33,
November 19, 1998.
NGUY99 Nguyen-Tuong, A., Grimshaw, A. S., Using Reflection for Incorporating
Fault-Tolerance Techniques into Distributed Applications, Parallel
Processing Letters, Vol. 9, No. 2, pp. 291-301, 1999.
NYE92 Nye, A., O’Reilly, T., X Toolkit Intrinsics Programming Manual for X11,
Release 5, O'Reill y & Associates, 1992.
OMG95 OMG, The Common Object Request Broker: Architecture and
Specification, OMG, 1995.
PAWL98 Pawlak, R., Seinturier, L., Implementation of an Event-Based RT-MOP,
Research Report CNAM-CEDRIC 98-04, June 1998.
POWE83 Powell , M. L., Presotto, D. L., Publishing: A Reliable Broadcast
Communication Mechanism, 9th ACM Symposium on Operating Systems,
Operating Systems Review, pp. 100-109, 1983.
POWE94 Powell , D., Distributed Fault-Tolerance — Lessons Learnt from Delta-4,
Hardware and Software Architecture for Fault Tolerance: Experiences and
Perspectives, LNCS 774, pp. 199-217, New York, Springer-Verlag, 1994.
QUIN94 Quinn, M. J., Parallel Computing Theory and Practice, McGraw Hill,
1994.
RAND75 Randell , B., System Structure for Software Fault Tolerance, IEEE
Transactions on Software Engineering, pp. 220-232, June 1975.
RENE93 Renesse, R. V., Causal Controversy at Le Mont St.-Michel, Operating
Systems Review, Vol. 27, No. 2, pp. 44-53, April 1993.
RENE94 Renesse, R. V., Why Bother with CATOCS?, Operating Systems Review,
Vol. 28, No. 1, pp. 22-27, January 1994.
RENE96 Renesse, R. V., Birman, K. P., Maffeis S., Horus, a Flexible Group
Communication System, Communications of the ACM, April 1996.
168
RUSS80 Russell , D. L., State Restoration in Systems of Communicating Processes,
IEEE Transactions on Software Engineering, pp. 183-194, March 1980.
SABE94 Sabel, L., Marzullo, K., Simulating Fail -Stop in Asynchronous Distributed
Systems, Proceedings of the Thirteenth Symposium on Reliable Distributed
Systems, pp. 138-147, October 1994.
SALT90 Saltzer, J. H., Reed, D. P., Clark, D. D., End-to-End Arguments in System
Design, ACM Transactions on Computer Systems, Vol. 39, No. 4, April
1990.
SATO97 Sato, M., et al., Ninf: A Network based Information Library for a Global
World-Wide Computing Infrastructure, Proceedings of High Performance
Computing and Networking, (HPCN '97), (LNCS-1225), pp. 491-502,
1997.
SHAW96 Shaw, M., Garlan, D., Software Architecture, Perspectives on an Emerging
Discipline, Prentice Hall, 1996.
SCHN83 Schneider, F. B., Fail -Stop Processors, Digest of Papers, COMPCON 83,
pp. 66-70, 1983.
SCHN90 Schneider, F. B., Implementing Fault-Tolerant Services Using the State
Machine Approach: A Tutorial, ACM Computing Surveys, pp. 299-319,
December 1990.
SCHO98 Schonwalder, J., Garg, S., Huang, Y. van Moorsel, A. P. A., Yajnik S., A
Management Interface for Distributed Fault Tolerant CORBA Services,
Proceedings of the IEEE Third International Workshop on Systems
Management, Newport, RI, pp. 98-107, April 1998.
SILV95 Silva, L. M., Silva, J. G., Chapple, S., Clarke, L., Portable Checkpointing
and Recovery, Fourth International Symposium on High Performance
Distributed Computing, pp. 188-195, Pentagon City, Virginia, August
1995.
169
SING97 Singhai, A., Sane A., Campbell , R., Reflective ORBs: Supporting Robust,
Time-critical Distribution, Proceedings of Workshop on Reflective Real-
Time Object-Oriented Programming and Systems, (ECOOP ‘97),
Jyvaskyla, Finland, June 1997.
SMAR97 Smarr, L., Computational Infrastructure: Toward the 21st Century,
Communications of the ACM, November 1997.
SMIT82 Smith, B. C., Procedural Reflection in Programming Languages, PhD
Thesis, MIT, Available as MIT Laboratory of Computer Science Technical
Report 272, Cambridge, Mass., 1982.
STAN98 Stankovic, J. A., Son, S. H., An Architecture and Object Model for
Distributed Object-Oriented Real-Time Databases, IEEE International
Symposium on Object-Oriented Real-Time Distributed Computing,
(ISORC'98), Kyoto, Japan, August 1998.
STAN99 Stankovic, J. A., Ramamritham, K., Niehaus, D., Humphrey, M., Wallace,
G., The Spring System: Integrated Support for Complex Real-Time Systems,
Real-Time Systems, Vol 16, No. 2/3, pp. 97-125, May 1999.
STEL98 Stelling, P., Foster, I., Kesselman, C., Lee, C., von Laszewski, G., A Fault
Detection Service for Wide Area Distributed Computations, Proceedings
of the 7th IEEE Symposium on High Performance Distributed Computing,
268-278, 1998.
STRO96 Stroud, R. J., Wu, Z., Using Metaobject Protocols to Satisfy Non-
Functional Requirements, Advances in Object-Oriented Metalevel
Architectures and Reflection, Chapter 3, CRC Press, pp. 31-52, 1996.
STRO97 Stroustrup, B., The C++ Programming Language, Addison-Wesley, July
1997.
SULL96 Sullivan, K., Kalet, I. J., Notkin, D., Mediators in a Radiation Treatment
Planning Environment, IEEE Transactions on Software Engineering, Vol.
22, No. 8, pp. 563-579, August 1996.
SUN99A Sun Microsystems, Jini Specification, http://www.sun.com/jini/specs/,
1999.
170
SUN99B Sun Microsystems, JavaBeans Specification, http://java.sun.com/beans/,
1999.
TANE94 Tanenbaum, A. S., Distributed Operating Systems, Prentice Hall , ISBN:
0132199084, 1994.
TATS98 Tatsubori, M., Chiba, S., Programming Support of Design Patterns with
Compile-time Reflection, Proceedings of Workshop on Reflective
Programming in C++ and Java, UTCCP Report 98-4, Center for
Computational Physics, University of Tsukuba, Japan, 1998.
VILE97 Viles, C. L., et al., Enabling Flexibil ity in the Legion Run-Time Library,
International Conference on Parallel and Distributed Processing
Techniques, Las Vegas, NV, 1997.
WANG95 Wang, Y. M., The Maximum and Minimum Consistent Global Checkpoints
and their Applications, IEEE Symposium on Reliable and Distributed
Systems, pp. 86-95, September 1995.
WATA88 Watanabe, T., Yonezawa, A., Reflection in an Object-oriented Concurrent
Language, Proceedings of Object Oriented Programming, Systems,
Languages, and Applications, (OOPSLA ‘98), pp. 306-315, 1988.
WELC99 Welch, I., Stroud R., From Dalang to Kava - the Evolution of a Reflective
Java Extension, Proceedings of Meta-Level Architectures and Reflections,
(Reflections ‘99), Lecture Notes in Computer Science 1616, pp. 2-21,
Springer, 1999.
YOKO92 Yokote, Y., The Apertos Reflective Operating System: The Concept and its
Implementation, Proceedings of the 7th Conference on Object-Oriented
Programming Systems, Languages and Applications, pp. 414-434, 1992.