1112 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · 2011-05-14 · An Adaptive Quality of...

An Adaptive Quality of Service AwareMiddleware for Replicated Services

Sudha Krishnamurthy, William H. Sanders, Fellow, IEEE, and Michel Cukier, Member, IEEE

Abstract—A dependable middleware should be able to adaptively share the distributed resources it manages in order to meet diverse

application requirements, even when the quality of service (QoS) is degraded due to uncertain variations in load and unanticipated

failures. In this paper, we have addressed this issue in the context of a dependable middleware that adaptively manages replicated

servers to deliver a timely and consistent response to time-sensitive client applications. These applications have specific temporal and

consistency requirements, and can tolerate a certain degree of relaxed consistency in exchange for better response time. We propose

a flexible QoS model that allows clients to specify their timeliness and consistency constraints. We also propose an adaptive

framework that dynamically selects replicas to service a client’s request based on the prediction made by probabilistic models. These

models use the feedback from online performance monitoring of the replicas to provide probabilistic guarantees for meeting a client’s

QoS specification. The experimental results we have obtained demonstrate the role of feedback and the efficacy of simple analytical

models for adaptively sharing the available replicas among the users under different workload scenarios.

Index Terms—Replica consistency, middleware, quality of service, timeliness, probabilistic modeling.

�

1 INTRODUCTION

OUR motivation for building a QoS-aware middlewarestems from two main observations. First, distributed

systems have different degrees of uncertainty arising fromfactors such as transient overloads and failures. Second,different distributed applications have diverse require-ments. Hence, it is useful to design middleware-basedsolutions for sharing access to distributed services based onthe QoS requirements of the clients. In our work, we targettime-sensitive clients. Our goal is to develop a middleware-based approach to mediate a client’s access and to allocateservers based on their ability to meet the quality of servicerequirements of the client. Simple as the goal seems, theproblem is challenging, because the timeliness of a servicedepends on the performance characteristics of the servers,the distributed environment in which those services aredeployed, and the number of users accessing a service. Allof these factors vary with time in an unpredictable manner.As such, access to servers that is based on a simple directorylookup will not suffice for meeting the temporal constraints.Rather, the lookup has to be based on actively monitoringthe changes in the dynamic properties of the servers.Further, in order to cope with the unpredictability, themiddleware has to be designed to meet the demands of theclients under stable conditions as well as when there is achange in the availability of a service due to transientoverloads and server failures. In short, the middleware has

to be adaptive in order to provide both fault tolerance andtimeliness.

The approach we use for providing fault-tolerant andresponsive services makes use of replication. Replicatingthe servers provides robustness in times of failure byallowing access to a service even when some of the serversare not functioning, and improves the response time byallowing multiple clients to be serviced concurrently.However, replication by itself is not a solution for meetingthe different QoS requirements. Rather, the available replicaresources have to be managed and allocated to service theclients based on the QoS requested by the clients. Thisrequires an understanding of the trade offs between thedifferent quality of service measures and an ability to mapthe requirements appropriately onto properties of thereplicated resources. This mapping is often not straightfor-ward, especially when some of the QoS requirements maybe conflicting. For example, in order to provide good faulttolerance, we could allocate all the available replicas toservice a client (e.g., [1], [9], [20], [5], [21]). However, suchan approach would not be scalable, as it would increase theload on all the replicas and result in higher response timesfor the remaining clients. On the other hand, assigning asingle replica to service each client would allow multipleclients to be serviced concurrently [3], [7]. However, if thereplica failed while servicing a request, the failure mightresult in an unacceptable delay for the client being serviced.Hence, neither approach is suitable when a client hasspecific timing constraints and failure to meet thoseconstraints results in a penalty for the client.

Furthermore, when the replicated state is modified bythe clients, there is the additional challenge of permittingclient operations to execute with the greatest possibleconcurrency to provide good response times, while ensur-ing that the replicated state does not diverge in anuncontrolled manner. We can ensure immediate conver-gence of replicated state by forcing all the replicas tocommit the modifications at the same time (e.g., [2], [24],

1112 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2003

. S. Krishnamurthy is with the Department of Computer Science, Universityof Virginia, Charlottesville, VA 22904. E-mail: [email protected].

. W.H. Sanders is with the Department of Electrical and ComputerEngineering and the Coordinated Science Laboratory, University of Illinoisat Urbana-Champaign, 1308 W. Main Street, Urbana, IL 61801.E-mail: [email protected].

. M. Cukier is with the Center for Reliability Engineering, Department ofMechanical Engineering, University of Maryland, College Park, MD20742. E-mail: [email protected].

Manuscript received 8 Dec. 2002; revised 31 July 2003; accepted 3 Aug. 2003.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 118749.

1045-9219/03/$17.00 � 2003 IEEE Published by the IEEE Computer Society

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE JUL 2003 2. REPORT TYPE

3. DATES COVERED 00-00-2003 to 00-00-2003

4. TITLE AND SUBTITLE An Adaptive Quality of Service Aware Middleware for ReplicatedServices

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) University of Virginia,Department of Computer Science,151 Engineer’s Way,Charlottesville,VA,22904-4740

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT

18. NUMBEROF PAGES

14

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

[21], [25]). However, such a strategy that ensures strongreplica consistency limits the degree of concurrency andresults in reduced responsiveness. On the other hand, in theweak consistency model (e.g., [27], [13], [8]), operations areperformed on some subset of replicas, and the updates arepropagated to the other replicas either lazily or on demand.Typically, the only guarantee provided to the clients is thatthe replicated state will eventually converge if updateactivity ceases. Several optimistic replication algorithms(e.g., [6], [22]) have been proposed for applications that cantolerate relaxed consistency. These algorithms allow a clientto access any replica in order to provide better responsive-ness, unlike the pessimistic algorithms, which allow accessto only those servers that have the most up-to-date state.However, if the clients access different servers before theirstates converge, the resulting inconsistency may lead toconflicts.

Finally, when the currently available replicas areinsufficient to meet the requirements of the clients, themiddleware has to decide how to react appropriately. Forexample, should the middleware inform the client applica-tions about the insufficiency and leave it to them to adapt?Should it limit the number of clients that it admits? Shouldit increase the size of the replica pool? If the middlewaredecides to add more replicas, it also has to decide howmany to add and where to place them.

To summarize, in order to build a dependable, QoS-aware middleware for meeting a client’s QoS specification,we need an approach that adaptively selects the appro-priate replicas from the available replica pool. The replicasmust be chosen to service the client, based on an under-standing of the client’s requirements and the dynamicproperties of the replicas. Furthermore, we also need toenable the middleware to react appropriately when theavailable replicas are insufficient to meet the demands ofthe clients.

1.1 Paper Contributions

To address the above issues for managing replicatedresources, we have developed a middleware-based frame-work that allows us to construct customized protocolstailored to the semantics of specific applications. We haveimplemented this framework in AQuA, a CORBA-basedmiddleware that supports transparent replication of objectsacross a LAN [24]. The framework we have built usessimple analytical models to establish a relationship betweena client’s QoS specification and properties of the replicas. In[14] and [15], we described an adaptive replica allocationscheme that uses a probabilistic approach for providingtemporal guarantees to the clients in two different cases:1) when the replicated state is static and 2) when thereplicated state is dynamic. In the first case, we assume thatthe replicas are always consistent and therefore do notaddress the issue of maintaining replica consistency. This isuseful in applications such as compute servers, searchengines, and directory servers, which mainly exportinterfaces for information retrieval. In the second case inwhich the replicated state is time-varying, some of thereplicas may have obsolete state. We target time-sensitiveapplications that can tolerate a certain degree of relaxedconsistency in exchange for better response time andexpress their timeliness and consistency requirements inthe form of a QoS specification. In order to select replicas to

meet those requirements, we need to take into account thestate of a replica when estimating its responsiveness. To dothis, we developed an adaptive framework that supportstunable consistency and timeliness. Some of the applica-tions that motivate the need for such a framework includereal-time database applications, such as electronic patientrecording systems and ticket reservation systems. In thispaper, we compare and contrast the replica selectionapproaches used for the static and dynamic replicatedstates, and present additional experimental results thatextend our earlier performance evaluation [16], [17].

1.2 Paper Organization

The remainder of this paper is organized as follows: InSection 2, we describe our QoS model that allows a broadspectrum of applications to express their timeliness andconsistency requirements. Section 3 provides a brief over-view of the AQuA architecture. In Section 4, we describe thereplica organization that allows us to build protocols forproviding different consistency guarantees and to use themon demand. These protocols use a combination of immedi-ate and lazy update propagation to ensure that the states ofthe replicas do not diverge in an unacceptable manner. Asspecific examples, we describe the protocols we haveimplemented that allow the replicated services to providesequential and FIFO ordering guarantees. In Section 5, wecompare the probabilistic approach that uses the perfor-mance history of the replicas to predict the ability of thereplicas to meet a client’s QoS requirement for static anddynamic replicated states. In Section 6, we summarize thealgorithms that use the prediction made by the probabilisticmodels to select replicas to meet the QoS requirements ofthe clients. In Section 7, we present experimental results.Finally, we discuss ideas for future extensions in Section 8and present our conclusions in Section 9.

2 QOS MODEL FOR ACCESSING

REPLICATED SERVICES

Our QoS model allows a broad spectrum of applications toexpress their requirements at a fairly high level ofabstraction using a uniform interface. Applications mayeither specify their QoS requirements at start-up time ornegotiate them at runtime as often as they want. In order todistinguish invocations that modify the state of an objectfrom those that merely retrieve state, our QoS model allowsa client application to identify all the read-only methods itinvokes on an object by their names at the beginning of asession. If an operation is not specified as read-only, thenour middleware considers it to be an update operation. Anupdate operation is any invocation that modifies the state ofthe object on which the operation is performed, and may beeither a write-only operation or a read-write operation. Inorder to provide access to replicated servers, we are mainlyinterested in providing quality of service along twodimensions: timeliness of response and consistency ofreplicated data.

2.1 Timeliness

Time-sensitive applications require timely execution ofoperations and timely responses to their requests. However,due to the uncertainty in the distributed environment, it isimpossible to provide deterministic guarantees for meeting

KRISHNAMURTHY ET AL.: AN ADAPTIVE QUALITY OF SERVICE AWARE MIDDLEWARE FOR REPLICATED SERVICES 1113

the temporal requirements. Instead, our goal is to provideprobabilistic temporal guarantees. To achieve this, our QoSmodel allowsa client to specify its temporal requirements as apair of attributes: <response time, probability of timely response>.This pair specifies the time by which a client expects aresponse after it has transmitted its read request, and theminimum probability with which it expects its temporalconstraint to be met. Failure to meet a client’s response timeconstraint results in a timing failure for the client. Theadvantage of this probabilistic QoS model is that it allowsthe temporal requirements of applications to be treated as acontinuous spectrum, instead of classifying them as hardreal-time and soft real-time.

2.2 Consistency

Replica inconsistency may arise when multiple clientsaccess an object concurrently, as some of the accesses resultin modifications to the replicated state. In order for theresponses to be meaningful to the clients, it is important tobound the degree of inconsistency when the replicatedinformation is time-varying. Since different applicationshave different views of consistency, it is hard to capture thedifferent consistency requirements using a single metric.Instead of using qualitative measures, such as strong andweak consistency, we believe that several applications willbenefit from intermediate degrees of consistency that can bemore precisely quantified [26], [30], [23].

Several researchers have extended traditional consis-tency models by incorporating the notion of time in order tobound the degree of inconsistency. For example, the notionof epsilon-serializability (defined in [23]) and timed consis-tency models (defined in [28], [18]) require that if a write isexecuted at time t, then the effect of the write should bevisible to others by tþ x, where x is the maximumacceptable delay for propagating the effect of the write.The TACT middleware [30] is another related work thatattempts to provide a middleware framework for tunableconsistency and availability. The consistency measures usedby TACT to bound the level of inconsistency include theorder error, which limits the number of tentative writes thatcan be outstanding at any replica; the numerical error, whichbounds the difference between the value delivered to theclient and the most consistent value; and staleness, whichplaces a real-time bound on the delay for propagating thewrites among the replicas.

Our QoSmodel regards consistency as a two-dimensionalattribute: < orderingguarantee, staleness threshold>.The orderingguarantee is a service-specific attribute that denotes theguarantee that a service provides to all of its clients aboutthe order in which their requests will be processed by theservers, so as to prevent conflicts between operations. Somewell-known ordering guarantees that a service can offer aresequential (or total), causal, and FIFO [2], [4]. In ourwork, wetarget services that provide sequential and FIFO orderingguarantees. The staleness threshold, which is specified by theclient, is a measure of the maximum degree of staleness aclient is willing to tolerate in the response it receives. In ourframework, the staleness of a response denotes the stalenessof the state of the replica that sent the response. In order tomeet a client’s QoS specification, a response delivered to theclient should be no more stale than the staleness thresholdspecified by the client. We compute the staleness of a replicaby associating a timestamp with each update operation. We

use timestamps based on “logical clocks” [19] because thisobviates the need for synchronized clocks across the dis-tributed replicas. These logical timestamps make it possibleto specify the staleness in terms of “versions.” Like thetimeliness QoS model described above, the consistency QoSspecification accommodates the needs of a broad spectrumofapplications. For example, a client that requires strongconsistency can request sequential ordering with staleness 0.On theotherhand, ina scenario inwhich the replicatedstate iseither absent or static (for example, when the client transac-tions are read-only), clients can allow their accesses to beunordered and ignore the staleness threshold.

As an example of the use of the above QoS model,consider a document-sharing application in which multiplereaders and writers concurrently access a document that isupdated in sequential mode. Using the above model, aclient of such an application can specify that it wishes toobtain a copy of the document that is no more than fiveversions old within 2.0 seconds with a probability of at least0.7. Our goal is to meet the above QoS requirements evenwhen the availability of a service is degraded due to thefailure of a replica.

3 OVERVIEW OF AQUA

We now briefly describe the AQuA architecture. AQuAenhances the capabilities of CORBA objects by transparentlyreplicating the objects across a LAN. A dependability managermanages the replication level for different applications basedon their dependability requirements. Replicas offering thesame service are organized into a group. Communicationbetween members of a group takes place through theMaestro-Ensemble group communication layer [29], [11],above which AQuA is layered. The use of group commu-nication in AQuA is transparent to the end applications.Hence, each of the clients, which are all CORBA objects, isgiven the perception that it is communicating with a singleserver object using CORBA’s remote method invocation,although the client’s request may be processed by multipleserver replicas. This transparency is achievedusing anAQuAgateway, which transparently intercepts a local application’sCORBA message and forwards it to the destination replicagroup through Maestro-Ensemble. While previous work inAQuA has focused on gateway handlers for providing faulttolerance using the active and passive handlers [24], we haveenhanced AQuA by developing gateway handlers thatprovide tunable consistency and timeliness guarantees fortime-sensitive applications.

4 HIERARCHICAL REPLICA ORGANIZATION

Given the above QoS model, our goal is to build aframework that can be easily tuned to support the differentapplication-specific requirements at the middleware layer.In order to design this framework, we address three mainissues: 1) organization of the replicas, 2) development ofprotocols that implement different consistency semanticsand design of an infrastructure that would allow theprotocols to be used on demand, and 3) development of amechanism to select replicas to service a client dynamicallybased on the client’s QoS requirements. We will nowdescribe the approach we have used to address these issuesin the context of the AQuA middleware.


All the replicas offering the same service are organizedinto two groups: a primary replication group and a secondaryreplication group. We also have a QoS group, whichencompasses all of the replicas of a service and theirclients. The QoS group allows clients and servers toexchange messages, and it allows the servers to publishtheir performance updates to the client subscribers. In ourimplementation, all of these groups are derived fromMaestro groups [29], and members of a group communicatewith each other by making use of the Maestro-Ensemblegroup communication protocol [11], above which AQuA islayered. For each group, Ensemble elects one of themembers of the group as the leader. We depend onMaestro-Ensemble to provide reliable, virtual synchronyand FIFO messaging guarantees, and build upon theseguarantees to provide the different end-to-end consistencyguarantees. We also depend on Maestro-Ensemble toinform the group members when changes in the groupmembership occur.

The primary and secondary replication groups may beused to organize the replicas of an object adaptively toimplement different consistency semantics. The primaryreplication group is used to implement strong consistencysemantics, whereas the secondary group implements weak-er consistency semantics. The size of these groups can betuned to implement a range of consistency semantics. Forexample, when the secondary group is empty and all thereplicas are placed in the primary group, the replicaorganization supports an active replication approach (e.g.,[24], [21]), in which all the replicas implement strongconsistency semantics. On the other hand, by placing one ofthe replicas in the primary group and all the remainingreplicas in the secondary group, one can implement aprimary/backup protocol with multiple backup replicas.

In the case of static replicated state in which the serverspermit only read transactions, the primary group is emptyand we place all the replicas offering a service in thesecondary group. However, in the case of dynamicreplicated state, we organize the replicas into the primaryand secondary tiers. This two-level replica organization wasmotivated by the need to favor the read operations that cantolerate relaxed consistency to a certain degree, in exchangefor a timely response. While a write-all scheme that writesto all the replicas concurrently always provides access to thelatest updates, it may result in higher response times for theread operations. We therefore reduce the overheads in-curred by a write-all scheme by performing the updates onthe smaller primary group, while allowing the secondaryreplicas, which are greater in number, to handle the read-only operations of different clients. The primary replicassubsequently bring the state of the secondary replicas up-to-date using lazy update propagation. The degree ofdivergence between the states of primary and secondaryreplicas can be bounded by choosing an appropriatefrequency for the lazy update propagation. Thus, whileclients that need the most up-to-date state to be reflected intheir response may have to depend more on the responsefrom a primary replica, clients that are willing to tolerate acertain degree of staleness in their response can achievebetter response times, due to the higher availability of thesecondary replicas. Although in our work we restrictourselves to a two-tier organization of replicas in order tostudy the trade offs between timeliness and consistency, it

should be easy to extend our architecture to multiple tiersrepresenting intermediate degrees of staleness in the replicastates.

4.1 Ordering Guarantees

We now describe how we maintain consistency across thereplicas in the case of dynamic state. As mentioned inSection 2, in order to maintain replica consistency, we needto ensure that the replicas service their clients by respectingthe ordering guarantee associated with the service. Ourframework allows different ordering guarantees to beimplemented as timed consistency handlers within the AQuAgateway, as shown in Fig. 1. We have implemented gatewayhandlers that provide sequential and FIFO ordering. Thesequential handler was motivated by applications, such asdocument-sharing applications, in which all the clientsaccess a common replicated state, and the servers globallyorder the requests of the clients in order to prevent conflicts.On the other hand, the FIFO handler, which providesweaker consistency, was designed to support applications,such as banking transactions, in which the replicatedservers maintain states that are specific to each client. Aclient can communicate with a replicated service by usingthe gateway handler appropriate for the service. Forexample, Fig. 1 shows a client communicating withService A using a sequential handler and with Service Busing a FIFO handler. We have designed the protocols toensure that the ordering guarantees are provided evenwhen replica failures occur. The symbols S, W, and G inFig. 1 represent different performance parameters of ourprobabilistic model. They will be elaborated later inSection 5.3.

We now compare and contrast the sequential and FIFOhandlers with respect to the way they service a client’supdate and read-only requests. In case of both handlers, aclient’s update request is forwarded by the client gatewayhandler to all the primary replicas. The secondary replicasdo not directly service a client’s update request. Instead, thesecondary replicas update their state when one of themembers of the primary group lazily propagates itsupdated state to the secondary group. We call this memberthe lazy publisher. When a client invokes a read-only request,the client gateway handler forwards the request to a subsetof primary and secondary replicas. In Section 6, we willdescribe the selection of this subset. A selected replicaresponds to the read request immediately, if its mostrecently updated state is no more than x versions old,where x is the staleness threshold specified by the client inits QoS specification. In other words, the replica performsan immediate read operation if it meets the stalenessspecification of the client. However, a secondary replica


Fig. 1. Timed consistency handlers in the AQuA gateway.

may have a state that is more stale than the stalenessthreshold specified by the client, the reason being that thesecondary replicas update their state only upon receivingthe state update from the lazy publisher. In such a case, thereplica performs a deferred read by buffering the read requestand responding to the client upon receiving the next stateupdate from the lazy publisher.

The sequential and FIFO handlers differ in the order inwhich the replicas commit the updates and the manner inwhich a replica determines if its state meets the stalenessthreshold specified by a client. In the sequential consistencycase, all the replicas see the effects of the updates in thesame sequential order. The order in which the replicascommit updates is determined by the Global SequenceNumber (GSN) of the update operation, which is assignedby the leader of the primary group and broadcast by theleader to the other primary replicas. The leader merelyserves as the sequencer and does not actually service theclient’s request. The value of the GSN at any instant of timemay be considered to be the value of the leader’s logicalclock. For read-only operations, this GSN serves as the basisfor determining the staleness of a replica. In contrast, theFIFO handler does not use a dedicated sequencer todetermine the order in which the replicas commit theirupdates. Instead, the clients send a sequence number alongwith their invocations. The primary replicas commit theupdates of each client in increasing order of this sequencenumber. In the case of a read request, the replicas use thisclient-specific sequence number to determine if their statewith respect to a client’s updates is within the client-specified staleness threshold. They then perform an im-mediate or deferred read accordingly.

Failure handling is another point in which the FIFOhandler differs from the sequential handler. In our work,we assume that replicas fail by crashing. Since both theleader of the primary group and the lazy publisher play acrucial role in providing sequential consistency semantics,our algorithm handles their failures to ensure that theconsistency guarantees are not violated. The Maestro-Ensemble group communication protocol greatly simplifiesour handling of replica failures. First, if any of the groupmembers fail, Maestro-Ensemble notifies the remaininggroup members about the failure. Further, if the leader of agroup fails, Ensemble elects a new leader and notifies theother group members about the election. By virtue of thiselection protocol, we can guarantee that when the currentsequencer, which in our case is the leader of the primarygroup, fails, then another member of the primary group willbe elected the sequencer. The new sequencer first checks ifthe lazy publisher is still alive. If the lazy publisher hascrashed, the sequencer designates one of the survivingmembers of the primary group as the new lazy publisher.The sequencer then notifies all the clients that it is the newsequencer. The sequencer also has the responsibility ofpreserving the sequential ordering guarantees during thetransition. Since we depend on Maestro-Ensemble toprovide virtual synchrony and reliability, we assume thatall the replicas would have received the messages that weresent before the crash. Hence, to ensure sequential ordering,the new sequencer begins by assigning the GSN to thepending requests, if any, before making the GSN assign-ment for the newly incoming requests. Failure handling forFIFO ordering is relatively simpler because the FIFOhandler does not use a dedicated sequencer. Hence, the

FIFO handler has to handle only the failure of the lazypublisher. In the case of FIFO protocol, the leader of theprimary group is designated as the lazy publisher. When anew leader is elected by Ensemble to replace a failed leader,the leader-elect takes over as the new lazy publisher by firstpropagating its state to the secondary replicas. It thenschedules the subsequent lazy updates with the appropriatefrequency, and then continues to service the requests fromthe clients.

5 PROBABILISTIC MODELING OF THE

RESPONSE TIME DISTRIBUTION

Having described the processing involved in the gatewayhandler on the server side, we now describe the processingdone on the client side in order to meet the QoSspecification of the client. As mentioned in Section 2, ourwork targets clients that have specific consistency andtimeliness constraints. Each client expresses its constraintsin the form of a QoS specification that includes the responsetime constraint, d, and the minimum probability of meetingthis constraint, PcðdÞ. In the case of dynamic replicatedstate, the client also specifies the maximum staleness, a, thatit can tolerate in its response. If a response fails to meet thedeadline constraint of the client, it results in a timing failurefor the client. Hence, one of the important responsibilities ofthe client gateway handlers is to select an appropriatesubset of replicas that can deliver a timely and consistentresponse to the clients, thereby reducing the occurrence oftiming failures.

In our model, the constraints specified by a client applyonly for the read transactions invoked by the client. For anupdate transaction, the only constraint that applies is that ithas to be committed by the replicas in a manner thatrespects the ordering guarantee associated with the service.Hence, our selection algorithm handles an update requestof a client by simply multicasting the request to all theprimary replicas. The handler on the server side takes careof committing these updates in the appropriate order, asdescribed in Section 4.1. For the read-only requests, theselection algorithm has to choose from among the primaryand secondary replicas based on their ability to meet theclient’s temporal requirements, as well as on whether thestate of the replica is within the staleness thresholdspecified by the client. However, the uncertainty in theenvironment and in the availability of the replicas due totransient overload and failures makes it impossible for aclient to know with certainty if a set of replicas can meet itsdeadline. Further, while a client can be certain that the stateof the primary replicas is always up-to-date, because all ofthe clients propagate their updates directly to them, theclient cannot be certain about the state of the secondaryreplicas. The reason is that the secondary replicas updatetheir state only when they receive the lazy updatespropagated by the lazy publisher.

Hence, our selection approach makes use of probabilisticmodels to estimate a replica’s staleness and to predict theprobability that the replica will be able to meet the client’sdeadline. These models make their prediction based oninformation gathered by monitoring the replicas at runtime.A selection algorithm then uses this online prediction tochoose a subset of replicas that can together meet theclient’s timing constraints with at least the probabilityrequested by the client. While the algorithm ensures that the


response delivered to the client will meet the stalenessconstraint, it can only provide probabilistic guaranteesabout meeting the temporal constraint. We first present theprobabilistic model we have developed for the staticreplicated state, and then describe how we extend it forthe dynamic state.

5.1 Static State

LetM be the set of replicas offering the service requested bya client and Ri be the random variable denoting the time toreceive a response from a replica i 2 M, after a request wastransmitted to it. We now need to determine the probabilitythat a response from a subset K � M, consisting of k > 0replicas, will arrive by the client’s deadline, d, and therebyavoid the occurrence of a timing failure. This probability isdenoted by PKðdÞ. Each replica in the subset independentlyprocesses the client’s request and sends back its response.However, only the first response received for a request isdelivered to the client. Therefore, a timing failure occursonly if no response was received from any of the replicas inthe set K within d time units after the request was sent.Computing the distribution of the time until a response isreceived is straightforward if we assume that the responsetimes of individual replicas are independent of one another.While this assumption may not be strictly true in somecases (e.g., if the network delays are correlated), it doesresult in a model that is fast enough to solve online, which isespecially helpful for the time-sensitive applications wetarget in our work. Furthermore, the experimental resultswe obtained show that the resulting model makes reason-ably good predictions most of the time [16], [17]. We use theindependence assumption to compute the probability,PKðdÞ, for the replicas in subset K, as follows:

PKðdÞ ¼ 1� P ðno replica in K responds before dÞPKðdÞ ¼ 1�

Yi2K

P ðRi > dÞ

PKðdÞ ¼ 1�Yi2K

ð1� FIRiðdÞÞ;

ð1Þ

where FIRiðdÞ is the response time distribution function for

replica i, under the condition that the replica responds tothe request without waiting for a state update.

5.2 Dynamic State

We now explain how we extend the above model to takeinto account the state of the replica when estimating itsresponsiveness. Let t denote the time at which a request istransmitted. Since replicas are selected at the time a requestis transmitted, we also use t to denote the time at which thereplica selection is done. Let AiðtÞ denote the staleness ofthe state of replica i at time t, and P ðAiðtÞ � aÞ be theprobability that the state of replica i at time t is within thestaleness threshold, a, specified by the client. We call thisthe staleness factor for replica i. Let P ðRi � dÞ be theprobability that a response from replica i will be receivedby the client within the client’s deadline, d. As before, letPKðdÞ be the probability that at least one response from theset K, consisting of k > 0 replicas, will arrive by the client’sdeadline, d. The probability that a replica can meet theclient’s time constraint, d, and thereby prevent a timingfailure depends on whether the replica is functioning andhas a state that can satisfy the client-specified stalenessthreshold. We can make use of the probabilities of the

individual replicas to choose a subset K of replicas suchthat PKðdÞ � PcðdÞ. The replicas in the set K will then formthe final set selected to service the request.

We now derive the expression for PKðdÞ. Unlike thestatic case, which made the selection from a single tier ofreplicas, the setK in the case of dynamic state is made up ofa subset Kp of primary replicas and a subset Ks ofsecondary replicas (i.e., K ¼ Kp [Ks). While each replicain K processes the client’s request and returns its response,only the first response received for a request is delivered tothe client. Hence, a timing failure occurs only if no responseis received from any of the replicas in the selected set Kwithin d time units after the request was transmitted.Therefore, we have

PKðdÞ ¼ 1� P ðno replica i 2 K 3 Ri � dÞ:

As in the case of the static state, we assume that theresponse times of the replicas are independent because theyprocess their requests independently. Thus, using theindependence assumption, we obtain

PKðdÞ ¼1� P ðno i 2 Kp 3 Ri � dÞ � P ðno j 2 Ks 3 Rj � dÞ

� �:

ð2Þ

5.2.1 Primary Replicas

In Section 4.1, we mentioned that the update requests of theclients are propagated to the primary group immediately.Hence, for a primary replica i, the staleness factorP ðAiðtÞ � aÞ ¼ 1, and the replica always has a state thatcan satisfy the staleness threshold of the client. Therefore, inthe case of the primary replicas, we have

P ðno i 2 Kp 3 Ri � dÞ ¼Yi2Kp

P ðRi > dÞ

¼Yi2Kp

ð1� FIRiðdÞÞ;

ð3Þ

where FIRi, as in the case of the model for static state,

denotes the response time distribution function for replica i,given that it can respond immediately to a read requestwithout waiting for a state update.

5.2.2 Secondary Replicas

The response time of a secondary replica depends onwhether it has a state that can satisfy the client specifiedstaleness threshold, a. If the replica’s staleness is within thespecified staleness threshold, then the replica can performan immediate read. Otherwise, as mentioned in Section 4.1,the replica has to perform a deferred read. At the time ofreplica selection, the client gateway that selects the replicasdoes not know for certain how stale the secondary replicasare. Hence, the client gateway uses a probabilistic approachto estimate the staleness of the secondary replicas. Theprobabilistic approach allows us to express the responsive-ness of a replica j 2 Ks as a conditional probability usingthe following equation:

P ðRj > dÞ ¼ P ðRj > djAjðtÞ � aÞ � P ðAjðtÞ � aÞþP ðRj > djAjðtÞ > aÞ � P ðAjðtÞ > aÞ;

where P ðAjðtÞ � aÞ is the staleness factor of replica j asdefined earlier. Since the lazy update is propagated to allthe secondary replicas at the same time, it is reasonable to


assume that their degrees of staleness at the time of requesttransmission, t, are identical. Hence, rather than associatestaleness with an individual replica j as above, we associatestaleness with the entire secondary group of replicas. Weuse AsðtÞ to denote the staleness of the secondary group atthe time of request transmission t, and express theprobability that no secondary replica can respond withinthe deadline d as follows:

P ðno j 2 Ks 3 Rj � dÞ ¼� Yj2Ks

P ðRj > djAsðtÞ � aÞ��

P ðAsðtÞ � aÞþ� Yj2Ks

P ðRj > djAsðtÞ > aÞ��

P ðAsðtÞ > aÞ

P ðno j 2 Ks 3 Rj � dÞ ¼� Yj2Ks

ð1� FIRjðdÞÞ

�� P ðAsðtÞ � aÞþ

� Yj2Ks

ð1� FDRjðdÞÞ

��

ð1� P ðAsðtÞ � aÞÞ;ð4Þ

where FIRj, as before, denotes the response time distribution

function for the replica j, given that j can respondimmediately to a request without waiting for a state update,and FD

Rjis the response time distribution function, given

that the replica defers the read until it has received the lazystate update. We now describe how we compute thestaleness factor, P ðAsðtÞ � aÞ, for the secondary replicas,and then follow that with a description of how we computethe values of the response time distribution functions FI

Ri

and FDRi

for a replica i.

5.2.3 Staleness Factor

The staleness of a secondary replica, at the instant t, is thenumber of update requests that have been received by theprimary group since the time of the last lazy update. Let tldenote the duration elapsed between the time of requesttransmission, t, and the time of the last lazy update. LetNuðtlÞ be the total number of update requests received bythe primary group from all the clients in the duration tl.Since AsðtÞ ¼ NuðtlÞ, we have P ðAsðtÞ � aÞ ¼ P ðNuðtlÞ � aÞ.Our approach estimates the staleness of the secondaryreplicas based on a probabilistic model, rather than usingthe prohibitively costlier method of probing the primarygroup at the time of request transmission in order to obtainthe value of NuðtlÞ. Using the assumption that the arrival ofupdate requests from the clients follows a Poissondistribution with rate �u, we obtain

P ðAsðtÞ � aÞ ¼ P ðNuðtlÞ � aÞ ¼Xan¼0

ð�utlÞne��utl

n!: ð5Þ

The sequential and FIFO handlers differ slightly in the waythey evaluate the update arrival rate, �u. In sequentialordering, the replica state is shared by all the clients, andtherefore �u is the rate at which updates are received by theprimary replicas from all the clients. However, in the case ofFIFO ordering, since the updates to the replicated object are

specific to the individual replicas, �u is the rate at which theclient that is making the replica selection updates thereplicated object. In either case, the staleness of thesecondary replicas can be determined probabilistically ifwe know the arrival rate of the update requests and thetime elapsed since the last lazy update. We measure thosetwo parameters at runtime by instrumenting the gatewayhandlers, as we have explained in detail in [15]. Althoughwe have assumed Poisson arrivals in our work, it should bepossible to evaluate the staleness factor when the arrival ofupdate requests follows a distribution that is not Poisson.Finally, we can use the expressions in (3), (4), and (5) in (2)to evaluate the probability PKðdÞ that at least one of thereplicas in the selected set K can deliver a timely andconsistent response.

5.3 Evaluating the Response Time Distribution

We now explain how we determine the values of theconditional response time distributions, FI

RiðdÞ and FD

RiðdÞ,

for a replica i. To do this, we make use of the performancehistory recorded by online performance monitoring tocompute the value of the distribution function for a replicai. In the case in which a replica can respond to a requestwithout waiting for a state update, the response timerandom variable for a replica i is given by (6):

Ri ¼ Si þWi þGi: ð6Þ

For a deferred read, in which the replica has to buffer theread request until it has received the next state update inorder to respond to the request, the response time randomvariable is given by (7):

Ri ¼ Si þWi þGi þ Ui; ð7Þ

where Si is the random variable denoting the service timefor a read request serviced by replica i, Wi is the randomvariable denoting the queuing delay experienced by arequest waiting to be serviced by i, Gi is the randomvariable denoting the two-way gateway-to-gateway delaybetween the client and replica i, and Ui is the duration oftime the replica spends waiting for the next lazy update. Inthe case of sequential ordering, the queuing delay includesthe time the replica spends waiting for the sequencer tosend the GSN for the request. The service time and queuingdelay are specific to the individual replicas, while thegateway delay is specific to a client-replica pair. These threeparameters are depicted in Fig. 1 by the terms S, W, and G,respectively.

For each read request, we experimentally measure thevalues of the above performance parameters by instrument-ing the gateway handlers. The values of Si, Wi, and Ui for aread request are measured by the server-side handler. Theserver handler then publishes the new measurements to allthe clients. The value of the two-way gateway delay, Gi, ismeasured by the client-side handler when it receives aresponse from replica i. For each replica, the client handlersrecord the most recent l measurements of these parametersin separate sliding windows in an information repositorythat is local to each client. The size of the sliding window, l,is chosen so as to include a reasonable number of recentrequests, while eliminating obsolete measurements. Thedetails of the gateway instrumentation and online perfor-mance monitoring are provided in [15].


Given that we can measure the performance parametersand record them at runtime, we can now compute the valueof the distribution function for a replica i. To do this, wefirst compute the probability mass function (pmf) of Si andWi based on the relative frequency of their values recordedin the sliding window, L. We then use the pmf of Si, thepmf of Wi, and the recently recorded value of Gi tocompute the pmf of the response time Ri as a discreteconvolution of Wi, Si, and Gi. The pmf of Ri can then beused to compute the value of the distribution functionFIRiðdÞ. We follow a similar procedure to compute FD

RiðdÞ,

although, in this case, we record a performance history of Ui

and include the pmf of Ui in the convolution.

6 REPLICA SELECTION ALGORITHM

Given the ability to predict the probability that anindividual replica will meet a client’s time constraint basedon the replica’s state, we designed two algorithms that usethis prediction to select a set of replicas that can meet thetime constraint with the probability the client has requested.We call these two algorithms BEST_PROBABILITY_FIRST andLEAST_USED_FIRST; we presented the former in [14] and thelatter in [15]. The selection algorithms are executed by eachclient gateway when the client associated with it performs aread-only request on a server object. If the client makes anupdate request, the gateway sends the request to all theprimary replicas. In this section, we summarize thesealgorithms and highlight their key differences. The BEST_

PROBABILITY_FIRST algorithm selects replicas in decreasingorder of the probability that they can individually meet theclient’s response time requirement. It includes just enoughreplicas in K such that the condition PKðdÞ � PcðdÞ issatisfied, where PKðdÞ is computed using the modelspresented in the previous section. The LEAST_USED_FIRST

algorithm, on the other hand, selects replicas in decreasingorder of their elapsed time of response (ETR). The ETR of areplica is the duration that has elapsed since a reply was lastreceived by the client from that replica, and is measured atruntime by instrumenting the gateway handler on the clientside. Like the BEST_PROBABILITY_FIRST algorithm, theLEAST_USED_FIRST algorithm includes just enough replicasin K such that the condition PKðdÞ � PcðdÞ is satisfied. Bothalgorithms are designed to choose replicas in such a waythat the QoS requirements of the client can be met even ifone of the selected replicas fails before responding.

The BEST_PROBABILITY_FIRST algorithm is a greedyalgorithm because it always picks the best replicas first.While this results in a smaller subset of replicas, it alsoincreases the potential for the occurrence of hot-spots due tothe following reason. The model used by the algorithmmakes use of the performance information broadcast by areplica to estimate the replica’s ability to meet a client’s QoSspecification. Since the performance information is broad-cast to all the clients and the gateway delays of differentclient-replica pairs are not significantly different in a LAN,the information repositories of different clients may containalmost identical performance histories for the replicas. Thatmay cause the clients to select the same or common replicasfor their requests, resulting in hot-spots. In the case of theLEAST_USED_FIRST algorithm, while the response timedistributions of a replica, which are computed from theperformance history, are nearly identical in all the client

information repositories, the ETR information is specific toeach client-replica pair and is likely to be different fordifferent clients. That results in a more balanced utilizationof the available replicas and thereby reduces the occurrenceof hot-spots.

7 EXPERIMENTAL RESULTS

We have conducted experiments to study the overhead ofthe selection algorithm and studied the effectiveness as wellas the adaptability of the probabilistic model underdifferent workload scenarios for static and dynamicreplicated states [16], [15]. We also experimentally analyzedthe trade offs between timeliness and consistency, using thesequential and FIFO ordering handlers we implemented inAQuA [17]. All of our experiments were conducted usingan experimental setup composed of a set of uniprocessorLinux machines with processor speeds ranging from300 MHz to 1 GHz. The machines were distributed over a100 Mbps LAN. All confidence intervals for the resultspresented are at a 95 percent level and have been computedunder the assumption that the number of timing failuresfollows a binomial distribution [12]. We now summarize thekey results of the experiments we have published pre-viously and present additional experimental results thatprovide a further evaluation of our work.

Our experiments showed that both the BEST_PROBABIL

ITY_FIRST algorithm as well as the LEAST_USED_FIRST

algorithm adapt to less stringent QoS requirements bychoosing fewer replicas. The reason is that the algorithmsuse the model’s prediction to select just enough replicas thatcanmeet a client’sQoS request, even if a replica failure shouldoccur. The less stringent a client’s QoS specification is, thehigher the probability that a chosen replica will meet theclient’s specification.Hence, as theQoS requirement becomesless stringent, fewer replicas are needed to satisfy the request.Wewere also able to validate ourmodels experimentally andshow that while the observed probability of timing failuresincreases when the requested QoS is more stringent, thereplicas selected by the model were able to maintain theobserved failure probability to be within the thresholdspecified by the client, for the workloads we considered. Toexperimentally justify the need for a hierarchical replicaorganization, we compared the performance of a single-tierreplica organization in which all the replicas were in theprimary group with a two-tier organization in which40 percent of the replicas were in the primary group withthe remaining in the secondarygroup. The size of theprimarygroup represents a trade off between the buffering delay dueto deferred reads and queuing delay due to update requests.When there are more replicas in the primary group, a greaternumber of replicas have consistent state, and therefore thebuffering delays are smaller. However, since more replicasare involved in committing the updates, the queuing delaysexperiencedby the read requests arehigher.Our experimentsshowed that for smaller update rates, the single and two-tierreplica organizations perform comparably. However, forlarger update rates, it is possible to tune the lazy updateinterval (LUI) such that the two-tier scheme results in lowerprobability of timing failures compared to the single-tierscheme. The LUI is the periodicity with which the lazypublisher publishes its state to the secondary group of


replicas. In Section 7.2, wewill discuss the cost/performancetrade offs associated with lazy updates in more detail.

7.1 Performance under Load

We now describe the experiments we carried out todetermine how well our model adapts to meet the client’sQoS specification under different client-induced workloads.We present experimental results for the dynamic state usingthe sequential consistency guarantee, but we observedsimilar behavior for the dynamic state with FIFO order andfor static replicated state as well. Our experimental setuphad 10 server replicas in addition to the sequencer, of which40 percent were in the primary group and the remainder inthe secondary group. The service time was normallydistributed with a mean of 100 milliseconds and varianceof 50 milliseconds. We present results obtained by varyingtwo different parameters: 1) the number of clients accessinga service and 2) the think time between the requests.

In the first case, the client-induced load increases withthe number of clients accessing a service. Each client sent1,000 alternating update and read requests with a think timeof 1,000 milliseconds between successive requests. One ofthe clients specified a staleness threshold of value 2. Itvaried its deadline from 100 to 200 milliseconds andrequested that its deadline be met with a probability� 0.5. All of the remaining clients specified a stalenessthreshold of 4, deadline of 200 milliseconds, and requestedthat this deadline be met with a probability � 0.1 in eachrun. The lazy update interval was 2 seconds. Figs. 2a and 2bevaluate the performance of the probabilistic scheme usingtwo, four, and eight clients. Fig. 2a shows the timing failureprobability for each case, as measured at the client thatspecified that its probability of timely response should be atleast 0.5, and Fig. 2b shows the average number of replicasselected by the probabilistic scheme to meet the QoSspecifications of this client in each case. As expected, theobserved timing failure probability increased as the numberof clients requesting service increased, because of the higherqueuing delays. However, we find that for the range ofworkloads we considered, the model was able to adaptappropriately to select a subset of replicas that could meetthe client’s QoS specification.

In the second case, when we varied the client-inducedload by varying the think time, we used a constant number

of clients in our experimental setup, which in our case, wastwo. In all of the runs, Client1 specified a stalenessthreshold of 4, a deadline of 200 milliseconds, and aminimum probability of timely response of 0.1. Client2specified a staleness threshold of 2 and minimum prob-ability of timely response of 0.9 in all of the runs, but variedits deadline from 100 to 200 milliseconds. The clients useddifferent think times between their requests. The inducedload on the servers was higher for smaller think times.Figs. 3a and 3b present the results, using a lazy updateinterval of 2 seconds, for two different values of the thinktime: 1,000 milliseconds and 250 milliseconds.

The first observation from Fig. 3a is that the observedfailure probability increases as the think time reduces from1,000 milliseconds to 250 milliseconds. The reason is that asthe think time reduces from 1,000 milliseconds andapproaches values closer to the mean service time of100 milliseconds, the number of requests that experiencequeuing delays at the servers increases. We also observefrom the graphs in Fig. 3b that as the queuing delayincreases, the probabilistic scheme is sometimes unable tofind enough replicas to meet the deadline with theprobability requested by the client. For instance, when thethink time is 250 milliseconds, the replica subset chosen bythe probabilistic scheme is unable to meet deadline values� 140 milliseconds with a probability � 0.9, although therequest is sent to all 10 available replicas. In such cases, theselection handler can inform the client that there areinsufficient resources to satisfy its QoS requirement, so thatthe client can choose either to renegotiate its QoS specifica-tion or to send its requests at a later time when the system isless loaded. Alternatively, the middleware can choose tocreate more replicas to meet the demand, and we describehow we do that in AQuA later in this section.

7.2 Cost/Performance Trade Offs of Lazy Updates

We now look at the cost/performance trade offs associatedwith lazy update propagation. Our earlier results showedthat increasing the frequency of lazy updates resulted insmaller buffering delays for deferred reads, and therebyreduced the occurrence of timing failures [15]. However,there is a cost associated with lazy update propagation. Itarises from timer interrupts, network load, and processingof the lazy updates, and the cost increases as the frequency


Fig. 2. Performance under load: multiple clients. (a) Probability of timing failures with varying clients and (b) number of replicas selected with varying

clients.

of lazy update propagation increases. To study how the costaffects the performance, we repeated our experiments usinga more intense workload than the one we had used in ourearlier studies. In the following experiment, we used fourclients, which is twice the value we used in our earlierstudy. The think time is 250 milliseconds, which is one-fourth the value we used in our earlier study. This resultedin a client update rate of five updates/second, which isnearly five times the update rate of our earlier experiments.

Fig. 4a shows the observed timing failure probability asthe lazy update interval increases from 1 second to6 seconds. We see that as the LUI increases, the failureprobability reduces and stabilizes around 4 seconds, which isin contrast to the behavior we observed under the lessintense workload in our earlier study. From these results,we conclude that increasing the lazy update frequency hasthe potential to reduce the buffering delay for deferredreads and thereby improve the responsiveness of thereplicas. However, increasing the frequency beyond acertain threshold value causes the overheads associatedwith the lazy update propagation to become moredominant, nullifying any performance gains. The thresholdvalue is specific to each workload. Thus, our experimentsshow that the lazy update interval has to be chosen to

balance the cost/performance trade offs, depending on theupdate rate of the clients, think time, QoS specification ofthe clients, and the primary/secondary group size.

7.3 Impact of the Primary Group Size

We now present experimental results that show how thesize of the primary group impacts the performance for agiven number of server replicas. We used the sameexperimental setup with 10 servers and four clients havinga think time of 250 milliseconds, as in the previousexperiment. We used two different values of LUI: 1 secondand 2 seconds. Fig. 4b shows the probability of timingfailures observed by Client1 as the percentage of replicas inthe primary group is varied from 10 percent to 100 percent(i.e., all 10 replicas are in the primary group). From Fig. 4b,we see that the observed probability of timing failuresreduces as the size of the primary group increases up to thepoint at which 80 percent of the replicas are in the primarygroup. However, increasing the size of the primary groupbeyond that results in an increase in the number of timingfailures. Those observations can be explained as follows.

As mentioned earlier, the size of the primary grouprepresents a trade off between two different delay factors:the buffering delay introduced by the deferred reads and the


Fig. 3. Performance under load: variable think time. (a) Probability of timing failures with varying think time and (b) number of replicas selected with

varying think time.

Fig. 4. Timeliness/consistency trade offs. (a) Impact of lazy update propagation and (b) impact of the primary group size.

queuing delay caused by the update operations. Increasingthe size of the primary group reduces the buffering delay,because more replicas have consistent state. On the otherhand, when the arrival rate of updates from the clients ishigh, increasing the primary group size causes morereplicas to be involved in update operations. That resultsin higher queuing delays, and thereby reduces the avail-ability of the replicas for the read operations. Applying thistheory to the results in Fig. 4b, we see that the queuingdelay begins to play a more dominant role when more than80 percent of the replicas are in the primary group.Although a larger percentage of the replicas have theappropriate state to meet the client’s staleness threshold inthat region, there are not enough replicas available that canrespond within the client’s deadline. That is the reason forthe increase in timing failures. The above results show thatthere is a certain optimal ratio between the sizes of theprimary and secondary groups that can deliver the bestbalance between the buffering delay and queuing delay.That ratio is specific to each workload and can be used toconfigure the size of the two groups according to theworkload.

Another observation from Fig. 4b is that the observedfailure probability is lower when the frequency of lazyupdates has the smaller of the two values (i.e., LUI =2 seconds). The reason is that beyond a certain thresholdfrequency, the overhead of the lazy update propagationbecomes dominant. We explained this when we analyzedthe result shown in Fig. 4a, which used 40 percent ofreplicas in the primary group. In effect, Fig. 4b shows thatas we increase the size of the primary group, it may be morebeneficial to reduce the frequency of lazy updates, becausea larger fraction of the replicas are consistent.

7.4 Time-Varying Workload

In the experiments we presented so far, the service time wasnormally distributed and the mean service time wasstationary. The results we presented showed that theprobabilistic scheme was able to use the performancehistory of the replicas effectively and adapt the selectionof replicas to meet the QoS requested by the clients, whenthe service time distribution was stationary in the stochasticsense. We now discuss the experimental evaluation of ourprobabilistic framework using a time-varying workload. Inthe experiments we present below, we used a workload thatshowed heavy-tailed properties. This was motivated by theevidence that the workloads in many well-known distrib-uted services exhibit heavy-tailed distribution [10]. A heavy-tailed distribution is characterized by high variability and isdefined as follows:

Heavy-Tailed Distribution: A random variable X

follows a heavy-tailed distribution with a tail index � ifP ½X > x� � x��; 0 < � < 2. The variance increases as �

decreases. A simple example of a heavy-tailed distributionis the Pareto distribution.

In a typical client/server application, the service time hassome upper bound. Hence, we model the service time usinga Bounded Pareto distribution [10]. The Bounded Paretodistribution is characterized by three parameters: �, whichcontrols the variance and mean of the distribution; k, whichis the lower bound for the samples in the distribution; andp, which is the upper bound for the samples in the

distribution. The probability density function of theBounded Pareto distribution is given by

fðxÞ ¼ � � k�1� ðk=pÞ� x

��1; k � x � p:

The Bounded Pareto distribution has finite moments, andtherefore does not strictly conform to the above definition ofa heavy-tailed distribution. However, it does display highvariability when k is significantly less than p. In ourexperiments, we set k to 50 milliseconds and p to250 milliseconds. Thus, the service time varied between 50and 250 milliseconds. To generate a time-varying workload,we conducted experiments using different values of �. Ingeneral, for a Bounded Pareto distribution, the smaller thevalue of �, the higher the mean. We now discuss the resultswhen we varied � between two values: 0.1 and 1.9. When� ¼ 0.1, the mean of the samples was 121, and when� ¼ 1.9, the mean was 84.

As before, we used two clients, Client1 and Client2, eachof which sent 1,000 alternating read and update requestswith a think time of 250 milliseconds. Client1 specified astaleness threshold of 4, a deadline of 200 milliseconds, anda probability of timely response of 0.1, while Client2specified a staleness threshold of 2, varied its deadlinefrom 100 to 200 milliseconds, and requested a probability oftimely response of 0.9. Forty percent of the replicas were inthe primary group, and the lazy updates were propagatedto the secondary group at intervals of 2 seconds.

We generated the time-varying workload by varying theservice time of a replica between two states, NORMAL andHIGH. For the workload parameters we used, the HIGHstate corresponds to � ¼ 0.1, where the mean service timewas 121 milliseconds, and the NORMAL state correspondsto � ¼ 1.9, in which the mean service time was 84 milli-seconds. For every 100 requests it received, each replicaserviced the first f requests it received in the HIGH stateand then made a transition to the NORMAL state, in whichit serviced the remaining 100� f requests. The replica thenmade a transition back to the HIGH state to service the nextf percent of the requests, and repeated the cycle. Thus, thelength of the cycle is 100 requests.

Fig. 5 presents the observed probability of timing failuresand the average number of replicas selected for Client2 fordifferent values of f , for the sequential consistency case. Thefirst observation from Fig. 5a is that the probability of timingfailures increased as the percentage of requests processed intheHIGHstate increased from10percent to 90percent. This isto be expected as an increasing percentage of requestsexperience higher service times with a mean close to121 milliseconds. The second observation is that when thevalue of f increased, the observed failure probabilityexceeded the client’s expectation for deadline values thatare close to the mean service time. We considered twopossible explanations for this. Our first hypothesis was thatthe model is unable to adapt to a time-varying workload. Toverify the hypothesis, we conducted experiments with anequivalent, non-time-varyingworkload that used a BoundedPareto distribution with the same parameter values asdescribed above for the time-varying case. In the non-time-varyingworkload, the transition between theHIGHstate andNORMAL statewas controlled using a probabilistic measureas follows. Before servicing a request, each replica used auniform random number generator to generate a value p


between 0 and 1. Like the time-varying case, we studied theperformance using a non-time-varying workload for thefollowing three cases:

1. When p > 0:9, the replica serviced the request in theHIGH state; when p � 0:9, it serviced the request inthe NORMAL state. This is equivalent to the time-varying case in which 10 percent of requests wereserviced in the HIGH state (f ¼ 10).



Our opinion was that if the replicas selected by the modelare able to maintain a failure probability within theacceptable threshold of 0.1 in the case of non-time-varyingworkload, then it indicates that our model is unable to copewith a time-varying workload. However, we observed thatthe behavior in the non-time-varying case was nearlyidentical to that presented in Fig. 5 for the time-varyingworkload.

Having ruled out the first hypothesis, we considered asecond possible explanation, which was that there are notenough replicas that can deliver a timely response with aprobability � 0.9 for smaller deadline values under a higherworkload. From Fig. 5b, we see that the model tries to meetstrict requirements under higher workload by choosingmore replicas. However, we see that in certain cases, thereare not enough replicas available to deliver a timely andconsistent response with the requested probability. In suchcases, our model saturates the entire pool of replicas.Therefore, we repeated the experiments in the case of time-varying workload by reducing the lazy update intervalfrom 2 seconds to 1 second. That helped reduce the timingfailure probability significantly. These results show that themodel can adapt to a time-varying workload. However,under stringent demands, there may not be enough replicasavailable to meet the demands. In such cases, we can adapt

by propagating the lazy updates more frequently, so thatwe have more replicas with up-to-date state. Alternatively,we can increase the size of the available replica pool bycreating more replicas on demand.

We now briefly describe how our middleware addressesthe problem of creating replicas on demand. In order tosupport dynamic replica creation, the middleware needs todetermine when, where, and how many replicas to create.In our approach, the replica selector in a client’s gatewayrequests that the dependability manager, which is one ofthe components of the AQuA middleware, create a replicawhen the selector is unable to find enough replicas toprovide a timely response with the probability specified bythe client. The new replica is placed on the least loadedhost. The new replica joins the secondary group, but doesnot have the most up-to-date state initially. When the lazypublisher subsequently disseminates its state to the sec-ondary group, the new replica inherits the correct state andthen begins to service a client’s request.

8 FUTURE EXTENSIONS

Our work motivates some interesting avenues for futurework. First, in our current QoS model, the clients expresstheir timeliness requirements by specifying their deadlinesand probability of timely response. While it is easy for theclients to specify the deadline values for their requests, theway they should choose appropriate probabilities of timelyresponse may not be very intuitive. It is easy to extend ourframework so that the clients can replace the probability oftimely response with a higher-level specification, such asthe “importance” level, which takes on an integer valuebetween 0 and 10. Alternatively, the client can specify thecost it is willing to pay for timely delivery. The middlewarecan then internally map these higher-level inputs to anappropriate probability value and perform adaptive replicaselection as described.

Second, our middleware currently admits all the clients.If the observed timing failure probability exceeds a client’sexpectations, the middleware informs the client through acallback. The client can then renegotiate its QoS require-ments. An alternative approach would be to incorporatesome kind of admission control at the middleware layer, in


Fig. 5. Performance using a time-varying workload. (a) Probability of timing failures and (b) number of replicas selected.

order to determine which clients can be admitted based onthe current availability of the replicas.

Third, we currently associate QoS attributes with readoperations only. Although we allow different orderingguarantees for write operations, we do not currentlysupport QoS requirements for write operations that can bespecified at runtime. Our work can be enhanced toincorporate write-specific QoS attributes, such as themaximum tolerable delay in propagating an update to aspecified fraction of replicas.

Finally, in a large-scale system, in which the replicas andclients are more numerous and more widespread, it maynot be feasible to propagate the performance updates to allof the clients in a timely manner, on account of largerlatencies. That may result in a greater degree of inaccuracyin the performance histories. Hence, in order to extend ourwork to large-scale networks, we need a way to track theperformance histories of the replicas in a scalable manner.One way to address this issue is by organizing the replicasinto groups, based on their geographic proximity, andpropagating the performance updates to the clients in sucha way that clients that are closer to a replica group can trackthe performance information of the replicas in that groupmore accurately. At the time of replica selection, theinaccuracy in the performance histories can be factored inby associating the response time distribution functions ofthe replicas with a weight that is proportional to theiraccuracy.

9 CONCLUSIONS

The framework we have developed enables a middlewareto accommodate diverse application requirements byimplementing them as protocols tailored to differentapplication-specific requirements. The framework allows adependable middleware to assign replicated servers toclients adaptively based on the QoS requirements of theclients and the current responsiveness and state of thereplicas. It actively monitors the replicas at runtime anduses the feedback to guide the adaptation. The experimentalresults we obtained demonstrate the role of feedback andthe efficacy of analytical models for adaptively sharing theavailable resources among the users in a range of differentscenarios. While a static selection scheme or round-robinscheme would be sufficient when the primary goal is loadbalancing and when the clients do not have specific timingconstraints, we believe that a dynamic scheme, like theprobabilistic model-based replica selection scheme we havedeveloped, would be useful in an environment in whichtime-sensitive clients that have different QoS requirementsaccess servers that display significant variability in theirresponse times. Our experiments also helped in under-standing the trade offs between timeliness and consistencyfor different consistency semantics, and our results showthat the frequency of lazy updates is an importantparameter that allows us to tune the trade offs betweenthe desired levels of consistency and timeliness.

Although our probabilistic approach was mainly devel-oped to adaptively share replicated servers in uncertainenvironments, similar techniques can be applied to a rangeof problems, including scheduling and other resource

allocation problems. Given the diversity of the require-

ments of client applications when accessing distributed

services, such adaptive frameworks that rely on feedback-

based control are likely to play an increasing role in solving

a range of problems related to building dependable

systems.

ACKNOWLEDGMENTS

The authors are thankful to the anonymous reviewers fortheir detailed feedback, which helped them to improve theirwork. They thank the rest of the AQuA team for theircontributions to the AQuA project. They are thankful toJenny Applequist for her editorial comments. This researchhas been supported by the Defense Advanced ResearchProjects Agency (DARPA) contract F30602-98-C-0187.

REFERENCES

[1] K. Birman, “Replication and Fault Tolerance in the ISIS System,”Proc. 10th ACM Symp. Operating Systems Principles, pp. 79-86, Dec.1985.

[2] K. Birman, Building Secure and Reliable Network Applications.Manning, 1996.

[3] R. Carter and M. Crovella, “Dynamic Server Selection UsingBandwidth Probing in Wide Area Networks,” Technical Report,Boston Univ., BU-CS-96-007, 1996.

[4] G.V. Chockler, R. Vitenberg, and R. Friedman, “ConsistencyConditions for a CORBA Caching Service,” Proc. Int’l Symp.Distributed Computing, Oct. 2000.

[5] M. Cukier et al., “AQuA: An Adaptive Architecture that ProvidesDependable Distributed Objects,” Proc. IEEE Symp. ReliableDistributed Systems, pp. 245-253, Oct. 1998.

[6] A. Demers, D. Greene, C. Hauser, W. Irish, and J. Larson,“Epidemic Algorithms for Replicated Database Maintenance,”Proc. ACM Symp. Principles of Distributed Computing, pp. 1-12, 1987.

[7] Z. Fei, S. Bhattacharjee, E. Zegura, and M. Ammar, “A NovelServer Selection Technique for Improving the Response Time of aReplicated Service,” Proc. IEEE INFOCOM, Mar. 1998.

[8] R. Golding, “A Weak-Consistency Architecture for DistributedInformation Services,” Computing Systems, vol. 5, no. 4, pp. 379-405, 1992.

[9] R. Guerraoui and A. Schiper, “Software-Based Replication forFault Tolerance,” Computer, pp. 68-74, Apr. 1997.

[10] M. Harchol-Balter, M. Crovella, and C. Murta, “On Choosing aTask Assignment Policy for a Distributed Server System,” Proc.Performance Tools Conf., pp. 231-242, Sept. 1998.

[11] M. Hayden, “The Ensemble System,” PhD thesis, Cornell Univ.,Jan. 1998.

[12] N. Johnson, S. Kotz, and A. Kemp,Univariate Discrete Distributions.chapter 3, second ed., pp. 129-130, Addison-Wesley, 1992.

[13] B. Kantor and P. Rapsey, “Network News Transfer Protocol,”http://www.cis.ohio-state.edu/htbin/rfc/rfc977.html, Feb. 1986.

[14] S. Krishnamurthy, W.H. Sanders, and M. Cukier, “A DynamicReplica Selection Algorithm for Tolerating Timing Faults,” Proc.Int’l Conf. Dependable Systems and Networks, pp. 107-116, July 2001.

[15] S. Krishnamurthy, W.H. Sanders, and M. Cukier, “An AdaptiveFramework for Tunable Consistency and Timeliness UsingReplication,” Proc. Int’l Conf. Dependable Systems and Networks,pp. 17-26, June 2002.

[16] S. Krishnamurthy, W.H. Sanders, and M. Cukier, “PerformanceEvaluation of a Probabilistic Replica Selection Algorithm,” Proc.Workshop Object-Oriented Real-Time Dependable Systems, pp. 119-127, Jan. 2002.

[17] S. Krishnamurthy, W.H. Sanders, and M. Cukier, “PerformanceEvaluation of a QoS-Aware Framework for Providing TunableConsistency and Timeliness,” Proc. Int’l Workshop Quality ofService, pp. 214-223, May 2002.

[18] V. Krishnaswamy, M. Raynal, D. Bakken, and M. Ahamad,“Shared State Consistency for Time-Sensitive Distributed Appli-cations,” Proc. Int’l Conf. Distributed Computing Systems, pp. 606-614, Apr. 2001.


[19] L. Lamport, “Time, Clocks, and the Ordering of Events inDistributed Systems,” Comm. ACM, vol. 21, no. 7, pp. 558-565,July 1978.

[20] M. Little, “Object Replication in a Distributed System,” PhDthesis, Univ. of Newcastle upon Tyne, Sept. 1991.

[21] L. Moser, P. Melliar-Smith, and P. Narasimhan, “A FaultTolerance Framework for CORBA,” Proc. IEEE Int’l Symp. Fault-Tolerant Computing, pp. 150-157, June 1999.

[22] K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers,“Flexible Update Propagation for Weakly Consistent Replication,”Proc. 16th ACM Symp. Operating Systems Principles, pp. 288-301,Oct. 1997.

[23] C. Pu and A. Leff, “Replica Control in Distributed Systems: AnAsynchronous Approach,” Proc. ACM SIGMOD Int’l Conf.Management of Data, pp. 377-386, May 1991.

[24] Y.(J.) Ren, T. Courtney, M. Cukier, C. Sabnis, W.H. Sanders, M.Seri, D.A. Karr, P. Rubel, R.E. Schantz, and D.E. Bakken, “AQuA:An Adaptive Architecture that Provides Dependable DistributedObjects,” IEEE Trans. Computers, vol. 52, no. 1, pp. 31-50, Jan. 2003.

[25] P. Rubel, “Passive Replication in the AQuA System,” Master’sthesis, Univ. of Illinois at Urbana-Champaign, 2000.

[26] D. Terry, “Towards a Quality of Service Model for Replicated DataAccess,” Proc. Second Int’l Workshop Services in Distributed andNetworked Environments, pp. 118-122, June 1995.

[27] D. Terry, A. Demers, K. Petersen, M. Spreitzer, M. Theimer, and B.Welch, “Session Guarantees for Weakly Consistent ReplicatedData,” Proc. Int’l Conf. Parallel and Distributed Information Systems,pp. 140-149, Sept. 1994.

[28] F. Torres-Rojas, M. Ahamad, and M. Raynal, “Timed Consistencyfor Shared Distributed Objects,” Proc. ACM Symp. Principles ofDistributed Computing, pp. 163-172, May 1999.

[29] A. Vaysburd, “Building Reliable Interoperable Distributed Appli-cations with Maestro Tools,” PhD thesis, Cornell Univ., May 1998.

[30] H. Yu and A. Vahdat, “Design and Evaluation of a ContinuousConsistency Model for Replicated Services,” Proc. Fourth Symp.Operating Systems Design and Implementation, Oct. 2000.

Sudha Krishnamurthy received the PhD de-gree in computer science in 2002 from theUniversity of Illinois at Urbana-Champaign,Illinois. She is currently a research associate inthe Department of Computer Science at theUniversity of Virginia, Charlottesville. Her re-search interests include the design and experi-mental analysis of software protocols fornetworked and wireless distributed systems.

William H. Sanders is currently a professor inthe Department of Electrical and ComputerEngineering and the Coordinated Science La-boratory at the University of Illinois at Urbana-Champaign, Illinois. He is vice-chair of the IFIPWorking Group 10.4 on Dependable Computing.In addition, he serves on the editorial board ofIEEE Transactions on Reliability, and is the areaeditor for Simulation and Modeling of ComputerSystems for the ACM Transactions on Modeling

and Computer Simulation. He is a past chair of the IEEE TechnicalCommittee on Fault-Tolerant Computing. His research interests includeperformance/dependability evaluation, dependable computing, andreliable distributed systems. He has published more than 140 technicalpapers in these areas. He is currently serving as the general chair of the2003 Illinois International Multiconference on Measurement, Modelling,and Evaluation of Computer-Communication Systems. He has servedas cochair of the program committees of the 29th InternationalSymposium on Fault-Tolerant Computing (FTCS-29), the Sixth IFIPWorking Conference on Dependable Computing for Critical Applications,Sigmetrics 2003, PNPM 2003, and Performance Tools 2003, and hasserved on the program committees of numerous conferences andworkshops. He is a codeveloper of three tools for assessing theperformability of systems represented as stochastic activity networks:METASAN, UltraSAN, and Mobius. Mobius and UltraSAN have beendistributed widely to industry and academia; more than 300 licenses forthe tools have been issued to universities, companies, and NASA forevaluating the performance, dependability, security, and performabilityof a variety of systems. He is also a codeveloper of the Loki distributedsystem fault injector and the AQuA/ITUA middlewares for providingdependability/security to distributed and networked applications. He is afellow of the IEEE.

Michel Cukier is currently an assistant profes-sor in the Center for Reliability Engineering,Department of Mechanical Engineering at theUniversity of Maryland, College Park, Maryland.His research interests include intrusion toler-ance, fault tolerance in distributed systems,modeling, and fault injection. He is a memberof the IEEE and the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at http://computer.org/publications/dlib.


Date post:	30-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

1112 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · 2011-05-14 · An Adaptive Quality of...

Documents