Fault-tolerant data management in the gaston peer-to-peer file system

1 Introduction

Ordinary file systems are designed to pro-vide a file service for a limited number ofusers using machines located in relativelysmall geographic areas, usually of a localnetwork range. These services are fast, effi-cient and highly suitable for local users.However, mobile users, who often changetheir locations, do not have location trans-parent file access which is usually realizedthrough ftp service or another substitutemethod. Moreover, standard file systemsdo not provide a sufficient and transparentlevel of fault-tolerance and backup.Although local file systems are adequatefor many applications, Internet evolutionand increasing storage and computation ca-pacity allow the designing of systems sol-ving the above mentioned problems.

Gaston is a peer-to-peer large-scale file sys-tem designed to provide a fault-tolerant andhighly available file service for a virtuallyunlimited number of users. It uses the Inter-net and its resources to create a system in-frastructure and employs massive data re-plication and efficient update mechanismsin cooperation with suitable cryptographytechniques to create a reliable, scalable andsecure system. The core of the Gaston de-sign is data management, which has a hugeimpact on overall system characteristics andinvolves definition of suitable data struc-tures, replication schema management andelementary operations including creation,update and consistency maintenance.

The essential features of data managementin the Gaston file system is its simplicity,

scalability, fault-tolerance and security.Dissemination of data object replicas isbased on user-specified parameters, currentcharacteristics of individual nodes and theirautonomous settings. Replication schemaof tree topology ensures efficient updatepropagation among replicas, while the pro-posed technique for increasing reliabilityof the tree structure guarantees the desiredlevel of fault-tolerance with acceptablememory overhead. To achieve data confi-dentiality of all replicas, cryptography me-chanisms are used to protect data so thatupdates can be performed at possibly mal-icious nodes without knowledge of plaindata.

The remainder of this paper is structuredas follows. Section 2 describes the underly-ing infrastructure of the system; Section 3presents structure and components of ele-mentary data objects. Section 4 introducesthe tree-topology structure for connectingreplicas of data objects and describes algo-rithms for its creation and maintenance ac-cording to specified parameters. Section 5focuses on consistency control and updatepropagation among replicas. Finally, Sec-tion 6 presents a technique improvingfault-tolerance of data management in Gas-ton. Section 7 compares our approach todata management to related work. Section8 contains conclusions and sets some futuredirections.

2 System overview

Gaston is intended to provide a file serviceto a virtually unlimited number of users

WIRTSCHAFTSINFORMATIK 45 (2003) 3, S. 273–283

Die Autoren

Vladimır DyndaPavel Rydlo

Dipl.-Ing. Vladimır Dynda,Dipl.-Ing. Pavel Rydlo,Department of Computer Scienceand Engineering,Faculty of Electrical Engineering,Czech Technical University in Prague,Karlovo namesti 13,121 35, Prague 2, Czech Republic,E-Mail: {xdynda|xrydlo}@fel.cvut.cz

Fault-tolerant data managementin the Gaston peer-to-peerfile system

WI – Schwerpunktaufsatz

utilizing a nearly unlimited number of he-terogeneous machines. All nodes that parti-cipate in the system are organized in anoverlay system network (SN) that is estab-lished on top of the physical interconnec-tion infrastructure (Internet) and is used tomaintain distributed knowledge of allnodes in the system. For this purpose, Gas-ton makes use of Tapestry [ZhKu01], aself-organizing overlay network based on ahashed-suffix routing structure [Plax97]and provides object location and node-to-node communication under failure condi-tions while transparently managing net-work topology. Alternatives to Tapestryinclude Pastry [Rows01], Chord [Stoi01]and CAN [Ratn01]. The nodes involved inthe system may include common desktopcomputers, mobile computers, PDAs, etc.Every machine connected to the Internet isallowed to become a part of the SN; it isjust a matter of installing client software onit. Thus system network can span acrossusers’ homes and offices, schools, libraries,internet cafes and so on.

To achieve a highly available large-scale fileservice, the data objects are replicatedacross the system network substantiallyimproving fault-tolerance, availability andload balancing. On the other hand, replica-tion incurs several challenges to data ma-nagement since a replication schema (i.e. setof system network nodes holding a replicaof particular data object) has to be built upand adapted according to the current sys-tem state while preserving minimum over-head. Data consistency and efficient updatepropagation must also be guaranteed (seeSection 5). To enable a direct connectionbetween replicas, the replica network (RN)of a tree topology connecting all membersof replication schema is constructed inde-pendent of the system network (see Sec-tion 4). In the Gaston file system, all copiesof a data object, including caches, are con-sidered equivalent replicas and thus allnodes holding these copies are part of theRN. This way a user has a transparent ac-cess to current versions of her files, no mat-ter where they are physically located. Itmeans that the user can access the files forexample from her office in daytime andfrom home in the evening without need tocare about explicit file transfers using afloppy disk or ftp service.

3 Data object

To enable efficient, scalable and secure datacontrol, every file in Gaston consists ofone or more segments, which are the smal-lest replicable data objects (i.e., elements ofdata management) that can be located, read,written or moved independently. As dataoperations in the system are performed in alarge-scale environment, where the com-munication cost plays a significant role, theformat of the file segment is designed sothat nodes exchange the minimum amountof data to perform the requested data op-eration.

Let’s imagine that a user writes a letter forher business partners and stores it in let-ter.txt file. As size of this file may exceedsize limit of a single file segment, it can beinternally divided into several segmentstreated independently by data managementin the Gaston file system. As segments arereplicated individually, it is sufficient thatthe system updates just one respective filesegment when needed, and disseminatesthe update using segment’s replica net-work.

Each file segment can be viewed as an in-crementally modifiable block of data and isdivided into four areas. Modification tableTM contains a record of changes applied tothe segment, data area contains encryptedoriginal data whereas area of changes holdschanges applied to the original data (formore detail see Section 5). Signature areacontains the digital signature of the wholefile segment. The structure of the file seg-ment is presented in Fig. 6.

The modification table contains entries re-presenting particular changes performedon the previous version of the file segment.The entry is a structure formed by the IDof the updating node and a pointer to pre-vious and new versions of data in the seg-ment. The proposed format has several ad-vantages with respect to data management;particularly it preserves versions of file seg-ments while retaining low storage over-head. However, as the size of a single filesegment cannot rise indefinitely, its overallsize and the number of the most recent ver-sions are limited.

Since the intended area of use is a large areaof uncertified machines, strong emphasis islaid on overall security so data manage-ment is extended with cryptography tech-

niques. To provide the strongest possibledata protection, it would be necessary toencrypt data in a chaining manner so thatevery encrypted byte depends on someprior ones [Schn96], however, performancepenalties would be too high. The updatemechanism and format of the file segmentin Gaston significantly increases perform-ance of update operations, since the currentversion of the file segment is formed byseveral blocks of data not forming chainingsequence, which however, slightly de-creases the level of data security in compar-ison with the highest theoretically possiblelevel – contiguous encryption.

To provide access control, the digital sig-nature of the modification table, the dataarea and the area of changes is attached,since the ability to create a valid signatureserves as the writer’s authentication andalso authorization to write the data. Onthe other side, incorporating the modifi-cation table into the process of signing se-cures it against malicious changes fromattacker, since any modification is easilydetectable. The signature is calculated asfollows:

S ¼ signSKðhashðEKðPTÞjjjjEKðPTD1Þjj . . . jjEKðPTDnÞjj TMÞÞ

SK is a writing private key, K is an encryp-tion key used to confidentially protectdata, PT, PTD1, . . . , PTDn are original dataplain text and plain text of updated blocksrespectively, TM is the current modificationtable and || is a concatenation operation.

4 Replica network

To allow efficient message traffic andachieve smooth update propagation amongreplicas, member nodes of the replicationschema of each file segment are connectedby a replica network RN of a tree topology(secured against node failures, see Sec-tion 6). The replication schema has not tobe static but rather has to dynamically reactand adapt itself to current network condi-tions and the data object’s state to mini-mize the cost of replication in terms of toweighted criteria that should be complexenough to adjust the system according toautonomous needs of users and administra-tors. As results from the nature of the sys-tem, decisions concerning adaptation mustbe fully distributed and based on localknowledge.


274 Vladimır Dynda, Pavel Rydlo

RN structure. Based on their activity,nodes in the replica network of each dataobject (e.g., a single file segment of our fileletter.txt) are classified into several cate-gories. First, ordinary members of the RNare nodes where data object replicas arestored (primarily for data availability rea-sons) and no read or write requests aregenerated by these nodes. Reading nodesare nodes that have recently generated oneor more read requests to the replica and, fi-nally, writing nodes (wn) are those whichhave recently generated write requests.Writing nodes may become primary nodes(pn) that are responsible for data updates.Note that despite this classification, allnodes are functionally equivalent and theirassignment to a particular class changesover time according to their activity.

In the case of our letter.txt file, readingnodes are the computers used to read con-tent of the file (these could include user’sPDA, home and office PCs and computersof other potential users reading the letter).Similarly, writing nodes of replica networkof particular file segment are the computerswhere the user has recently changed con-tent of the letter causing this file segmentto be updated.

As data updates are disseminated to the re-plica network starting at all available pnnodes (see Section 5), each of the updatescreates a parent–child relationship betweeneach two intermediate nodes until itreaches a node that has already receivedthis update. Thus, the relation betweennodes alters as network characteristics or aset of pn nodes changes. The replica net-work can then be seen as a set of sub-treesrooted at primary nodes and connected byboundary nodes.

Figure 1 shows an example of a replica net-work RN containing four primary nodespn1, . . . , pn4 disseminating updates in all di-rections along RN edges. As update mes-sages travel, they establish a parent–childrelationship and delimit a sub-tree of eachpni (areas bounded by dotted lines).Boundary nodes having RN-neighbors inother sub-trees (nodes a, b, c, d and e) havemultiple parents (each from a differentsub-tree).

Criteria for RN modification. The systemcreates and adapts the replica network ac-cording to the following requirements andconstraints imposed by data object usersand administrators of system nodes:

& Data object administrator requirements(data durability, data reliability, servicecharge, . . .);

& Requirements of other data object users(latency of update delivery, . . .);

& Constraints specified by administratorsof particular nodes (load, bandwidth ca-pacity, storage capacity, . . .), further de-noted as dni (constraints imposed on no-de i).

To get required availability of our letter.txt,the file is generally stored at more nodesthan it is accessed from. Besides the avail-ability the user can also specify for examplethe maximum service charge for replicasstored at nodes providing charged storage.On the other hand, administrators of com-puters involved in the system can similarlyconstrain for instance storage capacity pro-vided for replicas of file segments used byother users.


Fault-tolerant data management in the Gaston peer-to-peer file system 275

Each requirement or constraint is repre-sented as a structure of its minimum andmaximum value (where applicable), weightto appreciate importance and coefficient tomake requirements mutually comparable.Each category forms a set of individual re-

quirements, which is treated as one criter-ion for RN modification.

To modify RN appropriately, the systemuses several types of information availableat each node:

& Information concerning nodes and ot-her replicas in node neighborhood inthe respective RN (distance to bounda-ry-nodes and primary nodes, aggrega-ted durability of replicas in sub-trees,service charge for replica storage insub-tree, . . .), further denoted as snai(local information known at node i);

& Measurements at each node snli (actualload, bandwidth and storage consumpti-on, node reliability, . . .).

Similarly to requirements to replicas, bothcategories of information form a set of in-dividual values that can be compared withsets of requirements and constraints in theprocess of replica network adaptation.

RN construction. When a user creates afile, it is (based on its initial size) dividedinto several file segments internally treatedindependently. At this moment, each seg-ment is stored only locally at the user’scomputer and has to be replicated to othercomputers in the system while creating itstree-topology replica network, since anRN consisting of one node does not usual-ly meet desired parameters. The process ofRN construction is virtually the same asthe adaptation process that modifies an RNwhen it does not meet requirements anymore.

The sets of data requirements are involvedin the process of RN construction andmodification by means of a single argu-ment (denoted r-parameter), which is acomplex structure comprising all require-ments of data administrators and datausers. When a data object is created, the val-ue of the r-parameter for the whole replicanetwork is defined directly by the creatinguser. During the RN tree construction pro-cess, the r-parameter value is dividedamong all sub-trees of each node. Thus,each sub-tree in the RN is qualitatively de-scribed by the portion of the initial r-parameter (i.e., initial requirements) as-sured by this sub-tree. The fraction of r-parameter ensured by each individual RN-member node i is called the affinity factor,further denoted ari, expressing the qualita-tive rate of the replica stored at this node.

Let’s look at one of the file segments of fileletter.txt that a user has just created. Thesegment is initially stored locally at user’scomputer and it is assigned an r-parameterof value 1000 (expressed here as an integervalue for the sake of simplicity) resultingfrom user’s requirements for the whole file.


Algorithm 1 Activity of a node receiving REPLICA_LEAVE message

Algorithm 2 Activity of node receiving REPLICA_STORE message


The affinity factor for this locally storedsegment is set to, say, 15 – it is based onlocally measured state of the network andcomputer itself. Now, the segment has tobe replicated to other nodes to cover theremaining requested r-parameter of value985, so several candidate nodes from thesystem network SN are selected to becomepart of replica network of this file segment,the r-parameter value 985 is divided amongthem and they are asked to store the replicawith the respective r-parameter fraction. Ifthey can not, they are supposed to distri-bute the store request further. This way thereplica network of our file segment is hier-archically created. A more complex and abit more formal continuation of this pro-cess can be found below in example illu-strated by Figure 2 assuming that the origi-nal node where the file has been created isformally identified as node p and one ofcandidate nodes for storing the replica withr-parameter fraction 100 is identified asnode i.

Neither requirements (r-parameter) nor af-finity factor of each node are static, theymay change during data object existence,so the replica network has to be adaptedto reflect these changes. The mechanism ofreplica network modification is based pri-marily on two types of messages, REPLI-CA_STORE and REPLICA_LEAVE. Theformer is used to distribute an r-parameterdown the RN tree and to extend the replicanetwork; the latter is used to return por-tion of an r-parameter that can not be en-sured by a node (or its sub-tree) to the par-ent and to reduce RN. For those readers,who are more interested in this mechanism,Algorithm 1 and Algorithm 2 describe thebasic process of RN modification per-formed on RN-member nodes after receiv-ing a respective type of message.

When it is no longer possible for a node ito ensure its actual affinity factor for itsreplica, it has to delegate Dari in the formof an r-parameter to other nodes. It firstsends REPLICA_STORE message to itschild nodes from which it has not re-ceived a REPLICA_LEAVE message for aperiod t1 (see later) or to selected RN-can-didate nodes. If there is not such a node orif all possible candidates reply REJECT,the node sends REPLICA_LEAVE mes-sage to all its current parents. This is themethod by which the change in the r-parameter or affinity factor is propagatedand processed in the replica network.

Figure 2 illustrates a typical example of theRN creation and adaptation process. Afterreceiving REPLICA_STORE message (step1), node i decides to store the replica withaffinity parameter ari of value 10, sets nodep as its parent and sends the rest of the re-quested r-parameter to selected candidates(nodes b, j and a, step 2). Nodes b and aaccept the request and continue to furtherdistribute the data object. Node j repliesREJECT (step 3) forcing node i to selectanother candidate – node c (step 4). At thesame time node a sends the message RE-PLICA_LEAVE to node i, since it couldnot ensure the whole requested r-para-meter any more. Now, node i decides to in-crease its affinity ari by 5 (its state mayhave changed meanwhile) and as it doesnot have any more suitable candidates, itforwards the REPLICA_LEAVE messagewith the remaining r-parameter of value 5to its parent p (step 5).

Algorithms for REPLICA_LEAVE andREPLICA_STORE allow to be tuned byparameter t1 representing the minimal timeperiod, in which a child node must notsend a REPLICA_LEAVE message to itsparent to be selected as a candidate for re-plica storage (see description of Algorithm1 above); the lower value of t1 represents amore optimistic strategy and vice versa.Figure 3 shows the impact of the t1 period(parameterized by average branching fac-

tor, i.e. average number of neighbors, ofeach node in RN) to the average number ofmessages exchanged between a node andits sub-tree during the process of replicadissemination assuming that node’s RN-neighbors in the sub-tree regularly rejectmessages with constant probability p < 1.The longer t1 is, the lower the ratio of re-jects can be expected. On the other hand, alonger t1 period lowers the number of can-didates for replica placement so a trade-offbetween the number of exchanged mes-sages and the probability of successful re-plica placement in a sub-tree has to be cho-sen.

To ensure stability of the RN modificationalgorithm, the decision whether to modifythe ar affinity parameter (change_af-finity( ) function) is also based on hys-teresis of locally measured parameters,whose two thresholds h1 and h2 can be ad-justed as well. The number of rejects RE-PLICA_STORE requests (shown on Fig-ure 4) at each node depends on theaggregated value q of input parameters (ari,snli, snai) of the change_affinity( )function.

Since introduced procedures represent gen-eric algorithms to perform RN creationand adaptation, they can be deployed gen-erally with several different strategies forRN-candidates selection, affinity factor



modification and r-parameter distribution.Another advantage of these algorithms isdeployment of an affinity factor definingthe quality of a node with respect to thestored replica. Thus, when new additionalrequirements to the replica are to be en-sured by a node already holding this replica,only the node affinity for this replica is up-dated, instead of creating another replica(s).The affinity can also be seen as a preferencefactor of a node when handling requests to adata object, which can be used as an optimi-zation hint for request distribution.

5 Data updates

There are several requirements that have tobe met by a data update mechanism. Theupdate of a data object has to be distribu-ted to all respective replicas currentlystored in the system in order to keep alldata up-to-date. The data consistency mustbe maintained as well, since simultaneousmodifications of the same data from differ-ent nodes are likely to occur in a large-scalesystem. Efficiency of the update mechan-ism is also an important issue, because per-formance of the whole system is influencedby data updates to a great extent.

Modifying data of a particular file segmentcauses the update to be distributed throughits replica network to all replicas of the seg-ment. To keep data consistent, a primarynetwork (PN) responsible for consistencymaintenance of file segment data is used.Primary network is an overlay structurebuilt on top of respective replica network,connecting all primary nodes pn. When anode intends to change data, it sends anupdate request directly to the closest pri-mary node in its RN (it is at the same timethe closest member of PN), which initiatesan agreement process with other availableprimary nodes through PN and only aftersuccessful completion all primary nodesdisseminate the update to all their neigh-bors in the RN. The agreement is based onmodified version vectors [Park83] derivedfrom the file segment modification table.

Primary nodes pn are writing nodes thatmost frequently perform file segment up-

date; the portion of writing nodes that arealso primary nodes at the same time is de-noted as ratio k. This ratio influences thecost of the agreement protocol in PN, com-munication overhead of update propaga-tion and propagation latency, since highernumber of primary nodes causes that theupdate dissemination process in RN is in-itiated in parallel at more nodes and thuspropagation latency to other nodes in re-plica network is lower. On the other hand,higher number of primary nodes induceshigher costs of the consistency agreementprotocol. As these aspects are contradic-tory, the optimal threshold level kopt has tobe found. This value influences thresholdlevel of write request rate crucial for a writ-ing node to become primary node and viceversa (i.e. to join or to leave the primarynetwork PN). Figure 5 depicts the depen-dence of overall update cost on ratio k andoptimal value kopt. Note that the exactgraph shape depends on specification ofcriteria weights.

Let’s imagine that our letter.txt is an offeredited by product manager, salesman andassistant. Thus a single file segment of theletter may be updated simultaneously atthree different machines – writing nodes.The most active user is the assistant, so hermachine is also the primary node for thesegment (let’s assume it is the only primarynode, since the other users do not generateenough write requests according to koptvalue). All updates have to be performed ina correct order by all replicas, thus they aredirected to the primary network PN forconsistency agreement process, done inthis case only by the assistant’s computer.


period t1

#of

mes

sage

s

avg. branching factor

Figure 3 Impact of period t1 to aver-age number of messages exchangedbetween a node and its sub-tree duringthe replica dissemination process

q

#of

stor

ere

ject

s

h2h1

h1 , h2 … lower and upper threshold levels… aggregated value of parametersq

Figure 4 Number of rejects of replicastore requests dependent on input pa-rameters of change_affinity( )function

ratio k

ove

rall

up

dat

eco

st

kopt 1

kopt … optimal value of ratio k

Figure 5 Dependence of overall up-date cost on threshold level k

Table 1 Modification table operations

Operator Description

TM1 < TM2 TM1 represents an earlier version of the file segment than TM2. All updates in TM1

are contained in TM2 in the same order.

TM1 ¼ TM2 Versions represented by TM1 and TM2 are the same. Both tables contain the sameentries in the same order.

TM1 6¼ TM2 File segment versions represented by TM1 and TM2 are conflicting. Correspond-ing entries do not match.

TM1 � TM2 Subtraction. Represents updates that were applied to file segment after updatesrepresented by TM2, i. e. set of updates that are contained in TM1 but not in TM2.Order of entries in the result is the same as in TM1.

TM1 þ TM2 Union. Represents updates contained both in TM1 and TM2 and updatescontained in one table and not in the other. Order of entries is preserved.This operation is applicable only if TM1 < TM2 or TM2 < TM1

jTMj Cardinality of table TM (i. e. number of entries in the table).


Only if the agreement process for a parti-cular update is successful, our primarynode disseminates the agreed update to allother replicas through the replica network.

The update request structure is presentedin Figure 6, which also illustrates the up-date process where an old segment of ver-sion i is updated with a write request con-taining update data for version i þ 1,modification table TMi þ 1 and new signa-ture Si þ 1 (signature of the already up-dated file segment). Updated data are al-ways placed at the end of the area ofchanges and the new signature Si þ 1 is at-tached right after the last update.

A node receiving an UPDATE messageuses the received modification table andits own table to detect what modificationsto apply to the file segment. This decisionis based on operations with modificationtables that can be seen as an ordered setof updates where every two update ele-ments are distinguishable. Modificationtable operations are summarized and de-scribed in Table 1.

The activity of a node receiving an UP-DATE message is described in Algorithm3. When node c receives an UPDATEmessage from node o, it checks whether ithas all previous updates (step 2). If not, itqueries node o (step 3) and only if it has(or gets) all of them and the signaturefrom the update is valid, node c updatesits replica of the file segment (steps 5, 8)and distributes the update to all its neigh-bors except node o (steps 6, 9). In all va-lid cases, node c sets node o as its parent(steps 10, 11).

This mechanism ensures that for each pairof pn nodes any update propagation toinner nodes is equally distributed in termsof dissemination time. However, the bal-ancing of other parameters (e.g. load,number of replicas etc. as well as dissemi-nation time to leaves of the RN) is en-sured by the replica network adaptationmechanism.

Data update represents an elementary pro-cess used as a part of the file system opera-tions write file and delete file. Data updateis performed in replica networks of all filesegments that the file currently consists ofand thus the request reaches all relevant re-plicas. The file delete operation is a specialcase write operation causing the file to belogically deleted, however, the replicas of

file segments of the last valid version arenot physically removed until the chosen re-tention policy decides to do so.

One of the most important advantages ofdescribed update mechanism is the fact thatconsistency control is performed by mem-bers of primary network always consistingof nodes heavily involved in update pro-cess due to their high write request rate.Next, the boundaries between update dis-

semination sub-trees of primary nodes dy-namically migrate to balance update propa-gation time in the replica network.Additionally, the update process has asmall communication overhead since onlychanged data are transmitted among repli-cas and the format of the update request(together with data object format) allowsperforming operations with encrypted datawithout a need to decrypt the data objectat distrusted nodes.


Write request

T M S i+1E K (PT D )

Data area Area of changes

E K Encryption using a key K...

T M Modification table... PT Plain text...

S S ignature...

PT D Plain text of a newversio n of updated data

...i+1

T MS iE K (PT)

Old file segment (version i)

E K (P T D )i

i

i+1

i+1 i+1

Updated file segment (versio n i+1)

T M

S i+1

E K (PT)

E K (P T D )i+1

E K (P T D )i

Figure 6 File segment structure and an update process

Algorithm 3 Activity of node receiving UPDATE message


6 Multicast fault-tolerance

Although a tree-topology replica networkis well suited to multicast updates to all re-plicas in the replica network, it is not im-mune to node or link failure (what happensto the replica network if node e from Fig-ure 1 fails?). To achieve maximum fault-tolerance, the data management in the Gas-ton file system uses both pessimistic andoptimistic strategy to eliminate networkpartitioning. The optimistic method ex-ploits heartbeat messages [Well00] to de-tect partitioning similarly as in D-tree fra-mework [MeCh00]. If a node fails toreceive a heartbeat from its parent within athreshold time period, it assumes it is nolonger connected to the replica networkand tries to rejoin it as if it were joininganew.

Since rejoining the whole sub-tree after ac-cidental node disconnection can be a rela-tively slow process, we have proposed ascalable pessimistic method that eliminatespartitioning when it occurs and is indepen-dent of message source and update propa-gation direction in the RN. This method isbased on construction of virtual bypassrings BR(r) of radii 1, . . . , r (counted inhops) around each member of replica net-work, which are used as alternative paths

to divert message traffic in case of nodefailure.

One of motivations for designing BR(r)fault-tolerant scheme may be the followingrather typical simple scenario. Suppose thatour user stores the file letter.txt at her mo-bile computer (thus it is part of respectivereplica networks) and suddenly her ma-chine is unintentionally disconnected (e.g.due to loss of wireless network signal).Now, all the replica networks that the ma-chine has been part of are broken and haveto be reconnected. For this purpose, by-pass rings previously constructed aroundthis currently disconnected machine areemployed using the mechanism describedfurther.

An example of rings BR(1) and BR(2)centered at node c is shown in Figure 7.Original replica network is illustratedwith bold solid line, bypass rings are de-picted by thin dot-dashed lines. BR(1)consists of all nodes in the distance 1from its center node c, BR(2) contains allnodes in the distance 2. Note that mem-bers of ring BR(1) are sorted according totheir identifications (network addresses)and ring ordering is indicated by orienta-tion of bypass ring edges (arrows in thefigure 7). Each ring-member node has

two ring-neighbors; the right one is theneighbor in the direction of ring orienta-tion, the left one is the other.

The higher the maximum radius of bypassrings is used, the higher level of fault-toler-ance is reached since more adjacent nodefailures can be repaired. On the other hand,higher-radius rings induce higher commu-nication and memory overhead. Neverthe-less, the trade-off between overhead andfault-tolerance can be achieved by choos-ing an appropriate radius and possibly byspecifying the number of member nodes ofeach ring of radius r > 1.

Next, the algorithms for bypass ring con-struction and tree repair are described.For clarity reasons, just BR(1) rings areconsidered. Each node holds a BR tablewith information about all bypass ringsthat the node is member of. Entries in thetable consist of IDs of left and right ring-neighbors, ID of the center node andsome additional information describedfurther.

Bypass ring construction and update. Anode c that is to create its bypass ring ofradius 1 sorts all its m neighbors accordingto their IDs and sends to each neighbor j aCREATE_BL message with IDs of adja-cent nodes of the node j in the sortedlist of neighbor nodes. Upon receiving aCREATE_BL message, each node j savesthe sender’s ID as the center node of thering and both received IDs as left and rightring-neighbors into the BR table. Thus,construction of BR(1) is similar to con-struction of circular bidirectional linkedlist. The important aspect here is to pre-vent incomplete rings in cases when thecenter node fails during the process. Todetect these failures, each node sends aHALLO_BL message to its ring-neighborsupon receiving CREATE_BL.

The update of a bypass ring (when a neigh-bor of the center node is either added ordeleted) is again similar to correspondingoperations with sorted circular bidirec-tional linked list. The center node sendsCHANGE_BLL and CHANGE_BLR mes-sages causing the target node to change theID of its left and right neighbor retainingnode ordering and orientation of the ring.To deal with center node failure, which canlead to bypass ring damage, nodes receiv-ing CHANGE_BLx message have to agreewith each other to perform the changeatomically and simultaneously.



Tree repair. When a node f during messagerouting through the replica network rea-lizes that one of its child neighbors, node c,is down, it initiates tree repair. First, it ex-cludes node c from its own bypass ring.Next, it sends REPAIR (ID of node c, IDof node f) messages to both its ring-neigh-bors on the ring centered at faulty node c.

Upon receiving a REPAIR message fromits ring-neighbor, each node i runs OnRe-ceive_REPAIR_message( ) proce-dure (see Algorithm 4). This algorithmsubstitutes bypass edges with core RN treeedges using the CREATE_CE message(step 3) and thus eliminates failed node cfrom the replica network and restores con-nectivity of the tree. The important re-quirement is to avoid forming cycles dur-ing the process, since more ring memberscan initiate repair simultaneously. This issolved by ring orientation (steps 9, 12),keeping ring members ordered on the ringand by comparing IDs of initiating nodesof incoming REPAIR messages (steps 6, 8).

A simple example of distributed tree repairis shown in Figure 8. In the first step,node 18 finds node c not to be available soit uses a bypass ring centered at node c toeliminate tree partition and sends REPAIR(c, 18) messages along the ring (step 2 and3) to replace ring edges by core tree edges.In step 4, another node (8B) similarlysends REPAIR (c, 8B) message to its ring-neighbors. Since the first REPAIR mes-sage received by node A0 originated atnode 18 (which is less than A0 – compar-ison in step 8 in Algorithm 4) and the sec-ond one (from node 8B) has been receivedfrom its right ring-neighbor, node A0 re-jects to substitute ring edge (A0, 8B) witha new RN edge and thus prevents cycleforming. Since all other edges of the by-pass ring have been substituted with newRN edges (bold dot-dashed lines in thefigure), all partitions induced by the fail-ure of node c are reconnected into oneconnected tree.

Bypass rings of higher radii. To deal withsimultaneous failure of adjacent nodes, by-pass rings of higher radius can also be de-ployed. The construction process differs insorting of ring members, which is based onpartial ordering of nodes in the same dis-tance from the center node in each sub-tree. The node ordering at the ring BR(r) isbased on information known to the ring-members of BR(r � 1). The ring BR(r � 1)is also used for CREATE_BL message ex-

change. Repair is always initiated by a nodeat ring BR(1), which routes the REPAIRmessage along it. Only when BR(1) is cor-rupted, the message is sent to higher radiusring or BR(1) of another faulty node androuted along. If a member of BR(r), r > 1,receives the REPAIR message, it performs

an algorithm similar to Algorithm 4 exceptthat it checks the ring BR(r � 1) before itforwards the message to its ring-neighbor.It is thus assured that bypass edges in theclosest neighborhood of the cluster offaulty nodes are replaced by core tree edgesand the tree is reconnected.



Bypass rings BR(r) are sufficient for the re-pair of a cluster of faulty nodes with dia-meter less than or equal to 2(r � 1). How-ever, because of the suitable repair messagerouting mechanism, bypass rings BR(r) canrepair an even larger cluster of faulty nodesunder convenient circumstances. For anyrepair to be successful, there has to be apath not containing faulty nodes in the un-derlying routing infrastructure betweenany two neighboring nodes on BR(1) orbypass cycle. Under circumstance that thenumber of failing RN-member nodes is lessthan the actual number of replicas, the re-spective data object is still available for allclients in the same network partition sinceTapestry provides a reliable object locationand routing service.

The schema is independent of messagesource and propagation direction in replicanetwork and it assures that the repair pro-cess operates properly no matter whereand how many nodes simultaneously initi-ate the repair of a failed node or a cluster offailed nodes. Moreover, the repair processperformance is directly proportional to thenumber of nodes concurrently launching itand thus the higher communication load inreplica network is, the faster repair processcompletes.

7 Related work

There are currently several peer-to-peersystems providing large-scale data service.These projects differ in the intended pur-pose of use, characterization of stored dataand the overall system functionality. Theyare designed to support or create servicessuch as WWW content distribution([ChKa02; RaAg99; Fran00]), read-only ar-chives (e.g. PAST [DrRo01] and CFS[Dabe01]), data sharing (Gnutella[Gnut02], Morpheus [Morp02] and KaZaA[Kaza03]) and distributed read-write filesystems (OceanStore [Kubi00; Rhea01],Silverback [Weat01], FarSite [Fars01], Gas-ton [DyRy02]). These systems also differin the provided functionality they primar-ily focus on, e.g. user anonymity (Eternity[Ande96]), persistent storage of read-onlydata snapshots – versions (InternetArchive[Inte01]), permanent data backup and dataavailability ([Kubi00; DyRy02]), contentsharing ([Gnut02; Kaza03; Morp02]).

When creating a file, PAST and CFS deter-mine a replication factor k that is constant

during existence of the file. In PAST, thisfactor is dependent on the availability andpersistence requirements of the file. InGaston the file replication factor is depen-dent on r-parameter representing the wholeset of requirements of data administratorsand users. Moreover, the factor can vary intime since replication schemas of file seg-ments adapt to the current system condi-tions.

Similarly to the Gaston file system, ADR[WoJa97] uses a tree topology to connectnodes in replication schema; however itsadaptation is based solely on read/write ac-cess counts on particular nodes. RaDaR[RaAg99] project uses network topologyinformation to construct replication sche-ma and does not involve write operationstatistics in the adaptation process. Gastonfile system uses complex set of weightedparameters to adapt replication schema, in-cluding user requirements and constraintsautonomously specified by node adminis-trators.

Dissemination trees in OceanStore are in-dependent of each other, while Gastonuses a single tree that can be logically di-vided into sub-trees for every primary re-plica. The boundaries between sub-treesdynamically migrate, which helps to en-sure load balancing in the replica network.However, the update process is similar,since both systems use a primary tier ofreplicas to ensure consistency and disse-minate updates to the rest of the replicanetwork.

To achieve data confidentiality, Gastonuses encryption and digital signatures thatserve as proof of a user’s authorization tomodify data and also as integrity valida-tion mechanism. A similar mechanism isapplied in PAST, where a certificate iscreated for every file. Data integrity inCFS, OceanStore and also in PAST is par-tially assured by cryptographic hashesthat at the same time serve as file identi-fiers. This approach is applicable in Gas-ton as well.

In the distribution mechanism used inOceanStore [MeCh00], node failure causesthe whole dissemination sub-tree (rootedin this node) be removed, forcing all nodesto rejoin. In Gaston optional bypass ringsare constructed so that a failed node can beeliminated by reconnecting the treethrough the appropriate bypass ring.

8 Conclusions

The increased usage of computer technol-ogy in the world has boosted demand onubiquitous data. In this paper, we have pre-sented fault-tolerant and secure data man-agement in the Gaston file system – a glo-bal file system designed to provide userswith a highly available, fault-tolerant andsecure file service.

We showed how user data are stored andmanaged in the system and mechanisms forsecure and fault-tolerant update propaga-tion. Data management in the Gaston filesystem uses replication to assure data avail-ability, and creates and adapts a tree-topol-ogy replica network to a set of specified re-quirements and constraints. We proposedthe structure of the data object designed toprovide data confidentiality and security ofthe data update, described how the replicanetwork is used to efficiently propagateupdates from primary nodes to other repli-cas and finally we proposed a bypass ringfault-tolerant scheme to improve reliabilityof the tree-topology network of replicas.

Future work in this research involves im-proving replica network modification algo-rithms, particularly efficient reconfigura-tion according to specific requirements ofread-only clients. There are also several un-resolved security issues in our architecture,especially means of preventing maliciousnodes to participate in the replica network.The next steps in the project also involvesimulation of presented mechanisms undervarious workloads [Bolo00; Voge99], eva-luation of overall performance and effec-tiveness and comparison of data manage-ment overhead normalized to minimumstorage cost with other similar approaches.

References

[Ande96] Anderson, R. J.: The Eternity Service. In:Pribyl, J. (Ed.): Proc. of Pragocrypt ’96. Praha1996, pp. 242–252.

[Bolo00] Bolosky, W. et al.: Feasibility of a server-less distributed file system deployed on an exist-ing set of desktop PCs. In: ACM SIGMETRICSPerformance Evaluation Review 28 (2000) 1, pp.34–43.

[Dabe01] Dabek, F. et al.: Wide-area cooperativestorage with CFS. In: ACM SIGOPS OperatingSystems Review 35 (2001) 5, pp. 202–215.

[DrRo01] Druschel, P.; Rowstron, A.: Past: Persis-tent and anonymous storage in a peer-to-peernetworking environment. In: Proc. of 8th IEEE



Workshop on Hot Topics in Operating SystemsHotOS 2001. 2001, pp. 65–70.

[DyRy02] Dynda, V.; Rydlo, P.: P2P Large-scaleFile System Architecture. In: Baca, J. (Ed.): Proc.of the Fifth International Scientific Conference –Electronic Computers and Informatics ECI ’02.Kosice 2002, pp. 262–267.

[Fars01] Farsite website: http://www.research.mi-crosoft.com/sn/Farsite/publications.htm, as of2001-09-20.

[Fran00] Francis, P.: Yoid: Your own internet dis-tribution. Technical report, ACIRI.http://www.aciri.org/yoid, as of 2000-12-16.

[Gnut02] Gnutella website:http://gnutella.wego.com, as of 2002-08-12.

[ChKa02] Chen, Y.; Katz, R. H.; Kubiatowicz, J.D.: Dynamic Replica Placement for ScalableContent Delivery In: Druschel, P, Kaashoek, F.,Rowstron, A. (Eds.): Proc. of the First Interna-tional Workshop on Peer-to-Peer SystemsIPTPS 2002. 2002, pp. 306–318.

[Inte01] Internet archive website:http://www.archive.org, as of 2001-09-16.

[Kaza03] KaZaA website: http://www.kazaa.com,as of 2003-03-03.

[Kubi00] Kubiatowicz, J. et al.: OceanStore: An ar-chitecture for global-scale persistent storage. In:ACM SIGPLAN Notices 35 (2000) 11, pp. 190–201.

[MeCh00] Mehra, P.; Chatterjee, S.: Efficient DataDissemination in OceanStore.http://www-video.eecs.berkeley.edu/ ~pmehra/classes/ cs262/paper.pdf, as of 2000-12-16.

[Morp02] Morpheus website:http://www.morpheus.com, as of 2002-08-06.

[Park83] Parker, D. et al.: Detection of Mutual In-consistency in Distributed Systems. In: IEEETransactions on Software Engineering 9 (1983)3, pp. 240–247.

[Plax97] Plaxton, G. et al.: Accessing nearby copiesof replicated objects in a distributed environ-ment. In: Leiserson, Charles E.; Culler, David E.(Eds.): Proc. of the Ninth Annual ACM Sympo-sium on Parallel Algorithms and Architectures.ACM Press, New York 1997, pp. 311–320.

[RaAg99] Rabinovich, M.; Aggarwal, A.: RaDaR:A scalable architecture for a global Web hostingservice. In: Mendelzon, A. (Ed.): Proc. of the 8thInt’l World Wide Web Conf. 1999, pp. 1545–1561.

[Ratn01] Ratnasamy, S. et al.: A scalable content-addressable network. In: Proc. of the ACM

SIGCOMM Symposium on Communication,Architecture, and Protocols. ACM Press, NewYork 2001, pp. 161–172.

[Rhea01] Rhea, S. et al.: Maintenance-free globalstorage in OceanStore. In: IEEE Internet Com-puting 5 (2001) 5, pp. 40–49.

[Rows01] Rowstron, A. et al.: Pastry: Scalable, dis-tributed object location and routing for large-scale peer-to-peer systems. In: Guerraoui, R.(Ed.): Proc. of the 18th IFIP/ACM InternationalConference on Distributed Systems Platforms(Middleware 2001). Springer, Berlin Heidelberg2001, pp. 329–350.

[Schn96] Schneier, B.: Applied Cryptography. Wi-ley, New York 1996.

[Stoi01] Stoica, I. et al.: Chord: A scalable peer-to-peer lookup service for Internet applications. In:Proc. of ACM SIGCOMM Symposium onCommunication, Architecture, and Protocols.ACM Press, New York 2001, pp. 149–160.

[Voge99] Vogels, W.: File system usage in WindowsNT 4.0. In: ACM Operating Systems Review 35(1999) 5, pp. 93–109.

[Weat01] Weatherspoon, H. et al.: Silverback: Aglobal-scale archival system. Technical ReportUCB/CSD-01-1139. http://oceanstore.cs.berkeley.edu/publications/papers/pdf/silverback_sosp_tr.pdf, as of 2001-09-24.

[Well00] Wells, C.: The OceanStore Archive:Goals, Structures, and Self-Repair.http://oceanstore.cs.berkeley.edu/publications/papers/pdf/cwells_masters.pdf, as of 2000-12-10.

[WoJa97] Wolfson, O.; Jajodia, S.; Huang, Y.: Anadaptive replication algorithm. In: ACM Trans.On Database Systems 22 (1997) 2.

[ZhKu01] Zhao, B. Y.; Kubiatowicz, J.; Joseph, A.D.: Tapestry: An infrastructure for fault-tolerantwide-area location and routing. Technical ReportUCB/CSD-01-1141. http://www.cs.berkeley.edu/~ravenben/ tapestry.pdf, as of 2001-06-12.


Abstract

Fault-tolerant data management in the Gaston peer-to-peer file system

Gaston is a peer-to-peer large-scale file system designed to provide a fault-tolerant andhighly available file service for a virtually unlimited number of users. Data management inGaston disseminates and stores replicas of files on multiple machines to achieve the re-quested level of data availability and uses a dynamic tree-topology structure to connect re-plication schema members. We present generic algorithms for replication schema creationand maintenance according to file user requirements and autonomous constraints that areset on individual nodes. We also show specific data object structure as well as mechanismsfor secure and efficient update propagation among replicas with data consistency control.Finally, we introduce a scalable and efficient technique improving fault-tolerance of the tree-topology structure connecting replicas.

Keywords: peer-to-peer, data management, file system, replication, reliability, fault-toler-ance, security, update propagation, availability, adaptation, multicasting


Date post:	25-Aug-2016
Category:	Documents
Upload:	pavel
View:	220 times
Download:	4 times

Fault-tolerant data management in the gaston peer-to-peer file system

Documents