Hanafi Yakouben* and Sahri Soror - Paris Dauphine …litwin/cours98/Doc-cours-clouds/...6 H....

Int. J. Internet Technology and Secured Transactions, Vol. 2, Nos. 1/2, 2010 5

Copyright © 2010 Inderscience Enterprises Ltd.

LH*RSP2P: a fast and high churn resistant scalable

distributed data structure for P2P systems

Hanafi Yakouben* and Sahri Soror CERIA Lab, Université Paris-Dauphine, Place Maréchal de Lattre de Tassigny, 75016 Paris Cedex, France E-mail: [email protected] E-mail: [email protected] *Corresponding author

Abstract: LH*RSP2P is a new scalable distributed data structure (SDDS) for P2P

applications. It deals with two major issues in P2P systems. One is efficient location of the peers with searched data records. The other is the protection against data unavailability due to churn of peers. The LH*RS

P2P properties permit to reduce key search messaging to at most one forwarding message (hop). It is also the least number of worst case hops for any SDDS known at present and likely the least possible. The scheme provides in fact the fastest key search for any known SDDS. Also, a scan of the file requires at most two rounds. To deal efficiently with churn, LH*RS

P2P reuses and expands LH*RS

parity management principles. As the result, the file transparently supports unavailability or withdrawal of up to any k ≥ 1 peers, where k is a parameter that can scale dynamically with the file.

Keywords: scalable distributed data structure; SDDS; P2P system; linear hashing; LH.

Reference to this paper should be made as follows: Yakouben, H. and Soror, S. (2010) ‘LH*RS

P2P: a fast and high churn resistant scalable distributed data structure for P2P systems’, Int. J. Internet Technology and Secured Transactions, Vol. 2, Nos. 1/2, pp.5–31.

Biographical notes: H. Yakouben is a PhD student in Computer Science at the Université Paris-Dauphine, Centre d’Etudes et de Recherche en Informatique Appliquée (CERIA) laboratory since October 2006. He obtained his Master’s degree in September 2006 at the same university.

S. Sahri received his PhD in Computer Science from Université Paris-Dauphine in 2006. She is currently a Researcher at the Université Paris-Dauphine, Centre d’Etudes et de Recherche en Informatique Appliquée (CERIA). During her PhD work, she proposed a new scalable distributed database system called SD-SQL server. Her research areas also include scalable distributed data structures, web services and peer-to-peer systems.

1 Introduction

Recent years saw the emergence of new architectures, involving multiple computers. These are primarily popular workstations or PCs interconnected by high-speed networks. Such configurations offer quasi-unlimited cumulated storage and computing power. They

6 H. Yakouben and S. Soror

become increasingly present in the organisations due to the rapidly falling hardware costs. New concepts were proposed for the design of such systems. Among most popular is that of a multi-computer, or of a network of workstations, and, more recently, of peer-to-peer (P2P) and of grid computing.

The distributed nature of new systems imposes new requirements on data files, not well satisfied by the traditional data structures. The scalable and distributed data structures (SDDS) are a new class of data structures proposed specifically for this purpose. An SDDS distributes its data in buckets spread over the nodes of a multi-computer. These nodes can form a P2P or grid network. Some SDDS nodes are clients, interfacing to applications. Others are servers storing the data in buckets and accessed only by clients. An SDDS file is distributed over the server nodes. The number of nodes dynamically scales with the file growth. The application accesses the file through a client node that makes the data distribution transparent. For scalability, the data address calculations do not involve any centralised directory. The data are also basically stored in the distributed RAM, for faster access than to the traditional disk-based structures. A search or an insert of data can consequently be hundreds of times faster than a disk access (Bennour et al., 2000; Bennour, 2002).

Several SDDS schemes have been proposed. The most studied are the LH* schemes for hash partitioning and the RP* schemes for the range partitioning. We also quote CHORD presented in (Stoica et al., 2001), BATON (Jagadish et al., 2005) and VBI-Tree (Jagadish et al., 2006) proposed for P2P architectures. Here, we are interested in the LH* SDDS and particularly in its variant LH*RS.

LH* creates scalable, distributed, hash-partitioned files. Each server stores the records in a bucket. The buckets split when the file grows. The splits follow the linear hashing (LH) principles (Litwin, 1980; Litwin, 1994). At times, an LH* server can become unavailable. It may fail as the result of a software or hardware failure. Either way, access to data becomes impossible. The situation may not be acceptable for an application, limiting the utility of the LH* scheme. Data unavailability can be very costly.

To allow an efficient scalable availability scheme, the LH*RS was proposed (Litwin and Schwarz, 2000). It structures the LH* data buckets into groups of size m, providing each with K ≥ 1 parity buckets. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The high-availability management uses a novel parity calculus based on Reed-Salomon erasure correcting coding. The LH*RS parity records enable the application to get the values of data records stored on any unavailable servers, up to k ≥ 1.

As in LH* file, each LH*RS client caches its private image, and applies (LH) to it. A split makes every existing image outdated. A client with an outdated image may direct a key-based request towards an incorrect address, not the correct one given by (LH) for the request. The addressee recognises its status from the received key and from its bucket level. The incorrectly addressed bucket should forward the request to the bucket that is possibly the correct one. The LH*RS scheme has shown that a key search may need at most two forwarding messages (hops) to find the correct server, regardless of the growth of the file. This property makes the LH* scheme and its subsequent variants very efficient tool for applications requiring fast growing and large files: distributed databases in general, warehousing, document repositories, e.g., for e-government, stream data repositories, etc.

To improve LH*RS performance by reducing its forwarding messages, we propose a new SDDS called LH*RS

P2P. We designed it specifically for P2P environment. It reduces

LH*RSP2P: a fast and high churn resistant scalable distributed data structure 7

the forwarding to at most one hop in such environment. This result is probably impossible to improve because ‘zero forwarding’ seems to require a centralised architecture. The LH*RS

P2P scheme offers the same functional capabilities as LH*RS: the key search, a scan, record insert, update and delete. It also has the coordinator which becomes a super-peer in this terminology. In contrast, while LH*RS node was either a client, without data and free to go offline, or a server, supporting a file bucket, an LH*RS

P2P node is always a peer, with both client and a server components, with a data bucket or at least candidate for. In consequence, LH*RS

P2P peers strive to be always online. In this scenario, it becomes reasonable to assume that a server forwards from time to time some meta-data to selected clients. It may always send such data to its local client, but in addition to a few remote ones. To improve the already excellent LH*RS

P2P routing (typically, but not always one hop), pushing information on bucket splits and merges appears especially useful. Closer analysis that will appear in what follows shows the possibility to decrease the worst case forwarding to a single message only.

The rest of this paper is structured as follows. In Section 2, we discuss the related work. Section 3 presents our LH*RS

P2P architecture. Next, we present the basic addressing algorithms in Section 4. Section 5 presents the LH*RS

P2P file evolution. Then, Section 6 discusses the record manipulation. In Section 7, we detail the churn management. In Section 8, we present the parity encoding. Next, we present data decoding in Section 9 and experimental analysis in Section 10. Section 11 presents our current implementation. Finally, we show some variants of LH*RS

P2P in Section 12 and summarise our contribution in Section 13.

2 Related work

We first overview the related work in P2P systems. Next, we discuss the work in the domain of SDDS. Finally, we recall principles of LH*RS.

2.1 P2P systems

In the early 2000s, P2P systems made their appearance and became a popular topic. In a P2P system, each participant node functions as both a data client and a server. Earliest P2P systems implemented file sharing and used flooding of all nodes for every search. To avoid the resulting message storm, structured P2P systems appeared with additional data structures to make search more efficient, as shown in (Crainiceanu et al., 2004; Stoica et al., 2001). Structured P2P are in fact specific distributed data structures under a new brand name. Dynamic hash table (DHT) based structures are the most popular, presented in (Devine, 1993; Gribble et al., 2000). The typical number of hops is O (log N) where N is the number of peers storing the file. Some peers may be super-peers performing additional coordination functions. However, with skewed data distributions, this approach may lead to very unevenly distributed workloads for the peers. In classical approaches to database indexing this problem is addressed by using balanced tree structures, like B*-Trees. Such an approach applied to P2P environments might pose substantial problems in terms of coordination among peers. Among proposed works to remedy to such issue, we quote P-Grid (Aberer et al., 2003). P-Grid is a scalable data access structure resulting from the distribution of a binary prefix tree. It consists of constructed


binary search trees that use randomised algorithms such that the storage space required at each peer is balanced. As consequence the search trees will no longer be balanced if the data distribution is not uniform.

A typical P2P system suffers not only from temporary unavailability of some of its constituent nodes, but also from churn, the continuous leaving and entering of machines into the system. For example, a P2P system (a.k.a. a desktop grid) such as Farsite, presented in (Bolosky et al., 2007), currently in the process of being commercialised, uses the often underused, considerable desktop computing resources of a large organisation to build a distributed, scalable file server. Even though participant nodes are under the control of the organisation, the natural cycle of replacing old system by new ones will appear to the P2P storage system as random churn. In other words, at any time some peers leave the system and new peers join it. However, an application whether a file server or a database needs the availability of all its data, regardless of the fate of some participant nodes. It is necessary then to have some redundancy of stored data.

Research starts addressing the high-availability needs of scalable disk farms, (Xin et al., 2003). These should be soon necessary for the grid computing, and very large internet databases. Some simple techniques are already in everyday use. There are also open research proposals for high-availability distributed data structures over large clusters specifically intended for the internet access. One is a distributed hash table scheme with built-in specific replication, (Gribble et al., 2000). There was also a research project with the goal of a scalable distributed highly-available linked B-tree, (Boxwood Project, 2003).

Emerging P2P applications, including the Wi-Fi ones, also lead to compelling high-availability storage needs (Anderson and Kubiatowicz, 2002; Kubiatowicz, 2003; Dingledine et al., 2000). In this new environment the availability of the nodes should be more ‘chaotic’ than one typically supposed in the past. Their number and geographical spread-out should often be also orders of magnitude larger. Possibly easily running in near future into hundreds of thousands and soon reaching millions, spread worldwide. This thinking clearly shares some rationales for scalable distributed data structures.

2.2 Scalable distributed data structures

Within the database community in 1992, a new concept of SDDS appeared. The first scheme proposed was LH* (Litwin et al., 1993). Later on, SDDS principles were found useful as the basis for the already mentioned structured P2P scheme. Several SDDS schemes have been proposed (see Google). LH*RS was one of most recent responding to the scalable high-availability needs (Litwin et al., 2005). The scheme uses for this purpose an original parity calculus embedded into the LH* SDDS structure. It is done over so-called groups of size n = 2i of its application data buckets. The value of n is arbitrary; thought in practice should be like 8–32. The scheme tolerates the unavailability of up to any k ≥ 1 of servers per group. The value of k may scale dynamically. The parity calculus is based on an original variant of Reed Salomon erasure correcting coding, developed for LH*RS. The storage overhead is small, in the orders of k / n. These properties seem attractive to deal not only with high-availability files and databases, but with reasonable amount of churn as well. All this was our rationale for the LH*RS

P2P design. We address details of our scheme in Section 3 and further. Before, we briefly overview some aspects of the LH*RS scheme we reuse.


2.3 Overview of LH*RS

As we mentioned, the LH*RS scheme provides high availability to the LH* scheme (Litwin et al., 1993; Litwin et al., 1996; Kaarlson et al., 1996; Litwin et al., 1999). LH* itself is the scalable distributed generalisation of LH (Litwin, 1980). Below, we recall the LH*RS file structure and its architecture.

LH*RS scheme is described with details in (Litwin and Schwarz, 2000; Moussa and Litwin, 2002; Litwin et al., 2005). An LH*RS file is subdivided into groups. Each group is composed of m data buckets and k parity buckets, Figure 1. The data buckets store the data records, the parity buckets store the parity records. Every data record fills a rank r in its data bucket, as shown in Figure 1. A record group consists of all data records with the same rank in a bucket group. The parity records are constructed from data records having the same rank within data buckets forming a bucket group. The record grouping has an impact on the data structure of a parity record. The latter keeps track of the data records it is computed from. The parity calculus uses an original variant of Reed Solomon codes.

Figure 1 LH*RS file structure

The file starts with one data bucket and k parity bucket. It scales up through data buckets’ splits, as the data buckets get overloaded. Each data bucket contains a maximum number of b records. The value of b is the bucket capacity. When the number of records within a data bucket exceeds b, the bucket adverts a special entity coordinating the splits, which is the coordinator. The latter designates a data bucket to split. During a data bucket split process, half of the splitting data bucket contents move to a new created data bucket. As a consequence to the data records’ transfer, the data records remaining in the splitting data bucket and those moving get new ranks. The LH*RS architecture is conceived on the top of the SDDS 2000 prototype, that supports LH* and RP* structures in first. The LH*RS architecture denotes two principal components: SDDS clients and SDDS servers. The SDDS servers can be data servers or parity servers as shown in Figure 2. The LH*RS components communicate via TCP/IP and UDP protocols. Each component has an IP address and a listener/sender port. Both clients and servers accomplish their functionalities using threads. The clients send requests and receive their responses using three threads: Application thread, UDP listener thread and Working thread. The server (bucket) functionalities are organised in two threads: UDP listener thread and working thread. Each bucket has a TCP port connection. Details are in (Moussa and Litwin, 2002; Litwin et al., 2005).


Figure 2 Prototype architecture of LH*RS

3 LH*RSP2P architecture

Every LH*RSP2P peer node has a client component (the client in what follows), that

interacts with applications. It also has the server component that carries, or is waiting to carry, a data or a parity bucket, see Figure 3. Every peer joining the file is expected to store some data, as part of service for the peer community. As for LH*RS, the server at a peer can be a data server, called simply a server in what follows, managing a data bucket. Likewise, it can be a parity server managing a parity bucket. A new peer finally can be a candidate peer. It acts as the client and serves as a hot spare for a bucket. The situation occurs when there is no pending storage need when the new peer appears. The client acts as an intermediary between an application and the data servers as in LH*RS his includes the LH hash functions and the image adjustment messaging (IAMs). It has also additional functions we detail below. The data servers are the only to interact with the parity servers, basically during the updates. The data and parity servers behave basically as for LH*RS, with additional capabilities we present soon.

Figure 3 LH*RSP2P architecture overview


Every server peer informs its client component of each new split or merge. We will show you soon how. It also sends this information to the parity peers of its reliability group. Finally, it may send it to selected candidate peers. We say that the peer is then a tutor to its pupils. All the client components that receive the info adjust their images accordingly, using the usual LH* IAM algorithm. Merges are rarely implemented; we do not deal more here with this case. Finally, one peer acts as the coordinator through its coordinator component. This peer behaves like the LH*RS file coordinator with additional capabilities we will show. The coordinator may eventually be replicated.

4 Records addressing

4.1 Global rule

The addressing scheme of LH*RSP2P is that of LH* with differences we will now explain.

As usual, a record of an LH*RSP2P file is identified by its primary key. The key C

determines the record location (the bucket number a = 0,1,2..) according to the LH algorithm (Litwin et al., 1993):

Algorithm 1 LH*RSP2P global addressing rule

a ← hi (C);

If a < n then a ← hi +1(C).

We recall that (i, n) are the file state. Here, i = 0,1… is the file level. It determines the linear hash (LH) function hi applied. Basic LH-functions are hi (C ) = C mod 2i.Likewise, n = 0,1… is the split pointer, indicating the next bucket to split.

4.2 Key-based addressing

As is the case for LH*RS, only the coordinator peer in the LH*RSP2P file always knows the

file state. Any other peer uses its local image (i′, n′) of the file state for addressing. The image may be outdated showing fewer buckets than actually in a growing file. (Since heavily shrinking files are infrequent in practice, bucket merges are rare or even not implemented.) The peer uses the image to find the location of a record given the key for a key-based query, or to scan all the records. We now review the key-based addressing. The next section deals with scans.

The primary location of a record identified by its key C is the bucket with the address a given by (1). However, the peer applies Algorithm 1, to its client image only. It sends its key-based query Q, accordingly to some bucket Q may search for a record, may insert it, update or delete. It always includes C. For the reasons we explain soon, it also includes its image (i′, n′).

An outdated image could result in a′ < a. The peer then sent Q to an incorrect bucket. In every case, Q reaches the server component at the receiving peer a′. That one server component starts with the following algorithm. It first verifies whether its own address is the correct one by checking its guessed bucket level j′ in the received client image against its actual level j (the level of LH function last used to split or create the bucket). It calculates j′ as i′ for a′ ≥ n′ and as i′ + 1 otherwise.


If needed, the server forwards Q. We will demonstrate later that Q always reaches the correct bucket a in this step. This is not true for LH* in general and LH*RS in particular, which may need an additional hop. Finally, the failure of j′ test for forwarding, implies (as we will show below) that the image was outdated because of a communication failure with the tutor or churn or some other error. The addressee returns then the error information.

Algorithm 2 LH*RSP2P server key-based addressing

If a′ ≥ n′ then j′ = i′ else j′ = i′ +1;

If j = j′ then process Q: exit;

Else if j′ – j = 1 then a ← hj (C);

if a > a' then forward Q to bucket a ; exit;

Else send the ‘erroneous image’ message to the sender;

If forwarding occurs, the new address a has to be the correct one. Hence the addressee does not perform the checking (2). (This is not the case for forwarding in LH*RS.) As usual for an SDDS, it only sends the IAM. The IAM informs the sender that the initial address was incorrect. It includes the level j of the correct bucket a. In LH*RS

P2P, it has to be also j of bucket a′. The sender adjusts then its image reusing the LH* image adjustment algorithm:

Algorithm 3 LH*RSP2P client image adjustment

i′ = j ; n' = a' + 1 ; If n' = 2i’ then i′ = i' + 1; n' = 0;

As usual for LH*, the adjusted image is more accurate with respect to the files state. It takes to the account at least one more split that happened since the image creation or its last adjustment. In particular, the addressing error that has triggered the IAM cannot happen again.

Server peer a has also physical (IP) addresses of buckets that the sender does not know about. These buckets are those beyond the last one in the received image (i′, n′) until a, namely n′ + 2i′,…, a. Server peer a attaches the IP addresses to the IAM.

4.3 Scan search

When a peer performs a scan S, it uses unicast to S to each bucket a in its image, namely to buckets 0,1,…, n′ + 2i – 1. Every message contains j′. Every bucket that receives S verifies whether S needs to be forwarded because of a bucket it knows about, but the originating peer did not. It executes the following algorithm:

Algorithm 4 LH*RSP2P server scan processing

If j = j' + 1 then a' = a + 2j – 1; forward S to peer a'; exit;

If j = j' then process S; exit;

Else send the ‘erroneous image’ message to the sender;

The client peer may finally wish to execute the termination protocol, to ensure that all the addresses have received S. This protocol is the same as for LH*RS.


5 LH*RSP2P file expansion

5.1 Peer joins

A peer wishing to join the file contacts the coordinator. The coordinator adds the peer to its peer location tables and checks whether there is a pending request for the bucket space. This is typically not the case. If the peer wishes to be a file client, it implicitly commits to be a data bucket. The coordinator declares the peer a candidate and chooses its tutor. The candidate becomes a pupil. To find the tutor, the coordinator uses the IP address of the candidate as if it was a key. Using Algorithm 1, it hashes the value and uses the result as the tutor’s address. The coordinator sends the message to the tutor that in turn contacts the candidate. In particular, it sends its image, accompanied with the physical locations of the peers in the image. We recall from LH* principles that these are at least the locations of buckets 0… a + 2j–1 for a bucket a. The pupil stores the addresses, initialises its image, starts working as a client and acts as a spare, waiting for a bucket need.

A split may necessitate finding a new parity bucket. The coordinator finds the peer willing to host it. This can be a candidate peer or a server peer. The candidate peer getting the parity bucket only will remain pupil until it gets a data bucket. Alternatively, as we have mentioned, there may be (philanthropic) peers willing to host parity buckets only, without being clients. A candidate upon becoming a (data) server informs the tutor that it is no longer its pupil. The tutor ceases monitoring it.

5.2 Peer splits

When a peer splits its data bucket a, it updates its local image, using the j value before the split. If it has pupils, it sends them the information about the split, including j. The client at the peer and the pupils adjust the images using Algorithm 3. The image adjusted by Algorithm 3 reflects however the file state exactly this time.

With the request to split, the coordinator informs peer a not only of the physical addresses of the new bucket, but also those of the buckets that have been created since the last split of the peer. The latter are buckets a + 2j–1… a + 2j – 1, where the j value is the one before the split. The peer is not aware of their existence if it did not get any IAM. The peer that is a tutor forwards the addresses to all its pupils. If the peer or a pupil got an IAM since the last split of the peer, it may already have some of these addresses, possibly even all. It may happen that there are two different addresses for the same bucket. This means that the bucket became unavailable and was recovered since the peer got its address. The peer uses the address sent by the coordinator.

The peer with the new data bucket initialises its image to the same values as in the after-image at the splitting peer. The splitting peer informs the new server of all the physical addresses of the predecessors of the new bucket. The images may remain unchanged until the next split of the tutor or until the first split of the newly created bucket. Alternatively, each image may get further adjusted by IAMs in the meantime.

When the splitting peer is a tutor, its pupil might receive the peer of the newly split-off bucket as a new tutor. This happens for the pupil with the address hashed to the new bucket after the split. If this happens, then the new tutor informs all its (new) pupils of the change of assignment. The scheme flexibly and uniformly spreads the tutoring load as


more servers become available. The file can handle efficiently an increase in the number of pupils. As the file grows, we can expect this increase to occur. The peer getting the parity bucket during a split behaves like the LH*RS parity server.

Figure 4 LH*RSP2P peer split, (a) before, (b) after

Example 1

Assume the file with data buckets distributed over three peers, as shown in Figure 4(a). We neglect parity buckets. Peer addresses are 0, 1, 2 and the file state is i = 1 and n = 1. The images at the peers are these created with their buckets, for peers 1, 2 or adjusted during the last split of bucket 0. The image of peer 1 is outdated. Peer 0 and peer 1 are tutors, each with one pupil, numbered here upon the tutors. Peer 0 reached its storage capacity. Peer 1 inserts the record 8, i.e., with key C = 8. With its client image, it calculates 8 mod 21 = 0 and sends record 8 to address 0. Peer 0 executes Algorithm 2 and inserts the record. This creates an overflow and the peer contacts the coordinator. Given n = 1, the coordinator requests peer 1 to split. The split creates data bucket 3. Once the split completes, peer 1 adjusts its image to (2, 0). The coordinator does the same. Assume that pupil 0 gets bucket 3. It becomes (server) peer 3. It initialises its client image accordingly to (2, 0). It informs about its new status the tutor which takes note that this peer is no more its pupil. That it is no more after this the Coordinator update it image (i, n). The final structure is shown in Figure 4(b).

6 Records manipulation

We now prove the following basic properties of an LH*RSP2P file. They determine the

access performance of the scheme, under the assumption that all the manipulated data buckets are available.

Property 1 The maximal number of forwarding messages for key-based addressing is one.

Property 2 The maximal number of rounds for the scan search is two.

(a) (b)

Pupil 0

Peer

Coordinator peer (CP)

0

j = 2 i’ = 1 n’ = 1

j = 1i’ = 1n’ = 0

j = 2

i’ = 1 n’ = 1 2

i’ = 1n’ = 0

i’ = 1 n’ = 1

Pupil 1

i = 1 n = 1

1

8

Peer

Coordinator peer (CP)

0

j = 2i’ = 1n’ = 1

j = 2i’ = 2n’ = 0

j = 2i’ = 1n’ = 12

i’ = 2n’ = 0Pupil 1

i = 2n = 0

1

8

j = 2i’ = 2n’ = 03


Property 3 The worst case access performance of LH*RSP2P as defined by Property

1 is the fastest possible for any SDDS or a practical structured P2P addressing scheme.

Figure 5 Addressing the region with no forward

Proof of Property 1

Assume that peer a has the client image (i′, n′). Assume further that peer a did not receive any IAM since its last split using hi + 1. Hence, we have i′ = j – 1, n′ = a + 1. At the time of the split, this image was the file state (i, n). Any key-based addressing issued by peer a before next file split, i.e., of bucket a + 1, had no forwarding as show in Figure 5. Let us suppose now that the file grew further. The first possible case with possibility of forwarding is depicted in Figure 6. Some buckets with addresses beyond a have split, but the file level i did not change, i.e., the split pointer n did not come back to n = 0. We thus have n > n′ and i′ = i. The figure shows the corresponding addressing regions. Suppose now that peer a addresses key C using Algorithm 2. Let a′ be the address receiving C from a. If a′ is anywhere beyond [n′, n], then the bucket level j must be the same as when peer a created its image. There cannot be any forwarding of C. Otherwise, a′ is the address of a bucket that split in the meantime. The split could move C to new bucket beyond 2i′ – 1. One forwarding is thus possible and would be generated by Algorithm 2. But this bucket could not split since the last split of bucket a that has created its current image. No other forwarding can occur and none is generated by the above presented addressing scheme.

The second case of addressing regions is illustrated by Figure 7. Here, the split pointer n came back to 0 and some splits occurred using hi + 1. However, peer a did not yet split again, i.e., n ≤ a. Peer a searches again for c and sends the search to peer a′. There are three situations:

a We have n ≤ a′ ≤ a. A bucket in this interval did not split yet using hi′ + 1. Hence the bucket level j(a′) is as it was when the image at peer a was created. There cannot be any forwarding, since j(a′) = i′ + 1 while n′ = a.

b We have a′ < n. Bucket a′ split using hi′ + 1 hence we may have a forwarding of C towards bucket a′ + 2i′ this split has created (one buckets at the right side of the Figure 6. That bucket could not split yet again. Bucket a would need to split first since a < 2i′. Hence, there cannot be 2nd forwarding of C.

n+2i

’

2i’an’

0

j =i’

j

n a’ a+2i

’

Peer Address

Leve

l 3


Figure 6 Addressing the region with possibility of forward

Figure 7 Addressing regions in 2nd case

These are the only possible cases for LH*RSP2P. Hence, there cannot be more than one

forwarding during any key-based addressing. If peer a got any IAM in the meantime, then its image could only become more accurate with respect to the actual file state. That is n′ moves closer to n or ‘i’ becomes new i = i′ + 1 for Figure 6. No additional forwarding is possible, only the single forwarding becomes less likely.

Proof of Property 2

Assume that peer a now issues a scan search. It sends it to every peer a′ it has in its image. The file situation can be as at Figure 5 or Figure 6 above. In each case, peer a′ could split once. It recognises this case through Algorithm 4 and forwards then the scan to its descendant. These messages constitute the second round. No descendant could split in turn. They have to be beyond bucket a itself which would need to split first. Hence, two rounds is the worst case for the LH*RS

P2P scan.

Proof of Property 3

The only better bound is zero messages for any key addressing. The peer a that issues a key-based query can be any peer in the file. To reach zero forwards for any peer a would require to propagate synchronously the information on a split to every peer in the file. This would violate the goal of scalability, basic to any SDDS. The same restriction stands for structured P2P systems. Hence no SDDS, or practical structured P2P scheme can improve the worst case addressing performance than LH*RS

P2P.

2i’

a’n+2i’a 0

j = i’

j = i’+1

n a+2i’ Peer Address

Leve

l 3

2i’

a’ 2i’+1

a 0

j =i’ j =i’+1

n Peer Address

Leve

l 3

n+2i

j =i’+2


7 Churn management

LH*RSP2P copes with the churn through the LH*RS management of unavailable data and

parity buckets. In more detail, it recovers up to k data or parity buckets found unavailable for any reason, in a single bucket group. Globally, if K is the file availability level, K = 1, 2… then k = K or K = k + 1. In addition, LH*RS

P2P allows a server or parity peer to quit with notice. The peer notifies the coordinator and stops dealing with incoming record manipulation requests. The data or parity bucket at the peer is transferred elsewhere. Any bucket in the bucket (reliability) group gets the new address. The quitting peer is finally notified of the success of the operation. The whole operation should be faster than recovery, but the quitting peer now has to wait.

A pupil may always leave without notice. When its tutor, or the coordinator, does not receive a reply to a message, it simply drops it from its data structures. A special case arises if a network failure or similar happenstance disconnected the pupil from its tutor, but the pupil did not discover this disconnection. It will therefore no longer receive any updates. Without the benefits of these updates, a query by this pupil is essentially an LH* query and might therefore take two forwarding messages.

A similar situation may happen to a server peer that – not aware of being unavailable or having left – comes back and issues a query. Our solution for this situation does not fall back on LH* addressing with potentially two forwarding messages, but instead includes the value of j′ in each query (Section 4). The addressee executing Algorithm 2 tests in fact whether j′ – j ≤ 1. If not, the peer concludes the query had to come from the peer that is not up to date, most likely because it was unavailable. It refuses the query and informs the sender about. This one contacts in turn the coordinator. The coordinator processes the sender as a new peer. The resent query will be dealt with as usual, with one forwarding at most. The peer may learn in the process that its data or parity bucket was recovered somewhere else as well.

It may finally happen that a peer appears to be unavailable for a while, e.g., because of a communication failure or an unplanned system shutdown, and that its data is recovered elsewhere. The new address is not posted to other existing peers. When the peer becomes again available, another peer might not be aware of the recovery and send a search to the ‘former’ peer. To prevent this somewhat unlikely possibility of an erroneous response, we consider two types of search. The usual one, as in (Gribble et al., 2000) in particular, does not prevent the occurrence of this type of error. In contrast, the sure search, new to LH*RS

P2P, prevents this scenario. A sure search query triggers a message to one of the parity servers of the bucket group. These have the actual addresses of all the data buckets in the group. The parity server getting the sure search informs the outdated peer about its (outdated) status. It also provides it with the correct peer address. The outdated peer avoids the incorrect reply, resending the query to the correct peer. Again, we have only one hop needed to complete the query. This one replies instead, with the IAM piggybacked.

8 Parity encoding

We now present the parity management used in our LH*RSP2P prototype for churn

management. We adapted for this purpose the LH*RS calculus. We use an erasure


correcting code (ECC) code. We thus use the galois field (GF) calculations based on parity matrix, which is a submatrix of the generator matrix. When we insert a data record, or modify it, we encode the parity through the parity matrix and store the resulting parity data in the related parity records. To recover the unavailable data or parity records, we decode some data and parity, using an inversion of a square submatrix of the generator matrix. We are now overview more in depth this aspect of our scheme.

8.1 Galois field

The GF has 2f elements called symbols. Each symbol is a bit-string of length f. One symbol is the zero element, noted 0, consisting of f zero-bits. Another is the one element, noted 1, with f – 1 bits 0 followed by bit 1. Symbols can be added (+), multiplied (⋅), subtracted (–) and divided (/). These operations in a GF possess the usual properties of their analogues in the field of real or complex numbers, including the properties of 0 and 1. As usual, we may omit the ‘⋅’ symbol.

GFs GF(28) and GF(216) are potentially most practical candidates for parity calculus, For LH*RS

P2P we use f = 16, following the results reported in (Litwin et al., 2005). We note our GF as F. The symbols of F are 2B words. F has thus 64K symbols which is 0,1...ffff in hexadecimal notation.

The addition and the subtraction in F are both the bit-wise XOR (Exclusive-OR) operations on 2B words. That is a + b = a – b = b – a = a ⊕ b = a XOR b. This operation is widely available, e.g., as the ^ operator in C and Java, i.e., a XOR b = a ^ b. The multiplication and division are more complex operations. There are several methods for their calculus. We use a variant of the log/antilog table calculus like in (Litwin and Schwarz, 2000; Macwilliams and Sloane, 1997).

The calculus exploits the existence in every GF of the primitive elements. If α is primitive, then any element ξ ≠ 0 is αi for some integer power ,i i, 0 ≤ i < 2f – 1. We call i the logarithm of ξand write i = logα(ξ). Likewise, ξ = αi is then the antilogarithm of i that we write as ξ = antilog (i). The successive powers αi for any i, including i ≥ 2f – 1 form a cyclic group of order 2f – 1, with αi = αi′ exactly if i′ = I mod 2f – 1. Using the logarithms and the antilogarithms, we can calculate multiplication and division through the following formulae. They apply to symbols ξ,ψ ≠ 0. If one of the symbols is 0, then the product is obviously 0. The addition and subtraction in the formulae is the usual one of integers:

( ) ( ) ( )( ) 2 1fξ ψ antilog log ξ log ψ mod⋅ = + −

( ) ( ) ( )( )/ 2 1 2 1 .f fξ ψ antilog log ξ log ψ mod= − + − −

In order to implement these formulas, the symbols are stored as unsigned integers 0 to ffff. The logarithms and antilogarithms are also stored in two arrays. The logarithm array log has 2f entries. Its offsets are symbols 0x00… 0xff, and entry i contains log(i), an unsigned integer. Since element 0 has no logarithm, that entry is a dummy value such as 0xffffffff. The multiplication algorithm applies the antilogarithm to sums of logarithms modulo 2f – 1. To avoid the modulus calculation, all possible sums of logarithms as offsets are used. The resulting antilog array then stores antilog[i] = antilog(i mod (2f – 1))


for entries i = 0, 1, 2…, 2(2f – 2). This speeds up both encoding and decoding times. Figure 8 shows the resulting multiplication algorithm.

Figure 8 Galois field multiplication algorithm

( )( )

[ ] [ ]

GFElement mult GFElement left, GFElement right {

if left 0 || right 0 return 0;

return antilog log left log right ;}

= =

⎡ ⎤+⎣ ⎦

8.2 Parity matrix

8.2.1 Parity calculus

The parity record contains the keys of the data records and the parity data of the non-key fields of the data records in a record group. We encode the parity data from the non-key data as follows. We number the data records in the record group 0, 1, … m – 1. We represent the non-key field of the data record j as a sequence a0,j, a1,j, a2,j,… of symbols. We give to all the record s in the group the same length l by at least formally padding with zero symbols if necessary. If the record group does not contain m records, then we assume the virtual presence of enough dummy (zero, null…) records consisting of zeroes.

We consider all the data records then in the group as the columns of an l by m matrix A = (ai,j). We also number the parity records in the record group 0,1…k. We write b0,j, b1,j, b2,j,… for the B-field symbols of the jth parity record. We arrange the parity records also in a matrix with l rows and k columns B = (bi,j). Finally, we consider the parity matrix P = (pλ,μ) that is a matrix of symbols p forming m rows and k columns. We address the construction of P in next sections, especially Section 9. Its key property is the linear relationship between the non-key fields of data records in the group and the non-key fields of the parity records:

.⋅ =A P B

More in depth, each jth row of A is a vector aj = (aj,0, aj,1, aj,2,… aj,m–1) of the symbols in all the successive data records with the same offset j. Likewise, every jth line bj of B contains the parity symbols with the same offset j in the successive parity records of the record group. The above relationship means that bj = aj P, i.e.:

1

, , ,0

.m

j λ j v v λv

b a p−

=

= ⋅∑ (1)

Thus, each parity symbol is the sum of m products of data symbols with the same offset times m coefficients of a column of the parity matrix.

The LH*RSP2P parity calculus does not use P directly. Instead, we use the logarithmic

parity matrix Q with coefficients qi,j = logα(pi,j). The implementation of equation (1) gets the form:

( )( )1, 0 , ,antilog logm

t λ v v λ t vb q a−== ⊕ + (2)


Here, ⊕ designates XOR and the antilog designates the calculus using our antilog table, which avoids the mod(2f – 1) computation. Using (2) and Q instead of (1) and P clearly speeds up the encoding. While that is our actual approach, we continue to present the parity calculus in terms of P, for ease of presentation.

8.2.2 Generic parity matrix

LH*RSP2P files may differ by their (bucket) group size m and availability level k. Smaller

m speed up the recovery time, but increase the storage overhead, and vice versa. The parity matrix P for a group needs m rows and k columns, k = K or k = K – 1. Different LH*RS

P2P files may need in this way different matrices P. We derive any such P from the single generic parity matrix P′ and from its logarithmic parity matrix Q′. The m′ and k ′ dimensions of P′ and Q′ should be big enough for any application of our system. Any actual P and Q we use are then the m ≤ m′ by k ≤ k′ top left corners of P′ and of Q′. Their columns are derived dynamically when needed. Our P′ itself may reach m′ = 32K by k′ = 32K + 1. This allows for LH*RS

P2P files with in practice arbitrarily large groups, e.g., with more than 128 buckets per group. Ultimately, even a very large file could consist of a single group, if such an approach would ever prove useful.

A specific optimisation of our P′ construction algorithm, hence of any P we can cut from, is that the first column and the first row contains coefficients 1. The column of ones allows us to calculate the first parity records of the bucket group using the XOR only, as for the ‘traditional’ RAID-like parity calculus. Thus, if only one data bucket in a group has failed and the first parity bucket is available, then we can decode the unavailable records using XOR only. Otherwise, we also need GF multiplications, through Q in our case. The row of ones allows us to use XOR calculations exclusively for each first record of a record group. This speeds up the overall calculus, with respect to known parity calculus schemes without this nice property.

Example 2

As in (Litwin et al., 2005), we consider for the sake of the example 1B symbols. We see them as elements of GF(28). We write them below as hexadecimal numbers. We cut our parity matrix P and the logarithmic parity matrix Q in Figure 9 from the generic parity matrix P′, shown in (Litwin et al., 2005). We number the data buckets in a bucket group D0. . .D3 and consider three parity buckets, Figure 10. We have one data record per bucket. The record in D0 has the D-field: ‘En arche en o logos …’ The other D-fields are ‘In the beginning was the word …’ in D1, ‘Au commencement ‘etait le mot …’ in D2, and ‘Am Anfang war dasWort. . . ‘ in D3. Assuming the ASCII coding, D-fields translate to (hex) GF symbols in Figure 10(c), for example, ‘45 6e 20 61 72 63 68 . . . ‘ for the record in D0. We obtain the parity symbols in P0 from the vector a0 = (45, 49, 41, 41) multiplied by P. The result b0 = a0 · P is (c, d2, d0). We calculate the first symbol of b0 simply as 45 ⊕ 49 ⊕ 41 ⊕ 41 = c.

This is the conventional parity, as in a RAID. The second symbol of b0 is 45 ·1 ⊕ 49 · 1a ⊕ 41 · 3b ⊕ 41· ff and so on. Notice that we need no multiplication to calculate the first addend. We now insert these records one by one. Figure 10(a) shows the result of inserting the first record into D0. The record is in fact replicated at all parity buckets, since updates to the first bucket translate to XOR operations at the parity buckets, and there are not yet the other data records. Inserting the second record into D1


at Figure 10(b) still leads to an XOR operation only at P0, but involves GF-multiplications using the respective P columns at the other parity buckets. Notice that, in the latter case, we operationally use Q columns only. Figure 10(c) shows the insert of the last two records. Finally, Figure 10(d) shows the result of changing the first record to ‘In the beginning was . . . ‘. Operationally, we first calculate the Δ-field at 0: (45 ⊕ 49, 6e ⊕ 6e, 20 ⊕ 20, 61 ⊕ 74, . . .) = (c, 0, 0, 15, . . .) and then forward it to all the parity buckets. Since this is an update to the first data bucket, we update the P-fields of all parity records by only XORing the Δ-field to the current contents.

Figure 9 Matrices P and Q derived from P′ and Q′ for a 3-available record group of size m = 4

1 1 1 0 0 01 1a 1c 0 105 200

1 3b 37 0 120 1851 ff fd 0 175 80

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟= =⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

P Q

Figure 10 Example of parity updating calculus

9 Data decoding

9.1 Parity matrix

The decoding calculus uses the concept of a generator matrix. Let I be an m × m identity matrix and P a parity matrix. The generator matrix G for P is the concatenation I|P. We recall that we organise the data records in a matrix A. Let U denote the matrix A⋅G. U is the concatenation (A|B) of matrix A and matrix B from the previous section. We refer to each line u = (a1, a2, ..., am, am+1, ..., an) of U as a code word. The first m coordinates of u are the coordinates of the corresponding line vector a of A. We recall that these are the data symbols with the same offset in all the data records in the record group. The


remaining k coordinates of u are the newly generated parity codes. A column u′ of U corresponds to an entire data or parity record.

A crucial property of G is that any m × m square submatrix H is invertible. We use this property for reconstructing up to k unavailable data or parity records. Consider first that we wish to recover only data records. We form a matrix H from any m columns of G that do not correspond to the unavailable records. Let S be A ⋅ H. The columns of S are the m available data and parity records we picked in order to form H. Using any matrix inversion algorithm, we compute H–1. Since A ⋅ H = S, we have A = S ⋅ H–1. We thus can decode all the data records in the record group. Hence, we can decode in particular our k data records. In contrast, we cannot perform the decoding if more than k data or parity records are unavailable. We would not be able to form any square matrix H. In general, if there are unavailable parity records, we can decode the data records first and then re-encode the unavailable parity records. Alternatively, we may recover these records in a single pass.

We form the recovery matrix R = H-1 ⋅ G. Since S = A ⋅ H, we have A = S ⋅ H–1, hence U = A ⋅ G = S ⋅ H–1 ⋅ G = S ⋅ R. Although, the recovery matrix has m rows and n columns, we only need the columns of the unavailable data and parity records. The following example, based on the LH*RS discussion in (Litwin et al., 2005), illustrates this calculus.

Example 3: Assume that the data buckets D0, D1, and D2 in Figure 10, in some LH*RS

P2Pfile, are unavailable. We want to read a record in D0. We collect the columns of G for the available buckets D3, P0, P1, and P2 in Figure 10 in matrix H, Figure 11. We invert H, e.g., using the Gaussian calculus. The last column of H−1 is a unit vector since the fourth data record is among the available ones. To reconstruct the first symbols simultaneously in each data bucket, we form a vector s from the first symbols in the available records of D3, P0, P1, and P2: s = (44, 5, fa, f2). This vector is the first row of S. To recover the first symbol in D0, we multiply s by the first column of H−1 and obtain 49 = 1 * 44 + 46 * 5 + 91 * FA + d6 * f2. Notice again that we actually use the matrix log2(H–1). We iterate over the other rows of S to obtain the other symbols in D0. If we were to read our record in D1, we would use S with the second column of H−1.

Figure 11 Matrix for correcting erasure of buckets D0, D1, D2

0 1 1 10 1 1a 1c0 1 3b 371 1 ff fd

⎛ ⎞⎜ ⎟⎜ ⎟=⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

H1

1 a7 a7 1

46 7a 3d 0.

91 c8 59 0

d6 b2 64 0

− =

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

H

9.2 Constructing a generator matrix

The construction of a generic generator matrix G′, Figure 13 is as follow. Note that matrix G above is derived from G′. Let aj be l elements of any field. We recall that according to Vandermonde, the determinant of the l-by-l matrix that has the ith

power of element aj in row i and column j is:


( ) ( )0 10 1

det ij j iij l

ij l

a a a≤ ≤ −

≤ ≤ −

= −∏ (3)

If the elements ai are all different, then the determinant is not zero and the matrix invertible. We start constructing G′ by forming a matrix V with n + 1 columns and m′ rows, see Figure 12. The first n columns contain the successive powers of all the different elements in the Galois field GF(n) starting with 0. The first column has a 1 in the first row and zeroes below. The final column consists of all zeroes but for a 1 in row m′ − 1. V is the extended Vandermonde matrix (Macwilliams and Sloane, 1997). It has the property that any submatrix S formed of m′ different columns is invertible. This follows from (3), if S does not contain the last column of V. If S contains the last column of V, then we can apply (3) to the submatrix of S obtained by removing the last row and column of V. This submatrix has the determinant of S and is invertible, so S is invertible.

Figure 12 An extended Vandermonde matrix V with m′ rows and n = 2f + 1 columns

0 0 01 2 11 1 11 2 11 2 21 2 1

1 1 11 2 1

1 L 0

0 L 0

0 L 0M M M O M M

0 L 1

n

n

n

m m mn

a a a

a a a

a a a

a a a

−

−

−

′ ′ ′− − −−

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

V =

Figure 13 Generic generator matrix G′

1,1 1,

1,1 1,

1 0 L 0 1 1 L 10 1 L 0 1 LM M O M M L M M0 0 L 1 1 L

m

m m m

p p

p p

′

′ ′ ′− −

⎛ ⎞⎜ ⎟⎜ ⎟′⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

G =

Note: Left m′ columns form the identity matrix I. The P′ matrix follows, with first column and row of ones

We transform V into G′, Figure 13, as follows. Let U be the m′ by m′ matrix formed by the leftmost m′ columns of V. We form an intermediate matrix W = U–1 ⋅ V. The leftmost m′ columns of W form the identity matrix, i.e., W has already the form W = I|R. If we pick any m′ columns of W and form a submatrix S, then S is the product U–1 ⋅ T with T the submatrix of V picked from the same columns as S. Hence, S is invertible. If we transform W by multiplying a single column or a single row by a non-zero element, we retain the property that any m′ by m′ submatrix of the transformed matrix is still invertible. The coefficients wm’,i of W located in the leftmost column of R are all non-zero. For, if this were not the case, and wm’,j = 0 for any index j, then the submatrix formed by the first m′ columns of W with the sole exception of column j and the leftmost column of R has only zero coefficients in row j and is hence singular, a contradiction. We now transform W into generic generator matrix G′ first by multiplying all rows j with wm′,j–1. As a result of these multiplications, column m′ now only contains coefficients 1, but the left m′ columns no longer form the identity matrix. Second, we multiply all columns j ∈ {0, … m′ – 1} with wm’,j to recoup the identity matrix in these columns.


Third, we multiply all columns m′, ... n with the inverse of the coefficient in the first row. The resulting matrix has now also 1-entries in the first row. This is our generic generator matrix G′.

10 Experimental analysis

10.1 Constructing a generator matrix

We now discuss some experimental measures, following results in (Litwin et al., 2005) for LH*RS, letting us to predict the performance of LH*RS for churn management. We basically analyse the time of recovery per record or bucket. We show then the time per MB of data. The results predict good practical performance.

The experimental configuration consisted from Windows Server 2000 and P4 machines and 1GB RAM over 1 Gbs Ethernet. The recovery manager was located at a parity bucket, not at a spare for implementation related reasons. The measurements concerned 4 data buckets and 1, 2, or 3 parity buckets. The group contained 125,000 = 4 * 31,250 data records, consisting each of a 4 B key and 100 B non-key data. The experiments consisted of the reconstruction of 1, 2, and 3 ‘unavailable’ buckets. This is thus the case of unexpected unavailability which should be the most frequent. The recovery manager loops conceptually over all the existing record groups. In fact, it recovers records by slices of a given size s. It requests s successive records from each of the m data/parity buckets, and recovers the s record groups. Then, it requests next s records from each bucket. While waiting, it sends the recovered slice to the spare(s).

Figure 14 presents the effect of slice size on the recovery of a data bucket in the sample case of using the first parity bucket with 1’s only and GF(216). We measured the total recovery time T, the processing time P, and the communication time C. The figure lists only s ≥ 100. Once s value is above a thousand, T drops under 1s, and P and C under 0.5s. All the times decrease slightly even further, while becoming quite constant when we choose s over 3,000. This is a consequence of our latest communication architecture that uses the passive TCP connections, details are in (Moussa, 2004). The result means also that a server may efficiently work with buffers much smaller than the bucket capacity b, e.g., 10 times smaller.

Table 1 completes Figure 14 by listing the T, P, C times for s values minimising T and k = 1, 2, 3. The difference between a T value and the related P + C is the thread synchronisation and switching time. For s ≥ 1,250, the differences to the times listed here were under 15% for 1-DB recovery, 5% for 2-DB recovery and 2% for 3-DBs.The first line of the table presents the recovery of a single data bucket (1-DB), using the XOR decoding only, as at Figure 14. The 2nd line of the table shows 1-DB recovery using the RS decoding (with the XORing and multiplications). The XOR calculus proves notably faster for both GFs used. The gain was expected, but not its actual magnitude. P becomes indeed almost 1.5 times smaller for GF(216). T decreases less, given the incidence of the C value. That value is naturally rather stable and reveals relatively important with respect to P, despite our fast 1 Gbs network. For the RS decoding we have C > 0.5P at least. Even more interestingly, we reach C > P for the XOR decoding.

The numbers prove the efficiency of the LH*RSP2P bucket recovery mechanism. It

takes only 0,667 s to recover 1 DB in our experiments, and less than 1.5 s to recover


3 DBs, i.e., 9.375 MB of data in three buckets. Notice that the growth of T appears sublinear with respect to the number of buckets recovered. The numbers prove also the advantages of using GF(216). It halves P of any recovery measured, but that using XOR only. This was the rationale for choice of this field for the basic LH*RS scheme, given also its behaviour for the encoding as good in practice as of GF(28). Notice that C in Table 1 increases more moderately than T as the function of the number of DBs recovered.

Figure 14 A single data bucket recovery time (milliseconds) as function of the slice size s (see online version for colours)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 5000 10000 15000 20000 25000 30000 35000

Slice Size

Tim

es (m

sec)

Total TimeProcess TimeCommunication Time

The flat character of charts in Figure 14 for larger values of s confirms the scalability of the scheme. It allows us to also guess the recovery times for larger buckets. We can infer from the above numbers that we recover a data bucket group of size m = 4 from 1-unavailability at the rate (speed) of 5.89 MB/sec of data. Next, we recover two data buckets of the group at the rate of 7.43 MB/sec. Finally, we recover the group from 3-unavailability at the rate of 8.21 MB/sec.

If we have thus, for instance, 1 GB of data per bucket, the figures imply a value for T of about 170 sec for 1-DB recovery, 270 sec for 2 GB recovered, and 365sec, about 6 min per 3 GB recovered, respectively. If we choose the group size m = 8, to halve the storage overhead, the recovery rates will halve as well, while the recovery time will double, and so on.

Table 2 presents a single parity bucket recovery time, again for 31,250 records to recover and s = 31,250. For other values of s > 1,000, the times were about the same. The time T to recover the first parity bucket, using XOR only, analysed in the line noted PB (XOR), is faster than for the other buckets using the RS calculus. We observe again fast performance. The XOR only recovery using 2B symbols, for GF(216). The picture reverses for the other parity bucket, as shows the RS line in the table. Similarly as for the data buckets, we can infer, for various values of m, the parity bucket recovery rates per MB of data stored, and the recovery times of the parity buckets of various sizes.


Table 1 Best data bucket recovery times (seconds) and slice sizes

GF(216)

s T P C

1-DB (XOR) 6,250 667 260 360 1-DB (RS) 6,250 828 385 303 2-DBs 31,250 1,088 599 442 3-DBs 15,625 1,468 906 495

Table 2 Parity bucket recovery times (seconds) for the slice size of 31,250s = records

GF(216)

T P C

PB (XOR) 2.062 1.484 0.322 PB (RS) 2.103 1.531 0.322

10.2 Record search and insert

We measured the search time in a file of 125,000 records, distributed over four buckets and servers. A record had a 4 B key and 100 B of non-key data. We measured the individual and bulk search or insert times. The average individual search time was 0.24 ms. The bulk one was 0.06 ms, four times faster. The individual search is thus about 40 times faster than a single disk access, whereas the bulk one is about 200 times faster.

We timed a series of 10,000 inserts into an initially empty bucket of b = 10, 000, thus avoiding the split. Again, a record consisted of a 4 B key and 100 B of non-key data. We recorded 0.29 ms for k = 0, 0.33 ms for k = 1 and 0.36 ms for k = 2.

The insert time allows evaluating the individual ‘sure search’ time. This one has to be at most that of an insert. Probably, little lower, since there is no parity calculus, only the verification with the group manager (but the messaging time should remain dominant). Thus there is a penalty for sure search, as compared to the usual one, but it should not be a sensitive one. E.g., it should be less than 20% in the measured case.

11 Implementation

Our current implementation reuses the existing LH*RS prototype software. We have augmented it with the original capabilities of LH*RS

P2P scheme. We added thus the capability of having the LH*RS client, and data server on the same node, with the local adjustment of the client image during the split. We further added the tutoring capabilities to the LH*RS servers and to the coordinator. We extended the client with the capability of being a pupil. All these extensions required the design and implementation of the related original messaging protocol. The sure search remains to be added in the near future. More in depth, we first ported the reused parts of the LH*RS prototype to Windows Server 2003 and its environment, from Windows Server 2000. Then, we did some code evolution, to integrate our LH*RS

P2P functions. For the peer – pupil communication, we use the


Windows Sockets API and TCP/IP protocol. The LH*RSP2P Server communicates with its

local client using the local call procedure (LPC). In this case each peer communicates through the creation of a port using NtCreatePort() function. The name of the port is published and first used by the server and its local client. Then any process, i.e., peers, can send connection requests on this port and get a port handle for communication as well.

A specific issue is the k-availability of the tutoring data at a server peer. Our approach is to create dedicated LH*RS

P2P records within each data and parity bucket, called the tutoring records, with a (unique) key and the pupils’ addresses in the non-key part of the tutoring data record. These addresses may get split during the bucket split. The keys of the tutoring records form a subset in the key space that is forbidden to applications. For instance, for the integer key space 0…232 – 1, the key for the tutoring record at data bucket l could be 232 – l – 1. A 1M node file would need about 0.5/1,000 of the key space for tutoring records. All the tutoring records have furthermore the same rank r that is r = 0 at present.

We recall that r is the key of all the k related parity records under LH*RS scheme, with our version of Reed-Salomon based erasure correcting encoding. The tutoring records are recovered together with any data or parity records under LH*RS.

12 Variants

Up to now we have presented the basic version of LH*RSP2P. The first interesting variant

is where a peer joins the file only when the coordinator creates the data bucket. Then, every peer is from the beginning a server. Some have also the parity buckets. There are no candidate peers. Hence, no need for tutoring.

Another simplified variant uses an LH*RS file, with clients and servers thus perhaps, extended only with the basic capabilities of LH*RS

P2P peer nodes. A server node becomes thus able to have also the client component, and both become able to adjust the client image during the split. Optionally, the server component is able to perform the sure search. No tutoring, in contrast. The peers are rewarded with faster worst case addressing performance. The (LH*) clients continue with the slower, basic LH* performance, which is – we recall – two messages per key based addressing at most and possibly several rounds per scan.

We can also improve the accuracy of the client image after an IAM. The server peer then sends the image at its client component, in lieu of the j value at its server component, as in the basic scheme. This image may be more accurate, because of IAMs received by the sender, than what the client may guess from j only. The client then simply takes this image instead of its own. The received image must be more accurate. This variant requires a tighter integration of the client and server components at a peer and derives more advantages from the existence of peers. We chose not to implement this variant in order to reuse our existing LH*RS prototype. The metadata of each peer component remains then internal to the component and the server component does not have access to the image of its local client when composing an IAM.

In the basic scheme, the coordinator sends 2j–1 – 1 physical addresses of buckets to a peer with a split request. This can be a large number and the peer might already received most or even all of these addresses through IAM. In the integrated variant just mentioned,


we could choose the splitting peer to request only missing bucket addresses from the coordinator. If (i″, n″) is the before-image at the client of the splitting peer, then the only missing addresses are of buckets n″ + 2i″'… a + 2j–1 – 1. However, in this version, the peer will not receive any addresses of recovered buckets. The practical implications of both variants remain for further study.

Yet another interesting variant concerns parity management. Ideally, each peer should uniformly provide data access, data storage, and parity management. This is not basically not the case of LH*RS

P2P. A peer may support indeed only a parity bucket, or only a data bucket (the usual case) or both data and parity buckets. The latter situation is that of about k/m peers. A parity bucket has at least as many records as the biggest data bucket in the bucket group. Next, an update to any data bucket in the bucket group results in an update to the parity bucket. The peer with a parity bucket and, especially with both buckets should thus be typically more loaded, with respect to its storage or processing capabilities, than the one with the data bucket only. A variant of LH*RS intended to get rid of these shortcomings is sketched in (Litwin et al., 2008). A more in depth analysis we describe in what follows leads us to the following variant. We decluster the parity records on basically all the peers. More precisely, typically, all peers except perhaps last m – 1, carry uniformly the parity and data records.

For this purpose, we number the peers 0,1,... upon the data buckets they carry or should carry. Each peer is ready to carry both data and parity records in respective buckets. We continue to group the peers. Peer n belongs to group g = n \ m, where ‘\’ denotes the integer division. We decluster the parity records of data records in every even group g on all the peers of the (odd) group g' = g + 1. In turn, we decluster the parity records of group g' into the parity buckets on all the peers of group g. To uniformly distribute the parity records, record groups with successive ranks map to different buckets, unlike for LH*RS, we recall. More precisely, the k parity records of a group with rank r = 1 map to 1st, 2nd … k-th peer of group g' or g. Rank r = 2 maps to 2nd, 3rd.... (k + 1)-th mod m peer.

In general thus, if p0…pm–1 are the successive peers within the group to store the parity records declustered there, then k records of record group r = 1,2… go successively to peers p(r–1 mod m), p(r mod m)...p(r–1+k mod m).

Figure 15 illustrates this architecture. Initially, the file consists of peer 0 carrying a data bucket and of all the peers in group 1, carrying each only the parity bucket. These buckets carry parity records of record groups in group 0 as above described.

The splits progressively extend the peer (data bucket) group by data buckets 1,…m – 1. Figure 15(a) illustrates the situation in the file with the data buckets on peers 0 and 1.

Once peer m gets the data bucket, group 1 starts getting data records. The parity records start filling parity buckets at group 0. Figure 15(b) illustrates this situation. It lasts till the split creates 1st data bucket of group 2, i.e., bucket 8, at peer 8 at the figure thus. The peers of group 3 join then the file as well. They start to carry the parity records of group 2. Figure 15(c) shows finally this phase of the file evolution. Later on, when splits make group 4 to start, parity buckets at group 3 will start to fill etc.

The scheme obviously insures that parity records of a record group cannot be stored on the same node as some data records in the group. This, provided that the number of parity records k is less than the maximum number of records m in a record group. The


declustering provided by the scheme potentially better balances the storage and processing load than the basic LH*RS

P2P. It also potentially speeds up the recovery up to m times. The recovery may indeed proceed in parallel on m peers, with each peer dealing with 1/m of the bucket(s) to recover. If we have thus, for instance, again, 1 GB of data per bucket and m = 4 as in our experiments, we end up with T possibly under 43 sec for 1-DB recovery, under 70 sec for 2 GB recovered, and under 6 min per 3 GB recovered, respectively.

Figure 15 LH*RSP2P with declustered parity management (m = 4 and k = 2)


13 Conclusions

We have intended the LH*RSP2P scheme for the P2P files, where every node both uses

data and serves its storage for them, or is at least willing to serve the storage when needed. Under this assumption, it is the fastest SDDS and P2P addressing scheme known. It should in particular protect the file efficiently against churn. Current work consists in the implementation of our scheme. We are including the sure search, and the k-available tutoring functions. We plan to evolve the prototype to the variant with the declustered parity management. We also continue with the in-depth performance analysis.

Acknowledgements

The work on LH*RSP2P was partly supported by EEC eGov-Bus IST project,

number FP6-IST-4-026727-STP. The declustered parity architecture follows discussions with Prof. W. Litwin and Prof. T. Schwarz.

References Aberer, K., Cudré-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., Punceva, M. and

Schmidt, R. (2003) ‘P-Grid: a self-organizing structured P2P system’, SIGMOD Record, Vol. 32, No. 3, pp.29–33.

Anderson, D. and Kubiatowicz, J. (2002) ‘The worldwide computer’, in Scientific American, Vol. 286, No. 3, March 2002.

Bennour, F., Diène, A., Ndiaye, Y. and Litwin, W. (2000) ‘Scalable and distributed linear hashing LH*LH under Windows NT’, in Systemics, Cybernetics, and Informatics, Orlando, Florida.

Bennour, F. (2002) ‘Performance of the SDDS LH*LH under SDDS-2000’, in Distributed Data and Structures 4 (Proceedings of WDAS), Carleton Scientific, pp.1–12.

Bolosky, W.J., Douceur, J.R and Howell, J. (2007) ‘The Farsite Project: a retrospective’, Operating System Review, pp.17–26.

Boxwood Project (2003) Available at http://research.microsoft.com/research/sv/Boxwood/. Crainiceanu, A., Linga, P., Gehrke, J. and Shanmugasundaram, J. (2004) ‘Querying peer-to-peer

networks using p-trees’, in Proceedings of the Seventh International Workshop on the Web and Databases WebDB, Paris, France.

Devine, R. (1993) ‘Design and implementation of DDH: a distributed dynamic hashing algorithm’, Proc. Of the 4th Intl. Foundation of Data Organisation and Algorithms –FODO.

Dingledine, R., Freedman, M. and Molnar, D. (2000) ‘The Free Haven Project: distributed anonymous storage service’, Workshop on Design Issues in Anonymity and Unobservability, July 2000.

Gribble, S., Brewer, E.A., Hellerstein, J. and Culler, D. (2000) ‘Scalable, distributed data structures for internet service construction’, 4th Symp. on Operating Systems Design and Implementation (OSDI 2000).

Jagadish, H.V., Ooi, B.C. and Vu, Q.H. (2005) ‘BATON: a balanced tree structure for peer-to-peer networks’, VLDB ‘05: Proceedings of the 31st International Conference on Very Large Data Bases.

Jagadish, H.V., Ooi, B.C., Vu, Q.H., Zhang, R. and Zhou, A. (2006) ‘VBI-Tree: a peer-to-peer framework for supporting multi-dimensional indexing schemes’, ICDE ‘06: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06).


Kaarlson, J., Litwin, W. and Risch, T. (1996) ‘LH*LH: a scalable high performance data structure for switched multicomputers’, in Apers, P., Gardarin, G. and Bouzeghoub, M., (Eds.), Extending Database Technology, EDBT96, Lecture Notes in Computer Science, Vol. 1057, Springer Verlag.

Kubiatowicz, J. (2003) ‘Extracting guarantees from chaos’, in Communications of the ACM, Vol. 46, No. 2, February 2003.

Litwin, W. (1980) ‘Linear hashing: a new algorithm for files and tables addressing’, International Conference on Databases, Aberdeen, Heyden, pp.260–275.

Litwin, W., Neimat, M-A. and Schneider, D. (1993) ‘LH*: linear hashing for distributed files’, ACM-SIGMOD Int. Conf. on Management of Data.

Litwin, W. (1994) ‘Linear hashing: a new tool for file and table addressing’, reprinted from VLDB80 in Readings in Databases, edited by M. Stonebraker, 2nd edn., Morgan Kaufmann Publishers.

Litwin, W., Neimat, M-A. and Schneider, D. (1996) ‘LH*: A scalable distributed data structure’, ACM-TODS.

Litwin, W., Menon, J., Risch, T., and Schwarz, T. (1999) ‘Design issues for scalable availability LH schemes with record grouping’, DIMACS Workshop on Distributed Data and Structures, Princeton U. Carleton Scientific.

Litwin, W. and Schwarz, T. (2000) LH*RS: a high-availability scalable distributed data structure using Reed Solomon codes’, ACM-SIGMOD International Conference on Management of Data.

Litwin, W., Moussa, R. and Schwarz, T. (2005) ‘LH*rs- a highly available scalable distributed data structure’, ACM-TODS.

Litwin, W., Yakouben, H. and Schwarz, T. (2008) ‘LH*RSP2P: a scalable distributed data structure

for P2P environment’, Proc of the 8th International Conference on New technologies in Distributed Systems (NOTERE’ 08), Lyon, France.

Macwilliams, F.J. and Sloane, N.J.A. (1997) The Theory of Error Correcting Codes, Elsevier/North Holland, Amsterdam.

Moussa, R. and Litwin, W. (2002) ‘Experimental performance analysis of LH*RS parity management’, Distributed Data and Structures 4, Records of the 4th International Meeting (WDAS 2002), Paris, France.

Moussa, R. (2004) ‘Experimental performance analysis of LH*RS’, CERIA Res. Rep. [CERIA]. Stoica, I., Morris, R., Karger, D., Kaashoek, F. and Balakrishma, H. (2001) ‘CHORD: a scalable

peer to peer lookup service for internet application’, SIGCOMM’O, San Diego, California, USA.

Xin, Q., Miller, E., Schwarz, T., Brandt, S., Long, D. and Litwin, W. (2003) ‘Reliability mechanisms for very large storage systems’, 20th IEEE Mass Storage Systems and Technologies (MSST 2003), San Diego, CA, pp.146–156.

Date post:	25-Dec-2019
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Hanafi Yakouben* and Sahri Soror - Paris Dauphine …litwin/cours98/Doc-cours-clouds/...6 H....

Documents