+ All Categories
Home > Documents > Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao,...

Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao,...

Date post: 29-Dec-2015
Category:
Upload: ann-justina-robinson
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley Proc. of the 2 nd USENIX Conf. On File and Storage Technologies (FAST ‘03) Presented by Park, Seon-Yeong
Transcript

Pond: the OceanStore Prototype

Sean Rhea, Patric Eaton, Dennis Gells,

Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz

University of California, Berkeley

Proc. of the 2nd USENIX Conf. On File and Storage Technologies (FAST ‘03)

Presented by Park, Seon-Yeong

2/26

Ubiquitous Computing

Telephone

SPO Watch

PDA Cell Phone

Digital TV

PC

Storage Pool

3/26

OceanStore Overview

Internet-scale, Cooperative File System

ApplicationCalendars, Email, Contact Lists, Large Digital Libraries, Repositories for Scientific Data, Distributed Design Tool, etc.

RequirementsUniversal Availability

Durability

Understandable Consistency Model

Privacy vs. Information Sharing

4/26

Data Model (1/2)

Data ObjectA File in a Traditional File System

Named by an Active Globally-Unique Identifier, AGUID– Location Independent

– Preventing Name Space Collisions

SHA-1

AGUID

Application-specified Name + Owner’s Public Key

5/26

Data Model (2/2)

Data ObjectSequences of Read-only Versions

Block Reference– Cryptographically-secure Hash of Child Block’s Contents

< Structure of Data Object >

6/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing ;Tapestry

7/26

Access Control

Reader RestrictionEncrypt All Data

Distribute Encryption Key to Users with Read Permission

Writer RestrictionAccess Control List (ACL) for an Object

All Writes be Signed so that Well-behaved Servers and Clients Verify them based on the ACL

8/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing

9/26

Data Update (1/2)

UpdateAdding a New Version to the Head of Version Stream

Array of Potential Actions each Guarded by a Predicate– Predicate Examples

• Checking Latest Version_Num, Comparing a Region of Bytes to an Expected Value, etc.

– Action Examples• Replacing a Set of Bytes, Appending New Data, Truncating the

Object, etc.

TimestampClient ID<Predicate 1, Action 1><Predicate 2, Action 2> . . .<Predicate N, Action N>Client Signature < Update Message Format >

10/26

Data Update (2/2)

Application

Primary Replica(Inner Ring)

Archival Storages

ApplicationSecondary

ReplicaSecondary

Replica

< OceanStore Update Path >

11/26

Primary Replica

Inner RingA Set of Servers that Implement Object’s Primary Replica

Applies Updates and Creates New Versions– Serialization

– Access Control

– Create Archival Fragments

Update Agreements– Byzantine Agreement Protocol

• Distributed Decision Process in which All Non-faulty Participants Reach the Same Decision for a Group of Size 3f+1, no more than f Faulty Servers

12/26

Archival Storage

Simple ReplicationTolerance of One Failure for an Addition 100% Storage Cost

Erasure CodesEfficient and Stable Storage for Archival Copies

Storage Cost by a Factor of N/M

Original Block can be Reconstructed from Any M Fragments

Block

Fragment 1

Fragment 2

Fragment N

. . .

Fragment 1

Fragment 2

Fragment M

. . .Encoded by

Erasure Code

M < N

Fragment 3

13/26

Secondary Replica

Whole-block Caching to Avoid Erasure Codes on Frequently-read Objects

Push-based UpdateEvery Time the Primary Replica Applies an Update

Dissemination TreeApplication-level Multicast Tree

Rooted at Primary Replica

Parent Nodes are Pre-existing Replicas to Serve Objects

14/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing

15/26

Data Read

Application

Primary Replica(Inner Ring)

Archival Storages

SecondaryReplica

1. AGUID

2. Latest VGUID

3. Search Blocks from Secondary Replicas

4. Search enough Fragments from Archival Storages

16/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing

17/26

Data Location & Routing (1/4)

TapestryDecentralized Object Location and Routing System

Using Globally Unique Identifier (GUID) to Hosts and Resources

Location Independent

Locality Aware

18/26

Data Location & Routing (2/4)

Routing Example

Messages are Routed to the Destination ID Digit by Digit***8=>**98=>*598=>4598

B4F8

9098

0325

2BB8

75984598

87CA

0098

3E98

1598

D598

2118

L1

L2

L2

L3

L4 L4

L2

L4

L3

L3

L1

19/26

Data Location & Routing (3/4)

Location Independent & Locality Aware

L1

L2

L2

L3

L4 L4

L2

L4

L3

L3

ReplicaLocation Pointer

L1

20/26

Data Location & Routing (4/4)

Routing Table

< Neighbor Map in Memory for Tapestry Node 0642 >

21/26

Prototype

Prototype Software Architecture

22/26

Experimental Results (1/2)

Update Performance

< Table. Results of Latency Microbenchmark > < Figure. Throughput in Local Area >

23/26

Experimental Results (2/2)

Comparison with NFS

< Figure. Andrew Benchmark >

Write

Read

Read/Write

24/26

Related Work

Other Peer-to-peer File SystemsPAST[Rows01] and CFS[Dabe01]

– No Write Sharing

IVY[Muth02], Pangaea[Sait02]– Provide Both Read and Write Sharing but,

– No Single Point of Consistency

25/26

Conclusion

Operational OceanStore PrototypeUniversally Accessible, Fault-tolerance, Security and Information Sharing

Future ResearchImproving Performance

– Efficient Threshold Schemes and Archival Data Generation

Self-Maintenance

Stability and Fault-tolerance

Supporting More Applications

26/26

Discussion

System Design ChoiceSecurity vs. Fast Response

Simple vs. Complicate Design

Storage Service Provider (SSP)Independent SSP vs.

Confederation of Companies such as IBM, AT&T

Efficient Storage Usage

27/26

Primary Replica (Ext.)

Modification of Byzantine Agreement ProtocolPublic Key Cryptography

– Symmetric-key Message Authentication Codes (MACs) for Inner Ring

– Public-key Cryptography for All Other Machines

Proactive Threshold Signatures– Flexibility in Choosing the Membership of Inner Ring– Single Public Key with l Private Key Shares– Any k Correctly Generated Signature Shares among l– Independent Sets of Key Shares can be Used to Control

Membership

Responsible Party– To Choose the Hosts that Make Up Inner Rings


Recommended