+ All Categories
Home > Documents > OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley.

OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley.

Date post: 02-Jan-2016
Category:
Upload: nelson-glenn
View: 222 times
Download: 1 times
Share this document with a friend
Popular Tags:
41
OceanStore: In Search of Global- Scale, Persistent Storage John Kubiatowicz UC Berkeley
Transcript

OceanStore:In Search of Global-Scale,

Persistent Storage

John KubiatowiczUC Berkeley

OceanStore:2FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

OceanStore Context: Ubiquitous Computing

• Computing everywhere:– Desktop, Laptop, Palmtop– Cars, Cellphones– Shoes? Clothing? Walls?

• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the

net– Broadband to the home and office– Wireless technologies such as CMDA, Satelite,

laser

OceanStore:3FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Utility-based Infrastructure?

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

IBM

• Data service provided by federation of companies• Cross-administrative domain • Metric: MOLE OF BYTES (61023)

OceanStore:4FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

OceanStore Assumptions• Untrusted Infrastructure:

– The OceanStore is comprised of untrusted components

– Only ciphertext within the infrastructure• Responsible Party:

– Some organization (i.e. service provider) guarantees that your data is consistent and durable

– Not trusted with content of data, merely its integrity• Mostly Well-Connected:

– Data producers and consumers are connected to a high-bandwidth network most of the time

– Exploit multicast for quicker consistency when possible

• Promiscuous Caching: – Data may be cached anywhere, anytime

OceanStore:5FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Key Observation:Want Automatic Maintenance

• Can’t possibly manage billions of servers by hand!

• System should automatically:– Adapt to failure – Repair itself – Incorporate new elements

• Introspective Computing/Autonomic Computing• Can data be accessible for 1000 years?

– New servers added from time to time– Old servers removed from time to time– Everything just works

OceanStore:6FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Outline• Motivation• Assumptions of the OceanStore• Specific Technologies and approaches:

– Routing and Data Location – Naming– Conflict resolution on encrypted data– Replication and Deep archival storage– Introspection for optimization and repair

• Conclusion

OceanStore:7FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Basic Structure:Irregular Mesh of “Pools”

OceanStore:8FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Bringing Order to this Chaos• How do you find information?

– Must be scalable and provide maximum flexibility• How do you name information?

– Must provide global uniqueness• How do you ensure consistency?

– Must scale and handle intermittent connectivity– Must prevent unauthorized update of information

• How do you protect information?– Must preserve privacy– Must provide deep archival storage (continuous

repair)• How do go tune performance?

– Locality very important

Throughout all of this: how do you maintain it???

OceanStore:9FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Location and Routing

OceanStore:10FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Locality, Locality, LocalityOne of the defining

principles• “The ability to exploit local resources over

remote ones whenever possible”• “-Centric” approach

– Client-centric, server-centric, data source-centric• Requirements:

– Find data quickly, wherever it might reside– Locate nearby object without global

communication – Permit rapid object migration – Verifiable: can’t be sidetracked

• Locality yields: Performance, Availability, Reliability

OceanStore:11FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Enabling Technology: DOLR(Decentralized Object Location and Routing)

GUID1

Tapestry

GUID1GUID2

OceanStore:12FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Stability under Changes

• Unstable, unreliable, untrusted nodes are the common case!– Network never fully stabilizes– What is half-life of a routing node?– Must provide stable routing in these

circumstances

• Redundancy and adaptation fundamental:– Make use of alternative paths when possible– Incrementally remove faulty nodes– Route around network faults– Continuously tune neighbor links

OceanStore:13FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

The Tapestry DOLR• Routing to Objects, not Locations!

– Replacement for IP?– Very powerful abstraction

• Built as overlay network, but not fundamental– Randomized prefix routing +

distributed object location index – Routing nodes have links to nearby neighbors– Additional state tracks objects

• Massive parallel insert (SPAA 2002) – Construction of nearest-neighbor mesh links

• Log2 n message complexity for new node– New nodes integrated, faulty ones removed– Objects kept available during this process

OceanStore:14FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

OceanStore Naming

OceanStore:15FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Model of Data• Ubiquitous object access from anywhere

– Undifferentiated “Bag of Bits”• Versioned Objects

– Every update generates a new version– Can always go back in time (Time Travel)

• Each Version is Read-Only– Can have permanent name (SHA-1 Hash)– Much easier to repair

• An Object is a signed mapping between permanent name and latest version– Write access control/integrity involves managing

these mappings

Comet Analogy updates

versions

OceanStore:16FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Secure Hashing

• Read-only data: GUID is hash over actual information– Uniqueness and Unforgeability: the data is what it

is!– Verification: check hash over data

• Changeable data: GUID is combined hash over a human-readable name + public key– Uniqueness: GUID space selected by public key– Unforgeability: public key is indelibly bound to

GUID– Verification: check signatures with public key

SHA-1DATA 160-bit GUID

OceanStore:17FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Secure Naming

• Naming hierarchy:– Users map from names to GUIDs via hierarchy of

OceanStore objects (ala SDSI)– Requires set of “root keys” to be acquired by user

FooBarBaz

Myfile

Out-of-Band“Root link”

OceanStore:18FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

The Write Path

OceanStore:19FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

The Path of an OceanStore Update

Second-TierCaches

Multicasttrees

Inner-RingServers

Clients

OceanStore:20FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

OceanStore Consistency viaConflict Resolution

• Consistency is form of optimistic concurrency – An update packet contains a series of predicate-

action pairs which operate on encrypted data– Each predicate tried in turn:

• If none match, the update is aborted• Otherwise, action of first true predicate is applied

• Inner Ring must securely:– Pick serial order of updates– Apply them– Sign result (threshold signature)– Disseminate results to active users

OceanStore:21FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Automatic Maintenance• Byzantine Commitment for inner ring:

– Tolerates up to 1/3 malicious servers in inner ring

– Continuous refresh of set of inner-ring servers• Proactive threshold signatures• Use of Tapestry membership of inner ring

unknown to clients

• Secondary tier self-organized into overlay dissemination tree– Use of Tapestry routing to suggest placement

of replicas in the infrastructure– Automatic choice between update vs

invalidate

OceanStore:22FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Self-Organizing Soft-State Replication

• Simple algorithms for placing replicas on nodes in the interior– Intuition: locality properties

of Tapestry help select positionsfor replicas

– Tapestry helps associateparents and childrento build multicast tree

• Preliminary resultsshow that this is effective

OceanStore:23FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Deep Archival Storage

OceanStore:24FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

TwoTypes of OceanStore Data

• Active Data: “Floating Replicas”– Per object virtual server– Logging for updates/conflict resolution– Interaction with other replicas for consistentency– May appear and disappear like bubbles

• Archival Data: OceanStore’s Stable Store– m-of-n coding: Like hologram

• Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64)

• Coding overhead is proportional to nm (e.g 4)• Other parameter, rate, is 1/overhead

– Fragments are cryptographically self-verifying

• Most data in the OceanStore is archival!

OceanStore:25FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Archival Disseminationof Fragments

OceanStore:26FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Fraction of Blocks Lost per Year (FBLPY)

• Exploit law of large numbers for durability!• 6 month repair, FBLPY:

– Replication: 0.03– Fragmentation: 10-35

OceanStore:27FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

The Dissemination Process:Achieving Failure Independence

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Inner Ring

Inner Ringse

t

set

probe

type

fragments

fragments

fragments

OceanStore:28FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Automatic Maintenance

• Continuous Entropy Suppression – i.e. repair!– Erasure coding give flexibility in timing repair

• Data continuously transferred from physical medium to physical medium– No “tapes decaying in basement”

• Actual Repair– Recombine fragments, then send out copies

again– DOLR permits efficient heartbeat mechanism

• Permits infrastructure to notice:– Servers going away for a while– Or, going away forever!

– Continuous sweep through data

OceanStore:29FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Introspective Tuning

OceanStore:30FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

On the use of Redundancy• Question: Can we use Moore’s law gains for

something other than just raw performance?– Growth in computational performance– Growth in network bandwidth– Growth in storage capacity

• Physical systems are unreliable and untrusted– Can we use multiple faulty elements instead of one?– Can we devote resources to monitoring and analysis?– Can we devote resources to repairing systems?

• Complexity of systems growing rapidly– Can no longer debug systems entirely– How to handle this?

OceanStore:31FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

The Biological Inspiration• Biological Systems are built from (extremely)

faulty components, yet:– They operate with a variety of component failures

Redundancy of function and representation– They have stable behavior Negative feedback– They are self-tuning Optimization of common

case

• Introspective Computing:– Components for computing– Components for monitoring and

model building– Components for continuous

adaptationAdapt

Compute

Monitor

OceanStore:32FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

The Thermodynamic Analogy

• System such as OceanStore has a variety of latent order– Connections between elements– Mathematical structure (erasure coding, etc)– Distributions peaked about some desired behavior

• Permits “Stability through Statistics”– Exploit the behavior of aggregates

• Subject to Entropy– Servers fail, attacks happen, system changes

• Requires continuous repair– Apply energy (i.e. through servers) to reduce

entropy

OceanStore:33FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Introspective Optimization• Adaptation of routing substrate

– Optimization of Tapestry Mesh– Fault-tolerant routing mechanisms– Adaptation of second-tier multicast tree

• Monitoring of access patterns:– Clustering algorithms to discover object

relationships– Time series-analysis of user and data motion

• Observations of system behavior– Extracting of failure correllations

• Continuous testing and repair of information– Slow sweep through all information to make sure

there are sufficient erasure-coded fragments– Continuously reevaluate risk and redistribute data

OceanStore:34FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

PondStore [Java]:• Event-driven state-machine model• Included Components

Initial floating replica design• Conflict resolution and Byzantine agreement

Routing facility (Tapestry)• Bloom Filter location algorithm • Plaxton-based locate and route data structures

Introspective gathering of tacit info and adaptation• Language for introspective handler construction• Clustering, prefetching, adaptation of network routing

Initial archival facilities • Interleaved Reed-Solomon codes for fragmentation• Methods for signing and validating fragments

• Target ApplicationsUnix file-system interface under Linux (“legacy

apps”)Email application, proxy for web caches, streaming

multimedia applications

OceanStore:35FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

We have Things Running!

• Latest: it is up to 7MB/sec• Still a ways to go, but working

OceanStore:36FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Update Latency

• Cryptography in critical path (not surprising!)• New metric: Avoid hashes (like avoid copies)

OceanStore:37FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

OceanStore Goes Global!• OceanStore components running “globally:”

– Australia, Georgia, Washington, Texas, Boston– Able to run the Andrew File-System benchmark

with inner ring spread throughout US– Interface: NFS on OceanStore

• Word on the street: it was easy to do– The components were debugged locally– Easily set up remotely

• I am currently talking with people in:– England, Maryland, Minnesota, ….– PlanetLab testbed will give us access to much

more

OceanStore:38FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Reality: Web Caching through

OceanStore

OceanStore:39FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

Other Apps

• Better file system support– NFS (working – reimplementation in progress)– Windows Installable file system (soon)

• Email through OceanStore– IMAP and POP proxies– Let normal mail clients access mailboxes in OS

• Palm-pilot synchronization– Palm data base as an OceanStore DB

OceanStore:40FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

OceanStore Conclusions• OceanStore: everyone’s data, one big utility

– Global Utility model for persistent data storage• OceanStore assumptions:

– Untrusted infrastructure with a responsible party– Mostly connected with conflict resolution– Continuous on-line optimization

• OceanStore properties:– Provides security, privacy, and integrity– Provides extreme durability– Lower maintenance cost through redundancy,

continuous adaptation, self-diagnosis and repair– Large scale system has good statistical properties

OceanStore:41FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley

For more info:

• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale

Persistent Storage”

• Tapestry algorithms paper (SPAA 2002):“Distributed Object Location in a Dynamic

Network”

• Bloom Filters for Probabilistic Routing (INFOCOM 2002):

“Probabilistic Location and Routing”

• OceanStore web site:http://oceanstore.org/


Recommended