Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | nelson-glenn |
View: | 222 times |
Download: | 1 times |
OceanStore:2FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
OceanStore Context: Ubiquitous Computing
• Computing everywhere:– Desktop, Laptop, Palmtop– Cars, Cellphones– Shoes? Clothing? Walls?
• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the
net– Broadband to the home and office– Wireless technologies such as CMDA, Satelite,
laser
OceanStore:3FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Utility-based Infrastructure?
Pac Bell
Sprint
IBMAT&T
CanadianOceanStore
IBM
• Data service provided by federation of companies• Cross-administrative domain • Metric: MOLE OF BYTES (61023)
OceanStore:4FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
OceanStore Assumptions• Untrusted Infrastructure:
– The OceanStore is comprised of untrusted components
– Only ciphertext within the infrastructure• Responsible Party:
– Some organization (i.e. service provider) guarantees that your data is consistent and durable
– Not trusted with content of data, merely its integrity• Mostly Well-Connected:
– Data producers and consumers are connected to a high-bandwidth network most of the time
– Exploit multicast for quicker consistency when possible
• Promiscuous Caching: – Data may be cached anywhere, anytime
OceanStore:5FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Key Observation:Want Automatic Maintenance
• Can’t possibly manage billions of servers by hand!
• System should automatically:– Adapt to failure – Repair itself – Incorporate new elements
• Introspective Computing/Autonomic Computing• Can data be accessible for 1000 years?
– New servers added from time to time– Old servers removed from time to time– Everything just works
OceanStore:6FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Outline• Motivation• Assumptions of the OceanStore• Specific Technologies and approaches:
– Routing and Data Location – Naming– Conflict resolution on encrypted data– Replication and Deep archival storage– Introspection for optimization and repair
• Conclusion
OceanStore:8FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Bringing Order to this Chaos• How do you find information?
– Must be scalable and provide maximum flexibility• How do you name information?
– Must provide global uniqueness• How do you ensure consistency?
– Must scale and handle intermittent connectivity– Must prevent unauthorized update of information
• How do you protect information?– Must preserve privacy– Must provide deep archival storage (continuous
repair)• How do go tune performance?
– Locality very important
Throughout all of this: how do you maintain it???
OceanStore:10FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Locality, Locality, LocalityOne of the defining
principles• “The ability to exploit local resources over
remote ones whenever possible”• “-Centric” approach
– Client-centric, server-centric, data source-centric• Requirements:
– Find data quickly, wherever it might reside– Locate nearby object without global
communication – Permit rapid object migration – Verifiable: can’t be sidetracked
• Locality yields: Performance, Availability, Reliability
OceanStore:11FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Enabling Technology: DOLR(Decentralized Object Location and Routing)
GUID1
Tapestry
GUID1GUID2
OceanStore:12FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Stability under Changes
• Unstable, unreliable, untrusted nodes are the common case!– Network never fully stabilizes– What is half-life of a routing node?– Must provide stable routing in these
circumstances
• Redundancy and adaptation fundamental:– Make use of alternative paths when possible– Incrementally remove faulty nodes– Route around network faults– Continuously tune neighbor links
OceanStore:13FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
The Tapestry DOLR• Routing to Objects, not Locations!
– Replacement for IP?– Very powerful abstraction
• Built as overlay network, but not fundamental– Randomized prefix routing +
distributed object location index – Routing nodes have links to nearby neighbors– Additional state tracks objects
• Massive parallel insert (SPAA 2002) – Construction of nearest-neighbor mesh links
• Log2 n message complexity for new node– New nodes integrated, faulty ones removed– Objects kept available during this process
OceanStore:15FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Model of Data• Ubiquitous object access from anywhere
– Undifferentiated “Bag of Bits”• Versioned Objects
– Every update generates a new version– Can always go back in time (Time Travel)
• Each Version is Read-Only– Can have permanent name (SHA-1 Hash)– Much easier to repair
• An Object is a signed mapping between permanent name and latest version– Write access control/integrity involves managing
these mappings
Comet Analogy updates
versions
OceanStore:16FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Secure Hashing
• Read-only data: GUID is hash over actual information– Uniqueness and Unforgeability: the data is what it
is!– Verification: check hash over data
• Changeable data: GUID is combined hash over a human-readable name + public key– Uniqueness: GUID space selected by public key– Unforgeability: public key is indelibly bound to
GUID– Verification: check signatures with public key
SHA-1DATA 160-bit GUID
OceanStore:17FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Secure Naming
• Naming hierarchy:– Users map from names to GUIDs via hierarchy of
OceanStore objects (ala SDSI)– Requires set of “root keys” to be acquired by user
FooBarBaz
Myfile
Out-of-Band“Root link”
OceanStore:19FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
The Path of an OceanStore Update
Second-TierCaches
Multicasttrees
Inner-RingServers
Clients
OceanStore:20FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
OceanStore Consistency viaConflict Resolution
• Consistency is form of optimistic concurrency – An update packet contains a series of predicate-
action pairs which operate on encrypted data– Each predicate tried in turn:
• If none match, the update is aborted• Otherwise, action of first true predicate is applied
• Inner Ring must securely:– Pick serial order of updates– Apply them– Sign result (threshold signature)– Disseminate results to active users
OceanStore:21FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Automatic Maintenance• Byzantine Commitment for inner ring:
– Tolerates up to 1/3 malicious servers in inner ring
– Continuous refresh of set of inner-ring servers• Proactive threshold signatures• Use of Tapestry membership of inner ring
unknown to clients
• Secondary tier self-organized into overlay dissemination tree– Use of Tapestry routing to suggest placement
of replicas in the infrastructure– Automatic choice between update vs
invalidate
OceanStore:22FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Self-Organizing Soft-State Replication
• Simple algorithms for placing replicas on nodes in the interior– Intuition: locality properties
of Tapestry help select positionsfor replicas
– Tapestry helps associateparents and childrento build multicast tree
• Preliminary resultsshow that this is effective
OceanStore:24FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
TwoTypes of OceanStore Data
• Active Data: “Floating Replicas”– Per object virtual server– Logging for updates/conflict resolution– Interaction with other replicas for consistentency– May appear and disappear like bubbles
• Archival Data: OceanStore’s Stable Store– m-of-n coding: Like hologram
• Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64)
• Coding overhead is proportional to nm (e.g 4)• Other parameter, rate, is 1/overhead
– Fragments are cryptographically self-verifying
• Most data in the OceanStore is archival!
OceanStore:26FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Fraction of Blocks Lost per Year (FBLPY)
• Exploit law of large numbers for durability!• 6 month repair, FBLPY:
– Replication: 0.03– Fragmentation: 10-35
OceanStore:27FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
The Dissemination Process:Achieving Failure Independence
Model Builder
Set Creator
IntrospectionHuman Input
Network
Monitoringmodel
Inner Ring
Inner Ringse
t
set
probe
type
fragments
fragments
fragments
OceanStore:28FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Automatic Maintenance
• Continuous Entropy Suppression – i.e. repair!– Erasure coding give flexibility in timing repair
• Data continuously transferred from physical medium to physical medium– No “tapes decaying in basement”
• Actual Repair– Recombine fragments, then send out copies
again– DOLR permits efficient heartbeat mechanism
• Permits infrastructure to notice:– Servers going away for a while– Or, going away forever!
– Continuous sweep through data
OceanStore:30FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
On the use of Redundancy• Question: Can we use Moore’s law gains for
something other than just raw performance?– Growth in computational performance– Growth in network bandwidth– Growth in storage capacity
• Physical systems are unreliable and untrusted– Can we use multiple faulty elements instead of one?– Can we devote resources to monitoring and analysis?– Can we devote resources to repairing systems?
• Complexity of systems growing rapidly– Can no longer debug systems entirely– How to handle this?
OceanStore:31FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
The Biological Inspiration• Biological Systems are built from (extremely)
faulty components, yet:– They operate with a variety of component failures
Redundancy of function and representation– They have stable behavior Negative feedback– They are self-tuning Optimization of common
case
• Introspective Computing:– Components for computing– Components for monitoring and
model building– Components for continuous
adaptationAdapt
Compute
Monitor
OceanStore:32FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
The Thermodynamic Analogy
• System such as OceanStore has a variety of latent order– Connections between elements– Mathematical structure (erasure coding, etc)– Distributions peaked about some desired behavior
• Permits “Stability through Statistics”– Exploit the behavior of aggregates
• Subject to Entropy– Servers fail, attacks happen, system changes
• Requires continuous repair– Apply energy (i.e. through servers) to reduce
entropy
OceanStore:33FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Introspective Optimization• Adaptation of routing substrate
– Optimization of Tapestry Mesh– Fault-tolerant routing mechanisms– Adaptation of second-tier multicast tree
• Monitoring of access patterns:– Clustering algorithms to discover object
relationships– Time series-analysis of user and data motion
• Observations of system behavior– Extracting of failure correllations
• Continuous testing and repair of information– Slow sweep through all information to make sure
there are sufficient erasure-coded fragments– Continuously reevaluate risk and redistribute data
OceanStore:34FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
PondStore [Java]:• Event-driven state-machine model• Included Components
Initial floating replica design• Conflict resolution and Byzantine agreement
Routing facility (Tapestry)• Bloom Filter location algorithm • Plaxton-based locate and route data structures
Introspective gathering of tacit info and adaptation• Language for introspective handler construction• Clustering, prefetching, adaptation of network routing
Initial archival facilities • Interleaved Reed-Solomon codes for fragmentation• Methods for signing and validating fragments
• Target ApplicationsUnix file-system interface under Linux (“legacy
apps”)Email application, proxy for web caches, streaming
multimedia applications
OceanStore:35FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
We have Things Running!
• Latest: it is up to 7MB/sec• Still a ways to go, but working
OceanStore:36FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Update Latency
• Cryptography in critical path (not surprising!)• New metric: Avoid hashes (like avoid copies)
OceanStore:37FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
OceanStore Goes Global!• OceanStore components running “globally:”
– Australia, Georgia, Washington, Texas, Boston– Able to run the Andrew File-System benchmark
with inner ring spread throughout US– Interface: NFS on OceanStore
• Word on the street: it was easy to do– The components were debugged locally– Easily set up remotely
• I am currently talking with people in:– England, Maryland, Minnesota, ….– PlanetLab testbed will give us access to much
more
OceanStore:39FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
Other Apps
• Better file system support– NFS (working – reimplementation in progress)– Windows Installable file system (soon)
• Email through OceanStore– IMAP and POP proxies– Let normal mail clients access mailboxes in OS
• Palm-pilot synchronization– Palm data base as an OceanStore DB
OceanStore:40FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
OceanStore Conclusions• OceanStore: everyone’s data, one big utility
– Global Utility model for persistent data storage• OceanStore assumptions:
– Untrusted infrastructure with a responsible party– Mostly connected with conflict resolution– Continuous on-line optimization
• OceanStore properties:– Provides security, privacy, and integrity– Provides extreme durability– Lower maintenance cost through redundancy,
continuous adaptation, self-diagnosis and repair– Large scale system has good statistical properties
OceanStore:41FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley
For more info:
• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale
Persistent Storage”
• Tapestry algorithms paper (SPAA 2002):“Distributed Object Location in a Dynamic
Network”
• Bloom Filters for Probabilistic Routing (INFOCOM 2002):
“Probabilistic Location and Routing”
• OceanStore web site:http://oceanstore.org/