Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | lynn-robertson |
View: | 214 times |
Download: | 0 times |
Toward Achieving Tapeless Backup
at PB Scales
Hakim Weatherspoon University of California, Berkeley
Frontiers in Distributed Information SystemsSan Francisco. Thursday, July 31, 2003
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:2
OceanStore Context: Ubiquitous Computing
• Computing everywhere:– Desktop, Laptop, Palmtop.– Cars, Cellphones.– Shoes? Clothing? Walls?
• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the
net.– Broadband to the home and office.– Wireless technologies such as CDMA, Satellite,
laser.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:3
Archival Storage
• Where is persistent information stored?– Want: Geographic independence for availability,
durability, and freedom to adapt to circumstances
• How is it protected?– Want: Encryption for privacy, secure naming and
signatures for authenticity, and Byzantine commitment for integrity
• Is it Available/Durable? – Want: Redundancy with continuous repair and
redistribution for long-term durability
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:4
Path of an Update
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:5
Questions about Data?
• How to use redundancy to protect against data being lost?
• How to verify data?
• Amount of resources used to keep data durable? Storage? Bandwidth?
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:6
Archival Dissemination Built into Update
• Erasure codes– redundancy without overhead of strict replication– produce n fragments, where any m is sufficient to reconstruct data. m < n.
rate r = m/n. Storage overhead is 1/r.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:7
Durability
• Fraction of Blocks Lost Per Year (FBLPY)*– r = ¼, erasure-enncoded block. (e.g. m = 16, n = 64)– Increasing number of fragments, increases durability of block
• Same storage cost and repair time.
– n = 4 fragment case is equivalent to replication on four servers.* Erasure Coding vs. Replication, H. Weatherspoon and J. Kubiatowicz, In Proc. of IPTPS
2002.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:8
Naming and Verification Algorithm
• Use cryptographically secure hash algorithm to detect corrupted fragments.
• Verification Tree: – n is the number of
fragments.– store log(n) + 1 hashes
with each fragment.
– Total of n*(log(n) + 1) hashes.
• Top hash is a block GUID (B-GUID).
– Fragments and blocks are self-verifying
Fragment 3:
Fragment 4:
Data:
Fragment 1:
Fragment 2:
H2 H34 Hd F1 - fragment data
H14 data
H1 H34 Hd F2 - fragment data
H4 H12 Hd F3 - fragment data
H3 H12 Hd F4 - fragment data
F1 F2 F3 F4
H1 H2 H3 H4
H12 H34
H14
B-GUID
HdData
Encoded Fragments
F1
H2
H34
Hd
Fragment 1: H2 H34 Hd F1 - fragment data
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:9
Naming and Verification Algorithm
• Use cryptographically secure hash algorithm to detect corrupted fragments.
• Verification Tree: – n is the number of
fragments.– store log(n) + 1 hashes
with each fragment.
– Total of n*(log(n) + 1) hashes.
• Top hash is a block GUID (B-GUID).
– Fragments and blocks are self-verifying
Fragment 3:
Fragment 4:
Data:
Fragment 1:
Fragment 2:
H2 H34 Hd F1 - fragment data
H14 data
H1 H34 Hd F2 - fragment data
H4 H12 Hd F3 - fragment data
H3 H12 Hd F4 - fragment data
F1
H1 H2
H12 H34
H14
B-GUID
Hd
Encoded Fragments
F1
H2
H34
Hd
Fragment 1: H2 H34 Hd F1 - fragment data
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:10
Enabling Technology
GUID
Fragments
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:11
Complex Objects I
Unit of Coding
data
Verification Tree
GUID of d
Encoded Fragments:Unit of Archival Storage
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:12
Complex Objects II
GUID of d1
d1
Unit of CodingEncoded Fragments:Unit of Archival Storage
Verification Tree
BlocksData
VGUID
d2 d4d3 d8d7d6d5 d9
DataB -Tree
Indirect Blocks
M
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:13
Complex Objects III
DataBlocks
VGUIDi VGUIDi + 1
d2 d4d3 d8d7d6d5 d9d1
Data B -Tree
IndirectBlocks
M
d'8 d'9
Mbackpointer
copy on write
copy on write
AGUID = hash{name+keys}
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:14
Mutable Data
• Need mutable data for real system.– Entity in network.– A-GUID to V-GUID mapping.– Byzantine Commitment for Integrity– Verifies client privileges.– Creates a serial order. – Atomically applies update.
• Versioning system– Each version is inherently read-only.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:15
Deployment
• Planet Lab global network– 98 machines at 42 institutions, in North America,
Europe, Asia, Australia.– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)– North American machines (2/3) on Internet2
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:16
Deployment
• Deployed storage system in November of 2002.– ~ 50 physical machines.– 100 virtual nodes.
• 3 clients, 93 storage serves, 1 archiver, 1 monitor.
– Support OceanStore API• NFS, IMAP, etc.
– Fault injection.– Fault detection and repair.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:17
Performance
• Performance of the Archival Layer– Performance of an OceanStore server in archiving a objects.– analyze operations of archiving data (this includes signing
updates in a BFT protocol).• No archiving• Archiving (synchronous) (m = 16, n = 32)
• Experiment Environment– OceanStore servers were analyzed on a 42-node cluster. – Each machine in the cluster is a
• IBM xSeries 330 1U rackmount PC with• two 1.0 GHz Pentium III CPUs• 1.5 GB ECC PC133 SDRAM• two 36 GB IBM UltraStar 36LZX hard drives.• The machines use a single Intel PRO/1000 XF gigabit Ethernet
adaptor to connect to a Packet Engines• Linux 2.4.17 SMP kernel.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:18
Performance: Throughput
• Data Throughput– No archive 8MB/s.– Archive 2.8MB/s.
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:19
Performance: Latency
• Latency– Fragmentation
• Y-intercept 3ms, slope 0.3s/MB.
– Archive = No archive + Fragmentation.Update Latency vs. Update Size
y = 0.6x + 29.6
y = 0.3x + 3.0
y = 1.2x + 36.4
0
10
20
30
40
50
60
70
80
2 7 12 17 22 27 32
UpdateSize (kB)
La
ten
cy
(in
ms
) Archive
No Archive
Fragmentation
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:20
Closer Look: Update Latency
• Threshold Signature dominates small update latency– Common RSA tricks not applicable
• Batch updates to amortize signature cost• Tentative updates hide latency
Update Latency (ms)
Key Size
Update Size
5% Time
Median Time
95%Time
512b4kB 39 40 41
2MB 1037 1086 1348
1024b
4kB 98 99 100
2MB 1098 1150 1448
Latency Breakdown
Phase Time (ms)
Check 0.3
Serialize
6.1
Apply 1.5
Archive 4.5
Sign 77.8
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:21
Current Situation
• Stabilized routing layer under churn and extraordinary circumstances
• NSF infrastructure grant– Deploy code as a service for Berkeley– Target 1/3 PB
• Future Collaborations– CMU for PB Store. – Internet Archive?
FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:22
Conclusion• Storage efficient, self-verifying mechanism.
– Erasure codes are good.
• Self-verifying data assist in– Secure read-only data – Secure caching infrastructures– Continuous adaptation and repair
For more information: http://oceanstore.cs.berkeley.edu/ Papers: Pond: the OceanStore Prototype- Naming and Integrity: Self-Verifying Data in P2P Systems