Opportunistic Use of Content Addressable
Storage for Distributed File Systems
Niraj Tolia†*, Michael Kozuch†, M. Satyanarayanan†*, Brad Karp†,Thomas Bressoud†‡, and Adrian Perrig*
*Carnegie Mellon University, †Intel Research Pittsburgh and ‡Denison University
2
Introduction
Using a Distributed File System on a Wide Area Network is slow!However, there seems to be a growth in the number of providers of Content Addressable Storage (CAS)Therefore can we make opportunistic use of these CAS providers to benefit client-server file systems like NFS, AFS, and Coda?
3
Content Addressable StorageContent Addressable Storage is data that is identified by its contents instead of a name
Foo.txtFoo.txtFoo.txt 0xd58e23b71b1b...0xd58e23b71b1b...
File Content Addressable Name
CryptographicHash
An example of a CAS provider is a Distributed Hash Table (DHT)
4
MotivationUse CAS as a performance enhancement when the file server is remoteConvert data transfers from WAN to LAN
CASProviders
File Server Client
5
Talk Outline
IntroductionThe CASPER File System
Building Blocks• Recipes• Jukeboxes• Recipe Servers
ArchitectureBenchmarks and PerformanceFuzzy MatchingConclusions
6
The CASPER File System
It can make use of any available CAS provider to improve read performanceHowever it does not depend on the CAS providers
• In the absence of a useful CAS provider, you are no worse off than you originally were
Writes are sent directly to the File Server and not to the CAS provider
7
Recipes
File Data Content Addressable Name
0x330c7eb274a4...0x330c7eb274a4...
0x1deb72e98470...0x1deb72e98470...
0xf13758906c8d...0xf13758906c8d...
0xe13b918d6a50...0xe13b918d6a50...
0xf9d09794b6d7...0xf9d09794b6d7...
RecipeCryptographic Hash
8
Building Blocks: Recipes
Description of objects in a content addressable wayFirst class entity in the file system
• Can be cached
Uses XML for data representation• Compression used over the network
Can be maintained lazily as they contain version information
lego.com
9
Recipe Example (XML)<recipe type="file"><metadata><version>00 00 01 04 01</version>…
</metadata>
<recipe_choice><hash_list hash_type="SHA-1" block_type="variable"
number="5"><hash size="4189">330c7eb274a4...</hash>…
</hash_list></recipe_choice>
</recipe>
10
CASPER Architecture
Client
DFSClient
RecipeServer
DFS FileServer
Server
WAN Connection DFS FileServer
CAS Provider
LAN
Con
nect
ion
Jukebox
11
Building Blocks: JukeboxesJukeboxes are abstractions of a Content Addressable Storage provider
• Provide access to data based on their hash value
Provide no guarantee to consumers about persistence or reliabilitySupport a Query() and Fetch() interface
• MultiQuery() and MultiFetch() also available
Examples include your desktop, a departmental jukebox, P2P systems, etc.
12
Building Blocks: Recipe Server
This module generates recipe representation of files present in the underlying file systemCan be placed either on the Distributed File System server or on any other machine well connected to itHelps in maintaining consistency by informing the client of changes in files that it has reconstructed
lego.com
13
CASPER details…The file system is based on Coda
• Whole file caching, open-close consistency
Proxy based layering approach usedCoda takes care of consistency, conflict detection, resolution, etc.
• The file server is the final authoritative source
CASPER allows us to service cache misses faster that might be usually possible
14
CASPER Architecture
DFSClient
Client
CodaClient
DFSProxy
RecipeServer
DFS FileServer
Server
Coda FileServer
WAN Connection
CAS Provider
LAN
Con
nect
ion
Jukebox
15
CASPER Implementation
CodaClient
DFSProxy
RecipeServer
Coda FileServer
CAS Provider
2. Recipe Request
3. Recipe ResponseLA
N C
onne
ctio
n
1. File Request4.
CA
S R
eque
st
5. C
AS
Res
pons
e
6. Missed Block Request
7. Missed Block Response
Client
WAN Connection
Server
Jukebox
16
Talk Outline
IntroductionThe CASPER File System
Building Blocks• Recipes• Jukeboxes• Recipe Servers
ArchitectureBenchmarks and PerformanceFuzzy MatchingConclusions
17
Experimental Setup
NIST Net RouterClient
File Server +Recipe Server
Jukebox
WAN bandwidth limitations between the server and client were controlled using NIST Net
• 10 Mb/s, 1 Mb/s + 10ms, and 100 Kb/s + 100ms
The hit-ratio on the jukebox was set to 100%, 66%, 33%, and 0%Clients began with a cold cache (no files or recipes)
18
Benchmarks
Binary Install BenchmarkVirtual Machine MigrationModified Andrew Benchmark
19
Benchmark DescriptionBinary Install (RPM based)
Installed RPMs for the Mozilla 1.1 browser• 6 RPMs, Total size of 13.5 MB
Virtual Machine migrationTime taken to resume a migrated Virtual Machine and execute a MS Office-based benchmarkTrace accesses ~1000 files @ 256 KB each
• No think time modeled
20
Benchmark Description (II)Modified Andrew Benchmark
Phases include: Create Directory, Copy, Scan Directory, Read All, and Make
• However, only the Copy phase will exhibit an improvement
Uses Apache 1.3.27• 11.36 MB source tree, 977 files• 53% of files are less than 4 KB and 71% of them
are less than 8 KB in size
21
Mozilla (RPM) Install
Time normalized against vanilla CodaGain most pronounced at lower bandwidthVery low overhead seen for these experiments (between 1-5 %)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
100 Kb/s 1 Mb/s 10 Mb/s
Nor
mal
ized
Run
time
100% 66% 33% 0%
44 sec150 sec1238 secBaseline
22
Virtual Machine Migration
Time normalized against vanilla CodaLarge amounts of data show benefit even at higher bandwidthsHigh overhead seen at 10 Mb/s is an artifact of data buffering
0
0.2
0.4
0.6
0.8
1
1.2
1.4
100 Kb/s 1 Mb/s 10 Mb/s
Nor
mal
ized
Run
time
100% 66% 33% 0%
203 sec2046 sec21523 secBaseline
23
Andrew Benchmark
Only the Copy phase shows benefitsMore than half of the files are fetched over the WAN without talking to the jukebox because of their small size
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
100 Kb/s 1 Mb/s 10 Mb/s
Nor
mal
ized
Run
time
100% 66% 33% 0%
1.5 1.6
1.8
13 sec103 sec1150 secBaseline
24
CommonalityThe question of where and how much commonality can be found is still openHowever, there are a number of applications that will benefit from this approach
Some of the applications that would benefit include:
Virtual Machine MigrationBinary Installs and UpgradesSoftware Development
25
Mozilla binary commonality
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
mozilla-16
mozilla-17
mozilla-18
mozilla-19
mozilla-20
mozilla-21
mozilla-22
mozilla-23
mozilla-24
mozilla-25
mozilla-16 mozilla-17 mozilla-18 mozilla-19 mozilla-20 mozilla-21mozilla-22 mozilla-23 mozilla-24 mozilla-25
26
Linux kernel commonality – 2.2
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2.2.0
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.2.6
2.2.7
2.2.8
2.2.9
2.2.10
2.2.11
2.2.12
2.2.13
2.2.14
2.2.15
2.2.16
2.2.17
2.2.18
2.2.19
2.2.20
2.2.21
2.2.22
2.2.23
2.2.24
2.2.0 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8
2.2.9 2.2.10 2.2.11 2.2.12 2.2.13 2.2.14 2.2.15 2.2.16 2.2.17
2.2.18 2.2.19 2.2.20 2.2.21 2.2.22 2.2.23 2.2.24
27
Related Work
Delta Encoding• rsync, HTTP, etc.
Distributed File Systems• NFS, AFS, Coda, etc.
P2P Content Addressable Networks• Chord, Pastry, Freenet, CAN, etc.
Hash based storage and file systems• Venti, LBFS, Ivy, EMC’s Centera, Farsite, etc.
28
Conclusions
Introduction of the concept of recipes
Proven benefits of opportunistic use of content providers by traditional distributed file systems on WANs
Introduced “Fuzzy Matching”
29
30
Backup Slides
31
Where did the time go?For the Andrew benchmark
Reconstruction of a large number of small files takes 4 roundtripsThere is also the overhead of compression, verification, etc.
Some part of the system (CAS requests) can be optimized by performing work in parallel
32
Number of round trips
CodaClient
DFSProxy
RecipeServer
Coda FileServer
CAS Provider
1. Recipe Request
Recipe ResponseLA
N C
onne
ctio
n2
& 3
CA
S R
eque
st
CA
S R
espo
nse 4. Missed Block Request
Missed Block Response
Client
WAN Connection
Server
Jukebox
33
Absolute Andrew
13.3 (0.5)103.3 (0.5)1150.7 (0.5)Baseline23.7 (0.5)108.7 (0.5)1069.3 (1.7)0%21.3 (0.5)85.0 (0.8)762.7 (0.5)33%20.0 (1.6)64.0 (0.8)520.7 (0.5)66%17.3 (1.2)40.3 (0.9)261.3 (0.5)100%10 Mb/s1 Mb/s100 Kb/s
Network BandwidthJukebox Hit-Ratio
Andrew Benchmark: Copy Performance (sec)
34
NFS Implementation?The benchmark results would not change significantly (with the possible exception of the Virtual Machine migration benchmark).It is definitely possible to adopt a similar approach
In fact, an NFS proxy (without CASPER) exists.
However, the semantics of such a system are still unclear…
35
Fuzzy MatchingQuestion: Can we convert an incorrect block into what we need?If there is a block that is “near” to what is needed, treat it as a transmission errorFix it by applying an error-correcting codeFuzzy Matching needs three components
• Exact hash• Fuzzy hash• ECC information
36
Fuzzy Hashes
File Data
4 7 8 9 2 0Shingleprints 3
Sort
0 2 3 4 7 8 9
Fuzzy Hash
37
Fuzzy Hashing and ECCs
A fuzzy hash could simply be a shingle• Hash a number of features with a sliding window• Take the first m hashes after sorting to be
representative of the data• These m shingles are used to find “near” blocks
After finding a similar block, an ECC that tolerates a number of changes could be applied to recover the original blockDefinite tradeoff between recipe size and Fuzzy Matching but this approach is promising
38
Linux kernel commonality - 2.4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2.4.0
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
2.4.7
2.4.8
2.4.9
2.4.10
2.4.11
2.4.12
2.4.13
2.4.14
2.4.15
2.4.16
2.4.17
2.4.18
2.4.19
2.4.20
2.4.0 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.8
2.4.9 2.4.10 2.4.11 2.4.12 2.4.13 2.4.14 2.4.15 2.4.16 2.4.17
2.4.18 2.4.19 2.4.20
39
Linux kernel – 2.2 (B/W)