Shafeeq SinnamohideenRaja Sambasivan, James Hendricks, Likun Liu,
Gregory R. Ganger
PARALLEL DATA LABORATORYCarnegie Mellon University
A transparently scalablemetadata service for the
Ursa Minor storage system
http://www.pdl.cmu.edu/ 2 Shafeeq Sinnamohideen © June 10
Ursa Minor• Prototype of a Self-* storage system [FAST05]
• Direct-access system model• Data path for bulk data• Metadata path for attributes• Similar to NASD, Panasas, PVFS, Lustre, etc.
• Research questions• How to automate management?• How to build a versatile system?
• This talk : one hard problem with simple solution
http://www.pdl.cmu.edu/ 3 Shafeeq Sinnamohideen © June 10
Metadata Server
Ursa Minor Overview
Client
Metadata
/foo
Object-ID
Data
Object-basedStorageDevices
Metadata Server
Metadata
Object-IDFile attributesStorage node list
http://www.pdl.cmu.edu/ 4 Shafeeq Sinnamohideen © June 10
Desired properties• Scalability
• Adding servers increases capacity• Ideally the increase is proportional
• Transparency• Users don’t care which server is used• Always provide consistent semantics
• Atomic operations are a useful building block• Standard compliance• Difficult for programmers to do without
http://www.pdl.cmu.edu/ 5 Shafeeq Sinnamohideen © June 10
Maintaining semanticsEasy for the data path:• Operations affect a single file• Only one server involved in each op
Some metadata ops can affect two items:• Renaming a file to different directory• Parent & child• Could involve two servers
http://www.pdl.cmu.edu/ 6 Shafeeq Sinnamohideen © June 10
Handling multi-server ops1. Only allow single-server ops
• e.g.: AFS, NFS, OnTAP GX• Volume abstraction->limited transparency
2. Use a distributed transaction protocol• e.g.: Farsite• Complex to implement
3. Use distributed locking & shared state• e.g.: GPFS• Push complexity into lock manager
http://www.pdl.cmu.edu/ 7 Shafeeq Sinnamohideen © June 10
Our approach to multi-server ops• Use the simplest possible solution• System can already:
• Perform single server atomic operations• Migrate items for load balancing
Reuse features to support multi-server ops
http://www.pdl.cmu.edu/ 8 Shafeeq Sinnamohideen © June 10
The ideaWhen a request needs multiple files:• Migrate file’s metadata to one server• Execute the single-server code path• Fix any load imbalance
• Return metadata to original server• Move other files
http://www.pdl.cmu.edu/ 9 Shafeeq Sinnamohideen © June 10
Core tradeoff• Gain simplicity through reuse
• Unmodified single server execution• Unmodified migration path
• Lose some performance• Migration latency added to op latency
• Expect this to be a worthwhile tradeoff
http://www.pdl.cmu.edu/ 10 Shafeeq Sinnamohideen © June 10
What do we expect?Traces of large file systems show that:• Multi-object ops are a tiny fraction• Most multi-object ops are parent-child
• CREATE, DELETE• Parent & child on same server for locality
• Other multi-object ops extremely rare• RENAME: 0.005% involve 2 dirs• LINK: 0.120% possible (0.005% actual)• Most of these will be close in directory tree
• Rare case doesn’t have to be fast
http://www.pdl.cmu.edu/ 11 Shafeeq Sinnamohideen © June 10
Metadata distribution• Distributed key-value store for “inodes”
• Key: Object-ID• Value: object metadata (attributes & layout)
• Distribute by Object-ID
Object-IDs
http://www.pdl.cmu.edu/ 12 Shafeeq Sinnamohideen © June 10
Metadata distribution• Distributed key-value store for “inodes”
• Key: Object-ID• Value: object metadata (attributes & layout)
• Distribute by Object-ID• Partition into ranges
00000999
10001999
20002999
30003999
40004999
50007499
75009999
http://www.pdl.cmu.edu/ 13 Shafeeq Sinnamohideen © June 10
Metadata distribution• Distributed key-value store for “inodes”
• Key: Object-ID• Value: object metadata (attributes & layout)
• Distribute by Object-ID• Partition into ranges• Assign each range to a server
Metadata Server 1 Metadata Server 2
00000999
10001999
20002999
30003999
40004999
50007499
75009999
http://www.pdl.cmu.edu/ 14 Shafeeq Sinnamohideen © June 10
Metadata distribution• Distributed key-value store for “inodes”
• Key: Object-ID• Value: object metadata (attributes & layout)
• Distribute by Object-ID• Partition into ranges• Assign each range to a server
Metadata Server 1 Metadata Server 2
00000999
10001999
20002999
30003999
40004999
50007499
75009999
http://www.pdl.cmu.edu/ 15 Shafeeq Sinnamohideen © June 10
Metadata distribution• Delegation coordinator assigns ranges• Range is unit of migration• Metadata persistently stored in data path
Metadata Server 1 Metadata Server 2
00000999
10001999
20002999
30003999
40004999
50007499
75009999
Delegation Coordinator
http://www.pdl.cmu.edu/ 16 Shafeeq Sinnamohideen © June 10
Multi-server operations• When a metadata server needs a range :
1.Borrow it from the server that has it2.Perform the operation3.Return it to the original server
Metadata Server 1 Metadata Server 2
00000999
10001999
20002999
30003999
40004999
50007499
75009999
Delegation Coordinator
Migration
2000299920002999
http://www.pdl.cmu.edu/ 17 Shafeeq Sinnamohideen © June 10
Object-IDs• Object-ID determines which server to use• Assign Object-IDs to minimize multi-server ops
• Directory tree determines operation locality• Multi-file ops involve nearby directories• Nearby files should get similar Object-IDs
• Fall into same range• Served by same server - locality benefits
• Encode hierarchy into Object-ID• Analogous to IP address subnetting
http://www.pdl.cmu.edu/ 18 Shafeeq Sinnamohideen © June 10
Example tree
/dir1
/dir1/dir1/dir2
/dir1/dir1
/dir1/dir1/dir2/file3
/dir1/dir2
/dir1/dir1/dir1/dir1/dir1/file1 /dir1/dir2/file1
http://www.pdl.cmu.edu/ 19 Shafeeq Sinnamohideen © June 10
Object-ID assignment
01 01 02 03
/dir1
/dir1/dir1/dir2
/dir1/dir1
/dir1/dir1/dir2/file3
Object-ID
http://www.pdl.cmu.edu/ 20 Shafeeq Sinnamohideen © June 10
Evaluation1. Is Metadata Service scalable?2. Sensitivity to workload characteristics3. Sensitivity to system parameters4. Headroom for future workloads
http://www.pdl.cmu.edu/ 21 Shafeeq Sinnamohideen © June 10
Experimental setup• Modified SpecSFS97 NFS benchmark
• Applied to Ursa Minor NFS head-ends• NFS head-end translates to Ursa Minor• Configured to maximize MDS load
• 8.3 million files & directories• 26GB of metadata (158GB of file data)
• Vary number of metadata servers• Rest of system is constant
http://www.pdl.cmu.edu/ 22 Shafeeq Sinnamohideen © June 10
Metadata traffic
OSD1
OSD3
OSD2
NFS1
NFS3
NFS2
NFS4
MDS1
MDS2
(24)(1-32)(48)
NFSRequests
Measured
http://www.pdl.cmu.edu/ 23 Shafeeq Sinnamohideen © June 10
Scalability w/o multi-server ops
http://www.pdl.cmu.edu/ 24 Shafeeq Sinnamohideen © June 10
About multi-server ops SpecSFS97 doesn’t produce any• Simple directory structure• No multi-directory ops in workload• OID-assignment policy does perfectly
http://www.pdl.cmu.edu/ 25 Shafeeq Sinnamohideen © June 10
Adding multi-server ops Artificially introduce them• Replace CREATEs with cross-dir LINKs
• Same work for each operation• Use “bad” OID-assignment policy• 1% multi-server ops• 100X rate from traces!
http://www.pdl.cmu.edu/ 26 Shafeeq Sinnamohideen © June 10
Scalability with multi-server ops
http://www.pdl.cmu.edu/ 27 Shafeeq Sinnamohideen © June 10
Causes of slowdown• Latency of migration• Side-effects on other operations
• Migration makes a table unavailable• Servers flush cache on migration
• Granularity of migration is significant• The smaller, the better• Extreme case is single-object
• Encountered very rarely in practice
http://www.pdl.cmu.edu/ 28 Shafeeq Sinnamohideen © June 10
Implementation
• Half of our implementation is a simple lock manager• Our 2PC implementation is not robust
2587Multi-server using 2PC820Multi-server operations
47000Base metadata serverLines of C
http://www.pdl.cmu.edu/ 29 Shafeeq Sinnamohideen © June 10
Conclusion• Feasible to reuse migration to support
multi-server operations• Almost no overhead w/ shared storage
• Harvard, NetApp, SpecSFS97 workloads• Even higher multi-server operation rates
• Good choice for system designers• Transparent scalability made easy
http://www.pdl.cmu.edu/ 30 Shafeeq Sinnamohideen © June 10