+ All Categories
Home > Documents > Jeff Darcy for BBLISA, October 2011 fileCeph Diagram Data Data Data Data Metadata Metadata Client...

Jeff Darcy for BBLISA, October 2011 fileCeph Diagram Data Data Data Data Metadata Metadata Client...

Date post: 19-Jul-2018
Category:
Upload: hoangthien
View: 223 times
Download: 0 times
Share this document with a friend
34
Cloud Filesystem Jeff Darcy for BBLISA, October 2011
Transcript

Cloud Filesystem

Jeff Darcyfor BBLISA, October 2011

What is a Filesystem?

• “The thing every OS and language knows”• Directories, files, file descriptors• Directories within directories• Operate on single record (POSIX: single byte)

within a file• Built-in permissions model (e.g. UID, GID,

ugo·rwx)• Defined concurrency behaviors (e.g. fsync)• Extras: symlinks, ACLs, xattrs

Are Filesystems Relevant?

• Supported by every language and OS natively• Shared data with rich semantics• Graceful and efficient handling of multi-GB

objects• Permission model missing in some alternatives• Polyglot storage, e.g. DB to index data in FS

Network Filesystems

• Extend filesystem to multiple clients• Awesome idea so long as total required

capacity/performance doesn't exceed a single servero ...otherwise you get server sprawl

• Plenty of commercial vendors, community experience

• Making NFS highly available brings extra headaches

Distributed Filesystems

• Aggregate capacity/performance across servers• Built-in redundancy

o ...but watch out: not all deal with HA transparently• Among the most notoriously difficult kinds of

software to set up, tune and maintaino Anyone want to see my Lustre scars?

• Performance profile can be surprising• Result: seen as specialized solution (esp. HPC)

Example: NFS4.1/pNFS

• pNFS distributes data access across servers• Referrals etc. offload some metadata• Only a protocol, not an implementation

o OSS clients, proprietary servers• Does not address metadata scaling at all• Conclusion: partial solution, good for

compatibility, full solution might layer on top of something else

Example: Ceph

• Two-layer architecture• Object layer (RADOS) is self-organizing

o can be used alone for block storage via RBD• Metadata layer provides POSIX file semantics

on top of RADOS objects• Full-kernel implementation• Great architecture, some day it will be a great

implementation

Ceph Diagram

Data

Data

Data

Data

Metadata

Metadata

Client

RADOSLayer

CephLayer

Example: GlusterFS

• Single-layer architectureo sharding instead of layeringo one type of server – data and metadata

• Servers are dumb, smart behavior driven by clients

• FUSE implementation• Native, NFSv3, UFO, Hadoop

GlusterFS Diagram

Client

Data

Metadata

Brick A

Data

Metadata

Data

Metadata

Brick B

Data

Metadata

Data

Metadata

Brick C

Data

Metadata

Data

Metadata

Brick D

Data

Metadata

OK, What About HekaFS?

• Don't blame me for the nameo trademark issues are a distraction from real work

• Existing DFSes solve many problems alreadyo sharding, replication, striping

• What they don't address is cloud-specific deploymento lack of trust (user/user and user/provider)o location transparencyo operationalization

Why Start With GlusterFS?

• Not going to write my own from scratcho been there, done thato leverage existing code, community, user base

• Modular architecture allows adding functionality via an APIo separate licensing, distribution, support

• By far the best configuration/management• OK, so it's FUSE

o not as bad as people think + add more servers

HekaFS Current Features

• Directory isolation• ID isolation

o “virtualize” between server ID space and tenants'• SSL

o encryption useful on its owno authentication is needed by other features

• At-rest encryptiono Keys ONLY on clientso AES-256 through AES-1024, “ESSIV-like”

HekaFS Future Features

• Enough of multi-tenancy, now for other stuff• Improved (local/sync) replication

o lower latency, faster repair• Namespace (and small-file?) caching• Improved data integrity• Improved distribution

o higher server counts, smoother reconfiguration• Erasure codes?

HekaFS Global Replication

• Multi-site asynchronous• Arbitrary number of sites• Write from any site, even during partition

o ordered, eventually consistent with conflict resolution

• Caching is just a special case of replicationo interest expressed (and withdrawn) not assumed

• Some infrastructure being done early for local replication

Project Status

• All open sourceo code hosted by Fedora, bugzilla by Red Hato Red Hat also pays me (and others) to work on it

• Close collaboration with Glustero they do most of the worko they're open-source folks tooo completely support their business model

• “current” = Fedora 16• “future” = Fedora 17+ and Red Hat product

Contact Info

• Project• http://hekafs.org• [email protected]

• Personal• http://pl.atyp.us• [email protected]

Cloud Filesystem

Jeff Darcyfor BBLISA, October 2011

What is a Filesystem?

• “The thing every OS and language knows”• Directories, files, file descriptors• Directories within directories• Operate on single record (POSIX: single byte)

within a file• Built-in permissions model (e.g. UID, GID,

ugo·rwx)• Defined concurrency behaviors (e.g. fsync)• Extras: symlinks, ACLs, xattrs

Are Filesystems Relevant?

• Supported by every language and OS natively• Shared data with rich semantics• Graceful and efficient handling of multi-GB

objects• Permission model missing in some alternatives• Polyglot storage, e.g. DB to index data in FS

Network Filesystems

• Extend filesystem to multiple clients• Awesome idea so long as total required

capacity/performance doesn't exceed a single servero ...otherwise you get server sprawl

• Plenty of commercial vendors, community experience

• Making NFS highly available brings extra headaches

Distributed Filesystems

• Aggregate capacity/performance across servers• Built-in redundancy

o ...but watch out: not all deal with HA transparently• Among the most notoriously difficult kinds of

software to set up, tune and maintaino Anyone want to see my Lustre scars?

• Performance profile can be surprising• Result: seen as specialized solution (esp. HPC)

Example: NFS4.1/pNFS

• pNFS distributes data access across servers• Referrals etc. offload some metadata• Only a protocol, not an implementation

o OSS clients, proprietary servers• Does not address metadata scaling at all• Conclusion: partial solution, good for

compatibility, full solution might layer on top of something else

Example: Ceph

• Two-layer architecture• Object layer (RADOS) is self-organizing

o can be used alone for block storage via RBD• Metadata layer provides POSIX file semantics

on top of RADOS objects• Full-kernel implementation• Great architecture, some day it will be a great

implementation

Ceph DiagramDataDataDataDataMetadataMetadataClientRADOSLayerCephLayer

Example: GlusterFS

• Single-layer architectureo sharding instead of layeringo one type of server – data and metadata

• Servers are dumb, smart behavior driven by clients

• FUSE implementation• Native, NFSv3, UFO, Hadoop

GlusterFS DiagramClientDataMetadataBrick ADataMetadataDataMetadataBrick BDataMetadataDataMetadataBrick CDataMetadataDataMetadataBrick DDataMetadata

OK, What About HekaFS?

• Don't blame me for the nameo trademark issues are a distraction from real work

• Existing DFSes solve many problems alreadyo sharding, replication, striping

• What they don't address is cloud-specific deploymento lack of trust (user/user and user/provider)o location transparencyo operationalization

Why Start With GlusterFS?

• Not going to write my own from scratcho been there, done thato leverage existing code, community, user base

• Modular architecture allows adding functionality via an APIo separate licensing, distribution, support

• By far the best configuration/management• OK, so it's FUSE

o not as bad as people think + add more servers

HekaFS Current Features

• Directory isolation• ID isolation

o “virtualize” between server ID space and tenants'• SSL

o encryption useful on its owno authentication is needed by other features

• At-rest encryptiono Keys ONLY on clientso AES-256 through AES-1024, “ESSIV-like”

HekaFS Future Features

• Enough of multi-tenancy, now for other stuff• Improved (local/sync) replication

o lower latency, faster repair• Namespace (and small-file?) caching• Improved data integrity• Improved distribution

o higher server counts, smoother reconfiguration• Erasure codes?

HekaFS Global Replication

• Multi-site asynchronous• Arbitrary number of sites• Write from any site, even during partition

o ordered, eventually consistent with conflict resolution

• Caching is just a special case of replicationo interest expressed (and withdrawn) not assumed

• Some infrastructure being done early for local replication

Project Status

• All open sourceo code hosted by Fedora, bugzilla by Red Hato Red Hat also pays me (and others) to work on it

• Close collaboration with Glustero they do most of the worko they're open-source folks tooo completely support their business model

• “current” = Fedora 16• “future” = Fedora 17+ and Red Hat product

Contact Info

• Project• http://hekafs.org• [email protected]

• Personal• http://pl.atyp.us• [email protected]


Recommended