2 OpenNebulaConf 2014 Berlin
Agenda
● What is it?
● Architecture
● Integration with OpenNebula
● What's new?
3OpenNebulaConf 2014 Berlin
What is Ceph?
4 OpenNebulaConf 2014 Berlin
What is Ceph?
● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image service, and S3-compatible object storage service.
5 OpenNebulaConf 2014 Berlin
Interfaces to storage
FILE SYSTEMCephFS
BLOCK STORAGE
RBD
OBJECT STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
6OpenNebulaConf 2014 Berlin
Ceph Architecture
7OpenNebulaConf 2014 Berlin
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
8OpenNebulaConf 2014 Berlin
Object Storage Daemons
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
btrfsxfsext4
M
M
M
9OpenNebulaConf 2014 Berlin
RADOS Components
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number These do not serve stored objects to clients
M
10OpenNebulaConf 2014 Berlin
Rados Cluster
APPLICATION
M M
M M
M
RADOS CLUSTER
11OpenNebulaConf 2014 Berlin
Where do objects live?
??
APPLICATION
M
M
M
OBJECT
12OpenNebulaConf 2014 Berlin
A Metadata Server?
1
APPLICATION
M
M
M
2
13OpenNebulaConf 2014 Berlin
Calculated placement
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
14OpenNebulaConf 2014 Berlin
Even better: CRUSH
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
15OpenNebulaConf 2014 Berlin
CRUSH is a quick calculation
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
16OpenNebulaConf 2014 Berlin
CRUSH: Dynamic data placement
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
17OpenNebulaConf 2014 Berlin
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
18OpenNebulaConf 2014 Berlin
RBD: Virtual disks in Ceph
18
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, OpenNebula,
Proxmox
19OpenNebulaConf 2014 Berlin
Storing virtual disks
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
19
20OpenNebulaConf 2014 Berlin
Using Ceph with OpenNebula
21OpenNebulaConf 2014 Berlin
Storage in OpenNebula deployments
OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)
22OpenNebulaConf 2014 Berlin
RBD and libvirt/qemu
● librbd (user space) client integration with libvirt/qemu
● Support for live migration, thin clones
● Get recent versions!
● Directly supported in OpenNebula since 4.0 with the Ceph Datastore (wraps `rbd` CLI)
More info online:
http://ceph.com/docs/master/rbd/libvirt/http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html
23OpenNebulaConf 2014 Berlin
Other hypervisors
● OpenNebula is flexible, so can we also use Ceph with non-libvirt/qemu hypervisors?
● Kernel RBD: can present RBD images in /dev/ on hypervisor host for software unaware of librbd
● Docker: can exploit RBD volumes with a local filesystem for use as data volumes – maybe CephFS in future...?
● For unsupported hypervisors, can adapt to Ceph using e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports carefully!)
24OpenNebulaConf 2014 Berlin
Choosing hardware
Testing/benchmarking/expert advice is needed, but there are general guidelines:
● Prefer many cheap nodes to few expensive nodes (10 is better than 3)
● Include small but fast SSDs for OSD journals
● Don't simply buy biggest drives: consider IOPs/capacity ratio
● Provision network and IO capacity sufficient for your workload plus recovery bandwidth from node failure.
25OpenNebulaConf 2014 Berlin
What's new?
26OpenNebulaConf 2014 Berlin
Ceph releases
● Ceph 0.80 firefly (May 2014)
– Cache tiering & erasure coding
– Key/val OSD backends
– OSD primary affinity● Ceph 0.87 giant (October 2014)
– RBD cache enabled by default
– Performance improvements
– Locally recoverable erasure codes● Ceph x.xx hammer (2015)
27OpenNebulaConf 2014 Berlin
Additional components
● Ceph FS – scale-out POSIX filesystem service, currently being stabilized
● Calamari – monitoring dashboard for Ceph
● ceph-deploy – easy SSH-based deployment tool
● Puppet, Chef modules
28OpenNebulaConf 2014 Berlin
Get involved
Evaluate the latest releases:
http://ceph.com/resources/downloads/
Mailing list, IRC:
http://ceph.com/resources/mailing-list-irc/
Bugs:
http://tracker.ceph.com/projects/ceph/issues
Online developer summits:
https://wiki.ceph.com/Planning/CDS
29OpenNebulaConf 2014 Berlin
Questions?
30OpenNebulaConf 2014 Berlin
31OpenNebulaConf 2014 Berlin
Spare slides
32OpenNebulaConf 2014 Berlin
33OpenNebulaConf 2014 Berlin
Ceph FS
34 OpenNebulaConf 2014 Berlin
CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf
35OpenNebulaConf 2014 Berlin
Components
● Client: kernel, fuse, libcephfs
● Server: MDS daemon
● Storage: RADOS cluster (mons & OSDs)
36OpenNebulaConf 2014 Berlin
Components
Linux host
M M
M
Ceph server daemons
ceph.ko
datametadata 0110
37 OpenNebulaConf 2014 Berlin
From application to disk
ceph-mds
libcephfsceph-fuse Kernel client
RADOS
Client network protocol
Application
Disk
38OpenNebulaConf 2014 Berlin
Scaling out FS metadata
● Options for distributing metadata?
– by static subvolume
– by path hash
– by dynamic subtree● Consider performance, ease of implementation
39OpenNebulaConf 2014 Berlin
Dynamic subtree placement
40OpenNebulaConf 2014 Berlin
Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)
● In practice work at directory fragment level in order to handle large dirs
41 OpenNebulaConf 2014 Berlin
Data placement
● Stripe file contents across RADOS objects● get full rados cluster bw from clients● fairly tolerant of object losses: reads return zero
● Control striping with layout vxattrs● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS
42 OpenNebulaConf 2014 Berlin
Clients
● Two implementations:● ceph-fuse/libcephfs● kclient
● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting
● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.
● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)
43OpenNebulaConf 2014 Berlin
Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file" in the metadata pool.
– I/O latency on metadata ops is sum of network latency and journal commit latency.
– Metadata remains pinned in in-memory cache until expired from journal.
44 OpenNebulaConf 2014 Berlin
Journaling and caching in MDS
● In some workloads we expect almost all metadata always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.
45 OpenNebulaConf 2014 Berlin
Lookup by inode
● Sometimes we need inode → path mapping:● Hard links● NFS handles
● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects
● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency
46 OpenNebulaConf 2014 Berlin
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
47 OpenNebulaConf 2014 Berlin
Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations
48 OpenNebulaConf 2014 Berlin
CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data
49 OpenNebulaConf 2014 Berlin
Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS configuration
50 OpenNebulaConf 2014 Berlin
Giant->Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache
51 OpenNebulaConf 2014 Berlin
FSCK and repair
● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?
● Repair:● Automatic where possible● Manual tools to enable support
52 OpenNebulaConf 2014 Berlin
Client management
● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist
● Client metadata● Initially domain name, mount point● Extension to other identifiers?
53 OpenNebulaConf 2014 Berlin
Online diagnostics
● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:
● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce `session ls`
● Which clients does MDS think are stale?● Identify clients to evict with `session evict`
54 OpenNebulaConf 2014 Berlin
Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes on startup”:
● Data loss● Software bugs
● Updated on-disk format to make recovery from damage easier
● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions
55 OpenNebulaConf 2014 Berlin
Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal
56 OpenNebulaConf 2014 Berlin
Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests
● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics
● Third party testing is extremely valuable
57 OpenNebulaConf 2014 Berlin
What's next?
● You tell us!
● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support
● Which use cases will community test with?● General purpose● Backup● Hadoop
58 OpenNebulaConf 2014 Berlin
Reporting bugs
● Does the most recent development release or kernel fix your issue?
● What is your configuration? MDS config, Ceph version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
59 OpenNebulaConf 2014 Berlin
Future
● Ceph Developer Summit:● When: 8 October● Where: online
● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?