Date post: | 25-May-2015 |
Category: |
Technology |
Upload: | ceph-community |
View: | 1,927 times |
Download: | 5 times |
2 Ceph Day London - CephFS Update
Agenda
● Introduction to distributed filesystems
● Architectural overview
● Recent development
● Test & QA
3Ceph Day London – CephFS Update
Distributed filesystems...and why they are hard.
4 Ceph Day London - CephFS Update
Interfaces to storage
● Object● Ceph RGW, S3, Swift
● Block (aka SAN)● Ceph RBD, iSCSI, FC, SAS
● File (aka scale-out NAS)● Ceph, GlusterFS, Lustre, proprietary filers
5 Ceph Day London - CephFS Update
Interfaces to storage
FILE SYSTEMCephFS
BLOCK STORAGE
RBD
OBJECT STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
6 Ceph Day London - CephFS Update
Object stores scale out well
● Last writer wins consistency
● Consistency rules only apply to one object at a time
● Clients are stateless (unless explicitly doing lock ops)
● No relationships exist between objects
● Objects have exactly one name
● Scale-out accomplished by mapping objects to nodes
● Single objects may be lost without affecting others
7 Ceph Day London - CephFS Update
POSIX filesystems are hard to scale out
● Extents written from multiple clients must win or lose on all-or-nothing basis → locking
● Inodes depend on one another (directory hierarchy)
● Clients are stateful: holding files open
● Users have local-filesystem latency expectations: applications assume FS client will do lots of metadata caching for them.
● Scale-out requires spanning inode/dentry relationships across servers
● Loss of data can damage whole subtrees
8 Ceph Day London - CephFS Update
Failure cases increase complexity further
● What should we do when... ?● Filesystem is full● Client goes dark● An MDS goes dark● Memory is running low● Clients are competing for the same files● Clients misbehave
● Hard problems in distributed systems generally, especially hard when we have to uphold POSIX semantics designed for local systems.
9 Ceph Day London - CephFS Update
Terminology
● inode: a file. Has unique ID, may be referenced by one or more dentries.
● dentry: a link between an inode and a directory
● directory: special type of inode that has 0 or more child dentries
● hard link: many dentries referring to the same inode
● Terms originate form original (local disk) filesystems, where these were how a filesystem was represented on disk.
10Ceph Day London – CephFS Update
Architectural overview
11 Ceph Day London - CephFS Update
CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf
12Ceph Day London – CephFS Update
Components
● Client: kernel, fuse, libcephfs● Server: MDS daemon● Storage: RADOS cluster (mons & OSDs)
13Ceph Day London – CephFS Update
Components
Linux host
M M
M
Ceph server daemons
ceph.ko
datametadata 0110
14 Ceph Day London - CephFS Update
From application to disk
ceph-mds
libcephfsceph-fuse Kernel client
RADOS
Client network protocol
Application
Disk
15Ceph Day London – CephFS Update
Scaling out FS metadata
● Options for distributing metadata?– by static subvolume
– by path hash
– by dynamic subtree
● Consider performance, ease of implementation
16Ceph Day London – CephFS Update
DYNAMIC SUBTREE PARTITIONING
17 Ceph Day London - CephFS Update
Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)
● In practice work at directory fragment level in order to handle large dirs
18 Ceph Day London - CephFS Update
Data placement
● Stripe file contents across RADOS objects● get full rados cluster bandwidth from clients● delegate all placement/balancing to RADOS
● Control striping with layout vxattrs● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS
19 Ceph Day London - CephFS Update
Clients
● Two implementations:● ceph-fuse/libcephfs● kclient
● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting
● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.
● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)
20 Ceph Day London - CephFS Update
Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file" in the metadata pool.
● I/O latency on metadata ops is sum of network latency and journal commit latency.
● Metadata remains pinned in in-memory cache until expired from journal.
21 Ceph Day London - CephFS Update
Journaling and caching in MDS
● In some workloads we expect almost all metadata always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.
22 Ceph Day London - CephFS Update
Lookup by inode
● Sometimes we need inode → path mapping:● Hard links● NFS handles
● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects
● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency?
23 Ceph Day London - CephFS Update
Extra features
● Snapshots:● Exploit RADOS snapshotting for file data● … plus some clever code in the MDS● Fast petabyte snapshots
● Recursive statistics● Lazily updated● Access via vxattr● Avoid spurious client I/O for df
24 Ceph Day London - CephFS Update
Extra features
● Snapshots:● Exploit RADOS snapshotting for file data● … plus some clever code in the MDS● Fast petabyte snapshots
● Recursive statistics● Lazily updated● Access via vxattr● Avoid spurious client I/O for df
25 Ceph Day London - CephFS Update
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
26 Ceph Day London - CephFS Update
Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations
27 Ceph Day London - CephFS Update
CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data
28Ceph Day London – CephFS Update
Development update
29Ceph Day London – CephFS Update
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
30 Ceph Day London - CephFS Update
Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS configuration
31 Ceph Day London - CephFS Update
Giant → Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache
32 Ceph Day London - CephFS Update
FSCK and repair
● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?
● Repair:● Automatic where possible● Manual tools to enable support
33 Ceph Day London - CephFS Update
Client management
● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist
● Client metadata● Initially domain name, mount point● Extension to other identifiers?
34 Ceph Day London - CephFS Update
Online diagnostics
● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:
● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce `session ls`
● Which clients does MDS think are stale?● Identify clients to evict with `session evict`
35 Ceph Day London - CephFS Update
Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes on startup”:
● Data loss● Software bugs
● Updated on-disk format to make recovery from damage easier
● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions
36 Ceph Day London - CephFS Update
Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal
37 Ceph Day London - CephFS Update
Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests
● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics
● Third party testing is extremely valuable
38 Ceph Day London - CephFS Update
What's next?
● You tell us!
● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support
● Which use cases will matter to community?● Backup● Hadoop● NFS/Samba gateway● Other?
39 Ceph Day London - CephFS Update
Reporting bugs
● Does the most recent development release or kernel fix your issue?
● What is your configuration? MDS config, Ceph version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
40 Ceph Day London - CephFS Update
Future
● Ceph Developer Summit:● When: 8 October● Where: online
● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?
41Ceph Day London – CephFS Update
Questions?
42Ceph Day London – CephFS Update
43 Ceph Day London - CephFS Update
Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide for white space.
● If you use a graphic, make sure text is readable.
44 Ceph Day London - CephFS Update
Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide for white space.
● If you use a graphic, make sure text is readable.
45 Ceph Day London - CephFS Update
Introduce Red Hat
● Create an agenda slide for every presentation.● Outline what you’re going to tell the audience.● Prepare them for a call to action after the presentation.
● If this is a confidential presentation, use the confidential presentation template located on the Corporate > Templates > Presentation templates page of the PNT Portal.
46 Ceph Day London - CephFS Update
Introduce Red Hat solutions and services
● Provide product details that specifically solve the customer pain point you’re addressing.
● These slides explain how Red Hat solutions work, what makes them unique and valuable.
47 Ceph Day London - CephFS Update
Learn more
● End with a call to action.
● Let the audience know what can be done next, how you or Red Hat can help them.
48Ceph Day London – CephFS Update
Divider slide
49Ceph Day London – CephFS Update
Divider slide
50Ceph Day London – CephFS Update
Divider slide
51Ceph Day London – CephFS Update
Divider slide
52Ceph Day London – CephFS Update
Divider slide
53Ceph Day London – CephFS Update
Divider slide
54Ceph Day London – CephFS Update
Divider slide
55Ceph Day London – CephFS Update
Divider slide
56Ceph Day London – CephFS Update
Divider SlideDivider slide
A STORAGE REVOLUTION
PROPRIETARY HARDWARE
PROPRIETARY SOFTWARE
SUPPORT & MAINTENANCE
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
STANDARDHARDWARE
OPEN SOURCE SOFTWARE
ENTERPRISEPRODUCTS &
SERVICES
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
58
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
59
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
OBJECT STORAGE DAEMONS
60
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
btrfsxfsext4
M
M
M
RADOS CLUSTER
61
APPLICATION
M M
M M
M
RADOS CLUSTER
RADOS COMPONENTS
62
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number These do not serve stored objects to clients
M
WHERE DO OBJECTS LIVE?
63
??
APPLICATION
M
M
M
OBJECT
A METADATA SERVER?
64
1
APPLICATION
M
M
M
2
CALCULATED PLACEMENT
65
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
EVEN BETTER: CRUSH!
66
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
CRUSH IS A QUICK CALCULATION
67
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
CRUSH: DYNAMIC DATA PLACEMENT
68
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
69
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
ACCESSING A RADOS CLUSTER
70
APPLICATION
M M
M
RADOS CLUSTER
LIBRADOS
OBJECT
socket
L
LIBRADOS: RADOS ACCESS FOR APPS
71
LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
72
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
THE RADOS GATEWAY
73
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGW
LIBRADOS
APPLICATION APPLICATION
REST
RADOSGW MAKES RADOS WEBBY
74
RADOSGW: REST-based object storage proxy Uses RADOS to store objects API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
75
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
STORING VIRTUAL DISKS
76
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
SEPARATE COMPUTE FROM STORAGE
77
M M
RADOS CLUSTER
HYPERVISOR
LIBRBDVM
HYPERVISOR
LIBRBD
KERNEL MODULE FOR MAX FLEXIBLE!
78
M M
RADOS CLUSTER
LINUX HOST
KRBD
RBD STORES VIRTUAL DISKS
79
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula, Proxmox
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
80
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
SEPARATE METADATA SERVER
81
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 0110
SCALABLE METADATA SERVERS
82
METADATA SERVER Manages metadata for a POSIX-compliant
shared filesystem Directory hierarchy File metadata (owner, timestamps, mode,
etc.) Stores metadata in RADOS Does not serve file data to clients Only required for shared filesystem
CEPH AND OPENSTACK
83
RADOSGWLIBRADOS
M M
RADOS CLUSTER
OPENSTACK
KEYSTONE CINDER GLANCE
NOVASWIFTLIBRB
DLIBRB
D
HYPER- VISOR
LIBRBD
Read about the latest version of Ceph. The latest stuff is always at http://ceph.com/get
Deploy a test cluster using ceph-deploy. Read the quick-start guide at http://ceph.com/qsg
Read the rest of the docs! Find docs for the latest release at http://ceph.com/docs
Ask for help when you get stuck! Community volunteers are waiting for you at
http://ceph.com/help
Copyright © 2014 by Inktank | Private and Confidential
GETTING STARTED WITH CEPH
84
85Ceph Day London – CephFS Update
86Ceph Day London – CephFS Update