Post on 08-Sep-2020
transcript
1
Douglas FullerSC19 Ceph BoF
2019.11.19
WHAT’S COMING IN CEPH
OCTOPUS
2
CEPH IS A UNIFIED STORAGE SYSTEM
RGW
S3 and Swiftobject storage
LIBRADOSLow-level storage API
RADOSReliable, elastic, distributed storage layer with
replication and erasure coding
RBD
Virtual block device
CEPHFS
Distributed networkfile system
OBJECT BLOCK FILE
3
Usability
Performance
EcosystemMulti-site
Quality
FIVE THEMES
4
Usability
Performance
EcosystemMulti-site
Quality
FIVE THEMES
5
ORCHESTRATOR API
● mgr API to interface with deployment tool○ rook○ ssh (run this command on that host)
● Expose provisioning functions to CLI, GUI○ Create, destroy, start, stop daemons○ Blink disk lights
● Replace ceph-deploy with ssh backend○ Bootstrap: create mon + mgr on local host○ “Day 2” operations to provision rest of
cluster○ Provision containers exclusively
● Pave way for cleanup of docs.ceph.com● Automated upgrades
ceph-mgr: orchestrator API
Rook ssh
CLI DASHBOARD
?
ceph-osdceph-mdsceph-mon ...
6
● Integration with orchestrator○ Initial parts done○ Adding OSDs next
● Sidebar with display notifications and progress bars● CephFS management features
○ Snapshots and quotas● RGW multisite management● Password improvements
○ Strength indicator○ Enforce change after first login
DASHBOARD
7
● Improvements to progress bars (ceph -s output)○ More RADOS events/process supported (mark out, rebalancing)○ Time estimates
● Health alert muting○ TTL on mutes○ Auto-unmute when alert changes, increases in severity
● Hands-off defaults○ PG autoscaler on by default○ Balancer on by default
● ‘ceph tell’ and ‘ceph daemon’ unification○ Same expanded command set via either interface (over-the-wire or local unix socket)
MISC
8
Usability
Performance
EcosystemMulti-site
Quality
FIVE THEMES
9
TELEMETRY AND CRASH REPORTS
● Opt-in○ Require re-opt-in if telemetry content
expanded○ Explicitly acknowledge data sharing
license● Telemetry channels
○ basic - cluster size, version, etc.○ ident - contact info (off by default)○ crash - anonymized crash metadata○ device - device health (SMART) data
● Dashboard nag to enable?
● Backend tools to summarize, query, browse telemetry data
● Initial focus on crash reports○ Identify crash signatures by stack trace (or
other key properties)○ Correlate crashes with ceph version or
other properties● Improved device failure prediction model
○ Predict error rate instead of binary failed/not-failed or life expectancy
○ Evaluating value of some vendor-specific data
10
Usability
Performance
EcosystemMulti-site
Quality
FIVE THEMES
11
● Partially implemented dmclock-based quality-of-service in OSD/librados
● Blocked because a deep queue in BlueStore obscures scheduling decisions
● Goal: understand, tune, and (hopefully) autotune bluestore queue depth
○ Device type (HDD, SSD, hybrid)○ Workload (IO size, type)
● Current status:○ No luck yet with autotuning, but we have
semi-repeatable process to manually calibrate to a particular device
○ Pivot to rebasing dmclock patches, evaluate effectiveness for
■ Background vs client■ Client vs client
RADOS QoS PROGRESS
OSD
BlueStore
Client IO
To replicaTo replica
Priority/QoS queue
Ordered Transaction Queue
12
● Each CephFS metadata operation is a round-trip to the MDS● untar, rm -r tend are dominated by client/MDS network latency● CephFS aggressively leases/delegates state/capabilities to the clients● Allow async creates
○ Linux client can immediately return, queue async operation with MDS○ Same for unlink○ tar xf, rm -r, etc. become much faster!
● Except it’s complex○ Current ordering between request, locks in MDS, and client capabilities
CEPHFS ASYNC CREATE, UNLINK
13
PROJECT CRIMSON
What● Rewrite IO path in using Seastar
○ Preallocate cores○ One thread per core○ Explicitly shard all data structures
and work over cores○ No locks and no blocking○ Message passing between cores○ Polling for IO
● DPDK, SPDK○ Kernel bypass for network and
storage IO
Why● Not just about how many IOPS we do…● More about IOPS per CPU core● Current Ceph is based on traditional
multi-threaded programming model● Context switching is too expensive when
storage is almost as fast as memory
● New hardware devices coming○ DIMM form-factor persistent memory○ ZNS - zone-based SSDs
14
Usability
Performance
EcosystemMulti-site
Quality
FIVE THEMES
15
CEPHFS MULTI-SITE REPLICATION
● Scheduling of snapshots and snapshot pruning● Automate periodic snapshot + sync to remote cluster
○ Arbitrary source tree, destination in remote cluster○ Sync snapshots via rsync○ May support non-CephFS targets
● Discussing more sophisticated models○ Bidirectional, loosely/eventually consistent sync○ Simple conflict resolution behavior?
16
● Today: RBD mirroring provides async replication to another cluster○ Point-in-time (“crash”) consistency○ Perfect for disaster recovery○ Managed on per-pool or per-image basis
● rbd-nbd runner improvements to drive multiple images from one instance● Vastly-simplified setup procedure
○ One command on each cluster; copy+paste string blob● New: snapshot-based mirroring mode
○ (Just like CephFS)○ Same rbd-mirror daemon, same overall infrastructure/architecture○ Will work with kernel RBD
■ (RBD mirroring today requires librbd, rbd-nbd, or similar)
RBD SNAPSHOT-BASED MIRRORING
17
Usability
Performance
EcosystemMulti-site
Quality
FIVE THEMES
18
NEW WITH CEPH-CSI AND ROOK
● Much investment in ceph-csi○ RWO and RWX support via RBD and/or
CephFS○ Snapshots, clones, and so on
● Rook 1.1○ Turn-key ceph-csi by default○ Dynamic bucket provisioning
■ ObjectBucketClaim○ External cluster mode○ Run mons or OSDs on top of other PVs○ Upgrade improvements
■ Wait for healthy between steps■ Pod disruption budgets
○ Improved configuration
● Rook: RBD mirroring○ Manage RBD mirroring via CRDs○ Investment in better rbd-nbd support to
provide RBD mirroring in Kubernetes○ New, simpler snapshot-based mirroring
● Rook: RGW multisite○ Federation of multiple clusters into single
namespace○ Site-granularity replication
● Rook: CephFS mirroring○ Eventually...
19
SAMBA + CEPHFS
● Expose inode ‘birth time’● Expose snapshot creation time (birth time)● Protect snapshots from deletion● Supplementary group handling
20
● Internal abstraction layer for buckets -- a bucket “VFS”● Traditional RADOS backend
○ Index buckets in RADOS; stripe object data over RADOS objects● Pass-through to external store
○ “Stateless” pass through of bucket “foo” to external (e.g., S3) bucket○ Auth credential translation○ API translation (e.g., Azure Blob Storage backend)
● Layering○ Compose a bucket from multiple layers
PROJECT ZIPPER
21
● https://ceph.io/ ● Twitter: @ceph● Docs: http://docs.ceph.com/ ● Mailing lists: http://lists.ceph.io/
○ ceph-announce@ceph.io → announcements○ ceph-users@ceph.io → user discussion○ dev@ceph.io → developer discussion
● IRC: irc.oftc.net○ #ceph, #ceph-devel
● GitHub: https://github.com/ceph/ ● YouTube ‘Ceph’ channel
FOR MORE INFORMATION
22
March 4-5Developer Summit - March 3
CFP open until December 6https://ceph.io/cephalocon/seoul-2020