WHAT’S COMING IN CEPH OCTOPUS · 2019. 11. 20. · BlueStore obscures scheduling decisions Goal:...

transcript

Douglas FullerSC19 Ceph BoF

2019.11.19

WHAT’S COMING IN CEPH

OCTOPUS

CEPH IS A UNIFIED STORAGE SYSTEM

S3 and Swiftobject storage

LIBRADOSLow-level storage API

RADOSReliable, elastic, distributed storage layer with

replication and erasure coding

Virtual block device

CEPHFS

Distributed networkfile system

OBJECT BLOCK FILE

Usability

Performance

EcosystemMulti-site

Quality

FIVE THEMES

Usability

Performance

EcosystemMulti-site

Quality

FIVE THEMES

ORCHESTRATOR API

● mgr API to interface with deployment tool○ rook○ ssh (run this command on that host)

● Expose provisioning functions to CLI, GUI○ Create, destroy, start, stop daemons○ Blink disk lights

● Replace ceph-deploy with ssh backend○ Bootstrap: create mon + mgr on local host○ “Day 2” operations to provision rest of

cluster○ Provision containers exclusively

● Pave way for cleanup of docs.ceph.com● Automated upgrades

ceph-mgr: orchestrator API

Rook ssh

CLI DASHBOARD

ceph-osdceph-mdsceph-mon ...

● Integration with orchestrator○ Initial parts done○ Adding OSDs next

● Sidebar with display notifications and progress bars● CephFS management features

○ Snapshots and quotas● RGW multisite management● Password improvements

○ Strength indicator○ Enforce change after first login

DASHBOARD

● Improvements to progress bars (ceph -s output)○ More RADOS events/process supported (mark out, rebalancing)○ Time estimates

● Health alert muting○ TTL on mutes○ Auto-unmute when alert changes, increases in severity

● Hands-off defaults○ PG autoscaler on by default○ Balancer on by default

● ‘ceph tell’ and ‘ceph daemon’ unification○ Same expanded command set via either interface (over-the-wire or local unix socket)

Usability

Performance

EcosystemMulti-site

Quality

FIVE THEMES

TELEMETRY AND CRASH REPORTS

● Opt-in○ Require re-opt-in if telemetry content

expanded○ Explicitly acknowledge data sharing

license● Telemetry channels

○ basic - cluster size, version, etc.○ ident - contact info (off by default)○ crash - anonymized crash metadata○ device - device health (SMART) data

● Dashboard nag to enable?

● Backend tools to summarize, query, browse telemetry data

● Initial focus on crash reports○ Identify crash signatures by stack trace (or

other key properties)○ Correlate crashes with ceph version or

other properties● Improved device failure prediction model

○ Predict error rate instead of binary failed/not-failed or life expectancy

○ Evaluating value of some vendor-specific data

Usability

Performance

EcosystemMulti-site

Quality

FIVE THEMES

● Partially implemented dmclock-based quality-of-service in OSD/librados

● Blocked because a deep queue in BlueStore obscures scheduling decisions

● Goal: understand, tune, and (hopefully) autotune bluestore queue depth

○ Device type (HDD, SSD, hybrid)○ Workload (IO size, type)

● Current status:○ No luck yet with autotuning, but we have

semi-repeatable process to manually calibrate to a particular device

○ Pivot to rebasing dmclock patches, evaluate effectiveness for

■ Background vs client■ Client vs client

RADOS QoS PROGRESS

BlueStore

Client IO

To replicaTo replica

Priority/QoS queue

Ordered Transaction Queue

● Each CephFS metadata operation is a round-trip to the MDS● untar, rm -r tend are dominated by client/MDS network latency● CephFS aggressively leases/delegates state/capabilities to the clients● Allow async creates

○ Linux client can immediately return, queue async operation with MDS○ Same for unlink○ tar xf, rm -r, etc. become much faster!

● Except it’s complex○ Current ordering between request, locks in MDS, and client capabilities

CEPHFS ASYNC CREATE, UNLINK

PROJECT CRIMSON

What● Rewrite IO path in using Seastar

○ Preallocate cores○ One thread per core○ Explicitly shard all data structures

and work over cores○ No locks and no blocking○ Message passing between cores○ Polling for IO

● DPDK, SPDK○ Kernel bypass for network and

storage IO

Why● Not just about how many IOPS we do…● More about IOPS per CPU core● Current Ceph is based on traditional

multi-threaded programming model● Context switching is too expensive when

storage is almost as fast as memory

● New hardware devices coming○ DIMM form-factor persistent memory○ ZNS - zone-based SSDs

Usability

Performance

EcosystemMulti-site

Quality

FIVE THEMES

CEPHFS MULTI-SITE REPLICATION

● Scheduling of snapshots and snapshot pruning● Automate periodic snapshot + sync to remote cluster

○ Arbitrary source tree, destination in remote cluster○ Sync snapshots via rsync○ May support non-CephFS targets

● Discussing more sophisticated models○ Bidirectional, loosely/eventually consistent sync○ Simple conflict resolution behavior?

● Today: RBD mirroring provides async replication to another cluster○ Point-in-time (“crash”) consistency○ Perfect for disaster recovery○ Managed on per-pool or per-image basis

● rbd-nbd runner improvements to drive multiple images from one instance● Vastly-simplified setup procedure

○ One command on each cluster; copy+paste string blob● New: snapshot-based mirroring mode

○ (Just like CephFS)○ Same rbd-mirror daemon, same overall infrastructure/architecture○ Will work with kernel RBD

■ (RBD mirroring today requires librbd, rbd-nbd, or similar)

RBD SNAPSHOT-BASED MIRRORING

Usability

Performance

EcosystemMulti-site

Quality

FIVE THEMES

NEW WITH CEPH-CSI AND ROOK

● Much investment in ceph-csi○ RWO and RWX support via RBD and/or

CephFS○ Snapshots, clones, and so on

● Rook 1.1○ Turn-key ceph-csi by default○ Dynamic bucket provisioning

■ ObjectBucketClaim○ External cluster mode○ Run mons or OSDs on top of other PVs○ Upgrade improvements

■ Wait for healthy between steps■ Pod disruption budgets

○ Improved configuration

● Rook: RBD mirroring○ Manage RBD mirroring via CRDs○ Investment in better rbd-nbd support to

provide RBD mirroring in Kubernetes○ New, simpler snapshot-based mirroring

● Rook: RGW multisite○ Federation of multiple clusters into single

namespace○ Site-granularity replication

● Rook: CephFS mirroring○ Eventually...

SAMBA + CEPHFS

● Expose inode ‘birth time’● Expose snapshot creation time (birth time)● Protect snapshots from deletion● Supplementary group handling

● Internal abstraction layer for buckets -- a bucket “VFS”● Traditional RADOS backend

○ Index buckets in RADOS; stripe object data over RADOS objects● Pass-through to external store

○ “Stateless” pass through of bucket “foo” to external (e.g., S3) bucket○ Auth credential translation○ API translation (e.g., Azure Blob Storage backend)

● Layering○ Compose a bucket from multiple layers

PROJECT ZIPPER

● https://ceph.io/ ● Twitter: @ceph● Docs: http://docs.ceph.com/ ● Mailing lists: http://lists.ceph.io/

○ ceph-announce@ceph.io → announcements○ ceph-users@ceph.io → user discussion○ dev@ceph.io → developer discussion

● IRC: irc.oftc.net○ #ceph, #ceph-devel

● GitHub: https://github.com/ceph/ ● YouTube ‘Ceph’ channel

FOR MORE INFORMATION

March 4-5Developer Summit - March 3

CFP open until December 6https://ceph.io/cephalocon/seoul-2020

WHAT’S COMING IN CEPH OCTOPUS · 2019. 11. 20. · BlueStore obscures scheduling decisions Goal:...

Documents