Pan Liu2017/06/10
Our journey to high
performance large scale
CEPH cluster at Alibaba
Customer Use Model
Use Model One
cephwal_group0 wal_group1 cp_TPX_Q0 cp_TPY_Q1
Sync write
Sync write
write read read
Sync
Use Model Two
Broker1 Broker2
• WAL(like hbase’s hlog, mysql’s binlog)– Merge small request into big ones.
– Create more WAL to improve broker throughput.
– Don’t need big storage space. Use high performance SSD
to reduce RT.
• checkpoint(like hbase’s sstable, mysql’s datapage)– WAL trigger checkpoint
– Not require high performance. HDD.
Use Model Two
Improve the performance of recovery
Test environment
• HW/SW
– 3 servers, 24 OSDs
– ceph 10.2.5 + patches
– 100G rbd image, fio 4k randwrite
• Test timeline(second)0
fio start stop 1 osd
60 120 180 300
restart the osdrecovery begins
partial recovery
async recovery
partial + async recovery
Bug Fixes
• Bug 1: Lose map data after PG remap.
• Bug 2: Data inconsistent after reweight.
• Will pull a request later for the total solution.
Commit Majority
Commit_majority PR
• commit_majority
– https://github.com/ceph/ceph/pull/15027
Test environment
• HW/SW
– 3 servers, 24 OSDs
– ceph 10.2.5 + patches
– 100G rbd image, fio 16k randwrite
• Test timeline(400 second)0
fio start
400
fio end
commit_majority(IOPS)IO
PS
commit_majority(latency)
disable enable
commit_majority(FIO)
commit_majority
latencyiops
min max avg 95% 99%99.90%
99.99%
disable(us) 1711 14892 2401 2608 3824 9408 13504 415
enable(us) 1610 6465 2124 2352 2480 2672 3856 469
optimize(%) 5.90 56.59 11.53 9.81 35.14 71.59 71.44 13.01
Async Queue Transaction
Motivation
• Currently pg worker is doing heavy work• do_op() is a long heavy function
• PG_LOCK is held during the entire path
• Can we offload some functions within do_op() to other thread pools and make PG worker pipeline with those threads?• Start from looking at objectstore->queue_transaction()
Offload some work from PG worker
PGworke
r
PGworke
r
PGworke
r……
Messenger
OBJECT STORE
Prepare op and queue transactions
PGworke
r
PGworke
r
PGworke
r……
Messenger
OBJECT STORE
Prepare op and really just “queued”
Asynchronously queue_transaction()
Objectstore layer allocate thread pool to execute logic within current
queue_transaction()
Offload queue_transaction() to threads pool at objectstore layer,return pg worker and release pg lock sooner
OBJECT STORE (BlueStore)
PG WORKER
Create bluestore transaction, reserve disk space, submit aio
RocksDB Ksyn worker
Batch sync Rocksdb metadata and Bluestore small data writes
Finisher
PG WORKER
transactionworker
……transactionworker
transactionworker
Create bluestore transaction, reserve disk space, submit aio and sync RocksDB metadata and small data writes individually
Finisher
Deploy transaction workers to handle transaction requests enqueued by PG worker,and submit individual transaction within transaction worker context (both data and metadata)
Evaluations (1)
• Systems (roughly):• 4 servers, 1 running mon and fio processes, 3 running osd processes.
• Running 12 osd processes on osd servers, each manage one Intel NVME drive.
• 25Gb NIC
• Fio workload:• Num_jobs=32 or 64
• bs=4KB
• Seq write and rand write
Evaluations (2)
Bandwidth (MB/s)
Note: difference between ”orange” and “grey” bar is: orange bar still use ksync thread to commit rocksdb transactions, while grey bar commit rocksdb transaction within transaction worker context
Thanks