Download - Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Pan Liu2017/06/10

Our journey to high

performance large scale

CEPH cluster at Alibaba

Customer Use Model

Use Model One

cephwal_group0 wal_group1 cp_TPX_Q0 cp_TPY_Q1

Sync write

Sync write

write read read

Sync

Use Model Two

Broker1 Broker2

• WAL(like hbase’s hlog, mysql’s binlog)– Merge small request into big ones.

– Create more WAL to improve broker throughput.

– Don’t need big storage space. Use high performance SSD

to reduce RT.

• checkpoint(like hbase’s sstable, mysql’s datapage)– WAL trigger checkpoint

– Not require high performance. HDD.

Use Model Two

Improve the performance of recovery

Test environment

• HW/SW

– 3 servers, 24 OSDs

– ceph 10.2.5 + patches

– 100G rbd image, fio 4k randwrite

• Test timeline(second)0

fio start stop 1 osd

60 120 180 300

restart the osdrecovery begins

partial recovery

async recovery

partial + async recovery

Bug Fixes

• Bug 1: Lose map data after PG remap.

• Bug 2: Data inconsistent after reweight.

• Will pull a request later for the total solution.

Commit Majority

Commit_majority PR

• commit_majority

– https://github.com/ceph/ceph/pull/15027

https://github.com/ceph/ceph/pull/15027

Test environment

• HW/SW

– 3 servers, 24 OSDs

– ceph 10.2.5 + patches

– 100G rbd image, fio 16k randwrite

• Test timeline(400 second)0

fio start

400

fio end

commit_majority(IOPS)IO

PS

commit_majority(latency)

disable enable

commit_majority(FIO)

commit_majority

latencyiops

min max avg 95% 99%99.90%

99.99%

disable(us) 1711 14892 2401 2608 3824 9408 13504 415

enable(us) 1610 6465 2124 2352 2480 2672 3856 469

optimize(%) 5.90 56.59 11.53 9.81 35.14 71.59 71.44 13.01

Async Queue Transaction

Motivation

• Currently pg worker is doing heavy work• do_op() is a long heavy function

• PG_LOCK is held during the entire path

• Can we offload some functions within do_op() to other thread pools and make PG worker pipeline with those threads?• Start from looking at objectstore->queue_transaction()

Offload some work from PG worker

PGworke

r

PGworke

r

PGworke

r……

Messenger

OBJECT STORE

Prepare op and queue transactions

PGworke

r

PGworke

r

PGworke

r……

Messenger

OBJECT STORE

Prepare op and really just “queued”

Asynchronously queue_transaction()

Objectstore layer allocate thread pool to execute logic within current

queue_transaction()

Offload queue_transaction() to threads pool at objectstore layer，return pg worker and release pg lock sooner

OBJECT STORE (BlueStore)

PG WORKER

Create bluestore transaction, reserve disk space, submit aio

RocksDB Ksyn worker

Batch sync Rocksdb metadata and Bluestore small data writes

Finisher

PG WORKER

transactionworker

……transactionworker

transactionworker

Create bluestore transaction, reserve disk space, submit aio and sync RocksDB metadata and small data writes individually

Finisher

Deploy transaction workers to handle transaction requests enqueued by PG worker，and submit individual transaction within transaction worker context (both data and metadata)

Evaluations (1)

• Systems (roughly):• 4 servers, 1 running mon and fio processes, 3 running osd processes.

• Running 12 osd processes on osd servers, each manage one Intel NVME drive.

• 25Gb NIC

• Fio workload:• Num_jobs=32 or 64

• bs=4KB

• Seq write and rand write

Evaluations (2)

Bandwidth (MB/s)

Note: difference between ”orange” and “grey” bar is: orange bar still use ksync thread to commit rocksdb transactions, while grey bar commit rocksdb transaction within transaction worker context

Thanks