+ All Categories
Home > Technology > Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Date post: 22-Jan-2018
Category:
Upload: danielle-womboldt
View: 139 times
Download: 8 times
Share this document with a friend
24
Pan Liu 2017/06/10 Our journey to high performance large scale CEPH cluster at Alibaba
Transcript
Page 1: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Pan Liu2017/06/10

Our journey to high

performance large scale

CEPH cluster at Alibaba

Page 2: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Customer Use Model

Page 3: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Use Model One

Page 4: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

cephwal_group0 wal_group1 cp_TPX_Q0 cp_TPY_Q1

Sync write

Sync write

write read read

Sync

Use Model Two

Broker1 Broker2

Page 5: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

• WAL(like hbase’s hlog, mysql’s binlog)– Merge small request into big ones.

– Create more WAL to improve broker throughput.

– Don’t need big storage space. Use high performance SSD

to reduce RT.

• checkpoint(like hbase’s sstable, mysql’s datapage)– WAL trigger checkpoint

– Not require high performance. HDD.

Use Model Two

Page 6: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Improve the performance of recovery

Page 7: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Test environment

• HW/SW

– 3 servers, 24 OSDs

– ceph 10.2.5 + patches

– 100G rbd image, fio 4k randwrite

• Test timeline(second)0

fio start stop 1 osd

60 120 180 300

restart the osdrecovery begins

Page 8: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

partial recovery

Page 9: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

async recovery

Page 10: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

partial + async recovery

Page 11: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Bug Fixes

• Bug 1: Lose map data after PG remap.

• Bug 2: Data inconsistent after reweight.

• Will pull a request later for the total solution.

Page 12: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Commit Majority

Page 13: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Commit_majority PR

• commit_majority

– https://github.com/ceph/ceph/pull/15027

Page 14: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Test environment

• HW/SW

– 3 servers, 24 OSDs

– ceph 10.2.5 + patches

– 100G rbd image, fio 16k randwrite

• Test timeline(400 second)0

fio start

400

fio end

Page 15: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

commit_majority(IOPS)IO

PS

Page 16: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

commit_majority(latency)

disable enable

Page 17: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

commit_majority(FIO)

commit_majority

latencyiops

min max avg 95% 99%99.90%

99.99%

disable(us) 1711 14892 2401 2608 3824 9408 13504 415

enable(us) 1610 6465 2124 2352 2480 2672 3856 469

optimize(%) 5.90 56.59 11.53 9.81 35.14 71.59 71.44 13.01

Page 18: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Async Queue Transaction

Page 19: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Motivation

• Currently pg worker is doing heavy work• do_op() is a long heavy function

• PG_LOCK is held during the entire path

• Can we offload some functions within do_op() to other thread pools and make PG worker pipeline with those threads?• Start from looking at objectstore->queue_transaction()

Page 20: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Offload some work from PG worker

PGworke

r

PGworke

r

PGworke

r……

Messenger

OBJECT STORE

Prepare op and queue transactions

PGworke

r

PGworke

r

PGworke

r……

Messenger

OBJECT STORE

Prepare op and really just “queued”

Asynchronously queue_transaction()

Objectstore layer allocate thread pool to execute logic within current

queue_transaction()

Offload queue_transaction() to threads pool at objectstore layer,return pg worker and release pg lock sooner

Page 21: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

OBJECT STORE (BlueStore)

PG WORKER

Create bluestore transaction, reserve disk space, submit aio

RocksDB Ksyn worker

Batch sync Rocksdb metadata and Bluestore small data writes

Finisher

PG WORKER

transactionworker

……transactionworker

transactionworker

Create bluestore transaction, reserve disk space, submit aio and sync RocksDB metadata and small data writes individually

Finisher

Deploy transaction workers to handle transaction requests enqueued by PG worker,and submit individual transaction within transaction worker context (both data and metadata)

Page 22: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Evaluations (1)

• Systems (roughly):• 4 servers, 1 running mon and fio processes, 3 running osd processes.

• Running 12 osd processes on osd servers, each manage one Intel NVME drive.

• 25Gb NIC

• Fio workload:• Num_jobs=32 or 64

• bs=4KB

• Seq write and rand write

Page 23: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Evaluations (2)

Bandwidth (MB/s)

Note: difference between ”orange” and “grey” bar is: orange bar still use ksync thread to commit rocksdb transactions, while grey bar commit rocksdb transaction within transaction worker context

Page 24: Ceph Day Beijing - Our journey to high performance large scale Ceph cluster at Alibaba

Thanks


Recommended