+ All Categories
Home > Technology > Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Date post: 21-Jul-2015
Category:
Upload: ceph-community
View: 134 times
Download: 4 times
Share this document with a friend
25
Unleash Ceph over Flash Storage Potential with Mellanox High-Performance Interconnect Ceph Day Berlin Apr 28th, 2015 Oren Duer, Director of Storage Software, Software R&D
Transcript
Page 1: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Unleash Ceph over Flash Storage Potential with

Mellanox High-Performance Interconnect

Ceph Day Berlin – Apr 28th, 2015

Oren Duer, Director of Storage Software, Software R&D

Page 2: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

- Mellanox C nfidential -

Leading Supplier of End-to-End Interconnect Solutions

StorageServer / Compute Switch / Gateway

Front / Back-EndVirtual Protocol Interconnect

56G IB & FCoIBVirtual Protocol Interconnect

56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

ICs Adapter Cards Switches/Gateways Host/Fabric Software Metro / WAN Cables/Modules

© 2015 Mellanox Technologies o 2

Page 3: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

© 2015 Mellanox Technologies - Mellanox Confidential - 6

How Customers Deploy Ceph with Mellanox Interconnect

Building Scalable, Performing Storage Solutions

Cluster network @ 40Gb Ethernet

Clients @ 10G/40Gb Ethernet

High performance at Low Cost

Allows more capacity per OSD

Lower cost/TB

Flash Deployment Options

All HDD (no flash)

Flash for OSD Journals

100% Flash in OSDs

Faster Cluster Network Improves Price/Capacity and Price/Performance

Page 4: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph Deployment Using 10GbE and 40GbE

Cluster (Private) Network @ 40/56GbEClient Nodes

10GbE/40Gb

E

• Smooth HA, unblocked heartbeats, efficient data balancing

Throughput Clients @ 40/56GbE

• Guaranties line rate for high ingress/egress clientsPublicNetwork

10GbE/40GB

E IOPs Clients @ 10GbE or 40/56GbE

• 100K+ IOPs/Client @4K blocks

Ceph Nodes

(Monitors, OSDs,

MDS

Admin Node

ClusterNetwork

40GbE

Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2

20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)

© 2015 Mellanox Technologies - Mellanox Confidential - 7

Page 5: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

R r

OK, But How Do We Further Improve IOPS? We Use RDMA!

Application

1Application

Buffer21Buffer 1

Buffer 1Buffer 1

OS OS

Buffer 1 Buffer 1

DMA over InfiniBandEthernet

HCAHCA

Buffer 1 Buffer 1NIC NIC

TCP/IP

RACK 1 RACK 2

© 2015 Mellanox Technologies - Mellanox Confidential - 8

HARDWARE

KERNEL

USER

Page 6: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph Throughput using 40Gb and 56Gb Ethernet

One OSD, One Client, 8 Threads

6000

5000

4000

40Gb TCP MTU=1500

56Gb TCP MTU=45003000

56 Mb RDMA MTU=4500

2000

1000

0

64KB Random Read 256KB Random Read

© 2015 Mellanox Technologies - Mellanox Confidential - 9

MB

/s

40

Gb

TC

P

56

Gb

TC

P

56

Gb

RD

MA

40

Gb

TC

P

56

Gb

TC

P

56

Gb

RD

MA

Page 7: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

© 2015 Mellanox Technologies - Mellanox Confidential - 10

Optimizing Ceph for Flash

By SanDisk & Mellanox

Page 8: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph Flash Optimization

Highlights Compared to Stock Ceph••

Read performance up to 8x betterWrite performance up to 2x better with tuning

Optimizations••

All-flash storage for OSDs

Enhanced parallelism and lock optimization

Optimization for reads from flash

Improvements to Ceph messenger

SanDisk InfiniFlashTest Configuration••

InfiniFlash Storage with IFOS 1.0 EAP3Up to 4 RBDs

2 Ceph OSD nodes, connected to InfiniFlash

40GbE NICs from Mellanox

© 2015 Mellanox Technologies - Mellanox Confidential - 11

Page 9: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

8K Random - 2 RBD/Client with File System

IOPS: 2 LUNs /Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)300000

120

250000100

20000080

LatencyIOPS150000 (ms)

60

10000040

50000 20

0 0

[Queue Depth]Read Percent

IFOS 1.0 Stock Ceph

© 2015 Mellanox Technologies - Mellanox Confidential - 12

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

Page 10: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Performance: 64K Random -2 RBD/Client with File System

Lat(ms): 2 LUNs/Client (Total 4 Clients)IOPS: 2 LUNs/Client (Total 4 Clients)160000

180

140000160

140120000

120

Latency100000

IOPS80000

100

(ms)

80

60000

60

4000040

20000 20

00

[Queue Depth]Read Percent

IFOS 1.0 Stock Ceph

© 2015 Mellanox Technologies - Mellanox Confidential - 13

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

Page 11: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

© 2015 Mellanox Technologies - Mellanox Confidential - 14

Adding RDMA To Ceph

XioMessenger

Page 12: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

I/O Offload Frees Up CPU for Application Processing

Without RDMA With RDMA and Offload

~53% CPUEfficiency

~88% CPUEfficiency

~47% CPUOverhead/Idle

~12% CPUOverhead/Idle

© 2015 Mellanox Technologies - Mellanox Confidential - 15

Sys

tem

Sp

ac

eU

se

r S

pa

ce

Syste

mS

pace

User

Sp

ace

Page 13: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Adding RDMA to Ceph

RDMA BetaHammer

in ••

Mellanox, Red Hat, CohortFS, and Community collaborationFull RDMA expected in Infernalis

MessagingLayer

BuffersManagement

••

New RDMA messenger layer called XioMessengerNew class hierarchy allowing multiple transports (simple one is TCP)

Async design, reduced locks, reduced number of threads

• Introduced non-sharable messages

On top of ••

Accelio is RDMA abstraction layerIntegrated into all CEPH user space components: daemons

“public network” and “cluster network”and clients

Accelio

© 2015 Mellanox Technologies - Mellanox Confidential - 16

Page 14: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Accelio, High-Performance Reliable Messaging and RPC Library

Open source!

• https://github.com/accelio/accelio/ && www.accelio.org

Faster RDMA integration to application

Asynchronous

Maximize msg and CPU parallelism

Enable >10GB/s from single node

Enable <10usec latency under load

In

Giant and Hammer

http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger

© 2015 Mellanox Technologies - Mellanox Confidential - 17

Page 15: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph 4KB Read IOPS: 40Gb TCP vs. 40Gb RDMA

450

400

350

300

250

40Gb

40Gb

TCP

RDMA200

150

100

50

0

2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients

© 2015 Mellanox Technologies - Mellanox Confidential - 18

Th

ou

san

ds

of

IOP

S

38

30

co

res

co

res

inO

SD

inclie

nt

34

24

co

res

co

resin in

OS

Dclie

nt

38

24

co

res

co

resin in

OS

Dclie

nt

38

24

co

res

co

res

inO

SD

inclie

nt

36

32

co

res

co

resin in

OS

Dclie

nt

34

27

co

res

co

resin in

OS

Dclie

nt

RDMA RDMATCPTCP RDMA TCP

Page 16: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph RDMA Performance Summary – Work In Progress

Normalized to per-core

BW is @ 256K IO size, IOPS is @ 4K IO size

© 2015 Mellanox Technologies - Mellanox Confidential - 19

READ IOPS Up to 250% better

BW Up to 50% better

WRITE IOPS Up to 20% better

BW Up to 7% better

Page 17: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

What’s next?

XIO-Messenger to GA

Ceph Bottlenecks

Erasure Coding

© 2015 Mellanox Technologies - Mellanox Confidential - 20

• Erasure Coding is really needed to reduce redundancy capacity overhead

• Erasure Coding is complicated math for CPU

• Demanding high-end storage nodes

• New ConnectX-4 can offload Erasure Coding

• XIO-Messenger can do much more as transport!

• Collaborate to resolve, performance work group

• Infernalis?

Page 18: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

© 2015 Mellanox Technologies - Mellanox Confidential - 21

Deployment Examples

Ceph-Powered Solutions

Page 19: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Ceph For Large Scale Storage– Fujitsu Eternus CD10000

Hyperscale Storage

4 to 224 nodes

Up to 56 PB raw capacity

Runs Ceph with Enhancements

3 different storage nodes

Object, block, and file storage

Mellanox InfiniBand Cluster Network

40Gb

10Gb

InfiniBand cluster network

Ethernet front end network

© 2015 Mellanox Technologies - Mellanox Confidential - 22

Page 20: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Media & Entertainment Storage – StorageFoundry Nautilus

Turnkey Object Storage

Built on Ceph

Pre-configured for rapid deployment

Mellanox 10/40GbE networking

High-Capacity Configuration

6-8TB Helium-filled drives

Up to 2PB in 18U

High-Performance Configuration

Single client read 2.2 GB/s

SSD caching + Hard Drives

Supports Ethernet, IB, FC, FCoE front-end ports

More information: www.storagefoundry.net

© 2015 Mellanox Technologies - Mellanox Confidential - 23

Page 21: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

SanDisk InfiniFlash

Flash Storage System

Announced March 3, 2015

InfiniFlash OS uses Ceph

512 TB (raw) in one 3U enclosure

Tested with 40GbE networking

High Throughput

Up to 7GB/s

Up to 1M IOPS with two nodes

More information:

• http://bigdataflash.sandisk.com/infiniflash

© 2015 Mellanox Technologies - Mellanox Confidential - 24

Page 22: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

F

ml

More Ceph Solutions

Cloud – OnyxCCS ElectraStack ISS Storage Supercore•

Turnkey IaaS

Multi-tenant computing system

5x faster Node/Data restoration

https://www.onyxccs.com/products/8-series

Healthcare solution

82,000 IOPS on 512B reads

74,000 IOPS on 4KB reads

1.1GB/s on 256KB reads

http://www.iss-integration.com/supercore.html

lextronics CloudLabs

OpenStack on CloudX design

2SSD + 20HDD per node

Mix of 1Gb/40GbE network

http://www.flextronics.com/

Scalable Informatics Unison•

High availability cluster

60 HDD in 4U

Tier 1 performance at archive cost

https://scalableinformatics.com/unison.ht

© 2015 Mellanox Technologies - Mellanox Confidential - 25

Page 23: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Summary

Ceph scalability and performance benefit from high performance

Ceph being optimized for flash storage

networks

End-to-end 40/56 Gb/s transport accelerates Ceph today

100Gb/s testing has begun!

Available in various Ceph solutions and appliances

RDMA is next to optimize flash performance—beta in Hammer

© 2015 Mellanox Technologies - Mellanox Confidential - 26

Page 24: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Thank You

Page 25: Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Performance

Setup

Two 28-core [email protected] (Haswell) servers••

64GB of memory

Hyperthreading is enabled

Mellanox ConnectX3-EN 40Gb/s, fw-2.33.5000

Mellanox SX1012 EN 40Gb/s switch

MLNX_OFED_LINUX-2.4-1.0.0

Accelio version 1.3 (master branch tag v1.3-rc3)

Ceph upstream branch hammer

Ubuntu 14.04 LTS stock kernel

Default mtu = 1500

1st server run as single node ceph cluster••

One monitor and

One OSD (using XFS on ramdisk /dev/ram0)

2nd server run as ceph fio_rbd clients

BW is measured at 256K IOs

Iops is measured at 4K IOs

© 2015 Mellanox Technologies - Mellanox Confidential - 28


Recommended