Date post: | 21-Jul-2015 |
Category: |
Technology |
Upload: | ceph-community |
View: | 134 times |
Download: | 4 times |
Unleash Ceph over Flash Storage Potential with
Mellanox High-Performance Interconnect
Ceph Day Berlin – Apr 28th, 2015
Oren Duer, Director of Storage Software, Software R&D
- Mellanox C nfidential -
Leading Supplier of End-to-End Interconnect Solutions
StorageServer / Compute Switch / Gateway
Front / Back-EndVirtual Protocol Interconnect
56G IB & FCoIBVirtual Protocol Interconnect
56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
ICs Adapter Cards Switches/Gateways Host/Fabric Software Metro / WAN Cables/Modules
© 2015 Mellanox Technologies o 2
© 2015 Mellanox Technologies - Mellanox Confidential - 6
How Customers Deploy Ceph with Mellanox Interconnect
Building Scalable, Performing Storage Solutions
•
•
Cluster network @ 40Gb Ethernet
Clients @ 10G/40Gb Ethernet
High performance at Low Cost
•
•
Allows more capacity per OSD
Lower cost/TB
Flash Deployment Options
•
•
•
All HDD (no flash)
Flash for OSD Journals
100% Flash in OSDs
Faster Cluster Network Improves Price/Capacity and Price/Performance
Ceph Deployment Using 10GbE and 40GbE
Cluster (Private) Network @ 40/56GbEClient Nodes
10GbE/40Gb
E
• Smooth HA, unblocked heartbeats, efficient data balancing
Throughput Clients @ 40/56GbE
• Guaranties line rate for high ingress/egress clientsPublicNetwork
10GbE/40GB
E IOPs Clients @ 10GbE or 40/56GbE
• 100K+ IOPs/Client @4K blocks
Ceph Nodes
(Monitors, OSDs,
MDS
Admin Node
ClusterNetwork
40GbE
Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
© 2015 Mellanox Technologies - Mellanox Confidential - 7
R r
OK, But How Do We Further Improve IOPS? We Use RDMA!
Application
1Application
Buffer21Buffer 1
Buffer 1Buffer 1
OS OS
Buffer 1 Buffer 1
DMA over InfiniBandEthernet
HCAHCA
Buffer 1 Buffer 1NIC NIC
TCP/IP
RACK 1 RACK 2
© 2015 Mellanox Technologies - Mellanox Confidential - 8
HARDWARE
KERNEL
USER
Ceph Throughput using 40Gb and 56Gb Ethernet
One OSD, One Client, 8 Threads
6000
5000
4000
40Gb TCP MTU=1500
56Gb TCP MTU=45003000
56 Mb RDMA MTU=4500
2000
1000
0
64KB Random Read 256KB Random Read
© 2015 Mellanox Technologies - Mellanox Confidential - 9
MB
/s
40
Gb
TC
P
56
Gb
TC
P
56
Gb
RD
MA
40
Gb
TC
P
56
Gb
TC
P
56
Gb
RD
MA
© 2015 Mellanox Technologies - Mellanox Confidential - 10
Optimizing Ceph for Flash
By SanDisk & Mellanox
Ceph Flash Optimization
Highlights Compared to Stock Ceph••
Read performance up to 8x betterWrite performance up to 2x better with tuning
Optimizations••
•
•
All-flash storage for OSDs
Enhanced parallelism and lock optimization
Optimization for reads from flash
Improvements to Ceph messenger
SanDisk InfiniFlashTest Configuration••
•
•
InfiniFlash Storage with IFOS 1.0 EAP3Up to 4 RBDs
2 Ceph OSD nodes, connected to InfiniFlash
40GbE NICs from Mellanox
© 2015 Mellanox Technologies - Mellanox Confidential - 11
8K Random - 2 RBD/Client with File System
IOPS: 2 LUNs /Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)300000
120
250000100
20000080
LatencyIOPS150000 (ms)
60
10000040
50000 20
0 0
[Queue Depth]Read Percent
IFOS 1.0 Stock Ceph
© 2015 Mellanox Technologies - Mellanox Confidential - 12
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Performance: 64K Random -2 RBD/Client with File System
Lat(ms): 2 LUNs/Client (Total 4 Clients)IOPS: 2 LUNs/Client (Total 4 Clients)160000
180
140000160
140120000
120
Latency100000
IOPS80000
100
(ms)
80
60000
60
4000040
20000 20
00
[Queue Depth]Read Percent
IFOS 1.0 Stock Ceph
© 2015 Mellanox Technologies - Mellanox Confidential - 13
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
© 2015 Mellanox Technologies - Mellanox Confidential - 14
Adding RDMA To Ceph
XioMessenger
I/O Offload Frees Up CPU for Application Processing
Without RDMA With RDMA and Offload
~53% CPUEfficiency
~88% CPUEfficiency
~47% CPUOverhead/Idle
~12% CPUOverhead/Idle
© 2015 Mellanox Technologies - Mellanox Confidential - 15
Sys
tem
Sp
ac
eU
se
r S
pa
ce
Syste
mS
pace
User
Sp
ace
Adding RDMA to Ceph
RDMA BetaHammer
in ••
Mellanox, Red Hat, CohortFS, and Community collaborationFull RDMA expected in Infernalis
MessagingLayer
BuffersManagement
••
•
New RDMA messenger layer called XioMessengerNew class hierarchy allowing multiple transports (simple one is TCP)
Async design, reduced locks, reduced number of threads
• Introduced non-sharable messages
On top of ••
•
Accelio is RDMA abstraction layerIntegrated into all CEPH user space components: daemons
“public network” and “cluster network”and clients
Accelio
© 2015 Mellanox Technologies - Mellanox Confidential - 16
Accelio, High-Performance Reliable Messaging and RPC Library
Open source!
• https://github.com/accelio/accelio/ && www.accelio.org
Faster RDMA integration to application
Asynchronous
Maximize msg and CPU parallelism
•
•
Enable >10GB/s from single node
Enable <10usec latency under load
In
•
Giant and Hammer
http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger
© 2015 Mellanox Technologies - Mellanox Confidential - 17
Ceph 4KB Read IOPS: 40Gb TCP vs. 40Gb RDMA
450
400
350
300
250
40Gb
40Gb
TCP
RDMA200
150
100
50
0
2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients
© 2015 Mellanox Technologies - Mellanox Confidential - 18
Th
ou
san
ds
of
IOP
S
38
30
co
res
co
res
inO
SD
inclie
nt
34
24
co
res
co
resin in
OS
Dclie
nt
38
24
co
res
co
resin in
OS
Dclie
nt
38
24
co
res
co
res
inO
SD
inclie
nt
36
32
co
res
co
resin in
OS
Dclie
nt
34
27
co
res
co
resin in
OS
Dclie
nt
RDMA RDMATCPTCP RDMA TCP
Ceph RDMA Performance Summary – Work In Progress
Normalized to per-core
BW is @ 256K IO size, IOPS is @ 4K IO size
© 2015 Mellanox Technologies - Mellanox Confidential - 19
READ IOPS Up to 250% better
BW Up to 50% better
WRITE IOPS Up to 20% better
BW Up to 7% better
What’s next?
XIO-Messenger to GA
Ceph Bottlenecks
Erasure Coding
© 2015 Mellanox Technologies - Mellanox Confidential - 20
• Erasure Coding is really needed to reduce redundancy capacity overhead
• Erasure Coding is complicated math for CPU
• Demanding high-end storage nodes
• New ConnectX-4 can offload Erasure Coding
• XIO-Messenger can do much more as transport!
• Collaborate to resolve, performance work group
• Infernalis?
© 2015 Mellanox Technologies - Mellanox Confidential - 21
Deployment Examples
Ceph-Powered Solutions
Ceph For Large Scale Storage– Fujitsu Eternus CD10000
Hyperscale Storage
•
•
4 to 224 nodes
Up to 56 PB raw capacity
Runs Ceph with Enhancements
•
•
3 different storage nodes
Object, block, and file storage
Mellanox InfiniBand Cluster Network
•
•
40Gb
10Gb
InfiniBand cluster network
Ethernet front end network
© 2015 Mellanox Technologies - Mellanox Confidential - 22
Media & Entertainment Storage – StorageFoundry Nautilus
Turnkey Object Storage
•
•
•
Built on Ceph
Pre-configured for rapid deployment
Mellanox 10/40GbE networking
High-Capacity Configuration
•
•
6-8TB Helium-filled drives
Up to 2PB in 18U
High-Performance Configuration
•
•
•
Single client read 2.2 GB/s
SSD caching + Hard Drives
Supports Ethernet, IB, FC, FCoE front-end ports
More information: www.storagefoundry.net
© 2015 Mellanox Technologies - Mellanox Confidential - 23
SanDisk InfiniFlash
Flash Storage System
•
•
•
•
Announced March 3, 2015
InfiniFlash OS uses Ceph
512 TB (raw) in one 3U enclosure
Tested with 40GbE networking
High Throughput
•
•
Up to 7GB/s
Up to 1M IOPS with two nodes
More information:
• http://bigdataflash.sandisk.com/infiniflash
© 2015 Mellanox Technologies - Mellanox Confidential - 24
F
ml
More Ceph Solutions
Cloud – OnyxCCS ElectraStack ISS Storage Supercore•
•
•
•
Turnkey IaaS
Multi-tenant computing system
5x faster Node/Data restoration
https://www.onyxccs.com/products/8-series
•
•
•
•
•
Healthcare solution
82,000 IOPS on 512B reads
74,000 IOPS on 4KB reads
1.1GB/s on 256KB reads
http://www.iss-integration.com/supercore.html
lextronics CloudLabs
OpenStack on CloudX design
2SSD + 20HDD per node
Mix of 1Gb/40GbE network
http://www.flextronics.com/
Scalable Informatics Unison•
•
•
•
•
•
•
•
High availability cluster
60 HDD in 4U
Tier 1 performance at archive cost
https://scalableinformatics.com/unison.ht
© 2015 Mellanox Technologies - Mellanox Confidential - 25
Summary
Ceph scalability and performance benefit from high performance
Ceph being optimized for flash storage
networks
End-to-end 40/56 Gb/s transport accelerates Ceph today
•
•
100Gb/s testing has begun!
Available in various Ceph solutions and appliances
RDMA is next to optimize flash performance—beta in Hammer
© 2015 Mellanox Technologies - Mellanox Confidential - 26
Thank You
Setup
Two 28-core [email protected] (Haswell) servers••
•
•
•
•
•
•
•
64GB of memory
Hyperthreading is enabled
Mellanox ConnectX3-EN 40Gb/s, fw-2.33.5000
Mellanox SX1012 EN 40Gb/s switch
MLNX_OFED_LINUX-2.4-1.0.0
Accelio version 1.3 (master branch tag v1.3-rc3)
Ceph upstream branch hammer
Ubuntu 14.04 LTS stock kernel
Default mtu = 1500
1st server run as single node ceph cluster••
One monitor and
One OSD (using XFS on ramdisk /dev/ram0)
2nd server run as ceph fio_rbd clients
BW is measured at 256K IOs
Iops is measured at 4K IOs
© 2015 Mellanox Technologies - Mellanox Confidential - 28