Confidential
© Supermicro 2014
Ceph, a physical perspective
CEPH Day New YorkOctober 8th 2014
Alan Johnson
Storage Applications Manager
Supermicro Storage Group
Confidential
Agenda
Objectives
Hardware Description
CEPH environment
Test Results
Conclusions
Further areas for testing
Confidential
Objectives
Generate well-balanced solutions without compromising on availability, performance and cost
Derive configurations from empirical data during the testing.
Confidential
Hardware Description
Monitor Nodes
Private N
etwork (192.168.50)
Public N
etwork (172.27.50)
Notes:1. 3x Monitor Nodes2. 7x OSD Nodes with 36 x 3.5”bays
3. 8x Client Nodes
4. Clients and Monitors only access the public
network
5. All Networks are 10G
6. OSD nodes have bonded (2x) private
network interfaces 7. OSD nodes have 5:1 ratio, 30x HDD + 6x
SSDs
Client Nodes
OSD Nodes
Confidential
Hardware Description
ServersDual Intel E5-2630 6 Core64/128 GB Memory
StorageLSI 2308 (IT mode) x 2 (36 Bay OSD)3TB HDD HGST – HUA723030ALA640400 GB SSD Intel - SSDSC2BA400G3
Confidential
Hardware Description
Confidential
Hardware DescriptionCluster Role Monitor Node OSD Node OSD Node OSD Node
Server Model SYS-6017R-MON1 SSG-6027R-OSD040H SSG-6047R-OSD120H SSG-6047R-OSD240H
Key Features 4x 3.5” HDD BaysDual 10G (SFP+)
12x 3.5” HDD BaysRear 2.5” Hot-swap OS drives (mirrored 80GB SSD)Dual 10G (SFP+)
36x 3.5” HDD BaysRear 2.5” Hot-swap OS drives (mirrored 80GB SSD)x8 SAS2 ConnectivityQuad10G (SFP+)
72x 3.5” HDD BaysRear 2.5” Hot-swap OS drives (mirrored 80GB SSD)x8 SAS2 ConnectivityQuad10G (SFP+)
Processor Dual Intel E5-2630 V2 6-Core 2.6G 15M 7.2 GT/s QPI
Single Intel E5-2630 V2 6-Core 2.6G 15M 7.2 GT/s QPI
Dual Intel E5-2630 V2 6-Core 2.6G 15M 7.2 GT/s QPI
Dual Intel E5-2630 V2 6-Core 2.6G 15M 7.2 GT/s QPI
Memory 64 GB per node 64 GB per node 128 GB per node 128 GB per node
Networking On-board Dual Port 10G (SFP+)
AOC-STGN-12S Dual Port 10G (SFP+)
2x AOC-STGN-12S Dual Port 10G (SFP+)
2x AOC-STGN-12S Dual Port 10G (SFP+)
Drive Configuration 4x 300GB HDDs (SAS) 2x 400GB SATA3 SSDs10 x 4TB HDDs (SATA3)
6x 400GB SATA3 SSDs30 x 4TB HDDs (SATA3)
12x 400GB SATA3 SSDs60 x 4TB HDDs (SATA3)
Form Factor 1U w/Redundant Hot-swap 700W Power Supplies
2U w/Redundant Hot-swap 920W Power Supplies
4U w/Redundant Hot-swap 1280W Power Supplies
4U w/Redundant Hot-swap 2000W Power Supplies
Confidential
Ceph Environment
Operating SystemCentOS 6.5No modifications made to kernel (2.6.32)
Yum updates applied
CephFirefly – 0.80.5
Confidential
Initial test configurationRADOS bench used for testing
Test Results
Monitor Nodes
Public N
etwork (172.27.50)
Single Client Node
OSD Nodes
Notes:
1. 3x Monitor Nodes2. 7x OSD Nodes with 36 x
3.5”bays
3. 1x Client Nodes
4. Everything on public network
5. All Nodes are 10G6. All nodes have single
connection
Confidential
Test ResultsSequential Write Bandwidth on single
replicated pool - Journal and data on the same device – rados defaults b=4M t =16OSDs built up over OSD nodes in a breadth first
manner Increments of 6 OSDs – new pool created each time
#of osds
6 12 18 24 30 36 42 48 54 60 66 69 72 75 78 81 840.00
100.00
200.00
300.00
400.00
500.00
600.00
Bandwidth (MB/s)Max Bandwidth (MB/s)
Confidential
Test ResultsSequential Write Bandwidth latency test
#of osds
6 12 18 24 30 36 42 48 54 60 66 69 72 750.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
Average LatencyMax Latency
Confidential
Test ResultsSequential Write Bandwidth on single replicated
pool – with journal on SSDs Increments of 6 OSDs – new pool created each time
#of osds 6 12 18 24 30 36 42 48 540.00
200.00
400.00
600.00
800.00
1000.00
1200.00
Bandwidth (MB/s)Max Bandwidth (MB/s)
Confidential
Test ResultsSequential Write Bandwidth latency test -
Journal on SSDs
#of osds 6 12 18 24 30 36 42 48 540.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Average LatencyMax Latency
Confidential
Test ResultsSequential Write Bandwidth on single erasure coded
pool (k=5, m=2) –OSDs built up over OSD nodes in a breadth first manner = journal on SSD Increments of 7 OSDs – new pool created each time
#of OSDs 7 14 21 28 35 42 490.00
200.00
400.00
600.00
800.00
1000.00
1200.00
Bandwidth (MB/s)Max Bandwidth (MB/s)
Confidential
Test ResultsErasure coded pool - Sequential Write
Bandwidth latency test - Journal on SSDs
#of OSDs 7 14 21 28 35 42 490.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
Average LatencyMax Latency
Confidential
Incremental Object size test One Client – 180 OSDs, 1 x replicated pool
Monitor Nodes
Private N
etwork (192.168.50)
Client Node
OSD NodesP
ublic Netw
ork (172.27.50)
1GB/s
3GB/s
6GB/s12GB/s
ConfidentialIncremental Object size test One Client – 180 OSDs, 1 x erasure coded
pool (k=4,m=2)
Monitor Nodes
Private N
etwork (192.168.50)
Client Node
OSD NodesP
ublic Netw
ork (172.27.50)
1GB/s
3GB/s
6GB/s12GB/s
Confidential
Aggregate Large File Sequential Performance (incremental clients and OSDs)
Monitor Nodes
Private N
etwork (192.168.50)
Client Nodes
OSD Nodes
Public N
etwork (172.27.50)
8GB/s
3GB/s
7GB/s14GB/s
0.00
500.00
1000.00
1500.00
2000.00
#of osds Bandwidth (MB/s) # of clients pg setting12 235.00 1 51224 518.00 1 51236 769.00 2 102448 830.00 2 102460 920.00 2 102472 955.00 2 204884 1130.00 2 409696 1306.00 3 4096
108 1446.00 3 4096120 1516.60 3 4096132 1726.00 4 4096144 1817.15 4 4096144 3043.40 4 4096144 2401.00 4 4096
ReadsEC writes k=10,m=2
Confidential
OSD Node I/O Bandwidth
4.8GB/sSAS I/O 2.7GB/s
Disk I/O
Frontend Networking1GB/s to Clients
Backend Networking2GB/s OSD to OSD
Confidential
Conclusions
Use of SSDs is near mandatory for performance purposes
Erasure coded pools provide reasonable good performance with a lower price point.T esting still required for EC failure conditions
A single client can easily saturate the network with two 36 bay OSD nodesReal world more likely to utilize multiple clients
Monitor nodes do not use a high degree of CPU utilization
Network component extremely critical as the configuration scales
Confidential
Failure Domain 0 (Datacenter 1)
Failure Domain 1(Datacenter 1)
1
2
Failure Domain 3(Datacenter 2)
3
Replication Path as the Cluster ScalesConsiderations?
1. Base solution, copies are done in rack2. Copies made to hardware in other rack3. Copies go to other datacenter
Connection?
This Connection will depend on the Organization
Confidential
Further areas for investigation – future testing
RDMA Support
Use of 40 gig Ethernet
Use of Infiniband
Use of cache tiering
Enterprise 1.2, Calamari with Red Hat 7
Giant testing
Block testing with more comprehensive benchmarks
Degraded mode (Erasure Coding and Replicated)