+ All Categories
Home > Technology > The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk

The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk

Date post: 13-Apr-2017
Category:
Upload: openstack
View: 1,100 times
Download: 1 times
Share this document with a friend
34
May 5, 2016 1 Allen Samuels The Consequences of Infinite Storage Bandwidth Engineering Fellow, Systems and Software Solutions May 5, 2016
Transcript

May 5, 2016 1

Allen Samuels

The Consequences of Infinite Storage Bandwidth

Engineering Fellow, Systems and Software Solutions

May 5, 2016

May 5, 2016 2

Disclaimer

During the presentation today, we may make forward-looking statements.

Any statement that refers to expectations, projections, or other characterizations of future events or circumstances is a forward-looking statement, including those relating to industry predictions and trends, future products and their projected availability, and evolution of product capacities. Actual results may differ materially from those expressed in these forward-looking statements due to a number of risks and uncertainties, including among others: industry predictions may not occur as expected, products may not become available as expected, and products may not evolve as excepted; and the factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including, but not limited to, ourannual report on Form 10-K for the year ended January 3, 2016. This presentation contains information from third parties, which reflect their projections as of the date of issuance. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or the date of issuance by a third party.

May 5, 2016 3

What do I Mean By Infinite Bandwidth ?

May 5, 2016 4

Log scale

• Use DRAM Bandwidth as a proxy for CPU throughput

• Reasonable approximation for DMA heavy, and/or poor cache hit performance workloads (e.g. Storage)

Big

diffe

ren

ce

in slo

pe

!

Data is for informational purposes only and may contain errors

Network, Storage and DRAM Trends

May 5, 2016 5

Linear scale

Infin

ite S

tora

ge

Ba

nd

wid

th

• Same data as last slide, but for the Log-impaired

• Storage Bandwidth is not literally infinite

• But the ratio of Network and Storage to CPU throughput is widening very quickly

Data is for informational purposes only and may contain errors

Network, Storage and DRAM Trends

May 5, 2016 6

0

50

100

150

200

250

1990 1995 2000 2005 2010 2015 2020 2025Year

SSDs / CPU Socket

Data is for informational purposes only and may contain errors

May 5, 2016 7

0

5

10

15

20

25

30

35

40

45

50

1995 2000 2005 2010 2015 2020 2025Year

SSDs / CPU Socket @ 20% Max BW

Data is for informational purposes only and may contain errors

May 5, 2016 8

What happens as we get closer to the limit?

May 5, 2016 9

New Denser Server Form Factors

– Blades

– Sleds

Good short term solutions

Let’s Get Small!

May 5, 2016 10

Storage Cost = Media + Access + Management

Shared nothing architecture conflates access and management

Storage costs will become dominated by Management cost

Storage costs become CPU/DRAM costs

Effects Of The CPU/DRAM Bottleneck

May 5, 2016 11

Move management to upper layers where CPU can be right-sized by client

What kind of media access do I want?

– Simple enough functionality to be done directly in drive hardware – NO CPU

– Allow direct access throughout the compute cluster over a network

– Just enough machinery to enable coarse-grained sharing

Embracing The CPU/DRAM Bottleneck

In short, you really want a SAN !

– Or more technically, Fabric Connected Storage

May 5, 2016 12

Not Your Father’s SAN

Three problems with current SAN

– Fibre channel transport

– SCSI access protocol

– Drive oriented storage allocation

All of these want to be updated

– Fibre channel is brittle and costly

– SCSI initiators have long code paths catering to seldom used configurations

– Robust sub-drive storage allocation

May 5, 2016 13

SAN 2.0

NVMe over Fabrics

1.0 Spec is out for review, hopefully done in May

Simple enough for direct hardware execution of data path ops

Minimal initiator code path lengths improve performance

Namespaces allow sub-drive allocations

Not mature enough for enterprise deployment – yet

May 5, 2016 14

SAN 2.0

What storage network?

– Current candidates are FC, Infiniband and Ethernet

Ethernet has best economics – if you can make it work

RoCE is easy on the edge, but hard on the interior

– Only controlled environments have shown multi-switch scalability

– General scalability in a multi-vendor environment likely to be difficult

– Wonderful for intra-rack storage networking

iWarp is hard on the edge, but easy on the interior

– Scarcity of implementations inhibits deployment

Storage over IP will see limited cross rack deployment until this is resolved

May 5, 2016 15

Implementations using OTS stuff are in progress

Server side implementations look pretty conventional too

4-5 MIOPS have been shown

Seems like 10 MIOPS isn’t unreasonable to expect

First Generation Of SAN 2.0

NIC

CPU DRAM

SSDPCIe

May 5, 2016 16

Soon, NICs will forward NVMe operations to local PCIe devices

CPU removed from the software part of the data path

CPU is still needed for the hardware part of the data path

IOPS improve, BW is unchanged

Significant CPU freed for application processing

Getting closer to the wall!

Second Generation SAN 2.0

May 5, 2016 17

New generation of combined SSD controller and NIC

– Rethink of interfaces eliminates DRAM buffering

Network goes right into the drive

No CPU to be found

Works well with rack scale architecture

Third Generation SAN 2.0, Imagined

May 5, 2016 18

Disaggregated / Rack Scale Architecture

– Fabric connected

– Independently scale compute, networking and storage

Let’s Get Really Small

May 5, 2016 19

Call To Action

Fabric-connected storage isn’t well managed by existing FOSS

Lots of upper layer management software is available

– OpenStack, Ceph, Gluster, Cassandra, MongoDB, SheepDog, etc.

Lower layer cluster management still primitive

May 5, 2016 20

What’s It All Mean?

New form factors are in everybody's future

The coming avalanche of storage bandwidth wants to be free

– Not imprisoned by a CPU

Rack Scale Architecture allows new Storage/Compute configs

Storage will be increasingly “Software Defined” as the HW evolves

May 5, 2016 21

Product Pitch!

May 5, 2016 22

Old Model Monolithic, large upfront

investments, and fork-lift upgrades Proprietary storage OS Costly: $$$$$

New SD-AFS Model Disaggregate storage, compute, and software for

better scaling and costs Best-in-class solution components Open source software - no vendor lock-in Cost-efficient: $

Software-defined All-Flash StorageThe disaggregated model for scale

May 5, 2016 23

Scalable Raw Performance

2M IOPS, Latency 1-3ms12-15 GB/s Throughput

8TB Flash-Card Innovations

• Enterprise Grade Power-Fail Safe• Alerts & monitoring • Latching integrated & monitored• Directly samples air temp• Form-factor enables lowest cost SSD

InfiniFlash™ Storage Platform

Capacity 512TB – raw all Flash!

All Flash 3U JBOD of Flash (JBOF)Up to 64 x 8TB SAS Drive Cards 4TB cards also available soon

Operational Efficiency & Resilient

Hot Swappable Architecture, Easy FRULow power – typical workload 400-500W 150W(idle) - 750W(max)

MTBF 1.5+ million hours

Hot Swappable !

Fans, SAS Expander Boards, Power Suppliers, Flash cards

Host Connectivity

Connect up to 8 servers through 8 SAS portsMulti-path enabled

Flash Drive Card

EMS Product Management SanDisk Confidential

May 5, 2016 24

InfiniFlash IF500 All-Flash Storage System Block and Object Storage Powered by Ceph

Ultra-dense High Capacity Flash storage

– 512TB in 3U, Scale-out software for PB scale capacity

Highly scalable performance

– Industry leading IOPS/TB

Cinder, Glance and Swift storage

– Add/remove server & capacity on-demand

Enterprise-Class storage features

– Automatic rebalancing

– Hot Software upgrade

– Snapshots, replication, thin provisioning

– Fully hot swappable, redundant

Ceph Optimized for SanDisk flash

– Tuned & Hardened for InfiniFlash

May 5, 2016 25

InfiniFlash SW + HW Advantage

Software Storage System

Software tuned for Hardware• Ceph modifications for Flash• Both Ceph, Host OS tuned for

InfiniFlash• SW defects that impacts Flash

identified & mitigated

Hardware Configured for Software• Right balance of CPU, RAM,

Storage• Rack level designs for optimal

performance & cost

Software designed for all systems does not work well with any system

Ceph has over 50 tuning parameters that results in 5x – 6x performance improvement

Fixed CPU, RAM hyperconvergednodes does not work well for all workloads

May 5, 2016 26

InfiniFlash for OpenStack with Dis-Aggregation

Compute & Storage Disaggregation enables Optimal Resource utilization

Allows for more CPU usage required for OSDs with small Block workloads

Allows for higher bandwidth provisioning as required for large Object workload

Independent Scaling of Compute and Storage

Higher Storage capacity needs doesn't’t force you to add more compute and vice-versa

Leads to optimal ROI for PB scale OpenStack deploymentsHSEB A HSEB B

OSDs

SAS

….

HSEB A HSEB B HSEB A HSEB B

….

Co

mp

ute

Far

m LUN LUN

iSCSI Storage

…Obj Obj

Swift ObjectStore

…LUN LUN

Nova with Cinder & Glance

LibRBD

QEMU/KVM

RGW

WebServer

KRBD

iSCSI Target

OSDs OSDs OSDs OSDs OSDs

Sto

rage

Far

m

Confidential – EMS Product Management

May 5, 2016 27

IF500 - Enhancing Ceph for Enterprise Consumption

IF500 provides usability and performance utilities without sacrificing Open Source principles

• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality• All Ceph Performance improvements developed by SanDisk are contributed back to community

27

SanDisk Distribution or

Community Distribution

Out-of-the Box configurations tuned for performance with Flash

Sizing & planning tool

InfiniFlash drive management integrated into Ceph management (Coming Soon)

Ceph installer that is specifically built for InfiniFlash High performance iSCSI storage

Better diagnostics with log collection tool Enterprise hardened SW + HW QA

May 5, 2016 28

InfiniFlash Performance Advantage900K Random Read Performance with 384TB of storage

Flash Performance unleashed• Out-of-the Box configurations tuned for

performance with Flash• Read & Write data-path changes for Flash• x3-12 block performance improvement –

depending on workload• Almost linear performance scale with

addition of InfiniFlash nodes• Write performance WIP with NV-RAM

Journals• Measured with 3 InfiniFlash nodes with 128TB each• Avg Latency with 4K Block is ~2ms, with 99.9 percentile

latency is under 10ms • For Lower block size, performance is CPU bound at Storage

Node.• Maximum Bandwidth of 12.2GB/s measured towards 64KB

blocks

S

28

May 5, 2016 29

InfiniFlash Ceph Performance Advantage

Single InfiniFlash unit Performance

– 1 x 512TB InfiniFlash unit connected with 8 nodes

– 4K RR IOPS: ~1 million IOPs - 85% of bare metal perf.

• Corresponding Bare metal IF100 IOPS is 1.1 million

– All 8 hosts CPU saturated for 4K Random read.

• More performance potential with higher CPU cycles

– With 64k IO size we are able to utilize full IF150 bandwidth of over 12GB/s.

– Librbd and Krbd performance are comparable.

– Write Performance is on 3x copy configuration. The more common 2x copy will result in 33% improvement.

Random Write

IO Profile LIBRBD IOPs

4k Random Write 54k

64k Random Write 34k

256k Random Write 11.3k

1,123,175

349,247

87,369

0

5

10

15

20

25

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

4k 64k 256k

BW

(G

Bp

s)

IOP

S

Random Read Block Performance

LIBRBD IOPs Bandwidth (GBps)

May 5, 2016 30

InfiniFlash Ceph Performance Advantage

Linear Scaling with 2 InfiniFlash units

– 2 x 512TB InfiniFlash unit connected with 16 nodes

– 1.8M 4K IOPS – 80% of the bare metal performance

– Performance is Scaling almost linearly - Almost doubled the

performance of single IF150 with ceph

– Write perf is 2 X with 16 node cluster compared with 8 node cluster.

Random Read

Random Write

IO Profile LIBRBD IOPs

4k RR 1800k

64k RR 225k

256k RR 53k

IO Profile LIBRBD BW(MB/s)

4k RR 7194

64k RR 14412

256k RR 13366

May 5, 2016 31

InfiniFlash OS – Hardened Enterprise Class Ceph

Hardened and tested for Hyperscaledeployments and workloads

Platform focused testing enables us to deliver a complete and hardened storage solution

Single Vendor support for both Hardware & Software

Enterprise Level Hardening

Testing at Scale

FailureTesting

9,000 hours of cumulative IO tests

1,100+ unique test cases

1,000 hours of Cluster Rebalancing tests

1,000 hours of IO on iSCSI

Over 100 server node clusters

Over 4PB of Flash Storage

2,000 Cycle Node Reboot

1,000 times Node Abrupt Power Cycle

1,000 times Storage Failure

1,000 times Network Failure

IO for 250 hours at a stretch

May 5, 2016 32

IF500 Reference Configurations

Model Entry Mid High

InfiniFlash 128TB 256TB 512TB

Servers1 2 x Dell R 630-2U 4 x Dell R 630-2U 4 x Dell R 630-2U2

Processor per server Dual socket Intel Xeon E5-2690 v3 Dual socket Intel Xeon E5-2690 v3 * Dual socket Intel Xeon E5-2690 v3

Memory per server 128GB RAM 128GB RAM 128GB RAM

HBA per server (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps

Network per server(1) Mellanox ConnectX-3 dual ports 40GbE

(1) Mellanox ConnectX-3 dual ports 40GbE

(1) Mellanox ConnectX-3 dual ports 40GbE

Boot Drive per server (2) SATA 120GB SSD (2) SATA 120GB SSD (2) SATA 120GB SSD

1 - For larger block workload or less CPU intensive workload, OSD node could use single socket server. Dell Servers can be substituted with other vendor servers that match the specs.2 - For Small Block workloads, 8 servers are recommended

May 5, 2016 33

InfiniFlash TCO Advantage

$-

$10,000,000

$20,000,000

$30,000,000

$40,000,000

$50,000,000

$60,000,000

$70,000,000

$80,000,000

Tradtional ObjStore onHDD

IF500 ObjStore w/ 3Full Replicas on Flash

IF500 w/ EC - All Flash IF500 - Flash Primary& HDD Copies

3 year TCO comparison *

3 year Opex

TCA

0

20

40

60

80

100

Tradtional ObjStore on HDD IF500 ObjStore w/ 3 FullReplicas on Flash

IF500 w/ EC - All Flash IF500 - Flash Primary & HDDCopies

Total Rack

Reduce the replica count with higher reliability of flash

- 2 copies on InfiniFlash vs. 3 copies on HDD

InfiniFlash disaggregated architecture reduces compute usage, thereby reducing HW & SW costs

- Flash allows the use of erasure coded storage pool without performance limitations

- Protection equivalent of 2x storage with only 1.2x storage

Power, real estate, maintenance cost savings over 5 year TCO

* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment

33

May 5, 2016 34

©2016 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. Other brands mentioned herein are for identification purposes only and may be the trademarks of their holder(s).


Recommended