An Exploration into Object Storagefor Exascale SupercomputersRaghu Chandrasekar
Agenda
● Introduction
● Trends and Challenges
● Design and Implementation of SAROJA
● Preliminary evaluations
● Summary and Conclusion
CUG 2017 Copyright 2017 Cray Inc. 2
Safe Harbor Statement
This presentation may contain forward-looking statements that are basedon our current expectations. Forward looking statements may includestatements about our financial guidance and expected operating results,our opportunities and future potential, our product development and newproduct introduction plans, our ability to expand and penetrate ouraddressable markets and other statements that are not historicalfacts. These statements are only predictions and actual results maymaterially vary from those projected. Please refer to Cray's documents filedwith the SEC from time to time concerning factors that could affect theCompany and these forward-looking statements.
CUG 2017 Copyright 2017 Cray Inc. 3
Storage Hierarchy Data Path Concepts
CUG 2017 Copyright 2017 Cray Inc. 4
NVRAM
CPUs
DRAM
HBM
IONodes
NAND
ColdStorageServers
HDD
ArchiveStorageServers
Tape
High SpeedCompute Fabric
Site Network
WAN / Cloud
…Switch
…Switch
…Switch
O(1µs) Nonvolatile StoragePrivate Scratch NamespaceRelaxed POSIX or Key-Value APISharable Namespace (upon flush to backing store)App-Controlled Caching; Close-to-Open ConsistencyInter-Node Cache-Consistent in Small Clusters
O(100µs) Nonvolatile StorageSharable Across ComputersLightweight Object or POSIX InterfacePrimary Resilient Random StorageFor HPC / Analytics / General
Distributed Storage ClientNode-Local Cache & Working MemoryRelaxed POSIX or Key-Value APILocal Cache ControlSharding and Resiliency Controls
O(10ms) Backing StoreSite-Wide AccessBulk Object/Cloud Storage APIHigh-9’s Data ResiliencyStreaming Sequential Throughput
NamespaceServers
NAND
Scalable Metadata ServiceServes Metadata to One Or More TiersProvide Key-Value, POSIX, namespace APIsAttributes and Rich Metadata StructuresMap App Structures to One AnotherCollections and Manifests of Large Data Sets
Multi-Second Data Distribution & ProtectionMulti-Site ReachBulk Object Storage APICloud Bursting, Disaster Recovery, Archival
O(5ns) In-Package MemoryCustomer Workload
O(50ns) DDR MemoryCustomer Workload
Storage Media Latencies and IOPs
CUG 2017 Copyright 2017 Cray Inc. 5
525l
Disk-Based(~5 msec)
With Pmem(~0.03 msec)
32K IOPS*
200 IOPS*
* Max potential 1-thread random sector
Storage
5000us
100usNetwork
200usSofttware
25 25lFlash+RDMA(~0.05 msec)
20k IOPS*
Flash/pmem(~0.0x msec)
Software becomes the largest fraction of latency when usingpersistent memory, even with 4x improved software efficiency
Cray Compute and Fabric Topology
CUG 2017 Copyright 2017 Cray Inc. 6
Group0 Group1 Group2 Group3 Group4 Group5 Group6 Group7
Flexiblecompute Highdensitycompute
Compute Node-Local StoragePotential 256k Nodes
Enclosure-Based StoragePotentially 64k (or more) Devices
High bandwidth DragonflyFabric
Analytics and HPC Software Convergence
CUG 2017 Copyright 2017 Cray Inc. 7
256k NodeManagement,
Monitor,Service
Infrastructure
Analytics Frameworkwith Local Caching
ScalableMetadataServices
User Application
High-speed dragonfly fabricComputeStore
HPC File or Objectwith Optional Caching
Flash
Pmem
Flash
Flash
Flash
Flash
Flash
RDMATransport
Flash
Flash
Flash
Flash
Flash
User Application
Flash
Pmem
RDMATransport
POSIX Files,HDF5 Containers,
K/V
Spark RDDs,K/V, or Other
Discover,Query
SAROJA Proof-of-Concept
CUG 2017 Copyright 2017 Cray Inc. 8
Scalable And Resilient ObJect StorAge
CUG 2017 Copyright 2017 Cray Inc.
ObjectStorageClusterNoSQLService
PersistentMemory
NVMeFlashNVMe Flash
ComputeNodes
ParallelApplications
POSIX Object
SAROJAclient(libsaroja)
NativeAPI
PAXOS/RAFT/ZabCluster
DatapathMetadataConsensus
9
Preliminary Evaluations
CUG 2017 Copyright 2017 Cray Inc. 10
Metadata Evaluations
CUG 2017 Copyright 2017 Cray Inc. 11
10,000
12,000
1 2 4 8 16 32 64 128
Cre
ates
per
sec
ond
Number of client processes
Ceph vs Lustre: File Creation Rates
ceph−fuse cephfs (kernel) Lustre
0
2,000
4,000
6,000
8,000
Ceph POSIX support still has a long way to go
(Higher is better)
Metadata Evaluations
CUG 2017 Copyright 2017 Cray Inc. 12
Scaling trends not ideal; but promising approach functionally
SAROJA File Creates vs. Cassandra Servers
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1 2 4 8
Cre
ates
per
sec
on
d
Number of Cassandra Servers
Peak Lustre file creation rate(w/o DNE)
• POSIX over SAROJA• 4480 MPI ranks• 56 XC compute nodes• 500 files/rank• TCP over GNI• Replication disabled
Data Path Evaluation
CUG 2017 Copyright 2017 Cray Inc. 13
32 64 128
Thro
ughput
(MB
/s)
Number of client processes
Ceph vs Lustre Throughput
Ceph (mean) Lustre (mean) Ceph (peak) Lustre (peak)
0
2,000
4,000
6,000
8,000
10,000
12,000
1 2 4 8 16
Viable for use in the data path;Plenty of opportunities for tuning
Summary
● Inflection point in storage system design
● Three-tier storage topology for supercomputers
● Promising early investigations with object storage tech
● Gradual transition
● Call for feedback
CUG 2017 Copyright 2017 Cray Inc. 14
Legal Disclaimer
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners.
CUG 2017 Copyright 2017 Cray Inc. 15