Brent Gorda
General Manager, High Performance Data Division
Legal Disclaimer
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.
For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.
Intel, Xeon, Xeon Phi and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2016 Intel Corporation. All rights reserved.
2
3
Today: ML Data Movement with Lustre*
Collection:
Network Infrastructure
DATA CENTER/ CLOUDLustre Parallel F/S
Data Ingress
Petabytes of Data
Machine Learning
Remote Ruggedized100’s MB/s of IO
Terabytes of Data
Asynchronous uploadFormatting / AuthenticationSecurity Upto GB/s of IO
100’s GB/s of Bandwtih
Data = Memories CPUs = Brain More Data== Better ResultsMore CPU = Faster Results
20GB/s+ per PB10’s – 100’s of PBData Growth > 2 PB MonthRack scale compute
4
Lustre with OpenStack is a growing area
https://www.openstack.org/videos/video/lustre-integration-for-hpc-on-openstack-at-cambridge-and-monash
HPE Scalable Storage with Intel Enterprise Edition for Lustre*
Designed for PB-Scale Data Sets
Density Optimized Design For Scale• Dense Storage Design Translates to Lower $/GB• Limitless Scale Capability – Solution Grows Linearly in
Innovative Software Features
Leading Edge Yet Enterprise Ready Solution• ZFS software RAID provides Snapshot, Compression & Error Correction• ZFS reduces hardware costs with uncompromised performance• Rigorously Tested for Stability & Efficiency
High Performance Storage Solution
Meets Demanding I/O requirementsPerformance measured for an Apollo 4520 building block • Up to 17 GB/s Read/15 GB/s Writes with EDR1
• Up to 16 GB/s Reads and Writes with OPA1
• Up to 21GB/s Reads and 15GB/s Writes with all SSD’s²
“Appliance Like” Delivery Model
Pre-Configured Flexible Solution• Deployment Services for Installation• Simplified Wizard Driven Management thru Intel Manager for
LustreHPE Scalable Storage with Intel EE for Lustre*
OmniPath
DL360 + MSA 2040
HPC Clients
Apollo 6000
DL360 (Intel Management Server)
Built on Apollo 4520
1: Different Conditions & Workloads can affect the Performance 2: 24 x1.6TB MU SSD’s in A4520 no JBOD
7
Byte-addressable Persistent Memory
`
Rotational NAND 3D-Xpoint
Fre
qu
en
cy o
f A
cce
ss
SSD(PCIe)
Disk Drives
Co
ldW
arm
Ho
t
NVDIMM
SSD
NVMe
SSD SSD
Disk Drives Disk Drives Disk Drives
NVMe NVMe
Disk Drives
Random Access Time (Gap)
8
3D Xpoint + Omni-Path
Byte granular
Ultra low latency
– ~0.01 µS storage latency
– + ~1 µS network latency
= ~1 µS hardware latency
Conventional Storage Stack
Block/page granular and locking
High overhead
– ~1 µS kernel/user context switch
– + ~10 µS communications software
– + ~100 µS filesystem & block I/O stack
= ~100 µS software latency
Disruptive Technologies
Entirely masksHW capabilities!!!
~ = order of magnitude µS = micro seconds
9
3D Xpoint + Omni-Path
Byte granular
Ultra low latency
– ~0.01 µS storage latency
– + ~1 µS network latency
= ~1 µS hardware latency
Conventional Exascale NEW Storage Stack
Arbitrary alignment and granularity
Ultra low overhead
– OS bypass comms + storage
– Shared nothing
Disruptive Technologies
Deliver HW performance!!!100x/1000x increase in data velocity!!!
~ = order of magnitude µS = micro seconds
Storage Server
10
End-to-end OS bypass
Mercury userspace function shipping
– MPI equivalent communications latency
– Built over libfabric
Applications link directly with DAOS lib
– Direct call, no context switch
– No locking, caching or data copy
Userspace DAOS server
– Mmap non-volatile memory (NVML)
– NVMe access through SPDK*
– User-level thread with Argobots**
– FPGA offload
Lightweight Storage Stack
HPC Application
DAOS library
DAOS Server
Mercury/libfabric
NVMe
NVRAM
Bulk transfers
* https://01.org/spdk ** https://github.com/pmodels/argobots/wiki
11
Mix of storage technologies
NVRAM (3D Xpoint DIMMs)
– DAOS metadata & application metadata
– Byte-granular application data
NVMe (NAND, 3D NAND, 3D Xpoint)
– Cheaper storage for bulk data
– Multi-KB
I/Os are logged & inserted into persistent index
All I/O operations tagged/indexed by version
Non-destructive write: log blob@version
Consistent read: blob@version
No alignment constraints
Ultra-fine grained I/O Index
ExtentsVers
ion =
epoch
Being written
Committed
NVRAM
NVMe
v1
v2
v3
read@v3
12
Scalable I/O
Lockless, no read-modify-write
Producers not blocked by consumers
– And vice-versa
Conflict resolution in I/O middleware
– No system-imposed, worst case serialization
– Ad hoc concurrency control mechanism
Scalable communications
Track process groups/jobs
– not individual compute node
Tree-based broadcast
Scalable metadata
Collective open/close
Tree-based caching, refcount, open handles, …
Distributed/global transactions
No object metadata maintained by default
Shared-nothing distribution schema
Algorithmic placement
– Scales with # storage nodes
Data-driven placement
– Scales with volume of data
Performance domains
Extreme Scale-out
13
DAOS Ecosystem
DAOSApache 2.0
HPC application, storage service/engine,machine learning …
HDF5 + ExtensionsHPC
LegionHPC
NetCDFHPC
USDComputer animation
HDFS/SparkAnalytics
CloudIntegra
tion
POSIXHPC
KV Store
Varia
MPI-IOHPC
…
ESSIO (Q3’15 - Q2’17)
Alpha quality
N-way replication
Online rebuild
Multi-tier prototype
Follow-on projects
DAOS Productization
Next gen HW support
System integration
Future
Progressive layout
Erasure code
Security Model
DAOS Roadmap
15
DAOS Resources (Apache 2.0)
Public git repository
git clone http://review.whamcloud.com/daos/daos_m
Browsable source code: http://git.whamcloud.com/daos/daos_m.git
Mirror on github: https://github.com/daos-stack/daos
Released under Open source Apache 2.0 License
Leveraging other open source projects
– DoE: Mercury, Argobots
– Intel: Libfabric, PMIx, NVML, SPDK, ISA-L
High IOPS Lustre Configurations with Intel SSDs Utilize single port SSDs to build high performance Lustre
scratch file system
Design with SATA SSDs for Object store target while NVMe SSD for Metadata targets.
All flash configuration at Intel Endeavor cluster on 350 SSDs deliver remarkable performance for parallel access needs
SSD 4x throughput @44GBps Base cost @ 4x <than
commercial HDD solution Continued interleaved scaling
after 32x clients – Iozone*
https://www-ssl.intel.com/content/www/us/en/solid-state-drives/hpc-ssd-dc-family-lustre-file-system-study.html
LFS09 – Intel SSD DC S3500 Series Based Lustre* System 1x Meta Data Server (MDS)2 x Intel® Xeon® Processor E5-2680 + 64GB RAM + 2x SATA RAID0 MDT FDR In niband* (56Gb/s) 8x Storage Server (OSS)2 x Intel® Xeon® Processor E5-2680 + 64GB RAM 1x Intel® SSD 320 Series for OS 3x RAID controllers with 8 SAS/SATA targets each 6x OST per server, each target = 4x Intel® SSD DC S3500 Series @600GB SSDs ‘over-provisioned’ to 75% for 450GB usable Software Stack:RedHat* Enterprise Linux* 6.4 + Lustre* 2.1.5
*Other names and brands may be claimed as the property of others.
+Intel® SSD DC S3500 Series
17
Emerging trends
Vast majority of storage objects are tiny…
Kilobytes and smaller
Trending to larger proportion of system capacity
High performance => Billions of IOPs @ lowest possible latency
Resilience => Replication improves concurrency/scalability @ acceptable storage overhead
Poorly supported by today’s filesystems
…but vast majority of space is used by large storage objects
Megabytes and larger
High performance => efficient streaming @ Terabytes per second
Resilience => Erasure codes required for space efficiency & limit system cost
>