NVMe/TCP Standards-Based, Fault-Tolerant Clustered Storage with LightOS
Alex ShpinerSystem ArchitectLightbits [email protected]
● Founded
● Key milestones:
● 80 Employees :
○
○
● Locations
○
○
○
○
● Funding
○
○
○
We are hiring!
●●●●●●
Optional hardware acceleration for SSD management and data services
High performance, low latency Global Flash Translation Layer with data services
High performance, low latency NVMe/TCP target
NVMe/TCP targetGlobal FTL with Rich
Data Services
Optional hardware acceleration for SSD management and data services
High performance, low latency Global Flash Translation Layer with data services
High performance, low latency
NVMe/TCP target Standard TCP/IP
Network (no RDMA required)
Standard NVMe/TCP client
driver
NVMe/TCP targetGlobal FTL with Rich
Data Services
NVMe/TCP targetGlobal FTL with Rich
Data Services
NVMe/TCP targetGlobal FTL with Rich
Data Services
With Application Replication
v1.xdo
replicate
No Application Replication
●●●
○●
○
v2.x
With Application Replication No Application Replication
v1.xdo
replicate
●○
Storage server level protection Storage server failure via LightOS Clustering
SSD level protection SSD failure via Global FTL Erasure Coding
v2.x
v1.x
NVMe/TCP targetGlobal FTL with Rich
Data Services
NVMe/TCP targetGlobal FTL with Rich
Data Services
NVMe/TCP targetGlobal FTL with Rich
Data Services
●●
○○ All clients continue working!
● Inherit storage services from LightOS 1.x● High performance and low latency
○ Single hop reads ○ Two hop writes (user + replications)
NVMe/TCP target
Global FTL with Rich
Data ServicesNVMe/TCP target
Global FTL with Rich
Data ServicesNVMe/TCP target
Global FTL with Rich
Data ServicesNVMe/TCP target
Global FTL with Rich
Data ServicesNVMe/TCP target
Global FTL with Rich
Data ServicesNVMe/TCP target
Global FTL with Rich Data Services
● Standard unmodified clients and network○ Leveraging standard NVMe-1.4 and NVMeoF 1.1○ Transparent failover via multipath with Asymmetric
Namespace Access (ANA)
● Distributed and fault tolerant storage servers○ Automatic volume assignment○ Failure domains○ Management○ Discovery service
● Multi-replica volumes
● Each replica is stored on a separate storage server
LightOS Cluster
Storage Server Storage Server
Storage Server Storage Server
Storage Server Storage Server
Storage Server Storage Server
Storage Server Storage Servervol1_replica_2 vol1_replica_3vol1_replica_1
vol1_replica_1
vol1_replica_2
vol1_replica_3
● Different groups of storage servers can be impacted by common elements that share a point of failure:○ Network○ Power○ Geographical
● User defined server assignments to specific failure domain groups.
● Configured via labels assigned to servers, reflecting common dependencies.○ rack_01, rack_02, …○ power_0, power_1, ...
● Replicas are placed in different failure domains.
LightOS Cluster
rack_01
rack_02
rack_03
rack_04
rack_05
Storage Server Storage Server
Storage Server Storage Server
Storage Server Storage Server
Storage Server Storage Server
Storage Server Storage Server
vol1_replica_1
vol1_replica_2
vol1_replica_3
●●●●●
●
LightOS Cluster
Storage Server Storage Server Storage Server
NVMe/TCP Client
Secondary Secondary Primary
Writes Reads
●○ partial rebuild
○
●
LightOS Cluster
Storage Server Storage Server Storage Server
NVMe/TCP Client
Secondary Secondary Primary
Writes Reads
Temporary Failure
“Partial rebuild” Only the necessary
data is sent
●
●○ Symmetric○ Asymmetric
■■
■● LightOS leverages NVMe ANA for Clustering
○○
○ Failure Handling
LightOS Cluster
Storage Server Storage Server Storage Server
NVMe/TCP Client
Secondary Secondary Primary
●
●○ Symmetric○ Asymmetric
■■
■
● LightOS Leverages NVMe ANA for Clustering○○
○ Failure Handling
LightOS Cluster
Storage Server Storage Server Storage Server
NVMe/TCP Client
Secondary Secondary Primary
Failure
●
●○ Symmetric○ Asymmetric
■■
■
● LightOS Leverages NVMe ANA for Clustering○○
○ Failure Handling
LightOS Cluster
Storage Server Storage Server Storage Server
NVMe/TCP Client
Secondary Primary Secondary
Failure
●○
○●
○○○
●○
○●
LightOS Cluster
Storage Server Storage Server Storage Server
NVMe/TCP Client
Cluster Management DiscoveryAPI
●
○
●
○
●
○
●
○
Initial state
Missing
●
○
●
●
Initial state
Failover
●
○
●
●
●
●
●
Contact information
●
○
○
●
●
●
●
○
●
○ lsblk nvme list
● optimized inaccessbile
○ nvme list-subsys <dev>
●
●
○
●
○
●
●
○
●