How multifarious and how mutually
complicated are the considerations which
the working of such an engine involve.
There are frequently several distinct sets of
effects going on simultaneously; all in a
manner independent of each other, and yet
to a greater or less degree exercising a
mutual influence. To adjust each to every
other, and indeed even to perceive and
trace them out with perfect correctness and
success, entails difficulties whose nature
partakes to a certain extent of those
involved in every question where conditions
are very numerous and inter-complicated.
Hardware outgrowing software
+ CPU clocks not getting faster.
+ More cores, but hard to use them.
+ Locks have costs even when no contention
+ Data is allocated on one core, copied and used on
others
+ Result: Software can’t keep up with new
hardware (SSD, 10Gbps networking…)
Kernel
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Memory
Workloads changing
+ Complex, multi-layered applications
+ NoSQL data stores
+ More users
+ Lower latencies needed
+ Microservices
- 81% of Redis processing is in the kernel.
- If 100 requests needed for a page, the “99%
latency” affects 63% of pageviews.
Kernel
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Memory
Benchmark hardware
■ 2x Xeon E5-2695v3, 2.3GHz
35M cache, 14 cores
(28 cores total, 56 HT)
■ 8x 8GB DDR4 Micron memory
■ Intel Ethernet CNA XL710-QDA1
A new model
Threads
- Costly locking (example:
POSIX requires multiple
threads to be able to use same
socket)
+ Uses available skills/tools
Shared-nothing
+ Fewer wasted cycles
- Cross-core communication
must be explicit, so harder to
program
How
■ Single-threaded async engine
running on each CPU
■ No threads
■ No shared data
■ All inter-CPU communication by message
passing
Linear scaling+ Each engine is executed by each core
+ Shared-nothing per-core design
+ Fits existing shared-nothing distributed
applications model
+ Full kernel bypass, supports zero-copy
+ No threads, no context switch and no locks!
+ Instead, asynchronous lambda
invocation
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Kernel
Comparison with old school
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Traditional stack SeaStar’s sharded stack
Memory
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(not
involved)
Userspace
Millions of connections Traditional stack SeaStar’s sharded stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
But how can you program it?
■ Ada Lovelace’s
problem today
■ Need max. possible
“easy” without
giving up any “fast.”
If the answer
were “no”,
would this
book be 467
pages long?
F-P-C defined: Future
A future is a result of a computation
that may not be available yet. ■ a data buffer that we are reading from the network
■ the expiration of a timer
■ the completion of a disk write
■ the result computation that requires the values from one
or more other futures.
F-P-C defined: Promise
A promise is an object or function that
provides you with a future, with the
expectation that it will fulfill the future.
Basic future/promise
future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int
void f() {get().then([] (int value) {
put(value + 1).then([] {std::cout << "value stored successfully\n";
});});
}
Chaining
future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int
void f() {get().then([] (int value) {
return put(value + 1);}).then([] {
std::cout << "value stored successfully\n";});
}
Zero copy friendly
future<temporary_buffer>socket::read(size_t n);
■ temporary_buffer points at driver-provided pages if
possible
■ stack can linearize scatter-gather buffers using page
tables
■ discarded after use
Zero copy friendly (2)
pair<future<size_t>, future<temporary_buffer>>socket::write(temporary_buffer);
■ First future becomes ready when TCP window allows
sending more data (usually immediately)
■ Second future becomes ready when buffer can be
discarded (after TCP ACK)
■ May complete in any order
Fully async filesystem
No threads
read_metadata().then([] {return lock_pages();
}).then([] {return read_data();
});
Shared state: networking
■ No shared state except index of
net channels (1 per cpu)
■ No migration of existing TCP connections
Handling shared state: block
■ Each CPU is responsible for handling
specific files/directories/free blocks
(by hash)
■ Can delegate access to another CPU for
locality, but not concurrent shared access
■ Flash optimized - no fancy layout
■ DMA only
Seastar
TCP
Seastar
TCP
Linux
sockets
Seastar TCP
DPDK Virtio or raw
device
access
Linux
process
OSv
networking
Deployment models
Performance results
■ Linear scaling to 20 cores and beyond
■ 250,000 transactions/core (memcached)
■ Currently limited by client. More client
development in progress.
Applications
■ HTTP server
■ NoSQL system
■ Distributed filesystem
■ Object store
■ Transparent proxy
■ Cache (Memcache, CDN,..)
■ NFV