Hardware Acceleration of Key-Value StoresSagar Karandikar, Howard Mao, Albert Ou, Yunsup Lee Advisor: Krste Asanovic
Motivation
I In datacenter applications, path through CPU/kernel/application accounts for
86% of total request latency
I Goal: Serve popular requests without CPU interruption
I Solution: Hardware key-value store attached to the network interface controller
I Many workloads have an access pattern suitable for a small dedicated cache
I Per a Facebook study, 10% of keys represent 90% of requestsI Most values are relatively small in size (1 kB)
Related Work
I A 2013 paper by Lim et al. proposed a system dubbed “Thin Servers with
Smart Pipes”, which served memcached GET requests from FPGA hardware.
I However, the FPGA hardware handled GET requests by accessing DRAM, not
a local SRAM cache.
Infrastructure
SFP Cage(1 GbE SFP,not pictured)
DDR3 SDRAM
XilinxXC7Z045FPGA: Rocket Core Accelerator Traffic Manager DMA Engine
Xilinx ZC706 Evaluation Platform
I ZYNQ-7000 SoC
I Brocade 1GbE Copper SFP Transceiver
I Xilinx Tri-Mode Ethernet MACI Xilinx 1000Base-X PCS/PMA
I 64-bit RISC-V Rocket Core (50 MHz)
I Single-issue, in-order, 6-stage pipelineI ASIC version most nearly comparable
with ARM Cortex-A5
Int.RF
InstructionDecode
ITLB
I$ Access
PCGeneration
Int.EX
DTLB
D$ Access Commit
FP.RF FP.EX1 FP.EX2 FP.EX3
I No pre-existing I/O peripherals for the Rocket core
I Built first RISC-V hardware device: register-mapped NIC
I Programmed I/O with custom Linux kernel driverI First telnet/ssh session into a physical RISC-V machine
I Evolved to DMA-based NIC for performance
Software
I Manages what keys and values are stored on the accelerator
I Controls the accelerator through the RoCC co-processor interface, which
provides custom instructions for setting keys and values
I Responsible for implementing cache replacement policies
I Identification of the most popular keys as candidates for offloadingI Invalidation of stale entries
System Architecture
Baseline
Rocketscalar core
16 KiB I$ 32 KiB D$
2:1 arbiter
HTIF
DMA
L2 Coherence Agent
TileLink
512 MiB SDRAM
Tile
Uncore
AXI
NIC
Enhanced
Rocketscalar core
Key-Value Storeaccelerator
RoCC
16 KiB I$ 32 KiB D$
2:1 arbiter
2:1 arbiter
HTIF
DMA
L2 Coherence Agent
TileLink
512 MiB SDRAM
Tile
Uncore
AXI
TrafficManager NIC
Accelerator
Accelerator
Hasher
Hasher
WriterCurrent
Key CacheAll Keys
KeyCompare
ValueMemory
Key In
Value Out
MemoryHandler
Controller
RoCC Memory RoCC Command
Mux
Traffic Manager
MainWriter
MainBuffer
DeferBuffer
Splitter
Arbiter
ArbiterResponder
Controller
MAC Rx
DMA Rx
DMA Tx
Mac Tx
Key Out Result In
I The accelerator accepts a key and computes primary and secondary hash
values, which it uses to retrieve the value from its local block RAM.
I The traffic manager, interposed between the NIC and the DMA engine,
implements the specialized Memcached logic.
I For intercepted Memcached GET requests, the traffic manager queries the
accelerator and constructs the response packet if the key is found.
I Unhandled frames are forwarded to the DMA engine for processing by the
operating system.
DMA Engine
I Performs uncached memory accesses via the TileLink protocol
I Transfers a 512 bit cache block per request
I Front-end/back-end decoupling allows load prefetching to hide memory latency
I Buffer descriptor rings exposed as queues through processor control registers
I Provides 250 times the query throughput as compared to programmed I/O
TxFront-end
acquire
grant
finish
TxBack-end
data
last?
addrcount
descriptor
rptr
wptr
CSRs
IRQ
RxFront-end
data
last?
RxBack-end
acquire
grant
finish
cptr cptr
addrcountCSRs
descriptor
wptr rptr
IRQ
Floorplan
Utilization
Resource w/o A+TM w/A+TM
Slice LUTs 17.09% 21.79%
Slice Registers 6.18% 8.01%
Memory 21.65% 63.85%
Latency Evaluation
0 10 20 30 40 50 60 70 80 90Percentiles
1600
1700
1800
1900
2000
2100
2200
2300
2400
Late
ncy
(us)
Uniform Distribution Latencies
Requests served by memcached on RocketRequests served by accelerator
0 10 20 30 40 50 60 70 80 90Percentiles
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
Late
ncy
(us)
Normal Distribution Latencies
Requests served by memcached on RocketRequests served by accelerator
0 10 20 30 40 50 60 70 80 90Percentiles
0
500
1000
1500
2000
2500
Late
ncy
(us)
Pareto Distribution Latencies
Requests served by memcached on RocketRequests served by accelerator
0 10 20 30 40 50 60 70 80 90Percentiles
0
500
1000
1500
2000
2500
Late
ncy
(us)
Facebook ETC Distribution Latencies
Requests served by memcached on RocketRequests served by accelerator
Conclusion
I By moving some of the keys to the accelerator and serving directly to the NIC
from hardware, we gained an order of magnitude speed-up over memcached
software running on the Rocket core (1700µs vs 150µs response latency).
I The accelerator serves 40% of keys at this reduced latency for Facebook ETC.
I However, we still have a long way to go before reaching production quality.
Future Work
I Place the DMA engine, traffic manager, and accelerator in faster clock domains
with asynchronous FIFOs, rather than be constrained by the core frequency
I Widen I/O interfaces for greater throughput
I Investigate replacing the fixed-function traffic manager with a programmable
co-processor (reminiscent of mainframe channel I/O)
I Conduct torture testing for reliability
I Explore opportunities for measuring and improving energy efficiency
University of California, Berkeley ASPIRE Winter 2015 Retreat Department of Electrical Engineering & Computer Sciences