Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | veronica-kennedy |
View: | 225 times |
Download: | 0 times |
BlueDBM: An Appliance for Big Data Analytics
Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+
John Ankcorn+ Myron King+ Shuotao Xu* Arvind*
*MIT Computer Science and Artificial Intelligence Laboratory+Quanta Research Cambridge
1
June 15, 2015
This work is funded by Quanta, Samsung and Lincoln Laboratory.We also thank Xilinx for their hardware and expertise donations.
ISCA 2015, Portland, OR.
Big data analyticsAnalysis of previously unimaginable amount of data can provide deep insight
Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC)
Analyzing personal genome can determine predisposition to diseases
Social network chatter analysis can identify political revolutions before newspapers
Scientific datasets can be mined to extract accurate models
Likely to be the biggest economic driver for the IT industry for the next decade
2
A currently popular solution:RAM Cloud
Cluster of machines with large DRAM capacity and fast interconnect+ Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAM
Flash-based solutions may be a better alternative+ Faster than Disk, cheaper than DRAM+ Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM
3
What if enough DRAM isn’t affordable?
Related workUse of flash
SSDs, FusionIO, Purestorage Zetascale SSD for database buffer pool and metadata
[SIGMOD 2008], [IJCA 2013]
Networks QuickSAN [ISCA 2013] Hadoop/Spark on Infiniband RDMA [SC 2012]
Accelerators SmartSSD[SIGMOD 2013], Ibex[VLDB 2014] Catapult[ISCA 2014] GPUs
4
Latency profile of distributed flash-based analytics
5
Distributed processing involves many system components
Flash device access Storage software (OS, FTL, …) Network interface (10gE, Infiniband, …) Actual processing
FlashAccess75 μs
50~100 μs
StorageSoftware100 μs
100~1000 μs
Network20 μs
20~1000 μs
Processing
…
Latency is additive
100~1000 μs
StorageSoftware100 μs
Latency profile of distributed flash-based analytics
6
Architectural modifications can remove unnecessary overhead
Near-storage processing Cross-layer optimization of flash management software*
Dedicated storage area network Accelerator
FlashAccess75 μs
50~100 μs
Network20 μs
20~1000 μs
Processing
…< 20μs …
Difficult to explore using flash packaged as off-the-shelf SSDs
Custom flash card had to be built
7
HPC
FM
C P
ORT
Artix 7FPGA
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Network Ports Flash Array (on both side)
Bus 0
ToVC707
Bus 1
Bus 2
Bus 3
BlueDBM: Platform with near-storage processing and inter-controller networks
8
20 24-core Xeon Servers20 BlueDBM Storage devices
1TB flash storage x4 20Gbps controller
network Xilinx VC707 2GB/s PCIe
BlueDBM: Platform with near-storage processing and inter-controller networks
9
1 of 2 Racks (10 Nodes) BlueDBM Storage Device
20 24-core Xeon Servers20 BlueDBM Storage devices
1TB flash storage x4 20Gbps controller
network Xilinx VC707 2GB/s PCIe
BlueDBM node architecture
10
In-StorageProcessor
FlashController
Netw
ork
Inte
rface
Host Server
Flash Device
PCIe
Lightweight flash management with very low overhead
Adds almost no latency ECC support
Custom network protocol with low latency/high bandwidth
x4 20Gbps links at 0.5us latency Virtual channels with flow control
Software has very low level access to flash storage
High level information can be used for low level management
FTL implemented inside file system
FlashController
Netw
ork
Inte
rface
Host Server
No time to go into gritty details!
BlueDBM software view
Block Device Driver
Connectal (By Quanta)
Flash Ctrl
NAND Flash
File System
Accelerator ManagerConnectal Wrapper
BlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal)
Hardware-assisted Applications Connectal Proxy
Generated by Connectal*
Kern
el-
space
FPG
AU
ser-
space
HW Accelerator
11
NetworkInterface
Power consumption is low
Component Power (Watts)
VC707 30
Flash Board (x2) 10
Storage Device Total 40
12
Component Power (Watts)
Storage Device 40
Xeon Server 200+
Node Total 240+
Storage device power consumption is a very conservative estimate
GPU-based accelerator will double the power
ApplicationsContent-based image search *
Faster flash with accelerators as replacement for DRAM-based systems
BlueCache – An accelerated memcached* Dedicated network and accelerated caching systems
with larger capacity
Graph analytics Benefits of lower latency access into distributed flash
for computation on large graphs
13* Results obtained since the paper submission
Content-based image retrievalTakes a query image and returns similar images in a dataset of tens of million picturesImage similarity is determined by measuring the distance between histograms of each image
Histogram is generated using RGB, HSV, “edgeness”, etc Better algorithms are available!
14
Image search acceleratorSang woo Jun, Chanwoo Chung
15
Sobel Filter
HistogramGenerator
Query Histogram
Comparator
FlashController
Software
FlashFPGA
Image query performance without sampling
Faster flash with acceleration can perform at DRAM speed
16
CPU Bottleneck
Off-the shelfM.2. SSD
BlueDBM+ CPU
BlueDBM+ FPGA
Sampling to improve performance
Intelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space
But introduces random access pattern
17
Locality-sensitive hash table Data
Data accesses corresponding to a single hash table entry results in a lot of random accesses
Image query performance with sampling
A disk based system cannot take advantage of the reduced search space
18
memcached serviceA distributed in-memory key-value store
caches DB results indexed by query strings Accessed via socket communication Uses system DRAM for caching (~256GB)
Extensively used by database-driven websites Facebook, Flicker, Twitter, Wikipedia, Youtube …
Web requestMemcachedrequest
MemcachedResponseReturn data
Application Servers
Memcached Servers
Brower/ Mobile Apps
19
Networking contributes to 90% overhead
Bluecache: Accelerated memcached service Shuotao Xu
20
Memcached server implemented in hardware Hashing and flash management implemented in FPGA 1TB hardware managed flash cache per node
Hardware server accessed via local PCIeDirect network between hardware
Inter-controller network
…Bluecacheaccelerator
FlashController
Netw
ork
1TB Flash
web server
PCIe
Bluecacheaccelerator
FlashController
Netw
ork
1TB Flash
web server
PCIe
Bluecacheaccelerator
FlashController
Netw
ork
1TB Flash
web server
PCIe
Effect of architecture modification(no flash, only DRAM)
21
Bluecache Local Memcached Remote Memcached
0
500
1000
1500
2000
2500
3000
3500
4000
45004012
357 273
Get Operations ( Key Size = 64Bytes, Value Size = 64Bytes)
Thro
ughput
(KO
ps
per
Seco
nd)
PCIe DMA and inter-controller network reduces access overheadFPGA acceleration of memcached is effective
11X Performance
High cache-hit rate outweighs slow flash-accesses (small DRAM vs. large Flash)
0 5 10 15 20 25 30 35 40 45 500
50
100
150
200
250
300
350
Bluecache (0.5TB Flash)
Local mem-cached (50GB
DRAM)
Thro
ughp
ut (
KO
ps p
er s
econ
ds)
22
Key size = 64 Bytes, Value size = 8K Bytes5ms penalty per cache miss
Bluecache starts performing better at 5% missA “sweet spot” for large flash caches exist
* Assuming no cache misses for Bluecache
Graph traversalVery latency-bound problem, because often cannot predict the next node to visit
Beneficial to reduce latency by moving computation closer to data
23
Flash 1 Flash 2 Flash 3
Host 1 Host 3
In-Store Processor
Host 2
Graph traversal performance
Software+DRAM
Software + Separate Network
Software + Controller Network
Accelerator + Controller Network
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Nodes
travers
ed p
er
seco
nd
24
Flash based system can achieve comparable performance with a much smaller cluster
DRAM Flash * Used fast BlueDBM network even for separate network for fairness
Other potential applicationsGenomicsDeep machine learningComplex graph analyticsPlatform acceleration
Spark, MATLAB, SciDB, …
25
Suggestions and collaboration are welcome!
ConclusionFast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big DataReducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networksFlash-based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration
Thank you26
27
DR
AM
Motherboard
Near-Data Accelerator is Preferable
CPU DR
AM
Flash
NIC
Acc
ele
rato
rFPG
A
CPU
Flash
NIC
Motherboard
Acc
ele
rato
rFPG
ACPU D
RA
M Flash
NIC
FPGA
Motherboard
CPU DR
AM Fl
ash
NIC
FPGA
Motherboard
Hardware & software latencies are additive
BlueDBM
Traditional Approach
28
29
DRAMPCIe
FlashNetwork Ports
Artix 7
Network Cable
Virtex 7
VC707