BlueDBM: An Appliance for Big Data Analytics Sang-Woo Jun * Ming Liu * Sungjin Lee * Jamey Hicks +...

BlueDBM: An Appliance for Big Data Analytics

Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+

John Ankcorn+ Myron King+ Shuotao Xu* Arvind*

*MIT Computer Science and Artificial Intelligence Laboratory+Quanta Research Cambridge

1

June 15, 2015

This work is funded by Quanta, Samsung and Lincoln Laboratory.We also thank Xilinx for their hardware and expertise donations.

ISCA 2015, Portland, OR.

Big data analyticsAnalysis of previously unimaginable amount of data can provide deep insight

Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC)

Analyzing personal genome can determine predisposition to diseases

Social network chatter analysis can identify political revolutions before newspapers

Scientific datasets can be mined to extract accurate models

Likely to be the biggest economic driver for the IT industry for the next decade

2

A currently popular solution:RAM Cloud

Cluster of machines with large DRAM capacity and fast interconnect+ Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAM

Flash-based solutions may be a better alternative+ Faster than Disk, cheaper than DRAM+ Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM

3

What if enough DRAM isn’t affordable?

Related workUse of flash

SSDs, FusionIO, Purestorage Zetascale SSD for database buffer pool and metadata

[SIGMOD 2008], [IJCA 2013]

Networks QuickSAN [ISCA 2013] Hadoop/Spark on Infiniband RDMA [SC 2012]

Accelerators SmartSSD[SIGMOD 2013], Ibex[VLDB 2014] Catapult[ISCA 2014] GPUs

4

Latency profile of distributed flash-based analytics

5

Distributed processing involves many system components

Flash device access Storage software (OS, FTL, …) Network interface (10gE, Infiniband, …) Actual processing

FlashAccess75 μs

50~100 μs

StorageSoftware100 μs

100~1000 μs

Network20 μs

20~1000 μs

Processing

…

Latency is additive

100~1000 μs

StorageSoftware100 μs

Latency profile of distributed flash-based analytics

6

Architectural modifications can remove unnecessary overhead

Near-storage processing Cross-layer optimization of flash management software*

Dedicated storage area network Accelerator

FlashAccess75 μs

50~100 μs

Network20 μs

20~1000 μs

Processing

…< 20μs …

Difficult to explore using flash packaged as off-the-shelf SSDs

Custom flash card had to be built

7

HPC

FM

C P

ORT

Artix 7FPGA

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Network Ports Flash Array (on both side)

Bus 0

ToVC707

Bus 1

Bus 2

Bus 3

BlueDBM: Platform with near-storage processing and inter-controller networks

8

20 24-core Xeon Servers20 BlueDBM Storage devices

1TB flash storage x4 20Gbps controller

network Xilinx VC707 2GB/s PCIe

BlueDBM: Platform with near-storage processing and inter-controller networks

9

1 of 2 Racks (10 Nodes) BlueDBM Storage Device

20 24-core Xeon Servers20 BlueDBM Storage devices

1TB flash storage x4 20Gbps controller

network Xilinx VC707 2GB/s PCIe

BlueDBM node architecture

10

In-StorageProcessor

FlashController

Netw

ork

Inte

rface

Host Server

Flash Device

PCIe

Lightweight flash management with very low overhead

Adds almost no latency ECC support

Custom network protocol with low latency/high bandwidth

x4 20Gbps links at 0.5us latency Virtual channels with flow control

Software has very low level access to flash storage

High level information can be used for low level management

FTL implemented inside file system

FlashController

Netw

ork

Inte

rface

Host Server

No time to go into gritty details!

BlueDBM software view

Block Device Driver

Connectal (By Quanta)

Flash Ctrl

NAND Flash

File System

Accelerator ManagerConnectal Wrapper

BlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal)

Hardware-assisted Applications Connectal Proxy

Generated by Connectal*

Kern

el-

space

FPG

AU

ser-

space

HW Accelerator

11

NetworkInterface

Power consumption is low

Component Power (Watts)

VC707 30

Flash Board (x2) 10

Storage Device Total 40

12

Component Power (Watts)

Storage Device 40

Xeon Server 200+

Node Total 240+

Storage device power consumption is a very conservative estimate

GPU-based accelerator will double the power

ApplicationsContent-based image search *

Faster flash with accelerators as replacement for DRAM-based systems

BlueCache – An accelerated memcached* Dedicated network and accelerated caching systems

with larger capacity

Graph analytics Benefits of lower latency access into distributed flash

for computation on large graphs

13* Results obtained since the paper submission

Content-based image retrievalTakes a query image and returns similar images in a dataset of tens of million picturesImage similarity is determined by measuring the distance between histograms of each image

Histogram is generated using RGB, HSV, “edgeness”, etc Better algorithms are available!

14

Image search acceleratorSang woo Jun, Chanwoo Chung

15

Sobel Filter

HistogramGenerator

Query Histogram

Comparator

FlashController

Software

FlashFPGA

Image query performance without sampling

Faster flash with acceleration can perform at DRAM speed

16

CPU Bottleneck

Off-the shelfM.2. SSD

BlueDBM+ CPU

BlueDBM+ FPGA

Sampling to improve performance

Intelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space

But introduces random access pattern

17

Locality-sensitive hash table Data

Data accesses corresponding to a single hash table entry results in a lot of random accesses

Image query performance with sampling

A disk based system cannot take advantage of the reduced search space

18

memcached serviceA distributed in-memory key-value store

caches DB results indexed by query strings Accessed via socket communication Uses system DRAM for caching (~256GB)

Extensively used by database-driven websites Facebook, Flicker, Twitter, Wikipedia, Youtube …

Web requestMemcachedrequest

MemcachedResponseReturn data

Application Servers

Memcached Servers

Brower/ Mobile Apps

19

Networking contributes to 90% overhead

Bluecache: Accelerated memcached service Shuotao Xu

20

Memcached server implemented in hardware Hashing and flash management implemented in FPGA 1TB hardware managed flash cache per node

Hardware server accessed via local PCIeDirect network between hardware

Inter-controller network

…Bluecacheaccelerator

FlashController

Netw

ork

1TB Flash

web server

PCIe

Bluecacheaccelerator

FlashController

Netw

ork

1TB Flash

web server

PCIe

Bluecacheaccelerator

FlashController

Netw

ork

1TB Flash

web server

PCIe

Effect of architecture modification(no flash, only DRAM)

21

Bluecache Local Memcached Remote Memcached

0

500

1000

1500

2000

2500

3000

3500

4000

45004012

357 273

Get Operations ( Key Size = 64Bytes, Value Size = 64Bytes)

Thro

ughput

(KO

ps

per

Seco

nd)

PCIe DMA and inter-controller network reduces access overheadFPGA acceleration of memcached is effective

11X Performance

High cache-hit rate outweighs slow flash-accesses (small DRAM vs. large Flash)

0 5 10 15 20 25 30 35 40 45 500

50

100

150

200

250

300

350

Bluecache (0.5TB Flash)

Local mem-cached (50GB

DRAM)

Thro

ughp

ut (

KO

ps p

er s

econ

ds)

22

Key size = 64 Bytes, Value size = 8K Bytes5ms penalty per cache miss

Bluecache starts performing better at 5% missA “sweet spot” for large flash caches exist

* Assuming no cache misses for Bluecache

Graph traversalVery latency-bound problem, because often cannot predict the next node to visit

Beneficial to reduce latency by moving computation closer to data

23

Flash 1 Flash 2 Flash 3

Host 1 Host 3

In-Store Processor

Host 2

Graph traversal performance

Software+DRAM

Software + Separate Network

Software + Controller Network

Accelerator + Controller Network

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Nodes

travers

ed p

er

seco

nd

24

Flash based system can achieve comparable performance with a much smaller cluster

DRAM Flash * Used fast BlueDBM network even for separate network for fairness

Other potential applicationsGenomicsDeep machine learningComplex graph analyticsPlatform acceleration

Spark, MATLAB, SciDB, …

25

Suggestions and collaboration are welcome!

ConclusionFast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big DataReducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networksFlash-based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration

Thank you26

27

DR

AM

Motherboard

Near-Data Accelerator is Preferable

CPU DR

AM

Flash

NIC

Acc

ele

rato

rFPG

A

CPU

Flash

NIC

Motherboard

Acc

ele

rato

rFPG

ACPU D

RA

M Flash

NIC

FPGA

Motherboard

CPU DR

AM Fl

ash

NIC

FPGA

Motherboard

Hardware & software latencies are additive

BlueDBM

Traditional Approach

28

29

DRAMPCIe

FlashNetwork Ports

Artix 7

Network Cable

Virtex 7

VC707

Date post:	24-Dec-2015
Category:	Documents
Upload:	veronica-kennedy
View:	225 times
Download:	0 times

BlueDBM: An Appliance for Big Data Analytics Sang-Woo Jun * Ming Liu * Sungjin Lee * Jamey Hicks +...

Documents