+ All Categories
Home > Documents > Pelican: A Building Block for Exascale Cold Data Storage · 2015 Storage Developer Conference. ©...

Pelican: A Building Block for Exascale Cold Data Storage · 2015 Storage Developer Conference. ©...

Date post: 02-Sep-2019
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
32
Pelican: A Building Block for Exascale Cold Data Storage Austin Donnelly Microsoft Research and: Shobana Balakrishnan, Richard Black, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, Ant Rowstron
Transcript

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Pelican: A Building Block for Exascale Cold Data Storage

Austin Donnelly Microsoft Research

and: Shobana Balakrishnan, Richard Black, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, Ant Rowstron

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Outline

Background Pelican co-design Research challenges Demo Performance results

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Background: Cold Data in the Cloud

Hot

Warm

Cold

Archival

?

SSD 15KRPMHDD

7200 RPM HDD

Tape

Hot tier • Provisioned for peak • High throughput • Low latency • High cost

Archive tier • Low cost • High latency (hours)

$$$$$ $$$$

$$$

$

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Background: Cold Data in the Cloud

Hot

Warm

Cold

Archival

Pelican

SSD 15KRPMHDD

7200 RPM HDD

Hot tier • Provisioned for peak • High throughput • Low latency • High cost

Cold tier • High density • Low hardware cost • Low operating cost • Lower latency than

tape

$$$$$ $$$$

$$$

$

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Pelican: Rack-scale Co-design

Hardware & software co-designed: Power, Cooling, Mechanical, HDDs &

Software. Trade latency for lower cost.

Massive density, low per-drive overhead. 1152 3.5” HDDs per 52U. 2 servers, PCIe bus stretched rack-wide.

4x 10G links out of rack. Only 8% of disks can spin.

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Interconnect Details

1152 Gb/s x4 (1152 total – at most 96 active)

SATA controller (72 total)

Backplane switch (PCIe)

(6 total)

Port multiplier x4 (288 total)

Server switch (PCIe)

Server switch (PCIe)

Server Server

Data center network

192 Gb/s

128 Gb/s

40 Gb/s

864 Gb/s

576 Gb/s

Bandwidth required: 96 Gb/s 96 Gb/s 96 Gb/s 96 Gb/s

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Research Challenges

Not enough cooling, power, or bandwidth.

How do we manage theses resource limits?

Which disks to use for data? The data layout problem.

How to schedule requests to get good performance.

[“Pelican: A building block for exascale cold data storage”, OSDI 2014]

Placement

IOs to disks *.sys

kernel userspace

requests Blob store API

Scheduler

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

2 of 16 Power

1 of 12 Cooling

1 of 2 Vibration

tree Bandwidth

Resource use

Traditional systems: Any disk can be active at any time.

Pelican: Disk is part of a domain for each resource

Domains, limits: Disk d

Power domain of d

Cooling domain of d

Rack: 3D array of disks

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Data placement

Blob erasure-encoded on a set of concurrently active disks

In traditional systems: Any two sets can be active No impact on concurrency

In Pelican: Sets can conflict in resource requirements Conflicting sets cannot be concurrently active

Challenge: form sets to minimize Pconflict

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Data placement: random

Disks of blob 1 Disks of blob 2

Conflict

Random placement: Storing blobs on n disks,

Pconflict O(n²)

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Data placement: Pelican

Power domain

Coo

ling

dom

ain

Class: 12 fully-conflicting groups

Schematic side-view of the rack

Intuition: concentrate conflicts over a few sets of disks.

Store blob in one group Pconflict O(n)

Groups encapsulate constraints: Unit of IO scheduling No constraints

managed at runtime.

• 48 groups of 24 disks – 4 classes of 12 fully-conflicting groups – Class is independent: concurrency = 4

• Blob is stored over 18 disks – 15+3 erasure coding

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

IO Scheduling: “spin up is the new seek”

Four independent schedulers Each scheduler: 12 groups, only one can be active Naïve scheduler: FIFO

Avg. group activation time: 14.2 sec High probability of spinup after each request Time is spent doing spinups!

Pelican scheduler: Request batching Limit on maximum re-ordering Trade-off between throughput and fairness Weighted fair-share between client and rebuild traffic

Time Spin up Spin up Spin up … …

Time Spin up

IO batch Spin up … …

IO batch

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Outline

Background Pelican co-design Research challenges: Data placement: constraint-aware Scheduler: batching to amortize spinups

Demo Performance results

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Demo

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Performance

Compare Pelican vs. all disks active (FP). Cross-validate simulator. Metrics:

Throughput Latency (time to first byte) Power consumption

Open loop workload: Poisson arrivals Read requests, 1GB blobs

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

First step: simulator cross-validation

Burst workload, varying burst size

0

2

4

6

8

10

8 32 128 512 2048

Thr

ough

put (

Gbp

s)

SimulatorRack

1

10

100

1000

8 32 128 512 2048T

ime

to fi

rst

byte

(s)

SimulatorRack

Burst size (#reqs) Burst size (#reqs)

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Rack throughput

0

10

20

30

40

0.0625 0.125 0.25 0.5 1 2 4 8Avg

. thr

ough

put

(Gbp

s)

Workload rate (req/s)

Random placement

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Rack throughput

0

10

20

30

40

0.0625 0.125 0.25 0.5 1 2 4 8Avg

. thr

ough

put

(Gbp

s)

Workload rate (req/s)

FP

Random placement

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Rack throughput

0

10

20

30

40

0.0625 0.125 0.25 0.5 1 2 4 8Avg

. thr

ough

put

(Gbp

s)

Workload rate (req/s)

FPPelicanRandom placement

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Time to first byte

0.00010.0010.010.1

110

1001000

10000

0.0625 0.125 0.25 0.5 1 2 4 8

Tim

e to

firs

t by

te (s

)

Workload rate (req/s)

FP Pelican

14.2 seconds: average time to activate group

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Power consumption

02468

1012

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

pow

er

draw

(kW

)

Workload rate (req/s)

All disks spun down

1.8kW

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Power consumption

02468

1012

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

pow

er

draw

(kW

)

Workload rate (req/s)

All disks spun down All disks active

1.8kW

10.8kW

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Power consumption

02468

1012

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

pow

er

draw

(kW

)

Workload rate (req/s)

Pelican average All disks spun down

All disks active

1.8kW

10.8kW

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Power consumption: 3x lower peak

02468

1012

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

pow

er

draw

(kW

)

Workload rate (req/s)

Pelican average All disks spun down

All disks active Pelican peak

1.8kW

10.8kW

3.7kW

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Trace replay

European Centre for Medium-range Weather Forecasts [FAST 2015] ECFS trace is every request for 2.4 years. Run through a tiering simulator

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Tiering model

Primary Storage

Cache

Cold Storage

Retrieval

De-stage policies File

requests

warm files

cold files new and cheap-to-store files

trace captured from below hot tier

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Requests per second, over 2.4 years

We replay two 2-hour segments: G1: highest response time G2: deepest queues

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

G1: Highest response time

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

G2: Deepest queues

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

War stories

Booting a system with 1152 disks BIOS changes needed

Object store vs. File system Data model for system: Serial numbers on all FRUs Disks, Volumes, Media

2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.

Thank you!

Questions?


Recommended