2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Pelican: A Building Block for Exascale Cold Data Storage
Austin Donnelly Microsoft Research
and: Shobana Balakrishnan, Richard Black, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, Ant Rowstron
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Outline
Background Pelican co-design Research challenges Demo Performance results
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Background: Cold Data in the Cloud
Hot
Warm
Cold
Archival
?
SSD 15KRPMHDD
7200 RPM HDD
Tape
Hot tier • Provisioned for peak • High throughput • Low latency • High cost
Archive tier • Low cost • High latency (hours)
$$$$$ $$$$
$$$
$
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Background: Cold Data in the Cloud
Hot
Warm
Cold
Archival
Pelican
SSD 15KRPMHDD
7200 RPM HDD
Hot tier • Provisioned for peak • High throughput • Low latency • High cost
Cold tier • High density • Low hardware cost • Low operating cost • Lower latency than
tape
$$$$$ $$$$
$$$
$
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Pelican: Rack-scale Co-design
Hardware & software co-designed: Power, Cooling, Mechanical, HDDs &
Software. Trade latency for lower cost.
Massive density, low per-drive overhead. 1152 3.5” HDDs per 52U. 2 servers, PCIe bus stretched rack-wide.
4x 10G links out of rack. Only 8% of disks can spin.
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Interconnect Details
1152 Gb/s x4 (1152 total – at most 96 active)
SATA controller (72 total)
Backplane switch (PCIe)
(6 total)
Port multiplier x4 (288 total)
Server switch (PCIe)
Server switch (PCIe)
Server Server
Data center network
192 Gb/s
128 Gb/s
40 Gb/s
864 Gb/s
576 Gb/s
Bandwidth required: 96 Gb/s 96 Gb/s 96 Gb/s 96 Gb/s
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Research Challenges
Not enough cooling, power, or bandwidth.
How do we manage theses resource limits?
Which disks to use for data? The data layout problem.
How to schedule requests to get good performance.
[“Pelican: A building block for exascale cold data storage”, OSDI 2014]
Placement
IOs to disks *.sys
kernel userspace
requests Blob store API
Scheduler
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
2 of 16 Power
1 of 12 Cooling
1 of 2 Vibration
tree Bandwidth
Resource use
Traditional systems: Any disk can be active at any time.
Pelican: Disk is part of a domain for each resource
Domains, limits: Disk d
Power domain of d
Cooling domain of d
Rack: 3D array of disks
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Data placement
Blob erasure-encoded on a set of concurrently active disks
In traditional systems: Any two sets can be active No impact on concurrency
In Pelican: Sets can conflict in resource requirements Conflicting sets cannot be concurrently active
Challenge: form sets to minimize Pconflict
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Data placement: random
Disks of blob 1 Disks of blob 2
Conflict
Random placement: Storing blobs on n disks,
Pconflict O(n²)
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Data placement: Pelican
Power domain
Coo
ling
dom
ain
Class: 12 fully-conflicting groups
Schematic side-view of the rack
Intuition: concentrate conflicts over a few sets of disks.
Store blob in one group Pconflict O(n)
Groups encapsulate constraints: Unit of IO scheduling No constraints
managed at runtime.
• 48 groups of 24 disks – 4 classes of 12 fully-conflicting groups – Class is independent: concurrency = 4
• Blob is stored over 18 disks – 15+3 erasure coding
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
IO Scheduling: “spin up is the new seek”
Four independent schedulers Each scheduler: 12 groups, only one can be active Naïve scheduler: FIFO
Avg. group activation time: 14.2 sec High probability of spinup after each request Time is spent doing spinups!
Pelican scheduler: Request batching Limit on maximum re-ordering Trade-off between throughput and fairness Weighted fair-share between client and rebuild traffic
Time Spin up Spin up Spin up … …
Time Spin up
IO batch Spin up … …
IO batch
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Outline
Background Pelican co-design Research challenges: Data placement: constraint-aware Scheduler: batching to amortize spinups
Demo Performance results
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Performance
Compare Pelican vs. all disks active (FP). Cross-validate simulator. Metrics:
Throughput Latency (time to first byte) Power consumption
Open loop workload: Poisson arrivals Read requests, 1GB blobs
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
First step: simulator cross-validation
Burst workload, varying burst size
0
2
4
6
8
10
8 32 128 512 2048
Thr
ough
put (
Gbp
s)
SimulatorRack
1
10
100
1000
8 32 128 512 2048T
ime
to fi
rst
byte
(s)
SimulatorRack
Burst size (#reqs) Burst size (#reqs)
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Rack throughput
0
10
20
30
40
0.0625 0.125 0.25 0.5 1 2 4 8Avg
. thr
ough
put
(Gbp
s)
Workload rate (req/s)
Random placement
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Rack throughput
0
10
20
30
40
0.0625 0.125 0.25 0.5 1 2 4 8Avg
. thr
ough
put
(Gbp
s)
Workload rate (req/s)
FP
Random placement
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Rack throughput
0
10
20
30
40
0.0625 0.125 0.25 0.5 1 2 4 8Avg
. thr
ough
put
(Gbp
s)
Workload rate (req/s)
FPPelicanRandom placement
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Time to first byte
0.00010.0010.010.1
110
1001000
10000
0.0625 0.125 0.25 0.5 1 2 4 8
Tim
e to
firs
t by
te (s
)
Workload rate (req/s)
FP Pelican
14.2 seconds: average time to activate group
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Power consumption
02468
1012
0.0625 0.125 0.25 0.5 1 2 4 8
Agg
rega
te d
isk
pow
er
draw
(kW
)
Workload rate (req/s)
All disks spun down
1.8kW
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Power consumption
02468
1012
0.0625 0.125 0.25 0.5 1 2 4 8
Agg
rega
te d
isk
pow
er
draw
(kW
)
Workload rate (req/s)
All disks spun down All disks active
1.8kW
10.8kW
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Power consumption
02468
1012
0.0625 0.125 0.25 0.5 1 2 4 8
Agg
rega
te d
isk
pow
er
draw
(kW
)
Workload rate (req/s)
Pelican average All disks spun down
All disks active
1.8kW
10.8kW
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Power consumption: 3x lower peak
02468
1012
0.0625 0.125 0.25 0.5 1 2 4 8
Agg
rega
te d
isk
pow
er
draw
(kW
)
Workload rate (req/s)
Pelican average All disks spun down
All disks active Pelican peak
1.8kW
10.8kW
3.7kW
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Trace replay
European Centre for Medium-range Weather Forecasts [FAST 2015] ECFS trace is every request for 2.4 years. Run through a tiering simulator
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Tiering model
Primary Storage
Cache
Cold Storage
Retrieval
De-stage policies File
requests
warm files
cold files new and cheap-to-store files
trace captured from below hot tier
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
Requests per second, over 2.4 years
We replay two 2-hour segments: G1: highest response time G2: deepest queues
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
G1: Highest response time
2015 Storage Developer Conference. © Microsoft Research. All Rights Reserved.
War stories
Booting a system with 1152 disks BIOS changes needed
Object store vs. File system Data model for system: Serial numbers on all FRUs Disks, Volumes, Media