+ All Categories
Home > Documents > Haryadi Gunawi and Andrew Chien - SyNeRGy...

Haryadi Gunawi and Andrew Chien - SyNeRGy...

Date post: 27-Jul-2019
Category:
Upload: vankien
View: 213 times
Download: 0 times
Share this document with a friend
19
in collaboration with Gokul Soundararajan and Deepak Kenchammana (NetApp) Rob Ross and Dries Kimpe (Argonne National Labs) Haryadi Gunawi and Andrew Chien
Transcript

in collaboration with Gokul Soundararajan and Deepak Kenchammana (NetApp)

Rob Ross and Dries Kimpe (Argonne National Labs)

Haryadi Gunawi and Andrew Chien

2

q Complete fail-stop

q Fail partial

q Corruption

q Performance degradation (“limpware”)?

Rich literature

6/1/15 LigHTS @ XPS Workshop 2015

3

q “… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps,

q this one slow machine caused a chain reaction … making a 100 node cluster was crawling at a snail's pace” – Facebook Engineers

“Limping” NIC! (1,000,000x)

Cascading impact!

6/1/15 LigHTS @ XPS Workshop 2015

4

q  Disks §  “… 4 servers having high wait times on I/O for, up to 103 seconds. This was left uncorrected for

50 days.” @ Argonne §  Causes: Weak disk head, bad packaging, missing screws, broken/old fans, too many disks/

box, firmware bugs, bad sector remapping, …

q  SSDs §  Samsung firmware bug (reduce bandwidth by 4x)

q  Network cards and switches §  “On Intrepid, a bad batch of optical transceivers with an extremely high error rate cause an

effective throughput of 1-2 Kbps.” @ Argonne §  Causes: Broken adapter, error correcting, driver bugs, power fluctuation, …

q  Memory §  Runs only at 25% of normal speed – HBase operators

q  Processors §  26% variation §  Aging transistors, overheat, self throttling, …

q  Many others: “Yes we've seen that in production” §  More anecdotes in our paper [SoCC ’13]

6/1/15 LigHTS @ XPS Workshop 2015

5 6/1/15 LigHTS @ XPS Workshop 2015

6

q Introduction

q Impact of limpware to scale-out cloud systems? [HotCloud ’13, SoCC ’13]

q Progress Summary §  What bugs live in the cloud? [SoCC ’14] §  Detecting performance bugs [HotCloud ’15] §  The Tail at Store [In Submission] §  Other ongoing work

6/1/15 LigHTS @ XPS Workshop 2015

7

q Anecdotes §  “The performance of a 100 node cluster was crawling at a

snail's pace” – Facebook

q But, … why?

6/1/15 LigHTS @ XPS Workshop 2015

8

q  Goals: §  Measure system-level impacts §  Find design flaws

q  Run distributed systems/protocols §  E.g., 3-node write in HDFS

q  Measure slowdowns under: §  No failure, crash, a limping NIC

workload 10 Mbps NIC

1Mbps NIC

0.1 Mbps NIC

1

10x slower

100x slower

1000x slower

Execution slowdown

6/1/15 LigHTS @ XPS Workshop 2015

9

HDFS

Hadoop

ZooKeeper

Cassandra

ZooKeeper

HBase

6/1/15 LigHTS @ XPS Workshop 2015

10

Fail-stop tolerant, but not limpware tolerant (no failover recovery)

6/1/15 LigHTS @ XPS Workshop 2015

11

q  Run Hadoop with 6+ hours of Facebook workload §  30-node cluster §  30-node cluster (w/ 1 slow node @ 0.1 Mbps)

Cluster collapse after ~4 hours

1 job/hour Also happens in HDFS and ZooKeeper

6/1/15 LigHTS @ XPS Workshop 2015

12

q  Single point of performance failure

q  Coarse-grained timeouts

q  Bounded thread/queue pool à resource exhaustion

q  Unbounded thread/queue pool à OOM

q  No throttling or back-pressure

q  Limp-oblivious background jobs

q  Unexploited parallelism of small transactional I/Os

q  Long lock/resource contention

q  …

6/1/15 LigHTS @ XPS Workshop 2015

13

q Introduction

q Impact of limpware [SoCC ’13]

q Progress Summary

6/1/15 LigHTS @ XPS Workshop 2015

6/1/15 LigHTS @ XPS Workshop 2015 14

q Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14]

6/1/15 LigHTS @ XPS Workshop 2015 15

q Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14]

-  Study of 3000+ bugs in scale-out distributed systems -  New: scalability bugs, single-point-of-failure bugs, …

6/1/15 LigHTS @ XPS Workshop 2015 16

q Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14] §  The Tail at Store [In Submission]

-  Goal: Anecdotes to real statistics -  Collaboration with Gokul Soundararajan and Deepak Kenchammana -  Study of over 450,000 disks, 4000 SSDs, and 240 EBS drives -  Ask: How many slow drives? How often? Transient?

RAID RAID

6/1/15 LigHTS @ XPS Workshop 2015 17

q  Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14] §  The Tail at Store [In Submission]

-  Limping disks and SSDs are real! -  2-digit slowdowns had occurred in 0.01% of disk and SSD hours -  4- and 3-digit slowdowns in 124 and 2461 disk hours, and 3-digit SSD

slowdowns in 10 SSD hours

6/1/15 LigHTS @ XPS Workshop 2015 18

q Study/Analysis q Towards Limpware-Tolerant Systems

§  Detecting limpware-intolerant designs in distributed systems [HotCloud ’15]

§  Tail-tolerant storage [In Progress] -  In flash controller, operating system, and distributed storage -  + Coordination with MapReduce Speculative Execution -  (A cross-cutting approach)

TT OS/RAID

TT Distr. FS TT Flash Ctrl

MapReduce Spec. Ex.

19

XPS à Exploit Scale

Limpware à Underexploit Scale

6/1/15 LigHTS @ XPS Workshop 2015

ucare.cs.uchicago.edu ceres.uchicago.edu


Recommended