in collaboration with Gokul Soundararajan and Deepak Kenchammana (NetApp)
Rob Ross and Dries Kimpe (Argonne National Labs)
Haryadi Gunawi and Andrew Chien
2
q Complete fail-stop
q Fail partial
q Corruption
q Performance degradation (“limpware”)?
Rich literature
6/1/15 LigHTS @ XPS Workshop 2015
3
q “… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps,
q this one slow machine caused a chain reaction … making a 100 node cluster was crawling at a snail's pace” – Facebook Engineers
“Limping” NIC! (1,000,000x)
Cascading impact!
6/1/15 LigHTS @ XPS Workshop 2015
4
q Disks § “… 4 servers having high wait times on I/O for, up to 103 seconds. This was left uncorrected for
50 days.” @ Argonne § Causes: Weak disk head, bad packaging, missing screws, broken/old fans, too many disks/
box, firmware bugs, bad sector remapping, …
q SSDs § Samsung firmware bug (reduce bandwidth by 4x)
q Network cards and switches § “On Intrepid, a bad batch of optical transceivers with an extremely high error rate cause an
effective throughput of 1-2 Kbps.” @ Argonne § Causes: Broken adapter, error correcting, driver bugs, power fluctuation, …
q Memory § Runs only at 25% of normal speed – HBase operators
q Processors § 26% variation § Aging transistors, overheat, self throttling, …
q Many others: “Yes we've seen that in production” § More anecdotes in our paper [SoCC ’13]
6/1/15 LigHTS @ XPS Workshop 2015
6
q Introduction
q Impact of limpware to scale-out cloud systems? [HotCloud ’13, SoCC ’13]
q Progress Summary § What bugs live in the cloud? [SoCC ’14] § Detecting performance bugs [HotCloud ’15] § The Tail at Store [In Submission] § Other ongoing work
6/1/15 LigHTS @ XPS Workshop 2015
7
q Anecdotes § “The performance of a 100 node cluster was crawling at a
snail's pace” – Facebook
q But, … why?
6/1/15 LigHTS @ XPS Workshop 2015
8
q Goals: § Measure system-level impacts § Find design flaws
q Run distributed systems/protocols § E.g., 3-node write in HDFS
q Measure slowdowns under: § No failure, crash, a limping NIC
workload 10 Mbps NIC
1Mbps NIC
0.1 Mbps NIC
1
10x slower
100x slower
1000x slower
Execution slowdown
6/1/15 LigHTS @ XPS Workshop 2015
10
Fail-stop tolerant, but not limpware tolerant (no failover recovery)
6/1/15 LigHTS @ XPS Workshop 2015
11
q Run Hadoop with 6+ hours of Facebook workload § 30-node cluster § 30-node cluster (w/ 1 slow node @ 0.1 Mbps)
Cluster collapse after ~4 hours
1 job/hour Also happens in HDFS and ZooKeeper
6/1/15 LigHTS @ XPS Workshop 2015
12
q Single point of performance failure
q Coarse-grained timeouts
q Bounded thread/queue pool à resource exhaustion
q Unbounded thread/queue pool à OOM
q No throttling or back-pressure
q Limp-oblivious background jobs
q Unexploited parallelism of small transactional I/Os
q Long lock/resource contention
q …
6/1/15 LigHTS @ XPS Workshop 2015
13
q Introduction
q Impact of limpware [SoCC ’13]
q Progress Summary
6/1/15 LigHTS @ XPS Workshop 2015
6/1/15 LigHTS @ XPS Workshop 2015 14
q Study/Analysis § Limplock/limpware [HotCloud ’13, SoCC ’13] § What bugs live in the cloud? [SoCC ’14]
6/1/15 LigHTS @ XPS Workshop 2015 15
q Study/Analysis § Limplock/limpware [HotCloud ’13, SoCC ’13] § What bugs live in the cloud? [SoCC ’14]
- Study of 3000+ bugs in scale-out distributed systems - New: scalability bugs, single-point-of-failure bugs, …
6/1/15 LigHTS @ XPS Workshop 2015 16
q Study/Analysis § Limplock/limpware [HotCloud ’13, SoCC ’13] § What bugs live in the cloud? [SoCC ’14] § The Tail at Store [In Submission]
- Goal: Anecdotes to real statistics - Collaboration with Gokul Soundararajan and Deepak Kenchammana - Study of over 450,000 disks, 4000 SSDs, and 240 EBS drives - Ask: How many slow drives? How often? Transient?
RAID RAID
6/1/15 LigHTS @ XPS Workshop 2015 17
q Study/Analysis § Limplock/limpware [HotCloud ’13, SoCC ’13] § What bugs live in the cloud? [SoCC ’14] § The Tail at Store [In Submission]
- Limping disks and SSDs are real! - 2-digit slowdowns had occurred in 0.01% of disk and SSD hours - 4- and 3-digit slowdowns in 124 and 2461 disk hours, and 3-digit SSD
slowdowns in 10 SSD hours
6/1/15 LigHTS @ XPS Workshop 2015 18
q Study/Analysis q Towards Limpware-Tolerant Systems
§ Detecting limpware-intolerant designs in distributed systems [HotCloud ’15]
§ Tail-tolerant storage [In Progress] - In flash controller, operating system, and distributed storage - + Coordination with MapReduce Speculative Execution - (A cross-cutting approach)
TT OS/RAID
TT Distr. FS TT Flash Ctrl
MapReduce Spec. Ex.