Date post: | 02-Jul-2015 |
Category: |
Technology |
Upload: | locaweb |
View: | 94 times |
Download: | 7 times |
distributed systemsdiego souza @ infra-dev
agenda
● the basics● models● practical aspects
the basics
the basics
what is a distributed system? (cont.)● a distributed system is a piece of software
that ensures that a collection of independent computers appears to its users as a single coherent system;
the basics
what is a distributed system? (cont.)● a distributed system is a software system in
which components located on networked computers communicate and coordinate their actions by passing messages;
the basics
what is a distributed system?● a distributed system is one in which the
failure of a computer you didn't even know existed can render your own computer unusable [Lamport];
the basics
fallacies of a distributed system1. the network is reliable;2. latency is zero;3. bandwidth is infinite;4. the network is secure;5. topology doesn't change;6. there is one administrator;7. transport cost is zero;8. the network is homogeneous;
the basics
examples:● cassandra● hadoop● www● internet● etc.
the basics
why?● things no longer fit in a single machine;● scalability [size, geographic, organizational];● availability;● fault tolerance;● performance;
the basics
scalability● is the ability of a system, network, or
process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth;
the basics
performance● depends on the context and what we want
to achieve:○ response time/low latency;○ throughput;○ utilization of computer resources;
the basics
latency● the state of being latent; delay, a period
between the initiation of something and the occurrence;
● a wise man once said:○ Bandwidth is easy. Engineers build bandwidth. But
latency is hard. Only God gives us latency;
the basics
availability● the proportion of time a system is in a
functioning condition. If a user cannot access the system, it is said to be unavailable;
the basics
fault tolerance● ability of a system to behave in a well-
defined manner once faults occur;
models
models
availability metrics
availability = uptime / (uptime + downtime)
availability = mtbf / (mtbf + mttr)
mtbf: mean time between failure
mttr: mean time to repair
● q: is every second the same?
models
availability metrics
yield = successes / requests
● a: very unlikely!
models
availability metrics
harvest = data_available / total_data
● how incomplete is this [think of websearch]?
models
distributing the dataset● partition● replication
models
partition● improves performance [reduces dataset];● improves availability [partial failures];
● usually application specific [random, time, user];
models
replication● improves performance [full copy];● improves availability [full copy, reed-
solomon codes];○ synchronous, asynchronous;○ single copy, multi-master○ crdts
models
replication [strong consistency]● primary/copy [eg. mysql master]● 2pc [eg. mysql cluster]● paxos, zab, raft
models
replication [weak consistency]● amazon dynamo
○ consistent hashing [partitioning]○ partial quorums○ failure detection and read repair○ gossip protocol
● note: r + w > n != strong consistency
models
time● global clock [ntp, total order]● local clock [partial order]● logical clock [partial order; lamport clock,
vector clocks]
models
consensus & atomic broadcast● consensus: vote & agreement;● atomic broadcast: reliable message
transmission and order guarantees;
● they are equivalent
models
flp impossibility● does not exist an algorithm for the
consensus problem in an asynchronous system subject to failures, even if messages can never be lost, at most one process may fail, and it can only fail by crashing
● note: its not that bad! :)
models
models
cap: [note: pick only two is misleading]● consistency: the same data at the same
time;● availability;● partition tolerance: continues to operate
despite message loss [network or node failure];
practical aspects
I find latency one of the most important aspects of performance
hard to develop, even hard to operate: they are not unbreakable
consensus is a hard problem
failures are the norm
metrics, metrics, metrics
what to do in presence of failures
think about backpressure mechanisms
think about timeouts
feature flag as a deploy mechanism
think hard about scalability
thanks :)questions or comments?
appendix
appendix: what we have here
● cassandra● zookeeper● ceph● etcd● consul● leela