Sistemas Distribuidos

distributed systemsdiego souza @ infra-dev

agenda

● the basics● models● practical aspects

the basics

the basics

what is a distributed system? (cont.)● a distributed system is a piece of software

that ensures that a collection of independent computers appears to its users as a single coherent system;

the basics

what is a distributed system? (cont.)● a distributed system is a software system in

which components located on networked computers communicate and coordinate their actions by passing messages;

the basics

what is a distributed system?● a distributed system is one in which the

failure of a computer you didn't even know existed can render your own computer unusable [Lamport];

the basics

fallacies of a distributed system1. the network is reliable;2. latency is zero;3. bandwidth is infinite;4. the network is secure;5. topology doesn't change;6. there is one administrator;7. transport cost is zero;8. the network is homogeneous;

the basics

examples:● cassandra● hadoop● www● internet● etc.

the basics

why?● things no longer fit in a single machine;● scalability [size, geographic, organizational];● availability;● fault tolerance;● performance;

the basics

scalability● is the ability of a system, network, or

process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth;

the basics

performance● depends on the context and what we want

to achieve:○ response time/low latency;○ throughput;○ utilization of computer resources;

the basics

latency● the state of being latent; delay, a period

between the initiation of something and the occurrence;

● a wise man once said:○ Bandwidth is easy. Engineers build bandwidth. But

latency is hard. Only God gives us latency;

the basics

availability● the proportion of time a system is in a

functioning condition. If a user cannot access the system, it is said to be unavailable;

the basics

fault tolerance● ability of a system to behave in a well-

defined manner once faults occur;

models

models

availability metrics

availability = uptime / (uptime + downtime)

availability = mtbf / (mtbf + mttr)

mtbf: mean time between failure

mttr: mean time to repair

● q: is every second the same?

models


yield = successes / requests

● a: very unlikely!

models


harvest = data_available / total_data

● how incomplete is this [think of websearch]?

models

distributing the dataset● partition● replication

models

partition● improves performance [reduces dataset];● improves availability [partial failures];

● usually application specific [random, time, user];

models

replication● improves performance [full copy];● improves availability [full copy, reed-

solomon codes];○ synchronous, asynchronous;○ single copy, multi-master○ crdts

models

replication [strong consistency]● primary/copy [eg. mysql master]● 2pc [eg. mysql cluster]● paxos, zab, raft

models

replication [weak consistency]● amazon dynamo

○ consistent hashing [partitioning]○ partial quorums○ failure detection and read repair○ gossip protocol

● note: r + w > n != strong consistency

models

time● global clock [ntp, total order]● local clock [partial order]● logical clock [partial order; lamport clock,

vector clocks]

models

consensus & atomic broadcast● consensus: vote & agreement;● atomic broadcast: reliable message

transmission and order guarantees;

● they are equivalent

models

flp impossibility● does not exist an algorithm for the

consensus problem in an asynchronous system subject to failures, even if messages can never be lost, at most one process may fail, and it can only fail by crashing

● note: its not that bad! :)

models

models

cap: [note: pick only two is misleading]● consistency: the same data at the same

time;● availability;● partition tolerance: continues to operate

despite message loss [network or node failure];

practical aspects

I find latency one of the most important aspects of performance

hard to develop, even hard to operate: they are not unbreakable

consensus is a hard problem

failures are the norm

metrics, metrics, metrics

what to do in presence of failures

think about backpressure mechanisms

think about timeouts

feature flag as a deploy mechanism

think hard about scalability

thanks :)questions or comments?

appendix

appendix: what we have here

● cassandra● zookeeper● ceph● etcd● consul● leela

links● http://book.mixu.net/distsys/

http://book.mixu.net/distsys/

http://book.mixu.net/distsys/

Date post:	02-Jul-2015
Category:	Technology
Upload:	locaweb
View:	94 times
Download:	7 times

Sistemas Distribuidos

Technology