Post on 16-Dec-2015
transcript
© DEEDS – OS © DEEDS – OS
Distributed Operating Systems
© DEEDS – OS
Coverage
• Distributed Systems (DS) Paradigms– DS … NOS, DOS’s
– DS Services: communication, synchronization, coordination, replication…
© DEEDS – OS
What is a Distributed System“A distributed system is the one preventing you from workingbecause of the failure of a machine that you had never heard of”
Leslie Lamport
Multiple computers sharing (same) state and interconnected by a network
… collection of autonomous entities appearing to users as a single OS
shared memory multiprocessor
message passing multicomputer
distributed system
© DEEDS – OS
Distribution: Example Pro/Cons
The Good Stuff: Resource Sharing (concurrency performance), Distributed Access (matching spatial distribution of applications), Scalable, Load Balancing (Migration, Relocation), Fault Tolerance.
Bank account database (DB) example– Naturally centralized: easy consistency and performance
– Fragment DB among regions: exploit locality of reference, security & reduce reliance on network for remote access
– Replicate each fragment for fault tolerance
But, we now need (additional) DS techniques– Route request to right fragment
– Maintain access/consistency of fragments as a whole database
– Maintain access/consistency of each fragment’s replicas
– …
© DEEDS – OS
OS’s for DS’s Loosely-coupled OS
– A collection of computers each running their own OS with OS’s allowing sharing of resources across machines
– A.K.A. Network Operating System (NOS) Provides local services to remote clients via remote logging Data transfer from remote OS to local OS via FTP (File Transfer Protocols)
Tightly-coupled OS– OS tries to maintain single global view of resources it manages– A.K.A. Distributed Operating System (DOS) “Local access feel” as a non-distributed, standalone OS Data migration or computation migration modes (entire process or threads)
© DEEDS – OS
Network Operating Systems (NOS)
Provide an environment where users are (explicitly) aware of the multiplicity of machines. Users can access remote resources by
logging into the remote machine OR transferring data from the remote machine to their own machine
Users should know where the required files and directories are and mount them. Each machine could act like a server and a client at the same time. E.g NFS from Sun Microsystems, CMU-AFS etc
© DEEDS – OS
Distributed Operating Systems (DOS)
Runs on a cluster of machines with no shared memory Users get the feel of a single processor - virtual uni-processor Transparency is the driving force Requires
A single global IPC mechanism Identical process management and system calls at all nodes Common file system at all nodes State, services and data consistency
© DEEDS – OS
Basic Client Server Model for DOS & NOS
Non-blocking comm.!!!File based
Comm./object based
© DEEDS – OS
Middleware
• Can we have the best of both worlds?– Scalability and openness of a NOS
– Transparency and common-state of a DOS
• Solution additional layer of SW above OS (Middleware)– Mask heterogeneity
– Improve distribution transparency (and others)
© DEEDS – OS
Middleware Openness Basis
• Document-based middleware (e.g. WWW)• Coordination-based MW (e.g., Linda, publish subscribe, Jini etc.)
• File system based MW (upload/download, remote access)
• Shared object based MW
© DEEDS – OS
Global Access Transparency
Illusion of a single computer across a DS
Distribution transparency: All of above + performance + flexibility (modification, enhancements for kernel/devices), balancing/scheduling, & scaling (allowing systems to expand without disrupting users) + …
Fragmentation Hide whether the resource is fragmented or not
© DEEDS – OS
Reliability, Performance, Scalability
Faults (Fail stop, transient, Byzantine) Fault Avoidance (de-cluster, rejuvenate) Fault Tolerance
Redundancy techniques (k failures??) Distributed control
Fault Detection & Recovery Atomic transactions Stateless servers Acknowledgements and timeout-based
retransmissions of messages
Batch if possible Cache whenever possible Minimize copying of data Minimize network traffic Take advantage of fine-grain
parallelism for multiprocessing
Avoid centralized entities Provides no/limited fault tolerance Leads to system bottlenecks Issues of network traffic capabilities with centralized entity
Avoid centralized algorithms Perform most operations on client workstations
© DEEDS – OS
Design Issues Resource management
Hard to obtain consistent information about utilization or availability of resources.
Has be calculated (costly!!) approximately using heuristic methods. Processor allocation
Load balancing Hierarchical organization of processors. If a processor cannot handle a request, ask the parent for help. …BUT crashing of a higher level processor results in isolation of all
processors attached to it. Process scheduling
Communication dependency, Causality, Linearizability… to consider Fault tolerance
Consider distribution of control and data. Services provided
Typical services include name, directory, file, time, etc.
© DEEDS – OS
Process Addressing ~ N-OS Flavor Explicit addressing
Send(process_id, message)
Implicit addressing (Functional addressing) Send_any(service_id, message)
Ex: Machine_id@local_id (Berkeley UNIX)- Limited with process migration
Link based process addressingEx: machine_id@local_id@machine_id
- Overload of locating a process
- Intermediate node failure
System-wide unique identifier (Location Transparency) High level m/c independent and low level m/c dependent
- Centralized naming server for high level id (functional)
© DEEDS – OS
So what services do we need to realize DS?
• Communication
• Coordination (Stateful? Stateless?) & Synchronization
• Replication
• Failure Handling
• Consistency
• Liveness
• Storage
© DEEDS – OS
Communication (Group Comm)
One to many communication (blocking or non-blocking?) Multicast/Broadcast Open group/Closed group
Flexible reliability The 0-reliable The 1-reliable The m-out-of-n reliable All reliable
Atomic MulticastMany to one communicationMany to many Communication
Absolute Ordering (Global clock) Consistent ordering (Sequencer/ABCAST protocol) Causal ordering
© DEEDS – OS
Communication Failure handling Delivers messages despite
– communication link(s) failure– process failures
Main kinds of failures to tolerate– timing (link and process)– omission (link and process)– value
Loss of request message Loss of response message Unsuccessful execution of the request (system crash)Inter Process Communication (IPC)
Two message IPC (Request, reply) Three message reliable IPC (request, reply, ack) Four message reliable IPC (request, ack, reply, ack)
Failure handling At-least-once (Time out) Idempotency (no side effects no matter how many times performed) Nonidempotent (Exactly once semantics)
• Reply from the cache with unique Id
© DEEDS – OS
Communication: Reliable Delivery
• Omission failure tolerance (degree k).
• Design choices:a) Error masking (spatial): several (> k) links
b) Error masking (temporal): repeat K+1 times
c) Error recovery: detect error and recover
© DEEDS – OS
Reliable Delivery (cont.)
Error detection and recovery: ACK’s and timeouts
• Positive ACK: sent when a message is received– Timeout on sender without ACK: sender retransmits
• Negative ACK: sent when a message loss detected– Needs sequence #s or time-based reception semantics
• Tradeoffs– Positive ACKs faster failure detection usually– NACKs : fewer msgs…
Q: what kind of situations are good for– Spatial error masking?– Temporal error masking?– Error detection and recovery with positive ACKs?– Error detection and recovery with NACKs?
© DEEDS – OS
Resilience to Sender Failure
• Multicast FT-Communication harder than point-to-point– Basic problem is of failure detection
– Subsets of senders may receive msg, then sender fails
• Solutions depend on flavor of multicast reliabilitya) Unreliable: no effort to overcome link failures
b) Best-effort: some steps taken to overcome link failures
c) Reliable: participants coordinate to ensure that all or none of correct recipients get it
© DEEDS – OS
Coverage
• DS Paradigms– DS & OS’s
– Services and models
– Communication
• Coordination– Distributed ME
– Distributed Coordination
© DEEDS – OS
Co-ordination protocols in DOS/DS
• Distributed ME• Distributed atomicity• Distributed synchronization & ordering
How do we co-ordinate the distributed resources for ME, CS access, consistency etc?
© DEEDS – OS
Co-ordination in Distributed Systems
• Event ordering– centralized system: ~ easy (common clock & memory)
– distributed system: hard (convergent/consistent dist. time)
• Example: Unix “make” program– source files/object files make [compiles & links based on last version]
– a.o @99 & a.c @100 re-compile a.c [assuming a common time base]
– “make” in a DS?
A
Ba.c @ 95
a.o @ 98
a.c’ @ 97slow clock
© DEEDS – OS
Synchronization
Blocking (send primitive is blocked until receive acknowledgment) Time out
Nonblocking (send and copy to buffer) Polling at receive primitive Interrupt
Synchronous (Send and receive primitives are blocked) Asynchronous
Distributed Synchronization with failures? DB/Control apps where “order” is essential for consistency
© DEEDS – OS
ME
• TSL, Semaphores, monitors… (Single OS)
• Do they work in DS given timing delays, ordering issues +++?
© DEEDS – OS
Let’s start with TSL for Multiprocessors
• TSL no longer atomic but over the bus! CPU #1/#2 TSL RW sequencing?• Both CPU #1 & #2 think they have CS access – ME?
– Single CPU: disabling interrupts; Multiprocessor?– Is TSL atomic at the distributed/networked level? –
TSL instruction can fail unless bus locking made part of the TSL op.
© DEEDS – OS
Progressive TSL – Private Locks per CPU
* possible to make separate decision each time locked mutex encountered* multiple locks needed to avoid cache thrashing
• CPU needing a locked mutex just waits for it, either by - polling continuously, polling intermittently, - or attaching itself to a list of waiting CPUs:
© DEEDS – OS
Distributed Lock Problems
p3
p2
p1
LOCK GRANTEDLOCK
GRANTEDLOCK
LOCK
p4
What happens? Solutions?
© DEEDS – OS
Distributed Mutual Exclusion
• Lock server solution problems– Server is a single point of failure
– Server is performance bottleneck
– Failure of client holding lock also causes problems: No unlock sent
• Similar to garbage collection problems in DSs … validity conditions etc
What is the state of the lock server? For stateless servers?
Works? Under what assumptions?
• Solution #1: Build a lock server in a centralized manner (generally simple)
© DEEDS – OS
Distributed Mutual Exclusion (cont.)
• Solution #2: Decentralized alg– Replicate state in central server
on all processes
– Requesting process sends LOCK message to all others
– Then waits for LOCK_GRANTED from all
– To release critical region, send UNLOCK to all
Works? What assumptions?
© DEEDS – OS
Co-ordination in Distributed Systems
1. Given distributed entities, how does one obtain resource “coordination” to result in an agreed action (such as CS access/ME, shared memory “writes”, producer/consumer modes or decisions)?
2. How are distributed tasks/requests “ordered”?3. Given distributed resources, how do they “all” agree on a
“single” course of action?
• Asynchronous Co-ordination-2PC, leader elections, etc…
• Synchronous Co-ordination - clocks, ordering, serialization, etc …
© DEEDS – OS
Consistency & Distributed State Machines
© DEEDS – OS
Distributed State Machine
Consensus
© DEEDS – OS
Asynch: “Single” Decision - Commit
2PC: Two Phase Commit Protocols– coordinator (pre-specified or selected dynamically)
– multiple secondary sites (“cohorts”)
Objective: All nodes agree and execute a single decision [all agree or no action taken...banking transactions]
© DEEDS – OS
Two – Phase Commit (2PC) Protocol
1. send PREPARE to all
.......”bounded waiting”.......------------------------------------4. receive OK from all put COMMIT in log & send COMMIT to all
4’. receive ABORT send ABORT to all
5. ACK from all? DONE
2. get msg (PREPARE)
3. if ready, send OK
(write undo/redo logs)
else, send NOT-OK------------------------------------
4 receive COMMIT
release resources,
send ACK
4’ receive ABORT, undo actions
Coord. Actions
EachClient Actions
© DEEDS – OS
Two-phase commit (cont.)
Problem: coordinator failure, after PREPARE & before COMMIT, blocks participants waiting for decision (a)
• Three-phase commit overcomes this (b) … slowwwwww– delay final decision until enough processes “know” which decision will be taken
Q: can this somehow block?
© DEEDS – OS
Comments...
• Time-lag in making decisions – RT applications??• Resources locked till voting/decision is completed
• Message overhead• Reliable common assumptions
• Possibilities of deadlock/livelock• Limited fault tolerance (coordinator dependency)• New coordinator initiation per new request
© DEEDS – OS
Distributed ME (all ACK)
• To request CS: send REQ msg. M to ALL; enQ M in local Q
• Upon receiving M from Pi
– if it does not want (or have) CS, send ACK
– if it has CS, enQ request Pi : M
– if it wants CS, enQ/ACK based on lowest ID (time-stamp would be so much nicer but lack of time no time basis for timestamps)
• To release CS: send ACK to all in Q, deQ [diff. from 2PC]
• To enter CS: enter CS when ACK received from all
A
CB
A
CB
A
CB
8 12
12
8
ACK
ACK ACK
{8,12}
enters CS
ACK
enters CS
{8}
{12}
© DEEDS – OS
DS Solutions: State Machine Replication
State machine replicationImplements the abstractionof a single reliable server
© DEEDS – OS
Efficient State Machines (with Crashes)
• Observation: worst-case failures are the exception
– Replicas often have an a-priori consistent view (w/o coordination)
– There is a correct replica (called leader) known to every replica
“Fast” consensus = short latency
• Observation: latency optimizations have overhead
– Significant latency overhead if assumptions are not met
• Observation: messages & crypto are expensive
Minimal latency + no crypto + trade latency for mess. complexity
© DEEDS – OS
Example: Web Applications• Ideal setting for applying replication
– Exposed to the Internet, strong reliability requirements
Applicationcode
Front-endserver
Database
WEB SERVER
Client
© DEEDS – OS
Practical Replication
client
primary
backup
backup
backupORDER AGREE COMMIT
REQUEST f+1 REPLIES
deliver
• Seminal work on efficient replication– Optimal resilience
– Three phases: Non-optimal
– O(n2) message complexity: Non-optimal
© DEEDS – OS
Motivation
[PNUTS , ZooKeeper]
[Cassandra][GFS, Bigtable]
[Dynamo]
15K+ commodity servers
© DEEDS – OS
Internet Datacenters
• Large Scale Crashes are the common case to handle
• Need high performance for ALMOST ALL requests – Example: Dynamo’s SLA specifies worst-case latency for 99,9% of the
requests under high load
– ALL = also in presence of crashes
• Need low replication costs– 100s to 1,000s of replicated services
– Additional replication costs must be multiplied over the number of services
– Diagnosis, repair & re-configurations
– Speed with unresponsive replicas, e.g. WAN replication• Replicas can be located at geographically remote sites
• Some sites can become temporary unreachable
© DEEDS – OS
Large Scale DS Goals• Consistency (Safety)
– Linearizability
• Availability (Liveness)– Wait-freedom
• Performance– Latency, throughput, ...
Despite – Failures– Concurrency– Asynchrony
© DEEDS – OS
Challenges
• Crash failures– Detectable
– Very popular
• Byzantine failures– Nodes under adversarial control
– Worst-case needs...but costly (# replicas, latency, complexity)
• Asynchronous communication– Reflects real networks (e.g. WAN)
© DEEDS – OS
Efficient distributed solutions
• Main efficiency metrics
– Resilience: # of replicas
– Latency: # of communication steps
– Crypto: use of signatures
– Message complexity: # of messages
E.g., 3t+1 replicas
Client
Leader replica
Replicas
© DEEDS – OS
• Key advocated abstractions: – Consensus
– Distributed Storage
• State Machine Replication (SMR) Consensus
• Reliable Shared Memory Distributed Storage
Advocated Solutions
no reply or“bad” reply
clientsrequest
reply
clientsReplicationrequest
service
service
Practical implementations of these abstractions?
© DEEDS – OS
Distributed Storage
Storage serverrequest
“bad” reply orno reply
clients
Previously Nowadays
request
reply certificate
clientsDistributed
Storage
Storage is a state machine w/ operations Storage is a state machine w/ operations readread and and write (SWMR/MWMR…)write (SWMR/MWMR…)