© DEEDS – OS Distributed Operating Systems. © DEEDS – OS Coverage Distributed Systems (DS)...

transcript

Distributed Operating Systems

Coverage

• Distributed Systems (DS) Paradigms– DS … NOS, DOS’s

– DS Services: communication, synchronization, coordination, replication…

What is a Distributed System“A distributed system is the one preventing you from workingbecause of the failure of a machine that you had never heard of”

Leslie Lamport

Multiple computers sharing (same) state and interconnected by a network

… collection of autonomous entities appearing to users as a single OS

shared memory multiprocessor

message passing multicomputer

distributed system

Distribution: Example Pro/Cons

The Good Stuff: Resource Sharing (concurrency performance), Distributed Access (matching spatial distribution of applications), Scalable, Load Balancing (Migration, Relocation), Fault Tolerance.

Bank account database (DB) example– Naturally centralized: easy consistency and performance

– Fragment DB among regions: exploit locality of reference, security & reduce reliance on network for remote access

– Replicate each fragment for fault tolerance

But, we now need (additional) DS techniques– Route request to right fragment

– Maintain access/consistency of fragments as a whole database

– Maintain access/consistency of each fragment’s replicas

– …

OS’s for DS’s Loosely-coupled OS

– A collection of computers each running their own OS with OS’s allowing sharing of resources across machines

– A.K.A. Network Operating System (NOS) Provides local services to remote clients via remote logging Data transfer from remote OS to local OS via FTP (File Transfer Protocols)

Tightly-coupled OS– OS tries to maintain single global view of resources it manages– A.K.A. Distributed Operating System (DOS) “Local access feel” as a non-distributed, standalone OS Data migration or computation migration modes (entire process or threads)

Network Operating Systems (NOS)

Provide an environment where users are (explicitly) aware of the multiplicity of machines. Users can access remote resources by

logging into the remote machine OR transferring data from the remote machine to their own machine

Users should know where the required files and directories are and mount them. Each machine could act like a server and a client at the same time. E.g NFS from Sun Microsystems, CMU-AFS etc

Distributed Operating Systems (DOS)

Runs on a cluster of machines with no shared memory Users get the feel of a single processor - virtual uni-processor Transparency is the driving force Requires

A single global IPC mechanism Identical process management and system calls at all nodes Common file system at all nodes State, services and data consistency

Basic Client Server Model for DOS & NOS

Non-blocking comm.!!!File based

Comm./object based

Middleware

• Can we have the best of both worlds?– Scalability and openness of a NOS

– Transparency and common-state of a DOS

• Solution additional layer of SW above OS (Middleware)– Mask heterogeneity

– Improve distribution transparency (and others)

Middleware Openness Basis

• Document-based middleware (e.g. WWW)• Coordination-based MW (e.g., Linda, publish subscribe, Jini etc.)

• File system based MW (upload/download, remote access)

• Shared object based MW

Global Access Transparency

Illusion of a single computer across a DS

Distribution transparency: All of above + performance + flexibility (modification, enhancements for kernel/devices), balancing/scheduling, & scaling (allowing systems to expand without disrupting users) + …

Fragmentation Hide whether the resource is fragmented or not

Reliability, Performance, Scalability

Faults (Fail stop, transient, Byzantine) Fault Avoidance (de-cluster, rejuvenate) Fault Tolerance

Redundancy techniques (k failures??) Distributed control

Fault Detection & Recovery Atomic transactions Stateless servers Acknowledgements and timeout-based

retransmissions of messages

Batch if possible Cache whenever possible Minimize copying of data Minimize network traffic Take advantage of fine-grain

parallelism for multiprocessing

Avoid centralized entities Provides no/limited fault tolerance Leads to system bottlenecks Issues of network traffic capabilities with centralized entity

Avoid centralized algorithms Perform most operations on client workstations

Design Issues Resource management

Hard to obtain consistent information about utilization or availability of resources.

Has be calculated (costly!!) approximately using heuristic methods. Processor allocation

Load balancing Hierarchical organization of processors. If a processor cannot handle a request, ask the parent for help. …BUT crashing of a higher level processor results in isolation of all

processors attached to it. Process scheduling

Communication dependency, Causality, Linearizability… to consider Fault tolerance

Consider distribution of control and data. Services provided

Typical services include name, directory, file, time, etc.

Process Addressing ~ N-OS Flavor Explicit addressing

Send(process_id, message)

Implicit addressing (Functional addressing) Send_any(service_id, message)

Ex: Machine_id@local_id (Berkeley UNIX)- Limited with process migration

Link based process addressingEx: machine_id@local_id@machine_id

- Overload of locating a process

- Intermediate node failure

System-wide unique identifier (Location Transparency) High level m/c independent and low level m/c dependent

- Centralized naming server for high level id (functional)

So what services do we need to realize DS?

• Communication

• Coordination (Stateful? Stateless?) & Synchronization

• Replication

• Failure Handling

• Consistency

• Liveness

• Storage

Communication (Group Comm)

One to many communication (blocking or non-blocking?) Multicast/Broadcast Open group/Closed group

Flexible reliability The 0-reliable The 1-reliable The m-out-of-n reliable All reliable

Atomic MulticastMany to one communicationMany to many Communication

Absolute Ordering (Global clock) Consistent ordering (Sequencer/ABCAST protocol) Causal ordering

Communication Failure handling Delivers messages despite

– communication link(s) failure– process failures

Main kinds of failures to tolerate– timing (link and process)– omission (link and process)– value

Loss of request message Loss of response message Unsuccessful execution of the request (system crash)Inter Process Communication (IPC)

Two message IPC (Request, reply) Three message reliable IPC (request, reply, ack) Four message reliable IPC (request, ack, reply, ack)

Failure handling At-least-once (Time out) Idempotency (no side effects no matter how many times performed) Nonidempotent (Exactly once semantics)

• Reply from the cache with unique Id

Communication: Reliable Delivery

• Omission failure tolerance (degree k).

• Design choices:a) Error masking (spatial): several (> k) links

b) Error masking (temporal): repeat K+1 times

c) Error recovery: detect error and recover

Reliable Delivery (cont.)

Error detection and recovery: ACK’s and timeouts

• Positive ACK: sent when a message is received– Timeout on sender without ACK: sender retransmits

• Negative ACK: sent when a message loss detected– Needs sequence #s or time-based reception semantics

• Tradeoffs– Positive ACKs faster failure detection usually– NACKs : fewer msgs…

Q: what kind of situations are good for– Spatial error masking?– Temporal error masking?– Error detection and recovery with positive ACKs?– Error detection and recovery with NACKs?

Resilience to Sender Failure

• Multicast FT-Communication harder than point-to-point– Basic problem is of failure detection

– Subsets of senders may receive msg, then sender fails

• Solutions depend on flavor of multicast reliabilitya) Unreliable: no effort to overcome link failures

b) Best-effort: some steps taken to overcome link failures

c) Reliable: participants coordinate to ensure that all or none of correct recipients get it

Coverage

• DS Paradigms– DS & OS’s

– Services and models

– Communication

• Coordination– Distributed ME

– Distributed Coordination

Co-ordination protocols in DOS/DS

• Distributed ME• Distributed atomicity• Distributed synchronization & ordering

How do we co-ordinate the distributed resources for ME, CS access, consistency etc?

Co-ordination in Distributed Systems

• Event ordering– centralized system: ~ easy (common clock & memory)

– distributed system: hard (convergent/consistent dist. time)

• Example: Unix “make” program– source files/object files make [compiles & links based on last version]

– a.o @99 & a.c @100 re-compile a.c [assuming a common time base]

– “make” in a DS?

Ba.c @ 95

a.o @ 98

a.c’ @ 97slow clock

Synchronization

Blocking (send primitive is blocked until receive acknowledgment) Time out

Nonblocking (send and copy to buffer) Polling at receive primitive Interrupt

Synchronous (Send and receive primitives are blocked) Asynchronous

Distributed Synchronization with failures? DB/Control apps where “order” is essential for consistency

• TSL, Semaphores, monitors… (Single OS)

• Do they work in DS given timing delays, ordering issues +++?

Let’s start with TSL for Multiprocessors

• TSL no longer atomic but over the bus! CPU #1/#2 TSL RW sequencing?• Both CPU #1 & #2 think they have CS access – ME?

– Single CPU: disabling interrupts; Multiprocessor?– Is TSL atomic at the distributed/networked level? –

TSL instruction can fail unless bus locking made part of the TSL op.

Progressive TSL – Private Locks per CPU

* possible to make separate decision each time locked mutex encountered* multiple locks needed to avoid cache thrashing

• CPU needing a locked mutex just waits for it, either by - polling continuously, polling intermittently, - or attaching itself to a list of waiting CPUs:

Distributed Lock Problems

LOCK GRANTEDLOCK

GRANTEDLOCK

What happens? Solutions?

Distributed Mutual Exclusion

• Lock server solution problems– Server is a single point of failure

– Server is performance bottleneck

– Failure of client holding lock also causes problems: No unlock sent

• Similar to garbage collection problems in DSs … validity conditions etc

What is the state of the lock server? For stateless servers?

Works? Under what assumptions?

• Solution #1: Build a lock server in a centralized manner (generally simple)

Distributed Mutual Exclusion (cont.)

• Solution #2: Decentralized alg– Replicate state in central server

on all processes

– Requesting process sends LOCK message to all others

– Then waits for LOCK_GRANTED from all

– To release critical region, send UNLOCK to all

Works? What assumptions?

Co-ordination in Distributed Systems

1. Given distributed entities, how does one obtain resource “coordination” to result in an agreed action (such as CS access/ME, shared memory “writes”, producer/consumer modes or decisions)?

2. How are distributed tasks/requests “ordered”?3. Given distributed resources, how do they “all” agree on a

“single” course of action?

• Asynchronous Co-ordination-2PC, leader elections, etc…

• Synchronous Co-ordination - clocks, ordering, serialization, etc …

Consistency & Distributed State Machines

Distributed State Machine

Consensus

Asynch: “Single” Decision - Commit

2PC: Two Phase Commit Protocols– coordinator (pre-specified or selected dynamically)

– multiple secondary sites (“cohorts”)

Objective: All nodes agree and execute a single decision [all agree or no action taken...banking transactions]

Two – Phase Commit (2PC) Protocol

1. send PREPARE to all

.......”bounded waiting”.......------------------------------------4. receive OK from all put COMMIT in log & send COMMIT to all

4’. receive ABORT send ABORT to all

5. ACK from all? DONE

2. get msg (PREPARE)

3. if ready, send OK

(write undo/redo logs)

else, send NOT-OK------------------------------------

4 receive COMMIT

release resources,

send ACK

4’ receive ABORT, undo actions

Coord. Actions

EachClient Actions

Two-phase commit (cont.)

Problem: coordinator failure, after PREPARE & before COMMIT, blocks participants waiting for decision (a)

• Three-phase commit overcomes this (b) … slowwwwww– delay final decision until enough processes “know” which decision will be taken

Q: can this somehow block?

Comments...

• Time-lag in making decisions – RT applications??• Resources locked till voting/decision is completed

• Message overhead• Reliable common assumptions

• Possibilities of deadlock/livelock• Limited fault tolerance (coordinator dependency)• New coordinator initiation per new request

Distributed ME (all ACK)

• To request CS: send REQ msg. M to ALL; enQ M in local Q

• Upon receiving M from Pi

– if it does not want (or have) CS, send ACK

– if it has CS, enQ request Pi : M

– if it wants CS, enQ/ACK based on lowest ID (time-stamp would be so much nicer but lack of time no time basis for timestamps)

• To release CS: send ACK to all in Q, deQ [diff. from 2PC]

• To enter CS: enter CS when ACK received from all

ACK ACK

{8,12}

enters CS

DS Solutions: State Machine Replication

State machine replicationImplements the abstractionof a single reliable server

Efficient State Machines (with Crashes)

• Observation: worst-case failures are the exception

– Replicas often have an a-priori consistent view (w/o coordination)

– There is a correct replica (called leader) known to every replica

“Fast” consensus = short latency

• Observation: latency optimizations have overhead

– Significant latency overhead if assumptions are not met

• Observation: messages & crypto are expensive

Minimal latency + no crypto + trade latency for mess. complexity

Example: Web Applications• Ideal setting for applying replication

– Exposed to the Internet, strong reliability requirements

Applicationcode

Front-endserver

Database

WEB SERVER

Client

Practical Replication

client

primary

backup

backupORDER AGREE COMMIT

REQUEST f+1 REPLIES

deliver

• Seminal work on efficient replication– Optimal resilience

– Three phases: Non-optimal

– O(n2) message complexity: Non-optimal

Motivation

[PNUTS , ZooKeeper]

[Cassandra][GFS, Bigtable]

[Dynamo]

15K+ commodity servers

Internet Datacenters

• Large Scale Crashes are the common case to handle

• Need high performance for ALMOST ALL requests – Example: Dynamo’s SLA specifies worst-case latency for 99,9% of the

requests under high load

– ALL = also in presence of crashes

• Need low replication costs– 100s to 1,000s of replicated services

– Additional replication costs must be multiplied over the number of services

– Diagnosis, repair & re-configurations

– Speed with unresponsive replicas, e.g. WAN replication• Replicas can be located at geographically remote sites

• Some sites can become temporary unreachable

Large Scale DS Goals• Consistency (Safety)

– Linearizability

• Availability (Liveness)– Wait-freedom

• Performance– Latency, throughput, ...

Despite – Failures– Concurrency– Asynchrony

Challenges

• Crash failures– Detectable

– Very popular

• Byzantine failures– Nodes under adversarial control

– Worst-case needs...but costly (# replicas, latency, complexity)

• Asynchronous communication– Reflects real networks (e.g. WAN)

Efficient distributed solutions

• Main efficiency metrics

– Resilience: # of replicas

– Latency: # of communication steps

– Crypto: use of signatures

– Message complexity: # of messages

E.g., 3t+1 replicas

Client

Leader replica

Replicas

• Key advocated abstractions: – Consensus

– Distributed Storage

• State Machine Replication (SMR) Consensus

• Reliable Shared Memory Distributed Storage

Advocated Solutions

no reply or“bad” reply

clientsrequest

clientsReplicationrequest

service

Practical implementations of these abstractions?

Distributed Storage

Storage serverrequest

“bad” reply orno reply

clients

Previously Nowadays

request

reply certificate

clientsDistributed

Storage

Storage is a state machine w/ operations Storage is a state machine w/ operations readread and and write (SWMR/MWMR…)write (SWMR/MWMR…)

© DEEDS – OS Distributed Operating Systems. © DEEDS – OS Coverage Distributed Systems (DS)...

Documents