Distributed Systems: scalability and high availability

Distributed Systems

scalability and high availability

Renato Lucindo - lucindo.github.com - @rlucindo

Renato Lucindo

Call me Lucindo (or Linus)2002 - Bachelor Computer Science2007 - M.Sc. Computer Science (Combinatorial Optimization)7+ year developing Distributed Systems

My default answer: "I don't know."

Agenda

Scalability

High Availability

Problems

Tips and Tricks

Learning More

Distributed Systems

Multiple computers that interact with each other over a network to achieve a common goalPurpose

ScalabilityHigh availability

source: http://www.cnds.jhu.edu/

Scalability

System ability to handle gracefully a growing amount of work

Scale up (vertical)Add resources to a single nodeImprove existing code to handle more work

Scale out (horizontal)Add more nodes to a systemLinear (or better) scalability

Scalability - Vertical

Add: CPU, Memory, Disks (bigger box) Handling more simultaneous:

ConnectionsOperationsUsers

Choose a good I/O and concurrency modelNon-blocking I/OAsynchronous I/OThreads (single, pool, per-connection)Event handling patterns (Reactor, Proactor, ...)

Memory model?STM

Scalability - Vertical

Careful with numbersRequests per second# of ConnectionsSimultaneous operations

Event handlingThink front-endSlow connections/clientsIt's slower than other options

In doubt, go asyncBack-end

Thread pool (thread per-connection)No eventsProcess per-core

Scalability - Horizontal

Add nodes to handle more workFront-end

StraightforwardStateless

Back-endMaster/Slave(s)Partitioning

DHTVolatile Index


Master/SlaveWrite on single MasterRead on Slaves (one or more)Scales reads


Partitioning (Sharding)Distribute dada across nodes

Generally involves data de-normalizationWhere is some specific data?

Master IndexHash (DTH, Consistent Hashing)Volatile Index

Joins done in application levelNoSQL friendly


Volatile Index: build and maintain data index as cached information (all clients)

High Availability

"Processes, as well as people, die"

Handle hardware and software failuresEliminate single point of failure

RedundancyFailoverReplicas

High Availability - Failover/Redundancy

High Availability - Replicas

Two or more copies of same dataReplica granularity

From node replica to "row" replicaLoad balancingWrite concurrencyReplica updatesKey for high availability and root of several problems

Problems

Problems - CAP Theorem


Consistency: all operations (reads/writes) yield a global consistent state

Availability: all requests (on non-failed servers) must have a response

Partition Tolerance: nodes may not be able to communicate with each other.

Pick Two


C + A: network problems might stop the system

Examples:Oracle RAC, IBM DB2 ParallelRDBMS (Master/Slave)Google File SystemHDFS (Hadoop)


C + P: clients can't always perform operations

Examples:Distributed lock-systems: Chubby, ZooKeeperPaxos protocol (consensus)BigTable, HbaseHypertableMongoDB


A + P: clients may read inconsistent (old or undone) data

Examples:�Amazon DynamoCassandraVoldemortCouchDBRiakCaches

Problem with CAP Theorem

In practice, C + A and C + P systems are the same.C + A: not tolerant of network partitionsC + P: not available when a network partition occurs

Big problem: network partitionNot so big (how often does it happens?)

Pick twoAvailabilityConsistency

The forgotten: LatencyOr, how long the system waits before considering a partitioned network?

Problems - Real World

Every component may fail:Network failureHardware failureElectricityNatural disastersCode failure

Tips & Tricks

Tips & Tricks - Pyramid

Capacity (connections, operations, ...) Pyramid

Tips & Tricks - Reply Fast

FAIL FastBreak complex requests into smaller onesUse timeoutsNo transactionsBe aware that a single slow operation or component can generate contentionSelf-denial attack

Tips & Tricks - Cache

Cache: component location, data, dns lookups, previous requests, etcUse negative cache for failed requests (low expiration)Don't rely on cacheYour system must work with no cache

Tips & Tricks - Queues

Easy way to add asynchronous processing an decouple your system.

Tips & Tricks - DNS

Tips & Tricks - Logs

Log everythingUse several log levelsOn every log message

UserRequest hostComponent involvedVersionFilename and line

If log level not enabled do not process log messageAvoid lookup calls (gettimeofday)

Tips & Tricks - Domino Effect

Make sure your load balancer won't overload componentsUser smart algorithms

Load BalanceResource Allocation

Tips & Tricks - (Zero) Configuration

No configuration filesUse good defaultsAuto-discovery (multicast, gossip, ...)Make everything configurable

Administrative commandNo need to stop for changes

Automatic self adjusts when possible

Tips & Tricks - STOP Test

With your system under load: kill -STOP <component>

Tips & Tricks - Know your tools

load average (uptime)stats tools

vmstatiostatmpstattcpstat, tcprstat, etc

tcpdump, nc, netstattunning

/proc/net/*ulimitsysctl

oprofiledebuging tools (gdb, valgrind)...

Tips & Tricks - Count

Count everythingConnectionsOperationsFailuresSuccessesRequest times (granularity)

Total, average, standard deviationMonitor counters

Tips & Tricks - Stability Patterns

Use TimeoutsCircuit BreakerBulkheadsSteady StateFail FastHandshakingTest HarnessDecoupling Middleware

Tips & Tricks - Don't Panic!

Learning More - Books

TCP/IP Illustrated, Vol. 1: The Protocols


Unix Network Programming, Vol. 1: The Sockets Networking


Pattern Oriented Software Architecture, Vol. 2


Release It!

Learning More - Papers

The Google File System Bigtable: A Distributed Storage System for Structured DataDynamo: Amazon's Highly Available Key-Value StorePNUTS: Yahoo!’s Hosted Data Serving PlatformMapReduce: Simplified Data Processing on Large Clusters

Towards robust distributed systemsBrewer's conjecture and the feasibility of consistent, available, partition-tolerant web servicesBASE: An Acid AlternativeLooking up data in P2P systems

Thanks!!! Questions?

lucindo.github.com - @rlucindo

Date post:	15-Jan-2015
Category:	Technology
Upload:	renato-lucindo
View:	11,359 times
Download:	1 times

Distributed Systems: scalability and high availability

Technology