+ All Categories
Home > Technology > Distributed Systems: scalability and high availability

Distributed Systems: scalability and high availability

Date post: 15-Jan-2015
Category:
Upload: renato-lucindo
View: 11,359 times
Download: 1 times
Share this document with a friend
Description:
QconSP 2010
Popular Tags:
42
Distributed Systems scalability and high availability Renato Lucindo - lucindo.github.com - @rlucindo
Transcript
Page 1: Distributed Systems: scalability and high availability

Distributed Systems

scalability and high availability

Renato Lucindo - lucindo.github.com - @rlucindo

Page 2: Distributed Systems: scalability and high availability

Renato Lucindo

Call me Lucindo (or Linus)2002 - Bachelor Computer Science2007 - M.Sc. Computer Science (Combinatorial Optimization)7+ year developing Distributed Systems

My default answer: "I don't know."

Page 3: Distributed Systems: scalability and high availability

Agenda

Scalability

High Availability

Problems

Tips and Tricks

Learning More

Page 4: Distributed Systems: scalability and high availability

Distributed Systems

Multiple computers that interact with each other over a network to achieve a common goalPurpose

ScalabilityHigh availability

source: http://www.cnds.jhu.edu/

Page 5: Distributed Systems: scalability and high availability

Scalability

System ability to handle gracefully a growing amount of work

Scale up (vertical)Add resources to a single nodeImprove existing code to handle more work

Scale out (horizontal)Add more nodes to a systemLinear (or better) scalability

Page 6: Distributed Systems: scalability and high availability

Scalability - Vertical

Add: CPU, Memory, Disks (bigger box) Handling more simultaneous:

ConnectionsOperationsUsers

Choose a good I/O and concurrency modelNon-blocking I/OAsynchronous I/OThreads (single, pool, per-connection)Event handling patterns (Reactor, Proactor, ...)

Memory model?STM

Page 7: Distributed Systems: scalability and high availability

Scalability - Vertical

Careful with numbersRequests per second# of ConnectionsSimultaneous operations

Event handlingThink front-endSlow connections/clientsIt's slower than other options

In doubt, go asyncBack-end

Thread pool (thread per-connection)No eventsProcess per-core

Page 8: Distributed Systems: scalability and high availability

Scalability - Horizontal

Add nodes to handle more workFront-end

StraightforwardStateless

Back-endMaster/Slave(s)Partitioning

DHTVolatile Index

Page 9: Distributed Systems: scalability and high availability

Scalability - Horizontal

Master/SlaveWrite on single MasterRead on Slaves (one or more)Scales reads

Page 10: Distributed Systems: scalability and high availability

Scalability - Horizontal

Partitioning (Sharding)Distribute dada across nodes

Generally involves data de-normalizationWhere is some specific data?

Master IndexHash (DTH, Consistent Hashing)Volatile Index

Joins done in application levelNoSQL friendly

Page 11: Distributed Systems: scalability and high availability

Scalability - Horizontal

Volatile Index: build and maintain data index as cached information (all clients)

Page 12: Distributed Systems: scalability and high availability

High Availability

"Processes, as well as people, die"

Handle hardware and software failuresEliminate single point of failure

RedundancyFailoverReplicas

Page 13: Distributed Systems: scalability and high availability

High Availability - Failover/Redundancy

Page 14: Distributed Systems: scalability and high availability

High Availability - Replicas

Two or more copies of same dataReplica granularity

From node replica to "row" replicaLoad balancingWrite concurrencyReplica updatesKey for high availability and root of several problems

Page 15: Distributed Systems: scalability and high availability

Problems

Page 16: Distributed Systems: scalability and high availability

Problems - CAP Theorem

Page 17: Distributed Systems: scalability and high availability

Problems - CAP Theorem

Consistency: all operations (reads/writes) yield a global consistent state

Availability: all requests (on non-failed servers) must have a response

Partition Tolerance: nodes may not be able to communicate with each other.

Pick Two

Page 18: Distributed Systems: scalability and high availability

Problems - CAP Theorem

C + A: network problems might stop the system

Examples:Oracle RAC, IBM DB2 ParallelRDBMS (Master/Slave)Google File SystemHDFS (Hadoop)

Page 19: Distributed Systems: scalability and high availability

Problems - CAP Theorem

C + P: clients can't always perform operations

Examples:Distributed lock-systems: Chubby, ZooKeeperPaxos protocol (consensus)BigTable, HbaseHypertableMongoDB

Page 20: Distributed Systems: scalability and high availability

Problems - CAP Theorem

A + P: clients may read inconsistent (old or undone) data

Examples:�Amazon DynamoCassandraVoldemortCouchDBRiakCaches

Page 21: Distributed Systems: scalability and high availability

Problem with CAP Theorem

In practice, C + A and C + P systems are the same.C + A: not tolerant of network partitionsC + P: not available when a network partition occurs

Big problem: network partitionNot so big (how often does it happens?)

Pick twoAvailabilityConsistency

The forgotten: LatencyOr, how long the system waits before considering a partitioned network?

Page 22: Distributed Systems: scalability and high availability

Problems - Real World

Every component may fail:Network failureHardware failureElectricityNatural disastersCode failure

Page 23: Distributed Systems: scalability and high availability

Tips & Tricks

Page 24: Distributed Systems: scalability and high availability

Tips & Tricks - Pyramid

Capacity (connections, operations, ...) Pyramid

Page 25: Distributed Systems: scalability and high availability

Tips & Tricks - Reply Fast

FAIL FastBreak complex requests into smaller onesUse timeoutsNo transactionsBe aware that a single slow operation or component can generate contentionSelf-denial attack

Page 26: Distributed Systems: scalability and high availability

Tips & Tricks - Cache

Cache: component location, data, dns lookups, previous requests, etcUse negative cache for failed requests (low expiration)Don't rely on cacheYour system must work with no cache

Page 27: Distributed Systems: scalability and high availability

Tips & Tricks - Queues

Easy way to add asynchronous processing an decouple your system.

Page 28: Distributed Systems: scalability and high availability

Tips & Tricks - DNS

Page 29: Distributed Systems: scalability and high availability

Tips & Tricks - Logs

Log everythingUse several log levelsOn every log message

UserRequest hostComponent involvedVersionFilename and line

If log level not enabled do not process log messageAvoid lookup calls (gettimeofday)

Page 30: Distributed Systems: scalability and high availability

Tips & Tricks - Domino Effect

Make sure your load balancer won't overload componentsUser smart algorithms

Load BalanceResource Allocation

Page 31: Distributed Systems: scalability and high availability

Tips & Tricks - (Zero) Configuration

No configuration filesUse good defaultsAuto-discovery (multicast, gossip, ...)Make everything configurable

Administrative commandNo need to stop for changes

Automatic self adjusts when possible

Page 32: Distributed Systems: scalability and high availability

Tips & Tricks - STOP Test

With your system under load: kill -STOP <component>

Page 33: Distributed Systems: scalability and high availability

Tips & Tricks - Know your tools

load average (uptime)stats tools

vmstatiostatmpstattcpstat, tcprstat, etc

tcpdump, nc, netstattunning

/proc/net/*ulimitsysctl

oprofiledebuging tools (gdb, valgrind)...

Page 34: Distributed Systems: scalability and high availability

Tips & Tricks - Count

Count everythingConnectionsOperationsFailuresSuccessesRequest times (granularity)

Total, average, standard deviationMonitor counters

Page 35: Distributed Systems: scalability and high availability

Tips & Tricks - Stability Patterns

Use TimeoutsCircuit BreakerBulkheadsSteady StateFail FastHandshakingTest HarnessDecoupling Middleware

Page 36: Distributed Systems: scalability and high availability

Tips & Tricks - Don't Panic!

Page 37: Distributed Systems: scalability and high availability

Learning More - Books

TCP/IP Illustrated, Vol. 1: The Protocols

Page 38: Distributed Systems: scalability and high availability

Learning More - Books

Unix Network Programming, Vol. 1: The Sockets Networking

Page 39: Distributed Systems: scalability and high availability

Learning More - Books

Pattern Oriented Software Architecture, Vol. 2

Page 40: Distributed Systems: scalability and high availability

Learning More - Books

Release It!

Page 41: Distributed Systems: scalability and high availability

Learning More - Papers

The Google File System Bigtable: A Distributed Storage System for Structured DataDynamo: Amazon's Highly Available Key-Value StorePNUTS: Yahoo!’s Hosted Data Serving PlatformMapReduce: Simplified Data Processing on Large Clusters

Towards robust distributed systemsBrewer's conjecture and the feasibility of consistent, available, partition-tolerant web servicesBASE: An Acid AlternativeLooking up data in P2P systems

Page 42: Distributed Systems: scalability and high availability

Thanks!!! Questions?

lucindo.github.com - @rlucindo


Recommended