Date post: | 14-Jul-2015 |
Category: |
Software |
Upload: | ahmed-soliman |
View: | 738 times |
Download: | 0 times |
Distributed Systems andwhy you should care!
Ahmed Solimanأحمد سليمان
Why the hassle?
Wrestling With Capacity Constraints
Cannot add more (CPU Cores, RAM, Disk, Network Bandwidth, etc.) to a single box
Wrestling With Availability Constraints
Dealing with failures of (Network, Power, Hard-drives, Faulty Ram, etc.)
Wrestling With Latency/Performance ConstraintsNot the entire world is wired with Optical Fibers.
Sometimes even the speed of light is not fast enough*
it takes 39ms for light to reach from Cairo to Dallas.
Welcome to Distributed Systems
The world is a mesh of very large and very small computers with a utterly sophisticated networking medium in between
it’s everywhere!The web is currently the world’s largest distributed
system, mobile is getting even larger, wearables will make the web-scale look like ant to an elephant!
Computing was never that sophisticated
and it’s getting even more sophisticated every single day.
Some Challenges• Heterogeneity and Abstractions
• Transparency and Abstractions
• Concurrency and Coordination
• Scalability
• Resilience to Failures
• Security
Heterogeneity• Different systems will be written in different
languages, running on different operating systems, network characteristics, computer architectures, etc.
• Clear boundaries and a set of abstractions must be defined. Think of Micro-service/SOA architecture, RESTful APIs, Thrift/Protobuf/Avro/msgpack/etc. for efficient binary message serialization across systems
Heterogeneity and Abstractions
• Abstractions through interfaces exposed through the network (think of sockets as low-level abstraction)
• Higher-level abstractions of data formats (json/thrift/protobuf/avro/etc.), protocols like HTTP, XMLRPC, Thrift-RPC,
• Effective separation offers great flexibility for large-scale development teams and allows the use of the right-tool for the right-job
• Generates a new set of challenges, think of protocols and versioning, cascading failures that are hard to trace, congestion/malfunction traceability is order of magnitude higher
Transparency• Systems need to know how to reach other systems
• Central Registry, discovery protocols, gossip protocols
• Components living in the same process vs. inter-process communication (mobility)
• shared-memory or IPC/Unix Sockets/TCP Sockets/etc?
• Higher-level abstraction means that there is no conceptual difference between scaling vertically on multicore or horizontally on the cluster
• Think of failures, restarts, commute of components and their effect on consumers.
Concurrency• a multi-user system means that users will be competing
against system resources.
• Don’t confuse concurrency and parallelism.
• A tricky business, if you think that threading is the best way to handle concurrency you will pay for the cost for that…a lot!
• heisenbugs™ everywhere
• Please welcome, race conditions, dead-locks, locking, barriers, compilers reordering, compiler optimizations, etc.
Concurrency• Mutable shared state is the root of most evil™
• Low-level abstractions (threading)
• Higher-level abstractions in programming languages (co-routines, async in C#, etc.)
• Actor Model (Erlang OTP, Scala/Java Akka, etc.)
• Concurrent but not Parallel using Async IO (NodeJS)
Coordination• Low-level coordination primitives from the operating
system like mutex, readwrite-locks, reentrant-locks, flock, etc.
• across-systems coordination is sometimes needed
• central registry with atomic semantics
• consensus protocols (paxos, raft, etc.)
Coordination• Classical master-writer, slave-reader clustering is
an easy fix but proven to scale poorly in write-intensive environments (e.g., social networks)
• Sharding/Partitioning may help in avoiding coordination altogether, still rebalancing, healing is extra work needed.
• CRDT (Conflict-free Data Types) maybe one potential solution too.
Scalability• Horizontal Scalability vs Vertical Scalability
• Optimization of per-node capacity (req/s) is a third dimension of your scalability design plan
• Single Point of Failures
• Performance Bottlenecks
• Commutative Operations
• Consistency Levels (Strong, Weak, Eventual)
Resilience• Ability to sustain high-loads without crashing
• Ability to recover from partial crashes
• Failures should be properly reported to the operation team and gracefully delivered to the client
• Avoidance of cascading failures as possible
Security• Authentication and authorization across different
systems
• Central authorization vs. state-less authorization
• data confidentiality across different systems
• JWT (Json Web Tokens) as one solution to decentralized authenticity verification
In Short…
• Distributed Systems are much more fun than you think
• It’s a mixture between science and art
• It’s about putting the right tradeoffs in place
• It has an immediate business impact, be careful.
Thank You!