Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | beatrix-carr |
View: | 242 times |
Download: | 3 times |
CSC 536 Lecture 8
Outline
Reactive StreamsStreamsReactive streamsAkka streams
Case studyGoogle infrastructure (part I)
Reactive Streams
Streams
Stream Process involving data flow and transformation Data possibly of unbounded size Focus on describing transformation
Examples bulk data transfer real-time data sources batch processing of large data sets monitoring and analytics
Needed: Asynchrony
For fault tolerance: Encapsulation Isolation
For scalability: Distribution across nodes Distribution across cores
Problem: Managing data flow across an async boundary
Types of Async Boundaries
between different applications
between network nodes
between CPUs
between threads
between actors
Possible solutions
Traditional way: Synchronous/blocking (possibly remote) method calls
Does not scale
Possible solutions
Traditional way: Synchronous/blocking (possibly remote) method calls
Does not scale
Push way: Asynchronous/non-blocking message passing
Scales! Problem: message buffering and message dropping
Supply and Demand
Traditional way: Synchronous/blocking (possibly remote) method calls
Does not scale
Push way: Asynchronous/non-blocking message passing
Scales! Problem: message buffering and message dropping
Reactive way: non-blocking non-dropping
Reactive way
View slides 24-55 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014
Supply and Demand
data items flow downstream demand flows upstream data items flow only when there is demand
recipient is in control of incoming data rate data in flight is bounded by signaled demand
Dynamic Push-Pull
“push” behavior when consumer is faster“pull” behavior when producer is fasterswitches automatically between thesebatching demand allows batching data
Tailored Flow Control
Splitting the data means merging the demand
Tailored Flow Control
Merging the data means splitting the demand
Reactive Streams
Back-pressured Asynchronous Stream Processing asynchronous non-blocking data flow asynchronous non-blocking demand flow Goal: minimal coordination and contention
Message passing allows for distribution across applications across nodes across CPUs across threads across actors
Reactive Streams Projects
Standard implemented by many libraries
Engineers from Netflix Oracle Red Hat Twitter Typesafe …See http://reactive-streams.org
Reactive Streams
All participants had the same basic problem
All are building tools for their community
A common solution benefits everybody
Interoperability to make best use of efforts minimal interfaces rigorous specification of semantics full TCK for verification of implementation complete freedom for many idiomatic APIs
The underlying (internal) API
trait Publisher[T] {
def subscribe(sub: Subscriber[T]): Unit
}trait Subscription {
def requestMore(n: Int): Unit
def cancel(): Unit
}
trait Subscriber[T] {
def onSubscribe(s: Subscription): Unit
def onNext(elem: T): Unit
def onError(thr: Throwable): Unit
def onComplete(): Unit
}
The Process
Reactive Streams
All calls on Subscriber must dispatch async
All calls on Subscription must not block
Publisher is just there to create Subscriptions
Akka Streams
Powered by Akka Actors
Type-safe streaming through Actors with bounded buffering
Akka Streams API is geared towards end-users
Akka Streams implementation uses the Reactive Streams interfaces (Publisher/Subscriber) internally to pass data between the different processing stages
Examples
View slides 62-80 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014
basic.scala
TcpEcho.scala
WritePrimes.scala
Overview of Google’s distributed systems
Original Google search engine architecture
More than just a search engine
Organization of Google’sphysical infrastructure
40-80 PCsper rack(terabytes ofdisk space each)
30+ racksper cluster
Hundredsof clustersspread acrossdata centersworldwide
System architecture requirements
Scalability
Reliability
Performance
Openness (at the beginning, at least)
Overall Google systems architecture
Google infrastructure
Design philosophy
SimplicitySoftware should do one thing and do it well
Provable performance“every millisecond counts”Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)
Testing”if it ain’t broke, you’re not trying hard enough”Stringent testing
Data and coordination services
Google File System (GFS)Broadly similar to NFS and AFSOptimized to type of files and data access used by Google
BigTableA distributed database that stores (semi-)structured dataJust enough organization and structure for the type of data Google uses
Chubbya locking service (and more) for GFS and BigTable
GFS requirements
Must run reliably on the physical platformMust tolerate failures of individual components
So application-level services can rely on the file system
Optimized for Google’s usage patternsHuge files (100+MB, up to 1GB)Relatively small number of filesAccesses dominated by sequential reads and appendsAppends done concurrently
Meets the requirements of the whole Google infrastructurescalable, reliable, high performance, openImportant: throughput has higher priority than latency
GFS architecture
File stored in 64MB chunks in a cluster witha master node (operations log replicated on remote machines)hundreds of chunk servers
Chunks replicated 3 times
Reading and writing
When the client wants to access a particular offset in a fileThe GFS client translates this to a (file name, chunk index)And then send this to the master
When the master receives the (file name, chunk index) pairIt replies with the chunk identifier and replica locations
The client then accesses the closest chunk replica directly
No client-side cachingCaching would not help in the type of (streaming) access GFS has
Keeping chunk replicas consistent
Keeping chunk replicas consistent
When the master receives a mutation request from a clientthe master grants a chunk replica a lease (replica is primary)returns identity of primary and other replicas to client
The client sends the mutation directly to all the replicasReplicas cache the mutation and acknowledge receipt
The client sends a write request to primaryPrimary orders mutations and updates accordinglyPrimary then requests that other replicas do the mutations in the same orderWhen all the replicas have acknowledged success, the primary reports an ack to the client
What consistency model does this seem to implement?
GFS (non-)guarantees
Writes (at a file offset) are not atomicConcurrent writes to the same location may corrupt replicated chunksIf any replica is left inconsistent, the write fails (and is retried a few times)
Appends are executed atomically “at least once”Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appendsGFS does not guarantee that the replicas are identicalIt only guarantees that some file regions are consistent across replicas
When needed, GFS needs an external locking service (Chubby)As well as a leader election service (also Chubby) to select the primary replica
Bigtable
GFS provides raw data storage
Also needed:Storage for structured data ...... optimized to handle the needs of Google’s apps ...... that is reliable, scalable, high-performance, open, etc
Examples of structured data
URLs:Content, crawl metadata, links, anchors, PageRank, ...
Per-user data:User preference settings, recent queries/search results, …
Geographic locations:Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …
Commercial DB
Why not use commercial database?Not scalable enoughToo expensiveFull-featured relational database not requiredLow-level optimizations may be needed
Bigtable table
Implementation: Sparse distributed multi-dimensional map (row, column, timestamp) → cell contents
Rows
Each row has a keyA string up to 64KB in sizeAccess to data in a row is atomic
Rows ordered lexicographicallyRows close together lexicographically reside on one or close machines (locality)
Columns
“com.cnn.www”
‘contents:.’
“<html>…” “CNN Sports”
‘anchor:com.cnn.www/sport’
“CNN world”
‘anchor:com.cnn.www/world’
Columns have two-level name structure:family:qualifier
Column familylogical grouping of datagroups unbounded number of columns (named with qualifiers)may have a single column with no qualifier
Timestamps
Used to store different versions of data in a celldefault to current timecan also be set explicitly set by client
Garbage CollectionPer-column-family GC settings
“Only retain most recent K values in a cell”“Keep values until they are older than K seconds”...
API
Create / delete tables and column families
Table *T = OpenOrDie(“/bigtable/web/webtable”);
RowMutation r1(T, “com.cnn.www”);
r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);
r1.Delete(“anchor:com.cnn.www/world”);
Operation op;
Apply(&op, &r1);
Bigtable architecture
An instance of BigTable is a cluster that stores tableslibrary on client sidemaster servertablet servers
table is decomposed into tablets
Tablets
A table is decomposed into tabletsTablet holds contiguous range of rows100MB - 200MB of data per tabletTablet server responsible for ~100 tablets
Each tablet is represented by A set of files stored in GFS
The files use the SSTable format, a mapping of (string) keys to (string) values
Log files
Tablet Server
Master assigns tablets to tablet servers
Tablet serverHandles reads / writes requests to tablets from clients
No data goes through masterBigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata tableThe metadata table contains metadata about actual tablets
including location information of associated SSTables and log files
MasterUpon startup, must grab master lock to insure it is the single masterof a set of tablet servers
provided by locking service (Chubby)
Monitors tablet serversperiodically scans directory of tablet servers provided by naming service (Chubby)keeps track of tablets assigned to its table servers
obtains a lock on the tablet server from locking service (Chubby)lock is the communication mechanism between master and tablet server
Assigns unassigned tablets in the cluster to tablet servers it monitorsand moving tablets around to achieve load balancing
Garbage collects underlying files stored in GFS
BigTable tablet architecture
Each is an ordered and immutable mapping of keys to values
Tablet Serving
Writes committed to logMemtable: ordered log of recent commits (in memory)SSTables really store a snapshot
When Memtable gets too bigCreate new empty MemtableMerge old Memtable with SSTables and write to GFS
SSTable
OperationsLook up value for keyIterate over all key/value pairs in specified range
Relies on lock service (Chubby)Ensure there is at most one active masterAdminister table server deathStore column family informationStore access control lists
Chubby
Chubby provides to the infrastructurea locking servicea file system for reliable storage of small filesa leader election service (e.g. to select a primary replica)a name service
Seemingly violates “simplicity” design philosophy but...
... Chubby really provides an asynchronous distributed agreement service
Chubby API
Overall architecture of Chubby
Cell: single instanceof Chubby system
5 replicas1 master replica
Each replica maintains a databaseof directories and files/locks
Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation logChubby internally supports snapshots to periodically GCthe operation log
Paxos distributed consensus algorithm
A distributed consensus protocol for asynchronous systems
Used by servers managing replicas in order to reach agreement on update when
messages may be lost, re-ordered, duplicatedservers may operate at arbitrary speed and failservers have access to stable persistent storage
Fact: Consensus not always possible in asynchronous systemsPaxos works by insuring safety (correctness) not liveness (termination)
Paxos algorithm - step 1
Paxos algorithm - step 2
The Big Picture
Customized solutions for Google-type problems
GFS: Stores data reliablyJust raw files
BigTable: provides key/value mapDatabase like, but doesn’t provide everything we need
Chubby: locking mechanismHandles all synchronization problems
Common Principles
One master, multiple workersMapReduce: master coordinates work amongst map / reduce workersChubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers