Building Distributed Systems with Riak Core
Steve VinoskiArchitect, Basho Technologies
Cambridge, MA USAhttp://www.basho.com/
http://steve.vinoski.net/
Monday, December 5, 11
What We’ll Cover
•Origins of Riak Core
•Abstractions and Functionality
•Getting started with Riak Core
Monday, December 5, 11
20 Years Ago:Client-Server
ClientClientClientClient
Server
Monday, December 5, 11
Early-ish Web Apps
Database
appserver
appserver
appserver
webserver
webserver
webserver
webserver
webserver
webserver
webserver
webserver
Monday, December 5, 11
Scaling Up
•Scaling up meant getting bigger boxes
•Worked for client/server and early web apps
•But couldn’t keep up with web growth
Monday, December 5, 11
Scaling Up
•Software architectures crafted for client/server...
•And later stretched for early web...
•Simply didn’t cut it for scale-out web apps
•Resulted in fragile systems with serious operational problems
Monday, December 5, 11
Scaling Out
•As businesses went from “having” websites to “being” websites:
• increasing number of commodity boxes
•eventually across multiple data centers
Monday, December 5, 11
Scaling OutChanged Everything
•Dealing with more concurrency
•And more distribution
•And more operational issues
•As well as more system failures
•While also needing higher reliability and uptime
Monday, December 5, 11
CAP Theorem
•A conjecture put forth in 2000 by Dr. Eric Brewer
• Formally proven in 2002
• In any distributed system, pick two:
•Consistency
•Availability
• Partition tolerance
Monday, December 5, 11
Partition Tolerance
•Guarantees continued system operation even when the network breaks and messages are lost
•Systems generally tend to support P
•Leaves choice of either C or A
Monday, December 5, 11
Consistency
•Distributed nodes see the same updates at the same logical time
•Hard to guarantee across a distributed system
Monday, December 5, 11
Availability
•Guarantees the system will service every read and write sent to it
•Even when things are breaking
Monday, December 5, 11
Choosing AP
•Provides read/write availability even when network breaks or nodes die
•Provides eventual consistency
•Example: Domain Name System (DNS) is an AP system
•Requires a careful look at tradeoffs
Monday, December 5, 11
Example AP Systems
•Amazon Dynamo
•Cassandra
•CouchDB
•Voldemort
•Basho Riak
Monday, December 5, 11
Handling Tradeoffs forAP Systems
Monday, December 5, 11
Assumptions
•We want to scale out
•We’re choosing to lean toward AP
•We have a networked cluster of nodes, each with local storage
Monday, December 5, 11
• Problem: how to make the system available even if nodes die or the network breaks?
• Solution:
• allow reading and writing from multiple nodes in the system
• avoid master nodes, instead make all nodes peers
Monday, December 5, 11
• Problem: if multiple nodes are involved, how do you reliably know where to read or write?
• Solution:
• assign virtual nodes (vnodes) to physical nodes
• use consistent hashing to find vnodes for reads/writes
Monday, December 5, 11
Consistent Hashing
Monday, December 5, 11
Consistent Hashing and Multi Vnode Benefits•Data is stored in multiple locations
•Loss of a node means only a single replica is lost
•No master to lose
•Adding nodes is trivial, data gets rebalanced minimally and automatically
Monday, December 5, 11
• Problem: what about availability? What if the node you write to dies or becomes inaccessible?
• Solution: sloppy quorums
• write to multiple vnodes
• attempt reads from multiple vnodes
Monday, December 5, 11
N/R/W Values
•N = number of replicas to store (on distinct nodes)
•R = number of replica responses needed for a successful read (specified per-request)
•W = number of replica responses needed for a successful write (specified per-request)
Monday, December 5, 11
N/R/W Values
Monday, December 5, 11
• Problem: what happens if a key hashes to vnodes that aren’t available?
• Solution:
• read from or write to the next available vnode (hence “sloppy” not “strict” quorums)
• eventually repair via hinted handoff
Monday, December 5, 11
N/R/W Values
Monday, December 5, 11
Hinted Handoff
•Surrogate vnode holds data for unavailable actual vnode
•Surrogate vnode keeps checking for availability of actual vnode
•Once the actual vnode is again available, surrogate hands off data to it
Monday, December 5, 11
Quorum Benefits
•Allows applications to tune consistency, availability, reliability per read or write
Monday, December 5, 11
• Problem: how do the nodes in the ring keep track of ring state?
• Solution: gossip protocol
Monday, December 5, 11
•Nodes “gossip” their view of the state of the ring to other nodes
• If a node changes its claim on the ring, it lets others know
•The overall state of the ring is thus kept consistent among all nodes in the ring
Gossip Protocol
Monday, December 5, 11
• Problem: what happens if vnode replicas get out of sync?
• Solution:
• vector clocks
• read repair
Monday, December 5, 11
• Problem: what happens if vnode replicas get out of sync?
• Solution:
• vector clocks
• read repair
Monday, December 5, 11
Vector Clocks
•Reasoning about time and causality in distributed systems is hard
• Integer timestamps don’t necessarily capture causality
•Vector clocks provide a happens-before relationship between two events
Monday, December 5, 11
Vector Clocks
•Simple data structure: [{ActorID,Counter}]
•All data has an associated vector clock, actors update their entry when making changes
•ClockA happened-before ClockB if all actor-counters in A are less than or equal to those in B
Monday, December 5, 11
Vector Clocks are Easy
•Bryan Fink’s blog post: http://basho.com/blog/technical/2010/01/29/why-vector-clocks-are-easy/
•Explains vector clocks using a dinner invitation example
Monday, December 5, 11
Dinner Example
•Alice, Ben, Cathy, Dave exchange some email to decide when to meet for dinner
•Alice emails everyone to suggest Wednesday
Monday, December 5, 11
Dinner Example
•Ben and Dave email each other and decide Tuesday
•Cathy and Dave email each other and Cathy prefers Thursday, and Dave changes his mind and agrees
Monday, December 5, 11
Dinner Example
•Ann then pings everyone to check that Wednesday is still OK
•Ben says he and Dave prefer Tuesday
•Cathy says she and Dave prefer Thursday
•Dave doesn’t answer
Monday, December 5, 11
Dinner Example
•Ann then pings everyone to check that Wednesday is still OK
•Ben says he and Dave prefer Tuesday
•Cathy says she and Dave prefer Thursday
•Dave doesn’t answerConflict!
Monday, December 5, 11
[{Alice,1}]Wednesday
Monday, December 5, 11
[{Alice,1}]Wednesday
Ben
Cathy
Dave
Monday, December 5, 11
Ben Dave
Monday, December 5, 11
Ben Dave
[{Alice,1},{Ben,1}]Tuesday
Monday, December 5, 11
[{Alice,1},{Ben,1},{Dave,1}]Tuesday
Ben Dave
Monday, December 5, 11
Cathy
Dave
Monday, December 5, 11
[{Alice,1},{Cathy,1}]Thursday
Cathy
Dave
Monday, December 5, 11
[{Alice,1},{Cathy,1}]Thursday
Cathy
Dave
[{Alice,1},{Ben,1},{Dave,1}]Tuesday
Monday, December 5, 11
[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday
[{Alice,1},{Cathy,1}]Thursday
Cathy
Dave
Monday, December 5, 11
[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday
Cathy
Dave
Monday, December 5, 11
[{Alice,1}]Wednesday
Ben
Cathy
Dave
Monday, December 5, 11
[{Alice,1}]Wednesday
Ben
Cathy
Dave
[{Alice,1},{Ben,1},{Dave,1}]Tuesday
Monday, December 5, 11
[{Alice,1}]Wednesday
Ben
Cathy
Dave
[{Alice,1},{Ben,1},{Dave,1}]Tuesday
[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday
Monday, December 5, 11
[{Alice,1}]Wednesday
Ben
Cathy
[{Alice,1},{Ben,1},{Dave,1}]Tuesday
[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday
Monday, December 5, 11
[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday
Monday, December 5, 11
See: Easy!
[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday
Monday, December 5, 11
Vector Clocks are Hard
• Justin Sheehy’s blog post: http://basho.com/blog/technical/2010/04/05/why-vector-clocks-are-hard/
Monday, December 5, 11
Vector Clocks are Hard
•Our example shows how vclocks can quickly grow
•Tradeoffs to keep them bounded:
•mark each entry with a timestamp
•occasionally drop old entries
•also trim vclock if too many entries
Monday, December 5, 11
• Problem: what happens if vnode replicas get out of sync?
• Solution:
• vector clocks
• read repair
Monday, December 5, 11
Read Repair
• If a read detects that a vnode has stale data, it is repaired via asynchronous update
•Helps implement eventual consistency
Monday, December 5, 11
This is Riak Core
•consistent hashing
•vector clocks
•sloppy quorums
•gossip protocols
•virtual nodes (vnodes)
•hinted handoff
Monday, December 5, 11
Riak Core Implementation
•Open source
•https://github.com/basho/riak_core
• Implemented in Erlang
•Helps you build AP systems
Monday, December 5, 11
Why Erlang?
•Erlang started in the mid-80s at Ericsson Computer Science Laboratories
•Needed a better way to program telephone switches for concurrency, fault tolerance, and hot upgrade
•Erlang released as open source in 1998 (www.erlang.org)
Monday, December 5, 11
Concurrency with Erlang
•A single Erlang VM instance can supports millions of processes
•The VM schedules these onto CPU cores
• Processes communicate via message passing
•No locks, condition variables, etc. makes programming easier
Monday, December 5, 11
Reliability with Erlang
• Apps typically consist of numerous Erlang processes (very lightweight threads)
• Some processes supervise others
• If a process dies, its supervisor can restart it
• “Let It Crash” philosophy
• Hot code loading for upgrades and fixes
Monday, December 5, 11
Distribution with Erlang
•Messaging primitives the same whether in same VM or different VM, even across a network
•No “extra” packages or libraries needed for distribution, it’s just built in
Monday, December 5, 11
Erlang Applications
• Erlang systems are composed of applications
• Erlang provides tools for creating and bundling applications, managing app dependencies
• Numerous apps can run within a single VM
• See rebar, an Erlang project build tool from Basho: https://github.com/basho/rebar
Monday, December 5, 11
Getting Started with Riak Core
Monday, December 5, 11
Riak Core Applications
•The Riak Key-Value database is an application built on riak core
•Riak Search, a full-text search capability supplied with Riak, is also an app built on riak core
Monday, December 5, 11
Riak KV
StoragePlug-ins
Protobufinterface
Riak KV
HTTPinterface
Riak Core
. . . . . .
Monday, December 5, 11
Speaking of rebar...
$ git clone git://github.com/basho/rebar.git$ cd rebar$ ./bootstrap$ cp rebar ~/bin
Monday, December 5, 11
Riak Core Templates
•https://github.com/rzezeski/rebar_riak_core
•A set of rebar templates that create a skeleton riak_core project for you
Monday, December 5, 11
Installing rebar_riak_core
$ git://github.com/rzezeski/rebar_riak_core.git$ cd rebar_riak_core$ mkdir -p ~/.rebar/templates$ cp riak* ~/.rebar/templates
Monday, December 5, 11
Using rebar_riak_core$ mkdir myapp
$ cd myapp
$ rebar create template=riak_core appid=myapp
•Creates a bunch of files, including rebar.config to control rebar builds
•Automatically makes riak_core a project dependency
Monday, December 5, 11
Multinode
•riak_core_multinode template (part of rebar_riak_core templates)
•Creates multiple Erlang distributed nodes to simulate physical nodes
•See https://github.com/rzezeski/try-try-try/tree/master/2011/riak-core-first-multinode for details
Monday, December 5, 11
riak_id
• Implements something like Twitter’s Snowflake, a highly-available unique ID generator for tweet identifiers
•Must be 64 bits and roughly sortable
•https://github.com/seancribbs/riak_id
Monday, December 5, 11
Operations and Vnodes
•For a riak_core app, you need:
•to know what operations your app needs to perform
•how to implement your vnodes to perform the operations
Monday, December 5, 11
riak_id Operations
•next_id: generate a new unique ID
•ping: vnode communication check (comes from rebar_riak_core templates, we’ll ignore it)
Monday, December 5, 11
riak_id_vnode handle_command
•The riak_id_vnode:handle_command function implements next_id operation
•File apps/riak_id/src/riak_id_vnode.erl
Monday, December 5, 11
riak_pipe
•A new framework for Riak 1.0 that handles MapReduce and other tasks
•Allows tasks to be configured and spread out across the cluster
•Outputs from one task become inputs for the next
•https://github.com/basho/riak_pipe
Monday, December 5, 11
Riak Core
•consistent hashing
•vector clocks
•sloppy quorums
•gossip protocols
•virtual nodes (vnodes)
•hinted handoff
https://github.com/basho/riak_core
Monday, December 5, 11
Thanks
Monday, December 5, 11