Cop
yrig
ht P
rism
Tech
, 201
6
AngeloCorsaro,PhDCTO,ADLINKTech.Inc.Co-Chair,OMGDDS-SIG
Distributed ComputingwithDDS
Copy
right
Pris
mTe
ch, 2
015
A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.
— Wikipedia
Distributed System Definition
Copy
right
Pris
mTe
ch, 2
015
A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.
— Wikipedia
Distributed System DefinitionWell…Well..Well… This may be true at the transport level, but the components
may coordinate using different models as we’ll see later.
Copy
right
Pris
mTe
ch, 2
015
A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.
— Adapted from Wikipedia
Distributed System Definition
Copy
right
Pris
mTe
ch, 2
015
A distributed system is a model in which components located on networked computers communicate and coordinate their actions to achieve a common goal.
Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.
— Adapted from Wikipedia
Distributed System Definition
Copy
right
Pris
mTe
ch, 2
015 A distributed system is one in which the failure of a
computer you didn't even know existed can render your own computer unusable.
— Leslie Lamport, 28 May 1987
Distributed System Definition
Copy
right
Pris
mTe
ch, 2
015 A distributed system is one in which the failure of a
computer you didn't even know existed can render your own computer unusable.
— Leslie Lamport, 28 May 1987
Distributed System Definition
Copy
right
Pris
mTe
ch, 2
015Computation models are symmetric if the processes
involved in the distributed computation don’t assume special roles, in other terms they are all peers
Computational models are asymmetric if some processes assume a special role, i.e. server or client
Symmetric vs asymmetric
Copy
right
Pris
mTe
ch, 2
015Some distributed computational / coordination model
support anonymous communication in the sense that communication parties are unaware of each other
Other, require explicit knowledge of parties with which communication has to happen
Anonymous vs Named
Copy
right
Pris
mTe
ch, 2
015Message passing is a
symmetric computation model in which distributed
processes communicate and cooperate
asynchronously sending messages to each other
Examples: Sockets, Agents, MPI
Message Passing
msg
msg msg
Process
msg
Process Process
Process
Copy
right
Pris
mTe
ch, 2
015Client/Server is an
asymmetric computation model in which distributed
processes communicate and cooperate by requesting
services (often synchronously) to special processes called servers
Examples: Java RMI, CORBA, OPC-UA
Client/Server
request
replyClientServer
requestreply
Server
Client
requestreply
request
reply
Client
Copy
right
Pris
mTe
ch, 2
015
Message Queues is a symmetric and anonymous computation model in which
distributed processes communicate and coordinate by
asynchronously putting and getting messages on named
queues
Examples: AMQP, JMS Queues, AWS SQS
Message QueuesProcess
Process
Process
Process
put
get
Processget
get
put
get
get
put
Queue
Queue
Queue
Copy
right
Pris
mTe
ch, 2
015
The Tuple Space is a symmetric and anonymous computation
model in which distributed processes communicate and
coordinate by asynchronously reading and writing tuples, i.e.
data, into a tuple-space
Examples: DBMS, DDS, Linda
Tuple Spaces
Process
ProcessProcess
Process
Process
Process
wri
te
read
| ta
ke
write
read | take
write
read | take
write
read | take
wri
te
read
| ta
kewrite
read | take
Copy
right
Pris
mTe
ch, 2
015
DDS provides a Tuple Space inspired symmetric computation
model in which distributed processes communicate and
coordinate by asynchronously reading and writing data into an
eventually consistent data space
The DDS ModeL
Process
ProcessProcess
Process
Process
Process
wri
te
read
| ta
ke
write
read | take
write
read | take
write
read | take
wri
te
read
| ta
kewrite
read | take
Copy
right
Pris
mTe
ch, 2
015Applications can
autonomously and asynchronously read and
write data enjoying spatial and temporal
decoupling
Virtualised Data Space
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
Copy
right
Pris
mTe
ch, 2
015DDS’s Data Space is
eventually consistent with respect to writes
That means that readers of some kind of data will
eventually see a write, but they may not observe it at
the “same time”
CONSISTENCYMODEL
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
Copy
right
Pris
mTe
ch, 2
015Given a property P(t) we say that this
property is eventually true iff:
Eventual Properties
Copy
right
Pris
mTe
ch, 2
015
Consistency with respect to a datum means that anything/anybody looking at the datum will see exactly the same value.
Eventually Consistent means that consistency will be “eventually” asserted, but before t* (which in unknown in asynchronous and partially synchronous systems), anything/anybody looking at the datum may see different values.
Understanding eventual consistency
Copy
right
Pris
mTe
ch, 2
015
A Topic defines a domain-wide information’s class by a
<name, type, qos> triple
DDS Topics allow to express functional and non-
functional properties of a system information model
Topic
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
TopicType
Name
QoS
Topic types can be expressed using
different syntaxes, including IDL and
ProtoBuf
Topic Type IDL
structTemperatureSensor{@keylongsid;floattemp;floathum;}
Copy
right
Pris
mTe
ch, 2
015
Each unique key value identifies a unique stream of data
DDS demultiplexes “streams” and provides per-instance
lifecycle information
A Writer can write multiple instances
Instances
Topic
InstancesInstances
sid =”12345”
sid =”54321”
sid =”15243”
structTemperatureSensor{@keylongsid;floattemp;floathum;};
Copy
right
Pris
mTe
ch, 2
015Each Writer and Reader
have an associated Data Cache
Data Cache
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
Copy
right
Pris
mTe
ch, 2
015The writer’s cache stores
(a subset of) the data written
Writer Cache
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
Copy
right
Pris
mTe
ch, 2
015The reader’s cache
contains a projection of the global data space that reflect the reader
“interest”
Reader Cache
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
Copy
right
Pris
mTe
ch, 2
015
A Reader/Writer Cache can stores the last n∊𝜨∞ samples
for each relevant instance.
The cache properties are configured via QoS.
Data Cache
Data Cache
...
Samples
Instances
Cache
Where: 𝜨∞=𝜨 ∪ {∞}
Copy
right
Pris
mTe
ch, 2
015The action of reading
samples for a Reader Cache is non-destructive.
Samples are not removed from the cache
Reading Samples
DataReader Cache
DataReader
...
DataReader Cache
DataReader
...read
Copy
right
Pris
mTe
ch, 2
015The action of taking
samples for a Reader Cache is destructive.
Samples are removed from the cache
taking samples
DataReader Cache
DataReader
...takeDataReader Cache
DataReader
...
Copy
right
Pris
mTe
ch, 2
015Samples can be selected
using composable content and status
predicates
Sample selectors
DataReader Cache
DataReader
...
Copy
right
Pris
mTe
ch, 2
015Filters allow to control what
gets into a DataReader cache
Filters are expressed as SQL where clauses or as Java/C/
JavaScript predicates
Data filters
DataReader Cache
DataReader
...
Filter
Application
Network
Copy
right
Pris
mTe
ch, 2
015Queries allow to control
what gets out of a DataReader Cache
Queries are expressed as SQL where clauses or as
Java/C/JavaScript predicates
Data Queries
DataReader Cache
DataReader
...
Query
DataReader Cache
DataReader
...
Application
Network
Copy
right
Pris
mTe
ch, 2
015State based selection
allows to control what gets out of a DataReader Cache based on samples (read or
not), instance (alive or not) and view (known or
not) states
State Selectors
DataReader Cache
DataReader
...
State Selector
DataReader Cache
DataReader
...
Application
Network
Copy
right
Pris
mTe
ch, 2
015QoS policies allow the
expression and control over data’s temporal
and availability constraints
QoS Enabled
DDS Global Data Space
...
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
Data Writer
TopicAQoS
TopicBQoS
TopicCQoS
TopicDQoS
Copy
right
Pris
mTe
ch, 2
015QoS Policies controlling
end-to-end properties follow a Request vs.
Offered
QoSDomain
Participant
DURABILITY
OWENERSHIP
DEADLINE
LATENCY BUDGET
LIVELINESS
RELIABILITY
DEST. ORDER
Publisher
DataWriter
PARTITION
DataReader
Subscriber
DomainParticipant
offered QoS
Topicwrites reads
Domain Idjoins joins
produces-in consumes-from
RxO QoS Policies
requested QoS
Copy
right
Pris
mTe
ch, 2
015
We can think of a DataWriter and its matching DataReaders as connected by a logical typed communication channel
The properties of this channel are controlled by means of QoS Policies
At the two extreme this logical communication channel can be:
- Best-Effort/Reliable Last n-values Channel
- Best-Effort/Reliable FIFO Channel
Channel Properties
DR
DR
DR
TopicDW
Copy
right
Pris
mTe
ch, 2
015
Last n-values ChannelThe last n-values channel is useful when modelling distributed state
When n=1 then the last value channel provides a way of modelling an eventually consistent distributed state
This abstraction is very useful if what matters is the current value of a given topic instance
The Qos Policies that give a Last n-value Channel are:
- RELIABILITY = RELIABLE
- HISTORY = KEEP_LAST(n)
- DURABILITY = TRANSIENT | PERSISTENT [in most cases]
DR
DR
DR
TopicDW
Copy
right
Pris
mTe
ch, 2
015
The FIFO Channel is useful when we care about every single sample that was produced for a given topic -- as opposed to the “last value”
This abstraction is very useful when writing distributing algorithm over DDS
Depending on Qos Policies, DDS provides:
- Best-Effort/Reliable FIFO Channel
- FT-Reliable FIFO Channel (using an OpenSplice-specific extension)
The Qos Policies that give a FIFO Channel are:
- RELIABILITY = RELIABLE
- HISTORY = KEEP_ALL
fifo channel
DR
DR
DR
TopicDW
Copy
right
Pris
mTe
ch, 2
015
We can think of a DDS Topic as defining a group
The members of this group are matching DataReaders and DataWriters
DDS’ dynamic discovery manages this group membership, however it provides a low level interface to group management and eventual consistency of views
In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader-side
This is not sufficient for certain distributed algorithms.
membershipDR
DR
DR
TopicDW
DataWriter Group View
DW
DW DRTopic
DWDataReader Group View
Copy
right
Pris
mTe
ch, 2
015
DDS provides built-in mechanism for detection of DataWriter faults through the LivelinessChangedStatus
A writer is considered as having lost its liveliness if it has failed to assert it within its lease period
fault detection
DW
DW DRTopic
DWDataReader Group View
Copy
right
Pris
mTe
ch, 2
015
Partially Synchronous
- After a Global Stabilisation Time (GST) communication latencies are bounded, yet the bound is unknown
Non-Byzantine Fail/Recovery
- Process can fail and restart but don’t perform malicious actions
System Model
Copy
right
Pris
mTe
ch, 2
015The algorithms that will be showed next are implemented on
OpenSplice using the Moliere Scala API
All algorithms are available as part of the Open Source project dada
Programming environment
! DDS-based Advanced Distributed Algorithms Toolkit
!Open Source !github.com/kydos/dada
Copy
right
Pris
mTe
ch, 2
015
A Group Management abstraction should provide the ability to join/leave a group, provide the current view and detect failures of group members
Ideally group management should also provide the ability to elect leaders
A Group Member should represent a process
Group management Abstraction
abstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int)
// Group View API def size: Int def view: List[Int] def waitForViewSize(n: Int) def waitForViewSize(n: Int, timeout: Int)
// Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int)}
Copy
right
Pris
mTe
ch, 2
015
The group management algorithm that follows will provide eventually consistent views as well as eventual leaders
Whilst eventual consistency seems to weakens the abstraction, there are plenty of situations when this is actually more than enough.
It is also worth noticing that these algorithm as very efficient thanks to the eventual consistency assumption
Eventually consistent group views
Copy
right
Pris
mTe
ch, 2
015
To implement the Group abstraction with support for Leader Election it is sufficient to rely on the following topic types:
Topic types
enum TMemberStatus { JOINED, LEFT, FAILED, SUSPECTED};
struct TMemberInfo { long mid; // member-id TMemberStatus status;};#pragma keylist TMemberInfo mid
struct TEventualLeaderVote { long long epoch; long mid; long lid; // voted leader-id};#pragma keylist TEventualLeaderVote mid
Copy
right
Pris
mTe
ch, 2
015
Group Management The TMemberInfo topic is used to advertise presence and manage the members state transitions
Leader Election The TEventualLeaderVote topic is used to cast votes for leader election
This leads us to: Topic(name = MemberInfo, type = TMemberInfo, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}) Topic(name = EventualLeaderVote, type = TEventualLeaderVote, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}
Topics
Copy
right
Pris
mTe
ch, 2
015
Notice that we are using two Last-Value Channels for implementing both the (eventual) group management and the (eventual) leader election
This makes it possible to:
- Let DDS provide our latest known state automatically thanks to the TransientLocal Durability
- No need for periodically asserting our liveliness. DDS will do that for our DataWriter
observation
Copy
right
Pris
mTe
ch, 2
015
(Eventual) Leader election
At the beginning of each epoch the leader is None Each new epoch a leader election algorithm is run
M1
M2
M0
crashjoin
join
join
epoch = 0 epoch = 1 epoch = 2 epoch = 3
Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0
At the beginning of each epoch the leader is None Each new epoch a leader election algorithm is run
Copy
right
Pris
mTe
ch, 2
015
An eventual leader election algorithm can be implemented by simply casting a vote each time there is an group epoch change
A Group Epoch change takes place each time there is a change on the group view
The leader is eventually elected only if a majority of the process currently on the view agree
Otherwise the group leader is set to “None”
(EventuaL) Leader Election algorithmobject EventualLeaderElection { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt
val group = Group(gid)
group.join(mid)
group listen { case EpochChange(e) => { val lid = group.view.min group.proposeLeader(mid, lid) } case NewLeader(l) =>
println(">> NewLeader = "+ l) } }}
Copy
right
Pris
mTe
ch, 2
015
To isolate the traffic generated by different groups, we use the group-id gid to name the partition in which all the group related traffic will take place
segregating groups
“1”“2”
“3” DDS Domain
Partition associated to the group with gid=2
Copy
right
Pris
mTe
ch, 2
015
Barriers are a useful construct in parallel and distributed computing used to coordinate the phases of a distributed computation
barrier abstraction
Process
Process
Process
Copy
right
Pris
mTe
ch, 2
015
A Barrier abstraction should provide a way to assert the desired size along with waiting for it
It is also useful to be able to list who is waiting on a given barrier
Barrier Abstraction
abstract class Barrier { def name: String def size: Int def watingList: List[Int]
def wait(): Unit def wait(timeout: Duration): Unit}
Copy
right
Pris
mTe
ch, 2
015
To implement the Barrier abstraction it is sufficient to rely on the following topic types:
Topic types
struct Barrier { string name; long long epoch; short count;};#pragma keylist Barrier name epoch
struct BarredProcess { string name; long long epoch; long pid;};#pragma keylist name epoch pid
Copy
right
Pris
mTe
ch, 2
015
P3
P2
P1
Barrier = [(“Foo”, 1, 3)]
BarredProcess = [(“Foo”, 1, 2)]
BarredProcess = [(“Foo”, 1, 2), (“Foo”, 1, 1)]
BarredProcess = [(“Foo”, 1, 2), (“Foo”, 1, 1), (“Foo”, 1, 3)]
BarredProcess = []
Copy
right
Pris
mTe
ch, 2
015
P3
P2
P1
Barrier = [(“Foo”, 1, 3)]
BarredProcess = [(“Foo”, 1, 1)]
BarredProcess = [(“Foo”, 1, 1), (“Foo”, 1, 2)]
BarredProcess = [(“Foo”, 1, 1), (“Foo”, 1, 2), (“Foo”, 1, 3)]
BarredProcess = []
Barrier = [(“Foo”, 2, 3)]
...
Copy
right
Pris
mTe
ch, 2
015
DDS provide a computation/coordination model inspired by tuple spaces.
This is a symmetric and anonymous model of computation in which processes coordinate by reading and writing data in an eventual data space
While amenable to very high performance implementations this abstraction is quite powerful and greatly ease in the development of distributed systems
concluding remarks