NICTA Copyright 2012 From imagination to impact
Design
Consequences
of DevOps
Practices
Len Bass
NICTA Copyright 2012 From imagination to impact
Introductions
• Me
• You
2
NICTA Copyright 2012 From imagination to impact
Overview of Tutorial
• DevOps practices when taken to the limit for
internet scale organizations => continuous
delivery
• Economics of deployment when have many
instances of services => rolling upgrade
3
NICTA Copyright 2012 From imagination to impact
Outline
• What is DevOps?
– Definitions
– Deriving architecturally significant requirement
• Architectural style elaboration
• Deployment
• Summary
4
NICTA Copyright 2012 From imagination to impact
What is DevOps?
• “DevOps is a software development method that stresses
communication, collaboration, and integration between software
developers and IT professionals” – Wikipedia
• From an architect’s or developers’ perspective it means treating
system administrators and operators as first class stakeholders.
5
NICTA Copyright 2012 From imagination to impact
What is DevOps - 2
• DevOps is accompanied by a certain amount of
mysticism.
– “Be Self-Aware
– Be aware of a project’s maturity
– Be aware of others” http://architects.dzone.com/articles/zen-and-art-collaborative
• Similar to the early days of agile.
6
NICTA Copyright 2012 From imagination to impact
What problem is DevOps trying to solve?
• Poor communication between developers and
operations personnel
• Slow release schedule
• Limited capacity of operations staff
• Limited organizational insight into operations
7
NICTA Copyright 2012 From imagination to impact
Communication between developers and
operations staff
• Log messages
– What information is needed to do monitoring and
error diagnosis?
– Where is the best place to put particular types of
information?
• Release planning
– What is the scheduling for the next release?
– What capacity is needed for the next release?
– What are the infrastructure compatibility requirements
for the next release?
8
NICTA Copyright 2012 From imagination to impact
Release plan
1. Define and agree release and deployment plans with
customers/stakeholders.
2. Ensure that each release package consists of a set of related assets and
service components that are compatible with each other.
3. Ensure that integrity of a release package and its constituent components is
maintained throughout the transition activities and recorded accurately in
the configuration management system.
4. „„Ensure that all release and deployment packages can be tracked, installed,
tested, verified, and/or uninstalled or backed out, if appropriate.
5. „„Ensure that change is managed during the release and deployment
activities.
6. „„Record and manage deviations, risks, issues related to the new or changed
service, and take necessary corrective action.
7. „„Ensure that there is knowledge transfer to enable the customers and users
to optimise their use of the service to support their business activities.
8. „„Ensure that skills and knowledge are transferred to operations and support
staff to enable them to effectively and efficiently deliver, support and
maintain the service, according to required warranties and service levels *http://en.wikipedia.org/wiki/Deployment_Plan
9
NICTA Copyright 2012 From imagination to impact
Limited capacity of operations staff
• The number of physical servers that can be
administered by a single sys admin varies
depending on context but some data*
– As low as 10 per admin
– Norm of 30 per admin at small-medium businesses
• Depends on whether admin performs just
maintenance or whether admin is also involved
in other projects
*http://www.computerworld.com.au/article/352635/there_best_practice_server_
system_administrator_ratio_/
10
NICTA Copyright 2012 From imagination to impact
Limited Organizational insight into
operations
• An organization has budgetary insight into
operations.
• The impact of various operational activities on
business value is difficult to discern.
• This is a long running complaint that goes under
the heading of “aligning IT with the
business”. There are differences in
– Objectives
– Culture
– Incentives
11
NICTA Copyright 2012 From imagination to impact
DevOps can also be a role
• DevOps practices rely on a high degree of
automation and standardization of tools
• Someone has to be responsible for these tools.
• Person filling this role is “DevOps Engineer”
12
NICTA Copyright 2012 From imagination to impact
My Take on DevOps
• DevOps is a set of practices intended to – Reduce management overhead
– Speed up deployment
– Move some (formerly) IT responsibilities to developers
– Increase communication between developers and operations
– Reduce operations costs
• Are there architecturally significant requirements
in these practices?
13
NICTA Copyright 2012 From imagination to impact
Architecturally significant requirement
• Speed up deployment through minimizing
synchronous coordination among development
teams.
• Synchronous coordination such as a meeting
adds time since it requires – Ensuring that all parties are available
– Ensuring that all parties have the background to make
the coordination productive.
– Following up to decisions made during the meeting.
14
NICTA Copyright 2012 From imagination to impact
Summary of this section
• DevOps is a collection of practices designed, among
other things, to reduce time to deploy new features.
• Reducing time to deploy new features can be
accomplished by reducing synchronous coordination
among development teams
– This is an architecturally significant requirement that we will carry
forward.
15
NICTA Copyright 2012 From imagination to impact
Questions
16
NICTA Copyright 2012 From imagination to impact
Outline
• What is DevOps?
• Architectural Style Elaboration
– Micro Service Oriented Architecture
– Categories of design decisions
– How micro SOA specifies or delegates the categories
of design decisions
• Deployment
• Summary
17
NICTA Copyright 2012 From imagination to impact
Deployment pipeline
• Developers commit code
• Code is compiled
• Binary is processed by a build and unit test tool which
builds the service
• Integration tests are run followed by performance tests.
• Result is a machine image (assuming virtualization)
• The service (its image) is deployed to production.
18
NICTA Copyright 2012 From imagination to impact
Continuous Deployment
• Deployment pipeline is triggered by commit of
code
• All gates from one phase to the next are
automatic.
19
NICTA Copyright 2012 From imagination to impact
Requirements that drive the design in this
section
• Reduce synchronous communication among
development teams
– Continuous deployment
– Individual developers can commit to production (as
long as automated tests are passed)
• Scalability and performance
• Reliability
• A different ordering of requirements will produce
a different design
20
NICTA Copyright 2012 From imagination to impact
Architectural Style
• An architectural style (pattern) can specify many
decisions that might otherwise require
synchronous coordination among development
teams.
• The remainder of this section will justify why the
Micro Service Oriented Architecture style
satisfies our identified Architecturally Significant
Requirement.
21
NICTA Copyright 2012 From imagination to impact
Amazon design rules - 1
• All teams will henceforth expose their data and
functionality through service interfaces.
• Teams must communicate with each other
through these interfaces.
• There will be no other form of inter-process
communication allowed: no direct linking, no
direct reads of another team’s data store, no
shared-memory model, no back-doors
whatsoever. The only communication allowed is
via service interface calls over the network.
22
NICTA Copyright 2012 From imagination to impact
Amazon design rules - 2
• It doesn’t matter what technology they[services]
use.
• All service interfaces, without exception, must be
designed from the ground up to be
externalizable.
• Amazon is optimizing for its workload with these
requirements
– Mainly searching and browsing and web page
delivery
– Some transactions but not the dominant portion of the
workload. 23
NICTA Copyright 2012 From imagination to impact
Micro service oriented architecture
24
Service
• Each user request is
satisfied by some sequence
of services.
• Most services are not
externally available.
• Each service communicates
with other services through
service interfaces.
• Service depth may be 70,
e.g. LinkedIn
NICTA Copyright 2012 From imagination to impact
Relation of teams and services
• Each service is the responsibility of a single
development team
• Individual developers can deploy new version without
coordination with other developers.
• It is possible that a single development team is
responsible for multiple services
• Team size
• Coordination among team members
must be high bandwidth and low
overhead.
• Typically is done with small teams –
as in agile.
25
NICTA Copyright 2012 From imagination to impact
Design decisions
• Seven categories of design decisions*.
1. Allocation of responsibilities.
2. Coordination model.
3. Data model.
4. Management of resources.
5. Mapping among architectural elements.
6. Binding time decisions.
7. Choice of technology
*Software Architecture in Practice 3rd edition, Chap 4
26
NICTA Copyright 2012 From imagination to impact
Design decisions made or delegated by
choice of micro SOA
• Micro service oriented architecture either
specifies or delegates to the development team
five out of the seven categories of design
decisions.
1. Allocation of responsibilities.
2. Coordination model.
3. Data model.
4. Management of resources.
5. Mapping among architectural elements.
6. Binding time decisions.
7. Choice of technology
27
NICTA Copyright 2012 From imagination to impact
Roadmap for next several slides
• Micro service oriented architectural style will
either specify or allow delegation of five different
categories of design decisions.
• Each decision category will be discussed
separately.
28
NICTA Copyright 2012 From imagination to impact
Decision 1 – allocation of responsibilities
• This decision is not delegated to the team or
specified.
• Development teams must coordinate to divide
responsibilities for features that are to be added.
• Typically this happens at the beginning of each
iteration cycle.
29
NICTA Copyright 2012 From imagination to impact
Decision 2 - coordination model
• Elements of service interaction
– Services communicate asynchronously through
message passing
– Each service could (in principle) be deployed
anywhere on the net.
• Latency requirements will probably force particular
deployment location choices.
• Services must discover location of dependent services.
30
NICTA Copyright 2012 From imagination to impact
Service discovery
31
• When an instance of a
service is launched, it
registers with a
registry/load balancer
• When a client wishes
to utilize a service, it
gets the location of an
instance from the
registry/load balancer.
• Eureka is an open
source registry/load
balancer
Instance of
a service
Client
Register
Invoke
Registry/
load balancer
Query registry
NICTA Copyright 2012 From imagination to impact
Subtleties of registry/load balancer
• When multiple instances of the same service
have registered, the load balancer can rotate
through them to equalize number of requests to
each instance.
• Each instance must renew its registration
periodically (~90 seconds) so that load balancer
does not schedule message to failed instance.
• Registry can keep other information as well as
address of instance. For example, version
number of service instance.
32
NICTA Copyright 2012 From imagination to impact
Decision 3 – Data model
• Schema based database system (relational).
Requires coordination.
– Development teams must coordinate when schema is
defined or modified.
– Schema definition happens once when the
architecture is defined. Schema modification should
be rare occurrence. Schema extensions (new fields or
tables) do not cause problems.
• NoSQL systems. Will still require coordination
over semantics of data.
– Data written by one service is typically read by others,
they must agree on semantics.
33
NICTA Copyright 2012 From imagination to impact
Decision 4 – Resource Management
• Each instance of a service can process a certain
workload.
– Could be expressed in terms of requests
– Could be expressed in terms of resource
requirements – e.g. CPU
• Each client instance will require resources from
the service to process its requests.
• Service Level Agreements (SLAs) are a means
for automating the resource assumptions of the
clients and the resource requirements of the
service.
34
NICTA Copyright 2012 From imagination to impact
Managing SLAs
• A requirement for each service is to provide an SLA for
its response time in terms of the workload asked of it.
– E.g. For a workload of Y requests per second, I will
provide a response within X seconds.
• A requirement for each client is to provide an estimate of
the requests it will make of each dependent service.
– E.g. for each request I receive, I will make Z requests
for your service per second.
• This combination will enable a run time determination of
the number of instances required for each service to
meet its SLA.
35
NICTA Copyright 2012 From imagination to impact
Provisioning new instances
• When the desired workload of a service is greater than
can be provided by the existing number of instances of
that service, new instances can be instantiated (at
runtime).
• Four possibilities for initiating new instance of a service:
1. Client. Client determines whether service is adequately
provisioned for its needs based on service SLA and services
current workload.
2. Service. Service determines whether it is adequately
provisioned based on number of requests it expects from
clients.
3. Registry/load balancer determines appropriate number of
instances of a service based on SLA and client instance
requests.
4. External entity can initiate creation of new instances
36
NICTA Copyright 2012 From imagination to impact
Responsibilities of development teams.
• SLA determination of a service is done by the
service development team prior to deployment
augmented by run time discovery.
• Determination of a client's requirements for a
service are is done by the client’s development
team.
• Choice of which component has responsibility
for instantiating/deinstantiating instances of a
service is done as a portion of the architecture
definition.
37
NICTA Copyright 2012 From imagination to impact
Decision 5 – Mapping among architectural
elements
• Decisions about packaging modules into
processes and processes into a service are
delegated to the service development team.
• Decisions about deployment of a service will be
discussed in the next section.
38
NICTA Copyright 2012 From imagination to impact
Decision 6 – Binding time
• Configuration information binding time is
decided during the development of architecture
and the deployment pipeline.
• Other binding time decisions are delegated to
the service development team.
39
NICTA Copyright 2012 From imagination to impact
Decisions 7 – Technology choices
• All technology choices are delegated to the
service development team.
40
NICTA Copyright 2012 From imagination to impact
Questions about Micro SOA
• /Q/ Isn’t it possible that different teams will implement the
same functionality, likely differently?
• /A/ Yes, but so what? Major duplications are avoided
through assignment of responsibilities to services. Minor
duplications are the price to be paid to avoid necessity
for synchronous coordination.
• /Q/ what about transactions?
• /A/ Micro SOA privileges flexibility above reliability and
performance. Transactions are recoverable through
logging of service interactions. This may introduce some
delays if failures occur.
41
NICTA Copyright 2012 From imagination to impact
Summary
• Synchronous coordination among development
teams is avoided by
– Using a micro SOA architecture
– Having the architecture specify the coordination
model and resource management techniques used by
the application.
– Delegating to the development team mapping,
binding time, and technology decisions.
– Having each service be the responsibility of a single
development team.
• Micro SOA privileges flexibility and development
team independence over performance and
reliability.
42
NICTA Copyright 2012 From imagination to impact
Questions
43
NICTA Copyright 2012 From imagination to impact
Outline
• What is DevOps?
• Overall Architectural Style
• Deployment
– Deployment strategies
– Maintaining Logical Consistency.
• Summary
44
NICTA Copyright 2012 From imagination to impact
Deployment Overview
45
Multiple instances
of a service are
executing • Red is service being
replaced with new version
• Blue are clients
• Green are dependent
services
VA VB VB VB
UAT / staging / performance
tests
NICTA Copyright 2012 From imagination to impact
Deployment goal and constraints
• Goal of a deployment is to move from current
state (N instances of version A of a service) to a
new state (N instances of version B of a service)
• Constraints:
– Any development team can deploy their service at
any time. I.e. New version of a service can be
deployed either before or after a new version of a
client. (no synchronization among development
teams)
– It takes time to replace one instance of version A with
an instance of version B (order of minutes)
– Service to clients must be maintained while the new
version is being deployed. 46
NICTA Copyright 2012 From imagination to impact
Deployment strategies
• Two basic all of nothing strategies
– Big Flip – leave N instances with version A as they
are, allocate and provision N instances with version B
and then switch to version B and release instances
with version A.
– Rolling Upgrade – allocate one instance, provision it
with version B, release one version A instance.
Repeat N times.
• Other deployment topics
– Partial strategies (canary testing, A/B testing,). We
will discuss them later. For now we are discussing all
or nothing deployment.
– Rollback
– Packaging services into machine images
47
NICTA Copyright 2012 From imagination to impact
Trade offs - Big Flip and Rolling Upgrade
• Big Flip
– Only one version available
to the client at any
particular time.
– Requires 2N instances
(additional costs)
• Rolling Upgrade
– Multiple versions are
available for service at the
same time
– Requires N+1 instances.
• Rolling upgrade is
commonly preferred. 48
Update Auto Scaling
Group
Sort Instances
Remove & Deregister
Old Instance from ELB
Confirm Upgrade Spec
Terminate Old
Instance
Wait for ASG to Start
New Instance
Register New Instance
with ELB
Rolling
Upgrade
in EC2
NICTA Copyright 2012 From imagination to impact
Types of failures during rolling upgrade
Rolling Upgrade Failure
Provisioning
See references at end
Logical failure
Inconsistencies to be discussed
Instance failure
Handled by Auto Scaling Group in EC2
49
NICTA Copyright 2012 From imagination to impact
What are the problems with Rolling
Upgrade?
• Recall that any development team can deploy
their service at any time.
• Three concerns
– Maintaining consistency between different versions of
the same service when performing a rolling upgrade
– Maintaining consistency among different services
– Maintaining consistency between a service and
persistent data
50
NICTA Copyright 2012 From imagination to impact
Maintaining consistency between different
versions of the same service
• Key idea – differentiate between installing a new
version and activating a new version
• Involves “feature toggles” (described
momentarily)
• Sequence
– Develop version B with new code under control of
feature toggle
– Install each instance of version B with the new code
toggled off.
– When all of the instances of version A have been
replaced with instances of version B, activate new
code through toggling the feature. 51
NICTA Copyright 2012 From imagination to impact
Issues
• What is a feature toggle?
• How do I manage features that extend across
multiple services?
• How do I activate all relevant instances at once?
52
NICTA Copyright 2012 From imagination to impact
Feature toggle
• Place feature dependent new code inside of an
“if” statement where the code is executed if an
external variable is true. Removed code would
be the “else” portion.
• Used to allow developers to check in
uncompleted code. Uncompleted code is
toggled off.
• During deployment, until new code is activated,
it will not be executed.
• Removing feature toggles when a new feature
has been committed is important.
53
NICTA Copyright 2012 From imagination to impact
Multi service features
• Most features will involve multiple services.
• Each service has some code under control of a
feature toggle.
• Activate feature when all instances of all
services involved in a feature have been
installed.
– Maintain a catalog with feature vs service version
number.
– A feature toggle manager determines when all old
instances of each version have been replaced. This
could be done using registry/load balancer.
– The feature manager activates the feature.
54
NICTA Copyright 2012 From imagination to impact
Activating feature
• The feature toggle manager changes the value
of the feature toggle. Two possible techniques to
get new value to instances.
– Push. Broadcasting the new value will instruct each
instance to use new code. If a lag of several seconds
between the first service to be toggled and the last
can be tolerated, there is no problem. Otherwise
synchronizing value across network must be done.
– Pull. Querying the manager by each instance to get
latest value may cause performance problems.
• A coordination mechanism such as Zookeeper
will overcome both problems. I will discuss
Zookeeper if I have time at the end. 55
NICTA Copyright 2012 From imagination to impact
Maintaining consistency across versions
(summary)
• Install all instances before activating any new
code
• Use feature toggles to activate new code
• Use feature toggle manager to determine when
to activate new code
• Use Zookeeper to coordinate activation with low
overhead
56
NICTA Copyright 2012 From imagination to impact
Maintaining consistency among different
services
• Use case:
– Wish to deploy new version of service A without
coordinating with development team for clients of
service A.
• I.e. new version of service A should be backward compatible
in terms of its interfaces.
• May also require forward compatibility in certain
circumstances, e.g. rollback
57
NICTA Copyright 2012 From imagination to impact
Achieving Backwards Compatibility
• APIs can be extended but must always be
backward compatible.
• Leads to a translation layer
External APIs (unchanging but with ability to extend
or add new ones)
Translation to internal APIs
Client Client
Internal APIs (changes require changes to
translation layer but do not propagate further)
NICTA Copyright 2012 From imagination to impact
What about dependent services?
• Dependent services that are within your control
should maintain backward compatibility
• Dependent services not within your control (third
party software) cannot be forced to maintain
backward compatibility.
– Minimize impact of changes by localizing interactions
with third party software within a single module.
– Keeping services independent and packaging as
much as possible into a virtual machine means that
only third party software accessed through message
passing will cause problems.
59
NICTA Copyright 2012 From imagination to impact
Forward Compatibility
• Gracefully handle unknown calls and data base schema
information
– Suppose your service receives a method call it does
not recognize. It could be intended for a later version
where this method is supported.
– Suppose your service retrieves a data base table with
an unknown field. It could have been added to
support a later version.
• Forward compatibility allows a version of a service to be
upgraded or rolled back independently from its clients. It
involves both
– The service handling unrecognized information
– The client handling returns that indicate unrecognized
information. 60
NICTA Copyright 2012 From imagination to impact
Maintaining consistency between a service
and persistent data
• Assume new version is correct – we will discuss the
situation where it is incorrect in a moment.
• Inconsistency in persistent data can come about
because data schema or semantics change.
• Effect can be minimized by the following practices (if
possible).
– Only extend schema – do not change semantics of
existing fields. This preserves backwards
compatibility.
– Treat schema modifications as features to be toggled.
This maintains consistency among various services
that access data.
61
NICTA Copyright 2012 From imagination to impact
I really must change the schema
• In this case, apply pattern for backward
compatibility of interfaces to schemas.
• Use features of database system (I am
assuming a relational DBMS) to restructure data
while maintaining access to not yet restructured
data.
62
NICTA Copyright 2012 From imagination to impact
Summary of consistency discussion so far.
• Feature toggles are used to maintain
consistency within instances of a service
• Backward compatibility pattern is used to
maintain consistency between a service and it s
clients.
• Discouraging modification of schema will
maintain consistency between services and
persistent data.
– If schema must be modified, then synchronize
modifications with feature toggles.
63
NICTA Copyright 2012 From imagination to impact
Canary testing
• Canaries are a small number of instances of a new
version placed in production in order to perform live
testing in a production environment.
• Canaries are observed closely to determine whether the
new version introduces any logical or performance
problems. If not, roll out new version globally. If so, roll
back canaries.
• Named after canaries
in coal mines.
64
NICTA Copyright 2012 From imagination to impact
Implementation of canaries
• Designate a collection of instances as canaries. They do
not need to be aware of their designation.
• Designate a collection of customers as testing the
canaries. Can be, for example
– Organizationally based
– Geographically based
• Then
– Activate feature or version to be tested for canaries.
Can be done through feature activation
synchronization mechanism
– Route messages from canary customers to canaries.
Can be done through making registry/load balancer
canary aware.
65
NICTA Copyright 2012 From imagination to impact
A/B testing
• Suppose you wish to test user response to a
system variant. E.g. UI difference or marketing
effort. A is one variant and B is the other.
• You simultaneously make available both
variants to different audiences and compare the
responses.
• Implementation is the same as canary testing.
66
NICTA Copyright 2012 From imagination to impact
Rollback
• New versions of a service may be unacceptable
either for logical or performance reasons.
• Two options in this case
• Roll back (undo deployment)
• Roll forward (discontinue current deployment and
create a new release without the problem).
• Decision to rollback or roll forward is almost
never automated because there are multiple
factors to consider.
• Forward or backward recovery
• Consequences and severity of problem
• Importance of upgrade 67
NICTA Copyright 2012 From imagination to impact
States of upgrade.
• An upgrade can be in one of two states when an
error is detected.
– Installed (fully or partially) but new features not
activated
– Installed and new features activated.
68
NICTA Copyright 2012 From imagination to impact
Possibilities
• Initially we will discuss the situation where
persistent data is not incorrect. Later we will
discuss persistent data.
• Installed but new features not activated
– Error must be in backward compatibility
– Halt deployment
– Roll back by reinstalling old version
– Roll forward by creating new version and installing
that
• Installed with new features activated
– Turn off new features
– If that is insufficient, we are at prior case.
69
NICTA Copyright 2012 From imagination to impact
Persistent data
• Keep log of user requests (each with their own
identification)
• Identification of incorrect persistent data • Tag each data item with metadata that provides service and
version that wrote that data
• user request that caused the data to be written
• Correction of incorrect persistent data (simplistic
version)
– Remove data written by incorrect version of a service
– Install correct version
– Replay user requests that caused incorrect data to be
written
70
NICTA Copyright 2012 From imagination to impact
Persistent data correction problems
I will not present good solutions to these problems.
1. Replaying user requests may involve requesting
features that are not in the current version.
– Requests can be queued until they can be correctly
re-executed
– User can be informed of error (after the fact)
2. There may be domino effects from incorrect data. i.e.
other calculations may be affected.
– Keep pedigree for data items that allows determining
which additional data items are incorrect. Remove
them and regenerate them when requests replayed.
– Data that escaped the system, e.g. sent to other
system or shown to a user, cannot be retrieved.
71
NICTA Copyright 2012 From imagination to impact
Summary of rollback options
• Can roll back or roll forward
• Rolling back without consideration of persistent
data is relatively straightforward.
• Managing erroneous persistent data is
complicated and will likely require manual
processing.
72
NICTA Copyright 2012 From imagination to impact
Packaging of services
• The last portion of the deployment pipeline is
packaging services into machine images for
installation.
• Two dimensions
– Flat vs deep service hierarchy
– One service per virtual machine vs many services per
virtual machine
73
NICTA Copyright 2012 From imagination to impact
Flat vs Deep Service Hierarchy
• Trading off independence of teams and
possibilities for reuse.
• Flat Service Hierarchy
– Limited dependence among services & limited
coordination needed among teams
– Difficult to reuse services
• Deep Service Hierarchy
– Provides possibility for reusing services
– Requires coordination among teams to discover
reuse possibilities. This can be done during
architecture definition.
74
NICTA Copyright 2012 From imagination to impact
Services per VM Image
75
Service
1
Service
2
VM image
Develop
Develop
Embed
Embed
One service per VM
Service VM image
Develop Embed
Multiple services per VM
NICTA Copyright 2012 From imagination to impact
One Possible Race Condition with Multiple
Services per VM
76
TIME
Initial State: VM image with Version N of Service 1 and Version N of Service 2
Developer 1
Build new image with VN+1|VN
Begin provisioning
process with new image
Developer 2
Build new image with VN|VN+1
Begin provisioning
process with new image
without new version of
Service 1
Results in Version N+1 of Service 1 not being
updated until next build of VM image
Could be prevented by VM image build tool
NICTA Copyright 2012 From imagination to impact
Another Possible Race Condition with Multiple
Services per VM
77
TIME
Initial State: VM image with Version N of Service 1 and Version N of Service 2
Developer 1
Build new image with VN+1|VN
Begin provisioning
process with new image
overwrites image
created by developer 2
Developer 2
Build new image with VN+1|VN+1
Begin provisioning
process with new image
Results in Version N+1 of Service 2 not being
updated until next build of VM image
Could be prevented by provisioning tool
NICTA Copyright 2012 From imagination to impact
Trade offs
• One service per VM
– Message from one service to another must go
through inter VM communication mechanism – adds
latency
– No possibility of race condition
• Multiple Services per VM
– Inter VM communication requirements reduced –
reduces latency
– Adds possibility of race condition caused by
simultaneous deployment
78
NICTA Copyright 2012 From imagination to impact
Summary of Deployment
• Rolling upgrade is common deployment strategy
• Introduces requirements for consistency among
– Different versions of the same service
– Different services
– Services and persistent data
• Other deployment considerations include
– Canary deployment
– A/B testing
– Rollback
79
NICTA Copyright 2012 From imagination to impact
Question
80
NICTA Copyright 2012 From imagination to impact
Zookeeper
• What purpose does Zookeeper serve?
• Use cases
– Leader election
– Group membership
– Distributed locks
– Synchronization
– Configuration
• In our case, we will use Zookeeper to manage
activating features
81
NICTA Copyright 2012 From imagination to impact
Distributed applications
• Zookeeper provides guaranteed consistent
(mostly) data structure for every instance of a
distributed application.
– Definition of “mostly” is within eventual consistency
lag (but this is small)
• Zookeeper deals with managing failure as well
as consistency.
– Done using Praxis algorithm.
• Zookeeper guarantees that service requests are
linearly ordered and processed in a FIFO order
NICTA Copyright 2012 From imagination to impact
Model
• Zookeeper maintains a file type data structure
– Hierarchical
– Data in every node (called znode)
– Amount of data in each node assumed small (<1M)
– Intended for metadata
• Configuration
• Location
• Group
NICTA Copyright 2012 From imagination to impact
Zookeeper znode structure
/
<data>
/b1
<data>
/b1/c1
<data>
/b1/c2
<data>
/b2
<data>
/b2/c1
<data>
NICTA Copyright 2012 From imagination to impact
API
Function Type
create write
delete write
Exists read
Get children Read
Get data Read
Set data write
+ others
• All calls return atomic views of state – either
succeed or fail. No partial state returned. Writes
also are atomic. Either succeed or fail. If they
fail, no side effects.
NICTA Copyright 2012 From imagination to impact
Use Case – leader election
• Many distributed applications have master
(leader)/slave structure
– One master, many slaves
– Master
• Sends work to slaves
• Monitors health of slaves and creates new ones as needed.
NICTA Copyright 2012 From imagination to impact
Using Zookeeper to elect master
• Suppose master fails. Then must create/choose
a new master.
• All candidates issue “create” call with node
name “master”.
• Only one of these create requests will succeed,
the rest will fail. This is one of the consistency
elements enforced by Zookeeper.
• Client who successfully creates znode named
“master” will become new master.
NICTA Copyright 2012 From imagination to impact
Using Zookeeper to manage group membership
• App connects to zookeeper – Get list of zookeeper servers
– Create session (if server fails – automatic fail over)
• Known group name – Create /group_name
• If already exists get a failure
• Client joins group by creating /group_name/my_id
• Client can list children of /group_name and get members of group.
• Watcher will inform client if group members fail or leave.
NICTA Copyright 2012 From imagination to impact
Using Zookeeper to manage distributed locks - 1
• Naïve solution
– All clients attempt to create /lockname
– Successful client has lock.
• Client will delete znode when finished with lock
• Znode will be deleted if client fails
– Unsuccessful clients will watch /lockname. If it is
deleted then they will attempt to create it.
– Repeat
NICTA Copyright 2012 From imagination to impact
Distributed locks – 2
• Problem with naïve solution is “herd effect”.
– If many clients all wake up and try to grab lock at
once there will be an impact on the system load.
• Better solution is for each client to watch
predecessor.
– Zookeeper enforces order
– When predecessor deletes /lockname, then client will
acquire it.
– If predecessor fails, client is informed and will watch
predecessor’s predecessor. Etc.
NICTA Copyright 2012 From imagination to impact
Using Zookeeper for distributed synchronization
• Create new synchronization client. – It creates synchronization node
– Other clients register on synchronization node at beginning of computation.
– At end of computation they remove themselves from synchronization node
• Synchronization client watches clients that have registered themselves. If one fails, it removes it from synchronization node.
• When synchronization node is empty, synchronization client deletes it and other clients (who are watching) can proceed.
NICTA Copyright 2012 From imagination to impact
Using Zookeeper for configuration
Each client records configuration information as
data in a child node it creates under a main
configuration node.
Checking configuration is a matter of getting data
from all of the children of the configuration node.
NICTA Copyright 2012 From imagination to impact
Using Zookeeper to synchronize activation
of features.
• Feature manager creates Znode containing
– <feature flag name, feature flag value>
– Written only when all services available.
• Service retrieves feature flag value from Znode
– If (Znode_read_value(feature flag name) then
feature is active
else
feature is inactive
• Feature flag value guaranteed to be consistent across
services.
• Latency is low (order of micro seconds) since
Zookeeper keeps data structures in memory. 93
NICTA Copyright 2012 From imagination to impact
Summary of tutorial
• DevOps practices lead to requirement to
minimize inter team coordination
• Continuous deployment has no human
intervention from developer commit until
deployment to production
• Micro SOA architectural style determines or
delegates 5 of 7 design decision categories
• Deployment strategies raise issues of
consistency. Separation of installation and
activation enables turning features on or off.
• Zookeeper is one tool to manage synchronizing
the activations of features. 94
NICTA Copyright 2012 From imagination to impact
NICTA Team
• Anna Liu
• Alan Fekete
• Min Fu
• Daniel Sun
• Hiroshi Wada
• Ingo Weber
• Xiwei Xu
• Liming Zhu
95
NICTA Copyright 2012 From imagination to impact
Readings
• http://www.slideshare.net/lenbass/what-is-dev-ops-for-
review
• http://www.slideshare.net/lenbass/02-team-practcies-
and-overall-architecture
• http://www.slideshare.net/lenbass/03-build-structure-
and-testing
• NICTA research papers.
https://ssrg.nicta.com.au/projects/cloud/
96