WICSA 2012 tutorial

NICTA Copyright 2012 From imagination to impact

Architecting Highly

Dependable Cloud

ApplicationsAnna Liu

Len Bass


The Land Down Under


Sydney

NICTA Copyright 2012 From imagination to impact4

About NICTA

National ICT Australia

• Federal and state funded research company established in 2002

• Largest ICT research resource in Australia

• National impact is an important success metric

• ~700 staff/students working in 5 labs across major capital cities

• 7 university partners• Providing R&D services, knowledge

transfer to Australian (and global) ICT industry

NICTA technology is in over 1 billion mobile

phones

NICTA Copyright 2012 From imagination to impact 5

Research Areas at NICTA

Networks

Aruna Seneviratne

Anna LiuGernot Heiser

Software Systems

Machine Learning

Bob Williamson

Computer Vision

Nick Barnes,\ Richard Hartley Peter Corke

Rob Evans

Control & Signal Processing

Mark Wallace, Sylvie Thiebaux, Toby Walsh

Optimisation


Our applied R&D capabilityspans cloud computing, web, SOA, distributed systems, data management, analytics, performance monitoring, DR, automated reasoning, ontologies, AI…

Intelligent management

Business continuity

Dynamic

Cost optimisedHigh availability

High performance

Disaster recovery

Systems resilience

Real-time monitoring

Actionable analytics

Hybrid cloudOnsite/offsite

ElasticReal time

Our team’s mission: help enterprises take full advantage as software extends into cloud!


Who are we?

• Anna• Len

8


Who are you?

What would you like from this tutorial?

9


Outline

• Introduction• Cloud Computing Platforms• Nature and causes of outages and down-time• Characteristics of Dependability in Cloud

• Achieving high dependability• The importance of stateless components• Techniques to handle performance problems• Techniques to handle availability problems• Techniques to handle security problems

• Case Studies: Netflix, Yuruware• Conclusions

11


Introduction

• intro to the cloud – xxx as a service, regions/zones

• What is dependability• why is dependability a concern in the cloud • types of dependability and high level problem

descriptions– performance – availability – Security

12


Introduction to Cloud Computing


What is Cloud Computing?

Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

This cloud model is composed of five essential characteristics, three service models, and four deployment models.

- US National Institute of Standards and Technology


Characterising Cloud Computing

Elasticity


Five Characteristics – NIST Definition

• On-demand Self-Service– A consumer can provision computing capabilities without human

interaction

• Broad network access– Computing capabilities are available over the network and accessed

through standard mechanisms

• Resource pooling– Provider’s computing resources are pooled to serve multiple consumers

with different resources dynamically assigned according to consumers’ demands

• Rapid elasticity– Computing capabilities can be rapidly and elastically provisioned to

quickly scale out and rapidly released to scale in

• Measured service– Resource usage can be monitored, controlled, and reported. Providing

transparency for both the provider and consumer


Leading Provider: Amazon EC2

Let’s see how Amazon EC2, a leading commercial cloud, looks

I want my cloud!


1. Grab your credit card and create an account. (10 min) Then, access to a console

2. Select where you want to create your virtual machines(US East, US West, Ireland or Singapore)

3. Hit this button


4. Select a machine image• Many pre-configured

images are available• You can register your

machine images as well


5. Determine the amount of resources to allocate• <1.0Ghz CPU + 600MB RAM 0.01 USD/hour• 1.0Ghz CPU + 1.7GB RAM 0.04 USD/hour• 3.0Ghz x 8 CPUs + 68GB RAM 1.1 USD/hour• You can pay Win/SQL Server license fees in pay-per-

hour


6. Define a set of access control rules


7. Done! (< 5 minutes in total)• You have your virtual machine at

ec2-184-74-14-28.us-west-1.compute.amazonaws.com

I got my virtual machine!


8. Connect to my virtual machine• Just SSH to the address• You have a root access!!

You’re in an Amazon Datacenter in CA

This is my desktop in Sydney


If you like Windows, just launch a Windows virtual machine and remote-desktop to it

You’re in an Amazon Datacenter in NV

This is my desktop in Sydney

Connected througha VPN connection


9. Terminate or hibernate virtual machines when they are not in use• In some systems, we use a script to

hibernate virtual machines at 8:00PM• Restart instances in the morning if

necessary. It takes just a couple of minutes


10. Check a bill in real-time• Hours to run virtual

machines• Network in/out• VPN• Disk access• # of requests made…


Three Service Models – NIST definition

Infrastructureas a Service

DatacenterInfrastructure

Technology exposed to customers Providers

Platformas a Service

Softwareas a Service


Three Delivery Models• Infrastructure as a Service (IaaS)

– The consumer has control over operating systems, storage and deployed applications

• Platform as a Service (PaaS)– Consumers can deploy applications created using programming

languages and tools supported by the provider (e.g., Java Servlet)

– The provider shields the complexity of its infrastructure • Scale up/down, load balancing, replication, disaster recovery,

database management, …

• Software as a Service (SaaS)– Consumers use the provider’s applications– The consumer does not manage the underlying cloud

infrastructure


Leading Provider: Google App Engine

Let’s see how Google App Engine, a leading commercial PaaS, looks

I want my PaaS!


2. Write an application using GAE’s framework

1. Create an account. (5 min) GAE offers a large amount of quota for free


3. Deploy your application on GAE!

Scale up/down, load balancing, replication, disaster recovery, database management, … many functions are implemented by GAE’s framework


4. Check your resource usage (CPU, storage, # of API calls, …)Pay only when usage exceeds the free quota


Provider Services - 1

• Consumer is allocated some number of virtual machine instances.– Number of instances is under the control of the

consumer– Provider allows consumer to set rules for

“autoscaling”. Automatically creating and removing instances

– When new instance is launched it has• Software as specified by either the consumer or the provider• Private IP address available only from within cloud. Private

IP address exists for life of instance and will not change• Public IP address. Addressable from outside the cloud. May

change under certain circumstances

33


Provider Services – 2

• Cloud data centers – hosted in different geographic regions– Cloud provider responsible for physical security

• SLAs from cloud providers are for 99.9%+ up time for the cloud. No guarantee for any individual instance

• Cloud provider will replicate databases to different regions or within a region.

34


Questions

35


Dependability


What is dependability?

• Dependability of a computing system is the ability to deliver service that can justifiably be trusted. – The service delivered by a system is its behaviour as

it is perceived by its user(s); – a user is another system (physical, human) that

interacts with the former at the service interface. – The function of a system is what the system is

intended for, and is described by the system specification.

[ A. Avizienis, J.-C. Laprie and B. Randell: Fundamental Concepts of Dependability. Research Report No 1145, LAAS-CNRS, April 2001]

37

http://en.wikipedia.org/wiki/Brian_Randell

http://en.wikipedia.org/wiki/Brian_Randell


Parsing the definition

• Dependability is relative– “justifiably be trusted”

• May be different users with different expectations

• Users can be systems or humans• Systems may deliver many services and

dependability may be different for each service

38


Dependability subsumes many other attributes

39


Questions

40


Dependability concerns in the

cloud for the consumer


Cloud vis a vis private data center

• Cloud providers remove some of the problems of operating a private data centerAcquisition of physical hardware.

Hiring/training data center staff

Physical security

• Other problems remain basically the sameSecurity threats from internet connections

Separation of production/test environments

Patch installation

• Other problems are new or exist in changed formIt is these other problems that we now focus on.

42


Cloud Specific Dependability Problems

FailureInstance failure

Data failure/consistency

Operator error

Upgrade error

Performance Latency of provisioning

Over/under provisioning

Latency of communication

Security/privacyCredentials and keys

Multi-tenancy

Location dependency/governance

Disaster Recovery43


Provisioning

• Consumer or cloud infrastructure can launch or delete instance of virtual machine

• When new instance launched it consists of– Virtual hardware with public and private IP address– Executable image– Virtual hard disk

• Provisioning is important both in failure recovery and performance

44


Elasticity - Over or Under Provisioning

• Elasticity is the defining characteristic of cloud– Traditional ‘scalability’ or ‘throughput’ measures no longer helpful– “the ability of software to meet changing capacity demands,

deploying and releasing relevant necessary resources on-demand”

• There is often over or under provisioning


Failure


Instance Failure – recognition

• Basic failure recognition mechanism is “heartbeat”.

• Instance must periodically show it is still alive– Send a message– Respond to query

• Must be an entity that is responsible for monitoring “aliveness” of instance– Entity can be infrastructure– Entity can be other portion of the application– Entity can be client

• Failed instances are not automatically deleted

47


Monitoring for Pending Failure

• Besides PING…• A dashboard of flashing lights• Monitoring ongoing CPU, memory utilization,

disk activities, Network activities• Environmental controls, water/coolant flow,

power and temperature

48

Akamai’s NOC in Cambridge, Massachusetts


State

• An instance can be stateful or stateless• A stateful instance remembers information from

one message to another. State can be stored either within instance memory or on external memory device

• A stateless instance must be sent necessary state associated with the message.

• HTTP is a stateless protocol so every message must contain information allowing the instance to understand the context.

• Recovery process is different for stateful instances than for stateless instances.

49


Stateful Recovery

• Strategy depends on how much loss of computation and events can be tolerated.

• Strategy - 1– Checkpoint image periodically– On recovery, provision with checkpointed image and

computation will restart from last checkpoint– Any computation and messages between last

checkpoint and failure will be lost.– Assumes no state stored on external device.

• Only for cloud because of checkpointing image

50


Stateful Recovery Strategy – 2

• Periodically save important state on persistent external device.

• When image is activated, it checks whether any state has been saved. If so, it reads that state and resumes computation

• Any computation and messages between last checkpoint and failure will be lost

• Different with prior strategy is that does not assume an image exists and state is explicitly checkedpointed by application

51


Stateful Recovery Strategy – 3

• Periodically save important state on persistent external device

• Log incoming messages on persistent external device

• When image is activated, it checks whether any state has been saved. If so, it reads that state.

• Activated image then reads log and replays activity.

• No computation or messages will be lost unless there is failure between message arrival and recording that message on log. Acks to client will allow client to resend message if necessary.

52


Comments on Stateful recovery strategies

• Only strategy 1 (provision with checkpointed image) is specific to cloud

• Other strategies apply also to non-cloud environments.

• Strategy 3 achieves least data loss since messages are logged and replayed upon recovery.

53


Stateless images

• If instance is stateless then– Infrastructure can send any message to any instance– Can create new instances for performance or

reliability reasons.– Router/load balancer/controller is responsible for

getting messages to instances

54

Load balancer

ServersClients

Cloud


How do messages get to instances?

• Two models– Push. Load balancer decides which instance should

get message – Pull. Load balancer maintains queue of messages

and instances retrieve messages from queue.

55


Push Architecture Pattern

Load balancer

Servers

Clients

Monitor


Push Pattern Description

Client sends a request (e.g. HTTP message) to the app in the cloud.

Request arrives at a load balancer

Load balancer forwards request to one of the VMs

Load balancer uses scheduling strategy to decide which VM gets the request, e.g. round robin


Monitor

The load balancer knowsCPU utilization for each VM through monitor

how many requests each VM has gotten

Possibly how long it took to service the requests.

The monitor decides (based on rules) when new resources are needed

58


Failure management within Push Pattern

• Monitor will recognize failure of instance through non-responsiveness.

• Load Balancer will not send further messages to instance

• Messages currently being processed by failed instance are lost

• Client must detect message not processed (through timeout) and resend message.

59


Pull architecture pattern (aka Producer-Consumer)

Load balancer/ queue manager

Servers

Clients

Monitor


Pull architecture description

Each request from the client is application specific and typed.

The queue keeps separate queues for each application running on the VMs.

A VM requests the next message of a particular type (pull) and processes it.

When the VM has processed a message, it informs the controller to remove the message from the queue.


Monitor

The monitor can now see how long a request waits in a queue

the average queue length

This is an indication of the load on the VMs that have applications that service requests of that type.

Allows better scheduling of messages to VMs.

62


Failure Management within Pull Pattern

• Controller knows when message has been processed.

• If message is not processed within time interval, controller can reassign it.

• Failed instances will not request further messages and so take themselves out of service.

• It is possible for a failed instance to recover and continue processing on a message that has been rescheduled so checks must be in place to keep a message from being double processed.

63


Cleaning up

When instance fails it is not automatically deallocated

Consumer must deallocate failed instance.

When instance deallocated– Public and private IP address available for realloation– Possible to tell infrastructure that public IP address is

to be assigned to replacement instance

• Within AWS charging continues until instance deallocated.

64


Data Failure

• Data storage can be “ephemeral” or “persistent”• Ephemeral storage disappears if instance fails• Persistent storage is maintained by cloud

provider– Replicated automatically– Replicas may be geographically separated

• May lead to problems with data consistency

65


Data Consistency

• Takes time to replicate data• Means that different replicas of the data may not

be instantaneously consistent• CAP Theorem. Data cannot simultaneously be

– Consistent– Fully available– Partitioned (distributed across multiple data stores)

• May take ½ second for data to become consistent

• Most cloud providers offer “consistent reads” but at a potential cost in latency

66


Characterising Eventual Consistency in Amazon SimpleDB

• The probability to read updated data in SimpleDB in US West– An application reads data X (ms) after it has written data

• SimpleDB has two read operations– Eventual Consistent

Read– Consistent Read

• This pattern is consistent regardless of the time of day

Eventual ConsistentConsistent Read


Operator error

• After trying out something in AWS, may want to go back to original state

• Not always that straight-forward:– Attaching volume is no problem while the instance is

running, detaching might be problematic– Creating / changing auto-scaling rules has effect on

number of running instances• Cannot terminate additional instances, as the rule would

create new ones!

– Deleted / terminated / released resources are gone!

68


Undo for System Operators

69

+ commit+ pseudo-delete

begin-transaction rollback

dododo

Administrator


Approach

70


dododo

Sense cloud resources states


Administrator

Undo System


Approach

71


dododo



Administrator

Undo System

Goal stateGoal state

Initial state

Initial state



dododo



PlanGenerate codeExecute

Administrator

Undo System

Goal stateGoal state

Initial state

Initial state

Set of actionsSet of

actions

Approach

72


Location of instances

• Amazon divides the cloud into– Regions (currently eight)

• US – east (Northern Va), west (Oregon, Northern Calif), gov• Asia Pactific – Singapore, Toyko• Europe – Ireland• South America (Sao Paulo)

– Each region has some number of availability zones.• Each availability zone has distinct physical location, power

sources• Communication

– within availability zones is high speed, – across availability zones is lower speed, – across regions is lowest speed

• Availability zones and regions can be exploited to improve availability 73


User Visible Failures

• Operator error is largest cause of user visible errors in large Internet systems

• Largest cause of operator error is configuration errors during upgrade – Data may be dated– Data is based on a world where monthly updates

were considered frequent. Updates may be as frequent as weekly (Facebook) or even more frequently – Jan Bosch talks about “continuous deployment”.

– I have not seen recent data describing sources of operator error

74


Upgrade Frequency

Upgrades to systems are a very common occurrence

Upgrade frequency of some common systems

This frequency would suggest it is important to get the updates correct

75

Application Average release interval

Facebook (platform) < 7 days

Google Docs <50 days

Media Wiki 21 (171 schema updates in 4.5 years)

Joomla 30


Configuration parameters

• Options are extensive– Hadoop – 206– Cassandra – 36– HBase – 64

• Massive numbers of dependencies, many hidden– File path – Network address – Dynamically loaded libraries– Database schema– …

76


Basic upgrade strategies

• Rolling Upgrade– Perform upgrade one node at a time

• Does not require additional resources• Allows for determination of correctness in an incremental

fashion• Implies that multiple versions may be simultaneously in

service• Takes time

• Big flip– Perform upgrade to a cluster at a time

• Keep users from accessing cluster until upgrade completed• Takes resources out of service until upgrade is completed

• General industrial practice is Rolling Upgrade

77


Potential error condition during rolling upgrade• Multiple versions are simultaneously active

during rolling upgrade• Opens door to errors resulting from version

incompatibility• During a single session a client can deal with

multiple versions of a single component.• May result in “mixed-version” race condition• “…these race conditions occur frequently during

rolling updates of large Internet systems, such as Facebook”

78

From “To Upgrade or Not to Upgrade”


Mixed Version Race Condition

79

3

4

New Version

X ERROR

Client (browser) Server

1

2

5

Start rolling upgrade

Initial request

HTTP reply with embedded JavaScript

AJAX callbackOld Version


Assumptions/Requirements for a Solution

• Requirements– Clients never interact with decreasing versions. i.e.

once a client interacts with version xxx, it will never interact with a version less than xxx.

– Messages are balanced across all instances of an application, whether new or old versions.

• Assumptions– Versions are backwards compatible. i.e. any message

can be processed by the latest version without creating mixed-version race condition

– Client behavior with respect to the versions with which it interacts is governed by mobile code sent to the browser from the server side.

80


Key Ideas of Proposed Solution - 1

• Consider different versions as separate endpoints for a message. Each version is www.sample.com/<version number>

• Each instance knows its version number. • Client knows the largest version number with

which it has interacted.

81


Key ideas of Proposed Solution - 2

• Load Balancer portion– Use a load balancer that routes messages to different

endpoints– The load balancer is the entry point for messages.– Messages with /<version number> in the header are

routed to an instance greater than or equal than the version number according to load balancing algorithm for those instances.

– Messages without version information are routed according to normal load balancing

• Load balancers are hierarchical– Ensure that top level is updated before used to route

messages82


Performance


Achieving Elasticity

• Elasticity means the ability to create new (virtual) resources on demand

• Providers allow consumer to set up “autoscaling” rules. These rules make the demand automatic without necessity for operator manual action.– E.g. create a new instance when an existing instance

is utilizing greater than 75% of CPU for more than 5 minutes.

• Correct strategy for autoscaling is a matter of research because of the time it takes to create a new instance, provision it, boot it, and start an application.

84


Provisioning Latency

• Small Instance – 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1

EC2 Compute Unit), 160 GB of instance storage, 32-bit platform with a base install of CentOS 5.3 AMI

– Between 5 and 6 minutes us-east-1c from launch to availability

• Large Instance – 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2

EC2 Compute Units each), 850 GB of instance storage, 64-bit platform with a base install of CentOS 5.3 AMI

– Between 11 and 18 minutes us-east-1c

[http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance]

85


Provisioning Forecasting

• Approaches to predict appropriate number of instances

• Technique 1 (due to Sadeka Islam)– Calculate cost of having instances that are unused

(overprovisioning)– Calculate cost of having requests go unsatisfied

(underprovisioning)– Allocate additional instances to optimize costs under

various usage scenarios

• Technique 2 (due to Matthew Sladescu )– Sniff out events that might lead to surge in demand

and use that to predict appropriate number of instances

86


Latency of Communication

• Measurements by Robin Meehan based on http-ping

• Within EU region but across availability zones– Roundtrip to local host within cloud (control) avg = 1.0 ms – Roundtrip to public IP in same AZ avg = 1.4 ms

• Out of cloud (local England facility) to within cloud– Us-east = 231 ms– Eu-west = 96 ms

http://smart421.wordpress.com/2011/02/15/amazon-web-services-inter-az-latency-measurements/

http://smart421.wordpress.com/2011/01/17/which-amazon-web-services-region-should-you-use-for-your-service/

87




Security/Privacy


Security topics

• Credentials and keys• Management of credentials and keys in the

cloud• Multi-tenancy• Location dependency/governance

89


Credentials and keys• A credential identifies you

– As an individual – As having certain privileges– As having certain qualifications

• Credentials are used in– Authentication (you are who you say you are)– Authorization (you have the rights to perform certain actions)– Non-repudiation (you cannot deny you did something)

• A key is a magic number used in cryptography for– Encrypting/decrypting data– Digital credentials

90


Basic Data protection

91

Data

App outside of cloud(data

unencrypted) https: data is encrypted for transfer into the cloud

App inside of cloud(data unencrypted,

communication encrypted)

Data is stored encrypted (by vendor)


What can go wrong with the Basic Data Protection?

• Suppose cloud provider has to respond to subpoena for data. Your data may, potentially, be included.

• Cloud provider must decrypt data to respond to subpoena.

• You may wish to encrypt your data (double encryption) so that cloud provider can only provide encrypted data.

• Of course, if subpoena is directed at you, you must comply with decrypted data.

92


Use of credentials

• Log into app in the cloud• Attach a disk volume• Download application from a non-public location• Access particular data bases.

• For non-public applications, protect your credentials and your data will be protected.

93


Vulnerabilities to Credentials

• Compromised inadvertently through social engineering means or carelessness

• Held by disgruntled employee• Compromised through some sort of attack

94


Goals for credential storage

• Easy to do. If it is difficult to store credentials, people will avoid their use. A script can automate the provisioning of credentials but then the script needs to be protected

• Possible to change in a running instance?. Once an instance has been launched, can the credentials it uses be changed?

• Possible to change for instances launched in the future? This issue is related to building credentials into scripts. If scripts have credentials built in then it makes it difficult to change them in the future.

95


Options for getting credentials to App in the cloud• Send credentials from client outside the cloud

– HTTPS will negotiate encryption of credentials over the internet– Assumes credentials can be kept private on clients that have

them.– Credentials need to be sent every time there is a new instance –

• Pass credentials in as a parameter during launch of instance– Credentials persist for the life of the instance so if credentials

change, can re-instantiate instance– Means credentials are stored on a server – itself a vulnerability

96


More options for getting credentials to App server

• Build credentials into the image – App server is instantiated from an image in the image library– Could install credentials in the image when building it– Makes it difficult to change credentials– Prevents reuse of image (or makes reusing image a very bad

idea)

• Keep credentials in persistent storage.– Access control list for persistent storage provides protection

based on credentials– Credentials may be based on a different account

97


Conclusion with respect to credential management

• No insurmountable problem• Needs to be thought through

– Who has access to credentials?– Will I ever need to change credentials?

98


What is Multi-tenancy?

99

Storage

Local Network

Server

Data Data Data Data

Hypervisor

VM for customer 1

VM for customer 2

VM for customer 3


Multi Tenancy Gets More Complicated

100

Hypervisor

VM for customer 1

VM for customer 2

VM for customer 3

End users


Multi Tenancy Means “Sharing”

• Consumers share hardware– CPU– Network– Storage media

• Consumers share software– Hypervisor

• End users share applications– E.g. Salesforce.com

101


What are the problems with Multi-tenancy?

• Performance – other users or consumers will consume resources and, potentially, keep you from achieving your performance requirements.– Some providers allow consumers to reserve complete

machines that would prevent multi-tenancy from occurring.

• Security – other users could potentially break confidentiality or integrity– Provider uses isolation for security. Consumer must

have trust in provider– Consumer uses encryption to protect data.

102


Isolation assumptions

• Virtual machines are isolated based on virtual memory technology and addressing scheme– Processor manufacturers have specialized hardware

to support virtualization– Hypervisor introduces a new layer of privileged

software that could be attacked.

• Hypervisors provide facilities to isolate networks.• Disk isolation is the same as in a non-cloud

environment. OSs or shared software provide facilities.

103


Personally Identifiable Information

• Personally identifiable (US NIST)– Information which can be used to distinguish or trace an

individual's identity, such as their name, social security number, biometric records, etc. alone, or when combined with other personal or identifying information which is linked or linkable to a specific individual, such as date and place of birth, mother’s maiden name, etc.

• Personal data (EU)– ‘personal data' shall mean any information relating to an

identified or identifiable natural person ('data subject'); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity

104


Location dependency/governance

• Some jurisdictions require that personal information for their jurisdiction is not stored outside of the jurisdiction– The EU requires that personal information can leave

the EU only for locations that have equivalent privacy guarantees

– Australia has a similar policy– “If offshore cloud compromises your data, we’ll sue

you, not them”, Victoria Privacy Commissioner

• Some jurisdictions claim rights to access any data stored within their borders– US Patriot Act gives US government right to examine

any data stored in the US.105


What does this mean in the cloud?

• Knowing location of data centers– Amazon provides locations of their data centers– Google does not

• Does this mean just use Amazon data center in region compliant with your requirements?– Not so fast!– Back up locations may be chosen by provider. Could

be anywhere– A complicated problem is to control back up location

based on data content.

• Amazon does have a gov region that almost certainly complies with US government regulations 106


Use tokens as a replacement for PII

• A token is an identifier that has no mathematical mapping to the individual being identified– E.g. number people in tutorial arbitrarily– Your number becomes a unique identifier for your PII

stored in the cloud– I keep mapping between you and your token privately

according to jurisdictional laws

107


Example of token use

• Original data– John Doe– Sensitive information

• Token table (kept locally to conform to privacy laws)– John Doe– Token for John Doe

• Data stored in cloud– Token– Sensitive information

• Take join of token table and data table in cloud and the original data is restored

108


How about jurisdictional problem?

• Tokens – Technique for decoupling PII from identifier.– Adds a level of indirection and protects that level

locally

• Does this solve jurisdictional problems?– I don’t know– PerspecSys says it does “http://www.perspecsys.com/how-we-help/data-residency/”

109

http://www.perspecsys.com/how-we-help/data-residency/

http://www.perspecsys.com/how-we-help/data-residency/


Questions

110


Case Studies


Netflix Corporation

• Launched in 1998 after founder was irritated at having to pay late fees on a DVD rental.

• DVD Model– Pay monthly membership fee that includes rentals,

shipping and no late fees – Maintain online queue of desired rentals – When return last rental (depending on service plan),

next item in queue is mailed to you together with a return envelope.

• Customers rate movies and Netflix recommends based on your preferences


Streaming video - 1

• Streaming video service introduced in 2008• Customers can watch Netflix streaming video on

a wide variety of devices many of which feed into a TV– Roku set top box – Blu-ray disk platers– Xbox 360– TV directly– PlayStation 3– …

• Customers can stop and restart video at will. Netflix calls these locations in the films “bookmarks”.


Streaming video - 2

• Initially, one hour of streaming video was available to customers for every dollar they spent on their plan

• In Jan, 2008, every customer was entitled to unlimited streaming video.

• In Nov, 2011 Netflix changed billing model to have separate charges for DVDs and streaming


Internet statistics

• In May, 2011, Netflix streaming video accounted for 22% of all internet traffic. 30% of traffic during peak usage hours.

• Three bandwidth tiers– Continuous bandwidth to the client of 5 Mbit/s. HDTV, surround

sound– Continuous bandwidth to the client of 3Mbit/s – better than DVD– Continuous bandwidth to the client of 1.5Mbit/s – DVD quality

115


Netflix’s move to the cloud

• In late 2008, Netflix had a single data center with Oracle as the main database system.

• With the growth of subscriptions and streaming video, it was clear that they would soon outgrow the data center.

• Two options:– Build more data centers– Use the cloud

• Netflix choose Amazon EC2 platform


Why EC2?• Four reasons cited by Netflix for moving to the

cloud1. Every layer of the software stack needed to scale horizontally, be

more reliable, redundant, and fault tolerant. This leads to reason #2

2. Outsourcing data center infrastructure to Amazon allowed Netflix engineers to focus on building and improving their business.

3. Netflix is not very good at predicting customer growth or device engagement. They underestimated their growth rate. The cloud supports rapid scaling.

4. Cloud computing is the future. This will help Netflix with recruiting engineers who are interested in honing their skills, and will help scale the business. It will also ensure competition among cloud providers helping to keep costs down.

• Why Amazon and EC2? In 2008, Amazon was the leading supplier. Netflix wanted an IaaS so they could focus on their core competencies.


Netflix applications

Video ratings, reviews, and recommendations

Video streaming

User registration, log-in

Video queues

Billing

DVD disc management – inventory and shipping

Video metadata management – movie cast information


Netflix Reliability

• Deep service dependency hierarchy

• 1 billion incoming calls/day

• Across 1000s of instances

• Intermittent failure guaranteed

119


Approach to detecting faults

• Fast network timeouts and retries

• Separate threads on per-dependency thread pools

• Semaphores instead of threads for services that do not perform network calls

• Circuit breaker – Service calls are

decorated with code to test whether service is failing too often

120


If failure detected

• Custom fallback– Each service has specific fallback plan

• Fail silent– Service returns a null value and invoking service

knows it has failed

• API should be able to show what is happening now, in real time, not from some past time. Dashboard shown to operator has red/yellow/green lights for important services

121


Netflix test suite - 1

• Netflix has a variety of test programs they call the Simian Army. These programs include– Chaos monkey. Randomly kill a process and monitor the effect.– Latency monkey. Randomly introduce latency and monitor the

effect.– Doctor monkey. The Doctor Monkey taps into health checks that

run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances.

– Janitor Monkey. The Janitor Monkey ensures that the Netflix cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.


Netflix test suite - 2– Conformity Monkey. The Conformity Monkey finds instances that

don’t adhere to best-practices and shuts them down. For example, if an instance does not belong to an auto-scaling group, that is a potential problem.

– Security Monkey The Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

– 10-18 Monkey The 10-18 Monkey (Localization-Internationalization) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets. The name 10-18 comes from L10n and I18n which are the number of characters in the words localization and internationalization.


Performance

• Create new auto-scaling group for each new version of code– Copy entire configuration to new group– Test behaviour under load by squeezing traffic in

production to a smaller set of servers or generating artificial load against a single server

124


SmugMug

• Photo sharing site• Survived April AWS outage• Recommendations

– Spread across as many availability zones as possible– Spread across regions if possible– Build for failure (like Chaos Monkey)– Understand how components fail (yours and cloud

providers services)

125


Others

• Bizo– Use circuit breakers. Assume services will fail, cache

data and monitor extensively to detect failure.

• SimpleGeo– share nothing, redundancy, automated failover,

automated replication

• Twilio– Unit of failure is a single host

• Simple services, replicatable

– Short timeouts and quick retries– Idempotent service interfaces (stateless)– Relax consistency requirements

126


Disaster Recovery As a Service


Cost of DR is increasing…

Improving business continuity (BC) and DR is 2nd highest priority for enterprises for 2010/2011

BC/DR typically claims 6-7% of total IT budget

32% of enterprises plan to increase spending on BC/DR by at least 5% in 2010/2011.

Forrester global survey 2,803 IT decision-makers, Sept 2010

Enterprise DR under pressure?

128

Issues…

DR requirement is growing, driven by (a) changing customer expectations, and associated reputational risks; (b) Government & industry regulations

Infrastructure for DR is expensive: sophisticated DR is only affordable for a small % of applications; forces compromises/prioritisation

Confidence in initiating a recovery often less than it should be (too long, too much loss), uncertain integrity

DR Solutions often too ‘local’, insufficiently resilient

Enterprise IT becoming more complex

Good DR coverage

Hig

he

r p

riorit

y a

pp

lica

tion

s

Limited coverage

No cover

Good DR is only affordable for a few applications

Hypothesis: We can use cloud to extend DR at 1/10th cost.


Using Cloud for Business Continuity

• Two main usages of cloud for Business Continuity:– Provides highly available systems for day-to-day business– Serves as a technology platform to implement disaster recovery

• Some definitions:– Business Continuity: “Activity performed by an organisation to

ensure that critical business functions will be available to customers, suppliers, regulators and other entities…”

– Disaster Recovery: “A small subset of business continuity. The process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organisation after a natural or human-induced disaster”

– Fault Tolerance: “The property that enables a system to continue operating properly, possibly at a reduced quality level…”

129


Building Highly Reliable Systems with Cloud

• Must address potential failures at two levels:– Hardware/Infrastructure

• To prevent Single-Point-of-Failure (SPOF) by adding redundancy in all hardware components (i.e., redundant disks, redundant network devices, redundant power supply, etc.)

• NOT all cloud providers provide 100% availability. Check your SLA!!

– Application• Prepare fail-over system to take over in case of a failure• Database replicates to minimise downtime and loss of data• Replicate to geographically different location (e.g., to avoid

natural disasters such as floods)

130


DR As A Service – Requirements

• Cost Effective DR-As-A-Service is essential to get the DR solution deployed

• Deep architectural expertise does not exist in many businesses

• Needs solutions that achieves dependability that is

• Non intrusive at runtime• Does not require changes to application architecture• Works across platforms• Cheaper and easier to use than current state of practice

131


Case Study: Building Reliable System using EC2

• Highly replicated architecture of cloud makes them great as foundations for business continuity solutions

• Globally distributed nature further enhances the disaster recovery capability of cloud

• Availability limitations means need to be realistic about Hot vs Warm vs Cold standby options

132

Availability Zone A Availability Zone B

Minimum Size= 1Availability Zones = A, B, CMinimum Size= 1Availability Zones = A, B, C

Auto Scaling Rule

Availability Zone C

EC2 Instance

Create

Elastic IP addressxxx.xxx.xxx.xxx

Allocate

Availability Zone A Availability Zone B

Minimum Size= 2Availability Zones = A, B, CMinimum Size= 2Availability Zones = A, B, C

Auto Scaling Rule

Availability Zone C

EC2 Instance

Forward Request

EC2 Instance

Elastic Load Balancer

Request from Clients Availability Zones= A, B, C


Case Study: Building Reliable System using EC2 (Contd)

• Data backup in AWS– Amazon S3 is best for off-site data backup

• Stores large binary files• Designed to provide 99.999999999% durability• Objects are redundantly stored in multiple facilities in a

Region

– Back up using EBS• Uses a regular file system• Takes image (or snapshot) of the partition

– VM Import• Allows for easy replication from on-premise to cloud• Not trivial to replicate various configuration such as network

configuration and disk drives

133


The Business Opportunity

134

Cost

Downtime

Hot Standby

Warm Standby Cold Standby

seconds(auto failover)

days – weeks(large data loss)

• Ship backup to offsite

• Hardware is not already set up

• Recover systems after disaster

Cost of warm and cold is comparable

Traditional DR

Cloud DR

“always-on” costs in cloud. Also, very hot one

is not feasible

• Run transactions on multiple sites but use only one

• Mirror data via dedicated high speed network (e.g., SANs)

minutes – few hours(auto failover,

minimum data loss)

hours – few days

(manual failover, few data loss)

• Regularly backup app/data in a backup site

• Launch systems upon a disaster


Yuruware Bolt

135


Questions

136


Conclusions

• Cloud Computing brings unique dependability challenges

• Latency across the global links• Full automation means faster than ever error propagation• Multi-tenancy issues

• Many traditional dependability patterns would work, but need some new techniques in the Cloud-era

• Traditional Patterns: stateless, etc• Upgrade, undo/redo• Simian armies, DR-As-A-Service

137


References• How to keep your AWS credentials on an EC2 Instance Securely,

Shlomo Swidler, http://shlomoswidler.com/2009/08/how-to-keep-your-aws-credentials-on-ec2.html

• http://techblog.netflix.com/• Cloud Performance Benchmark Series, Network Performance:

Rackspace.com, Sumit, Sanghrajka, Radu Sion, http://www.cs.stonybrook.edu/~sion/research/sion2011cloud-net2.pdf

• How long does it take to launch an Amazon EC2 instance, Phil Chen, http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance

• Basic Concepts and Taxonomy of Dependable and Secure Computing, Avizienis, Laprie, Randell, Landwehr, IEEE Transactions on Dependable and Secure Computing, Vol 1, No 1, Jan-March 2004

http://shlomoswidler.com/2009/08/how-to-keep-your-aws-credentials-on-ec2.html



http://techblog.netflix.com/

http://www.cs.stonybrook.edu/~sion/research/sion2011cloud-net2.pdf

http://www.cs.stonybrook.edu/~sion/research/sion2011cloud-net2.pdf

http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance

http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance


References - 2• Cloud Software Updates: Challenges and Opportunies, Neamtiu,

Dumitras, http://www.ece.cmu.edu/~tdumitra/public_documents/neamtiu11cloudupgrades11.pdf

• To upgrade or not to Upgrade, Dumitras, Narasimhan, Tilevich, Onward! 2010

• Cloud Application Architectures, George Reese, O’Reilly, 2009• Why do internet services fail and what can be done about it?

Oppenheimer, et al. Usenix Symposium on Internet Technologies and Systems, 2003

• Data Consistency properties and the trade-offs in commercial cloud storages: the consumers’ perspectives, Wada, et al. 5th Biennial conference on Innovative Data Systems Research, CiDR, 2011 http://www.nicta.com.au/pub?id=4341

139

http://www.ece.cmu.edu/~tdumitra/public_documents/neamtiu11cloudupgrades11.pdf

http://www.ece.cmu.edu/~tdumitra/public_documents/neamtiu11cloudupgrades11.pdf

http://www.nicta.com.au/pub?id=4341



References - 3• Why do upgrades fail and what can we do about it? Tudor Dumitras

and Priya Narasimhan. 2009. Why do upgrades fail and what can we do about it? Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware (Middleware'09)

• Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software, Ariel Rabkin, PhD thesis, Univ of Calif, Berkeley, 2012

• A method for preventing mixed version race conditions, Bass, Wada https://docs.google.com/open?id=0ByLr8SO1MsAiaXVxcmNNcDhVczg, 2012

• Automatic Undo for Cloud Management via AI Planning, Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, Len Bass, Proceedings of the 12th Hot Topics in System Dependability http://www.nicta.com.au/pub?id=5994

140

https://docs.google.com/open?id=0ByLr8SO1MsAiaXVxcmNNcDhVczg

https://docs.google.com/open?id=0ByLr8SO1MsAiaXVxcmNNcDhVczg



References - 4• How a consumer can measure elasticity for cloud platforms, Sadeka

Islam, Kevin Lee, Alan Fekete, Anna Liu, Proceedings of the 3rd Joint WOSP/SIPEW International Conference on Performance Engineering, p.85-96, 2012

• Empirical prediction models for adaptive resource provisioning in the cloud, Sadeka Islam, Jacky Keung, Kevin Lee, Anna Liu, Future Generation Computer Systems, Vol 28, No.1, p.155-162, 2012

141


Q&A

142

Research study opportunities in dependable cloud computing:• Software Architecture • Data Management • Performance Engineering • Autonomic Computing

To find out more, send your CV and undergraduate details [email protected]

Thank You!

Date post:	27-Jan-2015
Category:	Technology
Upload:	len-bass
View:	111 times
Download:	4 times