Download - Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

SAAS Platform ArchitectGeorge Barnett

Inside the Atlassian OnDemand private cloud


In 2010 a team of engineers moved into our secret lair (above a pub) to re-imagine our hosted platform.


Launch - October 20111000 VMs

6 months later13,500 VMs


We have a cloud. So what?


Poor performance

We also had a cloud.. and ..

Slow deployments

VM sprawl

Over provisioning

Low visibility into the full stack


Virtualisation often creates new challenges but does

nothing about existing ones.






Focus


Be less flexible about what infrastructure you provide.


#summit12

“You can use any database you like, as

long as its PostgreSQL 8.4.”


• Stop trying to be everything to everyone• (we have other clouds within Atlassian)

• Lower operational complexity• Easier to provide a deeply integrated, well supported

toolchain• Small test surface matrix


Fail fast. Learn quickly.


Do as littleas possible

deploy anduse it


A small scale model of the initial proposed platform architecture. 4 desktop machines and a switch.

Purpose: Validate design, evaluate failure modes.

Block-1

http://history.nasa.gov/Apollo204/blocks.html




Creation of VM’s over NFS too resource and time intensive. (more on this later)

Block-1

Network boot assumptions validated.

Applications do not fall over.


A large scale model of the platform architecture.

Purpose: Validate hardware resource assumptions and compare CPU vendors.

Block-2





Initial specs of compute hardware too conservative. Decided to add 50% more RAM.

Block-2

VM Distribution and failover tools work.

Customers per GB of RAM metric validated


Hardware


Existing platform hardware was a poor fit for our workload.

Memory and IO were heavily constrained, but CPU was not.

Challenge


We took 6 months worth of monitoring data from our existing platform.We used this to data to determine the right mix of hardware.

Monitoring


• 10 x Compute nodes (144G RAM, 12 cores, NO disks)• 3 x Storage nodes (24 disks)• Each rack delivered fully assembled

• Unwrap, provide power, networking

• Connected to customers in ~2 hours


Reliable.

Each machine goes through a 2 day burn in before it goes into the rack.

Advantage #1


Neat.

Advantage #2


Consistent.

Advantage #3


Easy to deploy.

Advantage #4


No disks.


Wait. What?


Existing compute infrastructure used local disk for swap and hypervisor boot.Once we got the memory density right, it’s only boot.

Challenge


• No disks in compute infrastructure• Avoid spinning 20 more disks per rack for a hypervisor OS

• Evaluated booting from:• USB drives

• NFS

• Custom binary initrd image + kernel


• No disks in compute infrastructure• Avoid spinning 20 more disks per rack for a hypervisor OS

• Evaluated booting from:• USB drives (unreliable and slow!)

• NFS (what if the network goes away?)

• Custom binary initrd image + kernel


• Image is ~170Mb gzipped filesystem• Download on boot, extract into ram - ~400Mb

• No external dependencies after boot• All compute nodes boot from the same image

• Reboot to known state


Compute Node Netboot Server

PXE DHCP

TFTP

dhcp

gpxe

response

Etherboot

dhcp

responseDHCP

HTTPbootscript

kernel & boot image

Boot


Sharp Edges.• No swap == provision carefully

• Not a problem if you automate provisioning

• Treat running hypervisor image like an appliance• Don’t change code - rebuild image and reboot

• Doing this often? Too many services in the hypervisor


Software


Virtualisation is often inefficient. There’s a memory and CPU penalty which is hard to avoid.

Challenge


Open VZ• Linux containers

• Basis for Parallels Virtuozzo Containers

• LXC isn’t there yet

• No guest OS kernels• No performance hit

• Better resource sharing


Performance


http://wiki.openvz.org/Performance/vConsolidate-SMP




http://wiki.openvz.org/Performance/LAMP




Resource de-duping


“Don’t load the same thing twice”


Java VM’s aren’t lightweight.

Challenge


• Full virtualisation does a poor job at this• 50 VMs = 50 Kernels + 50 caches + 50 shared libs!

• Memory de-dupe combats this, but burns CPU.

• Memory de-dupe works across all OSes• We don’t use Windows.

• By being less flexible, we can exploit Linux specific features.


OpenVZ containers all share the same kernel.


• Provide a single OS image to all - free benefits:• Shared libraries only load once.

• OS is cached only once.

• OS image is the same on every instance.


If all containers share the same OS image, then managing state is a nightmare!One bad change in one container would break them all!

Challenge


• But managing state on multiple machines is a solved problem!• What if you have >10,000 machines.

• Why are you modifying the OS anyway?


Does your iPhone upgrade iOS when you install an

app?


#summit12

“Fix problems by removing them, not by adding

systems to manage them.”


Read-only OS images


Data classes in a system• OS and system daemon code• Application code• Application and user data




OpenVZ Kernel


OpenVZ Kernel


OpenVZ Kernel

Container


OpenVZ Kernel

Container


OS toolsSystem supplied code

OpenVZ Kernel

Container



OpenVZ Kernel

/ - Read Only

Container



OpenVZ Kernel

/ - Read Only

Container



OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs

Container



OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container



OpenVZ Kernel


Container



OpenVZ Kernel


Container

Application and user data - /data (R/W)



OpenVZ Kernel


Container


/data/service/



OpenVZ Kernel


Container


/data/service/



OpenVZ Kernel


Container


/data/service/


How?• Storage nodes export /e/ro/ & /e/rw• Build an OS distro inside a chroot.

• Use whatever tools you are comfortable with.

• Put this chroot tree in the RO location on storage nodes• Make a “data” dir in the RW location for each container


How?• On Container start bind mount:/net/storage-n/e/ro/os/linux-image-v1/-> /vz/<ctid>/root

• Replace etc, var & tmp with a memfs• Linux expects to be able to write to these

• Mount containers data dir (RW) to /data


More benefits• Distribute OS images as a simple directory.• Prove that environments (Dev, Stg, Prd) are identical

using MD5sum.• Flip between OS versions by changing a variable


The Swear Wall


The swear wall helps prevent death by a thousand cuts.

Your team has a gut feeling about whats hurting them - this helps you quantify that feeling and act on the pain.



1.!@&*^# Solaris!2.Solaris gets a mark3.Repeat4.Periodically throw out offensive technology5...6.PROFIT!! (swear less)


Optimise for the task at hand.

Don’t layer solutions onto problems. Get rid of them.


Thank you!