Tuesday, July 10, 12
SAAS Platform ArchitectGeorge Barnett
Inside the Atlassian OnDemand private cloud
Tuesday, July 10, 12
In 2010 a team of engineers moved into our secret lair (above a pub) to re-imagine our hosted platform.
Tuesday, July 10, 12
Launch - October 20111000 VMs
6 months later13,500 VMs
Tuesday, July 10, 12
We have a cloud. So what?
Tuesday, July 10, 12
Poor performance
We also had a cloud.. and ..
Slow deployments
VM sprawl
Over provisioning
Low visibility into the full stack
Tuesday, July 10, 12
Virtualisation often creates new challenges but does
nothing about existing ones.
Tuesday, July 10, 12
Tuesday, July 10, 12
Tuesday, July 10, 12
Tuesday, July 10, 12
Tuesday, July 10, 12
Focus
Tuesday, July 10, 12
Be less flexible about what infrastructure you provide.
Tuesday, July 10, 12
#summit12
“You can use any database you like, as
long as its PostgreSQL 8.4.”
Tuesday, July 10, 12
• Stop trying to be everything to everyone• (we have other clouds within Atlassian)
• Lower operational complexity• Easier to provide a deeply integrated, well supported
toolchain• Small test surface matrix
Tuesday, July 10, 12
Fail fast. Learn quickly.
Tuesday, July 10, 12
Do as littleas possible
deploy anduse it
Tuesday, July 10, 12
A small scale model of the initial proposed platform architecture. 4 desktop machines and a switch.
Purpose: Validate design, evaluate failure modes.
Block-1
http://history.nasa.gov/Apollo204/blocks.html
Tuesday, July 10, 12
Creation of VM’s over NFS too resource and time intensive. (more on this later)
Block-1
Network boot assumptions validated.
Applications do not fall over.
Tuesday, July 10, 12
A large scale model of the platform architecture.
Purpose: Validate hardware resource assumptions and compare CPU vendors.
Block-2
http://history.nasa.gov/Apollo204/blocks.html
Tuesday, July 10, 12
Initial specs of compute hardware too conservative. Decided to add 50% more RAM.
Block-2
VM Distribution and failover tools work.
Customers per GB of RAM metric validated
Tuesday, July 10, 12
Hardware
Tuesday, July 10, 12
Existing platform hardware was a poor fit for our workload.
Memory and IO were heavily constrained, but CPU was not.
Challenge
Tuesday, July 10, 12
We took 6 months worth of monitoring data from our existing platform.We used this to data to determine the right mix of hardware.
Monitoring
Tuesday, July 10, 12
• 10 x Compute nodes (144G RAM, 12 cores, NO disks)• 3 x Storage nodes (24 disks)• Each rack delivered fully assembled
• Unwrap, provide power, networking
• Connected to customers in ~2 hours
Tuesday, July 10, 12
Reliable.
Each machine goes through a 2 day burn in before it goes into the rack.
Advantage #1
Tuesday, July 10, 12
Neat.
Advantage #2
Tuesday, July 10, 12
Consistent.
Advantage #3
Tuesday, July 10, 12
Easy to deploy.
Advantage #4
Tuesday, July 10, 12
No disks.
Tuesday, July 10, 12
Wait. What?
Tuesday, July 10, 12
Existing compute infrastructure used local disk for swap and hypervisor boot.Once we got the memory density right, it’s only boot.
Challenge
Tuesday, July 10, 12
• No disks in compute infrastructure• Avoid spinning 20 more disks per rack for a hypervisor OS
• Evaluated booting from:• USB drives
• NFS
• Custom binary initrd image + kernel
Tuesday, July 10, 12
• No disks in compute infrastructure• Avoid spinning 20 more disks per rack for a hypervisor OS
• Evaluated booting from:• USB drives (unreliable and slow!)
• NFS (what if the network goes away?)
• Custom binary initrd image + kernel
Tuesday, July 10, 12
• Image is ~170Mb gzipped filesystem• Download on boot, extract into ram - ~400Mb
• No external dependencies after boot• All compute nodes boot from the same image
• Reboot to known state
Tuesday, July 10, 12
Compute Node Netboot Server
PXE DHCP
TFTP
dhcp
gpxe
response
Etherboot
dhcp
responseDHCP
HTTPbootscript
kernel & boot image
Boot
Tuesday, July 10, 12
Sharp Edges.• No swap == provision carefully
• Not a problem if you automate provisioning
• Treat running hypervisor image like an appliance• Don’t change code - rebuild image and reboot
• Doing this often? Too many services in the hypervisor
Tuesday, July 10, 12
Software
Tuesday, July 10, 12
Virtualisation is often inefficient. There’s a memory and CPU penalty which is hard to avoid.
Challenge
Tuesday, July 10, 12
Open VZ• Linux containers
• Basis for Parallels Virtuozzo Containers
• LXC isn’t there yet
• No guest OS kernels• No performance hit
• Better resource sharing
Tuesday, July 10, 12
Performance
Tuesday, July 10, 12
http://wiki.openvz.org/Performance/vConsolidate-SMP
Tuesday, July 10, 12
http://wiki.openvz.org/Performance/LAMP
Tuesday, July 10, 12
Resource de-duping
Tuesday, July 10, 12
“Don’t load the same thing twice”
Tuesday, July 10, 12
Java VM’s aren’t lightweight.
Challenge
Tuesday, July 10, 12
• Full virtualisation does a poor job at this• 50 VMs = 50 Kernels + 50 caches + 50 shared libs!
• Memory de-dupe combats this, but burns CPU.
• Memory de-dupe works across all OSes• We don’t use Windows.
• By being less flexible, we can exploit Linux specific features.
Tuesday, July 10, 12
OpenVZ containers all share the same kernel.
Tuesday, July 10, 12
• Provide a single OS image to all - free benefits:• Shared libraries only load once.
• OS is cached only once.
• OS image is the same on every instance.
Tuesday, July 10, 12
If all containers share the same OS image, then managing state is a nightmare!One bad change in one container would break them all!
Challenge
Tuesday, July 10, 12
• But managing state on multiple machines is a solved problem!• What if you have >10,000 machines.
• Why are you modifying the OS anyway?
Tuesday, July 10, 12
Does your iPhone upgrade iOS when you install an
app?
Tuesday, July 10, 12
#summit12
“Fix problems by removing them, not by adding
systems to manage them.”
Tuesday, July 10, 12
Read-only OS images
Tuesday, July 10, 12
Data classes in a system• OS and system daemon code• Application code• Application and user data
Tuesday, July 10, 12
Tuesday, July 10, 12
Tuesday, July 10, 12
OpenVZ Kernel
Tuesday, July 10, 12
OpenVZ Kernel
Tuesday, July 10, 12
OpenVZ Kernel
Container
Tuesday, July 10, 12
OpenVZ Kernel
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs /sw - Read Only
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs /sw - Read Only
Container
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs /sw - Read Only
Container
Application and user data - /data (R/W)
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs /sw - Read Only
Container
Application and user data - /data (R/W)
/data/service/
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs /sw - Read Only
Container
Application and user data - /data (R/W)
/data/service/
Tuesday, July 10, 12
OS toolsSystem supplied code
OpenVZ Kernel
/ - Read Only Applications, JVM’sConfigs /sw - Read Only
Container
Application and user data - /data (R/W)
/data/service/
Tuesday, July 10, 12
How?• Storage nodes export /e/ro/ & /e/rw• Build an OS distro inside a chroot.
• Use whatever tools you are comfortable with.
• Put this chroot tree in the RO location on storage nodes• Make a “data” dir in the RW location for each container
Tuesday, July 10, 12
How?• On Container start bind mount:/net/storage-n/e/ro/os/linux-image-v1/-> /vz/<ctid>/root
• Replace etc, var & tmp with a memfs• Linux expects to be able to write to these
• Mount containers data dir (RW) to /data
Tuesday, July 10, 12
More benefits• Distribute OS images as a simple directory.• Prove that environments (Dev, Stg, Prd) are identical
using MD5sum.• Flip between OS versions by changing a variable
Tuesday, July 10, 12
The Swear Wall
Tuesday, July 10, 12
The swear wall helps prevent death by a thousand cuts.
Your team has a gut feeling about whats hurting them - this helps you quantify that feeling and act on the pain.
Tuesday, July 10, 12
Tuesday, July 10, 12
1.!@&*^# Solaris!2.Solaris gets a mark3.Repeat4.Periodically throw out offensive technology5...6.PROFIT!! (swear less)
Tuesday, July 10, 12
Optimise for the task at hand.
Don’t layer solutions onto problems. Get rid of them.
Tuesday, July 10, 12
Thank you!
Tuesday, July 10, 12