Cloud present, future & trajectoryBrendan Bouffler (@boofla), #scicoGlobal Scientific Computing
02-Mar-16
*Does not apply to mathematicians with specialties in Cantorian set theory who should immediately ask for a copy of my very long disclaimer.
We are Psycho SciCo
No, not that one.
Scientific Computing Group (SciCo)
Science is one of the greatest areas of computation and can benefit from a democratization in cost and global accessibility that the cloud brings.
It’s also where we think Amazon can make a huge, really disruptive, impact on the world by participating - which is, at the most basic level, what we are about as a company.
“… the online book and decorative pillow seller Amazon.com swooped in and, in 2006, launched its own computer rental system—the future Amazon Web Services. The once-fledgling service has since turned cloud computing into a mainstream phenomenon …”
Source: Bloomberg Business - April 22, 2015
$7B retail business10,000 employees
A whole lot of servers
2006 2015
Every day, AWS adds enough server capacity to power
this $7B enterprise
Global AWS RegionsExisting1. Oregon2. California3. Virginia4. Dublin5. Frankfurt6. Singapore7. Sydney8. Seoul9. Tokyo10. Sao Paulo11. Beijng12. US GovCloud
2016/17:13. Ohio14. India15. UK16. Canada17. China+1
AWS Region = A cluster of Availability ZonesAvailability Zone = A cluster of data centers
All regions are sovereign, meaning your data never leaves that location unless you cause it to.
Map of scientific collaboration between researchers - Olivier H. Beauchesne - http://bit.ly/e9ekP2
Science means Collaboration
Public Data Sets
Cray Supercomputer
Beowulf Cluster
A top500 supercomputer
Ready in ~100 seconds
For ~ $100/hr
Time travel for job queues
Wall clock time: ~1 hour Wall clock time: ~1 week
Cost: the same
Cost Control & Budgeting
Spot Market AWSome-nessSpot Bid Advisor
The Spot Bid Advisor analyzes Spot price history to help you determine a bid price that suits your needs.
You should weigh your application’s tolerance for interruption and your cost saving goals when selecting a Spot instance and bid price.
The lower your frequency of being outbid, the longer your Spot instances are likely to run without interruption.
https://aws.amazon.com/ec2/spot/bid-advisor/
Bid Price & Savings
Your bid price affects your ranking when it comes to acquiring resources in the SPOT market, and is the maximum price you will pay.
But frequently you’ll pay a lot less.
Choices
When you only pay for what you use …• If you’re only able to use your compute, say, 30%
of the time, you only pay for that time.
1 Pocket the savings• Buy chocolate• Buy a spectrometer• Hire a scientist.
2 Go faster• Use 3x the cores to
run your jobs at 3x the speed.
3Go Large• Do 3x the science,
or consume 3x the data.
… you have options.
AWS - Frankfurt
EC2
S3
over (Janet/GÉANT)research network
over commercialinternet
----- Data egress----- Not data egress
inter-region
Data egress waiver applies
Data egress is: data transferred out from AWS, over the Internet, to the end user
AWS – Dublin
Global Data Egress Waiver at a Glance
Available to degree-granting / research institutions Permanent program unlike previous pilots
North America, Europe, APAC, Japan & GovCloud regions(but not including Latin America, Middle East, China, India, and Africa)
Excludes MOOCs or other egress-as-a-service situations.
Must use a Research Network we peer with (e.g. Janet or GÉANT)
Who
Contract addendum required Can also procure through reseller (e.g. Arcus)
Waives data egress charges from qualified accounts
Capped at waiving no more than 15% of the customer’s bill
What
How
Researchers strongly need predictable budgetsWhy
39 years of computational chemistry in 9 hoursNovartis ran a project that involved virtually screening 10 million compounds against a common cancer target in less than a week. They calculated that it would take 50,000 cores and close to a $40 million investment if they wanted to run the experiment internally.
Partnering with Cycle Computing and Amazon Web Services (AWS), Novartis built a platform thst ran across 10,600 Spot Instances (~87,000 cores) and allowed Novartis to conduct 39 years of computational chemistry in 9 hours for a cost of $4,232. Out of the 10 million compounds screened, three were successfully identified.
Novartis
Stars in the CloudCHILES will produce the first HI deep field, to be carried out with the VLA in B array and covering a redshift range from z=0 to z=0.45. The field is centered at the COSMOS field. It will produce neutral hydrogen images of at least 300 galaxies spread over the entire redshift range.
The team at ICRAR in Australia have been able to implement the entire processing pipeline in the cloud for around $2,000 per month by exploiting the SPOT market, which means the $1.75M they otherwise needed to spend on an HPC cluster can be spent on way cooler things that impact their research … like astronomers.
Finding what you’re not looking for
http://blog.csiro.au/wtf-is-that-how-were-trawling-the-universe-for-the-unknown/
WTF’s cloud-based backend is hosted on Amazon Web Services servers, where the researchers are able to access software for data reduction, calibration and viewing right from their desktop. The team is currently issuing a challenge using data peppered with “EMU (Easter) Eggs” – objects that might pose a challenge to data mining algorithms.
This way they hope to train the system to recognise things that systematically depart from known categories of astronomical objects, to help better prepare for unanticipated discoveries that would otherwise remain hidden.
Zooniverse“The Zooniverse is heavily reliant on Amazon Web Services (AWS), particularly Elastic Compute Cloud (EC2) virtual private servers and Simple Storage Service (S3) data storage. AWS is the most cost-effective solution for the dynamic needs of Zooniverse’s infrastructure …”http://wwwconference.org/proceedings/www2014/companion/p1049.pdf
The World’s Largest Citizen Science Platform
… cost is a factor – running a central API means that when the Zooniverse is quiet and there aren’t many people about we can scale back the number of servers we’re running (automagically on Amazon Web Services) to a minimal level.
C4Intel Xeon E5-2666 v3, custom built for AWS.
Intel Haswell, 16 FLOPS/tick
2.9 GHz, turbo to 3.5 GHz
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html
Feature SpecificationProcessor Number E5-2666 v3Intel® Smart Cache 25 MiBInstruction Set 64-bitInstruction Set Extensions AVX 2.0Lithography 22 nmProcessor Base Frequency 2.9 GHzMax All Core Turbo Frequency 3.2 GHzMax Turbo Frequency 3.5 GHz (available on c4.2xLarge)Intel® Turbo Boost Technology 2.0Intel® vPro Technology YesIntel® Hyper-Threading Technology YesIntel® Virtualization Technology (VT-x) YesIntel® Virtualization Technology for Directed I/O (VT-d)
Yes
Intel® VT-x with Extended Page Tables (EPT)
Yes
Intel® 64 Yes
cfnCluster - provision an HPC cluster in minutes
#cfnclusterhttps://github.com/awslabs/cfncluster
cfncluster is a sample code framework that deploys and maintains clusters on AWS. It is reasonably agnostic to what the cluster is for and can easily be extended to support different frameworks. The CLI is stateless, everything is done using CloudFormation or resources within AWS.
10 minutes
http://boofla.io/u/cfnCluster – (Boof’s HOWTO slides)
Headnode
Instance
Compute node
Instance
Compute node
Instance
Compute node
Instance
Compute node
Instance
10G Network
Auto-scaling group
Virtual Private Cloud
/shared
Head Instance2 or more cores (as needed)CentOS 6.xOpenMPI, gcc etc…
Choice of scheduler:Torque, SGE, OpenLava, Slurm
Compute Instances2 or more cores (as needed)CentOS 6.x
Auto Scaling group driven by scheduler queue length.
Can start with 0 (zero) nodes and only scale when there are jobs.
It's a real cluster
Infrastructure as code
#cfncluster
The creation process might take a few minutes (maybe up to 5 mins or so, depending on how you configured it.
Because the API to Cloud Formation (the service that does all the orchestration) is asynchronous, we can kill the terminal session if we wanted to and watch the whole show from the AWS console (where you’ll find it all under the “Cloud Formation”dashboard in the events tab for this stack.
$ cfnCluster create boof-clusterStarting: boof-clusterStatus: cfncluster-boof-cluster - CREATE_COMPLETE Output:"MasterPrivateIP"="10.0.0.17"Output:"MasterPublicIP"="54.66.174.113"Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"
This cluster intentionally left blank.
Your cluster is ephemeral.
Yes, that’s right, you’ve created a disposable cluster.
But it’s 100% recyclable.
It’s worth noting that anything you put into this cluster will vaporize when you issue the command
$ cfncluster delete <your cluster name>
… which might not be what you first expect.
It’s easy to save your data tho, and pick up from where you left off later.
Before you delete your cluster, take a snapshot of the EBS (block storage) volume that you used for your /shared filesystem using the AWS EC2 console (see the pic on the right).
The EBC volume you care most about is the one attached to the headnode instance (hint: it’s probably the largest one).
How do I join the Data Egress Waiver Program?
Your AWS account manager will work with you to sign you up. Sign up for an AWS Account using the Jisc/Arcus Portal (coming soon)
Peter MeagherAWS [email protected]
How will this impact me? Simple, predictable budgets: you will not be charged for data egress out from AWS over the
internet to you. This makes it easier to write grant proposals, and plan your research budget. Discount: this program lowers your monthly bill. Retrieving data: there is no cost to access your data or to retrieve it to your local site. Tailored to academia: We understand that predictable budgets are important because of how
research funding works. And we know that National Research and Education Networks provide most research institutions with a reliable, fast network connection to the AWS cloud for your compute and big data needs.
Volume Discount: AWS will apply the waiver to your institution’s aggregated AWS account, which averages out data egress use – and gives you access to further volume discounts.
** Data egress charges waived up to 15% of your total bill, or >3x typical usage. ** Data ingress (uploading data to AWS) is always free. ** Data egress waived is from AWS out over the internet. Glacier, CloudFront, DirectConnect port speed fees, or
traffic between AWS regions are not waived.