This work is licensed under a Creative Commons Attribution 3.0 United States License.
Programming the Cloud Cloud Services for Science
Tony HeyCorporate Vice President
Microsoft External Research
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Tony Hey – An Introduction
Commander of the British Empire
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Worldwide External Research
Core Computer Science
Earth, Energy &Environment
Education & Scholarly
Communication
Health & Wellbeing
Advanced Research Tools and Services
Community and Geographic Outreach
This work is licensed under a Creative Commons Attribution 3.0 United States License.
1. Thousand years ago – Experimental Science– Description of natural phenomena
2. Last few hundred years – Theoretical Science– Newton’s Laws, Maxwell’s Equations…
3. Last few decades – Computational Science– Simulation of complex phenomena
4. Today – Data-Intensive Science– Scientists overwhelmed with data sets
from many different sources • Data captured by instruments• Data generated by simulations• Data generated by sensor networks
eScience is the set of tools and technologiesto support data federation and collaboration• For analysis and data mining• For data visualization and exploration• For scholarly communication and dissemination
Emergence of a Fourth Research Paradigm
(With thanks to Jim Gray)With thanks to Jim Gray
Astronomy has been one of the first disciplines to embrace data-intensive science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a centralized site. The image shows the Pleiades star cluster form the Digitized Sky Surveycombined with an image of the moon, synthesized within the WorldWide Telescope service.
Science must move from data to information to knowledge
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Accelerating time to insightwith Advanced Research Tools and Services
Our goal is to accelerate research by collaborating with academic communities to use computer science research technologies
We also aim to help scientists spend less time on IT issues and more time on discovery by creating open tools and services based on Microsoft platforms and productivity software
Data Acquisition
and Modeling
Collaboration and
VisualizationAnalysis and Data Mining
Disseminate and Share
Archiving and Preservation
This work is licensed under a Creative Commons Attribution 3.0 United States License.
What is Cloud Computing?A Definition: – Cloud Computing means using a remote data center to
manage scalable, reliable, on-demand access to applications
– Providing Applications and Infrastructure over the Internet– Scalable means:• Possibly millions of simultaneous users of the app.• Exploiting thousand-fold parallelism in the app.
– Reliable means on-demand; 5 “nines” available right now– Applications span the continuum from client to the cloud
Three New Aspects to Cloud Computing:– Illusion of infinite computing resources available on
demand– Elimination of an upfront commitment by cloud users– Ability to pay for use of computing resources on a short-
term basis as needed
This work is licensed under a Creative Commons Attribution 3.0 United States License.
The Data Center LandscapeRange in size from “edge” facilities to mega scale.Unprecedented economies of scaleApproximate costs for a small size center
(1K servers) and a larger, 50K server center.
Each data center is 11.5 times
the size of a football field
Technology Cost in small-sized Data Center
Cost in Large Data Center
Ratio
Network $95 per Mbps/month
$13 per Mbps/month
7.1
Storage $2.20 per GB/month
$0.40 per GB/month
5.7
Administration ~140 servers/Administrator
>1000 Servers/Administrator
7.1
Data courtesy of James Hamilton
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Conquering complexity– Building racks of servers & complex cooling
systems all separately is not efficient.– Package and deploy into bigger units– 3 Sockets: Power, Cooling, Bandwidth
Advances in Data Center Deployment
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
Containers: Separating Concerns
This work is licensed under a Creative Commons Attribution 3.0 United States License.
• A Supercomputer is designed to scale a single application for a single user. – Optimized for peak
performance of hardware– Batch operation is not “on-
demand”– Reliability is secondary
• If MPI fails, application crashes
• Build check-pointing into application
– Most Data Center applications run continuously (as services)
• Most Cloud Applications are immediate, scalable and persistent
• The Cloud is also a platform for massive data analysis– Not a replacement for leading
edge supercomputers
• The Programming model must support scalability in two dimensions– Thousands of simultaneous users
of the same applications– Applications that require
thousands of cores for each use
Why is this not just the same as Supercomputing?
* Dan Reed
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Programming the Cloud
Infrastructure as a Service (IaaS)Provide a way to host virtual machines on demand
Platform as a Service (PaaS)You write an Application to Cloud APIs and the platform manages and scales it for you.
Software as a Service (SaaS)Delivery of software to the desktop from the Cloud
Infrastructure as a Service
Platform as a Service
Software as a Service
Azure™ Services Platform
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
A Spectrum of Application Models
Microsoft Azure
.NET CLR/Windows Only
Choice of LanguageSome Auto
Failover/ Scale (but needs declarative
application properties)
Google App Engine
Traditional Web Apps
Auto Scaling and Provisioning
Force.ComSalesForce Biz Apps
Auto Scaling and Provisioning
Amazon AWSVMs Look Like
HardwareNo Limit on App
ModelUser Must Implement
Scalability and Failover
Constraints in the App Model More Constrained
Less Constrained
Automated Management Services More Automation
Less Automation
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Azure Abstract Programming Model
Azure Services (storage)
Load
Balancer
Public Internet
Worker Role(s)
Front-endWeb Role
Switches
Highly-availableFabric Controller
In-band communication – software control
Load-balancers
Abstract ProgrammingModel:
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
Roles: Scalable, Fault Tolerant, Stateless
Roles are mostly stateless processes running on a core
Web Roles provide web service access to the app by the users. Web roles generate tasks for worker roles
Worker Roles do “heavy lifting” and manage data in tables/blobs
Communication is through queues. The number of role instances should
dynamically scale with load
Scalability• Queue length directly reflects how well
backend processing is keeping up with overall workload
• Queues decouple different parts of the application, making it easier to scale parts of the application independently
• Flexible resource allocation, different priority queues and separation of backend servers to process different queues
This work is licensed under a Creative Commons Attribution 3.0 United States License.
• Consists of a (large) group of machines all managed by software called the fabric controller
• The fabric controller is replicated across a group of five to seven machines and owns all of the resources in the fabric
• Because it can communicate with a fabric agent on every computer, the controller is aware of every Windows Azure application running on the fabric
The Azure Fabric
This work is licensed under a Creative Commons Attribution 3.0 United States License.
• The simplest way to store data in Windows Azure storage is to use blobs– A blob contains binary data – A storage account can have one
or more containers, each of which holds one or more blobs
• Tables hold some number of entities. – An entity contains zero or more
properties
• Queues provide scalability– Queues provide a buffer to
absorb traffic bursts– Reduce the impact of individual
component failures
Azure Storage Blobs, Tables, Queues, and SQL Data Services (relational)
Blobs can be big—up to 50 gigabytes eachThey can also have associated metadata
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Investigating the use of commercial Cloud services for scientific research
Example Applications:• PhyloD, computationally-intensive science
application that was not previously available as a service. Moving to the Cloud so researchers around the world can have access.
• MatLab, client application making use of Azure blob storage. Matlab could also be hosted in the Cloud, with the appropriate licensing.
• Excel, demonstrating seamless interaction between familiar client tools and the Cloud, where data is stored in Azure tables and Azure computations can be invoked for analysis. The data is from sensors distributed throughout one of our Data Centers.
Scientific Applications in the CloudSaaS for Science
This work is licensed under a Creative Commons Attribution 3.0 United States License.
• Statistical tool used to analyze DNA of HIV from large studies of infected patients
• PhyloD was developed by Microsoft Research and has had highly impact
• Small but important group of researchers– 100’s of HIV and HepC researchers actively use it– 1000’s of research communities rely on these results
PhyloD as an Azure Service
• Typical job, 10 – 20 CPU hours with extreme jobs requiring 1K – 2K CPU hours– Very CPU efficient– Requires a large number of test runs for a given job (1 – 10M tests)– Highly compressed data per job ( ~100 KB per job)
Highlights Azure’s potential for agile deployment of science-related services that scale
Cover of PLoS Biology November 2008
Courtesy of Roger Barga
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
PhyloD as an Azure Service Web role copies input tree, predictor and target files to blob
storage, enqueues INITIAL work item and updates tracking tables.
Job Created
INITIAL Created
Blob Storage
Work Item Queue
Tracking Tables
(INITIAL)
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
PhyloD as an Azure Service Worker role copies the input files to its local storage, computes
p-values for a subset of the allele-codon pairs, copies the partial results back to blob storage and updates the tracking tables.
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
PhyloD as an Azure Service Web role serves the final results from blob storage and status
reports from tracking tables.
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.
PhyloD Scalability on Azure Cloud
Workers Clock Duration
Total running time
Computational running time
25 0:12:00 2:19:39 1:49:4316 0:15:00 2:25:12 1:53:47
8 0:26:00 2:33:23 2:00:144 0:47:00 2:34:17 2:01:062 1:27:00 2:31:39 1:59:13
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Project JUNIOR:Demonstrating the Value of Cloud Services for ScienceLed by Newcastle University (Prof. Watson) and supported by External Research.
Goal: Investigate applicability of Clouds for scientific research• Build a working prototype for a thin slice (use-cases in chemo-informatics)• Utilize Microsoft technologies to build science-related services• Investigating additional Scientific Cloud Services to raise abstraction level for applications
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Reference Scientific Data Sets on Azure
24
Ocean Science data on Azure SDS-relational• Two terabytes of coastal and model data
Computational finance data on SDS-relational• BATS, daily tick data for stocks (10 years)• XBRL call report for banks (10,000 banks)
Currently working with IRIS to store select seismic data on Azure. IRIS consortium based in Seattle (NSF) collects and distributes global seismological data.• Data sets requested by researchers worldwide• Includes HD videos, seismograms, images, data from
major seismic events.
High research value worldwide, frequently used in academia and lecturesConsume in any language, any tool, any platform in S+S scenarios
This work is licensed under a Creative Commons Attribution 3.0 United States License.
• A knowledge ecosystem: – A richer authoring experience– An ecosystem of services– Semantic storage – Open, Collaborative,
Interoperable, and Automatic
• Data/information is inter-connected through machine-interpretable information (e.g. paper X is about star Y)
• Social networks are a special case of ‘data meshes’
A world where all data is linked …
Attribution: Chris Bizer
This work is licensed under a Creative Commons Attribution 3.0 United States License.
…and stored/processed/analyzed in the Cloud
scholarly communications
domain-specific services
The Microsoft Technical Computing mission to reduce time to scientific insights is exemplified by the June 13, 2007 release of a set of four free software tools designed to advance AIDS vaccine research. The code for the tools is available now via CodePlex, an online portal created by Microsoft in 2006 to foster collaborative software development projects and host shared source code. Microsoft researchers hope that the tools will help the worldwide scientific community take new strides toward an AIDS vaccine. See more.
instant messaging
identity
document store
blogs &social networking
notification
searchbooks
citations
visualization and analysis services
storage/data services
computeservices
virtualization
Project management
Reference management
knowledge management
knowledge discovery
Vision of Future ResearchEnvironment with bothSoftware + Services
This work is licensed under a Creative Commons Attribution 3.0 United States License.
Additional Resourcesresearch.microsoft.com/en-us/collaboration/toolsDryad; DraydLINQ
Computational Biology ToolkitEnables and accelerates fundamental advances in biology
F#Collaboration with the academic and research community on F#’s typed functional and object-oriented programming on the .NET platform
Plug-ins for OfficeOntology Add-in for WordArticle Authoring Add-in for WordChem4Word – Chemistry Drawing in WordMicrosoft Electronic Journals ServiceOpen XML Document Viewer
Software Engineering ToolsSpec#: Program verifier for C# extended with design by contract VCC: Program verifier for Concurrent C PEX: automatic unit testing tool for .NET CHESS: Unit testing tools for concurrent Win32 executable and .NET
Windows Azure: http://www.windowsazure.com
This work is licensed under a Creative Commons Attribution 3.0 United States License.This work is licensed under a Creative Commons Attribution 3.0 United States License.