Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | robert-mcdermott |
View: | 2,236 times |
Download: | 0 times |
1
Cloud On-Ramp ProjectAugust 27th 2015Overview, Concepts and Capabilities
Contact: Robert McDermott [email protected]
Agenda High-level project goals Why AWS? Overview of AWS services Virtual Private Clouds (VPC) and networking Account and VPC options Metadata tagging Cost accountability and reporting Elastic Compute Cloud (EC2) overview Infrastructure as code FHCRC directory services integration (AD/DNS) Cloud security overview Cloud computing for scientific applications overview AWS support options HIPAA BAA details Cloud Benefits Cloud Challenges Capabilities we gained during this project high-level AWS roadmap 2
High-Level Project Goals
Gain experience and competency operating securely in a public cloud environment
Design and implement a cloud based virtual datacenter
Logically extend the Center’s internal IP network to secured subnets in the cloud datacenter
Explore various use cases (Servers, HPC, application hosting, database hosting, etc…)
Stand up at least one production server/service by the conclusion of this project
Develop a roadmap for future use and enhancements of the architecture
Gain operational flexibility to respond quickly to emerging needs 3
Cloud Basics
4
We Manage They Manage
Server roomsPhysical serversVirtual serversNetwork gearStorage systems
Amazon AWSRackspaceMicrosoft AzureGoogle Compute
Office 365DNAnexusDropBoxGoogle Docs
Amazon AWSMicrosoft AzureGoogle App EngineHeroku
Why Amazon Web Services? Market Leader
o Overwhelming market share (Gartner)o 10 times the compute capacity of all competitors combined (Gartner)
Greatest breadth and depth of services
Maturity: IaaS service launched in 2006o Microsoft announces Azure IaaS (VMs) service preview in June 7, 2012o Google Compute Engine limited preview started June 28, 2012
Rapid pace of innovation: 449 new services and features roll outs in 2014 alone
Cost competitiveo 48 Price reductions since 2006o Reserved instances and the spot market
Broad Adoptiono Government, Financial, Healthcare, Education, Research, Entertainment, etc…
,
Security Certifications o FISMA, HIPAA, SOC, PCI DSS, ITAR, DOD CSM, FedRAMP, ISO, FERPA, etc…
Most reliable cloud provider (CloudHarmony.com)
o AWS: 2.4 hours of total downtime in 2014 (zero downtime in our region)o Google Compute Engine: 4.4 hours total downtime in 2014 o Azure: 39.6 hours of total downtime in 2014
Why AWS? Why Not? 5
2014
2012
2013
2010
2015
2011
Gartner Magic Quadrant for IaaS 2010 - 2015
6
Are we locked in to AWS? We are not locked in; it’s possible to use other IaaS providers at the same time as
AWS if there are compelling reasons to do so.
Using both Microsoft Office 365 (SaaS) and Amazon AWS (IaaS) at the same time will work perfectly with no overlap in effort or waste; It’s likely even a great strategy – Microsoft Office 365 for SaaS and Amazon AWS for IaaS.
Dumping AWS to move to another IaaS provider (Azure, Google, Rackspace) would require reworking everything (networking, automation, chargebacks, etc…). It’s possible, but it will be a lot of work to attempt to recreate the architecture.
General Electric COO of IT Chris Drumgoole
…we really view ourselves to be a service provider to our businesses, so our businesses can buy [AWS cloud services] from us or they can buy from others. You can go to Amazon directly or you can go to Azure directly. If you want to come through me, by definition, you’re going to live and operate in this safe environment. I have already taken care of the things that GE holds dear and our requirements around regulation, security, data privacy and so on. I pre-built and pre-instrumented the environment so that those things are not something you have to worry about. That’s the benefit of coming to me. If you decide to go on your own, you certainly can. We’re never going to stop you, but understand that now those things are on you and you have to take care of them.http://www.infoworld.com/article/2824508/cloud-computing/ges-head-of-it-were-going-all-in-with-the-public-cloud.html
7
Overview of AWS Services
Used during project
Compute
Storage and Content Delivery
Databases
Networking
Administration & Security
Deployment & Management
Analytics
Enterprise Applications
Mobile Services
Application Services
8
Overview of AWS Services
Research Computing
Compute
Storage and Content Delivery
Databases
Networking
Administration & Security
Deployment & Management
Analytics
Enterprise Applications
Mobile Services
Application Services
9
AWS is similar to Lego Mindstorm…
Imagination
& Code+ =
Imagination
& Code+ =
Lego Mindstorm Building Blocks A Sudoku Solving Robot
Amazon AWS Building BlocksA $5.5 billion in annual revenue service
that consumes 37% of the Internet traffic at peak
10
Region Location AZs
us-east-1 N. Virginia 5
us-west-1 N. California 3
us-west-2 Oregon 3
eu-west-1 Ireland 3
eu-central-1 Frankfurt 2
ap-southeast-1 Singapore 2
ap-southeast-2 Sydney 2
ap-northeast-1 Tokyo 3
sa-east-1 Sao Paulo 3
us-gov-west-1 Pacific northwest 2
cn-north-1 Beijing 2
Atlanta, GA - Ashburn, VA (3) - Dallas/FortWorth, TX (2) - Hayward, CA - Jacksonville, FL- Los Angeles, CA (2) - Miami, FL - New York,NY (3) - Newark, NJ - Palo Alto, CA - San Jose,CA - Seattle, WA - South Bend, IN - St. Louis,MO - Amsterdam, The Netherlands (2) -Dublin, Ireland - Frankfurt, Germany (3) -London, England (3) - Madrid, Spain -Marseille, France - Milan, Italy - Paris, France(2) - Stockholm, Sweden - Warsaw, Poland -Chennai, India - Hong Kong (2) - Mumbai,India - Manila, the Philippines - Osaka, Japan- Seoul, Korea (2) - Singapore (2) - Taipei,Taiwan - Tokyo, Japan (2) - Australia -Melbourne, Australia - Sydney, Australia -São Paulo, Brazil - Rio de Janeiro, Brazil
In addition to regions and zones there are
currently 64 edge locations
Regions and Availability Zones
11
Availability Zone Basics
Availability
Zone B
Availability
Zone C
US-WEST-2: Oregon Region
Availability
Zone A
Each region has at least 2 availability zones Each availability zone is in a separate location miles apart that shares nothing with other zones Latency between availability zones in the same region is less than 2ms Systems requiring high-availability should be designed to take advantage of multiple AZs The elastic network load balancers (ELBs) live at the region level across all AZs A typical HA design pattern is shown here:
ELB
App App
DB DB
App.com
AZ A AZ B
12
Virtual Private Cloud (VPC) Overview
VPCs are customer defined private networks that provide isolation from other VPCs (even your own) and other customers.
A VPC can only reside in one region, but can span all AZs in that region. A VPC is subdivided into subnets. Subnets are located in AZs; subnets can’t span across multiple AZs. Subnets can be completely isolated, connected to the internet or connected to your
corporate network via a VPN or direct connection offered by a number of AWS partners VPCs can be peered with other VPCs, even other customers VPCs for collaboration. The VPC service provides the following building blocks to design a network to fit your
needs:
VPCs
Routers
Internet Gateways
Customer Gateways
VPNs
Virtual Private Gateways
VPC Peering
Subnets
Route Tables
DHCP
Network ACLs
Security Groups
StatefulFirewall
NAT NAT
13
Private Only
Subnets with
VPN connection
InternetInternet
Corporate Datacenter
High-Level Virtual Private Cloud Patterns
Public Only
Subnets
Public and
Private
Subnets
Public and
Private
Subnets with
VPN
connection
Corporate Datacenter
Internet
14
A B
C D
Example Account, VPC, Peering and Billing Configurations
Single account; single bill
No isolation Very simple but no flexibility
Account VPCAccount
VPC 1
VPC 2
VPC 3
VPC 4
Account 1VPC 1
VPC 2
VPC 3
VPC 4Account 2
Single account; single bill
Isolation between environments No account isolation Still simple with some flexibility
Multiple accounts; single bill
High level of isolation Intra account VPC peering Flexible but moderately complex
Bil
lin
g
Multiple accounts; separate billing
High level of isolation Intra and Inter organization peering Most flexible but very complex
Account 1 VPC 1
Account 2 VPC 2
Org B
AccountVPC X
pe
er
pe
er
pe
er
15
A B
C D
Account 3 VPC Z
Example of a Possible Account & VPC Architecture
ProductionEnvironment
Account
Benefits Infrastructure testing in isolated test account Network level isolation between environments for flexibility and some independence Single consolidated bill covering both accounts
Test VPC
Enterprise VPC
Research VPC
High Security VPC
Collaboration VPC
pe
er
TestEnvironment
Account
Test VPC
Enterprise VPC
Research VPC
High Security VPC
Collaboration VPC
pe
er
Co
nso
lid
ated
Bill
ing
Org AVPC
Org BVPC
Dev/test systems
Administrative computing
Research Computing
Sensitive systems (PHI)
Scientific collaboration
16
AWS VPC
Hutch Network192.168.0.0/1610.168.0.0/16
Internet
Hutch firewalls in active/standby
configuration
Hutch Network192.168.0.0/1610.168.0.0/16
Internet
Current Topology
Hutch Network extended
to our AWS VPC
Firewall Instance filters
and logs all traffic in and
out of the VPC via the AWS
Internet Gateway.
Virtual systems residing in VPC
subnets are referred to as
Instances of the machine images
from which they are launched.
AWS Internet Gateway provides
direct access between the VPC and
the Internet (avoids having to
route traffic through the Hutch
campus – AKA “hairpinning”).
Extension of the Hutch Network
Traffic between the Hutch network and our
VPC must pass through the Center’s Internet
firewall, which is an endpoint for the VPN
connection between the two networks.
Cloud On-Ramp Datacenter
17
Hutch Network192.168.0.0/1610.111.0.0/16
Internet
VPN tunnel extends
the Hutch network
into the VPC.
External Systems
Hutch firewalls serve as the VPN Tunnel Endpoint
(Customer Gateway)
Cloud On-Ramp – Virtual Private Cloud (VPC)
172.16.0.0/16 + AWS Elastic IP (EIP) assigned to Firewall
us-west-2a us-west-2b us-west-2c
Internal Subnet
172.16.160.0/24Instances directly
accessible via Hutchnet
AWS Virtual
Private Gateway
Internal Subnet
172.16.168.0/24Instances directly
accessible via Hutchnet
Internal Subnet
172.16.176.0/24Instances directly
accessible via Hutchnet
Firewall Inside
172.16.8.0/24
Connectivity from internal Hutch
systems in the VPC to the Internet
via the Firewall / UTM Instance and
AWS Internet Gateway
Notes:
• Internal subnets are logical
extensions of the Hutch network.
• Instances in the internal subnets
can communicate freely with other
systems in the Hutch network via
the VPN tunnel using their private
IP addresses (172.16.x.x).
• The Firewall Outside subnet is the
only subnet with direrct access to
the AWS Internet Gateway
• The Firewall is a FortiGate instance
with interfaces in both the Firewall
Outside and Inside subnets. It can
communicate with any VPC subnet
via its inside interface.
• The Firewall performs NAT for
outbound-initiated Internet access.
AWS Internet
Gateway
Post Phase 1Post Phase 1Post Phase 1
Firewall Outside
172.16.0.0/24
FW
Cloud On-Ramp Datacenter – Detailed
18
Server for
software
updates
Networks192.168.0.0/1610.168.0.0/16
DC1
US-WEST-2A: 172.16.160.x
AD Site: AWS_USW2
DC2
DC3
DC-USW2A
DC-USW2B
AD Site: SELU
Networks172.16.0.0/16
US-WEST-2B: 172.16.168.x
Intra Site Replication
Create a new AD “site” named for the AWS region “AWS_USW2” Create two new domain controllers in AWS; each in a different availability zone This architecture was tested during the project (twice) using test domains
Active Directory Cloud Integration Design
19
Tags are key/value annotations that can be attached to every type of object in AWS Tags are used for inventory, security, cost accounting, backups and automation
Tag Restrictions Maximum number of tags per resource: 10 Maximum key length: 127 Unicode characters Maximum value length: 255 Unicode characters Tag keys and values are case sensitive.
Mandatory Cloud On-Ramp Tagging Scheme Name: name of the server Owner: department/customer that owns/pays for the server <_div/dept> Technical_contact: who provides technical support for this system; who to send alerts and reports Billing_contact: who to send chargeback invoices to Description: short description of the servers purpose SLE: business_hours=? / grant_critical=? / publicly_accessible=?
Metadata Tagging Scheme
20
Metadata Tagging Scheme Example Name : skyshieldInstanceId : i-32b8edc1InstanceType : m3.mediumImageId : ami-b9c98181State : runningPrivateIpAddress : 172.16.0.18PublicIpAddress : 52.16.139.222SecurityGrps : sg-8e7221e1AvailabilityZone : us-west-2aSubnetId : subnet-29549451VpcId : vpc-f3e23491owner : _adm/infosectechnical_contact : [email protected]_contact : [email protected] : firewall - inside interface in dedicated subnetsle : business_hours=24x7 / grant_critical=no /
publicly_accessible=noTenancy : defaultLaunchTime : 6/3/2015 1:34:59 PMKeyName : cloud on-ramp test keypair
Platform : linux
21
Cost Accounting During the Project 79.7% of AWS costs on average could be directly tied to an owner/department
After accounting for sales tax and a proportional amount of AWS support costs we would have been able to assign 89.2% of AWS costs to owners/departments for potential chargebacks in the future
Strict tagging of servers, network interfaces, volumes, snapshots, etc… is critical
Resources attached to servers (volumes & NICs) need to automatically inherit tags from their parent to ensure all costs are captured
Tag creation, maintenance and enforcement needs to be fully automated
End of MonthTaxes
Daily Spend Report – Invoiced vs. Chargebacks
Excessive“Leakage”
22
Cost Reporting and Potential Chargebacksowner invoice_date bill----- ------------ -------_adm/custserv 2015-07-01 $401.96 _adm/iops 2015-07-01 $370.46_adm/solarch 2015-07-01 $213.45_adm/infosec 2015-07-01 $102.43_adm/ess 2015-07-01 $77.04_adm/scicomp 2015-07-01 $1.71
Chargebacks: Monthly Charges by Owner Tag
Monthly Charges by Owner Tag – Last Month vs This Month
Pulled via API
23
Elastic Compute Cloud (EC2) Overview
EC2 is Amazon’s virtual server (instance) service 38 instances types are available ranging from 10% of a CPU core and 1GB of RAM to 40
CPU cores and 244GB of RAM. General purpose, fractional CPU (burstable), compute optimized, memory optimized,
storage optimized (both high IO and high density) and GPU instance classes. On-demand, Reserved and Spot market pricing options Some instances come with “free” ephemeral storage General purpose SSD EBS volumes up to 16TB each Provisioned IOPs volumes can provide up to 20,000 IO operations per second per volume Shared or dedicated tenancy models available
Instances
SSD
IOPs
Snap
Auto Scaling
AMI (Images)
Network Interface
MagMagnetic Disks
GP SSDDisks
Provisioned IOps Disks
Encrypted Disks
Snapshots Monitoring Alerting Load Balancing
24
EC2: On-Demand Pricing Zero upfront costs with no long term commitments Charged hourly (fractional hours, rounded up) for the time the instance is running Each instance type has a different hourly rate Availability of specific instance types in specific AZs can fluctuate with demand Best for short term workloads that can’t be interrupted while running Best for systems that can be shutdown when not in use (test, monthly jobs, experiments) Most flexible option but also most expensive
Type CPUs RAM Temp Storage Rate Annual
t2.micro 1* 1GB none $0.013 $114
m3.medium 1 3.75GB 1 x 4GB SSD $0.067 $587
t2.large 2* 8GB none $0.104 $911
m4.large 2 8GB none $0.126 $1,104
m4.2xlarge 4 16GB none $0.252 $2,207
c3.4xlarge 16 30GB 2 x 160GB SSD $0.84 $7,359
m4.10xlarge 40 160GB none $2.52 $22,075
i2.8xlarge 32 244GB 8 x 800GB SSD** $6.82 $59,743
**365,000 random read IOPS total* Fractional CPU with credit based burst 25
EC2: Reserved Instances Reserved instances are commitments for 1 or 3 years and provide guaranteed availability All upfront, partial upfront and no upfront purchasing options Purchased for specific availability zones Best for long term production servers Unwanted reserved instances can be sold on the reserved instance market Loss of flexibility but costs saving can be significant (up to %75)
T2.Large Example
26
EC2: Spot Market Pricing Save up to 90% by bidding on unused capacity Spot instances are functionally identical to on-demand and reserved instances Requested instances are launched when your bid matches or exceeds the market rate Market rate fluctuates based on current supply and demand in a particular zone When market rate exceeds the bid, instances are terminated after a two-minute notice Good for short running jobs or long running processes that can check-point their state HPC and ad-hoc testing are good candidates for spot instances
M4.10xlarge: 40 CPUs, 160GB RAM On-demand rate: $2.52/hour Current Spot rate: $0.27 (us-west-2a) 89% costs savings
27
Spot Market Rate History for M4.10xlarge instance in Oregon
Infrastructure as
Code
28
Infrastructure as Code Networks, servers, storage, security, databases, monitoring, etc… can be defined in code Code “stacks” can be written to build entire complex infrastructures or multi-tier application stacks Your infrastructure or application stack is always documented, versioned and consistently repeatable Infrastructure code is strictly managed and tracked via a source code management system Disaster recovery and business continuity can be an order of magnitude faster/simpler This is the future of IT operations Requires a different set of skills than traditional IT operations
Example: Infrastructure changes are automatically documented and versioned
29
Infrastructure as Code Demo
Q: ManagementHow long would it take to migrate the Center's public web site to the cloud while increasing security, performance and availability? Will 6 months and a budget of $250K work?
A: DevOps EngineerI can do it in 30 minutes for less than $5 per day. Where would you like it to reside? Oregon, California, or North Virginia?
30
Infrastructure as Code Example
This code does the following:
Creates a new VPC wherever you like Creates 2 public subnets (each in a different AZ) Creates 2 private subnets (each in a different AZ) Creates an Internet gateway Creates a route table to route public traffic to the Internet gateway Creates a NAT instance in a public subnet Creates a route table to route Internet bound traffic from the private subnets to the NAT instance Creates Security groups and network ACLs for both the private and public subnets Creates two Linux instances, one in each private subnet Installs and configures the NGINX web server on each Linux instance (via user-data script pulled
from a github repo and provided to each instance at boot) Each Linux web server pulls (via the NAT connection) a 1.4GB tarball containing the public
www.fredhutch.org website from an S3 bucket and extracts into the NGINX web root Creates an elastic load balancer (ELB) and attaches interfaces to the public subnets Creates ELB health checks to verify the health of the new Fredhutch web servers Adds the web server instances to the load balancer Adds a DNS CNAME to the fredhutch.center DNS zone that points to the ELB public DNS name Sends an SMS text message to my iPhone when it's done.
31
Availability Zone
Public Subnet
Private subnet
Public Security Group
security group
Web
Server
NAT
GitHub
Availability Zone
Public subnet
Private subnet
Public Security Group
security group
Web
Server
S3
www.fredhutch.org
www.fredhutch.center
SNS
ELBHealth Checks
EIP
Orchestration Code
Webserver
config code
Base Images (AMIs)
Internet Gateway
Website Archive
R53
PubRT
PrivRT
NAT Linux
Config Config
Nimbus Robert’s iPhone
NACLNACL
DHCPKey
SMS Text
Fredhutch.org Website Migration Demo
32
Custom Cloud Automation Code EC2 instance provisioning to build and configure Windows and Linux instances
o Tags instance with all mandatory tagso Configures OS at during bootstrapo Optionally enables monitoring an alteringo Optionally enables daily backup rotation for all attached volumeso Optionally registers instance in DNSo Optionally creates and attaches additional “data” disko Optionally configures data disk for encryptiono Optionally configures instance for scheduled retirement
EC2 instance reportingo Gathers all information on instances to find, filter and report on instances
EC2 tag primero Finds all instances without tags and creates all tag key stubso Reports to ops group that instances without tags were found and tagged
EC2 tag inheritance o Ensures that EBS volumes and NICs attached to an instance inherit the parent instances tags
EC2 tag enforcero Finds instances that are missing the mandatory tagso Reports them to ops group and optionally shuts them down 33
Custom Cloud Automation Code (cont)
34
EC2 backupso Finds all volumes that are tagged with a backup retention and snapshots themo Tags the snapshots to identify the owner, parent instance and retention dateo Purges snapshots that are past their designated retention date
EC2 instance lifecycleo Finds instances that are scheduled for retirement in the next 30 days and reports on themo Retires instances that have reached their retirement date
Virtual datacenter creationo Creates a VPC, subnets, security group, NACLs, gateway, routing tables, DHCP options
“GrabCloudNode” research compute node provisioningo Researcher facing tool to provision a cloud based HPC nodeo Similar to existing “grabnode” functionality that researchers have to access on-premises
compute resourceso Tags the instance with all mandatory metadata tagso Facilitates transferring data to and from the cloud nodeo Sets up monitoring to automatically shutdown the node if it’s idle for more than 1 hour
Cloud Security
35
Amazon AWS Certifications and Accreditations
PCI DSS Level 1, SOC 1/ ISAE 3402, SOC 2, SOC 3, FIPS 140-2, CSA. FedRAMP, DIACAP, FISMA, ISO 27001, MPAA, Section 508 / VPAT, HIPAA, DOD CSM Levels 1-2, 3-5, ISO 9001, CJIS, FERPA, G-Cloud, IT – Grundschutz, IRAP (Australia), MTCS Tier 3 Certification, ITAR
Amazon is responsible for the security of the Cloud We are responsible for our security in the Cloud
Our Responsibility
Their Responsibility
AWS Shared Responsibility Model
36
Proposed Security Choices
37
What we think are good ideas based on what we learned during the project
See appendix for details
These are not Policies
o we do not understand our actual AWS use cases well enough to have policies and procedures
o we have not tested these proposed choices in real operations mode
Proposed Security Choice Highlights
38
2-factor authentication for AWS administrator accounts with passwords
For service accounts making API calls with access keys, implement IP restrictions
Maintain team-level (as opposed to individual) access key repositories
Turn on CloudTrail and CloudConfig auditing everywhere. Send logs to Splunk
Clearly defined governance model for VPC-level design and changes—subnets, ACLs, EIPs, VPNs, VPC-peering…
Protect traffic between our EC2 instances and the Internet with a virtual firewall/IPS appliance, or some host-level alternative
Security-wise, AWS adoption brings…
39
Potential benefits
Challenges and uncertainties
For certain areas of IT operations, no change
Potential Security Benefits
40
Complete audit trail of infrastructure access & changes
Improved detection and alerting of security exceptions at the infrastructure level—faster, more precise incident response and recovery
Security goals such as physical security of the data center, protection of backup media, secure disposal of unwanted storage media are just easier to accomplish with AWS
Potential Security Benefits (cont)
41
Relatively easy to compartmentalize IT resources with well-defined technical and administrative boundaries
o Optimized to create discrete computing/storage/application instances on demand without having to maintain a common infrastructure
o Create separate “networks” with VPCso Buy storage space in the form S3 buckets o Separate database instances with RDSo Control admin access via granular access control rules.
IP & port filtering on a per-server basis with Security Groups
Security Challenges and Uncertainties
42
The “everything-as-code” paradigm bundles the different layers of the IT stack together, in a way that is not necessarily compatible with our current separation of duties. Combining services such as networking, server, OS, apps, and security filtering into one set of code blurs the lines between different teams’ responsibilities confusion, unmet expectations, lost opportunities for cross-checking.
Our current team structure in Center IT is optimized for our physical IT environment. It is not necessarily efficient for managing the software-defined world of AWS. AWS is not a virtualized copy of our IT infrastructure.
New frontier. It will take time to establish new policies, expectations, and norms in order to operate AWS securely and smoothly. Potential for friction and dropped balls.
Security Challenges and Uncertainties (cont)
43
Rapid evolution of AWS features—long-term investment in staff time for learning.
AWS represents not a replacement of existing infrastructure, but a parallel one. We must duplicate resources to secure it.
Security - No Change
44
OS and application patching (but we may end up maintaining fewer servers, if we purchase things like storage and database “as-services” from AWS, instead of running our own)
Need for firewall/IPS/WAF protections (must be purchased via 3rd party vendors).
Cloud Computing for
Scientific Applications
45
Ad-Hoc Capacity
46
When compute capacity needs(cores, memory, storage) exceeds in-house
o Reduce time-to-solution
o Scale wide/short:
• 100 cores for 10 hours has same cost as 1000 cores for 1 hour
o Rent-a-terabyte:
• Short term analyses and interim storage options won’t require large capital investment
Ad-Hoc Capability
47
Use of technologies not currently available in-house
o GPU
o Low-latency interconnect (AWS “enhanced networking”)
o Short term or one-off analyses won’t require large capital investment
Sandbox
48
Provide a sandbox for prototyping and evaluation
o Easily provisioned ephemeral environment
o Allows researcher to try new algorithms and evaluate methods without constraints
o Docker and AMIs are popular mechanisms for distributing data, tools, and pipelines
Container Solutions
49
Containers are
o “a server-virtualization method where the kernel of an operating system allows for multiple isolated user-space instances”
o Docker containers and AMIs allow distribution of tools and data in a portable container.
o Reproducibility and distribution of results
o Difficult and cumbersome (security) to deploy in-house
• Easy to pop into a sandbox in the cloud!
Science DMZ
Transferring data into the cloud is
free Transferring data out the cloud is
charged by the GB
Download large datasets quickly and inexpensively using Amazon’s big
network pipes Analyze and process data in the
cloud using cloud resources (EC2,
EMR, …) Download the results of the analysis
or experiment to the Hutch
EC2Compute
Analyze
Fred Hutch Campus Fred Hutch Amazon VPC
Data
Repository
dbGaP
EMBL-ENA
Researcher
Results
50
Store
RetrieveS3 Storage
S3 Storage
/fh/fast
“…designed such that the equipment, configuration, and security policies are
optimized for high-performance scientific applications rather than for general purpose business systems” - ESnet
Collaboration
51
Environments providing compute, application, and storage for collaborations between Hutch and others
o Resources independent
o Access from one to other via peering
o Uses AWS high-throughput networkingo Data transfer does incur cost
o Good for bringing in outside expertise
Hutch Intercloud VPC Partner VPC
EC2 ComputeS3 Storage
Hutch Partner In this example, a group with compute expertise provides their computational resources, accessing Hutch-produced data via a VPC peering relationship
Meet-Me
52
A self-contained VPC for collaboration
o Custom environment
o Isolated from other Hutch resources
o Limits need for shipping data between organizations and VPCs (c.f. intercloud)
o IAM controls access and authorization
Meet-Me VPC
EC2Compute
S3 Storage
Hutch Partner
Illumina Data for External Customers
Upload from HiSeq into S3
(implemented today) Processing in EC2 Download by customer or transfer of
bucket to customer via VPC peering or S3 copy
EC2Compute
Gerald/Bustard/etc.
Shared Resources
Genomics
Fred Hutch Amazon VPC
External
SR Customer
53
Basecalls & Alignments
“Raw” data storageS3 Storage
S3 Storage
External
SR Customer
VPC
GlacierArchive(raw & cooked)
This is simply an example
of a possibility- no plans orproposals are in place atthis time!
The Proteomics Lab is currently testing Proteome Discoverer in the cloud Run time with current local system takes 150 hours to run
Cloud Comparison
8 CPU cloud server o Run time: 123.24 hourso Cost: $124.99
36 CPU cloud servero Run time: 42.47 hourso Cost: $132.91
Key Concepts
1 server running for 100 hours, costs about the same as running 100 servers for 1 hour
Running an 8 CPU system for 4 hours costs about the same a running a 32 CPU system for 1 hour
Proteomics in the Cloud
54
8 CPUs 36 CPUs
$125 $133
AWS HIPAA BAA Details
55
We must identify the AWS account IDs that we want covered by the BAA We are responsible for implementing appropriate privacy and security safeguards in
order to protect our PHI in compliance with HIPAA The following are the current HIPAA eligible services:
o Amazon Elastic Compute Cloud (EC2)o Amazon Simple Storage Service (S3)
o Amazon Elastic Block Store (EBS)o Amazon Glaciero Amazon Redshift
o Amazon RDS (MySQL and Oracle engines only)o Amazon Elastic Map Reduce (EMR)
o Amazon DynamoDBo Elastic Load Balancing
All compute instances processing, storing, or transmitting PHI must be dedicated instances Dedicated instances won’t share a hypervisor host with any other customers Dedicated tenancy costs an extra $2 per hour but covers all EC2 instances in a region AWS will reporting all security incidents and breaches to us We must enable all auditing and logging (CloudTrails, CloudConfig) All PHI data must be encrypted at rest and transmission Set ELB Load Balancer protocol to TCP for sessions containing PHI, and the TCP session
must be encrypted end-to-end (no SSL termination on the ELB)
NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy *
Information security in cloud environments is still the responsibility of the institution , the implementation of that security is shared between the institution and the cloud service provider
You and your institution are accountable for ensuring the security of this data, not the cloud service provider.
The NIH strongly recommends that investigators consult with institutional IT leaders, including the Chief Information Officer (CIO) and the institutional Information Systems Security Officer (ISSO)
* http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=dbgap_2b_security_procedures.pdf** https://d0.awsstatic.com/whitepapers/compliance/AWS_dBGaP_Genomics_on_AWS_Best_Practices.pdf
Whitepaper: “Architecting for Genomic Data Security and Compliance in AWS” **
Guidance for working with controlled-access datasets from dbGaP, GWAS, and other individual-level genomic research repositories
Co-authored by Chris Whalley, formerly of the Fred Hutch regulatory compliance office
dbGaP Data in the Cloud
56
AWS Support Options
57
Basic Developer Business Enterprise
AWS Trusted
Advisor4 checks 4 checks 41 checks 41 checks
Access to SupportSupport for Health
Checks
Email (local business
hours)
Phone, chat, email, live
screen sharing (24/7)
Phone, chat, email, live
screen sharing, TAM
(24/7)
Primary case
handling
Technical Customer
Service Associate
Cloud Support
Associate
Cloud Support
Engineer
Sr. Cloud Support
Engineer
Users who can
create support
requests
1Unlimited
(IAM supported)
Unlimited
(IAM supported)
Response time <12 hours <1 hour <15 minutes
Cost Free $49/month
10% of monthly bill
Rate goes down at
higher spending tiers
$15,000
Rate goes down at
higher spending tiers
We used “Basic” support during the cloud on-ramp project AWS has excellent documentation so we didn’t need to contact support during the project I recommend that we upgrade to a Business support plan prior to production use
An opportunity to build a high security computing environment for our current and future security needs.
A complete, up to the minute inventory of all cloud resources
Complete visibility and accountability of all IT costs associated with the Cloud
We can accurately calculate chargebacks for almost 90% of cloud costs
Disaster recovery and business continuity are a reality
Rapidly respond to urgent or unplanned IT needs
Audit all access and configuration changes
A documented, versioned, repeatable IT infrastructure is possible
Everything can be automated via well documented APIs and SDKs
No physical infrastructure equipment (servers, switches, routers, PDUs, etc…) to maintain
Collaborate with other institutions in the cloud via VPC peering
Take advantage of Amazon’s fat network pipes to download large datasets to the cloud
With the brokering layer in place, we could offer self-service IT to CIT, divisional IT and research staff across the Center.
AWS Benefits
58
AWS Challenges
59
The highly abstracted nature of the cloud combined with the infrastructure as code paradigm results in a blurring/elimination of the boundaries between traditional IT roles and separation of duties
Our current team structure in Center IT is optimized for our physical IT environment. It is not necessarily efficient for managing the software-defined world of the Cloud. The Cloud is not a virtualized copy of our IT infrastructure.
AWS is evolving at a very rapid pace. IT staff responsible for the cloud infrastructure/service will need to stay abreast of all changes and incorporate these changes into our architecture/service when beneficial
There is currently ~10ms of network latency between the Center and our Oregon VPC. Data and compute should to be co-located for the best performance
o On campus: ~1mso Campus to Oregon AWS VPC: ~10mso Campus to Europe: ~150mso Campus to Africa: ~300ms
Our VPN can currently only encrypt network traffic to and from our VPC at a rate of 300Mb/s (37.5MB/s). Moving large data sets between campus and our VPC will be very slow until we upgrade to a dedicated direct connect or other solution.
Not Everything Works Well in the Cloud
Is physically connected to an instrument Requires a licensing “dongle” Has very specific hardware requirements (model XYZ only)
Examples: Aperio, OnBase
60
The cloud is not possible for anything that…
Requires very low latency access to systems/data on Campus
Requires high throughput access to systems/data on Campus *
Examples: BI, PeopleSoft, Varonis, Hyperion
The cloud is not currently a good fit for anything that…
Requires a large server (many CPUs) and runs 24/7
Might not be cost effective for anything that…
* This limitation can be removed by implementing a 10GbE direct connect (costs $5-7K / month)
Capabilities Gained During This Project
Create complex cloud based datacenters and networks Logically integrate cloud networks with our campus network Secure cloud resources with security groups, NACLs and third party firewalls Limit access with fine grained security policies and multi-factor authentication Consistently provision and configure Windows and Linux server instances Audit configuration changes to enforce change management Audit all AWS access (web console, CLI, API) Backup and recover servers in the cloud Implement and enforce a metadata tagging scheme Cost accounting and reporting to facilitate chargebacks Monitor systems, trend metrics and alert support staff Load balance private and public network traffic Vertically scale (up/down) or horizontally scale (out/in) systems Log all network traffic in/out of the VPC Peer with other organizations for collaboration in the cloud Automate everything via AWS APIs, CLI tools, CloudFormation, Packer
61
FHCRC AWS Roadmap Determine cloud operations model (who and what)
Develop cloud governance model
Select production account and VPC architecture
Extend production the FHCRC Active Directory with an AWS AD site
Integrate AWS user authentication (IAM) with FHCRC AD via SAML
Implement chargebacks (not decided yet)
Offer brokered self service to Center IT departments
Implement a direct connect network to AWS or other solution (Internet2)
Integrate EC2 into scientific computing service offering for researchers
Offer brokered self service to the research community 62
Key Takeaways The cloud is no longer just hype, it’s a very capable, mature
platform that can offer increased agility, flexibility, security and capabilities
The cloud is not a traditional IT operating environment and requires a different approach to operate effectively
Not every server or application can or should move to the cloud
It won’t happen overnight; the journey to the cloud will take several years
Center IT is not currently offering the cloud as a service, but we may in the future
63
Appendix
65
Proposed Security Choices
66
Identity and Access Management for AWS API-based access to AWS for automation tasks should be done using service accounts instead of the individual
accounts of technicians.
All AWS user accounts belonging to humans must use two-factor authentication.
Permissions granted to service accounts should be restricted to the source IPs or subnets of servers needing those permissions. In AWS parlance, IAM policies granting permissions to service accounts should use source IP as a condition.
Service accounts should be used with access keys only. They should not be associated with passwords. Access keys must be stored in encrypted, team-level key repositories, but not in the personal storage space of individual technicians.
The ability to modify IAM settings should be restricted to the ITSO. Exceptions (e.g. service accounts requiring IAM permissions) must be approved by the ITSO.
Logging and Auditing CloudTrail must be turned on for all regions. CloudTrail logs must be forwarded to Splunk.
Splunk should be setup to monitor CloudTrail logs and alert [the cloud operations team] of notable activities, including activities that have security implications, such as account creation and permission changes. The exact set of events to be monitored will be defined as we operationalize our AWS environment, and continuously updated as we accumulate knowledge of AWS.
CloudTrail logs must be retain in Splunk for at least a year.
Proposed Security Choices (cont)
67
VPC-Level Security The permissions to modify VPC-level configurations must be limited to data-ops staff. Changes should be
made in consultation with ITSO. VPC-level configurations include, but are not limited to: Creation/removal of subnets Assignment/removal of Elastic IPs (EIPs) Changes related to Access Control Lists (ACLs) Changes related to internet gateways, VPNs, and VPC-peering Changes related to VPC Endpoints. All traffic between the FHCRC campus and the private IP space of our AWS VPC will go through a VPN
tunnel. This represents the scenario where hosts on the FHCRC campus access non-publicly accessible hosts in our VPC, and vice versa.
A Fortigate virtual appliance will be deployed within our VPC and managed by ITSO. Firewall, IPS, anti-virus, and application control features will be enabled on the Fortigate. The appliance will inspect traffic in the following scenarios:
All traffic originating from the VPC to non-FHCRC addresses. This represents the scenario where hosts within the VPC need to initiate connections to the internet at large, for reasons such as patching.
All connections to publicly accessible hosts within the VPC, including connections originating from our campus network.
EC2-level security By default, when EC2 instances are created they should be associated with one of the pre-defined security
groups created by ITSO. Network administrators should not create new security groups unless there are specific needs to do so, and it should be done in consultation with ITSO.
VPC Peering Example
[10.99.1.50]$ traceroute 192.168.1.1561 192.168.1.156 1.472 ms
[10.99.1.50]$ curl http://192.168.1.156/Welcome to Organization A!!
[192.168.1.156]$ traceroute 10.99.1.501 10.99.1.50 1.417 ms
[192.168.1.156]$ curl http://10.99.1.50/ Welcome to Organization B!
Organization A Organization BPeering connection Peering connection
Route Table Route Table
It works! It works!
68
123456789 987654321
albite01.fhcrc.org
DNS ViewsInternal: 192.168/16, 10.168/16, 172.16/16 (AWS VPC)External: !(internal)
AWS VPC Resolv.conf172.16.160.11, 172.16.168.11, 192.168.116.A
US-WEST-2A: 172.16.160.11
US-WEST-2B: 172.16.168.11
AWS VPC
albite10.fhcrc.org
IBX0 DNS Master
albite01.fhcrc.org
IBX-J4 IBX-E2
Grid Updates
DNS Cloud Integration
69
Project ScopeIn Scope Develop functional and security requirements Design and implement virtual datacenter architecture (regions, zones, subnets, etc) Extend the Center’s IP network to the virtual datacenter (VPC) Develop security polices and/or guidelines on appropriate use of the environment Active Directory and DNS services Create FHCRC server templates (AMIs) and standards Server pricing strategy (on-demand vs. reserved instances) Develop and test various use cases Train operational staff on the use of the environment Determine RBAC/account strategy Develop accounting strategy to support future chargeback functionality Select and pilot at least one production server/service in the new virtual datacenter Pilot researcher use of EC2 Develop a roadmap for this environment
Out of Scope Implementing a high-speed “Direct Connect” network connection Implementation of chargebacks Chef automated server builds (Chef implementation project still in progress) Customer self-service 70
Secure, Logical network extension of FHCRC IP space into the cloud Encrypted transport between FHCRC and cloud Design that allows HA architected services Separate public network to run services outside of the FHCRC network Support both enterprise and research computing Cost tracking and reporting System metrics, monitoring and logging FHCRC Active Directory access/integration for servers FHCRC DNS access/integration for servers Ability for servers to log to Splunk Role base administrative access (RBAC) Ability to backup / restore servers Secure storage media wipe on deletion Support full automated provisioning and configuration Support Windows 2012, CentOS 6/7 and Ubuntu 14.04 operating systems Pre-cooked FHCRC server templates (CentOS, Ubuntu and Windows 2012) Stateful ingress/egress firewall capability Advanced intrusion prevention firewall Vendor support Granular cost reporting (per server, application type, owner) to support future chargeback
implementation Metadata tagging capability to identify and group AWS objects
Cloud Architecture Requirements During this project we’ve determined that we can satisfy all the following “must have” requirements
71