Cloud On-Ramp Project Briefing

1

Cloud On-Ramp ProjectAugust 27th 2015Overview, Concepts and Capabilities

Contact: Robert McDermott [email protected]

Agenda High-level project goals Why AWS? Overview of AWS services Virtual Private Clouds (VPC) and networking Account and VPC options Metadata tagging Cost accountability and reporting Elastic Compute Cloud (EC2) overview Infrastructure as code FHCRC directory services integration (AD/DNS) Cloud security overview Cloud computing for scientific applications overview AWS support options HIPAA BAA details Cloud Benefits Cloud Challenges Capabilities we gained during this project high-level AWS roadmap 2

High-Level Project Goals

Gain experience and competency operating securely in a public cloud environment

Design and implement a cloud based virtual datacenter

Logically extend the Center’s internal IP network to secured subnets in the cloud datacenter

Explore various use cases (Servers, HPC, application hosting, database hosting, etc…)

Stand up at least one production server/service by the conclusion of this project

Develop a roadmap for future use and enhancements of the architecture

Gain operational flexibility to respond quickly to emerging needs 3

Cloud Basics

4

We Manage They Manage

Server roomsPhysical serversVirtual serversNetwork gearStorage systems

Amazon AWSRackspaceMicrosoft AzureGoogle Compute

Office 365DNAnexusDropBoxGoogle Docs

Amazon AWSMicrosoft AzureGoogle App EngineHeroku

Why Amazon Web Services? Market Leader

o Overwhelming market share (Gartner)o 10 times the compute capacity of all competitors combined (Gartner)

Greatest breadth and depth of services

Maturity: IaaS service launched in 2006o Microsoft announces Azure IaaS (VMs) service preview in June 7, 2012o Google Compute Engine limited preview started June 28, 2012

Rapid pace of innovation: 449 new services and features roll outs in 2014 alone

Cost competitiveo 48 Price reductions since 2006o Reserved instances and the spot market

Broad Adoptiono Government, Financial, Healthcare, Education, Research, Entertainment, etc…

,

Security Certifications o FISMA, HIPAA, SOC, PCI DSS, ITAR, DOD CSM, FedRAMP, ISO, FERPA, etc…

Most reliable cloud provider (CloudHarmony.com)

o AWS: 2.4 hours of total downtime in 2014 (zero downtime in our region)o Google Compute Engine: 4.4 hours total downtime in 2014 o Azure: 39.6 hours of total downtime in 2014

Why AWS? Why Not? 5

2014

2012

2013

2010

2015

2011

Gartner Magic Quadrant for IaaS 2010 - 2015

6

Are we locked in to AWS? We are not locked in; it’s possible to use other IaaS providers at the same time as

AWS if there are compelling reasons to do so.

Using both Microsoft Office 365 (SaaS) and Amazon AWS (IaaS) at the same time will work perfectly with no overlap in effort or waste; It’s likely even a great strategy – Microsoft Office 365 for SaaS and Amazon AWS for IaaS.

Dumping AWS to move to another IaaS provider (Azure, Google, Rackspace) would require reworking everything (networking, automation, chargebacks, etc…). It’s possible, but it will be a lot of work to attempt to recreate the architecture.

General Electric COO of IT Chris Drumgoole

…we really view ourselves to be a service provider to our businesses, so our businesses can buy [AWS cloud services] from us or they can buy from others. You can go to Amazon directly or you can go to Azure directly. If you want to come through me, by definition, you’re going to live and operate in this safe environment. I have already taken care of the things that GE holds dear and our requirements around regulation, security, data privacy and so on. I pre-built and pre-instrumented the environment so that those things are not something you have to worry about. That’s the benefit of coming to me. If you decide to go on your own, you certainly can. We’re never going to stop you, but understand that now those things are on you and you have to take care of them.http://www.infoworld.com/article/2824508/cloud-computing/ges-head-of-it-were-going-all-in-with-the-public-cloud.html

7

Overview of AWS Services

Used during project

Compute

Storage and Content Delivery

Databases

Networking

Administration & Security

Deployment & Management

Analytics

Enterprise Applications

Mobile Services

Application Services

8

Overview of AWS Services

Research Computing

Compute

Storage and Content Delivery

Databases

Networking

Administration & Security

Deployment & Management

Analytics

Enterprise Applications

Mobile Services

Application Services

9

AWS is similar to Lego Mindstorm…

Imagination

& Code+ =

Imagination

& Code+ =

Lego Mindstorm Building Blocks A Sudoku Solving Robot

Amazon AWS Building BlocksA $5.5 billion in annual revenue service

that consumes 37% of the Internet traffic at peak

10

Region Location AZs

us-east-1 N. Virginia 5

us-west-1 N. California 3

us-west-2 Oregon 3

eu-west-1 Ireland 3

eu-central-1 Frankfurt 2

ap-southeast-1 Singapore 2

ap-southeast-2 Sydney 2

ap-northeast-1 Tokyo 3

sa-east-1 Sao Paulo 3

us-gov-west-1 Pacific northwest 2

cn-north-1 Beijing 2

Atlanta, GA - Ashburn, VA (3) - Dallas/FortWorth, TX (2) - Hayward, CA - Jacksonville, FL- Los Angeles, CA (2) - Miami, FL - New York,NY (3) - Newark, NJ - Palo Alto, CA - San Jose,CA - Seattle, WA - South Bend, IN - St. Louis,MO - Amsterdam, The Netherlands (2) -Dublin, Ireland - Frankfurt, Germany (3) -London, England (3) - Madrid, Spain -Marseille, France - Milan, Italy - Paris, France(2) - Stockholm, Sweden - Warsaw, Poland -Chennai, India - Hong Kong (2) - Mumbai,India - Manila, the Philippines - Osaka, Japan- Seoul, Korea (2) - Singapore (2) - Taipei,Taiwan - Tokyo, Japan (2) - Australia -Melbourne, Australia - Sydney, Australia -São Paulo, Brazil - Rio de Janeiro, Brazil

In addition to regions and zones there are

currently 64 edge locations

Regions and Availability Zones

11

Availability Zone Basics

Availability

Zone B

Availability

Zone C

US-WEST-2: Oregon Region

Availability

Zone A

Each region has at least 2 availability zones Each availability zone is in a separate location miles apart that shares nothing with other zones Latency between availability zones in the same region is less than 2ms Systems requiring high-availability should be designed to take advantage of multiple AZs The elastic network load balancers (ELBs) live at the region level across all AZs A typical HA design pattern is shown here:

ELB

App App

DB DB

App.com

AZ A AZ B

12

Virtual Private Cloud (VPC) Overview

VPCs are customer defined private networks that provide isolation from other VPCs (even your own) and other customers.

A VPC can only reside in one region, but can span all AZs in that region. A VPC is subdivided into subnets. Subnets are located in AZs; subnets can’t span across multiple AZs. Subnets can be completely isolated, connected to the internet or connected to your

corporate network via a VPN or direct connection offered by a number of AWS partners VPCs can be peered with other VPCs, even other customers VPCs for collaboration. The VPC service provides the following building blocks to design a network to fit your

needs:

VPCs

Routers

Internet Gateways

Customer Gateways

VPNs

Virtual Private Gateways

VPC Peering

Subnets

Route Tables

DHCP

Network ACLs

Security Groups

StatefulFirewall

NAT NAT

13

Private Only

Subnets with

VPN connection

InternetInternet

Corporate Datacenter

High-Level Virtual Private Cloud Patterns

Public Only

Subnets

Public and

Private

Subnets

Public and

Private

Subnets with

VPN

connection

Corporate Datacenter

Internet

14

A B

C D

Example Account, VPC, Peering and Billing Configurations

Single account; single bill

No isolation Very simple but no flexibility

Account VPCAccount

VPC 1

VPC 2

VPC 3

VPC 4

Account 1VPC 1

VPC 2

VPC 3

VPC 4Account 2

Single account; single bill

Isolation between environments No account isolation Still simple with some flexibility

Multiple accounts; single bill

High level of isolation Intra account VPC peering Flexible but moderately complex

Bil

lin

g

Multiple accounts; separate billing

High level of isolation Intra and Inter organization peering Most flexible but very complex

Account 1 VPC 1

Account 2 VPC 2

Org B

AccountVPC X

pe

er

pe

er

pe

er

15

A B

C D

Account 3 VPC Z

Example of a Possible Account & VPC Architecture

ProductionEnvironment

Account

Benefits Infrastructure testing in isolated test account Network level isolation between environments for flexibility and some independence Single consolidated bill covering both accounts

Test VPC

Enterprise VPC

Research VPC

High Security VPC

Collaboration VPC

pe

er

TestEnvironment

Account

Test VPC

Enterprise VPC

Research VPC

High Security VPC

Collaboration VPC

pe

er

Co

nso

lid

ated

Bill

ing

Org AVPC

Org BVPC

Dev/test systems

Administrative computing

Research Computing

Sensitive systems (PHI)

Scientific collaboration

16

AWS VPC

Hutch Network192.168.0.0/1610.168.0.0/16

Internet

Hutch firewalls in active/standby

configuration

Hutch Network192.168.0.0/1610.168.0.0/16

Internet

Current Topology

Hutch Network extended

to our AWS VPC

Firewall Instance filters

and logs all traffic in and

out of the VPC via the AWS

Internet Gateway.

Virtual systems residing in VPC

subnets are referred to as

Instances of the machine images

from which they are launched.

AWS Internet Gateway provides

direct access between the VPC and

the Internet (avoids having to

route traffic through the Hutch

campus – AKA “hairpinning”).

Extension of the Hutch Network

Traffic between the Hutch network and our

VPC must pass through the Center’s Internet

firewall, which is an endpoint for the VPN

connection between the two networks.

Cloud On-Ramp Datacenter

17

Hutch Network192.168.0.0/1610.111.0.0/16

Internet

VPN tunnel extends

the Hutch network

into the VPC.

External Systems

Hutch firewalls serve as the VPN Tunnel Endpoint

(Customer Gateway)

Cloud On-Ramp – Virtual Private Cloud (VPC)

172.16.0.0/16 + AWS Elastic IP (EIP) assigned to Firewall

us-west-2a us-west-2b us-west-2c

Internal Subnet

172.16.160.0/24Instances directly

accessible via Hutchnet

AWS Virtual

Private Gateway

Internal Subnet



Internal Subnet



Firewall Inside

172.16.8.0/24

Connectivity from internal Hutch

systems in the VPC to the Internet

via the Firewall / UTM Instance and

AWS Internet Gateway

Notes:

• Internal subnets are logical

extensions of the Hutch network.

• Instances in the internal subnets

can communicate freely with other

systems in the Hutch network via

the VPN tunnel using their private

IP addresses (172.16.x.x).

• The Firewall Outside subnet is the

only subnet with direrct access to

the AWS Internet Gateway

• The Firewall is a FortiGate instance

with interfaces in both the Firewall

Outside and Inside subnets. It can

communicate with any VPC subnet

via its inside interface.

• The Firewall performs NAT for

outbound-initiated Internet access.

AWS Internet

Gateway

Post Phase 1Post Phase 1Post Phase 1

Firewall Outside

172.16.0.0/24

FW

Cloud On-Ramp Datacenter – Detailed

18

Server for

software

updates

Networks192.168.0.0/1610.168.0.0/16

DC1

US-WEST-2A: 172.16.160.x

AD Site: AWS_USW2

DC2

DC3

DC-USW2A

DC-USW2B

AD Site: SELU

Networks172.16.0.0/16

US-WEST-2B: 172.16.168.x

Intra Site Replication

Create a new AD “site” named for the AWS region “AWS_USW2” Create two new domain controllers in AWS; each in a different availability zone This architecture was tested during the project (twice) using test domains

Active Directory Cloud Integration Design

19

Tags are key/value annotations that can be attached to every type of object in AWS Tags are used for inventory, security, cost accounting, backups and automation

Tag Restrictions Maximum number of tags per resource: 10 Maximum key length: 127 Unicode characters Maximum value length: 255 Unicode characters Tag keys and values are case sensitive.

Mandatory Cloud On-Ramp Tagging Scheme Name: name of the server Owner: department/customer that owns/pays for the server <_div/dept> Technical_contact: who provides technical support for this system; who to send alerts and reports Billing_contact: who to send chargeback invoices to Description: short description of the servers purpose SLE: business_hours=? / grant_critical=? / publicly_accessible=?

Metadata Tagging Scheme

20

Metadata Tagging Scheme Example Name : skyshieldInstanceId : i-32b8edc1InstanceType : m3.mediumImageId : ami-b9c98181State : runningPrivateIpAddress : 172.16.0.18PublicIpAddress : 52.16.139.222SecurityGrps : sg-8e7221e1AvailabilityZone : us-west-2aSubnetId : subnet-29549451VpcId : vpc-f3e23491owner : _adm/infosectechnical_contact : [email protected]_contact : [email protected] : firewall - inside interface in dedicated subnetsle : business_hours=24x7 / grant_critical=no /

publicly_accessible=noTenancy : defaultLaunchTime : 6/3/2015 1:34:59 PMKeyName : cloud on-ramp test keypair

Platform : linux

21

Cost Accounting During the Project 79.7% of AWS costs on average could be directly tied to an owner/department

After accounting for sales tax and a proportional amount of AWS support costs we would have been able to assign 89.2% of AWS costs to owners/departments for potential chargebacks in the future

Strict tagging of servers, network interfaces, volumes, snapshots, etc… is critical

Resources attached to servers (volumes & NICs) need to automatically inherit tags from their parent to ensure all costs are captured

Tag creation, maintenance and enforcement needs to be fully automated

End of MonthTaxes

Daily Spend Report – Invoiced vs. Chargebacks

Excessive“Leakage”

22

Cost Reporting and Potential Chargebacksowner invoice_date bill----- ------------ -------_adm/custserv 2015-07-01 $401.96 _adm/iops 2015-07-01 $370.46_adm/solarch 2015-07-01 $213.45_adm/infosec 2015-07-01 $102.43_adm/ess 2015-07-01 $77.04_adm/scicomp 2015-07-01 $1.71

Chargebacks: Monthly Charges by Owner Tag

Monthly Charges by Owner Tag – Last Month vs This Month

Pulled via API

23

Elastic Compute Cloud (EC2) Overview

EC2 is Amazon’s virtual server (instance) service 38 instances types are available ranging from 10% of a CPU core and 1GB of RAM to 40

CPU cores and 244GB of RAM. General purpose, fractional CPU (burstable), compute optimized, memory optimized,

storage optimized (both high IO and high density) and GPU instance classes. On-demand, Reserved and Spot market pricing options Some instances come with “free” ephemeral storage General purpose SSD EBS volumes up to 16TB each Provisioned IOPs volumes can provide up to 20,000 IO operations per second per volume Shared or dedicated tenancy models available

Instances

SSD

IOPs

Snap

Auto Scaling

AMI (Images)

Network Interface

MagMagnetic Disks

GP SSDDisks

Provisioned IOps Disks

Encrypted Disks

Snapshots Monitoring Alerting Load Balancing

24

EC2: On-Demand Pricing Zero upfront costs with no long term commitments Charged hourly (fractional hours, rounded up) for the time the instance is running Each instance type has a different hourly rate Availability of specific instance types in specific AZs can fluctuate with demand Best for short term workloads that can’t be interrupted while running Best for systems that can be shutdown when not in use (test, monthly jobs, experiments) Most flexible option but also most expensive

Type CPUs RAM Temp Storage Rate Annual

t2.micro 1* 1GB none $0.013 $114

m3.medium 1 3.75GB 1 x 4GB SSD $0.067 $587

t2.large 2* 8GB none $0.104 $911

m4.large 2 8GB none $0.126 $1,104

m4.2xlarge 4 16GB none $0.252 $2,207

c3.4xlarge 16 30GB 2 x 160GB SSD $0.84 $7,359

m4.10xlarge 40 160GB none $2.52 $22,075

i2.8xlarge 32 244GB 8 x 800GB SSD** $6.82 $59,743

**365,000 random read IOPS total* Fractional CPU with credit based burst 25

EC2: Reserved Instances Reserved instances are commitments for 1 or 3 years and provide guaranteed availability All upfront, partial upfront and no upfront purchasing options Purchased for specific availability zones Best for long term production servers Unwanted reserved instances can be sold on the reserved instance market Loss of flexibility but costs saving can be significant (up to %75)

T2.Large Example

26

EC2: Spot Market Pricing Save up to 90% by bidding on unused capacity Spot instances are functionally identical to on-demand and reserved instances Requested instances are launched when your bid matches or exceeds the market rate Market rate fluctuates based on current supply and demand in a particular zone When market rate exceeds the bid, instances are terminated after a two-minute notice Good for short running jobs or long running processes that can check-point their state HPC and ad-hoc testing are good candidates for spot instances

M4.10xlarge: 40 CPUs, 160GB RAM On-demand rate: $2.52/hour Current Spot rate: $0.27 (us-west-2a) 89% costs savings

27

Spot Market Rate History for M4.10xlarge instance in Oregon

Infrastructure as

Code

28

Infrastructure as Code Networks, servers, storage, security, databases, monitoring, etc… can be defined in code Code “stacks” can be written to build entire complex infrastructures or multi-tier application stacks Your infrastructure or application stack is always documented, versioned and consistently repeatable Infrastructure code is strictly managed and tracked via a source code management system Disaster recovery and business continuity can be an order of magnitude faster/simpler This is the future of IT operations Requires a different set of skills than traditional IT operations

Example: Infrastructure changes are automatically documented and versioned

29

Infrastructure as Code Demo

Q: ManagementHow long would it take to migrate the Center's public web site to the cloud while increasing security, performance and availability? Will 6 months and a budget of $250K work?

A: DevOps EngineerI can do it in 30 minutes for less than $5 per day. Where would you like it to reside? Oregon, California, or North Virginia?

30

Infrastructure as Code Example

This code does the following:

Creates a new VPC wherever you like Creates 2 public subnets (each in a different AZ) Creates 2 private subnets (each in a different AZ) Creates an Internet gateway Creates a route table to route public traffic to the Internet gateway Creates a NAT instance in a public subnet Creates a route table to route Internet bound traffic from the private subnets to the NAT instance Creates Security groups and network ACLs for both the private and public subnets Creates two Linux instances, one in each private subnet Installs and configures the NGINX web server on each Linux instance (via user-data script pulled

from a github repo and provided to each instance at boot) Each Linux web server pulls (via the NAT connection) a 1.4GB tarball containing the public

www.fredhutch.org website from an S3 bucket and extracts into the NGINX web root Creates an elastic load balancer (ELB) and attaches interfaces to the public subnets Creates ELB health checks to verify the health of the new Fredhutch web servers Adds the web server instances to the load balancer Adds a DNS CNAME to the fredhutch.center DNS zone that points to the ELB public DNS name Sends an SMS text message to my iPhone when it's done.

31

Availability Zone

Public Subnet

Private subnet

Public Security Group

security group

Web

Server

NAT

GitHub

Availability Zone

Public subnet

Private subnet

Public Security Group

security group

Web

Server

S3

www.fredhutch.org

www.fredhutch.center

SNS

ELBHealth Checks

EIP

Orchestration Code

Webserver

config code

Base Images (AMIs)

Internet Gateway

Website Archive

R53

PubRT

PrivRT

NAT Linux

Config Config

Nimbus Robert’s iPhone

NACLNACL

DHCPKey

SMS Text

Fredhutch.org Website Migration Demo

32

Custom Cloud Automation Code EC2 instance provisioning to build and configure Windows and Linux instances

o Tags instance with all mandatory tagso Configures OS at during bootstrapo Optionally enables monitoring an alteringo Optionally enables daily backup rotation for all attached volumeso Optionally registers instance in DNSo Optionally creates and attaches additional “data” disko Optionally configures data disk for encryptiono Optionally configures instance for scheduled retirement

EC2 instance reportingo Gathers all information on instances to find, filter and report on instances

EC2 tag primero Finds all instances without tags and creates all tag key stubso Reports to ops group that instances without tags were found and tagged

EC2 tag inheritance o Ensures that EBS volumes and NICs attached to an instance inherit the parent instances tags

EC2 tag enforcero Finds instances that are missing the mandatory tagso Reports them to ops group and optionally shuts them down 33

Custom Cloud Automation Code (cont)

34

EC2 backupso Finds all volumes that are tagged with a backup retention and snapshots themo Tags the snapshots to identify the owner, parent instance and retention dateo Purges snapshots that are past their designated retention date

EC2 instance lifecycleo Finds instances that are scheduled for retirement in the next 30 days and reports on themo Retires instances that have reached their retirement date

Virtual datacenter creationo Creates a VPC, subnets, security group, NACLs, gateway, routing tables, DHCP options

“GrabCloudNode” research compute node provisioningo Researcher facing tool to provision a cloud based HPC nodeo Similar to existing “grabnode” functionality that researchers have to access on-premises

compute resourceso Tags the instance with all mandatory metadata tagso Facilitates transferring data to and from the cloud nodeo Sets up monitoring to automatically shutdown the node if it’s idle for more than 1 hour

Cloud Security

35

Amazon AWS Certifications and Accreditations

PCI DSS Level 1, SOC 1/ ISAE 3402, SOC 2, SOC 3, FIPS 140-2, CSA. FedRAMP, DIACAP, FISMA, ISO 27001, MPAA, Section 508 / VPAT, HIPAA, DOD CSM Levels 1-2, 3-5, ISO 9001, CJIS, FERPA, G-Cloud, IT – Grundschutz, IRAP (Australia), MTCS Tier 3 Certification, ITAR

Amazon is responsible for the security of the Cloud We are responsible for our security in the Cloud

Our Responsibility

Their Responsibility

AWS Shared Responsibility Model

36

Proposed Security Choices

37

What we think are good ideas based on what we learned during the project

See appendix for details

These are not Policies

o we do not understand our actual AWS use cases well enough to have policies and procedures

o we have not tested these proposed choices in real operations mode

Proposed Security Choice Highlights

38

2-factor authentication for AWS administrator accounts with passwords

For service accounts making API calls with access keys, implement IP restrictions

Maintain team-level (as opposed to individual) access key repositories

Turn on CloudTrail and CloudConfig auditing everywhere. Send logs to Splunk

Clearly defined governance model for VPC-level design and changes—subnets, ACLs, EIPs, VPNs, VPC-peering…

Protect traffic between our EC2 instances and the Internet with a virtual firewall/IPS appliance, or some host-level alternative

Security-wise, AWS adoption brings…

39

Potential benefits

Challenges and uncertainties

For certain areas of IT operations, no change

Potential Security Benefits

40

Complete audit trail of infrastructure access & changes

Improved detection and alerting of security exceptions at the infrastructure level—faster, more precise incident response and recovery

Security goals such as physical security of the data center, protection of backup media, secure disposal of unwanted storage media are just easier to accomplish with AWS

Potential Security Benefits (cont)

41

Relatively easy to compartmentalize IT resources with well-defined technical and administrative boundaries

o Optimized to create discrete computing/storage/application instances on demand without having to maintain a common infrastructure

o Create separate “networks” with VPCso Buy storage space in the form S3 buckets o Separate database instances with RDSo Control admin access via granular access control rules.

IP & port filtering on a per-server basis with Security Groups

Security Challenges and Uncertainties

42

The “everything-as-code” paradigm bundles the different layers of the IT stack together, in a way that is not necessarily compatible with our current separation of duties. Combining services such as networking, server, OS, apps, and security filtering into one set of code blurs the lines between different teams’ responsibilities confusion, unmet expectations, lost opportunities for cross-checking.

Our current team structure in Center IT is optimized for our physical IT environment. It is not necessarily efficient for managing the software-defined world of AWS. AWS is not a virtualized copy of our IT infrastructure.

New frontier. It will take time to establish new policies, expectations, and norms in order to operate AWS securely and smoothly. Potential for friction and dropped balls.

Security Challenges and Uncertainties (cont)

43

Rapid evolution of AWS features—long-term investment in staff time for learning.

AWS represents not a replacement of existing infrastructure, but a parallel one. We must duplicate resources to secure it.

Security - No Change

44

OS and application patching (but we may end up maintaining fewer servers, if we purchase things like storage and database “as-services” from AWS, instead of running our own)

Need for firewall/IPS/WAF protections (must be purchased via 3rd party vendors).

Cloud Computing for

Scientific Applications

45

Ad-Hoc Capacity

46

When compute capacity needs(cores, memory, storage) exceeds in-house

o Reduce time-to-solution

o Scale wide/short:

• 100 cores for 10 hours has same cost as 1000 cores for 1 hour

o Rent-a-terabyte:

• Short term analyses and interim storage options won’t require large capital investment

Ad-Hoc Capability

47

Use of technologies not currently available in-house

o GPU

o Low-latency interconnect (AWS “enhanced networking”)

o Short term or one-off analyses won’t require large capital investment

Sandbox

48

Provide a sandbox for prototyping and evaluation

o Easily provisioned ephemeral environment

o Allows researcher to try new algorithms and evaluate methods without constraints

o Docker and AMIs are popular mechanisms for distributing data, tools, and pipelines

Container Solutions

49

Containers are

o “a server-virtualization method where the kernel of an operating system allows for multiple isolated user-space instances”

o Docker containers and AMIs allow distribution of tools and data in a portable container.

o Reproducibility and distribution of results

o Difficult and cumbersome (security) to deploy in-house

• Easy to pop into a sandbox in the cloud!

Science DMZ

Transferring data into the cloud is

free Transferring data out the cloud is

charged by the GB

Download large datasets quickly and inexpensively using Amazon’s big

network pipes Analyze and process data in the

cloud using cloud resources (EC2,

EMR, …) Download the results of the analysis

or experiment to the Hutch

EC2Compute

Analyze

Fred Hutch Campus Fred Hutch Amazon VPC

Data

Repository

dbGaP

EMBL-ENA

Researcher

Results

50

Store

RetrieveS3 Storage

S3 Storage

/fh/fast

“…designed such that the equipment, configuration, and security policies are

optimized for high-performance scientific applications rather than for general purpose business systems” - ESnet

Collaboration

51

Environments providing compute, application, and storage for collaborations between Hutch and others

o Resources independent

o Access from one to other via peering

o Uses AWS high-throughput networkingo Data transfer does incur cost

o Good for bringing in outside expertise

Hutch Intercloud VPC Partner VPC

EC2 ComputeS3 Storage

Hutch Partner In this example, a group with compute expertise provides their computational resources, accessing Hutch-produced data via a VPC peering relationship

Meet-Me

52

A self-contained VPC for collaboration

o Custom environment

o Isolated from other Hutch resources

o Limits need for shipping data between organizations and VPCs (c.f. intercloud)

o IAM controls access and authorization

Meet-Me VPC

EC2Compute

S3 Storage

Hutch Partner

Illumina Data for External Customers

Upload from HiSeq into S3

(implemented today) Processing in EC2 Download by customer or transfer of

bucket to customer via VPC peering or S3 copy

EC2Compute

Gerald/Bustard/etc.

Shared Resources

Genomics

Fred Hutch Amazon VPC

External

SR Customer

53

Basecalls & Alignments

“Raw” data storageS3 Storage

S3 Storage

External

SR Customer

VPC

GlacierArchive(raw & cooked)

This is simply an example

of a possibility- no plans orproposals are in place atthis time!

The Proteomics Lab is currently testing Proteome Discoverer in the cloud Run time with current local system takes 150 hours to run

Cloud Comparison

8 CPU cloud server o Run time: 123.24 hourso Cost: $124.99

36 CPU cloud servero Run time: 42.47 hourso Cost: $132.91

Key Concepts

1 server running for 100 hours, costs about the same as running 100 servers for 1 hour

Running an 8 CPU system for 4 hours costs about the same a running a 32 CPU system for 1 hour

Proteomics in the Cloud

54

8 CPUs 36 CPUs

$125 $133

AWS HIPAA BAA Details

55

We must identify the AWS account IDs that we want covered by the BAA We are responsible for implementing appropriate privacy and security safeguards in

order to protect our PHI in compliance with HIPAA The following are the current HIPAA eligible services:

o Amazon Elastic Compute Cloud (EC2)o Amazon Simple Storage Service (S3)

o Amazon Elastic Block Store (EBS)o Amazon Glaciero Amazon Redshift

o Amazon RDS (MySQL and Oracle engines only)o Amazon Elastic Map Reduce (EMR)

o Amazon DynamoDBo Elastic Load Balancing

All compute instances processing, storing, or transmitting PHI must be dedicated instances Dedicated instances won’t share a hypervisor host with any other customers Dedicated tenancy costs an extra $2 per hour but covers all EC2 instances in a region AWS will reporting all security incidents and breaches to us We must enable all auditing and logging (CloudTrails, CloudConfig) All PHI data must be encrypted at rest and transmission Set ELB Load Balancer protocol to TCP for sessions containing PHI, and the TCP session

must be encrypted end-to-end (no SSL termination on the ELB)

NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy *

Information security in cloud environments is still the responsibility of the institution , the implementation of that security is shared between the institution and the cloud service provider

You and your institution are accountable for ensuring the security of this data, not the cloud service provider.

The NIH strongly recommends that investigators consult with institutional IT leaders, including the Chief Information Officer (CIO) and the institutional Information Systems Security Officer (ISSO)

* http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=dbgap_2b_security_procedures.pdf** https://d0.awsstatic.com/whitepapers/compliance/AWS_dBGaP_Genomics_on_AWS_Best_Practices.pdf

Whitepaper: “Architecting for Genomic Data Security and Compliance in AWS” **

Guidance for working with controlled-access datasets from dbGaP, GWAS, and other individual-level genomic research repositories

Co-authored by Chris Whalley, formerly of the Fred Hutch regulatory compliance office

dbGaP Data in the Cloud

56

AWS Support Options

57

Basic Developer Business Enterprise

AWS Trusted

Advisor4 checks 4 checks 41 checks 41 checks

Access to SupportSupport for Health

Checks

Email (local business

hours)

Phone, chat, email, live

screen sharing (24/7)

Phone, chat, email, live

screen sharing, TAM

(24/7)

Primary case

handling

Technical Customer

Service Associate

Cloud Support

Associate

Cloud Support

Engineer

Sr. Cloud Support

Engineer

Users who can

create support

requests

1Unlimited

(IAM supported)

Unlimited

(IAM supported)

Response time <12 hours <1 hour <15 minutes

Cost Free $49/month

10% of monthly bill

Rate goes down at

higher spending tiers

$15,000

Rate goes down at

higher spending tiers

We used “Basic” support during the cloud on-ramp project AWS has excellent documentation so we didn’t need to contact support during the project I recommend that we upgrade to a Business support plan prior to production use

An opportunity to build a high security computing environment for our current and future security needs.

A complete, up to the minute inventory of all cloud resources

Complete visibility and accountability of all IT costs associated with the Cloud

We can accurately calculate chargebacks for almost 90% of cloud costs

Disaster recovery and business continuity are a reality

Rapidly respond to urgent or unplanned IT needs

Audit all access and configuration changes

A documented, versioned, repeatable IT infrastructure is possible

Everything can be automated via well documented APIs and SDKs

No physical infrastructure equipment (servers, switches, routers, PDUs, etc…) to maintain

Collaborate with other institutions in the cloud via VPC peering

Take advantage of Amazon’s fat network pipes to download large datasets to the cloud

With the brokering layer in place, we could offer self-service IT to CIT, divisional IT and research staff across the Center.

AWS Benefits

58

AWS Challenges

59

The highly abstracted nature of the cloud combined with the infrastructure as code paradigm results in a blurring/elimination of the boundaries between traditional IT roles and separation of duties

Our current team structure in Center IT is optimized for our physical IT environment. It is not necessarily efficient for managing the software-defined world of the Cloud. The Cloud is not a virtualized copy of our IT infrastructure.

AWS is evolving at a very rapid pace. IT staff responsible for the cloud infrastructure/service will need to stay abreast of all changes and incorporate these changes into our architecture/service when beneficial

There is currently ~10ms of network latency between the Center and our Oregon VPC. Data and compute should to be co-located for the best performance

o On campus: ~1mso Campus to Oregon AWS VPC: ~10mso Campus to Europe: ~150mso Campus to Africa: ~300ms

Our VPN can currently only encrypt network traffic to and from our VPC at a rate of 300Mb/s (37.5MB/s). Moving large data sets between campus and our VPC will be very slow until we upgrade to a dedicated direct connect or other solution.

Not Everything Works Well in the Cloud

Is physically connected to an instrument Requires a licensing “dongle” Has very specific hardware requirements (model XYZ only)

Examples: Aperio, OnBase

60

The cloud is not possible for anything that…

Requires very low latency access to systems/data on Campus

Requires high throughput access to systems/data on Campus *

Examples: BI, PeopleSoft, Varonis, Hyperion

The cloud is not currently a good fit for anything that…

Requires a large server (many CPUs) and runs 24/7

Might not be cost effective for anything that…

* This limitation can be removed by implementing a 10GbE direct connect (costs $5-7K / month)

Capabilities Gained During This Project

Create complex cloud based datacenters and networks Logically integrate cloud networks with our campus network Secure cloud resources with security groups, NACLs and third party firewalls Limit access with fine grained security policies and multi-factor authentication Consistently provision and configure Windows and Linux server instances Audit configuration changes to enforce change management Audit all AWS access (web console, CLI, API) Backup and recover servers in the cloud Implement and enforce a metadata tagging scheme Cost accounting and reporting to facilitate chargebacks Monitor systems, trend metrics and alert support staff Load balance private and public network traffic Vertically scale (up/down) or horizontally scale (out/in) systems Log all network traffic in/out of the VPC Peer with other organizations for collaboration in the cloud Automate everything via AWS APIs, CLI tools, CloudFormation, Packer

61

FHCRC AWS Roadmap Determine cloud operations model (who and what)

Develop cloud governance model

Select production account and VPC architecture

Extend production the FHCRC Active Directory with an AWS AD site

Integrate AWS user authentication (IAM) with FHCRC AD via SAML

Implement chargebacks (not decided yet)

Offer brokered self service to Center IT departments

Implement a direct connect network to AWS or other solution (Internet2)

Integrate EC2 into scientific computing service offering for researchers

Offer brokered self service to the research community 62

Key Takeaways The cloud is no longer just hype, it’s a very capable, mature

platform that can offer increased agility, flexibility, security and capabilities

The cloud is not a traditional IT operating environment and requires a different approach to operate effectively

Not every server or application can or should move to the cloud

It won’t happen overnight; the journey to the cloud will take several years

Center IT is not currently offering the cloud as a service, but we may in the future

63

Thank You!

64

Questions?

Contact: Robert McDermott [email protected]

Appendix

65

Proposed Security Choices

66

Identity and Access Management for AWS API-based access to AWS for automation tasks should be done using service accounts instead of the individual

accounts of technicians.

All AWS user accounts belonging to humans must use two-factor authentication.

Permissions granted to service accounts should be restricted to the source IPs or subnets of servers needing those permissions. In AWS parlance, IAM policies granting permissions to service accounts should use source IP as a condition.

Service accounts should be used with access keys only. They should not be associated with passwords. Access keys must be stored in encrypted, team-level key repositories, but not in the personal storage space of individual technicians.

The ability to modify IAM settings should be restricted to the ITSO. Exceptions (e.g. service accounts requiring IAM permissions) must be approved by the ITSO.

Logging and Auditing CloudTrail must be turned on for all regions. CloudTrail logs must be forwarded to Splunk.

Splunk should be setup to monitor CloudTrail logs and alert [the cloud operations team] of notable activities, including activities that have security implications, such as account creation and permission changes. The exact set of events to be monitored will be defined as we operationalize our AWS environment, and continuously updated as we accumulate knowledge of AWS.

CloudTrail logs must be retain in Splunk for at least a year.

Proposed Security Choices (cont)

67

VPC-Level Security The permissions to modify VPC-level configurations must be limited to data-ops staff. Changes should be

made in consultation with ITSO. VPC-level configurations include, but are not limited to: Creation/removal of subnets Assignment/removal of Elastic IPs (EIPs) Changes related to Access Control Lists (ACLs) Changes related to internet gateways, VPNs, and VPC-peering Changes related to VPC Endpoints. All traffic between the FHCRC campus and the private IP space of our AWS VPC will go through a VPN

tunnel. This represents the scenario where hosts on the FHCRC campus access non-publicly accessible hosts in our VPC, and vice versa.

A Fortigate virtual appliance will be deployed within our VPC and managed by ITSO. Firewall, IPS, anti-virus, and application control features will be enabled on the Fortigate. The appliance will inspect traffic in the following scenarios:

All traffic originating from the VPC to non-FHCRC addresses. This represents the scenario where hosts within the VPC need to initiate connections to the internet at large, for reasons such as patching.

All connections to publicly accessible hosts within the VPC, including connections originating from our campus network.

EC2-level security By default, when EC2 instances are created they should be associated with one of the pre-defined security

groups created by ITSO. Network administrators should not create new security groups unless there are specific needs to do so, and it should be done in consultation with ITSO.

VPC Peering Example

[10.99.1.50]$ traceroute 192.168.1.1561 192.168.1.156 1.472 ms

[10.99.1.50]$ curl http://192.168.1.156/Welcome to Organization A!!

[192.168.1.156]$ traceroute 10.99.1.501 10.99.1.50 1.417 ms

[192.168.1.156]$ curl http://10.99.1.50/ Welcome to Organization B!

Organization A Organization BPeering connection Peering connection

Route Table Route Table

It works! It works!

68

123456789 987654321

albite01.fhcrc.org

DNS ViewsInternal: 192.168/16, 10.168/16, 172.16/16 (AWS VPC)External: !(internal)

AWS VPC Resolv.conf172.16.160.11, 172.16.168.11, 192.168.116.A

US-WEST-2A: 172.16.160.11

US-WEST-2B: 172.16.168.11

AWS VPC

albite10.fhcrc.org

IBX0 DNS Master

albite01.fhcrc.org

IBX-J4 IBX-E2

Grid Updates

DNS Cloud Integration

69

Project ScopeIn Scope Develop functional and security requirements Design and implement virtual datacenter architecture (regions, zones, subnets, etc) Extend the Center’s IP network to the virtual datacenter (VPC) Develop security polices and/or guidelines on appropriate use of the environment Active Directory and DNS services Create FHCRC server templates (AMIs) and standards Server pricing strategy (on-demand vs. reserved instances) Develop and test various use cases Train operational staff on the use of the environment Determine RBAC/account strategy Develop accounting strategy to support future chargeback functionality Select and pilot at least one production server/service in the new virtual datacenter Pilot researcher use of EC2 Develop a roadmap for this environment

Out of Scope Implementing a high-speed “Direct Connect” network connection Implementation of chargebacks Chef automated server builds (Chef implementation project still in progress) Customer self-service 70

Secure, Logical network extension of FHCRC IP space into the cloud Encrypted transport between FHCRC and cloud Design that allows HA architected services Separate public network to run services outside of the FHCRC network Support both enterprise and research computing Cost tracking and reporting System metrics, monitoring and logging FHCRC Active Directory access/integration for servers FHCRC DNS access/integration for servers Ability for servers to log to Splunk Role base administrative access (RBAC) Ability to backup / restore servers Secure storage media wipe on deletion Support full automated provisioning and configuration Support Windows 2012, CentOS 6/7 and Ubuntu 14.04 operating systems Pre-cooked FHCRC server templates (CentOS, Ubuntu and Windows 2012) Stateful ingress/egress firewall capability Advanced intrusion prevention firewall Vendor support Granular cost reporting (per server, application type, owner) to support future chargeback

implementation Metadata tagging capability to identify and group AWS objects

Cloud Architecture Requirements During this project we’ve determined that we can satisfy all the following “must have” requirements

71

Date post:	16-Apr-2017
Category:	Technology
Upload:	robert-mcdermott
View:	2,236 times
Download:	0 times

Cloud On-Ramp Project Briefing

Technology