+ All Categories
Home > Technology > Talk at NCRR P41 Director's Meeting

Talk at NCRR P41 Director's Meeting

Date post: 15-Jan-2015
Category:
Upload: deepak-singh
View: 1,524 times
Download: 0 times
Share this document with a friend
Description:
Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010
Popular Tags:
60
Amazon Web Services A platform for life science research Deepak Singh, Ph.D. Amazon Web Services NCRR P41 PI meeting, October 2010
Transcript
Page 1: Talk at NCRR P41 Director's Meeting

Amazon Web ServicesA platform for life science research

Deepak Singh, Ph.D.Amazon Web Services

NCRR P41 PI meeting, October 2010

Page 2: Talk at NCRR P41 Director's Meeting

the new reality

Page 3: Talk at NCRR P41 Director's Meeting

lots and lots and lots and lots and lots of data

Page 4: Talk at NCRR P41 Director's Meeting

lots and lots and lots and lots and lots of

people

Page 5: Talk at NCRR P41 Director's Meeting

lots and lots and lots and lots and lots of

places

Page 6: Talk at NCRR P41 Director's Meeting

constant change

Page 7: Talk at NCRR P41 Director's Meeting

science in a new reality

Page 8: Talk at NCRR P41 Director's Meeting

science in a new reality^

Page 9: Talk at NCRR P41 Director's Meeting

science in a new realitydata

^

Page 11: Talk at NCRR P41 Director's Meeting

goal

Page 12: Talk at NCRR P41 Director's Meeting

optimize the most valuable resource

Page 13: Talk at NCRR P41 Director's Meeting

compute, storage, workflows, memory,

transmission, algorithms, cost, …

Page 15: Talk at NCRR P41 Director's Meeting

enter the cloud

Page 16: Talk at NCRR P41 Director's Meeting

what is the cloud?

Page 17: Talk at NCRR P41 Director's Meeting

infrastructure

Page 18: Talk at NCRR P41 Director's Meeting
Page 19: Talk at NCRR P41 Director's Meeting

scalable

Page 20: Talk at NCRR P41 Director's Meeting

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

Page 21: Talk at NCRR P41 Director's Meeting

highly available

Page 22: Talk at NCRR P41 Director's Meeting

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

Page 23: Talk at NCRR P41 Director's Meeting

durable

Page 24: Talk at NCRR P41 Director's Meeting

99.999999999%

Page 25: Talk at NCRR P41 Director's Meeting

dynamic

Page 26: Talk at NCRR P41 Director's Meeting

extensible

Page 27: Talk at NCRR P41 Director's Meeting
Page 28: Talk at NCRR P41 Director's Meeting

secure

Page 29: Talk at NCRR P41 Director's Meeting

a utility

Page 30: Talk at NCRR P41 Director's Meeting

on-demand instancesreserved instances

spot instances

Page 31: Talk at NCRR P41 Director's Meeting
Page 32: Talk at NCRR P41 Director's Meeting
Page 33: Talk at NCRR P41 Director's Meeting

infrastructure as code

Page 34: Talk at NCRR P41 Director's Meeting

class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] endend

Page 35: Talk at NCRR P41 Director's Meeting

include_recipe "packages"include_recipe "ruby"include_recipe "apache2"

if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endend

gem_package "passenger" do version node[:passenger][:version]end

execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path]end

Page 36: Talk at NCRR P41 Director's Meeting

import botoimport boto.emrfrom boto.emr.step import StreamingStepfrom boto.emr.bootstrap_action import BootstrapActionimport time

# set your aws keys and S3 bucket, e.g. from environment or .botoAWSKEY= SECRETKEY= S3_BUCKET=NUM_INSTANCES = 1

conn = boto.connect_emr(AWSKEY,SECRETKEY)

bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None)

step = StreamingStep(name='Wordcount',                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                     reducer='aggregate',                     input='s3n://elasticmapreduce/samples/wordcount/input',                     output='s3n://' + S3_BUCKET + '/output/wordcount_output')

jobid = conn.run_jobflow(    name="testbootstrap",     log_uri="s3://" + S3_BUCKET + "/logs",     steps = [step],    bootstrap_actions=[bootstrap_step],    num_instances=NUM_INSTANCES)

print "finished spawning job (note: starting still takes time)"

state = conn.describe_jobflow(jobid).stateprint "job state = ", stateprint "job id = ", jobidwhile state != u'COMPLETED':    print time.localtime()    time.sleep(30)    state = conn.describe_jobflow(jobid).state    print "job state = ", state    print "job id = ", jobid

print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMPprint "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."

Connect to Elastic MapReduce

Install packages

Set up mappers &reduces

job state

Page 37: Talk at NCRR P41 Director's Meeting

a data science platform

Page 38: Talk at NCRR P41 Director's Meeting

dataspaces

Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data

Page 39: Talk at NCRR P41 Director's Meeting

accept all data formats

Page 40: Talk at NCRR P41 Director's Meeting

evolve APIs

Page 41: Talk at NCRR P41 Director's Meeting

beyond the database and the data warehouse

Page 42: Talk at NCRR P41 Director's Meeting

move compute to the data

Page 43: Talk at NCRR P41 Director's Meeting

data is a royal garden

Page 44: Talk at NCRR P41 Director's Meeting

compute is a fungible commodity

Page 45: Talk at NCRR P41 Director's Meeting

“I terminate the instance and relaunch it. Thats my error handling”

Source: @jtimberman on Twitter

Page 46: Talk at NCRR P41 Director's Meeting

the cloud is an architectural and

cultural fit for data science

Page 47: Talk at NCRR P41 Director's Meeting

amazon web services

Page 48: Talk at NCRR P41 Director's Meeting

your data science platform

Page 49: Talk at NCRR P41 Director's Meeting

s3://1000genomes

Page 50: Talk at NCRR P41 Director's Meeting
Page 52: Talk at NCRR P41 Director's Meeting

Credit: Angel Pizzaro, U. Penn

Page 53: Talk at NCRR P41 Director's Meeting

http://usegalaxy.org/cloud

Page 55: Talk at NCRR P41 Director's Meeting
Page 56: Talk at NCRR P41 Director's Meeting

AWS knows scalable infrastructure

Page 57: Talk at NCRR P41 Director's Meeting

you know the science

Page 58: Talk at NCRR P41 Director's Meeting

we can make this work together

Page 60: Talk at NCRR P41 Director's Meeting

[email protected] Twitter:@mndoci

http://slideshare.net/mndocihttp://mndoci.com

Inspiration and ideas from Matt Wood, James Hamilton

& Larry Lessig

Credit” Oberazzi under a CC-BY-NC-SA license


Recommended