RNASeq Pipeline
Map Reads to Genome (Tophat)
FASTQ File
Assemble Transcripts (Cufflinks)
Gene Level Estimate(FPKM)
SNP Discovery
Isoform Analysis
Significant Genes
Gene Fusions
Pipeline development• Run sequencing simulation to determine
optimal coverage for typical experiments
• Vary number of reads, read length, paired-end
• Run though Tophat/Cufflinks
• Determine optimal coverage
• Want to generate 100 files x 5 replicates
• 500 - 700 files with 18 billion sequence reads
• ~ 1 TB total data
Computing Resources
• Personal Laptop/Desktop
• Small server cluster at work(with limited access)
Amazon Elastic Compute Cloud
Outline
• Look before you leap
• Launching an Amazon EC2 instance
• Launching Multiple Instances for HPC Applications
Look Before You Leap
Banko & Brill, 2001
More Data vs. Better Algorithms
55Monday, June 21, 2010
“The Unreasonable Effectiveness of Data” - Peter Norvig 2010
http://www.youtube.com/watch?v=yvDCzhbjYWs
Thoughts on Big Data
• Sure, Big Data Is Great. But So Is Intuition.http://www.nytimes.com/2012/12/30/technology/big-data-is-great-but-dont-forget-intuition.html
• What statistics should do about big data: problem forward not solution backwardhttp://simplystatistics.org/2013/05/29
• Most data isn’t “big”http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/
Look before you leap
• Optimize code(e.g. use apply instead of for loop)
• Performance testing(I/O, Memory, or CPU bound)
• Work with subset of data
• Quality control testing using positive and negative controls
Getting started with Amazon EC2
Instance Store
Amazon EBS
Amazon S3
Bucket
Instance(from AMI)
Virtual Machine
EC2 InstancesInstance Type vCPU Memory (GiB) Instance
Storage (GB)Network
Performance
t1.micro 1 0.615 EBS only Very Low
m1.small 1 1.7 1 x 160 Low
m1.medium 1 3.75 1 x 410 Moderate
m1.large 2 7.5 2 x 420 Moderate
m1.xlarge 4 15 4 x 420 High
m3.xlarge 4 15 EBS only Moderate
m3.2xlarge 8 30 EBS only High
c1.medium 2 1.7 1 x 350 Moderate
c1.xlarge 8 7 4 x 420 High
cc2.8xlarge 32 60.5 4 x 840 10 Gigabit
m2.xlarge 2 17.1 1 x 420 Moderate
m2.2xlarge 4 34.2 1 x 850 Moderate
m2.4xlarge 8 68.4 2 x 840 High
cr1.8xlarge 32 244 2 x 120 SSD
10 Gigabit
hi1.4xlarge 16 60.5 2 x 1,024 SSD 10 Gigabit
http://aws.amazon.com/ec2/instance-types/
EC2 Pricing (per hour)Instance Type On Demand
InstanceSpot
InstanceReserved Instance
(1 Year of Medium Use)Reserved Instance
(1 Year of Medium Use)
t1.micro $0.020 $0.003 $54 $0.007
m1.small $0.060 $0.007 $139 $0.021
m1.medium $0.120 $0.013 $277 $0.042
m1.large $0.240 $0.026 $554 $0.084
m1.xlarge $0.480 $0.052 $1,108 $0.168
m3.xlarge $0.500 $0.0575 $1,217 $0.184
m3.2xlarge $1.000 $0.115 $2,434 $0.368
c1.medium $0.145 $0.018 $370 $0.054
c1.xlarge $0.580 $0.070 $1,480 $0.216
cc2.8xlarge $1.300 $0.270 $4,146 $0.540
m2.xlarge $0.410 $0.035 $651 $0.103
m2.2xlarge $0.820 $0.070 $1,302 $0.206
m2.4xlarge $1.640 $0.140 $2,604 $0.412
cr1.8xlarge $3.500 $0.343 $5,958 $0.930
hi1.4xlarge $3.100 $0.208 $5,973 $0.909
http://aws.amazon.com/ec2/pricing/
* as of 6/18/2013 for
US East
** For accurate pricing visit -->
EC2 PricingData Transfer IN To Amazon EC2 FromData Transfer IN To Amazon EC2 From
Internet $0.00 per GB
Data Transfer OUT From Amazon EC2 To InternetData Transfer OUT From Amazon EC2 To Internet
First 1 GB / month $0.00 per GB
Up to 10 TB / month $0.12 per GB
Next 40 TB / month $0.09 per GB
Next 100 TB / month $0.07 per GB
Next 350 TB / month $0.05 per GB
Amazon EBS Standard VolumesAmazon EBS Standard Volumes
$0.10 per GB-month of provisioned storage$0.10 per GB-month of provisioned storage
$0.10 per 1 million I/O requests$0.10 per 1 million I/O requests
Amazon EBS Snapshots to Amazon S3Amazon EBS Snapshots to Amazon S3
$0.095 per GB-month of data stored$0.095 per GB-month of data stored
http://aws.amazon.com/ec2/pricing/
* as of 6/18/2013 for
US East
** For accurate pricing visit -->
Snail Mail is Sometimes Faster
Let’s do it!
example_keypair|
-----BEGIN RSA PRIVATE KEY-----9q0d4md52izXfu27xbAnb3uAz5SBSIauqZo2C9L2c3JM2lUcj3kz6j8ErjjSg+hKv9kPcdPNqqYbfRNzaMU1XlVCIlm0F9ctEbwehXRE1FBwjLMIQ0AE9y0Um9/pmvU840MmR/C5Btgaf3dfHShWfO41sbnJr/wqUsdFvdPx87OEIDp6ifp5/GC7CXFLimVbOD+gHGe/B1Ge+qG5LXKsrjQjA9tbdeAzD0bN7JvYbXiWRR9wnEUne5Kzj/zVY/BO7Z2f++HLotQQSpajfDwOuDYsw+tYRoQvLgRiDC74glg3GJuOvMa1FPGB48Fu8iR6fxywHBeCzK2xiLNECgJAhB1PDiI4Wc1ECA4kaj6KC+/AXbJeLS98i18ATZKM0+FKznpuH7JIomPFyDVQiAHRdLh292V+sqP914aqI+G8IweajYntwfYv+ie7U7nqqUhypTYixWboq1oy/xOGlEEmJopfTFsuwix4nXJ4PGHn4xBBzEQ+H0EpUMdYkAA/szT9na0zxKJQzdbkf0YVCjEtWpw6axU1rja0K55ULg3jju6DKM/uz+lg9vAvdPKpubX9LwwBiAIYnOgYxBnIB4CfqhmqrgAWoRoHbqU8KovKw+rwBYl6dslDwwvEk0oX8Lg67t920RXOCOcXgCnCkgz6PUDWkbxf8f9AKWK21+u8ShuarH4Z+j5V58SyZxoYKHLzVO+LgeSCwBPrjyVn9icTAx5kWvJQot4VBSe3a0aFk/QUWMwecnz7OZCLfZ6VO8uj9QCpAGxD6NsvzmyLLhSjgyylHm6+zbX35WzdcH44507lgG0DbSxb3q6i3Sc3vsYpFjPgI0z4LuzeC3Bs8rw9qcKh4cpjhpBQyc/S1g6+vbkwII9OEjh6eBZrhfkd1s7eCWwrl865OJp+6WpcBMYRAjOlL6fOPzDLjhAeSWr5G/FuN1p1EOWib63HfsvxKU9DkLKdHdnL4B3lA8Eor/dZ8iYbiTx2xasv2SVHcYLCenG8X811LXiyMEQ25lm+XUu+FTXfhd3XaRq0Q+KHUMpxM77Bjj6omubDR96rDmCw1x156haFqfwHgDgWddyJj4wEB58G7kSj2U4Y+swVRfejrNlmgd4z/dz8OyvNVWS1+egbclVI7oL+R5TuMZyQoFbHuyPlExDUaESGqP0GZwMjtyGc1jij/9AJBzONJj3nH/Mf+l1Y1iPnk/87qM+h1LEJJol0p+aR3HggYPQNdc+D+aVm7rEbCh5THVHGUUxz7h787kfUKLcTcblqac21iE7Qy7XA09AOLJFFOwiUMpfr1A2jFEG5mHFQToRprEmy7D6j3Qbyn1b9DgVPdV+4kMTKaX2K1NasUJc+0VaUc1xFt8/+ErN1mLTcQBRDcn7mf1uZ7CMd+4ChigGJrQKDQClhxdS6ZgLDd89W2lhGZ+YoB3tYmaUAm4kIJpDFILPcU4tnUk8X9Z2akEjC9o4Q8s/9xvGFmna4tZKxmQQub1qlDb5rLTi5aLxcKz/mROFtB6GkYs9GU+yU7KttKQGKFAsFMxpGbfacjJqGvy0z/U+IkPUzcpc3ioFt1QD9kzWnbJx0YkyIShzdT6i4-----END RSA PRIVATE KEY-----
example_keypair.pem
note: this is not a real key pair
[astling@laptop ~]$ ssh -i example_keypair.pem \[email protected]
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ WARNING: UNPROTECTED PRIVATE KEY FILE! @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@Permissions 0644 for 'example_keypair.pem' are too open.It is recommended that your private key files are NOT accessible by others.This private key will be ignored.bad permissions: ignore key: example_keypair.pemPermission denied (publickey).[astling@laptop ~]$
[astling@laptop ~]$ ssh -i example_keypair.pem \[email protected]
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ WARNING: UNPROTECTED PRIVATE KEY FILE! @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@Permissions 0644 for 'example_keypair.pem' are too open.It is recommended that your private key files are NOT accessible by others.This private key will be ignored.bad permissions: ignore key: example_keypair.pemPermission denied (publickey).[astling@laptop ~]$ chmod 400 example_keypair.pem[astling@laptop ~]$
[astling@laptop ~]$ ssh -i example_keypair.pem \[email protected]
__| __|_ ) _| ( / Amazon Linux AMI ___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/There are 1 security update(s) out of 3 total update(s) availableRun "sudo yum update" to apply all updates.[ec2-user@domU-blah ~]$
Success!!
Package Managers
• yum - Yellowdog Updater Modified(Red Hat, Fedora, CentOS)
• apt-get - Advanced Packaging Tool(Ubuntu, Debian)
[ec2-user@domU-blah ~]$ sudo yum update
[ec2-user@domU-blah ~]$ sudo yum install R
[ec2-user@domU-blah ~]$ sudo yum install svn mercurial gcc-c++ [other packages...]
For Amazon/Red Hat
[ec2-user@domU-blah ~]$ RR version 2.15.2 (2012-10-26) -- "Trick or Treat"Copyright (C) 2012 The R Foundation for Statistical ComputingISBN 3-900051-07-0Platform: x86_64-redhat-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.Type 'contributors()' for more information and'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or'help.start()' for an HTML browser interface to help.Type 'q()' to quit R.
>
For Ubuntu
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install build-essential
$ sudo apt-get install r-base
$ sudo apt-get install r-cran-cluster r-cran-lattice r-cran-mass r-cran-mgcv r-cran-nlme r-cran-nnet r-cran-survival r-cran-rodbc
Upload Code/Data
[astling@laptop ~]$ scp -r -i example_keypair.pem \path/to/stuff \[email protected]:path/to/folder
Mount Ephemeral Storage
includes 2x 420 GB instance storage
Mounting the Instance Store
[ec2-user@ip-blah ~]$ df -hFilesystem Size Used Avail Use% Mounted on/dev/xvda1 7.9G 1.6G 6.3G 20% /tmpfs 3.7G 0 3.7G 0% /dev/shm/dev/xvdb 414G 199M 393G 1% /media/ephemeral0[ec2-user@ip-blah ~]$
where is /media/ephemeral1 ??
Mounting the Instance Store
[ec2-user@ip-blah ~]$ ls /devautofs full input loop7 ppp stderr ttyS2 vcsa xvda1block fuse kmsg loop-control psaux stdin ttyS3 vcsa1 xvdbbtrfs-control hvc0 log mapper ptmx stdout urandom vcsa2 xvdcchar hvc1 loop0 mem pts tty vcs vcsa3 zeroconsole hvc2 loop1 net random tty0 vcs1 vcsa4core hvc3 loop2 network_latency root tty1 vcs2 vcsa5cpu hvc4 loop3 network_throughput sda1 tty10 vcs3 vcsa6cpu_dma_latency hvc5 loop4 null sdb tty11 vcs4 vga_arbiterdisk hvc6 loop5 oldmem sdc tty12 vcs5 vhost-netfd hvc7 loop6 port shm tty13 vcs6 xen
[ec2-user@ip-blah ~]$
Mounting the Instance Store
$ sudo mkfs /dev/xvdc$ sudo mkdir /media/ephemeral1$ sudo mount -t ext4 /dev/xvdc /media/ephemeral1
$ sudo chown ec2-user /media/ephemeral0$ sudo chown ec2-user /media/ephemeral1
Mounting the Instance Store
[ec2-user@ip-blah ~]$ df -hFilesystem Size Used Avail Use% Mounted on/dev/xvda1 7.9G 1.6G 6.3G 20% /tmpfs 3.7G 0 3.7G 0% /dev/shm/dev/xvdb 414G 199M 393G 1% /media/ephemeral0/dev/xvdc 414G 71M 393G 1% /media/ephemeral1[ec2-user@ip-blah ~]$
Automated Mounting
[ec2-user@ip-blah ~]$ sudo vim /etc/fstab#LABEL=/ / ext4 defaults,noatime 1 1tmpfs /dev/shm tmpfs defaults 0 0devpts /dev/pts devpts gid=5,mode=620 0 0sysfs /sys sysfs defaults 0 0proc /proc proc defaults 0 0/dev/sdb /media/ephemeral0 auto defaults,comment=cloudconfig 0 2/dev/sdc /media/ephemeral1 auto defaults,comment=cloudconfig 0 2
[ec2-user@ip-blah ~]$
Caution: if instance storage fails to mount (e.g. not configured correctly), you may not be able to boot and ssh into the instance
Add EBS Storage
Mounting the EBS Volume
$ sudo mkfs /dev/xvdf$ sudo mkdir -p /Volumes/things_n_stuff$ sudo mount -t ext4 /dev/xvdf /Volumes/things_n_stuff
$ sudo chown ec2-user /Volumes/things_n_stuff
Mounting the EBS Volume
[ec2-user@ip-blah ~]$ df -hFilesystem Size Used Avail Use% Mounted on/dev/xvda1 7.9G 1.6G 6.3G 20% /tmpfs 3.7G 0 3.7G 0% /dev/shm/dev/xvdb 414G 199M 393G 1% /media/ephemeral0/dev/xvdc 414G 71M 393G 1% /media/ephemeral1/dev/xvdf 99G 60M 94G 1% /Volumes/things_n_stuff[ec2-user@ip-blah ~]$
Creating an AMI
Launching Multiple Instances for HPC Applications
launch instances from custom AMI
Run Simulations
result file
download results
instance_1
•••Run
Simulations
result file
instance_2
Run Simulations
result file
instance_n
For Independent Processes
For HPC Applications
http://www.youtube.com/embed/YfCgK1bmCjw
1. Building a Cluster in 10 min
2. MIT StarCluster
http://star.mit.edu/cluster/
3. MPI Cluster Setuphttp://glennklockwood.blogspot.com/2013/04/quick-mpi-cluster-setup-on-amazon-ec2.html
$ sudo groupadd data$ sudo useradd -s /bin/bash -m cluster -G data -p ""$ su - cluster$ cd ~$ ssh-keygen -t dsa$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys$ chmod 644 ~/.ssh/authorized_keys
$ R> install.packages("doSNOW")
Configure User
library(doSNOW)servers <- c(rep("localhost", 5), rep("ec2-blah1.compute-1.amazonaws.com", 5), rep("ec2-blah2.compute-1.amazonaws.com", 5), rep("ec2-blah3.compute-1.amazonaws.com", 5) )cl <- makeCluster(servers)registerDoSNOW(cl)
Run Analysis
my.test <- function(n){ for (i in 1:100000) { X <- matrix(rnorm(100), ncol = 10, nrow = 10) solve(X) }}
times <- 100system.time(for (i in 1:times) { my.test(i) })system.time(x <- foreach(i = 1:times) %dopar% my.test(i))
stopCluster(cl)
Run Analysis
> system.time(for (i in 1:times) { my.test(i) }) user system elapsed 1142.071 0.000 1142.021
> system.time(x <- foreach(i = 1:times) %dopar% my.test(i)) user system elapsed 0.096 0.016 58.227
> stopCluster(cl)
Run Analysis
> number.of.cores <- 5> cl <- makeCluster(number.of.cores)> registerDoSNOW(cl)> system.time(x <- foreach(i = 1:times) %dopar% my.test(i)) user system elapsed 0.388 0.016 232.580
Run Analysis
Parallelizing for a single instance (or on your desktop):
Bonus Material
Running RStudio on Amazon EC2
http://www.r-bloggers.com/instructions-for-installing-using-r-on-amazon-ec2/
Follow instructions here:
... and then you can run R and access results through a web browser
Getting Help
• Amazon EC2 Documentationhttp://aws.amazon.com/documentation/ec2/
• Getting Started Guidehttp://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html
• More HPC Resourceshttp://aws.amazon.com/hpc-applications/
• Manage/Configure Multiple Usershttp://www.youtube.com/watch?v=XuRM4Id6uDY