R Analytics in the Cloud
Radek Maciaszek DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse consultancy.
MSc in Bioinformatics at Birkbeck, University of London.
Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster.
Introduction
2
Primer in Bioinformatics
3
Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)
Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).
Goal: find genes responsible for ageing
Caenorhabditis Elegans
Genes are encoded by the DNA. Microarray
(100 x 100)
4
Central dogma of molecular biology
• Database of 50 curated experiments.• 10k genes compare to each other
Why R? Very popular in bioinformatics Functional, scripting programming
language Swiss-army knife for statistician Designed by statisticians for
statisticians Lots of ready to use packages (CRAN)
5
R limitations & Hadoop Data needs to fit in the memory Single-threaded Hadoop integration:
Hadoop Streaming Rhipe: http://ml.stat.purdue.edu/rhipe/ Segue: http://code.google.com/p/segue/
6
Segue Works with Amazon Elastic MapReduce. Creates a cluster for you. Designed for Big Computations (rather than
Big Data) Implements a cloud version of lapply()
function.
7
Segue workflow (emrlapply)
8
S3
R
Elastic MapReduce
Amazon AWS
List (local)
List (remote)
R very quick example
m <- list(a = 1:10, b = exp(-3:3))lapply(m, mean)$a[1] 5.5$b[1] 4.535125
lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
9
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} > pearson.cor <- lapply(probes, AnalysePearsonCorelation)
Moving to the cloud in 3 lines of code!
10
RNA Probes
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} > # pearson.cor <- lapply(probes, AnalysePearsonCorelation)> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)> stopCluster(myCluster)
11
RNA Probes
Discovering genes
12
Topomaps of clustered genes
This work was based on a similar approach to:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., Science 293, 2087 (2001)
Conclusions R is great for statistics. It’s easy to scale up R using Segue. We are all going to live very long.
13
Thanks! Questions?
References:http://code.google.com/r/radek-segue/ http://www.dataminelab.com
14