+ All Categories
Home > Documents > R Analytics in the Cloud

R Analytics in the Cloud

Date post: 22-Feb-2016
Category:
Upload: caden
View: 34 times
Download: 0 times
Share this document with a friend
Description:
R Analytics in the Cloud. Introduction. Radek Maciaszek DataMine Lab ( www.dataminelab.com ) - Data mining , business intelligence and data warehouse consultancy . MSc in Bioinformatics at Birkbeck , University of London. - PowerPoint PPT Presentation
Popular Tags:
14
R Analytics in the Cloud
Transcript
Page 1: R Analytics  in the Cloud

R Analytics in the Cloud

Page 2: R Analytics  in the Cloud

Radek Maciaszek DataMine Lab (www.dataminelab.com) - Data mining,

business intelligence and data warehouse consultancy.

MSc in Bioinformatics at Birkbeck, University of London.

Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster.

Introduction

2

Page 3: R Analytics  in the Cloud

Primer in Bioinformatics

3

Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)

Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).

Goal: find genes responsible for ageing

Caenorhabditis Elegans

Page 4: R Analytics  in the Cloud

Genes are encoded by the DNA. Microarray

(100 x 100)

4

Central dogma of molecular biology

• Database of 50 curated experiments.• 10k genes compare to each other

Page 5: R Analytics  in the Cloud

Why R? Very popular in bioinformatics Functional, scripting programming

language Swiss-army knife for statistician Designed by statisticians for

statisticians Lots of ready to use packages (CRAN)

5

Page 6: R Analytics  in the Cloud

R limitations & Hadoop Data needs to fit in the memory Single-threaded Hadoop integration:

Hadoop Streaming Rhipe: http://ml.stat.purdue.edu/rhipe/ Segue: http://code.google.com/p/segue/

6

Page 7: R Analytics  in the Cloud

Segue Works with Amazon Elastic MapReduce. Creates a cluster for you. Designed for Big Computations (rather than

Big Data) Implements a cloud version of lapply()

function.

7

Page 8: R Analytics  in the Cloud

Segue workflow (emrlapply)

8

S3

R

Elastic MapReduce

Amazon AWS

List (local)

List (remote)

Page 9: R Analytics  in the Cloud

R very quick example

m <- list(a = 1:10, b = exp(-3:3))lapply(m, mean)$a[1] 5.5$b[1] 4.535125

lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

9

Page 10: R Analytics  in the Cloud

Segue – large scale example

> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} > pearson.cor <- lapply(probes, AnalysePearsonCorelation)

Moving to the cloud in 3 lines of code!

10

RNA Probes

Page 11: R Analytics  in the Cloud

Segue – large scale example

> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} > # pearson.cor <- lapply(probes, AnalysePearsonCorelation)> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)> stopCluster(myCluster)

11

RNA Probes

Page 12: R Analytics  in the Cloud

Discovering genes

12

Topomaps of clustered genes

This work was based on a similar approach to:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., Science 293, 2087 (2001)

Page 13: R Analytics  in the Cloud

Conclusions R is great for statistics. It’s easy to scale up R using Segue. We are all going to live very long.

13

Page 14: R Analytics  in the Cloud

Thanks! Questions?

References:http://code.google.com/r/radek-segue/ http://www.dataminelab.com

14


Recommended