Computational Techniques for the Statistical Analysis of Big Data in R

Computational Techniques for the StatisticalAnalysis of Big Data in R

A Case Study of the rlme Package

Herb Susmann, Yusuf Bilgic

April 12, 2014

WorkflowIdentifyRewriteBenchmarkTest

Case Study: rlmeIdentifyWilcoxon Tau EstimatorPairupCovariance Estimator

Summary

Keeping Ahead

Motivation

I Case study: rlme package

I Rank based regression and estimation of two- and three- levelnested effects models.

I Goals: faster, less memory, more data

I Before: 5,000 rows of data

I After: 50,000 rows of data

Section 1

Workflow

Workflow

I Identify

I Rewrite

I Benchmark

I Test

Identify

I Know your big O!

(O(n2) memory usage? probably not sogood for big data)

I Look for error messages

I Profiling with RProf

Identify

I Know your big O! (O(n2) memory usage? probably not sogood for big data)



Identify




Identify




Rewrite

High level design

I Algorithm design

I Statistical techniques: bootstrapping

Rewrite

High level design

I Algorithm design

I Statistical techniques: bootstrapping

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Rewrite

Microbenchmarking



I Preallocation


I Embrace C++

Be careful!

Rewrite

Microbenchmarking



I Preallocation


I Embrace C++

Be careful!

Rewrite

Microbenchmarking



I Preallocation


I Embrace C++

Be careful!

Rewrite

Microbenchmarking



I Preallocation


I Embrace C++

Be careful!

Rewrite

Microbenchmarking



I Preallocation


I Embrace C++

Be careful!

Vectorizing

## Bad

vec = 1:100

for (i in 1:length(vec)) {vec[i] = vec[i]^2

}

## Better

sapply(vec, function(x) x^2)

## Best

vec^2

Preallocation

## Bad

vec = c()

for (i in 1:0) {vec = c(vec, i)

}

## Better

vec = numeric(100)

for (i in 1:0) {vec[i] = i

}

Pass by value

square <- function(x) {x <- x^2

return(x)

}

x <- 1:100

square(x)

Benchmark

I Write several versions of a slow function

I Test them against each other

I Package: microbenchmark

Benchmark




Benchmark




Test

I Regressions

I Unit Testing

I Package: testthat

Test

I Regressions

I Unit Testing

I Package: testthat

Test

I Regressions

I Unit Testing

I Package: testthat

Test

I Regressions

I Unit Testing

I Package: testthat

Section 2

Case Study: rlme

Identify

Over to R!

Rprof("profile")

fit.rlme = rlme(...)

Rprof(NULL)

summaryRprof("profile")

Wilcoxon Tau Estimator

I Rank based scale estimator of residuals

I Uses pairup (so already O(n2))


Original:

dresd <- sort(abs(temp[, 1] - temp[, 2]))

dresd = dresd[(p + 1):choose(n, 2)]

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

dresd = remove.k.smallest(dresd)


Original:



What’s wrong?

Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++



Original:



What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple times

Updated with C++



Original:



What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++



Test with 2,000 residuals: better!

Wilcoxon Tau

I But what about really huge inputs?

I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times

I Not about speed, but about memory

Wilcoxon Tau




Wilcoxon Tau




Pairup

I Pairup function: generates every possible pair from inputvector

I Some rank-based estimators require pairwise operations

I O(n2) complexity

Pairup

I Original version: vectorized (14 LOC)

I Loop version (12 LOC)

I ”Combn” version (core R function, 1 LOC)

I C++ version (12 LOC)

Pairup





Pairup





Pairup





Over to R!

Covariance Estimator

I n × n covariance matrix

I change to preallocation

Covariance Estimator

Summary

I Identify

I Rewrite

I Benchmark

I Test

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Keeping Ahead

I Parallelism


I GPU: rpud


I Julia Language



Questions?

Date post:	29-Nov-2014
Category:	Technology
Upload:	herbps10
View:	476 times
Download:	3 times

Computational Techniques for the Statistical Analysis of Big Data in R

Technology