Date post: | 29-Nov-2014 |
Category: |
Technology |
Upload: | herbps10 |
View: | 476 times |
Download: | 3 times |
Computational Techniques for the StatisticalAnalysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusuf Bilgic
April 12, 2014
WorkflowIdentifyRewriteBenchmarkTest
Case Study: rlmeIdentifyWilcoxon Tau EstimatorPairupCovariance Estimator
Summary
Keeping Ahead
Motivation
I Case study: rlme package
I Rank based regression and estimation of two- and three- levelnested effects models.
I Goals: faster, less memory, more data
I Before: 5,000 rows of data
I After: 50,000 rows of data
Section 1
Workflow
Workflow
I Identify
I Rewrite
I Benchmark
I Test
Identify
I Know your big O!
(O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
Identify
I Know your big O! (O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
Identify
I Know your big O! (O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
Identify
I Know your big O! (O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
Rewrite
High level design
I Algorithm design
I Statistical techniques: bootstrapping
Rewrite
High level design
I Algorithm design
I Statistical techniques: bootstrapping
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
Vectorizing
## Bad
vec = 1:100
for (i in 1:length(vec)) {vec[i] = vec[i]^2
}
## Better
sapply(vec, function(x) x^2)
## Best
vec^2
Preallocation
## Bad
vec = c()
for (i in 1:0) {vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {vec[i] = i
}
Pass by value
square <- function(x) {x <- x^2
return(x)
}
x <- 1:100
square(x)
Benchmark
I Write several versions of a slow function
I Test them against each other
I Package: microbenchmark
Benchmark
I Write several versions of a slow function
I Test them against each other
I Package: microbenchmark
Benchmark
I Write several versions of a slow function
I Test them against each other
I Package: microbenchmark
Test
I Regressions
I Unit Testing
I Package: testthat
Test
I Regressions
I Unit Testing
I Package: testthat
Test
I Regressions
I Unit Testing
I Package: testthat
Test
I Regressions
I Unit Testing
I Package: testthat
Section 2
Case Study: rlme
Identify
Over to R!
Rprof("profile")
fit.rlme = rlme(...)
Rprof(NULL)
summaryRprof("profile")
Wilcoxon Tau Estimator
I Rank based scale estimator of residuals
I Uses pairup (so already O(n2))
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++
dresd = remove.k.smallest(dresd)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong?
Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++
dresd = remove.k.smallest(dresd)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple times
Updated with C++
dresd = remove.k.smallest(dresd)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++
dresd = remove.k.smallest(dresd)
Wilcoxon Tau Estimator
Test with 2,000 residuals: better!
Wilcoxon Tau
I But what about really huge inputs?
I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times
I Not about speed, but about memory
Wilcoxon Tau
I But what about really huge inputs?
I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times
I Not about speed, but about memory
Wilcoxon Tau
I But what about really huge inputs?
I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times
I Not about speed, but about memory
Pairup
I Pairup function: generates every possible pair from inputvector
I Some rank-based estimators require pairwise operations
I O(n2) complexity
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
Over to R!
Covariance Estimator
I n × n covariance matrix
I change to preallocation
Covariance Estimator
Summary
I Identify
I Rewrite
I Benchmark
I Test
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
Questions?