Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Parallel computing with RR course,
Master 2 Statistics and Econometrics, TSEFormation OMP
Thibault [email protected]
Toulouse School of Economics, CNRS
October 24, 2019
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
1 General informations
2 Master program
3 How coding a program in parallel?
4 Random Forest algorithm
5 Balancing the distribution of tasks
6 Reproduccing the results: choice of the seed
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Packages and software versions
> install.packages(c("parallel", "snow", "snowFT", "VGAM"),
dependencies = TRUE)
> require("parallel")
> require("snow")
> require("snowFT")
> require("VGAM")
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] splines stats4 parallel stats graphics grDevices
[7] utils datasets methods base
other attached packages:
[1] VGAM_1.1-1 snowFT_1.6-0 rlecuyer_0.3-4 snow_0.4-3
loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Principle of parallel computing
Initialize a vector of N seeds:
initialize.rng(...)
Repeat N times a function with a different seed for eachsimulation:
for (iteration in 1:N) {
result[iteration] <- myfunc(...)
}
Summarize the N results obtained:
process(result,...)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Sequential to parallel
Sequential: execute iteration 1, then iteration 2, etc.sequentially
Parallel: split the N iterations into groups of size equal tothe number of cores available and execute the iterations inthe cores in parallel
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Example of algorithm: Random forest
Repeat N regression or classification trees with respect to theN samples:
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
How many cores are available on my machine?
What is difference between CPU and cores ?
> library("parallel")
> detectCores(logical = FALSE) # number of cores
[1] 40
> detectCores() # number of logical cores
[1] 40
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
1 General informations
2 Master program
3 How coding a program in parallel?
4 Random Forest algorithm
5 Balancing the distribution of tasks
6 Reproduccing the results: choice of the seed
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
What is a master program?
Definition: The master program specifies the division oftasks and summarize the parallelized results.
Example: in the previous algorithm, we have N process toreplicate. Let suppose that N = 100 and the number ofcores available is equal to 4. The master program mustindicate how to divide tasks across the cores. For example:
The core 1 will execute iterations 1, 5, 9, ..., 93, 97The core 2 will execute iterations 2, 6, 10, ..., 94, 98The core 3 will execute iterations 3, 7, 11, ..., 95, 99The core 4 will execute iterations 4, 8, 12, ..., 96, 100
Once the tasks computed, the master program has tosummarize the results obtained.
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Summary of a master program
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
1 General informations
2 Master program
3 How coding a program in parallel?
4 Random Forest algorithm
5 Balancing the distribution of tasks
6 Reproduccing the results: choice of the seed
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Example
We consider the function myfun() which calculates the averagemean of a sample of size r simulated according to a normal lawof parameter mean and sd.
> myfun <- function(r, mean = 0, sd = 1) {
mean(rnorm(r, mean = mean, sd = sd))
}
We would like to repeat 100 times this function with respect tothis different values of r :
25 times when r=10
25 times when r=1000
25 times when r=100000
25 times when r=10000000
with mean=5 and sd=10. The objective is to compare thestandard deviation of the results with respect to the values of r .
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Non parallel case
In the non parallel case we can use the function sapply()
where the 1st argument corresponds to the values of r whichare evaluated by the function myfun()
> r_values <- rep(c(10, 1000, 100000, 10000000), each = 25)
> system.time(
res_non_par <- sapply(r_values, FUN = myfun,
mean = 5, sd = 10) # options of function myfun
)
Then, to compute the standard deviation of the results.
> tapply(res_non_par, r_values, sd)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Parallel case
In the parallel case, we first define the number of cores:
> P <- 4
> cl <- makeCluster(P) # allocate 4 cores
Use the function clusterApply() where the 1st argument givesthe cluster to use and the 2nd argument the function to send inthe different cores:
> system.time(
res_par <- clusterApply(cl, r_values, fun = myfun, # evaluate myfun on r_values
mean = 5, sd = 10) # options of myfun
)
To finish, it is necessary to free the cores:
> stopCluster(cl)
Then, to compute the standard deviation of the results.
> tapply(unlist(res_par), r_values, sd)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Computational time with respect to cores
Remark 1: the function is decreasing but not linearly.
Remark 2: in this example the use of 10 cores among 40available is enough.
Remark 3: in parallel computing, the computationnaltime depends depends both on the efficiency of thealgorithm and the quantity of informations that flowsbetween the master program and the cores.
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
How to call packages across the cores?
Parallel computing on B cores ⇐⇒ opening of B new Rsession: Packages need to be loaded in each core.
Solution 1: use library() or require() inside thefunction to parallelize> myfun_pareto <- function(r, scale = 1, shape = 10) {
library("VGAM")
mean(rpareto(r, scale = scale, shape = shape))
}
Solution 2: use the name package:: syntax which permitsto avoid to load all the functions of the package> myfun_pareto <- function(r, scale = 1, shape = 10) {
mean(VGAM::rpareto(r, scale = scale, shape = shape))
}
Application:> r_values <- rep(c(10, 1000, 100000), each = 25)
> cl <- makeCluster(P)
> res_par <- clusterApply(cl, r_values, fun = myfun_pareto,
scale = 1, shape = 10)
> stopCluster(cl)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
How to call packages across the cores?
Solution 3: indicating in the master program the codeswhich have to been evaluated in each core, thanks to thefunction clusterEvalQ()
> myfun_pareto <- function(r, scale = 1, shape = 10) {
mean(rpareto(r, scale = scale, shape = shape))
}
> cl <- makeCluster(P)
> clusterEvalQ(cl, library("VGAM"))
> res_par <- clusterApply(cl, r_values, fun = myfun_pareto,
scale = 1, shape = 10)
> stopCluster(cl)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
How to share R objects across the cores?
Here the functin myfun pareto() calls two objects which arenot defined in the function.
> myfun_pareto <- function(r) {
mean(rpareto(r, scale = scale, shape = shape))
}
As for the packages, objects scale and shape can be evaluatedin each core thanks to the function clusterEvalQ()
> cl <- makeCluster(P)
> clusterEvalQ(cl, {
library("VGAM")
scale <- 1
shape <- 10
})
> res_par <- clusterApply(cl, r_values, fun = myfun_pareto)
> stopCluster(cl)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
How to share R objects across the cores?
It can also be executed in the master program and thenexported to the different cores:
> scale <- 1
> shape <- 10
> cl <- makeCluster(P)
> clusterExport(cl, c("scale", "shape"))
> clusterEvalQ(cl, library("VGAM"))
> res_par <- clusterApply(cl, r_values, fun = myfun_pareto)
> stopCluster(cl)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
lapply(), sapply(), apply(), mapply()
It exits parallelized versions of l-s-m-apply() functions :
parLapply(),
parSapply(),
parApply(),
clusterMap().
Non parallel:
> res_non_par <- sapply(r_values, FUN = myfun,
mean = 5, sd = 10)
Parallel:
> P <- 4 # definir le nombre de coeurs
> cl <- makeCluster(P) # reserve 4 coeurs - debut du calcul
> system.time(
res_par <- parSapply(cl, r_values, FUN = myfun,
mean = 5, sd = 10)
)
> stopCluster(cl)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Other packages
See :https://rawgit.com/PPgp/useR2017public/master/tutorial.html
snowFT: useful for randomizing the seed,
foreach and doParallel: based on for syntax.> require("doParallel")
> P <- 4 # definir le nombre de coeurs
> registerDoParallel(cores = P) # alloue le nombre de coeurs souhaite
> getDoParWorkers() # Pour verifer qu'on utilise bien tous les coeurs
> system.time(
res_par_foreach <- foreach(i = r_values)
%dopar% myfun(i, mean = 5, sd = 10)
)
> # pour revenir au nombre de coeur initial
> registerDoParallel(cores = 1)
> # presentation des resultats
> unlist(res_par_foreach)
doMPI: MPI
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
1 General informations
2 Master program
3 How coding a program in parallel?
4 Random Forest algorithm
5 Balancing the distribution of tasks
6 Reproduccing the results: choice of the seed
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
RF algorithm
Input:
test sets: Test sets
train sets: Training sets
B: Number of replications
m: Number of variables to keep for splitting a node
Algorithm: Repeat B times:
1 Sample the trainning sample from train sets
2 Buid a model CART where each cut is chosen byminimizing the cost function CART among m variableschosen among the p initial variables. We note h(., θk) thetree
Output: estimates h(x) = 1B
∑Bk=1 h(x , θk) on test sets
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Application
We consider the data iris; the objective is to predict thevariable Species (coded into 3 different levels) with respect tothe numeric variables (Sepal.Length, Sepal.Width,Petal.Length and Petal.Width). First, we build the trainningand the test sets We choose to sample 25 observations in thetest sets
> set.seed(1)
> id_pred <- sample(1:150, 25, replace = F)
> test_sets <- iris[id_pred, ]
> train_sets<- iris[-id_pred,]
We choose m=2 variables at each iteration. The possibilities:> list_model <- list(
f1 = formula(Species ~ Sepal.Length + Sepal.Width),
f2 = formula(Species ~ Sepal.Length + Petal.Length),
f3 = formula(Species ~ Sepal.Length + Petal.Width),
f4 = formula(Species ~ Sepal.Width + Petal.Length),
f5 = formula(Species ~ Sepal.Width + Petal.Width),
f6 = formula(Species ~ Petal.Length + Petal.Width)
)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Basic function to replicate
We define the function to be replicated B times
> class_tree <- function(k) {
set.seed(k)
# sample the observations
train_sets_bootstrap <- train_sets[sample(1:125, replace = T), ]
# sample the variable
res_rf <- rpart(sample(list_model, 1)[[1]],
data = train_sets_bootstrap)
# prediction
max.col(predict(res_rf, newdata = test_sets))
}
Application one time:
> require("rpart")
> class_tree(1)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Replications
We repeat 100 times the function by using parallel computing:
> cl <- makeCluster(P)
> clusterExport(cl, c("test_sets", "train_sets", "list_model"))
> clusterEvalQ(cl, library("rpart"))
> res_par <- clusterApply(cl, 1:100, fun = class_tree)
> stopCluster(cl)
Summarize the results:
> tab_res_par <- sapply(res_par, function(x) x)
> prop_res_par <- apply(tab_res_par, 1, function(x)
c(length(which(x == 1)), length(which(x == 2)),
length(which(x == 3))))
> prop_res_par
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Predictions
For each observation, we predict by the most predicted value inthe B replicates:
> pred_vecteur <- apply(prop_res_par, 2,
function(x) which.max(x))
Confusion matrix:
> (tab_res <- table(pred_vecteur, ech_test$Species))
Percentage of good predictions:
> sum(diag(tab_res))/sum(tab_res)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
1 General informations
2 Master program
3 How coding a program in parallel?
4 Random Forest algorithm
5 Balancing the distribution of tasks
6 Reproduccing the results: choice of the seed
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Problematic
When splitting the tasks between cores, it could happen thatsome tasks take more time than others.Example: Function rnmean() returns the average mean of agaussian sample of size r where r is the input argument.
> rnmean <- function(r, mean = 0, sd = 1) {
mean(rnorm(r, mean = mean, sd = sd))
}
We create heterogeneous values of r so that there is animbalance between tasks with respect to the values of r :
> N <- 40
> set.seed(50)
> r.seq <- sample(ceiling(exp(seq(10, 14, length = 50))), N)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Applications on 4 cores
If we parallelize the tasks between 4 cores with respect to thevector r.seq, the division of tasks will be done like this:
core 1 will make the computations for the following valuesof r.seq: 903730 (1st position), 679133 (5th position),12439 (9th position), etc.
core 2 will make the computations for the following valuesof r.seq: 4576 (2nd position), 1460 (6th position), 44995(10th position), etc.
core 3 will make the computations for the following valuesof r.seq: 79676 (3rd position), 2981 (7th position),19095 (11th position), etc.
core 4 will make the computations for the following valuesof r.seq: 1202605 (4th position), 9348 (8th position),332464 (12th position), etc.
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Computational time per core
> library("snow")
> cl <- makeCluster(P)
> ctime <- snow.time(clusterApply(cl, r.seq, fun = rnmean))
> plot(ctime, title = "Usage with clusterApply")
> stopCluster(cl)
0.0 0.2 0.4 0.6 0.8 1.0
Elapsed Time
Nod
e
01
23
4
Usage with clusterApply
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Computational time per core (2)
When using clusterApplyLB() function, the execution of theparallelized functions start independently between cores.
> cl <- makeCluster(P)
> ctimeLB <- snow.time(clusterApplyLB(cl, r.seq, fun = rnmean))
> plot(ctimeLB, title = "Usage with clusterApplyLB")
> stopCluster(cl)
0.0 0.2 0.4 0.6
Elapsed Time
Nod
e
01
23
4
Usage with clusterApplyLB
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
1 General informations
2 Master program
3 How coding a program in parallel?
4 Random Forest algorithm
5 Balancing the distribution of tasks
6 Reproduccing the results: choice of the seed
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Problematic
How finding the same results when working on differentcomputers and at different dates?Example: In the following example, each iteration generates asample which canno’t be repoducced
> cl <- makeCluster(P)
> res_par <- parSapply(cl, 1:100, function(x) mean(rnorm(100)))
> stopCluster(cl)
Solution 1: Fix a seed inside the function
> rnmean <- function(x, r, mean = 0, sd = 1) {
set.seed(x)
mean(rnorm(r, mean = mean, sd = sd))
}
> cl <- makeCluster(P)
> res_par <- parSapply(cl, 1:100, rnmean, r = 100,
mean = 0, sd = 1)
> stopCluster(cl)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Package snowFT (1)
Solution 2: Use function performParallel() in packagesnowFT which permits to determine the different seeds in thereplicates with respect to one seed given in the master program
> rnmean <- function(r, mean = 0, sd = 1) {
mean(rnorm(r, mean = mean, sd = sd))
}
> library("snowFT")
> seed <- 1
> r_values <- rep(c(10, 1000, 100000, 10000000), each = 10)
> res <- performParallel(P, r.seq, fun = rnmean,
seed = seed)
> unlist(res)
Parallelcomputingwith R
ThibaultLAURENT
Generalinformations
Masterprogram
How coding aprogram inparallel?
RandomForestalgorithm
Balancing thedistribution oftasks
Reproduccingthe results:choice of theseed
Package snowFT (2)
When using package snowFT, the objects to export andpackages to load in the cores are given in the argumentsinitexpr and export. For example:
> myfun_pareto <- function(r) {
mean(rpareto(r, scale = scale, shape = shape))
}
> seed <- 1
> scale <- 1
> shape <- 10
> r_values <- rep(c(10, 1000, 100000, 10000000), each = 10)
> res <- performParallel(P, r.seq, fun = myfun_pareto,
seed = seed,
initexpr = require("VGAM"),
export = c("scale", "shape"))
> unlist(res)