Converting R Loops to Parallel Loops Using Slurm (The Easy Way)

So your research requires carrying out a task multiple times, and you’ve written an R loop to do it. Great! That saved you a lot of time writing code–but it doesn’t speed up running the code at all. But what if instead of completing the first task, then the next, then the next; all of them could be run at once? That’s absolutely possible if you send them to the SSCC’s Slurm cluster.

This article assumes you know R well enough to write loops, but don’t know a lot about Slurm–in fact this article will cover everything you need to know about Slurm. Thus we won’t use advanced features of Slurm like job arrays, and instead use R to manage submitting jobs to Slurm.

Example: Converting CSV Files to R Data Sets

Suppose you need to convert ten CSV files into RDS files. Assuming we have all of those in our working directory, we could do that with a for loop:

# find all files ending with ".csv"
myFiles <- list.files(pattern = "*.csv$")

for (i in 1:length(myFiles)) {
  # import data set
  dat <- read.csv(myFiles[i])
  
  # change ".csv" to ".rds"
  new_filename <- gsub(".csv", ".rds", myFiles[i], fixed = T)
  
  # save file with new extension
  saveRDS(dat, new_filename)
}

If each file takes ten seconds to import, then the whole process will be done in one hundred seconds (one minute and forty seconds) and it’s probably not worth spending any more of your time on the problem. But if the files are enormous and take ten minutes to import, importing the ten files in parallel instead of one at a time would make this a ten minute job rather than a one hour and forty minute job.

Workers and Their Manager

To do so, you’ll split the loop into two R scripts. First, you need a worker script that imports just one file (it will also need a way to know which file it is to import). More generally, the worker script will carry out whatever task is inside your loop, but only once. Second, you need a manager script that carries out the loop, but instead of executing the code inside the loop itself, it will submit a new worker to Slurm each time it goes through the loop.

For this example, the worker script will look like the following:

filename <- commandArgs(trailingOnly = T)[1]
dat <- read.csv(filename)
new_filename <- gsub(".csv", ".rds", filename, fixed = T)
saveRDS(dat, new_filename)

The commandArgs() function gets arguments from the command line. The function argument trailingOnly = T tells it to only consider command line arguments from the end of the command. If you run:

Rscript worker.R 1 2 3

Then commandArgs(trailingOnly = T) will return the character vector c("1", "2", "3"). If we only want the first element, we can index our vector with [1]: commandArgs(trailingOnly = T)[1].

(Normally, we would recommend R CMD BATCH since it creates an output file with commands, output, and errors. However, we suggest Rscript in this case because Rscript’s argument parser can handle spaces within arguments, so we can import CSV files with spaces in their names.)

And that’s it!

The manager script will look like the following:

myFiles <- list.files(pattern = "*.csv$")

for (i in 1:length(myFiles)) {
  
  # create Slurm submission command
  myCommand <- paste0("ssubmit --cores=1 --mem=5g ",
                      "\"Rscript ",
                      "worker.R ",
                      "\\\"", myFiles[i], "\\\"",
                      "\"",
                      collapse = "")
  
  # submit job
  system(myCommand)
}

The system() function tells Linux to run a command. In this case, we run this command:

ssubmit --cores=1 --mem=5g "Rscript worker.R \"myFiles[i]\""

where

  • ssubmit submits a job to Slurm
  • --cores=1 tells Slurm each job needs just one core
  • --mem=5g tells Slurm each job needs 5GB of memory

and the part in double quotes is is the command to be executed, Rscript worker.R \"myFiles[i]\", where

  • Rscript tells Linux to run an R script
  • worker.R is the name of our script
  • \"myFiles[i]\" is replaced with the name of the file we want to convert.
    • Note that myFiles[i] has escaped quotes (\") around it, which keep files with spaces in their names together (i.e., “data 1.csv” is a single argument, instead of two arguments, “data” and “1.csv”). In R, we had to escape the escape symbol (\\\) to pass that to the command line.

While the worker script should not do anything except the one task assigned to it, the manager can be just part of a larger script. However, the system() function does not wait until the command it sends to Linux is finished before telling R to run the next command in the script. In this case, that means the script will run any code that comes after the loop before the imported data sets have been created. If your next step is to append the ten data sets into a single data set, you should put that in a separate script that you run after you know all the jobs that were submitted to Slurm are complete. If you want to have Slurm send you an email when each job is complete, add to your ssubmit command --email=your_email_address where your_email_address should be replaced by your actual email address.

If you’re thinking “It would be nice if the process knew when all the Slurm jobs were complete so it could start the next step automatically” you’re right, and Slurm has tools for doing that. But that’s outside the scope of this article.

Worker Resources

Identifying the Computing Resources Used by a Linux Job talks about tools you can use to identify the computing resources your worker needs to run successfully. But here are some additional considerations for parallelizing a loop.

The Slurm cluster will run as many of your workers as it can, but if it runs out of resources workers will wait in the queue until resources become available. So the more resources you assign to each worker, the fewer workers it can run at once.

Cores

Most of the Slurm servers have 44 cores and you are welcome to use all of them. However, having multiple cores work on the same task always involves some overhead. You should experiment, but if you have many tasks to carry out it’s likely that 44 tasks using one core each but running all at the same time will get work done faster than running one task at a time using 44 cores. That’s why in the example we only asked for one core.

Memory

R jobs normally need just a little more memory than the size of all the data sets they work with. Workers will crash if they run out of memory, so you can’t skimp here. But don’t use (much) more than you need. Most of the servers in the Slurm cluster have 384GB of memory and 44 cores, so about 8.7GB per core. If your workers need more memory than that per core, then memory will limit the number of workers Slurm can run at the same time rather than cores.

Reading and Writing Files

Each Slurm server has just one network connection to the file server, and there’s just one file server. If your workers spend a lot of their time reading and writing files, running too many workers on one server could overwhelm that server’s network connection, and running too many total workers could overwhelm the file server itself. In that case, you’ll get better performance by reducing the number of workers you run at the same time.

More on Running Workers in Parallel

Running many workers at the same time can cause some complications.

Logs and Other Files

Scripts run with R CMD BATCH will have log files created by default, but scripts run with Rscript will not. If you need your workers to create log files for debugging purposes, use the sink() function and include a job identifier in the log name. In our example, that might be the name of the CSV we are converting, appended with .txt.

Add a few lines to the worker script to save output and errors:

filename <- commandArgs(trailingOnly = T)[1]

# set up log file as filename.csv.txt
logfile <- file(paste0(filename, ".txt"), open = "wt")
sink(logfile) # save output
sink(logfile, type = "message") # save errors
filename # print the name of the CSV file

dat <- read.csv(filename)
new_filename <- gsub(".csv", ".rds", filename, fixed = T)
saveRDS(dat, new_filename)

# close connection
sink(type = "message")
sink()

After you’re done debugging, delete or comment out the lines that save the log files to avoid cluttering your working directory.

Random Seeds

If your worker does anything random (simulation, bootstrapping, multiple imputation, etc.) then you need to be careful about setting seeds.

Every worker should get the same seed every time it is run for reproducibility. However, different workers should never get the same seed. Also, you don’t want to reuse seeds across projects.

An easy way to accomplish all these goals is to set the seed equal to the iteration identifier times an arbitrary number that’s different for every project:

set.seed(798153*i)

Earlier, we only passed myFiles[i] to our worker script. Now, we need to also give i as an additional argument. Our updated scripts would look like these:

Manager:

myFiles <- list.files(pattern = "*.csv$")

for (i in 1:length(myFiles)) {
  
  myCommand <- paste0("ssubmit --cores=1 --mem=5g ",
                      "\"Rscript ",
                      "worker.R ",
                      "\\\"", myFiles[i], "\\\" ", i,
                      "\"",
                      collapse = "")
  
  system(myCommand)
}

Worker:

filename <- commandArgs(trailingOnly = T)[1]
i <- as.numeric(commandArgs(trailingOnly = T)[2])
set.seed(798153*i)
dat <- read.csv(filename)
new_filename <- gsub(".csv", ".rds", filename, fixed = T)
saveRDS(dat, new_filename)

(Recall that commandArgs() returns a character vector, so be sure to convert numbers to numeric values.)

Efficiency

Your worker script will be run many times (perhaps many, many times) so do not have it do anything that it doesn’t absolutely need to do.

When you first wrote your loop you may have loaded a data set, cleaned it up a bit, and then started the loop. The worker script will have to load the data it needs, but don’t have every worker repeat that data cleaning. Instead, do the data cleaning once and save the result as a data set the workers can use immediately. And if the data set contains twenty variables and the worker only needs to use five, consider only including those five in the data set the worker needs to load so it loads faster. It won’t matter much if you’re only going to run tens of workers, but if you will run tens of thousands of workers every little bit counts.

Reading and writing to disk is slow, so avoid it whenever possible.