Integrating R with SLURM job submission
Sometimes working with R within an High-Performance Computing (HPC) environment is difficult. If you use an HPC or server that uses the Slurm Workload Manager (SLURM) system for job submission, I will present an alternative that helps me a lot when I need to submit some analysis to the job queue. Let’s create a situation here: imagine that you have 5 different single-cell data or 5 samples from different patients, and, starting from the raw count file or the files provided by 10X genomics, you want to pre-process the data using the Seurat standard pipeline.
First, we will have to create a file that here we will call sampleName.txt, which contains the names of the samples. Below is an example of this very simple file:
sample_01
sample_02
sample_03
sample_04
sample_05
Then a file called SeuratPreProcess.R should be created, which will contain the script with the standard Seurat commands. An example of the script is below:
#!/usr/bin/env Rscript
arguments = commandArgs(trailingOnly=TRUE)
options(future.globals.maxSize = 20 * 1024 ^ 3)
plan("multiprocess", workers = 6)
sample <- as.character(arguments)
# Loading packages
library(Seurat)
library(future)
# Check if the input is a csv file or a 10x directory
test <- dir.exists(file.path("/path/to/datasets", sample))
if (test) {
data <- CreateSeuratObject(counts = Read10X(data.dir = file.path("/path/to/datasets", sample)),
project = sample,
min.cells = 3,
min.features = 200)
} else {
tmp <- list.files("/path/to/datasets", full.names = T)
data <- CreateSeuratObject(counts = read.table(grep(sample, tmp, value = T), header = T, row.names = 1),
project = sample,
min.cells = 3,
min.features = 200)
}
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
dir.create(file.path("/path/to/output", sample))
data[["percent.mt"]] <- PercentageFeatureSet(data, pattern = "^MT-")
data <- subset(data, subset = nFeature_RNA < 3000 & percent.mt < 10)
data <- NormalizeData(data, verbose = F)
data <- FindVariableFeatures(data, verbose = F)
data <- CellCycleScoring(data, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
data <- ScaleData(data, vars.to.regress = c("nCount_RNA", "percent.mt", "S.Score", "G2M.Score"))
data <- RunPCA(data, verbose = F)
data <- FindNeighbors(data)
data <- FindClusters(data)
data <- RunTSNE(data, perplexity = 30, dims = 1:30)
data <- RunUMAP(data, dims = 1:30)
# Save seurat object
saveRDS(data, file = file.path("/path/to/output", sample, paste0(sample, ".rds")))
Here I draw your attention to some details of the script:
- The first line is the shebang line, and the second one indicates that the script will receive the sample name as arguments as input.
#!/usr/bin/env Rscript
arguments = commandArgs(trailingOnly=TRUE)
- Changes the /path/to/output variable pointing to the directory that contains the sample folders.
The next step is to create a script named submitJob_SeuratPreProcess.R. This script has the function of creating the btc file for the job submission.
library(optparse)
library(digest)
option_list = list(
make_option(c("-f", "--file"), type="character", default=NULL,
help="sample names file [default= %default]", metavar="character"),
make_option(c("-p", "--script"), type="character", default=NULL,
help="path to script file [default= %default]", metavar="character"),
make_option(c("-c", "--cpu"), type="numeric", default=NULL,
help="number of cpus per task [default= %default]", metavar="numeric"),
make_option(c("-m", "--mem"), type="character", default=NULL,
help="number of minimum amount of real memory [default= %default]", metavar="character"),
make_option(c("-t", "--time"), type="numeric", default=NULL,
help="time limit in minutes [default= %default]", metavar="numeric")
);
opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);
if (is.null(opt$file)){
print_help(opt_parser)
stop("Input a sample names file.", call.=FALSE)
} else if (is.null(opt$script)){
print_help(opt_parser)
stop("Input a path to a script file.", call.=FALSE)
} else if (is.null(opt$cpu)) {
print_help(opt_parser)
stop("Input the number of cpus per task.", call.=FALSE)
} else if (is.null(opt$mem)) {
print_help(opt_parser)
stop("Input the number of minimum amount of real memory.", call.=FALSE)
} else if (is.null(opt$time)) {
print_help(opt_parser)
stop("time limit in minutes.", call.=FALSE)
}
df <- readLines(opt$file)
jobname <- paste0("tmp_", digest(Sys.time(), algo = "xxhash32"))
for (i in seq_along(df)) {
sink(paste0("/scratch/biagi/", jobname, "_", i, ".btc"))
cat("#! /bin/bash -l", "\n")
cat(paste0("#SBATCH --job-name ", paste0(jobname, "_", i)), "\n")
cat(paste0("#SBATCH --error ", paste0("/scratch/biagi/", jobname, "_", i, ".err")), "\n")
cat(paste0("#SBATCH --output ", paste0("/scratch/biagi/", jobname, "_", i, ".out"), "\n"))
cat(paste0("#SBATCH --cpus-per-task=", opt$cpu), "\n")
cat(paste0("#SBATCH --mem=", opt$mem), "\n")
cat(paste0("#SBATCH --time=", opt$time), "\n")
cat("\n")
cat(paste0("R --vanilla -f ", opt$script, " --args ", df[i]), "\n")
cat("\n")
cat("echo [$(date)] Starting execution of sample", "\n")
sink()
system(paste0("sbatch /scratch/biagi/", jobname, "_", i, ".btc"))
}
The above script has the following options:
- -f or –file: sample name file (for example the previously created file named sampleName.txt);
- -p or –script: path to script file (for example the previously created file named SeuratPreProcess.R);
- -c or –cpu: number of cpus per task to be used in HPC/server;
- -m or –mem: number of minimum amount of real memory to be used in HPC/server;
- -t ot –time: time limit in minutes to run the script in HPC/server.
The script will create an random btc and logs files name (xxhash32 algorithm) to save in any chosen directory. I prefer to save in the server’s scratch folder, because after a certain time these files are deleted. Remembering that the options in the btc file can be modified in the script, just follow the same pattern as the existing commands.
Finally, we will gather all the scripts and information generated above, to submit our jobs. Running this command, 5 jobs from the 5 samples of the different patients will be submitted to the queue, and as soon as they have resources they will be run. Submission is very simple, and can be done using the following command:
Rscript --vanilla /path/to/folder/submitJob_SeuratPreProcess.R \
--file /projects/cangen/coliveir/sampleName.txt \
--script /projects/cangen/coliveir/scripts/SeuratPreProcess.R \
--cpu 24 \
--mem 120G \
--time 720
Remembering again that this is an application that can be easily customized for any script in R. Each step can be modified according to your needs and the available resources. I hope it may have helped you to understand this integration between R scripts and job submission. It really helps me daily in the analysis and I found it interesting to share. Any questions, comments, suggestions, criticisms, etc., feel free!