High-throughput computing and workflows
High-throughput computing (HTC) refers to running a large amount of jobs, frequently enabled by automatization, scripts and workflow managers. Workflow automation saves human time and reduces manual errors. Workflows are frequently very specific and one seldom finds a method that would work out of the box for a certain application.
This page introduces some critical aspects you should consider when designing high-throughput workflows and helps you in narrowing down the right set of tools for your use case. By carefully selecting the most appropriate technology stack, your jobs will idle less in the queue, IO-operations will be more efficient and the performance of the whole HPC system will remain stable and fast for all users.
General guidelines
Running and managing high-throughput jobs
Does your workflow involve running a substantial amount of (short) batch jobs?
This is a characteristic feature of high-throughput computing, frequently referred
to as "task farming". However, it poses problems for batch job schedulers such as
Slurm used in HPC systems. A large number of jobs (launched with sbatch
) and job
steps (launched with srun
) generate excess log data and slow down Slurm. Short
jobs have also a large scheduling overhead, meaning that an increasing fraction
of the time will actually be spent in the queue instead of computing.
To enable high-throughput computing while avoiding the above issues, jobs and job
steps should be packed so that they can be executed with minimal invocations of
sbatch
and srun
. The first and best option is to check whether the software
you're using comes with a built-in option for farming-type workloads. This
applies for applications such as CP2K, GROMACS, LAMMPS, Python and R.
If integrated support for farming-type workloads is unavailable in your software, another option is to use external tools such as HyperQueue or GNU Parallel. Be aware that some tools, for example FireWorks, still create a lot of job steps although they may allow you to conveniently pack your, possibly interdependent, subtasks to be executed as a single batch job.
Note
You do not need to issue srun
if you intend to run serial jobs as a part
of your HTC workflow. A lot of job steps can be avoided just by dropping
unnecessary calls of srun
.
You can use the flowchart below to narrow down the most appropriate technologies for your high-throughput computing workflow. Please note that this is not a complete list and other tools might also work for your use case. These tools usually work fine in HTC use cases with around 100 subtasks (or even more if subtasks utilize max. one node each, see HyperQueue). If your workflow contains hundreds or thousands of multi-node subtasks, please contact CSC Service Desk as this may require special solutions. However, don't hesitate to contact us also if you have any other concerns regarding how to implement your workflow.
%%{init: {'theme': 'default', 'themeVariables': { 'fontSize': '0.6rem'}}}%%
graph TD
C(Does your software have a built-in HTC option?) -->|Yes| D("Use if suitable for use case:<br><a href='/support/tutorials/gromacs-throughput/'>GROMACS</a>, <a href='/apps/cp2k/#high-throughput-computing-with-cp2k'>CP2K</a>, <a href='/apps/lammps/#high-throughput-computing-with-lammps'>LAMMPS</a>, <a href='/apps/amber/#high-throughput-computing-with-amber'>Amber</a>,<br> Python, R ")
C -->|No| E(Serial or parallel subtasks?)
E -->|Serial| F(<a href='/support/tutorials/many/'>GNU Parallel</a><br><a href='/computing/running/array-jobs/'>Array jobs</a><br><a href='/apps/hyperqueue/'>HyperQueue</a>)
E -->|Parallel| G(Single- or multinode subtasks?)
G -->|Single| H(Dependencies between subtasks?)
G -->|Multi| I(<a href='/computing/running/fireworks/'>FireWorks</a>)
H -->|Yes| J(<a href='https://snakemake.readthedocs.io/en/stable/'>Snakemake</a><br><a href='/support/tutorials/nextflow-puhti/'>Nextflow</a><br><a href='/computing/running/fireworks/'>FireWorks</a>)
H -->|No| K(<a href='/apps/hyperqueue/'>HyperQueue</a>)
A qualitative overview of the features and capabilities of some of the workflow tools recommended by CSC is presented below.
Nextflow | Snakemake | HyperQueue | FireWorks | Array jobs | GNU Parallel | |
---|---|---|---|---|---|---|
No excessive IO | ||||||
Packs jobs/job steps | ||||||
Easy to setup | ||||||
Dependency support | ||||||
Container support | ||||||
Error recovery | ||||||
Parallelization support | ||||||
Slurm integration |
Input/output efficiency
Often when running many parallel jobs, the problem of input/output (IO) efficiency arises. If you're doing a lot of IO-operations (reading and writing files) in your high-throughput workflows, you should pay special attention to where these operations are performed. CSC supercomputers use Lustre as the parallel distributed file system. It is designed for efficient parallel IO of large files, but when dealing with many small files IO quickly becomes a bottleneck. Importantly, intensive IO-operations risk degrading the file system performance for all users and should thus be moved away from Lustre.
If you need to read and write thousands of files in your HTC workflow, please use:
- Fast local NVMe disk on Puhti and Mahti GPU-nodes
- Ramdisk (
/dev/shm
) on Mahti CPU-nodes - If your application can be run as a Singularity container, another good option is to mount your datasets with SquashFS. Creating a SquashFS image from your dataset, possibly composed of thousands of files, reduces it to a single file from the point of view of Lustre. However, mounting the image to your Singularity execution makes it appear as an ordinary directory inside the container.
- If you have to use Lustre for IO-heavy tasks, make sure to leverage file striping
Whether or not you are running HTC workflows, another important aspect affecting IO-efficiency is how your application is installed. CSC has deprecated the direct usage of Conda environments due to the huge amount of files they bring about. A large fraction of these files are read each time a Conda application is run, causing excessive load on Lustre and system-wide slowdowns. Conda environments and other applications reading thousands of files at start-up should thus be containerized. To make this easy, the container wrapper tool Tykky has been made available.
Further details on how to work efficiently with Lustre are documented here. Please also consider the flowchart below as a guideline for selecting the most appropriate technologies for your IO-intensive workflows.
%%{init: {'theme': 'default', 'themeVariables': { 'fontSize': '0.6rem'}}}%%
graph TD
A(Is your application containerized?) -->|Yes| B(<a href='/computing/containers/run-existing/#mounting-datasets-with-squashfs'>Mount your dataset with SquashFS</a>)
A -->|No| C(Are you running a Conda/pip environment?)
C -->|Yes| D(<a href='/computing/containers/tykky/'>Containerize it with Tykky</a>)
D --> B
C -->|No| E(Are you running on Puhti or Mahti?)
E -->|Mahti| F(CPU or GPU job?)
F -->|CPU| G(<a href='/computing/disk/#compute-nodes-without-local-ssd-nvme-disks'>Use ramdisk</a>)
F -->|GPU| H(<a href='/computing/disk/#compute-nodes-with-local-ssd-nvme-disks'>Use fast local NVMe disk</a>)
E -->|Puhti| H
B -->|No| E
B -->|Yes| I(Done)
I -.-> E
G -->|No, I have to use Lustre| J(<a href='/computing/lustre/#file-striping-and-alignment'>Use file striping</a>)
H -->|No, I have to use Lustre| J
Note
Please do not reserve GPU nodes just to utilize the node's NVMe disk. To run on GPUs, your code must be GPU-enabled and benefit from using the resources, see usage policy. Remember that the CPU nodes of Puhti also have NVMe disks. If you have questions about your particular workflow, please contact CSC Service Desk.
More information on workflows and efficient IO
General tools that run multiple jobs with one script
- Array jobs are a native Slurm tool to submit several independent jobs with one command
- GNU Parallel tutorial shows how to efficiently run a very large number of
serial jobs without bloating the Slurm log. You can also replace GNU Parallel
with
xargs
, see xargsjob.sh for example. - FireWorks is a flexible tool for defining, managing and executing workflows with multiple steps and complex dependencies
- HyperQueue is a tool for efficient sub-node task scheduling
- Nextflow workflows using HyperQueue as an executor can be leveraged to run large workflows involving thousands of processes efficiently
Science specific workflow tools and tutorials
- Nextflow singularity container-based bioinformatics pipelines on Puhti
- Data storage guide for machine learning explains where to work with ML data and how to use the shared file system efficiently
- Farming Gaussian jobs with HyperQueue
Workflow tools integrated into common simulation software
The following built-in tools allow running multiple simulations in parallel within a single Slurm job step. If you're using any of the applications below, please consider these as the first option for implementing your high-throughput workflows.
- GROMACS multidir option
- FARMING mode of CP2K (supports dependencies between subjobs)
- LAMMPS multi-partition switch
- Amber multi-pmemd
- Python:
- R: