GPU-accelerated machine learning

This guide explains the basics of using GPUs in CSC's supercomputers. It is part of our Machine learning guide.

Puhti, Mahti or LUMI?

Puhti and Mahti are CSC's two national supercomputers. Of the two, Puhti has the larger number of GPUs (NVIDIA V100) and offers the widest selection of installed software, while Mahti has a smaller number of faster newer generation NVIDIA A100 GPUs. The CSC-hosted European supercomputer LUMI provides a massive GPU resource based on AMD GPUs.

The main GPU-related statistics are summarized in the table below.

	GPU type	GPU memory	GPU nodes	GPUs/node	Total GPUs
Puhti	NVIDIA Volta V100	32 GB	80	4	320
Mahti	NVIDIA Ampere A100	40 GB	24	4	96
LUMI	AMD MI250x	64 (128) GB	2560	8 (4)	20480 (10240)

Note

Each LUMI node has 4 MI250x GPUs, however 8 GPUs will be available through Slurm as the MI250x card features 2 GPU dies (GCDs). The table above shows the GPU die specific numbers, MI250x card specific numbers are shown in parenthesis.

Please read our usage policy for the GPU nodes. Also consider that the Slurm queuing situation may vary between the different supercomputers at different times, so it may be worth checking out all the options. For example LUMI has a huge number of GPUs available, and queuing times are very short (as of summer 2023).

Note that all supercomputers have distinct file systems, so you need to manually copy your files if you wish to change the system. In case you are unsure which supercomputer to use, Puhti is a good default as it has a wider set of software supported.

Available machine learning software

We support a number of applications for GPU-accelerated machine learning on CSC's supercomputers, including TensorFlow and PyTorch. Please read the detailed instructions for the specific application that you are interested in.

You need to use the module system to load the application you want, for example:

module load tensorflow/2.12

Please note that our modules already include CUDA and cuDNN libraries, so there is no need to load cuda and cudnn modules separately!

On LUMI you need to first enable the module repository for CSC's installations:

module use /appl/local/csc/modulefiles/

Finally, on Puhti, we provide some special applications which are not shown by default in the module system. These have been made available due to user requests, but with limited support. You can enable them by running:

module use /appl/soft/ai/singularity/modulefiles/

Installing your own software

In many cases, our existing modules provide the required framework, but some packages are missing. In this case you can often load the appropriate module and then install additional packages for personal use with the pip package manager.

For more complex software requirements, we recommend using tykky or creating your own Apptainer container.

Running GPU jobs

To submit a GPU job to the Slurm workload manager, you need to use the gpu partition on Puhti or gpusmall or gpumedium on Mahti, and specify the type and number of GPUs required using the --gres flag.

On LUMI you need to use one of the GPU-partitions such as dev-g, small-g or standard-g.

Below are example batch scripts for reserving one GPU and a corresponding proportion of the CPU cores and memory of a single node:

PuhtiMahtiLUMI

#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=64G
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:v100:1

srun python3 myprog.py <options>

#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=gpusmall
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:a100:1

srun python3 myprog.py <options>

#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small-g
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=1
#SBATCH --mem=64G
#SBATCH --time=1:00:00

srun python3 myprog.py <options>

Mahti's gpusmall partition supports only jobs with 1-2 GPUs. If you need more GPUs, use the gpumedium queue. You can read more about multi-GPU and multi-node jobs in our separate tutorial.

For more detailed information about the different partitions, see our page about the available batch job partitions on CSC's supercomputers and Slurm partitions on LUMI.

GPU utilization

GPUs are a very expensive resource compared to CPUs, hence, GPUs should be maximally utilized once they have been allocated. We provide some tools to monitor the utilization of GPU jobs on different supercomputers. The GPU utilization, should ideally be close to 100%. If your utilization is consistently low (for example under 50%) it might because of several reasons:

You may have have a processing bottle-neck, for example you should use a data loading framework (and reserve enough CPU cores for it) to be able to feed the GPU with data fast enough. See our documentation on using multiple CPU cores for data loading.
Alternatively, it might simply be the case that the computational problem is "too small" for the GPU, for example if the neural network is relatively simple. This is not a problem as such, but if your utilization is really low, you might consider if using CPUs would be more cost efficient.

As always, don't hesitate to contact our service desk if you have any questions regarding GPU utilization.

Tools for monitoring GPU utilization

`seff` command for a completed job (Puhti and Mahti)

The easiest way to check the GPU utilization on a completed job is to use the seff command:

seff <job_id>

In this example we can see that maximum utilization is 100%, but average is 92% (this is a good level):

GPU load 
     Hostname        GPU Id      Mean (%)    stdDev (%)       Max (%) 
       r01g07             0         92.18         19.48           100 
------------------------------------------------------------------------
GPU memory 
     Hostname        GPU Id    Mean (GiB)  stdDev (GiB)     Max (GiB) 
       r01g07             0         16.72          1.74         16.91

`nvidia-smi` for a running job (Puhti and Mahti)

When the job is running you can run nvidia-smi over ssh on the node where it is running. You can check the node's hostname with the squeue --me command. The output can look something like this:

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
17273947       gpu puhti-gp mvsjober  R       0:07      1 r01g06

You can see the node's hostname from the NODELIST column, in this case it's r01g06. You can now check the GPU utilization with (replace <nodename> with the actual node's hostname in your case):

ssh <nodename> nvidia-smi

The output will look something like this:

Wed Jun 14 09:53:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   57C    P0   232W / 300W |   5222MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2312753      C   /appl/soft/ai/bin/python3        5219MiB |
+-----------------------------------------------------------------------------+

From this we can see that our process is using around 5GB (out of 32GB) of GPU memory, and the current GPU utilization is 100% (which is very good).

If you want a continually updating view:

ssh r01g06 -t watch nvidia-smi

This will update every 2 seconds, press Ctrl-C to exit.

`rocm-smi` for a running job (LUMI)

The LUMI supercomputer uses AMD GPUs, and hence the command is a bit different: rocm-smi. On LUMI you need to use srun to log in to a node where you have a running job:

srun --interactive --pty --jobid=<jobid> rocm-smi

Replace <jobid> with the actual Slurm job ID. You can also use watch rocm-smi to get the continually updated view.

Using multiple CPUs for data pre-processing

One common reason for the GPU utilization being low is when the CPU cannot load and pre-process the data fast enough, and the GPU has to wait for the next batch to process. It is then a common practice to reserve more CPUs to perform data loading and pre-processing in several parallel threads or processes. A good rule of thumb in Puhti is to reserve 10 CPUs per GPU (as there are 4 GPUs and 40 CPUs on each node). On Mahti you can reserve up to 32 cores, as that corresponds to 1/4 of the node. On LUMI we recommend using 7 CPU cores, as there are 63 cores for 8 GPUs. Remember that CPUs are a much cheaper resource than the GPU!

You might have noticed that we have already followed this advice in our example job scripts:

#SBATCH --cpus-per-task=10

Your code also has to support parallel pre-processing. However, most high-level machine learning frameworks support this out of the box. For example in TensorFlow you can use tf.data and set num_parallel_calls to the number of CPUs reserved and utilize prefetch:

dataset = dataset.map(..., num_parallel_calls=10)
dataset = dataset.prefetch(buffer_size)

In PyTorch, you can use torch.utils.DataLoader, which supports data loading with multiple processes:

train_loader = torch.utils.data.DataLoader(..., num_workers=10)

If you are using multiple data loaders, but data loading is still slow, it is also possible that you are using the shared file system inefficiently. A common error is to read a huge number of small files. You can read more about how to store and load data in the most efficient way for machine learning in our separate tutorial.

Profilers

TensorFlow Profiler and PyTorch Profiler are available as TensorBoard plugins. The profilers can be found at the PROFILE and PYTORCH_PROFILER tabs in TensorBoard, respectively. Note that the tabs may not be visible by default but can be found at the pull-down menu on the right-hand side of the interface. The profilers can be used to identify resource consumption and to resolve performance bottlenecks, in particular the data input pipeline.

See also how to launch TensorBoard using the Puhti web interface. The TensorFlow module tensorflow/2.8 or later is required to use the profilers.