GPU-accelerated machine learning
This guide explains the basics of using GPUs in CSC's supercomputers. It is part of our Machine learning guide.
Puhti, Mahti or LUMI?
Puhti and Mahti are CSC's two national supercomputers. Of the two, Puhti has the larger number of GPUs (NVIDIA V100) and offers the widest selection of installed software, while Mahti has a smaller number of faster newer generation NVIDIA A100 GPUs. The CSC-hosted European supercomputer LUMI provides a massive GPU resource based on AMD GPUs.
The main GPU-related statistics are summarized in the table below.
GPU type | GPU memory | GPU nodes | GPUs/node | Total GPUs | |
---|---|---|---|---|---|
Puhti | NVIDIA Volta V100 | 32 GB | 80 | 4 | 320 |
Mahti | NVIDIA Ampere A100 | 40 GB | 24 | 4 | 96 |
LUMI | AMD MI250x | 64 (128) GB | 2560 | 8 (4) | 20480 (10240) |
Note
Each LUMI node has 4 MI250x GPUs, however 8 GPUs will be available through Slurm as the MI250x card features 2 GPU dies (GCDs). The table above shows the GPU die specific numbers, MI250x card specific numbers are shown in parenthesis.
Please read our usage policy for the GPU nodes. Also consider that the Slurm queuing situation may vary between the different supercomputers at different times, so it may be worth checking out all the options. For example LUMI has a huge number of GPUs available, and queuing times are very short (as of summer 2023).
Note that all supercomputers have distinct file systems, so you need to manually copy your files if you wish to change the system. In case you are unsure which supercomputer to use, Puhti is a good default as it has a wider set of software supported.
Available machine learning software
We support a number of applications for GPU-accelerated machine learning on CSC's supercomputers, including TensorFlow and PyTorch. Please read the detailed instructions for the specific application that you are interested in.
You need to use the module system to load the application you want, for example:
module load tensorflow/2.12
Please note that our modules already include CUDA and cuDNN libraries, so there is no need to load cuda and cudnn modules separately!
On LUMI you need to first enable the module repository for CSC's installations:
module use /appl/local/csc/modulefiles/
Finally, on Puhti, we provide some special applications which are not shown by default in the module system. These have been made available due to user requests, but with limited support. You can enable them by running:
module use /appl/soft/ai/singularity/modulefiles/
Installing your own software
In many cases, our existing modules provide the required framework, but some
packages are missing. In this case you can often load the appropriate module and
then install additional packages for personal use with the pip
package
manager.
For more complex software requirements, we recommend using tykky or creating your own Apptainer container.
Running GPU jobs
To submit a GPU job to the Slurm workload manager, you need to use the gpu
partition on Puhti or gpusmall
or gpumedium
on Mahti, and specify the type
and number of GPUs required using the --gres
flag.
On LUMI you need to use one of the GPU-partitions such as dev-g
,
small-g
or standard-g
.
Below are example batch scripts for reserving one GPU and a corresponding proportion of the CPU cores and memory of a single node:
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=64G
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:v100:1
srun python3 myprog.py <options>
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=gpusmall
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:a100:1
srun python3 myprog.py <options>
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small-g
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=1
#SBATCH --mem=64G
#SBATCH --time=1:00:00
srun python3 myprog.py <options>
Mahti's gpusmall
partition supports only jobs with 1-2 GPUs. If you need more
GPUs, use the gpumedium
queue. You can read more about multi-GPU and
multi-node jobs in our separate tutorial.
For more detailed information about the different partitions, see our page about the available batch job partitions on CSC's supercomputers and Slurm partitions on LUMI.
GPU utilization
GPUs are a very expensive resource compared to CPUs, hence, GPUs should be maximally utilized once they have been allocated. We provide some tools to monitor the utilization of GPU jobs on different supercomputers. The GPU utilization, should ideally be close to 100%. If your utilization is consistently low (for example under 50%) it might because of several reasons:
-
You may have have a processing bottle-neck, for example you should use a data loading framework (and reserve enough CPU cores for it) to be able to feed the GPU with data fast enough. See our documentation on using multiple CPU cores for data loading.
-
Alternatively, it might simply be the case that the computational problem is "too small" for the GPU, for example if the neural network is relatively simple. This is not a problem as such, but if your utilization is really low, you might consider if using CPUs would be more cost efficient.
As always, don't hesitate to contact our service desk if you have any questions regarding GPU utilization.
Tools for monitoring GPU utilization
seff
command for a completed job (Puhti and Mahti)
The easiest way to check the GPU utilization on a completed job is to
use the seff
command:
seff <job_id>
In this example we can see that maximum utilization is 100%, but average is 92% (this is a good level):
GPU load
Hostname GPU Id Mean (%) stdDev (%) Max (%)
r01g07 0 92.18 19.48 100
------------------------------------------------------------------------
GPU memory
Hostname GPU Id Mean (GiB) stdDev (GiB) Max (GiB)
r01g07 0 16.72 1.74 16.91
nvidia-smi
for a running job (Puhti and Mahti)
When the job is running you can run nvidia-smi
over ssh
on the
node where it is running. You can check the node's hostname with the
squeue --me
command. The output can look something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
17273947 gpu puhti-gp mvsjober R 0:07 1 r01g06
You can see the node's hostname from the NODELIST
column, in this
case it's r01g06
. You can now check the GPU utilization with
(replace <nodename>
with the actual node's hostname in your case):
ssh <nodename> nvidia-smi
The output will look something like this:
Wed Jun 14 09:53:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 57C P0 232W / 300W | 5222MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2312753 C /appl/soft/ai/bin/python3 5219MiB |
+-----------------------------------------------------------------------------+
From this we can see that our process is using around 5GB (out of 32GB) of GPU memory, and the current GPU utilization is 100% (which is very good).
If you want a continually updating view:
ssh r01g06 -t watch nvidia-smi
This will update every 2 seconds, press Ctrl-C to exit.
rocm-smi
for a running job (LUMI)
The LUMI supercomputer uses AMD GPUs, and hence the command is a bit
different: rocm-smi
. On LUMI you need to use srun
to log in to a node where you have a running job:
srun --interactive --pty --jobid=<jobid> rocm-smi
Replace <jobid>
with the actual Slurm job ID. You can also use
watch rocm-smi
to get the continually updated view.
Using multiple CPUs for data pre-processing
One common reason for the GPU utilization being low is when the CPU cannot load and pre-process the data fast enough, and the GPU has to wait for the next batch to process. It is then a common practice to reserve more CPUs to perform data loading and pre-processing in several parallel threads or processes. A good rule of thumb in Puhti is to reserve 10 CPUs per GPU (as there are 4 GPUs and 40 CPUs on each node). On Mahti you can reserve up to 32 cores, as that corresponds to 1/4 of the node. On LUMI we recommend using 7 CPU cores, as there are 63 cores for 8 GPUs. Remember that CPUs are a much cheaper resource than the GPU!
You might have noticed that we have already followed this advice in our example job scripts:
#SBATCH --cpus-per-task=10
Your code also has to support parallel pre-processing. However, most high-level
machine learning frameworks support this out of the box. For example in
TensorFlow you can use tf.data
and
set num_parallel_calls
to the number of CPUs reserved and utilize prefetch
:
dataset = dataset.map(..., num_parallel_calls=10)
dataset = dataset.prefetch(buffer_size)
In PyTorch, you can use
torch.utils.DataLoader
, which
supports data loading with multiple processes:
train_loader = torch.utils.data.DataLoader(..., num_workers=10)
If you are using multiple data loaders, but data loading is still slow, it is also possible that you are using the shared file system inefficiently. A common error is to read a huge number of small files. You can read more about how to store and load data in the most efficient way for machine learning in our separate tutorial.
Profilers
TensorFlow Profiler and PyTorch Profiler are available as TensorBoard plugins. The profilers can be found at the PROFILE and PYTORCH_PROFILER tabs in TensorBoard, respectively. Note that the tabs may not be visible by default but can be found at the pull-down menu on the right-hand side of the interface. The profilers can be used to identify resource consumption and to resolve performance bottlenecks, in particular the data input pipeline.
See also
how to launch TensorBoard using the Puhti web interface.
The TensorFlow module tensorflow/2.8
or later is required to use the
profilers.