Python

Python programming language in CSC's Supercomputers Puhti and Mahti.

Available

Puhti: 3.x versions
Mahti: 3.x versions

The basic system Python (/usr/bin/python3) available by default on both Puhti and Mahti (without loading any modules) is Python version 3.6.8. This can be launched simply with the command python3, but this environment contains only a basic set of standard Python packages. You can install additional packages yourself with the pip command, see the section below explaining how to install packages to existing modules.

Using science area-specific Python modules

If you need a newer version of Python, or a wider set of Python packages, Puhti and Mahti have several pre-installed modules containing Python environments made for different science areas:

python-data - for data analytics and machine learning
PyTorch - PyTorch deep learning framework
TensorFlow - TensorFlow deep learning framework
JAX - Autograd and XLA for high-performance machine learning
geoconda - for spatial data anlysis
BioPython (Puhti only) - biopython and other bioinformatics related Python libraries

To use any of the above mentioned modules, just load the appropriate module, for example:

module load python-data

For more details about available Python versions and included libraries, check the corresponding application documentation.

Typically, after activating a Python-based module, you can continue using the python3 command but this will now point to a newer version of Python with a wider set of Python packages available. You can always check the Python version with the command python3 --version, and the full path of the command with which python3 (to see if you are using the system Python or one from the modules listed above).

Installing Python packages to existing modules

If there is a CSC-provided module that covers almost everything you need, but it is missing a few Python packages, you may be able to install those yourself with the Pip package manager.

If you think that some important package should be included by default in a module provided by CSC, don't hesitate to contact our Service Desk.

Using `venv`

The recommended way to add packages on top of an existing environment is to use venv, which is a standard Python module for creating a lightweight "virtual environment". You can have multiple virtual environments, for example one for each project.

For example to install a package called whatshap on top of the CSC-provided python-data module:

cd /projappl/<your_project>  # change this to the appropriate path for your project
module load python-data
python3 -m venv --system-site-packages venv
source venv/bin/activate
pip install whatshap

Warning

Don't forget to use the --system-site-packages flags when creating the virtual environment, otherwise the environment will not find the pre-installed packages from the base module (for example numpy from python-data).

Later when you wish to use the virtual environment you only need to load the module and activate the environment:

module load python-data
source /projappl/<your_project>/venv/bin/activate

Naturally, this also applies to slurm job scripts.

Note: some older CSC installations are not compatible with Python virtual environments. We are still working to update those. For these you need to use the pip install --user approach described below.

Using `pip install --user`

Another approach to install additional packages is to do a "user installation" with the command pip install --user. This approach is easy to use, as it doesn't require setting up a virtual environment, but it can easily fill up your home directory if you install a lot of packages. There are also other drawbacks, such as package-provided commands not working out-of-the box.

With this approach packages are by default installed to your home directory under .local/lib/pythonx.y/site-packages (where x.y is the version of Python being used). If you would like to change the installation folder, for example to make a project-wide installation instead of a personal one, you need to define the PYTHONUSERBASE environment variable with the new installation local. For example to add the package whatshap to the python-data module:

module load python-data
export PYTHONUSERBASE=/projappl/<your_project>/my-python-env
pip install --user whatshap

In the example, the package is now installed inside the my-python-env directory in the project's projappl directory. Run unset PYTHONUSERBASE if you wish to later install into your home directory again.

When later using those libraries you need to define PYTHONUSERBASE again. Naturally, this also applies to slurm job scripts. For example:

module load python-data
export PYTHONUSERBASE=/projappl/<your_project>/my-python-env

Note that if the package you installed also contains executable files these may not work as they refer to the Python path internal to the container (and most of our Python modules are installed with containers). You might see an error message like this:

whatshap --help
whatshap: /CSC_CONTAINER/miniconda/envs/env1/bin/python3.9: bad interpreter: No such file or directory

You can fix this by editing the first line of the executable (check with which whatshap in our example) to point to the real python interpreter (check with which python3). In our example we would edit the file ~/.local/bin/whatshap to have this as the first line:

#!/appl/soft/ai/tykky/python-data-2022-09/bin/python3

Creating your own Python environments

It is also possible to create your own Python environments.

Tykky

The easiest option is to use Tykky for Conda or pip installations.

Custom Apptainer container

In some cases, for example if you know of a suitable ready-made Apptainer or Docker container, also using a custom Apptainer container is an option.

Please, see our Apptainer documentation:

Running Apptainer containers
Creating Apptainer containers, including how to convert Docker container to Apptainer container.

Conda

Conda is easy to use and flexible, but it usually creates a huge number of files which is inefficient with shared file systems. This can cause very slow library imports and in the worst case slowdowns in the whole file system. Therefore, CSC has deprecated the direct use of Conda installations on CSC supercomputers. You can, however, still use Conda environments granted that they are containerized. To easily containerize your Conda (or pip) environments, please see the Tykky container wrapper tool.

CSC Conda tutorial describes in more detail what Conda is and how to use it. Some parts of this tutorial may be helpful also for Tykky installations.

Python development environments

Python code can be edited with a console-based text editor directly on the supercomputer. Codes can also be edited on your local machine and copied to the supercomputer with scp or graphical file transfer tools. You can also edit Python scripts in Puhti from your local PC with some code editors like Visual Studio Code.

Finally, several graphical programming environments can be used directly on the supercomputer, such as Jupyter Notebooks, Spyder and Visual Studio Code, through the Puhti web interface.

Jupyter Notebooks

Jupyter Notebooks allows one to run Python code via a web browser running on a local PC. The notebooks can combine code, equations, visualizations and narrative text in a single document. Many of our modules, including python-data, the deep learning modules and geoconda include the Jupyter notebook package. See the tutorial how to set up and connect to a Jupyter Notebook for using Jupyter in CSC environment.

Spyder

Spyder is scientific Python development environment. Modules python-data and geoconda have Spyder included. The best option for using it is through the Puhti web interface remote desktop.

Python parallel jobs

Python has several different packages for parallel processing:

multiprocessing
joblib
dask
mpi4py - Python interface to MPI

The multiprocessing package is likely the easiest to use and as it is part of the Python standard library it is included in all Python installations. joblib provides some more flexibility. multiprocessing and joblib are suitable for one node (max 40 cores). dask is the most versatile and has several options for parallelization. Please see CSC's Dask tutorial which includes both single-node (max 40 cores) and multi-node examples.

See our GitHub repository for some examples for using the different parallelization options with Puhti.

The mpi4py is not included in the current Python environments in CSC supercomputers, however, for multinode jobs with non-trivial parallelization it is generally the most efficient option. For a short tutorial on mpi4py along with other approaches to improve performance of Python programs see the free online course Python in High Performance Computing

License

Python packages usually are licensed under various free and open source licenses (FOSS). Python itself is licensed under the PSF License, which is also open source.