Running MaxQuant software on Puhti supercomputer

MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. High-performance computing environment like Puhti is a suitable place for running compute-intensive jobs using MaxQuant software in proteomics research.

MaxQuant is free to use, but each user needs to register and download MaxQuant from the developer site themselves.

This tutorial provides instructions for running MaxQuant software on Puhti.

Configure parameter file

Even if you are going to run the MaxQuant pipeline on Puhti, you first have to configure different parameters of your MaxQuant job on your local Windows machine. And then upload parameter file (i.e.,mqpar.xml), raw data samples (i.e, .raw files) and sequence file (i.e., .fasta file) to Puhti computing environment.

Edit XML configuration file

You have to make some modifications in parameter file (mqpar.xml), which was for example created on a local windows machine, to comply with HPC environment.

These modifications include changes in :

Windows paths into linux paths for sample files ( tip: search for <filePaths> in XML file)
Windows path into linux path for fasta sequence file (tip: search for <fastaFilePath> in XML file)
In the number of threads according to number of samples (tip: search for <numThreads> in XML file)

Submit as a batch job to Puhti cluster

First login to Puhti computer (see instructions here)
Change to your project directory on Puhti and copy your input files there (tips on how to transfer files).

This is your project directory (on scratch) where your .xml files, .fasta file, and raw data files are located

Learn how to enable MaxQuant environment

MaxQuant software actually also needs mono software to be able to run. With mono software, you can choose your version of MaxQuant. CSC provides a module for mono.

module load mono

Download your linux-compatible version of MaxQuant (e.g., v2.0.3.0) to your scratch directory on Puhti and run the following to verify that MaxQuant is installed properly:

mono MaxQuant\ 2.0.3.0/bin/MaxQuantCmd.exe --help

Note that the directory name contains a space, so you need to either escape it using backslash () or enclose the path in quotes. For ease of use, you may wish to rename the directory so it has e.g underscore instead of space.

Note

Please note that the MaxQuant version you used to create .xml parameter configuration file must match with the version you use on linux environment to smoothly run it on a cluster environment. Other latest versions may work.

Finally submit your script

Create a batch script according to the instructions for shared memory jobs and make sure the script ends up in the same directory as your mqpar.xml file and other data files are located.

Just to facilitate writing your batch scripting process, you may use the following
minimal example script (calles say, e.g., maxquant.sh), to start with:

#!/bin/bash
#SBATCH --job-name=maxquant
#SBATCH --output=output_%j.txt
#SBATCH --error=errors_%j.txt
#SBATCH --account=project_xxx
#SBATCH --time=01:20:00
#SBATCH --ntasks=1
#SBATCH --partition=small
#SBATCH --cpus-per-task=6
#SBATCH --mem=16000

# load maxquant environment

module load mono

# adjust file paths here

mono /path_of_MaxQuant/bin/MaxQuantCmd.exe /path/MaxQuant/mqpar.xml

and then modify resource allocations depending on the number samples. Submit your script as below:

sbatch maxquant.sh

When maxquant job is finished, your output files will be in this same directory.

Tutorial example

You can download example tutorial data for running MaxQuant as below:

wget https://a3s.fi/proteomics/MaxQuant_tutorial.tar.gz

and then untar the downloaded archive file as below:

tar -xavf  MaxQuant_tutorial.tar.gz

The tutorial has example raw files and other necessary files to run MaxQuant for testing.

Look at the used resources once your job is finished

Once maxquant job is finished, you can check the utilization of computing resources like memory and CPU usage efficiency. This will help you tune with better parameters for efficient usage of computing resources.

You can use the following commands using job id:

seff <jobid>
sacct –l –j <jobid>
sacct -o jobid,jobname,maxrss,maxvmsize,state,elapsed -j <jobid>