Kraken
Description
Kraken is a sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor of all genomes known to contain a given k-mer.
License
Free to use and open source under MIT License.
Version
- Puhti: 2.1.2
Usage
Kraken in included in the biokit module. To set it up, run command:
module load biokit
Now you Kraken2 starts with commad kraken2
. For example:
kraken2 --help
There are several Kraken2 reference databases available in Puhti. By default Kraken2 uses the standard database that is based on taxonomic information and complete genomes in RefSeq for the bacterial, archaeal, and viral domains, along with the human genome and a collection of known vectors (UniVec_Core).
Available databases in Puhti are:
name | Mem. request | description |
---|---|---|
standard | 40 GB | NCBI taxonomic information, as well as the complete genomes in RefSeq for the bacterial, archaeal, and viral domains, along with the human genome and a collection of known vectors (UniVec_Core). |
krak_microb | 44 GB | RefSeq bacterial, archea, viral, fungi and protozoa |
16S_Greengenes_k2db | 1 GB | Greengenes 16S data |
16S_RDP_k2db | 1 GB | RDP 16S data |
16S_SILVA132_k2db | 1 GB | Silva 132 16S data |
16S_SILVA138_k2db | 1 GB | Silva 138 16S data |
minikraken_8GB_20200312 | 1 GB |
Using Kraken2 with a large reference database will require plenty on memory. For example jobs with the standard Karken2 database require 40 GB of memory. Thus Kraken should in practice always be executed as a batch job. Below is a sample Karaken job using 4 cores 40 GB of memory and 6 hours or runtime:
#!/bin/bash -l
#SBATCH --job-name=kraken2
#SBATCH --output=output_%j.txt
#SBATCH --error=errors_%j.txt
#SBATCH --time=06:00:00
#SBATCH --partition=small
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --account=project_123456
#SBATCH --mem=40000
#
module load biokit
kraken2 -db standard --threads $SLURM_CPUS_PER_TASK input.fasta --output results.txt
You can submit the batch job file to the batch job system with command:
sbatch batch_job_file.bash