BWA
Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements three algorithms, BWA-MEM (mem
), BWA-Backtrack (aln
) and BWA-SW (bwasw
). BWA-Backtrack works for query sequences shorter than 200bp. The other two algorithms are used longer reads up to around 100kbp. BWA-MEM is recommended for reads longer than 70gb. All algorithms do gapped alignment.
BWA can be used to align both single-end and paired-end reads to a reference genome or sequence set.
License
Free to use and open source under GNU GPLv3.
Available
- Puhti: 0.7.17
- Chipster graphical user interface
Usage
On Puhti, BWA can be taken in use as part of the biokit
module collection:
module load biokit
The biokit modules set up a set of commonly used bioinformatics tools, including BWA. Note however that there are other bioinformatics tools on Puhti that have separate setup commands.
The basic syntax of BWA commands is:
bwa <command> [options]
BWA indexes
CSC does not maintain pre-compiled BWA indexes for reference genomes on Puhti, but you can check if genomes used in Chipster can provide you a ready-made index for a genome you want use. This can be done with the command:
chipster_genomes bwa
If a suitable genome index is not found, the fist step in creating alignment with BWA is downloading the reference genome and indexing it. Please note that your $HOME
directory is often too small for working with complete genomes. Instead, you should do the analysis in the scratch directory of your Puhti project.
You can use for example command ensemblfetch
or wget
to download a reference genome to Puhti. For example:
ensemblfetch homo_sapiens
The command above retrieves the human genome sequence to a file called Homo_sapiens.GRCh38.dna.toplevel.fa
. You can calculate the BWA indexes for this file with the command:
bwa index -a bwtsw Homo_sapiens.GRCh38.dna.toplevel.fa
Note that for small less than 2 GB reference genomes you could use a faster "is" indexing algorithm (bwa index -a is
)
Single-end alignment
Once the indexing is ready you can carry out the alignment for singe-end reads with the command:
bwa mem Homo_sapiens.GRCh38.dna.toplevel.fa reads.fastq > aln.sam
If you wish to use the aln
(BWA-Backtrack) algorithm, you need to do the alignment in two steps.
First calculate the actual alignment:
bwa aln Homo_sapiens.GRCh38.dna.toplevel.fa reads.fastq > aln_sa.sai
The result file is in BWA-specific .sai
format that you can convert to SAM format with bwa samse
command:
bwa samse Homo_sapiens.GRCh38.dna.toplevel.fa aln_sa.sai reads.fastq > aln.sam
Paired-end alignment
If you use the MEM algorithm, you can do the paired-end alignment with just one command:
bwa mem Homo_sapiens.GRCh38.dna.toplevel.fa read1.fq read2.fq > aln.sam
In the case of BWA-Backtrack algorithm, you should first do a separate alignment run for each read file:
bwa aln Homo_sapiens.GRCh38.dna.toplevel.fa reads1.fq > aln1.sai
bwa aln Homo_sapiens.GRCh38.dna.toplevel.fa reads2.fq > aln2.sai
The two .sai
alignment files are combined with command bwa sampe
:
bwa sampe Homo_sapiens.GRCh38.dna.toplevel.fa aln1.sai aln2.sai reads1.fq reads2.fq > aln.sam
Running BWA batch jobs on Puhti
In Puhti, BWA jobs should be run as batch jobs. Below is a sample batch job file for running a BWA job on Puhti:
#!/bin/bash
#SBATCH --job-name=bwa
#SBATCH --output=output_%j.txt
#SBATCH --error=errors_%j.txt
#SBATCH --time=12:00:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32000
#SBATCH --account=your_project_name
#load the bio tools
module load biokit
# Index the reference genome
bwa index -a bwtsw Homo_sapiens.GRCh38.dna.toplevel.fa
# Run the alignnments
bwa mem -t $SLURM_CPUS_PER_TASK Homo_sapiens.GRCh38.dna.toplevel.fa reads1.fq reads2.fq > aln.sam
In the batch job example above, one BWA task (--ntasks=1
) is executed. The BWA job uses 8 cores (--cpus-per-task=8
) with a total of 32 GB of memory (--mem=32000
). The maximum duration of the job is twelve hours (--time 12:00:00
). All the cores are assigned from one computing node (--nodes=1
). In addition to the resource reservations, you have to define the billing project for your batch job. This is done by replacing your_project_name
with the name of your project. You can use command csc-projects
to see what projects you have on Puhti.
You can submit the batch job file to the batch job system with the command:
sbatch batch_job_file.bash
See the Puhti user guide for more information about running batch jobs.
More information
More information about BWA can be found from: