See http://htslib.org/ for the new 1.x releases of SAMtools, BCFtools, and HTSlib. This website contains information pertaining to the old 0.1.19 samtools release, and so is useful but somewhat out of date. As time permits, this information will be updated for the new samtools/bcftools versions and moved to the new website.


Quick Overview

From a sorted BAM alignment, raw SNP and indel calls are acquired by: The resultant output should be further filtered by: to rule out error-prone variant calls caused by factors not considered in the statistical model.


Consensus Calling

The pileup command is able to optionally generate the consensus sequence with the model implemented in MAQ.

When consensus calling is switched on, pileup command will insert the consensus base, consensus quality, SNP quality and maximum mapping quality of the reads covering the sites between the `reference base' and the `read bases' columns. The output looks like this:

Note that consensus quality is the Phred-scaled probability that the consensus is wrong. SNP quality is the Phred-scaled probability that the consensus is identical to the reference. They are different in concept. For SNP calling, SNP quality is of more importance.


Short Indel Calling

Pileup also summarises short indels information by correcting the effect of flanking tandem repeats. It is important to note that SAMtools' indel caller is not perfect. A better way would be to do local de novo assembly or local multiple alignment around the candidate indel sites.

Short indels tend to occur around tandem repeats, but the alignment is much harder in these regions given short reads. Reads aligned without gaps may actually contain indels due to wrong alignment. The pileup command fixes this. Here is an example of a 2bp insertion to the reference:

The line with the 3rd column a star indicates that the AG insertion is supported by 3 reads; 8 reads agree with the reference according to the raw alignment; no reads support a third allele. However, SAMtools infers a AG homozygous insertion with a high score 252 because when we realign the reads with the prior of an insertion, we found that the 8 reads mapped without gaps are due to a tandam repeat. This scenario is clearer from the alignment viewer:



Here is another example of a 5bp heterozygous insertion (in this case, 3 reads are aligned with gaps, but 13 reads show the evidence of the insertion):