Consensus/Indel Calling

See http://htslib.org/ for the new 1.x releases of SAMtools, BCFtools, and HTSlib. This website contains information pertaining to the old 0.1.19 samtools release, and so is useful but somewhat out of date. As time permits, this information will be updated for the new samtools/bcftools versions and moved to the new website.

Quick Overview

From a sorted BAM alignment, raw SNP and indel calls are acquired by: samtools pileup -vcf ref.fa aln.bam > raw.pileup The resultant output should be further filtered by: samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup to rule out error-prone variant calls caused by factors not considered in the statistical model.

Consensus Calling

The pileup command is able to optionally generate the consensus sequence with the model implemented in MAQ.

When consensus calling is switched on, pileup command will insert the consensus base, consensus quality, SNP quality and maximum mapping quality of the reads covering the sites between the `reference base' and the `read bases' columns. The output looks like this:

seq1  60  T  T  66  0  99  13  ...........^~.^~.   9<<55<;<<<<<<
seq1  61  G  G  72  0  99  15  .............^~.^y. (;975&;<<<<<<<<
seq1  62  T  T  72  0  99  15  .$..............    <;;,55;<<<<<<<<
seq1  63  G  G  72  0  99  15  .$.............^~.  4;2;<7:+<<<<<<<
seq1  64  G  G  69  0  99  14  ..............  9+5<;;;<<<<<<<
seq1  65  A  A  69  0  99  14  .$............. <5-2<;;<<<<<<;
seq1  66  C  C  66  0  99  13  .............   &*<;;<<<<<<8<
seq1  67  C  C  69  0  99  14  .............^~.    ,75<.4<<<<<-<<
seq1  68  C  C  69  0  99  14  ..............  576<;7<<<<<8<<

Note that consensus quality is the Phred-scaled probability that the consensus is wrong. SNP quality is the Phred-scaled probability that the consensus is identical to the reference. They are different in concept. For SNP calling, SNP quality is of more importance.

Short Indel Calling

Pileup also summarises short indels information by correcting the effect of flanking tandem repeats. It is important to note that SAMtools' indel caller is not perfect. A better way would be to do local de novo assembly or local multiple alignment around the candidate indel sites.

Short indels tend to occur around tandem repeats, but the alignment is much harder in these regions given short reads. Reads aligned without gaps may actually contain indels due to wrong alignment. The pileup command fixes this. Here is an example of a 2bp insertion to the reference:

seq2  151 G  G  36  0  99  12  ...........A    :9<;;7=<<<<<
seq2  152 A  A  63  0  99  12  ............    :9<;;<;<<<<<
seq2  153 A  A  63  0  99  12  ............    :7<877=<<<<<
seq2  154 A  A  66  0  99  13  .$...........^~.    :7<97<7<<<<<<
seq2  155 A  A  63  0  99  12  .$...........   7<77<;<<<<<<
seq2  156 A  A  10  0  99  11  .$......+2AG.+2AG.+2AGGG <975;:<<<<<
seq2  156 *  +AG/+AG  71  252  99  11  +AG  *  3  8  0
seq2  157 A  A  57  0  99  10  .$.$........    97<<<<<<<<
seq2  158 A  R  18  18 99  8   GG$G.....   <;;<<<<<
seq2  159 T  T  8   0  99  7   A$A$.....   3:<<<<<

The line with the 3rd column a star indicates that the AG insertion is supported by 3 reads; 8 reads agree with the reference according to the raw alignment; no reads support a third allele. However, SAMtools infers a AG homozygous insertion with a high score 252 because when we realign the reads with the prior of an insertion, we found that the 8 reads mapped without gaps are due to a tandam repeat. This scenario is clearer from the alignment viewer:

Here is another example of a 5bp heterozygous insertion (in this case, 3 reads are aligned with gaps, but 13 reads show the evidence of the insertion):

SAMtools

General Information

SAMtools in C

Variant Call Format

Tabix

Other Lang-bindings

Quick Overview

Consensus Calling

Short Indel Calling