Calling SNPs/INDELs with SAMtools/BCFtools

The basic Command line

Suppose we have reference sequences in ref.fa, indexed by samtools faidx, and position sorted alignment files aln1.bam and aln2.bam, the following command lines call SNPs and short INDELs: where the -D option sets the maximum read depth to call a SNP. SAMtools acquires sample information from the SM tag in the @RG header lines. One alignment file can contain multiple samples; reads from one sample can also be distributed in different alignment files. SAMtools will regroup the reads anyway. In addition, if no @RG lines are present, each alignment file is taken as one sample.

Since r865, it is possible to generate the consensus sequence with

Understanding the command line

In the command line above, samtools collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format (see below). It does not call variants.

Bcftools applies the prior and does the actual calling. It can also concatenate BCF files, index BCFs for fast random access and convert BCF to VCF. In addition, bcftools can operate on some VCFs (e.g. calling SNPs from GL-tagged VCFs), but not for all VCFs; VCF to BCF conversion is not working at the moment, either.

Tuning the parameters

One should consider to apply the following parameters to mpileup in different scenarios:

Understanding the output: the VCF/BCF format

The VCF format

The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.

VCF consists of a header section and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below). The FORMAT field and sample information are allowed to be absent. We refer to the official VCF spec for a more rigorous description of the format.

ColFieldDescription
1CHROMChromosome name
2POS1-based position. For an indel, this is the position preceding the indel.
3IDVariant identifier. Usually the dbSNP rsID.
4REFReference sequence at POS involved in the variant. For a SNP, it is a single base.
5ALTComma delimited list of alternative seuqence(s).
6QUALPhred-scaled probability of all samples being homozygous reference.
7FILTERSemicolon delimited list of filters that the variant fails to pass.
8INFOSemicolon delimited list of variant information.
9FORMATColon delimited list of the format of individual genotypes in the following fields.
10+Sample(s)Individual genotype information defined by FORMAT.

The BCF Format

BCF, or the binary variant call format, is the binary version of VCF. It keeps the same information in VCF, while much more efficient to process especially for many samples. The relationship between BCF and VCF is similar to that between BAM and SAM. The detailed format description of the BCF format can be found in bcf.tex included in the samtools source code package.

SAMtools/BCFtools specific information

SAMtools/BCFtools may write the following tags in the INFO field in VCF/BCF.

TagDescription
I1616 integers:
1#reference Q13 bases on the forward strand 2#reference Q13 bases on the reverse strand
3#non-ref Q13 bases on the forward strand 4#non-ref Q13 bases on the reverse strand
5sum of reference base qualities 6sum of squares of reference base qualities
7sum of non-ref base qualities 8sum of squares of non-ref base qualities
9sum of ref mapping qualities 10sum of squares of ref mapping qualities
11sum of non-ref mapping qualities 12sum of squares of non-ref mapping qualities
13sum of tail distance for ref bases 14sum of squares of tail distance for ref bases
15sum of tail distance for non-ref bases 16sum of squares of tail distance for non-ref
INDELIndicating the variant is an INDEL.
DPThe number of reads covering or bridging POS.
DP4Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.
PV4P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t)
FQConsensus quality. If positive, FQ equals the phred-scaled probability of there being two or more different alleles. If negative, FQ equals the minus phred-scaled probability of all chromosomes being identical. Notably, given one sample, FQ is positive at hets and negative at homs.
AF1EM estimate of the site allele frequency of the strongest non-reference allele.
CI95Equal-tail (Bayesian) credible interval of the site allele frequency at the 95% level.
PC2Phred-scaled probability of the alternate allele frequency of group1 samples being larger (,smaller) than of group2 samples.
PCHI2Posterior weighted chi^2 P-value between group1 and group2 samples. This P-value is conservative.
QCHI2Phred-scaled PCHI2
RPNumber of permutations yeilding a smaller PCHI2

Performing Association Test

The following command performs association test: where `xxx' is a file containing the list of samples with the first `yyy' samples being cases (or controls) and the rest being controls (or cases). In the output, the PCHI2 INFO field gives the P-value of association. This P-value is known to be conservative, in the sense that the true P-value is probably smaller (more significant). Nonetheless, in practice PCHI2 appears to be a good indicator of associations. On a QQ-plot for negative controls, expected P-value and observed PCHI2 fall on a straight line, though not on the diagonal.

To further calibrate P-value, one may perform a permutation test with the `-U' option. The `RP' INFO field will give the number of permutations which yield a smaller PCHI2 test statistics. Permutation test is very slow, due to the genotype ambiguity in sequencing data.


Estimating the Allele Frequency Spectrum

While calling SNPs, bcftools will print the estimated AFS to the error output at lines starting with [afs]. However, to accurately estimate AFS, we need to iterate the procedure. In most applications, we are only interested in the AFS conditional on a list of loci. Suppose we have the list of loci in file cond.txt with the first two columns in the file giving the coordinates, the procedure to estimate AFS is: until the AFS converges, which usually takes less than 10 rounds of EM iterations. The first command line above extracts sites in cond.txt for efficiency in later steps. Option -P specifies the initial AFS (in SNP calling, this is prior), which can be a file (as in the 3rd and 4th command lines) or 'full', 'cond2' or 'flat' (as in the 2nd command line). Choosing the right initial AFS helps accuracy and reduces iterations and potential overfitting.

Another way to estimate AFS is to get the list of site allele frequency by and then derive the histogram. It should be noted that AF1 does not use the initial AFS. No iterations are needed. However, the histogram of AF1 usually has a lot of noises. It is up to the users to decide which method to use.


Base Alignment Quality (BAQ)

Base Alignment Quality (BAQ) is a new concept deployed in samtools-0.1.9+. It aims to provide an efficient and effective way to rule out false SNPs caused by nearby INDELs. The following shows the alignments of 6 reads by a typical read mapper in the presence of a 4bp homozygous INDEL:

 coor     12345678901234    5678901234567890123456 
 ref      aggttttataaaac----aattaagtctacagagcaacta
 sample   aggttttataaaacAAATaattaagtctacagagcaacta 
 read1    aggttttataaaac****aaAtaa
 read2     ggttttataaaac****aaAtaaTt
 read3         ttataaaacAAATaattaagtctaca
 read4             CaaaT****aattaagtctacagagcaac
 read5               aaT****aattaagtctacagagcaact
 read6                 T****aattaagtctacagagcaacta

where capital bases represent differences from the reference and underlined bases are the inserted bases. The alignments except for read3 are wrong because the 4bp insertion is misplaced. The mapper produces such alignments because when doing a pairwise alignment, the mapper prefers one or two mismatches over a 4bp insertion. What is hurting more is that the wrong alignments lead to recurrent mismatches, which are likely to deceive most site-independent SNP callers into calling false SNPs.

One way to avoid such an artifact is to do multi-sequence realignment, but the current implementations are very computationally demanding. SAMtools seeks another solution. It assigns each base a BAQ which is the Phred-scaled probability of the base being misaligned. BAQ is low if the base is aligned to a different reference base in a suboptimal alignment, and in this case a mismatch should contribute little to SNP calling even if the base quality is high. With BAQ, the mismatches in the example above are significantly downweighted. SAMtools will not call SNPs from that.

The BAQ strategy is invoked by default in mpileup. To make other SNP callers take advantage of BAQ, one should run to cap base quality by BAQ and then give aln.baq.bam to the SNP callers as the input. For high-coverage single-sample SNP calling, BAQ appears to be as effective as multi-sequence realignment, while being much faster and easier to use. Currently the BAQ strategy is the only practical way to avoid the INDEL artifact in low-coverage multi-sample SNP calling.


Limitations

Appendix: Use Cases

SNP/INDEL calling for hundreds of exomes

The following shows the detailed procedure on how to call SNPs/INDELs for hundreds of exomes. It only aims to provide an overview of how to handle huge data sets. Some command lines given below may not work for all systems. Advanced users may also want to modify based on their own system configurations.

In the following, the key and the most difficult part is the command line calling samtools mpileup. Once that is done, one can use 3rd party tools or write their own scsripts to achieve the rest.

Input and preparation

Procedure