Suppose we have reference sequences in ref.fa, indexed by samtools faidx, and position sorted alignment files aln1.bam and aln2.bam, the following command lines call SNPs and short INDELs: where the -D option sets the maximum read depth to call a SNP. SAMtools acquires sample information from the SM tag in the @RG header lines. One alignment file can contain multiple samples; reads from one sample can also be distributed in different alignment files. SAMtools will regroup the reads anyway. In addition, if no @RG lines are present, each alignment file is taken as one sample.
Since r865, it is possible to generate the consensus sequence with
In the command line above, samtools collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format (see below). It does not call variants.
Bcftools applies the prior and does the actual calling. It can also concatenate BCF files, index BCFs for fast random access and convert BCF to VCF. In addition, bcftools can operate on some VCFs (e.g. calling SNPs from GL-tagged VCFs), but not for all VCFs; VCF to BCF conversion is not working at the moment, either.
One should consider to apply the following parameters to mpileup in different scenarios:
The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.
VCF consists of a header section and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below). The FORMAT field and sample information are allowed to be absent. We refer to the official VCF spec for a more rigorous description of the format.
|2||POS||1-based position. For an indel, this is the position preceding the indel.|
|3||ID||Variant identifier. Usually the dbSNP rsID.|
|4||REF||Reference sequence at POS involved in the variant. For a SNP, it is a single base.|
|5||ALT||Comma delimited list of alternative seuqence(s).|
|6||QUAL||Phred-scaled probability of all samples being homozygous reference.|
|7||FILTER||Semicolon delimited list of filters that the variant fails to pass.|
|8||INFO||Semicolon delimited list of variant information.|
|9||FORMAT||Colon delimited list of the format of individual genotypes in the following fields.|
|10+||Sample(s)||Individual genotype information defined by FORMAT.|
BCF, or the binary variant call format, is the binary version of VCF. It keeps the same information in VCF, while much more efficient to process especially for many samples. The relationship between BCF and VCF is similar to that between BAM and SAM. The detailed format description of the BCF format can be found in bcf.tex included in the samtools source code package.
SAMtools/BCFtools may write the following tags in the INFO field in VCF/BCF.
|INDEL||Indicating the variant is an INDEL.|
|DP||The number of reads covering or bridging POS.|
|DP4||Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.|
|PV4||P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t)|
|FQ||Consensus quality. If positive, FQ equals the phred-scaled probability of there being two or more different alleles. If negative, FQ equals the minus phred-scaled probability of all chromosomes being identical. Notably, given one sample, FQ is positive at hets and negative at homs.|
|AF1||EM estimate of the site allele frequency of the strongest non-reference allele.|
|CI95||Equal-tail (Bayesian) credible interval of the site allele frequency at the 95% level.|
|PC2||Phred-scaled probability of the alternate allele frequency of group1 samples being larger (,smaller) than of group2 samples.|
|PCHI2||Posterior weighted chi^2 P-value between group1 and group2 samples. This P-value is conservative.|
|RP||Number of permutations yeilding a smaller PCHI2|
The following command performs association test: where `xxx' is a file containing the list of samples with the first `yyy' samples being cases (or controls) and the rest being controls (or cases). In the output, the PCHI2 INFO field gives the P-value of association. This P-value is known to be conservative, in the sense that the true P-value is probably smaller (more significant). Nonetheless, in practice PCHI2 appears to be a good indicator of associations. On a QQ-plot for negative controls, expected P-value and observed PCHI2 fall on a straight line, though not on the diagonal.
To further calibrate P-value, one may perform a permutation test with the `-U' option. The `RP' INFO field will give the number of permutations which yield a smaller PCHI2 test statistics. Permutation test is very slow, due to the genotype ambiguity in sequencing data.
While calling SNPs, bcftools will print the estimated AFS to the error output at lines starting with [afs]. However, to accurately estimate AFS, we need to iterate the procedure. In most applications, we are only interested in the AFS conditional on a list of loci. Suppose we have the list of loci in file cond.txt with the first two columns in the file giving the coordinates, the procedure to estimate AFS is: until the AFS converges, which usually takes less than 10 rounds of EM iterations. The first command line above extracts sites in cond.txt for efficiency in later steps. Option -P specifies the initial AFS (in SNP calling, this is prior), which can be a file (as in the 3rd and 4th command lines) or 'full', 'cond2' or 'flat' (as in the 2nd command line). Choosing the right initial AFS helps accuracy and reduces iterations and potential overfitting.
Another way to estimate AFS is to get the list of site allele frequency by and then derive the histogram. It should be noted that AF1 does not use the initial AFS. No iterations are needed. However, the histogram of AF1 usually has a lot of noises. It is up to the users to decide which method to use.
Base Alignment Quality (BAQ) is a new concept deployed in samtools-0.1.9+. It aims to provide an efficient and effective way to rule out false SNPs caused by nearby INDELs. The following shows the alignments of 6 reads by a typical read mapper in the presence of a 4bp homozygous INDEL:
coor 12345678901234 5678901234567890123456 ref aggttttataaaac----aattaagtctacagagcaacta sample aggttttataaaacAAATaattaagtctacagagcaacta read1 aggttttataaaac****aaAtaa read2 ggttttataaaac****aaAtaaTt read3 ttataaaacAAATaattaagtctaca read4 CaaaT****aattaagtctacagagcaac read5 aaT****aattaagtctacagagcaact read6 T****aattaagtctacagagcaacta
where capital bases represent differences from the reference and underlined bases are the inserted bases. The alignments except for read3 are wrong because the 4bp insertion is misplaced. The mapper produces such alignments because when doing a pairwise alignment, the mapper prefers one or two mismatches over a 4bp insertion. What is hurting more is that the wrong alignments lead to recurrent mismatches, which are likely to deceive most site-independent SNP callers into calling false SNPs.
One way to avoid such an artifact is to do multi-sequence realignment, but the current implementations are very computationally demanding. SAMtools seeks another solution. It assigns each base a BAQ which is the Phred-scaled probability of the base being misaligned. BAQ is low if the base is aligned to a different reference base in a suboptimal alignment, and in this case a mismatch should contribute little to SNP calling even if the base quality is high. With BAQ, the mismatches in the example above are significantly downweighted. SAMtools will not call SNPs from that.
The BAQ strategy is invoked by default in mpileup. To make other SNP callers take advantage of BAQ, one should run to cap base quality by BAQ and then give aln.baq.bam to the SNP callers as the input. For high-coverage single-sample SNP calling, BAQ appears to be as effective as multi-sequence realignment, while being much faster and easier to use. Currently the BAQ strategy is the only practical way to avoid the INDEL artifact in low-coverage multi-sample SNP calling.
The following shows the detailed procedure on how to call SNPs/INDELs for hundreds of exomes. It only aims to provide an overview of how to handle huge data sets. Some command lines given below may not work for all systems. Advanced users may also want to modify based on their own system configurations.
In the following, the key and the most difficult part is the command line calling samtools mpileup. Once that is done, one can use 3rd party tools or write their own scsripts to achieve the rest.