See http://htslib.org/ for the new 1.x releases of SAMtools, BCFtools, and HTSlib. This website contains information pertaining to the old 0.1.19 samtools release, and so is useful but somewhat out of date. As time permits, this information will be updated for the new samtools/bcftools versions and moved to the new website.


Open Tasks for SAMtools-C

  1. Duplicate removal (samtools' rmdup is library aware now, but the algorithms should be improved further)
    • Priority: Low
    • Difficulty: Hard
    • Background: Samtools rmdup only works for paired-end reads with both ends mapped to the same chromosomes. It does not work for single-end data, one end unmapped or ends mapped to different chromosomes.
    • Description: Implement a proper rmdup for mixed single-end and paired-end reads. Better aware of the library information. The proposed interface is:
    • Temporary solution to endusers: use Picard for rmdup.
  2. Converting phrap ACE and/or CAF and/or AMOS formats to SAM (the development version of AMOS now supports AMOS->SAM conversion)
    • Priority: Low
    • Difficulty: Medium
    • Background: By design SAM is able to describe de novo assembly. When an assembly converted to SAM, we can build consensus and view the assembly very efficiently.
    • Description: Implement a converter for the phrap ACE and/or the CAF and/or AMOS assembly format. A perl or python standalone script would be ideal.
  3. Converting the PSL format to SAM
    • Priority: Low
    • Difficulty: Easy
    • Background: PSL is widely used by UCSC. Samtools provides a simple converter, but it only translates coordinates.
    • Description: Implement a proper converter for PSL. A perl/python script would be ideal.
    • Note: It would be better for someone to maintain the converters for other formats. It is hard for one person to keep track of the development of all the aligners.
  4. Converting BAM to FASTQ
    • Priority: Low
    • Difficulty: Easy/Hard (depending on functionality)
    • Description: Better a C function with an interface: or a Perl script.

Closed Tasks for SAMtools-C

  1. Parsing SAM headers (DONE by Petr Danecek)
    • Priority: High
    • Difficulty: Hard
    • Background: Currently samtools only parses part of the SAM header, which causes problems in maintaining a proper header in merging, sorting and several other operations.
    • Description: Write a set of rountines that parse the SAM header into a dictionary structure, write the dictionary into the SAM header format, merge dictionaries, add or remove tags and so on. Proposed function prototypes may be:
    • Temporary solution to endusers: use Picard for merging.
  2. Generic program/library to support fast record retrieval.
    • Priority: Low to samtools (but will be very useful to the community)
    • Difficulty: Hard
    • Background: Samtools implements an indexable compression format (BGZF) and a novel indexing scheme that typically requires one seek() function call per interval query (i.e. retrieving records overlapping a specified region). These techniques can surely be used for other formats like GLF, VCF and GFF.
    • Description: Practical implementation needs more thoughts.
  3. Converting samtools pileup output to VCF and implementing VCF filtration
    • Priority: Medium
    • Difficulty: Easy
    • Background: VCF is the new format for storing SNP calls, but samtools does not support that at the moment.
    • Description: A perl/python script to convert the pileup output to VCF and to filter a VCF with the rules similar to samtools.pl varFilter.