See http://htslib.org/ for the new 1.x
releases of SAMtools, BCFtools, and HTSlib. This website contains information
pertaining to the old 0.1.19 samtools release, and so is useful but somewhat
out of date. As time permits, this information will be updated for the new
samtools/bcftools versions and moved to the new website.
Open Tasks for SAMtools-C
Duplicate removal (samtools' rmdup is library aware now, but the algorithms should be improved further)
Priority: Low
Difficulty: Hard
Background: Samtools rmdup only works for paired-end
reads with both ends mapped to the same chromosomes. It does not
work for single-end data, one end unmapped or ends mapped to
different chromosomes.
Description: Implement a proper rmdup for mixed
single-end and paired-end reads. Better aware of the library
information. The proposed interface is:
Temporary solution to endusers: use Picard for rmdup.
Converting phrap ACE and/or CAF and/or AMOS formats to SAM (the
development version
of AMOS
now supports AMOS->SAM conversion)
Priority: Low
Difficulty: Medium
Background: By design SAM is able to describe de novo
assembly. When an assembly converted to SAM, we can build
consensus and view the assembly very efficiently.
Description: Implement a converter for the phrap ACE
and/or the CAF and/or AMOS assembly format. A perl or python
standalone script would be ideal.
Converting the PSL format to SAM
Priority: Low
Difficulty: Easy
Background: PSL is widely used by UCSC. Samtools
provides a simple converter, but it only translates coordinates.
Description: Implement a proper converter for PSL. A
perl/python script would be ideal.
Note: It would be better for someone to maintain the
converters for other formats. It is hard for one person to keep
track of the development of all the aligners.
Converting BAM to FASTQ
Priority: Low
Difficulty: Easy/Hard (depending on functionality)
Description: Better a C function with an interface:
or a Perl script.
Closed Tasks for SAMtools-C
Parsing SAM headers (DONE by Petr Danecek)
Priority: High
Difficulty: Hard
Background: Currently samtools only parses part of the SAM
header, which causes problems in maintaining a proper header in
merging, sorting and several other operations.
Description: Write a set of rountines that parse the
SAM header into a dictionary structure, write the dictionary
into the SAM header format, merge dictionaries, add or remove
tags and so on. Proposed function prototypes may be:
Temporary solution to endusers: use Picard for merging.
Generic program/library to support fast record retrieval.
Priority: Low to samtools (but will be very useful to
the community)
Difficulty: Hard
Background: Samtools implements an indexable
compression format (BGZF) and a novel indexing scheme that
typically requires one seek() function call per interval query
(i.e. retrieving records overlapping a specified region). These
techniques can surely be used for other formats like GLF, VCF
and GFF.
Description: Practical implementation needs more thoughts.
Converting samtools pileup output to VCF and implementing VCF
filtration
Priority: Medium
Difficulty: Easy
Background: VCF
is the new format for storing SNP calls, but samtools does not
support that at the moment.
Description: A perl/python script to convert the pileup
output to VCF and to filter a VCF with the rules similar to
samtools.pl varFilter.