Duplicate removal (samtools' rmdup is library aware now, but the algorithms should be improved further)
Background: Samtools rmdup only works for paired-end
reads with both ends mapped to the same chromosomes. It does not
work for single-end data, one end unmapped or ends mapped to
Description: Implement a proper rmdup for mixed
single-end and paired-end reads. Better aware of the library
information. The proposed interface is:
Temporary solution to endusers: use Picard for rmdup.
Converting phrap ACE and/or CAF and/or AMOS formats to SAM (the
now supports AMOS->SAM conversion)
Background: By design SAM is able to describe de novo
assembly. When an assembly converted to SAM, we can build
consensus and view the assembly very efficiently.
Description: Implement a converter for the phrap ACE
and/or the CAF and/or AMOS assembly format. A perl or python
standalone script would be ideal.
Converting the PSL format to SAM
Background: PSL is widely used by UCSC. Samtools
provides a simple converter, but it only translates coordinates.
Description: Implement a proper converter for PSL. A
perl/python script would be ideal.
Note: It would be better for someone to maintain the
converters for other formats. It is hard for one person to keep
track of the development of all the aligners.
Converting BAM to FASTQ
Difficulty: Easy/Hard (depending on functionality)
Description: Better a C function with an interface:
or a Perl script.
Closed Tasks for SAMtools-C
Parsing SAM headers (DONE by Petr Danecek)
Background: Currently samtools only parses part of the SAM
header, which causes problems in maintaining a proper header in
merging, sorting and several other operations.
Description: Write a set of rountines that parse the
SAM header into a dictionary structure, write the dictionary
into the SAM header format, merge dictionaries, add or remove
tags and so on. Proposed function prototypes may be:
Temporary solution to endusers: use Picard for merging.
Generic program/library to support fast record retrieval.
Priority: Low to samtools (but will be very useful to
Background: Samtools implements an indexable
compression format (BGZF) and a novel indexing scheme that
typically requires one seek() function call per interval query
(i.e. retrieving records overlapping a specified region). These
techniques can surely be used for other formats like GLF, VCF
Description: Practical implementation needs more thoughts.
Converting samtools pileup output to VCF and implementing VCF
is the new format for storing SNP calls, but samtools does not
support that at the moment.
Description: A perl/python script to convert the pileup
output to VCF and to filter a VCF with the rules similar to