samtools / hts-specs Goto Github PK

Specifications of SAM/BAM and related high-throughput sequencing file formats

Home Page: http://samtools.github.io/hts-specs/

TeX 97.81% Makefile 0.29% HTML 0.13% CSS 0.10% Shell 0.14% Dockerfile 0.04% JavaScript 0.10% Perl 1.02% Jupyter Notebook 0.38%

hts-specs's Introduction

samtools

This is the official development repository for samtools.

The original samtools package has been split into three separate but tightly coordinated projects:

htslib: C-library for handling high-throughput sequencing data
samtools: mpileup and other tools for handling SAM, BAM, CRAM
bcftools: calling and other tools for handling VCF, BCF

Building Samtools

See INSTALL for complete details. Release tarballs contain generated files that have not been committed to this repository, so building the code from a Git repository requires extra steps:

autoheader            # Build config.h.in (this may generate a warning about
                      # AC_CONFIG_SUBDIRS - please ignore it).
autoconf -Wno-syntax  # Generate the configure script
./configure           # Needed for choosing optional functionality
make
make install

By default, this will build against an HTSlib source tree in ../htslib. You can alter this to a source tree elsewhere or to a previously-installed HTSlib by configuring with --with-htslib=DIR.

Citing

Please cite this paper when using SAMtools for your publications.

Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li
GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

@article{10.1093/gigascience/giab008,
    author = {Danecek, Petr and Bonfield, James K and Liddle, Jennifer and Marshall, John and Ohan, Valeriu and Pollard, Martin O and Whitwham, Andrew and Keane, Thomas and McCarthy, Shane A and Davies, Robert M and Li, Heng},
    title = "{Twelve years of SAMtools and BCFtools}",
    journal = {GigaScience},
    volume = {10},
    number = {2},
    year = {2021},
    month = {02},
    abstract = "{SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods.The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines.Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed \\&gt;1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.}",
    issn = {2047-217X},
    doi = {10.1093/gigascience/giab008},
    url = {https://doi.org/10.1093/gigascience/giab008},
    note = {giab008},
    eprint = {https://academic.oup.com/gigascience/article-pdf/10/2/giab008/36332246/giab008.pdf},
}

hts-specs's People

Contributors

Stargazers

Watchers

Forkers

peterjc wtsi-hgi heuermh jdiez htc502 beaumontlab jkbonfield carsonhh laserson pcingola rpoplin gdtm86 pditommaso kortschak l337x911 mdavy86 frankmusacchia mp15 karel-brinda lindenb dslituiev weese gabrielnicolasavellaneda simexin lbergelson cyenyxe bjpop limbus-medtec yanzhaowu semal diekhans cyang-2014 boratonaj vicular neonaldo xingwenhao92 heathsc cooleel drashley lh3 knausb zilhua jmarshall geckolml bicycle1885 realityjunky alenzhao anderspitman jenniferliddle hy714335634 superligen jeromekelleher xiongxu daviesrob wulj2016 pd3 polojacky amilamanoj sunthink thefferon scchess sher10 pmarks pryvkin10x melonfish haoziyeung gangchen bioinformagik dkj cltidwe juzheng87 nunofonseca nathanweeks xiangarima ctsa siniperca-chuatsi d-cameron blankenberg klmr hefferon liulei6696 namratha2913 pythseq dayedepps tomwhite darrenldl smaegol sunbymoon labonaboat raskoleinonen lairdm jlotempiojr dlaehnemann thomascolthurst jmmut hanluu cmnbroad enformatik nh13 sogada

hts-specs's Issues

CRAM: Multi-base and multi-quality events.

Checking the recent commits to make CRAMv3.0.tex I see disagreements between the specification and the original CRAMv3_proposed_changes.txt file.

Specifically the proposal was:

1. New read features:
Code Data type Data series Description
A byte[], byte[] BA, QS a stretch of bases and quality scores
b byte[] BA a stretch of bases
q byte[] QS a stretch of quality scores

The tex document now contains:

Bases & a (0x61) & byte[ ] & BA & a stretch of bases\tabularnewline
\hline
Scores & b (0x62) & byte[ ] & QS & a stretch of scores\tabularnewline
\hline
Bases and scores & A (0x41) & byte[ ],byte[ ] & BA,QS & A a stretch of bases and
quality scores score\tabularnewline
\hline

I don't particularly object to the change of b to a and q to b, but am wondering if this was deliberate change or an accident? My prototype implementation for this code using b/q rather than a/b. Also for some reason I didn't implement 'A' at all. I can't think why, unless we decided it was superfluous. Obviously I can add it, but only if either of us feel a need for it. Do you use the A feature at all for your CRAM3 implementation?

Re "Representing variation in VCF records", multiallelic variants and haplotypes

Hi,

I've a question about the examples in the VCF 4.2 spec (same as in 4.1 I think). The first example is

#CHROM POS ID REF ALT QUAL FILTER INFO 
20  3   .   C   G   .   PASS    DP=100 
20  2   .   TC  T   .   PASS    DP=100 
20  2   .   TC  TCA .   PASS    DP=100

The third variant here (20:2 TC->TCA) seems to me to be better repesented as

20  3   .   C  CA .   PASS    DP=100

since the ref base T as 20:2 doesn't change. The minimal common denominator is the C at 20:3.

The next example

#CHROM POS ID REF ALT QUAL FILTER INFO
20  2   .   TC  TG,T    .   PASS    DP=100

Could (should?) be represented as two variants

#CHROM POS ID REF ALT QUAL FILTER INFO
20  2   .   TC  T    .   PASS    DP=100 ## deletetion of C
20  3   .   C  G    .   PASS    DP=100 ## SNP C->G

The complex example

#CHROM POS ID REF ALT QUAL FILTER INFO
20  2   .   TCG TG,T,TCAG   .   PASS    DP=100

would then be

#CHROM POS ID REF ALT QUAL FILTER INFO
20  2   .   TCG TG,T   .   PASS    DP=100 ## the deletions
20  3   .   C   CA  .   PASS    DP=100 ## insertion of A at after the C at 20:3

The common theme to these changes is to evaluate each alt allele with the ref separately to determine the POS of that allele. REF/ALT are then trimmed to the minimal common base. This means that matching variants between VCFs become easier since different VCFs can have different ALT alleles in them.

I guess the overall question here is how a locus (in "2. Understanding the VCF format and the haplotype representation") is defined. To me, this is ambigous, and the spec should strive towards an unambigous represtentation. Evaluating each ALT separately does this, but can make haplotypes end up on separate lines in a VCF.

Thanks!

Daniel

spec: add full reference to Kent et al. (2002)

A footnote with a full reference to Kent et al. Gen. Res. (2002)12:996-1006, and pointer to the relevant part of the paper would be helpful.

Reference contains IUPAC ambiguity codes but not legal in REF or ALT for VCF

Affects: VCF 4.2 and prior versions
The reference contains IUPAC ambiguity codes:
$ samtools faidx ../../ref/GCA_000001405.15_GRCh38_full_analysis_set.fna chr3:16902883-16902883
>chr3:16902883-16902883
B

dbSNP contains IUPAC codes because the reference does for example entry rs56708014:

chr3    16902883    rs56708014  B   BGC .   .   RS=56708014;RSPOS=16902887;dbSNPBuildID=129;SSR=0;SAO=0;VP=0x050000080005000002000200;WGT=1;VC=DIV;INT;ASP;OTHERKG

However the VCF spec says: "REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive)" and for ALT "Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“”) or a breakend replacement string as described in the section on breakends". Unfortunately this conflict means that InDels or SNPs that contain ref sites with IUPAC codes are not legally representable in VCF.

More than 65535 cigar opcodes

There was discussion on the samtools-dev mailing list about this last year:

http://sourceforge.net/p/samtools/mailman/message/30672431/

The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.

There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.

Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

String vectors in BCF

BCF2 does not directly support vectors of strings, because strings themselves are already vectors of characters. Instead, a list of strings is collapsed into a single comma-separated string and encoded as a regular BCF2 vector of characters. For efficiency reasons, such a string is prepended with ','. For example, the vector of strings "A,B,C,D" is therefore internally represented as a single string ",A,B,C,D".

This looks like an unnecessary complication, because the information about a tag being a scalar or a vector is already present in the header via the Number=number field.

Add PM tag to @RG header

Hi,

As PL is a restricted list could PM, 'Platform Model', be added to the RG header definition? This would be an unrestricted text field allowing for more detailed information about the machine generating the data.

Regards,
Keiran

wrong footnotes in SAM spec

on page 6 and 7 of SAMv1.pdf
the footnotes for MD and CT tags are inverted

Floats vs doubles in VCF and BCF

While the VCF file format does not specify the range of numeric types, BCF format allows maximum 32-bit integers and 32-bit floats. The general consensus is that 32-bit integers are sufficient but opinions about floats vary. Please see this wiki page for further details:
http://github.com/samtools/hts-specs/wiki/floats-vs-doubles-in-VCF-BCF

Issues with definition/use of "read" and "segment"

Dear all,

We have been going through the SAM specification with a fine comb, and
found a few issues.

The main issue is that there seems to be a confusion about ‘read’ and
‘segment’ at various places, and most importantly, the constituent parts
of a chimeric alignment are not currently chained together, so that to
find all records of a chimeric alignment you’ll have to read the entire
BAM file.

Another, smaller issue is that flag bit 0x1 appears to be superfluous
given the state of 0x40 and 0x80; and in the proposed spec, you cannot
work out whether a template has multiple reads (rather than segments)
contributing to it, other than by recovering all records.

Just to be clear, we think that the intended concepts of template, read
and segment are:

template: the physical molecule that has been sequenced;
read: a unit of sequence information that came off the sequencing
machine;
segment: a unit of sequence information, part of a read, that can be
contiguously aligned (with only short indels) to a reference sequence;

and that a SAM record refers to an alignment of a segment.

Here is a list of proposed edits to the spec:

Page 2, section 1.2:

Edit: Linear alignment: an alignment of a segment to a single
reference sequence ….

Page 2, section 1.2:

Edit: Chimeric alignment: an alignment of a read that cannot be represented as a single linear alignment. […] All the SAM records in a chimeric alignment have the same QNAME and the same
values for 0x40 and ox80 tags (see Section 1.4)
The 0x40 and 0x80 flag definitions refer to segments rather than reads, so different segments should be able to have the appropriate flags.

Page 2, section 1.2:

Edit: Multiple mapping: replace “read” by “segment” throughout.

Page 4, section 1.4:

Col 7, RNEXT […] Ref. name of the ~~mate /~~ next segment
Having RNEXT/PNEXT point to the next segment rather than the next read, allows all SAM records related to a single read to be efficiently recovered from a large BAM file. In addition, the concept of a "mate" is never defined, and it refers to a read not a segment.

Col 8, PNEXT […] Position of the ~~mate /~~ next segment

Page 4, section 1.4. As it stands, bit 0x1 signifies whether a template
has multiple segments. If it does not, the first segment is also the
last, and therefore bits 0x40 and 0x80 are both set; and if they’re both
set, there can only be one segment. Therefore bit 0x1 is redundant. If
its meaning is changed into “template having multiple reads in
sequencing”, it allows users to determine whether a template has
multiple reads; this information can be inferred from the whole set of
chimeric reads, but these first need to be obtained from the BAM file.
In addition, the proposed meaning is identical to the current meaning
of this bit.

Page 4, section 1.4. For each read ~~/ contig~~ in a
SAM file, it is required that one and only one […].
The concept of a "contig" is not defined.

Page 5, section 1.4. “If 0x40 and 0x80 are both set, the read is part of
a linear template, but is neither the first nor the last read”.

This does not seem to make sense — this redefines the meaning of 0x40 ==
0x80 == 1 to be the same as 0x40 == 0x80 == 0, for no reason that is
apparent to us. It also clashes with the proposed redefinition of the
meaning of the 0x1 bit.

Page 5, section 1.4, “…This may happen for a chimeric alignment or
if the index is lost in data processing”.

Page 5, section 1.4, “If 0x1 is unset, no assumption can be made about
0x2, 0x8, 0x20, 0x40 and 0x80”. If the proposed redefinition of the
meaning of 0x1 is accepted, this line should be removed altogether, as
single reads can still consist of multiple segments.

Page 5, section 1.4. “RNEXT: Reference sequence name of the primary
alignment of the NEXT segment in the template. For the last
segment, the next segment is the first segment in the
template. […]

Page 5, “PNEXT: Postition of the primary alignment of the NEXT
segment in the template. […]

Gerton and Daniel.

{tabix,csi}: specs are incomplete

The specifications for tabix and csi are incomplete, currently being variously only essentially a struct layout with no semantic explanation or restrictions, or a description of the underlying storage format. Files in the wild in CSI and tabix (produced by samtools) include data that is not described in the specification at all.

It is clear from examination of .csi files that they are stored as BGZF (why?), although this is not mentioned and is at odds with the current behaviour of BAI.
Both formats include a stats dummy bin, though this is not mentioned in either spec. In CSI indexes it is possible for valid bins to be at or above 0x924a (in the SAM spec this is explained as: "bin number 37450 (which is beyond the normal range)", but it is not explained how an index stats dummy bin should be encoded in CSI with values for min_shift/depth that, currently legally, break this invariant of BAI (e.g. indexes with a depth >=6 - if this is not allowed it should be specified).
TBI has potentially conflicting fields. The spec does not describe the precedence of these fields (the format field may conflict with col_{seq,beg,end}, meta and skip).
The format field of tabix has a 0-based half-open flag, but the semantics of this flag are not explained: is it an indication to the client or to the tabix handling code?

Typo in conceptual BAI example table

At https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L963 s/1 (96--144kbp)/3 (96--144kbp)/.

Allowing missing values in BCF vectors

The following extension to BCF has been agreed on the vcftools-spec mailling list which introduces end-of-vector byte to allow missing values in vectors. This is the proposed text which will appear in BCFv2.2:

"For integer types, the values 0x80, 0x8000, 0x80000000 are interpreted
as missing values and 0x81, 0x8001, 0x80000001 as end-of-vector
indicators. Similarly for floats, the value of 0x7F800001 is
interpreted as a missing value and 0x7F800002 as the end-of-vector
indicator. The end-of-vector byte is not part of the vector."

SAM: Clarification of primary vs secondary reads

When we have secondary reads, do we need to have TLEN defined?

The recommended practices section under "multiple mappings" states: RNEXT and PNEXT point to the primary line of the next read in the template.

This doesn't cover what TLEN should be on secondary reads. I assume it is computed from primary reads only, but mirrored over to secondary reads, or is it simply that the field is undefined for secondary reads? The specification doesn't explicitly state that TLEN comes from only primary alignments either, although maybe it is implied in "observed template length" with the assumption that in our observations we'd not look at secondary mappings.

VCF: Ambiguity in ##contig specification

The meta "contig" field format is specified as:

As with chromosomal sequences it is highly recommended (but not required) that the header include tags describing the contigs referred to in the VCF file. This furthermore allows these contigs to come from different files. The format is identical to that of a reference sequence, but with an additional URL tag to indicate where that sequence can be found. For example:

##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>

But all the examples follow this format:

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">

From the first paragraph, one would infer that the contig is optional, but the rest is quite ambiguous. Could you please clarify the following and, if possible, include the answers in the next version of the spec?

Whether the ellipsis after URL represents "whatever you want"
Whether URL is optional
Whether the "long style" (length, assembly, md5, species) is a replacement for just the URL

Request for clearer documentation to explain bitwise FLAG values, ideally with documentation

If you are new to bioinformatics, and are asked to work with any SAM file, then you might reasonably turn to this documentation to help to better understand how the format works.

I feel that many people have trouble understanding what is meant by bitwise FLAG values. The documentation is very technical and not very transparent to people who may be new to bioinformatics.

Many people might be turning to the documentation after looking at their SAM output file. Maybe they see that their output file has a range of integer values in column 2 and are puzzled by the explanation in the documentation (this is very likely if you have no familiarity with bit patterns).

I think this section would be greatly helped by the following:

A reminder that the SAM file itself stores an integer value (this is mentioned in the overview section for what all the columns mean, but it is not as obvious as it could be)
An explicit description that a bitwise value of zero means that your read has mapped to the forward strand of the reference (t is not intuitive to work this out because the 'segment unmapped' bit has not been set)
Some specific examples that explain what various integer values correspond to.

VCFv4.3 branch: what version of BCF is being described?

The document subheading says "VCFv4.3 and BCFv2.2", but the BCF Header subsection still has

The BCF2 header begins with the ``BCF2 magic'' 5 bytes that encode
{\tt BCF\em XY} where {\em X} and {\em Y} are bytes indicating the major
number (currently 2) and the minor number (currently 1).

Add support for supplementary alignment flag in CRAM3

What does "CRAM" mean ?

Hello,

sorry if I missed the obvious, but I could not find what CRAM means. Isn't it an acronym ?

Have a nice day,

Charles Plessy
Tsurumi, Kanagawa, Japan

Multiple values

Hi. I have a question concerning my vcf files. In some rows I have multiple values in some of the columns. For example, in the GI column, there are some rows that read

BRCA2

and in some others I get

BRCA1,BRCA1,BRCA1,BRCA1,BRCA1,BRCA1

Could someone explain me why this happens? This also happens with some other more important columns

Missense_V162G,Silent

SAM spec: define placed

The term placed is undefined but used.

flag for first line in example should be 99 not 163

Apologies if I've got this wrong, but I think the SAM/BAM format spec has a mistake. The first alignment line in the example SAM file (first page of the spec) assigns 163 to the mapping that is labeled "r001/1" in the alignment shown above the example. The problem is that 163 is the code for the second member of a pair, but the "/1" implies that "r001/1" is the first member of the pair and "r001/2" is the second member. The code for a first-member, forward-strand, both-mapped paired read is 99. So I think the paired reads should be 99 (first r001 line) and 147 (last r001 line) rather than 163 and 83 as given.

Unify VCF spec

At the moment, there are separate VCFv4.1 and VCFv4.2 specs with very little difference between them -- and it is hard to figure out by reading them just what the difference between them is.

Having two near-duplicate documents also means that typos have to be fixed twice, etc.

It would be better if these were unified into a single document, so that the common things could be seen to be common. And it would note where there are differences between the point-releases, e.g., it would say that Number=R is a 4.2 feature.

VCF: explain the order of Number=G fields more clearly

Explain more clearly what j and k stand for in the formula F(j/k)=(k*(k+1)/2)+j.
Mention that j<=k is assumed.

Standardise the description of Reference and Alternative alleles in sample details

At the moment various variant callers use a wide range of different formats and methods for describing the number of reference and alternative alleles seen at a site for each sample.

Freebayes - Uses RO,AO - two fields one for Ref observations one for Alt observations
GATK - Uses AD one field with the Ref & Alt alleles comma separated
Platypus - Uses NR, NA - With NR being number of reads and NA being number of alts thus Ref = NR-NA

These annotations are useful for a range of different use cases, as they give some indication of what evidence the variant caller made the call off. Which when combined with additional information from the population or pedigree structure can allow for later improvements to the variant call with out going back to the BAM files.

As such it would be useful if a standard way of describing this information could be decided on and included into the VCF spec, this would greatly simplify the use of VCF files after variant calling.

INFO keys section intro repeat itself

this bit has a rephrasing left in, i like the second version fwiw.

When the INFO keys reserved for encoding structural variants are used for imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).

The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).

SAM: PLatform field in @RG for Oxford Nanopore

The SAM @rg header has a PL attribute for the platform. This needs updating to support more instruments, eg Oxford Nanopore.

Should it be PL:ONT, NANPORE, OXFORD_NANOPORE? I think I prefer ONT as the easiest yet descriptive one. (There could be other nanopore vendors in time.)

Modify VCF to support GRC assembly model (GRCh37, GRCh38, GRCm38)

There are a couple of limitations to the current VCF that make it difficult to fully represent data using the full GRC assemblies, GRCh37, GRCh38 and GRCm38 specifically. These are:

lack of representation of the relationship between alternate alleles and their chromosome locations (i.e. maintaining the allelic relationships of the alternate sequences).
inability for an ID to have >1 location. While this is a valid requirement within a specific assembly unit, it needs to be relaxed when describing data on the full assembly as a feature can validly be on the Primary assembly as well as an alternate locus.
To get around this issue, for example, NCBI emits 2 VCF files- one for the Primary assembly and one for the alternate sequences. This also modifies how the ID column is used (and it is used differently in the two files as some rsIDs can be on more than 1 alternate locus in a biologically consistent way).

This issue was discussed at a workshop put on by the GRC at Genome Informatics 2014 and there are a series of proposals we'd like to put forth. A set of coherent examples can be found here: http://www.slideshare.net/GenomeRef/variant-calling-ii

Augment VCF header with more specific assembly information:

##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT, id=GL000258.2, assoc_id=CM000679.2, reg=45309498-46836265>

Introduce new reserved VCF Info tags:

##INFO=<ID=ALTLOCS, Number=.,Type=String,Description=“A list of the alternate
loci in the reference genome that are associated with this locus”>

##INFO=<ID=ALTHAPS, Number=.,Type=String,Description=“A list of the known
haplotypes that are associated with this locus”>

##FORMAT=<ID=HT,Number=1,Type=String,Description=“Haplotype combination based on ALTHAPS">

There may be additional improvements/suggestions that can be made, but these seem like a reasonable start. Making this types of modifications will be an important part of helping groups migrate to GRCh38.

CRAM: unclear spec for BYTE_ARRAY_LEN

These lines don't quite make sense:

hts-specs/CRAMv2.1.tex

Lines 1622 to 1635 in c923d4f

 Byte arrays are captured as a sequence of bytes teminated by a special stop byteFor 

 example this could be a golomb encoding. The parameter for BYTE\_ARRAY\_STOP are 

 listed below: 

 \begin{tabular}{|>{\raggedright}p{100pt}|>{\raggedright}p{100pt}|>{\raggedright}p{230pt}|} 

 \hline 

 \textbf{Data type} & \textbf{Name} & \textbf{Comment} 

 \tabularnewline 

 \hline 

 byte & stop byte & a special byte treated as a delimiter\tabularnewline 

 \hline 

 itf8 & external id & id of an external block containing the byte stream\tabularnewline 

 \hline 

 \end{tabular}

The parameters don't include any encoding, though they should according to the informal description.

Format of alleles in VCF

The format of alleles in VCF is not formally defined in the specifications. In particular, the following edge cases exist in the current specifications:
A*A* is a valid alt allele but should not.
A<contig>A is unclear as to whether it is (or should be) valid.

I propose the following grammar productions as a formal definition for the format of alleles:

ref:
 base-string

alt:
 allele
 allele , alt

allele:
 base-string
 missing-allele
 symbolic-allele

missing-allele:
 *

symbolic-allele:
 id-string
 symbolic-insertion
 breakend
 breakpoint

symbolic-insertion:
 base id-string
 id-string base
 id-string

breakend:
 base-string null-allele
 null-allele base-string

breakpoint:
 base-string [ breakpoint-reference [
 [ breakpoint-reference [ base-string
 base-string ] breakpoint-reference ]
 ] breakpoint-reference ] base-string

breakpoint-reference:
 contig-reference
 contig-reference : digits

contig-reference:
 id-string
 contig-identifier

digits:
 digit
 digits digit

digit: one of
 0 1 2 3 4 5 6 7 8 9 0

base: one of
 A C G T N a c g t n

base-string:
 base
 base-string base

id-string:
 < contig-identifier >

contig-identifier:
 string-containing-no-whitespace-or-colon

contig-identifier is a problematic definition. Currently the spec allows [ ] < > . * as part of contig identifiers. Inclusion of the brackets as valid characters are especially likely to cause difficulties with implementations as alleles such as N[<[>[>[ are currently valid, and the string is an ambiguous reference either to a (contig-identifier) reference contig "" or a (id-string) named contig CHR.

sam spec change: ambiguous reverse complement bit for unmapped reads

In the sam spec http://samtools.github.io/hts-specs/SAMv1.pdf, it says the following about the flag:

0x10 SEQ being reverse complemented

and further down in the notes:

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no
assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100
and 0x800, and the bit 0x20 of the previous read in the template.

Thus for an unmapped read, we dont know if the reverse complement flag is valid. This has important implications for extracting the original read data from a bam file, since we dont know whether unmapped reads are represented as-is in the bam, or have been reverse complemented. It would be fairly simple to add the requirement that 0x10 is always valid, regardless of 0x4.

If I understand the bwa code correctly, the reverse complement flag is always set for a reverse complemented read (though confusingly, that flag takes on the same value of its mate if its mate is mapped, reverse complementing the read unnecessarily). If we updated the spec, bwa would already be within spec. I am not sure about other aligners.

In the communities I work with bam files are ubiquitous for storing sequencing data, and with good reason since they are smaller than compressed fastqs. For people relying on bams as a storage format, and wishing to use unmapped reads to predict, for instance, structural variants, it is important the bams allows for consistent extraction of the read sequences as they came off the sequencer.

Clarify when VCF meta entries are mandatory

The specifications states the following:

It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.

But vcftools at least creates an error when they are not present. I found it in the code (http://sourceforge.net/p/vcftools/code/HEAD/tree/trunk/perl/Vcf.pm#l2366), not from the console output. What would be the preferred behaviour?

For ALT metadata describing alleles such as <DEL> or <INS>, nothing is stated and vcftools doesn't raise an error if they are not described. Should I assume these are optional too? Anyway, if some are mandatory but others don't it would be interesting to highlight this difference.

Request: block-oriented positioning for BCF spec

The GATK single-sample pipeline produces GVCF files with reference blocks, which indicate that they correspond to a range of genomic locations, rather than a single position. In order to make these files conform to the VCF spec, this is implemented using the INFO field "END". Here's an example showing a reference block spanning 1:12141-1:12277.

1       12141   .       C       <NON_REF>       .       .       END=12277       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0

We'd now like to move this process to BCF, and we're starting out by using the same approach. However, we've noticed a significant performance difference between reading/writing top-level BCF record elements like POS as opposed to INFO fields. It would help us a great deal if there were a field like POS to record the end of a block, perhaps BLOCK_END. In the usual case of single-location positions, this field would be unused.

Short Tandem Repeats

Can there be a specification be added for Short Tandem Repeats? Similar to what lobSTR uses.

CRAM: slice header record_counter too small

Section 7.5 defines slice header record_counter as itf8. This implicitly is a 32-bit signed integer (ltf8 is the 64-bit one), which in practice means 2 billion sequences before the number overflows.

That's a large file, but not impossibly large. I broke the 1 billion fragment BAM file in production in 2009 and I'd be suprised if people haven't gone an order of magnitude higher than that since then.

We should try and squeeze this fix in before CRAM v3.0 goes live.

SAM/BAM sorting order clarification

Reading samtools source code revealed that the queryname sorting order is far from being lexicographic: https://github.com/samtools/samtools/blob/develop/bam_sort.c#L13

It would also be good to require reads with same position to be sorted by strand.

VCF: Clarify GL Genotype Field Ordering

The description of the GL field says that:

If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.

It is unclear to me how one is supposed to interpret this equation as defining an ordering of anything. What do j and k represent? Why are we dividing one by the other? What does the value returned by the F function represent?

I think what is happening is that j represents the number of the first allele of a diploid pair (with 0 for the ref allele) k represents the number of the second, the "/" is really denoting that the argument of the function is the unphased genotype composed from those two alleles, and the result of the function is the index in the GL array at which the likelihood of that genotype is to be found. If that is the case, this should be described more clearly in the spec. If that is not the case, this should definitely be described more clearly in the spec.

Furthermore, for triploid or higher sites, the spec merely says genotype likelihoods should appear in "the canonical order". What order is that, exactly?

VCF: Phasing and Alt Loci

If I want to call a SNP at human chr6_GL000255v2_alt:1,144,634, and phase my (potentially haploid) call at that SNP for a sample with other calls on the main chr6, how would I express that in VCF? The spec says that phase sets are restricted to things with matching "chromosome" fields, so I can't put both the variants into a phase set, and anyway the first allele in this haploid call might be phased with the second allele in the diploid call on chr6.

Do I need to wire up the appropriate copy of chr6 to the alt locus with a breakend that's phased at both ends?

Basically, the issue is that the spec should specify how to describe variants that fall into the alt loci which are suddenly very common in GRCh38, especially with regard to phasing. Perhaps with an example.

SAM: "next" vs "next primary".

For RNEXT and PNEXT fields we explicitly state the next primary alignment. It is a bit unclear what the interpretation of these fields on a secondary alignment, but that's not a big issue.

For FLAG field the 0x8 and 0x20 bits mention "next segment" with no qualification of it being the next primary segment. Clearly we need primary data as if we are looking at the first read in a pair then we use FLAG, RNEXT and PNEXT jointly to obtain information about the "other end". If RNEXT/PNEXT refer to the other end but FLAG may potentially not then these fields lose their usefulness.

I believe the table of bits in 1.4 should be modified as s/next segment/next primary segment/;

Describe @HD's GO field

Could 'collated' be added as a valid sort order for reads? In many cases it isn't important to fully sort by name only bring pairs together and it would be useful if this was easy to detect from the header.

The CIGAR B operator is absent from the spec.

(SAM specification)

Not only does it need to be documented, but we need to codify all the various constraints. The code already disallows B going beyond the bounds of the sequence (1M2B1M) but what about B in conjunction with other cigar ops, like soft-clipping, reference skips, or indels?

I'm not sure if the forum discussions ever got that far.

1-1 correspondence for ploidy greater than 2.

For ploidy 2, the VCF specification gives the ordering F(j/k) = (k*(k+1)/2)+j. This works for any number of alleles as long as ploidy is 2.

Is there a suggested ordering for variants where the number of alleles is less than the ploidy? Say ploidy = 5 and n_allele = 3?

String Encoding

VCF should either have an option in the header to choose a specific string encoding (e.g: utf-8, latin-1, ascii) with a default option set, or it should be documented which encoding VCF should be in.

clarify section of SAM1 spec explaining combining bins with linear index

In the text of SAMv1.tex:

Given a region [rbeg,rend), we only need to visit a chunk whose
end file offset is larger than the file offset of the 16kbp window
containing rbeg.

clarify whether the given region, [rbeg,rend) is a query or an alignment. If the region is a query the statement is not true since we need to visit all chunks whose end file offset is larger than the file offset of the 16kbp window containing any position in [rbeg,rend) since we may have alignments in the specified region that are not in the first tile of the linear index.

Questions raised by the proposed VCFv4.3 encoding changes

I'm in favor of the change to make VCF encoding UTF-8 and to add URL/percent encoding of INFO and FORMAT values. However, these changes do raise some practical issues:

UTF-8 encoding is not propagated to BCF. The draft spec continues to refer to ASCII encoded strings, which would require application of percent coding to store UTF-8 strings. The definition of Character still states that, "[a]s with Strings, UNICODE characters are not supported." Should the definition of scalar Character encoding be changed to support code points < 127 and String encoding be changed to unrestricted UTF-8?
URL/percent coding is defined in multiple standards, some only supporting ASCII, some any 8-bit character set, and different versions for UTF-8 and other multi-byte encodings. It would be helpful if the specs referred to a specific standard for URL/percent coding (e.g. RFC 3986 or the variant that encodes higher code points as %uxxxx as in ECMA-262).

SAM spec versioning

The @HD::VN tag tells the "format version" for the SAM file. It has to match \d+.\d+ . For example, samtools sort writes "VN:1.3" to files it creates.

Prior to commit 321f786 , the SAM spec had its version and revision written in the title. It would say something like "The SAM Format Specification (v1.3-r624)". Now, there's no version number, but there is a git commit hash. However, that cannot be written into the SAM file.

So, if I produce a SAM file that I'm sure conforms to some copy of the spec, how should I indicate that in the VN tag? Alternatively, if a SAM file says its version is 1.5, how do I find the corresponding spec?

Integer conversion from BAM to SAM

The footnote on integer values of a B-type array says that all integer types should be mapped to int32_t in SAM. However this is not possible with uint32_t elements since it is not possible to encode them with signed 32 bit integers. What should be done for instance when converting from BAM to SAM?

BAM: undefined segment sequence

The current SAM spec states in section 1.4 that for SAM the cigar string does not need to reflect the segment sequence if the segment sequence is undefined. At the time being there seems to be no way of denoting an undefined segment sequence in BAM, which is a problem when converting a SAM line with a defined cigar field but undefined query sequence to BAM. I would propose to define an empty (l_seq == 0) query sequence as undefined in section 4.2.

Best,
German

In section 5.2.4 Ref?

Should REF in section 5.2.4 be GCG instead of CGC? (VCF format)

BAM Index Ref Seq Length Limitation Hit with Wheat

Our group at ACPFG work with wheat. The 3B chromosome has been released as a psuedomolecule (http://wheat-urgi.versailles.inra.fr/Projects/3BSeq). The length of this sequence is within the length supported by the SAM specification (2³¹-1 or ~2.1Gbp) but is larger than the upper limit imposed by the current BAM indexing spec (2²⁹-1 or ~536Mbp).

The following thread from 2011 tries to clarify the reason for this difference and to have the differences more explicitly stated within the SAM spec:
http://sourceforge.net/p/samtools/mailman/message/28141607/

However, we have actually now hit this limit with the 3B psuedomolecule and it is causing problems. Can someone provide an update on whether the BAM indexing spec can/will be brought in line with the upper limit defined in the SAM spec?

	Byte arrays are captured as a sequence of bytes teminated by a special stop byteFor
	example this could be a golomb encoding. The parameter for BYTE\_ARRAY\_STOP are
	listed below:

	\begin{tabular}{\|>{\raggedright}p{100pt}\|>{\raggedright}p{100pt}\|>{\raggedright}p{230pt}\|}
	\hline
	\textbf{Data type} & \textbf{Name} & \textbf{Comment}
	\tabularnewline
	\hline
	byte & stop byte & a special byte treated as a delimiter\tabularnewline
	\hline
	itf8 & external id & id of an external block containing the byte stream\tabularnewline
	\hline
	\end{tabular}

samtools / hts-specs Goto Github PK

hts-specs's Introduction

samtools

Building Samtools

Citing

hts-specs's People

Contributors

Stargazers

Watchers

Forkers

hts-specs's Issues

Recommend Projects

Recommend Topics

Recommend Org