samtools / hts-specs Goto Github PK
View Code? Open in Web Editor NEWSpecifications of SAM/BAM and related high-throughput sequencing file formats
Home Page: http://samtools.github.io/hts-specs/
Specifications of SAM/BAM and related high-throughput sequencing file formats
Home Page: http://samtools.github.io/hts-specs/
At the moment, there are separate VCFv4.1 and VCFv4.2 specs with very little difference between them -- and it is hard to figure out by reading them just what the difference between them is.
Having two near-duplicate documents also means that typos have to be fixed twice, etc.
It would be better if these were unified into a single document, so that the common things could be seen to be common. And it would note where there are differences between the point-releases, e.g., it would say that Number=R
is a 4.2 feature.
Apologies if I've got this wrong, but I think the SAM/BAM format spec has a mistake. The first alignment line in the example SAM file (first page of the spec) assigns 163 to the mapping that is labeled "r001/1" in the alignment shown above the example. The problem is that 163 is the code for the second member of a pair, but the "/1" implies that "r001/1" is the first member of the pair and "r001/2" is the second member. The code for a first-member, forward-strand, both-mapped paired read is 99. So I think the paired reads should be 99 (first r001 line) and 147 (last r001 line) rather than 163 and 83 as given.
The current SAM spec states in section 1.4 that for SAM the cigar string does not need to reflect the segment sequence if the segment sequence is undefined. At the time being there seems to be no way of denoting an undefined segment sequence in BAM, which is a problem when converting a SAM line with a defined cigar field but undefined query sequence to BAM. I would propose to define an empty (l_seq == 0) query sequence as undefined in section 4.2.
Best,
German
Could 'collated' be added as a valid sort order for reads? In many cases it isn't important to fully sort by name only bring pairs together and it would be useful if this was easy to detect from the header.
At the moment various variant callers use a wide range of different formats and methods for describing the number of reference and alternative alleles seen at a site for each sample.
Freebayes - Uses RO,AO - two fields one for Ref observations one for Alt observations
GATK - Uses AD one field with the Ref & Alt alleles comma separated
Platypus - Uses NR, NA - With NR being number of reads and NA being number of alts thus Ref = NR-NA
These annotations are useful for a range of different use cases, as they give some indication of what evidence the variant caller made the call off. Which when combined with additional information from the population or pedigree structure can allow for later improvements to the variant call with out going back to the BAM files.
As such it would be useful if a standard way of describing this information could be decided on and included into the VCF spec, this would greatly simplify the use of VCF files after variant calling.
The description of the GL field says that:
If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.
It is unclear to me how one is supposed to interpret this equation as defining an ordering of anything. What do j
and k
represent? Why are we dividing one by the other? What does the value returned by the F
function represent?
I think what is happening is that j
represents the number of the first allele of a diploid pair (with 0 for the ref allele) k
represents the number of the second, the "/" is really denoting that the argument of the function is the unphased genotype composed from those two alleles, and the result of the function is the index in the GL array at which the likelihood of that genotype is to be found. If that is the case, this should be described more clearly in the spec. If that is not the case, this should definitely be described more clearly in the spec.
Furthermore, for triploid or higher sites, the spec merely says genotype likelihoods should appear in "the canonical order". What order is that, exactly?
Can there be a specification be added for Short Tandem Repeats? Similar to what lobSTR uses.
on page 6 and 7 of SAMv1.pdf
the footnotes for MD and CT tags are inverted
Section 7.5 defines slice header record_counter as itf8. This implicitly is a 32-bit signed integer (ltf8 is the 64-bit one), which in practice means 2 billion sequences before the number overflows.
That's a large file, but not impossibly large. I broke the 1 billion fragment BAM file in production in 2009 and I'd be suprised if people haven't gone an order of magnitude higher than that since then.
We should try and squeeze this fix in before CRAM v3.0 goes live.
The format of alleles in VCF is not formally defined in the specifications. In particular, the following edge cases exist in the current specifications:
A*A*
is a valid alt allele but should not.
A<contig>A
is unclear as to whether it is (or should be) valid.
I propose the following grammar productions as a formal definition for the format of alleles:
ref:
base-string
alt:
allele
allele , alt
allele:
base-string
missing-allele
symbolic-allele
missing-allele:
*
symbolic-allele:
id-string
symbolic-insertion
breakend
breakpoint
symbolic-insertion:
base id-string
id-string base
id-string
breakend:
base-string null-allele
null-allele base-string
breakpoint:
base-string [ breakpoint-reference [
[ breakpoint-reference [ base-string
base-string ] breakpoint-reference ]
] breakpoint-reference ] base-string
breakpoint-reference:
contig-reference
contig-reference : digits
contig-reference:
id-string
contig-identifier
digits:
digit
digits digit
digit: one of
0 1 2 3 4 5 6 7 8 9 0
base: one of
A C G T N a c g t n
base-string:
base
base-string base
id-string:
< contig-identifier >
contig-identifier:
string-containing-no-whitespace-or-colon
contig-identifier is a problematic definition. Currently the spec allows [ ] < > . * as part of contig identifiers. Inclusion of the brackets as valid characters are especially likely to cause difficulties with implementations as alleles such as N[<[>[>[ are currently valid, and the string is an ambiguous reference either to a (contig-identifier) reference contig "" or a (id-string) named contig CHR.
When we have secondary reads, do we need to have TLEN defined?
The recommended practices section under "multiple mappings" states: RNEXT and PNEXT point to the primary line of the next read in the template.
This doesn't cover what TLEN should be on secondary reads. I assume it is computed from primary reads only, but mirrored over to secondary reads, or is it simply that the field is undefined for secondary reads? The specification doesn't explicitly state that TLEN comes from only primary alignments either, although maybe it is implied in "observed template length" with the assumption that in our observations we'd not look at secondary mappings.
Hi,
As PL is a restricted list could PM, 'Platform Model', be added to the RG header definition? This would be an unrestricted text field allowing for more detailed information about the machine generating the data.
Regards,
Keiran
Affects: VCF 4.2 and prior versions
The reference contains IUPAC ambiguity codes:
$ samtools faidx ../../ref/GCA_000001405.15_GRCh38_full_analysis_set.fna chr3:16902883-16902883
>chr3:16902883-16902883
B
dbSNP contains IUPAC codes because the reference does for example entry rs56708014:
chr3 16902883 rs56708014 B BGC . . RS=56708014;RSPOS=16902887;dbSNPBuildID=129;SSR=0;SAO=0;VP=0x050000080005000002000200;WGT=1;VC=DIV;INT;ASP;OTHERKG
However the VCF spec says: "REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive)" and for ALT "Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“”) or a breakend replacement string as described in the section on breakends". Unfortunately this conflict means that InDels or SNPs that contain ref sites with IUPAC codes are not legally representable in VCF.
BCF2 does not directly support vectors of strings, because strings themselves are already vectors of characters. Instead, a list of strings is collapsed into a single comma-separated string and encoded as a regular BCF2 vector of characters. For efficiency reasons, such a string is prepended with ','. For example, the vector of strings "A,B,C,D" is therefore internally represented as a single string ",A,B,C,D".
This looks like an unnecessary complication, because the information about a tag being a scalar or a vector is already present in the header via the Number=number field.
The document subheading says "VCFv4.3 and BCFv2.2", but the BCF Header subsection still has
The BCF2 header begins with the ``BCF2 magic'' 5 bytes that encode
{\tt BCF\em XY} where {\em X} and {\em Y} are bytes indicating the major
number (currently 2) and the minor number (currently 1).
At https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L963 s/1 (96--144kbp)/3 (96--144kbp)/.
Dear all,
We have been going through the SAM specification with a fine comb, and
found a few issues.
The main issue is that there seems to be a confusion about ‘read’ and
‘segment’ at various places, and most importantly, the constituent parts
of a chimeric alignment are not currently chained together, so that to
find all records of a chimeric alignment you’ll have to read the entire
BAM file.
Another, smaller issue is that flag bit 0x1 appears to be superfluous
given the state of 0x40 and 0x80; and in the proposed spec, you cannot
work out whether a template has multiple reads (rather than segments)
contributing to it, other than by recovering all records.
Just to be clear, we think that the intended concepts of template, read
and segment are:
and that a SAM record refers to an alignment of a segment.
Here is a list of proposed edits to the spec:
Page 2, section 1.2:
Edit: Linear alignment: an alignment of a segment to a single
reference sequence ….
Page 2, section 1.2:
Edit: Chimeric alignment: an alignment of a read that cannot be represented as a single linear alignment. […] All the SAM records in a chimeric alignment have the same QNAME and the same
values for 0x40 and ox80 tags (see Section 1.4)
The 0x40 and 0x80 flag definitions refer to segments rather than reads, so different segments should be able to have the appropriate flags.
Page 2, section 1.2:
Edit: Multiple mapping: replace “read” by “segment” throughout.
Page 4, section 1.4:
Col 7, RNEXT […] Ref. name of the mate / next segment
Having RNEXT/PNEXT point to the next segment rather than the next read, allows all SAM records related to a single read to be efficiently recovered from a large BAM file. In addition, the concept of a "mate" is never defined, and it refers to a read not a segment.
Col 8, PNEXT […] Position of the mate / next segment
Page 4, section 1.4. As it stands, bit 0x1 signifies whether a template
has multiple segments. If it does not, the first segment is also the
last, and therefore bits 0x40 and 0x80 are both set; and if they’re both
set, there can only be one segment. Therefore bit 0x1 is redundant. If
its meaning is changed into “template having multiple reads in
sequencing”, it allows users to determine whether a template has
multiple reads; this information can be inferred from the whole set of
chimeric reads, but these first need to be obtained from the BAM file.
In addition, the proposed meaning is identical to the current meaning
of this bit.
Page 4, section 1.4. For each read / contig in a
SAM file, it is required that one and only one […].
The concept of a "contig" is not defined.
Page 5, section 1.4. “If 0x40 and 0x80 are both set, the read is part of
a linear template, but is neither the first nor the last read”.
This does not seem to make sense — this redefines the meaning of 0x40 ==
0x80 == 1 to be the same as 0x40 == 0x80 == 0, for no reason that is
apparent to us. It also clashes with the proposed redefinition of the
meaning of the 0x1 bit.
Page 5, section 1.4, “…This may happen for a chimeric alignment or
if the index is lost in data processing”.
Page 5, section 1.4, “If 0x1 is unset, no assumption can be made about
0x2, 0x8, 0x20, 0x40 and 0x80”. If the proposed redefinition of the
meaning of 0x1 is accepted, this line should be removed altogether, as
single reads can still consist of multiple segments.
Page 5, section 1.4. “RNEXT: Reference sequence name of the primary
alignment of the NEXT segment in the template. For the last
segment, the next segment is the first segment in the
template. […]
Page 5, “PNEXT: Postition of the primary alignment of the NEXT
segment in the template. […]
Gerton and Daniel.
this bit has a rephrasing left in, i like the second version fwiw.
When the INFO keys reserved for encoding structural variants are used for imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).
The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).
There was discussion on the samtools-dev mailing list about this last year:
http://sourceforge.net/p/samtools/mailman/message/30672431/
The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.
There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.
Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)
The specifications states the following:
It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.
But vcftools at least creates an error when they are not present. I found it in the code (http://sourceforge.net/p/vcftools/code/HEAD/tree/trunk/perl/Vcf.pm#l2366), not from the console output. What would be the preferred behaviour?
For ALT metadata describing alleles such as <DEL> or <INS>, nothing is stated and vcftools doesn't raise an error if they are not described. Should I assume these are optional too? Anyway, if some are mandatory but others don't it would be interesting to highlight this difference.
Checking the recent commits to make CRAMv3.0.tex I see disagreements between the specification and the original CRAMv3_proposed_changes.txt file.
Specifically the proposal was:
1. New read features:
Code Data type Data series Description
A byte[], byte[] BA, QS a stretch of bases and quality scores
b byte[] BA a stretch of bases
q byte[] QS a stretch of quality scores
The tex document now contains:
Bases & a (0x61) & byte[ ] & BA & a stretch of bases\tabularnewline
\hline
Scores & b (0x62) & byte[ ] & QS & a stretch of scores\tabularnewline
\hline
Bases and scores & A (0x41) & byte[ ],byte[ ] & BA,QS & A a stretch of bases and
quality scores score\tabularnewline
\hline
I don't particularly object to the change of b to a and q to b, but am wondering if this was deliberate change or an accident? My prototype implementation for this code using b/q rather than a/b. Also for some reason I didn't implement 'A' at all. I can't think why, unless we decided it was superfluous. Obviously I can add it, but only if either of us feel a need for it. Do you use the A feature at all for your CRAM3 implementation?
Reading samtools
source code revealed that the queryname
sorting order is far from being lexicographic: https://github.com/samtools/samtools/blob/develop/bam_sort.c#L13
It would also be good to require reads with same position to be sorted by strand.
A footnote with a full reference to Kent et al. Gen. Res. (2002)12:996-1006, and pointer to the relevant part of the paper would be helpful.
(SAM specification)
Not only does it need to be documented, but we need to codify all the various constraints. The code already disallows B going beyond the bounds of the sequence (1M2B1M) but what about B in conjunction with other cigar ops, like soft-clipping, reference skips, or indels?
I'm not sure if the forum discussions ever got that far.
While the VCF file format does not specify the range of numeric types, BCF format allows maximum 32-bit integers and 32-bit floats. The general consensus is that 32-bit integers are sufficient but opinions about floats vary. Please see this wiki page for further details:
http://github.com/samtools/hts-specs/wiki/floats-vs-doubles-in-VCF-BCF
Explain more clearly what j and k stand for in the formula F(j/k)=(k*(k+1)/2)+j.
Mention that j<=k is assumed.
Should REF in section 5.2.4 be GCG instead of CGC? (VCF format)
The term placed is undefined but used.
In the sam spec http://samtools.github.io/hts-specs/SAMv1.pdf, it says the following about the flag:
0x10 SEQ being reverse complemented
and further down in the notes:
Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no
assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100
and 0x800, and the bit 0x20 of the previous read in the template.
Thus for an unmapped read, we dont know if the reverse complement flag is valid. This has important implications for extracting the original read data from a bam file, since we dont know whether unmapped reads are represented as-is in the bam, or have been reverse complemented. It would be fairly simple to add the requirement that 0x10 is always valid, regardless of 0x4.
If I understand the bwa code correctly, the reverse complement flag is always set for a reverse complemented read (though confusingly, that flag takes on the same value of its mate if its mate is mapped, reverse complementing the read unnecessarily). If we updated the spec, bwa would already be within spec. I am not sure about other aligners.
In the communities I work with bam files are ubiquitous for storing sequencing data, and with good reason since they are smaller than compressed fastqs. For people relying on bams as a storage format, and wishing to use unmapped reads to predict, for instance, structural variants, it is important the bams allows for consistent extraction of the read sequences as they came off the sequencer.
There are a couple of limitations to the current VCF that make it difficult to fully represent data using the full GRC assemblies, GRCh37, GRCh38 and GRCm38 specifically. These are:
This issue was discussed at a workshop put on by the GRC at Genome Informatics 2014 and there are a series of proposals we'd like to put forth. A set of coherent examples can be found here: http://www.slideshare.net/GenomeRef/variant-calling-ii
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT, id=GL000258.2, assoc_id=CM000679.2, reg=45309498-46836265>
##INFO=<ID=ALTLOCS, Number=.,Type=String,Description=“A list of the alternate
loci in the reference genome that are associated with this locus”>
##INFO=<ID=ALTHAPS, Number=.,Type=String,Description=“A list of the known
haplotypes that are associated with this locus”>
##FORMAT=<ID=HT,Number=1,Type=String,Description=“Haplotype combination based on ALTHAPS">
There may be additional improvements/suggestions that can be made, but these seem like a reasonable start. Making this types of modifications will be an important part of helping groups migrate to GRCh38.
The meta "contig" field format is specified as:
As with chromosomal sequences it is highly recommended (but not required) that the header include tags describing the contigs referred to in the VCF file. This furthermore allows these contigs to come from different files. The format is identical to that of a reference sequence, but with an additional URL tag to indicate where that sequence can be found. For example:
##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>
But all the examples follow this format:
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">
From the first paragraph, one would infer that the contig is optional, but the rest is quite ambiguous. Could you please clarify the following and, if possible, include the answers in the next version of the spec?
VCF should either have an option in the header to choose a specific string encoding (e.g: utf-8, latin-1, ascii) with a default option set, or it should be documented which encoding VCF should be in.
The following extension to BCF has been agreed on the vcftools-spec mailling list which introduces end-of-vector byte to allow missing values in vectors. This is the proposed text which will appear in BCFv2.2:
"For integer types, the values 0x80, 0x8000, 0x80000000 are interpreted
as missing values and 0x81, 0x8001, 0x80000001 as end-of-vector
indicators. Similarly for floats, the value of 0x7F800001 is
interpreted as a missing value and 0x7F800002 as the end-of-vector
indicator. The end-of-vector byte is not part of the vector."
For ploidy 2, the VCF specification gives the ordering F(j/k) = (k*(k+1)/2)+j. This works for any number of alleles as long as ploidy is 2.
Is there a suggested ordering for variants where the number of alleles is less than the ploidy? Say ploidy = 5 and n_allele = 3?
If you are new to bioinformatics, and are asked to work with any SAM file, then you might reasonably turn to this documentation to help to better understand how the format works.
I feel that many people have trouble understanding what is meant by bitwise FLAG values. The documentation is very technical and not very transparent to people who may be new to bioinformatics.
Many people might be turning to the documentation after looking at their SAM output file. Maybe they see that their output file has a range of integer values in column 2 and are puzzled by the explanation in the documentation (this is very likely if you have no familiarity with bit patterns).
I think this section would be greatly helped by the following:
The GATK single-sample pipeline produces GVCF files with reference blocks, which indicate that they correspond to a range of genomic locations, rather than a single position. In order to make these files conform to the VCF spec, this is implemented using the INFO field "END". Here's an example showing a reference block spanning 1:12141-1:12277.
1 12141 . C <NON_REF> . . END=12277 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
We'd now like to move this process to BCF, and we're starting out by using the same approach. However, we've noticed a significant performance difference between reading/writing top-level BCF record elements like POS as opposed to INFO fields. It would help us a great deal if there were a field like POS to record the end of a block, perhaps BLOCK_END. In the usual case of single-location positions, this field would be unused.
Hi. I have a question concerning my vcf files. In some rows I have multiple values in some of the columns. For example, in the GI column, there are some rows that read
BRCA2
and in some others I get
BRCA1,BRCA1,BRCA1,BRCA1,BRCA1,BRCA1
Could someone explain me why this happens? This also happens with some other more important columns
Missense_V162G,Silent
I'm in favor of the change to make VCF encoding UTF-8 and to add URL/percent encoding of INFO and FORMAT values. However, these changes do raise some practical issues:
If I want to call a SNP at human chr6_GL000255v2_alt:1,144,634, and phase my (potentially haploid) call at that SNP for a sample with other calls on the main chr6, how would I express that in VCF? The spec says that phase sets are restricted to things with matching "chromosome" fields, so I can't put both the variants into a phase set, and anyway the first allele in this haploid call might be phased with the second allele in the diploid call on chr6.
Do I need to wire up the appropriate copy of chr6 to the alt locus with a breakend that's phased at both ends?
Basically, the issue is that the spec should specify how to describe variants that fall into the alt loci which are suddenly very common in GRCh38, especially with regard to phasing. Perhaps with an example.
These lines don't quite make sense:
Lines 1622 to 1635 in c923d4f
Hi,
I've a question about the examples in the VCF 4.2 spec (same as in 4.1 I think). The first example is
#CHROM POS ID REF ALT QUAL FILTER INFO
20 3 . C G . PASS DP=100
20 2 . TC T . PASS DP=100
20 2 . TC TCA . PASS DP=100
The third variant here (20:2 TC->TCA) seems to me to be better repesented as
20 3 . C CA . PASS DP=100
since the ref base T as 20:2 doesn't change. The minimal common denominator is the C
at 20:3
.
The next example
#CHROM POS ID REF ALT QUAL FILTER INFO
20 2 . TC TG,T . PASS DP=100
Could (should?) be represented as two variants
#CHROM POS ID REF ALT QUAL FILTER INFO
20 2 . TC T . PASS DP=100 ## deletetion of C
20 3 . C G . PASS DP=100 ## SNP C->G
The complex example
#CHROM POS ID REF ALT QUAL FILTER INFO
20 2 . TCG TG,T,TCAG . PASS DP=100
would then be
#CHROM POS ID REF ALT QUAL FILTER INFO
20 2 . TCG TG,T . PASS DP=100 ## the deletions
20 3 . C CA . PASS DP=100 ## insertion of A at after the C at 20:3
The common theme to these changes is to evaluate each alt allele with the ref separately to determine the POS of that allele. REF/ALT are then trimmed to the minimal common base. This means that matching variants between VCFs become easier since different VCFs can have different ALT alleles in them.
I guess the overall question here is how a locus (in "2. Understanding the VCF format and the haplotype representation") is defined. To me, this is ambigous, and the spec should strive towards an unambigous represtentation. Evaluating each ALT separately does this, but can make haplotypes end up on separate lines in a VCF.
Thanks!
Daniel
The SAM @rg header has a PL attribute for the platform. This needs updating to support more instruments, eg Oxford Nanopore.
Should it be PL:ONT, NANPORE, OXFORD_NANOPORE? I think I prefer ONT as the easiest yet descriptive one. (There could be other nanopore vendors in time.)
For RNEXT and PNEXT fields we explicitly state the next primary alignment. It is a bit unclear what the interpretation of these fields on a secondary alignment, but that's not a big issue.
For FLAG field the 0x8 and 0x20 bits mention "next segment" with no qualification of it being the next primary segment. Clearly we need primary data as if we are looking at the first read in a pair then we use FLAG, RNEXT and PNEXT jointly to obtain information about the "other end". If RNEXT/PNEXT refer to the other end but FLAG may potentially not then these fields lose their usefulness.
I believe the table of bits in 1.4 should be modified as s/next segment/next primary segment/;
Our group at ACPFG work with wheat. The 3B chromosome has been released as a psuedomolecule (http://wheat-urgi.versailles.inra.fr/Projects/3BSeq). The length of this sequence is within the length supported by the SAM specification (231-1 or ~2.1Gbp) but is larger than the upper limit imposed by the current BAM indexing spec (229-1 or ~536Mbp).
The following thread from 2011 tries to clarify the reason for this difference and to have the differences more explicitly stated within the SAM spec:
http://sourceforge.net/p/samtools/mailman/message/28141607/
However, we have actually now hit this limit with the 3B psuedomolecule and it is causing problems. Can someone provide an update on whether the BAM indexing spec can/will be brought in line with the upper limit defined in the SAM spec?
The specifications for tabix and csi are incomplete, currently being variously only essentially a struct layout with no semantic explanation or restrictions, or a description of the underlying storage format. Files in the wild in CSI and tabix (produced by samtools) include data that is not described in the specification at all.
The @HD::VN
tag tells the "format version" for the SAM file. It has to match \d+.\d+ . For example, samtools sort writes "VN:1.3" to files it creates.
Prior to commit 321f786 , the SAM spec had its version and revision written in the title. It would say something like "The SAM Format Specification (v1.3-r624)". Now, there's no version number, but there is a git commit hash. However, that cannot be written into the SAM file.
So, if I produce a SAM file that I'm sure conforms to some copy of the spec, how should I indicate that in the VN tag? Alternatively, if a SAM file says its version is 1.5, how do I find the corresponding spec?
Hello,
sorry if I missed the obvious, but I could not find what CRAM means. Isn't it an acronym ?
Have a nice day,
Charles Plessy
Tsurumi, Kanagawa, Japan
The footnote on integer values of a B-type array says that all integer types should be mapped to int32_t
in SAM. However this is not possible with uint32_t
elements since it is not possible to encode them with signed 32 bit integers. What should be done for instance when converting from BAM to SAM?
In the text of SAMv1.tex:
Given a region [rbeg,rend), we only need to visit a chunk whose
end file offset is larger than the file offset of the 16kbp window
containing rbeg.
clarify whether the given region, [rbeg,rend) is a query or an alignment. If the region is a query the statement is not true since we need to visit all chunks whose end file offset is larger than the file offset of the 16kbp window containing any position in [rbeg,rend) since we may have alignments in the specified region that are not in the first tile of the linear index.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.