When we have secondary reads, do we need to have TLEN defined? The r

SAM: Clarification of primary vs secondary reads,about samtools/hts-specs

Comments (15)

jkbonfield commented on June 26, 2024

Further to this is understanding the phrase "If all [primary?] segments are mapped to the same reference". What if we have 3 segments for seq X. I'll call them X/1, X/2 and X/3 respectively. X/1 and X/3 have flag 0x40 and 0x80 respectively with X/2 being a middle one having flag 0xC0. Maybe X/1 and X/3 map to the same reference as a proper pair, allowing TLEN to be computed from them, but X/2 maps to another reference.

According to the spec, this is not valid. Does this mean we set it to 0 in such circumstances, although technically that only happens when we have a single-segment or "the information is unavailable". Perhaps that phrase needs to be updated to "or when the information is unavailable or all primary segments do not map to the same reference".

from hts-specs.

nh13 commented on June 26, 2024

What about if we use the secondary's alignment information along with the next primary segment's alignment information to compute TLEN, almost as though viewing TLEN from the "secondary alignment" 's perspective?

from hts-specs.

jkbonfield commented on June 26, 2024

I'm not entirely sure I follow, but are you saying that when we have, say, two alignments (primary+secondary) for one end and primary only for the other end, that the secondary read (with PNEXT/RNEXT pointing to the "other end" primary) has TLEN based on its own alignment? If so I agree.

It still fits the geenral goal of TLEN: this read + TLEN == other end. The fact that the other end + it's TLEN may get you to a 3rd read is irrelevant really. (Ie /1sec to /2prim to /1prim)

The only fly in the ointment comes from dealing with ambiguous alignments of two reads for the same template having the same coordinate (but perhaps different alignments). This is a very unlikely and contrived example though, and if we really need to address this it could be resolved by a shared auxiliary tag (template alignment number) to disambiguate.

from hts-specs.

jkbonfield commented on June 26, 2024

I'm trying to write some validation for SAM format and this is, yet again, tripping me up.

Technically I have to produce completely illogical and IMO wrong test data if we want to validate what the spec states, but I'm still adamant the spec is plain wrong on this front.

The most extreme example is from these snippets. Firstly some definitions:

\item[Segment] A contiguous sequence or subsequence.

\item[Read] ... A read may consist of multiple segments.

\item[Linear alignment] ... A linear alignment can be represented in a single SAM record.

\item[Chimeric alignment] An alignment of a read that cannot be represented as a linear alignment. ...

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies \mbox{`{\sf FLAG} {\tt & 0x900 == 0}'}. This line is called the \emph{primary line} of the read.

So it's clear here that "segment" is potentially a portion of a read and if we map a read in more than one place, either as supplementary or secondary, then only the non-supp non-secondary read is considered as primary.

However it gets muddied. RNEXT/PNEXT are based on primary line (IMO wrong, but let's ignore for now). TLEN is not. It's "all segments":

\item {\sf TLEN}: signed observed Template LENgth. If all segments are mapped to the same reference sequence, the absolute value of TLEN equals the distance between the mapped end of the template and the mapped start of the template, inclusively (i.e., $\mbox{end}-\mbox{start}+1$).%

Hence a strictly this means if we include a secondary alignment in another contig, then the alignment records for the primary non-secondary mappings must now have TLEN 0 as it's only valid is "all segments" (not all primary lines) map to the same reference. Even worse if the secondary alignment happened to map to the same reference as the primary alignment then TLEN must include that secondary map location.

It's all total twaddle and I know of no aligner which works this way. :/ Obviously we can just change this so TLEN also states primary alignments, but then what if the read mapping somewhere else is supplementary, not secondary? It's part of the same template alignment so surely should be included. This all stems from the decision that supplementary reads cannot be primary.

It gets even more rediculous too. If you have an ALT contig and the entire read-pair is mapped to Chr1 and also to Chr1_alt1 as a secondary pair, then the Chr1_alt records have to point back to the primary locations on Chr1 and not "their" mate. Illogical and rather worthless. Add supplementary + secondary together and you're just in a world of pain. It all ends up as just a free for all and being unable to make any assumptions on what a tool may have done, which begs the question why we have a spec to describe what should be done.

I gave up trying to fight this battle in the past, but I can't do validation files by ignoring it, unless I just conveniently refuse to validate certain parts of the specification.

I see two ways forward.

Fix the spec to take the sensible approach instead of the (IMO) ill-thought out approach.
Fix the spec as we did for TLEN with a footnote to state that these fields are basically implementation defined and users shouldn't rely on them.

from hts-specs.

yfarjoun commented on June 26, 2024

Firstly, I want to commend your courage for seeking yet another hill to die on (after TLEN). I agree 💯 that Primary/Secondary/Supplementary issue is a big mess, and we have seen other issues come up with that as the root cause.

I think that the current TLEN definition should be considered to apply only to the primary records. we should probably amend the writing regarding secondary and supplementary reads.
*NEXT field with secondaries and supplementaries is a mess. but I think that the idea should be that they point to the primary since form the primaries you should be able to find a list of all the records with that queryname, and try to figure out what is going on. I don't think that we have enough fields to describe all the options currently.

from hts-specs.

jkbonfield commented on June 26, 2024

TLEN at least should be consistent with the RNEXT and PNEXT fields. They all need to apply to the same data with consistent language. I expect this was the intention and at least making it apply to primary only wouldn't be contentious in as far as making things consistent.

I hadn't though about the second point, but I see what you mean. Maybe it needs a footnote to explain the rationale and state that aux tags such as SA may be used to identify the full list of other locations. Again this is maintaining the status quo, albeit better explained.

However it may not surprise you that I simply disagree with this philosophy. If a paired template can map in its entirety somewhere else, then it doesn't help when decoding a secondary mapping of READ1 to be told where the primary mapping of READ2 was. What am I meant to do? Do a random access seek on the BAM file to pull out all the data at the primary READ2, sift through it to find the correct query name, decode the aux tags, and then get the coordinate for the secondary mapping of READ2? It's preposterous and inefficient when there is a perfectly acceptable field to do this and I'd be amazed if anyone went to such lengths.

We could of course list all the sites in the secondary mapped read's XA tag (note we have no official one for this), but my recollection - perhaps wrong - is that we did this for primary reads only. I'd need to check. It feels the wrong way around though. If we go down this road, then we need an official tag for describing where the list of secondary alignments are, or at least where the template-specific RNEXT/PNEXT/TLEN figures are for this secondary alignment.

However IMO we should flip it around - permit the RNEXT/PNEXT/TLEN fields to be mapping specific (so leaving it up to the aligner to decide what's sensible) and use tags to indicate the location of the primary mapping. Thus the pnext/rnext/tlen data is internally consistent for a read-pair and the aux data is what tells you about other sites. This is nice and symmetric; primary mappings point to secondaries via aux tags and secondaries point to primaries via aux tags, with the core 11 SAM columns being indications about alignments of this template.

The reason I think we should relax it and make this the aligner's choice is that there are too many combinations and sometimes it may not make sense. Eg READ1 secondary map but no READ2 secondary map. Or READ1+READ2 have a secondary mapping, but not as a "proper pair" so there's no real value in those fields. Arguably we could say they should be */0 when they don't make sense.

from hts-specs.

jkbonfield commented on June 26, 2024

If we change TLEN to be the length of the primary alignment instead of the length of all segments whether primary, second, supplementary, etc, then what do we do about FLAG? Bit 0x20 is the orientation of the next segment in the template. I assume "next" here is the same "next" as used in PNEXT/RNEXT, in which case it is the next segment in the primary alignment of the template.

For this TLEN change to work though we do also need to state that supplementary may be part of the primary alignment (may because we can also have supplementary secondaries). A chimeric alignment, eg cDNA to genomic, may have multiple segments but jointly they are the complete alignment for this template. I don't understand why only part of it is primary. I get that for a "proper-pair" read-pair you'd want the outer two reads to be non-supplementary READ1 and READ2 with all the inner bits being supplementary, but imagine an ONT chimeric alignment where there is no READ2. It'd be tragically unhelpful to simply omit RNEXT/PNEXT/TLEN because there is no next primary segment. TLEN there really should be the complete length of the whole chimeric alignment on the reference genome.

from hts-specs.

jkbonfield commented on June 26, 2024

I did some exploring using a read-pair that could map to two different chromosomes (with differing locations and template length too).

Bwa mem reports the secondary alignment with RNEXT/PNEXT pointing to the primary alignment, but TLEN is zero. Hence TLEN isn't being taken from the primary, rather it is listed as unavailable.

Bowtie2, STAR and subread report the secondary alignment with RNEXT/PNEXT/TLEN using figures from that secondary alignment.

Minimap2 didn't fill out RNEXT/PNEXT/TLEN at all, but maybe it doesn't like just 2 reads?

Smalt output both sets of alignments as primary, but with internally consistent figures per chromosome; I guess it doesn't support secondary alignment flag.

Bwa is the outlier here, but also possibly the dominant aligner. Either way it shows the real world usage of these fields is often at odds with the specification. Trying to codify what should happen is a case of shutting the stable door after the horse has bolted, so we need to take the pragmatic approach, as we did with TLEN before.

Edit: supplementary reads are even odder. With bwa, aligning a pair of chimeric read against a single reference (first 9 columns only):

s1	97	r1	51	60	50M40S	=	451	450
s1	2145	r1	151	60	50H40M	=	451	350
s1	145	r1	451	60	40S50M	=	51	-450
s1	2193	r1	361	60	40M50H	=	51	-350

So both non-supp and supp point to the other end coordinate.

Also primary alignment has TLEN +/- 450. Supplementary alignment has TLEN +/- 350. That's just... wrong!

It's basically working out TLEN if this read and other end read are in same ref, as this start/end vs other read start/end. Noddy, and incorrect. It doesn't care if this alignment is supplementary or not. So it's invalidating the spec however you choose to slice it.

If I align against both r1 and r2 (nearly identical refs, but one starts a bit later and has a chunk inserted in the middle too), it gets totally wonky. We have "primary" reads, secondary reads, supplementary reads, but no secondary-supplementaries. They're incorrectly reported as secondary only despite clearly actually being supplementary.

Bowtie2 won't show any alignments on the supplementary test data unless I use --local, and then it shows all 4 permutations with each having it's own pairing figures as secondary alignments. I never seem to get supplementaries reported.

In short, I think it doesn't really work properly for supplementaries. Maybe there's an aligner out there that supports them, but I haven't found it yet.

from hts-specs.

yfarjoun commented on June 26, 2024

I think we need a figure to understand all this.... I'll try to cook something up.

from hts-specs.

d-cameron commented on June 26, 2024

then we need an official tag for describing where the list of secondary alignments are, or at least where the template-specific RNEXT/PNEXT/TLEN figures are for this secondary alignment

Isn't this what the NH/CC/CP/HI/IH/FI tags are for?

with the core 11 SAM columns being indications about alignments of this template.

I had a rant about this a few years ago and I never got a clear answer about what primary/secondary alignments actually mean. There's two interpretations:
Model A: primary and secondary alignments are independent sets of alignments.
Model B: primary and secondary alignments jointly form a graph of alignments

The reason I think we should relax it and make this the aligner's choice

It's already relatxed - that's part of the problem. There's nothing actually stopping a model B alignment graph being represented in SAM using the current fields and tags.

TLEN at least should be consistent with the RNEXT and PNEXT fields. They all need to apply to the same data with consistent language.

In my example, under model B, the fragment size listed at R1-B2 and R2 will be inconsistent.

With bwa, aligning a pair of chimeric read against a single reference (first 9 columns only):

BWA also writes independent alignments on ALT contigs using the supplementary flag. I think this is a voilation of the sam specs (SA alignments with one record having a full-length unclipped alignment messes up downstream split read analysis), but lh3 disagrees (lh3/bwa#282).

from hts-specs.

jkbonfield commented on June 26, 2024

Regarding CC et al, that depends on what a "hit" is. The spec is pretty vague there, but I assumed it's all hits and not just hits for this particular template alignment. It's also not a list you can get from any single read to all others, and you can't record things like TLEN in there either. CC/CP aren't a circular list either, so even if you had random access, you cannot get from every record to every other record. Basically it's not fit for this role.

As for it being relaxed, no it's not. The spec specifically states that all reads MUST link to the primary alignments, irrespective of whether they are part of a secondary alignment. So there isn't any choice or nuance. In reality most aligner authors just ignore it and "do the right thing" for whatever definition of "right" they hold. This is a problem as it makes the spec worthless. You can't rely on what it says. It's better to not be explicit and pretend it's all hunky-dory in that case. Lipstick and pigs spring to mind ;-)

Your example shows consistent TLEN, but it's not what BWA does. TLEN differs from alignment to alignment, and some of them make no sense whatsoever. According to the spec it should be one of two things only: length of primary alignment or zero. Again it's a case of alignment authors ignoring the spec - even the author who wrote both aligners and the spec. :)

I had a rant about this a few years ago and I never got a clear answer about what primary/secondary alignments actually mean. There's two interpretations:

I'm with you! I've been banging this drum since 2014 and never managed to get any traction. People don't want to rock the boat, so we get left with useless fields which most people just ignore as they're unreliable.

Related issues: #44, #53. Somewhere I have a PDF with a load of discussion on this and diagrams too, but can't find it now.

from hts-specs.

jkbonfield commented on June 26, 2024

Ah found it: https://sourceforge.net/p/samtools/mailman/message/33003353/ and the attachment https://sourceforge.net/p/samtools/mailman/attachment/20141104165115.GZ29816%40sanger.ac.uk/1/

I'd need to re-read through that and see what makes sense still. Some of the language was clarified, and we've maybed gained soe more aux tags meanwhile.

from hts-specs.

tfenne commented on June 26, 2024

I've been trying to avoid wading into this, but since there's now a PR I think I have to add my 2c. My sense is that we made things way more confusing years ago when the spec went from talking about "read pairs" and "first of pair"/"second of pair" to templates with arbitrarily many reads and "previous" and "next".

I also think this all gets confounded by a large fraction of the user community not fully understanding the difference between secondary and supplementary alignments.

My take is that we should attempt to describe the cases that common aligners produce for common sequencing setup and data types (e.g. paired-end DNA and RNA seq). In my mind this is relatively clear - let's see if I can write it out in a way that is also relatively clear 😉

I think that aligners should produce 1 or more sets of alignments per template (or read pair). A single set would comprise a logically consistent set of alignment records for the read pair. This could be:

No alignment of either read
A single alignment record for one read
A single alignment record for both reads
A single logical alignment of one read comprised of a) one_not supplementary_ record and b) one or more supplementary records
A single logical alignment of both reads (each alignment comprised of two or more records as in the previous point)

In all cases above, all the reads should be either a) not secondary or b) secondary. The most likely set should not be secondary, and if multiple sets exist and two or more are equally likely then one should be randomly selected and marked as not secondary. A tag (HI I think) should be used to tie together alignment records that belong to the same logical set of alignments.

I understand that some aligners may produce e.g. a single logical alignment for R1 and then multiple for R2, but I think this is a bad idea for exactly the reasons described in this issue - it makes it very hard to reason about the alignment of the template. As such I think that should be discouraged.

So, if you accept what I've described above, then I think defining what to store in RNEXT/PNEXT and TLEN becomes simpler and perhaps even relatively simple. I would suggest something like the following:

RNEXT/PNEXT should contain the information for the single non-supplementary alignment record of the other read in the read pair from the same logical alignment
TLEN should be computed by the aligner independently for each logical template alignment. In the absence of a better algorithm the TLEN should be computed as the distance from the first sequenced-and-mapped base in R1 to the first sequenced-and-mapped-base in R2 (or insert your favorite definition here since I can't recall if TLEN is still controversial or not).

from hts-specs.

jkbonfield commented on June 26, 2024

I agree, which is one reason I've battled this one in the past, but it just goes nowhere sadly. It doesn't help that it's pretty much most aligners doing one thing vs bwa doing something totally different, but bwa has the dominant usage. The community is split, which makes the spec very hard to codify too other than "do whatever the ^&$! you like as it's already gone to hell in a handbasket" - what we did elsewhere infact. :-(

Hence my PR was to make the smallest change which may be acceptable, given the likelihood that the above will, unfortunately, not be.

I'd love to be proven wrong though, but I don't really have the willpower to argue for a larger change again having been burnt on it before.

from hts-specs.

jkbonfield commented on June 26, 2024

The reason I think we should relax it and make this the aligner's choice

It's already relatxed - that's part of the problem. There's nothing actually stopping a model B alignment graph being represented in SAM using the current fields and tags.

Rereading this again, and spotted this.

Unfortunately you're wrong here (I wish you weren't). Your figure for model A is sadly banned by the SAM spec. It states that PNEXT is "position of the primary alignment of the NEXT read in the template". So a complete secondary alignment for both reads must have PNEXT pointing to their primaries, not their mate secondary position. This is how bwa works, but it's not how bowtie2 or STAR work.

Hence why I say about relaxing it. There are pros and cons to both, and we clearly cannot come down on one side of this argument without technically labelling the output from the other aligners as invalid (as we do currently). The best we can hope to do at this juncture is simply to document the possible alternative interpretations and label all of them as valid, but choice of which interpretation to use is implementation defined. (Just as we did with 3' to 5' vs left to right for TLEN.)

Yes you can handle all of this with a myriad of other SAM aux tags instead, but that's basically the same as saying give up on PNEXT/RNEXT/TLEN and start afresh elsewhere. Most downstream tools would need rewriting.

from hts-specs.

SAM: Clarification of primary vs secondary reads about hts-specs HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent