Comments (5)
The exception encountered (Empty array passed to setUnsignedArrayAttribute for tag [tagname]) is not specific to ML but would be triggered for any tagged field. Zero-length arrays are explicitly valid in general since 2018 (see #135 and #326), and HTSJDK (and hence Picard and GATK) was adjusted accordingly in samtools/htsjdk#1194. Unfortunately setUnsignedArrayAttribute()
was overlooked in that pull request, but it's clear that its check should have been removed too and this exception being thrown is an HTSJDK bug.
So much for the proximate cause. However there may be HTSJDK base modification code that would have been run for ML in particular but did not execute as this exception was thrown first, that might not handle a zero-length array; i.e., there may be additional reasons why HTSJDK might not handle a zero-length ML field — so we still need an answer to whether an empty ML array is intended to be valid.
In my opinion, such an ML value is very much valid and HTSJDK ought to accept it without issue.
SAMtags.pdf §1.7 also says
[ML] uses a byte array of type ‘C’ with the number of elements matching the summation of the number of modifications listed as being present in the MM tag
This is quite explicit that for a record that contains MM:Z:C+m;
(which ¶8 of the MM description explicitly says is valid), that MM value lists a total of zero modifications, so the corresponding ML array has zero elements.
Practically speaking, it would be silly to require methylation calling software to include code to check whether their MM tag being emitted lists no modifications and suppress emitting an ML field accordingly. Far better to regularise the corner case of a correspondingly empty ML value.
So the correct interpretation is your (1), and this is simply an HTSJDK bug.
One commenter on the GATK discussion considered that it is unclear whether the SAM spec allows an empty ML value or not. We could add explicit text noting that an empty ML:B:C
array would correspond to the largely empty MM:Z:C+m;
value that is explicitly discussed. However personally I don't think that's necessary, as it is fairly clear (from the equality in the paragraph I quoted above) — easily derivable albeit not explicit — and so the only reason the GATK commenter felt that there was any doubt was that they were misled by the faulty HTSJDK / GATK behaviour.
from hts-specs.
We are closing this as we do not think there is merit in singling out ML tags for additional clarification in the specification, but thank you for raising the issue. This is an bug in a specific implementation, for which a fix has already been provided and will be merged in due course.
from hts-specs.
The HTSJDK fix has now been released in HTSJDK 4.0.0.
from hts-specs.
PR samtools/htsjdk#1669 addresses the immediate issue, namely the incorrectly raised exception that has been encountered when using HTSJDK.
from hts-specs.
Fully agree with @jmarshall's comment.
I don't really see the need to be explicit here. Zero length arrays are legal, and we don't need to mention it for ML as it may lead people to the wrong conclusion that permitting zero-length here is an exception rather than the norm.
from hts-specs.
Related Issues (20)
- VCF: "Genotype fields" vs "FORMAT" and per-sample HOT 1
- primary, secondary, and supplementary alignments with optional MM tags HOT 5
- Modified base single letter codes update HOT 7
- test/sam: Duplicate aux field tags
- test/vcf: Duplicate contig header record ID
- FAIRsharing Record Query - BED format
- CRAM: Need to improve feature positions description HOT 1
- is `*` better than `\*`? HOT 1
- cram: interpretation of "unmapped" flag in a pseudocode seems incorrect HOT 1
- SVCLAIM: VCF4.4 and backward compatibility with VCF4.3 HOT 1
- How to retrieve the primary alignment for secondary and supplementary reads HOT 8
- Is there a semantic difference between GT=./. and GT=0/0 + GQ=0 ? HOT 15
- cram: Inconsistent descriptions of auxiliary tag types HOT 1
- SA tag CIGAR format
- vcf: Handling structured header records with missing IDs in VCF 4.1/4.2 HOT 1
- bcf: First phasing indicators not set in genotype (GT) value examples
- CSI file is BGZF compressed but this is not mentioned in the CSV1 spec HOT 2
- Questions about third-party use of test data HOT 6
- VCF Draft 4.5 and Modified Bases HOT 27
- VCF4.4 SVLEN requirement across different variant representations HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hts-specs.