Comments (5)
It's also worth noting even if this was being generated by MGI and we hadn't previously decided on DNBSEQ, it would be rejected. The problem with MGIG400 is it conflates platform with model, which are two different tags.
from hts-specs.
We did also discuss this sort of issue in the most recent conference call.
The issue was one of validation. Having an invalid field here does not invalidate the rest of the file. Syntacically it would all make sense. The point was raised what if we check these fields and reject files that don't match, but the specification then gets updated? We haven't updated the SAM version number when we've added extra fields here as the syntax is identical, so programs cannot check that either. That's a valid point. We felt the correct process would be, if validation is performed, to make it a warning only. This fits with the point above that unknown data here does not invalidate any remainder of the file.
You could argue then what's the point of having a controlled vocabulary, as it doesn't stop people from just adding anything there anyway (as demonstrated). We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc. That's even ignoring the issue of case sensitivity. With it, well we may still get invalid fields, but hopefully it is significantly reduced. Note that this also ties in with sequence submissions as the archives have a controlled vocabularly in their schemas.
from hts-specs.
Nevermind, MGI doesn't seem to ship aligners. PL field most probably introduced on third party pipeline downstream erroneously.
from hts-specs.
Having an invalid field here does not invalidate the rest of the file.
A file can either be well-formed or not. I'm not sure why you would want or trust a somewhat valid file w.r.t. a specification, especially in the scientific domain.
We haven't updated the SAM version number when we've added extra fields here as the syntax is identical
Appending to a list of known values changes the syntax. PL:(CAPILLARY|DNBSEQ|etc)
is not identical to PL:[ -~]+
.
We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc.
In this case, the spec shouldn't define them as valid values but as suggested values.
from hts-specs.
The syntax of header lines is described in the first paragraph of §1.3: the fields are TAB-separated, and the line matches the regexp shown (notwithstanding the minor UTF-8-related issue you noted elsewhere). So regardless of whatever characters are in the PL field value, there is no difficulty parsing it: there is a TAB before the PL:
and the value extends to (but does not include) either end-of-line or the next TAB, whichever comes first.
The list of keywords in the PL description is a list of semantically valid values. The syntax of the header line is unchanged when this list is appended to, as doing that does not affect parsing of that line or of the rest of that file or even (generally speaking) the interpretation of the rest of the file.
Note for example that ULTIMA was added to the spec when the SAM VN version number was already 1.6. Nonetheless a SAM file that says @HD VN:1.3
// @RG … PL:ULTIMA …
is a perfectly valid SAM file. This is very intentional: parsing and understanding (the remainder of) the file is unaffected, so it would be silly for it to be invalid.
We've discussed this any number of times, e.g. on #454. I don't see any particular need to relitigate it, but if we do let's do it on a new open issue rather than a closed WONTFIX based on someone mistakenly specifying MGIG400 on their bwa
command line.
from hts-specs.
Related Issues (20)
- Can ML Tags be zero length arrays? HOT 5
- VCF test case filenames should be more meaningful HOT 1
- sam-tags: `MM` field value pattern does not allow ambiguity codes HOT 3
- Circular chromosomes in the VCF header HOT 1
- register bgzip in IANA media types? HOT 5
- Consider 64-bit values in BCF, BAM and CRAM HOT 2
- What does `.` mean for an `A`, `R`, or `G` indexed field? HOT 6
- VCF: "Genotype fields" vs "FORMAT" and per-sample HOT 1
- primary, secondary, and supplementary alignments with optional MM tags HOT 5
- Modified base single letter codes update HOT 7
- test/sam: Duplicate aux field tags
- test/vcf: Duplicate contig header record ID
- FAIRsharing Record Query - BED format
- CRAM: Need to improve feature positions description HOT 1
- is `*` better than `\*`? HOT 1
- cram: interpretation of "unmapped" flag in a pseudocode seems incorrect HOT 1
- SVCLAIM: VCF4.4 and backward compatibility with VCF4.3 HOT 1
- How to retrieve the primary alignment for secondary and supplementary reads HOT 8
- Is there a semantic difference between GT=./. and GT=0/0 + GQ=0 ? HOT 15
- cram: Inconsistent descriptions of auxiliary tag types HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hts-specs.