Comments (6)
END
and SVLEN
represent the same information. The new VCFv4.5 definition reflects this with END
officially deprecated and redefined as computed field based on what's in SVLEN
. Think of it in terms of END=POS+SVLEN
(except for <INS>
).
I see two definitions of SVLEN: Length of the Structural Variant or length of the reference allele and I find this confusing. (I'm expecting I won't be the only one). Was this intended?
Yes. There are indeed two definitions:
For <DEL>, [POS+1, POS+SVLEN]
is deleted
For <DUP>, [POS+1, POS+SVLEN]
is tandemly duplication
For <INV>, [POS+1, POS+SVLEN]
is inverted
For <CNV>, [POS+1, POS+SVLEN]
has a copy number of whatever's in the INFO CN
field.
For <INS>, there are SVLEN additional bases between POS
and POS+1
We could have split it up into SVLEN defining END (thus always SVLEN=1 for <INS>) and had a separate field (e.g. INSLEN
) to define the number of extra additional bases but that didn't happen so we're stuck being as backwards compatible as is reasonable^.
Example bellow
Did I interpret the specs correctly ?
Almost. Your example is a bit unusual in that you're defining a novel TR insertion. Generally speaking TR callers report expansion/contraction of existing TRs and the <CNV:TR>
records are defined over the length of the TR in the reference (see example in Section 5.7). For CNV:TR, SVLEN is just defining how long that TR is in the reference (i.e. what's in the TR catalogue). The actual expansion/contraction is defined in the RUS/RUL/RB/RUC/RUB INFO fields. SVLEN is just there so the CNV END is defined and CN is only there so it's still a valid <CNV> record. Where your example is wrong is that you're defining a 60bp insertion over a 1bp long copy number interval so you have CN=61
copies of this interval, not CN=20
. The CN
field is defined for all <CNV>
records. Yes this is a bit weird but it done this way so <CNV>
parsers can handle <CNV:TR>
records without needing to know anything about the R*
fields.
The other issues is you've done is defined both a sequence allele and a symbolic <INS> allele for the same variant. Copy number records are in their own category so you can write a <CNV:TR>
DNA abundance record as well as a direct sequence record without issue but the sequence record and <INS>
will the treated as two records thus you're saying there's 120bp inserted just after POS 130 (in an unknown order of insertion. Use PSO
to disambiguate this). If you're just demonstrating that you can write the same 60bp insertion multiple ways then yes, sequence allele and <INS> are fine as ALT alleles.
Is it necessary to enforce absence of SVLEN value for non symbolic allele ?
Technically it's not enforced - the wording is "should" not "must". If a implementation-defined field has a meaningful implementation-defined interpretation of SVLEN then the specs will allow it. If you've got a file that has SVLEN defined then it's not an invalid VCF, it's just not following the recommendations (There's draft 'SAM/VCF strict' specs designed as a set of validation rules to highlight issues with technically-compliant SAM/VCF file but there's still sitting as a PR as nobody's writing a validator that would use them.
^ Up to VCFv4.3 SVLEN was defined as the length difference between REF and ALT so was meaningless for <INV> and in practice every caller that reported <INV> used the VCFv4.4 redefinition as the length of the SV anyway.
from hts-specs.
END
andSVLEN
represent the same information. The new VCFv4.5 definition reflects this withEND
officially deprecated and redefined as computed field based on what's inSVLEN
. Think of it in terms ofEND=POS+SVLEN
(except for<INS>
).
@d-cameron Can you please point to the issue/pull request that discusses deprecation of END? I completely missed it.
I see two problems with it:
- backward compatibility: programs rely on END to find overlaps and index VCF records
- complexity brought to incorrect place: I appreciate SVLEN is preferred because it is more informative and better reflects the complexity of the SV world. However, it also requires increasing complexity in programs that can be blissfully unaware. Specifically, indexing only needs to know where the allele which modifies the longest chunk of reference sequence starts and ends. With SVLEN it needs to start learning about all the SV categories.
The backward-incompatibility is what worries me more. What is the advantage of deprecating it? I understand the desire to remove certain amount of redundancy; however, existing programs will stop functioning with such files.
from hts-specs.
from hts-specs.
@davmlaw Sure. But that's a commit, not a discussion. An important decision like this should be discussed publicly.
from hts-specs.
@davmlaw Sure. But that's a commit, not a discussion. An important decision like this should be discussed publicly.
There were lots of discussions on this that went on for months, plus long GA4GH file formats committee discussions over zoom.
Also see #758
I would urge you to take part in the discussions and track the VCF PRs here if you wish to be kept in the loop on upcoming changes.
from hts-specs.
As for why it was done, the fact is END was pretty broken when there was more than one sample, as every sample could have its own end. It's never really worked well, and this isn't just an SV issue (although it gets worse there).
If I recall the policy was END would be for indexing only, representing the largest size, with the expectation that tools will have to post-filter if they wish to do sample specific queries within a region as the format simply disallows for that to be done correctly when using INFO/END. I'm not that familier with all the ins and outs though so mostly left the discussions to people better informed. (Hopefully they will correct me if my recollection is wrong.)
from hts-specs.
Related Issues (20)
- SA tag CIGAR format
- vcf: Handling structured header records with missing IDs in VCF 4.1/4.2 HOT 1
- bcf: First phasing indicators not set in genotype (GT) value examples
- CSI file is BGZF compressed but this is not mentioned in the CSV1 spec HOT 2
- Questions about third-party use of test data HOT 6
- VCF Draft 4.5 and Modified Bases HOT 29
- refget: v2 spec for Range header errors does not align with typical usage
- vcf: Invalid unstructured header line in VCF 4.3 example `complexfile_passed_000.vcf` HOT 2
- VCF format: correct representation of complex indels and MNPs HOT 5
- Number, type and description on FORMAT not in sync with List of changes
- FORMAT field CICN and its relation to CN field
- BED "chrom" field regex is inconsistent with existing practice HOT 5
- Using ChaCha20-Poly1305 for encryption is NOT FIPS140-3 compliant and not justified and is considered unprotected plaintext HOT 3
- vcf: VCF 4.5 gVCF example has off-by-one errors
- INFO CIPOS Number=2xA
- INFO/END should not be deprecated HOT 14
- MM tag preferred format for TAPS data HOT 1
- 0-based coordinate system error in sam spec? HOT 2
- [Improvement] BAM file format, optional strandedness field in header for RNA-seq HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hts-specs.