Comments (19)
I think it will be the best, if the Exomiser prints out the variant like in the initial VCF format. It makes no sense to remove the reference allele (and this leads to position errors, as Orion noticed).
The frequency of rs11382548 was added in dbSNP build 142 (snp present since version 120). So maybe Exomiser has not the newest dbSNP version.
Am 03.02.2015 um 16:09 schrieb Orion Buske [email protected]:
It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.
For example, the VCF file contains:
chr11 61165731 . C CAThis results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732
The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%
Not sure how common this is? Can anyone confirm?
—
Reply to this email directly or view it on GitHub.
from exomiser.
To be correct:
it is an insertion of A between 61165731 and 61165732 of the reference sequence. Therefore also g.61165732 is incorrect, because the position is related to the reference sequence.
from exomiser.
I would actually tend to suggest using the stripped version for output and all database lookups as it is unambiguous. Otherwise, the same variant can occur with different numbers of prefix letters (and different positions), leading to further database lookup errors.
from exomiser.
@visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.
from exomiser.
Not entirely sure what is happening here but the db has the data:
nsfpalizer=# select * from frequency where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf |
espaamaf | espallmaf
------------+----------+-----+-----+----------+----------+----------+-----------+-----------
11 | 61165731 | C | CA | 11382548 | | 14.11 |
23.857599 | 17.2379
What version of Exomiser are you using Orion? I know Jannovar always
chooses to strip off the ref allele - there was a long debate a while back
about whether this was right or not. And at some point the code was fine
i.e. it knew what Jannovar was doing and still managed to look up the
indels.
This definitely needs investigating
On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske [email protected]
wrote:
@visze https://github.com/visze Okay, fair point on the position. So it
sounds like it's not getting looked up because of the dbSNP version, and
there is no underlying cause for concerns.—
Reply to this email directly or view it on GitHub
#36 (comment).
from exomiser.
and in the variant table that stores pathogenicity data it is:
nsfpalizer=# select * from variant where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | aaref | aaalt | aapos | sift |
polyphen | mut_taster | phylop | cadd | cadd_raw
------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+----------
11 | 61165731 | - | A | | | 0 | -5 |
-5 | -5 | -5 | 20.3 | 3.973177
I have a feeling we used to consistently store the position and alleles the
same way Jannovar does but at some point the frequency table has got broken
for indels?
On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley [email protected]
wrote:
Not entirely sure what is happening here but the db has the data:
nsfpalizer=# select * from frequency where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf |
espaamaf | espallmaf------------+----------+-----+-----+----------+----------+----------+-----------+-----------
11 | 61165731 | C | CA | 11382548 | | 14.11 |
23.857599 | 17.2379What version of Exomiser are you using Orion? I know Jannovar always
chooses to strip off the ref allele - there was a long debate a while back
about whether this was right or not. And at some point the code was fine
i.e. it knew what Jannovar was doing and still managed to look up the
indels.This definitely needs investigating
On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske [email protected]
wrote:@visze https://github.com/visze Okay, fair point on the position. So
it sounds like it's not getting looked up because of the dbSNP version, and
there is no underlying cause for concerns.—
Reply to this email directly or view it on GitHub
#36 (comment).
from exomiser.
The variant representation in Jannovar has changed in my updated versions. I was planning to integrate the new version with the Exomiser in this month.
If the fix is not easy it might be simpler to just integrate the new Jannovar version and make sure that works properly.
from exomiser.
I think the problem may be in
VCF2FrequencyParser.transformVCF2AnnovarCoordinates.
It looks as though nothing is ever done with teh reset positions and ref
and alt alleles
On Tue, Feb 3, 2015 at 3:40 PM, Damian Smedley [email protected]
wrote:
and in the variant table that stores pathogenicity data it is:
nsfpalizer=# select * from variant where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | aaref | aaalt | aapos | sift |
polyphen | mut_taster | phylop | cadd | cadd_raw------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+----------
11 | 61165731 | - | A | | | 0 | -5 |
-5 | -5 | -5 | 20.3 | 3.973177I have a feeling we used to consistently store the position and alleles
the same way Jannovar does but at some point the frequency table has got
broken for indels?On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley [email protected]
wrote:Not entirely sure what is happening here but the db has the data:
nsfpalizer=# select * from frequency where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf |
espaamaf | espallmaf------------+----------+-----+-----+----------+----------+----------+-----------+-----------
11 | 61165731 | C | CA | 11382548 | | 14.11 |
23.857599 | 17.2379What version of Exomiser are you using Orion? I know Jannovar always
chooses to strip off the ref allele - there was a long debate a while back
about whether this was right or not. And at some point the code was fine
i.e. it knew what Jannovar was doing and still managed to look up the
indels.This definitely needs investigating
On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske [email protected]
wrote:@visze https://github.com/visze Okay, fair point on the position. So
it sounds like it's not getting looked up because of the dbSNP version, and
there is no underlying cause for concerns.—
Reply to this email directly or view it on GitHub
#36 (comment).
from exomiser.
Does that mean it would be represented as chr11:g.61165731C->CA in the new
version of Jannovar and we would no longer need to do any conversion of
coordinates/alleles when we generate the database?
On Tue, Feb 3, 2015 at 3:46 PM, Manuel Holtgrewe [email protected]
wrote:
The variant representation in Jannovar has changed in my updated versions.
I was planning to integrate the new version with the Exomiser in this month.If the fix is not easy it might be simpler to just integrate the new
Jannovar version and make sure that works properly.—
Reply to this email directly or view it on GitHub
#36 (comment).
from exomiser.
Jannovar uses the HTSJDK internally, i.e. you can see the VCF file directly if you want to.
Each alternative allele of the HTSJDK VariantContext
is converted into a Jannovar GenomeChange
object. These objects describe changes in the sequence using zero-based coordinates, in pseudo-code:
GenomeChange(1, 123, "A", "C")
describes a SNV of the 124-th base (in 1-based notation) on chromosome 1.GenomeChange(1, 234, "", "CGT")
describes the insertion of the stringCGT
before the 234-th base (1-based).GenomeChange(1, 456, "CGT", "")
describes the deletion of the 457th to the 459th base (1-based).
Of course, any conversion can be made on an update.
from exomiser.
Thanks for pointing this out, Orion. Another thing that I noticed recently that we need to do better is when there is a deletion or insertion right in the splice donor or acceptor site. Jannovar currently annotated this as c.123->C, for instance, which is incorrect. Manuel, can we discuss this?
thanks Peter
Von: Orion Buske [[email protected]]
Gesendet: Dienstag, 3. Februar 2015 16:09
An: exomiser/Exomiser
Betreff: [Exomiser] Exomiser parsing of indels is incorrect? (#36)
It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.
For example, the VCF file contains:
chr11 61165731 . C CA
This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732
The output lists there as being no frequency data, but this is actually rs11382548http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs11382548 with MAF 14%
Not sure how common this is? Can anyone confirm?
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/36.
from exomiser.
@buske the current Jannovar version (upcoming v0.13
definitely, v0.11
should give the same HGVS annotation albeit with the old effect names) yields the following:
# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61165732>A'
[...]
#change effect hgvs_annotation
chr1:61165732>A NON_CODING_TRANSCRIPT_INTRON_VARIANT AK097193:uc001czt.1:n.463+29515_463+29516insT
@pnrobinson the current Jannovar version should handle this correctly:
# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61127145C>A' 'chr1:61127145>A'
[...]
#change effect hgvs_annotation
chr1:61127145C>A SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT AK097193:uc001czt.1:n.505-3G>T
chr1:61127145>A SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT AK097193:uc001czt.1:n.505-3_505-2insT
from exomiser.
@damiansm, this happened on both the 6.0.0 binary downloaded directly, and the latest build of our phenomecentral branch (built off of development commit: adb7bb0).
from exomiser.
Here are a few more examples we ran into. All of these appear are technically "fixed" in Jannovar v0.12 because it does not appear to do prefix/suffix stripping. If Exomiser continues to do variant normalization, I'd suggest doing something like: http://genome.sph.umich.edu/wiki/Variant_Normalization.
-
The variant is improperly reduced, resulting in both an incorrect position (+2 instead of +1) and an incorrect alt allele (TCCGCCGCC instead of CCTCCGCCG).
- VCF:
chr1 120612040 . Tcc TCCTCCGCCGcc 217 . INDEL;DP=58;VDB=0.0430 AF1=0.5;AC1=1;DP4=11,13,10,13;MQ=43;FQ=217;PV4=1,1,0.18,1 GT:PL:DP:SP:GQ 0/1:255,0,255:47:0:99
- exomiser.variant.tsv:
chr1 120612042 - TCCGCCGCC 217.0 Pathogenicity 0/1 58 UTR5 NOTCH2:uc001eil.3:c.-22_-21insGGCGGCGGA NOTCH2 . .. . . 0.0 . . . 0.0 0.6022212 1.0 0.8961792
-
The variant is not left-aligned, resulting in an incorrect position (+5 instead of +1)
- VCF:
chr6 31242257 . accccc acccc 9.9 . INDEL;DP=3;VDB=0.0191;AF1=1;AC1=2;DP4=0,0,0,3;MQ=60;FQ=-43.5 GT:PL:DP:SP:GQ 1/1:49,9,0:3:0:12
- exomiser.variant.tsv:
chr6 31242262 C - 9.9 Target 1/1 3 INTRONIC HLA-B:uc003ntf.2:intron2:c.344-3137G>- HLA-B . . .. . . . . . 0.0 0.5999689 1.0 0.8939807
from exomiser.
Hi @buske, thanks for the link.
The different representation of variants in the input VCF and the output TSV/VCF in Exomiser is something that Jules and I already talked about. Our proposal is to write out the same REF and ALT fields as is in the input VCF file.
The Exomiser and Jannovar do not know about the reference sequence (IIRC for the Exomiser) and thus proper normalization is currently not easily possible. However, this could be added later since the normalization is trivial given the proof and algorithm in your link.
Within the Exomiser, variant trimming is done by Jannovar that first does trimming (not following the algorithm in your link) and then 3' alignment for each transcript as required for the HGVS output. This is based on the mRNA sequence.
Do you think the following is a sensible approach for now?
- the Exomiser itself is agnostic about variant normalization by default for now in its input and output and for the database lookup and requires normalized VCF files
- Jannovar continues to perform its variant normalization for now based on the mRNA and in 3' direction but uses the algorithm in your link
- we provide a little precompiled Java standalone tool for variant normalization if people know the pain about installing
from exomiser.
@holtgrewe Sorry for the slow response. Yes, I think that sounds good. Honestly, however you handle it is fine with me, I just wanted to make sure it was all being discussed so we don't run into erroneous results downstream.
I believe we would be very interested in a little standalone variant normalization tool, as well. :)
from exomiser.
Current status on 7.0.0.BETA with Jannovar 0.13 incorporated.
(i) Seems Exomiser just outputs what was input i.e. the 11:61165731 C CA variant comes out as that.
(ii) The internal representation within Exomiser/Jannovar seems to be pos=61165731, ref="-", alt="A"
Therefore it seems to me that the indels in the frequency table should be normalized before loading into the database so they are correctly loaded. This used to happen but at some stage got broken. I will fix this and then hopefully we are there.
The only gotcha is my normalization should correspond to Jannovar normalization.
from exomiser.
I have now tested this fix in 7.0.0.BETA and all seems fine
from exomiser.
Here's another variant for testing:
On Exomiser 6.0 this variant gets shifted to a point where it's called as novel, despite the allele frequency of around 30%.
Here's the line from the VCF file:
chr15 100252709 . Ccagcagcagcagcagcagcagcagcagcagcagc Ccagcagcagcagcagcagcagcagcagc 999 . INDEL;DP=521;VDB=0.0390;AF1=0.25;AC1=1;DP4=66,106,47,95;MQ=50;FQ=999;PV4=0.35,1,1,1 GT:PL:GQ 0/1:255,0,255:99
And the line from the Exomiser 6.0 output tsv file:
chr15 100252738 AGCAGC - 999.0 PASS 0/1 521 NON_FS_DELETION MEF2A:uc010bot.3:exon8:c.1052_1057del:p.Q351_Q352del
Which is the same as this variant:
http://exac.broadinstitute.org/variant/15-100252709-CCAGCAG-C
from exomiser.
Related Issues (20)
- missing distribution HOT 1
- Exomiser VCF output includes whitespace in INFO field which are forbidden in VCF<4.3 HOT 2
- TSV_VARIANT outputs only prioritized variants with exomiser gene score > 0 HOT 1
- An error: comparison of unrelated phenotypes HOT 4
- Replace phenopacket ingest code with phenopacket-tools-io
- Add analysis strategies to docs HOT 1
- Genomad frequency results shown by the exomizer different from gnomad v3.1.2 (hg38) HOT 1
- custom VCF no gene/variant prioritized
- Variants not being passed into Exomiser HOT 2
- Error with Test run HOT 4
- Users want to use Exomiser as an annotation tool...
- Update H2 dependency to 2.1.x? HOT 2
- Enable more frequent ClinVar data updates in Exomiser database HOT 2
- Adjust PM2 weighting to Supporting in line with SVI v1.0
- Create long-term archival format for variant data
- Try assembly check and fail at startup
- Add missense z scores
- ACMG secondary findings HOT 1
- ACMG classifications HOT 2
- AR prioritization false calls
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from exomiser.