Exomiser parsing of indels is incorrect?,about exomiser/exomiser

Comments (19)

visze commented on June 2, 2024

I think it will be the best, if the Exomiser prints out the variant like in the initial VCF format. It makes no sense to remove the reference allele (and this leads to position errors, as Orion noticed).

The frequency of rs11382548 was added in dbSNP build 142 (snp present since version 120). So maybe Exomiser has not the newest dbSNP version.

Am 03.02.2015 um 16:09 schrieb Orion Buske [email protected]:

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains:
chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732

The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?

—
Reply to this email directly or view it on GitHub.

from exomiser.

visze commented on June 2, 2024

To be correct:

it is an insertion of A between 61165731 and 61165732 of the reference sequence. Therefore also g.61165732 is incorrect, because the position is related to the reference sequence.

from exomiser.

buske commented on June 2, 2024

I would actually tend to suggest using the stripped version for output and all database lookups as it is unambiguous. Otherwise, the same variant can occur with different numbers of prefix letters (and different positions), leading to further database lookup errors.

from exomiser.

buske commented on June 2, 2024

@visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.

from exomiser.

damiansm commented on June 2, 2024

Not entirely sure what is happening here but the db has the data:

nsfpalizer=# select * from frequency where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf |
espaamaf | espallmaf
------------+----------+-----+-----+----------+----------+----------+-----------+-----------
11 | 61165731 | C | CA | 11382548 | | 14.11 |
23.857599 | 17.2379

What version of Exomiser are you using Orion? I know Jannovar always
chooses to strip off the ref allele - there was a long debate a while back
about whether this was right or not. And at some point the code was fine
i.e. it knew what Jannovar was doing and still managed to look up the
indels.

This definitely needs investigating

On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske [email protected]
wrote:

@visze https://github.com/visze Okay, fair point on the position. So it
sounds like it's not getting looked up because of the dbSNP version, and
there is no underlying cause for concerns.

—
Reply to this email directly or view it on GitHub
#36 (comment).

from exomiser.

damiansm commented on June 2, 2024

and in the variant table that stores pathogenicity data it is:

nsfpalizer=# select * from variant where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | aaref | aaalt | aapos | sift |
polyphen | mut_taster | phylop | cadd | cadd_raw
------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+----------
11 | 61165731 | - | A | | | 0 | -5 |
-5 | -5 | -5 | 20.3 | 3.973177

I have a feeling we used to consistently store the position and alleles the
same way Jannovar does but at some point the frequency table has got broken
for indels?

On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley [email protected]
wrote:

Not entirely sure what is happening here but the db has the data:

nsfpalizer=# select * from frequency where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf |
espaamaf | espallmaf

------------+----------+-----+-----+----------+----------+----------+-----------+-----------
11 | 61165731 | C | CA | 11382548 | | 14.11 |
23.857599 | 17.2379

What version of Exomiser are you using Orion? I know Jannovar always
chooses to strip off the ref allele - there was a long debate a while back
about whether this was right or not. And at some point the code was fine
i.e. it knew what Jannovar was doing and still managed to look up the
indels.

This definitely needs investigating

On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske [email protected]
wrote:

@visze https://github.com/visze Okay, fair point on the position. So
it sounds like it's not getting looked up because of the dbSNP version, and
there is no underlying cause for concerns.

—
Reply to this email directly or view it on GitHub
#36 (comment).

from exomiser.

holtgrewe commented on June 2, 2024

The variant representation in Jannovar has changed in my updated versions. I was planning to integrate the new version with the Exomiser in this month.

If the fix is not easy it might be simpler to just integrate the new Jannovar version and make sure that works properly.

from exomiser.

damiansm commented on June 2, 2024

I think the problem may be in
VCF2FrequencyParser.transformVCF2AnnovarCoordinates.

It looks as though nothing is ever done with teh reset positions and ref
and alt alleles

On Tue, Feb 3, 2015 at 3:40 PM, Damian Smedley [email protected]
wrote:

and in the variant table that stores pathogenicity data it is:

nsfpalizer=# select * from variant where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | aaref | aaalt | aapos | sift |
polyphen | mut_taster | phylop | cadd | cadd_raw

------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+----------
11 | 61165731 | - | A | | | 0 | -5 |
-5 | -5 | -5 | 20.3 | 3.973177

I have a feeling we used to consistently store the position and alleles
the same way Jannovar does but at some point the frequency table has got
broken for indels?

On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley [email protected]
wrote:

Not entirely sure what is happening here but the db has the data:

nsfpalizer=# select * from frequency where chromosome = 11 and position =
61165731;
chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf |
espaamaf | espallmaf

------------+----------+-----+-----+----------+----------+----------+-----------+-----------
11 | 61165731 | C | CA | 11382548 | | 14.11 |
23.857599 | 17.2379

What version of Exomiser are you using Orion? I know Jannovar always
chooses to strip off the ref allele - there was a long debate a while back
about whether this was right or not. And at some point the code was fine
i.e. it knew what Jannovar was doing and still managed to look up the
indels.

This definitely needs investigating

On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske [email protected]
wrote:

@visze https://github.com/visze Okay, fair point on the position. So
it sounds like it's not getting looked up because of the dbSNP version, and
there is no underlying cause for concerns.

—
Reply to this email directly or view it on GitHub
#36 (comment).

from exomiser.

damiansm commented on June 2, 2024

Does that mean it would be represented as chr11:g.61165731C->CA in the new
version of Jannovar and we would no longer need to do any conversion of
coordinates/alleles when we generate the database?

On Tue, Feb 3, 2015 at 3:46 PM, Manuel Holtgrewe [email protected]
wrote:

The variant representation in Jannovar has changed in my updated versions.
I was planning to integrate the new version with the Exomiser in this month.

If the fix is not easy it might be simpler to just integrate the new
Jannovar version and make sure that works properly.

—
Reply to this email directly or view it on GitHub
#36 (comment).

from exomiser.

holtgrewe commented on June 2, 2024

Jannovar uses the HTSJDK internally, i.e. you can see the VCF file directly if you want to.

Each alternative allele of the HTSJDK VariantContext is converted into a Jannovar GenomeChange object. These objects describe changes in the sequence using zero-based coordinates, in pseudo-code:

GenomeChange(1, 123, "A", "C") describes a SNV of the 124-th base (in 1-based notation) on chromosome 1.
GenomeChange(1, 234, "", "CGT") describes the insertion of the string CGT before the 234-th base (1-based).
GenomeChange(1, 456, "CGT", "") describes the deletion of the 457th to the 459th base (1-based).

Of course, any conversion can be made on an update.

from exomiser.

pnrobinson commented on June 2, 2024

Thanks for pointing this out, Orion. Another thing that I noticed recently that we need to do better is when there is a deletion or insertion right in the splice donor or acceptor site. Jannovar currently annotated this as c.123->C, for instance, which is incorrect. Manuel, can we discuss this?
thanks Peter

Von: Orion Buske [[email protected]]
Gesendet: Dienstag, 3. Februar 2015 16:09
An: exomiser/Exomiser
Betreff: [Exomiser] Exomiser parsing of indels is incorrect? (#36)

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains:
chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732

The output lists there as being no frequency data, but this is actually rs11382548http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/36.

from exomiser.

holtgrewe commented on June 2, 2024

@buske the current Jannovar version (upcoming v0.13 definitely, v0.11 should give the same HGVS annotation albeit with the old effect names) yields the following:

# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61165732>A'
[...]
#change effect  hgvs_annotation
chr1:61165732>A NON_CODING_TRANSCRIPT_INTRON_VARIANT    AK097193:uc001czt.1:n.463+29515_463+29516insT

@pnrobinson the current Jannovar version should handle this correctly:

# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61127145C>A' 'chr1:61127145>A'
[...]
#change effect  hgvs_annotation
chr1:61127145C>A    SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT  AK097193:uc001czt.1:n.505-3G>T
chr1:61127145>A SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT  AK097193:uc001czt.1:n.505-3_505-2insT

from exomiser.

buske commented on June 2, 2024

@damiansm, this happened on both the 6.0.0 binary downloaded directly, and the latest build of our phenomecentral branch (built off of development commit: adb7bb0).

from exomiser.

buske commented on June 2, 2024

Here are a few more examples we ran into. All of these appear are technically "fixed" in Jannovar v0.12 because it does not appear to do prefix/suffix stripping. If Exomiser continues to do variant normalization, I'd suggest doing something like: http://genome.sph.umich.edu/wiki/Variant_Normalization.

The variant is improperly reduced, resulting in both an incorrect position (+2 instead of +1) and an incorrect alt allele (TCCGCCGCC instead of CCTCCGCCG).

VCF:

chr1    120612040       .       Tcc     TCCTCCGCCGcc    217     .       INDEL;DP=58;VDB=0.0430 AF1=0.5;AC1=1;DP4=11,13,10,13;MQ=43;FQ=217;PV4=1,1,0.18,1      GT:PL:DP:SP:GQ  0/1:255,0,255:47:0:99

exomiser.variant.tsv:

chr1    120612042    -    TCCGCCGCC    217.0    Pathogenicity    0/1    58    UTR5    NOTCH2:uc001eil.3:c.-22_-21insGGCGGCGGA    NOTCH2    .    ..    .    .    0.0    .    .    .    0.0    0.6022212    1.0    0.8961792

The variant is not left-aligned, resulting in an incorrect position (+5 instead of +1)

VCF:

chr6    31242257        .       accccc  acccc   9.9     .       INDEL;DP=3;VDB=0.0191;AF1=1;AC1=2;DP4=0,0,0,3;MQ=60;FQ=-43.5    GT:PL:DP:SP:GQ  1/1:49,9,0:3:0:12

exomiser.variant.tsv:

chr6    31242262    C    -    9.9    Target    1/1    3    INTRONIC    HLA-B:uc003ntf.2:intron2:c.344-3137G>-    HLA-B    .    .    ..    .    .    .    .    .    0.0    0.5999689    1.0    0.8939807

from exomiser.

holtgrewe commented on June 2, 2024

Hi @buske, thanks for the link.

The different representation of variants in the input VCF and the output TSV/VCF in Exomiser is something that Jules and I already talked about. Our proposal is to write out the same REF and ALT fields as is in the input VCF file.

The Exomiser and Jannovar do not know about the reference sequence (IIRC for the Exomiser) and thus proper normalization is currently not easily possible. However, this could be added later since the normalization is trivial given the proof and algorithm in your link.

Within the Exomiser, variant trimming is done by Jannovar that first does trimming (not following the algorithm in your link) and then 3' alignment for each transcript as required for the HGVS output. This is based on the mRNA sequence.

Do you think the following is a sensible approach for now?

the Exomiser itself is agnostic about variant normalization by default for now in its input and output and for the database lookup and requires normalized VCF files
Jannovar continues to perform its variant normalization for now based on the mRNA and in 3' direction but uses the algorithm in your link
we provide a little precompiled Java standalone tool for variant normalization if people know the pain about installing

from exomiser.

buske commented on June 2, 2024

@holtgrewe Sorry for the slow response. Yes, I think that sounds good. Honestly, however you handle it is fine with me, I just wanted to make sure it was all being discussed so we don't run into erroneous results downstream.

I believe we would be very interested in a little standalone variant normalization tool, as well. :)

from exomiser.

damiansm commented on June 2, 2024

Current status on 7.0.0.BETA with Jannovar 0.13 incorporated.
(i) Seems Exomiser just outputs what was input i.e. the 11:61165731 C CA variant comes out as that.
(ii) The internal representation within Exomiser/Jannovar seems to be pos=61165731, ref="-", alt="A"

Therefore it seems to me that the indels in the frequency table should be normalized before loading into the database so they are correctly loaded. This used to happen but at some stage got broken. I will fix this and then hopefully we are there.

The only gotcha is my normalization should correspond to Jannovar normalization.

from exomiser.

damiansm commented on June 2, 2024

I have now tested this fix in 7.0.0.BETA and all seems fine

from exomiser.

buske commented on June 2, 2024

Here's another variant for testing:

On Exomiser 6.0 this variant gets shifted to a point where it's called as novel, despite the allele frequency of around 30%.

Here's the line from the VCF file:

chr15   100252709       .       Ccagcagcagcagcagcagcagcagcagcagcagc     Ccagcagcagcagcagcagcagcagcagc   999     .       INDEL;DP=521;VDB=0.0390;AF1=0.25;AC1=1;DP4=66,106,47,95;MQ=50;FQ=999;PV4=0.35,1,1,1     GT:PL:GQ        0/1:255,0,255:99

And the line from the Exomiser 6.0 output tsv file:

chr15   100252738       AGCAGC  -       999.0   PASS    0/1     521     NON_FS_DELETION MEF2A:uc010bot.3:exon8:c.1052_1057del:p.Q351_Q352del

Which is the same as this variant:
http://exac.broadinstitute.org/variant/15-100252709-CCAGCAG-C

from exomiser.

Exomiser parsing of indels is incorrect? about exomiser HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent