Giter Club home page Giter Club logo

Comments (10)

nuno-agostinho avatar nuno-agostinho commented on September 17, 2024 1

Hi @wuttke,

The excerpt outputs are reversed (109 is the first and 110 is the last), but I get the same results as you do. I am going to check what is going on.

Thanks for opening this issue.

Best,
Nuno

from ensembl-vep.

dvg-p4 avatar dvg-p4 commented on September 17, 2024 1

Hi Nuno,

I'm trying to run VEP on a sets of ~20-30 million variants, with --tab output. Our project consistently uses the VCF standard notation for indels, with a one-allele REF for insertions and a one-allele ALT for deletions, e.g. 1:69433:A:AGAT.

It turns out that for this particular project we actually want to run VEP 109 for other reasons, so this isn't actually a blocker for meeting my current deadline--I can flesh out an example run to illustrate the problem next week.

Best,
Dan

from ensembl-vep.

nuno-agostinho avatar nuno-agostinho commented on September 17, 2024 1

Hi @dvg-p4,

As you say, customising your variant identifiers in a VCF file is the best workaround to uniquely identify variants (regardless of using VEP or not).

I will now close this issue, but feel free to open new issues in case you face any other problems.

Best regards,
Nuno

from ensembl-vep.

wuttke avatar wuttke commented on September 17, 2024

True, I've edited the post to clarify. Thanks for your prompt response!

from ensembl-vep.

nuno-agostinho avatar nuno-agostinho commented on September 17, 2024

Hi @wuttke,

Our team decided to update the default representation of variants to be minimised for indels by default, as the results after minimisation are the most accurate. The original allele from the input can still be accessed with --uploaded_allele.

We are sorry that this change is not documented. We will update our docs to reflect this.

Thanks,
Nuno

from ensembl-vep.

dvg-p4 avatar dvg-p4 commented on September 17, 2024

Is there any way to override this default behavior and return to the previous behavior of not minimizing indels? This change completely breaks my workflow.

from ensembl-vep.

dvg-p4 avatar dvg-p4 commented on September 17, 2024

--uploaded_allele is not sufficient, since it does not account for the changed coordinates.

from ensembl-vep.

nuno-agostinho avatar nuno-agostinho commented on September 17, 2024

Hi @dvg-p4,

--uploaded_allele is not sufficient, since it does not account for the changed coordinates.

Can you show an example of what you mean by this? How does this affect your workflow?

Best,
Nuno

from ensembl-vep.

dvg-p4 avatar dvg-p4 commented on September 17, 2024

So, my input to vep has indels in "vcf-style", like such (short_vep_input.tsv):

1       1220770 1220772 GAC/G   +
1       1220772 1220772 C/T     +
1       1220794 1220794 G/GCGGGCA       +
1       1223144 1223146 ACT/A   +
1       1223149 1223149 T/A     +
1       1223153 1223153 C/T     +
1       1223154 1223154 G/GAC   +
1       1223154 1223156 GAC/G   +
1       1223182 1223184 AAC/A   +

This is the "canonical" form for variants in our database.

If I run vep 110+ with --tab:

~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.tsv \
--tab

I get output like such (test_output.tsv):

## ENSEMBL VARIANT EFFECT PREDICTOR v112.0
[...]
#Uploaded_variation	Location	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	REF_ALLELE	UPLOADED_ALLELE	IMPACT	DISTANCE	STRAND	FLAGS
1_1220771_AC/-	1:1220771-1220772	-	ENSG00000078808	ENST00000263741	Transcript	intron_variant	-	-	-	-	-	-	AC	GAC/G	MODIFIER	-	-1	-
1_1220771_AC/-	1:1220771-1220772	-	ENSG00000078808	ENST00000360001	Transcript	intron_variant	-	-	-	-	-	-	AC	GAC/G	MODIFIER	-	-1	-
1_1220771_AC/-	1:1220771-1220772	-	ENSG00000078808	ENST00000403997	Transcript	intron_variant	-	-	-	-	-	-	AC	GAC/G	MODIFIER	-	-1	cds_start_NF,cds_end_NF
1_1220771_AC/-	1:1220771-1220772	-	ENSG00000078808	ENST00000465727	Transcript	intron_variant,NMD_transcript_variant	-	-	-	-	-	-	AC	GAC/G	MODIFIER	-	-1	-
1_1220771_AC/-	1:1220771-1220772	-	ENSG00000078808	ENST00000478938	Transcript	upstream_gene_variant	-	-	-	-	-	-	AC	GAC/G	MODIFIER	478	-1	-
1_1220771_AC/-	1:1220771-1220772	-	ENSG00000078808	ENST00000494748	Transcript	non_coding_transcript_exon_variant	580-581	-	-	-	-	-	AC	GAC/G	MODIFIER	-	-1	-
1_1220772_C/T	1:1220772	T	ENSG00000078808	ENST00000263741	Transcript	intron_variant	-	-	-	-	-	-	C	C/T	MODIFIER	-	-1	-
1_1220772_C/T	1:1220772	T	ENSG00000078808	ENST00000360001	Transcript	intron_variant	-	-	-	-	-	-	C	C/T	MODIFIER	-	-1	-
1_1220772_C/T	1:1220772	T	ENSG00000078808	ENST00000403997	Transcript	intron_variant	-	-	-	-	-	-	C	C/T	MODIFIER	-	-1	cds_start_NF,cds_end_NF
1_1220772_C/T	1:1220772	T	ENSG00000078808	ENST00000465727	Transcript	intron_variant,NMD_transcript_variant	-	-	-	-	-	-	C	C/T	MODIFIER	-	-1	-
1_1220772_C/T	1:1220772	T	ENSG00000078808	ENST00000478938	Transcript	upstream_gene_variant	-	-	-	-	-	-	C	C/T	MODIFIER	479	-1	-
1_1220772_C/T	1:1220772	T	ENSG00000078808	ENST00000494748	Transcript	non_coding_transcript_exon_variant	580	-	-	-	-	-	C	C/T	MODIFIER	-	-1	-
1_1220795_-/CGGGCA	1:1220794-1220795	CGGGCA	ENSG00000078808	ENST00000263741	Transcript	intron_variant	-	-	-	-	-	-	-	G/GCGGGCA	MODIFIER	-	-1	-
1_1220795_-/CGGGCA	1:1220794-1220795	CGGGCA	ENSG00000078808	ENST00000360001	Transcript	intron_variant	-	-	-	-	-	-	-	G/GCGGGCA	MODIFIER	-	-1	-
1_1220795_-/CGGGCA	1:1220794-1220795	CGGGCA	ENSG00000078808	ENST00000403997	Transcript	intron_variant	-	-	-	-	-	-	-	G/GCGGGCA	MODIFIER	-	-1	cds_start_NF,cds_end_NF
1_1220795_-/CGGGCA	1:1220794-1220795	CGGGCA	ENSG00000078808	ENST00000465727	Transcript	intron_variant,NMD_transcript_variant	-	-	-	-	-	-	-	G/GCGGGCA	MODIFIER	-	-1	-
1_1220795_-/CGGGCA	1:1220794-1220795	CGGGCA	ENSG00000078808	ENST00000478938	Transcript	upstream_gene_variant	-	-	-	-	-	-	-	G/GCGGGCA	MODIFIER	501	-1	-
1_1220795_-/CGGGCA	1:1220794-1220795	CGGGCA	ENSG00000078808	ENST00000494748	Transcript	non_coding_transcript_exon_variant	557-558	-	-	-	-	-	-	G/GCGGGCA	MODIFIER	-	-1	-
1_1223145_CT/-	1:1223145-1223146	-	ENSG00000078808	ENST00000263741	Transcript	intron_variant	-	-	-	-	-	-	CT	ACT/A	MODIFIER	-	-1	-
[...]

There are many advantages to this output format--it works pretty seamlessly with awk, cut, column -t, R's data.table::fread(), etc. However, note the minimization. The "UPLOADED_ALLELE" column preserves the original ref/alt that I uploaded; but not the original coordinates. (It would also be nice to have separate chrom/pos/ref/alt columns, so as to not have to regex out the chromosome from the position.)

One alternative is to run vep with --vcf output:

~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.vcf \
--vcf

However, as per the VCF standard, this condenses all output to one line per variant (test_output.vcf):

##fileformat=VCFv4.1
##VEP="v112.0" API="v112" time="2024-06-03 17:06:46" cache="/home/dgealow/.vep/homo_sapiens/112_GRCh38" ensembl=112.3add379 ensembl-funcgen=112.be19ffa ensembl-io=112.2851b6f ensembl-variation=112.4113356 1000genomes="phase3" COSMIC="98" ClinVar="202310" HGMD-PUBLIC="20204" assembly="GRCh38.p14" dbSNP="156" gencode="GENCODE 46" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.3" regbuild="1.0" sift="6.2.1"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|REF_ALLELE|UPLOADED_ALLELE|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
##VEP-command-line='vep --assembly GRCh38 --cache --database 0 --format ensembl --input_file [PATH]/short_vep_input.tsv --no_stats --offline --output_file [PATH]/test_output.vcf --show_ref_allele --uploaded_allele --vcf --verbose'
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	1220770	1_1220771_AC/-	GAC	G	.	.	CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||AC|GAC/G||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||AC|GAC/G|478|-1||HGNC|HGNC:24188,-|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580-581||||||AC|GAC/G||-1||HGNC|HGNC:24188
1	1220772	1_1220772_C/T	C	T	.	.	CSQ=T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||C|C/T||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,T|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||C|C/T|479|-1||HGNC|HGNC:24188,T|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580||||||C|C/T||-1||HGNC|HGNC:24188
1	1220794	1_1220795_-/CGGGCA	G	GCGGGCA	.	.	CSQ=CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4||||||||||G/GCGGGCA||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,CGGGCA|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron||||||||||||G/GCGGGCA|501|-1||HGNC|HGNC:24188,CGGGCA|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||557-558|||||||G/GCGGGCA||-1||HGNC|HGNC:24188
1	1223144	1_1223145_CT/-	ACT	A	.	.	CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||CT|ACT/A||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|downstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000459994|protein_coding_CDS_not_defined|||||||||||CT|ACT/A|4126|-1||HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||CT|ACT/A|2852|-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|||||||||||CT|ACT/A|1794|-1||HGNC|HGNC:24188
[...]

This has the variant identification scheme that we use, but the comma/pipe/ampersand-delimited amalgam of an INFO field is quite annoying to parse.

What would be most convenient for us would be an option to output in --tsv format, but with vcf-style CHROM/POS/REF/ALT columns. (Perhaps --include_vcf_id_cols?) An "UPLOADED_COORDINATES" column (paralleling "UPLOADED_ALLELE") would also work well.

(And suggestions for workarounds I haven't thought of here would also be much appreciated!)

from ensembl-vep.

dvg-p4 avatar dvg-p4 commented on September 17, 2024

Ah, here's a good workaround: just pass chrom:pos:ref:alt-style variant identifiers (or whatever style is meaningful to you) as the final column of the input (https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#default)

from ensembl-vep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.