Comments (10)
Hi @wuttke,
The excerpt outputs are reversed (109 is the first and 110 is the last), but I get the same results as you do. I am going to check what is going on.
Thanks for opening this issue.
Best,
Nuno
from ensembl-vep.
Hi Nuno,
I'm trying to run VEP on a sets of ~20-30 million variants, with --tab
output. Our project consistently uses the VCF standard notation for indels, with a one-allele REF for insertions and a one-allele ALT for deletions, e.g. 1:69433:A:AGAT
.
It turns out that for this particular project we actually want to run VEP 109 for other reasons, so this isn't actually a blocker for meeting my current deadline--I can flesh out an example run to illustrate the problem next week.
Best,
Dan
from ensembl-vep.
Hi @dvg-p4,
As you say, customising your variant identifiers in a VCF file is the best workaround to uniquely identify variants (regardless of using VEP or not).
I will now close this issue, but feel free to open new issues in case you face any other problems.
Best regards,
Nuno
from ensembl-vep.
True, I've edited the post to clarify. Thanks for your prompt response!
from ensembl-vep.
Hi @wuttke,
Our team decided to update the default representation of variants to be minimised for indels by default, as the results after minimisation are the most accurate. The original allele from the input can still be accessed with --uploaded_allele
.
We are sorry that this change is not documented. We will update our docs to reflect this.
Thanks,
Nuno
from ensembl-vep.
Is there any way to override this default behavior and return to the previous behavior of not minimizing indels? This change completely breaks my workflow.
from ensembl-vep.
--uploaded_allele
is not sufficient, since it does not account for the changed coordinates.
from ensembl-vep.
Hi @dvg-p4,
--uploaded_allele
is not sufficient, since it does not account for the changed coordinates.
Can you show an example of what you mean by this? How does this affect your workflow?
Best,
Nuno
from ensembl-vep.
So, my input to vep has indels in "vcf-style", like such (short_vep_input.tsv
):
1 1220770 1220772 GAC/G +
1 1220772 1220772 C/T +
1 1220794 1220794 G/GCGGGCA +
1 1223144 1223146 ACT/A +
1 1223149 1223149 T/A +
1 1223153 1223153 C/T +
1 1223154 1223154 G/GAC +
1 1223154 1223156 GAC/G +
1 1223182 1223184 AAC/A +
This is the "canonical" form for variants in our database.
If I run vep 110+ with --tab
:
~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.tsv \
--tab
I get output like such (test_output.tsv
):
## ENSEMBL VARIANT EFFECT PREDICTOR v112.0
[...]
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation REF_ALLELE UPLOADED_ALLELE IMPACT DISTANCE STRAND FLAGS
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - AC GAC/G MODIFIER - -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000360001 Transcript intron_variant - - - - - - AC GAC/G MODIFIER - -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000403997 Transcript intron_variant - - - - - - AC GAC/G MODIFIER - -1 cds_start_NF,cds_end_NF
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000465727 Transcript intron_variant,NMD_transcript_variant - - - - - - AC GAC/G MODIFIER - -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000478938 Transcript upstream_gene_variant - - - - - - AC GAC/G MODIFIER 478 -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000494748 Transcript non_coding_transcript_exon_variant 580-581 - - - - - AC GAC/G MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - C C/T MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000360001 Transcript intron_variant - - - - - - C C/T MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000403997 Transcript intron_variant - - - - - - C C/T MODIFIER - -1 cds_start_NF,cds_end_NF
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000465727 Transcript intron_variant,NMD_transcript_variant - - - - - - C C/T MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000478938 Transcript upstream_gene_variant - - - - - - C C/T MODIFIER 479 -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000494748 Transcript non_coding_transcript_exon_variant 580 - - - - - C C/T MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000360001 Transcript intron_variant - - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000403997 Transcript intron_variant - - - - - - - G/GCGGGCA MODIFIER - -1 cds_start_NF,cds_end_NF
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000465727 Transcript intron_variant,NMD_transcript_variant - - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000478938 Transcript upstream_gene_variant - - - - - - - G/GCGGGCA MODIFIER 501 -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000494748 Transcript non_coding_transcript_exon_variant 557-558 - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1223145_CT/- 1:1223145-1223146 - ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - CT ACT/A MODIFIER - -1 -
[...]
There are many advantages to this output format--it works pretty seamlessly with awk
, cut
, column -t
, R's data.table::fread()
, etc. However, note the minimization. The "UPLOADED_ALLELE" column preserves the original ref/alt that I uploaded; but not the original coordinates. (It would also be nice to have separate chrom/pos/ref/alt columns, so as to not have to regex out the chromosome from the position.)
One alternative is to run vep with --vcf
output:
~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.vcf \
--vcf
However, as per the VCF standard, this condenses all output to one line per variant (test_output.vcf
):
##fileformat=VCFv4.1
##VEP="v112.0" API="v112" time="2024-06-03 17:06:46" cache="/home/dgealow/.vep/homo_sapiens/112_GRCh38" ensembl=112.3add379 ensembl-funcgen=112.be19ffa ensembl-io=112.2851b6f ensembl-variation=112.4113356 1000genomes="phase3" COSMIC="98" ClinVar="202310" HGMD-PUBLIC="20204" assembly="GRCh38.p14" dbSNP="156" gencode="GENCODE 46" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.3" regbuild="1.0" sift="6.2.1"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|REF_ALLELE|UPLOADED_ALLELE|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
##VEP-command-line='vep --assembly GRCh38 --cache --database 0 --format ensembl --input_file [PATH]/short_vep_input.tsv --no_stats --offline --output_file [PATH]/test_output.vcf --show_ref_allele --uploaded_allele --vcf --verbose'
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1220770 1_1220771_AC/- GAC G . . CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||AC|GAC/G||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||AC|GAC/G|478|-1||HGNC|HGNC:24188,-|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580-581||||||AC|GAC/G||-1||HGNC|HGNC:24188
1 1220772 1_1220772_C/T C T . . CSQ=T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||C|C/T||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,T|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||C|C/T|479|-1||HGNC|HGNC:24188,T|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580||||||C|C/T||-1||HGNC|HGNC:24188
1 1220794 1_1220795_-/CGGGCA G GCGGGCA . . CSQ=CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4||||||||||G/GCGGGCA||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,CGGGCA|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron||||||||||||G/GCGGGCA|501|-1||HGNC|HGNC:24188,CGGGCA|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||557-558|||||||G/GCGGGCA||-1||HGNC|HGNC:24188
1 1223144 1_1223145_CT/- ACT A . . CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||CT|ACT/A||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|downstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000459994|protein_coding_CDS_not_defined|||||||||||CT|ACT/A|4126|-1||HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||CT|ACT/A|2852|-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|||||||||||CT|ACT/A|1794|-1||HGNC|HGNC:24188
[...]
This has the variant identification scheme that we use, but the comma/pipe/ampersand-delimited amalgam of an INFO field is quite annoying to parse.
What would be most convenient for us would be an option to output in --tsv
format, but with vcf-style CHROM/POS/REF/ALT
columns. (Perhaps --include_vcf_id_cols
?) An "UPLOADED_COORDINATES" column (paralleling "UPLOADED_ALLELE") would also work well.
(And suggestions for workarounds I haven't thought of here would also be much appreciated!)
from ensembl-vep.
Ah, here's a good workaround: just pass chrom:pos:ref:alt
-style variant identifiers (or whatever style is meaningful to you) as the final column of the input (https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#default)
from ensembl-vep.
Related Issues (20)
- VEP112 predicts "inframe_insertion, stop_retained_variant" in cases where previously was predicted as "frameshift_variant, stop_gained" HOT 6
- Q: How to filter variants by a specific feature before --pick_order is applied? HOT 5
- `0` does not work as a variant identifier HOT 3
- StructuralVariantOverlap Hanging Indefinitely HOT 2
- "No cache found for homo_sapiens, version 105", but the latest version is 112 HOT 6
- issue specifiying cache dir (-d) and downloading files HOT 2
- Seeking clarity on --fields vs --custom usage HOT 4
- Request for Documentation and Containerization of Bio::EnsEMBL::XS Module for VEP to run faster. HOT 3
- filter_vep on HGNC_ID HOT 5
- SpliceAI plugin update please HOT 1
- VEP installation can not install Bio::DB:HTS module HOT 2
- ERROR: Can't detect input format HOT 2
- Document --max_sv_size command line default HOT 2
- High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts HOT 3
- Custom bigwig annotation not working for insertion variants HOT 9
- HGVS C dot not using right most aligned option HOT 6
- incomplete html stats files HOT 2
- G2P for "BOTH monoallelic and biallelic" genes in PanelApp panels HOT 2
- negative repeat count does nothing at quantifier.pl line 1312, <in> line 5603. HOT 2
- Use of uninitialized value warning issued by plugins/Conservation.pm HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ensembl-vep.