Giter Club home page Giter Club logo

spliceai's Introduction

SpliceAI: A deep learning-based tool to identify splice variants

release license downloads

This package annotates genetic variants with their predicted effect on splicing, as described in Jaganathan et al, Cell 2019 in press. The annotations for all possible substitutions, 1 base insertions, and 1-4 base deletions within genes are available here for download. These annotations are free for academic and not-for-profit use; other use requires a commercial license from Illumina, Inc.

License

SpliceAI source code is provided under the GPLv3 license. SpliceAI includes several third party packages provided under other open source licenses, please see NOTICE for additional details. The trained models used by SpliceAI (located in this package at spliceai/models) are provided under the CC BY NC 4.0 license for academic and non-commercial use; other use requires a commercial license from Illumina, Inc.

Installation

The simplest way to install SpliceAI is through pip or conda:

pip install spliceai
# or
conda install -c bioconda spliceai

Alternately, SpliceAI can be installed from the github repository:

git clone https://github.com/Illumina/SpliceAI.git
cd SpliceAI
python setup.py install

SpliceAI requires tensorflow>=1.2.0, which is best installed separately via pip or conda (see the TensorFlow website for other installation options):

pip install tensorflow
# or
conda install tensorflow

Usage

SpliceAI can be run from the command line:

spliceai -I input.vcf -O output.vcf -R genome.fa -A grch37
# or you can pipe the input and output VCFs
cat input.vcf | spliceai -R genome.fa -A grch37 > output.vcf

Required parameters:

  • -I: Input VCF with variants of interest.
  • -O: Output VCF with SpliceAI predictions ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL included in the INFO column (see table below for details). Only SNVs and simple INDELs (REF or ALT is a single base) within genes are annotated. Variants in multiple genes have separate predictions for each gene.
  • -R: Reference genome fasta file. Can be downloaded from GRCh37/hg19 or GRCh38/hg38.
  • -A: Gene annotation file. Can instead provide grch37 or grch38 to use GENCODE V24 canonical annotation files included with the package. To create custom gene annotation files, use spliceai/annotations/grch37.txt in repository as template.

Optional parameters:

  • -D: Maximum distance between the variant and gained/lost splice site (default: 50).
  • -M: Mask scores representing annotated acceptor/donor gain and unannotated acceptor/donor loss (default: 0).

Details of SpliceAI INFO field:

ID Description
ALLELE Alternate allele
SYMBOL Gene symbol
DS_AG Delta score (acceptor gain)
DS_AL Delta score (acceptor loss)
DS_DG Delta score (donor gain)
DS_DL Delta score (donor loss)
DP_AG Delta position (acceptor gain)
DP_AL Delta position (acceptor loss)
DP_DG Delta position (donor gain)
DP_DL Delta position (donor loss)

Delta score of a variant, defined as the maximum of (DS_AG, DS_AL, DS_DG, DS_DL), ranges from 0 to 1 and can be interpreted as the probability of the variant being splice-altering. In the paper, a detailed characterization is provided for 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision) cutoffs. Delta position conveys information about the location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).

Examples

A sample input file and the corresponding output file can be found at examples/input.vcf and examples/output.vcf respectively. The output T|RYR1|0.00|0.00|0.91|0.08|-28|-46|-2|-31 for the variant 19:38958362 C>T can be interpreted as follows:

  • The probability that the position 19:38958360 (=38958362-2) is used as a splice donor increases by 0.91.
  • The probability that the position 19:38958331 (=38958362-31) is used as a splice donor decreases by 0.08.

Similarly, the output CA|TTN|0.07|1.00|0.00|0.00|-7|-1|35|-29 for the variant 2:179415988 C>CA has the following interpretation:

  • The probability that the position 2:179415981 (=179415988-7) is used as a splice acceptor increases by 0.07.
  • The probability that the position 2:179415987 (=179415988-1) is used as a splice acceptor decreases by 1.00.

Frequently asked questions

1. Why are some variants not scored by SpliceAI?

SpliceAI only annotates variants within genes defined by the gene annotation file. Additionally, SpliceAI does not annotate variants if they are close to chromosome ends (5kb on either side), deletions of length greater than twice the input parameter -D, or inconsistent with the reference fasta file.

2. What are the differences between raw (-M 0) and masked (-M 1) precomputed files?

The raw files also include splicing changes corresponding to strengthening annotated splice sites and weakening unannotated splice sites, which are typically much less pathogenic than weakening annotated splice sites and strengthening unannotated splice sites. The delta scores of such splicing changes are set to 0 in the masked files. We recommend using raw files for alternative splicing analysis and masked files for variant interpretation.

3. Can SpliceAI be used to score custom sequences?

Yes, install SpliceAI and use the following script:

from keras.models import load_model
from pkg_resources import resource_filename
from spliceai.utils import one_hot_encode
import numpy as np

input_sequence = 'CGATCTGACGTGGGTGTCATCGCATTATCGATATTGCAT'
# Replace this with your custom sequence

context = 10000
paths = ('models/spliceai{}.h5'.format(x) for x in range(1, 6))
models = [load_model(resource_filename('spliceai', x)) for x in paths]
x = one_hot_encode('N'*(context//2) + input_sequence + 'N'*(context//2))[None, :]
y = np.mean([models[m].predict(x) for m in range(5)], axis=0)

acceptor_prob = y[0, :, 1]
donor_prob = y[0, :, 2]

Contact

Kishore Jaganathan: [email protected]

spliceai's People

Contributors

david-a-parry avatar jeremymcrae avatar kishorejaganathan avatar rizkg avatar rybval avatar sandeepaswathnarayana avatar tsnowlan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spliceai's Issues

receptive field calculation

spliceAI-10k model architecture has 4 stages as follows:
4 x (11, 1)
4 x (11, 4)
4 x (21, 10)
4 x (41, 25)
tuple represents kernel size and dilation rate.
The receptive field of the neurons in the final layer is 1 + 10 * 4 + 10 * 4 * 4 + 20 * 10 * 4 + 40 * 25 * 4 = 5001
you can check it here
So actually the model only takes 5001nt sequence as input and the Cropping1D layer should only crop 2500 from each side?

delta score VCFs

Thanks for this important resource. I am currently working with the whole genome delta scores provided on BaseSpace (whole_genome_filtered_spliceai_scores.vcf.gz). I am wondering whether these include predictions of both essential and cryptic splice mutations.

If both, how did you distinguish these two categories for your paper? Is this defined by distance from the nearest splice site (i.e., the "DIST" INFO field in the VCF)?

discrepancy between pre-computed scores and spliceAI scores

I downloaded the pre-computed scores from Basespace and for some variants I see a difference in the output. E.g.

This is the output from the pre-computed file spliceai_scores.raw.snv.hg19.vcf.gz :
X 32466728 . C A . . SpliceAI=A|DMD|0.00|0.57|0.00|0.00|-42|27|2|-25

When I run the spliceAI tool on the same variant:
X 32466728 . C A . . SpliceAI=A|DMD|0.00|0.57|0.00|0.48|-155|27|2|-155

Invalid argument

Using TensorFlow backend.

2020-05-21 02:23:54.313276: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-05-21 02:23:54.313341: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-05-21 02:23:54.313397: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (3016ca31e4f9): /proc/driver/nvidia/version does not exist
2020-05-21 02:23:54.313774: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-21 02:23:54.330051: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2000140000 Hz
2020-05-21 02:23:54.337342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x61a52a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-21 02:23:54.337397: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
/usr/local/lib/python3.6/site-packages/keras/engine/saving.py:341: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
[W::vcf_parse] INFO 'callstatus' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'AF' is not defined in the header, assuming Type=String
[E::vcf_format] Invalid BCF, the INFO index is too large
Traceback (most recent call last):
File "/usr/local/bin/spliceai", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/spliceai/main.py", line 75, in main
output.write(record)
File "pysam/libcbcf.pyx", line 4400, in pysam.libcbcf.VariantFile.write
File "pysam/libcbcf.pyx", line 4437, in pysam.libcbcf.VariantFile.write
OSError: [Errno 22] b'Invalid argument'

SpliceAI output.vcf equals input.vcf

Hi all,

I ran spliceAI with the following parameters: spliceai -I 0000.vcf -O out.vcf -R hg19.fa -A grch37. The output vcf is exactly identical to the input vcf. Also the column with the delta scores is completely missing.

What my actual plan was: Finding variants in my sample.vcf that are NOT in the spliceai_scores.raw.snv.hg19.vcf.gz by using bcftools isec, resulting in 0000.vcf.

Then I wanted to let spliceAI predict the splicing sites of these "unknown" variants in 0000.vcf and write them to out.vcf

Am I missing something or do I follow a wrong idea? It would be nice to hear your suggestions on this issue. Best regards!

transcript list does not match prescored transcripts for hg38

I noticed that for GRCh38 the prescored file does contain more transcripts than annotated via the script. Therefore variants are annotated differently.

i.e. tabix spliceai_scores.raw.snv.hg38.vcf.gz 17:7013943-7013943 results for me in this output:

17      7013943 .       A       C       .       .       SpliceAI=C|AC040977.1|0.00|0.00|0.00|0.00|3|2|19|2
17      7013943 .       A       C       .       .       SpliceAI=C|RNASEK-C17orf49|0.00|0.00|0.00|0.00|-5|36|5|-6
17      7013943 .       A       C       .       .       SpliceAI=C|RNASEK|0.00|0.00|0.00|0.00|10|36|10|9
17      7013943 .       A       G       .       .       SpliceAI=G|AC040977.1|0.00|0.00|0.00|0.00|42|2|-10|2
17      7013943 .       A       G       .       .       SpliceAI=G|RNASEK-C17orf49|0.00|0.00|0.03|0.00|9|36|9|-6
17      7013943 .       A       G       .       .       SpliceAI=G|RNASEK|0.00|0.00|0.00|0.00|1|36|9|-6
17      7013943 .       A       T       .       .       SpliceAI=T|AC040977.1|0.00|0.00|0.00|0.00|42|2|-10|2
17      7013943 .       A       T       .       .       SpliceAI=T|RNASEK-C17orf49|0.00|0.00|0.11|0.00|10|36|-6|9
17      7013943 .       A       T       .       .       SpliceAI=T|RNASEK|0.00|0.00|0.01|0.00|10|36|-6|9

while annotations/grch38.txt only contains the transcript coordinates for RNASEK-C17orf49 and not for RNASEK and AC040977.1. However, the latter two are found in annotations/grch37.txt. I assume this is because you calculated scores only for GRCh37 did liftover from there. Is there a reason why RNASEK and AC040977.1 (and apparently many others) are not in the GRCh38 annotation? Those are active genes in Ensembl so I assume they are relatively recently added genes that somehow you only added to the GRCh37 list and not GRCh38?

TensorFlow binary was not compiled to use: SSE4.1 SSE4.2

Hi SpliceAI --

Thanks for the code but I am new to this application and need some assistance.

After running an installed spliceai on CentOS6 OS I get this message


TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 ... UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
tensorflow/1.4.1

I am not understanding the warning completely, do I need to modify my installation for the CPUs I have? I get an output but not a complete output and it ran overnight.

Can you recommend a strategy for CentOS6 HPC users and perhaps give a typical runtime for this? I am not sure my implementation is working yet. :-/

Daniel

running error

Hello,

I installed spliceAI but I got the below error when I ran the test example. Any idea?

[ec2-user@ip-172-31-17-206 SpliceAI]$ ls
COPYRIGHT.txt examples LICENSE README.md setup.py spliceai tests
[ec2-user@ip-172-31-17-206 SpliceAI]$ spliceai -I examples/input.vcf -O examples/output2.vcf -R examples/human_g1k_v37.fasta -A spliceai/annotations/GENCODE.v24lift37
Using TensorFlow backend.
2019-01-22 07:16:15.708152: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
/usr/local/lib64/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
Traceback (most recent call last):
File "/usr/local/bin/spliceai", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/spliceai/main.py", line 53, in main
scores = get_delta_scores(record, ann)
File "/usr/local/lib/python2.7/site-packages/spliceai/utils.py", line 95, in get_delta_scores
seq = ann.ref_fasta['chr'+str(record.chrom)][
File "/usr/local/lib/python2.7/site-packages/pyfasta/fasta.py", line 128, in getitem
c = self.index[i]
KeyError: 'chr2'

Attribute error

Hello,
I am trying to run spliceai on singulairty interactive mode, below is the attribute error i see. I have python3.6 installed.

spliceai -I /home/input.vcf.gz -O /home/output.vcf -R /home/ucsc.fa -A grch38
Using TensorFlow backend.
2020-06-12 10:42:28.533297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-12 10:42:28.535179: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
Traceback (most recent call last):
File "/usr/local/bin/spliceai", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/spliceai/main.py", line 69, in main
ann = Annotator(args.R, args.A)
File "/usr/local/lib/python3.6/dist-packages/spliceai/utils.py", line 21, in init
self.genes = df['#NAME'].get_values()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5274, in getattr
return object.getattribute(self, name)
AttributeError: 'Series' object has no attribute 'get_values'

ValueError: all the input array dimensions for the concatenation axis must match exactl

I'm running a WGS VCF through spliceAI and got this traceback:

raceback (most recent call last):
  File "/usr/local/bin/spliceai", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/spliceai/__main__.py", line 72, in main
    scores = get_delta_scores(record, ann, args.D, args.M)
  File "/usr/local/lib/python3.6/site-packages/spliceai/utils.py", line 168, in get_delta_scores
    y = np.concatenate([y_ref, y_alt])
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 101 and the array at index 1 has size 122

With this command:

spliceai -I brain-eQTL-235-19-02-2018.gt.snp.indel.recal.vcf.bgz -O brainVar-splice.vcf -R Homo_sapiens_assembly38.fasta -A grch38

And it looks like the variant spliceAI failed on was this:

chr1    1028996 .       CAAGGAACCGAGCCCCAGCCCCTCGTGGGCCAAGGGCGCCCACACCCACGCCACCCTCTCCCAAGGAACCGAGCCCCAGCCCCTCGTGGGCCAAGGGCGCCCACAGCCACGCCACCCTTTCCG     C,*

I know that previously spliceAI had trouble with variants of length > 500 but this one is 124nt.
Any help on this is greatly appreciated!

Problems running

I am running command: spliceai -I 4-ZL2487_S31_L001_001.vcf -O 2487_output.vcf -R ucsc_hg19.fa -A GENCODE.v24lift37

and getting warning:

Using TensorFlow backend.
2019-02-11 00:29:47.295532: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '

And then I have output for all variants :
... SpliceAI=G|.|.|.|.|.|.|.|.|. ...

Example of VCF used:

##fileformat=VCFv4.2 | ย  | ย  | ย  | ย  | ย  | ย 
-- | -- | -- | -- | -- | -- | --
##FILTER=<ID=PASS,Description="All filters passed"> | ย  | ย  | ย 
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> | ย  | ย 
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality"> | ย  | ย 
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> | ย  | ย 
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="The phred-scaled genotype likelihoods rounded to the closest integer">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership"> | ย  | ย 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples"> | ย 
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts, for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS mapping quality"> | ย  | ย 
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##SentieonCommandLine.GVCFtyper=<ID=GVCFtyper,Version="sentieon-genomics-201808",Date="2018-12-06T09:10:32Z",CommandLine="/usr/local/sentieon-genomics-201808/libexec/driver --interval ignore_decoy.bed -r genome/ucsc_hg19.fa -t 32 --algo GVCFtyper --call_conf 20 --emit_conf 20 -v output.g.vcf.gz -d resources/dbsnp_138.hg19.vcf.gz output.vcf.gz">
##SentieonCommandLine.Haplotyper=<ID=Haplotyper,Version="sentieon-genomics-201808",Date="2018-12-06T09:07:16Z",CommandLine="/usr/local/sentieon-genomics-201808/libexec/driver --interval ignore_decoy.bed -r genome/ucsc_hg19.fa -t 32 -i realigned.bam -q recal_data.table --algo Haplotyper --call_conf 20 --emit_conf 20 --phmm_chunk_size 1000 --emit_mode gvcf -d resources/dbsnp_138.hg19.vcf.gz output.g.vcf.gz">
##contig=<ID=chrM,length=16571,assembly=hg19> | ย  | ย  | ย  | ย 
##contig=<ID=chr1,length=249250621,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr2,length=243199373,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr3,length=198022430,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr4,length=191154276,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr5,length=180915260,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr6,length=171115067,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr7,length=159138663,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr8,length=146364022,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr9,length=141213431,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr10,length=135534747,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr11,length=135006516,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr12,length=133851895,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr13,length=115169878,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr14,length=107349540,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr15,length=102531392,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr16,length=90354753,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr17,length=81195210,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr18,length=78077248,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr19,length=59128983,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr20,length=63025520,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr21,length=48129895,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr22,length=51304566,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrX,length=155270560,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrY,length=59373566,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr1_gl000191_random,length=106433,assembly=hg19> | ย  | ย 
##contig=<ID=chr1_gl000192_random,length=547496,assembly=hg19> | ย  | ย 
##contig=<ID=chr4_ctg9_hap1,length=590426,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr4_gl000193_random,length=189789,assembly=hg19> | ย  | ย 
##contig=<ID=chr4_gl000194_random,length=191469,assembly=hg19> | ย  | ย 
##contig=<ID=chr6_apd_hap1,length=4622290,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr6_cox_hap2,length=4795371,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr6_dbb_hap3,length=4610396,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr6_mann_hap4,length=4683263,assembly=hg19> | ย  | ย 
##contig=<ID=chr6_mcf_hap5,length=4833398,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr6_qbl_hap6,length=4611984,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr6_ssto_hap7,length=4928567,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chr7_gl000195_random,length=182896,assembly=hg19> | ย  | ย 
##contig=<ID=chr8_gl000196_random,length=38914,assembly=hg19> | ย  | ย 
##contig=<ID=chr8_gl000197_random,length=37175,assembly=hg19> | ย  | ย 
##contig=<ID=chr9_gl000198_random,length=90085,assembly=hg19> | ย  | ย 
##contig=<ID=chr9_gl000199_random,length=169874,assembly=hg19> | ย  | ย 
##contig=<ID=chr9_gl000200_random,length=187035,assembly=hg19> | ย  | ย 
##contig=<ID=chr9_gl000201_random,length=36148,assembly=hg19> | ย  | ย 
##contig=<ID=chr11_gl000202_random,length=40103,assembly=hg19> | ย  | ย 
##contig=<ID=chr17_ctg5_hap1,length=1680828,assembly=hg19> | ย  | ย 
##contig=<ID=chr17_gl000203_random,length=37498,assembly=hg19> | ย  | ย 
##contig=<ID=chr17_gl000204_random,length=81310,assembly=hg19> | ย  | ย 
##contig=<ID=chr17_gl000205_random,length=174588,assembly=hg19> | ย  | ย 
##contig=<ID=chr17_gl000206_random,length=41001,assembly=hg19> | ย  | ย 
##contig=<ID=chr18_gl000207_random,length=4262,assembly=hg19> | ย  | ย 
##contig=<ID=chr19_gl000208_random,length=92689,assembly=hg19> | ย  | ย 
##contig=<ID=chr19_gl000209_random,length=159169,assembly=hg19> | ย  | ย 
##contig=<ID=chr21_gl000210_random,length=27682,assembly=hg19> | ย  | ย 
##contig=<ID=chrUn_gl000211,length=166566,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000212,length=186858,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000213,length=164239,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000214,length=137718,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000215,length=172545,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000216,length=172294,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000217,length=172149,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000218,length=161147,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000219,length=179198,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000220,length=161802,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000221,length=155397,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000222,length=186861,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000223,length=180455,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000224,length=179693,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000225,length=211173,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000226,length=15008,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000227,length=128374,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000228,length=129120,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000229,length=19913,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000230,length=43691,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000231,length=27386,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000232,length=40652,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000233,length=45941,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000234,length=40531,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000235,length=34474,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000236,length=41934,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000237,length=45867,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000238,length=39939,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000239,length=33824,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000240,length=41933,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000241,length=42152,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000242,length=43523,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000243,length=43341,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000244,length=39929,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000245,length=36651,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000246,length=38154,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000247,length=36422,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19> | ย  | ย  | ย 
##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19> | ย  | ย  | ย 
##reference=file://genome/ucsc_hg19.fa | ย  | ย  | ย  | ย 
##INFO=<ID=SpliceAI,Number=.,Type=String,Description="SpliceAI variant annotation. These include delta scores (DS) and delta positions (DP) for acceptor gain (AG), acceptor loss (AL), donor gain (DG), and donor loss (DL). Format: ALLELE\|SYMBOL\|DS_AG\|DS_AL\|DS_DG\|DS_DL\|DP_AG\|DP_AL\|DP_DG\|DP_DL">
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO
chrM | 195 | . | C | T | 725.77 | . | AC=2;AF=1;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=29.03;SOR=0.941;SpliceAI=T\|.\|.\|.\|.\|.\|.\|.\|.\|.
chrM | 302 | . | A | AC | 77 | . | AC=2;AF=1;AN=2;DP=4;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=19.25;SOR=3.258;SpliceAI=AC\|.\|.\|.\|.\|.\|.\|.\|.\|.

I am able to produce the right results for the example input VCF. Although having the abovementioned message about that model is not compiled.

Top-k calculations

I ran SpliceAI_train_code/Canonical/test_model.py (downloaded from BaseSpace) on the models provided on GitHub, and got these results:

Acceptor:
0.9955 0.9204 0.9722 0.9811 0.9572 0.9878 0.6865 0.1195 0.0240 1796

Donor:
0.9967 0.9282 0.9833 0.9850 0.9683 0.9903 0.7042 0.1343 0.0244 1796

If I'm interpreting this correctly, the averaged top-k accuracy is between 92-92.8% (instead of ~95%). Did I maybe run this incorrectly, or was top-k accuracy calculated differently in the paper? Thanks a lot!

SpliceAI prediction for 2.8kb deletion

Hello,

Thank you so much for providing this software.

I am trying to run splicing prediction on the below intronic deletion. When I put it in my input VCF and run spliceai -I in.vcf -O out.vcf -R hs37d5.fa -A grch37, my output VCF shows SpliceAI output for the other variants in my VCF but not for this long deletion.

I went back to the FAQ and realized I should expect this behavior, since the deletion is greater than twice the input parameter -D. So I reran with spliceai -I in.vcf -O out.vcf -R hs37d5.fa -A grch37 -D 4999, and there is still no output provided for this variant.

Can you help me figure out why?

I do get the below warnings, but they have not impeded me in the past:

2020-07-09 12:03:01.876276: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-09 12:03:01.899785: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fefd3896dc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-09 12:03:01.899804: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
/usr/local/lib/python3.7/site-packages/keras/engine/saving.py:341: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '

And here is the variant I am trying to run prediction on:
5 142047933 . GTGGTCGCTAAAGCTCCCGCATTCGTGATGTTTATGTGACCGTGACTTGTCACATGACCTTGGGGAGGCTGCTCTTGTCTTCACTGTCCCATCCTCTTGACACCCCTCCCTGCCCCGGCCCCAGTAAGGACTTATATACAAGGAACTCACCCAGACACAGTTAGGTAATAAAGGTGAAGTCATGGTCAAGGGCATGGGTTTTGAAGTGAAGCTTCCTGAACTTAAGTCCAGAAAGTCAATTCATCTCTCTGAGCTTCCTTTATAAAAGGAGGCCAGTGATATACCTACCTCACGTGTTTGTCAAAAGGTTGCAACAAGATAGTGCAAGGGAAGCACCTAGCACATTGCCTCATATAAATAAAATATGCATTCTAACATCTCATAATTTTAGCTATTATTTTTGTTGTGATTATTCAGGATGAATGAGATAAGCTTCCTGTCCTTGAAAGAAGGTGAGAAGCTAAGTCATAAATACCATGACAAGCTATGATAAGTGCTGTGAAGGTTGTGCCAAGGGCTGGGGGGATGGATGGAGGGCTGAAGAGGAAGGGGTCAGGAGGACTTTCTGGAAGATGTGGCATCTCAGCAGGAAGGACAAACAGGTTTTACATATGTGGGCATGGAATAAAGGAATCACTTGAACAGAGGCCCAGGGGTGGGCAAGTAGAGGGCATCTTCGCCCAGCATGCATGGCTGATGCCGCTGAGCGGGGGGTTTTCAGGGACACACAGGGTTAGTGCAAGCTAGCTTGGAGTCAGATCATAGAGGACCTTGAATGATGCTGTTCCAAAGCATTTAGACTTTTTTCTGCAAGCAAATGAAAGTGGAGCCAGTGGAGACTTTGGAGAAGGGAATTTGGAAATTCTTCTGGAAAATTGTTGGTTACTGGTAGGGGGCTGGGCAGAGCAGAGGCCACGGGGCACAGTATCCTCTATTTCCATCTCTGAAAAGCTCATCAGCCCTCAAGTCTTCTTGACTAGACTACAAGCCAGTCGAGGACAGAAGCATGAGCTTCCTTCTCACGATTATGTCCCCAGCACAGCCCCTGGGGCTCAATAAATGGCTGTTGAATTGAAATAGACAAAAGATCCTAAACTTGAGCCATGAGAGTGGAAGGGAGAGATGGCTCCAAAGACACTCCAGGAGAAGAATGAAGAGGATTCTGCAAAGGAGTAATCAGAGGTGACTCGGGCATAGAGGGACCGGAAGGATGAATGGAGAGGCTATTTACTGAGAGAATGGACGCAGCCAGGAGCAGCCCATCTGGCAGGACGACTGCTGCACTCGCTGTGGGGCGTGTCGCGGCTGAGGTGAACAGAGTCATCCATCTGCTGACTGGCAGGCAGTGGACGCGAGCTCTGGAGAGAACCCAGAATATGAGGCAGGCACAGGAGAAGCAGAGCAGAGGGGAAAGATGAGATCTGCAAGGAAGAACAGCCTGGAGCCTGAAGAGCGAAGCAGACAGCGGCAGCCACCCTGGGGGTCACTCCACACTCTGAGATCCCCAGTGCAGCCGGTTACCACAGAGGAGCAACCCCAGGATCCTGGGGAGAAGAGGCACAGTCAGGAGGAGAGCCAAGAAGTACAGGATCACAGCCTTCAAGGGAGAAGTCAGCTGGTCTGCACTGCCCGAAGCTACAAAGAGGTTAACCAGAGTGGAAAGTGGGTGCCGGAGCCTCGGGATTGATCTCCTGCCCCACAGCTTTGCTGTTCCCGTCACTTCTGCACACTGGCACTGAGGTCATGCTTCTCAAGAAAGATCTGCATGTGCCAACCATTGGCTCAGAAACCTTCCCTTGTTTCCTACTTCCCAGCTAGGAAGGCCAAAGTCCTTGGCCAAGTGACCTCAAATCTGGTGTGAAAAGCCTCACCTTCCACTACGTCCTTCCACATCCCAGCAATCTGAGCTGCTCCCTCTCCCCTGTTGCACACCAAGAACTTCCATGCCTTCCTATGTTGGTTCTTGCTGTTCCCTCAGCTAGGAAGGCCATTTTGCTCAATGTTCTGCTAATTAAAACTCTCCTCATCCTTCAAGGCCTATGTCAAATGCTACCTCCACTGCAAAGCCTTCCTTGGTCTCTCCAGTCAGTAGGGTCCTTCCTTCAAGGACCCACGGTGTGGTGCTTTGCCTAGTCATTACTTCATTCTTTCTACAATTGAGTTGTTGGCCCCCATCCACTCACACATGACAGCATGAGAGCAGAGAACATGTTTTATCTTTGAATCTCCCATGGTATAGATTCTTACATGCAGAAGGCACTCAACCAACGCCAGCAAAGCATATGAAGAAATTAGGAAAAAAACCCCACAAAAACCTGACAGCAGTAGATGCTACCACTTCCAATCATTTATTTGGAAATTACGATGTGCCTCACAGTTATTTGGACTTCAGGGGTTTTGCCCTGCAATTAATGTCTCGGCCCGATGTTGGGTTGCCTTACAATCACATGTGTTATCTGTGATCTCTAGGGCCATTTCGGTTTTGAGAGCACACTATTGCACATACACTAAGATGTTTCTGAGTATTTCCCATAGTTAGACAAAAATGGAAAAACTTGGCCCAGCCTTCAGAAGCGCTTTATCACATGTGCTGCCTATGACAGTGGTTTAGGGAAATCTCTACACAATTCCACCAGGATCATAAACACATTTCCTGAGCTATGAATTAAGAGGAAAAAGTTTCTCAACATGCAACACAAACTCAAAAAGCAGCTGTTCATTTAACCATCTATCATCTTCTTTTGACCCTTTGAACAGGCTCTGACAGTATTTAGATTACATGAA G

Thanks very much for your time and help,
Lee-kai

ImportError

Hello! I am trying to use SpliceAI 1.3.1 on Ubuntu 14.04 operating system. When I was running the sample input file, I got an ImportError (vide infra).
Thank you in advance!

Traceback (most recent call last):
File "/home/ngs/program/anaconda3/bin/spliceai", line 7, in
from spliceai.main import main
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/spliceai/main.py", line 5, in
from spliceai.utils import Annotator, get_delta_scores
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/spliceai/utils.py", line 2, in
import pandas as pd
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/pandas/init.py", line 55, in
from pandas.core.api import (
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/pandas/core/api.py", line 5, in
from pandas.core.arrays.integer import (
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/init.py", line 13, in
from .sparse import SparseArray # noqa: F401
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/sparse/init.py", line 3, in
from pandas.core.arrays.sparse.accessor import SparseAccessor, SparseFrameAccessor
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/sparse/accessor.py", line 10, in
from pandas.core.arrays.sparse.array import SparseArray
File "/home/ngs/program/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/sparse/array.py", line 46, in
from pandas.core.indexers import check_array_indexer
ImportError: cannot import name 'check_array_indexer'

Ref too long error

I installed the latest version of spliceai using git clone and the most recent version of tensor flow with pip install. Everything looks as if it runs OK, but I keep getting a WARNING:root:Skipping record (ref too long). I am using the input.vcf supplied on the github page. This is the ouput I get:

spliceai -I test_input.vcf -O test_output.vcf -R ~/refs/hg19.fa -A ~/projects/spliceai/annotations/grch37.txt -D 0
Using TensorFlow backend.
2020-03-24 20:29:22.429127: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-03-24 20:29:22.437459: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399995000 Hz
2020-03-24 20:29:22.439313: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561edee5c420 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-24 20:29:22.439347: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-03-24 20:29:22.439534: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
/home/ubuntu/miniconda3/lib/python3.7/site-packages/Keras-2.3.1-py3.7.egg/keras/engine/saving.py:341: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
WARNING:root:Skipping record (ref too long): 2 152389953 . T A,C,G . . .

WARNING:root:Skipping record (ref too long): 2 179415988 . C CA . . .

WARNING:root:Skipping record (ref too long): 2 179446218 . ATACT A . . .

WARNING:root:Skipping record (ref too long): 2 179446218 . ATACT AT,ATA . . .

WARNING:root:Skipping record (ref too long): 2 179642185 . G A . . .

WARNING:root:Skipping record (ref too long): 19 38958362 . C T . . .

WARNING:root:Skipping record (ref too long): 21 47406854 . CCA C . . .

WARNING:root:Skipping record (ref too long): 21 47406856 . A AT . . .

WARNING:root:Skipping record (ref too long): X 129274636 . A C,G,T . . .

Any advice would be appreciated.

Running example skips records when using gzipped reference

when running the example files like this:

$ spliceai -I input.vcf -O output.vcf -R ~/vep_data/cache/homo_sapiens/92_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz -A grch37

All records are skipped

Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-04-30 18:11:24.618251: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
WARNING:root:Skipping record (fasta issue): 2   179642185       .       G       A       .       .       .

Works fine when I extract the reference:

$ zcat ~/vep_data/cache/homo_sapiens/92_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz > genome.fa
$ spliceai -I input.vcf -O output.vcf -R genome.fa -A grch37

Impossible output shape error

I ran the following command:

spliceai -I inputfile.vcf -O runaiM.vcf -R ucsc.hg19.fasta -A GENCODE.V29LIFT37.BASIC.TXT

and got the output below. As reference, the input file includes 522 variants, and I get an output vcf file with 58 variants (and then spliceai crashes). 11 of those 58 variants are fully annotated, and the rest have mostly "SpliceAI=C|.|.|.|.|.|.|.|.|.". Any clues on how to fix it? Thanks!

/home/users/aldocp/.local/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In fut
ure, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using Theano backend.
/home/users/aldocp/.local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manuall
y.
  warnings.warn('No training configuration found in save file: '
Traceback (most recent call last):
  File "/home/users/aldocp/.local/bin/spliceai", line 11, in <module>
    sys.exit(main())
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/spliceai/__main__.py", line 53, in main
    scores = get_delta_scores(record, ann)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/spliceai/utils.py", line 109, in get_delta_scores
    Y0 = np.asarray(ann.models[0].predict(X_ref))
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1169, in predict
    steps=steps)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/engine/training_arrays.py", line 294, in predict_loop
    batch_outs = f(ins_batch)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 1388, in __call__
    return self.function(*inputs)
  File "/share/software/user/open/py-theano/1.0.1_py27/lib/python2.7/site-packages/theano/compile/function_module.py", line 917, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/share/software/user/open/py-theano/1.0.1_py27/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/share/software/user/open/py-theano/1.0.1_py27/lib/python2.7/site-packages/theano/compile/function_module.py", line 903, in __call__
    self.fn() if output_subset is None else\
ValueError: CorrMM: impossible output shape
  bottom shape: 1 x 32 x 0 x 1
  weights shape: 3 x 32 x 1 x 1
  top shape: 1 x 3 x 0 x 1

Apply node that caused the error: CorrMM{valid, (1, 1), (1, 1), 1 False}(Elemwise{Add}[(0, 0)].0, Subtensor{::, ::, ::int64, ::int64}.0)
Toposort index: 1286
Inputs types: [TensorType(float32, (False, False, False, True)), TensorType(float32, (False, False, False, True))]
Inputs shapes: [(1, 32, 0, 1), (3, 32, 1, 1)]
Inputs strides: [(1208704, 37772, 4, 4), (4, 12, -4, -4)]
Inputs values: [array([], shape=(1, 32, 0, 1), dtype=float32), 'not shown']
Outputs clients: [[InplaceDimShuffle{0,2,3,1}(CorrMM{valid, (1, 1), (1, 1), 1 False}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/layers/__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/utils/generic_utils.py", line 145, in deserialize_keras_object
    list(custom_objects.items())))
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/engine/network.py", line 1032, in from_config
    process_node(layer, node_data)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/engine/network.py", line 991, in process_node
    layer(unpack_singleton(input_tensors), **kwargs)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 163, in call
    dilation_rate=self.dilation_rate[0])
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 2096, in conv1d
    data_format=data_format, dilation_rate=dilation_rate)
  File "/home/users/aldocp/.local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 2134, in conv2d
    filter_dilation=dilation_rate)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

no SpliceAI results

I tried SpliceAI and the results are all like this: SpliceAI=C|.|.|.|.|.|.|.|.|.
The vcf file is generated from the GATK packages by analyzing a WES data. Please suggest the solution. Thanks!

Here is the head of the results:

##fileformat=VCFv4.2 ย  ย  ย  ย  ย  ย  ย  ย 
##FILTER=<ID=PASS,Description="All filters passed"> ย  ย  ย  ย  ย 
##FILTER=<ID=LowQual,Description="Low quality"> ย  ย  ย  ย  ย  ย 
##FILTER=<ID=my_snp_filter,Description="QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"> ย  ย 
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ย  ย 
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ย  ย 
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ย  ย  ย 
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ย  ย  ย  ย 
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ย  ย 
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Tue Jul 23 11:51:30 EDT 2019",Epoch=1563897090053,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[Reorder_dedup_NGTS1801253601.sam.bam.bam] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../../ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=20 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false out=/external/rprshnas01/wgs_data/ALS-WES/Analysis/Reorder_dedup_NGTS1801253601.sam.bam.bam.vcf likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[] excludeAnnotation=[] group=[StandardAnnotation, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE bamOutput=null bamWriterType=CALLED_HAPLOTYPES emitDroppedReads=false disableOptimizations=false annotateNDA=false useNewAFCalculator=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 heterozygosity_stdev=0.01 standard_min_confidence_threshold_for_calling=10.0 standard_min_confidence_threshold_for_emitting=30.0 max_alternate_alleles=6 max_genotype_count=1024 max_num_PL_values=100 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=true keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxReadsInMemoryPerSample=30000 maxTotalReadsInMemory=10000000 maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.7-0-gcfedb67,Date="Thu Jul 25 13:08:54 EDT 2019",Epoch=1564074534110,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../../ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=Reorder_dedup_NGTS1801253601.sam.bam.bam.vcf) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/external/rprshnas01/wgs_data/ALS-WES/Analysis/rawsnps_Reorder_dedup_NGTS1801253601.sam.bam.bam.vcf.vcf sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[SNP] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 maxNOCALLnumber=2147483647 maxNOCALLfraction=1.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.7-0-gcfedb67,Date="Thu Jul 25 14:08:22 EDT 2019",Epoch=1564078102578,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../../ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=rawsnps_Reorder_dedup_NGTS1801253601.sam.bam.bam.vcf.vcf) mask=(RodBinding name= source=UNBOUND) out=/external/rprshnas01/wgs_data/ALS-WES/Analysis/filtered_snps_NGTS1801253601.vcf filterExpression=[QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0] filterName=[my_snp_filter] genotypeFilterExpression=[] genotypeFilterName=[] clusterSize=3 clusterWindowSize=0 maskExtension=0 maskName=Mask filterNotInMask=false missingValuesInExpressionsShouldEvaluateAsFailing=false invalidatePreviousFilters=false invertFilterExpression=false invertGenotypeFilterExpression=false setFilteredGtToNocall=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ย  ย 
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ย  ย 
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ย  ย 
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ย  ย 
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ย  ย 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ย  ย 
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ย  ย 
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity"> ย  ย 
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias"> ย  ย 
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ย  ย 
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ย  ย 
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ย  ย 
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ย  ย 
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ย  ย  ย  ย 
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ย  ย 
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ย  ย 
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ย  ย 
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias"> ย  ย 
##contig=<ID=chrM,length=16571,assembly=hg19> ย  ย  ย  ย  ย  ย 
##contig=<ID=chr1,length=249250621,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr2,length=243199373,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr3,length=198022430,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr4,length=191154276,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr5,length=180915260,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr6,length=171115067,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr7,length=159138663,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr8,length=146364022,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr9,length=141213431,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr10,length=135534747,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr11,length=135006516,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr12,length=133851895,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr13,length=115169878,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr14,length=107349540,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr15,length=102531392,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr16,length=90354753,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr17,length=81195210,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr18,length=78077248,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr19,length=59128983,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr20,length=63025520,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr21,length=48129895,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr22,length=51304566,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrX,length=155270560,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrY,length=59373566,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr1_gl000191_random,length=106433,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr1_gl000192_random,length=547496,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr4_ctg9_hap1,length=590426,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr4_gl000193_random,length=189789,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr4_gl000194_random,length=191469,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr6_apd_hap1,length=4622290,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr6_cox_hap2,length=4795371,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr6_dbb_hap3,length=4610396,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr6_mann_hap4,length=4683263,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr6_mcf_hap5,length=4833398,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr6_qbl_hap6,length=4611984,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr6_ssto_hap7,length=4928567,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chr7_gl000195_random,length=182896,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr8_gl000196_random,length=38914,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr8_gl000197_random,length=37175,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr9_gl000198_random,length=90085,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr9_gl000199_random,length=169874,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr9_gl000200_random,length=187035,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr9_gl000201_random,length=36148,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr11_gl000202_random,length=40103,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr17_ctg5_hap1,length=1680828,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr17_gl000203_random,length=37498,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr17_gl000204_random,length=81310,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr17_gl000205_random,length=174588,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr17_gl000206_random,length=41001,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr18_gl000207_random,length=4262,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr19_gl000208_random,length=92689,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr19_gl000209_random,length=159169,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chr21_gl000210_random,length=27682,assembly=hg19> ย  ย  ย  ย 
##contig=<ID=chrUn_gl000211,length=166566,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000212,length=186858,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000213,length=164239,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000214,length=137718,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000215,length=172545,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000216,length=172294,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000217,length=172149,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000218,length=161147,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000219,length=179198,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000220,length=161802,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000221,length=155397,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000222,length=186861,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000223,length=180455,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000224,length=179693,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000225,length=211173,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000226,length=15008,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000227,length=128374,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000228,length=129120,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000229,length=19913,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000230,length=43691,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000231,length=27386,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000232,length=40652,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000233,length=45941,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000234,length=40531,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000235,length=34474,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000236,length=41934,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000237,length=45867,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000238,length=39939,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000239,length=33824,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000240,length=41933,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000241,length=42152,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000242,length=43523,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000243,length=43341,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000244,length=39929,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000245,length=36651,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000246,length=38154,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000247,length=36422,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19> ย  ย  ย  ย  ย 
##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19> ย  ย  ย  ย  ย 
##reference=file:///external/rprshnas01/wgs_data/ALS-WES/Analysis/../../ucsc.hg19.fasta ย  ย  ย 
##source=SelectVariants ย  ย  ย  ย  ย  ย  ย  ย 
##INFO=<ID=SpliceAI,Number=.,Type=String,Description="SpliceAI variant annotation. These include delta scores (DS) and delta positions (DP) for acceptor gain (AG), acceptor loss (AL), donor gain (DG), and donor loss (DL). Format: ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL"> ย 
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chrM 150 . T C 1685.78 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=1.606;ClippingRankSum=0;DP=53;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=33.72;ReadPosRankSum=-1.21;SOR=0.693;SpliceAI=C|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 152 . T C 395.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.703;ClippingRankSum=0;DP=53;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=7.92;ReadPosRankSum=0.42;SOR=0.743;SpliceAI=C|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 184 . G A 371.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.521;ClippingRankSum=0;DP=49;ExcessHet=3.0103;FS=1.257;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=8.08;ReadPosRankSum=-0.813;SOR=0.495;SpliceAI=A|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 195 . C T 1827.77 PASS AC=2;AF=1;AN=2;DP=50;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.04;SOR=0.874;SpliceAI=T|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 199 . T C 552.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.582;ClippingRankSum=0;DP=49;ExcessHet=3.0103;FS=2.66;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=12.28;ReadPosRankSum=0.281;SOR=0.977;SpliceAI=C|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 200 . A G 357.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.837;ClippingRankSum=0;DP=47;ExcessHet=3.0103;FS=4.97;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=8.32;ReadPosRankSum=0.291;SOR=1;SpliceAI=G|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 204 . T C 271.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.832;ClippingRankSum=0;DP=46;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=6.47;ReadPosRankSum=0.177;SOR=0.936;SpliceAI=C|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL
chrM 235 . A G 243.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=2.066;ClippingRankSum=0;DP=41;ExcessHet=3.0103;FS=1.448;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=6.59;ReadPosRankSum=-0.394;SOR=0.404;SpliceAI=G|.|.|.|.|.|.|.|.|. GT:AD:DP:GQ:PL

Inaccurate error message when unable to write a new index file for reference fasta

If the the reference genome fasta file specified is in a read-only directory, spliceai fails with:

ERROR:root:Reference genome fasta file /opt/reference.fa not found, exiting.

However, the error being thrown (with some variation in *.fa.fai filename) is actually:

OSError: /opt/reference.fa.fai may not be writable. Please use Fasta(rebuild=False), Faidx(rebuild=False) or faidx --no-rebuild.

We keep our references write protected to prevent accidental editing / overwriting and have already created index files for them to be used. This also causes problems if running SpliceAI in a Singularity image, as they use a read only filesystem.

Gene AC068620.1 location

I was checking the SpliceAI scores for variants in the gene PPAT from file: spliceai_scores.masked.snv.hg38.vcf.gz and I noticed this gene overlaps gene AC068620.1.
Example:

4       56410516        .       A       C       .       .       SpliceAI=C|AC068620.1|0.00|0.00|0.00|0.00|6|8|38|-22
4       56410516        .       A       C       .       .       SpliceAI=C|PPAT|0.01|0.00|0.00|0.00|-1|27|0|-41
4       56410516        .       A       G       .       .       SpliceAI=G|AC068620.1|0.00|0.00|0.00|0.00|1|6|-1|38
4       56410516        .       A       G       .       .       SpliceAI=G|PPAT|0.00|0.00|0.00|0.00|27|-37|-41|0
4       56410516        .       A       T       .       .       SpliceAI=T|AC068620.1|0.00|0.00|0.00|0.00|6|0|50|-22
4       56410516        .       A       T       .       .       SpliceAI=T|PPAT|0.00|0.00|0.00|0.00|27|-37|-41|-42

From the scores file we can see that AC068620.1 has coordinates 4:56410516-56410965. However, in GRCh38 this gene has coordinates 4:56387625-56388153 (which doesn't overlap PPAT).
Is there any explanation of why this is happening?

Lacking Support for pandas Version 1.0.0

Summary

Went through a fresh install today and received the following error when running the command
spliceai -I input.vcf -O output.vcf -R genome.fa -A grch38

image

Downgrading to the previous minor version (0.25.3) was functional but raised these deprecation warnings.

image

Cause

Latest pandas release has deprecated support for the get_values method (seemily in place of to_numpy / array methods)

Solution

Replace get_values calls witin the __init__ of Annotations class with the pandas reccomended methods.

optimisation, run recommendations

Hey,
Finally I was able to run Spliceai on my server (after reinstalling conda). It uses 64CPUs but it's going really slow. Is there anything I can do to speed it up?
My VCFs have ~120000 variants. Maybe I should remove these variants that are in the middle of exons?
Is it caching somewhere already encountered variants? (so the same variants in another samples won't be processed again?)

Is it important warning?:
UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.

here is my example ongoing output:

spliceai -I xxxx_final.vcf -O output.vcf -R /mnt/ssd_01/refs/hs37d5_noHap.fa -A grch37
Using TensorFlow backend.
WARNING: Logging before flag parsing goes to stderr.
W0726 10:16:57.406675 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0726 10:16:57.473724 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0726 10:16:57.605217 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:131: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0726 10:16:57.605415 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0726 10:17:05.270048 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2019-07-26 10:17:05.270867: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-07-26 10:17:05.323606: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1995195000 Hz
2019-07-26 10:17:05.335810: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5632022d3230 executing computations on platform Host. Devices:
2019-07-26 10:17:05.335847: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-63
OMP: Info #156: KMP_AFFINITY: 64 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 4 packages x 8 cores/pkg x 2 threads/core (32 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 32 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 36 maps to package 0 core 1 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 2 thread 0 
(...)
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 3 core 8 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 47 maps to package 3 core 8 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 3 core 17 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 51 maps to package 3 core 17 thread 1to package 3 core 24 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 3 core 25 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 63 maps to package 3 core 25 thread 1 
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56609 thread 0 bound to OS proc set 0
2019-07-26 10:17:05.346735: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2019-07-26 10:17:07.518215: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
/home/damian/anaconda3/lib/python3.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56790 thread 1 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56795 thread 4 bound to OS proc set 16
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56793 thread 2 bound to OS proc set 8
(...)

GRCh38 Concatenation Issue

When I tried to run SpliceAI on a list of variants on hg38 I obtained this error message. Looks to be something with numpy concatenate that is used within SpliceAI.

image

Any help would be much appreciated!

No training config error

I was wondering if anyone might have any ideas as to what happened with my execution of SpliceAI. I ran my command in January and it took weeks to run. I just checked the output with the diff command and it seems that my output is the same as the input file. The log shows the following error:

2020-01-10 15:12:01.924711: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2020-01-10 15:12:01.947975: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499910000 Hz
2020-01-10 15:12:01.955758: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561956be8590 executing computations on platform Host. Devices:
2020-01-10 15:12:01.955856: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
Using TensorFlow backend.
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:21: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  self.genes = df['#NAME'].get_values()
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:22: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  self.chroms = df['CHROM'].get_values()
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:23: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  self.strands = df['STRAND'].get_values()
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:24: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  self.tx_starts = df['TX_START'].get_values()+1
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:25: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  self.tx_ends = df['TX_END'].get_values()
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:27: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  for c in df['EXON_START'].get_values()]
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/spliceai/utils.py:29: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  for c in df['EXON_END'].get_values()]
/home/cavery/miniconda3/envs/py36/lib/python3.6/site-packages/keras/engine/saving.py:341: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '

This is the complete log and there is no traceback to hint at what might have occurred.
Advice?

Thank you!

Consider not INFO field for non-predictions?

Would it be better to not output an INFO field for non-predictions? VCF specs allow for this. It would make the output smaller and easier to grep for variants with predictions. I'm getting a lot of SpliceAI=[ACGT]+|.|.|.|.|.|.|.|.|..

Thanks for the tool. It was dearly needed.

PyVCF.reader crashes on SpliceAI generated VCF

Hi SpliceAI masters --

Has anyone noticed that when reading in a SpliceAI vcf file when computed de novo that the python PyVCF parser .reader class crashes on

"##ALT=ALT from Sutr file"

Is this a duplicate definition that breaks the VCF spec? I took it out and all is better...

Thanks for the excellent work!

Daniel

Illegal instruction (core dumped)

Doesn't work for me :)

spliceai -I test.vcf -O spliceai.vcf -R /refs/hs37d5_noHap.fa -A grch37
Using TensorFlow backend.
Illegal instruction (core dumped)

How to debug it?

Apply mask before argmax

Dear Kishore,

Thanks for the update. Can you explain the masking feature a bit more. The way I understand the code is that you choose for each task the position with the highest predicted effect and then check whether that position is at a known splice junction. This goes against my intuition since I would mask the position of the splice junction (or everything else for a junction) and then look for the highest effected other (unmasked) position. Further, it seems to me that you are only checking the closest splice junction which, especially for small exons, is not necessarily the only one to consider?

Cheers, Philipp

Different output for same MT variant

Hello, I'm running SpliceAI to annotate all possible ref/alt alterations in chromosome MT (within genes). I run the same job twice, once with 32 cores and a second time with 2 cores - it was used the same input file, same default values and same fasta file.
I would expect to get the same output however, there's one variant with different DP_AL.

Output run with 32 cores:
MT 5173 . A T . . SpliceAI=T|MT-ND2|0.00|0.00|0.00|0.00|10|-32|9|3
Output run with 2 cores:
MT 5173 . A T . . SpliceAI=T|MT-ND2|0.00|0.00|0.00|0.00|10|3|9|3

Was this behavior observed before?
Thanks in advance.

custom sequence annotation

Hi, thank you for sharing this very interesting project.

I am interested in annotating splice sites, and as I understand this is possible using the code in the README under "3. Can SpliceAI be used to score custom sequences?".

What is the output supposed to look like?

This is what I get:

array([[[9.99996364e-01, 3.40426527e-06, 1.84067687e-07],
        [9.99996662e-01, 2.71707631e-06, 6.23323729e-07],
        [9.99994576e-01, 5.18830893e-06, 2.79781887e-07],
        [9.99998927e-01, 7.55508211e-07, 3.41592965e-07],
        [9.99997735e-01, 2.15835212e-06, 2.53134544e-07],
        [9.99997318e-01, 1.50860694e-06, 1.16936849e-06],
        [9.99996960e-01, 2.02024376e-06, 1.05931451e-06],
        [9.99994755e-01, 5.04309583e-06, 2.04231384e-07],
        [9.98610377e-01, 2.63422262e-05, 1.36317930e-03],
        [9.99994755e-01, 4.74206627e-06, 5.73345631e-07],
        [9.99991596e-01, 5.64515540e-06, 2.68636268e-06],
        [9.99995530e-01, 1.74788761e-06, 2.76254627e-06],
        [9.99977887e-01, 6.86904605e-06, 1.53402689e-05],
        [9.99995053e-01, 4.54184283e-06, 3.90558029e-07],
        [9.99982834e-01, 7.04187778e-06, 1.01336136e-05],
        [9.99998093e-01, 1.29824912e-06, 6.42674308e-07],
        [9.99998271e-01, 1.44282819e-06, 3.55954569e-07],
        [9.99996364e-01, 2.94889173e-06, 5.85653822e-07],
        [9.99987781e-01, 1.10939018e-05, 1.12047121e-06],
        [9.99997616e-01, 2.22323160e-06, 2.50069832e-07],
        [9.99994159e-01, 5.47670834e-06, 3.12220266e-07],
        [9.99993503e-01, 5.98185261e-06, 5.30127295e-07],
        [9.99992371e-01, 6.53128291e-06, 1.09182520e-06],
        [9.99994457e-01, 3.16445880e-06, 2.30126102e-06],
        [9.99996960e-01, 2.51003553e-06, 5.60837691e-07],
        [9.99990821e-01, 7.84443364e-06, 1.30404328e-06],
        [9.99998093e-01, 1.68042948e-06, 2.47538310e-07],
        [9.99998212e-01, 1.40566101e-06, 4.16315345e-07],
        [9.99993980e-01, 4.94392680e-06, 1.01284820e-06],
        [9.99956906e-01, 1.56221540e-05, 2.74412796e-05],
        [9.99984622e-01, 1.46040666e-05, 6.91071818e-07],
        [9.99996960e-01, 1.28972226e-06, 1.74354807e-06],
        [9.99993980e-01, 4.21906589e-06, 1.75190837e-06],
        [9.99996960e-01, 2.47964954e-06, 5.13104339e-07],
        [9.99973595e-01, 1.17214395e-05, 1.46894872e-05],
        [9.99996006e-01, 3.24902180e-06, 7.17190403e-07],
        [9.99983966e-01, 7.92639821e-06, 8.18002263e-06],
        [9.99997139e-01, 2.31662875e-06, 5.50594791e-07],
        [9.99994278e-01, 1.66906466e-06, 4.15140403e-06]]], dtype=float32)

From the probabilities, I don't see any clear acceptor/ donor site being called -- am I misunderstanding the code (i.e. you're not supposed to call splice sites with it) or the result?

Thanks!

Performance Warnings with Tensorflow 2.3.0

Hi,
I'm getting a few of the following warnings after updating Tensorflow to version 2.3.0:

WARNING:tensorflow:5 out of the last 5 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7fbdf4728d08> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

Can this be ignored or should I better downgrade to version 2.2.0 ?

Thanks and best regards,

Sebastian

Runtime delta score error

Hello,

Everything is running fine except at the end when it gets the delta scores and outputs this error:

  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/spliceai", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/spliceai/__main__.py", line 53, in main
    scores = get_delta_scores(record, ann)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/spliceai/utils.py", line 137, in get_delta_scores
    Y = np.concatenate([Y_ref, Y_alt])
ValueError: all the input array dimensions except for the concatenation axis must match exactly```

Bug in precomputed hg38 SpliceAI scores

OR4F5 gene is located at position chr1:69091-70008 in hg19 and at position chr1:69055-70108 in hg38.

In both raw and masked precomputed vcfs for hg38, OR4F5 is computed on the hg19 coordinates. I manually checked a couple random genes and they all look fine, so I think it is just an off by 1 error

Performance bottleneck

Greetings.

y_ref = np.mean([ann.models[m].predict(x_ref) for m in range(5)], axis=0)

Have you considered, that this line (along with the next one) is a huge performance bottleneck, because you basically calculate a small batch whilst loading/unloading models from GPU memory? GPUs are only effective when you load them consistently with large batches without any context-switches.

installation issue with conda

conda create --yes --name spliceai; conda activate spliceai; conda install --yes tensorflow=1.2-0; conda install --yes spliceai=1.2

Running spliceai -h works.

But running with actual data gives this error:

$ spliceai -I temp/vt.VCFs.GATK__1:83083541-110778053.vcf.gz -R /data/OGVFB/resources/1000G_phase2_GRCh37/human_g1k_v37_decoy.fasta -A grch37 -O temp/spliceai.VCFs.GATK__1:83083541-110778053.vcf.gz
Using TensorFlow backend.
Traceback (most recent call last):
  File "/data/mcgaugheyd/conda/envs/spliceai/bin/spliceai", line 12, in <module>
    sys.exit(main())
  File "/usr/local/apps/spliceai/20190507/src/spliceai/__main__.py", line 53, in main
    ann = Annotator(args.R, args.A)
  File "/usr/local/apps/spliceai/20190507/src/spliceai/utils.py", line 39, in __init__
    self.models = [load_model(resource_filename(__name__, x)) for x in paths]
  File "/usr/local/apps/spliceai/20190507/src/spliceai/utils.py", line 39, in <listcomp>
    self.models = [load_model(resource_filename(__name__, x)) for x in paths]
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/saving.py", line 225, in _deserialize_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/saving.py", line 458, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/layers/__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 145, in deserialize_keras_object
    list(custom_objects.items())))
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/network.py", line 1022, in from_config
    process_layer(layer_data)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/network.py", line 1008, in process_layer
    custom_objects=custom_objects)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/layers/__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 147, in deserialize_keras_object
    return cls.from_config(config['config'])
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/base_layer.py", line 1109, in from_config
    return cls(**config)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/engine/input_layer.py", line 87, in __init__
    name=self.name)
  File "/usr/local/apps/spliceai/20190507/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 517, in placeholder
    x = tf.placeholder(dtype, shape=shape, name=name)
AttributeError: module 'tensorflow' has no attribute 'placeholder'

version info of dependencies

Hi,

What's the exact version of python, Tensorflow and other dependencies compatible with spliceAI? Could you show update the compatible version info or ideally provide a docker container? I tried to install using the latest version by default but came across errors when calling spliceAI. Thanks.

Mismatching scores

I run the script as spliceai -I input.vcf -O output.vcf -R hg37.genome.fa for my vcf files. And I also extract the score from Illumina predicting data files [Predicting_splicing_from_primary_sequence-66029966].
Most of variants have matched scores from both. But a few of them are mismatching.
For examples:
16:57998386:C:G
SpliceAI Illumina predicting data file:
SYMBOL=CNGB1;STRAND=-;TYPE=I;DIST=5;DS_AG=0.0000;DS_AL=0.0004;DS_DG=0.0000;DS_DL=0.9000;DP_AG=-28;DP_AL=5;DP_DG=-19;DP_DL=5
SpliceAI running:
SpliceAI=G|CNGB1|0.00| 0.81|0.01|0.90|-64|62|-125|5

1:216495345:T:C
SpliceAI Illumina predicting data file: SYMBOL=USH2A;STRAND=-;TYPE=I;DIST=-27;DS_AG=0.0235;DS_AL=0.0000;DS_DG=0.0000;DS_DL=0.0000;DP_AG=17;DP_AL=-2;DP_DG=-27;DP_DL=-2
SpliceAI=C|USH2A|0.02|0.27|0.00|0.17|17|-2|-27|-120

Noted: Why some DP alters are different?
Please help me figure out the reason.
Many thanks!

Refactor get_delta_scores(..) to allow access to raw absolute scores.

To allow for different ways of displaying model predictions, it would be nice to move
https://github.com/Illumina/SpliceAI/blob/master/spliceai/utils.py#L96-L176
to a separate function (something like get_raw_scores(..) or run_model(..)?) which would return y_ref, y_alt, dist_ann[2], genes. Then, external code (and get_delta_scores(..)) could call this function to get both masked and raw delta scores in one go, and also could allow different visualizations of y_ref and y_alt.
If you agree with this idea, would you be open to a PR?

Very slow prediction speed

GPU utilization is low, please optimize it.

Try to reduce the times of prediction, each call should have bunch of data.

All possible SNVs for grch37

Hello,
Is there an available file of all possible SNVs for grch37, or can you advise how to create one?
Many thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.