Comments (14)
When running make_examples
, you can pass the the following flag: --select_variant_types='indels'
If you are using the run_deepvariant
command, you can pass --make_examples_extra_args-"select_variant_types=indels"
from deepvariant.
thank you @danielecook ,
How is --select_variant_types='indels'
different from types_to_alt_align
parameter?
I want to train DeepVariant on INDEL variants with particular length, is there a command in make_examples
to generate labeled example for these particular variants?
Thank you
from deepvariant.
types_to_alt_align
refers to the type of variants in which we perform alignments against the alternative variant, when you have also set the alt_aligned_pileup
flag.
You might be able to accomplish something like this by making use of the vcf candidate importer. See --truth_variants
+ --variant_caller=vcf_candidate_importer
I would expect that if you perform filtering similarly on your training and test data, that this could be a way to develop a model specific to certain size INDEL variants, but we have never tried to do something like this.
from deepvariant.
Hi @danielecook , I run the run_deepvariant
command with and without --select_variant_types='indels'
parameter and observed different number of INDELS between two outputs. My understanding is that this command will remove all the SNP candidates but surprisingly, it also lowers the number of INDELS variants. I attached the visual report of deepvariant result with and without this parameter.
When all type of variants are considered:
Only INDELS variants are reported:
from deepvariant.
Hi @sophienguyen01,
Can you please share the VCF files so I can look at it? I think the issue would be multi-allelic sites, but I want to confirm that.
from deepvariant.
The files are big (124MB) in size so I cannot attach here. Is there an email I can send to?
from deepvariant.
yes, please send it to [email protected]
from deepvariant.
I think I was able to get to the problem but unfortunately, I am unable to reproduce the issue.
Here's what I did:
Extract indels from the full file:
bcftools view -v indels HG003_043024.vcf.gz > HG003_043024.indels_only.bcftools_filter.vcf
Run stats:
bcftools stats HG003_043024.indels_only.bcftools_filter.vcf.gz | grep 'indels:'
SN 0 number of indels: 1240956
Compared to parameter-based:
SN 0 number of indels: 1056774
So, we are looking for 184182 variants.
So do subtract:
bedtools subtract \
-a HG003_043024.indels_only.bcftools_filter.vcf.gz \
-b HG003_indels_043024.vcf.gz | wc -l
180869
So roughly it matches.
Now look at some variants:
bedtools subtract \
-a HG003_043024.indels_only.bcftools_filter.vcf.gz \
-b HG003_indels_043024.vcf.gz | head
chr1 10247 . TAAACCCTA T 0.5 RefCall . GT:GQ:DP:AD:VAF:PL ./.:9:41:28,4:0.097561:0,14,10
chr1 98999 . TTTTATTTA T,TTTTATTTATTTA 20 PASS . GT:GQ:DP:AD:VAF:PL 1/2:10:31:20,9,2:0.290323,0.0645161:19,16,12,16,0,21
chr1 99092 . C CT 2.7 RefCall . GT:GQ:DP:AD:VAF:PL ./.:3:50:19,7:0.14:0,1,8
chr1 101674 . C CAAA 0.6 RefCall . GT:GQ:DP:AD:VAF:PL ./.:9:29:23,2:0.0689655:0,8,17
chr1 104160 . A AACAC,AACACACAC 15.1 PASS . GT:GQ:DP:AD:VAF:PL 1/2:5:79:1,37,21:0.468354,0.265823:13,14,6,14,0,9
chr1 108545 . C CA 2.7 RefCall . GT:GQ:DP:AD:VAF:PL ./.:3:44:12,21:0.477273:0,1,6
chr1 109575 . CGT C,CGTGTGT 13 PASS . GT:GQ:DP:AD:VAF:PL 1/2:4:22:0,8,10:0.363636,0.454545:11,13,15,13,0,4
chr1 111513 . C CTA 19.3 PASS . GT:GQ:DP:AD:VAF:PL 1/1:18:33:0,30:0.909091:19,22,0
chr1 180150 . AC A,GC 15 PASS . GT:GQ:DP:AD:VAF:PL 1/2:2:19:2,6,9:0.315789,0.473684:11,13,2,13,0,1
chr1 180174 . TAA T 3.5 PASS . GT:GQ:DP:AD:VAF:PL 1/1:3:14:7,4:0.285714:0,9,0
So there are few variants that we are not picking up.
Next, I picked the region where variant "chr1 10247" is and ran make_examples with a debug command:
Without filter command:
chr1 10240 T ['TA']
chr1 10246 TA ['T']
chr1 10249 A ['C']
chr1 10253 TA ['T']
chr1 10256 A ['C']
I see these five variants.
With filtering I see:
FILTERING CANDIDATES
chr1 10240 T ['TA']
chr1 10246 TA ['T']
chr1 10253 TA ['T']
I am unsure how to reproduce this. Are you using a publicly available bam file? I can also run DV with and without this command and generate results to investigate further. It would be faster and helpful if you can point me to the bam you are using so it's more specific to your issue.
from deepvariant.
Hi @kishwarshafin,
It's a HG003 sample, I believe you also use this sample for training DeepVariant. I used our internal HG003 cram file.
If you can send me a link to a public HG003 bam/cram file, I can rerun DV and see if the issue still persists.
Thank you!
from deepvariant.
Hello,
I found out the reason of the difference. Using parameter --select_variant_types='indels'
removes multi-allelic site of the INDELS. You can see it using the command:
bcftools view -i 'N_ALT>1' HG003_043024.vcf.gz
and no multiallelic site shows up.
My question is if i use --select_variant_types='indels multi-allelics'
, DeepVariant output will include INDELS and multiallelic variants of SNPS and INDELS type right?
from deepvariant.
@sophienguyen01 , I can run a chr20 test on our end try to replicate.
Looking at the code, the filtering logics are implemented here.
def _select_biallelic_snps(v):
return variant_utils.is_snp(v) and variant_utils.is_biallelic(v)
def _select_biallelic_indels(v):
return variant_utils.is_indel(v) and variant_utils.is_biallelic(v)
def _select_biallelic_insertions(v):
return variant_utils.has_insertion(v) and variant_utils.is_biallelic(v)
def _select_biallelic_deletions(v):
return variant_utils.has_deletion(v) and variant_utils.is_biallelic(v)
VARIANT_TYPE_SELECTORS = {
'snps': _select_biallelic_snps,
'indels': _select_biallelic_indels,
'insertions': _select_biallelic_insertions,
'deletions': _select_biallelic_deletions,
'multi-allelics': variant_utils.is_multiallelic,
'all': lambda v: True,
}
And the filtering logic is implemented here.
It looks like --select_variant_types='indels multi-allelics'
will give you all multi-allelic indels too. I am unsure if it will solve the issue because the VCF you provided before also misses bi-allelic indels. I will need to debug it further to see if there's something missing. Meanwhile, you can use --select_variant_types='indels multi-allelics'
to see if it fixes your issue.
from deepvariant.
Hi,
I tried --select_variant_types='indels multi-allelics'
, the vcf includes all the INDELS from vcf output that is not included with --select_variant_types
. However, the output vcf using --select_variant_types='indels multi-allelics'
also includes SNPs variants.
My purpose is to create INDELs training examples only and --select_variant_types='indels multi-allelics'
still contains SNPs examples. Is there a way to filter examples (from tfrecord???.gz files) after make_examples
step but before shuffling
step.
I also tried --truth_variants
(using the truth vcf that only contains INDEL variants) and -variant_caller=vcf_candidate_importer
parameter, but the number of examples are reduced significantly.
from deepvariant.
If you are looking for INDEL purity then --select_variant_types='indels'
should work well I believe? You are losing some INDELs but ultimately you are creating a training set that contains only INDELs. You can also generate more training samples by downsampling the bam and re-running make_examples
.
On the other hand, if you think about inference pipeline, when you are in prediction mode, you can't switch your --select_variant_types
to something else than your training data as the model will now have to deal with data it has never seen before. There can be two scenarios:
-
You train a model with
--select_variant_types='indels'
and after the model is trained, you are running inference with--select_variant_types='indels multi-allelics'
. Then the model will have to deal with mult-allelic variants it has never seen before so the prediction on those would expected to be poor. -
You train a model with
--select_variant_types='indels multi-allelics'
and allow a few snps to be present in the training. In which case during your inference or prediction, the model will be confident on all indel cases.
It depends on what exactly you are trying to do, but, my suggestion would be to use --select_variant_types='indels multi-allelics'
for your purpose.
from deepvariant.
@sophienguyen01, given you have found a solution and we have discussed the to-dos, I will close this bug. Please feel free to reopen if you want to discuss further.
from deepvariant.
Related Issues (20)
- Question about the time it takes for VC analysis HOT 5
- Merging vcf files error with glnexus:v1.2.7 HOT 6
- haploid contigs and PAR region options for DeepTrio HOT 13
- [E::vcf_parse_format] Incorrect number of FORMAT fields at NC_059157.1:24900 HOT 2
- postprocess_variants: Found multiple file patterns in input filename space HOT 8
- Issues with Incompatible TensorRT libraries in docker image google/deepvariant:latest-gpu and google/deepvariant:1.6.1-gpu HOT 9
- CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected HOT 9
- Info ONT R10.4.1 data HOT 3
- error while running deepvariant with a bam file with phasing information
- Error while using deepvariant with a bam file that is phased HOT 4
- Homozygous GT value while IGV shows otherwise HOT 8
- Fix male VCF after calling without --haploid_contigs="chrX,chrY" and/or --par_regions_bed parameters HOT 2
- gvcf with true depth and not (only) min_dp HOT 5
- any progress on somatic SNV calling? HOT 1
- Use haplotagged bam file with WES model type HOT 6
- docker: invalid reference format. HOT 6
- google/deepvariant:1.6.1 docker says version 1.6.0 HOT 7
- A timeout error occurs HOT 2
- training with multi-gpu HOT 2
- Error encountered while running on downsampled BAM HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepvariant.