Giter Club home page Giter Club logo

Comments (14)

danielecook avatar danielecook commented on July 19, 2024

When running make_examples, you can pass the the following flag: --select_variant_types='indels'

If you are using the run_deepvariant command, you can pass --make_examples_extra_args-"select_variant_types=indels"

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on July 19, 2024

thank you @danielecook ,

How is --select_variant_types='indels' different from types_to_alt_align parameter?

I want to train DeepVariant on INDEL variants with particular length, is there a command in make_examples to generate labeled example for these particular variants?

Thank you

from deepvariant.

danielecook avatar danielecook commented on July 19, 2024

types_to_alt_align refers to the type of variants in which we perform alignments against the alternative variant, when you have also set the alt_aligned_pileup flag.

You might be able to accomplish something like this by making use of the vcf candidate importer. See --truth_variants + --variant_caller=vcf_candidate_importer

I would expect that if you perform filtering similarly on your training and test data, that this could be a way to develop a model specific to certain size INDEL variants, but we have never tried to do something like this.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on July 19, 2024

Hi @danielecook , I run the run_deepvariant command with and without --select_variant_types='indels' parameter and observed different number of INDELS between two outputs. My understanding is that this command will remove all the SNP candidates but surprisingly, it also lowers the number of INDELS variants. I attached the visual report of deepvariant result with and without this parameter.

When all type of variants are considered:
HG003_all visual_report

Only INDELS variants are reported:
HG003_indels visual_report

from deepvariant.

kishwarshafin avatar kishwarshafin commented on July 19, 2024

Hi @sophienguyen01,

Can you please share the VCF files so I can look at it? I think the issue would be multi-allelic sites, but I want to confirm that.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on July 19, 2024

The files are big (124MB) in size so I cannot attach here. Is there an email I can send to?

from deepvariant.

kishwarshafin avatar kishwarshafin commented on July 19, 2024

yes, please send it to [email protected]

from deepvariant.

kishwarshafin avatar kishwarshafin commented on July 19, 2024

@sophienguyen01 ,

I think I was able to get to the problem but unfortunately, I am unable to reproduce the issue.

Here's what I did:

Extract indels from the full file:

bcftools view -v indels HG003_043024.vcf.gz > HG003_043024.indels_only.bcftools_filter.vcf

Run stats:

bcftools stats HG003_043024.indels_only.bcftools_filter.vcf.gz | grep 'indels:'
SN	0	number of indels:	1240956

Compared to parameter-based:

SN	0	number of indels:	1056774

So, we are looking for 184182 variants.

So do subtract:

bedtools subtract \
-a HG003_043024.indels_only.bcftools_filter.vcf.gz \
-b HG003_indels_043024.vcf.gz | wc -l

180869

So roughly it matches.

Now look at some variants:

bedtools subtract \
-a HG003_043024.indels_only.bcftools_filter.vcf.gz \
-b HG003_indels_043024.vcf.gz | head

chr1	10247	.	TAAACCCTA	T	0.5	RefCall	.	GT:GQ:DP:AD:VAF:PL	./.:9:41:28,4:0.097561:0,14,10
chr1	98999	.	TTTTATTTA	T,TTTTATTTATTTA	20	PASS	.	GT:GQ:DP:AD:VAF:PL	1/2:10:31:20,9,2:0.290323,0.0645161:19,16,12,16,0,21
chr1	99092	.	C	CT	2.7	RefCall	.	GT:GQ:DP:AD:VAF:PL	./.:3:50:19,7:0.14:0,1,8
chr1	101674	.	C	CAAA	0.6	RefCall	.	GT:GQ:DP:AD:VAF:PL	./.:9:29:23,2:0.0689655:0,8,17
chr1	104160	.	A	AACAC,AACACACAC	15.1	PASS	.	GT:GQ:DP:AD:VAF:PL	1/2:5:79:1,37,21:0.468354,0.265823:13,14,6,14,0,9
chr1	108545	.	C	CA	2.7	RefCall	.	GT:GQ:DP:AD:VAF:PL	./.:3:44:12,21:0.477273:0,1,6
chr1	109575	.	CGT	C,CGTGTGT	13	PASS	.	GT:GQ:DP:AD:VAF:PL	1/2:4:22:0,8,10:0.363636,0.454545:11,13,15,13,0,4
chr1	111513	.	C	CTA	19.3	PASS	.	GT:GQ:DP:AD:VAF:PL	1/1:18:33:0,30:0.909091:19,22,0
chr1	180150	.	AC	A,GC	15	PASS	.	GT:GQ:DP:AD:VAF:PL	1/2:2:19:2,6,9:0.315789,0.473684:11,13,2,13,0,1
chr1	180174	.	TAA	T	3.5	PASS	.	GT:GQ:DP:AD:VAF:PL	1/1:3:14:7,4:0.285714:0,9,0

So there are few variants that we are not picking up.

Next, I picked the region where variant "chr1 10247" is and ran make_examples with a debug command:

Without filter command:

chr1 10240 T ['TA']
chr1 10246 TA ['T']
chr1 10249 A ['C']
chr1 10253 TA ['T']
chr1 10256 A ['C']

I see these five variants.

With filtering I see:

FILTERING CANDIDATES
chr1 10240 T ['TA']
chr1 10246 TA ['T']
chr1 10253 TA ['T']

I am unsure how to reproduce this. Are you using a publicly available bam file? I can also run DV with and without this command and generate results to investigate further. It would be faster and helpful if you can point me to the bam you are using so it's more specific to your issue.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on July 19, 2024

Hi @kishwarshafin,

It's a HG003 sample, I believe you also use this sample for training DeepVariant. I used our internal HG003 cram file.

If you can send me a link to a public HG003 bam/cram file, I can rerun DV and see if the issue still persists.

Thank you!

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on July 19, 2024

Hello,
I found out the reason of the difference. Using parameter --select_variant_types='indels' removes multi-allelic site of the INDELS. You can see it using the command:
bcftools view -i 'N_ALT>1' HG003_043024.vcf.gz and no multiallelic site shows up.

My question is if i use --select_variant_types='indels multi-allelics' , DeepVariant output will include INDELS and multiallelic variants of SNPS and INDELS type right?

from deepvariant.

kishwarshafin avatar kishwarshafin commented on July 19, 2024

@sophienguyen01 , I can run a chr20 test on our end try to replicate.

Looking at the code, the filtering logics are implemented here.

def _select_biallelic_snps(v):
  return variant_utils.is_snp(v) and variant_utils.is_biallelic(v)


def _select_biallelic_indels(v):
  return variant_utils.is_indel(v) and variant_utils.is_biallelic(v)


def _select_biallelic_insertions(v):
  return variant_utils.has_insertion(v) and variant_utils.is_biallelic(v)


def _select_biallelic_deletions(v):
  return variant_utils.has_deletion(v) and variant_utils.is_biallelic(v)


VARIANT_TYPE_SELECTORS = {
    'snps': _select_biallelic_snps,
    'indels': _select_biallelic_indels,
    'insertions': _select_biallelic_insertions,
    'deletions': _select_biallelic_deletions,
    'multi-allelics': variant_utils.is_multiallelic,
    'all': lambda v: True,
}

And the filtering logic is implemented here.

It looks like --select_variant_types='indels multi-allelics' will give you all multi-allelic indels too. I am unsure if it will solve the issue because the VCF you provided before also misses bi-allelic indels. I will need to debug it further to see if there's something missing. Meanwhile, you can use --select_variant_types='indels multi-allelics' to see if it fixes your issue.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on July 19, 2024

Hi,

I tried --select_variant_types='indels multi-allelics', the vcf includes all the INDELS from vcf output that is not included with --select_variant_types. However, the output vcf using --select_variant_types='indels multi-allelics' also includes SNPs variants.

My purpose is to create INDELs training examples only and --select_variant_types='indels multi-allelics' still contains SNPs examples. Is there a way to filter examples (from tfrecord???.gz files) after make_examples step but before shuffling step.

I also tried --truth_variants (using the truth vcf that only contains INDEL variants) and -variant_caller=vcf_candidate_importer parameter, but the number of examples are reduced significantly.

from deepvariant.

kishwarshafin avatar kishwarshafin commented on July 19, 2024

@sophienguyen01 ,

If you are looking for INDEL purity then --select_variant_types='indels' should work well I believe? You are losing some INDELs but ultimately you are creating a training set that contains only INDELs. You can also generate more training samples by downsampling the bam and re-running make_examples.

On the other hand, if you think about inference pipeline, when you are in prediction mode, you can't switch your --select_variant_types to something else than your training data as the model will now have to deal with data it has never seen before. There can be two scenarios:

  1. You train a model with --select_variant_types='indels' and after the model is trained, you are running inference with --select_variant_types='indels multi-allelics'. Then the model will have to deal with mult-allelic variants it has never seen before so the prediction on those would expected to be poor.

  2. You train a model with --select_variant_types='indels multi-allelics' and allow a few snps to be present in the training. In which case during your inference or prediction, the model will be confident on all indel cases.

It depends on what exactly you are trying to do, but, my suggestion would be to use --select_variant_types='indels multi-allelics' for your purpose.

from deepvariant.

kishwarshafin avatar kishwarshafin commented on July 19, 2024

@sophienguyen01, given you have found a solution and we have discussed the to-dos, I will close this bug. Please feel free to reopen if you want to discuss further.

from deepvariant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.