Giter Club home page Giter Club logo

Comments (12)

ACEnglish avatar ACEnglish commented on September 27, 2024 1

I'm assuming there are haploid genotypes in your file. I put in a quick guard on the if statement if you'd like to pull develop and try again.

from truvari.

ACEnglish avatar ACEnglish commented on September 27, 2024

Hello,

Thanks for the detailed ticket. This is going to be tricky to debug without any example data since I'll need to dig into why pysam is doing things differently. Is it possible to just share the comparison variants in your two example regions? e.g. bcftools view -r chr1:1653825-1653863). Actually, can you buffer the regions by ±100bp just in case there's edge effects?

Thanks,
~/Adam English

from truvari.

fennerm avatar fennerm commented on September 27, 2024

@ACEnglish No problem. VCFs are attached (with buffering included). Reference is hg38. Let me know if I can help with anything else.

truth.subset.vcf.txt
query.subset.vcf.txt

from truvari.

ACEnglish avatar ACEnglish commented on September 27, 2024

I'm having some trouble recreating the issue. When I ran the truth/query using pysam 0.22 I saw the two variants matching up after refine. However, I used the GIABTR_v1.0. The provided regions overlap the benchmark, but don't match up.

Provided:
chr1    1653825 1653863
chr1    3712929 3712943
Benchmark:
chr1    1653800 1653887 Tier1   TP_TP_TP        15      0.9325  -6      0
chr1    3712928 3712990 Tier1   TP_TP_TP        5       0.8681  7       0

When I used the provided regions, no variants matched up because none were within the provided regions. So, I made the following script to recreate Truvari's --includebed logic:

import sys
import pysam
import truvari
v = pysam.VariantFile(sys.argv[1])

from intervaltree import IntervalTree
def check_overlap(entry, tree):
    astart, aend = truvari.entry_boundaries(entry)
    if astart == aend - 1:
        return tree.overlaps(astart, aend)

    m_ovl = tree.overlap(astart, aend)
    if len(m_ovl) != 1:
        return False
    m_ovl = list(m_ovl)[0]
    return astart >= m_ovl.begin and aend <= m_ovl.end
    
provided_coords = IntervalTree()
provided_coords.addi(1653825, 1653863)
provided_coords.addi(3712929, 3712943)

benchmark_coords = IntervalTree()
benchmark_coords.addi(1653800, 1653887)
benchmark_coords.addi(3712928, 3712990)

for entry in v:
    print('analyzing', str(entry).strip())
    print('provided', check_overlap(entry, provided_coords))
    print('benchmark', check_overlap(entry, benchmark_coords))
    print('-' * 10)

And saw:

analyzing chr1	1653825	.	T	TACACAC	60	PASS	AN=2;AC=1;NS=80;AF=0.223684;MAF=0.223684;AC_Het=23;AC_Hom=10;AC_Hemi=1;HWE=0.501081;ExcHet=0.886246	GT:BPDP:FT:AD	1/0:1,1,1,1:PASS:1,1
provided False
benchmark True
----------
analyzing chr1	3712929	.	GCCCCTGC	G	60	PASS	AN=2;AC=1;NS=86;AF=0.145349;MAF=0.145349;AC_Het=25;AC_Hom=0;AC_Hemi=0;HWE=0.199527;ExcHet=0.129597	GT:BPDP:FT:AD	1/0:1,1,1,1:PASS:1,1
provided False
benchmark True
----------
(py) english@Adams-MacBook-Pro pysamfn % python show_coords.py query.subset.vcf.gz
analyzing chr1	1653825	chr1_1653825_1653863	TACACACACACACACACACACACACACACACACACACAC	TACACACACACACACACACACACACACACACACACACACACACAC	.	PASS	SVTYPE=INS;SVLEN=6;RUS=AC	GT:CN	0/1:1.15789
provided False
benchmark True
----------
analyzing chr1	3712929	chr1_3712929_3712943	GCCCCTGCCCCCTGC	GCCCCTGC	.	PASS	SVTYPE=DEL;SVLEN=-7;RUS=CCCCTGC	GT:CN	0/1:0.5
provided False
benchmark True
----------

So my questions for you is which bed file are you using? Is there any reason why the GIABTR bed file isn't being used? Are you sure that the bed file you've created is using 0-based coordinates? Are you certain that your pysam 0.22 and pysam 0.19 runs used the same bed files?

Despite the possible regions issue, I wanted to check if there was a difference between pysam 0.22 and 0.19. For the region checking script above and running bench/refine I saw identical results.

from truvari.

fennerm avatar fennerm commented on September 27, 2024

So my questions for you is which bed file are you using? Is there any reason why the GIABTR bed file isn't being used?

Our include bed is the GIABTR bed file, but our refine regions is a STR catalog which covers just repeats of interest.

Are you sure that the bed file you've created is using 0-based coordinates?

It does actually look like the BED file is 1-based, which is news to me. I'll run a test with a 0-indexed BED and see if this improves anything. I suspect its not the main issue because all variants in the query VCF have POS=bed.start.

Are you certain that your pysam 0.22 and pysam 0.19 runs used the same bed files?

Yeah we've been slowly narrowing down on this for a couple of weeks. We've checked and rechecked that all other params are identical (barring something really stupid 😉)

I think I have a minimal reproducible example which at least shows a difference between pysam versions. Looking at chr1:1653800 in the refine.regions.txt, pysam 0.21.0 lists it as TP, while pysam 0.22.0 lists it as TN. Oddly this variant was instead called FN by pysam 0.22.0 in our full analysis. Data is attached (excluding the reference which was GRCh38).

I ran this script with pysam 0.21.0 and pysam 0.22.0 installed. Nothing else was changed.

#!/usr/bin/env bash
output_dir="$1"
truvari bench \
    --base=truth.vcf.gz \
    --comp=query.vcf.gz \
    --output="$output_dir" \
    --includebed=include.bed \
    --refdist=500 \
    --pctseq=0.7 \
    --pctsize=0.7 \
    --pctovl=0.0 \
    --sizemin=5 \
    --sizefilt=5 \
    --sizemax=50000 \
    --pick=ac \
    --extend=50 \
    --chunksize=1000

truvari refine \
    --reference=genome.fa \
    --regions=regions.bed \
    --use-original-vcfs \
    --threads=1 \
    --debug \
    "$output_dir"

truvari_repro.tar.gz

from truvari.

ACEnglish avatar ACEnglish commented on September 27, 2024

Okay, now I'm seeing it.

And I've been able to uncover what is different

Details

0.21 [W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi CRITICAL:root:Applied 6 variants

CRITICAL:root:Applied 3 variants

[W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi
CRITICAL:root:Applied 3 variants

[W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi
CRITICAL:root:Applied 6 variants

0.22
CRITICAL:root:Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Applied 2 variants

[W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi
CRITICAL:root:Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1653825
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1653825
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1653825
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Warning: ignoring overlapping variant starting at chr1:2371211
Applied 5 variants

I'm interpreting this as pysam.bcftools.consensus with 0.22 isn't including some variants in the haplotypes. The main difference between 0.22 and 0.21 is they wrap htslib 1.18 and 1.17 respectively. However in the htslib release notes I can't find any indication why this would have changed. Looking at the source (I think) for consensus I can't tell if/when/how it was changed to affect this behavior.

This is the 3rd or 4th ticket that pysam.bcftools.consensus has caused issues. I'm going to start looking into a longer term solution. In the meantime, I'd recommend downgrading to pysam=0.21 but be wary of other sites which it may be excluding variants from haplotypes.

from truvari.

fennerm avatar fennerm commented on September 27, 2024

This is the 3rd or 4th ticket that pysam.bcftools.consensus has caused issues. I'm going to start looking into a longer term solution.

Yeah 100% feel you on this, if I had a penny for every time I've needed to pin pysam 😅. Really appreciate the quick and thorough response on this

from truvari.

ACEnglish avatar ACEnglish commented on September 27, 2024

Hello,

I have a feature branch with a possible alternative to pysam.bcftools.consensus. I've been doing spot testing and it looks to be correct so far. I need to do a lot more work before committing the feature to develop since it changes the results expected by the functional tests. It turns out that 'overlapping' variants are somewhat common.

If you're interested in testing out the new consensus, any feed back on the accuracy would be greatly appreciated. Because VCFs come in so many shapes/sizes, it's difficult to anticipate all the edge cases. You can install it by running git checkout feature/consensus and then running python3 -m pip install . from a truvari repo directory.

Running Notes:

  1. I think the errors started from pysam 0.22 because bcftools consensus started allowing higher ploidy than just -H1 and -H2.
  2. Two overlapping heterozygous variants from different haplotypes may always have conflicts. For example, a deletion variant from 10-20 with GT 1|0 and a deletion from 15-25 with GT 0|1 are conflicting because the first deletion says the second allele has all the reference bases (the 0 in the 1|0).
  3. To address the overlapping issue, the new procedure to generate consensus sequence by assuming that a 0 genotype (i.e. reference) can be interpreted as "ignore this variant / don't deviate from the reference path".
  4. Star (*) alternate alleles are ignored.
  5. Need to deal with symbolic ALTs. I'll probably settle on just raising an error.

from truvari.

fennerm avatar fennerm commented on September 27, 2024

Hi Adam,

Awesome yeah I can do some testing when I'm back at work start of January.

from truvari.

fennerm avatar fennerm commented on September 27, 2024

@ACEnglish I ran with the same params as above and hit an exception.

multiprocessing.pool.RemoteTraceback:
--
2024-01-02 | 13:14:16,147 | """
2024-01-02 | 13:14:16,147 | Traceback (most recent call last):
2024-01-02 | 13:14:16,147 | File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
2024-01-02 | 13:14:16,147 | result = (True, func(*args, **kwds))
2024-01-02 | 13:14:16,147 | File "/usr/local/lib/python3.10/dist-packages/truvari/phab.py", line 120, in make_consensus
2024-01-02 | 13:14:16,148 | if entry.samples[sample]['GT'][1] == 1:
2024-01-02 | 13:14:16,148 | IndexError: tuple index out of range
2024-01-02 | 13:14:16,148 | """
2024-01-02 | 13:14:16,148 |  
2024-01-02 | 13:14:16,148 | The above exception was the direct cause of the following exception:
2024-01-02 | 13:14:16,148 |  
2024-01-02 | 13:14:16,148 | Traceback (most recent call last):
2024-01-02 | 13:14:16,148 | File "/usr/local/bin/truvari", line 8, in <module>
2024-01-02 | 13:14:16,148 | sys.exit(main())
2024-01-02 | 13:14:16,148 | File "/usr/local/lib/python3.10/dist-packages/truvari/__main__.py", line 102, in main
2024-01-02 | 13:14:16,148 | TOOLS[args.cmd](args.options)
2024-01-02 | 13:14:16,148 | File "/usr/local/lib/python3.10/dist-packages/truvari/refine.py", line 413, in refine_main
2024-01-02 | 13:14:16,148 | truvari.phab(to_eval_coords, base_vcf, args.reference, phab_vcf, buffer=PHAB_BUFFER,
2024-01-02 | 13:14:16,149 | File "/usr/local/lib/python3.10/dist-packages/truvari/phab.py", line 344, in phab
2024-01-02 | 13:14:16,149 | sample_haps = collect_haplotypes(ref_haps_fn, hap_jobs, threads)
2024-01-02 | 13:14:16,149 | File "/usr/local/lib/python3.10/dist-packages/truvari/phab.py", line 184, in collect_haplotypes
2024-01-02 | 13:14:16,149 | for haplotypes in pool.imap(to_call, hap_jobs):
2024-01-02 | 13:14:16,149 | File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
2024-01-02 | 13:14:16,149 | raise value
2024-01-02 | 13:14:16,149 | IndexError: tuple index out of range

from truvari.

fennerm avatar fennerm commented on September 27, 2024

The results with the new consensus code is looking great as far as I can tell. PPV/TPR/TNR are all looking pretty much identical to truvari 4.1.0 with pysam==0.21.0. I do see 2 more TPs and 2 fewer FPs and FNs with the new consensus. But these may be related to other changes since 4.1.0, and for our purposes they don't make any difference.

from truvari.

ACEnglish avatar ACEnglish commented on September 27, 2024

Right, there's another change to develop that I forgot to mention which has to do with how sequence similarity is calculated with the 'unroll' technique (tl;dr-its a little more sensitive to finding higher similarities).

I really appreciate your help in discovering and fixing this regression. Hopefully in addition to the code working again you saw some improvement in runtime as I believe the new code is faster than bcftools.consensus.

I'm aiming to release v4.2 by the end of the month. Let me know if you ever need anything else.

from truvari.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.