Comments (12)
I'm assuming there are haploid genotypes in your file. I put in a quick guard on the if statement if you'd like to pull develop and try again.
from truvari.
Hello,
Thanks for the detailed ticket. This is going to be tricky to debug without any example data since I'll need to dig into why pysam is doing things differently. Is it possible to just share the comparison variants in your two example regions? e.g. bcftools view -r chr1:1653825-1653863
). Actually, can you buffer the regions by ±100bp just in case there's edge effects?
Thanks,
~/Adam English
from truvari.
@ACEnglish No problem. VCFs are attached (with buffering included). Reference is hg38. Let me know if I can help with anything else.
truth.subset.vcf.txt
query.subset.vcf.txt
from truvari.
I'm having some trouble recreating the issue. When I ran the truth/query using pysam 0.22 I saw the two variants matching up after refine. However, I used the GIABTR_v1.0. The provided regions overlap the benchmark, but don't match up.
Provided:
chr1 1653825 1653863
chr1 3712929 3712943
Benchmark:
chr1 1653800 1653887 Tier1 TP_TP_TP 15 0.9325 -6 0
chr1 3712928 3712990 Tier1 TP_TP_TP 5 0.8681 7 0
When I used the provided regions, no variants matched up because none were within the provided regions. So, I made the following script to recreate Truvari's --includebed
logic:
import sys
import pysam
import truvari
v = pysam.VariantFile(sys.argv[1])
from intervaltree import IntervalTree
def check_overlap(entry, tree):
astart, aend = truvari.entry_boundaries(entry)
if astart == aend - 1:
return tree.overlaps(astart, aend)
m_ovl = tree.overlap(astart, aend)
if len(m_ovl) != 1:
return False
m_ovl = list(m_ovl)[0]
return astart >= m_ovl.begin and aend <= m_ovl.end
provided_coords = IntervalTree()
provided_coords.addi(1653825, 1653863)
provided_coords.addi(3712929, 3712943)
benchmark_coords = IntervalTree()
benchmark_coords.addi(1653800, 1653887)
benchmark_coords.addi(3712928, 3712990)
for entry in v:
print('analyzing', str(entry).strip())
print('provided', check_overlap(entry, provided_coords))
print('benchmark', check_overlap(entry, benchmark_coords))
print('-' * 10)
And saw:
analyzing chr1 1653825 . T TACACAC 60 PASS AN=2;AC=1;NS=80;AF=0.223684;MAF=0.223684;AC_Het=23;AC_Hom=10;AC_Hemi=1;HWE=0.501081;ExcHet=0.886246 GT:BPDP:FT:AD 1/0:1,1,1,1:PASS:1,1
provided False
benchmark True
----------
analyzing chr1 3712929 . GCCCCTGC G 60 PASS AN=2;AC=1;NS=86;AF=0.145349;MAF=0.145349;AC_Het=25;AC_Hom=0;AC_Hemi=0;HWE=0.199527;ExcHet=0.129597 GT:BPDP:FT:AD 1/0:1,1,1,1:PASS:1,1
provided False
benchmark True
----------
(py) english@Adams-MacBook-Pro pysamfn % python show_coords.py query.subset.vcf.gz
analyzing chr1 1653825 chr1_1653825_1653863 TACACACACACACACACACACACACACACACACACACAC TACACACACACACACACACACACACACACACACACACACACACAC . PASS SVTYPE=INS;SVLEN=6;RUS=AC GT:CN 0/1:1.15789
provided False
benchmark True
----------
analyzing chr1 3712929 chr1_3712929_3712943 GCCCCTGCCCCCTGC GCCCCTGC . PASS SVTYPE=DEL;SVLEN=-7;RUS=CCCCTGC GT:CN 0/1:0.5
provided False
benchmark True
----------
So my questions for you is which bed file are you using? Is there any reason why the GIABTR bed file isn't being used? Are you sure that the bed file you've created is using 0-based coordinates? Are you certain that your pysam 0.22 and pysam 0.19 runs used the same bed files?
Despite the possible regions issue, I wanted to check if there was a difference between pysam 0.22 and 0.19. For the region checking script above and running bench/refine I saw identical results.
from truvari.
So my questions for you is which bed file are you using? Is there any reason why the GIABTR bed file isn't being used?
Our include bed is the GIABTR bed file, but our refine regions is a STR catalog which covers just repeats of interest.
Are you sure that the bed file you've created is using 0-based coordinates?
It does actually look like the BED file is 1-based, which is news to me. I'll run a test with a 0-indexed BED and see if this improves anything. I suspect its not the main issue because all variants in the query VCF have POS=bed.start.
Are you certain that your pysam 0.22 and pysam 0.19 runs used the same bed files?
Yeah we've been slowly narrowing down on this for a couple of weeks. We've checked and rechecked that all other params are identical (barring something really stupid 😉)
I think I have a minimal reproducible example which at least shows a difference between pysam versions. Looking at chr1:1653800 in the refine.regions.txt, pysam 0.21.0 lists it as TP, while pysam 0.22.0 lists it as TN. Oddly this variant was instead called FN by pysam 0.22.0 in our full analysis. Data is attached (excluding the reference which was GRCh38).
I ran this script with pysam 0.21.0 and pysam 0.22.0 installed. Nothing else was changed.
#!/usr/bin/env bash
output_dir="$1"
truvari bench \
--base=truth.vcf.gz \
--comp=query.vcf.gz \
--output="$output_dir" \
--includebed=include.bed \
--refdist=500 \
--pctseq=0.7 \
--pctsize=0.7 \
--pctovl=0.0 \
--sizemin=5 \
--sizefilt=5 \
--sizemax=50000 \
--pick=ac \
--extend=50 \
--chunksize=1000
truvari refine \
--reference=genome.fa \
--regions=regions.bed \
--use-original-vcfs \
--threads=1 \
--debug \
"$output_dir"
from truvari.
Okay, now I'm seeing it.
And I've been able to uncover what is different
Details
0.21 [W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi CRITICAL:root:Applied 6 variants
CRITICAL:root:Applied 3 variants
[W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi
CRITICAL:root:Applied 3 variants
[W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi
CRITICAL:root:Applied 6 variants
0.22
CRITICAL:root:Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Applied 2 variants
[W::hts_idx_load3] The index file is older than the data file: /Users/english/code/truvari/tickets/pysamfn/truvari_repro/query.vcf.gz.tbi
CRITICAL:root:Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1653825
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1653825
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Warning: ignoring overlapping variant starting at chr1:1010481
Warning: ignoring overlapping variant starting at chr1:1585949
Warning: ignoring overlapping variant starting at chr1:1653825
Warning: ignoring overlapping variant starting at chr1:1845824
Warning: ignoring overlapping variant starting at chr1:1904424
Warning: ignoring overlapping variant starting at chr1:2003928
Warning: ignoring overlapping variant starting at chr1:2371211
Applied 5 variants
I'm interpreting this as pysam.bcftools.consensus
with 0.22 isn't including some variants in the haplotypes. The main difference between 0.22 and 0.21 is they wrap htslib 1.18 and 1.17 respectively. However in the htslib release notes I can't find any indication why this would have changed. Looking at the source (I think) for consensus I can't tell if/when/how it was changed to affect this behavior.
This is the 3rd or 4th ticket that pysam.bcftools.consensus
has caused issues. I'm going to start looking into a longer term solution. In the meantime, I'd recommend downgrading to pysam=0.21 but be wary of other sites which it may be excluding variants from haplotypes.
from truvari.
This is the 3rd or 4th ticket that pysam.bcftools.consensus has caused issues. I'm going to start looking into a longer term solution.
Yeah 100% feel you on this, if I had a penny for every time I've needed to pin pysam 😅. Really appreciate the quick and thorough response on this
from truvari.
Hello,
I have a feature branch with a possible alternative to pysam.bcftools.consensus
. I've been doing spot testing and it looks to be correct so far. I need to do a lot more work before committing the feature to develop since it changes the results expected by the functional tests. It turns out that 'overlapping' variants are somewhat common.
If you're interested in testing out the new consensus, any feed back on the accuracy would be greatly appreciated. Because VCFs come in so many shapes/sizes, it's difficult to anticipate all the edge cases. You can install it by running git checkout feature/consensus
and then running python3 -m pip install .
from a truvari repo directory.
Running Notes:
- I think the errors started from pysam 0.22 because bcftools consensus started allowing higher ploidy than just
-H1
and-H2
. - Two overlapping heterozygous variants from different haplotypes may always have conflicts. For example, a deletion variant from 10-20 with GT
1|0
and a deletion from 15-25 with GT0|1
are conflicting because the first deletion says the second allele has all the reference bases (the 0 in the1|0
). - To address the overlapping issue, the new procedure to generate consensus sequence by assuming that a 0 genotype (i.e. reference) can be interpreted as "ignore this variant / don't deviate from the reference path".
- Star (
*
) alternate alleles are ignored. - Need to deal with symbolic ALTs. I'll probably settle on just raising an error.
from truvari.
Hi Adam,
Awesome yeah I can do some testing when I'm back at work start of January.
from truvari.
@ACEnglish I ran with the same params as above and hit an exception.
multiprocessing.pool.RemoteTraceback:
--
2024-01-02 | 13:14:16,147 | """
2024-01-02 | 13:14:16,147 | Traceback (most recent call last):
2024-01-02 | 13:14:16,147 | File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
2024-01-02 | 13:14:16,147 | result = (True, func(*args, **kwds))
2024-01-02 | 13:14:16,147 | File "/usr/local/lib/python3.10/dist-packages/truvari/phab.py", line 120, in make_consensus
2024-01-02 | 13:14:16,148 | if entry.samples[sample]['GT'][1] == 1:
2024-01-02 | 13:14:16,148 | IndexError: tuple index out of range
2024-01-02 | 13:14:16,148 | """
2024-01-02 | 13:14:16,148 |
2024-01-02 | 13:14:16,148 | The above exception was the direct cause of the following exception:
2024-01-02 | 13:14:16,148 |
2024-01-02 | 13:14:16,148 | Traceback (most recent call last):
2024-01-02 | 13:14:16,148 | File "/usr/local/bin/truvari", line 8, in <module>
2024-01-02 | 13:14:16,148 | sys.exit(main())
2024-01-02 | 13:14:16,148 | File "/usr/local/lib/python3.10/dist-packages/truvari/__main__.py", line 102, in main
2024-01-02 | 13:14:16,148 | TOOLS[args.cmd](args.options)
2024-01-02 | 13:14:16,148 | File "/usr/local/lib/python3.10/dist-packages/truvari/refine.py", line 413, in refine_main
2024-01-02 | 13:14:16,148 | truvari.phab(to_eval_coords, base_vcf, args.reference, phab_vcf, buffer=PHAB_BUFFER,
2024-01-02 | 13:14:16,149 | File "/usr/local/lib/python3.10/dist-packages/truvari/phab.py", line 344, in phab
2024-01-02 | 13:14:16,149 | sample_haps = collect_haplotypes(ref_haps_fn, hap_jobs, threads)
2024-01-02 | 13:14:16,149 | File "/usr/local/lib/python3.10/dist-packages/truvari/phab.py", line 184, in collect_haplotypes
2024-01-02 | 13:14:16,149 | for haplotypes in pool.imap(to_call, hap_jobs):
2024-01-02 | 13:14:16,149 | File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
2024-01-02 | 13:14:16,149 | raise value
2024-01-02 | 13:14:16,149 | IndexError: tuple index out of range
from truvari.
The results with the new consensus code is looking great as far as I can tell. PPV/TPR/TNR are all looking pretty much identical to truvari 4.1.0 with pysam==0.21.0. I do see 2 more TPs and 2 fewer FPs and FNs with the new consensus. But these may be related to other changes since 4.1.0, and for our purposes they don't make any difference.
from truvari.
Right, there's another change to develop that I forgot to mention which has to do with how sequence similarity is calculated with the 'unroll' technique (tl;dr-its a little more sensitive to finding higher similarities).
I really appreciate your help in discovering and fixing this regression. Hopefully in addition to the code working again you saw some improvement in runtime as I believe the new code is faster than bcftools.consensus
.
I'm aiming to release v4.2 by the end of the month. Let me know if you ever need anything else.
from truvari.
Related Issues (20)
- BED Region off-by-one error HOT 4
- Zero matches between base and comp HOT 4
- AttributeError: 'CollapsedCalls' object has no attribute 'consolidate' | version 4.2.1 HOT 4
- Calculate SNV HOT 7
- complex genotype problem HOT 3
- GT integrate HOT 1
- No TP or FP calls for CNV HOT 1
- merging different SV type? HOT 3
- No FP or TP calls HOT 2
- Unable to run MAFFT HOT 9
- md5sum FIPS issue HOT 1
- Support vector for intra-sample merge HOT 6
- some questions about the results in fp.vcf.gz
- some questions about the results in fp.vcf.gz HOT 1
- Getting same numbers of TP-base and TP-comp HOT 4
- Suggested minor documentation changes
- Truvari, STRs and Expansion Hunter - Query HOT 2
- Bug in benchmarking HOT 4
- Request: truvari collapse --keep option to mantain the ALT sequence HOT 1
- Inquiry on the Determination of Representative Structural Variants in Merged VCF Sets HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from truvari.