Giter Club home page Giter Club logo

Comments (3)

ACEnglish avatar ACEnglish commented on June 25, 2024 1

There's a huge deletion on chromosome 4 ID=chr4-49657849-DEL-140440990. It spans many TR regions and each one performs a pysam.VariantFile.fetch which has to parse it. The variant by itself in a gzip vcf is 38M. Removing that variant from the VCF allows the job to complete.

Chromosome 9 also has some larger variants that might need to be pre-filtered

# LEN              ID
-140440990	chr4-49657849-DEL-140440990
-22543055	chr9-43222012-DEL-22543055
-20115032	chr9-42684836-DEL-20115032
-19608353	chr9-40910205-DEL-19608353
-4215131	chr21-5393558-DEL-4215131
-2828305	chr5-46867696-DEL-2828305
-2818263	chr9-62556860-DEL-2818263
-2240586	chr9-60559282-DEL-2240586
-1664861	chr9-40910205-DEL-1664861

from truvari.

TimD1 avatar TimD1 commented on June 25, 2024

Thanks for looking into this! I never thought to check if large INDELs were slowing things down, since I assumed the --sizemax flag was excluding them from the analysis entirely.

from truvari.

TimD1 avatar TimD1 commented on June 25, 2024

Reopening this issue, but now with Truvari refine (Truvari bench now succeeds on this input in ~15 minutes, thanks!). I've filtered out all remotely large variants and all inversions as follows:

bcftools view \
    -i 'TYPE=="SNP" || (ILEN < 1000 && ILEN > -1000)'\
    pav.all.vcf.gz |
    grep -v "INV" > pav.most.vcf
bgzip -f pav.most.vcf
tabix -p vcf pav.most.vcf.gz

I ran Truvari WFA refine on the bench results, limiting the GIAB-TR regions to candidate.refine.bed. I've included my log file here: pav.log.

On this attempt, I defined the default /tmp directory to be located on an external hard drive. It crashed after 2.5 hours with a MemoryError, and used 475GB of /x/tmp memory.

On a previous attempt, it hung for a few days after filling the /tmp directory with 185GB of data and then the current directory with another 600GB of data.

All of this data is located in files named tmp********, and they appear to be FASTA files auto-generated by samtools or something. How much space is this expected to take, and how long should the analysis run for?

Thanks in advance, sorry for opening so many issues. I'd be happy to help provide any more info to get this working.

from truvari.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.