Hello, I've run ksrates paralogs-ks for a few different genome assemblies and they

ksrates paralogs-ks seemingly freezing on i-adhore step about ksrates HOT 9 CLOSED

sjfleck commented on May 28, 2024

ksrates paralogs-ks seemingly freezing on i-adhore step

from ksrates.

Comments (9)

sjfleck commented on May 28, 2024 1

The new paralogs-ks run finished in 4 hours and 10 minutes and used 3.05 GB memory. I always check to see if AGAT messed up the gff3 or protein file by looking at the number of genes, mRNA, and BUSCOs before and after running it. I just learned that those are not always enough to make sure there weren't any major changes to the data. I reran all the orthologs-ks jobs that used the old CDS file and they finished pretty quickly too.

This turned out to be a pretty niche issue, but if anyone uses AGAT in combination with ksrates, there's a small chance they might benefit from my misadventure here. Thanks for your help!

from ksrates.

Cecilia-Sensalari commented on May 28, 2024

Hi sjfleck,

It shouldn't indeed take that many hours, or at least I never had such case with any FASTA file so far (and I'm using generally 10 threads).
I've just started the paralogs-ks command on the same species (thanks for the link) by using 10 threads and 3G per thread (total 30G) to see how it goes on my cluster.

Did your last try finish within the 72 hours?

It would perhaps be useful to know which step it was stalling at: for example, is it during BLAST or later during the Ks estimate per GF?
The BLAST step can last long without printing anything, to a point that you might think it's stalling while it's actually still working; the Ks estimate of the first (and largest) gene families also lasts long when they have many members, looking like the following for some time:

INFO	Running whole paranome Ks analysis...
INFO	Started analysis of 4999 gene families in parallel using 3 threads
INFO	Performing analysis on gene family GF_000001 (size 175)
INFO	Performing analysis on gene family GF_000002 (size 108)
INFO	Performing analysis on gene family GF_000003 (size 82)

If it is the BLAST step, could you double check the number of threads (-num_threads) from its command line? This latter appears in the log file:

INFO	Running Blastp
INFO	blastp -db /path/to/.../eustomagrandiflorum.blast_tmp/eustomagrandiflorum.db.fasta -query /path/to/.../eustomagrandiflorum.blast_tmp/eustomagrandiflorum.query.fasta -evalue 1e-10 -outfmt 6 -num_threads 10 -out /path/to/../eustomagrandiflorum.blast.tsv

Best,
Cecilia

from ksrates.

Cecilia-Sensalari commented on May 28, 2024

Hi!

My run took 2h30 and the whole-paranome looks good.
The stalling is then probably not related to the species itself, but e.g. on cluster settings. Do you spot any setting difference between the runs for species A-B-C compared to species Eustoma? I wonder why this latter used ~40G instead of ~4G like the three other species you reported.

Best,
Cecilia

from ksrates.

sjfleck commented on May 28, 2024

I had a feeling something was wrong with this run. The job has been running for 69 hours and is still on the i-adhore step. There isn't a difference between this run and the others, so I'm not exactly sure why this one isn't working over here. There wasn't any differences between the commands for the 4 different species.

The only difference that I can tell between my run and yours is that I extracted the CDS file using AGAT instead of using the CDS file that they already had available. I did it this way because some of the species I'm using didn't have publicly available CDS files and I wanted to be consistent with all of them. The other three species I ran paralogs-ks with finished quickly and had results that made sense. I'll rerun this job with the publicly available CDS file and see what happens

from ksrates.

Cecilia-Sensalari commented on May 28, 2024

I see! And what's the last thing appearing in the log file, during the i-ADHoRe step?
Perhaps the problem occurs during the parsing of the GFF file ("feature", "attribute", e.g. "mRNA" and "ID")?

from ksrates.

sjfleck commented on May 28, 2024

This is the entire .out file for ksrates so far. This run picked up where another one left off, so some of the steps were already completed. I also tried deleting all files created in the previous run, but it didn't seem to make a difference. I know orthologs-ks needs the files from a failed run deleted, but it didn't seem to be the case with paralogs-ks.

SLURM_JOBID=11057273
SLURM_JOB_NODELIST=cpn-u23-14
SLURM_NNODES=1
SLURMTMPDIR=/scratch/11057273
working directory = /full/path/to/working/directory
INFO - - - - - - - - - - - - - - - - - - -
INFO Paralog wgd analysis for species [Eg]
INFO Sun Jan 22 13:42:04 2023
INFO - - - - - - - - - - - - - - - - - - -
INFO Checking if sequence data files exist and if sequence IDs are compatible with wgd pipeline...
INFO Completed
INFO Running wgd paralog Ks pipeline...
INFO Paralog blast data Eg.blast.tsv already exists, will skip wgd all versus all Blastp
INFO Paralog gene family data Eg.mcl.tsv already exists, will skip wgd mcl
INFO Paralog Ks data Eg.ks.tsv already exists, will skip wgd Ks analysis
INFO All paralog data already exist, nothing to do
INFO Done
INFO ---
INFO Running wgd colinearity Ks pipeline...
INFO No colinearity anchor pair Ks data, will run wgd colinearity Ks analysis
INFO Checking external software...
INFO This is i-ADHoRe v3.0.
INFO Parsing GFF file
INFO Writing gene lists for i-ADHoRe
INFO Writing families file for i-ADHoRe
INFO Writing i-ADHoRe configuration file
INFO Running i-ADHoRe 3.0...
INFO i-adhore /full/path/to/paralog_distributions/wgd_Eg/Eg.ks_anchors_tmp/i-adhore.conf

from ksrates.

Cecilia-Sensalari commented on May 28, 2024

The log doesn't really help... But you could check into the tmp files that are produced, perhaps something there is going wrong.

/path/to/.../paralog_distributions/wgd_eustomagrandiflorum/eustomagrandiflorum.ks_anchors_tmp

Now that you're running the workflow also with the CDS/GFF downloaded from the public database, you could compare the content of the i-ADHoRe tmp directory between the two runs.

from ksrates.

sjfleck commented on May 28, 2024

Thats a good idea. I'll keep an eye on this run for the next 2-3 hours. Hopefully it finishes like yours did.

For the i-adhore_out directory in the 70 hour run, the only file that isn't empty is genes.txt. The other files are empty or just have headers. The other thing worth mentioning is that E. grandiflorum has a blast.tsv file that is WAY bigger than the others I ran. I'm not sure how big yours was, but for my other 3 species, this file was between 214 and 486 MB, not 8.3 GB.

from ksrates.

Cecilia-Sensalari commented on May 28, 2024

Ah! That's it then, my BLAST table is also way smaller (305M), and all tables I generated so far were also max ~500M. With 8G of table your workflow is busy for a very long time.
-rw-r--r-- 1 user group 305M Jan 25 12:29 /path/to/.../eustomagrandiflorum.blast.tsv

Side note: in my i-ADHoRe tmp directory, gene.txt is also the only file with content, so that's apparently normal.

from ksrates.

ksrates paralogs-ks seemingly freezing on i-adhore step about ksrates HOT 9 CLOSED

Comments (9)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent