Comments (9)
The new paralogs-ks run finished in 4 hours and 10 minutes and used 3.05 GB memory. I always check to see if AGAT messed up the gff3 or protein file by looking at the number of genes, mRNA, and BUSCOs before and after running it. I just learned that those are not always enough to make sure there weren't any major changes to the data. I reran all the orthologs-ks jobs that used the old CDS file and they finished pretty quickly too.
This turned out to be a pretty niche issue, but if anyone uses AGAT in combination with ksrates, there's a small chance they might benefit from my misadventure here. Thanks for your help!
from ksrates.
Hi sjfleck,
It shouldn't indeed take that many hours, or at least I never had such case with any FASTA file so far (and I'm using generally 10 threads).
I've just started the paralogs-ks command on the same species (thanks for the link) by using 10 threads and 3G per thread (total 30G) to see how it goes on my cluster.
Did your last try finish within the 72 hours?
It would perhaps be useful to know which step it was stalling at: for example, is it during BLAST or later during the Ks estimate per GF?
The BLAST step can last long without printing anything, to a point that you might think it's stalling while it's actually still working; the Ks estimate of the first (and largest) gene families also lasts long when they have many members, looking like the following for some time:
INFO Running whole paranome Ks analysis...
INFO Started analysis of 4999 gene families in parallel using 3 threads
INFO Performing analysis on gene family GF_000001 (size 175)
INFO Performing analysis on gene family GF_000002 (size 108)
INFO Performing analysis on gene family GF_000003 (size 82)
If it is the BLAST step, could you double check the number of threads (-num_threads
) from its command line? This latter appears in the log file:
INFO Running Blastp
INFO blastp -db /path/to/.../eustomagrandiflorum.blast_tmp/eustomagrandiflorum.db.fasta -query /path/to/.../eustomagrandiflorum.blast_tmp/eustomagrandiflorum.query.fasta -evalue 1e-10 -outfmt 6 -num_threads 10 -out /path/to/../eustomagrandiflorum.blast.tsv
Best,
Cecilia
from ksrates.
Hi!
My run took 2h30 and the whole-paranome looks good.
The stalling is then probably not related to the species itself, but e.g. on cluster settings. Do you spot any setting difference between the runs for species A-B-C compared to species Eustoma? I wonder why this latter used ~40G instead of ~4G like the three other species you reported.
Best,
Cecilia
from ksrates.
I had a feeling something was wrong with this run. The job has been running for 69 hours and is still on the i-adhore step. There isn't a difference between this run and the others, so I'm not exactly sure why this one isn't working over here. There wasn't any differences between the commands for the 4 different species.
The only difference that I can tell between my run and yours is that I extracted the CDS file using AGAT instead of using the CDS file that they already had available. I did it this way because some of the species I'm using didn't have publicly available CDS files and I wanted to be consistent with all of them. The other three species I ran paralogs-ks with finished quickly and had results that made sense. I'll rerun this job with the publicly available CDS file and see what happens
from ksrates.
I see! And what's the last thing appearing in the log file, during the i-ADHoRe step?
Perhaps the problem occurs during the parsing of the GFF file ("feature", "attribute", e.g. "mRNA" and "ID")?
from ksrates.
This is the entire .out file for ksrates so far. This run picked up where another one left off, so some of the steps were already completed. I also tried deleting all files created in the previous run, but it didn't seem to make a difference. I know orthologs-ks needs the files from a failed run deleted, but it didn't seem to be the case with paralogs-ks.
SLURM_JOBID=11057273
SLURM_JOB_NODELIST=cpn-u23-14
SLURM_NNODES=1
SLURMTMPDIR=/scratch/11057273
working directory = /full/path/to/working/directory
INFO - - - - - - - - - - - - - - - - - - -
INFO Paralog wgd analysis for species [Eg]
INFO Sun Jan 22 13:42:04 2023
INFO - - - - - - - - - - - - - - - - - - -
INFO Checking if sequence data files exist and if sequence IDs are compatible with wgd pipeline...
INFO Completed
INFO Running wgd paralog Ks pipeline...
INFO Paralog blast data Eg.blast.tsv already exists, will skip wgd all versus all Blastp
INFO Paralog gene family data Eg.mcl.tsv already exists, will skip wgd mcl
INFO Paralog Ks data Eg.ks.tsv already exists, will skip wgd Ks analysis
INFO All paralog data already exist, nothing to do
INFO Done
INFO ---
INFO Running wgd colinearity Ks pipeline...
INFO No colinearity anchor pair Ks data, will run wgd colinearity Ks analysis
INFO Checking external software...
INFO This is i-ADHoRe v3.0.
INFO Parsing GFF file
INFO Writing gene lists for i-ADHoRe
INFO Writing families file for i-ADHoRe
INFO Writing i-ADHoRe configuration file
INFO Running i-ADHoRe 3.0...
INFO i-adhore /full/path/to/paralog_distributions/wgd_Eg/Eg.ks_anchors_tmp/i-adhore.conf
from ksrates.
The log doesn't really help... But you could check into the tmp files that are produced, perhaps something there is going wrong.
/path/to/.../paralog_distributions/wgd_eustomagrandiflorum/eustomagrandiflorum.ks_anchors_tmp
Now that you're running the workflow also with the CDS/GFF downloaded from the public database, you could compare the content of the i-ADHoRe tmp directory between the two runs.
from ksrates.
Thats a good idea. I'll keep an eye on this run for the next 2-3 hours. Hopefully it finishes like yours did.
For the i-adhore_out directory in the 70 hour run, the only file that isn't empty is genes.txt. The other files are empty or just have headers. The other thing worth mentioning is that E. grandiflorum has a blast.tsv file that is WAY bigger than the others I ran. I'm not sure how big yours was, but for my other 3 species, this file was between 214 and 486 MB, not 8.3 GB.
from ksrates.
Ah! That's it then, my BLAST table is also way smaller (305M), and all tables I generated so far were also max ~500M. With 8G of table your workflow is busy for a very long time.
-rw-r--r-- 1 user group 305M Jan 25 12:29 /path/to/.../eustomagrandiflorum.blast.tsv
Side note: in my i-ADHoRe tmp directory, gene.txt
is also the only file with content, so that's apparently normal.
from ksrates.
Related Issues (17)
- A tree with branch length set to "rate-adjusted mixed Ks distances" and its Newick string? HOT 1
- Error when executing ksrates test in nextflow HOT 5
- Incompatibility with recent Nextflow version HOT 4
- singularity HOT 5
- Issue with wgd paralog Ks analysis HOT 4
- TypeError: cannot convert the series to <class 'float'> HOT 10
- installation error HOT 2
- Warning: Dubious indirect gene relationship - closest genes get same color in alignment HOT 2
- ERROR Unexpected internal error during analysis of gene family GF_000001 HOT 4
- ERROR Unexpected internal error during analysis of gene family GF_000001 HOT 2
- ortholog_peak_db.tsv, ortholog_ks_list_db.tsv database are not generated HOT 2
- where would I find the equation or lognorm fit parameters to the Ks distributions? HOT 10
- Installation problem HOT 11
- No output files HOT 6
- Updated Errors HOT 11
- Difficulty in Output Plots Interpretation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ksrates.