Used version: TETools git commit <a class="commit-link" data-hovercard-type="commit" d

I see this quite a few times in the log: <code class="n

The --init option did the trick! I ran a <code class=

Recovering run that failed while executing LTR_retriever about tetools HOT 11 CLOSED

raul-w commented on August 11, 2024

Recovering run that failed while executing LTR_retriever

from tetools.

Comments (11)

jebrosen commented on August 11, 2024

I see this quite a few times in the log:

sh: 1: Cannot fork

This error suggests resource exhaustion, in particular running low on memory or hitting the maximum number of processes/threads. This can cause all kinds of issues and false reporting of error conditions. Can you share the output of ulimit -u from inside the container, and do you know how much memory was in use at the time / the memory available? The main log / screen output of the overall RepeatModeler run may also be helpful if you still have it. And finally, what was the full command line you used to run RepeatModeler?

Do you have any suggestions on how I could finish this run without starting from scratch?

You can run the LTRPipeline separately: LTRPipeline -pa 4 genome.fa (where 4 is the number of parallel threads to use). However, it is not easy to combine the results between RECON+RepeatScout and LTRPipeline if you do them separately. I will open a GitHub issue for RepeatModeler to properly support -recoverDir when only the LTR pipeline failed.

Note that the fork error could have impacted the previous rounds, so it may be worth re-doing the run anyway if that turns out to be a serious problem.

from tetools.

raul-w commented on August 11, 2024

I ran the container on a shared server, so the fork error could indeed be caused by a lack of available threads during runtime. Memory was likely not the issue (I assume that there was at least 100G available at the time), nor were any limits placed on the container itself (ulimit -u and ulimit -m both gave unlimited as output when ran inside the container).

The full command with which I ran RepeatModeler was: docker run -it --mount type=bind,source="/mnt/local_scratch/wijfj001/repeatmodeler_dir",target=/work --mount type=bind,source="/lustre/BIF/nobackup/wijfj001/Software/bin/TRF",target=/opt/trf,ro --user "$(id -u):$(id -g)" --workdir "/work" --env "HOME=/work" "rwijfjes/tetools:1.1" RepeatModeler -database Hinc_ctgs -pa 5 -LTRStruct >& repeat_modeler_run.log

Besides the message that no results were obtained after running LTR_retriever, the main log file does not contain any red flags as far as I can tell:

repeat_modeler_run.log

It seems that restarting the run on a system on which I do not have to share threads is the most practical solution for now. Thanks for the help up to this point!

from tetools.

raul-w commented on August 11, 2024

Alright, I managed to complete a run through the container on a different server and the main log file does not seem to point out any issues. However, we had to make the increase the pid_max value of the server to 200,000 to make it happen, as RepeatModeler had generated 57,795 zombie processes by the time it reached the final RepeatClassifier stage. Any idea what could have caused this?

The full command that I used was: docker run -it --mount type=bind,source="/mnt/local_scratch/wijfj001/repeatmodeler_dir",target=/work --mount type=bind,source="/mnt/scratch/wijfj001/Software/bin/TRF",target=/opt/trf,ro --user "$(id -u):$(id -g)" --workdir "/work" --env "HOME=/work" "rwijfjes/tetools:1.1" RepeatModeler -database Hinc_scfs -pa 6 -LTRStruct

from tetools.

jebrosen commented on August 11, 2024

as RepeatModeler had generated 57,795 zombie processes by the time it reached the final RepeatClassifier stage. Any idea what could have caused this?

That is definitely wrong, and it's possible it is the cause of or otherwise related to to the cannot fork error you experienced earlier. However, RepeatModeler does wait on its children so I am surprised that you would accumulate any zombies let alone such an absurd number of them. Do you know if the zombies were of RepeatModeler itself or of another script/program that it runs?

from tetools.

raul-w commented on August 11, 2024

The zombie processes were generated while RepeatModeler was running LTR_retriever and seemed to stick around until the complete run had finished. Nevertheless, the main log file reported that LTR_retriever had finished (see attachment). There was unfortunately no log file (besides makeblastdb.log) present in the working directory of the LTRpipeline part of the run, so I could not check what the stderr output was during that time.

repeat_modeler_run.log

from tetools.

jebrosen commented on August 11, 2024

Does running LTRPipeline -pa X genome.fa by itself also create the zombie processes? I have not been able to reproduce it yet but this does help narrow it down.

from tetools.

raul-w commented on August 11, 2024

I tried out the following command: docker run -it --mount type=bind,source="/mnt/local_scratch/wijfj001/repeatmodeler_dir",target=/work --mount type=bind,source="/mnt/scratch/wijfj001/Software/bin/TRF",target=/opt/trf,ro --user "$(id -u):$(id -g)" --workdir "/work" --env "HOME=/work" "rwijfjes/tetools:1.1" LTRPipeline -pa 6 scaffold_sequences_short_ids.fa >& LTRPipeline_scaffold_rerun.out

This command had spawned ~57,000 [bash] <defunct> processes before I killed it. They were all generated between 14:17 and 14:19, which corresponded to the end of the modules 2-5 part of the pipeline (see log):

LTR_retriever_20200316.log

Hope this helps!

from tetools.

jebrosen commented on August 11, 2024

Thanks. And I realize I should have asked these right away and forgot to, in case there is a known bug in those versions:

Host operating system and version (from uname -a, lsb_release -a, etc.)
Version of docker and where you got it from (OS package manager or from source)
Whether and how you modified the Dockerfile to build rwijfjes/tetools:1.1

from tetools.

jebrosen commented on August 11, 2024

You can disregard the previous message - I was able to reproduce this after all.

When you run a command directly as in docker run ... LTRPipeline ..., child processes whose parents die are reparented to LTRPipeline which does not reap adopted orphans automatically (and it does not usually need to).

One simple change that should work for you is to use sh -c as an intermediate, which does reap adopted children:

docker run -it --mount type=bind,source="/mnt/local_scratch/wijfj001/repeatmodeler_dir",target=/work --mount type=bind,source="/lustre/BIF/nobackup/wijfj001/Software/bin/TRF",target=/opt/trf,ro --user "$(id -u):$(id -g)" --workdir "/work" --env "HOME=/work" "rwijfjes/tetools:1.1" sh -c 'RepeatModeler -database Hinc_ctgs -pa 5 -LTRStruct' >& repeat_modeler_run.log

This is often not a huge problem, but LTR_retriever indirectly spawns a large number of processes so this issue became noticeable.

EDIT: Docker's --init flag may also work or even be preferred as a workaround. I will continue to test and make sure the issue is resolved on our side for anyone who uses the dfam-tetools.sh script.

from tetools.

raul-w commented on August 11, 2024

The --init option did the trick! I ran a docker run ... LTRPipeline ... command with this flag and it finished without starting a zombie apocalypse. I expect that the full RepeatModeler pipeline will now run properly as well.

Thanks for getting to the bottom of this, Jeb!

from tetools.

jebrosen commented on August 11, 2024

I added --init to the docker command in dfam-tetoolsh.sh, so this should not negatively impact other users in the future.

I have also opened Dfam-consortium/RepeatModeler#65 to track the issue where one cannot resume the run if only the LTRPipeline step fails.

Thanks for reporting both issues!

from tetools.

Recovering run that failed while executing LTR_retriever about tetools HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent