Pggb on bacteria. about pggb HOT 17 CLOSED

SilasK commented on September 15, 2024

Pggb on bacteria.

from pggb.

Comments (17)

ekg commented on September 15, 2024

You ran out of memory (signal 4 I assume). There are several possible causes, but it always comes back to putting sequences that are too large into spoa, which has quadratic memory costs. I'm working on a generic solution to this now, and we will probably swap in abPOA to drop memory usage. I'd suggest either trying a significantly larger segment size -s (10kb to 100kb) and/or waiting a few days for me to resolve this issue with smoothxg. Thanks for testing!

…

On Wed, Sep 23, 2020, 07:29 Silas Kieser ***@***.***> wrote: Very cool tool. I'd like to apply it on a set of bacterial genomes. Do I understand this message correctly that there was a command that took too long? Shouldn't then the exit status be 1? In any case, the smooth.gfa is empty. Probably I should change one of the parameters for the smoothing, could you give me some advice? [smoothxg::smoothable_blocks] computing blocks [smoothxg::smoothable_blocks] computing blocks 100.00% Command terminated by signal 4plying spoa to block 3/37675 0.008% Command being timed: "smoothxg -t 8 -g /data/Akkermansia_ref.fasta.gz.pggb-s1000-p90-n10-a90-K11-k8-w10000-j5000-W0-e100.seqwish.gfa -w 10000 -j 5000 -k 0 -e 100" User time (seconds): 1952.10 System time (seconds): 23.83 Percent of CPU this job got: 318% Elapsed (wall clock) time (h:mm:ss or m:ss): 10:21.26 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 4197032 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 2 Minor (reclaiming a frame) page faults: 2382976 Voluntary context switches: 760552 Involuntary context switches: 9034 Swaps: 0 File system inputs: 264 File system outputs: 7309432 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEOFJUNCELT7YTAYI2TSHGBUFANCNFSM4RWSWVOQ> .

from pggb.

dirkjanvw commented on September 15, 2024

Hi, very cool tool indeed :)

I am not sure whether to comment on this issue or create a new one, but I ran into problems in the exact same step.
I managed to run pggb for both public Xanthomonas and yeast data, but when running it for some public cucumber data (the newest versions of these genomes: ftp://cucurbitgenomics.org/pub/cucurbit/genome/cucumber), I found the following:

[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks 100.00%%
Command terminated by signal 11lying spoa to block 62339/189753 32.853%
	Command being timed: "smoothxg -t 45 -g output/all.fa.pggb-s50000-p75-n5-a70-K16-k8-w10000-j5000-W0-e100.seqwish.gfa -w 10000 -j 5000 -k 0 -e 100"
	User time (seconds): 17007.23
	System time (seconds): 909.29
	Percent of CPU this job got: 342%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:27:09
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 49033128
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 6288
	Minor (reclaiming a frame) page faults: 229546238
	Voluntary context switches: 15770329
	Involuntary context switches: 75027
	Swaps: 0
	File system inputs: 135369878
	File system outputs: 185148472
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

I also tried the same settings but with -s 100000, in which case I also got signal 11 and no smooth.gfa

from pggb.

ekg commented on September 15, 2024

49033128 = 49G. Is that more than the memory you have on this system?

from pggb.

dirkjanvw commented on September 15, 2024

No, the maximum memory on this system is 128GB.
I also checked with htop during running whether it still had space and it was never above half of the memory on the system (that is, everything running in total).

from pggb.

ekg commented on September 15, 2024

This can happen when you try to allocate a lot more memory in one go. The actual resident size never goes to the level you requested. The allocations in spoa don't seem to be guarded, or I'm not interacting with their errors correctly.

Working on a fix for this now. Hope to push in the next hour or two.

from pggb.

ekg commented on September 15, 2024

Please try with the current smoothxg HEAD. This should be resolved pangenome/smoothxg#8.

I've tested it on all the cases I had that were failing in a similar way.

from pggb.

dirkjanvw commented on September 15, 2024

Thank you! There are no errors anymore and it runs smoothly (pun intended)!

from pggb.

SilasK commented on September 15, 2024

I think I still get the error 4:

I've updated smoothxg in the docker contaner.


[smoothxg::main] building xg index
[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks for 206004 handles: 100.00% @ 1.65e+05/s elapsed: 00:00:00:01 remain: 00:00:00:00
[smoothxg::break_blocks] splitting short sequences out of 1625 blocks: 100.00% @ 6.49e+03/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::break_blocks] split 117 blocks
[smoothxg::break_blocks] cutting blocks that contain sequences longer than max-poa-length (10000)
[smoothxg::break_blocks] cutting 1742 blocks: 100.00% @ 6.96e+03/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::break_blocks] cut 446 blocks of which 5 had repeats
Command terminated by signal 4
smoothxg -t 8 -g /data/akkermansia.fasta.gz.pggb-s100000-p90-n10-a90-K11-k8-w10000-j5000-e5000.seqwish.gfa -w 10000 -j 5000 -e 5000 -l 10000 -m /data/akkermansia.fasta.gz.pggb-s100000-p90-n10-a90-K11-k8-w10000-j5000-e5000.smooth.maf -s /data/akkermansia.fasta.gz.pggb-s100000-p90-n10-a90-K11-k8-w10000-j5000-e5000.consensus -a -C 10,100,1000,10000
35.37s user 1.68s system 193% cpu 19.12s total 155316Kb max memory

from pggb.

ekg commented on September 15, 2024

Please try running this under gdb to catch exactly where the error happens.

…

On Fri, Oct 30, 2020, 14:04 Silas Kieser ***@***.***> wrote: I think I still get the error 4: I've updated smoothxg in the docker contaner. [smoothxg::main] building xg index [smoothxg::smoothable_blocks] computing blocks [smoothxg::smoothable_blocks] computing blocks for 206004 handles: 100.00% @ 1.65e+05/s elapsed: 00:00:00:01 remain: 00:00:00:00 [smoothxg::break_blocks] splitting short sequences out of 1625 blocks: 100.00% @ 6.49e+03/s elapsed: 00:00:00:00 remain: 00:00:00:00 [smoothxg::break_blocks] split 117 blocks [smoothxg::break_blocks] cutting blocks that contain sequences longer than max-poa-length (10000) [smoothxg::break_blocks] cutting 1742 blocks: 100.00% @ 6.96e+03/s elapsed: 00:00:00:00 remain: 00:00:00:00 [smoothxg::break_blocks] cut 446 blocks of which 5 had repeats Command terminated by signal 4 smoothxg -t 8 -g /data/akkermansia.fasta.gz.pggb-s100000-p90-n10-a90-K11-k8-w10000-j5000-e5000.seqwish.gfa -w 10000 -j 5000 -e 5000 -l 10000 -m /data/akkermansia.fasta.gz.pggb-s100000-p90-n10-a90-K11-k8-w10000-j5000-e5000.smooth.maf -s /data/akkermansia.fasta.gz.pggb-s100000-p90-n10-a90-K11-k8-w10000-j5000-e5000.consensus -a -C 10,100,1000,10000 35.37s user 1.68s system 193% cpu 19.12s total 155316Kb max memory — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQELNTLBLWAC5OCZXDILSNK2XRANCNFSM4RWSWVOQ> .

from pggb.

SilasK commented on September 15, 2024

Can you tell me which command to use. I installed gdb but don’t know how to use it.

from pggb.

ekg commented on September 15, 2024

Take the smoothxg command that ran last and run it again like this: gdb smoothxg -ex 'r ___' Where ___ is the rest of the command line after smoothxg. That will break. Then enter 'bt' to get a backtrace. Please share the message you get here.

…

On Fri, Oct 30, 2020, 14:36 Silas Kieser ***@***.***> wrote: Can you tell me which command to use. I installed gdb but don’t know how to use it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEMLWSHHUI7SQLDXKVDSNK6L7ANCNFSM4RWSWVOQ> .

from pggb.

ekg commented on September 15, 2024

Any word on this? If the data is public, I can try to reproduce it.

from pggb.

SilasK commented on September 15, 2024

Sorry, I didn't manage to debug smooththxg as the intermediate files are removed. And using gdb with pggb directly didn't worked out.

Here is my data. 5 (fragmented) bacterial genomes for which I wanted to create a pangenome graph. I used to concatenate the fasta files and use them as input for pggb. The genomes have 98% average nucleotide identity therefore I used a high mapping/and alignment rate in ppgb.

pggb -i combined_genomes.fasta.gz --segment-length=100000 -K 11 --map-pct-id=90 --align-pct-id=90 -n 10 -t 2 -v -l

from pggb.

ekg commented on September 15, 2024

Adding -K to smoothxg keeps the "prep" graph. I'll try out your test.

…

On Thu, Nov 5, 2020, 18:07 Silas Kieser ***@***.***> wrote: Sorry, I didn't manage to debug smooththxg as the intermediate files are removed. And using gdb with pggb directly didn't worked out. Here <https://github.com/pangenome/pggb/files/5495827/genomes.tar.gz> is my data. 5 (fragmented) bacterial genomes for which I wanted to create a pangenome graph. I used to concatenate the fasta files and use them as input for pggb. The genomes have 98% average nucleotide identity therefore I used a high mapping/and alignment rate in ppgb. pggb -i combined_genomes.fasta.gz --segment-length=100000 -K 11 --map-pct-id=90 --align-pct-id=90 -n 10 -t 2 -v -l — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEM4I3XO3XQ3VE34WY3SOLLVJANCNFSM4RWSWVOQ> .

from pggb.

SilasK commented on September 15, 2024

I have a question:

The input for pggb is a fasta file with complete genomes, isn't it. But, most bacterial genomes are only available as scaffolds or contigs. Should I fill the gaps simply with NNN or should I filter the paf file in order only to allow between - genome alignments?
What do you think?

from pggb.

ekg commented on September 15, 2024

You can input shorter contigs. Be careful to align them uniquely against longer scaffolds. If you have whole assemblies those should help to structure the graph. The segment length in the alignment needs to be long enough for that purpose. Sequences shorter than the segment size are not multi mapped. They get only the best mapping (this is configurable in edyeet).

…

On Thu, Nov 12, 2020, 16:16 Silas Kieser ***@***.***> wrote: I have a question: The input for pggb is a fasta file with complete genomes, isn't it. But, most bacterial genomes are only available as scaffolds or contigs. Should I fill the gaps simply with NNN or should I filter the paf file in order only to allow between - genome alignments? What do you think? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQENLLG7V5KSGS7NKO3TSPP36LANCNFSM4RWSWVOQ> .

from pggb.

subwaystation commented on September 15, 2024

Because there was not recent activity here, the issue seems solved. Closing. If you feel otherwise, please open again.

from pggb.

Pggb on bacteria. about pggb HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent