I am trying to assemble a PacBio dataset containing 52Gb of data (my genome is 620Mb).

memory issues about flye HOT 8 CLOSED

fenderglass commented on August 27, 2024

memory issues

from flye.

Comments (8)

mikolmogorov commented on August 27, 2024

Hello,

We haven't tested ABruijn on that large genomes yet (but it is definitely a direction we want to improve).

There is a warning that basically means that ABruijn could not find enough solid kmers to represent the genome - and the default threshold t = 2 is too low, which slowing everything down.

Currently I might think about two reasons for that: first, it might be that some fraction of input reads does not come from genome, but rather a contamination, symbiotic bacteria, etc. If the fraction of such reads is high, it might confuse the selection of solid kmers. Do you think it might be the case?
Secondly, your genome is 620Mb, so it already contain a significant part of all possible 15-mers (1Gb), which also might trigger some effects that we did not observe previously.

So, I would consider to try the following: (i) increase kmer size to 17 (-k parameter), (ii) Filter out short reads (say, shorter than 14k) (iii) check if there is a contamination in the sample.

Also, check if you are using the latest version from master (we have pushed an update about a week ago).

from flye.

StefanoLonardi commented on August 27, 2024

Thanks for the quick answer. The organism is a legume. Now I filtered my reads at 14K and I've got about 27Gb of data. I am re-running ABruijn with -k 17. Regarding the contamination, I cannot be 100% we are contaminant free, but I removed mito DNA and chloroplast DNA. I will post here progress.

from flye.

StefanoLonardi commented on August 27, 2024

I have been able to run ABruijn on my dataset on CANU corrected pacbio data (16Gb) until this point

[09:01:17] INFO: Running ABruijn
[09:01:17] INFO: Assembling reads
[09:06:41] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[09:27:43] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[09:36:44] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[09:59:39] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[17:07:59] INFO: Extending reads
[17:17:17] INFO: Assembled 741 contigs
[17:17:17] INFO: Generating contig sequences
[17:33:37] INFO: Polishing genome (1/2)
[17:33:47] INFO: Running BLASR

It has been running BLASR now for two days on 32 cores - any idea how long it will take?
Can I instead polish the draft assembly with Quiver?

from flye.

mikolmogorov commented on August 27, 2024

Hmm, this looks strange, how big is the genome? Maybe it depend on BLASR version, but it usually finishes in a reasonable time even for human alignments.

You can roughly estimate the completion by checking the size of .m5 file in the output directory. At the end it should be roughly 3x reads file size.

If the reads are already corrected, you can definitely try to apply Quiver right away - the draft assembly is in "draft_assembly.fasta" file. I am not sure how it performs on non-corrected reads though.

from flye.

StefanoLonardi commented on August 27, 2024

The genome expected size is 620Mb. Here is what I have in the out directory

-rw-rw-r--  1 stelo stelo  203785934 Aug 27 17:33 abruijn.log
-rw-rw-r--  1 stelo stelo         62 Aug 27 17:33 abruijn.save
-rw-rw-r--  1 stelo stelo 4824166003 Aug 29 11:35 blasr_1.m5
-rw-rw-r--  1 stelo stelo  480953800 Aug 27 17:33 blasr_ref_1.fasta
-rw-rw-r--  1 stelo stelo  480750468 Aug 27 17:33 draft_assembly.fasta
-rw-rw-r--  1 stelo stelo  486778112 Aug 27 17:32 reads_order.fasta

It seems that that blasr_1.m5 is about 10x the size of draft_assembly.fasta. I have the latest version of BLASR from github

from flye.

mikolmogorov commented on August 27, 2024

Hm, it seems that BLASR alignment goes really slow. I'll try to check if there is a significant difference in running time between BLASR versions on the datasets we have.

from flye.

mikolmogorov commented on August 27, 2024

A bit late follow-up..

It seems that there is no significant difference between different BLASR versions for the datasets we currently have. But there is at least 2x slowdown when running on error-corrected reads comparing to raw reads (ABruijn does not require prior error correction).

We also checked different BLASR parameters and found a combination which gives about 1.5-2x speedup without affecting the result quality. The new version is in master branch, the speed gain could be even bigger for your dataset, since it contains more complicated repeat structures.

Hope this helps. Let us know, if you have any other questions.

from flye.

StefanoLonardi commented on August 27, 2024

Mikhail: many thanks for the follow-up. I know that ABruijn does not require prior error correction, but I was unable to run it on my whole dataset (see my very first post). Running it on CANU corrected data seems a way to remove some of the low-quality data in the dataset without compromising the ability of assembling the genome. I will definitely try again ABruijn (and compare with CANU, FALCON, and HINGE). We are still struggling to get contaminants out of the reads.

from flye.

memory issues about flye HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent