Giter Club home page Giter Club logo

Comments (8)

mikolmogorov avatar mikolmogorov commented on August 27, 2024

Hello,

We haven't tested ABruijn on that large genomes yet (but it is definitely a direction we want to improve).

There is a warning that basically means that ABruijn could not find enough solid kmers to represent the genome - and the default threshold t = 2 is too low, which slowing everything down.

Currently I might think about two reasons for that: first, it might be that some fraction of input reads does not come from genome, but rather a contamination, symbiotic bacteria, etc. If the fraction of such reads is high, it might confuse the selection of solid kmers. Do you think it might be the case?
Secondly, your genome is 620Mb, so it already contain a significant part of all possible 15-mers (1Gb), which also might trigger some effects that we did not observe previously.

So, I would consider to try the following: (i) increase kmer size to 17 (-k parameter), (ii) Filter out short reads (say, shorter than 14k) (iii) check if there is a contamination in the sample.

Also, check if you are using the latest version from master (we have pushed an update about a week ago).

from flye.

StefanoLonardi avatar StefanoLonardi commented on August 27, 2024

Thanks for the quick answer. The organism is a legume. Now I filtered my reads at 14K and I've got about 27Gb of data. I am re-running ABruijn with -k 17. Regarding the contamination, I cannot be 100% we are contaminant free, but I removed mito DNA and chloroplast DNA. I will post here progress.

from flye.

StefanoLonardi avatar StefanoLonardi commented on August 27, 2024

I have been able to run ABruijn on my dataset on CANU corrected pacbio data (16Gb) until this point

[09:01:17] INFO: Running ABruijn
[09:01:17] INFO: Assembling reads
[09:06:41] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[09:27:43] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[09:36:44] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[09:59:39] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[17:07:59] INFO: Extending reads
[17:17:17] INFO: Assembled 741 contigs
[17:17:17] INFO: Generating contig sequences
[17:33:37] INFO: Polishing genome (1/2)
[17:33:47] INFO: Running BLASR

It has been running BLASR now for two days on 32 cores - any idea how long it will take?
Can I instead polish the draft assembly with Quiver?

from flye.

mikolmogorov avatar mikolmogorov commented on August 27, 2024

Hmm, this looks strange, how big is the genome? Maybe it depend on BLASR version, but it usually finishes in a reasonable time even for human alignments.

You can roughly estimate the completion by checking the size of .m5 file in the output directory. At the end it should be roughly 3x reads file size.

If the reads are already corrected, you can definitely try to apply Quiver right away - the draft assembly is in "draft_assembly.fasta" file. I am not sure how it performs on non-corrected reads though.

from flye.

StefanoLonardi avatar StefanoLonardi commented on August 27, 2024

The genome expected size is 620Mb. Here is what I have in the out directory

-rw-rw-r--  1 stelo stelo  203785934 Aug 27 17:33 abruijn.log
-rw-rw-r--  1 stelo stelo         62 Aug 27 17:33 abruijn.save
-rw-rw-r--  1 stelo stelo 4824166003 Aug 29 11:35 blasr_1.m5
-rw-rw-r--  1 stelo stelo  480953800 Aug 27 17:33 blasr_ref_1.fasta
-rw-rw-r--  1 stelo stelo  480750468 Aug 27 17:33 draft_assembly.fasta
-rw-rw-r--  1 stelo stelo  486778112 Aug 27 17:32 reads_order.fasta

It seems that that blasr_1.m5 is about 10x the size of draft_assembly.fasta. I have the latest version of BLASR from github

from flye.

mikolmogorov avatar mikolmogorov commented on August 27, 2024

Hm, it seems that BLASR alignment goes really slow. I'll try to check if there is a significant difference in running time between BLASR versions on the datasets we have.

from flye.

mikolmogorov avatar mikolmogorov commented on August 27, 2024

A bit late follow-up..

It seems that there is no significant difference between different BLASR versions for the datasets we currently have. But there is at least 2x slowdown when running on error-corrected reads comparing to raw reads (ABruijn does not require prior error correction).

We also checked different BLASR parameters and found a combination which gives about 1.5-2x speedup without affecting the result quality. The new version is in master branch, the speed gain could be even bigger for your dataset, since it contains more complicated repeat structures.

Hope this helps. Let us know, if you have any other questions.

from flye.

StefanoLonardi avatar StefanoLonardi commented on August 27, 2024

Mikhail: many thanks for the follow-up. I know that ABruijn does not require prior error correction, but I was unable to run it on my whole dataset (see my very first post). Running it on CANU corrected data seems a way to remove some of the low-quality data in the dataset without compromising the ability of assembling the genome. I will definitely try again ABruijn (and compare with CANU, FALCON, and HINGE). We are still struggling to get contaminants out of the reads.

from flye.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.