Hi!
I'm super impressed with the speed of Shasta - and I'm taking advantage of its speed to explore parameters to optimize/maximize the size of genome assembly and longest assembled segment for ONT and PacBio data sets having read lengths skewed to <10kb in different marine invertebrates.
For ONT:
I'm wondering if you could recommend Shasta parameters to explore for working with ONT data sets that are skewed to shorter read lengths of 500-10000 bp.
I am assembling ~60x PromethION (~15 million) reads of an estimated 2-2.2 Gb pygmy squid (Idiosepius paradoxus) genome. Unfortunately, sequencing ran into pore clogging issues unless the DNA was sheared and size selected - so our reads are mostly smaller than 10 kb (~65% reads and ~33% bases) and even under 1 kb (~20% reads and ~0.5% bases). I read that default settings in Shasta are optimized for read lengths >10kb, so I'm interested in exploring different parameters to optimize Shasta for my data set.
To begin with I tested minimum read length cutoffs (--Reads.minReadLength) from 500-10000 bps. The results were interesting, in that having more short reads didn't always help - and max lengths in genome assembly and in longest assembled segment occurred at a cutoff of 6000 bp (see below). This is promising but at a 6000 bp cutoff over 45% of the reads are rejected - so I'd like to optimize other parameters to see if I can improve things and use more of the reads.
Also - here is the Assembly Summary for a 1000 bp cutoff, as an example.
Given my ONT read lengths skewed to smaller sizes, I'm unsure what additional Shasta parameters I should be exploring and was wondering if you have any suggestions.
Also, are there reasons to flat out reject short ONT reads - for instance, I'm wondering if they generally have even higher rates of sequencing error - or for other reasons are generally viewed as poor quality and to be avoided in assembly?
For PacBio:
I saw that use of Shasta with PacBio was discussed and closed for now in Issue #56 . I can move my PacBio questions to that thread, no problem - just let me know.
I'm preparing to do Shasta assembly of deep PacBio sequencing of a ctenophore (Bolinopsis species) with an estimated genome size of 200-300 Mb. Like my ONT data set above, the PacBio data skew to under 10 kb - at least in comparison to a typical human ONT dataset that Shasta is optimized for.
I plan to follow Issue #56 recommendations to explore read length and kmer length - and use the Modal consensus caller for repeat counts. Are there other parameters in Shasta that might be useful to explore, given PacBio datasets - and/or do you now have recommended Shasta settings for PacBio?
I'm happy to update on parameter exploration as things progress on both the ONT and PacBio data sets above, if there is interest.