Hi Haoyu/Heng Li, Thanks for the fantastic software for assembling H

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Lower contiguity with longer insert size about hifiasm HOT 18 CLOSED

chhylp123 commented on May 27, 2024

Lower contiguity with longer insert size

from hifiasm.

Comments (18)

chhylp123 commented on May 27, 2024 2

Like this: hifiasm -o hifiasm.asm --write-ec input.fq

from hifiasm.

kevfengler227 commented on May 27, 2024 1

I wish I recorded that talk more recently. I have learned a lot more since then. Like you say, there are a lot of nuances to HiFi that can make a big difference. Increasing the min predicted accuracy with access coverage only seems to help with HiCanu. I have not been doing that with hifiasm. Having a library with a tail of long reads negatively impacts the yield too, with fewer reads having enough passes. So I really have been pressing the lab to make the sizing as tight as possible around ~17 kb. This seems to be the sweet spot for accuracy, throughput and assembly for many plant genomes. Surprisingly, I don't think that 40-50 kb HiFi reads are needed in most cases. I am working on resolving a 100 kb tandem duplication with just a few SNP differences, but I am going to try to supplement with a ~60 kb CLR library.

from hifiasm.

chhylp123 commented on May 27, 2024

Thanks for extensive experiments. I personally think we'd better to make sure if the lower contiguity is caused by no enough coverage or low read accuracy. You can map r_utg/p_ctg/reads to Grch38, and check all break points. We observed some coverage drops on other HiFi datasets. Beside, for diploid genome, 16/17 fold coverage might be a little bit low, which means the coverage for each haplotype is just 8.

BTW, NGA50 for HiFi assembly is tricky. HiFi assemblers like hifiasm and hicanu can reconstruct hard regions. When you align contigs to Grch38, the contig alignment usually fail on these regions. Minigraph might be a better choice since it tends to generate longer chains, but it still doesn't fully reflect the the contiguity of HiFi assemblers.

from hifiasm.

proteinosome commented on May 27, 2024

Thanks @chhylp123 for your comments! I will try to investigate the contigs break. However, if coverage is the problem here, the 15kb libraries should also have low contiguity, right? The 15kb libraries is only 1X higher in coverage but has almost doubled the contiguity compared to the 20kb libraries.

Thanks for the suggestion on minigraph too! I will try it out.

from hifiasm.

chhylp123 commented on May 27, 2024

That's just my guess, might be wrong. If all attributes of 20kb libraries and 15kb libraries are the same except length, I agree that 20kb assemblies should be better. But I worry 20kb cells and 15kb cells might be generated by different chemistry so some other things changed. Of course another possibility is hifiasm has issues. We have tested hifiasm a lot on HG002 mixed 20kb/15kb libraries but we haven't performed the experiments you have done. Thanks a lot.

from hifiasm.

proteinosome commented on May 27, 2024

Thanks! They are actually using the same chemistry 2.0 from what I understand from the GIAB description:

https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA586863

I tried using IPA to assemble both the 15kb and 20kb libraries and the same phenomenon happens with IPA (lower contiguity for 20kb library), so I don't think it's an issue with Hifiasm but it's possible that there's something up with assembling reads with slightly lower accuracy?

from hifiasm.

chhylp123 commented on May 27, 2024

I have heard that HiFi reads have tradeoff between length and quality, but forget what does quality mean... Again probably I was wrong. So I still recommend you to check some break points first. And for read accuracy, maybe you can evaluate the QV of corrected reads generated by hifiasm. I guess 15kb and 20kb corrected reads should be comparable in terms of QV.

from hifiasm.

HenrivdGeest commented on May 27, 2024

Simplest thing is to map error corrected reads of hifiams to the reference. The 20kb lib reads might contain more errors, which hampers the overlapping in hifiasm I think.

…

On Wed, 24 Jun 2020, 17:17 chhylp123, ***@***.***> wrote: I have heard that HiFi reads have tradeoff between length and quality, but forget what does quality mean... Again probably I was wrong. So I still recommend you to check some break points first. And for read accuracy, maybe you can evaluate the QV of corrected reads generated by hifiasm. I guess 15kb and 20kb corrected reads should be comparable in terms of QV. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARZCFKMWDIMICOGBPOJZPLRYIKIDANCNFSM4OGLLVEQ> .

from hifiasm.

proteinosome commented on May 27, 2024

Yeah I can try that! I might be missing something obvious but how to I convert the "ec.bin" file to FASTA for mapping?

Thank you!

from hifiasm.

kevfengler227 commented on May 27, 2024

there is definitely a trade-off between accuracy and read length and from my testing accuracy is more important. Our biggest issue with HiFi is consistently getting a tight peak at ~17 kb, if we have a tail of longer reads the accuracy and the assemblies suffer.

Indeed, aligning the HiFi reads, before correction, to the assembly and computing the accuracy is the best way to do this.

from hifiasm.

kevfengler227 commented on May 27, 2024

or here is data from the same cell, got a much better assembly with the small fraction than the large fraction

from hifiasm.

HenrivdGeest commented on May 27, 2024

Is the result above also applicable for big plant genomes, ie do you see any scenario where the longer fragments are preferred?

…

On Wed, 24 Jun 2020, 19:20 kevfengler227, ***@***.***> wrote: or here is data from the same cell, got a much better assembly with the small fraction than the large fraction [image: image] <https://user-images.githubusercontent.com/20604003/85602500-0dc7fc80-b615-11ea-8e3d-5cb7809f1430.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARZCFLPAFNIDZCD5ZPOHL3RYIYWPANCNFSM4OGLLVEQ> .

from hifiasm.

kevfengler227 commented on May 27, 2024

well, how big are we talking about? The 11 Gb Oat genome I referenced in the other thread had a nice tight band at 17 kb and yielded a 30 Mb (HiCanu) or 70 Mb (Hifasm) contig N50. I see them same with cotton, maize, etc. What are the chances that a rare 30 kb HiFi is going to land on the exact spot you need it. HiFi dropouts in CT repeats are a much bigger issue. that is what prevents gapless chromosome length contigs in plants more than anything else.

from hifiasm.

proteinosome commented on May 27, 2024

Thanks @kevfengler227 for chipping in! I watched your SMRT Leiden talk and tried one of your suggestion to keep only reads above 99.4% accuracy but it didn't improve contiguity in my case, presumably because coverage is quite limited already at 17 fold.

If long tail is the issue, theoretically when coverage is high enough we should be able to computationally sample from the reads to obtain a tight quality distribution and perhaps use those for draft assembly? I wonder then if there's any way to incorporate those very long (40-50kb) reads back to help with the assembly in the complex region.

It appears that there's still a lot of nuances in terms of HiFi assembly to be discovered!

from hifiasm.

proteinosome commented on May 27, 2024

A little bit of update, I took the error corrected reads, map it to hg38, and selected reads with minimum mapping quality of 20 (minimum 1 gives similar results) and it seems like the "error" (not really error because I calculated it as NM/read length) profile looks similar, with 15kb having slightly better accuracy. Question is does this slightly higher accuracy really makes it so much better and if there's any way to optimize this by allowing slightly higher error rate in the overlap process?

Shall continue to see if there' any interesting developments or observations coming up!

from hifiasm.

HenrivdGeest commented on May 27, 2024

How about the mapped reads on the spot where you have contig breakage in the 20kb assembly? Can you visually look at those alignments? I would expect an local elivated error rate or a simple drop in coverage.

…

On Wed, 1 Jul 2020, 14:55 proteinosome, ***@***.***> wrote: A little bit of update, I took the error corrected reads, map it to hg38, and selected reads with minimum mapping quality of 20 (minimum 1 gives similar results) and it seems like the "error" (not really error because I calculated it as NM/read length) profile looks similar, with 15kb having slightly better accuracy. Question is does this slightly higher accuracy really makes it so much better and if there's any way to optimize this by allowing slightly higher error rate in the overlap process? [image: plot_accuracy] <https://user-images.githubusercontent.com/13174349/86246146-1acc8900-bbdd-11ea-9256-bd536b94dc5b.png> [image: hex_plot_rl_qual] <https://user-images.githubusercontent.com/13174349/86244576-c58f7800-bbda-11ea-8142-ddde2bfd7778.png> Shall continue to see if there' any interesting developments or observations coming up! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARZCFOT3TFKC4K7YMADAY3RZMW5FANCNFSM4OGLLVEQ> .

from hifiasm.

chhylp123 commented on May 27, 2024

Unlike other assembles, hifiasm mainly relies on phasing, instead of stringent sequence similarity threshold. So it should be able to tolerate a little bit lower accuracy. May I ask have you ever checked the coverage in break points? Of course there might be some other reasons. I will check hifiasm again on the 20kb datasets. Supporting low quality dataset is a key feature for assemblers. Thanks for let us know this dataset for debugging.

from hifiasm.

proteinosome commented on May 27, 2024

Sorry I haven't gotten the time to investigate the breakpoints but it's on my to-do as well. Will update once I do! @chhylp123 Good to hear you're gonna look into this, do let us know your findings, thank you!!

from hifiasm.

Lower contiguity with longer insert size about hifiasm HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent