Giter Club home page Giter Club logo

Comments (18)

chhylp123 avatar chhylp123 commented on May 27, 2024 2

Like this: hifiasm -o hifiasm.asm --write-ec input.fq

from hifiasm.

kevfengler227 avatar kevfengler227 commented on May 27, 2024 1

I wish I recorded that talk more recently. I have learned a lot more since then. Like you say, there are a lot of nuances to HiFi that can make a big difference. Increasing the min predicted accuracy with access coverage only seems to help with HiCanu. I have not been doing that with hifiasm. Having a library with a tail of long reads negatively impacts the yield too, with fewer reads having enough passes. So I really have been pressing the lab to make the sizing as tight as possible around ~17 kb. This seems to be the sweet spot for accuracy, throughput and assembly for many plant genomes. Surprisingly, I don't think that 40-50 kb HiFi reads are needed in most cases. I am working on resolving a 100 kb tandem duplication with just a few SNP differences, but I am going to try to supplement with a ~60 kb CLR library.

from hifiasm.

chhylp123 avatar chhylp123 commented on May 27, 2024

Thanks for extensive experiments. I personally think we'd better to make sure if the lower contiguity is caused by no enough coverage or low read accuracy. You can map r_utg/p_ctg/reads to Grch38, and check all break points. We observed some coverage drops on other HiFi datasets. Beside, for diploid genome, 16/17 fold coverage might be a little bit low, which means the coverage for each haplotype is just 8.

BTW, NGA50 for HiFi assembly is tricky. HiFi assemblers like hifiasm and hicanu can reconstruct hard regions. When you align contigs to Grch38, the contig alignment usually fail on these regions. Minigraph might be a better choice since it tends to generate longer chains, but it still doesn't fully reflect the the contiguity of HiFi assemblers.

from hifiasm.

proteinosome avatar proteinosome commented on May 27, 2024

Thanks @chhylp123 for your comments! I will try to investigate the contigs break. However, if coverage is the problem here, the 15kb libraries should also have low contiguity, right? The 15kb libraries is only 1X higher in coverage but has almost doubled the contiguity compared to the 20kb libraries.

Thanks for the suggestion on minigraph too! I will try it out.

from hifiasm.

chhylp123 avatar chhylp123 commented on May 27, 2024

That's just my guess, might be wrong. If all attributes of 20kb libraries and 15kb libraries are the same except length, I agree that 20kb assemblies should be better. But I worry 20kb cells and 15kb cells might be generated by different chemistry so some other things changed. Of course another possibility is hifiasm has issues. We have tested hifiasm a lot on HG002 mixed 20kb/15kb libraries but we haven't performed the experiments you have done. Thanks a lot.

from hifiasm.

proteinosome avatar proteinosome commented on May 27, 2024

Thanks! They are actually using the same chemistry 2.0 from what I understand from the GIAB description:

https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA586863

I tried using IPA to assemble both the 15kb and 20kb libraries and the same phenomenon happens with IPA (lower contiguity for 20kb library), so I don't think it's an issue with Hifiasm but it's possible that there's something up with assembling reads with slightly lower accuracy?

from hifiasm.

chhylp123 avatar chhylp123 commented on May 27, 2024

I have heard that HiFi reads have tradeoff between length and quality, but forget what does quality mean... Again probably I was wrong. So I still recommend you to check some break points first. And for read accuracy, maybe you can evaluate the QV of corrected reads generated by hifiasm. I guess 15kb and 20kb corrected reads should be comparable in terms of QV.

from hifiasm.

HenrivdGeest avatar HenrivdGeest commented on May 27, 2024

from hifiasm.

proteinosome avatar proteinosome commented on May 27, 2024

Yeah I can try that! I might be missing something obvious but how to I convert the "ec.bin" file to FASTA for mapping?

Thank you!

from hifiasm.

kevfengler227 avatar kevfengler227 commented on May 27, 2024

there is definitely a trade-off between accuracy and read length and from my testing accuracy is more important. Our biggest issue with HiFi is consistently getting a tight peak at ~17 kb, if we have a tail of longer reads the accuracy and the assemblies suffer.

Indeed, aligning the HiFi reads, before correction, to the assembly and computing the accuracy is the best way to do this.
image

from hifiasm.

kevfengler227 avatar kevfengler227 commented on May 27, 2024

or here is data from the same cell, got a much better assembly with the small fraction than the large fraction

image

from hifiasm.

HenrivdGeest avatar HenrivdGeest commented on May 27, 2024

from hifiasm.

kevfengler227 avatar kevfengler227 commented on May 27, 2024

well, how big are we talking about? The 11 Gb Oat genome I referenced in the other thread had a nice tight band at 17 kb and yielded a 30 Mb (HiCanu) or 70 Mb (Hifasm) contig N50. I see them same with cotton, maize, etc. What are the chances that a rare 30 kb HiFi is going to land on the exact spot you need it. HiFi dropouts in CT repeats are a much bigger issue. that is what prevents gapless chromosome length contigs in plants more than anything else.

from hifiasm.

proteinosome avatar proteinosome commented on May 27, 2024

Thanks @kevfengler227 for chipping in! I watched your SMRT Leiden talk and tried one of your suggestion to keep only reads above 99.4% accuracy but it didn't improve contiguity in my case, presumably because coverage is quite limited already at 17 fold.

If long tail is the issue, theoretically when coverage is high enough we should be able to computationally sample from the reads to obtain a tight quality distribution and perhaps use those for draft assembly? I wonder then if there's any way to incorporate those very long (40-50kb) reads back to help with the assembly in the complex region.

It appears that there's still a lot of nuances in terms of HiFi assembly to be discovered!

from hifiasm.

proteinosome avatar proteinosome commented on May 27, 2024

A little bit of update, I took the error corrected reads, map it to hg38, and selected reads with minimum mapping quality of 20 (minimum 1 gives similar results) and it seems like the "error" (not really error because I calculated it as NM/read length) profile looks similar, with 15kb having slightly better accuracy. Question is does this slightly higher accuracy really makes it so much better and if there's any way to optimize this by allowing slightly higher error rate in the overlap process?

plot_accuracy
hex_plot_rl_qual

Shall continue to see if there' any interesting developments or observations coming up!

from hifiasm.

HenrivdGeest avatar HenrivdGeest commented on May 27, 2024

from hifiasm.

chhylp123 avatar chhylp123 commented on May 27, 2024

Unlike other assembles, hifiasm mainly relies on phasing, instead of stringent sequence similarity threshold. So it should be able to tolerate a little bit lower accuracy. May I ask have you ever checked the coverage in break points? Of course there might be some other reasons. I will check hifiasm again on the 20kb datasets. Supporting low quality dataset is a key feature for assemblers. Thanks for let us know this dataset for debugging.

from hifiasm.

proteinosome avatar proteinosome commented on May 27, 2024

Sorry I haven't gotten the time to investigate the breakpoints but it's on my to-do as well. Will update once I do! @chhylp123 Good to hear you're gonna look into this, do let us know your findings, thank you!!

from hifiasm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.