Giter Club home page Giter Club logo

Comments (5)

AndreaGuarracino avatar AndreaGuarracino commented on August 11, 2024 1

Regarding question 3, we are working on improving the performance with lower identity thresholds. The bottleneck is the mapping phase, where the aligner (wfmash) has to find the homology map between all input sequences, given the input segment length and estimated identity threshold. Preliminary results are quite hot. Stay tuned, something could pop up soon (weeks / a very few months),

from pggb.

AndreaGuarracino avatar AndreaGuarracino commented on August 11, 2024

About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.

Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).

from pggb.

sivico26 avatar sivico26 commented on August 11, 2024

Thanks for asnwering @AndreaGuarracino

About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.

Unfortunately, I am on the receiving end of a consortium that generated the data, and I am not sure how tight are the sharing policies. if you are interested in hunting down this bug, I can ask and let you know.

Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).

I see. I might give it a try but basically depends on the following: I have a question regarding the relationship between the parameters and the sensitivity/accuracy of the alignments found by wfmash. In your mind, being everything else equal (input data and the other parameters), would you expect the alignments recovered with -p 90 to be a subset of the alignments recovered with -p 85? As far as I understand, this is the main difference (besides/at the expense of performance).

I am asking because I am already having severe under-alignment issues even at -p 80 for my data. So, unless for some reason I should expect higher sensitivity at, let's say, -p 90 than what I am already getting at -p 80, I see no point in trying to tweak performance if the results are not there in the first place.

Funny enough, I also did some tests using minimap2 and it seems to yield good alignments (sensitive enough), but as @ekg pointed out in other issues (e.g. here), the current minimap2 implementation is basically unviable for this application performance-wise (time-wise might work, but excessive memory consumption). I can expand on this under-alignment problem for high-divergence cases (in my tests ~8%) in another Issue if you are interested.

Regards
Sivico

from pggb.

AndreaGuarracino avatar AndreaGuarracino commented on August 11, 2024

Hi @sivico26, I don't think the alignments recovered with -p 90 would be exactly a subset of the ones recovered with -p 85, but I would expect a strong overlap between the two sets of alignments. Perhaps, have you already checked that?

from pggb.

sivico26 avatar sivico26 commented on August 11, 2024

Hi @AndreaGuarracino,

I have not run the set at -p 90. And I think I won't. Simply because with -p 85 I am seeing that only 1% of the bases between two homolog chromosomes are aligning. So, based on your expectation of strong overlap (which I shared), it seems pointless to run the set -p 90 if it would yield around the same 1%.

from pggb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.