Hello, I ran pggb (<code class="not

Thanks for asnwering <a class="user-mention notranslate" data-hovercard-type="user" da

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

GFA with no P lines about pggb HOT 5 CLOSED

sivico26 commented on August 11, 2024

GFA with no P lines

from pggb.

Comments (5)

AndreaGuarracino commented on August 11, 2024 1

Regarding question 3, we are working on improving the performance with lower identity thresholds. The bottleneck is the mapping phase, where the aligner (wfmash) has to find the homology map between all input sequences, given the input segment length and estimated identity threshold. Preliminary results are quite hot. Stay tuned, something could pop up soon (weeks / a very few months),

from pggb.

AndreaGuarracino commented on August 11, 2024

About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.

Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).

from pggb.

sivico26 commented on August 11, 2024

Thanks for asnwering @AndreaGuarracino

About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.

Unfortunately, I am on the receiving end of a consortium that generated the data, and I am not sure how tight are the sharing policies. if you are interested in hunting down this bug, I can ask and let you know.

Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).

I see. I might give it a try but basically depends on the following: I have a question regarding the relationship between the parameters and the sensitivity/accuracy of the alignments found by wfmash. In your mind, being everything else equal (input data and the other parameters), would you expect the alignments recovered with -p 90 to be a subset of the alignments recovered with -p 85? As far as I understand, this is the main difference (besides/at the expense of performance).

I am asking because I am already having severe under-alignment issues even at -p 80 for my data. So, unless for some reason I should expect higher sensitivity at, let's say, -p 90 than what I am already getting at -p 80, I see no point in trying to tweak performance if the results are not there in the first place.

Funny enough, I also did some tests using minimap2 and it seems to yield good alignments (sensitive enough), but as @ekg pointed out in other issues (e.g. here), the current minimap2 implementation is basically unviable for this application performance-wise (time-wise might work, but excessive memory consumption). I can expand on this under-alignment problem for high-divergence cases (in my tests ~8%) in another Issue if you are interested.

Regards
Sivico

from pggb.

AndreaGuarracino commented on August 11, 2024

Hi @sivico26, I don't think the alignments recovered with -p 90 would be exactly a subset of the ones recovered with -p 85, but I would expect a strong overlap between the two sets of alignments. Perhaps, have you already checked that?

from pggb.

sivico26 commented on August 11, 2024

Hi @AndreaGuarracino,

I have not run the set at -p 90. And I think I won't. Simply because with -p 85 I am seeing that only 1% of the bases between two homolog chromosomes are aligning. So, based on your expectation of strong overlap (which I shared), it seems pointless to run the set -p 90 if it would yield around the same 1%.

from pggb.

GFA with no P lines about pggb HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent