Comments (5)
Regarding question 3, we are working on improving the performance with lower identity thresholds. The bottleneck is the mapping phase, where the aligner (wfmash) has to find the homology map between all input sequences, given the input segment length and estimated identity threshold. Preliminary results are quite hot. Stay tuned, something could pop up soon (weeks / a very few months),
from pggb.
About questions 1 and 2, seqwish
must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish
.
Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85
). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90
, wfmash
will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).
from pggb.
Thanks for asnwering @AndreaGuarracino
About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.
Unfortunately, I am on the receiving end of a consortium that generated the data, and I am not sure how tight are the sharing policies. if you are interested in hunting down this bug, I can ask and let you know.
Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).
I see. I might give it a try but basically depends on the following: I have a question regarding the relationship between the parameters and the sensitivity/accuracy of the alignments found by wfmash
. In your mind, being everything else equal (input data and the other parameters), would you expect the alignments recovered with -p 90
to be a subset of the alignments recovered with -p 85
? As far as I understand, this is the main difference (besides/at the expense of performance).
I am asking because I am already having severe under-alignment issues even at -p 80
for my data. So, unless for some reason I should expect higher sensitivity at, let's say, -p 90
than what I am already getting at -p 80
, I see no point in trying to tweak performance if the results are not there in the first place.
Funny enough, I also did some tests using minimap2
and it seems to yield good alignments (sensitive enough), but as @ekg pointed out in other issues (e.g. here), the current minimap2
implementation is basically unviable for this application performance-wise (time-wise might work, but excessive memory consumption). I can expand on this under-alignment problem for high-divergence cases (in my tests ~8%) in another Issue if you are interested.
Regards
Sivico
from pggb.
Hi @sivico26, I don't think the alignments recovered with -p 90
would be exactly a subset of the ones recovered with -p 85
, but I would expect a strong overlap between the two sets of alignments. Perhaps, have you already checked that?
from pggb.
I have not run the set at -p 90
. And I think I won't. Simply because with -p 85
I am seeing that only 1% of the bases between two homolog chromosomes are aligning. So, based on your expectation of strong overlap (which I shared), it seems pointless to run the set -p 90
if it would yield around the same 1%.
from pggb.
Related Issues (20)
- empty VCF after running PanGenIe on the pggb assembly HOT 1
- DRB1-3123 example not producing a nice graph anymore after `biwflambda` update. HOT 5
- PGGB use case with hexaploidy genomes HOT 1
- force reference output in VCF HOT 2
- Three chromosome take too long time HOT 16
- High heterogeneity in sequences identity HOT 2
- extracting node path-coverage information HOT 3
- wfmash -Y option HOT 3
- About the result study HOT 4
- Question about the example "scerevisiae7.fasta.gz " HOT 1
- ValueError: too many values to unpack (expected 13) HOT 3
- Annotating the 1D pangenome graph visualisation with centromere coordinates
- Get the fasta file of non reference sequence
- [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.) HOT 4
- PGGB get the fasta file of non reference sequence
- Building a graph from fragmented assemblies
- interoperability with vg - error:[vg::SmallSnarlSimplifier] Invalid graph on iteration 0 HOT 14
- Current Bioconda release does not find python scripts HOT 7
- Possible community detection bug HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pggb.