Dear developer: We are conducting the collection test of our real data according t

Ok, then this means that you need to group the paths by samples (<code class="notransl

command is not supported for more than 65534 about panacus HOT 6 CLOSED

marschall-lab commented on July 25, 2024

command is not supported for more than 65534

from panacus.

Comments (6)

danydoerr commented on July 25, 2024

Yes, that's right-- at the moment the tool is limited to 65534 path groups (speak "samples" or "taxa"). I did not find it likely that there are data sets with more distinct samples/taxa out there right now. How many samples does your data set have?

Typically, you want to group your paths into samples or haplotypes, but this requires that path names adhere to the PanSN naming scheme. Then, you can simply group by sample (-S) or haplotype (-H)

from panacus.

danydoerr commented on July 25, 2024

Oh, and if your paths are not PanSN compatible, you can still do the grouping by hand, by specifying a path-to-group mapping with -g

from panacus.

ld9866 commented on July 25, 2024

Thank you for getting back to me so quickly. In fact, we only have 27 samples, and the genome size of each sample is 2.5G, so it should not be a problem for human pan-genome to visualize our data.
We used minigraph-cactus for pan-genome construction and then used vg to convert gfa1.1 format for visual analysis, I would like to ask how we should conduct quality control or other operations to complete the visualization.
Best yours.

from panacus.

danydoerr commented on July 25, 2024

Ok, then this means that you need to group the paths by samples (-S) or haplotypes (-H). Regarding quality control, I think panacus is a good starting point, here is my suggestion:

Generate an HTML page that contains coverage histograms+growth curves for all count types:

RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -H -c all -a -o html test.giffa2.1.0.gfa > test.giffa2.histgrowth.all.html

I find the coverage plots very insightful for quality control. Typically, you expect that the two highest bars correspond to coverage by a single sample/haplotype and by all samples/haplotypes, respectively. Anything else indicates that you might want to re-consider your alignment parameters
I find the node-resolved coverage table extremely helpful for checking some basic properties of pangenome graphs, especially in combination with node length information (see script gfa2nodelen.py.zip). The table can be generated with

RUST_LOG=info panacus table -t4 -H -c node test.giffa2.1.0.gfa > test.giffa2.coverage.node.tsv

I am a bit surprised that you have 170 mio. nodes in your graph, given a genome size of 2.5Gbp per sample. For comparison, the HPRC+chinese human pangenome graph (also generated with minigraph-cactus) contains 211 haplotypes, each with ~2.7Gbp length has only about 119 mio. nodes. Now, this does not necessarily mean that your graph has poor quality, the number of nodes depends very much also on the diversity of the genomes. The large number of nodes might make the analysis that I propose (see 3.) a bit more resource-demanding, but typical HPCs nowadays should be able to deal with these large tables.

from panacus.

danydoerr commented on July 25, 2024

If you have further questions on QC of your pangenome graph, please email me at [email protected]

from panacus.

ld9866 commented on July 25, 2024

OK！I will send the detailed information to your email for consultation！
With best wishes

from panacus.

command is not supported for more than 65534 about panacus HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent