Giter Club home page Giter Club logo

Comments (6)

danydoerr avatar danydoerr commented on July 25, 2024

Yes, that's right-- at the moment the tool is limited to 65534 path groups (speak "samples" or "taxa"). I did not find it likely that there are data sets with more distinct samples/taxa out there right now. How many samples does your data set have?

Typically, you want to group your paths into samples or haplotypes, but this requires that path names adhere to the PanSN naming scheme. Then, you can simply group by sample (-S) or haplotype (-H)

from panacus.

danydoerr avatar danydoerr commented on July 25, 2024

Oh, and if your paths are not PanSN compatible, you can still do the grouping by hand, by specifying a path-to-group mapping with -g

from panacus.

ld9866 avatar ld9866 commented on July 25, 2024

Thank you for getting back to me so quickly. In fact, we only have 27 samples, and the genome size of each sample is 2.5G, so it should not be a problem for human pan-genome to visualize our data.
We used minigraph-cactus for pan-genome construction and then used vg to convert gfa1.1 format for visual analysis, I would like to ask how we should conduct quality control or other operations to complete the visualization.
Best yours.

from panacus.

danydoerr avatar danydoerr commented on July 25, 2024

Ok, then this means that you need to group the paths by samples (-S) or haplotypes (-H). Regarding quality control, I think panacus is a good starting point, here is my suggestion:

  1. Generate an HTML page that contains coverage histograms+growth curves for all count types:
RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -H -c all -a -o html test.giffa2.1.0.gfa > test.giffa2.histgrowth.all.html
  1. I find the coverage plots very insightful for quality control. Typically, you expect that the two highest bars correspond to coverage by a single sample/haplotype and by all samples/haplotypes, respectively. Anything else indicates that you might want to re-consider your alignment parameters
  2. I find the node-resolved coverage table extremely helpful for checking some basic properties of pangenome graphs, especially in combination with node length information (see script gfa2nodelen.py.zip). The table can be generated with
RUST_LOG=info panacus table -t4 -H -c node test.giffa2.1.0.gfa > test.giffa2.coverage.node.tsv
  1. I am a bit surprised that you have 170 mio. nodes in your graph, given a genome size of 2.5Gbp per sample. For comparison, the HPRC+chinese human pangenome graph (also generated with minigraph-cactus) contains 211 haplotypes, each with ~2.7Gbp length has only about 119 mio. nodes. Now, this does not necessarily mean that your graph has poor quality, the number of nodes depends very much also on the diversity of the genomes. The large number of nodes might make the analysis that I propose (see 3.) a bit more resource-demanding, but typical HPCs nowadays should be able to deal with these large tables.

from panacus.

danydoerr avatar danydoerr commented on July 25, 2024

If you have further questions on QC of your pangenome graph, please email me at [email protected]

from panacus.

ld9866 avatar ld9866 commented on July 25, 2024

OK!I will send the detailed information to your email for consultation!
With best wishes

from panacus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.