marschall-lab / panacus Goto Github PK

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs

License: MIT License

Rust 89.35% Python 4.79% CSS 0.27% JavaScript 4.09% Shell 0.13% HTML 1.37%

gfa pangenomics pangenome-growth graph pangenome pangenome-graph

panacus's Issues

Could you please make an new release?

In order to use the new file format feature in pipeline development, we need a new release, because we can only work with a Conda package......
Would you please provide a new one @danydoerr ?

Thanks!

How to Visualize the results of the minigraph-cactus?

Dear developer:
Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including #4.
Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why?
code:
grep '^P' test.full.gfa | cut -f2 > test.paths.txt

macOS binary missing

wget --no-check-certificate -c https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_macos_arm64.tar.gz
--2024-06-20 08:19:21-- https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_macos_arm64.tar.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-06-20 08:19:21 ERROR 404: Not Found.

AttributeError: 'DataFrame' object has no attribute 'cumulative'

Hi @danydoerr,

I tried to visualize my pangenome-growth output with your Python script, however, I ran into the following problem:

python3 ~/software/panacus/git/master/scripts/visualize_growth.py chrM.hprc-v1.0-pggb.og.pangenome-growth_r100.txt
Traceback (most recent call last):
  File "/home/ubuntu/software/panacus/git/master/scripts/visualize_growth.py", line 66, in <module>
    plot(df, path.basename(args.growth_stats.name), out)
  File "/home/ubuntu/software/panacus/git/master/scripts/visualize_growth.py", line 35, in plot
    plt.bar(x=xticks, height=df.cumulative)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cumulative'

chrM.hprc-v1.0-pggb.og.pangenome-growth_r100.txt

Thanks for any tips!

What does the output content represent

When I used the pangenome-growth command, I got a 5-column data, and what do the last three columns represent?When I build a sample file, I only need to write the sample name instead of using the haplotype form of sample name. 1.2. Is this a requirement for the sample file that was previously constructed as a pangenome?

panacus-visualize.py is overwhelmed by 1000 haplotypes

Hi there :)
I applied panacus-visualize.py to a histgrowth output of 1000 haplotypes, but the PDF is not showing any colors and some weird x-axis labels:

The TSV input is available for 10 days at https://fex.belwue.de/fop/rFYpUmCn/chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.

panacus-visualize.py -e -l "lower right" chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv > chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.pdf

Next I want to try it on a data set with ~2k sequences.
Thanks for any feedback :)

Update Readme to reflect installation of python dependencies in the installation section.

After `panacus hist` and `panacus growth`, the final visualization will show `#nodes` instead of `bps`. I use `-c bp` for hist

After panacus hist and panacus growth, the final visualization will show #nodes instead of bps. I use -c bp for hist

Originally posted by @baozg in #7 (comment)

Haplotype labels in TSV, visualisations ?

Hello,
Great tool.
Is it possible to display the haplotypes labels from GFA P-lines or W-lines into the TSV outputs, as well as the HTML/PDF outputs ?

I tried with different GFA files, containing either P-lines or W-lines.
However, the output always lists the haplotypes on X axes with integers from 1 to n.

If this is not implemented, does the enumeration corresponds to the order of the lines in the GFA ? or from the haplotype.txt files ?

Thanks

how is panacus treating Ns

Assuming I have 100 haplotypes and each brings in 1000 Ns. Would this lead to a steep growth curve or is panacus ignoring the Ns?

Installation instructions

Nowadays, conda should be replaced by mamba, which is much faster and also solves dependencies in a more optimal way. Further, the bioconda channel always should be combined with the conda-forge channel because it is build against that instead of the defaults channel. Something like mamba install -c conda-forge -c bioconda panacus. For getting a mamba installation you can refer to the mambaforge distribution.

could you give us an example of cactus?

What is the meaning of common and consensus?

Hi,

Thanks for this approach to check the GFA file's growth plot.
But I am curious the meaning of "common" and "consensus".
Which means the number that all accessions contained? I guess "common".
What about "consensus"?

Best wishes,
Lan

A problem while running panacus-visualize

I was trying to run panacus-visualize with my data but it gave me this error
Traceback (most recent call last): File "/home/gustavo/panacus-0.2.3_linux_x86_64/bin/panacus-visualize", line 16, in <module> from matplotlib.transforms import Bbox ModuleNotFoundError: No module named 'matplotlib'

and in the file is this code from matplotlib.transforms import Bbox
And I tried to run it with your example and it got the same error.

Thanks for your time.

command is not supported for more than 65534

Dear developer:
We are conducting the collection test of our real data according to the example data, but we have encountered some problems and hope to get your help. The errors are as follows.
Best day!

step1 is ok

grep '^P' test.giffa2.1.0.gfa | cut -f2 | grep -ve 'refernece' > test.giffa2.paths.haplotypes.txt

step2 is erro

RUST_LOG=info /home/test/Software/panacus-0.2.3_linux_x86_64/bin/panacus histgrowth -t 4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s test.giffa2.paths.haplotypes.txt test.giffa2.1.0.gfa > test.giffa2.histgrowth.node.tsv

erro

[2023-11-30T00:41:26Z INFO panacus::cli] running panacus on 4 threads
[2023-11-30T00:41:26Z INFO panacus::cli] constructing indexes for node/edge IDs, node lengths, and P/W lines..
[2023-11-30T00:43:19Z INFO panacus::cli] ..done; found 383935 paths/walks and 174028496 nodes
[2023-11-30T00:43:19Z INFO panacus::cli] loading data from group / subset / exclude files
[2023-11-30T00:43:19Z INFO panacus::abacus] loading coordinates from pig.giffa2.paths.haplotypes.txt
Error: Custom { kind: Unsupported, error: "data has 383917 path groups, but command is not supported for more than 65534" }

compiler error in rustc-serialize

Hi @danydoerr :)

When compiling with

cargo --version
cargo 1.76.0 (c84b36747 2024-01-18)

I get the following error:

cargo build --release
   Compiling thiserror-impl v1.0.40
   Compiling getrandom v0.2.9
   Compiling io-lifetimes v1.0.10
   Compiling num_cpus v1.15.0
   Compiling clap_derive v4.5.3
   Compiling serde_json v1.0.114
   Compiling rustc-serialize v0.3.24
   Compiling regex v1.7.3
   Compiling strum_macros v0.25.3
   Compiling time v0.3.34
   Compiling rustix v0.37.19
   Compiling rand_core v0.6.4
   Compiling rayon-core v1.11.0
error[E0310]: the parameter type `T` may not live long enough
    --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rustc-serialize-0.3.24/src/serialize.rs:1155:5
     |
1155 |     fn decode<D: Decoder>(d: &mut D) -> Result<Cow<'static, T>, D::Error> {
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     |     |
     |     the parameter type `T` must be valid for the static lifetime...
     |     ...so that the type `T` will meet its required lifetime bounds...
     |
note: ...that is required by this bound
    --> /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/borrow.rs:180:30
help: consider adding an explicit lifetime bound
     |
1151 | impl<'a, T: ?Sized + 'static> Decodable for Cow<'a, T>
     |                    +++++++++

   Compiling rand_chacha v0.3.1
   Compiling rand v0.8.5
   Compiling rayon v1.7.0
For more information about this error, try `rustc --explain E0310`.
error: could not compile `rustc-serialize` (lib) due to 1 previous error
warning: build failed, waiting for other jobs to finish...

Any ideas? THanks!

Discrepancy between graph length, reference length, and novel base pairs

Hi,

I have created a pangenome using the minigraph-cactus pipeline in a manner very similar to how the HPRC created their pangenome (using T2T-CHM13 as the primary reference and including GRCh38 as an assembly), I then ran panacus and identified the amount of base pairs added per assembly, and in total. The total amount of basepairs added (coverage 1, quorum 0, last row) was 0.489 Gb. The length of the T2T-CHM13 reference is 3.11 Gb and the total length of the graph (determined using odgi stats) is 3.38 Gb. Am I wrong in thinking that the length of the reference, 3.11 Gb, plus the length of total bp added, 0.489 Gb, should equal the total length of the graph (3.38 Gb)? Or do I misunderstand? I've tried a variety of parameters and still get the same results.
The code:

grep -e '^W' $gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' > corrected.paths.txt

grep -e 'GRCh38' corrected.paths.txt > corrected.paths.grch38.txt

grep -ive 'grch38|chm13' corrected.paths.txt > corrected.paths.haplotypes.txt

panacus ordered-histgrowth --count bp --order $order --exclude corrected.paths.grch38.txt --subset corrected.paths.haplotypes.txt --output-format table --coverage 1,2,4,67 --groupby-sample --threads $SLURM_CPUS_ON_NODE $gfa > $gfa.ordered-histgrowth.bp.tsv

path coordinates

Hi,

How could I provide path coordinates for panacus from pggb graph? Since it didn't have an interval in the GFA, this function only works for Minigraph-Cactus graph?

Request software updates

Hello Daniel，

I found that the version of the panacus is no longer compatible with the results of Mingraph-Cactus (MC), and the final visualized graph is incorrect. I think this problem needs to be completely solved. It is necessary to use the current Mingraph-Cactus pipeline to construct the pan-genome and use the final results for visualization.

Best wishes
Xuelei

Merge different chroms stats into one graph

Hi,

How do we merge stats from all chromosomes into a single plot?

Best regards
Zhigui

Feature request: Alternative plot with #nodes/#edges vs AC

A very useful visualization to QC variant call sets has been this:
https://www.nature.com/articles/nature15394/figures/5
That is, plotting the number of variants (Y) vs their allele count (X) in a log-log plot. This should be a power law and hence a straight line. Would it be easy to extend Panacus to output such plots as well for bp/#node/#edges? At least for edges I would hope/expect to get a straight line as well if the graph is good. In any case, being able to produce such plots would be very valuable.

Option to output plots in separate PNG files

Hi @danydoerr,

could you please provide an option to output the plots in separate PNG files?
@heringerp is working on making panacus a module for nf-core. However, in order for the results to be integrated into a MultiQC report, we need PNGs. PDFs are not supported.

Thanks!

Best,
Simon

marschall-lab / panacus Goto Github PK

panacus's Issues

step1 is ok

step2 is erro

erro

Recommend Projects

Recommend Topics

Recommend Org