marschall-lab / panacus Goto Github PK
View Code? Open in Web Editor NEWPanacus is a tool for computing statistics for GFA-formatted pangenome graphs
License: MIT License
Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
License: MIT License
In order to use the new file format feature in pipeline development, we need a new release, because we can only work with a Conda package......
Would you please provide a new one @danydoerr ?
Thanks!
Dear developer:
Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including #4.
Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why?
code:
grep '^P' test.full.gfa | cut -f2 > test.paths.txt
wget --no-check-certificate -c https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_macos_arm64.tar.gz
--2024-06-20 08:19:21-- https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_macos_arm64.tar.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-06-20 08:19:21 ERROR 404: Not Found.
Hi @danydoerr,
I tried to visualize my pangenome-growth output with your Python script, however, I ran into the following problem:
python3 ~/software/panacus/git/master/scripts/visualize_growth.py chrM.hprc-v1.0-pggb.og.pangenome-growth_r100.txt
Traceback (most recent call last):
File "/home/ubuntu/software/panacus/git/master/scripts/visualize_growth.py", line 66, in <module>
plot(df, path.basename(args.growth_stats.name), out)
File "/home/ubuntu/software/panacus/git/master/scripts/visualize_growth.py", line 35, in plot
plt.bar(x=xticks, height=df.cumulative)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cumulative'
chrM.hprc-v1.0-pggb.og.pangenome-growth_r100.txt
Thanks for any tips!
When I used the pangenome-growth command, I got a 5-column data, and what do the last three columns represent?When I build a sample file, I only need to write the sample name instead of using the haplotype form of sample name. 1.2. Is this a requirement for the sample file that was previously constructed as a pangenome?
Hi there :)
I applied panacus-visualize.py to a histgrowth output of 1000 haplotypes, but the PDF is not showing any colors and some weird x-axis labels:
The TSV input is available for 10 days at https://fex.belwue.de/fop/rFYpUmCn/chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.
panacus-visualize.py -e -l "lower right" chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv > chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.pdf
Next I want to try it on a data set with ~2k sequences.
Thanks for any feedback :)
After panacus hist
and panacus growth
, the final visualization will show #nodes
instead of bps
. I use -c bp
for hist
Originally posted by @baozg in #7 (comment)
Hello,
Great tool.
Is it possible to display the haplotypes labels from GFA P-lines or W-lines into the TSV outputs, as well as the HTML/PDF outputs ?
I tried with different GFA files, containing either P-lines or W-lines.
However, the output always lists the haplotypes on X axes with integers from 1 to n.
If this is not implemented, does the enumeration corresponds to the order of the lines in the GFA ? or from the haplotype.txt files ?
Thanks
Assuming I have 100 haplotypes and each brings in 1000 Ns. Would this lead to a steep growth curve or is panacus ignoring the Ns?
Nowadays, conda should be replaced by mamba, which is much faster and also solves dependencies in a more optimal way. Further, the bioconda channel always should be combined with the conda-forge channel because it is build against that instead of the defaults channel. Something like mamba install -c conda-forge -c bioconda panacus
. For getting a mamba installation you can refer to the mambaforge distribution.
Hi,
Thanks for this approach to check the GFA file's growth plot.
But I am curious the meaning of "common" and "consensus".
Which means the number that all accessions contained? I guess "common".
What about "consensus"?
Best wishes,
Lan
I was trying to run panacus-visualize with my data but it gave me this error
Traceback (most recent call last): File "/home/gustavo/panacus-0.2.3_linux_x86_64/bin/panacus-visualize", line 16, in <module> from matplotlib.transforms import Bbox ModuleNotFoundError: No module named 'matplotlib'
and in the file is this code from matplotlib.transforms import Bbox
And I tried to run it with your example and it got the same error.
Thanks for your time.
Dear developer:
We are conducting the collection test of our real data according to the example data, but we have encountered some problems and hope to get your help. The errors are as follows.
Best day!
grep '^P' test.giffa2.1.0.gfa | cut -f2 | grep -ve 'refernece' > test.giffa2.paths.haplotypes.txt
RUST_LOG=info /home/test/Software/panacus-0.2.3_linux_x86_64/bin/panacus histgrowth -t 4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s test.giffa2.paths.haplotypes.txt test.giffa2.1.0.gfa > test.giffa2.histgrowth.node.tsv
[2023-11-30T00:41:26Z INFO panacus::cli] running panacus on 4 threads
[2023-11-30T00:41:26Z INFO panacus::cli] constructing indexes for node/edge IDs, node lengths, and P/W lines..
[2023-11-30T00:43:19Z INFO panacus::cli] ..done; found 383935 paths/walks and 174028496 nodes
[2023-11-30T00:43:19Z INFO panacus::cli] loading data from group / subset / exclude files
[2023-11-30T00:43:19Z INFO panacus::abacus] loading coordinates from pig.giffa2.paths.haplotypes.txt
Error: Custom { kind: Unsupported, error: "data has 383917 path groups, but command is not supported for more than 65534" }
Hi @danydoerr :)
When compiling with
cargo --version
cargo 1.76.0 (c84b36747 2024-01-18)
I get the following error:
cargo build --release
Compiling thiserror-impl v1.0.40
Compiling getrandom v0.2.9
Compiling io-lifetimes v1.0.10
Compiling num_cpus v1.15.0
Compiling clap_derive v4.5.3
Compiling serde_json v1.0.114
Compiling rustc-serialize v0.3.24
Compiling regex v1.7.3
Compiling strum_macros v0.25.3
Compiling time v0.3.34
Compiling rustix v0.37.19
Compiling rand_core v0.6.4
Compiling rayon-core v1.11.0
error[E0310]: the parameter type `T` may not live long enough
--> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rustc-serialize-0.3.24/src/serialize.rs:1155:5
|
1155 | fn decode<D: Decoder>(d: &mut D) -> Result<Cow<'static, T>, D::Error> {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| |
| the parameter type `T` must be valid for the static lifetime...
| ...so that the type `T` will meet its required lifetime bounds...
|
note: ...that is required by this bound
--> /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/borrow.rs:180:30
help: consider adding an explicit lifetime bound
|
1151 | impl<'a, T: ?Sized + 'static> Decodable for Cow<'a, T>
| +++++++++
Compiling rand_chacha v0.3.1
Compiling rand v0.8.5
Compiling rayon v1.7.0
For more information about this error, try `rustc --explain E0310`.
error: could not compile `rustc-serialize` (lib) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
Any ideas? THanks!
Hi,
I have created a pangenome using the minigraph-cactus pipeline in a manner very similar to how the HPRC created their pangenome (using T2T-CHM13 as the primary reference and including GRCh38 as an assembly), I then ran panacus and identified the amount of base pairs added per assembly, and in total. The total amount of basepairs added (coverage 1, quorum 0, last row) was 0.489 Gb. The length of the T2T-CHM13 reference is 3.11 Gb and the total length of the graph (determined using odgi stats) is 3.38 Gb. Am I wrong in thinking that the length of the reference, 3.11 Gb, plus the length of total bp added, 0.489 Gb, should equal the total length of the graph (3.38 Gb)? Or do I misunderstand? I've tried a variety of parameters and still get the same results.
The code:
grep -e '^W' $gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' > corrected.paths.txt
grep -e 'GRCh38' corrected.paths.txt > corrected.paths.grch38.txt
grep -ive 'grch38|chm13' corrected.paths.txt > corrected.paths.haplotypes.txt
panacus ordered-histgrowth --count bp --order $order --exclude corrected.paths.grch38.txt --subset corrected.paths.haplotypes.txt --output-format table --coverage 1,2,4,67 --groupby-sample --threads $SLURM_CPUS_ON_NODE $gfa > $gfa.ordered-histgrowth.bp.tsv
Hi,
How could I provide path coordinates for panacus
from pggb
graph? Since it didn't have an interval in the GFA, this function only works for Minigraph-Cactus
graph?
Hello Daniel,
I found that the version of the panacus
is no longer compatible with the results of Mingraph-Cactus (MC), and the final visualized graph is incorrect. I think this problem needs to be completely solved. It is necessary to use the current Mingraph-Cactus pipeline to construct the pan-genome and use the final results for visualization.
Best wishes
Xuelei
Hi,
How do we merge stats from all chromosomes into a single plot?
Best regards
Zhigui
A very useful visualization to QC variant call sets has been this:
https://www.nature.com/articles/nature15394/figures/5
That is, plotting the number of variants (Y) vs their allele count (X) in a log-log plot. This should be a power law and hence a straight line. Would it be easy to extend Panacus to output such plots as well for bp/#node/#edges? At least for edges I would hope/expect to get a straight line as well if the graph is good. In any case, being able to produce such plots would be very valuable.
Hi @danydoerr,
could you please provide an option to output the plots in separate PNG files?
@heringerp is working on making panacus a module for nf-core. However, in order for the results to be integrated into a MultiQC report, we need PNGs. PDFs are not supported.
Thanks!
Best,
Simon
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.