Giter Club home page Giter Club logo

panacus's Introduction

Rust Build Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

A Counting Tool for Pangenome Graphs

panacus is a counting tool for pangenome graphs

panacus is a tool for calculating statistics for GFA files. It supports GFA files with P and W lines, but requires that the graph is blunt, i.e., nodes do not overlap and consequently, each link (L) points from the end of one segment (S) to the start of another.

panacus supports the following calculations:

  • coverage histogram
  • pangenome growth statistics
  • path-/group-resolved coverage table

Dependencies

panacus is written in RUST and requires a working RUST build system for installation. See here for more details.

  • clap
  • itertools
  • quick-csv
  • rand
  • rayon
  • regex
  • rustc-hash
  • strum, strum_macros

panacus provides a Python script for visualizing the calculated counting statistics and requires the following Python libraries:

  • matplotlib
  • numpy
  • pandas
  • scikit-learn
  • scipy
  • seaborn

Installation

From bioconda channel

Make sure you have conda/mamba installed!

mamba install -c conda-forge -c bioconda panacus

From binary release

Linux x86_64

wget --no-check-certificate -c https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_linux_x86_64.tar.gz
tar -xzvf panacus-0.2.3_linux_x86_64.tar.gz

# suggestion: add tool to path in your ~/.bashrc
export PATH="$(readlink -f panacus-0.2.3_linux_x86_64/bin)":$PATH

# you are ready to go! 
panacus --help

Mac OSX arm64

wget --no-check-certificate -c https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_macos_arm64.tar.gz
tar -xzvf panacus-0.2.3_macos_arm64.tar.gz

# suggestion: add tool to path in your ~/.bashrc
export PATH="$(readlink -f panacus-0.2.3_macos_arm64/bin)":$PATH

# you are ready to go! 
panacus --help

From repository

git clone [email protected]:marschall-lab/panacus.git

cd panacus
cargo build --release

mkdir bin
ln -s ../target/release/panacus bin/
ln -s ../scripts/panacus-visualize.py bin/panacus-visualize

# suggestion: add tool to path in your ~/.bashrc
export PATH="$(readlink -f bin)":$PATH

# you are ready to go! 
panacus --help

Run

$ panacus
Calculate count statistics for pangenomic data

Usage: panacus <COMMAND>

Commands:
  histgrowth          Run in default mode, i.e., run hist and growth successively and output the results of the latter
  hist                Calculate coverage histogram from GFA file
  growth              Construct growth table from coverage histogram
  ordered-histgrowth  Compute growth table for order specified in grouping file (or, if non specified, the order of paths in the GFA file)
  table               Compute coverage table for count type
  help                Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Pangenome Coverage and Growth Statistics

PGGB

Here's a quick example for computing coverage and pangenome growth statistics on the HPRC v.1.0 pggb, chr 22:

  1. Download and unpack the graph:
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr22.hprc-v1.0-pggb.gfa.gz
gunzip chr22.hprc-v1.0-pggb.gfa.gz
  1. Prepare file to select subset of paths corresponding to haplotypes:
grep '^P' chr22.hprc-v1.0-pggb.gfa | cut -f2 | grep -ve 'grch38\|chm13' > chr22.hprc-v1.0-pggb.paths.haplotypes.txt
  1. Run panacus histgrowth to calculate coverage and pangenome growth for nodes (default) with coverage/quorum thresholds 1/0, 2/0, 1/1, 1/0.5, and 1/0.1 using up to 4 threads:
RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s chr22.hprc-v1.0-pggb.paths.haplotypes.txt chr22.hprc-v1.0-pggb.gfa > chr22.hprc-v1.0-pggb.histgrowth.node.tsv
  1. Visualize coverage histogram and pangenome growth curve with estimated growth parameters:
panacus-visualize -e chr22.hprc-v1.0-pggb.histgrowth.node.tsv > chr22.hprc-v1.0-pggb.histgrowth.node.pdf

coverage histogram and pangenome growth of nodes in chr22.hprc-v1.0-pggb.gfa

Minigraph-Cactus

This example shows how to ccompute coverage and pangenome growth statistics for the HPRC v.1.1 mc, chr 22:

  1. Download the graph:
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.chroms/chr22.vg
  1. Convert to GFA (this graph is provided in VG format and requires conversion into GFA with vg:
vg view --gfa chr22.vg > chr22.hprc-v1.1-mc-grch38.gfa
  1. Prepare file to select subset of paths corresponding to haplotypes:
grep -e '^W' chr22.hprc-v1.1-mc-grch38.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' > chr22.hprc-v1.1-mc-grch38.paths.txt
grep -ve 'grch38\|chm13' chr22.hprc-v1.1-mc-grch38.paths.txt > chr22.hprc-v1.1-mc-grch38.paths.haplotypes.txt
  1. Run panacus histgrowth to calculate coverage and pangenome growth for nodes (default) with coverage/quorum thresholds 1/0, 2/0, 1/1, 1/0.5, and 1/0.1 using up to 4 threads:
RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s chr22.hprc-v1.1-mc-grch38.paths.haplotypes.txt chr22.hprc-v1.1-mc-grch38.gfa > chr22.hprc-v1.1-mc-grch38.histgrowth.node.tsv
  1. Visualize coverage histogram and pangenome growth curve with estimated growth parameters:
panacus-visualize -e chr22.hprc-v1.1-mc-grch38.histgrowth.node.tsv > chr22.hprc-v1.1-mc-grch38.histgrowth.node.pdf

coverage histogram and pangenome growth of nodes in chr22.hprc-v1.1-mc-grch38.gfa

Generate an HTML report for your pangenome graph!

Instead of tab-separated tables, panacus supports for many commands also HTML output. The generated report page is interactive and self-contained.

  1. Follow steps 1-2 from above.
  2. Run panacus histgrowth with settings to output stats for all graph features (-c all), include coverage histogram in output (-a), and set output to HTML (-o html):
RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -s chr22.hprc-v1.0-pggb.paths.haplotypes.txt -c all -a -o html chr22.hprc-v1.0-pggb.gfa > chr22.hprc-v1.0-pggb.histgrowth.html

πŸ‘‰ πŸ‘‰ πŸ‘‰ view the resulting HTML report here!

panacus report (coverage histogram) for chr22.hprc-v1.0-pggb.gfa

Figure legend

  1. Navigate between coverage histograms for bp, node, and edge through tabs
  2. Toggle log-scale on Y-axis
  3. Download plot as PNG file
  4. Download raw data as tab-separated-values (TSV) file
  5. Choose between light and dark theme
  6. Proceed to pangenome growth plots

panacus report (pangenome growth) for chr22.hprc-v1.0-pggb.gfa

Figure legend

  1. Navigate between coverage histograms for bp, node, and edge through tabs
  2. Disable curves that you do not want to view by clicking on legend
  3. Download plot as PNG file
  4. Download raw data as tab-separated-values (TSV) file
  5. Choose between light and dark theme

Ordered Pangenome Growth Statistics

Sometimes it is interesting to look at the pangenome growth when samples are processed in a specific order rather than considering all all possible orders. panacus' capability to construct such plots is illustrated here by the example of the GRCh38-based HPRC v.1.0 minigraph-cactus graph (all chromosomes). The example reproduces Figure 3g(left) from the publication A draft human pangenome reference that quantifies pangenome growth of the amount of non-reference (GRCh38) sequence of the minigraph-cactus based human pangenome reference graph.

  1. Download and unpack the graph:
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz
gunzip hprc-v1.0-mc-grch38.gfa.gz
  1. Establish order of samples in the growth statistics:
echo 'HG03492 HG00438 HG00621 HG00673 HG02080 HG00733 HG00735 HG00741 HG01071 HG01106 HG01109 HG01123 HG01175 HG01243 HG01258 HG01358 HG01361 HG01928
HG01952 HG01978 HG02148 HG01891 HG02055 HG02109 HG02145 HG02257 HG02486 HG02559 HG02572 HG02622 HG02630 HG02717 HG02723 HG02818 HG02886 HG03098
HG03453 HG03486 HG03516 HG03540 HG03579 NA18906 NA20129 NA21309' | tr ' ' '\n' > hprc-v1.0-mc-grch38.order.samples.txt
  1. Exclude paths from reference genome GRCh38
grep '^P' hprc-v1.0-mc-grch38.gfa | cut -f2 | grep -ie 'grch38' > hprc-v1.0-mc-grch38.paths.grch38.txt
  1. Run panacus ordered-histgrowth to calculate pangenome growth for basepairs with coverage thresholds 1,2,3, and 42 using up to 4 threads:
RUST_LOG=info panacus ordered-histgrowth -c bp -t4 -l 1,2,3,42 -S -e hprc-v1.0-mc-grch38.paths.grch38.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.ordered-histgrowth.bp.tsv

(The log will report some errors regarding missing order information of CHM13 paths. These paths will be ignored in the plot, which is the intended behavior of this command line call)

  1. Visualize growth curve and estimate growth parameters :
panacus-visualize hprc-v1.0-mc-grch38.ordered-histgrowth.bp.tsv > hprc-v1.0-mc-grch38.ordered-histgrowth.bp.pdf

ordered pangenome growth of bps in hprc-v1.0-mc-grch38.gfa

panacus's People

Contributors

andreaguarracino avatar danydoerr avatar lucaparmigiani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

panacus's Issues

path coordinates

Hi,

How could I provide path coordinates for panacus from pggb graph? Since it didn't have an interval in the GFA, this function only works for Minigraph-Cactus graph?

Option to output plots in separate PNG files

Hi @danydoerr,

could you please provide an option to output the plots in separate PNG files?
@heringerp is working on making panacus a module for nf-core. However, in order for the results to be integrated into a MultiQC report, we need PNGs. PDFs are not supported.

Thanks!

Best,
Simon

Could you please make an new release?

In order to use the new file format feature in pipeline development, we need a new release, because we can only work with a Conda package......
Would you please provide a new one @danydoerr ?

Thanks!

panacus-visualize.py is overwhelmed by 1000 haplotypes

Hi there :)
I applied panacus-visualize.py to a histgrowth output of 1000 haplotypes, but the PDF is not showing any colors and some weird x-axis labels:

image

The TSV input is available for 10 days at https://fex.belwue.de/fop/rFYpUmCn/chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.

panacus-visualize.py -e -l "lower right" chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv > chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.pdf

Next I want to try it on a data set with ~2k sequences.
Thanks for any feedback :)

What is the meaning of common and consensus?

Hi,

Thanks for this approach to check the GFA file's growth plot.
But I am curious the meaning of "common" and "consensus".
Which means the number that all accessions contained? I guess "common".
What about "consensus"?

Best wishes,
Lan

how is panacus treating Ns

Assuming I have 100 haplotypes and each brings in 1000 Ns. Would this lead to a steep growth curve or is panacus ignoring the Ns?

What does the output content represent

When I used the pangenome-growth command, I got a 5-column data, and what do the last three columns represent?When I build a sample file, I only need to write the sample name instead of using the haplotype form of sample name. 1.2. Is this a requirement for the sample file that was previously constructed as a pangenome?

Feature request: Alternative plot with #nodes/#edges vs AC

A very useful visualization to QC variant call sets has been this:
https://www.nature.com/articles/nature15394/figures/5
That is, plotting the number of variants (Y) vs their allele count (X) in a log-log plot. This should be a power law and hence a straight line. Would it be easy to extend Panacus to output such plots as well for bp/#node/#edges? At least for edges I would hope/expect to get a straight line as well if the graph is good. In any case, being able to produce such plots would be very valuable.

Request software updates

Hello Daniel,

I found that the version of the panacus is no longer compatible with the results of Mingraph-Cactus (MC), and the final visualized graph is incorrect. I think this problem needs to be completely solved. It is necessary to use the current Mingraph-Cactus pipeline to construct the pan-genome and use the final results for visualization.

Best wishes
Xuelei

A problem while running panacus-visualize

I was trying to run panacus-visualize with my data but it gave me this error
Traceback (most recent call last): File "/home/gustavo/panacus-0.2.3_linux_x86_64/bin/panacus-visualize", line 16, in <module> from matplotlib.transforms import Bbox ModuleNotFoundError: No module named 'matplotlib'

and in the file is this code from matplotlib.transforms import Bbox
And I tried to run it with your example and it got the same error.

Thanks for your time.

AttributeError: 'DataFrame' object has no attribute 'cumulative'

Hi @danydoerr,

I tried to visualize my pangenome-growth output with your Python script, however, I ran into the following problem:

python3 ~/software/panacus/git/master/scripts/visualize_growth.py chrM.hprc-v1.0-pggb.og.pangenome-growth_r100.txt
Traceback (most recent call last):
  File "/home/ubuntu/software/panacus/git/master/scripts/visualize_growth.py", line 66, in <module>
    plot(df, path.basename(args.growth_stats.name), out)
  File "/home/ubuntu/software/panacus/git/master/scripts/visualize_growth.py", line 35, in plot
    plt.bar(x=xticks, height=df.cumulative)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cumulative'

chrM.hprc-v1.0-pggb.og.pangenome-growth_r100.txt

Thanks for any tips!

How to Visualize the results of the minigraph-cactus?

Dear developer:
Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including #4.
Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why?
code:
grep '^P' test.full.gfa | cut -f2 > test.paths.txt

command is not supported for more than 65534

Dear developer:
We are conducting the collection test of our real data according to the example data, but we have encountered some problems and hope to get your help. The errors are as follows.
Best day!

step1 is ok

grep '^P' test.giffa2.1.0.gfa | cut -f2 | grep -ve 'refernece' > test.giffa2.paths.haplotypes.txt

step2 is erro

RUST_LOG=info /home/test/Software/panacus-0.2.3_linux_x86_64/bin/panacus histgrowth -t 4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s test.giffa2.paths.haplotypes.txt test.giffa2.1.0.gfa > test.giffa2.histgrowth.node.tsv

erro

[2023-11-30T00:41:26Z INFO panacus::cli] running panacus on 4 threads
[2023-11-30T00:41:26Z INFO panacus::cli] constructing indexes for node/edge IDs, node lengths, and P/W lines..
[2023-11-30T00:43:19Z INFO panacus::cli] ..done; found 383935 paths/walks and 174028496 nodes
[2023-11-30T00:43:19Z INFO panacus::cli] loading data from group / subset / exclude files
[2023-11-30T00:43:19Z INFO panacus::abacus] loading coordinates from pig.giffa2.paths.haplotypes.txt
Error: Custom { kind: Unsupported, error: "data has 383917 path groups, but command is not supported for more than 65534" }

compiler error in rustc-serialize

Hi @danydoerr :)

When compiling with

cargo --version
cargo 1.76.0 (c84b36747 2024-01-18)

I get the following error:

cargo build --release
   Compiling thiserror-impl v1.0.40
   Compiling getrandom v0.2.9
   Compiling io-lifetimes v1.0.10
   Compiling num_cpus v1.15.0
   Compiling clap_derive v4.5.3
   Compiling serde_json v1.0.114
   Compiling rustc-serialize v0.3.24
   Compiling regex v1.7.3
   Compiling strum_macros v0.25.3
   Compiling time v0.3.34
   Compiling rustix v0.37.19
   Compiling rand_core v0.6.4
   Compiling rayon-core v1.11.0
error[E0310]: the parameter type `T` may not live long enough
    --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rustc-serialize-0.3.24/src/serialize.rs:1155:5
     |
1155 |     fn decode<D: Decoder>(d: &mut D) -> Result<Cow<'static, T>, D::Error> {
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     |     |
     |     the parameter type `T` must be valid for the static lifetime...
     |     ...so that the type `T` will meet its required lifetime bounds...
     |
note: ...that is required by this bound
    --> /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/borrow.rs:180:30
help: consider adding an explicit lifetime bound
     |
1151 | impl<'a, T: ?Sized + 'static> Decodable for Cow<'a, T>
     |                    +++++++++

   Compiling rand_chacha v0.3.1
   Compiling rand v0.8.5
   Compiling rayon v1.7.0
For more information about this error, try `rustc --explain E0310`.
error: could not compile `rustc-serialize` (lib) due to 1 previous error
warning: build failed, waiting for other jobs to finish...

Any ideas? THanks!

Installation instructions

Nowadays, conda should be replaced by mamba, which is much faster and also solves dependencies in a more optimal way. Further, the bioconda channel always should be combined with the conda-forge channel because it is build against that instead of the defaults channel. Something like mamba install -c conda-forge -c bioconda panacus. For getting a mamba installation you can refer to the mambaforge distribution.

Haplotype labels in TSV, visualisations ?

Hello,
Great tool.
Is it possible to display the haplotypes labels from GFA P-lines or W-lines into the TSV outputs, as well as the HTML/PDF outputs ?

I tried with different GFA files, containing either P-lines or W-lines.
However, the output always lists the haplotypes on X axes with integers from 1 to n.

If this is not implemented, does the enumeration corresponds to the order of the lines in the GFA ? or from the haplotype.txt files ?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.