Giter Club home page Giter Club logo

pggb's Introduction

PanGenome Graph Builder

pggb: the PanGenome Graph Builder

Publish container to github container registry

pggb builds pangenome variation graphs from a set of input sequences.

A pangenome variation graph is a kind of generic multiple sequence alignment. It lets us understand any kind of sequence variation between a collection of genomes. It shows us similarity where genomes walk through the same parts of the graph, and differences where they do not.

pggb generates this kind of graph using an all-to-all alignment of input sequences (wfmash), graph induction (seqwish), and progressive normalization (smoothxg, gfaffix). After construction, pggb generates diagnostic visualizations of the graph (odgi). A variant call report (in VCF) representing both small and large variants can be generated based on any reference genome included in the graph (vg). pggb writes its output in GFAv1 format, which can be used as input by numerous "genome graph" and pangenome tools, such as the vg and odgi toolkits.

pggb has been tested at scale in the Human Pangenome Reference Consortium (HPRC) as a method to build a graph from the draft human pangenome.

Documentation at https://pggb.readthedocs.io/ and pggb manuscript (WIP).

quick start

  1. Install pggb with Docker, Singularity, bioconda, guix, or by manually building its dependencies.

  2. Put your sequences in one FASTA file (in.fa), optionally compress it with bgzip, and index it with samtools faidx. If you have many samples and/or haplotypes, we recommend using the PanSN prefix naming pattern.

  3. [OPTIONAL] If you have whole-genome assemblies, you might consider partitioning your sequences into communities, which usually correspond to the different chromosomes of the genomes. Then, you can run pggb on each community (set of sequences) independently (see partition before pggb).

  4. To build a graph from in.fa, which contains, for example, 9 haplotypes, in the directory output, scaffolding the graph using 5kb matches at >= 90% identity, and using 16 parallel threads for processing, execute:

pggb -i in.fa \       # input file in FASTA format
     -o output \      # output directory
     -n 9 \           # number of haplotypes (optional with PanSN-spec)
     -t 16 \          # number of threads
     -p 90 \          # minimum average nucleotide identity for segments
     -s 5k \          # segment length for scaffolding the graph
     -V 'ref:1000'    # make a VCF against "ref" decomposing variants <= 1000bp

The final output will match outdir/in.fa.*final.gfa. By default, we render 1D and 2D visualizations of the graph with odgi, which are very useful to understand the result of the build.

See also this step-by-step example for more information.

partition before pggb

In the above example, to partition your sequences into communities, execute:

partition-before-pggb -i in.fa \       # input file in FASTA format
                      -o output \      # output directory
                      -n 9 \           # number of haplotypes (optional with PanSN-spec)
                      -t 16 \          # number of threads
                      -p 90 \          # minimum average nucleotide identity for segments
                      -s 5k \          # segment length for scaffolding the graph
                      -V 'ref:1000'    # make a VCF against "ref" decomposing variants <= 1000bp

This generates the command lines to run pggb on each community (2 in this example) independently:

pggb -i output/in.fa.dd9e519.community.0.fa \
     -o output/in.fa.dd9e519.community.0.fa.out \
     -p 5k -l 25000 -p 90 -n 9 -K 19 -F 0.001 \
     -k 19 -f 0 -B 10000000 \
     -H 9 -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -V ref:1000 --threads 16 --poa-threads 16
pggb -i output/in.fa.dd9e519.community.1.fa \
     -o output/in.fa.dd9e519.community.1.fa.out \
     -p 5k -l 25000 -p 90 -n 9 -K 19 -F 0.001 \
     -k 19 -f 0 -B 10000000 \
     -H 9 -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -V ref:1000 --threads 16 --poa-threads 16

Se also the sequence partitioning tutorial for more information

citation

Erik Garrison*, Andrea Guarracino*, Simon Heumos, Flavia Villani, Zhigui Bao, Lorenzo Tattini, Jörg Hagmann, Sebastian Vorbrugg, Santiago Marco-Sola, Christian Kubica, David G. Ashbrook, Kaisa Thorell, Rachel L. Rusholme-Pilcher, Gianni Liti, Emilio Rudbeck, Sven Nahnsen, Zuyu Yang, Mwaniki N. Moses, Franklin L. Nobrega, Yi Wu, Hao Chen, Joep de Ligt, Peter H. Sudmant, Nicole Soranzo, Vincenza Colonna, Robert W. Williams, Pjotr Prins, Building pangenome graphs, bioRxiv 2023.04.05.535718; doi: https://doi.org/10.1101/2023.04.05.535718

usage

Pangenome graphs let us understand multi-way relationships between many genomes. They are thus models of many-way sequence alignments. A goal of pggb is to reduce the complexity of making these alignments, which are notoriously difficult to design.

key parameters

The overall structure of pggb's output graph is defined by two parameters: segment length (-s) and pairwise identity (-p). Segment length defines the seed length used by the "MashMap3" homology mapper in wfmash. The pairwise identity is the minimum allowed pairwise identity between seeds, which is estimated using a mash-type approximation based on k-mer Jaccard. Mappings are initiated from collinear chains of around 5 seeds (-l, --block-length), and extended greedily as far as possible, allowing up to -n minus 1 mappings at each query position. Genome number (-n) is automatically computed if sequence names follow PanSN-spec.

An additional parameter, -k, can also greatly affect graph structure by pruning matches shorter than a given threshold from the initial graph model. In effect, -k N removes any match shorter than Nbp from the initial alignment. This filter removes potentially ambiguous pairwise alignments from consideration in establishing the initial scaffold of the graph.

The initial graph is defined by parameters to wfmash and seqwish. But due to the ambiguities generated across the many pairwise alignments we use as input, this graph can be locally very complex. To regularize it we orchestrate a series of graph transformations. First, with smoothxg, we "smooth" it by locally realigning sequences to each other with a traditional multiple sequence alignment (we specifically apply POA). This process repeats multiple times to smooth over any boundary effects that may occur due to binning errors near MSA boundaries. Finally, we apply gfaffix to remove forks where both alternatives have the same sequence.

bringing your pangenome into focus

We suggest using default parameters for initial tests. For instance pggb -i in.fa.gz -o out would be a minimal build command for a pangenome from in.fa.gz. The default parameters provide a good balance between runtime and graph quality for small-to-medium (1kbp-100Mbp) problems.

However, we find that parameters may still need to be adjusted to fine-tune pggb to a given problem. Furthermore, it is a good idea to explore a range of settings of the pipeline's key graph-shaping parameters (-s, -p, -k) to appreciate the degree to which changes in parameters may or may-not affect results. These parameters must be tuned so that the graph resolves structures of interest (variants and homologies) that are implied by the sequence space of the pangenome.

example builds for diverse species

In preparation of a manuscript on pggb, we have developed a set of example pangenome builds for a collection of diverse species. (These also use cross-validation against nucmer to evaluate graph quality.)

Examples (-n is optional if sequence names follow PanSN-spec):

  • Human, whole genome, 90 haplotypes: pggb -p 98 -s 10k -k 47 [-n 90]...
  • 15 helicobacter genomes, 5% divergence: pggb -k 47 [-n 15], and 15 at higher (10%) divergence pggb -k 23 [-n 15] ...
  • Yeast genomes, 5% divergence: pggb's defaults should work well.
  • Aligning 9 MHC class II assemblies from vertebrate genomes (5-10% divergence): pggb -k 29 [-n 9] ...
  • A few thousand bacterial genomes pggb -x auto [-n 2000] .... In general mapping sparsification (-x auto) is a good idea when you have many hundreds to thousands of genomes.

pggb defaults to using the number of threads as logical processors on the system (the thread count given by getconf _NPROCESSORS_ONLN). Use -t to set an appropriate level of parallelism if you can't use all the processors on your system.

visual diagnostics

To guide graph building, pggb provides several visual diagnostic outputs. Here, we explain them using a test from the data/HLA directory in this repo:

git clone --recursive https://github.com/pangenome/pggb
cd pggb
./pggb -i data/HLA/DRB1-3123.fa.gz -p 70 -s 500 -n 10 -t 16 -V 'gi|568815561' -o out -M

This yields a variation graph in GFA format, a multiple sequence alignment in MAF format (-M), and several diagnostic images (all in the directory out/). We specify -n because the sequences do not follow PanSN-spec, so the number of haplotypes can not be automatically computed. We also call variants with -V with respect to the reference gi|568815561.

1D graph visualization

odgi viz rendering of DRB1-3123 graph

  • The graph nodes’ are arranged from left to right forming the pangenome’s sequence.
  • The colored bars represent the binned, linearized renderings of the embedded paths versus this pangenome sequence in a binary matrix.
  • The black lines under the paths, so called links, represent the topology of the graph.

2D graph visualization

odgi layout rendering of DRB1-3123 graph

  • Each colored rectangle represents a node of a path. The node’s x-coordinates are on the x-axis and the y-coordinates are on the y-axis, respectively.
  • A bubble indicates that here some paths have a diverging sequence or it can represent a repeat region.

extension

The pipeline is provided as a single script with configurable command-line options. Users should consider taking this script as a starting point for their own pangenome project. For instance, you might consider swapping out wfmash with minimap2 or another PAF-producing long-read aligner. If the graph is small, it might also be possible to use abPOA or spoa to generate it directly. On the other hand, maybe you're starting with an assembly overlap graph which can be converted to blunt-ended GFA using gimbricate. You might have a validation process based on alignment of sequences to the graph, which should be added at the end of the process.

downstream

The resulting graph can then be manipulated with odgi for transformation, analysis, simplification, validation, interrogation, and visualization. It can also be loaded into any of the GFA-based mapping tools, including vg map, mpmap, giraffe, and GraphAligner. Alignments to the graph can be used to make variant calls (vg call) and coverage vectors over the pangenome, which can be useful for phylogeny and association analyses. Using odgi matrix, we can render the graph in a sparse matrix format suitable for direct use in a variety of statistical frameworks, including phylogenetic tree construction, PCA, or association studies.

scalability

pggb's initial use is as a mechanism to generate variation graphs from the contig sets produced by the human pangenome project. Although its design represents efforts to scale these approaches to collections of many human genomes, it is not intended to be human-specific.

principles

pangenome alignment

It's straightforward to generate a pangenome graph by the all-pairs alignment of a set of input sequences. And it's good: this provides a completely unbiased view of the pangenome. However, all-to-all alignment scales very poorly, with the square of the number of sequences, making it untenable for even small collections of genomes (~10s). To make matters worse, existing standards for alignment are based on k-mer matching, which can require processing enormous numbers of seeds when aligning repetitive sequences.

We answer this new alignment issue with the mashmap and WFA-based algorithms in wfmash. wfmash provides a practical way to generate alignments between the sequences. Its design focuses on the long, collinear homologies that are often "orthologous" in a phylogenetic context. Crucially, wfmash is robust to repetitive sequences, and it can be adjusted using probabilistic thresholds for segment alignment identity. Finally, seqwish converts the alignments directly into a graph model. This allows us to define the base graph structure using a few free parameters, as discussed above.

graph normalization

The manifold nature of typical variation graphs means that they are very likely to look linear locally. By running a stochastic 1D layout algorithm that attempts to match graph distances (as given by paths) between nodes and their distances in the layout, we execute a kind of multi-dimensional scaling (MDS). In the aggregate, we see that regions that are linear (the chains of nodes and bubbles) in the graph tend to co-localize in the 1D sort. Applying an MSA algorithm (in this case, abPOA or spoa) to each of these chunks enforces a local linearity and homogenizes the alignment representation. This smoothing step thus yields a graph that is locally as we expect: partially ordered, and linear as the base DNA molecules are, but globally can represent large structural variation. The homogenization also rectifies issues with the initial wfa-based alignment.

installation

manual-mode

You'll need wfmash, seqwish, smoothxg, odgi, and gfaffix in your shell's PATH. These can be individually built and installed. Then, put the pggb bash script in your path to complete installation. Optionally, install bcftools, vcfbub, vcfwave, and vg for calling and normalizing variants, MultiQC for generating summarized statistics in a MultiQC report, or pigz to compress the output files of the pipeline.

Docker

To simplify installation and versioning, we have an automated GitHub action that pushes the current docker build to the GitHub registry. To use it, first pull the actual image (IMPORTANT: see also how to build docker locally):

docker pull ghcr.io/pangenome/pggb:latest

Or if you want to pull a specific snapshot from https://github.com/orgs/pangenome/packages/container/package/pggb:

docker pull ghcr.io/pangenome/pggb:TAG

You can pull the docker image also from dockerhub:

docker pull pangenome/pggb

As an example, going in the pggb directory

git clone --recursive https://github.com/pangenome/pggb.git
cd pggb

you can run the container using the human leukocyte antigen (HLA) data provided in this repo:

docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"

The -v argument of docker run always expects a full path. If you intended to pass a host directory, use absolute path. This is taken care of by using ${PWD}.

build docker locally

Multiple pggb's tools use SIMD instructions that require AVX (like abPOA) or need it to improve performance. The currently built docker image has -Ofast -march=sandybridge set. This means that the docker image can run on processors that support AVX or later, improving portability, but preventing your system hardware from being fully exploited. In practice, this could mean that specific tools are up to 9 times slower. And that a pipeline runs ~30% slower compared to when using a native build docker image.

To achieve better performance, it is STRONGLY RECOMMENDED to build the docker image locally after replacing -march=sandybridge with -march=native and the Generic build type with Release in the Dockerfile:

sed -i 's/-march=sandybridge/-march=native/g' Dockerfile 
sed -i 's/Generic/Release/g' Dockerfile 

To build a docker image locally using the Dockerfile, execute:

docker build --target binary -t ${USER}/pggb:latest .

Staying in the pggb directory, we can run pggb with the locally built image:

docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"

A script that handles the whole building process automatically can be found at https://github.com/nf-core/pangenome#building-a-native-container.

Singularity

Many managed HPCs utilize Singularity as a secure alternative to docker. Fortunately, docker images can be run through Singularity seamlessly.

First pull the docker file and create a Singularity SIF image from the dockerfile. This might take a few minutes.

singularity pull docker://ghcr.io/pangenome/pggb:latest

Next clone the pggb repo and cd into it

git clone --recursive https://github.com/pangenome/pggb.git
cd pggb

Finally, run pggb from the Singularity image. For Singularity, to be able to read and write files to a directory on the host operating system, we need to 'bind' that directory using the -B option and pass the pggb command as an argument.

singularity run -B ${PWD}/data:/data ../pggb_latest.sif pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out

A script that handles the whole building process automatically can be found at https://github.com/nf-core/pangenome#building-a-native-container.

Bioconda (VCF NORMALIZATON IS CURRENTLY MISSING, SEE THIS ISSUE)

pggb recipes for Bioconda are available at https://anaconda.org/bioconda/pggb. To install the latest version using Conda execute:

conda install -c bioconda pggb

If you get dependencies problems, try:

conda install -c bioconda -c conda-forge pggb

guix

git clone https://github.com/ekg/guix-genomics
cd guix-genomics
GUIX_PACKAGE_PATH=. guix package -i pggb

nextflow

A nextflow DSL2 port of pggb is developed by the nf-core community. See nf-core/pangenome for more details.

issues

Resolving the 'Illegal option --' error with Singularity on HPC (thanks to Rachel Rusholme Pilcher)

When running a Singularity container on an Alma Linux node in HPC, you may encounter an 'Illegal option --' error. This issue arises due to an incompatibility between the Singularity container and the Alma Linux environment.

The which function from the Alma Linux 9 host machine is inherited by the Singularity container during execution. However, the container's Debian operating system may not be compatible with this function, leading to the error. Modifying the container's definition file (.def) alone is insufficient to resolve this issue persistently.

To address this problem, you have two options:

  1. Unset the which function before running the container:

    Before executing your Singularity container, run the following command on the host machine:

    unset -f which
    

    This command unsets the which function, preventing it from being inherited by the container.

  2. Use the -e flag when running the container:

    Execute your Singularity container using the -e flag:

    singularity exec -e ...
    

    The -e flag prevents any host environment variables and functions from being inherited by the container.

reporting

MultiQC

Many thanks go to @Zethson and @Imipenem who started to implemented a MultiQC module for odgi stats. Using -m, --multiqc statistics are generated automatically, and summarized in a MultiQC report. If created, visualizations and layouts are integrated into the report, too. In the following an example excerpt:

MultiQC example report

installation

pip install multiqc --user

The docker image already contains v1.11 of MultiQC.

authors

Erik Garrison*, Andrea Guarracino*, Simon Heumos, Flavia Villani, Zhigui Bao, Lorenzo Tattini, Jörg Hagmann, Sebastian Vorbrugg, Santiago Marco-Sola, Christian Kubica, David G. Ashbrook, Kaisa Thorell, Rachel L. Rusholme-Pilcher, Gianni Liti, Emilio Rudbeck, Sven Nahnsen, Zuyu Yang, Mwaniki N. Moses, Franklin L. Nobrega, Yi Wu, Hao Chen, Joep de Ligt, Peter H. Sudmant, Nicole Soranzo, Vincenza Colonna, Robert W. Williams, Pjotr Prins, Building pangenome graphs, bioRxiv 2023.04.05.535718; doi: https://doi.org/10.1101/2023.04.05.535718

license

MIT

pggb's People

Contributors

andreaguarracino avatar asleonard avatar ekg avatar joehagmann avatar kdm9 avatar mbhall88 avatar petersudmant avatar subwaystation avatar thomasraulet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pggb's Issues

Issue with odgi

Hello all,

I am having an error with the pggb/0.1.3. in the smoothxg phase. Apparently could not find flag 'O' in odgi build options:

warning [libhandlegraph]: Serialized object does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
terminate called after throwing an instance of 'std::runtime_error'
 what():  Error rewinding to load non-magic-prefixed "SerializableHandleGraph"

This is the command line:
pggb -i test.fa -s1000 -n10 -p90 -U -v -L -S -V -o /PGGB

input file: test.txt
output file: pggb_test_9105756out.txt

Thank you in advance.

Can't build the wfmash dependency

After git-cloning the repo and attempting to run the de-novo graph generaton workflow, the program fails to find wfmash which, indeed, I hadn't installed. I tried building wfmash from the github repo, but gave up after a bunch of linker errors (
see waveygang/wfmash#87). Would it be possible to bundle a wfmash static binary with the pggb release?

Thanks!

Error with latest PGGB docker build

Hi

I see you've just updated the docker build. I've tried it out and noticed that I get errors with the parameters now:

pggb -i barley_chromosome_7H.fasta -s 100000 -p 85 -a 85 -n 18 -t 16 -m -v -l -S -o barley_pangenome_7H
Argument 'N' received invalid value type '-S'
  edyeet [target] [queries...] {OPTIONS}

    edyeet: base-accurate alignments using edlib and mashmap2

  OPTIONS:

Command terminated by signal 4
edyeet -X -s 100000 -l -S -p 85 -n 18 -a 85 -k 16 -t 16 barley_chromosome_7H.fasta barley_chromosome_7H.fasta
0.00s user 0.02s system 10% cpu 0.27s total 4492Kb max memory

I've also tried running with parameters the same as mentioned in issue #51

pggb -i barley_chromosome_7H.fasta -s 100000 -p 85 -a 85 -n 18 -t 16 -v -L -o out_test_file

Command terminated by signal 4
edyeet -X -s 100000 -l 300000 -p 85 -n 18 -a 85 -k 16 -t 16 barley_chromosome_7H.fasta barley_chromosome_7H.fasta
0.00s user 0.03s system 18% cpu 0.19s total 4908Kb max memory

The error seems to happen as soon as edyeet is run. I also tried running by accessing edyeet directly, and I get an error:

singularity exec --bind ${PWD}:${PWD} /data/pggb_builds/pggb.sif edyeet
Illegal instruction (core dumped)

smoothxg Command terminated by signal 4/exit status 132

Does smoothxg exit properly when run from the Docker container?

I see Command terminated by signal 4 below

$ docker run -it \
  -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest \
  "pggb -i /data/HLA/A-3105.fa.gz -s 3000 -K 11 -p 70 -a 70 -n 10 -t 2 -v -l"

...
[smoothxg::break_blocks] cut 0 blocks of which 0 had repeats
Command terminated by signal 4plying spoa to block 0/25 0.000%
	Command being timed: "smoothxg -t 2 -g /data/HLA/A-3105.fa.gz.pggb-s3000-p70-n10-a70-K11-k8-w10000-j5000-W0-e5000.seqwish.gfa -w 10000 -j 5000 -k 0 -e 5000 -l 10000"
	User time (seconds): 0.89
	System time (seconds): 0.44
	Percent of CPU this job got: 15%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.91
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 15068
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 20
	Minor (reclaiming a frame) page faults: 4819
	Voluntary context switches: 15828
	Involuntary context switches: 1436
	Swaps: 0
	File system inputs: 6168
	File system outputs: 5872
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

and exit status 132/Illegal instruction when attempting to run via Docker from a Nextflow workflow

$ nextflow run main.nf \
  -with-docker ghcr.io/pangenome/pggb:latest \
  --input_fasta data/HLA/A-3105.fa.gz \
  --segment_length 3000 --mash_kmer 11 --map_pct_id 70 --align_pct_id 70 --n_secondary 10

N E X T F L O W  ~  version 20.07.1
Launching `main.nf` [cheeky_mclean] - revision: 661863f1b0
executor >  local (3)
[52/639d59] process > edyeet (1)   [100%] 1 of 1 ✔
[89/eb2666] process > seqwish (1)  [100%] 1 of 1 ✔
[b9/ca540c] process > smoothxg (1) [100%] 1 of 1, failed: 1 ✘
[-        ] process > odgiBuild    -
Error executing process > 'smoothxg (1)'

Caused by:
  Process `smoothxg (1)` terminated with an error exit status (132)

Command executed:

  smoothxg     -t 1     -g A-3105.pggb-s3000-p70-n10-a70-K11-k8-w10000-j5000-W0-e5000.seqwish.gfa     -w 10000     -j 5000     -k 0     -e 5000     -l 10000     >A-3105.pggb-s3000-p70-n10-a70-K11-k8-w10000-j5000-W0-e5000.smooth.gfa

Command exit status:
  132

Command output:
  (empty)

Command error:
...
  [smoothxg::break_blocks] cut 0 blocks of which 0 had repeats
  [smoothxg::smooth_and_lace] applying spoa to block 0/24 0.000%
  .command.sh: line 2:     7 Illegal instruction     smoothxg -t 1 -g A-3105.pggb-s3000-p70-n10-a70-K11-k8-w10000-j5000-W0-e5000.seqwish.gfa -w 10000 -j 5000 -k 0 -e 5000 -l 10000 > A-3105.pggb-s3000-p70-n10-a70-K11-k8-w10000-j5000-W0-e5000.smooth.gfa

Question about changes with parameters with the latest pggb updates

Hi

I've noticed with the more recent updates that a few parameters have changed their default values, namely these values:
max_path_jump=100 (-j)
max_edge_jump=0 (-e)
block_id_min=0 (-I)
block_ratio_min=0 (-R)

With the -I and -R parameters is it no longer necessary to apply a -I 0.9 -R 0.05, due to updates with smoothxg?

With path and edge jumps, have these been reduced heavily due to observations with the human pangenome work? I've been using 12000 and as high as 15000. My understanding is that lower values would pull in more variation. I'm currently testing -j 100 and -e 0 on a single gene region in barley, but I may not see much of a difference unless I work on a chromosome level (chromosome level pggb runs with -p 95 take weeks to complete with the barley pangenome (even longer on chr 7H, due to its high diversity), so its not feasible to run multiple quick tests). Thanks.

Error with latest Docker build

Hi Erik

I just updated to the latest Docker build of PGGB: 20201210144337a5a404

As soon as Edyeet starts to run I get this error:

+ srun -n 1 singularity exec --bind /data/pangenome_20way/edyeet_results/2H:/data/pangenome_20way/edyeet_results/2H /data/pggb_builds/pggb.sif pggb -i /data/pangenome_20way/edyeet_results/2H/all_genomes_1640.tmp -w 10000 -s 1000000 -I 0 -p 85 -a 85 -n 18 -t 16 -v -l -S -o barley_pangenome_2H_s1000000_p85_a85_split
Command terminated by signal 4
edyeet -X -s 1000000 -p 85 -n 18 -a 85 -k 16 -t 16 /data/pangenome_20way/edyeet_results/2H/all_genomes_1640.tmp /data/pangenome_20way/edyeet_results/2H/all_genomes_1640.tmp
0.00s user 0.03s system 6% cpu 0.55s total 5424Kb max memory
srun: error: node-7: task 0: Exited with exit code 132

Edyeet does't seem to be detecting the correct memory available. This only happened immediately after I updated to the latest PGGB build. I'll see if rolling back to the version a week ago will help for now. Thanks.

problem in running pggb with empty file

Hi All,

I am running pggb using:
pggb -i input.fa -w 100000 -s 100000 -p 80 -a 100 -n 24 -k 8 -I 0.6 -R 0.2 -t 16 -v -L -o out

It produces all the files but the input.paf file is empty. The 1-D graph not not look correct and there is no 2-D graph output. Thank you!

Cycles in graph

Hello,
The first paragraph in the README mentions that the goal is a locally acyclic graph. I recall seqwish not having this requirement, i.e. it used to happily built cyclic graphs. Did the graph model change in seqwish or is it just being called with some constraints on cycles?

This ticket suggests it might be the latter, but could you confirm?
ekg/seqwish#19

Thank you for continuing your work in this direction!

odgi::gfa_to_handle Command terminated by signal 11

Hi there, thanks for making this great software available. My error looks similar but distinct from issue #139 , so I figured I'd post it. I've been running pggb v0.1.3 (cloned from github) on a collection of 361 50Mb genomes, chromosome by chromosome. The pggb pipeline for 18/21 chromosomes successfully completed, but I am getting the following error with the 3 remaining chromosomes at a odgi::gfa_to_handle step just after the second call of smoothxg::main

[smoothxg::main] unchopping smoothed graph
[odgi::unchop] unchopped 9049 nodes into 3573 new nodes.
[smoothxg::main] smoothed graph length 119666901bp in 1811962 nodes
[smoothxg::main] writing smoothed graph to IPO323.chr_5.fasta.10111f1.8b7f2f6.f37f590.smooth.gfa
smoothxg -t 24 -T 24 -g IPO323.chr_5.fasta.10111f1.8b7f2f6.f37f590.smooth.1.gfa -w 4772059 -K -X 100 -d 2000 -I 0 -R 0 -j 100 -e 0 -l 13219 -p 1,19,39,3,81,1 -O 0.001 -Q Consensus_ -V -o IPO323.chr_5.fasta.10111f1.8b7f2f6.f37f590.smooth.gfa
312470.88s user 28337.46s system 1768% cpu 19266.12s total 145814260Kb max memory
[odgi::gfa_to_handle] building nodes: 0.00% @ 1.78e+02/s elapsed: 00:00:00:00 remain: 00:02:49:41
[odgi::gfa_to_handle] building nodes: 100.00% @ 8.34e+05/s elapsed: 00:00:00:02 remain: 00:00:00:00
[odgi::gfa_to_handle] building edges: 0.00% @ 9.89e+00/s elapsed: 00:00:00:00 remain: 02:22:31:44
[odgi::gfa_to_handle] building edges: 100.00% @ 6.84e+05/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::gfa_to_handle] building paths: 0.01% @ 3.90e+00/s elapsed: 00:00:00:00 remain: 00:00:44:16
[odgi::gfa_to_handle] building paths: 1.38% @ 2.82e+02/s elapsed: 00:00:00:00 remain: 00:00:00:36
[odgi::gfa_to_handle] building paths: 2.82% @ 3.86e+02/s elapsed: 00:00:00:00 remain: 00:00:00:26
[odgi::gfa_to_handle] building paths: 5.34% @ 5.47e+02/s elapsed: 00:00:00:01 remain: 00:00:00:17Command terminated by signal 11
odgi build -t 24 -P -g IPO323.chr_5.fasta.10111f1.8b7f2f6.f37f590.smooth.fix.gfa -o - -O
19.24s user 1.13s system 228% cpu 8.89s total 2141432Kb max memory
warning [libhandlegraph]: Serialized object does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
terminate called after throwing an instance of 'std::runtime_error' what(): Error rewinding to load non-magic-prefixed SerializableHandleGraph Command terminated by signal 6

Here is my command:
/home/emile/software/pggb/pggb -i IPO323.chr_5.fasta -s 50000 -k 700 -p 95 -n 361 -t 24 --exclude-delim '.' --normalize --vcf-spec IPO323:chr_5.ids

Any insight would be greatly appreciated. I've tried increasing the number of threads, but still get the same error. I've attached the log file edited to remove progress printouts to reduce its size:
IPO323.chr_5.fasta.10111f1.8b7f2f6.f37f590.10-27-2021_08:24:36.log

Thanks!

Consensus graph filled with 5000 bp blocks which might be trivially joined

Hi,
I had a few questions about the consensus graph generation. I generated this with -C 100

image

There are many blocks of around 5000 bp which are essentially a linear chain, with only one in and one out edge. It feels like these should be joined into a single node, since they don't contain any variation. This length is suspiciously close to poa-length-target [5000], so I wasn't sure if this was related to that, where it won't produce longer nodes even if there is no bubble in it.

On a related note, I ran into a lot of issues initially, because I completed the initial run without -C (as previous default behaviour included the consensus graph). When I reran with --resume, this kept causing assertion errors in odgi build. I believe this is because in L305 of pggb, it only checks if a smooth.gfa exists (regardless of whether the consensus graphs exist), while later in L338 it tries to get stats on the consensus graphs (which were never made).

Running out of memory even on a a high memory cluster

I am running pggb on 13 genomes ~45Mbp each. Am using 2 high memory compute nodes on a supercomputer: zeus (https://pawsey.org.au/systems/zeus/) where each node has 1TB of memory.
Please advise on how to efficiently utilize memory and processors to run this program.

Here is my submission slurm script

#!/bin/bash -l
### Job Name
#SBATCH --job-name=pggb
### Set email type for job
### Accepted options: NONE, BEGIN, END, FAIL, ALL
#SBATCH --mail-type=FAIL
### email address for user
#SBATCH [email protected]
### Queue name that job is submitted to
#SBATCH --partition=highmemq
#SBATCH --account=xxxx
### Request nodes
#SBATCH --nodes=2
#SBATCH --time=96:00:00
#SBATCH --export=NONE
echo Running on host `hostname`
echo Time is `date`
#module(s) if required module load application_module
module load singularity/3.7.0
#Run the executable
singularity exec pggb-latest.simg \
	pggb -i concatenated.fasta -N -w 100000 -s 100000 -I 0 -p 80 -n 24 -k 8 -t 24 -v -L -o pggb_out_zeus -m -r

And the error

[smoothxg::main] loading graph
[smoothxg::main] prepping graph for smoothing
[odgi::gfa_to_handle] building nodes: 100.00% @ 1.88e+05/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::gfa_to_handle] building edges: 100.00% @ 3.99e+05/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::gfa_to_handle] building paths: 100.00% @ 1.41e+03/s elapsed: 00:00:00:02 remain: 00:00:00:00
[smoothxg::prep] building path index
[smoothxg::prep] sorting graph
[odgi::path_linear_sgd] calculating linear SGD schedule (7.06e-10 1.00e+00 100 0 1.00e-02)
[odgi::path_linear_sgd] calculating zetas for 10003 zipf distributions
[odgi::path_linear_sgd] 1D path-guided SGD: 100.00% @ 4.17e+06/s elapsed: 00:00:00:43 remain: 00:00:00:00
[odgi::groom] grooming: 100.00% @ 8.95e+05/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::groom] organizing handles: 100.00% @ 2.24e+07/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::groom] flipped 113433 handles
[odgi::topological_order] sorting nodes: 100.00% @ 9.49e+04/s elapsed: 00:00:00:02 remain: 00:00:00:00
[smoothxg::prep] chopping graph to 100
[odgi::chop] 53350 node(s) to chop.
[smoothxg::prep] writing graph pggb_out_zeus/concatenated.fasta.pggb-E-s100000-l300000-p80-n24-a0-K16-N.seqwish-k8-B1000000.gfa.prep.gfa
[smoothxg::main] building xg index
[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks for 3524875 handles: 100.00% @ 2.34e+05/s elapsed: 00:00:00:15 remain: 00:00:00:00
[smoothxg::break_and_split_blocks] cutting blocks that contain sequences longer than max-poa-length (10000)
[smoothxg::break_and_split_blocks] splitting 57430 blocks at identity 0.000 (edlib-based clustering) and at estimated-identity 0.000 (mash-based clustering)
[smoothxg::break_and_split_blocks] cutting and splitting 57430 blocks: 100.00% @ 6.55e+02/s elapsed: 00:00:01:27 remain: 00:00:00:00
[smoothxg::break_and_split_blocks] cut 2989 blocks of which 208 had repeats
[smoothxg::break_and_split_blocks] split 0 blocks
[smoothxg::smooth_and_lace] applying global abPOA to 57430 blocks: Command terminated by signal 90:00:10:15 remain: 03:04:29:01
smoothxg -t 24 -g pggb_out_zeus/concatenated.fasta.pggb-E-s100000-l300000-p80-n24-a0-K16-N.seqwish-k8-B1000000.gfa -w 100000 -M -J 0.7 -K -G 150 -I 0 -R 0 -j 5000 -e 5000 -l 10000 -p 1,4,6,2,26,1 -m pggb_out_zeus/concatenated.fasta.pggb-E-s100000-l300000-p80-n24-a0-K16-N.seqwish-k8-B1000000.smooth-w100000-j5000-e5000-I0-p1_4_6_2_26_1.maf -C pggb_out_zeus/concatenated.fasta.pggb-E-s100000-l300000-p80-n24-a0-K16-N.seqwish-k8-B1000000.smooth-w100000-j5000-e5000-I0-p1_4_6_2_26_1.consensus,10,100,1000,10000 -o pggb_out_zeus/concatenated.fasta.pggb-E-s100000-l300000-p80-n24-a0-K16-N.seqwish-k8-B1000000.smooth-w100000-j5000-e5000-I0-p1_4_6_2_26_1.gfa
673.07s user 268.05s system 94% cpu 995.56s total 63485420Kb max memory
slurmstepd: error: Detected 1 oom-kill event(s) in step 5021988.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Command terminated by signal 6

Hello again,

Yesterday I used docker to pull, and try, the latest available build of pggb. While some samples do work, running this one particular sample of ten genomes results in an error which terminates the run.
Is this due to the parameters, the sample, or is it the version?

Starting pggb on Fri Aug 20 09:28:30 CEST 2021

Command: /usr/local/bin/pggb -i ./indata/completes/Ten_global_refs_cleaned.fa -s 500 -p 85 -n 9 -P 1,9,16,2,41,1 -t 10 -v -C 1000 -L -o ./outdata/completes/cleaned

PARAMETERS

general:
  input-fasta:        ./indata/completes/Ten_global_refs_cleaned.fa
  output-dir:         ./outdata/completes/cleaned
  resume:             false
  pigz-compress:      false
  threads:            10
alignment:
  mapping-tool:       wfmash
  no-splits:          false
  segment-length:     500
  block-length:       1500
  no-merge-segments:  false
  map-pct-id:         85
  n-mappings:         9
  mash-kmer:          16
  exclude-delim:      false
seqwish:
  min-match-len:      19
  transclose-batch:   10000000
smoothxg:
  n-haps:             9
  path-jump-max:      100
  edge-jump-max:      0
  poa-length-target:  13117,13219
  poa-params:         1,9,16,2,41,1
  write-maf:          false
  consensus-prefix:   Consensus_
  consensus-spec:     1000
  split-min-depth:    2000
  block-id-min:       0
  block-ratio-min:    0
  poa_threads:        10
  poa_padding:        0.001
odgi:
  normalize:          false
  viz:                true
  layout:             true
  stats:              false
vg:
  deconstruct:        false  
reporting:
  multiqc:            false

Running pggb

[wfmash::map] Reference = [./indata/completes/Ten_global_refs_cleaned.fa]
[wfmash::map] Query = [./indata/completes/Ten_global_refs_cleaned.fa]
[wfmash::map] Kmer size = 16
[wfmash::map] Window size = 1
[wfmash::map] Segment length = 500 (read split allowed)
[wfmash::map] Block length min = 1500
[wfmash::map] Alphabet = DNA
[wfmash::map] Percentage identity threshold = 85%
[wfmash::map] Mapping output file = /crex/proj/uppstore2017270/Emilio/pggb/2021_08_19-late_runs/wfmash-pynVi9
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 10
[wfmash::skch::Sketch::build] minimizers picked from reference = 16385897
[wfmash::skch::Sketch::index] unique minimizers = 6084903
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 3031997) ... (268, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minimizers occurring >= 48 times during lookup.
[wfmash::map] time spent computing the reference index: 4.88354 sec
[wfmash::skch::Map::mapQuery] WARNING, no .fai index found for ./indata/completes/Ten_global_refs_cleaned.fa, reading file to sum sequence length (slow)
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 5.20e+05 bp/s elapsed: 00:00:00:31 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 9, reads qualified for mapping = 10, total input reads = 10, total input bp = 16386557
[wfmash::map] time spent mapping the query: 3.16e+01 sec
[wfmash::map] mapping results saved in: /crex/proj/uppstore2017270/Emilio/pggb/2021_08_19-late_runs/wfmash-pynVi9
[wfmash::align] Reference = [./indata/completes/Ten_global_refs_cleaned.fa]
[wfmash::align] Query = [./indata/completes/Ten_global_refs_cleaned.fa]
[wfmash::align] Mapping file = /crex/proj/uppstore2017270/Emilio/pggb/2021_08_19-late_runs/wfmash-pynVi9
[wfmash::align] Alignment identity cutoff = 8.50e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent read the reference sequences: 1.68e-07 sec
[wfmash::align::computeAlignments] aligned 100.00% @ 9.22e+05 bp/s elapsed: 00:00:01:13 remain: 00:00:00:00
[wfmash::align::computeAlignments] count of mapped reads = 10, total aligned bp = 67319873
[wfmash::align] time spent computing the alignment: 7.31e+01 sec
[wfmash::align] alignment results saved in: /dev/stdout
wfmash -X -s 500 -l 1500 -p 85 -n 9 -k 16 -t 10 ./indata/completes/Ten_global_refs_cleaned.fa ./indata/completes/Ten_global_refs_cleaned.fa
1006.17s user 28.84s system 935% cpu 110.59s total 856728Kb max memory
[seqwish::seqidx] 0.005 indexing sequences
[seqwish::seqidx] 0.232 index built
[seqwish::alignments] 0.232 processing alignments
[seqwish::alignments] 0.788 indexing
[seqwish::alignments] 2.203 index built
[seqwish::transclosure] 2.212 computing transitive closures
[seqwish::transclosure] 2.246 0.00% 0-10000000 overlap_collect
[seqwish::transclosure] 3.451 0.00% 0-10000000 rank_build
[seqwish::transclosure] 3.778 0.00% 0-10000000 parallel_union_find
[seqwish::transclosure] 4.082 0.00% 0-10000000 dset_write
[seqwish::transclosure] 4.278 0.00% 0-10000000 dset_compression
[seqwish::transclosure] 4.521 0.00% 0-10000000 dset_sort
[seqwish::transclosure] 4.622 0.00% 0-10000000 dset_invert
[seqwish::transclosure] 4.849 0.00% 0-10000000 graph_emission
[seqwish::transclosure] 4.946 92.95% 10000038-16386557 overlap_collect
[seqwish::transclosure] 5.054 92.95% 10000038-16386557 rank_build
[seqwish::transclosure] 5.147 92.95% 10000038-16386557 parallel_union_find
[seqwish::transclosure] 5.178 92.95% 10000038-16386557 dset_write
[seqwish::transclosure] 5.220 92.95% 10000038-16386557 dset_compression
[seqwish::transclosure] 5.236 92.95% 10000038-16386557 dset_sort
[seqwish::transclosure] 5.263 92.95% 10000038-16386557 dset_invert
[seqwish::transclosure] 5.287 92.95% 10000038-16386557 graph_emission
[seqwish::transclosure] 7.583 100.00% building node_iitree and path_iitree indexes
[seqwish::transclosure] 9.327 100.00% done
[seqwish::transclosure] 9.327 done with transitive closures
[seqwish::compact] 9.327 compacting nodes
[seqwish::compact] 9.881 done compacting
[seqwish::compact] 9.909 built node index
[seqwish::links] 9.909 finding graph links
[seqwish::links] 10.597 links derived
[seqwish::gfa] 10.597 writing graph
[seqwish::gfa] 13.044 done
seqwish -t 10 -s ./indata/completes/Ten_global_refs_cleaned.fa -p ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.wfmash.paf -k 19 -g ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.seqwish.gfa -B 10000000 -P
31.89s user 6.21s system 287% cpu 13.27s total 851480Kb max memory
[smoothxg::main] loading graph
[smoothxg::main] prepping graph for smoothing
[odgi::gfa_to_handle] building nodes: 100.00% @ 1.06e+06/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::gfa_to_handle] building edges: 100.00% @ 1.00e+06/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::gfa_to_handle] building paths: 100.00% @ 9.61e+00/s elapsed: 00:00:00:01 remain: 00:00:00:00
[smoothxg::prep] building path index
[smoothxg::prep] sorting graph
[odgi::path_linear_sgd] calculating linear SGD schedule (1.20e-11 1.00e+00 100 0 1.00e-02)
[odgi::path_linear_sgd] calculating zetas for 10028 zipf distributions
[odgi::path_linear_sgd] 1D path-guided SGD: 100.00% @ 2.71e+06/s elapsed: 00:00:01:40 remain: 00:00:00:00
[odgi::groom] grooming: 100.00% @ 2.12e+06/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::groom] organizing handles: 100.00% @ 2.12e+06/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::groom] flipped 297776 handles
[odgi::topological_order] sorting nodes: 100.00% @ 1.01e+05/s elapsed: 00:00:00:05 remain: 00:00:00:00
[smoothxg::prep] chopping graph to 100
[odgi::chop] 1459 node(s) to chop.
[smoothxg::prep] writing graph ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.seqwish.gfa.prep.gfa
[smoothxg::main] building xg index
[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks for 535041 handles: 100.00% @ 4.11e+04/s elapsed: 00:00:00:13 remain: 00:00:00:00
[smoothxg::break_and_split_blocks] cutting blocks that contain sequences longer than max-poa-length (26234) and depth >= 2000
[smoothxg::break_and_split_blocks] splitting 520 blocks at identity 0.000 (WFA-based clustering) and at estimated-identity 0.000 (mash-based clustering)
[smoothxg::break_and_split_blocks] cutting and splitting 520 blocks: 100.00% @ 4.41e+04/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::break_and_split_blocks] cut 0 blocks of which 0 had repeats
[smoothxg::break_and_split_blocks] split 0 blocks
[smoothxg::smooth_and_lace] applying global abPOA to 520 blocks: 100.00% @ 7.47e+00/s elapsed: 00:00:01:09 remain: 00:00:00:00
[smoothxg::smooth_and_lace] flipping 0 block graphs: 100.00% @ 1.34e+05/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] sorting path_mappings
[smoothxg::smooth_and_lace] adding edges from 520 graphs: 100.00% @ 6.88e+02/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] embedding 4691 path fragments: 100.00% @ 9.87e+02/s elapsed: 00:00:00:04 remain: 00:00:00:00
[smoothxg::smooth_and_lace] validating 10 path sequences: 100.00% @ 1.74e+01/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] adding nodes from 520 graphs: 100.00% @ 8.15e+01/s elapsed: 00:00:00:06 remain: 00:00:00:00
[smoothxg::smooth_and_lace] walking edges in 10 paths: 100.00% @ 7.22e+00/s elapsed: 00:00:00:01 remain: 00:00:00:00
[smoothxg::main] unchopping smoothed graph
[odgi::unchop] unchopped 332 nodes into 160 new nodes.
[smoothxg::main] smoothed graph length 2676195bp in 809019 nodes
[smoothxg::main] writing smoothed graph to ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.smooth.1.gfa
smoothxg -t 10 -T 10 -g ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.seqwish.gfa -w 118053 -K -X 100 -d 2000 -I 0 -R 0 -j 100 -e 0 -l 13117 -p 1,9,16,2,41,1 -O 0.001 -V -o ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.smooth.1.gfa
1585.78s user 91.85s system 742% cpu 225.85s total 6343400Kb max memory
[smoothxg::main] loading graph
[smoothxg::main] prepping graph for smoothing
[odgi::gfa_to_handle] building nodes: 100.00% @ 6.46e+05/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::gfa_to_handle] building edges: 100.00% @ 1.36e+06/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::gfa_to_handle] building paths: 100.00% @ 6.67e+00/s elapsed: 00:00:00:01 remain: 00:00:00:00
[smoothxg::prep] building path index
[smoothxg::prep] sorting graph
[odgi::path_linear_sgd] calculating linear SGD schedule (4.53e-12 1.00e+00 100 0 1.00e-02)
[odgi::path_linear_sgd] calculating zetas for 10046 zipf distributions
[odgi::path_linear_sgd] 1D path-guided SGD: 100.00% @ 3.16e+06/s elapsed: 00:00:02:23 remain: 00:00:00:00
[odgi::groom] grooming: 100.00% @ 1.62e+06/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::groom] organizing handles: 100.00% @ 3.23e+06/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::groom] flipped 308875 handles
[odgi::topological_order] sorting nodes: 100.00% @ 9.51e+04/s elapsed: 00:00:00:08 remain: 00:00:00:00
[smoothxg::prep] chopping graph to 100
[odgi::chop] 770 node(s) to chop.
[smoothxg::prep] writing graph ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.smooth.1.gfa.prep.gfa
[smoothxg::main] building xg index
[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks for 813112 handles: 100.00% @ 3.61e+04/s elapsed: 00:00:00:22 remain: 00:00:00:00
[smoothxg::break_and_split_blocks] cutting blocks that contain sequences longer than max-poa-length (26438) and depth >= 2000
[smoothxg::break_and_split_blocks] splitting 485 blocks at identity 0.000 (WFA-based clustering) and at estimated-identity 0.000 (mash-based clustering)
[smoothxg::break_and_split_blocks] cutting and splitting 485 blocks: 100.00% @ 3.95e+04/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::break_and_split_blocks] cut 0 blocks of which 0 had repeats
[smoothxg::break_and_split_blocks] split 0 blocks
[smoothxg::smooth_and_lace] applying global abPOA to 485 blocks: 100.00% @ 7.98e+00/s elapsed: 00:00:01:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] flipping 0 block graphs: 100.00% @ 6.38e+04/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] sorting path_mappings
[smoothxg::smooth_and_lace] adding edges from 485 graphs: 100.00% @ 6.43e+02/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] embedding 3758 path fragments: 100.00% @ 7.51e+02/s elapsed: 00:00:00:05 remain: 00:00:00:00
[smoothxg::smooth_and_lace] validating 10 path sequences: 100.00% @ 1.52e+01/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::smooth_and_lace] sorting consensus
[smoothxg::smooth_and_lace] embedding consensus: creating path handles
[smoothxg::smooth_and_lace] embedding consensus: creating step handles
[smoothxg::smooth_and_lace] adding nodes from 485 graphs: 100.00% @ 7.10e+01/s elapsed: 00:00:00:06 remain: 00:00:00:00
[smoothxg::smooth_and_lace] walking edges in 10 paths: 100.00% @ 8.77e+00/s elapsed: 00:00:00:01 remain: 00:00:00:00
[smoothxg::main] unchopping smoothed graph
[odgi::unchop] unchopped 154 nodes into 73 new nodes.
[smoothxg::main] smoothed graph length 2667410bp in 810120 nodes
[smoothxg::main] writing smoothed graph to ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.smooth.gfa
[smoothxg::main] building xg index from smoothed graph
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
Command terminated by signal 6
smoothxg -t 10 -T 10 -g ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.smooth.1.gfa -w 118971 -K -X 100 -d 2000 -I 0 -R 0 -j 100 -e 0 -l 13219 -p 1,9,16,2,41,1 -O 0.001 -Q Consensus_ -C ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.cons,1000 -o ./outdata/completes/cleaned/Ten_global_refs_cleaned.fa.adfd3d2.34ee7b1.76f53b7.smooth.gfa
1972.45s user 91.14s system 719% cpu 286.91s total 5966132Kb max memory

Thank you in advance!

can't run your example

I followed the instructions to pull your docker file but get the following error when trying to run..

docker run -it -v /home/heumos/git/pggb/data/:/data ghcr.io/pangenome/pggb:latest "pggb -i data/HLA/A-3105.fa.gz -s 3000 -K 11 -p 70 -a 70 -n 10 -t 2 -v -l"

/usr/local/bin/pggb: line 99: /data/HLA/A-3105.fa.gz.pggb-s3000-p70-n10-a70-K11-k8-w10000-j5000-W0-e100.paf: No such file or directory

Any idea what's wrong?

Use pggb to get fasta alignment for 20 1Mb genomes

Hi,

I tried using pggb to align 20 1Mb genomes, and it seems much faster that FSA! My goal is to produce FASTA alignments that are better than I can get with FSA. If there was a way to clean up FSA alignments by running POA on 1kb windows, that would probably work fine.

  1. How can I generate FASTA files from the graphs? It looks like the FASTA file should have the a similar layout to the 1D visualization. The MAF alignments are really the same kind of thing. I tried increasing the block size with -s, but this just seemed to generate weird alignments.

  2. Preliminary inspection of the MAF alignments (using less 😬) suggests that there is some over-alignment, mostly in aligning sequences to themselves (i.e. loops). Is there a recommendation to eliminate or minimize this?

-BenRI

Command terminated by signal 11

Hi All,

I get an error while running pggb on 15 saccharomyces assemblies:

[smoothxg::create_consensus_graph] cleaning up redundant link paths
Command terminated by signal 11
smoothxg -t 132 -g /data/cp-full/multisaccha-prefix.fa.gz.pggb-s50000-p70-n10-a70-K16-k8-w10000-j5000-e5000-I0.seqwish.gfa -w 10000 -I 0 -j 5000 -e 5000 -l 10000 -m /data/cp-full/multisaccha-prefix.fa.gz.pggb-s50000-p70-n10-a70-K16-k8-w10000-j5000-e5000-I0.smooth.maf -s /data/cp-full/multisaccha-prefix.fa.gz.pggb-s50000-p70-n10-a70-K16-k8-w10000-j5000-e5000-I0.consensus -a -C 10,100,1000,10000 -o /data/cp-full/multisaccha-prefix.fa.gz.pggb-s50000-p70-n10-a70-K16-k8-w10000-j5000-e5000-I0.smooth.gfa
32771.07s user 727.37s system 2672% cpu 1253.48s total 20854404Kb max memory

Everything works fine if I reduce the number of assemblies (e.g. to 7).

I'm using a server with ~500 GB RAM, 176 cores, more than 10 TB of free temporary HDD storage.

Any suggestion?

Best,

Lorenzo

Issues in using pggb

Hello,

I want to use pggb to make a graph pan genome by using 15 genomes. I tried to use pggb command line but not sure how to input multiple genomes. For example, in the following command line, how can I input all 15 genomes to create the output file.

pggb -i sample1.fa.gz -N -w 50000 -s 10000 -I 0 -p 70 -a 70 -n 5 -t 16 -v -l -o out

I could not find a way to input multiple files in the document.

Thank you for the help!

smoothxg Flag could not be matched: 'o'

Running from a Docker container built on ghcr.io/pangenome/pggb:latest as of today (22 Nov 2020)

$ smoothxg \
  -t 1 \
  -g DRB1-3123.pggb-s1000-p90-n10-a90-K16-k8-w10000-j5000-W0-e5000.seqwish.gfa \
  -w 10000 \
  -j 5000 \
  -e 5000 \
  -l 10000 \
  -o DRB1-3123.pggb-s1000-p90-n10-a90-K16-k8-w10000-j5000-W0-e5000.smooth.gfa \
  -m DRB1-3123.pggb-s1000-p90-n10-a90-K16-k8-w10000-j5000-W0-e5000.smooth.maf \
  -s DRB1-3123.pggb-s1000-p90-n10-a90-K16-k8-w10000-j5000-W0-e5000.consensus \
  -a \
  -C 5000
...
Flag could not be matched: 'o'
  smoothxg {OPTIONS}

Did this -o option go away recently? If so, I assume the script should be updated to > out.smooth.gfa?

Command terminated by signal 4

Hey to all,
Really nice tool, and I am pretty happy about the conda installation.
Just a notice to everyone: I am trying to run this tool on kinda lot of Metagenome-Associated Genomes (MAGs). In the first second of running, within the wfmash tool, it just throws the "Illegal instruction" text and stops running. After some experimentation, at least in my case, it was related to out-of-memory problem (I was running it on cluster). When I provided more memory, it worked.
Just keep in mind that could be a problem, even though it is not explicitly mentioned.

Best,
Matija

Latest version - Command terminated by signal 11

Hi!

This morning I pulled PGGB as I did in #62 and tried running with the same two samples as mentioned in that issue. Both which are samples that I have previously been bale to run without any issues. While the other sample work, this sample (and other ones) fail with this error:

Starting pggb on Tue Jun 15 08:37:30 CEST 2021

Command: /usr/local/bin/pggb -i ./New_indata/messy_genomes_concatenated.fa -s 500 -w 25000 -p 85 -n 9 -P 1,9,16,2,41,1 -t 10 -v -C 1000 -L -o ./runtest_3

PARAMETERS

general:
  input-fasta:        ./New_indata/messy_genomes_concatenated.fa
  output-dir:         ./runtest_3
  resume:             false
  pigz-compress:      false
  threads:            10
alignment:
  mapping-tool:       wfmash
  no-splits:          false
  segment-length:     500
  block-length:       1500
  no-merge-segments:  false
  map-pct-id:         85
  n-secondary:        9
  mash-kmer:          16
  exclude-delim:      false
seqwish:
  min-match-len:      19
  transclose-batch:   10000000
smoothxg:
  block-weight-max:   25000
  path-jump-max:      100
  edge-jump-max:      0
  poa-length-target:  5000
  poa-params:         1,9,16,2,41,1
  write-maf:          false
  consensus-prefix:   Consensus_
  consensus-spec:     1000
  split-min-depth:    2000
  block-id-min:       0
  block-ratio-min:    0
odgi:
  viz:                true
  layout:             true
  stats:              false
reporting:
  multiqc:            false

Running pggb

[wfmash::map] Reference = [./New_indata/messy_genomes_concatenated.fa]
[wfmash::map] Query = [./New_indata/messy_genomes_concatenated.fa]
[wfmash::map] Kmer size = 16
[wfmash::map] Window size = 1
[wfmash::map] Segment length = 500 (read split allowed)
[wfmash::map] Block length min = 1500
[wfmash::map] Alphabet = DNA
[wfmash::map] Percentage identity threshold = 85%
[wfmash::map] Mapping output file = /crex/proj/uppstore2017270/Emilio/pggb/wfmash-E66Pry
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 10
[wfmash::skch::Sketch::build] minimizers picked from reference = 16424393
[wfmash::skch::Sketch::index] unique minimizers = 5451643
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 2430033) ... (364, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minimizers occurring >= 65 times during lookup.
[wfmash::map] time spent computing the reference index: 4.96914 sec
[wfmash::skch::Map::mapQuery] WARNING, no .fai index found for ./New_indata/messy_genomes_concatenated.fa, reading file to sum sequence length (slow)
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 3.73e+05 bp/s elapsed: 00:00:00:44 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 172, reads qualified for mapping = 174, total input reads = 174, total input bp = 16427933
[wfmash::map] time spent mapping the query: 4.41e+01 sec
[wfmash::map] mapping results saved in: /crex/proj/uppstore2017270/Emilio/pggb/wfmash-E66Pry
[wfmash::align] Reference = [./New_indata/messy_genomes_concatenated.fa]
[wfmash::align] Query = [./New_indata/messy_genomes_concatenated.fa]
[wfmash::align] Mapping file = /crex/proj/uppstore2017270/Emilio/pggb/wfmash-E66Pry
[wfmash::align] Alignment identity cutoff = 8.50e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent read the reference sequences: 5.63e-02 sec
[wfmash::align::computeAlignments] aligned 21.53% @ 6.75e+05 bp/s elapsed: 00:00:00:19 remain: 00:00:01:12Command terminated by signal 11
wfmash -X -s 500 -l 1500 -p 85 -n 9 -k 16 -t 10 ./New_indata/messy_genomes_concatenated.fa ./New_indata/messy_genomes_concatenated.fa
478.35s user 10.79s system 698% cpu 70.05s total 1223272Kb max memory

Running either with, or without, the -Y "_" flag both fails with this error.
Any help is greatly appreciated!

Consensus paths in smoothed graph

Hi

I've noticed in my smoothed graph I have over 1 million consensus paths for each of the chromosomes. For visualisation purposes in sequenceTubeMap this becomes quite a problem, having to search through all the consensus paths to find paths relating to embedded gene paths that I've added from a GFF or particular genomes of interest. I'd like to request a parameter be added to PGGB to avoid including consensus paths in the resulting smoothed graph. The consensus paths are useful, but to visualise through tools such as sequenceTubeMap it becomes a problem. My aim is to deploy sequenceTubeMap for other non-technical people to use, and I think the consensus paths will only confuse them if they are left in the final graph. Thanks.

Does PGGB miss odgi sort -O?

From @brettChapman pangenome/odgi#282

Hi @subwaystation and @ekg
Something I thought worth pointing out, while running odgi I came across errors about my graphs being unsorted, so I've added
odgi sort with the -O flag prior to odgi viz to my script. I notice odgi sort is left out of the PGGB script, which may cause issues with some graphs, like I've found with mine after removal of the consensus paths. It might be worth adding that into the PGGB script to avoid possible issues with graphs.

docker build error with MacBook Pro

Hi! I trying to build an image following the instruction, with command line docker build --target binary -t ${USER}/pggb:latest . in pggb directory, but I get an error as following:

` ...
#8 40.71 [ 85%] Building CXX object CMakeFiles/seqwish.dir/src/main.cpp.o
#8 40.71 [ 85%] Building CXX object CMakeFiles/seqwish.dir/src/utils.cpp.o
#8 40.71 [ 86%] Building CXX object CMakeFiles/seqwish.dir/src/seqindex.cpp.o
#8 40.71 c++: error: unrecognized command-line option '-mcx16'
#8 40.71 c++: error: unrecognized command-line option '-mcx16'
#8 40.71 gmake[2]: *** [CMakeFiles/seqwish.dir/build.make:82: CMakeFiles/seqwish.dir/src/utils.cpp.o] Error 1
#8 40.71 gmake[2]: *** Waiting for unfinished jobs....
#8 40.71 gmake[2]: *** [CMakeFiles/seqwish.dir/build.make:95: CMakeFiles/seqwish.dir/src/main.cpp.o] Error 1
#8 40.71 c++: error: unrecognized command-line option '-mcx16'
#8 40.71 gmake[2]: *** [CMakeFiles/seqwish.dir/build.make:108: CMakeFiles/seqwish.dir/src/seqindex.cpp.o] Error 1
#8 40.71 [ 87%] Building CXX object CMakeFiles/seqwish.dir/src/paf.cpp.o
#8 40.72 c++: error: unrecognized command-line option '-mcx16'
#8 40.72 gmake[2]: *** [CMakeFiles/seqwish.dir/build.make:121: CMakeFiles/seqwish.dir/src/paf.cpp.o] Error 1
#8 40.72 gmake[1]: *** [CMakeFiles/Makefile2:454: CMakeFiles/seqwish.dir/all] Error 2
#8 40.72 gmake: *** [Makefile:149: all] Error 2

executor failed running [/bin/sh -c git clone --recursive https://github.com/ekg/seqwish && cd seqwish && git pull && git checkout 6da2102 && git submodule update --init --recursive && cmake -H. -Bbuild && cmake --build build -- -j $(nproc) && cp bin/seqwish /usr/local/bin/seqwish && cd ../]: exit code: 2
`

Could you please help me with this problem? Thanks a lot in advance!

Suggestion for 1D graph visualisation

Hi

I've been using the 1D graph visualizations to identify regions of high variation across the pangenome graph, and also slicing the pangenome graph by different germ plasm (I'm working with Barley), to identify striking differences between groups of haplotypes (see attached 2 different haplotype groups for examples).

My question is, is there a way to add chromosome coordinates to the visualisation, so that I can then proceed to narrow in on specific regions of interest. I know a common coordinate across the entire pangenome can be based on a consensus graph coordinate, but perhaps there could be node ID on the x axis in major/minor ticks or some estimate of the number of Mbps from the start to end of the chromosome, just to give a ball-park idea of the region to zoom into.

Thanks.

barley_pangenome_graph_7H og rowtype6 viz_inv

barley_pangenome_graph_7H og rowtype2 viz_inv

MAF output

Hi, The block number/accession (ex. "blocks=394-395_399-413") in the MAF output does not seem to coincide with segment/node number (or anything else) in _smooth.gfa file. Is it possible to connect these blocks a value in the gfa file? Thanks very much!

Auto create output directory based on params and time

When working with multiple params and wanting to analyze what happened.
It would be nice if pggb did not create files in the current directory but rather auto-created a directory based on the current params such as prefix_paf, prefix_seqwish, prefix_smoothed etc. and also dumped the log and/or commands that went in into the directory.

For example:
pggb-Z-s100000-l300000-p95-n10-a95-K16-z_gfextend_chain_nogapped.03-27-2021_18:58:25 or pggb-Z-s50000-l150000-p85-n10-a0-K16-z-M.03-10-2021_17:24:15.

Example command issue?

Hey guys,

I'm running into an issue trying to execute the example command:

git clone --recursive https://github.com/pangenome/pggb
cd pggb
./pggb -i data/HLA/DRB1-3123.fa.gz -N -s 5000 -I 0 -p 80 -n 10 -k 8 -t 16 -v -L -o out

which gives the usage info:

Mandatory arguments -i, -s, -n, -p
usage: ./pggb -i <input-fasta> -s <segment-length> -n <n-mappings>
              -p <map-pct-id> [options]
options:
   [wfmash]
    -i, --input-fasta FILE      input FASTA/FASTQ file
    -s, --segment-length N      segment length for mapping
    -l, --block-length N        minimum block length filter for mapping [default: 3*segment-length]
    -N, --no-split              disable splitting of input sequences during mapping [enabled by default]
    -M, --no-merge-segments     do not merge successive mappings
    -p, --map-pct-id PCT        percent identity in the wfmash step
    -n, --n-mappings N          number of mappings to retain for each segment
    -K, --mash-kmer N           kmer size for mashmap [default: 16]
    -Y, --exclude-delim C       skip mappings between sequences with the same name prefix before
                                the given delimiter character [default: all-vs-all and !self]
   [seqwish]
    -k, --min-match-len N       ignore exact matches below this length [default: 19]
    -B, --transclose-batch      number of bp to use for transitive closure batch [default: 10000000]
   [smoothxg]
    -H, --n-haps N              number of haplotypes, if different than that set with -n [default: n-mappings]
    -d, --split-min-depth N     minimum POA block depth to trigger splitting [default: 2000]
    -I, --block-id-min N        split blocks into groups connected by this identity threshold [default: 0.0]
    -R, --block-ratio-min N     minimum small / large length ratio to cluster in a block [default: 0.0]
    -j, --path-jump-max         maximum path jump to include in block [default: 100]
    -e, --edge-jump-max N       maximum edge jump before breaking [default: 0 / off]
    -G, --poa-length-target N,M target sequence length for POA, first pass = N, second pass = M [default: 13117,13219]
    -P, --poa-params PARAMS     score parameters for POA in the form of match,mismatch,gap1,ext1,gap2,ext2
                                [default: 1,19,39,3,81,1]
    -O, --poa-padding N         pad each end of each sequence in POA with N*(longest_poa_seq) bp [default: 0.03]
    -F, --write-maf             write MAF output representing merged POA blocks [default: off]
    -Q, --consensus-prefix P    use this prefix for consensus path names [default: Consensus_]
    -C, --consensus-spec SPEC   consensus graph specification: write consensus graphs to
                                BASENAME.cons_[spec].gfa; where each spec contains at least a min_len parameter
                                (which defines the length of divergences from consensus paths to preserve in the
                                output), optionally a file containing reference paths to preserve in the output,
                                a flag (y/n) indicating whether we should also use the POA consensus paths, a
                                minimum coverage of consensus paths to retain (min_cov), and a maximum allele
                                length (max_len, defaults to 1e6); implies -a; example:
                                cons,100,1000:refs1.txt:n,1000:refs2.txt:y:2.3:1000000,10000
                                [default: off]
   [odgi]
    -v, --viz                   render a visualization of the graph in 1D [default: off]
    -L, --layout                render a 2D layout of the graph [default: off]
    -S, --stats                 generate statistics of the seqwish and smoothxg graph [default: off]
   [gfaffix]
    -U, --normalize             normalize and re-sort the output graph [default: off]
   [vg]
    -V, --vcf-spec SPEC         specify a set of VCFs to produce with SPEC = [REF:SAMPLE_LIST_FILE,]*
                                the paths matching ^REF are used as a reference, while the samples are taken
                                one per line from the given SAMPLE_LIST_FILE (e.g. -V chm13:sample.list,grch38:sample.list)
   [multiqc]
    -m, --multiqc               generate MultiQC report of graphs' statistics and visualizations,
                                automatically runs odgi stats [default: off]
   [general]
    -o, --output-dir PATH       output directory
    -r, --resume PATH           do not overwrite existing output from wfmash, seqwish, smoothxg in given directory
                                [default: start pipeline from scratch in a new directory]
    -t, --threads N             number of compute threads to use in parallel steps
    -T, --poa-threads N         number of compute threads to use during POA (set lower if you OOM during smoothing)
    -Z, --pigz-compress         compress alignment (.paf), graph (.gfa, .og), and MSA (.maf) outputs with pigz
    -h, --help                  this text

Use wfmash, seqwish, smoothxg, and odgi to build and display a pangenome graph.

What might I be doing wrong here? Thanks in advance!

Downstream suggestions for use pggb gfa in the vg pipeline

Hi,

I have run the pggb for each chromosome fasta of pangenome. Does the HPRC graph have any downstream use in vg pipeline?
I have several questions.

  1. How to combine the each chromosome gfa for vg and prepare the index of vg?
vg combine chr*/*.smooth.gfa > test.gfa
vg autoindex -p ref -R XG -w giraffe -g test.gfa -T ./tmp -M 230G -t 64
  1. Compared the VCF from the mapping pipeline and vg snarl

vg genotyping and calling based the snarl result of vg, if I want to compare the VCF with the pggb pangenome graph. How can I use gfa for vg deconstruct? I try the follwing code but failed to produce a vcf.

vg autoindex -p ref -R XG -w giraffe -g chr1.920c6bb.2ff309f.adf7ed8.smooth.gfa  -T ./tmp -M 230G -t 64

#  vcf only have variant without genotype, no sample
vg deconstruct -e -p ref -t 64 chr1.920c6bb.2ff309f.adf7ed8.smooth.gfa > chr1.vcf

# empty vcf with header, but with the exactly n-1 sample I use
vg deconstruct -e -p ref -t 64 chr1.xg > chr1.xg.vcf

But with the cactus vg graph, I can use the deconstruct to produce a vcf with multi-sample genotype.

  1. General pipeline suggestions of pggb gfa
    Does the HPRC graph have any downstream analysis using pggb output? (pggb -> vg index -> vg giraffe -> vg pack -> vg call) Is there any public avaiable pipeline I can follow?

How to evalute the different parameter set for graph building?

Thanks,
Zhigui Bao

Docker image build failed

Hello!

I was just trying to build a docker image from the latest Dockerfile and have got errors in compilation of seqwish and smoothxg. If I comment lines 50 and 59 (lines, which checkout specific commits in seqwish and smoothxg repos respectively) and compile the latest versions, everything compiles normally.

After that pddb seems to work just fine (at least on test example), but do not generate diagnostic images (even though options -v and -l were set).

Is there any particular reason for going to these particular commits or pggb was just tested with these particular versions of the tools?

Thanks.

Best regards,

Vitaly.

smooth maf output file question

Hello,

I am trying to understand MAF format from smooth.maf file. I have 18 genomes and created graphs for each of the chromosomes. Here are some example lines from smooth.maf output file for one of the chromosomes:

a blocks=12-13 loops=false merged=true below_thresh=true
s Consensus_12-13     0 7250 +     7250 CCCTCCTACTCATCGGGGCCTGGCACTTGCCCCGACGGCCGGGTGTAGGTCGCGCGCTTAAGCGCCATCCATTTTCGGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACATTCCTTAGCGGA
s Sample2 0 7250 + 5243748              CCCTCCTACTCATCGGGGCCTGGCACTTGCCCCGACGGCCGGGTGTAGGTCGCGCGCTTAAGCGCCATCCATTTTCGGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACATTCCTTAGCGGA

From the documentation for MAF files, I understand that this paragraph represents a set of multiple alignment. But I could not find a documentation for Consensus. Does it means for block 12_13, only Sample_2 aligns same? Thank you for any help!

Issues with using PGGB output with VG or GraphAligner

I have a set of a few bacterial assemblies of a ~5mbp genome. My pggb run on the genome is quick and the results look good in the included visualizations. It's highly contiguous and the paths are reasonable.

However, I want to align short reads to the pangenome graph and call variants. When I use GraphAligner with the pggb graph and paired illumina short read data, I get reasonable read mapping (about 95% reads mapped), but the GAM output doesn't seem to work with vg call. I get completely blank or all "ref" or "0" calls in the resulting VCF. Is there a formatting issue when converting pggb's smoothxg output into .vg format I should be aware of? Seems like at some point the pggb graph isn't converting to vg correctly.

An example of my piping is below, where $gfa is the smoothxg graph from pggb.

vg view -Fv $gfa > salm.vg
vg index -t 38 -x salm.xg salm.vg

GraphAligner --verbose -x vg -t 38 -g salm.vg -f $fq1 -f $fq2 -a salm.gam

vg pack -t 38 -x salm.xg -g salm.gam -Q 5 -o salm.gam.pack
vg call -d 1 -t 38 -m 1,2 -k salm.gam.pack -s SRR13061905 -a salm.xg > salm.xg.gam.vcf

set up Memory and threads of pggb

Dear Sir/Madam,

Is there any information on how to set up the memory and threads of pggb?
For example, there are seven genomic sequences, each with 100 M bp. For a fast analysis, there is a balance between the memory for each thread and the number of threads. How to balance it?
What is the mimimun memory for each thread, e.g. 4 G? More threads will play more important role than the the memory of each thread in the analysis?
Thanks.

Behaviour of PGGB when resuming

Hi

After PGGB completes, I'm intending to add spliced regions to my seqwish graph using vg rna and then adding it back in to resume from the smoothxg step to normalise the splice regions.

I was wondering, to resume, do I need to specify the same parameters again, or will PGGB read the yaml file to infer which parameters to resume with? So I could basically just run pggb -r -o output_folder. Thanks.

Bubble popping/gfa-format issue

Hi!

After running pggb I would like to check out the variants/bubbles found. I tried using "gfatools bubble" to do this, but there is no output produced when running this on pggb gfa. Is there a known difference in the gfa-format of pggb compared to EG. minigraph causing this, or is there another way to get the variants/bubble-information?

Thanks,
Kristina Stenløkk

error in chromosome alignment

Hi,

I am trying to use genomes to run pggb but I am getting an error for memory. Can you please help with this. Thank you!

seqwish issue in pggb command line

Hello All,

I am aligning multiple genomes to make a graph genome using:

pggb -i in_file.fasta.gz -w 100000 -s 20000 -p 120 -a 70 -n 5 -t 16 -v -l out_file

It runs the alignment step and generates corresponding PAF output file. However, for the seqwish step, it gives this error:

Passed in argument, but no positional arguments were ready to receive it: 8
  seqwish {OPTIONS}

    seqwish: a variation graph inducer

  OPTIONS:

      -h, --help                        display this help menu
      -a[FILE], --sxs-alns=[FILE]       induce the graph from these .sxs
                                        formatted alignments
      -p[FILE], --paf-alns=[FILE]       induce the graph from these PAF
                                        formatted alignments
      -s[FILE], --seqs=[FILE]           the sequences used to generate the
                                        alignments (FASTA, FASTQ, .seq)
      -b[BASE], --base=[BASE]           build graph using this basename
      -g[FILE], --gfa=[FILE]            write the graph in GFA to FILE
      -m[FILE], --match-list=[FILE]     use the sequence match list in FILE to
                                        subset the input alignments
      -o[BASE], --vgp-out=[BASE]        write the graph in VGP format with
                                        basename FILE
      -t[N], --threads=[N]              use this many threads during parallel
                                        steps
      -r[N], --repeat-max=[N]           limit transitive closure to include no
                                        more than N copies of a given input base
      -k, --keep-temp                   keep intermediate files generated during
                                        graph induction
      -d, --debug                       enable debugging

seqwish -t 16 -s in_file.fasta.gz -p in_file.paf -k 8 -g in_file.gfa -B 1000000 -P

From the above message, I understand, It is not recognizing options -k -B -P and I did not give these options in the pggb command line. I think it is automatically coming with some other parameter. I am not sure how to avoid these parameters to make it run. Thank you for any help!

Segmentation fault (core dumped) when smoothing

Hi, I ran pggb on small dataset (say 200 or 2000 virus genomes) successfully. However, when I ran it on 20000 virus genomes, it terminated at the smoothing stage. here is the information:

[smoothxg::main] loading graph [smoothxg::main] prepping graph for smoothing Command terminated by signal 11 smoothxg -t 40 -g input.fasta.pggb-s3000-p70-n10-a70-K16-k8-w10000-j5000-e5000.seqwish.gfa -w 10000 -j 5000 -e 5000 -l 10000 -m input.fasta.pggb-s3000-p70-n10-a70-K16-k8-w10000-j5000-e5000.smooth.maf -s input.fasta.pggb-s3000-p70-n10-a70-K16-k8-w10000-j5000-e5000.consensus -a -C 10,100,1000,10000 0.02s user 0.00s system 5% cpu 0.46s total 6000Kb max memory

Here is the command I used pggb -i input.fasta -s 3000 -p 70 -a 70 -n 10 -t 40 -v -l

What could be the reason for this termination? Many thanks in advance.

Latest version - Command terminated by signal 4

Hello,
I have been using pggb for a while now and over a couple of different builds. I have been able to run it previously and can still run a previous version (pulled February 8) without seeing this error. Yesterday I used Singularity and pulled the newest version to try the --poa-params variable but have so far been unable to run this version. I have tried running this version both using wfmash and edyeet, but both terminate at the same place. I do seem to produce alignments as I can see a readable “wfmash-us6ZOJ”-file in the directory where I ran pggb. Any help that allows me to continue is greatly appreciated!

This is the log with the error I am seeing:

Starting pggb on Mon Feb 22 17:15:29 CET 2021

Command: /usr/local/bin/pggb -i /proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa -s 500 -w 25000 -p 85 -a 85 -n 14 -W -t 10 --poa-params 1,4,6,2,26,1 -v -o /proj/uppstore2017270/Emilio/pggb/new_version_out/Erik_suggested/All_wfmash

PARAMETERS

general:
input-fasta: /proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa
output-dir: /proj/uppstore2017270/Emilio/pggb/new_version_out/Erik_suggested/All_wfmash
resume: false
threads: 10
alignment:
mapping-tool: wfmash
no-splits: false
segment-length: 500
block-length: 1500
no-merge-segments: false
map-pct-id: 85
align-pct-id: 85
n-secondary: 14
mash-kmer: 16
wfmash: true
exclude-delim: false
seqwish:
min-match-len: 19
transclose-batch: 1000000
smoothxg:
block-weight-max: 25000
path-jump-max: 5000
edge-jump-max: 5000
poa-length-max: 10000
poa-params: 1,4,6,2,26,1
consensus-spec: 10,100,1000,10000
block-id-min: 0
ratio-contain: 0
odgi:
viz: true
layout: false
stats: false
reporting:
multiqc: false

Running pggb

[wfmash::map] Reference = [/proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa]
[wfmash::map] Query = [/proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa]
[wfmash::map] Kmer size = 16
[wfmash::map] Window size = 20
[wfmash::map] Segment length = 500 (read split allowed)
[wfmash::map] Block length min = 1500
[wfmash::map] Alphabet = DNA
[wfmash::map] Percentage identity threshold = 0.85%
[wfmash::map] Mapping output file = /crex/proj/uppstore2017270/Emilio/pggb/wfmash-us6ZOJ
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads = 10
[wfmash::skch::Sketch::build] minimizers picked from reference = 2330611
[wfmash::skch::Sketch::index] unique minimizers = 761163
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 371539) ... (282, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minimizers occurring >= 67 times during lookup.
[wfmash::map] time spent computing the reference index: 1.84586 sec
[wfmash::skch::Map::mapQuery] WARNING, no .fai index found for /proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa, reading file to sum sequence length (slow)
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 3.15e+06 bp/s elapsed: 00:00:00:07 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 14, reads qualified for mapping = 15, total input reads = 15, total input bp = 24465645
[wfmash::map] time spent mapping the query: 7.81e+00 sec
[wfmash::map] mapping results saved in: /crex/proj/uppstore2017270/Emilio/pggb/wfmash-us6ZOJ
[wfmash::align] Reference = [/proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa]
[wfmash::align] Query = [/proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa]
[wfmash::align] Mapping file = /crex/proj/uppstore2017270/Emilio/pggb/wfmash-us6ZOJ
[wfmash::align] Alignment identity cutoff = 8.50e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent read the reference sequences: 4.60e-02 sec
Command terminated by signal 4
wfmash -X -s 500 -l 1500 -p 85 -n 14 -k 16 -t 10 /proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa /proj/uppstore2017270/Emilio/pggb/indata/Erik_test_data/complete_reference_genomes.fa
56.16s user 1.49s system 557% cpu 10.33s total 254096Kb max memory

how to extract sequences from graph genome

Hello,

I want to extract a sequence segment that only showed up in one of the samples used to make graph genome and as a gap in the rest of the samples. Can you please direct me if I should use multiple alignment file for this. Thank you!

Edyeet -N no split and resolved memory issue with abPOA

Hi Erik

I see you've added new options to edyeet and smoothxg.

Would using -N for no split with Edyeet be the same as setting the segment length to the length of the chromosome/genome? Would using -N reduce memory requirements and would there be an advantage to using -N over the default splitting the sequences by segment lengths? If so, why not set -N by default? I also see you've added an update to using abPOA. Has the memory issue with using abPOA been resolved now? Being able to use abPOA again would greatly reduce the memory overhead.

I have also just gained access to compute nodes in my cluster with 128GB RAM. Smoothxg jobs which failed before are now completing. I'm now testing with different Edyeet segment lengths and -a/-p parameters to see where the new memory limits are.

gfaffix not found with latest PGGB build

Hi

I've tried running PGGB with the latest docker build (pulled 3rd August). I get the following error:

[odgi::unchop] unchopped 827209 nodes into 20 new nodes.
[smoothxg::main] smoothed graph length 12656130003bp in 20 nodes
[smoothxg::main] writing smoothed graph to barley_pangenome_7H_s1000000_l3000000_p95_k79_B10000000_w10000000_I0_R0_j100_e0_P1-4-6-2-26-1/barley_pangenome_7H.fasta.4c2b892.7748b33.37f4624.smooth.gfa
smoothxg -t 32 -T 32 -g barley_pangenome_7H_s1000000_l3000000_p95_k79_B10000000_w10000000_I0_R0_j100_e0_P1-4-6-2-26-1/barley_pangenome_7H.fasta.4c2b892.7748b33.37f4624.smooth.1.gfa -w 10000000 -K -X 100 -d 2000 -I 0 -R 0 -j 100 -e 0 -l 15331 -p 1,4,6,2,26,1 -O 0.01 -Q Consensus_ -V -o barley_pangenome_7H_s1000000_l3000000_p95_k79_B10000000_w10000000_I0_R0_j100_e0_P1-4-6-2-26-1/barley_pangenome_7H.fasta.4c2b892.7748b33.37f4624.smooth.gfa
61667.98s user 5170.68s system 1310% cpu 5100.68s total 81869412Kb max memory
/usr/local/bin/pggb: line 387: gfaffix: command not found

Thanks.

/usr/bin/time: cannot run edyeet: No such file or directory

Hello,

I'm trying to use pggb with this command line:
/SD5/people/s1060627/pggb/pggb -i /SD5/people/u464891/Data/Projects/mutate_fasta_genome/genome_comparison/datasets/Oryza_sativa__japonica_v3.fas -s 5000 -n 10 -p 90
but I've got the following error message:

/usr/bin/time: cannot run edyeet: No such file or directory
edyeet -X -s 5000 -l 15000 -p 90 -n 10 -a 0 -k 16 -t 1 /SD5/people/u464891/Data/Projects/mutate_fasta_genome/genome_comparison/datasets/Oryza_sativa__japonica_v3.fas /SD5/people/u464891/Data/Projects/mutate_fasta_genome/genome_comparison/datasets/Oryza_sativa__japonica_v3.fas
0.00s user 0.00s system 0% cpu 0.00s total 352Kb max memory

Does it exist a way to specify the path of edyeet so pggb can find it, cause it seems to be the problem for what I understood from the error message?

Thanks in advance for your help
Have a great day
Best,
C.

Segfault at start of smoothxg step (on broadwell machine)

Hi,
I've been encountering a segfault when getting to the smoothxg step of the pipeline. It happens very early on, so doesn't appear to be that similar to #47. I've checked the segfault happens on several different node, all of which are broadwell with avx2 flags, so shouldn't be the issue. In addition, the version is only several days old (pangenome/smoothxg@3481f90).

vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz

flags		:  ... avx2 ...

The error looks like it occurs during graph prepping.

[smoothxg::main] loading graph
[smoothxg::main] prepping graph for smoothing
Command terminated by signal 11
smoothxg -t 12 -g /output/input.fa.gz.pggb-E-s100000-l300000-p90-n10-a0-K16-k19.seqwish.gfa -w 10000 -M -J 0.7 -K -I 0 -R 0 -j 5000 -e 5000 -l 10000 -m /output/input.fa.gz.pggb-E-s100000-l300000-p90-n10-a0-K16-k19.seqwish-w10000-j5000-e5000-I0.smooth.maf -C /output/input.fa.gz.pggb-E-s100000-l300000-p90-n10-a0-K16-k19.seqwish-w10000-j5000-e5000-I0.smooth.consensus,10,100,1000,10000 -o /output/input.fa.gz.pggb-E-s100000-l300000-p90-n10-a0-K16-k19.seqwish-w10000-j5000-e5000-I0.smooth.gfa
37.02s user 10.61s system 153% cpu 31.01s total 14194488Kb max memory

I tried running it directly in the container, additionally with the debug and spoa flags (-d -S), but same issue with no additional information

[smoothxg::main] loading graph
[smoothxg::main] prepping graph for smoothing
Segmentation fault

I reached the same point (error at start of smoothxg) with a previous version of pggb (sometime mid Jan), so at a bit of a loss of what to try next.
Thanks,
Alex

Issue with the vcf file created using pggb graphs

Hello,

I have created pggb graph genome for a chromosome
pggb
I made a dot plot of two samples (one with the extra piece and other not) using mashmap with the same percent identity and segment length as pggb
dot plot
Then I made a vcf file using vcf deconstruct to locate the insertion but I am not able to find the insertion in the vcf file based on pggb graph. I am not able to attach the vcf file here. I can send the vcf file separately, if needed.

I used BLAST and it looks like there are repetitive sequences in this region (5000000-10000000). I will appreciate any help in figuring out the insertion in vcf file.

Thank you!

wfmash error with latest pggb build v0.1

Hi

I've just updated to the latest build and I'm now getting errors at the wfmash step aligning only small gene regions of a few thousand bp. I pulled the build from docker and set many of the parameters to the now recommended defaults (e.g. block_id_min=0). This has happened consistently with several different gene regions.

+ block_length=0
+ min_match_len=29
+ transclose_batch=10000000
+ block_weight_max=10000000
+ block_id_min=0
+ ratio_contain=0
+ path_jump_max=100
+ edge_jump_max=0
+ poa_params=1,4,6,2,26,1
+ segment_length=2000
+ srun -n 1 singularity exec --bind /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp:/data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp /data/pggb_builds/pggb.sif pggb -i /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna -N -s 2000 -l 0 -p 85 -n 21 -k 29 -B 10000000 -w 10000000 -I 0 -R 0 -j 100 -e 0 -P 1,4,6,2,26,1 -t 16 -m -v -L -S -o pggb_output
[wfmash::map] Reference = [/data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna]
[wfmash::map] Query = [/data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna]
[wfmash::map] Kmer size = 16
[wfmash::map] Window size = 5
[wfmash::map] Segment length = 2000 (read split disabled)
[wfmash::map] Block length min = 0
[wfmash::map] Alphabet = DNA
[wfmash::map] Percentage identity threshold = 0.85%
[wfmash::map] Mapping output file = /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/wfmash-H8TzOD
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 16
[wfmash::skch::Sketch::build] minimizers picked from reference = 20497
[wfmash::skch::Sketch::index] unique minimizers = 1415
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 261) ... (27, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, consider all minimizers during lookup.
[wfmash::map] time spent computing the reference index: 0.00455175 sec
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.44e+05 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 21, reads qualified for mapping = 21, total input reads = 21, total input bp = 60984
[wfmash::map] time spent mapping the query: 2.63e-01 sec
[wfmash::map] mapping results saved in: /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/wfmash-H8TzOD
[wfmash::align] Reference = [/data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna]
[wfmash::align] Query = [/data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna]
[wfmash::align] Mapping file = /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/wfmash-H8TzOD
[wfmash::align] Alignment identity cutoff = 8.50e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent read the reference sequences: 2.53e-04 sec
Command terminated by signal 4
wfmash -X -s 2000 -l 0 -N -p 85 -n 21 -k 16 -t 16 /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna /data/pangene_graph_workflow/pangene-HORVU.MOREX.r2.3HG0222170_mmseq2pctid-0.9_mmseq2mincov-0.8_pggb-mapping-pctid-85_geneBuffer-1000bp/HORVU.MOREX.r2.3HG0222170_pan_genes_buffer-1000bp.fna
0.34s user 0.05s system 18% cpu 2.16s total 21224Kb max memory
srun: error: node-4: task 0: Exited with exit code 132

how to construct pangenome graph from two assembly genome of two species

Hello,Professor
I have two assembly genome of two species. I want to construct a graph from the two genome using PGGB. I firstly combine the two genome into a input.fa ,then index the input.fa , finally run the PGGB. I refer to the human example in readme command is :
cat A.fasta B.fasta > input.fasta
samtools faidx input.fasta
pggb -i input.fata -s 100000 -p 70 -t 40 -v -L -U -o out -T 20 -n 2 -H 2 -G 20000
I have used "mash dist" to calculate divergence between A.fasta and B.fasta. The result is:
A.fasta B.fasta 0.0181313 0 519/1000
I do not know how to convert the result to the approximate percent identity and then provide it as -p and how to adapt to these parameters "-k -s -G". I set the n to 2 according to the number of my assembly genomes. And I can not run by chromosome because I can not find the related chromosome from the two genome.
Also, can you show me the use of memory and CPU. The size of My two genome is about 3G.
I would appreciate it if you could give me any suggestions.
Looking forward with your reply.

Pggb on bacteria.

Very cool tool. I'd like to apply it on a set of bacterial genomes.

Do I understand this message correctly that there was a command that took too long? Shouldn't then the exit status be 1?
In any case, the smooth.gfa is empty. Probably I should change one of the parameters for the smoothing, could you give me some advice?

[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks 100.00%
Command terminated by signal 4plying spoa to block 3/37675 0.008%
        Command being timed: "smoothxg -t 8 -g /data/Akkermansia_ref.fasta.gz.pggb-s1000-p90-n10-a90-K11-k8-w10000-j5000-W0-e100.seqwish.gfa -w 10000 -j 5000 -k 0 -e 100"
        User time (seconds): 1952.10
        System time (seconds): 23.83
        Percent of CPU this job got: 318%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 10:21.26
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4197032
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 2
        Minor (reclaiming a frame) page faults: 2382976
        Voluntary context switches: 760552
        Involuntary context switches: 9034
        Swaps: 0
        File system inputs: 264
        File system outputs: 7309432
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

filtering false positive graph link/paths

I found that the pggb output graph has too many false positive links to the other places in the genomes.
Is there any better way to filter out the long link and keep appropriate links close to the gaps? Detailed instructions of how to tune the parameters should be very useful.
My thing is to keep some links among very close blocks and filter out those across many blocks. The genomes are collinearity so the links among very close blocks should be the primary ones and should be kept. My genome may have many repeat sequences. Some links (edge) are obviously wrong which are several mega bp long and across many blocks.

used parameters:
-s25000-p70-n5-a70-K16-k5-w10000-j5000-e5000
graph image: https://ibb.co/m9TmGGK

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.