Giter Club home page Giter Club logo

gfaffix's Introduction

Rust Build Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

GFAffix

GFAffix collapses walk-preserving shared affixes

GFAffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure.

Dependencies

GFAffix is written in RUST and requires a working RUST build system for installation. See https://www.rust-lang.org/tools/install for more details.

It makes use of the following crates:

  • clap
  • env_logger
  • gfa
  • handlegraph
  • log
  • quick-csv
  • regex
  • rustc-hash

Installation

From bioconda channel

Make sure you have conda installed!

conda install -c bioconda gfaffix

From binary release

Linux x86_64


wget --no-check-certificate -c https://github.com/marschall-lab/GFAffix/releases/download/0.1.5b/GFAffix-0.1.5b_linux_x86_64.tar.gz 
tar -xzvf GFAffix-0.1.5b_linux_x86_64.tar.gz 

# you are ready to go! 
./GFAffix-0.1.5b_linux_x86_64/gfaffix


MacOS X arm64


wget --no-check-certificate -c https://github.com/marschall-lab/GFAffix/releases/download/0.1.5b/GFAffix-0.1.5b_macos_x_arm64.tar.gz 
tar -xzvf GFAffix-0.1.5b_macos_x_arm64.tar.gz 

# you are ready to go! 
./GFAffix-0.1.5b_macos_x_arm64/gfaffix


From repository

# install GFAffix
git clone https://github.com/marschall-lab/GFAffix.git
# build program
cargo build --manifest-path GFAffix/Cargo.toml --release

Command Line Interface

$ gfaffix --help
gfaffix 0.1.5b
Daniel Doerr <[email protected]>
Discover walk-preserving shared prefixes in multifurcations of a given graph.

    - Do you want log output? Call program with 'RUST_LOG=info gfaffix ...'
    - Log output not informative enough? Try 'RUST_LOG=debug gfaffix ...'

USAGE:
    gfaffix [OPTIONS] <GRAPH>

ARGS:
    <GRAPH>    graph in GFA1 format

OPTIONS:
    -c, --check_transformation
            Verifies that the transformed parts of the graphs spell out the identical sequence as in
            the original graph. Only for debugging purposes

    -h, --help
            Print help information

    -o, --output_refined <REFINED_GRAPH_OUT>
            Write refined graph in GFA1 format to supplied file [default: " "]

    -t, --output_transformation <TRANSFORMATION_OUT>
            Report original nodes and their corresponding walks in refined graph to supplied file
            [default: " "]

    -V, --version
            Print version information

    -x, --dont_collapse <NO_COLLAPSE_PATH>
            Do not collapse nodes on a given paths ("P" lines) that match given regular expression
            [default: " "]

Execution

RUST_LOG=info gfaffix examples/example1.gfa -o example1.gfa -t example1.trans > example1.shared_affixes

gfaffix's People

Contributors

danydoerr avatar natir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gfaffix's Issues

speeding up deep graphs

On deep graphs (2k-fold) I'm seeing gfaffix taking quite a bit of time. It's essentially single threaded, right? Is there a possible way to adapt it to operate in parallel?

memory allocation error in 0.1.5b

I've been trying to figure out why a cactus run has been crashing, and have found it happens during a gfaffix step. It turns out that this step fails when using gfaffix v0.1.5b, but not v0.1.5. A reproducible example is below, though I apologize that the input file is quite large (5.5Gb uncompressed). I don't know enough about this program to make a smaller example. However, it only takes a few minutes to crash. Although the message says "memory allocation error", there is plenty of memory remaining, and I get the same behavior on both a 512Gb and 1Tb RAM server. The actual RAM usage seems to stay well under 100Gb.

wget ftp://cbsuftp.biohpc.cornell.edu/melissa/test.vg.gfa.gz
gunzip test.vg.gfa.gz
gfaffix test.vg.gfa --output_refined out.gfa

with gfaffix v0.1.5b, this outputs:

memory allocation of 560 bytes failed
Aborted (core dumped)

But with v0.1.5, it appears to run smoothly.

Gfaffix removes one edge needed by paths, leaving graph invalid

This comes by way of ComparativeGenomicsToolkit/cactus#971

To reproduce

wget -q http://public.gi.ucsc.edu/~hickey/debug/NC_006096.5.vg.gfa.gz
gzip -d NC_006096.5.vg.gfa.gz
gfaffix ./NC_006096.5.vg.gfa --output_refined fixed.gfa --check_transformation --dont_collapse "REF*" 2> gfaffix.stderr > gfaffix.stdout
vg validate fixed.gfa

Which gives the output below. This is all complaining about single missing edge: (1322919:1) -> (1322920:1)

graph invalid: missing edge between 435768th step (1322919:1) and 435769th step (1322920:1) of path Thailand#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 435769th step (1322920:0) and 435768th step (1322919:1) of path Thailand#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 437552th step (1322919:1) and 437553th step (1322920:1) of path Cornish#0#Chr9#0
graph invalid: missing edge between 437553th step (1322920:0) and 437552th step (1322919:1) of path Cornish#0#Chr9#0
graph invalid: missing edge between 436158th step (1322919:1) and 436159th step (1322920:1) of path Silkies#0#Chr9#0
graph invalid: missing edge between 436159th step (1322920:0) and 436158th step (1322919:1) of path Silkies#0#Chr9#0
graph invalid: missing edge between 436542th step (1322919:1) and 436543th step (1322920:1) of path Asil#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 436543th step (1322920:0) and 436542th step (1322919:1) of path Asil#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 434343th step (1322919:1) and 434344th step (1322920:1) of path Tibetan#0#Chr9#0
graph invalid: missing edge between 434344th step (1322920:0) and 434343th step (1322919:1) of path Tibetan#0#Chr9#0
graph invalid: missing edge between 436112th step (1322919:1) and 436113th step (1322920:1) of path BLH#0#Chr9#0
graph invalid: missing edge between 436113th step (1322920:0) and 436112th step (1322919:1) of path BLH#0#Chr9#0
graph invalid: missing edge between 930th step (1322919:1) and 931th step (1322920:1) of path _MINIGRAPH_#s57999
graph invalid: missing edge between 931th step (1322920:0) and 930th step (1322919:1) of path _MINIGRAPH_#s57999
graph invalid: missing edge between 434428th step (1322919:1) and 434429th step (1322920:1) of path Houdan#0#Chr9#0
graph invalid: missing edge between 434429th step (1322920:0) and 434428th step (1322919:1) of path Houdan#0#Chr9#0
graph invalid: missing edge between 446879th step (1322919:1) and 446880th step (1322920:1) of path Naked_neck#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 446880th step (1322920:0) and 446879th step (1322919:1) of path Naked_neck#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 442289th step (1322919:1) and 442290th step (1322920:1) of path REF#NC_006096.5
graph invalid: missing edge between 442290th step (1322920:0) and 442289th step (1322919:1) of path REF#NC_006096.5
graph invalid: missing edge between 478515th step (1322919:1) and 478516th step (1322920:1) of path Fayoumi#0#NC_006096.5_RagTag#0
graph invalid: missing edge between 478516th step (1322920:0) and 478515th step (1322919:1) of path Fayoumi#0#NC_006096.5_RagTag#0
graph: invalid

GFAffix breaks path

I've run into a rare problem with gfaffix where it seems to leave an invalid path on this graph. Here is how to reproduce:

wget http://public.gi.ucsc.edu/~hickey/debug/chr2.vg.gfa.gz
gzip -d chr2.vg.gfa.gz
gfaffix chr2.vg.gfa --output_refined chr2.gfaffixed.gfa --dont_collapse "GRCh38*" 2> gfaffix.stderr > gfaffix.stdout
vg validate chr2.gfaffixed.gfa
graph invalid: missing edge between 67753th step (7866944:0) and 67754th step (7861328:0) of path HG03579.1.JAGYVU010000149.1
graph invalid: missing edge between 67754th step (7861328:1) and 67753th step (7866944:0) of path HG03579.1.JAGYVU010000149.1
graph: invalid

GFAffix strips header information

In vg we're using these "RS" GFA header tags to distinguish between reference and haplotype paths in W-lines. But this information is getting lost in GFAffix which apparently just writes H VG:Z:1.0 in the output no matter what. This is not an urgent issue as it's so easy to work around, but it would be nice if GFAffix would be updated to preserve the input H line as is.

--dont_collapse braids a deep nested snarl

This is messing with @xchang1's distance index, because it makes a deep nested snarl structure.

To reproduce, gfaffix this graph with and without collapsing

wget -q http://public.gi.ucsc.edu/~hickey/debug/gfaffix-snarl69/chunk_133493101_133529958_raw.gfa
gfaffix chunk_133493101_133529958_raw.gfa -o chunk_133493101_133529958_fix.gfa --dont_collapse 'CHM13*' > /dev/null
gfaffix chunk_133493101_133529958_raw.gfa -o chunk_133493101_133529958_fixc.gfa  > /dev/null

This is what the original graph looks like
chunk_133493101_133529958_raw
After gfaffix it zips up part of the bubble:
chunk_133493101_133529958_fix
But if I zoom into the zipped part, it's really weird "braid" structure
chunk_133493101_133529958_fix_zoom
That same part looks like this in _fixc.gfa where --dont_collapse wasn't used (but the ref path loops back through the zipped part)
chunk_133493101_133529958_fix_collapse_zoom

The net result is a really nested distance index, which can be checked as follows.

vg stats -b chunk_133493101_133529958_raw.dist | sort -rnk 4 | head -1
vg stats -b chunk_133493101_133529958_fix.dist | sort -rnk 4 | head -1
vg stats -b chunk_133493101_133529958_fixc.dist | sort -rnk 4 | head -1

Which show snarl depth 3 for the raw graph, 4 for the collapsed graph and 69 for the --dont_collapse graph.

Do you think there is a way of preventing this type of motif? It's true that the fixed graph has 75 fewer bases and 64 fewer nodes (one more edge, though), but it is much more difficult for vg to work with.

--dont_collapse doesn't work on W-lines?

when I run gfaffix on a GFA with W-lines, I get downstream sanity check fails about cycles on reference paths (that I'd input with --dont_collapse). When I run the same file with P-lines this error doesn't happen. This could be somethign I've screwed up, but it really seems like --dont_collapse may not be working on W-line paths?

gfaffix invalidates graph by removing self-looping edge

This issue comes courtesy of ComparativeGenomicsToolkit/cactus#1123

Note that to reproduce, you need a vg version < 1.50.0 due to this issue!

gfaffix --version
gfaffix 0.1.4b

vg version
vg version v1.49.0 "Peschici"
wget -q http://public.gi.ucsc.edu/~hickey/debug/gfaffix-crash-aug4-2023/NW_017567117.1.vg.gfa.gz
gzip -d NW_017567117.1.vg.gfa.gz

vg validate NW_017567117.1.vg.gfa
graph: valid

gfaffix NW_017567117.1.vg.gfa --output_refined NW_017567117.1.vg.gfaffixed.gfa --check_transformation --dont_collapse GCF_0016612551* > /dev/null

vg validate NW_017567117.1.vg.gfaffixed.gfa
graph invalid: missing edge between 691th step (86210:1) and 692th step (86210:0) of path GCA_0193217651#0#JAHTLY010002216.1#0
graph invalid: missing edge between 692th step (86210:1) and 691th step (86210:1) of path GCA_0193217651#0#JAHTLY010002216.1#0
graph: invalid

--dont_collapse regex behaves weirdly

Something I noticed when debugging another issue (which I'll post right after this). When I use CHM* or CHM13* for --dont_collapse I get different results:

wget http://public.gi.ucsc.edu/~hickey/debug/gfaffix-snarl69/chunk_133493101_133529958_raw.gfa

grep ^P chunk_133493101_133529958_raw.gfa | grep CHM | awk '{print $2}'
CHM13#chrX[133493094]
gfaffix chunk_133493101_133529958_raw.gfa -o chunk_133493101_133529958_fix.gfa --dont_collapse 'CHM13*' > /dev/null
gfaffix chunk_133493101_133529958_raw.gfa -o chunk_133493101_133529958_fix1.gfa --dont_collapse 'CHM*' > /dev/null
ls -l *.gfa
-rw-rw-r-- 1 hickey hickey 823926 Nov  9 11:59 chunk_133493101_133529958_fix1.gfa
-rw-rw-r-- 1 hickey hickey 823091 Nov  9 11:59 chunk_133493101_133529958_fix.gfa
-rw-rw-r-- 1 hickey hickey 821759 Nov  9 11:17 chunk_133493101_133529958_raw.gfa

Seems unexpected? Am I missing something obvious?

update Bioconda recipe

the latest release is not reflected in Bioconda, could you please force an update? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.