Giter Club home page Giter Club logo

rhea's Introduction

rhea: reference-free heterogeneity and evolution in assembly graphs

Description

Rhea is a software used to detect structural variants (SVs) between steps in long-read metagenomic series data. Current SVs detected include: insertions, deletions, complex indel (insertion and deletion in the same location), and tandem duplications. Below is a graphic represtation of the rhea pipeline.

pipeline_image

(a) To utilize rhea, first, microbiome series data must be collected and long whole genome sequencing reads generated. Then, within rhea, a coassembly graph of all reads in the series is created with metaFlye. Reads from each sample are then separately aligned to the coassembly graph with minigraph. Rhea evaluates log fold change in coverage between series steps for SV-specific patterns in the assembly graph to detect structural variants between steps. (b) Assembly graph patterns detected in rhea, which indicate potential insertions, deletions, complex indels, and tandem duplicates. Insertions and deletions are detected by observing a triangle where one node has a significantly higher (insertion) or lower (deletion) log fold change. Complex indels are noted by a square with one or two outliers; in the case of two outliers, the two outliers must be of opposing sides of the median and not have an edge between them. Tandem duplicates are detected by a log fold change of a self-loop edge coverage greater than 1.

Demo

Here is a toy example to detect SV in two different variants E. coli in a metagenome. First, download t0.fasta and t1.fasta from OSF under the example directory. Then run rhea on the samples.

python ./rhea.py example/t0.fasta example/t1.fasta

Expected output: All output files are expected to be in directory rhea_results. Output files include: Bandage_metadata.csv, bp_counts.tsv, edge_coverage-t{0,1}, metaflye directory, node_coverage_norm.csv, node_coverage.csv, rhea.log, structual_variants-c0, t{0,1}.gaf.

Expected run time: ~5 minutes for flye

Installation & Dependencies

Dependencies for rhea include: python v3.8+, numpy, pandas, networkx v3.2+, seaborn, Flye, and minigraph. Environment.yaml can be used build a conda environment with all necessary dependencies; rhea environment will need to be activated for each rhea run.

https://github.com/treangenlab/rhea.git
cd rhea
conda env create -f environment.yml
conda activate rhea

We highly recommend using metaFlye v2.9.3+ due to improvements in strain variations maintained.

Output files

  • structual_variants-c{1-N}.tsv: detected structual variants for each pair of subsequent samples. Denoted in the file name.
  • node_coverage.csv: Spreadsheet where each row is a unique node. Contains sequence length for each node, detected coverage for each sample, linear change in coverage between subsequent pairs, and log fold change in coverage. Color hex values are also provided for each type of coverage/coverage difference for graph heatmap visuals.
  • node_coverage_norm.csv: Same as node_coverage.csv, except coverage values are normalized across all samples in the set based on total number of base pairs per sample. Use of normalized or raw values for SV detection is determined by the input parameter --raw-diff (normalized is default).
  • edge_coverage-c{}.tsv: 2D matrix of edge coverage for each provided sample.
  • Bandage_metadata.csv: consolidated data for regarding node length, coverage, coverage change, and detected SVs; designed to be used with Bandage to create visuals. See below.
  • bp_counts.tsv: total number of bps in each sample; used for normalization.
  • rhea.log: Computational log and established min/max values for heatmap graph visuals.

Graph visuals

Rhea output can be used in conjunction with Bandage to see patterns SV and evolutionary activity in the metagenome.

  1. Install and open Bandage.
  2. Load assembly graph: ./rhea_results/metaflye/assembly_graph.gfa.
  3. Select column for node colors from Bandage_metadata.csv by changing any of the column name ending in _colour to colour.
  4. Import CSV for updated CSV into Bandage.
  5. Draw graph single with custom colours.
  6. Min/max color scale values can be found in rhea.log. Colored bar can be found in docs/visual_scales.png.

Parameters

Command Default Description
input (required) paths to input metagenome sequences or alignments
--output-dir, -o rhea_results path to output directory
--type nano-raw type of reads for MetaFlye graph construction ['pacbio-raw', 'pacbio-corr', 'pacbio-hifi', 'nano-raw', 'nano-corr', 'nano-hq']
--input-graph (generated) path to graph if alignments are provided (req. with alignments)
--bp-table (generated) path to bp counts per sample if alignments are provided (req. with alignments)
--node-std 1 number of standard deviations away from median to call indels
--edge-lfc-thresh 1 number of lfc increase to call duplications
--raw-diff FALSE set to true if no normalization for bp count between samples is desired
--collapse FALSE activate to collapse metaFlye bubble (i.e. not use --keep-haplotypes)
--flye-exec flye path to flye executable
--minigraph-exec minigraph path to minigraph executable
--threads, -t 3 number of threads utilized by flye & minigraph

Citations

If you cite rhea, be sure to also cite metaFlye and mingraph. If you use rhea visuals, be sure to cite Bandage.

Rhea preprint: Curry, K. D., Yu, F. B., Vance, S. E., Segarra, S., Bhaya, D., Chikhi, R., Rocha, E. P. C., & Treangen, T. J. (2024). Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs (p. 2024.01.25.577285). bioRxiv. https://doi.org/10.1101/2024.01.25.577285

Validation tests & scripts: https://osf.io/fvhw8/

rhea's People

Contributors

kdc10 avatar

Stargazers

Jie Zhu avatar  avatar  avatar  avatar Rayan Chikhi avatar Bryce Kille avatar Yilei Fu avatar Austin Marshall avatar

Watchers

Todd J Treangen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.