Giter Club home page Giter Club logo

galah's Introduction

Galah logo

Galah

Anaconda-Server Badge

Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication method. That is, it clusters microbial genomes together based on their average nucleotide identity (ANI), and chooses a single member of each cluster as the representative.

Galah uses a greedy clustering approach to speed up genome dereplication, relative to e.g. dRep, particularly when there are many closely related genomes (i.e. >95% ANI). Generated cluster representatives have 2 properties. If the ANI threshold was set to 99%, then:

  1. Each representative is <99% ANI to each other representative.
  2. All members are >=99% ANI to the representative.

If CheckM genome qualities were specified, then the clusters have an additional property:

  1. Each representative genome has a better quality score than other members of the cluster. Each genome is assigned a quality score based on the formula completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000, which is reduced from a quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8.

If instead CheckM qualities were not provided, then the following holds instead:

  1. Each representative genome was specified to galah before other members of the cluster.

The overall greedy clustering approach was largely inspired by the work of Donovan Parks, as described in Parks et. al. 2020. It operates in 3 steps. In the first step, genomes are assigned as representative if no genomes of higher quality are >99% ANI. In the second step, each non-representative genome is assigned to the representative genome it has the highest ANI with.

Installation

Install through the bioconda package

Galah can be installed through the bioconda conda channel. After initial setup of conda and the bioconda channel, it can be installed with

conda install galah

Galah can also be used indirectly through CoverM via its cluster subcommand, which is also available on bioconda.

Pre-compiled binary

Galah can be installed by downloading statically compiled binaries, available on the releases page.

Third party dependencies listed below are required for this method.

Compiling from source

Galah can also be installed from source, using the cargo build system after installing Rust.

cargo install galah

Third party dependencies listed below are required for this method.

Development

To run an unreleased version of Galah, after installing Rust:

git clone https://github.com/wwood/galah
cd galah
cargo run -- cluster ...etc...

Third party dependencies listed below are required for this method.

Dependencies

Galah relies on these 3rd party tools, which must be installed separately.

Usage

For clustering a set of genomes at 99% ANI:

galah cluster --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna --output-cluster-definition clusters.tsv

There are several other options for specifying genomes, ANI cutoffs, etc.

The full usage is described on the manual page, which can be accessed on the command line running galah cluster --full-help.

Precluster ANI

Similar to dRep, galah operates in two stages. In the first, a fast pre-clustering distance (dashing) is calculated between each pair of genomes. Genome pairs are only considered as potentially in the same cluster with FastANI if the prethreshold ANI is greater than the specified value. By default, the precluster ANI is set at 95% and the final ANI is set at 99%.

License

Galah is made available under GPL3+. See LICENSE.txt for details. Copyright Ben Woodcroft.

Developed by Ben Woodcroft at the Centre for Microbiome Research, Queensland University of Technology.

galah's People

Contributors

wwood avatar apcamargo avatar rhysnewell avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.