Giter Club home page Giter Club logo

rust-msbwt's Introduction

Crates.io Crates.io Crates.io Build status

msbwt2

The intent of crate is to provide Rust functionality for querying a Multi-String BWT (MSBWT), and is mostly based on the same methodology used by the original msbwt.

NOTE: This is very much a work-in-progress and currently only being updated as a side project during spare time. If you have any feature requests, feel free to submit a new issue on GitHub. Here is a current list of planned additions:

  1. Incorporate the high-memory BWT implementation from fmlrc2
  2. Add some more query functionality
  3. Improve the performance of the built-in BWT construction tool (msbwt2-build)

Installation

All installation options assume you have installed Rust along with the cargo crate manager for Rust.

From Cargo

cargo install msbwt2
msbwt2-convert -h

From GitHub

git clone https://github.com/HudsonAlpha/rust-msbwt.git
cd rust-msbwt
#testing optional
cargo test --release
cargo build --release
./target/release/msbwt2-convert -h

Usage

MSBWT Building

The Multi-String Burrows Wheeler Transform (MSBWT or BWT) must be built prior to performing any queries. Currently, there are two ways to build the BWT with identical results:

  1. Using the built-in msbwt2-build tool. This approach will accept any combination of FASTQ or FASTA files that may be gzip-compressed.
    This method tends to be slower currently and is not parallelized (we hope to improve both of these over time). However, it is easier to use with different file types requires only msbwt2 to be installed:
msbwt2-build \
    -o comp_msbwt.npy \
    reads.fq.gz [reads2.fq.gz ...]
  1. Using an external tool and feeding that to msbwt2-convert. This approach tends to be faster currently. However, the following command is more complex, less flexible file typing (requiring FASTQ in this example), and requires the ropebwt2 executable (or a similar tool) to be installed:
gunzip -c reads.fq.gz [read2.fq.gz ...] | \
    awk 'NR % 4 == 2' | \
    sort | \
    tr NT TN | \
    ropebwt2 -LR | \
    tr NT TN | \
    msbwt2-convert comp_msbwt.npy

Queries

The general use case of the library is k-mer queries, which can be performed as follows:

use msbwt2::msbwt_core::BWT;
use msbwt2::rle_bwt::RleBWT;
use msbwt2::string_util;
let mut bwt = RleBWT::new();
let filename: String = "test_data/two_string.npy".to_string();
bwt.load_numpy_file(&filename);
assert_eq!(bwt.count_kmer(&string_util::convert_stoi(&"ACGT")), 1);

Reference

msbwt2 does not currently have a pre-print or paper. If you use msbwt2, please cite the one of the msbwt papers:

Holt, James, and Leonard McMillan. "Merging of multi-string BWTs with applications." Bioinformatics 30.24 (2014): 3524-3531.

Holt, James, and Leonard McMillan. "Constructing Burrows-Wheeler transforms of large string collections via merging." Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 2014.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

rust-msbwt's People

Contributors

ebedthan avatar holtjma avatar natproach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

rust-msbwt's Issues

very slow

Hello rust-msbwt team,

I found that this version the build command is very slow, a 5G fastq reads take more then 2 hours and more while the ropebwt2 is very fast, several minutes. I am wondering whether this could be parallelized or something because this is not practical for real word sequence files, which are always more than 20G.

Thanks,

Jianshu

Incorporate the high-memory BWT implementation from fmlrc2

Hi @holtjma,

Sorry to bother you, and sorry if it is more issues and PR from me than from anyone in the recent days.
I believe this tool will greatly benefit from this addition.
I would like to try to see if we can incorporate the high-memory BWT implementation from fmlrc2 here. Before starting I prefer to have your recommendations and direction so that it will not be a waste of time for me and you. And if it is not something you would like for mswbt2, we can just leave it as it is.

Thanks!

remove unsafe in msbwt2::string_util::convert_itos

Hi,

I was wondering why not use directly from_utf8_lossy which "returns a Cow<'a, str>. If our byte slice is invalid UTF-8, then we need to insert the replacement characters, which will change the size of the string, and hence, require a String. But if it’s already valid UTF-8, we don’t need a new allocation. This return type allows us to handle both cases."

So except we are sure that convert_itos can take as input non-valid utf8, using from_utf8_lossy also provides a non-allocating way of doing what we want, and we also reduce the amount of unsafe in the code...

What is your advice?

Anicet

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.