Giter Club home page Giter Club logo

mtsv's Introduction

Build Status Anaconda-Server Badge Anaconda-Server Badge

MTSv Pipeline

Change log

MTSv is a suite of metagenomic binning and analysis tools. It attempts to accurately identify which taxa are present in a given DNA sequencing sample by identifying signature reads that are unique to a single taxa. It assumes that read fragments in samples will be in a "shotgun" or short read format, typically ~50-200 bases in length.

Pipeline Overview

The pipeline is broken into two major sections: (1) the download and setup of sequences from the sequence database which only needs to be completed once per database and as needed to include updates and (2) the main binning and analysis pipeline that is completed for each new read set.

  1. Installation
  2. Sequence Download and Setup Quick Start Guide
  3. Binning and Analysis Quick Start Guide

Installation

MTSv is written in Python but calls compiled rust binaries for most of the core functionality. It is currently available as a conda package (on linux-64 platforms) and we recommend that it is installed into a Python 3.6 conda environment.

Dependencies

Create Conda environment with Python3.6.

$ conda create -n [ENV_NAME] python=3.6

Activate Conda Environment

$ source activate [ENV_NAME]

Install MTSv

$ conda install mtsv -c bioconda -c conda-forge

Deactivate Conda Environment

$ source deactivate

License & Copyright

© 2018 FofanovLab

Licensed under the MIT License

mtsv's People

Contributors

tfursten avatar ktaed avatar isaacshaffer avatar fellmk avatar

Stargazers

R. Taylor Raborn avatar  avatar  avatar Krista Ternus avatar Adam Perry avatar Chance Nelson avatar  avatar  avatar Perry avatar Christopher McCormack avatar Alex Groce avatar  avatar

Watchers

James Cloos avatar Viacheslav Fofanov avatar  avatar  avatar  avatar  avatar

Forkers

lovettse ktaed

mtsv's Issues

MTSv-summary bug

There is a potential bug in the MTSv-summary pipeline that causes an unexpected crash. The bug is reproducible.

here is the command line:
srun --mem=64000 python /home/vyf2/MTSv/scripts/MTSv_summary.py --threads 1 -o /scratch/vyf2/CR2/metaSlava/YP/ YERPE_Yp4027 /scratch/vyf2/CR2/metaSlava/YP/vedro/YERPE_Yp4027.clp /scratch/vyf2/CR2/metaSlava/YP/vedro/YERPE_Yp4027.sig &

The taxdump I've used is here (but the newer one results in same issue):
/scratch/vyf2/NCBI_012417/treeBuild/taxdump.tar.gz

The error message produced is (int overflow?):
Traceback (most recent call last):
File "/home/vyf2/MTSv/scripts/MTSv_summary.py", line 241, in
get_summary(ARGS.all, ARGS.sig, outfile, ARGS.threads, ARGS.verbose)
File "/home/vyf2/MTSv/scripts/MTSv_summary.py", line 148, in get_summary
data_dict = parse_all_hits(all_file, data_dict, sig_reads, threads, verbose)
File "/home/vyf2/MTSv/scripts/MTSv_summary.py", line 140, in parse_all_hits
all_file, sep=":", header=None, chunksize=n_rows//threads))
File "/home/vyf2/.conda/envs/biopy3/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/vyf2/.conda/envs/biopy3/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/home/vyf2/.conda/envs/biopy3/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
put(task)
File "/home/vyf2/.conda/envs/biopy3/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/home/vyf2/.conda/envs/biopy3/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

The issue seems to be squarely in the data from collapsed file /scratch/vyf2/CR2/metaSlava/YP/vedro/YERPE_Yp4027.clp

Removing the .clp file from the equation removes the issue. For example:
srun --mem=64000 python /home/vyf2/MTSv/scripts/MTSv_summary.py --threads 1 -o /scratch/vyf2/CR2/metaSlava/YP/ YERPE_Yp4027 /scratch/vyf2/CR2/metaSlava/YP/vedro/YERPE_Yp4027.sig /scratch/vyf2/CR2/metaSlava/YP/vedro/YERPE_Yp4027.sig &

the above command will work fine

requested enhancement: custom adaptive --lca on MTSv-summary

From correspondence:

Regarding the LCA functionality, it would seem that it would be straightforward to implement things such that any of the major ranks could be specified (so instead of --family, --genus, etc., just a single –rank flag that accepts any of the major ranks). I do understand the concept of a “relative” LCA specified numerically that limits how many jumps “up” the tree to allow, although I’m not sure why there’s a limit on how high this can be set (if it’s set too high, then the program could just report the most inclusive rank allowed - phylum, kingdom, root, etc.). Finally, what about a “generic” LCA mode that simply returns the LCA of the various hits for each read (potentially all the way up to the root)? I guess this would be equivalent to setting --lca to a very large value, if this were allowed. This functionality doesn’t seem possible currently but is actually the default behavior of many classifiers.

Adding the --rank is a little bit more complicated. The couple of reasons we’ve chosen to stick with --genus and --family are: (a) not all organism have all of the ranks in the taxonomic lineage – genus and family are the most common, (b) we actually do some precomputing to speed up the signature read identification post binning – we have several high-speed access tables for species, genus, and family, as the code stands, adding --rank will break a few things. I've added this on our GitHub page as one of the desired features but this one will require substantial testing to make sure it works as intended so it is unlikely that it will be available by June 30th.

mtsv-signature bug

The mtsv-signature code crashed at high thread counts giving the following error:

Thread N "mtsv-signature" received signal SIGSEGV, Segmentation fault.

where N is the thread that crashed. This was fixed by updating the Cargo package cue v0.1.0 (cue) which is one of MTSv's dependencies. The Cargo package crossbeam was at fault. The cargo lock file for cue was set to use version 0.2 which has known issues documented after cue was written (crossbeam-rs/crossbeam#107). These issues were solved in early 2017 and the new version of crossbeam no longer causes a segfault. The version of cue available through Cargo no longer can be updated so a local copy of cue was added to MTSv. The Cargo.toml file for MTSv was then updated to refer to the local copy of cue. The local copy of cue's dependencies in its Carto.toml file were updated to use crossbeam v0.3.2. In 500 test runs of mtsv.signature with the update 0 segfaults were produced when 24 threads were used. Prior to the change segfaults occurred in approx. 10% of tests with 24 threads.

The changes to fix this problem were merged with MTSv on 7/18/18 (#17).

Additional Details
To determine what was causing the original error above two backtraces were used. The first was found using rust-gdb which indicated an issue in the cue package. Frame 12 in the backtrace indicates an issue with lib.rs which is the single source file for the cue package. A second backtrace was obtained using Cargo directly. In the first section of the backtrace there is a panic at "libstd/threawd/mod.rs:1058:22" and the crossbeam package is indicated in frame 7 and cue in frame 8. This information led to the solution above. Both backtraces are included below.

rust-gdb backtrace:

 rust-gdb -statistics --args ./../target/debug/mtsv-signature -t 24 -x tree.index --input test.clp --lca 0 --output testmkf.sig

(gdb) run
Starting program: /home/michael/Documents/Slava/MTSV/MTSv/target/debug/mtsv-signature -t 24 -x tree.index --input test.clp --lca 0 --output testmkf.sig
Command execution time: 0.036000 (cpu), 0.042659 (wall)
Space used: 10846208 (+0 for this command)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[INFO 2018-06-01 13:01:31.110 mtsv_signature] Opening files...
[INFO 2018-06-01 13:01:31.178 mtsv_signature] Deserializing taxonomic index...
[INFO 2018-06-01 13:02:12.032 mtsv_signature] Finding informative reads...
[New Thread 0x7ffff09ff700 (LWP 22272)]
[New Thread 0x7ffff07fe700 (LWP 22273)]
[New Thread 0x7fffeffff700 (LWP 22274)]
[New Thread 0x7fffefdfe700 (LWP 22275)]
[New Thread 0x7fffefbfd700 (LWP 22276)]
[New Thread 0x7fffef3ff700 (LWP 22277)]
[New Thread 0x7fffeebff700 (LWP 22278)]
[New Thread 0x7fffee9fe700 (LWP 22279)]
[New Thread 0x7fffee3ff700 (LWP 22280)]
[New Thread 0x7fffedbff700 (LWP 22281)]
[New Thread 0x7fffed9fe700 (LWP 22282)]
[New Thread 0x7fffed3ff700 (LWP 22283)]
[New Thread 0x7fffecdff700 (LWP 22284)]
[New Thread 0x7fffec7ff700 (LWP 22285)]
[New Thread 0x7fffec1ff700 (LWP 22286)]
[New Thread 0x7fffebbff700 (LWP 22287)]
[New Thread 0x7fffeb5ff700 (LWP 22288)]
[New Thread 0x7fffeafff700 (LWP 22289)]
[New Thread 0x7fffea9ff700 (LWP 22290)]
[New Thread 0x7fffea3ff700 (LWP 22291)]
[New Thread 0x7fffe9dff700 (LWP 22292)]
[New Thread 0x7fffe97ff700 (LWP 22293)]
[New Thread 0x7fffe91ff700 (LWP 22294)]
[New Thread 0x7fffe8bff700 (LWP 22295)]
[New Thread 0x7fffe85ff700 (LWP 22296)]

Thread 22 "mtsv-signature" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe9dff700 (LWP 22292)]
0x0000555555810d2e in core::sync::atomic::atomic_compare_exchange::h49dae1b491b312cb () at /checkout/src/libcore/sync/atomic.rs:1635
1635       /checkout/src/libcore/sync/atomic.rs: No such file or directory.

(gdb) bt
#0  0x0000555555810d2e in core::sync::atomic::atomic_compare_exchange::h49dae1b491b312cb () at /checkout/src/libcore/sync/atomic.rs:1635
#1  core::sync::atomic::AtomicUsize::compare_exchange::he54cc56f32692a41 () at /checkout/src/libcore/sync/atomic.rs:1232
#2  std::thread::Thread::unpark::h245253837e117a38 () at libstd/thread/mod.rs:1051
#3  0x000055555558a98d in _$LT$crossbeam..sync..ms_queue..MsQueue$LT$T$GT$$GT$::push::h7c43ec3831975a1c (self=0x7fffffff6530, t=...)
    at /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/sync/ms_queue.rs:178
#4  0x00005555555878f3 in cue::pipeline::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hc4a53ccd492768ef ()
    at /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/cue-0.1.0/src/lib.rs:89
#5  0x0000555555586564 in crossbeam::scoped::Scope::spawn::_$u7b$$u7b$closure$u7d$$u7d$::hb65b5eac789e5eb3 ()
    at /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/scoped.rs:237
#6  0x0000555555585958 in _$LT$F$u20$as$u20$crossbeam..FnBox$GT$::call_box::had8bc517e74e907b (self=0x7ffff6a22b80)
    at /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/lib.rs:44
#7  0x00005555555a4e53 in crossbeam::spawn_unsafe::_$u7b$$u7b$closure$u7d$$u7d$::he497fb2e5d48d154 ()
    at /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/lib.rs:53
#8  0x00005555555a4ead in std::sys_common::backtrace::__rust_begin_short_backtrace::ha12f93978e9a4d56 (f=closure = {...})
    at /checkout/src/libstd/sys_common/backtrace.rs:136
#9  0x0000555555571935 in std::thread::Builder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h32c9d626e56a56cb ()
    at /checkout/src/libstd/thread/mod.rs:406
#10 0x000055555558fb3d in _$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::had269266294c1a69 (
    self=AssertUnwindSafe = {...}, _args=0) at /checkout/src/libstd/panic.rs:296
#11 0x000055555557f09a in std::panicking::try::do_call::hbd02c4d3077641c3 (data=0x7fffe9dfec08 "\200+\242\366\377\177") at /checkout/src/libstd/panicking.rs:306
#12 0x000055555581fe7f in __rust_maybe_catch_panic () at libpanic_unwind/lib.rs:102
#13 0x000055555557ef91 in std::panicking::try::h4a558b787dcdbfd7 (f=AssertUnwindSafe = {...}) at /checkout/src/libstd/panicking.rs:285
#14 0x000055555558fb9d in std::panic::catch_unwind::h39f87b33dfb2af23 (f=AssertUnwindSafe = {...}) at /checkout/src/libstd/panic.rs:361
#15 0x00005555555714f0 in std::thread::Builder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h109c817cc61acd7f () at /checkout/src/libstd/thread/mod.rs:405
#16 0x0000555555571a18 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h42849e6864f22f8f (self=0x7ffff6a22ba0, args=0)
    at /checkout/src/liballoc/boxed.rs:784
#17 0x0000555555815368 in _$LT$alloc..boxed..Box$LT$alloc..boxed..FnBox$LT$A$C$$u20$Output$u3d$R$GT$$u20$$u2b$$u20$$u27$a$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hfb7267911708b31a () at /checkout/src/liballoc/boxed.rs:794
#18 std::sys_common::thread::start_thread::hbdb0265288bf7dcc () at libstd/sys_common/thread.rs:24
#19 0x0000555555809fa9 in std::sys::unix::thread::Thread::new::thread_start::h99a7fe00c40fd1be () at libstd/sys/unix/thread.rs:90
#20 0x00007ffff77b56ba in start_thread (arg=0x7fffe9dff700) at pthread_create.c:333
#21 0x00007ffff72d541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Command execution time: 0.044000 (cpu), 0.057290 (wall)
Space used: 21016576 (+2154496 for this command)

Cargo backtrace:

RUST_BACKTRACE=1 cargo run --bin mtsv-signature -- -t 24 -x mkftests/tree.index --input mkftests/test.clp --lca 0 --output mkftests/testmkf.sig
    Finished dev [unoptimized + debuginfo] target(s) in 0.17s
    Running `target/debug/mtsv-signature -t 24 -x mkftests/tree.index --input mkftests/test.clp --lca 0 --output mkftests/testmkf.sig`
[INFO 2018-07-12 11:36:13.517 mtsv_signature] Opening files...
[INFO 2018-07-12 11:36:13.570 mtsv_signature] Deserializing taxonomic index...
[INFO 2018-07-12 11:36:50.719 mtsv_signature] Finding informative reads...
thread '' panicked at 'inconsistent state in unpark', libstd/thread/mod.rs:1058:22
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:463
   5: std::panicking::begin_panic
             at libstd/panicking.rs:397
   6: std::thread::Thread::unpark
             at libstd/thread/mod.rs:1058
   7: >::push
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/sync/ms_queue.rs:178
   8: cue::pipeline::{{closure}}::{{closure}}
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/cue-0.1.0/src/lib.rs:89
   9: crossbeam::scoped::Scope::spawn::{{closure}}
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/scoped.rs:237
  10: ::call_box
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/lib.rs:44
  11: crossbeam::spawn_unsafe::{{closure}}
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/lib.rs:53
 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any', libcore/result.rs:945:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:463
   5: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:350
   6: rust_begin_unwind
             at libstd/panicking.rs:328
   7: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   8: core::result::unwrap_failed
             at /checkout/src/libcore/macros.rs:26
   9: >::unwrap
             at /checkout/src/libcore/result.rs:782
  10: crossbeam::scoped::JoinState::join
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/scoped.rs:33
  11: crossbeam::scoped::Scope::spawn::{{closure}}
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/scoped.rs:247
  12: ::call_box
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/lib.rs:44
  13: crossbeam::scoped::Scope::drop_all
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/scoped.rs:98
  14: crossbeam::scoped::scope
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.12/src/scoped.rs:63
  15: cue::pipeline
             at /home/mkf58/.cargo/registry/src/github.com-1ecc6299db9ec823/cue-0.1.0/src/lib.rs:59
  16: mtsv::tax_tree::TreeWithIndices::find_and_write_informatives
             at ./src/tax_tree.rs:88
  17: mtsv_signature::main
             at src/bin/mtsv-signature.rs:120
  18: std::rt::lang_start::{{closure}}
             at /checkout/src/libstd/rt.rs:74
  19: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:310
  20: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:105
  21: std::rt::lang_start_internal
             at libstd/panicking.rs:289
             at libstd/panic.rs:374
             at libstd/rt.rs:58
  22: std::rt::lang_start
             at /checkout/src/libstd/rt.rs:74
  23: main
  24: __libc_start_main
  25: _start 

mtsv_setup looking for decompression.log

When trying to run the database build_only commands, we're getting the following error:
(mtsv) [sguertin@c mtsv]$ mtsv_setup database --path Jun-15-2018 --thread 4 --build_only --includedb "Complete Genome"
Traceback (most recent call last):
File "/data/home/sguertin/miniconda3/envs/mtsv/bin/mtsv_setup", line 11, in sys.exit(main())
File "/data/home/sguertin/miniconda3/envs/mtsv/lib/python3.6/site-packages/mtsv/mtsv_prep/main.py", line 451, in main setup_and_run(parser)
File "/data/home/sguertin/miniconda3/envs/mtsv/lib/python3.6/site-packages/mtsv/mtsv_prep/main.py", line 362, in setup_and_run oneclickbuild(args)
File "/data/home/sguertin/miniconda3/envs/mtsv/lib/python3.6/site-packages/mtsv/mtsv_prep/main.py", line 80, in oneclickbuild with open(os.path.join(args.path, "artifacts","decompression.log"), "w" ):
PermissionError: [Errno 13] Permission denied: 'Jun-15-2018/artifacts/decompression.log'

The case shown above is with the latest bioconda build as of 5/14, but we've observed the same error with the latest singularity image as well.

Automatic MTSv-extract options

Tara, in your snakemake implementation, lets have MTSv-extract be called to at least pull out the reads that did not align to anything. We'll need these reads anyways for alignment free binning.

Creating databases for use on 'offline' systems

What is the recommended procedure for use on non-internet facing systems? Ideally, the database download step would be decoupled from the database build step as the available internet-accessible system would not have the resources to build.

The desired database would include all assemblies for bacteria (including contig level) in addition to all of the default database contents (which I understand to be all scaffold-level or higher assemblies as well as Genbank records).

Offline mirror of Genbank/Refseq for use in database creation.

Due to our not having Internet access on our analysis network, we try to mirror portions of NCBI that are used in our work. If this is a duplication of effort in terms of the offline_download scripts, maybe some of what we have could be used to build the databases internally.

I won't include the entire higher-level directory structure, but assuming all of the following are within a single ncbi folder and are using the structure of the ncbi ftp, here is our mirror layout as you requested:
genomes/refseq/*
genomes/genbank/*
genomes/Viruses/*
genbank/.seq.gz
pub/taxonomy/

blast/db/nt* and nr* and the v5 versions
refseq/release/plasmid/*

I can also add other sections of the ftp if needed, if it would facilitate an easier internal build.

Thanks for any help in getting this working.

Database won't build from list of unzipped flat files

The current version (1.0.4) of mtsv_setup gives the following error when attempting to build a database from a list of genbank flat files:

Traceback (most recent call last):
File "/scratch/.../.conda/envs/MTSV_env/bin/mtsv_setup", line 11, in
sys.exit(main())
File "/scratch/.../.conda/envs/MTSV_env/lib/python3.6/site-packages/mtsv/mtsv_prep/main.py", line 458, in main
setup_and_run(parser)
File "/scratch/.../.conda/envs/MTSV_env/lib/python3.6/site-packages/mtsv/mtsv_prep/main.py", line 368, in setup_and_run
oneclickbuild(args)
File "/scratch/.../.conda/envs/MTSV_env/lib/python3.6/site-packages/mtsv/mtsv_prep/main.py", line 94, in oneclickbuild
if x.strip().rsplit(".",1)[1] == "gz":
IndexError: list index out of range
Command exited with non-zero status 1

The specific call is: mtsv_setup database --path 10-Sep-2019 --thread 4 --build_only --taxonomy_path 10-Sep-2019/artifacts/ --ff_list selected_species_open.txt"

strange --lca behavior in MTSv-signature

Correspondence with one of the beta testers is below:

Anyway, at that point we’d probably check to make sure e.g. the bug we’d noticed with LCA assignments to the B. cereus group had gone away, check out all the new features, etc.

Can you provide some more details on what kind of read set you were using when you first encountered the bug? Were these B. cereus or B. anthracis reads?

To answer your question about the B cereus LCA issue: I simulated reads to a strain of B anthracis and then tried changing the lca option and/or using the genus or family flags. Every set of parameters (other than lca 0) gave 34 reads to the Bacillus genus. I dug in further and found that there is a large population of reads compatible with B cereus, B anthracis, and B cereus that I would have expected to go up to the Bacillus genus. They don’t. It seems that the only reads moving up to Bacillus are compatible with two or more Bacillus species outside the cereus group.

I noticed a similar issue with B pseudomallei reads, but no problem with F tularensis reads. My suspicion is that the existence of a species group is causing problems.

Difficulty running v1.06

I'm having trouble running v1.06.

I create a run directory.
Copy reads into 'reads/{SAMPLE}_S1_L001_R[12]_001.fastq', using only one sample.
mtsv init /path/to/genbank.json
mtsv pipeline -c mtsv.cfg

At this point, I get an error reading:

WorkflowError in line 234 of /path/to/analyze_snek:
Can only use unpack() on list and dict

Any ideas what might be causing this?

bug in summary portion on 1.0.4 singularity container

I get the following error running the summary command in the latest version of the singularity container, 1.0.4. What could be causing this, please (full command progress and error below)?

Singularity> mtsv summary -c sing2_test/mtsv.cfg
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 signature
1 summary
1 summary_all
1 summary_report
4

[Mon Sep 9 08:53:37 2019]
Job 2:
Finding signature hits from /data/mtsv/sing2_test/Binning/merged.clp.
Writing to /data/mtsv/sing2_test/Summary/signature.txt
Logging to /data/mtsv/Logs/mtsv_summary_2019-09-09_08-53-35.log
Snakemake scheduler assuming 1 thread(s)

Loading node names...
2117225 names loaded.
208877 synonyms loaded.
Loading nodes...
2117225 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /data/home/sguertin/.etetoolkit/mtsv_8f419d58a6caaa5932664f991a10ec12_taxa.sqlite ...
2117000 generating entries...
Uploading to /data/home/sguertin/.etetoolkit/mtsv_8f419d58a6caaa5932664f991a10ec12_taxa.sqlite

Inserting synonyms: 205000
Inserting taxid merges: 50000
Inserting taxids: 2115000
Traceback (most recent call last):
File "/data/mtsv/.snakemake/scripts/tmpsupbwhzz.MTSv_signature.py", line 121, in
NCBI = get_ete_ncbi(snakemake.params[0])
File "/usr/local/lib/python3.6/site-packages/mtsv/utils.py", line 143, in get_ete_ncbi
with open(ete_database_data(), 'w') as out:
OSError: [Errno 30] Read-only file system: '/usr/local/lib/python3.6/site-packages/mtsv/data/ete_databases.json'

Rust 1.36.0 Compilation Issue

It looks likes the binner no longer compiles with rust 1.36.0
The solution I found was to change the meta.yaml rust to < 1.36.0 I wanted to confirm this wasn't isolated to me before pushing a fix.

MTSv-summary lacks graceful exit

MTSv-summary.py crashes (see below for traceback message) when it encounters a taxID that is not in it's database. This is going to happen more often once we start adding custom DBs from chromosome and scaffold portion of NCBI (which may be updated at a different rate than the original GenBank).

Rather than throwing an exception and crashing we should put such reads into 'unknown' category and report them. Note that when we are not running "--lca 0" option and are instead using genus or family option, we'll need to assume that all unknown hits are non-signatures to be safe.

To replicate on monsoon, use the following collapsed, signature, and taxdump files respectively:

  • /scratch/vyf2/CR2/metaSlava/bats/vedro/NHCS-EchoLake01.clp
  • /scratch/vyf2/CR2/metaSlava/bats/vedro/NHCS-EchoLake01.sig
  • /scratch/vyf2/NCBI_012417/treeBuild/taxdump.tar.gz

==
Traceback (most recent call last):
File "/home/vyf2/MTSv/scripts/MTSv_summary.py", line 234, in
get_summary(ARGS.all, ARGS.sig, outfile, ARGS.threads, ARGS.verbose)
File "/home/vyf2/MTSv/scripts/MTSv_summary.py", line 152, in get_summary
row_list = [taxa, tax2div(taxa), taxid2name[taxa]]
File "/home/vyf2/MTSv/scripts/MTSv_summary.py", line 42, in tax2div
lineage = NCBI.get_lineage(taxid)
File "/home/vyf2/.conda/envs/biopy3/lib/python3.5/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 238, in get_lineage
raise ValueError("%s taxid not found" %taxid)
ValueError: 1526222 taxid not found

Taxonomy-aware (genus/family collapse) MTSv-summary output

From stakeholders "It would probably be most useful if it were sorted in a taxonomy-aware manner (at least in the cases where multiple taxonomic ranks are being output), ala KrakenHLL"

This is a feature for option year we've discussed before. The idea is to run MTSv-signature at genus / family / custom lca level and collapse the nodes with large number of signature hits but low p-value (a.k.a. the Bradyrhizobium problem I've illustrated before).

Conda install of v1.02 bug

The Conda install of v1.02 was not functional when first installed. The pipeline expected scripts to be located in 'envs/mtsv1.02/lib/python3.6/site-packages/mtsv/ext/' that were actually located in 'envs/mtsv1.02/lib/python3.6/site-packages/mtsv/ext/target/release/'.

The issue was resolved by soft-linking 'envs/mtsv1.02/lib/python3.6/site-packages/mtsv/ext/target/release/*' inside 'envs/mtsv1.02/lib/python3.6/site-packages/mtsv/ext/'.

Remove non-species taxids from databases

For example, taxid 34146 appears in binning output and it corresponds to Closterium peracerosum-strigosum-littorale complex which is not a species. This should be removed as a part of the sequence database cleanup. This caused problems in summary because there is no associated rank but this was fixed in fee92c5.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.