bonsai-team / matam Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 9.0 6.1 MB

Mapping-Assisted Targeted-Assembly for Metagenomics

License: GNU Affero General Public License v3.0

CMake 0.18% C++ 10.69% Python 45.24% R 0.16% Shell 0.82% Jupyter Notebook 42.56% C 0.36%

assembly bioinformatics metagenomics rrna

matam's People

Contributors

Stargazers

Watchers

Forkers

dreadbonney zhssakura gelomerase gaberoo kbseah bilille pythseq wang12 tazend

matam's Issues

Add support for gzipped read files

SortMeRNA v2.1b now support them, so we also could in the future.

Docker is not building anymore

The error seems to come from the compilation of the ovgraphbuild module:

/matam/ovgraphbuild/lib/seqan/include/seqan/system/file_sync.h:337:25: error: call of overloaded 'empty(std::__cxx11::string&)' is ambiguous
         if (empty(tmpDir))
                         ^
[...]
CMakeFiles/ovgraphbuild.dir/build.make:62: recipe for target 'CMakeFiles/ovgraphbuild.dir/src/alignmentsComparison.cpp.o' failed
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/ovgraphbuild.dir/all' failed
Makefile:83: recipe for target 'all' failed

�[91m2017-10-06 14:41:25,076 - WARNING - A problem might have happened while compiling ovgraphbuild. Check log above

Add a first manual

Lets add a pdf manual to MATAM. The first iteration could be filled with the infos already in the README.
Ideally the pdf should be build automatically after each push, or release.

No valid binary found for sga

MATAM spill out these messages complaining not finding the sga binary.

INFO - Graph compaction & Components identification terminated in 63.3342 seconds wall time
INFO - Compressed graph: 472 components

INFO - === LCA labelling ===
INFO - LCA labelling terminated in 7.3417 seconds wall time

INFO - === Contigs assembly ===
INFO - Save components to fastq files
INFO - Assemble components
No valid binary found for sga
No valid binary found for sga
No valid binary found for sga
No valid binary found for sga
No valid binary found for sga
...

I export $MATAM_HOME/sga/src/bin to $PATH but that doesn't help.

Ovgraphbuild compilation issue with gcc-6

With the issue #44 we noticed that ovgraphbuild do not compile with gcc-6.
This issue seems to come from the way the flags are set in CMakeLists.txt.
More precisely:

set(ENABLE_CXXFLAGS_TO_CHECK 
    -std=gnu++1z 
    -std=c++1z
    -std=gnu++14 
    -std=c++14
    -std=gnu++1y 
-std=c++1y)

Update this file to allow the compilation with gcc-6.

For now (see #5) , the process of building a conda package uses the default build.py script. It will be preferable to write a specific build script:
Instead of using submodules, use conda requirements.

To enable this, the code have to be slightly modified to search binaries from matam dir first then from the path.

Running index_default_ssu_rrna_db.py before building should raise an error

When running index_default_ssu_rrna_db.py before building MATAM, SortMeRNA indexdb_rna executable is not found but indexing script terminates with no error...

$ ./index_default_ssu_rrna_db.py

2017-01-30 11:26:01,604 - INFO - -- Get compressed archive --
2017-01-30 11:26:01,604 - DEBUG - PWD: /home/pericard/matam
2017-01-30 11:26:01,604 - DEBUG - CMD: mkdir /home/pericard/matam/db; wget http://bioinfo.lifl.fr/matam/SILVA_128_SSURef_NR95.tar.bz2 -O /home/pericard/matam/db/SILVA_128_SSURef_NR95.tar.bz2
mkdir: impossible de créer le répertoire «/home/pericard/matam/db»: Le fichier existe
--2017-01-30 11:26:01-- http://bioinfo.lifl.fr/matam/SILVA_128_SSURef_NR95.tar.bz2
Résolution de bioinfo.lifl.fr (bioinfo.lifl.fr)… 193.48.186.71
Connexion à bioinfo.lifl.fr (bioinfo.lifl.fr)|193.48.186.71|:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 140567671 (134M) [application/x-bzip2]
Enregistre : «/home/pericard/matam/db/SILVA_128_SSURef_NR95.tar.bz2»

100%[====================================================================================================================================================================================================>] 140 567 671 10,6MB/s ds 12s

2017-01-30 11:26:14 (10,8 MB/s) - «/home/pericard/matam/db/SILVA_128_SSURef_NR95.tar.bz2» enregistré [140567671/140567671]

2017-01-30 11:26:14,032 - INFO - -- Extracting default ref db --
2017-01-30 11:26:14,032 - DEBUG - PWD: /home/pericard/matam/db
2017-01-30 11:26:14,032 - DEBUG - CMD: tar jxvf SILVA_128_SSURef_NR95.tar.bz2
SILVA_128_SSURef_NR95.clustered.fasta
SILVA_128_SSURef_NR95.complete.fasta
SILVA_128_SSURef_NR95.complete.taxo.tab

2017-01-30 11:26:58,580 - INFO - -- Indexing default ref db --
2017-01-30 11:26:58,580 - DEBUG - PWD: /home/pericard/matam/db
2017-01-30 11:26:58,580 - DEBUG - CMD: /home/pericard/matam/scripts/index_ref_db.py -v -i /home/pericard/matam/db/SILVA_128_SSURef_NR95 --max_memory 10000
INFO - Indexing complete ref db
/bin/sh: 1: /home/pericard/matam/sortmerna/indexdb_rna: not found

INFO - Indexing clustered ref db
/bin/sh: 1: /home/pericard/matam/sortmerna/indexdb_rna: not found

2017-01-30 11:26:58,617 - INFO - -- Completed default SSU rRNA DB indexing --
2017-01-30 11:26:58,617 - DEBUG - Indexing completed in 57.01 seconds
2017-01-30 11:26:58,617 - INFO - Indexing went well. Default SSU rRNA DB and its indexes can be found in: /home/pericard/matam/db/SILVA_128_SSURef_NR95*

Ovgraphbuild major overhaul: Improve implementation speed ++

Change how reads pairs are sorted through by reading SAM file reference by reference and using a positions window gliding on each reference.

One major problem: How to store read pairs on which we already know they overlap so we dont deal with them again later ?

Parallelize treatment: at the reference level, but also at the reads pair level.

Better handling paired-end reads

Right now, all reads are treated as single reads.
Some steps could benefit from using paired-end information (contigs assembly with SGA, MATAM scaffolding, ...).
So we could still accept the input file the same way, but be more precise about its formatting so we could retrieve paired-end reads and use them later on.

Make the krona tabulated file an output file

Right now it is removed as a tmp file

Improve output files visibility

MATAM output files should be more easily visible. We could work in a sub-directory and then link the final files in the MATAM output directory.
As in:

matam_assembly/

final_assembly.fa --> wkdir/final_assembly.fa
...
wkdir/
. final_assembly.fa
. scaffolds.NR.min_500bp.fa
. ...

Add Krona visualization for contigs taxonomical assignments

Transfer docker build to bonsai-team docker organisation

Create bonsai-team organisation
Transfer matam docker build to that organisation
Change README

Improve matam error management

Today, we just know if an error happened during a MATAM run, and at which step.
Moreover, a MATAM run can keep on going even if there was a major problem in a previous step.

So predictable errors should be enumerated and dealt with, at least by printing them. And MATAM should not be able to start a step if the previous one had a major problem

Rethink the README

From time to time, the README became bigger and bigger and the amount of information displayed can be overwhelming or unclear (#27).

We have to re-organized the README

Duplicate the STDOUT and STDERR outputs in a log file

We should have a log file to store everything that currently output to STDOUT and STDERR.

Currently, if people dont redirect their STDOUT and STDERR in a file, they can lose important informations about the run.

The implementation should be fexible enough so that it is easy to chose whever we want to output to the terminal or to the log file, or both.

Use GitLab as our main repository manager ?

Pros:

GitLab is fully open-source
the repository can be hosted on the main GitLab server or on a private server
exact same functionalities as GitHub + additional ones
better issues tracking
better continuous integration management
our current GitHub repository is fully transferable to GitLab (code, issues, ...)

Cons:

GitLab is currently less popular than GitHub, which means less visibility. But this is rapidly changing.

https://usersnap.com/blog/gitlab-github/
https://www.upwork.com/hiring/development/gitlab-vs-github-how-are-they-different/

Code deduplication

The read_fasta_file_handle function is duplicated in many locations:

scripts/compute_assembly_stats.py
scripts/compute_lca_from_tab.py
scripts/compute_pairwise_distance_matrix.py
scripts/compute_ref_coverage_histogram.py
scripts/exonerate_to_sam.py
scripts/extract_taxo_from_fasta.py
scripts/fasta_clean_name.py
scripts/fasta_get_lengths.py
scripts/fasta_length_filter.py
scripts/fasta_name_filter.py
scripts/filter_sam_by_coverage.py
scripts/filter_sam_by_pid.py
scripts/get_HMP_OTU_psn.py
scripts/matam_assembly.py
scripts/remove_redundant_sequences.py
scripts/replace_Ns_by_As.py
scripts/replace_Ns_by_rand_nu.py
scripts/sort_fasta_by_length.py

The issue #8 introduce a new module (scripts/fasta_utils.py).
Use this to deduplicate the code.

The scripts/compute_abundance.py and scripts/krona.py use the one from fasta_clean_name. Replace the import.

Some other function may be to replace as well ( read_fastq_file_handle, format_seq...).

Better memory management. Issue to be redistributed between more specific issues

Find a way to improve memory management at nearly every step:

SortMeRNA runs are not memory capped, only the ref db index memory is capped, not the alignments memory

Ovgraphbuild could perform a treatment by bins instead of reading all the SAM file in RAM.

Unix Sort is sometimes used twice in a pipe command line so there is a risk of using twice the RAM that was given.

BIOM format

Hi,

Any chance of producing the results, i.e. abundance and taxonomy assignment, in BIOM format?

Parallelize components assembly

This should be easy. Right now, components are assembled one at a time, but since they are independant, this should be parallelized in the Python code of matam_assembly.py

LCA labelling error with custom database

When running matam_v1.3.0 with a depleted version of SILVA database, an error occured:

2017-10-16 13:26:24,198 - DEBUG - CMD: cat /workdir/lcouderc/data_matam/16S_rRNA/simulated_dataset/50x/matam/matam_v1.3.0_S128_SSURef_NR95_depleted_cov_50/genomes_complets.art_HS25_pe_1
01bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.cov_filt_50.ovgb_i100_o50.cpts_N1_E1.read_metanode_component_taxo.tab | sort -T /workdir/lcouderc/data_matam/16S_r
RNA/simulated_dataset/50x/matam/matam_v1.3.0_S128_SSURef_NR95_depleted_cov_50 -S 10000M --parallel 1 -k3,3 -k1,1 | /home/lcouderc/matam/scripts/compute_lca_from_tab.py -t 4 -f 3 -g 1 -m
 0.51 -o /workdir/lcouderc/data_matam/16S_rRNA/simulated_dataset/50x/matam/matam_v1.3.0_S128_SSURef_NR95_depleted_cov_50/genomes_complets.art_HS25_pe_101bp_50x.sortmerna_vs_SILVA_128_SS
URef_NR95_b10_m10.scr_filt_geo_90pct.cov_filt_50.ovgb_i100_o50.cpts_N1_E1.component_lca51pct.tab
Traceback (most recent call last):
  File "/home/lcouderc/matam/scripts/compute_lca_from_tab.py", line 157, in <module>
    factor_id = factor_tab_list[0][0][args.factor]
IndexError: list index out of range
2017-10-16 13:26:24,285 - INFO - LCA labelling terminated in 10.4576 seconds wall time

SGA logging

In debug mode, there is too much verbosity on the contigs assembly step. Restrict the logs to the assembly of the first component and the last one
The script sga_assemble.py is logging to a file and this informations are hidden to the matam logging file. Show this info for the first and last component.

Getting no scaffold > 500bp should raise an error and exit

Right now it continues into abundance calculation which in turn leads to many errors.

SGA freezes in some cases

TBD

Check dependencies before going any further

It can be painful to see the matam failed in final steps due to a missing dependency.
Assert dependencies at the start of the pipeline:

Change the way of retrieving binaries. Use Binary class from https://github.com/bonsai-team/matam/blob/master/scripts/binary_utils.py
~~check the dependencies. See if using a config file as https://docs.python.org/3/library/configparser.html can be a good solution.~~

Do no fix the debian version used by docker

When the issue #46 will be solved. Remove the modification from 34ca105

Refactoring/Cleaning of ovgraphbuild stdout and options

Right now, there is still lots of stdout infos and ovgraphbuild options that are left from the development period. It needs cleaning.

Add build clean functionnality

Replace sumaclust by vsearch

Vsearch is now the clustering tool that should be used instead of Sumaclust (not published, and very slow).
!!! Careful, such change will have a significant impact on the clustered database, so on MATAM results. Major tests will have to be performed before validation !!!

Abundance computing: Add warning for scaffolds with abundance=0

If no reads are re-mapping on a scaffold after keeping only the best equivalent reads alignments, we have to output a warning.
This behavior has to be followed up. This should not happen, or we have to understand why this happens.

Improve scaffolding by using paired-end reads

Paired-end reads could be used to help us chose between compatible contigs that are mapping on the same reference, thus allowing us to discriminate between 2 close species mapping on the same reference sequence, and decreasing the risk of creating chimeras.

Has to be done after #31

Allow dynamic reduction of the coverage for highly covered references

At the moment, for huge datasets, some species can be highly covered (> 10000x).
Thus, computing the overlap graph is impossible due to the complexity of the algorithm used: O(n²), with n being the read coverage.

To be able to achieve the task of assembling such datasets, we can implement different kind of dynamic coverage reduction (simple to hardest):

randomly sample a highly covered reference until we reach a coverage threshold. Potential lose of specific regions due to there low coverage in comparison to the conserved regions
randomly sample highly covered regions. Potential lose of connexions between conserved regions ( = can induce some chimeric sequences)
pseudo-randomly sample highly covered regions to keep links between conserved regions (keep reads shared by multiples ref)

Standardize RDP taxa

Some RDP assignments can have additional levels like subclass or suborder. This badly translates into the Krona file where taxa level do not have the same depth. Ex genus will be at level 6 for most assignment, but can be at level 8 for those particular assignments, while the suborder will be at level 6.

A simple way to correct the problem would be to generate a krona file only with a fixed number of level, which means extracting from the RDP assignments only the 6 standard levels.

Expose the min scaffold length as a parameter

Right now it is set to 500bp and cannot be changed by the user. Let's expose that parameter so that the user can change it.

Add automatic error correction to components assembly

Run a first time the assembly of one component, and then estimate a-posteriori the fold coverage: ~reads_nt/contigs_nt. If it is superior to 20-50, then re-run the assembly with error correction activated.

Probably need to wait until #8 is implemented, so assembly doesnt take forever.

Problem completing installation (download-traindata)

Installation halted after

download-traindata:
      [get] Getting: http://rdp.cme.msu.edu/download/rdpclassifiertraindata/data.tgz
      [get] To: /home/ycl6/tools/matam/RDPTools/classifier/build/classes/data.tgz

It seems data.tgz is unavailable, I even tried wget from the URL but failed. Is there an alternative location of the training data?

preferred installation method

I recall Pierre saying that Conda was the preferred method for getting MATAM, yet the README appears to recommend doing source compilation.

Also these "sudo apt-get" for a "quick installation" aren't user friendly, not so many users are root on their machine.

Docker classification issue:

Dear authors,

Thank you for making this great software!

I'm having an issue running MATAM through docker, it works fine until the classification step, then outputs this error:

INFO - Write abundance informations to: /matam/matam_assembly/scaffolds.NR.min_500bp.fa.abd
INFO - === Taxonomic assignment ===
No valid binary found for classifier

Seems like it's not finding the rdp binaries?

Any help would be greatly appreciated.

Best regards,
Raphael

Add a step to compare taxonomic assignments from multiple datasets

2 output files:

a tabulated file with columns: sample, sequence_id, taxo, count, normalized_abundance
a contingency table with the normalized abundance for each taxonomy by sample

Integrate new C++ implementation of ComponentSearch

If ComponentSearch is implemented in C++ (and it should), need to integrate that new version into MATAM.

The components assembly module has to be "assembler agnostic"

The components_assembly module has many references to SGA assembler.
This references have to be removed.

Moreover, add an option in matam_assembly to choose the assembler.

Change output file names so that the final assembly file be more visible

Do no limit the debian version for docker image

When the issue #46 will be solved, remove the modification from the commit 34ca105

Bug in RDP parsing leads to false taxa in Krona

An example from a Shakya 2013 run with MATAM v1.2.0b

RDP_parsing_bug.zip

Look into SGA scaffolding step if paired-end reads are used

Right now we do not use the scaffolding step since we treat all reads as single reads.
Using paired-end reads could allow us to use the scaffolding step from SGA (not implemented yet in sga_assembly.py) and improve MATAM contigs length.
However, this mean that SGA would generate scaffolds with Ns and we have to evaluate how to handle those.