Giter Club home page Giter Club logo

matam's People

Contributors

kbseah avatar loic-couderc avatar ppericard avatar tazend avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

matam's Issues

Docker is not building anymore

The error seems to come from the compilation of the ovgraphbuild module:

/matam/ovgraphbuild/lib/seqan/include/seqan/system/file_sync.h:337:25: error: call of overloaded 'empty(std::__cxx11::string&)' is ambiguous
         if (empty(tmpDir))
                         ^
[...]
CMakeFiles/ovgraphbuild.dir/build.make:62: recipe for target 'CMakeFiles/ovgraphbuild.dir/src/alignmentsComparison.cpp.o' failed
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/ovgraphbuild.dir/all' failed
Makefile:83: recipe for target 'all' failed

�[91m2017-10-06 14:41:25,076 - WARNING - A problem might have happened while compiling ovgraphbuild. Check log above

Add a first manual

Lets add a pdf manual to MATAM. The first iteration could be filled with the infos already in the README.
Ideally the pdf should be build automatically after each push, or release.

No valid binary found for sga

MATAM spill out these messages complaining not finding the sga binary.

INFO - Graph compaction & Components identification terminated in 63.3342 seconds wall time
INFO - Compressed graph: 472 components

INFO - === LCA labelling ===
INFO - LCA labelling terminated in 7.3417 seconds wall time

INFO - === Contigs assembly ===
INFO - Save components to fastq files
INFO - Assemble components
No valid binary found for sga
No valid binary found for sga
No valid binary found for sga
No valid binary found for sga
No valid binary found for sga
...

I export $MATAM_HOME/sga/src/bin to $PATH but that doesn't help.

Ovgraphbuild compilation issue with gcc-6

With the issue #44 we noticed that ovgraphbuild do not compile with gcc-6.
This issue seems to come from the way the flags are set in CMakeLists.txt.
More precisely:

set(ENABLE_CXXFLAGS_TO_CHECK 
    -std=gnu++1z 
    -std=c++1z
    -std=gnu++14 
    -std=c++14
    -std=gnu++1y 
-std=c++1y)

Update this file to allow the compilation with gcc-6.

Improve conda recipe

For now (see #5) , the process of building a conda package uses the default build.py script. It will be preferable to write a specific build script:
Instead of using submodules, use conda requirements.

To enable this, the code have to be slightly modified to search binaries from matam dir first then from the path.

Running index_default_ssu_rrna_db.py before building should raise an error

When running index_default_ssu_rrna_db.py before building MATAM, SortMeRNA indexdb_rna executable is not found but indexing script terminates with no error...

$ ./index_default_ssu_rrna_db.py

2017-01-30 11:26:01,604 - INFO - -- Get compressed archive --
2017-01-30 11:26:01,604 - DEBUG - PWD: /home/pericard/matam
2017-01-30 11:26:01,604 - DEBUG - CMD: mkdir /home/pericard/matam/db; wget http://bioinfo.lifl.fr/matam/SILVA_128_SSURef_NR95.tar.bz2 -O /home/pericard/matam/db/SILVA_128_SSURef_NR95.tar.bz2
mkdir: impossible de créer le répertoire «/home/pericard/matam/db»: Le fichier existe
--2017-01-30 11:26:01-- http://bioinfo.lifl.fr/matam/SILVA_128_SSURef_NR95.tar.bz2
Résolution de bioinfo.lifl.fr (bioinfo.lifl.fr)… 193.48.186.71
Connexion à bioinfo.lifl.fr (bioinfo.lifl.fr)|193.48.186.71|:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 140567671 (134M) [application/x-bzip2]
Enregistre : «/home/pericard/matam/db/SILVA_128_SSURef_NR95.tar.bz2»

100%[====================================================================================================================================================================================================>] 140 567 671 10,6MB/s ds 12s

2017-01-30 11:26:14 (10,8 MB/s) - «/home/pericard/matam/db/SILVA_128_SSURef_NR95.tar.bz2» enregistré [140567671/140567671]

2017-01-30 11:26:14,032 - INFO - -- Extracting default ref db --
2017-01-30 11:26:14,032 - DEBUG - PWD: /home/pericard/matam/db
2017-01-30 11:26:14,032 - DEBUG - CMD: tar jxvf SILVA_128_SSURef_NR95.tar.bz2
SILVA_128_SSURef_NR95.clustered.fasta
SILVA_128_SSURef_NR95.complete.fasta
SILVA_128_SSURef_NR95.complete.taxo.tab

2017-01-30 11:26:58,580 - INFO - -- Indexing default ref db --
2017-01-30 11:26:58,580 - DEBUG - PWD: /home/pericard/matam/db
2017-01-30 11:26:58,580 - DEBUG - CMD: /home/pericard/matam/scripts/index_ref_db.py -v -i /home/pericard/matam/db/SILVA_128_SSURef_NR95 --max_memory 10000
INFO - Indexing complete ref db
/bin/sh: 1: /home/pericard/matam/sortmerna/indexdb_rna: not found

INFO - Indexing clustered ref db
/bin/sh: 1: /home/pericard/matam/sortmerna/indexdb_rna: not found

2017-01-30 11:26:58,617 - INFO - -- Completed default SSU rRNA DB indexing --
2017-01-30 11:26:58,617 - DEBUG - Indexing completed in 57.01 seconds
2017-01-30 11:26:58,617 - INFO - Indexing went well. Default SSU rRNA DB and its indexes can be found in: /home/pericard/matam/db/SILVA_128_SSURef_NR95*

Ovgraphbuild major overhaul: Improve implementation speed ++

Change how reads pairs are sorted through by reading SAM file reference by reference and using a positions window gliding on each reference.

One major problem: How to store read pairs on which we already know they overlap so we dont deal with them again later ?

Parallelize treatment: at the reference level, but also at the reads pair level.

Better handling paired-end reads

Right now, all reads are treated as single reads.
Some steps could benefit from using paired-end information (contigs assembly with SGA, MATAM scaffolding, ...).
So we could still accept the input file the same way, but be more precise about its formatting so we could retrieve paired-end reads and use them later on.

Improve output files visibility

MATAM output files should be more easily visible. We could work in a sub-directory and then link the final files in the MATAM output directory.
As in:

matam_assembly/

  • final_assembly.fa --> wkdir/final_assembly.fa
  • ...
  • wkdir/
    . final_assembly.fa
    . scaffolds.NR.min_500bp.fa
    . ...

Improve matam error management

Today, we just know if an error happened during a MATAM run, and at which step.
Moreover, a MATAM run can keep on going even if there was a major problem in a previous step.

So predictable errors should be enumerated and dealt with, at least by printing them. And MATAM should not be able to start a step if the previous one had a major problem

Rethink the README

From time to time, the README became bigger and bigger and the amount of information displayed can be overwhelming or unclear (#27).

We have to re-organized the README

Duplicate the STDOUT and STDERR outputs in a log file

We should have a log file to store everything that currently output to STDOUT and STDERR.

Currently, if people dont redirect their STDOUT and STDERR in a file, they can lose important informations about the run.

The implementation should be fexible enough so that it is easy to chose whever we want to output to the terminal or to the log file, or both.

Use GitLab as our main repository manager ?

Pros:

  • GitLab is fully open-source
  • the repository can be hosted on the main GitLab server or on a private server
  • exact same functionalities as GitHub + additional ones
  • better issues tracking
  • better continuous integration management
  • our current GitHub repository is fully transferable to GitLab (code, issues, ...)

Cons:

  • GitLab is currently less popular than GitHub, which means less visibility. But this is rapidly changing.

https://usersnap.com/blog/gitlab-github/
https://www.upwork.com/hiring/development/gitlab-vs-github-how-are-they-different/

Code deduplication

The read_fasta_file_handle function is duplicated in many locations:

scripts/compute_assembly_stats.py
scripts/compute_lca_from_tab.py
scripts/compute_pairwise_distance_matrix.py
scripts/compute_ref_coverage_histogram.py
scripts/exonerate_to_sam.py
scripts/extract_taxo_from_fasta.py
scripts/fasta_clean_name.py
scripts/fasta_get_lengths.py
scripts/fasta_length_filter.py
scripts/fasta_name_filter.py
scripts/filter_sam_by_coverage.py
scripts/filter_sam_by_pid.py
scripts/get_HMP_OTU_psn.py
scripts/matam_assembly.py
scripts/remove_redundant_sequences.py
scripts/replace_Ns_by_As.py
scripts/replace_Ns_by_rand_nu.py
scripts/sort_fasta_by_length.py

The issue #8 introduce a new module (scripts/fasta_utils.py).
Use this to deduplicate the code.

The scripts/compute_abundance.py and scripts/krona.py use the one from fasta_clean_name. Replace the import.

Some other function may be to replace as well ( read_fastq_file_handle, format_seq...).

Better memory management. Issue to be redistributed between more specific issues

Find a way to improve memory management at nearly every step:

SortMeRNA runs are not memory capped, only the ref db index memory is capped, not the alignments memory

Ovgraphbuild could perform a treatment by bins instead of reading all the SAM file in RAM.

Unix Sort is sometimes used twice in a pipe command line so there is a risk of using twice the RAM that was given.

BIOM format

Hi,

Any chance of producing the results, i.e. abundance and taxonomy assignment, in BIOM format?

Parallelize components assembly

This should be easy. Right now, components are assembled one at a time, but since they are independant, this should be parallelized in the Python code of matam_assembly.py

LCA labelling error with custom database

When running matam_v1.3.0 with a depleted version of SILVA database, an error occured:

2017-10-16 13:26:24,198 - DEBUG - CMD: cat /workdir/lcouderc/data_matam/16S_rRNA/simulated_dataset/50x/matam/matam_v1.3.0_S128_SSURef_NR95_depleted_cov_50/genomes_complets.art_HS25_pe_1
01bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.cov_filt_50.ovgb_i100_o50.cpts_N1_E1.read_metanode_component_taxo.tab | sort -T /workdir/lcouderc/data_matam/16S_r
RNA/simulated_dataset/50x/matam/matam_v1.3.0_S128_SSURef_NR95_depleted_cov_50 -S 10000M --parallel 1 -k3,3 -k1,1 | /home/lcouderc/matam/scripts/compute_lca_from_tab.py -t 4 -f 3 -g 1 -m
 0.51 -o /workdir/lcouderc/data_matam/16S_rRNA/simulated_dataset/50x/matam/matam_v1.3.0_S128_SSURef_NR95_depleted_cov_50/genomes_complets.art_HS25_pe_101bp_50x.sortmerna_vs_SILVA_128_SS
URef_NR95_b10_m10.scr_filt_geo_90pct.cov_filt_50.ovgb_i100_o50.cpts_N1_E1.component_lca51pct.tab
Traceback (most recent call last):
  File "/home/lcouderc/matam/scripts/compute_lca_from_tab.py", line 157, in <module>
    factor_id = factor_tab_list[0][0][args.factor]
IndexError: list index out of range
2017-10-16 13:26:24,285 - INFO - LCA labelling terminated in 10.4576 seconds wall time

SGA logging

  • In debug mode, there is too much verbosity on the contigs assembly step. Restrict the logs to the assembly of the first component and the last one
  • The script sga_assemble.py is logging to a file and this informations are hidden to the matam logging file. Show this info for the first and last component.

Replace sumaclust by vsearch

Vsearch is now the clustering tool that should be used instead of Sumaclust (not published, and very slow).
!!! Careful, such change will have a significant impact on the clustered database, so on MATAM results. Major tests will have to be performed before validation !!!

Improve scaffolding by using paired-end reads

Paired-end reads could be used to help us chose between compatible contigs that are mapping on the same reference, thus allowing us to discriminate between 2 close species mapping on the same reference sequence, and decreasing the risk of creating chimeras.

Has to be done after #31

Allow dynamic reduction of the coverage for highly covered references

At the moment, for huge datasets, some species can be highly covered (> 10000x).
Thus, computing the overlap graph is impossible due to the complexity of the algorithm used: O(n²), with n being the read coverage.

To be able to achieve the task of assembling such datasets, we can implement different kind of dynamic coverage reduction (simple to hardest):

  • randomly sample a highly covered reference until we reach a coverage threshold. Potential lose of specific regions due to there low coverage in comparison to the conserved regions
  • randomly sample highly covered regions. Potential lose of connexions between conserved regions ( = can induce some chimeric sequences)
  • pseudo-randomly sample highly covered regions to keep links between conserved regions (keep reads shared by multiples ref)

Standardize RDP taxa

Some RDP assignments can have additional levels like subclass or suborder. This badly translates into the Krona file where taxa level do not have the same depth. Ex genus will be at level 6 for most assignment, but can be at level 8 for those particular assignments, while the suborder will be at level 6.

A simple way to correct the problem would be to generate a krona file only with a fixed number of level, which means extracting from the RDP assignments only the 6 standard levels.

Add automatic error correction to components assembly

Run a first time the assembly of one component, and then estimate a-posteriori the fold coverage: ~reads_nt/contigs_nt. If it is superior to 20-50, then re-run the assembly with error correction activated.

Probably need to wait until #8 is implemented, so assembly doesnt take forever.

Problem completing installation (download-traindata)

Installation halted after

download-traindata:
      [get] Getting: http://rdp.cme.msu.edu/download/rdpclassifiertraindata/data.tgz
      [get] To: /home/ycl6/tools/matam/RDPTools/classifier/build/classes/data.tgz

It seems data.tgz is unavailable, I even tried wget from the URL but failed. Is there an alternative location of the training data?

preferred installation method

I recall Pierre saying that Conda was the preferred method for getting MATAM, yet the README appears to recommend doing source compilation.

Also these "sudo apt-get" for a "quick installation" aren't user friendly, not so many users are root on their machine.

Docker classification issue:

Dear authors,

Thank you for making this great software!

I'm having an issue running MATAM through docker, it works fine until the classification step, then outputs this error:

INFO - Write abundance informations to: /matam/matam_assembly/scaffolds.NR.min_500bp.fa.abd
INFO - === Taxonomic assignment ===
No valid binary found for classifier

Seems like it's not finding the rdp binaries?

Any help would be greatly appreciated.

Best regards,
Raphael

Look into SGA scaffolding step if paired-end reads are used

Right now we do not use the scaffolding step since we treat all reads as single reads.
Using paired-end reads could allow us to use the scaffolding step from SGA (not implemented yet in sga_assembly.py) and improve MATAM contigs length.
However, this mean that SGA would generate scaffolds with Ns and we have to evaluate how to handle those.

Allow to resume from taxonomic_assignment step

When the --perform_taxonomic_assignment flag is omitted, we have to re-run matam from abundance calculation step. It's an unnecessary waste of time. Allow to restart directly from the taxonomic_assignment step.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.