leaemiliepradier / plasforest Goto Github PK

View Code? Open in Web Editor NEW

15.0 1.0 6.0 91.8 MB

A random forest classifier to identify contigs of plasmid origin in contig and scaffold genomes

License: GNU General Public License v3.0

Python 88.45% Shell 11.55%

plasmid random-forest-classifier genome-analysis homology-search pipeline

plasforest's People

Contributors

Stargazers

Watchers

Forkers

wangdi2014 samratencode tanaes leapradier liupfskygre vclanclos

plasforest's Issues

hi， it seems like you updat something, another error like below

PlasForest: a homology-based random forest classifier for plasmid identification.
(C) Lea Pradier, Tazzio Tissot, Anna-Sophie Fiston-Lavier, Stephanie Bedhomme. 2020.
Traceback (most recent call last):
File "PlasForest.py", line 311, in
main(sys.argv[1:])
File "PlasForest.py", line 147, in main
finalfile = plasforest_predict(features, showFeatures, besthits, verbose, attributed_IDs, attributed_identities, nthreads)
File "PlasForest.py", line 289, in plasforest_predict
plasforest.n_jobs = int(nthreads)
NameError: name 'plasforest' is not defined

install and database download conda

Hi!
Could you please provide guidance to install in conda environment, please? I have tried but, after pip install all python dependencies and conda install -c bioconda blast, could not download the database of plasmid sequences. Thanks.

error with database_downloader.sh

hi !

when i "bash database_downloader.sh "

"All sequences were downloaded correctly. Good!
Program finished without error."

but also a line:
"database_downloader.sh: line 41: 24585 Segmentation fault makeblastdb -in plasmid_refseq.fasta -dbtype nucl -parse_seqids"

is that a error?

database

hi!

do you know what the issue is?

Traceback (most recent call last):
File "train_plasforest.py", line 174, in
main(sys.argv[1:])
File "train_plasforest.py", line 103, in main
blast_launcher(inputfile, blast_table, verbose, nthreads, database)
File "train_plasforest.py", line 126, in blast_launcher
stdout, stderr = blastn_cline()
File "/root/.local/lib/python3.6/site-packages/Bio/Application/init.py", line 574, in call
raise ApplicationError(return_code, str(self), stdout_str, stderr_str)
Bio.Application.ApplicationError: Non-zero return code 2 from 'blastn -out /root/PlasForest/test.fasta_blast.out -outfmt 6 -query /root/PlasForest/test.fasta -db plasmid_refseq.fasta -evalue 0.001 -num_threads 1', message 'BLAST Database error: No alias or index file found for nucleotide database [plasmid_refseq.fasta] in search path [/root/PlasForest::]'

do you know what is the problem when downloading the database

/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/Parser.py:903: UserWarning: Failed to save epost.dtd at /usr/local/home/hsv709/.config/biopython/Bio/Entrez/DTDs/epost.dtd
warnings.warn("Failed to save %s at %s" % (filename, path))
Traceback (most recent call last):
File "/mibi/users/Wanli/test_plasplinev1.4.1/Plaspline/db/db/plasforest/check_and_download_database.py", line 95, in
download_missing(list_missing, email)
File "/mibi/users/Wanli/test_plasplinev1.4.1/Plaspline/db/db/plasforest/check_and_download_database.py", line 77, in download_missing
result = Entrez.read(request)
File "/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/init.py", line 508, in read
record = handler.read(handle)
File "/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 304, in read
self.parser.ParseFile(handle)
File "/home/conda/feedstock_root/build_artifacts/python-split_1653669926144/work/Modules/pyexpat.c", line 459, in EndElement
File "/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 666, in endErrorElementHandler
raise RuntimeError(value)
RuntimeError: Some IDs have invalid value and were omitted. Maximum ID value 18446744073709551615

run test plasforest error

Dear,
i have installed plasforest in a cluster and i wanted to test installation but i get an error.
i don't know how to fix it.
can you please help me to resolve the issue?
please find attached the file error
slurm-52074819.out.zip
cordially
Azim

Test fails syntax error

Running test_plasforest.sh errored out:

$ ./test_plasforest.sh 
Starting to test your PlasForest install...
Checking if all the files are here.... OK
We now run PlasForest on the test dataset...  File "PlasForest.py", line 6
SyntaxError: Non-ASCII character '\xc3' in file PlasForest.py on line 6, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
We now run PlasForest on the test dataset... ERROR

I'm running python 3.7.2, I get this same error whether I keep the PlasForest.py shebang as-is or change to /usr/bin/env python3

Cannot import name 'ChainMap"

Hi. I'm having a problem. When I try to run my command (python3 PlasForest.py -i /home/pedro/Guaymas_C/Analyses/GuaymasC.fasta) or the test_plasforest.sh script, I receive the same error:
I'm using Python version 3.6.0

Traceback (most recent call last):
  File "PlasForest.py", line 28, in <module>
    import pandas as pd
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/__init__.py", line 121, in <module>
    from pandas.core.computation.api import eval
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/core/computation/api.py", line 3, in <module>
    from pandas.core.computation.eval import eval
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/core/computation/eval.py", line 12, in <module>
    from pandas.core.computation.engines import _engines
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/core/computation/engines.py", line 9, in <module>
    from pandas.core.computation.ops import _mathops, _reductions
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/core/computation/ops.py", line 19, in <module>
    from pandas.core.computation.scope import _DEFAULT_GLOBALS
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/core/computation/scope.py", line 17, in <module>
    from pandas.compat.chainmap import DeepChainMap
  File "/home/pedro/miniconda3/envs/PlastForest_env/lib/python3.6/site-packages/pandas/compat/chainmap.py", line 1, in <module>
    from typing import ChainMap, MutableMapping, TypeVar, cast
ImportError: cannot import name 'ChainMap'

AttributeError: module 'numpy' has no attribute 'float'.

Hi,
Thank you for the awsome package. I would like to try it on my bacterial metagenomic sequences. However, when i try it with the test data I get the following error:

conda activate plasforest-1.4
bash test_plasforest.sh

bash test_plasforest.sh
Starting to test your PlasForest install...
Checking if all the files are here.... OK
We now run PlasForest on the test dataset...Traceback (most recent call last):
  File "PlasForest.py", line 27, in <module>
    from sklearn.ensemble import RandomForestClassifier
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/ensemble/__init__.py", line 7, in <module>
    from ._forest import RandomForestClassifier
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 56, in <module>
    from ..tree import (DecisionTreeClassifier, DecisionTreeRegressor,
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/tree/__init__.py", line 6, in <module>
    from ._classes import BaseDecisionTree
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 40, in <module>
    from ._criterion import Criterion
  File "sklearn/tree/_splitter.pxd", line 34, in init sklearn.tree._criterion
  File "sklearn/tree/_tree.pxd", line 37, in init sklearn.tree._splitter
  File "sklearn/neighbors/_quad_tree.pxd", line 55, in init sklearn.tree._tree
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/neighbors/__init__.py", line 17, in <module>
    from ._nca import NeighborhoodComponentsAnalysis
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/neighbors/_nca.py", line 22, in <module>
    from ..decomposition import PCA
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/decomposition/__init__.py", line 17, in <module>
    from .dict_learning import dict_learning
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/decomposition/dict_learning.py", line 4, in <module>
    from . import _dict_learning
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/decomposition/_dict_learning.py", line 21, in <module>
    from ..linear_model import Lasso, orthogonal_mp_gram, LassoLars, Lars
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/linear_model/__init__.py", line 12, in <module>
    from ._least_angle import (Lars, LassoLars, lars_path, lars_path_gram, LarsCV,
  File "/home/user/.local/lib/python3.8/site-packages/sklearn/linear_model/_least_angle.py", line 30, in <module>
    method='lar', copy_X=True, eps=np.finfo(np.float).eps,
  File "/home/user/.local/lib/python3.8/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
We now run PlasForest on the test dataset... ERROR(plasforest-1.4)[user@user1 PlasForest-1.4]$

it may be due to compatibility issue. scikit-learn=0.22.2 requires NumPy (>= 1.11.0)

blastn

hi!

do you know what the issue is?

Traceback (most recent call last):
File "/home/projects/ku_00041/apps/wanli/F_pipeline/db/plasforest/PlasForest.py", line 236, in
main(sys.argv[1:])
File "/home/projects/ku_00041/apps/wanli/F_pipeline/db/plasforest/PlasForest.py", line 114, in main
blast_launcher(tmp_fasta, blast_table, verbose, nthreads)
File "/home/projects/ku_00041/apps/wanli/F_pipeline/db/plasforest/PlasForest.py", line 161, in blast_launcher
stdout, stderr = blastn_cline()
File "/home/projects/ku_00041/apps/wanli/F_pipeline/conda_envs/ceb528a9/lib/python3.8/site-packages/Bio/Application/init.py", line 569, in call
raise ApplicationError(return_code, str(self), stdout_str, stderr_str)

Bio.Application.ApplicationError: Non-zero return code -11 from 'blastn -out assmebly_res/SRR2145291_contigs_1kb.fasta_blast.out -outfmt 6 -query assmebly_res/SRR2145291_contigs_1kb.fasta_tmp.fasta -db plasmid_refseq.fasta -evalue 0.001 -num_threads 30'

Infinite loop in database_downloader.sh

Hello,

I have executed database_downloader.sh, but after it downloads all the records it does not finish, but starts the process over again. This is a sample of what is shown in the Linux terminal:

Downloading record 34401 to 34600 of 34701
Downloading record 32201 to 32400 of 34701
Downloading record 33201 to 33400 of 34701
Downloading record 34601 to 34701 of 34701
Downloading record 33401 to 33600 of 34701
Checking for sequences that did not download... Please wait.
Downloading accession 1 to 34701 of 34701
WARNING: Master record found and removed: NZ_CBTO000000000.1.
All sequences were downloaded correctly. Good!
Program finished without error.
Downloading record 4801 to 5000 of 34701
Downloading record 7201 to 7400 of 34701
Downloading record 1201 to 1400 of 34701
Downloading record 3601 to 3800 of 34701

The download stops only if it is manually interrupted and the running test_plasforest.sh returns the following error:
ERROR: You must first download the plasmid database by using database_downloader.sh(plasforest)

Do you know what could be causing this error and how to solve it?

Thanks!

Version of the python dependencies

Hello,

I am trying to install the packages in python that are required, but I am getting an error with sci-kit, sometimes is due to incompatibilities with the other packages that are installed without specifying the versions.

Could you please provide the exact versions of the python packages that you have installed?

Thank you.

Best,
Susana

release candidate

could you please specify a release? this is important for version control

thank you

cannot find the path to the .sav file

Hi, I have run the test at the end of the installation and everything seemed to work. Indeed, I am able to call the program, however an error occurs as well:
PlasForest: a homology-based random forest classifier for plasmid identification.
(C) Lea Pradier, Tazzio Tissot, Anna-Sophie Fiston-Lavier, Stephanie Bedhomme. 2020.
Error: cannot find the path to the .sav file
Any idea what is going on or how it can be fixed?
Thanks

command line option for plasforest.sav and plasmid_refseq.fasta

Could we have a command line option to set the location of

plasmid_refseq.fasta
and
plasforest.sav

rather than have it hardcoded in a certain location?

FileNotFoundError: [Errno 2] No such file or directory: 'plasforest.sav'

Hello!
When I try this :
for file in *.fna; do python ../../../softwares/PlasForest/PlasForest.py -b -i $file -o ${file%%.fna}.csv --threads 8; done
I get this for each fasta file:

Traceback (most recent call last):
File "../../../softwares/PlasForest/PlasForest.py", line 45, in
plasforest = pickle.load(open("plasforest.sav","rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'plasforest.sav'

**the test turned out well.

Help me please.
thanks in advance
Benjamin Leyton