mrolm / drep Goto Github PK

View Code? Open in Web Editor NEW

243.0 8.0 36.0 16.89 MB

Rapid comparison and dereplication of genomes

Python 97.44% MAXScript 0.49% Batchfile 2.07%

bioinformatics metagenomics microbiology assembly microbial-genomes

drep's People

Contributors

Stargazers

Watchers

drep's Issues

Pass options to checkM?

(Suggestion) I'm wondering if it would be possible to implement options for checkM as -reduced_tree during lineage_wf so it could run in machines with <40 Gb.
Thanks

Pair wise mash clustering issue

Dear Matt,

Me again :)

I was trying to see if dRep was behaving weirdly due to the datasets/environment. Therefore, I copied the genomes provided within dRep for testing purposes into my input folder and proceeded to run the following command:

$ dRep dereplicate_wf -p 4 --completeness 0.6 --strain_htr 101 --P_ani 0.6 --S_ani 0.965 --overwrite --run_tax /scratch/users/snarayanasamy/LAO_TS_IMP-v1.3/TemporalBinning/RepresentativeBins -g /scratch/users/snarayanasamy/LAO_TS_IMP-v1.3/TemporalBinning/AllBins/*.fa

It fails on the clustering step for some reason:

Step 1. Parse Arguments
Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
[====================] 100.00%
2b. Cluster pair-wise MASH clustering
Traceback (most recent call last):
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/env3.4.3/bin/dRep", line 26, in <module>
    controller.parseArguments(args)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/controller.py", line 144, in parseArguments
    self.dereplicate_wf_operation(**vars(args))
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/controller.py", line 86, in dereplicate_wf_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
    drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 287, in d_cluster_wrapper
    Cdb, Mdb, Ndb = cluster_genomes(Bdb, data_folder, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 106, in cluster_genomes
    Cdb = cluster_mash_database(Mdb, data_folder= data_folder, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 662, in cluster_mash_database
    linkage_cutoff= P_Lcutoff)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 355, in cluster_hierarchical
    linkage = scipy.cluster.hierarchy.linkage(arr, method= linkage_method)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/scipy/cluster/hierarchy.py", line 660, in linkage
    n = int(distance.num_obs_y(y))
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/scipy/spatial/distance.py", line 1718, in num_obs_y
    raise ValueError("The number of observations cannot be determined on "
ValueError: The number of observations cannot be determined on an empty distance matrix.

Any idea if this is an issue with my environment.

Please know that dRep is working very well for our data sets. It is just that we are finding some inconsistencies from time to time and would like to report them to be fixed. We are looking to keep using dRep on more of our data. Looking forward to hear from you.

Cheers,
Shaman

ANImf - how is filtering done?

Hi, I can't find any details on the filtering done with the ANImf algorithm on your read the docs page. What do you do? This also might be good to add to the documentation.

Thanks!

Error bonus step

Hi,

I have run dereplication. Genomes have been properly dereplicated but it has been an error during the bonus step:

This is the command I have run:

dRep dereplicate -p 28 output_dir -g genomes/*.fasta --genomeInfo genomeinfo.csv --run_tax --cent_index /home/name/scripts/centrifuge_database/

In /home/name/scripts/centrifuge_database/ it can be found these files and folders:
abv.1.cf abv.2.cf abv.3.cf input-sequences.fna library seqid2taxid.map taxonomy

is that the correct path to include in --cent_index?

..:: dRep dereplicate Step 1. Filter ::..

Will filter the genome list
Calculating genome info of genomes
99.08% of genomes passed length filtering
42.52% of genomes passed checkM filtering

..:: dRep dereplicate Step 2. Cluster ::..

Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
100 primary clusters made
Step 3. Perform secondary clustering
Running 11390 ANImf comparisons- should take ~ 203.4 min
Step 4. Return output

..:: dRep dereplicate Step 3. Choose ::..

Loading work directory

..:: dRep dereplicate Step 4. Bonus ::..

Loading work directory
Running tax
Running Centrifuge
Traceback (most recent call last):
File "/services/tools/anaconda3/4.0.0/bin/dRep", line 26, in
controller.parseArguments(args)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/controller.py", line 144, in parseArguments
self.dereplicate_operation(**vars(args))
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/controller.py", line 86, in dereplicate_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/d_workflows.py", line 52, in dereplicate_wrapper
drep.d_bonus.d_bonus_wrapper(wd, **kwargs)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/d_bonus.py", line 26, in d_bonus_wrapper
run_taxonomy(wd,**kwargs)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/d_bonus.py", line 45, in run_taxonomy
Tdb, Bdb = parse_taxonomy(Bdb, cent_dir, **kwargs)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/d_bonus.py", line 62, in parse_taxonomy
Tdb = parse_centrifuge_percent(Bdb, cent_dir, **kwargs)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/d_bonus.py", line 200, in parse_centrifuge_percent
"{0}{1}_report.tsv".format(cent_dir,genome))
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/drep/d_bonus.py", line 340, in parse_raw_centrifuge
tax = pd.read_table(report)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/pandas/io/parsers.py", line 709, in parser_f
return _read(filepath_or_buffer, kwds)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/pandas/io/parsers.py", line 449, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/pandas/io/parsers.py", line 818, in init
self._make_engine(self.engine)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/pandas/io/parsers.py", line 1049, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/services/tools/anaconda3/4.0.0/lib/python3.5/site-packages/pandas/io/parsers.py", line 1695, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 402, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 718, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'/home/output_dir/data/centrifuge/102_Con1500.fasta_report.tsv' does not exist

In the data folder both prodigal and centrifuge folders are empty:

83 Jan 12 14:44 MASH_files
53130 Jan 12 15:43 ANImf_files
5014 Jan 12 15:43 Clustering_files
0 Jan 12 15:45 prodigal
0 Jan 12 15:45 centrifuge

And I have also seen this error:

Warning: Could not open read file "/home/output_dir/data/centrifuge/103_Con1500.fasta.fna" for reading; skipping...
Error: No input read files were valid
(ERR): centrifuge-class exited with value 1

Thank you very much in advance.

Error when plotting

Hi,

First of all, great job on the development of dRep. It has been an extremely useful tool for my analyses and has saved me a lot of time.

However, whenever I run the dereplicate workflow, once it reaches the analyze stage it stops at the MDS plot with:

***************************************************
    ..:: dRep dereplicate Step 6. Analyze ::..
***************************************************
    
making plots 1, 2, 3, 4, 5, 6
Plotting primary dendrogram
Plotting secondary dendrograms
Plotting MDS plot
Traceback (most recent call last):
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/bin/dRep", line 26, in <module>
    controller.parseArguments(args)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/drep/contr
oller.py", line 144, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/drep/contr
oller.py", line 86, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/drep/d_wor
kflows.py", line 68, in dereplicate_wrapper
    drep.d_analyze.d_analyze_wrapper(wd, plots = 'a', **kwargs)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/drep/d_ana
lyze.py", line 66, in d_analyze_wrapper
    plot_secondary_mds_from_wd(wd, plot_dir, **kwargs)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/drep/d_ana
lyze.py", line 252, in plot_secondary_mds_from_wd
    cluster2color = cluster2color)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/drep/d_ana
lyze.py", line 1401, in _make_mds_plot
    bbox_to_anchor=(1, 0.5), prop = {'size': 10, 'style': 'italic'})
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/matplotlib
/pyplot.py", line 3744, in legend
    ret = gca().legend(*args, **kwargs)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/matplotlib
/axes/_axes.py", line 500, in legend
    self.legend_ = mlegend.Legend(self, handles, labels, **kwargs)
  File "/hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/lib/python3.6/site-packages/matplotlib
/legend.py", line 583, in __init__
    for label, handle in zip(labels[:], handles[:]):
TypeError: 'dict_keys' object is not subscriptable

This happens consistently across different datasets. I am using dRep v2.0.0 and all the dependencies seem to be properly installed:

Loading work directory
Checking dependencies
mash.................................... all good        (location = /hps/nobackup/production/metagenomics/aalmeida/software/mash-Linux64-v2.0/mash)
nucmer.................................. all good        (location = /hps/nobackup/production/metagenomics/aalmeida/software/MUMmer3.23/nucmer)
checkm.................................. all good        (location = /hps/nobackup/production/metagenomics/aalmeida/software/miniconda3/bin/checkm)
ANIcalculator........................... all good        (location = /hps/nobackup/production/metagenomics/aalmeida/software/ANIcalculator_v1/ANIcalculator)
prodigal................................ all good        (location = /hps/nobackup/production/metagenomics/aalmeida/software/prodigal)
centrifuge.............................. all good        (location = /hps/nobackup/production/metagenomics/aalmeida/software/centrifuge-1.0.3-beta/centrifuge)

Any ideas on how I could solve this? Alternatively, any way to skip the analyze step when lauching the dereplicate workflow to avoid always getting this error at the end?

Thanks a lot in advance.

Alex

os.path.isfile(genome), "{0} is not a file".format(genome)

Hey there,
I am attempting to use viral contigs (from VirSorter) as "genomes", but I get an error
Traceback (most recent call last): File "/srv/sw/miniconda3/envs/dRep_2.2.4/bin/dRep", line 33, in <module> controller.parseArguments(args) File "/srv/sw/miniconda3/envs/dRep_2.2.4/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments self.dereplicate_operation(**vars(args)) File "/srv/sw/miniconda3/envs/dRep_2.2.4/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_operation drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs) File "/srv/sw/miniconda3/envs/dRep_2.2.4/lib/python3.6/site-packages/drep/d_workflows.py", line 28, in dereplicate_wrapper drep.d_filter.d_filter_wrapper(wd, genomes = genomes, Chdb = Chdb, **kwargs) File "/srv/sw/miniconda3/envs/dRep_2.2.4/lib/python3.6/site-packages/drep/d_filter.py", line 65, in d_filter_wrapper bdb = drep.d_cluster.load_genomes(kwargs['genomes']) File "/srv/sw/miniconda3/envs/dRep_2.2.4/lib/python3.6/site-packages/drep/d_cluster.py", line 1175, in load_genomes assert os.path.isfile(genome), "{0} is not a file".format(genome) AssertionError: /concatenatedASSVirs/04042019_concatenated_VirSorter_output_for_dRep_cats123/assembly*.fasta is not a file

When I attempt to run the following command:

$ dRep dereplicate workingdirectory/ -sa 0.97 -l 1000 -g /path/assembly*.fasta
How can I format my viral contigs so that they can be accepted as genome files? i.e. what are the requirements of a genome file?

dRep failes when no genome passes the filtering step

Hi, as recommended in #25 I ran dRep with this option but it seems this is not implemented correctly in the current version:

***************************************************
    ..:: dRep dereplicate Step 2. Cluster ::..
***************************************************
    
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2. Nevermind! Skipping Mash
0 primary clusters made
Step 3. Perform secondary clustering
Running 0 ANImf comparisons- should take ~ 0.0 min
Traceback (most recent call last):
  File "/home/johdro/.conda/envs/drep/bin/dRep", line 26, in <module>
    controller.parseArguments(args)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
    drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 75, in d_cluster_wrapper
    data_folder, wd=workDirectory, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 175, in cluster_genomes
    Cdb, c2ret = _cluster_Ndb(Ndb, comp_method=algorithm, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 207, in _cluster_Ndb
    for name, ndb in Ndb.groupby(id):
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/generic.py", line 5162, in groupby
    **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/groupby.py", line 1848, in groupby
    return klass(obj, by, **kwds)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/groupby.py", line 516, in __init__
    mutated=self.mutated)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/groupby.py", line 2934, in _get_grouper
    raise KeyError(gpr)
KeyError: 'primary_cluster'

Apparently, skipping MASH produces zero primary clusters and this leads to errors.

Command line used was

dRep dereplicate -pa 0.8 -sa 0.98 -comp 25 -con 25 -l 20000 -p 1 --SkipMash drep_out -g *.fna

Best,
Johannes

Installation issue

Dear authors,

I am attempting to install Drep on my cluster. However, I am facing some issues.

Here is what I did:

$ git clone https://github.com/MrOlm/drep.git

$ cd drep

$ pip install --user .

It looks like it goes fine:

Unpacking /mnt/gaiagpfs/users/homedirs/snarayanasamy/repositories/drep
  Running setup.py egg_info for package from file:///mnt/gaiagpfs/users/homedirs/snarayanasamy/repositories/drep
    
Installing collected packages: drep
  Running setup.py install for drep
    changing mode of build/scripts-3.3/dRep from 644 to 755
    
    changing mode of /home/users/snarayanasamy/.local/bin/dRep to 755
Successfully installed drep
Cleaning up...

I have python3.3.2 loaded onto my environment. Then I attempt the test:
$ python3 test_suite.py

To get:

Traceback (most recent call last):
  File "tests/test_suite.py", line 14, in <module>
    from drep import argumentParser
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/argumentParser.py", line 17, in <module>
    from drep.controller import Controller
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/controller.py", line 25, in <module>
    from drep.WorkDirectory import WorkDirectory
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/WorkDirectory.py", line 5, in <module>
    import pandas as pd
ImportError: No module named 'pandas'

I also try to call the script directly:
$ bin/dRep

And get this:

Traceback (most recent call last):
  File "bin/dRep", line 19, in <module>
    import drep.argumentParser
  File "/home/users/snarayanasamy/.local/lib/python3.3/site-packages/drep/__init__.py", line 5, in <module>
    from Bio import SeqIO
ImportError: No module named 'Bio'

Is there something I am missing here?

I look forward to your reply.

Cheers,
Shaman

Update

I proceeded to install all the dependencies via pip. These include:
pandas, matplotlib, seaborn.

I then arrive to this error when trying to run the test:

Traceback (most recent call last):
  File "tests/test_suite.py", line 215, in <module>
    test_cluster()
  File "tests/test_suite.py", line 207, in test_cluster
    verifyCluster.run()
  File "tests/test_suite.py", line 147, in run
    self.functional_test_1()
  File "tests/test_suite.py", line 163, in functional_test_1
    controller.parseArguments(args)
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/controller.py", line 143, in parseArguments
    self.cluster_operation(**vars(args))
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/controller.py", line 50, in cluster_operation
    drep.d_cluster.d_cluster_wrapper(kwargs['work_directory'],**kwargs)
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/d_cluster.py", line 284, in d_cluster_wrapper
    Bdb, data_folder, kwargs = parse_arguments(workDirectory, **kwargs)
  File "/home/users/snarayanasamy/.local/lib/python3.4/site-packages/drep/d_cluster.py", line 316, in parse_arguments
    .format(prog))
NameError: name 'prog' is not defined

"Argument list too long" when running w/ 45k bins

Hi @MrOlm ,

I'm trying to run dRep on a very large set of bins and reference genomes, totalling ~45k. No matter how I attempt to set up the script, I get the following error:

run.dRep.sh: line 1: [...]/bin/dRep: Argument list too long

I'm running dRep v2.2.3, in a conda environment.

Is this an OS issue, or do you somehow have a hard cutoff on how many genomes/bins will be processed?

Edit. Re the command line I'm using, nothing fancy there. I have pre-filtered the genome list based on my own checkM run, so here's what I do:

dRep dereplicate [path_to_wd] -p 24 --noQualityFiltering --genomes [genome_1.fa genome_2.fa ...]

something about MASH

Hi:
I'm a green hand of bioinformatics.Now I have two eucaryotic organisms' genomes.Then I want to ask : Can I use the "dRep compare"module to get the ANI of two organisms?Or the "ANI" is meaningful for eucaryotic organisms?

Look forward to your reply!

Yangfan
2019/3/29

pandas.io.common.EmptyDataError: No columns to parse from file

Hallo,
I can run the first Step of dRep with CheckM now,but when I go to the second module "Cluster",

..:: dRep Step 2. Cluster ::..

Step 1. Parse Arguments
Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
[====================] 100.00%
Traceback (most recent call last):
File "/home/zjs/tools/drep/bin/dRep", line 26, in
controller.parseArguments(args)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/controller.py", line 144, in parseArguments
self.dereplicate_wf_operation(**vars(args))
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/controller.py", line 86, in dereplicate_wf_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_cluster.py", line 288, in d_cluster_wrapper
Cdb, Mdb, Ndb = cluster_genomes(Bdb, data_folder, **kwargs)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_cluster.py", line 104, in cluster_genomes
Mdb = all_vs_all_MASH(Bdb, data_folder, **kwargs)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_cluster.py", line 632, in all_vs_all_MASH
table = pd.read_csv(file,sep='\t',header = None)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 388, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 729, in init
self._make_engine(self.engine)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 922, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1389, in init
self._reader = _parser.TextReader(src, kwds)
File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.cinit (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file

and if I use the parameter of "Skipmash" I can pass the step,but when I face the final Step ,it happened again:

[zjs@www drep]$ /home/zjs/tools/drep/bin/dRep evaluate ./drep_out/ -e all
will compare winners
[====================] 100.00%
Traceback (most recent call last):
File "/home/zjs/tools/drep/bin/dRep", line 26, in
controller.parseArguments(args)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/controller.py", line 161, in parseArguments
self.evaluate_operation(**vars(args))
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/controller.py", line 81, in evaluate_operation
drep.d_evaluate.d_evaluate_wrapper(kwargs['work_directory'],**kwargs)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_evaluate.py", line 29, in d_evaluate_wrapper
Wmdb, Wndb = compare_winners(wd,**kwargs)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_evaluate.py", line 67, in compare_winners
Wmdb = dClust.all_vs_all_MASH(Bdb,data_folder)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/drep/d_cluster.py", line 632, in all_vs_all_MASH
table = pd.read_csv(file,sep='\t',header = None)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 388, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 729, in init
self._make_engine(self.engine)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 922, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/zjs/.pyenv/versions/3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1389, in init
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.cinit (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file

Could you help me for this issue?

Test_backend folder missing for test_suite.py

I'm checking to see if dRep installed correctly, so I am running the test_suite.py script from the tests folder. I get an error:

FileNotFoundError: [Errno 2] No such file or directory: '/XXX/XXX/drep/tests/../tests/test_backend/ecoli_wd'

Indeed, this isn't in the tests folder. Is there a work around to get the test scripts to work?

Thanks.

too many bins?

I ran dRep cluster and I got the following message:

Step 1. Parse Arguments
Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
Traceback (most recent call last):
File "/usr/local/bin/dRep", line 26, in
controller.parseArguments(args)
File "/usr/local/lib/python3.6/site-packages/drep/controller.py", line 151, in parseArguments
self.cluster_operation(**vars(args))
File "/usr/local/lib/python3.6/site-packages/drep/controller.py", line 56, in cluster_operation
drep.d_cluster.d_cluster_wrapper(kwargs['work_directory'],**kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_cluster.py", line 59, in d_cluster_wrapper
Cdb, Mdb, Ndb = cluster_genomes(Bdb, data_folder, wd=workDirectory, **kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_cluster.py", line 138, in cluster_genomes
Mdb = all_vs_all_MASH(Bdb, data_folder, **kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_cluster.py", line 612, in all_vs_all_MASH
dm.run_cmd(cmd, dry, shell=False, logdir=logdir)
File "/usr/local/lib/python3.6/site-packages/drep/init.py", line 62, in run_cmd
call(cmd,stdout=sto, stderr=ste)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 709, in init
restore_signals, start_new_session)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: '/usr/local/bin/mash'

python 3 for drep but python 2 for checkm

Based on looking at the drep python code and Issue 22, it appears that drep requires python3. I don't see such a requirement specified in the setup.py file, but still, it appears to be python3 dependent. However, drep has checkm as a dependency, which is only compatible with python2. This has caused some issues with the bioconda drep recipe. It would be very helpful to make drep python2 compatible so that it can "play nicely" with checkm.

Report secondary clusters in text format

Hi, thanks for the useful program. I wonder if you can get the secondary clusters in text format other than a list of pairwise similarity scores. Basically, I have to redo the clustering, or extract the data from the images. Both seems wrong to me.

Best,
Johannes

figures

Hi,
Are there any way to get the figures in another format than pdf? like .emf or .svg?

Thanks!

Cristian.-

Something worng while running evaluate module

I run the module respectively， command: dRep evaluate work_dir -e a , and get the error like this:

will compare winners
will provide warnings about clusters
2 warnings generated: saved to /stor9000/apps/users/NWSUAF/2013130172/Jingrx/drep/log/warnings.txt
will produce Widb (winner information db)
Traceback (most recent call last):
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/indexes/bas
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Bin Id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/bin/dRep", line 33, in
controller.parseArguments(args)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/controller.py", li
self.evaluate_operation(**vars(args))
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/controller.py", li
drep.d_evaluate.d_evaluate_wrapper(kwargs['work_directory'],**kwargs)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/d_evaluate.py", li
Widb = evaluate_winners(wd, **kwargs)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/d_evaluate.py", li
d = Chdb[Chdb['Bin Id'] == row['genome']]
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/frame.py",
return self._getitem_column(key)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/frame.py",
return self._get_item_cache(key)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/generic.py"
values = self._data.get(item)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/internals.p
loc = self.items.get_loc(item)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/indexes/bas
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Bin Id'
I wonder what's wrong with the Bin id?

how to set dependencies

i am new at bioinformatics，and not sure how to set all the dependencies correctly.
Download all the dependiencies，and should i set the path to let dREP know？or where should i download the dependencies？

'missing' genome

Hi,

I ran dRep on a bunch or MAGs. If I am reading your explanations right, genomes with > 75% completness and < 24% redundancy (contamination) should be kept. Yet, I found that some MAGs with completness at 83 - 89 (and contamination 3.6 - 3.9) according to CheckM are not included among the winning dereplicated MAGs. Is there something about these MAGs that excludes them that I missed while reading the documents?

I ran this analysis in febrauary on the previous dRep version that was up. I recently updated, so I will re-run the analysis if this has been addressed already.

Camilla
The bins in fasta (added .txt to name to upload them)
UASBVU0Maxbin.001.fasta.txt
UASBVU03megahit.metabat.bin.41.fasta.txt

checkM failed !!!

Hi,
I am trying to run filter module of drep with genomes supplied as Bdb.csv file:
dRep filter ~/drep_pub -p 12

I keep getting error at the checkM step. Here is the log output that I am getting:
07-25 12:04 DEBUG Filtering genomes by size
07-25 12:04 INFO 100.00% of genomes passed length filtering
07-25 12:04 DEBUG Running CheckM
07-25 12:04 INFO Running prodigal
07-25 12:04 INFO Past prodigal runs found- will not re-run
07-25 12:04 INFO Running checkM
07-25 12:04 DEBUG Running CheckM with command: ['/home/balaji/.pyenv/shims/checkm', 'lineage_wf', '/home/balaji/drep_pub/data/prodigal/', '/home/balaji/drep_pub/data/checkM/checkM_outdir/', '-f', '/home/balaji/drep_pub/data/checkM/checkM_outdir//results.tsv', '--tab_table', '-t', '12', '--pplacer_threads', '12', '-g', '-x', 'faa']
07-25 12:04 DEBUG Running CheckM with command: ['/home/balaji/.pyenv/shims/checkm', 'qa', '/home/balaji/drep_pub/data/checkM/checkM_outdir/lineage.ms', '/home/balaji/drep_pub/data/checkM/checkM_outdir/', '-f', '/home/balaji/drep_pub/data/checkM/checkM_outdir/Chdb.tsv', '-t', '12', '--tab_table', '-o', '2']
07-25 12:04 ERROR !!! checkM failed !!!
If using pyenv, make sure both python2 and python3 are available (for example: pyenv global 3.5.1 2.7.9)

I am using pyenv.
$ pyenv global
3.5.1
2.7.9

Also, drep check_dependencies work fine:
dRep bonus testDir --check_dependencies
Loading work directory
Checking dependencies
mash.................................... all good (location = /home/balaji/softwares/mash-Linux64-v2.1.1/mash)
nucmer.................................. all good (location = /home/balaji/softwares/mummer-3.9.4alpha/bin/nucmer)
checkm.................................. all good (location = /home/balaji/.pyenv/shims/checkm)
ANIcalculator........................... all good (location = /home/balaji/softwares/ANIcalculator_v1/ANIcalculator)
prodigal................................ all good (location = /home/balaji/softwares/prodigal/prodigal)
centrifuge.............................. !!! ERROR !!! (location = None)
nsimscan................................ all good (location = /home/balaji/softwares/ANIcalculator_v1/nsimscan)

Please help.
Thanks

installing from github gives dRep v1.4.3

Hello,

when I install from github using the following commands, i get dRep v1.4.3

$ git clone https://github.com/MrOlm/drep.git
$ cd drep
$ pip install .

When i install with pip install drep i get version 2.0.

is the code on this github the latest version of dRep?

Breaks if stopped during checkM stage, but restarts if delete checkM dir

Hi Matt,

Not sure this is an issue, but I'm leaving this here for posterity.

I've stopped dRep in different stages and restarted it and it usually works better than I would expect. This seems to be correct given the documentations statement

Work Directory
The work directory is where all of the program’s internal workings, log files, cached data, and output is stored. When running dRep modules multiple times on the same dataset, it is essential that you use the same work directory so the program can find the results of previous runs.
https://drep.readthedocs.io/en/latest/module_descriptions.html

However, if dRep is stopped during the checkM stage and restarted, it crashes and reports the same error it does when pyenv is not set to both the python3 version with dRep installed and the python2 version with checkM installed.

Example

pyenv local 3.6.1 miniconda2-latest
dRep dereplicate test_checkM -g tests/genomes/*
***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************

Will filter the genome list
Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Running checkM

#control+c to stop dRep
^CTraceback (most recent call last):
  File "/home/talex/.pyenv/versions/3.6.1/bin/dRep", line 33, in <module>
    controller.parseArguments(args)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/d_workflows.py", line 28, in dereplicate_wrapper
    drep.d_filter.d_filter_wrapper(wd, genomes = genomes, Chdb = Chdb, **kwargs)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/d_filter.py", line 81, in d_filter_wrapper
    Gdb = _get_run_genomeInfo(wd, bdb, **kwargs)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/d_filter.py", line 141, in _get_run_genomeInfo
    Chdb = _run_checkM_wrapper(bdb, workDirectory, **kwargs)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/d_filter.py", line 363, in _run_checkM_wrapper
    Chdb = run_checkM(prod_folder, checkM_outfolder, wd=workDirectory, **kwargs)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/d_filter.py", line 512, in run_checkM
    drep.run_cmd(cmd, shell=False, logdir=logdir)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/site-packages/drep/__init__.py", line 47, in run_cmd
    call(cmd,stdout=sto, stderr=ste)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 269, in call
    return p.wait(timeout=timeout)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 1439, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/talex/.pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 1386, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

#restart 
dRep dereplicate test_checkM -g tests/genomes/*
***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************

Will filter the genome list
Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Past prodigal runs found- will not re-run
Running checkM
!!! checkM failed !!!
If using pyenv, make sure both python2 and python3 are available (for example: pyenv global 3.5.1 2.7.9)

If I delete test_checkM/data/checkM though, dRep restarts just fine

rm -r test_checkM/data/checkM/
Rep dereplicate test_checkM -g tests/genomes/*
***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************

Will filter the genome list
Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Past prodigal runs found- will not re-run
Running checkM
100.00% of genomes passed checkM filtering
***************************************************
    ..:: dRep dereplicate Step 2. Cluster ::..
***************************************************

#etc ... ran to completion

Like I said, not sure this is really an issue, but it was confusing for me for a bit...

Cheers,
Alex

unrecognized

" 'module' has no attribute 'which' "after installation

Hello.
The test suite is failing with:

Traceback (most recent call last):
File "test_suite.py", line 632, in
test_quick()
File "test_suite.py", line 623, in test_quick
rerun_test()
File "test_suite.py", line 595, in rerun_test
QuickTests().run()
File "test_suite.py", line 404, in run
self.setUp()
File "test_suite.py", line 394, in setUp
os.mkdir(self.working_wd_loc)
OSError: [Errno 2] No such file or directory: '~/Software/drep/tests/../tests/test_backend/ecoli_wd'

And I can see that subdirectory "test_backend" is indeed not in there, and not in the github repo.

However, dRep is also failing while running the test dataset on its own as:

/home/Software/drep/tests $ dRep compare_wf outputdir/ -g ./genomes/*fasta
Step 1. Cluster
Traceback (most recent call last):
File "/home/Software/BioTools/Anaconda/bin/dRep", line 26, in
controller.parseArguments(args)
File "/home/Software/BioTools/Anaconda/lib/python2.7/site-packages/drep/controller.py", line 146, in parseArguments
self.compare_wf_operation(**vars(args))
File "/home/Software/BioTools/Anaconda/lib/python2.7/site-packages/drep/controller.py", line 91, in compare_wf_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/home/Software/BioTools/Anaconda/lib/python2.7/site-packages/drep/d_workflows.py", line 92, in compare_wrapper
drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
File "/home/Software/BioTools/Anaconda/lib/python2.7/site-packages/drep/d_cluster.py", line 282, in d_cluster_wrapper
Bdb, data_folder, kwargs = parse_arguments(workDirectory, **kwargs)
File "/home/Software/BioTools/Anaconda/lib/python2.7/site-packages/drep/d_cluster.py", line 311, in parse_arguments
loc = shutil.which('mash')
AttributeError: 'module' object has no attribute 'which'

I get this exact last error when running with my data too, which leads me this should be related ti an overall installation problem. I also reinstalled it through pip and have the same result.
I am running on Ubuntu 16.04, and set pyenv to run globally with python 2.7.9 and 3.5.1...

Any idea what might be going wrong?

TypeError: sequence item 3: expected str instance, NoneType found

Hi,

I'm running dereplication with dRep 2.0 version.

dRep dereplicate -p 28 dereplicate_dir -g genomes_dir/*.fasta --run_tax --genomeInfo genomeinfo.csv

Everything was running fine but it called for an error in the bonus step:

..:: dRep dereplicate Step 1. Filter ::..

Will filter the genome list
Calculating genome info of genomes
99.08% of genomes passed length filtering
42.52% of genomes passed checkM filtering

..:: dRep dereplicate Step 2. Cluster ::..

..:: dRep dereplicate Step 3. Choose ::..

Loading work directory

..:: dRep dereplicate Step 4. Bonus ::..

Any idea about what is producing the problem?

Thanks in advance.

Python error at the dereplicate step

Hi. First of all, thanks for developing such a useful program! I'm trying to run the dereplicate step but the checkM step is failing:

$ dRep dereplicate outdR/ -g Final/*.fasta

..:: dRep dereplicate Step 1. Filter ::..

Will filter the genome list

Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Past prodigal runs found- will not re-run
Running checkM
!!! checkM failed !!!
If using pyenv, make sure both python2 and python3 are available (for example: pyenv global 3.5.1 2.7.9)

However, I have already set the pyenv global parameter:
$ pyenv global
3.5.1
2.7.9

Everything else looks fine:
$ dRep bonus testDir --check_dependencies
Loading work directory
Checking dependencies
mash.................................... all good (location = /usr/local/bin/mash)
nucmer.................................. all good (location = /usr/bin/nucmer)
checkm.................................. all good (location = /usr/local/bin/checkm)
ANIcalculator........................... all good (location = /usr/bin/ANIcalculator_v1/ANIcalculator)
prodigal................................ all good (location = /home/linuxbrew/.linuxbrew/bin/prodigal)
centrifuge.............................. all good (location = /usr/local/bin/centrifuge)

Any idea what might be causing this problem? I don't have much experience in bioinformatics, so this is giving me a lot of headache.

AttributeError: 'module' object has no attribute 'which'

Hello,

I have tried installing dRep by first installing python3 from homebrew, then downloading pip, then downloading mummer and mash and scikit-learn (not this straightforward but ultimately got to this point upon troubleshooting things I seemed to need).

First I was getting this error at the end of trying to install:
error: Installed distribution numpy 1.8.0rc1 conflicts with requirement numpy>=1.9.0

But I believe I overcame this by instead of just using 'python setup.py install' but by linking to the python in my bin that had the numpy:
bash-3.2# /usr/bin/python setup.py install
and it seemed to install successfully. However I also tried to install numpy to python3 since it seemed to be missing by:
karissas-mbp:~ ikf$ brew install numpy --with-python3 --without-python

NOW, however, when I try
karissas-mbp:dRep ikf$ dRep bonus testDir --check_dependencies
or
karissas-mbp:tests ikf$ python test_suite.py
I get the same error:

Traceback (most recent call last):
  File "test_suite.py", line 852, in <module>
    cluster_test()
  File "test_suite.py", line 788, in cluster_test
    verifyCluster.run()
  File "test_suite.py", line 421, in run
    self.functional_test_3()
  File "test_suite.py", line 502, in functional_test_3
    controller.parseArguments(args)
  File "/Library/Python/2.7/site-packages/drep-1.4.3-py2.7.egg/drep/controller.py", line 151, in parseArguments
    self.cluster_operation(**vars(args))
  File "/Library/Python/2.7/site-packages/drep-1.4.3-py2.7.egg/drep/controller.py", line 56, in cluster_operation
    drep.d_cluster.d_cluster_wrapper(kwargs['work_directory'],**kwargs)
  File "/Library/Python/2.7/site-packages/drep-1.4.3-py2.7.egg/drep/d_cluster.py", line 53, in d_cluster_wrapper
    Bdb, data_folder, kwargs = parse_arguments(workDirectory, **kwargs)
  File "/Library/Python/2.7/site-packages/drep-1.4.3-py2.7.egg/drep/d_cluster.py", line 297, in parse_arguments
    loc = shutil.which('mash')
AttributeError: 'module' object has no attribute 'which'

I have a feeling that it has something to do with my pythons but I cannot seem to figure out how to get around this!

Thank you for your time in helping me troubleshoot this.
Karissa

Docker Image?

Going to try out drep and I got stuck when I was reminded that ANIcalculator is linux only and I was installing drep on a mac. Do you by any chance have a docker image with drep and all its dependancies?
Thanks!

sort Cluster_scoring barplots in winner order

In the Cluster_scoring.pdf the y-axis sorts genomes alphabetically. I think it would be more useful to sort by score. I realize I could parse the data_tables/ but think you should consider making this default. I also just wanted to be the first person to put an issue on your issue tracker for drep ;)

FileNotFoundError when performing secondary clustering

Hi,

I am running dRep v.2.2.4 on ~5000 genomes and am having issues during the secondary clustering step. The log snippet can be found below:

    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************
    
Will filter the genome list
Calculating genome info of genomes
100.00% of genomes passed length filtering
5372.26% of genomes passed checkM filtering
***************************************************
    ..:: dRep dereplicate Step 2. Cluster ::..
***************************************************
    
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
3287 primary clusters made
Step 3. Perform secondary clustering
Running 353383 ANImf comparisons- should take ~ 5521.6 min
Traceback (most recent call last):
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/bin/dRep", line 33, in <module>
    controller.parseArguments(args)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/controller.py", line 144, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/controller.py", line 86, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
    drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_cluster.py", line 78, in d_cluster_wrapper
    data_folder, wd=workDirectory, **kwargs)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_cluster.py", line 200, in cluster_genomes
    ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_cluster.py", line 817, in compare_genomes
    df = run_pairwise_ANImf(genome_list, working_data_folder, **kwargs)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_cluster.py", line 964, in run_pairwise_ANImf
    df = process_deltafiles(deltafiles, org_lengths, **kwargs)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_cluster.py", line 707, in process_deltafiles
    tot_length, tot_sim_error = parse_delta(deltafile)
  File "/nfs/production/interpro/metagenomics/mags-scripts/miniconda3/lib/python3.7/site-packages/drep/d_cluster.py", line 645, in parse_delta
    for line in [l.strip().split() for l in open(filename, 'r').readlines()]:
FileNotFoundError: [Errno 2] No such file or directory: '/hps/nobackup2/production/metagenomics/clustering/iter_8/drep32/data/ANImf_files/ERS608499_12.fa/ERS608499_12.fa
_vs_SRS475589_82.fa.delta.filtered'

This is not the first time I've encountered this error. It seems to occur when I launch de-replication of large numbers of genomes (5K-6K seems to be the limit). I am usually able to solve it if I split the genomes in different batches and de-replicate them separately first. However, for this particular set I'd like to do all of them in one-go. It doesn't seem to be a RAM memory issue, as I've tried allocating much more memory (>100GB) than what the program is using. Could it be just a filesystem issue when there are too many ANImf comparisons being made?

Many thanks in advance.

Best,
Alex

ValueError: 'label' must be of length 'x'

Hi! I'm running the 2.0 version and at the end of the analysis, I get the error below. I'm not sure what is going on, but please let me know if you need more info about the run.
Thanks!
Coto

..:: dRep dereplicate Step 6. Analyze ::..

making plots 1, 2, 3, 4, 5, 6
Plotting primary dendrogram
Plotting secondary dendrograms
Plotting MDS plot
Plotting scatterplots
Plotting bin scorring plot
Plotting winning genomes plot...
Traceback (most recent call last):
File "/usr/local/bin/drep", line 26, in
controller.parseArguments(args)
File "/usr/local/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments
self.dereplicate_operation(**vars(args))
File "/usr/local/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_workflows.py", line 68, in dereplicate_wrapper
drep.d_analyze.d_analyze_wrapper(wd, plots = 'a', **kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_analyze.py", line 78, in d_analyze_wrapper
plot_winners_from_wd(wd, plot_dir, **kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_analyze.py", line 345, in plot_winners_from_wd
plot_winners(Wdb, Gdb, Wndb, Wmdb, Widb, plot_dir = plot_dir, **kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_analyze.py", line 734, in plot_winners
_make_piechart(labels,sizes)
File "/usr/local/lib/python3.6/site-packages/drep/d_analyze.py", line 878, in _make_piechart
shadow = True)
File "/usr/local/lib/python3.6/site-packages/matplotlib/pyplot.py", line 3219, in pie
frame=frame, rotatelabels=rotatelabels, data=data)
File "/usr/local/lib/python3.6/site-packages/matplotlib/init.py", line 1710, in inner
return func(ax, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/matplotlib/axes/_axes.py", line 2635, in pie
raise ValueError("'label' must be of length 'x'")
ValueError: 'label' must be of length 'x'

Secondary clustering failing

Hi!

I am trying to cluster a large number of population-level genomes (5449 genomes after length filtering of dRep).

It seemed to have ran just fine until the secondary clustering step. I got the error below:

Step 1. Parse Arguments
Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
109 primary clusters made
Step 3. Perform secondary clustering
Running 143000 ANIn comparisons- should take ~ 1179.8 min
Traceback (most recent call last):
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/env3.4.3/bin/dRep", line 26, in <module>
    controller.parseArguments(args)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-p
ackages/drep/controller.py", line 136, in parseArguments
    self.dereplicate_wf_operation(**vars(args))
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/controller.py", line 80, in dereplicate_wf_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
    drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 287, in d_cluster_wrapper
    Cdb, Mdb, Ndb = cluster_genomes(Bdb, data_folder, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 119, in cluster_genomes
    **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 153, in run_secondary_clustering
    ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 866, in compare_genomes
    df = run_pairwise_ANIn(genome_list, working_data_folder, **kwargs)
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 913, in run_pairwise_ANIn
    for x,y in zip(genomes,genomes)}
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/d_cluster.py", line 913, in <dictcomp>
    for x,y in zip(genomes,genomes)}
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/drep/__init__.py", line 64, in fasta_length
    for seq_record in SeqIO.parse(fasta, "fasta"):
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/Bio/SeqIO/__init__.py", line 600, in parse
    for r in i:
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/Bio/SeqIO/FastaIO.py", line 122, in FastaIterator
    for title, sequence in SimpleFastaParser(handle):
  File "/mnt/nfs/projects/ecosystem_biology/local_tools/pyenv/versions/3.4.3/envs/env3.4.3/lib/python3.4/site-packages/Bio/SeqIO/FastaIO.py", line 62, in SimpleFastaParser
    line = handle.readline()
BrokenPipeError: [Errno 108] Cannot send after transport endpoint shutdown

Note that dRep worked fine for us when applying it onto smaller data sets. Any idea why this happened? I'm looking forward to your reply.

Best regards,
Shaman

genomeInfo.csv and strain_heterogeneity

Hi there,

I appreciate the fact that you provided a way to skip the checkM step by providing a csv file with external quality information. dRep documentation for the dereplicate workflow states:

  --genomeInfo GENOMEINFO
                        location of .csv file containing quality information
                        on the genomes. Must contain: ["genome"(basename of
                        .fasta file of that genome), "completeness"(0-100
                        value for completeness of the genome),
                        "contamination"(0-100 value of the contamination of
                        the genome)] (default: None)

But in other parts of the workflow, it seems dRep also uses the strain heterogeneity to compute a score and select "winner" genomes. I do not understand then why only completeness and contamination are asked for in the csv file provided.

I have a list of genomes on which I've already run checkM: is it possible for me to provide completeness, contamination and strain heterogeneity using --genomeInfo? Or will dRep ignore the strain heterogeneity criteria?

Cheers,
Nils

change secondary ANI but the same results

Hi Matt

Genome de-replication by commond: dRep dereplicate out_dir -g genome/.fasta
and then i want to change the threshold of ANI with commands as follows:
dRep dereplicate out_dir -sa 0.95 -g genome/.fasta
but i got the same results with two commands above.
then i tried to use the Single operations step by step
1. dRep filter out_dir -g genome/*.fasta
2. dRep cluster out_dir -sa 0.95
3. dRep choose out_dir
4. dRep bonus out_dir
5. dRep evaluate out_dir -e a
6. dRep analyze out_dir -pl a
There are no error in the process. but the results are the same with these three command.

Did i use the wrong command ?
Could you help me?
Thank you very much!

Best
Alice

pplacer OOM

Hey Matt,

I'm consistently getting an out of memory error (OOM) in my new environment, AWS EC2 m4.4xlarge (16 CPU, 64 Gb RAM). One option is to rent another machine with higher specs for more money. Another option is to add a low memory flag to dRep.

I came upon the solution of calling "--reduced_tree" to reduce memory via checkM here. Also see checkM's options here.

Another thought is to change pplacer parameters, see "Memory Usage" section here, but I don't have much experience with this.

I'm happy to send along dRep logs - just let me know what's relevant for you.

As a start here's a system log clip:

Apr 18 02:16:29 ip-172-31-20-20 kernel: [92767.295137] Out of memory: Kill process 14057 (pplacer) score 541 or sacrifice child
Apr 18 02:16:29 ip-172-31-20-20 kernel: [92767.299315] Killed process 14057 (pplacer) total-vm:35679400kB, anon-rss:35673656kB, file-rss:0kB

And attached is the command log, captured with your favorite &> mattsfavoritewaytocapture.logs &.

The end of the log is:

$ tail cmd_drep_gANI_gt99.log
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ubuntu/miniconda3/envs/irep/lib/python3.6/site-packages/pandas/io/parsers.py", line 730, in __init__
    self._make_engine(self.engine)
  File "/home/ubuntu/miniconda3/envs/irep/lib/python3.6/site-packages/pandas/io/parsers.py", line 923, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/ubuntu/miniconda3/envs/irep/lib/python3.6/site-packages/pandas/io/parsers.py", line 1390, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4184)
  File "pandas/parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:8449)
FileNotFoundError: File b'/home/ubuntu/db/ncbi_Blautia_20170418/dRep/gANI_gt99/data/centrifuge/GCF_000153905.1_ASM15390v1_genomic.fna_report.tsv' does not exist

And the dir indeed doesn't have this file and I think it's because of the OOM terminating the process:

$ ls /home/ubuntu/db/ncbi_Blautia_20170418/dRep/gANI_gt99/data/centrifuge/
GCF_000153905.1_ASM15390v1_genomic.fna_hits.tsv    GCF_001404755.1_13470_2_82_genomic.fna_hits.tsv
GCF_000156675.1_ASM15667v1_genomic.fna_hits.tsv    GCF_001404775.1_14207_7_80_genomic.fna_hits.tsv
GCF_000157975.1_ASM15797v1_genomic.fna_hits.tsv    GCF_001404775.1_14207_7_80_genomic.fna_report.tsv
GCF_000373885.1_ASM37388v1_genomic.fna_hits.tsv    GCF_001404935.1_13414_6_41_genomic.fna_hits.tsv
GCF_000424085.1_ASM42408v1_genomic.fna_hits.tsv    GCF_001405215.1_14207_7_44_genomic.fna_hits.tsv
GCF_000424085.1_ASM42408v1_genomic.fna_report.tsv  GCF_001405215.1_14207_7_44_genomic.fna_report.tsv
GCF_000439125.1_ASM43912v1_genomic.fna_hits.tsv    GCF_001405455.1_13470_2_80_genomic.fna_hits.tsv
GCF_000466565.1_ASM46656v1_genomic.fna_hits.tsv    GCF_001405455.1_13470_2_80_genomic.fna_report.tsv
GCF_000484655.1_ASM48465v1_genomic.fna_hits.tsv    GCF_001487165.1_Blautia_massiliensis1_genomic.fna_hits.tsv
GCF_000484655.1_ASM48465v1_genomic.fna_report.tsv  GCF_001487165.1_Blautia_massiliensis1_genomic.fna_report.tsv
GCF_000702025.1_ASM70202v1_genomic.fna_hits.tsv    GCF_900078295.1_PRJEB13136_genomic.fna_hits.tsv
GCF_000765245.1_ASM76524v1_genomic.fna_hits.tsv    GCF_900078295.1_PRJEB13136_genomic.fna_report.tsv
GCF_001404455.1_13414_6_22_genomic.fna_hits.tsv    GCF_900120195.1_PRJEB18016_genomic.fna_hits.tsv
GCF_001404535.1_13414_6_21_genomic.fna_hits.tsv    GCF_900120195.1_PRJEB18016_genomic.fna_report.tsv
GCF_001404535.1_13414_6_21_genomic.fna_report.tsv  GCF_900120295.1_PRJEB18018_genomic.fna_hits.tsv
GCF_001404735.1_14207_7_34_genomic.fna_hits.tsv    GCF_900120295.1_PRJEB18018_genomic.fna_report.tsv

cmd_drep_gANI_gt99.log.txt

ANImf comparisons running time

Hi
I have 24 thousand bins for drep, after completion(50) and contamination(10) filtering, the log file give the estimated time consuming 28184 min, about 19 days? Anything i can do to accelerate this process?
The parameter i run drep is "-p 32 -comp 50 -con 10".

thanks!

dRep failes with symlinks pointing to the same file

Hi, dRep produces errors when trying to lower the thresholds. I basically want to cluster/dereplicate all bins, regardless of size and completion levels. So I ran dRep dereplicate -pa 0.8 -sa 0.98 -comp 0 -con 50 -l 20000 which gave the following error in v2.0.5:

2b. Cluster pair-wise MASH clustering
Traceback (most recent call last):
  File "/home/johdro/.conda/envs/drep/bin/dRep", line 26, in <module>
    controller.parseArguments(args)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
    drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 75, in d_cluster_wrapper
    data_folder, wd=workDirectory, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 145, in cluster_genomes
    Cdb, cluster_ret = cluster_mash_database(Mdb, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 570, in cluster_mash_database
    linkage_db = db.pivot("genome1","genome2","dist")
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/frame.py", line 4382, in pivot
    return pivot(self, index=index, columns=columns, values=values)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 389, in pivot
    return indexed.unstack(columns)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/series.py", line 2224, in unstack
    return unstack(self, level, fill_value)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 474, in unstack
    fill_value=fill_value)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
    self._make_selectors()
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 154, in _make_selectors
    raise ValueError('Index contains duplicate entries, '
ValueError: Index contains duplicate entries, cannot reshape

It would be great if dRep could also do ANI clustering and representative picking for smaller bins which usually get sorted out, this is the real challenge in metagenome data.

Best,
Johannes

Hard-code colors in the pie charts

Hi, I'd recommend an improvement: the categories in the pie charts (winning genomes) are fixed, so it would make sense to pick a nice palette and stick to it. Right now, the colors swap between different runs and are thus more difficult to compare.

Best,
Johannes

wrong data in Chdb.csv

Hi,

I noticed that the data in the Chdb.csv file produced when I ran dereplicate_wf looks wrong.
My command line was
dRep dereplicate_wf dereplicationDG074078UASBVU03 -g genomes/*.fasta &

I first noticed that the # contig numbers are too high.
For instance, for the winning bin DG074_161215megahit.metabat.bin.2, #contigs should be 110 not 6758. Longest contig and N50 also looks wrong.
Genome size, GC, quality and contamination is correct.
I ran CheckM on my winning bins and got the correct numbers for number of contigs (also attached).
Thanks for any help with this!

Camilla

bin_stats_ext.txt
Chdb.txt

similarity rate in filter

I use dREP to select representative bins for seven metagenomes .I want to know whether there is a parameter to set the similarity rate to filter bins.
I use dRep dereplicate -p 28 -comp 50 -con 10 -l 10000 output_dir/ -g /path/to/genomes/*.fasta this command to dereplicate 540 bins and there are 450 bins left. I want to know the similarity rate between the selected bin and filtered bin?

Thanks in advance.

How could i get strain_heterogeneity?

when I run choose muddle of dRep, according the guid of Advanced use, and get the error:
Loading work directory
Traceback (most recent call last):
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/bin/dRep", line 33, in
controller.parseArguments(args)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/controller.py", li
self.choose_operation(**vars(args))
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/controller.py", li
drep.d_choose.d_choose_wrapper(kwargs['work_directory'],**kwargs)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/d_choose.py", line
Gdb = drep.d_filter._get_run_genomeInfo(wd, bdb, **kwargs)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/d_filter.py", line
Idb = chdb_to_genomeInfo(Chdb)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/drep/d_filter.py", line
Gdb = Gdb[['genome', 'completeness', 'contamination', 'strain_heterogeneity']]
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/frame.py",
return self._getitem_array(key)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/frame.py",
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/stor9000/apps/users/NWSUAF/2013130172/software/miniconda_dRep/lib/python3.6/site-packages/pandas/core/indexing.py
.format(mask=objarr[mask]))
KeyError: "['strain_heterogeneity'] not in index"
I have no idea that how could I find strain_heterogeneity information?

What is the format of checkM output needed?

I have problems in running checkM due to issues with python2/python3. I would like to provide as input the output file of checkM in order to skip the use of python2. I tried using the command with
dRep dereplicate working_dir -g directory_with_genomes/*.fa --genomeInfo checkMfile

As checkM output I used "bin_stats_ext.tsv" but again dRep terminates with an error and I cannot understand what is the format of the input file.
Could you please provide an example?
Thanks

Stefano

Primary clustering figure labels are out of order when many genomes are compared

Hi, I am using dRep (v2.2.2, installed using pip install drep) to cluster and then analyze a few hundred genomes at a time, and noticed that the labels on plot 1 (Primary_clustering_dendrogram.pdf) were all out of order. The steps I took were:

Download a few hundred genomes. For example, all Prevotella. https://www.ncbi.nlm.nih.gov/assembly/?term=prevotella (Click "Download Assemblies", source = GenBank, file type = Genomic FASTA)
tar -xf the archive and gunzip the files within the folder
Run dRep cluster out_dir -p 14 --SkipSecondary -g *.fna
Run dRep analyze out_dir --plots 1

The files Mdb.csv and Cdb.csv appear to be correct, however. Doing steps 3-4 with just a handful (~20) genomes doesn't seem to cause an error with misplaced labels. I'm not sure where the upper bound is where this becomes a problem.

Here's a snippet of a tree with labels that are out of order:

'str' object has no attribute 'hasDb'

Hi, I'm trying to use dRep but it does not work.. the error message is "AttributeError: 'str' object has no attribute 'hasDb'" in d_filter.py in line 54.
When I open d_filter.py "wd.hasDb" seems to make an error, but "wd" is not difined class before.
So I change the line "wd.hasDb('Bdb')" to "workDirectory.hasDb('Bdb')" and run the script again.
But there is an error again in def calc_genome_info..
please check my error message.
Thanks

Program checkm is not working!! Im going to crash now

I have my checkm working correctly and my dRep installed. I switched the shebang so it correctly calls checkm. I don't understand how to fix this error.

checkM table input

When I try to use a checkM-generated table using the --Chdb option I get the following error:

`Will filter the genome list
100.00% of genomes passed length filtering
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2522, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Bin Id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/dRep", line 26, in
controller.parseArguments(args)
File "/usr/local/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments
self.dereplicate_wf_operation(**vars(args))
File "/usr/local/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_wf_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_workflows.py", line 28, in dereplicate_wrapper
drep.d_filter.d_filter_wrapper(wd, genomes = genomes, Chdb = Chdb, **kwargs)
File "/usr/local/lib/python3.6/site-packages/drep/d_filter.py", line 65, in d_filter_wrapper
validate_chdb(Chdb, bdb)
File "/usr/local/lib/python3.6/site-packages/drep/d_filter.py", line 127, in validate_chdb
if genome not in Chdb['Bin Id'].tolist():
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in getitem
return self._getitem_column(key)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.6/site-packages/pandas/core/internals.py", line 3838, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2524, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: Bin Id``
If I compare the checkM file generated during a dRep run to the one I'm using with --Chdb option, they look the same (i.e., same number of columns, etc). Even If I use the .tsv file generated by dRep, I get the same error.

Bin made drep crash

Heia,

I had issues with dRep not doing the third CheckM run.
Our systems-person
DGG1DGG0PEPACbiospadesmb.bin.40.fa.txt
looked at it for me - and found that it could be due to one 'problematic' bin (see below). I looked at DGG1DGG0PEPACbiospadesmb.bin.40.fa and its just one large contig, with N's in the middle.
I removed DGG1DGG0PEPACbiospadesmb.bin.40.fa and dRep completed.
I have attached the bin. Do you know what the problem with the bin can be? I dont think its the N's since many other bins also have there.

Thanks
Camilla

This is what Dean reported to me:
I am not sure how checkm works, but dRep is using it fine, there is a log file in your “FeiBinDrep” directory that shows two checkm calls that worked fine, and then the third one fails:

07-03 13:46 DEBUG Starting the dereplicate operation
07-03 13:46 INFO ***************************************************
..:: dRep dereplicate Step 1. Filter ::..

07-03 13:46 DEBUG Loading work directory in filter
07-03 13:46 DEBUG Located: /nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep
Datatables: []
Cluster files: []
Arguments: []
07-03 13:46 DEBUG Validating filter arguments
07-03 13:46 INFO Will filter the genome list
07-03 13:46 INFO Calculating genome info of genomes
07-03 13:48 DEBUG Filtering genomes by size
07-03 13:48 INFO 100.00% of genomes passed length filtering
07-03 13:48 DEBUG Running CheckM
07-03 13:48 INFO Running prodigal
07-03 15:39 INFO Running checkM
07-03 15:39 DEBUG Running CheckM with command: ['/usr/local/bin/checkm', 'lineage_wf', '/nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/prodigal/', '/nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/checkM/checkM_outdir/', '-f', '/nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/checkM/checkM_outdir//results.tsv', '--tab_table', '-t', '6', '--pplacer_threads', '6', '-g', '-x', 'faa']
07-03 15:47 DEBUG Running CheckM with command: ['/usr/local/bin/checkm', 'qa', '/nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/checkM/checkM_outdir/lineage.ms', '/nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/checkM/checkM_outdir/', '-f', '/nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/checkM/checkM_outdir/Chdb.tsv', '-t', '6', '--tab_table', '-o', '2']
07-03 15:47 ERROR !!! checkM failed !!!

There were no files created for the second command, so it failed, I ran the first command and it failed for not being able to find “guppy” I found some guppys on the system and placed it in the path.

[CheckM - tree] Placing bins in reference genome tree.

Identifying marker genes in 1028 bins with 6 threads:
Finished processing 1028 of 1028 (100.00%) bins.
Saving HMM info to file.

Calculating genome statistics for 1028 bins with 6 threads:
Finished processing 1028 of 1028 (100.00%) bins.

Extracting marker genes to align.
Parsing HMM hits to marker genes:
Finished parsing hits for 1028 of 1028 (100.00%) bins.
Extracting 43 HMMs with 6 threads:
Finished extracting 43 of 43 (100.00%) HMMs.
Aligning 43 marker genes with 6 threads:
Finished aligning 43 of 43 (100.00%) marker genes.

[Error] Make sure guppy, which is part of the pplacer package, is on your system path.

Controlled exit resulting from an unrecoverable error or warning.

Running the command:
/usr/local/bin/checkm lineage_wf /nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/prodigal/ /nfs/groups/edwards/camilla/Fei/allDGGBins
/Bins/FeiBinDrep/data/checkM/checkM_outdir/ -f /nfs/groups/edwards/camilla/Fei/allDGGBins/Bins/FeiBinDrep/data/checkM/checkM_outdir//results.tsv --tab_table -t 6 --pplacer_threads 6 -g -x faa

Gives an error with a file DGG1DGG0PEPACbiospadesmb.bin.40.fa . Are you able to see what that issue maybe, or can that set be removed and see if it works?

[CheckM - tree] Placing bins in reference genome tree.

Identifying marker genes in 1028 bins with 6 threads:
Finished processing 383 of 1028 (37.26%) bins.
Finished processing 387 of 1028 (37.65%) bins.
Finished processing 1028 of 1028 (100.00%) bins.
Saving HMM info to file.

Calculating genome statistics for 1028 bins with 6 threads:
Finished processing 1028 of 1028 (100.00%) bins.

Reading marker alignment files.
Concatenating alignments.
Placing 1028 bins into the genome tree with pplacer (be patient).

{ Current stage: 0:50:32.227 || Total: 0:50:32.227 }

[CheckM - lineage_set] Inferring lineage-specific marker sets.

Reading HMM info from file.
Parsing HMM hits to marker genes:
Finished parsing hits for 1028 of 1028 (100.00%) bins.

Determining marker sets for each genome bin.
Finished processing 245 of 1028 (23.83%) bins (current: DGG1DGG0PEPACbiospadesmb.bin.40.fa).
Unexpected error: <type 'exceptions.AttributeError'>
Traceback (most recent call last):
File "/usr/local/bin/checkm", line 712, in
checkmParser.parseOptions(args)
File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 1245, in parseOptions
self.lineageSet(options)
File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 230, in lineageSet
resultsParser, options.unique, options.multi)
File "/usr/local/lib/python2.7/dist-packages/checkm/treeParser.py", line 515, in getBinMarkerSets
domainNode = self.__findDomainNode(node)
File "/usr/local/lib/python2.7/dist-packages/checkm/treeParser.py", line 244, in __findDomainNode
if curNode.label:
AttributeError: 'NoneType' object has no attribute 'label'
nesbocam@silicon:/groups/edwards/camilla/Fei/allDGGBins/Bins$

A section of file 40 has :

ACAGATTAAACAAATCCGACGGGAATAAACATAACCCATAAACGAGTANNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNAAGTGGGTAGCGCTCAAATTGTAGGTGGAACA
Maybe the “N”s are the issue?"

Suggestion

It would be great if a user could add information regarding the sample from which each genome came from and the output would show which genome is the best for each sample in each cluster. Something along the lines of:

cluster 1: Best genome, best genome from sample 1, best genome from sample 2 etc...

Why is that, it seems to have some problem compatible with my python?

zhouzhichao@ubuntu-32:/mnt/nfs_storage7/tools/drep/bin$ dRep -h
Traceback (most recent call last):
File "/usr/local/bin/dRep", line 19, in
import drep.argumentParser
File "/usr/local/lib/python2.7/dist-packages/drep/init.py", line 10, in
import drep.d_filter
File "/usr/local/lib/python2.7/dist-packages/drep/d_filter.py", line 211
def calc_genome_info(genomes: list):
^
SyntaxError: invalid syntax

Can you help me on installation?

New checkM db needs to be made

Hi Matt,

dRep worked in one of my projects, but failed in another one with this error message: New checkM db needs to be made, and here is the detailed info:

02-04 08:21 DEBUG    Starting the dereplicate operation
02-04 08:21 INFO     ***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************
    
02-04 08:21 DEBUG    Loading work directory in filter
02-04 08:21 DEBUG    Located: /mnt/redundans_dRep/derep
Datatables: []
Cluster files: []
Arguments: []
02-04 08:21 DEBUG    Validating filter arguments
02-04 08:21 INFO     Will filter the genome list
02-04 08:21 INFO     Calculating genome info of genomes
02-04 08:21 DEBUG    Filtering genomes by size
02-04 08:21 INFO     95.23% of genomes passed length filtering
02-04 08:21 DEBUG    Running CheckM
02-04 08:21 INFO     Running prodigal
02-04 08:30 INFO     Running checkM
02-04 08:30 DEBUG    Running CheckM with command: ['/usr/local/bin/checkm', 'lineage_wf', '/mnt/redundans_dRep/derep/data/prodigal/', '/mnt/redundans_dRep/derep/data/checkM/checkM_outdir/', '-f', '/mnt/re
dundans_dRep/derep/data/checkM/checkM_outdir//results.tsv', '--tab_table', '-t', '6', '--pplacer_threads', '6', '-g', '-x', 'faa']
02-04 10:43 DEBUG    Running CheckM with command: ['/usr/local/bin/checkm', 'qa', '/mnt/redundans_dRep/derep/data/checkM/checkM_outdir/lineage.ms', '/mnt/redundans_dRep/derep/data/checkM/checkM_outdir/', 
'-f', '/mnt/redundans_dRep/derep/data/checkM/checkM_outdir/Chdb.tsv', '-t', '6', '--tab_table', '-o', '2']
02-04 10:45 ERROR    BinSanity.213.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    concoct.140.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    concoct.227.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    maxbin2.095.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    maxbin2.610.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2036.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2158.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2176.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2303.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2361.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2410.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2490.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2613.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2730.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.2808.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.282.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3004.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3009.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3083.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3284.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3338.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3407.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.347.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3510.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3543.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3852.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.3893.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.4098.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.4279.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.4802.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.4924.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.5154.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.647.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.771.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    vamb.911.contigs.filtered.filtered.filtered.fa is not in checkM db
02-04 10:45 ERROR    New checkM db needs to be made

So is it a controlled exit or have I missed anything? I'm using docker, you can pull the image via docker push shengwei/drep:latest, the source code of the docker image is here: https://github.com/housw/docker/tree/master/dRep

Thanks in advance.

Best,
Shengwei

mrolm / drep Goto Github PK

drep's People

Contributors

Stargazers

Watchers

Forkers

drep's Issues

As checkM output I used "bin_stats_ext.tsv" but again dRep terminates with an error and I cannot understand what is the format of the input file. Could you please provide an example? Thanks

Recommend Projects

Recommend Topics

Recommend Org

As checkM output I used "bin_stats_ext.tsv" but again dRep terminates with an error and I cannot understand what is the format of the input file.
Could you please provide an example?
Thanks