mdu-phl / bohra Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 3.0 366.52 MB

A pipeline for bioinformatics analysis of bacterial genomes

License: GNU General Public License v3.0

Python 29.56% Shell 4.42% HTML 5.66% Perl 1.72% Nextflow 58.65%

bioinformatics bioinformatics-pipeline mdu-phl microbial-genomics nextflow

bohra's People

Contributors

Stargazers

Watchers

Forkers

nailouzhang tianyazhiying bogemad

bohra's Issues

Add appropriate classifiers to setup.py

Here is a list:

https://pypi.org/classifiers/

This will help in searching and finding Bohra...

MLST bug

Skip over isolates where no mlst is found gracefully. Thanks to @willpitchers for finding

Badge says python 3.6 but it all seems to be 3.7

add in a setup command

Add in a setup command for the setup of databases and to check that all files and deps are installed

Keep amrfinder output

https://mduphl.slack.com/archives/C02E16XFSJK/p1632816517012500

Update CLI and config with snippy params

Would be useful to report the minaln but would suggest responsibiility on user for them to think about what makes sense for them. Can instead state in github / show in case studies what standard/ recommended threshold for inclusion based on core min aln (if doing large large scale pop genomics). Add an example config to github for snippy

Add 'check' subcommand

Check all binary and database deps exist
And maybe their versions / names

"bohra run" no params is a traceback mess

bohra run
[INFO:03/05/2020 05:13:47 PM] Starting bohra pipeline using /home/linuxbrew/.linuxbrew/bin/bohra run
[INFO:03/05/2020 05:13:47 PM] You are running bohra in preview mode.
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/bin/bohra", line 8, in <module>
    sys.exit(main())
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 136, in main
    args.func(args)
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 32, in run_pipeline
    R = RunSnpDetection(args)
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/SnpDetection.py", line 65, in __init__
    self.log_messages('warning', 'Input file can not be empty, please set -i path_to_input to try again')
AttributeError: 'RunSnpDetection' object has no attribute 'log_messages'

Small miss used variable in bohra

Snakemake parameter
Bohra passed args.cpus to SnpDetection object self.cpus during the initial
Then, there is a func name: set_snakemake_jobs() to double check whether the self.cpus is over the limitation:
SnpDetection.py#L141
def set_snakemake_jobs(self):
'''
set the number of jobs to run in parallel based on the number of cpus from args
'''
if int(self.cpus) < int(psutil.cpu_count()):
self.jobs = self.cpus
else:
self.jobs = 1
However, in the final command line to run snakemake file, it does not use self.jobs.
if self.cluster:
cmd = f"{self.cluster_cmd()} -s {snake_name} -d {wd} {force} {singularity_string} --latency-wait 1200"
else:
cmd = f"snakemake {dry} -s {snake_name} {singularity_string} -j {self.cpus} -d {self.job_id} {force} --verbose 2>&1"

Best practice for `logging`

A good convention is to put at the top of each *.py file or module:

import logging

logging.basicConfig(format='[%(asctime)s] %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p', level=logging.INFO)
logger = logging.getLogger(__name__)

Then call logger wherever needed (e.g., logger.info("Starting up"))

--dryrun should be --dry-run and -n

The Unix standard for dry run is --dry-run and the synonym -n if needed.

Typo in my name

bohra/bohra/utils/iqtree_generator.sh

Line 4 in 33671b8

 # courtesy of Torsten Seeman. Modified here to account for snp-detector file structure 

double-N please :)

Add in end-to-end test for circleci

handful of isolates that we map to a reference and extract reads aligning to a small 5kb region.

run/rerun `-f` flag not working

I find that if I run e.g.
bohra run --input_file isolates.tab --job_id PA_20200115_1 --reference ref.fa --mask phastaf/phage.bed -mdu -n
...and get the warning message:

[WARNING:01/15/2020 02:59:44 PM] This may be a re-run of an existing job. Please try again using rerun instead of run OR use -f to force an overwrite of the existing job.
[WARNING:01/15/2020 02:59:44 PM] Exiting....

...then running the same command with -f added only repeats the same error message.

Add versions to software check

pip installation

push to pypi

Phylo tree missing a scale

Ideally ~SNPs and evol dist

Add annotations to the phylo tree

Ideally want these:

MLST
AMR genes
% core

Maybe these

SNPs from reference?

error in dryrun

line 693

report writing fails when report.toml too large

When a report.toml is too large, the writing of the report.html from this file is waaaaay to slow - to the point of not being possible (tested on >1000 isolates).

Rerunning preview after running snps fails

rerun bug

if report dir found should rename or remove and start again.

Include ref(s) in preview mode

It would be useful to include one or more ref genomes in preview mode.

Etiology - did you mean Etymology?

Move `test` to top directory

It is a convention to have the tests folder outside the module folder. You can modify the tasks.py to run tests before packaging, and stop if one or more tests fail.

Bug when reference is FASTA

  File "/home/linuxbrew/.linuxbrew/bin/bohra", line 8, in <module>
    sys.exit(main())
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 136, in main
    args.func(args)
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 33, in run_pipeline
    return(R.run_pipeline())
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/SnpDetection.py", line 920, in run_pipeline
    self.index_reference()
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/SnpDetection.py", line 612, in index_reference
    if '.fa' not in self.ref:
TypeError: argument of type 'PosixPath' is not iterable```

The offending line should be:

if 'self.ref.match("*.fa*):

seqtk vs fq

Swap in seqtk in place of fq

Update README.md for installation options

Update README.md for installation options
Include pip, singularity, conda and brew?

NameError: name 'parser' is not defined

  File "/home/linuxbrew/.linuxbrew/bin/bohra", line 11, in <module>
    load_entry_point('bohra==1.0.3', 'console_scripts', 'bohra')()
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 96, in main
    parser.print_help(sys.stderr)
NameError: name 'parser' is not defined

nulla2bohra requirements

   nulla2bohra         Ensure that bohra can be rerun over an existing
                        nullarbor folder. Can also be used to update older
                        bohra directories. Must supply name of nullarbor
                        directory, and your isolates.tab file

can you use nullarbor/input.tab so they only provide the folder?
it is a copy of their original file when they ran it last

Use `shutil.which` instead of `subprocess.run` to find out executables

You either get None or the full path to the executable. You could easily then allow the user to supply a path to the executable if it is not in $PATH for some reason.

You will, of course, still need to use subprocess.run to get the exact version of the tool.

You can then be a bit clever with regex to parse out versions and use the packaging package to compare:

import re
from packaging import version
version_pat = re.compile(r'\bv?(?P<major>[0-9]+)\.(?P<minor>[0-9]+)\.(?P<release>[0-9]+)(?:\.(?P<build>[0-9]+))?\b')
version = "snippy v3.2.1"
m = version_pat.search(version)
# you can access individual components
m.group("major") # "3"
# the whole matching string
m.group() # v3.2.1
# as a dictionary
m.groupdict() # {'major': '3', 'minor': '2', 'release': '1', 'build': None}
# or tuple
m.groups() # ('3', '2', '1', None)

# the packing package offers some comparison tools and it comes with 
# setuptools, so no additional requirements needed
min_version = version.parse("v3.2.3")
version.parse(m.group()) >= min_version # False

min_version = version.parse("v3.2.0")
version.parse(m.group()) >= min_version # True

Add docstring to top `bohra.py` and `init.py`

Add some docstrings to the top of these two files (at least, I would like to see it in all of them). That will help in self-documenting later. Ideally, all functions and class definitions would have one too. They don't have to be long. I generally try to start every function/class definition by writing down the docstrings what the element will do, and then what parameters it will take and what output to expect. That helps in organising things, and you can quickly see if the function is trying to do too much.

You can follow the sphinx model and then add sphinx to the tasks.py to automatically generate some docs too. https://pythonhosted.org/an_example_pypi_project/sphinx.html

What is --workdir ?

  --workdir WORKDIR, -w WORKDIR
                        Working directory, default is current directory
                        (default: /home/linuxbrew)

If this is meant to be fast scratch, please use os.tempdir instead, which will use $TMPDIR.

Set --threads to kraken2 (and check other tools).

This will allow Kraken2 to play nicer with the Snkakemake scheduler.

"or Ion Torrent read sets."

Ion Torrent are single end reads (1 fast file) and need different settings in BWA MEM normally.

Do you really support them?

conda and brew

make dependencies brew and conda installable

Parse out argument processing into a new function

main() should be short, and really just call on other functions.

Add support for contig in core genome graph

add zoom to tree

Add a zoom functionality to the tree

-mdu option does not prefill kraken or maybe others

Hi Kristy,

My command to run bohra
bohra run -c 8 -i ids.tab -j CRL_20210120_ -r GCF_001548355.1_JKo3.fna -p sa -mdu -ma 0 -mc 0

Josh

Add setup instructions for all databases

Like this:
https://github.com/nmquijada/tormes/blob/master/bin/tormes-setup

add example reports

Add in example reports for each type of pipeline

Add version to init.py

Add the keyword __version__ = "1.0.1" to __init__.py (note version number is a string). You can then import the bohra into setup.py and never have to modify the version in more than one place. Read the bumpversion docs too.

This is good practice when deploying python packages.

You can add __author__, __copyright__ and __license__ variables to __init__.py too.

Process customisation

Modify the cextflow pipeline to allow for each process to have customisable resources

fail gracefully if files are not accessible

If input files are not accessible:

If reads are not present or no permissions fail before running workflow
If prefill path not accessible for assembly and speciation default to running assembly or speciation

singularity support

Add in paths to singularity containers and recipes

Automate push to PyPI

Create a little invoke script to automatically version bump, generate the bundles, and push to PyPI. Example using bumpversion (to automatically bump version) and twine (to upload to PyPI) below.

You can add that to a file called tasks.py and then just run inv to push new versions to PyPI (after pip3 install invoke). You could also make it a bash script, of course.

Read twine README for some more background: https://github.com/pypa/twine

And, invoke: http://www.pyinvoke.org/

'''
Automate deployment to PyPi
'''

import invoke


@invoke.task
def deploy_patch(ctx):
    '''
    Automate deployment
    rm -rf build/* dist/*
    bumpversion patch --verbose
    python3 setup.py sdist bdist_wheel
    twine upload dist/*
    git push --tags
    '''
    ctx.run("rm -rf build/* dist/*")
    ctx.run("bumpversion patch --verbose")
    ctx.run("python3 setup.py sdist bdist_wheel")
    ctx.run("twine check dist/*")
    ctx.run("twine upload dist/*")
    ctx.run("git push --tags")

cluster.json and snakemake command

add in a cluster.json and a flag for cluster with alternate snaemake command