kblin / ncbi-genome-download Goto Github PK

View Code? Open in Web Editor NEW

931.0 35.0 176.0 355 KB

Scripts to download genomes from the NCBI FTP servers

License: Apache License 2.0

Python 99.69% Makefile 0.22% Shell 0.09%

ncbi bioinformatics download-genomes python biology genomics command-line genbank

ncbi-genome-download's Introduction

NCBI Genome Downloading Scripts

Some script to download bacterial and fungal genomes from NCBI after they restructured their FTP a while ago.

Idea shamelessly stolen from Mick Watson's Kraken downloader scripts that can also be found in Mick's GitHub repo. However, Mick's scripts are ~~written in Perl~~ specific to actually building a Kraken database (as advertised).

So this is a set of scripts that focuses on the actual genome downloading.

Installation

pip install ncbi-genome-download

Alternatively, clone this repository from GitHub, then run (in a python virtual environment)

pip install .

If this fails on older versions of Python, try updating your pip tool first:

pip install --upgrade pip

and then rerun the ncbi-genome-download install.

Alternatively, ncbi-genome-download is packaged in conda. Refer the the Anaconda/miniconda site to install a distribution (highly recommended). With that installed one can do:

conda install -c bioconda ncbi-genome-download

ncbi-genome-download is only developed and tested on Python releases still under active support by the Python project. At the moment, this means versions 3.7, 3.8, 3.9, 3.10 and 3.11. Specifically, no attempt at testing under Python versions older than 3.7 is being made.

If your system is stuck on an older version of Python, consider using a tool like Homebrew to obtain a more up-to-date version.

ncbi-genome-download 0.2.12 was the last version to support Python 2.

Usage

To download all bacterial RefSeq genomes in GenBank format from NCBI, run the following:

ncbi-genome-download bacteria

Downloading multiple groups is also possible:

ncbi-genome-download bacteria,viral

Note: To see all available groups, see ncbi-genome-download --help, or simply use all to check all groups. Naming a more specific group will reduce the download size and the time needed to find the sequences to download.

If you're on a reasonably fast connection, you might want to try running multiple downloads in parallel:

ncbi-genome-download bacteria --parallel 4

To download all fungal GenBank genomes from NCBI in GenBank format, run:

ncbi-genome-download --section genbank fungi

To download all viral RefSeq genomes in FASTA format, run:

ncbi-genome-download --formats fasta viral

It is possible to download multiple formats by supplying a list of formats or simply downloading all formats:

ncbi-genome-download --formats fasta,assembly-report viral
ncbi-genome-download --formats all viral

To download only completed bacterial RefSeq genomes in GenBank format, run:

ncbi-genome-download --assembly-levels complete bacteria

It is possible to download multiple assembly levels at once by supplying a list:

ncbi-genome-download --assembly-levels complete,chromosome bacteria

To download only bacterial reference genomes from RefSeq in GenBank format, run:

ncbi-genome-download --refseq-categories reference bacteria

To download bacterial RefSeq genomes of the genus Streptomyces, run:

ncbi-genome-download --genera Streptomyces bacteria

Note: This is a simple string match on the organism name provided by NCBI only.

You can also use this with a slight trick to download genomes of a certain species as well:

ncbi-genome-download --genera "Streptomyces coelicolor" bacteria

Note: The quotes are important. Again, this is a simple string match on the organism name provided by the NCBI.

Multiple genera is also possible:

ncbi-genome-download --genera "Streptomyces coelicolor,Escherichia coli" bacteria

You can also put genus names into a file, one organism per line, e.g.:

Streptomyces
Amycolatopsis

Then, pass the path to that file (e.g. my_genera.txt) to the --genera option, like so:

ncbi-genome-download --genera my_genera.txt bacteria

Note: The above command will download all Streptomyces and Amycolatopsis genomes from RefSeq.

You can make the string match fuzzy using the --fuzzy-genus option. This can be handy if you need to match a value in the middle of the NCBI organism name, like so:

ncbi-genome-download --genera coelicolor --fuzzy-genus bacteria

Note: The above command will download all bacterial genomes containing "coelicolor" anywhere in their organism name from RefSeq.

To download bacterial RefSeq genomes based on their NCBI species taxonomy ID, run:

ncbi-genome-download --species-taxids 562 bacteria

Note: The above command will download all RefSeq genomes belonging to Escherichia coli.

To download a specific bacterial RefSeq genomes based on its NCBI taxonomy ID, run:

ncbi-genome-download --taxids 511145 bacteria

Note: The above command will download the RefSeq genome belonging to Escherichia coli str. K-12 substr. MG1655.

It is also possible to download multiple species taxids or taxids by supplying the numbers in a comma-separated list:

ncbi-genome-download --taxids 9606,9685 --assembly-level chromosome vertebrate_mammalian

Note: The above command will download the reference genomes for cat and human.

In addition, you can put multiple species taxids or taxids into a file, one per line and pass that filename to the --species-taxids or --taxids parameters, respectively.

Assuming you had a file my_taxids.txt with the following contents:

9606
9685

You could download the reference genomes for cat and human like this:

ncbi-genome-download --taxids my_taxids.txt --assembly-levels chromosome vertebrate_mammalian

It is possible to also create a human-readable directory structure in parallel to mirroring the layout used by NCBI:

ncbi-genome-download --human-readable bacteria

This will use links to point to the appropriate files in the NCBI directory structure, so it saves file space. Note that links are not supported on some Windows file systems and some older versions of Windows.

It is also possible to re-run a previous download with the --human-readable option. In this case, ncbi-genome-download will not download any new genome files, and just create human-readable directory structure. Note that if any files have been changed on the NCBI side, a file download will be triggered.

There is a "dry-run" option to show which accessions would be downloaded, given your filters:

ncbi-genome-download --dry-run bacteria

If you want to filter for the "relation to type material" column of the assembly summary file, you can use the --type-materials option. Possible values are "any", "all", "type", "reference", "synonym", "proxytype", and/or "neotype". "any" will include assemblies with no relation to type material value defined, "all" will download only assemblies with a defined value. Multiple values can be given, separated by comma:

ncbi-genome-download --type-materials type,reference

By default, ncbi-genome-download caches the assembly summary files for the respective taxonomic groups for one day. You can skip using the cache file by using the --no-cache option. The output of --help also shows the cache directory, should you want to remove any of the cached files.

To get an overview of all options, run

ncbi-genome-download --help

As a method

You can also use it as a method call:

import ncbi_genome_download as ngd
ngd.download()

Pass the pythonised keyword arguments as described above or in the --help. To specify taxonomic groups, like bacteria, use the groups keyword. To specify file formats, like for the --format CLI option, use file_formats. All other keywords should match the CLI options, with - converted to _. Note that because the method call follows the same logic as the CLI, lists data should still be passed as strings, separated by a comma but no spaces, just like on the command line.

Contributed Scripts: `gimme_taxa.py`

This script lets you find out what TaxIDs to pass to ngd, and will write a simple one-item-per-line file to pass in to it. It utilises the ete3 toolkit, so refer to their site to install the dependency if it's not already satisfied.

You can query the database using a particular TaxID, or a scientific name. The primary function of the script is to return all the child taxa of the specified parent taxa. The script has various options for what information is written in the output.

A basic invocation may look like:

# Fetch all descendent taxa for Escherichia (taxid 561):
python gimme_taxa.py -o ~/mytaxafile.txt 561

# Alternatively, just provide the taxon name
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia

# You can provide multiple taxids and/or names
python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter

On first use, a small sqlite database will be created in your home directory by default (change the location with the --database flag). You can update this database by using the --update flag. Note that if the database is not in your home directory, you must specify it with --database or a new database will be created in your home directory.

To see all help:

python gimme_taxa.py
python gimme_taxa.py -h
python gimme_taxa.py --help

To use the gimme_taxa.py script with ncbi-genome-download's --taxids option, you need to call gimme_taxa.py with the -j option, like this:

gimme_taxa.py -j -o my_taxids.txt Escherichia
ncbi-genome-download --taxids my_taxids.txt bacteria

Citing `ncbi-genome-download`

You can cite ncbi-genome-download via the Zenodo deposit under DOI: 10.5281/zenodo.8192432 or the specific DOI for the version you used.

License

All code is available under the Apache License version 2, see the LICENSE file for details.

ncbi-genome-download's People

Contributors

Stargazers

Watchers

Forkers

druvus metagenomics alancpu mgalardini muslih14 nejcstopno freeh4cker joshbaldwin jhenriksen-agbiome mbourqui ifb-elixirfr stavrosnco decaturjim mz-cy-han1998 jn7163 bioshell shenmengyuan peterjc abremges tilmweber palc yesimon aakrosh res2677 hurwitzlab klarareichard liupfskygre rhpvorderman thisisliuqing wy2160640 shuang01 rajaldebnath sluo6 ezozayav jiee1993 creageng huangsunan bluegenes junhuili phiweger biterbilen thexiyang sunnan123 anamikasen zdk123 silviane-m jrjhealey marcomeola lancetxiao wendashou aningvi jdaviscooke pythseq sgnajar wangdi2014 wangmz0617 alphaneer hotliu liyan910117 hzpromegene sogada promexjm johnsonhit hwang-happy li-yapeng inambioinfo zm-git-dev sdwfrost luizirber aiyacharley silask ksanjeetsinha wanliu2019 akhileshkaushal nedatavakoli harisankarsadasivan qianwenluo ahmedelhosseiny qq1042032751 boxuchen gavieira arghya1611 cecilpert zhangxianglisb jdiezf01 mics-jusue404 wrzlprmft nico-chung jarekbryk alienzj 444thliao tw7649116 lizhizhong1992 hongzhonglu svpipaliya zhaoxia413 cosign070128 lijingdi im-han tanyuegithub

ncbi-genome-download's Issues

Progress bar in bash

Since I do not use this software for biological work, I'm not sure the value. But during my testing if I downloaded a lot of files, I had no idea if the program was working or not unless I looked my file system to see the new directories. Would a progress bar be something you are interested in?

Something like this sample output:
Progress: |█████████████████████████████████████████████-----| 90.0% Complete

bioconda package - import error

I got an import error for enum in a brand-new conda environment. Is enum34 missing in the recipe?

ImportError: No module named enum

(I now installed with pip which went just fine. Thank you for the effort to put it into bioconda, I appreciate it.)

Here's the output I got when installing the package:

$ conda create -n "genome_download" ncbi-genome-download
Fetching package metadata .................
Solving package specifications: .

Package plan for installation in environment /hpc/local/CentOS7/dla_mm/tools/miniconda2/envs/genome_download:

The following NEW packages will be INSTALLED:

    ncbi-genome-download: 0.2.5-py27_0  bioconda
    openssl:              1.0.2l-0              
    pip:                  9.0.1-py27_1          
    python:               2.7.13-0              
    readline:             6.2-2                 
    requests:             2.14.2-py27_0         
    setuptools:           27.2.0-py27_0         
    sqlite:               3.13.0-0              
    tk:                   8.5.18-0              
    wheel:                0.29.0-py27_0         
    zlib:                 1.2.11-0 

...
(genome_download)$ ncbi-genome-download bacteria
Traceback (most recent call last):
  File "/hpc/local/CentOS7/dla_mm/tools/miniconda2/envs/genome_download/bin/ncbi-genome-download", line 4, in <module>
    import ncbi_genome_download.__main__
  File "/hpc/local/CentOS7/dla_mm/tools/miniconda2/envs/genome_download/lib/python2.7/site-packages/ncbi_genome_download/__init__.py", line 2, in <module>
    from .core import (
  File "/hpc/local/CentOS7/dla_mm/tools/miniconda2/envs/genome_download/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 8, in <module>
    from enum import Enum, unique
ImportError: No module named enum

Allow passing a list of genus names instead of using a single one.

It is possible to get ncbi-genome-download to download from a list of genomes by using a shell script loop like this:

for genus in $(cat genomes_list.txt); do
    ncbi-genome-download -g $genus
done

But this causes ncbi-genome-download to repeatedly download the assembly_summary.txt file, wasting some bandwidth. It'd be nicer to have this built in and only download the summary file once.

Only 3 viral sequences downloaded?

Hi,
I am trying to download all viral sequences from genbank in fasta format, equivalent of the search:
((viral)) AND "viruses"[porgn:__txid10239]
This package looked very promising, so I tried the command below and few other variations. However, it only downloads 3 sequences. Am I missing something?

ncbi-genome-download -s genbank -F fasta -l all viral -v
INFO: Downloading record u'GCA_001857745.1'
INFO: Downloading record u'GCA_001857805.1'
INFO: Downloading record u'GCA_001857825.1'

Thank you!

Python version requirement not listed

After having an issue with "requests" which was resolved, oddly enough, by following the easy_install suggestion in chrippa/livestreamer#384, it seems that there may be a feature which requires a minimum of Python 2.7.

There is a solution for Python 2.6.6 for those of us stuck on older distros, but I don't know if it can be implemented.

home path altered to $HOME for following block to reduce line length

ncbi-genome-download bacteria
Traceback (most recent call last):
  File "$HOME/.local/bin/ncbi-genome-download", line 9, in <module>
    load_entry_point('ncbi-genome-download==0.1.8', 'console_scripts', 'ncbi-genome-download')()
  File "$HOME/.local/lib/python2.6/site-packages/ncbi_genome_download/__main__.py", line 59, in main
    ncbi_genome_download.download(args)
  File "$HOME/.local/lib/python2.6/site-packages/ncbi_genome_download/core.py", line 50, in download
    args.assembly_level, args.genus, args.parallel)
  File "$HOME/.local/lib/python2.6/site-packages/ncbi_genome_download/core.py", line 67, in _download
    download_jobs.extend(download_entry(entry, section, domain, output, file_format))
  File "$HOME/.local/lib/python2.6/site-packages/ncbi_genome_download/core.py", line 98, in download_entry
    checksums = grab_checksums_file(entry)
  File "$HOME/.local/lib/python2.6/site-packages/ncbi_genome_download/core.py", line 139, in grab_checksums_file
    full_url = '{}/md5checksums.txt'.format(http_url)
ValueError: zero length field name in format

Problem installing version 0.2.3 on Biolinux8 (Virtualbox) [solved]

I am trying to update ncbi-genome-download ona Biolinux8 (Ubuntu 14.04) virtualbox virtual computer. Installation and running keep on failing on the cryptography module.

Error messages when upgrading from 0.1.8 to 0.2.3:
Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-p6t82g/cryptography/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-pi1daZ-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-p6t82g/cryptography/

When I first uninstall version 0.1.8 and then install 0.2.3, I get the following when running "ncbi-genome-download --version"

Traceback (most recent call last):
File "/usr/local/bin/ncbi-genome-download", line 7, in
from ncbi_genome_download.main import main
File "/usr/local/lib/python2.7/dist-packages/ncbi_genome_download/init.py", line 2, in
from ncbi_genome_download.core import (
File "/usr/local/lib/python2.7/dist-packages/ncbi_genome_download/core.py", line 16, in
from requests.packages.urllib3.contrib import pyopenssl
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 47, in
from cryptography import x509
ImportError: No module named cryptography

But cryptography etc is installed normally.

It does work on a non-virtualbox Biolinux install? Not sure what is the best approach now?

Fix the setup.py/requirements.txt magic to only install pyOpenSSL on old Python versions

Currently, because setuptools was being stupid in old versions and I wanted to get the 0.2.0 release out, requirements.txt and thus setup.py always list pyOpenSSL and ndg-httpsclient as dependencies. This is not true on Python versions >= 2.7.9.

This could be fixed in theory by using an environment marker in requirements.txt like

pyOpenSSL >= 16.0.0 ; python_version < '2.7.9'

Unfortunately, on one of the bigger install bases for a pre-2.7.9 Python, Ubuntu 14.04, the setuptools version is also pretty ancient, and doesn't support environment markers. In my testing before the release, I was forced to remove the environment markers again (commit a078d93) to get pip install to work on both py2.7 and py3.4 as shipped with Ubuntu 14.04.

It would be really nice to come up with a good way of doing this properly, but only requiring people to build pyOpenSSL on old Pythons.

Parallel downloads of MD5SUM files?

ncbi-genome-download can download data in parallel already to speed up the process. But the part of the process that downloads the MD5SUMS files and checks if a download job needs to be started in the first place is still one at a time. Fix this to be parallel already.

OSError: [Errno 17] File exists

Been getting this a bit when I restart jobs that failed due to unreachable network:

NFO: Starting new HTTPS connection (1): ftp.ncbi.nlm.nih.gov
DEBUG: "GET /genomes/all/GCA_001443705.1_ASM144370v1/GCA_001443705.1_ASM144370v1_genomic.gbff.gz HTTP/1.1" 200 185238057
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/bin/ncbi-genome-download", line 11, in <module>
    sys.exit(main())
  File "/home/linuxbrew/.linuxbrew/Cellar/python/2.7.12_1/lib/python2.7/site-packages/ncbi_genome_download/__main__.py", line 78, in main
INFO: Starting new HTTPS connection (1): ftp.ncbi.nlm.nih.gov
    ret = ncbi_genome_download.download(args)
  File "/home/linuxbrew/.linuxbrew/Cellar/python/2.7.12_1/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 60, in download
    args.taxid, args.human_readable, args.parallel)
  File "/home/linuxbrew/.linuxbrew/Cellar/python/2.7.12_1/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 94, in _download
    pool.map(worker, download_jobs)
  File "/home/linuxbrew/.linuxbrew/Cellar/python/2.7.12_1/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/home/linuxbrew/.linuxbrew/Cellar/python/2.7.12_1/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
OSError: [Errno 17] File exists

Make it usable as a regular python method

That is, allow to skip the command-line arguments parsing phase and use it in any python script as a regular method call.
Typical usage would be:

import ncbi_genome_download as ngd
ngd.download(section="refseq")

Program hangs when downloading not existing 'unknown' section.

DEBUG: Starting new HTTPS connection (1): ftp.ncbi.nih.gov
DEBUG: https://ftp.ncbi.nih.gov:443 "GET /genomes/refseq/unknown/assembly_summary.txt HTTP/1.1" 404 None

This also applies when downloading all. Which is a common use case.

Enhancement to --human-readable output

Firstly, let me say thank you for adding the human readable output option. It is working great so far with my testing!

I was hoping you could add another option to make the human readable name in the file name as well? This makes it easy to provide to software a bunch of these files and have output we can understand. Not quite sure how versions would work though.

Current:
human_readable/genbank/bacteria/Dichelobacter/nodosus/VCS1703A/GCA_000015345.1_ASM1534v1_genomic.gbff.gz

Alternate suggested:
human_readable/genbank/bacteria/Dichelobacter/nodosus/VCS1703A/Dichelobacter_nodosus_VCS1703A.gbff.gz

Should we install `gimme_taxa.py` via setup.py?

This came up in PR #67. I'm undecided on the topic, but want to make sure I don't forget about it.

Lots of "ERROR: checksum mismatches" with Genbank download

I suddenly get lots of "ERROR: checksum mismatches" with downloads from Genbank or RefSeq, due to a significant number of downloads not completing. But if I check the link given in the report, it is correct and the files can be downloaded from the FTP-server.

Have already tried to reduce the number of parallel downloads (first 4, then 2), but no effect. Any thoughts? I use the latest version, last week no problems.

Add "chromosome" as an assembly level

Apparently besides "complete", "scaffold" and "contig", RefSeq also knows "chromosome" as assembly level. So support that as well.

Downloads stall in mid-process

To quote @forestdussault from issue #19

Just came across your script today - nice work. Unfortunately it is hanging for me with the following command:

ncbi-genome-download --assembly-level complete --format fasta --human-readable --verbose --debug bacteria

The downloader will progress for a few minutes and then stop. It seems to stop while attempting to start a new HTTPS connection:
DEBUG: Skipping entry with assembly level 'Contig'
INFO: Downloading record 'GCF_000465235.1'
INFO: Starting new HTTPS connection (1): ftp.ncbi.nlm.nih.gov

Can I also use this to download specific genus?

As I am a small player with ditto diskspace, and usually only want the genomes from a specific genus, can this be done?

Is it possible to download NCBI protein database instead of genome?

Hi!
Nice tool! I used it many times before to donwload large genome data sets, but now I need to download Bacteria protein database but I was not able to find an optimal application to do this. Is it possible to do this with your app?

Thank you in advance.

KeyError: 'assembly_accession'

% ncbi-genome-download -V
0.1.0

% ncbi-genome-download -v viral

INFO: Starting new HTTP connection (1): ftp.ncbi.nih.gov
Traceback (most recent call last):
  File "/bio/linuxbrew/bin/ncbi-genome-download", line 11, in <module>
    sys.exit(main())
  File "/bio/linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/ncbi_genome_download/__main__.py", line 43, in main
    ncbi_genome_download.download(args)
  File "/bio/linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 21, in download
    _download(args.section, args.domain, args.uri, args.output)
  File "/bio/linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 29, in _download
    download_entry(entry, section, domain, uri, output)
  File "/bio/linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 56, in download_entry
    logging.info('Downloading record %r', entry['assembly_accession'])
KeyError: 'assembly_accession'

DEBUG: Skipping over unexpected checksum line u''

This line occurs a lot - should I be worried?

Handle non-ASCII characters

Both the RefSeq summary and the GenBank summary contain non-ASCII characters that cause csv to fall over. Work around this.

New added column into assembly_summary.txt file

Hi,

Are you aware of this new added columns into assembly_summary.txt files?

A new column was added to the assembly_summary.txt files on the genomes FTP site.

The new “relation_to_type_material” field is most relevant to bacteria and archaea, although some fungi 
and a few algae also have assemblies from type material.

Column 22: “relation_to_type_material"
Relation to type material: contains a value if the sequences in the genome assembly were derived from type material.
Values:
assembly from type material - the sequences in the genome assembly were derived from type material
assembly from synonym type material - the sequences in the genome assembly were derived from synonym type material
assembly from proxytype material - the sequences in the genome assembly were derived from proxy type material

Just to let you know if this could affect your tool

Regards

Parallel download for us with crappy FTP connections

The FTP download is single threaded and is slow for us in Australia.

Would you consider an enhancement to have multiple download threads?

It doesn't seem to actually support comma-separated list.

(despite documentation saying so. Tried quoting too.)

$ ncbi-genome-download -F fasta bacteria,viral
usage: ncbi-genome-download [-h] [-s {refseq,genbank}]
[-F {genbank,fasta,features,gff,protein-fasta,genpept,wgs,cds-fasta,rna-fasta,assembly-report,assembly-stats,all}]
[-l {all,complete,chromosome,scaffold,contig}]
[-g GENUS] [-T SPECIES_TAXID] [-t TAXID]
[-R {all,reference,representative}] [-o OUTPUT]
[-H] [-u URI] [-p N] [-r N] [-v] [-d] [-V]
{all,archaea,bacteria,fungi,invertebrate,plant,protozoa,unknown,vertebrate_mammalian,vertebrate_other,viral}
ncbi-genome-download: error: argument group: invalid choice: 'bacteria,viral' (choose from 'all', 'archaea', 'bacteria', 'fungi', 'invertebrate', 'plant', 'protozoa', 'unknown', 'vertebrate_mammalian', 'vertebrate_other', 'viral')

Bump to 0.2.4

Hi,

I would need the additional features from the latest pull requests. As it is part of a larger project, it would be much easier to use if available from PyPI. Are you planning to soon publish an update on PyPI?

--taxids with input file is not working

I'm using ncbi-genome-download 0.2.6 py35_0 bioconda

The following works:

ncbi-genome-download bacteria --format fasta --taxid 562

However, when I try to place the taxid in a file and provide that as input for --taxid:

ncbi-genome-download bacteria --format fasta --taxid my_taxids.txt

...I get ERROR: No downloads matched your filter. Please check your options. I've tried this with other taxids, and the same behavior occurs.

It appears that this version of ngd doesn't accept a file that lists taxids, even if they work when provided as a comma-separated list.

return entry['organism_name'].split(' ')[1] IndexError: list index out of range

% ncbi-genome-download --parallel 4 -H -d -l complete viral

<snip>
DEBUG: Starting new HTTPS connection (1): ftp.ncbi.nlm.nih.gov
DEBUG: https://ftp.ncbi.nlm.nih.gov:443 "GET /genomes/all/GCF/000/838/005/GCF_000838005.1_ViralProj14104/md5checksums.txt HTTP/1.1" 200 649
INFO: Downloading record 'GCF_000847185.1'
DEBUG: Starting new HTTPS connection (1): ftp.ncbi.nlm.nih.gov
DEBUG: https://ftp.ncbi.nlm.nih.gov:443 "GET /genomes/all/GCF/000/847/185/GCF_000847185.1_ViralProj14595/md5checksums.txt HTTP/1.1" 200 649
INFO: Downloading record 'GCF_000883035.1'
DEBUG: Starting new HTTPS connection (1): ftp.ncbi.nlm.nih.gov
DEBUG: https://ftp.ncbi.nlm.nih.gov:443 "GET /genomes/all/GCF/000/883/035/GCF_000883035.1_ViralProj31249/md5checksums.txt HTTP/1.1" 200 649
INFO: Downloading record 'GCF_000864565.1'
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/bin/ncbi-genome-download", line 11, in <module>
    sys.exit(main())
  File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.5/site-packages/ncbi_genome_download/__main__.py", line 73, in main
    ret = ncbi_genome_download.download(args)
  File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 60, in download
    args.taxid, args.human_readable, args.parallel)
  File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 91, in _download
    download_jobs.extend(download_entry(entry, section, domain, output, file_format, human_readable))
  File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 125, in download_entry
    symlink_path = create_readable_dir(entry, section, domain, output)
  File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 169, in create_readable_dir
    get_species_label(entry),
  File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 295, in get_species_label
    return entry['organism_name'].split(' ')[1]
IndexError: list index out of range

Add --acc id for download specific genomes

Dear, will be very usefull to fetch an specific genomes only with acc id, I think in a file with several ID's or looping each one.

Regards

taxid usage

How does one specidy the --taxid argument correctly? When I call ...

ncbi-genome-download --taxid 83654 bacteria

... I get ...

ERROR: No downloads matched your filter. Please check your options.

However, this works:

ncbi-genome-download --genus leclercia bacteria

However, I would like to download the entire family, so using the taxid would be much more convenient.

Thank you,
Adrian

Support for proxies in requests

So far, the only way to set the proxies for the requests package is via EXPORT. It would be great to also support the proxies as parameter to requests, like here: http://docs.python-requests.org/en/latest/user/advanced/#proxies

I would see two options for that:

either to set the proxies via a sub-command, and then all the following calls to ncbi_genome_download will use those proxies. This would be reflected as a setter in core.py, such that it could also be set programmatically if using download(). Maybe more user-friendly, as usually proxy settings do not change bertween two calls.
set the proxies at each call of ncbi_genome_download via specific params. This would be reflected as keyword arguments to download(). Easier to implement.

If you do not mind, I will do a PR next week, as I need this functionnality anyway in my project. Let me know which solution (or an other one) you prefer!

Make symlinks relative rather than absolute

Can you make the human readable symlinks relative (eg. ../../../../refseq/..... etc instead of absolute paths?

Reason is because they all break if i move the folder hierarchy anywhere.

starts, then fail

Hi,

I try to download all viral reference genomes in refseq like so:

ncbi-genome-download --format fasta --section refseq --verbose --parallel 10 --human-readable viral

The program starts downloading things but then, at the same ID (GCF...) breaks with the following error message:

INFO: Downloading record 'GCF_000868825.1'
INFO: Downloading record 'GCF_000844105.1'
Traceback (most recent call last):
  File "/Users/pi/.virtualenvs/lab3/bin/ncbi-genome-download", line 11, in <module>
    sys.exit(main())
  File "/Users/pi/.virtualenvs/lab3/lib/python3.5/site-packages/ncbi_genome_download/__main__.py", line 73, in main
    ret = ncbi_genome_download.download(args)
  File "/Users/pi/.virtualenvs/lab3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 60, in download
    args.taxid, args.human_readable, args.parallel)
  File "/Users/pi/.virtualenvs/lab3/lib/python3.5/site-packages/ncbi_genome_download/core.py", line 75, in _download
    for entry in entries:
  File "/Users/pi/.virtualenvs/lab3/lib/python3.5/site-packages/ncbi_genome_download/summary.py", line 28, in __next__
    entry[self._fields[i]] = val
IndexError: list index out of range

I tried a minimal version as well, same result:

ncbi-genome-download --verbose --parallel 10 viral

Am I using it wrongly or is this an uncaught error, like some empty entry in the database or similar.

Thanks a lot!

requests 2.4.3 is required

Hi,
Just try to install it under python 2.7.6, installation is Okay but it failed to launch and the error is:
from requests.packages.urllib3.contrib import pyopenssl
ImportError: No module named packages.urllib3.contrib

This is due to the requests version is 2.2.1 but actually 2.4.3 is required.
It would be better to make requests >= 2.4.3 in the requirements.txt

viral division has disappeared from NCBI

This just hangs:

ncbi-genome-download -r 2 -v -o viral -H -p 1 --debug -s genbank -F genbank viral
DEBUG: Downloading summary for 'genbank'/'viral' uri: 'https://ftp.ncbi.nih.gov/genomes'
INFO: Starting new HTTPS connection (1): ftp.ncbi.nih.gov
DEBUG: "GET /genomes/genbank/viral/assembly_summary.txt HTTP/1.1" 404 240

Going via web browser says:

Not Found
The requested URL /genomes/genbank/viral/assembly_summary.txt was not found on this server.

It seems 'viralis not in thegenbank` folder?
https://ftp.ncbi.nih.gov/genomes/genbank/

The refseq folder has viral bit it is a dead symlink.
https://ftp.ncbi.nih.gov/genomes/refseq

Could be a victim of the NCBI FTP rearrangement?

Support rsync and/or aspera as an alternative download method

As described in some comments on issue #15, there is some interest in allowing downloads from NCBI to go via rsync or Aspera's ascp tool.

FYI - NCBI doing "Genus species" folders now?

Just FYI

eg. all the latest Leptospira_borgpetersenii assemblies:

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Leptospira_borgpetersenii/latest_assembly_versions/

Complete genomes can be 'complete' or 'chromosome'

We have a problem with bacterial genomes that sometimes the assembly_summary.txt file says complete and sometimes it says chromosome for finished bacterial genomes.

This is partially to do with the fact that bacteria usually only have 1 chromosome, but partially because they have plasmids too. It's confusing.

If I want both, can I do -l complete -l chromosome to get both?
Or so I run 2 commands with same -o folder?

ncbi-genome-download: error: unrecognized arguments: --dry-run

I downloaded the current version, but --dry-run does not seem to be an option when I call

ncbi-genome-download --dry-run bacteria

Thank you for looking into this,
Adrian

Use batch rsync instead of FTP

The NCBI FTP server supports rsync://ftp.ncbi.... connections.

Could you create a "batch file" and do this via a single rsync call?

So incremental updates are easy and fast?

Include --plasmid option?

Is it possible to only download plasmids? NCBI Genome has a plasmid-specific page, but I don't see that option in ncbi-genome-download.

Don't know how easy this is?

Create output of descriptors of downloaded genomes

Currently all genomes are downloaded as cryptic filenames, such as: "GCF_000469325.1.fna"

De FASTA header of that file is:
"NZ_KI271582.1 Lactobacillus shenzhenensis LY-73 genomic scaffold LY73.Scaffold1, whole genome shotgun sequence"

Is it possible that ncbi-genome-download also makes a list of filename + descriptor?

Example:
GCF_000469325.1 NZ_KI271582.1 Lactobacillus shenzhenensis LY-73 ...
GCF_000967245.1 NZ_KQ033877.1 Lactobacillus mellis strain Hon2 ...

etc

I am sure I can create something like that myself, but for linux-novices (as I am a bit) this would really enhance the tool 👍

Human readable files/folders as Genus_species_strain

In my (older unreleased scruffier) version of this tool I create a hierarchy like this:

Kingdom/
   Genus/
       species/
            strain/
                    blah.gbk
                    blah.fna
                    blah.gff

I currently set blah to Genus_species_strain but that loses the GCA_xxxx accession. I was thinking of having a a 'mirror' folder of symlinks with human readable names.

It was tricky to extract the strain as it appears in up to 3 different columns sometimes, but it mostly works.

The reason for this is to make it easy to work with sequences and get human readable labels etc.

Ability to download specific species

I know you added genus but really need species too.

For example, Staphylococcus aureus has 1000s of assemblies and don't want the whole genus.

Improve handling of network outages during transfer

While downloading all bacteria, I received the above error, both when running it the first time and when running it the second (correct the mismatches) time. Not sure if there's an actual problem or not.

bash-3.2$ ncbi-genome-download bacteria --parallel 5
ERROR: No entry for file ending in '_genomic.gbff.gz'
ERROR: Checksum mismatch for u'$HOME/tmp/bacteria/refseq/bacteria/GCF_000986765.1/GCF_000986765.1_ASM98676v1_genomic.gbff.gz'. Expected u'06e551ac87e510f4275fcd04302277b7', got 'ed004afb30643a1cae37e01d7fa93523'
ERROR: Checksum mismatch for u'$HOME/tmp/bacteria/refseq/bacteria/GCF_001650315.1/GCF_001650315.1_ASM165031v1_genomic.gbff.gz'. Expected u'fd17ff629c497bd8de1715f35a1c04eb', got '691d3769cb390b10495c5235b40bb993'
ERROR: Checksum mismatch for u'$HOME/tmp/bacteria/refseq/bacteria/GCF_000633455.1/GCF_000633455.1_de_novo_genomic.gbff.gz'. Expected u'b7b1441a00d8983e7ad0bcebc3c031e1', got '1e4b8ce848e5fd22caaed4ba76bd1f9f'
bash-3.2$ ncbi-genome-download bacteria --parallel 5
ERROR: No entry for file ending in '_genomic.gbff.gz'
bash-3.2$ python --version
Python 2.7.11

Human readable path names and NCBI underscore escaping

The NCBI don't just use standard percent escapes in the FTP site folder names, but instead replace certain characters with underscores.

e.g. using txid 613 problem cases include the # character:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/564/475/GCA_001564475.1_12082_3#81/GCA_001564475.1_12082_3#81_genomic.fna.gz
-->
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/564/475/GCA_001564475.1_12082_3_81/GCA_001564475.1_12082_3_81_genomic.fna.gz

and brackets ( and ):

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/468/075/GCA_000468075.1_Serratia_fanticola_AU-P3(3)/GCA_000468075.1_Serratia_fanticola_AU-P3(3)_genomic.fna.gz
-->
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/468/075/GCA_000468075.1_Serratia_fanticola_AU-P3_3_/GCA_000468075.1_Serratia_fanticola_AU-P3_3__genomic.fna.gz

This is handled here in a related script which I have contributed to also fetches genomes by taxon from the NCBI FTP site:

widdowquinn/pyani@585b8b6

This code replaces the following with an underscore: white space including , slash \, comma ,, hash #, brackets ( and ).

There is something similar in ncbi-genome-download where (if I have understood the context correctly) you make a local folder name which would ideally match the NCBI FTP naming convention:

https://github.com/kblin/ncbi-genome-download/blob/0.2.4/ncbi_genome_download/core.py#L375
https://github.com/kblin/ncbi-genome-download/blob/0.2.4/ncbi_genome_download/core.py#L574

This code replaces the following with an underscore: space , semi-colon ;, slash \, back-slash /.

It seems likely to me the two scripts have overlapping subsets of the full list of characters which the NCBI replaces with an underscore - but I'm not 100% sure of the use-case for your cleanup function.

Allow creation of --human-readable hierarchy if files are already downloaded

Currently the links in the human_readable folder are just created for records that are downloaded while ncbi-genome-download is being called with --human-readable. If a file is already up-to-date with a good checksum match from a previous run without --human-readable, no link will be created.

The logic should be changed to allow for a run with --human-readable to create links for already downloaded files as well.

Download the correct FASTA file

There are multiple files that end in _genomic.fna.gz (see ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Shewanella_sp._cp20/latest_assembly_versions/GCF_000832025.1_ASM83202v1/md5checksums.txt), so using endswith("_genomic.fna.gz") alone doesn't work.

Human readable producing non-unique folders

The strain column isn't unique. Might need to detect this, and appened the GCF_ number to the strain to discriminate?

/home/tseemann/tmp/B.cereus/human_readable/refseq/bacteria/Bacillus/cereus/E33L/GCF_000011625.1_ASM1162v1_genomic.fna.gz

/home/tseemann/tmp/B.cereus/human_readable/refseq/bacteria/Bacillus/cereus/E33L/GCF_000833045.1_ASM83304v1_genomic.fna.gz

Print a helpful message if no sequences were downloaded due to the filter options

If you specify a filter option that causes ncbi-genome-download to not download anything, this looks like the tool mysteriously failed. Instead, print out a helpful error message.

Is it possible to use this script for higher taxon-levels than "Genus"?

I frequently do genome comparisons between all currently known genomes of a given family, order, class or sometimes even phylum of bacteria. Downloading all currently available reference genomes can be laborious, so I was glad to find your script here.

However, currently I would like to download all available NCBI genomes of the phylum "Chloroflexi" (no matter which genus or species, as long as it's classified as "Chloroflexi". If possible I would like to include draft genomes which have not been assigned to a specific genus yet.).

I tried using your script with the "-t" option, giving the TaxID for the phylum "Chloroflexi" (which would be 200795). My exact program call was first: ncbi-genome-download -F genbank -l all -t 200795 bacteria.
then I tried with ncbi-genome-download -F genbank -l all -T 200795 bacteria

However nothing was downloaded. I notice that you included a separate argument for "genus". Does that mean higher taxonomic levels than "genus" are not supported?
Would it be possible/feasible to support higher taxon-levels in the future?