Giter Club home page Giter Club logo

parallel-fastq-dump's People

Contributors

kfuku52 avatar nh13 avatar rvalieris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parallel-fastq-dump's Issues

Problem with large files

Hello,

thank you for the useful tool! I've used it for my projects with downloaded SRA files and things speed up quite a lot. However I've tried it on a very large file, e.g. SRR3192525 (a 30 Gb archive), and it made it crush with "not enough storage" message, while serial fastq-dump finished the job ok.

Does it use default Python tmp directory or something like that? Maybe it would make sense to set tmp dir to the directory in which the tool is run?

Cannot find temporary files

So the program creates temporary files in /tmp/pfd_0hm17d1p but at the end, it cannot find the second fastq file: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pfd_0hm17d1p/10/SRR4253583_2.fastq'

$ parallel-fastq-dump --sra-id SRR4253583 -t 16 -O out/ --split-files
SRR ids: ['SRR4253583']
extra args: ['--split-files']
SRR4253583 spots: 143228176
tempdir: /tmp/pfd_0hm17d1p
blocks: [[1, 8951761], [8951762, 17903522], [17903523, 26855283], [26855284, 35807044], [35807045, 44758805], [44758806, 53710566], [53710567, 62662327], [62662328, 71614088], [71614089, 80565849], [80565850, 89517610], [89517611, 98469371], [98469372, 107421132], [107421133, 116372893], [116372894, 125324654], [125324655, 134276415], [134276416, 143228176]]Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 1880687 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Traceback (most recent call last):
File "/home/$username/anaconda3/bin/parallel-fastq-dump", line 4, in
import('pkg_resources').run_script('parallel-fastq-dump==0.6.2', 'parallel-fastq-dump')
File "/home/$username/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 750, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/$username/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 1534, in run_script
exec(script_code, namespace, namespace)
File "/home/$username/anaconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 100, in
File "/home/$username/anaconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 93, in main
File "/home/$username/anaconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 43, in pfd
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pfd_0hm17d1p/10/SRR4253583_2.fastq'

Thanks!

Unknown error

Hello,

Thank you for developing this tool. I used v0.6.6 to download some SRA data before and it was fine. Now, I'm trying to download a different dataset, and parallel-fastq-dump is strangely failing, producing logs like below. I'm not sure if it is a a parallel-fastq-dump problem, fastq-dump or prefetch problem per say, but it is repeatable on my system for all of the 84 samples in the dataset. I'm attaching the report mentioned in the log as well.

Would you have any suggestions as to what could be causing this error and how to work around it? Any pointers you may be able to provide are highly appreciated.

Best,
Azza

$ parallel-fastq-dump -V
parallel-fastq-dump : 0.6.6

"fastq-dump" version 2.10.8

$ parallel-fastq-dump --sra-id SRR9070188 --threads 4 --split-files --gzip
SRR ids: ['SRR9070188']
extra args: ['--split-files', '--gzip']
tempdir: /tmp/pfd_5k4ylnch
SRR9070188 spots: 323486139
blocks: [[1, 80871534], [80871535, 161743068], [161743069, 242614602], [242614603, 323486139]]
2020-12-07T13:22:07 fastq-dump.2.10.8 err: timeout exhausted while waiting condition within process system module - failed SRR9070188

=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================

2020-12-07T13:39:06 fastq-dump.2.10.8 err: transfer canceled while allocating buffer within file system module - Cannot KHttpFileTimedReadChunked: to=480
fastq-dump quit with error code 3
2020-12-07T14:08:23 fastq-dump.2.10.8 err: timeout exhausted while waiting condition within process system module - failed SRR9070188

=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================

2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - failed SRR9070188
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - failed SRR9070188
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'

=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================


=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================

fastq-dump quit with error code 3
fastq-dump quit with error code 3
fastq-dump error! exit code: 3

ncbi_error_report.txt

downloads are slower if maxSpotId is set higher than n_spots

This is an issue if using maxSpotId to make sure no more than N spots are downloaded (e.g., if there are some very large RNA-Seq experiments I want to ignore).

For example, in this case there are only 5.4 million spots so the third thread does not do anything.
This makes the download slower than not using -X 10000000.

$ parallel-fastq-dump -X 10000000 -t 3 -s SRR868679
SRR ids: ['SRR868679']
extra args: []
tempdir: /tmp/pfd_k2htn18j
SRR868679 spots: 5487730
blocks: [[1, 3333333], [3333334, 6666666], [6666667, 10000000]]
Read 2154397 spots for SRR868679
Written 2154397 spots for SRR868679
Read 3333333 spots for SRR868679
Written 3333333 spots for SRR868679

I believe the fix is just:

end = min(n_spots, args.maxSpotId) if args.maxSpotId is not None else n_spots

Thanks for the useful tool!

parallel-fastq-dump can't download some old datasets

Hi,
thank you for writing this neat tool - it nicely covers a critical gap in ncbi-tools!
I started using it in our lab's snakemake pipelines and found that it can't deal with some old datasets.
In particular, it froze when it tried to download SRR567536 (from https://www.ncbi.nlm.nih.gov/sra/SRX183520). I used the following command:
parallel-fastq-dump --origfmt --split-files --gzip -s SRR567537 -t 8

At the same time, ncbi-tools can download it:
fastq-dump --origfmt --split-files --gzip SRR567536

I also tried prefetch-ing and it didn't help.

Anton.

query unauthorized while resolving query within virtual file

Hi,

I installed parallel-fastq-dump on centos cluster using conda. I get the below error when I ran the example command.

$ which sra-stat.2.9.6
~/.conda/envs/parallel-fastq-dump/bin/sra-stat.2.9.6

Error:

2019-08-15T20:36:57 sra-stat.2.9.6 err: query unauthorized while resolving query within virtual file 
system module - failed to resolve accession 'SRR1219899' - Access denied - please request permission 
to access phs000710/UR in dbGaP ( 403 )
2019-08-15T20:36:57 sra-stat.2.9.6 int: directory not found while opening manager within virtual file
 system module - 'SRR1219899'
SRR ids: ['SRR1219899']
extra args: ['--split-files', '--gzip']
tempdir: /tmp/uge/61368154.1.secondary.q/pfd_do4anflw
Traceback (most recent call last):
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/bin/parallel-fastq-dump", li
ne 4, in <module>
    __import__('pkg_resources').run_script('parallel-fastq-dump==0.6.5', 'parallel-fastq-dump')
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
pkg_resources/__init__.py", line 1460, in run_script
    exec(script_code, namespace, namespace)
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 112, in <module>
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 105, in main
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 15, in pfd
  File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 64, in get_spot_count
IndexError: list index out of range

Thanks

an integer is required (got type _io.BufferedWriter)

Hi I am trying to use pfd for converting sra file to fastq files in a pipeline. I have a cluster with 100 cores.following is my code:
variable word1 has the SRR id.

time $path_to_parallel_fastq_dump --threads 100 --sra-id $word1 --tmpdir $path_to_sra_folder --outdir $path_to_fastq_folder$word1.fastq |& tee $path_to_stats_folder/sra_fastq/fout1_$word1.txt

But, I am getting following error:
File "/home/csb/tools/parallel-fastq-dump-master/parallel-fastq-dump", line 123, in
main()
File "/home/csb/tools/parallel-fastq-dump-master/parallel-fastq-dump", line 116, in main
pfd(args, si, extra_args)
File "/home/csb/tools/parallel-fastq-dump-master/parallel-fastq-dump", line 50, in pfd
os.close(fd)
TypeError: an integer is required (got type _io.BufferedWriter)

Can you please help ?

does it work with the -Z option?

Looks like parallel-fastq-dump writes out multiple different parts of the output file simultaneously.
So I assume it can't be used with fastq-dump's -Z option which writes a single stream to stdout.

But thought I'd make sure. Is that the case?

I usually use this option to stream the fastq output directly to Amazon S3 since I have limited disk space available (these fastq files get big) but S3 provides unlimited storage.

It looks like parallel-fastq-dump writes out N files, where N is the number of threads. Are these concatenated together with the equivalent of cat? If so I could write to N files which are actually named pipes that stream to S3, and then concatenate them myself. Since I don't see an option to suppress the concatenation, I probably need to mess with the source... Any thoughts on this?

Thanks.

it never ends...

Hello,
I started a download with the following command:
parallel-fastq-dump --sra-id SRR9089604 --threads 10 --outdir out/ --tmpdir temp/
This SRR is about 80Gb. The temp folder is already more than 100Gb downloaded and it continues... is it normal?

thanks for your help

Alex

point to a local file?

I have access to some restricted SRA files on dbGAP which I have downloaded. When executing $ parallel-fastq-dump -s SRR.sra I get an error message.

2018-05-08T14:34:42 fastq-dump.2.8.2 err: file not found while opening file within network system module - FailekeHttpFileInt('https://gap-download.be-md.ncbi.nlm.nih.gov/sragap/D3B39D27-6567-4F0A-87EA-2707BC243B8B/SRR100758B0E3F-7D85-4391-84B7-2C1B31A30C5B' (130.14.250.15))

It looks like the command is attempting to generate the fastq files from directly from the online SRA files. However as they are protected presumably it fails to find them, I have uploaded the correct ncg file
for permission as per the instuctions on (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=dbgap_use) to the new version of SRA tools bundled with parallel-fastq-dump.

Is there any way to direct this command to a local file or other solution you may be able to suggest.

Many thanks any help appreciated

james

Multiple SRA to FASTQ conversion issue

Hello,

This might be a novice problem but when I try to do the conversion of multiple files within a SRA group (i.e. SRA227673, 33 files) I get 0 byte files... I can get the FASTQ files when I do individual files. I am using iMac desktop and terminal version 2.7.3. The way I tried doing multiple files is like:

$ parallel-fastq-dump --split-3 --outdir out/ ---sra-id SRR174709[0-9]

I did [#] for the last one because within this range there is 2 files (wanted to test to see if it worked before trying to do 33 files). Any help will be greatly appreciated!!

OSError: [Errno 39] Directory not empty: '2'

Hello!

This is my first time running parallel-fastq-dump so I apologize if this is a easy fix. Below I will post the output I receive from the program.

(base) [arglatha]$ parallel-fastq-dump --sra-id SRR6820613 --threads 6 --outdir ./SRR6820613/ --split-files --gzip --tmpdir ./tmp/
SRR ids: ['SRR6820613']
extra args: ['--split-files', '--gzip']
tempdir: ./tmp/pfd_kh5jeeyw
SRR6820613 spots: 75623640
blocks: [[1, 12603940], [12603941, 25207880], [25207881, 37811820], [37811821, 50415760], [50415761, 63019700], [63019701, 75623640]]
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:13:04 fastq-dump.2.9.6 sys: error unknown while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2019-08-12T15:15:52 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2019-08-12T15:18:40 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2019-08-12T15:58:34 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
fastq-dump error! exit code: -24
Exception ignored in: <finalize object at 0x7f29c0854460; dead>
Traceback (most recent call last):
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/weakref.py", line 552, in call
return info.func(*info.args, **(info.kwargs or {}))
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/tempfile.py", line 795, in _cleanup
_shutil.rmtree(name)
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/shutil.py", line 491, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/shutil.py", line 433, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/shutil.py", line 431, in _rmtree_safe_fd
os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: '2'

Any advice would be appreciated!

conda install failed on MacOS X Yosemite 10.10.5

There are two parts of the error (given below), though they could be related:

  1. Solving environment failed
  2. PackageNotFoundError

Is this because the package is not available for Mac OS?

I did do "conda update conda" to make sure my conda is up to date (conda/4.5.0).
And I tried install another package using conda and it worked fine. (conda install geopy)

Error message is given below:


$ conda install parallel-fastq-dump
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  • parallel-fastq-dump

Current channels:


storage exhausted while writing file within file system module

Hi,

I am trying to download several reads. Eg using the following command:

parallel-fastq-dump --sra-id SRR925794 --threads 32 --gzip

But when I do, I get this error message:

fastq-dump.2.9.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'

I am running this on an AWS EC2 instance which has plently of disk space and yet I am still getting this error. However, I am not sure if it is because parallel-fastq-dump is mounting volumes and running it there which have less disk space. It looks like these volumes are mounted on the instance and I am not sure how else I might have created them.

Do you know if this is the case and if so how I can change the location where the command is run to prevent this error.

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      394G   56G  339G  15% /
devtmpfs        121G   96K  121G   1% /dev
tmpfs           121G     0  121G   0% /dev/shm
/dev/dm-3       9.8G  4.1G  5.2G  45% /var/lib/docker/devicemapper/mnt/686a42157daedddaf1f9e187ea042311bbc553e466013fb79adc4bec8da51432
shm              64M     0   64M   0% /var/lib/docker/containers/dada83892501925da80d6abd9c27cc048c1e2c3fb90b6f587d051ca0f5e8c12c/shm

Also do you know how I can get it to run any faster. For example do you know how parallel-fastq-dump compares to fasterq-dump. I tried running that instead and it seems like it may be a bit slower. Or what is the optimum value to set --threads to? For example, I know fasterq-dump for the value of th has diminishing returns, does parallel-fastq also have the same?

superslow

I download stuff with wget ten times faster, as an example
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
takes 60 secondswith 2 mins of follow up extraction while parallel-fastqdump spends half an hour with 4 threads

temp file attribute error

Hi, apologies that this is probably user error in my environment, but when I attempt to run parallel-fastq-dump using this command:

parallel-fastq-dump --sra-id SRR1930123 --threads 4 --outdir ./ --split-files --gzip

I get this error (with traceback):

Traceback (most recent call last): File "/usr/bin/parallel-fastq-dump", line 100, in <module> main() File "/usr/bin/parallel-fastq-dump", line 93, in main pfd(args, si, extra_args, n_spots) File "/usr/bin/parallel-fastq-dump", line 12, in pfd tmp_dir = tempfile.TemporaryDirectory(prefix="pfd_",dir=args.tmpdir) AttributeError: 'module' object has no attribute 'TemporaryDirectory'

I'm running this in centos, and I noticed that the shebang of the parallel-fastq-dump script is calling for python2: #!/usr/bin/python2
Is this an error? I installed with pip install, and version is 0.6.2. Thanks for your help!

The dependency sra-tools in conda is outdated and will trigger errors

When installing parallel-fastq-dump from conda, it will also pull a sra-tools v 2.8.0 from conda even if I have a local sra-tools. This sra-tools is outdated and will cause network connection error. This should be fixed


The following NEW packages will be INSTALLED:

  parallel-fastq-dump bioconda/noarch::parallel-fastq-dump-0.6.7-pyhdfd78af_0
  sra-tools          bioconda/linux-64::sra-tools-2.8.0-0


Question about using parallel fastqdump with prefetch

Hi,

I was wondering whether I can run the following to download the data. I know this is possible with fastq-dump. Just wanted to confirm it is also possible with parallel-fastq-dump 0.6.6
.

prefetch [runid] && vdb-validate [runid] && parallel-fastq-dump --outdir [out] --skip-technical --split-3 --sra-id [runid] --gzip

Thank you!

Usage

Hello I want to use your parrallel-fastq-dump as it would help my a lot in my studies.
It seems I have issues using the right syntax somehow... Every time I try to use it I get the ERROR:
File "/home/hpc/t1172/di36dih/bin/parallel-fastq-dump", line 29
p = subprocess.Popen(["fastq-dump", "-N", str(out[i][0]), "-X", str(out[i][1]), "-O", d, *extra_args, args.sra_id])
Could you send one or two example calls so I understand my problem?

Thanks in advance,
Sincerly TripleB

fastq-dump uses network even though I prefectech

I have ERR3240205 and ERR3240205.vdbcache files retrieved through SRA Cloud Data Delivery. I then ran:

parallel-fastq-dump -s ERR3240193 -t 2 -O out --tmpdir tmp --split-files --gzip

I found the two fastq-dump processes, and ran strace -f -e trace=network -p <pid> on each. I found that the fastq-dump that starts from the start of the SRA file does not use network IO, while the one that starts mid-range does, and was wondering why given I've downloaded the file.

What does it mean if -split-files is used but I still only get a single output file for each input file?

Hi, this is more of a question about fastq, but not sure where I should go to get help with it. I would appreciate it if anyone here can help me!

What does it mean if -split-files is used but I still only get a single output file (I..e I get ..._1.fastq but not ..._2.fastq)? My understanding is that I should get 2 output files.

import os 
command = 'parallel-fastq-dump/parallel-fastq-dump '+\
 f'--outdir {data_dirpath}/sratofastq/{sra_id} '+\
'--threads 32 '+\
'--split-files '+\
'--tmpdir tmpdir '+\
f'-s {data_dirpath}/pysradb_downloads/{sra_id}/SRX2536403/SRR5227288.sra'
print(command)
print(os.system(command))

Problem installing in Linux

When trying to install the program using either pip or the setup.py, the following error appears:

Traceback (most recent call last):
  File "setup.py", line 8, in <module>
    open(pname).read().split("\n")
TypeError: list object is not an iterator

Problem running parallel-fastq-dump on a cluster

Hi,
Sorry for bothering you with a very basic problem...
I have tried to install parallel-fastq-dump locally on our cluster after cloning it from git :

$ python setup.py --prefix=/data/users/scollomb/programs
$echo $PYTHONPATH
/bioinfo/local/build/Centos/python/python-2.7.12/lib/python2.7/site-packages/:/data/users/scollomb/programs/lib/python2.7/site-packages

but I cannot run it :

parallel-fastq-dump --tmpdir /data/tmp/scollomb --threads 10 -O ./ -s SRR3933583 --split-3 --gzip
SRR ids: ['SRR3933583']
extra args: ['--split-3', '--gzip']
Traceback (most recent call last):
  File "/data/users/scollomb/programs/bin/parallel-fastq-dump", line 4, in <module>
    __import__('pkg_resources').run_script('parallel-fastq-dump==0.6.2', 'parallel-fastq-dump')
  File "/bioinfo/local/build/Centos/python/python-2.7.12/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/bioinfo/local/build/Centos/python/python-2.7.12/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1501, in run_script
    exec(script_code, namespace, namespace)
  File "/data/users/scollomb/programs/lib/python2.7/site-packages/parallel_fastq_dump-0.6.2-py2.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 100, in <module>
    
  File "/data/users/scollomb/programs/lib/python2.7/site-packages/parallel_fastq_dump-0.6.2-py2.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 92, in main
    
  File "/data/users/scollomb/programs/lib/python2.7/site-packages/parallel_fastq_dump-0.6.2-py2.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 47, in get_spot_count
    
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Any idea what is going on?

Download all reads for given experiment accession

Thanks for the great tool! As an enhancement, it would be great if I could give the program an experiment (or sample) accession and download reads for all run accessions that correspond to the experiment. This is a feature of the standard fastq-dump program.

Failed to retrieve results error

Hi,

I am running into this error - I installed parallel-fastq-dump on an AWS EC2 using conda. Any insights would be wonderful. I suspect this may be an error with fastq dump itself but I could not find any information on it either.

My code:
parallel-fastq-dump --sra-id ERS1829683 --threads 56 --outdir ./ --split-files
Output:
SRR ids: ['ERS1829683'] extra args: ['--split-files'] tempdir: /tmp/pfd_9bpf3fvs ERS1829683 spots: 13339116 blocks: [[1, 238198], [238199, 476396], [476397, 714594], [714595, 952792], [952793, 1190990], [1190991, 1429188], [1429189, 1667386], [1667387, 1905584], [1905585, 2143782], [2143783, 23819 80], [2381981, 2620178], [2620179, 2858376], [2858377, 3096574], [3096575, 3334772], [3334773, 3572970], [3572971, 3811168], [3811169, 4049366], [4049367, 4287564], [4287565, 4525762], [4525 763, 4763960], [4763961, 5002158], [5002159, 5240356], [5240357, 5478554], [5478555, 5716752], [5716753, 5954950], [5954951, 6193148], [6193149, 6431346], [6431347, 6669544], [6669545, 69077 42], [6907743, 7145940], [7145941, 7384138], [7384139, 7622336], [7622337, 7860534], [7860535, 8098732], [8098733, 8336930], [8336931, 8575128], [8575129, 8813326], [8813327, 9051524], [9051 525, 9289722], [9289723, 9527920], [9527921, 9766118], [9766119, 10004316], [10004317, 10242514], [10242515, 10480712], [10480713, 10718910], [10718911, 10957108], [10957109, 11195306], [111 95307, 11433504], [11433505, 11671702], [11671703, 11909900], [11909901, 12148098], [12148099, 12386296], [12386297, 12624494], [12624495, 12862692], [12862693, 13100890], [13100891, 1333911 6]] failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result fastq-dump error! exit code: 22 failed to retrieve result (virus) jmifsud@dacelo:/disks/dacelo/data/jmifsud$ failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621

IndexError: list index out of range

$ parallel-fastq-dump -s ERS1444621 --outdir pfd_output --tmpdir pfd_tmp
SRR ids: ['ERS1444621']
extra args: []
tempdir: pfd_tmp/pfd___5awz6o
2021-04-11T06:28:36 sra-stat.2.8.2 int: directory not found while opening manager within virtual file system module - 'ERS1444621'
Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 119, in <module>
    main()
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 112, in main
    pfd(args, si, extra_args)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 15, in pfd
    n_spots = get_spot_count(srr_id)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 65, in get_spot_count
    total += int(l.split("|")[2].split(":")[0])
IndexError: list index out of range

The output from subprocess.Popen(["sra-stat", "--meta", "--quick", sra_id], stdout=subprocess.PIPE) is just [''], which is causing the IndexError

Error message appear after running

Hello, I'm getting this error while I was analyzing.
I'm using ubuntu on WSL.

$ parallel-fastq-dump --threads 8 --split-files --gzip --sra-id SRR2244401
2022-10-05 10:14:30,278 - SRR ids: ['SRR2244401']
2022-10-05 10:14:30,278 - extra args: ['--split-files', '--gzip']
2022-10-05 10:14:30,283 - tempdir: /tmp/pfd_hec73ych
2022-10-05 10:14:30,283 - CMD: sra-stat --meta --quick SRR2244401
Traceback (most recent call last):
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 116, in get_spot_count
total += int(l.split('|')[2].split(':')[0])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 181, in
main()
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 175, in main
pfd(args, si, extra_args)
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 49, in pfd
n_spots = get_spot_count(srr_id)
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 122, in get_spot_count
raise IndexError(msg.format('\n'.join(txt), '\n'.join(etxt)))
IndexError: sra-stat output parsing error!
--sra-stat STDOUT--

--sra-stat STDERR--
2022-10-05T01:14:30 sra-stat.2.8.0 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2022-10-05T01:14:30 sra-stat.2.8.0 sys: mbedtls_ssl_get_verify_result returned 0x8 ( !! The certificate is not correctly signed by the trusted CA )
2022-10-05T01:14:31 sra-stat.2.8.0 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2022-10-05T01:14:31 sra-stat.2.8.0 sys: mbedtls_ssl_get_verify_result returned 0x8 ( !! The certificate is not correctly signed by the trusted CA )
2022-10-05T01:14:31 sra-stat.2.8.0 int: directory not found while opening manager within virtual file system module - 'SRR2244401'

Upgrade to sra toolkit 3.0

Could you please upgrade your tool to use fastq-dump from sra toolkit 3.0? NCBI changed something, and some sra runs can only be downloaded with the new version of the toolkit (for example, try SRR15808775. It works with 3.0 but has "access denied" with the earlier version or any other methods of downloading). And with the new toolkit version I cannot use the parallel wrapper.

Thank you!

TemporaryDirectory error

Hi @rvalieris ,
first thank you for developing this tool. I was doing a search for better SRA download and found a three of tweets pointing to this. Great stuff.

I am having trouble making the script work on our server. It runs fine in my local computer, but on the server I get this error:

SRR ids: ['DRR093002']
extra args: ['--split-files', '--gzip']
Traceback (most recent call last):
  File "parallel-fastq-dump.py", line 119, in <module>
    main()
  File "parallel-fastq-dump.py", line 112, in main
    pfd(args, si, extra_args)
  File "parallel-fastq-dump.py", line 12, in pfd
    tmp_dir = tempfile.TemporaryDirectory(prefix="pfd_",dir=args.tmpdir)
AttributeError: 'module' object has no attribute 'TemporaryDirectory'

here the call

parallel-fastq-dump --sra-id DRR093002 --threads 16 --outdir out/ --split-files --gzip

I've been googling for a solution, but I cannot find and obvious one. When I run parallel-fastq-dump by itself I get the list of options just fine.

Any suggestions?
D

Forward and reverse identifier /1 and /2

Hello,

Thank you for your wonderful tool. Is there any way I can download the reads with the forward and reverse identifiers? Some downstream applications require these identifiers.

Many thanks!!

exception: No such file or directory: 'sra-stat'

I am trying to run parallel-fastq-dump on an SRA file that is already downloaded, and get an error:

root@4eb8d72f7f10:~/ncbi/dbGaP-0/sra# /root/miniconda3/bin/parallel-fastq-dump --sra-id SRR1219902.sra --threads 10 --outdir out2
SRR ids: ['SRR1219902.sra']
extra args: []
Traceback (most recent call last):
  File "/root/miniconda3/bin/parallel-fastq-dump", line 4, in <module>
    __import__('pkg_resources').run_script('parallel-fastq-dump==0.6.2', 'parallel-fastq-dump')
  File "/root/miniconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 748, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/root/miniconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1524, in run_script
    exec(script_code, namespace, namespace)
  File "/root/miniconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 100, in <module>
  File "/root/miniconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 92, in main
  File "/root/miniconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 47, in get_spot_count
  File "/root/miniconda3/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/root/miniconda3/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sra-stat': 'sra-stat'

How are files named in the output of parallel-fastq-dump?

Hi dbGaP,

I'm writing to ask about this specific project by NYGC: '3 CANCER CELL LINES ON 2 SEQUENCERS' dbGaP accession number: phs001839.v1.p1 https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001839.v1.p1

I've reached out to NYGC for comment on what the 72x, 74x, or 75x means or what the 48,49,45,43 means in these files: HCC1143-N_72x_48_1.fastq.gz*
HCC1143-N_72x_48_2.fastq.gz*
HCC1143-N_72x_49_1.fastq.gz*
HCC1143-N_72x_49_2.fastq.gz*
HCC1143-N_74x_45_1.fastq.gz*
HCC1143-N_74x_45_2.fastq.gz*
HCC1143-N_75x_43_1.fastq.gz*
HCC1143-N_75x_43_2.fastq.gz*

To get these files, I had to install Aspera Connect, and then use prefetch from the SRAtoolkit and then parallel-fastq-dump to extract and gzip fastq files.

NYGC said dbGaP named these files because they uploaded these BAM files and they were converted by dbGaP to FASTQs to host on the dbGaP FTP.

dbGaP is now saying that parallel-fastq-dump named these files.

Can you help me understand how files are named in the output of parallel-fastq-dump please?

Appreciate your prompt reply and thank you!

Best,
Nicole

Parallel-fastq-dump has been running for nearly 24 hours

Dear developer,

Greetings! I would like to express my gratitude for developing this software. I have been utilizing it since yesterday to convert a 10X single-cell sequencing SRA file, which has a size of 14.8Gb. However, I have noticed that the program has been running continuously for nearly 24 hours without producing any output. Upon inspecting the program's CPU usage using hTOP, I observed that it did not exceed 100%. I am uncertain whether this behavior is normal, and thus, I seek your expert opinion. I have been running the program on my home computer with 16 threads simultaneously, yet the outcome remains unchanged.

The first image illustrates the command I used and the corresponding output, while the second image showcases the CPU utilization as displayed in hTOP.
1
2

I would highly appreciate it if you could provide me with any suggestions or guidance.

Thank you sincerely!

Benchmark comparison

Hi,

This is more for your information and not an issue.

I wanted to let you know about a comparison that I ran between parallel-fastq-dump and sra-tools prefetch + fasterq-dump. You can find the code and results in this repo.

This is the way that I invoke parallel-fastq-dump so if you see some problem, tweak, or think that it is an unfair comparison, please let me know.

can't install parallel-fastq-dump

I used conda install parallel-fastq-dump but I keep getting this error:

olving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  • parallel-fastq-dump

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Is the package still there in conda? or was it removed?

Conflicting environment

Hi,
When I tried to install parallel-fastq-dump with "conda install parallel-fastq-dump", the following error showed:
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:

  • enum34
  • parallel-fastq-dump
    Use "conda info " to see the dependencies for each package.

So I tried to uninstall "enum34" with "conda uninstall enum34", but it was not allowed.

Is there any way to resolve this conflict?
Thank you!

Stella

AttributeError: 'module' object has no attribute 'TemporaryDirectory'

Running parallel-fastq-dump 0.6.3 on Scientific Linux using sratools/2.8.2-1 and I get this error.

[rmf@r43 temp]$ parallel-fastq-dump --split-spot --split-files --gzip -s SRR390728 -t 8

SRR ids: ['SRR390728']
extra args: ['--split-spot', '--split-files', '--gzip']
Traceback (most recent call last):
  File "/home/rmf/.pyenv/versions/2.7.6/bin/parallel-fastq-dump", line 103, in <module>
    main()
  File "/home/rmf/.pyenv/versions/2.7.6/bin/parallel-fastq-dump", line 96, in main
    pfd(args, si, extra_args)
  File "/home/rmf/.pyenv/versions/2.7.6/bin/parallel-fastq-dump", line 12, in pfd
    tmp_dir = tempfile.TemporaryDirectory(prefix="pfd_",dir=args.tmpdir)
AttributeError: 'module' object has no attribute 'TemporaryDirectory'

I have also tried -s SRR390728 and -s ./SRR390728.sra because I do have the .sra file locally. But, I am not sure if it uses that.

IndexError: list index out of range

You can download the SRA here:
https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR2778062/SRR2778062.1

And when I dump with 8 cores it fails, normal fastq-dump performs fine

Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
2020-01-09T14:29:16 sra-stat.2.10.0 int: path incorrect while opening manager within database module - '/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/tmp'
SRR ids: ['/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062', '/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/tmp']
extra args: ['--split-spot', '--skip-technical', '--dumpbase', '--readids', '--clip', '--read-filter', 'pass', '--defline-seq', '@$ac.$si.$sg/$ri', '--defline-qual', '+', '--gzip']
tempdir: /tmp/pfd_7z0gpkz8
/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062 spots: 51460528
blocks: [[1, 6432566], [6432567, 12865132], [12865133, 19297698], [19297699, 25730264], [25730265, 32162830], [32162831, 38595396], [38595397, 45027962], [45027963, 51460528]]
tempdir: /tmp/pfd_l9hikqiw
Traceback (most recent call last):
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/bin/parallel-fastq-dump", line 4, in <module>
    __import__('pkg_resources').run_script('parallel-fastq-dump==0.6.5', 'parallel-fastq-dump')
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1469, in run_script
    exec(script_code, namespace, namespace)
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 112, in <module>
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 105, in main
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 15, in pfd
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 64, in get_spot_count
IndexError: list index out of range

Crash happens on this line:

total += int(l.split("|")[2].split(":")[0])

how to dealwith downloaded sra file

"parallel-fastq-dump --sra-id SRR1219899 --threads 4 --outdir out/ --split-files --gzip"
does it mean that this command line combines [prefetch] and [fastq-dump]? I haved downloaded the sra file, however?

wrong file size of output fastq files

Hello! It looks like parallel-fastq-dump creates fastq files which are not the same size as the fastq files created by fasterq-dump.

For example:

prefetch SRR5683211 -O output
parallel-fastq-dump -t 80 -O output/parallel-fastq --tmpdir tmp/ -s output/SRR5683211.sra --split-files
fasterq-dump -e 80 -O output/fasterq -S output/SRR5683211.sra

Fasterq size:

du -h output/fasterq/SRR5683211.sra_1.fastq
7.8G    output/fasterq/SRR5683211.sra_1.fastq

Parallel fastq size:

du -h output/parallel-fastq/SRR5683211_1.fastq
7.6G    output/parallel-fastq/SRR5683211_1.fastq

I've noticed this before and sometimes the data loss is very large -- especially when I don't use prefetch first. Do you know what may be causing this behavior?

fastq-dump.2.11.0 err: storage exhausted while writing...

Hello,

I'm getting this error running parallel-fastq-dump on an HPC.

2022-09-28T15:55:16 fastq-dump.2.11.0 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'

The line is present repeatedly in slurm output.

I've read a few other threads here about the same problem. I changed my --tmpdir and --outdir to a scratch drive and yes the temp files are being written there. Both sra-toolkit and parallel-fastq-dump were installed with conda (today).

Could another folder be filling up with and that's why I'm getting the complaint? Any thoughts?

Something writes to root, even if tmpdir is defined.

Hi,

This is a very nice tool to speed-up the download via fastq-dump.

I have the issue that fastq-dump has sometimes errors like: timeout exhausted while waiting condition within process system module

a) Do you know what this error is? It seems to have some relation to the internet speed: on my local computer with ~50Mbits it raises; however if I download it via my local compute center and a 1 Gbits connection this error does not occure.

b) Using the fast connection (1 Gbit) I am facing another problem. I defined a tmp dir in my home but still something is written to my root. And then I get 2020-03-08T11:06:02 fastq-dump.2.10.3 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'.

parallel-fastq-dump --sra-id SRR5229019 -t 40 --outdir out_fastq/ --tmpdir local_tmp

2020-03-08_12-19

Any idea how to solve the two issues?

For the first one: If it is an issue of fastq-dump, would it be possible to have sth which is restarting the download from the point it crashed?

Best,

Joachim

IndexError

Hi,

I'm trying to run parallel-fastq-dump, but I get the error provided below. I can find similar issues here, but none of them solves my problem. The specific call is as follows:

parallel-fastq-dump --sra-id $1 --threads 4 --outdir raw/ --split-files --gzip

Where $1 is the result of parsing a SRR ID list.

The log from one of the SRA IDs (SRR6337208):

2021-04-30 13:45:38,797 - SRR ids: ['SRR6337208']
2021-04-30 13:45:38,797 - extra args: ['--split-files', '--gzip']
2021-04-30 13:45:38,798 - tempdir: /tmp/pfd_zhzmw3es
2021-04-30 13:45:38,798 - CMD: sra-stat --meta --quick SRR6337208

Traceback (most recent call last):
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 116, in get_spot_count
    total += int(l.split('|')[2].split(':')[0])
IndexError: list index out of range

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 181, in <module>
    main()
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 175, in main
    pfd(args, si, extra_args)
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 49, in pfd
    n_spots = get_spot_count(srr_id)
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 122, in get_spot_count
    raise IndexError(msg.format('\n'.join(txt), '\n'.join(etxt)))
IndexError: sra-stat output parsing error!
--sra-stat STDOUT--

--sra-stat STDERR--
2021-04-30T11:47:40 sra-stat.2.11.0 int: directory not found while opening manager within virtual file system module - 


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.