rvalieris / parallel-fastq-dump Goto Github PK
View Code? Open in Web Editor NEWparallel fastq-dump wrapper
License: MIT License
parallel fastq-dump wrapper
License: MIT License
Hello,
thank you for the useful tool! I've used it for my projects with downloaded SRA files and things speed up quite a lot. However I've tried it on a very large file, e.g. SRR3192525 (a 30 Gb archive), and it made it crush with "not enough storage" message, while serial fastq-dump finished the job ok.
Does it use default Python tmp directory or something like that? Maybe it would make sense to set tmp dir to the directory in which the tool is run?
So the program creates temporary files in /tmp/pfd_0hm17d1p but at the end, it cannot find the second fastq file: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pfd_0hm17d1p/10/SRR4253583_2.fastq'
$ parallel-fastq-dump --sra-id SRR4253583 -t 16 -O out/ --split-files
SRR ids: ['SRR4253583']
extra args: ['--split-files']
SRR4253583 spots: 143228176
tempdir: /tmp/pfd_0hm17d1p
blocks: [[1, 8951761], [8951762, 17903522], [17903523, 26855283], [26855284, 35807044], [35807045, 44758805], [44758806, 53710566], [53710567, 62662327], [62662328, 71614088], [71614089, 80565849], [80565850, 89517610], [89517611, 98469371], [98469372, 107421132], [107421133, 116372893], [116372894, 125324654], [125324655, 134276415], [134276416, 143228176]]Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 8951761 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Rejected 1880687 READS because READLEN < 1
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Read 8951761 spots for SRR4253583
Written 8951761 spots for SRR4253583
Traceback (most recent call last):
File "/home/$username/anaconda3/bin/parallel-fastq-dump", line 4, in
import('pkg_resources').run_script('parallel-fastq-dump==0.6.2', 'parallel-fastq-dump')
File "/home/$username/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 750, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/$username/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 1534, in run_script
exec(script_code, namespace, namespace)
File "/home/$username/anaconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 100, in
File "/home/$username/anaconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 93, in main
File "/home/$username/anaconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 43, in pfd
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pfd_0hm17d1p/10/SRR4253583_2.fastq'
Thanks!
I have several sra files in my computer, Can it deal with local sra file?
Hello,
Thank you for developing this tool. I used v0.6.6 to download some SRA data before and it was fine. Now, I'm trying to download a different dataset, and parallel-fastq-dump
is strangely failing, producing logs like below. I'm not sure if it is a a parallel-fastq-dump
problem, fastq-dump
or prefetch
problem per say, but it is repeatable on my system for all of the 84 samples in the dataset. I'm attaching the report mentioned in the log as well.
Would you have any suggestions as to what could be causing this error and how to work around it? Any pointers you may be able to provide are highly appreciated.
Best,
Azza
$ parallel-fastq-dump -V
parallel-fastq-dump : 0.6.6
"fastq-dump" version 2.10.8
$ parallel-fastq-dump --sra-id SRR9070188 --threads 4 --split-files --gzip
SRR ids: ['SRR9070188']
extra args: ['--split-files', '--gzip']
tempdir: /tmp/pfd_5k4ylnch
SRR9070188 spots: 323486139
blocks: [[1, 80871534], [80871535, 161743068], [161743069, 242614602], [242614603, 323486139]]
2020-12-07T13:22:07 fastq-dump.2.10.8 err: timeout exhausted while waiting condition within process system module - failed SRR9070188
=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================
2020-12-07T13:39:06 fastq-dump.2.10.8 err: transfer canceled while allocating buffer within file system module - Cannot KHttpFileTimedReadChunked: to=480
fastq-dump quit with error code 3
2020-12-07T14:08:23 fastq-dump.2.10.8 err: timeout exhausted while waiting condition within process system module - failed SRR9070188
=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - failed SRR9070188
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - failed SRR9070188
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'
2020-12-07T14:12:23 fastq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'
=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================
=============================================================
An error occurred during processing.
A report was generated into the file '/home/azzaea/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================
fastq-dump quit with error code 3
fastq-dump quit with error code 3
fastq-dump error! exit code: 3
This is an issue if using maxSpotId to make sure no more than N spots are downloaded (e.g., if there are some very large RNA-Seq experiments I want to ignore).
For example, in this case there are only 5.4 million spots so the third thread does not do anything.
This makes the download slower than not using -X 10000000
.
$ parallel-fastq-dump -X 10000000 -t 3 -s SRR868679
SRR ids: ['SRR868679']
extra args: []
tempdir: /tmp/pfd_k2htn18j
SRR868679 spots: 5487730
blocks: [[1, 3333333], [3333334, 6666666], [6666667, 10000000]]
Read 2154397 spots for SRR868679
Written 2154397 spots for SRR868679
Read 3333333 spots for SRR868679
Written 3333333 spots for SRR868679
I believe the fix is just:
end = min(n_spots, args.maxSpotId) if args.maxSpotId is not None else n_spots
Thanks for the useful tool!
Hi,
thank you for writing this neat tool - it nicely covers a critical gap in ncbi-tools!
I started using it in our lab's snakemake pipelines and found that it can't deal with some old datasets.
In particular, it froze when it tried to download SRR567536 (from https://www.ncbi.nlm.nih.gov/sra/SRX183520). I used the following command:
parallel-fastq-dump --origfmt --split-files --gzip -s SRR567537 -t 8
At the same time, ncbi-tools can download it:
fastq-dump --origfmt --split-files --gzip SRR567536
I also tried prefetch-ing and it didn't help.
Anton.
Hi,
I installed parallel-fastq-dump on centos cluster using conda. I get the below error when I ran the example command.
$ which sra-stat.2.9.6
~/.conda/envs/parallel-fastq-dump/bin/sra-stat.2.9.6
Error:
2019-08-15T20:36:57 sra-stat.2.9.6 err: query unauthorized while resolving query within virtual file
system module - failed to resolve accession 'SRR1219899' - Access denied - please request permission
to access phs000710/UR in dbGaP ( 403 )
2019-08-15T20:36:57 sra-stat.2.9.6 int: directory not found while opening manager within virtual file
system module - 'SRR1219899'
SRR ids: ['SRR1219899']
extra args: ['--split-files', '--gzip']
tempdir: /tmp/uge/61368154.1.secondary.q/pfd_do4anflw
Traceback (most recent call last):
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/bin/parallel-fastq-dump", li
ne 4, in <module>
__import__('pkg_resources').run_script('parallel-fastq-dump==0.6.5', 'parallel-fastq-dump')
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
pkg_resources/__init__.py", line 1460, in run_script
exec(script_code, namespace, namespace)
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 112, in <module>
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 105, in main
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 15, in pfd
File "/sonas-hs/ware/hpc/home/kchougul/.conda/envs/parallel-fastq-dump/lib/python3.7/site-packages/
parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 64, in get_spot_count
IndexError: list index out of range
Thanks
Hi I am trying to use pfd for converting sra file to fastq files in a pipeline. I have a cluster with 100 cores.following is my code:
variable word1 has the SRR id.
time $path_to_parallel_fastq_dump --threads 100 --sra-id $word1 --tmpdir $path_to_sra_folder --outdir $path_to_fastq_folder$word1.fastq |& tee $path_to_stats_folder/sra_fastq/fout1_$word1.txt
But, I am getting following error:
File "/home/csb/tools/parallel-fastq-dump-master/parallel-fastq-dump", line 123, in
main()
File "/home/csb/tools/parallel-fastq-dump-master/parallel-fastq-dump", line 116, in main
pfd(args, si, extra_args)
File "/home/csb/tools/parallel-fastq-dump-master/parallel-fastq-dump", line 50, in pfd
os.close(fd)
TypeError: an integer is required (got type _io.BufferedWriter)
Can you please help ?
Looks like parallel-fastq-dump writes out multiple different parts of the output file simultaneously.
So I assume it can't be used with fastq-dump's -Z
option which writes a single stream to stdout.
But thought I'd make sure. Is that the case?
I usually use this option to stream the fastq output directly to Amazon S3 since I have limited disk space available (these fastq files get big) but S3 provides unlimited storage.
It looks like parallel-fastq-dump writes out N files, where N is the number of threads. Are these concatenated together with the equivalent of cat
? If so I could write to N files which are actually named pipes that stream to S3, and then concatenate them myself. Since I don't see an option to suppress the concatenation, I probably need to mess with the source... Any thoughts on this?
Thanks.
Hello,
I started a download with the following command:
parallel-fastq-dump --sra-id SRR9089604 --threads 10 --outdir out/ --tmpdir temp/
This SRR is about 80Gb. The temp folder is already more than 100Gb downloaded and it continues... is it normal?
thanks for your help
Alex
Does this work with SRA Lite?
https://ncbiinsights.ncbi.nlm.nih.gov/2021/10/19/sra-lite/
I have access to some restricted SRA files on dbGAP which I have downloaded. When executing $ parallel-fastq-dump -s SRR.sra I get an error message.
2018-05-08T14:34:42 fastq-dump.2.8.2 err: file not found while opening file within network system module - FailekeHttpFileInt('https://gap-download.be-md.ncbi.nlm.nih.gov/sragap/D3B39D27-6567-4F0A-87EA-2707BC243B8B/SRR100758B0E3F-7D85-4391-84B7-2C1B31A30C5B' (130.14.250.15))
It looks like the command is attempting to generate the fastq files from directly from the online SRA files. However as they are protected presumably it fails to find them, I have uploaded the correct ncg file
for permission as per the instuctions on (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=dbgap_use) to the new version of SRA tools bundled with parallel-fastq-dump.
Is there any way to direct this command to a local file or other solution you may be able to suggest.
Many thanks any help appreciated
james
Hello,
This might be a novice problem but when I try to do the conversion of multiple files within a SRA group (i.e. SRA227673, 33 files) I get 0 byte files... I can get the FASTQ files when I do individual files. I am using iMac desktop and terminal version 2.7.3. The way I tried doing multiple files is like:
$ parallel-fastq-dump --split-3 --outdir out/ ---sra-id SRR174709[0-9]
I did [#] for the last one because within this range there is 2 files (wanted to test to see if it worked before trying to do 33 files). Any help will be greatly appreciated!!
Hello!
This is my first time running parallel-fastq-dump so I apologize if this is a easy fix. Below I will post the output I receive from the program.
(base) [arglatha]$ parallel-fastq-dump --sra-id SRR6820613 --threads 6 --outdir ./SRR6820613/ --split-files --gzip --tmpdir ./tmp/
SRR ids: ['SRR6820613']
extra args: ['--split-files', '--gzip']
tempdir: ./tmp/pfd_kh5jeeyw
SRR6820613 spots: 75623640
blocks: [[1, 12603940], [12603941, 25207880], [25207881, 37811820], [37811821, 50415760], [50415761, 63019700], [63019701, 75623640]]
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:04:38 fastq-dump.2.9.6 int: storage exhausted while writing file within file system module - switching cache-tee-file to read-only
2019-08-12T15:13:04 fastq-dump.2.9.6 sys: error unknown while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2019-08-12T15:15:52 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2019-08-12T15:18:40 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2019-08-12T15:58:34 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
fastq-dump error! exit code: -24
Exception ignored in: <finalize object at 0x7f29c0854460; dead>
Traceback (most recent call last):
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/weakref.py", line 552, in call
return info.func(*info.args, **(info.kwargs or {}))
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/tempfile.py", line 795, in _cleanup
_shutil.rmtree(name)
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/shutil.py", line 491, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/shutil.py", line 433, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/util/common/bioinformatics/bioconda/miniconda3-4.6.14/lib/python3.7/shutil.py", line 431, in _rmtree_safe_fd
os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: '2'
Any advice would be appreciated!
There are two parts of the error (given below), though they could be related:
Is this because the package is not available for Mac OS?
I did do "conda update conda" to make sure my conda is up to date (conda/4.5.0).
And I tried install another package using conda and it worked fine. (conda install geopy)
Error message is given below:
$ conda install parallel-fastq-dump
Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
Current channels:
Hi,
I am trying to download several reads. Eg using the following command:
parallel-fastq-dump --sra-id SRR925794 --threads 32 --gzip
But when I do, I get this error message:
fastq-dump.2.9.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
I am running this on an AWS EC2 instance which has plently of disk space and yet I am still getting this error. However, I am not sure if it is because parallel-fastq-dump
is mounting volumes and running it there which have less disk space. It looks like these volumes are mounted on the instance and I am not sure how else I might have created them.
Do you know if this is the case and if so how I can change the location where the command is run to prevent this error.
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 394G 56G 339G 15% /
devtmpfs 121G 96K 121G 1% /dev
tmpfs 121G 0 121G 0% /dev/shm
/dev/dm-3 9.8G 4.1G 5.2G 45% /var/lib/docker/devicemapper/mnt/686a42157daedddaf1f9e187ea042311bbc553e466013fb79adc4bec8da51432
shm 64M 0 64M 0% /var/lib/docker/containers/dada83892501925da80d6abd9c27cc048c1e2c3fb90b6f587d051ca0f5e8c12c/shm
Also do you know how I can get it to run any faster. For example do you know how parallel-fastq-dump
compares to fasterq-dump
. I tried running that instead and it seems like it may be a bit slower. Or what is the optimum value to set --threads
to? For example, I know fasterq-dump
for the value of th has diminishing returns, does parallel-fastq
also have the same?
I download stuff with wget ten times faster, as an example
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
takes 60 secondswith 2 mins of follow up extraction while parallel-fastqdump spends half an hour with 4 threads
Hi, apologies that this is probably user error in my environment,
but when I attempt to run parallel-fastq-dump using this command:
parallel-fastq-dump --sra-id SRR1930123 --threads 4 --outdir ./ --split-files --gzip
I get this error (with traceback):
Traceback (most recent call last): File "/usr/bin/parallel-fastq-dump", line 100, in <module> main() File "/usr/bin/parallel-fastq-dump", line 93, in main pfd(args, si, extra_args, n_spots) File "/usr/bin/parallel-fastq-dump", line 12, in pfd tmp_dir = tempfile.TemporaryDirectory(prefix="pfd_",dir=args.tmpdir) AttributeError: 'module' object has no attribute 'TemporaryDirectory'
I'm running this in centos, and I noticed that the shebang of the parallel-fastq-dump script is calling for python2: #!/usr/bin/python2
Is this an error? I installed with pip install, and version is 0.6.2. Thanks for your help!
When installing parallel-fastq-dump
from conda, it will also pull a sra-tools v 2.8.0 from conda even if I have a local sra-tools. This sra-tools is outdated and will cause network connection error. This should be fixed
The following NEW packages will be INSTALLED:
parallel-fastq-dump bioconda/noarch::parallel-fastq-dump-0.6.7-pyhdfd78af_0
sra-tools bioconda/linux-64::sra-tools-2.8.0-0
Hi,
I was wondering whether I can run the following to download the data. I know this is possible with fastq-dump. Just wanted to confirm it is also possible with parallel-fastq-dump 0.6.6
.
prefetch [runid] && vdb-validate [runid] && parallel-fastq-dump --outdir [out] --skip-technical --split-3 --sra-id [runid] --gzip
Thank you!
Hello I want to use your parrallel-fastq-dump as it would help my a lot in my studies.
It seems I have issues using the right syntax somehow... Every time I try to use it I get the ERROR:
File "/home/hpc/t1172/di36dih/bin/parallel-fastq-dump", line 29
p = subprocess.Popen(["fastq-dump", "-N", str(out[i][0]), "-X", str(out[i][1]), "-O", d, *extra_args, args.sra_id])
Could you send one or two example calls so I understand my problem?
Thanks in advance,
Sincerly TripleB
I have ERR3240205
and ERR3240205.vdbcache
files retrieved through SRA Cloud Data Delivery. I then ran:
parallel-fastq-dump -s ERR3240193 -t 2 -O out --tmpdir tmp --split-files --gzip
I found the two fastq-dump
processes, and ran strace -f -e trace=network -p <pid>
on each. I found that the fastq-dump
that starts from the start of the SRA file does not use network IO, while the one that starts mid-range does, and was wondering why given I've downloaded the file.
Hi, this is more of a question about fastq, but not sure where I should go to get help with it. I would appreciate it if anyone here can help me!
What does it mean if -split-files is used but I still only get a single output file (I..e I get ..._1.fastq but not ..._2.fastq)? My understanding is that I should get 2 output files.
import os
command = 'parallel-fastq-dump/parallel-fastq-dump '+\
f'--outdir {data_dirpath}/sratofastq/{sra_id} '+\
'--threads 32 '+\
'--split-files '+\
'--tmpdir tmpdir '+\
f'-s {data_dirpath}/pysradb_downloads/{sra_id}/SRX2536403/SRR5227288.sra'
print(command)
print(os.system(command))
When trying to install the program using either pip
or the setup.py
, the following error appears:
Traceback (most recent call last):
File "setup.py", line 8, in <module>
open(pname).read().split("\n")
TypeError: list object is not an iterator
Hi,
Sorry for bothering you with a very basic problem...
I have tried to install parallel-fastq-dump locally on our cluster after cloning it from git :
$ python setup.py --prefix=/data/users/scollomb/programs
$echo $PYTHONPATH
/bioinfo/local/build/Centos/python/python-2.7.12/lib/python2.7/site-packages/:/data/users/scollomb/programs/lib/python2.7/site-packages
but I cannot run it :
parallel-fastq-dump --tmpdir /data/tmp/scollomb --threads 10 -O ./ -s SRR3933583 --split-3 --gzip
SRR ids: ['SRR3933583']
extra args: ['--split-3', '--gzip']
Traceback (most recent call last):
File "/data/users/scollomb/programs/bin/parallel-fastq-dump", line 4, in <module>
__import__('pkg_resources').run_script('parallel-fastq-dump==0.6.2', 'parallel-fastq-dump')
File "/bioinfo/local/build/Centos/python/python-2.7.12/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/bioinfo/local/build/Centos/python/python-2.7.12/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1501, in run_script
exec(script_code, namespace, namespace)
File "/data/users/scollomb/programs/lib/python2.7/site-packages/parallel_fastq_dump-0.6.2-py2.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 100, in <module>
File "/data/users/scollomb/programs/lib/python2.7/site-packages/parallel_fastq_dump-0.6.2-py2.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 92, in main
File "/data/users/scollomb/programs/lib/python2.7/site-packages/parallel_fastq_dump-0.6.2-py2.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 47, in get_spot_count
File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Any idea what is going on?
Thanks for the great tool! As an enhancement, it would be great if I could give the program an experiment (or sample) accession and download reads for all run accessions that correspond to the experiment. This is a feature of the standard fastq-dump program.
Hi,
I am running into this error - I installed parallel-fastq-dump on an AWS EC2 using conda. Any insights would be wonderful. I suspect this may be an error with fastq dump itself but I could not find any information on it either.
My code:
parallel-fastq-dump --sra-id ERS1829683 --threads 56 --outdir ./ --split-files
Output:
SRR ids: ['ERS1829683'] extra args: ['--split-files'] tempdir: /tmp/pfd_9bpf3fvs ERS1829683 spots: 13339116 blocks: [[1, 238198], [238199, 476396], [476397, 714594], [714595, 952792], [952793, 1190990], [1190991, 1429188], [1429189, 1667386], [1667387, 1905584], [1905585, 2143782], [2143783, 23819 80], [2381981, 2620178], [2620179, 2858376], [2858377, 3096574], [3096575, 3334772], [3334773, 3572970], [3572971, 3811168], [3811169, 4049366], [4049367, 4287564], [4287565, 4525762], [4525 763, 4763960], [4763961, 5002158], [5002159, 5240356], [5240357, 5478554], [5478555, 5716752], [5716753, 5954950], [5954951, 6193148], [6193149, 6431346], [6431347, 6669544], [6669545, 69077 42], [6907743, 7145940], [7145941, 7384138], [7384139, 7622336], [7622337, 7860534], [7860535, 8098732], [8098733, 8336930], [8336931, 8575128], [8575129, 8813326], [8813327, 9051524], [9051 525, 9289722], [9289723, 9527920], [9527921, 9766118], [9766119, 10004316], [10004317, 10242514], [10242515, 10480712], [10480713, 10718910], [10718911, 10957108], [10957109, 11195306], [111 95307, 11433504], [11433505, 11671702], [11671703, 11909900], [11909901, 12148098], [12148099, 12386296], [12386297, 12624494], [12624495, 12862692], [12862693, 13100890], [13100891, 1333911 6]] failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result fastq-dump error! exit code: 22 failed to retrieve result (virus) jmifsud@dacelo:/disks/dacelo/data/jmifsud$ failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result failed to retrieve result Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621 Read 238198 spots for ERR2040621 Written 238198 spots for ERR2040621
$ parallel-fastq-dump -s ERS1444621 --outdir pfd_output --tmpdir pfd_tmp
SRR ids: ['ERS1444621']
extra args: []
tempdir: pfd_tmp/pfd___5awz6o
2021-04-11T06:28:36 sra-stat.2.8.2 int: directory not found while opening manager within virtual file system module - 'ERS1444621'
Traceback (most recent call last):
File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 119, in <module>
main()
File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 112, in main
pfd(args, si, extra_args)
File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 15, in pfd
n_spots = get_spot_count(srr_id)
File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgqc/.snakemake/conda/af0470e0f4e49441f0a6d5af028c9398/bin/parallel-fastq-dump", line 65, in get_spot_count
total += int(l.split("|")[2].split(":")[0])
IndexError: list index out of range
The output from subprocess.Popen(["sra-stat", "--meta", "--quick", sra_id], stdout=subprocess.PIPE)
is just ['']
, which is causing the IndexError
Hello, I'm getting this error while I was analyzing.
I'm using ubuntu on WSL.
$ parallel-fastq-dump --threads 8 --split-files --gzip --sra-id SRR2244401
2022-10-05 10:14:30,278 - SRR ids: ['SRR2244401']
2022-10-05 10:14:30,278 - extra args: ['--split-files', '--gzip']
2022-10-05 10:14:30,283 - tempdir: /tmp/pfd_hec73ych
2022-10-05 10:14:30,283 - CMD: sra-stat --meta --quick SRR2244401
Traceback (most recent call last):
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 116, in get_spot_count
total += int(l.split('|')[2].split(':')[0])
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 181, in
main()
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 175, in main
pfd(args, si, extra_args)
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 49, in pfd
n_spots = get_spot_count(srr_id)
File "/home/takehiro/miniconda3/bin/parallel-fastq-dump", line 122, in get_spot_count
raise IndexError(msg.format('\n'.join(txt), '\n'.join(etxt)))
IndexError: sra-stat output parsing error!
--sra-stat STDOUT--
--sra-stat STDERR--
2022-10-05T01:14:30 sra-stat.2.8.0 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2022-10-05T01:14:30 sra-stat.2.8.0 sys: mbedtls_ssl_get_verify_result returned 0x8 ( !! The certificate is not correctly signed by the trusted CA )
2022-10-05T01:14:31 sra-stat.2.8.0 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2022-10-05T01:14:31 sra-stat.2.8.0 sys: mbedtls_ssl_get_verify_result returned 0x8 ( !! The certificate is not correctly signed by the trusted CA )
2022-10-05T01:14:31 sra-stat.2.8.0 int: directory not found while opening manager within virtual file system module - 'SRR2244401'
Could you please upgrade your tool to use fastq-dump from sra toolkit 3.0? NCBI changed something, and some sra runs can only be downloaded with the new version of the toolkit (for example, try SRR15808775. It works with 3.0 but has "access denied" with the earlier version or any other methods of downloading). And with the new toolkit version I cannot use the parallel wrapper.
Thank you!
If the user provides a temp directory that doesn't exist, then parallel-fastq-dump
throws a FileNotFoundError
error. How about including a os.makedirs()
line to create the directory if it doesn't exist?
Hi @rvalieris ,
first thank you for developing this tool. I was doing a search for better SRA download and found a three of tweets pointing to this. Great stuff.
I am having trouble making the script work on our server. It runs fine in my local computer, but on the server I get this error:
SRR ids: ['DRR093002']
extra args: ['--split-files', '--gzip']
Traceback (most recent call last):
File "parallel-fastq-dump.py", line 119, in <module>
main()
File "parallel-fastq-dump.py", line 112, in main
pfd(args, si, extra_args)
File "parallel-fastq-dump.py", line 12, in pfd
tmp_dir = tempfile.TemporaryDirectory(prefix="pfd_",dir=args.tmpdir)
AttributeError: 'module' object has no attribute 'TemporaryDirectory'
here the call
parallel-fastq-dump --sra-id DRR093002 --threads 16 --outdir out/ --split-files --gzip
I've been googling for a solution, but I cannot find and obvious one. When I run parallel-fastq-dump by itself I get the list of options just fine.
Any suggestions?
D
Hello,
Thank you for your wonderful tool. Is there any way I can download the reads with the forward and reverse identifiers? Some downstream applications require these identifiers.
Many thanks!!
I am trying to run parallel-fastq-dump on an SRA file that is already downloaded, and get an error:
root@4eb8d72f7f10:~/ncbi/dbGaP-0/sra# /root/miniconda3/bin/parallel-fastq-dump --sra-id SRR1219902.sra --threads 10 --outdir out2
SRR ids: ['SRR1219902.sra']
extra args: []
Traceback (most recent call last):
File "/root/miniconda3/bin/parallel-fastq-dump", line 4, in <module>
__import__('pkg_resources').run_script('parallel-fastq-dump==0.6.2', 'parallel-fastq-dump')
File "/root/miniconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 748, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/root/miniconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1524, in run_script
exec(script_code, namespace, namespace)
File "/root/miniconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 100, in <module>
File "/root/miniconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 92, in main
File "/root/miniconda3/lib/python3.6/site-packages/parallel_fastq_dump-0.6.2-py3.6.egg/EGG-INFO/scripts/parallel-fastq-dump", line 47, in get_spot_count
File "/root/miniconda3/lib/python3.6/subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "/root/miniconda3/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sra-stat': 'sra-stat'
Hi dbGaP,
I'm writing to ask about this specific project by NYGC: '3 CANCER CELL LINES ON 2 SEQUENCERS' dbGaP accession number: phs001839.v1.p1 https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001839.v1.p1
I've reached out to NYGC for comment on what the 72x, 74x, or 75x means or what the 48,49,45,43 means in these files: HCC1143-N_72x_48_1.fastq.gz*
HCC1143-N_72x_48_2.fastq.gz*
HCC1143-N_72x_49_1.fastq.gz*
HCC1143-N_72x_49_2.fastq.gz*
HCC1143-N_74x_45_1.fastq.gz*
HCC1143-N_74x_45_2.fastq.gz*
HCC1143-N_75x_43_1.fastq.gz*
HCC1143-N_75x_43_2.fastq.gz*
To get these files, I had to install Aspera Connect, and then use prefetch from the SRAtoolkit and then parallel-fastq-dump to extract and gzip fastq files.
NYGC said dbGaP named these files because they uploaded these BAM files and they were converted by dbGaP to FASTQs to host on the dbGaP FTP.
dbGaP is now saying that parallel-fastq-dump named these files.
Can you help me understand how files are named in the output of parallel-fastq-dump please?
Appreciate your prompt reply and thank you!
Best,
Nicole
The recommended procedure using conda install returns a non-zero status.
cloning the repository also returns a non-zero status
Dear developer,
Greetings! I would like to express my gratitude for developing this software. I have been utilizing it since yesterday to convert a 10X single-cell sequencing SRA file, which has a size of 14.8Gb. However, I have noticed that the program has been running continuously for nearly 24 hours without producing any output. Upon inspecting the program's CPU usage using hTOP, I observed that it did not exceed 100%. I am uncertain whether this behavior is normal, and thus, I seek your expert opinion. I have been running the program on my home computer with 16 threads simultaneously, yet the outcome remains unchanged.
The first image illustrates the command I used and the corresponding output, while the second image showcases the CPU utilization as displayed in hTOP.
I would highly appreciate it if you could provide me with any suggestions or guidance.
Thank you sincerely!
Hi,
This is more for your information and not an issue.
I wanted to let you know about a comparison that I ran between parallel-fastq-dump
and sra-tools
prefetch
+ fasterq-dump
. You can find the code and results in this repo.
This is the way that I invoke parallel-fastq-dump
so if you see some problem, tweak, or think that it is an unfair comparison, please let me know.
I used conda install parallel-fastq-dump but I keep getting this error:
olving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
Current channels:
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
Is the package still there in conda? or was it removed?
Hi,
When I tried to install parallel-fastq-dump with "conda install parallel-fastq-dump", the following error showed:
Solving environment: failed
UnsatisfiableError: The following specifications were found to be in conflict:
So I tried to uninstall "enum34" with "conda uninstall enum34", but it was not allowed.
Is there any way to resolve this conflict?
Thank you!
Stella
Running parallel-fastq-dump 0.6.3
on Scientific Linux using sratools/2.8.2-1
and I get this error.
[rmf@r43 temp]$ parallel-fastq-dump --split-spot --split-files --gzip -s SRR390728 -t 8
SRR ids: ['SRR390728']
extra args: ['--split-spot', '--split-files', '--gzip']
Traceback (most recent call last):
File "/home/rmf/.pyenv/versions/2.7.6/bin/parallel-fastq-dump", line 103, in <module>
main()
File "/home/rmf/.pyenv/versions/2.7.6/bin/parallel-fastq-dump", line 96, in main
pfd(args, si, extra_args)
File "/home/rmf/.pyenv/versions/2.7.6/bin/parallel-fastq-dump", line 12, in pfd
tmp_dir = tempfile.TemporaryDirectory(prefix="pfd_",dir=args.tmpdir)
AttributeError: 'module' object has no attribute 'TemporaryDirectory'
I have also tried -s SRR390728
and -s ./SRR390728.sra
because I do have the .sra file locally. But, I am not sure if it uses that.
You can download the SRA here:
https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR2778062/SRR2778062.1
And when I dump with 8 cores it fails, normal fastq-dump performs fine
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Read 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
Written 6432566 spots for /home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062
2020-01-09T14:29:16 sra-stat.2.10.0 int: path incorrect while opening manager within database module - '/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/tmp'
SRR ids: ['/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062', '/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/tmp']
extra args: ['--split-spot', '--skip-technical', '--dumpbase', '--readids', '--clip', '--read-filter', 'pass', '--defline-seq', '@$ac.$si.$sg/$ri', '--defline-qual', '+', '--gzip']
tempdir: /tmp/pfd_7z0gpkz8
/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/sra/SRR2778062/SRR2778062 spots: 51460528
blocks: [[1, 6432566], [6432567, 12865132], [12865133, 19297698], [19297699, 25730264], [25730265, 32162830], [32162831, 38595396], [38595397, 45027962], [45027963, 51460528]]
tempdir: /tmp/pfd_l9hikqiw
Traceback (most recent call last):
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/bin/parallel-fastq-dump", line 4, in <module>
__import__('pkg_resources').run_script('parallel-fastq-dump==0.6.5', 'parallel-fastq-dump')
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1469, in run_script
exec(script_code, namespace, namespace)
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 112, in <module>
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 105, in main
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 15, in pfd
File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/download_fastq/.snakemake/conda/f4ec0168/lib/python3.8/site-packages/parallel_fastq_dump-0.6.5-py3.7.egg/EGG-INFO/scripts/parallel-fastq-dump", line 64, in get_spot_count
IndexError: list index out of range
Crash happens on this line:
parallel-fastq-dump/parallel-fastq-dump
Line 64 in fcddfa0
"parallel-fastq-dump --sra-id SRR1219899 --threads 4 --outdir out/ --split-files --gzip"
does it mean that this command line combines [prefetch] and [fastq-dump]? I haved downloaded the sra file, however?
Thanks for this great tool!
Can I use --include-technical
flag in parallel-fastq-dump
command, like fasterq-dump to make a separate file for UMI reads of single cell sra?
Hello! It looks like parallel-fastq-dump creates fastq files which are not the same size as the fastq files created by fasterq-dump.
For example:
prefetch SRR5683211 -O output
parallel-fastq-dump -t 80 -O output/parallel-fastq --tmpdir tmp/ -s output/SRR5683211.sra --split-files
fasterq-dump -e 80 -O output/fasterq -S output/SRR5683211.sra
Fasterq size:
du -h output/fasterq/SRR5683211.sra_1.fastq
7.8G output/fasterq/SRR5683211.sra_1.fastq
Parallel fastq size:
du -h output/parallel-fastq/SRR5683211_1.fastq
7.6G output/parallel-fastq/SRR5683211_1.fastq
I've noticed this before and sometimes the data loss is very large -- especially when I don't use prefetch first. Do you know what may be causing this behavior?
Hello,
I'm getting this error running parallel-fastq-dump on an HPC.
2022-09-28T15:55:16 fastq-dump.2.11.0 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
The line is present repeatedly in slurm output.
I've read a few other threads here about the same problem. I changed my --tmpdir and --outdir to a scratch drive and yes the temp files are being written there. Both sra-toolkit and parallel-fastq-dump were installed with conda (today).
Could another folder be filling up with and that's why I'm getting the complaint? Any thoughts?
Hi,
This is a very nice tool to speed-up the download via fastq-dump.
I have the issue that fastq-dump has sometimes errors like: timeout exhausted while waiting condition within process system module
a) Do you know what this error is? It seems to have some relation to the internet speed: on my local computer with ~50Mbits it raises; however if I download it via my local compute center and a 1 Gbits connection this error does not occure.
b) Using the fast connection (1 Gbit) I am facing another problem. I defined a tmp dir in my home but still something is written to my root. And then I get 2020-03-08T11:06:02 fastq-dump.2.10.3 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='4'
.
parallel-fastq-dump --sra-id SRR5229019 -t 40 --outdir out_fastq/ --tmpdir local_tmp
Any idea how to solve the two issues?
For the first one: If it is an issue of fastq-dump, would it be possible to have sth which is restarting the download from the point it crashed?
Best,
Joachim
Recently I found parallel-fastq-dump is not working. I install the recent version from the conda.
Hi,
I'm trying to run parallel-fastq-dump, but I get the error provided below. I can find similar issues here, but none of them solves my problem. The specific call is as follows:
parallel-fastq-dump --sra-id $1 --threads 4 --outdir raw/ --split-files --gzip
Where $1 is the result of parsing a SRR ID list.
The log from one of the SRA IDs (SRR6337208):
2021-04-30 13:45:38,797 - SRR ids: ['SRR6337208']
2021-04-30 13:45:38,797 - extra args: ['--split-files', '--gzip']
2021-04-30 13:45:38,798 - tempdir: /tmp/pfd_zhzmw3es
2021-04-30 13:45:38,798 - CMD: sra-stat --meta --quick SRR6337208
Traceback (most recent call last):
File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 116, in get_spot_count
total += int(l.split('|')[2].split(':')[0])
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 181, in <module>
main()
File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 175, in main
pfd(args, si, extra_args)
File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 49, in pfd
n_spots = get_spot_count(srr_id)
File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 122, in get_spot_count
raise IndexError(msg.format('\n'.join(txt), '\n'.join(etxt)))
IndexError: sra-stat output parsing error!
--sra-stat STDOUT--
--sra-stat STDERR--
2021-04-30T11:47:40 sra-stat.2.11.0 int: directory not found while opening manager within virtual file system module -
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.