open2c / distiller-nf Goto Github PK
View Code? Open in Web Editor NEWA modular Hi-C mapping pipeline
License: MIT License
A modular Hi-C mapping pipeline
License: MIT License
It would be helpful to have shorter, polished error reports instead of nextflow or python exception messages. For example, when the path of an input file is wrong I got the following output and it took me a while to figure out that there was a typo in the input path:
N E X T F L O W ~ version 0.24.3
Launching /home/ubuntu/distiller/distiller.nf
[distracted_leakey] - revision: 3548dd9b36
[warm up] executor > local
[1d/011c3a] Submitted process > chunk_fastqs (library:lib-1-A run:lane1)
[f7/6acfcc] Submitted process > fastqc (library:lib-1-A run:lane1 side:2)
[fb/6762f7] Submitted process > fastqc (library:lib-1-A run:lane1 side:1)
ERROR ~ Error executing process > 'fastqc (library:lib-1-A run:lane1 side:2)'
Caused by:
Missing output file(s) lib-1-A.lane1.2_fastqc.html
expected by process fastqc (library:lib-1-A run:lane1 side:2)
Command executed:
mkdir -p ./temp_fastqc/
ln -s $(readlink -f lib-1-A_S1_L001_R2_001.fastq.gz) ./temp_fastqc/lib-1-A.lane1.2.fastq.gz
fastqc --threads 4 -o ./ -f fastq ./temp_fastqc/lib-1-A.lane1.2.fastq.gz
rm -r ./temp_fastqc/
Command exit status:
0
Command output:
(empty)
Command error:
Skipping './temp_fastqc/lib-1-A.lane1.2.fastq.gz' which didn't exist, or couldn't be read
.command.run.1: line 99: 12 Terminated nxf_trace "$pid" .command.trace
Work dir:
/data/work/f7/6acfcce9869c0f07e606ff85dbe122
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh
-- Check '.nextflow.log' file for details
WARN: Killing pending tasks (2)
storeDir
is a huge troublemaker in a cluster environment: whenever some process fails downstream, for whatever reason, some upstream processes can be copying intermediate results to storeDir
location. After the restart or -resume
processes take off with whatever is present in the storeDir
- which isn't always a good thing, as results might be incomplete or whatever - it often creates a mess.
errorStrategy='retry'
does not necessarily saves us in all cases.storeDir
intended use is to store things as a permanent cache, in most cases after downloading from some database (like SRA), but not to store all intermediate results there (according to Paolo)Without the -o pipefail
option, commands in a pipe can fail, but the whole pipeline can return as successful, creating all kinds of hard-to-trace havoc in nextflow. Based on this discussion, we just add something like process.shell = ['/bin/bash', '-uexo','pipefail']
to nextflow.config to set these globally.
Will make a PR shortly.
More on safe bash scripts.
I've tried both on Linux and Mac and I'm getting the same error:
ERROR ~ Error executing process > 'download_sra (sra:SRR2601843?start=0&end=10000)'
Caused by:
Process `download_sra (sra:SRR2601843?start=0&end=10000)` terminated with an error exit status (3)
Command executed:
fastq-dump -F SRR2601843 --minSpotId 0 --maxSpotId 10000 --split-files --gzip
mv SRR2601843_1.fastq.gz MATalpha_R1.lane2.1.fastq.gz
mv SRR2601843_2.fastq.gz MATalpha_R1.lane2.2.fastq.gz
Command exit status:
3
Command output:
(empty)
Command error:
2017-05-02T12:50:09 fastq-dump.2.8.1 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2017-05-02T12:50:39 fastq-dump.2.8.1 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -76 ( NET - Reading information from the socket failed )
2017-05-02T12:50:39 fastq-dump.2.8.1 err: no error - error with http open 'http://sra-download.ncbi.nlm.nih.gov/srapub/SRR2601843'
2017-05-02T12:50:39 fastq-dump.2.8.1 err: item not found while constructing within virtual database module - the path 'SRR2601843' cannot be opened as database or table
.command.run.1: line 104: 13 Terminated nxf_trace "$pid" .command.trac
We built the docker image after modifying pairsamtools. Everything except for fastqc works fine. Fastqc processes got stuck when we used the new image. Fastqc works in the existing docker image in docker hub.
Analysis complete for 2-Cellaphidicolintreated_R1.lane_39096.1.fastq.gz
Exception in thread "Thread-1" java.lang.Error: Probable fatal error:No fonts found.
at sun.font.SunFontManager.getDefaultPhysicalFont(SunFontManager.java:1236)
at sun.font.SunFontManager.initialiseDeferredFont(SunFontManager.java:1100)
at sun.font.SunFontManager.findOtherDeferredFont(SunFontManager.java:1037)
at sun.font.SunFontManager.findDeferredFont(SunFontManager.java:1054)
at sun.font.SunFontManager.findFont2D(SunFontManager.java:2256)
Just reporting in the case you are interested to have it working in Mac OS
$ bash setup_test.sh
++ pwd
+ PROJECT_DIR=/Users/pditommaso/Downloads/test/test
+ mkdir -p /Users/pditommaso/Downloads/test/test
+ cd /Users/pditommaso/Downloads/test/test
+ curl -LkSs https://api.github.com/repos/mirnylab/distiller-test-data/tarball
+ tar -zxf - --strip=1 --wildcards '*/genome/*' --wildcards '*/fastq/*'
tar: Option --wildcards is not supported
Usage:
List: tar -tf <archive-filename>
Extract: tar -xf <archive-filename>
Create: tar -cf <archive-filename> [filenames...]
Help: tar --help
curl: (23) Failed writing body (0 != 1370)
should there be some available memory checks & warnings issued depending on chunksize etc?
N GB of input data requires more than 10N space to run. Also, the space usage peaks while doing sort, as mentioned in an earlier ticket. In particular, in my run, 54M of input requires almost 600M of free space to run. Is there any chance to get rid of intermediate files as soon as we delete them?
Need to have some pre-check maybe. I was lucky it was my first file to be processed.
output:
ERROR ~ Error executing process > 'parse_runs (library:strain_2 run:lane1)'
Caused by:
Cannot invoke method toBoolean() on null object
Source block:
dropsam_flag = params['map'].get('drop_sam','false').toBoolean() ? '--drop-sam' : ''
dropreadid_flag = params['map'].get('drop_readid','false').toBoolean() ? '--drop-readid' : ''
dropseq_flag = params['map'].get('drop_seq','false').toBoolean() ? '--drop-seq' : ''
stats_command = (params.get('do_stats', 'true').toBoolean() ?
"pairsamtools stats ${library}.${run}.pairsam.gz -o ${library}.${run}.stats" :
"touch ${library}.${run}.stats" )
n_parse_processes = (int)Math.ceil(task.cpus / 2)
n_parse_processes = n_parse_processes < 1 ? 1 : n_parse_processes
if( isSingleFile(bam))
"""
mkdir ./tmp4sort
pairsamtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
-c ${chrom_sizes} ${bam} \
| pairsamtools sort --nproc ${task.cpus} \
-o ${library}.${run}.pairsam.gz \
--tmpdir ./tmp4sort \
| cat
rm -rf ./tmp4sort
${stats_command}
"""
else
"""
mkdir ./tmp4sort
mkdir ./tmp_pairsam
parallel -P${n_parse_processes} 'pairsamtools parse \
${dropsam_flag} ${dropseq_flag} ${dropreadid_flag} -c ${chrom_sizes} {} \
| pairsamtools sort --nproc 4 \
-o ./tmp_pairsam/{}.pairsam.gz \
--tmpdir ./tmp4sort ' ::: ${bam}
pairsamtools merge ./tmp_pairsam/* --nproc ${task.cpus} -o ${library}.${run}.pairsam.gz
rm -rf ./tmp4sort
rm -rf ./tmp_pairsam
${stats_command}
"""
project.yml
do_fastqc: True
do_stats: True
# Fastqs can be provided as:
# -- a pairs of relative/absolute paths
# -- sra:<SRA_NUMBER>, optionally followed by the indices of the first and
# the last entry in the SRA in the form of "?start=<first>&end=<last>
# [to implement] -- as a path to a folder with fastqs '<base_folder>', with the structure
# <base_folder>/<library_name>/<run_name>/, with each folder containing only
# two fastq.gz files
input:
raw_reads_paths:
strain_1:
lane1:
- /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002581_1.fastq.gz
- /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002581_2.fastq.gz
strain_2:
lane1:
- /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002583_2.fastq.gz
- /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002584_1.fastq.gz
library_groups:
PY79:
- strain_1
- strain_2
genome:
assembly: 'bacillus'
bwa_index_wildcard: '/home/hbrandao/Documents/Rudner_data/PY79_genome/bwa_index/NC_022898.1.fa.*'
chrom_sizes_path: '/home/hbrandao/Documents/Rudner_data/PY79_genome/bwa_index/PY79.chrom.sizes'
map:
chunksize: 6000
#drop_sam: False
#drop_readid: False
#drop_seq: False
filter:
pcr_dups_max_mismatch_bp: 3
bin:
resolutions:
- 1000000
- 400000
- 200000
- 100000
- 40000
- 20000
- 10000
intermediates:
base_dir: './intermediates/'
dirs:
downloaded_fastqs: 'downloaded_fastqs/'
fastq_chunks: 'fastq_chunks'
bam_run: 'bam/run'
pairsam_run: 'pairsam/run'
pairsam_library: 'pairsam/library'
output:
base_dir: '/home/hbrandao/Documents/Rudner_data/test_distiller/output/'
dirs:
fastqc: 'fastqc/'
pairs_library: 'pairs/library/'
stats_run: 'stats/run/'
stats_library: 'stats/library/'
stats_library_group: 'stats/library_group/'
coolers_library: 'coolers/library/'
coolers_library_group: 'coolers/library_group/'
bams_library: 'bams/library_group/'
In a process fastqc
command trying to create sym-link ln -s $(readlink -f ${fastq}) ...
would fail if one were to execute distiller pipeline in a folder with spaces/parenthesis in the name.
Possible solution would be to replace command creating sum-link to ln -s \"$(readlink -f ${fastq})\" ...
, in order to guarantee proper escaping
Currently they are somewhere in the work folder, I believe
Both in the "work" and in the "intermediates" folder
I got the following error in parse_runs step. I was able to use the same configuration files on other datasets without any errors but this time it failed after the alignment step.
My map parameters were:
map:
chunksize: 10000000
# drop_sam: True
drop_readid: True
drop_seq: True
####################################
Caused by:
Cannot invoke method toBoolean() on null object
Source block:
dropsam_flag = params['map'].get('drop_sam','false').toBoolean() ? '--drop-sam' : ''
dropreadid_flag = params['map'].get('drop_readid','false').toBoolean() ? '--drop-readid' : ''
dropseq_flag = params['map'].get('drop_seq','false').toBoolean() ? '--drop-seq' : ''
stats_command = (params.get('do_stats', 'true').toBoolean() ?
"pairsamtools stats ${library}.${run}.pairsam.gz -o ${library}.${run}.stats" :
"touch ${library}.${run}.stats" )
n_parse_processes = (int)Math.ceil(task.cpus / 2)
n_parse_processes = n_parse_processes < 1 ? 1 : n_parse_processes
if( isSingleFile(bam))
"""
mkdir ./tmp4sort
pairsamtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
-c ${chrom_sizes} ${bam} \
| pairsamtools sort --nproc ${task.cpus} \
-o ${library}.${run}.pairsam.gz \
--tmpdir ./tmp4sort \
| cat
rm -rf ./tmp4sort
${stats_command}
"""
else
"""
mkdir ./tmp4sort
mkdir ./tmp_pairsam
parallel -P${n_parse_processes} 'pairsamtools parse \
${dropsam_flag} ${dropseq_flag} ${dropreadid_flag} -c ${chrom_sizes} {} \
| pairsamtools sort --nproc 4 \
-o ./tmp_pairsam/{}.pairsam.gz \
--tmpdir ./tmp4sort ' ::: ${bam}
pairsamtools merge ./tmp_pairsam/* --nproc ${task.cpus} -o ${library}.${run}.pairsam.gz
rm -rf ./tmp4sort
rm -rf ./tmp_pairsam
${stats_command}
"""
[Nezar: edited formatting]
Somehow the result is not being stored as completed
When I use conda, with the commands in the docker file and the environment.yml file, to install the dependencies locally, it gets stuck when installing pairsamtools.
I replaced
git+git://github.com/mirnylab/pairsamtools
with
git+https://github.com/mirnylab/pairsamtools.git
and this solved the problem for me.
Could be nice way to avoid needing to re-balance everything after mapping
I think parallel
should be installed in the container. I'm getting this error
ERROR ~ Error executing process > 'parse_runs (library:MATa_R2 run:lane1)'
Caused by:
Process `parse_runs (library:MATa_R2 run:lane1)` terminated with an error exit status (127)
Command executed:
mkdir ./tmp4sort
mkdir ./tmp_pairsam
parallel -P4 'pairsamtools parse {} | pairsamtools sort --nproc 4 -o ./tmp_pairsam/{}.pairsam.gz --tmpdir ./tmp4sort ' ::: MATa_R2.lane1.00.bam MATa_R2.lane1.01.bam
pairsamtools merge ./tmp_pairsam/* --nproc 8 -o MATa_R2.lane1.pairsam.gz
rm -rf ./tmp4sort
rm -rf ./tmp_pairsam
pairsamtools stats MATa_R2.lane1.pairsam.gz -o MATa_R2.lane1.stats
Command exit status:
127
Command output:
(empty)
Command error:
.command.sh: line 4: parallel: command not found
.command.run.1: line 104: 13 Terminated nxf_trace "$pid" .command.trace
Distiller does not check (before running) for consistency of library groups or experiments. As such, a misprint in library groups, in a complex experiment, would result in error.
Would be nice to be able to specify where to put temporary files of GNU sort
Error log just in case
[3c/371792] Submitted process > parse_runs (library:WaplKO_1.14-B run:lane2) [18/1965]
ERROR ~ Error executing process > 'parse_runs (library:WaplKO_3.3-C run:lane1)'
Caused by:
Process parse_runs (library:WaplKO_3.3-C run:lane1)
terminated with an error exit status (1)
Command executed:
mkdir ./tmp4sort
mkdir ./tmp_pairsam
parallel -P4 'pairsamtools parse --drop-seq --drop-readid -c hg19.chrom.sizes.reduced {} | pairsamtools sort --nproc 4 -o ./tmp
_pairsam/{}.pairsam.gz --tmpdir ./tmp4sort ' ::: WaplKO_3.3-C.lane1.20.bam WaplKO_3.3-C.lane1.00.bam WaplKO_3.3-C.lane1.01.bam WaplKO_3.3-C.lane1.03.bam W
aplKO_3.3-C.lane1.02.bam WaplKO_3.3-C.lane1.04.bam WaplKO_3.3-C.lane1.05.bam WaplKO_3.3-C.lane1.43.bam WaplKO_3.3-C.lane1.06.bam WaplKO_3.3-C.lane1.07.bam WaplKO_3.3-C.lane1.08.bam WaplK
O_3.3-C.lane1.09.bam WaplKO_3.3-C.lane1.10.bam WaplKO_3.3-C.lane1.11.bam WaplKO_3.3-C.lane1.12.bam WaplKO_3.3-C.lane1.13.bam WaplKO_3.3-C.lane1.14.bam WaplKO_3.3-C.lane1.15.bam WaplKO_3.
3-C.lane1.17.bam WaplKO_3.3-C.lane1.18.bam WaplKO_3.3-C.lane1.21.bam WaplKO_3.3-C.lane1.19.bam WaplKO_3.3-C.lane1.24.bam WaplKO_3.3-C.lane1.22.bam WaplKO_3.3-C.lane1.23.bam WaplKO_3.3-C.
lane1.27.bam WaplKO_3.3-C.lane1.26.bam WaplKO_3.3-C.lane1.25.bam WaplKO_3.3-C.lane1.29.bam WaplKO_3.3-C.lane1.28.bam WaplKO_3.3-C.lane1.31.bam WaplKO_3.3-C.lane1.30.bam WaplKO_3.3-C.lane
1.32.bam WaplKO_3.3-C.lane1.33.bam WaplKO_3.3-C.lane1.34.bam WaplKO_3.3-C.lane1.35.bam WaplKO_3.3-C.lane1.36.bam WaplKO_3.3-C.lane1.40.bam WaplKO_3.3-C.lane1.41.bam WaplKO_3.3-C.lane1.37
.bam WaplKO_3.3-C.lane1.39.bam WaplKO_3.3-C.lane1.38.bam WaplKO_3.3-C.lane1.42.bam WaplKO_3.3-C.lane1.48.bam WaplKO_3.3-C.lane1.44.bam WaplKO_3.3-C.lane1.45.bam WaplKO_3.3-C.lane1.47.bam
WaplKO_3.3-C.lane1.46.bam WaplKO_3.3-C.lane1.16.bam
pairsamtools merge ./tmp_pairsam/* --nproc 8 -o WaplKO_3.3-C.lane1.pairsam.gz
rm -rf ./tmp4sort
rm -rf ./tmp_pairsam
pairsamtools stats WaplKO_3.3-C.lane1.pairsam.gz -o WaplKO_3.3-C.lane1.stats
Command exit status:
1
Command output:
(empty)
Command error:
sort: write failed: /tmp/sortE2aarR: No space left on device
.command.run.1: line 50: cannot create temp file for here-document: No space left on device
Traceback (most recent call last):
File "/miniconda3/bin/pairsamtools", line 11, in
load_entry_point('pairsamtools==0.0.1.dev0', 'console_scripts', 'pairsamtools')()
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_stats.py", line 50, in stats
stats_py(input_path, output, merge)
File "/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_stats.py", line 99, in stats_py
cols[_pairsam_format.COL_C1],
IndexError: list index out of range
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.