open2c / distiller-nf Goto Github PK

View Code? Open in Web Editor NEW

82.0 10.0 25.0 1.36 MB

A modular Hi-C mapping pipeline

License: MIT License

Shell 5.03% Groovy 67.19% Nextflow 3.73% Dockerfile 2.69% Python 21.36%

pipeline nextflow hi-c

distiller-nf's People

Contributors

Stargazers

Watchers

distiller-nf's Issues

add pairs and coolers to intermediates

chunks of fastq.gz not deleted from work folder upon completion of a job

Add normalization and corresponding parameters

It would be helpful to have shorter, polished error reports instead of nextflow or python exception messages. For example, when the path of an input file is wrong I got the following output and it took me a while to figure out that there was a typo in the input path:

N E X T F L O W ~ version 0.24.3
Launching /home/ubuntu/distiller/distiller.nf [distracted_leakey] - revision: 3548dd9b36
[warm up] executor > local
[1d/011c3a] Submitted process > chunk_fastqs (library:lib-1-A run:lane1)
[f7/6acfcc] Submitted process > fastqc (library:lib-1-A run:lane1 side:2)
[fb/6762f7] Submitted process > fastqc (library:lib-1-A run:lane1 side:1)
ERROR ~ Error executing process > 'fastqc (library:lib-1-A run:lane1 side:2)'

Caused by:
Missing output file(s) lib-1-A.lane1.2_fastqc.html expected by process fastqc (library:lib-1-A run:lane1 side:2)

Command executed:

mkdir -p ./temp_fastqc/
ln -s $(readlink -f lib-1-A_S1_L001_R2_001.fastq.gz) ./temp_fastqc/lib-1-A.lane1.2.fastq.gz
fastqc --threads 4 -o ./ -f fastq ./temp_fastqc/lib-1-A.lane1.2.fastq.gz
rm -r ./temp_fastqc/

Command exit status:
0

Command output:
(empty)

Command error:
Skipping './temp_fastqc/lib-1-A.lane1.2.fastq.gz' which didn't exist, or couldn't be read
.command.run.1: line 99: 12 Terminated nxf_trace "$pid" .command.trace

Work dir:
/data/work/f7/6acfcce9869c0f07e606ff85dbe122

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

-- Check '.nextflow.log' file for details
WARN: Killing pending tasks (2)

get rid of most of `storeDir` - the troublemaker

storeDir is a huge troublemaker in a cluster environment: whenever some process fails downstream, for whatever reason, some upstream processes can be copying intermediate results to storeDir location. After the restart or -resume processes take off with whatever is present in the storeDir - which isn't always a good thing, as results might be incomplete or whatever - it often creates a mess.

Jobs/processes could fail on a cluster for many reasons which we could not control and even using errorStrategy='retry' does not necessarily saves us in all cases.
storeDir intended use is to store things as a permanent cache, in most cases after downloading from some database (like SRA), but not to store all intermediate results there (according to Paolo)

Use safe script bash options

Without the -o pipefail option, commands in a pipe can fail, but the whole pipeline can return as successful, creating all kinds of hard-to-trace havoc in nextflow. Based on this discussion, we just add something like process.shell = ['/bin/bash', '-uexo','pipefail'] to nextflow.config to set these globally.

Will make a PR shortly.

SRA download step fail

I've tried both on Linux and Mac and I'm getting the same error:

ERROR ~ Error executing process > 'download_sra (sra:SRR2601843?start=0&end=10000)'

Caused by:
  Process `download_sra (sra:SRR2601843?start=0&end=10000)` terminated with an error exit status (3)

Command executed:

  fastq-dump -F SRR2601843  --minSpotId 0  --maxSpotId 10000 --split-files --gzip
  mv SRR2601843_1.fastq.gz MATalpha_R1.lane2.1.fastq.gz
  mv SRR2601843_2.fastq.gz MATalpha_R1.lane2.2.fastq.gz

Command exit status:
  3

Command output:
  (empty)

Command error:
  2017-05-02T12:50:09 fastq-dump.2.8.1 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
  2017-05-02T12:50:39 fastq-dump.2.8.1 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -76 ( NET - Reading information from the socket failed )
  2017-05-02T12:50:39 fastq-dump.2.8.1 err: no error - error with http open 'http://sra-download.ncbi.nlm.nih.gov/srapub/SRR2601843'
  2017-05-02T12:50:39 fastq-dump.2.8.1 err: item not found while constructing within virtual database module - the path 'SRR2601843' cannot be opened as database or table
  .command.run.1: line 104:    13 Terminated              nxf_trace "$pid" .command.trac

specify number of cores for the whole pipeline and individual tasks

Problem with Fastqc in Recently Built Docker Image

We built the docker image after modifying pairsamtools. Everything except for fastqc works fine. Fastqc processes got stuck when we used the new image. Fastqc works in the existing docker image in docker hub.

save the uniquely mapped non-rescued chimeric alignments in a flexible format

add a command to modify fastq files and pipe the output to bwa-mem

fastqc seem to halt after completion with an error

Analysis complete for 2-Cellaphidicolintreated_R1.lane_39096.1.fastq.gz
Exception in thread "Thread-1" java.lang.Error: Probable fatal error:No fonts found.
at sun.font.SunFontManager.getDefaultPhysicalFont(SunFontManager.java:1236)
at sun.font.SunFontManager.initialiseDeferredFont(SunFontManager.java:1100)
at sun.font.SunFontManager.findOtherDeferredFont(SunFontManager.java:1037)
at sun.font.SunFontManager.findDeferredFont(SunFontManager.java:1054)
at sun.font.SunFontManager.findFont2D(SunFontManager.java:2256)

Setup script is not working on Mac

Just reporting in the case you are interested to have it working in Mac OS

$ bash setup_test.sh 
++ pwd
+ PROJECT_DIR=/Users/pditommaso/Downloads/test/test
+ mkdir -p /Users/pditommaso/Downloads/test/test
+ cd /Users/pditommaso/Downloads/test/test
+ curl -LkSs https://api.github.com/repos/mirnylab/distiller-test-data/tarball
+ tar -zxf - --strip=1 --wildcards '*/genome/*' --wildcards '*/fastq/*'
tar: Option --wildcards is not supported
Usage:
  List:    tar -tf <archive-filename>
  Extract: tar -xf <archive-filename>
  Create:  tar -cf <archive-filename> [filenames...]
  Help:    tar --help
curl: (23) Failed writing body (0 != 1370)

memory checks?

should there be some available memory checks & warnings issued depending on chunksize etc?

Space Usage

N GB of input data requires more than 10N space to run. Also, the space usage peaks while doing sort, as mentioned in an earlier ticket. In particular, in my run, 54M of input requires almost 600M of free space to run. Is there any chance to get rid of intermediate files as soon as we delete them?

expose min_mapq option of pairsamtools parse

Parentheses in library group names leave to bash errors

Need to have some pre-check maybe. I was lucky it was my first file to be processed.

Failed at parse_runs

output:

ERROR ~ Error executing process > 'parse_runs (library:strain_2 run:lane1)'

Caused by:
  Cannot invoke method toBoolean() on null object


Source block:
  dropsam_flag = params['map'].get('drop_sam','false').toBoolean() ? '--drop-sam' : ''
  dropreadid_flag = params['map'].get('drop_readid','false').toBoolean() ? '--drop-readid' : ''
  dropseq_flag = params['map'].get('drop_seq','false').toBoolean() ? '--drop-seq' : ''
  stats_command = (params.get('do_stats', 'true').toBoolean() ?
      "pairsamtools stats ${library}.${run}.pairsam.gz -o ${library}.${run}.stats" :
      "touch ${library}.${run}.stats" )
  n_parse_processes = (int)Math.ceil(task.cpus / 2)
  n_parse_processes = n_parse_processes < 1 ? 1 : n_parse_processes
  if( isSingleFile(bam))
      """
      mkdir ./tmp4sort
      pairsamtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
          -c ${chrom_sizes}  ${bam} \
              | pairsamtools sort --nproc ${task.cpus} \
                                  -o ${library}.${run}.pairsam.gz \
                                  --tmpdir ./tmp4sort \
              | cat
  
      rm -rf ./tmp4sort
  
      ${stats_command}
      """
  else 
      """
      mkdir ./tmp4sort
      mkdir ./tmp_pairsam
      parallel -P${n_parse_processes} 'pairsamtools parse \
          ${dropsam_flag} ${dropseq_flag} ${dropreadid_flag} -c ${chrom_sizes} {} \
          | pairsamtools sort --nproc 4 \
                              -o ./tmp_pairsam/{}.pairsam.gz \
                              --tmpdir ./tmp4sort ' ::: ${bam}
  
      pairsamtools merge ./tmp_pairsam/* --nproc ${task.cpus} -o ${library}.${run}.pairsam.gz
  
      rm -rf ./tmp4sort
      rm -rf ./tmp_pairsam
  
      ${stats_command}
      """

project.yml

do_fastqc: True
do_stats: True
    

# Fastqs can be provided as:
# -- a pairs of relative/absolute paths
# -- sra:<SRA_NUMBER>, optionally followed by the indices of the first and
# the last entry in the SRA in the form of "?start=<first>&end=<last>
# [to implement] -- as a path to a folder with fastqs '<base_folder>', with the structure 
# <base_folder>/<library_name>/<run_name>/, with each folder containing only
# two fastq.gz files
input:
    raw_reads_paths:
        strain_1:
            lane1:
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002581_1.fastq.gz
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002581_2.fastq.gz  

        strain_2:
            lane1:
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002583_2.fastq.gz  
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002584_1.fastq.gz


    library_groups:
        PY79:
            - strain_1
            - strain_2

    genome:
        assembly: 'bacillus'
        bwa_index_wildcard: '/home/hbrandao/Documents/Rudner_data/PY79_genome/bwa_index/NC_022898.1.fa.*'
        chrom_sizes_path: '/home/hbrandao/Documents/Rudner_data/PY79_genome/bwa_index/PY79.chrom.sizes'

map:
    chunksize: 6000
    #drop_sam: False
    #drop_readid: False
    #drop_seq: False

filter:
    pcr_dups_max_mismatch_bp: 3

bin:
    resolutions:
        - 1000000
        - 400000
        - 200000
        - 100000
        - 40000
        - 20000
        - 10000

intermediates:
    base_dir: './intermediates/'
    dirs:
        downloaded_fastqs: 'downloaded_fastqs/'
        fastq_chunks: 'fastq_chunks'
        bam_run: 'bam/run'
        pairsam_run: 'pairsam/run'
        pairsam_library: 'pairsam/library'

output:
    base_dir: '/home/hbrandao/Documents/Rudner_data/test_distiller/output/'
    dirs:
        fastqc: 'fastqc/'
        pairs_library: 'pairs/library/'
        stats_run: 'stats/run/'
        stats_library: 'stats/library/'
        stats_library_group: 'stats/library_group/'
        coolers_library: 'coolers/library/'
        coolers_library_group: 'coolers/library_group/'
        bams_library: 'bams/library_group/'

merge mapping and parsing steps to prevent storing of intermediate bam files

Spaces and parenthesis in the file names

In a process fastqc command trying to create sym-link ln -s $(readlink -f ${fastq}) ... would fail if one were to execute distiller pipeline in a folder with spaces/parenthesis in the name.

Possible solution would be to replace command creating sum-link to ln -s \"$(readlink -f ${fastq})\" ..., in order to guarantee proper escaping

chunksize=0 does not seem to work

combine sra downloading with chunking; find a way to use pbgzip and online splitting instead of fastq-dump --gzip --split-fastq

Attend the lab BBQ

reconstruct the library/run structure from a provided folder with fastqs

add option to download from http/https/ftp

Delete .sra after finished chunking

Currently they are somewhere in the work folder, I believe

fastqc by default is 4 threads, but really only 1 is being used.

fastq chunks are saved twice

Both in the "work" and in the "intermediates" folder

Error at parse_runs step

I got the following error in parse_runs step. I was able to use the same configuration files on other datasets without any errors but this time it failed after the alignment step.

My map parameters were:

map:
    chunksize: 10000000
#    drop_sam: True
    drop_readid: True
    drop_seq: True

####################################

Caused by:
  Cannot invoke method toBoolean() on null object


Source block:
  dropsam_flag = params['map'].get('drop_sam','false').toBoolean() ? '--drop-sam' : ''
  dropreadid_flag = params['map'].get('drop_readid','false').toBoolean() ? '--drop-readid' : ''
  dropseq_flag = params['map'].get('drop_seq','false').toBoolean() ? '--drop-seq' : ''
  stats_command = (params.get('do_stats', 'true').toBoolean() ?
      "pairsamtools stats ${library}.${run}.pairsam.gz -o ${library}.${run}.stats" :
      "touch ${library}.${run}.stats" )
  n_parse_processes = (int)Math.ceil(task.cpus / 2)
  n_parse_processes = n_parse_processes < 1 ? 1 : n_parse_processes
  if( isSingleFile(bam))
      """
      mkdir ./tmp4sort
      pairsamtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
          -c ${chrom_sizes}  ${bam} \
              | pairsamtools sort --nproc ${task.cpus} \
                                  -o ${library}.${run}.pairsam.gz \
                                  --tmpdir ./tmp4sort \
              | cat
  
      rm -rf ./tmp4sort
  
      ${stats_command}
      """
  else 
      """
      mkdir ./tmp4sort
      mkdir ./tmp_pairsam
      parallel -P${n_parse_processes} 'pairsamtools parse \
          ${dropsam_flag} ${dropseq_flag} ${dropreadid_flag} -c ${chrom_sizes} {} \
          | pairsamtools sort --nproc 4 \
                              -o ./tmp_pairsam/{}.pairsam.gz \
                              --tmpdir ./tmp4sort ' ::: ${bam}
  
      pairsamtools merge ./tmp_pairsam/* --nproc ${task.cpus} -o ${library}.${run}.pairsam.gz

      rm -rf ./tmp4sort
      rm -rf ./tmp_pairsam
  
      ${stats_command}
      """

[Nezar: edited formatting]

fastqc re-runs when restarted

Somehow the result is not being stored as completed

use a local tmp folder for sorting

add option to drop sequences/phreds

git problem in the environment.yml file

When I use conda, with the commands in the docker file and the environment.yml file, to install the dependencies locally, it gets stuck when installing pairsamtools.

I replaced

git+git://github.com/mirnylab/pairsamtools

with

git+https://github.com/mirnylab/pairsamtools.git

and this solved the problem for me.

option to balance coolers?

Could be nice way to avoid needing to re-balance everything after mapping

In aggregation, do not use cis pairs separated by less than a given distance

Add custom locations for the intermediates

in parse, use chrom.sizes to flip pairs in the semantic order, later used in coolers.

Missing parallel tool

I think parallel should be installed in the container. I'm getting this error

ERROR ~ Error executing process > 'parse_runs (library:MATa_R2 run:lane1)'

Caused by:
  Process `parse_runs (library:MATa_R2 run:lane1)` terminated with an error exit status (127)

Command executed:

  mkdir ./tmp4sort
  mkdir ./tmp_pairsam
  parallel -P4 'pairsamtools parse                {}             | pairsamtools sort --nproc 4                                 -o ./tmp_pairsam/{}.pairsam.gz                                 --tmpdir ./tmp4sort ' ::: MATa_R2.lane1.00.bam MATa_R2.lane1.01.bam
  
  pairsamtools merge ./tmp_pairsam/* --nproc 8 -o MATa_R2.lane1.pairsam.gz
  
  rm -rf ./tmp4sort
  rm -rf ./tmp_pairsam
  
  pairsamtools stats MATa_R2.lane1.pairsam.gz -o MATa_R2.lane1.stats

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 4: parallel: command not found
  .command.run.1: line 104:    13 Terminated              nxf_trace "$pid" .command.trace

specify and pass bwa index files using a wildcard

add continuous integration

Make an option to make multiresolution and Higlass coolers

introduce versions, sync the docker version with the library version

Drop sequences in intermediate bam files created in the work folder

allow to specify a custom location for the products (pairs/coolers/stats/etc)

check inputs: invalid characters in group names, missing libraries in library groups, etc...

Distiller does not check (before running) for consistency of library groups or experiments. As such, a misprint in library groups, in a complex experiment, would result in error.

GNU sort uses too much space on /tmp; runs out of space

Would be nice to be able to specify where to put temporary files of GNU sort

Error log just in case

[3c/371792] Submitted process > parse_runs (library:WaplKO_1.14-B run:lane2) [18/1965]
ERROR ~ Error executing process > 'parse_runs (library:WaplKO_3.3-C run:lane1)'

Caused by:
Process parse_runs (library:WaplKO_3.3-C run:lane1) terminated with an error exit status (1)

Command executed:

mkdir ./tmp4sort
mkdir ./tmp_pairsam
parallel -P4 'pairsamtools parse --drop-seq --drop-readid -c hg19.chrom.sizes.reduced {} | pairsamtools sort --nproc 4 -o ./tmp
_pairsam/{}.pairsam.gz --tmpdir ./tmp4sort ' ::: WaplKO_3.3-C.lane1.20.bam WaplKO_3.3-C.lane1.00.bam WaplKO_3.3-C.lane1.01.bam WaplKO_3.3-C.lane1.03.bam W
aplKO_3.3-C.lane1.02.bam WaplKO_3.3-C.lane1.04.bam WaplKO_3.3-C.lane1.05.bam WaplKO_3.3-C.lane1.43.bam WaplKO_3.3-C.lane1.06.bam WaplKO_3.3-C.lane1.07.bam WaplKO_3.3-C.lane1.08.bam WaplK
O_3.3-C.lane1.09.bam WaplKO_3.3-C.lane1.10.bam WaplKO_3.3-C.lane1.11.bam WaplKO_3.3-C.lane1.12.bam WaplKO_3.3-C.lane1.13.bam WaplKO_3.3-C.lane1.14.bam WaplKO_3.3-C.lane1.15.bam WaplKO_3.
3-C.lane1.17.bam WaplKO_3.3-C.lane1.18.bam WaplKO_3.3-C.lane1.21.bam WaplKO_3.3-C.lane1.19.bam WaplKO_3.3-C.lane1.24.bam WaplKO_3.3-C.lane1.22.bam WaplKO_3.3-C.lane1.23.bam WaplKO_3.3-C.
lane1.27.bam WaplKO_3.3-C.lane1.26.bam WaplKO_3.3-C.lane1.25.bam WaplKO_3.3-C.lane1.29.bam WaplKO_3.3-C.lane1.28.bam WaplKO_3.3-C.lane1.31.bam WaplKO_3.3-C.lane1.30.bam WaplKO_3.3-C.lane
1.32.bam WaplKO_3.3-C.lane1.33.bam WaplKO_3.3-C.lane1.34.bam WaplKO_3.3-C.lane1.35.bam WaplKO_3.3-C.lane1.36.bam WaplKO_3.3-C.lane1.40.bam WaplKO_3.3-C.lane1.41.bam WaplKO_3.3-C.lane1.37
.bam WaplKO_3.3-C.lane1.39.bam WaplKO_3.3-C.lane1.38.bam WaplKO_3.3-C.lane1.42.bam WaplKO_3.3-C.lane1.48.bam WaplKO_3.3-C.lane1.44.bam WaplKO_3.3-C.lane1.45.bam WaplKO_3.3-C.lane1.47.bam
WaplKO_3.3-C.lane1.46.bam WaplKO_3.3-C.lane1.16.bam

pairsamtools merge ./tmp_pairsam/* --nproc 8 -o WaplKO_3.3-C.lane1.pairsam.gz

rm -rf ./tmp4sort
rm -rf ./tmp_pairsam

pairsamtools stats WaplKO_3.3-C.lane1.pairsam.gz -o WaplKO_3.3-C.lane1.stats

Command exit status:
1

Command output:
(empty)

Command error:
sort: write failed: /tmp/sortE2aarR: No space left on device
.command.run.1: line 50: cannot create temp file for here-document: No space left on device
Traceback (most recent call last):
File "/miniconda3/bin/pairsamtools", line 11, in
load_entry_point('pairsamtools==0.0.1.dev0', 'console_scripts', 'pairsamtools')()
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_stats.py", line 50, in stats
stats_py(input_path, output, merge)
File "/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_stats.py", line 99, in stats_py
cols[_pairsam_format.COL_C1],
IndexError: list index out of range

open2c / distiller-nf Goto Github PK

distiller-nf's People

Contributors

Stargazers

Watchers

Forkers

distiller-nf's Issues

Recommend Projects

Recommend Topics

Recommend Org