Giter Club home page Giter Club logo

distiller-nf's People

Contributors

agalitsyna avatar azkalot1 avatar golobor avatar gspracklin avatar molsim avatar nvictus avatar pditommaso avatar phlya avatar sergpolly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distiller-nf's Issues

Error Reporting

It would be helpful to have shorter, polished error reports instead of nextflow or python exception messages. For example, when the path of an input file is wrong I got the following output and it took me a while to figure out that there was a typo in the input path:

N E X T F L O W ~ version 0.24.3
Launching /home/ubuntu/distiller/distiller.nf [distracted_leakey] - revision: 3548dd9b36
[warm up] executor > local
[1d/011c3a] Submitted process > chunk_fastqs (library:lib-1-A run:lane1)
[f7/6acfcc] Submitted process > fastqc (library:lib-1-A run:lane1 side:2)
[fb/6762f7] Submitted process > fastqc (library:lib-1-A run:lane1 side:1)
ERROR ~ Error executing process > 'fastqc (library:lib-1-A run:lane1 side:2)'

Caused by:
Missing output file(s) lib-1-A.lane1.2_fastqc.html expected by process fastqc (library:lib-1-A run:lane1 side:2)

Command executed:

mkdir -p ./temp_fastqc/
ln -s $(readlink -f lib-1-A_S1_L001_R2_001.fastq.gz) ./temp_fastqc/lib-1-A.lane1.2.fastq.gz
fastqc --threads 4 -o ./ -f fastq ./temp_fastqc/lib-1-A.lane1.2.fastq.gz
rm -r ./temp_fastqc/

Command exit status:
0

Command output:
(empty)

Command error:
Skipping './temp_fastqc/lib-1-A.lane1.2.fastq.gz' which didn't exist, or couldn't be read
.command.run.1: line 99: 12 Terminated nxf_trace "$pid" .command.trace

Work dir:
/data/work/f7/6acfcce9869c0f07e606ff85dbe122

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

-- Check '.nextflow.log' file for details
WARN: Killing pending tasks (2)

get rid of most of `storeDir` - the troublemaker

storeDir is a huge troublemaker in a cluster environment: whenever some process fails downstream, for whatever reason, some upstream processes can be copying intermediate results to storeDir location. After the restart or -resume processes take off with whatever is present in the storeDir - which isn't always a good thing, as results might be incomplete or whatever - it often creates a mess.

  • Jobs/processes could fail on a cluster for many reasons which we could not control and even using errorStrategy='retry' does not necessarily saves us in all cases.
  • storeDir intended use is to store things as a permanent cache, in most cases after downloading from some database (like SRA), but not to store all intermediate results there (according to Paolo)

Use safe script bash options

Without the -o pipefail option, commands in a pipe can fail, but the whole pipeline can return as successful, creating all kinds of hard-to-trace havoc in nextflow. Based on this discussion, we just add something like process.shell = ['/bin/bash', '-uexo','pipefail'] to nextflow.config to set these globally.

Will make a PR shortly.

More on safe bash scripts.

SRA download step fail

I've tried both on Linux and Mac and I'm getting the same error:

ERROR ~ Error executing process > 'download_sra (sra:SRR2601843?start=0&end=10000)'

Caused by:
  Process `download_sra (sra:SRR2601843?start=0&end=10000)` terminated with an error exit status (3)

Command executed:

  fastq-dump -F SRR2601843  --minSpotId 0  --maxSpotId 10000 --split-files --gzip
  mv SRR2601843_1.fastq.gz MATalpha_R1.lane2.1.fastq.gz
  mv SRR2601843_2.fastq.gz MATalpha_R1.lane2.2.fastq.gz

Command exit status:
  3

Command output:
  (empty)

Command error:
  2017-05-02T12:50:09 fastq-dump.2.8.1 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
  2017-05-02T12:50:39 fastq-dump.2.8.1 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -76 ( NET - Reading information from the socket failed )
  2017-05-02T12:50:39 fastq-dump.2.8.1 err: no error - error with http open 'http://sra-download.ncbi.nlm.nih.gov/srapub/SRR2601843'
  2017-05-02T12:50:39 fastq-dump.2.8.1 err: item not found while constructing within virtual database module - the path 'SRR2601843' cannot be opened as database or table
  .command.run.1: line 104:    13 Terminated              nxf_trace "$pid" .command.trac

Problem with Fastqc in Recently Built Docker Image

We built the docker image after modifying pairsamtools. Everything except for fastqc works fine. Fastqc processes got stuck when we used the new image. Fastqc works in the existing docker image in docker hub.

fastqc seem to halt after completion with an error

Analysis complete for 2-Cellaphidicolintreated_R1.lane_39096.1.fastq.gz
Exception in thread "Thread-1" java.lang.Error: Probable fatal error:No fonts found.
at sun.font.SunFontManager.getDefaultPhysicalFont(SunFontManager.java:1236)
at sun.font.SunFontManager.initialiseDeferredFont(SunFontManager.java:1100)
at sun.font.SunFontManager.findOtherDeferredFont(SunFontManager.java:1037)
at sun.font.SunFontManager.findDeferredFont(SunFontManager.java:1054)
at sun.font.SunFontManager.findFont2D(SunFontManager.java:2256)

Setup script is not working on Mac

Just reporting in the case you are interested to have it working in Mac OS

$ bash setup_test.sh 
++ pwd
+ PROJECT_DIR=/Users/pditommaso/Downloads/test/test
+ mkdir -p /Users/pditommaso/Downloads/test/test
+ cd /Users/pditommaso/Downloads/test/test
+ curl -LkSs https://api.github.com/repos/mirnylab/distiller-test-data/tarball
+ tar -zxf - --strip=1 --wildcards '*/genome/*' --wildcards '*/fastq/*'
tar: Option --wildcards is not supported
Usage:
  List:    tar -tf <archive-filename>
  Extract: tar -xf <archive-filename>
  Create:  tar -cf <archive-filename> [filenames...]
  Help:    tar --help
curl: (23) Failed writing body (0 != 1370)

memory checks?

should there be some available memory checks & warnings issued depending on chunksize etc?

Space Usage

N GB of input data requires more than 10N space to run. Also, the space usage peaks while doing sort, as mentioned in an earlier ticket. In particular, in my run, 54M of input requires almost 600M of free space to run. Is there any chance to get rid of intermediate files as soon as we delete them?

Failed at parse_runs

output:

ERROR ~ Error executing process > 'parse_runs (library:strain_2 run:lane1)'

Caused by:
  Cannot invoke method toBoolean() on null object


Source block:
  dropsam_flag = params['map'].get('drop_sam','false').toBoolean() ? '--drop-sam' : ''
  dropreadid_flag = params['map'].get('drop_readid','false').toBoolean() ? '--drop-readid' : ''
  dropseq_flag = params['map'].get('drop_seq','false').toBoolean() ? '--drop-seq' : ''
  stats_command = (params.get('do_stats', 'true').toBoolean() ?
      "pairsamtools stats ${library}.${run}.pairsam.gz -o ${library}.${run}.stats" :
      "touch ${library}.${run}.stats" )
  n_parse_processes = (int)Math.ceil(task.cpus / 2)
  n_parse_processes = n_parse_processes < 1 ? 1 : n_parse_processes
  if( isSingleFile(bam))
      """
      mkdir ./tmp4sort
      pairsamtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
          -c ${chrom_sizes}  ${bam} \
              | pairsamtools sort --nproc ${task.cpus} \
                                  -o ${library}.${run}.pairsam.gz \
                                  --tmpdir ./tmp4sort \
              | cat
  
      rm -rf ./tmp4sort
  
      ${stats_command}
      """
  else 
      """
      mkdir ./tmp4sort
      mkdir ./tmp_pairsam
      parallel -P${n_parse_processes} 'pairsamtools parse \
          ${dropsam_flag} ${dropseq_flag} ${dropreadid_flag} -c ${chrom_sizes} {} \
          | pairsamtools sort --nproc 4 \
                              -o ./tmp_pairsam/{}.pairsam.gz \
                              --tmpdir ./tmp4sort ' ::: ${bam}
  
      pairsamtools merge ./tmp_pairsam/* --nproc ${task.cpus} -o ${library}.${run}.pairsam.gz
  
      rm -rf ./tmp4sort
      rm -rf ./tmp_pairsam
  
      ${stats_command}
      """

project.yml

do_fastqc: True
do_stats: True
    

# Fastqs can be provided as:
# -- a pairs of relative/absolute paths
# -- sra:<SRA_NUMBER>, optionally followed by the indices of the first and
# the last entry in the SRA in the form of "?start=<first>&end=<last>
# [to implement] -- as a path to a folder with fastqs '<base_folder>', with the structure 
# <base_folder>/<library_name>/<run_name>/, with each folder containing only
# two fastq.gz files
input:
    raw_reads_paths:
        strain_1:
            lane1:
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002581_1.fastq.gz
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002581_2.fastq.gz  

        strain_2:
            lane1:
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002583_2.fastq.gz  
                - /home/hbrandao/Documents/Rudner_data/test_distiller/fastq/SRR2002584_1.fastq.gz


    library_groups:
        PY79:
            - strain_1
            - strain_2

    genome:
        assembly: 'bacillus'
        bwa_index_wildcard: '/home/hbrandao/Documents/Rudner_data/PY79_genome/bwa_index/NC_022898.1.fa.*'
        chrom_sizes_path: '/home/hbrandao/Documents/Rudner_data/PY79_genome/bwa_index/PY79.chrom.sizes'

map:
    chunksize: 6000
    #drop_sam: False
    #drop_readid: False
    #drop_seq: False

filter:
    pcr_dups_max_mismatch_bp: 3

bin:
    resolutions:
        - 1000000
        - 400000
        - 200000
        - 100000
        - 40000
        - 20000
        - 10000

intermediates:
    base_dir: './intermediates/'
    dirs:
        downloaded_fastqs: 'downloaded_fastqs/'
        fastq_chunks: 'fastq_chunks'
        bam_run: 'bam/run'
        pairsam_run: 'pairsam/run'
        pairsam_library: 'pairsam/library'

output:
    base_dir: '/home/hbrandao/Documents/Rudner_data/test_distiller/output/'
    dirs:
        fastqc: 'fastqc/'
        pairs_library: 'pairs/library/'
        stats_run: 'stats/run/'
        stats_library: 'stats/library/'
        stats_library_group: 'stats/library_group/'
        coolers_library: 'coolers/library/'
        coolers_library_group: 'coolers/library_group/'
        bams_library: 'bams/library_group/'

Spaces and parenthesis in the file names

In a process fastqc command trying to create sym-link ln -s $(readlink -f ${fastq}) ... would fail if one were to execute distiller pipeline in a folder with spaces/parenthesis in the name.

Possible solution would be to replace command creating sum-link to ln -s \"$(readlink -f ${fastq})\" ..., in order to guarantee proper escaping

Error at parse_runs step

I got the following error in parse_runs step. I was able to use the same configuration files on other datasets without any errors but this time it failed after the alignment step.

My map parameters were:

map:
    chunksize: 10000000
#    drop_sam: True
    drop_readid: True
    drop_seq: True

####################################

Caused by:
  Cannot invoke method toBoolean() on null object


Source block:
  dropsam_flag = params['map'].get('drop_sam','false').toBoolean() ? '--drop-sam' : ''
  dropreadid_flag = params['map'].get('drop_readid','false').toBoolean() ? '--drop-readid' : ''
  dropseq_flag = params['map'].get('drop_seq','false').toBoolean() ? '--drop-seq' : ''
  stats_command = (params.get('do_stats', 'true').toBoolean() ?
      "pairsamtools stats ${library}.${run}.pairsam.gz -o ${library}.${run}.stats" :
      "touch ${library}.${run}.stats" )
  n_parse_processes = (int)Math.ceil(task.cpus / 2)
  n_parse_processes = n_parse_processes < 1 ? 1 : n_parse_processes
  if( isSingleFile(bam))
      """
      mkdir ./tmp4sort
      pairsamtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
          -c ${chrom_sizes}  ${bam} \
              | pairsamtools sort --nproc ${task.cpus} \
                                  -o ${library}.${run}.pairsam.gz \
                                  --tmpdir ./tmp4sort \
              | cat
  
      rm -rf ./tmp4sort
  
      ${stats_command}
      """
  else 
      """
      mkdir ./tmp4sort
      mkdir ./tmp_pairsam
      parallel -P${n_parse_processes} 'pairsamtools parse \
          ${dropsam_flag} ${dropseq_flag} ${dropreadid_flag} -c ${chrom_sizes} {} \
          | pairsamtools sort --nproc 4 \
                              -o ./tmp_pairsam/{}.pairsam.gz \
                              --tmpdir ./tmp4sort ' ::: ${bam}
  
      pairsamtools merge ./tmp_pairsam/* --nproc ${task.cpus} -o ${library}.${run}.pairsam.gz

      rm -rf ./tmp4sort
      rm -rf ./tmp_pairsam
  
      ${stats_command}
      """

[Nezar: edited formatting]

Missing parallel tool

I think parallel should be installed in the container. I'm getting this error

ERROR ~ Error executing process > 'parse_runs (library:MATa_R2 run:lane1)'

Caused by:
  Process `parse_runs (library:MATa_R2 run:lane1)` terminated with an error exit status (127)

Command executed:

  mkdir ./tmp4sort
  mkdir ./tmp_pairsam
  parallel -P4 'pairsamtools parse                {}             | pairsamtools sort --nproc 4                                 -o ./tmp_pairsam/{}.pairsam.gz                                 --tmpdir ./tmp4sort ' ::: MATa_R2.lane1.00.bam MATa_R2.lane1.01.bam
  
  pairsamtools merge ./tmp_pairsam/* --nproc 8 -o MATa_R2.lane1.pairsam.gz
  
  rm -rf ./tmp4sort
  rm -rf ./tmp_pairsam
  
  pairsamtools stats MATa_R2.lane1.pairsam.gz -o MATa_R2.lane1.stats

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 4: parallel: command not found
  .command.run.1: line 104:    13 Terminated              nxf_trace "$pid" .command.trace

GNU sort uses too much space on /tmp; runs out of space

Would be nice to be able to specify where to put temporary files of GNU sort

Error log just in case

[3c/371792] Submitted process > parse_runs (library:WaplKO_1.14-B run:lane2) [18/1965]
ERROR ~ Error executing process > 'parse_runs (library:WaplKO_3.3-C run:lane1)'

Caused by:
Process parse_runs (library:WaplKO_3.3-C run:lane1) terminated with an error exit status (1)

Command executed:

mkdir ./tmp4sort
mkdir ./tmp_pairsam
parallel -P4 'pairsamtools parse --drop-seq --drop-readid -c hg19.chrom.sizes.reduced {} | pairsamtools sort --nproc 4 -o ./tmp
_pairsam/{}.pairsam.gz --tmpdir ./tmp4sort ' ::: WaplKO_3.3-C.lane1.20.bam WaplKO_3.3-C.lane1.00.bam WaplKO_3.3-C.lane1.01.bam WaplKO_3.3-C.lane1.03.bam W
aplKO_3.3-C.lane1.02.bam WaplKO_3.3-C.lane1.04.bam WaplKO_3.3-C.lane1.05.bam WaplKO_3.3-C.lane1.43.bam WaplKO_3.3-C.lane1.06.bam WaplKO_3.3-C.lane1.07.bam WaplKO_3.3-C.lane1.08.bam WaplK
O_3.3-C.lane1.09.bam WaplKO_3.3-C.lane1.10.bam WaplKO_3.3-C.lane1.11.bam WaplKO_3.3-C.lane1.12.bam WaplKO_3.3-C.lane1.13.bam WaplKO_3.3-C.lane1.14.bam WaplKO_3.3-C.lane1.15.bam WaplKO_3.
3-C.lane1.17.bam WaplKO_3.3-C.lane1.18.bam WaplKO_3.3-C.lane1.21.bam WaplKO_3.3-C.lane1.19.bam WaplKO_3.3-C.lane1.24.bam WaplKO_3.3-C.lane1.22.bam WaplKO_3.3-C.lane1.23.bam WaplKO_3.3-C.
lane1.27.bam WaplKO_3.3-C.lane1.26.bam WaplKO_3.3-C.lane1.25.bam WaplKO_3.3-C.lane1.29.bam WaplKO_3.3-C.lane1.28.bam WaplKO_3.3-C.lane1.31.bam WaplKO_3.3-C.lane1.30.bam WaplKO_3.3-C.lane
1.32.bam WaplKO_3.3-C.lane1.33.bam WaplKO_3.3-C.lane1.34.bam WaplKO_3.3-C.lane1.35.bam WaplKO_3.3-C.lane1.36.bam WaplKO_3.3-C.lane1.40.bam WaplKO_3.3-C.lane1.41.bam WaplKO_3.3-C.lane1.37
.bam WaplKO_3.3-C.lane1.39.bam WaplKO_3.3-C.lane1.38.bam WaplKO_3.3-C.lane1.42.bam WaplKO_3.3-C.lane1.48.bam WaplKO_3.3-C.lane1.44.bam WaplKO_3.3-C.lane1.45.bam WaplKO_3.3-C.lane1.47.bam
WaplKO_3.3-C.lane1.46.bam WaplKO_3.3-C.lane1.16.bam

pairsamtools merge ./tmp_pairsam/* --nproc 8 -o WaplKO_3.3-C.lane1.pairsam.gz

rm -rf ./tmp4sort
rm -rf ./tmp_pairsam

pairsamtools stats WaplKO_3.3-C.lane1.pairsam.gz -o WaplKO_3.3-C.lane1.stats

Command exit status:
1

Command output:
(empty)

Command error:
sort: write failed: /tmp/sortE2aarR: No space left on device
.command.run.1: line 50: cannot create temp file for here-document: No space left on device
Traceback (most recent call last):
File "/miniconda3/bin/pairsamtools", line 11, in
load_entry_point('pairsamtools==0.0.1.dev0', 'console_scripts', 'pairsamtools')()
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_stats.py", line 50, in stats
stats_py(input_path, output, merge)
File "/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_stats.py", line 99, in stats_py
cols[_pairsam_format.COL_C1],
IndexError: list index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.