flow-r / flowr Goto Github PK

View Code? Open in Web Editor NEW

83.0 10.0 10.0 31.99 MB

Robust and efficient workflows using a simple language agnostic approach

Home Page: http://flow-r.github.io/flowr

License: Other

R 93.60% Shell 6.40%

bioinformatics cran workflow flowr pipeline cluster cloud tsv

flowr's Introduction

Streamlining Computing workflows

Latest documentation: flow-r.github.io/flowr

Flowr framework allows you to design and implement complex pipelines, and deploy them on your institution's computing cluster. This has been built keeping in mind the needs of bioinformatics workflows. However, it is easily extendable to any field where a series of steps (shell commands) are to be executed in a (work)flow.

Highlights

No new syntax or language. Put all shell commands as a tsv file called flow mat.
Define the flow of steps using a simple tsv file (serial, scatter, gather, burst...) called flow def.
Works on your laptop/server or cluster (/cloud).
Supports multiple cluster computing platforms (torque, lsf, sge, slurm ...), cloud (star cluster) OR a local machine.
One line installation (install.packages("flowr"))
Reproducible and transparent, with cleanly structured execution logs
Track and re-run flows
Lean and Portable, with easy installation
Fine grain control over resources (CPU, memory, walltime, etc.) of each step.

Example

A few lines, to get started

## Official stable release from CRAN (updated every other month)
## visit flow-r.github.io/flowr/install for more details
install.packages("flowr",  repos = "http://cran.rstudio.com")

# or a latest version from DRAT, provide cran for dependencies
install.packages("flowr", repos = c(CRAN="http://cran.rstudio.com", DRAT="http://sahilseth.github.io/drat"))

library(flowr) ## load the library
setup() ## copy flowr bash script; and create a folder flowr under home.

# Run an example pipeline

# style 1: sleep_pipe() function creates system cmds
flowr run x=sleep_pipe platform=local execute=TRUE

# style 2: we start with a tsv of system cmds
# get example files
wget --no-check-certificate http://raw.githubusercontent.com/sahilseth/flowr/master/inst/pipelines/sleep_pipe.tsv
wget --no-check-certificate http://raw.githubusercontent.com/sahilseth/flowr/master/inst/pipelines/sleep_pipe.def

# submit to local machine
flowr to_flow x=sleep_pipe.tsv def=sleep_pipe.def platform=local execute=TRUE
# submit to local LSF cluster
flowr to_flow x=sleep_pipe.tsv def=sleep_pipe.def platform=lsf execute=TRUE

Example pipelines inst/pipelines

Resources

For a quick overview, you may browse through, these introductory slides.
The overview provides additional details regarding the ideas and concepts used in flowr
We have a tutorial which can walk you through creating a new pipeline
Additionally, a subset of important functions are described in the package reference page
You may follow detailed instructions on installing and configuring
You can use flow creator: https://sseth.shinyapps.io/flow_creator), a shiny app to aid in designing a shiny new flow. This provides a good example of the concepts

Updates

This package is under active-development, you may watch for changes using the watch link above.

Feedback

Please feel free to raise a github issue with questions and comments.

Acknowledgements

Jianhua Zhang
Samir Amin
Roger Moye
Kadir Akdemir
Ethan Mao
Henry Song
An excellent resource for writing your own R packages: r-pkgs.org

flowr's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger killedision dyndna shjoshi ddbs larsgr glass-consortium afcarl wisekh6 sbamin

flowr's Issues

Pattern matching in kill is not always as expected

flowr kill x=runs/bam_mutect-HCC1143___HCC1143_BL-20160413-21-*
found multiple wds:
runs/bam_mutect-HCC1143___HCC1143_BL-20160413-21-52-53-VhvZDq9L
runs/bam_mutect-HCC1143___HCC1143_BL-20160413-21-54-49-abhTg3DJ
If you want to kill all of them, kill again with force=TRUE
flowr kill x=runs/bam_mutect-HCC1143___HCC1143_BL-20160413-21*
found multiple wds:
runs/bam_mutect-HCC1143___HCC1143_BL-20160413-21-52-53-VhvZDq9L
runs/bam_mutect-HCC1143___HCC1143_BL-20160413-21-54-49-abhTg3DJ
runs/bam_mutect-HCC1143___HCC1143_BL-20160413-22-07-02-kNZqOJ9W
If you want to kill all of them, kill again with force=TRUE

Better separation of status from multiple folders

When monitoring status from multiple folder, it might be useful to add a visible header; to easily differentiate them.

Something like this is desired:

================================================================================
Showing status of:
/rsrch2/iacs/iacs_dep/sseth/flowr/runs/flowname-127390-T-20150829-03-11-48-NVbDveFr

|            | total| started| completed| exit_status|status  |
|:-----------|-----:|-------:|---------:|-----------:|:-------|
|001.markdup |     1|       1|         1|           1|errored |
|002.target  |     1|       0|         0|           0|pending |
|003.realign |     1|       0|         0|           0|pending |

================================================================================
Showing status of:
/rsrch2/iacs/iacs_dep/sseth/flowr/runs/flowname-127390-T-20150831-01-46-16-cv6mBQ8D

automatically subset flowmat, incase of `rerun`

If flowmat contains multiple samples, rerun fails:

flowr rerun start_from=transfer_out mat=flowmat.tsv x=flow_run_path execute=TRUE

.>  working on... 
.Error in (function (cl, name, valueClass)  :
  ‘status’ is not a slot in class “list”
In addition: There were 16 warnings (use warnings() to see them)
Error in funr(args = commandArgs(trailingOnly = TRUE), script_name = "flowr",  :

error in dependent job tabulation in flow_details.txt for scatter type jobs

Is it possible to fix one-to-one dependent job id assignment in scattered jobs? e.g., when sam2bam_1 finishes, it starts mduprg_1 (which is depenent on sam2bam_1's job id: 2765506). However, for other sam2bam steps (_2, _3, ...), there are different job ids, i.e., 2765507, 2765508, ... When matching next job (scatter - serial) starts, mduprg_2, mduprg_3, they should have 2765507 and 2765508 in their dependent job id column. At present, flow_details.txt is writing dependent job id from the first of several scattered jobs in dependent id column, i.e. 2765506, 2765506,...

Not a priority issues as this does not break HPC dependency strucutre as flowr writes multiple dependency is msub or bsub script correctly, but only in flow_details.txt, it is copying the same first job id as dependency for other scattered jobs under same command group, mduprg in this case.

foo@helix:/fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/003.sam2bam$ grep "2765506" ../flow_details.txt
003.sam2bam     sam2bam 1       2765506 003.sam2bam_1   bwamem  2765503 submitted       NA      /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/003.sam2bam/sam2bam_cmd_1.sh       /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/trigger/trigger_003.sam2bam_1.txt
004.mduprg      mduprg  1       2765509 004.mduprg_1    sam2bam 2765506 submitted       NA      /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/004.mduprg/mduprg_cmd_1.sh /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/trigger/trigger_004.mduprg_1.txt
004.mduprg      mduprg  2       2765510 004.mduprg_2    sam2bam 2765506 submitted       NA      /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/004.mduprg/mduprg_cmd_2.sh /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/trigger/trigger_004.mduprg_2.txt
004.mduprg      mduprg  3       2765511 004.mduprg_3    sam2bam 2765506 submitted       NA      /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/004.mduprg/mduprg_cmd_3.sh /fastscratch/foo/flowr/runs/glass_tools/flowr_aln_fqs-CGP-S03-E7AB-8AD9ACF7-T1-A2-J02-20170506-02-40-22-lTAxP5Eu/trigger/trigger_004.mduprg_3.txt

Ability to re-run after dependency has changed

Is it possible to add the ability to re-run analysis if one of the dependencies (eg., file or code.sh) has changed?

simple make example

simple_make.pdf

issue with to_flow, if a data.frame is supplied, instead of a flowmat.

to_flow(x = flowmat, def = flowdef, execute = FALSE, containerize = TRUE, module_cmds = "module load gcc/4.8.1")
--> Using `samplename` as the grouping column
--> Using `jobname` as the jobname column
--> Using `cmd` as the cmd column
> reading and checking flow def ...
 Show Traceback

 Rerun with Debug
 Error in is.flowdef(x) : argument "def" is missing, with no default

killing flow with a single job, seems to be an issue

For some reason, progress bar fails if the loop has a single element - causing issues in killing a flow with single job.

killing 1 jobs, please wait... See kill_jobs.out in the wd for more details.
Error in txtProgressBar(style = 3, min = 1, max = length(cmds)) :
  must have 'max' > 'min'
Error in funr(args = commandArgs(trailingOnly = TRUE), script_name = "flowr",  :

Validation of CPU, and making sure the pattern is correct

<%%>

rerun ability

For more complex flows, if a user can specify specific jobs to be re-run. Then re-run would become a lot more versatile.

create_jobs_mat

Replace this function with:

to_flowdef()
split_multi_dep()

to make the plotting of flows simpler.

Job execution on remote grid node

Is there a feature to execute job on remote grid server?
~/flowr/conf/flowr.conf does not provide any parameter to specify the remote grid (GCS) nodes.

Thanks.

flowr script, not parsing arguments with getOption() properly

When calling a function, with default argument specified using getOption() does not seem to be compatible.

Even when supplying a argument via command line, its not able to replace the default

modifying commands/resources before re-running

Hi Sahil,

Sometimes, a flowr pipeline can stop because of insufficient memory/walltime (I set up) or inappropriate commands, and therefore they need to be revised before re-running. But, I am not sure about which flowr file (flow_details.rds?) I should modify in order to update the commands/memory/walltime. Would you give me your comments on this?

Thanks,

job name of a `job` object is changed after submission

As of now job name of the job object is changed after submission.

Rather, using a separate variable: uid would be better.

name: a simple name given by the user
uid: 001.jobname1, 002.jobname2
id: cluster submission ids given by the cluster

A Universal parser for default parameters.

A function which accepts a parameter file and populates them into the user environment.

Default sits in flowr and ngsflows installation.
File has defaults in ~/flowr/conf/default.conf
Further each pipeline may have a file: default.conf

bash: "nocobbler on" gives errors with flow status and rerun function.

default bash options has noclobber set to off, allowing touch hello && echo hi > hello to work. If set -o | grep noclobber is on, then the same command will return error:

bash: hello: cannot overwrite existing file

Since user commands as well as flow torque.sh and other job scheduler files rely on > operator for job trigger, dependency and rerun functions, set -o | grep noclobber flag is worth checking before using > or do set +o noclobber && touch hello && echo hi > hello or force allow overwrite using touch hello && echo hi >| hello

PS: exitstat variable, currently at https://github.com/sahilseth/flowr/blob/master/inst/conf/torque.sh#L37 needs to be called just after {{{CMD}}} line, https://github.com/sahilseth/flowr/blob/master/inst/conf/torque.sh#L33

Better error message for to_flowdef

for invalid path:

 to_flowdef(wd)
Sorry we do not recognize this file format  please use tsv, csv or xlsx2 (sheetname: sample_sheet)
Looks good, just check...
Creating a skeleton flow definition
Following jobnames detected:

if a folder is provided:

Following jobnames detected: 
 Show Traceback

 Rerun with Debug
 Error in rep("gather", njobs - 1) : invalid 'times' argument

check if path is a flowwd, if yes, read flow object and proceed.

Ability to control the max number of flows being run concurrently

For this try the devel version of flowr. Use devotools load_all to functions.

git clone -b devel https://github.com/sahilseth/flowr.git
devtools::load_all("flowr")

Say if you already have a list of flow objects to be run (fobjs)

One may use this function to loop over the list and submit if fewer than MAX allowed, flows are running.

submit_run(fobjs, wd = "<flow_run_path>", max_processing = 7)

This basically monitors the folder wd, and checks how many flows in the folder are running.
If fewer than max_processing are running, ONE is submitted, else the function waits for 10 minutes.

When creating flow_objet with to_flow command, no output in resulting Rplots.pdf

When creating flow_objet with to_flow command, no output in resulting Rplots.pdf.

fobj = to_flow(x = user_flow_mat, def = flow_def, platform = myplatform, 
                             flow_base_path = myoutdir, 
                             flowname = myflowname, 
                             execute = myexecute)

Resulting pdf can not be viewed in pdf reader, ? no graphics inside pdf.

Reduce dependency load

There seem to be quite a few dependencies, whose functions are being rarely used.

ggplot2
knitr

add ability to set dynamic parameters

Currently, in flowr (like many other tools), all parameters are decided before execution.

There are a few use cases in which it might be useful, that step A; analyzes the data and sets a parameter for step B.

An example would be; we align reads and infer insert size. Then this is used as a parameter for tools like breakdancer (for looking at translocations).

http://bionics.it/posts/dynamic-workflow-scheduling

Simplify the configuration setup

In two subsequent versions of flowr, remove extra folders from where configuration file are loaded.

By default all configuration is loaded using .flowr file from the home dir.

Other than this all other configuration would be dependent on each pipeline.

Each pipeline, can have its own configuration files as well. This issue is somewhat related to #58 as well.

PBS scheduler errors with qsub submission: && in bash command

dependent jobs are failing because of syntax error while parsing first bash command:

PBS Job Id: 40993.
Job Name:   T95f3A8d9eG_001.job_start-1
Exec host:  
Execution terminated
Exit_status=2
resources_used.cput=00:00:00
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:01

sample 1, command 1:

cat ~/flowr/runs/<SAMPLE_ID>/001../job_start_cmd_1.sh

## --- command to run comes here (flow_mat)
## ------  
 mkdir -p 95f31dd1-0014-4f11-b866-f6bf5cb152c8__8d9ebd2e-9ac6-4d93-81fb-5cfe6f4a4e7a &amp;&amp; touch 95f31dd1-0014-4f11-b866-f6bf5cb152c8__8d9ebd2e-9ac6-4d93-81fb-5cfe6f4a4e7a.running

cat ~/flowr/runs/<SAMPLE_ID>/001../job_start_cmd_1.out

/var/spool/torque/mom_priv/jobs/40993....SC: line 35: syntax error near unexpected token `;&'
/var/spool/torque/mom_priv/jobs/40993.....SC: line 35: ` mkdir -p 95f31dd1-0014-4f11-b866-f6bf5cb152c8__8d9ebd2e-9ac6-4d93-81fb-5cfe6f4a4e7a &amp;&amp; touch 95f31dd1-0014-4f11-b866-f6bf5cb152c8__8d9ebd2e-9ac6-4d93-81fb-5cfe6f4a4e7a.running '

&& was passed in bash command and probable cause of error.

Subsequent jobs dependent on job_start_cmd_1 fails:

PBS Job Id: 40994.
Job Name:  2_make_config
Aborted by PBS Server 
Job deleted as result of dependency on job 40993...

sge conf should be pbs?

This is really a minor point. But should conf/sge.sh be renamed conf/pbs.sh?
A new configuration file can be added for sge.sh that has SGE tailored commands.
SGE commands/variables start with a SGExxx.

Command to specify email alert only for job killed or error.

Is there an option while submitting fobj <- to_flow(...) command to specify email alert only for job killed or error?

rerun ignores platform, when specified via run command

For example if we run:

flowr run x=fastq_bam_bwa fqs1=$fqs1 fqs2=$fqs2 samplename=samp execute=TRUE platform=local rerun_wd=output/runs/fastq_bam_bwa-samp-20151201-18-07-17-aiJFUNjt start_from=aln1

This fails with the error:

-> Flow is being processed. Track it from cmd line using:
flowr status x=/Users/sahilseth/Dropbox2/Dropbox/public/flow-r/fastq_bam/output/runs/fastq_bam_bwa-samp-20151201-18-07-17-aiJFUNjt
OR from R using:
status(x='/Users/sahilseth/Dropbox2/Dropbox/public/flow-r/fastq_bam/output/runs/fastq_bam_bwa-samp-20151201-18-07-17-aiJFUNjt')
  |                                                                      |   0%sh: msub: command not found
Error in system(cmd, intern = TRUE) : error in running command
Error in x$value : $ operator is invalid for atomic vectors

flowr flowdef bug while running scatter-gather approach

scatter-gather approach is failing with a major bug in flowr code.

In brief: bug was found while doing scatter-serial to parallelize GATK steps: index_7, recal_8, bqsr_9 and run_mutect, by chromosomes. flowr currently does not respect one-to-one relationship for successive scatter-serial jobs, and instead will start successive scatter jobs without making sure that preceding completed job is one matched to current job and not a different one, e.g., If indel_8 for chromsome 22 is complete, flowr should start recal_8 for same chr 22 but instead it can start other chromosome(s) even though indel_8 might be running for the latter chromosome(s).

Issue stems from faulty flowmat which assigns the same dependency job ID for HPC job scheduler as the first job of scatter-serial instead of respecting one-to-one relationship between consecutive scatter-serial jobs.

This issue has been cross-referenced from internal code repository and work in progress to fix it.

Temporary fix:

Disallow scatter-serial and force to use scatter-gather instead, i.e., all jobs of scatter have to be completed before next scatter job can start.

Adding a new platform

Adding a new platform involves a few moving pieces.

Job submission
Parsing of job IDs, the platform returns upon submission
Providing job IDs to the subsequent, as a dependency to subsequent jobs

Example, adding torque:

submission: Template used for submission:
https://github.com/sahilseth/flowr/blob/master/inst/conf/torque.sh
parse_jobid: The job ids should parse using regular expression as provided by:
https://github.com/sahilseth/flowr/blob/master/inst/conf/flowr.conf
As used in parse-jobids():
https://github.com/sahilseth/flowr/blob/master/R/parse-jobids.R
parse_dependency: These are then parsed to create a dependency string, as seen here:
https://github.com/sahilseth/flowr/blob/master/R/parse-dependency.R
job(): Add a new class using the platform name. This is essentially just a wrapper around job class.
https://github.com/sahilseth/flowr/blob/master/R/class-def.R
setClass("torque", contains = "job")
Killing jobs

Its important to have the right kill command in: detect_kill_cmd()

Add status column in the flow_details from the beginning

The initial flow_details.txt file does not have a started column. When one runs flowr::status, it adds this column to the file.

Some downstream functions expect this column to be present, to begin with.

How to integrate flowr and dplyr

I am heavy using dplyr and sparklyr, flowr looks so great on workflow.

Could I translate my exist dplyr and sparklyr pipeline into flowr?

delete previous triggers when running re-run

Say when starting a re-run jobs from step 5, all triggers starting step 5 should be reset so that the status function displays the correct output.

If the dir is missing, kill does not provide a informative error message

If the dir is missing, kill does not provide a informative error message. Currently it shows:

Flowr: streamlining workflows
Error in tail(fobj@jobs, 1) :
  trying to get slot "jobs" from an object of a basic class ("character") with no slots
Calls: main ... kill.character -> kill.flow -> detect_kill_cmd -> tail
Execution halted

Installing flowr fails from github - diagram pkg - Invalid comparison operator in dependency

Fresh install of flowr on RHEL 6.5 fails with error related to package diagram.

> devtools::install_github("glass-consortium/flowr", ref="master")
Downloading GitHub repo glass-consortium/flowr@master
from URL https://api.github.com/repos/glass-consortium/flowr/zipball/master
Installing flowr
trying URL 'https://cran.rstudio.com/src/contrib/diagram_1.6.3.tar.gz'
Content type 'application/x-gzip' length 466691 bytes (455 KB)
==================================================
downloaded 455 KB

Installing diagram
Error in FUN(X[[i]], ...) :
  Invalid comparison operator in dependency: >=
> devtools::install_github("sahilseth/flowr", ref="master")
Downloading GitHub repo sahilseth/flowr@master
from URL https://api.github.com/repos/sahilseth/flowr/zipball/master
Installing flowr
trying URL 'https://cran.rstudio.com/src/contrib/diagram_1.6.3.tar.gz'
Content type 'application/x-gzip' length 466691 bytes (455 KB)
==================================================
downloaded 455 KB

Installing diagram
Error in FUN(X[[i]], ...) :
  Invalid comparison operator in dependency: >=

following works though if pkg: diagram was installed beforehand. Looks like it is seeking additional dependency of pkg: shape which default flowr install is not picking up.

install.packages("diagram")
devtools::install_github("sahilseth/flowr", ref="master")

install log:

> install.packages("diagram")
Installing package into ‘/projects/verhaak-lab/verhaak_env/verhaak_libs/R/3.3.2’
(as ‘lib’ is unspecified)
also installing the dependency ‘shape’

trying URL 'https://cran.rstudio.com/src/contrib/shape_1.4.2.tar.gz'
Content type 'application/x-gzip' length 683515 bytes (667 KB)
==================================================
downloaded 667 KB

trying URL 'https://cran.rstudio.com/src/contrib/diagram_1.6.3.tar.gz'
Content type 'application/x-gzip' length 466691 bytes (455 KB)
==================================================
downloaded 455 KB

* installing *source* package ‘shape’ ...
** package ‘shape’ successfully unpacked and MD5 sums checked
** R
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (shape)
* installing *source* package ‘diagram’ ...
** package ‘diagram’ successfully unpacked and MD5 sums checked
** R
** data
*** moving datasets to lazyload DB
** demo

... truncated
** testing if installed package can be loaded
* DONE (diagram)

> devtools::install_github("sahilseth/flowr", ref="master")
Downloading GitHub repo sahilseth/flowr@master
from URL https://api.github.com/repos/sahilseth/flowr/zipball/master
Installing flowr
trying URL 'https://cran.rstudio.com/src/contrib/params_0.6.1.tar.gz'
Content type 'application/x-gzip' length 23757 bytes (23 KB)
==================================================
downloaded 23 KB

Installing params
'/opt/compsci/R/3.3.2/lib64/R/bin/R' --no-site-file --no-environ --no-save  \
  --no-restore --quiet CMD INSTALL  \
  '/tmp/RtmpVDwgUS/devtools46e63f9d7d43/params'  \
  --library='/projects/verhaak-lab/verhaak_env/verhaak_libs/R/3.3.2'  \
  --install-tests

* installing *source* package ‘params’ ...
** package ‘params’ successfully unpacked and MD5 sums checked
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (params)

... truncated

** installing vignettes
** testing if installed package can be loaded
* DONE (flowr)

session info

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocInstaller_1.24.0

loaded via a namespace (and not attached):
 [1] httr_1.2.1        R6_2.2.0          tools_3.3.2       withr_1.0.2
 [5] curl_2.3          memoise_1.0.0     knitr_1.15.1      git2r_0.18.0.9000
 [9] digest_0.6.12     devtools_1.12.0
>

Strict checking of flow definition

Add two new separate functions, to make this cleaner and user accessible. Helping users, debug better.

check_resources()

cpu: numeric, default 1
nodes: now support char. default 1
walltime: char, default 1:00
memory: char, default 100

check_dependency():
check for things not allowed:
consider jobs A and B, each with 10.

sub: scatter, dep: serial and prev_sub: serial
This will try to do a one-one mapping between A and B. But B does not seem to be a parallel operation, does not seem intended.
Rather this should be:
- sub: scatter, dep: serial and prev_sub: scatter; to have one-to-one mapping.

Will add more of such in-compatible/ambiguous relationships.

to_flow submission gives error: object 'fobjuuid' not found

When running ...

fobj = to_flow(x = user_flow_mat, def = flow_def, platform = myplatform,
                                                         flow_run_path = myoutdir,
                                                         flowname = myflowname,
                                                         execute = myexecute)

it gives following error and halts batch submission.

Detected 2 samples/groups in flow_mat.
flow_mat would be split and each would be submitted seperately...

Working on... sample1
input x is list
.....Error in FUN(X[[i]], ...) : object 'fobjuuid' not found
Calls: to_flow -> to_flow.data.frame -> lapply -> FUN
Execution halted

PS: library(uuid) is already installed in R.

Font too big for small windows.

It would be nice to dynamically calculate various elements of of the plot, such that they are always pretty.

Check complex dependency

For dependency serial and submission scatter.

Number of commands in this cmd; should match the previous.

Use opts_flow$get/set/load instead of get_opts set_opts etc.

This makes sure that all functions are always referring to the correct env. to get/set and load parameters.

A typical example would be if we load flowr, and then reload params package for some reason; get_opts would refer to env from params package and not flowr. This is un-desirable.

0.9.9 would show a message, and version 0.10 would depreciate the use of get_opts in flowr package.

better handling of conf file loading.

If conf files cannot be loaded properly, die gracefully.

The following error is quite confusing:

library(flowr)
Loading required package: params
Loading required package: whisker
Flowr: streamlining workflows
Error : .onAttach failed in attachNamespace() for 'flowr', details:
  call: `colnames<-`(`*tmp*`, value = c("name", "value"))
  error: attempt to set 'colnames' on an object with less than two dimensions
Error: package or namespace load failed for ‘flowr’

Issue with recent R version, when creating a flowmat

Following error when creating a flowmat, possibly due to a issues in function to_flowmat.list

Error in data.frame(samplename = samplename, ret, row.names = NULL, stringsAsFactors = FALSE) : 
  arguments imply differing number of rows: 1, 2

dynamically assign CPUs used from the flowdef.

fix to_flowmat to remove un-warranted warning messages.

Works fine, but at times shows this warning message:

Warning message:
   In data.frame(..., check.names = FALSE) :
        row names were found from a short variable and have been discarded

status function works but throws a warning

At times status gives out a warning like:
Error in file(file, "rt") : invalid 'description' argument

Also status fails if a user does not have write privileges to the run folder.

Jobs stuck in the queue

I have a job in queue forever, 005.job_end-1. Is it safe to ignore it?

@SiyuanZheng

feature request submit jobs by priority rather than by jobname

Currently flowr submits jobs to HPC by the jobname. Eg. you want to submit 5 independent flowr pipelines (say A-E) with each 50 (1-50) dependant sub-jobs. Submission is then done sequentially for each jobname, starting with pipeline A sub-job 1, then sub-job 2, etc... Because of the delay in submitting jobs, this can take a while.

Can you make it so when you submit multiple jobnames it submits in order of priority, eg first A1, then B1 (rather than A2) etc..?

relax flowr's default directory structure.

Switch to, using qsub/bsub scripts rather than one line commands

I may be easier to edit scripts rather than long one line commands. Thus may be useful to switch from one line commands to using qsub/bsub scripts.

Example:
torque.sh

flowr will first check for the internal conf folder, then will check ~/flowr/conf. Additionally one can change this by making a call to function queue(), and then supplying this object to say to_flow()

LSF CPU/core/thread request

Some LSF clusters are not appreciating the number of cores requested via -n parameter (it seems).

From the documentation of bsub command, I gather:

bsub -h[elp] [all] [description] [category_name ...]
       [-option_name ...]

 bsub -h limit
 -p      Sets the limit of the number of processes to the specified value for the
whole job.
-T      Sets the limit of the number of concurrent threads to the specified
value for the whole job.

bsub -h resource
-n      Submits a parallel job and specifies the number of tasks in the job.

The latest version of LSF job template uses the following options,
https://github.com/sahilseth/flowr/blob/master/inst/conf/lsf.sh

#BSUB –n {{{CPU}}}                                      # CPU reserved
#BSUB -R span[ptile={{{CPU}}}]                          # CPU reserved, all reserved on same node

A online LSF tutorial also uses the -n param, https://wiki.med.harvard.edu/Orchestra/IntroductionToLSF#How_many_cores

First, you need to tell LSF the number of cores you want to run on. You do this with the bsub -n flag. The number of cores you request from the program should always be the same as the number of cores you request from LSF. Note that different programs have different options to specify multiple cores. For example, tophat -p 8 asks tophat for eight cores. So you might run bsub -q mcore -W 1:00 -n 8 tophat -p 8 ...

rerun support

Currently, rerun is supported but there are a few quirks and need to streamline the process.

Each time submit_flow is called it saves a few files:

flow object, is saved into: flow_details.rds
flow_details, a table with several details including all the job ids into: flow_details.txt
flow_design.pdf: a flow chart

When re-running: these should not overwritten, but updated and currently there is not such update feature available.

Currently rerun skips creating of these files.

Apart from these rerun should work properly.