clicumu / doepipeline Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 2.0 849 KB

A python package for optimizing processing pipelines using statistical design of experiments (DoE).

License: MIT License

Python 100.00%

bioinformatics doe optimization pipeline

doepipeline's People

Stargazers

Watchers

Forkers

druvus keshavaspanda

doepipeline's Issues

pyDOE2 issues

This is indirectly associated to doepipeline but we can not open issues on a fork so I open it here.

I was trying to install pyDOE2 using python 2.7.14 in a conda environment but I get the following error:

$ python setup.py install
  File "setup.py", line 11
SyntaxError: Non-ASCII character '\xc3' in file setup.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

This due to the special letter in the family name. I changed it and then it complains about the encoding argument that I think was introduced in python 3.

$ python setup.py install
Traceback (most recent call last):
  File "setup.py", line 16, in <module>
    long_description=read('README.md'),
  File "setup.py", line 5, in read
    with open(fname, encoding=encoding) as f:
TypeError: 'encoding' is an invalid keyword argument for this function

Removing it solves the issue and pyDOE is installed but without considering the encoding.

Prepending scripts with env-variable setting and setup-scripts

I was asked about the possibility to set environment variables when running remotely. This does not currently work at all since paramiko executes each command in an isolated session.

I suggest this is solved by letting environment variable setting and setup script execution be prepended at each execution command in a similar manner as directory change is now:

    if execution_dir:
        cd = [path for path in execution_dir
              if not 'cd {path}'.format(path=path) in command]
        prefix = '. ./.bash_profile; cd {path};'.format(path=posixpath.join(*cd)) 
    else:
        prefix = '. ./.bash_profile;'

    full_command = prefix + command

Reading of .bash_profile could also be then specified as a "setup-script". This would be a generalization of current functionality, since now the package assumes that BASH is the terminal used at the server.

BaseSSHExecutor can simply override BasePipelineExecutor._set_env_variables to build a prefix that can be fetched in execute_command.

Sidenote:
I also think that the remote-executor could override BasePipelineExecutor._cd in a similar manner to instead keep check of current location which can be fetched as a script-prefix. This would tidy up the now butt-ugly conditional checking in BaseSSHExecutor.execute_command.

init values outside min/max range

I realized by a typo that the predicted optima might be stuck outside the tested values if init_min/init_max values are outside the range of min and max. We should have a test against that to avoid mistakes in config file.

Understanding doepipeline on simple example

Great package. I've just begun tinkering at a very low level to make sure my intuition reflects what the code is doing. I've set a two factor experiment with a normal distribution for a response. For example;

# standard
import pandas as pd
import sys,os
import pylab as plt
import numpy as np

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.colors as colors
import matplotlib.cm as cmx

# stats
from scipy.stats import multivariate_normal

# doe
from doepipeline.designer import ExperimentDesigner 
import doepipeline

# logging

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def generate_response(dfi,rv):
    return 1000*rv.pdf(dfi)

number_of_iterations = 5 # number of iterations to try

mu_x = 60
variance_x = 500

mu_y = 75
variance_y = 500

minx = 40
maxx = 100
miny = 45
maxy = 110

responses = {"fraction":{"criterion":"maximize"}}

factors = {
            "A": {
                    "min":10,
                    "max": 150,
                    "low_init": minx,
                    "high_init": maxx,
                },
            "B": {
                    "min":10,
                    "max":150,
                    "low_init":miny,
                    "high_init":maxy,
                     }
            } 

# create grid and multivariate normal

x = np.linspace(minx,maxx,100)
y = np.linspace(miny,maxy,100)
X, Y = np.meshgrid(x,y)
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X; pos[:, :, 1] = Y
rv = multivariate_normal([mu_x, mu_y], [[variance_x, 0], [0, variance_y]])

Z = generate_response(pos,rv)

# view response surface for factors A & B

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X,Y,Z)
ax.set_xlabel("A")
ax.set_ylabel("B")
ax.set_zlabel("response")
fig.savefig("response.png",bbox_inches='tight',dpi=300)
plt.show()

I then now want to iteratively optimize as per the doepiepline to hopefully converge on the solution set by mu_x and mu_y.

exp = ExperimentDesigner(factors,
                        'fullfactorial2levels',
                        responses,
                        model_selection='greedy',
                        skip_screening=True,
                        shrinkage=0.9)

# create first design (skipping screening)

df = exp.new_design()

dfstart = df.copy()
factors =[]
optimal = []
designs = []
best = []
designs.append(df)

for niters in range(number_of_iterations):
    r_0 = generate_response(df,rv)
    fractioni = pd.DataFrame.from_dict({"fraction":r_0})
    bi = exp.get_best_experiment(df,fractioni)
    fi = exp.update_factors_from_optimum(bi)
    opti = exp.get_optimal_settings(fractioni)
    df = exp.new_design()
    best.append(bi)
    factors.append(fi)
    designs.append(df)
    print("Iteration",niters+1)

Now I create a simple function to loop through each experimental design at each step. I made one or two modifications to return the model and optima to inspect interactively in my notebook.

def plot_search(dflist,Zin,optima,expi):
    fig,ax = plt.subplots(figsize=(10,8))
    falpha = np.linspace(0.3,0.9,len(dflist))[::-1]
    falphar = np.linspace(0.3,0.9,len(dflist))
    plt.plot(dfstart['A'],dfstart['B'],'rs',ms=5,label='start')

    jet = cm = plt.get_cmap('viridis') 
    cNorm  = colors.Normalize(vmin=0, vmax=len(dflist))
    scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=jet)
    xyb = []
    ax.plot(mu_x,mu_y,'yo',label='peak response (optimal)')
    cbar = ax.imshow(Z,origin='lower',cmap='Blues',extent=[minx,maxx,miny,maxy],aspect='auto')

    for idx,dfin in enumerate(dflist):

        # set rectangle locations based on new design

        width = np.max(dfin["A"])-np.min(dfin["A"])
        height = np.max(dfin["B"])-np.min(dfin["B"])
        corner_y = np.min(dfin["B"])
        corner_x = np.min(dfin["A"])
        colorVal = scalarMap.to_rgba(idx)
        if idx != 0:
            rect = plt.Rectangle((corner_x,corner_y),width,height,linewidth=3,edgecolor=colorVal,facecolor='none')
        else:
            rect = plt.Rectangle((corner_x,corner_y),width,height,ls='--',linewidth=2,edgecolor='r',facecolor='none')
        
        ax.add_patch(rect)

        # obtain the best factors

        xyb.append([best[idx-1]['factor_settings']["A"],best[idx-1]['factor_settings']["B"]])
        
        # label the iteration in the center of each square

        ax.text(corner_x+width/2.,corner_y+height/2.,idx+1,fontsize=15,fontweight='bold',color=colorVal,va='center',ha='center')

    xyb = np.array(xyb)
    best_exp = expi._best_experiment
    plt.plot(best_exp['optimal_x']['A'],best_exp['optimal_x']['B'],'gx',ms=15,label='best')
    ax.set_xlabel("A")
    ax.set_ylabel("B")
    plt.colorbar(cbar)
    plt.xlim([minx,maxx])
    plt.ylim([miny,maxy])
    plt.subplots_adjust(left=0.1,right=0.99,top=0.9,bottom=0.1)
    plt.savefig("search.png",bbox_inches='tight',dpi=300)
    plt.grid()

# plot each experimental design proposed
plot_search(designs,Z,[mu_x,mu_y],exp)

r_0 = generate_response(df,rv)
fractioni = pd.DataFrame.from_dict({"fraction":r_0})
dfoptimal,model,prediction = exp.get_optimal_settings(fractioni)
plt.legend(loc='best')
plt.savefig("search.png",bbox_inches='tight',dpi=300)
plt.show()

The dashed line is where the first initial guess is then each 1,2,3,4,5,6 is each iteration. The yellow dot being the hard set desired optimal to be found.

A few questions:

At each iteration, does doepipeline 'forget' what has been run before hand and treats each new run separately? I would have thought that all responses to previous runs would be run through the OLS regression to propose new runs. In this case should I just concatenate data frames to grow the experimental runs being fed in each time to exp.get_best_experiment?
It seems that get_best_experiment and get_optimal get_optimal_settings only can provide experimental conditions that have already been tested and not actually interpolated optimal responses? I would expect after maybe three or four iterations for the system to predict the optimal to be closer to the true optimal, no? I guess I'm a bit confused over the nomenclature.
Is this response surface appropriate for doepipeline as a test? It seems that despite the model.summary() being quite desirable, it still didn't converge (see below).
Should I be using shrinking in this way? It seems like the only way to converge on anything as the other approaches seem to keep just generate new, slightly different experimental parameters for A, B.
Inside designer.py at _new_optimization_design(self) I can't seem to see how it uses OLS to generate a new set of experimental settings, at least in my intuitive arrangement here:

    fractioni = pd.DataFrame.from_dict({"fraction":r_0})
    bi = exp.get_best_experiment(df,fractioni)
    fi = exp.update_factors_from_optimum(bi)
    opti = exp.get_optimal_settings(fractioni)
    df = exp.new_design()

Lastly, I get the following:

Given mu_x = 60 and mu_y = 75, I'm just wondering if I've done something wrong, set it up incorrectly or taken the pipeline into an area it isn't best suited. Apologies for any silly errors, just trying to understand what's going on as the examples provided have a certain overhead to getting started. A very simple purely Pythonic example would be greatly appreciated. Thanks for any help you might provide.

Break out or change the scheduler part

We discussed earlier to find a better way of submitting/distributing jobs. I have used snakemake quite a lot and I think it could be an option to use their API instead to get a more stable solution. They have support for several schedulers plus kubernetes.

I include a small example I found below

#!/usr/bin/env python3
"""
rule all:
    input:
        "reads.counts"
rule unpack_fastq:
    '''Unpack a FASTQ file'''
    output: "{file}.fastq"
    input: "{file}.fastq.gz"
    resources: time=60, mem=100
    params: "{file}.params"
    threads: 8
    log: 'unpack.log'
    shell:
        '''zcat {input} > {output}
        echo finished 1>&2 {log}
        '''
rule count:
    '''Count reads in a FASTQ file'''
    output: counts="{file}.counts"
    input: fastq="{file}.fastq"
    run:
        n = 0
        with open(input.fastq) as f:
            for _ in f:
                n += 1
        with open(output.counts, 'w') as f:
            print(n / 4, file=f)
"""

In pure python this is equivalent to the following code.

workflow.include("pipeline.conf")

shell.prefix("set -euo pipefail;")

@workflow.rule(name='all', lineno=6, snakefile='.../Snakefile')
@workflow.input("reads.counts")
@workflow.norun()
@workflow.run
def __all(input, output, params, wildcards, threads, resources, log, version):
    pass


@workflow.rule(name='unpack_fastq', lineno=17, snakefile='.../Snakefile')
@workflow.docstring("""Unpack a FASTQ file""")
@workflow.output("{file}.fastq")
@workflow.input("{file}.fastq.gz")

@workflow.resources(time=60, mem=100)
@workflow.params("{file}.params")
@workflow.threads(8)
@workflow.log('unpack.log')
@workflow.shellcmd(
    """zcat {input} > {output}
        echo finished 1>&2 {log}
        """
)
@workflow.run
def __unpack_fastq(input, output, params, wildcards, threads, resources, log, version):
    shell("""zcat {input} > {output}
        echo finished 1>&2 > {log}
        """
)


@workflow.rule(name='count', lineno=52, snakefile='.../Snakefile')
@workflow.docstring("""Count reads in a FASTQ file""")
@workflow.output(counts = "{file}.counts")
@workflow.input(fastq = "{file}.fastq")
@workflow.run
def __count(input, output, params, wildcards, threads, resources, log, version):
    n = 0
    with open(input.fastq) as f:
        for _ in f:
            n += 1
    with open(output.counts, 'w') as f:
        print(n / 4, file=f)


### End of output from snakemake --print-compilation


workflow.check()
print("Dry run first ...")
workflow.execute(dryrun=True, updated_files=[])
print("And now for real")
workflow.execute(dryrun=False, updated_files=[], resources=dict())

Another option that I have used earlier is ipython-cluster-helper but it probably other options available.

Local serial execution of test pipeline

Modified the test pipeline in an attempt to run the pipeline locally on the KAW server.

Modified batch_execution.py:

from doepipeline.generator import PipelineGenerator
#from doepipeline.executor import SSHExecutor
from doepipeline.executor import LocalExecutor
import os
import yaml


if __name__ == '__main__':

    generator = PipelineGenerator.from_yaml('/media/data/daniel/doe/doepipeline_testscript/pipeline.yaml')
    designer = generator.new_designer_from_config()
    design = designer.new_design()
    pipeline = generator.new_pipeline_collection(design)

    executor = LocalExecutor(workdir='test_pipeline', execution_type='serial', base_command='nohup {script}')
    executor.execute_command('cd test_pipeline; ls | grep "[0-9]" | xargs rm -r')
    results = executor.run_pipeline_collection(pipeline)
    optimum = designer.update_factors_from_response(results)
    pass

$ (doe_pipeline)daniel@kaw:/media/data/daniel/doe/doepipeline_testscript$ python batch_execution.py
Traceback (most recent call last):
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/generator.py", line 22, in __init__
    self._validate_config(config)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/generator.py", line 221, in _validate_config
    'job specified with SLURM but SLURM project-name is missing'
AssertionError: job specified with SLURM but SLURM project-name is missing

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "batch_execution.py", line 10, in <module>
    generator = PipelineGenerator.from_yaml('/media/data/daniel/doe/doepipeline_testscript/pipeline.yaml')
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/generator.py", line 55, in from_yaml
    return cls(config, *args, **kwargs)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/generator.py", line 24, in __init__
    raise ValueError('Invalid config: ' + str(e))
ValueError: Invalid config: job specified with SLURM but SLURM project-name is missing

Does this boil down to line 220 in generator.py?

assert (any('SLURM' in job.keys() for job in jobs) and 'SLURM' in config_dict),\
            'job specified with SLURM but SLURM project-name is missing'

The YAML file I use does not contain 'SLURM'.

pyDOE issue

I run into the following error. I'm using the same version of pyDOE as before

Traceback (most recent call last):
  File "/media/data/daniel/doe_2018/doepipeline/bin/doepipeline", line 76, in <module>
    designer = generator.new_designer_from_config()
  File "/media/data/db/data/anaconda2/envs/doepipeline/lib/python3.6/site-packages/doepipeline-0.1-py3.6.egg/doepipeline/generator.py", line 77, in new_designer_from_config
    return designer_class(factors, design_type, responses, *args, **kwargs)
  File "/media/data/db/data/anaconda2/envs/doepipeline/lib/python3.6/site-packages/doepipeline-0.1-py3.6.egg/doepipeline/designer.py", line 179, in __init__
    self._design_matrix = matrix_designer(n)
  File "/media/data/db/data/anaconda2/envs/doepipeline/lib/python3.6/site-packages/doepipeline-0.1-py3.6.egg/doepipeline/designer.py", line 167, in <lambda>
    'ccf': lambda n: pyDOE.ccdesign(n, (0, 1), face='ccf'),
  File "/media/data/db/data/anaconda2/envs/doepipeline/lib/python3.6/site-packages/pyDOE/doe_composite.py", line 147, in ccdesign
    H1 = ff2n(n)
  File "/media/data/db/data/anaconda2/envs/doepipeline/lib/python3.6/site-packages/pyDOE/doe_factorial.py", line 115, in ff2n
    return 2*fullfact([2]*n) - 1
  File "/media/data/db/data/anaconda2/envs/doepipeline/lib/python3.6/site-packages/pyDOE/doe_factorial.py", line 78, in fullfact
    rng = lvl*range_repeat
TypeError: 'numpy.float64' object cannot be interpreted as an integer

Print of version number

It would be convenient to easily see which version that is installed usign doepipeline --version

possible number of "qos=short" jobs passed to SLURM is capped

UPPMAX seem to cap the number of possible qos=short jobs that you can have in the queue. I think this number is set to 10. I receive the following error when trying to submit a CCC design with three factors (14 exp):

$ python gatk_snp_execute.py
Traceback (most recent call last):
  File "gatk_snp_execute.py", line 18, in <module>
    results = executor.run_pipeline_collection(pipeline)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 199, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/mixins.py", line 334, in run_jobs
    _, stdout, _ = self.execute_command(command, job_name=exp_name)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/remote.py", line 159, in execute_command
    raise CommandError('\n'.join(err))
doepipeline.executor.base.CommandError: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

This should be possible to fix with a try/except-clause when submitting the jobs. If the job submission is rejected the job should remain in an internal queue for submission later.

Remote host closing connection causes crash

There's been trouble running a pipeline remotely on UPPMAX. After a few iterations the following error usually pops up. Should be fixed by a reconnect I suspect.

Socket exception: An existing connection was forcibly closed by the remote host (10054)
Traceback (most recent call last):
  File "manta_sv.py", line 49, in <module>
    results = executor.run_pipeline_collection(pipeline)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\base.py", line 200, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\mixins.py", line 353, in run_jobs
    self.wait_until_current_jobs_are_finished()
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\base.py", line 239, in wait_until_current_jobs_are_finished
    status, msg = self.poll_jobs()
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\remote.py", line 247, in poll_jobs
    return mixins.SlurmExecutorMixin.poll_jobs(self)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\mixins.py", line 377, in poll_jobs
    __, stdout, __ = self.execute_command(cmd)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\remote.py", line 153, in execute_command
    stdin, stdout, stderr = self._client.exec_command(full_command)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\paramiko\client.py", line 341, in exec_command
    chan = self._transport.open_session(timeout=timeout)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\paramiko\transport.py", line 618, in open_session
    timeout=timeout)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\paramiko\transport.py", line 739, in open_channel
    raise e
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\paramiko\transport.py", line 1608, in run
    ptype, m = self.packetizer.read_message()
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\paramiko\packet.py", line 386, in read_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\paramiko\packet.py", line 249, in read_all
    x = self.__socket.recv(n)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

rounded ints passed as arguments, floats used for internal calculations

Changes made in this commit introduces a discrepancy between what arguments are actually used for an experiment, and what arguments doepipeline thinks were used. All factor values are rounded and turned into ints before substituting them into the template script. This because HaplotypeCaller (and probably more softwares) take int arguments and cannot handle floats.

Ideas for scaffolding example

I think in the first round we could try the following setup for the scaffolding optimization.

Organism

E. coli (5Mbp)
F. tularensis (2 Mbp)

Assembly

Abyss
Spades

Scaffolder

SSPACE(-longread)
LINKS

Slurm error: PipelineRunFailed

I have some problem to get my pipeline to work correctly using slurm. The same pipeline works nicly using local executor in serial mode. Using Uppmax (/proj/nobackup/b2015353/scaffolding/) with the those files.

The output indicates that the job failed but it seems that it finished correctly.

Andreas-MBP-6:scaffolding_optimization andreassjodin$ python links_execute_1.py
/Users/andreassjodin/anaconda/lib/python3.5/site-packages/pyDOE-0.3.8-py3.5.egg/pyDOE/doe_factorial.py:78: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
design:    KMER  DVALUE
0  15.0  1000.0
1  25.0  1000.0
2  15.0  4000.0
3  25.0  4000.0
4  15.0  2500.0
5  25.0  2500.0
6  20.0  1000.0
7  20.0  4000.0
8  20.0  2500.0
LINKSScaffolder_exp_7 has failed. (exit code 127:0)
Traceback (most recent call last):
  File "links_execute_1.py", line 19, in 
    results = executor.run_pipeline_collection(pipeline)
  File "/Users/andreassjodin/anaconda/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 199, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "/Users/andreassjodin/anaconda/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/mixins.py", line 349, in run_jobs
    self.wait_until_current_jobs_are_finished()
  File "/Users/andreassjodin/anaconda/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 246, in wait_until_current_jobs_are_finished
    raise PipelineRunFailed(msg)
doepipeline.executor.base.PipelineRunFailed: LINKSScaffolder_exp_7 has failed. (exit code 127:0)

Not sure what I did wrong so I would be helpful with advice how to fix it.

Restarting failed experiments during iteration, and to resume at a failed iteration

I'm currently trying to optimize parameters for Manta, a structural variant caller. There's a recurring issue that experiments fail, for example experiment 19 below:

RunManta_exp_19 has failed. (exit code 1:0)
Traceback (most recent call last):
  File "manta_sv.py", line 31, in <module>
    results = executor.run_pipeline_collection(pipeline)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\base.py", line 199, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\mixins.py", line 349, in run_jobs
    self.wait_until_current_jobs_are_finished()
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\base.py", line 246, in wait_until_current_jobs_are_finished
    raise PipelineRunFailed(msg)
doepipeline.executor.base.PipelineRunFailed: RunManta_exp_19 has failed. (exit code 1:0)

Checking the Manta log file I can see this:

[2016-09-27T13:24:09.175970] [m196.uppmax.uu.se] [61581_1] [TaskManager] Completed command task: 'generateCandidateSV_0066' launched from master workflow
[2016-09-27T13:24:52.698109] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR] Unhandled Exception in TaskManager-Thread
[2016-09-27T13:24:52.909386] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR] Traceback (most recent call last):
[2016-09-27T13:24:52.910425] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1660, in run
[2016-09-27T13:24:52.911376] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self._startTasks()
[2016-09-27T13:24:52.912096] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 526, in wrapped
[2016-09-27T13:24:52.912850] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     return f(self, *args, **kw)
[2016-09-27T13:24:52.913684] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1818, in _startTasks
[2016-09-27T13:24:52.914829] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self._launchTask(task)
[2016-09-27T13:24:52.916007] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1762, in _launchTask
[2016-09-27T13:24:52.917214] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     trun = self._getCommandTaskRunner(task)
[2016-09-27T13:24:52.918028] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1745, in _getCommandTaskRunner
[2016-09-27T13:24:52.918808] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     task.setRunstate)
[2016-09-27T13:24:52.919517] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1137, in __init__
[2016-09-27T13:24:52.920267] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     BaseTaskRunner.__init__(self, runStatus, taskStr, sharedFlowLog, setRunstate)
[2016-09-27T13:24:52.921161] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1041, in __init__
[2016-09-27T13:24:52.921949] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self.setInitialRunstate()
[2016-09-27T13:24:52.922960] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1079, in setInitialRunstate
[2016-09-27T13:24:52.923728] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self.setRunstate("running")
[2016-09-27T13:24:52.924557] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1076, in setRunstate
[2016-09-27T13:24:52.925421] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self._setRunstate(*args, **kw)
[2016-09-27T13:24:52.926591] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 526, in wrapped
[2016-09-27T13:24:52.927825] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     return f(self, *args, **kw)
[2016-09-27T13:24:52.928734] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 2110, in setRunstate
[2016-09-27T13:24:52.929669] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self.tdag.writeTaskStatus()
[2016-09-27T13:24:52.930562] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 526, in wrapped
[2016-09-27T13:24:52.931612] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     return f(self, *args, **kw)
[2016-09-27T13:24:52.932358] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 2475, in writeTaskStatus
[2016-09-27T13:24:52.933449] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     forceRename(tmpFile, self.taskStateFile)
[2016-09-27T13:24:52.934858] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 170, in forceRename
[2016-09-27T13:24:52.935823] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     os.rename(src,dst)
[2016-09-27T13:24:52.936617] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR] OSError: [Errno 2] No such file or directory
[2016-09-27T13:25:07.711376] [m196.uppmax.uu.se] [61581_1] [WorkflowRunner] [ERROR] Workflow terminated due to unhandled exception in TaskManager

I believe there should be some kind of error-checking feature of doepipeline that detects that an experiment has failed and restarts it. I think it not too unlikely that this kind of spontaneous failing is restricted only to Manta, and could be a major issue for the usability of doepipeline in a range of different optimization problems.

For the other kind of problem, where the site you are performing your experiments at (in this case Uppmax) becomes unavailable, whether it being due to connection trouble or a planned down-time of the resource, I think there needs to be a feature for the user to resume the optimization after the last completed iteration.

/Daniel

Factor value definition

I have tested the code and I think we need a replacement for the quick fix f5298d6. To many software are quite picky to be feeded by the correct type. I think the best solution, as suggested by @RickardSjogren, would be to add a int/float definition field to the yaml file.

clicumu / doepipeline Goto Github PK

doepipeline's People

Stargazers

Watchers

Forkers

doepipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org