llnl / merlin Goto Github PK

View Code? Open in Web Editor NEW

116.0 13.0 26.0 6.45 MB

Machine Learning for HPC Workflows

License: MIT License

Makefile 0.89% Python 96.74% C 0.36% Shell 1.46% Dockerfile 0.10% C++ 0.45%

machine-learning big-data simulation workflow redis-server celery-workers workflows hpc radiuss

merlin's Issues

[BUG] We are missing Changlelog info for 1.0.1 and 1.0.2 @ben-bay

We are missing Changlelog info for 1.0.1 and 1.0.2 @ben-bay

Originally posted by @koning in #33 (comment)

[BUG] feature example leaves messages in queue

🐛 Bug Report

Describe the bug
The feature example, which shows how to stop workers from within a worker leaves the stop worker command in the queue, since the worker kills itself and doesn't acknowledge that it finished. This can cause other workers later to pick this back up and kill themselves and others later on.

To Reproduce
Steps to reproduce the behavior:

merlin run feature_demo.yaml
merlin run-workers feature_demo.yaml
merlin status feature_demo.yaml

--> you'll see there's still a task in the queue

Expected behavior
The queue should be empty after the workflow finishes.

Additional context
I think the fix is to add a delay to the stop workers command in the step and background it, something like

( sleep 30; merlin stop-workers ) &

instead of just merlin stop-workers

this should fork a child background process that will execute after the parent worker finishes the step (and removes it from the server). I'm not sure what the sleep delay should be.

[FEAT] Get `merlin` pypi domain

merlin is currently taken on PYPI.

This page is our official request to transfer ownership: pypi/support#87

[FEAT] merlin one-off

🚀 Feature Request

What problem is this feature looking to solve?
Currently, we can only launch tasks that are defined in yaml file. There are cases where you'd like to just do something asynchronously, like a one-off delayed execution of a script. It'd be nice to expose this ability.

Describe the solution you'd like
Something like
merlin run-script [args] -- script.sh
args could be various celery arguments like --queue . perhaps also could have blocking/not blocking argument

Describe alternatives you've considered
This could probably be done now with a yaml file with a single step in it. The whole DAG / workspace creation seems like potentially a lot of overhead. Maybe if that were relaxed (see #207 ) it would be cleaner.

Additional context
Not really sure on this one. Mostly a discussion prompt.

[FEATURE] Convenient signal from HPC applications to identify hard/soft fail conditions

Merlin has a way of registering success and failure of tasks through the the MERLIN_SOFT_FAIL and MERLIN_HARD_FAIL variables. Each HPC application with terminate (successfully or otherwise) differently - is there a common protocol you want to suggest to application developers to triggering hard/soft fail conditions (e.g. write a file of a particular name, print to terminal, return an integer value)?

[BUG] Examples not included in MANIFEST.in

🐛 Bug Report

Describe the bug
An exception is given when running merlin with spack, or any install that relocates merlin.

[sh]: merlin
.
.
.
FileNotFoundError: [Errno 2] No such file or directory: '<path>/site-packages/merlin/examples/workflows'

To Reproduce
Install merlin with spack, type merlin

Expected behavior
No exceptions.

[BUG] Missing dependencies on pypi install

Describe the bug
pip install from pypi doesn't grab all the necessary dependencies

To reproduce

python3 -m virtualenv "testy"
source testy/bin/activate
pip install merlinwf
merlin
-> breaks
"ModuleNotFoundError: No module named 'tabulate'"
pip install tabulate
merlin
-> breaks
ModuleNotFoundError: No module named 'maestrowf'
pip install -e git+https://github.com/LLNL/maestrowf.git#egg=maestrowf
merlin
-> success

Expected behavior
pip install merlinwf should grab all dependencies

[BUG] Rogue string expansion

🐛 Bug Report

Describe the bug
The string "$_" expands to a path in the venv.

To Reproduce
Steps to reproduce the behavior:

Add "$_" anywhere in a spec
merlin run <spec.yaml>
Look in the provenance file.

Expected behavior
The string "$_" should not change.

EDIT: According to this, "$_" is a special shell variable.

Add full code release to docs

change to LLNL: LLNL-CODE-797170

Originally posted by @lucpeterson in #159

[FEATURE] Run workers on mutiple machines

A request has been made to run the workers for a given yaml spec on different machines.

The implementation will need to specify the hosts required for a given set of steps in the workers section of the spec. This can be accomplished by adding the "machines" keyword to the workers spec. All the hosts listed in the machines section will need to have access to the OUTPUT_PATH. Because of this constraint the OUTPUT_PATH must be a full path that can be checked for existence on all hosts listed in the machines keywords. If the machines keyword is not present, then those workers will be started on all hosts where merlin run-workers is executed.

An example is given below:

merlin:
  resources:
    task_server: celery
    workers:
      step1workers:
        args: -O fair --prefetch-multiplier 1 -E -l info --concurrency 36
        batch:
          type: local
        steps: step1
        machines: [host1A, hostB]
      step2workers:
        args: -O fair --prefetch-multiplier 1 -E -l info --concurrency 36
        steps: step2
        batch:
          type: local
        machines: [hostC]

[BUG] incorrect warning

🐛 Bug Report

Describe the bug
The Maestro warning Cannot set the submission time of 'sim_runs' because it has already been set. appears when there should be no warning.

To Reproduce
Steps to reproduce the behavior:

Have a study step with a restart section
Go into the restart cmd using exit $(MERLIN_RESTART)
See warning in logs

Expected behavior
A clear and concise description of what you expected to happen.

[DEV-FEAT] auto-increment version

🚀 Feature Request

What problem is this feature looking to solve?
It is time-consuming and tedious to manually increment Merlin's version.

Describe the solution you'd like
A Makefile target for developers that increments the project-wide version, or something similar.

[FEATURE] run-workers daemon

Is your feature request related to a problem? Please describe.
While the 2 commands merlin run and merlin run-workers are useful as separate units, it is sometimes cumbersome to repeatedly type both commands.

Describe the solution you'd like
it would be convenient to have the ability to combine these into a single command: merlin run ... --workers ... or merlin [third subparser name here] ...

Describe alternatives you've considered
The alternative to this is to simply keep things as they are for the time being. This isn't project-critical, but would add user-friendliness.

Additional context
supervisord may be helpful in daemonizing Celery.

[FEATURE] Link README to arXiv paper

Also provide link in the docs.

[FEATURE] Query tasks in the task broker queue?

Is there a way to query the remaining tasks on the task broker? The merlin query-workers command is a bit too coarse to get a sense for the status of a currently running ensemble.

Probably more useful than a list of remaining tasks would be statistics on the total number of tasks in each queue, the number completed successfully, number failed, and number remaining.

[FEAT] Relax requirement for different machines

🚀 Feature Request

What problem is this feature looking to solve?
Currently multi-machine workflows need to be able to both have access to the same workspace on the file system. This can be pretty limiting. For instance, running a multi-machine workflow on two machines with different air-gapped parallel file systems would require the workspaces to be in a common location, ie not on the parallel file systems. The user would have to do gymnastics to make the tasks run on the parallel file systems, but have a workspace in a shared spot.

Describe the solution you'd like
The ability to do this. Possible solutions/paths forward

allow for a step-wise defined overwrite of the workspace location, maybe allowing "None"
delay the instantiation of the DAG/sample generation as a task to be executed by a worker (so that merlin run can launch work to be done elsewhere)

Issues to consider

the MERLIN_FINISHED files -- shouldn't we move these to a database or file or something?
how to record where the workspaces are? (broken) symlinks? a file?

[FEAT] merlin block errors and outputs

🚀 Feature Request

What problem is this feature looking to solve?
When I try to generate samples from the merlin block I don't get an error telling me what went wrong. It gives an error that is unrelated to the actual bug which is in the script itself. That error is suppressed when it should be outputted somewhere.

Describe the solution you'd like
Printing the error just like in the study block and having an out and err files.

Describe alternatives you've considered
Manually going every step and trying to figure out what went wrong. This takes way too long and it's not always clear where the problem is unless everything is broken down.

[FEATURE] Complete status checking

Describe the solution you'd like
This includes:

giving the user a summary of what has been done (using Redis), and
what has not been done.

[BUG] no warning for workerless step

🐛 Bug Report

Describe the bug
merlin not reporting when a step isn't assigned to a worker when running in distributed mode.

To Reproduce
Steps to reproduce the behavior:
Don't assign a step to a worker in distributed mode and check the study directory.

Expected behavior
Ideally it would show that one of the steps isn't assigned to a worker. Currently it does not show any errors/warning and it just doesn't run the step.

Screenshots

    resources:
        workers:
            merge_posthoc_workers:
                args: -l INFO --concurrency 36 --prefetch-multiplier 1 -Ofair
                steps: [merge_posthoc]
                batch:
                  type: slurm


study:
    - name: setup
      description: 
      run:
          cmd: 

    - name: merge_posthoc
      description: Combines the outputs of the previous step
      run:
          cmd: |
          depends: [setup]

EDIT: this is only the case when submitting the spec as a slurm job.

[FEATURE] Synchronicity between merlin allocations and workflow computing resources

I'm just getting started with Merlin and noticed that there are a few steps that need to be coordinated properly and it might be nice if there were utilities in Merlin to support that coordination.

Specifically, when I submit my merlin job to a parallel compute resource (e.g. via a slurm script as described in https://merlin.readthedocs.io/en/latest/merlin_commands.html), I specify the number of compute nodes. Then, within my merlin workflow, I specify the number of "nodes" for my workers in the workers sub-block of the merlin block (e.g. number of workers). Then I also specify the number of nodes and processors for each sample directly using the cmd: keyword in a step defined in the study block.

A utility to generate an sbatch file from a merlin spec would help keep things sorted, it seems like there's enough information contained there to do such a thing.

Also, as a novice to celery, I found the nodes keyword in the workers block to be somewhat misleading - it seems to control the number of workers, but my first instinct was that it was some kind of 'node allocation' assigned to the worker. Particularly if you're working with multi-node tasks, I can see others making the same mistake.

[FEAT] Add yaml argument to merlin stop-workers

🚀 Feature Request

What problem is this feature looking to solve?
Merlin runs sometimes fail and leave celery workers running. I want a convenient way to get rid of the workers from the ensemble that failed.

Describe the solution you'd like
I could use "merlin -f stop-workers", but I only want to get rid of workers associated with the queues from the ensemble that failed. I can supply the --queues argument, but that hard codes the queue name. I want something like "merlin -f stop-workers my_ensemble.yaml".

I organize my merlin runs in series with a sub-directory for each ensemble. There is a yaml file with the same name (but different parameters) in each directory. I can put the above command in a shell script and use it with any ensemble in the series (the yaml file knows how to compute the queue name).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[FEAT] alternate samples shell

🚀 Feature Request

What problem is this feature looking to solve?
To provide a solution for times when sample generation could be done with a small python script instead of a bash call to an external script.

Describe the solution you'd like
Add an optional merlin.samples.generate.shell entry, same as <step>.run.shell.

[FEATURE] Handle maestrowf restart keyword

On the road to running a full meastrowf yaml spec is to implement a handler for the restart commands in a step. The issue with this keyword is that it may be necessary to restart the initial task as opposed to always running the restart cmd on task restart. Research must be done to determine if this is possible in the celery system.

A possible solution would be to split a parallel run task into multiple tasks including a setup task and then only have the parallel launch commands including initial and restart for a run step.

[BUG] run-workers --echo sends logs wrong message

🐛 Bug Report

Describe the bug
run-workers --echo shows the logger message:

[2020-03-23 16:23:18: INFO] Launching workers from <spec.yaml>'

Expected behavior
No incorrect logger message.

[FEATURE] Add step timeouts/walltime

Describe the solution you'd like
Add step timeout/walltime fields, thru Celery

[FEATURE] Batch script feature

Describe the solution you'd like
merlin run flag --batch that interfaces with popular batch managers like Slurm, LSF, and Flux.

[FEAT] Report missing sample file

🚀 Feature Request

Report missing sample file. If I make a typo, the file name in the samples section may not match the file name in my samples generation command. Merlin responds by silently not doing anything. I think it would be better to say "Sample file my_samples_1024.npy not found" or something of that sort.

[BUG] MERLIN_RESTART overwriting the .out and .err files

🐛 Bug Report

When exiting with a $(MERLIN_RESTART) flag, merlin overwrites the .out and .err files with the restart steps' own .out and .err files as opposed to creating new outputs files specifically for the restart step like with the .sh files

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
The expected outputs should be something like this:
step.err step.out step.sh step.restart.err step.restart.out step.restart.sh

but it's only outputting:

step.err step.out step.sh step.restart.sh

[BUG] missing config message

🐛 Bug Report

Describe the bug

[2020-01-24 09:41:15: ERROR] Cannot find a config file, run merlin config and "                 "edit the file, "/Users/bay1/.merlin/app.yaml"
[2020-01-24 09:41:15: ERROR] expected str, bytes or os.PathLike object, not NoneType

To Reproduce

Delete your ~/.merlin/ dir
merlin run <spec>.yaml
See error

Expected behavior
No weird whitespace formatting, no type error.

[BUG] merlin info uses python and pip instead of python3 and pip3

🐛 Bug Report

The merlin info uses python and pip instead of python3 and pip3, python and pip will be for
python2 in a docker container or standard python install.

To Reproduce
run merlin info for a merlin install outside of a virtualenv.

[FEAT] Allow programmatic assignment of description.name

🚀 Feature Request

Describe the solution you'd like
The ability to use this: merlin run spec.yaml --vars STATE=Illinois ID=42
And alter this:

description:
    name: $(STATE)_State$(ID)
    description: ...

[FEATURE] Add periodic tasks

Describe the solution you'd like
Tasks / steps defined in a spec that repeat every n seconds.

These would be able to depend on other steps, but would not be valid as a dependency for any other step.

[FEATURE] Improve visibility timeout

merlin/merlin/celery.py

Line 105 in 2f726c6

 app.conf.broker_transport_options = {"visibility_timeout": 7200, "max_connections": 100} 

Right now, this is defaulted to 2 hours (despite the docs saying 1 hour). This is too short: apps that run longer than 2 hours could be run again by a new worker. This should be changed to 24 hours and we should expose it in the config file.

[FEATURE] Check version increments on master PRs

Is your feature request related to a problem? Please describe.
We want to keep the CHANGELOG and version updated as master is changed.

Describe the solution you'd like
A travis check that the CHANGELOG and version in merlin/init.py have been modified for every PR into master.

Describe alternatives you've considered
As an alternative, we could have a test that looks for changes and have people run it by hand or have travis run that test. I do think an automated alert that says "hey, this needs to be changed" would be helpful. Much like a style check.

[BUG] Broken tests from zfill

🐛 Bug Report

Describe the bug
tests are broken because of the zero pad modification

To Reproduce
see run tests and travis build failure

Expected behavior
Tests should pass
@ben-bay thinks he has a solution? something about paths being hardcoded in the testing module

[Q/A] is this info message more of a debug message?

🤓 Question

In the run-workers output, should we change the message level of this line:

[2020-02-07 13:14:38: INFO] ['celery worker -A merlin  -n default_worker.%%h -l INFO -Q merlin']

to debug instead of info?

       *
   *~~~~~
  *~~*~~~*      __  __           _ _
 /   ~~~~~     |  \/  |         | (_)
     ~~~~~     | \  / | ___ _ __| |_ _ __
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~
   ~~~*~~~*    Machine Learning for HPC Workflows



[2020-02-07 13:14:38: INFO] Launching workers from 'hello.yaml'
[2020-02-07 13:14:38: WARNING] Workflow specification missing
 encouraged 'merlin' section! Run 'merlin example' for examples.
Using default configuration with no sampling.
[2020-02-07 13:14:38: INFO] Starting celery workers
[2020-02-07 13:14:38: INFO] ['celery worker -A merlin  -n default_worker.%%h -l INFO -Q merlin']

[BUG] Duplicated cryptography install

Describe the bug
Cryptography is in the makefile (installed with easy_install) and in the requirements.txt. The easy_install was a temporary workaround that should be removed. Having it in both places means the pip install ignores cryptography.

Expected behavior
No dependency installs in makefile: everything should be in requirements.

[FEATURE] Support Maestro LAUNCHER

Describe the solution you'd like
Specifically:
nodes
procs
walltime
$(LAUNCHER)

[FEAT] Speed up task creation

🚀 Feature Request

What problem is this feature looking to solve?
This is a bottleneck on large-scale task recreation (500 secs to create 100k) and could be parallelized with additional tasks.

Describe the solution you'd like
Break up the lists etc into groups of tasks to make it parallel.

Describe alternatives you've considered
Use something like asyncio, but we're already using celery so we should just do this.

Additional context
We may have to break up the functions as we dig down.

[BUG] Transfer README material to web docs

Describe the bug
Instead of continuing to support a large README file, move the bulk of it to readthedocs.

In the Spack repo, for example, the README is more manageable and concise.

[FEATURE] Release compute nodes when the workflow is complete

When launching a workflow across multiple nodes, the example given in the documentation (https://merlin.readthedocs.io/en/latest/merlin_commands.html) suggests ending the the sbatch script with a 'sleep inf' to hold the compute allocation and keep the workers running until the time limit of the allocation is achieved.

It would be nice if there was a way to stop the workers when the task broker queue is empty and gracefully release the compute nodes back to the cluster when the workflow is complete

[BUG] requirements files not in source tarball

The requirements/* files are not in the source code tarball on pypi. These are required to use setup.py to install and should be included in the deployment.

[FEAT] zfill sample paths

🚀 Feature Request

What problem is this feature looking to solve?
The sample path files 0/0/0 etc end up not being in order

Describe the solution you'd like
zfill the samples directories, so that you get 00/00/00 etc

Describe alternatives you've considered
do nothing

Additional context
we'll have to make sure we catch everywhere the directories are written

[FEAT] Merlin commands and arguments are missing in the docs

There merlin status command is not documented and some commands do not have documented arguments.

[FEATURE] Add a way for users (who pip installed) to run examples

Is your feature request related to a problem? Please describe.
Currently, pip-installers have no example workflows to run right out of the box.

Describe the solution you'd like
A CLI command that copies an internal workflow to the user's cwd.

[BUG] celery/restart bug

🐛 Bug Report

Describe the bug
Looks like this:

ERROR:celery.app.trace:Task merlin.common.tasks.merlin_step[31704a45-84d0-4b5a-8c71-4d9aa55cd9f2] raised unexpected: RuntimeError('Never call result.get() within a tas
k!\nSee http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks\n',)
Traceback (most recent call last):
  File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/base.py", line 487, in run
    return task._orig_run(*args, **kwargs)
  File "/Users/bay1/merlin/merlin/common/tasks.py", line 130, in merlin_step
    raise RestartException
merlin.exceptions.RestartException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/base.py", line 500, in run
    raise task.retry(exc=exc, **retry_kwargs)
  File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/task.py", line 716, in retry
    S.apply().get()
  File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/result.py", line 1027, in get
    assert_will_not_block()
  File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/result.py", line 43, in assert_will_not_block
    raise RuntimeError(E_WOULDBLOCK)
RuntimeError: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks

To Reproduce
Steps to reproduce the behavior:

Run a single-step DAG with this as the step:

study:
    - name: step1
      description: step 1
      run:
         cmd: exit $(MERLIN_RESTART)
         restart: echo "restarted :)"

Found in merlin run --local mode.
Reproduced on local Mac and Quartz.

[FEAT] include stdout and stderr for sample generation

🚀 Feature Request

What problem is this feature looking to solve?
When designing a sample generator for a workflow, it can be challenging to to debug using merlin.

Describe the solution you'd like
When samples are generated, the stdout and stderr should be outputted to merlin_info as samples.out and samples.err.

Requested by @ymubarka

[BUG] Celery v4.4.3 breaks merlin

🐛 Bug Report

Merlin breaks when attempting to use Celery version 4.4.3.

We have locked merlin at using Celery v4.4.2 to temporarily get around this, but we need a solution that allows us to use future versions of Celery.

[BUG] merlin info should generate key

🐛 Bug Report

Describe the bug
Even with the proper results backend password in app.yaml, a run of merlin info shows the results backend as not working because encrypt_data_key has not yet been generated.

To Reproduce
Steps to reproduce the behavior:

Remove encrypt_data_key from merlin home dir.
Run merlin info
See authentication error

Expected behavior
merlin info should generate its own encrypt_data_key if one does not yet exist.

[FEATURE] Add retry delays

Describe the solution you'd like
Steps already have the capacity to restart. Add an optional delay time to this.

llnl / merlin Goto Github PK

merlin's Issues

🐛 Bug Report

🚀 Feature Request

🐛 Bug Report

🐛 Bug Report

🐛 Bug Report

🚀 Feature Request

🚀 Feature Request

🚀 Feature Request

🐛 Bug Report

🚀 Feature Request

🚀 Feature Request

🐛 Bug Report

🚀 Feature Request

🐛 Bug Report

🐛 Bug Report

🐛 Bug Report

🚀 Feature Request

🐛 Bug Report

🤓 Question

🚀 Feature Request

🚀 Feature Request

🐛 Bug Report

🚀 Feature Request

🐛 Bug Report

🐛 Bug Report

Recommend Projects

Recommend Topics

Recommend Org