llnl / merlin Goto Github PK
View Code? Open in Web Editor NEWMachine Learning for HPC Workflows
License: MIT License
Machine Learning for HPC Workflows
License: MIT License
We are missing Changlelog info for 1.0.1 and 1.0.2 @ben-bay
Originally posted by @koning in #33 (comment)
Describe the bug
The feature example, which shows how to stop workers from within a worker leaves the stop worker command in the queue, since the worker kills itself and doesn't acknowledge that it finished. This can cause other workers later to pick this back up and kill themselves and others later on.
To Reproduce
Steps to reproduce the behavior:
--> you'll see there's still a task in the queue
Expected behavior
The queue should be empty after the workflow finishes.
Additional context
I think the fix is to add a delay to the stop workers command in the step and background it, something like
( sleep 30; merlin stop-workers ) &
instead of just merlin stop-workers
this should fork a child background process that will execute after the parent worker finishes the step (and removes it from the server). I'm not sure what the sleep delay should be.
merlin
is currently taken on PYPI.
This page is our official request to transfer ownership: pypi/support#87
What problem is this feature looking to solve?
Currently, we can only launch tasks that are defined in yaml file. There are cases where you'd like to just do something asynchronously, like a one-off delayed execution of a script. It'd be nice to expose this ability.
Describe the solution you'd like
Something like
merlin run-script [args] -- script.sh
args could be various celery arguments like --queue . perhaps also could have blocking/not blocking argument
Describe alternatives you've considered
This could probably be done now with a yaml file with a single step in it. The whole DAG / workspace creation seems like potentially a lot of overhead. Maybe if that were relaxed (see #207 ) it would be cleaner.
Additional context
Not really sure on this one. Mostly a discussion prompt.
Merlin has a way of registering success and failure of tasks through the the MERLIN_SOFT_FAIL and MERLIN_HARD_FAIL variables. Each HPC application with terminate (successfully or otherwise) differently - is there a common protocol you want to suggest to application developers to triggering hard/soft fail conditions (e.g. write a file of a particular name, print to terminal, return an integer value)?
Describe the bug
An exception is given when running merlin with spack, or any install that relocates merlin.
[sh]: merlin
.
.
.
FileNotFoundError: [Errno 2] No such file or directory: '<path>/site-packages/merlin/examples/workflows'
To Reproduce
Install merlin with spack, type merlin
Expected behavior
No exceptions.
Describe the bug
pip install from pypi doesn't grab all the necessary dependencies
To reproduce
Expected behavior
pip install merlinwf should grab all dependencies
Describe the bug
The string "$_" expands to a path in the venv.
To Reproduce
Steps to reproduce the behavior:
merlin run <spec.yaml>
Expected behavior
The string "$_" should not change.
EDIT: According to this, "$_" is a special shell variable.
change to LLNL: LLNL-CODE-797170
Originally posted by @lucpeterson in #159
A request has been made to run the workers for a given yaml spec on different machines.
The implementation will need to specify the hosts required for a given set of steps in the workers section of the spec. This can be accomplished by adding the "machines" keyword to the workers spec. All the hosts listed in the machines section will need to have access to the OUTPUT_PATH. Because of this constraint the OUTPUT_PATH must be a full path that can be checked for existence on all hosts listed in the machines keywords. If the machines keyword is not present, then those workers will be started on all hosts where merlin run-workers is executed.
An example is given below:
merlin:
resources:
task_server: celery
workers:
step1workers:
args: -O fair --prefetch-multiplier 1 -E -l info --concurrency 36
batch:
type: local
steps: step1
machines: [host1A, hostB]
step2workers:
args: -O fair --prefetch-multiplier 1 -E -l info --concurrency 36
steps: step2
batch:
type: local
machines: [hostC]
Describe the bug
The Maestro warning Cannot set the submission time of 'sim_runs' because it has already been set.
appears when there should be no warning.
To Reproduce
Steps to reproduce the behavior:
exit $(MERLIN_RESTART)
Expected behavior
A clear and concise description of what you expected to happen.
What problem is this feature looking to solve?
It is time-consuming and tedious to manually increment Merlin's version.
Describe the solution you'd like
A Makefile
target for developers that increments the project-wide version, or something similar.
Is your feature request related to a problem? Please describe.
While the 2 commands merlin run
and merlin run-workers
are useful as separate units, it is sometimes cumbersome to repeatedly type both commands.
Describe the solution you'd like
it would be convenient to have the ability to combine these into a single command: merlin run ... --workers ...
or merlin [third subparser name here] ...
Describe alternatives you've considered
The alternative to this is to simply keep things as they are for the time being. This isn't project-critical, but would add user-friendliness.
Additional context
supervisord
may be helpful in daemonizing Celery.
Also provide link in the docs.
Is there a way to query the remaining tasks on the task broker? The merlin query-workers command is a bit too coarse to get a sense for the status of a currently running ensemble.
Probably more useful than a list of remaining tasks would be statistics on the total number of tasks in each queue, the number completed successfully, number failed, and number remaining.
What problem is this feature looking to solve?
Currently multi-machine workflows need to be able to both have access to the same workspace on the file system. This can be pretty limiting. For instance, running a multi-machine workflow on two machines with different air-gapped parallel file systems would require the workspaces to be in a common location, ie not on the parallel file systems. The user would have to do gymnastics to make the tasks run on the parallel file systems, but have a workspace in a shared spot.
Describe the solution you'd like
The ability to do this. Possible solutions/paths forward
merlin run
can launch work to be done elsewhere)Issues to consider
What problem is this feature looking to solve?
When I try to generate samples from the merlin block I don't get an error telling me what went wrong. It gives an error that is unrelated to the actual bug which is in the script itself. That error is suppressed when it should be outputted somewhere.
Describe the solution you'd like
Printing the error just like in the study block and having an out and err files.
Describe alternatives you've considered
Manually going every step and trying to figure out what went wrong. This takes way too long and it's not always clear where the problem is unless everything is broken down.
Describe the solution you'd like
This includes:
Describe the bug
merlin not reporting when a step isn't assigned to a worker when running in distributed mode.
To Reproduce
Steps to reproduce the behavior:
Don't assign a step to a worker in distributed mode and check the study directory.
Expected behavior
Ideally it would show that one of the steps isn't assigned to a worker. Currently it does not show any errors/warning and it just doesn't run the step.
Screenshots
resources:
workers:
merge_posthoc_workers:
args: -l INFO --concurrency 36 --prefetch-multiplier 1 -Ofair
steps: [merge_posthoc]
batch:
type: slurm
study:
- name: setup
description:
run:
cmd:
- name: merge_posthoc
description: Combines the outputs of the previous step
run:
cmd: |
depends: [setup]
EDIT: this is only the case when submitting the spec as a slurm job.
I'm just getting started with Merlin and noticed that there are a few steps that need to be coordinated properly and it might be nice if there were utilities in Merlin to support that coordination.
Specifically, when I submit my merlin job to a parallel compute resource (e.g. via a slurm script as described in https://merlin.readthedocs.io/en/latest/merlin_commands.html), I specify the number of compute nodes. Then, within my merlin workflow, I specify the number of "nodes" for my workers in the workers sub-block of the merlin block (e.g. number of workers). Then I also specify the number of nodes and processors for each sample directly using the cmd: keyword in a step defined in the study block.
A utility to generate an sbatch file from a merlin spec would help keep things sorted, it seems like there's enough information contained there to do such a thing.
Also, as a novice to celery, I found the nodes keyword in the workers block to be somewhat misleading - it seems to control the number of workers, but my first instinct was that it was some kind of 'node allocation' assigned to the worker. Particularly if you're working with multi-node tasks, I can see others making the same mistake.
What problem is this feature looking to solve?
Merlin runs sometimes fail and leave celery workers running. I want a convenient way to get rid of the workers from the ensemble that failed.
Describe the solution you'd like
I could use "merlin -f stop-workers", but I only want to get rid of workers associated with the queues from the ensemble that failed. I can supply the --queues argument, but that hard codes the queue name. I want something like "merlin -f stop-workers my_ensemble.yaml".
I organize my merlin runs in series with a sub-directory for each ensemble. There is a yaml file with the same name (but different parameters) in each directory. I can put the above command in a shell script and use it with any ensemble in the series (the yaml file knows how to compute the queue name).
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
What problem is this feature looking to solve?
To provide a solution for times when sample generation could be done with a small python script instead of a bash call to an external script.
Describe the solution you'd like
Add an optional merlin.samples.generate.shell
entry, same as <step>.run.shell
.
On the road to running a full meastrowf yaml spec is to implement a handler for the restart commands in a step. The issue with this keyword is that it may be necessary to restart the initial task as opposed to always running the restart cmd on task restart. Research must be done to determine if this is possible in the celery system.
A possible solution would be to split a parallel run task into multiple tasks including a setup task and then only have the parallel launch commands including initial and restart for a run step.
Describe the bug
run-workers --echo
shows the logger message:
[2020-03-23 16:23:18: INFO] Launching workers from <spec.yaml>'
Expected behavior
No incorrect logger message.
Describe the solution you'd like
Add step timeout/walltime fields, thru Celery
Describe the solution you'd like
merlin run
flag --batch
that interfaces with popular batch managers like Slurm, LSF, and Flux.
Report missing sample file. If I make a typo, the file name in the samples section may not match the file name in my samples generation command. Merlin responds by silently not doing anything. I think it would be better to say "Sample file my_samples_1024.npy not found" or something of that sort.
When exiting with a $(MERLIN_RESTART)
flag, merlin overwrites the .out
and .err
files with the restart steps' own .out
and .err
files as opposed to creating new outputs files specifically for the restart step like with the .sh files
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The expected outputs should be something like this:
step.err step.out step.sh step.restart.err step.restart.out step.restart.sh
but it's only outputting:
step.err step.out step.sh step.restart.sh
Describe the bug
[2020-01-24 09:41:15: ERROR] Cannot find a config file, run merlin config and " "edit the file, "/Users/bay1/.merlin/app.yaml"
[2020-01-24 09:41:15: ERROR] expected str, bytes or os.PathLike object, not NoneType
To Reproduce
~/.merlin/
dirmerlin run <spec>.yaml
Expected behavior
No weird whitespace formatting, no type error.
The merlin info uses python and pip instead of python3 and pip3, python and pip will be for
python2 in a docker container or standard python install.
To Reproduce
run merlin info for a merlin install outside of a virtualenv.
Describe the solution you'd like
The ability to use this: merlin run spec.yaml --vars STATE=Illinois ID=42
And alter this:
description:
name: $(STATE)_State$(ID)
description: ...
Describe the solution you'd like
Tasks / steps defined in a spec that repeat every n seconds.
These would be able to depend on other steps, but would not be valid as a dependency for any other step.
Line 105 in 2f726c6
Right now, this is defaulted to 2 hours (despite the docs saying 1 hour). This is too short: apps that run longer than 2 hours could be run again by a new worker. This should be changed to 24 hours and we should expose it in the config file.
Is your feature request related to a problem? Please describe.
We want to keep the CHANGELOG and version updated as master is changed.
Describe the solution you'd like
A travis check that the CHANGELOG and version in merlin/init.py have been modified for every PR into master.
Describe alternatives you've considered
As an alternative, we could have a test that looks for changes and have people run it by hand or have travis run that test. I do think an automated alert that says "hey, this needs to be changed" would be helpful. Much like a style check.
Describe the bug
tests are broken because of the zero pad modification
To Reproduce
see run tests and travis build failure
Expected behavior
Tests should pass
@ben-bay thinks he has a solution? something about paths being hardcoded in the testing module
In the run-workers output, should we change the message level of this line:
[2020-02-07 13:14:38: INFO] ['celery worker -A merlin -n default_worker.%%h -l INFO -Q merlin']
to debug instead of info?
*
*~~~~~
*~~*~~~* __ __ _ _
/ ~~~~~ | \/ | | (_)
~~~~~ | \ / | ___ _ __| |_ _ __
~~~~~* | |\/| |/ _ \ '__| | | '_ \
*~~~~~~~ | | | | __/ | | | | | | |
~~~~~~~~~~ |_| |_|\___|_| |_|_|_| |_|
*~~~~~~~~~~~
~~~*~~~* Machine Learning for HPC Workflows
[2020-02-07 13:14:38: INFO] Launching workers from 'hello.yaml'
[2020-02-07 13:14:38: WARNING] Workflow specification missing
encouraged 'merlin' section! Run 'merlin example' for examples.
Using default configuration with no sampling.
[2020-02-07 13:14:38: INFO] Starting celery workers
[2020-02-07 13:14:38: INFO] ['celery worker -A merlin -n default_worker.%%h -l INFO -Q merlin']
Describe the bug
Cryptography is in the makefile (installed with easy_install) and in the requirements.txt. The easy_install was a temporary workaround that should be removed. Having it in both places means the pip install ignores cryptography.
Expected behavior
No dependency installs in makefile: everything should be in requirements.
Describe the solution you'd like
Specifically:
nodes
procs
walltime
$(LAUNCHER)
What problem is this feature looking to solve?
This is a bottleneck on large-scale task recreation (500 secs to create 100k) and could be parallelized with additional tasks.
Describe the solution you'd like
Break up the lists etc into groups of tasks to make it parallel.
Describe alternatives you've considered
Use something like asyncio, but we're already using celery so we should just do this.
Additional context
We may have to break up the functions as we dig down.
Describe the bug
Instead of continuing to support a large README file, move the bulk of it to readthedocs.
In the Spack repo, for example, the README is more manageable and concise.
When launching a workflow across multiple nodes, the example given in the documentation (https://merlin.readthedocs.io/en/latest/merlin_commands.html) suggests ending the the sbatch script with a 'sleep inf' to hold the compute allocation and keep the workers running until the time limit of the allocation is achieved.
It would be nice if there was a way to stop the workers when the task broker queue is empty and gracefully release the compute nodes back to the cluster when the workflow is complete
The requirements/* files are not in the source code tarball on pypi. These are required to use setup.py to install and should be included in the deployment.
What problem is this feature looking to solve?
The sample path files 0/0/0 etc end up not being in order
Describe the solution you'd like
zfill the samples directories, so that you get 00/00/00 etc
Describe alternatives you've considered
do nothing
Additional context
we'll have to make sure we catch everywhere the directories are written
There merlin status
command is not documented and some commands do not have documented arguments.
Is your feature request related to a problem? Please describe.
Currently, pip-installers have no example workflows to run right out of the box.
Describe the solution you'd like
A CLI command that copies an internal workflow to the user's cwd.
Describe the bug
Looks like this:
ERROR:celery.app.trace:Task merlin.common.tasks.merlin_step[31704a45-84d0-4b5a-8c71-4d9aa55cd9f2] raised unexpected: RuntimeError('Never call result.get() within a tas
k!\nSee http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks\n',)
Traceback (most recent call last):
File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/base.py", line 487, in run
return task._orig_run(*args, **kwargs)
File "/Users/bay1/merlin/merlin/common/tasks.py", line 130, in merlin_step
raise RestartException
merlin.exceptions.RestartException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/base.py", line 500, in run
raise task.retry(exc=exc, **retry_kwargs)
File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/app/task.py", line 716, in retry
S.apply().get()
File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/result.py", line 1027, in get
assert_will_not_block()
File "/Users/bay1/merlin/venv_merlin/lib/python3.6/site-packages/celery/result.py", line 43, in assert_will_not_block
raise RuntimeError(E_WOULDBLOCK)
RuntimeError: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
To Reproduce
Steps to reproduce the behavior:
study:
- name: step1
description: step 1
run:
cmd: exit $(MERLIN_RESTART)
restart: echo "restarted :)"
merlin run --local
mode.What problem is this feature looking to solve?
When designing a sample generator for a workflow, it can be challenging to to debug using merlin.
Describe the solution you'd like
When samples are generated, the stdout
and stderr
should be outputted to merlin_info
as samples.out
and samples.err
.
Requested by @ymubarka
Merlin breaks when attempting to use Celery version 4.4.3.
We have locked merlin at using Celery v4.4.2 to temporarily get around this, but we need a solution that allows us to use future versions of Celery.
Describe the bug
Even with the proper results backend password in app.yaml
, a run of merlin info
shows the results backend as not working because encrypt_data_key
has not yet been generated.
To Reproduce
Steps to reproduce the behavior:
encrypt_data_key
from merlin home dir.merlin info
Expected behavior
merlin info
should generate its own encrypt_data_key
if one does not yet exist.
Describe the solution you'd like
Steps already have the capacity to restart. Add an optional delay time to this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.