Comments (8)
The multi instance will always crash if a single member fails.
You can use the create_clone --ensemble to create the cases. Then I think you need to get an interactive login using https://docs.nersc.gov/jobs/interactive/#cori-haswell
and once you are logged into the node then you can use case.submit with the --no-batch command.
from cime.
Thanks for the feedback @jedwards4b. The problem for me is that I need to run 3072 ensemble members, so the login option is not feasible. If I understood correctly, there is no way with the current settings for cori to call case.submit --no-batch
without it automatically invoking srun
, unless I run the job in interactive mode.
Basically what I am trying to do is to have multiple cases running on the same node (like the multi instance) so my project is not overcharged, but with them being independent runs (like submitting multiple case.submit) so the good ensemble members are not affected by the problematic ones.
I may try to clone the xml configurations from cori to ${HOME}/.cime
and delete the built-in slurm settings, so I can set up slurm manually, but I wanted to check with there was some simpler option I missed (still learning how to run E3SM).
from cime.
I think you will need some custom scripts for what you want to do. Instead of trying to run ./case.submit --no-batch I would modify template.case.run (it's in the machines directory) to call case.case_run for all of your runs instead of just a single one. So you would have a list of caseroots and loop over the list starting each in a new thread something like:
for i in range(1,3072): t = threading.Thread(target=case.case_run) t.start()
from cime.
With CLM at NCAR, we use casper for single point simulations as an easier way to get just 1 processor for each simulation. So we haven't explicitly been forced to solve this sort of thing.
from cime.
@djk2120 with the PPE work, you do send off a long list of simulations all at once. At least that's my understanding. You have one case that you clone the others from, and then have modifications for each of the cloned cases. How do you set it up to run through your list of simulations at one go? I know you have some scripts to manage all of this together.
from cime.
Just a shell script to submit them independently. Loop through a bunch of paramfiles, cloning the basecase, pointing to the appropriate paramfile in user_nl_clm. I was at one point using multi-instance, but likewise abandoned because it was too cumbersome when a rogue ensemble member failed.
If Marcos can find a way to request just 1 cpu at a time on the DOE system, as we can with casper, that would probably be the easiest solution.
from cime.
Thanks @jedwards4b @ekluzek and @djk2120 for the input! After a few additional attempts, I managed to submit multiple independent simulations sharing a single node on cori. In the end the solution was simple, I just had to switch the MPI libraries from default to mpi-serial before I built each case.
./create_newcase <new case settings>
cd <path to new case>
./xmlchange MPILIB="mpi-serial"
./case.setup
./case.build
And then I bundled the multiple independent cases using NERSC's TaskFarmer workflow, by requesting 1 one for TaskFarmer and 1 node for every 64 ensemble members on cori-knl (or 1 node for every 32 ensemble members if using cori-haswell).
from cime.
There is an argument to create_newcase --mpilib mpi-serial so you don't need the
xmlchange step, and you can use create_clone with the --keepexe argument so that you only need to compile once. Glad you figured it out.
from cime.
Related Issues (20)
- what does e3sm build still use? HOT 4
- tag head of maint-5.8_5.8.32 HOT 1
- Set mpirun to None when a single MPI task is being used (similar to handling of mpi-serial) HOT 6
- 'use --append' is not a valid value of the atomic type 'xs:NCName' for the cime w/t e3sm/maint-2.1 (2.0) HOT 3
- query_config --grids needs grid longname.
- query_config --components is broken
- Increase standard_name length in entry_id_pg.xsd HOT 5
- Where did one-letter case names come from? HOT 2
- list_e3sm_tests is partly broken
- regression test test_d_create_clone_new_user FAILS HOT 2
- testlist hangs when it should exit HOT 2
- Special characters in SRCROOT sometimes lead to build failures HOT 1
- Enhancement request: Use realpath when comparing namelists
- error in timing makefile HOT 4
- --xml-test-list not working? HOT 1
- pelayout changes no longer working HOT 2
- xmlquery issue when values are set to match another variable HOT 25
- Create a set of helper functions for running customized functional tests/heuristics
- CIME 6.1.9 6.1.10 (current master) failing test_d_create_clone_new_user HOT 3
- REST_N specification in env_tests.xml
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cime.