Giter Club home page Giter Club logo

Comments (8)

jedwards4b avatar jedwards4b commented on August 22, 2024

The multi instance will always crash if a single member fails.

You can use the create_clone --ensemble to create the cases. Then I think you need to get an interactive login using https://docs.nersc.gov/jobs/interactive/#cori-haswell
and once you are logged into the node then you can use case.submit with the --no-batch command.

from cime.

mpaiao avatar mpaiao commented on August 22, 2024

Thanks for the feedback @jedwards4b. The problem for me is that I need to run 3072 ensemble members, so the login option is not feasible. If I understood correctly, there is no way with the current settings for cori to call case.submit --no-batch without it automatically invoking srun, unless I run the job in interactive mode.

Basically what I am trying to do is to have multiple cases running on the same node (like the multi instance) so my project is not overcharged, but with them being independent runs (like submitting multiple case.submit) so the good ensemble members are not affected by the problematic ones.

I may try to clone the xml configurations from cori to ${HOME}/.cime and delete the built-in slurm settings, so I can set up slurm manually, but I wanted to check with there was some simpler option I missed (still learning how to run E3SM).

from cime.

jedwards4b avatar jedwards4b commented on August 22, 2024

I think you will need some custom scripts for what you want to do. Instead of trying to run ./case.submit --no-batch I would modify template.case.run (it's in the machines directory) to call case.case_run for all of your runs instead of just a single one. So you would have a list of caseroots and loop over the list starting each in a new thread something like:
for i in range(1,3072): t = threading.Thread(target=case.case_run) t.start()

from cime.

djk2120 avatar djk2120 commented on August 22, 2024

With CLM at NCAR, we use casper for single point simulations as an easier way to get just 1 processor for each simulation. So we haven't explicitly been forced to solve this sort of thing.

from cime.

ekluzek avatar ekluzek commented on August 22, 2024

@djk2120 with the PPE work, you do send off a long list of simulations all at once. At least that's my understanding. You have one case that you clone the others from, and then have modifications for each of the cloned cases. How do you set it up to run through your list of simulations at one go? I know you have some scripts to manage all of this together.

from cime.

djk2120 avatar djk2120 commented on August 22, 2024

Just a shell script to submit them independently. Loop through a bunch of paramfiles, cloning the basecase, pointing to the appropriate paramfile in user_nl_clm. I was at one point using multi-instance, but likewise abandoned because it was too cumbersome when a rogue ensemble member failed.

If Marcos can find a way to request just 1 cpu at a time on the DOE system, as we can with casper, that would probably be the easiest solution.

from cime.

mpaiao avatar mpaiao commented on August 22, 2024

Thanks @jedwards4b @ekluzek and @djk2120 for the input! After a few additional attempts, I managed to submit multiple independent simulations sharing a single node on cori. In the end the solution was simple, I just had to switch the MPI libraries from default to mpi-serial before I built each case.

./create_newcase <new case settings>
cd <path to new case>

./xmlchange MPILIB="mpi-serial"
./case.setup
./case.build

And then I bundled the multiple independent cases using NERSC's TaskFarmer workflow, by requesting 1 one for TaskFarmer and 1 node for every 64 ensemble members on cori-knl (or 1 node for every 32 ensemble members if using cori-haswell).

from cime.

jedwards4b avatar jedwards4b commented on August 22, 2024

There is an argument to create_newcase --mpilib mpi-serial so you don't need the
xmlchange step, and you can use create_clone with the --keepexe argument so that you only need to compile once. Glad you figured it out.

from cime.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.