Giter Club home page Giter Club logo

Comments (4)

suchyta1 avatar suchyta1 commented on September 9, 2024

Can you try setting processes-per-node to 4. I don't know what happens if it exceeds processes.

from effis.

cwsmith avatar cwsmith commented on September 9, 2024

Thank you @suchyta1 . I reduced processes-per-node to 4 for the coupler and got the same error at run time.

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /gpfs/alpine/fus123/scratch/cwsmith/spack-stage/spack-stage-kokkos-3.1-if6ttd7st5pdd3hese5w5d3stjdtgemz/spack-src/core/src/Cuda/Kokkos_Cuda_Instance.cpp:223
Traceback functionality not available

AFAIK, LSF needs to be told that the job step (not sure what the official LSF term for a jsrun call is...) that four resource sets need to be defined where each one has a CPU process and one GPU. I recall that EFFIS was using ERF files to supply this info to LSF. For example, without using EFFIS, my jsrun command for the coupler is:

jsrun <environment stuff> \
  -n 4 --tasks_per_rs 1 --cpu_per_rs 1 --gpu_per_rs 1 --bind rs \
  /path/to/coupler 1

from effis.

suchyta1 avatar suchyta1 commented on September 9, 2024

Can you try using True or On instead of 1 for use-gpus. I'm not sure if 1 will resolve correctly in YAML as a boolean instead of an integer. The way it's implemented now, a GPU will be assigned to each MPI rank. (use-gpus only has a direct effect on Summit, because summit explicitly has the --gpu_per_rs setting. On Rhea, if you use the gpu partition, use-gpus doesn't have any effect, as there's no gpu setting flag with srun.)

Are the environment things important? You might need to set those. Though I don't think you need to load Cuda, as far as I'm aware.

  coupler:
    pre-submit-commands: ["mkdir out"]
    processes: 4
    processes-per-node: 4
    cpus-per-process: 1
    use-gpus: True
    executable_path: /gpfs/alpine/fus123/scratch/cwsmith/spack-install/linux-rhel7-power9le/gcc-8.1.1/coupler-develop-6eczhtc7ufb6o62onfeabrropnj4ahv6/bin/cpl
    commandline_args:
      - ${steps}
    env:
      OMP_NUM_THREADS: 1
      HDF5_USE_FILE_LOCKING: 'FALSE'

from effis.

cwsmith avatar cwsmith commented on September 9, 2024

Adding use-gpus: True got the coupler job step past the cudaGetDeviceCount error. Thank you.

The change is here:
https://github.com/SCOREC/testcases/commit/d9fd662daf312452b0157f662d86d6b3501eecd3

The <environment stuff> passed to jsrun is a LD_PRELOAD setting/hack to avoid a spack issue on summit.

from effis.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.