Giter Club home page Giter Club logo

Comments (7)

tazend avatar tazend commented on May 28, 2024 1

Hi,

oh yeah, the slurm_init call in pyslurm/__init__.py only exists on the most recent commit on the main branch (or 23.2.x branch), which I recommend to use (as it already includes a bit of API rework, there will be a new release soon though).

Removing slurm_init and then making an API call that potentially segfaults is indeed expected - I just wanted to make sure that the slurm_init call is actually the point where the lookup errors is brought up.

I will try to reproduce it also on my test cluster and do some tests

from pyslurm.

KrisDavie avatar KrisDavie commented on May 28, 2024 1

Just jumping in to say that the branch you linked worked great, thanks a lot for the quick fix!

from pyslurm.

tazend avatar tazend commented on May 28, 2024

Hi,

mh interesting...
If you do nm -D /usr/lib64/slurm/cli_filter_lua.so | grep data_init, it's really not showing up right?

Could you check to see when you manually remove the slurm_init call from pyslurm/__init__.py and reinstall whether the error is gone?

from pyslurm.

KrisDavie avatar KrisDavie commented on May 28, 2024

Thanks for the help.

Running nm does find it:

➜ nm -D /usr/lib64/slurm/cli_filter_lua.so | grep data_init
    U data_init

I couldn't find a slurm_init call in pyslurm/__init__.py, but there was one at the last line in pyslurm/pyslurm.pyx, removing that seems to let me load the library, but then a call to pyslurm.slurmdb_jobs() causes a segfault (maybe not unexpected?).

Cheers,

Kris

from pyslurm.

tazend avatar tazend commented on May 28, 2024

Hi again,

As I found out, that error was introduced with slurm 23.02.

Basically, in 23.02, they now explicitly load any client plugins in slurm_init, such as cli_filter, that may be required to interact with the API. Problem is however, as the error indicates, a symbol called data_init is expected to be somewhere in a shared-library (as indicated by the U (undefined), it isn't in cli_filter_lua.so directly).

This symbol is in libslurmfull.so, which basically contains the public API + all internal functions, and every slurm tool like squeue, sbatch, slurmctld, slurmdbd, ... links to that one. Thats why no error appears when using the slurm tools.

It is however not in libslurm.so, which is usually the recommended library to link against to interact with slurm. And because of that, basically any client application linking with libslurm.so in 23.2, like pyslurm, and calling slurm_init (which is mandatory when doing API calls) is broken. If you have some of the tools from the slurm-contribs package installed, like seff, that should also yield the same error.

The bug however has already been reported: https://bugs.schedmd.com/show_bug.cgi?id=16503
(Not sure if its already fixed in 23.02.2, but I don't think so)

But I have been thinking about switching back to libslurmfull for pyslurm anyway actually, as it might make certain things a bit easier to implement in the future.

from pyslurm.

tazend avatar tazend commented on May 28, 2024

You can build from this branch for now if you want, it links with libslurmfull and the error should go away

from pyslurm.

tazend avatar tazend commented on May 28, 2024

Hi @KrisDavie ,

just wanted to let you know that the issue with data_init symbol missing should be fixed in Slurm 23.02.2 (by this commit)
If your cluster already updated to this version, you can continue to use the normal pyslurm releases instead of the branch I made where it links to libslurmfull

Also a note on that: I planned on actually merging the change where we link back with libslurmfull to the main branch, but I noticed a specific test was failing.
The issue can be triggered with this for example:

python -c "import pyslurm; gg = pyslurm.utils.nodelist_from_range_str('node[001:002]'); print(gg)";

You should probably see some weird unknown error if you are still using the branch and 23.02.1. Well I have absolutely no idea why its happening with libslurmfull and not libslurm - it also only happens in a python context (can't reproduce with a simple c program that does the same)

So just a heads up: The version I provided via the branch might not be 100% stable in some cases and slurm 23.02.2 is the minimum requirement to use the normal pyslurm 23.2.x releases if the cluster uses the cli_filter functionality.

from pyslurm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.