cea-hpc / auks Goto Github PK
View Code? Open in Web Editor NEWKerberos credential support for batch environments
License: Other
Kerberos credential support for batch environments
License: Other
this is the same as hautreux/auks#61
I've just hit this or a similar problem (on 0.5.3).
starting any job I get :
Auks API request failed : krb5 cred : unable to read credential cache
and auks -R loop fails i.e no cache renewal, the rest works just fine
AFAICS, it stems from the fact krb5_cc_default_name (which is called by krb5_cc_default) calls secure_getenv(KRB5_ENV_CCNAME) and at the point where auks is started from slurmstepd euid(uid of the user starting the job ) != uid (root), same goes for gid/egid.
either adding -C $auks_credcache to the arglist or droping the privileges to the normal user (setgid(),setuid()) inside this if fixes the problem.
We have not been able to get this combination to work. With 0.4.4 and some patches, we have had it working on RHEL7 for several years by using the AUKS_PRIV_CCACHE_APPEND environment variable to force auks to use a different cache than root's default (/tmp/krb5cc_0). The problem with auks using root's cache is that gssproxy also uses it and overwrites the file with something incompatible. So having auks use /tmp/krb5cc_0_auks avoids that issue.
0.4.4 won't compile on RHEL8. Or at least not easily. It looks like gcc jumped so many versions since RHEL7 that there is a problem with compatibility.
With 0.5.0, 0.5.3, and 0.5.4 it seems like AUKS_PRIV_CCACHE_APPEND was removed (though it is still in the man page for aukspriv).
Has anyone gotten this combination working? We could really use some advice. I think our next step might be to try and patch the code ourselves to restore AUKS_PRIV_CCACHE_APPEND but we are really hoping someone has already done that work.
Hello,
first of all, thanks for writing AUKS!
I have installed AUKS plugin in Slurm, and apparently all is going well, I submit a job to Slurm with "--auks=yes", and I can verify that AUKS adds the ticket to the cache, and that when the job runs it succeeds with a "get request" to get the ticket. But the problem is that in our environment we have our HOME directores as NFS drives that are accessed via Kerberos. When connecting via "ssh" to a machine, these drives are automounted. When running with Slurm + AUKS, despite having a valid TGT, the NFS drive is not automounted and I get "Stale file handle" error if I try to list its contents.
I suspect this is not a problem really with AUKS, but perhaps somebody has come across something similar and has pointers on where I should look to solve this?
We ran into a similar issue as was reported a few years back
https://github.com/hautreux/auks/issues/3
We are using an Active Directory setup, and there are users that have a lots of groups
The user with the issue, had a cache ticket of 8308 bytes
We created a custom build, and changed the AUKS_CRED_DATA_MAX_LENGTH to 16384, and that solved the issue for us
First of all, I need to admit that I have very limited knowledge of autoconf etc., so it might be that I just doing it wrong ...
My understanding of the build process for auks is that I would create an RPM using make rpm
and deploy it on my cluster. I can successfully run autoreconf -fvi && ./configure && make
to build the binaries, but unfortunately, I cannot build the RPMs.
As I need to support scheduling of GPUs in Slurm, I build it from source against NVIDIA's current libraries. If I do so and run make rpm
instead of make
, I get the error message
error: Failed build dependencies:
slurm-devel >= 20.11.0 is needed by auks-UNKNOWN-UNKNOWN.el7.x86_64
I assume that is because I do not have slurm-devel installed from yum, because I build Slurm from source. Is there any way to work around this?
When we build RPMs for RHEL8, we get a service file with "After=network.target". I assume this would be true for any distro but haven't checked. That is not late enough in the boot process and we kept having to manually restart auksd after a reboot. Changing to "After=network-online.target" seems to have fully resolved that problem. I'm not sure exactly why (maybe because slurmctld and munge come later?), but it is 100% reproducible in our environment.
Anyway, we would suggest changing from "After=network.target" to "After=network-online.target" for the auksd service file.
The issue asking about file based caches in this scenario hasn't garnered any responses so we have been trying to use KCM caches. Unfortunately, a similar (maybe the same) problem exists.
We have been using auks (0.4.4-1 with some patches) for a couple years on RHEL7 with Kerberized NFS homes. We are having lots of trouble with auks 0.5.4 on RHEL8. The issue is that when the gssproxy service is running it creates a cache in root's collection for the principal HOSTNAME$@realm in addition to the one for host/hostname.domain@REALM that auks uses. Here is an example.
[root@dcompute ~]# klist -A
Ticket cache: KCM:0
Default principal: host/compute01.dartmouth.edu@REALM
Valid starting Expires Service principal
09/06/2023 14:39:49 09/07/2023 00:39:48 krbtgt/REALM@REALM
renew until 10/06/2023 14:39:48
Ticket cache: KCM:0:57660
Default principal: COMPUTE01$@realm
Valid starting Expires Service principal
12/31/1969 19:00:00 12/31/1969 19:00:00 Encrypted/Credentials/v1@X-GSSPROXY:
When that gssproxy cache exists on a compute node, it cannot retrieve a credential using auks. If we delete the gssproxy cache (with kdestroy -c as root), compute nodes can retrieve credentials. e.g. jobs can be submitted with that cache deleted and seem to run normally until gssproxy puts the cache back. It's not clear exactly when gssproxy creates this cache (which makes testing this a little frustrating) but eventually (less than an hour) it always puts it back.
We have tried disabling gssproxy, but it seems to start regardless of the enabled/disabled status of the gssproxy service or the GSS_USE_PROXY setting in /etc/sysconfig/nfs or the existence of /etc/gssproxy/99-nfs-client.conf. Uninstalling gssproxy wants to uninstall nfs-utils but we definitely need NFS! So that is a no-go.
This also causes problems on the system running auksd and auksdrenewer. When the gssproxy cache is there, "auks -D" fails. If we delete the cache, "auks -D" starts working again. It also seems like auksdrenewer is unable to dump credentials when the gssproxy cache is in place.
When the gssproxy ticket is deleted from the compute nodes and the auksd node, things in auks otherwise seem to be working correctly. e.g. Submitting a job successfully adds the TGT to auksd. auksdrenewer renews credentials (verified by looking in /var/cache/auks at the tickets stashed there), jobs run and have access to Kerberized NFS home directories.
We would really love to hear from people using auks on RHEL8 and understand how you got it to work. We can post configurations etc. if that would be useful but it does seem like we have zeroed in on the core issue here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.