diracgrid / pilot Goto Github PK
View Code? Open in Web Editor NEWDIRAC pilot 3.0
Home Page: http://diracgrid.org/
License: GNU General Public License v3.0
DIRAC pilot 3.0
Home Page: http://diracgrid.org/
License: GNU General Public License v3.0
PR #210 adds a number of integration tests for the Pilot. One of these highlighted potential issues with Pilots using CVMFS-deployed installations 1. Each DIRAC client installation at the moment relies on these environment variables, pointing to 3 distinct directories:
dirac-configure
in the local directory tree 2For CVMFS-installed releases the above env variables should normally point to locations that are independent from where a release is found. If any of the above env variable is set within the worker node environment, those values are kept. Otherwise, the "standard" CVMFS locations can normally be found in /cvmfs/grid.cern.ch/etc/grid-security:
|-> ll /cvmfs/grid.cern.ch/etc/grid-security/
total 10
drwxr-xr-x. 2 cvmfs cvmfs 8192 6 ott 14.34 certificates
drwxrwxr-x. 59 cvmfs cvmfs 18 17 mar 2023 vomsdir
drwxrwxr-x. 2 cvmfs cvmfs 42 3 ago 13.50 vomses
And this is what I have added as default in #210 .
For historical reasons in LHCb we have been using /cvmfs/lhcb.cern.ch/etc/grid-security
. At the same time, there might be some sites that have cvmfs mounted in a non-standard location (this was certainly true in AFS and NFS times, but could still be true for some sites today, and for HPCs especially). For LHCb, another historical env variable was $VO_LHCB_SW_DIR
, that could be set by sites for signaling a non-standard locations of the "shared area".
My questions here are: do I get everything correct? What am I missing? Would you prefer something different?
The first implementation of https://github.com/DIRACGrid/Pilot/issues/166 went in https://github.com/DIRACGrid/Pilot/pull/205. PR #210 adds some fixes. ↩
This is what happens when you issues dirac-configure
: Checking DIRAC installation at "/client/diracos" Current hash for bundle CAs in directory /client/diracos/etc/grid-security/certificates is '' Synchronizing directory with remote bundle Dir has been synchronized Current hash for bundle CRLs in directory /client/diracos/etc/grid-security/certificates is '' Synchronizing directory with remote bundle Dir has been synchronized Created vomsdir file /client/diracos/etc/grid-security/vomsdir/dteam/voms2.hellasgrid.gr.lsc Created vomses file /client/diracos/etc/grid-security/vomses/dteam Created vomsdir file /client/diracos/etc/grid-security/vomsdir/gridpp/voms.gridpp.ac.uk.lsc Created vomsdir file /client/diracos/etc/grid-security/vomsdir/gridpp/voms02.gridpp.ac.uk.lsc Created vomsdir file /client/diracos/etc/grid-security/vomsdir/gridpp/voms03.gridpp.ac.uk.lsc Created vomses file /client/diracos/etc/grid-security/vomses/gridpp Created vomsdir file /client/diracos/etc/grid-security/vomsdir/wlcg/wlcg-voms.cloud.cnaf.infn.it.lsc Created vomses file /client/diracos/etc/grid-security/vomses/wlcg
↩
Today I have tried to integrate a new resource on SunGridEngine. Job agent started, job status changed to Running. After that, it hangs, job on a resource is finished. In error file of the batch I see the following:
Traceback (most recent call last):
File "dirac-pilot.py", line 66, in
command.execute()
File "/tmp/DIRAC_6fagrxpilot/pilotCommands.py", line 917, in execute
self.__startJobAgent()
File "/tmp/DIRAC_6fagrxpilot/pilotCommands.py", line 904, in __startJobAgent
retCode, _output = self.executeAndGetOutput(jobAgent, self.pp.installEnv)
File "/tmp/DIRAC_6fagrxpilot/pilotTools.py", line 384, in executeAndGetOutput
outData = _p.stdout.read().decode().strip()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 5591: ordinal not in range(128)
Hi,
While deploying pilot3 in production, one of our users noticed their argument list was now being corrupted... After a bit of poking around I found that "pilot.cfg" is always appended to the list of user arguments for JDL jobs (but not API jobs). We missed this during our initial tests as everything we run can (unfortunately) silently tolerate an extra argument.
I've tracked this behaviour down to this line of code:
Line 902 in e244e65
Why is this set? We've run our full GridPP test suite with this line commented out and all of the tests pass, but I can imagine that there must be a reason for setting this (for reference our inner CE is just InProcess)... Can we somehow get rid of this setting (by changing it to something else?) to fix this behaviour?
Just for complete clarity our test case for this is a JDL with:
Executable = "/usr/bin/echo";
Arguments = "Hello";
The Stdout contains Hello pilot.cfg
.
Regards,
Simon
Several scripts are used by the pilot only in DIRAC: they can be copied (for the moment, and really moved later) here.
This applies also to some specific part of the DIRAC code, including the JobAgent.
It would be nice for the Pilot to take an existing DIRACOS installation from CVMFS to reduce the load on the worker node's filesystem.
I've put an example on /cvmfs/lhcbdev.cern.ch
which would probably be hosted on /cvmfs/dirac.egi.eu
instead (Don't use it, I may delete it at any time!). The idea would then be to make a Python venv
on top of it like so:
$ source /cvmfs/lhcbdev.cern.ch/experimental/dirac-pilot-ideas/v1/DIRACOS/v2.29/Linux-x86_64/diracosrc
$ python -m venv --system-site-packages /tmp/diracos-venv
$ cp $DIRACOS/diracosrc /tmp/diracos-venv/diracosrc
$ echo "source /tmp/diracos-venv/bin/activate" >> /tmp/diracos-venv/diracosrc
# Some other modifications should probably also be done to the diracosrc
# From now onwards "source /tmp/diracos-venv/diracosrc" can be used to activate the CVMFS based DIRACOS
$ source /tmp/diracos-venv/diracosrc
$ pip install 'DIRAC==7.3.32'
This avoids creating 62108 files/links/directories (1.6GB of data) and would make pilots much faster while still letting people control the DIRAC version and install extensions if desired.
My suggestion to implement this would be:
cc @IgorPelevanyuk who expressed interest in this before the BiLD meeting
There are a few issues with the current implementation of PilotLogging.
X509_CERT_DIR
and X509_USER_PROXY
generates the following:2023-10-31T15:48:30.341093Z DEBUG [PilotParams] Release project:
2023-10-[31](https://github.com/fstagni/Pilot/actions/runs/6709010578/job/18231106705#step:6:32)T15:48:30.342023Z INFO [Pilot] Remote logger activated
Traceback (most recent call last):
File "/home/runner/work/Pilot/Pilot/Pilot/dirac-pilot.py", line 76, in <module>
log.buffer.flush()
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 514, in wrapper
return func(self, *args, **kwargs)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 599, in flush
self.senderFunc(buf)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 6[35](https://github.com/fstagni/Pilot/actions/runs/6709010578/job/18231106705#step:6:36), in sendMessage
context.load_cert_chain(cert)
TypeError: certfile should be a valid filesystem path
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 522, in run
self.function(*self.args, **self.kwargs)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 514, in wrapper
return func(self, *args, **kwargs)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 599, in flush
self.senderFunc(buf)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 635, in sendMessage
context.load_cert_chain(cert)
TypeError: certfile should be a valid filesystem path
This is especially true for the case when DIRAC is sourced from CVMFS (not installed).
To address this issue, in #218 I moved the setting of these variables as early as possible.
2023-11-01T11:50:36.304812Z DEBUG [PilotParams] Release project:
2023-11-01T11:50:44.286345Z DEBUG [PilotParams] X509_CERT_DIR is set in the host environment as /cvmfs/grid.cern.ch/etc/grid-security/certificates, aligning installEnv to it
2023-11-01T11:50:44.287539Z DEBUG [PilotParams] X509_VOMS_DIR is set in the host environment as /cvmfs/grid.cern.ch/etc/grid-security/vomsdir, aligning installEnv to it
2023-11-01T11:50:44.287626Z DEBUG [PilotParams] X509_VOMSES is not set in the host environment
2023-11-01T11:50:44.287688Z DEBUG [PilotParams] Candidate directory for X509_VOMSES is /cvmfs/grid.cern.ch/etc/grid-security/vomses
2023-11-01T11:50:44.289156Z DEBUG [PilotParams] Setting X509_VOMSES=/cvmfs/grid.cern.ch/etc/grid-security/vomses
2023-11-01T11:50:44.289448Z INFO [Pilot] Remote logger activated
Traceback (most recent call last):
File "/home/runner/work/Pilot/Pilot/Pilot/dirac-pilot.py", line 76, in <module>
log.buffer.flush()
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 503, in wrapper
return func(self, *args, **kwargs)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 588, in flush
self.senderFunc(buf)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 6[31](https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881#step:6:32), in sendMessage
res = urlopen(url, data, context=context)
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 6[34](https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881#step:6:35), in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 511, in run
self.function(*self.args, **self.kwargs)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 503, in wrapper
return func(self, *args, **kwargs)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 588, in flush
self.senderFunc(buf)
File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 631, in sendMessage
res = urlopen(url, data, context=context)
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error [40](https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881#step:6:41)1: Unauthorized
Error: Process completed with exit code 1.
this is because the host certificate is not recognized as such. A DIRAC client would add a {""extraCredentials": "hosts"}
to the credential dictionary, but this is not done when running before DIRAC is installed.
2023-11-01T11:51:13.939924Z INFO [Pilot] Remote logger activated
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 511, in run
self.function(*self.args, **self.kwargs)
File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 503, in wrapper
return func(self, *args, **kwargs)
File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 588, in flush
self.senderFunc(buf)
File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 623, in sendMessage
context.load_verify_locations(capath=caPath)
TypeError: cafile, capath and cadata cannot be all omitted
Again, this is because X509_CERT_DIR
and X509_USER_PROXY
are not defined at this stage.
I don't know how to solve 2) and 3).
Maybe for 2) the simplest would be to use a diracx
service.
Hi @fstagni,
Would it be possible to merge some of the older features from the devel branch out to production now? (I'm mainly interested in the rolling output patch, but there are some others in there too that are a few months old now).
Regards,
Simon
Not a real issue, just a question related to DIRACGrid/DIRAC#7086
Tags/RequiredTags
are added both in /LocalSite
and /Resources/Computing/CEDefaults
. Is that expected?
See here:
Lines 565 to 568 in 15a0874
And here:
Lines 679 to 687 in 15a0874
Thanks
There is a weird logic in this script. The line:
https://github.com/DIRACGrid/Pilot/blob/3a837eccbf9f494d16e504699db1e32c3dc683d5/.github/workflows/basic.yml#L41
will prevent pytest
from being ever run. Not sure what was a reason behind this.
When the Pilot creates its DIRACOS environment, it directly calls the DIRACOS-Linux-machine.sh
which eventually invokes mamba
. Mamba assumes that it can use all cores of the machine (see mamba-org/mamba#2463), which isn't realistic for Pilot environments. This leads to excessive process creation, which can negatively affect the pilot, user or even entire compute resource.
As far as I can tell, the templates from which DIRACOS is generated do no provide a feasible way to limit this internally. The Pilot thus seems like the best place, seeing how it is aware of resource restrictions.
A solution would be to set MAMBA_EXTRACT_THREADS
when installing DIRACOS, either to a conservative 1
or pp.maxNumberOfProcessors
.
For reference of scale, we caught this on a WLCG Tier 1 WN with 256 cores that got allocated mostly to one VO. Each of the single core pilots tried to use 256 child processes; each pilot quickly ground to a halt due to resource and fork bomb protection, which caused each new pilot to also immediately get stuck on nproc limits and similar safeguards.
In a case when a VO cannot be obtained from a proxy, we pass it within a remote logger call. The server will you is only if a VO cannot be guessed by calling getRemoteCredentials()
. This a case when a pilot is authenticated with a certificate.
Including the pilot code creation.
As far as I can tell the GFAL (etc.) variables survive the purging by the diracos2/diracosrc because the original environment is kept from the start
Line 528 in 2a1dc97
Line 307 in 2a1dc97
Line 316 in 2a1dc97
Line 1053 in 2a1dc97
Should we just self.pp.installEnv = {}
before filling updated values?
The filed pilot_options in Jenkins needs a trailing whitespace to work properly e.g. --dirac-os
if run without it complains because the following options are passed --dirac-os-M 1 -S DIRAC-Certification ...
Hi,
Following the merge of #185, we've started getting errors from users where their jobs now fail to find the dirac.cfg when they run any dirac-* commands in their jobs script. As an emergency fix, I've swapped the GridPP production server over to a local branch without that commit, but it would be good to agree what the behaviour here should be (as we frequently seem to run into pilot.cfg/dirac.cfg problems!).
Tagging @marianne013 .
Regards,
Simon
As requested in DIRACGrid/DIRAC#4604 (comment) instrument the pilot to pass a GlobalDefaults.cfg file when installing DIRAC
The installation script doesn't recognize the --dirac-os-version
flag, so the Jenkins tests fail if a specific version is given. Only passing --dirac-os
flag works fine.
When PoolCE solution is fully certified.
Wrt DIRACGrid/DIRAC#4616
As far as I can tell, the pilots in v7r1 are still using the dirac-install.py from the Pilot repo (rather than the management repo). It looks like the dirac-install scripts have diverged a bit, but would it be possible to to backport the XRD_RUNFORKHANDLER fix to this version as well? (I only just found this as I was trying to remove the "test the new code" override of our pilot configs.)
@chrisburr
Follow-up of DIRACGrid/DIRAC#6208 (comment)
an easy way to get the VO from a proxy would be using voms-proxy-info, e.g.
$ voms-proxy-info -file /tmp/x509up_u49429 -vo
lhcb
but this of course relies on voms-proxy-info being on the WN before DIRAC is installed, or using a cvmfs-deployed version, like the one in dirac.egi.eu/client/diracos/. This would be OK most probably for several LCG resources but not necessarily for all resources (and on HPCs we might be in trouble). Base openssl is not enough.
We can of course rely on "sensible defaults" if we fail to get the VO from the proxy (and ~soon from the token).
Verify if it is possible to do it with basic openssl
.
dirac-install switch --pythonVersion=3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.