Giter Club home page Giter Club logo

pilot's People

Contributors

aldbr avatar andresailer avatar andrew-mcnab-uk avatar atsareg avatar chaen avatar chrisburr avatar cinzialu avatar fstagni avatar marianne013 avatar martynia avatar maxfischer2781 avatar schitic avatar sfayer avatar wkrzemien avatar zhangxiaomei avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pilot's Issues

Security-related (CVMFS) locations

PR #210 adds a number of integration tests for the Pilot. One of these highlighted potential issues with Pilots using CVMFS-deployed installations 1. Each DIRAC client installation at the moment relies on these environment variables, pointing to 3 distinct directories:

  • X509_CERT_DIR : for the CAs
  • X509_VOMS_DIR : for some VO-dependent VOMS files
  • DIRAC_VOMSES (or X509_VOMSES) : for some other VO-dependent VOMS files
    For locally-installed clients these variables are added, and populated, by dirac-configure in the local directory tree 2

For CVMFS-installed releases the above env variables should normally point to locations that are independent from where a release is found. If any of the above env variable is set within the worker node environment, those values are kept. Otherwise, the "standard" CVMFS locations can normally be found in /cvmfs/grid.cern.ch/etc/grid-security:

 |-> ll /cvmfs/grid.cern.ch/etc/grid-security/
total 10
drwxr-xr-x.  2 cvmfs cvmfs 8192  6 ott 14.34 certificates
drwxrwxr-x. 59 cvmfs cvmfs   18 17 mar  2023 vomsdir
drwxrwxr-x.  2 cvmfs cvmfs   42  3 ago 13.50 vomses

And this is what I have added as default in #210 .

For historical reasons in LHCb we have been using /cvmfs/lhcb.cern.ch/etc/grid-security. At the same time, there might be some sites that have cvmfs mounted in a non-standard location (this was certainly true in AFS and NFS times, but could still be true for some sites today, and for HPCs especially). For LHCb, another historical env variable was $VO_LHCB_SW_DIR, that could be set by sites for signaling a non-standard locations of the "shared area".

My questions here are: do I get everything correct? What am I missing? Would you prefer something different?

Footnotes

  1. The first implementation of https://github.com/DIRACGrid/Pilot/issues/166 went in https://github.com/DIRACGrid/Pilot/pull/205. PR #210 adds some fixes.

  2. This is what happens when you issues dirac-configure: Checking DIRAC installation at "/client/diracos" Current hash for bundle CAs in directory /client/diracos/etc/grid-security/certificates is '' Synchronizing directory with remote bundle Dir has been synchronized Current hash for bundle CRLs in directory /client/diracos/etc/grid-security/certificates is '' Synchronizing directory with remote bundle Dir has been synchronized Created vomsdir file /client/diracos/etc/grid-security/vomsdir/dteam/voms2.hellasgrid.gr.lsc Created vomses file /client/diracos/etc/grid-security/vomses/dteam Created vomsdir file /client/diracos/etc/grid-security/vomsdir/gridpp/voms.gridpp.ac.uk.lsc Created vomsdir file /client/diracos/etc/grid-security/vomsdir/gridpp/voms02.gridpp.ac.uk.lsc Created vomsdir file /client/diracos/etc/grid-security/vomsdir/gridpp/voms03.gridpp.ac.uk.lsc Created vomses file /client/diracos/etc/grid-security/vomses/gridpp Created vomsdir file /client/diracos/etc/grid-security/vomsdir/wlcg/wlcg-voms.cloud.cnaf.infn.it.lsc Created vomses file /client/diracos/etc/grid-security/vomses/wlcg

Problem decoding output in pilotTools.py

Today I have tried to integrate a new resource on SunGridEngine. Job agent started, job status changed to Running. After that, it hangs, job on a resource is finished. In error file of the batch I see the following:
Traceback (most recent call last):
File "dirac-pilot.py", line 66, in
command.execute()
File "/tmp/DIRAC_6fagrxpilot/pilotCommands.py", line 917, in execute
self.__startJobAgent()
File "/tmp/DIRAC_6fagrxpilot/pilotCommands.py", line 904, in __startJobAgent
retCode, _output = self.executeAndGetOutput(jobAgent, self.pp.installEnv)
File "/tmp/DIRAC_6fagrxpilot/pilotTools.py", line 384, in executeAndGetOutput
outData = _p.stdout.read().decode().strip()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 5591: ordinal not in range(128)

Pilot3 adds "pilot.cfg" on to user argument list

Hi,

While deploying pilot3 in production, one of our users noticed their argument list was now being corrupted... After a bit of poking around I found that "pilot.cfg" is always appended to the list of user arguments for JDL jobs (but not API jobs). We missed this during our initial tests as everything we run can (unfortunately) silently tolerate an extra argument.

I've tracked this behaviour down to this line of code:

self.innerCEOpts.append(' -o /AgentJobRequirements/ExtraOptions=%s' % self.pp.localConfigFile)

Why is this set? We've run our full GridPP test suite with this line commented out and all of the tests pass, but I can imagine that there must be a reason for setting this (for reference our inner CE is just InProcess)... Can we somehow get rid of this setting (by changing it to something else?) to fix this behaviour?

Just for complete clarity our test case for this is a JDL with:
Executable = "/usr/bin/echo";
Arguments = "Hello";

The Stdout contains Hello pilot.cfg.

Regards,
Simon

Add scripts used by the pilot

Several scripts are used by the pilot only in DIRAC: they can be copied (for the moment, and really moved later) here.

This applies also to some specific part of the DIRAC code, including the JobAgent.

Support taking unpacked install from CVMFS

It would be nice for the Pilot to take an existing DIRACOS installation from CVMFS to reduce the load on the worker node's filesystem.

I've put an example on /cvmfs/lhcbdev.cern.ch which would probably be hosted on /cvmfs/dirac.egi.eu instead (Don't use it, I may delete it at any time!). The idea would then be to make a Python venv on top of it like so:

$ source /cvmfs/lhcbdev.cern.ch/experimental/dirac-pilot-ideas/v1/DIRACOS/v2.29/Linux-x86_64/diracosrc
$ python -m venv --system-site-packages /tmp/diracos-venv
$ cp $DIRACOS/diracosrc /tmp/diracos-venv/diracosrc
$ echo "source /tmp/diracos-venv/bin/activate" >> /tmp/diracos-venv/diracosrc
# Some other modifications should probably also be done to the diracosrc
# From now onwards "source /tmp/diracos-venv/diracosrc" can be used to activate the CVMFS based DIRACOS
$ source /tmp/diracos-venv/diracosrc
$ pip install 'DIRAC==7.3.32'

This avoids creating 62108 files/links/directories (1.6GB of data) and would make pilots much faster while still letting people control the DIRAC version and install extensions if desired.

My suggestion to implement this would be:

  • Add a script to DIRACOS which creates this virtual environment
  • Add support to the pilot to do this instead of installing DIRACOS2 (might also need a little SiteDirector work)
  • Setup automatic installation of versions DIRACOS environments on /cvmfs/dirac.egi.eu

cc @IgorPelevanyuk who expressed interest in this before the BiLD meeting

On Pilot Logging

There are a few issues with the current implementation of PilotLogging.

  1. SOLVED activating RemoteLogger in an environment that does NOT include env X509_CERT_DIR and X509_USER_PROXY generates the following:
2023-10-31T15:48:30.341093Z DEBUG [PilotParams] Release project: 
2023-10-[31](https://github.com/fstagni/Pilot/actions/runs/6709010578/job/18231106705#step:6:32)T15:48:30.342023Z INFO [Pilot] Remote logger activated
Traceback (most recent call last):
  File "/home/runner/work/Pilot/Pilot/Pilot/dirac-pilot.py", line 76, in <module>
    log.buffer.flush()
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 514, in wrapper
    return func(self, *args, **kwargs)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 599, in flush
    self.senderFunc(buf)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 6[35](https://github.com/fstagni/Pilot/actions/runs/6709010578/job/18231106705#step:6:36), in sendMessage
    context.load_cert_chain(cert)
TypeError: certfile should be a valid filesystem path
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 522, in run
    self.function(*self.args, **self.kwargs)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 514, in wrapper
    return func(self, *args, **kwargs)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 599, in flush
    self.senderFunc(buf)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 635, in sendMessage
    context.load_cert_chain(cert)
TypeError: certfile should be a valid filesystem path

This is especially true for the case when DIRAC is sourced from CVMFS (not installed).

To address this issue, in #218 I moved the setting of these variables as early as possible.

  1. NOT SOLVED Testing the above PR (#218) when DIRAC is sourced from CVMFS, and we're using certificates and not proxies generates the following (https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881):
2023-11-01T11:50:36.304812Z DEBUG [PilotParams] Release project: 
2023-11-01T11:50:44.286345Z DEBUG [PilotParams] X509_CERT_DIR is set in the host environment as /cvmfs/grid.cern.ch/etc/grid-security/certificates, aligning installEnv to it
2023-11-01T11:50:44.287539Z DEBUG [PilotParams] X509_VOMS_DIR is set in the host environment as /cvmfs/grid.cern.ch/etc/grid-security/vomsdir, aligning installEnv to it
2023-11-01T11:50:44.287626Z DEBUG [PilotParams] X509_VOMSES is not set in the host environment
2023-11-01T11:50:44.287688Z DEBUG [PilotParams] Candidate directory for X509_VOMSES is /cvmfs/grid.cern.ch/etc/grid-security/vomses
2023-11-01T11:50:44.289156Z DEBUG [PilotParams] Setting X509_VOMSES=/cvmfs/grid.cern.ch/etc/grid-security/vomses
2023-11-01T11:50:44.289448Z INFO [Pilot] Remote logger activated
Traceback (most recent call last):
  File "/home/runner/work/Pilot/Pilot/Pilot/dirac-pilot.py", line 76, in <module>
    log.buffer.flush()
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 503, in wrapper
    return func(self, *args, **kwargs)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 588, in flush
    self.senderFunc(buf)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 6[31](https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881#step:6:32), in sendMessage
    res = urlopen(url, data, context=context)
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 6[34](https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881#step:6:35), in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 511, in run
    self.function(*self.args, **self.kwargs)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 503, in wrapper
    return func(self, *args, **kwargs)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 588, in flush
    self.senderFunc(buf)
  File "/home/runner/work/Pilot/Pilot/Pilot/pilotTools.py", line 631, in sendMessage
    res = urlopen(url, data, context=context)
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error [40](https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260562881#step:6:41)1: Unauthorized
Error: Process completed with exit code 1.

this is because the host certificate is not recognized as such. A DIRAC client would add a {""extraCredentials": "hosts"} to the credential dictionary, but this is not done when running before DIRAC is installed.

  1. NOT SOLVED. Testing the above PR (#218) when DIRAC is installed locally, and we're using certificates and not proxies generates the following (https://github.com/fstagni/Pilot/actions/runs/6719269671/job/18260566428) :
2023-11-01T11:51:13.939924Z INFO [Pilot] Remote logger activated
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 511, in run
    self.function(*self.args, **self.kwargs)
  File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 503, in wrapper
    return func(self, *args, **kwargs)
  File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 588, in flush
    self.senderFunc(buf)
  File "/__w/Pilot/Pilot/Pilot/pilotTools.py", line 623, in sendMessage
    context.load_verify_locations(capath=caPath)
TypeError: cafile, capath and cadata cannot be all omitted

Again, this is because X509_CERT_DIR and X509_USER_PROXY are not defined at this stage.


I don't know how to solve 2) and 3).

Maybe for 2) the simplest would be to use a diracx service.

Merge devel branch?

Hi @fstagni,

Would it be possible to merge some of the older features from the devel branch out to production now? (I'm mainly interested in the rolling output patch, but there are some others in there too that are a few months old now).

Regards,
Simon

Question: Tags/RequiredTags added twice to `pilot.cfg`

Not a real issue, just a question related to DIRACGrid/DIRAC#7086

Tags/RequiredTags are added both in /LocalSite and /Resources/Computing/CEDefaults. Is that expected?

See here:

for queueParamName, queueParamValue in self.pp.queueParameters.items():
if isinstance(queueParamValue, list): # for the tags
queueParamValue = ",".join([str(qpv).strip() for qpv in queueParamValue])
self.cfg.append("-o /LocalSite/%s=%s" % (queueParamName, quote(queueParamValue)))

And here:

self.pp.tags = list(set(self.pp.tags))
if self.pp.tags:
self.cfg.append('-o "/Resources/Computing/CEDefaults/Tag=%s"' % ",".join((str(x) for x in self.pp.tags)))
self.pp.reqtags = list(set(self.pp.reqtags))
if self.pp.reqtags:
self.cfg.append(
'-o "/Resources/Computing/CEDefaults/RequiredTag=%s"' % ",".join((str(x) for x in self.pp.reqtags))
)

Thanks

Setup of DIRACOS does not restrict cpu cores

When the Pilot creates its DIRACOS environment, it directly calls the DIRACOS-Linux-machine.sh which eventually invokes mamba. Mamba assumes that it can use all cores of the machine (see mamba-org/mamba#2463), which isn't realistic for Pilot environments. This leads to excessive process creation, which can negatively affect the pilot, user or even entire compute resource.

As far as I can tell, the templates from which DIRACOS is generated do no provide a feasible way to limit this internally. The Pilot thus seems like the best place, seeing how it is aware of resource restrictions.
A solution would be to set MAMBA_EXTRACT_THREADS when installing DIRACOS, either to a conservative 1 or pp.maxNumberOfProcessors.


For reference of scale, we caught this on a WLCG Tier 1 WN with 256 cores that got allocated mostly to one VO. Each of the single core pilots tried to use 256 child processes; each pilot quickly ground to a halt due to resource and fork bomb protection, which caused each new pilot to also immediately get stuck on nproc limits and similar safeguards.

Purging variables in pilotCommand installEnv

As far as I can tell the GFAL (etc.) variables survive the purging by the diracos2/diracosrc because the original environment is kept from the start

self.installEnv = os.environ

And then only updated with the new values from diracosrc obtained here
retCode, output = self.executeAndGetOutput('bash -c "source diracos/diracosrc && env"', self.pp.installEnv)

updating installEnv dict
self.pp.installEnv[var] = value

but non-existing values are not popped
and then the jobagent runs with installEnv
retCode, _output = self.executeAndGetOutput(jobAgent, self.pp.installEnv)

Should we just self.pp.installEnv = {} before filling updated values?

Pilot3 pipeline Jenkins

The filed pilot_options in Jenkins needs a trailing whitespace to work properly e.g. --dirac-os if run without it complains because the following options are passed --dirac-os-M 1 -S DIRAC-Certification ...

Recent pilot patch breaks command line tools in jobs

Hi,

Following the merge of #185, we've started getting errors from users where their jobs now fail to find the dirac.cfg when they run any dirac-* commands in their jobs script. As an emergency fix, I've swapped the GridPP production server over to a local branch without that commit, but it would be good to agree what the behaviour here should be (as we frequently seem to run into pilot.cfg/dirac.cfg problems!).

Tagging @marianne013 .

Regards,
Simon

[v7r1] backport XRD_RUNFORKHANDLER fix to v7r1

Wrt DIRACGrid/DIRAC#4616
As far as I can tell, the pilots in v7r1 are still using the dirac-install.py from the Pilot repo (rather than the management repo). It looks like the dirac-install scripts have diverged a bit, but would it be possible to to backport the XRD_RUNFORKHANDLER fix to this version as well? (I only just found this as I was trying to remove the "test the new code" override of our pilot configs.)
@chrisburr

(for PilotLogging) find the VO from the proxy, and from the token

Follow-up of DIRACGrid/DIRAC#6208 (comment)

an easy way to get the VO from a proxy would be using voms-proxy-info, e.g.
$ voms-proxy-info -file /tmp/x509up_u49429 -vo
lhcb
but this of course relies on voms-proxy-info being on the WN before DIRAC is installed, or using a cvmfs-deployed version, like the one in dirac.egi.eu/client/diracos/. This would be OK most probably for several LCG resources but not necessarily for all resources (and on HPCs we might be in trouble). Base openssl is not enough.
We can of course rely on "sensible defaults" if we fail to get the VO from the proxy (and ~soon from the token).

Verify if it is possible to do it with basic openssl.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.