cms-pdmv / gridpackmachine Goto Github PK
View Code? Open in Web Editor NEWRepository used for development of gridpack machine
Repository used for development of gridpack machine
The Lxplus main alias (lxplus.cern.ch
) is used to access CERN internal nodes to perform a wide variety of tasks, like in this case, synchronizing files with /eos via SSH sessions. The resolution for this alias will be updated to point to EL9 nodes (AlmaLinux 9 or RHEL 9)[1]. To avoid issues related to incompatibility, update all the source code sections that reference this alias and use as value lxplus7.cern.ch
instead.
If the alias is not updated, it is possible that after the migration is performed, there could be issues with opening SSH sessions to sync files to /eos/ folders related to the new node configuration.
Change all the source code sections to retrieve the remote hostname via environment/configuration variables and set them to make reference to the current pool of nodes lxplus7.cern.ch
Some components hard-code the remote node alias to lxplus.cern.ch
The components that reference instructions that require CERN internal nodes, take the remote host name from configuration variables instead of setting a fixed default value.
References
[1] OTG0076556 - lxplus.cern.ch
alias switch to EL9 resources. Available at: https://cern.service-now.com/service-portal?id=outage&n=OTG0076556
Extract CSV file for all generators and processes
then importing back the CSV file with additional information (Campaign Tune Events GEN productions)
making 'users' to provide the block only and replace it in the template for different madgraph versions
Hi we have a new issue regarding a problem with the OS of the cluster nodes used by the Machine.
Let me tag @agrohsje and @sihyunjeon so they can follow up/comment on this.
At some point during the break we had to implement a change in the condor submission files so the job that is submitted runs in
singularity containers due to the change of the OS in the condor nodes. This solved the problem of not being able to submit jobs at all, but we have seen in the last couple of weeks (before the winter break) that some jobs are now kept on IDLE for cases in which it should not take such long time in normal conditions (a 64 core job took ~2 weeks during the christmas break to be submitted).
Therefore we think this issue might be related to the migration to AL9 in someway, but we are not sure ow.
@sihyunjeon implemented a solution here that modifies the OS used in the node for the gridpack generation: MY.WantOS = "el9
. So currenly, according to CERN DOCs on CONDOR this run a el9 container.
We can try anything by submitting gridpacks. Right now (january 2024) there are three jobs running in IDLE that have been like that for quite sometime.
We think there's something not properly working regarding the job submission on condor, so we want to make sure that the change in the condor submission file is the correct approach.
Allow GridpackMachine application users to manually set some attributes related to the hardware resources for the Gridpack generation job (HTCondor job). Specifically, the number of cores and memory used by the job.
Sometimes, the GEN production team is required to set different job core configurations to perform tests faster, more than the default value (16 cores). However, increasing this value more than the default implies longer wait times before resources are assigned by the HTCondor master node. Even further, currently, each update related to the number of cores requires human intervention from the PdmV development team which also impedes automatically setting these changes.
Include two new fields in the Gridpack creation panel to allow users to manually set the number of cores and memory that will be assigned to the HTCondor execution. Set their values as follows:
memory
value lower than a factor of (#cores * 1000 MB), e.g.: 64 cores and 4000 MB of memory. It is unsure if this could raise side effects on the execution of the Gridpack job.All Gridpack generation jobs (HTCondor jobs) are provided with a default value of execution cores and memory (#cores * 2000 MB).
Users are able to manually set the number of cores and memory provisioned for a Gridpack generation job based on a group of preset values.
It would be nice if we can (for some of the samples) skip gridpack generation part and just use the one existing in CVMFS if applicable.
We might have some cases where we want to use the same gridpack but only change the fragments. Then everything in [1] is ignorable except things related to "fragment".
Can we create a channel that takes following arguments from json file
"parent_process" : DY
"parent_dataset" : "DYGto2LG-1Jets_MLL-4to50_PTG-100to200_amcatnloFXFX-pythia8"
and make the python script to search for
/cvmfs/cms.cern.ch/phys_generator/gridpacks/PdmV/Run3Summer22/DY/DYGto2LG-1Jets_MLL-50_PTG-50to100_amcatnloFXFX-pythia8_*
And use this gridpack instead of submitting a new gridpack job and only make new fragments with new prepid?
The default space of disk assigned for a job in HTCondor is 20GB [1]. Some Gridpack jobs tend to consume near to ~19.5 GB or more. The last cause the job to be evicted from the execution and has to be restarted, repeating this cycle indefinitely.
The disk space requested for a Gridpack job is set to 30GB so that the last situation described is mitigated.
Set the attribute RequestDisk
in the HTCondor configuration file to 3e7 (30 * 1e6). This attribute is measured in KB.
References
[1] Resources and limits – CERN Batch Jobs. Available at: https://batchdocs.web.cern.ch/local/submit.html#resources-and-limits
Currently, all the logs related to the execution of a Gridpack job are stored locally in the worker node. Colleagues would like to check them without performing a request to retrieve them from the source.
More than a functionality problem, this is related to the UX. Users can not freely check the output for their jobs without human intervention from PdmV Service personnel.
Standard output and error for submitted jobs are streamed into files stored in a public area that allows colleagues to read them.
All logs for the running jobs are stored internally in the worker node and are only accessible to PdmV Service personnel.
All logs for the running jobs are stored in a public folder that allows CMS members to read them. Logs are streamed to AFS and are also sent by email when the job finishes.
stream_output
and stream_error
attributes in the job configuration file>
) operatorSometimes it is not possible to capture the file name for the resulting Gridpack created by the batch job. Therefore, no name is assigned and the fragment builder sets an invalid placeholder when the McM request is created. The API should verify if the file name is valid otherwise it avoids creating the request and sets the Gridpack as failed
.
Only Gridpacks with the status done
or reused
that set the values archive
or archive_reused
respectively should be considered valid to submit an McM request. Otherwise, the Gridpack should be labeled as failed.
All Gridpacks that require a batch job for creating this artifact create a McM request independently if there is an output file or not.
Create a new method inside the Gridpack class to check that all the constraints described before are complied with. Otherwise, describe the reason why the McM request is not going to be created. Finally, send an email notifying about this situation.
It is difficult to reproduce this bug because it relies on possible failures with HTCondor or the SSH session that retrieves the output, which happens rarely. However, the potential failure can be seen from the code.
A Gridpack that requires a batch job to create this artifact finishes, then:
This issue tracks some UI improvements for the web page related to refining the UX a little bit. For this case, details related to how many items are displayed on the web page and automatic approval for all the new Gridpack requests created in this application.
UX issues mainly. Users want to perform a less amount of clicks for executing requests into this application and display just a little set of elements on the main page, the most recent ones.
For the number of elements displayed on the page, group all gridpacks into pages of 10 elements (this number could change after some experimentation) and display only each page at a time. Include two new labels to display the current page that is being displayed and the range of items included in it. Allow users to change the page via a drop-down element.
For automatic approval, include a new button in the creation wizard (modal) that fulfills this requirement. Create all the Gridpacks and try to automatically approve them as soon as the creation action finishes.
All Gridpacks are shown on the same main page and, for a new request of Gridpacks, it is required to approve each of them manually.
Gridpacks are shown in tiny groups and it is possible to create and approve several requests without performing a click for each of them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.