cms-pdmv / gridpackmachine Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 367 KB

Repository used for development of gridpack machine

Python 77.87% HTML 20.34% CSS 1.06% Dockerfile 0.73%

gridpackmachine's People

Contributors

Stargazers

Watchers

Forkers

sihyunjeon agrohsje

gridpackmachine's Issues

Update aliases to sync files

The Lxplus main alias (lxplus.cern.ch) is used to access CERN internal nodes to perform a wide variety of tasks, like in this case, synchronizing files with /eos via SSH sessions. The resolution for this alias will be updated to point to EL9 nodes (AlmaLinux 9 or RHEL 9)[1]. To avoid issues related to incompatibility, update all the source code sections that reference this alias and use as value lxplus7.cern.ch instead.

Is your feature related to a problem?

If the alias is not updated, it is possible that after the migration is performed, there could be issues with opening SSH sessions to sync files to /eos/ folders related to the new node configuration.

Describe the solution you'd like

Change all the source code sections to retrieve the remote hostname via environment/configuration variables and set them to make reference to the current pool of nodes lxplus7.cern.ch

Current behavior

Some components hard-code the remote node alias to lxplus.cern.ch

Expected behavior

The components that reference instructions that require CERN internal nodes, take the remote host name from configuration variables instead of setting a fixed default value.

References
[1] OTG0076556 - lxplus.cern.ch alias switch to EL9 resources. Available at: https://cern.service-now.com/service-portal?id=outage&n=OTG0076556

CSV compatibility

Extract CSV file for all generators and processes
then importing back the CSV file with additional information (Campaign Tune Events GEN productions)

cuts.f steering within genproductions

making 'users' to provide the block only and replace it in the template for different madgraph versions

OS migration

Hi we have a new issue regarding a problem with the OS of the cluster nodes used by the Machine.
Let me tag @agrohsje and @sihyunjeon so they can follow up/comment on this.

Is your feature related to a problem?

At some point during the break we had to implement a change in the condor submission files so the job that is submitted runs in
singularity containers due to the change of the OS in the condor nodes. This solved the problem of not being able to submit jobs at all, but we have seen in the last couple of weeks (before the winter break) that some jobs are now kept on IDLE for cases in which it should not take such long time in normal conditions (a 64 core job took ~2 weeks during the christmas break to be submitted).

Therefore we think this issue might be related to the migration to AL9 in someway, but we are not sure ow.

Current behaviour

@sihyunjeon implemented a solution here that modifies the OS used in the node for the gridpack generation: MY.WantOS = "el9. So currenly, according to CERN DOCs on CONDOR this run a el9 container.

Provided test bench

We can try anything by submitting gridpacks. Right now (january 2024) there are three jobs running in IDLE that have been like that for quite sometime.

Points of discussion

We think there's something not properly working regarding the job submission on condor, so we want to make sure that the change in the condor submission file is the correct approach.

Path to place produced gridpacks

Set number of `cores` and `memory` for the Gridpack job via UI

Allow GridpackMachine application users to manually set some attributes related to the hardware resources for the Gridpack generation job (HTCondor job). Specifically, the number of cores and memory used by the job.

Is your feature related to a problem?

Sometimes, the GEN production team is required to set different job core configurations to perform tests faster, more than the default value (16 cores). However, increasing this value more than the default implies longer wait times before resources are assigned by the HTCondor master node. Even further, currently, each update related to the number of cores requires human intervention from the PdmV development team which also impedes automatically setting these changes.

Describe the solution you'd like

Include two new fields in the Gridpack creation panel to allow users to manually set the number of cores and memory that will be assigned to the HTCondor execution. Set their values as follows:

For the number of cores, create a drop-down list with values: 4, 8, 16, 32, and 64 to allow the user to choose from it.
For the memory, take each value of the core options and multiply its value by 1000 (value related to 1000 MB).

Constraints

Prevent users from setting the memory value lower than a factor of (#cores * 1000 MB), e.g.: 64 cores and 4000 MB of memory. It is unsure if this could raise side effects on the execution of the Gridpack job.

Current behavior

All Gridpack generation jobs (HTCondor jobs) are provided with a default value of execution cores and memory (#cores * 2000 MB).

Expected behavior

Users are able to manually set the number of cores and memory provisioned for a Gridpack generation job based on a group of preset values.

Skipping the gridpack generation and only parsing the fragments

It would be nice if we can (for some of the samples) skip gridpack generation part and just use the one existing in CVMFS if applicable.

We might have some cases where we want to use the same gridpack but only change the fragments. Then everything in [1] is ignorable except things related to "fragment".

[1] https://github.com/cms-PdmV/GridpackFiles/blob/master/Cards/MadGraph5_aMCatNLO/DY/DYGto2LG-1Jets_MLL-4to50_PTG-100to200_amcatnloFXFX-pythia8/DYGto2LG-1Jets_MLL-4to50_PTG-100to200_amcatnloFXFX-pythia8.json

Can we create a channel that takes following arguments from json file

"parent_process" : DY
"parent_dataset" : "DYGto2LG-1Jets_MLL-4to50_PTG-100to200_amcatnloFXFX-pythia8"

and make the python script to search for

/cvmfs/cms.cern.ch/phys_generator/gridpacks/PdmV/Run3Summer22/DY/DYGto2LG-1Jets_MLL-50_PTG-50to100_amcatnloFXFX-pythia8_*

And use this gridpack instead of submitting a new gridpack job and only make new fragments with new prepid?

Set the requested disk space to 30GB for Gridpack jobs

The default space of disk assigned for a job in HTCondor is 20GB [1]. Some Gridpack jobs tend to consume near to ~19.5 GB or more. The last cause the job to be evicted from the execution and has to be restarted, repeating this cycle indefinitely.

Expected behavior

The disk space requested for a Gridpack job is set to 30GB so that the last situation described is mitigated.

Possible solution

Set the attribute RequestDisk in the HTCondor configuration file to 3e7 (30 * 1e6). This attribute is measured in KB.

References
[1] Resources and limits – CERN Batch Jobs. Available at: https://batchdocs.web.cern.ch/local/submit.html#resources-and-limits

Stream job progress to files into AFS

Currently, all the logs related to the execution of a Gridpack job are stored locally in the worker node. Colleagues would like to check them without performing a request to retrieve them from the source.

Is your feature related to a problem?

More than a functionality problem, this is related to the UX. Users can not freely check the output for their jobs without human intervention from PdmV Service personnel.

Describe the solution you’d like

Standard output and error for submitted jobs are streamed into files stored in a public area that allows colleagues to read them.

Current behavior

All logs for the running jobs are stored internally in the worker node and are only accessible to PdmV Service personnel.

Expected behavior

All logs for the running jobs are stored in a public folder that allows CMS members to read them. Logs are streamed to AFS and are also sent by email when the job finishes.

Possible solutions

Set the stream_output and stream_error attributes in the job configuration file
Modify the job execution script to store the job progress into a file by using the redirection (>) operator

Avoid to create McM requests for invalid Gridpacks

Sometimes it is not possible to capture the file name for the resulting Gridpack created by the batch job. Therefore, no name is assigned and the fragment builder sets an invalid placeholder when the McM request is created. The API should verify if the file name is valid otherwise it avoids creating the request and sets the Gridpack as failed.

Expected behavior

Only Gridpacks with the status done or reused that set the values archive or archive_reused respectively should be considered valid to submit an McM request. Otherwise, the Gridpack should be labeled as failed.

Current behavior

All Gridpacks that require a batch job for creating this artifact create a McM request independently if there is an output file or not.

Possible solution

Create a new method inside the Gridpack class to check that all the constraints described before are complied with. Otherwise, describe the reason why the McM request is not going to be created. Finally, send an email notifying about this situation.

Steps to reproduce

It is difficult to reproduce this bug because it relies on possible failures with HTCondor or the SSH session that retrieves the output, which happens rarely. However, the potential failure can be seen from the code.

A Gridpack that requires a batch job to create this artifact finishes, then:

Scans the remote output folder to retrieve all the files available.
For each line in the standard output available after scanning the folder, filter the content to check if there is one that matches the desired output format.
If there is no one, an empty string is set as the output file name.
The resulting McM request attempts to use a Gridpack file that doesn't exist.

UI enhancements: Gridpack's pagination and automatic approval

This issue tracks some UI improvements for the web page related to refining the UX a little bit. For this case, details related to how many items are displayed on the web page and automatic approval for all the new Gridpack requests created in this application.

Is your feature related to a problem?

UX issues mainly. Users want to perform a less amount of clicks for executing requests into this application and display just a little set of elements on the main page, the most recent ones.

Describe the solution you'd like

For the number of elements displayed on the page, group all gridpacks into pages of 10 elements (this number could change after some experimentation) and display only each page at a time. Include two new labels to display the current page that is being displayed and the range of items included in it. Allow users to change the page via a drop-down element.
For automatic approval, include a new button in the creation wizard (modal) that fulfills this requirement. Create all the Gridpacks and try to automatically approve them as soon as the creation action finishes.

Current behavior

All Gridpacks are shown on the same main page and, for a new request of Gridpacks, it is required to approve each of them manually.

Expected behavior

Gridpacks are shown in tiny groups and it is possible to create and approve several requests without performing a click for each of them.

cms-pdmv / gridpackmachine Goto Github PK

gridpackmachine's People

Contributors

Stargazers

Watchers

Forkers

gridpackmachine's Issues

Is your feature related to a problem?

Describe the solution you'd like

Current behavior

Expected behavior

Is your feature related to a problem?

Current behaviour

Provided test bench

Points of discussion

Is your feature related to a problem?

Describe the solution you'd like

Constraints

Current behavior

Expected behavior

Expected behavior

Possible solution

Is your feature related to a problem?

Describe the solution you’d like

Current behavior

Expected behavior

Possible solutions

Expected behavior

Current behavior

Possible solution

Steps to reproduce

Is your feature related to a problem?

Describe the solution you'd like

Current behavior

Expected behavior

Recommend Projects

Recommend Topics

Recommend Org