xenonnt / outsource Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 436 KB

Job submission code for XENONnT

Python 92.86% Shell 4.51% Perl 2.63%

outsource's People

Contributors

Stargazers

Watchers

Forkers

rynge napoliion

outsource's Issues

Noop workflow generated for `afterpulses`

No idea why it is happening. Currently we are not able to process afterpulses, and example workflow can be found here, which is a noop:

(XENONnT_development) [yuanlq@ap23 outsource]$  cd /home/yuanlq/software/outsource ; /usr/bin/env /cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/python /home/yuanlq/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 50337 -- /home/yuanlq/software/outsource/bin/outsource --run 51777 --name debug_051777_1 --context xenonnt_offline --detector tpc 
2024-02-07 19:04:33,088 - utilix - DEBUG - Token exists at /home/yuanlq/.dbtoken
2024-02-07 19:04:33,088 - utilix - DEBUG - Token exists at /home/yuanlq/.dbtoken
2024-02-07 19:04:33,089 - utilix - DEBUG - Token is valid.
2024-02-07 19:04:33,089 - utilix - DEBUG - Token is valid.
*** Detector definition message ***
You are currently using the default XENON10 template detector.

2024-02-07 19:04:39,303 - utilix - DEBUG - Token exists at /home/yuanlq/.dbtoken
2024-02-07 19:04:39,303 - utilix - DEBUG - Token exists at /home/yuanlq/.dbtoken
DEBUG:utilix:Token exists at /home/yuanlq/.dbtoken
2024-02-07 19:04:39,303 - utilix - DEBUG - Token is valid.
2024-02-07 19:04:39,303 - utilix - DEBUG - Token is valid.
DEBUG:utilix:Token is valid.
2024-02-07 19:04:39,362 - utilix - DEBUG - Token exists at /home/yuanlq/.dbtoken
2024-02-07 19:04:39,362 - utilix - DEBUG - Token exists at /home/yuanlq/.dbtoken
DEBUG:utilix:Token exists at /home/yuanlq/.dbtoken
2024-02-07 19:04:39,362 - utilix - DEBUG - Token is valid.
2024-02-07 19:04:39,362 - utilix - DEBUG - Token is valid.
DEBUG:utilix:Token is valid.
You specified _auto_append_rucio_local=True and you are not on dali compute nodes, so we will add the following rucio local path:  /project/lgrandi/rucio/
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/straxen/url_config.py:743: UserWarning: From straxen version 2.1.0 onward, URLConfig parameterswill be sorted alphabetically before being passed to the plugins, this will change the lineage hash for non-sorted URLs. To load data processed with non-sorted URLs, you will need to use an older version.
  warnings.warn(
Skipping neutron_veto data
Skipping muon_veto data
Run modes for runs passing the basic queires: tpc_pmtap
The following are the run numbers passing the basic queries:
[51777]
------------------------------------------
Run modes for runs passing the basic queires and have raw data available: tpc_pmtap
The following are the run numbers passing the basic queries and have raw data available:
[51777]
------------------------------------------
The following are the run numbers passing the basic queries and have no to_process data available:
[51777]
------------------------------------------
You specified _auto_append_rucio_local=True and you are not on dali compute nodes, so we will add the following rucio local path:  /project/lgrandi/rucio/
2024-02-07 19:14:19 WARNING:  Using cutax: /ospool/uc-shared/project/xenon/xenonnt/software/cutax/v1-16-0.tar.gz
defaultdict(<class 'collections.OrderedDict'>, {})
defaultdict(<class 'collections.OrderedDict'>, {})
defaultdict(<class 'collections.OrderedDict'>, {})
edu.isi.pegasus.planner.catalog.transformation.TransformationFactoryException:  Unable to connect to Transformation Catalog with properties{directory=/scratch/yuanlq/workflows/generated/debug_051777_1/transformations}
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadInstance(TransformationFactory.java:293)
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadTransformationStoreFromDirectories(TransformationFactory.java:390)
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadTransformationStoreFromDirectories(TransformationFactory.java:351)
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadInstanceWithStores(TransformationFactory.java:102)
at edu.isi.pegasus.planner.client.CPlanner.executeCommand(CPlanner.java:450)
at edu.isi.pegasus.planner.client.CPlanner.executeCommand(CPlanner.java:328)
at edu.isi.pegasus.planner.client.CPlanner.main(CPlanner.java:206)
java.lang.RuntimeException: The File to be used as TC should be defined with the property pegasus.catalog.transformation.file
at edu.isi.pegasus.planner.catalog.transformation.impl.YAML.connect(YAML.java:167)
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadInstance(TransformationFactory.java:292)
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadInstanceWithStores(TransformationFactory.java:158)
at edu.isi.pegasus.planner.catalog.transformation.TransformationFactory.loadInstanceWithStores(TransformationFactory.java:104)
at edu.isi.pegasus.planner.client.CPlanner.executeCommand(CPlanner.java:450)
at edu.isi.pegasus.planner.client.CPlanner.executeCommand(CPlanner.java:328)
at edu.isi.pegasus.planner.client.CPlanner.main(CPlanner.java:206)
2024.02.07 19:19:07.797 CST:
2024.02.07 19:19:07.802 CST:   -----------------------------------------------------------------------
2024.02.07 19:19:07.807 CST:   File for submitting this DAG to HTCondor           : xenonnt-0.dag.condor.sub
2024.02.07 19:19:07.812 CST:   Log of DAGMan debugging messages                   : xenonnt-0.dag.dagman.out
2024.02.07 19:19:07.817 CST:   Log of HTCondor library output                     : xenonnt-0.dag.lib.out
2024.02.07 19:19:07.823 CST:   Log of HTCondor library error messages             : xenonnt-0.dag.lib.err
2024.02.07 19:19:07.828 CST:   Log of the life of condor_dagman itself            : xenonnt-0.dag.dagman.log
2024.02.07 19:19:07.833 CST:
2024.02.07 19:19:07.838 CST:   -no_submit given, not submitting DAG to HTCondor. You can do this with:
2024.02.07 19:19:07.848 CST:   -----------------------------------------------------------------------
2024.02.07 19:19:09.578 CST:   Database version: '5.0.7dev' (sqlite:////home/yuanlq/.pegasus/workflow.db)
2024.02.07 19:19:10.836 CST:   Pegasus database was successfully created.
2024.02.07 19:19:11.317 CST:   Database version: '5.0.7dev' (sqlite:////scratch/yuanlq/workflows/runs/straxen_v2.2.0/xenonnt_offline/debug_051777_1/xenonnt-0.replicas.db)
2024.02.07 19:19:11.369 CST:   Output replica catalog set to jdbc:sqlite:/scratch/yuanlq/workflows/runs/straxen_v2.2.0/xenonnt_offline/debug_051777_1/xenonnt-0.replicas.db
[WARNING]  Submitting to condor xenonnt-0.dag.condor.sub
2024.02.07 19:19:12.436 CST:   Time taken to execute is 5.929 seconds
Worfklow written to 

        /scratch/yuanlq/workflows/runs/straxen_v2.2.0/xenonnt_offline/debug_051777_1

Failure when trying to download a rucio-corrupted file

See this for example: /scratch/yuanlq/workflows/runs/straxen_v2.2.1/xenonnt_offline/daniel_20240409/00/03/events_ID0000291.out.002. The failure happened here:

        strax.storage.common.DataCorrupted: Cannot open metadata for xnt_051694:peak_positions_cnn-iongm54rho

The issue is that this rule xnt_051694:peak_positions_cnn is incomplete. See this

(XENONnT_development) yuanlq@ap23:/scratch/yuanlq/workflows/runs/straxen_v2.2.1/xenonnt_offline/daniel_20240409/00/03$ rucio list-rules xnt_051694:peak_positions_cnn-iongm54rho
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/rucio:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('rucio-clients==32.8.0', 'rucio')
ID                                ACCOUNT     SCOPE:NAME                                STATE[OK/REPL/STUCK]    RSE_EXPRESSION      COPIES    SIZE    EXPIRES (UTC)    CREATED (UTC)
--------------------------------  ----------  ----------------------------------------  ----------------------  ------------------  --------  ------  ---------------  -------------------
e3667ddd171242b3a46121928bda4f09  production  xnt_051694:peak_positions_cnn-iongm54rho  OK[3/0/0]               UC_MIDWAY_USERDISK  1         N/A                      2024-04-07 06:54:55
(XENONnT_development) yuanlq@ap23:/scratch/yuanlq/workflows/runs/straxen_v2.2.1/xenonnt_offline/daniel_20240409/00/03$ rucio list-files xnt_051694:peak_positions_cnn-iongm54rho
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/rucio:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('rucio-clients==32.8.0', 'rucio')
+-------------------------------------------------+--------------------------------------+-------------+------------+----------+
| SCOPE:NAME                                      | GUID                                 | ADLER32     | FILESIZE   | EVENTS   |
|-------------------------------------------------+--------------------------------------+-------------+------------+----------|
| xnt_051694:peak_positions_cnn-iongm54rho-000000 | 7422C99D-721C-43EC-BE4D-1DB2E932C4D1 | ad:3eb8895b | 66.119 MB  |          |
| xnt_051694:peak_positions_cnn-iongm54rho-000002 | C3B3BC57-4F75-464F-BFF4-20D6DA9F119D | ad:cc34ae4d | 66.412 MB  |          |
| xnt_051694:peak_positions_cnn-iongm54rho-000004 | 04106F2F-674D-4042-B5F0-210582EDA52E | ad:ad6b2416 | 65.421 MB  |          |
+-------------------------------------------------+--------------------------------------+-------------+------------+----------+
Total files : 3
Total size : 197.952 MB

We want a mechanism that allow download to fail, but just modify the to process list.

Want to check chunk number in runstrax too

There is evidence that the patch we added in #95 is not fully solving the problems.

ValueError: Cannot merge chunks with different number of items: [[048149.peaks: 1665769375sec 34340400 ns - 1665769395sec 762648600 ns, 55791 items, 11.3 MB/s], [048149.peak_positions: 1665769375sec 34340400 ns - 1665769395sec 762648600 ns, 55791 items, 0.1 MB/s], [048149.peak_shadow: 1665769375sec 34340400 ns - 1665769395sec 762648600 ns, 55790 items, 0.2 MB/s], [048149.peak_se_density: 1665769375sec 34340400 ns - 1665769395sec 762648600 ns, 55790 items, 0.1 MB/s], [048149.cut_position_shadow_peak: 1665769375sec 34340400 ns - 1665769395sec 762648600 ns, 55791 items, 0.0 MB/s]]

Abnormally high resource usage

Below is an example in event processing. Are we cleaning up temp files if previous trial failed? How is it possible that processing a event level data will use more than 50GB disk and 50GB ram? Example

001 (26796748.000.000) 2023-07-01 06:01:11 Job executing on host: <10.23.185.214:39969?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector1#21535007%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-d55c]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector1#4989687&PrivNet=Expanse-PATH-EP.osgvo-docker-pilot-ospool-557fd4f758-mh2jj&addrs=10.23.185.214-39969&alias=Expanse-PATH-EP.osgvo-docker-pilot-ospool-557fd4f758-mh2jj&noUDP&sock=startd_5550_6ab6>
        SlotName: slot1_13@Expanse-PATH-EP.osgvo-docker-pilot-ospool-557fd4f758-mh2jj
        CondorScratchDir = "/pilot/osgvo-pilot-RNAyDu/execute/dir_2795961"
        Cpus = 1
        Disk = 60993116
        GLIDEIN_ResourceName = "Expanse-PATH-EP"
        GPUs = 0
        Memory = 48000
...
006 (26796748.000.000) 2023-07-01 06:01:22 Image size of job updated: 2988064
        111  -  MemoryUsage of job (MB)
        112972  -  ResidentSetSize of job (KB)
...
006 (26796748.000.000) 2023-07-01 06:06:23 Image size of job updated: 3604928
        111  -  MemoryUsage of job (MB)
        112972  -  ResidentSetSize of job (KB)
...
006 (26796748.000.000) 2023-07-01 06:26:24 Image size of job updated: 4973060
        2384  -  MemoryUsage of job (MB)
        2440300  -  ResidentSetSize of job (KB)
...
007 (26796748.000.000) 2023-07-01 06:26:25 Shadow exception!
        Error from slot1_13@Expanse-PATH-EP.osgvo-docker-pilot-ospool-557fd4f758-mh2jj: disk usage exceeded request_disk
        0  -  Run Bytes Sent By Job
        9513319  -  Run Bytes Received By Job
...
012 (26796748.000.000) 2023-07-01 06:26:26 Job was held.
        Error from slot1_13@Expanse-PATH-EP.osgvo-docker-pilot-ospool-557fd4f758-mh2jj: disk usage exceeded request_disk
        Code 21 Subcode 104

More details can be found in /scratch/yuanlq/workflows/runs/straxen_v2.1.0/xenonnt_offline/sr1_kr83m_401-427_0

Adding `SDSC_NSDF_USERDISK` into `_rse_site_map`

Hi @rynge! Do you know what should we do to add the coming rse SDSC_NSDF_USERDISK to here?

Suspicious combine jobs at surfsara

In principle we don't want any jobs to run outside USA, otherwise it might leads to transferring headache.

This is an example bad batch which basically did nothing. You can see massive failure in combine jobs: /scratch/yuanlq/workflows/runs/straxen_v2.1.1/xenonnt_offline/sr1_hotspot_batch_1_1. For example this piece from /scratch/yuanlq/workflows/runs/straxen_v2.1.1/xenonnt_offline/sr1_hotspot_batch_1_1/00/09/combine_ID0000463.out.004

---------------pegasus-multipart
- transfer_attempts:
  - src_url: "file:///srv/pegasus.N7Q2EGPSS/046503-peaklets-combined.tar.gz"
    src_label: "condorpool"
    dst_url: "gsiftp://xenon-gridftp.grid.uchicago.edu:2811/xenon/workflow_scratch/yuanlq/sr1_hotspot_batch_1_1/00/08/046503-peaklets-combined.tar.gz"
    dst_label: "staging"
    success: False
    start: 1689999168
    duration: 0.0
    lfn: "046503-peaklets-combined.tar.gz"
  - src_url: "file:///srv/pegasus.N7Q2EGPSS/046503-peaklets-combined.tar.gz"
    src_label: "condorpool"
    dst_url: "gsiftp://xenon-gridftp.grid.uchicago.edu:2811/xenon/workflow_scratch/yuanlq/sr1_hotspot_batch_1_1/00/08/046503-peaklets-combined.tar.gz"
    dst_label: "staging"
    success: False
    start: 1689999307
    duration: 0.0
    lfn: "046503-peaklets-combined.tar.gz"
  - src_url: "file:///srv/pegasus.N7Q2EGPSS/046503-peaklets-combined.tar.gz"
    src_label: "condorpool"
    dst_url: "gsiftp://xenon-gridftp.grid.uchicago.edu:2811/xenon/workflow_scratch/yuanlq/sr1_hotspot_batch_1_1/00/08/046503-peaklets-combined.tar.gz"
    dst_label: "staging"
    success: False
    start: 1689999609
    duration: 0.0
    lfn: "046503-peaklets-combined.tar.gz"

We need to go through codes again to see where it could lead to such issue.

Want to retire CMT features

More details here in slack.

Need to test runlist functionality

The current runlist doesn't seem to be working... Will add failure example here.

Embed reprox targets into outsource

So we don't waste time on midway for cuts etc processing

Reconsider memory/disk assignment

Currently, just in case that we have problems with memory/disk, we are assigning over-safe amount of memory/disk. However, this is slowing down our iteration, especially when the failure is NOT due to lack of resources. (#57 as the bottleneck).

We need to re-evaluate the following:

How much resource shall we add after each failure?
How much resource shall we give each kind of job?

Want to make `standalone_download` into some configuration

Right now we want it to be True instead of False, so that it won't try to download raw_records from non-OSG locations

Optimizations

We should do some optimization studies to determine the best:

The cluster size per job -- this line
Memory required
Disk required

`_validate_x509_proxy` thread not alive error

This issue started yesterday. Same codes ran well last week.

'Thread' object has no attribute 'isAlive'
  File "/home/yuanlq/software/outsource/outsource/Shell.py", line 46, in run
    if thread.isAlive():
  File "/home/yuanlq/software/outsource/outsource/Outsource.py", line 516, in _validate_x509_proxy
    shell.run()
  File "/home/yuanlq/software/outsource/outsource/Outsource.py", line 178, in submit_workflow
    self._validate_x509_proxy()
  File "/home/yuanlq/software/outsource/bin/outsource", line 229, in main
    outsource.submit_workflow()
  File "/home/yuanlq/software/outsource/bin/outsource", line 233, in <module>
    main()
AttributeError: 'Thread' object has no attribute 'isAlive'

Need to reconsider processing flow

The current data flow is described here. However, so far in v11 we observed a very different success rate between peak level processing and event level processing. My hypothesis for this phenomenon is that most event processing suffer from data transferring bottleneck. We need to check the following:

What is the threshold in batch size that event-level processing will start to become difficult?
Will it be better if we only process peak-level or event-level in one single batch? By doing so Midway will only be heavy on either downloading or uploading.

Upload problem rooted in uncleaned rucio metadata

Not really a problem for outsource itself. Original slack thread here.

Right now, I am trying to upload this file /ospool/uc-shared/project/xenon/yuanlq/data/051689-peaklets-euocvpkv3y/peaklets-euocvpkv3y-000144, which you can see doesn’t exist yet as a did.

(XENONnT_development) yuanlq@ap23:~/software/admix$ rucio list-files xnt_051689:peaklets-euocvpkv3y
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/rucio:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('rucio-clients==1.23.14', 'rucio')
+---------------------------------------+--------------------------------------+-------------+------------+----------+
| SCOPE:NAME                            | GUID                                 | ADLER32     | FILESIZE   | EVENTS   |
|---------------------------------------+--------------------------------------+-------------+------------+----------|
| xnt_051689:peaklets-euocvpkv3y-000049 | 1C38CE71-BB8E-46DB-BBE1-1407D5DD160B | ad:0fe6cf5c | 114.435 MB |          |
| xnt_051689:peaklets-euocvpkv3y-000056 | 3AC903B8-5B44-4A2A-92B5-EA4F60346454 | ad:ac35d526 | 119.025 MB |          |
| xnt_051689:peaklets-euocvpkv3y-000245 | 083366AB-C92B-4A67-9CB8-0FD416FAC691 | ad:1dd01840 | 121.214 MB |          |
+---------------------------------------+--------------------------------------+-------------+------------+----------+
Total files : 3
Total size : 354.674 MB
(XENONnT_development) yuanlq@ap23:~/software/admix$ rucio ls xnt_051689:peaklets-euocvpkv3y-000144
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/rucio:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('rucio-clients==1.23.14', 'rucio')
+--------------+--------------+
| SCOPE:NAME   | [DID TYPE]   |
|--------------+--------------|
+--------------+--------------+

However, when I upload by rucio.client.uploadclient.UploadClient().upload , it gives this error:

>>> import rucio
>>> client = rucio.client.uploadclient.UploadClient()
>>> to_upload=[{'path': '/ospool/uc-shared/project/xenon/yuanlq/data/051689-peaklets-euocvpkv3y/peaklets-euocvpkv3y-000144', 'rse': 'UC_OSG_USERDISK', 'did_scope': 'xnt_051689', 'did_name': 'peaklets-euocvpkv3y-000144', 'dataset_scope': 'xnt_051689', 'dataset_name': 'peaklets-euocvpkv3y', 'register_after_upload': False, 'lifetime': None}]
>>> client.upload(to_upload)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/rucio_clients-1.23.14-py3.9.egg/rucio/client/uploadclient.py", line 203, in upload
    self._register_file(file, registered_dataset_dids)
  File "/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/rucio_clients-1.23.14-py3.9.egg/rucio/client/uploadclient.py", line 392, in _register_file
    raise DataIdentifierAlreadyExists
rucio.common.exception.DataIdentifierAlreadyExists: Data Identifier Already Exists.

After a bit more source codes reading, it seems that the issue is in metadata… For those who I can successfully upload, before I upload so there is NO metadata, for example:

(XENONnT_development) yuanlq@ap23:~/software/admix$ rucio get-metadata xnt_051689:peaklets-euocvpkv3y-000055
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/rucio:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('rucio-clients==1.23.14', 'rucio')
2024-02-04 18:34:03,014	ERROR	Data identifier not found.
Details: Data identifier 'xnt_051689:peaklets-euocvpkv3y-000055' not found

What I don’t understand is why the first upload attempt for a specific file could fail because the metadata for that specific file already existed. Is it possible that somehow the purging (delete rules + erase) we did before neglected the metadata? It seems indeed that this file’s metadata (for which file I cannot upload because of rucio.common.exception.DataIdentifierAlreadyExists: Data Identifier Already Exists.) has escaped all purging since Dec 2023…

(XENONnT_development) yuanlq@ap23:~/software/admix$ rucio get-metadata xnt_051689:peaklets-euocvpkv3y-000052
/cvmfs/xenon.opensciencegrid.org/releases/nT/development/anaconda/envs/XENONnT_development/bin/rucio:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('rucio-clients==1.23.14', 'rucio')
access_cnt:      None
accessed_at:     None
account:         production
adler32:         62ff996e
availability:    AVAILABLE
bytes:           131695330
campaign:        None
closed_at:       None
complete:        None
constituent:     None
created_at:      2023-12-02 23:01:52
datatype:        None
deleted_at:      None
did_type:        FILE
eol_at:          None
events:          None
expired_at:      None
guid:            30f3d69612574d78ae52cbbb499d87ff
hidden:          False
is_archive:      None
is_new:          True
is_open:         None
length:          None
lumiblocknr:     None
md5:             b441be3291a5200cf5bacc99a3d02b06
monotonic:       False
name:            peaklets-euocvpkv3y-000052
obsolete:        False
panda_id:        None
phys_group:      None
prod_step:       None
project:         None
provenance:      None
purge_replicas:  True
run_number:      None
scope:           xnt_051689
stream_name:     None
suppressed:      False
task_id:         None
transient:       False
updated_at:      2023-12-02 23:01:52
version:         None

This explains why for some runs we have never been able to process.

Massive failure in combine jobs when staging/uploading

An example: /scratch/yuanlq/workflows/runs/straxen_v2.1.1/xenonnt_offline/43043_1.

This happens even when the workload is pretty low.

----------------Task #1 - combine - ID0000001 - Kickstart stderr----------------

 WARNING: X509_CERT_DIR is set set and could lead to problems when using this environment
/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/straxen/url_config.py:711: UserWarning: From straxen version 2.1.0 onward, URLConfig parameterswill be sorted alphabetically before being passed to the plugins, this will change the lineage hash for non-sorted URLs. To load data processed with non-sorted URLs, you will need to use an older version.
  warnings.warn("From straxen version 2.1.0 onward, URLConfig parameters"
No <rechunk_to_mb> specified!
No <rechunk_to_mb> specified!
No <rechunk_to_mb> specified!
No <rechunk_to_mb> specified!
Traceback (most recent call last):
  File "/srv/pegasus.PGR8F6hoQ/./combine.py", line 161, in <module>
    main()
  File "/srv/pegasus.PGR8F6hoQ/./combine.py", line 157, in main
    admix.upload(this_path, rse=rse, did=dataset_did, update_db=args.update_db)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/uploader.py", line 80, in upload
    clients.upload_client.upload(to_upload)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/rucio_clients-1.23.14-py3.9.egg/rucio/client/uploadclient.py", line 330, in upload
rucio.common.exception.NoFilesUploaded: None of the given files have been uploaded.

Want array job submission

Basically a wrapper executable around outsource. Currently, if we submit 100 runs, the logs will be quite unreadable. However, if we submit 100 workflows for each run, it will be better.

Deeper strax processing

Right now we still only process raw_records to records. We need to do the next step of processing to peaks, and ideally have (configurable) functionality to go all the way to the highest level (though right now the plan is for that to be done on midway).

We need to do per-chunk jobs for raw_records-->records and records-->peaks processing. For those we should probably do on the same job; i.e. for a given chunk, we process raw_records-->records-->peaks. Then we send the two outputs out on their separate journeys.

Manually check length equality between metadata and chunks in combine jobs

Discussed with @dachengx. The storage black magic here could have manually bypassed many safety checks which should be in frontend, leading to the mismatch between length in meta data and actual chunk. However, we don't understand the mechanism yet.

Just for safety, consider using similar idea of this guy, to do something right before uploading here.

Need a better estimation how much ram is enough for a certain run

Given the SR1 physical situation...

Clean up `combine.py`

This is probably introducing extra burden in memory. Want to remove at least the trial to get_array. It was introduced when we were trying to debug the problem when we have mismatch between rucio/disk for peak-level data.

Want to handle `ref_mon_nv `

Original thread here. Should be doable since we just need to adjust the datatypes in nv pipeline.

Want to add `admix.preupload` before `admix.upload`

Details here in admix PR.. So that we can resolve the orphan file issue, assuming the dataset did has been deleted and erased already.

Can we add `event_pattern_fit` into the to-process list?

Otherwise we will have to compute it on midway

PegasusClientError

(XENONnT_development) yuanlq@login:~$ outsource --context xenonnt_offline --force --detector tpc --runlist ~/generated_runlist.txt --name radon_batch0_2
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
Skipping neutron_veto data
Skipping muon_veto data
Run modes: tpc_radon_hev
2023-06-13 15:17:45 WARNING:  Using cutax: /xenon/xenonnt/software/cutax/v1-14-3.tar.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:25<00:00,  8.56s/it]
defaultdict(<class 'collections.OrderedDict'>, {})
defaultdict(<class 'collections.OrderedDict'>, {})
defaultdict(<class 'collections.OrderedDict'>, {})
defaultdict(<class 'collections.OrderedDict'>, {})
2023.06.13 15:18:14.132 CDT: [WARNING]  Unable to determine the version of condor
2023.06.13 15:18:14.162 CDT: [WARNING]  Unable to determine the version of condor
2023.06.13 15:18:16.262 CDT:
2023.06.13 15:18:16.267 CDT:   -----------------------------------------------------------------------
2023.06.13 15:18:16.273 CDT:   File for submitting this DAG to HTCondor           : xenonnt-0.dag.condor.sub
2023.06.13 15:18:16.278 CDT:   Log of DAGMan debugging messages                 : xenonnt-0.dag.dagman.out
2023.06.13 15:18:16.284 CDT:   Log of HTCondor library output                     : xenonnt-0.dag.lib.out
2023.06.13 15:18:16.289 CDT:   Log of HTCondor library error messages             : xenonnt-0.dag.lib.err
2023.06.13 15:18:16.294 CDT:   Log of the life of condor_dagman itself          : xenonnt-0.dag.dagman.log
2023.06.13 15:18:16.300 CDT:
2023.06.13 15:18:16.305 CDT:   -no_submit given, not submitting DAG to HTCondor.  You can do this with:
2023.06.13 15:18:16.315 CDT:   -----------------------------------------------------------------------
2023.06.13 15:18:18.063 CDT:   Database version: '5.0.2' (sqlite:////home/yuanlq/.pegasus/workflow.db)
2023.06.13 15:18:22.157 CDT:   Pegasus database was successfully created.
2023.06.13 15:18:22.162 CDT:   Database version: '5.0.2' (sqlite:////scratch/yuanlq/workflows/runs/straxen_v2.0.6/xenonnt_offline/radon_batch0_2/xenonnt-0.replicas.db)
2023.06.13 15:18:22.227 CDT:   Output replica catalog set to jdbc:sqlite:/scratch/yuanlq/workflows/runs/straxen_v2.0.6/xenonnt_offline/radon_batch0_2/xenonnt-0.replicas.db
[WARNING]  Submitting to condor xenonnt-0.dag.condor.sub
2023.06.13 15:18:23.021 CDT: [FATAL ERROR]
[1] java.lang.RuntimeException: Unable to submit the workflow using pegasus-run at edu.isi.pegasus.planner.client.CPlanner.executeCommand(CPlanner.java:667)
Traceback (most recent call last):
  File "/home/yuanlq/.local/bin/outsource", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/yuanlq/software/outsource/bin/outsource", line 220, in <module>
    main()
  File "/home/yuanlq/software/outsource/bin/outsource", line 216, in main
    outsource.submit_workflow()
  File "/home/yuanlq/software/outsource/outsource/Outsource.py", line 175, in submit_workflow
    self._plan_and_submit(wf)
  File "/home/yuanlq/software/outsource/outsource/Outsource.py", line 475, in _plan_and_submit
    wf.plan(conf=base_dir + '/workflow/pegasus.conf',
  File "/opt/pegasus/current/lib64/python3.6/site-packages/Pegasus/api/_utils.py", line 85, in wrapper
    assert f(self, *args, **kwargs) == None
  File "/opt/pegasus/current/lib64/python3.6/site-packages/Pegasus/api/workflow.py", line 937, in wrapper
    return f(self, *args, **kwargs)
  File "/opt/pegasus/current/lib64/python3.6/site-packages/Pegasus/api/workflow.py", line 1264, in plan
    workflow_instance = self._client.plan(
  File "/opt/pegasus/current/lib64/python3.6/site-packages/Pegasus/client/_client.py", line 249, in plan
    rv = self._exec(cmd, stream_stdout=False, stream_stderr=True)
  File "/opt/pegasus/current/lib64/python3.6/site-packages/Pegasus/client/_client.py", line 720, in _exec
    raise PegasusClientError("Pegasus command: {} FAILED".format(cmd), result)
Pegasus.client._client.PegasusClientError: Pegasus command: ['/opt/pegasus/current/bin/pegasus-plan', '--conf', '/home/yuanlq/software/outsource/outsource/workflow/pegasus.conf', '--sites', 'condorpool', '--staging-site', 'condorpool=staging', '--dir', '/scratch/yuanlq/workflows/runs/straxen_v2.0.6/xenonnt_offline', '--relative-dir', 'radon_batch0_2', '--cleanup', 'inplace', '--submit', 'workflow.yml', '--json'] FAILED

Want better `--help` info

Provide example for --runlist RUNLIST Path to a runlist file
Example for --run
More details for --dry-run

Want to overwrite package like what we did for cutax

Upload local tarball and override the software version in environment. This should be very beneficial for debugging.

Add flag to disable runDB

We should have an optional flag in case we don't want to update the runDB.

Checksum after uploading in combine jobs

In light of solving #57, we want to impose checksum after uploading. Namely in combine and runstrax

Gfal+rucio when binding /cvmfs

Sometimes jobs fail because of problems downloading from rucio, giving an error of Protocol implementation not found.

The rucio API doesn’t always give the most useful errors so I tried replicating on the login node. It seems like there is some conflict between environments if /cvmfs is binded to the container, causing gfal errors, see the gist I link below. I’m not 100% sure this is exactly what’s happening on the grid but the error does look similar. So maybe some of the jobs land on sites that don’t see /cvmfs or the env is setup in such a way that it doesn’t run into this error? I don't know why else the jobs would succeed sometimes and fail others when downloading files from the same run.

If I don’t bind /cvmfs when starting the container, I can do rucio downloads without a problem, but I can’t use outsource because it can’t find Pegasus. I can see Pegasus binaries are installed in the container -- I guess that’s separate from the python bindings though? At least in my local repo I use the pegasus at pegasus_path = /cvmfs/oasis.opensciencegrid.org/osg/projects/pegasus/rhel7/4.9.2dev and I can't import pegasus if I don't append this to my path.

The errors I got and debugging process are outlined in this gist: https://gist.github.com/ershockley/a72a72a3dad8403d174dc5248cc2f777

@rynge any ideas?

`standalone_download` leads to failure in download jobs

This commit seems to trigger the following problems.

---------------Task #1 - download - ID0000010 - Kickstart stderr----------------

 WARNING: X509_CERT_DIR is set set and could lead to problems when using this environment
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
ls: cannot access data/*: No such file or directory
ls: cannot access data/*: No such file or directory

2023-06-18 23:43:22,905:ERROR:pegasus-analyzer(1515): Workflow Failed  wf_uuid: 72040480-3e60-4291-9edc-2c188cdb1b4a submit dir: /scratch/yuanlq/workflows/runs/straxen_v2.0.6/xenonnt_offline/AmBe_single_test_master
2023-06-18 23:43:22,905:ERROR:pegasus-analyzer(2147): One or more workflows failed

Tensorflow AVX issue

@rynge is it possible to specify sites that have AVX cpus? It looks like we are seeing a version of this issue again. I can reproduce the error on the xenon login node like this:

(XENONnT_development) Singularity> python -c "import tensorflow"
Illegal instruction

(XENONnT_development) Singularity> cat /proc/cpuinfo  | grep flags | grep avx
# returns nothing

Whereas on e.g. midway I have no such issue:

Singularity xenonnt-development.simg:~> python -c "import tensorflow"
2021-03-15 22:16:27.390327: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/XENONnT/anaconda/envs/XENONnT_development/lib64:/opt/XENONnT/anaconda/envs/XENONnT_development/lib:/opt/rh/devtoolset-9/root/usr/lib64:/opt/rh/devtoolset-9/root/usr/lib:/opt/rh/devtoolset-9/root/usr/lib64/dyninst:/opt/rh/devtoolset-9/root/usr/lib/dyninst:/opt/rh/devtoolset-9/root/usr/lib64:/opt/rh/devtoolset-9/root/usr/lib
2021-03-15 22:16:27.390378: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Singularity xenonnt-development.simg:~> cat /proc/cpuinfo  | grep flags | grep avx
# returns all the cpus

I am then seeing the illegal instruction error out on the grid, on sites that I presume do not have AVX. This only affects the high-level job.

Want to force save peaks

This is related to the peaks-loading problem. We will force saving peaks and check the loading. How to implement will be the new issue.

Want priority computing sites options

Something like:
If some data lives in EU, then prefer EU computing sites while making US possible.

Remove `max_worker` and `allow_multiple` in computation

Which might lead to huge consumption in memory

Test EU grid

We will start from removing US requirement here.
- Making this an option in xenon_config maybe
How to cooperate with this

Want to impose rucio file check after upload

So to avoid empty rules and incomplete upload.

Test ms and separate array info datatypes

Need to unite configs

Like #23. Currently we are scattering configs here and there. At least in the following:

Config.py
RunConfig.py
pegasus.conf
xenon_config untracked in this repo.
Detector types hardcoded in outsource bin
resource allocations in Outsource.py
default memory in _job, which is later used in download job
combine job memory
raw type download sites

Need a refactor -- memory issues

It should've been clear at the beginning, but creating a large list of DBConfig objects does not scale well to large reprocessings. We should refactor things a bit to have outsource take a list of run numbers, not a list of these relatively heavy config objects. Also many attributes of the DBConfig class are more natural for the Outsource class (like context, force_rerun, etc).

Update README

We should update the README with more info. I'll do it but making the reminder here.

Want to add `event_n_channel ` into event-level production chain

This is the new situation since this straxen PR

Specifying raw_records_rse is not working

Details here. We are still downloading from Europe even though we specified SDSC_NSDF_USERDISK as raw_records_rse.

Make more configurable in xenon_config

We keep hitting issues in the following way when processing:

disk quota blows up
memory blows up

The way to solve these troubles, for now, is to hardcode the Outsource.py, which is a shame. We will collect more stuff beyond disk and memory to add into user's configs in this thread.

Excessive rucio metadata lookup

We crashed rucio server, and one suspect is that outsource jobs are doing overwhelming rucio metadata lookup. One example from a peaklet job

[root@rucio-xenon rucio_httpd]# grep mwt2-c071.campuscluster.illinois.edu httpd_access_log | grep 11/Dec/2023:12
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:29:12 -0600] "GET /auth/x509_proxy HTTP/1.1" 200 -
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:29:12 -0600] "GET /accounts/whoami HTTP/1.1" 303 -
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:29:12 -0600] "GET /accounts/production HTTP/1.1" 200 202
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/aqmon_hits-gsmdrzd6gz HTTP/1.1" 404 128
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/peak_classification_bayes-gkrokfhp5z HTTP/1.1" 404 143
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/corrected_areas-o44rjrrdci HTTP/1.1" 404 133
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records-rfzvpzj4mf HTTP/1.1" 200 186
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records-rfzvpzj4mf/rules HTTP/1.1" 200 1630
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_he-rfzvpzj4mf HTTP/1.1" 200 189
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_he-rfzvpzj4mf/rules HTTP/1.1" 200 1632
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_aqmon-rfzvpzj4mf HTTP/1.1" 200 192
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_aqmon-rfzvpzj4mf/rules HTTP/1.1" 200 1638
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_nv-rfzvpzj4mf HTTP/1.1" 200 189
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_nv-rfzvpzj4mf/rules HTTP/1.1" 200 1632
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_aqmon_nv-rfzvpzj4mf HTTP/1.1" 200 195
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_aqmon_nv-rfzvpzj4mf/rules HTTP/1.1" 200 1643
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:27 -0600] "GET /dids/xnt_049794/raw_records_aux_mv-rfzvpzj4mf HTTP/1.1" 200 193
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/raw_records_aux_mv-rfzvpzj4mf/rules HTTP/1.1" 200 1639
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/raw_records_mv-rfzvpzj4mf HTTP/1.1" 200 189
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/raw_records_mv-rfzvpzj4mf/rules HTTP/1.1" 200 1632
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/detector_time_offsets-xyl6d4sclw HTTP/1.1" 404 139
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/distinct_channels-qzbcespqp3 HTTP/1.1" 404 135
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/energy_estimates-c2fdcw5wcr HTTP/1.1" 404 134
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_ambience-b7qndgom42 HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_area_per_channel-ywkbseoogl HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_n_channel-ywkbseoogl HTTP/1.1" 404 133
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_basics-o5culyytu5 HTTP/1.1" 404 130
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_info-5ovx4l2iul HTTP/1.1" 404 128
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_info_double-22sfnn5kim HTTP/1.1" 404 135
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_ms_naive-epbbboxkuj HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_pattern_fit-cahobtcvru HTTP/1.1" 404 135
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_per_event-klsx7noczh HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_positions-dayaj7r4yj HTTP/1.1" 404 133
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_s1_positions_cnn-hvvmp65uq2 HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_s2_positions_cnn-bh27t2hzlw HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_s2_positions_gcn-d5gu4vewsv HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_s2_positions_mlp-f6ytzu6qqn HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_shadow-zum2aprycl HTTP/1.1" 404 130
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_top_bottom_params-urabkrxdqc HTTP/1.1" 404 141
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_waveform-loszf2t5d6 HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_w_bayes_class-yhok6eq7b3 HTTP/1.1" 404 137
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/raw_records_diagnostic-5jnmouazik HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/individual_peak_monitor-spjp2pl4hl HTTP/1.1" 404 141
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/afterpulses-uf63o427z7 HTTP/1.1" 404 129
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/led_calibration-lsdigsccxn HTTP/1.1" 404 133
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/event_local_min_info-hazyatbm5u HTTP/1.1" 404 138
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/merged_s2s-wpzg6lsm2m HTTP/1.1" 404 128
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/merged_s2s_he-hohrh5yh6o HTTP/1.1" 404 131
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/online_monitor_mv-if3qqljiby HTTP/1.1" 404 135
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/online_monitor_nv-z2cushqctj HTTP/1.1" 404 135
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/online_peak_monitor-etr2pjxerr HTTP/1.1" 404 137
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_ambience-aglg7ad3jl HTTP/1.1" 404 131
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_basics-6bdyxhzzfz HTTP/1.1" 404 129
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_basics_he-4bjhf4bqfd HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_corrections-tqortpgfmp HTTP/1.1" 404 134
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_positions_cnn-iongm54rho HTTP/1.1" 404 136
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_positions_gcn-sgi5hgfujv HTTP/1.1" 404 136
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_positions_mlp-zfge4gdktl HTTP/1.1" 404 136
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_proximity-abfkn7nmwv HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_s1_positions_cnn-vfvtclcw3u HTTP/1.1" 404 139
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_shadow-pctjku7ovb HTTP/1.1" 404 129
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peak_top_bottom_params-366morvvx5 HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaklet_classification-p3m6pr2fhz HTTP/1.1" 404 140
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaklet_classification_he-jc5epv3vtq HTTP/1.1" 404 143
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaklets-euocvpkv3y HTTP/1.1" 200 183
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaklets-euocvpkv3y/rules HTTP/1.1" 200 -
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/lone_hits-euocvpkv3y HTTP/1.1" 200 184
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/lone_hits-euocvpkv3y/rules HTTP/1.1" 200 -
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaklets_he-qsiaztqf6c HTTP/1.1" 404 129
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaks-5i3zhnt5vx HTTP/1.1" 404 123
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/peaks_he-iakptlspfu HTTP/1.1" 404 126
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/records-hwcemk7dbt HTTP/1.1" 404 125
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:28 -0600] "GET /dids/xnt_049794/veto_regions-hwcemk7dbt HTTP/1.1" 200 187
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/veto_regions-hwcemk7dbt/rules HTTP/1.1" 200 803
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/pulse_counts-hwcemk7dbt HTTP/1.1" 200 187
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/pulse_counts-hwcemk7dbt/rules HTTP/1.1" 200 803
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/records_he-cbkz5nmyc4 HTTP/1.1" 404 128
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/pulse_counts_he-cbkz5nmyc4 HTTP/1.1" 404 133
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/s2_recon_pos_diff-5srxt25yui HTTP/1.1" 404 135
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/veto_intervals-hyu26wo3bv HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/veto_proximity-c2u2t7vf5z HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/events_sync_mv-e3gwcxlgee HTTP/1.1" 404 132
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/events_mv-5wxwp4k6sz HTTP/1.1" 404 127
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/hitlets_mv-yf3ezxotyn HTTP/1.1" 404 128
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/records_mv-gsrdzepgyt HTTP/1.1" 404 128
mwt2-c071.campuscluster.illinois.edu - - [11/Dec/2023:12:31:29 -0600] "GET /dids/xnt_049794/event_positions_nv-l6qg7r6k5v HTTP/1.1" 404 136
mwt2-c071.campuscluster.illinois.edu...

It is suspected to be what strax's rucio backend is doing. The question here:

Can we bypass this in strax?
If not, can we download manually what we need and then not using rucio remote?

`notification_email` is not working

Trying to fetch a chunk which doesn't exist

More details can be found in /scratch/yuanlq/workflows/runs/straxen_v2.1.0/xenonnt_offline/sr0_ar37_786-787_0.

----------------Task #1 - events - ID0000032 - Kickstart stderr-----------------

 WARNING: X509_CERT_DIR is set set and could lead to problems when using this environment
/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/straxen/url_config.py:707: FutureWarning: From straxen version 2.1.0 onward, URLConfig parameters will be sorted alphabetically before being passed to the plugins, this will change the lineage hash for non-sorted URLs. To load data processed with non-sorted URLs, you will need to use an older version.
  warnings.warn("From straxen version 2.1.0 onward, URLConfig parameters will be sorted alphabetically before being passed to the plugins, this will change the lineage hash for non-sorted URLs. To load data processed with non-sorted URLs, you will need to use an older version.", FutureWarning)
/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/chunk.py:364: NumbaExperimentalFeatureWarning: Record(Start time since unix epoch [ns][type=int64;offset=0;title=Start time since unix epoch [ns]],time[type=int64;offset=0;title=Start time since unix epoch [ns]],Length of the interval in samples[type=int32;offset=8;title=Length of the interval in samples],length[type=int32;offset=8;title=Length of the interval in samples],Width of one sample [ns][type=int32;offset=12;title=Width of one sample [ns]],dt[type=int32;offset=12;title=Width of one sample [ns]],Channel/PMT number[type=int16;offset=16;title=Channel/PMT number],channel[type=int16;offset=16;title=Channel/PMT number],Classification of the peak(let)[type=int8;offset=18;title=Classification of the peak(let)],type[type=int8;offset=18;title=Classification of the peak(let)],Integral across channels [PE][type=float32;offset=19;title=Integral across channels [PE]],area[type=float32;offset=19;title=Integral across channels [PE]],Integral per channel [PE][type=nestedarray(float32, (494,));offset=23;title=Integral per channel [PE]],area_per_channel[type=nestedarray(float32, (494,));offset=23;title=Integral per channel [PE]],Number of hits contributing at least one sample to the peak [type=int32;offset=1999;title=Number of hits contributing at least one sample to the peak ],n_hits[type=int32;offset=1999;title=Number of hits contributing at least one sample to the peak ],Waveform data in PE/sample (not PE/ns!)[type=nestedarray(float32, (200,));offset=2003;title=Waveform data in PE/sample (not PE/ns!)],data[type=nestedarray(float32, (200,));offset=2003;title=Waveform data in PE/sample (not PE/ns!)],Waveform data in PE/sample (not PE/ns!), top array[type=nestedarray(float32, (200,));offset=2803;title=Waveform data in PE/sample (not PE/ns!), top array],data_top[type=nestedarray(float32, (200,));offset=2803;title=Waveform data in PE/sample (not PE/ns!), top array],Peak widths in range of central area fraction [ns][type=nestedarray(float32, (11,));offset=3603;title=Peak widths in range of central area fraction [ns]],width[type=nestedarray(float32, (11,));offset=3603;title=Peak widths in range of central area fraction [ns]],Peak widths: time between nth and 5th area decile [ns][type=nestedarray(float32, (11,));offset=3647;title=Peak widths: time between nth and 5th area decile [ns]],area_decile_from_midpoint[type=nestedarray(float32, (11,));offset=3647;title=Peak widths: time between nth and 5th area decile [ns]],Does the channel reach ADC saturation?[type=nestedarray(int8, (494,));offset=3691;title=Does the channel reach ADC saturation?],saturated_channel[type=nestedarray(int8, (494,));offset=3691;title=Does the channel reach ADC saturation?],Total number of saturated channels[type=int16;offset=4185;title=Total number of saturated channels],n_saturated_channels[type=int16;offset=4185;title=Total number of saturated channels],Channel within tight range of mean[type=int16;offset=4187;title=Channel within tight range of mean],tight_coincidence[type=int16;offset=4187;title=Channel within tight range of mean],Largest gap between hits inside peak [ns][type=int32;offset=4189;title=Largest gap between hits inside peak [ns]],max_gap[type=int32;offset=4189;title=Largest gap between hits inside peak [ns]],Maximum interior goodness of split[type=float32;offset=4193;title=Maximum interior goodness of split],max_goodness_of_split[type=float32;offset=4193;title=Maximum interior goodness of split],Largest time difference between apexes of hits inside peak [ns][type=int32;offset=4197;title=Largest time difference between apexes of hits inside peak [ns]],max_diff[type=int32;offset=4197;title=Largest time difference between apexes of hits inside peak [ns]],Smallest time difference between apexes of hits inside peak [ns][type=int32;offset=4201;title=Smallest time difference between apexes of hits inside peak [ns]],min_diff[type=int32;offset=4201;title=Smallest time difference between apexes of hits inside peak [ns]];4205;False) has been considered a subtype of Record(Start time since unix epoch [ns][type=int64;offset=0;title=Start time since unix epoch [ns]],time[type=int64;offset=0;title=Start time since unix epoch [ns]],Length of the interval in samples[type=int32;offset=8;title=Length of the interval in samples],length[type=int32;offset=8;title=Length of the interval in samples],Width of one sample [ns][type=int32;offset=12;title=Width of one sample [ns]],dt[type=int32;offset=12;title=Width of one sample [ns]],Channel/PMT number[type=int16;offset=16;title=Channel/PMT number],channel[type=int16;offset=16;title=Channel/PMT number],Classification of the peak(let)[type=int8;offset=18;title=Classification of the peak(let)],type[type=int8;offset=18;title=Classification of the peak(let)];19;False)  This is an experimental feature.
  strax.endtime(d))
/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/straxen/storage/mongo_storage.py:340: DownloadWarning: Downloading xnt_gcn_SR0_mix_2000030_2000020_20211211.tar.gz to /tmp/straxen_resource_cache/b8baee8874d4f688303d9d524a2a9746
  warn(f'Downloading {config_name} to {destination_path}',
2023-06-30 06:26:03.335769: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-30 06:26:04.249902: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-30 06:26:04.251597: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-30 06:26:08.612425: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Exception in thread build:peaks:
Traceback (most recent call last):
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/mailbox.py", line 294, in _send_from
    self.kill_from_exception(e)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/mailbox.py", line 213, in kill_from_exception
    raise e
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/mailbox.py", line 281, in _send_from
    x = next(iterable)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/plugins/plugin.py", line 452, in iter
    self._fetch_chunk(
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/plugins/plugin.py", line 373, in _fetch_chunk
    [self.input_buffer[d], next(iters[d])])
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/mailbox.py", line 438, in _read
    res = msg.result()
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/storage/files.py", line 249, in _read_and_format_chunk
    chunk = super()._read_and_format_chunk(*args, **kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/storage/common.py", line 508, in _read_and_format_chunk
    data = self._read_chunk(backend_key,
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/straxen/storage/rucio_remote.py", line 164, in _read_chunk
    downloaded = admix.download(chunk_did, rse=rse, location=self.staging_dir)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/downloader.py", line 85, in download
    did_type = get_did_type(did)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/clients.py", line 39, in wrapped
    return func(*args, **kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/rucio.py", line 99, in get_did_type
    return clients.rucio_client.get_did(scope, name)['type']
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/rucio_clients-1.23.14-py3.9.egg/rucio/client/didclient.py", line 427, in get_did
rucio.common.exception.DataIdentifierNotFound: Data identifier not found.
Details: Data identifier 'xnt_032153:merged_s2s-qkatogwl36-000005' not found
Target Mailbox (peak_positions_gcn) killed, exception <class 'strax.mailbox.MailboxKilled'>, message (<class 'rucio.common.exception.DataIdentifierNotFound'>, DataIdentifierNotFound("Data identifier 'xnt_032153:merged_s2s-qkatogwl36-000005' not found"), <traceback object at 0x1554a4cb5d80>)
Traceback (most recent call last):
  File "/srv/pegasus.UjarGpDUa/./runstrax.py", line 486, in <module>
    main()
  File "/srv/pegasus.UjarGpDUa/./runstrax.py", line 323, in main
    process(runid,
  File "/srv/pegasus.UjarGpDUa/./runstrax.py", line 139, in process
    st.make(runid_str, keystring,
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/context.py", line 1414, in make
    for _ in self.get_iter(run_ids[0], targets,
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/context.py", line 1324, in get_iter
    generator.throw(e)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/context.py", line 1296, in get_iter
    for n_chunks, result in enumerate(strax.continuity_check(generator), 1):
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/chunk.py", line 303, in continuity_check
    for chunk in chunk_iter:
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/processor.py", line 302, in iter
    raise exc.with_traceback(traceback)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/mailbox.py", line 281, in _send_from
    x = next(iterable)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/plugins/plugin.py", line 452, in iter
    self._fetch_chunk(
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/plugins/plugin.py", line 373, in _fetch_chunk
    [self.input_buffer[d], next(iters[d])])
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/mailbox.py", line 438, in _read
    res = msg.result()
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/storage/files.py", line 249, in _read_and_format_chunk
    chunk = super()._read_and_format_chunk(*args, **kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/strax/storage/common.py", line 508, in _read_and_format_chunk
    data = self._read_chunk(backend_key,
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/straxen/storage/rucio_remote.py", line 164, in _read_chunk
    downloaded = admix.download(chunk_did, rse=rse, location=self.staging_dir)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/downloader.py", line 85, in download
    did_type = get_did_type(did)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/clients.py", line 39, in wrapped
    return func(*args, **kwargs)
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/admix/rucio.py", line 99, in get_did_type
    return clients.rucio_client.get_did(scope, name)['type']
  File "/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.9/site-packages/rucio_clients-1.23.14-py3.9.egg/rucio/client/didclient.py", line 427, in get_did
rucio.common.exception.DataIdentifierNotFound: Data identifier not found.
Details: Data identifier 'xnt_032153:merged_s2s-qkatogwl36-000005' not found