Giter Club home page Giter Club logo

admix's People

Contributors

darrylmasson avatar e-masson avatar ershockley avatar joranangevaare avatar lucascottolavina avatar xeboris avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ershockley

admix's Issues

Too many print statements when uploading.

  • aDMIX version: 0.2.0
  • Python version: 3.6
  • Operating System: CentOS Linux

Description

When uploading a dataset with admix, there are quite a few print statements that seem misleading. For a brand new dataset, I get the following:

An object with the same identifier already exists.
Details: Scope 'xnt' already exists!
Data Identifier Already Exists.
Details: Data Identifier already exists!
An object with the same identifier already exists.
Details: Scope 'xnt_007079' already exists!
Data Identifier Already Exists.
Details: Data Identifier already exists!
Data identifier already added to the destination content.
Details: ['(psycopg2.IntegrityError) duplicate key value violates unique constraint "CONTENTS_PK"\nDETAIL:  Key (scope, name, child_scope, child_name)=(xnt, run_007079, xnt_007079, data) already exists.\n']
An object with the same identifier already exists.
Details: Scope 'xnt_007079' already exists!
Data Identifier Already Exists.
Details: Data Identifier already exists!
Data identifier already added to the destination content.
Details: ['(psycopg2.IntegrityError) duplicate key value violates unique constraint "CONTENTS_PK"\nDETAIL:  Key (scope, name, child_scope, child_name)=(xnt_007079, data, xnt_007079, raw_records-5jnmouazik) already exists.\n']
A duplicate rule for this account, did, rse_expression, copies already exists.
Details: (psycopg2.IntegrityError) duplicate key value violates unique constraint "RULES_SC_NA_AC_RS_CO_UQ_IDX"
DETAIL:  Key (scope, name, account, rse_expression, copies)=(xnt_007079, raw_records-5jnmouazik, production, LNGS_USERDISK, 1) already exists.

As you can see it says frequently that this DID already exists, which is untrue. We need to fix this.

GetBoundary fetches the entire run DB

The GetBoundary method currently fetches the entire run DB, see here. It does use a projection which helps, but this line is not very scalable and will take a long time once we have tens of thousands of datasets (or more).

Complete command line arguments in aDMIX

Description

The aDMIX command line tools offers command line arguments which can be handed over when staring aDMIX. But these arguments are not yet implemented into the data selection process yet. Therefore these for command line arguments (--rse, --lifetime, --tag, --source) need a proper implementation in the future.

Solve:

The proper implementation goes into the data selection process when each modules pulls the data selection from the runDB. The --tag and --source arguments are supposed to be used for a more narrow selection on the data which are going to uploaded. For example, a user prefers to init only transfer rules for data which have a specific tag or source.

The implementation of --rse and --lifetime points to select the destination in Rucio and its file lifetime manually. This might be useful if no database is available.

Allowing programmatic downloads in jupyterlab

Software

straxen.print_versions('strax straxen admix utilix admix'.split())

Working on midway2-0416.rcc.local with the following versions and installation paths:
python	v3.8.8	(default, Apr 13 2021, 19:58:26) [GCC 7.3.0]
strax	v0.15.0	/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.8/site-packages/strax
straxen	v0.18.0	/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.8/site-packages/straxen
admix	v0.3.1	/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.8/site-packages/admix
utilix	v0.5.3	/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.8/site-packages/utilix
admix	v0.3.1	/opt/XENONnT/anaconda/envs/XENONnT_development/lib/python3.8/site-packages/admix

Description

It would be nice if we could admix download using jupyter bash magic such that one does not have to go do this manually on the login node. E.g. the command below. I am not sure if this is due to some setup in rucio/the container/.. or due to the fact that the batch nodes don't have internet access.

What I Did

!admix-download 012233 raw_records --threads 10 --dir /dali/lgrandi/angevaare/download_rr/.

Then, it simply hangs on this, and nothing really happens for hours. Running the exact command on the login node solves the issue but is not suitable since we don't want everyone to start doing jobs on the login node.

Downloading xnt_012233:raw_records-rfzvpzj4mf from CCIN2P3_USERDISK

Public?

Does this need to be private?

Too many open files

We keep getting this error on datamanager which is causing the clean process to break.

OSError: [Errno 24] Too many open files: '/home/datamanager/software/admix/admix/config/datamanager.config'

Can we download from non RSE sites?

  • aDMIX version: admix.version: '0.2.0'
  • Python version: Python 3.6.12 at /opt/XENONnT/anaconda/envs/XENONnT_development/bin/python
  • Operating System: dali / linux?

Description

I wanted to download some peaklets on dali using admix download. It said that the data was not in the US (but the rundoc says it is).

What I Did

(XENONnT_development) [angevaare@dali-login1 test]$ admix-download 010038 peaklets
This run is not in the US so can't be processed here. Exit 255

The reason is line:
https://github.com/XENONnT/admix/blob/master/admix/download.py#L31
Can we not just return data also if it is not in the US? Or provide the rse site where it is available?

Setup the runDB correctly

Issue:

The config file is made to reference a "collection" in the MongoDB which is used as a runDB.
Actually, the first argument is the database name and the second refers to the collection within the chosen database.

"collection": "xenon1t-runs",

Solve:

To fix that issue and avoid confusion in the future, it needs a slightly change in the config script and how the config script is read out by the "interfaces/database.py" class.

afterpulses data to dali

This is a request from @hoetzsch. Could we make sure to ship the afterpulses datatype to dali once it has been processed on the eventbuilders? This is a new datatype since recent straxen updates.

Many thanks!

Feature request, admix insights (disk usage)

Dear admix experts,
Motivation
We are finding some more difficulties on the event builders that I think can be attributed to the access of admix to the eventbuilder disks which naturally cause decreased performance of said disks. There is no way around this and completely fine, we need to adapt the processing on the evenbuilders to be less sensitive for such cases. I do have a request though to monitor the admix access to these disks to better pin point bottlenecks.

Request
Could we track the amount of data (mb/s + number of threads) that access a given eventbuilder at a given time, for example in a database? This would allow me to query what admix was doing at a given time and how high the peak load was. For the future, I would argue that we could try to display this info on the DAD-website as many people are asking about it and it is an essential link in the chain to get the data to dali.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.