Giter Club home page Giter Club logo

Comments (14)

cinquo avatar cinquo commented on July 28, 2024

mcinquil: (1) Adding an user table (wm_users) with few columns (id, dn, proxy_file) that in the future if needed can contain more info (phgroup, email, hnname, ...?). It should be populated as soon as a new workflow arrives. The CRAB-client will send the user DN in the REST request and the agent will fill a new entry in the user table if it doesn't exist yet. The wmbs_workflow.owner column will contain the user id (wmbs_workflow.owner -> wm_user.id), and probably the best idea is to have owner as a foreign key to the user table. (I do not know if it is the case to have a cleaning mechanism).
(2) Adding a bulk table (bl_bulk) with few columns (id, bulk_id, user). This implies to add a foreign key to the bl_runjob table (bl_runjob.bulk_id -> bl_bulk.id).

from crabserver.

drsm79 avatar drsm79 commented on July 28, 2024

metson: what do you intend to do with the contents of wm_users? For CRAB wmbs_workflow.owner should be a DN, you don't need the proxy file (since the API will handle that). Things like email, hnname can come from services, if needed.
I don't see a good motivation for adding the new table, and I'd rather add it when there's a clear requirement than put it in there just in case.

Can you explain what the bulk table does?

from crabserver.

cinquo avatar cinquo commented on July 28, 2024

mcinquil: You may have N times the same DN in the wmbs_workflow table. Why do not grouping that information?

At the job level in bossair you need to track which user own the jobs.
Since currently the bl_runjob table contains only a reference to wmbs_job.id then you need to query the db with somthing like this:

select wmbs_workflow.owner, wmbs_job.id from wmbs_job inner join wmbs_jobgroup on (wmbs_job.jobgroup = wmbs_jobgroup.id ) inner join wmbs_subscription on (wmbs_jobgroup.subscription = wmbs_subscription.id) inner join wmbs_workflow on (wmbs_subscription.workflow = wmbs_workflow.id) where wmbs_job.id = 1;

for every job (well, probably this query can be optimized, but you need anyway to pass between all the 4 tables).
Currently the jobs are already sharing a parameter (bulk_id). Adding another parameter to the bl_runjob table (user) would increase too much the bl_runjob table. Probably creating a new table bl_bulk makes sense, having a reference 1->N from bl_bulk to bl_runjob. So, since a bulk has the same user it can contain the bulk identificator and the user identificator.

Replicating around the user DN (in wmbs_workflow and then in runjob) doesn't make to much sense. That is also why it ca be a good idea to have a dedicated table for user.

from crabserver.

drsm79 avatar drsm79 commented on July 28, 2024

metson: For the DN the value of N is probably quite small (e.g. there'll be a workflow for signal and background samples, so it's not likely to be much over 10). If you do need to add a user table you should just hold hn username and DN. The API should provide the path to the proxy file (and you don't want to be keeping that field in sync with the filesystem).

Wouldn't it be better to make two queries:

  1. get the high level metadata of the workflow; user DN etc.
  2. get the jobs for the workflow
    This would mean you get the data you need, plus you put the minimum load on the database. Doing it all in one query is going to be inefficient; you'll pull back a lot of duplicated information that you don't actually need (you don't need to get the DN for each job, you need to get the DN for the workflow and know the association of workflow to jobs).

from crabserver.

cinquo avatar cinquo commented on July 28, 2024

mcinquil: Ok for the users, but surely I expect at least the double (Daniele just showed me some interesting plot about this that if you want I can show to you). I also expect that a user can submit more worflows to an agent and that a workflow do not get cleaned before the user requests a new one. Ok for HN-name + DN.

The current decided BAir implementation do not take into account any information about the workflow, so the workflow is not known at the BAir level. The only information related at wmbs is the wmbs_job.id.
So, each plugin always has a set of jobs with no-relation about workflow or user (a list of jobs). Moreover some BAir operation do not take any input, but it pretends to be self consistent with its own bl_runjob table. So, here the two solutions I see at plugin level are:

  1. querying wmbs tables to get needed information (this means passing by wmbs_job, wmbs_jobgroup, wmbs_subscription and wmbs_workflow tables...and I do not like this to be done for each job!);
  2. propagating user id information in the BAir tables (here the solution of adding another table).

In case you see other solutions let me know, but given (1) and (2) I will prefer the latter.

Here you can see the current mysql table structure for BAir.
https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/src/python/WMCore/BossAir/MySQL/Create.py

from crabserver.

drsm79 avatar drsm79 commented on July 28, 2024

metson: Replying to [comment:5 mcinquil]:

Ok for the users, but surely I expect at least the double (Daniele just showed me some interesting plot about this that if you want I can show to you). I also expect that a user can submit more worflows to an agent and that a workflow do not get cleaned before the user requests a new one.

Right, but even then it's going to be hundreds of records not millions.

Replying to [comment:5 mcinquil]:

Ok for HN-name + DN.

Cool.

The current decided BAir implementation do not take into account any information about the workflow, so the workflow is not known at the BAir level. The only information related at wmbs is the wmbs_job.id.
So, each plugin always has a set of jobs with no-relation about workflow or user (a list of jobs). Moreover some BAir operation do not take any input, but it pretends to be self consistent with its own bl_runjob table. So, here the two solutions I see at plugin level are:

  1. querying wmbs tables to get needed information (this means passing by wmbs_job, wmbs_jobgroup, wmbs_subscription and wmbs_workflow tables...and I do not like this to be done for each job!);
  2. propagating user id information in the BAir tables (here the solution of adding another table).

In case you see other solutions let me know, but given (1) and (2) I will prefer the latter.

I think you want to avoid pulling back the DN with jobs, that will potentially be millions of records and you don't want a long string a million times over.

How about the following (it's sort of a half way house):

  1. add a ba_users table with ID, HN username and DN
  2. wmbs_workflow.owner is ba_users.ID
  3. jobs have an owner which is ba_users.ID
  4. jobs are processed by user, so you go through the users in ba_users, get their ID and DN, and then get a list of active jobs where owner = the ID. You don't do a join in the database, you just use the user ID as a constraint in the query. You could do something funky like asking the DB for a count of jobs per user id and then processing through the users on some appropriate algorithm, but that sounds like a v2 type of thing.

TBH. I think you probably need to do some testing of the schema for some realistic (and exaggerated) numbers. E.g. how the schema behaves for a simple test isn't a good reflection on how it would behave if there were all of CMS using one server and submitting 100000 jobs in 10 workflows each.

from crabserver.

cinquo avatar cinquo commented on July 28, 2024

mcinquil: Replying to [comment:6 metson]:

I think you want to avoid pulling back the DN with jobs, that will potentially be millions of records and you don't want a long string a million times over.

Exactly.

How about the following (it's sort of a half way house):

  1. add a ba_users table with ID, HN username and DN
  2. wmbs_workflow.owner is ba_users.ID
  3. jobs have an owner which is ba_users.ID
  4. jobs are processed by user, so you go through the users in ba_users, get their ID and DN, and then get a list of active jobs where owner = the ID. You don't do a join in the database, you just use the user ID as a constraint in the query. You could do something funky like asking the DB for a count of jobs per user id and then processing through the users on some appropriate algorithm, but that sounds like a v2 type of thing.

This is what I proposed on my #comment1 of this ticket, but moving the user table at the level of BossAir. But would this solution be good enough also for other cases such the Async stage out? (other use cases may apply as well, eg: gridftp interaction)

TBH. I think you probably need to do some testing of the schema for some realistic (and exaggerated) numbers. E.g. how the schema behaves for a simple test isn't a good reflection on how it would behave if there were all of CMS using one server and submitting 100000 jobs in 10 workflows each.

Ok!

from crabserver.

drsm79 avatar drsm79 commented on July 28, 2024

metson: Replying to [comment:7 mcinquil]:

This is what I proposed on my #comment1 of this ticket, but moving the user table at the level of BossAir. But would this solution be good enough also for other cases such the Async stage out? (other use cases may apply as well, eg: gridftp interaction)

The difference (or maybe I misunderstood your first post) is that you'd get the DN in one call then get all jobs for the user in another, you wouldn't join the two databases (e.g. to avoid getting the DN for all the jobs, since it doesn't change for a user).

AsyncStageout gets the HN username from the files, so could use this table to do the look up of HN to DN.

from crabserver.

cinquo avatar cinquo commented on July 28, 2024

mcinquil: Replying to [comment:8 metson]:

The difference (or maybe I misunderstood your first post) is that you'd get the DN in one call then get all jobs for the user in another, you wouldn't join the two databases (e.g. to avoid getting the DN for all the jobs, since it doesn't change for a user).

I would do exactly what you said in the previous comment!

AsyncStageout gets the HN username from the files, so could use this table to do the look up of HN to DN.

Ok, then Async will have BossAir db as dependence. Or may be we want a dedicated package (eg: WMCore.Users).

from crabserver.

cinquo avatar cinquo commented on July 28, 2024

mcinquil: This modification include few changes also on:
wmbs_workflow.owner -> this is a VARCHAR and will be an INT (containing the ID that is in the user table...well, the best idea would be to have the foreign key to the user id)
bl_runjob.user_id -> this column needs to be added and used for various join operations (the RunJob object will then have a 'userdn' field)

This will possibily break what is currently in svn. Let me know how to proceed.

from crabserver.

sfoulkes avatar sfoulkes commented on July 28, 2024

sfoulkes: I'd rather ba_users be wmbs_users so we don't introduce a dependency in WMBS on BossAir. BossAir would still depend on WMBS and bl_runjob.user_id would link to wmbs_users, wmbs_workflow.owner would link to wmbs_users, etc.

As far as how to proceed, assuming we're doing wmbs_users, update the wmbs schema to add the table, modify wmbs_workflow.owner to link to the table. I'd either create a new class in WMBS for the users stuff or add methods to WMBS Workflow to manipulate user information. Simon, comment?

from crabserver.

cinquo avatar cinquo commented on July 28, 2024

mcinquil: Please Review

from crabserver.

sfoulkes avatar sfoulkes commented on July 28, 2024

sfoulkes: (In 11794) Integrate the ProxyAPI with the gLite BossAir plugin. Fixes #672.

From: Matt Cinquilli [email protected]
Signed-off-by: Steve Foulkes [email protected]

from crabserver.

sfoulkes avatar sfoulkes commented on July 28, 2024

sfoulkes: (In 11795) Integrate the ProxyAPI with the gLite BossAir plugin. Fixes #672.

From: Matt Cinquilli [email protected]
Signed-off-by: Steve Foulkes [email protected]

from crabserver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.