Giter Club home page Giter Club logo

Comments (14)

sidnarayanan avatar sidnarayanan commented on July 29, 2024

CNAF is also observing the same thing. Not yet clear what's causing it. There haven't been any changes in the router config recently as far as I know...

from phedex.

nataliaratnikova avatar nataliaratnikova commented on July 29, 2024

Looking at the routing activity table for this dataset [1], nothing is currently routed from any Tier-1.
The only destination for this DS currently is T2_FR_GRIF_LLR, and all blocks are routed from T2 sites.

Could triple-A be causing this?

[1] https://cmsweb.cern.ch/phedex/prod/Activity::Routing?tofilter=.*&fromfilter=.*&priority=any&blockfilter=%2FZeroBias%2FRun2016H-PromptReco-v2%2FAOD&.submit=Update#

from phedex.

DAMason avatar DAMason commented on July 29, 2024

from phedex.

DAMason avatar DAMason commented on July 29, 2024

from phedex.

sidnarayanan avatar sidnarayanan commented on July 29, 2024

Yeah, I'm guessing PhEDEx gave up on the ones from that dataset. If I look at what is currently routed from FNAL_Buffer, it's tons of stuff that shouldn't be coming from tape [1]. Lots of 2017 data, (MINI)AOD(SIM), etc. Picking one at random, I see 3 full disk copies, and yet it is routed from FNAL_MSS [2].

[1] https://cmsweb.cern.ch/phedex/prod/Activity::Routing?tofilter=.*&fromfilter=T1_US_FNAL_Buffer&priority=any&showinvalid=on&blockfilter=&.submit=Update#
[2] https://cmsweb.cern.ch/phedex/datasvc/json/prod/subscriptions?dataset=/JetHT/Run2016G-07Aug17-v1/MINIAOD

from phedex.

nataliaratnikova avatar nataliaratnikova commented on July 29, 2024

Sid, thanks for the example. I do not think router can decide based on the dataset name (AOD, etc). I will see if I can figure out about the links weights from the router agent log.

Dave, you should be able to see from the local stager agent logs whether and when it tried to re-stage the file. By default stager will "forget" about the staged files after 8 hours, you may adjust this using -stage-stale option:
https://github.com/dmwm/PHEDEX/blob/master/Toolkit/Transfer/FileStager#L49-L50

from phedex.

sidnarayanan avatar sidnarayanan commented on July 29, 2024

The dataset name should not have anything to do with what the router decides. I was trying to point out that these are data tiers that are already replicated on disk, and therefore should not be recalled form tape.

from phedex.

nataliaratnikova avatar nataliaratnikova commented on July 29, 2024

Okay, I got your point. As far as I can tell, Router considers all available sources, including T1_*_Buffer nodes, and chooses a link with minimal cost. It simply adds a half-an hour penalty for the files that need staging:
https://github.com/dmwm/PHEDEX/blob/master/perl_lib/PHEDEX/Infrastructure/FileRouter/Agent.pm#L1044-L1051
If you want the disk-only sources to outweigh the Buffer nodes, we could try to adjust this penalty.

from phedex.

vlimant avatar vlimant commented on July 29, 2024

+1 on making this penalty 10 thousand hours to prevent tape copies from being considered a good source

from phedex.

DAMason avatar DAMason commented on July 29, 2024

Seeing again today encp recalls at FNAL very much dominated by things that also exist on disk, even at FNAL_Disk. I would bet fixing this goes a long way to settle any tape recall problems CMS has -- should set the penalty somewhere high enough that all functional disk replicas are tried first, but not so high to prevent exclusion of a tape replica when only disk replicas at broken/very backlogged sites are available. Not knowing the distribution I won't offer a number :) I would put addressing this at high prio, possibly just behind the secret 4th queue.

from phedex.

DAMason avatar DAMason commented on July 29, 2024

Actually now I go and look at the code @nataliaratnikova referenced -- assuming half an hour for unstaged data is ridiculous! Half a day or a day is maybe as low as I would ever have thought there. Maybe the real number something like longer than 90% of the "from disk" transfer latencies? But I guess I'd need to see what that cost function looks like. Is there a data service query to see what these numbers look like?

from phedex.

nataliaratnikova avatar nataliaratnikova commented on July 29, 2024

Hi Dave,
https://cmsweb.cern.ch/phedex/datasvc/perl/prod/routerhistory
shows the last hour numbers for rate and latency used in the cost calculation per link.
See https://cmsweb.cern.ch/phedex/datasvc/doc/routerhistory for more filters.

In the last hour the latency varies from 0 to 7days.. .

I'll see how easy it would be to pass the staging penalty to the Router as an option, instead of a hard-coded value.

from phedex.

DAMason avatar DAMason commented on July 29, 2024

Just checking where we are on this?

from phedex.

nataliaratnikova avatar nataliaratnikova commented on July 29, 2024

I'm done with new priority queue. This one is next on my list. If you figured out the desired number, I can put it right away as a new default. Since this is a trivial change, we could also ask T0 PhEDEx operators to patch the FileRouter in place to put this feature in action.

from phedex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.