Giter Club home page Giter Club logo

Comments (6)

maxfischer2781 avatar maxfischer2781 commented on September 1, 2024

This is unintended. The intention is for files to be considered regardless where they come from, unless they are outdated.
The cache metadata is stored in a git-friendly way on purpose.

I'm guessing this is caused by storing dependencies as absolute paths. Note that the cache will report why it needs regeneration if its log level is set low enough. Running excalibur.py with --log-level cache:info core:info conf:info should allow debugging this.

from excalibur.

dhaitz avatar dhaitz commented on September 1, 2024

Hi Max, thanks for the quick reply! I've found that the cache does not check the presence of the actually needed (lumi/pileup) files, but the cache (.pkl) file. Also, because of
https://github.com/artus-analysis/Excalibur/blob/master/cfg/python/configutils.py#L199 the .pkl lumi files are stored in git in data/json/, while the pileup .pkl files are in cache/. I'm not sure whether the .pkl files are supposed to be in git ... Personally I'd put them in cache/ and have the caching mechanism look for the actual pu/lumi files

from excalibur.

maxfischer2781 avatar maxfischer2781 commented on September 1, 2024

The .pkl files must be in git as well. It includes the metadata of the cache.
When invoked, the cache must check the current state of dependencies against the old state of dependencies. This old state is stored in the .pkl. Implicitly, the cache does check the JSON - both original and derived JSONs are dependencies (e.g. here and here).
If the .pkl does not exist, the cache will break at this point already, before checking the JSON.

I agree that having some things in cache/ and others in data/ is not clean. Especially figuring out what to commit and what not. I'll write up how this could be resolved soon.

from excalibur.

maxfischer2781 avatar maxfischer2781 commented on September 1, 2024

If somebody has the time (hint**hint) there's an alternative: have a global and local cache mode.
The main problem is that putting cache/generated files into git is very messy. Both git and cache do a kind of VCS, but the first is persistent while the later is volatile. What is actually needed is a way to share caches.

  • Local: all .pkl and automatically generated files go to $(pwd)/cache/<subdir>, e.g. /usr/users/mfischer/Excalibur/cache/data/lumi. This is the default, working as now. Everybody has his own cache. One can copy (or commit) it between users if needed.
  • Global: all .pkl and automatically generated files go to $(EXCALIBURCACHE)/<subdir>, e.g. /storage/a/dhaitz/excalibur_cache/data/lumi. This is activated by defining EXCALIBURCACHE. This acts as a shared cached: someone populates it (w/ CERN accounts), then everybody can use it (w/o CERN accounts). Once can copy it between hosts if needed.

The functionality for this is already there, but it needs cleanup and consolidation. Required changes:

  • cached_query expects and cache_dir as a relative path. E.g., the default path of RunJSON would be just os.path.join("data", "json").
    • It might suffice to just do cache_dir = os.path.relpath(cache_dir) at the start of cached_query. I strongly advice against it, to keep cache magic to a minimum.
  • cached_query expects dependency_files and dependency_folders as absolute paths, relative paths, or beginning with "%(cache_dir)s". The later must be used when files are generated. E.g. generated JSONs would now go to os.path.join("%(cache_dir)s", self._store_path, os.path.basename(json_name))
    • Using relative paths with a global cache or absolute paths inside cwd with a local cache may be ill-defined (aka useless). See below.
  • cached_query performs a switch between global and local cache at the beginning.
    • If global, convert relative paths to absolute. If local, convert absolute paths to relative if a subpath of os.getcwd(). Paths using cache_dir should be resolved dynamically. I guess it should look something like this:
def cached_query(*args, **kwargs):
    try:
        cache_dir = os.path.join(os.environ['EXCALIBURCACHE'], cache_dir)
    except KeyError:  # thrown by os.environ
        # local cache
        cache_dir = os.path.join(getPath(), "cache", cache_dir)
        # convert *true* relative paths
        dependency_files = [get_relsubpath(dpath) for dpath in dependency_files]
        dependency_folders = [get_relsubpath(dpath) for dpath in dependency_folders]
    else:  # this triggers if no exception is raised
        # global cache
        # purge relative paths
        dependency_files = [os.path.abspath(dpath) for dpath in dependency_files]
        dependency_folders = [os.path.abspath(dpath) for dpath in dependency_folders]
    # define this once cache_dir is set
    def stat_file(file_path):
        """Get a comparable representation of file validity"""
        try:
            file_stat = os.stat(file_path % {'cache_dir': cache_dir})  # resolve cache_dir dynamically
            return file_stat.st_size, file_stat.st_mtime
        except OSError:
            return -1, -1
    # run the rest of cached_query
    # ...

I guess that should do the trick, but there's only so much coffee in one day...

from excalibur.

dhaitz avatar dhaitz commented on September 1, 2024

For now, I stuck to a simple fix :) 6b88c754
more advanced solutions maybe after holiday

from excalibur.

maxfischer2781 avatar maxfischer2781 commented on September 1, 2024

Since both Dominik and me are no longer at EKP, please let me know if there is help required on this. I don't have a working test environment anymore.

from excalibur.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.