con / fscacher Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 160 KB

Caching results of operations on heavy file trees

License: MIT License

Python 100.00%

fscacher's People

Contributors

Watchers

Forkers

yarikoptic

fscacher's Issues

conda-forge feedstock

process starts with submitting a PR against https://github.com/conda-forge/staged-recipes with the recipe with LICENSE and meta.yaml.

The process is quite simple, see any of the recent PRs there or the one we had for dandi: https://github.com/conda-forge/staged-recipes/pull/11289/files

After PR gets merged a dedicated repo would get generated for the feedstock, e.g. like https://github.com/conda-forge/dandi-feedstock and bots will submit PRs (to finalize or just merge) upon new releases appearing on pypi (apparently I have missed the one for dandi 0.10.0 heh: conda-forge/dandi-feedstock#15 and it had some tests fail and now logs are gone :-/)

support generator functions

the simplest implementation would be to detect if a function to be wrapped is producing a generator function (inspect.isgeneratorfunction), and thus fscacher would cache all yielded results and wrapped function would also be a generator possibly just re-yielding them from the cache.

use case which inspired it -- caching of hash computation which would yield progress status while computing it: dandi/dandi-cli#400 . But I think, although could be used, in this case we need some different construct since we would not be interested in all the progress indicators if value is cached. So, although inspired by the use case, and could be used, it would be not a complete optimal solution for it.

support for hierarchical organization / boundaries

Well, #5 could provide a way for this to be implemented.
Use case -- git submodules/DataLad subdatasets. Caching should be done per each repository separately, even if they are in a hierarchy. So if we hit a "boundary" (e.g. submodule -- a directory with .git in it) we do not descend under assumption that that nested repository would have its own "fscached" call.

"caching" on directories

This is the most interesting/challenging part of this helper module.

Initial implementation could just follow initially suggested by @satra in #1 (comment) with slight tuneups/safe guards I have added in https://gist.github.com/yarikoptic/7284b634d8ab12277a3d316a117cc0db and timings for sample usecase in #1 (comment)
storing an entire list of stats as a "fingerprint" allows for fast decision making (if any of stats change) but might be prohibitively large for many target use cases; we need to research/think about a better approach. Possible immediately thought about alternatives is/are:
- list of digest-pairs per each directory, where for each directory we have two digests: 1 - digest of the stat for the most recently modified file/directory in that directory (to detect quickly if any immediate file/directory was modified), 2 - joint digest of all other entries in the directory. (digest should be some fastest one to estimate on the stats records of interest)
- but I bet we would find a better approach if we look into how git does it (I believe @kyleam and/or @mih looked into that at some point) for its status etc needs

thoughts which lead me to the above "list of digest-pairs"

store some digest of that list:
- digest could be updated with each record -- the fastest possible (in conjunction with time it might need to take to convert stat into bytes for digest update)
- big cons: no early decision making -- we would need to traverse entire tree to make a decision. Show stopper
I bet there could be some hierarchical structure/tree which would allow for some not that heavy fingerprint, while allowing for early termination. E.g. could be a digest per each directory: as soon as we run into a directory with a different digest -- early termination. This should work but might be still not that efficient in some cases with heavy flat directories (heuristic could be to have digests per N, e.g. 100, files in a directory, so in most of the cases it would still be a single digest per directory)
optimization:
- we could sort (well -- O(N*log(N)) operation) list of files in a directory by their mtime, thus if any file is changed, mtime changed, order changed -- early detection
- but above shows that we do not really need to know that full order -- we just need to detect change of
  - mtime of a directory (so if file is removed or copied into while preserving its time stamps,mtime of a directory would change!)
  - digest of the most recently modified file in a directory + digest of the entire directory listing. We would still need to do O(N) pass over all files in a directory to get the "latest", but we do not need to right away spend time on estimating the digest for the join of N items (we could multithread btw since os.stat probably would allow for some "digest"ing thread to coexist?) So insofar I think this is the way to go

After that initial implementation we should consider following features to be added (list is likely to grow)

add "support" for specification of paths to ignore (or consider?): #5
add "support" for hierarchical organization. #6

"background" caching

we could run decorated function in the main thread and do "fingerprinting" in a separate thread (or process?), so that if fingerprint found unchanged, we interrupt execution of the main thread and return cached value, if we found that it changed -- we just keep going and store a new fingerprint.
this should be optional since probably would benefit only in the cases where underlying function is not stat-heavy itself. E.g. I would not use this around git status, but it might still be of benefit for datalad status which does git calls, which would operate on index etc, so there would be good amount of time spent on IO/text parsing etc. if we can make both fingerprinting and status run in parallel -- could be faster than fingerprinting + status sequentially.

do not resolve path we pass into decorated function?

prompted by investigating dandi/dandisets#38 -- it seems we resolve path which we pass to underlying function:

code -- we pass os.realpath resolved path into original function

 86             @wraps(f)
 87             def fingerprinter(path, *args, **kwargs):
 88                 # we need to dereference symlinks and use that path in the function
(Pdb) 
 89                 # call signature
 90                 path_orig = path
 91                 path = op.realpath(path)
 92                 if path != path_orig:
 93                     lgr.log(5, "Dereferenced %r into %r", path_orig, path)
 94                 if op.isdir(path):
 95                     fprint = self._get_dir_fingerprint(path)
 96                 else:
 97                     fprint = self._get_file_fingerprint(path)
 98                 if fprint is None:
 99                     lgr.debug("Calling %s directly since no fingerprint for %r", f, path)
(Pdb) 
100                     # just call the function -- we have no fingerprint,
101                     # probably does not exist or permissions are wrong
102                     ret = f(path, *args, **kwargs)
103                 # We should still pass through if file was modified just now,
104                 # since that could mask out quick modifications.
105                 # Target use cases will not be like that.
106                 elif fprint.modified_in_window(self._min_dtime):
107                     lgr.debug("Calling %s directly since too short for %r", f, path)
108                     ret = f(path, *args, **kwargs)
109                 else:
110                     lgr.debug("Calling memoized version of %s for %s", f, path)
(Pdb) 
111                     # If there is a fingerprint -- inject it into the signature
112                     kwargs_ = kwargs.copy()
113                     kwargs_[fingerprint_kwarg] = fprint.to_tuple() + (
114                         tuple(self._tokens) if self._tokens else ()
115                     )
116  ->                 ret = fingerprinted(path, *args, **kwargs_)
117                 lgr.log(1, "Returning value %r", ret)
118                 return ret
119     
120             # and we memoize actually that function
121             return fingerprinter
(Pdb) p path
'/mnt/backup/dandi/dandisets/000003/.git/annex/objects/64/J4/SHA256E-s8417855886--918e541a92b934d7543acc92981ed0425eadc0d24dcabf9d8e5b44924dbb1046.nwb/SHA256E-s8417855886--918e541a92b934d7543acc92981ed0425eadc0d24dcabf9d8e5b44924dbb1046.nwb'
(Pdb) p path_orig
PosixPath('/mnt/backup/dandi/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140327_behavior+ecephys.nwb')

we should check on original motivation and either it could/should be avoided!

flaky test_memoize_path_dir ?

Fresh run https://github.com/con/fscacher/runs/4440200422?check_suite_focus=true failed on Mac

____________________________ test_memoize_path_dir _____________________________

cache = <fscacher.cache.PersistentCache object at 0x1074818e0>
tmp_path = PosixPath('/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/test_memoize_path_dir0')

    def test_memoize_path_dir(cache, tmp_path):
        calls = []
    
        @cache.memoize_path
        def memoread(path, arg, kwarg=None):
            calls.append([path, arg, kwarg])
            total_size = 0
            with os.scandir(path) as entries:
                for e in entries:
                    if e.is_file():
                        total_size += e.stat().st_size
            return total_size
    
        def check_new_memoread(arg, content, expect_new=False):
            ncalls = len(calls)
            assert memoread(path, arg) == content
            assert len(calls) == ncalls + 1
            assert memoread(path, arg) == content
            assert len(calls) == ncalls + 1 + int(expect_new)
    
        fname = "foo"
        path = tmp_path / fname
    
        with pytest.raises(IOError):
            memoread(path, 0)
        # and again
        with pytest.raises(IOError):
            memoread(path, 0)
        assert len(calls) == 2
    
        path.mkdir()
        (path / "a.txt").write_text("Alpha")
        (path / "b.txt").write_text("Beta")
    
        t0 = time.time()
        try:
            # unless this computer is too slow -- there should be less than
            # cache._min_dtime between our creating the file and testing,
            # so we would force a direct read:
>           check_new_memoread(0, 9, True)

.tox/py/lib/python3.9/site-packages/fscacher/tests/test_cache.py:204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arg = 0, content = 9, expect_new = True

    def check_new_memoread(arg, content, expect_new=False):
        ncalls = len(calls)
        assert memoread(path, arg) == content
        assert len(calls) == ncalls + 1
        assert memoread(path, arg) == content
>       assert len(calls) == ncalls + 1 + int(expect_new)
E       AssertionError: assert 3 == ((2 + 1) + 1)
E        +  where 3 = len([['/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/test_memoize_path_dir0/foo', 0, ...rivate/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/test_memoize_path_dir0/foo', 0, None]])
E        +  and   1 = int(True)

.tox/py/lib/python3.9/site-packages/fscacher/tests/test_cache.py:183: AssertionError

although was all green before AFAIK.

from joblib: UserWarning: Persisting input arguments took XXX to run.

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/

/home/jovyan/fscacher/src/fscacher/cache.py:138: UserWarning: Persisting input arguments took 0.53s to run.
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  ret = fingerprinted(*args, **kwargs_)
/home/jovyan/fscacher/src/fscacher/cache.py:138: UserWarning: Persisting input arguments took 1.30s to run.
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  ret = fingerprinted(*args, **kwargs_)
/home/jovyan/fscacher/src/fscacher/cache.py:138: UserWarning: Persisting input arguments took 0.60s to run.
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  ret = fingerprinted(*args, **kwargs_)
... still running

trying on #67 state of things

jovyan@jupyter-yarikoptic:~/fscacher/src/fscacher$ nl -ba cache.py | grep -3 138
   135                      + fprint.to_tuple()
   136                      + (tuple(self._tokens) if self._tokens else ())
   137                  )
   138                  ret = fingerprinted(*args, **kwargs_)
   139              lgr.log(1, "Returning value %r", ret)
   140              return ret
   141

jovyan@jupyter-yarikoptic:~/fscacher/src/fscacher$ git describe
0.1.6-17-g7ec27c1

jovyan@jupyter-yarikoptic:~/fscacher/src/fscacher$ git branch
* gh-66
  master

and not sure why it took long since all args here should be quite concise afaik (path). Needs to be investigated on either to be addressed/sped up

establish benchmarking setup

even if it would be quite basic, and tricky (e.g. "how to make sure the file system is cold? and run only once", IMHO it would be very valuable so we see how each PR affects it).

I guess should be using asv: https://github.com/airspeed-velocity/asv . DataLad has some collection of benchmarks and helpers to run benchmarking in a github workflow for PRs, and output comparison between target branch and a PR. We should have that here.
ideally github actions setup should run it on default as well as on some additional file systems (NFS, *fat) as was setup in matrix run for git-annex bulider of https://github.com/datalad/git-annex/ . But that would mean AFAIK having multiple asv machine definitions, and harder to get "overall view". So github action should configure all filesystems at once and pass those into asv run, to be used as parameters for "3.4 Parameterized benchmarks" (e.g. each benchmark to run on "default", "nfs", "vfat").
per file level fingerprinting would probably be not very stable , but since asv would run many runs -- should be ok
for directory level fingerprinting we should have at least two critically different types of layout: "flat" (e.g. 100 files in a single directory) and "deep" (tree up to some depth, e.g. up to 3, with e.g. about 3**3=27 directories total and some files (3) in the leave directories). Then it would be again a parameter to the parametric benchmark.
so overall we just need a benchmark for a file, and then benchmark for a tree. But excercise across file systems and different layouts in case of a tree

fscacher is broken under joblib 1.1.0

add inode into the file fingerprint

We recently had to deal with improving decision making on either to reload config in datalad where in effect we also rely on similar concept. We ended up with adding st_ino which helped greatly at least in case of git config which seems to just write a new file (produce a new inode) on modifications of .git/config, which otherwise we failed to detect: datalad/datalad@0ead554#diff-94f9b0e4cdbcfe44b244bbf4450719d1034680f27d3a1cef81393694ed7d6e08R378

I even wonder (blindly, didn't analyze) if that would somehow relate to test failures in the mothership dandi-cli like this one from https://github.com/dandi/dandi-cli/pull/330/checks?check_run_id=1723505920

dandi/support/tests/test_cache.py:121: in test_memoize_path
    check_new_memoread(0, "content", True)
dandi/support/tests/test_cache.py:101: in check_new_memoread
    assert len(calls) == ncalls + 1 + int(expect_new)
E   assert 3 == 4
E     +3
E     -4

Q: does joblib memoization works across python versions?

or we would need to make caches python version specific?

fix in joblib broke fscacher?

Using joblib with the recent fix (thanks @jwodder) http://github.com/joblib/joblib/commit/457d2c8a926105fd156b1dfccfaae3e500d22451 (1.0.1-3-g457d2c8) causes tests to error out with

E               TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'

full log

(git)lena:~/proj/fscacher[master]git
$> python -m pytest --tb=short src/fscacher/tests/test_cache.py
========================= test session starts ==========================
platform linux -- Python 3.9.1, pytest-6.2.2, py-1.9.0, pluggy-0.13.0
rootdir: /home/yoh/proj/fscacher, configfile: tox.ini
plugins: mock-3.5.1, cov-2.11.1, pyfakefs-4.1.0, forked-1.3.0, timeout-1.4.1, xdist-1.32.0, hypothesis-5.41.2, xonsh-0.9.24
collected 16 items                                                     

src/fscacher/tests/test_cache.py ..FFFF..........              [100%]

============================== FAILURES ==============================
_________________________ test_memoize_path __________________________
src/fscacher/tests/test_cache.py:131: in test_memoize_path
    check_new_memoread(1, "content")
src/fscacher/tests/test_cache.py:98: in check_new_memoread
    assert memoread(path, arg) == content
src/fscacher/cache.py:116: in fingerprinter
    ret = fingerprinted(path, *args, **kwargs_)
../../deb/gits/joblib/joblib/memory.py:591: in __call__
    return self._cached_call(args, kwargs)[0]
../../deb/gits/joblib/joblib/memory.py:483: in _cached_call
    func_id, args_id = self._get_output_identifiers(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:620: in _get_output_identifiers
    argument_hash = self._get_argument_hash(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:614: in _get_argument_hash
    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),
../../deb/gits/joblib/joblib/func_inspect.py:295: in filter_args
    raise TypeError("Ignore list for %s() contains an unexpected "
E   TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'
_______________________ test_memoize_path_dir ________________________
src/fscacher/tests/test_cache.py:212: in test_memoize_path_dir
    check_new_memoread(1, 9)
src/fscacher/tests/test_cache.py:178: in check_new_memoread
    assert memoread(path, arg) == content
src/fscacher/cache.py:116: in fingerprinter
    ret = fingerprinted(path, *args, **kwargs_)
../../deb/gits/joblib/joblib/memory.py:591: in __call__
    return self._cached_call(args, kwargs)[0]
../../deb/gits/joblib/joblib/memory.py:483: in _cached_call
    func_id, args_id = self._get_output_identifiers(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:620: in _get_output_identifiers
    argument_hash = self._get_argument_hash(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:614: in _get_argument_hash
    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),
../../deb/gits/joblib/joblib/func_inspect.py:295: in filter_args
    raise TypeError("Ignore list for %s() contains an unexpected "
E   TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'
_____________________ test_memoize_path_persist ______________________
src/fscacher/tests/test_cache.py:275: in test_memoize_path_persist
    assert outputs[0].stdout.strip().decode() == "Running script.py.DONE"
E   AssertionError: assert '' == 'Running script.py.DONE'
E     - Running script.py.DONE
------------------------ Captured stdout call ------------------------
Full outputs: [CompletedProcess(args=['/home/yoh/proj/dandi/dandi-cli-master/venvs/dev3/bin/python', '/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py'], returncode=1, stdout=b'', stderr=b'Traceback (most recent call last):\n  File "/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py", line 10, in <module>\n    print(func(r"/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py"))\n  File "/home/yoh/proj/fscacher/src/fscacher/cache.py", line 116, in fingerprinter\n    ret = fingerprinted(path, *args, **kwargs_)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 591, in __call__\n    return self._cached_call(args, kwargs)[0]\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 483, in _cached_call\n    func_id, args_id = self._get_output_identifiers(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 620, in _get_output_identifiers\n    argument_hash = self._get_argument_hash(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 614, in _get_argument_hash\n    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),\n  File "/home/yoh/deb/gits/joblib/joblib/func_inspect.py", line 295, in filter_args\n    raise TypeError("Ignore list for %s() contains an unexpected "\nTypeError: Ignore list for func() contains an unexpected keyword argument \'_cache_fingerprint\'\n'), CompletedProcess(args=['/home/yoh/proj/dandi/dandi-cli-master/venvs/dev3/bin/python', '/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py'], returncode=1, stdout=b'', stderr=b'Traceback (most recent call last):\n  File "/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py", line 10, in <module>\n    print(func(r"/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py"))\n  File "/home/yoh/proj/fscacher/src/fscacher/cache.py", line 116, in fingerprinter\n    ret = fingerprinted(path, *args, **kwargs_)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 591, in __call__\n    return self._cached_call(args, kwargs)[0]\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 483, in _cached_call\n    func_id, args_id = self._get_output_identifiers(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 620, in _get_output_identifiers\n    argument_hash = self._get_argument_hash(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 614, in _get_argument_hash\n    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),\n  File "/home/yoh/deb/gits/joblib/joblib/func_inspect.py", line 295, in filter_args\n    raise TypeError("Ignore list for %s() contains an unexpected "\nTypeError: Ignore list for func() contains an unexpected keyword argument \'_cache_fingerprint\'\n'), CompletedProcess(args=['/home/yoh/proj/dandi/dandi-cli-master/venvs/dev3/bin/python', '/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py'], returncode=1, stdout=b'', stderr=b'Traceback (most recent call last):\n  File "/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py", line 10, in <module>\n    print(func(r"/home/yoh/.tmp/pytest-of-yoh/pytest-83/test_memoize_path_persist0/script.py"))\n  File "/home/yoh/proj/fscacher/src/fscacher/cache.py", line 116, in fingerprinter\n    ret = fingerprinted(path, *args, **kwargs_)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 591, in __call__\n    return self._cached_call(args, kwargs)[0]\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 483, in _cached_call\n    func_id, args_id = self._get_output_identifiers(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 620, in _get_output_identifiers\n    argument_hash = self._get_argument_hash(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 614, in _get_argument_hash\n    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),\n  File "/home/yoh/deb/gits/joblib/joblib/func_inspect.py", line 295, in filter_args\n    raise TypeError("Ignore list for %s() contains an unexpected "\nTypeError: Ignore list for func() contains an unexpected keyword argument \'_cache_fingerprint\'\n')]
______________________ test_memoize_path_tokens ______________________
src/fscacher/tests/test_cache.py:311: in test_memoize_path_tokens
    check_new_memoread(memoread, 0, "content")
src/fscacher/tests/test_cache.py:299: in check_new_memoread
    assert call(path, arg) == content
src/fscacher/cache.py:116: in fingerprinter
    ret = fingerprinted(path, *args, **kwargs_)
../../deb/gits/joblib/joblib/memory.py:591: in __call__
    return self._cached_call(args, kwargs)[0]
../../deb/gits/joblib/joblib/memory.py:483: in _cached_call
    func_id, args_id = self._get_output_identifiers(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:620: in _get_output_identifiers
    argument_hash = self._get_argument_hash(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:614: in _get_argument_hash
    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),
../../deb/gits/joblib/joblib/func_inspect.py:295: in filter_args
    raise TypeError("Ignore list for %s() contains an unexpected "
E   TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'
====================== short test summary info =======================
FAILED src/fscacher/tests/test_cache.py::test_memoize_path - TypeEr...
FAILED src/fscacher/tests/test_cache.py::test_memoize_path_dir - Ty...
FAILED src/fscacher/tests/test_cache.py::test_memoize_path_persist
(dev3) 1 66715 ->1 [2].....................................:Thu 18 Mar 2021 06:46:12 PM EDT:.
(git)lena:~/proj/fscacher[master]git
$> 
(dev3) 1 66715 ->130 [2].....................................:Thu 18 Mar 2021 06:46:13 PM EDT:.
(git)lena:~/proj/fscacher[master]git
$> python -m pytest --tb=short src/fscacher/tests/test_cache.py
====================================================== test session starts ======================================================
platform linux -- Python 3.9.1, pytest-6.2.2, py-1.9.0, pluggy-0.13.0
rootdir: /home/yoh/proj/fscacher, configfile: tox.ini
plugins: mock-3.5.1, cov-2.11.1, pyfakefs-4.1.0, forked-1.3.0, timeout-1.4.1, xdist-1.32.0, hypothesis-5.41.2, xonsh-0.9.24
collected 16 items                                                                                                              

src/fscacher/tests/test_cache.py ..FFFF..........                                                                         [100%]

=========================================================== FAILURES ============================================================
_______________________________________________________ test_memoize_path _______________________________________________________
src/fscacher/tests/test_cache.py:131: in test_memoize_path
    check_new_memoread(1, "content")
src/fscacher/tests/test_cache.py:98: in check_new_memoread
    assert memoread(path, arg) == content
src/fscacher/cache.py:116: in fingerprinter
    ret = fingerprinted(path, *args, **kwargs_)
../../deb/gits/joblib/joblib/memory.py:591: in __call__
    return self._cached_call(args, kwargs)[0]
../../deb/gits/joblib/joblib/memory.py:483: in _cached_call
    func_id, args_id = self._get_output_identifiers(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:620: in _get_output_identifiers
    argument_hash = self._get_argument_hash(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:614: in _get_argument_hash
    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),
../../deb/gits/joblib/joblib/func_inspect.py:295: in filter_args
    raise TypeError("Ignore list for %s() contains an unexpected "
E   TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'
_____________________________________________________ test_memoize_path_dir _____________________________________________________
src/fscacher/tests/test_cache.py:212: in test_memoize_path_dir
    check_new_memoread(1, 9)
src/fscacher/tests/test_cache.py:178: in check_new_memoread
    assert memoread(path, arg) == content
src/fscacher/cache.py:116: in fingerprinter
    ret = fingerprinted(path, *args, **kwargs_)
../../deb/gits/joblib/joblib/memory.py:591: in __call__
    return self._cached_call(args, kwargs)[0]
../../deb/gits/joblib/joblib/memory.py:483: in _cached_call
    func_id, args_id = self._get_output_identifiers(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:620: in _get_output_identifiers
    argument_hash = self._get_argument_hash(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:614: in _get_argument_hash
    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),
../../deb/gits/joblib/joblib/func_inspect.py:295: in filter_args
    raise TypeError("Ignore list for %s() contains an unexpected "
E   TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'
___________________________________________________ test_memoize_path_persist ___________________________________________________
src/fscacher/tests/test_cache.py:275: in test_memoize_path_persist
    assert outputs[0].stdout.strip().decode() == "Running script.py.DONE"
E   AssertionError: assert '' == 'Running script.py.DONE'
E     - Running script.py.DONE
----------------------------------------------------- Captured stdout call ------------------------------------------------------
Full outputs: [CompletedProcess(args=['/home/yoh/proj/dandi/dandi-cli-master/venvs/dev3/bin/python', '/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py'], returncode=1, stdout=b'', stderr=b'Traceback (most recent call last):\n  File "/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py", line 10, in <module>\n    print(func(r"/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py"))\n  File "/home/yoh/proj/fscacher/src/fscacher/cache.py", line 116, in fingerprinter\n    ret = fingerprinted(path, *args, **kwargs_)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 591, in __call__\n    return self._cached_call(args, kwargs)[0]\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 483, in _cached_call\n    func_id, args_id = self._get_output_identifiers(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 620, in _get_output_identifiers\n    argument_hash = self._get_argument_hash(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 614, in _get_argument_hash\n    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),\n  File "/home/yoh/deb/gits/joblib/joblib/func_inspect.py", line 295, in filter_args\n    raise TypeError("Ignore list for %s() contains an unexpected "\nTypeError: Ignore list for func() contains an unexpected keyword argument \'_cache_fingerprint\'\n'), CompletedProcess(args=['/home/yoh/proj/dandi/dandi-cli-master/venvs/dev3/bin/python', '/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py'], returncode=1, stdout=b'', stderr=b'Traceback (most recent call last):\n  File "/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py", line 10, in <module>\n    print(func(r"/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py"))\n  File "/home/yoh/proj/fscacher/src/fscacher/cache.py", line 116, in fingerprinter\n    ret = fingerprinted(path, *args, **kwargs_)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 591, in __call__\n    return self._cached_call(args, kwargs)[0]\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 483, in _cached_call\n    func_id, args_id = self._get_output_identifiers(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 620, in _get_output_identifiers\n    argument_hash = self._get_argument_hash(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 614, in _get_argument_hash\n    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),\n  File "/home/yoh/deb/gits/joblib/joblib/func_inspect.py", line 295, in filter_args\n    raise TypeError("Ignore list for %s() contains an unexpected "\nTypeError: Ignore list for func() contains an unexpected keyword argument \'_cache_fingerprint\'\n'), CompletedProcess(args=['/home/yoh/proj/dandi/dandi-cli-master/venvs/dev3/bin/python', '/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py'], returncode=1, stdout=b'', stderr=b'Traceback (most recent call last):\n  File "/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py", line 10, in <module>\n    print(func(r"/home/yoh/.tmp/pytest-of-yoh/pytest-84/test_memoize_path_persist0/script.py"))\n  File "/home/yoh/proj/fscacher/src/fscacher/cache.py", line 116, in fingerprinter\n    ret = fingerprinted(path, *args, **kwargs_)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 591, in __call__\n    return self._cached_call(args, kwargs)[0]\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 483, in _cached_call\n    func_id, args_id = self._get_output_identifiers(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 620, in _get_output_identifiers\n    argument_hash = self._get_argument_hash(*args, **kwargs)\n  File "/home/yoh/deb/gits/joblib/joblib/memory.py", line 614, in _get_argument_hash\n    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),\n  File "/home/yoh/deb/gits/joblib/joblib/func_inspect.py", line 295, in filter_args\n    raise TypeError("Ignore list for %s() contains an unexpected "\nTypeError: Ignore list for func() contains an unexpected keyword argument \'_cache_fingerprint\'\n')]
___________________________________________________ test_memoize_path_tokens ____________________________________________________
src/fscacher/tests/test_cache.py:311: in test_memoize_path_tokens
    check_new_memoread(memoread, 0, "content")
src/fscacher/tests/test_cache.py:299: in check_new_memoread
    assert call(path, arg) == content
src/fscacher/cache.py:116: in fingerprinter
    ret = fingerprinted(path, *args, **kwargs_)
../../deb/gits/joblib/joblib/memory.py:591: in __call__
    return self._cached_call(args, kwargs)[0]
../../deb/gits/joblib/joblib/memory.py:483: in _cached_call
    func_id, args_id = self._get_output_identifiers(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:620: in _get_output_identifiers
    argument_hash = self._get_argument_hash(*args, **kwargs)
../../deb/gits/joblib/joblib/memory.py:614: in _get_argument_hash
    return hashing.hash(filter_args(self.func, self.ignore, args, kwargs),
../../deb/gits/joblib/joblib/func_inspect.py:295: in filter_args
    raise TypeError("Ignore list for %s() contains an unexpected "
E   TypeError: Ignore list for memoread() contains an unexpected keyword argument '_cache_fingerprint'
==================================================== short test summary info ====================================================
FAILED src/fscacher/tests/test_cache.py::test_memoize_path - TypeError: Ignore list for memoread() contains an unexpected keyw...
FAILED src/fscacher/tests/test_cache.py::test_memoize_path_dir - TypeError: Ignore list for memoread() contains an unexpected ...
FAILED src/fscacher/tests/test_cache.py::test_memoize_path_persist - AssertionError: assert '' == 'Running script.py.DONE'
FAILED src/fscacher/tests/test_cache.py::test_memoize_path_tokens - TypeError: Ignore list for memoread() contains an unexpect...
================================================= 4 failed, 12 passed in 1.69s ==================================================

RF: to allow for more flexible use

For now just "thinking out loud", I guess more specifics would come as I try...

ATM the main use pattern is to decorate a function using @cache.memoize_path under assumption that path is the first positional argument. But it might not be the case in already existing code base, thus would require either
- creating use-case specific shims which would reorder/extract path from the corresponding function signature (could be in kwargs, or another positional arg), or even a value itself (e.g. in datalad it would be a dataset instance passed through from which we would get .path)
- providing some convenience, e.g. path_getter(*args, **kwargs) which if not defined, would be just current default args[0]
if a path_getter above, and whenever we make it parametrized (e.g. for #5 #6 #10 ) -- should those be defined at the class level or at memoize_path invocation level? ATM we seems (actually I am a bit confused how we account for different functions...) allow for the same cache to be used across multiple functions. So if we are to allow that, it smells that for path_getter should be at invocation level since the same cache could be used across multiple invocations (functions)
probably there is no non-decorator use pattern we need to provide, but we need to make sure that it works if we decorate a function instance right in the code, e.g.

def func(path, ...):
     ...

def some(...):
         out = cache.memoize_path(func)(path)

or especially either it would somehow work on locally defined functions/closures like

def some(...):
    def func(path, ...):
         ...
    out = cache.memoize_path(func)(path)

and make sure that multiple invocations of some do take advantage of the cache from prior invocations

if we do not foresee a use for __call__, should we just rename memoize_path -> __call__?

FOI: joblib can announce if it takes it too long

add a test and verify that fscacher works ok with git-annex'ed files in direct mode

i.e. whenever a file is a symlink pointing to some .git/annex/objects, and we should be using cached value even if original file changes its name and/or location but still points to the same file under .git/annex/objects.
I guess a test could just be done by manipulating symlinks explicitly, but I think it would be best to ensure that there is no some other side-effect from git-annex doing that. I guess no need to drag datalad for this minor test, and we could just use git annex and git commands directly to

create a file with content, git annex add and git commit it
cache some function result
git mv the file to a new filename and git commit, redo the function call and ensure that we do not recompute
git mv the file to another directory and git commit, redo the function call and ensure that we do not recompute

related to #44 where we actually don't want to pass resolved path into underlying function. But we still want to use resolved path for getting the fingerprint. So getting this test done would ensure that we do not break it whenever we finally address #44

ENH(TST): add a test to verify correct operation on closures/local functions

as given in example of #17

if that could not work (e.g. we then need to add closure's space into fingerprinted signature, and somehow cannot/should not do that) -- we should raise some exception (might be NotImplementedError("caching on functions with closures is not supported"))

the same probably would go for functions accessing global variables space, but I guess it is nothing we could deduce/control for, right?

multithread traversal of directories

for caching on directories, the snippet @satra has shared could be of great speed up : https://gist.github.com/satra/0e02cd7554672120cf8073df3986f302 and may be would "fix" #19 in that sorting would not be avoidable? ;)

Automagic GC

I feel that we need to think about garbage collection now. /home on drogon ran out of space, and fscacher has contributed -- e.g. in my case it was 4GB for ~/.cache/fscacher.

Ideally it should some kind of a LRU mechanism.
Unfortunately I think we cannot rely on atime for deciding which cached files to be removed since filesystem might be mounted with relatime or noatime, so we might need another way to decide on what is recent and what is not.
another idea could be just to help user to decide by creating lead directories for those extra tags (versions) we add into function signature (but they could become too long too fast :-/). But we could then may be make decision based on that level -- the oldest created/modified ones to go. If it was modified recently - means that the particular version is still in use

Any ideas @jwodder ?

Statement of the problem and review of possible existing solutions

High level overview

An increasing number of projects need to operate on growing large collections of files. Having (hundreds of ) thousands of files within a file tree is no longer atypical e.g. in neuroimaging. Many supplementary tools/functions need to "walk" the file tree and return result of some function. Such result should not change between invocations if nothing in the file tree changes (no file added/removed/changed) and theoretically output of the function should remain the same.

Situation scales down all the way to individual files: function could operate on a specific file path and return result which should not change if file does not change.

"Fulfilled" use-case

git already does some level of caching to speed up git status (see comment in datalad below), so I think it should be possible to provide some generic helper which would cache the result of operation on some path argument (file or a tree).
dandi-cli: loading metadata from .nwb file which takes a while. For that purpose, based on joblib.Memory I came up with a PersistentCache class to provide a general persistent (across processes) cache, and a .memoize_path method which is a decorator to decorate any function which takes path to a file as the first argument.

Implementation

Dunno yet what to do given a file tree, and either there is filesystem specifics involved to be (ab)used. Probably should not rely on active monitoring (e.g. inotify) to not depend on some daemon to be running.
That PersistentCache mentioned above also can account (see example) for versions of the relevant python modules placing them into signature, so if there is an update, cached result will not be used.
probably should accept a list of paths changes to which could be safely ignored
consider support to account for hierarchical organization of files/directories -- so it would be the caching helper which would cause re-invocation only on relevant sb-trees while reloading cached for the others which had not changed

Target use-cases

DataLad: datalad status and possibly some (e.g. diff) other commands operation. Whenever git status already uses some smart caching of the results, so subsequent invocation takes advantage of it, there is nothing like that within datalad yet.
PyBIDS: construction of the BIDSLayout could take awhile. If instance (or at least a list of walked paths) could be cached -- would speed up subsequent invocations. (ref: bids-standard/pybids#609 (comment))
dandi-cli: ATM we cache per file, but as larger in number of files datasets appear, we might like to cache results on full file trees

If anyone (attn @kyleam @mih @vsoch @con @tyarkoni @effigies @satra) knows an already existing solution -- would be great!

support for ignoring/considering some paths

I am not certain yet it should be just a list of some regexes or relative paths. Most likely it would need to be both. E.g. considering git repositories (or DataLad datasets): we can immediately ignore .git/objects and probably .git/logs and some others if we are only interested in status. But also we might like to ignore anything git ignores (which could come from .gitignore or some other additional ways to specify those ignores)! So we might want to consult git on that somehow to ideally discover directories to skip etc.

post-call fingerprinting mode

if a function can potentially effect the tree (especially if there were changes to it detected in pre-call fingerprinting), e.g. I am thinking about some datalad operation which would trigger git annex init call... then fingerprint which we estimated (largely to decide to either even proceed with calling the function) would no longer be valid. Sure thing we would discover that whenever we call that function again and just redo the function call, but I wonder if there would be some optimization if we redo fingerprinting (possibly in background ,thus relates to #7) after the function call. I guess some timing to be made and target clear use cases which would benefit to be identified first.

memoized_path_copy helper to complement @memoize_path

In the light of dandi/dandi-cli#848 discussion to allow for more efficient caching of digests, I wondered if it would be feasible to provide something like memoized_path_copy which would copy all (?) memoized invocations for a specific decorated function as they were invoked for another "new" path.

ATM, looking at the code, and since we rely on joblib memoization and otherwise do not track what specific parametrizations of the function were used, I really do not see how we could even do that. But may be you @jwodder see some way to provide such functionality?

Support decorating instance methods of classes with path attributes

Prompted by dandi/dandi-cli#852 (comment)

directories: avoid full `sort`

as initially briefly discussed in #11 (comment)

worth looking into how git does it as well

my guess (didn't try) that until we add support for #5 etc, it might just work without sort?

What should replace the "DANDI_CACHE" environment variable?

When a new PersistentCache is constructed, the code currently checks the value of the DANDI_CACHE environment variable. If it is "clear", the cache is cleared; if it is "ignore", memoization is a no-op. Should the fscacher library continue to perform these actions based on the value of an environment variable, and, if so, what should the variable be renamed to?

One possible way to handle this:

Remove the DANDI_CACHE and _ignore_cache code from fscacher
Add to fscacher a NullCache class that is API-compatible with PersistentCache but does nothing
In dandi, replace all constructions of a PersistentCache instance with a call to a get_persistent_cache() function which checks DANDI_CACHE and, based on its value, returns either a PersistentCache instance, a NullCache instance, or a PersistentCache instance on which clear() has been called.

run benchmarking only for PRs

I guess I have rushed merging #13 a bit: running benchmarks on CI makes sense only when they are ran on the same box in both versions (target to be merged into branch e.g. master and corresponding PR (merged)), since CI could schedule runs on different boxes with different speeds etc. So we can disable that workflow completely for push. I will push a commit for that (hopefully I am not missing smth ;-))

Can get stuck on windows

Origin:

dandi/dandi-cli#1321

which I believe leads us here to somehow causing the stuck behavior when uploading in parallel threads (but not multiple chunks per file). Not yet sure how to troubleshoot/reproduce besides may be torturing some Windows CI with some upload jobs to staging or primary DANDI bucket. @jwodder any ideas?

Next design, directories oriented?/localized, is needed

This issue could be considered a duplicate of earlier dandi/dandi-cli#848 (comment) but as that one just started with suggestion of an alternative implementation within dandi-cli, I decided to file a separate one within fscacher which would provide more coverage over the situation and since I still hope that fscacher could be a reusable package to provide needed solution.

Outstanding issues we have which are all bottlenecking IMHO in our initial simplistic fscacher implementation, which

I1: uses the same cache for a function (or collection of functions) regardless of their parametrization etc
I2: places all caches into centralized user-wide cache directory (e.g., ~/.cache/fscacher/{cachename}): caches related to a dataset which might get removed,
I3: uses joblib's memoize so that each invocation gets its own directory + 2 files on disk (thus 3 inodes):

(dandi-devel) jovyan@jupyter-yarikoptic:~/.cache/fscacher/dandi-checksums/joblib/dandi/support/digests$
$ ls get_zarr_checksum/b74a957ae8f7e2e4fc2b07d1aaa73775/
metadata.json  output.pkl

  - in effect allows to avoid need for any locking since fingerprint defines the directory name to use

Such simplistic initial design showed its limitations by

making difficult to prune old cache selectively (helping interfacing/functionality is proposed in joblib/joblib#1200 but was never finalized/merged)
hard or impossible to "carry over" the cache while moving files around, e.g. by dandi mv command: dandi/dandi-cli#848 (comment) and some analysis to make it blocked by joblib backend: #57
can add lots (over 1000%?) of overhead in some cases (e.g., see #914) -- possibly due to needing to create directory with 2 files even for a quick invocation
wasting filesystem inodes which could be of precious resource on HPC/some filesystems

The question is how we could redesign (or just expand since current implementation is good in its simplicity) fscacher to possibly

RF I3: provide an alternative (to directory with 2 files)
- more efficient in case of working with lots of small files
- less reliant on filesystem and thus more gentle to inodes storage backend
- might need to have locking mechanisms added to guarantee
RF I2 and/or may be I1 provide some "locality":
- "dataset level": so we could keep cache within any given dandiset, so if that dandiset is removed, cache goes along with it
- "directory/path" (in some cases only): keep all fingerprints for files within .zarr/ folder within a singular cache "file" -- would save inodes, for mv etc operations would be a matter of copying/changing one such file.
- in principle, both localities could already be achieved with current fscacher to some degree by establishing separate cache per specific dataset/zarr file. We just would need to get away from using a simple generic @decorator form and instantiate cache per each dandiset/zarr file in a given location. For 'zarr' support though some alternative storage backend would be needed
gain more versatile cleaning
- may be tie a little with the code, i.e. removal of file by the code could trigger cleaning the cache for that path. although not sure if worth adding such explicit ties really.
support "early decision" for directories (say that changed - run the target compute while in parallel finishing getting the full directory fingerprint; might need storing "tree" fingerprint at least to some level depth)

edit: adding an explicit section

Features

be able to query the status/if changed on a path (directory). Will be of use for DataLad as well. So we might need to abstract those interfaces

WDYT @jwodder -- have any ideas/vision on what next design in fscacher could be to make our life in dandi-cli better?

Inconsistent behavior on Windows from deprecated `st_ctime_ns` stat

didn't know this was yours as well!

Noticed that only the windows tests were failing here for a circumstance where a cache should have been invalidated because the creation time, but not the mtime was changed: dandi/dandi-cli#1364

i think this comes from here: https://github.com/con/fscacher/blob/e3a6dee365f89a3b0212d19976e61a8165bf0488/src/fscacher/cache.py#L209C10-L209C10

where a fingerprint is generated for a file using s.st_mtime_ns, s.st_ctime_ns, s.st_size, s.st_ino

st_ctime_ns is deprecated on windows:

In 3.12:

Changed in version 3.12: st_ctime_ns is deprecated on Windows. Use st_birthtime_ns for the file creation time. In the future, st_ctime will contain the time of the most recent metadata change, as for other platforms.

In 3.11:

Platform dependent:
the time of most recent metadata change on Unix,
the time of creation on Windows, expressed in nanoseconds as an integer.

It's not clear to me how one would reconcile them since the 'get last metadata change' behavior isn't available in windows, but the docs say it should be 'in the future.'

In the meantime it might be good to add a small ~1kb hash from the front and back of a file or something in the fingerprint.

mostly raising the issue to bring it to attention, doesn't seem like an urgent problem, but seems good to note when OS inconsistencies arise.

0.1 alpha release?

I think it is time -- we should start trying to use it. API etc is sure thing a subject to change so should not be used "too widely"

how do we cut the first release (there is no prior pypi upload etc)

account for symlinks

another "complication" -- for datalad we would need to also account for either symlink (if symlink) is broken or not to provide relevant status. For git it doesn't matter. So I guess should be an option to specific instance, and if symlinks='add-followed' (or alike), then their record/fingerprint should also contain also stat for a de-referenced (all the way?) file